memory - How to set the OpenCL's local work space size? -
i'm doing image processing using opencl.
for example, used 100*200 size image. in .cl code, half image pixel value by:
{ int width=get_group_id(0); int height=get_group_id(1); // col(width) int x= get_global_id(0); // row(height) int y= get_global_id(1); (unsigned char) data_output[x*width+y]= (unsigned char)data_input[x*width+y]/2; }
after kernel's parameter setting run kernel by:
clenqueuendrangekernel( queue,kernel_dip,2,null,global_work_size,local_work_size, 0,null,null);
the global_work_size used image size:
size_t global_work_size[2] = {100,200};
i found .cl code doesn't include code "get_local_id(0);"
the local_work_size did have lots influence on performance.
both "size_t local_work_size[2]= {1,1};"(small local work size) , "size_t local_work_size[2]= {50,50};" (big work size) slow.
some suitable size below faster:
size_t local_work_size[2]= {10,10};
so here question:
why code without get_local_id() influenced local memory?
how can set best local size make run in highest speed?
i tested running speed on other platforms such freescale's imx.6, seems changed-size local work-size doesn't work there @ all! why?
if know answer, plz help. thank much!
darkzeros mentioned can set local work size null
let opencl choose size considers "appropriate", given global work size , device executed on.
however, global work sizes, opencl may not able choose "suitable" local work size. particularly when global work size prime number larger maximum local work size. might forced use local work size of 1. may consider padding input data may distributed nicely among several workgroups. (i wrote few words in https://stackoverflow.com/a/22969485 )
for complex kernels, may consider querying cl_kernel_preferred_work_group_size_multiple
, base computation on that, simple kernel, should not necessary.
additionally, might want have @ "amd app kernelanalyzer" or "nvidia occupancy calculator" - these tools may give hints appropriate configuration target platforms (although, preferably, code should written generic possible, long not have sever performance impact)
Comments
Post a Comment