multithreading - OpenCL - how to effectively distribute work items to different devices -

- August 15, 2015

i'm writing opencl application have n work items want distribute d devices n > d , in turn each device can process elements of own work item in parallel , achieve sort of "double" parallelism.

here code have written try , achieve this.

first create event each of devices , set them complete:

cl_int err; cl_event *events = new cl_event[devicecount]; for(int = 0; < devicecount; i++) {     events[i] = clcreateuserevent(context, &err);     events[i] = clsetusereventstatus(events[i], cl_complete);  }

each device has own command queue , own "instance" of kernel.

then enter "main loop" distributing work items. code finds first available device , enqueues work item.

/*---loop on available jobs---*/ for(int = 0; < numworkitems; i++) {        workitem item = workitems[i];      bool found = false; //check device availability     int index = -1;     //index of found device     while(!found)       //continuously loop until free device found.     {         for(int j = 0; j < devicecount; j++) //total number of cpus + gpus         {             cl_int status;             err = clgeteventinfo(events[j], cl_event_command_execution_status, sizeof(cl_int), &status, null);             if(status == cl_complete) /*current device has completed of tasks*/             {                 found = true; //exit infinite loop                 index = j;    //choose current device                 break;        //break out of inner loop             }         }     }      //enqueue kernel     clsetkernelarg(kernels[index], 0, sizeof(cl_mem), &item);     clenqueuendrangekernel(queues[index], kernels[index], 1, null, &glob, &loc, 0, null, &events[index]);      clflush(commandqueues[index]); }

and wrap calling clfinish on devices:

/*---wait completion---*/ for(int = 0; < devicecount; i++) {     clfinish(queues[i]); }

this approach has few problems however:

1) doesn't distribute work devices. on current computer have 3 devices. algorithm above distributes work devices 1 , 2. device 3 gets left out because devices 1 , 2 finish can snatch more work items before 3 gets chance.

2) devices 1 , 2 running together, see very, mild speed increase. instance if assign work items device 1 might take 10 seconds complete, , if assign work items device 2 might take 11 seconds complete, if try split work between them, combined might take 8-9 seconds when hope might between 4-5 seconds. feeling might not running in parallel each other way want.

how fix these issues?

you have careful sizes , memory location. typically these factors not considered when dealing gpu devices. ask you:

what kernel sizes?
how fast finish?
- if kernel size small , finish quite quickly. overhead of launching them high. finer granularity of distributing them across many devices not overcome overhead. in case better directly increase work size , use 1 device only.
are kernels independent? use different buffers?
- another important thing have different memory each device, otherwise memory trashing between devices delay kernel launches, , in case 1 single device (holding memory buffers locally) perform better.
- opencl copy device buffers kernel uses, , "block" kernels (even in other devices) use buffers kernel writing to; wait finish , copy buffer other device.
is host bottleneck?
- the host not fast may think, , kernels run fast host big bottleneck scheduling jobs them.
- if use cpu cl device, cannot both tasks (act host , run kernels). you should prefer gpu devices rather cpu devices when scheduling kernels.
never let device empty
- waiting till device has finish execution, before queuing more work typically bad idea. should queue preemptively kernels in advance (1 or 2) before current kernel has finished. otherwise, device utilization not reach not 80%. since there big amount of time since kernel finishes till hosts realizes of it, , bigger amount of time until host queues more data kernel (typically >2ms, 10ms kernel, thats 33% wasted).

i do:

change line submitted jobs: if(status >= cl_submitted)
ensure devices ordered gpu -> cpu. so, gpus device 0,1 , cpu device 2.
try removing cpu device (only using gpus). maybe speed better.

Search This Blog

WINAPI

multithreading - OpenCL - how to effectively distribute work items to different devices -

Comments

Post a Comment

Popular posts from this blog

ruby on rails - RuntimeError: Circular dependency detected while autoloading constant - ActiveAdmin.register Role -

c++ - OpenMP unpredictable overhead -

javascript - Wordpress slider, not displayed 100% width -