sorting - CUDA - How to make thread in kernel wait for it's children -


i'm trying implement simple merge sort using cuda recursive (for cm > 35) technology, can not find way tell parent thread launch it's children concurrently , wait it's children computation, since cudaeventsynchronize() , cudastreamsynchronize() host only. __syncthread() not archive desired effect, since parent's next line should executed after it's children has completed computation.

__global__ void simple_mergesort(int* data,int *dataaux,int begin,int end, int depth){      int middle = (end+begin)/2;      int i0 = begin;      int i1 = middle;      int index;      int n = end-begin;       cudastream_t s,s1;       //if we're deep or there few elements left, use insertion sort...      if( depth >= max_depth || end-begin <= insertion_sort ){          selection_sort( data, begin, end );          return;      }       if(n < 2){          return;      }      // launches new block sort left part.     cudastreamcreatewithflags(&s,cudadevicescheduleblockingsync);     simple_mergesort<<< 1, 1, 0, s >>>(data,dataaux, begin, middle, depth+1);     cudastreamdestroy(s);      // launches new block sort right part.     cudastreamcreatewithflags(&s1,cudadevicescheduleblockingsync);     simple_mergesort<<< 1, 1, 0, s1 >>>(data,dataaux, middle, end, depth+1);     cudastreamdestroy(s1);      // waits until children have returned, not compile.     cudastreamsynchronize(s);     cudastreamsynchronize(s1);       (index = begin; index < end; index++) {         if (i0 < middle && (i1 >= end || data[i0] <= data[i1])){             dataaux[index] = data[i0];             i0++;         }else{             dataaux[index] = data[i1];             i1++;         }     }      for(index = begin; index < end; index ++){         data[index] = dataaux[index];     } } 

which adaptation should make code can achieve desired effect?

thanks reading.

the typical barrier used force kernels complete cudadevicesynchronize() , works in parent kernels well, forcing child kernels complete.

as indicated in the documentation:

as cudastreamsynchronize() , cudastreamquery() unsupported device runtime, cudadevicesynchronize() should used instead when application needs know stream-launched child kernels have completed.


Comments

Popular posts from this blog

c++ - OpenMP unpredictable overhead -

ruby on rails - RuntimeError: Circular dependency detected while autoloading constant - ActiveAdmin.register Role -

javascript - Wordpress slider, not displayed 100% width -