sorting - CUDA - How to make thread in kernel wait for it's children -
i'm trying implement simple merge sort using cuda recursive (for cm > 35) technology, can not find way tell parent thread launch it's children concurrently , wait it's children computation, since cudaeventsynchronize() , cudastreamsynchronize() host only. __syncthread() not archive desired effect, since parent's next line should executed after it's children has completed computation. __global__ void simple_mergesort(int* data,int *dataaux,int begin,int end, int depth){ int middle = (end+begin)/2; int i0 = begin; int i1 = middle; int index; int n = end-begin; cudastream_t s,s1; //if we're deep or there few elements left, use insertion sort... if( depth >= max_depth || end-begin <= insertion_sort ){ selection_sort( data, begin, end ); return; } if(n < 2){ return; } // launches new block sort left part. cudastreamcreatewithflags(&s,cudadevicescheduleb...