sorting - CUDA - How to make thread in kernel wait for it's children -
    i'm trying implement simple merge sort using cuda recursive (for cm > 35) technology, can not find way tell parent thread launch it's children concurrently , wait it's children computation, since cudaeventsynchronize() , cudastreamsynchronize() host only. __syncthread() not archive desired effect, since parent's next line should executed after it's children has completed computation.   __global__ void simple_mergesort(int* data,int *dataaux,int begin,int end, int depth){      int middle = (end+begin)/2;      int i0 = begin;      int i1 = middle;      int index;      int n = end-begin;       cudastream_t s,s1;       //if we're deep or there few elements left, use insertion sort...      if( depth >= max_depth || end-begin <= insertion_sort ){          selection_sort( data, begin, end );          return;      }       if(n < 2){          return;      }      // launches new block sort left part.     cudastreamcreatewithflags(&s,cudadevicescheduleb...