is possible use opencl data parallel kernel sum vector of size n, without doing partial sum trick?
say if have access 16 work items , vector of size 16. wouldn't not possible have kernel doing following
__kernel void summation(__global float* input, __global float* sum) { int idx = get_global_id(0); sum[0] += input[idx]; } when i've tried this, sum variable doesn't updated, overwritten. i've read using barriers, , tried inserting barrier before summation above, update variable somehow, doesn't reproduce correct sum.
let me try explain why sum[0] overwritten rather updated.
in case of 16 work items, there 16 threads running simultaneously. sum[0] single memory location shared of threads, , line sum[0] += input[idx] run each of 16 threads, simultaneously.
now instruction sum[0] += input[idx] (i think) expands performs read of sum[0], adds input[idx] before writing result sum[0].
there will data race multiple threads reading , writing same shared memory location. might happen is:
- all threads may read value of
sum[0]before other thread writes updated resultsum[0], in case final result ofsum[0]value ofinput[idx]of thread executed slowest. since different each time, if run example multiple times should see different results. - or, 1 thread may execute more slowly, in case thread may have written updated result
sum[0]before slow thread readssum[0], in case there addition using values of more 1 thread, not threads.
so how can avoid this?
option 1 - atomics (worse option):
you can use atomics force threads block if thread performing operation on shared memory location, results in loss of performance since making parallel process serial (and incurring costs of parallelisation -- such moving memory between host , device , creating threads).
option 2 - reduction (better option):
the best solution reduce array, since can use parallelism effectively, , can give o(log(n)) performance. here overview of reduction using opencl : reduction example.
Comments
Post a Comment