is possible use opencl data parallel kernel sum vector of size n, without doing partial sum trick?
say if have access 16 work items , vector of size 16. wouldn't not possible have kernel doing following
__kernel void summation(__global float* input, __global float* sum) { int idx = get_global_id(0); sum[0] += input[idx]; }
when i've tried this, sum variable doesn't updated, overwritten. i've read using barriers, , tried inserting barrier before summation above, update variable somehow, doesn't reproduce correct sum.
let me try explain why sum[0]
overwritten rather updated.
in case of 16 work items, there 16 threads running simultaneously. sum[0]
single memory location shared of threads, , line sum[0] += input[idx]
run each of 16 threads, simultaneously.
now instruction sum[0] += input[idx]
(i think) expands performs read of sum[0]
, adds input[idx]
before writing result sum[0]
.
there will data race multiple threads reading , writing same shared memory location. might happen is:
- all threads may read value of
sum[0]
before other thread writes updated resultsum[0]
, in case final result ofsum[0]
value ofinput[idx]
of thread executed slowest. since different each time, if run example multiple times should see different results. - or, 1 thread may execute more slowly, in case thread may have written updated result
sum[0]
before slow thread readssum[0]
, in case there addition using values of more 1 thread, not threads.
so how can avoid this?
option 1 - atomics (worse option):
you can use atomics force threads block if thread performing operation on shared memory location, results in loss of performance since making parallel process serial (and incurring costs of parallelisation -- such moving memory between host , device , creating threads).
option 2 - reduction (better option):
the best solution reduce array, since can use parallelism effectively, , can give o(log(n)) performance. here overview of reduction using opencl : reduction example.
Comments
Post a Comment