Matrix filling for the Smith-Waterman algorithm in CUDA -
i need advice on how optimizing implementation of smith-waterman algorithm in cuda.
the part want optimize filling matrix. due data dependence between matrix elements (each next element depends on other ones - left it, it, , left-up it), i'm filling anti-diagonal matrix elements in parallel, illustrated in picture below:
my program operates in loop as
int diag = 1; for(int x = 0; x < size_b; x++) { block_size = 1024; if(block_size > diag) { block_size = diag; } safe_kernel_call((dev_init_diag<<<(diag - 1)/block_size + 1, block_size>>>(h, size_a, size_b, x, sequence_a, sequence_b, false, x_offset, y_offset, null_ind))); diag++; }
as can see, there 1 kernel call each diagonal.
since have quite large matrices (with 21000
elements on side), there lot of kernel calls. result, have large overhead cuda kernel calls, wasting half of processing time, can seen screenshot of visual profiler (look @ profiler overhead string):
so, question how rid of multiple kernel calls , eliminate overheads.
there 1 important thing notice: reason why call new kernel each diagonal need synchronize threads , blocks before next call and, understand, there way syncronize cuda blocks - finish kernel , start again. nevertheless, algorithm there might better solution.
thank reading this!
//////////////////////////////////////////////////////////////
ok, thank response! 1 more question, more cuda: so, have implement new kernel, this:
__global__ void kernel(...) { for(int diag_num = 0; diag_num < size; diag_num++) { init_one_diag(...); syncronize_threads(); } }
but means have launch kernel on 1 cuda-block?(because know there no syncronization between different blocks)
before have launched kernels way:
dev_init_diag<<<(diag - 1)/block_size + 1, block_size>>>(...)
will new approach efficient?
i recommend going through available literature implement efficient approach matrix filling problem smith-waterman algorithm.
from description of code, choosing parallel filling anti-diagonals , launching 1 kernel each anti-diagonal. mentioned, quite ineffective due multiple kernel launches.
a simple alternative constructing single kernel function in charge of calculating anti-diagonals. kernel should launched number of threads @ least equal longest anti-diagonal. kernel performs number of iterations equal number of anti-diagonals calculated. anti-diagonals shorter longest one, subset of threads remains active. approach described in
parallelizing smith-waterman local alignment algorithm using cuda
but ineffective 2 reasons:
- most of threads remain unactive significant number of computations (anti-diagonals);
- the memory accesses highly uncoalesced.
an alternative anti-diagonal matrix filling provided approach in
acceleration of smith–waterman algorithm using single , multiple graphics processors
there shown how how smith–waterman (anti-diagonal) matrix filling algorithm can reformulated calculations can performed in parallel 1 row (or column) @ time. underlined how row (or column) calculations allow gpu memory accesses consecutive (coalesced) , therefore fast. although not explicitly mentioned, believe approach mitigates (or totally removes) above mentioned issue of unactive threads.
edit
the gpu computing gems book, emerald edition, dedicates 2 chapters smith-waterman algorithm, namely,
chapter 11, accurate scanning of sequence databases smith-waterman algorithm
and
chapter 13, gpu-supercomputer acceleration of pattern matching
the latter chapter same authors of second mentioned approach. former, contains step-by-step derivation of optimized cuda code, may result useful future users.
Comments
Post a Comment