Matrix filling for the Smith-Waterman algorithm in CUDA -


i need advice on how optimizing implementation of smith-waterman algorithm in cuda.

the part want optimize filling matrix. due data dependence between matrix elements (each next element depends on other ones - left it, it, , left-up it), i'm filling anti-diagonal matrix elements in parallel, illustrated in picture below:

enter image description here

my program operates in loop as

int diag = 1; for(int x = 0; x < size_b; x++) {     block_size = 1024;     if(block_size > diag)     {         block_size = diag;     }     safe_kernel_call((dev_init_diag<<<(diag - 1)/block_size + 1, block_size>>>(h, size_a, size_b, x,                       sequence_a, sequence_b, false, x_offset, y_offset, null_ind)));     diag++; } 

as can see, there 1 kernel call each diagonal.

since have quite large matrices (with 21000 elements on side), there lot of kernel calls. result, have large overhead cuda kernel calls, wasting half of processing time, can seen screenshot of visual profiler (look @ profiler overhead string):

enter image description here

so, question how rid of multiple kernel calls , eliminate overheads.

there 1 important thing notice: reason why call new kernel each diagonal need synchronize threads , blocks before next call and, understand, there way syncronize cuda blocks - finish kernel , start again. nevertheless, algorithm there might better solution.

thank reading this!

//////////////////////////////////////////////////////////////

ok, thank response! 1 more question, more cuda: so, have implement new kernel, this:

__global__ void kernel(...) { for(int diag_num = 0; diag_num < size; diag_num++) {     init_one_diag(...);     syncronize_threads(); } } 

but means have launch kernel on 1 cuda-block?(because know there no syncronization between different blocks)

before have launched kernels way:

dev_init_diag<<<(diag - 1)/block_size + 1, block_size>>>(...)

will new approach efficient?

i recommend going through available literature implement efficient approach matrix filling problem smith-waterman algorithm.

from description of code, choosing parallel filling anti-diagonals , launching 1 kernel each anti-diagonal. mentioned, quite ineffective due multiple kernel launches.

a simple alternative constructing single kernel function in charge of calculating anti-diagonals. kernel should launched number of threads @ least equal longest anti-diagonal. kernel performs number of iterations equal number of anti-diagonals calculated. anti-diagonals shorter longest one, subset of threads remains active. approach described in

parallelizing smith-waterman local alignment algorithm using cuda

but ineffective 2 reasons:

  1. most of threads remain unactive significant number of computations (anti-diagonals);
  2. the memory accesses highly uncoalesced.

an alternative anti-diagonal matrix filling provided approach in

acceleration of smith–waterman algorithm using single , multiple graphics processors

there shown how how smith–waterman (anti-diagonal) matrix filling algorithm can reformulated calculations can performed in parallel 1 row (or column) @ time. underlined how row (or column) calculations allow gpu memory accesses consecutive (coalesced) , therefore fast. although not explicitly mentioned, believe approach mitigates (or totally removes) above mentioned issue of unactive threads.

edit

the gpu computing gems book, emerald edition, dedicates 2 chapters smith-waterman algorithm, namely,

chapter 11, accurate scanning of sequence databases smith-waterman algorithm

and

chapter 13, gpu-supercomputer acceleration of pattern matching

the latter chapter same authors of second mentioned approach. former, contains step-by-step derivation of optimized cuda code, may result useful future users.


Comments

Popular posts from this blog

windows - Single EXE to Install Python Standalone Executable for Easy Distribution -

c# - Access objects in UserControl from MainWindow in WPF -

javascript - How to name a jQuery function to make a browser's back button work? -