image processing - What is the best way to parallelize Hough Transform algorithm? -
in hough line transform, each edge pixel, find corresponding rho , theta in hough parameter space. accumulator rho , theta should global. if want parallelize algorithm, best way split accumulator space?
what best way parallelise algorithm may depend on several aspects. important such aspect hardware targeting. have tagged question "openmp", assume that, in case, target smp system.
to answer question, let start looking @ typical, straightforward implementation of hough transform (i use c, follows applies c++ , fortran well):
size_t *hough(bool *pixels, size_t w, size_t h, size_t res, size_t *rlimit) { *rlimit = (size_t)(sqrt(w * w + h * h)); double step = m_pi_2 / res; size_t *accum = calloc(res * *rlimit, sizeof(size_t)); size_t x, y, t; (x = 0; x < w; ++x) (y = 0; y < h; ++y) if (pixels[y * w + x]) (t = 0; t < res; ++t) { double theta = t * step; size_t r = x * cos(theta) + y * sin(theta); ++accum[r * res + t]; } return accum; }
given array of black-and-white pixels (stored row-wise), width, height , target resolution angle-component of hough space, function hough
returns accumulator array hough space (organised "angle-wise") , stores upper bound distance dimension in output argument rlimit
. is, number of elements in returned accumulator array given res * (*rlimit)
.
the body of function centres on 3 nested loops: 2 outermost loops iterate on input pixels, while conditionally executed innermost loop iterates on angular dimension of hough space.
to parallelise algorithm, have somehow decompose pieces can execute concurrently. typically such decomposition induced either structure of computation or otherwise structure of data operated on.
as, besides iteration, computationally interesting task carried out function trigonometry in body of innermost loop, there no obvious opportunities decomposition based on structure of computation. therefore, let focus on decompositions based on structure of data, , let distinguish between
- data decompositions based on structure of input data, and
- data decompositions based on structure of output data.
the structure of input data, in our case, given pixel array passed argument function hough
, iterated on 2 outermost loops in body of function.
the structure of output data given structure of returned accumulator array , iterated on innermost loop in body of function.
we first @ output-data decomposition as, hough transform, leads simplest parallel algorithm.
output-data decomposition
decomposing output data units can produced relatively independently materialises having iterations of innermost loop execute in parallel.
doing so, 1 has take account so-called loop-carried dependencies loop parallelise. in case, straightforward there no such loop-carried dependencies: iterations of loop require read-write accesses shared array accum
, each iteration operates on own "private" segment of array (i.e., elements have indices i
i % res == t
).
using openmp, gives following straightforward parallel implementation:
size_t *hough(bool *pixels, size_t w, size_t h, size_t res, size_t *rlimit) { *rlimit = (size_t)(sqrt(w * w + h * h)); double step = m_pi_2 / res; size_t *accum = calloc(res * *rlimit, sizeof(size_t)); size_t x, y, t; (x = 0; x < w; ++x) (y = 0; y < h; ++y) if (pixels[y * w + x]) #pragma omp parallel (t = 0; t < res; ++t) { double theta = t * step; size_t r = x * cos(theta) + y * sin(theta); ++accum[r * res + t]; } return accum; }
input-data decomposition
a data decomposition follows structure of input data can obtained parallelising outermost loop.
that loop, however, have loop-carried flow dependency each loop iteration potentially requires read-write access each cell of shared accumulator array. hence, in order obtain correct parallel implementation have synchronise these accumulator accesses. in case, can done updating accumulators atomically.
the loop carries 2 so-called antidependencies. these induced induction variables y
, t
of inner loops , trivially dealt making them private variables of parallel outer loop.
the parallel implementation end looks this:
size_t *hough(bool *pixels, size_t w, size_t h, size_t res, size_t *rlimit) { *rlimit = (size_t)(sqrt(w * w + h * h)); double step = m_pi_2 / res; size_t *accum = calloc(res * *rlimit, sizeof(size_t)); size_t x, y, t; #pragma omp parallel private(y, t) (x = 0; x < w; ++x) (y = 0; y < h; ++y) if (pixels[y * w + x]) (t = 0; t < res; ++t) { double theta = t * step; size_t r = x * cos(theta) + y * sin(theta); #pragma omp atomic ++accum[r * res + t]; } return accum; }
evaluation
evaluating 2 data-decomposition strategies, observe that:
- for both strategies, end parallelisation in computationally heavy parts of algorithm (the trigonometry) nicely distributed on threads.
- decomposing output data gives parallelisation of innermost loop in function
hough
. loop not have loop-carried dependencies not incur data-synchronisation overhead. however, innermost loop executed every set input pixel, incur quite lot of overhead due repeatedly forming team of threads etc. - decomposing input data gives parallelisation of outermost loop. loop executed once , threading overhead minimal. on other hand, however, incur data-synchronisation overhead dealing loop-carried flow dependency.
atomic operations in openmp can typically assumed quite efficient, while threading overheads known rather large. consequently, 1 expects that, hough transform, input-data decomposition gives more efficient parallel algorithm. confirmed simple experiment. experiment, applied 2 parallel implementations randomly generated 1024x768 black-and-white picture target resolution of 90 (i.e., 1 accumulator per degree of arc) , compared results sequential implementation. table shows relative speedups obtained 2 parallel implementations different numbers of threads:
# threads | output decomposition | input decomposition ----------+----------------------+-------------------- 2 | 1.2 | 1.9 4 | 1.4 | 3.7 8 | 1.5 | 6.8
(the experiment carried out on otherwise unloaded dual 2.2 ghz quad-core intel xeon e5520. speedups averages on 5 runs. average running time of sequential implementation 2.66 s.)
false sharing
note parallel implementations susceptible false sharing of accumulator array. implementation based on decomposition of output data false sharing can large extent avoided transposing accumulator array (i.e., organising "distance-wise"). doing , measuring impact, did, in experiments, not result in observable further speedups.
conclusion
returning question, "what best way split accumulator space?", answer seems best not split accumulator space @ all, instead split input space.
if, reason, have set mind on splitting accumulator space, may consider changing structure of algorithm outermost loops iterate on hough space , inner loop on whichever smallest of input picture's dimensions. way, can still derive parallel implementation incurs threading overhead once , comes free of data-synchronisation overhead. however, in scheme, trigonometry can no longer conditional , so, in total, each loop iteration have more work in scheme above.
Comments
Post a Comment