Speech Transcript - GPU Optimization of Convolution for Large 3-D Real Images

okay a

so good morning everyone it's my pleasure to present here of work my name is

public address i'm from centre for by medical image analysis in uh at some uh

select university banal and in my talk i would like to present the work of

the in my colleagues at that with when on about same chi who is from

uh this faculty

uh in my talk uh or the outline of my talk would be as follows

so first uh i will shortly uh recall uh convolution and how to compute its

and then i will talk about motivation of a little more which is use of

convolution in image processing and uh acceleration of convolution on a graphic cards

uh then i will present our algorithm uh i will show the results on that

uh finally i will conclude my talk so uh

first uh what is convolution i'm sure most of you are uh familiar with a

convolution or at least heard about it but uh just to recall uh some basics

uh it's uh and essential operation in the image or signal processing and uh

uh to show an example on a uh

here on the two images in fact convolution takes two images as an input

and uh it has one images and i'm

and uh one of the input images is usually called a convolution kernel so what

actually happens is that uh

in at every pixel of and output image the value is computed as the sum

of all well point wise multiplication or for the convolution kernel and an appropriate window

the input image

um there are several approaches how to compute the convolution uh one of them is

the so-called naive approach from the definition which is uh the slowest one then there

are some uh some faster solutions like separable convolution or recursive filters uh which are

however a suitable only for some specific uh kind of convolution kernels uh for a

large general convolution kernel it turns out that the best solution a so called fast

convolution which is uh

uh actually convolution computed in the frequency domain according to so-called convolution theorem that means

that we compute fourier transforms of input image and the convolution kernel we compute point

wise multiplication and then we compute uh the inverse fourier transform of the results

uh then the question is a white to accelerate convolution on gpu uh the answer

is simple uh nowadays gpu implementations uh achieve up to approximately ten times speedup over

uh optimized cpu solutions

and the uh if you look for uh resources you can find some of convolution

implementations in uh and we D a uh documentations for example you can find night

and separable convolution income to its decay you can also find uh

implementation of the fast fourier transform in those the cufft library which can help you

to implement the fast convolution

uh no motivation for our work

is uh

uh following uh and our laboratory we develop uh a simulator of optical microscopy images

uh you can try it uh by herself a at this web page

cheers

and's uh

and this simulation itself uh is divided into three phases in and the first step

and artificial image is created in the second step the image is blurred according to

a configuration of for real optical microscope and in this final face uh and noise

is added and the reason why we use this simulator is that uh we can

we can generate as much images as uh as many images as we want and

uh with these images of we know the ground truth we know the number of

objects and therefore there's so that we can uh test the image processing algorithms on

these images and we can evaluate how they perform

uh the essential part is this second phase where the convolution of an input image

and so-called point spread function is computed and uh at this phase the convolution is

computed over a large images to for your uh information uh the input images can

have uh thousand four thousand four hundred pixels and the convolution kernel can have hundred

four hundred four hundred pixels so these images are really big

if uh if one tries to implement this convolution on gpu uh

it turns out that the gpu implementation has significantly lower computation time than cpu implementations

but there are two drawbacks first one is that the uh the gpu implementation is

uh basically for complex data uh it's not optimized for uh images which are uh

which have real values

so if we compare the solution with the optimized cpu solution it's not the speed

up is not that big the other drawback is that uh if the input image

is too large then we don't have you know gpu memory to compute the convolution

so the goals are to pressed to decompose the problem of convolution and the second

one is to optimize it for real data

uh as for the decomposition of problem there are several approaches but uh with respect

to number of operations and number of data transfers between cpu and gpu it turns

out that uh the best solution is so called decimation in frequency algorithm which is

uh well known from the fast fourier transform uh algorithm

so we use these decimation in frequency algorithm and the idea looks basically as follows

uh on cpu we actually compute the first step of the fast fourier transform according

to uh these equations

and the this allows us to split the input data into and parts in this

case it's two parts and we can compute the fast fourier transform and subsequently the

convolution uh in two parts separately

as for the optimisation for a real data the algorithm changes slightly so we add

two steps into these uh these the work flow uh we convert the data from

real too complex and in the final phase we somehow we combine the results and

uh these first steps the transformation of real data into complex data is actually a

trivial because uh what we do is actually that we take

the old pixels as the real part of a complex number and we take even

pixels as a complex part of uh

as imaginary part of or for complex uh image and what we get is a

complex image of house is

without doing any computations or data transfers we just recast

the a error rate and uh the computer memory

so this is done in a uh instantly the final phase uh take some time

but it's not uh it's not that's computationally expensive

and to summarize the whole are written uh

we get the input signal and input kernel we decompose both of them and recast

them to complex signals this is done on cpu then we compute all fourier transforms

point wise multiplication so and inverse fourier transforms on gpu this is a computationally expensive

part so we try to keep the most computations on gpu

and finally each uh we transferred data back and compose them together and recombine then

so that uh

we get uh the result

uh now i will speak about some experiments so this is the configuration we used

and uh we used to dataset

uh actually the first dataset has uh in the first dataset image have images have

well uh the dimensions that are powers of two which is the optimal for which

is optimal for fast fourier transform so these images are expected to have uh

good uh computation times i in the second dataset the images are slightly larger and

that means that to the computation times will be uh will be uh longer

and as you can see on this graph uh on X axis we have image

size on y-axis we have performance which is computed as a ratio between the output

image size and the computation time so the higher is better

the

uh the gpu optimized for real data performs best

and the

even gpu of without the optimisation for a real data performs better than the best

cpu solution but we optimize section for the real data we get like a two

times more and more as you know

uh this graph is for the first dataset

these graphics for second dataset so clearly the performance goes down when uh when the

image don't have dimensions of a specific size but uh for any data sets it

turns out that the gpu uh solution is uh much uh much of faster than

the cpu one and the speed up to achieve this approximately uh about uh five

two to ten depending on uh the input image

and uh we also come uh perform test

with uh a fixed input image and we uh computed the performance with different values

of the parameter P parameter means that uh

B parameter is actually the number of four of sub images the data are split

into so one means that no decomposition is actually made

two means that data are decomposed into two parts and so we can clearly see

that the decomposition uh imposes some uh some uh

performance decrease

but then with increasing be parameter the performance doesn't decrease that much and that it

stays uh

it's always higher significantly higher than the performance of cpu uh implementation as for the

gpu memory that is required it uh of course it uh decreases with the be

equal to two or be equal to four it's the that you can memory requirements

are the same because of the uh properties of the recombination phase

but you can see that uh

the memory requirements can decrease

uh two

can i can decrease to any number you desire according uh to the graphic arts

you have

as for the numerical precision it turns out that uh

uh that uh the error of the output image is under the level of well

all for one that means that the

the maximum difference between output image computed on gpu and output image computed on cpu

in long double precision

is less uh use less than one and that means that uh the images are

uh essentially the same but uh

the images that we use hats uh and uh actually effective uh a bit that's

all about twelve bits which is uh which is uh come on it that's for

a optical microscopy images so if one have

you for wanna use this images of higher bit that let's at sixteen or thirty

two then uh this precision

uh change D not enough so um

so in this case one should use not single precision for computation but uh let's

see double precision uh fortunately modern graphic cards allow to use double precision

uh we also tried to implement um more multi-gpu solution

and uh we made some optimisation with synchronisation so that uh

the overall time of the multi-gpu uh multi-gpu solution is as best as possible however

it turns out that the with too graphic cards one does not definitely achieve two

times speed up uh in fact for some images two gpus perform more or less

the same this one gpu and that's because of the overhead of data transfers

for some images one can achieve uh maybe one point five speed up

so it can be interesting for some uh some kind of input images

uh to complete my talk uh

our algorithm for uh computation of uh on convolution on gpu uh has actually known

timit an only needs of data sides the only name it is actually the cpu

memory but there are some approaches to uh decompose the data even if you don't

have enough to cpu memory so for extremely large uh input images uh

you can uh you can combine maybe our approach with some other approaches to perform

the computation

the algorithm is optimized for real data uh which are uh oh

most common in practical applications it has quite stable performance that means that uh with

uh increasing the number of sub images the data are split into the performance doesn't

decrease significantly uh there are non rigid and data transfers so our computations

and the multi-gpu solution is also possible so if you have many graphic cards you

can split the data into uh more parts and send the data into uh individual

graphic cards let then compute parallel and the disadvantages is that as i told you

the multi gpu solution is not that efficient

and uh the numerical precision may not be sufficient for some applications so uh what

should consider uh like uh using uh double precision for the computation

so uh that's all to conclude my talk thank you very much for listening and

i'm looking forward to have questions

uh_huh

uh_huh yes thank you that's and that are definitely interesting questions uh

so uh i didn't i didn't i didn't mention it but it wasn't slide so

the cpu was kinda like a four course

intel core I seven and the gpu was uh you course you D X of

four hundred and eighty

and the cpu implementation was based on the fftw uh a library so probably

uh it is considered the fastest

fft library uh so we use the

as fast as cpu implementation is possible

uh yes uh we actually of for the sake of the simplicity we

yeah

uh oh

uh i will put in different we uh in a fast convolution the current knows

both the data both the data and can also have to be uh that it

in to the same size so in fact it doesn't matter if you have uh

current no one hundred or one hundred per one hundred or two hundred or two

hundred or one hundred because

at the end it's uh it is but it's to the size

oh of well

uh of size of input image last the size of kernel so it is like

a one thousand or one some part two hundred for example and uh for the

sake of simplicity during our experiments we use uh we use the

uh the input image of the same size as the uh convolution kernel but uh

it really doesn't matter

how large the convolution car noise

uh_huh

oh because uh for uh in america precision test we wanted to compare the single

precision on gpu with uh the ground truth

because we wanted to make sure that the results are correct so that for this

test we use the along double precision on cpu however for performance test we use

the single precision both on cpu and gpu

i'm sorry it's the wasn't that clear during the talk

it's

uh block convolution you mean you mean uh splitting it into small blocks and you

uh performing poorly yeah or what overlap that yeah that's uh

i that's the this approach it's timing

and uh actually unlikely to and of its what i swore i was working on

this approach and its uh but the disadvantage of this approach is that

the number of operation

uh in means of a asymptotic complexity is uh higher

and so is the number of data transfers because of the overlaps

and uh it turns out that the with the gpu implementation the number of data

transfers is essential

so we uh we wanted to keep the number of data transfers as low as

possible and the

uh for this reason we use the decimation in frequency instead of instead of overlap

set safe overlap-add method because we got significantly lower number of uh data transfers

uh_huh

yeah D oh yeah i'm sorry i didn't mention it D means number of parts

the data are decomposed into

so in fact with the overlap save an overlap add with increasing number of some

parts you get uh you get higher number of data transfers

okay

GPU Optimization of Convolution for Large 3-D Real Images

DSP and Hardware

Pavel Karas, David Svoboda, Pavel Zemcik