0:00:16okay a
0:00:19so good morning everyone it's my pleasure to present here of work my name is
0:00:24public address i'm from centre for by medical image analysis in uh at some uh
0:00:29select university banal and in my talk i would like to present the work of
0:00:34the in my colleagues at that with when on about same chi who is from
0:00:39uh this faculty
0:00:42uh in my talk uh or the outline of my talk would be as follows
0:00:49so first uh i will shortly uh recall uh convolution and how to compute its
0:00:57and then i will talk about motivation of a little more which is use of
0:01:01convolution in image processing and uh acceleration of convolution on a graphic cards
0:01:09uh then i will present our algorithm uh i will show the results on that
0:01:16uh finally i will conclude my talk so uh
0:01:22first uh what is convolution i'm sure most of you are uh familiar with a
0:01:29convolution or at least heard about it but uh just to recall uh some basics
0:01:37uh it's uh and essential operation in the image or signal processing and uh
0:01:45uh to show an example on a uh
0:01:50here on the two images in fact convolution takes two images as an input
0:01:57and uh it has one images and i'm
0:02:01and uh one of the input images is usually called a convolution kernel so what
0:02:07actually happens is that uh
0:02:11in at every pixel of and output image the value is computed as the sum
0:02:17of all well point wise multiplication or for the convolution kernel and an appropriate window
0:02:25the input image
0:02:28um there are several approaches how to compute the convolution uh one of them is
0:02:35the so-called naive approach from the definition which is uh the slowest one then there
0:02:41are some uh some faster solutions like separable convolution or recursive filters uh which are
0:02:50however a suitable only for some specific uh kind of convolution kernels uh for a
0:02:58large general convolution kernel it turns out that the best solution a so called fast
0:03:04convolution which is uh
0:03:08uh actually convolution computed in the frequency domain according to so-called convolution theorem that means
0:03:16that we compute fourier transforms of input image and the convolution kernel we compute point
0:03:23wise multiplication and then we compute uh the inverse fourier transform of the results
0:03:31uh then the question is a white to accelerate convolution on gpu uh the answer
0:03:39is simple uh nowadays gpu implementations uh achieve up to approximately ten times speedup over
0:03:48uh optimized cpu solutions
0:03:51and the uh if you look for uh resources you can find some of convolution
0:03:59implementations in uh and we D a uh documentations for example you can find night
0:04:06and separable convolution income to its decay you can also find uh
0:04:13implementation of the fast fourier transform in those the cufft library which can help you
0:04:20to implement the fast convolution
0:04:24uh no motivation for our work
0:04:27is uh
0:04:29uh following uh and our laboratory we develop uh a simulator of optical microscopy images
0:04:39uh you can try it uh by herself a at this web page
0:04:44cheers
0:04:46and's uh
0:04:48and this simulation itself uh is divided into three phases in and the first step
0:04:56and artificial image is created in the second step the image is blurred according to
0:05:04a configuration of for real optical microscope and in this final face uh and noise
0:05:12is added and the reason why we use this simulator is that uh we can
0:05:18we can generate as much images as uh as many images as we want and
0:05:24uh with these images of we know the ground truth we know the number of
0:05:30objects and therefore there's so that we can uh test the image processing algorithms on
0:05:37these images and we can evaluate how they perform
0:05:42uh the essential part is this second phase where the convolution of an input image
0:05:49and so-called point spread function is computed and uh at this phase the convolution is
0:05:56computed over a large images to for your uh information uh the input images can
0:06:05have uh thousand four thousand four hundred pixels and the convolution kernel can have hundred
0:06:12four hundred four hundred pixels so these images are really big
0:06:17uh
0:06:19if uh if one tries to implement this convolution on gpu uh
0:06:27it turns out that the gpu implementation has significantly lower computation time than cpu implementations
0:06:37but there are two drawbacks first one is that the uh the gpu implementation is
0:06:45uh basically for complex data uh it's not optimized for uh images which are uh
0:06:54which have real values
0:06:57so if we compare the solution with the optimized cpu solution it's not the speed
0:07:03up is not that big the other drawback is that uh if the input image
0:07:09is too large then we don't have you know gpu memory to compute the convolution
0:07:16so the goals are to pressed to decompose the problem of convolution and the second
0:07:23one is to optimize it for real data
0:07:27uh as for the decomposition of problem there are several approaches but uh with respect
0:07:34to number of operations and number of data transfers between cpu and gpu it turns
0:07:40out that uh the best solution is so called decimation in frequency algorithm which is
0:07:45uh well known from the fast fourier transform uh algorithm
0:07:52so we use these decimation in frequency algorithm and the idea looks basically as follows
0:08:00uh on cpu we actually compute the first step of the fast fourier transform according
0:08:09to uh these equations
0:08:11and the this allows us to split the input data into and parts in this
0:08:18case it's two parts and we can compute the fast fourier transform and subsequently the
0:08:25convolution uh in two parts separately
0:08:32as for the optimisation for a real data the algorithm changes slightly so we add
0:08:40two steps into these uh these the work flow uh we convert the data from
0:08:48real too complex and in the final phase we somehow we combine the results and
0:08:55uh these first steps the transformation of real data into complex data is actually a
0:09:02trivial because uh what we do is actually that we take
0:09:08the old pixels as the real part of a complex number and we take even
0:09:15pixels as a complex part of uh
0:09:20as imaginary part of or for complex uh image and what we get is a
0:09:27complex image of house is
0:09:30without doing any computations or data transfers we just recast
0:09:36the a error rate and uh the computer memory
0:09:42so
0:09:43so this is done in a uh instantly the final phase uh take some time
0:09:51but it's not uh it's not that's computationally expensive
0:09:57um
0:10:00and to summarize the whole are written uh
0:10:05we get the input signal and input kernel we decompose both of them and recast
0:10:13them to complex signals this is done on cpu then we compute all fourier transforms
0:10:21point wise multiplication so and inverse fourier transforms on gpu this is a computationally expensive
0:10:28part so we try to keep the most computations on gpu
0:10:33and finally each uh we transferred data back and compose them together and recombine then
0:10:42so that uh
0:10:44we get uh the result
0:10:49uh now i will speak about some experiments so this is the configuration we used
0:10:57and uh we used to dataset
0:11:01uh actually the first dataset has uh in the first dataset image have images have
0:11:09well uh the dimensions that are powers of two which is the optimal for which
0:11:16is optimal for fast fourier transform so these images are expected to have uh
0:11:24good uh computation times i in the second dataset the images are slightly larger and
0:11:32that means that to the computation times will be uh will be uh longer
0:11:40and as you can see on this graph uh on X axis we have image
0:11:48size on y-axis we have performance which is computed as a ratio between the output
0:11:55image size and the computation time so the higher is better
0:12:00uh
0:12:03the
0:12:05uh the gpu optimized for real data performs best
0:12:12and the
0:12:16uh
0:12:20even gpu of without the optimisation for a real data performs better than the best
0:12:28cpu solution but we optimize section for the real data we get like a two
0:12:34times more and more as you know
0:12:37uh this graph is for the first dataset
0:12:43these graphics for second dataset so clearly the performance goes down when uh when the
0:12:50image don't have dimensions of a specific size but uh for any data sets it
0:12:57turns out that the gpu uh solution is uh much uh much of faster than
0:13:03the cpu one and the speed up to achieve this approximately uh about uh five
0:13:10two to ten depending on uh the input image
0:13:18and uh we also come uh perform test
0:13:24with uh a fixed input image and we uh computed the performance with different values
0:13:32of the parameter P parameter means that uh
0:13:38B parameter is actually the number of four of sub images the data are split
0:13:45into so one means that no decomposition is actually made
0:13:51two means that data are decomposed into two parts and so we can clearly see
0:13:57that the decomposition uh imposes some uh some uh
0:14:03performance decrease
0:14:06but then with increasing be parameter the performance doesn't decrease that much and that it
0:14:14stays uh
0:14:17it's always higher significantly higher than the performance of cpu uh implementation as for the
0:14:26gpu memory that is required it uh of course it uh decreases with the be
0:14:36equal to two or be equal to four it's the that you can memory requirements
0:14:42are the same because of the uh properties of the recombination phase
0:14:51but you can see that uh
0:14:54uh
0:14:55the memory requirements can decrease
0:15:00uh two
0:15:04can i can decrease to any number you desire according uh to the graphic arts
0:15:11you have
0:15:13as for the numerical precision it turns out that uh
0:15:19uh that uh the error of the output image is under the level of well
0:15:29all for one that means that the
0:15:35the maximum difference between output image computed on gpu and output image computed on cpu
0:15:44in long double precision
0:15:46is less uh use less than one and that means that uh the images are
0:15:53uh essentially the same but uh
0:15:57the images that we use hats uh and uh actually effective uh a bit that's
0:16:06all about twelve bits which is uh which is uh come on it that's for
0:16:15a optical microscopy images so if one have
0:16:20you for wanna use this images of higher bit that let's at sixteen or thirty
0:16:28two then uh this precision
0:16:32uh change D not enough so um
0:16:39so in this case one should use not single precision for computation but uh let's
0:16:45see double precision uh fortunately modern graphic cards allow to use double precision
0:16:54uh we also tried to implement um more multi-gpu solution
0:17:01and uh we made some optimisation with synchronisation so that uh
0:17:09the overall time of the multi-gpu uh multi-gpu solution is as best as possible however
0:17:18it turns out that the with too graphic cards one does not definitely achieve two
0:17:24times speed up uh in fact for some images two gpus perform more or less
0:17:31the same this one gpu and that's because of the overhead of data transfers
0:17:38for some images one can achieve uh maybe one point five speed up
0:17:46so it can be interesting for some uh some kind of input images
0:17:53uh to complete my talk uh
0:17:57our algorithm for uh computation of uh on convolution on gpu uh has actually known
0:18:05timit an only needs of data sides the only name it is actually the cpu
0:18:10memory but there are some approaches to uh decompose the data even if you don't
0:18:18have enough to cpu memory so for extremely large uh input images uh
0:18:26you can uh you can combine maybe our approach with some other approaches to perform
0:18:33the computation
0:18:36oh
0:18:37the algorithm is optimized for real data uh which are uh oh
0:18:43most common in practical applications it has quite stable performance that means that uh with
0:18:51uh increasing the number of sub images the data are split into the performance doesn't
0:18:57decrease significantly uh there are non rigid and data transfers so our computations
0:19:04and the multi-gpu solution is also possible so if you have many graphic cards you
0:19:11can split the data into uh more parts and send the data into uh individual
0:19:18graphic cards let then compute parallel and the disadvantages is that as i told you
0:19:26the multi gpu solution is not that efficient
0:19:30and uh the numerical precision may not be sufficient for some applications so uh what
0:19:36should consider uh like uh using uh double precision for the computation
0:19:45so uh that's all to conclude my talk thank you very much for listening and
0:19:50i'm looking forward to have questions
0:20:15uh_huh
0:20:23uh_huh yes thank you that's and that are definitely interesting questions uh
0:20:30so uh i didn't i didn't i didn't mention it but it wasn't slide so
0:20:36the cpu was kinda like a four course
0:20:41intel core I seven and the gpu was uh you course you D X of
0:20:47four hundred and eighty
0:20:50and the cpu implementation was based on the fftw uh a library so probably
0:21:00uh it is considered the fastest
0:21:04fft library uh so we use the
0:21:11as fast as cpu implementation is possible
0:21:29uh yes uh we actually of for the sake of the simplicity we
0:21:38uh
0:21:40yeah
0:21:42uh oh
0:21:44uh i will put in different we uh in a fast convolution the current knows
0:21:51both the data both the data and can also have to be uh that it
0:21:56in to the same size so in fact it doesn't matter if you have uh
0:22:01current no one hundred or one hundred per one hundred or two hundred or two
0:22:05hundred or one hundred because
0:22:09at the end it's uh it is but it's to the size
0:22:13oh of well
0:22:18uh of size of input image last the size of kernel so it is like
0:22:24a one thousand or one some part two hundred for example and uh for the
0:22:29sake of simplicity during our experiments we use uh we use the
0:22:35uh the input image of the same size as the uh convolution kernel but uh
0:22:42it really doesn't matter
0:22:45how large the convolution car noise
0:22:59uh_huh
0:23:16oh because uh for uh in america precision test we wanted to compare the single
0:23:24precision on gpu with uh the ground truth
0:23:29because we wanted to make sure that the results are correct so that for this
0:23:35test we use the along double precision on cpu however for performance test we use
0:23:42the single precision both on cpu and gpu
0:23:49i'm sorry it's the wasn't that clear during the talk
0:24:00i
0:24:02it's
0:24:15uh block convolution you mean you mean uh splitting it into small blocks and you
0:24:22uh performing poorly yeah or what overlap that yeah that's uh
0:24:29i that's the this approach it's timing
0:24:35and uh actually unlikely to and of its what i swore i was working on
0:24:42this approach and its uh but the disadvantage of this approach is that
0:24:49the number of operation
0:24:52uh in means of a asymptotic complexity is uh higher
0:24:57and so is the number of data transfers because of the overlaps
0:25:04and uh it turns out that the with the gpu implementation the number of data
0:25:09transfers is essential
0:25:11so we uh we wanted to keep the number of data transfers as low as
0:25:16possible and the
0:25:18uh for this reason we use the decimation in frequency instead of instead of overlap
0:25:26set safe overlap-add method because we got significantly lower number of uh data transfers
0:25:44uh_huh
0:25:47yeah D oh yeah i'm sorry i didn't mention it D means number of parts
0:25:52the data are decomposed into
0:25:55so in fact with the overlap save an overlap add with increasing number of some
0:26:02parts you get uh you get higher number of data transfers
0:26:17okay