0:00:16 okay a so good morning everyone it's my pleasure to present here of work my name is public address i'm from centre for by medical image analysis in uh at some uh select university banal and in my talk i would like to present the work of the in my colleagues at that with when on about same chi who is from uh this faculty uh in my talk uh or the outline of my talk would be as follows so first uh i will shortly uh recall uh convolution and how to compute its and then i will talk about motivation of a little more which is use of convolution in image processing and uh acceleration of convolution on a graphic cards uh then i will present our algorithm uh i will show the results on that uh finally i will conclude my talk so uh first uh what is convolution i'm sure most of you are uh familiar with a convolution or at least heard about it but uh just to recall uh some basics uh it's uh and essential operation in the image or signal processing and uh uh to show an example on a uh here on the two images in fact convolution takes two images as an input and uh it has one images and i'm and uh one of the input images is usually called a convolution kernel so what actually happens is that uh in at every pixel of and output image the value is computed as the sum of all well point wise multiplication or for the convolution kernel and an appropriate window the input image um there are several approaches how to compute the convolution uh one of them is the so-called naive approach from the definition which is uh the slowest one then there are some uh some faster solutions like separable convolution or recursive filters uh which are however a suitable only for some specific uh kind of convolution kernels uh for a large general convolution kernel it turns out that the best solution a so called fast convolution which is uh uh actually convolution computed in the frequency domain according to so-called convolution theorem that means that we compute fourier transforms of input image and the convolution kernel we compute point wise multiplication and then we compute uh the inverse fourier transform of the results uh then the question is a white to accelerate convolution on gpu uh the answer is simple uh nowadays gpu implementations uh achieve up to approximately ten times speedup over uh optimized cpu solutions and the uh if you look for uh resources you can find some of convolution implementations in uh and we D a uh documentations for example you can find night and separable convolution income to its decay you can also find uh implementation of the fast fourier transform in those the cufft library which can help you to implement the fast convolution uh no motivation for our work is uh uh following uh and our laboratory we develop uh a simulator of optical microscopy images uh you can try it uh by herself a at this web page cheers and's uh and this simulation itself uh is divided into three phases in and the first step and artificial image is created in the second step the image is blurred according to a configuration of for real optical microscope and in this final face uh and noise is added and the reason why we use this simulator is that uh we can we can generate as much images as uh as many images as we want and uh with these images of we know the ground truth we know the number of objects and therefore there's so that we can uh test the image processing algorithms on these images and we can evaluate how they perform uh the essential part is this second phase where the convolution of an input image and so-called point spread function is computed and uh at this phase the convolution is computed over a large images to for your uh information uh the input images can have uh thousand four thousand four hundred pixels and the convolution kernel can have hundred four hundred four hundred pixels so these images are really big uh if uh if one tries to implement this convolution on gpu uh it turns out that the gpu implementation has significantly lower computation time than cpu implementations but there are two drawbacks first one is that the uh the gpu implementation is uh basically for complex data uh it's not optimized for uh images which are uh which have real values so if we compare the solution with the optimized cpu solution it's not the speed up is not that big the other drawback is that uh if the input image is too large then we don't have you know gpu memory to compute the convolution so the goals are to pressed to decompose the problem of convolution and the second one is to optimize it for real data uh as for the decomposition of problem there are several approaches but uh with respect to number of operations and number of data transfers between cpu and gpu it turns out that uh the best solution is so called decimation in frequency algorithm which is uh well known from the fast fourier transform uh algorithm so we use these decimation in frequency algorithm and the idea looks basically as follows uh on cpu we actually compute the first step of the fast fourier transform according to uh these equations and the this allows us to split the input data into and parts in this case it's two parts and we can compute the fast fourier transform and subsequently the convolution uh in two parts separately as for the optimisation for a real data the algorithm changes slightly so we add two steps into these uh these the work flow uh we convert the data from real too complex and in the final phase we somehow we combine the results and uh these first steps the transformation of real data into complex data is actually a trivial because uh what we do is actually that we take the old pixels as the real part of a complex number and we take even pixels as a complex part of uh as imaginary part of or for complex uh image and what we get is a complex image of house is without doing any computations or data transfers we just recast the a error rate and uh the computer memory so so this is done in a uh instantly the final phase uh take some time but it's not uh it's not that's computationally expensive um and to summarize the whole are written uh we get the input signal and input kernel we decompose both of them and recast them to complex signals this is done on cpu then we compute all fourier transforms point wise multiplication so and inverse fourier transforms on gpu this is a computationally expensive part so we try to keep the most computations on gpu and finally each uh we transferred data back and compose them together and recombine then so that uh we get uh the result uh now i will speak about some experiments so this is the configuration we used and uh we used to dataset uh actually the first dataset has uh in the first dataset image have images have well uh the dimensions that are powers of two which is the optimal for which is optimal for fast fourier transform so these images are expected to have uh good uh computation times i in the second dataset the images are slightly larger and that means that to the computation times will be uh will be uh longer and as you can see on this graph uh on X axis we have image size on y-axis we have performance which is computed as a ratio between the output image size and the computation time so the higher is better uh the uh the gpu optimized for real data performs best and the uh even gpu of without the optimisation for a real data performs better than the best cpu solution but we optimize section for the real data we get like a two times more and more as you know uh this graph is for the first dataset these graphics for second dataset so clearly the performance goes down when uh when the image don't have dimensions of a specific size but uh for any data sets it turns out that the gpu uh solution is uh much uh much of faster than the cpu one and the speed up to achieve this approximately uh about uh five two to ten depending on uh the input image and uh we also come uh perform test with uh a fixed input image and we uh computed the performance with different values of the parameter P parameter means that uh B parameter is actually the number of four of sub images the data are split into so one means that no decomposition is actually made two means that data are decomposed into two parts and so we can clearly see that the decomposition uh imposes some uh some uh performance decrease but then with increasing be parameter the performance doesn't decrease that much and that it stays uh it's always higher significantly higher than the performance of cpu uh implementation as for the gpu memory that is required it uh of course it uh decreases with the be equal to two or be equal to four it's the that you can memory requirements are the same because of the uh properties of the recombination phase but you can see that uh uh the memory requirements can decrease uh two can i can decrease to any number you desire according uh to the graphic arts you have as for the numerical precision it turns out that uh uh that uh the error of the output image is under the level of well all for one that means that the the maximum difference between output image computed on gpu and output image computed on cpu in long double precision is less uh use less than one and that means that uh the images are uh essentially the same but uh the images that we use hats uh and uh actually effective uh a bit that's all about twelve bits which is uh which is uh come on it that's for a optical microscopy images so if one have you for wanna use this images of higher bit that let's at sixteen or thirty two then uh this precision uh change D not enough so um so in this case one should use not single precision for computation but uh let's see double precision uh fortunately modern graphic cards allow to use double precision uh we also tried to implement um more multi-gpu solution and uh we made some optimisation with synchronisation so that uh the overall time of the multi-gpu uh multi-gpu solution is as best as possible however it turns out that the with too graphic cards one does not definitely achieve two times speed up uh in fact for some images two gpus perform more or less the same this one gpu and that's because of the overhead of data transfers for some images one can achieve uh maybe one point five speed up so it can be interesting for some uh some kind of input images uh to complete my talk uh our algorithm for uh computation of uh on convolution on gpu uh has actually known timit an only needs of data sides the only name it is actually the cpu memory but there are some approaches to uh decompose the data even if you don't have enough to cpu memory so for extremely large uh input images uh you can uh you can combine maybe our approach with some other approaches to perform the computation oh the algorithm is optimized for real data uh which are uh oh most common in practical applications it has quite stable performance that means that uh with uh increasing the number of sub images the data are split into the performance doesn't decrease significantly uh there are non rigid and data transfers so our computations and the multi-gpu solution is also possible so if you have many graphic cards you can split the data into uh more parts and send the data into uh individual graphic cards let then compute parallel and the disadvantages is that as i told you the multi gpu solution is not that efficient and uh the numerical precision may not be sufficient for some applications so uh what should consider uh like uh using uh double precision for the computation so uh that's all to conclude my talk thank you very much for listening and i'm looking forward to have questions uh_huh uh_huh yes thank you that's and that are definitely interesting questions uh so uh i didn't i didn't i didn't mention it but it wasn't slide so the cpu was kinda like a four course intel core I seven and the gpu was uh you course you D X of four hundred and eighty and the cpu implementation was based on the fftw uh a library so probably uh it is considered the fastest fft library uh so we use the as fast as cpu implementation is possible uh yes uh we actually of for the sake of the simplicity we uh yeah uh oh uh i will put in different we uh in a fast convolution the current knows both the data both the data and can also have to be uh that it in to the same size so in fact it doesn't matter if you have uh current no one hundred or one hundred per one hundred or two hundred or two hundred or one hundred because at the end it's uh it is but it's to the size oh of well uh of size of input image last the size of kernel so it is like a one thousand or one some part two hundred for example and uh for the sake of simplicity during our experiments we use uh we use the uh the input image of the same size as the uh convolution kernel but uh it really doesn't matter how large the convolution car noise uh_huh oh because uh for uh in america precision test we wanted to compare the single precision on gpu with uh the ground truth because we wanted to make sure that the results are correct so that for this test we use the along double precision on cpu however for performance test we use the single precision both on cpu and gpu i'm sorry it's the wasn't that clear during the talk i it's uh block convolution you mean you mean uh splitting it into small blocks and you uh performing poorly yeah or what overlap that yeah that's uh i that's the this approach it's timing and uh actually unlikely to and of its what i swore i was working on this approach and its uh but the disadvantage of this approach is that the number of operation uh in means of a asymptotic complexity is uh higher and so is the number of data transfers because of the overlaps and uh it turns out that the with the gpu implementation the number of data transfers is essential so we uh we wanted to keep the number of data transfers as low as possible and the uh for this reason we use the decimation in frequency instead of instead of overlap set safe overlap-add method because we got significantly lower number of uh data transfers uh_huh yeah D oh yeah i'm sorry i didn't mention it D means number of parts the data are decomposed into so in fact with the overlap save an overlap add with increasing number of some parts you get uh you get higher number of data transfers okay