0:00:16 | okay a |
---|---|

0:00:19 | so good morning everyone it's my pleasure to present here of work my name is |

0:00:24 | public address i'm from centre for by medical image analysis in uh at some uh |

0:00:29 | select university banal and in my talk i would like to present the work of |

0:00:34 | the in my colleagues at that with when on about same chi who is from |

0:00:39 | uh this faculty |

0:00:42 | uh in my talk uh or the outline of my talk would be as follows |

0:00:49 | so first uh i will shortly uh recall uh convolution and how to compute its |

0:00:57 | and then i will talk about motivation of a little more which is use of |

0:01:01 | convolution in image processing and uh acceleration of convolution on a graphic cards |

0:01:09 | uh then i will present our algorithm uh i will show the results on that |

0:01:16 | uh finally i will conclude my talk so uh |

0:01:22 | first uh what is convolution i'm sure most of you are uh familiar with a |

0:01:29 | convolution or at least heard about it but uh just to recall uh some basics |

0:01:37 | uh it's uh and essential operation in the image or signal processing and uh |

0:01:45 | uh to show an example on a uh |

0:01:50 | here on the two images in fact convolution takes two images as an input |

0:01:57 | and uh it has one images and i'm |

0:02:01 | and uh one of the input images is usually called a convolution kernel so what |

0:02:07 | actually happens is that uh |

0:02:11 | in at every pixel of and output image the value is computed as the sum |

0:02:17 | of all well point wise multiplication or for the convolution kernel and an appropriate window |

0:02:25 | the input image |

0:02:28 | um there are several approaches how to compute the convolution uh one of them is |

0:02:35 | the so-called naive approach from the definition which is uh the slowest one then there |

0:02:41 | are some uh some faster solutions like separable convolution or recursive filters uh which are |

0:02:50 | however a suitable only for some specific uh kind of convolution kernels uh for a |

0:02:58 | large general convolution kernel it turns out that the best solution a so called fast |

0:03:04 | convolution which is uh |

0:03:08 | uh actually convolution computed in the frequency domain according to so-called convolution theorem that means |

0:03:16 | that we compute fourier transforms of input image and the convolution kernel we compute point |

0:03:23 | wise multiplication and then we compute uh the inverse fourier transform of the results |

0:03:31 | uh then the question is a white to accelerate convolution on gpu uh the answer |

0:03:39 | is simple uh nowadays gpu implementations uh achieve up to approximately ten times speedup over |

0:03:48 | uh optimized cpu solutions |

0:03:51 | and the uh if you look for uh resources you can find some of convolution |

0:03:59 | implementations in uh and we D a uh documentations for example you can find night |

0:04:06 | and separable convolution income to its decay you can also find uh |

0:04:13 | implementation of the fast fourier transform in those the cufft library which can help you |

0:04:20 | to implement the fast convolution |

0:04:24 | uh no motivation for our work |

0:04:27 | is uh |

0:04:29 | uh following uh and our laboratory we develop uh a simulator of optical microscopy images |

0:04:39 | uh you can try it uh by herself a at this web page |

0:04:44 | cheers |

0:04:46 | and's uh |

0:04:48 | and this simulation itself uh is divided into three phases in and the first step |

0:04:56 | and artificial image is created in the second step the image is blurred according to |

0:05:04 | a configuration of for real optical microscope and in this final face uh and noise |

0:05:12 | is added and the reason why we use this simulator is that uh we can |

0:05:18 | we can generate as much images as uh as many images as we want and |

0:05:24 | uh with these images of we know the ground truth we know the number of |

0:05:30 | objects and therefore there's so that we can uh test the image processing algorithms on |

0:05:37 | these images and we can evaluate how they perform |

0:05:42 | uh the essential part is this second phase where the convolution of an input image |

0:05:49 | and so-called point spread function is computed and uh at this phase the convolution is |

0:05:56 | computed over a large images to for your uh information uh the input images can |

0:06:05 | have uh thousand four thousand four hundred pixels and the convolution kernel can have hundred |

0:06:12 | four hundred four hundred pixels so these images are really big |

0:06:17 | uh |

0:06:19 | if uh if one tries to implement this convolution on gpu uh |

0:06:27 | it turns out that the gpu implementation has significantly lower computation time than cpu implementations |

0:06:37 | but there are two drawbacks first one is that the uh the gpu implementation is |

0:06:45 | uh basically for complex data uh it's not optimized for uh images which are uh |

0:06:54 | which have real values |

0:06:57 | so if we compare the solution with the optimized cpu solution it's not the speed |

0:07:03 | up is not that big the other drawback is that uh if the input image |

0:07:09 | is too large then we don't have you know gpu memory to compute the convolution |

0:07:16 | so the goals are to pressed to decompose the problem of convolution and the second |

0:07:23 | one is to optimize it for real data |

0:07:27 | uh as for the decomposition of problem there are several approaches but uh with respect |

0:07:34 | to number of operations and number of data transfers between cpu and gpu it turns |

0:07:40 | out that uh the best solution is so called decimation in frequency algorithm which is |

0:07:45 | uh well known from the fast fourier transform uh algorithm |

0:07:52 | so we use these decimation in frequency algorithm and the idea looks basically as follows |

0:08:00 | uh on cpu we actually compute the first step of the fast fourier transform according |

0:08:09 | to uh these equations |

0:08:11 | and the this allows us to split the input data into and parts in this |

0:08:18 | case it's two parts and we can compute the fast fourier transform and subsequently the |

0:08:25 | convolution uh in two parts separately |

0:08:32 | as for the optimisation for a real data the algorithm changes slightly so we add |

0:08:40 | two steps into these uh these the work flow uh we convert the data from |

0:08:48 | real too complex and in the final phase we somehow we combine the results and |

0:08:55 | uh these first steps the transformation of real data into complex data is actually a |

0:09:02 | trivial because uh what we do is actually that we take |

0:09:08 | the old pixels as the real part of a complex number and we take even |

0:09:15 | pixels as a complex part of uh |

0:09:20 | as imaginary part of or for complex uh image and what we get is a |

0:09:27 | complex image of house is |

0:09:30 | without doing any computations or data transfers we just recast |

0:09:36 | the a error rate and uh the computer memory |

0:09:42 | so |

0:09:43 | so this is done in a uh instantly the final phase uh take some time |

0:09:51 | but it's not uh it's not that's computationally expensive |

0:09:57 | um |

0:10:00 | and to summarize the whole are written uh |

0:10:05 | we get the input signal and input kernel we decompose both of them and recast |

0:10:13 | them to complex signals this is done on cpu then we compute all fourier transforms |

0:10:21 | point wise multiplication so and inverse fourier transforms on gpu this is a computationally expensive |

0:10:28 | part so we try to keep the most computations on gpu |

0:10:33 | and finally each uh we transferred data back and compose them together and recombine then |

0:10:42 | so that uh |

0:10:44 | we get uh the result |

0:10:49 | uh now i will speak about some experiments so this is the configuration we used |

0:10:57 | and uh we used to dataset |

0:11:01 | uh actually the first dataset has uh in the first dataset image have images have |

0:11:09 | well uh the dimensions that are powers of two which is the optimal for which |

0:11:16 | is optimal for fast fourier transform so these images are expected to have uh |

0:11:24 | good uh computation times i in the second dataset the images are slightly larger and |

0:11:32 | that means that to the computation times will be uh will be uh longer |

0:11:40 | and as you can see on this graph uh on X axis we have image |

0:11:48 | size on y-axis we have performance which is computed as a ratio between the output |

0:11:55 | image size and the computation time so the higher is better |

0:12:00 | uh |

0:12:03 | the |

0:12:05 | uh the gpu optimized for real data performs best |

0:12:12 | and the |

0:12:16 | uh |

0:12:20 | even gpu of without the optimisation for a real data performs better than the best |

0:12:28 | cpu solution but we optimize section for the real data we get like a two |

0:12:34 | times more and more as you know |

0:12:37 | uh this graph is for the first dataset |

0:12:43 | these graphics for second dataset so clearly the performance goes down when uh when the |

0:12:50 | image don't have dimensions of a specific size but uh for any data sets it |

0:12:57 | turns out that the gpu uh solution is uh much uh much of faster than |

0:13:03 | the cpu one and the speed up to achieve this approximately uh about uh five |

0:13:10 | two to ten depending on uh the input image |

0:13:18 | and uh we also come uh perform test |

0:13:24 | with uh a fixed input image and we uh computed the performance with different values |

0:13:32 | of the parameter P parameter means that uh |

0:13:38 | B parameter is actually the number of four of sub images the data are split |

0:13:45 | into so one means that no decomposition is actually made |

0:13:51 | two means that data are decomposed into two parts and so we can clearly see |

0:13:57 | that the decomposition uh imposes some uh some uh |

0:14:03 | performance decrease |

0:14:06 | but then with increasing be parameter the performance doesn't decrease that much and that it |

0:14:14 | stays uh |

0:14:17 | it's always higher significantly higher than the performance of cpu uh implementation as for the |

0:14:26 | gpu memory that is required it uh of course it uh decreases with the be |

0:14:36 | equal to two or be equal to four it's the that you can memory requirements |

0:14:42 | are the same because of the uh properties of the recombination phase |

0:14:51 | but you can see that uh |

0:14:54 | uh |

0:14:55 | the memory requirements can decrease |

0:15:00 | uh two |

0:15:04 | can i can decrease to any number you desire according uh to the graphic arts |

0:15:11 | you have |

0:15:13 | as for the numerical precision it turns out that uh |

0:15:19 | uh that uh the error of the output image is under the level of well |

0:15:29 | all for one that means that the |

0:15:35 | the maximum difference between output image computed on gpu and output image computed on cpu |

0:15:44 | in long double precision |

0:15:46 | is less uh use less than one and that means that uh the images are |

0:15:53 | uh essentially the same but uh |

0:15:57 | the images that we use hats uh and uh actually effective uh a bit that's |

0:16:06 | all about twelve bits which is uh which is uh come on it that's for |

0:16:15 | a optical microscopy images so if one have |

0:16:20 | you for wanna use this images of higher bit that let's at sixteen or thirty |

0:16:28 | two then uh this precision |

0:16:32 | uh change D not enough so um |

0:16:39 | so in this case one should use not single precision for computation but uh let's |

0:16:45 | see double precision uh fortunately modern graphic cards allow to use double precision |

0:16:54 | uh we also tried to implement um more multi-gpu solution |

0:17:01 | and uh we made some optimisation with synchronisation so that uh |

0:17:09 | the overall time of the multi-gpu uh multi-gpu solution is as best as possible however |

0:17:18 | it turns out that the with too graphic cards one does not definitely achieve two |

0:17:24 | times speed up uh in fact for some images two gpus perform more or less |

0:17:31 | the same this one gpu and that's because of the overhead of data transfers |

0:17:38 | for some images one can achieve uh maybe one point five speed up |

0:17:46 | so it can be interesting for some uh some kind of input images |

0:17:53 | uh to complete my talk uh |

0:17:57 | our algorithm for uh computation of uh on convolution on gpu uh has actually known |

0:18:05 | timit an only needs of data sides the only name it is actually the cpu |

0:18:10 | memory but there are some approaches to uh decompose the data even if you don't |

0:18:18 | have enough to cpu memory so for extremely large uh input images uh |

0:18:26 | you can uh you can combine maybe our approach with some other approaches to perform |

0:18:33 | the computation |

0:18:36 | oh |

0:18:37 | the algorithm is optimized for real data uh which are uh oh |

0:18:43 | most common in practical applications it has quite stable performance that means that uh with |

0:18:51 | uh increasing the number of sub images the data are split into the performance doesn't |

0:18:57 | decrease significantly uh there are non rigid and data transfers so our computations |

0:19:04 | and the multi-gpu solution is also possible so if you have many graphic cards you |

0:19:11 | can split the data into uh more parts and send the data into uh individual |

0:19:18 | graphic cards let then compute parallel and the disadvantages is that as i told you |

0:19:26 | the multi gpu solution is not that efficient |

0:19:30 | and uh the numerical precision may not be sufficient for some applications so uh what |

0:19:36 | should consider uh like uh using uh double precision for the computation |

0:19:45 | so uh that's all to conclude my talk thank you very much for listening and |

0:19:50 | i'm looking forward to have questions |

0:20:15 | uh_huh |

0:20:23 | uh_huh yes thank you that's and that are definitely interesting questions uh |

0:20:30 | so uh i didn't i didn't i didn't mention it but it wasn't slide so |

0:20:36 | the cpu was kinda like a four course |

0:20:41 | intel core I seven and the gpu was uh you course you D X of |

0:20:47 | four hundred and eighty |

0:20:50 | and the cpu implementation was based on the fftw uh a library so probably |

0:21:00 | uh it is considered the fastest |

0:21:04 | fft library uh so we use the |

0:21:11 | as fast as cpu implementation is possible |

0:21:29 | uh yes uh we actually of for the sake of the simplicity we |

0:21:38 | uh |

0:21:40 | yeah |

0:21:42 | uh oh |

0:21:44 | uh i will put in different we uh in a fast convolution the current knows |

0:21:51 | both the data both the data and can also have to be uh that it |

0:21:56 | in to the same size so in fact it doesn't matter if you have uh |

0:22:01 | current no one hundred or one hundred per one hundred or two hundred or two |

0:22:05 | hundred or one hundred because |

0:22:09 | at the end it's uh it is but it's to the size |

0:22:13 | oh of well |

0:22:18 | uh of size of input image last the size of kernel so it is like |

0:22:24 | a one thousand or one some part two hundred for example and uh for the |

0:22:29 | sake of simplicity during our experiments we use uh we use the |

0:22:35 | uh the input image of the same size as the uh convolution kernel but uh |

0:22:42 | it really doesn't matter |

0:22:45 | how large the convolution car noise |

0:22:59 | uh_huh |

0:23:16 | oh because uh for uh in america precision test we wanted to compare the single |

0:23:24 | precision on gpu with uh the ground truth |

0:23:29 | because we wanted to make sure that the results are correct so that for this |

0:23:35 | test we use the along double precision on cpu however for performance test we use |

0:23:42 | the single precision both on cpu and gpu |

0:23:49 | i'm sorry it's the wasn't that clear during the talk |

0:24:00 | i |

0:24:02 | it's |

0:24:15 | uh block convolution you mean you mean uh splitting it into small blocks and you |

0:24:22 | uh performing poorly yeah or what overlap that yeah that's uh |

0:24:29 | i that's the this approach it's timing |

0:24:35 | and uh actually unlikely to and of its what i swore i was working on |

0:24:42 | this approach and its uh but the disadvantage of this approach is that |

0:24:49 | the number of operation |

0:24:52 | uh in means of a asymptotic complexity is uh higher |

0:24:57 | and so is the number of data transfers because of the overlaps |

0:25:04 | and uh it turns out that the with the gpu implementation the number of data |

0:25:09 | transfers is essential |

0:25:11 | so we uh we wanted to keep the number of data transfers as low as |

0:25:16 | possible and the |

0:25:18 | uh for this reason we use the decimation in frequency instead of instead of overlap |

0:25:26 | set safe overlap-add method because we got significantly lower number of uh data transfers |

0:25:44 | uh_huh |

0:25:47 | yeah D oh yeah i'm sorry i didn't mention it D means number of parts |

0:25:52 | the data are decomposed into |

0:25:55 | so in fact with the overlap save an overlap add with increasing number of some |

0:26:02 | parts you get uh you get higher number of data transfers |

0:26:17 | okay |