oh one my name is an idea was to don't and i'm here to present is work well right what yeah or a a a a lot of attention yeah one that of that we we have N so far is that we start with a given direction of a of the results in a processing is we have a a video signal here uh a the results signal which is composed of two modalities that the video part and the audio by the be their body recorded with that common and the L be a or with a microphone there could be more common as a microphones but uh here we prefer to to uh that is the the most simple problem but it's also the most calm in these the money so the uh media and audio signal as they are very different but they show the temporal axes the resolution of these axes is different if we can you would have a much more obvious on sound done be of frames people the sound right of the thing on it you uh hi or or the main idea in general the with and signal processing is to combine or the on do well that it is in or that to extract a maximum amount of information a a a given scene so there are several applications like for example in speech recognition you can use that the media modality in other to better understand speech or or you can use and combined the information in both signals in out there to look at this one's for yeah are mainly in this domain uh the main assumption is that that really it events both channels as they happened more or less at the same that so in this case for example you have a i guy who's playing a that i'm the guitar sounds they are correlated with the movement of the or they cup and more or is at the skin then so that's the main assumption that we would use in what so knowledgeable to the goal what we want to do in this work used to extract the would is all objects in the scene uh looks like this we have i sequence where are that are two objects the for up to or speaker in this case uh is associated to the soundtrack and that is another person which is moving that beeps but we cannot listen to the sounds so these disease that is the vector and these ease though the visible lot what we want to do is to extract the be the a part of which is associated to the something without the interference of the race the but why do we want to do this because many applications in this the they then used then entire signal these that's used that part of the signal which is associated to that in deep reading for example you just need this speaker and beeps or their revision around the mouth um we don't get if that is another person on or the use a table that's by that's i mean just that there's you that we are going to for is these one first we have this sequence and what we want to do is to identify the regions was motion is correlated to the soundtrack the regions of interest for the with one not for this purpose was we would use a the year of the result if not pro once we have here and map of the the correlation between the the motion and divisions and the soundtrack that that is a region the most correlated is these uh as texture with the sound then we can extract that the uh media regions which are most but really chatting white here as you see in these sequence for example we had this person whose speech what uh we were ending and we can extract this and then once we have this starting point we want to use a uh segmentation approach we use a graph cuts you know that to extract of role the region of there are yeah which is uh more or less correlated to the sound so we want to extract that region which is a more was in core or and it's has a high synchrony with this that's the first but so we want to know where the sound sources are here we have we use these of the base be the fusion and that was presented in i "'cause" that year and a V is to to remove or or information which is not associated to the country we want to present a just that information which is interesting from one of the result point of in this case for example there is that hand playing a piano and that is an object moving in the on we would like to present these information and blur what is not in to do it what do we do is uh we define and of the result diffusion efficient which is a function of the synchrony between the motion and the sound yeah we see these people are coefficient is a function of these and these are the results synchrony measure the combination of or the energy and the temporal derivative for the media signal which is the motion the videos then we can more of these without motion in or the to to reduce the effect of no as you can see here at at if we don't fish is a function of this synchrony measures when the synchrony use high then that if we join these stop the diffusion coefficient is close to zero the diffusion is when the uh the uh synchrony slow so the be actually and the sounds are not correlate then that if we don't coefficient is constant and equal to one and the vision is blue and that that is that's on these sequence so i'm not sure if you see but here that the still the point there is still point sharp but it is difficult to see use diffusion a more the audio-visual visible object or these this direct so let's see what happens in the motion in this not here that we have i'm not sure if you see but we have the motion in frame which is you quality has equal might need to in that is distracting moving objects so in the head of the rocking horse and in the audio-visual object and after that if we can process and we are here and the the uh main intensity of the motion use you it in the the visible now if you one you we don't want to not eyes regions with low uh motion on what do we need to do is to compare the motion after the we don't to the motion before is be fusion so that we see how it's region has been uh a diffuse through the this but yeah is that is that and uh uh again you that's you might but if we plot just the high as as for these features we see that at the beginning we have the high values for the original motion which are equally distributed between the or the result these what that is that door and only result so we have more or less the same points in the hand and that in the head of the course after the push most of the hi as well are already ready it it than the hand which is generated in the sounds and finally when comparing both there are test to sports we the the the high as R miss class if are now let's see we have the points where we have the highest correlation i know what we want to do it's to extract then that region so what we do is we use a of the results segmentation approach we need some starting point from the segmentation process we three like that's see for the source so the starting point for the source i the point the the X it's we the high yes but the result we dance the peaks serves or the regions which move accordingly to the sample and scenes we don't want to make any assumption of the background we don't want to say the the with which are less correlated to the soundtrack at the back down we don't want to make these a shown to the know anything about the background down but we do to to it then the me the seats for the back down you and then we don't fix any in C seat inside and be that's because we don't want to condition that is that in and P that's with the no you what is this was because in fact that is most most oh that's but they're not to fake that is that and then a fixed and and does that the results with our method and now will explain now how from this starting point would reach to this is or what we use is a i could have cat segmentation approach and no we introduce an the result them spar pose is to keep to get it lesions with high for the resulting from me so we have typically we want to minimize is a question first we have the vision of then which compares the core of fixed speaks so the with the estimates of core for background i'm for no then we have the one that it then we check keeps to their excel neighbor peaks says which uh similar hold and we define these all the result then so that it keeps together visions which present a high for the results in so those first then and a commonly used and the last term you so let's see the let's study more deeply the the the use of then as this would be for innings to get a regions with high there is a synchrony but in contrast it doesn't affect regions with low synchrony so what we do is we define a like this and if is proportional to though the result coherence so when to a neighbour or sets have high and similar audio-visual synchrony then we keep them to better through the segmentation process it's like a block in contrast when two peaks its never in pixels as have different or do a lot of uh and an synchrony the yeah likely to be C to it you know of there and we don't know take them and when the are the result we and is low we don't do anything this term doesn't affect the segmentation and uh we let the other terms the original their mom on the right them to to that so here are the starting point of the segmentations of the seats for the source the for the but down the at least to with that everywhere else five then would be there but uh here we see that in this case this person is speaking most of the seats i don't in the mouth of these person then the right person on is speaking in the bottom line and that those that of that is that without though the result then so we extract just a part of the mouth of the speaker and one we are these T is what they're then that mouth of the speaker is classified as a block and we can extract oh bit of the region in this case extract then that face or then there mouth it's compare our method with previous method and be fun here the main difference between our method some previous methods is that the yeah they assume that uh peaks sets or regions presenting a low or the results synchrony they can not be long to the the for gram they cannot not don't to this works you our case the we don't make these assumptions since we want to extract uh i division which is or more generous in or so that we can extract visions for example like the for height of the speaker in that case as you see here that the for it is not extract that you can of extract the face because we assume that if the synchrony is slow this region can not be part of this was you know case since we do not use a do we them make this assumption the results are more set this fact more results with this specs motion here left person is speaking and the right person is just moving the lips uh we expect that for a speaker or then when we have general or the results sources that's why we with use user phase detector because we want to extract any kind of the results all sources but just speaker so you fact that had which is playing the O and the rock not extract and what happens if we have two persons that that speak at the same time in fact we don't know a force our our in to choose in between in from frames one person will be more synchronise for uh there would be more seats in the mouth of one person in the other frame there would be more seed in the out of the other person but in general we can extract the two of them with the make an assumption about just one some results some be there sequence here we have the first of them is with speakers we have at that nothing speak so for the red pairs and will speak and then the left person and will start speaking we like to is that first the face of the right their person and then the phase of the left purse let's see that result oh oh i so oh yeah two so i so when the right person or stop speaking you uh our method is able to top the extraction and we three and we expect that for a person that they it's in the results with the general the results for since than we've that is motion as you think and in the first frame that is well we have a hand which is playing a piano again and that is a a a a a fun moving in the but don during then they're speak then in the first frame we extract these well there is a object the the object which is interesting for us but we but a little bit of the fun but to see that these these of firstly lee and you keep on extracting the hand and that think i oh ah right do we extract also the the board the thing is that when the findings i think that key uh that is some shown in the case which is associate it to the sound it happened at the same time that the sounds of these is normal because they are pushing that that key and then the the sound or what that the same that so these there seats which are C to it and the keys of the cube since the region it's also a much is in terms of or we that extracting and we extract then that you bored not is however that we don't like the black keys because they are not present any time so a some discussions so i have presented that and uh a method to extract or the all objects from a scene these method it's based on the main assumption in this the main which states that motion happens at the same what we let it to in now the M B that channels are more less synchronous so them be the motion is more or less in run also with the appearance of that sounds in the soundtrack a a a i our method and you know with any kind of what the results for sources even with uh different what people of the results was is would be for activity but but able to extract more complete or the result objects since we don't know a require all the region to be to see it to the sound and the main limitation is that we can not scenes are approach is and supervised we cannot not control that the S like of the extracted vision so we want make it semi supervised and say we want the region to look like this but then we would compromise three that the uh and supervise P of these approach that thing that is that a with a some experiments approach what we could do do is for example two when we have a there is a sequence what they are multiple sources we put a low the user to choose the source to be extracted and then extract both the really a part of this was not that the around out like maybe you of the source and extract also though the a part of this so you have some questions oh i close first to or it's one