0:00:18oh one
0:00:19my name is an idea was to don't
0:00:21and i'm here to present
0:00:23is work
0:00:24well
0:00:24right
0:00:26what
0:00:27yeah
0:00:27or
0:00:28a a a a lot of attention
0:00:30yeah one that of that
0:00:31we we have N
0:00:35so far is that we start with a given direction of a of the results in a processing is we
0:00:39have a
0:00:40a video signal
0:00:42here
0:00:43uh a the results signal which is composed
0:00:45of two modalities that the video part
0:00:48and the audio by
0:00:49the be their body recorded with that common and the L be a or with a microphone
0:00:53there could be more common as a microphones
0:00:55but uh
0:00:56here we prefer to to uh that is the the most simple problem but it's also the most calm
0:01:02in these the money
0:01:03so the uh media and audio signal as they are very different
0:01:07but they show the temporal axes
0:01:09the resolution of these axes is
0:01:11different
0:01:12if we can you would have a much more obvious on sound done be of frames
0:01:16people the sound right of the thing on it
0:01:18you
0:01:19uh
0:01:19hi
0:01:21or or the main idea
0:01:22in general the with and signal processing is
0:01:25to combine
0:01:26or the on do well that it is in or that to extract
0:01:29a maximum amount of information
0:01:32a a a given scene
0:01:34so there are several applications like for example in speech recognition you can use
0:01:38that the media modality
0:01:40in other to better understand speech
0:01:43or or you can use
0:01:45and combined the information in both signals in out there to look at this one's for
0:01:50yeah are mainly in this domain
0:01:52uh the main assumption is that that really it events
0:01:55both channels as they happened more or less at the same that
0:01:58so in this case for example you have a
0:02:01i guy who's playing a that
0:02:03i'm the guitar sounds
0:02:05they are correlated
0:02:06with the movement of the
0:02:08or they cup and more or is at the skin then
0:02:11so that's the main assumption that we would use in
0:02:13what
0:02:14so knowledgeable to the goal what we want to do in this work
0:02:18used to extract the would is all objects in the scene
0:02:22uh
0:02:23looks like this we have
0:02:25i sequence where are that are
0:02:27two objects the for up to
0:02:29or
0:02:30speaker in this case
0:02:31uh is associated to the soundtrack
0:02:34and that is
0:02:35another person
0:02:36which is moving that beeps
0:02:38but we cannot listen to the sounds
0:02:40so these disease
0:02:41that is the vector
0:02:43and these ease
0:02:44though the visible lot
0:02:45what we want to do is to extract
0:02:48the be the a part of which is associated to the something
0:02:51without the interference of the race
0:02:53the but
0:02:54why do we want to do this
0:02:55because
0:02:56many applications in this the they then used then entire signal these that's used
0:03:01that part of the signal which is associated to that
0:03:05in deep reading for example you just need this speaker and beeps
0:03:08or their revision
0:03:09around the mouth
0:03:11um we don't get if that is another person on or the use a table
0:03:15that's
0:03:15by
0:03:17that's
0:03:18i mean
0:03:19just that there's you that we are going to for
0:03:22is these one first we have
0:03:24this
0:03:24sequence
0:03:25and what we want to do is to identify the regions
0:03:29was motion is correlated to the soundtrack
0:03:31the regions of interest for the with one not
0:03:34for this purpose was we would use a the year of the result if not pro
0:03:39once
0:03:40we have
0:03:40here
0:03:42and map
0:03:43of the the correlation between the
0:03:46the motion and divisions and the soundtrack
0:03:49that that is a region the most correlated is
0:03:51these uh as texture with the sound
0:03:54then we can extract that the uh media regions
0:03:58which are most but really
0:04:00chatting white here
0:04:02as you see in these sequence for example
0:04:04we had
0:04:05this person
0:04:06whose
0:04:06speech
0:04:07what
0:04:09uh
0:04:09we were ending
0:04:10and we can
0:04:11extract this
0:04:13and then once we have this starting point we want to use a uh
0:04:17segmentation approach
0:04:19we use a graph cuts
0:04:20you know that to extract of role the region of there are yeah
0:04:25which is
0:04:26uh
0:04:27more or less correlated to the sound
0:04:29so we want to extract that region which is
0:04:32a more was
0:04:33in core or and it's has a high synchrony with this
0:04:38that's the first
0:04:39but so we want to know where
0:04:41the sound sources are
0:04:43here we have
0:04:44we use these of the base
0:04:46be the fusion
0:04:47and
0:04:48that was presented in i "'cause"
0:04:50that year
0:04:52and a V is to to remove
0:04:54or or information which is not associated to the country
0:04:58we want to present a just that information which is interesting from one of the result point of
0:05:04in this case for example there is that hand playing
0:05:06a piano
0:05:07and that is an object moving in the on
0:05:10we would like to present these information
0:05:12and
0:05:13blur what is not in
0:05:16to do it
0:05:17what do we do is uh we define
0:05:19and of the result diffusion efficient
0:05:21which is a function of the synchrony between the motion and the sound
0:05:26yeah we see
0:05:27these people are coefficient is a function of these
0:05:30and
0:05:31these are the results synchrony measure
0:05:33the combination of
0:05:34or the energy
0:05:36and the temporal derivative for the media signal
0:05:39which is the motion
0:05:40the videos
0:05:42then we can more of these without motion in or the to to reduce the effect of no
0:05:48as you can see here at at if we don't fish
0:05:52is a function of this synchrony measures
0:05:54when the synchrony
0:05:56use high
0:05:57then that if we join these
0:05:59stop
0:06:00the diffusion coefficient is close to zero the diffusion is
0:06:03when the uh the uh synchrony slow
0:06:07so the be actually
0:06:08and the sounds are not correlate
0:06:11then
0:06:12that if we don't coefficient is constant and equal to one and the vision is blue
0:06:17and that that is that's
0:06:19on these sequence
0:06:20so
0:06:21i'm not sure if you see but here that the
0:06:23still the point there is still point sharp
0:06:26but it is difficult to see
0:06:28use diffusion a more
0:06:30the audio-visual visible object
0:06:32or these
0:06:34this direct
0:06:36so let's see what happens
0:06:39in the motion in this not
0:06:41here that we have
0:06:44i'm not sure if you see
0:06:45but we have the motion in
0:06:47frame
0:06:48which is
0:06:49you quality has equal might need to
0:06:51in that is distracting moving objects so in the head
0:06:54of the rocking horse
0:06:56and
0:06:56in the audio-visual object
0:06:58and after that if we can process and we are here
0:07:01and
0:07:02the the uh main intensity of the motion
0:07:05use you it in the the visible
0:07:08now if you one
0:07:10you we don't want to not eyes
0:07:12regions with low
0:07:13uh motion on what do we need to do is to compare
0:07:17the motion after the we don't
0:07:19to the motion before
0:07:20is be fusion
0:07:22so that
0:07:23we see how it's region has been uh a diffuse
0:07:27through
0:07:28the
0:07:29this but
0:07:30yeah is that is that
0:07:32and uh uh again you that's you might but if we plot
0:07:35just the high as as for these features
0:07:38we see that at the beginning
0:07:40we have the high values for the original motion
0:07:43which are equally distributed between the
0:07:46or the result these
0:07:47what that is that door and only result
0:07:50so we have more or less the same points
0:07:52in the hand
0:07:53and that in the head of the course
0:07:55after the push most of the
0:07:57hi as well are already ready it it than the hand
0:08:01which is generated in the sounds
0:08:03and finally when comparing both
0:08:05there are test
0:08:07to sports
0:08:08we
0:08:09the the
0:08:10the high as R
0:08:12miss class if
0:08:15are now let's see we have the points where we have the highest correlation
0:08:20i know what we want to do it's to
0:08:22extract
0:08:23then that region
0:08:24so what we do is we use a of the results segmentation approach
0:08:28we need some starting point
0:08:30from the segmentation process
0:08:32we three
0:08:33like
0:08:34that's see for the source
0:08:36so the starting point for the source
0:08:38i the point
0:08:39the the X it's we the high yes
0:08:42but the result we dance the peaks serves or the regions which move accordingly to the sample
0:08:47and scenes
0:08:48we don't want to make any assumption of the background
0:08:51we don't want to say
0:08:52the the with
0:08:54which are less correlated to the soundtrack
0:08:57at the back down
0:08:58we don't want to make these a shown to the know anything about the background down
0:09:01but we do
0:09:02to to it
0:09:03then the me
0:09:05the seats for the back down
0:09:07you
0:09:08and then
0:09:09we don't fix any in C seat inside and be that's because we don't want to condition that is that
0:09:14in and P that's with the no you what is this was because in fact that is most most
0:09:19oh that's but they're not to fake
0:09:21that is that
0:09:22and then a fixed
0:09:23and
0:09:25and does that the results
0:09:26with our method
0:09:28and now will explain
0:09:29now how from this
0:09:31starting point
0:09:32would reach to this is
0:09:35or what we use is a i could have cat
0:09:37segmentation approach
0:09:39and no we introduce an the result them
0:09:42spar pose is to keep to get it lesions with high for the resulting from me
0:09:46so we have
0:09:47typically we want to minimize is a question
0:09:50first we have the vision of then
0:09:52which compares the core of fixed
0:09:54speaks so the with the estimates of core for background
0:09:57i'm for no
0:09:58then we have
0:09:59the one that it then
0:10:01we check keeps to their excel
0:10:03neighbor peaks says which uh similar hold
0:10:06and we define
0:10:08these all the result then
0:10:10so that it keeps together visions
0:10:12which present a high
0:10:14for the results in
0:10:17so those
0:10:18first then and a commonly used
0:10:20and the last term
0:10:21you
0:10:23so let's see the let's study
0:10:26more deeply the the the use of then
0:10:28as this would be for innings to get a regions with high there is a synchrony
0:10:32but
0:10:33in contrast it doesn't affect regions with low synchrony
0:10:37so what we do is we define a like this
0:10:40and
0:10:41if is proportional to though the result coherence
0:10:44so when to a neighbour or sets have high and similar
0:10:48audio-visual synchrony then we keep them to better through the segmentation process
0:10:52it's like a block
0:10:54in contrast when two peaks its never in pixels as have different
0:10:58or do a lot of uh and an synchrony
0:11:00the yeah likely to be C to it you know of there and we don't know take them
0:11:04and when the are the result we and is low
0:11:07we don't do anything this term doesn't affect the segmentation
0:11:11and uh we let the other terms the original their mom on the right them
0:11:16to to that
0:11:17so here
0:11:19are the starting point of the segmentations of the seats
0:11:23for the source
0:11:24the for the but down the at least to with that everywhere else five then
0:11:28would be there
0:11:29but uh here we see that
0:11:31in this case this person is speaking most of the seats
0:11:34i don't in the mouth
0:11:36of these person
0:11:37then the right person on is speaking
0:11:40in the bottom line
0:11:42and that those that of that is that
0:11:44without though the result then
0:11:46so we extract just
0:11:48a part of the mouth of the speaker
0:11:51and one we are these T is what they're
0:11:53then that mouth of the speaker is classified as a block
0:11:57and we can extract
0:11:59oh bit of the region
0:12:00in this case extract then that face or
0:12:03then there
0:12:04mouth
0:12:06it's compare our method with previous method
0:12:10and be fun
0:12:10here
0:12:12the main difference between our method some previous methods
0:12:15is that
0:12:16the yeah they assume that uh
0:12:19peaks sets or regions presenting a low
0:12:22or the results synchrony
0:12:23they can not be long to the the for gram they cannot not don't to this works
0:12:28you our case the we don't make
0:12:30these assumptions since we want to extract
0:12:32uh i division which is
0:12:34or more generous
0:12:35in or
0:12:37so that we can extract visions for example like the for height
0:12:41of the speaker
0:12:42in that case
0:12:43as you see here that the for it is not extract that you can of extract the face
0:12:47because we assume that if the synchrony is slow
0:12:50this region can not be part of this was
0:12:53you know case
0:12:54since we do not use a do we them make this assumption
0:12:58the results are
0:12:59more set this fact
0:13:02more results
0:13:04with this specs motion here
0:13:06left person is speaking and the right person is just moving the lips
0:13:11uh we expect that for a
0:13:13speaker or then when we have
0:13:15general
0:13:16or the results sources
0:13:17that's why we with use user phase detector because we want to extract any kind of the results all sources
0:13:22but just
0:13:23speaker
0:13:24so you fact that had which is playing the O
0:13:28and the rock
0:13:30not extract
0:13:31and what happens
0:13:32if we have
0:13:33two persons that that
0:13:35speak at the same time
0:13:36in fact we don't know a force our our in to choose
0:13:40in between
0:13:41in from frames
0:13:43one person will be more synchronise for
0:13:46uh there would be more seats
0:13:48in the mouth of one person in the other frame
0:13:51there would be more seed in the out of the other person
0:13:55but
0:13:55in general we can extract
0:13:57the two of them
0:13:58with the make an assumption about
0:14:01just
0:14:01one
0:14:04some results some be there sequence
0:14:06here we have
0:14:07the first of them is with speakers
0:14:09we have at that nothing speak
0:14:12so for the red pairs and will speak
0:14:14and then
0:14:15the left person and will start speaking
0:14:18we like to is that first the face
0:14:20of the right
0:14:21their person
0:14:22and then the phase of the left purse
0:14:26let's see that result
0:14:28oh
0:14:30oh
0:14:31i
0:14:32so
0:14:33oh
0:14:34yeah
0:14:35two
0:14:35so
0:14:36i
0:14:38so when the right person or stop speaking
0:14:41you
0:14:41uh our method is able to
0:14:43top the extraction and
0:14:46we
0:14:47three
0:14:47and we expect that for a person that they
0:14:51it's in the results with the general the results for since than we've
0:14:54that is motion
0:14:56as you think and in the first frame
0:14:58that is
0:14:59well
0:14:59we have a hand which is playing a piano again
0:15:02and that is a a a a a fun moving in the but don
0:15:05during then they're speak
0:15:07then
0:15:08in the first frame
0:15:09we extract
0:15:11these
0:15:11well there is a object
0:15:12the the object which is interesting for us
0:15:15but we
0:15:16but a little bit of the fun
0:15:18but to see that
0:15:19these these of
0:15:21firstly lee and you
0:15:22keep on extracting the hand and that think
0:15:25i
0:15:26oh
0:15:29ah
0:15:32right do we extract also the the board
0:15:35the thing is that
0:15:36when the findings
0:15:37i think that key
0:15:39uh that is some shown in the case which is associate it
0:15:43to the sound it happened at the same time that the sounds of
0:15:47these is normal because they are pushing that that key
0:15:50and then the the sound or what that the same that
0:15:53so these
0:15:54there seats which are C to it and the keys
0:15:58of the cube
0:15:59since the region it's also a much is in terms of or
0:16:04we that extracting and we extract then that
0:16:07you bored
0:16:08not is however that we don't like the black keys because they are not present any time
0:16:15so a some discussions so i have presented that and uh a method to extract or the all objects from
0:16:21a scene
0:16:23these method it's based on the main assumption in this the main which states that
0:16:27motion
0:16:28happens at the same what we let it to in now the M B that channels are more less synchronous
0:16:34so them be the motion
0:16:36is more or less in run also with the appearance of that sounds in the soundtrack
0:16:41a a a i our method and you know with any kind of what the results for sources
0:16:46even with uh different what people of the results was is would be for activity but
0:16:52but able to extract more complete
0:16:54or the result objects
0:16:56since we don't know a require
0:16:58all the region
0:16:59to be to see it to the sound
0:17:01and the main limitation
0:17:03is that we can not scenes are approach is and supervised we cannot not
0:17:06control that the S like of the extracted vision
0:17:11so we want
0:17:13make it semi supervised
0:17:14and say we want the region to look
0:17:17like this but then we would compromise
0:17:19three that the uh and supervise P of these approach
0:17:24that thing that is that a
0:17:26with a some experiments approach what we could do
0:17:29do is for example two
0:17:31when we have a there is a sequence what they are multiple sources we put a low the user to
0:17:36choose
0:17:37the source
0:17:38to be extracted
0:17:39and then extract both the really a part of this was not that the around out like maybe you of
0:17:44the source
0:17:45and extract also though the a part of this
0:17:50so you have some questions
0:17:55oh
0:18:00i close
0:18:02first to
0:18:07or it's one