0:00:13a media signal processing section on joint about of it you are are you these low signal processing
0:00:19we have uh
0:00:21spain bass
0:00:22and uh for each one we have a more less twenty minutes mean that's work fall a last
0:00:28uh uh we start with the first paper are but these on a out your visual synchronization recovery multimedia content
0:00:35represent a is a drawn grow say okay
0:00:38uh and they are all queries is G by me from where as we there of these group of technology
0:00:43in on sweets or one
0:00:50and everyone uh mine is tools of the uh from U P F which then
0:00:54a that i don't might cop is
0:00:56audio-visual synchronization recovery in multimedia content
0:01:01this is the a line of my talk first i'm gonna introduce the colour problem of what visual synchronisation in
0:01:07multimedia content
0:01:09and then i'm going to explain the contribution of my work
0:01:12and then i will
0:01:14explain in detail what used the proposed method
0:01:18and some some details about the correlation measures to measure the correlation between audio and video signals
0:01:24and then i i sure use some experimental result and then
0:01:27yeah i
0:01:28going to conclude might talk with does some summary and
0:01:31future work
0:01:35the problem of do we just synchronization in multimedia content is can be explained in this context
0:01:43when you have some multimedia content it
0:01:45usually contain both audio and video
0:01:49when you talk about the quality of multimedia
0:01:52we have this
0:01:53the the quality components
0:01:54from these two uh
0:01:56two modalities
0:01:57so for a audio
0:01:59we have a lower is noise for jitter
0:02:03and in video we have
0:02:05reading use jerking knees
0:02:07picks so noise et cetera
0:02:09but another important part is that
0:02:12to quality the two signals have some maturing
0:02:16uh interaction
0:02:17so for example the quality all the two signals
0:02:20are mutually interact each other
0:02:23and also there is a problem of synchronization of the two modalities
0:02:27so this is the problem that i wanna talk to
0:02:32usually we expect something can was audio video signal in our life
0:02:36this system also my wife and uh if we
0:02:39if she close my name the and then i suspect this
0:02:43shape of mouse
0:02:44uh at the same time
0:02:47this is our expectation
0:02:49i of the synchronization in our daily life
0:02:53and there some some start there are some studies about this synchronization problem in uh audio and video signals
0:03:00and people found that there is there is also some tolerance in the synchronisation for sample
0:03:06there is a
0:03:08or an inter sensory integration do which is about two hundred millisecond wide
0:03:13during which the audio which your perception is not degrade
0:03:17when the synchronization is we in this uh error
0:03:21so for example
0:03:22if you see this graph
0:03:25the two signals
0:03:26even if they are not perfectly yeah uh synchronise
0:03:30the in this
0:03:31uh area
0:03:32the people to say
0:03:35but the two signals are synchronise
0:03:38so based on many studies of all synchronization uh uh also some center document from i to you
0:03:45so i Q you this document specified of susceptibility a threshold
0:03:50uh as as a round uh plus minus one hundred millisecond
0:03:53and uh
0:03:54so so
0:03:56the some send are or some some looking the systems
0:03:59should follow this guideline
0:04:02but i i have
0:04:03oh a word this boundary we
0:04:07uh start to perceive the
0:04:10uh uh a line
0:04:12uh audio and video signals
0:04:17unfortunately we you may have some a synchrony in the audio
0:04:22signals in i
0:04:23and this
0:04:24may happen during the all
0:04:26all steps in the mote and the processing chain
0:04:28so for example in acquisition
0:04:30we know the speed
0:04:31all of the lights and the sound is different
0:04:34and doing the editing dating they may have different processing times
0:04:39or or people can make a mistake simply
0:04:41and during turn transmission they may suffer from different network
0:04:45transfer than delay
0:04:46what doing the right
0:04:48restitution maybe they have different uh delay in decoding
0:04:52oh the result of this uh a synchrony
0:04:55is first of all the qualities stick right at so maybe it people get angry about that fact
0:05:03a for the more
0:05:04the people don't understand the content
0:05:08so to so this problem in
0:05:10our work
0:05:12we developed the old automatic algorithm and to detect
0:05:16where there is whether data there is that a synchrony audio and video signal
0:05:20and the cover original synchronization
0:05:23and it for this we exploit the what do we
0:05:26regional correlation structure
0:05:28which is in here and to there
0:05:30in that two signal
0:05:32so the features of the method is first we don't have any assumption
0:05:36on the content
0:05:38therefore we don't need any training
0:05:41and also this can be of applied to any kind of content
0:05:45both speech and non-speech content
0:05:48as as long as there is a a being more the motion
0:05:51responsible for the for the
0:05:54and in part to large we
0:05:57you use two different correlation measures to compare
0:06:00and we compared the results
0:06:05so let me explain data to in detail the proposed method
0:06:09the idea is quite simple
0:06:10so when we have the one
0:06:12two oh audio and video signals
0:06:15we we don't know whether they are uh
0:06:18a line well or not
0:06:20we shift of the audio signal a relative to the video signal a step by step
0:06:25and the measured the correlation
0:06:27and we find the maximum at the moment where
0:06:30we get the maximum correlation between the two signals
0:06:34so the algorithm can be summarised
0:06:36like this
0:06:38so the first that is to extract some features
0:06:42and then we divide the signal in
0:06:44to some some small uh unit
0:06:47that where we can apply some correlation analysis
0:06:50so first to be divide the host signal in two
0:06:54in in the temporal dimension so that we have some some small
0:06:58uh segment
0:06:59we called it as a temporal problem here
0:07:01and this is applied for both audio and video
0:07:04and then
0:07:05for their we segment the video signal at the image frames into smaller small
0:07:11which is uh in our case we use four by four pixels says
0:07:16and uh so that we
0:07:18we have to not
0:07:19for uh by doing this we
0:07:22find where the actually the sound is coming from
0:07:27so then
0:07:28for each hypothetical time shift
0:07:30so you this hypothetical time you've means we
0:07:34ship the audio signal step by step one by one
0:07:37and then for each temporal block we do some analysis
0:07:41and then get the correlation
0:07:43and the correlation is the maximum correlation
0:07:46all between audio time shift to audio and then
0:07:51B you signal in the in this style
0:07:54and we'd
0:07:55we after we measure that
0:07:57the correlation or over the whole whole image frame and then we take the maximum
0:08:01and we expect
0:08:04well having this maximal maximum correlation is
0:08:08the sound source
0:08:10and then we have to this we
0:08:13a compute the average of this maximum correlation over the temporal problem so to from the beginning of the signal
0:08:18to the end of the signal
0:08:20and beep or from this
0:08:22and then we can
0:08:23now for each time shift we have the correlation measure
0:08:27of the two signals
0:08:28and then we choose a max some value
0:08:33and the finally
0:08:35after all this
0:08:36we we find the time shift in uh uh
0:08:40you know a smaller resolution
0:08:42the here the time shift is done at the resolution of the video frame rate
0:08:47the we get the correlation measures at each uh a you
0:08:51frame rate
0:08:53so when you have this kind of correlation curve
0:08:56for different time she
0:08:58that's save V get the maximum here but actually
0:09:00we do the probably fitting over the three points
0:09:04and then you get the maximum value here so this is the fine O time shift that we can get
0:09:11so um
0:09:13yeah there's a quite clear here but the question is what kind of correlation measure we can use
0:09:20so i can are two different methods
0:09:22one the mutual information and the other ones a can relation
0:09:28the mutual information use
0:09:30as you know
0:09:31probably no well it's uh
0:09:33on any measuring the sure
0:09:35dependence between two signals
0:09:38in particular
0:09:39a use the quadratic sure information proposed by
0:09:43more a on uh in to them for
0:09:46this uses and then use uh coder
0:09:48cathartic entropy
0:09:50and the the it
0:09:51also use the parzen pdf estimation for estimating the
0:09:55the marginal and the the joint pdf
0:10:00so the question is given by this
0:10:04and here we need to
0:10:05well the the each pdf
0:10:08using uh some of the
0:10:10the and
0:10:12and so
0:10:14got it
0:10:15since it's a person's F estimation
0:10:18this kind are set on each data point
0:10:22and we have a parameter
0:10:24that have to be
0:10:26a fixed
0:10:27the which is a
0:10:28which of the couch in connors this is a user parameter that we have to set
0:10:33in our experiment to be you we did the some research search and then
0:10:37take to the the best one
0:10:42yeah that correlation measure is the can and calculation is a measure of correlation in the space where the project
0:10:50the signals
0:10:51or maximum correlation
0:10:54finding this projection
0:10:56is uh equivalent to finding
0:10:58a common representation space all of the two signals
0:11:03this is a question of the
0:11:06correlation can of calculations so as you can see
0:11:09we need to find this uh projection vector W here
0:11:13i have a which project the input vector X and Y which is clear which are correspond to
0:11:19which correspond to the audio and video
0:11:23and we try to maximise to i
0:11:26that the correlation measure
0:11:28and this problem can be solved i that the problem
0:11:31she's she's uh available in many
0:11:34uh publication
0:11:37oh these are the two correlation measures that i use
0:11:42now let me explain some experiment result
0:11:46so i tested the the algorithm in three what do we just sequence is to are
0:11:52speech and the other one is non-speech speech
0:11:55and i selected the of synchrony between zero to one plus my one second
0:12:01and for
0:12:02features are use uh quite simple method
0:12:04"'cause" uh i found this
0:12:06to work very well but
0:12:08of course the more complex and of can be also you
0:12:11for visual features i use the i take that the the and then uh take the i-th tip the
0:12:17there but
0:12:18along the time dimension and also for audio feature i used i i i collected the energy
0:12:24and then to the derivative in the temporal dimension
0:12:28and the analysis uh
0:12:30unit need in time
0:12:31it was uh fifty video frames which correspond to a around to run
0:12:36two seconds
0:12:37the betting on the the sequence
0:12:39and i
0:12:39as i mentioned the spatial pile
0:12:41was four by four picks says
0:12:43but this is after down sampling the image frames
0:12:47the image frame was down sick
0:12:51uh sixteen
0:12:53so for
0:12:54one for in each time dimension
0:12:58oh here you can see the some the riddle some
0:13:02so these are the three sequences that i used the first one is uh monologue a by a guy
0:13:07and second one there are two guys but only disguise speaking
0:13:11the other guy move so bit bit uh
0:13:14is leaves stories head or use i
0:13:18and the third one is uh
0:13:21and make is in it includes the bumping sound by the pen on the table
0:13:27this is the result
0:13:29so act
0:13:30axis means the the
0:13:31simulated of synchrony
0:13:33from zero to plus minus one
0:13:36i thousand millisecond and the Y
0:13:38the estimation error
0:13:40in millisecond
0:13:44for the
0:13:45black are for me the results from using the which information and the white part means to
0:13:50results from can on the correlation
0:13:54so if you see that results
0:13:56a first if you
0:13:57see the result of which information
0:13:59normally is okay but
0:14:01there are
0:14:01some cases where the
0:14:03error is uh is not acceptable for them but this case is more than a hundred second this is a
0:14:09more than
0:14:09four hundred millisecond
0:14:10this is out of that
0:14:12a uh facial acceptability thresholds
0:14:15and the main each mainly typical main uh reason for this
0:14:19was that the
0:14:21you know i mentioned that in you sure information i we need to set the parameter
0:14:25of the couch and with
0:14:27and i tried all different kind of uh variance but
0:14:30this was the best
0:14:31and i couldn't five and this was a bit better results so
0:14:35for some cases the best
0:14:37with is some value but on the other and the for the other cases it's
0:14:41some different that is so that was the main difficulty
0:14:44in using each information
0:14:46uh a on the other hand if you look at the kind of calculation results uh it's
0:14:50always uh
0:14:53less than one hundred millisecond
0:14:55and i as i mentioned before the acceptability a first showed is around one hundred miliseconds so
0:15:01here we can say you can quote that
0:15:03for this sequence is the can and correlation
0:15:06was uh successful
0:15:12and this figure shows a briefly is simply C shows the the how the correlation measure changes according to the
0:15:19the hypothetical
0:15:20time ship
0:15:21a a for and this case i use the the perfectly synchronized so the just signals
0:15:26so that
0:15:27colour in the meter
0:15:28this is the correct hypothesis
0:15:31while this column than
0:15:33the right side column is so wrong
0:15:35but that's
0:15:36so this is the cool thirty one which means all round about one second
0:15:42oh here you can see that the correlation pager or can cannot co correlation measure
0:15:47is uh but
0:15:48is larger when they are synchronized
0:15:51in the middle column
0:15:52uh then in comparison to
0:15:54the the right side
0:15:56for but
0:15:58zero point seven versus to a point eight and in this case
0:16:01the bottom case zero point six
0:16:04zero point nine
0:16:06and one or more thing you can see here is that
0:16:08the black
0:16:10it when i calculate when i measure the correlation between different ties and the audio signal
0:16:16i i take i took only that ties which have the motion you side
0:16:20so i
0:16:21when there the motion is uh a negligible then i didn't do the analysis to
0:16:26save the computation
0:16:28and that that is a as uh black part
0:16:31so in this case can see the the on
0:16:34on the light
0:16:35ties are quite small in comparison to the whole
0:16:42a uh final conclusion
0:16:44so to summarise uh uh we propose a automatic synchronization my thought
0:16:49and i tried
0:16:50uh different
0:16:52correlation measures and the found that
0:16:54coder information
0:16:58it was uh quite sensitive to the couch some parameters
0:17:01where at is the current canonical correlation the uh sure you overall all quite a quite robust a result
0:17:09and one thing i like to mention here is that uh
0:17:12the signal our approach was uh also applied to three D
0:17:17there was big video uh synchronization
0:17:20so in still scott big video you have to video streams and
0:17:24if a they are if they are not synchronized and you see uh double
0:17:28region for them here
0:17:31i if you can see clearly but you see the laptop
0:17:34the lead of the laptop is
0:17:36you the twice
0:17:37one is here when you
0:17:38the ones here
0:17:40so this this
0:17:41synchronization problem may also a in this case
0:17:44and we applied the similar technique and beast "'cause" solve the problem and this was uh a to
0:17:50a present it next uh so
0:17:52a previous year in i C
0:17:56uh final the future work is uh
0:17:59we have an the test the method uh more uh one diapers contents because the be use only three content
0:18:05and also we'd like to continue studying a on the this synchronization problem in different uh
0:18:10media like mobile i H T V or three
0:18:20time i'm just for one question
0:18:31i think i know that uh if i remember correctly i think is on only or for for speech
0:18:39so i think they found
0:18:41it tried found the find the
0:18:43lip area first
0:18:45and then they use some some lit specific features
0:18:48to to recover the synchronisation that's what i remember
0:18:52but it
0:18:52this case the difference is i i didn't do that
0:18:56i think i and
0:18:58thank you uh we move uh
0:19:00second the