Přepis řeči - AUDIO-VISUAL SYNCHRONIZATION RECOVERY IN MULTIMEDIA CONTENT

0:00:13	a media signal processing section on joint about of it you are are you these low signal processing
0:00:19	we have uh
0:00:21	spain bass
0:00:22	and uh for each one we have a more less twenty minutes mean that's work fall a last
0:00:27	was
0:00:28	uh uh we start with the first paper are but these on a out your visual synchronization recovery multimedia content
0:00:35	represent a is a drawn grow say okay
0:00:38	uh and they are all queries is G by me from where as we there of these group of technology
0:00:43	in on sweets or one
0:00:45	please
0:00:50	and everyone uh mine is tools of the uh from U P F which then
0:00:54	a that i don't might cop is
0:00:56	audio-visual synchronization recovery in multimedia content
0:01:01	this is the a line of my talk first i'm gonna introduce the colour problem of what visual synchronisation in
0:01:07	multimedia content
0:01:09	and then i'm going to explain the contribution of my work
0:01:12	and then i will
0:01:14	explain in detail what used the proposed method
0:01:18	and some some details about the correlation measures to measure the correlation between audio and video signals
0:01:24	and then i i sure use some experimental result and then
0:01:27	yeah i
0:01:28	going to conclude might talk with does some summary and
0:01:31	future work
0:01:35	so
0:01:35	the problem of do we just synchronization in multimedia content is can be explained in this context
0:01:41	so
0:01:43	when you have some multimedia content it
0:01:45	usually contain both audio and video
0:01:48	so
0:01:49	when you talk about the quality of multimedia
0:01:52	we have this
0:01:53	the the quality components
0:01:54	from these two uh
0:01:56	two modalities
0:01:57	so for a audio
0:01:59	we have a lower is noise for jitter
0:02:01	component
0:02:03	and in video we have
0:02:05	reading use jerking knees
0:02:07	picks so noise et cetera
0:02:09	but another important part is that
0:02:12	to quality the two signals have some maturing
0:02:16	uh interaction
0:02:17	so for example the quality all the two signals
0:02:20	are mutually interact each other
0:02:23	and also there is a problem of synchronization of the two modalities
0:02:27	so this is the problem that i wanna talk to
0:02:31	so
0:02:32	usually we expect something can was audio video signal in our life
0:02:36	this system also my wife and uh if we
0:02:39	if she close my name the and then i suspect this
0:02:43	shape of mouse
0:02:44	uh at the same time
0:02:46	and
0:02:47	this is our expectation
0:02:49	i of the synchronization in our daily life
0:02:53	and there some some start there are some studies about this synchronization problem in uh audio and video signals
0:03:00	and people found that there is there is also some tolerance in the synchronisation for sample
0:03:06	there is a
0:03:08	or an inter sensory integration do which is about two hundred millisecond wide
0:03:13	during which the audio which your perception is not degrade
0:03:17	when the synchronization is we in this uh error
0:03:20	found
0:03:21	so for example
0:03:22	if you see this graph
0:03:24	the
0:03:25	the two signals
0:03:26	even if they are not perfectly yeah uh synchronise
0:03:30	the in this
0:03:31	uh area
0:03:32	the people to say
0:03:34	perceive
0:03:35	but the two signals are synchronise
0:03:38	so based on many studies of all synchronization uh uh also some center document from i to you
0:03:45	so i Q you this document specified of susceptibility a threshold
0:03:50	uh as as a round uh plus minus one hundred millisecond
0:03:53	and uh
0:03:54	so so
0:03:55	that
0:03:56	the some send are or some some looking the systems
0:03:59	should follow this guideline
0:04:02	but i i have
0:04:03	oh a word this boundary we
0:04:06	people
0:04:07	uh start to perceive the
0:04:10	uh uh a line
0:04:12	uh audio and video signals
0:04:15	so
0:04:17	but
0:04:17	unfortunately we you may have some a synchrony in the audio
0:04:22	signals in i
0:04:23	and this
0:04:24	may happen during the all
0:04:26	all steps in the mote and the processing chain
0:04:28	so for example in acquisition
0:04:30	we know the speed
0:04:31	all of the lights and the sound is different
0:04:34	and doing the editing dating they may have different processing times
0:04:39	or or people can make a mistake simply
0:04:41	and during turn transmission they may suffer from different network
0:04:45	transfer than delay
0:04:46	what doing the right
0:04:48	restitution maybe they have different uh delay in decoding
0:04:52	oh the result of this uh a synchrony
0:04:55	is first of all the qualities stick right at so maybe it people get angry about that fact
0:05:02	and
0:05:03	a for the more
0:05:04	the people don't understand the content
0:05:06	actually
0:05:08	so to so this problem in
0:05:10	our work
0:05:12	we developed the old automatic algorithm and to detect
0:05:16	where there is whether data there is that a synchrony audio and video signal
0:05:20	and the cover original synchronization
0:05:23	and it for this we exploit the what do we
0:05:26	regional correlation structure
0:05:28	which is in here and to there
0:05:30	in that two signal
0:05:32	so the features of the method is first we don't have any assumption
0:05:36	on the content
0:05:37	so
0:05:38	therefore we don't need any training
0:05:41	and also this can be of applied to any kind of content
0:05:45	both speech and non-speech content
0:05:48	as as long as there is a a being more the motion
0:05:51	responsible for the for the
0:05:53	sound
0:05:54	and in part to large we
0:05:56	did
0:05:57	you use two different correlation measures to compare
0:06:00	and we compared the results
0:06:05	so let me explain data to in detail the proposed method
0:06:09	the idea is quite simple
0:06:10	so when we have the one
0:06:12	two oh audio and video signals
0:06:15	we we don't know whether they are uh
0:06:18	a line well or not
0:06:20	we shift of the audio signal a relative to the video signal a step by step
0:06:25	and the measured the correlation
0:06:27	and we find the maximum at the moment where
0:06:30	we get the maximum correlation between the two signals
0:06:34	so the algorithm can be summarised
0:06:36	like this
0:06:38	so the first that is to extract some features
0:06:42	and then we divide the signal in
0:06:44	to some some small uh unit
0:06:47	that where we can apply some correlation analysis
0:06:50	so first to be divide the host signal in two
0:06:54	in in the temporal dimension so that we have some some small
0:06:58	uh segment
0:06:59	we called it as a temporal problem here
0:07:01	and this is applied for both audio and video
0:07:04	and then
0:07:05	for their we segment the video signal at the image frames into smaller small
0:07:10	tires
0:07:11	which is uh in our case we use four by four pixels says
0:07:16	and uh so that we
0:07:18	we have to not
0:07:19	for uh by doing this we
0:07:22	find where the actually the sound is coming from
0:07:27	so then
0:07:28	for each hypothetical time shift
0:07:30	so you this hypothetical time you've means we
0:07:34	ship the audio signal step by step one by one
0:07:37	and then for each temporal block we do some analysis
0:07:41	and then get the correlation
0:07:43	and the correlation is the maximum correlation
0:07:46	all between audio time shift to audio and then
0:07:50	the
0:07:51	B you signal in the in this style
0:07:54	and we'd
0:07:55	we after we measure that
0:07:57	the correlation or over the whole whole image frame and then we take the maximum
0:08:01	and we expect
0:08:02	the
0:08:03	location
0:08:04	well having this maximal maximum correlation is
0:08:07	he's
0:08:08	the sound source
0:08:10	and then we have to this we
0:08:13	a compute the average of this maximum correlation over the temporal problem so to from the beginning of the signal
0:08:18	to the end of the signal
0:08:20	and beep or from this
0:08:22	and then we can
0:08:23	now for each time shift we have the correlation measure
0:08:27	of the two signals
0:08:28	and then we choose a max some value
0:08:33	and the finally
0:08:35	after all this
0:08:36	that's
0:08:36	we we find the time shift in uh uh
0:08:40	you know a smaller resolution
0:08:42	the here the time shift is done at the resolution of the video frame rate
0:08:47	so
0:08:47	the we get the correlation measures at each uh a you
0:08:51	frame rate
0:08:53	so when you have this kind of correlation curve
0:08:56	for different time she
0:08:58	that's save V get the maximum here but actually
0:09:00	we do the probably fitting over the three points
0:09:04	and then you get the maximum value here so this is the fine O time shift that we can get
0:09:11	so um
0:09:13	yeah there's a quite clear here but the question is what kind of correlation measure we can use
0:09:20	so i can are two different methods
0:09:22	one the mutual information and the other ones a can relation
0:09:28	the mutual information use
0:09:30	as you know
0:09:31	probably no well it's uh
0:09:33	on any measuring the sure
0:09:35	dependence between two signals
0:09:38	and
0:09:38	in particular
0:09:39	a use the quadratic sure information proposed by
0:09:43	more a on uh in to them for
0:09:46	this uses and then use uh coder
0:09:48	cathartic entropy
0:09:50	and the the it
0:09:51	also use the parzen pdf estimation for estimating the
0:09:55	the marginal and the the joint pdf
0:10:00	so the question is given by this
0:10:03	so
0:10:04	and here we need to
0:10:05	well the the each pdf
0:10:08	using uh some of the
0:10:10	the and
0:10:11	colours
0:10:12	and so
0:10:13	this
0:10:14	got it
0:10:15	since it's a person's F estimation
0:10:18	this kind are set on each data point
0:10:21	and
0:10:22	and we have a parameter
0:10:24	that have to be
0:10:26	a fixed
0:10:27	the which is a
0:10:28	which of the couch in connors this is a user parameter that we have to set
0:10:33	in our experiment to be you we did the some research search and then
0:10:37	take to the the best one
0:10:42	yeah that correlation measure is the can and calculation is a measure of correlation in the space where the project
0:10:48	it
0:10:50	the signals
0:10:50	have
0:10:51	or maximum correlation
0:10:53	so
0:10:54	finding this projection
0:10:56	is uh equivalent to finding
0:10:58	a common representation space all of the two signals
0:11:02	so
0:11:03	this is a question of the
0:11:06	correlation can of calculations so as you can see
0:11:09	we need to find this uh projection vector W here
0:11:13	i have a which project the input vector X and Y which is clear which are correspond to
0:11:19	which correspond to the audio and video
0:11:21	yeah
0:11:22	signals
0:11:23	and we try to maximise to i
0:11:26	that the correlation measure
0:11:28	and this problem can be solved i that the problem
0:11:31	she's she's uh available in many
0:11:34	uh publication
0:11:37	oh these are the two correlation measures that i use
0:11:40	so
0:11:42	now let me explain some experiment result
0:11:46	so i tested the the algorithm in three what do we just sequence is to are
0:11:52	speech and the other one is non-speech speech
0:11:55	and i selected the of synchrony between zero to one plus my one second
0:12:01	and for
0:12:02	features are use uh quite simple method
0:12:04	"'cause" uh i found this
0:12:06	to work very well but
0:12:08	of course the more complex and of can be also you
0:12:11	for visual features i use the i take that the the and then uh take the i-th tip the
0:12:17	there but
0:12:18	along the time dimension and also for audio feature i used i i i collected the energy
0:12:24	and then to the derivative in the temporal dimension
0:12:28	and the analysis uh
0:12:30	unit need in time
0:12:31	it was uh fifty video frames which correspond to a around to run
0:12:36	two seconds
0:12:37	the betting on the the sequence
0:12:39	and i
0:12:39	as i mentioned the spatial pile
0:12:41	was four by four picks says
0:12:43	but this is after down sampling the image frames
0:12:47	the image frame was down sick
0:12:49	then
0:12:49	two
0:12:50	one
0:12:51	uh sixteen
0:12:53	so for
0:12:54	one for in each time dimension
0:12:58	oh here you can see the some the riddle some
0:13:02	so these are the three sequences that i used the first one is uh monologue a by a guy
0:13:07	and second one there are two guys but only disguise speaking
0:13:11	the other guy move so bit bit uh
0:13:14	is leaves stories head or use i
0:13:18	and the third one is uh
0:13:21	and make is in it includes the bumping sound by the pen on the table
0:13:27	and
0:13:27	this is the result
0:13:29	so act
0:13:30	axis means the the
0:13:31	simulated of synchrony
0:13:33	from zero to plus minus one
0:13:36	i thousand millisecond and the Y
0:13:38	the estimation error
0:13:40	in millisecond
0:13:42	and
0:13:44	for the
0:13:45	black are for me the results from using the which information and the white part means to
0:13:50	results from can on the correlation
0:13:54	so if you see that results
0:13:56	a first if you
0:13:57	see the result of which information
0:13:59	normally is okay but
0:14:01	there are
0:14:01	some cases where the
0:14:03	error is uh is not acceptable for them but this case is more than a hundred second this is a
0:14:09	more than
0:14:09	four hundred millisecond
0:14:10	this is out of that
0:14:12	a uh facial acceptability thresholds
0:14:15	and the main each mainly typical main uh reason for this
0:14:19	was that the
0:14:21	you know i mentioned that in you sure information i we need to set the parameter
0:14:25	of the couch and with
0:14:27	and i tried all different kind of uh variance but
0:14:30	this was the best
0:14:31	and i couldn't five and this was a bit better results so
0:14:35	for some cases the best
0:14:37	with is some value but on the other and the for the other cases it's
0:14:41	some different that is so that was the main difficulty
0:14:44	in using each information
0:14:46	uh a on the other hand if you look at the kind of calculation results uh it's
0:14:50	always uh
0:14:53	less than one hundred millisecond
0:14:55	and i as i mentioned before the acceptability a first showed is around one hundred miliseconds so
0:15:01	here we can say you can quote that
0:15:03	for this sequence is the can and correlation
0:15:06	was uh successful
0:15:12	and this figure shows a briefly is simply C shows the the how the correlation measure changes according to the
0:15:19	the hypothetical
0:15:20	time ship
0:15:21	a a for and this case i use the the perfectly synchronized so the just signals
0:15:26	so that
0:15:27	colour in the meter
0:15:28	this is the correct hypothesis
0:15:31	while this column than
0:15:33	the right side column is so wrong
0:15:35	but that's
0:15:36	so this is the cool thirty one which means all round about one second
0:15:42	oh here you can see that the correlation pager or can cannot co correlation measure
0:15:47	is uh but
0:15:48	is larger when they are synchronized
0:15:51	in the middle column
0:15:52	uh then in comparison to
0:15:54	the the right side
0:15:55	so
0:15:56	for but
0:15:58	zero point seven versus to a point eight and in this case
0:16:01	the bottom case zero point six
0:16:03	versus
0:16:04	zero point nine
0:16:06	and one or more thing you can see here is that
0:16:08	the black
0:16:09	area
0:16:10	it when i calculate when i measure the correlation between different ties and the audio signal
0:16:16	i i take i took only that ties which have the motion you side
0:16:20	so i
0:16:21	when there the motion is uh a negligible then i didn't do the analysis to
0:16:26	save the computation
0:16:28	and that that is a as uh black part
0:16:31	so in this case can see the the on
0:16:34	on the light
0:16:35	ties are quite small in comparison to the whole
0:16:37	scene
0:16:42	a uh final conclusion
0:16:44	so to summarise uh uh we propose a automatic synchronization my thought
0:16:49	and i tried
0:16:50	uh different
0:16:52	correlation measures and the found that
0:16:54	coder information
0:16:56	implementation
0:16:58	it was uh quite sensitive to the couch some parameters
0:17:01	where at is the current canonical correlation the uh sure you overall all quite a quite robust a result
0:17:09	and one thing i like to mention here is that uh
0:17:12	the signal our approach was uh also applied to three D
0:17:17	there was big video uh synchronization
0:17:20	so in still scott big video you have to video streams and
0:17:24	if a they are if they are not synchronized and you see uh double
0:17:28	region for them here
0:17:31	i if you can see clearly but you see the laptop
0:17:34	the lead of the laptop is
0:17:35	you
0:17:36	you the twice
0:17:37	one is here when you
0:17:38	the ones here
0:17:40	so this this
0:17:41	synchronization problem may also a in this case
0:17:44	and we applied the similar technique and beast "'cause" solve the problem and this was uh a to
0:17:50	a present it next uh so
0:17:52	a previous year in i C
0:17:56	uh final the future work is uh
0:17:59	we have an the test the method uh more uh one diapers contents because the be use only three content
0:18:04	here
0:18:05	and also we'd like to continue studying a on the this synchronization problem in different uh
0:18:10	media like mobile i H T V or three
0:18:14	right
0:18:14	i
0:18:20	time i'm just for one question
0:18:22	councils
0:18:24	a
0:18:31	i think i know that uh if i remember correctly i think is on only or for for speech
0:18:38	a
0:18:38	for
0:18:39	so i think they found
0:18:41	it tried found the find the
0:18:43	lip area first
0:18:45	and then they use some some lit specific features
0:18:48	to to recover the synchronisation that's what i remember
0:18:52	but it
0:18:52	this case the difference is i i didn't do that
0:18:56	i think i and
0:18:58	thank you uh we move uh
0:19:00	second the
0:19:02	paper

AUDIO-VISUAL SYNCHRONIZATION RECOVERY IN MULTIMEDIA CONTENT

Joint Audio Visual Processing

Přednášející: Jong-Seok Lee, Autoři: Jong-Seok Lee, Touradj Ebrahimi, Swiss Federal Institute of Technology in Lausanne, Switzerland