Speech Transcript - SYNTHESIZING VISUAL SPEECH TRAJECTORY WITH MINIMUM GENERATION ERROR

0:00:13	a a good to afternoon everyone
0:00:15	and
0:00:16	um a while from speech group microsoft research asia
0:00:20	paper or i'm going to present is is synthesizing we just speech trajectory
0:00:26	with minimum generation error row
0:00:28	so this is a joint of work we use
0:00:31	P G who and
0:00:33	from microsoft
0:00:34	and the uh uh then drawn from you are use C in S A and so friends so
0:00:40	my for some
0:00:44	so
0:00:45	this work is
0:00:46	part of the project
0:00:48	oh creating photo real let's go about but are in microsoft
0:00:51	so the goal is to create a a lot the but art that
0:00:55	look just like you
0:00:58	so a but are can be roughly are divided into two categories depends on how to about what are
0:01:04	and act to uh in the act with also i word
0:01:07	the first uh kind of a but are
0:01:10	i can be used uh in me at the human to human communication
0:01:14	such a like in town present
0:01:17	and uh in this morning
0:01:19	uh
0:01:20	oh up to field channels
0:01:22	uh
0:01:23	in every talk we mentioned the ducking that a but are actually he's a is going to release very so
0:01:30	and and and not a kind of a but are can be used in human computer interaction
0:01:35	for example uh intelligent a to and
0:01:39	so for the next generation are what are
0:01:42	here's
0:01:43	uh i will be issue list
0:01:44	we have a common expectation for the that's generation of but are
0:01:48	first the we want eight
0:01:50	uh
0:01:50	can be easily integrated into the
0:01:53	things that take a word
0:01:55	a a and also we want it in high fidelity and do a more realistic to a human
0:02:02	and uh about are
0:02:04	a should be
0:02:05	personalise to
0:02:07	each unique use a
0:02:08	and the
0:02:10	the but but are is
0:02:11	can be easily and automatically created
0:02:15	uh so
0:02:17	is since the motivation uh oh oh oh for this project
0:02:22	and a of as and this paper is focusing on the
0:02:26	a photo realist let's moment the censuses
0:02:32	so this lies list uh
0:02:35	some
0:02:36	related work
0:02:37	in both
0:02:39	you are we just to just sent as this
0:02:41	and the test to use speech just synthesis
0:02:44	and uh it is
0:02:45	uh but ins
0:02:46	a interesting to so you overlapping between these two feel
0:02:51	so so all uh
0:02:53	well a pretty that is speaking that many
0:02:56	and ends
0:02:57	are used in speech just synthesis
0:02:59	i
0:03:00	had had so uh successfully applied to to do be just be just since field
0:03:05	for example the
0:03:07	a unit selection
0:03:08	a concatenation based the speech of synthesis matt third or the hmm based speech synthesis
0:03:14	oh H M and god unit selection map third
0:03:17	extension stature
0:03:19	so uh last
0:03:21	last september
0:03:22	uh we present the paper
0:03:24	called a hmm trajectory god the sample selection for four we are talking had in the speech
0:03:30	so now we want to try
0:03:33	oh oh we want to improve the system
0:03:36	i taking
0:03:37	the advantage of them
0:03:39	recent and progress in speech synthesis
0:03:42	so the first attempt
0:03:44	that we are trying to improve the we just speech
0:03:47	a statistical modeling by
0:03:50	i i i i applied to the minimum generation error or words them
0:03:57	so let's uh
0:04:00	uh
0:04:01	that
0:04:01	there to do a quick review for the house system
0:04:05	so just to like
0:04:06	oh we do
0:04:08	speech a of is that's the tts system first we start with that speech database
0:04:13	so feel be does speech just synthesis we start with a
0:04:16	but do database
0:04:18	so that add a speaker
0:04:19	um
0:04:20	speaking talking to the camera instead of a microphone
0:04:24	reading some proper that's great
0:04:26	a what is got that a be the clip that's
0:04:29	the auto meter data base
0:04:31	we first do had a pose normalization
0:04:34	since the speaker a we'll
0:04:36	normally now actually change he's will be he's
0:04:39	i more he's had during the recording
0:04:42	so
0:04:43	after the had to pose normalization
0:04:46	every frame in the database and normalized to the fully frontal be you and then we can crop
0:04:52	the mouse images
0:04:54	a uh using a fixed the rectangular window
0:04:58	so once we get all the mouth images we we do prince spoke
0:05:02	a component now says
0:05:03	to bad to visual feature
0:05:06	and then we do do a all the visual and training to get the hmms
0:05:11	that's the training part
0:05:12	so and the sense it's as part
0:05:15	um the input is some phoneme labels
0:05:18	plus
0:05:18	the L alignment that there's
0:05:20	starting time and in time
0:05:22	a first to we will use that input a passed the a well trained hmm model two
0:05:28	to generate the we'd or trajectory just like a role we do in speech just census as we had to
0:05:34	the speech trajectory for speech parameter trajectory
0:05:38	and the all speech a trajectory
0:05:40	we be been used as a god and to select a let's images from i well that's sample library
0:05:47	and a amount those candidate that we have fat find a bass the ones
0:05:51	and uh
0:05:52	each back to the full had to
0:05:55	to render the full face animation
0:05:58	so here is a uh some more it that's to the example for this
0:06:03	hmm trajectory god the lips
0:06:06	and may just selection
0:06:07	so you can see that the top line images stick is actually a a pretty to the by H and
0:06:13	they are to trajectories
0:06:15	that's images are actually are restored the from the predicted the pca back to us
0:06:20	and the using these the true trajectory as the guidance we work
0:06:24	selected the image of candidates found the
0:06:28	from or
0:06:29	let
0:06:30	uh image library
0:06:32	and then um a a wrist moos
0:06:34	um
0:06:36	a a person was parsed can be fine
0:06:39	by a you are using viterbi search a those candidate
0:06:46	okay so
0:06:49	as we can see that for
0:06:51	either either for hmm based on parametric a map or the all these hmm got you'd the hybrid approach
0:06:58	just
0:06:59	start it's got model actually is very important no by recreational
0:07:03	uh because
0:07:05	uh the actual be the trajectory to a large
0:07:08	a extended you main how the lips can be rendered
0:07:11	so that's part is very important
0:07:15	um can really being about pretty
0:07:18	or were real our previous work we used a um maximum like a hoot
0:07:23	a a estimation for the hmm parameters
0:07:26	or or in shot a week or lead and now based the training
0:07:30	so a of one node is full of the nation that it that the mouse moves is over single was
0:07:36	and the
0:07:37	uh it
0:07:38	this is a a small band and uh is comes the to a much smaller than that then dynamic range
0:07:45	so this uh observation is uh actually quite similar to what we are was R
0:07:51	thinking hmm based tts
0:07:55	so
0:07:56	oh thinking
0:07:57	to improve the model so we propose to
0:08:01	uh uh used a minimum generation our approach
0:08:04	oh of to improved the all
0:08:07	or the visual hmm
0:08:08	parameter
0:08:10	uh a training
0:08:11	parameter to estimation
0:08:15	so
0:08:17	and the
0:08:18	a a minimum generation error quite around
0:08:20	the first important thing is that we need to define the arrow what's to arrow is
0:08:26	so here we define them
0:08:28	the bit of generation our O
0:08:30	for each
0:08:31	or you just sent has actually is the euclidean distance between the
0:08:36	P C a back to as peace a trajectories
0:08:39	so for the whole training set actually he's the average of
0:08:43	other twenty sent has this the arrows of order twenty sentence
0:08:47	so the objective all and G
0:08:49	a quality or is to
0:08:51	optimized the model parameters so as to the total generation our or can be minimised
0:08:59	i we note that the the rat
0:09:01	the direct solution for that question is mathematically intractable so here we adopt a problem
0:09:09	let's take a just send the map there to re estimate mate
0:09:12	the H and
0:09:13	at the bridge at M parameters
0:09:15	and the the
0:09:17	the film or or for up to eighteen the meeting and the about rinse can be
0:09:21	um uh found in the paper
0:09:25	so
0:09:26	we
0:09:27	we incorporated a H based uh
0:09:31	oh training thing to do a house system
0:09:34	a we want to joint to we find all that we draw a atoms
0:09:38	here here he's a are we're process
0:09:41	so
0:09:41	things
0:09:42	in the first stab we were first initialize the model and or so the state alignment
0:09:48	we using the traditional the baseline
0:09:51	a maximum like who training
0:09:54	and then
0:09:56	i here we will of re find the state alignment a you know a heuristic a matter we just the
0:10:03	per are try to put or just a pound or to the left and to the right
0:10:08	and the to see
0:10:09	the
0:10:10	total generation error all before and after just shaped
0:10:14	um
0:10:14	that's it is
0:10:15	mainly to find that the optimal state ones
0:10:19	um
0:10:20	a a a i or the energy G criterion
0:10:24	so after does a refunds the along the we estimate a to model
0:10:30	i'm sorry
0:10:31	um
0:10:32	that's step is
0:10:33	so i but to state
0:10:36	a state alignment
0:10:37	that we will we find a visual hmm parameters by using the problem list tick this an average them
0:10:46	and we go back to step
0:10:48	to you and that three
0:10:50	uh to see i'm to are there was no increase of the total generation error
0:10:58	we are here is the experiment to be about eight at that the entries them
0:11:03	so the are of visual database we used is the lips challenge to thousand eight and to to nine
0:11:10	a a challenge database it
0:11:13	it in close about
0:11:15	a
0:11:16	last than three hundred we do we do sentences
0:11:19	uh uh chris money audio or do try it is welcome by a single native female speaker in neutral emotion
0:11:28	so
0:11:28	um the experiment is mailing to compare two approaches the baseline approach is the
0:11:35	a i my like who the based to method or the and or so the proposed to M G based
0:11:40	the third
0:11:40	and the post approach a we have become pair with the ground choose the
0:11:45	a region of trajectories spoken by the real real person
0:11:49	and in objective evaluation since the database is very small
0:11:53	so we used the lead
0:11:55	i out uh actually and it calls twenty
0:11:58	uh um
0:12:00	a out cross validation for the open open pass
0:12:03	and the the
0:12:05	object to the measure we used a its mean square error roll
0:12:08	uh average of cross correlation and or so we
0:12:12	a matter the global variance
0:12:15	a we are so contact that subjective evaluation
0:12:18	uh two
0:12:20	to use called the M as in terms of the of beach of consistency
0:12:25	as six
0:12:26	subjects attended this evaluation
0:12:30	so
0:12:31	uh and
0:12:32	this this figure actually use uh
0:12:35	oh i want to show that the trajectory how the trajectory looks like
0:12:40	so
0:12:41	a a in this figure you can see that
0:12:43	the the brain
0:12:44	the way colour line actually is that one choose
0:12:47	and did the red colour is the
0:12:50	M L based uh approach
0:12:52	and uh the blue colour is the proposed and G based the third
0:12:59	can see that
0:13:00	um
0:13:02	i i highlighted a to the
0:13:04	the peak and a badly part you can see that
0:13:06	especially for those critical part to the peak and a baddie
0:13:10	uh the proposed and G map there'd
0:13:13	generated trajectory more close to the
0:13:16	uh
0:13:17	to the ground choose trajectory which do real human produce it
0:13:24	uh and the evaluation of the mean square error all
0:13:27	and the
0:13:29	a uh in that speaker that the
0:13:30	the first part of
0:13:32	of the left
0:13:33	i the left is
0:13:34	um
0:13:35	i the mse
0:13:38	ah
0:13:40	can can be laid all us
0:13:41	some summarise all the pca a mentions
0:13:44	and the
0:13:46	the
0:13:47	well
0:13:47	the rest of the shot bars actually for that top
0:13:51	or top for a component
0:13:55	so
0:13:55	um air
0:13:57	there is roughly about five percent
0:14:00	um
0:14:01	error reduction
0:14:03	i used the in the in new proposed them a third
0:14:06	and we are actually late uh after
0:14:09	we we that me this paper actually we we we tested on different corpus
0:14:14	uh the the problem and is quite a time about a five to seven percent of cross different the database
0:14:23	and the this is is about to the
0:14:26	a a cross correlation so
0:14:28	uh
0:14:29	um especially for the
0:14:32	oh
0:14:33	first the a component to because see that
0:14:35	it but a
0:14:37	in propose the correlation which is the very and the for the as the first the pca component
0:14:43	uh
0:14:44	i to be really lady to the mouse open
0:14:47	now open and close
0:14:51	or so we uh this is this is the result for the global very
0:14:56	uh
0:14:57	the proposed to the and G method can recover
0:15:01	uh a lot of the
0:15:04	uh compress the of variance
0:15:08	uh
0:15:09	it's is it is
0:15:11	for the
0:15:12	subjective evaluation
0:15:14	so we we only used a lower face
0:15:17	to do this up to two test
0:15:19	because we want to people can't focus only on the lips the region
0:15:24	um
0:15:26	we generated a ut
0:15:28	twelve test email a for each approach
0:15:31	and uh this is a party to that is depends
0:15:35	we
0:15:35	a a us score and most a score for for each radius
0:15:40	the mel
0:15:41	and did
0:15:42	uh this one this one that the
0:15:45	then
0:15:46	left
0:15:46	a to why is the original video
0:15:49	that's can to lists sure
0:15:57	that's i two tests show
0:16:05	lists
0:16:06	oh
0:16:13	okay so
0:16:14	here here i i want to show uh at them oh actually this is a a a a uh a
0:16:19	online
0:16:20	we sell this is a online product
0:16:22	a it's called uh
0:16:24	it it is um vertical search thing being
0:16:28	i in being search a we uh is that being dictionary online dictionary actually we put a the
0:16:34	a had a as a what your english teacher on that
0:16:37	that's side
0:16:38	they do we'll
0:16:39	help
0:16:40	the english learners to how how to pronounce each word
0:16:46	i can play the deal
0:16:50	so we is that being dictionary
0:16:54	i
0:16:56	i
0:17:03	i
0:17:04	i
0:17:12	so why you uh six
0:17:14	was
0:17:15	we any could or is to us uh find this T V i
0:17:19	and the you click it
0:17:20	then the to talking head of will pop up
0:17:27	i
0:17:30	this
0:17:38	okay
0:17:39	so
0:17:40	here is my conclusion
0:17:42	so here
0:17:43	uh
0:17:46	we applied a the minimum generation error approach to do we do speech synthesis
0:17:51	um
0:17:52	in objective evaluation compare with the baseline
0:17:56	a small like who based approach we get a consistent improvement thing
0:18:02	mean square error reduction and the or so increase being on correlation and or so we covered the
0:18:08	problem barons
0:18:10	in subject to evaluation we found that it can we increase the mouse that "'em" at a range and also
0:18:16	make that talking head
0:18:17	more like a real human
0:18:21	thank you
0:18:28	a questions
0:18:39	yeah
0:18:39	thank you for two
0:18:41	um
0:18:42	option you know soon as to maybe most to occlusion
0:18:46	yeah use to the do that you P C to some features please
0:18:51	but to region features
0:18:54	uh
0:18:56	uh actually we were for
0:18:58	after had poles normalization you you can imagine all the
0:19:02	face images a fully front tell
0:19:04	and then we we just use of
0:19:07	a a fixed a rectangular window to crop the mouth region
0:19:11	so
0:19:12	the pca actually
0:19:14	is is uh down on my mouse
0:19:16	a images
0:19:17	first you craft to mouth images and then P X P
0:19:21	i
0:19:22	uh yeah yeah yeah yeah
0:19:24	so
0:19:25	but this a mouth images all the pixels
0:19:28	after that we all like a a at the simple back to
0:19:31	so one simple to or for each frame and then you can do pca
0:19:36	like any
0:19:37	see for dimension back
0:19:39	which are backed
0:19:40	you know the the we shouldn't be two
0:19:43	you just one go to my mind
0:19:46	you you do you use
0:19:49	and in a the with
0:19:53	a each
0:19:54	we can uh
0:19:57	so it really true
0:20:00	you
0:20:01	i
0:20:04	with that you i think
0:20:06	i agree
0:20:08	question
0:20:15	or question
0:20:16	hmmm
0:20:16	i
0:20:17	to range
0:20:20	just
0:20:22	and look still
0:20:24	know
0:20:25	we didn't we didn't try to stream
0:20:28	yeah we can we can try
0:20:36	i
0:20:37	a questions
0:20:39	oh
0:20:40	okay
0:20:42	i
0:20:44	i
0:20:47	i
0:20:55	i
0:20:56	i
0:20:57	a
0:21:02	i
0:21:07	oh
0:21:09	oh
0:21:11	question
0:21:13	i
0:21:29	uh
0:21:30	yeah you mean the that the part i
0:21:34	the the tiny girl actually
0:21:36	at the boy it's you you heard actually is them a lady T D N
0:21:42	and uh
0:21:43	i i think uh is
0:21:45	it's uh
0:21:46	that good try but us because firstly we manage a in my imagination with that
0:21:51	we think that maybe they are be a will be some mismatch well we use a mac a ladies T
0:21:57	S with and trying to the ladies
0:21:59	talking head
0:22:00	but after we
0:22:03	do it and show you that
0:22:05	i think
0:22:06	uh
0:22:07	i okay or i is acceptable
0:22:16	a it doesn't sound like best
0:22:20	right
0:22:21	yeah
0:22:24	yeah it may be K common up about that so that
0:22:32	okay

SYNTHESIZING VISUAL SPEECH TRAJECTORY WITH MINIMUM GENERATION ERROR

Speech Synthesis

Presented by: Lijuan Wang, Author(s): Lijuan Wang, Microsoft Research Asia, China; Yi-Jian Wu, Microsoft Corporation, China; Xiaodan Zhuang, Beckman Institute / University of Illinois at Urbana-Champaign, China; Frank K. Soong, Microsoft Research Asia, China