Přepis řeči - SYNTHESIZING VISUAL SPEECH TRAJECTORY WITH MINIMUM GENERATION ERROR

a a good to afternoon everyone and um a while from speech group microsoft research asia paper or i'm going to present is is synthesizing we just speech trajectory with minimum generation error row so this is a joint of work we use P G who and from microsoft and the uh uh then drawn from you are use C in S A and so friends so my for some so this work is part of the project oh creating photo real let's go about but are in microsoft so the goal is to create a a lot the but art that look just like you so a but are can be roughly are divided into two categories depends on how to about what are and act to uh in the act with also i word the first uh kind of a but are i can be used uh in me at the human to human communication such a like in town present and uh in this morning uh oh up to field channels uh in every talk we mentioned the ducking that a but are actually he's a is going to release very so and and and not a kind of a but are can be used in human computer interaction for example uh intelligent a to and so for the next generation are what are here's uh i will be issue list we have a common expectation for the that's generation of but are first the we want eight uh can be easily integrated into the things that take a word a a and also we want it in high fidelity and do a more realistic to a human and uh about are a should be personalise to each unique use a and the the but but are is can be easily and automatically created uh so is since the motivation uh oh oh oh for this project and a of as and this paper is focusing on the a photo realist let's moment the censuses so this lies list uh some related work in both you are we just to just sent as this and the test to use speech just synthesis and uh it is uh but ins a interesting to so you overlapping between these two feel so so all uh well a pretty that is speaking that many and ends are used in speech just synthesis i had had so uh successfully applied to to do be just be just since field for example the a unit selection a concatenation based the speech of synthesis matt third or the hmm based speech synthesis oh H M and god unit selection map third extension stature so uh last last september uh we present the paper called a hmm trajectory god the sample selection for four we are talking had in the speech so now we want to try oh oh we want to improve the system i taking the advantage of them recent and progress in speech synthesis so the first attempt that we are trying to improve the we just speech a statistical modeling by i i i i applied to the minimum generation error or words them so let's uh uh that there to do a quick review for the house system so just to like oh we do speech a of is that's the tts system first we start with that speech database so feel be does speech just synthesis we start with a but do database so that add a speaker um speaking talking to the camera instead of a microphone reading some proper that's great a what is got that a be the clip that's the auto meter data base we first do had a pose normalization since the speaker a we'll normally now actually change he's will be he's i more he's had during the recording so after the had to pose normalization every frame in the database and normalized to the fully frontal be you and then we can crop the mouse images a uh using a fixed the rectangular window so once we get all the mouth images we we do prince spoke a component now says to bad to visual feature and then we do do a all the visual and training to get the hmms that's the training part so and the sense it's as part um the input is some phoneme labels plus the L alignment that there's starting time and in time a first to we will use that input a passed the a well trained hmm model two to generate the we'd or trajectory just like a role we do in speech just census as we had to the speech trajectory for speech parameter trajectory and the all speech a trajectory we be been used as a god and to select a let's images from i well that's sample library and a amount those candidate that we have fat find a bass the ones and uh each back to the full had to to render the full face animation so here is a uh some more it that's to the example for this hmm trajectory god the lips and may just selection so you can see that the top line images stick is actually a a pretty to the by H and they are to trajectories that's images are actually are restored the from the predicted the pca back to us and the using these the true trajectory as the guidance we work selected the image of candidates found the from or let uh image library and then um a a wrist moos um a a person was parsed can be fine by a you are using viterbi search a those candidate okay so as we can see that for either either for hmm based on parametric a map or the all these hmm got you'd the hybrid approach just start it's got model actually is very important no by recreational uh because uh the actual be the trajectory to a large a extended you main how the lips can be rendered so that's part is very important um can really being about pretty or were real our previous work we used a um maximum like a hoot a a estimation for the hmm parameters or or in shot a week or lead and now based the training so a of one node is full of the nation that it that the mouse moves is over single was and the uh it this is a a small band and uh is comes the to a much smaller than that then dynamic range so this uh observation is uh actually quite similar to what we are was R thinking hmm based tts so oh thinking to improve the model so we propose to uh uh used a minimum generation our approach oh of to improved the all or the visual hmm parameter uh a training parameter to estimation so and the a a minimum generation error quite around the first important thing is that we need to define the arrow what's to arrow is so here we define them the bit of generation our O for each or you just sent has actually is the euclidean distance between the P C a back to as peace a trajectories so for the whole training set actually he's the average of other twenty sent has this the arrows of order twenty sentence so the objective all and G a quality or is to optimized the model parameters so as to the total generation our or can be minimised i we note that the the rat the direct solution for that question is mathematically intractable so here we adopt a problem let's take a just send the map there to re estimate mate the H and at the bridge at M parameters and the the the film or or for up to eighteen the meeting and the about rinse can be um uh found in the paper so we we incorporated a H based uh oh training thing to do a house system a we want to joint to we find all that we draw a atoms here here he's a are we're process so things in the first stab we were first initialize the model and or so the state alignment we using the traditional the baseline a maximum like who training and then i here we will of re find the state alignment a you know a heuristic a matter we just the per are try to put or just a pound or to the left and to the right and the to see the total generation error all before and after just shaped um that's it is mainly to find that the optimal state ones um a a a i or the energy G criterion so after does a refunds the along the we estimate a to model i'm sorry um that's step is so i but to state a state alignment that we will we find a visual hmm parameters by using the problem list tick this an average them and we go back to step to you and that three uh to see i'm to are there was no increase of the total generation error we are here is the experiment to be about eight at that the entries them so the are of visual database we used is the lips challenge to thousand eight and to to nine a a challenge database it it in close about a last than three hundred we do we do sentences uh uh chris money audio or do try it is welcome by a single native female speaker in neutral emotion so um the experiment is mailing to compare two approaches the baseline approach is the a i my like who the based to method or the and or so the proposed to M G based the third and the post approach a we have become pair with the ground choose the a region of trajectories spoken by the real real person and in objective evaluation since the database is very small so we used the lead i out uh actually and it calls twenty uh um a out cross validation for the open open pass and the the object to the measure we used a its mean square error roll uh average of cross correlation and or so we a matter the global variance a we are so contact that subjective evaluation uh two to use called the M as in terms of the of beach of consistency as six subjects attended this evaluation so uh and this this figure actually use uh oh i want to show that the trajectory how the trajectory looks like so a a in this figure you can see that the the brain the way colour line actually is that one choose and did the red colour is the M L based uh approach and uh the blue colour is the proposed and G based the third can see that um i i highlighted a to the the peak and a badly part you can see that especially for those critical part to the peak and a baddie uh the proposed and G map there'd generated trajectory more close to the uh to the ground choose trajectory which do real human produce it uh and the evaluation of the mean square error all and the a uh in that speaker that the the first part of of the left i the left is um i the mse ah can can be laid all us some summarise all the pca a mentions and the the well the rest of the shot bars actually for that top or top for a component so um air there is roughly about five percent um error reduction i used the in the in new proposed them a third and we are actually late uh after we we that me this paper actually we we we tested on different corpus uh the the problem and is quite a time about a five to seven percent of cross different the database and the this is is about to the a a cross correlation so uh um especially for the oh first the a component to because see that it but a in propose the correlation which is the very and the for the as the first the pca component uh i to be really lady to the mouse open now open and close or so we uh this is this is the result for the global very uh the proposed to the and G method can recover uh a lot of the uh compress the of variance uh it's is it is for the subjective evaluation so we we only used a lower face to do this up to two test because we want to people can't focus only on the lips the region um we generated a ut twelve test email a for each approach and uh this is a party to that is depends we a a us score and most a score for for each radius the mel and did uh this one this one that the then left a to why is the original video that's can to lists sure that's i two tests show lists oh okay so here here i i want to show uh at them oh actually this is a a a a uh a online we sell this is a online product a it's called uh it it is um vertical search thing being i in being search a we uh is that being dictionary online dictionary actually we put a the a had a as a what your english teacher on that that's side they do we'll help the english learners to how how to pronounce each word i can play the deal so we is that being dictionary i i i i so why you uh six was we any could or is to us uh find this T V i and the you click it then the to talking head of will pop up i this okay so here is my conclusion so here uh we applied a the minimum generation error approach to do we do speech synthesis um in objective evaluation compare with the baseline a small like who based approach we get a consistent improvement thing mean square error reduction and the or so increase being on correlation and or so we covered the problem barons in subject to evaluation we found that it can we increase the mouse that "'em" at a range and also make that talking head more like a real human thank you a questions yeah thank you for two um option you know soon as to maybe most to occlusion yeah use to the do that you P C to some features please but to region features uh uh actually we were for after had poles normalization you you can imagine all the face images a fully front tell and then we we just use of a a fixed a rectangular window to crop the mouth region so the pca actually is is uh down on my mouse a images first you craft to mouth images and then P X P i uh yeah yeah yeah yeah so but this a mouth images all the pixels after that we all like a a at the simple back to so one simple to or for each frame and then you can do pca like any see for dimension back which are backed you know the the we shouldn't be two you just one go to my mind you you do you use and in a the with a each we can uh so it really true you i with that you i think i agree question or question hmmm i to range just and look still know we didn't we didn't try to stream yeah we can we can try i a questions oh okay i i i i i a i oh oh question i uh yeah you mean the that the part i the the tiny girl actually at the boy it's you you heard actually is them a lady T D N and uh i i think uh is it's uh that good try but us because firstly we manage a in my imagination with that we think that maybe they are be a will be some mismatch well we use a mac a ladies T S with and trying to the ladies talking head but after we do it and show you that i think uh i okay or i is acceptable a it doesn't sound like best right yeah yeah it may be K common up about that so that okay

SYNTHESIZING VISUAL SPEECH TRAJECTORY WITH MINIMUM GENERATION ERROR

Speech Synthesis

Přednášející: Lijuan Wang, Autoři: Lijuan Wang, Microsoft Research Asia, China; Yi-Jian Wu, Microsoft Corporation, China; Xiaodan Zhuang, Beckman Institute / University of Illinois at Urbana-Champaign, China; Frank K. Soong, Microsoft Research Asia, China