Přepis řeči - A SUBJECT-INDEPENDENT ACOUSTIC-TO-ARTICULATORY INVERSION

wanting um my name is that mean are not present or on oh and the two we use this in this guy lower subject independent and acoustic version that you review these two so "'cause" guy in version which you as you saw from the previous uh or a who who is that you're wrong and what you're are trying to S to me are able right but obviously you kind work and the ross speech signal domain or these you to main so you to um a P an acoustic features and then in in fact to we the problem what are really trying to do is you trying to in this we have trying to estimate that are uh these are features feature as from i'll synchronise acoustic features so a given us to estimate the acoustic feature vectors from your six speech signal and then you we are trying to estimate that to give a feature vector and now before having to the subject independent part of this and let me first discuss what subject dependent impression news because this is usually how it's addressed in the literature so that mentioned before you trying to estimate are to be a you feature is given this test utterance and and makes the subject but the dependent what me uh the traditional approach subject to is that the pruning and test subjects are identical so you have parallel acoustic and i to but feature vectors from as you a subject and you use this to develop a model of this mapping from the acoustic right D feature vectors and then you use that to estimate out to but feature vectors for the same once and subject and the easy and some those because you do it on the same subjects so the acoustic spaces mad start but space is matt but but about those try to look into in this paper there is a subject independent acoustic-to-articulatory inversion that given training data uh from a completely different subject how how we do use that to develop a model that you can then used to estimate a T D features for for out from four different subject and it should immediately you start becoming obvious that what are the kind of challenges you will see a so the acoustic space of the these two subjects and a very good so you can i think of the in read a similarity metric for these two subjects at the same time that close even the acoustic space so are not similar right so each each subject will have their own are be uh range so the acoustic and not but space are different and at this point now i would actually want to make it clear that well about those actually estimate as that are two but jack actors are you can to graph of in this fashion so it does that because there are completely feature that that that that D features that the pruning subject would have produced if you was trying to immediate or but that the test utterance you can think of the estimated uh are to be at a feature vectors in that fashion but value you would you want to even do us subject independent version why not just speak to the traditions that different it in so subject depend version is simple and C Z and it works that you read but the problem is that when you have to extended to two a and number of how this because you need a get a data from each of the speaker as you have know a but did do does not as simple look alike as compared to the acoustic data so if we could have an approach where you that that a lot of but acoustic are just for one talk or for one subject and then you could extended to any and that are a number of speakers for whom you have only the acoustic data then set to that would be how do how you desire then you could think of applications like joint joint but acoustic speech recognition are a simple as is and can just there you of once you have a a model from that just one subject and then an adapt to in a number because you want and so that as a motivation for the subject independent and version and i would know where to the details of how about the proposed to do it so they do it by a minimization of the and smoothness criterion for the are completely uh trajectories and and that was initially proposed a seven we hasn't ten four subject dependent and version but they found that this could be extended to subject independent inversion just as that and but to and is in the subject independent version is the second but that that's somebody deeper so but you are estimating were that was done as in this criterion what it's doing is that you are you are essentially have different like functions for each of these trajectories so you would if you have the articulatory feature as you would have these that's criterion intentions and that is because each articulator it is move this with different be so you need different levels of smoothness which but those fess some uh this first M is basically the smoothness spell you can think of it as the smoothness when B and does this in each a a a pension is basically a a high pass filter that you optimized on a development set for this but a block glitter and you do those for each to clear that it time and you get these are right in functions and the first and was radically he's by six those which paint realise any lot of trajectories that you might estimate otherwise the second them as the but that brought some of them which actually does estimation from the training data and you have these two bands your that you and the P but you are the problem values that you are to be T features could be and B are the corresponding probabilities that that's look uh look to how we could estimate of this eat and be the that i mentioned is optimized on no separate development set for each of that lately articulatory features so that to estimate this uh but you down the pu comes are are a simple the subject dependent is again and i i would just i would come to the subject in it just up to this but essentially since you acoustic space that's similar for the subject dependent case what a good it was it was do is equal look let that you could like to estimate these are to feature is the probability of features by approximately in the acoustic space so again and up the do they start utterance and that acoustic feature vector or what you can do as you can estimate the closest a acoustic features are from the training data and then you corresponding articulatory features not become we need to so you can compute some thought of an euclidean distance are i don't know "'cause" distance from each each of these that features as acoustic features in your training data and i i you done that you can the like so that that L M these that that lot of these acoustic features the corresponding a to be feature not be can you need up other probable values that you R acoustic are at be three features can be right and you can estimate the probabilities as as the inversely proportional to the distance because if you acoustic features that but have as are to the test utterance then the that probably do you have that being one of you that ben of bad use features you would want is hi and so i you are is is is doing but we are using the subject dependent case is a simple euclidean distance in the acoustic features and this is that this the subject dependent version of the generalized to this criteria what the question is how would you extend this when these two these two fast because i not the same and acoustic space of that different then do you can't you compute uh something like it's some the you in distance in this case so that about was proposed in this way is is that you map these acoustic features to at but D space that then the you to compare these using a simple distance metric that you know of so that there is achieved this by the concept of a general acoustic space a general like "'cause" you can think of the dell six in this fashion so that an acoustic space you can think of can the sting off that's that insist they a best acoustic feature of of acoustic feature vectors from a number of different that was and and not contain any of these two like a was so it's basically something you can think off because perception of that was the that sound that on him and this an acoustic space which can acoustic features from a see feature vectors from a number of different speakers is not a and to a different clusters and each of but i don't in each of these clusters as model using the bashing mixture model but can not this acoustic space not what you can do is you can prance on each of these acoustic feature like that the acoustic feature vector and the test acoustic feature vectors uh_huh a a any feature vector which can the of the posterior probabilities that this acoustic feature the W having come from each of these if clusters and then it can analyse it so that sends up to one so now that you are map your acoustic feature vector is to do the a feature vector five W now we can probably think of sampling as a some that the since but it like you in distance metric and so that is the modified distance metric which is used for had the subject independent case so i a few computing simple euclidean distance a among the acoustic features than the subject in the and gives you mad them to the probably space and now we compute be you didn't distance is a modified distance but you know what does that we still have and most five hundred thousand frames and was just for one frame and one not to a so the computational cost to it was immense and so that was proposed a the for that the and computational cost that that's proposed just padding the but only but element only the relevant for a feature and the and this is that is and don't remember that now we just have probably D "'cause" we don't have acoustic to then anymore so exactly value you probability vectors and we just to them into a can a different backs that the fast back and contains and that probably back far that's the first kind but it is the high a second that then is all the property like this for which the second companies highest and so and then came back not many get a test utterance when you get it test seconds from the acoustic feature vector to compute probably probably feature vector and the mean find the index on that and in that's in that can only probably but the for which the the company size and suppose but had index has the highest value is probably vector that we were just do the compare as an that are back and you would not consider other back and but even further reduction and class you can just compute the standard is now using values a but that index and then you would do a certain kind of sad and that you did that yeah and you would just be in the i and the the noses and i i uh each and then you get again you the probably be an inverse of this i think that of the of these this the the provide with it and be generalized right and this room by this provides a a a a a is it much as the multiplication and computation in general and on a having a the last then then it because of having that is let you know what the experiment far so uh on there was a it's on a a i which i but was to yeah and uh acoustic and i for one meeting one may dish speaker each where is reading for sixty statements three acoustic features use there but i'm sure image and i features they're the electromagnetic articulography and very is that you or lower lower in is there a more and the and and will and so we use this for it raw that that that that is that image idea was used to optimize but high or whatever is that we have one but uh i i a for each and that that was a good cost be being this stuff and and the data to we don't and for building an acoustic model uh timit at uh being is used but eight thirty two and for for of that evaluation of this proposed rules that that's right for inversion is them i know that i and you you you you in for the job the second is is that is that the training and are i ignored five use a so i that is that it was a a mismatch no where a like like right and so you you uh in the problem by a general uh space and this is uh so me so what you think that is and by i i for one of me a a speaker for for one of the that it is a a is that a one is made by subject the uh but the training and testing think that one done on the me saw uh think of a is a one that was that being is to me using a Q D C that you do that you i image you know but in that to me in a room at all training uh and what you see where is the thing that made all proposed to you the correlation relation should at one or as the evaluation and yet as a result and what i i see that the subject it's of all these are correlation average over all and for each the each of a of a feature and this i think that that does the best because you're your thing on the theme uh is the same but i think thing here is that are was that route right try model i is not you know is i mean that thing and the S so actually that that then then you ignore so these any more not i i i know i am and you so that yeah so that is some i ooh week is normalization by a the jen like that speech and then you a thing by stopping it's a and if that similar oh so we can are uh a a a a a is the most and is sub in are or should to me any i i i got from that you can be you more is on the you can be a a models on friday i three you just at at any from you just acoustic thinking and this is done by a fact it's are not like is acoustics but there is a uh uh uh uh uh and what a question here is that are generated i see you need to your looking for some yeah because the language in your are in your data you but you should have it in your general i so be i and and there's uh ooh a you more things and so we have a a speech recognition using i feature in but our region joint feature and it has a shown some a room and recognition i and the or would also like to investigate for the uh a different of got teacher from the for what he's like yeah i know my you see is which right right and the vision in addition even and we V in a lot your back to collect uh are you data might be used sequence is for for american english talkers it's and you can find more about lee uh thank you listening i you send question oh so so to have time for questions yeah that's please uh so i have a uh like the the main thing is uh that in this case uh you have a uh sentences well but exactly same sentences from the purpose different subjects so uh that's types of the seneca okay but just an approach uh that to um that have to car oh no but the the question i so that i so exactly the same sample sense i i i i to just no actually but some some so test data time so yeah uh or to just P mean it to me or so that "'cause" it's and sentence comes from a general them corpus so and you uh_huh so those a randomized actually a a a a a a a in a sense so we wish we had a more of roots of corpora so yeah so the done actually it's it's pretty stable than seems also use the of a so a if you rooms where five for a lower you know are typically don't to set about four and fifty sentences um and the results a pretty stable be done we tried that actually to the point to just to check or all i not hmmm sure yeah and that the point of what the square lotion just kind of give us a some feel for that uh you know how it works but it's still a to do so but the uh phonetic discrimination results are interesting um it's not cheer but the uh those a a creating actually okay and and the question well actually i would have a quick one in the addition of channel its most most smash or criterion you have two terms T a take to try to measure uh us but those and then the data across and he use a high pass filter that a possible to do something delay if it's the cost of filter so will that in practice some leap between the smoothness definition and the data uh accuracy therefore for each trajectory you and uh joint design um we have a pretty fast recursive but uh a uh uh design for that it's pretty fast so that sometimes lee so so that it is is very small or okay yeah so if the norm question we move on to the next picture

A SUBJECT-INDEPENDENT ACOUSTIC-TO-ARTICULATORY INVERSION

Modeling and Analysis of Speech Production

Přednášející: Komath Naveen Kumar, Autoři: Prasanta Ghosh, Shrikanth S. Narayanan, University of Southern California, United States