0:00:15wanting um
0:00:17my name is that mean are not present
0:00:24and the two
0:00:25we use
0:00:26this in this guy lower
0:00:28subject independent and acoustic version that
0:00:31you review these two
0:00:34so "'cause" guy in version which you as you saw from the
0:00:37previous uh
0:00:40a who who is that you're wrong
0:00:44and what you're are trying to
0:00:46S to me
0:00:47are able right
0:00:49but obviously you kind work and the
0:00:51ross speech signal domain or
0:00:53these you to main
0:00:54so you to um
0:00:56a P an acoustic features and then
0:00:58in in fact
0:00:59to we the problem what are really trying to do is you trying to
0:01:03in this we have trying to estimate that are uh these are features feature as from
0:01:08i'll synchronise acoustic features
0:01:10so a given us to estimate the acoustic feature vectors from your six speech signal and then you
0:01:14we are trying to estimate that to give a feature vector
0:01:19and now before having to the subject independent part of this and let me first
0:01:24discuss what subject dependent impression news because this is usually how it's addressed in the literature
0:01:29so that mentioned before you trying to estimate
0:01:32are to be a you feature is given this test utterance
0:01:35and and makes the subject but the dependent what me
0:01:39uh the traditional approach subject to is that the pruning and test subjects
0:01:43are identical
0:01:44so you have
0:01:45parallel acoustic and i to but feature vectors from
0:01:50as you a subject and you use this to develop a model of
0:01:54this mapping from the
0:01:55acoustic right D feature vectors
0:01:57and then you use that to estimate out to but
0:02:00feature vectors for the same once and subject
0:02:04and the easy and some those because
0:02:05you do it on the same subjects so the acoustic spaces mad start but space is matt
0:02:11but about those try to look into in this paper there is a subject independent acoustic-to-articulatory inversion
0:02:19training data uh from a completely different subject
0:02:21how how we do use that to develop a model that you can then
0:02:25used to estimate a T D features for
0:02:27for out from four different subject
0:02:29and it should immediately you start becoming obvious that what are the kind of challenges you will see a
0:02:34so the acoustic space of the these two subjects and a very good so
0:02:38you can i think of the in read a similarity metric for these two subjects
0:02:42at the same time that close even the acoustic space so are not similar
0:02:45right so each
0:02:46each subject will have their own are be
0:02:49uh range
0:02:50so the acoustic and not but space are different
0:02:52and at this point now i would actually want to make it clear that
0:02:56well about those actually estimate as that are two but
0:03:00jack actors
0:03:02you can to graph of in this fashion so
0:03:05it does that
0:03:06because there are completely feature that that that that D features that the pruning
0:03:11subject would have produced if you was trying to immediate or but that the test utterance you can think of
0:03:16the estimated
0:03:17uh are to be at a feature vectors in that fashion
0:03:22but value you would you want to even do us subject independent version why not just speak to the traditions
0:03:27that different it in
0:03:29so subject depend version is simple and C Z and it works that you read
0:03:33but the problem is that when you have to extended to two
0:03:36a and number of how this because you need a get a data from each of the speaker
0:03:40as you have know a but did do does not as simple look alike
0:03:44as compared to the acoustic data
0:03:47if we could have an approach where
0:03:49that that a lot of but acoustic are just for one talk or for one subject and then you could
0:03:54extended to any
0:03:56and that are a number of speakers for whom you have only the acoustic data
0:04:00then set to that would be how do how you desire
0:04:03then you could think of applications like
0:04:05joint joint but acoustic speech recognition are
0:04:08a simple as is
0:04:10and can just
0:04:12you of once you have a a model from that just one subject and then an adapt to in a
0:04:17number because you want
0:04:18and so that as a motivation for
0:04:20the subject independent and version
0:04:23and i would know where to the details of how about the proposed to do it
0:04:27so they do it by a minimization of the
0:04:30and smoothness criterion for the are completely
0:04:33uh trajectories
0:04:35and and that was initially proposed a seven
0:04:37we hasn't ten four
0:04:39subject dependent and version but they found that this could be extended to subject independent inversion just as that
0:04:45and but to and is in the subject independent version is the second but that that's somebody deeper
0:04:54but you are estimating were that was done as in this criterion what it's doing is that you are you
0:04:59are essentially have
0:05:01different like functions for each of these trajectories
0:05:06you would if you have the articulatory feature as you would have these that's criterion intentions
0:05:10and that is because each articulator it is move this with different be so you need
0:05:15different levels of smoothness
0:05:17which but those fess some uh this first M is basically the smoothness spell you can think of it as
0:05:22the smoothness when B
0:05:25does this in each
0:05:26a a a pension is basically a a high pass filter that you optimized on a development set
0:05:31for this but a block glitter and you do those for each to clear that it time and you get
0:05:36these are right in functions and the first and was radically he's by six those which paint realise
0:05:42lot of
0:05:43trajectories that you might estimate otherwise
0:05:46the second them as the but that brought some of them which actually does estimation from the training data
0:05:50and you have these two bands your that you and the P
0:05:53but you are the problem values that you are to be T
0:05:57features could be
0:05:58and B are the corresponding probabilities
0:06:02that that's look uh look to how we could estimate of this eat and be the that i mentioned is
0:06:07optimized on no
0:06:08separate development set for each of that lately
0:06:11articulatory features
0:06:13so that to estimate this
0:06:15uh but you down the pu comes
0:06:18are are
0:06:20a simple the subject dependent is again and i i would just i would come to the subject in it
0:06:25just up to this
0:06:26but essentially since you acoustic space that's similar for the subject dependent case
0:06:30what a good
0:06:30it was it was do is equal
0:06:32look let that you could like to estimate these are to feature is the probability of features by approximately in
0:06:37the acoustic space
0:06:39so again and up
0:06:40do they start utterance
0:06:42and that acoustic feature vector or what you can do as you can
0:06:46the closest a acoustic features are from the training data and then you corresponding articulatory features
0:06:52not become we need to
0:06:53so you can compute some thought of an euclidean distance are i don't know "'cause" distance
0:06:57from each
0:06:58each of these that
0:07:00features as acoustic features in your training data
0:07:02and i i you done that you can
0:07:05the like
0:07:05so that that L M these
0:07:07that that lot of these
0:07:09acoustic features the corresponding a to be feature not be can you need up other probable values that you R
0:07:14acoustic are at be three features can be
0:07:17right and you can estimate the probabilities as
0:07:21as the inversely proportional to the distance because if you acoustic features that
0:07:26but have as are to the test utterance then
0:07:29the that probably do you have that being one of you
0:07:32that ben of bad use features you would want is
0:07:35and so
0:07:37i you are
0:07:38is is is doing but we are using the subject dependent case is a simple euclidean distance in the acoustic
0:07:43features and this is that
0:07:44this the subject dependent version of the
0:07:47generalized to this criteria
0:07:51what the question is how would you extend this when these two
0:07:55these two fast because i not the same and acoustic space of that different
0:08:00do you can't you compute uh something like it's some the you in distance in this case
0:08:05so that about was proposed in this way
0:08:07is is that you map these
0:08:09acoustic features to at
0:08:10but D space
0:08:13then the you to compare these
0:08:15using a simple distance metric that you know of
0:08:18so that there is achieved this by the concept of a general acoustic space
0:08:22a general like "'cause" you can think of the dell six
0:08:25in this fashion so that an acoustic space you can think of can the sting off
0:08:29that's that insist they a best acoustic feature of of acoustic feature vectors from a number of different
0:08:34that was
0:08:35and and not contain any of these two like a was so it's basically something you can think off
0:08:41because perception of
0:08:43that was the that
0:08:44sound that on him
0:08:46and this an acoustic space which can
0:08:50acoustic features from
0:08:51a see feature vectors from a number of different speakers is not a and to a different clusters
0:08:56and each of but i don't in each of these clusters as model using the bashing mixture model
0:09:01but can not this acoustic space not what you can do is you can prance on each of these acoustic
0:09:06feature like that the acoustic feature vector and the test acoustic feature vectors
0:09:11a a any feature vector
0:09:13which can the of the posterior probabilities that
0:09:17acoustic feature the W having come from each of these if clusters
0:09:22and then it can analyse it so that sends up to one so now that you are map your acoustic
0:09:27feature vector is to do the a feature vector five W
0:09:30now we can probably think of sampling
0:09:32as a some that the since but it like you in distance metric
0:09:36and so that is the modified distance metric which is used
0:09:38for had the subject independent case
0:09:41so i a few computing simple euclidean distance a among the acoustic features than the subject in the and gives
0:09:46you mad them to the probably space and now we compute be you didn't distance
0:09:49is a modified distance
0:09:51but you know what does that we still have
0:09:54and most five hundred thousand frames
0:09:56and was just for one frame and one not to a
0:09:59so the computational cost to it was immense
0:10:02and so that was proposed
0:10:04a the for that the and computational cost
0:10:06that that's proposed
0:10:08just padding
0:10:09the but only but element
0:10:12only the relevant for a feature
0:10:14and the and this is that is
0:10:16and don't remember that now we just have probably D "'cause" we don't have acoustic to then anymore
0:10:21exactly value you probability vectors
0:10:24we just to them into a
0:10:26can a different backs that the fast back and contains and that probably back far that's the first kind but
0:10:31it is the high
0:10:32a second that then is all the property like this for which the second companies highest
0:10:36and so and then came back
0:10:38not many get a test utterance when you get it test seconds from the acoustic feature vector to compute probably
0:10:43probably feature vector
0:10:44and the mean find the index on that
0:10:46and in that's in that can only probably but the for which the
0:10:49the company size and suppose but had index has the highest value is probably vector
0:10:53that we were
0:10:54just do the compare as an that are back and you would not consider other back
0:10:58and but even further reduction and class
0:11:01you can just compute the standard is now using values a but that index
0:11:06and then you would do a certain kind of sad and that you did that yeah
0:11:09and you would just be in the i
0:11:12and the
0:11:13the noses and
0:11:22and then you get again you the
0:11:24probably be an inverse of this
0:11:26i think that of the of these this
0:11:29the the provide with it and be generalized right
0:11:33and this room by this provides a
0:11:36a a a a is it much as the multiplication
0:11:39computation in general and on a having a the last
0:11:45then it because of
0:11:47having that is let you know what the experiment
0:11:53so uh on there
0:11:55was a it's on a a i which i
0:11:59but was to yeah and uh
0:12:02acoustic and i
0:12:04for one meeting one may dish speaker
0:12:08where is reading for sixty statements
0:12:16acoustic features use there but i'm sure image
0:12:20and i features they're
0:12:22the electromagnetic articulography
0:12:25and very is that you
0:12:28or lower
0:12:29lower in is there a more
0:12:31and the and and will
0:12:33and so we use this for it
0:12:40that that that that is that image idea was used to optimize
0:12:43but high or whatever is that we have one
0:12:46but uh i i a for each
0:12:48and that that was a
0:12:50cost be being this stuff and
0:12:54and the data to we don't
0:12:55and for building an acoustic model
0:12:58uh timit at uh
0:13:00being is used
0:13:01but eight thirty two
0:13:06and for for of that evaluation of this proposed rules
0:13:09that that's right for inversion is them i know that
0:13:18and you you
0:13:19you you
0:13:20in for the job
0:13:22the second is is that is that the training and are i
0:13:27ignored five
0:13:34that is that it was a a mismatch
0:13:36no where a
0:13:38like like
0:13:40and so you
0:13:44uh in the problem by a
0:13:54this is uh
0:13:57so what you think that is
0:13:59and by
0:14:00i i for one of
0:14:03a a speaker for for one of the
0:14:06that it is
0:14:08a a is that a one is made by subject
0:14:12the uh
0:14:14but the training and testing think that one done on the me
0:14:19think of a is a one that was that being is to me using a Q D C that you
0:14:24do that
0:14:25you i image
0:14:27you know
0:14:30in that to me in a room at all
0:14:35and what you see where is the thing that made all proposed to you
0:14:42correlation relation should at one or as the evaluation
0:14:46and yet as a result
0:14:48and what
0:14:49i i see that the subject it's of all
0:14:53these are
0:14:54correlation average over all
0:14:57and for each the
0:14:59each of a of a feature
0:15:02this i think that that does the best because you're your thing on the theme
0:15:07uh is the same
0:15:09but i think thing here is
0:15:11that are was that route
0:15:13right try model i is not you know is
0:15:16i mean that thing and the S
0:15:19actually that that then
0:15:21then you ignore
0:15:22so these any more
0:15:25i i i know i am and you
0:15:28so that
0:15:29yeah so that is
0:15:31some i
0:15:32ooh week is normalization by a
0:15:34jen like that
0:15:36and then you a thing by stopping
0:15:39it's a and if that
0:15:43so we can are
0:15:44uh a a a a a is the most
0:15:48is sub in are or should to me any
0:15:53i i i got from that
0:15:56you can be you more is on the
0:15:58you can be a a models on friday i three
0:16:03you just
0:16:04at at any from you just
0:16:07acoustic thinking
0:16:08and this is done by a
0:16:12it's are not like is acoustics
0:16:16there is a
0:16:17uh uh uh uh uh and what a question here is that are generated
0:16:21i see you need to your
0:16:23looking for some
0:16:25because the language
0:16:26in your are
0:16:28in your data
0:16:29you but you should have it in your general
0:16:32so be
0:16:37there's uh
0:16:38ooh a you more things
0:16:41we have a a speech recognition
0:16:45i feature in but our region
0:16:49joint feature
0:16:50and it has a shown some a room and recognition i
0:16:55and the
0:16:56or would also like to investigate
0:16:58for the
0:17:00uh a different of got teacher
0:17:02from the for what he's like yeah i know my you see is which
0:17:06right right and the vision in addition
0:17:10and we V in a lot your back to collect
0:17:13uh are you data might be used sequence is for for american english talkers
0:17:19and you can find more about
0:17:23uh thank you listening
0:17:36so so to have time for questions yeah
0:17:39that's please
0:17:40uh so i have a uh like the the main thing is uh that in this case uh you have
0:17:46uh sentences well
0:17:47but exactly same sentences from the purpose different subjects
0:17:51so uh
0:17:53that's types of the seneca
0:17:54okay but just an approach uh that to um that have to car
0:17:58oh no but
0:17:59the the question
0:18:01i so that i
0:18:02so exactly the same sample sense
0:18:04i i i i to just no actually but some some so test data
0:18:18or to just P mean it to me or
0:18:20so that "'cause" it's and sentence comes from a general them corpus so
0:18:24and you
0:18:32so those a randomized actually a a a a a a a in a sense
0:18:35so we wish we had a more of roots of corpora
0:18:45so the done actually it's it's pretty stable than seems also use the of
0:18:50a so a if you rooms where five for a lower you know are typically don't to set about four
0:18:55and fifty sentences
0:18:57and the results a pretty stable be done we tried that actually to the point to just to check or
0:19:12and that the point of what the square lotion just kind of give us a some feel for that uh
0:19:17you know
0:19:18how it works but it's still a to do so but the
0:19:21uh phonetic discrimination results are interesting um it's not cheer but the uh
0:19:26those a a creating actually
0:19:34and and the question
0:19:38well actually
0:19:39i would have a quick one in the
0:19:41addition of channel its most most smash or criterion you have two terms T
0:19:46a take to try to measure uh us but those and then the data across
0:19:50and he use a high pass filter that a possible to do something delay
0:19:55if it's the cost of filter
0:19:57so will that in practice some leap between the smoothness definition and the data uh accuracy
0:20:09therefore for each trajectory you and uh joint design
0:20:12um we have a pretty fast recursive but uh a uh uh design for that it's pretty fast so that
0:20:17sometimes lee
0:20:18so so that it is is very small or okay yeah
0:20:26so if the
0:20:27norm question we move on to the next picture