Speech Transcript - RESOLVING NON-UNIQUENESS IN THE ACOUSTIC-TO-ARTICULATORY MAPPING

0:00:14	a a but not well as you uh
0:00:17	uh
0:00:18	the this stock is uh
0:00:20	uh
0:00:21	is a
0:00:22	go clap of it but by an and the question that's and that's me and all lot
0:00:26	and well
0:00:27	uh the liz resolving non-uniqueness in the
0:00:30	acoustic-to-articulatory mapping which i would for the for it was it a mapping
0:00:35	uh
0:00:36	i i i think i'll skip the scroll slide because the last two presentations were pretty much about
0:00:40	the same thing
0:00:41	and uh it's basically just to give an idea as to what it to we uh mapping is or inversion
0:00:47	is
0:00:49	uh
0:00:49	uh i'll do it
0:00:51	jump to the the main uh focus of this stop which is actually the non-uniqueness in this mapping
0:00:57	which has been a for to by the P uh by
0:00:59	uh by the few adults as before
0:01:01	uh would oh spoke before me
0:01:03	uh so in the literature we have uh
0:01:06	things like uh at the loss lists uh you models of the vocal tract to is a parameter gotta
0:01:11	oh model of uh speech synthesis and uh
0:01:14	you can say that the inverse mapping from acoustic coast is actually to a class of a a function the
0:01:19	not exactly one E
0:01:21	and you have a similar results from other that such experiments
0:01:24	and you have something uh there are some experiments called a bike block experiments where the uh
0:01:29	the all these speaker is
0:01:31	this constraint
0:01:32	uh but still a a uh is the speakers can produce a perceptually sim similar sounds
0:01:38	even spite of the natural pollution so this gives an uh indication of money
0:01:42	of course these are are sit situations than the
0:01:45	it this this may not really a in natural speech
0:01:48	so what what about in continuous speech so
0:01:50	we would be you can have different forms of data to collect this uh thing uh
0:01:55	uh which have listed here
0:01:57	uh in our case we use "'em" mocha timit database just like
0:02:00	the previous to uh a so i wouldn't going to that
0:02:03	too much
0:02:05	uh
0:02:06	so this is an example from the data set and then we have a a a a a phoneme
0:02:10	uh a
0:02:11	uh uh and the the red and the blue lines here they they get
0:02:15	uh the D spectrum the magnitude spectrum
0:02:19	uh
0:02:20	from two instances
0:02:21	and uh
0:02:23	uh the figure two uh
0:02:26	uh
0:02:27	uh to the bottom to the right bottom is actually of the positions of the articulator quite
0:02:32	a a a and you can see that
0:02:33	even though the the sticks are are are
0:02:36	quite similar
0:02:37	the the uh the if you positions art
0:02:39	are slow are quite different
0:02:41	uh but is this non-uniqueness i mean uh
0:02:44	i mean do you still can't say really that just not just because there is a difference in the acoustic
0:02:48	so uh can can this difference in acoustics be explained by uh uh
0:02:54	by this there
0:02:55	variation position of the of the article
0:02:58	so uh the but that's that that
0:03:00	sort of comes to the problem uh in when you have this kind of uh data a limited data base
0:03:06	that you cannot get exactly the same
0:03:08	uh uh sticks an exactly the same articulators
0:03:12	uh uh uh uh what or it exactly the same of six with different not is so that that's the
0:03:16	that the difficulty as data
0:03:18	so
0:03:20	the P questions in this in this stock or
0:03:22	a how does one estimate non-uniqueness in a limited data
0:03:25	and uh that we do it but statistical modeling morning based one a one of four previous papers
0:03:29	uh how do these non any
0:03:31	instances of coding friends agreed to
0:03:32	goes stick articulate frame
0:03:35	uh does uh applying can here D constraints help
0:03:38	a all non less
0:03:39	uh these of be a main questions
0:03:41	so we are we have a toy example your and you can say that
0:03:44	a that the the figure on the top here is uh is
0:03:47	uh
0:03:48	the acoustic parameters
0:03:50	belong to say one phoneme
0:03:52	and this is the uh are to two parameters but of long one point men you can see that
0:03:56	acoustic is you name but is that
0:03:58	i can three parameters are by more so is this non unique
0:04:04	a what so you look at the data points here i
0:04:06	i don't know whether the
0:04:07	points are very clear but uh you can see that i mean it's not it's not completely true i mean
0:04:12	you can see that there are some clusters your
0:04:14	in the look at that
0:04:16	the joint i quickly we an acoustic uh space
0:04:20	and uh therefore we what we do is we
0:04:23	for a model in this this sort of data and the joint space
0:04:25	uh articulatory acoustic space
0:04:27	and then we can look at what one of one value of acoustic but i'm with that a shown by
0:04:32	the blue line there that's of test sample
0:04:34	and we can find the conditional probability distribution and this case this is uh a by more eager which says
0:04:40	that
0:04:40	at at this uh at this value for acoustic parameter
0:04:43	uh the uh the the mapping use non unique
0:04:46	but if you look at a another acoustic parameter here which belong to this
0:04:50	the same
0:04:51	a a close to cluster you can see that it's uh uni modal and it's not
0:04:54	it's not not not
0:04:56	of course that the there's is the question of uh the variance
0:04:59	uh which is also a a a a least of some sort of and because for one value for a
0:05:03	stick but i'm with you can have different
0:05:05	well use of articulate but i mean
0:05:07	but uh we don't we don't look at to this sort of money miss in the in this paper
0:05:11	and uh
0:05:12	we just look at the uh this by mortar kind of an on in
0:05:17	uh this to the close the parameterization of the data and again it's very similar to what has been used
0:05:22	in the state of the art though
0:05:24	uh uh it we mapping systems and source some that the one which was used previous the previous paper
0:05:29	uh this is
0:05:31	an example of a non nice so what these uh
0:05:34	these uh
0:05:35	but blocks are actually
0:05:37	the conditional distributions
0:05:39	uh
0:05:40	given a one vector of a co six
0:05:42	these pop but with lots of the the conditional distributions of the uh of the
0:05:47	articulate records
0:05:49	so in this case and the blue out the blue dot sense and triangles and one they are they are
0:05:54	actually that the peaks of these uh different modes
0:05:56	and the green line
0:05:58	i uh
0:06:00	that's clear in the in the presentation that the green line is actually the the recorded positions
0:06:04	the this case you can see that the other
0:06:06	close or to one of the peaks so and the other P
0:06:08	and the other because actually uh
0:06:10	uh
0:06:11	the the the non unique
0:06:13	a a not a non unique estimate of for this
0:06:17	uh this particular stick
0:06:19	uh what now we look at this in a trajectory
0:06:21	uh so uh and in this case there is you can you can see that there they all you anymore
0:06:25	more the all the uh the the conditional distribution the you anymore
0:06:28	but you look at the next frame and then and in this case you can set saying that that on
0:06:32	tip
0:06:33	which is here
0:06:34	and uh you can start saying that there is a
0:06:37	there's another but which of uh which uh
0:06:39	you can in you can see the same thing and
0:06:41	in the lower lip which is here
0:06:43	and the tongue dorsum was
0:06:45	oh
0:06:47	and it's and so
0:06:48	uh but you can see but at the same time though the recorded positions are actually
0:06:53	are always close or two
0:06:55	one of the uh
0:06:58	uh that the two are to one of the modes side than the other
0:07:01	and uh
0:07:03	but the the to another example of uh not uh
0:07:06	following this and this is the
0:07:08	uh another example
0:07:09	and this case you can see that i this is largely uni modal
0:07:13	uh
0:07:14	um this is
0:07:15	one frame but to
0:07:16	uh are in the shop
0:07:19	uh
0:07:20	and
0:07:21	uh you can start seeing that that that is a
0:07:24	i the
0:07:25	the second mode starts appearing somewhere here
0:07:30	and you can see that it's
0:07:31	there
0:07:32	and
0:07:33	the next estimate here post
0:07:34	it shifts on to the the new mode
0:07:37	so
0:07:38	uh there what this work in in the first in the in the first example
0:07:41	the this this second mode it up your and then sort of disappeared from
0:07:45	from the estimates
0:07:46	and this case it seems like there's a switch between the
0:07:49	the first set of modes to the second set
0:07:51	so that we have new questions zero which is like what is a different between the two examples
0:07:55	how often do each type of
0:07:56	these non uh a core
0:07:58	and what is the role that what role does it play the predictability of the art uh uh i clear
0:08:03	articulation
0:08:05	and uh what we do that now is that
0:08:07	we just shift the uh so that the previous examples what in in the articulate space the midsagittal plane
0:08:13	where this one is actually in the uh in the space time
0:08:17	are these plots in space time so
0:08:18	the blue and the pink lines are actually uh the peaks of these
0:08:22	these uh modes that you so that you saw on the black line is the the recorded project
0:08:26	so you can see in that in that the type one what we call along the same part
0:08:30	these um
0:08:32	that the uh the the the recorded positions be sort of this stick to one
0:08:36	of the project
0:08:37	where as you can see that there is some non unique a estimates
0:08:41	for some part of this uh uh uh of this tragic which we call non unique batch
0:08:46	uh a the and in the second uh uh example
0:08:49	uh
0:08:50	you can see that the that they did not any uh so there is a sort of a a shifting
0:08:54	from one of these
0:08:56	oh well that's that that can be taken do the second but that's from the blue for the big
0:09:00	i recall that the change in but
0:09:02	so obviously it's it's is it's all obvious that from that type one is can is easy to estimate but
0:09:07	using a
0:09:08	information about the previous frames but that's not the case but i two
0:09:11	uh
0:09:13	and in this case you also need a a uh uh this the succeeding frames also you need to know
0:09:17	where in which direction
0:09:20	uh a but there are some exceptions you for example you can see this here that uh this is actually
0:09:25	a
0:09:26	the expect type is along the same but
0:09:28	a but uh in fact it actually the the the recorded questions goes to W C P
0:09:33	through to with the change in but
0:09:36	so we'll we we just want to see how often thus
0:09:39	you get these kind of excess
0:09:41	uh this will uh so what we do is that we just do
0:09:44	uh oh we we just have a conditions and be find other miss error
0:09:48	the first one is we we apply
0:09:50	can unity constraints
0:09:52	the based on dynamic programming from the preceding context
0:09:55	and then we select the one of the peaks from the to the second one we select the mean between
0:09:59	the two peaks actually this is not really but um yeah not body articulate positions but we do it just
0:10:04	two
0:10:04	C uh uh what how how we reduce a whether it uses the arm error
0:10:09	and the last one is that we uh estimate but
0:10:11	uh so we estimate which of the
0:10:14	don't the the peaks is actually uh
0:10:16	gives a low was uh are of so we don't of to continue to constraints but we just uh C
0:10:20	say that uh which of the peaks is close to the put
0:10:24	so i i i just go to the to the you graph and two
0:10:28	uh so it's uh
0:10:30	a first at uh sort that the first thing we see that these uh i think so the the X
0:10:35	axes actually the light of five
0:10:37	so how many as a set uh
0:10:39	successive frames you get
0:10:41	where you have non unique uh uh estimates
0:10:43	that is that's that's the exact x-axis and the number of occurrences is in the the wire
0:10:48	so you can see that it form sort of a the if and a function
0:10:51	and it sort of uh that the the number of has uh a number of uh
0:10:55	a i is not any is with
0:10:57	sort of decreases as of people's of with like
0:10:59	and
0:11:01	and that's that's even uh it's more so for the uh uh with change and but case that from yeah
0:11:06	from the long the same part in long the same point you see that
0:11:09	for
0:11:09	a to uh uh um
0:11:12	for two consecutive frames you get a lot more or you get more uh or cry
0:11:17	oh
0:11:18	oh it so the frequency of occurrence of a along the same but this higher than the used uh
0:11:22	C if but only for a shorter parts
0:11:25	for for longer parts
0:11:26	it seems like it's uh
0:11:27	more uh with change
0:11:30	uh
0:11:31	so fifty want to
0:11:32	a three percent of the E frames are are result
0:11:36	with a unity constraints for is P
0:11:38	uh but
0:11:39	it's it's much lower what for or W C B as expect
0:11:42	it's only twenty a nine to that a three percent
0:11:44	and it keeps or using uh uh that the the a pitch gives are using but the uh the with
0:11:49	the length of
0:11:50	uh
0:11:51	that's a the mean
0:11:53	uh between the two but actually works pretty well for uh a them use a the view C P many
0:11:57	of the case
0:11:58	which is actually a or what it's not a it's not completely into it
0:12:01	but
0:12:02	it seems to what
0:12:03	some
0:12:05	but this is probably because you don't know at what point
0:12:07	the the trajectory switches from one one of these uh thoughts
0:12:11	the other part
0:12:12	that's so selecting the mean actually is gonna pragmatic
0:12:15	to use that seven
0:12:16	um
0:12:18	and uh but but this but
0:12:20	the uh
0:12:22	uh but
0:12:23	uh this method actually by selecting the mean actually
0:12:25	decreases as the length of the uh uh green
0:12:30	a a a a a a a around it percent for that is for that is P and twenty two
0:12:34	percent for the W C P i don result in the sense that the uh
0:12:39	the uh
0:12:40	the mode which actually it gives you the best results
0:12:43	uh cannot be estimated using can to constrain
0:12:46	so uh that's uh
0:12:48	the other
0:12:49	i result from from this uh paper
0:12:51	it has uh that
0:12:54	yes
0:12:54	a a a a a a a uh acoustic project clean motion can be uh
0:12:58	uh and the non-uniqueness in this uh
0:13:00	inversion
0:13:01	can be estimated statistically
0:13:03	can you constraints but not for all ins uh instead
0:13:07	we probably need some other information rather than just got to D for example like the motion state or
0:13:12	that that of speech and some some some the time because
0:13:15	the estimate
0:13:17	uh there are some semidefinite good conclusions is that uh
0:13:20	human beings make use of non unique i can uh articulator positions so this is clear but
0:13:25	uh
0:13:26	but this cannot we i can be a less we have exactly the same
0:13:30	with six
0:13:31	or for you the same of six with to for
0:13:33	so it's a set my some might definite
0:13:36	they are are i'm someone many are rather a a on so questions so well
0:13:41	the the main question here is that uh
0:13:43	and
0:13:44	does this is unique like quite positions
0:13:46	it change the at a function of the vocal tract and
0:13:49	it might it might seem at you to that they do but
0:13:51	that had that least to verify and
0:13:53	we can hope that we get some uh and my i uh dynamic and might i
0:13:57	results
0:13:58	to uh but it it there
0:14:00	and uh what kind of compensation we kind of them is used to to make this uh non unique uh
0:14:04	are quickly uh uh uh articulation sorry
0:14:07	one any calculations for the same course
0:14:10	and uh
0:14:11	a a given that we have non uniqueness
0:14:13	in this mapping a what is it all for the for learning uh a line so how the inference figured
0:14:18	out
0:14:19	vol
0:14:20	um
0:14:21	and i like when my speech
0:14:30	so that
0:14:31	for it's open for discussion
0:14:39	the many questions some
0:14:42	okay over there
0:14:46	that's to
0:14:47	um
0:14:48	so i we have a common to a type and your last slide you had a a question about what
0:14:52	uh do not unique assistance of forty political right
0:14:56	so
0:14:57	maybe can show some i uh you know
0:14:59	that's some comments on
0:15:01	no the way to measure but the articulation right that these three positions or or even a sagittal it'll some
0:15:06	of the image image right
0:15:08	well provides this sort of uh projection are is complex channel tree that's also moving in time so we have
0:15:14	a
0:15:15	uh restrictions of special control sampling
0:15:17	and which are again trying to map it to some acoustic uh feature vector it's also some sort of prediction
0:15:23	of the signal
0:15:24	uh
0:15:24	so it's really
0:15:26	in often times not the point or whether it this actually a uh are be mapping the same things or
0:15:31	or for uh and trying to find something that's not there
0:15:34	and
0:15:35	well this this just that it results and all all of four or yeah uh has as shown no can
0:15:40	to some extent we can show this
0:15:42	a so what are you thoughts on an know how one would actually
0:15:45	who were there
0:15:46	gaps that one could still
0:15:48	well the
0:15:49	yeah i mean that's that's a very valid question in this in this field of research to because
0:15:53	i i as you said that the are was it there all projections from what the reality is
0:15:57	and uh
0:15:58	i mean this this thing is is uh a sort of a much larger question i
0:16:03	in many sense
0:16:04	but what i would like to say is that uh
0:16:06	B
0:16:07	by looking at these statistical methods
0:16:09	the let's say that we just do we don't use
0:16:12	the acoustic parameters that we use
0:16:14	and we instead use some other acoustic but i
0:16:16	uh
0:16:17	how are we use a uh are articulated parameters
0:16:20	which are which are different
0:16:22	or instead of using position of the quite we use a a functions for example
0:16:26	uh
0:16:27	the the the thing is that
0:16:29	in this uh by using this kind of a as the stick method we can
0:16:33	find of whether it is uh it is a uh you non and nick or not
0:16:36	in a reasonable way
0:16:37	uh
0:16:38	this kind of this of us
0:16:40	paper that i i i i that that that we worked on is sort of
0:16:44	uh tells you that the problems that come when we try to do
0:16:47	statistical based base stick to get you mapping which is very cute which is quite clear have that you are
0:16:51	gonna have these problems when you do so a basic course
0:16:54	uh i to give you mapping
0:16:56	i
0:16:57	we just why would it in the sum might of and conclusions i mean
0:17:00	because
0:17:01	we can be sure
0:17:02	as of
0:17:03	so i i i i
0:17:05	don't know how to go ahead
0:17:06	based on the
0:17:07	unless of pose we have a three D and i
0:17:09	that think uh
0:17:14	yeah
0:17:15	yeah
0:17:17	in front of you
0:17:19	uh
0:17:19	it
0:17:21	uh if i understand it correctly uh in this work you're at the in the question of
0:17:25	not in this with within the speaker is that right yeah it's within this speaker so how do we it
0:17:30	is there a made to extend this to cross speaker non because
0:17:33	i that might to be important for yes actually that's quite clear i mean uh a different it's that i'm
0:17:38	many of are many other evidence which show that people that a cross because we use different strategies
0:17:44	uh of for to produce the same kind of sound
0:17:47	but the problem there i is not is not exactly the same this because you would not produce exactly the
0:17:52	same that i mean the course sticks
0:17:53	very is bother five is also which is like to shape of a vocal tract and
0:17:57	so
0:17:58	you can produce the same phonemes
0:17:59	the same sounds that we classify as the same phonemes
0:18:02	uh a different people use different set it in there are several results which of that
0:18:06	but
0:18:06	can be produce exactly the same of course to
0:18:09	by different uh are like uh by different are to get configuration
0:18:12	i think that that question is more relevant to you look at a single speaker
0:18:17	so you would say that this is a big telling instant
0:18:19	i no
0:18:20	oh have was
0:18:21	okay
0:18:22	i mean there just
0:18:23	different
0:18:24	questions
0:18:24	okay
0:18:25	thanks
0:18:28	yeah
0:18:31	so i think this
0:18:32	spring cell session try that can does so as i "'cause" you have those and and the people on the
0:18:38	flap but just

RESOLVING NON-UNIQUENESS IN THE ACOUSTIC-TO-ARTICULATORY MAPPING

Modeling and Analysis of Speech Production

Presented by: Gopai Ananthakrishnan, Author(s): Ananthakrishnan G, Olov Engwall, KTH - Royal Institute of Technology, Sweden