Přepis řeči - ACOUSTIC-TO-ARTICULATORY INVERSION USING AN EPISODIC MEMORY

0:00:21	um so hmmm coding everybody so um
0:00:25	my my is that's the most
0:00:26	and the result in junior at three yeah
0:00:30	and the work i'm bring to present you uh as been then by one of my critiques scheme we we
0:00:35	who is associate professor
0:00:36	at at key uh on the and all of the
0:00:40	and the uh i set
0:00:43	um
0:00:45	the problem we are or
0:00:47	a thinking this is a work
0:00:48	is the acoustic-to-articulatory inversion
0:00:51	and we propose to use a a a a new model in this domain
0:00:54	uh which is a and they present in my
0:00:58	so here is the the of my to work
0:01:00	um in the first part uh i'm going to briefly present you
0:01:04	uh
0:01:05	what is the problem of the uh acoustic to a to mean person
0:01:09	uh also um
0:01:11	or brief presentation of the it is a tick mapping
0:01:14	and uh the motivation of
0:01:15	is
0:01:17	uh then i we propose you the um present you the the proposed approach
0:01:21	so uh which we call the the not keep it had memory
0:01:25	and these but be followed by a a compact addition
0:01:29	before the completion
0:01:33	so um
0:01:35	what's do acoustic don't good to mission problem uh and the is to recover
0:01:40	the uh articulatory gestures
0:01:42	from a uh a a speech you
0:01:45	uh this is a of an interesting problem because many application can take and H
0:01:49	uh of the
0:01:51	knowledge about the articulatory
0:01:53	such as uh a language learning
0:01:55	speech directly or also speech recognition
0:01:59	this is an interesting problem but also a very difficult why
0:02:02	because this problem
0:02:03	uh use i D uh a nonlinear
0:02:05	and uh
0:02:07	the mapping between the acoustic to the after three space
0:02:11	uh is it and then you
0:02:15	so uh we think that um in fact the dynamics
0:02:19	at the at very then a mix can and to us sold
0:02:22	uh i is partially
0:02:23	the non-uniqueness uniqueness of the solution
0:02:26	because uh
0:02:28	the the dynamics
0:02:30	uh accounts for uh
0:02:32	some that when only effect
0:02:33	uh such as the quad addition
0:02:35	um
0:02:36	is a control so for the physical property of the a greater
0:02:40	such as the a ct the last
0:02:43	uh the degree of freedom
0:02:45	and also it accounts uh
0:02:48	for the twenty teaching
0:02:49	that the uh speaker use the are
0:02:51	a a to a good choice it
0:02:56	um
0:02:57	so what about the that modeling um
0:03:00	in the like it's like their linguistic many works
0:03:03	uh a a a three
0:03:04	uh a on the existence of a if or you know
0:03:07	in fact this is a a part of brand
0:03:10	uh where we encode code uh in the uh
0:03:13	events
0:03:14	we experience in or like
0:03:16	and this uh
0:03:18	uh
0:03:18	experience uh uh a a are good the uh you into it is that
0:03:23	i can you retrieved
0:03:24	uh at any time
0:03:26	and they are they are maybe that's is that's we use the order to may be speech processing
0:03:31	and uh in fact you can uh retrieve a fast if you that you know that to interpret present events
0:03:37	and also to um
0:03:39	uh
0:03:41	two
0:03:42	and to speak uh you we knew
0:03:45	for
0:03:47	oh so they it but it can be uh or we use the uh in a
0:03:51	i think to to to of speech uh processing
0:03:55	uh us just the speech recognition
0:03:57	so we don't be based speech recognition and also
0:04:00	uh we've uh a speech and this is
0:04:02	uh uh uh we've unit addition
0:04:05	which can you also uh seen as a
0:04:08	so um
0:04:09	this model
0:04:11	it's models or
0:04:12	are in fact a
0:04:15	i yeah collections of uh acoustic tradition of a lexical units
0:04:21	we can be phones life on sites say to votes on word
0:04:25	and uh most of the time this uh
0:04:28	a it is that are are this try uh i as uh i'm i'm the uh acoustic frequencies
0:04:33	and uh we've contextual information
0:04:36	um
0:04:39	the results of the
0:04:41	this model uh
0:04:43	for both speech recognition and speech and these are uh most of the time expressed
0:04:47	uh as a concatenation of it that
0:04:50	and he's can get and we should uh best explains
0:04:53	the
0:04:54	input seen your signal for speech recognition
0:04:57	but a put to the the input speech you know would be uh describe a sequence of it is that
0:05:02	and for speech in
0:05:03	uh this it i and that's use we the also express i to comp condition
0:05:08	of of
0:05:09	so
0:05:10	uh i i call these uh are sure uh
0:05:14	was the decay uh a memory as compared to
0:05:18	so let's go back
0:05:19	to do or or from problem which is the the
0:05:21	acoustic but going there's
0:05:24	so uh
0:05:25	because is can is attractive for this problem for uh to reason
0:05:29	the first one it that's it relies on uh all sir
0:05:32	uh synchronized acoustic and articulatory data
0:05:35	so we don't at to form a any assumption about a mapping function
0:05:39	uh the second uh it that each it's that's to get three dynamics are these of we think it is
0:05:45	that
0:05:45	and and then was to solve
0:05:47	the problem of the than unity you
0:05:51	um um or were there is um
0:05:54	maybe
0:05:55	more practical problem than uh
0:05:57	uh to record problem
0:05:59	um i mean
0:06:00	a if we consider speech recognition and speech in
0:06:03	um
0:06:05	the not being is a from continuous space from a discrete space
0:06:08	for speech recognition so we try to map and acoustic signal to a sequence of
0:06:13	lexicon
0:06:14	the speech and this
0:06:15	try to map
0:06:16	uh
0:06:17	the sequence of lexical units
0:06:18	so that's a phone type one
0:06:20	two and a
0:06:23	but if you can see that the uh i did not that patch the prime used
0:06:27	the mapping is between two
0:06:29	continues space
0:06:32	so um
0:06:34	usually usually for speech cushion speech and this the memory are based on uh
0:06:39	let's say a a few of words of to tens of a words of speech
0:06:43	uh to have a uh reason it one uh press
0:06:47	but uh
0:06:49	the a or are uh of uh board
0:06:52	for uh we articulatory in information are very sport for now
0:06:56	uh
0:06:58	pixel out have a few minutes
0:07:00	or uh
0:07:01	at most
0:07:02	two tenths of
0:07:03	and that's this
0:07:04	uh
0:07:05	small amount of data
0:07:07	uh can at cover
0:07:08	us to efficiently
0:07:09	uh well the
0:07:11	evaluation in the
0:07:13	the uh
0:07:15	acoustic and articulatory space
0:07:23	so um
0:07:25	we propose to um
0:07:28	to frank
0:07:29	for two to combine uh the the bit about it is that and uh this combination
0:07:34	uh uh we'll be based on the look similar i between these it is that
0:07:39	uh this way of combining it use that can uh produce
0:07:44	and seen a uh are that we trajectory
0:07:46	and can uh
0:07:47	bit there are or a nice about the
0:07:50	that these we can
0:07:51	the memory will be able to produce variation of fixed
0:07:56	so
0:07:57	a here is a a a a a a a very basic example just to illustrate uh what i mean
0:08:01	by combining it
0:08:02	so just consider a
0:08:04	a very simple like and pro problem
0:08:06	and just a that i give you this letter and and
0:08:10	uh
0:08:11	ask you
0:08:12	two are try to to solve this problem
0:08:14	and we think uh only a to six
0:08:18	and
0:08:19	image that you to fine to to try
0:08:23	uh within in this that you hand
0:08:24	uh the the um
0:08:26	the red one and a two one
0:08:28	and after that
0:08:29	i can ask you could you
0:08:31	a a give me or their solution to do so
0:08:35	and we get
0:08:38	i see three point point
0:08:41	uh let's say the some sort of a real E
0:08:44	so from the to previously five
0:08:46	uh
0:08:47	trajectory
0:08:48	uh we think the like and we can find a what of want
0:08:52	name
0:08:54	i and
0:08:56	and
0:08:56	yeah
0:08:57	models
0:08:59	but this is a very basic problem and a is only spatial
0:09:03	and and of course
0:09:04	here we don't have to do with a a for and uh to mention
0:09:08	uh
0:09:11	a a a a a a a solution
0:09:13	um
0:09:14	so
0:09:16	here right spend oh i bits my memory um
0:09:20	we consider a it is that as a a sequence use of synchronized acoustic and country three observation
0:09:25	uh and uh the consider leads you can it is the phone
0:09:28	were
0:09:30	so
0:09:31	what
0:09:32	um
0:09:33	do we consider are local but i T so
0:09:35	see uh look uh local also T
0:09:37	is uh
0:09:40	to uh are similar are good we can gosh which a pure at
0:09:45	so you know times
0:09:46	so not instance
0:09:47	during the addition of a given for
0:09:51	so you have to do with to uh time mention
0:09:54	the first one to tom they mention
0:09:56	and the second one is to spatial
0:09:57	image
0:09:58	oh so we use uh a the D U W uh i i've to uh
0:10:03	did with temporal dimension
0:10:05	and we you also if the and
0:10:07	not the to uh
0:10:08	make the
0:10:10	the mapping
0:10:11	uh a symmetry
0:10:13	and to be able to compare different uh
0:10:16	uh distance
0:10:17	between it is that
0:10:20	and uh uh also be talk or constraint uh a a low to uh control the
0:10:25	distortion that time distortion
0:10:27	a a of the at
0:10:29	um
0:10:30	for for special to a similar P uh let's consider
0:10:33	uh
0:10:34	the plots on the bottom right corner
0:10:37	um uh uh just say that it's the a trajectory of one of one at late or
0:10:41	and just consider the at a time
0:10:43	the
0:10:45	uh the position of position X uh X i
0:10:49	and we uh just say that X i plus one it's the natural
0:10:54	a a a a a a target of uh X I
0:10:57	and we just
0:10:58	make this
0:10:59	the following estimation
0:11:00	um
0:11:01	that's X i plus one would have been is found
0:11:05	uh without that a significant impact uh on the uh a a a a quiz
0:11:10	so we define uh
0:11:12	when in the divide
0:11:13	uh a their center of around uh X Y this one
0:11:17	and we just uh say that any uh uh got three configuration
0:11:23	uh uh within this into value
0:11:24	can be uh
0:11:26	consider a a similar
0:11:27	to uh X Y
0:11:33	so
0:11:34	um
0:11:35	lets consider two to it is that now
0:11:37	um
0:11:39	oh a given for so
0:11:41	that's say for example to uh acoustic and articulatory or a addition of the the phone G or
0:11:48	know
0:11:49	um
0:11:50	don't um
0:11:52	let
0:11:53	uh see uh uh oh oh to beats uh
0:11:56	the genetic thing
0:11:57	so
0:11:58	we just check before or
0:12:01	before that uh X and Y are similar enough
0:12:04	uh because
0:12:05	uh uh to a realisation of uh
0:12:08	some uh all
0:12:10	uh can be quite different
0:12:12	uh because some to get or on a not critical for for four
0:12:19	um
0:12:20	so we we we map uh first uh
0:12:23	let's say it is that uh a want to the if is that X
0:12:27	uh i've represent the the the a line observation
0:12:31	we've the got collides
0:12:33	so the right one
0:12:36	oh
0:12:38	okay okay
0:12:39	two
0:12:43	um
0:12:44	so i it just to like that uh from to a it is the
0:12:47	uh the genetic memory can things
0:12:50	uh uh at the bottom of to grow uh of the figure
0:12:53	as you can see
0:12:54	uh eight
0:12:55	that
0:12:55	through good it it is so the memory is able to produce a
0:13:00	from a a a two if is that eight uh
0:13:03	it it is that which are uh up a battery uh for for from a a a a a three
0:13:07	point of view
0:13:08	uh
0:13:10	a but it and can uh and that
0:13:14	oh so the emission consist in the so so the chance you marie
0:13:18	uh is an oriented graph
0:13:20	so each node is the
0:13:22	uh
0:13:23	synchronized acoustic and at the target vision
0:13:26	and the it is a the a load of uh a transition
0:13:29	did from the
0:13:31	a preceding a mapping from uh uh and it was that
0:13:34	and know that
0:13:36	and the emission in finding in the this draft
0:13:39	uh
0:13:40	the
0:13:41	the path which best matching
0:13:43	but matches the
0:13:45	uh
0:13:47	the input uh acoustic to be birds
0:13:51	and uh of course don't to great gesture
0:13:53	uh uh is the right from the to get three component of each node
0:13:59	so um
0:14:00	for the edition we have compared uh
0:14:02	uh the memory yeah that's going
0:14:04	we a concatenative in and we will could look bad uh this approach
0:14:09	we the me call uh uh a a constraint
0:14:12	um he is the cup are we use more got
0:14:15	uh uh which contains two speakers and made and a female
0:14:19	uh the which is english and uh
0:14:21	we use a a more are you seven seven colours
0:14:24	uh are two on the the lips
0:14:26	on the low once he's are that don't keep the don't body
0:14:29	a of some and the
0:14:31	and we use also a french corpus we have recorded
0:14:33	not your got
0:14:35	a uh we don't use the uh we don't fix the code
0:14:38	a a on the vet on but uh on the the route
0:14:41	that
0:14:44	um
0:14:46	okay okay
0:14:47	that
0:14:49	and the would do that
0:14:51	evaluation efficient um
0:14:52	off the is to a uh trajectory
0:14:55	uh are based on that would mean square or and the P which can you five this to me like
0:14:59	and synchrony between to
0:15:01	a accounts and it's meeting up to a that we
0:15:06	so you are the results
0:15:08	um
0:15:10	do the red about isn't the codebook book uh a results
0:15:13	the blue the concatenative memory and the green bar
0:15:15	does not memory and
0:15:17	we can observe to same uh improvement trend
0:15:20	uh over all the three corpus
0:15:22	so over the two language language which use um over the three speaker
0:15:26	uh
0:15:27	that memory uh a always perform
0:15:30	the competitive memory and the could be
0:15:33	and uh graph can five the probability of movement
0:15:37	so we can expect an improvement
0:15:38	between five and and percent
0:15:40	with an eight nine person computer
0:15:42	uh
0:15:43	for the gmm over the seem am and uh
0:15:46	between ten and fifteen points some
0:15:48	uh for this unit level
0:15:53	here is you a uh
0:15:55	a
0:15:55	so uh as you can see the could to write very jerky trajectory
0:16:00	why the
0:16:01	um it to dig memories
0:16:03	uh provide us with the uh
0:16:05	trajectory
0:16:06	because it i it it's better model
0:16:10	so it correspond to the movement of them
0:16:12	a along the at that X
0:16:15	for the french and sure
0:16:16	she's to can "'cause" extreme the boss
0:16:21	okay that uh
0:16:22	a compile the the is you the of the or results
0:16:25	uh we can say that
0:16:27	we have uh
0:16:28	reason able good performance
0:16:30	uh for example i
0:16:32	i propose to some uh
0:16:34	machine learning algorithm
0:16:37	uh which have been proved over something to based and uh are we can see that
0:16:40	the uh mean square and price all between
0:16:43	a a one point four and one once
0:16:47	but um
0:16:48	a would have reported in article that uh
0:16:51	do uh articulatory data acquisition is a a a all is about
0:16:56	zero point for me to also
0:16:57	we can just say that a a okay
0:16:59	uh we have different uh method but
0:17:02	maybe
0:17:03	as we don't share exactly the same process
0:17:05	thing uh
0:17:06	that that process
0:17:07	and uh because of the uh that the position error
0:17:10	we are more
0:17:12	and
0:17:14	so
0:17:15	um
0:17:16	we propose a a a a not to because the be marie so this model is uh interesting because
0:17:21	it does not require a it since assumption about the mapping function
0:17:24	uh the memory is able to uh on but the dynamic
0:17:30	and uh
0:17:32	it is a also so to produce and seen to uh gesture and just can should are a i about
0:17:37	it
0:17:38	so
0:17:39	for a future work
0:17:40	uh we're focusing on the use of more reviews distance because for
0:17:45	uh
0:17:45	this where we have used the a to the end distance of the acoustic space and
0:17:50	G
0:17:51	these distance is known that to be
0:17:53	uh robust
0:17:54	for the
0:17:56	we like was of to they can do
0:17:58	the can uh the correlation between the articulators
0:18:01	because that bit does can compensate can each with the
0:18:04	and uh
0:18:06	we think this
0:18:07	uh
0:18:08	correlation can add to get for that
0:18:11	uh a like was to to uh move from uh
0:18:15	a pure phonetic segmentation
0:18:17	during the
0:18:18	the building of the memory
0:18:19	to uh
0:18:20	but not cry just based uh
0:18:23	that tension should propose or something but i uh i don't think used
0:18:27	and finally can uh
0:18:29	proceed
0:18:30	or to get further improvement local the application
0:18:33	uh because the memory is able to produce new trajectories but face
0:18:38	uh two
0:18:39	uh
0:18:40	precisely map uh an acoustic frame it is uh
0:18:44	in to the up that i've got made if
0:18:47	uh
0:18:48	synchronise of solution um
0:18:52	thank
0:18:53	i you i
0:18:58	we have time about the question
0:19:08	i and that's just one thing linear and it seems to me there is room for combining the codebook book
0:19:12	and the chance model and that the codebook book be some kind of a starting trajectory arrears
0:19:18	i i was T is a possible to come by the codebook book at the channel to model so the
0:19:22	codebook book stuff as you are
0:19:24	yeah no initialization annotation so to speak are
0:19:27	it's
0:19:28	yeah i think um
0:19:29	space
0:19:31	and i to the search or would that be computationally to
0:19:34	expense
0:19:37	oh
0:19:37	i in the memory it's
0:19:39	it's uh
0:19:40	and is that as a kind of code
0:19:42	it's
0:19:42	it's much data could because
0:19:44	uh
0:19:45	we have to dump for information within the memory
0:19:48	uh this is and see that the could
0:19:50	uh
0:19:51	missus
0:20:03	but
0:20:04	okay so thank you again

ACOUSTIC-TO-ARTICULATORY INVERSION USING AN EPISODIC MEMORY

Modeling and Analysis of Speech Production

Přednášející: Sebastien Demange, Autoři: Sebastien Demange, LORIA / INRIA, France; Slim Ouni, University Nancy 2, France