Speech Transcript - INFORMATIVE DIALECT RECOGNITION USING CONTEXT-DEPENDENT PRONUNCIATION MODELING

0:00:22	oh
0:00:22	can have a
0:00:25	okay good
0:00:27	i okay
0:00:28	mean
0:00:28	um
0:00:29	to do be talking about how we generalise and adapt the concept of pronunciation modeling
0:00:36	and and use that to design a framework to help analyse like
0:00:41	step here is the structure of the talk
0:00:43	and i'll first start from the motivation
0:00:46	um of speech science and engineering
0:00:48	that
0:00:49	model
0:00:53	so
0:00:54	dialect recognition
0:00:55	a uh the dialect research uh there are different
0:00:58	branches she's of work
0:01:00	on the one hand there's speech science
0:01:02	so for it
0:01:03	well speech
0:01:04	a signs
0:01:06	these are social linguists
0:01:08	but word um and a
0:01:10	rules
0:01:11	for across dialects to understand why these dialects are different
0:01:15	um this is
0:01:16	very important
0:01:17	um but the are analysis is often manual
0:01:21	so it's very time consuming
0:01:23	we are them out of data that that can be the
0:01:26	and that without enough data uh have sometimes
0:01:30	a it is po
0:01:31	it's the ball that some of these rules might be over
0:01:34	or or or or a specified
0:01:38	on the other
0:01:39	yeah and we have speech technology
0:01:41	so for example or a speech engine is um
0:01:45	but design
0:01:46	automatic dialect recognition systems
0:01:50	i
0:01:51	i
0:01:52	and um
0:01:54	and i to of these not
0:01:57	and so it can put
0:01:58	a since to that very efficiently even if the is a lot to and can also reach be a decent
0:02:03	perform
0:02:06	that we model these two then the commands
0:02:09	i'm do these dialect differences
0:02:13	for
0:02:15	and a work
0:02:16	we decided to combine the straits of these to research communities
0:02:20	to bridge the gap between speech science and technology
0:02:24	a in particular we want to design automatic systems that are you have to explicitly the these than the cross
0:02:31	across dialects
0:02:32	and use that to infer from human last
0:02:35	so because of this in so but it's nature of had these
0:02:39	results of the system we turn this approach in so but the of dialect recognition
0:02:47	so to to can you a a or taste of what i mean by what of system can do
0:02:54	as an example
0:02:55	so in the end that we have there were transcript and the audio signal
0:03:00	which could be used to generate the reference pronunciation and the dialect specific pronunciation
0:03:06	um um in and red here
0:03:09	to the model for all and the mapping between this reference pronunciation and dialect specific pronunciation
0:03:16	um so that in the ah
0:03:17	but we can get these phonetic transformations the use phonetic rules
0:03:21	um
0:03:22	that tell you how the dialects are different
0:03:24	so for example in this case
0:03:27	we see that a is deleted one it's followed by a consonant
0:03:31	a and in addition we can see that we can quantify the occurrence frequency and no how often this happens
0:03:38	and that's kind of information is extremely important for forensic phoneticians
0:03:42	which is uh one of the big motivations behind a work
0:03:48	so before i go into more of the details of our proposed model
0:03:52	um i like to form we introduce what i mean by phonetic transformation because uh i will be
0:03:58	we will be characterising dialects differences um using phonetic transformations
0:04:03	so um
0:04:05	represents adds a word to um in the rap reference dialect as reference phones
0:04:11	and in the dialect interest we represent the pronunciation a surface phones
0:04:16	and this may in between the reference phones and the surface phones is what we call phonetic transformation
0:04:22	so to K
0:04:24	if we're given the word
0:04:25	a
0:04:26	um and shoes general american english
0:04:29	has the reference dialect
0:04:31	um
0:04:32	and british english as a dialect of interest
0:04:35	now we have the reference phones and surface phones of the word back
0:04:39	and here you see
0:04:41	and the reference phones is mapped to a a a a a and the surface phones
0:04:45	so this is an example of a
0:04:47	a substitution which use the kind of phonetic transformation
0:04:51	um there are two other car
0:04:53	i have to be shown in in and so
0:04:56	more about then
0:04:57	right right but this is what i mean by phonetic transformation
0:05:02	and
0:05:02	i and to i proposed model
0:05:05	and a
0:05:06	we we it to make
0:05:08	a model any parents to express a woman these have a transformations
0:05:13	so i'm is called phonetic pronunciation model
0:05:16	yeah and
0:05:18	we want to answer the following questions you of this model
0:05:22	so first to um
0:05:23	one and can be a dialect to a reference dialect
0:05:28	kinds of phonetic transformations occur
0:05:31	oh a substitution
0:05:32	insertions or deletions
0:05:34	and if they occur to the how to that kurt in only certain phonetic context that okay
0:05:41	and
0:05:42	a thing to the curb
0:05:43	so to answer these questions um we have to in
0:05:48	a model but
0:05:50	a markov model
0:05:52	and we use that to help us automatically running the reference phones with the surface phones
0:05:57	um the second part
0:05:59	decision tree clustering which helps us gender as the phonetic rule
0:06:06	so here is the slide way
0:06:08	a three
0:06:09	the thing kind of phonetic transformations each with an example
0:06:13	yeah and the in the example
0:06:16	american english has a reference dialect and british english for
0:06:21	but um dialect of interest
0:06:24	um so we use a
0:06:26	cases the substitution of a a an american english it's pronounced that's back and in
0:06:33	british english or sound like by
0:06:35	um the second that the relation example where
0:06:39	one is followed by a constant so in american english
0:06:43	part
0:06:44	what's that like something like
0:06:46	in british english
0:06:47	and
0:06:49	example of phonetic transformations is insertions
0:06:52	still here in general american english of what happens with the bound and the
0:06:57	val following it at that
0:06:59	the word finally it starts with a
0:07:02	um that how the and and i i might be inserted in between
0:07:06	when it's the british ah english speaker
0:07:09	so that phrase saw i feel was on to more like saw a film
0:07:15	um so these are some of the examples of the phonetic transformations
0:07:19	and in the following slides was straight how these examples fit into our proposed H M and that
0:07:28	but here is um a traditional hmm work
0:07:32	where the circles represent the states in the squares represent the observation
0:07:36	and um they are also i the state transition
0:07:40	so this is a trivial case where
0:07:42	the reference phones in the surface phones are things so there are no dialect differences
0:07:47	um and this is the case of a substitution
0:07:49	where
0:07:50	i
0:07:51	W and in this case the traditional hmm system can handle it at quickly
0:07:57	however
0:07:58	what about an insertion it's so if we have an insertion of a here we see that this are stiff
0:08:04	is and does not have any corresponding state
0:08:08	to it
0:08:09	so a solution is that now we have a one to two mapping between the reference phones and the state
0:08:17	so for reference pattern
0:08:19	it's rappers
0:08:21	oh
0:08:22	uh states the first one is the right circle
0:08:24	which indicates an estate
0:08:27	and then it's by an insertion state the green circle
0:08:31	and so now you see that um the observation
0:08:36	that's the corresponding state to be mapped to
0:08:41	and in addition uh we also for the categorise our state transitions
0:08:46	um according to the press
0:08:49	data transformations
0:08:50	so now if
0:08:52	a state transition is and sure and insertion state has like the red a or here in the graph
0:08:58	there we call it insertion state transition
0:09:03	okay so we can like the case of insertions
0:09:06	how about deletions then
0:09:09	so here we see the example i where um
0:09:12	this state
0:09:13	are has some the corresponding surface down or observation
0:09:17	and to solve this problem we introduce a deletion state transition
0:09:22	which skips normal state
0:09:24	so in this case
0:09:26	the state are is skipped
0:09:27	so it no longer needs to be mapped to an observation
0:09:32	so these are some of the highlights of um the differences if i proponents hmm network
0:09:37	and the traditional one to help us more explicitly model the phonetic transformations in a richer way
0:09:45	for now
0:09:45	after training a hmm system using triphones
0:09:49	we could find a rose like these on the right
0:09:53	so for example
0:09:55	yeah becomes all and it's followed by a T H
0:09:58	so back becomes by
0:10:01	also
0:10:01	becomes comes a one it's followed by an uh
0:10:04	as becomes class
0:10:06	and
0:10:07	i'm not example
0:10:09	hmmm
0:10:11	i still laugh becomes small
0:10:14	the question here or one as it is
0:10:17	the is observed rules
0:10:19	um
0:10:19	actually originating from a more general underlying rule
0:10:24	and if it it is how can we find that
0:10:27	so here we use decision tree a clustering to help us
0:10:31	so from the results of decision tree clustering
0:10:34	um we can find that by clustering
0:10:37	these observed for an underlying rule
0:10:40	so here the underlying what we found was that oh
0:10:43	so now i actually when have a a is followed by a voiceless fricative but phonetic transformation of at to
0:10:49	a little occur
0:10:54	so i just talked about the highlights of for model and now um
0:10:58	we going into the evaluation stage
0:11:00	and we've done a series of experiments
0:11:03	um and
0:11:04	because of the time constraint not be able to share this information
0:11:08	so the dialect recognition task um
0:11:11	well not be talked about but uh you can read a lot of the details in our paper
0:11:18	i'll be focusing on the other choose the first one is the pronunciation generation experiment
0:11:23	where
0:11:25	basically what as that's that bill the of the model by seeing how well it can convert one pronunciation into
0:11:31	one other dialects pronunciation
0:11:35	that do are we used it is um
0:11:38	and big database um it has five different arabic dialect regions
0:11:43	you where E
0:11:44	egypt
0:11:45	why
0:11:46	palestine time in C or yeah
0:11:48	and they are all conversational telephone speech
0:11:51	and here we chose your he as a reference dialect
0:11:55	and in this table or you can see that data the partition um for a experiment
0:12:01	so
0:12:03	this experiment the assumption is if we trained a
0:12:06	pronunciation model well that it has learned these phonetic rules across dialects correctly
0:12:12	then the model should be able to convert
0:12:14	um the reference
0:12:16	phones into a other dialects each and
0:12:19	a very well
0:12:20	so here after which
0:12:23	and C and model a phonetic pronunciation model
0:12:26	we give it a
0:12:28	reference phones of the test that
0:12:30	and
0:12:32	to will generate the most likely surface phones of other arabic dialects
0:12:37	i by comparing these surface phones
0:12:40	that were generated
0:12:41	to the ground truth surface phones we can see how well i model was converting
0:12:47	uh one pronounce one doll let's pronunciation to another
0:12:51	and here are the results so the orange um by a is the monophone version of the pronunciation model
0:12:58	and the blue one is the decision tree um pronunciation model
0:13:02	and we see here
0:13:04	tree
0:13:04	helps improve the recovery rate at one point seven percent relative
0:13:10	meaning that the decision tree through results help as um
0:13:14	convert these pronunciations better
0:13:18	i'm here are like to mention a site note and we also did a lot of for
0:13:23	analysis and found that they are are word usage differences across arabic dialect
0:13:29	and this could um um can potentially complicate the evaluation of our
0:13:33	system
0:13:35	for
0:13:36	um we also did the same experiment
0:13:38	a using a phonetic pronunciation model on multiple english corpora without these were usage differences that will cause complications
0:13:47	and the results are very good
0:13:49	unfortunately i can not sure with the a show with you these to day because it will be covered in
0:13:54	interspeech
0:13:55	but um that means you should all come to my talk in interest as well
0:14:01	so that
0:14:02	evaluation is the row can an evaluation of where we can i one and rules are and shoot the ones
0:14:09	in the linguistic literature
0:14:11	so here on the left see that linguistic
0:14:14	description of their for arabic dialects
0:14:16	there are from the literature
0:14:18	on the right T C where rules from my proposed system
0:14:22	and
0:14:24	you can see that the and rules from a proposed system actually
0:14:28	um corresponds with these linguistic descriptions
0:14:31	and spherical or more i they actually sometimes might potentially find the phonetic context of what these rules occur
0:14:39	and most importantly um
0:14:41	we can also quantify to five
0:14:43	the current
0:14:44	frequencies of these rules given the phonetic context
0:14:48	and this information is very input
0:14:51	six annotations for a for forensic phoneticians
0:14:54	but is rarely document
0:14:56	in the literature
0:14:58	a little to conclude my top what talking about the contributions of this work
0:15:03	so here we propose an automatic yet informative approach and analysing dialects
0:15:09	and we call that's informative dialect recognition
0:15:13	we use a mathematical framework to characterise phonetic transformations a
0:15:17	a style X
0:15:18	in a very explicit manner or to in these rules
0:15:22	um yeah and i proposed system is able to postulate rules
0:15:26	from large corpora to discover a
0:15:28	we fine and quantify dialect specific rules
0:15:34	so um
0:15:35	if people have questions or issues that they were like to ask me about the talk i would be happy
0:15:40	to do so
0:15:42	i
0:15:49	five
0:15:50	a i don't know of the four
0:15:54	one one four
0:15:57	uh
0:16:01	um um
0:16:05	i
0:16:07	oh i thought
0:16:07	i i i it to you i yeah
0:16:39	hmmm
0:16:44	a
0:17:00	i
0:17:10	a
0:17:15	hmmm
0:17:43	i
0:17:44	a
0:17:47	hmmm
0:17:54	hmmm
0:18:02	hmmm
0:18:05	a
0:18:07	and
0:18:14	a
0:18:18	oh
0:18:22	a
0:18:26	um
0:18:28	thank you
0:18:29	and so i don't know i can remember all of them to respond them to a but uh
0:18:34	that that's one yes that is the uh we are well yeah that point and it's just a i system
0:18:39	is also able to go to these
0:18:42	tension differences that may not actually be a phonetic rule in the
0:18:46	but existing or not existing know when error is one of them
0:18:49	and um
0:18:51	john wells had
0:18:52	have have a a has established a lot of very good literature on dialect differences in in and actually i'll
0:18:59	be using a a lot of that in my next talk a um so
0:19:03	so that is um what you could
0:19:05	you looking for two
0:19:06	and um
0:19:07	you mentioned something else out the reference dialect um but the session of the reference dialect they are
0:19:13	but the to me and linguistic descriptor um considerations so we actually consider
0:19:19	or from the linguistic um side um
0:19:23	i make some decisions such as i would not want to use a each option i back as
0:19:28	um the reference dialect because it seems like for the native speakers of their big that i know that usually
0:19:34	know how their dialogue is different
0:19:36	um the egyptian dialect and so
0:19:38	i since i don't really understand yeah a big and we have had to them to help me as a
0:19:41	as of the model or of the system is going in the right direction uh will be easier for them
0:19:46	to tell me uh uh if
0:19:49	these phonetic transformations are occurring and it egyptian one is not a reference
0:19:53	and then for palestine a and and see
0:19:56	we want to but we have time we have been taking a big family
0:20:00	so i was more reluctant to use them as reference is because uh since they are more closely
0:20:07	then that values palestine then i may not be able to see C or you and difference is very easily
0:20:13	and in the initial um
0:20:15	establishment of
0:20:17	uh the system it might be be better to have more or dialect differences
0:20:21	and finally from the engineering perspective we actually have a lot more you data so that we can train systems
0:20:28	on and so um
0:20:30	that was the reason why a B and we chose iraqi rocky and this is a was a very difficult
0:20:35	its decision but i and worked out okay in this case
0:20:38	um um
0:20:39	and so uh are there any other questions
0:20:42	no no okay
0:20:49	know
0:20:51	hmmm

INFORMATIVE DIALECT RECOGNITION USING CONTEXT-DEPENDENT PRONUNCIATION MODELING

Language Identification

Presented by: Nancy Chen, Author(s): Nancy Chen, Massachusetts Institute of Technology, United States; Wade Shen, Joseph Campbell, Pedro Torres-Carrasquillo, MIT Lincoln Laboratory, United States