0:00:22oh
0:00:22can have a
0:00:25okay good
0:00:27i okay
0:00:28mean
0:00:28um
0:00:29to do be talking about how we generalise and adapt the concept of pronunciation modeling
0:00:36and and use that to design a framework to help analyse like
0:00:41step here is the structure of the talk
0:00:43and i'll first start from the motivation
0:00:46um of speech science and engineering
0:00:48that
0:00:49model
0:00:53so
0:00:54dialect recognition
0:00:55a uh the dialect research uh there are different
0:00:58branches she's of work
0:01:00on the one hand there's speech science
0:01:02so for it
0:01:03well speech
0:01:04a signs
0:01:06these are social linguists
0:01:08but word um and a
0:01:10rules
0:01:11for across dialects to understand why these dialects are different
0:01:15um this is
0:01:16very important
0:01:17um but the are analysis is often manual
0:01:21so it's very time consuming
0:01:23we are them out of data that that can be the
0:01:26and that without enough data uh have sometimes
0:01:30a it is po
0:01:31it's the ball that some of these rules might be over
0:01:34or or or or a specified
0:01:38on the other
0:01:39yeah and we have speech technology
0:01:41so for example or a speech engine is um
0:01:45but design
0:01:46automatic dialect recognition systems
0:01:50i
0:01:51i
0:01:52and um
0:01:54and i to of these not
0:01:57and so it can put
0:01:58a since to that very efficiently even if the is a lot to and can also reach be a decent
0:02:03perform
0:02:06that we model these two then the commands
0:02:09i'm do these dialect differences
0:02:13for
0:02:15and a work
0:02:16we decided to combine the straits of these to research communities
0:02:20to bridge the gap between speech science and technology
0:02:24a in particular we want to design automatic systems that are you have to explicitly the these than the cross
0:02:31across dialects
0:02:32and use that to infer from human last
0:02:35so because of this in so but it's nature of had these
0:02:39results of the system we turn this approach in so but the of dialect recognition
0:02:47so to to can you a a or taste of what i mean by what of system can do
0:02:54as an example
0:02:55so in the end that we have there were transcript and the audio signal
0:03:00which could be used to generate the reference pronunciation and the dialect specific pronunciation
0:03:06um um in and red here
0:03:09to the model for all and the mapping between this reference pronunciation and dialect specific pronunciation
0:03:16um so that in the ah
0:03:17but we can get these phonetic transformations the use phonetic rules
0:03:21um
0:03:22that tell you how the dialects are different
0:03:24so for example in this case
0:03:27we see that a is deleted one it's followed by a consonant
0:03:31a and in addition we can see that we can quantify the occurrence frequency and no how often this happens
0:03:38and that's kind of information is extremely important for forensic phoneticians
0:03:42which is uh one of the big motivations behind a work
0:03:48so before i go into more of the details of our proposed model
0:03:52um i like to form we introduce what i mean by phonetic transformation because uh i will be
0:03:58we will be characterising dialects differences um using phonetic transformations
0:04:03so um
0:04:05represents adds a word to um in the rap reference dialect as reference phones
0:04:11and in the dialect interest we represent the pronunciation a surface phones
0:04:16and this may in between the reference phones and the surface phones is what we call phonetic transformation
0:04:22so to K
0:04:24if we're given the word
0:04:25a
0:04:26um and shoes general american english
0:04:29has the reference dialect
0:04:31um
0:04:32and british english as a dialect of interest
0:04:35now we have the reference phones and surface phones of the word back
0:04:39and here you see
0:04:41and the reference phones is mapped to a a a a a and the surface phones
0:04:45so this is an example of a
0:04:47a substitution which use the kind of phonetic transformation
0:04:51um there are two other car
0:04:53i have to be shown in in and so
0:04:56more about then
0:04:57right right but this is what i mean by phonetic transformation
0:05:02and
0:05:02i and to i proposed model
0:05:05and a
0:05:06we we it to make
0:05:08a model any parents to express a woman these have a transformations
0:05:13so i'm is called phonetic pronunciation model
0:05:16yeah and
0:05:18we want to answer the following questions you of this model
0:05:22so first to um
0:05:23one and can be a dialect to a reference dialect
0:05:28kinds of phonetic transformations occur
0:05:31oh a substitution
0:05:32insertions or deletions
0:05:34and if they occur to the how to that kurt in only certain phonetic context that okay
0:05:41and
0:05:42a thing to the curb
0:05:43so to answer these questions um we have to in
0:05:48a model but
0:05:50a markov model
0:05:52and we use that to help us automatically running the reference phones with the surface phones
0:05:57um the second part
0:05:59decision tree clustering which helps us gender as the phonetic rule
0:06:06so here is the slide way
0:06:08a three
0:06:09the thing kind of phonetic transformations each with an example
0:06:13yeah and the in the example
0:06:16american english has a reference dialect and british english for
0:06:21but um dialect of interest
0:06:24um so we use a
0:06:26cases the substitution of a a an american english it's pronounced that's back and in
0:06:33british english or sound like by
0:06:35um the second that the relation example where
0:06:39one is followed by a constant so in american english
0:06:43part
0:06:44what's that like something like
0:06:46in british english
0:06:47and
0:06:49example of phonetic transformations is insertions
0:06:52still here in general american english of what happens with the bound and the
0:06:57val following it at that
0:06:59the word finally it starts with a
0:07:02um that how the and and i i might be inserted in between
0:07:06when it's the british ah english speaker
0:07:09so that phrase saw i feel was on to more like saw a film
0:07:15um so these are some of the examples of the phonetic transformations
0:07:19and in the following slides was straight how these examples fit into our proposed H M and that
0:07:28but here is um a traditional hmm work
0:07:32where the circles represent the states in the squares represent the observation
0:07:36and um they are also i the state transition
0:07:40so this is a trivial case where
0:07:42the reference phones in the surface phones are things so there are no dialect differences
0:07:47um and this is the case of a substitution
0:07:49where
0:07:50i
0:07:51W and in this case the traditional hmm system can handle it at quickly
0:07:57however
0:07:58what about an insertion it's so if we have an insertion of a here we see that this are stiff
0:08:04is and does not have any corresponding state
0:08:08to it
0:08:09so a solution is that now we have a one to two mapping between the reference phones and the state
0:08:17so for reference pattern
0:08:19it's rappers
0:08:21oh
0:08:22uh states the first one is the right circle
0:08:24which indicates an estate
0:08:27and then it's by an insertion state the green circle
0:08:31and so now you see that um the observation
0:08:36that's the corresponding state to be mapped to
0:08:41and in addition uh we also for the categorise our state transitions
0:08:46um according to the press
0:08:49data transformations
0:08:50so now if
0:08:52a state transition is and sure and insertion state has like the red a or here in the graph
0:08:58there we call it insertion state transition
0:09:03okay so we can like the case of insertions
0:09:06how about deletions then
0:09:09so here we see the example i where um
0:09:12this state
0:09:13are has some the corresponding surface down or observation
0:09:17and to solve this problem we introduce a deletion state transition
0:09:22which skips normal state
0:09:24so in this case
0:09:26the state are is skipped
0:09:27so it no longer needs to be mapped to an observation
0:09:32so these are some of the highlights of um the differences if i proponents hmm network
0:09:37and the traditional one to help us more explicitly model the phonetic transformations in a richer way
0:09:45for now
0:09:45after training a hmm system using triphones
0:09:49we could find a rose like these on the right
0:09:53so for example
0:09:55yeah becomes all and it's followed by a T H
0:09:58so back becomes by
0:10:01also
0:10:01becomes comes a one it's followed by an uh
0:10:04as becomes class
0:10:06and
0:10:07i'm not example
0:10:09hmmm
0:10:11i still laugh becomes small
0:10:14the question here or one as it is
0:10:17the is observed rules
0:10:19um
0:10:19actually originating from a more general underlying rule
0:10:24and if it it is how can we find that
0:10:27so here we use decision tree a clustering to help us
0:10:31so from the results of decision tree clustering
0:10:34um we can find that by clustering
0:10:37these observed for an underlying rule
0:10:40so here the underlying what we found was that oh
0:10:43so now i actually when have a a is followed by a voiceless fricative but phonetic transformation of at to
0:10:49a little occur
0:10:54so i just talked about the highlights of for model and now um
0:10:58we going into the evaluation stage
0:11:00and we've done a series of experiments
0:11:03um and
0:11:04because of the time constraint not be able to share this information
0:11:08so the dialect recognition task um
0:11:11well not be talked about but uh you can read a lot of the details in our paper
0:11:18i'll be focusing on the other choose the first one is the pronunciation generation experiment
0:11:23where
0:11:25basically what as that's that bill the of the model by seeing how well it can convert one pronunciation into
0:11:31one other dialects pronunciation
0:11:35that do are we used it is um
0:11:38and big database um it has five different arabic dialect regions
0:11:43you where E
0:11:44egypt
0:11:45why
0:11:46palestine time in C or yeah
0:11:48and they are all conversational telephone speech
0:11:51and here we chose your he as a reference dialect
0:11:55and in this table or you can see that data the partition um for a experiment
0:12:01so
0:12:03this experiment the assumption is if we trained a
0:12:06pronunciation model well that it has learned these phonetic rules across dialects correctly
0:12:12then the model should be able to convert
0:12:14um the reference
0:12:16phones into a other dialects each and
0:12:19a very well
0:12:20so here after which
0:12:23and C and model a phonetic pronunciation model
0:12:26we give it a
0:12:28reference phones of the test that
0:12:30and
0:12:32to will generate the most likely surface phones of other arabic dialects
0:12:37i by comparing these surface phones
0:12:40that were generated
0:12:41to the ground truth surface phones we can see how well i model was converting
0:12:47uh one pronounce one doll let's pronunciation to another
0:12:51and here are the results so the orange um by a is the monophone version of the pronunciation model
0:12:58and the blue one is the decision tree um pronunciation model
0:13:02and we see here
0:13:04tree
0:13:04helps improve the recovery rate at one point seven percent relative
0:13:10meaning that the decision tree through results help as um
0:13:14convert these pronunciations better
0:13:18i'm here are like to mention a site note and we also did a lot of for
0:13:23analysis and found that they are are word usage differences across arabic dialect
0:13:29and this could um um can potentially complicate the evaluation of our
0:13:33system
0:13:35for
0:13:36um we also did the same experiment
0:13:38a using a phonetic pronunciation model on multiple english corpora without these were usage differences that will cause complications
0:13:47and the results are very good
0:13:49unfortunately i can not sure with the a show with you these to day because it will be covered in
0:13:54interspeech
0:13:55but um that means you should all come to my talk in interest as well
0:14:01so that
0:14:02evaluation is the row can an evaluation of where we can i one and rules are and shoot the ones
0:14:09in the linguistic literature
0:14:11so here on the left see that linguistic
0:14:14description of their for arabic dialects
0:14:16there are from the literature
0:14:18on the right T C where rules from my proposed system
0:14:22and
0:14:24you can see that the and rules from a proposed system actually
0:14:28um corresponds with these linguistic descriptions
0:14:31and spherical or more i they actually sometimes might potentially find the phonetic context of what these rules occur
0:14:39and most importantly um
0:14:41we can also quantify to five
0:14:43the current
0:14:44frequencies of these rules given the phonetic context
0:14:48and this information is very input
0:14:51six annotations for a for forensic phoneticians
0:14:54but is rarely document
0:14:56in the literature
0:14:58a little to conclude my top what talking about the contributions of this work
0:15:03so here we propose an automatic yet informative approach and analysing dialects
0:15:09and we call that's informative dialect recognition
0:15:13we use a mathematical framework to characterise phonetic transformations a
0:15:17a style X
0:15:18in a very explicit manner or to in these rules
0:15:22um yeah and i proposed system is able to postulate rules
0:15:26from large corpora to discover a
0:15:28we fine and quantify dialect specific rules
0:15:34so um
0:15:35if people have questions or issues that they were like to ask me about the talk i would be happy
0:15:40to do so
0:15:42i
0:15:49five
0:15:50a i don't know of the four
0:15:54one one four
0:15:57uh
0:16:01um um
0:16:05i
0:16:07oh i thought
0:16:07i i i it to you i yeah
0:16:39hmmm
0:16:44a
0:17:00i
0:17:10a
0:17:15hmmm
0:17:43i
0:17:44a
0:17:47hmmm
0:17:54hmmm
0:18:02hmmm
0:18:05a
0:18:07and
0:18:14a
0:18:18oh
0:18:22a
0:18:26um
0:18:28thank you
0:18:29and so i don't know i can remember all of them to respond them to a but uh
0:18:34that that's one yes that is the uh we are well yeah that point and it's just a i system
0:18:39is also able to go to these
0:18:42tension differences that may not actually be a phonetic rule in the
0:18:46but existing or not existing know when error is one of them
0:18:49and um
0:18:51john wells had
0:18:52have have a a has established a lot of very good literature on dialect differences in in and actually i'll
0:18:59be using a a lot of that in my next talk a um so
0:19:03so that is um what you could
0:19:05you looking for two
0:19:06and um
0:19:07you mentioned something else out the reference dialect um but the session of the reference dialect they are
0:19:13but the to me and linguistic descriptor um considerations so we actually consider
0:19:19or from the linguistic um side um
0:19:23i make some decisions such as i would not want to use a each option i back as
0:19:28um the reference dialect because it seems like for the native speakers of their big that i know that usually
0:19:34know how their dialogue is different
0:19:36um the egyptian dialect and so
0:19:38i since i don't really understand yeah a big and we have had to them to help me as a
0:19:41as of the model or of the system is going in the right direction uh will be easier for them
0:19:46to tell me uh uh if
0:19:49these phonetic transformations are occurring and it egyptian one is not a reference
0:19:53and then for palestine a and and see
0:19:56we want to but we have time we have been taking a big family
0:20:00so i was more reluctant to use them as reference is because uh since they are more closely
0:20:07then that values palestine then i may not be able to see C or you and difference is very easily
0:20:13and in the initial um
0:20:15establishment of
0:20:17uh the system it might be be better to have more or dialect differences
0:20:21and finally from the engineering perspective we actually have a lot more you data so that we can train systems
0:20:28on and so um
0:20:30that was the reason why a B and we chose iraqi rocky and this is a was a very difficult
0:20:35its decision but i and worked out okay in this case
0:20:38um um
0:20:39and so uh are there any other questions
0:20:42no no okay
0:20:49know
0:20:51hmmm