0:00:15 | thanks karen it's a very nice to be here to talk to about some of |
---|
0:00:18 | the low resource |
---|
0:00:20 | work we've been doing in our group |
---|
0:00:23 | this research that i'll talk about today involves the |
---|
0:00:28 | work of several students in our group that i a list here |
---|
0:00:34 | there's been a lot of talk just |
---|
0:00:35 | today about the low resource issue good descriptions i won't belabour the point |
---|
0:00:42 | i think my perspective on the problem is that |
---|
0:00:46 | current |
---|
0:00:47 | speech recognition technology will benefit by incorporating more unsupervised learning |
---|
0:00:53 | ideas |
---|
0:00:54 | and this schematic shows the |
---|
0:00:59 | sort of a range of increasingly difficult tasks that we could imagine |
---|
0:01:05 | starting the upper left |
---|
0:01:07 | from the conventional asr approach that has |
---|
0:01:11 | annotated resources pronunciation dictionary and units |
---|
0:01:14 | two scenarios that have less resources associated with them |
---|
0:01:19 | a parallel annotated speech and text independent speech and text all the way down to |
---|
0:01:23 | just having speech |
---|
0:01:25 | what can you do with that as would be the case for example of an |
---|
0:01:29 | oral language that we were talking about earlier this morning |
---|
0:01:33 | i think |
---|
0:01:34 | it's a challenging problem but if we start to look at some of these ideas |
---|
0:01:37 | i think there will be a few benefits first of all i think it's really |
---|
0:01:41 | interesting problem i will learn a lot just by trying to do it |
---|
0:01:45 | be i think it will ultimately enable more speech recognition for |
---|
0:01:50 | larger numbers of languages in the world and |
---|
0:01:53 | it has the potential ideally to complement existing techniques |
---|
0:01:57 | and so even benefit languages that are |
---|
0:01:59 | but quite successful with conventional techniques |
---|
0:02:03 | so in the time i have today i was gonna talk about two research areas |
---|
0:02:07 | that we plan |
---|
0:02:09 | exploring in our group the first one is the |
---|
0:02:13 | speech pattern discovery a method and various we applied to various problems |
---|
0:02:18 | and this is an example of the zero resource scenario that could work potentially on |
---|
0:02:24 | any language in the world just with the body of speech |
---|
0:02:27 | so that has a certain appeal to it |
---|
0:02:31 | but there's no specific models that we learn from that so and your line of |
---|
0:02:36 | research that we've been starting |
---|
0:02:38 | to do is exploring |
---|
0:02:41 | methods to learn speech units models speech units and pronunciations |
---|
0:02:47 | from either speech or when there are some limited resources available and we're using a |
---|
0:02:53 | joint modeling framework to do this that i think even know what still very early |
---|
0:02:57 | days for this work i believe it's quite promising so these of the two things |
---|
0:03:03 | all a |
---|
0:03:03 | touch on |
---|
0:03:06 | and it'll be fairly high overview because i don't a lot of time but hopefully |
---|
0:03:09 | you'll get the idea some other things we're trying to do |
---|
0:03:12 | so the speech pattern work |
---|
0:03:15 | it was actually motivated by in part by humans you know incensed people like jenny |
---|
0:03:21 | saffron are shown that |
---|
0:03:23 | just exposing infants to use short amount of nonsense syllables a very quickly learn where |
---|
0:03:30 | a word boundaries are of what they've seen before what they haven't so we sample |
---|
0:03:36 | Y can we try to apply some of these ideas if we have a large |
---|
0:03:40 | body of speech |
---|
0:03:41 | but we don't have |
---|
0:03:43 | say all the conventional paraphernalia that goes with the conventional asr |
---|
0:03:48 | but maybe we had a lot of speech so we could throw all that out |
---|
0:03:51 | and look for repeating occurrences |
---|
0:03:55 | instances of things like words and there might be some interesting things that we could |
---|
0:04:00 | do if we achieve that capability |
---|
0:04:04 | so to describe it |
---|
0:04:07 | in terms of some speech the idea was to if we had say different chunks |
---|
0:04:12 | of audio can we find the reoccurring words of these are two utterances with the |
---|
0:04:18 | word schizophrenia in them |
---|
0:04:19 | can we develop a method to establish that those two where the same thing |
---|
0:04:25 | and the approach that we took |
---|
0:04:28 | fairly common sense i think is to |
---|
0:04:31 | if you have a lower body of adding a we would compare every piece of |
---|
0:04:34 | audio with every other piece of audio |
---|
0:04:37 | and so for those two utterances now we would computed distance matrix which would represent |
---|
0:04:42 | point by point distances |
---|
0:04:45 | and the idea is that one two things to spectral frames of the same |
---|
0:04:49 | be of course a low this stands and when there are very the similarity high |
---|
0:04:53 | distance |
---|
0:04:55 | when you look |
---|
0:04:56 | and the representations that we use of varied over time we started off just using |
---|
0:05:00 | white and mfccs which would work fine if it's all the same speaker |
---|
0:05:04 | we then went to unsupervised posteriorgram representations based on both |
---|
0:05:10 | to write from gaussian mixture models and also some |
---|
0:05:14 | dnn trained in an unsupervised way we've done some stuff with herbs slu is as |
---|
0:05:19 | well representing them as a posteriorgram it really doesn't matter what you use |
---|
0:05:23 | the interesting thing is when we all look at that picture |
---|
0:05:26 | i think most of this can see right away that all you know what there |
---|
0:05:30 | is a diagonal that sort of |
---|
0:05:32 | bunch of low distances that's where the repeating pattern is so that's |
---|
0:05:36 | what we wanna do we try and find that automatically |
---|
0:05:39 | and we developed a |
---|
0:05:41 | a |
---|
0:05:42 | variational to dynamic time warping we call segmental dynamic time warping that basically just consisted |
---|
0:05:48 | of striping all the way through the audio corpus so that you would eventually |
---|
0:05:53 | compare everybody's with every other piece and that was that |
---|
0:05:57 | the past the warping path you were on would eventually snap into |
---|
0:06:02 | that as you passed over that possible alignment |
---|
0:06:05 | we call this little region the alignment pass a fragment and the web in that |
---|
0:06:11 | region is sort of the point my point distance and that's what we're trying to |
---|
0:06:16 | so there's different ways to do this but this particular illustration shows the point by |
---|
0:06:21 | point alignment of the two stripes of that the two pieces of the utterances here |
---|
0:06:26 | and this is the distortion as a function of the frame-by-frame distances and here courses |
---|
0:06:32 | where the overlapping |
---|
0:06:34 | word schizophrenia is see want to look |
---|
0:06:37 | there were that warping path |
---|
0:06:39 | establish some mechanism to try and find a low distortion region |
---|
0:06:43 | we were looking for things that were least half a second longer turns out the |
---|
0:06:47 | longer you constrain yourself the better this idea works |
---|
0:06:51 | so like boston red socks would work really well as sort of an expression |
---|
0:06:57 | and we extended a little bit and when we do this |
---|
0:07:02 | you know we're is we produce are aligned fragment |
---|
0:07:05 | now |
---|
0:07:07 | people a modified this basic idea "'cause" this is computationally fairly intensive we actually done |
---|
0:07:12 | some stuff to do |
---|
0:07:16 | approximations that are guaranteed to be admissible but other people like aaron |
---|
0:07:21 | chance and edge i J H U is done some really nice work using some |
---|
0:07:25 | of visual processing concepts to significantly reduce the amount of computation that and paul |
---|
0:07:32 | although it turns out i think that using sgd W idea that very and |
---|
0:07:37 | is not a bad way to refine the initial matches |
---|
0:07:41 | so when you do this what happens if you end up with pairs of utterances |
---|
0:07:47 | and the things in red are example matches low distortion matches that are found in |
---|
0:07:52 | your corpus and you can see that it's |
---|
0:07:55 | depending on the parameters like the with that you pick we were sort of aiming |
---|
0:07:59 | for word level ideas which was why we picked have second constraint |
---|
0:08:02 | i sometimes it's a word sometimes its multiple words |
---|
0:08:06 | sometimes of the fragment up were |
---|
0:08:08 | sometimes it's something similar but not the same thing |
---|
0:08:11 | this is the type of thing that you get out |
---|
0:08:17 | the interesting question then is once you have all these pairwise matches for your corpus |
---|
0:08:23 | you'd like to try to establish |
---|
0:08:25 | what things that are the same underlying |
---|
0:08:28 | and so you have to go to some sort of clustering |
---|
0:08:32 | notion in this is what we call speech pattern discovery and we try to represent |
---|
0:08:37 | all of these pairwise matches and the graphical structure |
---|
0:08:40 | that then we could do clustering on |
---|
0:08:44 | and so when you do that where the |
---|
0:08:48 | well describe how we defined the very season the graph in the second but if |
---|
0:08:53 | you do each region |
---|
0:08:56 | corresponds to a vertex |
---|
0:08:58 | in a in the matched in the arc corresponds to |
---|
0:09:02 | where the edges correspond to connection between the regions and then you can trying to |
---|
0:09:06 | clustering |
---|
0:09:08 | naturally of course in the real world |
---|
0:09:11 | these clusters are inevitably connected in some capacity is the matches are |
---|
0:09:16 | a perfect but then you can apply your favourite clustering algorithm to try and find |
---|
0:09:21 | densely connected regions in the graph and that's in fact we did |
---|
0:09:26 | i just a very briefly show you one way that we did it there are |
---|
0:09:31 | many to define the of or the season the graph |
---|
0:09:36 | this illustration is sort of showing all example pairwise matches so each little rectangular |
---|
0:09:43 | are corresponds to match and the colour |
---|
0:09:46 | means it's the same match so the blue rectangle for example is a region where |
---|
0:09:51 | we think of the word matches to something else it said over here |
---|
0:09:54 | okay so different colours mean different batches |
---|
0:09:58 | well if you actually look at what's going on at any point in time |
---|
0:10:03 | it's messier than that because you potentially have a whole lot of matches because each |
---|
0:10:07 | match is done independently the start time and end times are all gonna be probably |
---|
0:10:11 | different |
---|
0:10:13 | but what we did as we summarize |
---|
0:10:15 | that collection of matches by just summing up the similarities as a function of time |
---|
0:10:20 | and so you get some and then we consider that and then what you get |
---|
0:10:25 | is something is time-varying that has local max's |
---|
0:10:29 | and we defined the local max's is places that interest |
---|
0:10:32 | where a lot of similarity matches were occurring |
---|
0:10:35 | and so we use those places to define nodes or a rare disease in our |
---|
0:10:40 | graph and then |
---|
0:10:41 | what you define the notes |
---|
0:10:44 | the |
---|
0:10:46 | the matched pair is that you have that overlap the nodes define |
---|
0:10:51 | the edges in your graph |
---|
0:10:53 | so for example the blue pair went from no one |
---|
0:10:58 | to know eight so you'd make this connection in your |
---|
0:11:02 | and you can do that for all of the |
---|
0:11:05 | matches that you have that are low distortion and so this is how you can |
---|
0:11:07 | construct your graph |
---|
0:11:09 | and then as i said you can apply clustering algorithm to |
---|
0:11:14 | make chunks out of that to define clusters |
---|
0:11:17 | so let me show you i and example on a lecture that was recorded mit |
---|
0:11:24 | and this is an exam so you we had four matches here different places in |
---|
0:11:28 | the in the recording |
---|
0:11:30 | and there was is nice little cluster here the show you and i'll hopefully play |
---|
0:11:34 | you some examples should not on a research optimized |
---|
0:11:40 | there was for things to them played the same time |
---|
0:11:42 | but |
---|
0:11:43 | basically what this was this guy was talking about |
---|
0:11:47 | variations of search engine optimiser search engine optimising and |
---|
0:11:52 | it actually stand the word optimiser |
---|
0:11:55 | to get the common acoustics in the cluster |
---|
0:11:59 | so this is an example of the type of thing that you get |
---|
0:12:02 | interestingly all of these words tended to |
---|
0:12:05 | curve near each other in the lecture so you can actually |
---|
0:12:08 | are we done this work i'm not gonna talk about at any actually do topic |
---|
0:12:12 | segmentation based on the time-varying nature of these clusters over the course of long audio |
---|
0:12:17 | recording |
---|
0:12:19 | i can show you some other examples we try don't different languages this is a |
---|
0:12:22 | lebanese |
---|
0:12:24 | interview that we recorded it actually used to people talking |
---|
0:12:27 | one of them is using a lebanese layer levenshtein arabic and one of them is |
---|
0:12:31 | talking and msa |
---|
0:12:33 | the algorithm doesn't care it doesn't know and it it's oblivious is just looking for |
---|
0:12:38 | things that look like they're the same |
---|
0:12:40 | and so here's the cluster that |
---|
0:12:42 | it got |
---|
0:12:44 | this is this is this is |
---|
0:12:47 | there's another one for mandarin a lecture that we apply right on those users use |
---|
0:12:53 | those use versus you get the idea defining these acoustic chunks and they're sort of |
---|
0:12:58 | the same |
---|
0:12:59 | thing |
---|
0:13:00 | now when you do it over a single large body of audio like |
---|
0:13:05 | a not sure you'll get a bunch of these different clusters |
---|
0:13:08 | and that's interesting when you look underlying lee at what the identity of the cluster |
---|
0:13:12 | is |
---|
0:13:15 | you can see the lot of the terms are important content words in fact what |
---|
0:13:19 | we did a |
---|
0:13:21 | this study |
---|
0:13:22 | we were finding about eighty five percent of the top twenty tf-idf terms |
---|
0:13:27 | on lectures so it's an indication that |
---|
0:13:30 | the clusters are finding potentially useful combines |
---|
0:13:33 | of information and i guess what on the motivations for this that there's a word |
---|
0:13:38 | that's important |
---|
0:13:39 | in a conversation or a lecture it'll probably be said multiple times and that gives |
---|
0:13:44 | us a chance to find it it's not always the case |
---|
0:13:47 | but we need that for this type of technique to work |
---|
0:13:52 | now one of the things that we've done recently is |
---|
0:13:56 | in addition to this was in one particular a document you can look at the |
---|
0:14:00 | relationship between these unsupervised patterns across different documents |
---|
0:14:05 | and documents like |
---|
0:14:08 | her was talking about a topic id |
---|
0:14:10 | we can do one supervised topic clustering |
---|
0:14:13 | based on the relationship of these unsupervised words across different documents |
---|
0:14:20 | so |
---|
0:14:21 | just to visualise that a little bit here you each of these grey |
---|
0:14:26 | rectangles is a different document and the darker grey rectangles are |
---|
0:14:31 | speech patterns that we found |
---|
0:14:33 | in the unsupervised way and then the connections are just |
---|
0:14:39 | things where there they connected to each other with a low distortion match |
---|
0:14:42 | and say for example |
---|
0:14:46 | you know you have this type of distribution of your clusters then you might you |
---|
0:14:50 | might want to say well these two |
---|
0:14:52 | two clusters are on the right because of the connection between those unseen supervised terms |
---|
0:14:57 | that we found |
---|
0:14:58 | and the three on the laughter in the same class again this is not doing |
---|
0:15:02 | this unsupervised |
---|
0:15:04 | so to do this we tried a couple different methods but one that was the |
---|
0:15:09 | most successful |
---|
0:15:11 | and how to latent model for |
---|
0:15:15 | topics and words |
---|
0:15:17 | and that's just the plate notation on the right |
---|
0:15:20 | but the observed variables of course where the documents and then what we call at |
---|
0:15:24 | a link structure |
---|
0:15:26 | which we define |
---|
0:15:30 | a link |
---|
0:15:31 | as the connections for each interval in a document that we found the link structure |
---|
0:15:37 | is just a set of connections to all the other patterns that were made in |
---|
0:15:42 | all the other different documents |
---|
0:15:44 | when |
---|
0:15:44 | and the latent variable words has a certain distribution |
---|
0:15:49 | of links |
---|
0:15:50 | and the topics have a certain distribution |
---|
0:15:52 | of words and you can learn this model with em style |
---|
0:15:57 | algorithm and the thing that's interesting we did some experiments on the fisher corpus |
---|
0:16:04 | sixty conversations |
---|
0:16:07 | spending six different topics |
---|
0:16:09 | we see this with about thirteen hundred initial |
---|
0:16:15 | clusters and we did tell what that define six clusters so there was kind of |
---|
0:16:20 | cheating |
---|
0:16:20 | but |
---|
0:16:22 | this is the resulting clusters that we found |
---|
0:16:26 | and the interesting thing is when you look at the underlying speech patterns that are |
---|
0:16:30 | associated with these topics |
---|
0:16:34 | they make a little bit of sense actually which is nice so |
---|
0:16:38 | what you find just the there are relevant words and there are irrelevant words things |
---|
0:16:43 | that you might like to be |
---|
0:16:44 | in the stop word if you were doing with text so be nice to get |
---|
0:16:48 | rid of them |
---|
0:16:50 | here leads me show you some the other one so that's of the green was |
---|
0:16:53 | on minimum wage |
---|
0:16:56 | the other one is on content computer and education |
---|
0:17:02 | the purple one is kind of interesting |
---|
0:17:04 | when you look at the distribution of the true underlying topic labels |
---|
0:17:09 | some of them are pretty good |
---|
0:17:12 | you know that the holidays computers in education the benefactor split into two |
---|
0:17:17 | it's kinda |
---|
0:17:18 | intriguing to me that corporate conduct an illness were |
---|
0:17:22 | mapped into the same cluster maybe that's telling or something |
---|
0:17:25 | but you know this really days but you know there's a lot of things that |
---|
0:17:28 | you can potentially do |
---|
0:17:30 | with these kinds of unsupervised methods are not conventional but her showing some nice examples |
---|
0:17:35 | to and this is another one |
---|
0:17:39 | i wanna move on and talk about |
---|
0:17:42 | some of the newer work that we're doing |
---|
0:17:45 | we just trying to learn a models |
---|
0:17:48 | lower speech units learned pronunciations |
---|
0:17:52 | really trying to get rid of the dictionary |
---|
0:17:56 | or at least learn some methods |
---|
0:17:58 | you know it's interesting |
---|
0:18:02 | we pride ourselves on our ignorance models that we've developed with hmms modeling think we |
---|
0:18:07 | don't know about speech |
---|
0:18:08 | and that the dictionary is still are crutch it's |
---|
0:18:11 | it's typically made by humans and hours and hours are spent |
---|
0:18:16 | tweaking these things getting rid of all the |
---|
0:18:19 | inconsistencies anybody used on it is it's hardware and it takes long time mary can |
---|
0:18:24 | tell you the amount of effort goes in the making use dictionaries for the babble |
---|
0:18:29 | program but it's not a trivial ever |
---|
0:18:31 | why is it we can learn |
---|
0:18:33 | the units and learn the pronunciations automatically we do everything else |
---|
0:18:38 | you know i think it's time we look into this |
---|
0:18:40 | so this is the type of thing we're trying to do what we do from |
---|
0:18:43 | speech |
---|
0:18:43 | or maybe if you have some tax |
---|
0:18:46 | you know can that help you other pronunciation so we're doing |
---|
0:18:50 | we're trying to do both of these things now in our work |
---|
0:18:54 | are there's prior work in this area dating |
---|
0:18:56 | all the way back to the eighties people lunch in the |
---|
0:18:59 | we're trying to do some |
---|
0:19:01 | acoustic based approaches |
---|
0:19:04 | a more recent work as a verb is a good example with a self organizing |
---|
0:19:09 | units |
---|
0:19:09 | and there's been other work that's come out of johns hopkins that is very interesting |
---|
0:19:13 | as well |
---|
0:19:15 | the approach that we've been taking is |
---|
0:19:19 | motivated in fact by |
---|
0:19:22 | more of a machine learning something is becoming more popular it in machine learning is |
---|
0:19:26 | of asian |
---|
0:19:28 | framework |
---|
0:19:29 | for inference and in particular sharon goldwater |
---|
0:19:33 | who's now the university then burrell had a really nice paper |
---|
0:19:37 | on trying to learn word segmentation from a phonetic transcriptions so was symbolic input |
---|
0:19:43 | and the trying to work for that and are more recent work was trying to |
---|
0:19:46 | do it from the ways here |
---|
0:19:48 | phonetic transcriptions we wanted to try and modify this model so we could learn from |
---|
0:19:53 | speech itself so that's what i'll talk about now |
---|
0:19:57 | and then we've recently extend it to try and word pronunciations as well |
---|
0:20:01 | so we all know what the challenges are you're trying to learn what the speech |
---|
0:20:05 | units are first of all |
---|
0:20:06 | as the last question or said we don't know how many units there are |
---|
0:20:11 | maybe they're sixty four maybe there's not |
---|
0:20:14 | and we don't know what they are and we don't know where they are |
---|
0:20:19 | so these are a lot of unknowns or trying to figure out in units |
---|
0:20:23 | so |
---|
0:20:24 | what we're trying to do is given speech |
---|
0:20:28 | in this stuff only speech |
---|
0:20:30 | discover the inventory of units and build a model after each of them |
---|
0:20:35 | as i said we're formulating this in a different kind of mathematical framework for the |
---|
0:20:40 | speech community where we have a set of latent variables |
---|
0:20:43 | that include the boundaries the units the segments and then there's the conventional hmm-gmm |
---|
0:20:50 | model we're using for each unit that although scribe shortly |
---|
0:20:54 | and in this initial work we actually wanted to try more number of units |
---|
0:20:58 | and to do this we were |
---|
0:21:01 | representing is what's known as a chinese restaurant process or dirichlet process |
---|
0:21:08 | had a prior on it so there's a finite chance of generating a new unit |
---|
0:21:12 | every time you better rate |
---|
0:21:14 | through |
---|
0:21:15 | so let me walking through this |
---|
0:21:19 | at a high level sort of channelling of the generative story or a power |
---|
0:21:25 | generating an utterance |
---|
0:21:28 | with this hmm gmm a mixture |
---|
0:21:31 | that i said so basically underlying when we had a have a set of K |
---|
0:21:37 | models and then |
---|
0:21:39 | for a particular set or frames one of them is selected |
---|
0:21:42 | and it generates |
---|
0:21:45 | a certain number of speech frames and then you'll transition to another one generates more |
---|
0:21:50 | frames |
---|
0:21:51 | et cetera et cetera as you go through the entire utterance |
---|
0:21:55 | so that this sort of just described are model here but what of the latent |
---|
0:21:59 | variables will first of all |
---|
0:22:01 | we don't know where the transition are |
---|
0:22:03 | transitions are in the speech between one unit and another so the bees will be |
---|
0:22:08 | one set of |
---|
0:22:11 | latent variables |
---|
0:22:12 | we don't know what the labels are |
---|
0:22:15 | inventory of labels and i have down below also these season purple will also be |
---|
0:22:20 | a set of unknown variables |
---|
0:22:24 | we don't know of course the parameters of our hmm gmm model so those will |
---|
0:22:30 | be parameters of variables |
---|
0:22:32 | and lastly as i mentioned we don't know how many units there are |
---|
0:22:36 | so that will be in a no as well and this is the last thing |
---|
0:22:40 | is what we're modeling with the deer actually process |
---|
0:22:45 | so |
---|
0:22:46 | the learning procedure for this is done the inference and gibbs sampling |
---|
0:22:52 | so it's an iterative process |
---|
0:22:55 | where we initially select |
---|
0:22:58 | values on some boundary variables i'll talk about that |
---|
0:23:03 | shortly but for now |
---|
0:23:05 | think of an it as an initial segmentation and we have an initial prior distribution |
---|
0:23:10 | for the parameters that we have |
---|
0:23:13 | and then we go through are corpus |
---|
0:23:16 | one segment is a time where segment is defined as the chunk of frames between |
---|
0:23:21 | boundaries and |
---|
0:23:22 | based on the posterior distribution will sample down you so for each segment will sample |
---|
0:23:28 | a identity of the units see for a particular segment |
---|
0:23:32 | when we say something about that |
---|
0:23:35 | here |
---|
0:23:37 | so as i mention this is a chinese restaurant process because there's a finite chance |
---|
0:23:41 | of selecting defining a new unit |
---|
0:23:44 | and for those of you foreign familiar with that |
---|
0:23:47 | the idea the analogy is what people going into a chinese restaurant in trying to |
---|
0:23:51 | decide which table to sit down at "'cause" each tables begin can see multiple customers |
---|
0:23:57 | so |
---|
0:23:59 | in this notation each segment is a customer and they have to decide |
---|
0:24:03 | well which table to sit at |
---|
0:24:05 | and |
---|
0:24:07 | the index of each table has an index |
---|
0:24:11 | which corresponds to the identity of the models we each of these tables think of |
---|
0:24:15 | it is belong to a different unit |
---|
0:24:17 | and |
---|
0:24:18 | what you wanna have is a |
---|
0:24:22 | posterior probability of |
---|
0:24:24 | the likelihood of taking a particular unit label for each segment |
---|
0:24:31 | and that basically is proportional to the likelihood of the customers the that particular same |
---|
0:24:39 | data in generated by that particular units |
---|
0:24:43 | hmm-gmm and it's a weighted by a prior probability |
---|
0:24:47 | which just corresponds to the number of customers that were at the table normalized total |
---|
0:24:53 | amount of |
---|
0:24:57 | segments that you have and you'll notice that there's also a |
---|
0:25:00 | little bit a probability that stolen away from each one is out of |
---|
0:25:05 | to assign a little bit a probability to the likelihood that you might generate a |
---|
0:25:10 | new unit |
---|
0:25:11 | as well |
---|
0:25:12 | so once you have these posterior distribution setup sample and that's the value of for |
---|
0:25:17 | that particular segment |
---|
0:25:19 | at that particular iteration |
---|
0:25:22 | once you have a unit label for that segment |
---|
0:25:25 | you then go through and use apply hmm parameters |
---|
0:25:29 | and this is sort of on our home turf so it's a conventional |
---|
0:25:34 | hmm gmm we're using an eight mixture model |
---|
0:25:37 | you know the three state left to right transition so it's all very familiar |
---|
0:25:42 | and |
---|
0:25:44 | we assume when you have to start a segment in state one and it in |
---|
0:25:48 | state three but any other states we will sample we will draw samples to determine |
---|
0:25:52 | which state you're in at once you have the state sequence |
---|
0:25:56 | we'll draw samples to see which mixture component |
---|
0:25:58 | you're drawing file and then |
---|
0:26:02 | we can update the parameters based on that |
---|
0:26:05 | but passing i wanna say as we have to also consider different segmentations the choices |
---|
0:26:10 | of the B |
---|
0:26:11 | no these boundary the variables naively every frame to be a boundary |
---|
0:26:18 | and |
---|
0:26:20 | they take binary values either frame is the boundary or it's not so would zero |
---|
0:26:24 | or one |
---|
0:26:25 | in terms of putting this into a probabilistic formulation we have again a prior |
---|
0:26:31 | and a posterior probability the prior is just a bernoulli trial we picked we flip |
---|
0:26:36 | a coin |
---|
0:26:37 | with probability alpha to be it's a boundary and one minus alpha so be it's |
---|
0:26:43 | not |
---|
0:26:45 | two |
---|
0:26:46 | generate the posterior so we can generate a new sample for every boundary we go |
---|
0:26:51 | through one boundary at a time |
---|
0:26:54 | and we fix the state of all the other boundaries we generate the posterior distribution |
---|
0:26:58 | and then sample whether it's |
---|
0:27:01 | boundary or not |
---|
0:27:02 | and the posterior |
---|
0:27:04 | sorry for the map is just so this is this is it i think |
---|
0:27:09 | but the you know we have the prior here |
---|
0:27:12 | and then |
---|
0:27:13 | it's we consider all possible units that the segments on either side of this boundary |
---|
0:27:18 | might be so this is sort of the likelihood |
---|
0:27:20 | that you would generate |
---|
0:27:23 | given this was the boundary |
---|
0:27:25 | and then you consider the possibility that is not a boundary and again that's the |
---|
0:27:29 | prior |
---|
0:27:30 | and then you consider the likelihood of generating this entire segment |
---|
0:27:35 | considering all possible models that you have so those are your to posterior distributions |
---|
0:27:41 | and you sampled from then to generate a new value for each boundary to the |
---|
0:27:45 | corpus |
---|
0:27:46 | so you reiterate through this i think that was twenty thousand iterations that we were |
---|
0:27:51 | doing |
---|
0:27:52 | to generate all these parameters the last thing i'll say is just like her note |
---|
0:27:57 | so this is near and dear to my heart anybody knows me for awhile now |
---|
0:28:01 | i am a big believer a landmark based things but |
---|
0:28:04 | it turns out |
---|
0:28:06 | that it can help save a lot of computation |
---|
0:28:10 | and |
---|
0:28:14 | so we're using some acoustic landmarks we developed are also derive some spectral change |
---|
0:28:19 | and it's nice these are language-independent and it reduces the computation |
---|
0:28:24 | and the thing is |
---|
0:28:26 | as i'll show you later |
---|
0:28:29 | this is just the initialisation once you learn thirty minutes |
---|
0:28:32 | you can then going train a conventional models doing frame based stuff |
---|
0:28:36 | so this is sort of a heuristic to help you do the learning faster |
---|
0:28:42 | but it seems to be effective |
---|
0:28:45 | so i don't want to dwell on experiments too much this work we will analysing |
---|
0:28:49 | the timit corpus and |
---|
0:28:52 | we found a hundred twenty three so i'm gonna have to do it out with |
---|
0:28:56 | her brother sixty four hundred point three |
---|
0:28:59 | maybe should eight hundred twenty eight next |
---|
0:29:02 | and i have a similar kind of plot showing the underlying phonetic label |
---|
0:29:08 | verses the unit index and you can see we're sort of covering |
---|
0:29:14 | are the majority of the sounds |
---|
0:29:17 | what we're generating a little too much effort modeling silence i think we would benefit |
---|
0:29:22 | from a good speech activity |
---|
0:29:24 | detector the interesting thing is when you start looking at some of these ones that |
---|
0:29:27 | have multiple units like a |
---|
0:29:30 | we looked and |
---|
0:29:33 | there's a there tends to be a distribution for particular context so we are seeing |
---|
0:29:36 | some context-dependency here |
---|
0:29:38 | okay at has a raised out phenomena word-like champ |
---|
0:29:42 | versus cap |
---|
0:29:43 | so when we're seeing a lot of these thousand one particular unit the ads followed |
---|
0:29:48 | by nasal so what's |
---|
0:29:49 | it there is some context-dependency stuff going on |
---|
0:29:52 | so i'm only doing for time |
---|
0:29:55 | i'm good |
---|
0:29:56 | okay good |
---|
0:29:58 | telling be here |
---|
0:29:59 | so as i wanna move onto the next step which is i mean when we |
---|
0:30:02 | really wanna go always learning words |
---|
0:30:06 | we're not there yet but what we but we tried to do is enhance the |
---|
0:30:10 | model so we could learn pronunciations from parallel speech text |
---|
0:30:14 | data |
---|
0:30:15 | and |
---|
0:30:16 | ideally do better than the graphone model of five hopes |
---|
0:30:21 | okay i'm gonna go with your first answer |
---|
0:30:25 | now again there's been work done in this area in the past |
---|
0:30:29 | chin |
---|
0:30:30 | there was also some more people like |
---|
0:30:33 | mari ostendorf and |
---|
0:30:36 | in the ninety nine use that was exploration of joint acoustic lexicon discovery |
---|
0:30:43 | and are naturally baseline for this work is the graph |
---|
0:30:48 | the grapheme based recognisers a standard think people do when there isn't a pronunciation dictionary |
---|
0:30:52 | project parallel text |
---|
0:30:55 | and am giving a little bit away the punchline but the formulation we have |
---|
0:30:59 | can reduce to grapheme based setup if you wanted to its one particular constrained application |
---|
0:31:06 | but to go through the intuition |
---|
0:31:10 | again |
---|
0:31:10 | we're setting up an additional latent structure here |
---|
0:31:13 | beyond what we had before |
---|
0:31:17 | so we have all |
---|
0:31:20 | we had the unit sequence we need to learn the boundaries |
---|
0:31:23 | and we also need to learn the graphing the sound mappings now the experiments we |
---|
0:31:28 | don't in english will think letter to sound if you want but the framework generalizes |
---|
0:31:32 | to do two different languages |
---|
0:31:35 | just to remind you where we started from |
---|
0:31:39 | with just acoustic units alone |
---|
0:31:41 | and some distribution on the likelihood of predicting a particular units |
---|
0:31:48 | you can go through and generate a speech frames as a sequence of these units |
---|
0:31:53 | now if you have a |
---|
0:31:54 | i wanted associated with it like the word fly |
---|
0:31:58 | we need to introduce a new variables and |
---|
0:32:00 | what happen at final or word pronunciations directly we're actually learning these browsing the sound |
---|
0:32:05 | because we think it'll generalize better across a corpus |
---|
0:32:09 | you might eventually want to |
---|
0:32:12 | do word specific pronunciations but |
---|
0:32:15 | so we are representing that by another set of latent variables that are latter specific |
---|
0:32:20 | here |
---|
0:32:21 | where we would have specific distributions for each letter and by the way will eventually |
---|
0:32:25 | do try grapheme |
---|
0:32:27 | also context dependent ones but this is monophone or mono grapheme for now |
---|
0:32:32 | so you have letter specific distribution so the S might prefer this particular |
---|
0:32:37 | acoustic model on the L might prefer this particular one et cetera et cetera |
---|
0:32:42 | hopefully get the idea |
---|
0:32:44 | so those are latter specific mappings |
---|
0:32:46 | and this is the initial believe and of course you need a couple these together |
---|
0:32:51 | so that your general belief is related to the more context specific |
---|
0:32:57 | things that you have a gap an unknown set of units these will also be |
---|
0:33:00 | years like processes |
---|
0:33:02 | underlying only as well |
---|
0:33:07 | so i don't think so if you go through you know you have a ladder |
---|
0:33:12 | you use the particular |
---|
0:33:15 | distribution you select a particular unit and that unit would generate |
---|
0:33:19 | i your frames and when you go to the different letter different distribution |
---|
0:33:24 | so like to different unit sampling generate frames |
---|
0:33:29 | et cetera et cetera |
---|
0:33:31 | and that sort of |
---|
0:33:33 | how would work so this model is now |
---|
0:33:36 | i joint model for learning units |
---|
0:33:39 | and |
---|
0:33:40 | grapheme the sound mappings |
---|
0:33:43 | and the underlying |
---|
0:33:45 | acoustic models |
---|
0:33:47 | one more wrinkle we have to deal with this that there isn't necessarily a one-to-one |
---|
0:33:50 | matching course graphemes there is a one-to-one matching but there isn't necessarily |
---|
0:33:55 | so we introduce another variable |
---|
0:33:59 | that |
---|
0:34:00 | gives you some flexibility as to the number of |
---|
0:34:04 | units a particular latter might not too so we can think of english here so |
---|
0:34:08 | we said zero one or two |
---|
0:34:12 | and we have a little categorical distribution |
---|
0:34:15 | which predicts the likelihood a ladder specific or context letter specific |
---|
0:34:21 | distribution |
---|
0:34:23 | so like the X for example might be likely to be represented by two models |
---|
0:34:27 | you might learn |
---|
0:34:30 | so let me go through this very quickly i know i'm running out of time |
---|
0:34:34 | but once you have a sequence of letters you have to generate a sample for |
---|
0:34:41 | the number of units that you have once you have a so that units |
---|
0:34:45 | we also have |
---|
0:34:47 | you have to pick a |
---|
0:34:51 | a hmm acoustic model to a sample from a we have these are position-dependent as |
---|
0:34:57 | well so for next has to we have position one and position to |
---|
0:35:02 | distributions but you would sample |
---|
0:35:04 | from those |
---|
0:35:06 | and then once you've sample the particular |
---|
0:35:10 | units you would generate speech frames from the appropriate |
---|
0:35:14 | hmm-gmm model |
---|
0:35:16 | and so the latent variables that we have to deal with now there's more of |
---|
0:35:20 | them |
---|
0:35:21 | there is the number of units per letter the unit label identity same as before |
---|
0:35:26 | and then there's these latter specific |
---|
0:35:30 | distributions |
---|
0:35:31 | as well as the hmm gmm parameters |
---|
0:35:37 | when you do the inference i'm gonna skip that i knew i wouldn't have time |
---|
0:35:41 | to go into that |
---|
0:35:42 | you end up with this mapping all the way down from the latter is all |
---|
0:35:46 | the way down to the segments and now |
---|
0:35:49 | if you look at the top part |
---|
0:35:51 | you can generate pronunciations for words in your lexicon in terms of these units that |
---|
0:35:55 | you lower |
---|
0:35:56 | and of course you can train your hmm gmm models |
---|
0:36:00 | and guess what i now you're in business to train a conventional speech recognizer with |
---|
0:36:05 | whatever technique |
---|
0:36:07 | that you like |
---|
0:36:11 | this is similar to whatever a was talking about |
---|
0:36:15 | the experiments that we and i know what's very ironic that here i am talking |
---|
0:36:19 | about low resource languages and all the experiments i'm showing your in english |
---|
0:36:23 | it's ongoing work trust me |
---|
0:36:26 | but we've done some experiments on a weather corpus that we had for a while |
---|
0:36:30 | we like working on |
---|
0:36:33 | we use that an eight hour subset to try and learn the units and the |
---|
0:36:37 | pronunciations and then we retrain |
---|
0:36:39 | a conventional recognizer on the entire training set using these units and we compare |
---|
0:36:46 | these are our baseline the expert |
---|
0:36:48 | as we wanna be but certainly we would hope to do better than the grapheme |
---|
0:36:53 | and just to cut to the chase were in be twenty |
---|
0:36:56 | right now |
---|
0:36:57 | so |
---|
0:36:59 | graphemes or heart on english everybody knows that so it's nice that were able to |
---|
0:37:03 | be that |
---|
0:37:05 | but we'd like to get to the supervised and |
---|
0:37:08 | i actually think there is |
---|
0:37:11 | there's reasonable expectation that we can do that because |
---|
0:37:17 | you know we've done some stuff where automatically learning |
---|
0:37:20 | throwing out expert pronunciations and learning some new ones based on graphone models we can |
---|
0:37:26 | get down to about eight point three percent |
---|
0:37:29 | so i would hope that we could write this down below the ten percent |
---|
0:37:33 | marker |
---|
0:37:34 | it's still early days |
---|
0:37:37 | it's also how to interpret a necessarily what you actually have lower |
---|
0:37:44 | because |
---|
0:37:45 | the pronunciations are in terms of these numbers which is not very intuitive for me |
---|
0:37:49 | but |
---|
0:37:50 | so we can do things like a local these words that have a shot in |
---|
0:37:53 | it |
---|
0:37:54 | sort of have the same unit so that's kind of encouraging |
---|
0:37:57 | and the other thing that |
---|
0:38:01 | jackie is done recently |
---|
0:38:04 | is used moses to try and translate |
---|
0:38:06 | use the two dictionaries the pronunciations from the expert dictionary the pronunciations from these learn |
---|
0:38:12 | units |
---|
0:38:13 | and use models it's a trying translate between the two so here's a six words |
---|
0:38:18 | this is the yellow one on top is the expert pronunciation and the blue one |
---|
0:38:23 | is the translated |
---|
0:38:25 | unit once into phone like units so there are a little more |
---|
0:38:31 | in interpretable by us and you know |
---|
0:38:34 | i think it's on the right track i think there's something there we need to |
---|
0:38:37 | do a lot of more investigation |
---|
0:38:39 | but |
---|
0:38:41 | i'm encourage so far so that sorta where we are |
---|
0:38:48 | so i can wind up i think |
---|
0:38:50 | you know |
---|
0:38:52 | it would be beneficial i'm really encourages is to see you know more people |
---|
0:38:58 | doing unsupervised things |
---|
0:39:01 | it's challenging but i think |
---|
0:39:04 | we learn a lot by doing this and |
---|
0:39:06 | i truly believe that |
---|
0:39:08 | in the end it will help us develop speech recognition capability from or the world's |
---|
0:39:13 | languages |
---|
0:39:14 | and i'm optimistic that |
---|
0:39:18 | these methods can potentially complement existing approaches one of the things we're doing right now |
---|
0:39:22 | in the babble framework as we're looking at these |
---|
0:39:27 | acoustic matching mechanism as a means to rescore keyword hypotheses from conventional recognizers and |
---|
0:39:35 | you know maybe that can be helpful |
---|
0:39:38 | so i've shown you some problems we've been making in speech and discovery and topic |
---|
0:39:42 | clustering also learning units |
---|
0:39:45 | and i mentioned we're looking at other languages and we're augmenting the framework right now |
---|
0:39:49 | to try and learn words itself from the audio |
---|
0:39:55 | and i guess the by can be speculative that the very and |
---|
0:40:02 | out this morning said the elephant in the room is |
---|
0:40:06 | text data actually think the element in the room is lost |
---|
0:40:11 | in fact it's all humanity that's ever been in a never will be |
---|
0:40:17 | you know we are learned language |
---|
0:40:20 | toddlers |
---|
0:40:22 | no one gives us a dictionary |
---|
0:40:25 | no one gets as text |
---|
0:40:27 | you know we have other things but we figure it out |
---|
0:40:30 | and |
---|
0:40:32 | i think we be |
---|
0:40:35 | better off in the long run as a community trying to think about how to |
---|
0:40:41 | have some of those capabilities our system |
---|
0:40:44 | anyways thank you very much |
---|
0:40:45 | and i'm done |
---|
0:40:47 | you're some references for anybody who's interested not be happy to talk to anybody |
---|
0:41:02 | sorry i one over |
---|
0:41:04 | so that was actually and just exact way as one to remind people really quickly |
---|
0:41:08 | that after |
---|
0:41:09 | this break instead of having a for one panel session we're gonna have a short |
---|
0:41:13 | and panel session and preceded by a talk by manual to prove that mention this |
---|
0:41:17 | morning is a cognitive science side is working and human language |
---|
0:41:23 | we have time for a few questions |
---|
0:41:40 | so the i think this each the kind of pretty nice approach based on the |
---|
0:41:45 | didn't you |
---|
0:41:47 | model |
---|
0:41:47 | framework |
---|
0:41:49 | and the isolates |
---|
0:41:50 | they discussed seems kind of more like a discriminative |
---|
0:41:54 | approach based neural network and i think past |
---|
0:41:57 | a pretty important so don't have some kind of the ideal document all integrating these |
---|
0:42:03 | kind of course the how to say and forced to some nice framework |
---|
0:42:08 | a well i don't know if it's nice so first of all i totally agree |
---|
0:42:11 | with you |
---|
0:42:13 | i sorted you this generated stuff we're trying to do as a way to get |
---|
0:42:17 | started |
---|
0:42:19 | and then just as verbs that i think |
---|
0:42:23 | if you can figure out the units and some pronunciations and get going |
---|
0:42:28 | then i think |
---|
0:42:29 | there's potential to bring in |
---|
0:42:31 | discriminant thing to sharpen |
---|
0:42:34 | the boundaries that you're learning |
---|
0:42:39 | so that's sort of the approach that were taking try to get use this to |
---|
0:42:43 | generate an initial speech recognizer like the landmarks we go away from that once we |
---|
0:42:50 | have an initial model |
---|
0:42:52 | it's not maybe well |
---|
0:42:56 | the best idea i have at them |
---|
0:43:01 | more questions |
---|
0:43:10 | at the beginning of the talk is so that the you are going for the |
---|
0:43:15 | places of similar for increased |
---|
0:43:18 | by looking at the similarities so actually by detecting of places where signal right speech |
---|
0:43:24 | is similar to some other place you say this is a place of interest well |
---|
0:43:28 | this is a bit contradictory to the information theory mobile |
---|
0:43:32 | basically when you have something that is unique with this they're just once it can |
---|
0:43:36 | carry lots of information and that can be a place of interest |
---|
0:43:40 | could you comment on this be |
---|
0:43:43 | it's very true something that i mean these patterns are complicated |
---|
0:43:48 | you know "'cause" they're sequences it sounds |
---|
0:43:51 | they're probably |
---|
0:43:53 | a little more reliably detect |
---|
0:43:55 | and it could be that you could have a very important word that only occurs |
---|
0:43:59 | once |
---|
0:44:00 | this method would not be able to find it so |
---|
0:44:05 | i just |
---|
0:44:09 | it's more how reliable can do we think we can actually find it |
---|
0:44:13 | and the longer it is the more reliable it is and it turns out |
---|
0:44:18 | that's why mentions the comparisons with the tf-idf |
---|
0:44:23 | we do seem to be able to find |
---|
0:44:28 | important content words |
---|
0:44:29 | in lectures using these methods |
---|
0:44:31 | now |
---|
0:44:32 | the hidden thing ricer mention little bit was a lot of the common words are |
---|
0:44:37 | short |
---|
0:44:38 | and that's why having a threshold for looking for some duration is important like have |
---|
0:44:43 | second |
---|
0:44:44 | you guys might of even use the second |
---|
0:44:46 | right in the first paper |
---|
0:44:50 | have okay |
---|
0:44:51 | but so have second you know eliminate a lot of the very commonly occurring |
---|
0:44:57 | things that you can |
---|
0:45:01 | one other one other point out that to that is the non parametric bayesian stuff |
---|
0:45:05 | is actually really good chinese restaurant process |
---|
0:45:09 | generating a new category for something that just occurs once are very small number of |
---|
0:45:14 | time |
---|
0:45:15 | and so it sort of a really nice framework i think from language generally has |
---|
0:45:18 | a nice |
---|
0:45:22 | jordan questions |
---|
0:45:29 | so what we have standing problems in this field use output of the real observations |
---|
0:45:35 | so have you thought |
---|
0:45:38 | designing |
---|
0:45:39 | something that will allow us to |
---|
0:45:42 | therefore |
---|
0:45:43 | because of observations we have based on these kinds of non parametric |
---|
0:45:48 | distribution issues |
---|
0:45:50 | i missed the part when the most important |
---|
0:45:54 | observations what is that the your |
---|
0:45:56 | and i don't i don't believe that anybody in this room listens to cepstra |
---|
0:46:00 | and that's just not the right thing the question is what's the right thing so |
---|
0:46:05 | you see whether to acquire more analyze so that we could actually figure out what |
---|
0:46:10 | the observations of this |
---|
0:46:14 | you're talking about the input representation |
---|
0:46:17 | these were all just based on mfccs |
---|
0:46:23 | it's a very good question i |
---|
0:46:25 | i actually think you know we would benefit from a better representation that sort of |
---|
0:46:31 | naturally had the more phone like contrast in the languages around the world |
---|
0:46:37 | i think mfccs are kind of rather blurry |
---|
0:46:40 | description of the signal but i don't have any brilliant |
---|
0:46:48 | i just have a question |
---|
0:46:51 | we need where are you very natural language i like that it's a side quests |
---|
0:46:58 | okay and it data |
---|
0:47:01 | i don't know about the impression |
---|
0:47:03 | with respect to discovery things the more data better |
---|
0:47:07 | so the question |
---|
0:47:12 | it's a state which you ideally want to have |
---|
0:47:16 | right |
---|
0:47:18 | or is it is the next so i have much while are that it seemed |
---|
0:47:23 | like really well are |
---|
0:47:26 | things that are sort of |
---|
0:47:27 | state |
---|
0:47:31 | thank you might have a lot a tensor thinking about how much data collection things |
---|
0:47:35 | like it's really are really understanding in |
---|
0:47:44 | it is really want |
---|
0:47:51 | almost |
---|
0:47:53 | don't know the answer that question |
---|
0:47:55 | it seems like it for questions |
---|
0:48:02 | i |
---|
0:48:10 | i |
---|
0:48:19 | i |
---|
0:48:31 | so |
---|
0:48:36 | there's a lot of data for english and another you know that all languages of |
---|
0:48:40 | course and resources like the babble |
---|
0:48:43 | will be a tremendous legacy that people can evaluate on |
---|
0:48:50 | i don't know how much we need all |
---|
0:48:52 | i mean there's this stuff i was talking about very early days so |
---|
0:48:56 | but |
---|
0:48:59 | okay well that's what others think jim |
---|