0:00:15thanks karen it's a very nice to be here to talk to about some of
0:00:18the low resource
0:00:20work we've been doing in our group
0:00:23this research that i'll talk about today involves the
0:00:28work of several students in our group that i a list here
0:00:34there's been a lot of talk just
0:00:35today about the low resource issue good descriptions i won't belabour the point
0:00:42i think my perspective on the problem is that
0:00:46current
0:00:47speech recognition technology will benefit by incorporating more unsupervised learning
0:00:53ideas
0:00:54and this schematic shows the
0:00:59sort of a range of increasingly difficult tasks that we could imagine
0:01:05starting the upper left
0:01:07from the conventional asr approach that has
0:01:11annotated resources pronunciation dictionary and units
0:01:14two scenarios that have less resources associated with them
0:01:19a parallel annotated speech and text independent speech and text all the way down to
0:01:23just having speech
0:01:25what can you do with that as would be the case for example of an
0:01:29oral language that we were talking about earlier this morning
0:01:33i think
0:01:34it's a challenging problem but if we start to look at some of these ideas
0:01:37i think there will be a few benefits first of all i think it's really
0:01:41interesting problem i will learn a lot just by trying to do it
0:01:45be i think it will ultimately enable more speech recognition for
0:01:50larger numbers of languages in the world and
0:01:53it has the potential ideally to complement existing techniques
0:01:57and so even benefit languages that are
0:01:59but quite successful with conventional techniques
0:02:03so in the time i have today i was gonna talk about two research areas
0:02:07that we plan
0:02:09exploring in our group the first one is the
0:02:13speech pattern discovery a method and various we applied to various problems
0:02:18and this is an example of the zero resource scenario that could work potentially on
0:02:24any language in the world just with the body of speech
0:02:27so that has a certain appeal to it
0:02:31but there's no specific models that we learn from that so and your line of
0:02:36research that we've been starting
0:02:38to do is exploring
0:02:41methods to learn speech units models speech units and pronunciations
0:02:47from either speech or when there are some limited resources available and we're using a
0:02:53joint modeling framework to do this that i think even know what still very early
0:02:57days for this work i believe it's quite promising so these of the two things
0:03:03all a
0:03:03touch on
0:03:06and it'll be fairly high overview because i don't a lot of time but hopefully
0:03:09you'll get the idea some other things we're trying to do
0:03:12so the speech pattern work
0:03:15it was actually motivated by in part by humans you know incensed people like jenny
0:03:21saffron are shown that
0:03:23just exposing infants to use short amount of nonsense syllables a very quickly learn where
0:03:30a word boundaries are of what they've seen before what they haven't so we sample
0:03:36Y can we try to apply some of these ideas if we have a large
0:03:40body of speech
0:03:41but we don't have
0:03:43say all the conventional paraphernalia that goes with the conventional asr
0:03:48but maybe we had a lot of speech so we could throw all that out
0:03:51and look for repeating occurrences
0:03:55instances of things like words and there might be some interesting things that we could
0:04:00do if we achieve that capability
0:04:04so to describe it
0:04:07in terms of some speech the idea was to if we had say different chunks
0:04:12of audio can we find the reoccurring words of these are two utterances with the
0:04:18word schizophrenia in them
0:04:19can we develop a method to establish that those two where the same thing
0:04:25and the approach that we took
0:04:28fairly common sense i think is to
0:04:31if you have a lower body of adding a we would compare every piece of
0:04:34audio with every other piece of audio
0:04:37and so for those two utterances now we would computed distance matrix which would represent
0:04:42point by point distances
0:04:45and the idea is that one two things to spectral frames of the same
0:04:49be of course a low this stands and when there are very the similarity high
0:04:53distance
0:04:55when you look
0:04:56and the representations that we use of varied over time we started off just using
0:05:00white and mfccs which would work fine if it's all the same speaker
0:05:04we then went to unsupervised posteriorgram representations based on both
0:05:10to write from gaussian mixture models and also some
0:05:14dnn trained in an unsupervised way we've done some stuff with herbs slu is as
0:05:19well representing them as a posteriorgram it really doesn't matter what you use
0:05:23the interesting thing is when we all look at that picture
0:05:26i think most of this can see right away that all you know what there
0:05:30is a diagonal that sort of
0:05:32bunch of low distances that's where the repeating pattern is so that's
0:05:36what we wanna do we try and find that automatically
0:05:39and we developed a
0:05:41a
0:05:42variational to dynamic time warping we call segmental dynamic time warping that basically just consisted
0:05:48of striping all the way through the audio corpus so that you would eventually
0:05:53compare everybody's with every other piece and that was that
0:05:57the past the warping path you were on would eventually snap into
0:06:02that as you passed over that possible alignment
0:06:05we call this little region the alignment pass a fragment and the web in that
0:06:11region is sort of the point my point distance and that's what we're trying to
0:06:16so there's different ways to do this but this particular illustration shows the point by
0:06:21point alignment of the two stripes of that the two pieces of the utterances here
0:06:26and this is the distortion as a function of the frame-by-frame distances and here courses
0:06:32where the overlapping
0:06:34word schizophrenia is see want to look
0:06:37there were that warping path
0:06:39establish some mechanism to try and find a low distortion region
0:06:43we were looking for things that were least half a second longer turns out the
0:06:47longer you constrain yourself the better this idea works
0:06:51so like boston red socks would work really well as sort of an expression
0:06:57and we extended a little bit and when we do this
0:07:02you know we're is we produce are aligned fragment
0:07:05now
0:07:07people a modified this basic idea "'cause" this is computationally fairly intensive we actually done
0:07:12some stuff to do
0:07:16approximations that are guaranteed to be admissible but other people like aaron
0:07:21chance and edge i J H U is done some really nice work using some
0:07:25of visual processing concepts to significantly reduce the amount of computation that and paul
0:07:32although it turns out i think that using sgd W idea that very and
0:07:37is not a bad way to refine the initial matches
0:07:41so when you do this what happens if you end up with pairs of utterances
0:07:47and the things in red are example matches low distortion matches that are found in
0:07:52your corpus and you can see that it's
0:07:55depending on the parameters like the with that you pick we were sort of aiming
0:07:59for word level ideas which was why we picked have second constraint
0:08:02i sometimes it's a word sometimes its multiple words
0:08:06sometimes of the fragment up were
0:08:08sometimes it's something similar but not the same thing
0:08:11this is the type of thing that you get out
0:08:17the interesting question then is once you have all these pairwise matches for your corpus
0:08:23you'd like to try to establish
0:08:25what things that are the same underlying
0:08:28and so you have to go to some sort of clustering
0:08:32notion in this is what we call speech pattern discovery and we try to represent
0:08:37all of these pairwise matches and the graphical structure
0:08:40that then we could do clustering on
0:08:44and so when you do that where the
0:08:48well describe how we defined the very season the graph in the second but if
0:08:53you do each region
0:08:56corresponds to a vertex
0:08:58in a in the matched in the arc corresponds to
0:09:02where the edges correspond to connection between the regions and then you can trying to
0:09:06clustering
0:09:08naturally of course in the real world
0:09:11these clusters are inevitably connected in some capacity is the matches are
0:09:16a perfect but then you can apply your favourite clustering algorithm to try and find
0:09:21densely connected regions in the graph and that's in fact we did
0:09:26i just a very briefly show you one way that we did it there are
0:09:31many to define the of or the season the graph
0:09:36this illustration is sort of showing all example pairwise matches so each little rectangular
0:09:43are corresponds to match and the colour
0:09:46means it's the same match so the blue rectangle for example is a region where
0:09:51we think of the word matches to something else it said over here
0:09:54okay so different colours mean different batches
0:09:58well if you actually look at what's going on at any point in time
0:10:03it's messier than that because you potentially have a whole lot of matches because each
0:10:07match is done independently the start time and end times are all gonna be probably
0:10:11different
0:10:13but what we did as we summarize
0:10:15that collection of matches by just summing up the similarities as a function of time
0:10:20and so you get some and then we consider that and then what you get
0:10:25is something is time-varying that has local max's
0:10:29and we defined the local max's is places that interest
0:10:32where a lot of similarity matches were occurring
0:10:35and so we use those places to define nodes or a rare disease in our
0:10:40graph and then
0:10:41what you define the notes
0:10:44the
0:10:46the matched pair is that you have that overlap the nodes define
0:10:51the edges in your graph
0:10:53so for example the blue pair went from no one
0:10:58to know eight so you'd make this connection in your
0:11:02and you can do that for all of the
0:11:05matches that you have that are low distortion and so this is how you can
0:11:07construct your graph
0:11:09and then as i said you can apply clustering algorithm to
0:11:14make chunks out of that to define clusters
0:11:17so let me show you i and example on a lecture that was recorded mit
0:11:24and this is an exam so you we had four matches here different places in
0:11:28the in the recording
0:11:30and there was is nice little cluster here the show you and i'll hopefully play
0:11:34you some examples should not on a research optimized
0:11:40there was for things to them played the same time
0:11:42but
0:11:43basically what this was this guy was talking about
0:11:47variations of search engine optimiser search engine optimising and
0:11:52it actually stand the word optimiser
0:11:55to get the common acoustics in the cluster
0:11:59so this is an example of the type of thing that you get
0:12:02interestingly all of these words tended to
0:12:05curve near each other in the lecture so you can actually
0:12:08are we done this work i'm not gonna talk about at any actually do topic
0:12:12segmentation based on the time-varying nature of these clusters over the course of long audio
0:12:17recording
0:12:19i can show you some other examples we try don't different languages this is a
0:12:22lebanese
0:12:24interview that we recorded it actually used to people talking
0:12:27one of them is using a lebanese layer levenshtein arabic and one of them is
0:12:31talking and msa
0:12:33the algorithm doesn't care it doesn't know and it it's oblivious is just looking for
0:12:38things that look like they're the same
0:12:40and so here's the cluster that
0:12:42it got
0:12:44this is this is this is
0:12:47there's another one for mandarin a lecture that we apply right on those users use
0:12:53those use versus you get the idea defining these acoustic chunks and they're sort of
0:12:58the same
0:12:59thing
0:13:00now when you do it over a single large body of audio like
0:13:05a not sure you'll get a bunch of these different clusters
0:13:08and that's interesting when you look underlying lee at what the identity of the cluster
0:13:12is
0:13:15you can see the lot of the terms are important content words in fact what
0:13:19we did a
0:13:21this study
0:13:22we were finding about eighty five percent of the top twenty tf-idf terms
0:13:27on lectures so it's an indication that
0:13:30the clusters are finding potentially useful combines
0:13:33of information and i guess what on the motivations for this that there's a word
0:13:38that's important
0:13:39in a conversation or a lecture it'll probably be said multiple times and that gives
0:13:44us a chance to find it it's not always the case
0:13:47but we need that for this type of technique to work
0:13:52now one of the things that we've done recently is
0:13:56in addition to this was in one particular a document you can look at the
0:14:00relationship between these unsupervised patterns across different documents
0:14:05and documents like
0:14:08her was talking about a topic id
0:14:10we can do one supervised topic clustering
0:14:13based on the relationship of these unsupervised words across different documents
0:14:20so
0:14:21just to visualise that a little bit here you each of these grey
0:14:26rectangles is a different document and the darker grey rectangles are
0:14:31speech patterns that we found
0:14:33in the unsupervised way and then the connections are just
0:14:39things where there they connected to each other with a low distortion match
0:14:42and say for example
0:14:46you know you have this type of distribution of your clusters then you might you
0:14:50might want to say well these two
0:14:52two clusters are on the right because of the connection between those unseen supervised terms
0:14:57that we found
0:14:58and the three on the laughter in the same class again this is not doing
0:15:02this unsupervised
0:15:04so to do this we tried a couple different methods but one that was the
0:15:09most successful
0:15:11and how to latent model for
0:15:15topics and words
0:15:17and that's just the plate notation on the right
0:15:20but the observed variables of course where the documents and then what we call at
0:15:24a link structure
0:15:26which we define
0:15:30a link
0:15:31as the connections for each interval in a document that we found the link structure
0:15:37is just a set of connections to all the other patterns that were made in
0:15:42all the other different documents
0:15:44when
0:15:44and the latent variable words has a certain distribution
0:15:49of links
0:15:50and the topics have a certain distribution
0:15:52of words and you can learn this model with em style
0:15:57algorithm and the thing that's interesting we did some experiments on the fisher corpus
0:16:04sixty conversations
0:16:07spending six different topics
0:16:09we see this with about thirteen hundred initial
0:16:15clusters and we did tell what that define six clusters so there was kind of
0:16:20cheating
0:16:20but
0:16:22this is the resulting clusters that we found
0:16:26and the interesting thing is when you look at the underlying speech patterns that are
0:16:30associated with these topics
0:16:34they make a little bit of sense actually which is nice so
0:16:38what you find just the there are relevant words and there are irrelevant words things
0:16:43that you might like to be
0:16:44in the stop word if you were doing with text so be nice to get
0:16:48rid of them
0:16:50here leads me show you some the other one so that's of the green was
0:16:53on minimum wage
0:16:56the other one is on content computer and education
0:17:02the purple one is kind of interesting
0:17:04when you look at the distribution of the true underlying topic labels
0:17:09some of them are pretty good
0:17:12you know that the holidays computers in education the benefactor split into two
0:17:17it's kinda
0:17:18intriguing to me that corporate conduct an illness were
0:17:22mapped into the same cluster maybe that's telling or something
0:17:25but you know this really days but you know there's a lot of things that
0:17:28you can potentially do
0:17:30with these kinds of unsupervised methods are not conventional but her showing some nice examples
0:17:35to and this is another one
0:17:39i wanna move on and talk about
0:17:42some of the newer work that we're doing
0:17:45we just trying to learn a models
0:17:48lower speech units learned pronunciations
0:17:52really trying to get rid of the dictionary
0:17:56or at least learn some methods
0:17:58you know it's interesting
0:18:02we pride ourselves on our ignorance models that we've developed with hmms modeling think we
0:18:07don't know about speech
0:18:08and that the dictionary is still are crutch it's
0:18:11it's typically made by humans and hours and hours are spent
0:18:16tweaking these things getting rid of all the
0:18:19inconsistencies anybody used on it is it's hardware and it takes long time mary can
0:18:24tell you the amount of effort goes in the making use dictionaries for the babble
0:18:29program but it's not a trivial ever
0:18:31why is it we can learn
0:18:33the units and learn the pronunciations automatically we do everything else
0:18:38you know i think it's time we look into this
0:18:40so this is the type of thing we're trying to do what we do from
0:18:43speech
0:18:43or maybe if you have some tax
0:18:46you know can that help you other pronunciation so we're doing
0:18:50we're trying to do both of these things now in our work
0:18:54are there's prior work in this area dating
0:18:56all the way back to the eighties people lunch in the
0:18:59we're trying to do some
0:19:01acoustic based approaches
0:19:04a more recent work as a verb is a good example with a self organizing
0:19:09units
0:19:09and there's been other work that's come out of johns hopkins that is very interesting
0:19:13as well
0:19:15the approach that we've been taking is
0:19:19motivated in fact by
0:19:22more of a machine learning something is becoming more popular it in machine learning is
0:19:26of asian
0:19:28framework
0:19:29for inference and in particular sharon goldwater
0:19:33who's now the university then burrell had a really nice paper
0:19:37on trying to learn word segmentation from a phonetic transcriptions so was symbolic input
0:19:43and the trying to work for that and are more recent work was trying to
0:19:46do it from the ways here
0:19:48phonetic transcriptions we wanted to try and modify this model so we could learn from
0:19:53speech itself so that's what i'll talk about now
0:19:57and then we've recently extend it to try and word pronunciations as well
0:20:01so we all know what the challenges are you're trying to learn what the speech
0:20:05units are first of all
0:20:06as the last question or said we don't know how many units there are
0:20:11maybe they're sixty four maybe there's not
0:20:14and we don't know what they are and we don't know where they are
0:20:19so these are a lot of unknowns or trying to figure out in units
0:20:23so
0:20:24what we're trying to do is given speech
0:20:28in this stuff only speech
0:20:30discover the inventory of units and build a model after each of them
0:20:35as i said we're formulating this in a different kind of mathematical framework for the
0:20:40speech community where we have a set of latent variables
0:20:43that include the boundaries the units the segments and then there's the conventional hmm-gmm
0:20:50model we're using for each unit that although scribe shortly
0:20:54and in this initial work we actually wanted to try more number of units
0:20:58and to do this we were
0:21:01representing is what's known as a chinese restaurant process or dirichlet process
0:21:08had a prior on it so there's a finite chance of generating a new unit
0:21:12every time you better rate
0:21:14through
0:21:15so let me walking through this
0:21:19at a high level sort of channelling of the generative story or a power
0:21:25generating an utterance
0:21:28with this hmm gmm a mixture
0:21:31that i said so basically underlying when we had a have a set of K
0:21:37models and then
0:21:39for a particular set or frames one of them is selected
0:21:42and it generates
0:21:45a certain number of speech frames and then you'll transition to another one generates more
0:21:50frames
0:21:51et cetera et cetera as you go through the entire utterance
0:21:55so that this sort of just described are model here but what of the latent
0:21:59variables will first of all
0:22:01we don't know where the transition are
0:22:03transitions are in the speech between one unit and another so the bees will be
0:22:08one set of
0:22:11latent variables
0:22:12we don't know what the labels are
0:22:15inventory of labels and i have down below also these season purple will also be
0:22:20a set of unknown variables
0:22:24we don't know of course the parameters of our hmm gmm model so those will
0:22:30be parameters of variables
0:22:32and lastly as i mentioned we don't know how many units there are
0:22:36so that will be in a no as well and this is the last thing
0:22:40is what we're modeling with the deer actually process
0:22:45so
0:22:46the learning procedure for this is done the inference and gibbs sampling
0:22:52so it's an iterative process
0:22:55where we initially select
0:22:58values on some boundary variables i'll talk about that
0:23:03shortly but for now
0:23:05think of an it as an initial segmentation and we have an initial prior distribution
0:23:10for the parameters that we have
0:23:13and then we go through are corpus
0:23:16one segment is a time where segment is defined as the chunk of frames between
0:23:21boundaries and
0:23:22based on the posterior distribution will sample down you so for each segment will sample
0:23:28a identity of the units see for a particular segment
0:23:32when we say something about that
0:23:35here
0:23:37so as i mention this is a chinese restaurant process because there's a finite chance
0:23:41of selecting defining a new unit
0:23:44and for those of you foreign familiar with that
0:23:47the idea the analogy is what people going into a chinese restaurant in trying to
0:23:51decide which table to sit down at "'cause" each tables begin can see multiple customers
0:23:57so
0:23:59in this notation each segment is a customer and they have to decide
0:24:03well which table to sit at
0:24:05and
0:24:07the index of each table has an index
0:24:11which corresponds to the identity of the models we each of these tables think of
0:24:15it is belong to a different unit
0:24:17and
0:24:18what you wanna have is a
0:24:22posterior probability of
0:24:24the likelihood of taking a particular unit label for each segment
0:24:31and that basically is proportional to the likelihood of the customers the that particular same
0:24:39data in generated by that particular units
0:24:43hmm-gmm and it's a weighted by a prior probability
0:24:47which just corresponds to the number of customers that were at the table normalized total
0:24:53amount of
0:24:57segments that you have and you'll notice that there's also a
0:25:00little bit a probability that stolen away from each one is out of
0:25:05to assign a little bit a probability to the likelihood that you might generate a
0:25:10new unit
0:25:11as well
0:25:12so once you have these posterior distribution setup sample and that's the value of for
0:25:17that particular segment
0:25:19at that particular iteration
0:25:22once you have a unit label for that segment
0:25:25you then go through and use apply hmm parameters
0:25:29and this is sort of on our home turf so it's a conventional
0:25:34hmm gmm we're using an eight mixture model
0:25:37you know the three state left to right transition so it's all very familiar
0:25:42and
0:25:44we assume when you have to start a segment in state one and it in
0:25:48state three but any other states we will sample we will draw samples to determine
0:25:52which state you're in at once you have the state sequence
0:25:56we'll draw samples to see which mixture component
0:25:58you're drawing file and then
0:26:02we can update the parameters based on that
0:26:05but passing i wanna say as we have to also consider different segmentations the choices
0:26:10of the B
0:26:11no these boundary the variables naively every frame to be a boundary
0:26:18and
0:26:20they take binary values either frame is the boundary or it's not so would zero
0:26:24or one
0:26:25in terms of putting this into a probabilistic formulation we have again a prior
0:26:31and a posterior probability the prior is just a bernoulli trial we picked we flip
0:26:36a coin
0:26:37with probability alpha to be it's a boundary and one minus alpha so be it's
0:26:43not
0:26:45two
0:26:46generate the posterior so we can generate a new sample for every boundary we go
0:26:51through one boundary at a time
0:26:54and we fix the state of all the other boundaries we generate the posterior distribution
0:26:58and then sample whether it's
0:27:01boundary or not
0:27:02and the posterior
0:27:04sorry for the map is just so this is this is it i think
0:27:09but the you know we have the prior here
0:27:12and then
0:27:13it's we consider all possible units that the segments on either side of this boundary
0:27:18might be so this is sort of the likelihood
0:27:20that you would generate
0:27:23given this was the boundary
0:27:25and then you consider the possibility that is not a boundary and again that's the
0:27:29prior
0:27:30and then you consider the likelihood of generating this entire segment
0:27:35considering all possible models that you have so those are your to posterior distributions
0:27:41and you sampled from then to generate a new value for each boundary to the
0:27:45corpus
0:27:46so you reiterate through this i think that was twenty thousand iterations that we were
0:27:51doing
0:27:52to generate all these parameters the last thing i'll say is just like her note
0:27:57so this is near and dear to my heart anybody knows me for awhile now
0:28:01i am a big believer a landmark based things but
0:28:04it turns out
0:28:06that it can help save a lot of computation
0:28:10and
0:28:14so we're using some acoustic landmarks we developed are also derive some spectral change
0:28:19and it's nice these are language-independent and it reduces the computation
0:28:24and the thing is
0:28:26as i'll show you later
0:28:29this is just the initialisation once you learn thirty minutes
0:28:32you can then going train a conventional models doing frame based stuff
0:28:36so this is sort of a heuristic to help you do the learning faster
0:28:42but it seems to be effective
0:28:45so i don't want to dwell on experiments too much this work we will analysing
0:28:49the timit corpus and
0:28:52we found a hundred twenty three so i'm gonna have to do it out with
0:28:56her brother sixty four hundred point three
0:28:59maybe should eight hundred twenty eight next
0:29:02and i have a similar kind of plot showing the underlying phonetic label
0:29:08verses the unit index and you can see we're sort of covering
0:29:14are the majority of the sounds
0:29:17what we're generating a little too much effort modeling silence i think we would benefit
0:29:22from a good speech activity
0:29:24detector the interesting thing is when you start looking at some of these ones that
0:29:27have multiple units like a
0:29:30we looked and
0:29:33there's a there tends to be a distribution for particular context so we are seeing
0:29:36some context-dependency here
0:29:38okay at has a raised out phenomena word-like champ
0:29:42versus cap
0:29:43so when we're seeing a lot of these thousand one particular unit the ads followed
0:29:48by nasal so what's
0:29:49it there is some context-dependency stuff going on
0:29:52so i'm only doing for time
0:29:55i'm good
0:29:56okay good
0:29:58telling be here
0:29:59so as i wanna move onto the next step which is i mean when we
0:30:02really wanna go always learning words
0:30:06we're not there yet but what we but we tried to do is enhance the
0:30:10model so we could learn pronunciations from parallel speech text
0:30:14data
0:30:15and
0:30:16ideally do better than the graphone model of five hopes
0:30:21okay i'm gonna go with your first answer
0:30:25now again there's been work done in this area in the past
0:30:29chin
0:30:30there was also some more people like
0:30:33mari ostendorf and
0:30:36in the ninety nine use that was exploration of joint acoustic lexicon discovery
0:30:43and are naturally baseline for this work is the graph
0:30:48the grapheme based recognisers a standard think people do when there isn't a pronunciation dictionary
0:30:52project parallel text
0:30:55and am giving a little bit away the punchline but the formulation we have
0:30:59can reduce to grapheme based setup if you wanted to its one particular constrained application
0:31:06but to go through the intuition
0:31:10again
0:31:10we're setting up an additional latent structure here
0:31:13beyond what we had before
0:31:17so we have all
0:31:20we had the unit sequence we need to learn the boundaries
0:31:23and we also need to learn the graphing the sound mappings now the experiments we
0:31:28don't in english will think letter to sound if you want but the framework generalizes
0:31:32to do two different languages
0:31:35just to remind you where we started from
0:31:39with just acoustic units alone
0:31:41and some distribution on the likelihood of predicting a particular units
0:31:48you can go through and generate a speech frames as a sequence of these units
0:31:53now if you have a
0:31:54i wanted associated with it like the word fly
0:31:58we need to introduce a new variables and
0:32:00what happen at final or word pronunciations directly we're actually learning these browsing the sound
0:32:05because we think it'll generalize better across a corpus
0:32:09you might eventually want to
0:32:12do word specific pronunciations but
0:32:15so we are representing that by another set of latent variables that are latter specific
0:32:20here
0:32:21where we would have specific distributions for each letter and by the way will eventually
0:32:25do try grapheme
0:32:27also context dependent ones but this is monophone or mono grapheme for now
0:32:32so you have letter specific distribution so the S might prefer this particular
0:32:37acoustic model on the L might prefer this particular one et cetera et cetera
0:32:42hopefully get the idea
0:32:44so those are latter specific mappings
0:32:46and this is the initial believe and of course you need a couple these together
0:32:51so that your general belief is related to the more context specific
0:32:57things that you have a gap an unknown set of units these will also be
0:33:00years like processes
0:33:02underlying only as well
0:33:07so i don't think so if you go through you know you have a ladder
0:33:12you use the particular
0:33:15distribution you select a particular unit and that unit would generate
0:33:19i your frames and when you go to the different letter different distribution
0:33:24so like to different unit sampling generate frames
0:33:29et cetera et cetera
0:33:31and that sort of
0:33:33how would work so this model is now
0:33:36i joint model for learning units
0:33:39and
0:33:40grapheme the sound mappings
0:33:43and the underlying
0:33:45acoustic models
0:33:47one more wrinkle we have to deal with this that there isn't necessarily a one-to-one
0:33:50matching course graphemes there is a one-to-one matching but there isn't necessarily
0:33:55so we introduce another variable
0:33:59that
0:34:00gives you some flexibility as to the number of
0:34:04units a particular latter might not too so we can think of english here so
0:34:08we said zero one or two
0:34:12and we have a little categorical distribution
0:34:15which predicts the likelihood a ladder specific or context letter specific
0:34:21distribution
0:34:23so like the X for example might be likely to be represented by two models
0:34:27you might learn
0:34:30so let me go through this very quickly i know i'm running out of time
0:34:34but once you have a sequence of letters you have to generate a sample for
0:34:41the number of units that you have once you have a so that units
0:34:45we also have
0:34:47you have to pick a
0:34:51a hmm acoustic model to a sample from a we have these are position-dependent as
0:34:57well so for next has to we have position one and position to
0:35:02distributions but you would sample
0:35:04from those
0:35:06and then once you've sample the particular
0:35:10units you would generate speech frames from the appropriate
0:35:14hmm-gmm model
0:35:16and so the latent variables that we have to deal with now there's more of
0:35:20them
0:35:21there is the number of units per letter the unit label identity same as before
0:35:26and then there's these latter specific
0:35:30distributions
0:35:31as well as the hmm gmm parameters
0:35:37when you do the inference i'm gonna skip that i knew i wouldn't have time
0:35:41to go into that
0:35:42you end up with this mapping all the way down from the latter is all
0:35:46the way down to the segments and now
0:35:49if you look at the top part
0:35:51you can generate pronunciations for words in your lexicon in terms of these units that
0:35:55you lower
0:35:56and of course you can train your hmm gmm models
0:36:00and guess what i now you're in business to train a conventional speech recognizer with
0:36:05whatever technique
0:36:07that you like
0:36:11this is similar to whatever a was talking about
0:36:15the experiments that we and i know what's very ironic that here i am talking
0:36:19about low resource languages and all the experiments i'm showing your in english
0:36:23it's ongoing work trust me
0:36:26but we've done some experiments on a weather corpus that we had for a while
0:36:30we like working on
0:36:33we use that an eight hour subset to try and learn the units and the
0:36:37pronunciations and then we retrain
0:36:39a conventional recognizer on the entire training set using these units and we compare
0:36:46these are our baseline the expert
0:36:48as we wanna be but certainly we would hope to do better than the grapheme
0:36:53and just to cut to the chase were in be twenty
0:36:56right now
0:36:57so
0:36:59graphemes or heart on english everybody knows that so it's nice that were able to
0:37:03be that
0:37:05but we'd like to get to the supervised and
0:37:08i actually think there is
0:37:11there's reasonable expectation that we can do that because
0:37:17you know we've done some stuff where automatically learning
0:37:20throwing out expert pronunciations and learning some new ones based on graphone models we can
0:37:26get down to about eight point three percent
0:37:29so i would hope that we could write this down below the ten percent
0:37:33marker
0:37:34it's still early days
0:37:37it's also how to interpret a necessarily what you actually have lower
0:37:44because
0:37:45the pronunciations are in terms of these numbers which is not very intuitive for me
0:37:49but
0:37:50so we can do things like a local these words that have a shot in
0:37:53it
0:37:54sort of have the same unit so that's kind of encouraging
0:37:57and the other thing that
0:38:01jackie is done recently
0:38:04is used moses to try and translate
0:38:06use the two dictionaries the pronunciations from the expert dictionary the pronunciations from these learn
0:38:12units
0:38:13and use models it's a trying translate between the two so here's a six words
0:38:18this is the yellow one on top is the expert pronunciation and the blue one
0:38:23is the translated
0:38:25unit once into phone like units so there are a little more
0:38:31in interpretable by us and you know
0:38:34i think it's on the right track i think there's something there we need to
0:38:37do a lot of more investigation
0:38:39but
0:38:41i'm encourage so far so that sorta where we are
0:38:48so i can wind up i think
0:38:50you know
0:38:52it would be beneficial i'm really encourages is to see you know more people
0:38:58doing unsupervised things
0:39:01it's challenging but i think
0:39:04we learn a lot by doing this and
0:39:06i truly believe that
0:39:08in the end it will help us develop speech recognition capability from or the world's
0:39:13languages
0:39:14and i'm optimistic that
0:39:18these methods can potentially complement existing approaches one of the things we're doing right now
0:39:22in the babble framework as we're looking at these
0:39:27acoustic matching mechanism as a means to rescore keyword hypotheses from conventional recognizers and
0:39:35you know maybe that can be helpful
0:39:38so i've shown you some problems we've been making in speech and discovery and topic
0:39:42clustering also learning units
0:39:45and i mentioned we're looking at other languages and we're augmenting the framework right now
0:39:49to try and learn words itself from the audio
0:39:55and i guess the by can be speculative that the very and
0:40:02out this morning said the elephant in the room is
0:40:06text data actually think the element in the room is lost
0:40:11in fact it's all humanity that's ever been in a never will be
0:40:17you know we are learned language
0:40:20toddlers
0:40:22no one gives us a dictionary
0:40:25no one gets as text
0:40:27you know we have other things but we figure it out
0:40:30and
0:40:32i think we be
0:40:35better off in the long run as a community trying to think about how to
0:40:41have some of those capabilities our system
0:40:44anyways thank you very much
0:40:45and i'm done
0:40:47you're some references for anybody who's interested not be happy to talk to anybody
0:41:02sorry i one over
0:41:04so that was actually and just exact way as one to remind people really quickly
0:41:08that after
0:41:09this break instead of having a for one panel session we're gonna have a short
0:41:13and panel session and preceded by a talk by manual to prove that mention this
0:41:17morning is a cognitive science side is working and human language
0:41:23we have time for a few questions
0:41:40so the i think this each the kind of pretty nice approach based on the
0:41:45didn't you
0:41:47model
0:41:47framework
0:41:49and the isolates
0:41:50they discussed seems kind of more like a discriminative
0:41:54approach based neural network and i think past
0:41:57a pretty important so don't have some kind of the ideal document all integrating these
0:42:03kind of course the how to say and forced to some nice framework
0:42:08a well i don't know if it's nice so first of all i totally agree
0:42:11with you
0:42:13i sorted you this generated stuff we're trying to do as a way to get
0:42:17started
0:42:19and then just as verbs that i think
0:42:23if you can figure out the units and some pronunciations and get going
0:42:28then i think
0:42:29there's potential to bring in
0:42:31discriminant thing to sharpen
0:42:34the boundaries that you're learning
0:42:39so that's sort of the approach that were taking try to get use this to
0:42:43generate an initial speech recognizer like the landmarks we go away from that once we
0:42:50have an initial model
0:42:52it's not maybe well
0:42:56the best idea i have at them
0:43:01more questions
0:43:10at the beginning of the talk is so that the you are going for the
0:43:15places of similar for increased
0:43:18by looking at the similarities so actually by detecting of places where signal right speech
0:43:24is similar to some other place you say this is a place of interest well
0:43:28this is a bit contradictory to the information theory mobile
0:43:32basically when you have something that is unique with this they're just once it can
0:43:36carry lots of information and that can be a place of interest
0:43:40could you comment on this be
0:43:43it's very true something that i mean these patterns are complicated
0:43:48you know "'cause" they're sequences it sounds
0:43:51they're probably
0:43:53a little more reliably detect
0:43:55and it could be that you could have a very important word that only occurs
0:43:59once
0:44:00this method would not be able to find it so
0:44:05i just
0:44:09it's more how reliable can do we think we can actually find it
0:44:13and the longer it is the more reliable it is and it turns out
0:44:18that's why mentions the comparisons with the tf-idf
0:44:23we do seem to be able to find
0:44:28important content words
0:44:29in lectures using these methods
0:44:31now
0:44:32the hidden thing ricer mention little bit was a lot of the common words are
0:44:37short
0:44:38and that's why having a threshold for looking for some duration is important like have
0:44:43second
0:44:44you guys might of even use the second
0:44:46right in the first paper
0:44:50have okay
0:44:51but so have second you know eliminate a lot of the very commonly occurring
0:44:57things that you can
0:45:01one other one other point out that to that is the non parametric bayesian stuff
0:45:05is actually really good chinese restaurant process
0:45:09generating a new category for something that just occurs once are very small number of
0:45:14time
0:45:15and so it sort of a really nice framework i think from language generally has
0:45:18a nice
0:45:22jordan questions
0:45:29so what we have standing problems in this field use output of the real observations
0:45:35so have you thought
0:45:38designing
0:45:39something that will allow us to
0:45:42therefore
0:45:43because of observations we have based on these kinds of non parametric
0:45:48distribution issues
0:45:50i missed the part when the most important
0:45:54observations what is that the your
0:45:56and i don't i don't believe that anybody in this room listens to cepstra
0:46:00and that's just not the right thing the question is what's the right thing so
0:46:05you see whether to acquire more analyze so that we could actually figure out what
0:46:10the observations of this
0:46:14you're talking about the input representation
0:46:17these were all just based on mfccs
0:46:23it's a very good question i
0:46:25i actually think you know we would benefit from a better representation that sort of
0:46:31naturally had the more phone like contrast in the languages around the world
0:46:37i think mfccs are kind of rather blurry
0:46:40description of the signal but i don't have any brilliant
0:46:48i just have a question
0:46:51we need where are you very natural language i like that it's a side quests
0:46:58okay and it data
0:47:01i don't know about the impression
0:47:03with respect to discovery things the more data better
0:47:07so the question
0:47:12it's a state which you ideally want to have
0:47:16right
0:47:18or is it is the next so i have much while are that it seemed
0:47:23like really well are
0:47:26things that are sort of
0:47:27state
0:47:31thank you might have a lot a tensor thinking about how much data collection things
0:47:35like it's really are really understanding in
0:47:44it is really want
0:47:51almost
0:47:53don't know the answer that question
0:47:55it seems like it for questions
0:48:02i
0:48:10i
0:48:19i
0:48:31so
0:48:36there's a lot of data for english and another you know that all languages of
0:48:40course and resources like the babble
0:48:43will be a tremendous legacy that people can evaluate on
0:48:50i don't know how much we need all
0:48:52i mean there's this stuff i was talking about very early days so
0:48:56but
0:48:59okay well that's what others think jim