Speech Transcript - Recent Progress in Unsupervised Speech Processing

0:00:15	thanks karen it's a very nice to be here to talk to about some of
0:00:18	the low resource
0:00:20	work we've been doing in our group
0:00:23	this research that i'll talk about today involves the
0:00:28	work of several students in our group that i a list here
0:00:34	there's been a lot of talk just
0:00:35	today about the low resource issue good descriptions i won't belabour the point
0:00:42	i think my perspective on the problem is that
0:00:46	current
0:00:47	speech recognition technology will benefit by incorporating more unsupervised learning
0:00:53	ideas
0:00:54	and this schematic shows the
0:00:59	sort of a range of increasingly difficult tasks that we could imagine
0:01:05	starting the upper left
0:01:07	from the conventional asr approach that has
0:01:11	annotated resources pronunciation dictionary and units
0:01:14	two scenarios that have less resources associated with them
0:01:19	a parallel annotated speech and text independent speech and text all the way down to
0:01:23	just having speech
0:01:25	what can you do with that as would be the case for example of an
0:01:29	oral language that we were talking about earlier this morning
0:01:33	i think
0:01:34	it's a challenging problem but if we start to look at some of these ideas
0:01:37	i think there will be a few benefits first of all i think it's really
0:01:41	interesting problem i will learn a lot just by trying to do it
0:01:45	be i think it will ultimately enable more speech recognition for
0:01:50	larger numbers of languages in the world and
0:01:53	it has the potential ideally to complement existing techniques
0:01:57	and so even benefit languages that are
0:01:59	but quite successful with conventional techniques
0:02:03	so in the time i have today i was gonna talk about two research areas
0:02:07	that we plan
0:02:09	exploring in our group the first one is the
0:02:13	speech pattern discovery a method and various we applied to various problems
0:02:18	and this is an example of the zero resource scenario that could work potentially on
0:02:24	any language in the world just with the body of speech
0:02:27	so that has a certain appeal to it
0:02:31	but there's no specific models that we learn from that so and your line of
0:02:36	research that we've been starting
0:02:38	to do is exploring
0:02:41	methods to learn speech units models speech units and pronunciations
0:02:47	from either speech or when there are some limited resources available and we're using a
0:02:53	joint modeling framework to do this that i think even know what still very early
0:02:57	days for this work i believe it's quite promising so these of the two things
0:03:03	all a
0:03:03	touch on
0:03:06	and it'll be fairly high overview because i don't a lot of time but hopefully
0:03:09	you'll get the idea some other things we're trying to do
0:03:12	so the speech pattern work
0:03:15	it was actually motivated by in part by humans you know incensed people like jenny
0:03:21	saffron are shown that
0:03:23	just exposing infants to use short amount of nonsense syllables a very quickly learn where
0:03:30	a word boundaries are of what they've seen before what they haven't so we sample
0:03:36	Y can we try to apply some of these ideas if we have a large
0:03:40	body of speech
0:03:41	but we don't have
0:03:43	say all the conventional paraphernalia that goes with the conventional asr
0:03:48	but maybe we had a lot of speech so we could throw all that out
0:03:51	and look for repeating occurrences
0:03:55	instances of things like words and there might be some interesting things that we could
0:04:00	do if we achieve that capability
0:04:04	so to describe it
0:04:07	in terms of some speech the idea was to if we had say different chunks
0:04:12	of audio can we find the reoccurring words of these are two utterances with the
0:04:18	word schizophrenia in them
0:04:19	can we develop a method to establish that those two where the same thing
0:04:25	and the approach that we took
0:04:28	fairly common sense i think is to
0:04:31	if you have a lower body of adding a we would compare every piece of
0:04:34	audio with every other piece of audio
0:04:37	and so for those two utterances now we would computed distance matrix which would represent
0:04:42	point by point distances
0:04:45	and the idea is that one two things to spectral frames of the same
0:04:49	be of course a low this stands and when there are very the similarity high
0:04:53	distance
0:04:55	when you look
0:04:56	and the representations that we use of varied over time we started off just using
0:05:00	white and mfccs which would work fine if it's all the same speaker
0:05:04	we then went to unsupervised posteriorgram representations based on both
0:05:10	to write from gaussian mixture models and also some
0:05:14	dnn trained in an unsupervised way we've done some stuff with herbs slu is as
0:05:19	well representing them as a posteriorgram it really doesn't matter what you use
0:05:23	the interesting thing is when we all look at that picture
0:05:26	i think most of this can see right away that all you know what there
0:05:30	is a diagonal that sort of
0:05:32	bunch of low distances that's where the repeating pattern is so that's
0:05:36	what we wanna do we try and find that automatically
0:05:39	and we developed a
0:05:41	a
0:05:42	variational to dynamic time warping we call segmental dynamic time warping that basically just consisted
0:05:48	of striping all the way through the audio corpus so that you would eventually
0:05:53	compare everybody's with every other piece and that was that
0:05:57	the past the warping path you were on would eventually snap into
0:06:02	that as you passed over that possible alignment
0:06:05	we call this little region the alignment pass a fragment and the web in that
0:06:11	region is sort of the point my point distance and that's what we're trying to
0:06:16	so there's different ways to do this but this particular illustration shows the point by
0:06:21	point alignment of the two stripes of that the two pieces of the utterances here
0:06:26	and this is the distortion as a function of the frame-by-frame distances and here courses
0:06:32	where the overlapping
0:06:34	word schizophrenia is see want to look
0:06:37	there were that warping path
0:06:39	establish some mechanism to try and find a low distortion region
0:06:43	we were looking for things that were least half a second longer turns out the
0:06:47	longer you constrain yourself the better this idea works
0:06:51	so like boston red socks would work really well as sort of an expression
0:06:57	and we extended a little bit and when we do this
0:07:02	you know we're is we produce are aligned fragment
0:07:05	now
0:07:07	people a modified this basic idea "'cause" this is computationally fairly intensive we actually done
0:07:12	some stuff to do
0:07:16	approximations that are guaranteed to be admissible but other people like aaron
0:07:21	chance and edge i J H U is done some really nice work using some
0:07:25	of visual processing concepts to significantly reduce the amount of computation that and paul
0:07:32	although it turns out i think that using sgd W idea that very and
0:07:37	is not a bad way to refine the initial matches
0:07:41	so when you do this what happens if you end up with pairs of utterances
0:07:47	and the things in red are example matches low distortion matches that are found in
0:07:52	your corpus and you can see that it's
0:07:55	depending on the parameters like the with that you pick we were sort of aiming
0:07:59	for word level ideas which was why we picked have second constraint
0:08:02	i sometimes it's a word sometimes its multiple words
0:08:06	sometimes of the fragment up were
0:08:08	sometimes it's something similar but not the same thing
0:08:11	this is the type of thing that you get out
0:08:17	the interesting question then is once you have all these pairwise matches for your corpus
0:08:23	you'd like to try to establish
0:08:25	what things that are the same underlying
0:08:28	and so you have to go to some sort of clustering
0:08:32	notion in this is what we call speech pattern discovery and we try to represent
0:08:37	all of these pairwise matches and the graphical structure
0:08:40	that then we could do clustering on
0:08:44	and so when you do that where the
0:08:48	well describe how we defined the very season the graph in the second but if
0:08:53	you do each region
0:08:56	corresponds to a vertex
0:08:58	in a in the matched in the arc corresponds to
0:09:02	where the edges correspond to connection between the regions and then you can trying to
0:09:06	clustering
0:09:08	naturally of course in the real world
0:09:11	these clusters are inevitably connected in some capacity is the matches are
0:09:16	a perfect but then you can apply your favourite clustering algorithm to try and find
0:09:21	densely connected regions in the graph and that's in fact we did
0:09:26	i just a very briefly show you one way that we did it there are
0:09:31	many to define the of or the season the graph
0:09:36	this illustration is sort of showing all example pairwise matches so each little rectangular
0:09:43	are corresponds to match and the colour
0:09:46	means it's the same match so the blue rectangle for example is a region where
0:09:51	we think of the word matches to something else it said over here
0:09:54	okay so different colours mean different batches
0:09:58	well if you actually look at what's going on at any point in time
0:10:03	it's messier than that because you potentially have a whole lot of matches because each
0:10:07	match is done independently the start time and end times are all gonna be probably
0:10:11	different
0:10:13	but what we did as we summarize
0:10:15	that collection of matches by just summing up the similarities as a function of time
0:10:20	and so you get some and then we consider that and then what you get
0:10:25	is something is time-varying that has local max's
0:10:29	and we defined the local max's is places that interest
0:10:32	where a lot of similarity matches were occurring
0:10:35	and so we use those places to define nodes or a rare disease in our
0:10:40	graph and then
0:10:41	what you define the notes
0:10:44	the
0:10:46	the matched pair is that you have that overlap the nodes define
0:10:51	the edges in your graph
0:10:53	so for example the blue pair went from no one
0:10:58	to know eight so you'd make this connection in your
0:11:02	and you can do that for all of the
0:11:05	matches that you have that are low distortion and so this is how you can
0:11:07	construct your graph
0:11:09	and then as i said you can apply clustering algorithm to
0:11:14	make chunks out of that to define clusters
0:11:17	so let me show you i and example on a lecture that was recorded mit
0:11:24	and this is an exam so you we had four matches here different places in
0:11:28	the in the recording
0:11:30	and there was is nice little cluster here the show you and i'll hopefully play
0:11:34	you some examples should not on a research optimized
0:11:40	there was for things to them played the same time
0:11:42	but
0:11:43	basically what this was this guy was talking about
0:11:47	variations of search engine optimiser search engine optimising and
0:11:52	it actually stand the word optimiser
0:11:55	to get the common acoustics in the cluster
0:11:59	so this is an example of the type of thing that you get
0:12:02	interestingly all of these words tended to
0:12:05	curve near each other in the lecture so you can actually
0:12:08	are we done this work i'm not gonna talk about at any actually do topic
0:12:12	segmentation based on the time-varying nature of these clusters over the course of long audio
0:12:17	recording
0:12:19	i can show you some other examples we try don't different languages this is a
0:12:22	lebanese
0:12:24	interview that we recorded it actually used to people talking
0:12:27	one of them is using a lebanese layer levenshtein arabic and one of them is
0:12:31	talking and msa
0:12:33	the algorithm doesn't care it doesn't know and it it's oblivious is just looking for
0:12:38	things that look like they're the same
0:12:40	and so here's the cluster that
0:12:42	it got
0:12:44	this is this is this is
0:12:47	there's another one for mandarin a lecture that we apply right on those users use
0:12:53	those use versus you get the idea defining these acoustic chunks and they're sort of
0:12:58	the same
0:12:59	thing
0:13:00	now when you do it over a single large body of audio like
0:13:05	a not sure you'll get a bunch of these different clusters
0:13:08	and that's interesting when you look underlying lee at what the identity of the cluster
0:13:12	is
0:13:15	you can see the lot of the terms are important content words in fact what
0:13:19	we did a
0:13:21	this study
0:13:22	we were finding about eighty five percent of the top twenty tf-idf terms
0:13:27	on lectures so it's an indication that
0:13:30	the clusters are finding potentially useful combines
0:13:33	of information and i guess what on the motivations for this that there's a word
0:13:38	that's important
0:13:39	in a conversation or a lecture it'll probably be said multiple times and that gives
0:13:44	us a chance to find it it's not always the case
0:13:47	but we need that for this type of technique to work
0:13:52	now one of the things that we've done recently is
0:13:56	in addition to this was in one particular a document you can look at the
0:14:00	relationship between these unsupervised patterns across different documents
0:14:05	and documents like
0:14:08	her was talking about a topic id
0:14:10	we can do one supervised topic clustering
0:14:13	based on the relationship of these unsupervised words across different documents
0:14:20	so
0:14:21	just to visualise that a little bit here you each of these grey
0:14:26	rectangles is a different document and the darker grey rectangles are
0:14:31	speech patterns that we found
0:14:33	in the unsupervised way and then the connections are just
0:14:39	things where there they connected to each other with a low distortion match
0:14:42	and say for example
0:14:46	you know you have this type of distribution of your clusters then you might you
0:14:50	might want to say well these two
0:14:52	two clusters are on the right because of the connection between those unseen supervised terms
0:14:57	that we found
0:14:58	and the three on the laughter in the same class again this is not doing
0:15:02	this unsupervised
0:15:04	so to do this we tried a couple different methods but one that was the
0:15:09	most successful
0:15:11	and how to latent model for
0:15:15	topics and words
0:15:17	and that's just the plate notation on the right
0:15:20	but the observed variables of course where the documents and then what we call at
0:15:24	a link structure
0:15:26	which we define
0:15:30	a link
0:15:31	as the connections for each interval in a document that we found the link structure
0:15:37	is just a set of connections to all the other patterns that were made in
0:15:42	all the other different documents
0:15:44	when
0:15:44	and the latent variable words has a certain distribution
0:15:49	of links
0:15:50	and the topics have a certain distribution
0:15:52	of words and you can learn this model with em style
0:15:57	algorithm and the thing that's interesting we did some experiments on the fisher corpus
0:16:04	sixty conversations
0:16:07	spending six different topics
0:16:09	we see this with about thirteen hundred initial
0:16:15	clusters and we did tell what that define six clusters so there was kind of
0:16:20	cheating
0:16:20	but
0:16:22	this is the resulting clusters that we found
0:16:26	and the interesting thing is when you look at the underlying speech patterns that are
0:16:30	associated with these topics
0:16:34	they make a little bit of sense actually which is nice so
0:16:38	what you find just the there are relevant words and there are irrelevant words things
0:16:43	that you might like to be
0:16:44	in the stop word if you were doing with text so be nice to get
0:16:48	rid of them
0:16:50	here leads me show you some the other one so that's of the green was
0:16:53	on minimum wage
0:16:56	the other one is on content computer and education
0:17:02	the purple one is kind of interesting
0:17:04	when you look at the distribution of the true underlying topic labels
0:17:09	some of them are pretty good
0:17:12	you know that the holidays computers in education the benefactor split into two
0:17:17	it's kinda
0:17:18	intriguing to me that corporate conduct an illness were
0:17:22	mapped into the same cluster maybe that's telling or something
0:17:25	but you know this really days but you know there's a lot of things that
0:17:28	you can potentially do
0:17:30	with these kinds of unsupervised methods are not conventional but her showing some nice examples
0:17:35	to and this is another one
0:17:39	i wanna move on and talk about
0:17:42	some of the newer work that we're doing
0:17:45	we just trying to learn a models
0:17:48	lower speech units learned pronunciations
0:17:52	really trying to get rid of the dictionary
0:17:56	or at least learn some methods
0:17:58	you know it's interesting
0:18:02	we pride ourselves on our ignorance models that we've developed with hmms modeling think we
0:18:07	don't know about speech
0:18:08	and that the dictionary is still are crutch it's
0:18:11	it's typically made by humans and hours and hours are spent
0:18:16	tweaking these things getting rid of all the
0:18:19	inconsistencies anybody used on it is it's hardware and it takes long time mary can
0:18:24	tell you the amount of effort goes in the making use dictionaries for the babble
0:18:29	program but it's not a trivial ever
0:18:31	why is it we can learn
0:18:33	the units and learn the pronunciations automatically we do everything else
0:18:38	you know i think it's time we look into this
0:18:40	so this is the type of thing we're trying to do what we do from
0:18:43	speech
0:18:43	or maybe if you have some tax
0:18:46	you know can that help you other pronunciation so we're doing
0:18:50	we're trying to do both of these things now in our work
0:18:54	are there's prior work in this area dating
0:18:56	all the way back to the eighties people lunch in the
0:18:59	we're trying to do some
0:19:01	acoustic based approaches
0:19:04	a more recent work as a verb is a good example with a self organizing
0:19:09	units
0:19:09	and there's been other work that's come out of johns hopkins that is very interesting
0:19:13	as well
0:19:15	the approach that we've been taking is
0:19:19	motivated in fact by
0:19:22	more of a machine learning something is becoming more popular it in machine learning is
0:19:26	of asian
0:19:28	framework
0:19:29	for inference and in particular sharon goldwater
0:19:33	who's now the university then burrell had a really nice paper
0:19:37	on trying to learn word segmentation from a phonetic transcriptions so was symbolic input
0:19:43	and the trying to work for that and are more recent work was trying to
0:19:46	do it from the ways here
0:19:48	phonetic transcriptions we wanted to try and modify this model so we could learn from
0:19:53	speech itself so that's what i'll talk about now
0:19:57	and then we've recently extend it to try and word pronunciations as well
0:20:01	so we all know what the challenges are you're trying to learn what the speech
0:20:05	units are first of all
0:20:06	as the last question or said we don't know how many units there are
0:20:11	maybe they're sixty four maybe there's not
0:20:14	and we don't know what they are and we don't know where they are
0:20:19	so these are a lot of unknowns or trying to figure out in units
0:20:23	so
0:20:24	what we're trying to do is given speech
0:20:28	in this stuff only speech
0:20:30	discover the inventory of units and build a model after each of them
0:20:35	as i said we're formulating this in a different kind of mathematical framework for the
0:20:40	speech community where we have a set of latent variables
0:20:43	that include the boundaries the units the segments and then there's the conventional hmm-gmm
0:20:50	model we're using for each unit that although scribe shortly
0:20:54	and in this initial work we actually wanted to try more number of units
0:20:58	and to do this we were
0:21:01	representing is what's known as a chinese restaurant process or dirichlet process
0:21:08	had a prior on it so there's a finite chance of generating a new unit
0:21:12	every time you better rate
0:21:14	through
0:21:15	so let me walking through this
0:21:19	at a high level sort of channelling of the generative story or a power
0:21:25	generating an utterance
0:21:28	with this hmm gmm a mixture
0:21:31	that i said so basically underlying when we had a have a set of K
0:21:37	models and then
0:21:39	for a particular set or frames one of them is selected
0:21:42	and it generates
0:21:45	a certain number of speech frames and then you'll transition to another one generates more
0:21:50	frames
0:21:51	et cetera et cetera as you go through the entire utterance
0:21:55	so that this sort of just described are model here but what of the latent
0:21:59	variables will first of all
0:22:01	we don't know where the transition are
0:22:03	transitions are in the speech between one unit and another so the bees will be
0:22:08	one set of
0:22:11	latent variables
0:22:12	we don't know what the labels are
0:22:15	inventory of labels and i have down below also these season purple will also be
0:22:20	a set of unknown variables
0:22:24	we don't know of course the parameters of our hmm gmm model so those will
0:22:30	be parameters of variables
0:22:32	and lastly as i mentioned we don't know how many units there are
0:22:36	so that will be in a no as well and this is the last thing
0:22:40	is what we're modeling with the deer actually process
0:22:45	so
0:22:46	the learning procedure for this is done the inference and gibbs sampling
0:22:52	so it's an iterative process
0:22:55	where we initially select
0:22:58	values on some boundary variables i'll talk about that
0:23:03	shortly but for now
0:23:05	think of an it as an initial segmentation and we have an initial prior distribution
0:23:10	for the parameters that we have
0:23:13	and then we go through are corpus
0:23:16	one segment is a time where segment is defined as the chunk of frames between
0:23:21	boundaries and
0:23:22	based on the posterior distribution will sample down you so for each segment will sample
0:23:28	a identity of the units see for a particular segment
0:23:32	when we say something about that
0:23:35	here
0:23:37	so as i mention this is a chinese restaurant process because there's a finite chance
0:23:41	of selecting defining a new unit
0:23:44	and for those of you foreign familiar with that
0:23:47	the idea the analogy is what people going into a chinese restaurant in trying to
0:23:51	decide which table to sit down at "'cause" each tables begin can see multiple customers
0:23:57	so
0:23:59	in this notation each segment is a customer and they have to decide
0:24:03	well which table to sit at
0:24:05	and
0:24:07	the index of each table has an index
0:24:11	which corresponds to the identity of the models we each of these tables think of
0:24:15	it is belong to a different unit
0:24:17	and
0:24:18	what you wanna have is a
0:24:22	posterior probability of
0:24:24	the likelihood of taking a particular unit label for each segment
0:24:31	and that basically is proportional to the likelihood of the customers the that particular same
0:24:39	data in generated by that particular units
0:24:43	hmm-gmm and it's a weighted by a prior probability
0:24:47	which just corresponds to the number of customers that were at the table normalized total
0:24:53	amount of
0:24:57	segments that you have and you'll notice that there's also a
0:25:00	little bit a probability that stolen away from each one is out of
0:25:05	to assign a little bit a probability to the likelihood that you might generate a
0:25:10	new unit
0:25:11	as well
0:25:12	so once you have these posterior distribution setup sample and that's the value of for
0:25:17	that particular segment
0:25:19	at that particular iteration
0:25:22	once you have a unit label for that segment
0:25:25	you then go through and use apply hmm parameters
0:25:29	and this is sort of on our home turf so it's a conventional
0:25:34	hmm gmm we're using an eight mixture model
0:25:37	you know the three state left to right transition so it's all very familiar
0:25:42	and
0:25:44	we assume when you have to start a segment in state one and it in
0:25:48	state three but any other states we will sample we will draw samples to determine
0:25:52	which state you're in at once you have the state sequence
0:25:56	we'll draw samples to see which mixture component
0:25:58	you're drawing file and then
0:26:02	we can update the parameters based on that
0:26:05	but passing i wanna say as we have to also consider different segmentations the choices
0:26:10	of the B
0:26:11	no these boundary the variables naively every frame to be a boundary
0:26:18	and
0:26:20	they take binary values either frame is the boundary or it's not so would zero
0:26:24	or one
0:26:25	in terms of putting this into a probabilistic formulation we have again a prior
0:26:31	and a posterior probability the prior is just a bernoulli trial we picked we flip
0:26:36	a coin
0:26:37	with probability alpha to be it's a boundary and one minus alpha so be it's
0:26:43	not
0:26:45	two
0:26:46	generate the posterior so we can generate a new sample for every boundary we go
0:26:51	through one boundary at a time
0:26:54	and we fix the state of all the other boundaries we generate the posterior distribution
0:26:58	and then sample whether it's
0:27:01	boundary or not
0:27:02	and the posterior
0:27:04	sorry for the map is just so this is this is it i think
0:27:09	but the you know we have the prior here
0:27:12	and then
0:27:13	it's we consider all possible units that the segments on either side of this boundary
0:27:18	might be so this is sort of the likelihood
0:27:20	that you would generate
0:27:23	given this was the boundary
0:27:25	and then you consider the possibility that is not a boundary and again that's the
0:27:29	prior
0:27:30	and then you consider the likelihood of generating this entire segment
0:27:35	considering all possible models that you have so those are your to posterior distributions
0:27:41	and you sampled from then to generate a new value for each boundary to the
0:27:45	corpus
0:27:46	so you reiterate through this i think that was twenty thousand iterations that we were
0:27:51	doing
0:27:52	to generate all these parameters the last thing i'll say is just like her note
0:27:57	so this is near and dear to my heart anybody knows me for awhile now
0:28:01	i am a big believer a landmark based things but
0:28:04	it turns out
0:28:06	that it can help save a lot of computation
0:28:10	and
0:28:14	so we're using some acoustic landmarks we developed are also derive some spectral change
0:28:19	and it's nice these are language-independent and it reduces the computation
0:28:24	and the thing is
0:28:26	as i'll show you later
0:28:29	this is just the initialisation once you learn thirty minutes
0:28:32	you can then going train a conventional models doing frame based stuff
0:28:36	so this is sort of a heuristic to help you do the learning faster
0:28:42	but it seems to be effective
0:28:45	so i don't want to dwell on experiments too much this work we will analysing
0:28:49	the timit corpus and
0:28:52	we found a hundred twenty three so i'm gonna have to do it out with
0:28:56	her brother sixty four hundred point three
0:28:59	maybe should eight hundred twenty eight next
0:29:02	and i have a similar kind of plot showing the underlying phonetic label
0:29:08	verses the unit index and you can see we're sort of covering
0:29:14	are the majority of the sounds
0:29:17	what we're generating a little too much effort modeling silence i think we would benefit
0:29:22	from a good speech activity
0:29:24	detector the interesting thing is when you start looking at some of these ones that
0:29:27	have multiple units like a
0:29:30	we looked and
0:29:33	there's a there tends to be a distribution for particular context so we are seeing
0:29:36	some context-dependency here
0:29:38	okay at has a raised out phenomena word-like champ
0:29:42	versus cap
0:29:43	so when we're seeing a lot of these thousand one particular unit the ads followed
0:29:48	by nasal so what's
0:29:49	it there is some context-dependency stuff going on
0:29:52	so i'm only doing for time
0:29:55	i'm good
0:29:56	okay good
0:29:58	telling be here
0:29:59	so as i wanna move onto the next step which is i mean when we
0:30:02	really wanna go always learning words
0:30:06	we're not there yet but what we but we tried to do is enhance the
0:30:10	model so we could learn pronunciations from parallel speech text
0:30:14	data
0:30:15	and
0:30:16	ideally do better than the graphone model of five hopes
0:30:21	okay i'm gonna go with your first answer
0:30:25	now again there's been work done in this area in the past
0:30:29	chin
0:30:30	there was also some more people like
0:30:33	mari ostendorf and
0:30:36	in the ninety nine use that was exploration of joint acoustic lexicon discovery
0:30:43	and are naturally baseline for this work is the graph
0:30:48	the grapheme based recognisers a standard think people do when there isn't a pronunciation dictionary
0:30:52	project parallel text
0:30:55	and am giving a little bit away the punchline but the formulation we have
0:30:59	can reduce to grapheme based setup if you wanted to its one particular constrained application
0:31:06	but to go through the intuition
0:31:10	again
0:31:10	we're setting up an additional latent structure here
0:31:13	beyond what we had before
0:31:17	so we have all
0:31:20	we had the unit sequence we need to learn the boundaries
0:31:23	and we also need to learn the graphing the sound mappings now the experiments we
0:31:28	don't in english will think letter to sound if you want but the framework generalizes
0:31:32	to do two different languages
0:31:35	just to remind you where we started from
0:31:39	with just acoustic units alone
0:31:41	and some distribution on the likelihood of predicting a particular units
0:31:48	you can go through and generate a speech frames as a sequence of these units
0:31:53	now if you have a
0:31:54	i wanted associated with it like the word fly
0:31:58	we need to introduce a new variables and
0:32:00	what happen at final or word pronunciations directly we're actually learning these browsing the sound
0:32:05	because we think it'll generalize better across a corpus
0:32:09	you might eventually want to
0:32:12	do word specific pronunciations but
0:32:15	so we are representing that by another set of latent variables that are latter specific
0:32:20	here
0:32:21	where we would have specific distributions for each letter and by the way will eventually
0:32:25	do try grapheme
0:32:27	also context dependent ones but this is monophone or mono grapheme for now
0:32:32	so you have letter specific distribution so the S might prefer this particular
0:32:37	acoustic model on the L might prefer this particular one et cetera et cetera
0:32:42	hopefully get the idea
0:32:44	so those are latter specific mappings
0:32:46	and this is the initial believe and of course you need a couple these together
0:32:51	so that your general belief is related to the more context specific
0:32:57	things that you have a gap an unknown set of units these will also be
0:33:00	years like processes
0:33:02	underlying only as well
0:33:07	so i don't think so if you go through you know you have a ladder
0:33:12	you use the particular
0:33:15	distribution you select a particular unit and that unit would generate
0:33:19	i your frames and when you go to the different letter different distribution
0:33:24	so like to different unit sampling generate frames
0:33:29	et cetera et cetera
0:33:31	and that sort of
0:33:33	how would work so this model is now
0:33:36	i joint model for learning units
0:33:39	and
0:33:40	grapheme the sound mappings
0:33:43	and the underlying
0:33:45	acoustic models
0:33:47	one more wrinkle we have to deal with this that there isn't necessarily a one-to-one
0:33:50	matching course graphemes there is a one-to-one matching but there isn't necessarily
0:33:55	so we introduce another variable
0:33:59	that
0:34:00	gives you some flexibility as to the number of
0:34:04	units a particular latter might not too so we can think of english here so
0:34:08	we said zero one or two
0:34:12	and we have a little categorical distribution
0:34:15	which predicts the likelihood a ladder specific or context letter specific
0:34:21	distribution
0:34:23	so like the X for example might be likely to be represented by two models
0:34:27	you might learn
0:34:30	so let me go through this very quickly i know i'm running out of time
0:34:34	but once you have a sequence of letters you have to generate a sample for
0:34:41	the number of units that you have once you have a so that units
0:34:45	we also have
0:34:47	you have to pick a
0:34:51	a hmm acoustic model to a sample from a we have these are position-dependent as
0:34:57	well so for next has to we have position one and position to
0:35:02	distributions but you would sample
0:35:04	from those
0:35:06	and then once you've sample the particular
0:35:10	units you would generate speech frames from the appropriate
0:35:14	hmm-gmm model
0:35:16	and so the latent variables that we have to deal with now there's more of
0:35:20	them
0:35:21	there is the number of units per letter the unit label identity same as before
0:35:26	and then there's these latter specific
0:35:30	distributions
0:35:31	as well as the hmm gmm parameters
0:35:37	when you do the inference i'm gonna skip that i knew i wouldn't have time
0:35:41	to go into that
0:35:42	you end up with this mapping all the way down from the latter is all
0:35:46	the way down to the segments and now
0:35:49	if you look at the top part
0:35:51	you can generate pronunciations for words in your lexicon in terms of these units that
0:35:55	you lower
0:35:56	and of course you can train your hmm gmm models
0:36:00	and guess what i now you're in business to train a conventional speech recognizer with
0:36:05	whatever technique
0:36:07	that you like
0:36:11	this is similar to whatever a was talking about
0:36:15	the experiments that we and i know what's very ironic that here i am talking
0:36:19	about low resource languages and all the experiments i'm showing your in english
0:36:23	it's ongoing work trust me
0:36:26	but we've done some experiments on a weather corpus that we had for a while
0:36:30	we like working on
0:36:33	we use that an eight hour subset to try and learn the units and the
0:36:37	pronunciations and then we retrain
0:36:39	a conventional recognizer on the entire training set using these units and we compare
0:36:46	these are our baseline the expert
0:36:48	as we wanna be but certainly we would hope to do better than the grapheme
0:36:53	and just to cut to the chase were in be twenty
0:36:56	right now
0:36:57	so
0:36:59	graphemes or heart on english everybody knows that so it's nice that were able to
0:37:03	be that
0:37:05	but we'd like to get to the supervised and
0:37:08	i actually think there is
0:37:11	there's reasonable expectation that we can do that because
0:37:17	you know we've done some stuff where automatically learning
0:37:20	throwing out expert pronunciations and learning some new ones based on graphone models we can
0:37:26	get down to about eight point three percent
0:37:29	so i would hope that we could write this down below the ten percent
0:37:33	marker
0:37:34	it's still early days
0:37:37	it's also how to interpret a necessarily what you actually have lower
0:37:44	because
0:37:45	the pronunciations are in terms of these numbers which is not very intuitive for me
0:37:49	but
0:37:50	so we can do things like a local these words that have a shot in
0:37:53	it
0:37:54	sort of have the same unit so that's kind of encouraging
0:37:57	and the other thing that
0:38:01	jackie is done recently
0:38:04	is used moses to try and translate
0:38:06	use the two dictionaries the pronunciations from the expert dictionary the pronunciations from these learn
0:38:12	units
0:38:13	and use models it's a trying translate between the two so here's a six words
0:38:18	this is the yellow one on top is the expert pronunciation and the blue one
0:38:23	is the translated
0:38:25	unit once into phone like units so there are a little more
0:38:31	in interpretable by us and you know
0:38:34	i think it's on the right track i think there's something there we need to
0:38:37	do a lot of more investigation
0:38:39	but
0:38:41	i'm encourage so far so that sorta where we are
0:38:48	so i can wind up i think
0:38:50	you know
0:38:52	it would be beneficial i'm really encourages is to see you know more people
0:38:58	doing unsupervised things
0:39:01	it's challenging but i think
0:39:04	we learn a lot by doing this and
0:39:06	i truly believe that
0:39:08	in the end it will help us develop speech recognition capability from or the world's
0:39:13	languages
0:39:14	and i'm optimistic that
0:39:18	these methods can potentially complement existing approaches one of the things we're doing right now
0:39:22	in the babble framework as we're looking at these
0:39:27	acoustic matching mechanism as a means to rescore keyword hypotheses from conventional recognizers and
0:39:35	you know maybe that can be helpful
0:39:38	so i've shown you some problems we've been making in speech and discovery and topic
0:39:42	clustering also learning units
0:39:45	and i mentioned we're looking at other languages and we're augmenting the framework right now
0:39:49	to try and learn words itself from the audio
0:39:55	and i guess the by can be speculative that the very and
0:40:02	out this morning said the elephant in the room is
0:40:06	text data actually think the element in the room is lost
0:40:11	in fact it's all humanity that's ever been in a never will be
0:40:17	you know we are learned language
0:40:20	toddlers
0:40:22	no one gives us a dictionary
0:40:25	no one gets as text
0:40:27	you know we have other things but we figure it out
0:40:30	and
0:40:32	i think we be
0:40:35	better off in the long run as a community trying to think about how to
0:40:41	have some of those capabilities our system
0:40:44	anyways thank you very much
0:40:45	and i'm done
0:40:47	you're some references for anybody who's interested not be happy to talk to anybody
0:41:02	sorry i one over
0:41:04	so that was actually and just exact way as one to remind people really quickly
0:41:08	that after
0:41:09	this break instead of having a for one panel session we're gonna have a short
0:41:13	and panel session and preceded by a talk by manual to prove that mention this
0:41:17	morning is a cognitive science side is working and human language
0:41:23	we have time for a few questions
0:41:40	so the i think this each the kind of pretty nice approach based on the
0:41:45	didn't you
0:41:47	model
0:41:47	framework
0:41:49	and the isolates
0:41:50	they discussed seems kind of more like a discriminative
0:41:54	approach based neural network and i think past
0:41:57	a pretty important so don't have some kind of the ideal document all integrating these
0:42:03	kind of course the how to say and forced to some nice framework
0:42:08	a well i don't know if it's nice so first of all i totally agree
0:42:11	with you
0:42:13	i sorted you this generated stuff we're trying to do as a way to get
0:42:17	started
0:42:19	and then just as verbs that i think
0:42:23	if you can figure out the units and some pronunciations and get going
0:42:28	then i think
0:42:29	there's potential to bring in
0:42:31	discriminant thing to sharpen
0:42:34	the boundaries that you're learning
0:42:39	so that's sort of the approach that were taking try to get use this to
0:42:43	generate an initial speech recognizer like the landmarks we go away from that once we
0:42:50	have an initial model
0:42:52	it's not maybe well
0:42:56	the best idea i have at them
0:43:01	more questions
0:43:10	at the beginning of the talk is so that the you are going for the
0:43:15	places of similar for increased
0:43:18	by looking at the similarities so actually by detecting of places where signal right speech
0:43:24	is similar to some other place you say this is a place of interest well
0:43:28	this is a bit contradictory to the information theory mobile
0:43:32	basically when you have something that is unique with this they're just once it can
0:43:36	carry lots of information and that can be a place of interest
0:43:40	could you comment on this be
0:43:43	it's very true something that i mean these patterns are complicated
0:43:48	you know "'cause" they're sequences it sounds
0:43:51	they're probably
0:43:53	a little more reliably detect
0:43:55	and it could be that you could have a very important word that only occurs
0:43:59	once
0:44:00	this method would not be able to find it so
0:44:05	i just
0:44:09	it's more how reliable can do we think we can actually find it
0:44:13	and the longer it is the more reliable it is and it turns out
0:44:18	that's why mentions the comparisons with the tf-idf
0:44:23	we do seem to be able to find
0:44:28	important content words
0:44:29	in lectures using these methods
0:44:31	now
0:44:32	the hidden thing ricer mention little bit was a lot of the common words are
0:44:37	short
0:44:38	and that's why having a threshold for looking for some duration is important like have
0:44:43	second
0:44:44	you guys might of even use the second
0:44:46	right in the first paper
0:44:50	have okay
0:44:51	but so have second you know eliminate a lot of the very commonly occurring
0:44:57	things that you can
0:45:01	one other one other point out that to that is the non parametric bayesian stuff
0:45:05	is actually really good chinese restaurant process
0:45:09	generating a new category for something that just occurs once are very small number of
0:45:14	time
0:45:15	and so it sort of a really nice framework i think from language generally has
0:45:18	a nice
0:45:22	jordan questions
0:45:29	so what we have standing problems in this field use output of the real observations
0:45:35	so have you thought
0:45:38	designing
0:45:39	something that will allow us to
0:45:42	therefore
0:45:43	because of observations we have based on these kinds of non parametric
0:45:48	distribution issues
0:45:50	i missed the part when the most important
0:45:54	observations what is that the your
0:45:56	and i don't i don't believe that anybody in this room listens to cepstra
0:46:00	and that's just not the right thing the question is what's the right thing so
0:46:05	you see whether to acquire more analyze so that we could actually figure out what
0:46:10	the observations of this
0:46:14	you're talking about the input representation
0:46:17	these were all just based on mfccs
0:46:23	it's a very good question i
0:46:25	i actually think you know we would benefit from a better representation that sort of
0:46:31	naturally had the more phone like contrast in the languages around the world
0:46:37	i think mfccs are kind of rather blurry
0:46:40	description of the signal but i don't have any brilliant
0:46:48	i just have a question
0:46:51	we need where are you very natural language i like that it's a side quests
0:46:58	okay and it data
0:47:01	i don't know about the impression
0:47:03	with respect to discovery things the more data better
0:47:07	so the question
0:47:12	it's a state which you ideally want to have
0:47:16	right
0:47:18	or is it is the next so i have much while are that it seemed
0:47:23	like really well are
0:47:26	things that are sort of
0:47:27	state
0:47:31	thank you might have a lot a tensor thinking about how much data collection things
0:47:35	like it's really are really understanding in
0:47:44	it is really want
0:47:51	almost
0:47:53	don't know the answer that question
0:47:55	it seems like it for questions
0:48:02	i
0:48:10	i
0:48:19	i
0:48:31	so
0:48:36	there's a lot of data for english and another you know that all languages of
0:48:40	course and resources like the babble
0:48:43	will be a tremendous legacy that people can evaluate on
0:48:50	i don't know how much we need all
0:48:52	i mean there's this stuff i was talking about very early days so
0:48:56	but
0:48:59	okay well that's what others think jim

Recent Progress in Unsupervised Speech Processing

Limited Resources Day

Jim Glass (MIT)