Speech Transcript - Selected poster summaries, panel discussion

0:00:15	i wish so where
0:00:17	well my paper was acoustic unit discovery and pronunciation generation from a grapheme based lexicon
0:00:23	this is work done with a menu right more in the model and john look
0:00:28	upon whimsy
0:00:31	so the basic premise was
0:00:34	given the task where you just have a parallel text and speech but no pronunciation
0:00:39	dictionary aware knowledge of a phonetic string acoustic units
0:00:42	can you learn the acoustic units and then the pronunciations for word using those discovered
0:00:47	acoustic units
0:00:49	so we broke the task into two parts with the first stage was learning the
0:00:54	acoustic units in the second stage was learning be pronunciations from them
0:00:58	i with the hope that you do one individually would improve performance and then together
0:01:03	what make things even better
0:01:06	so we start from the assumption that you can at least train a grapheme based
0:01:09	speech recognizer that will these produce some reasonable performance
0:01:16	once you train a contextdependent hmm based recognizer we taken we cluster the hmms using
0:01:24	spectral clustering
0:01:27	into some preset number of acoustic units
0:01:30	and i there we have a direct mapping from a context dependent try grapheme down
0:01:38	to one of these new acoustic units so using the original grapheme based lexicon you
0:01:44	can automatically map each pronunciation to the new acoustic units
0:01:49	i and performing recognition with that system for does produce a some small game but
0:01:55	it was our belief that
0:01:57	using those accused units in that matter might not be the best way to do
0:02:02	it might be better ways of
0:02:04	better pronunciations
0:02:07	so we took a machine translation based approach to transform those pronunciations into a new
0:02:13	set
0:02:13	so using the new discovered acoustic units we decode a the training data to generate
0:02:20	a set of pronunciation hypotheses for each word
0:02:24	and from their using moses i you can learn a set of phrase translations are
0:02:30	basically rules to translate a set of units into another set of units
0:02:36	so using that france phrase translation table you can line transform the original lexicon and
0:02:40	so on the one
0:02:42	i unfortunately using that directly actually significantly to a mix performance worse mainly because there's
0:02:49	a lot of the ways in the and the pronunciations the hypothesized
0:02:53	i still we rescore the rules in the phrase table by applying each rule individually
0:02:57	and then through forced alignment see how it affects the log-likelihood after training data
0:03:05	and we found that if we just from the phrase table keeping the rules that
0:03:08	improve the likelihood of the data and then transform the final lexicon we end up
0:03:15	with improve performance once we get the final lexicon we start over from beginning and
0:03:19	train the system up i haven't before recognition
0:03:27	okay might be better
0:03:29	to
0:03:32	okay my paper was
0:03:34	i don't i practical system for word discovery exploiting dtw based initialization and what we
0:03:40	basically did was
0:03:42	okay unsupervised word discovery task from continuous speech in our case we use the completely
0:03:48	zero-resourced setup meaning we only have the audio data and no other information
0:03:54	so
0:03:55	just for the task used a small vocabulary tasks in our case just the ti-digits
0:04:00	database was eleven digits and we used a hierarchical system for the word discovery are
0:04:06	the system in the in this case means we do have two levels on the
0:04:10	first level we have the acoustic unit discovery so
0:04:15	trying to discover acoustic units as a sequence of feature vectors it's basically the same
0:04:20	what use was presenting today
0:04:23	and the second level we do the discovery of words
0:04:26	sequences of the acoustic units which is basically another two learning the pronunciation lexicon so
0:04:33	for the first level as i said we're going the acoustic unit discovery it is
0:04:38	similar to self organizing units and what we basically do is segment or input
0:04:45	cluster all segments to get an initial transcription for the data and then do that
0:04:49	iterative hmm training for acoustic models for the acoustic you
0:04:54	and finally we get a transcription of or audio data is a sequence
0:05:01	and for the second level we'd of the word discovery
0:05:04	means that we are trying to discover words in an unsupervised lie on or sequence
0:05:10	of acoustic units
0:05:12	so what we do there is we use the probabilistic pronunciation lexicon because obviously our
0:05:18	acoustic unit sequence will be very noisy so we can reduce the one-to-one mapping we
0:05:23	need some probabilistic mapping and to do this mapping we use an hmm for each
0:05:29	word for the hmm
0:05:31	states have discrete emission from the distributions in terms of the acoustic units
0:05:38	so
0:05:41	additionally the transition probabilities of the hmm are governed by a not the distributions so
0:05:47	that you have kind of things modeling
0:05:49	and
0:05:51	the parameters of these hmms are learned in an unsupervised way we do this by
0:05:57	for example when we want to
0:05:59	search for and rinse instance ubm
0:06:04	what hmms connect them by a language model in our case a unigram language model
0:06:08	and then simply estimate the parameters using an em algorithm and we look at what
0:06:15	sequence of hmms be converged and finally we get
0:06:21	transcript transcription of the audio to intense
0:06:25	doing this with and
0:06:27	random initialization so a completely unsupervised setup we get T sixty
0:06:32	eight percent accuracy
0:06:35	and
0:06:37	well this was done in a speaker independent meaning what all the data into one
0:06:44	group just to learning and the segmentation
0:06:48	getting sixty percent
0:06:51	this was done unsupervised but we can get or we can go of step for
0:06:56	their using light supervision in our case we use the dtw algorithm of jim us
0:07:02	to do a pattern discovery on a small subset or input data
0:07:08	using nine percent of the input data or we ran the dtw algorithm and discovered
0:07:14	for four percent of the segments
0:07:18	some may so we used these four percent of the given segment is initialized our
0:07:23	right hmms these seconds
0:07:26	and
0:07:27	we used light supervision by just labeling
0:07:32	you can do this by listening to the
0:07:34	probably
0:07:36	and
0:07:37	when running this and
0:07:41	learning again we get an eighty two percent word accuracy of the end so using
0:07:46	the light supervision obviously improves the reside quite sick if we can
0:07:51	and it what we find it is an iterative speech recognizer training so just going
0:07:57	back to the standard speech recognizer using hmm you and gmms using the transcriptions but
0:08:02	we get from our unsupervised lightly supervised learning and doing an iterative training the speech
0:08:08	rate
0:08:09	so on the random case we then
0:08:11	go from sixty eight percent to
0:08:15	at four percent
0:08:17	and in the lightly supervised by people from the
0:08:22	at two percent to basically ninety nine percent which is close to the baseline
0:08:26	when using supervised training for details
0:08:36	okay so far work we are smoking on the low resource model we for a
0:08:42	tool on the continent dictionary but you how can we do not to the dallas
0:08:47	approach any so we still assume that we have a very small initial dictionary to
0:08:52	starbase
0:08:53	so that we can post that our system so our task in multimedia so given
0:08:58	that argue that we have us more initial a dictionary to start with we can
0:09:03	simply and we just simply channel up a grapheme to phoneme conversion model and then
0:09:10	we can generate
0:09:11	the multiple sessions for the word accented covering it up here you know training data
0:09:16	so actually a system we assume that our constant actually small it's very maybe even
0:09:23	or even noisy but we assume that we have a large amount word level transcription
0:09:28	i mean audio data and with whatever conclusion
0:09:31	so even though we do not know a have the procession for all the word
0:09:35	comparable results more usual actually just are ways we want to learn discipline sessions for
0:09:41	the word you know what training data so and the since we can be model
0:09:46	but very noisy because the training samples very small
0:09:50	but we can read all the possible pronunciation for the words in the training data
0:09:54	and then we can actually in the audio data to fit into the model and
0:10:00	then to be where the pronunciations and so here and we have another problem is
0:10:07	because if we had keep you know multiple from search for each word so and
0:10:12	for an utterance with say i'm unable word and the possible pronunciation sequence could be
0:10:17	bourlard you got at a national with a number word so you want to learn
0:10:22	the project where can be a problem
0:10:24	so we computer two different back to that in the model one just use a
0:10:30	bit be updated approximation at every time the most likely points in sequence fourish one
0:10:36	point you a project utterance as and uses that you couldn't that these subsidies to
0:10:41	obtain backed
0:10:42	so this is it just a approximation so we have to answer the question that
0:10:47	it is a good approximation so we have to i will call another system that
0:10:51	speech to learn these
0:10:53	model precisely so we do not do any pruning so other people actually a using
0:10:58	model that can say that's not the user impasse at least so this can also
0:11:02	be approximation
0:11:03	perform a so we have to know is a good approximation so
0:11:07	you want to you want to our work on this you know that impressive the
0:11:12	a large or a nonsensical for each utterance then we use the that is eighteen
0:11:21	techniques you know fst wfst so we represent these opens and a sequence E U
0:11:28	you ladies all you mother were you wfst and then used on the existing you
0:11:33	know composition and a limitation determinization algorithm that's only there's an algorithm arabians here in
0:11:41	the open fft one
0:11:42	so we use them and then she we found that is very efficient choose for
0:11:47	us to i don't think model so a given the two algorithm we can we
0:11:53	count as opposed to model but there's a nice to have to mention that because
0:11:59	or grapheme to phoneme conversion model and our work as model and also a one-best
0:12:03	model not depend on each other
0:12:05	so we can see that we have to each batch eight we find that you
0:12:08	know so that we can we can takes the other to have access to one
0:12:15	and go back
0:12:16	so that up a few iterations there is a system can convert
0:12:21	so we did some a common on the also we bought a i data set
0:12:26	so it becomes large tells that and the expert lexicon has a process of and
0:12:31	word and the total training data has about a three hundred hours so
0:12:38	we i was systems as a snack about fifty percent of the what an concessions
0:12:43	at the initialization which of this random this slide fifteen percent to start always as
0:12:49	an are we compare to use this expert let's go so obviously some can approach
0:12:55	is used for the let's go
0:12:58	there's a power to one percent to one point five percent in a gap there
0:13:02	but i think trustees to from a simple with we cannot speech or i mean
0:13:07	equivalent to that that's good as i think that's all translated just a initial study
0:13:15	offered to work
0:13:27	paper is tied probabilistic lexical modeling and unsupervised adaptation for unseen cases so
0:13:33	in this paper we propose to zero-resourced is our approach specifically Z linguistic whistles and
0:13:40	zero lexical resources et cetera system and the framework of codebook a bit languages based
0:13:45	features so we only use the nist probable words the target language
0:13:50	i that knowledge of the grapheme-to-phoneme relationship that
0:13:55	so that was used as opposed to model trained on a language independent out-of-domain data
0:14:02	and uses graphemes as subword units which avoid the need for lexicon
0:14:07	so i focus on three different point i that's tell you want kl-hmm approach use
0:14:13	and what has done until no kl-hmm what we did differently in this paper
0:14:18	so to briefly explain about kl-hmm approach is
0:14:23	the posterior probabilities estimated by neural network are directly used as feature observations to train
0:14:30	and an hmm right states are modeled by categorical distributions
0:14:40	so the dimension of the categorical distributions same as the output of mlp
0:14:46	and the parameters of the categorical distributions are estimated by minimising the kl divergence between
0:14:52	them
0:14:54	state distributions and the feature vectors belong to that state
0:14:58	and reality kl-hmm approach can be seen as probabilistic lexical modeling approach because the parameters
0:15:06	of the model capture the probabilistic relationship between hmm states and mlp outputs
0:15:13	and as in the normal hmm system the states can represent performance all graphemes context
0:15:20	independent units of context dependent subword units
0:15:24	so i'm not what we found until no to explain the benefits of kl-hmm we
0:15:30	have seen that the neural network can be trained on language-independent data and the kl-hmm
0:15:36	parameters can be trained on small amount of target language data
0:15:40	in the frame of kl-hmm the subword units like graphemes can also do you performance
0:15:46	similar to the systems using force as subword units so the grapheme based asr approach
0:15:52	using kl-hmm us to believe it's mentioned
0:15:55	the first it is it exploits both acoustic and lexical resources available in other languages
0:16:01	because it's reasonable to assume that some languages have lexical resources
0:16:07	and the parameters of the kl-hmm actually more than the probabilistic relationship between graphemes and
0:16:14	performance so it implicitly loans that you could be relationship using a pasta
0:16:22	so what we do in this work is normally insect or your the kl-hmm parameters
0:16:28	are estimated using target language data
0:16:31	so in this work we creeping it's to the kl-hmm parameters but knowledge based parameters
0:16:37	so we first dependent grapheme set in the target language we map each the graphemes
0:16:43	to one or more mlp what's a performance
0:16:48	and then the kl-hmm parameters are assigned using does not which
0:16:52	if untranscribed speech data from the target language we also proposed approach to i'd creativity
0:16:59	adapted kl-hmm parameters in an unsupervised fashion so first given the speech data waiting for
0:17:07	the grapheme sequence and
0:17:11	and we update the kl-hmm parameters using the decoded grapheme sequence and the process can
0:17:17	be i three and in this paper the approach was a evaluated on greek and
0:17:23	we used five other european languages as out-of-domain resources not greek
0:17:32	and what does not done in this paper we planted we it's
0:17:36	like in this paper we only adapted the kl-hmm parameters from untranscribed speech data but
0:17:42	in future we also plan to the mlp retraining based on
0:17:47	based on in for grapheme sequence and
0:17:51	also we don't prove in the utterances during unsupervised adaptation also so future we plan
0:17:58	try the problem or weight the utterances based on some
0:18:02	matching
0:18:09	so that anybody have any questions
0:18:26	so this is kind of an open and the question in this is for
0:18:30	everybody who talked about
0:18:34	automatically learning subword units
0:18:36	so i guess at least to the poster presenters and many of the speakers from
0:18:40	today and the question is a very often it's heart define the boundary between subword
0:18:47	units
0:18:50	and the more conversational a natural the speech gets the less well defined is boundaries
0:18:55	are
0:18:56	and so i was wondering if you found in your work
0:19:01	if you look at
0:19:02	at the types of boundaries there being hypothesize if you found
0:19:06	that this issue causes a problem for you and if you have any thoughts
0:19:11	and how to deal with it
0:19:19	i can at least a that's so even though my experiments were worked on what
0:19:23	read speech i did find that a lot of times the pronunciations that are not
0:19:27	being learned were heavily reduced and a much more reduced and the canonical pronunciations
0:19:36	i think that probably doesn't this decrease in accuracy because it increases the confusability among
0:19:41	the pronunciations in the lexicon
0:19:44	i don't really have a good
0:19:47	a good idea on how to fix that but i think probably maintaining some measure
0:19:52	of reducing or minimizing amount of commute confusability in the in the word units that
0:20:00	you get
0:20:01	a simple similar to the talk that we just saw
0:20:04	saying that you know it's important of these be able to still discriminate between
0:20:09	the words that you having a lexicon
0:20:14	so i don't see or anywhere see here
0:20:20	i will hunt you down only left the room
0:20:23	but i can or with what william set on the pronunciation mixture model stuff or
0:20:27	we can actually see and we throw out pronunciations real or new ones
0:20:32	we're lower learning variations that are addressing
0:20:40	you know reductions that you see in conversational speech
0:20:44	so i haven't looked closely enough of the units for learning
0:20:49	to know that by serving you see that and pronunciation stuff so i would assume
0:20:53	something like to be going i can say definitively
0:21:03	question no i
0:21:12	maybe i can say lovely what we discuss that the break i mean i wonder
0:21:16	why do we need boundaries between the you speech sounds
0:21:21	the obviously do need a sequence of the speech sounds but we can be enough
0:21:25	to have a sentence of the speech sounds and accent the fact that spits out
0:21:29	loss for about couple of hundred millisecond each of them and they are really overlapping
0:21:35	sold boundaries not entirely arbitrary in
0:21:39	things and i don't think we need an easy be that's my point but correct
0:21:43	me if i'm wrong
0:21:51	any other questions or comments
0:21:59	it's been longer
0:22:01	maybe we should declare success a closed session alright thanks everybody

Selected poster summaries, panel discussion

Limited Resources Day