0:00:15i wish so where
0:00:17well my paper was acoustic unit discovery and pronunciation generation from a grapheme based lexicon
0:00:23this is work done with a menu right more in the model and john look
0:00:28upon whimsy
0:00:31so the basic premise was
0:00:34given the task where you just have a parallel text and speech but no pronunciation
0:00:39dictionary aware knowledge of a phonetic string acoustic units
0:00:42can you learn the acoustic units and then the pronunciations for word using those discovered
0:00:47acoustic units
0:00:49so we broke the task into two parts with the first stage was learning the
0:00:54acoustic units in the second stage was learning be pronunciations from them
0:00:58i with the hope that you do one individually would improve performance and then together
0:01:03what make things even better
0:01:06so we start from the assumption that you can at least train a grapheme based
0:01:09speech recognizer that will these produce some reasonable performance
0:01:16once you train a contextdependent hmm based recognizer we taken we cluster the hmms using
0:01:24spectral clustering
0:01:27into some preset number of acoustic units
0:01:30and i there we have a direct mapping from a context dependent try grapheme down
0:01:38to one of these new acoustic units so using the original grapheme based lexicon you
0:01:44can automatically map each pronunciation to the new acoustic units
0:01:49i and performing recognition with that system for does produce a some small game but
0:01:55it was our belief that
0:01:57using those accused units in that matter might not be the best way to do
0:02:02it might be better ways of
0:02:04better pronunciations
0:02:07so we took a machine translation based approach to transform those pronunciations into a new
0:02:13so using the new discovered acoustic units we decode a the training data to generate
0:02:20a set of pronunciation hypotheses for each word
0:02:24and from their using moses i you can learn a set of phrase translations are
0:02:30basically rules to translate a set of units into another set of units
0:02:36so using that france phrase translation table you can line transform the original lexicon and
0:02:40so on the one
0:02:42i unfortunately using that directly actually significantly to a mix performance worse mainly because there's
0:02:49a lot of the ways in the and the pronunciations the hypothesized
0:02:53i still we rescore the rules in the phrase table by applying each rule individually
0:02:57and then through forced alignment see how it affects the log-likelihood after training data
0:03:05and we found that if we just from the phrase table keeping the rules that
0:03:08improve the likelihood of the data and then transform the final lexicon we end up
0:03:15with improve performance once we get the final lexicon we start over from beginning and
0:03:19train the system up i haven't before recognition
0:03:27okay might be better
0:03:32okay my paper was
0:03:34i don't i practical system for word discovery exploiting dtw based initialization and what we
0:03:40basically did was
0:03:42okay unsupervised word discovery task from continuous speech in our case we use the completely
0:03:48zero-resourced setup meaning we only have the audio data and no other information
0:03:55just for the task used a small vocabulary tasks in our case just the ti-digits
0:04:00database was eleven digits and we used a hierarchical system for the word discovery are
0:04:06the system in the in this case means we do have two levels on the
0:04:10first level we have the acoustic unit discovery so
0:04:15trying to discover acoustic units as a sequence of feature vectors it's basically the same
0:04:20what use was presenting today
0:04:23and the second level we do the discovery of words
0:04:26sequences of the acoustic units which is basically another two learning the pronunciation lexicon so
0:04:33for the first level as i said we're going the acoustic unit discovery it is
0:04:38similar to self organizing units and what we basically do is segment or input
0:04:45cluster all segments to get an initial transcription for the data and then do that
0:04:49iterative hmm training for acoustic models for the acoustic you
0:04:54and finally we get a transcription of or audio data is a sequence
0:05:01and for the second level we'd of the word discovery
0:05:04means that we are trying to discover words in an unsupervised lie on or sequence
0:05:10of acoustic units
0:05:12so what we do there is we use the probabilistic pronunciation lexicon because obviously our
0:05:18acoustic unit sequence will be very noisy so we can reduce the one-to-one mapping we
0:05:23need some probabilistic mapping and to do this mapping we use an hmm for each
0:05:29word for the hmm
0:05:31states have discrete emission from the distributions in terms of the acoustic units
0:05:41additionally the transition probabilities of the hmm are governed by a not the distributions so
0:05:47that you have kind of things modeling
0:05:51the parameters of these hmms are learned in an unsupervised way we do this by
0:05:57for example when we want to
0:05:59search for and rinse instance ubm
0:06:04what hmms connect them by a language model in our case a unigram language model
0:06:08and then simply estimate the parameters using an em algorithm and we look at what
0:06:15sequence of hmms be converged and finally we get
0:06:21transcript transcription of the audio to intense
0:06:25doing this with and
0:06:27random initialization so a completely unsupervised setup we get T sixty
0:06:32eight percent accuracy
0:06:37well this was done in a speaker independent meaning what all the data into one
0:06:44group just to learning and the segmentation
0:06:48getting sixty percent
0:06:51this was done unsupervised but we can get or we can go of step for
0:06:56their using light supervision in our case we use the dtw algorithm of jim us
0:07:02to do a pattern discovery on a small subset or input data
0:07:08using nine percent of the input data or we ran the dtw algorithm and discovered
0:07:14for four percent of the segments
0:07:18some may so we used these four percent of the given segment is initialized our
0:07:23right hmms these seconds
0:07:27we used light supervision by just labeling
0:07:32you can do this by listening to the
0:07:37when running this and
0:07:41learning again we get an eighty two percent word accuracy of the end so using
0:07:46the light supervision obviously improves the reside quite sick if we can
0:07:51and it what we find it is an iterative speech recognizer training so just going
0:07:57back to the standard speech recognizer using hmm you and gmms using the transcriptions but
0:08:02we get from our unsupervised lightly supervised learning and doing an iterative training the speech
0:08:09so on the random case we then
0:08:11go from sixty eight percent to
0:08:15at four percent
0:08:17and in the lightly supervised by people from the
0:08:22at two percent to basically ninety nine percent which is close to the baseline
0:08:26when using supervised training for details
0:08:36okay so far work we are smoking on the low resource model we for a
0:08:42tool on the continent dictionary but you how can we do not to the dallas
0:08:47approach any so we still assume that we have a very small initial dictionary to
0:08:53so that we can post that our system so our task in multimedia so given
0:08:58that argue that we have us more initial a dictionary to start with we can
0:09:03simply and we just simply channel up a grapheme to phoneme conversion model and then
0:09:10we can generate
0:09:11the multiple sessions for the word accented covering it up here you know training data
0:09:16so actually a system we assume that our constant actually small it's very maybe even
0:09:23or even noisy but we assume that we have a large amount word level transcription
0:09:28i mean audio data and with whatever conclusion
0:09:31so even though we do not know a have the procession for all the word
0:09:35comparable results more usual actually just are ways we want to learn discipline sessions for
0:09:41the word you know what training data so and the since we can be model
0:09:46but very noisy because the training samples very small
0:09:50but we can read all the possible pronunciation for the words in the training data
0:09:54and then we can actually in the audio data to fit into the model and
0:10:00then to be where the pronunciations and so here and we have another problem is
0:10:07because if we had keep you know multiple from search for each word so and
0:10:12for an utterance with say i'm unable word and the possible pronunciation sequence could be
0:10:17bourlard you got at a national with a number word so you want to learn
0:10:22the project where can be a problem
0:10:24so we computer two different back to that in the model one just use a
0:10:30bit be updated approximation at every time the most likely points in sequence fourish one
0:10:36point you a project utterance as and uses that you couldn't that these subsidies to
0:10:41obtain backed
0:10:42so this is it just a approximation so we have to answer the question that
0:10:47it is a good approximation so we have to i will call another system that
0:10:51speech to learn these
0:10:53model precisely so we do not do any pruning so other people actually a using
0:10:58model that can say that's not the user impasse at least so this can also
0:11:02be approximation
0:11:03perform a so we have to know is a good approximation so
0:11:07you want to you want to our work on this you know that impressive the
0:11:12a large or a nonsensical for each utterance then we use the that is eighteen
0:11:21techniques you know fst wfst so we represent these opens and a sequence E U
0:11:28you ladies all you mother were you wfst and then used on the existing you
0:11:33know composition and a limitation determinization algorithm that's only there's an algorithm arabians here in
0:11:41the open fft one
0:11:42so we use them and then she we found that is very efficient choose for
0:11:47us to i don't think model so a given the two algorithm we can we
0:11:53count as opposed to model but there's a nice to have to mention that because
0:11:59or grapheme to phoneme conversion model and our work as model and also a one-best
0:12:03model not depend on each other
0:12:05so we can see that we have to each batch eight we find that you
0:12:08know so that we can we can takes the other to have access to one
0:12:15and go back
0:12:16so that up a few iterations there is a system can convert
0:12:21so we did some a common on the also we bought a i data set
0:12:26so it becomes large tells that and the expert lexicon has a process of and
0:12:31word and the total training data has about a three hundred hours so
0:12:38we i was systems as a snack about fifty percent of the what an concessions
0:12:43at the initialization which of this random this slide fifteen percent to start always as
0:12:49an are we compare to use this expert let's go so obviously some can approach
0:12:55is used for the let's go
0:12:58there's a power to one percent to one point five percent in a gap there
0:13:02but i think trustees to from a simple with we cannot speech or i mean
0:13:07equivalent to that that's good as i think that's all translated just a initial study
0:13:15offered to work
0:13:27paper is tied probabilistic lexical modeling and unsupervised adaptation for unseen cases so
0:13:33in this paper we propose to zero-resourced is our approach specifically Z linguistic whistles and
0:13:40zero lexical resources et cetera system and the framework of codebook a bit languages based
0:13:45features so we only use the nist probable words the target language
0:13:50i that knowledge of the grapheme-to-phoneme relationship that
0:13:55so that was used as opposed to model trained on a language independent out-of-domain data
0:14:02and uses graphemes as subword units which avoid the need for lexicon
0:14:07so i focus on three different point i that's tell you want kl-hmm approach use
0:14:13and what has done until no kl-hmm what we did differently in this paper
0:14:18so to briefly explain about kl-hmm approach is
0:14:23the posterior probabilities estimated by neural network are directly used as feature observations to train
0:14:30and an hmm right states are modeled by categorical distributions
0:14:40so the dimension of the categorical distributions same as the output of mlp
0:14:46and the parameters of the categorical distributions are estimated by minimising the kl divergence between
0:14:54state distributions and the feature vectors belong to that state
0:14:58and reality kl-hmm approach can be seen as probabilistic lexical modeling approach because the parameters
0:15:06of the model capture the probabilistic relationship between hmm states and mlp outputs
0:15:13and as in the normal hmm system the states can represent performance all graphemes context
0:15:20independent units of context dependent subword units
0:15:24so i'm not what we found until no to explain the benefits of kl-hmm we
0:15:30have seen that the neural network can be trained on language-independent data and the kl-hmm
0:15:36parameters can be trained on small amount of target language data
0:15:40in the frame of kl-hmm the subword units like graphemes can also do you performance
0:15:46similar to the systems using force as subword units so the grapheme based asr approach
0:15:52using kl-hmm us to believe it's mentioned
0:15:55the first it is it exploits both acoustic and lexical resources available in other languages
0:16:01because it's reasonable to assume that some languages have lexical resources
0:16:07and the parameters of the kl-hmm actually more than the probabilistic relationship between graphemes and
0:16:14performance so it implicitly loans that you could be relationship using a pasta
0:16:22so what we do in this work is normally insect or your the kl-hmm parameters
0:16:28are estimated using target language data
0:16:31so in this work we creeping it's to the kl-hmm parameters but knowledge based parameters
0:16:37so we first dependent grapheme set in the target language we map each the graphemes
0:16:43to one or more mlp what's a performance
0:16:48and then the kl-hmm parameters are assigned using does not which
0:16:52if untranscribed speech data from the target language we also proposed approach to i'd creativity
0:16:59adapted kl-hmm parameters in an unsupervised fashion so first given the speech data waiting for
0:17:07the grapheme sequence and
0:17:11and we update the kl-hmm parameters using the decoded grapheme sequence and the process can
0:17:17be i three and in this paper the approach was a evaluated on greek and
0:17:23we used five other european languages as out-of-domain resources not greek
0:17:32and what does not done in this paper we planted we it's
0:17:36like in this paper we only adapted the kl-hmm parameters from untranscribed speech data but
0:17:42in future we also plan to the mlp retraining based on
0:17:47based on in for grapheme sequence and
0:17:51also we don't prove in the utterances during unsupervised adaptation also so future we plan
0:17:58try the problem or weight the utterances based on some
0:18:09so that anybody have any questions
0:18:26so this is kind of an open and the question in this is for
0:18:30everybody who talked about
0:18:34automatically learning subword units
0:18:36so i guess at least to the poster presenters and many of the speakers from
0:18:40today and the question is a very often it's heart define the boundary between subword
0:18:50and the more conversational a natural the speech gets the less well defined is boundaries
0:18:56and so i was wondering if you found in your work
0:19:01if you look at
0:19:02at the types of boundaries there being hypothesize if you found
0:19:06that this issue causes a problem for you and if you have any thoughts
0:19:11and how to deal with it
0:19:19i can at least a that's so even though my experiments were worked on what
0:19:23read speech i did find that a lot of times the pronunciations that are not
0:19:27being learned were heavily reduced and a much more reduced and the canonical pronunciations
0:19:36i think that probably doesn't this decrease in accuracy because it increases the confusability among
0:19:41the pronunciations in the lexicon
0:19:44i don't really have a good
0:19:47a good idea on how to fix that but i think probably maintaining some measure
0:19:52of reducing or minimizing amount of commute confusability in the in the word units that
0:20:00you get
0:20:01a simple similar to the talk that we just saw
0:20:04saying that you know it's important of these be able to still discriminate between
0:20:09the words that you having a lexicon
0:20:14so i don't see or anywhere see here
0:20:20i will hunt you down only left the room
0:20:23but i can or with what william set on the pronunciation mixture model stuff or
0:20:27we can actually see and we throw out pronunciations real or new ones
0:20:32we're lower learning variations that are addressing
0:20:40you know reductions that you see in conversational speech
0:20:44so i haven't looked closely enough of the units for learning
0:20:49to know that by serving you see that and pronunciation stuff so i would assume
0:20:53something like to be going i can say definitively
0:21:03question no i
0:21:12maybe i can say lovely what we discuss that the break i mean i wonder
0:21:16why do we need boundaries between the you speech sounds
0:21:21the obviously do need a sequence of the speech sounds but we can be enough
0:21:25to have a sentence of the speech sounds and accent the fact that spits out
0:21:29loss for about couple of hundred millisecond each of them and they are really overlapping
0:21:35sold boundaries not entirely arbitrary in
0:21:39things and i don't think we need an easy be that's my point but correct
0:21:43me if i'm wrong
0:21:51any other questions or comments
0:21:59it's been longer
0:22:01maybe we should declare success a closed session alright thanks everybody