Speech Transcript - Selected poster summaries, panel discussion

i wish so where

well my paper was acoustic unit discovery and pronunciation generation from a grapheme based lexicon

this is work done with a menu right more in the model and john look

upon whimsy

so the basic premise was

given the task where you just have a parallel text and speech but no pronunciation

dictionary aware knowledge of a phonetic string acoustic units

can you learn the acoustic units and then the pronunciations for word using those discovered

acoustic units

so we broke the task into two parts with the first stage was learning the

acoustic units in the second stage was learning be pronunciations from them

i with the hope that you do one individually would improve performance and then together

what make things even better

so we start from the assumption that you can at least train a grapheme based

speech recognizer that will these produce some reasonable performance

once you train a contextdependent hmm based recognizer we taken we cluster the hmms using

spectral clustering

into some preset number of acoustic units

and i there we have a direct mapping from a context dependent try grapheme down

to one of these new acoustic units so using the original grapheme based lexicon you

can automatically map each pronunciation to the new acoustic units

i and performing recognition with that system for does produce a some small game but

it was our belief that

using those accused units in that matter might not be the best way to do

it might be better ways of

better pronunciations

so we took a machine translation based approach to transform those pronunciations into a new

set

so using the new discovered acoustic units we decode a the training data to generate

a set of pronunciation hypotheses for each word

and from their using moses i you can learn a set of phrase translations are

basically rules to translate a set of units into another set of units

so using that france phrase translation table you can line transform the original lexicon and

so on the one

i unfortunately using that directly actually significantly to a mix performance worse mainly because there's

a lot of the ways in the and the pronunciations the hypothesized

i still we rescore the rules in the phrase table by applying each rule individually

and then through forced alignment see how it affects the log-likelihood after training data

and we found that if we just from the phrase table keeping the rules that

improve the likelihood of the data and then transform the final lexicon we end up

with improve performance once we get the final lexicon we start over from beginning and

train the system up i haven't before recognition

okay might be better

okay my paper was

i don't i practical system for word discovery exploiting dtw based initialization and what we

basically did was

okay unsupervised word discovery task from continuous speech in our case we use the completely

zero-resourced setup meaning we only have the audio data and no other information

just for the task used a small vocabulary tasks in our case just the ti-digits

database was eleven digits and we used a hierarchical system for the word discovery are

the system in the in this case means we do have two levels on the

first level we have the acoustic unit discovery so

trying to discover acoustic units as a sequence of feature vectors it's basically the same

what use was presenting today

and the second level we do the discovery of words

sequences of the acoustic units which is basically another two learning the pronunciation lexicon so

for the first level as i said we're going the acoustic unit discovery it is

similar to self organizing units and what we basically do is segment or input

cluster all segments to get an initial transcription for the data and then do that

iterative hmm training for acoustic models for the acoustic you

and finally we get a transcription of or audio data is a sequence

and for the second level we'd of the word discovery

means that we are trying to discover words in an unsupervised lie on or sequence

of acoustic units

so what we do there is we use the probabilistic pronunciation lexicon because obviously our

acoustic unit sequence will be very noisy so we can reduce the one-to-one mapping we

need some probabilistic mapping and to do this mapping we use an hmm for each

word for the hmm

states have discrete emission from the distributions in terms of the acoustic units

additionally the transition probabilities of the hmm are governed by a not the distributions so

that you have kind of things modeling

and

the parameters of these hmms are learned in an unsupervised way we do this by

for example when we want to

search for and rinse instance ubm

what hmms connect them by a language model in our case a unigram language model

and then simply estimate the parameters using an em algorithm and we look at what

sequence of hmms be converged and finally we get

transcript transcription of the audio to intense

doing this with and

random initialization so a completely unsupervised setup we get T sixty

eight percent accuracy

and

well this was done in a speaker independent meaning what all the data into one

group just to learning and the segmentation

getting sixty percent

this was done unsupervised but we can get or we can go of step for

their using light supervision in our case we use the dtw algorithm of jim us

to do a pattern discovery on a small subset or input data

using nine percent of the input data or we ran the dtw algorithm and discovered

for four percent of the segments

some may so we used these four percent of the given segment is initialized our

right hmms these seconds

and

we used light supervision by just labeling

you can do this by listening to the

probably

and

when running this and

learning again we get an eighty two percent word accuracy of the end so using

the light supervision obviously improves the reside quite sick if we can

and it what we find it is an iterative speech recognizer training so just going

back to the standard speech recognizer using hmm you and gmms using the transcriptions but

we get from our unsupervised lightly supervised learning and doing an iterative training the speech

rate

so on the random case we then

go from sixty eight percent to

at four percent

and in the lightly supervised by people from the

at two percent to basically ninety nine percent which is close to the baseline

when using supervised training for details

okay so far work we are smoking on the low resource model we for a

tool on the continent dictionary but you how can we do not to the dallas

approach any so we still assume that we have a very small initial dictionary to

starbase

so that we can post that our system so our task in multimedia so given

that argue that we have us more initial a dictionary to start with we can

simply and we just simply channel up a grapheme to phoneme conversion model and then

we can generate

the multiple sessions for the word accented covering it up here you know training data

so actually a system we assume that our constant actually small it's very maybe even

or even noisy but we assume that we have a large amount word level transcription

i mean audio data and with whatever conclusion

so even though we do not know a have the procession for all the word

comparable results more usual actually just are ways we want to learn discipline sessions for

the word you know what training data so and the since we can be model

but very noisy because the training samples very small

but we can read all the possible pronunciation for the words in the training data

and then we can actually in the audio data to fit into the model and

then to be where the pronunciations and so here and we have another problem is

because if we had keep you know multiple from search for each word so and

for an utterance with say i'm unable word and the possible pronunciation sequence could be

bourlard you got at a national with a number word so you want to learn

the project where can be a problem

so we computer two different back to that in the model one just use a

bit be updated approximation at every time the most likely points in sequence fourish one

point you a project utterance as and uses that you couldn't that these subsidies to

obtain backed

so this is it just a approximation so we have to answer the question that

it is a good approximation so we have to i will call another system that

speech to learn these

model precisely so we do not do any pruning so other people actually a using

model that can say that's not the user impasse at least so this can also

be approximation

perform a so we have to know is a good approximation so

you want to you want to our work on this you know that impressive the

a large or a nonsensical for each utterance then we use the that is eighteen

techniques you know fst wfst so we represent these opens and a sequence E U

you ladies all you mother were you wfst and then used on the existing you

know composition and a limitation determinization algorithm that's only there's an algorithm arabians here in

the open fft one

so we use them and then she we found that is very efficient choose for

us to i don't think model so a given the two algorithm we can we

count as opposed to model but there's a nice to have to mention that because

or grapheme to phoneme conversion model and our work as model and also a one-best

model not depend on each other

so we can see that we have to each batch eight we find that you

know so that we can we can takes the other to have access to one

and go back

so that up a few iterations there is a system can convert

so we did some a common on the also we bought a i data set

so it becomes large tells that and the expert lexicon has a process of and

word and the total training data has about a three hundred hours so

we i was systems as a snack about fifty percent of the what an concessions

at the initialization which of this random this slide fifteen percent to start always as

an are we compare to use this expert let's go so obviously some can approach

is used for the let's go

there's a power to one percent to one point five percent in a gap there

but i think trustees to from a simple with we cannot speech or i mean

equivalent to that that's good as i think that's all translated just a initial study

offered to work

paper is tied probabilistic lexical modeling and unsupervised adaptation for unseen cases so

in this paper we propose to zero-resourced is our approach specifically Z linguistic whistles and

zero lexical resources et cetera system and the framework of codebook a bit languages based

features so we only use the nist probable words the target language

i that knowledge of the grapheme-to-phoneme relationship that

so that was used as opposed to model trained on a language independent out-of-domain data

and uses graphemes as subword units which avoid the need for lexicon

so i focus on three different point i that's tell you want kl-hmm approach use

and what has done until no kl-hmm what we did differently in this paper

so to briefly explain about kl-hmm approach is

the posterior probabilities estimated by neural network are directly used as feature observations to train

and an hmm right states are modeled by categorical distributions

so the dimension of the categorical distributions same as the output of mlp

and the parameters of the categorical distributions are estimated by minimising the kl divergence between

them

state distributions and the feature vectors belong to that state

and reality kl-hmm approach can be seen as probabilistic lexical modeling approach because the parameters

of the model capture the probabilistic relationship between hmm states and mlp outputs

and as in the normal hmm system the states can represent performance all graphemes context

independent units of context dependent subword units

so i'm not what we found until no to explain the benefits of kl-hmm we

have seen that the neural network can be trained on language-independent data and the kl-hmm

parameters can be trained on small amount of target language data

in the frame of kl-hmm the subword units like graphemes can also do you performance

similar to the systems using force as subword units so the grapheme based asr approach

using kl-hmm us to believe it's mentioned

the first it is it exploits both acoustic and lexical resources available in other languages

because it's reasonable to assume that some languages have lexical resources

and the parameters of the kl-hmm actually more than the probabilistic relationship between graphemes and

performance so it implicitly loans that you could be relationship using a pasta

so what we do in this work is normally insect or your the kl-hmm parameters

are estimated using target language data

so in this work we creeping it's to the kl-hmm parameters but knowledge based parameters

so we first dependent grapheme set in the target language we map each the graphemes

to one or more mlp what's a performance

and then the kl-hmm parameters are assigned using does not which

if untranscribed speech data from the target language we also proposed approach to i'd creativity

adapted kl-hmm parameters in an unsupervised fashion so first given the speech data waiting for

the grapheme sequence and

and we update the kl-hmm parameters using the decoded grapheme sequence and the process can

be i three and in this paper the approach was a evaluated on greek and

we used five other european languages as out-of-domain resources not greek

and what does not done in this paper we planted we it's

like in this paper we only adapted the kl-hmm parameters from untranscribed speech data but

in future we also plan to the mlp retraining based on

based on in for grapheme sequence and

also we don't prove in the utterances during unsupervised adaptation also so future we plan

try the problem or weight the utterances based on some

matching

so that anybody have any questions

so this is kind of an open and the question in this is for

everybody who talked about

automatically learning subword units

so i guess at least to the poster presenters and many of the speakers from

today and the question is a very often it's heart define the boundary between subword

units

and the more conversational a natural the speech gets the less well defined is boundaries

are

and so i was wondering if you found in your work

if you look at

at the types of boundaries there being hypothesize if you found

that this issue causes a problem for you and if you have any thoughts

and how to deal with it

i can at least a that's so even though my experiments were worked on what

read speech i did find that a lot of times the pronunciations that are not

being learned were heavily reduced and a much more reduced and the canonical pronunciations

i think that probably doesn't this decrease in accuracy because it increases the confusability among

the pronunciations in the lexicon

i don't really have a good

a good idea on how to fix that but i think probably maintaining some measure

of reducing or minimizing amount of commute confusability in the in the word units that

you get

a simple similar to the talk that we just saw

saying that you know it's important of these be able to still discriminate between

the words that you having a lexicon

so i don't see or anywhere see here

i will hunt you down only left the room

but i can or with what william set on the pronunciation mixture model stuff or

we can actually see and we throw out pronunciations real or new ones

we're lower learning variations that are addressing

you know reductions that you see in conversational speech

so i haven't looked closely enough of the units for learning

to know that by serving you see that and pronunciation stuff so i would assume

something like to be going i can say definitively

question no i

maybe i can say lovely what we discuss that the break i mean i wonder

why do we need boundaries between the you speech sounds

the obviously do need a sequence of the speech sounds but we can be enough

to have a sentence of the speech sounds and accent the fact that spits out

loss for about couple of hundred millisecond each of them and they are really overlapping

sold boundaries not entirely arbitrary in

things and i don't think we need an easy be that's my point but correct

me if i'm wrong

any other questions or comments

it's been longer

maybe we should declare success a closed session alright thanks everybody

Selected poster summaries, panel discussion

Limited Resources Day