Speech Transcript - Recent Progress in Unsupervised Speech Processing

thanks karen it's a very nice to be here to talk to about some of

the low resource

work we've been doing in our group

this research that i'll talk about today involves the

work of several students in our group that i a list here

there's been a lot of talk just

today about the low resource issue good descriptions i won't belabour the point

i think my perspective on the problem is that

current

speech recognition technology will benefit by incorporating more unsupervised learning

ideas

and this schematic shows the

sort of a range of increasingly difficult tasks that we could imagine

starting the upper left

from the conventional asr approach that has

annotated resources pronunciation dictionary and units

two scenarios that have less resources associated with them

a parallel annotated speech and text independent speech and text all the way down to

just having speech

what can you do with that as would be the case for example of an

oral language that we were talking about earlier this morning

i think

it's a challenging problem but if we start to look at some of these ideas

i think there will be a few benefits first of all i think it's really

interesting problem i will learn a lot just by trying to do it

be i think it will ultimately enable more speech recognition for

larger numbers of languages in the world and

it has the potential ideally to complement existing techniques

and so even benefit languages that are

but quite successful with conventional techniques

so in the time i have today i was gonna talk about two research areas

that we plan

exploring in our group the first one is the

speech pattern discovery a method and various we applied to various problems

and this is an example of the zero resource scenario that could work potentially on

any language in the world just with the body of speech

so that has a certain appeal to it

but there's no specific models that we learn from that so and your line of

research that we've been starting

to do is exploring

methods to learn speech units models speech units and pronunciations

from either speech or when there are some limited resources available and we're using a

joint modeling framework to do this that i think even know what still very early

days for this work i believe it's quite promising so these of the two things

all a

touch on

and it'll be fairly high overview because i don't a lot of time but hopefully

you'll get the idea some other things we're trying to do

so the speech pattern work

it was actually motivated by in part by humans you know incensed people like jenny

saffron are shown that

just exposing infants to use short amount of nonsense syllables a very quickly learn where

a word boundaries are of what they've seen before what they haven't so we sample

Y can we try to apply some of these ideas if we have a large

body of speech

but we don't have

say all the conventional paraphernalia that goes with the conventional asr

but maybe we had a lot of speech so we could throw all that out

and look for repeating occurrences

instances of things like words and there might be some interesting things that we could

do if we achieve that capability

so to describe it

in terms of some speech the idea was to if we had say different chunks

of audio can we find the reoccurring words of these are two utterances with the

word schizophrenia in them

can we develop a method to establish that those two where the same thing

and the approach that we took

fairly common sense i think is to

if you have a lower body of adding a we would compare every piece of

audio with every other piece of audio

and so for those two utterances now we would computed distance matrix which would represent

point by point distances

and the idea is that one two things to spectral frames of the same

be of course a low this stands and when there are very the similarity high

distance

when you look

and the representations that we use of varied over time we started off just using

white and mfccs which would work fine if it's all the same speaker

we then went to unsupervised posteriorgram representations based on both

to write from gaussian mixture models and also some

dnn trained in an unsupervised way we've done some stuff with herbs slu is as

well representing them as a posteriorgram it really doesn't matter what you use

the interesting thing is when we all look at that picture

i think most of this can see right away that all you know what there

is a diagonal that sort of

bunch of low distances that's where the repeating pattern is so that's

what we wanna do we try and find that automatically

and we developed a

variational to dynamic time warping we call segmental dynamic time warping that basically just consisted

of striping all the way through the audio corpus so that you would eventually

compare everybody's with every other piece and that was that

the past the warping path you were on would eventually snap into

that as you passed over that possible alignment

we call this little region the alignment pass a fragment and the web in that

region is sort of the point my point distance and that's what we're trying to

so there's different ways to do this but this particular illustration shows the point by

point alignment of the two stripes of that the two pieces of the utterances here

and this is the distortion as a function of the frame-by-frame distances and here courses

where the overlapping

word schizophrenia is see want to look

there were that warping path

establish some mechanism to try and find a low distortion region

we were looking for things that were least half a second longer turns out the

longer you constrain yourself the better this idea works

so like boston red socks would work really well as sort of an expression

and we extended a little bit and when we do this

you know we're is we produce are aligned fragment

now

people a modified this basic idea "'cause" this is computationally fairly intensive we actually done

some stuff to do

approximations that are guaranteed to be admissible but other people like aaron

chance and edge i J H U is done some really nice work using some

of visual processing concepts to significantly reduce the amount of computation that and paul

although it turns out i think that using sgd W idea that very and

is not a bad way to refine the initial matches

so when you do this what happens if you end up with pairs of utterances

and the things in red are example matches low distortion matches that are found in

your corpus and you can see that it's

depending on the parameters like the with that you pick we were sort of aiming

for word level ideas which was why we picked have second constraint

i sometimes it's a word sometimes its multiple words

sometimes of the fragment up were

sometimes it's something similar but not the same thing

this is the type of thing that you get out

the interesting question then is once you have all these pairwise matches for your corpus

you'd like to try to establish

what things that are the same underlying

and so you have to go to some sort of clustering

notion in this is what we call speech pattern discovery and we try to represent

all of these pairwise matches and the graphical structure

that then we could do clustering on

and so when you do that where the

well describe how we defined the very season the graph in the second but if

you do each region

corresponds to a vertex

in a in the matched in the arc corresponds to

where the edges correspond to connection between the regions and then you can trying to

clustering

naturally of course in the real world

these clusters are inevitably connected in some capacity is the matches are

a perfect but then you can apply your favourite clustering algorithm to try and find

densely connected regions in the graph and that's in fact we did

i just a very briefly show you one way that we did it there are

many to define the of or the season the graph

this illustration is sort of showing all example pairwise matches so each little rectangular

are corresponds to match and the colour

means it's the same match so the blue rectangle for example is a region where

we think of the word matches to something else it said over here

okay so different colours mean different batches

well if you actually look at what's going on at any point in time

it's messier than that because you potentially have a whole lot of matches because each

match is done independently the start time and end times are all gonna be probably

different

but what we did as we summarize

that collection of matches by just summing up the similarities as a function of time

and so you get some and then we consider that and then what you get

is something is time-varying that has local max's

and we defined the local max's is places that interest

where a lot of similarity matches were occurring

and so we use those places to define nodes or a rare disease in our

graph and then

what you define the notes

the

the matched pair is that you have that overlap the nodes define

the edges in your graph

so for example the blue pair went from no one

to know eight so you'd make this connection in your

and you can do that for all of the

matches that you have that are low distortion and so this is how you can

construct your graph

and then as i said you can apply clustering algorithm to

make chunks out of that to define clusters

so let me show you i and example on a lecture that was recorded mit

and this is an exam so you we had four matches here different places in

the in the recording

and there was is nice little cluster here the show you and i'll hopefully play

you some examples should not on a research optimized

there was for things to them played the same time

but

basically what this was this guy was talking about

variations of search engine optimiser search engine optimising and

it actually stand the word optimiser

to get the common acoustics in the cluster

so this is an example of the type of thing that you get

interestingly all of these words tended to

curve near each other in the lecture so you can actually

are we done this work i'm not gonna talk about at any actually do topic

segmentation based on the time-varying nature of these clusters over the course of long audio

recording

i can show you some other examples we try don't different languages this is a

lebanese

interview that we recorded it actually used to people talking

one of them is using a lebanese layer levenshtein arabic and one of them is

talking and msa

the algorithm doesn't care it doesn't know and it it's oblivious is just looking for

things that look like they're the same

and so here's the cluster that

it got

this is this is this is

there's another one for mandarin a lecture that we apply right on those users use

those use versus you get the idea defining these acoustic chunks and they're sort of

the same

thing

now when you do it over a single large body of audio like

a not sure you'll get a bunch of these different clusters

and that's interesting when you look underlying lee at what the identity of the cluster

you can see the lot of the terms are important content words in fact what

we did a

this study

we were finding about eighty five percent of the top twenty tf-idf terms

on lectures so it's an indication that

the clusters are finding potentially useful combines

of information and i guess what on the motivations for this that there's a word

that's important

in a conversation or a lecture it'll probably be said multiple times and that gives

us a chance to find it it's not always the case

but we need that for this type of technique to work

now one of the things that we've done recently is

in addition to this was in one particular a document you can look at the

relationship between these unsupervised patterns across different documents

and documents like

her was talking about a topic id

we can do one supervised topic clustering

based on the relationship of these unsupervised words across different documents

just to visualise that a little bit here you each of these grey

rectangles is a different document and the darker grey rectangles are

speech patterns that we found

in the unsupervised way and then the connections are just

things where there they connected to each other with a low distortion match

and say for example

you know you have this type of distribution of your clusters then you might you

might want to say well these two

two clusters are on the right because of the connection between those unseen supervised terms

that we found

and the three on the laughter in the same class again this is not doing

this unsupervised

so to do this we tried a couple different methods but one that was the

most successful

and how to latent model for

topics and words

and that's just the plate notation on the right

but the observed variables of course where the documents and then what we call at

a link structure

which we define

a link

as the connections for each interval in a document that we found the link structure

is just a set of connections to all the other patterns that were made in

all the other different documents

when

and the latent variable words has a certain distribution

of links

and the topics have a certain distribution

of words and you can learn this model with em style

algorithm and the thing that's interesting we did some experiments on the fisher corpus

sixty conversations

spending six different topics

we see this with about thirteen hundred initial

clusters and we did tell what that define six clusters so there was kind of

cheating

but

this is the resulting clusters that we found

and the interesting thing is when you look at the underlying speech patterns that are

associated with these topics

they make a little bit of sense actually which is nice so

what you find just the there are relevant words and there are irrelevant words things

that you might like to be

in the stop word if you were doing with text so be nice to get

rid of them

here leads me show you some the other one so that's of the green was

on minimum wage

the other one is on content computer and education

the purple one is kind of interesting

when you look at the distribution of the true underlying topic labels

some of them are pretty good

you know that the holidays computers in education the benefactor split into two

it's kinda

intriguing to me that corporate conduct an illness were

mapped into the same cluster maybe that's telling or something

but you know this really days but you know there's a lot of things that

you can potentially do

with these kinds of unsupervised methods are not conventional but her showing some nice examples

to and this is another one

i wanna move on and talk about

some of the newer work that we're doing

we just trying to learn a models

lower speech units learned pronunciations

really trying to get rid of the dictionary

or at least learn some methods

you know it's interesting

we pride ourselves on our ignorance models that we've developed with hmms modeling think we

don't know about speech

and that the dictionary is still are crutch it's

it's typically made by humans and hours and hours are spent

tweaking these things getting rid of all the

inconsistencies anybody used on it is it's hardware and it takes long time mary can

tell you the amount of effort goes in the making use dictionaries for the babble

program but it's not a trivial ever

why is it we can learn

the units and learn the pronunciations automatically we do everything else

you know i think it's time we look into this

so this is the type of thing we're trying to do what we do from

speech

or maybe if you have some tax

you know can that help you other pronunciation so we're doing

we're trying to do both of these things now in our work

are there's prior work in this area dating

all the way back to the eighties people lunch in the

we're trying to do some

acoustic based approaches

a more recent work as a verb is a good example with a self organizing

units

and there's been other work that's come out of johns hopkins that is very interesting

as well

the approach that we've been taking is

motivated in fact by

more of a machine learning something is becoming more popular it in machine learning is

of asian

framework

for inference and in particular sharon goldwater

who's now the university then burrell had a really nice paper

on trying to learn word segmentation from a phonetic transcriptions so was symbolic input

and the trying to work for that and are more recent work was trying to

do it from the ways here

phonetic transcriptions we wanted to try and modify this model so we could learn from

speech itself so that's what i'll talk about now

and then we've recently extend it to try and word pronunciations as well

so we all know what the challenges are you're trying to learn what the speech

units are first of all

as the last question or said we don't know how many units there are

maybe they're sixty four maybe there's not

and we don't know what they are and we don't know where they are

so these are a lot of unknowns or trying to figure out in units

what we're trying to do is given speech

in this stuff only speech

discover the inventory of units and build a model after each of them

as i said we're formulating this in a different kind of mathematical framework for the

speech community where we have a set of latent variables

that include the boundaries the units the segments and then there's the conventional hmm-gmm

model we're using for each unit that although scribe shortly

and in this initial work we actually wanted to try more number of units

and to do this we were

representing is what's known as a chinese restaurant process or dirichlet process

had a prior on it so there's a finite chance of generating a new unit

every time you better rate

through

so let me walking through this

at a high level sort of channelling of the generative story or a power

generating an utterance

with this hmm gmm a mixture

that i said so basically underlying when we had a have a set of K

models and then

for a particular set or frames one of them is selected

and it generates

a certain number of speech frames and then you'll transition to another one generates more

frames

et cetera et cetera as you go through the entire utterance

so that this sort of just described are model here but what of the latent

variables will first of all

we don't know where the transition are

transitions are in the speech between one unit and another so the bees will be

one set of

latent variables

we don't know what the labels are

inventory of labels and i have down below also these season purple will also be

a set of unknown variables

we don't know of course the parameters of our hmm gmm model so those will

be parameters of variables

and lastly as i mentioned we don't know how many units there are

so that will be in a no as well and this is the last thing

is what we're modeling with the deer actually process

the learning procedure for this is done the inference and gibbs sampling

so it's an iterative process

where we initially select

values on some boundary variables i'll talk about that

shortly but for now

think of an it as an initial segmentation and we have an initial prior distribution

for the parameters that we have

and then we go through are corpus

one segment is a time where segment is defined as the chunk of frames between

boundaries and

based on the posterior distribution will sample down you so for each segment will sample

a identity of the units see for a particular segment

when we say something about that

here

so as i mention this is a chinese restaurant process because there's a finite chance

of selecting defining a new unit

and for those of you foreign familiar with that

the idea the analogy is what people going into a chinese restaurant in trying to

decide which table to sit down at "'cause" each tables begin can see multiple customers

in this notation each segment is a customer and they have to decide

well which table to sit at

and

the index of each table has an index

which corresponds to the identity of the models we each of these tables think of

it is belong to a different unit

and

what you wanna have is a

posterior probability of

the likelihood of taking a particular unit label for each segment

and that basically is proportional to the likelihood of the customers the that particular same

data in generated by that particular units

hmm-gmm and it's a weighted by a prior probability

which just corresponds to the number of customers that were at the table normalized total

amount of

segments that you have and you'll notice that there's also a

little bit a probability that stolen away from each one is out of

to assign a little bit a probability to the likelihood that you might generate a

new unit

as well

so once you have these posterior distribution setup sample and that's the value of for

that particular segment

at that particular iteration

once you have a unit label for that segment

you then go through and use apply hmm parameters

and this is sort of on our home turf so it's a conventional

hmm gmm we're using an eight mixture model

you know the three state left to right transition so it's all very familiar

and

we assume when you have to start a segment in state one and it in

state three but any other states we will sample we will draw samples to determine

which state you're in at once you have the state sequence

we'll draw samples to see which mixture component

you're drawing file and then

we can update the parameters based on that

but passing i wanna say as we have to also consider different segmentations the choices

of the B

no these boundary the variables naively every frame to be a boundary

and

they take binary values either frame is the boundary or it's not so would zero

or one

in terms of putting this into a probabilistic formulation we have again a prior

and a posterior probability the prior is just a bernoulli trial we picked we flip

a coin

with probability alpha to be it's a boundary and one minus alpha so be it's

not

two

generate the posterior so we can generate a new sample for every boundary we go

through one boundary at a time

and we fix the state of all the other boundaries we generate the posterior distribution

and then sample whether it's

boundary or not

and the posterior

sorry for the map is just so this is this is it i think

but the you know we have the prior here

and then

it's we consider all possible units that the segments on either side of this boundary

might be so this is sort of the likelihood

that you would generate

given this was the boundary

and then you consider the possibility that is not a boundary and again that's the

prior

and then you consider the likelihood of generating this entire segment

considering all possible models that you have so those are your to posterior distributions

and you sampled from then to generate a new value for each boundary to the

corpus

so you reiterate through this i think that was twenty thousand iterations that we were

doing

to generate all these parameters the last thing i'll say is just like her note

so this is near and dear to my heart anybody knows me for awhile now

i am a big believer a landmark based things but

it turns out

that it can help save a lot of computation

and

so we're using some acoustic landmarks we developed are also derive some spectral change

and it's nice these are language-independent and it reduces the computation

and the thing is

as i'll show you later

this is just the initialisation once you learn thirty minutes

you can then going train a conventional models doing frame based stuff

so this is sort of a heuristic to help you do the learning faster

but it seems to be effective

so i don't want to dwell on experiments too much this work we will analysing

the timit corpus and

we found a hundred twenty three so i'm gonna have to do it out with

her brother sixty four hundred point three

maybe should eight hundred twenty eight next

and i have a similar kind of plot showing the underlying phonetic label

verses the unit index and you can see we're sort of covering

are the majority of the sounds

what we're generating a little too much effort modeling silence i think we would benefit

from a good speech activity

detector the interesting thing is when you start looking at some of these ones that

have multiple units like a

we looked and

there's a there tends to be a distribution for particular context so we are seeing

some context-dependency here

okay at has a raised out phenomena word-like champ

versus cap

so when we're seeing a lot of these thousand one particular unit the ads followed

by nasal so what's

it there is some context-dependency stuff going on

so i'm only doing for time

i'm good

okay good

telling be here

so as i wanna move onto the next step which is i mean when we

really wanna go always learning words

we're not there yet but what we but we tried to do is enhance the

model so we could learn pronunciations from parallel speech text

data

and

ideally do better than the graphone model of five hopes

okay i'm gonna go with your first answer

now again there's been work done in this area in the past

chin

there was also some more people like

mari ostendorf and

in the ninety nine use that was exploration of joint acoustic lexicon discovery

and are naturally baseline for this work is the graph

the grapheme based recognisers a standard think people do when there isn't a pronunciation dictionary

project parallel text

and am giving a little bit away the punchline but the formulation we have

can reduce to grapheme based setup if you wanted to its one particular constrained application

but to go through the intuition

again

we're setting up an additional latent structure here

beyond what we had before

so we have all

we had the unit sequence we need to learn the boundaries

and we also need to learn the graphing the sound mappings now the experiments we

don't in english will think letter to sound if you want but the framework generalizes

to do two different languages

just to remind you where we started from

with just acoustic units alone

and some distribution on the likelihood of predicting a particular units

you can go through and generate a speech frames as a sequence of these units

now if you have a

i wanted associated with it like the word fly

we need to introduce a new variables and

what happen at final or word pronunciations directly we're actually learning these browsing the sound

because we think it'll generalize better across a corpus

you might eventually want to

do word specific pronunciations but

so we are representing that by another set of latent variables that are latter specific

here

where we would have specific distributions for each letter and by the way will eventually

do try grapheme

also context dependent ones but this is monophone or mono grapheme for now

so you have letter specific distribution so the S might prefer this particular

acoustic model on the L might prefer this particular one et cetera et cetera

hopefully get the idea

so those are latter specific mappings

and this is the initial believe and of course you need a couple these together

so that your general belief is related to the more context specific

things that you have a gap an unknown set of units these will also be

years like processes

underlying only as well

so i don't think so if you go through you know you have a ladder

you use the particular

distribution you select a particular unit and that unit would generate

i your frames and when you go to the different letter different distribution

so like to different unit sampling generate frames

et cetera et cetera

and that sort of

how would work so this model is now

i joint model for learning units

and

grapheme the sound mappings

and the underlying

acoustic models

one more wrinkle we have to deal with this that there isn't necessarily a one-to-one

matching course graphemes there is a one-to-one matching but there isn't necessarily

so we introduce another variable

that

gives you some flexibility as to the number of

units a particular latter might not too so we can think of english here so

we said zero one or two

and we have a little categorical distribution

which predicts the likelihood a ladder specific or context letter specific

distribution

so like the X for example might be likely to be represented by two models

you might learn

so let me go through this very quickly i know i'm running out of time

but once you have a sequence of letters you have to generate a sample for

the number of units that you have once you have a so that units

we also have

you have to pick a

a hmm acoustic model to a sample from a we have these are position-dependent as

well so for next has to we have position one and position to

distributions but you would sample

from those

and then once you've sample the particular

units you would generate speech frames from the appropriate

hmm-gmm model

and so the latent variables that we have to deal with now there's more of

them

there is the number of units per letter the unit label identity same as before

and then there's these latter specific

distributions

as well as the hmm gmm parameters

when you do the inference i'm gonna skip that i knew i wouldn't have time

to go into that

you end up with this mapping all the way down from the latter is all

the way down to the segments and now

if you look at the top part

you can generate pronunciations for words in your lexicon in terms of these units that

you lower

and of course you can train your hmm gmm models

and guess what i now you're in business to train a conventional speech recognizer with

whatever technique

that you like

this is similar to whatever a was talking about

the experiments that we and i know what's very ironic that here i am talking

about low resource languages and all the experiments i'm showing your in english

it's ongoing work trust me

but we've done some experiments on a weather corpus that we had for a while

we like working on

we use that an eight hour subset to try and learn the units and the

pronunciations and then we retrain

a conventional recognizer on the entire training set using these units and we compare

these are our baseline the expert

as we wanna be but certainly we would hope to do better than the grapheme

and just to cut to the chase were in be twenty

right now

graphemes or heart on english everybody knows that so it's nice that were able to

be that

but we'd like to get to the supervised and

i actually think there is

there's reasonable expectation that we can do that because

you know we've done some stuff where automatically learning

throwing out expert pronunciations and learning some new ones based on graphone models we can

get down to about eight point three percent

so i would hope that we could write this down below the ten percent

marker

it's still early days

it's also how to interpret a necessarily what you actually have lower

because

the pronunciations are in terms of these numbers which is not very intuitive for me

but

so we can do things like a local these words that have a shot in

sort of have the same unit so that's kind of encouraging

and the other thing that

jackie is done recently

is used moses to try and translate

use the two dictionaries the pronunciations from the expert dictionary the pronunciations from these learn

units

and use models it's a trying translate between the two so here's a six words

this is the yellow one on top is the expert pronunciation and the blue one

is the translated

unit once into phone like units so there are a little more

in interpretable by us and you know

i think it's on the right track i think there's something there we need to

do a lot of more investigation

but

i'm encourage so far so that sorta where we are

so i can wind up i think

you know

it would be beneficial i'm really encourages is to see you know more people

doing unsupervised things

it's challenging but i think

we learn a lot by doing this and

i truly believe that

in the end it will help us develop speech recognition capability from or the world's

languages

and i'm optimistic that

these methods can potentially complement existing approaches one of the things we're doing right now

in the babble framework as we're looking at these

acoustic matching mechanism as a means to rescore keyword hypotheses from conventional recognizers and

you know maybe that can be helpful

so i've shown you some problems we've been making in speech and discovery and topic

clustering also learning units

and i mentioned we're looking at other languages and we're augmenting the framework right now

to try and learn words itself from the audio

and i guess the by can be speculative that the very and

out this morning said the elephant in the room is

text data actually think the element in the room is lost

in fact it's all humanity that's ever been in a never will be

you know we are learned language

toddlers

no one gives us a dictionary

no one gets as text

you know we have other things but we figure it out

and

i think we be

better off in the long run as a community trying to think about how to

have some of those capabilities our system

anyways thank you very much

and i'm done

you're some references for anybody who's interested not be happy to talk to anybody

sorry i one over

so that was actually and just exact way as one to remind people really quickly

that after

this break instead of having a for one panel session we're gonna have a short

and panel session and preceded by a talk by manual to prove that mention this

morning is a cognitive science side is working and human language

we have time for a few questions

so the i think this each the kind of pretty nice approach based on the

didn't you

model

framework

and the isolates

they discussed seems kind of more like a discriminative

approach based neural network and i think past

a pretty important so don't have some kind of the ideal document all integrating these

kind of course the how to say and forced to some nice framework

a well i don't know if it's nice so first of all i totally agree

with you

i sorted you this generated stuff we're trying to do as a way to get

started

and then just as verbs that i think

if you can figure out the units and some pronunciations and get going

then i think

there's potential to bring in

discriminant thing to sharpen

the boundaries that you're learning

so that's sort of the approach that were taking try to get use this to

generate an initial speech recognizer like the landmarks we go away from that once we

have an initial model

it's not maybe well

the best idea i have at them

Recent Progress in Unsupervised Speech Processing

Limited Resources Day

Jim Glass (MIT)