Speech Transcript - Ouch - Outing Unfortunate Characteristics of HMMs (Used for Speech Recognition)

so i'm gonna talk about a project average but thank you for having me here

i enjoyed my time in the czech republic that learn many check we're concluding well

so thank you

so i

project ouch out stands for outing unfortunate characteristics of the hmms

there are three

truthfully there were three phases the at the sort of initial were that we did

on this was a project that larry really and i three hundred to when we

read nuance

and i truthfully it also had its antecedents in work that we were doing it

for signal

but that's a funded a very small pilot study and i our funded the a

larger but still small off a lot the people the students to work with me

were day gaelic

hardly

part is there i was actually postdoc ensure you chair is currently is to berkeley

and larry really jaw really morgan

and myself were thus reducing your people

so project out

what we're trying to do

is our goal is to sort of develop a quantitative understanding of

how the current formalism thing

and you know surprisingly this being very little work

in this direction in the for your history

of speech recognition

there's been some but it means were isolated and sporadic

and

you know progress in speech recognition has been very exercise and

in my largely because we be proceeding

wire trial-and-error and so the claim is

that by gaining a deeper understanding

powers are algorithm succeed and fail

other than just measuring we're word error and if it if we get an improvement

in word error keep it

we it doesn't improve we

we just it should enable more efficient and steady progress and i claim that this

should be embedded are standard sort of research may not necessarily the techniques that i'm

gonna talk about okay but just this

notion that when you have a model

you know it doesn't fit the data you should get a try to gain some

understanding of how a model differs from the data and how that data model residual

impacts

the classification errors

so the main questions that a project ouch was interested in is you could be

the main where you could think about it do this is what to the models

find surprising about data what is it about speech data that the models find surprising

and how to do that surprise translate the air

i'm gonna talk today about quantifying the two major

hmm assumptions their impact on the error rates of the course the two major assumptions

are the very strong independence assumptions the models makes

and also

and equally strong assumption about what the form of the marginal distribution of the frames

are typically we assume that there are a gaussian mixture models of course nowadays people

are using a multi layer perceptrons but it can you make some sort of formal

assumption about what looks like

also which of these incorrect assumptions is and key your discriminative training mpe or mmi

which it's these assumptions is

is this process are compensating for the maximum

and

do these results change when you move from a miss from a mass training and

test

us we're formalism the mismatched case

so there early sort of work that we did was on the switchboard in the

wall street journal corpora later on we move to the icsi corpus

you can read past

this sort of question about how do these results change in this mask a case

in it and form of why asr so brittle

we go

at any time bring up

a new recognizer on a problem whether

the same language or across languages you always have to star it seems almost from

scratch you always have to collect a bunch of data that's closely related to the

to the task that you

you have and

it hardly ever works the first time you try it it's the reason that most

of us in this room have

have jobs it's are sort of it sort of a good thing but it's incredibly

frustrating right it's like

it's a miracle that when anything works the first

so the ir project mainly was interested in studying

these

these questions on it the icsi meeting corpus where there's a new field channel and

a far-field show i'll talk a little bit more about that i'm we wanted to

understand when you trained models on the near field condition

what happens when you are recognise for future

and so in this context

is the brittle nist of asr solely due to the models inability to account for

the statistical dependence that occurs in real data

and you know what i started this particular project

i thought

that it was just gonna be used independence assumptions so

and i was very surprised

when we actually started doing the work

and in fact it once like so

and so i say i just sort of funny but

but in the matched case basically

this the inability of the model to account for statistical dependence that occurs in real

data is basically the whole problem

but when you move to the mismatched case

all the sudden something else rears its head it

and it and it and it's a big problem and so all describe what this

problem

it has to do with the lack of invariance of the from

i'm gonna spend a little data time

talking about the sort of methodology we use so what the way we explore this

question is we create

we fabricate data

a we use stimulation and a novel a sampling process

that uses real data

to probe the models and the data that we create

is either completely stimulated that satisfies all the model assumptions

or it's real data

that we sample than the way that gives the properties that we understand and so

by feeding this data we can sort of pro the models and see their response

to this to the state and we research we observe recognition actually

so here's an example

so this is an example of what of course according to the average estimate seventy

high miss rate by counts capital markets report

so this is an example of course what we expect speech to sound like this

is from wall street journal so this is a fabricated version of this that essentially

agrees with all the model assumptions

according to different estimates to construct the attachments capital markets report

you can speculate syllable rhymes two point five percent that's model

so you know it's highly amusing but it's intelligible obviously and it obviously you know

it's from a model that was constructed from a hundred different speakers and it reflects

the sort of structure

so what we're trying to quantify

what the difference between these two extremes in terms of recognition condition

so the basic idea of data fabrication a simple

we follow the hmms generation a mechanism so to do that we first we generate

a an underlying state sequence consistent with the transcript the dictionary and the state transitions

the underlying each of it that you know the hidden markov model

then we walk down this

this sequence and we and that of frame at each point

so here's a picture a nice picture that describe the sort of structure it's a

sort of a parts of it are actually a graphical model

a this courses in each ml

but basically we unpack so if we have a transcript we unpack words

we get the corresponding pronunciations

the phones in context

then determine which hmm we use so this is the hidden state each of these

states and mit observations according to the so whatever mixture model we're actually using right

and so if you're not so familiar with the hmms i assume pretty much everyone

in the room is but this sort of highlights the independence assumptions right well it

highlights two things one

the frames are omitted according to rule and the rule is the marginal but the

form that we get for the marginal distribution of frames

and then of course then this also says that these frames are independent so every

time i met

a frame from state three state it is independent from the previous frame that was

emitted from state three so that's a very strong assumption

but in addition

it is also independent from any of the frames that we're and it'd previously from

the state so these of the very strong and

but okay again to generate observations we just all of this rule and basically once

we know the sequence of states

i have a sequence of states one side out that i just walk down those

sequences states and i'd to withdraw

from

what it either a distribution

or whether it be empirical or parametric

so for simulation

it's a i know it's easy to simulate from a mixture models not a big

deal right

but what about this sort of novel sampling process that'll allow us to get a

the independence assumptions will so that for this

we idea of formalism

from a reference bootstrap so i talked a little bit about the bootstrap in the

paper the poster

a people in the feel don't seem to be terribly familiar with that i'm not

sure is topical very much but i will try

so in the basic idea areas

a suppose you have an unknown population right so you've got some population distribution and

you compute the statistic that's meant to summarize this population itself

then you want to know how good is the statistics so i want to construct

a confidence interval for the statistics to give me a sense how well i've estimated

from

a place

so how the lighting that if i don't know what the population

i mean i'm trying to

you know i'm trying to derive properties of a of this population

and so and so in particular i don't know anything about really except the sample

like drawn from this population

but for F runs a bootstrap procedure people would usually make some parametric assumptions about

population typically you'd assume it's a normal or gaussian

and then compute

and a confidence interval using that structure

well course that sort of crazy you know why would you do that you know

especially if you're trying to say

is this population distribution gaussian are not well it's crazy to still

then that the population distribution is gaussian to compute this confidence in

so this was a big problem in the late seventies when computers became sort of

sort usable

by and statisticians

he came up with the sort of formalism and

and so the name comes from pulling up oneself up by the bootstrap lots of

people use the bootstrap for various sorts of terminology it allegedly comes everyone attributes this

to the to the to the story in the

adventure and so pair and a one channels and where E

used in some and yes to get out so we pulls himself up

by its bootstraps out of the but of course you read very one or the

adventures of error

when chosen and that's not what huh

in fact you within a small

on forcing use trying to get out of this one

so instead pulled himself out what is okay

so maybe instead we collected daily

similarly a little bit whiter i thought that was very

so the with the way the way the bootstrap words

is you take empirical distribution so you tree

so you have the same

so this sample is a representative of the true population distribution so if it's big

enough it should be a pretty good represented

and so you since

instead of dating a parametric model to this you treat this is an empirical distribution

and you sample from that empirical distribution

sampling from the empirical distribution turns out to be equivalent to just doing a random

draw with replacement from the sample itself

yes the name resample

so we're gonna adapt this

this formalism to so the so problem at hand so ins will you know so

when we train our models right if it so imagine we're viterbi trainer

here here's a

you know

well i'll have another picture but basically we're gonna sample to the frames that are

assigned to a particular state during training and that's work

and we can do this for various types of sick

so here

it's a really crappy picture but which i have to do a better job but

this is that i here again that

so the you know these again see

but so we have the true population distribution this you know we fit a say

gaussian to this is not particularly good representative and instead if we have if we'd

run enough data from a this histogram estimate the distributor

so basically

but the important part of this slide it is

resampling is gonna fabricate data

that satisfies independence assumptions of the hmm because i'm gonna do random draw with replacement

from the distribution

but

the data we create are gonna deviate from the hmms parametric out the distribution of

the distributional assumptions that we make two exactly the same day degree that real data

do because it's real data

and it's the data at all

from the training

so here's it's already good picture which can lead in sort

describe a little bit

about what we do

and a

so imagine if we have training data and we're actually doing viterbi training so if

we're doing viterbi training we get a forced alignment that for all the states

we just accumulate all frames a

for that state and then we fit a gmm to right and so that

but instead of doing that in the in the bootstrap formal is the we accumulate

frames and we stick "'em" in earnest

that are that are labeled with that state

so training is just like you know or even here training you know

you just accumulate all the frames associated with the state

but instead of a forgetting about that you keep track what they are used to

come in a packet parameter

and so in it when it comes time to generate pseudo data you have an

alignment or some state sequence that you've got however

you have a state sequence ins when you walk down to generate the frames if

i was generating the frames and simulation i would stimulate i do a random draw

from a distribution now instead i to a random draw with replacement from a buck

under and of frames okay

so the frames again are independent because i'm doing random draws with independence

and they the deviate from the tape from the distributional assumptions to the same degree

the real data or "'cause" they are real data

so sorry i believe bring this but and then i can also all about it

i can i can

do you

sequence so i can i can samples the trajectories phone trajectories and word trajectories

because

so here

you're this is this sequence of frames associated to states

so i can stick that into that whole sequence

likewise i can take whole phone sequence and put it in here and when i

drawer from your ins

instead of getting individual frames i get segments

so that the important thing is

no matter what see so i five have segments in the utterance

when i draw the segments between segments the things they are independent but they inherit

dependence that exists in real data within that sector so we have

between segment independent

within segments dependent so this is the way that we can control the sort of

degree of statistical dependence that's in the day

this is quite power

so this sort of just

sort of summarises this

but the and you can see

could even stickler hundred and your

but that so the point is that's a that segment level resampling

relaxes frame level independence to segment

so here's a sort of picture

the models response to fabricate so this is i didn't for that

okay so

i don't know how much i wanna spend on this but

so here what we have it is simulated

a simulated data are the real error rate and as i gradually reintroduce independence and

the that the data the word error rate starts to increase rather dramatic

so point is

let's look at the simulated word error rate so you can think of this is

i think of this is you've got some sort of not and where you re

introducing depends in the data and as i reintroduce data dependence in the data error

rate

comes quite high this is

this is i icsi meeting data this is

with unimodal models

the same sort of phenomena happens when you use mixture models where you know like

say component extreme

so that here the simulated error rate is around two percent little bit less than

two percent

when i do frame level resampling error rate increases just a little bit it's a

very small increase it does increase but it's but it by very small

now when i reintroduce

with in state dependence

all the sudden the error rate becomes around twelve percent so the error rate is

increased by a factor of six

when i introduce

within bone dependence

the error rate increases the king by about a factor of a two

and then when i go to words it increases by

we can almost by a factor of two this typically is the largest job on

the corpora that we've worked with

when you go when you move from frame

to stay typically increases by about a factor of six

so you think about this you make an argument and the argument is that

the that the change the distributional assumption that we make with hidden with gmms

it's not such a big deal i mean it's important but it's not such a

big deal

the biggest single factors are these reintroduction dependent so with the dependence in the data

that the models are findings the price i mean you know it's a

it's a you know everybody knew the dependence assumptions work well i mean i'm not

saying that surprising but i personal we use it was

was really surprise and it took a long time to come around

to the fact that you really it is the model they're the errors oriented dependence

assumption and we tend to work around this by other sorts of things

so that this is a summary of the matched case result so we came the

statistic when we have matched training and test

the it's the independence assumptions that's the big deal

that's the model inability to account for dependence in the data that is that is

to railing things

the marginal distributions

that so much

so surprisingly also so in a different you know if later study

we zorro but

attached this formalism tasks the question so what is what is discriminative training doing you

know see start with the maximum likelihood model you apply mmi

what what's happening here so you apply this formalism and you see that in fact

mmi is actually randy is actually compensating for these independent and that's assumptions in a

way that i don't completely understand i have hypotheses about how this might work

but

a so here you

really complicated procedure that's a little hokey

that to people twenty years many people in this room it took twenty years to

get to work right

and it took many laughs once we shown to work on large vocabulary took many

labs an additional apply yours to get it to work in their lap

it's you know now it's pretty routine to do this but you know it's a

lot it was a single to get this to work and my point is that's

doing is compensating for the independence assumptions we know the independence assumptions are a problem

i'm not saying that it's gonna be easy the figure find a model that relaxes

the independence assumptions

but perhaps that twenty years of effort

would be better spent

attacking that problem

so one about mismatched training

so the icsi meeting corpus

on a we have near field models

collected from on a solo

you know head mounted microphones there was a some microphone array of some sort

but that the meeting room was quiet it was small had are normal amount of

river the kind of reaper human six back

in a room

if you listen to these two channels you can tell that they're different

but it's not like the far-field channel is radically different when you listen to

i we it's it sounds a little different but it's perfectly intelligible

so we explore

training test with near field train interest with farfield and this mismatch condition where a

train on your field data and test for

i'll just say that it's harder it's not

hardly we have to be careful and you have to think about what you're trying

to do when you when you when you run these types of experiments in particular

a there were a lot it's use that we went through

to take get the near field channel and the far-field channel exactly parallel so that

we were actually measuring

what we wanted to do it is it's like a somewhat

intricate lab set

and so it's

so the paper that we wrote in i cast just i don't know how well

it describes it but it attempted to describe it and we have a on

the icsi website there's a technical report that's reasonably good

that describes a lot this stuff so i'm not gonna believer this but there was

a lot of effort that we can go through that's

so here here's of the bottom line is that we're

so first let's look at the green and the red curve satanic again i'm almost

the green and the red curve are the mass near field and far-field and notice

that extract each other pretty well the different

the first real data is obviously hardware

but interestingly look down here at the simulated in the frame

accuracies

they're still really low you know there

the a match farfield is higher it's worse but it still really low and in

particular that these error rates are around the two percent right so i wanted so

let's think about that no then notice before we think about that the mismatch simulation

rate

it's you know so this is where we want to concentrate so this is what

we want to think about that right

so the simulated

we don't need to worry about this other stuff it's the simulated thing that we're

gonna concentrate

what when you simulate data from near field models and you recognise it with your

field models the error rate is essentially no

so that means that problem is essentially step

again when i take the far field models and i simulate data from the far-field

models

and i and i will

and i recognise it with the far-field models

i get essentially nowhere

again that means that problem is essentially stuff

so in these two individuals spaces where we you know so the frames so in

the signal processing the mfccs there are generated in the matched cases they're essentially separable

problems but all the side when i take in the

the near field models and look at the at the far field data it's

drat dramatically not step

so that means that the transformation that takes place between the near field data and

the far field data is not

it's not very that from the front end is not invariant under this transformation and

that lack of invariance

is what's causing this huge increase in here

so we again it's not surprising that the front and is not invariant to this

transformation there's a little bit a river there's a little bit of noise but what's

remarkable it is

that that's

solely that problem the causes

this huge degradation in there

and that is actually fairly remark

so there are many more results

a involving mixture model so we rerun all of these results with i think eight

component mixture models we see the same sort of behaviour

we've reproduce all the discriminative training results we ask

can does discriminative training somehow magically sort of the be leery than

the mismatch a case and the answer is no

we do i think morgan to this really you're on a natural question is how

does mllr work in this thing we talked about that an mllr you can you

can reduce

some of the scratches you would expect

but mllr is a simple linear transformation and whatever transformation between these two channels is

happening

it's some peculiar nonlinear transformation right so it's unreasonable

to expect animal or to do

as well but it's a good this task harness is a really good test harness

for evaluating

you know how resistant to these type how invariant to these transformations are for and

and so we've explored that a little but

and it's not so encouraged

alright well so that i think i table i will and i've

sort of blather donald enough i think all i'll turn it over to jordan and

you will

he will

have a higher level you a role idea and the not and then we'll have

questions that

so what you what presented in

okay one two three

alright so it turns out the there were two parts of this project

C told you about the technical stuff but we also saw that we'd like to

figure out

you've been hearing a lot about how wonderful speech recognition is during this meeting and

we thought we will actually like to understand what the community actually thought about what

speech recognition was like

so we rollers also survey and i called a bunch of people many of you

what called me

and this is called the rats right

and well we wanna do is just see what people thought about how speech recognition

really worked we were we were hoping that we would find some evidence to persuade

the government maybe to put it some money and fun some speech recognition research which

we haven't seen in a long time

but we really we just one the final was going on

and so we put together a little survey team

jen into jamieson worked with me she's a alice that's been in speech for very

long time and we engage frederick okay and he's a specialist at doing service

and we design a snowball start by

it's normal surveys very interesting it

it says you start with a small group of people that you know and you

have some the questions and then you apps them who else task

and you just follow that for your nose and what that means is although it's

not entirely unbiased it's as unbiased as you can do if you don't know the

sampling populations going to be

so we want to low what was going on what the people think or the

failures and what remedies of people try and how do they were

so we did this novel sampling

here's the questionnaire i don't wanna spend a lot of time and this but just

take a

the interesting questions are

the fall last one on the slide where is the current technology failed

and the first one on the side when you think broke

and then questions about sort of what you do about what was going on and

then if there's other stuff

the survey participants tended to be all

i think

that's sort of how our snowball work not terribly all but there's not a lot

again people in this so ages with thirty five seventy

we spoke about eighty five people

and they have an interesting mix of jobs most of them were in research somewhere

in development so we're both

there were a small battery as a management people and then people self referred them's

the their jobs as something more detail

but mostly these are and be people lord managers doing speech research or language one

sort of another

so here's what you told us

there's a

natural language is the real problem and acoustic modeling is a real problem

and everything else that we do was broken more or less

so i think the community sort of had this field not the people trying to

sell speech recognition to the management but the people trying to make it work have

a feeling that all is not really well in the technology

so lots of people and when you point fingers there pointing fingers to the language

itself and to acoustic modeling

and there's the third guy which this says not robust let's say this what steven

and stuff

we were able

so there's something going on with this technology that makes it not work very well

and when we ask people what they try

the fix things the answers everything

people of muck around with the training some people have tried all kinds of different

because i just of their system

i know

some piece trying to calm

alright anyway

what on the interesting things the people try to do

many of us have tried to fix pronunciations either in dictionaries or in rules the

pronunciation and to well me and everyone is found that this is a waste

it's pretty interesting that so that's not a way to fix the systems that we

currently will so we tried all kinds of stuff

and so i think

are taken from the survey is that people

actually don't believe that technology is very solid and we try a lot of things

to fix it and then we looked a little bit of the literature about the

literature surveys in the icsi report which you can go really but the comma so

we found a little sure looks sort of like this is from a review by

fruity

and it's a

L B C Rs far from be solved background noise channel distortion far in excess

casual disfluent speech one expected topic to it is because automatic systems to make egregious

errors and that's what everybody set anybody who's looked at that they'll says well this

technology is okay sometimes but it fails all i

so we conclude was

the technology is all i point out that the models the most of those used

by hidden markov models the most of us use i know as the thing that

was written down apply my for john a canadian sixty nine

so maybe that's i think kernel one of our issues here

so when these systems fail they degrade not gracefully like your for your role but

character catastrophic liam quickly

speech recognition performance is substantially behind how humans do in almost every circumstance

and

they're not robust

so i wanted to that sort of michael overall overview of what the survey was

and it's available on the icsi website in the in the program but i wanted

to add a couple a personal comments about my analysis of what's happening

so these are not i'm not representing the government are actually i want to talk

to you about my own personal else's

so here's i there's three points first point

if you have a model in it and you don't a lot of time hill

climbing to the optimum performance

and it doesn't perform optimally at that spot

you got the wrong model

hidden markov models we're proved to converge by power producers and Y so the idea

in nineteen sixty not

that prove has two parts

one is it says you can always make a better model

two it says you get the optimal parameters if the data came from the model

that second part is

absolutely not true in our speech recognition systems and we're climbing on data that doesn't

match the model and we're not gonna find the answer that way

so we spent a lot of time

trying to account trying to adapt for the problem back we got the wrong model

this is a personal bond

if you use sixty four gaussians applying to some distribution you have no idea what

the distribution

the original

multi gaussian distributions we're done with a single mean and i understand but that's not

weird

and so my corollary i think speaks for itself

and finally if the system you bill pills for fifty percent of the population entirely

and then for the people who works for estimate as they walk in a reverberant

environment or noisy place it fails

it's broken

and i believe speech recognition is terribly problem

so i think what we really wanted to do i'm i want to draw an

analogy so i one and what drawn analogy between

transcription and transportation

and for transportation man this is what i want something that slick and slowly and

easy to use and doesn't bright

and what we build use this

it runs on two wheels it will get similar eventually you spend almost all your

time dealing with problems they have nothing to do with the transportation part

and so i believe that that's what we've done with speech recognition

and it's time for new models and

i urge you to think about model

and not so much about the data

and tape

generate okay

i assume that is to generate a lot of discussion in a lot of questions

if it doesn't then something is wrong with us

this sds community would be done broken

okay was the first over there

a question about the resampling

as i think about this you have a sort of sequence of random variables in

your turning a knob on the independence between them

and

one of the things that charting that knob does is it

as things become more dependent there's

less information

what i'm wondering is how much of the word error rate degradation you see

might be associated simply with the fact that there's just less information

in streams that are more dependence

this working

so i guess i don't understand question

a that i mean i

so i you're right so here is an answering you can tell me if i'm

close to understanding the model assumes that each frame has an independent amount of information

but we know that the frames do not have in depend amounts of information the

amount of information

going from frame to frame varies enormously

but the model treats every single one of those frames is independent and that's the

an egregious violation of these

so that

i guess i was thinking about was

if i ask you to say we're ten times that i ask ten people to

see the work once

and are trying to figure what's the word

like that the ten people say it might actually provide more information in the data

itself

and i just wondering if that might at all

contribute to why there's more

information as you sample from

from or more disparate parts of the train database

well i think i think what you're actually saying is the you your works

explaining

why

so the model

i think

many people this is a question they have so the when you when you have

all the frames and their independent when you do frame resampling the frames come from

all sorts of different speakers and when you when you line them up you know

like the what i play they come from all sorts of different speakers but then

as soon as i start

increasing the segment size then each one of those segments is gonna come from one

speaker right is this is sort of along the lines what you're thinking well does

the notion of speaker is part of the dependence in the data right the fact

that each one of these frames scheme

from a single speaker that's the pen

and so that interframe to ten

well the model knows nothing about

and so if that's causing a problem or not that that's as we're Q your

data

of course all of us

you know as you said all of us or have been aware of this for

a long time and i think there has been a lot of effort at trying

to undo it

it's kind of when we say the model this these there's an independence assumption that

sort of have true

because the features that we use

go over several frames so of course they're not actually independent you know when you

synthesise it's not clear what you really synthesise "'cause" you have to synthesise something that

has

may have an independent value but it has to have a derivative that matches the

previous thing and so on but

but we've all tried things like segmental models

which don't have that independence assumption

right we take a segment

a whole phoneme so you're

is skipping the state independence assumption and the frame independence assumption and just going straight

to the contextdependent phoneme

and now you're picking a sample from the one distribution for that context dependent phoneme

and that always works worse

maybe you can do something with that are combined it with the hidden markov model

and gain of i have a point but by itself it always works a lot

worse

and unless you unless you cripple the hidden markov model with the salem only gonna

use context independent models then this one might work better but

so the question is

it's not that we haven't tried

people have tried to make models that aboard those things and almost all of those

things got more as the flip side of that is you said mpe or mmi

and all these things run that M

two

avoid

that assumption but they don't we just the arab i-vector for

they reduce the error by

ten percent fifteen percent relative

basically a small it it's is similar to any of that any of the other

tricks we do so they have any comment on those two observations

well i mean

i i'm not sure what

so a natural question is at which i think is the first part of what

you're saying is so why many people to try and fail to be hmms with

models that take into account

independent third the dependence structure in the data so why hasn't that work

well

i would say that

that

i do not believe that anyone has any quantitative notion of why these things here

in the data

i'm not saying that we should go back to these methods maybe we should but

well i will give you an example of something you know twenty years ago people

gave up neural networks

and all of a certain you know neural networks or

are the new

the new

come

i don't know what the right biblical sprays is but hallelujah so and what it

takes is somebody who believes in something and dry start to do it and i

think that here is the problem

we should be i don't know what the solution is i honestly don't know what

the solution is but i will say also that the mmi thing no and i

don't believe anyone would be the mmi it was not designed to overcome independence

you know if we knew that maximum likelihood solution to this problem was not the

right solution so we found an alternative model selection procedure that we've just in a

different place

again if the model were correct we wouldn't have to do that

coming back to the results this is this simulation results you presented

i think these are highly suggestive because

by changing the data to fulfil your assumptions

the error rates you get or not the error rates we

expect from the real data

because you fit

the problem to your assumptions but we have to go the other way around so

what error rates we really can expect if we

improve on modeling are still it that's an open questions system

exactly i'm the that that's absolutely right at the in no way in my claiming

that if we could model dependence in the data that we would be seen these

error rates the frame resampling error rates that that's absolutely correct

i mean so

presumably we do we repeat do better the other point though is i think that

a lot of the

this sort of brittle nist that we experience

in our models this is a conjecture is due to this very

sort of for fit to the temporal structure

and temper you know temporal we have a we have what one way of thinking

of what these results a you know the frame resampling results that says if you

forget about the temporal structure in the data models work really well but as soon

as you introduce real temporal structure and the data the model start falling

and so we'll speech i think temporal structures importance

i think

here is the my

by a shock i see how a

speechless

or thai interested party

yes the line

i don't think

i when you please independence assumptions is not

in the sticks more mixing to not extract information you can speech doesn't necessarily track

you know to work

i mean i can build the proposed system that satisfy

independence assumption

so i don't think

you know

really follows that

for my models really see

the models and so

i think you don't want thinking about extracting

getting the right information the problem this over account the information

it's a question of this represent information

and so if you misrepresented what are more or less than in the process

i was the misrepresentation

so that the false alarms

three

something like

some work

have you might have

but works if that's not right

work land farm

i rate is

just done the same tendency

these days

but

i like

when you know all

one thing that works really poor C

is if you have a mismatched representation

so i think the think about some model is representing text okay

you can represented as raster scan text

well you could represented as follows

and if you change the size of the image

the to the two things a very different of that the five

five test of an actual easy representation change and the rest just and it's just

the whole thing

so you have to ask yourself is the problem that we're C

the fact that we have a representation for the problem that doesn't match

that i think is the realisation

mm this tell us something a common

as you go for then for the top from state to phones in phones to

segments

data it's becoming more and more speaker-dependent is it may be the problem is your

models and not there don't i mean are

morse i mean if you made your models more speaker-dependent what we have seen the

such difference

but it has nothing to do with a frame dependent sampling but well like what

i was trying to say before is that is a form of dependence

the that

that

and the model knows nothing about

this form of the pen

you know that there are many forms a of dependence and data knowing what independence

is a heart thing for human to understand right

but that form of dependence is precisely there and it may be causing the problem

so there were there were a number of speakers so there are relatively few speakers

in this corpus and so we have to sort of cat them so that there

wasn't a single dominant speaker

which i mean i think that would be the last

so let me you sort of continue with work was asking again

we know the model is wrong

models are always wrong

and so

the way your

you can argue that the model is wrong mathematically or you can argue that it's

wrong because it doesn't meet certain in a match a human performance what we think

of as human performance i think we may overestimate human performance a little bit but

it clearly doesn't match it

but in fact you know if you look at all the research that all of

us do

we use at least feel like protecting those problem so we say we're gonna use

fonts models it to use your analogy we a lower models to have we scale

them like fonts right we put in we say we're going to estimate a scale

factor in that scale factor is not a simple

we can be a simple one there were can be a matrix you know much

more complicated than what you do with the font and we constrain it to be

the same we say the speakers the same for the whole sentence

we do speaker adaptive training so we try to remove the differences

we tried to normalize all the speakers to the same place and then insert the

properties of the new speaker again right

close sort of like the analogy of a font

we tried to do all of these things we certainly trying to model channels

we do all of these with linear models and not linear models

and

we get small improvements

so my question let me turn the question around

the model is wrong

what's the right model

not what is the do but what is the right model

i think we all don't know the answer to that question but let me tell

you something other phenomena that i would like to see as making

unless you've been following particle physics but

in particle physics

when you measure particle interactions prestigious of the interactions are governed by

basically by feynman diagrams

and so to compute a for particle interaction like using the super collider to compute

a cross sectional area for one of the interactions takes just if we computer about

a week to look at all the fine and i guess

the quite of the physics guys it's just discovered a geometric object

enforce days and in the geometric object it turns out that each

little area house

an area that is exactly the solution

so that problem of computing the cross sectional area

and you can outdo the computations

in about five minutes with a pencil the tape

there's a place where the difference in the model has a huge effect

i'm making things work so i don't think i don't believe the model lies in

that we of the kinds of things that we've all these always been doing

i think we need to have some radical re interpretation of the way we look

at the data that we look at the word

maybe which on the lines in one place

maybe

i took the degree in linguistics as i thought speech wasn't an easy problems as

a jury point of view and i learned to distrust everything a linguist set

maybe which most of them to but

maybe there's something different that we should be don't

so i would love just against look outside this place that we've been exploring

Ouch - Outing Unfortunate Characteristics of HMMs (Used for Speech Recognition)

4th Day

Jordan Cohen (Spelamode) , Steven Wegmann (ICSI)