so i'm gonna talk about a project average but thank you for having me here
i
i enjoyed my time in the czech republic that learn many check we're concluding well
so thank you
so i
project ouch out stands for outing unfortunate characteristics of the hmms
there are three
truthfully there were three phases the at the sort of initial were that we did
on this was a project that larry really and i three hundred to when we
read nuance
and i truthfully it also had its antecedents in work that we were doing it
for signal
but that's a funded a very small pilot study and i our funded the a
larger but still small off a lot the people the students to work with me
were day gaelic
hardly
part is there i was actually postdoc ensure you chair is currently is to berkeley
and larry really jaw really morgan
and myself were thus reducing your people
so project out
what we're trying to do
is our goal is to sort of develop a quantitative understanding of
how the current formalism thing
and you know surprisingly this being very little work
in this direction in the for your history
of speech recognition
there's been some but it means were isolated and sporadic
and
you know progress in speech recognition has been very exercise and
in my largely because we be proceeding
wire trial-and-error and so the claim is
that by gaining a deeper understanding
powers are algorithm succeed and fail
other than just measuring we're word error and if it if we get an improvement
in word error keep it
we it doesn't improve we
we just it should enable more efficient and steady progress and i claim that this
should be embedded are standard sort of research may not necessarily the techniques that i'm
gonna talk about okay but just this
notion that when you have a model
you know it doesn't fit the data you should get a try to gain some
understanding of how a model differs from the data and how that data model residual
impacts
the classification errors
so the main questions that a project ouch was interested in is you could be
the main where you could think about it do this is what to the models
find surprising about data what is it about speech data that the models find surprising
and how to do that surprise translate the air
so
i'm gonna talk today about quantifying the two major
hmm assumptions their impact on the error rates of the course the two major assumptions
are the very strong independence assumptions the models makes
and also
and equally strong assumption about what the form of the marginal distribution of the frames
are typically we assume that there are a gaussian mixture models of course nowadays people
are using a multi layer perceptrons but it can you make some sort of formal
assumption about what looks like
also which of these incorrect assumptions is and key your discriminative training mpe or mmi
which it's these assumptions is
is this process are compensating for the maximum
and
do these results change when you move from a miss from a mass training and
test
us we're formalism the mismatched case
so there early sort of work that we did was on the switchboard in the
wall street journal corpora later on we move to the icsi corpus
you can read past
this sort of question about how do these results change in this mask a case
in it and form of why asr so brittle
we go
at any time bring up
a new recognizer on a problem whether
the same language or across languages you always have to star it seems almost from
scratch you always have to collect a bunch of data that's closely related to the
to the task that you
you have and
it hardly ever works the first time you try it it's the reason that most
of us in this room have
have jobs it's are sort of it sort of a good thing but it's incredibly
frustrating right it's like
it's a miracle that when anything works the first
so the ir project mainly was interested in studying
these
these questions on it the icsi meeting corpus where there's a new field channel and
a far-field show i'll talk a little bit more about that i'm we wanted to
understand when you trained models on the near field condition
what happens when you are recognise for future
and so in this context
is the brittle nist of asr solely due to the models inability to account for
the statistical dependence that occurs in real data
and you know what i started this particular project
i thought
that it was just gonna be used independence assumptions so
and i was very surprised
when we actually started doing the work
and in fact it once like so
and so i say i just sort of funny but
but in the matched case basically
this the inability of the model to account for statistical dependence that occurs in real
data is basically the whole problem
but when you move to the mismatched case
all the sudden something else rears its head it
and it and it and it's a big problem and so all describe what this
problem
it has to do with the lack of invariance of the from
so
i'm gonna spend a little data time
talking about the sort of methodology we use so what the way we explore this
question is we create
we fabricate data
a we use stimulation and a novel a sampling process
that uses real data
to probe the models and the data that we create
is either completely stimulated that satisfies all the model assumptions
or it's real data
that we sample than the way that gives the properties that we understand and so
by feeding this data we can sort of pro the models and see their response
to this to the state and we research we observe recognition actually
so here's an example
so this is an example of what of course according to the average estimate seventy
high miss rate by counts capital markets report
so this is an example of course what we expect speech to sound like this
is from wall street journal so this is a fabricated version of this that essentially
agrees with all the model assumptions
according to different estimates to construct the attachments capital markets report
you can speculate syllable rhymes two point five percent that's model
so
so you know it's highly amusing but it's intelligible obviously and it obviously you know
it's from a model that was constructed from a hundred different speakers and it reflects
the sort of structure
so what we're trying to quantify
is
what the difference between these two extremes in terms of recognition condition
so the basic idea of data fabrication a simple
we follow the hmms generation a mechanism so to do that we first we generate
a an underlying state sequence consistent with the transcript the dictionary and the state transitions
the underlying each of it that you know the hidden markov model
then we walk down this
this sequence and we and that of frame at each point
so here's a picture a nice picture that describe the sort of structure it's a
sort of a parts of it are actually a graphical model
a this courses in each ml
but basically we unpack so if we have a transcript we unpack words
we get the corresponding pronunciations
the phones in context
then determine which hmm we use so this is the hidden state each of these
states and mit observations according to the so whatever mixture model we're actually using right
and so if you're not so familiar with the hmms i assume pretty much everyone
in the room is but this sort of highlights the independence assumptions right well it
highlights two things one
the frames are omitted according to rule and the rule is the marginal but the
form that we get for the marginal distribution of frames
and then of course then this also says that these frames are independent so every
time i met
a frame from state three state it is independent from the previous frame that was
emitted from state three so that's a very strong assumption
but in addition
it is also independent from any of the frames that we're and it'd previously from
the state so these of the very strong and
but okay again to generate observations we just all of this rule and basically once
we know the sequence of states
i have a sequence of states one side out that i just walk down those
sequences states and i'd to withdraw
from
what it either a distribution
or whether it be empirical or parametric
so
so for simulation
it's a i know it's easy to simulate from a mixture models not a big
deal right
but what about this sort of novel sampling process that'll allow us to get a
the independence assumptions will so that for this
we idea of formalism
from a reference bootstrap so i talked a little bit about the bootstrap in the
paper the poster
a people in the feel don't seem to be terribly familiar with that i'm not
sure is topical very much but i will try
so in the basic idea areas
a suppose you have an unknown population right so you've got some population distribution and
you compute the statistic that's meant to summarize this population itself
then you want to know how good is the statistics so i want to construct
a confidence interval for the statistics to give me a sense how well i've estimated
from
a place
so how the lighting that if i don't know what the population
i mean i'm trying to
you know i'm trying to derive properties of a of this population
and so and so in particular i don't know anything about really except the sample
like drawn from this population
so
but for F runs a bootstrap procedure people would usually make some parametric assumptions about
population typically you'd assume it's a normal or gaussian
and then compute
and a confidence interval using that structure
well course that sort of crazy you know why would you do that you know
especially if you're trying to say
is this population distribution gaussian are not well it's crazy to still
then that the population distribution is gaussian to compute this confidence in
so this was a big problem in the late seventies when computers became sort of
sort usable
by and statisticians
he came up with the sort of formalism and
and so the name comes from pulling up oneself up by the bootstrap lots of
people use the bootstrap for various sorts of terminology it allegedly comes everyone attributes this
to the to the to the story in the
adventure and so pair and a one channels and where E
used in some and yes to get out so we pulls himself up
by its bootstraps out of the but of course you read very one or the
adventures of error
when chosen and that's not what huh
in fact you within a small
on forcing use trying to get out of this one
so instead pulled himself out what is okay
so maybe instead we collected daily
similarly a little bit whiter i thought that was very
so
so the with the way the way the bootstrap words
is you take empirical distribution so you tree
so you have the same
so this sample is a representative of the true population distribution so if it's big
enough it should be a pretty good represented
and so you since
instead of dating a parametric model to this you treat this is an empirical distribution
and you sample from that empirical distribution
sampling from the empirical distribution turns out to be equivalent to just doing a random
draw with replacement from the sample itself
yes the name resample
so we're gonna adapt this
this formalism to so the so problem at hand so ins will you know so
when we train our models right if it so imagine we're viterbi trainer
here here's a
you know
well i'll have another picture but basically we're gonna sample to the frames that are
assigned to a particular state during training and that's work
and we can do this for various types of sick
so here
it's a really crappy picture but which i have to do a better job but
this is that i here again that
so the you know these again see
but so we have the true population distribution this you know we fit a say
gaussian to this is not particularly good representative and instead if we have if we'd
run enough data from a this histogram estimate the distributor
so basically
but the important part of this slide it is
resampling is gonna fabricate data
that satisfies independence assumptions of the hmm because i'm gonna do random draw with replacement
from the distribution
but
the data we create are gonna deviate from the hmms parametric out the distribution of
the distributional assumptions that we make two exactly the same day degree that real data
do because it's real data
and it's the data at all
from the training
so here's it's already good picture which can lead in sort
describe a little bit
about what we do
and a
so imagine if we have training data and we're actually doing viterbi training so if
we're doing viterbi training we get a forced alignment that for all the states
we just accumulate all frames a
for that state and then we fit a gmm to right and so that
but instead of doing that in the in the bootstrap formal is the we accumulate
frames and we stick "'em" in earnest
that are that are labeled with that state
so training is just like you know or even here training you know
you just accumulate all the frames associated with the state
but instead of a forgetting about that you keep track what they are used to
come in a packet parameter
and so in it when it comes time to generate pseudo data you have an
alignment or some state sequence that you've got however
you have a state sequence ins when you walk down to generate the frames if
i was generating the frames and simulation i would stimulate i do a random draw
from a distribution now instead i to a random draw with replacement from a buck
under and of frames okay
so the frames again are independent because i'm doing random draws with independence
and they the deviate from the tape from the distributional assumptions to the same degree
the real data or "'cause" they are real data
so sorry i believe bring this but and then i can also all about it
i can i can
do you
sequence so i can i can samples the trajectories phone trajectories and word trajectories
because
so here
you're this is this sequence of frames associated to states
so i can stick that into that whole sequence
likewise i can take whole phone sequence and put it in here and when i
drawer from your ins
instead of getting individual frames i get segments
so that the important thing is
no matter what see so i five have segments in the utterance
when i draw the segments between segments the things they are independent but they inherit
dependence that exists in real data within that sector so we have
between segment independent
within segments dependent so this is the way that we can control the sort of
degree of statistical dependence that's in the day
this is quite power
so this sort of just
sort of summarises this
but the and you can see
could even stickler hundred and your
but that so the point is that's a that segment level resampling
relaxes frame level independence to segment
so here's a sort of picture
the models response to fabricate so this is i didn't for that
okay so
i don't know how much i wanna spend on this but
so here what we have it is simulated
a simulated data are the real error rate and as i gradually reintroduce independence and
the that the data the word error rate starts to increase rather dramatic
so point is
let's look at the simulated word error rate so you can think of this is
i think of this is you've got some sort of not and where you re
introducing depends in the data and as i reintroduce data dependence in the data error
rate
comes quite high this is
this is i icsi meeting data this is
with unimodal models
the same sort of phenomena happens when you use mixture models where you know like
say component extreme
so that here the simulated error rate is around two percent little bit less than
two percent
when i do frame level resampling error rate increases just a little bit it's a
very small increase it does increase but it's but it by very small
now when i reintroduce
with in state dependence
all the sudden the error rate becomes around twelve percent so the error rate is
increased by a factor of six
when i introduce
within bone dependence
the error rate increases the king by about a factor of a two
and then when i go to words it increases by
we can almost by a factor of two this typically is the largest job on
the corpora that we've worked with
when you go when you move from frame
to stay typically increases by about a factor of six
so you think about this you make an argument and the argument is that
the that the change the distributional assumption that we make with hidden with gmms
it's not such a big deal i mean it's important but it's not such a
big deal
the biggest single factors are these reintroduction dependent so with the dependence in the data
that the models are findings the price i mean you know it's a
it's a you know everybody knew the dependence assumptions work well i mean i'm not
saying that surprising but i personal we use it was
was really surprise and it took a long time to come around
to the fact that you really it is the model they're the errors oriented dependence
assumption and we tend to work around this by other sorts of things
so that this is a summary of the matched case result so we came the
statistic when we have matched training and test
the it's the independence assumptions that's the big deal
that's the model inability to account for dependence in the data that is that is
to railing things
the marginal distributions
that so much
so surprisingly also so in a different you know if later study
we zorro but
attached this formalism tasks the question so what is what is discriminative training doing you
know see start with the maximum likelihood model you apply mmi
what what's happening here so you apply this formalism and you see that in fact
mmi is actually randy is actually compensating for these independent and that's assumptions in a
way that i don't completely understand i have hypotheses about how this might work
but
a so here you
really complicated procedure that's a little hokey
that to people twenty years many people in this room it took twenty years to
get to work right
and it took many laughs once we shown to work on large vocabulary took many
labs an additional apply yours to get it to work in their lap
it's you know now it's pretty routine to do this but you know it's a
lot it was a single to get this to work and my point is that's
doing is compensating for the independence assumptions we know the independence assumptions are a problem
i'm not saying that it's gonna be easy the figure find a model that relaxes
the independence assumptions
but perhaps that twenty years of effort
would be better spent
attacking that problem
so one about mismatched training
so the icsi meeting corpus
on a we have near field models
collected from on a solo
you know head mounted microphones there was a some microphone array of some sort
but that the meeting room was quiet it was small had are normal amount of
river the kind of reaper human six back
in a room
if you listen to these two channels you can tell that they're different
but it's not like the far-field channel is radically different when you listen to
i we it's it sounds a little different but it's perfectly intelligible
so we explore
training test with near field train interest with farfield and this mismatch condition where a
train on your field data and test for
so
i'll just say that it's harder it's not
hardly we have to be careful and you have to think about what you're trying
to do when you when you when you run these types of experiments in particular
a there were a lot it's use that we went through
to take get the near field channel and the far-field channel exactly parallel so that
we were actually measuring
what we wanted to do it is it's like a somewhat
intricate lab set
and so it's
so the paper that we wrote in i cast just i don't know how well
it describes it but it attempted to describe it and we have a on
the icsi website there's a technical report that's reasonably good
that describes a lot this stuff so i'm not gonna believer this but there was
a lot of effort that we can go through that's
so here here's of the bottom line is that we're
so first let's look at the green and the red curve satanic again i'm almost
so
the green and the red curve are the mass near field and far-field and notice
that extract each other pretty well the different
the first real data is obviously hardware
but interestingly look down here at the simulated in the frame
accuracies
they're still really low you know there
the a match farfield is higher it's worse but it still really low and in
particular that these error rates are around the two percent right so i wanted so
let's think about that no then notice before we think about that the mismatch simulation
rate
it's you know so this is where we want to concentrate so this is what
we want to think about that right
so the simulated
we don't need to worry about this other stuff it's the simulated thing that we're
gonna concentrate
so
what when you simulate data from near field models and you recognise it with your
field models the error rate is essentially no
so that means that problem is essentially step
again when i take the far field models and i simulate data from the far-field
models
and i and i will
and i recognise it with the far-field models
i get essentially nowhere
again that means that problem is essentially stuff
so in these two individuals spaces where we you know so the frames so in
the signal processing the mfccs there are generated in the matched cases they're essentially separable
problems but all the side when i take in the
the near field models and look at the at the far field data it's
drat dramatically not step
so that means that the transformation that takes place between the near field data and
the far field data is not
it's not very that from the front end is not invariant under this transformation and
that lack of invariance
is what's causing this huge increase in here
so we again it's not surprising that the front and is not invariant to this
transformation there's a little bit a river there's a little bit of noise but what's
remarkable it is
that that's
solely that problem the causes
this huge degradation in there
and that is actually fairly remark
so
a
so there are many more results
a involving mixture model so we rerun all of these results with i think eight
component mixture models we see the same sort of behaviour
we've reproduce all the discriminative training results we ask
can does discriminative training somehow magically sort of the be leery than
the mismatch a case and the answer is no
we do i think morgan to this really you're on a natural question is how
does mllr work in this thing we talked about that an mllr you can you
can reduce
some of the scratches you would expect
but mllr is a simple linear transformation and whatever transformation between these two channels is
happening
it's some peculiar nonlinear transformation right so it's unreasonable
to expect animal or to do
as well but it's a good this task harness is a really good test harness
for evaluating
you know how resistant to these type how invariant to these transformations are for and
and so we've explored that a little but
and it's not so encouraged
alright well so that i think i table i will and i've
sort of blather donald enough i think all i'll turn it over to jordan and
you will
he will
have a higher level you a role idea and the not and then we'll have
questions that
so what you what presented in
i
okay one two three
alright so it turns out the there were two parts of this project
C told you about the technical stuff but we also saw that we'd like to
figure out
you've been hearing a lot about how wonderful speech recognition is during this meeting and
we thought we will actually like to understand what the community actually thought about what
speech recognition was like
so we rollers also survey and i called a bunch of people many of you
what called me
and this is called the rats right
and well we wanna do is just see what people thought about how speech recognition
really worked we were we were hoping that we would find some evidence to persuade
the government maybe to put it some money and fun some speech recognition research which
we haven't seen in a long time
but we really we just one the final was going on
and so we put together a little survey team
jen into jamieson worked with me she's a alice that's been in speech for very
long time and we engage frederick okay and he's a specialist at doing service
and we design a snowball start by
it's normal surveys very interesting it
it says you start with a small group of people that you know and you
have some the questions and then you apps them who else task
and you just follow that for your nose and what that means is although it's
not entirely unbiased it's as unbiased as you can do if you don't know the
sampling populations going to be
so we want to low what was going on what the people think or the
failures and what remedies of people try and how do they were
so we did this novel sampling
here's the questionnaire i don't wanna spend a lot of time and this but just
take a
the interesting questions are
the fall last one on the slide where is the current technology failed
and the first one on the side when you think broke
and then questions about sort of what you do about what was going on and
then if there's other stuff
the survey participants tended to be all
i think
that's sort of how our snowball work not terribly all but there's not a lot
again people in this so ages with thirty five seventy
we spoke about eighty five people
and they have an interesting mix of jobs most of them were in research somewhere
in development so we're both
there were a small battery as a management people and then people self referred them's
the their jobs as something more detail
but mostly these are and be people lord managers doing speech research or language one
sort of another
so here's what you told us
there's a
natural language is the real problem and acoustic modeling is a real problem
and everything else that we do was broken more or less
so i think the community sort of had this field not the people trying to
sell speech recognition to the management but the people trying to make it work have
a feeling that all is not really well in the technology
so lots of people and when you point fingers there pointing fingers to the language
itself and to acoustic modeling
and there's the third guy which this says not robust let's say this what steven
and stuff
we were able
so there's something going on with this technology that makes it not work very well
and when we ask people what they try
the fix things the answers everything
people of muck around with the training some people have tried all kinds of different
because i just of their system
a
i
i know
some piece trying to calm
alright anyway
what on the interesting things the people try to do
many of us have tried to fix pronunciations either in dictionaries or in rules the
pronunciation and to well me and everyone is found that this is a waste
it's pretty interesting that so that's not a way to fix the systems that we
currently will so we tried all kinds of stuff
and so i think
are taken from the survey is that people
actually don't believe that technology is very solid and we try a lot of things
to fix it and then we looked a little bit of the literature about the
literature surveys in the icsi report which you can go really but the comma so
we found a little sure looks sort of like this is from a review by
fruity
and it's a
L B C Rs far from be solved background noise channel distortion far in excess
casual disfluent speech one expected topic to it is because automatic systems to make egregious
errors and that's what everybody set anybody who's looked at that they'll says well this
technology is okay sometimes but it fails all i
so we conclude was
the technology is all i point out that the models the most of those used
by hidden markov models the most of us use i know as the thing that
was written down apply my for john a canadian sixty nine
so maybe that's i think kernel one of our issues here
so when these systems fail they degrade not gracefully like your for your role but
character catastrophic liam quickly
speech recognition performance is substantially behind how humans do in almost every circumstance
and
they're not robust
so i wanted to that sort of michael overall overview of what the survey was
and it's available on the icsi website in the in the program but i wanted
to add a couple a personal comments about my analysis of what's happening
so these are not i'm not representing the government are actually i want to talk
to you about my own personal else's
so here's i there's three points first point
if you have a model in it and you don't a lot of time hill
climbing to the optimum performance
and it doesn't perform optimally at that spot
you got the wrong model
hidden markov models we're proved to converge by power producers and Y so the idea
in nineteen sixty not
that prove has two parts
one is it says you can always make a better model
two it says you get the optimal parameters if the data came from the model
that second part is
absolutely not true in our speech recognition systems and we're climbing on data that doesn't
match the model and we're not gonna find the answer that way
so we spent a lot of time
trying to account trying to adapt for the problem back we got the wrong model
this is a personal bond
if you use sixty four gaussians applying to some distribution you have no idea what
the distribution
the original
multi gaussian distributions we're done with a single mean and i understand but that's not
weird
and so my corollary i think speaks for itself
and finally if the system you bill pills for fifty percent of the population entirely
and then for the people who works for estimate as they walk in a reverberant
environment or noisy place it fails
it's broken
and i believe speech recognition is terribly problem
so i think what we really wanted to do i'm i want to draw an
analogy so i one and what drawn analogy between
transcription and transportation
and for transportation man this is what i want something that slick and slowly and
easy to use and doesn't bright
and what we build use this
it runs on two wheels it will get similar eventually you spend almost all your
time dealing with problems they have nothing to do with the transportation part
and so i believe that that's what we've done with speech recognition
and it's time for new models and
i urge you to think about model
and not so much about the data
and tape
generate okay
i assume that is to generate a lot of discussion in a lot of questions
if it doesn't then something is wrong with us
this sds community would be done broken
okay was the first over there
a question about the resampling
as i think about this you have a sort of sequence of random variables in
your turning a knob on the independence between them
and
one of the things that charting that knob does is it
as things become more dependent there's
less information
what i'm wondering is how much of the word error rate degradation you see
might be associated simply with the fact that there's just less information
in streams that are more dependence
this working
so i guess i don't understand question
a that i mean i
so i you're right so here is an answering you can tell me if i'm
close to understanding the model assumes that each frame has an independent amount of information
but we know that the frames do not have in depend amounts of information the
amount of information
going from frame to frame varies enormously
but the model treats every single one of those frames is independent and that's the
an egregious violation of these
so that
i guess i was thinking about was
if i ask you to say we're ten times that i ask ten people to
see the work once
and are trying to figure what's the word
like that the ten people say it might actually provide more information in the data
itself
and i just wondering if that might at all
contribute to why there's more
information as you sample from
from or more disparate parts of the train database
well i think i think what you're actually saying is the you your works
explaining
why
so the model
i think
many people this is a question they have so the when you when you have
all the frames and their independent when you do frame resampling the frames come from
all sorts of different speakers and when you when you line them up you know
like the what i play they come from all sorts of different speakers but then
as soon as i start
increasing the segment size then each one of those segments is gonna come from one
speaker right is this is sort of along the lines what you're thinking well does
the notion of speaker is part of the dependence in the data right the fact
that each one of these frames scheme
from a single speaker that's the pen
and so that interframe to ten
well the model knows nothing about
and so if that's causing a problem or not that that's as we're Q your
data
of course all of us
you know as you said all of us or have been aware of this for
a long time and i think there has been a lot of effort at trying
to undo it
it's kind of when we say the model this these there's an independence assumption that
sort of have true
because the features that we use
go over several frames so of course they're not actually independent you know when you
synthesise it's not clear what you really synthesise "'cause" you have to synthesise something that
has
may have an independent value but it has to have a derivative that matches the
previous thing and so on but
but we've all tried things like segmental models
which don't have that independence assumption
right we take a segment
a whole phoneme so you're
is skipping the state independence assumption and the frame independence assumption and just going straight
to the contextdependent phoneme
and now you're picking a sample from the one distribution for that context dependent phoneme
and that always works worse
maybe you can do something with that are combined it with the hidden markov model
and gain of i have a point but by itself it always works a lot
worse
and unless you unless you cripple the hidden markov model with the salem only gonna
use context independent models then this one might work better but
so the question is
it's not that we haven't tried
people have tried to make models that aboard those things and almost all of those
things got more as the flip side of that is you said mpe or mmi
and all these things run that M
two
avoid
that assumption but they don't we just the arab i-vector for
they reduce the error by
ten percent fifteen percent relative
basically a small it it's is similar to any of that any of the other
tricks we do so they have any comment on those two observations
well i mean
i i'm not sure what
so a natural question is at which i think is the first part of what
you're saying is so why many people to try and fail to be hmms with
models that take into account
independent third the dependence structure in the data so why hasn't that work
well
i would say that
that
i do not believe that anyone has any quantitative notion of why these things here
in the data
i'm not saying that we should go back to these methods maybe we should but
well i will give you an example of something you know twenty years ago people
gave up neural networks
and all of a certain you know neural networks or
R
are the new
the new
come
i don't know what the right biblical sprays is but hallelujah so and what it
takes is somebody who believes in something and dry start to do it and i
think that here is the problem
we should be i don't know what the solution is i honestly don't know what
the solution is but i will say also that the mmi thing no and i
don't believe anyone would be the mmi it was not designed to overcome independence
you know if we knew that maximum likelihood solution to this problem was not the
right solution so we found an alternative model selection procedure that we've just in a
different place
again if the model were correct we wouldn't have to do that
coming back to the results this is this simulation results you presented
i think these are highly suggestive because
by changing the data to fulfil your assumptions
the error rates you get or not the error rates we
expect from the real data
because you fit
the problem to your assumptions but we have to go the other way around so
what error rates we really can expect if we
improve on modeling are still it that's an open questions system
exactly i'm the that that's absolutely right at the in no way in my claiming
that if we could model dependence in the data that we would be seen these
error rates the frame resampling error rates that that's absolutely correct
i mean so
presumably we do we repeat do better the other point though is i think that
a lot of the
this sort of brittle nist that we experience
in our models this is a conjecture is due to this very
sort of for fit to the temporal structure
and temper you know temporal we have a we have what one way of thinking
of what these results a you know the frame resampling results that says if you
forget about the temporal structure in the data models work really well but as soon
as you introduce real temporal structure and the data the model start falling
and so we'll speech i think temporal structures importance
i think
here is the my
by a shock i see how a
speechless
or thai interested party
yes the line
i don't think
a
i when you please independence assumptions is not
in the sticks more mixing to not extract information you can speech doesn't necessarily track
you know to work
i mean i can build the proposed system that satisfy
independence assumption
so i don't think
you know
really follows that
for my models really see
the models and so
i think you don't want thinking about extracting
getting the right information the problem this over account the information
it's a question of this represent information
and so if you misrepresented what are more or less than in the process
i was the misrepresentation
so that the false alarms
three
something like
some work
have you might have
but works if that's not right
work land farm
i rate is
just done the same tendency
these days
but
but
i like
when you know all
one thing that works really poor C
is if you have a mismatched representation
so i think the think about some model is representing text okay
you can represented as raster scan text
well you could represented as follows
and if you change the size of the image
the to the two things a very different of that the five
five test of an actual easy representation change and the rest just and it's just
the whole thing
so you have to ask yourself is the problem that we're C
the fact that we have a representation for the problem that doesn't match
that i think is the realisation
mm this tell us something a common
as you go for then for the top from state to phones in phones to
segments
data it's becoming more and more speaker-dependent is it may be the problem is your
models and not there don't i mean are
morse i mean if you made your models more speaker-dependent what we have seen the
such difference
but it has nothing to do with a frame dependent sampling but well like what
i was trying to say before is that is a form of dependence
the that
that
and the model knows nothing about
this form of the pen
you know that there are many forms a of dependence and data knowing what independence
is a heart thing for human to understand right
but that form of dependence is precisely there and it may be causing the problem
so there were there were a number of speakers so there are relatively few speakers
in this corpus and so we have to sort of cat them so that there
wasn't a single dominant speaker
which i mean i think that would be the last
so let me you sort of continue with work was asking again
we know the model is wrong
models are always wrong
and so
the way your
you can argue that the model is wrong mathematically or you can argue that it's
wrong because it doesn't meet certain in a match a human performance what we think
of as human performance i think we may overestimate human performance a little bit but
it clearly doesn't match it
but in fact you know if you look at all the research that all of
us do
we use at least feel like protecting those problem so we say we're gonna use
fonts models it to use your analogy we a lower models to have we scale
them like fonts right we put in we say we're going to estimate a scale
factor in that scale factor is not a simple
we can be a simple one there were can be a matrix you know much
more complicated than what you do with the font and we constrain it to be
the same we say the speakers the same for the whole sentence
we do speaker adaptive training so we try to remove the differences
we tried to normalize all the speakers to the same place and then insert the
properties of the new speaker again right
close sort of like the analogy of a font
we tried to do all of these things we certainly trying to model channels
we do all of these with linear models and not linear models
and
we get small improvements
so my question let me turn the question around
the model is wrong
what's the right model
not what is the do but what is the right model
so
i think we all don't know the answer to that question but let me tell
you something other phenomena that i would like to see as making
unless you've been following particle physics but
in particle physics
when you measure particle interactions prestigious of the interactions are governed by
basically by feynman diagrams
and so to compute a for particle interaction like using the super collider to compute
a cross sectional area for one of the interactions takes just if we computer about
a week to look at all the fine and i guess
the quite of the physics guys it's just discovered a geometric object
enforce days and in the geometric object it turns out that each
little area house
an area that is exactly the solution
so that problem of computing the cross sectional area
and you can outdo the computations
in about five minutes with a pencil the tape
so
there's a place where the difference in the model has a huge effect
i'm making things work so i don't think i don't believe the model lies in
that we of the kinds of things that we've all these always been doing
i think we need to have some radical re interpretation of the way we look
at the data that we look at the word
maybe which on the lines in one place
maybe
i took the degree in linguistics as i thought speech wasn't an easy problems as
a jury point of view and i learned to distrust everything a linguist set
maybe which most of them to but
maybe there's something different that we should be don't
so i would love just against look outside this place that we've been exploring