0:00:15so i'm gonna talk about a project average but thank you for having me here
0:00:23i enjoyed my time in the czech republic that learn many check we're concluding well
0:00:29so thank you
0:00:31so i
0:00:33project ouch out stands for outing unfortunate characteristics of the hmms
0:00:39there are three
0:00:41truthfully there were three phases the at the sort of initial were that we did
0:00:47on this was a project that larry really and i three hundred to when we
0:00:52read nuance
0:00:54and i truthfully it also had its antecedents in work that we were doing it
0:00:59for signal
0:01:01but that's a funded a very small pilot study and i our funded the a
0:01:09larger but still small off a lot the people the students to work with me
0:01:14were day gaelic
0:01:18part is there i was actually postdoc ensure you chair is currently is to berkeley
0:01:24and larry really jaw really morgan
0:01:26and myself were thus reducing your people
0:01:29so project out
0:01:31what we're trying to do
0:01:33is our goal is to sort of develop a quantitative understanding of
0:01:41how the current formalism thing
0:01:44and you know surprisingly this being very little work
0:01:48in this direction in the for your history
0:01:51of speech recognition
0:01:54there's been some but it means were isolated and sporadic
0:02:00you know progress in speech recognition has been very exercise and
0:02:05in my largely because we be proceeding
0:02:09wire trial-and-error and so the claim is
0:02:12that by gaining a deeper understanding
0:02:16powers are algorithm succeed and fail
0:02:19other than just measuring we're word error and if it if we get an improvement
0:02:23in word error keep it
0:02:26we it doesn't improve we
0:02:28we just it should enable more efficient and steady progress and i claim that this
0:02:34should be embedded are standard sort of research may not necessarily the techniques that i'm
0:02:41gonna talk about okay but just this
0:02:43notion that when you have a model
0:02:46you know it doesn't fit the data you should get a try to gain some
0:02:50understanding of how a model differs from the data and how that data model residual
0:02:58the classification errors
0:03:01so the main questions that a project ouch was interested in is you could be
0:03:08the main where you could think about it do this is what to the models
0:03:12find surprising about data what is it about speech data that the models find surprising
0:03:17and how to do that surprise translate the air
0:03:23i'm gonna talk today about quantifying the two major
0:03:28hmm assumptions their impact on the error rates of the course the two major assumptions
0:03:33are the very strong independence assumptions the models makes
0:03:38and also
0:03:40and equally strong assumption about what the form of the marginal distribution of the frames
0:03:45are typically we assume that there are a gaussian mixture models of course nowadays people
0:03:50are using a multi layer perceptrons but it can you make some sort of formal
0:03:56assumption about what looks like
0:04:00also which of these incorrect assumptions is and key your discriminative training mpe or mmi
0:04:08which it's these assumptions is
0:04:11is this process are compensating for the maximum
0:04:17do these results change when you move from a miss from a mass training and
0:04:22us we're formalism the mismatched case
0:04:26so there early sort of work that we did was on the switchboard in the
0:04:30wall street journal corpora later on we move to the icsi corpus
0:04:35you can read past
0:04:36this sort of question about how do these results change in this mask a case
0:04:43in it and form of why asr so brittle
0:04:47we go
0:04:48at any time bring up
0:04:51a new recognizer on a problem whether
0:04:54the same language or across languages you always have to star it seems almost from
0:04:59scratch you always have to collect a bunch of data that's closely related to the
0:05:05to the task that you
0:05:06you have and
0:05:08it hardly ever works the first time you try it it's the reason that most
0:05:12of us in this room have
0:05:14have jobs it's are sort of it sort of a good thing but it's incredibly
0:05:19frustrating right it's like
0:05:23it's a miracle that when anything works the first
0:05:27so the ir project mainly was interested in studying
0:05:33these questions on it the icsi meeting corpus where there's a new field channel and
0:05:38a far-field show i'll talk a little bit more about that i'm we wanted to
0:05:43understand when you trained models on the near field condition
0:05:47what happens when you are recognise for future
0:05:51and so in this context
0:05:54is the brittle nist of asr solely due to the models inability to account for
0:06:00the statistical dependence that occurs in real data
0:06:04and you know what i started this particular project
0:06:07i thought
0:06:08that it was just gonna be used independence assumptions so
0:06:12and i was very surprised
0:06:15when we actually started doing the work
0:06:19and in fact it once like so
0:06:23and so i say i just sort of funny but
0:06:26but in the matched case basically
0:06:29this the inability of the model to account for statistical dependence that occurs in real
0:06:34data is basically the whole problem
0:06:37but when you move to the mismatched case
0:06:39all the sudden something else rears its head it
0:06:43and it and it and it's a big problem and so all describe what this
0:06:49it has to do with the lack of invariance of the from
0:06:55i'm gonna spend a little data time
0:06:57talking about the sort of methodology we use so what the way we explore this
0:07:03question is we create
0:07:07we fabricate data
0:07:09a we use stimulation and a novel a sampling process
0:07:15that uses real data
0:07:17to probe the models and the data that we create
0:07:21is either completely stimulated that satisfies all the model assumptions
0:07:26or it's real data
0:07:28that we sample than the way that gives the properties that we understand and so
0:07:34by feeding this data we can sort of pro the models and see their response
0:07:41to this to the state and we research we observe recognition actually
0:07:47so here's an example
0:07:49so this is an example of what of course according to the average estimate seventy
0:07:55high miss rate by counts capital markets report
0:07:59so this is an example of course what we expect speech to sound like this
0:08:03is from wall street journal so this is a fabricated version of this that essentially
0:08:09agrees with all the model assumptions
0:08:13according to different estimates to construct the attachments capital markets report
0:08:19you can speculate syllable rhymes two point five percent that's model
0:08:26so you know it's highly amusing but it's intelligible obviously and it obviously you know
0:08:32it's from a model that was constructed from a hundred different speakers and it reflects
0:08:37the sort of structure
0:08:39so what we're trying to quantify
0:08:43what the difference between these two extremes in terms of recognition condition
0:08:50so the basic idea of data fabrication a simple
0:08:56we follow the hmms generation a mechanism so to do that we first we generate
0:09:04a an underlying state sequence consistent with the transcript the dictionary and the state transitions
0:09:12the underlying each of it that you know the hidden markov model
0:09:15then we walk down this
0:09:19this sequence and we and that of frame at each point
0:09:22so here's a picture a nice picture that describe the sort of structure it's a
0:09:27sort of a parts of it are actually a graphical model
0:09:32a this courses in each ml
0:09:34but basically we unpack so if we have a transcript we unpack words
0:09:41we get the corresponding pronunciations
0:09:45the phones in context
0:09:47then determine which hmm we use so this is the hidden state each of these
0:09:51states and mit observations according to the so whatever mixture model we're actually using right
0:09:59and so if you're not so familiar with the hmms i assume pretty much everyone
0:10:04in the room is but this sort of highlights the independence assumptions right well it
0:10:10highlights two things one
0:10:12the frames are omitted according to rule and the rule is the marginal but the
0:10:17form that we get for the marginal distribution of frames
0:10:21and then of course then this also says that these frames are independent so every
0:10:26time i met
0:10:28a frame from state three state it is independent from the previous frame that was
0:10:33emitted from state three so that's a very strong assumption
0:10:37but in addition
0:10:38it is also independent from any of the frames that we're and it'd previously from
0:10:43the state so these of the very strong and
0:10:46but okay again to generate observations we just all of this rule and basically once
0:10:54we know the sequence of states
0:10:57i have a sequence of states one side out that i just walk down those
0:11:02sequences states and i'd to withdraw
0:11:06what it either a distribution
0:11:08or whether it be empirical or parametric
0:11:14so for simulation
0:11:16it's a i know it's easy to simulate from a mixture models not a big
0:11:21deal right
0:11:23but what about this sort of novel sampling process that'll allow us to get a
0:11:30the independence assumptions will so that for this
0:11:33we idea of formalism
0:11:36from a reference bootstrap so i talked a little bit about the bootstrap in the
0:11:41paper the poster
0:11:45a people in the feel don't seem to be terribly familiar with that i'm not
0:11:51sure is topical very much but i will try
0:11:54so in the basic idea areas
0:11:57a suppose you have an unknown population right so you've got some population distribution and
0:12:02you compute the statistic that's meant to summarize this population itself
0:12:08then you want to know how good is the statistics so i want to construct
0:12:12a confidence interval for the statistics to give me a sense how well i've estimated
0:12:18a place
0:12:20so how the lighting that if i don't know what the population
0:12:23i mean i'm trying to
0:12:25you know i'm trying to derive properties of a of this population
0:12:29and so and so in particular i don't know anything about really except the sample
0:12:34like drawn from this population
0:12:37but for F runs a bootstrap procedure people would usually make some parametric assumptions about
0:12:44population typically you'd assume it's a normal or gaussian
0:12:49and then compute
0:12:51and a confidence interval using that structure
0:12:54well course that sort of crazy you know why would you do that you know
0:12:58especially if you're trying to say
0:12:59is this population distribution gaussian are not well it's crazy to still
0:13:04then that the population distribution is gaussian to compute this confidence in
0:13:09so this was a big problem in the late seventies when computers became sort of
0:13:14sort usable
0:13:15by and statisticians
0:13:18he came up with the sort of formalism and
0:13:21and so the name comes from pulling up oneself up by the bootstrap lots of
0:13:26people use the bootstrap for various sorts of terminology it allegedly comes everyone attributes this
0:13:33to the to the to the story in the
0:13:36adventure and so pair and a one channels and where E
0:13:40used in some and yes to get out so we pulls himself up
0:13:44by its bootstraps out of the but of course you read very one or the
0:13:49adventures of error
0:13:50when chosen and that's not what huh
0:13:52in fact you within a small
0:13:55on forcing use trying to get out of this one
0:13:57so instead pulled himself out what is okay
0:14:01so maybe instead we collected daily
0:14:06similarly a little bit whiter i thought that was very
0:14:12so the with the way the way the bootstrap words
0:14:16is you take empirical distribution so you tree
0:14:19so you have the same
0:14:21so this sample is a representative of the true population distribution so if it's big
0:14:26enough it should be a pretty good represented
0:14:29and so you since
0:14:30instead of dating a parametric model to this you treat this is an empirical distribution
0:14:35and you sample from that empirical distribution
0:14:39sampling from the empirical distribution turns out to be equivalent to just doing a random
0:14:45draw with replacement from the sample itself
0:14:48yes the name resample
0:14:50so we're gonna adapt this
0:14:52this formalism to so the so problem at hand so ins will you know so
0:14:57when we train our models right if it so imagine we're viterbi trainer
0:15:02here here's a
0:15:04you know
0:15:05well i'll have another picture but basically we're gonna sample to the frames that are
0:15:11assigned to a particular state during training and that's work
0:15:16and we can do this for various types of sick
0:15:21so here
0:15:22it's a really crappy picture but which i have to do a better job but
0:15:26this is that i here again that
0:15:29so the you know these again see
0:15:32but so we have the true population distribution this you know we fit a say
0:15:40gaussian to this is not particularly good representative and instead if we have if we'd
0:15:45run enough data from a this histogram estimate the distributor
0:15:52so basically
0:15:55but the important part of this slide it is
0:15:58resampling is gonna fabricate data
0:16:02that satisfies independence assumptions of the hmm because i'm gonna do random draw with replacement
0:16:08from the distribution
0:16:11the data we create are gonna deviate from the hmms parametric out the distribution of
0:16:19the distributional assumptions that we make two exactly the same day degree that real data
0:16:24do because it's real data
0:16:26and it's the data at all
0:16:28from the training
0:16:30so here's it's already good picture which can lead in sort
0:16:34describe a little bit
0:16:36about what we do
0:16:38and a
0:16:39so imagine if we have training data and we're actually doing viterbi training so if
0:16:44we're doing viterbi training we get a forced alignment that for all the states
0:16:49we just accumulate all frames a
0:16:52for that state and then we fit a gmm to right and so that
0:16:57but instead of doing that in the in the bootstrap formal is the we accumulate
0:17:03frames and we stick "'em" in earnest
0:17:06that are that are labeled with that state
0:17:09so training is just like you know or even here training you know
0:17:14you just accumulate all the frames associated with the state
0:17:17but instead of a forgetting about that you keep track what they are used to
0:17:22come in a packet parameter
0:17:23and so in it when it comes time to generate pseudo data you have an
0:17:27alignment or some state sequence that you've got however
0:17:32you have a state sequence ins when you walk down to generate the frames if
0:17:36i was generating the frames and simulation i would stimulate i do a random draw
0:17:41from a distribution now instead i to a random draw with replacement from a buck
0:17:47under and of frames okay
0:17:49so the frames again are independent because i'm doing random draws with independence
0:17:55and they the deviate from the tape from the distributional assumptions to the same degree
0:18:00the real data or "'cause" they are real data
0:18:03so sorry i believe bring this but and then i can also all about it
0:18:08i can i can
0:18:10do you
0:18:11sequence so i can i can samples the trajectories phone trajectories and word trajectories
0:18:19so here
0:18:20you're this is this sequence of frames associated to states
0:18:25so i can stick that into that whole sequence
0:18:29likewise i can take whole phone sequence and put it in here and when i
0:18:34drawer from your ins
0:18:35instead of getting individual frames i get segments
0:18:39so that the important thing is
0:18:41no matter what see so i five have segments in the utterance
0:18:45when i draw the segments between segments the things they are independent but they inherit
0:18:53dependence that exists in real data within that sector so we have
0:18:59between segment independent
0:19:02within segments dependent so this is the way that we can control the sort of
0:19:07degree of statistical dependence that's in the day
0:19:12this is quite power
0:19:15so this sort of just
0:19:17sort of summarises this
0:19:19but the and you can see
0:19:21could even stickler hundred and your
0:19:24but that so the point is that's a that segment level resampling
0:19:31relaxes frame level independence to segment
0:19:39so here's a sort of picture
0:19:43the models response to fabricate so this is i didn't for that
0:19:49okay so
0:19:54i don't know how much i wanna spend on this but
0:19:59so here what we have it is simulated
0:20:04a simulated data are the real error rate and as i gradually reintroduce independence and
0:20:10the that the data the word error rate starts to increase rather dramatic
0:20:16so point is
0:20:18let's look at the simulated word error rate so you can think of this is
0:20:21i think of this is you've got some sort of not and where you re
0:20:24introducing depends in the data and as i reintroduce data dependence in the data error
0:20:32comes quite high this is
0:20:33this is i icsi meeting data this is
0:20:37with unimodal models
0:20:39the same sort of phenomena happens when you use mixture models where you know like
0:20:43say component extreme
0:20:46so that here the simulated error rate is around two percent little bit less than
0:20:51two percent
0:20:52when i do frame level resampling error rate increases just a little bit it's a
0:20:56very small increase it does increase but it's but it by very small
0:21:01now when i reintroduce
0:21:04with in state dependence
0:21:06all the sudden the error rate becomes around twelve percent so the error rate is
0:21:10increased by a factor of six
0:21:12when i introduce
0:21:14within bone dependence
0:21:17the error rate increases the king by about a factor of a two
0:21:23and then when i go to words it increases by
0:21:27we can almost by a factor of two this typically is the largest job on
0:21:31the corpora that we've worked with
0:21:33when you go when you move from frame
0:21:35to stay typically increases by about a factor of six
0:21:39so you think about this you make an argument and the argument is that
0:21:44the that the change the distributional assumption that we make with hidden with gmms
0:21:53it's not such a big deal i mean it's important but it's not such a
0:21:57big deal
0:21:57the biggest single factors are these reintroduction dependent so with the dependence in the data
0:22:04that the models are findings the price i mean you know it's a
0:22:08it's a you know everybody knew the dependence assumptions work well i mean i'm not
0:22:13saying that surprising but i personal we use it was
0:22:18was really surprise and it took a long time to come around
0:22:24to the fact that you really it is the model they're the errors oriented dependence
0:22:29assumption and we tend to work around this by other sorts of things
0:22:35so that this is a summary of the matched case result so we came the
0:22:40statistic when we have matched training and test
0:22:43the it's the independence assumptions that's the big deal
0:22:46that's the model inability to account for dependence in the data that is that is
0:22:52to railing things
0:22:53the marginal distributions
0:22:55that so much
0:22:57so surprisingly also so in a different you know if later study
0:23:01we zorro but
0:23:03attached this formalism tasks the question so what is what is discriminative training doing you
0:23:08know see start with the maximum likelihood model you apply mmi
0:23:13what what's happening here so you apply this formalism and you see that in fact
0:23:20mmi is actually randy is actually compensating for these independent and that's assumptions in a
0:23:28way that i don't completely understand i have hypotheses about how this might work
0:23:36a so here you
0:23:39really complicated procedure that's a little hokey
0:23:42that to people twenty years many people in this room it took twenty years to
0:23:48get to work right
0:23:49and it took many laughs once we shown to work on large vocabulary took many
0:23:54labs an additional apply yours to get it to work in their lap
0:23:58it's you know now it's pretty routine to do this but you know it's a
0:24:02lot it was a single to get this to work and my point is that's
0:24:07doing is compensating for the independence assumptions we know the independence assumptions are a problem
0:24:13i'm not saying that it's gonna be easy the figure find a model that relaxes
0:24:17the independence assumptions
0:24:19but perhaps that twenty years of effort
0:24:21would be better spent
0:24:23attacking that problem
0:24:26so one about mismatched training
0:24:30so the icsi meeting corpus
0:24:32on a we have near field models
0:24:37collected from on a solo
0:24:40you know head mounted microphones there was a some microphone array of some sort
0:24:47but that the meeting room was quiet it was small had are normal amount of
0:24:52river the kind of reaper human six back
0:24:55in a room
0:24:56if you listen to these two channels you can tell that they're different
0:25:01but it's not like the far-field channel is radically different when you listen to
0:25:07i we it's it sounds a little different but it's perfectly intelligible
0:25:13so we explore
0:25:15training test with near field train interest with farfield and this mismatch condition where a
0:25:20train on your field data and test for
0:25:26i'll just say that it's harder it's not
0:25:30hardly we have to be careful and you have to think about what you're trying
0:25:33to do when you when you when you run these types of experiments in particular
0:25:38a there were a lot it's use that we went through
0:25:43to take get the near field channel and the far-field channel exactly parallel so that
0:25:49we were actually measuring
0:25:51what we wanted to do it is it's like a somewhat
0:25:55intricate lab set
0:25:57and so it's
0:26:01so the paper that we wrote in i cast just i don't know how well
0:26:05it describes it but it attempted to describe it and we have a on
0:26:10the icsi website there's a technical report that's reasonably good
0:26:14that describes a lot this stuff so i'm not gonna believer this but there was
0:26:19a lot of effort that we can go through that's
0:26:23so here here's of the bottom line is that we're
0:26:26so first let's look at the green and the red curve satanic again i'm almost
0:26:33the green and the red curve are the mass near field and far-field and notice
0:26:38that extract each other pretty well the different
0:26:40the first real data is obviously hardware
0:26:43but interestingly look down here at the simulated in the frame
0:26:49they're still really low you know there
0:26:52the a match farfield is higher it's worse but it still really low and in
0:26:58particular that these error rates are around the two percent right so i wanted so
0:27:06let's think about that no then notice before we think about that the mismatch simulation
0:27:12it's you know so this is where we want to concentrate so this is what
0:27:17we want to think about that right
0:27:19so the simulated
0:27:21we don't need to worry about this other stuff it's the simulated thing that we're
0:27:25gonna concentrate
0:27:29what when you simulate data from near field models and you recognise it with your
0:27:34field models the error rate is essentially no
0:27:37so that means that problem is essentially step
0:27:43again when i take the far field models and i simulate data from the far-field
0:27:49and i and i will
0:27:51and i recognise it with the far-field models
0:27:53i get essentially nowhere
0:27:55again that means that problem is essentially stuff
0:27:59so in these two individuals spaces where we you know so the frames so in
0:28:06the signal processing the mfccs there are generated in the matched cases they're essentially separable
0:28:13problems but all the side when i take in the
0:28:18the near field models and look at the at the far field data it's
0:28:23drat dramatically not step
0:28:26so that means that the transformation that takes place between the near field data and
0:28:32the far field data is not
0:28:35it's not very that from the front end is not invariant under this transformation and
0:28:41that lack of invariance
0:28:43is what's causing this huge increase in here
0:28:47so we again it's not surprising that the front and is not invariant to this
0:28:52transformation there's a little bit a river there's a little bit of noise but what's
0:28:57remarkable it is
0:28:58that that's
0:29:00solely that problem the causes
0:29:03this huge degradation in there
0:29:06and that is actually fairly remark
0:29:14so there are many more results
0:29:17a involving mixture model so we rerun all of these results with i think eight
0:29:23component mixture models we see the same sort of behaviour
0:29:27we've reproduce all the discriminative training results we ask
0:29:32can does discriminative training somehow magically sort of the be leery than
0:29:38the mismatch a case and the answer is no
0:29:41we do i think morgan to this really you're on a natural question is how
0:29:46does mllr work in this thing we talked about that an mllr you can you
0:29:51can reduce
0:29:52some of the scratches you would expect
0:29:54but mllr is a simple linear transformation and whatever transformation between these two channels is
0:30:01it's some peculiar nonlinear transformation right so it's unreasonable
0:30:06to expect animal or to do
0:30:08as well but it's a good this task harness is a really good test harness
0:30:13for evaluating
0:30:14you know how resistant to these type how invariant to these transformations are for and
0:30:20and so we've explored that a little but
0:30:23and it's not so encouraged
0:30:26alright well so that i think i table i will and i've
0:30:31sort of blather donald enough i think all i'll turn it over to jordan and
0:30:37you will
0:30:37he will
0:30:38have a higher level you a role idea and the not and then we'll have
0:30:42questions that
0:30:54so what you what presented in
0:31:18okay one two three
0:31:20alright so it turns out the there were two parts of this project
0:31:26C told you about the technical stuff but we also saw that we'd like to
0:31:30figure out
0:31:31you've been hearing a lot about how wonderful speech recognition is during this meeting and
0:31:35we thought we will actually like to understand what the community actually thought about what
0:31:40speech recognition was like
0:31:42so we rollers also survey and i called a bunch of people many of you
0:31:48what called me
0:31:50and this is called the rats right
0:31:59and well we wanna do is just see what people thought about how speech recognition
0:32:03really worked we were we were hoping that we would find some evidence to persuade
0:32:09the government maybe to put it some money and fun some speech recognition research which
0:32:14we haven't seen in a long time
0:32:17but we really we just one the final was going on
0:32:20and so we put together a little survey team
0:32:24jen into jamieson worked with me she's a alice that's been in speech for very
0:32:29long time and we engage frederick okay and he's a specialist at doing service
0:32:36and we design a snowball start by
0:32:40it's normal surveys very interesting it
0:32:44it says you start with a small group of people that you know and you
0:32:47have some the questions and then you apps them who else task
0:32:51and you just follow that for your nose and what that means is although it's
0:32:56not entirely unbiased it's as unbiased as you can do if you don't know the
0:33:00sampling populations going to be
0:33:06so we want to low what was going on what the people think or the
0:33:10failures and what remedies of people try and how do they were
0:33:17so we did this novel sampling
0:33:19here's the questionnaire i don't wanna spend a lot of time and this but just
0:33:23take a
0:33:25the interesting questions are
0:33:28the fall last one on the slide where is the current technology failed
0:33:33and the first one on the side when you think broke
0:33:36and then questions about sort of what you do about what was going on and
0:33:41then if there's other stuff
0:33:45the survey participants tended to be all
0:33:49i think
0:33:50that's sort of how our snowball work not terribly all but there's not a lot
0:33:54again people in this so ages with thirty five seventy
0:33:58we spoke about eighty five people
0:34:03and they have an interesting mix of jobs most of them were in research somewhere
0:34:09in development so we're both
0:34:11there were a small battery as a management people and then people self referred them's
0:34:17the their jobs as something more detail
0:34:22but mostly these are and be people lord managers doing speech research or language one
0:34:30sort of another
0:34:35so here's what you told us
0:34:39there's a
0:34:42natural language is the real problem and acoustic modeling is a real problem
0:34:47and everything else that we do was broken more or less
0:34:51so i think the community sort of had this field not the people trying to
0:34:55sell speech recognition to the management but the people trying to make it work have
0:35:00a feeling that all is not really well in the technology
0:35:05so lots of people and when you point fingers there pointing fingers to the language
0:35:11itself and to acoustic modeling
0:35:14and there's the third guy which this says not robust let's say this what steven
0:35:20and stuff
0:35:21we were able
0:35:22so there's something going on with this technology that makes it not work very well
0:35:27and when we ask people what they try
0:35:30the fix things the answers everything
0:35:34people of muck around with the training some people have tried all kinds of different
0:35:38because i just of their system
0:35:43i know
0:35:50some piece trying to calm
0:35:54alright anyway
0:35:58what on the interesting things the people try to do
0:36:02many of us have tried to fix pronunciations either in dictionaries or in rules the
0:36:07pronunciation and to well me and everyone is found that this is a waste
0:36:12it's pretty interesting that so that's not a way to fix the systems that we
0:36:16currently will so we tried all kinds of stuff
0:36:21and so i think
0:36:22are taken from the survey is that people
0:36:27actually don't believe that technology is very solid and we try a lot of things
0:36:31to fix it and then we looked a little bit of the literature about the
0:36:35literature surveys in the icsi report which you can go really but the comma so
0:36:40we found a little sure looks sort of like this is from a review by
0:36:45and it's a
0:36:48L B C Rs far from be solved background noise channel distortion far in excess
0:36:52casual disfluent speech one expected topic to it is because automatic systems to make egregious
0:36:57errors and that's what everybody set anybody who's looked at that they'll says well this
0:37:02technology is okay sometimes but it fails all i
0:37:08so we conclude was
0:37:10the technology is all i point out that the models the most of those used
0:37:14by hidden markov models the most of us use i know as the thing that
0:37:18was written down apply my for john a canadian sixty nine
0:37:22so maybe that's i think kernel one of our issues here
0:37:29so when these systems fail they degrade not gracefully like your for your role but
0:37:35character catastrophic liam quickly
0:37:40speech recognition performance is substantially behind how humans do in almost every circumstance
0:37:49they're not robust
0:37:51so i wanted to that sort of michael overall overview of what the survey was
0:37:57and it's available on the icsi website in the in the program but i wanted
0:38:03to add a couple a personal comments about my analysis of what's happening
0:38:08so these are not i'm not representing the government are actually i want to talk
0:38:13to you about my own personal else's
0:38:17so here's i there's three points first point
0:38:21if you have a model in it and you don't a lot of time hill
0:38:24climbing to the optimum performance
0:38:26and it doesn't perform optimally at that spot
0:38:29you got the wrong model
0:38:32hidden markov models we're proved to converge by power producers and Y so the idea
0:38:37in nineteen sixty not
0:38:39that prove has two parts
0:38:41one is it says you can always make a better model
0:38:45two it says you get the optimal parameters if the data came from the model
0:38:51that second part is
0:38:54absolutely not true in our speech recognition systems and we're climbing on data that doesn't
0:39:00match the model and we're not gonna find the answer that way
0:39:04so we spent a lot of time
0:39:06trying to account trying to adapt for the problem back we got the wrong model
0:39:13this is a personal bond
0:39:15if you use sixty four gaussians applying to some distribution you have no idea what
0:39:19the distribution
0:39:21the original
0:39:23multi gaussian distributions we're done with a single mean and i understand but that's not
0:39:30and so my corollary i think speaks for itself
0:39:37and finally if the system you bill pills for fifty percent of the population entirely
0:39:43and then for the people who works for estimate as they walk in a reverberant
0:39:46environment or noisy place it fails
0:39:48it's broken
0:39:51and i believe speech recognition is terribly problem
0:39:55so i think what we really wanted to do i'm i want to draw an
0:39:59analogy so i one and what drawn analogy between
0:40:03transcription and transportation
0:40:06and for transportation man this is what i want something that slick and slowly and
0:40:12easy to use and doesn't bright
0:40:15and what we build use this
0:40:20it runs on two wheels it will get similar eventually you spend almost all your
0:40:24time dealing with problems they have nothing to do with the transportation part
0:40:28and so i believe that that's what we've done with speech recognition
0:40:32and it's time for new models and
0:40:35i urge you to think about model
0:40:38and not so much about the data
0:40:54and tape
0:40:56generate okay
0:40:58i assume that is to generate a lot of discussion in a lot of questions
0:41:02if it doesn't then something is wrong with us
0:41:06this sds community would be done broken
0:41:10okay was the first over there
0:41:20a question about the resampling
0:41:24as i think about this you have a sort of sequence of random variables in
0:41:27your turning a knob on the independence between them
0:41:31one of the things that charting that knob does is it
0:41:35as things become more dependent there's
0:41:37less information
0:41:40what i'm wondering is how much of the word error rate degradation you see
0:41:44might be associated simply with the fact that there's just less information
0:41:48in streams that are more dependence
0:41:54this working
0:41:56so i guess i don't understand question
0:41:59a that i mean i
0:42:02so i you're right so here is an answering you can tell me if i'm
0:42:07close to understanding the model assumes that each frame has an independent amount of information
0:42:15but we know that the frames do not have in depend amounts of information the
0:42:20amount of information
0:42:22going from frame to frame varies enormously
0:42:25but the model treats every single one of those frames is independent and that's the
0:42:31an egregious violation of these
0:42:34so that
0:42:37i guess i was thinking about was
0:42:39if i ask you to say we're ten times that i ask ten people to
0:42:42see the work once
0:42:43and are trying to figure what's the word
0:42:45like that the ten people say it might actually provide more information in the data
0:42:51and i just wondering if that might at all
0:42:53contribute to why there's more
0:42:57information as you sample from
0:42:59from or more disparate parts of the train database
0:43:07well i think i think what you're actually saying is the you your works
0:43:18so the model
0:43:20i think
0:43:21many people this is a question they have so the when you when you have
0:43:26all the frames and their independent when you do frame resampling the frames come from
0:43:31all sorts of different speakers and when you when you line them up you know
0:43:35like the what i play they come from all sorts of different speakers but then
0:43:40as soon as i start
0:43:43increasing the segment size then each one of those segments is gonna come from one
0:43:49speaker right is this is sort of along the lines what you're thinking well does
0:43:53the notion of speaker is part of the dependence in the data right the fact
0:43:59that each one of these frames scheme
0:44:01from a single speaker that's the pen
0:44:05and so that interframe to ten
0:44:07well the model knows nothing about
0:44:09and so if that's causing a problem or not that that's as we're Q your
0:44:22of course all of us
0:44:23you know as you said all of us or have been aware of this for
0:44:26a long time and i think there has been a lot of effort at trying
0:44:29to undo it
0:44:31it's kind of when we say the model this these there's an independence assumption that
0:44:37sort of have true
0:44:39because the features that we use
0:44:42go over several frames so of course they're not actually independent you know when you
0:44:47synthesise it's not clear what you really synthesise "'cause" you have to synthesise something that
0:44:52may have an independent value but it has to have a derivative that matches the
0:44:56previous thing and so on but
0:44:58but we've all tried things like segmental models
0:45:02which don't have that independence assumption
0:45:04right we take a segment
0:45:07a whole phoneme so you're
0:45:09is skipping the state independence assumption and the frame independence assumption and just going straight
0:45:15to the contextdependent phoneme
0:45:18and now you're picking a sample from the one distribution for that context dependent phoneme
0:45:24and that always works worse
0:45:28maybe you can do something with that are combined it with the hidden markov model
0:45:32and gain of i have a point but by itself it always works a lot
0:45:38and unless you unless you cripple the hidden markov model with the salem only gonna
0:45:43use context independent models then this one might work better but
0:45:48so the question is
0:45:49it's not that we haven't tried
0:45:51people have tried to make models that aboard those things and almost all of those
0:45:56things got more as the flip side of that is you said mpe or mmi
0:46:00and all these things run that M
0:46:04that assumption but they don't we just the arab i-vector for
0:46:08they reduce the error by
0:46:10ten percent fifteen percent relative
0:46:13basically a small it it's is similar to any of that any of the other
0:46:18tricks we do so they have any comment on those two observations
0:46:21well i mean
0:46:23i i'm not sure what
0:46:25so a natural question is at which i think is the first part of what
0:46:29you're saying is so why many people to try and fail to be hmms with
0:46:36models that take into account
0:46:40independent third the dependence structure in the data so why hasn't that work
0:46:47i would say that
0:46:50i do not believe that anyone has any quantitative notion of why these things here
0:46:57in the data
0:46:59i'm not saying that we should go back to these methods maybe we should but
0:47:04well i will give you an example of something you know twenty years ago people
0:47:09gave up neural networks
0:47:11and all of a certain you know neural networks or
0:47:16are the new
0:47:18the new
0:47:21i don't know what the right biblical sprays is but hallelujah so and what it
0:47:28takes is somebody who believes in something and dry start to do it and i
0:47:35think that here is the problem
0:47:37we should be i don't know what the solution is i honestly don't know what
0:47:41the solution is but i will say also that the mmi thing no and i
0:47:46don't believe anyone would be the mmi it was not designed to overcome independence
0:47:53you know if we knew that maximum likelihood solution to this problem was not the
0:47:58right solution so we found an alternative model selection procedure that we've just in a
0:48:04different place
0:48:05again if the model were correct we wouldn't have to do that
0:48:16coming back to the results this is this simulation results you presented
0:48:20i think these are highly suggestive because
0:48:24by changing the data to fulfil your assumptions
0:48:29the error rates you get or not the error rates we
0:48:32expect from the real data
0:48:35because you fit
0:48:36the problem to your assumptions but we have to go the other way around so
0:48:40what error rates we really can expect if we
0:48:45improve on modeling are still it that's an open questions system
0:48:48exactly i'm the that that's absolutely right at the in no way in my claiming
0:48:54that if we could model dependence in the data that we would be seen these
0:48:58error rates the frame resampling error rates that that's absolutely correct
0:49:04i mean so
0:49:05presumably we do we repeat do better the other point though is i think that
0:49:12a lot of the
0:49:17this sort of brittle nist that we experience
0:49:20in our models this is a conjecture is due to this very
0:49:25sort of for fit to the temporal structure
0:49:31and temper you know temporal we have a we have what one way of thinking
0:49:35of what these results a you know the frame resampling results that says if you
0:49:40forget about the temporal structure in the data models work really well but as soon
0:49:46as you introduce real temporal structure and the data the model start falling
0:49:51and so we'll speech i think temporal structures importance
0:49:57i think
0:50:04here is the my
0:50:10by a shock i see how a
0:50:16or thai interested party
0:50:19yes the line
0:50:25i don't think
0:50:28i when you please independence assumptions is not
0:50:34in the sticks more mixing to not extract information you can speech doesn't necessarily track
0:50:41you know to work
0:50:44i mean i can build the proposed system that satisfy
0:50:49independence assumption
0:50:51so i don't think
0:50:52you know
0:50:53really follows that
0:50:55for my models really see
0:50:58the models and so
0:51:01i think you don't want thinking about extracting
0:51:06getting the right information the problem this over account the information
0:51:10it's a question of this represent information
0:51:15and so if you misrepresented what are more or less than in the process
0:51:19i was the misrepresentation
0:51:21so that the false alarms
0:51:28something like
0:51:29some work
0:51:31have you might have
0:51:34but works if that's not right
0:51:38work land farm
0:51:41i rate is
0:51:44just done the same tendency
0:51:47these days
0:52:34i like
0:52:45when you know all
0:52:55one thing that works really poor C
0:52:58is if you have a mismatched representation
0:53:01so i think the think about some model is representing text okay
0:53:07you can represented as raster scan text
0:53:09well you could represented as follows
0:53:13and if you change the size of the image
0:53:16the to the two things a very different of that the five
0:53:20five test of an actual easy representation change and the rest just and it's just
0:53:25the whole thing
0:53:27so you have to ask yourself is the problem that we're C
0:53:31the fact that we have a representation for the problem that doesn't match
0:53:37that i think is the realisation
0:53:40mm this tell us something a common
0:53:43as you go for then for the top from state to phones in phones to
0:53:49data it's becoming more and more speaker-dependent is it may be the problem is your
0:53:54models and not there don't i mean are
0:53:57morse i mean if you made your models more speaker-dependent what we have seen the
0:54:02such difference
0:54:03but it has nothing to do with a frame dependent sampling but well like what
0:54:08i was trying to say before is that is a form of dependence
0:54:13the that
0:54:15and the model knows nothing about
0:54:17this form of the pen
0:54:19you know that there are many forms a of dependence and data knowing what independence
0:54:24is a heart thing for human to understand right
0:54:28but that form of dependence is precisely there and it may be causing the problem
0:54:36so there were there were a number of speakers so there are relatively few speakers
0:54:42in this corpus and so we have to sort of cat them so that there
0:54:46wasn't a single dominant speaker
0:54:50which i mean i think that would be the last
0:54:56so let me you sort of continue with work was asking again
0:55:02we know the model is wrong
0:55:05models are always wrong
0:55:08and so
0:55:11the way your
0:55:13you can argue that the model is wrong mathematically or you can argue that it's
0:55:17wrong because it doesn't meet certain in a match a human performance what we think
0:55:22of as human performance i think we may overestimate human performance a little bit but
0:55:26it clearly doesn't match it
0:55:29but in fact you know if you look at all the research that all of
0:55:32us do
0:55:34we use at least feel like protecting those problem so we say we're gonna use
0:55:39fonts models it to use your analogy we a lower models to have we scale
0:55:45them like fonts right we put in we say we're going to estimate a scale
0:55:49factor in that scale factor is not a simple
0:55:52we can be a simple one there were can be a matrix you know much
0:55:54more complicated than what you do with the font and we constrain it to be
0:55:58the same we say the speakers the same for the whole sentence
0:56:01we do speaker adaptive training so we try to remove the differences
0:56:07we tried to normalize all the speakers to the same place and then insert the
0:56:11properties of the new speaker again right
0:56:14close sort of like the analogy of a font
0:56:16we tried to do all of these things we certainly trying to model channels
0:56:23we do all of these with linear models and not linear models
0:56:29we get small improvements
0:56:31so my question let me turn the question around
0:56:34the model is wrong
0:56:36what's the right model
0:56:38not what is the do but what is the right model
0:56:43i think we all don't know the answer to that question but let me tell
0:56:47you something other phenomena that i would like to see as making
0:56:52unless you've been following particle physics but
0:56:56in particle physics
0:56:58when you measure particle interactions prestigious of the interactions are governed by
0:57:03basically by feynman diagrams
0:57:05and so to compute a for particle interaction like using the super collider to compute
0:57:11a cross sectional area for one of the interactions takes just if we computer about
0:57:15a week to look at all the fine and i guess
0:57:19the quite of the physics guys it's just discovered a geometric object
0:57:24enforce days and in the geometric object it turns out that each
0:57:28little area house
0:57:32an area that is exactly the solution
0:57:34so that problem of computing the cross sectional area
0:57:39and you can outdo the computations
0:57:43in about five minutes with a pencil the tape
0:57:48there's a place where the difference in the model has a huge effect
0:57:54i'm making things work so i don't think i don't believe the model lies in
0:57:59that we of the kinds of things that we've all these always been doing
0:58:02i think we need to have some radical re interpretation of the way we look
0:58:06at the data that we look at the word
0:58:09maybe which on the lines in one place
0:58:14i took the degree in linguistics as i thought speech wasn't an easy problems as
0:58:18a jury point of view and i learned to distrust everything a linguist set
0:58:24maybe which most of them to but
0:58:26maybe there's something different that we should be don't
0:58:28so i would love just against look outside this place that we've been exploring