| 0:00:15 | so i'm gonna talk about a project average but thank you for having me here | 
|---|
| 0:00:22 | i | 
|---|
| 0:00:23 | i enjoyed my time in the czech republic that learn many check we're concluding well | 
|---|
| 0:00:29 | so thank you | 
|---|
| 0:00:31 | so i | 
|---|
| 0:00:33 | project ouch out stands for outing unfortunate characteristics of the hmms | 
|---|
| 0:00:39 | there are three | 
|---|
| 0:00:41 | truthfully there were three phases the at the sort of initial were that we did | 
|---|
| 0:00:47 | on this was a project that larry really and i three hundred to when we | 
|---|
| 0:00:52 | read nuance | 
|---|
| 0:00:54 | and i truthfully it also had its antecedents in work that we were doing it | 
|---|
| 0:00:59 | for signal | 
|---|
| 0:01:01 | but that's a funded a very small pilot study and i our funded the a | 
|---|
| 0:01:09 | larger but still small off a lot the people the students to work with me | 
|---|
| 0:01:14 | were day gaelic | 
|---|
| 0:01:17 | hardly | 
|---|
| 0:01:18 | part is there i was actually postdoc ensure you chair is currently is to berkeley | 
|---|
| 0:01:24 | and larry really jaw really morgan | 
|---|
| 0:01:26 | and myself were thus reducing your people | 
|---|
| 0:01:29 | so project out | 
|---|
| 0:01:31 | what we're trying to do | 
|---|
| 0:01:33 | is our goal is to sort of develop a quantitative understanding of | 
|---|
| 0:01:41 | how the current formalism thing | 
|---|
| 0:01:44 | and you know surprisingly this being very little work | 
|---|
| 0:01:48 | in this direction in the for your history | 
|---|
| 0:01:51 | of speech recognition | 
|---|
| 0:01:54 | there's been some but it means were isolated and sporadic | 
|---|
| 0:01:59 | and | 
|---|
| 0:02:00 | you know progress in speech recognition has been very exercise and | 
|---|
| 0:02:05 | in my largely because we be proceeding | 
|---|
| 0:02:09 | wire trial-and-error and so the claim is | 
|---|
| 0:02:12 | that by gaining a deeper understanding | 
|---|
| 0:02:16 | powers are algorithm succeed and fail | 
|---|
| 0:02:19 | other than just measuring we're word error and if it if we get an improvement | 
|---|
| 0:02:23 | in word error keep it | 
|---|
| 0:02:26 | we it doesn't improve we | 
|---|
| 0:02:28 | we just it should enable more efficient and steady progress and i claim that this | 
|---|
| 0:02:34 | should be embedded are standard sort of research may not necessarily the techniques that i'm | 
|---|
| 0:02:41 | gonna talk about okay but just this | 
|---|
| 0:02:43 | notion that when you have a model | 
|---|
| 0:02:46 | you know it doesn't fit the data you should get a try to gain some | 
|---|
| 0:02:50 | understanding of how a model differs from the data and how that data model residual | 
|---|
| 0:02:57 | impacts | 
|---|
| 0:02:58 | the classification errors | 
|---|
| 0:03:01 | so the main questions that a project ouch was interested in is you could be | 
|---|
| 0:03:08 | the main where you could think about it do this is what to the models | 
|---|
| 0:03:12 | find surprising about data what is it about speech data that the models find surprising | 
|---|
| 0:03:17 | and how to do that surprise translate the air | 
|---|
| 0:03:22 | so | 
|---|
| 0:03:23 | i'm gonna talk today about quantifying the two major | 
|---|
| 0:03:28 | hmm assumptions their impact on the error rates of the course the two major assumptions | 
|---|
| 0:03:33 | are the very strong independence assumptions the models makes | 
|---|
| 0:03:38 | and also | 
|---|
| 0:03:40 | and equally strong assumption about what the form of the marginal distribution of the frames | 
|---|
| 0:03:45 | are typically we assume that there are a gaussian mixture models of course nowadays people | 
|---|
| 0:03:50 | are using a multi layer perceptrons but it can you make some sort of formal | 
|---|
| 0:03:56 | assumption about what looks like | 
|---|
| 0:04:00 | also which of these incorrect assumptions is and key your discriminative training mpe or mmi | 
|---|
| 0:04:08 | which it's these assumptions is | 
|---|
| 0:04:11 | is this process are compensating for the maximum | 
|---|
| 0:04:16 | and | 
|---|
| 0:04:17 | do these results change when you move from a miss from a mass training and | 
|---|
| 0:04:22 | test | 
|---|
| 0:04:22 | us we're formalism the mismatched case | 
|---|
| 0:04:26 | so there early sort of work that we did was on the switchboard in the | 
|---|
| 0:04:30 | wall street journal corpora later on we move to the icsi corpus | 
|---|
| 0:04:35 | you can read past | 
|---|
| 0:04:36 | this sort of question about how do these results change in this mask a case | 
|---|
| 0:04:43 | in it and form of why asr so brittle | 
|---|
| 0:04:47 | we go | 
|---|
| 0:04:48 | at any time bring up | 
|---|
| 0:04:51 | a new recognizer on a problem whether | 
|---|
| 0:04:54 | the same language or across languages you always have to star it seems almost from | 
|---|
| 0:04:59 | scratch you always have to collect a bunch of data that's closely related to the | 
|---|
| 0:05:05 | to the task that you | 
|---|
| 0:05:06 | you have and | 
|---|
| 0:05:08 | it hardly ever works the first time you try it it's the reason that most | 
|---|
| 0:05:12 | of us in this room have | 
|---|
| 0:05:14 | have jobs it's are sort of it sort of a good thing but it's incredibly | 
|---|
| 0:05:19 | frustrating right it's like | 
|---|
| 0:05:23 | it's a miracle that when anything works the first | 
|---|
| 0:05:27 | so the ir project mainly was interested in studying | 
|---|
| 0:05:32 | these | 
|---|
| 0:05:33 | these questions on it the icsi meeting corpus where there's a new field channel and | 
|---|
| 0:05:38 | a far-field show i'll talk a little bit more about that i'm we wanted to | 
|---|
| 0:05:43 | understand when you trained models on the near field condition | 
|---|
| 0:05:47 | what happens when you are recognise for future | 
|---|
| 0:05:51 | and so in this context | 
|---|
| 0:05:54 | is the brittle nist of asr solely due to the models inability to account for | 
|---|
| 0:06:00 | the statistical dependence that occurs in real data | 
|---|
| 0:06:04 | and you know what i started this particular project | 
|---|
| 0:06:07 | i thought | 
|---|
| 0:06:08 | that it was just gonna be used independence assumptions so | 
|---|
| 0:06:12 | and i was very surprised | 
|---|
| 0:06:15 | when we actually started doing the work | 
|---|
| 0:06:19 | and in fact it once like so | 
|---|
| 0:06:23 | and so i say i just sort of funny but | 
|---|
| 0:06:26 | but in the matched case basically | 
|---|
| 0:06:29 | this the inability of the model to account for statistical dependence that occurs in real | 
|---|
| 0:06:34 | data is basically the whole problem | 
|---|
| 0:06:37 | but when you move to the mismatched case | 
|---|
| 0:06:39 | all the sudden something else rears its head it | 
|---|
| 0:06:43 | and it and it and it's a big problem and so all describe what this | 
|---|
| 0:06:47 | problem | 
|---|
| 0:06:49 | it has to do with the lack of invariance of the from | 
|---|
| 0:06:53 | so | 
|---|
| 0:06:55 | i'm gonna spend a little data time | 
|---|
| 0:06:57 | talking about the sort of methodology we use so what the way we explore this | 
|---|
| 0:07:03 | question is we create | 
|---|
| 0:07:07 | we fabricate data | 
|---|
| 0:07:09 | a we use stimulation and a novel a sampling process | 
|---|
| 0:07:15 | that uses real data | 
|---|
| 0:07:17 | to probe the models and the data that we create | 
|---|
| 0:07:21 | is either completely stimulated that satisfies all the model assumptions | 
|---|
| 0:07:26 | or it's real data | 
|---|
| 0:07:28 | that we sample than the way that gives the properties that we understand and so | 
|---|
| 0:07:34 | by feeding this data we can sort of pro the models and see their response | 
|---|
| 0:07:41 | to this to the state and we research we observe recognition actually | 
|---|
| 0:07:47 | so here's an example | 
|---|
| 0:07:49 | so this is an example of what of course according to the average estimate seventy | 
|---|
| 0:07:55 | high miss rate by counts capital markets report | 
|---|
| 0:07:59 | so this is an example of course what we expect speech to sound like this | 
|---|
| 0:08:03 | is from wall street journal so this is a fabricated version of this that essentially | 
|---|
| 0:08:09 | agrees with all the model assumptions | 
|---|
| 0:08:13 | according to different estimates to construct the attachments capital markets report | 
|---|
| 0:08:19 | you can speculate syllable rhymes two point five percent that's model | 
|---|
| 0:08:25 | so | 
|---|
| 0:08:26 | so you know it's highly amusing but it's intelligible obviously and it obviously you know | 
|---|
| 0:08:32 | it's from a model that was constructed from a hundred different speakers and it reflects | 
|---|
| 0:08:37 | the sort of structure | 
|---|
| 0:08:39 | so what we're trying to quantify | 
|---|
| 0:08:41 | is | 
|---|
| 0:08:43 | what the difference between these two extremes in terms of recognition condition | 
|---|
| 0:08:50 | so the basic idea of data fabrication a simple | 
|---|
| 0:08:56 | we follow the hmms generation a mechanism so to do that we first we generate | 
|---|
| 0:09:04 | a an underlying state sequence consistent with the transcript the dictionary and the state transitions | 
|---|
| 0:09:12 | the underlying each of it that you know the hidden markov model | 
|---|
| 0:09:15 | then we walk down this | 
|---|
| 0:09:19 | this sequence and we and that of frame at each point | 
|---|
| 0:09:22 | so here's a picture a nice picture that describe the sort of structure it's a | 
|---|
| 0:09:27 | sort of a parts of it are actually a graphical model | 
|---|
| 0:09:32 | a this courses in each ml | 
|---|
| 0:09:34 | but basically we unpack so if we have a transcript we unpack words | 
|---|
| 0:09:41 | we get the corresponding pronunciations | 
|---|
| 0:09:45 | the phones in context | 
|---|
| 0:09:47 | then determine which hmm we use so this is the hidden state each of these | 
|---|
| 0:09:51 | states and mit observations according to the so whatever mixture model we're actually using right | 
|---|
| 0:09:59 | and so if you're not so familiar with the hmms i assume pretty much everyone | 
|---|
| 0:10:04 | in the room is but this sort of highlights the independence assumptions right well it | 
|---|
| 0:10:10 | highlights two things one | 
|---|
| 0:10:12 | the frames are omitted according to rule and the rule is the marginal but the | 
|---|
| 0:10:17 | form that we get for the marginal distribution of frames | 
|---|
| 0:10:21 | and then of course then this also says that these frames are independent so every | 
|---|
| 0:10:26 | time i met | 
|---|
| 0:10:28 | a frame from state three state it is independent from the previous frame that was | 
|---|
| 0:10:33 | emitted from state three so that's a very strong assumption | 
|---|
| 0:10:37 | but in addition | 
|---|
| 0:10:38 | it is also independent from any of the frames that we're and it'd previously from | 
|---|
| 0:10:43 | the state so these of the very strong and | 
|---|
| 0:10:46 | but okay again to generate observations we just all of this rule and basically once | 
|---|
| 0:10:54 | we know the sequence of states | 
|---|
| 0:10:57 | i have a sequence of states one side out that i just walk down those | 
|---|
| 0:11:02 | sequences states and i'd to withdraw | 
|---|
| 0:11:04 | from | 
|---|
| 0:11:06 | what it either a distribution | 
|---|
| 0:11:08 | or whether it be empirical or parametric | 
|---|
| 0:11:13 | so | 
|---|
| 0:11:14 | so for simulation | 
|---|
| 0:11:16 | it's a i know it's easy to simulate from a mixture models not a big | 
|---|
| 0:11:21 | deal right | 
|---|
| 0:11:23 | but what about this sort of novel sampling process that'll allow us to get a | 
|---|
| 0:11:30 | the independence assumptions will so that for this | 
|---|
| 0:11:33 | we idea of formalism | 
|---|
| 0:11:36 | from a reference bootstrap so i talked a little bit about the bootstrap in the | 
|---|
| 0:11:41 | paper the poster | 
|---|
| 0:11:45 | a people in the feel don't seem to be terribly familiar with that i'm not | 
|---|
| 0:11:51 | sure is topical very much but i will try | 
|---|
| 0:11:54 | so in the basic idea areas | 
|---|
| 0:11:57 | a suppose you have an unknown population right so you've got some population distribution and | 
|---|
| 0:12:02 | you compute the statistic that's meant to summarize this population itself | 
|---|
| 0:12:08 | then you want to know how good is the statistics so i want to construct | 
|---|
| 0:12:12 | a confidence interval for the statistics to give me a sense how well i've estimated | 
|---|
| 0:12:17 | from | 
|---|
| 0:12:18 | a place | 
|---|
| 0:12:20 | so how the lighting that if i don't know what the population | 
|---|
| 0:12:23 | i mean i'm trying to | 
|---|
| 0:12:25 | you know i'm trying to derive properties of a of this population | 
|---|
| 0:12:29 | and so and so in particular i don't know anything about really except the sample | 
|---|
| 0:12:34 | like drawn from this population | 
|---|
| 0:12:37 | so | 
|---|
| 0:12:37 | but for F runs a bootstrap procedure people would usually make some parametric assumptions about | 
|---|
| 0:12:44 | population typically you'd assume it's a normal or gaussian | 
|---|
| 0:12:49 | and then compute | 
|---|
| 0:12:51 | and a confidence interval using that structure | 
|---|
| 0:12:54 | well course that sort of crazy you know why would you do that you know | 
|---|
| 0:12:58 | especially if you're trying to say | 
|---|
| 0:12:59 | is this population distribution gaussian are not well it's crazy to still | 
|---|
| 0:13:04 | then that the population distribution is gaussian to compute this confidence in | 
|---|
| 0:13:09 | so this was a big problem in the late seventies when computers became sort of | 
|---|
| 0:13:14 | sort usable | 
|---|
| 0:13:15 | by and statisticians | 
|---|
| 0:13:18 | he came up with the sort of formalism and | 
|---|
| 0:13:21 | and so the name comes from pulling up oneself up by the bootstrap lots of | 
|---|
| 0:13:26 | people use the bootstrap for various sorts of terminology it allegedly comes everyone attributes this | 
|---|
| 0:13:33 | to the to the to the story in the | 
|---|
| 0:13:36 | adventure and so pair and a one channels and where E | 
|---|
| 0:13:40 | used in some and yes to get out so we pulls himself up | 
|---|
| 0:13:44 | by its bootstraps out of the but of course you read very one or the | 
|---|
| 0:13:49 | adventures of error | 
|---|
| 0:13:50 | when chosen and that's not what huh | 
|---|
| 0:13:52 | in fact you within a small | 
|---|
| 0:13:55 | on forcing use trying to get out of this one | 
|---|
| 0:13:57 | so instead pulled himself out what is okay | 
|---|
| 0:14:01 | so maybe instead we collected daily | 
|---|
| 0:14:06 | similarly a little bit whiter i thought that was very | 
|---|
| 0:14:10 | so | 
|---|
| 0:14:12 | so the with the way the way the bootstrap words | 
|---|
| 0:14:16 | is you take empirical distribution so you tree | 
|---|
| 0:14:19 | so you have the same | 
|---|
| 0:14:21 | so this sample is a representative of the true population distribution so if it's big | 
|---|
| 0:14:26 | enough it should be a pretty good represented | 
|---|
| 0:14:29 | and so you since | 
|---|
| 0:14:30 | instead of dating a parametric model to this you treat this is an empirical distribution | 
|---|
| 0:14:35 | and you sample from that empirical distribution | 
|---|
| 0:14:39 | sampling from the empirical distribution turns out to be equivalent to just doing a random | 
|---|
| 0:14:45 | draw with replacement from the sample itself | 
|---|
| 0:14:48 | yes the name resample | 
|---|
| 0:14:50 | so we're gonna adapt this | 
|---|
| 0:14:52 | this formalism to so the so problem at hand so ins will you know so | 
|---|
| 0:14:57 | when we train our models right if it so imagine we're viterbi trainer | 
|---|
| 0:15:02 | here here's a | 
|---|
| 0:15:04 | you know | 
|---|
| 0:15:05 | well i'll have another picture but basically we're gonna sample to the frames that are | 
|---|
| 0:15:11 | assigned to a particular state during training and that's work | 
|---|
| 0:15:16 | and we can do this for various types of sick | 
|---|
| 0:15:21 | so here | 
|---|
| 0:15:22 | it's a really crappy picture but which i have to do a better job but | 
|---|
| 0:15:26 | this is that i here again that | 
|---|
| 0:15:29 | so the you know these again see | 
|---|
| 0:15:32 | but so we have the true population distribution this you know we fit a say | 
|---|
| 0:15:40 | gaussian to this is not particularly good representative and instead if we have if we'd | 
|---|
| 0:15:45 | run enough data from a this histogram estimate the distributor | 
|---|
| 0:15:52 | so basically | 
|---|
| 0:15:55 | but the important part of this slide it is | 
|---|
| 0:15:58 | resampling is gonna fabricate data | 
|---|
| 0:16:02 | that satisfies independence assumptions of the hmm because i'm gonna do random draw with replacement | 
|---|
| 0:16:08 | from the distribution | 
|---|
| 0:16:10 | but | 
|---|
| 0:16:11 | the data we create are gonna deviate from the hmms parametric out the distribution of | 
|---|
| 0:16:19 | the distributional assumptions that we make two exactly the same day degree that real data | 
|---|
| 0:16:24 | do because it's real data | 
|---|
| 0:16:26 | and it's the data at all | 
|---|
| 0:16:28 | from the training | 
|---|
| 0:16:30 | so here's it's already good picture which can lead in sort | 
|---|
| 0:16:34 | describe a little bit | 
|---|
| 0:16:36 | about what we do | 
|---|
| 0:16:38 | and a | 
|---|
| 0:16:39 | so imagine if we have training data and we're actually doing viterbi training so if | 
|---|
| 0:16:44 | we're doing viterbi training we get a forced alignment that for all the states | 
|---|
| 0:16:49 | we just accumulate all frames a | 
|---|
| 0:16:52 | for that state and then we fit a gmm to right and so that | 
|---|
| 0:16:57 | but instead of doing that in the in the bootstrap formal is the we accumulate | 
|---|
| 0:17:03 | frames and we stick "'em" in earnest | 
|---|
| 0:17:06 | that are that are labeled with that state | 
|---|
| 0:17:09 | so training is just like you know or even here training you know | 
|---|
| 0:17:14 | you just accumulate all the frames associated with the state | 
|---|
| 0:17:17 | but instead of a forgetting about that you keep track what they are used to | 
|---|
| 0:17:22 | come in a packet parameter | 
|---|
| 0:17:23 | and so in it when it comes time to generate pseudo data you have an | 
|---|
| 0:17:27 | alignment or some state sequence that you've got however | 
|---|
| 0:17:32 | you have a state sequence ins when you walk down to generate the frames if | 
|---|
| 0:17:36 | i was generating the frames and simulation i would stimulate i do a random draw | 
|---|
| 0:17:41 | from a distribution now instead i to a random draw with replacement from a buck | 
|---|
| 0:17:47 | under and of frames okay | 
|---|
| 0:17:49 | so the frames again are independent because i'm doing random draws with independence | 
|---|
| 0:17:55 | and they the deviate from the tape from the distributional assumptions to the same degree | 
|---|
| 0:18:00 | the real data or "'cause" they are real data | 
|---|
| 0:18:03 | so sorry i believe bring this but and then i can also all about it | 
|---|
| 0:18:08 | i can i can | 
|---|
| 0:18:10 | do you | 
|---|
| 0:18:11 | sequence so i can i can samples the trajectories phone trajectories and word trajectories | 
|---|
| 0:18:18 | because | 
|---|
| 0:18:19 | so here | 
|---|
| 0:18:20 | you're this is this sequence of frames associated to states | 
|---|
| 0:18:25 | so i can stick that into that whole sequence | 
|---|
| 0:18:29 | likewise i can take whole phone sequence and put it in here and when i | 
|---|
| 0:18:34 | drawer from your ins | 
|---|
| 0:18:35 | instead of getting individual frames i get segments | 
|---|
| 0:18:39 | so that the important thing is | 
|---|
| 0:18:41 | no matter what see so i five have segments in the utterance | 
|---|
| 0:18:45 | when i draw the segments between segments the things they are independent but they inherit | 
|---|
| 0:18:53 | dependence that exists in real data within that sector so we have | 
|---|
| 0:18:59 | between segment independent | 
|---|
| 0:19:02 | within segments dependent so this is the way that we can control the sort of | 
|---|
| 0:19:07 | degree of statistical dependence that's in the day | 
|---|
| 0:19:12 | this is quite power | 
|---|
| 0:19:15 | so this sort of just | 
|---|
| 0:19:17 | sort of summarises this | 
|---|
| 0:19:19 | but the and you can see | 
|---|
| 0:19:21 | could even stickler hundred and your | 
|---|
| 0:19:24 | but that so the point is that's a that segment level resampling | 
|---|
| 0:19:31 | relaxes frame level independence to segment | 
|---|
| 0:19:39 | so here's a sort of picture | 
|---|
| 0:19:43 | the models response to fabricate so this is i didn't for that | 
|---|
| 0:19:49 | okay so | 
|---|
| 0:19:54 | i don't know how much i wanna spend on this but | 
|---|
| 0:19:59 | so here what we have it is simulated | 
|---|
| 0:20:04 | a simulated data are the real error rate and as i gradually reintroduce independence and | 
|---|
| 0:20:10 | the that the data the word error rate starts to increase rather dramatic | 
|---|
| 0:20:16 | so point is | 
|---|
| 0:20:18 | let's look at the simulated word error rate so you can think of this is | 
|---|
| 0:20:21 | i think of this is you've got some sort of not and where you re | 
|---|
| 0:20:24 | introducing depends in the data and as i reintroduce data dependence in the data error | 
|---|
| 0:20:30 | rate | 
|---|
| 0:20:32 | comes quite high this is | 
|---|
| 0:20:33 | this is i icsi meeting data this is | 
|---|
| 0:20:37 | with unimodal models | 
|---|
| 0:20:39 | the same sort of phenomena happens when you use mixture models where you know like | 
|---|
| 0:20:43 | say component extreme | 
|---|
| 0:20:46 | so that here the simulated error rate is around two percent little bit less than | 
|---|
| 0:20:51 | two percent | 
|---|
| 0:20:52 | when i do frame level resampling error rate increases just a little bit it's a | 
|---|
| 0:20:56 | very small increase it does increase but it's but it by very small | 
|---|
| 0:21:01 | now when i reintroduce | 
|---|
| 0:21:04 | with in state dependence | 
|---|
| 0:21:06 | all the sudden the error rate becomes around twelve percent so the error rate is | 
|---|
| 0:21:10 | increased by a factor of six | 
|---|
| 0:21:12 | when i introduce | 
|---|
| 0:21:14 | within bone dependence | 
|---|
| 0:21:17 | the error rate increases the king by about a factor of a two | 
|---|
| 0:21:23 | and then when i go to words it increases by | 
|---|
| 0:21:27 | we can almost by a factor of two this typically is the largest job on | 
|---|
| 0:21:31 | the corpora that we've worked with | 
|---|
| 0:21:33 | when you go when you move from frame | 
|---|
| 0:21:35 | to stay typically increases by about a factor of six | 
|---|
| 0:21:39 | so you think about this you make an argument and the argument is that | 
|---|
| 0:21:44 | the that the change the distributional assumption that we make with hidden with gmms | 
|---|
| 0:21:53 | it's not such a big deal i mean it's important but it's not such a | 
|---|
| 0:21:57 | big deal | 
|---|
| 0:21:57 | the biggest single factors are these reintroduction dependent so with the dependence in the data | 
|---|
| 0:22:04 | that the models are findings the price i mean you know it's a | 
|---|
| 0:22:08 | it's a you know everybody knew the dependence assumptions work well i mean i'm not | 
|---|
| 0:22:13 | saying that surprising but i personal we use it was | 
|---|
| 0:22:18 | was really surprise and it took a long time to come around | 
|---|
| 0:22:24 | to the fact that you really it is the model they're the errors oriented dependence | 
|---|
| 0:22:29 | assumption and we tend to work around this by other sorts of things | 
|---|
| 0:22:35 | so that this is a summary of the matched case result so we came the | 
|---|
| 0:22:40 | statistic when we have matched training and test | 
|---|
| 0:22:43 | the it's the independence assumptions that's the big deal | 
|---|
| 0:22:46 | that's the model inability to account for dependence in the data that is that is | 
|---|
| 0:22:52 | to railing things | 
|---|
| 0:22:53 | the marginal distributions | 
|---|
| 0:22:55 | that so much | 
|---|
| 0:22:57 | so surprisingly also so in a different you know if later study | 
|---|
| 0:23:01 | we zorro but | 
|---|
| 0:23:03 | attached this formalism tasks the question so what is what is discriminative training doing you | 
|---|
| 0:23:08 | know see start with the maximum likelihood model you apply mmi | 
|---|
| 0:23:13 | what what's happening here so you apply this formalism and you see that in fact | 
|---|
| 0:23:20 | mmi is actually randy is actually compensating for these independent and that's assumptions in a | 
|---|
| 0:23:28 | way that i don't completely understand i have hypotheses about how this might work | 
|---|
| 0:23:34 | but | 
|---|
| 0:23:36 | a so here you | 
|---|
| 0:23:39 | really complicated procedure that's a little hokey | 
|---|
| 0:23:42 | that to people twenty years many people in this room it took twenty years to | 
|---|
| 0:23:48 | get to work right | 
|---|
| 0:23:49 | and it took many laughs once we shown to work on large vocabulary took many | 
|---|
| 0:23:54 | labs an additional apply yours to get it to work in their lap | 
|---|
| 0:23:58 | it's you know now it's pretty routine to do this but you know it's a | 
|---|
| 0:24:02 | lot it was a single to get this to work and my point is that's | 
|---|
| 0:24:07 | doing is compensating for the independence assumptions we know the independence assumptions are a problem | 
|---|
| 0:24:13 | i'm not saying that it's gonna be easy the figure find a model that relaxes | 
|---|
| 0:24:17 | the independence assumptions | 
|---|
| 0:24:19 | but perhaps that twenty years of effort | 
|---|
| 0:24:21 | would be better spent | 
|---|
| 0:24:23 | attacking that problem | 
|---|
| 0:24:26 | so one about mismatched training | 
|---|
| 0:24:30 | so the icsi meeting corpus | 
|---|
| 0:24:32 | on a we have near field models | 
|---|
| 0:24:37 | collected from on a solo | 
|---|
| 0:24:40 | you know head mounted microphones there was a some microphone array of some sort | 
|---|
| 0:24:47 | but that the meeting room was quiet it was small had are normal amount of | 
|---|
| 0:24:52 | river the kind of reaper human six back | 
|---|
| 0:24:55 | in a room | 
|---|
| 0:24:56 | if you listen to these two channels you can tell that they're different | 
|---|
| 0:25:01 | but it's not like the far-field channel is radically different when you listen to | 
|---|
| 0:25:07 | i we it's it sounds a little different but it's perfectly intelligible | 
|---|
| 0:25:13 | so we explore | 
|---|
| 0:25:15 | training test with near field train interest with farfield and this mismatch condition where a | 
|---|
| 0:25:20 | train on your field data and test for | 
|---|
| 0:25:24 | so | 
|---|
| 0:25:26 | i'll just say that it's harder it's not | 
|---|
| 0:25:30 | hardly we have to be careful and you have to think about what you're trying | 
|---|
| 0:25:33 | to do when you when you when you run these types of experiments in particular | 
|---|
| 0:25:38 | a there were a lot it's use that we went through | 
|---|
| 0:25:43 | to take get the near field channel and the far-field channel exactly parallel so that | 
|---|
| 0:25:49 | we were actually measuring | 
|---|
| 0:25:51 | what we wanted to do it is it's like a somewhat | 
|---|
| 0:25:55 | intricate lab set | 
|---|
| 0:25:57 | and so it's | 
|---|
| 0:26:01 | so the paper that we wrote in i cast just i don't know how well | 
|---|
| 0:26:05 | it describes it but it attempted to describe it and we have a on | 
|---|
| 0:26:10 | the icsi website there's a technical report that's reasonably good | 
|---|
| 0:26:14 | that describes a lot this stuff so i'm not gonna believer this but there was | 
|---|
| 0:26:19 | a lot of effort that we can go through that's | 
|---|
| 0:26:23 | so here here's of the bottom line is that we're | 
|---|
| 0:26:26 | so first let's look at the green and the red curve satanic again i'm almost | 
|---|
| 0:26:31 | so | 
|---|
| 0:26:33 | the green and the red curve are the mass near field and far-field and notice | 
|---|
| 0:26:38 | that extract each other pretty well the different | 
|---|
| 0:26:40 | the first real data is obviously hardware | 
|---|
| 0:26:43 | but interestingly look down here at the simulated in the frame | 
|---|
| 0:26:48 | accuracies | 
|---|
| 0:26:49 | they're still really low you know there | 
|---|
| 0:26:52 | the a match farfield is higher it's worse but it still really low and in | 
|---|
| 0:26:58 | particular that these error rates are around the two percent right so i wanted so | 
|---|
| 0:27:06 | let's think about that no then notice before we think about that the mismatch simulation | 
|---|
| 0:27:12 | rate | 
|---|
| 0:27:12 | it's you know so this is where we want to concentrate so this is what | 
|---|
| 0:27:17 | we want to think about that right | 
|---|
| 0:27:19 | so the simulated | 
|---|
| 0:27:21 | we don't need to worry about this other stuff it's the simulated thing that we're | 
|---|
| 0:27:25 | gonna concentrate | 
|---|
| 0:27:26 | so | 
|---|
| 0:27:29 | what when you simulate data from near field models and you recognise it with your | 
|---|
| 0:27:34 | field models the error rate is essentially no | 
|---|
| 0:27:37 | so that means that problem is essentially step | 
|---|
| 0:27:43 | again when i take the far field models and i simulate data from the far-field | 
|---|
| 0:27:48 | models | 
|---|
| 0:27:49 | and i and i will | 
|---|
| 0:27:51 | and i recognise it with the far-field models | 
|---|
| 0:27:53 | i get essentially nowhere | 
|---|
| 0:27:55 | again that means that problem is essentially stuff | 
|---|
| 0:27:59 | so in these two individuals spaces where we you know so the frames so in | 
|---|
| 0:28:06 | the signal processing the mfccs there are generated in the matched cases they're essentially separable | 
|---|
| 0:28:13 | problems but all the side when i take in the | 
|---|
| 0:28:18 | the near field models and look at the at the far field data it's | 
|---|
| 0:28:23 | drat dramatically not step | 
|---|
| 0:28:26 | so that means that the transformation that takes place between the near field data and | 
|---|
| 0:28:32 | the far field data is not | 
|---|
| 0:28:35 | it's not very that from the front end is not invariant under this transformation and | 
|---|
| 0:28:41 | that lack of invariance | 
|---|
| 0:28:43 | is what's causing this huge increase in here | 
|---|
| 0:28:47 | so we again it's not surprising that the front and is not invariant to this | 
|---|
| 0:28:52 | transformation there's a little bit a river there's a little bit of noise but what's | 
|---|
| 0:28:57 | remarkable it is | 
|---|
| 0:28:58 | that that's | 
|---|
| 0:29:00 | solely that problem the causes | 
|---|
| 0:29:03 | this huge degradation in there | 
|---|
| 0:29:06 | and that is actually fairly remark | 
|---|
| 0:29:10 | so | 
|---|
| 0:29:13 | a | 
|---|
| 0:29:14 | so there are many more results | 
|---|
| 0:29:17 | a involving mixture model so we rerun all of these results with i think eight | 
|---|
| 0:29:23 | component mixture models we see the same sort of behaviour | 
|---|
| 0:29:27 | we've reproduce all the discriminative training results we ask | 
|---|
| 0:29:32 | can does discriminative training somehow magically sort of the be leery than | 
|---|
| 0:29:38 | the mismatch a case and the answer is no | 
|---|
| 0:29:41 | we do i think morgan to this really you're on a natural question is how | 
|---|
| 0:29:46 | does mllr work in this thing we talked about that an mllr you can you | 
|---|
| 0:29:51 | can reduce | 
|---|
| 0:29:52 | some of the scratches you would expect | 
|---|
| 0:29:54 | but mllr is a simple linear transformation and whatever transformation between these two channels is | 
|---|
| 0:30:00 | happening | 
|---|
| 0:30:01 | it's some peculiar nonlinear transformation right so it's unreasonable | 
|---|
| 0:30:06 | to expect animal or to do | 
|---|
| 0:30:08 | as well but it's a good this task harness is a really good test harness | 
|---|
| 0:30:13 | for evaluating | 
|---|
| 0:30:14 | you know how resistant to these type how invariant to these transformations are for and | 
|---|
| 0:30:20 | and so we've explored that a little but | 
|---|
| 0:30:23 | and it's not so encouraged | 
|---|
| 0:30:26 | alright well so that i think i table i will and i've | 
|---|
| 0:30:31 | sort of blather donald enough i think all i'll turn it over to jordan and | 
|---|
| 0:30:37 | you will | 
|---|
| 0:30:37 | he will | 
|---|
| 0:30:38 | have a higher level you a role idea and the not and then we'll have | 
|---|
| 0:30:42 | questions that | 
|---|
| 0:30:54 | so what you what presented in | 
|---|
| 0:31:02 | i | 
|---|
| 0:31:18 | okay one two three | 
|---|
| 0:31:20 | alright so it turns out the there were two parts of this project | 
|---|
| 0:31:26 | C told you about the technical stuff but we also saw that we'd like to | 
|---|
| 0:31:30 | figure out | 
|---|
| 0:31:31 | you've been hearing a lot about how wonderful speech recognition is during this meeting and | 
|---|
| 0:31:35 | we thought we will actually like to understand what the community actually thought about what | 
|---|
| 0:31:40 | speech recognition was like | 
|---|
| 0:31:42 | so we rollers also survey and i called a bunch of people many of you | 
|---|
| 0:31:48 | what called me | 
|---|
| 0:31:50 | and this is called the rats right | 
|---|
| 0:31:59 | and well we wanna do is just see what people thought about how speech recognition | 
|---|
| 0:32:03 | really worked we were we were hoping that we would find some evidence to persuade | 
|---|
| 0:32:09 | the government maybe to put it some money and fun some speech recognition research which | 
|---|
| 0:32:14 | we haven't seen in a long time | 
|---|
| 0:32:17 | but we really we just one the final was going on | 
|---|
| 0:32:20 | and so we put together a little survey team | 
|---|
| 0:32:24 | jen into jamieson worked with me she's a alice that's been in speech for very | 
|---|
| 0:32:29 | long time and we engage frederick okay and he's a specialist at doing service | 
|---|
| 0:32:36 | and we design a snowball start by | 
|---|
| 0:32:40 | it's normal surveys very interesting it | 
|---|
| 0:32:44 | it says you start with a small group of people that you know and you | 
|---|
| 0:32:47 | have some the questions and then you apps them who else task | 
|---|
| 0:32:51 | and you just follow that for your nose and what that means is although it's | 
|---|
| 0:32:56 | not entirely unbiased it's as unbiased as you can do if you don't know the | 
|---|
| 0:33:00 | sampling populations going to be | 
|---|
| 0:33:06 | so we want to low what was going on what the people think or the | 
|---|
| 0:33:10 | failures and what remedies of people try and how do they were | 
|---|
| 0:33:17 | so we did this novel sampling | 
|---|
| 0:33:19 | here's the questionnaire i don't wanna spend a lot of time and this but just | 
|---|
| 0:33:23 | take a | 
|---|
| 0:33:25 | the interesting questions are | 
|---|
| 0:33:28 | the fall last one on the slide where is the current technology failed | 
|---|
| 0:33:33 | and the first one on the side when you think broke | 
|---|
| 0:33:36 | and then questions about sort of what you do about what was going on and | 
|---|
| 0:33:41 | then if there's other stuff | 
|---|
| 0:33:45 | the survey participants tended to be all | 
|---|
| 0:33:49 | i think | 
|---|
| 0:33:50 | that's sort of how our snowball work not terribly all but there's not a lot | 
|---|
| 0:33:54 | again people in this so ages with thirty five seventy | 
|---|
| 0:33:58 | we spoke about eighty five people | 
|---|
| 0:34:03 | and they have an interesting mix of jobs most of them were in research somewhere | 
|---|
| 0:34:09 | in development so we're both | 
|---|
| 0:34:11 | there were a small battery as a management people and then people self referred them's | 
|---|
| 0:34:17 | the their jobs as something more detail | 
|---|
| 0:34:22 | but mostly these are and be people lord managers doing speech research or language one | 
|---|
| 0:34:30 | sort of another | 
|---|
| 0:34:35 | so here's what you told us | 
|---|
| 0:34:39 | there's a | 
|---|
| 0:34:42 | natural language is the real problem and acoustic modeling is a real problem | 
|---|
| 0:34:47 | and everything else that we do was broken more or less | 
|---|
| 0:34:51 | so i think the community sort of had this field not the people trying to | 
|---|
| 0:34:55 | sell speech recognition to the management but the people trying to make it work have | 
|---|
| 0:35:00 | a feeling that all is not really well in the technology | 
|---|
| 0:35:05 | so lots of people and when you point fingers there pointing fingers to the language | 
|---|
| 0:35:11 | itself and to acoustic modeling | 
|---|
| 0:35:14 | and there's the third guy which this says not robust let's say this what steven | 
|---|
| 0:35:20 | and stuff | 
|---|
| 0:35:21 | we were able | 
|---|
| 0:35:22 | so there's something going on with this technology that makes it not work very well | 
|---|
| 0:35:27 | and when we ask people what they try | 
|---|
| 0:35:30 | the fix things the answers everything | 
|---|
| 0:35:34 | people of muck around with the training some people have tried all kinds of different | 
|---|
| 0:35:38 | because i just of their system | 
|---|
| 0:35:40 | a | 
|---|
| 0:35:42 | i | 
|---|
| 0:35:43 | i know | 
|---|
| 0:35:50 | some piece trying to calm | 
|---|
| 0:35:54 | alright anyway | 
|---|
| 0:35:58 | what on the interesting things the people try to do | 
|---|
| 0:36:02 | many of us have tried to fix pronunciations either in dictionaries or in rules the | 
|---|
| 0:36:07 | pronunciation and to well me and everyone is found that this is a waste | 
|---|
| 0:36:12 | it's pretty interesting that so that's not a way to fix the systems that we | 
|---|
| 0:36:16 | currently will so we tried all kinds of stuff | 
|---|
| 0:36:21 | and so i think | 
|---|
| 0:36:22 | are taken from the survey is that people | 
|---|
| 0:36:27 | actually don't believe that technology is very solid and we try a lot of things | 
|---|
| 0:36:31 | to fix it and then we looked a little bit of the literature about the | 
|---|
| 0:36:35 | literature surveys in the icsi report which you can go really but the comma so | 
|---|
| 0:36:40 | we found a little sure looks sort of like this is from a review by | 
|---|
| 0:36:43 | fruity | 
|---|
| 0:36:45 | and it's a | 
|---|
| 0:36:48 | L B C Rs far from be solved background noise channel distortion far in excess | 
|---|
| 0:36:52 | casual disfluent speech one expected topic to it is because automatic systems to make egregious | 
|---|
| 0:36:57 | errors and that's what everybody set anybody who's looked at that they'll says well this | 
|---|
| 0:37:02 | technology is okay sometimes but it fails all i | 
|---|
| 0:37:08 | so we conclude was | 
|---|
| 0:37:10 | the technology is all i point out that the models the most of those used | 
|---|
| 0:37:14 | by hidden markov models the most of us use i know as the thing that | 
|---|
| 0:37:18 | was written down apply my for john a canadian sixty nine | 
|---|
| 0:37:22 | so maybe that's i think kernel one of our issues here | 
|---|
| 0:37:29 | so when these systems fail they degrade not gracefully like your for your role but | 
|---|
| 0:37:35 | character catastrophic liam quickly | 
|---|
| 0:37:40 | speech recognition performance is substantially behind how humans do in almost every circumstance | 
|---|
| 0:37:48 | and | 
|---|
| 0:37:49 | they're not robust | 
|---|
| 0:37:51 | so i wanted to that sort of michael overall overview of what the survey was | 
|---|
| 0:37:57 | and it's available on the icsi website in the in the program but i wanted | 
|---|
| 0:38:03 | to add a couple a personal comments about my analysis of what's happening | 
|---|
| 0:38:08 | so these are not i'm not representing the government are actually i want to talk | 
|---|
| 0:38:13 | to you about my own personal else's | 
|---|
| 0:38:17 | so here's i there's three points first point | 
|---|
| 0:38:21 | if you have a model in it and you don't a lot of time hill | 
|---|
| 0:38:24 | climbing to the optimum performance | 
|---|
| 0:38:26 | and it doesn't perform optimally at that spot | 
|---|
| 0:38:29 | you got the wrong model | 
|---|
| 0:38:32 | hidden markov models we're proved to converge by power producers and Y so the idea | 
|---|
| 0:38:37 | in nineteen sixty not | 
|---|
| 0:38:39 | that prove has two parts | 
|---|
| 0:38:41 | one is it says you can always make a better model | 
|---|
| 0:38:45 | two it says you get the optimal parameters if the data came from the model | 
|---|
| 0:38:51 | that second part is | 
|---|
| 0:38:54 | absolutely not true in our speech recognition systems and we're climbing on data that doesn't | 
|---|
| 0:39:00 | match the model and we're not gonna find the answer that way | 
|---|
| 0:39:04 | so we spent a lot of time | 
|---|
| 0:39:06 | trying to account trying to adapt for the problem back we got the wrong model | 
|---|
| 0:39:13 | this is a personal bond | 
|---|
| 0:39:15 | if you use sixty four gaussians applying to some distribution you have no idea what | 
|---|
| 0:39:19 | the distribution | 
|---|
| 0:39:21 | the original | 
|---|
| 0:39:23 | multi gaussian distributions we're done with a single mean and i understand but that's not | 
|---|
| 0:39:29 | weird | 
|---|
| 0:39:30 | and so my corollary i think speaks for itself | 
|---|
| 0:39:37 | and finally if the system you bill pills for fifty percent of the population entirely | 
|---|
| 0:39:43 | and then for the people who works for estimate as they walk in a reverberant | 
|---|
| 0:39:46 | environment or noisy place it fails | 
|---|
| 0:39:48 | it's broken | 
|---|
| 0:39:51 | and i believe speech recognition is terribly problem | 
|---|
| 0:39:55 | so i think what we really wanted to do i'm i want to draw an | 
|---|
| 0:39:59 | analogy so i one and what drawn analogy between | 
|---|
| 0:40:03 | transcription and transportation | 
|---|
| 0:40:06 | and for transportation man this is what i want something that slick and slowly and | 
|---|
| 0:40:12 | easy to use and doesn't bright | 
|---|
| 0:40:15 | and what we build use this | 
|---|
| 0:40:20 | it runs on two wheels it will get similar eventually you spend almost all your | 
|---|
| 0:40:24 | time dealing with problems they have nothing to do with the transportation part | 
|---|
| 0:40:28 | and so i believe that that's what we've done with speech recognition | 
|---|
| 0:40:32 | and it's time for new models and | 
|---|
| 0:40:35 | i urge you to think about model | 
|---|
| 0:40:38 | and not so much about the data | 
|---|
| 0:40:54 | and tape | 
|---|
| 0:40:56 | generate okay | 
|---|
| 0:40:58 | i assume that is to generate a lot of discussion in a lot of questions | 
|---|
| 0:41:02 | if it doesn't then something is wrong with us | 
|---|
| 0:41:06 | this sds community would be done broken | 
|---|
| 0:41:10 | okay was the first over there | 
|---|
| 0:41:20 | a question about the resampling | 
|---|
| 0:41:24 | as i think about this you have a sort of sequence of random variables in | 
|---|
| 0:41:27 | your turning a knob on the independence between them | 
|---|
| 0:41:30 | and | 
|---|
| 0:41:31 | one of the things that charting that knob does is it | 
|---|
| 0:41:35 | as things become more dependent there's | 
|---|
| 0:41:37 | less information | 
|---|
| 0:41:40 | what i'm wondering is how much of the word error rate degradation you see | 
|---|
| 0:41:44 | might be associated simply with the fact that there's just less information | 
|---|
| 0:41:48 | in streams that are more dependence | 
|---|
| 0:41:54 | this working | 
|---|
| 0:41:56 | so i guess i don't understand question | 
|---|
| 0:41:59 | a that i mean i | 
|---|
| 0:42:02 | so i you're right so here is an answering you can tell me if i'm | 
|---|
| 0:42:07 | close to understanding the model assumes that each frame has an independent amount of information | 
|---|
| 0:42:15 | but we know that the frames do not have in depend amounts of information the | 
|---|
| 0:42:20 | amount of information | 
|---|
| 0:42:22 | going from frame to frame varies enormously | 
|---|
| 0:42:25 | but the model treats every single one of those frames is independent and that's the | 
|---|
| 0:42:31 | an egregious violation of these | 
|---|
| 0:42:34 | so that | 
|---|
| 0:42:37 | i guess i was thinking about was | 
|---|
| 0:42:39 | if i ask you to say we're ten times that i ask ten people to | 
|---|
| 0:42:42 | see the work once | 
|---|
| 0:42:43 | and are trying to figure what's the word | 
|---|
| 0:42:45 | like that the ten people say it might actually provide more information in the data | 
|---|
| 0:42:49 | itself | 
|---|
| 0:42:51 | and i just wondering if that might at all | 
|---|
| 0:42:53 | contribute to why there's more | 
|---|
| 0:42:57 | information as you sample from | 
|---|
| 0:42:59 | from or more disparate parts of the train database | 
|---|
| 0:43:07 | well i think i think what you're actually saying is the you your works | 
|---|
| 0:43:15 | explaining | 
|---|
| 0:43:17 | why | 
|---|
| 0:43:18 | so the model | 
|---|
| 0:43:20 | i think | 
|---|
| 0:43:21 | many people this is a question they have so the when you when you have | 
|---|
| 0:43:26 | all the frames and their independent when you do frame resampling the frames come from | 
|---|
| 0:43:31 | all sorts of different speakers and when you when you line them up you know | 
|---|
| 0:43:35 | like the what i play they come from all sorts of different speakers but then | 
|---|
| 0:43:40 | as soon as i start | 
|---|
| 0:43:43 | increasing the segment size then each one of those segments is gonna come from one | 
|---|
| 0:43:49 | speaker right is this is sort of along the lines what you're thinking well does | 
|---|
| 0:43:53 | the notion of speaker is part of the dependence in the data right the fact | 
|---|
| 0:43:59 | that each one of these frames scheme | 
|---|
| 0:44:01 | from a single speaker that's the pen | 
|---|
| 0:44:05 | and so that interframe to ten | 
|---|
| 0:44:07 | well the model knows nothing about | 
|---|
| 0:44:09 | and so if that's causing a problem or not that that's as we're Q your | 
|---|
| 0:44:14 | data | 
|---|
| 0:44:22 | of course all of us | 
|---|
| 0:44:23 | you know as you said all of us or have been aware of this for | 
|---|
| 0:44:26 | a long time and i think there has been a lot of effort at trying | 
|---|
| 0:44:29 | to undo it | 
|---|
| 0:44:31 | it's kind of when we say the model this these there's an independence assumption that | 
|---|
| 0:44:37 | sort of have true | 
|---|
| 0:44:39 | because the features that we use | 
|---|
| 0:44:42 | go over several frames so of course they're not actually independent you know when you | 
|---|
| 0:44:47 | synthesise it's not clear what you really synthesise "'cause" you have to synthesise something that | 
|---|
| 0:44:51 | has | 
|---|
| 0:44:52 | may have an independent value but it has to have a derivative that matches the | 
|---|
| 0:44:56 | previous thing and so on but | 
|---|
| 0:44:58 | but we've all tried things like segmental models | 
|---|
| 0:45:02 | which don't have that independence assumption | 
|---|
| 0:45:04 | right we take a segment | 
|---|
| 0:45:07 | a whole phoneme so you're | 
|---|
| 0:45:09 | is skipping the state independence assumption and the frame independence assumption and just going straight | 
|---|
| 0:45:15 | to the contextdependent phoneme | 
|---|
| 0:45:18 | and now you're picking a sample from the one distribution for that context dependent phoneme | 
|---|
| 0:45:24 | and that always works worse | 
|---|
| 0:45:28 | maybe you can do something with that are combined it with the hidden markov model | 
|---|
| 0:45:32 | and gain of i have a point but by itself it always works a lot | 
|---|
| 0:45:36 | worse | 
|---|
| 0:45:38 | and unless you unless you cripple the hidden markov model with the salem only gonna | 
|---|
| 0:45:43 | use context independent models then this one might work better but | 
|---|
| 0:45:48 | so the question is | 
|---|
| 0:45:49 | it's not that we haven't tried | 
|---|
| 0:45:51 | people have tried to make models that aboard those things and almost all of those | 
|---|
| 0:45:56 | things got more as the flip side of that is you said mpe or mmi | 
|---|
| 0:46:00 | and all these things run that M | 
|---|
| 0:46:01 | two | 
|---|
| 0:46:02 | avoid | 
|---|
| 0:46:04 | that assumption but they don't we just the arab i-vector for | 
|---|
| 0:46:08 | they reduce the error by | 
|---|
| 0:46:10 | ten percent fifteen percent relative | 
|---|
| 0:46:13 | basically a small it it's is similar to any of that any of the other | 
|---|
| 0:46:18 | tricks we do so they have any comment on those two observations | 
|---|
| 0:46:21 | well i mean | 
|---|
| 0:46:23 | i i'm not sure what | 
|---|
| 0:46:25 | so a natural question is at which i think is the first part of what | 
|---|
| 0:46:29 | you're saying is so why many people to try and fail to be hmms with | 
|---|
| 0:46:36 | models that take into account | 
|---|
| 0:46:40 | independent third the dependence structure in the data so why hasn't that work | 
|---|
| 0:46:45 | well | 
|---|
| 0:46:47 | i would say that | 
|---|
| 0:46:49 | that | 
|---|
| 0:46:50 | i do not believe that anyone has any quantitative notion of why these things here | 
|---|
| 0:46:57 | in the data | 
|---|
| 0:46:59 | i'm not saying that we should go back to these methods maybe we should but | 
|---|
| 0:47:04 | well i will give you an example of something you know twenty years ago people | 
|---|
| 0:47:09 | gave up neural networks | 
|---|
| 0:47:11 | and all of a certain you know neural networks or | 
|---|
| 0:47:16 | R | 
|---|
| 0:47:16 | are the new | 
|---|
| 0:47:18 | the new | 
|---|
| 0:47:20 | come | 
|---|
| 0:47:21 | i don't know what the right biblical sprays is but hallelujah so and what it | 
|---|
| 0:47:28 | takes is somebody who believes in something and dry start to do it and i | 
|---|
| 0:47:35 | think that here is the problem | 
|---|
| 0:47:37 | we should be i don't know what the solution is i honestly don't know what | 
|---|
| 0:47:41 | the solution is but i will say also that the mmi thing no and i | 
|---|
| 0:47:46 | don't believe anyone would be the mmi it was not designed to overcome independence | 
|---|
| 0:47:53 | you know if we knew that maximum likelihood solution to this problem was not the | 
|---|
| 0:47:58 | right solution so we found an alternative model selection procedure that we've just in a | 
|---|
| 0:48:04 | different place | 
|---|
| 0:48:05 | again if the model were correct we wouldn't have to do that | 
|---|
| 0:48:16 | coming back to the results this is this simulation results you presented | 
|---|
| 0:48:20 | i think these are highly suggestive because | 
|---|
| 0:48:24 | by changing the data to fulfil your assumptions | 
|---|
| 0:48:29 | the error rates you get or not the error rates we | 
|---|
| 0:48:32 | expect from the real data | 
|---|
| 0:48:35 | because you fit | 
|---|
| 0:48:36 | the problem to your assumptions but we have to go the other way around so | 
|---|
| 0:48:40 | what error rates we really can expect if we | 
|---|
| 0:48:45 | improve on modeling are still it that's an open questions system | 
|---|
| 0:48:48 | exactly i'm the that that's absolutely right at the in no way in my claiming | 
|---|
| 0:48:54 | that if we could model dependence in the data that we would be seen these | 
|---|
| 0:48:58 | error rates the frame resampling error rates that that's absolutely correct | 
|---|
| 0:49:04 | i mean so | 
|---|
| 0:49:05 | presumably we do we repeat do better the other point though is i think that | 
|---|
| 0:49:12 | a lot of the | 
|---|
| 0:49:17 | this sort of brittle nist that we experience | 
|---|
| 0:49:20 | in our models this is a conjecture is due to this very | 
|---|
| 0:49:25 | sort of for fit to the temporal structure | 
|---|
| 0:49:31 | and temper you know temporal we have a we have what one way of thinking | 
|---|
| 0:49:35 | of what these results a you know the frame resampling results that says if you | 
|---|
| 0:49:40 | forget about the temporal structure in the data models work really well but as soon | 
|---|
| 0:49:46 | as you introduce real temporal structure and the data the model start falling | 
|---|
| 0:49:51 | and so we'll speech i think temporal structures importance | 
|---|
| 0:49:57 | i think | 
|---|
| 0:50:04 | here is the my | 
|---|
| 0:50:10 | by a shock i see how a | 
|---|
| 0:50:15 | speechless | 
|---|
| 0:50:16 | or thai interested party | 
|---|
| 0:50:19 | yes the line | 
|---|
| 0:50:25 | i don't think | 
|---|
| 0:50:27 | a | 
|---|
| 0:50:28 | i when you please independence assumptions is not | 
|---|
| 0:50:34 | in the sticks more mixing to not extract information you can speech doesn't necessarily track | 
|---|
| 0:50:41 | you know to work | 
|---|
| 0:50:44 | i mean i can build the proposed system that satisfy | 
|---|
| 0:50:49 | independence assumption | 
|---|
| 0:50:51 | so i don't think | 
|---|
| 0:50:52 | you know | 
|---|
| 0:50:53 | really follows that | 
|---|
| 0:50:55 | for my models really see | 
|---|
| 0:50:58 | the models and so | 
|---|
| 0:51:01 | i think you don't want thinking about extracting | 
|---|
| 0:51:06 | getting the right information the problem this over account the information | 
|---|
| 0:51:10 | it's a question of this represent information | 
|---|
| 0:51:15 | and so if you misrepresented what are more or less than in the process | 
|---|
| 0:51:19 | i was the misrepresentation | 
|---|
| 0:51:21 | so that the false alarms | 
|---|
| 0:51:25 | three | 
|---|
| 0:51:28 | something like | 
|---|
| 0:51:29 | some work | 
|---|
| 0:51:31 | have you might have | 
|---|
| 0:51:34 | but works if that's not right | 
|---|
| 0:51:38 | work land farm | 
|---|
| 0:51:41 | i rate is | 
|---|
| 0:51:44 | just done the same tendency | 
|---|
| 0:51:47 | these days | 
|---|
| 0:52:26 | but | 
|---|
| 0:52:27 | but | 
|---|
| 0:52:34 | i like | 
|---|
| 0:52:45 | when you know all | 
|---|
| 0:52:55 | one thing that works really poor C | 
|---|
| 0:52:58 | is if you have a mismatched representation | 
|---|
| 0:53:01 | so i think the think about some model is representing text okay | 
|---|
| 0:53:07 | you can represented as raster scan text | 
|---|
| 0:53:09 | well you could represented as follows | 
|---|
| 0:53:13 | and if you change the size of the image | 
|---|
| 0:53:16 | the to the two things a very different of that the five | 
|---|
| 0:53:20 | five test of an actual easy representation change and the rest just and it's just | 
|---|
| 0:53:25 | the whole thing | 
|---|
| 0:53:27 | so you have to ask yourself is the problem that we're C | 
|---|
| 0:53:31 | the fact that we have a representation for the problem that doesn't match | 
|---|
| 0:53:37 | that i think is the realisation | 
|---|
| 0:53:40 | mm this tell us something a common | 
|---|
| 0:53:43 | as you go for then for the top from state to phones in phones to | 
|---|
| 0:53:48 | segments | 
|---|
| 0:53:49 | data it's becoming more and more speaker-dependent is it may be the problem is your | 
|---|
| 0:53:54 | models and not there don't i mean are | 
|---|
| 0:53:57 | morse i mean if you made your models more speaker-dependent what we have seen the | 
|---|
| 0:54:02 | such difference | 
|---|
| 0:54:03 | but it has nothing to do with a frame dependent sampling but well like what | 
|---|
| 0:54:08 | i was trying to say before is that is a form of dependence | 
|---|
| 0:54:13 | the that | 
|---|
| 0:54:14 | that | 
|---|
| 0:54:15 | and the model knows nothing about | 
|---|
| 0:54:17 | this form of the pen | 
|---|
| 0:54:19 | you know that there are many forms a of dependence and data knowing what independence | 
|---|
| 0:54:24 | is a heart thing for human to understand right | 
|---|
| 0:54:28 | but that form of dependence is precisely there and it may be causing the problem | 
|---|
| 0:54:36 | so there were there were a number of speakers so there are relatively few speakers | 
|---|
| 0:54:42 | in this corpus and so we have to sort of cat them so that there | 
|---|
| 0:54:46 | wasn't a single dominant speaker | 
|---|
| 0:54:50 | which i mean i think that would be the last | 
|---|
| 0:54:56 | so let me you sort of continue with work was asking again | 
|---|
| 0:55:02 | we know the model is wrong | 
|---|
| 0:55:05 | models are always wrong | 
|---|
| 0:55:08 | and so | 
|---|
| 0:55:11 | the way your | 
|---|
| 0:55:13 | you can argue that the model is wrong mathematically or you can argue that it's | 
|---|
| 0:55:17 | wrong because it doesn't meet certain in a match a human performance what we think | 
|---|
| 0:55:22 | of as human performance i think we may overestimate human performance a little bit but | 
|---|
| 0:55:26 | it clearly doesn't match it | 
|---|
| 0:55:29 | but in fact you know if you look at all the research that all of | 
|---|
| 0:55:32 | us do | 
|---|
| 0:55:34 | we use at least feel like protecting those problem so we say we're gonna use | 
|---|
| 0:55:39 | fonts models it to use your analogy we a lower models to have we scale | 
|---|
| 0:55:45 | them like fonts right we put in we say we're going to estimate a scale | 
|---|
| 0:55:49 | factor in that scale factor is not a simple | 
|---|
| 0:55:52 | we can be a simple one there were can be a matrix you know much | 
|---|
| 0:55:54 | more complicated than what you do with the font and we constrain it to be | 
|---|
| 0:55:58 | the same we say the speakers the same for the whole sentence | 
|---|
| 0:56:01 | we do speaker adaptive training so we try to remove the differences | 
|---|
| 0:56:07 | we tried to normalize all the speakers to the same place and then insert the | 
|---|
| 0:56:11 | properties of the new speaker again right | 
|---|
| 0:56:14 | close sort of like the analogy of a font | 
|---|
| 0:56:16 | we tried to do all of these things we certainly trying to model channels | 
|---|
| 0:56:23 | we do all of these with linear models and not linear models | 
|---|
| 0:56:28 | and | 
|---|
| 0:56:29 | we get small improvements | 
|---|
| 0:56:31 | so my question let me turn the question around | 
|---|
| 0:56:34 | the model is wrong | 
|---|
| 0:56:36 | what's the right model | 
|---|
| 0:56:38 | not what is the do but what is the right model | 
|---|
| 0:56:42 | so | 
|---|
| 0:56:43 | i think we all don't know the answer to that question but let me tell | 
|---|
| 0:56:47 | you something other phenomena that i would like to see as making | 
|---|
| 0:56:52 | unless you've been following particle physics but | 
|---|
| 0:56:56 | in particle physics | 
|---|
| 0:56:58 | when you measure particle interactions prestigious of the interactions are governed by | 
|---|
| 0:57:03 | basically by feynman diagrams | 
|---|
| 0:57:05 | and so to compute a for particle interaction like using the super collider to compute | 
|---|
| 0:57:11 | a cross sectional area for one of the interactions takes just if we computer about | 
|---|
| 0:57:15 | a week to look at all the fine and i guess | 
|---|
| 0:57:19 | the quite of the physics guys it's just discovered a geometric object | 
|---|
| 0:57:24 | enforce days and in the geometric object it turns out that each | 
|---|
| 0:57:28 | little area house | 
|---|
| 0:57:32 | an area that is exactly the solution | 
|---|
| 0:57:34 | so that problem of computing the cross sectional area | 
|---|
| 0:57:39 | and you can outdo the computations | 
|---|
| 0:57:43 | in about five minutes with a pencil the tape | 
|---|
| 0:57:47 | so | 
|---|
| 0:57:48 | there's a place where the difference in the model has a huge effect | 
|---|
| 0:57:54 | i'm making things work so i don't think i don't believe the model lies in | 
|---|
| 0:57:59 | that we of the kinds of things that we've all these always been doing | 
|---|
| 0:58:02 | i think we need to have some radical re interpretation of the way we look | 
|---|
| 0:58:06 | at the data that we look at the word | 
|---|
| 0:58:09 | maybe which on the lines in one place | 
|---|
| 0:58:11 | maybe | 
|---|
| 0:58:14 | i took the degree in linguistics as i thought speech wasn't an easy problems as | 
|---|
| 0:58:18 | a jury point of view and i learned to distrust everything a linguist set | 
|---|
| 0:58:24 | maybe which most of them to but | 
|---|
| 0:58:26 | maybe there's something different that we should be don't | 
|---|
| 0:58:28 | so i would love just against look outside this place that we've been exploring | 
|---|