0:00:15 | so i'm gonna talk about a project average but thank you for having me here |
---|---|

0:00:22 | i |

0:00:23 | i enjoyed my time in the czech republic that learn many check we're concluding well |

0:00:29 | so thank you |

0:00:31 | so i |

0:00:33 | project ouch out stands for outing unfortunate characteristics of the hmms |

0:00:39 | there are three |

0:00:41 | truthfully there were three phases the at the sort of initial were that we did |

0:00:47 | on this was a project that larry really and i three hundred to when we |

0:00:52 | read nuance |

0:00:54 | and i truthfully it also had its antecedents in work that we were doing it |

0:00:59 | for signal |

0:01:01 | but that's a funded a very small pilot study and i our funded the a |

0:01:09 | larger but still small off a lot the people the students to work with me |

0:01:14 | were day gaelic |

0:01:17 | hardly |

0:01:18 | part is there i was actually postdoc ensure you chair is currently is to berkeley |

0:01:24 | and larry really jaw really morgan |

0:01:26 | and myself were thus reducing your people |

0:01:29 | so project out |

0:01:31 | what we're trying to do |

0:01:33 | is our goal is to sort of develop a quantitative understanding of |

0:01:41 | how the current formalism thing |

0:01:44 | and you know surprisingly this being very little work |

0:01:48 | in this direction in the for your history |

0:01:51 | of speech recognition |

0:01:54 | there's been some but it means were isolated and sporadic |

0:01:59 | and |

0:02:00 | you know progress in speech recognition has been very exercise and |

0:02:05 | in my largely because we be proceeding |

0:02:09 | wire trial-and-error and so the claim is |

0:02:12 | that by gaining a deeper understanding |

0:02:16 | powers are algorithm succeed and fail |

0:02:19 | other than just measuring we're word error and if it if we get an improvement |

0:02:23 | in word error keep it |

0:02:26 | we it doesn't improve we |

0:02:28 | we just it should enable more efficient and steady progress and i claim that this |

0:02:34 | should be embedded are standard sort of research may not necessarily the techniques that i'm |

0:02:41 | gonna talk about okay but just this |

0:02:43 | notion that when you have a model |

0:02:46 | you know it doesn't fit the data you should get a try to gain some |

0:02:50 | understanding of how a model differs from the data and how that data model residual |

0:02:57 | impacts |

0:02:58 | the classification errors |

0:03:01 | so the main questions that a project ouch was interested in is you could be |

0:03:08 | the main where you could think about it do this is what to the models |

0:03:12 | find surprising about data what is it about speech data that the models find surprising |

0:03:17 | and how to do that surprise translate the air |

0:03:22 | so |

0:03:23 | i'm gonna talk today about quantifying the two major |

0:03:28 | hmm assumptions their impact on the error rates of the course the two major assumptions |

0:03:33 | are the very strong independence assumptions the models makes |

0:03:38 | and also |

0:03:40 | and equally strong assumption about what the form of the marginal distribution of the frames |

0:03:45 | are typically we assume that there are a gaussian mixture models of course nowadays people |

0:03:50 | are using a multi layer perceptrons but it can you make some sort of formal |

0:03:56 | assumption about what looks like |

0:04:00 | also which of these incorrect assumptions is and key your discriminative training mpe or mmi |

0:04:08 | which it's these assumptions is |

0:04:11 | is this process are compensating for the maximum |

0:04:16 | and |

0:04:17 | do these results change when you move from a miss from a mass training and |

0:04:22 | test |

0:04:22 | us we're formalism the mismatched case |

0:04:26 | so there early sort of work that we did was on the switchboard in the |

0:04:30 | wall street journal corpora later on we move to the icsi corpus |

0:04:35 | you can read past |

0:04:36 | this sort of question about how do these results change in this mask a case |

0:04:43 | in it and form of why asr so brittle |

0:04:47 | we go |

0:04:48 | at any time bring up |

0:04:51 | a new recognizer on a problem whether |

0:04:54 | the same language or across languages you always have to star it seems almost from |

0:04:59 | scratch you always have to collect a bunch of data that's closely related to the |

0:05:05 | to the task that you |

0:05:06 | you have and |

0:05:08 | it hardly ever works the first time you try it it's the reason that most |

0:05:12 | of us in this room have |

0:05:14 | have jobs it's are sort of it sort of a good thing but it's incredibly |

0:05:19 | frustrating right it's like |

0:05:23 | it's a miracle that when anything works the first |

0:05:27 | so the ir project mainly was interested in studying |

0:05:32 | these |

0:05:33 | these questions on it the icsi meeting corpus where there's a new field channel and |

0:05:38 | a far-field show i'll talk a little bit more about that i'm we wanted to |

0:05:43 | understand when you trained models on the near field condition |

0:05:47 | what happens when you are recognise for future |

0:05:51 | and so in this context |

0:05:54 | is the brittle nist of asr solely due to the models inability to account for |

0:06:00 | the statistical dependence that occurs in real data |

0:06:04 | and you know what i started this particular project |

0:06:07 | i thought |

0:06:08 | that it was just gonna be used independence assumptions so |

0:06:12 | and i was very surprised |

0:06:15 | when we actually started doing the work |

0:06:19 | and in fact it once like so |

0:06:23 | and so i say i just sort of funny but |

0:06:26 | but in the matched case basically |

0:06:29 | this the inability of the model to account for statistical dependence that occurs in real |

0:06:34 | data is basically the whole problem |

0:06:37 | but when you move to the mismatched case |

0:06:39 | all the sudden something else rears its head it |

0:06:43 | and it and it and it's a big problem and so all describe what this |

0:06:47 | problem |

0:06:49 | it has to do with the lack of invariance of the from |

0:06:53 | so |

0:06:55 | i'm gonna spend a little data time |

0:06:57 | talking about the sort of methodology we use so what the way we explore this |

0:07:03 | question is we create |

0:07:07 | we fabricate data |

0:07:09 | a we use stimulation and a novel a sampling process |

0:07:15 | that uses real data |

0:07:17 | to probe the models and the data that we create |

0:07:21 | is either completely stimulated that satisfies all the model assumptions |

0:07:26 | or it's real data |

0:07:28 | that we sample than the way that gives the properties that we understand and so |

0:07:34 | by feeding this data we can sort of pro the models and see their response |

0:07:41 | to this to the state and we research we observe recognition actually |

0:07:47 | so here's an example |

0:07:49 | so this is an example of what of course according to the average estimate seventy |

0:07:55 | high miss rate by counts capital markets report |

0:07:59 | so this is an example of course what we expect speech to sound like this |

0:08:03 | is from wall street journal so this is a fabricated version of this that essentially |

0:08:09 | agrees with all the model assumptions |

0:08:13 | according to different estimates to construct the attachments capital markets report |

0:08:19 | you can speculate syllable rhymes two point five percent that's model |

0:08:25 | so |

0:08:26 | so you know it's highly amusing but it's intelligible obviously and it obviously you know |

0:08:32 | it's from a model that was constructed from a hundred different speakers and it reflects |

0:08:37 | the sort of structure |

0:08:39 | so what we're trying to quantify |

0:08:41 | is |

0:08:43 | what the difference between these two extremes in terms of recognition condition |

0:08:50 | so the basic idea of data fabrication a simple |

0:08:56 | we follow the hmms generation a mechanism so to do that we first we generate |

0:09:04 | a an underlying state sequence consistent with the transcript the dictionary and the state transitions |

0:09:12 | the underlying each of it that you know the hidden markov model |

0:09:15 | then we walk down this |

0:09:19 | this sequence and we and that of frame at each point |

0:09:22 | so here's a picture a nice picture that describe the sort of structure it's a |

0:09:27 | sort of a parts of it are actually a graphical model |

0:09:32 | a this courses in each ml |

0:09:34 | but basically we unpack so if we have a transcript we unpack words |

0:09:41 | we get the corresponding pronunciations |

0:09:45 | the phones in context |

0:09:47 | then determine which hmm we use so this is the hidden state each of these |

0:09:51 | states and mit observations according to the so whatever mixture model we're actually using right |

0:09:59 | and so if you're not so familiar with the hmms i assume pretty much everyone |

0:10:04 | in the room is but this sort of highlights the independence assumptions right well it |

0:10:10 | highlights two things one |

0:10:12 | the frames are omitted according to rule and the rule is the marginal but the |

0:10:17 | form that we get for the marginal distribution of frames |

0:10:21 | and then of course then this also says that these frames are independent so every |

0:10:26 | time i met |

0:10:28 | a frame from state three state it is independent from the previous frame that was |

0:10:33 | emitted from state three so that's a very strong assumption |

0:10:37 | but in addition |

0:10:38 | it is also independent from any of the frames that we're and it'd previously from |

0:10:43 | the state so these of the very strong and |

0:10:46 | but okay again to generate observations we just all of this rule and basically once |

0:10:54 | we know the sequence of states |

0:10:57 | i have a sequence of states one side out that i just walk down those |

0:11:02 | sequences states and i'd to withdraw |

0:11:04 | from |

0:11:06 | what it either a distribution |

0:11:08 | or whether it be empirical or parametric |

0:11:13 | so |

0:11:14 | so for simulation |

0:11:16 | it's a i know it's easy to simulate from a mixture models not a big |

0:11:21 | deal right |

0:11:23 | but what about this sort of novel sampling process that'll allow us to get a |

0:11:30 | the independence assumptions will so that for this |

0:11:33 | we idea of formalism |

0:11:36 | from a reference bootstrap so i talked a little bit about the bootstrap in the |

0:11:41 | paper the poster |

0:11:45 | a people in the feel don't seem to be terribly familiar with that i'm not |

0:11:51 | sure is topical very much but i will try |

0:11:54 | so in the basic idea areas |

0:11:57 | a suppose you have an unknown population right so you've got some population distribution and |

0:12:02 | you compute the statistic that's meant to summarize this population itself |

0:12:08 | then you want to know how good is the statistics so i want to construct |

0:12:12 | a confidence interval for the statistics to give me a sense how well i've estimated |

0:12:17 | from |

0:12:18 | a place |

0:12:20 | so how the lighting that if i don't know what the population |

0:12:23 | i mean i'm trying to |

0:12:25 | you know i'm trying to derive properties of a of this population |

0:12:29 | and so and so in particular i don't know anything about really except the sample |

0:12:34 | like drawn from this population |

0:12:37 | so |

0:12:37 | but for F runs a bootstrap procedure people would usually make some parametric assumptions about |

0:12:44 | population typically you'd assume it's a normal or gaussian |

0:12:49 | and then compute |

0:12:51 | and a confidence interval using that structure |

0:12:54 | well course that sort of crazy you know why would you do that you know |

0:12:58 | especially if you're trying to say |

0:12:59 | is this population distribution gaussian are not well it's crazy to still |

0:13:04 | then that the population distribution is gaussian to compute this confidence in |

0:13:09 | so this was a big problem in the late seventies when computers became sort of |

0:13:14 | sort usable |

0:13:15 | by and statisticians |

0:13:18 | he came up with the sort of formalism and |

0:13:21 | and so the name comes from pulling up oneself up by the bootstrap lots of |

0:13:26 | people use the bootstrap for various sorts of terminology it allegedly comes everyone attributes this |

0:13:33 | to the to the to the story in the |

0:13:36 | adventure and so pair and a one channels and where E |

0:13:40 | used in some and yes to get out so we pulls himself up |

0:13:44 | by its bootstraps out of the but of course you read very one or the |

0:13:49 | adventures of error |

0:13:50 | when chosen and that's not what huh |

0:13:52 | in fact you within a small |

0:13:55 | on forcing use trying to get out of this one |

0:13:57 | so instead pulled himself out what is okay |

0:14:01 | so maybe instead we collected daily |

0:14:06 | similarly a little bit whiter i thought that was very |

0:14:10 | so |

0:14:12 | so the with the way the way the bootstrap words |

0:14:16 | is you take empirical distribution so you tree |

0:14:19 | so you have the same |

0:14:21 | so this sample is a representative of the true population distribution so if it's big |

0:14:26 | enough it should be a pretty good represented |

0:14:29 | and so you since |

0:14:30 | instead of dating a parametric model to this you treat this is an empirical distribution |

0:14:35 | and you sample from that empirical distribution |

0:14:39 | sampling from the empirical distribution turns out to be equivalent to just doing a random |

0:14:45 | draw with replacement from the sample itself |

0:14:48 | yes the name resample |

0:14:50 | so we're gonna adapt this |

0:14:52 | this formalism to so the so problem at hand so ins will you know so |

0:14:57 | when we train our models right if it so imagine we're viterbi trainer |

0:15:02 | here here's a |

0:15:04 | you know |

0:15:05 | well i'll have another picture but basically we're gonna sample to the frames that are |

0:15:11 | assigned to a particular state during training and that's work |

0:15:16 | and we can do this for various types of sick |

0:15:21 | so here |

0:15:22 | it's a really crappy picture but which i have to do a better job but |

0:15:26 | this is that i here again that |

0:15:29 | so the you know these again see |

0:15:32 | but so we have the true population distribution this you know we fit a say |

0:15:40 | gaussian to this is not particularly good representative and instead if we have if we'd |

0:15:45 | run enough data from a this histogram estimate the distributor |

0:15:52 | so basically |

0:15:55 | but the important part of this slide it is |

0:15:58 | resampling is gonna fabricate data |

0:16:02 | that satisfies independence assumptions of the hmm because i'm gonna do random draw with replacement |

0:16:08 | from the distribution |

0:16:10 | but |

0:16:11 | the data we create are gonna deviate from the hmms parametric out the distribution of |

0:16:19 | the distributional assumptions that we make two exactly the same day degree that real data |

0:16:24 | do because it's real data |

0:16:26 | and it's the data at all |

0:16:28 | from the training |

0:16:30 | so here's it's already good picture which can lead in sort |

0:16:34 | describe a little bit |

0:16:36 | about what we do |

0:16:38 | and a |

0:16:39 | so imagine if we have training data and we're actually doing viterbi training so if |

0:16:44 | we're doing viterbi training we get a forced alignment that for all the states |

0:16:49 | we just accumulate all frames a |

0:16:52 | for that state and then we fit a gmm to right and so that |

0:16:57 | but instead of doing that in the in the bootstrap formal is the we accumulate |

0:17:03 | frames and we stick "'em" in earnest |

0:17:06 | that are that are labeled with that state |

0:17:09 | so training is just like you know or even here training you know |

0:17:14 | you just accumulate all the frames associated with the state |

0:17:17 | but instead of a forgetting about that you keep track what they are used to |

0:17:22 | come in a packet parameter |

0:17:23 | and so in it when it comes time to generate pseudo data you have an |

0:17:27 | alignment or some state sequence that you've got however |

0:17:32 | you have a state sequence ins when you walk down to generate the frames if |

0:17:36 | i was generating the frames and simulation i would stimulate i do a random draw |

0:17:41 | from a distribution now instead i to a random draw with replacement from a buck |

0:17:47 | under and of frames okay |

0:17:49 | so the frames again are independent because i'm doing random draws with independence |

0:17:55 | and they the deviate from the tape from the distributional assumptions to the same degree |

0:18:00 | the real data or "'cause" they are real data |

0:18:03 | so sorry i believe bring this but and then i can also all about it |

0:18:08 | i can i can |

0:18:10 | do you |

0:18:11 | sequence so i can i can samples the trajectories phone trajectories and word trajectories |

0:18:18 | because |

0:18:19 | so here |

0:18:20 | you're this is this sequence of frames associated to states |

0:18:25 | so i can stick that into that whole sequence |

0:18:29 | likewise i can take whole phone sequence and put it in here and when i |

0:18:34 | drawer from your ins |

0:18:35 | instead of getting individual frames i get segments |

0:18:39 | so that the important thing is |

0:18:41 | no matter what see so i five have segments in the utterance |

0:18:45 | when i draw the segments between segments the things they are independent but they inherit |

0:18:53 | dependence that exists in real data within that sector so we have |

0:18:59 | between segment independent |

0:19:02 | within segments dependent so this is the way that we can control the sort of |

0:19:07 | degree of statistical dependence that's in the day |

0:19:12 | this is quite power |

0:19:15 | so this sort of just |

0:19:17 | sort of summarises this |

0:19:19 | but the and you can see |

0:19:21 | could even stickler hundred and your |

0:19:24 | but that so the point is that's a that segment level resampling |

0:19:31 | relaxes frame level independence to segment |

0:19:39 | so here's a sort of picture |

0:19:43 | the models response to fabricate so this is i didn't for that |

0:19:49 | okay so |

0:19:54 | i don't know how much i wanna spend on this but |

0:19:59 | so here what we have it is simulated |

0:20:04 | a simulated data are the real error rate and as i gradually reintroduce independence and |

0:20:10 | the that the data the word error rate starts to increase rather dramatic |

0:20:16 | so point is |

0:20:18 | let's look at the simulated word error rate so you can think of this is |

0:20:21 | i think of this is you've got some sort of not and where you re |

0:20:24 | introducing depends in the data and as i reintroduce data dependence in the data error |

0:20:30 | rate |

0:20:32 | comes quite high this is |

0:20:33 | this is i icsi meeting data this is |

0:20:37 | with unimodal models |

0:20:39 | the same sort of phenomena happens when you use mixture models where you know like |

0:20:43 | say component extreme |

0:20:46 | so that here the simulated error rate is around two percent little bit less than |

0:20:51 | two percent |

0:20:52 | when i do frame level resampling error rate increases just a little bit it's a |

0:20:56 | very small increase it does increase but it's but it by very small |

0:21:01 | now when i reintroduce |

0:21:04 | with in state dependence |

0:21:06 | all the sudden the error rate becomes around twelve percent so the error rate is |

0:21:10 | increased by a factor of six |

0:21:12 | when i introduce |

0:21:14 | within bone dependence |

0:21:17 | the error rate increases the king by about a factor of a two |

0:21:23 | and then when i go to words it increases by |

0:21:27 | we can almost by a factor of two this typically is the largest job on |

0:21:31 | the corpora that we've worked with |

0:21:33 | when you go when you move from frame |

0:21:35 | to stay typically increases by about a factor of six |

0:21:39 | so you think about this you make an argument and the argument is that |

0:21:44 | the that the change the distributional assumption that we make with hidden with gmms |

0:21:53 | it's not such a big deal i mean it's important but it's not such a |

0:21:57 | big deal |

0:21:57 | the biggest single factors are these reintroduction dependent so with the dependence in the data |

0:22:04 | that the models are findings the price i mean you know it's a |

0:22:08 | it's a you know everybody knew the dependence assumptions work well i mean i'm not |

0:22:13 | saying that surprising but i personal we use it was |

0:22:18 | was really surprise and it took a long time to come around |

0:22:24 | to the fact that you really it is the model they're the errors oriented dependence |

0:22:29 | assumption and we tend to work around this by other sorts of things |

0:22:35 | so that this is a summary of the matched case result so we came the |

0:22:40 | statistic when we have matched training and test |

0:22:43 | the it's the independence assumptions that's the big deal |

0:22:46 | that's the model inability to account for dependence in the data that is that is |

0:22:52 | to railing things |

0:22:53 | the marginal distributions |

0:22:55 | that so much |

0:22:57 | so surprisingly also so in a different you know if later study |

0:23:01 | we zorro but |

0:23:03 | attached this formalism tasks the question so what is what is discriminative training doing you |

0:23:08 | know see start with the maximum likelihood model you apply mmi |

0:23:13 | what what's happening here so you apply this formalism and you see that in fact |

0:23:20 | mmi is actually randy is actually compensating for these independent and that's assumptions in a |

0:23:28 | way that i don't completely understand i have hypotheses about how this might work |

0:23:34 | but |

0:23:36 | a so here you |

0:23:39 | really complicated procedure that's a little hokey |

0:23:42 | that to people twenty years many people in this room it took twenty years to |

0:23:48 | get to work right |

0:23:49 | and it took many laughs once we shown to work on large vocabulary took many |

0:23:54 | labs an additional apply yours to get it to work in their lap |

0:23:58 | it's you know now it's pretty routine to do this but you know it's a |

0:24:02 | lot it was a single to get this to work and my point is that's |

0:24:07 | doing is compensating for the independence assumptions we know the independence assumptions are a problem |

0:24:13 | i'm not saying that it's gonna be easy the figure find a model that relaxes |

0:24:17 | the independence assumptions |

0:24:19 | but perhaps that twenty years of effort |

0:24:21 | would be better spent |

0:24:23 | attacking that problem |

0:24:26 | so one about mismatched training |

0:24:30 | so the icsi meeting corpus |

0:24:32 | on a we have near field models |

0:24:37 | collected from on a solo |

0:24:40 | you know head mounted microphones there was a some microphone array of some sort |

0:24:47 | but that the meeting room was quiet it was small had are normal amount of |

0:24:52 | river the kind of reaper human six back |

0:24:55 | in a room |

0:24:56 | if you listen to these two channels you can tell that they're different |

0:25:01 | but it's not like the far-field channel is radically different when you listen to |

0:25:07 | i we it's it sounds a little different but it's perfectly intelligible |

0:25:13 | so we explore |

0:25:15 | training test with near field train interest with farfield and this mismatch condition where a |

0:25:20 | train on your field data and test for |

0:25:24 | so |

0:25:26 | i'll just say that it's harder it's not |

0:25:30 | hardly we have to be careful and you have to think about what you're trying |

0:25:33 | to do when you when you when you run these types of experiments in particular |

0:25:38 | a there were a lot it's use that we went through |

0:25:43 | to take get the near field channel and the far-field channel exactly parallel so that |

0:25:49 | we were actually measuring |

0:25:51 | what we wanted to do it is it's like a somewhat |

0:25:55 | intricate lab set |

0:25:57 | and so it's |

0:26:01 | so the paper that we wrote in i cast just i don't know how well |

0:26:05 | it describes it but it attempted to describe it and we have a on |

0:26:10 | the icsi website there's a technical report that's reasonably good |

0:26:14 | that describes a lot this stuff so i'm not gonna believer this but there was |

0:26:19 | a lot of effort that we can go through that's |

0:26:23 | so here here's of the bottom line is that we're |

0:26:26 | so first let's look at the green and the red curve satanic again i'm almost |

0:26:31 | so |

0:26:33 | the green and the red curve are the mass near field and far-field and notice |

0:26:38 | that extract each other pretty well the different |

0:26:40 | the first real data is obviously hardware |

0:26:43 | but interestingly look down here at the simulated in the frame |

0:26:48 | accuracies |

0:26:49 | they're still really low you know there |

0:26:52 | the a match farfield is higher it's worse but it still really low and in |

0:26:58 | particular that these error rates are around the two percent right so i wanted so |

0:27:06 | let's think about that no then notice before we think about that the mismatch simulation |

0:27:12 | rate |

0:27:12 | it's you know so this is where we want to concentrate so this is what |

0:27:17 | we want to think about that right |

0:27:19 | so the simulated |

0:27:21 | we don't need to worry about this other stuff it's the simulated thing that we're |

0:27:25 | gonna concentrate |

0:27:26 | so |

0:27:29 | what when you simulate data from near field models and you recognise it with your |

0:27:34 | field models the error rate is essentially no |

0:27:37 | so that means that problem is essentially step |

0:27:43 | again when i take the far field models and i simulate data from the far-field |

0:27:48 | models |

0:27:49 | and i and i will |

0:27:51 | and i recognise it with the far-field models |

0:27:53 | i get essentially nowhere |

0:27:55 | again that means that problem is essentially stuff |

0:27:59 | so in these two individuals spaces where we you know so the frames so in |

0:28:06 | the signal processing the mfccs there are generated in the matched cases they're essentially separable |

0:28:13 | problems but all the side when i take in the |

0:28:18 | the near field models and look at the at the far field data it's |

0:28:23 | drat dramatically not step |

0:28:26 | so that means that the transformation that takes place between the near field data and |

0:28:32 | the far field data is not |

0:28:35 | it's not very that from the front end is not invariant under this transformation and |

0:28:41 | that lack of invariance |

0:28:43 | is what's causing this huge increase in here |

0:28:47 | so we again it's not surprising that the front and is not invariant to this |

0:28:52 | transformation there's a little bit a river there's a little bit of noise but what's |

0:28:57 | remarkable it is |

0:28:58 | that that's |

0:29:00 | solely that problem the causes |

0:29:03 | this huge degradation in there |

0:29:06 | and that is actually fairly remark |

0:29:10 | so |

0:29:13 | a |

0:29:14 | so there are many more results |

0:29:17 | a involving mixture model so we rerun all of these results with i think eight |

0:29:23 | component mixture models we see the same sort of behaviour |

0:29:27 | we've reproduce all the discriminative training results we ask |

0:29:32 | can does discriminative training somehow magically sort of the be leery than |

0:29:38 | the mismatch a case and the answer is no |

0:29:41 | we do i think morgan to this really you're on a natural question is how |

0:29:46 | does mllr work in this thing we talked about that an mllr you can you |

0:29:51 | can reduce |

0:29:52 | some of the scratches you would expect |

0:29:54 | but mllr is a simple linear transformation and whatever transformation between these two channels is |

0:30:00 | happening |

0:30:01 | it's some peculiar nonlinear transformation right so it's unreasonable |

0:30:06 | to expect animal or to do |

0:30:08 | as well but it's a good this task harness is a really good test harness |

0:30:13 | for evaluating |

0:30:14 | you know how resistant to these type how invariant to these transformations are for and |

0:30:20 | and so we've explored that a little but |

0:30:23 | and it's not so encouraged |

0:30:26 | alright well so that i think i table i will and i've |

0:30:31 | sort of blather donald enough i think all i'll turn it over to jordan and |

0:30:37 | you will |

0:30:37 | he will |

0:30:38 | have a higher level you a role idea and the not and then we'll have |

0:30:42 | questions that |

0:30:54 | so what you what presented in |

0:31:02 | i |

0:31:18 | okay one two three |

0:31:20 | alright so it turns out the there were two parts of this project |

0:31:26 | C told you about the technical stuff but we also saw that we'd like to |

0:31:30 | figure out |

0:31:31 | you've been hearing a lot about how wonderful speech recognition is during this meeting and |

0:31:35 | we thought we will actually like to understand what the community actually thought about what |

0:31:40 | speech recognition was like |

0:31:42 | so we rollers also survey and i called a bunch of people many of you |

0:31:48 | what called me |

0:31:50 | and this is called the rats right |

0:31:59 | and well we wanna do is just see what people thought about how speech recognition |

0:32:03 | really worked we were we were hoping that we would find some evidence to persuade |

0:32:09 | the government maybe to put it some money and fun some speech recognition research which |

0:32:14 | we haven't seen in a long time |

0:32:17 | but we really we just one the final was going on |

0:32:20 | and so we put together a little survey team |

0:32:24 | jen into jamieson worked with me she's a alice that's been in speech for very |

0:32:29 | long time and we engage frederick okay and he's a specialist at doing service |

0:32:36 | and we design a snowball start by |

0:32:40 | it's normal surveys very interesting it |

0:32:44 | it says you start with a small group of people that you know and you |

0:32:47 | have some the questions and then you apps them who else task |

0:32:51 | and you just follow that for your nose and what that means is although it's |

0:32:56 | not entirely unbiased it's as unbiased as you can do if you don't know the |

0:33:00 | sampling populations going to be |

0:33:06 | so we want to low what was going on what the people think or the |

0:33:10 | failures and what remedies of people try and how do they were |

0:33:17 | so we did this novel sampling |

0:33:19 | here's the questionnaire i don't wanna spend a lot of time and this but just |

0:33:23 | take a |

0:33:25 | the interesting questions are |

0:33:28 | the fall last one on the slide where is the current technology failed |

0:33:33 | and the first one on the side when you think broke |

0:33:36 | and then questions about sort of what you do about what was going on and |

0:33:41 | then if there's other stuff |

0:33:45 | the survey participants tended to be all |

0:33:49 | i think |

0:33:50 | that's sort of how our snowball work not terribly all but there's not a lot |

0:33:54 | again people in this so ages with thirty five seventy |

0:33:58 | we spoke about eighty five people |

0:34:03 | and they have an interesting mix of jobs most of them were in research somewhere |

0:34:09 | in development so we're both |

0:34:11 | there were a small battery as a management people and then people self referred them's |

0:34:17 | the their jobs as something more detail |

0:34:22 | but mostly these are and be people lord managers doing speech research or language one |

0:34:30 | sort of another |

0:34:35 | so here's what you told us |

0:34:39 | there's a |

0:34:42 | natural language is the real problem and acoustic modeling is a real problem |

0:34:47 | and everything else that we do was broken more or less |

0:34:51 | so i think the community sort of had this field not the people trying to |

0:34:55 | sell speech recognition to the management but the people trying to make it work have |

0:35:00 | a feeling that all is not really well in the technology |

0:35:05 | so lots of people and when you point fingers there pointing fingers to the language |

0:35:11 | itself and to acoustic modeling |

0:35:14 | and there's the third guy which this says not robust let's say this what steven |

0:35:20 | and stuff |

0:35:21 | we were able |

0:35:22 | so there's something going on with this technology that makes it not work very well |

0:35:27 | and when we ask people what they try |

0:35:30 | the fix things the answers everything |

0:35:34 | people of muck around with the training some people have tried all kinds of different |

0:35:38 | because i just of their system |

0:35:40 | a |

0:35:42 | i |

0:35:43 | i know |

0:35:50 | some piece trying to calm |

0:35:54 | alright anyway |

0:35:58 | what on the interesting things the people try to do |

0:36:02 | many of us have tried to fix pronunciations either in dictionaries or in rules the |

0:36:07 | pronunciation and to well me and everyone is found that this is a waste |

0:36:12 | it's pretty interesting that so that's not a way to fix the systems that we |

0:36:16 | currently will so we tried all kinds of stuff |

0:36:21 | and so i think |

0:36:22 | are taken from the survey is that people |

0:36:27 | actually don't believe that technology is very solid and we try a lot of things |

0:36:31 | to fix it and then we looked a little bit of the literature about the |

0:36:35 | literature surveys in the icsi report which you can go really but the comma so |

0:36:40 | we found a little sure looks sort of like this is from a review by |

0:36:43 | fruity |

0:36:45 | and it's a |

0:36:48 | L B C Rs far from be solved background noise channel distortion far in excess |

0:36:52 | casual disfluent speech one expected topic to it is because automatic systems to make egregious |

0:36:57 | errors and that's what everybody set anybody who's looked at that they'll says well this |

0:37:02 | technology is okay sometimes but it fails all i |

0:37:08 | so we conclude was |

0:37:10 | the technology is all i point out that the models the most of those used |

0:37:14 | by hidden markov models the most of us use i know as the thing that |

0:37:18 | was written down apply my for john a canadian sixty nine |

0:37:22 | so maybe that's i think kernel one of our issues here |

0:37:29 | so when these systems fail they degrade not gracefully like your for your role but |

0:37:35 | character catastrophic liam quickly |

0:37:40 | speech recognition performance is substantially behind how humans do in almost every circumstance |

0:37:48 | and |

0:37:49 | they're not robust |

0:37:51 | so i wanted to that sort of michael overall overview of what the survey was |

0:37:57 | and it's available on the icsi website in the in the program but i wanted |

0:38:03 | to add a couple a personal comments about my analysis of what's happening |

0:38:08 | so these are not i'm not representing the government are actually i want to talk |

0:38:13 | to you about my own personal else's |

0:38:17 | so here's i there's three points first point |

0:38:21 | if you have a model in it and you don't a lot of time hill |

0:38:24 | climbing to the optimum performance |

0:38:26 | and it doesn't perform optimally at that spot |

0:38:29 | you got the wrong model |

0:38:32 | hidden markov models we're proved to converge by power producers and Y so the idea |

0:38:37 | in nineteen sixty not |

0:38:39 | that prove has two parts |

0:38:41 | one is it says you can always make a better model |

0:38:45 | two it says you get the optimal parameters if the data came from the model |

0:38:51 | that second part is |

0:38:54 | absolutely not true in our speech recognition systems and we're climbing on data that doesn't |

0:39:00 | match the model and we're not gonna find the answer that way |

0:39:04 | so we spent a lot of time |

0:39:06 | trying to account trying to adapt for the problem back we got the wrong model |

0:39:13 | this is a personal bond |

0:39:15 | if you use sixty four gaussians applying to some distribution you have no idea what |

0:39:19 | the distribution |

0:39:21 | the original |

0:39:23 | multi gaussian distributions we're done with a single mean and i understand but that's not |

0:39:29 | weird |

0:39:30 | and so my corollary i think speaks for itself |

0:39:37 | and finally if the system you bill pills for fifty percent of the population entirely |

0:39:43 | and then for the people who works for estimate as they walk in a reverberant |

0:39:46 | environment or noisy place it fails |

0:39:48 | it's broken |

0:39:51 | and i believe speech recognition is terribly problem |

0:39:55 | so i think what we really wanted to do i'm i want to draw an |

0:39:59 | analogy so i one and what drawn analogy between |

0:40:03 | transcription and transportation |

0:40:06 | and for transportation man this is what i want something that slick and slowly and |

0:40:12 | easy to use and doesn't bright |

0:40:15 | and what we build use this |

0:40:20 | it runs on two wheels it will get similar eventually you spend almost all your |

0:40:24 | time dealing with problems they have nothing to do with the transportation part |

0:40:28 | and so i believe that that's what we've done with speech recognition |

0:40:32 | and it's time for new models and |

0:40:35 | i urge you to think about model |

0:40:38 | and not so much about the data |

0:40:54 | and tape |

0:40:56 | generate okay |

0:40:58 | i assume that is to generate a lot of discussion in a lot of questions |

0:41:02 | if it doesn't then something is wrong with us |

0:41:06 | this sds community would be done broken |

0:41:10 | okay was the first over there |

0:41:20 | a question about the resampling |

0:41:24 | as i think about this you have a sort of sequence of random variables in |

0:41:27 | your turning a knob on the independence between them |

0:41:30 | and |

0:41:31 | one of the things that charting that knob does is it |

0:41:35 | as things become more dependent there's |

0:41:37 | less information |

0:41:40 | what i'm wondering is how much of the word error rate degradation you see |

0:41:44 | might be associated simply with the fact that there's just less information |

0:41:48 | in streams that are more dependence |

0:41:54 | this working |

0:41:56 | so i guess i don't understand question |

0:41:59 | a that i mean i |

0:42:02 | so i you're right so here is an answering you can tell me if i'm |

0:42:07 | close to understanding the model assumes that each frame has an independent amount of information |

0:42:15 | but we know that the frames do not have in depend amounts of information the |

0:42:20 | amount of information |

0:42:22 | going from frame to frame varies enormously |

0:42:25 | but the model treats every single one of those frames is independent and that's the |

0:42:31 | an egregious violation of these |

0:42:34 | so that |

0:42:37 | i guess i was thinking about was |

0:42:39 | if i ask you to say we're ten times that i ask ten people to |

0:42:42 | see the work once |

0:42:43 | and are trying to figure what's the word |

0:42:45 | like that the ten people say it might actually provide more information in the data |

0:42:49 | itself |

0:42:51 | and i just wondering if that might at all |

0:42:53 | contribute to why there's more |

0:42:57 | information as you sample from |

0:42:59 | from or more disparate parts of the train database |

0:43:07 | well i think i think what you're actually saying is the you your works |

0:43:15 | explaining |

0:43:17 | why |

0:43:18 | so the model |

0:43:20 | i think |

0:43:21 | many people this is a question they have so the when you when you have |

0:43:26 | all the frames and their independent when you do frame resampling the frames come from |

0:43:31 | all sorts of different speakers and when you when you line them up you know |

0:43:35 | like the what i play they come from all sorts of different speakers but then |

0:43:40 | as soon as i start |

0:43:43 | increasing the segment size then each one of those segments is gonna come from one |

0:43:49 | speaker right is this is sort of along the lines what you're thinking well does |

0:43:53 | the notion of speaker is part of the dependence in the data right the fact |

0:43:59 | that each one of these frames scheme |

0:44:01 | from a single speaker that's the pen |

0:44:05 | and so that interframe to ten |

0:44:07 | well the model knows nothing about |

0:44:09 | and so if that's causing a problem or not that that's as we're Q your |

0:44:14 | data |

0:44:22 | of course all of us |

0:44:23 | you know as you said all of us or have been aware of this for |

0:44:26 | a long time and i think there has been a lot of effort at trying |

0:44:29 | to undo it |

0:44:31 | it's kind of when we say the model this these there's an independence assumption that |

0:44:37 | sort of have true |

0:44:39 | because the features that we use |

0:44:42 | go over several frames so of course they're not actually independent you know when you |

0:44:47 | synthesise it's not clear what you really synthesise "'cause" you have to synthesise something that |

0:44:51 | has |

0:44:52 | may have an independent value but it has to have a derivative that matches the |

0:44:56 | previous thing and so on but |

0:44:58 | but we've all tried things like segmental models |

0:45:02 | which don't have that independence assumption |

0:45:04 | right we take a segment |

0:45:07 | a whole phoneme so you're |

0:45:09 | is skipping the state independence assumption and the frame independence assumption and just going straight |

0:45:15 | to the contextdependent phoneme |

0:45:18 | and now you're picking a sample from the one distribution for that context dependent phoneme |

0:45:24 | and that always works worse |

0:45:28 | maybe you can do something with that are combined it with the hidden markov model |

0:45:32 | and gain of i have a point but by itself it always works a lot |

0:45:36 | worse |

0:45:38 | and unless you unless you cripple the hidden markov model with the salem only gonna |

0:45:43 | use context independent models then this one might work better but |

0:45:48 | so the question is |

0:45:49 | it's not that we haven't tried |

0:45:51 | people have tried to make models that aboard those things and almost all of those |

0:45:56 | things got more as the flip side of that is you said mpe or mmi |

0:46:00 | and all these things run that M |

0:46:01 | two |

0:46:02 | avoid |

0:46:04 | that assumption but they don't we just the arab i-vector for |

0:46:08 | they reduce the error by |

0:46:10 | ten percent fifteen percent relative |

0:46:13 | basically a small it it's is similar to any of that any of the other |

0:46:18 | tricks we do so they have any comment on those two observations |

0:46:21 | well i mean |

0:46:23 | i i'm not sure what |

0:46:25 | so a natural question is at which i think is the first part of what |

0:46:29 | you're saying is so why many people to try and fail to be hmms with |

0:46:36 | models that take into account |

0:46:40 | independent third the dependence structure in the data so why hasn't that work |

0:46:45 | well |

0:46:47 | i would say that |

0:46:49 | that |

0:46:50 | i do not believe that anyone has any quantitative notion of why these things here |

0:46:57 | in the data |

0:46:59 | i'm not saying that we should go back to these methods maybe we should but |

0:47:04 | well i will give you an example of something you know twenty years ago people |

0:47:09 | gave up neural networks |

0:47:11 | and all of a certain you know neural networks or |

0:47:16 | R |

0:47:16 | are the new |

0:47:18 | the new |

0:47:20 | come |

0:47:21 | i don't know what the right biblical sprays is but hallelujah so and what it |

0:47:28 | takes is somebody who believes in something and dry start to do it and i |

0:47:35 | think that here is the problem |

0:47:37 | we should be i don't know what the solution is i honestly don't know what |

0:47:41 | the solution is but i will say also that the mmi thing no and i |

0:47:46 | don't believe anyone would be the mmi it was not designed to overcome independence |

0:47:53 | you know if we knew that maximum likelihood solution to this problem was not the |

0:47:58 | right solution so we found an alternative model selection procedure that we've just in a |

0:48:04 | different place |

0:48:05 | again if the model were correct we wouldn't have to do that |

0:48:16 | coming back to the results this is this simulation results you presented |

0:48:20 | i think these are highly suggestive because |

0:48:24 | by changing the data to fulfil your assumptions |

0:48:29 | the error rates you get or not the error rates we |

0:48:32 | expect from the real data |

0:48:35 | because you fit |

0:48:36 | the problem to your assumptions but we have to go the other way around so |

0:48:40 | what error rates we really can expect if we |

0:48:45 | improve on modeling are still it that's an open questions system |

0:48:48 | exactly i'm the that that's absolutely right at the in no way in my claiming |

0:48:54 | that if we could model dependence in the data that we would be seen these |

0:48:58 | error rates the frame resampling error rates that that's absolutely correct |

0:49:04 | i mean so |

0:49:05 | presumably we do we repeat do better the other point though is i think that |

0:49:12 | a lot of the |

0:49:17 | this sort of brittle nist that we experience |

0:49:20 | in our models this is a conjecture is due to this very |

0:49:25 | sort of for fit to the temporal structure |

0:49:31 | and temper you know temporal we have a we have what one way of thinking |

0:49:35 | of what these results a you know the frame resampling results that says if you |

0:49:40 | forget about the temporal structure in the data models work really well but as soon |

0:49:46 | as you introduce real temporal structure and the data the model start falling |

0:49:51 | and so we'll speech i think temporal structures importance |

0:49:57 | i think |

0:50:04 | here is the my |

0:50:10 | by a shock i see how a |

0:50:15 | speechless |

0:50:16 | or thai interested party |

0:50:19 | yes the line |

0:50:25 | i don't think |

0:50:27 | a |

0:50:28 | i when you please independence assumptions is not |

0:50:34 | in the sticks more mixing to not extract information you can speech doesn't necessarily track |

0:50:41 | you know to work |

0:50:44 | i mean i can build the proposed system that satisfy |

0:50:49 | independence assumption |

0:50:51 | so i don't think |

0:50:52 | you know |

0:50:53 | really follows that |

0:50:55 | for my models really see |

0:50:58 | the models and so |

0:51:01 | i think you don't want thinking about extracting |

0:51:06 | getting the right information the problem this over account the information |

0:51:10 | it's a question of this represent information |

0:51:15 | and so if you misrepresented what are more or less than in the process |

0:51:19 | i was the misrepresentation |

0:51:21 | so that the false alarms |

0:51:25 | three |

0:51:28 | something like |

0:51:29 | some work |

0:51:31 | have you might have |

0:51:34 | but works if that's not right |

0:51:38 | work land farm |

0:51:41 | i rate is |

0:51:44 | just done the same tendency |

0:51:47 | these days |

0:52:26 | but |

0:52:27 | but |

0:52:34 | i like |

0:52:45 | when you know all |

0:52:55 | one thing that works really poor C |

0:52:58 | is if you have a mismatched representation |

0:53:01 | so i think the think about some model is representing text okay |

0:53:07 | you can represented as raster scan text |

0:53:09 | well you could represented as follows |

0:53:13 | and if you change the size of the image |

0:53:16 | the to the two things a very different of that the five |

0:53:20 | five test of an actual easy representation change and the rest just and it's just |

0:53:25 | the whole thing |

0:53:27 | so you have to ask yourself is the problem that we're C |

0:53:31 | the fact that we have a representation for the problem that doesn't match |

0:53:37 | that i think is the realisation |

0:53:40 | mm this tell us something a common |

0:53:43 | as you go for then for the top from state to phones in phones to |

0:53:48 | segments |

0:53:49 | data it's becoming more and more speaker-dependent is it may be the problem is your |

0:53:54 | models and not there don't i mean are |

0:53:57 | morse i mean if you made your models more speaker-dependent what we have seen the |

0:54:02 | such difference |

0:54:03 | but it has nothing to do with a frame dependent sampling but well like what |

0:54:08 | i was trying to say before is that is a form of dependence |

0:54:13 | the that |

0:54:14 | that |

0:54:15 | and the model knows nothing about |

0:54:17 | this form of the pen |

0:54:19 | you know that there are many forms a of dependence and data knowing what independence |

0:54:24 | is a heart thing for human to understand right |

0:54:28 | but that form of dependence is precisely there and it may be causing the problem |

0:54:36 | so there were there were a number of speakers so there are relatively few speakers |

0:54:42 | in this corpus and so we have to sort of cat them so that there |

0:54:46 | wasn't a single dominant speaker |

0:54:50 | which i mean i think that would be the last |

0:54:56 | so let me you sort of continue with work was asking again |

0:55:02 | we know the model is wrong |

0:55:05 | models are always wrong |

0:55:08 | and so |

0:55:11 | the way your |

0:55:13 | you can argue that the model is wrong mathematically or you can argue that it's |

0:55:17 | wrong because it doesn't meet certain in a match a human performance what we think |

0:55:22 | of as human performance i think we may overestimate human performance a little bit but |

0:55:26 | it clearly doesn't match it |

0:55:29 | but in fact you know if you look at all the research that all of |

0:55:32 | us do |

0:55:34 | we use at least feel like protecting those problem so we say we're gonna use |

0:55:39 | fonts models it to use your analogy we a lower models to have we scale |

0:55:45 | them like fonts right we put in we say we're going to estimate a scale |

0:55:49 | factor in that scale factor is not a simple |

0:55:52 | we can be a simple one there were can be a matrix you know much |

0:55:54 | more complicated than what you do with the font and we constrain it to be |

0:55:58 | the same we say the speakers the same for the whole sentence |

0:56:01 | we do speaker adaptive training so we try to remove the differences |

0:56:07 | we tried to normalize all the speakers to the same place and then insert the |

0:56:11 | properties of the new speaker again right |

0:56:14 | close sort of like the analogy of a font |

0:56:16 | we tried to do all of these things we certainly trying to model channels |

0:56:23 | we do all of these with linear models and not linear models |

0:56:28 | and |

0:56:29 | we get small improvements |

0:56:31 | so my question let me turn the question around |

0:56:34 | the model is wrong |

0:56:36 | what's the right model |

0:56:38 | not what is the do but what is the right model |

0:56:42 | so |

0:56:43 | i think we all don't know the answer to that question but let me tell |

0:56:47 | you something other phenomena that i would like to see as making |

0:56:52 | unless you've been following particle physics but |

0:56:56 | in particle physics |

0:56:58 | when you measure particle interactions prestigious of the interactions are governed by |

0:57:03 | basically by feynman diagrams |

0:57:05 | and so to compute a for particle interaction like using the super collider to compute |

0:57:11 | a cross sectional area for one of the interactions takes just if we computer about |

0:57:15 | a week to look at all the fine and i guess |

0:57:19 | the quite of the physics guys it's just discovered a geometric object |

0:57:24 | enforce days and in the geometric object it turns out that each |

0:57:28 | little area house |

0:57:32 | an area that is exactly the solution |

0:57:34 | so that problem of computing the cross sectional area |

0:57:39 | and you can outdo the computations |

0:57:43 | in about five minutes with a pencil the tape |

0:57:47 | so |

0:57:48 | there's a place where the difference in the model has a huge effect |

0:57:54 | i'm making things work so i don't think i don't believe the model lies in |

0:57:59 | that we of the kinds of things that we've all these always been doing |

0:58:02 | i think we need to have some radical re interpretation of the way we look |

0:58:06 | at the data that we look at the word |

0:58:09 | maybe which on the lines in one place |

0:58:11 | maybe |

0:58:14 | i took the degree in linguistics as i thought speech wasn't an easy problems as |

0:58:18 | a jury point of view and i learned to distrust everything a linguist set |

0:58:24 | maybe which most of them to but |

0:58:26 | maybe there's something different that we should be don't |

0:58:28 | so i would love just against look outside this place that we've been exploring |