0:00:15 | so i'm gonna talk about a project average but thank you for having me here |
---|
0:00:22 | i |
---|
0:00:23 | i enjoyed my time in the czech republic that learn many check we're concluding well |
---|
0:00:29 | so thank you |
---|
0:00:31 | so i |
---|
0:00:33 | project ouch out stands for outing unfortunate characteristics of the hmms |
---|
0:00:39 | there are three |
---|
0:00:41 | truthfully there were three phases the at the sort of initial were that we did |
---|
0:00:47 | on this was a project that larry really and i three hundred to when we |
---|
0:00:52 | read nuance |
---|
0:00:54 | and i truthfully it also had its antecedents in work that we were doing it |
---|
0:00:59 | for signal |
---|
0:01:01 | but that's a funded a very small pilot study and i our funded the a |
---|
0:01:09 | larger but still small off a lot the people the students to work with me |
---|
0:01:14 | were day gaelic |
---|
0:01:17 | hardly |
---|
0:01:18 | part is there i was actually postdoc ensure you chair is currently is to berkeley |
---|
0:01:24 | and larry really jaw really morgan |
---|
0:01:26 | and myself were thus reducing your people |
---|
0:01:29 | so project out |
---|
0:01:31 | what we're trying to do |
---|
0:01:33 | is our goal is to sort of develop a quantitative understanding of |
---|
0:01:41 | how the current formalism thing |
---|
0:01:44 | and you know surprisingly this being very little work |
---|
0:01:48 | in this direction in the for your history |
---|
0:01:51 | of speech recognition |
---|
0:01:54 | there's been some but it means were isolated and sporadic |
---|
0:01:59 | and |
---|
0:02:00 | you know progress in speech recognition has been very exercise and |
---|
0:02:05 | in my largely because we be proceeding |
---|
0:02:09 | wire trial-and-error and so the claim is |
---|
0:02:12 | that by gaining a deeper understanding |
---|
0:02:16 | powers are algorithm succeed and fail |
---|
0:02:19 | other than just measuring we're word error and if it if we get an improvement |
---|
0:02:23 | in word error keep it |
---|
0:02:26 | we it doesn't improve we |
---|
0:02:28 | we just it should enable more efficient and steady progress and i claim that this |
---|
0:02:34 | should be embedded are standard sort of research may not necessarily the techniques that i'm |
---|
0:02:41 | gonna talk about okay but just this |
---|
0:02:43 | notion that when you have a model |
---|
0:02:46 | you know it doesn't fit the data you should get a try to gain some |
---|
0:02:50 | understanding of how a model differs from the data and how that data model residual |
---|
0:02:57 | impacts |
---|
0:02:58 | the classification errors |
---|
0:03:01 | so the main questions that a project ouch was interested in is you could be |
---|
0:03:08 | the main where you could think about it do this is what to the models |
---|
0:03:12 | find surprising about data what is it about speech data that the models find surprising |
---|
0:03:17 | and how to do that surprise translate the air |
---|
0:03:22 | so |
---|
0:03:23 | i'm gonna talk today about quantifying the two major |
---|
0:03:28 | hmm assumptions their impact on the error rates of the course the two major assumptions |
---|
0:03:33 | are the very strong independence assumptions the models makes |
---|
0:03:38 | and also |
---|
0:03:40 | and equally strong assumption about what the form of the marginal distribution of the frames |
---|
0:03:45 | are typically we assume that there are a gaussian mixture models of course nowadays people |
---|
0:03:50 | are using a multi layer perceptrons but it can you make some sort of formal |
---|
0:03:56 | assumption about what looks like |
---|
0:04:00 | also which of these incorrect assumptions is and key your discriminative training mpe or mmi |
---|
0:04:08 | which it's these assumptions is |
---|
0:04:11 | is this process are compensating for the maximum |
---|
0:04:16 | and |
---|
0:04:17 | do these results change when you move from a miss from a mass training and |
---|
0:04:22 | test |
---|
0:04:22 | us we're formalism the mismatched case |
---|
0:04:26 | so there early sort of work that we did was on the switchboard in the |
---|
0:04:30 | wall street journal corpora later on we move to the icsi corpus |
---|
0:04:35 | you can read past |
---|
0:04:36 | this sort of question about how do these results change in this mask a case |
---|
0:04:43 | in it and form of why asr so brittle |
---|
0:04:47 | we go |
---|
0:04:48 | at any time bring up |
---|
0:04:51 | a new recognizer on a problem whether |
---|
0:04:54 | the same language or across languages you always have to star it seems almost from |
---|
0:04:59 | scratch you always have to collect a bunch of data that's closely related to the |
---|
0:05:05 | to the task that you |
---|
0:05:06 | you have and |
---|
0:05:08 | it hardly ever works the first time you try it it's the reason that most |
---|
0:05:12 | of us in this room have |
---|
0:05:14 | have jobs it's are sort of it sort of a good thing but it's incredibly |
---|
0:05:19 | frustrating right it's like |
---|
0:05:23 | it's a miracle that when anything works the first |
---|
0:05:27 | so the ir project mainly was interested in studying |
---|
0:05:32 | these |
---|
0:05:33 | these questions on it the icsi meeting corpus where there's a new field channel and |
---|
0:05:38 | a far-field show i'll talk a little bit more about that i'm we wanted to |
---|
0:05:43 | understand when you trained models on the near field condition |
---|
0:05:47 | what happens when you are recognise for future |
---|
0:05:51 | and so in this context |
---|
0:05:54 | is the brittle nist of asr solely due to the models inability to account for |
---|
0:06:00 | the statistical dependence that occurs in real data |
---|
0:06:04 | and you know what i started this particular project |
---|
0:06:07 | i thought |
---|
0:06:08 | that it was just gonna be used independence assumptions so |
---|
0:06:12 | and i was very surprised |
---|
0:06:15 | when we actually started doing the work |
---|
0:06:19 | and in fact it once like so |
---|
0:06:23 | and so i say i just sort of funny but |
---|
0:06:26 | but in the matched case basically |
---|
0:06:29 | this the inability of the model to account for statistical dependence that occurs in real |
---|
0:06:34 | data is basically the whole problem |
---|
0:06:37 | but when you move to the mismatched case |
---|
0:06:39 | all the sudden something else rears its head it |
---|
0:06:43 | and it and it and it's a big problem and so all describe what this |
---|
0:06:47 | problem |
---|
0:06:49 | it has to do with the lack of invariance of the from |
---|
0:06:53 | so |
---|
0:06:55 | i'm gonna spend a little data time |
---|
0:06:57 | talking about the sort of methodology we use so what the way we explore this |
---|
0:07:03 | question is we create |
---|
0:07:07 | we fabricate data |
---|
0:07:09 | a we use stimulation and a novel a sampling process |
---|
0:07:15 | that uses real data |
---|
0:07:17 | to probe the models and the data that we create |
---|
0:07:21 | is either completely stimulated that satisfies all the model assumptions |
---|
0:07:26 | or it's real data |
---|
0:07:28 | that we sample than the way that gives the properties that we understand and so |
---|
0:07:34 | by feeding this data we can sort of pro the models and see their response |
---|
0:07:41 | to this to the state and we research we observe recognition actually |
---|
0:07:47 | so here's an example |
---|
0:07:49 | so this is an example of what of course according to the average estimate seventy |
---|
0:07:55 | high miss rate by counts capital markets report |
---|
0:07:59 | so this is an example of course what we expect speech to sound like this |
---|
0:08:03 | is from wall street journal so this is a fabricated version of this that essentially |
---|
0:08:09 | agrees with all the model assumptions |
---|
0:08:13 | according to different estimates to construct the attachments capital markets report |
---|
0:08:19 | you can speculate syllable rhymes two point five percent that's model |
---|
0:08:25 | so |
---|
0:08:26 | so you know it's highly amusing but it's intelligible obviously and it obviously you know |
---|
0:08:32 | it's from a model that was constructed from a hundred different speakers and it reflects |
---|
0:08:37 | the sort of structure |
---|
0:08:39 | so what we're trying to quantify |
---|
0:08:41 | is |
---|
0:08:43 | what the difference between these two extremes in terms of recognition condition |
---|
0:08:50 | so the basic idea of data fabrication a simple |
---|
0:08:56 | we follow the hmms generation a mechanism so to do that we first we generate |
---|
0:09:04 | a an underlying state sequence consistent with the transcript the dictionary and the state transitions |
---|
0:09:12 | the underlying each of it that you know the hidden markov model |
---|
0:09:15 | then we walk down this |
---|
0:09:19 | this sequence and we and that of frame at each point |
---|
0:09:22 | so here's a picture a nice picture that describe the sort of structure it's a |
---|
0:09:27 | sort of a parts of it are actually a graphical model |
---|
0:09:32 | a this courses in each ml |
---|
0:09:34 | but basically we unpack so if we have a transcript we unpack words |
---|
0:09:41 | we get the corresponding pronunciations |
---|
0:09:45 | the phones in context |
---|
0:09:47 | then determine which hmm we use so this is the hidden state each of these |
---|
0:09:51 | states and mit observations according to the so whatever mixture model we're actually using right |
---|
0:09:59 | and so if you're not so familiar with the hmms i assume pretty much everyone |
---|
0:10:04 | in the room is but this sort of highlights the independence assumptions right well it |
---|
0:10:10 | highlights two things one |
---|
0:10:12 | the frames are omitted according to rule and the rule is the marginal but the |
---|
0:10:17 | form that we get for the marginal distribution of frames |
---|
0:10:21 | and then of course then this also says that these frames are independent so every |
---|
0:10:26 | time i met |
---|
0:10:28 | a frame from state three state it is independent from the previous frame that was |
---|
0:10:33 | emitted from state three so that's a very strong assumption |
---|
0:10:37 | but in addition |
---|
0:10:38 | it is also independent from any of the frames that we're and it'd previously from |
---|
0:10:43 | the state so these of the very strong and |
---|
0:10:46 | but okay again to generate observations we just all of this rule and basically once |
---|
0:10:54 | we know the sequence of states |
---|
0:10:57 | i have a sequence of states one side out that i just walk down those |
---|
0:11:02 | sequences states and i'd to withdraw |
---|
0:11:04 | from |
---|
0:11:06 | what it either a distribution |
---|
0:11:08 | or whether it be empirical or parametric |
---|
0:11:13 | so |
---|
0:11:14 | so for simulation |
---|
0:11:16 | it's a i know it's easy to simulate from a mixture models not a big |
---|
0:11:21 | deal right |
---|
0:11:23 | but what about this sort of novel sampling process that'll allow us to get a |
---|
0:11:30 | the independence assumptions will so that for this |
---|
0:11:33 | we idea of formalism |
---|
0:11:36 | from a reference bootstrap so i talked a little bit about the bootstrap in the |
---|
0:11:41 | paper the poster |
---|
0:11:45 | a people in the feel don't seem to be terribly familiar with that i'm not |
---|
0:11:51 | sure is topical very much but i will try |
---|
0:11:54 | so in the basic idea areas |
---|
0:11:57 | a suppose you have an unknown population right so you've got some population distribution and |
---|
0:12:02 | you compute the statistic that's meant to summarize this population itself |
---|
0:12:08 | then you want to know how good is the statistics so i want to construct |
---|
0:12:12 | a confidence interval for the statistics to give me a sense how well i've estimated |
---|
0:12:17 | from |
---|
0:12:18 | a place |
---|
0:12:20 | so how the lighting that if i don't know what the population |
---|
0:12:23 | i mean i'm trying to |
---|
0:12:25 | you know i'm trying to derive properties of a of this population |
---|
0:12:29 | and so and so in particular i don't know anything about really except the sample |
---|
0:12:34 | like drawn from this population |
---|
0:12:37 | so |
---|
0:12:37 | but for F runs a bootstrap procedure people would usually make some parametric assumptions about |
---|
0:12:44 | population typically you'd assume it's a normal or gaussian |
---|
0:12:49 | and then compute |
---|
0:12:51 | and a confidence interval using that structure |
---|
0:12:54 | well course that sort of crazy you know why would you do that you know |
---|
0:12:58 | especially if you're trying to say |
---|
0:12:59 | is this population distribution gaussian are not well it's crazy to still |
---|
0:13:04 | then that the population distribution is gaussian to compute this confidence in |
---|
0:13:09 | so this was a big problem in the late seventies when computers became sort of |
---|
0:13:14 | sort usable |
---|
0:13:15 | by and statisticians |
---|
0:13:18 | he came up with the sort of formalism and |
---|
0:13:21 | and so the name comes from pulling up oneself up by the bootstrap lots of |
---|
0:13:26 | people use the bootstrap for various sorts of terminology it allegedly comes everyone attributes this |
---|
0:13:33 | to the to the to the story in the |
---|
0:13:36 | adventure and so pair and a one channels and where E |
---|
0:13:40 | used in some and yes to get out so we pulls himself up |
---|
0:13:44 | by its bootstraps out of the but of course you read very one or the |
---|
0:13:49 | adventures of error |
---|
0:13:50 | when chosen and that's not what huh |
---|
0:13:52 | in fact you within a small |
---|
0:13:55 | on forcing use trying to get out of this one |
---|
0:13:57 | so instead pulled himself out what is okay |
---|
0:14:01 | so maybe instead we collected daily |
---|
0:14:06 | similarly a little bit whiter i thought that was very |
---|
0:14:10 | so |
---|
0:14:12 | so the with the way the way the bootstrap words |
---|
0:14:16 | is you take empirical distribution so you tree |
---|
0:14:19 | so you have the same |
---|
0:14:21 | so this sample is a representative of the true population distribution so if it's big |
---|
0:14:26 | enough it should be a pretty good represented |
---|
0:14:29 | and so you since |
---|
0:14:30 | instead of dating a parametric model to this you treat this is an empirical distribution |
---|
0:14:35 | and you sample from that empirical distribution |
---|
0:14:39 | sampling from the empirical distribution turns out to be equivalent to just doing a random |
---|
0:14:45 | draw with replacement from the sample itself |
---|
0:14:48 | yes the name resample |
---|
0:14:50 | so we're gonna adapt this |
---|
0:14:52 | this formalism to so the so problem at hand so ins will you know so |
---|
0:14:57 | when we train our models right if it so imagine we're viterbi trainer |
---|
0:15:02 | here here's a |
---|
0:15:04 | you know |
---|
0:15:05 | well i'll have another picture but basically we're gonna sample to the frames that are |
---|
0:15:11 | assigned to a particular state during training and that's work |
---|
0:15:16 | and we can do this for various types of sick |
---|
0:15:21 | so here |
---|
0:15:22 | it's a really crappy picture but which i have to do a better job but |
---|
0:15:26 | this is that i here again that |
---|
0:15:29 | so the you know these again see |
---|
0:15:32 | but so we have the true population distribution this you know we fit a say |
---|
0:15:40 | gaussian to this is not particularly good representative and instead if we have if we'd |
---|
0:15:45 | run enough data from a this histogram estimate the distributor |
---|
0:15:52 | so basically |
---|
0:15:55 | but the important part of this slide it is |
---|
0:15:58 | resampling is gonna fabricate data |
---|
0:16:02 | that satisfies independence assumptions of the hmm because i'm gonna do random draw with replacement |
---|
0:16:08 | from the distribution |
---|
0:16:10 | but |
---|
0:16:11 | the data we create are gonna deviate from the hmms parametric out the distribution of |
---|
0:16:19 | the distributional assumptions that we make two exactly the same day degree that real data |
---|
0:16:24 | do because it's real data |
---|
0:16:26 | and it's the data at all |
---|
0:16:28 | from the training |
---|
0:16:30 | so here's it's already good picture which can lead in sort |
---|
0:16:34 | describe a little bit |
---|
0:16:36 | about what we do |
---|
0:16:38 | and a |
---|
0:16:39 | so imagine if we have training data and we're actually doing viterbi training so if |
---|
0:16:44 | we're doing viterbi training we get a forced alignment that for all the states |
---|
0:16:49 | we just accumulate all frames a |
---|
0:16:52 | for that state and then we fit a gmm to right and so that |
---|
0:16:57 | but instead of doing that in the in the bootstrap formal is the we accumulate |
---|
0:17:03 | frames and we stick "'em" in earnest |
---|
0:17:06 | that are that are labeled with that state |
---|
0:17:09 | so training is just like you know or even here training you know |
---|
0:17:14 | you just accumulate all the frames associated with the state |
---|
0:17:17 | but instead of a forgetting about that you keep track what they are used to |
---|
0:17:22 | come in a packet parameter |
---|
0:17:23 | and so in it when it comes time to generate pseudo data you have an |
---|
0:17:27 | alignment or some state sequence that you've got however |
---|
0:17:32 | you have a state sequence ins when you walk down to generate the frames if |
---|
0:17:36 | i was generating the frames and simulation i would stimulate i do a random draw |
---|
0:17:41 | from a distribution now instead i to a random draw with replacement from a buck |
---|
0:17:47 | under and of frames okay |
---|
0:17:49 | so the frames again are independent because i'm doing random draws with independence |
---|
0:17:55 | and they the deviate from the tape from the distributional assumptions to the same degree |
---|
0:18:00 | the real data or "'cause" they are real data |
---|
0:18:03 | so sorry i believe bring this but and then i can also all about it |
---|
0:18:08 | i can i can |
---|
0:18:10 | do you |
---|
0:18:11 | sequence so i can i can samples the trajectories phone trajectories and word trajectories |
---|
0:18:18 | because |
---|
0:18:19 | so here |
---|
0:18:20 | you're this is this sequence of frames associated to states |
---|
0:18:25 | so i can stick that into that whole sequence |
---|
0:18:29 | likewise i can take whole phone sequence and put it in here and when i |
---|
0:18:34 | drawer from your ins |
---|
0:18:35 | instead of getting individual frames i get segments |
---|
0:18:39 | so that the important thing is |
---|
0:18:41 | no matter what see so i five have segments in the utterance |
---|
0:18:45 | when i draw the segments between segments the things they are independent but they inherit |
---|
0:18:53 | dependence that exists in real data within that sector so we have |
---|
0:18:59 | between segment independent |
---|
0:19:02 | within segments dependent so this is the way that we can control the sort of |
---|
0:19:07 | degree of statistical dependence that's in the day |
---|
0:19:12 | this is quite power |
---|
0:19:15 | so this sort of just |
---|
0:19:17 | sort of summarises this |
---|
0:19:19 | but the and you can see |
---|
0:19:21 | could even stickler hundred and your |
---|
0:19:24 | but that so the point is that's a that segment level resampling |
---|
0:19:31 | relaxes frame level independence to segment |
---|
0:19:39 | so here's a sort of picture |
---|
0:19:43 | the models response to fabricate so this is i didn't for that |
---|
0:19:49 | okay so |
---|
0:19:54 | i don't know how much i wanna spend on this but |
---|
0:19:59 | so here what we have it is simulated |
---|
0:20:04 | a simulated data are the real error rate and as i gradually reintroduce independence and |
---|
0:20:10 | the that the data the word error rate starts to increase rather dramatic |
---|
0:20:16 | so point is |
---|
0:20:18 | let's look at the simulated word error rate so you can think of this is |
---|
0:20:21 | i think of this is you've got some sort of not and where you re |
---|
0:20:24 | introducing depends in the data and as i reintroduce data dependence in the data error |
---|
0:20:30 | rate |
---|
0:20:32 | comes quite high this is |
---|
0:20:33 | this is i icsi meeting data this is |
---|
0:20:37 | with unimodal models |
---|
0:20:39 | the same sort of phenomena happens when you use mixture models where you know like |
---|
0:20:43 | say component extreme |
---|
0:20:46 | so that here the simulated error rate is around two percent little bit less than |
---|
0:20:51 | two percent |
---|
0:20:52 | when i do frame level resampling error rate increases just a little bit it's a |
---|
0:20:56 | very small increase it does increase but it's but it by very small |
---|
0:21:01 | now when i reintroduce |
---|
0:21:04 | with in state dependence |
---|
0:21:06 | all the sudden the error rate becomes around twelve percent so the error rate is |
---|
0:21:10 | increased by a factor of six |
---|
0:21:12 | when i introduce |
---|
0:21:14 | within bone dependence |
---|
0:21:17 | the error rate increases the king by about a factor of a two |
---|
0:21:23 | and then when i go to words it increases by |
---|
0:21:27 | we can almost by a factor of two this typically is the largest job on |
---|
0:21:31 | the corpora that we've worked with |
---|
0:21:33 | when you go when you move from frame |
---|
0:21:35 | to stay typically increases by about a factor of six |
---|
0:21:39 | so you think about this you make an argument and the argument is that |
---|
0:21:44 | the that the change the distributional assumption that we make with hidden with gmms |
---|
0:21:53 | it's not such a big deal i mean it's important but it's not such a |
---|
0:21:57 | big deal |
---|
0:21:57 | the biggest single factors are these reintroduction dependent so with the dependence in the data |
---|
0:22:04 | that the models are findings the price i mean you know it's a |
---|
0:22:08 | it's a you know everybody knew the dependence assumptions work well i mean i'm not |
---|
0:22:13 | saying that surprising but i personal we use it was |
---|
0:22:18 | was really surprise and it took a long time to come around |
---|
0:22:24 | to the fact that you really it is the model they're the errors oriented dependence |
---|
0:22:29 | assumption and we tend to work around this by other sorts of things |
---|
0:22:35 | so that this is a summary of the matched case result so we came the |
---|
0:22:40 | statistic when we have matched training and test |
---|
0:22:43 | the it's the independence assumptions that's the big deal |
---|
0:22:46 | that's the model inability to account for dependence in the data that is that is |
---|
0:22:52 | to railing things |
---|
0:22:53 | the marginal distributions |
---|
0:22:55 | that so much |
---|
0:22:57 | so surprisingly also so in a different you know if later study |
---|
0:23:01 | we zorro but |
---|
0:23:03 | attached this formalism tasks the question so what is what is discriminative training doing you |
---|
0:23:08 | know see start with the maximum likelihood model you apply mmi |
---|
0:23:13 | what what's happening here so you apply this formalism and you see that in fact |
---|
0:23:20 | mmi is actually randy is actually compensating for these independent and that's assumptions in a |
---|
0:23:28 | way that i don't completely understand i have hypotheses about how this might work |
---|
0:23:34 | but |
---|
0:23:36 | a so here you |
---|
0:23:39 | really complicated procedure that's a little hokey |
---|
0:23:42 | that to people twenty years many people in this room it took twenty years to |
---|
0:23:48 | get to work right |
---|
0:23:49 | and it took many laughs once we shown to work on large vocabulary took many |
---|
0:23:54 | labs an additional apply yours to get it to work in their lap |
---|
0:23:58 | it's you know now it's pretty routine to do this but you know it's a |
---|
0:24:02 | lot it was a single to get this to work and my point is that's |
---|
0:24:07 | doing is compensating for the independence assumptions we know the independence assumptions are a problem |
---|
0:24:13 | i'm not saying that it's gonna be easy the figure find a model that relaxes |
---|
0:24:17 | the independence assumptions |
---|
0:24:19 | but perhaps that twenty years of effort |
---|
0:24:21 | would be better spent |
---|
0:24:23 | attacking that problem |
---|
0:24:26 | so one about mismatched training |
---|
0:24:30 | so the icsi meeting corpus |
---|
0:24:32 | on a we have near field models |
---|
0:24:37 | collected from on a solo |
---|
0:24:40 | you know head mounted microphones there was a some microphone array of some sort |
---|
0:24:47 | but that the meeting room was quiet it was small had are normal amount of |
---|
0:24:52 | river the kind of reaper human six back |
---|
0:24:55 | in a room |
---|
0:24:56 | if you listen to these two channels you can tell that they're different |
---|
0:25:01 | but it's not like the far-field channel is radically different when you listen to |
---|
0:25:07 | i we it's it sounds a little different but it's perfectly intelligible |
---|
0:25:13 | so we explore |
---|
0:25:15 | training test with near field train interest with farfield and this mismatch condition where a |
---|
0:25:20 | train on your field data and test for |
---|
0:25:24 | so |
---|
0:25:26 | i'll just say that it's harder it's not |
---|
0:25:30 | hardly we have to be careful and you have to think about what you're trying |
---|
0:25:33 | to do when you when you when you run these types of experiments in particular |
---|
0:25:38 | a there were a lot it's use that we went through |
---|
0:25:43 | to take get the near field channel and the far-field channel exactly parallel so that |
---|
0:25:49 | we were actually measuring |
---|
0:25:51 | what we wanted to do it is it's like a somewhat |
---|
0:25:55 | intricate lab set |
---|
0:25:57 | and so it's |
---|
0:26:01 | so the paper that we wrote in i cast just i don't know how well |
---|
0:26:05 | it describes it but it attempted to describe it and we have a on |
---|
0:26:10 | the icsi website there's a technical report that's reasonably good |
---|
0:26:14 | that describes a lot this stuff so i'm not gonna believer this but there was |
---|
0:26:19 | a lot of effort that we can go through that's |
---|
0:26:23 | so here here's of the bottom line is that we're |
---|
0:26:26 | so first let's look at the green and the red curve satanic again i'm almost |
---|
0:26:31 | so |
---|
0:26:33 | the green and the red curve are the mass near field and far-field and notice |
---|
0:26:38 | that extract each other pretty well the different |
---|
0:26:40 | the first real data is obviously hardware |
---|
0:26:43 | but interestingly look down here at the simulated in the frame |
---|
0:26:48 | accuracies |
---|
0:26:49 | they're still really low you know there |
---|
0:26:52 | the a match farfield is higher it's worse but it still really low and in |
---|
0:26:58 | particular that these error rates are around the two percent right so i wanted so |
---|
0:27:06 | let's think about that no then notice before we think about that the mismatch simulation |
---|
0:27:12 | rate |
---|
0:27:12 | it's you know so this is where we want to concentrate so this is what |
---|
0:27:17 | we want to think about that right |
---|
0:27:19 | so the simulated |
---|
0:27:21 | we don't need to worry about this other stuff it's the simulated thing that we're |
---|
0:27:25 | gonna concentrate |
---|
0:27:26 | so |
---|
0:27:29 | what when you simulate data from near field models and you recognise it with your |
---|
0:27:34 | field models the error rate is essentially no |
---|
0:27:37 | so that means that problem is essentially step |
---|
0:27:43 | again when i take the far field models and i simulate data from the far-field |
---|
0:27:48 | models |
---|
0:27:49 | and i and i will |
---|
0:27:51 | and i recognise it with the far-field models |
---|
0:27:53 | i get essentially nowhere |
---|
0:27:55 | again that means that problem is essentially stuff |
---|
0:27:59 | so in these two individuals spaces where we you know so the frames so in |
---|
0:28:06 | the signal processing the mfccs there are generated in the matched cases they're essentially separable |
---|
0:28:13 | problems but all the side when i take in the |
---|
0:28:18 | the near field models and look at the at the far field data it's |
---|
0:28:23 | drat dramatically not step |
---|
0:28:26 | so that means that the transformation that takes place between the near field data and |
---|
0:28:32 | the far field data is not |
---|
0:28:35 | it's not very that from the front end is not invariant under this transformation and |
---|
0:28:41 | that lack of invariance |
---|
0:28:43 | is what's causing this huge increase in here |
---|
0:28:47 | so we again it's not surprising that the front and is not invariant to this |
---|
0:28:52 | transformation there's a little bit a river there's a little bit of noise but what's |
---|
0:28:57 | remarkable it is |
---|
0:28:58 | that that's |
---|
0:29:00 | solely that problem the causes |
---|
0:29:03 | this huge degradation in there |
---|
0:29:06 | and that is actually fairly remark |
---|
0:29:10 | so |
---|
0:29:13 | a |
---|
0:29:14 | so there are many more results |
---|
0:29:17 | a involving mixture model so we rerun all of these results with i think eight |
---|
0:29:23 | component mixture models we see the same sort of behaviour |
---|
0:29:27 | we've reproduce all the discriminative training results we ask |
---|
0:29:32 | can does discriminative training somehow magically sort of the be leery than |
---|
0:29:38 | the mismatch a case and the answer is no |
---|
0:29:41 | we do i think morgan to this really you're on a natural question is how |
---|
0:29:46 | does mllr work in this thing we talked about that an mllr you can you |
---|
0:29:51 | can reduce |
---|
0:29:52 | some of the scratches you would expect |
---|
0:29:54 | but mllr is a simple linear transformation and whatever transformation between these two channels is |
---|
0:30:00 | happening |
---|
0:30:01 | it's some peculiar nonlinear transformation right so it's unreasonable |
---|
0:30:06 | to expect animal or to do |
---|
0:30:08 | as well but it's a good this task harness is a really good test harness |
---|
0:30:13 | for evaluating |
---|
0:30:14 | you know how resistant to these type how invariant to these transformations are for and |
---|
0:30:20 | and so we've explored that a little but |
---|
0:30:23 | and it's not so encouraged |
---|
0:30:26 | alright well so that i think i table i will and i've |
---|
0:30:31 | sort of blather donald enough i think all i'll turn it over to jordan and |
---|
0:30:37 | you will |
---|
0:30:37 | he will |
---|
0:30:38 | have a higher level you a role idea and the not and then we'll have |
---|
0:30:42 | questions that |
---|
0:30:54 | so what you what presented in |
---|
0:31:02 | i |
---|
0:31:18 | okay one two three |
---|
0:31:20 | alright so it turns out the there were two parts of this project |
---|
0:31:26 | C told you about the technical stuff but we also saw that we'd like to |
---|
0:31:30 | figure out |
---|
0:31:31 | you've been hearing a lot about how wonderful speech recognition is during this meeting and |
---|
0:31:35 | we thought we will actually like to understand what the community actually thought about what |
---|
0:31:40 | speech recognition was like |
---|
0:31:42 | so we rollers also survey and i called a bunch of people many of you |
---|
0:31:48 | what called me |
---|
0:31:50 | and this is called the rats right |
---|
0:31:59 | and well we wanna do is just see what people thought about how speech recognition |
---|
0:32:03 | really worked we were we were hoping that we would find some evidence to persuade |
---|
0:32:09 | the government maybe to put it some money and fun some speech recognition research which |
---|
0:32:14 | we haven't seen in a long time |
---|
0:32:17 | but we really we just one the final was going on |
---|
0:32:20 | and so we put together a little survey team |
---|
0:32:24 | jen into jamieson worked with me she's a alice that's been in speech for very |
---|
0:32:29 | long time and we engage frederick okay and he's a specialist at doing service |
---|
0:32:36 | and we design a snowball start by |
---|
0:32:40 | it's normal surveys very interesting it |
---|
0:32:44 | it says you start with a small group of people that you know and you |
---|
0:32:47 | have some the questions and then you apps them who else task |
---|
0:32:51 | and you just follow that for your nose and what that means is although it's |
---|
0:32:56 | not entirely unbiased it's as unbiased as you can do if you don't know the |
---|
0:33:00 | sampling populations going to be |
---|
0:33:06 | so we want to low what was going on what the people think or the |
---|
0:33:10 | failures and what remedies of people try and how do they were |
---|
0:33:17 | so we did this novel sampling |
---|
0:33:19 | here's the questionnaire i don't wanna spend a lot of time and this but just |
---|
0:33:23 | take a |
---|
0:33:25 | the interesting questions are |
---|
0:33:28 | the fall last one on the slide where is the current technology failed |
---|
0:33:33 | and the first one on the side when you think broke |
---|
0:33:36 | and then questions about sort of what you do about what was going on and |
---|
0:33:41 | then if there's other stuff |
---|
0:33:45 | the survey participants tended to be all |
---|
0:33:49 | i think |
---|
0:33:50 | that's sort of how our snowball work not terribly all but there's not a lot |
---|
0:33:54 | again people in this so ages with thirty five seventy |
---|
0:33:58 | we spoke about eighty five people |
---|
0:34:03 | and they have an interesting mix of jobs most of them were in research somewhere |
---|
0:34:09 | in development so we're both |
---|
0:34:11 | there were a small battery as a management people and then people self referred them's |
---|
0:34:17 | the their jobs as something more detail |
---|
0:34:22 | but mostly these are and be people lord managers doing speech research or language one |
---|
0:34:30 | sort of another |
---|
0:34:35 | so here's what you told us |
---|
0:34:39 | there's a |
---|
0:34:42 | natural language is the real problem and acoustic modeling is a real problem |
---|
0:34:47 | and everything else that we do was broken more or less |
---|
0:34:51 | so i think the community sort of had this field not the people trying to |
---|
0:34:55 | sell speech recognition to the management but the people trying to make it work have |
---|
0:35:00 | a feeling that all is not really well in the technology |
---|
0:35:05 | so lots of people and when you point fingers there pointing fingers to the language |
---|
0:35:11 | itself and to acoustic modeling |
---|
0:35:14 | and there's the third guy which this says not robust let's say this what steven |
---|
0:35:20 | and stuff |
---|
0:35:21 | we were able |
---|
0:35:22 | so there's something going on with this technology that makes it not work very well |
---|
0:35:27 | and when we ask people what they try |
---|
0:35:30 | the fix things the answers everything |
---|
0:35:34 | people of muck around with the training some people have tried all kinds of different |
---|
0:35:38 | because i just of their system |
---|
0:35:40 | a |
---|
0:35:42 | i |
---|
0:35:43 | i know |
---|
0:35:50 | some piece trying to calm |
---|
0:35:54 | alright anyway |
---|
0:35:58 | what on the interesting things the people try to do |
---|
0:36:02 | many of us have tried to fix pronunciations either in dictionaries or in rules the |
---|
0:36:07 | pronunciation and to well me and everyone is found that this is a waste |
---|
0:36:12 | it's pretty interesting that so that's not a way to fix the systems that we |
---|
0:36:16 | currently will so we tried all kinds of stuff |
---|
0:36:21 | and so i think |
---|
0:36:22 | are taken from the survey is that people |
---|
0:36:27 | actually don't believe that technology is very solid and we try a lot of things |
---|
0:36:31 | to fix it and then we looked a little bit of the literature about the |
---|
0:36:35 | literature surveys in the icsi report which you can go really but the comma so |
---|
0:36:40 | we found a little sure looks sort of like this is from a review by |
---|
0:36:43 | fruity |
---|
0:36:45 | and it's a |
---|
0:36:48 | L B C Rs far from be solved background noise channel distortion far in excess |
---|
0:36:52 | casual disfluent speech one expected topic to it is because automatic systems to make egregious |
---|
0:36:57 | errors and that's what everybody set anybody who's looked at that they'll says well this |
---|
0:37:02 | technology is okay sometimes but it fails all i |
---|
0:37:08 | so we conclude was |
---|
0:37:10 | the technology is all i point out that the models the most of those used |
---|
0:37:14 | by hidden markov models the most of us use i know as the thing that |
---|
0:37:18 | was written down apply my for john a canadian sixty nine |
---|
0:37:22 | so maybe that's i think kernel one of our issues here |
---|
0:37:29 | so when these systems fail they degrade not gracefully like your for your role but |
---|
0:37:35 | character catastrophic liam quickly |
---|
0:37:40 | speech recognition performance is substantially behind how humans do in almost every circumstance |
---|
0:37:48 | and |
---|
0:37:49 | they're not robust |
---|
0:37:51 | so i wanted to that sort of michael overall overview of what the survey was |
---|
0:37:57 | and it's available on the icsi website in the in the program but i wanted |
---|
0:38:03 | to add a couple a personal comments about my analysis of what's happening |
---|
0:38:08 | so these are not i'm not representing the government are actually i want to talk |
---|
0:38:13 | to you about my own personal else's |
---|
0:38:17 | so here's i there's three points first point |
---|
0:38:21 | if you have a model in it and you don't a lot of time hill |
---|
0:38:24 | climbing to the optimum performance |
---|
0:38:26 | and it doesn't perform optimally at that spot |
---|
0:38:29 | you got the wrong model |
---|
0:38:32 | hidden markov models we're proved to converge by power producers and Y so the idea |
---|
0:38:37 | in nineteen sixty not |
---|
0:38:39 | that prove has two parts |
---|
0:38:41 | one is it says you can always make a better model |
---|
0:38:45 | two it says you get the optimal parameters if the data came from the model |
---|
0:38:51 | that second part is |
---|
0:38:54 | absolutely not true in our speech recognition systems and we're climbing on data that doesn't |
---|
0:39:00 | match the model and we're not gonna find the answer that way |
---|
0:39:04 | so we spent a lot of time |
---|
0:39:06 | trying to account trying to adapt for the problem back we got the wrong model |
---|
0:39:13 | this is a personal bond |
---|
0:39:15 | if you use sixty four gaussians applying to some distribution you have no idea what |
---|
0:39:19 | the distribution |
---|
0:39:21 | the original |
---|
0:39:23 | multi gaussian distributions we're done with a single mean and i understand but that's not |
---|
0:39:29 | weird |
---|
0:39:30 | and so my corollary i think speaks for itself |
---|
0:39:37 | and finally if the system you bill pills for fifty percent of the population entirely |
---|
0:39:43 | and then for the people who works for estimate as they walk in a reverberant |
---|
0:39:46 | environment or noisy place it fails |
---|
0:39:48 | it's broken |
---|
0:39:51 | and i believe speech recognition is terribly problem |
---|
0:39:55 | so i think what we really wanted to do i'm i want to draw an |
---|
0:39:59 | analogy so i one and what drawn analogy between |
---|
0:40:03 | transcription and transportation |
---|
0:40:06 | and for transportation man this is what i want something that slick and slowly and |
---|
0:40:12 | easy to use and doesn't bright |
---|
0:40:15 | and what we build use this |
---|
0:40:20 | it runs on two wheels it will get similar eventually you spend almost all your |
---|
0:40:24 | time dealing with problems they have nothing to do with the transportation part |
---|
0:40:28 | and so i believe that that's what we've done with speech recognition |
---|
0:40:32 | and it's time for new models and |
---|
0:40:35 | i urge you to think about model |
---|
0:40:38 | and not so much about the data |
---|
0:40:54 | and tape |
---|
0:40:56 | generate okay |
---|
0:40:58 | i assume that is to generate a lot of discussion in a lot of questions |
---|
0:41:02 | if it doesn't then something is wrong with us |
---|
0:41:06 | this sds community would be done broken |
---|
0:41:10 | okay was the first over there |
---|
0:41:20 | a question about the resampling |
---|
0:41:24 | as i think about this you have a sort of sequence of random variables in |
---|
0:41:27 | your turning a knob on the independence between them |
---|
0:41:30 | and |
---|
0:41:31 | one of the things that charting that knob does is it |
---|
0:41:35 | as things become more dependent there's |
---|
0:41:37 | less information |
---|
0:41:40 | what i'm wondering is how much of the word error rate degradation you see |
---|
0:41:44 | might be associated simply with the fact that there's just less information |
---|
0:41:48 | in streams that are more dependence |
---|
0:41:54 | this working |
---|
0:41:56 | so i guess i don't understand question |
---|
0:41:59 | a that i mean i |
---|
0:42:02 | so i you're right so here is an answering you can tell me if i'm |
---|
0:42:07 | close to understanding the model assumes that each frame has an independent amount of information |
---|
0:42:15 | but we know that the frames do not have in depend amounts of information the |
---|
0:42:20 | amount of information |
---|
0:42:22 | going from frame to frame varies enormously |
---|
0:42:25 | but the model treats every single one of those frames is independent and that's the |
---|
0:42:31 | an egregious violation of these |
---|
0:42:34 | so that |
---|
0:42:37 | i guess i was thinking about was |
---|
0:42:39 | if i ask you to say we're ten times that i ask ten people to |
---|
0:42:42 | see the work once |
---|
0:42:43 | and are trying to figure what's the word |
---|
0:42:45 | like that the ten people say it might actually provide more information in the data |
---|
0:42:49 | itself |
---|
0:42:51 | and i just wondering if that might at all |
---|
0:42:53 | contribute to why there's more |
---|
0:42:57 | information as you sample from |
---|
0:42:59 | from or more disparate parts of the train database |
---|
0:43:07 | well i think i think what you're actually saying is the you your works |
---|
0:43:15 | explaining |
---|
0:43:17 | why |
---|
0:43:18 | so the model |
---|
0:43:20 | i think |
---|
0:43:21 | many people this is a question they have so the when you when you have |
---|
0:43:26 | all the frames and their independent when you do frame resampling the frames come from |
---|
0:43:31 | all sorts of different speakers and when you when you line them up you know |
---|
0:43:35 | like the what i play they come from all sorts of different speakers but then |
---|
0:43:40 | as soon as i start |
---|
0:43:43 | increasing the segment size then each one of those segments is gonna come from one |
---|
0:43:49 | speaker right is this is sort of along the lines what you're thinking well does |
---|
0:43:53 | the notion of speaker is part of the dependence in the data right the fact |
---|
0:43:59 | that each one of these frames scheme |
---|
0:44:01 | from a single speaker that's the pen |
---|
0:44:05 | and so that interframe to ten |
---|
0:44:07 | well the model knows nothing about |
---|
0:44:09 | and so if that's causing a problem or not that that's as we're Q your |
---|
0:44:14 | data |
---|
0:44:22 | of course all of us |
---|
0:44:23 | you know as you said all of us or have been aware of this for |
---|
0:44:26 | a long time and i think there has been a lot of effort at trying |
---|
0:44:29 | to undo it |
---|
0:44:31 | it's kind of when we say the model this these there's an independence assumption that |
---|
0:44:37 | sort of have true |
---|
0:44:39 | because the features that we use |
---|
0:44:42 | go over several frames so of course they're not actually independent you know when you |
---|
0:44:47 | synthesise it's not clear what you really synthesise "'cause" you have to synthesise something that |
---|
0:44:51 | has |
---|
0:44:52 | may have an independent value but it has to have a derivative that matches the |
---|
0:44:56 | previous thing and so on but |
---|
0:44:58 | but we've all tried things like segmental models |
---|
0:45:02 | which don't have that independence assumption |
---|
0:45:04 | right we take a segment |
---|
0:45:07 | a whole phoneme so you're |
---|
0:45:09 | is skipping the state independence assumption and the frame independence assumption and just going straight |
---|
0:45:15 | to the contextdependent phoneme |
---|
0:45:18 | and now you're picking a sample from the one distribution for that context dependent phoneme |
---|
0:45:24 | and that always works worse |
---|
0:45:28 | maybe you can do something with that are combined it with the hidden markov model |
---|
0:45:32 | and gain of i have a point but by itself it always works a lot |
---|
0:45:36 | worse |
---|
0:45:38 | and unless you unless you cripple the hidden markov model with the salem only gonna |
---|
0:45:43 | use context independent models then this one might work better but |
---|
0:45:48 | so the question is |
---|
0:45:49 | it's not that we haven't tried |
---|
0:45:51 | people have tried to make models that aboard those things and almost all of those |
---|
0:45:56 | things got more as the flip side of that is you said mpe or mmi |
---|
0:46:00 | and all these things run that M |
---|
0:46:01 | two |
---|
0:46:02 | avoid |
---|
0:46:04 | that assumption but they don't we just the arab i-vector for |
---|
0:46:08 | they reduce the error by |
---|
0:46:10 | ten percent fifteen percent relative |
---|
0:46:13 | basically a small it it's is similar to any of that any of the other |
---|
0:46:18 | tricks we do so they have any comment on those two observations |
---|
0:46:21 | well i mean |
---|
0:46:23 | i i'm not sure what |
---|
0:46:25 | so a natural question is at which i think is the first part of what |
---|
0:46:29 | you're saying is so why many people to try and fail to be hmms with |
---|
0:46:36 | models that take into account |
---|
0:46:40 | independent third the dependence structure in the data so why hasn't that work |
---|
0:46:45 | well |
---|
0:46:47 | i would say that |
---|
0:46:49 | that |
---|
0:46:50 | i do not believe that anyone has any quantitative notion of why these things here |
---|
0:46:57 | in the data |
---|
0:46:59 | i'm not saying that we should go back to these methods maybe we should but |
---|
0:47:04 | well i will give you an example of something you know twenty years ago people |
---|
0:47:09 | gave up neural networks |
---|
0:47:11 | and all of a certain you know neural networks or |
---|
0:47:16 | R |
---|
0:47:16 | are the new |
---|
0:47:18 | the new |
---|
0:47:20 | come |
---|
0:47:21 | i don't know what the right biblical sprays is but hallelujah so and what it |
---|
0:47:28 | takes is somebody who believes in something and dry start to do it and i |
---|
0:47:35 | think that here is the problem |
---|
0:47:37 | we should be i don't know what the solution is i honestly don't know what |
---|
0:47:41 | the solution is but i will say also that the mmi thing no and i |
---|
0:47:46 | don't believe anyone would be the mmi it was not designed to overcome independence |
---|
0:47:53 | you know if we knew that maximum likelihood solution to this problem was not the |
---|
0:47:58 | right solution so we found an alternative model selection procedure that we've just in a |
---|
0:48:04 | different place |
---|
0:48:05 | again if the model were correct we wouldn't have to do that |
---|
0:48:16 | coming back to the results this is this simulation results you presented |
---|
0:48:20 | i think these are highly suggestive because |
---|
0:48:24 | by changing the data to fulfil your assumptions |
---|
0:48:29 | the error rates you get or not the error rates we |
---|
0:48:32 | expect from the real data |
---|
0:48:35 | because you fit |
---|
0:48:36 | the problem to your assumptions but we have to go the other way around so |
---|
0:48:40 | what error rates we really can expect if we |
---|
0:48:45 | improve on modeling are still it that's an open questions system |
---|
0:48:48 | exactly i'm the that that's absolutely right at the in no way in my claiming |
---|
0:48:54 | that if we could model dependence in the data that we would be seen these |
---|
0:48:58 | error rates the frame resampling error rates that that's absolutely correct |
---|
0:49:04 | i mean so |
---|
0:49:05 | presumably we do we repeat do better the other point though is i think that |
---|
0:49:12 | a lot of the |
---|
0:49:17 | this sort of brittle nist that we experience |
---|
0:49:20 | in our models this is a conjecture is due to this very |
---|
0:49:25 | sort of for fit to the temporal structure |
---|
0:49:31 | and temper you know temporal we have a we have what one way of thinking |
---|
0:49:35 | of what these results a you know the frame resampling results that says if you |
---|
0:49:40 | forget about the temporal structure in the data models work really well but as soon |
---|
0:49:46 | as you introduce real temporal structure and the data the model start falling |
---|
0:49:51 | and so we'll speech i think temporal structures importance |
---|
0:49:57 | i think |
---|
0:50:04 | here is the my |
---|
0:50:10 | by a shock i see how a |
---|
0:50:15 | speechless |
---|
0:50:16 | or thai interested party |
---|
0:50:19 | yes the line |
---|
0:50:25 | i don't think |
---|
0:50:27 | a |
---|
0:50:28 | i when you please independence assumptions is not |
---|
0:50:34 | in the sticks more mixing to not extract information you can speech doesn't necessarily track |
---|
0:50:41 | you know to work |
---|
0:50:44 | i mean i can build the proposed system that satisfy |
---|
0:50:49 | independence assumption |
---|
0:50:51 | so i don't think |
---|
0:50:52 | you know |
---|
0:50:53 | really follows that |
---|
0:50:55 | for my models really see |
---|
0:50:58 | the models and so |
---|
0:51:01 | i think you don't want thinking about extracting |
---|
0:51:06 | getting the right information the problem this over account the information |
---|
0:51:10 | it's a question of this represent information |
---|
0:51:15 | and so if you misrepresented what are more or less than in the process |
---|
0:51:19 | i was the misrepresentation |
---|
0:51:21 | so that the false alarms |
---|
0:51:25 | three |
---|
0:51:28 | something like |
---|
0:51:29 | some work |
---|
0:51:31 | have you might have |
---|
0:51:34 | but works if that's not right |
---|
0:51:38 | work land farm |
---|
0:51:41 | i rate is |
---|
0:51:44 | just done the same tendency |
---|
0:51:47 | these days |
---|
0:52:26 | but |
---|
0:52:27 | but |
---|
0:52:34 | i like |
---|
0:52:45 | when you know all |
---|
0:52:55 | one thing that works really poor C |
---|
0:52:58 | is if you have a mismatched representation |
---|
0:53:01 | so i think the think about some model is representing text okay |
---|
0:53:07 | you can represented as raster scan text |
---|
0:53:09 | well you could represented as follows |
---|
0:53:13 | and if you change the size of the image |
---|
0:53:16 | the to the two things a very different of that the five |
---|
0:53:20 | five test of an actual easy representation change and the rest just and it's just |
---|
0:53:25 | the whole thing |
---|
0:53:27 | so you have to ask yourself is the problem that we're C |
---|
0:53:31 | the fact that we have a representation for the problem that doesn't match |
---|
0:53:37 | that i think is the realisation |
---|
0:53:40 | mm this tell us something a common |
---|
0:53:43 | as you go for then for the top from state to phones in phones to |
---|
0:53:48 | segments |
---|
0:53:49 | data it's becoming more and more speaker-dependent is it may be the problem is your |
---|
0:53:54 | models and not there don't i mean are |
---|
0:53:57 | morse i mean if you made your models more speaker-dependent what we have seen the |
---|
0:54:02 | such difference |
---|
0:54:03 | but it has nothing to do with a frame dependent sampling but well like what |
---|
0:54:08 | i was trying to say before is that is a form of dependence |
---|
0:54:13 | the that |
---|
0:54:14 | that |
---|
0:54:15 | and the model knows nothing about |
---|
0:54:17 | this form of the pen |
---|
0:54:19 | you know that there are many forms a of dependence and data knowing what independence |
---|
0:54:24 | is a heart thing for human to understand right |
---|
0:54:28 | but that form of dependence is precisely there and it may be causing the problem |
---|
0:54:36 | so there were there were a number of speakers so there are relatively few speakers |
---|
0:54:42 | in this corpus and so we have to sort of cat them so that there |
---|
0:54:46 | wasn't a single dominant speaker |
---|
0:54:50 | which i mean i think that would be the last |
---|
0:54:56 | so let me you sort of continue with work was asking again |
---|
0:55:02 | we know the model is wrong |
---|
0:55:05 | models are always wrong |
---|
0:55:08 | and so |
---|
0:55:11 | the way your |
---|
0:55:13 | you can argue that the model is wrong mathematically or you can argue that it's |
---|
0:55:17 | wrong because it doesn't meet certain in a match a human performance what we think |
---|
0:55:22 | of as human performance i think we may overestimate human performance a little bit but |
---|
0:55:26 | it clearly doesn't match it |
---|
0:55:29 | but in fact you know if you look at all the research that all of |
---|
0:55:32 | us do |
---|
0:55:34 | we use at least feel like protecting those problem so we say we're gonna use |
---|
0:55:39 | fonts models it to use your analogy we a lower models to have we scale |
---|
0:55:45 | them like fonts right we put in we say we're going to estimate a scale |
---|
0:55:49 | factor in that scale factor is not a simple |
---|
0:55:52 | we can be a simple one there were can be a matrix you know much |
---|
0:55:54 | more complicated than what you do with the font and we constrain it to be |
---|
0:55:58 | the same we say the speakers the same for the whole sentence |
---|
0:56:01 | we do speaker adaptive training so we try to remove the differences |
---|
0:56:07 | we tried to normalize all the speakers to the same place and then insert the |
---|
0:56:11 | properties of the new speaker again right |
---|
0:56:14 | close sort of like the analogy of a font |
---|
0:56:16 | we tried to do all of these things we certainly trying to model channels |
---|
0:56:23 | we do all of these with linear models and not linear models |
---|
0:56:28 | and |
---|
0:56:29 | we get small improvements |
---|
0:56:31 | so my question let me turn the question around |
---|
0:56:34 | the model is wrong |
---|
0:56:36 | what's the right model |
---|
0:56:38 | not what is the do but what is the right model |
---|
0:56:42 | so |
---|
0:56:43 | i think we all don't know the answer to that question but let me tell |
---|
0:56:47 | you something other phenomena that i would like to see as making |
---|
0:56:52 | unless you've been following particle physics but |
---|
0:56:56 | in particle physics |
---|
0:56:58 | when you measure particle interactions prestigious of the interactions are governed by |
---|
0:57:03 | basically by feynman diagrams |
---|
0:57:05 | and so to compute a for particle interaction like using the super collider to compute |
---|
0:57:11 | a cross sectional area for one of the interactions takes just if we computer about |
---|
0:57:15 | a week to look at all the fine and i guess |
---|
0:57:19 | the quite of the physics guys it's just discovered a geometric object |
---|
0:57:24 | enforce days and in the geometric object it turns out that each |
---|
0:57:28 | little area house |
---|
0:57:32 | an area that is exactly the solution |
---|
0:57:34 | so that problem of computing the cross sectional area |
---|
0:57:39 | and you can outdo the computations |
---|
0:57:43 | in about five minutes with a pencil the tape |
---|
0:57:47 | so |
---|
0:57:48 | there's a place where the difference in the model has a huge effect |
---|
0:57:54 | i'm making things work so i don't think i don't believe the model lies in |
---|
0:57:59 | that we of the kinds of things that we've all these always been doing |
---|
0:58:02 | i think we need to have some radical re interpretation of the way we look |
---|
0:58:06 | at the data that we look at the word |
---|
0:58:09 | maybe which on the lines in one place |
---|
0:58:11 | maybe |
---|
0:58:14 | i took the degree in linguistics as i thought speech wasn't an easy problems as |
---|
0:58:18 | a jury point of view and i learned to distrust everything a linguist set |
---|
0:58:24 | maybe which most of them to but |
---|
0:58:26 | maybe there's something different that we should be don't |
---|
0:58:28 | so i would love just against look outside this place that we've been exploring |
---|