0:00:16 | thank you so welcome back after the lunch |
---|---|

0:00:19 | my name's frank seide i'm from microsoft research in ageing and this is a post |

0:00:24 | calibration my colleague dong yu what happens to be chinese but is actually base |

0:00:29 | and of course as a lot of contributors to this work inside the company and |

0:00:33 | outside and also thank you very much two people sharing slide to material |

0:00:38 | okay to me we start with the like personal story i got into this because |

0:00:42 | i'm sort of an unlikely experts of this because until two thousand eleven i had |

0:00:47 | no i two thousand ten had no idea what were networks deep one or |

0:00:51 | so in two thousand ten |

0:00:52 | my colleague dong yu cannot be here today came to visit us invading only told |

0:00:58 | us about this new speech recognition result that the dehak |

0:01:02 | and you told me about the technology that i had never heard about call dbn |

0:01:07 | and set |

0:01:08 | this was sort of invented by some professor in wonderful that also had never heard |

0:01:12 | about |

0:01:14 | so and he and he need a manager at the time had invite geoffrey hinton |

0:01:19 | this professor to come to read and with a few students and work on applying |

0:01:23 | this to speech recognition |

0:01:25 | any time he got |

0:01:26 | sixteen percent relative reduction |

0:01:29 | out of applying deep neural networks |

0:01:31 | and this is for intel voice search task relatively small number of hours of training |

0:01:36 | you know sixty percent is really a big a lot of people spend ten years |

0:01:40 | to get a sixteen percent error reduction |

0:01:42 | so my first got about this was |

0:01:44 | sixteen percent while what's wrong with the baseline |

0:01:55 | said well we should we collaborate on this and try how this carries over into |

0:01:59 | a large-scale task that switchboard |

0:02:02 | and the key thing that actually invented here was well talk a classic an hmm |

0:02:07 | i think this reference is probably based on |

0:02:10 | whatever this morning from nelson |

0:02:12 | a little bit |

0:02:13 | too late |

0:02:15 | so the classic nn hmm then the in the deep network dbn |

0:02:19 | which actually does not stand for dynamic bayesian networks as a line |

0:02:23 | at that point |

0:02:24 | and then you don't put in this idea of |

0:02:26 | just using tied triphones as modeling targets like we did in gmm based system |

0:02:32 | okay so |

0:02:34 | then fast forward like have here was reading papers in utah look to start and |

0:02:38 | finally we got to the point where we got first |

0:02:41 | results so this is or gmm baseline and i start the training next day ahead |

0:02:47 | the first iteration |

0:02:48 | was like twenty two percent so okay now seems to not be completely off |

0:02:53 | the next day i come back |

0:02:55 | twenty percent |

0:02:56 | so fourteen percent and the congratulation email to my colleague right |

0:03:00 | the to run next day came back |

0:03:03 | eighteen percent |

0:03:04 | you can really from that one moment i was just sitting at the computer waiting |

0:03:07 | for the next result of come out and submitting it and saw titanic have better |

0:03:11 | we got seventeen point three |

0:03:13 | something point one |

0:03:15 | then we get the alignment that's one thing you don't had already determined on the |

0:03:20 | smaller setup we got it down to sixty four then we look at sparseness |

0:03:24 | six import once we go thirty two percent error reduction |

0:03:27 | that's a very large reduction |

0:03:29 | all of a single technology |

0:03:33 | we also ran this over different test sets the same all and you could see |

0:03:37 | the error rate reductions were all sort of in a in a similar range of |

0:03:40 | the word didn't matter as well the gains were slightly worse |

0:03:44 | we also look the other ones for example we at some point finally the two |

0:03:48 | thousand all model the can still okay for product like windows on system that you |

0:03:54 | have right now we got something fifteen percent error reduction |

0:03:58 | and also other companies started publishing for example ibm on broadcast news i think the |

0:04:02 | total gaze thirteen eighteen percent that's i think in up to date paper some day |

0:04:07 | and then you to i think there's was about nineteen percent of the gains were |

0:04:11 | really convincing across the board |

0:04:14 | okay so that our work so what is this actually |

0:04:17 | no i thought asr you has the same different portion of understanding people might not |

0:04:22 | to you know the end and on the database so i think would like to |

0:04:26 | go through and explain |

0:04:27 | a little bit more to the basics how this works i don't know how many |

0:04:31 | understand people are really here today i hope it's not gonna be too boring |

0:04:34 | so the basic idea is |

0:04:36 | the dnn looks at for example spectrogram |

0:04:40 | a rectangular patch out of that a range of vectors |

0:04:44 | and feeds this into this processing chain word basically multiplies this input vector this rectangle |

0:04:49 | here with a matrix at some by and applies a nonlinearity are then you get |

0:04:54 | something like two thousand values other that you do the several times |

0:04:58 | note that all that the same thing except nonlinearity is a softmax |

0:05:02 | so |

0:05:04 | this is the formulas for that so what is actually well a softmax |

0:05:08 | is this form here |

0:05:09 | that is essentially nothing else but i sort of a linear classifier and is linear |

0:05:13 | because if you look at the class boundaries between two classes hasn't in are actually |

0:05:17 | relatively weak classifier have there |

0:05:20 | the hidden there is actually very similar they have the same for the only difference |

0:05:25 | is that these sort of this only two classes |

0:05:28 | instead of and or be all the different speech states here and the second pass |

0:05:32 | as parameters zero |

0:05:34 | so what is this really this is sort of a classifier that classifies collect class |

0:05:38 | membership or non membership in some class but we don't know what those classes are |

0:05:42 | actually |

0:05:43 | and so this representation is actually this also kind of sparse typically you get only |

0:05:48 | maybe ten percent of the activations five to ten percent |

0:05:52 | to be active in any given frame |

0:05:54 | so this is really sort of these class membership the kind of features descriptive features |

0:05:58 | of your input |

0:06:00 | so another way of looking at it is |

0:06:03 | basic what it doesn't takes an input vector projected onto something like a base vector |

0:06:07 | one column |

0:06:09 | this would be like a direction vector projected on it there's a bias term we |

0:06:13 | add on it and then you run into this nonlinearity we just one of the |

0:06:16 | binarization |

0:06:18 | so what this does this gives you sort of subsume find a river a like |

0:06:21 | a coordinate system for your inputs |

0:06:25 | and get another |

0:06:27 | way of looking at it is |

0:06:28 | well |

0:06:30 | this one here is actually a correlation so he the parameters have the same sort |

0:06:36 | of physical meaning as the inputs you put in there |

0:06:40 | so for example for the first layer the model parameters are also of the nature |

0:06:44 | of being a rectangular patch |

0:06:45 | of spectrogram |

0:06:46 | so and this is what they look like i think there was a little bit |

0:06:49 | of the discussion earlier on nelson's talk |

0:06:52 | so what is this me each of goals |

0:06:55 | is this case thirty two there twenty three frames Y |

0:06:59 | this is the frequency |

0:07:01 | access here |

0:07:02 | and what happens is that these things are basically overlay over here and then the |

0:07:05 | correlation is made and whatever it detects this particular pattern this is sort of the |

0:07:09 | peak detector of people that sliding over time |

0:07:13 | then you get the hideout |

0:07:14 | okay |

0:07:15 | you can we see all these different patterns to get many of them really look |

0:07:18 | like our filters |

0:07:20 | but these automatically learn about the system there's no knowledge that was put into their |

0:07:24 | you have these edge detectors you have P detectors you have some sliding detectors you |

0:07:29 | have a lot of noise in there actually i don't know what that's for think |

0:07:32 | this probably of later ignore them later stages |

0:07:36 | that they are problem is how to interpret the hidden layers |

0:07:39 | the hidden there is speech don't have any sort of spatial relationship to the input |

0:07:44 | or something so the only thing that i could think of is that |

0:07:47 | there we were presenting something like |

0:07:49 | logical operations so think of this again this is the direction vector this is the |

0:07:53 | hyperplane that is described by the bias right so if you inputs for example are |

0:07:58 | one this one is one this is obviously |

0:08:01 | two dimensional vector ones one is zero |

0:08:04 | could be this one of this one you could put a plane here indicates incorporation |

0:08:09 | okay kind of a soft or because not strictly binary |

0:08:12 | or you put it here is like an operation |

0:08:14 | so i think this my personal intuition of what this is the nn actually does |

0:08:18 | is |

0:08:19 | on the lower layers it extracts these landmarks |

0:08:22 | number higher there is it assembles them into more complicated classes |

0:08:27 | and can you do interesting things you can imagine that |

0:08:30 | that for example and one layer discover say a female version of and a and |

0:08:34 | then another no would give you a male version of a |

0:08:37 | then on the next there would say ten authors |

0:08:40 | female or male a |

0:08:42 | so this is an idea on top of the modeling power of this of this |

0:08:45 | one |

0:08:47 | okay so take away |

0:08:49 | lowest layer matters landmarks higher layers i think are sort of soft logical operators |

0:08:54 | and the top there is just a really primitive linear |

0:08:57 | okay so how do we do this in speech how to be used as speech |

0:09:02 | you take those output see these probabilities posterior probabilities of speech segments |

0:09:08 | suppose you know |

0:09:10 | it turns them into |

0:09:12 | likelihoods the using bias will and these are directly used in the hidden markov model |

0:09:16 | in a |

0:09:19 | and the key thing here is that these classes are tied triphone state and not |

0:09:23 | monophone states that is the thing that really made a big |

0:09:26 | okay so just before we move on just a given a rough idea of like |

0:09:30 | the subject this idea one buttons error rates actually we wanna play will video clip |

0:09:36 | where our executive vice president of research gave on stage demo |

0:09:41 | and you can see what accuracies come out of and speaker independent |

0:09:45 | dnns we can you can this not been adapted is voice |

0:09:53 | still far error rate for our work we have the one point five |

0:10:04 | what you hear research my research university |

0:10:10 | okay together with the other in your recognition so |

0:10:19 | i use i tell you know what i weight given red color your |

0:10:31 | so this is this is basically perfect right and this is really a speaker independent |

0:10:35 | system |

0:10:36 | and you can i think do interesting things of that just the fun of it |

0:10:39 | i'm gonna play at a later part of the video what we actually use this |

0:10:42 | input to drive translations just |

0:10:46 | translated into chinese you and vocal here we see i am i know |

0:11:05 | i |

0:11:07 | there i here |

0:11:09 | you people one |

0:11:17 | that is there |

0:11:21 | side |

0:11:31 | for this is a very |

0:11:35 | you do initial values you well |

0:11:41 | if you hear that right down by various people |

0:11:48 | so what we see |

0:11:54 | so that's a kind of fun you can have of the model like that |

0:11:58 | okay so |

0:11:59 | now in this talk |

0:12:02 | i would like to |

0:12:03 | you know you know a is giving talks about the nn is invited talks S |

0:12:08 | of income bracket like on each of those conferences that likely one hour talking to |

0:12:12 | you single something's the for example last year smt conference or sandra senior |

0:12:16 | then i think of the i syllables of innocent fun so when i've prepared a |

0:12:20 | strong i found energy and it ended up |

0:12:23 | doing andrews talk |

0:12:26 | so i thought that's maybe not a good idea i wanna do it slightly different |

0:12:29 | so what i wanted to someone we focus |

0:12:31 | and not gonna give you have you noticed overview of everything but i will focus |

0:12:35 | on |

0:12:36 | what is needed to build real life systems large-scale system so for example you will |

0:12:40 | not see in timit result |

0:12:42 | and the structured along three areas training features and run-time extraneous the biggest one i'm |

0:12:47 | gonna start force |

0:12:50 | so |

0:12:51 | how do you train this model i think we're pretty much all familiar with back-propagation |

0:12:55 | you give it |

0:12:56 | a sample vector run to the network get a posterior distribution compared against what it |

0:13:00 | should be |

0:13:01 | and then basically not the system a little bit in the direction to do a |

0:13:05 | better job next time |

0:13:07 | and so the problem is when you do this with the deep network often the |

0:13:11 | system does not converge where will get stuck in local optimum |

0:13:14 | so the thing that we of this whole revolution with geoffrey hinton who |

0:13:19 | the thing that's |

0:13:19 | matt sorry the thing that we propose to the restricted boltzmann machine |

0:13:24 | and the ideas basically you train |

0:13:26 | layer is so here we extend that the networks sort of in the way they |

0:13:30 | can run about |

0:13:31 | so you can run the sample through |

0:13:34 | you get a representation you run it backwards and then you can see okay how |

0:13:37 | well that's the thing that comes out the action match my input |

0:13:40 | then you can choose that system so that matches the input as closely as possible |

0:13:45 | if you can do that and don't forget this is sort of the binary representation |

0:13:48 | that means you have a representation of data that is meaningful this thing extract something |

0:13:53 | meaningful about the data and that's so that the idea |

0:13:56 | so now we do the same thing with the next there you freeze this is |

0:13:59 | taken as a feature extractor |

0:14:00 | a do this with the next there and so on |

0:14:02 | then you put |

0:14:04 | top softmax and then trying to location |

0:14:08 | now so i had no idea about |

0:14:10 | do you nor networks anything when i started this so i thought what we do |

0:14:13 | this or complicated i mean we already ran this experiment on how many layers you |

0:14:18 | need and so on so already had |

0:14:20 | and not work that had like a single in there |

0:14:23 | so why not just take that one is initialization |

0:14:25 | right out it softmax layer and then put another |

0:14:30 | hidden layer and another softmax down off |

0:14:32 | and then iterate the entire stack here |

0:14:34 | and then after that again right this guy off and do it again and so |

0:14:38 | on and once you are at the top and iterate this thing |

0:14:41 | so we call this greedy layer-wise a discriminative pre-training |

0:14:44 | and it turns out that actually works really well so if we look at this |

0:14:48 | the dbn pretraining geoffrey hinton this is the green curve here |

0:14:53 | if you do what i just described you get the red or just are essentially |

0:14:58 | the same word error rate |

0:15:00 | and this is different numbers of layers this is not progression over training the accuracy |

0:15:05 | for different layers right |

0:15:07 | so the more layers to get the better gets and |

0:15:09 | you see basically sparse |

0:15:11 | tract each other |

0:15:12 | the layer-wise pretraining slightly worse but then you'd only one understands neural networks much better |

0:15:17 | than i |

0:15:18 | said you shouldn't maybe to rating the model all the way to the and should |

0:15:22 | just let it iterate a little bit rerun in the ballpark then move on it |

0:15:25 | turns out that made the system slightly better and actually the sixteen point eight here |

0:15:29 | this is this just made pre-training method works like that |

0:15:34 | i'm i think it's expensive |

0:15:35 | because every time you have this full nine thousand seen on top layer there but |

0:15:39 | it turns out we don't need to do that you can actually use monophones |

0:15:42 | and it actually works equally well as much |

0:15:46 | okay so take away pre-training is still sort of me that it helps |

0:15:50 | but we need discriminative pre-training is sufficient and much simpler than the rbm pity because |

0:15:55 | we just use the existing call don't need to coding |

0:15:59 | okay another important topic is |

0:16:02 | sequence training |

0:16:03 | so the question here is |

0:16:06 | we have actually train this network to classify these signals is into those segments of |

0:16:11 | speech and of each other but in speech recognition |

0:16:14 | we have dictionary sure of language models we have hidden markov model that gives you |

0:16:18 | sequence and so on |

0:16:19 | so if we want to integrate that the system we should but we do that |

0:16:23 | we should actually get a better result right |

0:16:25 | so |

0:16:27 | the frame-classification right here on is written this way you maximise log posteriors every single |

0:16:32 | you know posterior correct C |

0:16:36 | if you want to use C and if you wanted to sequence training actually find |

0:16:40 | that it has exactly the same form |

0:16:42 | except this year not state posterior derived from the bn and but it is state |

0:16:47 | posterior taking all the additional knowledge into account |

0:16:51 | so this one the takes into account hmms the dictionary and language models |

0:16:55 | so the way to run this is you run your data through and you have |

0:16:59 | here you must from speech rec |

0:17:01 | in computers posteriors |

0:17:02 | and practical terms you would do this with word lattices |

0:17:05 | and then you do back-propagation and |

0:17:08 | so we did that |

0:17:10 | we start with the baseline fifty one six percent |

0:17:13 | we did the first iteration of this sequence training |

0:17:16 | i want to |

0:17:17 | the one |

0:17:18 | for |

0:17:19 | so that kind of didn't work |

0:17:22 | so |

0:17:24 | well we observe that it sort of time for each so |

0:17:27 | don't like we're training |

0:17:30 | so we try to do in what is the problem here so there is for |

0:17:33 | hypotheses |

0:17:34 | are we actually using the right models lattice generation their problems lattice sparseness |

0:17:39 | randomization of data and the objective function of multiple objective functions choose from and today |

0:17:44 | i will talk about the lattice parsing |

0:17:46 | so the final one thing we found was that |

0:17:49 | there was increasing |

0:17:51 | sort of |

0:17:52 | problem of speech getting replaced by silence |

0:17:57 | deletion problem we saw that the silence of course we're going |

0:18:01 | and the other scores were not |

0:18:03 | so basically what happens is that |

0:18:05 | the lattice is very biased the lattice typically doesn't have negative hypotheses for silence because |

0:18:11 | it's so far away from speech but it has a lot a lot of positive |

0:18:15 | examples of silence |

0:18:16 | so this thing was just biasing the system towards ringside really we you know given |

0:18:21 | high bias |

0:18:22 | so what we do this we said okay one interest |

0:18:24 | not update |

0:18:26 | sun state and also skip all silence frames |

0:18:29 | so that already gave us something much better |

0:18:31 | already look like it's going |

0:18:34 | so we could also the slightly more systematically we could actually explicitly and silence hours |

0:18:39 | into the lattice |

0:18:41 | right those that should have been there in the first place |

0:18:44 | so once you do that |

0:18:46 | i actually get even slightly better so that kind of confirms the missing sounds hypotheses |

0:18:50 | are all |

0:18:52 | but then |

0:18:53 | another problem is that the lattices other sparse |

0:18:56 | so we find that any given frame |

0:18:58 | we only have like three hundred out of mine thousand seen on T C and |

0:19:02 | that |

0:19:03 | the others are not there because they basically had zero probability |

0:19:07 | but as the model moves along maybe data at some point no longer have zero |

0:19:11 | probability so they should be there in the lattice but they're not |

0:19:14 | so the system cannot train properly |

0:19:16 | so we thought why don't we just we generate lattices after one iteration |

0:19:20 | we see how the next little bit of the difference at least keeps table here |

0:19:25 | now we thought can we do this slightly better so basically we take this idea |

0:19:28 | of adding silence |

0:19:30 | and sort of adding speech marks you can't really do that |

0:19:33 | but similar effect can be achieved by interpolating your sequence criterion |

0:19:38 | with the frame cardio |

0:19:40 | so and then we basing we do that get |

0:19:43 | a very good convergence |

0:19:46 | so |

0:19:47 | now we we're not the only people that observe that problem i ran into this |

0:19:51 | issue with the training so for example colour destiny |

0:19:55 | and this workers |

0:19:57 | observe that |

0:19:58 | you look at the posterior probability of the ground first pass |

0:20:02 | over time sometimes find that it's very low it's not always zero sometimes at zero |

0:20:07 | that means a lot |

0:20:09 | so |

0:20:09 | but they found is that |

0:20:11 | if you just get those frames you called frame rejection you get a much better |

0:20:15 | convergence behavior so the green the red curve is without and the blue curve is |

0:20:19 | with frank removing that |

0:20:23 | and of course |

0:20:25 | brian also observed exactly the same thing but he said no i'm gonna do the |

0:20:28 | smart thing |

0:20:29 | i'm gonna do something much better i'm gonna and second order method |

0:20:33 | so what the second one a method you approximate the objective function as a second |

0:20:37 | order function that you can like hope try to the optimal right theoretically |

0:20:41 | and so this can be done without explicitly computing they have C and this is |

0:20:44 | the method that martin's is tuned of hinton |

0:20:48 | sort of optimized |

0:20:49 | and the nice thing it's actually batch method |

0:20:52 | so it doesn't |

0:20:54 | suffer from these previous issues of like data sparseness and the last carol executions as |

0:20:59 | a lot of couldn't |

0:21:01 | and also what i think on this conference there's a paper that says that it |

0:21:04 | works with partially to rated ce multi don't even have to do a full see |

0:21:08 | you duration that's also very dry |

0:21:11 | and |

0:21:12 | i need to save your outdoor started with my homework actually writing first show the |

0:21:16 | effectiveness of sequence |

0:21:18 | for switchboard |

0:21:19 | okay so you have some results |

0:21:22 | so this is the gmm system C basically a C D based the nn |

0:21:27 | sequence trained one |

0:21:28 | so this is all and switchboard and five and are two or three |

0:21:32 | so we get like twelve percent |

0:21:35 | basically and others got eleven percent and ryan on the are two or three said |

0:21:39 | also fourteen percent sort all similar range |

0:21:42 | we also |

0:21:43 | then we i wanna point of one thing |

0:21:46 | going from here to here |

0:21:47 | now the dnn has given us forty two percent relative |

0:21:51 | and that's a fair comparison because this is also sequence trained based |

0:21:55 | right so if the only difference is you recall gmm replaced by the unique |

0:22:01 | also it works on a larger dataset |

0:22:05 | okay to take away sequence training gives us gains of mine to thirty percent |

0:22:10 | other std works but you need some tricks they're |

0:22:13 | those of smoothing and rejection of that frames |

0:22:16 | and the hessian-free method requires no tricks but is actually much more complicated so to |

0:22:20 | start with i would probably start with the cg method |

0:22:27 | so another big question is paralysing the training |

0:22:30 | so just a given idea that more but we use this demo video the threshold |

0:22:34 | was trained on two thousand hours |

0:22:37 | it took sixty days |

0:22:40 | now |

0:22:41 | most of you probably don't work with windows |

0:22:44 | we do and that causes the very specific problem because of probably heard of something |

0:22:49 | a patch tuesday |

0:22:51 | so basically |

0:22:52 | every two to four weeks microsoft I T forces us to update some virus scanners |

0:22:57 | or something like that |

0:22:58 | and so basically those machines have to be rebooted |

0:23:02 | so running a java sixty days is actually |

0:23:06 | so |

0:23:07 | you were running this on gpu so we had a very strong motivation to look |

0:23:11 | at that |

0:23:12 | but don't get your hopes up |

0:23:14 | so |

0:23:15 | one way of trying to paralyse the training is to see connections to match |

0:23:20 | ryan had already shown hessian-free works very well can be problem |

0:23:24 | so actually see one V stuff are to be cage was an intern at microsoft |

0:23:29 | try to use hessian-free also for the C training |

0:23:34 | but it the take away was basically it takes a lot of iterations to get |

0:23:38 | there so it was actually not |

0:23:41 | so back to std |

0:23:42 | it's to use also problem because if we do mini-batches of they one thousand twenty |

0:23:47 | four frames everyone thousand twenty four frame to have sixteen to lot of data |

0:23:51 | so that's a big challenge so the first group are actually a company that it |

0:23:55 | is successfully was well with the asynchronous sgd that just |

0:24:00 | so the way that works is |

0:24:02 | you have your machines you group them into a first one group them together each |

0:24:06 | of them takes a part of the model and then you split your data and |

0:24:08 | each chunk to compute the different |

0:24:11 | so that at any given time |

0:24:13 | and whatever one of them has a gradient computed |

0:24:16 | it sends that |

0:24:18 | parameters server or set of parameter servers and also parameter servers aggregate |

0:24:23 | the model or with it |

0:24:25 | and then |

0:24:26 | whenever they feel like and the but with allows they send |

0:24:31 | then models that |

0:24:32 | now that's a completely asynchronous process the smaller think of this is just being independent |

0:24:36 | trends one thread is just computing with whatever's and memory |

0:24:39 | another threat this just sharing and exchanging data in whatever way the small synchronisation |

0:24:45 | so why but that work |

0:24:47 | well it's very simple because |

0:24:50 | std implies sort of an assumption of you know are reading right we make we |

0:24:55 | this so basically |

0:24:57 | every parameter update contributes independent the objective function |

0:25:01 | so it's okay to miss some of them |

0:25:05 | and also there is something that we call delayed update on a quick to explain |

0:25:08 | that |

0:25:08 | so in the simplest way that explained the training the beginning you take every point |

0:25:12 | in time that a sample X we take a model |

0:25:16 | compute gradient update the model with the gradient |

0:25:20 | and then do it again after one frame you do it again do it again |

0:25:24 | and then based right |

0:25:26 | you models equal to that model plus |

0:25:29 | we can also do this you can also not advance |

0:25:33 | the model that using use the same model multiple times |

0:25:36 | and update for example this example for |

0:25:39 | the you do for model updates the frames are still these frames right but the |

0:25:43 | model session model |

0:25:45 | in do this again and so on |

0:25:47 | so that's actually what we call mini-batch based update right |

0:25:51 | mini-batch training |

0:25:53 | so now if you want to do parallelization need to deal with the problem that |

0:25:56 | we need to do computation and data exchange parallel so you would do something like |

0:26:00 | that you know you would have a model and you would start sending that into |

0:26:04 | the network so at some point it can do model update while the kids computing |

0:26:09 | the next |

0:26:11 | and then |

0:26:11 | you do not overlap session once these are computed you sent the result over while |

0:26:15 | these are being received an update so you get the sort of overlap processing and |

0:26:20 | recall the double buffered update |

0:26:22 | it has exactly the same form so with this formula can write it in exactly |

0:26:25 | the same for |

0:26:27 | and std is basically just sort of a random version of this where you have |

0:26:31 | no space adjust the |

0:26:34 | somewhere jumping between one or two that just like |

0:26:38 | so why not telling |

0:26:40 | well i would this work because the space not different from i mean you batch |

0:26:44 | and to make it work only thing you need to make sure is that we |

0:26:47 | still stay in this |

0:26:48 | sort of you narrative me |

0:26:50 | it also means that as you training progresses you can increase your mini-batches |

0:26:54 | well observed that also means you can increase |

0:26:57 | your delay |

0:26:59 | which means you can use more machines |

0:27:00 | the more machine to use the more delay you in-car because network such right |

0:27:06 | okay |

0:27:07 | so |

0:27:09 | okay so but then |

0:27:11 | actually |

0:27:13 | where the three times |

0:27:15 | that colleagues told me |

0:27:17 | like this with paper only the |

0:27:19 | and then |

0:27:20 | like three months later ask them so we came up to this day and what |

0:27:23 | we scale well |

0:27:24 | actually happened three times so why does not work |

0:27:27 | so let's look at this one of the different ways paralysing something model power of |

0:27:31 | data for was layer |

0:27:34 | model carol isn't means you're splitting a models over different notes |

0:27:37 | then after each computation step they have to the only compute part of the output |

0:27:41 | vector |

0:27:43 | each computed different sub range of your dimension so after every computation to have to |

0:27:47 | exchange |

0:27:48 | the airport with all the others |

0:27:50 | the same thing has to happen in the way back |

0:27:53 | no data parallelism means |

0:27:56 | you break your mini-batch into sub batches |

0:27:59 | so each node computes subgradient |

0:28:02 | and then sorry |

0:28:03 | they after every that they have to exchange lisa gradients each has to send their |

0:28:08 | gradient or the other nodes |

0:28:10 | so you can already that has a lot of communication going on |

0:28:13 | the third train a something that we tried called and they are powerless |

0:28:17 | work something like this you distribute layers |

0:28:21 | so maybe the first batch comes in |

0:28:23 | and then when it's done it sends |

0:28:25 | its output to the next one and i we compute the next batch here but |

0:28:29 | this section of correct because we haven't update the model |

0:28:33 | so well we keep doing we just ignore the problem |

0:28:36 | then in this case after four steps |

0:28:37 | this guy has finally come back with an update the model |

0:28:41 | so |

0:28:42 | why would that work is just too late update is exactly the same form another |

0:28:45 | what before except the delay is kind of different in different layers but there's nothing |

0:28:48 | fundamentally strange about this |

0:28:51 | so |

0:28:52 | no |

0:28:54 | very interesting questions how far can actually go what a sort of the optimum number |

0:28:58 | of notes that you could that you can |

0:29:01 | paralysed |

0:29:02 | so my colleague also dropout a very simple idea |

0:29:05 | you simply said |

0:29:06 | you optimal when a maxout all the resource |

0:29:10 | using all you computation and all your network |

0:29:14 | resource basically means that the time that it takes |

0:29:17 | computing mini-batch |

0:29:19 | is equal to the T times that it takes to transfer the result that all |

0:29:23 | the other |

0:29:25 | and you would do this sort of an overlap fashion so you would compute one |

0:29:28 | then you started transfer and you do the next one |

0:29:31 | and i you are ideal when the time that it's like takes transferred let's say |

0:29:35 | when it's transform the trance was completed the more you're ready to compute the next |

0:29:38 | batch |

0:29:39 | so then you can write down okay what's optimal |

0:29:42 | number of knowledge here well the form is a bit more complicated but the basic |

0:29:46 | idea is that this is proportional to the model size bigger monologues better parallelization but |

0:29:51 | only get faster |

0:29:53 | so gpu can paralyse less |

0:29:57 | and also it has to do of course with how much data you have exchanged |

0:29:59 | what you're bandwidth this |

0:30:01 | for data parallelization the mini-batch sizes also factor because for a longer mini-batch size you |

0:30:06 | have to exchange less of |

0:30:09 | and for their partisan that's not really that interesting because it's limited by the number |

0:30:14 | so |

0:30:16 | this may i ask |

0:30:17 | what you think model part was what would be get here |

0:30:20 | so just |

0:30:22 | consider that will is doing image net like sixteen thousand |

0:30:26 | so gimme number |

0:30:31 | gonna tell you |

0:30:36 | not sixteen thousand |

0:30:39 | no such a very fine so i implemented that we need to a lot of |

0:30:43 | care three gpus |

0:30:45 | this is the best you can do we get at one point eight speed up |

0:30:47 | twice a lot of three times speedup because gpus get less efficient the smaller chunks |

0:30:51 | of data they process |

0:30:52 | and once i went to for it was actually much worse than this |

0:30:58 | not data pearls must much better i'd so what we think |

0:31:07 | for many best size of one thousand twenty four now that records of course if |

0:31:11 | you can use bigger mini-batches as you progress of training |

0:31:14 | this becomes a bigger number |

0:31:16 | and the reality what you get is well that will a C D system |

0:31:20 | paralysing for eight at nodes |

0:31:23 | and eighty nodes each node is twenty four intervals you |

0:31:27 | so if we see what you get compared using |

0:31:29 | compared to using a single twenty four into machine |

0:31:34 | at times ignored but you only get a speed of five point eight |

0:31:38 | that's what you can actually get out of the paper there and about two point |

0:31:42 | two up that comes out of model parameters and two point six comes out of |

0:31:46 | data |

0:31:48 | of course not that much |

0:31:49 | then there's another group at the academy of sciences and in a rating |

0:31:53 | they paralysed over in video K twenty extra cues that sort of the state-of-the-art |

0:31:58 | and they got three point two |

0:31:59 | speedup also |

0:32:02 | okay not that great |

0:32:05 | and i'm not gonna give an answer better but i just wanna |

0:32:08 | okay |

0:32:09 | so the last thing is layer parallelism okay so we're and this experiment we found |

0:32:14 | that if you do the right way you can use more gpus and you get |

0:32:17 | a three point two or three times speedup but we already had to use model |

0:32:20 | curves |

0:32:22 | and if you don't do that have a promotional balancing bases there is also so |

0:32:26 | different |

0:32:27 | and so this is actually reason why do not recommend their problems |

0:32:31 | okay so the take away |

0:32:33 | paralysing sds actually really heart and if your colleagues come to you and say dampen |

0:32:38 | implement std then maybe show that |

0:32:41 | okay |

0:32:43 | so |

0:32:45 | so much about realisation |

0:32:51 | okay need to take about and me talk about adaptation so adaptation can be done |

0:32:56 | you mentioned that this morning for example by sticking in transform your the bottom called |

0:33:01 | the L and transform we call it yellow are to match |

0:33:05 | mllr |

0:33:06 | can also be things like vtln |

0:33:09 | another thing we can do is as nelson explain just train the whole stock just |

0:33:13 | a little bit or you can do this with regularization |

0:33:17 | so |

0:33:18 | what we have service this |

0:33:20 | we do this approach which are not the alarm and switchboard |

0:33:23 | we applied to the gmm system we get thirteen percent error reduction |

0:33:29 | we applied to shallow more network that's one they're only |

0:33:33 | you get very similar to that |

0:33:35 | if we do it on the deep network |

0:33:40 | and |

0:33:41 | so |

0:33:44 | so this is sort of the not such a great example but then on the |

0:33:48 | other hand to me tell you wanna forgot to put on the side when we |

0:33:51 | prepared this on stage medial |

0:33:54 | or vice president we tried to actually train the models |

0:33:58 | so we talked something like four hours of internal talks |

0:34:01 | and did adaptation on that one |

0:34:04 | and tested on another two talks have and we got like thirty percent |

0:34:11 | but then we moved on an actually did an actual dry run with him |

0:34:15 | it turns out |

0:34:16 | on that one parent works |

0:34:20 | so i think what happened there is that the D N actually did not more |

0:34:22 | voice |

0:34:23 | the more channel |

0:34:25 | of this particular recording and that seems to be if the so basically there's a |

0:34:29 | couple of other numbers here but let me just cut the short so what we |

0:34:31 | seem to be observing is that |

0:34:34 | the gain diminishes with a large amount of the gain of adaptation this what we |

0:34:37 | have seen so far on that except if the adaptation is done for the purpose |

0:34:42 | of domain adaptation |

0:34:45 | so and maybe the reason why this is here is that the dnn is already |

0:34:48 | very good morning invariant representations especially for all speakers would also means maybe there's a |

0:34:54 | limit on what is achievable by adaptation some keep this in mind if you're considering |

0:34:57 | two to do research |

0:35:00 | on the other hand i think karen try not very good results or with george |

0:35:03 | right on that so maybe what i'm saying is not correct so you better check |

0:35:06 | out their papers and session |

0:35:11 | okay so we need on with training but isolated are what alternative architectures |

0:35:16 | so when this |

0:35:18 | so values are very popular |

0:35:21 | basically replace the nonlinearity sick model |

0:35:25 | something like this |

0:35:27 | and that came and also lot of geoffrey hinton school |

0:35:31 | and it turns out that vision tasks |

0:35:34 | works really well it converges very fast |

0:35:36 | you get |

0:35:37 | base we don't need to do pre-training |

0:35:39 | and it seems to outperform the sigmoid version thrall basing everything |

0:35:44 | non-speech that was is really would be a whole you know |

0:35:48 | encouraging paper |

0:35:49 | by entering students untied rectified nonlinearity is improved more network acoustic models |

0:35:54 | and they were able to reduce the error rate from my point five seventy |

0:35:58 | so great i started a holding it is actually two lines of code |

0:36:02 | and i didn't get anywhere |

0:36:04 | not able to rip use these results |

0:36:07 | the red the paper again |

0:36:08 | a nice all |

0:36:10 | sentence network and |

0:36:11 | network training stops after to compute pass |

0:36:15 | we only due to process our system is that nineteen point two and we do |

0:36:19 | all the past as we can see |

0:36:22 | so actually there's something wrong with a baseline |

0:36:25 | so it turns out that when i talk to people |

0:36:28 | on the large set switchboard it seems to be very difficult to get relevance to |

0:36:33 | work |

0:36:33 | so one group that actually did get a to work is a ibm together with |

0:36:38 | george dahl but in a rather complicated method they use |

0:36:40 | by addition optimize the optimisation systems of the network training |

0:36:44 | the trains hyper parameters of the training this way the way to get |

0:36:47 | somebody five percent relative gain |

0:36:49 | i don't know if you buy still doing that or if it's a bit easier |

0:36:52 | now but |

0:36:54 | so |

0:36:55 | the point is |

0:36:57 | the point is that it looks easy but it actually isn't |

0:37:00 | for large |

0:37:02 | another's convolutional networks |

0:37:04 | and the idea is basically that's look at these filters here these are tracking some |

0:37:08 | sort of formant right but the formant positions the resonance frequencies |

0:37:13 | depend on your body height |

0:37:14 | for example for women the typically at slightly different position compared to |

0:37:18 | two men so |

0:37:19 | by can share these filters across that at the moment the system wouldn't do that |

0:37:24 | so the idea would be to apply this filters and just them slightly apply them |

0:37:28 | over a range of shifts and that's basically represent by this picture here |

0:37:33 | and then the next there would reduce you pick the maximum |

0:37:36 | over all these different results there right and so it turns out that actually you |

0:37:41 | can get something like forty seven percent whatever it reduction i think you have even |

0:37:45 | little bit more the religious paper you |

0:37:49 | so the take away for those alternative architectures |

0:37:52 | ratings like definitely not easy to get work |

0:37:55 | they seem to work for smaller setups |

0:37:57 | some people time they get really get result good results on something twenty four hour |

0:38:01 | datasets but on the big set three hundred hours it's very difficult and expensive |

0:38:06 | the other hand the cnn so much simpler gains are sort of the range of |

0:38:09 | what we get |

0:38:10 | with the adaptation feature adaptation |

0:38:14 | okay |

0:38:15 | and of the training section |

0:38:17 | just talk about a little bit about features |

0:38:23 | so for features for gmms |

0:38:27 | has been done a lot of work |

0:38:29 | because gmms typically used are not bounce my |

0:38:33 | a lot of work was done to decorrelate features |

0:38:36 | do we actually need to do this in the dnn |

0:38:38 | well how did you correlated with a linear transform the first thing dnn does is |

0:38:42 | your |

0:38:44 | so kind of are just by itself well so that |

0:38:48 | so we start with a gmm baseline twenty three point six if you put in |

0:38:51 | fmpe to be fair twenty two point six |

0:38:54 | and then you do it cd dnn just a normal dnn using those features here |

0:38:59 | the fmpe features you get to seventeen |

0:39:02 | get rid of that simply so this minus means take out |

0:39:06 | now it's just a plp system |

0:39:08 | seventeen |

0:39:08 | the kind of makes sense because the fmpe was basically trained specifically for this gmm |

0:39:16 | structure |

0:39:18 | then you can also take out the hlda gets much better |

0:39:21 | a little data obviously correlation right over a longer range and dnn already feels |

0:39:29 | you can also take out the dct that's part of plp or mfcc process |

0:39:34 | and now we have a slightly different the dimension |

0:39:37 | you have more features here and so |

0:39:41 | i think a lot of now using this particular set up |

0:39:44 | you can even take all the deltas |

0:39:46 | but you have to account for the speaker you have to make the window wider |

0:39:49 | so we still see the same frames and our case it still |

0:39:54 | can you go really extreme and completely eliminate filter back just you look at fifty |

0:39:59 | features direct |

0:40:00 | now get somebody works focused on the ballpark here right |

0:40:03 | so |

0:40:05 | actually what we just do you basically undid thirty years of features research |

0:40:10 | so |

0:40:13 | that |

0:40:13 | there is also kind of really could if you really care about the filter bank |

0:40:16 | you can actually have a more sort of this is another poster by tomorrow so |

0:40:20 | you see the blue bars the red curve the right the blue of the mel-filters |

0:40:24 | and the red curve so basically |

0:40:26 | alarm versions of that |

0:40:34 | and dnns also kind of really sorry |

0:40:38 | so take away dnns greatly simplifies feature extraction just use the back to the wider |

0:40:43 | window |

0:40:44 | one thing i didn't already still need to the mean normalization |

0:40:47 | that cannot |

0:40:49 | now |

0:40:50 | now we talk about features for dnns we can also trying to around right basically |

0:40:54 | you know ask not what the features can do for the dnn but what the |

0:40:57 | dnn and do for the features |

0:40:59 | i think that was |

0:41:01 | said by the same speech researcher |

0:41:05 | so we can use dnns as feature extractor so the idea is basically is one |

0:41:09 | of the factors that contributed to the success |

0:41:12 | long span features |

0:41:13 | discriminative training |

0:41:15 | and the hierarchical nonlinear feature map |

0:41:18 | right so |

0:41:19 | and trying to that is actually the major contributor so why not use this combined |

0:41:24 | with the gmm so we go really back to what the now some talked about |

0:41:27 | right |

0:41:28 | so that many ways of doing the tandem |

0:41:31 | we heard this morning you can also the tandem with |

0:41:34 | bigger layer our work on that so basically using signals here |

0:41:39 | you can do bottleneck where you take in intermediate layer that this has a much |

0:41:43 | smaller dimension |

0:41:44 | or you can also |

0:41:46 | use the top hidden there |

0:41:49 | ask sort of the bottleneck but not make it smaller just take it in each |

0:41:52 | of those cases you would typically do like a pca to use your dimensionality |

0:41:56 | so does that work |

0:41:58 | well okay so if you take |

0:42:00 | a dnn |

0:42:01 | H and this the hybrid system here and then you compared with this gmm system |

0:42:05 | retake top layer |

0:42:07 | pca and then applied you gmm |

0:42:09 | well it's not really that good |

0:42:12 | but now we have one really big advantage back and the rubber gmms |

0:42:16 | we can capitalise on anything that worked on the gmm world right |

0:42:20 | so for example hardly able to you region dependent linear transforms a little bit like |

0:42:24 | fm P |

0:42:26 | so once you apply that |

0:42:27 | already better |

0:42:29 | can also just to mmi training very easily okay in this case is not really |

0:42:33 | as good but at least you can do it out of the box without any |

0:42:36 | of these problems with you know silence at that and you can apply adaptation just |

0:42:41 | it would always |

0:42:42 | you can also do something more interesting can say what if i train my dnn |

0:42:47 | feature extractor on a smaller set |

0:42:49 | and then to the training on a larger set |

0:42:52 | because we have the scalability problem |

0:42:54 | so this can really help with the scalability problem and you can see well |

0:43:00 | closer not a not quite as good but italy but we're able to do that |

0:43:04 | i mean imagine the situation what this is like a ten thousand our product database |

0:43:07 | we couldn't training and then |

0:43:10 | and it's on the dnn side we also use the same data we definitely get |

0:43:13 | better |

0:43:14 | here and that still make it might make sense if we combine the for example |

0:43:18 | with the idea of building this you model only partially so and then see if |

0:43:23 | that we don't know that action |

0:43:24 | so that a lot of attention |

0:43:26 | another thing another idea of learning using dnns as feature extractor |

0:43:31 | is to transfer learning from one language |

0:43:35 | to another so the idea is to feed the network actually training set of multiple |

0:43:40 | languages |

0:43:41 | and you're output layer |

0:43:43 | for every frames based chosen on what that language what's right and this way you |

0:43:47 | can train |

0:43:48 | these hidden representations and it turns out if you do that |

0:43:51 | you can improve each individual language and it even works for another language that has |

0:43:56 | not been part of this set here |

0:43:58 | the only thing is that is typically something a works for low resource languages |

0:44:03 | but if you goal larger so for example salt on has a |

0:44:08 | was that the paper here or has a paper where you shows that if you |

0:44:11 | go up to subtract two hundred seventy hours of training |

0:44:14 | then you're again really is reduced or something like three percent |

0:44:18 | so this is actually something that does not seem to work very well for large |

0:44:21 | setting |

0:44:26 | okay so take away |

0:44:28 | the dnns as a hierarchical nonlinear feature transform |

0:44:31 | that's really the key to the success of unions and you can use this directly |

0:44:36 | and put it the engine on top of that as a plastic later |

0:44:40 | and it brings it back and gmm world with all the techniques including parallelization and |

0:44:45 | scalability and so on |

0:44:47 | and all that transfer learning sides works from a small works a small set ups |

0:44:52 | but the not so much large |

0:44:55 | okay |

0:44:58 | last topic runtime |

0:45:00 | runtime is an issue |

0:45:02 | this one problem for gmms |

0:45:05 | you can actually do on-demand computation |

0:45:08 | for dnns |

0:45:09 | a large amount of parameters actually the shared layers you can do on the map |

0:45:14 | so |

0:45:15 | all dnns are |

0:45:16 | you have to compute |

0:45:18 | and so it's important to look at how can speed up so for example the |

0:45:22 | demo video that i showed you in the beginning if i that was run with |

0:45:25 | the with the my gpu was doing the live likely to evaluation if you don't |

0:45:30 | do that it would like three times real time |

0:45:32 | wouldn't infeasible |

0:45:34 | so |

0:45:35 | the way to approach this and that was done both by some colleagues of microsoft |

0:45:38 | also ibm |

0:45:40 | is to ask we actually needles full weight matrices |

0:45:44 | i and so this is that the question is based on two observations |

0:45:48 | one is that we saw early on that actually you can set something like two |

0:45:52 | thirds of the parameters to zero |

0:45:55 | and still you get the same our |

0:45:57 | and what ibm observed is that this top hidden they're the |

0:46:02 | the number of |

0:46:03 | how to the number of nodes the actual active is relatively limited |

0:46:07 | so can you basically just decompose all the ideas you singular value decomposition |

0:46:12 | those weight matrix |

0:46:14 | and the ideas you basically this is your network there |

0:46:17 | the weight matrix nonlinearity replace this by two matrices and in the middle you have |

0:46:23 | a low-rank |

0:46:26 | so that's that work |

0:46:27 | well |

0:46:28 | so but there's this is the gmm baseline just for reference dnn |

0:46:32 | but thirty million parameters of the microsoft internal task |

0:46:35 | start with the word error rate of twenty five point six |

0:46:38 | now we apply these singular value decomposition |

0:46:41 | if you just to the straight out of gets much worse |

0:46:44 | but you can then do back-propagation again |

0:46:47 | and then you will get back to exactly the same number |

0:46:50 | and you gain like one third parameter reduction |

0:46:53 | you can actually also do that with |

0:46:55 | although there is not just the top there if you do that can bring it |

0:46:58 | down |

0:47:00 | that's a factor of four |

0:47:02 | and that is actually very good results so this basic bring that back |

0:47:08 | so just one show you only to again give your very rough idea |

0:47:12 | my classes |

0:47:21 | so it's only very short example just a given idea this is an apples to |

0:47:25 | apples comparison between the old gmm system and the dnn system |

0:47:29 | but for speech recognition so as to look at some of those things that you |

0:47:34 | know well so you are devices on the one on the left or is what |

0:47:37 | a previously the board one on the right uses the documents |

0:47:42 | we're gonna find a good pizza and |

0:47:50 | a very similar specifically for discriminative interested look here down to the latency which is |

0:47:56 | counted from when i don't talk when we see the recognition result over a second |

0:48:01 | approach |

0:48:02 | so i just want to give you act this is proof that this section works |

0:48:06 | okay so |

0:48:08 | think of cover the whole range i would like to recap |

0:48:13 | all the take aways |

0:48:14 | okay so we went through |

0:48:16 | cd dnn actually members of G |

0:48:18 | mlp not already said that nothing else the outputs are the triphone states and that's |

0:48:24 | important |

0:48:25 | they're not really that far to train we know now but doing it fast |

0:48:29 | is still sort of frustrating enterprise and i would at the moment recommend just get |

0:48:33 | the gpu and if you have multiple gpus just one multiple training rather than trying |

0:48:37 | to paralyse a single training |

0:48:40 | pre-training is |

0:48:41 | median but the greedy layer-wise but is simpler and it seems to be sufficient |

0:48:48 | sequence training gives us regularly good improvements on to thirty percent but if you use |

0:48:52 | std then you have to use these little tricks smoothing |

0:48:56 | and rejection |

0:48:57 | adaptation helps much less than for gmms |

0:49:00 | which might be because the dnn learns possibly |

0:49:04 | very good in there and representations already so that might be a limit to what |

0:49:07 | we can actually you can achieve |

0:49:09 | writers are definitely not as easy as changing two lines of code especially for large |

0:49:14 | datasets |

0:49:16 | but on the other hand the C N N's |

0:49:17 | give us like five percent is not really the heart get but and they make |

0:49:20 | a good sense |

0:49:23 | dnns really simplify the feature extraction we're able to eliminate thirty years of feature extraction |

0:49:27 | research |

0:49:30 | but you can also go around and using dnns as feature extractors |

0:49:35 | so dnns are definitely not slowing decoding if you use this speech |

0:49:40 | so |

0:49:40 | to conclude word racy the challenges one forward |

0:49:44 | there of course open issues of training |

0:49:46 | i mean it's one we talk to people in the company we always thinking what |

0:49:51 | kind of computers we find the future and are we optimize them for std but |

0:49:55 | we always think you know what in one year will laugh |

0:49:57 | laugh about this though some patch method and we will just not need all of |

0:50:01 | this but so far this not i would think it's fair to say that's not |

0:50:03 | a method like this on the rise in the media laws parallelization |

0:50:08 | and what we found section learning rate control is not sufficient this kind of really |

0:50:11 | important because if you don't do this right it might run into unreliable results and |

0:50:15 | have a hunch that is relevant result we saw there was little bit like that |

0:50:19 | and also has to do with paralyse ability because the smaller learning rate the bigger |

0:50:23 | your mini-batch can be factor and the more parallelization can |

0:50:30 | dnns still have an issue with robustness to real life situations |

0:50:35 | how much they sort of not be solved speech but they got very close to |

0:50:39 | solving speech under perfect recording conditions but it still fails it's a do or speech |

0:50:44 | recognition over like one meter fifteen a more room with two microphones or something like |

0:50:48 | that so dnns are not |

0:50:49 | in here we automatically robust to noise |

0:50:52 | there was to see variability but not on C or what |

0:50:57 | then personally i wanna not can we kind of a more machine you |

0:51:00 | so for example there's already work the tries to eliminate H M and replace it |

0:51:04 | by are and i think that's come very interesting and the same thing is already |

0:51:08 | very successfully done with language models |

0:51:11 | and there's the question of |

0:51:13 | i mean jointly treat everything and one big step but on the other hand |

0:51:16 | the problem with that is that different kinds of different |

0:51:19 | aspects of the model different kinds of data that have different cost would using to |

0:51:24 | them so it might actually never be possible to we need to a joint training |

0:51:28 | and the final question that i sort of have is what to dnns teachers about |

0:51:32 | humans process |

0:51:35 | what will also get |

0:51:36 | more ideas on |

0:51:38 | no |

0:51:40 | so that concludes my talk thank you very much |

0:51:51 | i think we have like six minutes for questions |

0:52:12 | another expert about units there was wondering point therefore if i train and it's a |

0:52:19 | neural network and conventional speech data and i try to anything the data which is |

0:52:26 | much more clean we therefore not as good or we don't noise |

0:52:31 | so what was the configuration you want to do you want to train on what |

0:52:34 | it is that they train mind manual nets on the noisy data when they're running |

0:52:39 | on the clean data |

0:52:41 | so they don't know exactly that's my question |

0:52:44 | okay so i actually did skip |

0:52:46 | one slide images this L O |

0:52:50 | so |

0:52:51 | the dnn is actually |

0:52:56 | way |

0:53:04 | so you get like |

0:53:10 | so this table here shows results on aurora so basically doing this case multi-style training |

0:53:19 | so the idea was not to train a noisy and test on clean |

0:53:22 | but this is basically training and testing on the same |

0:53:26 | set of noise conditions |

0:53:28 | and so the lot of numbers here this is the gmm baseline if you look |

0:53:31 | at this line here thirteen point four |

0:53:34 | so another specialist on robustness but i think this is of the best you can |

0:53:38 | do with the gmm |

0:53:39 | pooling all the tricks that you could possibly put in |

0:53:42 | and the dnn |

0:53:43 | it's just |

0:53:44 | but not any tricks just training on the data you get |

0:53:48 | you know how do not from the or you get just exactly the same |

0:53:51 | so what this means i think is that the dnn is very good and learning |

0:53:55 | variability new input also noise that it sees in the training data |

0:54:02 | but we have other experiments were is shown what we're we see that is the |

0:54:06 | variability smart cover new training data |

0:54:09 | the dnn is not very robust again |

0:54:12 | so i don't know what happens if you trained on noisy and test on clean |

0:54:15 | and clean is not of the conditions that you have your training i could imagine |

0:54:18 | that it will are but on the of an interest at the data |

0:54:25 | i don't think i can likely to get away with thirty years thirty years maybe |

0:54:30 | that was present at all |

0:54:33 | apparently talking and tongue in cheek right what you're talking about is going back before |

0:54:38 | some of the developments of the eighties right and most of the effort on feature |

0:54:43 | extraction last twenty years conferences is actually been more robustness to dealing with unseen variability |

0:54:51 | and this doesn't get and you that set equation |

0:54:59 | some more questions or comments |

0:55:02 | think about features i need for future |

0:55:08 | research |

0:55:10 | is it and use a large temporal context this is also be one it's was |

0:55:16 | coming but for |

0:55:19 | in contrast |

0:55:21 | it's something |

0:55:24 | okay what exactly i don't have to sell the embassy okay |

0:55:33 | anymore comments |

0:55:36 | kind of a personal question you said that you know anything about neural nets like |

0:55:40 | on two three years back something like that so you see this as rather an |

0:55:45 | advantage for drawback very maybe less sentimental |

0:55:48 | in throwing away some coldness that you know the guys very in the field for |

0:55:54 | many years expected some touchable or the other way round |

0:55:58 | i think so i think it helps to come with sort of the little bit |

0:56:02 | of an outsiders mine so i think for example it helped me to understand this |

0:56:06 | parallelization thing right that basically do it is G D you do layer train a |

0:56:11 | small the mini-batch training |

0:56:13 | and normal the regular definition of mini-batches is that you can take have original to |

0:56:18 | sell |

0:56:18 | maybe you might have noticed that i didn't actually divided by the number of frames |

0:56:23 | when i use this formula right interesting is if you're not right |

0:56:27 | so that for example is something for me as an engineer coming in looking at |

0:56:30 | that i wonder know why do you do mini-batches as an average doesn't seem to |

0:56:33 | make sense you're just accumulating multiple frames over time that help understand those kind of |

0:56:38 | parallelization questions in a different way |

0:56:41 | but things probably details |

0:56:49 | okay any other buttons |

0:56:54 | okay the speaker given is present |