0:00:16thank you so welcome back after the lunch
0:00:19my name's frank seide i'm from microsoft research in ageing and this is a post
0:00:24calibration my colleague dong yu what happens to be chinese but is actually base
0:00:29and of course as a lot of contributors to this work inside the company and
0:00:33outside and also thank you very much two people sharing slide to material
0:00:38okay to me we start with the like personal story i got into this because
0:00:42i'm sort of an unlikely experts of this because until two thousand eleven i had
0:00:47no i two thousand ten had no idea what were networks deep one or
0:00:51so in two thousand ten
0:00:52my colleague dong yu cannot be here today came to visit us invading only told
0:00:58us about this new speech recognition result that the dehak
0:01:02and you told me about the technology that i had never heard about call dbn
0:01:07and set
0:01:08this was sort of invented by some professor in wonderful that also had never heard
0:01:12about
0:01:14so and he and he need a manager at the time had invite geoffrey hinton
0:01:19this professor to come to read and with a few students and work on applying
0:01:23this to speech recognition
0:01:25any time he got
0:01:26sixteen percent relative reduction
0:01:29out of applying deep neural networks
0:01:31and this is for intel voice search task relatively small number of hours of training
0:01:36you know sixty percent is really a big a lot of people spend ten years
0:01:40to get a sixteen percent error reduction
0:01:42so my first got about this was
0:01:44sixteen percent while what's wrong with the baseline
0:01:55said well we should we collaborate on this and try how this carries over into
0:01:59a large-scale task that switchboard
0:02:02and the key thing that actually invented here was well talk a classic an hmm
0:02:07i think this reference is probably based on
0:02:10whatever this morning from nelson
0:02:12a little bit
0:02:13too late
0:02:15so the classic nn hmm then the in the deep network dbn
0:02:19which actually does not stand for dynamic bayesian networks as a line
0:02:23at that point
0:02:24and then you don't put in this idea of
0:02:26just using tied triphones as modeling targets like we did in gmm based system
0:02:32okay so
0:02:34then fast forward like have here was reading papers in utah look to start and
0:02:38finally we got to the point where we got first
0:02:41results so this is or gmm baseline and i start the training next day ahead
0:02:47the first iteration
0:02:48was like twenty two percent so okay now seems to not be completely off
0:02:53the next day i come back
0:02:55twenty percent
0:02:56so fourteen percent and the congratulation email to my colleague right
0:03:00the to run next day came back
0:03:03eighteen percent
0:03:04you can really from that one moment i was just sitting at the computer waiting
0:03:07for the next result of come out and submitting it and saw titanic have better
0:03:11we got seventeen point three
0:03:13something point one
0:03:15then we get the alignment that's one thing you don't had already determined on the
0:03:20smaller setup we got it down to sixty four then we look at sparseness
0:03:24six import once we go thirty two percent error reduction
0:03:27that's a very large reduction
0:03:29all of a single technology
0:03:33we also ran this over different test sets the same all and you could see
0:03:37the error rate reductions were all sort of in a in a similar range of
0:03:40the word didn't matter as well the gains were slightly worse
0:03:44we also look the other ones for example we at some point finally the two
0:03:48thousand all model the can still okay for product like windows on system that you
0:03:54have right now we got something fifteen percent error reduction
0:03:58and also other companies started publishing for example ibm on broadcast news i think the
0:04:02total gaze thirteen eighteen percent that's i think in up to date paper some day
0:04:07and then you to i think there's was about nineteen percent of the gains were
0:04:11really convincing across the board
0:04:14okay so that our work so what is this actually
0:04:17no i thought asr you has the same different portion of understanding people might not
0:04:22to you know the end and on the database so i think would like to
0:04:26go through and explain
0:04:27a little bit more to the basics how this works i don't know how many
0:04:31understand people are really here today i hope it's not gonna be too boring
0:04:34so the basic idea is
0:04:36the dnn looks at for example spectrogram
0:04:40a rectangular patch out of that a range of vectors
0:04:44and feeds this into this processing chain word basically multiplies this input vector this rectangle
0:04:49here with a matrix at some by and applies a nonlinearity are then you get
0:04:54something like two thousand values other that you do the several times
0:04:58note that all that the same thing except nonlinearity is a softmax
0:05:02so
0:05:04this is the formulas for that so what is actually well a softmax
0:05:08is this form here
0:05:09that is essentially nothing else but i sort of a linear classifier and is linear
0:05:13because if you look at the class boundaries between two classes hasn't in are actually
0:05:17relatively weak classifier have there
0:05:20the hidden there is actually very similar they have the same for the only difference
0:05:25is that these sort of this only two classes
0:05:28instead of and or be all the different speech states here and the second pass
0:05:32as parameters zero
0:05:34so what is this really this is sort of a classifier that classifies collect class
0:05:38membership or non membership in some class but we don't know what those classes are
0:05:42actually
0:05:43and so this representation is actually this also kind of sparse typically you get only
0:05:48maybe ten percent of the activations five to ten percent
0:05:52to be active in any given frame
0:05:54so this is really sort of these class membership the kind of features descriptive features
0:05:58of your input
0:06:00so another way of looking at it is
0:06:03basic what it doesn't takes an input vector projected onto something like a base vector
0:06:07one column
0:06:09this would be like a direction vector projected on it there's a bias term we
0:06:13add on it and then you run into this nonlinearity we just one of the
0:06:16binarization
0:06:18so what this does this gives you sort of subsume find a river a like
0:06:21a coordinate system for your inputs
0:06:25and get another
0:06:27way of looking at it is
0:06:28well
0:06:30this one here is actually a correlation so he the parameters have the same sort
0:06:36of physical meaning as the inputs you put in there
0:06:40so for example for the first layer the model parameters are also of the nature
0:06:44of being a rectangular patch
0:06:45of spectrogram
0:06:46so and this is what they look like i think there was a little bit
0:06:49of the discussion earlier on nelson's talk
0:06:52so what is this me each of goals
0:06:55is this case thirty two there twenty three frames Y
0:06:59this is the frequency
0:07:01access here
0:07:02and what happens is that these things are basically overlay over here and then the
0:07:05correlation is made and whatever it detects this particular pattern this is sort of the
0:07:09peak detector of people that sliding over time
0:07:13then you get the hideout
0:07:14okay
0:07:15you can we see all these different patterns to get many of them really look
0:07:18like our filters
0:07:20but these automatically learn about the system there's no knowledge that was put into their
0:07:24you have these edge detectors you have P detectors you have some sliding detectors you
0:07:29have a lot of noise in there actually i don't know what that's for think
0:07:32this probably of later ignore them later stages
0:07:36that they are problem is how to interpret the hidden layers
0:07:39the hidden there is speech don't have any sort of spatial relationship to the input
0:07:44or something so the only thing that i could think of is that
0:07:47there we were presenting something like
0:07:49logical operations so think of this again this is the direction vector this is the
0:07:53hyperplane that is described by the bias right so if you inputs for example are
0:07:58one this one is one this is obviously
0:08:01two dimensional vector ones one is zero
0:08:04could be this one of this one you could put a plane here indicates incorporation
0:08:09okay kind of a soft or because not strictly binary
0:08:12or you put it here is like an operation
0:08:14so i think this my personal intuition of what this is the nn actually does
0:08:18is
0:08:19on the lower layers it extracts these landmarks
0:08:22number higher there is it assembles them into more complicated classes
0:08:27and can you do interesting things you can imagine that
0:08:30that for example and one layer discover say a female version of and a and
0:08:34then another no would give you a male version of a
0:08:37then on the next there would say ten authors
0:08:40female or male a
0:08:42so this is an idea on top of the modeling power of this of this
0:08:45one
0:08:47okay so take away
0:08:49lowest layer matters landmarks higher layers i think are sort of soft logical operators
0:08:54and the top there is just a really primitive linear
0:08:57okay so how do we do this in speech how to be used as speech
0:09:02you take those output see these probabilities posterior probabilities of speech segments
0:09:08suppose you know
0:09:10it turns them into
0:09:12likelihoods the using bias will and these are directly used in the hidden markov model
0:09:16in a
0:09:19and the key thing here is that these classes are tied triphone state and not
0:09:23monophone states that is the thing that really made a big
0:09:26okay so just before we move on just a given a rough idea of like
0:09:30the subject this idea one buttons error rates actually we wanna play will video clip
0:09:36where our executive vice president of research gave on stage demo
0:09:41and you can see what accuracies come out of and speaker independent
0:09:45dnns we can you can this not been adapted is voice
0:09:53still far error rate for our work we have the one point five
0:10:04what you hear research my research university
0:10:10okay together with the other in your recognition so
0:10:19i use i tell you know what i weight given red color your
0:10:31so this is this is basically perfect right and this is really a speaker independent
0:10:35system
0:10:36and you can i think do interesting things of that just the fun of it
0:10:39i'm gonna play at a later part of the video what we actually use this
0:10:42input to drive translations just
0:10:46translated into chinese you and vocal here we see i am i know
0:11:05i
0:11:07there i here
0:11:09you people one
0:11:17that is there
0:11:21side
0:11:31for this is a very
0:11:35you do initial values you well
0:11:41if you hear that right down by various people
0:11:48so what we see
0:11:54so that's a kind of fun you can have of the model like that
0:11:58okay so
0:11:59now in this talk
0:12:02i would like to
0:12:03you know you know a is giving talks about the nn is invited talks S
0:12:08of income bracket like on each of those conferences that likely one hour talking to
0:12:12you single something's the for example last year smt conference or sandra senior
0:12:16then i think of the i syllables of innocent fun so when i've prepared a
0:12:20strong i found energy and it ended up
0:12:23doing andrews talk
0:12:26so i thought that's maybe not a good idea i wanna do it slightly different
0:12:29so what i wanted to someone we focus
0:12:31and not gonna give you have you noticed overview of everything but i will focus
0:12:35on
0:12:36what is needed to build real life systems large-scale system so for example you will
0:12:40not see in timit result
0:12:42and the structured along three areas training features and run-time extraneous the biggest one i'm
0:12:47gonna start force
0:12:50so
0:12:51how do you train this model i think we're pretty much all familiar with back-propagation
0:12:55you give it
0:12:56a sample vector run to the network get a posterior distribution compared against what it
0:13:00should be
0:13:01and then basically not the system a little bit in the direction to do a
0:13:05better job next time
0:13:07and so the problem is when you do this with the deep network often the
0:13:11system does not converge where will get stuck in local optimum
0:13:14so the thing that we of this whole revolution with geoffrey hinton who
0:13:19the thing that's
0:13:19matt sorry the thing that we propose to the restricted boltzmann machine
0:13:24and the ideas basically you train
0:13:26layer is so here we extend that the networks sort of in the way they
0:13:30can run about
0:13:31so you can run the sample through
0:13:34you get a representation you run it backwards and then you can see okay how
0:13:37well that's the thing that comes out the action match my input
0:13:40then you can choose that system so that matches the input as closely as possible
0:13:45if you can do that and don't forget this is sort of the binary representation
0:13:48that means you have a representation of data that is meaningful this thing extract something
0:13:53meaningful about the data and that's so that the idea
0:13:56so now we do the same thing with the next there you freeze this is
0:13:59taken as a feature extractor
0:14:00a do this with the next there and so on
0:14:02then you put
0:14:04top softmax and then trying to location
0:14:08now so i had no idea about
0:14:10do you nor networks anything when i started this so i thought what we do
0:14:13this or complicated i mean we already ran this experiment on how many layers you
0:14:18need and so on so already had
0:14:20and not work that had like a single in there
0:14:23so why not just take that one is initialization
0:14:25right out it softmax layer and then put another
0:14:30hidden layer and another softmax down off
0:14:32and then iterate the entire stack here
0:14:34and then after that again right this guy off and do it again and so
0:14:38on and once you are at the top and iterate this thing
0:14:41so we call this greedy layer-wise a discriminative pre-training
0:14:44and it turns out that actually works really well so if we look at this
0:14:48the dbn pretraining geoffrey hinton this is the green curve here
0:14:53if you do what i just described you get the red or just are essentially
0:14:58the same word error rate
0:15:00and this is different numbers of layers this is not progression over training the accuracy
0:15:05for different layers right
0:15:07so the more layers to get the better gets and
0:15:09you see basically sparse
0:15:11tract each other
0:15:12the layer-wise pretraining slightly worse but then you'd only one understands neural networks much better
0:15:17than i
0:15:18said you shouldn't maybe to rating the model all the way to the and should
0:15:22just let it iterate a little bit rerun in the ballpark then move on it
0:15:25turns out that made the system slightly better and actually the sixteen point eight here
0:15:29this is this just made pre-training method works like that
0:15:34i'm i think it's expensive
0:15:35because every time you have this full nine thousand seen on top layer there but
0:15:39it turns out we don't need to do that you can actually use monophones
0:15:42and it actually works equally well as much
0:15:46okay so take away pre-training is still sort of me that it helps
0:15:50but we need discriminative pre-training is sufficient and much simpler than the rbm pity because
0:15:55we just use the existing call don't need to coding
0:15:59okay another important topic is
0:16:02sequence training
0:16:03so the question here is
0:16:06we have actually train this network to classify these signals is into those segments of
0:16:11speech and of each other but in speech recognition
0:16:14we have dictionary sure of language models we have hidden markov model that gives you
0:16:18sequence and so on
0:16:19so if we want to integrate that the system we should but we do that
0:16:23we should actually get a better result right
0:16:25so
0:16:27the frame-classification right here on is written this way you maximise log posteriors every single
0:16:32you know posterior correct C
0:16:36if you want to use C and if you wanted to sequence training actually find
0:16:40that it has exactly the same form
0:16:42except this year not state posterior derived from the bn and but it is state
0:16:47posterior taking all the additional knowledge into account
0:16:51so this one the takes into account hmms the dictionary and language models
0:16:55so the way to run this is you run your data through and you have
0:16:59here you must from speech rec
0:17:01in computers posteriors
0:17:02and practical terms you would do this with word lattices
0:17:05and then you do back-propagation and
0:17:08so we did that
0:17:10we start with the baseline fifty one six percent
0:17:13we did the first iteration of this sequence training
0:17:16i want to
0:17:17the one
0:17:18for
0:17:19so that kind of didn't work
0:17:22so
0:17:24well we observe that it sort of time for each so
0:17:27don't like we're training
0:17:30so we try to do in what is the problem here so there is for
0:17:33hypotheses
0:17:34are we actually using the right models lattice generation their problems lattice sparseness
0:17:39randomization of data and the objective function of multiple objective functions choose from and today
0:17:44i will talk about the lattice parsing
0:17:46so the final one thing we found was that
0:17:49there was increasing
0:17:51sort of
0:17:52problem of speech getting replaced by silence
0:17:57deletion problem we saw that the silence of course we're going
0:18:01and the other scores were not
0:18:03so basically what happens is that
0:18:05the lattice is very biased the lattice typically doesn't have negative hypotheses for silence because
0:18:11it's so far away from speech but it has a lot a lot of positive
0:18:15examples of silence
0:18:16so this thing was just biasing the system towards ringside really we you know given
0:18:21high bias
0:18:22so what we do this we said okay one interest
0:18:24not update
0:18:26sun state and also skip all silence frames
0:18:29so that already gave us something much better
0:18:31already look like it's going
0:18:34so we could also the slightly more systematically we could actually explicitly and silence hours
0:18:39into the lattice
0:18:41right those that should have been there in the first place
0:18:44so once you do that
0:18:46i actually get even slightly better so that kind of confirms the missing sounds hypotheses
0:18:50are all
0:18:52but then
0:18:53another problem is that the lattices other sparse
0:18:56so we find that any given frame
0:18:58we only have like three hundred out of mine thousand seen on T C and
0:19:02that
0:19:03the others are not there because they basically had zero probability
0:19:07but as the model moves along maybe data at some point no longer have zero
0:19:11probability so they should be there in the lattice but they're not
0:19:14so the system cannot train properly
0:19:16so we thought why don't we just we generate lattices after one iteration
0:19:20we see how the next little bit of the difference at least keeps table here
0:19:25now we thought can we do this slightly better so basically we take this idea
0:19:28of adding silence
0:19:30and sort of adding speech marks you can't really do that
0:19:33but similar effect can be achieved by interpolating your sequence criterion
0:19:38with the frame cardio
0:19:40so and then we basing we do that get
0:19:43a very good convergence
0:19:46so
0:19:47now we we're not the only people that observe that problem i ran into this
0:19:51issue with the training so for example colour destiny
0:19:55and this workers
0:19:57observe that
0:19:58you look at the posterior probability of the ground first pass
0:20:02over time sometimes find that it's very low it's not always zero sometimes at zero
0:20:07that means a lot
0:20:09so
0:20:09but they found is that
0:20:11if you just get those frames you called frame rejection you get a much better
0:20:15convergence behavior so the green the red curve is without and the blue curve is
0:20:19with frank removing that
0:20:23and of course
0:20:25brian also observed exactly the same thing but he said no i'm gonna do the
0:20:28smart thing
0:20:29i'm gonna do something much better i'm gonna and second order method
0:20:33so what the second one a method you approximate the objective function as a second
0:20:37order function that you can like hope try to the optimal right theoretically
0:20:41and so this can be done without explicitly computing they have C and this is
0:20:44the method that martin's is tuned of hinton
0:20:48sort of optimized
0:20:49and the nice thing it's actually batch method
0:20:52so it doesn't
0:20:54suffer from these previous issues of like data sparseness and the last carol executions as
0:20:59a lot of couldn't
0:21:01and also what i think on this conference there's a paper that says that it
0:21:04works with partially to rated ce multi don't even have to do a full see
0:21:08you duration that's also very dry
0:21:11and
0:21:12i need to save your outdoor started with my homework actually writing first show the
0:21:16effectiveness of sequence
0:21:18for switchboard
0:21:19okay so you have some results
0:21:22so this is the gmm system C basically a C D based the nn
0:21:27sequence trained one
0:21:28so this is all and switchboard and five and are two or three
0:21:32so we get like twelve percent
0:21:35basically and others got eleven percent and ryan on the are two or three said
0:21:39also fourteen percent sort all similar range
0:21:42we also
0:21:43then we i wanna point of one thing
0:21:46going from here to here
0:21:47now the dnn has given us forty two percent relative
0:21:51and that's a fair comparison because this is also sequence trained based
0:21:55right so if the only difference is you recall gmm replaced by the unique
0:22:01also it works on a larger dataset
0:22:05okay to take away sequence training gives us gains of mine to thirty percent
0:22:10other std works but you need some tricks they're
0:22:13those of smoothing and rejection of that frames
0:22:16and the hessian-free method requires no tricks but is actually much more complicated so to
0:22:20start with i would probably start with the cg method
0:22:27so another big question is paralysing the training
0:22:30so just a given idea that more but we use this demo video the threshold
0:22:34was trained on two thousand hours
0:22:37it took sixty days
0:22:40now
0:22:41most of you probably don't work with windows
0:22:44we do and that causes the very specific problem because of probably heard of something
0:22:49a patch tuesday
0:22:51so basically
0:22:52every two to four weeks microsoft I T forces us to update some virus scanners
0:22:57or something like that
0:22:58and so basically those machines have to be rebooted
0:23:02so running a java sixty days is actually
0:23:06so
0:23:07you were running this on gpu so we had a very strong motivation to look
0:23:11at that
0:23:12but don't get your hopes up
0:23:14so
0:23:15one way of trying to paralyse the training is to see connections to match
0:23:20ryan had already shown hessian-free works very well can be problem
0:23:24so actually see one V stuff are to be cage was an intern at microsoft
0:23:29try to use hessian-free also for the C training
0:23:34but it the take away was basically it takes a lot of iterations to get
0:23:38there so it was actually not
0:23:41so back to std
0:23:42it's to use also problem because if we do mini-batches of they one thousand twenty
0:23:47four frames everyone thousand twenty four frame to have sixteen to lot of data
0:23:51so that's a big challenge so the first group are actually a company that it
0:23:55is successfully was well with the asynchronous sgd that just
0:24:00so the way that works is
0:24:02you have your machines you group them into a first one group them together each
0:24:06of them takes a part of the model and then you split your data and
0:24:08each chunk to compute the different
0:24:11so that at any given time
0:24:13and whatever one of them has a gradient computed
0:24:16it sends that
0:24:18parameters server or set of parameter servers and also parameter servers aggregate
0:24:23the model or with it
0:24:25and then
0:24:26whenever they feel like and the but with allows they send
0:24:31then models that
0:24:32now that's a completely asynchronous process the smaller think of this is just being independent
0:24:36trends one thread is just computing with whatever's and memory
0:24:39another threat this just sharing and exchanging data in whatever way the small synchronisation
0:24:45so why but that work
0:24:47well it's very simple because
0:24:50std implies sort of an assumption of you know are reading right we make we
0:24:55this so basically
0:24:57every parameter update contributes independent the objective function
0:25:01so it's okay to miss some of them
0:25:05and also there is something that we call delayed update on a quick to explain
0:25:08that
0:25:08so in the simplest way that explained the training the beginning you take every point
0:25:12in time that a sample X we take a model
0:25:16compute gradient update the model with the gradient
0:25:20and then do it again after one frame you do it again do it again
0:25:24and then based right
0:25:26you models equal to that model plus
0:25:29we can also do this you can also not advance
0:25:33the model that using use the same model multiple times
0:25:36and update for example this example for
0:25:39the you do for model updates the frames are still these frames right but the
0:25:43model session model
0:25:45in do this again and so on
0:25:47so that's actually what we call mini-batch based update right
0:25:51mini-batch training
0:25:53so now if you want to do parallelization need to deal with the problem that
0:25:56we need to do computation and data exchange parallel so you would do something like
0:26:00that you know you would have a model and you would start sending that into
0:26:04the network so at some point it can do model update while the kids computing
0:26:09the next
0:26:11and then
0:26:11you do not overlap session once these are computed you sent the result over while
0:26:15these are being received an update so you get the sort of overlap processing and
0:26:20recall the double buffered update
0:26:22it has exactly the same form so with this formula can write it in exactly
0:26:25the same for
0:26:27and std is basically just sort of a random version of this where you have
0:26:31no space adjust the
0:26:34somewhere jumping between one or two that just like
0:26:38so why not telling
0:26:40well i would this work because the space not different from i mean you batch
0:26:44and to make it work only thing you need to make sure is that we
0:26:47still stay in this
0:26:48sort of you narrative me
0:26:50it also means that as you training progresses you can increase your mini-batches
0:26:54well observed that also means you can increase
0:26:57your delay
0:26:59which means you can use more machines
0:27:00the more machine to use the more delay you in-car because network such right
0:27:06okay
0:27:07so
0:27:09okay so but then
0:27:11actually
0:27:13where the three times
0:27:15that colleagues told me
0:27:17like this with paper only the
0:27:19and then
0:27:20like three months later ask them so we came up to this day and what
0:27:23we scale well
0:27:24actually happened three times so why does not work
0:27:27so let's look at this one of the different ways paralysing something model power of
0:27:31data for was layer
0:27:34model carol isn't means you're splitting a models over different notes
0:27:37then after each computation step they have to the only compute part of the output
0:27:41vector
0:27:43each computed different sub range of your dimension so after every computation to have to
0:27:47exchange
0:27:48the airport with all the others
0:27:50the same thing has to happen in the way back
0:27:53no data parallelism means
0:27:56you break your mini-batch into sub batches
0:27:59so each node computes subgradient
0:28:02and then sorry
0:28:03they after every that they have to exchange lisa gradients each has to send their
0:28:08gradient or the other nodes
0:28:10so you can already that has a lot of communication going on
0:28:13the third train a something that we tried called and they are powerless
0:28:17work something like this you distribute layers
0:28:21so maybe the first batch comes in
0:28:23and then when it's done it sends
0:28:25its output to the next one and i we compute the next batch here but
0:28:29this section of correct because we haven't update the model
0:28:33so well we keep doing we just ignore the problem
0:28:36then in this case after four steps
0:28:37this guy has finally come back with an update the model
0:28:41so
0:28:42why would that work is just too late update is exactly the same form another
0:28:45what before except the delay is kind of different in different layers but there's nothing
0:28:48fundamentally strange about this
0:28:51so
0:28:52no
0:28:54very interesting questions how far can actually go what a sort of the optimum number
0:28:58of notes that you could that you can
0:29:01paralysed
0:29:02so my colleague also dropout a very simple idea
0:29:05you simply said
0:29:06you optimal when a maxout all the resource
0:29:10using all you computation and all your network
0:29:14resource basically means that the time that it takes
0:29:17computing mini-batch
0:29:19is equal to the T times that it takes to transfer the result that all
0:29:23the other
0:29:25and you would do this sort of an overlap fashion so you would compute one
0:29:28then you started transfer and you do the next one
0:29:31and i you are ideal when the time that it's like takes transferred let's say
0:29:35when it's transform the trance was completed the more you're ready to compute the next
0:29:38batch
0:29:39so then you can write down okay what's optimal
0:29:42number of knowledge here well the form is a bit more complicated but the basic
0:29:46idea is that this is proportional to the model size bigger monologues better parallelization but
0:29:51only get faster
0:29:53so gpu can paralyse less
0:29:57and also it has to do of course with how much data you have exchanged
0:29:59what you're bandwidth this
0:30:01for data parallelization the mini-batch sizes also factor because for a longer mini-batch size you
0:30:06have to exchange less of
0:30:09and for their partisan that's not really that interesting because it's limited by the number
0:30:14so
0:30:16this may i ask
0:30:17what you think model part was what would be get here
0:30:20so just
0:30:22consider that will is doing image net like sixteen thousand
0:30:26so gimme number
0:30:31gonna tell you
0:30:36not sixteen thousand
0:30:39no such a very fine so i implemented that we need to a lot of
0:30:43care three gpus
0:30:45this is the best you can do we get at one point eight speed up
0:30:47twice a lot of three times speedup because gpus get less efficient the smaller chunks
0:30:51of data they process
0:30:52and once i went to for it was actually much worse than this
0:30:58not data pearls must much better i'd so what we think
0:31:07for many best size of one thousand twenty four now that records of course if
0:31:11you can use bigger mini-batches as you progress of training
0:31:14this becomes a bigger number
0:31:16and the reality what you get is well that will a C D system
0:31:20paralysing for eight at nodes
0:31:23and eighty nodes each node is twenty four intervals you
0:31:27so if we see what you get compared using
0:31:29compared to using a single twenty four into machine
0:31:34at times ignored but you only get a speed of five point eight
0:31:38that's what you can actually get out of the paper there and about two point
0:31:42two up that comes out of model parameters and two point six comes out of
0:31:46data
0:31:48of course not that much
0:31:49then there's another group at the academy of sciences and in a rating
0:31:53they paralysed over in video K twenty extra cues that sort of the state-of-the-art
0:31:58and they got three point two
0:31:59speedup also
0:32:02okay not that great
0:32:05and i'm not gonna give an answer better but i just wanna
0:32:08okay
0:32:09so the last thing is layer parallelism okay so we're and this experiment we found
0:32:14that if you do the right way you can use more gpus and you get
0:32:17a three point two or three times speedup but we already had to use model
0:32:20curves
0:32:22and if you don't do that have a promotional balancing bases there is also so
0:32:26different
0:32:27and so this is actually reason why do not recommend their problems
0:32:31okay so the take away
0:32:33paralysing sds actually really heart and if your colleagues come to you and say dampen
0:32:38implement std then maybe show that
0:32:41okay
0:32:43so
0:32:45so much about realisation
0:32:51okay need to take about and me talk about adaptation so adaptation can be done
0:32:56you mentioned that this morning for example by sticking in transform your the bottom called
0:33:01the L and transform we call it yellow are to match
0:33:05mllr
0:33:06can also be things like vtln
0:33:09another thing we can do is as nelson explain just train the whole stock just
0:33:13a little bit or you can do this with regularization
0:33:17so
0:33:18what we have service this
0:33:20we do this approach which are not the alarm and switchboard
0:33:23we applied to the gmm system we get thirteen percent error reduction
0:33:29we applied to shallow more network that's one they're only
0:33:33you get very similar to that
0:33:35if we do it on the deep network
0:33:40and
0:33:41so
0:33:44so this is sort of the not such a great example but then on the
0:33:48other hand to me tell you wanna forgot to put on the side when we
0:33:51prepared this on stage medial
0:33:54or vice president we tried to actually train the models
0:33:58so we talked something like four hours of internal talks
0:34:01and did adaptation on that one
0:34:04and tested on another two talks have and we got like thirty percent
0:34:11but then we moved on an actually did an actual dry run with him
0:34:15it turns out
0:34:16on that one parent works
0:34:20so i think what happened there is that the D N actually did not more
0:34:22voice
0:34:23the more channel
0:34:25of this particular recording and that seems to be if the so basically there's a
0:34:29couple of other numbers here but let me just cut the short so what we
0:34:31seem to be observing is that
0:34:34the gain diminishes with a large amount of the gain of adaptation this what we
0:34:37have seen so far on that except if the adaptation is done for the purpose
0:34:42of domain adaptation
0:34:45so and maybe the reason why this is here is that the dnn is already
0:34:48very good morning invariant representations especially for all speakers would also means maybe there's a
0:34:54limit on what is achievable by adaptation some keep this in mind if you're considering
0:34:57two to do research
0:35:00on the other hand i think karen try not very good results or with george
0:35:03right on that so maybe what i'm saying is not correct so you better check
0:35:06out their papers and session
0:35:11okay so we need on with training but isolated are what alternative architectures
0:35:16so when this
0:35:18so values are very popular
0:35:21basically replace the nonlinearity sick model
0:35:25something like this
0:35:27and that came and also lot of geoffrey hinton school
0:35:31and it turns out that vision tasks
0:35:34works really well it converges very fast
0:35:36you get
0:35:37base we don't need to do pre-training
0:35:39and it seems to outperform the sigmoid version thrall basing everything
0:35:44non-speech that was is really would be a whole you know
0:35:48encouraging paper
0:35:49by entering students untied rectified nonlinearity is improved more network acoustic models
0:35:54and they were able to reduce the error rate from my point five seventy
0:35:58so great i started a holding it is actually two lines of code
0:36:02and i didn't get anywhere
0:36:04not able to rip use these results
0:36:07the red the paper again
0:36:08a nice all
0:36:10sentence network and
0:36:11network training stops after to compute pass
0:36:15we only due to process our system is that nineteen point two and we do
0:36:19all the past as we can see
0:36:22so actually there's something wrong with a baseline
0:36:25so it turns out that when i talk to people
0:36:28on the large set switchboard it seems to be very difficult to get relevance to
0:36:33work
0:36:33so one group that actually did get a to work is a ibm together with
0:36:38george dahl but in a rather complicated method they use
0:36:40by addition optimize the optimisation systems of the network training
0:36:44the trains hyper parameters of the training this way the way to get
0:36:47somebody five percent relative gain
0:36:49i don't know if you buy still doing that or if it's a bit easier
0:36:52now but
0:36:54so
0:36:55the point is
0:36:57the point is that it looks easy but it actually isn't
0:37:00for large
0:37:02another's convolutional networks
0:37:04and the idea is basically that's look at these filters here these are tracking some
0:37:08sort of formant right but the formant positions the resonance frequencies
0:37:13depend on your body height
0:37:14for example for women the typically at slightly different position compared to
0:37:18two men so
0:37:19by can share these filters across that at the moment the system wouldn't do that
0:37:24so the idea would be to apply this filters and just them slightly apply them
0:37:28over a range of shifts and that's basically represent by this picture here
0:37:33and then the next there would reduce you pick the maximum
0:37:36over all these different results there right and so it turns out that actually you
0:37:41can get something like forty seven percent whatever it reduction i think you have even
0:37:45little bit more the religious paper you
0:37:49so the take away for those alternative architectures
0:37:52ratings like definitely not easy to get work
0:37:55they seem to work for smaller setups
0:37:57some people time they get really get result good results on something twenty four hour
0:38:01datasets but on the big set three hundred hours it's very difficult and expensive
0:38:06the other hand the cnn so much simpler gains are sort of the range of
0:38:09what we get
0:38:10with the adaptation feature adaptation
0:38:14okay
0:38:15and of the training section
0:38:17just talk about a little bit about features
0:38:23so for features for gmms
0:38:27has been done a lot of work
0:38:29because gmms typically used are not bounce my
0:38:33a lot of work was done to decorrelate features
0:38:36do we actually need to do this in the dnn
0:38:38well how did you correlated with a linear transform the first thing dnn does is
0:38:42your
0:38:44so kind of are just by itself well so that
0:38:48so we start with a gmm baseline twenty three point six if you put in
0:38:51fmpe to be fair twenty two point six
0:38:54and then you do it cd dnn just a normal dnn using those features here
0:38:59the fmpe features you get to seventeen
0:39:02get rid of that simply so this minus means take out
0:39:06now it's just a plp system
0:39:08seventeen
0:39:08the kind of makes sense because the fmpe was basically trained specifically for this gmm
0:39:16structure
0:39:18then you can also take out the hlda gets much better
0:39:21a little data obviously correlation right over a longer range and dnn already feels
0:39:29you can also take out the dct that's part of plp or mfcc process
0:39:34and now we have a slightly different the dimension
0:39:37you have more features here and so
0:39:41i think a lot of now using this particular set up
0:39:44you can even take all the deltas
0:39:46but you have to account for the speaker you have to make the window wider
0:39:49so we still see the same frames and our case it still
0:39:54can you go really extreme and completely eliminate filter back just you look at fifty
0:39:59features direct
0:40:00now get somebody works focused on the ballpark here right
0:40:03so
0:40:05actually what we just do you basically undid thirty years of features research
0:40:10so
0:40:13that
0:40:13there is also kind of really could if you really care about the filter bank
0:40:16you can actually have a more sort of this is another poster by tomorrow so
0:40:20you see the blue bars the red curve the right the blue of the mel-filters
0:40:24and the red curve so basically
0:40:26alarm versions of that
0:40:34and dnns also kind of really sorry
0:40:38so take away dnns greatly simplifies feature extraction just use the back to the wider
0:40:43window
0:40:44one thing i didn't already still need to the mean normalization
0:40:47that cannot
0:40:49now
0:40:50now we talk about features for dnns we can also trying to around right basically
0:40:54you know ask not what the features can do for the dnn but what the
0:40:57dnn and do for the features
0:40:59i think that was
0:41:01said by the same speech researcher
0:41:05so we can use dnns as feature extractor so the idea is basically is one
0:41:09of the factors that contributed to the success
0:41:12long span features
0:41:13discriminative training
0:41:15and the hierarchical nonlinear feature map
0:41:18right so
0:41:19and trying to that is actually the major contributor so why not use this combined
0:41:24with the gmm so we go really back to what the now some talked about
0:41:27right
0:41:28so that many ways of doing the tandem
0:41:31we heard this morning you can also the tandem with
0:41:34bigger layer our work on that so basically using signals here
0:41:39you can do bottleneck where you take in intermediate layer that this has a much
0:41:43smaller dimension
0:41:44or you can also
0:41:46use the top hidden there
0:41:49ask sort of the bottleneck but not make it smaller just take it in each
0:41:52of those cases you would typically do like a pca to use your dimensionality
0:41:56so does that work
0:41:58well okay so if you take
0:42:00a dnn
0:42:01H and this the hybrid system here and then you compared with this gmm system
0:42:05retake top layer
0:42:07pca and then applied you gmm
0:42:09well it's not really that good
0:42:12but now we have one really big advantage back and the rubber gmms
0:42:16we can capitalise on anything that worked on the gmm world right
0:42:20so for example hardly able to you region dependent linear transforms a little bit like
0:42:24fm P
0:42:26so once you apply that
0:42:27already better
0:42:29can also just to mmi training very easily okay in this case is not really
0:42:33as good but at least you can do it out of the box without any
0:42:36of these problems with you know silence at that and you can apply adaptation just
0:42:41it would always
0:42:42you can also do something more interesting can say what if i train my dnn
0:42:47feature extractor on a smaller set
0:42:49and then to the training on a larger set
0:42:52because we have the scalability problem
0:42:54so this can really help with the scalability problem and you can see well
0:43:00closer not a not quite as good but italy but we're able to do that
0:43:04i mean imagine the situation what this is like a ten thousand our product database
0:43:07we couldn't training and then
0:43:10and it's on the dnn side we also use the same data we definitely get
0:43:13better
0:43:14here and that still make it might make sense if we combine the for example
0:43:18with the idea of building this you model only partially so and then see if
0:43:23that we don't know that action
0:43:24so that a lot of attention
0:43:26another thing another idea of learning using dnns as feature extractor
0:43:31is to transfer learning from one language
0:43:35to another so the idea is to feed the network actually training set of multiple
0:43:40languages
0:43:41and you're output layer
0:43:43for every frames based chosen on what that language what's right and this way you
0:43:47can train
0:43:48these hidden representations and it turns out if you do that
0:43:51you can improve each individual language and it even works for another language that has
0:43:56not been part of this set here
0:43:58the only thing is that is typically something a works for low resource languages
0:44:03but if you goal larger so for example salt on has a
0:44:08was that the paper here or has a paper where you shows that if you
0:44:11go up to subtract two hundred seventy hours of training
0:44:14then you're again really is reduced or something like three percent
0:44:18so this is actually something that does not seem to work very well for large
0:44:21setting
0:44:26okay so take away
0:44:28the dnns as a hierarchical nonlinear feature transform
0:44:31that's really the key to the success of unions and you can use this directly
0:44:36and put it the engine on top of that as a plastic later
0:44:40and it brings it back and gmm world with all the techniques including parallelization and
0:44:45scalability and so on
0:44:47and all that transfer learning sides works from a small works a small set ups
0:44:52but the not so much large
0:44:55okay
0:44:58last topic runtime
0:45:00runtime is an issue
0:45:02this one problem for gmms
0:45:05you can actually do on-demand computation
0:45:08for dnns
0:45:09a large amount of parameters actually the shared layers you can do on the map
0:45:14so
0:45:15all dnns are
0:45:16you have to compute
0:45:18and so it's important to look at how can speed up so for example the
0:45:22demo video that i showed you in the beginning if i that was run with
0:45:25the with the my gpu was doing the live likely to evaluation if you don't
0:45:30do that it would like three times real time
0:45:32wouldn't infeasible
0:45:34so
0:45:35the way to approach this and that was done both by some colleagues of microsoft
0:45:38also ibm
0:45:40is to ask we actually needles full weight matrices
0:45:44i and so this is that the question is based on two observations
0:45:48one is that we saw early on that actually you can set something like two
0:45:52thirds of the parameters to zero
0:45:55and still you get the same our
0:45:57and what ibm observed is that this top hidden they're the
0:46:02the number of
0:46:03how to the number of nodes the actual active is relatively limited
0:46:07so can you basically just decompose all the ideas you singular value decomposition
0:46:12those weight matrix
0:46:14and the ideas you basically this is your network there
0:46:17the weight matrix nonlinearity replace this by two matrices and in the middle you have
0:46:23a low-rank
0:46:26so that's that work
0:46:27well
0:46:28so but there's this is the gmm baseline just for reference dnn
0:46:32but thirty million parameters of the microsoft internal task
0:46:35start with the word error rate of twenty five point six
0:46:38now we apply these singular value decomposition
0:46:41if you just to the straight out of gets much worse
0:46:44but you can then do back-propagation again
0:46:47and then you will get back to exactly the same number
0:46:50and you gain like one third parameter reduction
0:46:53you can actually also do that with
0:46:55although there is not just the top there if you do that can bring it
0:46:58down
0:47:00that's a factor of four
0:47:02and that is actually very good results so this basic bring that back
0:47:08so just one show you only to again give your very rough idea
0:47:12my classes
0:47:21so it's only very short example just a given idea this is an apples to
0:47:25apples comparison between the old gmm system and the dnn system
0:47:29but for speech recognition so as to look at some of those things that you
0:47:34know well so you are devices on the one on the left or is what
0:47:37a previously the board one on the right uses the documents
0:47:42we're gonna find a good pizza and
0:47:50a very similar specifically for discriminative interested look here down to the latency which is
0:47:56counted from when i don't talk when we see the recognition result over a second
0:48:01approach
0:48:02so i just want to give you act this is proof that this section works
0:48:06okay so
0:48:08think of cover the whole range i would like to recap
0:48:13all the take aways
0:48:14okay so we went through
0:48:16cd dnn actually members of G
0:48:18mlp not already said that nothing else the outputs are the triphone states and that's
0:48:24important
0:48:25they're not really that far to train we know now but doing it fast
0:48:29is still sort of frustrating enterprise and i would at the moment recommend just get
0:48:33the gpu and if you have multiple gpus just one multiple training rather than trying
0:48:37to paralyse a single training
0:48:40pre-training is
0:48:41median but the greedy layer-wise but is simpler and it seems to be sufficient
0:48:48sequence training gives us regularly good improvements on to thirty percent but if you use
0:48:52std then you have to use these little tricks smoothing
0:48:56and rejection
0:48:57adaptation helps much less than for gmms
0:49:00which might be because the dnn learns possibly
0:49:04very good in there and representations already so that might be a limit to what
0:49:07we can actually you can achieve
0:49:09writers are definitely not as easy as changing two lines of code especially for large
0:49:14datasets
0:49:16but on the other hand the C N N's
0:49:17give us like five percent is not really the heart get but and they make
0:49:20a good sense
0:49:23dnns really simplify the feature extraction we're able to eliminate thirty years of feature extraction
0:49:27research
0:49:30but you can also go around and using dnns as feature extractors
0:49:35so dnns are definitely not slowing decoding if you use this speech
0:49:40so
0:49:40to conclude word racy the challenges one forward
0:49:44there of course open issues of training
0:49:46i mean it's one we talk to people in the company we always thinking what
0:49:51kind of computers we find the future and are we optimize them for std but
0:49:55we always think you know what in one year will laugh
0:49:57laugh about this though some patch method and we will just not need all of
0:50:01this but so far this not i would think it's fair to say that's not
0:50:03a method like this on the rise in the media laws parallelization
0:50:08and what we found section learning rate control is not sufficient this kind of really
0:50:11important because if you don't do this right it might run into unreliable results and
0:50:15have a hunch that is relevant result we saw there was little bit like that
0:50:19and also has to do with paralyse ability because the smaller learning rate the bigger
0:50:23your mini-batch can be factor and the more parallelization can
0:50:30dnns still have an issue with robustness to real life situations
0:50:35how much they sort of not be solved speech but they got very close to
0:50:39solving speech under perfect recording conditions but it still fails it's a do or speech
0:50:44recognition over like one meter fifteen a more room with two microphones or something like
0:50:48that so dnns are not
0:50:49in here we automatically robust to noise
0:50:52there was to see variability but not on C or what
0:50:57then personally i wanna not can we kind of a more machine you
0:51:00so for example there's already work the tries to eliminate H M and replace it
0:51:04by are and i think that's come very interesting and the same thing is already
0:51:08very successfully done with language models
0:51:11and there's the question of
0:51:13i mean jointly treat everything and one big step but on the other hand
0:51:16the problem with that is that different kinds of different
0:51:19aspects of the model different kinds of data that have different cost would using to
0:51:24them so it might actually never be possible to we need to a joint training
0:51:28and the final question that i sort of have is what to dnns teachers about
0:51:32humans process
0:51:35what will also get
0:51:36more ideas on
0:51:38no
0:51:40so that concludes my talk thank you very much
0:51:51i think we have like six minutes for questions
0:52:12another expert about units there was wondering point therefore if i train and it's a
0:52:19neural network and conventional speech data and i try to anything the data which is
0:52:26much more clean we therefore not as good or we don't noise
0:52:31so what was the configuration you want to do you want to train on what
0:52:34it is that they train mind manual nets on the noisy data when they're running
0:52:39on the clean data
0:52:41so they don't know exactly that's my question
0:52:44okay so i actually did skip
0:52:46one slide images this L O
0:52:50so
0:52:51the dnn is actually
0:52:56way
0:53:04so you get like
0:53:10so this table here shows results on aurora so basically doing this case multi-style training
0:53:19so the idea was not to train a noisy and test on clean
0:53:22but this is basically training and testing on the same
0:53:26set of noise conditions
0:53:28and so the lot of numbers here this is the gmm baseline if you look
0:53:31at this line here thirteen point four
0:53:34so another specialist on robustness but i think this is of the best you can
0:53:38do with the gmm
0:53:39pooling all the tricks that you could possibly put in
0:53:42and the dnn
0:53:43it's just
0:53:44but not any tricks just training on the data you get
0:53:48you know how do not from the or you get just exactly the same
0:53:51so what this means i think is that the dnn is very good and learning
0:53:55variability new input also noise that it sees in the training data
0:54:02but we have other experiments were is shown what we're we see that is the
0:54:06variability smart cover new training data
0:54:09the dnn is not very robust again
0:54:12so i don't know what happens if you trained on noisy and test on clean
0:54:15and clean is not of the conditions that you have your training i could imagine
0:54:18that it will are but on the of an interest at the data
0:54:25i don't think i can likely to get away with thirty years thirty years maybe
0:54:30that was present at all
0:54:33apparently talking and tongue in cheek right what you're talking about is going back before
0:54:38some of the developments of the eighties right and most of the effort on feature
0:54:43extraction last twenty years conferences is actually been more robustness to dealing with unseen variability
0:54:51and this doesn't get and you that set equation
0:54:59some more questions or comments
0:55:02think about features i need for future
0:55:08research
0:55:10is it and use a large temporal context this is also be one it's was
0:55:16coming but for
0:55:19in contrast
0:55:21it's something
0:55:24okay what exactly i don't have to sell the embassy okay
0:55:33anymore comments
0:55:36kind of a personal question you said that you know anything about neural nets like
0:55:40on two three years back something like that so you see this as rather an
0:55:45advantage for drawback very maybe less sentimental
0:55:48in throwing away some coldness that you know the guys very in the field for
0:55:54many years expected some touchable or the other way round
0:55:58i think so i think it helps to come with sort of the little bit
0:56:02of an outsiders mine so i think for example it helped me to understand this
0:56:06parallelization thing right that basically do it is G D you do layer train a
0:56:11small the mini-batch training
0:56:13and normal the regular definition of mini-batches is that you can take have original to
0:56:18sell
0:56:18maybe you might have noticed that i didn't actually divided by the number of frames
0:56:23when i use this formula right interesting is if you're not right
0:56:27so that for example is something for me as an engineer coming in looking at
0:56:30that i wonder know why do you do mini-batches as an average doesn't seem to
0:56:33make sense you're just accumulating multiple frames over time that help understand those kind of
0:56:38parallelization questions in a different way
0:56:41but things probably details
0:56:49okay any other buttons
0:56:54okay the speaker given is present