Speech Transcript - Context-Dependent Deep Neural Networks for Large Vocabulary Speech Recognition: From Discovery to Practical Systems

0:00:16	thank you so welcome back after the lunch
0:00:19	my name's frank seide i'm from microsoft research in ageing and this is a post
0:00:24	calibration my colleague dong yu what happens to be chinese but is actually base
0:00:29	and of course as a lot of contributors to this work inside the company and
0:00:33	outside and also thank you very much two people sharing slide to material
0:00:38	okay to me we start with the like personal story i got into this because
0:00:42	i'm sort of an unlikely experts of this because until two thousand eleven i had
0:00:47	no i two thousand ten had no idea what were networks deep one or
0:00:51	so in two thousand ten
0:00:52	my colleague dong yu cannot be here today came to visit us invading only told
0:00:58	us about this new speech recognition result that the dehak
0:01:02	and you told me about the technology that i had never heard about call dbn
0:01:07	and set
0:01:08	this was sort of invented by some professor in wonderful that also had never heard
0:01:12	about
0:01:14	so and he and he need a manager at the time had invite geoffrey hinton
0:01:19	this professor to come to read and with a few students and work on applying
0:01:23	this to speech recognition
0:01:25	any time he got
0:01:26	sixteen percent relative reduction
0:01:29	out of applying deep neural networks
0:01:31	and this is for intel voice search task relatively small number of hours of training
0:01:36	you know sixty percent is really a big a lot of people spend ten years
0:01:40	to get a sixteen percent error reduction
0:01:42	so my first got about this was
0:01:44	sixteen percent while what's wrong with the baseline
0:01:55	said well we should we collaborate on this and try how this carries over into
0:01:59	a large-scale task that switchboard
0:02:02	and the key thing that actually invented here was well talk a classic an hmm
0:02:07	i think this reference is probably based on
0:02:10	whatever this morning from nelson
0:02:12	a little bit
0:02:13	too late
0:02:15	so the classic nn hmm then the in the deep network dbn
0:02:19	which actually does not stand for dynamic bayesian networks as a line
0:02:23	at that point
0:02:24	and then you don't put in this idea of
0:02:26	just using tied triphones as modeling targets like we did in gmm based system
0:02:32	okay so
0:02:34	then fast forward like have here was reading papers in utah look to start and
0:02:38	finally we got to the point where we got first
0:02:41	results so this is or gmm baseline and i start the training next day ahead
0:02:47	the first iteration
0:02:48	was like twenty two percent so okay now seems to not be completely off
0:02:53	the next day i come back
0:02:55	twenty percent
0:02:56	so fourteen percent and the congratulation email to my colleague right
0:03:00	the to run next day came back
0:03:03	eighteen percent
0:03:04	you can really from that one moment i was just sitting at the computer waiting
0:03:07	for the next result of come out and submitting it and saw titanic have better
0:03:11	we got seventeen point three
0:03:13	something point one
0:03:15	then we get the alignment that's one thing you don't had already determined on the
0:03:20	smaller setup we got it down to sixty four then we look at sparseness
0:03:24	six import once we go thirty two percent error reduction
0:03:27	that's a very large reduction
0:03:29	all of a single technology
0:03:33	we also ran this over different test sets the same all and you could see
0:03:37	the error rate reductions were all sort of in a in a similar range of
0:03:40	the word didn't matter as well the gains were slightly worse
0:03:44	we also look the other ones for example we at some point finally the two
0:03:48	thousand all model the can still okay for product like windows on system that you
0:03:54	have right now we got something fifteen percent error reduction
0:03:58	and also other companies started publishing for example ibm on broadcast news i think the
0:04:02	total gaze thirteen eighteen percent that's i think in up to date paper some day
0:04:07	and then you to i think there's was about nineteen percent of the gains were
0:04:11	really convincing across the board
0:04:14	okay so that our work so what is this actually
0:04:17	no i thought asr you has the same different portion of understanding people might not
0:04:22	to you know the end and on the database so i think would like to
0:04:26	go through and explain
0:04:27	a little bit more to the basics how this works i don't know how many
0:04:31	understand people are really here today i hope it's not gonna be too boring
0:04:34	so the basic idea is
0:04:36	the dnn looks at for example spectrogram
0:04:40	a rectangular patch out of that a range of vectors
0:04:44	and feeds this into this processing chain word basically multiplies this input vector this rectangle
0:04:49	here with a matrix at some by and applies a nonlinearity are then you get
0:04:54	something like two thousand values other that you do the several times
0:04:58	note that all that the same thing except nonlinearity is a softmax
0:05:02	so
0:05:04	this is the formulas for that so what is actually well a softmax
0:05:08	is this form here
0:05:09	that is essentially nothing else but i sort of a linear classifier and is linear
0:05:13	because if you look at the class boundaries between two classes hasn't in are actually
0:05:17	relatively weak classifier have there
0:05:20	the hidden there is actually very similar they have the same for the only difference
0:05:25	is that these sort of this only two classes
0:05:28	instead of and or be all the different speech states here and the second pass
0:05:32	as parameters zero
0:05:34	so what is this really this is sort of a classifier that classifies collect class
0:05:38	membership or non membership in some class but we don't know what those classes are
0:05:42	actually
0:05:43	and so this representation is actually this also kind of sparse typically you get only
0:05:48	maybe ten percent of the activations five to ten percent
0:05:52	to be active in any given frame
0:05:54	so this is really sort of these class membership the kind of features descriptive features
0:05:58	of your input
0:06:00	so another way of looking at it is
0:06:03	basic what it doesn't takes an input vector projected onto something like a base vector
0:06:07	one column
0:06:09	this would be like a direction vector projected on it there's a bias term we
0:06:13	add on it and then you run into this nonlinearity we just one of the
0:06:16	binarization
0:06:18	so what this does this gives you sort of subsume find a river a like
0:06:21	a coordinate system for your inputs
0:06:25	and get another
0:06:27	way of looking at it is
0:06:28	well
0:06:30	this one here is actually a correlation so he the parameters have the same sort
0:06:36	of physical meaning as the inputs you put in there
0:06:40	so for example for the first layer the model parameters are also of the nature
0:06:44	of being a rectangular patch
0:06:45	of spectrogram
0:06:46	so and this is what they look like i think there was a little bit
0:06:49	of the discussion earlier on nelson's talk
0:06:52	so what is this me each of goals
0:06:55	is this case thirty two there twenty three frames Y
0:06:59	this is the frequency
0:07:01	access here
0:07:02	and what happens is that these things are basically overlay over here and then the
0:07:05	correlation is made and whatever it detects this particular pattern this is sort of the
0:07:09	peak detector of people that sliding over time
0:07:13	then you get the hideout
0:07:14	okay
0:07:15	you can we see all these different patterns to get many of them really look
0:07:18	like our filters
0:07:20	but these automatically learn about the system there's no knowledge that was put into their
0:07:24	you have these edge detectors you have P detectors you have some sliding detectors you
0:07:29	have a lot of noise in there actually i don't know what that's for think
0:07:32	this probably of later ignore them later stages
0:07:36	that they are problem is how to interpret the hidden layers
0:07:39	the hidden there is speech don't have any sort of spatial relationship to the input
0:07:44	or something so the only thing that i could think of is that
0:07:47	there we were presenting something like
0:07:49	logical operations so think of this again this is the direction vector this is the
0:07:53	hyperplane that is described by the bias right so if you inputs for example are
0:07:58	one this one is one this is obviously
0:08:01	two dimensional vector ones one is zero
0:08:04	could be this one of this one you could put a plane here indicates incorporation
0:08:09	okay kind of a soft or because not strictly binary
0:08:12	or you put it here is like an operation
0:08:14	so i think this my personal intuition of what this is the nn actually does
0:08:18	is
0:08:19	on the lower layers it extracts these landmarks
0:08:22	number higher there is it assembles them into more complicated classes
0:08:27	and can you do interesting things you can imagine that
0:08:30	that for example and one layer discover say a female version of and a and
0:08:34	then another no would give you a male version of a
0:08:37	then on the next there would say ten authors
0:08:40	female or male a
0:08:42	so this is an idea on top of the modeling power of this of this
0:08:45	one
0:08:47	okay so take away
0:08:49	lowest layer matters landmarks higher layers i think are sort of soft logical operators
0:08:54	and the top there is just a really primitive linear
0:08:57	okay so how do we do this in speech how to be used as speech
0:09:02	you take those output see these probabilities posterior probabilities of speech segments
0:09:08	suppose you know
0:09:10	it turns them into
0:09:12	likelihoods the using bias will and these are directly used in the hidden markov model
0:09:16	in a
0:09:19	and the key thing here is that these classes are tied triphone state and not
0:09:23	monophone states that is the thing that really made a big
0:09:26	okay so just before we move on just a given a rough idea of like
0:09:30	the subject this idea one buttons error rates actually we wanna play will video clip
0:09:36	where our executive vice president of research gave on stage demo
0:09:41	and you can see what accuracies come out of and speaker independent
0:09:45	dnns we can you can this not been adapted is voice
0:09:53	still far error rate for our work we have the one point five
0:10:04	what you hear research my research university
0:10:10	okay together with the other in your recognition so
0:10:19	i use i tell you know what i weight given red color your
0:10:31	so this is this is basically perfect right and this is really a speaker independent
0:10:35	system
0:10:36	and you can i think do interesting things of that just the fun of it
0:10:39	i'm gonna play at a later part of the video what we actually use this
0:10:42	input to drive translations just
0:10:46	translated into chinese you and vocal here we see i am i know
0:11:05	i
0:11:07	there i here
0:11:09	you people one
0:11:17	that is there
0:11:21	side
0:11:31	for this is a very
0:11:35	you do initial values you well
0:11:41	if you hear that right down by various people
0:11:48	so what we see
0:11:54	so that's a kind of fun you can have of the model like that
0:11:58	okay so
0:11:59	now in this talk
0:12:02	i would like to
0:12:03	you know you know a is giving talks about the nn is invited talks S
0:12:08	of income bracket like on each of those conferences that likely one hour talking to
0:12:12	you single something's the for example last year smt conference or sandra senior
0:12:16	then i think of the i syllables of innocent fun so when i've prepared a
0:12:20	strong i found energy and it ended up
0:12:23	doing andrews talk
0:12:26	so i thought that's maybe not a good idea i wanna do it slightly different
0:12:29	so what i wanted to someone we focus
0:12:31	and not gonna give you have you noticed overview of everything but i will focus
0:12:35	on
0:12:36	what is needed to build real life systems large-scale system so for example you will
0:12:40	not see in timit result
0:12:42	and the structured along three areas training features and run-time extraneous the biggest one i'm
0:12:47	gonna start force
0:12:50	so
0:12:51	how do you train this model i think we're pretty much all familiar with back-propagation
0:12:55	you give it
0:12:56	a sample vector run to the network get a posterior distribution compared against what it
0:13:00	should be
0:13:01	and then basically not the system a little bit in the direction to do a
0:13:05	better job next time
0:13:07	and so the problem is when you do this with the deep network often the
0:13:11	system does not converge where will get stuck in local optimum
0:13:14	so the thing that we of this whole revolution with geoffrey hinton who
0:13:19	the thing that's
0:13:19	matt sorry the thing that we propose to the restricted boltzmann machine
0:13:24	and the ideas basically you train
0:13:26	layer is so here we extend that the networks sort of in the way they
0:13:30	can run about
0:13:31	so you can run the sample through
0:13:34	you get a representation you run it backwards and then you can see okay how
0:13:37	well that's the thing that comes out the action match my input
0:13:40	then you can choose that system so that matches the input as closely as possible
0:13:45	if you can do that and don't forget this is sort of the binary representation
0:13:48	that means you have a representation of data that is meaningful this thing extract something
0:13:53	meaningful about the data and that's so that the idea
0:13:56	so now we do the same thing with the next there you freeze this is
0:13:59	taken as a feature extractor
0:14:00	a do this with the next there and so on
0:14:02	then you put
0:14:04	top softmax and then trying to location
0:14:08	now so i had no idea about
0:14:10	do you nor networks anything when i started this so i thought what we do
0:14:13	this or complicated i mean we already ran this experiment on how many layers you
0:14:18	need and so on so already had
0:14:20	and not work that had like a single in there
0:14:23	so why not just take that one is initialization
0:14:25	right out it softmax layer and then put another
0:14:30	hidden layer and another softmax down off
0:14:32	and then iterate the entire stack here
0:14:34	and then after that again right this guy off and do it again and so
0:14:38	on and once you are at the top and iterate this thing
0:14:41	so we call this greedy layer-wise a discriminative pre-training
0:14:44	and it turns out that actually works really well so if we look at this
0:14:48	the dbn pretraining geoffrey hinton this is the green curve here
0:14:53	if you do what i just described you get the red or just are essentially
0:14:58	the same word error rate
0:15:00	and this is different numbers of layers this is not progression over training the accuracy
0:15:05	for different layers right
0:15:07	so the more layers to get the better gets and
0:15:09	you see basically sparse
0:15:11	tract each other
0:15:12	the layer-wise pretraining slightly worse but then you'd only one understands neural networks much better
0:15:17	than i
0:15:18	said you shouldn't maybe to rating the model all the way to the and should
0:15:22	just let it iterate a little bit rerun in the ballpark then move on it
0:15:25	turns out that made the system slightly better and actually the sixteen point eight here
0:15:29	this is this just made pre-training method works like that
0:15:34	i'm i think it's expensive
0:15:35	because every time you have this full nine thousand seen on top layer there but
0:15:39	it turns out we don't need to do that you can actually use monophones
0:15:42	and it actually works equally well as much
0:15:46	okay so take away pre-training is still sort of me that it helps
0:15:50	but we need discriminative pre-training is sufficient and much simpler than the rbm pity because
0:15:55	we just use the existing call don't need to coding
0:15:59	okay another important topic is
0:16:02	sequence training
0:16:03	so the question here is
0:16:06	we have actually train this network to classify these signals is into those segments of
0:16:11	speech and of each other but in speech recognition
0:16:14	we have dictionary sure of language models we have hidden markov model that gives you
0:16:18	sequence and so on
0:16:19	so if we want to integrate that the system we should but we do that
0:16:23	we should actually get a better result right
0:16:25	so
0:16:27	the frame-classification right here on is written this way you maximise log posteriors every single
0:16:32	you know posterior correct C
0:16:36	if you want to use C and if you wanted to sequence training actually find
0:16:40	that it has exactly the same form
0:16:42	except this year not state posterior derived from the bn and but it is state
0:16:47	posterior taking all the additional knowledge into account
0:16:51	so this one the takes into account hmms the dictionary and language models
0:16:55	so the way to run this is you run your data through and you have
0:16:59	here you must from speech rec
0:17:01	in computers posteriors
0:17:02	and practical terms you would do this with word lattices
0:17:05	and then you do back-propagation and
0:17:08	so we did that
0:17:10	we start with the baseline fifty one six percent
0:17:13	we did the first iteration of this sequence training
0:17:16	i want to
0:17:17	the one
0:17:18	for
0:17:19	so that kind of didn't work
0:17:22	so
0:17:24	well we observe that it sort of time for each so
0:17:27	don't like we're training
0:17:30	so we try to do in what is the problem here so there is for
0:17:33	hypotheses
0:17:34	are we actually using the right models lattice generation their problems lattice sparseness
0:17:39	randomization of data and the objective function of multiple objective functions choose from and today
0:17:44	i will talk about the lattice parsing
0:17:46	so the final one thing we found was that
0:17:49	there was increasing
0:17:51	sort of
0:17:52	problem of speech getting replaced by silence
0:17:57	deletion problem we saw that the silence of course we're going
0:18:01	and the other scores were not
0:18:03	so basically what happens is that
0:18:05	the lattice is very biased the lattice typically doesn't have negative hypotheses for silence because
0:18:11	it's so far away from speech but it has a lot a lot of positive
0:18:15	examples of silence
0:18:16	so this thing was just biasing the system towards ringside really we you know given
0:18:21	high bias
0:18:22	so what we do this we said okay one interest
0:18:24	not update
0:18:26	sun state and also skip all silence frames
0:18:29	so that already gave us something much better
0:18:31	already look like it's going
0:18:34	so we could also the slightly more systematically we could actually explicitly and silence hours
0:18:39	into the lattice
0:18:41	right those that should have been there in the first place
0:18:44	so once you do that
0:18:46	i actually get even slightly better so that kind of confirms the missing sounds hypotheses
0:18:50	are all
0:18:52	but then
0:18:53	another problem is that the lattices other sparse
0:18:56	so we find that any given frame
0:18:58	we only have like three hundred out of mine thousand seen on T C and
0:19:02	that
0:19:03	the others are not there because they basically had zero probability
0:19:07	but as the model moves along maybe data at some point no longer have zero
0:19:11	probability so they should be there in the lattice but they're not
0:19:14	so the system cannot train properly
0:19:16	so we thought why don't we just we generate lattices after one iteration
0:19:20	we see how the next little bit of the difference at least keeps table here
0:19:25	now we thought can we do this slightly better so basically we take this idea
0:19:28	of adding silence
0:19:30	and sort of adding speech marks you can't really do that
0:19:33	but similar effect can be achieved by interpolating your sequence criterion
0:19:38	with the frame cardio
0:19:40	so and then we basing we do that get
0:19:43	a very good convergence
0:19:46	so
0:19:47	now we we're not the only people that observe that problem i ran into this
0:19:51	issue with the training so for example colour destiny
0:19:55	and this workers
0:19:57	observe that
0:19:58	you look at the posterior probability of the ground first pass
0:20:02	over time sometimes find that it's very low it's not always zero sometimes at zero
0:20:07	that means a lot
0:20:09	so
0:20:09	but they found is that
0:20:11	if you just get those frames you called frame rejection you get a much better
0:20:15	convergence behavior so the green the red curve is without and the blue curve is
0:20:19	with frank removing that
0:20:23	and of course
0:20:25	brian also observed exactly the same thing but he said no i'm gonna do the
0:20:28	smart thing
0:20:29	i'm gonna do something much better i'm gonna and second order method
0:20:33	so what the second one a method you approximate the objective function as a second
0:20:37	order function that you can like hope try to the optimal right theoretically
0:20:41	and so this can be done without explicitly computing they have C and this is
0:20:44	the method that martin's is tuned of hinton
0:20:48	sort of optimized
0:20:49	and the nice thing it's actually batch method
0:20:52	so it doesn't
0:20:54	suffer from these previous issues of like data sparseness and the last carol executions as
0:20:59	a lot of couldn't
0:21:01	and also what i think on this conference there's a paper that says that it
0:21:04	works with partially to rated ce multi don't even have to do a full see
0:21:08	you duration that's also very dry
0:21:11	and
0:21:12	i need to save your outdoor started with my homework actually writing first show the
0:21:16	effectiveness of sequence
0:21:18	for switchboard
0:21:19	okay so you have some results
0:21:22	so this is the gmm system C basically a C D based the nn
0:21:27	sequence trained one
0:21:28	so this is all and switchboard and five and are two or three
0:21:32	so we get like twelve percent
0:21:35	basically and others got eleven percent and ryan on the are two or three said
0:21:39	also fourteen percent sort all similar range
0:21:42	we also
0:21:43	then we i wanna point of one thing
0:21:46	going from here to here
0:21:47	now the dnn has given us forty two percent relative
0:21:51	and that's a fair comparison because this is also sequence trained based
0:21:55	right so if the only difference is you recall gmm replaced by the unique
0:22:01	also it works on a larger dataset
0:22:05	okay to take away sequence training gives us gains of mine to thirty percent
0:22:10	other std works but you need some tricks they're
0:22:13	those of smoothing and rejection of that frames
0:22:16	and the hessian-free method requires no tricks but is actually much more complicated so to
0:22:20	start with i would probably start with the cg method
0:22:27	so another big question is paralysing the training
0:22:30	so just a given idea that more but we use this demo video the threshold
0:22:34	was trained on two thousand hours
0:22:37	it took sixty days
0:22:40	now
0:22:41	most of you probably don't work with windows
0:22:44	we do and that causes the very specific problem because of probably heard of something
0:22:49	a patch tuesday
0:22:51	so basically
0:22:52	every two to four weeks microsoft I T forces us to update some virus scanners
0:22:57	or something like that
0:22:58	and so basically those machines have to be rebooted
0:23:02	so running a java sixty days is actually
0:23:06	so
0:23:07	you were running this on gpu so we had a very strong motivation to look
0:23:11	at that
0:23:12	but don't get your hopes up
0:23:14	so
0:23:15	one way of trying to paralyse the training is to see connections to match
0:23:20	ryan had already shown hessian-free works very well can be problem
0:23:24	so actually see one V stuff are to be cage was an intern at microsoft
0:23:29	try to use hessian-free also for the C training
0:23:34	but it the take away was basically it takes a lot of iterations to get
0:23:38	there so it was actually not
0:23:41	so back to std
0:23:42	it's to use also problem because if we do mini-batches of they one thousand twenty
0:23:47	four frames everyone thousand twenty four frame to have sixteen to lot of data
0:23:51	so that's a big challenge so the first group are actually a company that it
0:23:55	is successfully was well with the asynchronous sgd that just
0:24:00	so the way that works is
0:24:02	you have your machines you group them into a first one group them together each
0:24:06	of them takes a part of the model and then you split your data and
0:24:08	each chunk to compute the different
0:24:11	so that at any given time
0:24:13	and whatever one of them has a gradient computed
0:24:16	it sends that
0:24:18	parameters server or set of parameter servers and also parameter servers aggregate
0:24:23	the model or with it
0:24:25	and then
0:24:26	whenever they feel like and the but with allows they send
0:24:31	then models that
0:24:32	now that's a completely asynchronous process the smaller think of this is just being independent
0:24:36	trends one thread is just computing with whatever's and memory
0:24:39	another threat this just sharing and exchanging data in whatever way the small synchronisation
0:24:45	so why but that work
0:24:47	well it's very simple because
0:24:50	std implies sort of an assumption of you know are reading right we make we
0:24:55	this so basically
0:24:57	every parameter update contributes independent the objective function
0:25:01	so it's okay to miss some of them
0:25:05	and also there is something that we call delayed update on a quick to explain
0:25:08	that
0:25:08	so in the simplest way that explained the training the beginning you take every point
0:25:12	in time that a sample X we take a model
0:25:16	compute gradient update the model with the gradient
0:25:20	and then do it again after one frame you do it again do it again
0:25:24	and then based right
0:25:26	you models equal to that model plus
0:25:29	we can also do this you can also not advance
0:25:33	the model that using use the same model multiple times
0:25:36	and update for example this example for
0:25:39	the you do for model updates the frames are still these frames right but the
0:25:43	model session model
0:25:45	in do this again and so on
0:25:47	so that's actually what we call mini-batch based update right
0:25:51	mini-batch training
0:25:53	so now if you want to do parallelization need to deal with the problem that
0:25:56	we need to do computation and data exchange parallel so you would do something like
0:26:00	that you know you would have a model and you would start sending that into
0:26:04	the network so at some point it can do model update while the kids computing
0:26:09	the next
0:26:11	and then
0:26:11	you do not overlap session once these are computed you sent the result over while
0:26:15	these are being received an update so you get the sort of overlap processing and
0:26:20	recall the double buffered update
0:26:22	it has exactly the same form so with this formula can write it in exactly
0:26:25	the same for
0:26:27	and std is basically just sort of a random version of this where you have
0:26:31	no space adjust the
0:26:34	somewhere jumping between one or two that just like
0:26:38	so why not telling
0:26:40	well i would this work because the space not different from i mean you batch
0:26:44	and to make it work only thing you need to make sure is that we
0:26:47	still stay in this
0:26:48	sort of you narrative me
0:26:50	it also means that as you training progresses you can increase your mini-batches
0:26:54	well observed that also means you can increase
0:26:57	your delay
0:26:59	which means you can use more machines
0:27:00	the more machine to use the more delay you in-car because network such right
0:27:06	okay
0:27:07	so
0:27:09	okay so but then
0:27:11	actually
0:27:13	where the three times
0:27:15	that colleagues told me
0:27:17	like this with paper only the
0:27:19	and then
0:27:20	like three months later ask them so we came up to this day and what
0:27:23	we scale well
0:27:24	actually happened three times so why does not work
0:27:27	so let's look at this one of the different ways paralysing something model power of
0:27:31	data for was layer
0:27:34	model carol isn't means you're splitting a models over different notes
0:27:37	then after each computation step they have to the only compute part of the output
0:27:41	vector
0:27:43	each computed different sub range of your dimension so after every computation to have to
0:27:47	exchange
0:27:48	the airport with all the others
0:27:50	the same thing has to happen in the way back
0:27:53	no data parallelism means
0:27:56	you break your mini-batch into sub batches
0:27:59	so each node computes subgradient
0:28:02	and then sorry
0:28:03	they after every that they have to exchange lisa gradients each has to send their
0:28:08	gradient or the other nodes
0:28:10	so you can already that has a lot of communication going on
0:28:13	the third train a something that we tried called and they are powerless
0:28:17	work something like this you distribute layers
0:28:21	so maybe the first batch comes in
0:28:23	and then when it's done it sends
0:28:25	its output to the next one and i we compute the next batch here but
0:28:29	this section of correct because we haven't update the model
0:28:33	so well we keep doing we just ignore the problem
0:28:36	then in this case after four steps
0:28:37	this guy has finally come back with an update the model
0:28:41	so
0:28:42	why would that work is just too late update is exactly the same form another
0:28:45	what before except the delay is kind of different in different layers but there's nothing
0:28:48	fundamentally strange about this
0:28:51	so
0:28:52	no
0:28:54	very interesting questions how far can actually go what a sort of the optimum number
0:28:58	of notes that you could that you can
0:29:01	paralysed
0:29:02	so my colleague also dropout a very simple idea
0:29:05	you simply said
0:29:06	you optimal when a maxout all the resource
0:29:10	using all you computation and all your network
0:29:14	resource basically means that the time that it takes
0:29:17	computing mini-batch
0:29:19	is equal to the T times that it takes to transfer the result that all
0:29:23	the other
0:29:25	and you would do this sort of an overlap fashion so you would compute one
0:29:28	then you started transfer and you do the next one
0:29:31	and i you are ideal when the time that it's like takes transferred let's say
0:29:35	when it's transform the trance was completed the more you're ready to compute the next
0:29:38	batch
0:29:39	so then you can write down okay what's optimal
0:29:42	number of knowledge here well the form is a bit more complicated but the basic
0:29:46	idea is that this is proportional to the model size bigger monologues better parallelization but
0:29:51	only get faster
0:29:53	so gpu can paralyse less
0:29:57	and also it has to do of course with how much data you have exchanged
0:29:59	what you're bandwidth this
0:30:01	for data parallelization the mini-batch sizes also factor because for a longer mini-batch size you
0:30:06	have to exchange less of
0:30:09	and for their partisan that's not really that interesting because it's limited by the number
0:30:14	so
0:30:16	this may i ask
0:30:17	what you think model part was what would be get here
0:30:20	so just
0:30:22	consider that will is doing image net like sixteen thousand
0:30:26	so gimme number
0:30:31	gonna tell you
0:30:36	not sixteen thousand
0:30:39	no such a very fine so i implemented that we need to a lot of
0:30:43	care three gpus
0:30:45	this is the best you can do we get at one point eight speed up
0:30:47	twice a lot of three times speedup because gpus get less efficient the smaller chunks
0:30:51	of data they process
0:30:52	and once i went to for it was actually much worse than this
0:30:58	not data pearls must much better i'd so what we think
0:31:07	for many best size of one thousand twenty four now that records of course if
0:31:11	you can use bigger mini-batches as you progress of training
0:31:14	this becomes a bigger number
0:31:16	and the reality what you get is well that will a C D system
0:31:20	paralysing for eight at nodes
0:31:23	and eighty nodes each node is twenty four intervals you
0:31:27	so if we see what you get compared using
0:31:29	compared to using a single twenty four into machine
0:31:34	at times ignored but you only get a speed of five point eight
0:31:38	that's what you can actually get out of the paper there and about two point
0:31:42	two up that comes out of model parameters and two point six comes out of
0:31:46	data
0:31:48	of course not that much
0:31:49	then there's another group at the academy of sciences and in a rating
0:31:53	they paralysed over in video K twenty extra cues that sort of the state-of-the-art
0:31:58	and they got three point two
0:31:59	speedup also
0:32:02	okay not that great
0:32:05	and i'm not gonna give an answer better but i just wanna
0:32:08	okay
0:32:09	so the last thing is layer parallelism okay so we're and this experiment we found
0:32:14	that if you do the right way you can use more gpus and you get
0:32:17	a three point two or three times speedup but we already had to use model
0:32:20	curves
0:32:22	and if you don't do that have a promotional balancing bases there is also so
0:32:26	different
0:32:27	and so this is actually reason why do not recommend their problems
0:32:31	okay so the take away
0:32:33	paralysing sds actually really heart and if your colleagues come to you and say dampen
0:32:38	implement std then maybe show that
0:32:41	okay
0:32:43	so
0:32:45	so much about realisation
0:32:51	okay need to take about and me talk about adaptation so adaptation can be done
0:32:56	you mentioned that this morning for example by sticking in transform your the bottom called
0:33:01	the L and transform we call it yellow are to match
0:33:05	mllr
0:33:06	can also be things like vtln
0:33:09	another thing we can do is as nelson explain just train the whole stock just
0:33:13	a little bit or you can do this with regularization
0:33:17	so
0:33:18	what we have service this
0:33:20	we do this approach which are not the alarm and switchboard
0:33:23	we applied to the gmm system we get thirteen percent error reduction
0:33:29	we applied to shallow more network that's one they're only
0:33:33	you get very similar to that
0:33:35	if we do it on the deep network
0:33:40	and
0:33:41	so
0:33:44	so this is sort of the not such a great example but then on the
0:33:48	other hand to me tell you wanna forgot to put on the side when we
0:33:51	prepared this on stage medial
0:33:54	or vice president we tried to actually train the models
0:33:58	so we talked something like four hours of internal talks
0:34:01	and did adaptation on that one
0:34:04	and tested on another two talks have and we got like thirty percent
0:34:11	but then we moved on an actually did an actual dry run with him
0:34:15	it turns out
0:34:16	on that one parent works
0:34:20	so i think what happened there is that the D N actually did not more
0:34:22	voice
0:34:23	the more channel
0:34:25	of this particular recording and that seems to be if the so basically there's a
0:34:29	couple of other numbers here but let me just cut the short so what we
0:34:31	seem to be observing is that
0:34:34	the gain diminishes with a large amount of the gain of adaptation this what we
0:34:37	have seen so far on that except if the adaptation is done for the purpose
0:34:42	of domain adaptation
0:34:45	so and maybe the reason why this is here is that the dnn is already
0:34:48	very good morning invariant representations especially for all speakers would also means maybe there's a
0:34:54	limit on what is achievable by adaptation some keep this in mind if you're considering
0:34:57	two to do research
0:35:00	on the other hand i think karen try not very good results or with george
0:35:03	right on that so maybe what i'm saying is not correct so you better check
0:35:06	out their papers and session
0:35:11	okay so we need on with training but isolated are what alternative architectures
0:35:16	so when this
0:35:18	so values are very popular
0:35:21	basically replace the nonlinearity sick model
0:35:25	something like this
0:35:27	and that came and also lot of geoffrey hinton school
0:35:31	and it turns out that vision tasks
0:35:34	works really well it converges very fast
0:35:36	you get
0:35:37	base we don't need to do pre-training
0:35:39	and it seems to outperform the sigmoid version thrall basing everything
0:35:44	non-speech that was is really would be a whole you know
0:35:48	encouraging paper
0:35:49	by entering students untied rectified nonlinearity is improved more network acoustic models
0:35:54	and they were able to reduce the error rate from my point five seventy
0:35:58	so great i started a holding it is actually two lines of code
0:36:02	and i didn't get anywhere
0:36:04	not able to rip use these results
0:36:07	the red the paper again
0:36:08	a nice all
0:36:10	sentence network and
0:36:11	network training stops after to compute pass
0:36:15	we only due to process our system is that nineteen point two and we do
0:36:19	all the past as we can see
0:36:22	so actually there's something wrong with a baseline
0:36:25	so it turns out that when i talk to people
0:36:28	on the large set switchboard it seems to be very difficult to get relevance to
0:36:33	work
0:36:33	so one group that actually did get a to work is a ibm together with
0:36:38	george dahl but in a rather complicated method they use
0:36:40	by addition optimize the optimisation systems of the network training
0:36:44	the trains hyper parameters of the training this way the way to get
0:36:47	somebody five percent relative gain
0:36:49	i don't know if you buy still doing that or if it's a bit easier
0:36:52	now but
0:36:54	so
0:36:55	the point is
0:36:57	the point is that it looks easy but it actually isn't
0:37:00	for large
0:37:02	another's convolutional networks
0:37:04	and the idea is basically that's look at these filters here these are tracking some
0:37:08	sort of formant right but the formant positions the resonance frequencies
0:37:13	depend on your body height
0:37:14	for example for women the typically at slightly different position compared to
0:37:18	two men so
0:37:19	by can share these filters across that at the moment the system wouldn't do that
0:37:24	so the idea would be to apply this filters and just them slightly apply them
0:37:28	over a range of shifts and that's basically represent by this picture here
0:37:33	and then the next there would reduce you pick the maximum
0:37:36	over all these different results there right and so it turns out that actually you
0:37:41	can get something like forty seven percent whatever it reduction i think you have even
0:37:45	little bit more the religious paper you
0:37:49	so the take away for those alternative architectures
0:37:52	ratings like definitely not easy to get work
0:37:55	they seem to work for smaller setups
0:37:57	some people time they get really get result good results on something twenty four hour
0:38:01	datasets but on the big set three hundred hours it's very difficult and expensive
0:38:06	the other hand the cnn so much simpler gains are sort of the range of
0:38:09	what we get
0:38:10	with the adaptation feature adaptation
0:38:14	okay
0:38:15	and of the training section
0:38:17	just talk about a little bit about features
0:38:23	so for features for gmms
0:38:27	has been done a lot of work
0:38:29	because gmms typically used are not bounce my
0:38:33	a lot of work was done to decorrelate features
0:38:36	do we actually need to do this in the dnn
0:38:38	well how did you correlated with a linear transform the first thing dnn does is
0:38:42	your
0:38:44	so kind of are just by itself well so that
0:38:48	so we start with a gmm baseline twenty three point six if you put in
0:38:51	fmpe to be fair twenty two point six
0:38:54	and then you do it cd dnn just a normal dnn using those features here
0:38:59	the fmpe features you get to seventeen
0:39:02	get rid of that simply so this minus means take out
0:39:06	now it's just a plp system
0:39:08	seventeen
0:39:08	the kind of makes sense because the fmpe was basically trained specifically for this gmm
0:39:16	structure
0:39:18	then you can also take out the hlda gets much better
0:39:21	a little data obviously correlation right over a longer range and dnn already feels
0:39:29	you can also take out the dct that's part of plp or mfcc process
0:39:34	and now we have a slightly different the dimension
0:39:37	you have more features here and so
0:39:41	i think a lot of now using this particular set up
0:39:44	you can even take all the deltas
0:39:46	but you have to account for the speaker you have to make the window wider
0:39:49	so we still see the same frames and our case it still
0:39:54	can you go really extreme and completely eliminate filter back just you look at fifty
0:39:59	features direct
0:40:00	now get somebody works focused on the ballpark here right
0:40:03	so
0:40:05	actually what we just do you basically undid thirty years of features research
0:40:10	so
0:40:13	that
0:40:13	there is also kind of really could if you really care about the filter bank
0:40:16	you can actually have a more sort of this is another poster by tomorrow so
0:40:20	you see the blue bars the red curve the right the blue of the mel-filters
0:40:24	and the red curve so basically
0:40:26	alarm versions of that
0:40:34	and dnns also kind of really sorry
0:40:38	so take away dnns greatly simplifies feature extraction just use the back to the wider
0:40:43	window
0:40:44	one thing i didn't already still need to the mean normalization
0:40:47	that cannot
0:40:49	now
0:40:50	now we talk about features for dnns we can also trying to around right basically
0:40:54	you know ask not what the features can do for the dnn but what the
0:40:57	dnn and do for the features
0:40:59	i think that was
0:41:01	said by the same speech researcher
0:41:05	so we can use dnns as feature extractor so the idea is basically is one
0:41:09	of the factors that contributed to the success
0:41:12	long span features
0:41:13	discriminative training
0:41:15	and the hierarchical nonlinear feature map
0:41:18	right so
0:41:19	and trying to that is actually the major contributor so why not use this combined
0:41:24	with the gmm so we go really back to what the now some talked about
0:41:27	right
0:41:28	so that many ways of doing the tandem
0:41:31	we heard this morning you can also the tandem with
0:41:34	bigger layer our work on that so basically using signals here
0:41:39	you can do bottleneck where you take in intermediate layer that this has a much
0:41:43	smaller dimension
0:41:44	or you can also
0:41:46	use the top hidden there
0:41:49	ask sort of the bottleneck but not make it smaller just take it in each
0:41:52	of those cases you would typically do like a pca to use your dimensionality
0:41:56	so does that work
0:41:58	well okay so if you take
0:42:00	a dnn
0:42:01	H and this the hybrid system here and then you compared with this gmm system
0:42:05	retake top layer
0:42:07	pca and then applied you gmm
0:42:09	well it's not really that good
0:42:12	but now we have one really big advantage back and the rubber gmms
0:42:16	we can capitalise on anything that worked on the gmm world right
0:42:20	so for example hardly able to you region dependent linear transforms a little bit like
0:42:24	fm P
0:42:26	so once you apply that
0:42:27	already better
0:42:29	can also just to mmi training very easily okay in this case is not really
0:42:33	as good but at least you can do it out of the box without any
0:42:36	of these problems with you know silence at that and you can apply adaptation just
0:42:41	it would always
0:42:42	you can also do something more interesting can say what if i train my dnn
0:42:47	feature extractor on a smaller set
0:42:49	and then to the training on a larger set
0:42:52	because we have the scalability problem
0:42:54	so this can really help with the scalability problem and you can see well
0:43:00	closer not a not quite as good but italy but we're able to do that
0:43:04	i mean imagine the situation what this is like a ten thousand our product database
0:43:07	we couldn't training and then
0:43:10	and it's on the dnn side we also use the same data we definitely get
0:43:13	better
0:43:14	here and that still make it might make sense if we combine the for example
0:43:18	with the idea of building this you model only partially so and then see if
0:43:23	that we don't know that action
0:43:24	so that a lot of attention
0:43:26	another thing another idea of learning using dnns as feature extractor
0:43:31	is to transfer learning from one language
0:43:35	to another so the idea is to feed the network actually training set of multiple
0:43:40	languages
0:43:41	and you're output layer
0:43:43	for every frames based chosen on what that language what's right and this way you
0:43:47	can train
0:43:48	these hidden representations and it turns out if you do that
0:43:51	you can improve each individual language and it even works for another language that has
0:43:56	not been part of this set here
0:43:58	the only thing is that is typically something a works for low resource languages
0:44:03	but if you goal larger so for example salt on has a
0:44:08	was that the paper here or has a paper where you shows that if you
0:44:11	go up to subtract two hundred seventy hours of training
0:44:14	then you're again really is reduced or something like three percent
0:44:18	so this is actually something that does not seem to work very well for large
0:44:21	setting
0:44:26	okay so take away
0:44:28	the dnns as a hierarchical nonlinear feature transform
0:44:31	that's really the key to the success of unions and you can use this directly
0:44:36	and put it the engine on top of that as a plastic later
0:44:40	and it brings it back and gmm world with all the techniques including parallelization and
0:44:45	scalability and so on
0:44:47	and all that transfer learning sides works from a small works a small set ups
0:44:52	but the not so much large
0:44:55	okay
0:44:58	last topic runtime
0:45:00	runtime is an issue
0:45:02	this one problem for gmms
0:45:05	you can actually do on-demand computation
0:45:08	for dnns
0:45:09	a large amount of parameters actually the shared layers you can do on the map
0:45:14	so
0:45:15	all dnns are
0:45:16	you have to compute
0:45:18	and so it's important to look at how can speed up so for example the
0:45:22	demo video that i showed you in the beginning if i that was run with
0:45:25	the with the my gpu was doing the live likely to evaluation if you don't
0:45:30	do that it would like three times real time
0:45:32	wouldn't infeasible
0:45:34	so
0:45:35	the way to approach this and that was done both by some colleagues of microsoft
0:45:38	also ibm
0:45:40	is to ask we actually needles full weight matrices
0:45:44	i and so this is that the question is based on two observations
0:45:48	one is that we saw early on that actually you can set something like two
0:45:52	thirds of the parameters to zero
0:45:55	and still you get the same our
0:45:57	and what ibm observed is that this top hidden they're the
0:46:02	the number of
0:46:03	how to the number of nodes the actual active is relatively limited
0:46:07	so can you basically just decompose all the ideas you singular value decomposition
0:46:12	those weight matrix
0:46:14	and the ideas you basically this is your network there
0:46:17	the weight matrix nonlinearity replace this by two matrices and in the middle you have
0:46:23	a low-rank
0:46:26	so that's that work
0:46:27	well
0:46:28	so but there's this is the gmm baseline just for reference dnn
0:46:32	but thirty million parameters of the microsoft internal task
0:46:35	start with the word error rate of twenty five point six
0:46:38	now we apply these singular value decomposition
0:46:41	if you just to the straight out of gets much worse
0:46:44	but you can then do back-propagation again
0:46:47	and then you will get back to exactly the same number
0:46:50	and you gain like one third parameter reduction
0:46:53	you can actually also do that with
0:46:55	although there is not just the top there if you do that can bring it
0:46:58	down
0:47:00	that's a factor of four
0:47:02	and that is actually very good results so this basic bring that back
0:47:08	so just one show you only to again give your very rough idea
0:47:12	my classes
0:47:21	so it's only very short example just a given idea this is an apples to
0:47:25	apples comparison between the old gmm system and the dnn system
0:47:29	but for speech recognition so as to look at some of those things that you
0:47:34	know well so you are devices on the one on the left or is what
0:47:37	a previously the board one on the right uses the documents
0:47:42	we're gonna find a good pizza and
0:47:50	a very similar specifically for discriminative interested look here down to the latency which is
0:47:56	counted from when i don't talk when we see the recognition result over a second
0:48:01	approach
0:48:02	so i just want to give you act this is proof that this section works
0:48:06	okay so
0:48:08	think of cover the whole range i would like to recap
0:48:13	all the take aways
0:48:14	okay so we went through
0:48:16	cd dnn actually members of G
0:48:18	mlp not already said that nothing else the outputs are the triphone states and that's
0:48:24	important
0:48:25	they're not really that far to train we know now but doing it fast
0:48:29	is still sort of frustrating enterprise and i would at the moment recommend just get
0:48:33	the gpu and if you have multiple gpus just one multiple training rather than trying
0:48:37	to paralyse a single training
0:48:40	pre-training is
0:48:41	median but the greedy layer-wise but is simpler and it seems to be sufficient
0:48:48	sequence training gives us regularly good improvements on to thirty percent but if you use
0:48:52	std then you have to use these little tricks smoothing
0:48:56	and rejection
0:48:57	adaptation helps much less than for gmms
0:49:00	which might be because the dnn learns possibly
0:49:04	very good in there and representations already so that might be a limit to what
0:49:07	we can actually you can achieve
0:49:09	writers are definitely not as easy as changing two lines of code especially for large
0:49:14	datasets
0:49:16	but on the other hand the C N N's
0:49:17	give us like five percent is not really the heart get but and they make
0:49:20	a good sense
0:49:23	dnns really simplify the feature extraction we're able to eliminate thirty years of feature extraction
0:49:27	research
0:49:30	but you can also go around and using dnns as feature extractors
0:49:35	so dnns are definitely not slowing decoding if you use this speech
0:49:40	so
0:49:40	to conclude word racy the challenges one forward
0:49:44	there of course open issues of training
0:49:46	i mean it's one we talk to people in the company we always thinking what
0:49:51	kind of computers we find the future and are we optimize them for std but
0:49:55	we always think you know what in one year will laugh
0:49:57	laugh about this though some patch method and we will just not need all of
0:50:01	this but so far this not i would think it's fair to say that's not
0:50:03	a method like this on the rise in the media laws parallelization
0:50:08	and what we found section learning rate control is not sufficient this kind of really
0:50:11	important because if you don't do this right it might run into unreliable results and
0:50:15	have a hunch that is relevant result we saw there was little bit like that
0:50:19	and also has to do with paralyse ability because the smaller learning rate the bigger
0:50:23	your mini-batch can be factor and the more parallelization can
0:50:30	dnns still have an issue with robustness to real life situations
0:50:35	how much they sort of not be solved speech but they got very close to
0:50:39	solving speech under perfect recording conditions but it still fails it's a do or speech
0:50:44	recognition over like one meter fifteen a more room with two microphones or something like
0:50:48	that so dnns are not
0:50:49	in here we automatically robust to noise
0:50:52	there was to see variability but not on C or what
0:50:57	then personally i wanna not can we kind of a more machine you
0:51:00	so for example there's already work the tries to eliminate H M and replace it
0:51:04	by are and i think that's come very interesting and the same thing is already
0:51:08	very successfully done with language models
0:51:11	and there's the question of
0:51:13	i mean jointly treat everything and one big step but on the other hand
0:51:16	the problem with that is that different kinds of different
0:51:19	aspects of the model different kinds of data that have different cost would using to
0:51:24	them so it might actually never be possible to we need to a joint training
0:51:28	and the final question that i sort of have is what to dnns teachers about
0:51:32	humans process
0:51:35	what will also get
0:51:36	more ideas on
0:51:38	no
0:51:40	so that concludes my talk thank you very much
0:51:51	i think we have like six minutes for questions
0:52:12	another expert about units there was wondering point therefore if i train and it's a
0:52:19	neural network and conventional speech data and i try to anything the data which is
0:52:26	much more clean we therefore not as good or we don't noise
0:52:31	so what was the configuration you want to do you want to train on what
0:52:34	it is that they train mind manual nets on the noisy data when they're running
0:52:39	on the clean data
0:52:41	so they don't know exactly that's my question
0:52:44	okay so i actually did skip
0:52:46	one slide images this L O
0:52:50	so
0:52:51	the dnn is actually
0:52:56	way
0:53:04	so you get like
0:53:10	so this table here shows results on aurora so basically doing this case multi-style training
0:53:19	so the idea was not to train a noisy and test on clean
0:53:22	but this is basically training and testing on the same
0:53:26	set of noise conditions
0:53:28	and so the lot of numbers here this is the gmm baseline if you look
0:53:31	at this line here thirteen point four
0:53:34	so another specialist on robustness but i think this is of the best you can
0:53:38	do with the gmm
0:53:39	pooling all the tricks that you could possibly put in
0:53:42	and the dnn
0:53:43	it's just
0:53:44	but not any tricks just training on the data you get
0:53:48	you know how do not from the or you get just exactly the same
0:53:51	so what this means i think is that the dnn is very good and learning
0:53:55	variability new input also noise that it sees in the training data
0:54:02	but we have other experiments were is shown what we're we see that is the
0:54:06	variability smart cover new training data
0:54:09	the dnn is not very robust again
0:54:12	so i don't know what happens if you trained on noisy and test on clean
0:54:15	and clean is not of the conditions that you have your training i could imagine
0:54:18	that it will are but on the of an interest at the data
0:54:25	i don't think i can likely to get away with thirty years thirty years maybe
0:54:30	that was present at all
0:54:33	apparently talking and tongue in cheek right what you're talking about is going back before
0:54:38	some of the developments of the eighties right and most of the effort on feature
0:54:43	extraction last twenty years conferences is actually been more robustness to dealing with unseen variability
0:54:51	and this doesn't get and you that set equation
0:54:59	some more questions or comments
0:55:02	think about features i need for future
0:55:08	research
0:55:10	is it and use a large temporal context this is also be one it's was
0:55:16	coming but for
0:55:19	in contrast
0:55:21	it's something
0:55:24	okay what exactly i don't have to sell the embassy okay
0:55:33	anymore comments
0:55:36	kind of a personal question you said that you know anything about neural nets like
0:55:40	on two three years back something like that so you see this as rather an
0:55:45	advantage for drawback very maybe less sentimental
0:55:48	in throwing away some coldness that you know the guys very in the field for
0:55:54	many years expected some touchable or the other way round
0:55:58	i think so i think it helps to come with sort of the little bit
0:56:02	of an outsiders mine so i think for example it helped me to understand this
0:56:06	parallelization thing right that basically do it is G D you do layer train a
0:56:11	small the mini-batch training
0:56:13	and normal the regular definition of mini-batches is that you can take have original to
0:56:18	sell
0:56:18	maybe you might have noticed that i didn't actually divided by the number of frames
0:56:23	when i use this formula right interesting is if you're not right
0:56:27	so that for example is something for me as an engineer coming in looking at
0:56:30	that i wonder know why do you do mini-batches as an average doesn't seem to
0:56:33	make sense you're just accumulating multiple frames over time that help understand those kind of
0:56:38	parallelization questions in a different way
0:56:41	but things probably details
0:56:49	okay any other buttons
0:56:54	okay the speaker given is present

Context-Dependent Deep Neural Networks for Large Vocabulary Speech Recognition: From Discovery to Practical Systems

Neural Network Day

Frank Seide (Microsoft Research Asia)