Speech Transcript - Multilayer perceptrons for speech recognition: There and Back Again

0:00:15	that actually
0:00:17	that's actually a morgan kind of introduction just
0:00:20	i that the say too much
0:00:22	thank you brian
0:00:24	actually i just should
0:00:26	before get that the target should mention someone made it
0:00:29	seven a brief discussion with someone about the posters and
0:00:33	realising that some extent the optimum strategy for poster would be to make it seem
0:00:38	like it's really interesting but completely impossible to understand
0:00:41	so that we're gonna want to come up and explain
0:00:47	anyway
0:00:49	there are back again
0:00:50	someone else suggested that perhaps to talk to be called the
0:00:54	station through all over again
0:00:55	from that same philosophy real be there
0:00:58	but let me start with a little story i
0:01:02	those you
0:01:03	not no you're just to tell you arthur conan doyle a series of stories
0:01:09	about detective was next production name sure columns
0:01:12	and it had a cool part in watson
0:01:16	really didn't know so much about it
0:01:19	so holmes and watson one on a camping trip
0:01:21	the shared a good meal had a bottle of wine and the recharger the chance
0:01:25	for the night
0:01:26	three in the morning forms notch watson said
0:01:29	look up and this guy tell me what you see
0:01:32	what sense that i see millions of stars
0:01:34	homes that's what is the total
0:01:37	once replies astronomically it tells me there are billions of galaxies potentially millions of planets
0:01:43	astrological it tells me the saturn isn't leo theologically tells me that got is great
0:01:48	we are small insignificant
0:01:50	or logically tells me that it's about three
0:01:53	you are logically tells me will have a beautiful day tomorrow
0:01:56	was a tell you ones
0:01:58	some wants to tell you really
0:02:03	so we might be missing if you think
0:02:07	and
0:02:08	there are
0:02:11	some great really exciting results is a lot of people who are interested now in
0:02:16	neural nets for number of
0:02:18	application areas but in particular
0:02:21	in speech recognition or slots whose yes and
0:02:24	but there might be a few things that we're missing and the journey
0:02:28	and perhaps it might be useful to look at
0:02:31	some historical context to help us to know that
0:02:35	so
0:02:36	as bright alluded to earlier in the day
0:02:40	there has been a great deal of history
0:02:43	neural networks for speech and the neural networks in general before this
0:02:48	and i think of this is occurring in three ways
0:02:51	the first wave was in the fifties and sixties with the development of the perceptrons
0:02:57	and at one i think of this as a basic structure or the bs
0:03:03	in the eighties and nineties we had back propagation which it actually then develop before
0:03:09	that but really applied a lot
0:03:11	and multilayer perceptrons or mlps
0:03:13	which were applying more structure to the problem sort of an ms
0:03:18	and now we have things that are piled higher and deeper
0:03:22	so it's
0:03:23	the phd level
0:03:25	now asr speech recognition
0:03:29	we had digits pretty much or other very small vocabulary tasks i in the fifties
0:03:34	and sixties
0:03:36	high eighties and nineties we actually graduate too large vocabulary continuous speech recognition
0:03:42	and in this new wave
0:03:45	there's really quite sure use of the technology and it's probably compounded
0:03:51	no
0:03:52	this talk isn't about the history speech recognition but i think i can't really do
0:03:57	it is true of neural nets for speech recognition without doing a little bit of
0:04:00	that
0:04:01	that also have early start
0:04:03	the best known early paper
0:04:05	i was a nineteen fifty two paper the last
0:04:08	but before that was radio right
0:04:11	now if you haven't seen or heard about radio racks
0:04:13	radio rex was a little doggy dog house
0:04:17	and user racks and racks with but
0:04:21	our course if you did that X would also probably pop out just about anything
0:04:24	that have enough energy
0:04:26	five six seven hundred hertz or so because
0:04:29	but actively doghouse actually resonated with some of those low frequencies
0:04:34	and when it resonated vibrate it would break a connection from electromagnet in the spring
0:04:39	with push the dog
0:04:40	so we could think of it is speech recognition really bad rejection
0:04:45	now the first paper that i know of anyway
0:04:49	that was
0:04:51	just crime real speech recognition was this paper by our second davis
0:04:57	on a digit recognition from bell labs
0:05:00	and it approximated energy in the first couple formants was really just how much energy
0:05:05	there was over time
0:05:07	and the different regions different frequency regions
0:05:11	that already had some kinds of robust estimate particular i was quite insensitive to the
0:05:17	apple two
0:05:18	and it
0:05:19	works very well under limited circumstances that is it was
0:05:24	pristine recording conditions you very quiet very great signal noise ratio
0:05:29	in the laboratory and also for single speaker it was tuned to single speaker
0:05:35	and really tune because it was
0:05:37	big bunch of resistors and capacitors
0:05:40	it also took a fair amount of space
0:05:43	that was the nineteen fifty two digit recogniser
0:05:47	wasn't something that you would fit into
0:05:49	in nineteen fifty two phone
0:05:53	now
0:05:54	i should say that this system
0:05:57	have reported accuracy of ninety seven ninety eight percent
0:06:02	and since
0:06:03	every commercial
0:06:06	system says then has reported an accuracy of ninety seven the ninety percent you might
0:06:10	think there's been no progress
0:06:12	but of course there has been the problems of got much harder
0:06:16	and that's a speech recognition isn't the real point it was talk list of mystery
0:06:21	the fundamentally the early asr was based on some kind of templates are examples and
0:06:26	distances between incoming speech and those examples
0:06:30	in the last thirty to forty years
0:06:33	the systems have pretty much been based on statistical models especially
0:06:38	the last twenty five
0:06:42	the hidden markov model technology however is based on mathematics in the late sixties
0:06:47	and
0:06:49	the biggest source again since then this is slightly unfair statement of justified moment
0:06:55	that's based on having lots of computing
0:06:58	now obviously there's a lot of people including a lot of people here who contributed
0:07:02	many important engineering ideas since the since the late sixties
0:07:08	but
0:07:08	those ideas were in a by having lots of computing lots of storage
0:07:14	statistical models are
0:07:16	trained with exact this is the basic approach we all know about
0:07:20	the examples are represented by some kind of choice of features
0:07:24	and the estimators generate likelihoods for what was set and then
0:07:28	there is a model that integrates over time with these sort of
0:07:33	point wise time likelihoods are generated
0:07:37	now artificial neural nets can be used for this to generate even of the features
0:07:42	that are then processed by some kind of a probability estimator that the just neural
0:07:47	net or they can generate the likelihoods that are actually used in hidden markov
0:07:54	going back to these three ways in the first way
0:07:58	and actually i guess i should say
0:08:01	a lot of the things from the only way scary through to your car one
0:08:06	the idea was the mccullough it's your on model
0:08:10	and
0:08:10	there were training algorithms of learning algorithms that were developed around this model perceptrons headline
0:08:17	another more complex things
0:08:19	example of which is what's called discriminative analysis iterative design or D I D
0:08:25	now going to these little bit
0:08:28	so mccall gets model was basically that you had a bunch of inputs coming in
0:08:32	from other neurons
0:08:34	they were weighted in some way
0:08:36	and when the weighted sum exceeded some threshold in their on fire
0:08:41	now the perceptron algorithm
0:08:43	was based on
0:08:45	changing what these weights for
0:08:47	when the firing was incorrect another's for a classification problem that saying that it is
0:08:53	a particular class and i really S
0:08:56	a by the way i'm gonna have
0:08:58	almost no equations and this presentation
0:09:01	itself
0:09:03	if you rating problem too bad
0:09:07	so the perceptron learning algorithm adjusted these weights using the outputs using whether the run
0:09:12	fired or not
0:09:14	at the wind approach was actually a linear processing approach where it the weights were
0:09:20	just using the weighted so
0:09:23	the initial versions of all the experiments with both of these were done with a
0:09:27	single layer so they were single-layer
0:09:29	perceptrons single-layer outlines
0:09:32	and in the late sixties there is a famous both blackman skin pampered perceptron that
0:09:38	pointed out that such a simple network could not even solve exclusive or problem
0:09:45	but in fact
0:09:46	multiple layers were used as early as the early sixties an example of that is
0:09:51	this da the algorithm
0:09:54	so in timit was not homogeneous neural net like that kind of nets that we
0:09:58	mostly used today
0:10:00	had gaussian the output layer
0:10:03	it had perceptron at the output layer
0:10:05	it was somewhat similar to the waiter radial basis function networks which also had some
0:10:11	kind of
0:10:13	radial basis function gaussian like function that at the first layer
0:10:17	it's not really
0:10:18	a clever weighting scheme
0:10:20	when you loaded up the covariance matrix matrices for the gaussians
0:10:26	you would give particular way to the patterns
0:10:29	that had resulted in that errors
0:10:31	and so a that this and you use an exponential loss function of the output
0:10:38	to do that
0:10:39	this wasn't really used for speech but was used for a wide variety of problems
0:10:43	by task and five mcconnell douglas and you know other
0:10:48	governmental and i commercial organisations a lot people don't know about it i happened know
0:10:53	about it "'cause" i recorded one point
0:10:56	this police or
0:10:57	terribly but anyway
0:10:59	so
0:11:02	going to nns for speech recognition
0:11:06	in the early sixties at stanford
0:11:09	bernard woodrow's students did a system for digit recognition
0:11:15	where they had a series of these advertise these adaptive
0:11:19	linear units
0:11:21	and it
0:11:22	worked quite well within speaker much as the nineteen fifty two system had
0:11:26	except that this was automatically didn't have to tune a bunch of resistors
0:11:32	and it had
0:11:32	terrible error rates across speakers
0:11:35	but it was it was sort of comparable and it was using this kind of
0:11:37	technology
0:11:40	pooling move into the eighties
0:11:42	wave to
0:11:53	aquino this colleagues did some consonant classification with such systems
0:11:59	i had the good fortune be able to play around with such things for voiced
0:12:03	unvoiced classification for commercial
0:12:06	task
0:12:08	but competing is systems started coming up by the like by the mid to late
0:12:14	eighties
0:12:16	people at cmu
0:12:17	or exploitable and geoff hinton that the time
0:12:21	lying
0:12:22	did this kind of
0:12:25	classification for stop consonants using such systems
0:12:29	and there are many others i don't
0:12:31	have enough for one slide in this how many were but
0:12:34	can hold in finland what mean and goal us cameron cooper in germany dealing more
0:12:40	in U K many others
0:12:42	built up these systems and did a typically isolated word recognition
0:12:48	then by the by the end of the eighties
0:12:51	we got to for speech recognition that is continuous speech recognition
0:12:55	speaker-independent et cetera
0:12:58	so
0:13:00	have the good fortune to have really
0:13:02	clever friends
0:13:03	and together with some of them include some of this work
0:13:07	i ever bourlard can visit a dixie and eighty eight
0:13:11	and he and i started one collaboration where we developed in approach
0:13:16	for using feed-forward
0:13:17	neural networks for speech recognition
0:13:20	and there is a range of other people who did related things a particular in
0:13:23	germany
0:13:26	and it's seen you
0:13:28	also there was working recurrent networks so that the feed-forward
0:13:32	you can just get there from where
0:13:35	not too many of them a
0:13:37	and the recurrent nets
0:13:39	actually fed back
0:13:41	and this was really high near i mean there was that there were number of
0:13:45	people who work with
0:13:47	of recurrent networks
0:13:49	but for applying it to large vocabulary continuous speech recognition real centre for that is
0:13:53	cambridge
0:13:55	tony robinson and well while still live trying for side
0:13:59	and both approaches though what they had in common
0:14:02	was that they'd through the proper training a generative probability as a phone classes
0:14:08	and then they derive state emission likelihoods for hidden markov models
0:14:13	typically we found it work better in most cases to divide by the prior probabilities
0:14:17	of each phone classes
0:14:19	and get some scaled likelihoods
0:14:21	and we also catch the marker to this name recall that the hybrid hmm mlp
0:14:27	or hybrid
0:14:29	hmm and system
0:14:34	so
0:14:35	with mlps you would use the back error back propagation
0:14:40	using the chain rule the spread the blame or credit back through layers
0:14:45	it was simple to use simple to train a powerful transformations
0:14:50	they were also used for classification and prediction
0:14:53	but in the hybrid system the idea was using probability estimation
0:14:58	and initially we did this for unlimited number of classes typically model
0:15:04	the slight has the only for only a equation and the stall
0:15:10	we didn't
0:15:11	understand that are having some representation of context could be beneficial
0:15:17	but it was kind of heart to deal with twenty some years ago
0:15:21	and notion of having thousands and thousands of outputs just didn't seem particularly like a
0:15:26	good one
0:15:28	decree with the limited amount of training data that we have
0:15:32	and computation to work with
0:15:35	so we came up with a factor version
0:15:39	in this equation a Q stands for the states which in this case a typically
0:15:44	were monophones
0:15:45	but C stands for context and X is the feature
0:15:50	and you can break it up without any assumptions
0:15:54	and no independence assumption
0:15:56	into
0:15:58	two different factors factorisation is
0:16:01	probability of this of a state given the context and the and the future input
0:16:06	times probability of context given input
0:16:09	or the other one the right is
0:16:12	probability for context given state and the input times the probability of the monophone probability
0:16:18	and the latter one
0:16:21	means that you could take the monophone that you already training just multiply and this
0:16:25	other one
0:16:27	a thing we as with other things a bit right back and initially so if
0:16:31	the first six months to your didn't work at all
0:16:35	and then are colleagues at sri were very helpful came up so it's really good
0:16:40	smoothing methods which given the
0:16:42	when the number with limited amount of data
0:16:45	that we're working with was really necessary to make context work
0:16:48	and then it
0:16:50	and a few years later
0:16:52	dropped cmu
0:16:53	french
0:16:55	stardust to an extreme where you actually had a tree
0:16:58	of search mlps and so you could
0:17:03	implement this factorisation over and over can get in finer and finer sit down and
0:17:07	leaves you actually had tens of thousands or even a hundred thousand generalized triphones of
0:17:13	some sort
0:17:14	and it works very well it was actually quite comparable to other systems at the
0:17:18	time
0:17:19	but was really complicate
0:17:21	and most people this pointed really focused in on gaussian mixture systems so it never
0:17:26	really took off
0:17:29	now
0:17:30	if you look at where all this was n-gram two thousand
0:17:33	the gaussian mixture approaches have mature
0:17:36	people really have learned how to use them
0:17:39	they've been many refinements the developed
0:17:41	sometimes think about gaussians you have means you have covariances people typically using variance only
0:17:47	covariance matrices
0:17:49	and so there's lots of simple things that you can do with
0:17:53	many of these were developed
0:17:55	not just mllr sat and an image by later E
0:18:01	i mean all sorts of alphabet soups
0:18:05	this didn't come easily it's not that like between slu possible they didn't come easily
0:18:11	to the mlp world and since the mlp world for is larger can we speech
0:18:16	recognition at this point was really confined to a few places almost everybody was working
0:18:20	with gaussian mixtures which kind of hardly keep up
0:18:26	but we still want to
0:18:28	and we like them because
0:18:30	one important reason for us was that they work really well with different from S
0:18:35	so if you came up with some really weird thing you know listen to christoph
0:18:39	talking about the neurons and we said let's try that thing
0:18:43	during the to the mlp in llp didn't mind
0:18:47	we had experiences with a colleague of ours for instance john last row who was
0:18:52	doing these funny little chips that we'd implement in some threshold mos
0:18:57	us various functions of people had found in go clear nuclei and so on and
0:19:02	he'd those into htk and it would just rollover and i and so
0:19:08	we he that it into our systems and actually didn't mind at all so because
0:19:14	of the nature of the nonlinearities
0:19:16	it really was very
0:19:20	agnostic to the kind of inputs
0:19:23	so question is how to take advantage of both
0:19:25	well what happened at this time we were working with a with hynek hermansky was
0:19:30	a dog i and with dan ellis with the dixie
0:19:34	and there was this competition was happening for standard
0:19:39	for distributed speech recognition i idea being
0:19:42	that you would compute the features
0:19:45	and the phone and then somewhere else you would actually do the rest of the
0:19:48	recognition and so the idea was to replace mfccs something better
0:19:54	so the models were required to be hmm-gmm
0:19:57	you couldn't change
0:19:58	we still like the next
0:20:00	so the solution these guys came up with
0:20:02	was to use the outputs as features not as probabilities
0:20:06	they were the only ones whatever use the outputs of features the outputs of mlps
0:20:11	as features
0:20:12	but there's a particular way doing it and implemented in large vocabulary or small vocabulary
0:20:18	systems
0:20:20	lot really work this was with the digits
0:20:22	that they came up
0:20:24	and this was called tandem approach
0:20:28	now as a so sort of the social cultural advantage for our research
0:20:35	nice thing was instead of having to convince everybody that the hybrid systems the way
0:20:39	to go we could just say here some cool features one should try them out
0:20:44	and we couldn't did in fact collaborate with other people systems that way
0:20:50	and i give some credit the bottom over to some other work being done this
0:20:54	ryan speaker recognition
0:20:57	so there are also interference it can once you get the idea that you happen
0:21:01	some interesting
0:21:03	use of neural nets to generate features you also could focus on temporal approach which
0:21:08	can dickens guys dude with traps where you would have
0:21:12	neural nets just looking at parts of the spectrum or a lot of time
0:21:16	and so they would be kind of forced into learning something about what you couldn't
0:21:20	in the temporal properties
0:21:23	that would help you with a phonetic identification
0:21:27	icsi's version of this was called hats most hidden activation traps
0:21:32	and
0:21:33	in all these there were there were there was the germ of what people do
0:21:37	now with the layer-by-layer stuff
0:21:39	because you train something out and then you'd feed that into another now run the
0:21:43	case of hats
0:21:45	you train something up then you throw away the last layer and feed it into
0:21:50	something else feature
0:21:52	then there were a bunch of things worked with gabor filters in X M roster
0:21:57	where you had modulation based inputs
0:22:00	you can happen using a tandem approach for the end up with getting features
0:22:05	from that
0:22:06	and then much more recent version
0:22:09	is
0:22:10	bottleneck features
0:22:11	which are kind of tandem it's not
0:22:14	exactly
0:22:15	same thing that's not coming from posteriors but it is using an output from the
0:22:20	net as
0:22:22	that's features
0:22:25	so
0:22:26	third way
0:22:30	i liked course to go where
0:22:35	so there's no
0:22:37	there's nothing wrong with the original hybrid theory i mean that it
0:22:41	work fine
0:22:43	gmm approach is sort of have victory because
0:22:47	you get a lot of people
0:22:48	moving in the same direction lot of things can happen
0:22:51	but also
0:22:53	just computation
0:22:55	storage and so forth
0:22:57	there was a lot more straightforward i think to make progress
0:23:01	with modifications to the gmm based approaches
0:23:05	so the fundamental issues with going further with a hybrid approach is how to apply
0:23:10	many parameters usefully
0:23:12	and how do get
0:23:14	these emission probabilities from any phonetic categories
0:23:18	and aspects of solution were already there is already mentioned in a number of these
0:23:22	approaches we reject already generating mlps layer by layer
0:23:27	many phonetic categories there were some work in context dependence but that's needed to be
0:23:31	pushed further
0:23:33	learning approaches second order methods right conversations so forth
0:23:38	these were there are many papers on the sort of things on is variance of
0:23:43	conjugate gradient sort things in the eighties
0:23:47	integrating courses much older than the eighties
0:23:51	but someone had to do all this and so
0:23:53	when i'm sure he's reflections from earlier time i don't want to draw cast aspersions
0:23:56	on and people were doing great things now
0:23:59	someone actually had to put these things together and push forward
0:24:05	and i
0:24:05	and that kind of discussion you have to start with geoff hinton
0:24:09	jeff is kind of excitable guy
0:24:12	it was very excited by back-propagation eighties
0:24:14	it's been excited about the things
0:24:17	and he is very good at spreading but it's a
0:24:20	so
0:24:21	he developed particular initialisation techniques
0:24:25	and some of these
0:24:27	are unsupervised techniques particular which you likes because it's seen high logically possible
0:24:35	and
0:24:36	this permitted the use of many parameters and all layers
0:24:40	because when you have many layers
0:24:43	back propagation isn't to affective
0:24:46	down at your ears gets more that credit blankets watered down
0:24:52	a so is expected to spread to microsoft research
0:24:56	and they extremely what was going on before too many phonetic categories large vocabulary speech
0:25:03	recognition
0:25:05	and lots of other people or a very talented people are google ibm elsewhere follows
0:25:12	so
0:25:14	initialisation having a good starting point
0:25:17	for the weights before you start discriminant training some sort
0:25:22	a was often used for limited data case it was often the case
0:25:26	back in the early nineties when we were going to some situation where we had
0:25:30	relatively little data
0:25:32	we train with something else first and then
0:25:35	it start with those weights maybe we wouldn't even train all the way you just
0:25:39	do any block or two
0:25:40	and then we would go to the other language or other task
0:25:44	and we often found that be very helpful
0:25:49	so hinton developers general unsupervised approach
0:25:53	applied to multiple layers in general call that deep learning
0:26:01	lot of this early stuff was called sometimes talk all the deep belief nets
0:26:06	a general every dnns
0:26:09	supplied other applications and speech
0:26:11	and again i gave reasonable weights for the layers far targets because
0:26:15	even if
0:26:16	the weights don't use it all back propagation training at least the early ones are
0:26:21	doing something useful
0:26:24	later speech where a lot of while the things that you see posters are papers
0:26:28	in the last couple years actually skip this step
0:26:31	and do something else for instance do layer by layer
0:26:36	training
0:26:37	discriminatively
0:26:39	and many approaches use some kind of regularisation
0:26:42	to avoid overfitting
0:26:45	so the recent work which you're much more about in a clear today
0:26:50	from
0:26:51	is
0:26:53	shows significant improvements over comparable gmms
0:26:56	and although there's a mixture of approaches
0:26:59	sometimes tandem why core bottleneck like sometimes a hybrid mode i think they're usually hybrid
0:27:05	mode
0:27:07	and
0:27:08	i have to say it's great to called deep neural nets but they're still multilayer
0:27:11	perceptrons
0:27:12	if they just multilayer perceptrons with you know certain number of layers
0:27:17	and say well okay but it's really different with seven hidden layers then used to
0:27:22	know maybe
0:27:24	but we do have to ask how do you deep to the need to be
0:27:28	so
0:27:29	many experiments show continued improvements more layers
0:27:33	and the at some point there's diminishing returns but the underlying assumption there is that
0:27:37	there's no limit on parameters
0:27:39	so we start asking the question what if there was a one
0:27:42	now why would you want to limit
0:27:44	well because in any practical situations are actually in some kind of women at least
0:27:49	there's a cost right there's
0:27:51	you could think of the number of parameters as being a proxy for the cost
0:27:55	for the resources in general for the time it takes to train a time text
0:27:59	run amount of storage
0:28:01	and well
0:28:03	there's people who go here but
0:28:05	i have say you know even if you've got you know million machines
0:28:09	you probably one hundred users so it still matters on the parameters use
0:28:14	so in interspeech represented something which i'm just gonna present for mentor to here
0:28:20	what we called deep on a budget
0:28:23	and we say suppose
0:28:25	we have a fixed but very large wanna make sure that nobody thinks we didn't
0:28:29	use enough
0:28:30	parameters
0:28:31	and then you
0:28:33	compare between a narrow and deep versus wine and shallow
0:28:38	we often see comparisons where people tried E
0:28:41	you know the you earlier version that we often used of one big hidden where
0:28:45	versus a bunch of good
0:28:47	but we want to do all along the way step by step have two hidden
0:28:50	layers three hidden layers work admirers
0:28:53	we can't the architecture the same
0:28:55	and it was only a one task was a pretty small task as aurora two
0:29:00	and so that
0:29:01	allowed us to look at varying signal-to-noise ratios
0:29:04	we said if you did this on a budget what works best
0:29:08	well you know and maybe more to there's different kinds of additive noise train station
0:29:13	babble and so forth
0:29:16	and this was a mismatch case it's clean training and noisy test that we didn't
0:29:20	do the multi-style
0:29:23	and
0:29:24	it turns out that the answer is all over the map
0:29:28	and in particular
0:29:30	for the cases that were kind of usable
0:29:33	signal-to-noise ratios and by usable i mean
0:29:36	if we gave you a few percent error and digits
0:29:39	as opposed to twenty or thirty or forty which you just couldn't used for anything
0:29:44	actually to was better
0:29:47	and then to yield little bit with the question of will maybe you just pick
0:29:51	the we were number of parameters we tried with double number of parameters have the
0:29:56	number of parameters we saw similar for now
0:29:59	so when i gave
0:30:02	this longer version of this and interspeech some of the comments more along the lines
0:30:06	of why do you think to is better
0:30:09	so forth
0:30:10	i just wanna be clear i'm not saying that to is better than anything
0:30:13	what i'm saying is that
0:30:16	if you were thinking of something actually going into practical use you should do some
0:30:20	experiments where you keep the number of parameters the same you might
0:30:23	then expand and so forth but usually is to some experiments we keep the number
0:30:27	of parameters the same and then you get an idea about what's best and it's
0:30:32	probably gonna be test then
0:30:35	so
0:30:38	we focus on neural networks but we do have to be sure we ask direct
0:30:41	questions
0:31:03	i just said no
0:31:06	i
0:31:12	so
0:31:13	we have test right questions
0:31:17	one question is what we see into the nets no there's all these questions about
0:31:21	what's wearing data and how many layers we have so forth
0:31:27	some people not having any names
0:31:29	a white characterize is true believers think that features aren't important
0:31:34	actually
0:31:35	to verify slightly a discussion just a
0:31:38	that interspeech i think it wasn't and
0:31:42	the
0:31:43	i made this comment and he said no i think features of importance just usual
0:31:50	so anyway features are important
0:31:55	and this goes back to the old general computer
0:31:59	axiom garbage in garbage are
0:32:01	people have done some very interesting experiments with feeding waveforms in and i should say
0:32:07	back and today we did some experiments hynek like this in experiments with a feeling
0:32:11	waveforms in comparing the plp needed waveforms way worse they have made some progress there
0:32:16	actually are doing better
0:32:19	but if you actually look in detail at what these experiments do
0:32:23	in one case for instance
0:32:26	it they check the absolute value the floor to detect the logarithm the averaged over
0:32:31	a bunch of
0:32:31	all sorts of things which actually obscure the phase and that's kind of the point
0:32:37	is that you can have waveforms of extraordinarily different shape that really sound pretty much
0:32:42	the same
0:32:44	there's more recent results that uses maxout pooling of convolutional neural nets
0:32:49	that also had you know a nice result
0:32:53	again using this maximum and maximum style pooling also tended to screw the face
0:32:59	but in both those cases and the other case i've heard of anyway
0:33:04	this completely falls apart when you have made mismatch when you when the testing is
0:33:08	different than the training
0:33:10	so what is holy
0:33:12	of having a frontend after all the available data is in the way for some
0:33:17	assumptions there that you know you might things well and so forth but
0:33:21	that's ignore that for the moment
0:33:23	in fact front ends to consistently improve speech recognition
0:33:27	and i have this is great but like i learned from here which is
0:33:32	that the goal of front ends is to just or information
0:33:35	that's is a little extreme
0:33:37	scenic as these sometimes but i think it's true that some information is misleading at
0:33:42	some information is not affected
0:33:45	and we want to focus on the discriminative information
0:33:48	because the waveform that you receive is not just spoken language
0:33:52	it's also is and reverberation and channel effects and characteristics of the speaker if you're
0:33:58	not gonna speaker recognition
0:34:00	maybe you don't care so much about that
0:34:02	and so
0:34:04	the front end can help to
0:34:07	focus on the things that you care about for your particular task
0:34:11	and a good front end in principle of use or to carry to what extreme
0:34:14	can make recognition extremely simple
0:34:18	least N
0:34:20	so what about the connection to mlps well as i alluded to earlier mlps have
0:34:26	few distributional assumptions
0:34:29	mlps can
0:34:30	also easily integrate information over time
0:34:34	multiple feature streams
0:34:36	could provide useful way to incorporate more parameters
0:34:39	so yes that's do give you a nice way especially with good realisation initializations and
0:34:44	so forth
0:34:45	can give you a way to incorporate more features more parameters sorry usefully
0:34:50	but also multiple strings can do this too
0:34:53	by multiple streams i mean
0:34:55	having different sets of mlps the look at the signal in different ways
0:35:01	and you can really expand out the number of parameters and in a way that
0:35:05	is often quite useful
0:35:06	and so my as well thrown another acronym
0:35:09	if you use this with the T that
0:35:12	you can call this of don
0:35:13	deep white
0:35:17	so you can combine these different streams easily because the outputs of posteriors we know
0:35:22	how to combine probabilities
0:35:24	this isn't really example a very chanted at our place
0:35:29	fifteen thirteen years ago something
0:35:32	all that on the topic mlp
0:35:34	and
0:35:36	the idea is you have a bunch of different
0:35:39	sets of layers they're looking at different critical bands this is this is like
0:35:43	the hats and traps and so forth
0:35:46	the difference is it was just trained all at once
0:35:49	and in fact this work okay
0:35:52	a recent example and there's because they are in the such examples around i just
0:35:56	pick this one because of standby one of my students actually
0:36:00	in china
0:36:02	in which
0:36:04	he had some with
0:36:07	coming from
0:36:11	high modulation frequencies and modulation frequencies
0:36:15	and T this as pca is not the society for prevention of cruelty to animals
0:36:19	buses
0:36:20	a sparse pca and it's used to pick out
0:36:26	pretty uses it to pick out particular filters are gabor filter is in this case
0:36:32	that are particularly useful
0:36:34	for the discrimination
0:36:36	and this these then go into deep neural nets six-layer do deep neural nets
0:36:43	and the output of one deep neural net goes into another so i get the
0:36:46	and it's really deep but you also have some within their
0:36:49	this was used to some effect it's very noisy data for the rats program so
0:36:55	it's a
0:36:56	data that's and transmitted through a radio channels and is really extremely awful what you
0:37:02	get at the other side
0:37:04	so called or are dnns or troughs
0:37:10	nearly all
0:37:11	still based on essentially on this mccullough that small
0:37:15	there are some nice work is also a poster here about more complex units
0:37:22	and for certainly for large vocabulary
0:37:25	kinds of
0:37:27	for real word error rate measurements
0:37:30	they're not particularly better
0:37:33	just little disappointing
0:37:35	but maybe this work is just started
0:37:38	the complexity and power is not supplied by having more complex units are for used
0:37:43	it is applied by
0:37:45	the data and also is a say with multiple streams by the web
0:37:49	you also can represent to some extent signal correlations by pooling again by acoustic context
0:37:56	and so far at least the most effective learning methods are not biologically plausible
0:38:01	so given all that how can we benefit from biological models
0:38:06	why we want to benefit from biological models because we wanna have stable perception and
0:38:11	noise and reverberation which a human hearing can do
0:38:15	and our system certainly can
0:38:17	the cocktail party effect one voice out of many there are some
0:38:21	tory demonstrations of such things but in general they don't really work
0:38:28	rapid adjustment to changing conditions i remember telling someone one point that
0:38:33	if the if our sponsors
0:38:37	wanted us to have the best recognition
0:38:40	anyone could have in this room
0:38:42	we collected thousand hours in this room
0:38:45	then if the sponsors came back next year it's it now we wanted to be
0:38:49	in that conference room dollar fall we'd have to collect another thousand hours
0:38:53	okay i'm like slightly there is an set of things adaptation
0:38:56	but it's release
0:38:58	very minor compared so people can do we just walking to this room and walking
0:39:02	to their room and we just keep very pretty much
0:39:05	and real speaker independence we often colour system speaker independent the speech recognizers
0:39:11	but when you have a voice this particular a different to its it does badly
0:39:16	so we learn from the brain a
0:39:19	these are pictures from
0:39:22	same source that one of one of the source code first talk it is
0:39:29	E clock
0:39:30	so this is a direct cortical measurement as stuff explain
0:39:35	this is these are you get data
0:39:38	from people who are in the hospital for
0:39:42	certain neurosurgery because they had
0:39:45	extreme cases of epilepsy which have not been
0:39:50	sufficiently well treated by drugs
0:39:53	and so surgery is an option but you have to figure out
0:39:58	where the where the focus of what's the about seizures are
0:40:02	and you also wanna know we're not to cut
0:40:05	in terms of language
0:40:08	so
0:40:10	at each angle was mentioned earlier and new remotes grounding
0:40:13	had a lovely paper in nature couple years ago where they're making all kinds of
0:40:19	noise measurements during source separation and in this experiment they would play two speakers speaking
0:40:26	once
0:40:27	and
0:40:28	by the design of the experiment we get the subject to focus first on one
0:40:32	speaker and then on the other and sort of the changes and signal process
0:40:37	so this is giving clues about source separation and noise robustness and what's really exciting
0:40:42	about this from his that this isn't kind of intermediate so between E G which
0:40:47	is something i used to work with a long time ago we're on the scale
0:40:50	have really it spatial wrote
0:40:54	resolution you a pretty good temporal resolution
0:40:58	and the
0:41:00	single or
0:41:02	modest number of electrodes that directly like then there is on the surface intermediate region
0:41:08	and looks like we've got a lot of new kinds of information and the technology
0:41:12	on this is rapidly changing
0:41:15	people working on sensors are making these things with the sensor with the sensor with
0:41:20	the electorate closer and closer together
0:41:23	so the whole is that measurements like these and like the things that chris that's
0:41:28	what really are completely new processing steps
0:41:31	for instance
0:41:33	computational auditory scene analysis is based on psychoacoustics and their know that there's a range
0:41:40	of things that you can do try to pick up one speaker from some other
0:41:44	background but if we actually have a better handle on what's really going on inside
0:41:48	the system we might be able to better design those things rather than just relying
0:41:53	on psychoacoustics
0:41:55	and this concludes structures things that the signal level computational level
0:42:00	and
0:42:02	it's a
0:42:03	it's work that's been done
0:42:07	that will be talked about on thursday night by steve bregman for instance
0:42:11	and understanding the statistical systems can learn about what the limitations are
0:42:17	so what that hasn't common the other it's not from brain but it's actually analysis
0:42:22	of what's going on
0:42:23	it can give you a handle on how to percy
0:42:27	we need feature stability
0:42:29	under different kinds of conditions noise room reverberation so i'll
0:42:33	and models that can handle dependent variables
0:42:37	so in conclusion
0:42:41	there is
0:42:42	more than fifty years of effort
0:42:44	including
0:42:45	some with speech recognition
0:42:47	the current methods include tandem and hybrid approaches
0:42:53	multiple layers and initialisation do sometimes i'll
0:42:57	not
0:42:59	as but speech rec automatic speech recognition the fundamental algorithms
0:43:03	of
0:43:05	neural net used for speech recognition are actually reasonably my quite as well
0:43:11	the engineering efforts to make use of the computational capabilities have helped course
0:43:19	i would argue the features still matter
0:43:21	and the why important not just deep
0:43:24	and where is that missing okay
0:43:27	asr still performs badly for conditions on seen during training
0:43:31	so we have to keep looking
0:43:33	and that's it thank you very much
0:43:53	okay we conduct questions
0:43:59	okay
0:44:04	i can't resist to comment on one of things
0:44:07	like it you know the question of architecture really because
0:44:11	it'll when windows
0:44:15	idea of using hidden units for one task we do use it again and that
0:44:20	the use that eighty nine we called what you like neural networks at the time
0:44:26	was extremely successful work
0:44:29	but it was discarded at the time but people say okay the series say is
0:44:33	that was one hidden layer you can represent any convex classification function so we don't
0:44:38	need to six and then architectural multilayer way
0:44:42	so this car it's a lotta work actually multi layer deep neural networks as you
0:44:46	want even though it time already shot and this
0:44:50	now what it does all still today with work that scoring right now is that
0:44:54	people really don't look very much that how to do automatic architectural learning so in
0:44:59	other words
0:45:00	you know how we want to display by creating another layer of making why narrow
0:45:05	more creating different delays but we all this you know by repeating the same experiments
0:45:10	over again the think and what humans learn they do this development stages we don't
0:45:16	all your sit in the corner run back propagation for twenty years
0:45:20	but we and then wake up and no speech but we learn to babble about
0:45:25	willard words et cetera with this is all come from the must be some scheduled
0:45:29	by which we build architectures in that the about the development away and just too
0:45:35	little of my after divorce the more we look at the low resource as the
0:45:40	multiple languages et cetera i think having some mechanism of building these architectures one learning
0:45:46	approach i think is some fundamental research that still missing in my view but i'd
0:45:51	like to hear your comment that
0:45:52	i guess is another question but
0:45:57	the only count i mean sure
0:46:02	the only thing i have data that i mean i agree hours
0:46:06	is that one thing i didn't mention that is nineteen sixty one approach the idea
0:46:11	is that it actually also build up
0:46:16	automatically
0:46:16	and so it was in that case it was also a feature selection systems well
0:46:23	and so there would look at
0:46:26	the difference
0:46:28	superset of possible features and take a bunch of them and build up and unit
0:46:33	based on that and then it would consider what other group a feature so it
0:46:36	actually did build up
0:46:38	not a completely general architecture but it did a fair amount of automatic learning infrastructure
0:46:46	and that was nineteen sixty one that cornell
0:46:53	yes right
0:46:55	what's your steven compare
0:46:59	okay other questions
0:47:02	or comments
0:47:08	and so
0:47:12	so you work harder weakness of this cosine function white no not for going
0:47:17	i can be not than going down now being up again so do think discourse
0:47:22	and function is gonna work stood was so we will work we don't have to
0:47:26	be on the for productive lives or is it gonna
0:47:31	no one okay
0:47:32	i think it depends on to what extent we believe an exaggerated claims
0:47:39	so if we if we push that to hire people will get you don't speech
0:47:43	recognition works really well under many circumstances fails miserably under others so if people believe
0:47:50	too much that we have already found the holy grail
0:47:54	then after a while when they start using it having it fail
0:47:59	then
0:48:01	funding will go down and interest to go down you know for the whole field
0:48:05	of speech recognition but in particular any particular method
0:48:10	so i think
0:48:11	i don't feel again is that i think that i mean obviously are like using
0:48:16	artificial neural networks are stuff doing for a long time that i mean i started
0:48:21	using in them but
0:48:23	thirty three years ago
0:48:25	because i tried i had a particular task
0:48:29	and try to whole bunch of methods it just so have i mean just lock
0:48:33	that the neural net i was using was the best
0:48:36	of the different things that particular small voiced unvoiced speech task
0:48:41	but so i like them
0:48:44	but i think they're only a part of section
0:48:46	and this is why i emphasise that what you feed them
0:48:49	i should also say what you do them
0:48:52	are both at least is important problem more important
0:48:56	then the stuff that we're currently mostly excited about
0:48:59	and so i think that
0:49:01	but gaussian mixtures that a great run wasn't
0:49:04	you know
0:49:06	and i think people will still use them they're another tool in there is very
0:49:10	nice things about gaussian
0:49:11	it's nice things about sigmoid there's nice things about other kinds of non linear units
0:49:15	people have rectified linear not of data
0:49:18	a
0:49:19	but
0:49:22	i think
0:49:23	the level of excitement will probably go down somewhat because
0:49:28	you know after while being excessively and papers saying very similar things
0:49:32	sort of i down but i think it's people start
0:49:35	using these things in different ways feeding them different things making use the outputs of
0:49:40	different ways cetera
0:49:43	interest can be sustained
0:49:49	in a
0:49:51	you mentioned that one of the big advantage is something you that is the pos
0:49:55	label is that they can take a lot of abuse for how what you feed
0:50:00	it as long as it carries the right kind of information i also feel that
0:50:05	there is a great potential for various architectures built
0:50:10	it you mentioned that you take time with the relation sampled in select the outputs
0:50:16	from that and combining with a stressful or more deletions so i think that there
0:50:20	is plenty of opportunity for us to be deceitful time
0:50:25	one or is there is that you make that again you can use which is
0:50:29	like so if you try all kinds of things that you please report is more
0:50:33	severe this wasn't W
0:50:35	and i would somehow like to encourage the committee i need seeing slightly about was
0:50:40	you know one thing is to whole that i could actually pop out somehow automatically
0:50:45	a side or anything so i think we still need
0:50:49	to build a model i don't know we but can do all done automatically but
0:50:55	i see the works like what christmas present the year but basically learning from the
0:50:59	weight auditory system is working that can be plenty of this duration for vad architectures
0:51:05	of the new movement because neural is indeed a simple in
0:51:09	how much abuse they can take it forms of
0:51:13	removing seems to get the graphite i mean i agree and i maybe the size
0:51:19	of quite as much as a as i feel
0:51:22	we have this right now this real separation we have there's front end of somebody
0:51:27	works on the front end
0:51:28	and then there's neural nets and then you know and there's hmms and there's language
0:51:32	models and so forth these are really quite separate
0:51:35	but they really need the long run to be very integrated
0:51:38	and a that
0:51:40	particularly provide example i showed
0:51:43	are ready was kind of mixed together that you had some of the signal processing
0:51:47	stuff going later on in some of the going earlier and all of that and
0:51:52	when we start opening that a
0:51:54	and you say you know it's not just adding unity or something like that like
0:51:58	a nineteen sixty one approach
0:52:01	but you say it can be anything then i think you really lost unless you
0:52:06	have some example to work for
0:52:09	so for me it's not just the i mean i have no problem and i
0:52:12	think hynek doesn't either
0:52:14	with taken speaker if we come up with a purely engineering approach has nothing to
0:52:20	do with brains that just works better fine we're engineers that's okay
0:52:25	the problem is that the design space is infinite
0:52:29	and so how to figure out what direction even go
0:52:33	and so that's you feel that i think that we had that the appeal that
0:52:37	the brain related biologically stuff as have for us
0:52:41	is that it's a working system
0:52:44	it's something it already works and so it does really reduce the space
0:52:48	that you consider
0:52:50	is someone else gonna come up with some information theoretic approach that is the ends
0:52:54	of being better know this
0:52:55	fine you know i
0:52:57	microsoft
0:52:59	but this is where it occurs to us
0:53:04	questions
0:53:13	so you mention that a hmm gmm systems at some point they'd get much shorter
0:53:19	and one of the aspects is that they could be adapted well
0:53:25	so one would think about adapting neural networks and some sort of similar manner
0:53:32	and is that one of the reasons why neural networks i mean if you sick
0:53:37	recognition task you wanted to be adapted to the speaker and from my limited knowledge
0:53:43	i think that
0:53:45	a adaptation methods are still trying to be figured out
0:53:50	but all the intuition into doing adaptation methods comes from you know
0:53:57	the experience that we have with hmm gmm systems so at least at least for
0:54:04	me so is okay so if you talk about something like speaker adaptive training
0:54:12	could you think of neural network
0:54:16	sort of be becoming speaker independent of speaker adaptive training
0:54:20	i mean is i mean i would you add putting two point where
0:54:25	and what do you think that is that i reduction to build a speaker independent
0:54:30	truly speaker independent dnn
0:54:33	deductions to
0:54:35	i guess i mean speaker independent by being very speaker-dependent an adaptive so right
0:54:41	a actually if you do a little literature search there's a bunch of work on
0:54:47	adapting neural nets for speech recognition early nineties
0:54:50	a
0:54:51	and so this was work was largely done at cambridge and twenty runs and screw
0:54:57	and in our skin portable
0:55:01	you are not so
0:55:02	and there were basically performance is used we were actually and then you collaboration with
0:55:09	them
0:55:10	and there were four methods that i recall that we use so one was to
0:55:14	have like a linear
0:55:17	input transformation
0:55:20	so you could have so if you had you know thirteen plp coefficients
0:55:24	i just have thirteen by thirteen matrix coming in
0:55:28	another was the output so maybe you'd have you know if you
0:55:32	we're doing monophones so as like fifty by fifty or something
0:55:37	a third was to have i didn't wear off to decide what you just sort
0:55:41	of a added to it and
0:55:43	a trained up with a limited data that you had for the new speaker
0:55:47	this we're all
0:55:49	supervised the adaptation
0:55:52	and my favourite
0:55:56	when i proposed was
0:55:58	scrawl that just train everything and
0:56:01	so it just you know we and
0:56:03	the original direction do that was that you might have millions of parameters that but
0:56:08	my feeling was what you just a little bit
0:56:11	and the L
0:56:13	they all work to varying degrees i think it's fair to say but nine or
0:56:18	the
0:56:19	hmm-gmm adaptations nor those neural net adaptation is really solved problem
0:56:25	they all move you a little bit we did some experimentation as part of the
0:56:29	ouch project that stevens gonna
0:56:32	talk about thursday
0:56:35	where
0:56:35	we use the mllr for instance to try to adapt to just my recording given
0:56:41	close my training
0:56:44	and it helps
0:56:45	but it's not like it makes like the
0:56:51	so i'd say that
0:56:53	you can get you can use any of these methods
0:56:56	for both neural nets and for gaussian and there are there are methods for both
0:57:02	but none of them really solve the problem
0:57:10	and the other questions that one there
0:57:18	this number let's it's a couple back here
0:57:23	at the moment in to talk
0:57:27	thank you for that very interesting time
0:57:31	i was just curious is that any run ins in this and
0:57:36	kind of rate that we look at things for adaptation that speech recognition and ten
0:57:41	at something that are human speech recognition
0:57:44	and the reason i S is that if we look at least i am i
0:57:47	inspired by it seems that mention as a look at the places where a human
0:57:52	recognition breaks down i was an ad hoc unless you're a with really bad connection
0:57:58	i just couldn't understand the campus meeting way
0:58:01	and we don't i O and look at how i system and it's a beautiful
0:58:06	in exactly the same conditions human when be able to understand how these and have
0:58:11	hope systems would be added in humans and excel should be really be that i
0:58:16	am i check on my
0:58:18	well
0:58:20	when i found in expects in jack i don't understand that at all
0:58:26	a so i think a machine you do better
0:58:30	i think in general we're pretty far from that
0:58:34	there are individual examples that you could think of i think my favourite is anything
0:58:40	involving attention
0:58:41	so actually my wife used to work with these
0:58:46	large american express call centers
0:58:49	and i when we first got together i will always telling your humans are so
0:58:54	good it speech recognition and you know machines are so bad just a well not
0:58:58	the humans ideal and i
0:59:01	and it turned out that the people the call centres are really great we definitely
0:59:07	much better than anything do with machine
0:59:10	in for simple tasks like a string of numbers
0:59:15	right after they have coffee
0:59:16	and they're terrible after lunch
0:59:20	now they do however have i mean you don't talk about i certainly didn't talk
0:59:25	about recovery mechanisms but the saving grace for people is that they can say could
0:59:30	you repeat that please and all that we have some of that in our systems
0:59:33	humans are better at that
0:59:36	so i think
0:59:37	i think it's their other tasks
0:59:40	for which
0:59:42	machines can clearly be much better than because people
0:59:46	are not trained or is there are evolutionary
0:59:51	guidance
0:59:52	two doors there being better at it so for instance
0:59:57	doing speaker recognition with speakers that you know very well
1:00:02	i think machines can be better
1:00:05	used to do some work with the G is an edgy analysis isn't something we
1:00:10	you know grow up with
1:00:11	and so you can do classification but is much better than people okay
1:00:16	but i think for sort of straight out typical speech recognition
1:00:20	you take that noisy example
1:00:22	elevated to any of our recognizers and you just two
1:00:28	saw some of the signal-to-noise ratios the cost of showing your layer
1:00:32	basically zero db signal-to-noise ratio
1:00:35	say first human beings were paying attention listening to strings of digits they just get
1:00:40	them
1:00:42	and our systems you look at any of the even with the that
1:00:47	part of white noise robust front ends people have papers
1:00:50	you look at their performance at zero db signal-to-noise ratio it's a this
1:00:55	and that's with the best that's not is that we have
1:00:58	so i think we're just so far from that are straight out speech recognition
1:01:02	but maybe someday be saying well of this automatic system that we can figure out
1:01:15	high so you use like computer vision under networks are very appealing you can speak
1:01:20	and visualise what are being learned at the hidden layers so you can see that
1:01:23	explaining stuff specific parts of the faceplates and
1:01:27	so in speech you have an intuition about what is being learned in those hidden
1:01:32	layers
1:01:34	well i mean there have been some experiments with people of what the some of
1:01:38	these things again made reference to forms to reach and
1:01:42	and he was it did it should be just on a topic multilayer perceptron
1:01:47	and he found that
1:01:50	this was attempting to mimic what was happening with the nets that were
1:01:58	train on individual critical bands
1:02:00	and it did another one where i just through the whole spectrum in
1:02:06	and what was learned that layers in fact we did learn
1:02:11	interesting shapes interesting gabor like shapes so for
1:02:15	and there's been a number of experiments with people have looked at
1:02:20	some of those really layers
1:02:22	what you get pretty deep
1:02:24	especially for seven errors
1:02:27	i think it be pretty harder to do
1:02:29	but i wouldn't it is possible
1:02:31	i know there's been some work

Multilayer perceptrons for speech recognition: There and Back Again

Neural Network Day

Nelson Morgan (ICSI)