0:00:15that actually
0:00:17that's actually a morgan kind of introduction just
0:00:20i that the say too much
0:00:22thank you brian
0:00:24actually i just should
0:00:26before get that the target should mention someone made it
0:00:29seven a brief discussion with someone about the posters and
0:00:33realising that some extent the optimum strategy for poster would be to make it seem
0:00:38like it's really interesting but completely impossible to understand
0:00:41so that we're gonna want to come up and explain
0:00:47anyway
0:00:49there are back again
0:00:50someone else suggested that perhaps to talk to be called the
0:00:54station through all over again
0:00:55from that same philosophy real be there
0:00:58but let me start with a little story i
0:01:02those you
0:01:03not no you're just to tell you arthur conan doyle a series of stories
0:01:09about detective was next production name sure columns
0:01:12and it had a cool part in watson
0:01:16really didn't know so much about it
0:01:19so holmes and watson one on a camping trip
0:01:21the shared a good meal had a bottle of wine and the recharger the chance
0:01:25for the night
0:01:26three in the morning forms notch watson said
0:01:29look up and this guy tell me what you see
0:01:32what sense that i see millions of stars
0:01:34homes that's what is the total
0:01:37once replies astronomically it tells me there are billions of galaxies potentially millions of planets
0:01:43astrological it tells me the saturn isn't leo theologically tells me that got is great
0:01:48we are small insignificant
0:01:50or logically tells me that it's about three
0:01:53you are logically tells me will have a beautiful day tomorrow
0:01:56was a tell you ones
0:01:58some wants to tell you really
0:02:03so we might be missing if you think
0:02:07and
0:02:08there are
0:02:11some great really exciting results is a lot of people who are interested now in
0:02:16neural nets for number of
0:02:18application areas but in particular
0:02:21in speech recognition or slots whose yes and
0:02:24but there might be a few things that we're missing and the journey
0:02:28and perhaps it might be useful to look at
0:02:31some historical context to help us to know that
0:02:35so
0:02:36as bright alluded to earlier in the day
0:02:40there has been a great deal of history
0:02:43neural networks for speech and the neural networks in general before this
0:02:48and i think of this is occurring in three ways
0:02:51the first wave was in the fifties and sixties with the development of the perceptrons
0:02:57and at one i think of this as a basic structure or the bs
0:03:03in the eighties and nineties we had back propagation which it actually then develop before
0:03:09that but really applied a lot
0:03:11and multilayer perceptrons or mlps
0:03:13which were applying more structure to the problem sort of an ms
0:03:18and now we have things that are piled higher and deeper
0:03:22so it's
0:03:23the phd level
0:03:25now asr speech recognition
0:03:29we had digits pretty much or other very small vocabulary tasks i in the fifties
0:03:34and sixties
0:03:36high eighties and nineties we actually graduate too large vocabulary continuous speech recognition
0:03:42and in this new wave
0:03:45there's really quite sure use of the technology and it's probably compounded
0:03:51no
0:03:52this talk isn't about the history speech recognition but i think i can't really do
0:03:57it is true of neural nets for speech recognition without doing a little bit of
0:04:00that
0:04:01that also have early start
0:04:03the best known early paper
0:04:05i was a nineteen fifty two paper the last
0:04:08but before that was radio right
0:04:11now if you haven't seen or heard about radio racks
0:04:13radio rex was a little doggy dog house
0:04:17and user racks and racks with but
0:04:21our course if you did that X would also probably pop out just about anything
0:04:24that have enough energy
0:04:26five six seven hundred hertz or so because
0:04:29but actively doghouse actually resonated with some of those low frequencies
0:04:34and when it resonated vibrate it would break a connection from electromagnet in the spring
0:04:39with push the dog
0:04:40so we could think of it is speech recognition really bad rejection
0:04:45now the first paper that i know of anyway
0:04:49that was
0:04:51just crime real speech recognition was this paper by our second davis
0:04:57on a digit recognition from bell labs
0:05:00and it approximated energy in the first couple formants was really just how much energy
0:05:05there was over time
0:05:07and the different regions different frequency regions
0:05:11that already had some kinds of robust estimate particular i was quite insensitive to the
0:05:17apple two
0:05:18and it
0:05:19works very well under limited circumstances that is it was
0:05:24pristine recording conditions you very quiet very great signal noise ratio
0:05:29in the laboratory and also for single speaker it was tuned to single speaker
0:05:35and really tune because it was
0:05:37big bunch of resistors and capacitors
0:05:40it also took a fair amount of space
0:05:43that was the nineteen fifty two digit recogniser
0:05:47wasn't something that you would fit into
0:05:49in nineteen fifty two phone
0:05:53now
0:05:54i should say that this system
0:05:57have reported accuracy of ninety seven ninety eight percent
0:06:02and since
0:06:03every commercial
0:06:06system says then has reported an accuracy of ninety seven the ninety percent you might
0:06:10think there's been no progress
0:06:12but of course there has been the problems of got much harder
0:06:16and that's a speech recognition isn't the real point it was talk list of mystery
0:06:21the fundamentally the early asr was based on some kind of templates are examples and
0:06:26distances between incoming speech and those examples
0:06:30in the last thirty to forty years
0:06:33the systems have pretty much been based on statistical models especially
0:06:38the last twenty five
0:06:42the hidden markov model technology however is based on mathematics in the late sixties
0:06:47and
0:06:49the biggest source again since then this is slightly unfair statement of justified moment
0:06:55that's based on having lots of computing
0:06:58now obviously there's a lot of people including a lot of people here who contributed
0:07:02many important engineering ideas since the since the late sixties
0:07:08but
0:07:08those ideas were in a by having lots of computing lots of storage
0:07:14statistical models are
0:07:16trained with exact this is the basic approach we all know about
0:07:20the examples are represented by some kind of choice of features
0:07:24and the estimators generate likelihoods for what was set and then
0:07:28there is a model that integrates over time with these sort of
0:07:33point wise time likelihoods are generated
0:07:37now artificial neural nets can be used for this to generate even of the features
0:07:42that are then processed by some kind of a probability estimator that the just neural
0:07:47net or they can generate the likelihoods that are actually used in hidden markov
0:07:54going back to these three ways in the first way
0:07:58and actually i guess i should say
0:08:01a lot of the things from the only way scary through to your car one
0:08:06the idea was the mccullough it's your on model
0:08:10and
0:08:10there were training algorithms of learning algorithms that were developed around this model perceptrons headline
0:08:17another more complex things
0:08:19example of which is what's called discriminative analysis iterative design or D I D
0:08:25now going to these little bit
0:08:28so mccall gets model was basically that you had a bunch of inputs coming in
0:08:32from other neurons
0:08:34they were weighted in some way
0:08:36and when the weighted sum exceeded some threshold in their on fire
0:08:41now the perceptron algorithm
0:08:43was based on
0:08:45changing what these weights for
0:08:47when the firing was incorrect another's for a classification problem that saying that it is
0:08:53a particular class and i really S
0:08:56a by the way i'm gonna have
0:08:58almost no equations and this presentation
0:09:01itself
0:09:03if you rating problem too bad
0:09:07so the perceptron learning algorithm adjusted these weights using the outputs using whether the run
0:09:12fired or not
0:09:14at the wind approach was actually a linear processing approach where it the weights were
0:09:20just using the weighted so
0:09:23the initial versions of all the experiments with both of these were done with a
0:09:27single layer so they were single-layer
0:09:29perceptrons single-layer outlines
0:09:32and in the late sixties there is a famous both blackman skin pampered perceptron that
0:09:38pointed out that such a simple network could not even solve exclusive or problem
0:09:45but in fact
0:09:46multiple layers were used as early as the early sixties an example of that is
0:09:51this da the algorithm
0:09:54so in timit was not homogeneous neural net like that kind of nets that we
0:09:58mostly used today
0:10:00had gaussian the output layer
0:10:03it had perceptron at the output layer
0:10:05it was somewhat similar to the waiter radial basis function networks which also had some
0:10:11kind of
0:10:13radial basis function gaussian like function that at the first layer
0:10:17it's not really
0:10:18a clever weighting scheme
0:10:20when you loaded up the covariance matrix matrices for the gaussians
0:10:26you would give particular way to the patterns
0:10:29that had resulted in that errors
0:10:31and so a that this and you use an exponential loss function of the output
0:10:38to do that
0:10:39this wasn't really used for speech but was used for a wide variety of problems
0:10:43by task and five mcconnell douglas and you know other
0:10:48governmental and i commercial organisations a lot people don't know about it i happened know
0:10:53about it "'cause" i recorded one point
0:10:56this police or
0:10:57terribly but anyway
0:10:59so
0:11:02going to nns for speech recognition
0:11:06in the early sixties at stanford
0:11:09bernard woodrow's students did a system for digit recognition
0:11:15where they had a series of these advertise these adaptive
0:11:19linear units
0:11:21and it
0:11:22worked quite well within speaker much as the nineteen fifty two system had
0:11:26except that this was automatically didn't have to tune a bunch of resistors
0:11:32and it had
0:11:32terrible error rates across speakers
0:11:35but it was it was sort of comparable and it was using this kind of
0:11:37technology
0:11:40pooling move into the eighties
0:11:42wave to
0:11:53aquino this colleagues did some consonant classification with such systems
0:11:59i had the good fortune be able to play around with such things for voiced
0:12:03unvoiced classification for commercial
0:12:06task
0:12:08but competing is systems started coming up by the like by the mid to late
0:12:14eighties
0:12:16people at cmu
0:12:17or exploitable and geoff hinton that the time
0:12:21lying
0:12:22did this kind of
0:12:25classification for stop consonants using such systems
0:12:29and there are many others i don't
0:12:31have enough for one slide in this how many were but
0:12:34can hold in finland what mean and goal us cameron cooper in germany dealing more
0:12:40in U K many others
0:12:42built up these systems and did a typically isolated word recognition
0:12:48then by the by the end of the eighties
0:12:51we got to for speech recognition that is continuous speech recognition
0:12:55speaker-independent et cetera
0:12:58so
0:13:00have the good fortune to have really
0:13:02clever friends
0:13:03and together with some of them include some of this work
0:13:07i ever bourlard can visit a dixie and eighty eight
0:13:11and he and i started one collaboration where we developed in approach
0:13:16for using feed-forward
0:13:17neural networks for speech recognition
0:13:20and there is a range of other people who did related things a particular in
0:13:23germany
0:13:26and it's seen you
0:13:28also there was working recurrent networks so that the feed-forward
0:13:32you can just get there from where
0:13:35not too many of them a
0:13:37and the recurrent nets
0:13:39actually fed back
0:13:41and this was really high near i mean there was that there were number of
0:13:45people who work with
0:13:47of recurrent networks
0:13:49but for applying it to large vocabulary continuous speech recognition real centre for that is
0:13:53cambridge
0:13:55tony robinson and well while still live trying for side
0:13:59and both approaches though what they had in common
0:14:02was that they'd through the proper training a generative probability as a phone classes
0:14:08and then they derive state emission likelihoods for hidden markov models
0:14:13typically we found it work better in most cases to divide by the prior probabilities
0:14:17of each phone classes
0:14:19and get some scaled likelihoods
0:14:21and we also catch the marker to this name recall that the hybrid hmm mlp
0:14:27or hybrid
0:14:29hmm and system
0:14:34so
0:14:35with mlps you would use the back error back propagation
0:14:40using the chain rule the spread the blame or credit back through layers
0:14:45it was simple to use simple to train a powerful transformations
0:14:50they were also used for classification and prediction
0:14:53but in the hybrid system the idea was using probability estimation
0:14:58and initially we did this for unlimited number of classes typically model
0:15:04the slight has the only for only a equation and the stall
0:15:10we didn't
0:15:11understand that are having some representation of context could be beneficial
0:15:17but it was kind of heart to deal with twenty some years ago
0:15:21and notion of having thousands and thousands of outputs just didn't seem particularly like a
0:15:26good one
0:15:28decree with the limited amount of training data that we have
0:15:32and computation to work with
0:15:35so we came up with a factor version
0:15:39in this equation a Q stands for the states which in this case a typically
0:15:44were monophones
0:15:45but C stands for context and X is the feature
0:15:50and you can break it up without any assumptions
0:15:54and no independence assumption
0:15:56into
0:15:58two different factors factorisation is
0:16:01probability of this of a state given the context and the and the future input
0:16:06times probability of context given input
0:16:09or the other one the right is
0:16:12probability for context given state and the input times the probability of the monophone probability
0:16:18and the latter one
0:16:21means that you could take the monophone that you already training just multiply and this
0:16:25other one
0:16:27a thing we as with other things a bit right back and initially so if
0:16:31the first six months to your didn't work at all
0:16:35and then are colleagues at sri were very helpful came up so it's really good
0:16:40smoothing methods which given the
0:16:42when the number with limited amount of data
0:16:45that we're working with was really necessary to make context work
0:16:48and then it
0:16:50and a few years later
0:16:52dropped cmu
0:16:53french
0:16:55stardust to an extreme where you actually had a tree
0:16:58of search mlps and so you could
0:17:03implement this factorisation over and over can get in finer and finer sit down and
0:17:07leaves you actually had tens of thousands or even a hundred thousand generalized triphones of
0:17:13some sort
0:17:14and it works very well it was actually quite comparable to other systems at the
0:17:18time
0:17:19but was really complicate
0:17:21and most people this pointed really focused in on gaussian mixture systems so it never
0:17:26really took off
0:17:29now
0:17:30if you look at where all this was n-gram two thousand
0:17:33the gaussian mixture approaches have mature
0:17:36people really have learned how to use them
0:17:39they've been many refinements the developed
0:17:41sometimes think about gaussians you have means you have covariances people typically using variance only
0:17:47covariance matrices
0:17:49and so there's lots of simple things that you can do with
0:17:53many of these were developed
0:17:55not just mllr sat and an image by later E
0:18:01i mean all sorts of alphabet soups
0:18:05this didn't come easily it's not that like between slu possible they didn't come easily
0:18:11to the mlp world and since the mlp world for is larger can we speech
0:18:16recognition at this point was really confined to a few places almost everybody was working
0:18:20with gaussian mixtures which kind of hardly keep up
0:18:26but we still want to
0:18:28and we like them because
0:18:30one important reason for us was that they work really well with different from S
0:18:35so if you came up with some really weird thing you know listen to christoph
0:18:39talking about the neurons and we said let's try that thing
0:18:43during the to the mlp in llp didn't mind
0:18:47we had experiences with a colleague of ours for instance john last row who was
0:18:52doing these funny little chips that we'd implement in some threshold mos
0:18:57us various functions of people had found in go clear nuclei and so on and
0:19:02he'd those into htk and it would just rollover and i and so
0:19:08we he that it into our systems and actually didn't mind at all so because
0:19:14of the nature of the nonlinearities
0:19:16it really was very
0:19:20agnostic to the kind of inputs
0:19:23so question is how to take advantage of both
0:19:25well what happened at this time we were working with a with hynek hermansky was
0:19:30a dog i and with dan ellis with the dixie
0:19:34and there was this competition was happening for standard
0:19:39for distributed speech recognition i idea being
0:19:42that you would compute the features
0:19:45and the phone and then somewhere else you would actually do the rest of the
0:19:48recognition and so the idea was to replace mfccs something better
0:19:54so the models were required to be hmm-gmm
0:19:57you couldn't change
0:19:58we still like the next
0:20:00so the solution these guys came up with
0:20:02was to use the outputs as features not as probabilities
0:20:06they were the only ones whatever use the outputs of features the outputs of mlps
0:20:11as features
0:20:12but there's a particular way doing it and implemented in large vocabulary or small vocabulary
0:20:18systems
0:20:20lot really work this was with the digits
0:20:22that they came up
0:20:24and this was called tandem approach
0:20:28now as a so sort of the social cultural advantage for our research
0:20:35nice thing was instead of having to convince everybody that the hybrid systems the way
0:20:39to go we could just say here some cool features one should try them out
0:20:44and we couldn't did in fact collaborate with other people systems that way
0:20:50and i give some credit the bottom over to some other work being done this
0:20:54ryan speaker recognition
0:20:57so there are also interference it can once you get the idea that you happen
0:21:01some interesting
0:21:03use of neural nets to generate features you also could focus on temporal approach which
0:21:08can dickens guys dude with traps where you would have
0:21:12neural nets just looking at parts of the spectrum or a lot of time
0:21:16and so they would be kind of forced into learning something about what you couldn't
0:21:20in the temporal properties
0:21:23that would help you with a phonetic identification
0:21:27icsi's version of this was called hats most hidden activation traps
0:21:32and
0:21:33in all these there were there were there was the germ of what people do
0:21:37now with the layer-by-layer stuff
0:21:39because you train something out and then you'd feed that into another now run the
0:21:43case of hats
0:21:45you train something up then you throw away the last layer and feed it into
0:21:50something else feature
0:21:52then there were a bunch of things worked with gabor filters in X M roster
0:21:57where you had modulation based inputs
0:22:00you can happen using a tandem approach for the end up with getting features
0:22:05from that
0:22:06and then much more recent version
0:22:09is
0:22:10bottleneck features
0:22:11which are kind of tandem it's not
0:22:14exactly
0:22:15same thing that's not coming from posteriors but it is using an output from the
0:22:20net as
0:22:22that's features
0:22:25so
0:22:26third way
0:22:30i liked course to go where
0:22:35so there's no
0:22:37there's nothing wrong with the original hybrid theory i mean that it
0:22:41work fine
0:22:43gmm approach is sort of have victory because
0:22:47you get a lot of people
0:22:48moving in the same direction lot of things can happen
0:22:51but also
0:22:53just computation
0:22:55storage and so forth
0:22:57there was a lot more straightforward i think to make progress
0:23:01with modifications to the gmm based approaches
0:23:05so the fundamental issues with going further with a hybrid approach is how to apply
0:23:10many parameters usefully
0:23:12and how do get
0:23:14these emission probabilities from any phonetic categories
0:23:18and aspects of solution were already there is already mentioned in a number of these
0:23:22approaches we reject already generating mlps layer by layer
0:23:27many phonetic categories there were some work in context dependence but that's needed to be
0:23:31pushed further
0:23:33learning approaches second order methods right conversations so forth
0:23:38these were there are many papers on the sort of things on is variance of
0:23:43conjugate gradient sort things in the eighties
0:23:47integrating courses much older than the eighties
0:23:51but someone had to do all this and so
0:23:53when i'm sure he's reflections from earlier time i don't want to draw cast aspersions
0:23:56on and people were doing great things now
0:23:59someone actually had to put these things together and push forward
0:24:05and i
0:24:05and that kind of discussion you have to start with geoff hinton
0:24:09jeff is kind of excitable guy
0:24:12it was very excited by back-propagation eighties
0:24:14it's been excited about the things
0:24:17and he is very good at spreading but it's a
0:24:20so
0:24:21he developed particular initialisation techniques
0:24:25and some of these
0:24:27are unsupervised techniques particular which you likes because it's seen high logically possible
0:24:35and
0:24:36this permitted the use of many parameters and all layers
0:24:40because when you have many layers
0:24:43back propagation isn't to affective
0:24:46down at your ears gets more that credit blankets watered down
0:24:52a so is expected to spread to microsoft research
0:24:56and they extremely what was going on before too many phonetic categories large vocabulary speech
0:25:03recognition
0:25:05and lots of other people or a very talented people are google ibm elsewhere follows
0:25:12so
0:25:14initialisation having a good starting point
0:25:17for the weights before you start discriminant training some sort
0:25:22a was often used for limited data case it was often the case
0:25:26back in the early nineties when we were going to some situation where we had
0:25:30relatively little data
0:25:32we train with something else first and then
0:25:35it start with those weights maybe we wouldn't even train all the way you just
0:25:39do any block or two
0:25:40and then we would go to the other language or other task
0:25:44and we often found that be very helpful
0:25:49so hinton developers general unsupervised approach
0:25:53applied to multiple layers in general call that deep learning
0:26:01lot of this early stuff was called sometimes talk all the deep belief nets
0:26:06a general every dnns
0:26:09supplied other applications and speech
0:26:11and again i gave reasonable weights for the layers far targets because
0:26:15even if
0:26:16the weights don't use it all back propagation training at least the early ones are
0:26:21doing something useful
0:26:24later speech where a lot of while the things that you see posters are papers
0:26:28in the last couple years actually skip this step
0:26:31and do something else for instance do layer by layer
0:26:36training
0:26:37discriminatively
0:26:39and many approaches use some kind of regularisation
0:26:42to avoid overfitting
0:26:45so the recent work which you're much more about in a clear today
0:26:50from
0:26:51is
0:26:53shows significant improvements over comparable gmms
0:26:56and although there's a mixture of approaches
0:26:59sometimes tandem why core bottleneck like sometimes a hybrid mode i think they're usually hybrid
0:27:05mode
0:27:07and
0:27:08i have to say it's great to called deep neural nets but they're still multilayer
0:27:11perceptrons
0:27:12if they just multilayer perceptrons with you know certain number of layers
0:27:17and say well okay but it's really different with seven hidden layers then used to
0:27:22know maybe
0:27:24but we do have to ask how do you deep to the need to be
0:27:28so
0:27:29many experiments show continued improvements more layers
0:27:33and the at some point there's diminishing returns but the underlying assumption there is that
0:27:37there's no limit on parameters
0:27:39so we start asking the question what if there was a one
0:27:42now why would you want to limit
0:27:44well because in any practical situations are actually in some kind of women at least
0:27:49there's a cost right there's
0:27:51you could think of the number of parameters as being a proxy for the cost
0:27:55for the resources in general for the time it takes to train a time text
0:27:59run amount of storage
0:28:01and well
0:28:03there's people who go here but
0:28:05i have say you know even if you've got you know million machines
0:28:09you probably one hundred users so it still matters on the parameters use
0:28:14so in interspeech represented something which i'm just gonna present for mentor to here
0:28:20what we called deep on a budget
0:28:23and we say suppose
0:28:25we have a fixed but very large wanna make sure that nobody thinks we didn't
0:28:29use enough
0:28:30parameters
0:28:31and then you
0:28:33compare between a narrow and deep versus wine and shallow
0:28:38we often see comparisons where people tried E
0:28:41you know the you earlier version that we often used of one big hidden where
0:28:45versus a bunch of good
0:28:47but we want to do all along the way step by step have two hidden
0:28:50layers three hidden layers work admirers
0:28:53we can't the architecture the same
0:28:55and it was only a one task was a pretty small task as aurora two
0:29:00and so that
0:29:01allowed us to look at varying signal-to-noise ratios
0:29:04we said if you did this on a budget what works best
0:29:08well you know and maybe more to there's different kinds of additive noise train station
0:29:13babble and so forth
0:29:16and this was a mismatch case it's clean training and noisy test that we didn't
0:29:20do the multi-style
0:29:23and
0:29:24it turns out that the answer is all over the map
0:29:28and in particular
0:29:30for the cases that were kind of usable
0:29:33signal-to-noise ratios and by usable i mean
0:29:36if we gave you a few percent error and digits
0:29:39as opposed to twenty or thirty or forty which you just couldn't used for anything
0:29:44actually to was better
0:29:47and then to yield little bit with the question of will maybe you just pick
0:29:51the we were number of parameters we tried with double number of parameters have the
0:29:56number of parameters we saw similar for now
0:29:59so when i gave
0:30:02this longer version of this and interspeech some of the comments more along the lines
0:30:06of why do you think to is better
0:30:09so forth
0:30:10i just wanna be clear i'm not saying that to is better than anything
0:30:13what i'm saying is that
0:30:16if you were thinking of something actually going into practical use you should do some
0:30:20experiments where you keep the number of parameters the same you might
0:30:23then expand and so forth but usually is to some experiments we keep the number
0:30:27of parameters the same and then you get an idea about what's best and it's
0:30:32probably gonna be test then
0:30:35so
0:30:38we focus on neural networks but we do have to be sure we ask direct
0:30:41questions
0:31:03i just said no
0:31:06i
0:31:12so
0:31:13we have test right questions
0:31:17one question is what we see into the nets no there's all these questions about
0:31:21what's wearing data and how many layers we have so forth
0:31:27some people not having any names
0:31:29a white characterize is true believers think that features aren't important
0:31:34actually
0:31:35to verify slightly a discussion just a
0:31:38that interspeech i think it wasn't and
0:31:42the
0:31:43i made this comment and he said no i think features of importance just usual
0:31:50so anyway features are important
0:31:55and this goes back to the old general computer
0:31:59axiom garbage in garbage are
0:32:01people have done some very interesting experiments with feeding waveforms in and i should say
0:32:07back and today we did some experiments hynek like this in experiments with a feeling
0:32:11waveforms in comparing the plp needed waveforms way worse they have made some progress there
0:32:16actually are doing better
0:32:19but if you actually look in detail at what these experiments do
0:32:23in one case for instance
0:32:26it they check the absolute value the floor to detect the logarithm the averaged over
0:32:31a bunch of
0:32:31all sorts of things which actually obscure the phase and that's kind of the point
0:32:37is that you can have waveforms of extraordinarily different shape that really sound pretty much
0:32:42the same
0:32:44there's more recent results that uses maxout pooling of convolutional neural nets
0:32:49that also had you know a nice result
0:32:53again using this maximum and maximum style pooling also tended to screw the face
0:32:59but in both those cases and the other case i've heard of anyway
0:33:04this completely falls apart when you have made mismatch when you when the testing is
0:33:08different than the training
0:33:10so what is holy
0:33:12of having a frontend after all the available data is in the way for some
0:33:17assumptions there that you know you might things well and so forth but
0:33:21that's ignore that for the moment
0:33:23in fact front ends to consistently improve speech recognition
0:33:27and i have this is great but like i learned from here which is
0:33:32that the goal of front ends is to just or information
0:33:35that's is a little extreme
0:33:37scenic as these sometimes but i think it's true that some information is misleading at
0:33:42some information is not affected
0:33:45and we want to focus on the discriminative information
0:33:48because the waveform that you receive is not just spoken language
0:33:52it's also is and reverberation and channel effects and characteristics of the speaker if you're
0:33:58not gonna speaker recognition
0:34:00maybe you don't care so much about that
0:34:02and so
0:34:04the front end can help to
0:34:07focus on the things that you care about for your particular task
0:34:11and a good front end in principle of use or to carry to what extreme
0:34:14can make recognition extremely simple
0:34:18least N
0:34:20so what about the connection to mlps well as i alluded to earlier mlps have
0:34:26few distributional assumptions
0:34:29mlps can
0:34:30also easily integrate information over time
0:34:34multiple feature streams
0:34:36could provide useful way to incorporate more parameters
0:34:39so yes that's do give you a nice way especially with good realisation initializations and
0:34:44so forth
0:34:45can give you a way to incorporate more features more parameters sorry usefully
0:34:50but also multiple strings can do this too
0:34:53by multiple streams i mean
0:34:55having different sets of mlps the look at the signal in different ways
0:35:01and you can really expand out the number of parameters and in a way that
0:35:05is often quite useful
0:35:06and so my as well thrown another acronym
0:35:09if you use this with the T that
0:35:12you can call this of don
0:35:13deep white
0:35:17so you can combine these different streams easily because the outputs of posteriors we know
0:35:22how to combine probabilities
0:35:24this isn't really example a very chanted at our place
0:35:29fifteen thirteen years ago something
0:35:32all that on the topic mlp
0:35:34and
0:35:36the idea is you have a bunch of different
0:35:39sets of layers they're looking at different critical bands this is this is like
0:35:43the hats and traps and so forth
0:35:46the difference is it was just trained all at once
0:35:49and in fact this work okay
0:35:52a recent example and there's because they are in the such examples around i just
0:35:56pick this one because of standby one of my students actually
0:36:00in china
0:36:02in which
0:36:04he had some with
0:36:07coming from
0:36:11high modulation frequencies and modulation frequencies
0:36:15and T this as pca is not the society for prevention of cruelty to animals
0:36:19buses
0:36:20a sparse pca and it's used to pick out
0:36:26pretty uses it to pick out particular filters are gabor filter is in this case
0:36:32that are particularly useful
0:36:34for the discrimination
0:36:36and this these then go into deep neural nets six-layer do deep neural nets
0:36:43and the output of one deep neural net goes into another so i get the
0:36:46and it's really deep but you also have some within their
0:36:49this was used to some effect it's very noisy data for the rats program so
0:36:55it's a
0:36:56data that's and transmitted through a radio channels and is really extremely awful what you
0:37:02get at the other side
0:37:04so called or are dnns or troughs
0:37:10nearly all
0:37:11still based on essentially on this mccullough that small
0:37:15there are some nice work is also a poster here about more complex units
0:37:22and for certainly for large vocabulary
0:37:25kinds of
0:37:27for real word error rate measurements
0:37:30they're not particularly better
0:37:33just little disappointing
0:37:35but maybe this work is just started
0:37:38the complexity and power is not supplied by having more complex units are for used
0:37:43it is applied by
0:37:45the data and also is a say with multiple streams by the web
0:37:49you also can represent to some extent signal correlations by pooling again by acoustic context
0:37:56and so far at least the most effective learning methods are not biologically plausible
0:38:01so given all that how can we benefit from biological models
0:38:06why we want to benefit from biological models because we wanna have stable perception and
0:38:11noise and reverberation which a human hearing can do
0:38:15and our system certainly can
0:38:17the cocktail party effect one voice out of many there are some
0:38:21tory demonstrations of such things but in general they don't really work
0:38:28rapid adjustment to changing conditions i remember telling someone one point that
0:38:33if the if our sponsors
0:38:37wanted us to have the best recognition
0:38:40anyone could have in this room
0:38:42we collected thousand hours in this room
0:38:45then if the sponsors came back next year it's it now we wanted to be
0:38:49in that conference room dollar fall we'd have to collect another thousand hours
0:38:53okay i'm like slightly there is an set of things adaptation
0:38:56but it's release
0:38:58very minor compared so people can do we just walking to this room and walking
0:39:02to their room and we just keep very pretty much
0:39:05and real speaker independence we often colour system speaker independent the speech recognizers
0:39:11but when you have a voice this particular a different to its it does badly
0:39:16so we learn from the brain a
0:39:19these are pictures from
0:39:22same source that one of one of the source code first talk it is
0:39:29E clock
0:39:30so this is a direct cortical measurement as stuff explain
0:39:35this is these are you get data
0:39:38from people who are in the hospital for
0:39:42certain neurosurgery because they had
0:39:45extreme cases of epilepsy which have not been
0:39:50sufficiently well treated by drugs
0:39:53and so surgery is an option but you have to figure out
0:39:58where the where the focus of what's the about seizures are
0:40:02and you also wanna know we're not to cut
0:40:05in terms of language
0:40:08so
0:40:10at each angle was mentioned earlier and new remotes grounding
0:40:13had a lovely paper in nature couple years ago where they're making all kinds of
0:40:19noise measurements during source separation and in this experiment they would play two speakers speaking
0:40:26once
0:40:27and
0:40:28by the design of the experiment we get the subject to focus first on one
0:40:32speaker and then on the other and sort of the changes and signal process
0:40:37so this is giving clues about source separation and noise robustness and what's really exciting
0:40:42about this from his that this isn't kind of intermediate so between E G which
0:40:47is something i used to work with a long time ago we're on the scale
0:40:50have really it spatial wrote
0:40:54resolution you a pretty good temporal resolution
0:40:58and the
0:41:00single or
0:41:02modest number of electrodes that directly like then there is on the surface intermediate region
0:41:08and looks like we've got a lot of new kinds of information and the technology
0:41:12on this is rapidly changing
0:41:15people working on sensors are making these things with the sensor with the sensor with
0:41:20the electorate closer and closer together
0:41:23so the whole is that measurements like these and like the things that chris that's
0:41:28what really are completely new processing steps
0:41:31for instance
0:41:33computational auditory scene analysis is based on psychoacoustics and their know that there's a range
0:41:40of things that you can do try to pick up one speaker from some other
0:41:44background but if we actually have a better handle on what's really going on inside
0:41:48the system we might be able to better design those things rather than just relying
0:41:53on psychoacoustics
0:41:55and this concludes structures things that the signal level computational level
0:42:00and
0:42:02it's a
0:42:03it's work that's been done
0:42:07that will be talked about on thursday night by steve bregman for instance
0:42:11and understanding the statistical systems can learn about what the limitations are
0:42:17so what that hasn't common the other it's not from brain but it's actually analysis
0:42:22of what's going on
0:42:23it can give you a handle on how to percy
0:42:27we need feature stability
0:42:29under different kinds of conditions noise room reverberation so i'll
0:42:33and models that can handle dependent variables
0:42:37so in conclusion
0:42:41there is
0:42:42more than fifty years of effort
0:42:44including
0:42:45some with speech recognition
0:42:47the current methods include tandem and hybrid approaches
0:42:53multiple layers and initialisation do sometimes i'll
0:42:57not
0:42:59as but speech rec automatic speech recognition the fundamental algorithms
0:43:03of
0:43:05neural net used for speech recognition are actually reasonably my quite as well
0:43:11the engineering efforts to make use of the computational capabilities have helped course
0:43:19i would argue the features still matter
0:43:21and the why important not just deep
0:43:24and where is that missing okay
0:43:27asr still performs badly for conditions on seen during training
0:43:31so we have to keep looking
0:43:33and that's it thank you very much
0:43:53okay we conduct questions
0:43:59okay
0:44:04i can't resist to comment on one of things
0:44:07like it you know the question of architecture really because
0:44:11it'll when windows
0:44:15idea of using hidden units for one task we do use it again and that
0:44:20the use that eighty nine we called what you like neural networks at the time
0:44:26was extremely successful work
0:44:29but it was discarded at the time but people say okay the series say is
0:44:33that was one hidden layer you can represent any convex classification function so we don't
0:44:38need to six and then architectural multilayer way
0:44:42so this car it's a lotta work actually multi layer deep neural networks as you
0:44:46want even though it time already shot and this
0:44:50now what it does all still today with work that scoring right now is that
0:44:54people really don't look very much that how to do automatic architectural learning so in
0:44:59other words
0:45:00you know how we want to display by creating another layer of making why narrow
0:45:05more creating different delays but we all this you know by repeating the same experiments
0:45:10over again the think and what humans learn they do this development stages we don't
0:45:16all your sit in the corner run back propagation for twenty years
0:45:20but we and then wake up and no speech but we learn to babble about
0:45:25willard words et cetera with this is all come from the must be some scheduled
0:45:29by which we build architectures in that the about the development away and just too
0:45:35little of my after divorce the more we look at the low resource as the
0:45:40multiple languages et cetera i think having some mechanism of building these architectures one learning
0:45:46approach i think is some fundamental research that still missing in my view but i'd
0:45:51like to hear your comment that
0:45:52i guess is another question but
0:45:57the only count i mean sure
0:46:02the only thing i have data that i mean i agree hours
0:46:06is that one thing i didn't mention that is nineteen sixty one approach the idea
0:46:11is that it actually also build up
0:46:16automatically
0:46:16and so it was in that case it was also a feature selection systems well
0:46:23and so there would look at
0:46:26the difference
0:46:28superset of possible features and take a bunch of them and build up and unit
0:46:33based on that and then it would consider what other group a feature so it
0:46:36actually did build up
0:46:38not a completely general architecture but it did a fair amount of automatic learning infrastructure
0:46:46and that was nineteen sixty one that cornell
0:46:53yes right
0:46:55what's your steven compare
0:46:59okay other questions
0:47:02or comments
0:47:08and so
0:47:12so you work harder weakness of this cosine function white no not for going
0:47:17i can be not than going down now being up again so do think discourse
0:47:22and function is gonna work stood was so we will work we don't have to
0:47:26be on the for productive lives or is it gonna
0:47:31no one okay
0:47:32i think it depends on to what extent we believe an exaggerated claims
0:47:39so if we if we push that to hire people will get you don't speech
0:47:43recognition works really well under many circumstances fails miserably under others so if people believe
0:47:50too much that we have already found the holy grail
0:47:54then after a while when they start using it having it fail
0:47:59then
0:48:01funding will go down and interest to go down you know for the whole field
0:48:05of speech recognition but in particular any particular method
0:48:10so i think
0:48:11i don't feel again is that i think that i mean obviously are like using
0:48:16artificial neural networks are stuff doing for a long time that i mean i started
0:48:21using in them but
0:48:23thirty three years ago
0:48:25because i tried i had a particular task
0:48:29and try to whole bunch of methods it just so have i mean just lock
0:48:33that the neural net i was using was the best
0:48:36of the different things that particular small voiced unvoiced speech task
0:48:41but so i like them
0:48:44but i think they're only a part of section
0:48:46and this is why i emphasise that what you feed them
0:48:49i should also say what you do them
0:48:52are both at least is important problem more important
0:48:56then the stuff that we're currently mostly excited about
0:48:59and so i think that
0:49:01but gaussian mixtures that a great run wasn't
0:49:04you know
0:49:06and i think people will still use them they're another tool in there is very
0:49:10nice things about gaussian
0:49:11it's nice things about sigmoid there's nice things about other kinds of non linear units
0:49:15people have rectified linear not of data
0:49:18a
0:49:19but
0:49:22i think
0:49:23the level of excitement will probably go down somewhat because
0:49:28you know after while being excessively and papers saying very similar things
0:49:32sort of i down but i think it's people start
0:49:35using these things in different ways feeding them different things making use the outputs of
0:49:40different ways cetera
0:49:43interest can be sustained
0:49:49in a
0:49:51you mentioned that one of the big advantage is something you that is the pos
0:49:55label is that they can take a lot of abuse for how what you feed
0:50:00it as long as it carries the right kind of information i also feel that
0:50:05there is a great potential for various architectures built
0:50:10it you mentioned that you take time with the relation sampled in select the outputs
0:50:16from that and combining with a stressful or more deletions so i think that there
0:50:20is plenty of opportunity for us to be deceitful time
0:50:25one or is there is that you make that again you can use which is
0:50:29like so if you try all kinds of things that you please report is more
0:50:33severe this wasn't W
0:50:35and i would somehow like to encourage the committee i need seeing slightly about was
0:50:40you know one thing is to whole that i could actually pop out somehow automatically
0:50:45a side or anything so i think we still need
0:50:49to build a model i don't know we but can do all done automatically but
0:50:55i see the works like what christmas present the year but basically learning from the
0:50:59weight auditory system is working that can be plenty of this duration for vad architectures
0:51:05of the new movement because neural is indeed a simple in
0:51:09how much abuse they can take it forms of
0:51:13removing seems to get the graphite i mean i agree and i maybe the size
0:51:19of quite as much as a as i feel
0:51:22we have this right now this real separation we have there's front end of somebody
0:51:27works on the front end
0:51:28and then there's neural nets and then you know and there's hmms and there's language
0:51:32models and so forth these are really quite separate
0:51:35but they really need the long run to be very integrated
0:51:38and a that
0:51:40particularly provide example i showed
0:51:43are ready was kind of mixed together that you had some of the signal processing
0:51:47stuff going later on in some of the going earlier and all of that and
0:51:52when we start opening that a
0:51:54and you say you know it's not just adding unity or something like that like
0:51:58a nineteen sixty one approach
0:52:01but you say it can be anything then i think you really lost unless you
0:52:06have some example to work for
0:52:09so for me it's not just the i mean i have no problem and i
0:52:12think hynek doesn't either
0:52:14with taken speaker if we come up with a purely engineering approach has nothing to
0:52:20do with brains that just works better fine we're engineers that's okay
0:52:25the problem is that the design space is infinite
0:52:29and so how to figure out what direction even go
0:52:33and so that's you feel that i think that we had that the appeal that
0:52:37the brain related biologically stuff as have for us
0:52:41is that it's a working system
0:52:44it's something it already works and so it does really reduce the space
0:52:48that you consider
0:52:50is someone else gonna come up with some information theoretic approach that is the ends
0:52:54of being better know this
0:52:55fine you know i
0:52:57microsoft
0:52:59but this is where it occurs to us
0:53:04questions
0:53:13so you mention that a hmm gmm systems at some point they'd get much shorter
0:53:19and one of the aspects is that they could be adapted well
0:53:25so one would think about adapting neural networks and some sort of similar manner
0:53:32and is that one of the reasons why neural networks i mean if you sick
0:53:37recognition task you wanted to be adapted to the speaker and from my limited knowledge
0:53:43i think that
0:53:45a adaptation methods are still trying to be figured out
0:53:50but all the intuition into doing adaptation methods comes from you know
0:53:57the experience that we have with hmm gmm systems so at least at least for
0:54:04me so is okay so if you talk about something like speaker adaptive training
0:54:12could you think of neural network
0:54:16sort of be becoming speaker independent of speaker adaptive training
0:54:20i mean is i mean i would you add putting two point where
0:54:25and what do you think that is that i reduction to build a speaker independent
0:54:30truly speaker independent dnn
0:54:33deductions to
0:54:35i guess i mean speaker independent by being very speaker-dependent an adaptive so right
0:54:41a actually if you do a little literature search there's a bunch of work on
0:54:47adapting neural nets for speech recognition early nineties
0:54:50a
0:54:51and so this was work was largely done at cambridge and twenty runs and screw
0:54:57and in our skin portable
0:55:01you are not so
0:55:02and there were basically performance is used we were actually and then you collaboration with
0:55:09them
0:55:10and there were four methods that i recall that we use so one was to
0:55:14have like a linear
0:55:17input transformation
0:55:20so you could have so if you had you know thirteen plp coefficients
0:55:24i just have thirteen by thirteen matrix coming in
0:55:28another was the output so maybe you'd have you know if you
0:55:32we're doing monophones so as like fifty by fifty or something
0:55:37a third was to have i didn't wear off to decide what you just sort
0:55:41of a added to it and
0:55:43a trained up with a limited data that you had for the new speaker
0:55:47this we're all
0:55:49supervised the adaptation
0:55:52and my favourite
0:55:56when i proposed was
0:55:58scrawl that just train everything and
0:56:01so it just you know we and
0:56:03the original direction do that was that you might have millions of parameters that but
0:56:08my feeling was what you just a little bit
0:56:11and the L
0:56:13they all work to varying degrees i think it's fair to say but nine or
0:56:18the
0:56:19hmm-gmm adaptations nor those neural net adaptation is really solved problem
0:56:25they all move you a little bit we did some experimentation as part of the
0:56:29ouch project that stevens gonna
0:56:32talk about thursday
0:56:35where
0:56:35we use the mllr for instance to try to adapt to just my recording given
0:56:41close my training
0:56:44and it helps
0:56:45but it's not like it makes like the
0:56:51so i'd say that
0:56:53you can get you can use any of these methods
0:56:56for both neural nets and for gaussian and there are there are methods for both
0:57:02but none of them really solve the problem
0:57:10and the other questions that one there
0:57:18this number let's it's a couple back here
0:57:23at the moment in to talk
0:57:27thank you for that very interesting time
0:57:31i was just curious is that any run ins in this and
0:57:36kind of rate that we look at things for adaptation that speech recognition and ten
0:57:41at something that are human speech recognition
0:57:44and the reason i S is that if we look at least i am i
0:57:47inspired by it seems that mention as a look at the places where a human
0:57:52recognition breaks down i was an ad hoc unless you're a with really bad connection
0:57:58i just couldn't understand the campus meeting way
0:58:01and we don't i O and look at how i system and it's a beautiful
0:58:06in exactly the same conditions human when be able to understand how these and have
0:58:11hope systems would be added in humans and excel should be really be that i
0:58:16am i check on my
0:58:18well
0:58:20when i found in expects in jack i don't understand that at all
0:58:26a so i think a machine you do better
0:58:30i think in general we're pretty far from that
0:58:34there are individual examples that you could think of i think my favourite is anything
0:58:40involving attention
0:58:41so actually my wife used to work with these
0:58:46large american express call centers
0:58:49and i when we first got together i will always telling your humans are so
0:58:54good it speech recognition and you know machines are so bad just a well not
0:58:58the humans ideal and i
0:59:01and it turned out that the people the call centres are really great we definitely
0:59:07much better than anything do with machine
0:59:10in for simple tasks like a string of numbers
0:59:15right after they have coffee
0:59:16and they're terrible after lunch
0:59:20now they do however have i mean you don't talk about i certainly didn't talk
0:59:25about recovery mechanisms but the saving grace for people is that they can say could
0:59:30you repeat that please and all that we have some of that in our systems
0:59:33humans are better at that
0:59:36so i think
0:59:37i think it's their other tasks
0:59:40for which
0:59:42machines can clearly be much better than because people
0:59:46are not trained or is there are evolutionary
0:59:51guidance
0:59:52two doors there being better at it so for instance
0:59:57doing speaker recognition with speakers that you know very well
1:00:02i think machines can be better
1:00:05used to do some work with the G is an edgy analysis isn't something we
1:00:10you know grow up with
1:00:11and so you can do classification but is much better than people okay
1:00:16but i think for sort of straight out typical speech recognition
1:00:20you take that noisy example
1:00:22elevated to any of our recognizers and you just two
1:00:28saw some of the signal-to-noise ratios the cost of showing your layer
1:00:32basically zero db signal-to-noise ratio
1:00:35say first human beings were paying attention listening to strings of digits they just get
1:00:40them
1:00:42and our systems you look at any of the even with the that
1:00:47part of white noise robust front ends people have papers
1:00:50you look at their performance at zero db signal-to-noise ratio it's a this
1:00:55and that's with the best that's not is that we have
1:00:58so i think we're just so far from that are straight out speech recognition
1:01:02but maybe someday be saying well of this automatic system that we can figure out
1:01:15high so you use like computer vision under networks are very appealing you can speak
1:01:20and visualise what are being learned at the hidden layers so you can see that
1:01:23explaining stuff specific parts of the faceplates and
1:01:27so in speech you have an intuition about what is being learned in those hidden
1:01:32layers
1:01:34well i mean there have been some experiments with people of what the some of
1:01:38these things again made reference to forms to reach and
1:01:42and he was it did it should be just on a topic multilayer perceptron
1:01:47and he found that
1:01:50this was attempting to mimic what was happening with the nets that were
1:01:58train on individual critical bands
1:02:00and it did another one where i just through the whole spectrum in
1:02:06and what was learned that layers in fact we did learn
1:02:11interesting shapes interesting gabor like shapes so for
1:02:15and there's been a number of experiments with people have looked at
1:02:20some of those really layers
1:02:22what you get pretty deep
1:02:24especially for seven errors
1:02:27i think it be pretty harder to do
1:02:29but i wouldn't it is possible
1:02:31i know there's been some work