0:00:15uh well come to to a uh uh i guess
0:00:18would morning everyone
0:00:20and the first couple of practical the model
0:00:24we have a a change of room
0:00:26you know that the this club B was really small and you are afraid that people are uh would not
0:00:31in a
0:00:32so uh we moved everything from club B and the the expert sessions from club E
0:00:38to the north hall
0:00:39it's actually about the this uh a a hall on the second floor next deviation
0:00:45and we should have more space the there so would be a known
0:00:49uh actually
0:00:51should be close than the
0:00:53oh signs
0:00:54would be there
0:00:55then a for the internet really sorry for the trouble just today
0:00:59that was close to a by you mobile by to provide
0:01:03weighted um
0:01:05a range problems are all spot
0:01:08so to you should be a variable again
0:01:10but please uh
0:01:13oh uh we have a a just
0:01:15five hundred twelve also available
0:01:17there is no
0:01:18you way
0:01:19will more
0:01:20so please disconnect when you
0:01:22you not need to to be a and this is especially course my the not because the mark rosewood
0:01:29on on just the state or or it all
0:01:33then a for the bank at torch you know
0:01:35we have a an i
0:01:37you need to dig
0:01:38i'm sorry for that but you don't have it the you will be a lot the to get on the
0:01:43bus looking
0:01:45real a very limited number of kids it's of available uh a the registration desk
0:01:51then the the partial but it right at the the a section
0:01:56the or to seven from and for number ten
0:01:59and the transportation back from just lena
0:02:01is not provided so
0:02:03my crap
0:02:05or continue or evening uh and this man it pops and or uh
0:02:10the of rock
0:02:12and uh uh i'm pretty much done a so uh there would be a short their introduction
0:02:17of a are you done and other on a uh i
0:02:52yeah true
0:03:08and uh and uh
0:03:09there's is time for the for the second one E
0:03:12so uh
0:03:14so the going to be given by
0:03:16nelson morgan
0:03:17from icsi berkeley
0:03:19and uh and i get a month the
0:03:21or or the non fiction of the name
0:03:23will introduced a speaker and channel decision
0:03:31you very much for coming so one B
0:03:35the point
0:03:36it is my great
0:03:45rubber for
0:03:50probably from or
0:03:52for those
0:03:54you know walking speech for very long time
0:03:58a number of techniques
0:04:00i a
0:04:02or also are at you get number of
0:04:05number of you audience i C
0:04:09for for those people
0:04:12more than that much of the introduction
0:04:15for those of you know him
0:04:17it's also called
0:04:19there you walk
0:04:20a better with the one of the a
0:04:22signal processing
0:04:24then vol
0:04:26i mean out a new addition
0:04:28well for i for the problem
0:04:31on the uh
0:04:33what else can i say well i i i think that keep it sort or i will call more than
0:04:37here you leave you at all be better than
0:04:40looking at me
0:04:54i i nick
0:04:55well i thought it was time for a little bit of a reality check
0:04:59and uh speech recognition
0:05:02it's been around for a long time as i think everybody here knows
0:05:06very long research history
0:05:08uh lots of publications for decades many projects
0:05:12and he sponsored project
0:05:14systems have continually gotten better
0:05:17it actually tended to converge so that there is
0:05:20in some sense a a standard
0:05:22automatic speech recognition system now
0:05:24uh it's made it to a lot of commercial products
0:05:28actually been used
0:05:29actually works from time to time
0:05:32and so in some sense
0:05:33it seems to have graduate
0:05:39yes fails where humans don't
0:05:41and by the way those of you who have your P H Ds
0:05:44know that your education hopefully was not done at that point
0:05:49and there's probably a lot more to do here
0:05:51uh somewhat argue
0:05:53that there is little basic science that's been developed in quite a bit of time
0:05:58lots of good engineering methods though
0:06:00but they often require a great amount of data
0:06:03uh as we learned yesterday there is a great deal of data
0:06:07but not all of it is
0:06:08available for use in the way that you like
0:06:10and and are many tasks where you don't have that much
0:06:13and each new task requires
0:06:15uh essentially the same amount of effort you sort of have to start over again
0:06:20so how do we get to this point
0:06:21this is not gonna be anything like a complete history but
0:06:25enough to make my point help
0:06:28i'm gonna talk about the status current status in the standard methods
0:06:31a very briefly
0:06:33uh talk about some of the alternatives the people have worked with over the years
0:06:37and where could we go from here
0:06:41as i mentioned
0:06:42speech recognition research has been around for a very long time
0:06:46uh a significant papers for sixty years
0:06:50by the nineteen seventies
0:06:52in some sense major advances modeling and happened
0:06:56that is the basic
0:06:57mathematics behind hidden markov models
0:07:00was done by
0:07:02been a lots of improvements
0:07:03that happened uh a for the next twenty years or so
0:07:06and also in the features
0:07:08which became
0:07:09more less standard by ninety nine or so
0:07:12there were some really important methodology improvements by ninety ninety in earlier days
0:07:17people did many experiments but was very hard to compare them
0:07:20and the notions of standard evaluations and standard datasets really took called by ninety nine year so
0:07:27and over the all of these years
0:07:29uh specially last twenty thirty years they've been continuous improvements
0:07:33which were to some extent really close the related to more law
0:07:36movements in the technology
0:07:38that is
0:07:39um more more computational capability
0:07:42more more storage capability
0:07:44long people the work with very large datasets
0:07:46and develop very large models to well represent those large dataset
0:07:51so on
0:07:54there's an elephant the room which is the things
0:07:57are not entirely working still
0:08:00with these systems then fact have converged
0:08:02was kind of a byproduct product of all of these standard evaluations which were
0:08:06very good in many ways
0:08:09when people found out that the other group
0:08:11had something that they didn't they would copy a in very soon the system would become very much the same
0:08:19what are some of the remaining problem
0:08:22system still perform pretty poorly despite a large to work on this
0:08:27in the presence of a significant mounts of acoustic noise
0:08:30also reverberation
0:08:32which is natural for
0:08:34just about any situation
0:08:37uh unexpected speaking rate or accent
0:08:39that is by an expected i mean something that is not well represented in the training set
0:08:45uh on from all your topics
0:08:47uh the language models bring this a lot of the performance that we have and if you
0:08:51don't have a particular topic represented in the language model can do poorly
0:08:57a from the recognition performance per se how many words you get right
0:09:01another thing that's important is knowing whether you're right or wrong
0:09:05and that's very important for practical applications
0:09:08and that still need some work as well
0:09:12so turns out that even some fairly simple speech recognition task can still fail under some of these conditions
0:09:17yielding some strange result
0:09:20well so boy she no slow
0:09:26voice recognition technology
0:09:28and i
0:09:29and shall
0:09:31yeah know try voice recognition technology
0:09:33no i one to change action
0:09:37oh i
0:09:45oh i
0:09:47oh and oh yeah
0:09:53i don't
0:09:56i it was last
0:09:58yeah time in any case
0:10:05i shown in a
0:10:22same time one is that a
0:10:29small a
0:10:37a lot
0:10:38if you do not feel at all angles are we can getting a
0:10:45so that was funny
0:10:47i hope you think it was funny but
0:10:49hasn't worked in real life as opposed to just the jokes
0:10:53and what have
0:10:56so uh let me start off with
0:10:59some results from some of these standard evaluations are referred to
0:11:03this is a graph the people in speech of seen a million times
0:11:07is this other one
0:11:10for those of you who are familiar with this main thing to note is that uh P we start E
0:11:14R stands for word error rate
0:11:16hi high word error rate is obviously bad this is time and the axis
0:11:20and each of these lines represents a series of ten
0:11:23oh this is a kind of messy graph so it's cleaning up a little
0:11:27uh this is uh a task done in the early nineties uh called eight is
0:11:32and the main thing to see here as with a lot of these is that to starts off at a
0:11:35pretty high error rate people work for a while
0:11:38and after while a gets down to uh a pretty reasonable error rate
0:11:43that's go to another one this was uh
0:11:45a a conversational telephone speech
0:11:47you have the same sort of a fact and do remember that the this is a a a a um
0:11:52a logarithmic scale here
0:11:53so even though it looks like it hasn't come down very far really did come down pretty far but after
0:11:58well sort of levels off
0:12:00uh more recently there's been a bunch work on speech from meetings which is also conversational
0:12:05these are from the uh individual head mounted microphones
0:12:09she we still didn't have a huge effects of background noise or or or reverberation or anything
0:12:14and there wasn't actually a huge amount of progress after some of the initial uh initial work
0:12:20uh now these are
0:12:21these evaluations
0:12:23uh a commercial products
0:12:25i think
0:12:26uh uh you
0:12:27a lot of information is proprietary
0:12:29but i think working can say is that
0:12:31a partial products work some of the time for some people
0:12:34and they often don't work
0:12:35for others
0:12:37so what is the state
0:12:39well the recognition systems were either
0:12:42work really well for somebody
0:12:44or they'll be terribly brittle and reliable
0:12:47uh i know that when my wife and i both tried a uh a dictation systems they work wonderfully for
0:12:52her and terrible for me i think i i well my words something
0:12:57so here's an abbreviated review
0:12:59of what standard
0:13:01by ninety ninety one
0:13:03we had
0:13:05uh feature extraction
0:13:06basically being based on frames every ten milliseconds or so
0:13:11some something from a short spectrum
0:13:14uh i things called mel-frequency cepstral coefficients
0:13:18mention a bit more about a second
0:13:21P L P is another common method develop by then
0:13:25delta cepstra
0:13:26uh essentially temporal derivatives of the cepstra
0:13:30and on the statistical side
0:13:32uh acoustic modeling hidden markov models were quite standard
0:13:36it typically by this point represented
0:13:38context-dependent phoneme or units or phoneme like unit
0:13:42uh the language models are pretty much by this time all statistical
0:13:46and they represent it context-dependent words
0:13:50so all this with a by ninety ninety one
0:13:52a let's move to two thousand the eleven
0:13:56there it is
0:13:59notice all the changes
0:14:02okay that's a little unfair
0:14:04uh a will have actually done work in the last twenty years
0:14:07and this is
0:14:09they representation of a a lot of it i think
0:14:11and these of had big affects
0:14:13i don't mean to minimize
0:14:15uh various kinds of normalisation is uh meeting variance kind of normalisation
0:14:20uh a a an online version of that that we called rasta
0:14:23uh vocal tract length normalisation which
0:14:26compresses or expands the spectrum and
0:14:29in such a way is as to match the models better
0:14:34and uh then
0:14:35adaptation in feature transformation
0:14:38uh either adapting better to test set that somewhat different from the training set
0:14:42uh or uh
0:14:44various that changes is to make the features more discriminative
0:14:49discriminate range training
0:14:52uh changing the statistical models
0:14:55in such a way as to make them more discriminant between different speech sound
0:14:59did did have more more data or of the years and that required
0:15:03lots of work to figure out how to handle that
0:15:05but aside from handling it was also taking advantage of lots of data
0:15:09which was didn't come for free so was lots of engineering work there
0:15:14uh people found that
0:15:15combining systems helped and sometimes combining
0:15:18pieces of systems helped
0:15:20and that's been an important thing in improving uh perform an
0:15:24and because
0:15:25uh speech recognition was starting to go into applications you had to be concerned about speed
0:15:30and this been a lot of work on that
0:15:33well but more and some of this uh
0:15:35the main point uh about mel cepstrum and plp a wanna make is that
0:15:40each of "'em" use this kind of warped frequency scale
0:15:43uh in which you have better resolution at low frequencies and high frequencies
0:15:47"'cause" our perception of different uh
0:15:50speech sounds is very different at low frequencies high frequencies
0:15:53no cepstrum and plp used a different mechanisms
0:15:57for getting a smooth spectrum uh
0:16:00delta cepstrum uh
0:16:02uh as big as i said is basically
0:16:05uh time derivatives uh of the cepstrum
0:16:10hidden markov model this is a graphical form of it
0:16:13and main thing to see here this is a a
0:16:16a statistical dependency graph
0:16:18uh and
0:16:20say X three is only dependent on the current state
0:16:24each of these
0:16:25time steps
0:16:27are represented here
0:16:29and if you know Q three
0:16:31uh then Q two Q one X one X to tell you nothing about X three
0:16:35so that's a very very strong statistical conditional independence model
0:16:40and that's pretty much what people have used in these
0:16:43are now standard cyst
0:16:45this is my only equation
0:16:47and uh those of you and speech will go oh yeah fact probably
0:16:50most people say oh yeah
0:16:53basically bayes rule
0:16:55the idea is that
0:16:56in statistical system
0:16:58you want to pick the model
0:16:59that is most probable given the data
0:17:02and base so as you could expand in this way
0:17:05and then you can get rid of the P of X because there's no dependence on the model
0:17:13you realise these
0:17:14uh likelihoods
0:17:16of of probability of the two six given the model with mixtures of gaussians typically
0:17:21you typically have each gaussian in just represented by means and variances there's no covariance represented between the features
0:17:29and there's the weights of each of the gaussians
0:17:31the you language priors
0:17:32P of them
0:17:34are uh
0:17:35implemented with a n-gram
0:17:37do a bunch accounting counting you do some smoothing
0:17:40and it's basically a probability of a word given some word histories such as the frequent the recent
0:17:46and minus one words
0:17:49now i
0:17:50the math is lovely but in practice we actually raise each of these things to some kind of power
0:17:55this is to compensate for the fact that the models are
0:17:58and that uh
0:18:00there are really other dependence
0:18:04this is a picture of the acoustic likelihood
0:18:07uh uh uh uh estimator
0:18:09there's a few steps in here each of these boxes can actually be fairly complicated but
0:18:14just generally speaking
0:18:15there's a some kind of space short spectral estimation
0:18:19there's this vocal tract length normalisation i mention which compresses or expanse spectrum
0:18:24the some kind of smoothing either by
0:18:26uh throwing away some of the upper cepstra coefficients are why autoregressive modeling as is done in P L P
0:18:33there's various kinds of linear transformations for instance for dimensionality reduction
0:18:38uh and for discrimination better discrimination
0:18:41then there's the statistical engine
0:18:43that i mentioned before with this funny scaling um
0:18:46in the log domain or raising to a power
0:18:49in order to mixed with the
0:18:50uh language model
0:18:52okay well that seems simple enough but
0:18:54actual systems that get the very best scores are a bit more complicated than this
0:18:58uh there's well
0:18:59first off there's the decoder and the language priors coming in
0:19:05well you might have to of these france to
0:19:09people found that this is very helpful for getting best perform
0:19:13but you don't just put "'em" in in a very simple way
0:19:17it's a very often the case that you have all sorts of stages is
0:19:20with ugh
0:19:22C W here's is crossword
0:19:24or non crossword models and you produce graphs or lattice and you combine them at different points and you cross
0:19:30at that
0:19:33this kind of reminds me of some work
0:19:36by a uh
0:19:37a berkeley grad of for about a century ago name rube goldberg
0:19:41and this is these self operating napkin
0:19:44the self operating napkin is activated when these ships spoon a a is raised to mouth
0:19:50uh pulling string P and thereby jerking little C
0:19:54which throws crack or D past parrot P
0:19:57uh pair of jumps after cracked or and perch have tilt
0:20:01which uh uh a process C it's G in into pale H
0:20:06the extra weight in the pale pulls the cord i which opens and
0:20:10uh i which lights the cigarette lighter J
0:20:13and this uh
0:20:14turn lights the rocket which pulls the sickle which cuts the string
0:20:20was the pendulum to swing back and forth
0:20:22thereby by wiping the chen
0:20:25uh for this
0:20:26time my view of current speech recognition system
0:20:32it's successful at wiping the chance sometime time
0:20:35so i wanna talk a little bit about alternatives
0:20:37and i wanna say that the at the outset
0:20:40that these are just some of the alternatives
0:20:42a conference like this has uh a lot of work
0:20:46uh in in many different directions
0:20:48is the ones i wanted to give as examples
0:20:52but first i wanna say
0:20:54a little bit
0:20:57what else is there
0:20:58besides the main
0:21:02the great sage natural and
0:21:04was tracked down by a seeker
0:21:06and the or ask the sage
0:21:09what is the secret to happiness
0:21:12sage answered
0:21:13good judge
0:21:16well the sick said that's
0:21:17that's so very well
0:21:18master but
0:21:20how does one obtain good judgement
0:21:23and the master said
0:21:24from experience
0:21:27so the seek a okay experience
0:21:31how does one obtain this experience
0:21:34and the master said
0:21:35bad judgement
0:21:40here's of exercise exercises that we many other people of done in bad judgement
0:21:44we've pursued
0:21:46different signal representation
0:21:48uh some of them are related to perception
0:21:50to auditory models france
0:21:53a mean rate and synchrony has a to send ups model from "'em" some time ago
0:21:57uh and into and sample interval histogram
0:22:00from uh uh way it gets a
0:22:02i each of these
0:22:04related to models of neural firing
0:22:08uh how
0:22:09how fast they want to how much they synchronise one another
0:22:12what uh timing there was between the fire
0:22:15and they had some interesting performance in noise uh they
0:22:19i not been adopted any serious way
0:22:23there's interesting technology there an interesting scientific models
0:22:27then their stuff that's more and the psychological side these were sort of based on on models of fit
0:22:33uh then there is uh model uh
0:22:36really from the psychological side and multi band systems based on critical bands going all the way back to
0:22:42fletcher's work work of others
0:22:44uh and
0:22:46uh the idea here is that if you have a system that's just looking at part of the spectrum
0:22:50if the disturbances in that part of the spectrum
0:22:53uh then you can deal with that separately
0:22:56note of had some X some six
0:22:58and then something that uh
0:23:00you can observe both that the physiological and psychological level
0:23:04is the importance of tip different um modulations
0:23:08particularly temporal but also spectral modulations and the signal
0:23:13uh then there's on the production side is been a bunch of work by people on
0:23:17uh given the fact that there is only if you articulatory uh mechanism
0:23:22uh that maybe you can represent things that way and O be more se saying and
0:23:26the better
0:23:27better representation the signal one
0:23:29represent this over time in their been
0:23:31hidden dynamic
0:23:32uh models that attempt to do this and
0:23:35trajectory models sometimes the trajectory models had nothing to do with the physiological models but
0:23:40uh sometimes they did
0:23:43and articulatory features which you could think of as a quantized version of the articulator positions and so for
0:23:51then another direction was artificial neural networks which of been around for a very long time
0:23:58actually before nineteen sixty one but
0:24:00i picked out this one discriminant analysis iterative design
0:24:04the pick that out "'cause" a lot of people don't know about it a lot of people think that a
0:24:07multilayer networks the big N in the eighties
0:24:10but actually neck can sixty one they had a multilayer network that work very well for some problems is actually
0:24:15used industrial E
0:24:16for a that case after that
0:24:19um which the first uh layer of units was uh uh a bunch of gaussians and after that you had
0:24:24a you had linear perceptron
0:24:27couple years later uh other was work at stanford
0:24:30in which they actually did apply some of this kind of stuff to speech these were actually linear adaptive units
0:24:35to actually called add lines
0:24:37uh burning would row sent me uh uh
0:24:39a technical report
0:24:40sri is struggle interest is the cover real technical report nineteen sixty three
0:24:46uh is a page from it that shows a
0:24:48uh a block diagram of try blew up here for
0:24:51mean it's
0:24:52and starts off with some band filters basically you getting some power measures in each band
0:24:57and then here these add lines which uh give you some sets of outputs
0:25:02which one to a typewriter a pair
0:25:07nineteen eighties so an explosion of interest in the neural network
0:25:11very area
0:25:13uh part of this
0:25:14was sparked by
0:25:16a a rediscovery discovery say of your were back propagation
0:25:20just basically propagating the effect of errors from the output of the system
0:25:24back to the individual weight
0:25:27uh in the late eighties uh number of us worked on hybrid hmm artificial neural network systems
0:25:34where the neural networks were used this probability estimators stick to get the emission uh probabilities for the hmm
0:25:41last decade or so uh quite a few people have taken off on the tandem idea
0:25:46which is do you which is a particular way of using artificial neural networks
0:25:50as feature extractor is
0:25:52and i will just mention uh briefly
0:25:55uh a fairly recent development of the networks
0:26:00how uh
0:26:02how innovative it is is a is the question
0:26:04but there's definitely some new things going on there which i think are interesting
0:26:10the obvious difference between this in the previous networks to can to be more layers that and steep
0:26:15there's also sometimes and unsupervised pre-training
0:26:21there's actually several papers at this conference there's also a special issue
0:26:24uh in uh november of the transaction
0:26:28um here's a couple papers that this conference i think this if you others as well as one from the
0:26:32nails river E
0:26:34they had a lot different numbers in the paper but uh i pick one out
0:26:38and just
0:26:40it did they most the numbers had the same general trend
0:26:45deep mlp good
0:26:47uh and the old mlp somewhere in between
0:26:50these are error rates so low again uh low is good
0:26:54and uh there is a large vocabulary um
0:26:58voice search
0:26:59uh paper which uh i
0:27:01is is that poster today
0:27:03uh i had a sixteen percent set the their metric was sentence error reduction
0:27:08and they had a nice improvement compared to
0:27:10a system that used uh M P which is a a a very common discriminant training
0:27:20so that it to some of the alternatives is again there's you i'm sure
0:27:24many people this audience good think of a many of
0:27:29where could we go from here
0:27:31in my opinion where should be go from
0:27:36better features and models
0:27:40i've suggested
0:27:41better models of hearing in production
0:27:44uh could press perhaps lead to better features
0:27:48uh better models of these features
0:27:50better acoustic models
0:27:53models of understanding better language models dialogue models pragmatics and so on
0:27:58all these are likely to be import
0:28:01the other thing which i'm gonna go into a bit especially at the end is understanding the errors
0:28:06understanding what the assumptions are
0:28:08that are going into our models
0:28:10and how to get past
0:28:15so we start with models of hearing
0:28:17so there are
0:28:19useful approximations to the action of for free that is uh
0:28:23uh from here
0:28:25to the auditory in your of
0:28:27and when i say useful approximations i mean that there are are number of people who've worked
0:28:34simplifying the models that if that were used earlier
0:28:39crafting them more towards
0:28:41uh good engineering
0:28:44some of those are looking kind of promise
0:28:47uh there's new information about the auditory cortex which i'm gonna brief the refer to
0:28:51next few slides
0:28:53including some results with noise
0:28:57it's good to learn from a biological examples because uh you know humans are pretty good in many situations that
0:29:03at recognizing speech
0:29:06probably good also not to be purist
0:29:08and to mix
0:29:09in size that you get from these things with good engineering approaches
0:29:13and i i i think there's some
0:29:15uh good possibilities there
0:29:17uh this bottom bullet it is just to note that
0:29:20as with many things in this talk a money talking about some of the field
0:29:24and a mostly talking about single channel
0:29:26but uh people have to ears they make pretty good use of them when they were
0:29:31uh and that's
0:29:33something to keep in mind
0:29:34and of course you can go to many years in some situations with microphone arrays and that's a good thing
0:29:39think about
0:29:40that's not a topics and i'm expanding on and the stock
0:29:44and the same thing with visual information visual information is used by people whenever they can
0:29:49uh and i'm not gonna talk about that but it's obviously imp or
0:29:53okay a a is gonna talk about this a cortical stuff
0:29:58uh the slightest courtesy of uh she she shah it's not just the slide but also the idea
0:30:03uh and the idea that which comes from experiments that uh he in it's guys
0:30:09and gals
0:30:11uh done with a small mammals
0:30:14uh a that have
0:30:16pretty similar
0:30:17really part of the cortex X
0:30:19uh a primary auditory cortex
0:30:21to what people have
0:30:23also been some other work with people
0:30:25uh and
0:30:28this if you mention this is being the kind of spectrogram that's received that this primary auditory cortex
0:30:35what they've observed is that there's a bunch of what are called split spectro-temporal receptive fields S T R apps
0:30:41which are little filters
0:30:42that process it in time and frequency
0:30:46and you could think of them as processing temporal modulations which you called rate and spectral modulations which called scale
0:30:53and you imagine there being a cube
0:30:55at each time point
0:30:57with auditory frequency
0:30:59and uh
0:31:00rate and scale
0:31:02and much as you would like to be able to in in and a regular spectrogram
0:31:08de emphasise the areas where the signals noise was poor
0:31:11and emphasise areas with the sings noise was good
0:31:14you have perhaps an even greater chance
0:31:16to do this kind of emphasis you have a as
0:31:19uh if you're expanded out to this cube
0:31:22that's the general idea
0:31:23so you could end up with a lot of these different spectrotemporal receptive fields
0:31:27you could implement them and you could try to do something good with them pick out a good
0:31:32uh if limitation that we and and number of people have been trying
0:31:38uh what we would call T many stream
0:31:41uh implementation
0:31:43as opposed to multi-stream which uh was what i we shown before you we'd have two or three streams just
0:31:49refers to the quantity
0:31:50but what's in each stream is
0:31:53one of the representation one these spectro-temporal receptive fields implemented by a gabor filter
0:31:57and by a multilayer perceptron
0:31:59that's a discriminatively trained discriminant between different speech sounds
0:32:04you get a whole lot of these and some of implementations we at three hundred
0:32:07uh and then you have to figure out how to combine them or select them
0:32:11hopefully again to de emphasise the ones that are uh bad indicators of what was set
0:32:20another interesting side light of this kind of approach
0:32:23is that it's a good fit to modern high speed computing that it's
0:32:27as i think a lot of you know
0:32:29the clock rates and or long going up the way they used to other cpus use
0:32:33and so the way that manufacturers are trying to give us more performances by having many more core
0:32:38the graphics processors are an extreme example of this
0:32:41this kind of structure is a really good match to that
0:32:44uh because it's it's what they call an embarrassingly parallel
0:32:48um we found that this room this kind of approach does remove a significant number of errors particularly and noise
0:32:54but also a as it turns out in the clean condition
0:32:59it combines well with pure engineering not auditory
0:33:02kind of methods
0:33:03uh such as wiener filter based methods
0:33:06and we'd like to think that it could combine well with other auditory models all we haven't really done that
0:33:11work yet
0:33:16acoustic models
0:33:19uh we currently use these critical assumption
0:33:22and one of things about using very different kinds of features is that this can really change their statistical properties
0:33:27from what the ones we have now
0:33:29and so these assumptions
0:33:31i could be violated in yet different way
0:33:35uh there have been all turn models that were propose that allow you to bypass these typical assumptions
0:33:41but part of the problem is the figure out
0:33:43which statistical dependencies to put in
0:33:48um models of language an understanding
0:33:51i think it's probably pretty clear those you don't know me that this isn't a research area
0:33:55but it's of obvious import
0:33:58one of the things that uh
0:34:01has been frustrating to a lot of people in fact a member fred jelinek being physically frustrated about this
0:34:07is that
0:34:08it's very very tough to get much improvement
0:34:10over simple n-grams that is a probability of word given some number of previous work
0:34:16it can be very important
0:34:19get further information
0:34:21and we know this for sure for people
0:34:25me tell you little story
0:34:27uh one day
0:34:29i was walking out of i csi
0:34:31and i had on one of these catch this is a cap for the oakland athletics to local
0:34:37make league baseball club
0:34:39i also had on a jacket
0:34:41that had the same insignia on it
0:34:44and i had a radio
0:34:45hell to my head i was walking down the street
0:34:49and a guy across the street
0:34:50moderately noisy street
0:34:55and i said
0:34:56oh can five to three
0:35:01we'd like to be able to do that with a machine
0:35:06so where we go from here
0:35:10research what continue to get good ideas
0:35:15every time you get the shower or maybe you have a have a good idea coming out
0:35:20what's the best methodology
0:35:22what's the best way to proceed along this path
0:35:25so maybe we can learn from some other disciplines
0:35:29and let me give
0:35:30uh a kind of stretched analogy to
0:35:33the search for a cure for cancer
0:35:35and again i'm gonna tell you a little story
0:35:38uh us the personal one uh is about an uncle of mine names sydney far per
0:35:44my uncle set the and the forties
0:35:46uh was
0:35:48i path file just
0:35:49at harvard med channel
0:35:50however but centre
0:35:52and at children's hospital boston
0:35:57unfortunately fortunately got to see lots of little children
0:36:00of we came yeah
0:36:01uh once they were diagnosed they only had a few week
0:36:05as a pathologist he mostly dealt with P two dishes and so forth you didn't really wasn't really a clinician
0:36:11but he got this thought
0:36:13that maybe if you could come up with chemicals
0:36:16that were more poison this to the cancer cells than they were to the normal cells
0:36:20maybe he could extend the lives of these K
0:36:23are we experimented with this in the petri dishes the course for the most part for a while
0:36:27they need then any came up with something that he thought would work
0:36:31any tried it out
0:36:32everybody's permission
0:36:34and some of these kids
0:36:35and low and behold
0:36:36it actually did extend their lives for a while
0:36:39this was
0:36:40the first
0:36:41case of came at there
0:36:45this just
0:36:46great and it started a whole revolution the ended up starting a big centre national cancer institute stuff uh
0:36:52it's not the data fibre reverence to
0:36:59the key point i wanna make about it
0:37:01is that
0:37:02there's this quandary
0:37:03between curing patients
0:37:05you have these patients are coming through
0:37:07who are in terrible straits
0:37:10but on the other hand
0:37:12you don't have any time
0:37:14to figure out what's really going on
0:37:17and there were
0:37:18important early successes based on hunches the my own call than many others had
0:37:24and there wasn't time to wear in the real cause for things
0:37:27and by the way stories like this
0:37:29for surgery surgical interventions and for radiation as well
0:37:35so there's some success
0:37:37but they still
0:37:38didn't find a general curve cure and uh as you know to this day there's still is no general cure
0:37:42for cancer
0:37:43but things are a lot better every missions or longer and so forth
0:37:47and now there's
0:37:48starting to be some understanding of the biological mechanisms and one hopes that this will lead to to keep
0:37:54uh a solution
0:37:56so this is wonderful book a strong the recommend the emperor of all melodies
0:38:01about uh like the industry have cancer
0:38:05and i'll just read this
0:38:06isn't thing the speaker viewers i think of remedies
0:38:09in such time as we have considered of the cost
0:38:12here must be imperfect claim and to no purpose
0:38:15where and the "'cause" of that first been searched
0:38:18this again doesn't belie the fact that it can be very useful
0:38:22to uh go ahead and try to fix something along the way
0:38:26but in the long term you need to understand what's going on
0:38:30so as opposed to just
0:38:32trying our bright ideas which we all do
0:38:35how about finding out what's wrong
0:38:39the statistical approach
0:38:40to speech recognition requires
0:38:42assumptions that made reference to
0:38:44there known literally to be false
0:38:47this may or may not be a problem
0:38:49maybe it's just handled by uh say raising these
0:38:52uh likelihoods to power
0:38:55how can we learn
0:38:57so there's a some work that's been started i wanted the call your tension to
0:39:01from steve work men and larry gaelic okay
0:39:03starting a couple years ago
0:39:05where what they did was to consider each assumption separately
0:39:09and then rather can trying to fix the models
0:39:12modified the data
0:39:13B some resampling S um
0:39:15some uh
0:39:16bootstrapping kind approaches
0:39:18to match the models
0:39:20observe the improvement
0:39:22and use that to inspire more bright ideas
0:39:26but this point
0:39:27the really just focused on the diagnosis part and not on the
0:39:30a new bright ideas frankly
0:39:32so this is being pursued also at icsi and the how to project which is outing unfortunate characteristics of hmm
0:39:40uh i'm gonna give you just a couple results for more recent version i should add by the way that
0:39:45it's is a different K like this is larry sonde dan
0:39:48who just this P H D with that's
0:39:53first this is a
0:39:55very simplified system so the error rate for wall street journal is is pretty is pretty high here
0:40:01and uh it's
0:40:03the output
0:40:04uh demonstrably does not really fit the G M distribution that you got from the training set
0:40:10and it definitely doesn't satisfy the independence assumptions and you get this thirteen percent
0:40:17now if you simulated data really just generated from the models
0:40:21you should do pretty well in a fact you do basically
0:40:23uh virtually all of the errors go away
0:40:27but here's the interesting one i think
0:40:29if you
0:40:30use resampled sample data so this is the actual speech data
0:40:34but you're just resampling it such a way
0:40:37to assure the statistical conditional independence
0:40:41it also gets rid of nearly all of the year
0:40:45now they're studies are a lot more detailed this there's a a lot of
0:40:48a lot of things that they're looking at
0:40:50a lot of things that trying out
0:40:51but i think this gives the flavour
0:40:53of what they're doing
0:40:59in summary
0:41:02uh a speech recognition is mature mature
0:41:05in some sense it has an advanced degree
0:41:07this because it's been around a long time and their commercial systems and so forth
0:41:12and yet we still find it to be brittle
0:41:15and uh we essentially have to start over again with each new task
0:41:20uh the recent improvements
0:41:21have been really quite incremental or a lot of things of sort of levelled off
0:41:26we need to rethink
0:41:28kind of like going back to school
0:41:30kind of like continuing education
0:41:33uh we may need more basic models
0:41:35uh more may need more
0:41:37basic features
0:41:39we may need more study of air
0:41:44the other thing i wanna briefly mention is that
0:41:48we do live in error where there is a huge amount of computation available
0:41:52and even though the clock rates don't continue to go up is they have
0:41:56uh do to uh many core systems
0:42:00cloud computing and so forth
0:42:01there is gonna continue to be
0:42:03an increased availability to lots of computation
0:42:06and this
0:42:07should make it possible for us to consider
0:42:10huge numbers of models
0:42:12uh and methods
0:42:13that we wouldn't consider before
0:42:15for instance on the front end side
0:42:17these uh auditory based or cortical base things can really blowup up the computation
0:42:22from the simple kind of stuff that you have with mfccs or P L P
0:42:30it's good to do that it's good to try things
0:42:34that might take a lot of computation even if they might not work yeah in your i phone just now
0:42:39um um so
0:42:43you also have to know and then sure you all do that just having more computation is not a panacea
0:42:48doesn't actually solve things
0:42:49but it can potentially
0:42:51a give you a lot more possibility
0:42:54that's pretty much what i want to say
0:42:57i do wanna acknowledge that the stuff have talked about is not a particularly for me for many people including
0:43:03people outside our level
0:43:05uh but i do want to thank
0:43:07the many current former students and postdocs visitors icsi staff
0:43:12and particularly give a shout out to
0:43:14hynek hermansky every pore large she option honesty workman jordan cone
0:43:18here's my shameless plug for a book
0:43:21uh which he did already mentioned
0:43:23that is gonna be out this fall thanks to tons of work from dan ellis
0:43:27and other contributors i should say
0:43:29uh like uh
0:43:31gel and
0:43:33job for then the
0:43:37simon king for instance
0:43:41thank you for your attention
0:43:54having time i'm
0:43:58what is only a lot not of time bringing up
0:44:05you feel
0:44:16i promised i put on that
0:44:18yes are you thing to remind you about why
0:44:21i if know what is a question
0:44:28you mike
0:44:36what are you in the remote
0:44:38i know that
0:44:43speak Q mine
0:44:44oh don't hold back um yeah okay
0:44:47right at a time
0:44:51well though they they still a chance
0:44:52it's still which chance
0:44:53get get the courage
0:44:57i i think that the right answer is
0:44:59i don't know
0:45:03for instance
0:45:04well i used to say when people talk to me about this is that
0:45:07okay i think of
0:45:08speech recognition is as being in three pieces there's
0:45:12the representations that you have
0:45:14there's at the statistical models and the search and so forth in the middle
0:45:19and then there's
0:45:20uh all of the things that you could imagine doing with speech understanding and pragmatics
0:45:24it's X and so forth
0:45:26and i used the think that okay the first one i know a little bit about
0:45:30uh and i and i i feel very strongly and you know bunny results to back this up that that's
0:45:35very important for improving
0:45:37the last one i is not my area of expertise but where have seen in other is certainly and human
0:45:44i believe that's very important
0:45:45so i sort of thought the middle part
0:45:47yeah O you works well in
0:45:50uh but then so this
0:45:51this study
0:45:53and i'm not so sure
0:45:54no i actually think that you should
0:45:56uh pursue whatever it is that you
0:45:59i feel
0:46:00yeah feel is of greatest interest
0:46:01i actually think the key thing
0:46:03is they have interesting france
0:46:07a for nine now i see
0:46:11uh you like or know it's what i actually think i and if here and here right
0:46:21my is louder the
0:46:24all of these uh a technique used right or
0:46:27since we
0:46:29i'm spectral analysis
0:46:32pretty much uh uh everything just right
0:46:36in almost all
0:46:38but from now on
0:46:41spectral techniques like much
0:46:43C P L you of
0:46:45reading some aspects of
0:46:48course things
0:46:49you guys most
0:46:51but the big problems it seems to me you're still will interfere
0:46:55or fear from other sources
0:47:01hearing and so forth where us
0:47:04or you much help
0:47:06distinguishing mode
0:47:10the uh the other dimension
0:47:12uh uh uh
0:47:15information or something that has been explored lot
0:47:18psychological and
0:47:20is about you
0:47:23few steps
0:47:25so ensemble interval histogram that
0:47:28drop another
0:47:30mention an entirely
0:47:32a kind of a
0:47:39and source you
0:47:41same time
0:47:42you say much about that
0:47:45that that's the direction course
0:47:46so that's you
0:47:48what you think about that
0:47:49direction and we get
0:47:51people working in
0:47:52you pay more attention
0:47:55things beyond young
0:47:58what why spectral i guess which you mean a short-term spectral right
0:48:02and uh i i may not have done this is clearly as like could but i think the shah must
0:48:07stuff that i was making reference to
0:48:10a certainly can be long time the the the their spectro-temporal representation
0:48:15what you feed
0:48:17uh the the different
0:48:18quite the cortical
0:48:21can be a very different kind of spectrogram when that takes advantage of the sort of stuff and i think
0:48:26absolutely what we should do
0:48:28and that's these disturbances the multiple sources the reverberation et cetera
0:48:33uh uh i agree that's
0:48:34that's the biggest challenge that C
0:48:36if someone talks about the performance of humans versus
0:48:40uh a a speech recognition systems in the current generation systems that's the easiest playstation of the difference
0:48:46so uh
0:48:48i completely agree
0:48:49sorry am
0:48:50i'm not being a politician i actually do agree
0:49:04i modeling
0:49:17so go
0:49:19the most
0:49:29a more attention that
0:49:32K i didn't pay yeah
0:49:35but this is you were certainly are are reinforcing my my bias as
0:49:39uh oh go it is getting up but
0:49:43i i'm mostly a front-end person these days have been for a while and i agree that there's a lot
0:49:49to be done there
0:49:50i didn't mean the say at all that the language modeling and so forth was
0:49:54was the bulk of it
0:49:55even that study at the end was just saying for fairly simple case with the sensually matched training and test
0:50:01uh that
0:50:03you could
0:50:04jimmy with the data in such a way
0:50:07to match the models assumptions and you could do much better
0:50:10but uh one of the things that we're gonna be trying to do in follows to that study is looking
0:50:15at mismatched conditions
0:50:17what can you
0:50:18i cases with noise and reverberation and so forth
0:50:21in which case i don't think the effect will be quite as big
0:50:25you know it's garbage in garbage out if basically you feed in representation
0:50:30that are not
0:50:31uh giving you the information you need how are you gonna get it at the yeah so
0:50:36i i i agree with you but i was trying to be fair or not only to people that co
0:50:40but also because
0:50:42i feel that uh in if you cover the space
0:50:45of all these different cases
0:50:47there many cases where these other areas are in fact very pour
0:50:51and human beings as with my base po example human beings to make use of higher level information
0:50:56uh often
0:50:57in order to figure out what was said what up important about was that
0:51:01which leads me to george's question
0:51:03as you were talking
0:51:05i was
0:51:05constantly we with the
0:51:09um in speech recognition with almost
0:51:12you know
0:51:12irresistibly and at
0:51:15a things and optical character recognition
0:51:18and so
0:51:20almost every slide hand hand irresistible analogies uh from the a current successes to future direction is to problems that
0:51:28are being experience
0:51:29uh_huh and i i'm just wondering is there
0:51:31a cross disciplinary knowledge that can be leveraged yeah is is it is it being language
0:51:37to speech recognition except in the sense that some of these alternative there uh
0:51:43uh F they have tried looking at uh the spectrogram has an image
0:51:47uh and so forth some of the neural network techniques that were developed uh in optical character recognition
0:51:54sort of came back the other way but a lot of it's gone
0:51:57gone the other way
0:51:59you know we can to be fairly fragmented community and and and not listen to each other quite as much
0:52:04as we should
0:52:06whose now
0:52:12no i think he's of the dog
0:52:17oh i'm sorry i was drawing
0:52:19to stay in was C
0:52:23but the climbs
0:52:24a plug for the for a go on a tour
0:52:32i i have some exposure probably most people are so that some exposure to model
0:52:37speech recognition technology
0:52:39you real application
0:52:40yeah i think you know of um
0:52:43i've been exposed to google voice
0:52:45perhaps many people have
0:52:47yeah and
0:52:48uh is not a
0:52:50google voice but
0:52:52i think
0:52:52model and uh point speech recognition technology to me use them easily we
0:53:01considering the
0:53:02uh the systems or
0:53:04these are are have no
0:53:06"'kay" will you really great
0:53:10semantic condo
0:53:13in to see what people see what do systems can do you acoustic
0:53:17use a movies
0:53:18to me used
0:53:21yeah and the so
0:53:23where i so you the the channel used to be a to i don't know how to do that a
0:53:28where i so you challenge uses
0:53:31use um
0:53:34creating models so the semantic context to you of the kind of support
0:53:39to uh speech recognition that
0:53:41we seen from uh the
0:53:44you real
0:53:45why which models
0:53:52okay well
0:53:53that was a question uh
0:53:56i i know i wasn't
0:53:57um but also say something anyway which is that
0:54:00uh i i am really taking the middle position
0:54:03there plenty a task
0:54:04uh where in fact
0:54:06uh recognition does fail particularly in noise and reverberation so on
0:54:10google voice search is is very impressive
0:54:13you know there's a lot of
0:54:13a lot of cases where things to fail
0:54:17we can see significant improvements
0:54:20in a number of tasks
0:54:21by changing the front so i think there is something important there
0:54:25but in your in your state you one really attacking the front so what you're saying we have to pay
0:54:29attention to the back and i completely agree
0:54:34one more in it's probably time
0:54:36i one change of subject a little bit um yeah given that i'll as can you say something about the
0:54:41oh in this you courses academia and this research
0:54:43you you got a a big put both see a both side
0:54:47when it what is good what is that in you could for speech
0:54:49or now and
0:54:52which we go
0:54:53i actually a pretty small for
0:54:55and just re
0:54:57but uh
0:54:58uh well i think industry should fund the academia
0:55:07i to
0:55:14thanks for the actual