Přepis řeči - Does ASR have a PHD, or is it just Piled Higher and Deeper?

0:00:15	uh well come to to a uh uh i guess
0:00:18	would morning everyone
0:00:20	and the first couple of practical the model
0:00:24	we have a a change of room
0:00:26	you know that the this club B was really small and you are afraid that people are uh would not
0:00:31	in a
0:00:32	so uh we moved everything from club B and the the expert sessions from club E
0:00:38	to the north hall
0:00:39	it's actually about the this uh a a hall on the second floor next deviation
0:00:45	and we should have more space the there so would be a known
0:00:49	uh actually
0:00:50	club
0:00:51	the
0:00:51	should be close than the
0:00:53	oh signs
0:00:54	would be there
0:00:55	then a for the internet really sorry for the trouble just today
0:00:59	that was close to a by you mobile by to provide
0:01:03	weighted um
0:01:04	uh
0:01:05	a range problems are all spot
0:01:08	so to you should be a variable again
0:01:10	but please uh
0:01:12	uh
0:01:13	oh uh we have a a just
0:01:15	five hundred twelve also available
0:01:17	there is no
0:01:18	you way
0:01:19	will more
0:01:20	so please disconnect when you
0:01:22	you not need to to be a and this is especially course my the not because the mark rosewood
0:01:29	on on just the state or or it all
0:01:33	then a for the bank at torch you know
0:01:35	we have a an i
0:01:37	you need to dig
0:01:38	i'm sorry for that but you don't have it the you will be a lot the to get on the
0:01:43	bus looking
0:01:45	real a very limited number of kids it's of available uh a the registration desk
0:01:51	then the the partial but it right at the the a section
0:01:56	the or to seven from and for number ten
0:01:59	and the transportation back from just lena
0:02:01	is not provided so
0:02:03	my crap
0:02:05	or continue or evening uh and this man it pops and or uh
0:02:10	the of rock
0:02:12	and uh uh i'm pretty much done a so uh there would be a short their introduction
0:02:17	of a are you done and other on a uh i
0:02:27	i
0:02:32	i
0:02:33	i
0:02:35	oh
0:02:36	hmmm
0:02:38	i
0:02:39	i
0:02:40	i
0:02:41	i
0:02:43	i
0:02:48	i
0:02:52	yeah true
0:02:56	and
0:02:58	i
0:02:59	so
0:03:04	i
0:03:08	and uh and uh
0:03:09	there's is time for the for the second one E
0:03:12	so uh
0:03:14	so the going to be given by
0:03:16	nelson morgan
0:03:17	from icsi berkeley
0:03:19	and uh and i get a month the
0:03:21	or or the non fiction of the name
0:03:23	will introduced a speaker and channel decision
0:03:31	you very much for coming so one B
0:03:35	the point
0:03:36	it is my great
0:03:38	and
0:03:40	i
0:03:41	on
0:03:43	and
0:03:44	yeah
0:03:45	rubber for
0:03:47	a
0:03:48	compute
0:03:49	but
0:03:50	probably from or
0:03:52	for those
0:03:54	you know walking speech for very long time
0:03:57	core
0:03:58	a number of techniques
0:04:00	i a
0:04:02	or also are at you get number of
0:04:05	number of you audience i C
0:04:08	so
0:04:09	for for those people
0:04:12	more than that much of the introduction
0:04:15	for those of you know him
0:04:17	it's also called
0:04:19	there you walk
0:04:20	a better with the one of the a
0:04:22	signal processing
0:04:24	then vol
0:04:25	and
0:04:26	i mean out a new addition
0:04:28	well for i for the problem
0:04:31	on the uh
0:04:33	what else can i say well i i i think that keep it sort or i will call more than
0:04:37	here you leave you at all be better than
0:04:40	looking at me
0:04:42	i
0:04:44	more
0:04:54	i i nick
0:04:55	well i thought it was time for a little bit of a reality check
0:04:59	and uh speech recognition
0:05:01	and
0:05:02	it's been around for a long time as i think everybody here knows
0:05:06	very long research history
0:05:08	uh lots of publications for decades many projects
0:05:12	and he sponsored project
0:05:14	systems have continually gotten better
0:05:17	it actually tended to converge so that there is
0:05:20	in some sense a a standard
0:05:22	automatic speech recognition system now
0:05:24	uh it's made it to a lot of commercial products
0:05:28	actually been used
0:05:29	actually works from time to time
0:05:32	and so in some sense
0:05:33	it seems to have graduate
0:05:36	but
0:05:39	yes fails where humans don't
0:05:41	and by the way those of you who have your P H Ds
0:05:44	know that your education hopefully was not done at that point
0:05:49	and there's probably a lot more to do here
0:05:51	uh somewhat argue
0:05:53	that there is little basic science that's been developed in quite a bit of time
0:05:58	lots of good engineering methods though
0:06:00	but they often require a great amount of data
0:06:03	uh as we learned yesterday there is a great deal of data
0:06:07	but not all of it is
0:06:08	available for use in the way that you like
0:06:10	and and are many tasks where you don't have that much
0:06:13	and each new task requires
0:06:15	uh essentially the same amount of effort you sort of have to start over again
0:06:20	so how do we get to this point
0:06:21	this is not gonna be anything like a complete history but
0:06:25	enough to make my point help
0:06:27	so
0:06:28	i'm gonna talk about the status current status in the standard methods
0:06:31	a very briefly
0:06:33	uh talk about some of the alternatives the people have worked with over the years
0:06:37	and where could we go from here
0:06:41	so
0:06:41	as i mentioned
0:06:42	speech recognition research has been around for a very long time
0:06:46	uh a significant papers for sixty years
0:06:50	by the nineteen seventies
0:06:52	in some sense major advances modeling and happened
0:06:56	that is the basic
0:06:57	mathematics behind hidden markov models
0:07:00	was done by
0:07:02	been a lots of improvements
0:07:03	that happened uh a for the next twenty years or so
0:07:06	and also in the features
0:07:08	which became
0:07:09	more less standard by ninety nine or so
0:07:12	there were some really important methodology improvements by ninety ninety in earlier days
0:07:17	people did many experiments but was very hard to compare them
0:07:20	and the notions of standard evaluations and standard datasets really took called by ninety nine year so
0:07:27	and over the all of these years
0:07:29	uh specially last twenty thirty years they've been continuous improvements
0:07:33	which were to some extent really close the related to more law
0:07:36	movements in the technology
0:07:38	that is
0:07:39	um more more computational capability
0:07:42	more more storage capability
0:07:44	long people the work with very large datasets
0:07:46	and develop very large models to well represent those large dataset
0:07:51	so on
0:07:53	so
0:07:54	there's an elephant the room which is the things
0:07:57	are not entirely working still
0:08:00	with these systems then fact have converged
0:08:02	was kind of a byproduct product of all of these standard evaluations which were
0:08:06	very good in many ways
0:08:08	but
0:08:09	when people found out that the other group
0:08:11	had something that they didn't they would copy a in very soon the system would become very much the same
0:08:18	so
0:08:19	what are some of the remaining problem
0:08:22	well
0:08:22	system still perform pretty poorly despite a large to work on this
0:08:27	in the presence of a significant mounts of acoustic noise
0:08:30	also reverberation
0:08:32	which is natural for
0:08:34	just about any situation
0:08:37	uh unexpected speaking rate or accent
0:08:39	that is by an expected i mean something that is not well represented in the training set
0:08:45	uh on from all your topics
0:08:47	uh the language models bring this a lot of the performance that we have and if you
0:08:51	don't have a particular topic represented in the language model can do poorly
0:08:57	and
0:08:57	a from the recognition performance per se how many words you get right
0:09:01	another thing that's important is knowing whether you're right or wrong
0:09:05	and that's very important for practical applications
0:09:08	and that still need some work as well
0:09:12	so turns out that even some fairly simple speech recognition task can still fail under some of these conditions
0:09:17	yielding some strange result
0:09:20	well so boy she no slow
0:09:26	voice recognition technology
0:09:28	and i
0:09:29	and shall
0:09:31	yeah know try voice recognition technology
0:09:33	no i one to change action
0:09:37	oh i
0:09:40	i
0:09:43	yeah
0:09:45	oh i
0:09:47	oh and oh yeah
0:09:50	i
0:09:53	i don't
0:09:56	i it was last
0:09:58	yeah time in any case
0:10:01	yeah
0:10:03	i
0:10:04	but
0:10:05	i shown in a
0:10:06	yeah
0:10:09	yeah
0:10:10	one
0:10:13	a
0:10:15	shacks
0:10:18	and
0:10:20	i
0:10:21	a
0:10:21	a
0:10:22	same time one is that a
0:10:26	i
0:10:27	i
0:10:28	yeah
0:10:29	small a
0:10:31	yeah
0:10:33	a
0:10:35	i
0:10:37	a lot
0:10:38	if you do not feel at all angles are we can getting a
0:10:42	i
0:10:44	anyway
0:10:45	so that was funny
0:10:47	i hope you think it was funny but
0:10:49	what
0:10:49	hasn't worked in real life as opposed to just the jokes
0:10:53	and what have
0:10:56	so uh let me start off with
0:10:58	uh
0:10:59	some results from some of these standard evaluations are referred to
0:11:03	this is a graph the people in speech of seen a million times
0:11:06	uh
0:11:07	is this other one
0:11:09	um
0:11:10	for those of you who are familiar with this main thing to note is that uh P we start E
0:11:14	R stands for word error rate
0:11:16	hi high word error rate is obviously bad this is time and the axis
0:11:20	and each of these lines represents a series of ten
0:11:23	oh this is a kind of messy graph so it's cleaning up a little
0:11:26	and
0:11:27	uh this is uh a task done in the early nineties uh called eight is
0:11:32	and the main thing to see here as with a lot of these is that to starts off at a
0:11:35	pretty high error rate people work for a while
0:11:38	and after while a gets down to uh a pretty reasonable error rate
0:11:43	that's go to another one this was uh
0:11:45	a a conversational telephone speech
0:11:47	you have the same sort of a fact and do remember that the this is a a a a um
0:11:52	a logarithmic scale here
0:11:53	so even though it looks like it hasn't come down very far really did come down pretty far but after
0:11:58	well sort of levels off
0:12:00	uh more recently there's been a bunch work on speech from meetings which is also conversational
0:12:05	these are from the uh individual head mounted microphones
0:12:09	she we still didn't have a huge effects of background noise or or or reverberation or anything
0:12:14	and there wasn't actually a huge amount of progress after some of the initial uh initial work
0:12:20	uh now these are
0:12:21	these evaluations
0:12:23	uh a commercial products
0:12:25	i think
0:12:26	uh uh you
0:12:27	a lot of information is proprietary
0:12:29	but i think working can say is that
0:12:31	a partial products work some of the time for some people
0:12:34	and they often don't work
0:12:35	for others
0:12:37	so what is the state
0:12:39	well the recognition systems were either
0:12:42	work really well for somebody
0:12:44	or they'll be terribly brittle and reliable
0:12:47	uh i know that when my wife and i both tried a uh a dictation systems they work wonderfully for
0:12:52	her and terrible for me i think i i well my words something
0:12:57	so here's an abbreviated review
0:12:59	of what standard
0:13:01	by ninety ninety one
0:13:03	we had
0:13:05	uh feature extraction
0:13:06	basically being based on frames every ten milliseconds or so
0:13:10	compute
0:13:11	some something from a short spectrum
0:13:14	uh i things called mel-frequency cepstral coefficients
0:13:17	well
0:13:18	mention a bit more about a second
0:13:20	uh
0:13:21	P L P is another common method develop by then
0:13:25	delta cepstra
0:13:26	uh
0:13:26	uh essentially temporal derivatives of the cepstra
0:13:30	and on the statistical side
0:13:32	uh acoustic modeling hidden markov models were quite standard
0:13:36	it typically by this point represented
0:13:38	context-dependent phoneme or units or phoneme like unit
0:13:42	uh the language models are pretty much by this time all statistical
0:13:46	and they represent it context-dependent words
0:13:50	so all this with a by ninety ninety one
0:13:52	a let's move to two thousand the eleven
0:13:56	there it is
0:13:58	uh
0:13:59	notice all the changes
0:14:02	okay that's a little unfair
0:14:04	uh a will have actually done work in the last twenty years
0:14:07	and this is
0:14:09	they representation of a a lot of it i think
0:14:11	and these of had big affects
0:14:13	i don't mean to minimize
0:14:14	errors
0:14:15	uh various kinds of normalisation is uh meeting variance kind of normalisation
0:14:20	uh a a an online version of that that we called rasta
0:14:23	uh vocal tract length normalisation which
0:14:26	compresses or expands the spectrum and
0:14:29	in such a way is as to match the models better
0:14:33	um
0:14:34	and uh then
0:14:35	adaptation in feature transformation
0:14:38	uh either adapting better to test set that somewhat different from the training set
0:14:42	uh or uh
0:14:44	various that changes is to make the features more discriminative
0:14:49	discriminate range training
0:14:51	actually
0:14:52	uh changing the statistical models
0:14:55	in such a way as to make them more discriminant between different speech sound
0:14:59	did did have more more data or of the years and that required
0:15:03	lots of work to figure out how to handle that
0:15:05	but aside from handling it was also taking advantage of lots of data
0:15:09	which was didn't come for free so was lots of engineering work there
0:15:14	uh people found that
0:15:15	combining systems helped and sometimes combining
0:15:18	pieces of systems helped
0:15:20	and that's been an important thing in improving uh perform an
0:15:24	and because
0:15:25	uh speech recognition was starting to go into applications you had to be concerned about speed
0:15:30	and this been a lot of work on that
0:15:33	well but more and some of this uh
0:15:35	the main point uh about mel cepstrum and plp a wanna make is that
0:15:40	each of "'em" use this kind of warped frequency scale
0:15:43	uh in which you have better resolution at low frequencies and high frequencies
0:15:47	"'cause" our perception of different uh
0:15:50	speech sounds is very different at low frequencies high frequencies
0:15:53	no cepstrum and plp used a different mechanisms
0:15:57	for getting a smooth spectrum uh
0:16:00	delta cepstrum uh
0:16:02	uh as big as i said is basically
0:16:05	uh time derivatives uh of the cepstrum
0:16:09	um
0:16:10	hidden markov model this is a graphical form of it
0:16:13	and main thing to see here this is a a
0:16:16	a statistical dependency graph
0:16:18	uh and
0:16:20	say X three is only dependent on the current state
0:16:24	each of these
0:16:25	time steps
0:16:26	uh
0:16:27	are represented here
0:16:29	and if you know Q three
0:16:31	uh then Q two Q one X one X to tell you nothing about X three
0:16:35	so that's a very very strong statistical conditional independence model
0:16:40	and that's pretty much what people have used in these
0:16:43	are now standard cyst
0:16:45	this is my only equation
0:16:47	and uh those of you and speech will go oh yeah fact probably
0:16:50	most people say oh yeah
0:16:52	this
0:16:53	basically bayes rule
0:16:55	the idea is that
0:16:56	in statistical system
0:16:58	you want to pick the model
0:16:59	that is most probable given the data
0:17:02	and base so as you could expand in this way
0:17:05	and then you can get rid of the P of X because there's no dependence on the model
0:17:12	um
0:17:12	so
0:17:13	you realise these
0:17:14	uh likelihoods
0:17:16	of of probability of the two six given the model with mixtures of gaussians typically
0:17:21	you typically have each gaussian in just represented by means and variances there's no covariance represented between the features
0:17:29	and there's the weights of each of the gaussians
0:17:31	the you language priors
0:17:32	P of them
0:17:34	are uh
0:17:35	implemented with a n-gram
0:17:37	do a bunch accounting counting you do some smoothing
0:17:40	and it's basically a probability of a word given some word histories such as the frequent the recent
0:17:46	and minus one words
0:17:49	now i
0:17:50	the math is lovely but in practice we actually raise each of these things to some kind of power
0:17:55	this is to compensate for the fact that the models are
0:17:58	and that uh
0:18:00	there are really other dependence
0:18:04	um
0:18:04	this is a picture of the acoustic likelihood
0:18:07	uh uh uh uh estimator
0:18:09	there's a few steps in here each of these boxes can actually be fairly complicated but
0:18:14	just generally speaking
0:18:15	there's a some kind of space short spectral estimation
0:18:19	there's this vocal tract length normalisation i mention which compresses or expanse spectrum
0:18:24	the some kind of smoothing either by
0:18:26	uh throwing away some of the upper cepstra coefficients are why autoregressive modeling as is done in P L P
0:18:33	there's various kinds of linear transformations for instance for dimensionality reduction
0:18:38	uh and for discrimination better discrimination
0:18:41	then there's the statistical engine
0:18:43	that i mentioned before with this funny scaling um
0:18:46	in the log domain or raising to a power
0:18:49	in order to mixed with the
0:18:50	uh language model
0:18:52	okay well that seems simple enough but
0:18:54	actual systems that get the very best scores are a bit more complicated than this
0:18:58	uh there's well
0:18:59	first off there's the decoder and the language priors coming in
0:19:03	um
0:19:05	well you might have to of these france to
0:19:08	and
0:19:09	people found that this is very helpful for getting best perform
0:19:13	but you don't just put "'em" in in a very simple way
0:19:17	it's a very often the case that you have all sorts of stages is
0:19:20	with ugh
0:19:22	C W here's is crossword
0:19:24	or non crossword models and you produce graphs or lattice and you combine them at different points and you cross
0:19:30	at that
0:19:32	well
0:19:33	this kind of reminds me of some work
0:19:36	by a uh
0:19:37	a berkeley grad of for about a century ago name rube goldberg
0:19:41	and this is these self operating napkin
0:19:44	the self operating napkin is activated when these ships spoon a a is raised to mouth
0:19:50	uh pulling string P and thereby jerking little C
0:19:54	which throws crack or D past parrot P
0:19:57	uh pair of jumps after cracked or and perch have tilt
0:20:01	which uh uh a process C it's G in into pale H
0:20:06	the extra weight in the pale pulls the cord i which opens and
0:20:10	uh i which lights the cigarette lighter J
0:20:13	and this uh
0:20:14	turn lights the rocket which pulls the sickle which cuts the string
0:20:19	which
0:20:20	was the pendulum to swing back and forth
0:20:22	thereby by wiping the chen
0:20:25	uh for this
0:20:26	time my view of current speech recognition system
0:20:32	it's successful at wiping the chance sometime time
0:20:35	so i wanna talk a little bit about alternatives
0:20:37	and i wanna say that the at the outset
0:20:40	that these are just some of the alternatives
0:20:42	a conference like this has uh a lot of work
0:20:45	happily
0:20:46	uh in in many different directions
0:20:48	is the ones i wanted to give as examples
0:20:52	but first i wanna say
0:20:54	a little bit
0:20:55	about
0:20:57	what else is there
0:20:58	besides the main
0:21:02	the great sage natural and
0:21:04	was tracked down by a seeker
0:21:06	and the or ask the sage
0:21:09	what is the secret to happiness
0:21:12	sage answered
0:21:13	good judge
0:21:16	well the sick said that's
0:21:17	that's so very well
0:21:18	master but
0:21:20	how does one obtain good judgement
0:21:23	and the master said
0:21:24	from experience
0:21:27	so the seek a okay experience
0:21:30	but
0:21:31	how does one obtain this experience
0:21:34	and the master said
0:21:35	bad judgement
0:21:39	so
0:21:40	here's of exercise exercises that we many other people of done in bad judgement
0:21:44	we've pursued
0:21:46	different signal representation
0:21:48	uh some of them are related to perception
0:21:50	to auditory models france
0:21:53	a mean rate and synchrony has a to send ups model from "'em" some time ago
0:21:57	uh and into and sample interval histogram
0:22:00	from uh uh way it gets a
0:22:02	i each of these
0:22:03	were
0:22:04	related to models of neural firing
0:22:08	uh how
0:22:09	how fast they want to how much they synchronise one another
0:22:12	what uh timing there was between the fire
0:22:15	and they had some interesting performance in noise uh they
0:22:19	i not been adopted any serious way
0:22:22	but
0:22:23	there's interesting technology there an interesting scientific models
0:22:27	then their stuff that's more and the psychological side these were sort of based on on models of fit
0:22:32	physiology
0:22:33	uh then there is uh model uh
0:22:36	really from the psychological side and multi band systems based on critical bands going all the way back to
0:22:42	fletcher's work work of others
0:22:44	uh and
0:22:46	uh the idea here is that if you have a system that's just looking at part of the spectrum
0:22:50	if the disturbances in that part of the spectrum
0:22:53	uh then you can deal with that separately
0:22:56	note of had some X some six
0:22:58	and then something that uh
0:23:00	you can observe both that the physiological and psychological level
0:23:04	is the importance of tip different um modulations
0:23:08	particularly temporal but also spectral modulations and the signal
0:23:13	uh then there's on the production side is been a bunch of work by people on
0:23:17	uh given the fact that there is only if you articulatory uh mechanism
0:23:22	uh that maybe you can represent things that way and O be more se saying and
0:23:26	the better
0:23:27	better representation the signal one
0:23:29	represent this over time in their been
0:23:31	hidden dynamic
0:23:32	uh models that attempt to do this and
0:23:35	trajectory models sometimes the trajectory models had nothing to do with the physiological models but
0:23:40	uh sometimes they did
0:23:43	and articulatory features which you could think of as a quantized version of the articulator positions and so for
0:23:51	then another direction was artificial neural networks which of been around for a very long time
0:23:57	um
0:23:58	actually before nineteen sixty one but
0:24:00	i picked out this one discriminant analysis iterative design
0:24:04	the pick that out "'cause" a lot of people don't know about it a lot of people think that a
0:24:07	multilayer networks the big N in the eighties
0:24:10	but actually neck can sixty one they had a multilayer network that work very well for some problems is actually
0:24:15	used industrial E
0:24:16	for a that case after that
0:24:19	um which the first uh layer of units was uh uh a bunch of gaussians and after that you had
0:24:24	a you had linear perceptron
0:24:27	couple years later uh other was work at stanford
0:24:30	in which they actually did apply some of this kind of stuff to speech these were actually linear adaptive units
0:24:35	to actually called add lines
0:24:37	uh burning would row sent me uh uh
0:24:39	a technical report
0:24:40	sri is struggle interest is the cover real technical report nineteen sixty three
0:24:46	uh is a page from it that shows a
0:24:48	uh a block diagram of try blew up here for
0:24:51	mean it's
0:24:52	and starts off with some band filters basically you getting some power measures in each band
0:24:57	and then here these add lines which uh give you some sets of outputs
0:25:02	which one to a typewriter a pair
0:25:06	um
0:25:07	nineteen eighties so an explosion of interest in the neural network
0:25:11	uh
0:25:11	very area
0:25:13	uh part of this
0:25:14	was sparked by
0:25:16	a a rediscovery discovery say of your were back propagation
0:25:20	just basically propagating the effect of errors from the output of the system
0:25:24	back to the individual weight
0:25:27	uh in the late eighties uh number of us worked on hybrid hmm artificial neural network systems
0:25:34	where the neural networks were used this probability estimators stick to get the emission uh probabilities for the hmm
0:25:41	um
0:25:41	last decade or so uh quite a few people have taken off on the tandem idea
0:25:46	which is do you which is a particular way of using artificial neural networks
0:25:50	as feature extractor is
0:25:52	and i will just mention uh briefly
0:25:55	uh a fairly recent development of the networks
0:25:59	and
0:26:00	how uh
0:26:02	how innovative it is is a is the question
0:26:04	but there's definitely some new things going on there which i think are interesting
0:26:09	uh
0:26:10	the obvious difference between this in the previous networks to can to be more layers that and steep
0:26:15	there's also sometimes and unsupervised pre-training
0:26:20	uh
0:26:21	there's actually several papers at this conference there's also a special issue
0:26:24	uh in uh november of the transaction
0:26:28	um here's a couple papers that this conference i think this if you others as well as one from the
0:26:32	nails river E
0:26:34	they had a lot different numbers in the paper but uh i pick one out
0:26:38	and just
0:26:40	it did they most the numbers had the same general trend
0:26:43	mfcc
0:26:44	bad
0:26:45	deep mlp good
0:26:47	uh and the old mlp somewhere in between
0:26:50	these are error rates so low again uh low is good
0:26:54	and uh there is a large vocabulary um
0:26:58	voice search
0:26:59	uh paper which uh i
0:27:01	is is that poster today
0:27:03	uh i had a sixteen percent set the their metric was sentence error reduction
0:27:08	and they had a nice improvement compared to
0:27:10	a system that used uh M P which is a a a very common discriminant training
0:27:15	approach
0:27:20	okay
0:27:20	so that it to some of the alternatives is again there's you i'm sure
0:27:24	many people this audience good think of a many of
0:27:29	where could we go from here
0:27:30	or
0:27:31	in my opinion where should be go from
0:27:35	well
0:27:36	better features and models
0:27:39	um
0:27:40	i've suggested
0:27:41	better models of hearing in production
0:27:44	uh could press perhaps lead to better features
0:27:48	uh better models of these features
0:27:50	better acoustic models
0:27:53	models of understanding better language models dialogue models pragmatics and so on
0:27:58	all these are likely to be import
0:28:01	the other thing which i'm gonna go into a bit especially at the end is understanding the errors
0:28:06	understanding what the assumptions are
0:28:08	that are going into our models
0:28:10	and how to get past
0:28:15	so we start with models of hearing
0:28:17	so there are
0:28:19	useful approximations to the action of for free that is uh
0:28:23	uh from here
0:28:25	to the auditory in your of
0:28:27	and when i say useful approximations i mean that there are are number of people who've worked
0:28:33	and
0:28:34	simplifying the models that if that were used earlier
0:28:38	and
0:28:39	crafting them more towards
0:28:41	uh good engineering
0:28:43	tools
0:28:44	some of those are looking kind of promise
0:28:47	uh there's new information about the auditory cortex which i'm gonna brief the refer to
0:28:51	next few slides
0:28:53	including some results with noise
0:28:56	um
0:28:57	it's good to learn from a biological examples because uh you know humans are pretty good in many situations that
0:29:03	at recognizing speech
0:29:05	but
0:29:06	it's
0:29:06	probably good also not to be purist
0:29:08	and to mix
0:29:09	in size that you get from these things with good engineering approaches
0:29:13	and i i i think there's some
0:29:15	uh good possibilities there
0:29:17	uh this bottom bullet it is just to note that
0:29:20	as with many things in this talk a money talking about some of the field
0:29:24	and a mostly talking about single channel
0:29:26	but uh people have to ears they make pretty good use of them when they were
0:29:31	uh and that's
0:29:33	something to keep in mind
0:29:34	and of course you can go to many years in some situations with microphone arrays and that's a good thing
0:29:39	to
0:29:39	think about
0:29:40	that's not a topics and i'm expanding on and the stock
0:29:44	and the same thing with visual information visual information is used by people whenever they can
0:29:49	uh and i'm not gonna talk about that but it's obviously imp or
0:29:53	okay a a is gonna talk about this a cortical stuff
0:29:58	uh the slightest courtesy of uh she she shah it's not just the slide but also the idea
0:30:03	uh and the idea that which comes from experiments that uh he in it's guys
0:30:09	and gals
0:30:10	have
0:30:11	uh done with a small mammals
0:30:14	uh a that have
0:30:16	pretty similar
0:30:17	really part of the cortex X
0:30:19	uh a primary auditory cortex
0:30:21	to what people have
0:30:23	also been some other work with people
0:30:25	uh and
0:30:27	these
0:30:27	uh
0:30:28	this if you mention this is being the kind of spectrogram that's received that this primary auditory cortex
0:30:35	what they've observed is that there's a bunch of what are called split spectro-temporal receptive fields S T R apps
0:30:41	which are little filters
0:30:42	that process it in time and frequency
0:30:46	and you could think of them as processing temporal modulations which you called rate and spectral modulations which called scale
0:30:53	and you imagine there being a cube
0:30:55	at each time point
0:30:57	with auditory frequency
0:30:59	and uh
0:31:00	rate and scale
0:31:02	and much as you would like to be able to in in and a regular spectrogram
0:31:07	uh
0:31:08	de emphasise the areas where the signals noise was poor
0:31:11	and emphasise areas with the sings noise was good
0:31:14	you have perhaps an even greater chance
0:31:16	to do this kind of emphasis you have a as
0:31:19	uh if you're expanded out to this cube
0:31:22	that's the general idea
0:31:23	so you could end up with a lot of these different spectrotemporal receptive fields
0:31:27	you could implement them and you could try to do something good with them pick out a good
0:31:32	uh if limitation that we and and number of people have been trying
0:31:36	is
0:31:37	a
0:31:38	uh what we would call T many stream
0:31:41	uh implementation
0:31:43	as opposed to multi-stream which uh was what i we shown before you we'd have two or three streams just
0:31:49	refers to the quantity
0:31:50	but what's in each stream is
0:31:53	one of the representation one these spectro-temporal receptive fields implemented by a gabor filter
0:31:57	and by a multilayer perceptron
0:31:59	that's a discriminatively trained discriminant between different speech sounds
0:32:04	you get a whole lot of these and some of implementations we at three hundred
0:32:07	uh and then you have to figure out how to combine them or select them
0:32:11	hopefully again to de emphasise the ones that are uh bad indicators of what was set
0:32:19	so
0:32:20	another interesting side light of this kind of approach
0:32:23	is that it's a good fit to modern high speed computing that it's
0:32:27	as i think a lot of you know
0:32:29	the clock rates and or long going up the way they used to other cpus use
0:32:33	and so the way that manufacturers are trying to give us more performances by having many more core
0:32:38	the graphics processors are an extreme example of this
0:32:41	this kind of structure is a really good match to that
0:32:44	uh because it's it's what they call an embarrassingly parallel
0:32:48	um we found that this room this kind of approach does remove a significant number of errors particularly and noise
0:32:54	but also a as it turns out in the clean condition
0:32:58	um
0:32:59	it combines well with pure engineering not auditory
0:33:02	kind of methods
0:33:03	uh such as wiener filter based methods
0:33:06	and we'd like to think that it could combine well with other auditory models all we haven't really done that
0:33:11	work yet
0:33:14	um
0:33:15	statistical
0:33:16	acoustic models
0:33:19	uh we currently use these critical assumption
0:33:22	and one of things about using very different kinds of features is that this can really change their statistical properties
0:33:27	from what the ones we have now
0:33:29	and so these assumptions
0:33:31	i could be violated in yet different way
0:33:35	uh there have been all turn models that were propose that allow you to bypass these typical assumptions
0:33:41	but part of the problem is the figure out
0:33:43	which statistical dependencies to put in
0:33:48	um models of language an understanding
0:33:51	i think it's probably pretty clear those you don't know me that this isn't a research area
0:33:55	but it's of obvious import
0:33:57	and
0:33:58	one of the things that uh
0:34:01	has been frustrating to a lot of people in fact a member fred jelinek being physically frustrated about this
0:34:07	is that
0:34:08	it's very very tough to get much improvement
0:34:10	over simple n-grams that is a probability of word given some number of previous work
0:34:16	but
0:34:16	it can be very important
0:34:18	two
0:34:19	get further information
0:34:21	and we know this for sure for people
0:34:25	me tell you little story
0:34:27	uh one day
0:34:29	i was walking out of i csi
0:34:31	and i had on one of these catch this is a cap for the oakland athletics to local
0:34:37	make league baseball club
0:34:39	i also had on a jacket
0:34:41	that had the same insignia on it
0:34:44	and i had a radio
0:34:45	hell to my head i was walking down the street
0:34:49	and a guy across the street
0:34:50	moderately noisy street
0:34:51	yeah
0:34:52	or
0:34:55	and i said
0:34:56	oh can five to three
0:35:00	anyway
0:35:01	we'd like to be able to do that with a machine
0:35:06	so where we go from here
0:35:09	well
0:35:10	research what continue to get good ideas
0:35:13	uh
0:35:15	every time you get the shower or maybe you have a have a good idea coming out
0:35:20	but
0:35:20	what's the best methodology
0:35:22	what's the best way to proceed along this path
0:35:25	so maybe we can learn from some other disciplines
0:35:29	and let me give
0:35:30	uh a kind of stretched analogy to
0:35:33	the search for a cure for cancer
0:35:35	and again i'm gonna tell you a little story
0:35:38	uh us the personal one uh is about an uncle of mine names sydney far per
0:35:43	um
0:35:44	now
0:35:44	my uncle set the and the forties
0:35:46	uh was
0:35:48	i path file just
0:35:49	at harvard med channel
0:35:50	however but centre
0:35:52	and at children's hospital boston
0:35:55	and
0:35:56	he
0:35:57	unfortunately fortunately got to see lots of little children
0:36:00	of we came yeah
0:36:01	uh once they were diagnosed they only had a few week
0:36:05	as a pathologist he mostly dealt with P two dishes and so forth you didn't really wasn't really a clinician
0:36:11	but he got this thought
0:36:13	that maybe if you could come up with chemicals
0:36:16	that were more poison this to the cancer cells than they were to the normal cells
0:36:20	maybe he could extend the lives of these K
0:36:23	are we experimented with this in the petri dishes the course for the most part for a while
0:36:27	they need then any came up with something that he thought would work
0:36:31	any tried it out
0:36:32	everybody's permission
0:36:34	and some of these kids
0:36:35	and low and behold
0:36:36	it actually did extend their lives for a while
0:36:39	this was
0:36:40	the first
0:36:41	known
0:36:41	case of came at there
0:36:45	this just
0:36:46	great and it started a whole revolution the ended up starting a big centre national cancer institute stuff uh
0:36:52	it's not the data fibre reverence to
0:36:55	and
0:36:57	um
0:36:59	the key point i wanna make about it
0:37:01	is that
0:37:02	there's this quandary
0:37:03	between curing patients
0:37:05	you have these patients are coming through
0:37:07	who are in terrible straits
0:37:10	but on the other hand
0:37:12	you don't have any time
0:37:14	to figure out what's really going on
0:37:17	and there were
0:37:18	important early successes based on hunches the my own call than many others had
0:37:24	and there wasn't time to wear in the real cause for things
0:37:27	and by the way stories like this
0:37:29	for surgery surgical interventions and for radiation as well
0:37:34	uh
0:37:35	so there's some success
0:37:37	but they still
0:37:38	didn't find a general curve cure and uh as you know to this day there's still is no general cure
0:37:42	for cancer
0:37:43	but things are a lot better every missions or longer and so forth
0:37:47	and now there's
0:37:48	starting to be some understanding of the biological mechanisms and one hopes that this will lead to to keep
0:37:54	uh a solution
0:37:56	so this is wonderful book a strong the recommend the emperor of all melodies
0:38:00	uh
0:38:01	about uh like the industry have cancer
0:38:05	and i'll just read this
0:38:06	isn't thing the speaker viewers i think of remedies
0:38:09	in such time as we have considered of the cost
0:38:12	here must be imperfect claim and to no purpose
0:38:15	where and the "'cause" of that first been searched
0:38:18	this again doesn't belie the fact that it can be very useful
0:38:22	to uh go ahead and try to fix something along the way
0:38:26	but in the long term you need to understand what's going on
0:38:30	so as opposed to just
0:38:32	trying our bright ideas which we all do
0:38:35	how about finding out what's wrong
0:38:39	the statistical approach
0:38:40	to speech recognition requires
0:38:42	assumptions that made reference to
0:38:44	there known literally to be false
0:38:47	this may or may not be a problem
0:38:49	maybe it's just handled by uh say raising these
0:38:52	uh likelihoods to power
0:38:55	how can we learn
0:38:57	so there's a some work that's been started i wanted the call your tension to
0:39:01	from steve work men and larry gaelic okay
0:39:03	starting a couple years ago
0:39:05	where what they did was to consider each assumption separately
0:39:09	and then rather can trying to fix the models
0:39:12	modified the data
0:39:13	B some resampling S um
0:39:15	some uh
0:39:16	bootstrapping kind approaches
0:39:18	to match the models
0:39:20	observe the improvement
0:39:22	and use that to inspire more bright ideas
0:39:26	but this point
0:39:27	the really just focused on the diagnosis part and not on the
0:39:30	a new bright ideas frankly
0:39:32	so this is being pursued also at icsi and the how to project which is outing unfortunate characteristics of hmm
0:39:39	and
0:39:40	uh i'm gonna give you just a couple results for more recent version i should add by the way that
0:39:44	uh
0:39:45	it's is a different K like this is larry sonde dan
0:39:48	who just this P H D with that's
0:39:51	um
0:39:52	but
0:39:53	first this is a
0:39:55	very simplified system so the error rate for wall street journal is is pretty is pretty high here
0:40:01	and uh it's
0:40:03	the output
0:40:04	uh demonstrably does not really fit the G M distribution that you got from the training set
0:40:10	and it definitely doesn't satisfy the independence assumptions and you get this thirteen percent
0:40:16	uh
0:40:17	now if you simulated data really just generated from the models
0:40:21	you should do pretty well in a fact you do basically
0:40:23	uh virtually all of the errors go away
0:40:27	but here's the interesting one i think
0:40:29	if you
0:40:30	use resampled sample data so this is the actual speech data
0:40:34	but you're just resampling it such a way
0:40:37	to assure the statistical conditional independence
0:40:41	it also gets rid of nearly all of the year
0:40:45	now they're studies are a lot more detailed this there's a a lot of
0:40:48	a lot of things that they're looking at
0:40:50	a lot of things that trying out
0:40:51	but i think this gives the flavour
0:40:53	of what they're doing
0:40:58	so
0:40:59	in summary
0:41:02	uh a speech recognition is mature mature
0:41:05	in some sense it has an advanced degree
0:41:07	this because it's been around a long time and their commercial systems and so forth
0:41:12	and yet we still find it to be brittle
0:41:15	and uh we essentially have to start over again with each new task
0:41:20	uh the recent improvements
0:41:21	have been really quite incremental or a lot of things of sort of levelled off
0:41:26	we need to rethink
0:41:28	kind of like going back to school
0:41:30	kind of like continuing education
0:41:33	uh we may need more basic models
0:41:35	uh more may need more
0:41:37	basic features
0:41:39	we may need more study of air
0:41:43	and
0:41:44	the other thing i wanna briefly mention is that
0:41:48	we do live in error where there is a huge amount of computation available
0:41:52	and even though the clock rates don't continue to go up is they have
0:41:56	uh do to uh many core systems
0:41:59	and
0:42:00	cloud computing and so forth
0:42:01	there is gonna continue to be
0:42:03	an increased availability to lots of computation
0:42:06	and this
0:42:07	should make it possible for us to consider
0:42:10	huge numbers of models
0:42:12	uh and methods
0:42:13	that we wouldn't consider before
0:42:15	for instance on the front end side
0:42:17	these uh auditory based or cortical base things can really blowup up the computation
0:42:22	from the simple kind of stuff that you have with mfccs or P L P
0:42:28	uh
0:42:28	so
0:42:30	it's good to do that it's good to try things
0:42:34	that might take a lot of computation even if they might not work yeah in your i phone just now
0:42:39	um um so
0:42:43	you also have to know and then sure you all do that just having more computation is not a panacea
0:42:48	doesn't actually solve things
0:42:49	but it can potentially
0:42:51	a give you a lot more possibility
0:42:54	that's pretty much what i want to say
0:42:56	uh
0:42:57	i do wanna acknowledge that the stuff have talked about is not a particularly for me for many people including
0:43:03	people outside our level
0:43:05	uh but i do want to thank
0:43:07	the many current former students and postdocs visitors icsi staff
0:43:12	and particularly give a shout out to
0:43:14	hynek hermansky every pore large she option honesty workman jordan cone
0:43:18	here's my shameless plug for a book
0:43:21	uh which he did already mentioned
0:43:23	that is gonna be out this fall thanks to tons of work from dan ellis
0:43:27	and other contributors i should say
0:43:29	uh like uh
0:43:31	gel and
0:43:32	and
0:43:33	job for then the
0:43:35	um
0:43:37	simon king for instance
0:43:39	and
0:43:41	thank you for your attention
0:43:50	K
0:43:52	a
0:43:53	sorry
0:43:54	having time i'm
0:43:56	oh
0:43:58	you
0:43:58	what is only a lot not of time bringing up
0:44:04	yeah
0:44:05	you feel
0:44:06	i
0:44:08	oh
0:44:16	i promised i put on that
0:44:18	yes are you thing to remind you about why
0:44:21	i if know what is a question
0:44:23	or
0:44:25	okay
0:44:26	and
0:44:28	you mike
0:44:29	yeah
0:44:30	uh
0:44:31	right
0:44:33	think
0:44:35	yeah
0:44:36	what are you in the remote
0:44:38	i know that
0:44:39	think
0:44:41	by
0:44:42	yeah
0:44:43	speak Q mine
0:44:44	oh
0:44:44	oh don't hold back um yeah okay
0:44:47	right at a time
0:44:49	i
0:44:50	yes
0:44:51	well though they they still a chance
0:44:52	it's still which chance
0:44:53	get get the courage
0:44:55	um
0:44:57	i i think that the right answer is
0:44:59	i don't know
0:45:02	because
0:45:03	for instance
0:45:04	well i used to say when people talk to me about this is that
0:45:07	okay i think of
0:45:08	speech recognition is as being in three pieces there's
0:45:12	the representations that you have
0:45:14	there's at the statistical models and the search and so forth in the middle
0:45:19	and then there's
0:45:20	uh all of the things that you could imagine doing with speech understanding and pragmatics
0:45:24	it's X and so forth
0:45:26	and i used the think that okay the first one i know a little bit about
0:45:30	uh and i and i i feel very strongly and you know bunny results to back this up that that's
0:45:35	very important for improving
0:45:37	the last one i is not my area of expertise but where have seen in other is certainly and human
0:45:42	case
0:45:44	i believe that's very important
0:45:45	so i sort of thought the middle part
0:45:47	yeah O you works well in
0:45:50	uh but then so this
0:45:51	this study
0:45:53	and i'm not so sure
0:45:54	no i actually think that you should
0:45:56	uh pursue whatever it is that you
0:45:59	i feel
0:46:00	yeah feel is of greatest interest
0:46:01	i actually think the key thing
0:46:03	is they have interesting france
0:46:07	a for nine now i see
0:46:10	now
0:46:11	uh you like or know it's what i actually think i and if here and here right
0:46:16	okay
0:46:18	heard
0:46:19	a
0:46:19	yeah
0:46:20	okay
0:46:21	my is louder the
0:46:24	all of these uh a technique used right or
0:46:27	since we
0:46:29	i'm spectral analysis
0:46:30	roaches
0:46:32	pretty much uh uh everything just right
0:46:34	i
0:46:35	or
0:46:36	in almost all
0:46:38	but from now on
0:46:40	you
0:46:40	um
0:46:41	spectral techniques like much
0:46:43	C P L you of
0:46:45	from
0:46:45	reading some aspects of
0:46:47	us
0:46:48	course things
0:46:49	you guys most
0:46:51	but the big problems it seems to me you're still will interfere
0:46:55	or fear from other sources
0:46:57	reverberation
0:46:59	uh
0:47:00	spatial
0:47:01	hearing and so forth where us
0:47:04	or you much help
0:47:06	distinguishing mode
0:47:07	sources
0:47:08	direction
0:47:10	the uh the other dimension
0:47:12	uh uh uh
0:47:13	fine
0:47:14	role
0:47:15	information or something that has been explored lot
0:47:17	the
0:47:18	psychological and
0:47:20	is about you
0:47:22	and
0:47:23	few steps
0:47:24	and
0:47:25	so ensemble interval histogram that
0:47:27	and
0:47:28	drop another
0:47:29	drop
0:47:30	or
0:47:30	mention an entirely
0:47:32	a kind of a
0:47:33	but
0:47:34	station
0:47:34	see
0:47:36	you
0:47:36	get
0:47:39	rel
0:47:39	and source you
0:47:41	same time
0:47:42	you say much about that
0:47:45	that that's the direction course
0:47:46	so that's you
0:47:48	what you think about that
0:47:49	direction and we get
0:47:51	people working in
0:47:52	you pay more attention
0:47:55	things beyond young
0:47:58	what why spectral i guess which you mean a short-term spectral right
0:48:02	and uh i i may not have done this is clearly as like could but i think the shah must
0:48:07	stuff that i was making reference to
0:48:10	a certainly can be long time the the the their spectro-temporal representation
0:48:15	what you feed
0:48:17	uh the the different
0:48:18	quite the cortical
0:48:20	filters
0:48:21	can be a very different kind of spectrogram when that takes advantage of the sort of stuff and i think
0:48:26	absolutely what we should do
0:48:28	and that's these disturbances the multiple sources the reverberation et cetera
0:48:33	uh uh i agree that's
0:48:34	that's the biggest challenge that C
0:48:36	if someone talks about the performance of humans versus
0:48:40	uh a a speech recognition systems in the current generation systems that's the easiest playstation of the difference
0:48:46	so uh
0:48:48	i completely agree
0:48:49	sorry am
0:48:50	i'm not being a politician i actually do agree
0:48:54	i
0:48:55	uh
0:48:56	results
0:48:57	i
0:48:58	oh
0:48:59	hmms
0:49:00	see
0:49:00	just
0:49:04	i modeling
0:49:07	true
0:49:07	S
0:49:09	so
0:49:11	you
0:49:11	so
0:49:12	i
0:49:15	i
0:49:17	so go
0:49:18	yeah
0:49:19	the most
0:49:21	i
0:49:26	uh_huh
0:49:27	really
0:49:28	thus
0:49:29	a more attention that
0:49:32	K i didn't pay yeah
0:49:34	but
0:49:35	but this is you were certainly are are reinforcing my my bias as
0:49:39	uh oh go it is getting up but
0:49:42	um
0:49:43	i i'm mostly a front-end person these days have been for a while and i agree that there's a lot
0:49:49	to be done there
0:49:50	i didn't mean the say at all that the language modeling and so forth was
0:49:54	was the bulk of it
0:49:55	even that study at the end was just saying for fairly simple case with the sensually matched training and test
0:50:01	uh that
0:50:02	uh
0:50:03	you could
0:50:04	jimmy with the data in such a way
0:50:07	to match the models assumptions and you could do much better
0:50:10	but uh one of the things that we're gonna be trying to do in follows to that study is looking
0:50:15	at mismatched conditions
0:50:17	what can you
0:50:18	i cases with noise and reverberation and so forth
0:50:21	in which case i don't think the effect will be quite as big
0:50:24	and
0:50:25	you know it's garbage in garbage out if basically you feed in representation
0:50:30	that are not
0:50:31	uh giving you the information you need how are you gonna get it at the yeah so
0:50:36	i i i agree with you but i was trying to be fair or not only to people that co
0:50:40	but also because
0:50:42	i feel that uh in if you cover the space
0:50:45	of all these different cases
0:50:47	there many cases where these other areas are in fact very pour
0:50:51	and human beings as with my base po example human beings to make use of higher level information
0:50:56	uh often
0:50:57	in order to figure out what was said what up important about was that
0:51:01	which leads me to george's question
0:51:03	as you were talking
0:51:05	i was
0:51:05	constantly we with the
0:51:07	analogy
0:51:09	um in speech recognition with almost
0:51:12	you know
0:51:12	irresistibly and at
0:51:15	a things and optical character recognition
0:51:18	and so
0:51:19	uh
0:51:20	almost every slide hand hand irresistible analogies uh from the a current successes to future direction is to problems that
0:51:28	are being experience
0:51:29	uh_huh and i i'm just wondering is there
0:51:31	a cross disciplinary knowledge that can be leveraged yeah is is it is it being language
0:51:37	to speech recognition except in the sense that some of these alternative there uh
0:51:42	approaches
0:51:43	uh F they have tried looking at uh the spectrogram has an image
0:51:47	uh and so forth some of the neural network techniques that were developed uh in optical character recognition
0:51:54	sort of came back the other way but a lot of it's gone
0:51:57	gone the other way
0:51:58	but
0:51:59	you know we can to be fairly fragmented community and and and not listen to each other quite as much
0:52:04	as we should
0:52:06	whose now
0:52:11	J
0:52:12	no i think he's of the dog
0:52:15	hold
0:52:16	well
0:52:17	oh i'm sorry i was drawing
0:52:19	to stay in was C
0:52:21	oh
0:52:22	couldn't
0:52:23	but the climbs
0:52:24	a plug for the for a go on a tour
0:52:27	um
0:52:32	i i have some exposure probably most people are so that some exposure to model
0:52:37	speech recognition technology
0:52:39	you real application
0:52:40	yeah i think you know of um
0:52:43	i've been exposed to google voice
0:52:45	perhaps many people have
0:52:47	yeah and
0:52:48	uh is not a
0:52:49	plug
0:52:50	google voice but
0:52:52	i think
0:52:52	model and uh point speech recognition technology to me use them easily we
0:52:58	good
0:53:01	considering the
0:53:02	uh the systems or
0:53:04	these are are have no
0:53:06	"'kay" will you really great
0:53:08	uh
0:53:10	semantic condo
0:53:13	in to see what people see what do systems can do you acoustic
0:53:17	use a movies
0:53:18	to me used
0:53:21	yeah and the so
0:53:23	where i so you the the channel used to be a to i don't know how to do that a
0:53:28	where i so you challenge uses
0:53:31	use um
0:53:34	creating models so the semantic context to you of the kind of support
0:53:39	to uh speech recognition that
0:53:41	we seen from uh the
0:53:44	you real
0:53:45	why which models
0:53:46	which
0:53:47	don't
0:53:48	model
0:53:49	that
0:53:52	okay well
0:53:53	that was a question uh
0:53:55	i
0:53:56	i i know i wasn't
0:53:57	um but also say something anyway which is that
0:54:00	uh i i am really taking the middle position
0:54:03	there plenty a task
0:54:04	uh where in fact
0:54:06	uh recognition does fail particularly in noise and reverberation so on
0:54:10	google voice search is is very impressive
0:54:12	but
0:54:13	you know there's a lot of
0:54:13	a lot of cases where things to fail
0:54:16	and
0:54:17	uh
0:54:17	we can see significant improvements
0:54:20	in a number of tasks
0:54:21	by changing the front so i think there is something important there
0:54:25	but in your in your state you one really attacking the front so what you're saying we have to pay
0:54:29	attention to the back and i completely agree
0:54:34	one more in it's probably time
0:54:36	i one change of subject a little bit um yeah given that i'll as can you say something about the
0:54:40	of
0:54:41	oh in this you courses academia and this research
0:54:43	you you got a a big put both see a both side
0:54:47	when it what is good what is that in you could for speech
0:54:49	or now and
0:54:52	which we go
0:54:53	i actually a pretty small for
0:54:55	and just re
0:54:57	but uh
0:54:58	uh well i think industry should fund the academia
0:55:02	i
0:55:06	yeah
0:55:07	i to
0:55:12	and
0:55:12	exactly
0:55:14	thanks for the actual

Does ASR have a PHD, or is it just Piled Higher and Deeper?

Plenary Talks

Přednášející: Nelson Morgan (International Computer Science Institute and UC Berkeley, USA), Autoři: Nelson Morgan (International Computer Science Institute and UC Berkeley, USA)