uh well come to to a uh uh i guess would morning everyone and the first couple of practical the model we have a a change of room you know that the this club B was really small and you are afraid that people are uh would not in a so uh we moved everything from club B and the the expert sessions from club E to the north hall it's actually about the this uh a a hall on the second floor next deviation and we should have more space the there so would be a known uh actually club the should be close than the oh signs would be there then a for the internet really sorry for the trouble just today that was close to a by you mobile by to provide weighted um uh a range problems are all spot so to you should be a variable again but please uh uh oh uh we have a a just five hundred twelve also available there is no you way will more so please disconnect when you you not need to to be a and this is especially course my the not because the mark rosewood on on just the state or or it all then a for the bank at torch you know we have a an i you need to dig i'm sorry for that but you don't have it the you will be a lot the to get on the bus looking real a very limited number of kids it's of available uh a the registration desk then the the partial but it right at the the a section the or to seven from and for number ten and the transportation back from just lena is not provided so my crap or continue or evening uh and this man it pops and or uh the of rock and uh uh i'm pretty much done a so uh there would be a short their introduction of a are you done and other on a uh i i i i oh hmmm i i i i i i yeah true and i so i and uh and uh there's is time for the for the second one E so uh so the going to be given by nelson morgan from icsi berkeley and uh and i get a month the or or the non fiction of the name will introduced a speaker and channel decision you very much for coming so one B the point it is my great and i on and yeah rubber for a compute but probably from or for those you know walking speech for very long time core a number of techniques i a or also are at you get number of number of you audience i C so for for those people more than that much of the introduction for those of you know him it's also called there you walk a better with the one of the a signal processing then vol and i mean out a new addition well for i for the problem on the uh what else can i say well i i i think that keep it sort or i will call more than here you leave you at all be better than looking at me i more i i nick well i thought it was time for a little bit of a reality check and uh speech recognition and it's been around for a long time as i think everybody here knows very long research history uh lots of publications for decades many projects and he sponsored project systems have continually gotten better it actually tended to converge so that there is in some sense a a standard automatic speech recognition system now uh it's made it to a lot of commercial products actually been used actually works from time to time and so in some sense it seems to have graduate but yes fails where humans don't and by the way those of you who have your P H Ds know that your education hopefully was not done at that point and there's probably a lot more to do here uh somewhat argue that there is little basic science that's been developed in quite a bit of time lots of good engineering methods though but they often require a great amount of data uh as we learned yesterday there is a great deal of data but not all of it is available for use in the way that you like and and are many tasks where you don't have that much and each new task requires uh essentially the same amount of effort you sort of have to start over again so how do we get to this point this is not gonna be anything like a complete history but enough to make my point help so i'm gonna talk about the status current status in the standard methods a very briefly uh talk about some of the alternatives the people have worked with over the years and where could we go from here so as i mentioned speech recognition research has been around for a very long time uh a significant papers for sixty years by the nineteen seventies in some sense major advances modeling and happened that is the basic mathematics behind hidden markov models was done by been a lots of improvements that happened uh a for the next twenty years or so and also in the features which became more less standard by ninety nine or so there were some really important methodology improvements by ninety ninety in earlier days people did many experiments but was very hard to compare them and the notions of standard evaluations and standard datasets really took called by ninety nine year so and over the all of these years uh specially last twenty thirty years they've been continuous improvements which were to some extent really close the related to more law movements in the technology that is um more more computational capability more more storage capability long people the work with very large datasets and develop very large models to well represent those large dataset so on so there's an elephant the room which is the things are not entirely working still with these systems then fact have converged was kind of a byproduct product of all of these standard evaluations which were very good in many ways but when people found out that the other group had something that they didn't they would copy a in very soon the system would become very much the same so what are some of the remaining problem well system still perform pretty poorly despite a large to work on this in the presence of a significant mounts of acoustic noise also reverberation which is natural for just about any situation uh unexpected speaking rate or accent that is by an expected i mean something that is not well represented in the training set uh on from all your topics uh the language models bring this a lot of the performance that we have and if you don't have a particular topic represented in the language model can do poorly and a from the recognition performance per se how many words you get right another thing that's important is knowing whether you're right or wrong and that's very important for practical applications and that still need some work as well so turns out that even some fairly simple speech recognition task can still fail under some of these conditions yielding some strange result well so boy she no slow voice recognition technology and i and shall yeah know try voice recognition technology no i one to change action oh i i yeah oh i oh and oh yeah i i don't i it was last yeah time in any case yeah i but i shown in a yeah yeah one a shacks and i a a same time one is that a i i yeah small a yeah a i a lot if you do not feel at all angles are we can getting a i anyway so that was funny i hope you think it was funny but what hasn't worked in real life as opposed to just the jokes and what have so uh let me start off with uh some results from some of these standard evaluations are referred to this is a graph the people in speech of seen a million times uh is this other one um for those of you who are familiar with this main thing to note is that uh P we start E R stands for word error rate hi high word error rate is obviously bad this is time and the axis and each of these lines represents a series of ten oh this is a kind of messy graph so it's cleaning up a little and uh this is uh a task done in the early nineties uh called eight is and the main thing to see here as with a lot of these is that to starts off at a pretty high error rate people work for a while and after while a gets down to uh a pretty reasonable error rate that's go to another one this was uh a a conversational telephone speech you have the same sort of a fact and do remember that the this is a a a a um a logarithmic scale here so even though it looks like it hasn't come down very far really did come down pretty far but after well sort of levels off uh more recently there's been a bunch work on speech from meetings which is also conversational these are from the uh individual head mounted microphones she we still didn't have a huge effects of background noise or or or reverberation or anything and there wasn't actually a huge amount of progress after some of the initial uh initial work uh now these are these evaluations uh a commercial products i think uh uh you a lot of information is proprietary but i think working can say is that a partial products work some of the time for some people and they often don't work for others so what is the state well the recognition systems were either work really well for somebody or they'll be terribly brittle and reliable uh i know that when my wife and i both tried a uh a dictation systems they work wonderfully for her and terrible for me i think i i well my words something so here's an abbreviated review of what standard by ninety ninety one we had uh feature extraction basically being based on frames every ten milliseconds or so compute some something from a short spectrum uh i things called mel-frequency cepstral coefficients well mention a bit more about a second uh P L P is another common method develop by then delta cepstra uh uh essentially temporal derivatives of the cepstra and on the statistical side uh acoustic modeling hidden markov models were quite standard it typically by this point represented context-dependent phoneme or units or phoneme like unit uh the language models are pretty much by this time all statistical and they represent it context-dependent words so all this with a by ninety ninety one a let's move to two thousand the eleven there it is uh notice all the changes okay that's a little unfair uh a will have actually done work in the last twenty years and this is they representation of a a lot of it i think and these of had big affects i don't mean to minimize errors uh various kinds of normalisation is uh meeting variance kind of normalisation uh a a an online version of that that we called rasta uh vocal tract length normalisation which compresses or expands the spectrum and in such a way is as to match the models better um and uh then adaptation in feature transformation uh either adapting better to test set that somewhat different from the training set uh or uh various that changes is to make the features more discriminative discriminate range training actually uh changing the statistical models in such a way as to make them more discriminant between different speech sound did did have more more data or of the years and that required lots of work to figure out how to handle that but aside from handling it was also taking advantage of lots of data which was didn't come for free so was lots of engineering work there uh people found that combining systems helped and sometimes combining pieces of systems helped and that's been an important thing in improving uh perform an and because uh speech recognition was starting to go into applications you had to be concerned about speed and this been a lot of work on that well but more and some of this uh the main point uh about mel cepstrum and plp a wanna make is that each of "'em" use this kind of warped frequency scale uh in which you have better resolution at low frequencies and high frequencies "'cause" our perception of different uh speech sounds is very different at low frequencies high frequencies no cepstrum and plp used a different mechanisms for getting a smooth spectrum uh delta cepstrum uh uh as big as i said is basically uh time derivatives uh of the cepstrum um hidden markov model this is a graphical form of it and main thing to see here this is a a a statistical dependency graph uh and say X three is only dependent on the current state each of these time steps uh are represented here and if you know Q three uh then Q two Q one X one X to tell you nothing about X three so that's a very very strong statistical conditional independence model and that's pretty much what people have used in these are now standard cyst this is my only equation and uh those of you and speech will go oh yeah fact probably most people say oh yeah this basically bayes rule the idea is that in statistical system you want to pick the model that is most probable given the data and base so as you could expand in this way and then you can get rid of the P of X because there's no dependence on the model um so you realise these uh likelihoods of of probability of the two six given the model with mixtures of gaussians typically you typically have each gaussian in just represented by means and variances there's no covariance represented between the features and there's the weights of each of the gaussians the you language priors P of them are uh implemented with a n-gram do a bunch accounting counting you do some smoothing and it's basically a probability of a word given some word histories such as the frequent the recent and minus one words now i the math is lovely but in practice we actually raise each of these things to some kind of power this is to compensate for the fact that the models are and that uh there are really other dependence um this is a picture of the acoustic likelihood uh uh uh uh estimator there's a few steps in here each of these boxes can actually be fairly complicated but just generally speaking there's a some kind of space short spectral estimation there's this vocal tract length normalisation i mention which compresses or expanse spectrum the some kind of smoothing either by uh throwing away some of the upper cepstra coefficients are why autoregressive modeling as is done in P L P there's various kinds of linear transformations for instance for dimensionality reduction uh and for discrimination better discrimination then there's the statistical engine that i mentioned before with this funny scaling um in the log domain or raising to a power in order to mixed with the uh language model okay well that seems simple enough but actual systems that get the very best scores are a bit more complicated than this uh there's well first off there's the decoder and the language priors coming in um well you might have to of these france to and people found that this is very helpful for getting best perform but you don't just put "'em" in in a very simple way it's a very often the case that you have all sorts of stages is with ugh C W here's is crossword or non crossword models and you produce graphs or lattice and you combine them at different points and you cross at that well this kind of reminds me of some work by a uh a berkeley grad of for about a century ago name rube goldberg and this is these self operating napkin the self operating napkin is activated when these ships spoon a a is raised to mouth uh pulling string P and thereby jerking little C which throws crack or D past parrot P uh pair of jumps after cracked or and perch have tilt which uh uh a process C it's G in into pale H the extra weight in the pale pulls the cord i which opens and uh i which lights the cigarette lighter J and this uh turn lights the rocket which pulls the sickle which cuts the string which was the pendulum to swing back and forth thereby by wiping the chen uh for this time my view of current speech recognition system it's successful at wiping the chance sometime time so i wanna talk a little bit about alternatives and i wanna say that the at the outset that these are just some of the alternatives a conference like this has uh a lot of work happily uh in in many different directions is the ones i wanted to give as examples but first i wanna say a little bit about what else is there besides the main the great sage natural and was tracked down by a seeker and the or ask the sage what is the secret to happiness sage answered good judge well the sick said that's that's so very well master but how does one obtain good judgement and the master said from experience so the seek a okay experience but how does one obtain this experience and the master said bad judgement so here's of exercise exercises that we many other people of done in bad judgement we've pursued different signal representation uh some of them are related to perception to auditory models france a mean rate and synchrony has a to send ups model from "'em" some time ago uh and into and sample interval histogram from uh uh way it gets a i each of these were related to models of neural firing uh how how fast they want to how much they synchronise one another what uh timing there was between the fire and they had some interesting performance in noise uh they i not been adopted any serious way but there's interesting technology there an interesting scientific models then their stuff that's more and the psychological side these were sort of based on on models of fit physiology uh then there is uh model uh really from the psychological side and multi band systems based on critical bands going all the way back to fletcher's work work of others uh and uh the idea here is that if you have a system that's just looking at part of the spectrum if the disturbances in that part of the spectrum uh then you can deal with that separately note of had some X some six and then something that uh you can observe both that the physiological and psychological level is the importance of tip different um modulations particularly temporal but also spectral modulations and the signal uh then there's on the production side is been a bunch of work by people on uh given the fact that there is only if you articulatory uh mechanism uh that maybe you can represent things that way and O be more se saying and the better better representation the signal one represent this over time in their been hidden dynamic uh models that attempt to do this and trajectory models sometimes the trajectory models had nothing to do with the physiological models but uh sometimes they did and articulatory features which you could think of as a quantized version of the articulator positions and so for then another direction was artificial neural networks which of been around for a very long time um actually before nineteen sixty one but i picked out this one discriminant analysis iterative design the pick that out "'cause" a lot of people don't know about it a lot of people think that a multilayer networks the big N in the eighties but actually neck can sixty one they had a multilayer network that work very well for some problems is actually used industrial E for a that case after that um which the first uh layer of units was uh uh a bunch of gaussians and after that you had a you had linear perceptron couple years later uh other was work at stanford in which they actually did apply some of this kind of stuff to speech these were actually linear adaptive units to actually called add lines uh burning would row sent me uh uh a technical report sri is struggle interest is the cover real technical report nineteen sixty three uh is a page from it that shows a uh a block diagram of try blew up here for mean it's and starts off with some band filters basically you getting some power measures in each band and then here these add lines which uh give you some sets of outputs which one to a typewriter a pair um nineteen eighties so an explosion of interest in the neural network uh very area uh part of this was sparked by a a rediscovery discovery say of your were back propagation just basically propagating the effect of errors from the output of the system back to the individual weight uh in the late eighties uh number of us worked on hybrid hmm artificial neural network systems where the neural networks were used this probability estimators stick to get the emission uh probabilities for the hmm um last decade or so uh quite a few people have taken off on the tandem idea which is do you which is a particular way of using artificial neural networks as feature extractor is and i will just mention uh briefly uh a fairly recent development of the networks and how uh how innovative it is is a is the question but there's definitely some new things going on there which i think are interesting uh the obvious difference between this in the previous networks to can to be more layers that and steep there's also sometimes and unsupervised pre-training uh there's actually several papers at this conference there's also a special issue uh in uh november of the transaction um here's a couple papers that this conference i think this if you others as well as one from the nails river E they had a lot different numbers in the paper but uh i pick one out and just it did they most the numbers had the same general trend mfcc bad deep mlp good uh and the old mlp somewhere in between these are error rates so low again uh low is good and uh there is a large vocabulary um voice search uh paper which uh i is is that poster today uh i had a sixteen percent set the their metric was sentence error reduction and they had a nice improvement compared to a system that used uh M P which is a a a very common discriminant training approach okay so that it to some of the alternatives is again there's you i'm sure many people this audience good think of a many of where could we go from here or in my opinion where should be go from well better features and models um i've suggested better models of hearing in production uh could press perhaps lead to better features uh better models of these features better acoustic models models of understanding better language models dialogue models pragmatics and so on all these are likely to be import the other thing which i'm gonna go into a bit especially at the end is understanding the errors understanding what the assumptions are that are going into our models and how to get past so we start with models of hearing so there are useful approximations to the action of for free that is uh uh from here to the auditory in your of and when i say useful approximations i mean that there are are number of people who've worked and simplifying the models that if that were used earlier and crafting them more towards uh good engineering tools some of those are looking kind of promise uh there's new information about the auditory cortex which i'm gonna brief the refer to next few slides including some results with noise um it's good to learn from a biological examples because uh you know humans are pretty good in many situations that at recognizing speech but it's probably good also not to be purist and to mix in size that you get from these things with good engineering approaches and i i i think there's some uh good possibilities there uh this bottom bullet it is just to note that as with many things in this talk a money talking about some of the field and a mostly talking about single channel but uh people have to ears they make pretty good use of them when they were uh and that's something to keep in mind and of course you can go to many years in some situations with microphone arrays and that's a good thing to think about that's not a topics and i'm expanding on and the stock and the same thing with visual information visual information is used by people whenever they can uh and i'm not gonna talk about that but it's obviously imp or okay a a is gonna talk about this a cortical stuff uh the slightest courtesy of uh she she shah it's not just the slide but also the idea uh and the idea that which comes from experiments that uh he in it's guys and gals have uh done with a small mammals uh a that have pretty similar really part of the cortex X uh a primary auditory cortex to what people have also been some other work with people uh and these uh this if you mention this is being the kind of spectrogram that's received that this primary auditory cortex what they've observed is that there's a bunch of what are called split spectro-temporal receptive fields S T R apps which are little filters that process it in time and frequency and you could think of them as processing temporal modulations which you called rate and spectral modulations which called scale and you imagine there being a cube at each time point with auditory frequency and uh rate and scale and much as you would like to be able to in in and a regular spectrogram uh de emphasise the areas where the signals noise was poor and emphasise areas with the sings noise was good you have perhaps an even greater chance to do this kind of emphasis you have a as uh if you're expanded out to this cube that's the general idea so you could end up with a lot of these different spectrotemporal receptive fields you could implement them and you could try to do something good with them pick out a good uh if limitation that we and and number of people have been trying is a uh what we would call T many stream uh implementation as opposed to multi-stream which uh was what i we shown before you we'd have two or three streams just refers to the quantity but what's in each stream is one of the representation one these spectro-temporal receptive fields implemented by a gabor filter and by a multilayer perceptron that's a discriminatively trained discriminant between different speech sounds you get a whole lot of these and some of implementations we at three hundred uh and then you have to figure out how to combine them or select them hopefully again to de emphasise the ones that are uh bad indicators of what was set so another interesting side light of this kind of approach is that it's a good fit to modern high speed computing that it's as i think a lot of you know the clock rates and or long going up the way they used to other cpus use and so the way that manufacturers are trying to give us more performances by having many more core the graphics processors are an extreme example of this this kind of structure is a really good match to that uh because it's it's what they call an embarrassingly parallel um we found that this room this kind of approach does remove a significant number of errors particularly and noise but also a as it turns out in the clean condition um it combines well with pure engineering not auditory kind of methods uh such as wiener filter based methods and we'd like to think that it could combine well with other auditory models all we haven't really done that work yet um statistical acoustic models uh we currently use these critical assumption and one of things about using very different kinds of features is that this can really change their statistical properties from what the ones we have now and so these assumptions i could be violated in yet different way uh there have been all turn models that were propose that allow you to bypass these typical assumptions but part of the problem is the figure out which statistical dependencies to put in um models of language an understanding i think it's probably pretty clear those you don't know me that this isn't a research area but it's of obvious import and one of the things that uh has been frustrating to a lot of people in fact a member fred jelinek being physically frustrated about this is that it's very very tough to get much improvement over simple n-grams that is a probability of word given some number of previous work but it can be very important two get further information and we know this for sure for people me tell you little story uh one day i was walking out of i csi and i had on one of these catch this is a cap for the oakland athletics to local make league baseball club i also had on a jacket that had the same insignia on it and i had a radio hell to my head i was walking down the street and a guy across the street moderately noisy street yeah or and i said oh can five to three anyway we'd like to be able to do that with a machine so where we go from here well research what continue to get good ideas uh every time you get the shower or maybe you have a have a good idea coming out but what's the best methodology what's the best way to proceed along this path so maybe we can learn from some other disciplines and let me give uh a kind of stretched analogy to the search for a cure for cancer and again i'm gonna tell you a little story uh us the personal one uh is about an uncle of mine names sydney far per um now my uncle set the and the forties uh was i path file just at harvard med channel however but centre and at children's hospital boston and he unfortunately fortunately got to see lots of little children of we came yeah uh once they were diagnosed they only had a few week as a pathologist he mostly dealt with P two dishes and so forth you didn't really wasn't really a clinician but he got this thought that maybe if you could come up with chemicals that were more poison this to the cancer cells than they were to the normal cells maybe he could extend the lives of these K are we experimented with this in the petri dishes the course for the most part for a while they need then any came up with something that he thought would work any tried it out everybody's permission and some of these kids and low and behold it actually did extend their lives for a while this was the first known case of came at there this just great and it started a whole revolution the ended up starting a big centre national cancer institute stuff uh it's not the data fibre reverence to and um the key point i wanna make about it is that there's this quandary between curing patients you have these patients are coming through who are in terrible straits but on the other hand you don't have any time to figure out what's really going on and there were important early successes based on hunches the my own call than many others had and there wasn't time to wear in the real cause for things and by the way stories like this for surgery surgical interventions and for radiation as well uh so there's some success but they still didn't find a general curve cure and uh as you know to this day there's still is no general cure for cancer but things are a lot better every missions or longer and so forth and now there's starting to be some understanding of the biological mechanisms and one hopes that this will lead to to keep uh a solution so this is wonderful book a strong the recommend the emperor of all melodies uh about uh like the industry have cancer and i'll just read this isn't thing the speaker viewers i think of remedies in such time as we have considered of the cost here must be imperfect claim and to no purpose where and the "'cause" of that first been searched this again doesn't belie the fact that it can be very useful to uh go ahead and try to fix something along the way but in the long term you need to understand what's going on so as opposed to just trying our bright ideas which we all do how about finding out what's wrong the statistical approach to speech recognition requires assumptions that made reference to there known literally to be false this may or may not be a problem maybe it's just handled by uh say raising these uh likelihoods to power how can we learn so there's a some work that's been started i wanted the call your tension to from steve work men and larry gaelic okay starting a couple years ago where what they did was to consider each assumption separately and then rather can trying to fix the models modified the data B some resampling S um some uh bootstrapping kind approaches to match the models observe the improvement and use that to inspire more bright ideas but this point the really just focused on the diagnosis part and not on the a new bright ideas frankly so this is being pursued also at icsi and the how to project which is outing unfortunate characteristics of hmm and uh i'm gonna give you just a couple results for more recent version i should add by the way that uh it's is a different K like this is larry sonde dan who just this P H D with that's um but first this is a very simplified system so the error rate for wall street journal is is pretty is pretty high here and uh it's the output uh demonstrably does not really fit the G M distribution that you got from the training set and it definitely doesn't satisfy the independence assumptions and you get this thirteen percent uh now if you simulated data really just generated from the models you should do pretty well in a fact you do basically uh virtually all of the errors go away but here's the interesting one i think if you use resampled sample data so this is the actual speech data but you're just resampling it such a way to assure the statistical conditional independence it also gets rid of nearly all of the year now they're studies are a lot more detailed this there's a a lot of a lot of things that they're looking at a lot of things that trying out but i think this gives the flavour of what they're doing so in summary uh a speech recognition is mature mature in some sense it has an advanced degree this because it's been around a long time and their commercial systems and so forth and yet we still find it to be brittle and uh we essentially have to start over again with each new task uh the recent improvements have been really quite incremental or a lot of things of sort of levelled off we need to rethink kind of like going back to school kind of like continuing education uh we may need more basic models uh more may need more basic features we may need more study of air and the other thing i wanna briefly mention is that we do live in error where there is a huge amount of computation available and even though the clock rates don't continue to go up is they have uh do to uh many core systems and cloud computing and so forth there is gonna continue to be an increased availability to lots of computation and this should make it possible for us to consider huge numbers of models uh and methods that we wouldn't consider before for instance on the front end side these uh auditory based or cortical base things can really blowup up the computation from the simple kind of stuff that you have with mfccs or P L P uh so it's good to do that it's good to try things that might take a lot of computation even if they might not work yeah in your i phone just now um um so you also have to know and then sure you all do that just having more computation is not a panacea doesn't actually solve things but it can potentially a give you a lot more possibility that's pretty much what i want to say uh i do wanna acknowledge that the stuff have talked about is not a particularly for me for many people including people outside our level uh but i do want to thank the many current former students and postdocs visitors icsi staff and particularly give a shout out to hynek hermansky every pore large she option honesty workman jordan cone here's my shameless plug for a book uh which he did already mentioned that is gonna be out this fall thanks to tons of work from dan ellis and other contributors i should say uh like uh gel and and job for then the um simon king for instance and thank you for your attention K a sorry having time i'm oh you what is only a lot not of time bringing up yeah you feel i oh i promised i put on that yes are you thing to remind you about why i if know what is a question or okay and you mike yeah uh right think yeah what are you in the remote i know that think by yeah speak Q mine oh oh don't hold back um yeah okay right at a time i yes well though they they still a chance it's still which chance get get the courage um i i think that the right answer is i don't know because for instance well i used to say when people talk to me about this is that okay i think of speech recognition is as being in three pieces there's the representations that you have there's at the statistical models and the search and so forth in the middle and then there's uh all of the things that you could imagine doing with speech understanding and pragmatics it's X and so forth and i used the think that okay the first one i know a little bit about uh and i and i i feel very strongly and you know bunny results to back this up that that's very important for improving the last one i is not my area of expertise but where have seen in other is certainly and human case i believe that's very important so i sort of thought the middle part yeah O you works well in uh but then so this this study and i'm not so sure no i actually think that you should uh pursue whatever it is that you i feel yeah feel is of greatest interest i actually think the key thing is they have interesting france a for nine now i see now uh you like or know it's what i actually think i and if here and here right okay heard a yeah okay my is louder the all of these uh a technique used right or since we i'm spectral analysis roaches pretty much uh uh everything just right i or in almost all but from now on you um spectral techniques like much C P L you of from reading some aspects of us course things you guys most but the big problems it seems to me you're still will interfere or fear from other sources reverberation uh spatial hearing and so forth where us or you much help distinguishing mode sources direction the uh the other dimension uh uh uh fine role information or something that has been explored lot the psychological and is about you and few steps and so ensemble interval histogram that and drop another drop or mention an entirely a kind of a but station see you get rel and source you same time you say much about that that that's the direction course so that's you what you think about that direction and we get people working in you pay more attention things beyond young what why spectral i guess which you mean a short-term spectral right and uh i i may not have done this is clearly as like could but i think the shah must stuff that i was making reference to a certainly can be long time the the the their spectro-temporal representation what you feed uh the the different quite the cortical filters can be a very different kind of spectrogram when that takes advantage of the sort of stuff and i think absolutely what we should do and that's these disturbances the multiple sources the reverberation et cetera uh uh i agree that's that's the biggest challenge that C if someone talks about the performance of humans versus uh a a speech recognition systems in the current generation systems that's the easiest playstation of the difference so uh i completely agree sorry am i'm not being a politician i actually do agree i uh results i oh hmms see just i modeling true S so you so i i so go yeah the most i uh_huh really thus a more attention that K i didn't pay yeah but but this is you were certainly are are reinforcing my my bias as uh oh go it is getting up but um i i'm mostly a front-end person these days have been for a while and i agree that there's a lot to be done there i didn't mean the say at all that the language modeling and so forth was was the bulk of it even that study at the end was just saying for fairly simple case with the sensually matched training and test uh that uh you could jimmy with the data in such a way to match the models assumptions and you could do much better but uh one of the things that we're gonna be trying to do in follows to that study is looking at mismatched conditions what can you i cases with noise and reverberation and so forth in which case i don't think the effect will be quite as big and you know it's garbage in garbage out if basically you feed in representation that are not uh giving you the information you need how are you gonna get it at the yeah so i i i agree with you but i was trying to be fair or not only to people that co but also because i feel that uh in if you cover the space of all these different cases there many cases where these other areas are in fact very pour and human beings as with my base po example human beings to make use of higher level information uh often in order to figure out what was said what up important about was that which leads me to george's question as you were talking i was constantly we with the analogy um in speech recognition with almost you know irresistibly and at a things and optical character recognition and so uh almost every slide hand hand irresistible analogies uh from the a current successes to future direction is to problems that are being experience uh_huh and i i'm just wondering is there a cross disciplinary knowledge that can be leveraged yeah is is it is it being language to speech recognition except in the sense that some of these alternative there uh approaches uh F they have tried looking at uh the spectrogram has an image uh and so forth some of the neural network techniques that were developed uh in optical character recognition sort of came back the other way but a lot of it's gone gone the other way but you know we can to be fairly fragmented community and and and not listen to each other quite as much as we should whose now J no i think he's of the dog hold well oh i'm sorry i was drawing to stay in was C oh couldn't but the climbs a plug for the for a go on a tour um i i have some exposure probably most people are so that some exposure to model speech recognition technology you real application yeah i think you know of um i've been exposed to google voice perhaps many people have yeah and uh is not a plug google voice but i think model and uh point speech recognition technology to me use them easily we good considering the uh the systems or these are are have no "'kay" will you really great uh semantic condo in to see what people see what do systems can do you acoustic use a movies to me used yeah and the so where i so you the the channel used to be a to i don't know how to do that a where i so you challenge uses use um creating models so the semantic context to you of the kind of support to uh speech recognition that we seen from uh the you real why which models which don't model that okay well that was a question uh i i i know i wasn't um but also say something anyway which is that uh i i am really taking the middle position there plenty a task uh where in fact uh recognition does fail particularly in noise and reverberation so on google voice search is is very impressive but you know there's a lot of a lot of cases where things to fail and uh we can see significant improvements in a number of tasks by changing the front so i think there is something important there but in your in your state you one really attacking the front so what you're saying we have to pay attention to the back and i completely agree one more in it's probably time i one change of subject a little bit um yeah given that i'll as can you say something about the of oh in this you courses academia and this research you you got a a big put both see a both side when it what is good what is that in you could for speech or now and which we go i actually a pretty small for and just re but uh uh well i think industry should fund the academia i yeah i to and exactly thanks for the actual