0:00:15 | that actually |
---|
0:00:17 | that's actually a morgan kind of introduction just |
---|
0:00:20 | i that the say too much |
---|
0:00:22 | thank you brian |
---|
0:00:24 | actually i just should |
---|
0:00:26 | before get that the target should mention someone made it |
---|
0:00:29 | seven a brief discussion with someone about the posters and |
---|
0:00:33 | realising that some extent the optimum strategy for poster would be to make it seem |
---|
0:00:38 | like it's really interesting but completely impossible to understand |
---|
0:00:41 | so that we're gonna want to come up and explain |
---|
0:00:47 | anyway |
---|
0:00:49 | there are back again |
---|
0:00:50 | someone else suggested that perhaps to talk to be called the |
---|
0:00:54 | station through all over again |
---|
0:00:55 | from that same philosophy real be there |
---|
0:00:58 | but let me start with a little story i |
---|
0:01:02 | those you |
---|
0:01:03 | not no you're just to tell you arthur conan doyle a series of stories |
---|
0:01:09 | about detective was next production name sure columns |
---|
0:01:12 | and it had a cool part in watson |
---|
0:01:16 | really didn't know so much about it |
---|
0:01:19 | so holmes and watson one on a camping trip |
---|
0:01:21 | the shared a good meal had a bottle of wine and the recharger the chance |
---|
0:01:25 | for the night |
---|
0:01:26 | three in the morning forms notch watson said |
---|
0:01:29 | look up and this guy tell me what you see |
---|
0:01:32 | what sense that i see millions of stars |
---|
0:01:34 | homes that's what is the total |
---|
0:01:37 | once replies astronomically it tells me there are billions of galaxies potentially millions of planets |
---|
0:01:43 | astrological it tells me the saturn isn't leo theologically tells me that got is great |
---|
0:01:48 | we are small insignificant |
---|
0:01:50 | or logically tells me that it's about three |
---|
0:01:53 | you are logically tells me will have a beautiful day tomorrow |
---|
0:01:56 | was a tell you ones |
---|
0:01:58 | some wants to tell you really |
---|
0:02:03 | so we might be missing if you think |
---|
0:02:07 | and |
---|
0:02:08 | there are |
---|
0:02:11 | some great really exciting results is a lot of people who are interested now in |
---|
0:02:16 | neural nets for number of |
---|
0:02:18 | application areas but in particular |
---|
0:02:21 | in speech recognition or slots whose yes and |
---|
0:02:24 | but there might be a few things that we're missing and the journey |
---|
0:02:28 | and perhaps it might be useful to look at |
---|
0:02:31 | some historical context to help us to know that |
---|
0:02:35 | so |
---|
0:02:36 | as bright alluded to earlier in the day |
---|
0:02:40 | there has been a great deal of history |
---|
0:02:43 | neural networks for speech and the neural networks in general before this |
---|
0:02:48 | and i think of this is occurring in three ways |
---|
0:02:51 | the first wave was in the fifties and sixties with the development of the perceptrons |
---|
0:02:57 | and at one i think of this as a basic structure or the bs |
---|
0:03:03 | in the eighties and nineties we had back propagation which it actually then develop before |
---|
0:03:09 | that but really applied a lot |
---|
0:03:11 | and multilayer perceptrons or mlps |
---|
0:03:13 | which were applying more structure to the problem sort of an ms |
---|
0:03:18 | and now we have things that are piled higher and deeper |
---|
0:03:22 | so it's |
---|
0:03:23 | the phd level |
---|
0:03:25 | now asr speech recognition |
---|
0:03:29 | we had digits pretty much or other very small vocabulary tasks i in the fifties |
---|
0:03:34 | and sixties |
---|
0:03:36 | high eighties and nineties we actually graduate too large vocabulary continuous speech recognition |
---|
0:03:42 | and in this new wave |
---|
0:03:45 | there's really quite sure use of the technology and it's probably compounded |
---|
0:03:51 | no |
---|
0:03:52 | this talk isn't about the history speech recognition but i think i can't really do |
---|
0:03:57 | it is true of neural nets for speech recognition without doing a little bit of |
---|
0:04:00 | that |
---|
0:04:01 | that also have early start |
---|
0:04:03 | the best known early paper |
---|
0:04:05 | i was a nineteen fifty two paper the last |
---|
0:04:08 | but before that was radio right |
---|
0:04:11 | now if you haven't seen or heard about radio racks |
---|
0:04:13 | radio rex was a little doggy dog house |
---|
0:04:17 | and user racks and racks with but |
---|
0:04:21 | our course if you did that X would also probably pop out just about anything |
---|
0:04:24 | that have enough energy |
---|
0:04:26 | five six seven hundred hertz or so because |
---|
0:04:29 | but actively doghouse actually resonated with some of those low frequencies |
---|
0:04:34 | and when it resonated vibrate it would break a connection from electromagnet in the spring |
---|
0:04:39 | with push the dog |
---|
0:04:40 | so we could think of it is speech recognition really bad rejection |
---|
0:04:45 | now the first paper that i know of anyway |
---|
0:04:49 | that was |
---|
0:04:51 | just crime real speech recognition was this paper by our second davis |
---|
0:04:57 | on a digit recognition from bell labs |
---|
0:05:00 | and it approximated energy in the first couple formants was really just how much energy |
---|
0:05:05 | there was over time |
---|
0:05:07 | and the different regions different frequency regions |
---|
0:05:11 | that already had some kinds of robust estimate particular i was quite insensitive to the |
---|
0:05:17 | apple two |
---|
0:05:18 | and it |
---|
0:05:19 | works very well under limited circumstances that is it was |
---|
0:05:24 | pristine recording conditions you very quiet very great signal noise ratio |
---|
0:05:29 | in the laboratory and also for single speaker it was tuned to single speaker |
---|
0:05:35 | and really tune because it was |
---|
0:05:37 | big bunch of resistors and capacitors |
---|
0:05:40 | it also took a fair amount of space |
---|
0:05:43 | that was the nineteen fifty two digit recogniser |
---|
0:05:47 | wasn't something that you would fit into |
---|
0:05:49 | in nineteen fifty two phone |
---|
0:05:53 | now |
---|
0:05:54 | i should say that this system |
---|
0:05:57 | have reported accuracy of ninety seven ninety eight percent |
---|
0:06:02 | and since |
---|
0:06:03 | every commercial |
---|
0:06:06 | system says then has reported an accuracy of ninety seven the ninety percent you might |
---|
0:06:10 | think there's been no progress |
---|
0:06:12 | but of course there has been the problems of got much harder |
---|
0:06:16 | and that's a speech recognition isn't the real point it was talk list of mystery |
---|
0:06:21 | the fundamentally the early asr was based on some kind of templates are examples and |
---|
0:06:26 | distances between incoming speech and those examples |
---|
0:06:30 | in the last thirty to forty years |
---|
0:06:33 | the systems have pretty much been based on statistical models especially |
---|
0:06:38 | the last twenty five |
---|
0:06:42 | the hidden markov model technology however is based on mathematics in the late sixties |
---|
0:06:47 | and |
---|
0:06:49 | the biggest source again since then this is slightly unfair statement of justified moment |
---|
0:06:55 | that's based on having lots of computing |
---|
0:06:58 | now obviously there's a lot of people including a lot of people here who contributed |
---|
0:07:02 | many important engineering ideas since the since the late sixties |
---|
0:07:08 | but |
---|
0:07:08 | those ideas were in a by having lots of computing lots of storage |
---|
0:07:14 | statistical models are |
---|
0:07:16 | trained with exact this is the basic approach we all know about |
---|
0:07:20 | the examples are represented by some kind of choice of features |
---|
0:07:24 | and the estimators generate likelihoods for what was set and then |
---|
0:07:28 | there is a model that integrates over time with these sort of |
---|
0:07:33 | point wise time likelihoods are generated |
---|
0:07:37 | now artificial neural nets can be used for this to generate even of the features |
---|
0:07:42 | that are then processed by some kind of a probability estimator that the just neural |
---|
0:07:47 | net or they can generate the likelihoods that are actually used in hidden markov |
---|
0:07:54 | going back to these three ways in the first way |
---|
0:07:58 | and actually i guess i should say |
---|
0:08:01 | a lot of the things from the only way scary through to your car one |
---|
0:08:06 | the idea was the mccullough it's your on model |
---|
0:08:10 | and |
---|
0:08:10 | there were training algorithms of learning algorithms that were developed around this model perceptrons headline |
---|
0:08:17 | another more complex things |
---|
0:08:19 | example of which is what's called discriminative analysis iterative design or D I D |
---|
0:08:25 | now going to these little bit |
---|
0:08:28 | so mccall gets model was basically that you had a bunch of inputs coming in |
---|
0:08:32 | from other neurons |
---|
0:08:34 | they were weighted in some way |
---|
0:08:36 | and when the weighted sum exceeded some threshold in their on fire |
---|
0:08:41 | now the perceptron algorithm |
---|
0:08:43 | was based on |
---|
0:08:45 | changing what these weights for |
---|
0:08:47 | when the firing was incorrect another's for a classification problem that saying that it is |
---|
0:08:53 | a particular class and i really S |
---|
0:08:56 | a by the way i'm gonna have |
---|
0:08:58 | almost no equations and this presentation |
---|
0:09:01 | itself |
---|
0:09:03 | if you rating problem too bad |
---|
0:09:07 | so the perceptron learning algorithm adjusted these weights using the outputs using whether the run |
---|
0:09:12 | fired or not |
---|
0:09:14 | at the wind approach was actually a linear processing approach where it the weights were |
---|
0:09:20 | just using the weighted so |
---|
0:09:23 | the initial versions of all the experiments with both of these were done with a |
---|
0:09:27 | single layer so they were single-layer |
---|
0:09:29 | perceptrons single-layer outlines |
---|
0:09:32 | and in the late sixties there is a famous both blackman skin pampered perceptron that |
---|
0:09:38 | pointed out that such a simple network could not even solve exclusive or problem |
---|
0:09:45 | but in fact |
---|
0:09:46 | multiple layers were used as early as the early sixties an example of that is |
---|
0:09:51 | this da the algorithm |
---|
0:09:54 | so in timit was not homogeneous neural net like that kind of nets that we |
---|
0:09:58 | mostly used today |
---|
0:10:00 | had gaussian the output layer |
---|
0:10:03 | it had perceptron at the output layer |
---|
0:10:05 | it was somewhat similar to the waiter radial basis function networks which also had some |
---|
0:10:11 | kind of |
---|
0:10:13 | radial basis function gaussian like function that at the first layer |
---|
0:10:17 | it's not really |
---|
0:10:18 | a clever weighting scheme |
---|
0:10:20 | when you loaded up the covariance matrix matrices for the gaussians |
---|
0:10:26 | you would give particular way to the patterns |
---|
0:10:29 | that had resulted in that errors |
---|
0:10:31 | and so a that this and you use an exponential loss function of the output |
---|
0:10:38 | to do that |
---|
0:10:39 | this wasn't really used for speech but was used for a wide variety of problems |
---|
0:10:43 | by task and five mcconnell douglas and you know other |
---|
0:10:48 | governmental and i commercial organisations a lot people don't know about it i happened know |
---|
0:10:53 | about it "'cause" i recorded one point |
---|
0:10:56 | this police or |
---|
0:10:57 | terribly but anyway |
---|
0:10:59 | so |
---|
0:11:02 | going to nns for speech recognition |
---|
0:11:06 | in the early sixties at stanford |
---|
0:11:09 | bernard woodrow's students did a system for digit recognition |
---|
0:11:15 | where they had a series of these advertise these adaptive |
---|
0:11:19 | linear units |
---|
0:11:21 | and it |
---|
0:11:22 | worked quite well within speaker much as the nineteen fifty two system had |
---|
0:11:26 | except that this was automatically didn't have to tune a bunch of resistors |
---|
0:11:32 | and it had |
---|
0:11:32 | terrible error rates across speakers |
---|
0:11:35 | but it was it was sort of comparable and it was using this kind of |
---|
0:11:37 | technology |
---|
0:11:40 | pooling move into the eighties |
---|
0:11:42 | wave to |
---|
0:11:53 | aquino this colleagues did some consonant classification with such systems |
---|
0:11:59 | i had the good fortune be able to play around with such things for voiced |
---|
0:12:03 | unvoiced classification for commercial |
---|
0:12:06 | task |
---|
0:12:08 | but competing is systems started coming up by the like by the mid to late |
---|
0:12:14 | eighties |
---|
0:12:16 | people at cmu |
---|
0:12:17 | or exploitable and geoff hinton that the time |
---|
0:12:21 | lying |
---|
0:12:22 | did this kind of |
---|
0:12:25 | classification for stop consonants using such systems |
---|
0:12:29 | and there are many others i don't |
---|
0:12:31 | have enough for one slide in this how many were but |
---|
0:12:34 | can hold in finland what mean and goal us cameron cooper in germany dealing more |
---|
0:12:40 | in U K many others |
---|
0:12:42 | built up these systems and did a typically isolated word recognition |
---|
0:12:48 | then by the by the end of the eighties |
---|
0:12:51 | we got to for speech recognition that is continuous speech recognition |
---|
0:12:55 | speaker-independent et cetera |
---|
0:12:58 | so |
---|
0:13:00 | have the good fortune to have really |
---|
0:13:02 | clever friends |
---|
0:13:03 | and together with some of them include some of this work |
---|
0:13:07 | i ever bourlard can visit a dixie and eighty eight |
---|
0:13:11 | and he and i started one collaboration where we developed in approach |
---|
0:13:16 | for using feed-forward |
---|
0:13:17 | neural networks for speech recognition |
---|
0:13:20 | and there is a range of other people who did related things a particular in |
---|
0:13:23 | germany |
---|
0:13:26 | and it's seen you |
---|
0:13:28 | also there was working recurrent networks so that the feed-forward |
---|
0:13:32 | you can just get there from where |
---|
0:13:35 | not too many of them a |
---|
0:13:37 | and the recurrent nets |
---|
0:13:39 | actually fed back |
---|
0:13:41 | and this was really high near i mean there was that there were number of |
---|
0:13:45 | people who work with |
---|
0:13:47 | of recurrent networks |
---|
0:13:49 | but for applying it to large vocabulary continuous speech recognition real centre for that is |
---|
0:13:53 | cambridge |
---|
0:13:55 | tony robinson and well while still live trying for side |
---|
0:13:59 | and both approaches though what they had in common |
---|
0:14:02 | was that they'd through the proper training a generative probability as a phone classes |
---|
0:14:08 | and then they derive state emission likelihoods for hidden markov models |
---|
0:14:13 | typically we found it work better in most cases to divide by the prior probabilities |
---|
0:14:17 | of each phone classes |
---|
0:14:19 | and get some scaled likelihoods |
---|
0:14:21 | and we also catch the marker to this name recall that the hybrid hmm mlp |
---|
0:14:27 | or hybrid |
---|
0:14:29 | hmm and system |
---|
0:14:34 | so |
---|
0:14:35 | with mlps you would use the back error back propagation |
---|
0:14:40 | using the chain rule the spread the blame or credit back through layers |
---|
0:14:45 | it was simple to use simple to train a powerful transformations |
---|
0:14:50 | they were also used for classification and prediction |
---|
0:14:53 | but in the hybrid system the idea was using probability estimation |
---|
0:14:58 | and initially we did this for unlimited number of classes typically model |
---|
0:15:04 | the slight has the only for only a equation and the stall |
---|
0:15:10 | we didn't |
---|
0:15:11 | understand that are having some representation of context could be beneficial |
---|
0:15:17 | but it was kind of heart to deal with twenty some years ago |
---|
0:15:21 | and notion of having thousands and thousands of outputs just didn't seem particularly like a |
---|
0:15:26 | good one |
---|
0:15:28 | decree with the limited amount of training data that we have |
---|
0:15:32 | and computation to work with |
---|
0:15:35 | so we came up with a factor version |
---|
0:15:39 | in this equation a Q stands for the states which in this case a typically |
---|
0:15:44 | were monophones |
---|
0:15:45 | but C stands for context and X is the feature |
---|
0:15:50 | and you can break it up without any assumptions |
---|
0:15:54 | and no independence assumption |
---|
0:15:56 | into |
---|
0:15:58 | two different factors factorisation is |
---|
0:16:01 | probability of this of a state given the context and the and the future input |
---|
0:16:06 | times probability of context given input |
---|
0:16:09 | or the other one the right is |
---|
0:16:12 | probability for context given state and the input times the probability of the monophone probability |
---|
0:16:18 | and the latter one |
---|
0:16:21 | means that you could take the monophone that you already training just multiply and this |
---|
0:16:25 | other one |
---|
0:16:27 | a thing we as with other things a bit right back and initially so if |
---|
0:16:31 | the first six months to your didn't work at all |
---|
0:16:35 | and then are colleagues at sri were very helpful came up so it's really good |
---|
0:16:40 | smoothing methods which given the |
---|
0:16:42 | when the number with limited amount of data |
---|
0:16:45 | that we're working with was really necessary to make context work |
---|
0:16:48 | and then it |
---|
0:16:50 | and a few years later |
---|
0:16:52 | dropped cmu |
---|
0:16:53 | french |
---|
0:16:55 | stardust to an extreme where you actually had a tree |
---|
0:16:58 | of search mlps and so you could |
---|
0:17:03 | implement this factorisation over and over can get in finer and finer sit down and |
---|
0:17:07 | leaves you actually had tens of thousands or even a hundred thousand generalized triphones of |
---|
0:17:13 | some sort |
---|
0:17:14 | and it works very well it was actually quite comparable to other systems at the |
---|
0:17:18 | time |
---|
0:17:19 | but was really complicate |
---|
0:17:21 | and most people this pointed really focused in on gaussian mixture systems so it never |
---|
0:17:26 | really took off |
---|
0:17:29 | now |
---|
0:17:30 | if you look at where all this was n-gram two thousand |
---|
0:17:33 | the gaussian mixture approaches have mature |
---|
0:17:36 | people really have learned how to use them |
---|
0:17:39 | they've been many refinements the developed |
---|
0:17:41 | sometimes think about gaussians you have means you have covariances people typically using variance only |
---|
0:17:47 | covariance matrices |
---|
0:17:49 | and so there's lots of simple things that you can do with |
---|
0:17:53 | many of these were developed |
---|
0:17:55 | not just mllr sat and an image by later E |
---|
0:18:01 | i mean all sorts of alphabet soups |
---|
0:18:05 | this didn't come easily it's not that like between slu possible they didn't come easily |
---|
0:18:11 | to the mlp world and since the mlp world for is larger can we speech |
---|
0:18:16 | recognition at this point was really confined to a few places almost everybody was working |
---|
0:18:20 | with gaussian mixtures which kind of hardly keep up |
---|
0:18:26 | but we still want to |
---|
0:18:28 | and we like them because |
---|
0:18:30 | one important reason for us was that they work really well with different from S |
---|
0:18:35 | so if you came up with some really weird thing you know listen to christoph |
---|
0:18:39 | talking about the neurons and we said let's try that thing |
---|
0:18:43 | during the to the mlp in llp didn't mind |
---|
0:18:47 | we had experiences with a colleague of ours for instance john last row who was |
---|
0:18:52 | doing these funny little chips that we'd implement in some threshold mos |
---|
0:18:57 | us various functions of people had found in go clear nuclei and so on and |
---|
0:19:02 | he'd those into htk and it would just rollover and i and so |
---|
0:19:08 | we he that it into our systems and actually didn't mind at all so because |
---|
0:19:14 | of the nature of the nonlinearities |
---|
0:19:16 | it really was very |
---|
0:19:20 | agnostic to the kind of inputs |
---|
0:19:23 | so question is how to take advantage of both |
---|
0:19:25 | well what happened at this time we were working with a with hynek hermansky was |
---|
0:19:30 | a dog i and with dan ellis with the dixie |
---|
0:19:34 | and there was this competition was happening for standard |
---|
0:19:39 | for distributed speech recognition i idea being |
---|
0:19:42 | that you would compute the features |
---|
0:19:45 | and the phone and then somewhere else you would actually do the rest of the |
---|
0:19:48 | recognition and so the idea was to replace mfccs something better |
---|
0:19:54 | so the models were required to be hmm-gmm |
---|
0:19:57 | you couldn't change |
---|
0:19:58 | we still like the next |
---|
0:20:00 | so the solution these guys came up with |
---|
0:20:02 | was to use the outputs as features not as probabilities |
---|
0:20:06 | they were the only ones whatever use the outputs of features the outputs of mlps |
---|
0:20:11 | as features |
---|
0:20:12 | but there's a particular way doing it and implemented in large vocabulary or small vocabulary |
---|
0:20:18 | systems |
---|
0:20:20 | lot really work this was with the digits |
---|
0:20:22 | that they came up |
---|
0:20:24 | and this was called tandem approach |
---|
0:20:28 | now as a so sort of the social cultural advantage for our research |
---|
0:20:35 | nice thing was instead of having to convince everybody that the hybrid systems the way |
---|
0:20:39 | to go we could just say here some cool features one should try them out |
---|
0:20:44 | and we couldn't did in fact collaborate with other people systems that way |
---|
0:20:50 | and i give some credit the bottom over to some other work being done this |
---|
0:20:54 | ryan speaker recognition |
---|
0:20:57 | so there are also interference it can once you get the idea that you happen |
---|
0:21:01 | some interesting |
---|
0:21:03 | use of neural nets to generate features you also could focus on temporal approach which |
---|
0:21:08 | can dickens guys dude with traps where you would have |
---|
0:21:12 | neural nets just looking at parts of the spectrum or a lot of time |
---|
0:21:16 | and so they would be kind of forced into learning something about what you couldn't |
---|
0:21:20 | in the temporal properties |
---|
0:21:23 | that would help you with a phonetic identification |
---|
0:21:27 | icsi's version of this was called hats most hidden activation traps |
---|
0:21:32 | and |
---|
0:21:33 | in all these there were there were there was the germ of what people do |
---|
0:21:37 | now with the layer-by-layer stuff |
---|
0:21:39 | because you train something out and then you'd feed that into another now run the |
---|
0:21:43 | case of hats |
---|
0:21:45 | you train something up then you throw away the last layer and feed it into |
---|
0:21:50 | something else feature |
---|
0:21:52 | then there were a bunch of things worked with gabor filters in X M roster |
---|
0:21:57 | where you had modulation based inputs |
---|
0:22:00 | you can happen using a tandem approach for the end up with getting features |
---|
0:22:05 | from that |
---|
0:22:06 | and then much more recent version |
---|
0:22:09 | is |
---|
0:22:10 | bottleneck features |
---|
0:22:11 | which are kind of tandem it's not |
---|
0:22:14 | exactly |
---|
0:22:15 | same thing that's not coming from posteriors but it is using an output from the |
---|
0:22:20 | net as |
---|
0:22:22 | that's features |
---|
0:22:25 | so |
---|
0:22:26 | third way |
---|
0:22:30 | i liked course to go where |
---|
0:22:35 | so there's no |
---|
0:22:37 | there's nothing wrong with the original hybrid theory i mean that it |
---|
0:22:41 | work fine |
---|
0:22:43 | gmm approach is sort of have victory because |
---|
0:22:47 | you get a lot of people |
---|
0:22:48 | moving in the same direction lot of things can happen |
---|
0:22:51 | but also |
---|
0:22:53 | just computation |
---|
0:22:55 | storage and so forth |
---|
0:22:57 | there was a lot more straightforward i think to make progress |
---|
0:23:01 | with modifications to the gmm based approaches |
---|
0:23:05 | so the fundamental issues with going further with a hybrid approach is how to apply |
---|
0:23:10 | many parameters usefully |
---|
0:23:12 | and how do get |
---|
0:23:14 | these emission probabilities from any phonetic categories |
---|
0:23:18 | and aspects of solution were already there is already mentioned in a number of these |
---|
0:23:22 | approaches we reject already generating mlps layer by layer |
---|
0:23:27 | many phonetic categories there were some work in context dependence but that's needed to be |
---|
0:23:31 | pushed further |
---|
0:23:33 | learning approaches second order methods right conversations so forth |
---|
0:23:38 | these were there are many papers on the sort of things on is variance of |
---|
0:23:43 | conjugate gradient sort things in the eighties |
---|
0:23:47 | integrating courses much older than the eighties |
---|
0:23:51 | but someone had to do all this and so |
---|
0:23:53 | when i'm sure he's reflections from earlier time i don't want to draw cast aspersions |
---|
0:23:56 | on and people were doing great things now |
---|
0:23:59 | someone actually had to put these things together and push forward |
---|
0:24:05 | and i |
---|
0:24:05 | and that kind of discussion you have to start with geoff hinton |
---|
0:24:09 | jeff is kind of excitable guy |
---|
0:24:12 | it was very excited by back-propagation eighties |
---|
0:24:14 | it's been excited about the things |
---|
0:24:17 | and he is very good at spreading but it's a |
---|
0:24:20 | so |
---|
0:24:21 | he developed particular initialisation techniques |
---|
0:24:25 | and some of these |
---|
0:24:27 | are unsupervised techniques particular which you likes because it's seen high logically possible |
---|
0:24:35 | and |
---|
0:24:36 | this permitted the use of many parameters and all layers |
---|
0:24:40 | because when you have many layers |
---|
0:24:43 | back propagation isn't to affective |
---|
0:24:46 | down at your ears gets more that credit blankets watered down |
---|
0:24:52 | a so is expected to spread to microsoft research |
---|
0:24:56 | and they extremely what was going on before too many phonetic categories large vocabulary speech |
---|
0:25:03 | recognition |
---|
0:25:05 | and lots of other people or a very talented people are google ibm elsewhere follows |
---|
0:25:12 | so |
---|
0:25:14 | initialisation having a good starting point |
---|
0:25:17 | for the weights before you start discriminant training some sort |
---|
0:25:22 | a was often used for limited data case it was often the case |
---|
0:25:26 | back in the early nineties when we were going to some situation where we had |
---|
0:25:30 | relatively little data |
---|
0:25:32 | we train with something else first and then |
---|
0:25:35 | it start with those weights maybe we wouldn't even train all the way you just |
---|
0:25:39 | do any block or two |
---|
0:25:40 | and then we would go to the other language or other task |
---|
0:25:44 | and we often found that be very helpful |
---|
0:25:49 | so hinton developers general unsupervised approach |
---|
0:25:53 | applied to multiple layers in general call that deep learning |
---|
0:26:01 | lot of this early stuff was called sometimes talk all the deep belief nets |
---|
0:26:06 | a general every dnns |
---|
0:26:09 | supplied other applications and speech |
---|
0:26:11 | and again i gave reasonable weights for the layers far targets because |
---|
0:26:15 | even if |
---|
0:26:16 | the weights don't use it all back propagation training at least the early ones are |
---|
0:26:21 | doing something useful |
---|
0:26:24 | later speech where a lot of while the things that you see posters are papers |
---|
0:26:28 | in the last couple years actually skip this step |
---|
0:26:31 | and do something else for instance do layer by layer |
---|
0:26:36 | training |
---|
0:26:37 | discriminatively |
---|
0:26:39 | and many approaches use some kind of regularisation |
---|
0:26:42 | to avoid overfitting |
---|
0:26:45 | so the recent work which you're much more about in a clear today |
---|
0:26:50 | from |
---|
0:26:51 | is |
---|
0:26:53 | shows significant improvements over comparable gmms |
---|
0:26:56 | and although there's a mixture of approaches |
---|
0:26:59 | sometimes tandem why core bottleneck like sometimes a hybrid mode i think they're usually hybrid |
---|
0:27:05 | mode |
---|
0:27:07 | and |
---|
0:27:08 | i have to say it's great to called deep neural nets but they're still multilayer |
---|
0:27:11 | perceptrons |
---|
0:27:12 | if they just multilayer perceptrons with you know certain number of layers |
---|
0:27:17 | and say well okay but it's really different with seven hidden layers then used to |
---|
0:27:22 | know maybe |
---|
0:27:24 | but we do have to ask how do you deep to the need to be |
---|
0:27:28 | so |
---|
0:27:29 | many experiments show continued improvements more layers |
---|
0:27:33 | and the at some point there's diminishing returns but the underlying assumption there is that |
---|
0:27:37 | there's no limit on parameters |
---|
0:27:39 | so we start asking the question what if there was a one |
---|
0:27:42 | now why would you want to limit |
---|
0:27:44 | well because in any practical situations are actually in some kind of women at least |
---|
0:27:49 | there's a cost right there's |
---|
0:27:51 | you could think of the number of parameters as being a proxy for the cost |
---|
0:27:55 | for the resources in general for the time it takes to train a time text |
---|
0:27:59 | run amount of storage |
---|
0:28:01 | and well |
---|
0:28:03 | there's people who go here but |
---|
0:28:05 | i have say you know even if you've got you know million machines |
---|
0:28:09 | you probably one hundred users so it still matters on the parameters use |
---|
0:28:14 | so in interspeech represented something which i'm just gonna present for mentor to here |
---|
0:28:20 | what we called deep on a budget |
---|
0:28:23 | and we say suppose |
---|
0:28:25 | we have a fixed but very large wanna make sure that nobody thinks we didn't |
---|
0:28:29 | use enough |
---|
0:28:30 | parameters |
---|
0:28:31 | and then you |
---|
0:28:33 | compare between a narrow and deep versus wine and shallow |
---|
0:28:38 | we often see comparisons where people tried E |
---|
0:28:41 | you know the you earlier version that we often used of one big hidden where |
---|
0:28:45 | versus a bunch of good |
---|
0:28:47 | but we want to do all along the way step by step have two hidden |
---|
0:28:50 | layers three hidden layers work admirers |
---|
0:28:53 | we can't the architecture the same |
---|
0:28:55 | and it was only a one task was a pretty small task as aurora two |
---|
0:29:00 | and so that |
---|
0:29:01 | allowed us to look at varying signal-to-noise ratios |
---|
0:29:04 | we said if you did this on a budget what works best |
---|
0:29:08 | well you know and maybe more to there's different kinds of additive noise train station |
---|
0:29:13 | babble and so forth |
---|
0:29:16 | and this was a mismatch case it's clean training and noisy test that we didn't |
---|
0:29:20 | do the multi-style |
---|
0:29:23 | and |
---|
0:29:24 | it turns out that the answer is all over the map |
---|
0:29:28 | and in particular |
---|
0:29:30 | for the cases that were kind of usable |
---|
0:29:33 | signal-to-noise ratios and by usable i mean |
---|
0:29:36 | if we gave you a few percent error and digits |
---|
0:29:39 | as opposed to twenty or thirty or forty which you just couldn't used for anything |
---|
0:29:44 | actually to was better |
---|
0:29:47 | and then to yield little bit with the question of will maybe you just pick |
---|
0:29:51 | the we were number of parameters we tried with double number of parameters have the |
---|
0:29:56 | number of parameters we saw similar for now |
---|
0:29:59 | so when i gave |
---|
0:30:02 | this longer version of this and interspeech some of the comments more along the lines |
---|
0:30:06 | of why do you think to is better |
---|
0:30:09 | so forth |
---|
0:30:10 | i just wanna be clear i'm not saying that to is better than anything |
---|
0:30:13 | what i'm saying is that |
---|
0:30:16 | if you were thinking of something actually going into practical use you should do some |
---|
0:30:20 | experiments where you keep the number of parameters the same you might |
---|
0:30:23 | then expand and so forth but usually is to some experiments we keep the number |
---|
0:30:27 | of parameters the same and then you get an idea about what's best and it's |
---|
0:30:32 | probably gonna be test then |
---|
0:30:35 | so |
---|
0:30:38 | we focus on neural networks but we do have to be sure we ask direct |
---|
0:30:41 | questions |
---|
0:31:03 | i just said no |
---|
0:31:06 | i |
---|
0:31:12 | so |
---|
0:31:13 | we have test right questions |
---|
0:31:17 | one question is what we see into the nets no there's all these questions about |
---|
0:31:21 | what's wearing data and how many layers we have so forth |
---|
0:31:27 | some people not having any names |
---|
0:31:29 | a white characterize is true believers think that features aren't important |
---|
0:31:34 | actually |
---|
0:31:35 | to verify slightly a discussion just a |
---|
0:31:38 | that interspeech i think it wasn't and |
---|
0:31:42 | the |
---|
0:31:43 | i made this comment and he said no i think features of importance just usual |
---|
0:31:50 | so anyway features are important |
---|
0:31:55 | and this goes back to the old general computer |
---|
0:31:59 | axiom garbage in garbage are |
---|
0:32:01 | people have done some very interesting experiments with feeding waveforms in and i should say |
---|
0:32:07 | back and today we did some experiments hynek like this in experiments with a feeling |
---|
0:32:11 | waveforms in comparing the plp needed waveforms way worse they have made some progress there |
---|
0:32:16 | actually are doing better |
---|
0:32:19 | but if you actually look in detail at what these experiments do |
---|
0:32:23 | in one case for instance |
---|
0:32:26 | it they check the absolute value the floor to detect the logarithm the averaged over |
---|
0:32:31 | a bunch of |
---|
0:32:31 | all sorts of things which actually obscure the phase and that's kind of the point |
---|
0:32:37 | is that you can have waveforms of extraordinarily different shape that really sound pretty much |
---|
0:32:42 | the same |
---|
0:32:44 | there's more recent results that uses maxout pooling of convolutional neural nets |
---|
0:32:49 | that also had you know a nice result |
---|
0:32:53 | again using this maximum and maximum style pooling also tended to screw the face |
---|
0:32:59 | but in both those cases and the other case i've heard of anyway |
---|
0:33:04 | this completely falls apart when you have made mismatch when you when the testing is |
---|
0:33:08 | different than the training |
---|
0:33:10 | so what is holy |
---|
0:33:12 | of having a frontend after all the available data is in the way for some |
---|
0:33:17 | assumptions there that you know you might things well and so forth but |
---|
0:33:21 | that's ignore that for the moment |
---|
0:33:23 | in fact front ends to consistently improve speech recognition |
---|
0:33:27 | and i have this is great but like i learned from here which is |
---|
0:33:32 | that the goal of front ends is to just or information |
---|
0:33:35 | that's is a little extreme |
---|
0:33:37 | scenic as these sometimes but i think it's true that some information is misleading at |
---|
0:33:42 | some information is not affected |
---|
0:33:45 | and we want to focus on the discriminative information |
---|
0:33:48 | because the waveform that you receive is not just spoken language |
---|
0:33:52 | it's also is and reverberation and channel effects and characteristics of the speaker if you're |
---|
0:33:58 | not gonna speaker recognition |
---|
0:34:00 | maybe you don't care so much about that |
---|
0:34:02 | and so |
---|
0:34:04 | the front end can help to |
---|
0:34:07 | focus on the things that you care about for your particular task |
---|
0:34:11 | and a good front end in principle of use or to carry to what extreme |
---|
0:34:14 | can make recognition extremely simple |
---|
0:34:18 | least N |
---|
0:34:20 | so what about the connection to mlps well as i alluded to earlier mlps have |
---|
0:34:26 | few distributional assumptions |
---|
0:34:29 | mlps can |
---|
0:34:30 | also easily integrate information over time |
---|
0:34:34 | multiple feature streams |
---|
0:34:36 | could provide useful way to incorporate more parameters |
---|
0:34:39 | so yes that's do give you a nice way especially with good realisation initializations and |
---|
0:34:44 | so forth |
---|
0:34:45 | can give you a way to incorporate more features more parameters sorry usefully |
---|
0:34:50 | but also multiple strings can do this too |
---|
0:34:53 | by multiple streams i mean |
---|
0:34:55 | having different sets of mlps the look at the signal in different ways |
---|
0:35:01 | and you can really expand out the number of parameters and in a way that |
---|
0:35:05 | is often quite useful |
---|
0:35:06 | and so my as well thrown another acronym |
---|
0:35:09 | if you use this with the T that |
---|
0:35:12 | you can call this of don |
---|
0:35:13 | deep white |
---|
0:35:17 | so you can combine these different streams easily because the outputs of posteriors we know |
---|
0:35:22 | how to combine probabilities |
---|
0:35:24 | this isn't really example a very chanted at our place |
---|
0:35:29 | fifteen thirteen years ago something |
---|
0:35:32 | all that on the topic mlp |
---|
0:35:34 | and |
---|
0:35:36 | the idea is you have a bunch of different |
---|
0:35:39 | sets of layers they're looking at different critical bands this is this is like |
---|
0:35:43 | the hats and traps and so forth |
---|
0:35:46 | the difference is it was just trained all at once |
---|
0:35:49 | and in fact this work okay |
---|
0:35:52 | a recent example and there's because they are in the such examples around i just |
---|
0:35:56 | pick this one because of standby one of my students actually |
---|
0:36:00 | in china |
---|
0:36:02 | in which |
---|
0:36:04 | he had some with |
---|
0:36:07 | coming from |
---|
0:36:11 | high modulation frequencies and modulation frequencies |
---|
0:36:15 | and T this as pca is not the society for prevention of cruelty to animals |
---|
0:36:19 | buses |
---|
0:36:20 | a sparse pca and it's used to pick out |
---|
0:36:26 | pretty uses it to pick out particular filters are gabor filter is in this case |
---|
0:36:32 | that are particularly useful |
---|
0:36:34 | for the discrimination |
---|
0:36:36 | and this these then go into deep neural nets six-layer do deep neural nets |
---|
0:36:43 | and the output of one deep neural net goes into another so i get the |
---|
0:36:46 | and it's really deep but you also have some within their |
---|
0:36:49 | this was used to some effect it's very noisy data for the rats program so |
---|
0:36:55 | it's a |
---|
0:36:56 | data that's and transmitted through a radio channels and is really extremely awful what you |
---|
0:37:02 | get at the other side |
---|
0:37:04 | so called or are dnns or troughs |
---|
0:37:10 | nearly all |
---|
0:37:11 | still based on essentially on this mccullough that small |
---|
0:37:15 | there are some nice work is also a poster here about more complex units |
---|
0:37:22 | and for certainly for large vocabulary |
---|
0:37:25 | kinds of |
---|
0:37:27 | for real word error rate measurements |
---|
0:37:30 | they're not particularly better |
---|
0:37:33 | just little disappointing |
---|
0:37:35 | but maybe this work is just started |
---|
0:37:38 | the complexity and power is not supplied by having more complex units are for used |
---|
0:37:43 | it is applied by |
---|
0:37:45 | the data and also is a say with multiple streams by the web |
---|
0:37:49 | you also can represent to some extent signal correlations by pooling again by acoustic context |
---|
0:37:56 | and so far at least the most effective learning methods are not biologically plausible |
---|
0:38:01 | so given all that how can we benefit from biological models |
---|
0:38:06 | why we want to benefit from biological models because we wanna have stable perception and |
---|
0:38:11 | noise and reverberation which a human hearing can do |
---|
0:38:15 | and our system certainly can |
---|
0:38:17 | the cocktail party effect one voice out of many there are some |
---|
0:38:21 | tory demonstrations of such things but in general they don't really work |
---|
0:38:28 | rapid adjustment to changing conditions i remember telling someone one point that |
---|
0:38:33 | if the if our sponsors |
---|
0:38:37 | wanted us to have the best recognition |
---|
0:38:40 | anyone could have in this room |
---|
0:38:42 | we collected thousand hours in this room |
---|
0:38:45 | then if the sponsors came back next year it's it now we wanted to be |
---|
0:38:49 | in that conference room dollar fall we'd have to collect another thousand hours |
---|
0:38:53 | okay i'm like slightly there is an set of things adaptation |
---|
0:38:56 | but it's release |
---|
0:38:58 | very minor compared so people can do we just walking to this room and walking |
---|
0:39:02 | to their room and we just keep very pretty much |
---|
0:39:05 | and real speaker independence we often colour system speaker independent the speech recognizers |
---|
0:39:11 | but when you have a voice this particular a different to its it does badly |
---|
0:39:16 | so we learn from the brain a |
---|
0:39:19 | these are pictures from |
---|
0:39:22 | same source that one of one of the source code first talk it is |
---|
0:39:29 | E clock |
---|
0:39:30 | so this is a direct cortical measurement as stuff explain |
---|
0:39:35 | this is these are you get data |
---|
0:39:38 | from people who are in the hospital for |
---|
0:39:42 | certain neurosurgery because they had |
---|
0:39:45 | extreme cases of epilepsy which have not been |
---|
0:39:50 | sufficiently well treated by drugs |
---|
0:39:53 | and so surgery is an option but you have to figure out |
---|
0:39:58 | where the where the focus of what's the about seizures are |
---|
0:40:02 | and you also wanna know we're not to cut |
---|
0:40:05 | in terms of language |
---|
0:40:08 | so |
---|
0:40:10 | at each angle was mentioned earlier and new remotes grounding |
---|
0:40:13 | had a lovely paper in nature couple years ago where they're making all kinds of |
---|
0:40:19 | noise measurements during source separation and in this experiment they would play two speakers speaking |
---|
0:40:26 | once |
---|
0:40:27 | and |
---|
0:40:28 | by the design of the experiment we get the subject to focus first on one |
---|
0:40:32 | speaker and then on the other and sort of the changes and signal process |
---|
0:40:37 | so this is giving clues about source separation and noise robustness and what's really exciting |
---|
0:40:42 | about this from his that this isn't kind of intermediate so between E G which |
---|
0:40:47 | is something i used to work with a long time ago we're on the scale |
---|
0:40:50 | have really it spatial wrote |
---|
0:40:54 | resolution you a pretty good temporal resolution |
---|
0:40:58 | and the |
---|
0:41:00 | single or |
---|
0:41:02 | modest number of electrodes that directly like then there is on the surface intermediate region |
---|
0:41:08 | and looks like we've got a lot of new kinds of information and the technology |
---|
0:41:12 | on this is rapidly changing |
---|
0:41:15 | people working on sensors are making these things with the sensor with the sensor with |
---|
0:41:20 | the electorate closer and closer together |
---|
0:41:23 | so the whole is that measurements like these and like the things that chris that's |
---|
0:41:28 | what really are completely new processing steps |
---|
0:41:31 | for instance |
---|
0:41:33 | computational auditory scene analysis is based on psychoacoustics and their know that there's a range |
---|
0:41:40 | of things that you can do try to pick up one speaker from some other |
---|
0:41:44 | background but if we actually have a better handle on what's really going on inside |
---|
0:41:48 | the system we might be able to better design those things rather than just relying |
---|
0:41:53 | on psychoacoustics |
---|
0:41:55 | and this concludes structures things that the signal level computational level |
---|
0:42:00 | and |
---|
0:42:02 | it's a |
---|
0:42:03 | it's work that's been done |
---|
0:42:07 | that will be talked about on thursday night by steve bregman for instance |
---|
0:42:11 | and understanding the statistical systems can learn about what the limitations are |
---|
0:42:17 | so what that hasn't common the other it's not from brain but it's actually analysis |
---|
0:42:22 | of what's going on |
---|
0:42:23 | it can give you a handle on how to percy |
---|
0:42:27 | we need feature stability |
---|
0:42:29 | under different kinds of conditions noise room reverberation so i'll |
---|
0:42:33 | and models that can handle dependent variables |
---|
0:42:37 | so in conclusion |
---|
0:42:41 | there is |
---|
0:42:42 | more than fifty years of effort |
---|
0:42:44 | including |
---|
0:42:45 | some with speech recognition |
---|
0:42:47 | the current methods include tandem and hybrid approaches |
---|
0:42:53 | multiple layers and initialisation do sometimes i'll |
---|
0:42:57 | not |
---|
0:42:59 | as but speech rec automatic speech recognition the fundamental algorithms |
---|
0:43:03 | of |
---|
0:43:05 | neural net used for speech recognition are actually reasonably my quite as well |
---|
0:43:11 | the engineering efforts to make use of the computational capabilities have helped course |
---|
0:43:19 | i would argue the features still matter |
---|
0:43:21 | and the why important not just deep |
---|
0:43:24 | and where is that missing okay |
---|
0:43:27 | asr still performs badly for conditions on seen during training |
---|
0:43:31 | so we have to keep looking |
---|
0:43:33 | and that's it thank you very much |
---|
0:43:53 | okay we conduct questions |
---|
0:43:59 | okay |
---|
0:44:04 | i can't resist to comment on one of things |
---|
0:44:07 | like it you know the question of architecture really because |
---|
0:44:11 | it'll when windows |
---|
0:44:15 | idea of using hidden units for one task we do use it again and that |
---|
0:44:20 | the use that eighty nine we called what you like neural networks at the time |
---|
0:44:26 | was extremely successful work |
---|
0:44:29 | but it was discarded at the time but people say okay the series say is |
---|
0:44:33 | that was one hidden layer you can represent any convex classification function so we don't |
---|
0:44:38 | need to six and then architectural multilayer way |
---|
0:44:42 | so this car it's a lotta work actually multi layer deep neural networks as you |
---|
0:44:46 | want even though it time already shot and this |
---|
0:44:50 | now what it does all still today with work that scoring right now is that |
---|
0:44:54 | people really don't look very much that how to do automatic architectural learning so in |
---|
0:44:59 | other words |
---|
0:45:00 | you know how we want to display by creating another layer of making why narrow |
---|
0:45:05 | more creating different delays but we all this you know by repeating the same experiments |
---|
0:45:10 | over again the think and what humans learn they do this development stages we don't |
---|
0:45:16 | all your sit in the corner run back propagation for twenty years |
---|
0:45:20 | but we and then wake up and no speech but we learn to babble about |
---|
0:45:25 | willard words et cetera with this is all come from the must be some scheduled |
---|
0:45:29 | by which we build architectures in that the about the development away and just too |
---|
0:45:35 | little of my after divorce the more we look at the low resource as the |
---|
0:45:40 | multiple languages et cetera i think having some mechanism of building these architectures one learning |
---|
0:45:46 | approach i think is some fundamental research that still missing in my view but i'd |
---|
0:45:51 | like to hear your comment that |
---|
0:45:52 | i guess is another question but |
---|
0:45:57 | the only count i mean sure |
---|
0:46:02 | the only thing i have data that i mean i agree hours |
---|
0:46:06 | is that one thing i didn't mention that is nineteen sixty one approach the idea |
---|
0:46:11 | is that it actually also build up |
---|
0:46:16 | automatically |
---|
0:46:16 | and so it was in that case it was also a feature selection systems well |
---|
0:46:23 | and so there would look at |
---|
0:46:26 | the difference |
---|
0:46:28 | superset of possible features and take a bunch of them and build up and unit |
---|
0:46:33 | based on that and then it would consider what other group a feature so it |
---|
0:46:36 | actually did build up |
---|
0:46:38 | not a completely general architecture but it did a fair amount of automatic learning infrastructure |
---|
0:46:46 | and that was nineteen sixty one that cornell |
---|
0:46:53 | yes right |
---|
0:46:55 | what's your steven compare |
---|
0:46:59 | okay other questions |
---|
0:47:02 | or comments |
---|
0:47:08 | and so |
---|
0:47:12 | so you work harder weakness of this cosine function white no not for going |
---|
0:47:17 | i can be not than going down now being up again so do think discourse |
---|
0:47:22 | and function is gonna work stood was so we will work we don't have to |
---|
0:47:26 | be on the for productive lives or is it gonna |
---|
0:47:31 | no one okay |
---|
0:47:32 | i think it depends on to what extent we believe an exaggerated claims |
---|
0:47:39 | so if we if we push that to hire people will get you don't speech |
---|
0:47:43 | recognition works really well under many circumstances fails miserably under others so if people believe |
---|
0:47:50 | too much that we have already found the holy grail |
---|
0:47:54 | then after a while when they start using it having it fail |
---|
0:47:59 | then |
---|
0:48:01 | funding will go down and interest to go down you know for the whole field |
---|
0:48:05 | of speech recognition but in particular any particular method |
---|
0:48:10 | so i think |
---|
0:48:11 | i don't feel again is that i think that i mean obviously are like using |
---|
0:48:16 | artificial neural networks are stuff doing for a long time that i mean i started |
---|
0:48:21 | using in them but |
---|
0:48:23 | thirty three years ago |
---|
0:48:25 | because i tried i had a particular task |
---|
0:48:29 | and try to whole bunch of methods it just so have i mean just lock |
---|
0:48:33 | that the neural net i was using was the best |
---|
0:48:36 | of the different things that particular small voiced unvoiced speech task |
---|
0:48:41 | but so i like them |
---|
0:48:44 | but i think they're only a part of section |
---|
0:48:46 | and this is why i emphasise that what you feed them |
---|
0:48:49 | i should also say what you do them |
---|
0:48:52 | are both at least is important problem more important |
---|
0:48:56 | then the stuff that we're currently mostly excited about |
---|
0:48:59 | and so i think that |
---|
0:49:01 | but gaussian mixtures that a great run wasn't |
---|
0:49:04 | you know |
---|
0:49:06 | and i think people will still use them they're another tool in there is very |
---|
0:49:10 | nice things about gaussian |
---|
0:49:11 | it's nice things about sigmoid there's nice things about other kinds of non linear units |
---|
0:49:15 | people have rectified linear not of data |
---|
0:49:18 | a |
---|
0:49:19 | but |
---|
0:49:22 | i think |
---|
0:49:23 | the level of excitement will probably go down somewhat because |
---|
0:49:28 | you know after while being excessively and papers saying very similar things |
---|
0:49:32 | sort of i down but i think it's people start |
---|
0:49:35 | using these things in different ways feeding them different things making use the outputs of |
---|
0:49:40 | different ways cetera |
---|
0:49:43 | interest can be sustained |
---|
0:49:49 | in a |
---|
0:49:51 | you mentioned that one of the big advantage is something you that is the pos |
---|
0:49:55 | label is that they can take a lot of abuse for how what you feed |
---|
0:50:00 | it as long as it carries the right kind of information i also feel that |
---|
0:50:05 | there is a great potential for various architectures built |
---|
0:50:10 | it you mentioned that you take time with the relation sampled in select the outputs |
---|
0:50:16 | from that and combining with a stressful or more deletions so i think that there |
---|
0:50:20 | is plenty of opportunity for us to be deceitful time |
---|
0:50:25 | one or is there is that you make that again you can use which is |
---|
0:50:29 | like so if you try all kinds of things that you please report is more |
---|
0:50:33 | severe this wasn't W |
---|
0:50:35 | and i would somehow like to encourage the committee i need seeing slightly about was |
---|
0:50:40 | you know one thing is to whole that i could actually pop out somehow automatically |
---|
0:50:45 | a side or anything so i think we still need |
---|
0:50:49 | to build a model i don't know we but can do all done automatically but |
---|
0:50:55 | i see the works like what christmas present the year but basically learning from the |
---|
0:50:59 | weight auditory system is working that can be plenty of this duration for vad architectures |
---|
0:51:05 | of the new movement because neural is indeed a simple in |
---|
0:51:09 | how much abuse they can take it forms of |
---|
0:51:13 | removing seems to get the graphite i mean i agree and i maybe the size |
---|
0:51:19 | of quite as much as a as i feel |
---|
0:51:22 | we have this right now this real separation we have there's front end of somebody |
---|
0:51:27 | works on the front end |
---|
0:51:28 | and then there's neural nets and then you know and there's hmms and there's language |
---|
0:51:32 | models and so forth these are really quite separate |
---|
0:51:35 | but they really need the long run to be very integrated |
---|
0:51:38 | and a that |
---|
0:51:40 | particularly provide example i showed |
---|
0:51:43 | are ready was kind of mixed together that you had some of the signal processing |
---|
0:51:47 | stuff going later on in some of the going earlier and all of that and |
---|
0:51:52 | when we start opening that a |
---|
0:51:54 | and you say you know it's not just adding unity or something like that like |
---|
0:51:58 | a nineteen sixty one approach |
---|
0:52:01 | but you say it can be anything then i think you really lost unless you |
---|
0:52:06 | have some example to work for |
---|
0:52:09 | so for me it's not just the i mean i have no problem and i |
---|
0:52:12 | think hynek doesn't either |
---|
0:52:14 | with taken speaker if we come up with a purely engineering approach has nothing to |
---|
0:52:20 | do with brains that just works better fine we're engineers that's okay |
---|
0:52:25 | the problem is that the design space is infinite |
---|
0:52:29 | and so how to figure out what direction even go |
---|
0:52:33 | and so that's you feel that i think that we had that the appeal that |
---|
0:52:37 | the brain related biologically stuff as have for us |
---|
0:52:41 | is that it's a working system |
---|
0:52:44 | it's something it already works and so it does really reduce the space |
---|
0:52:48 | that you consider |
---|
0:52:50 | is someone else gonna come up with some information theoretic approach that is the ends |
---|
0:52:54 | of being better know this |
---|
0:52:55 | fine you know i |
---|
0:52:57 | microsoft |
---|
0:52:59 | but this is where it occurs to us |
---|
0:53:04 | questions |
---|
0:53:13 | so you mention that a hmm gmm systems at some point they'd get much shorter |
---|
0:53:19 | and one of the aspects is that they could be adapted well |
---|
0:53:25 | so one would think about adapting neural networks and some sort of similar manner |
---|
0:53:32 | and is that one of the reasons why neural networks i mean if you sick |
---|
0:53:37 | recognition task you wanted to be adapted to the speaker and from my limited knowledge |
---|
0:53:43 | i think that |
---|
0:53:45 | a adaptation methods are still trying to be figured out |
---|
0:53:50 | but all the intuition into doing adaptation methods comes from you know |
---|
0:53:57 | the experience that we have with hmm gmm systems so at least at least for |
---|
0:54:04 | me so is okay so if you talk about something like speaker adaptive training |
---|
0:54:12 | could you think of neural network |
---|
0:54:16 | sort of be becoming speaker independent of speaker adaptive training |
---|
0:54:20 | i mean is i mean i would you add putting two point where |
---|
0:54:25 | and what do you think that is that i reduction to build a speaker independent |
---|
0:54:30 | truly speaker independent dnn |
---|
0:54:33 | deductions to |
---|
0:54:35 | i guess i mean speaker independent by being very speaker-dependent an adaptive so right |
---|
0:54:41 | a actually if you do a little literature search there's a bunch of work on |
---|
0:54:47 | adapting neural nets for speech recognition early nineties |
---|
0:54:50 | a |
---|
0:54:51 | and so this was work was largely done at cambridge and twenty runs and screw |
---|
0:54:57 | and in our skin portable |
---|
0:55:01 | you are not so |
---|
0:55:02 | and there were basically performance is used we were actually and then you collaboration with |
---|
0:55:09 | them |
---|
0:55:10 | and there were four methods that i recall that we use so one was to |
---|
0:55:14 | have like a linear |
---|
0:55:17 | input transformation |
---|
0:55:20 | so you could have so if you had you know thirteen plp coefficients |
---|
0:55:24 | i just have thirteen by thirteen matrix coming in |
---|
0:55:28 | another was the output so maybe you'd have you know if you |
---|
0:55:32 | we're doing monophones so as like fifty by fifty or something |
---|
0:55:37 | a third was to have i didn't wear off to decide what you just sort |
---|
0:55:41 | of a added to it and |
---|
0:55:43 | a trained up with a limited data that you had for the new speaker |
---|
0:55:47 | this we're all |
---|
0:55:49 | supervised the adaptation |
---|
0:55:52 | and my favourite |
---|
0:55:56 | when i proposed was |
---|
0:55:58 | scrawl that just train everything and |
---|
0:56:01 | so it just you know we and |
---|
0:56:03 | the original direction do that was that you might have millions of parameters that but |
---|
0:56:08 | my feeling was what you just a little bit |
---|
0:56:11 | and the L |
---|
0:56:13 | they all work to varying degrees i think it's fair to say but nine or |
---|
0:56:18 | the |
---|
0:56:19 | hmm-gmm adaptations nor those neural net adaptation is really solved problem |
---|
0:56:25 | they all move you a little bit we did some experimentation as part of the |
---|
0:56:29 | ouch project that stevens gonna |
---|
0:56:32 | talk about thursday |
---|
0:56:35 | where |
---|
0:56:35 | we use the mllr for instance to try to adapt to just my recording given |
---|
0:56:41 | close my training |
---|
0:56:44 | and it helps |
---|
0:56:45 | but it's not like it makes like the |
---|
0:56:51 | so i'd say that |
---|
0:56:53 | you can get you can use any of these methods |
---|
0:56:56 | for both neural nets and for gaussian and there are there are methods for both |
---|
0:57:02 | but none of them really solve the problem |
---|
0:57:10 | and the other questions that one there |
---|
0:57:18 | this number let's it's a couple back here |
---|
0:57:23 | at the moment in to talk |
---|
0:57:27 | thank you for that very interesting time |
---|
0:57:31 | i was just curious is that any run ins in this and |
---|
0:57:36 | kind of rate that we look at things for adaptation that speech recognition and ten |
---|
0:57:41 | at something that are human speech recognition |
---|
0:57:44 | and the reason i S is that if we look at least i am i |
---|
0:57:47 | inspired by it seems that mention as a look at the places where a human |
---|
0:57:52 | recognition breaks down i was an ad hoc unless you're a with really bad connection |
---|
0:57:58 | i just couldn't understand the campus meeting way |
---|
0:58:01 | and we don't i O and look at how i system and it's a beautiful |
---|
0:58:06 | in exactly the same conditions human when be able to understand how these and have |
---|
0:58:11 | hope systems would be added in humans and excel should be really be that i |
---|
0:58:16 | am i check on my |
---|
0:58:18 | well |
---|
0:58:20 | when i found in expects in jack i don't understand that at all |
---|
0:58:26 | a so i think a machine you do better |
---|
0:58:30 | i think in general we're pretty far from that |
---|
0:58:34 | there are individual examples that you could think of i think my favourite is anything |
---|
0:58:40 | involving attention |
---|
0:58:41 | so actually my wife used to work with these |
---|
0:58:46 | large american express call centers |
---|
0:58:49 | and i when we first got together i will always telling your humans are so |
---|
0:58:54 | good it speech recognition and you know machines are so bad just a well not |
---|
0:58:58 | the humans ideal and i |
---|
0:59:01 | and it turned out that the people the call centres are really great we definitely |
---|
0:59:07 | much better than anything do with machine |
---|
0:59:10 | in for simple tasks like a string of numbers |
---|
0:59:15 | right after they have coffee |
---|
0:59:16 | and they're terrible after lunch |
---|
0:59:20 | now they do however have i mean you don't talk about i certainly didn't talk |
---|
0:59:25 | about recovery mechanisms but the saving grace for people is that they can say could |
---|
0:59:30 | you repeat that please and all that we have some of that in our systems |
---|
0:59:33 | humans are better at that |
---|
0:59:36 | so i think |
---|
0:59:37 | i think it's their other tasks |
---|
0:59:40 | for which |
---|
0:59:42 | machines can clearly be much better than because people |
---|
0:59:46 | are not trained or is there are evolutionary |
---|
0:59:51 | guidance |
---|
0:59:52 | two doors there being better at it so for instance |
---|
0:59:57 | doing speaker recognition with speakers that you know very well |
---|
1:00:02 | i think machines can be better |
---|
1:00:05 | used to do some work with the G is an edgy analysis isn't something we |
---|
1:00:10 | you know grow up with |
---|
1:00:11 | and so you can do classification but is much better than people okay |
---|
1:00:16 | but i think for sort of straight out typical speech recognition |
---|
1:00:20 | you take that noisy example |
---|
1:00:22 | elevated to any of our recognizers and you just two |
---|
1:00:28 | saw some of the signal-to-noise ratios the cost of showing your layer |
---|
1:00:32 | basically zero db signal-to-noise ratio |
---|
1:00:35 | say first human beings were paying attention listening to strings of digits they just get |
---|
1:00:40 | them |
---|
1:00:42 | and our systems you look at any of the even with the that |
---|
1:00:47 | part of white noise robust front ends people have papers |
---|
1:00:50 | you look at their performance at zero db signal-to-noise ratio it's a this |
---|
1:00:55 | and that's with the best that's not is that we have |
---|
1:00:58 | so i think we're just so far from that are straight out speech recognition |
---|
1:01:02 | but maybe someday be saying well of this automatic system that we can figure out |
---|
1:01:15 | high so you use like computer vision under networks are very appealing you can speak |
---|
1:01:20 | and visualise what are being learned at the hidden layers so you can see that |
---|
1:01:23 | explaining stuff specific parts of the faceplates and |
---|
1:01:27 | so in speech you have an intuition about what is being learned in those hidden |
---|
1:01:32 | layers |
---|
1:01:34 | well i mean there have been some experiments with people of what the some of |
---|
1:01:38 | these things again made reference to forms to reach and |
---|
1:01:42 | and he was it did it should be just on a topic multilayer perceptron |
---|
1:01:47 | and he found that |
---|
1:01:50 | this was attempting to mimic what was happening with the nets that were |
---|
1:01:58 | train on individual critical bands |
---|
1:02:00 | and it did another one where i just through the whole spectrum in |
---|
1:02:06 | and what was learned that layers in fact we did learn |
---|
1:02:11 | interesting shapes interesting gabor like shapes so for |
---|
1:02:15 | and there's been a number of experiments with people have looked at |
---|
1:02:20 | some of those really layers |
---|
1:02:22 | what you get pretty deep |
---|
1:02:24 | especially for seven errors |
---|
1:02:27 | i think it be pretty harder to do |
---|
1:02:29 | but i wouldn't it is possible |
---|
1:02:31 | i know there's been some work |
---|