0:00:15okay so that's what is not intended to be particularly for also you know we
0:00:19have well
0:00:20put away the screen there will be no slides
0:00:23so what i was encouraging everybody who's other annals to do it got about five
0:00:28minutes to just give sort of an oral summary of the poster so let's encourage
0:00:33people to come see it because it's going to be up for the rest of
0:00:37the
0:00:38session and then we can open up the floor for questions morgan i might have
0:00:42a few and we'll see where the discussion dots
0:00:45so
0:00:47why don't we get started and since you closest a big presently go first sorry
0:00:55okay so
0:00:56the one should not here because i'm basically got a nice gmms
0:01:01so what i did this i looked at the neural networks and i try to
0:01:05figure out why the work very well and try to port this back to a
0:01:09gmm
0:01:11so
0:01:13why gmms so we kind of like stand for years we have lots of techniques
0:01:18model based techniques model based adaptation speaker adaptation noise adaptation uncertainty decoding
0:01:25all kinds of techniques that are based on maximum-likelihood trained hmm gmm systems if we
0:01:31just
0:01:32put in dnns
0:01:34at the front and basically you basically you lose a lot
0:01:39and all the reason is there fast
0:01:41the very efficient a few parameters you can encode you can make a speech recognizer
0:01:46with ten times less packed fee parameters in there at gold very fast
0:01:51final and lost reason you'll do speech recognition we kind of try to understand how
0:01:56it works so if you going to replace the neural network in the top of
0:02:00your head
0:02:01it's on all the black box method like a deep neural network what i've learned
0:02:05in the end so maybe a little bit model molar system where you have building
0:02:10blocks that are
0:02:11at least doing something you understand
0:02:14it's nice to have that
0:02:16so the second part is what are we going to port from dnns to do
0:02:21the nn the gmm world
0:02:23so basically look at the nns they take a very large winnable frames
0:02:28and they going to map that to context dependent states for bit basically long span
0:02:33symbolic units to go from long span temporal patterns too long span
0:02:39symbolic units fairly complex mapping that's why they need
0:02:44lots of layers
0:02:45probably and also they want to go wide a have something like two thousand four
0:02:50thousand to notes in between so that's pretty
0:02:54pretty be pipeline
0:02:56so the deep we already had the white we have already had to important properties
0:03:03of a neural network a long window of frames and another thing is neural networks
0:03:10they advertise them as being a product of experts
0:03:14so basically adding note us useful input and is trained on all output
0:03:20so there's
0:03:22lots of training data for every weight
0:03:25okey so the next that is let's try to port all these ideas to the
0:03:30hmm gmm world
0:03:33so and basically i didn't invent anything you i used existing techniques
0:03:39so if you want to handle log large frame of windows you have to do
0:03:44feature reduction because gmms don't like a two hundred dimensional input features
0:03:50so we use something like lda linear discriminant analysis to do feature reduction
0:03:57but that's loses lots of information so in parallel with that for example use multiple
0:04:03streams multiple streams are not new in you old discrete hmm world you have static
0:04:08features delta features and double delta features multiple parallel streams and fusing at the end
0:04:13you can still do that today
0:04:15so that's already have coping bit a large input a window of frames
0:04:23going a wider we already had at we have multiple streams in parallel
0:04:28you can i the seed of
0:04:29as a
0:04:31property of coping with a large dimensional input feature stream or you can say that
0:04:36a little models
0:04:39then the going deeper that's basically don't by adding and log-linear layer up in the
0:04:46layers but nothing new nothing special
0:04:49the conditional random fields or maximum entropy models they go around but lots of names
0:04:56just a softmax in the neural networks
0:04:59so that's nothing special but it's a simplest the extra layer you can add more
0:05:03or less
0:05:05it is in a product of expert model so it combines values in a some
0:05:10which is basically product and
0:05:13makes a new values so it's very good at fusing thing so
0:05:17i at the frame stacking from of it just to increase the feature dimension i
0:05:26so basically all existing techniques very simple techniques i forgot one the parameter tying but
0:05:32that's also very simple use tied states a our system that we have time to
0:05:37go first row so
0:05:40that basically means that every gaussian is trained not all output and all that all
0:05:44inputs but a lot so basically every gaussian is used under and over a hundred
0:05:48times for a and that the output states so the lots of view so if
0:05:52it exceeds every frame anyhow
0:05:54and if you combine all these things you and the pitch results that are competitive
0:05:58to last year's union the results
0:06:02this year's the nn the results at something like segmental training or sequence training convolutional
0:06:08neural networks dropout training
0:06:11then you techniques i don't know yet how i'm going to map that to my
0:06:15system up to six the sequence training is very simple to add and probably will
0:06:19improve the systems
0:06:21so the and messages the gmms and hmms are not
0:06:31okay thank you chris a hank
0:06:35slightly worse on you to
0:06:38it will work on voice search and them some work on a U T
0:06:44we are actually published results i thought you know to be great if you chair
0:06:47sometimes of the with you want you to you
0:06:50so if you know what you to youtube so you're sharing site you can share
0:06:53also things i think most popular video is
0:06:56you know task using dogs or cats running but like this but a there there's
0:07:01actually and there's some useful data their mean ability user's fees each of youtube every
0:07:05month they watch six billion videos on U T
0:07:09was over a hundred hours of data being uploaded every minute
0:07:13so as long content of their a lot of people watching
0:07:18one thing we like to do with you be able to provide a you know
0:07:21because you use more accessible for those that are harder here you or not speak
0:07:26also imagine if we would by automatic captions you to you
0:07:30that would help searching for videos on youtube or
0:07:34actual to navigate in the video if you want one particular instances words and videos
0:07:39in people that there is that some people compute non-trivial actually latest video problem a
0:07:45bit bigger roles where you is obviously snapping we say words of all the weak
0:07:51acid
0:07:52i'm gonna give you want soft and people it may be used this indexing technology
0:07:57to the final instances where problem and says speech
0:08:01so you know there's you know some can be some point but with applying
0:08:07so i looked at this from a couple aspect of the i'll get is from
0:08:11a D task so we have a lot of data what are some the with
0:08:14that we can levels of data
0:08:18for example
0:08:20users are apply for twenty six about have uploaded twenty six thousand hours of just
0:08:27caption
0:08:27online text captions
0:08:29for these videos attempt to have tasks because you know the find it is useful
0:08:34to have them
0:08:37but some of those artists can in some fashion matched video they just the advertising
0:08:42things
0:08:43so
0:08:44i think about people looked at how to use this sort of
0:08:49dropped data to use a strange and so we do is so much i think
0:08:53that everyone else does we try to figure out what sort of aligned with doesn't
0:08:56align and we had this island of confidence or technique so basically areas where a
0:09:02lot of alignment happens from recognition result and the actual user provide result
0:09:07what we use that the sort of islands was not the coherence then used as
0:09:11training ground truth
0:09:13and so we're slight
0:09:15after all still training of like non-native data a christmas can be what actually aligns
0:09:20well we get a we got initial corpus about thousand hours
0:09:25and compared to but some and fifty hours of supervised
0:09:29actually hand-transcribing
0:09:32so we're able to do some persons on that
0:09:34and
0:09:35the other aspect is well we have so much data to me improve the modelling
0:09:39techniques that certainly different ways and it just use of force people i think doesn't
0:09:44talk about
0:09:45having thousand cd state units and i think it typically we all work on our
0:09:49own seven thousand cd state units
0:09:52i think frank's ideas we have to thirty three thousand so we really do run
0:09:56in europe so i'm not writing
0:09:58around twenty thousand four five thousand see you know to use more data my one
0:10:02time we got better
0:10:04it's model
0:10:07and so but you know that was really large that way with the softmax
0:10:12there are forty but that's no points and thousand nodes prepare for five million parameters
0:10:17there
0:10:18just that one there so actually this is little bits by brit actors aristotle in
0:10:23icassp the right factorization slight warming to try this data and see it goes so
0:10:29and they're in a paper we looked at using various levels of this task
0:10:33percent miss is lower a linear layer from
0:10:36to just by close to you are and so that it task
0:10:41and the basically our results were
0:10:43actually suboptimal we can use privately well the semi supervised data where we use it
0:10:48is a captions we can see build model it's better than or gmm system by
0:10:52about ten percent relative
0:10:54so our team system initially was well
0:10:56plus fifty percent error rates
0:10:58and i think there is that you know some issues of the gmm system i
0:11:01think it's very cambridge the events matrix for us
0:11:05they got below fifty percent but not much but with the same is rise data
0:11:09no supervised training we did pretty well
0:11:13we actually when we actually the supervised data the results with less data we actually
0:11:16better results than in the systems revise data models but that's expected and actually combined
0:11:22it doesn't work against combining
0:11:25and with low rate roll find that with your parameters we're able to get you
0:11:30know how better but actually results that slightly better maybe it's just regularization
0:11:36we found that overall by having
0:11:39and all this extra data we got the results on youtube general on you general
0:11:43test data sets test sets but
0:11:46we will actually that's a domain specific test set
0:11:50for example you to use same broadcast news we actually get a degradation by adding
0:11:54all this all the sins rise data so that was interesting so and you're a
0:11:58neural networks people bigger better more data
0:12:02but since then we still have some issues with cross training
0:12:06so i still you know what things look
0:12:09so that's what
0:12:12okay thanks a star
0:12:17okay so frank showed earlier today
0:12:20one of the first and results on lvcsr was on switchboard showed about thirty percent
0:12:24relative improvement on a speaker independent a system
0:12:28and you know microsoft as well as i am in others have shown that if
0:12:32you speaker adapted features for the dnn the results are better
0:12:37and then earlier this year i but using very simple log-mel features just a convolutional
0:12:42neural network you actually improve performance by between for the for seven percent relative over
0:12:48at the end and trained with speaker adapted features
0:12:52and one of the reasons we think is this sort of learning this speaker adaptation
0:12:56jointly with the rest of the network for the actual objective function at hand you
0:13:00either cross entropy or sequence
0:13:03into the idea of this filter learning work we did is he said well why
0:13:06are we've been starting log-mel let's start with the much simpler feature such as the
0:13:11power spectra
0:13:13and have a network learn a filterbank which is appropriate for the speech recognition task
0:13:18at hand rather than using a filterbank which is perceptually motivated
0:13:22it's if you think about how the log-mel is computed you take the power spectrum
0:13:25you multiply by filterbank and then take the log which is respectively one layer of
0:13:31a neural net great weight multiplication followed by nonlinearity
0:13:35so the idea in this filter learning work was to learn start with the power
0:13:38spectra and that the filter bank layer jointly with the rest of the convolutional neural
0:13:42network
0:13:44so and we did sort of this idea initially we got very modest improvements
0:13:51and one of the reasons is because you have to normalize not the layer to
0:13:56that
0:13:58convolutional network but the layer to the filter learning
0:14:02we know there's a lot of work that shows if you charge normalized input features
0:14:07into the network such written down here
0:14:10so we found that by normalizing input into the filter bank layer and by using
0:14:17a trick very similar to done and that's not in rasta processing to ensure that
0:14:21the input into the filter learning layer would be positive we able to get about
0:14:26a four percent relative improvement
0:14:30over using a
0:14:31fix filterbank there is a nativity are broadcast news task
0:14:37we then show that base you the filter bank where can be seen as a
0:14:41convolutional layer with limited weight sharing so you can fly tricks such as pooling
0:14:47so if you pull you can get you know what five percent relative improvement over
0:14:52the baseline of the fixed with fixed mel-filterbank
0:14:56and then tried other things like increasing the filter bank size a lot more freedom
0:15:01for the filters that didn't seem to help much probably because there's lot of course
0:15:05going on between the different filters
0:15:08we also tried we found the filter weights or very few key probably picking up
0:15:12many harmonics in the signal we tried smoothing that out that didn't seem to help
0:15:17much
0:15:19so it seems was that the extra peeps that are learned in the filter bank
0:15:22layer is actually beneficial
0:15:29finally we tried instead of enforcing you know using analogue nonlinearity positive weights along the
0:15:35weights P negative in using like a sigmoid or you prove nonlinearity and also didn't
0:15:40seem to help us
0:15:42so it seems like using a lot nonlinearity which is perceptually motivated is actually does
0:15:46so in summary we looked at filter bank learning i suppose using a fixed mel-filterbank
0:15:52agreeable to get about five percent relative improvement number i guess
0:15:59thank you chair call
0:16:05new
0:16:14okay
0:16:15so
0:16:16in principle i was trying to
0:16:19so the same similar problem is thing was folding
0:16:22but there was one difference that you will use probably several thousands of even ten
0:16:29thousands of training data that we could be possibly leveraged to improve the word error
0:16:34rates and in the our case the dataset most more modest at school
0:16:41very nice to play with
0:16:44in the our case we had the ten hours of transcribed data seventy four hours
0:16:49of un-transcribed be done today that was i means
0:16:53from the iarpa babel program and
0:16:59this one of the conditions the limited language pack
0:17:02condition
0:17:04and
0:17:07i try to find some heuristics to how to leverage the best
0:17:14they don't
0:17:16the results so what idea is that i used to different confidence the measures on
0:17:21different levels
0:17:23one level was to sometimes level
0:17:26and the other was or frame level
0:17:28so that we can select the data for training
0:17:35the way that the sentence-level condition was computed it was
0:17:38basically the average posterior from the confusion from the confusion network
0:17:45the best word
0:17:46and the frame level
0:17:49confidence measure was the
0:17:53imagine you have some
0:17:56let these
0:17:58you well the weighted semi supervised training is done that the beginning be able to
0:18:03transcribe a to be built some system
0:18:05and with this system we can decode the
0:18:09data we don't transcripts and so we can
0:18:13take the best parts from the lady sees as if you was the reference
0:18:18and
0:18:20so when we have the let this is we can take the best file and
0:18:24we can compute the posteriors abilities and we can then read the posteriors which lie
0:18:30under the best path and use those as this confidence measures
0:18:38to use
0:18:41so then when we start and when we started experiments first experiment with the frame
0:18:47cross entropy training
0:18:49and i try to make a systematic of steps first
0:18:54star so on the larger commodity and then go to the smaller one so let
0:18:59the beginning so i was starting to think of those sentences according to the confidence
0:19:06and surprisingly i kept adding
0:19:09more and more something see something like that it all of those and so there
0:19:14was stick still the system was
0:19:17it's a radio and there was no degradation be very in
0:19:22it was surprising
0:19:23and so
0:19:26so this gave so minus one point one percent improvement in absolute and then be
0:19:32very the situation that there was still
0:19:35roughly ten hours of transcribed speech seventy hours of untranscribed speech so there was double
0:19:40lines of in the monthly multiply the
0:19:44amount of transcribed speech show by twenty three we we'd try to
0:19:50different
0:19:53different
0:19:56numbers two
0:19:57different multiplication numbers of the system
0:20:01three was
0:20:02the good one and there was meant of zero point three
0:20:06no absolute on the
0:20:09and finally we went to do
0:20:13lower level to the frame level and found out that the
0:20:17frame level selection with the appropriately to a threshold would be another zero point nine
0:20:25zero point eight percent
0:20:28you know something
0:20:30so to the overall improvements over two point two
0:20:35first order eight percent absolute
0:20:38and
0:20:40is the
0:20:42as the full recipes use also includes the sequence-discriminative training
0:20:48i did some experiments to
0:20:51with the some the are criterion to improve the
0:20:56results on these stage and i try to use similar
0:21:00data selection framework
0:21:02but the remote
0:21:05how and what is this the safest option was to take the transcribed data and
0:21:12a use a some the are it's just the transcribed data and
0:21:17in the
0:21:20the improvement that we obtained on the frame cross entropy level training a large part
0:21:26of it persisted in the systems
0:21:31pretty much more
0:21:33the experiments we did
0:21:35so i'd like to invite you to
0:21:40so you see the posters union by i would like also to think to then
0:21:45podium of the colleagues who
0:21:48worked on developing company
0:22:07thanks kernel next we have all okay
0:22:12so other poster paper it's about how to learn a speech representation from multiple or
0:22:18single distant channels so we did distant speech recognition
0:22:24which has so we are now is much more difficult to copy of because of
0:22:28many aspects like for example
0:22:31course signal-to-noise ratio or
0:22:35different interference effects of other acoustic sources so what people usually do with distant speech
0:22:44recognition is to
0:22:47capture they're sticking out using multiple distant microphones which we now
0:22:52a germ at all so basically we can apply on top some sort of combining
0:22:57algorithm which in from the signal entire the single channel and then you be it
0:23:02acoustic model on top
0:23:04a like an acoustic model you want and it's
0:23:09we are interested how to how to use multiple distant microphones we'd up to conform
0:23:14or so
0:23:15we do in addition to the actual i in all the are dramatically
0:23:19and at and try to explore that way to combine channels so we use
0:23:27a neural networks for that and there are there are two obvious ways to follow
0:23:33the first one is
0:23:34a simple concatenation so you get a acoustic captured by multiple channels and you to
0:23:42just like a large spliced input to the network and you train it we have
0:23:46a single targets like we should why you do
0:23:50and the other way a the other way to do it it's is it is
0:23:55multi-style training and it must i multi-style training
0:24:02allows you to actually use multiple distant microphones while you're training and you can
0:24:09recognise with a single distant microphone
0:24:11so getting back to concatenation a we have just a simple concatenation we were able
0:24:16to recover around fifty percent of think of that inform again so we weren't able
0:24:22to beat our best
0:24:24and it
0:24:25dnn model
0:24:26trained on that eighteen from channels but we were like i've but we were able
0:24:30to improve like around fifty percent relative
0:24:36and
0:24:39it
0:24:40relative to the gain of indian eight of course
0:24:43and we've quality style training we train the network in which a task fashion where
0:24:49we actually had that share the representation for each channels and we like presented a
0:24:56random batch of data from random channels and we did not eight
0:25:00for that
0:25:01and that apparently force the network to actually you can or some of the travel
0:25:06it is in the channels so in the and
0:25:09and multi-style training
0:25:12i gave ask the same as those are simply if a concatenation
0:25:16so basically it's a very attractive a way because you do not need multiple distant
0:25:24microphones and test scenario
0:25:26which is nice finding
0:25:28and
0:25:32right so
0:25:33in that order we also point some sort of open challenges like for example she
0:25:39still overlapping speech just select a huge issue
0:25:43and not many researchers actually try to just it
0:25:47and so the simplest think is just ignore it as
0:25:52and
0:25:53and we also like present the complex set of numbers for i mean datasets for
0:25:58pure rugby datasets all this numbers should be easy to reproduce if someone is interested
0:26:02so i by
0:26:05anyone who's interested suggested
0:26:08came by and we can discuss some more
0:26:11thank you
0:26:14okay thanks paul i finally alex
0:26:17thank you
0:26:18so just to start a little bit of for
0:26:23more than ten percent here
0:26:25at a kind of longstanding ambition
0:26:30speech recognition can be reproducing kernel not
0:26:34have one network to an acoustic modeling
0:26:39the language modeling
0:26:43state transition
0:26:45and happens all kind of combining single network
0:26:49can difficult
0:26:52you probably won't be surprised here
0:26:54and i was eventually costly mostly by my coworker rock my mama you maybe i
0:27:00should just try
0:27:01you can one of this thing i
0:27:05replacing the
0:27:06neural networks with or
0:27:13and so that's basically what were you
0:27:18and it's really it's
0:27:21it's fairly straightforward
0:27:24you know it's
0:27:27standard system
0:27:29the only thing would be
0:27:31all the people here is the network architecture and the network architectures probably so
0:27:37one thing a run you know
0:27:41taking just ignore recurrent neural network making
0:27:45a sample
0:27:47input feature
0:27:48and brings really
0:27:50really you can increase
0:27:52but like with multi
0:27:56there are various other kind of
0:28:00improvements basic recurrent network architecture that i mean accumulating you
0:28:07and i guess the two main ones are not i directional so having single or
0:28:12no network stop beginning sequence those the and
0:28:16you have
0:28:17recurrent networks one going forward someone going back
0:28:20and you know that's not the past
0:28:24future compact
0:28:27and you can be that same structure just saying which account for normal
0:28:34so it's bidirectional
0:28:37and what you actually find the U
0:28:40and networks use of context brands out
0:28:44as it goes i
0:28:48and the other hand a novel thing i guess is used to this long short
0:28:51term memory architecture which i won't try to describe in detail you basic idea it's
0:28:58better at
0:28:59storing information times but gives you access longer range from
0:29:08common problem and everyone's a when you try or no networks for speech things
0:29:15there is flashing makes it difficult for score information
0:29:21and
0:29:23other not
0:29:25well as a standard recipe from the training
0:29:30fifteen hours
0:29:32because one of the compare the system with
0:29:36the kind of more and are workplaces
0:29:41using implement we actually i printable system
0:29:49and then we'll to the wall street journal corpus and the results kind of income
0:29:56using these
0:29:58bidirectional rnns can be cross entropy are frame are pretty small
0:30:08a
0:30:10one possible reasons
0:30:12wall street journal is
0:30:14maybe not the best corpus challenging and you know what is essential
0:30:20model which switchboard
0:30:22but my feeling is the
0:30:26what we really have
0:30:30this is this going to be cross entropy training
0:30:35you know word error rate actually carry
0:30:40something we got like
0:30:42same just train
0:30:53thanks so at this point we can open up the floor or questions or comments
0:30:58from the audience either directed that the panel or anybody else's room
0:31:05so drive any takers
0:31:12so following it up for what are certain jobs you have so terrible start well
0:31:18for you actually do you put was the power spectrum so do you think that
0:31:22the known as will be of are capable of
0:31:25right you can
0:31:27very there or backward see if you want
0:31:30to the waveform
0:31:32i think that something better definitely
0:31:36and nobody has done some more can actually
0:31:40i think you're right i think there's then a little bit of were but been
0:31:43by not do you might know alex on using convolution
0:31:47neural network like approaches are do you remember right
0:31:52i mention this is more has been some work with this but the generally do
0:31:57something on top of it like to take the law yep take the actual value
0:32:02florida log and so on the in there there's and things that are kind of
0:32:05heart to reproduce just by pretending you don't know these things are any good
0:32:09actually have i was gonna ask you
0:32:12i was trying to recall
0:32:15did you end up taking along still
0:32:19i so let's take the log is right into the neural network i think like
0:32:22twice right
0:32:24right so that's interesting right i mean that you know we've got these are for
0:32:28learning
0:32:30machine we stuff to take more
0:32:38i don't and about that
0:32:45okay i've a question which actually is can be directed more morgan and hynek and
0:32:51alex waibel the these in the room
0:32:53so one of the themes a came up earlier in the day was that some
0:32:59of this stuff was done that in the nineties and due to limitations only metadata
0:33:04we had to work with the amount of computation available
0:33:09there were there were things it really i couldn't people explore or couldn't viably be
0:33:13explored and so the question now is are there papers from the nineties that occur
0:33:18practitioner should be going back rereading and trying to plagiarised yes from the see that
0:33:24can that absent improve on now
0:33:28and it's the which ones
0:33:32they are so this L is a lot of i mean
0:33:39i don't say i mean it depends of people interested in right like this morning
0:33:44their questions about adaptation and i didn't recall that up my head which papers but
0:33:48there were a bunch of papers by a neural net so it in S K
0:33:52and twenty grams in an improvement cambridge
0:33:58if you interested in adaptation
0:34:00there is
0:34:03large number of papers on the basic methods on the sequence we're talking about a
0:34:08luncheon on the sequence training
0:34:10us papers i are shown and there are not there at an anti R T
0:34:18where he did sequence training i think around ninety five or something
0:34:26we're what we're doing the time once
0:34:29using the cameras as the targets for the net training
0:34:33i mean isn't just the computation and the
0:34:36and storage and amount of data it's also just that
0:34:43oftentimes you know these things are cyclic a new you try some things out and
0:34:48somebody like we did the sequence training
0:34:52help tiny little bit
0:34:54and in what we're in the examples we're looking at and was a lot more
0:34:57hassle
0:34:58so we didn't pursue it more
0:35:00we had a couple years where we really looking into it but it wasn't so
0:35:04great
0:35:05so there probably some things that we weren't doing quite right and
0:35:10now it's coming back and
0:35:12also people's and to see when you're enthusiastic about stuff
0:35:17you look at a point two percent increase a lot differently than when you're not
0:35:30about some other questions for the panel they had lots of interesting things they were
0:35:35talking about so
0:35:48i question for all and you're multiple
0:35:51yes
0:35:53the multi microphone
0:35:56experiment you did i guess that was with the ami corpus
0:35:59yes it's so you got you get this
0:36:02i guess nowadays predictable result that if you just concatenate the features from your three
0:36:08but to the for the different channels
0:36:11you would perform better than any beamforming
0:36:16wiener filtering or whatever else you that you're doing
0:36:18but it's is that correct no okay
0:36:24when you concatenate you get some improvements over a single distant microphone but it's a
0:36:31like the message from the air is that if you can inform you probably should
0:36:35inform
0:36:36yes but how okay with the with the concatenated features going into neural network is
0:36:42that assuming that this speaker is sort of
0:36:45i mean if my speaker was to walk around and
0:36:49i can imagine actually
0:36:52or observation is not network isn't learning can fink well actually gives you
0:36:59beamforming gives you a it's more like adapting to that
0:37:05the most meaningful signal
0:37:07to the strongest signal so basically
0:37:10if you have like multiple distant microphones one of the speakers just always a like
0:37:15in some way
0:37:17it's closer to give an microphone but down to D are not or and that's
0:37:23thinking actual you can exploit and you had
0:37:26in this scenario we applied for
0:37:29and
0:37:30because the when you like put multiple frames
0:37:34in the input you have like a very small time resolution so you actually can
0:37:39not there any time delays in this setup so it's just the it's just take
0:37:45eigenmaps really and you can do it like in a more obvious way for example
0:37:50you can apply
0:37:52convolutional that'll
0:37:54the acoustic models and the max-pooling
0:37:57and tops the also give some gains
0:38:00but that's like a followup work
0:38:14to be a little bit of courage to decide to response to brian was asking
0:38:19because you know i'm they pretty bad in reading other people's papers
0:38:22and so i had only examples of paper speech i wrote all my colleague set
0:38:27of students roles each people should very critically we i mean i don't mean that
0:38:32they are wonderful but i still think they are interesting which and this is this
0:38:37is this work on contracts which we started to work on that it was in
0:38:41the time you post pretty crazy
0:38:43because we just took the temporal trajectory of spectral energy a given frequency
0:38:48one second long and we said can you estimate what's happening in seventy six
0:38:53of this trajectory
0:38:55and so first of course without was that you got about twenty percent correct at
0:39:01best
0:39:01and of course you get is the number of frequencies so after that need to
0:39:05this out with all these posteriors and fit in your into nothing in it and
0:39:10a then you leads to
0:39:13estimate still the phoneme in the centre so it was like kind of formants deep
0:39:18neural net i would say because it was kind of neat it was also performance
0:39:22why because he that trajectories at different frequencies
0:39:26and you was it works surprisingly well i mean so if people can look at
0:39:32it and the last possible global we should have better and of course you never
0:39:36retrain the whole thing the which probably we should have done and we use
0:39:41on the context independent phonemes each maybe we should and shouldn't number of things happen
0:39:47at the time something that there are two entirely comparable to the manager and pos
0:39:53it's all and hopkins and so one is also be where only all that much
0:39:58but i still see that people should look at
0:40:01in and tell us what was wrong or how you is that it works
0:40:07that you try to recognize context independent phoneme out of one second context
0:40:13you know and you get actually very well you do very well if you look
0:40:17at least posteriorgram amazingly good
0:40:20sort of the look like for that issue do you mean vector or perception at
0:40:26all times seems like that it is i would see
0:40:30somebody else to should look at it critically
0:40:33so sorry for probabilities might work but i said i mean so
0:40:38so the other people's work so
0:40:53so this question is a mostly i and other hand address i have something to
0:41:01say i
0:41:02so i actually spoken trying to achieve a person
0:41:07our knowledge and found on this but i
0:41:10for example we can read and the program that the videos there being recorded going
0:41:15to be can that's going to be able to search them for keywords
0:41:19online so like i think the keywords that are typing into that system are gonna
0:41:23be that
0:41:24new at me now i have no
0:41:26and it's gonna be names i think i can and you know i okay so
0:41:31that are gonna be
0:41:32and i have a cabinet where K fig
0:41:36so
0:41:37kinds of plastic
0:41:39what is what is deep neural networks are we optimising R
0:41:44where acoustic models either really frequent words and leaving a on infrequent words second i
0:41:51mixture
0:41:53and if the other thing is you're analysing with word error rates over your entire
0:41:58vocabulary
0:42:01it's is this really getting at the performance
0:42:05the only one and we want to understand it is when the interesting to look
0:42:08at landing a way that standpoint for spoken content retrieval
0:42:17one stack of that
0:42:25restricted by taking a i can
0:42:28maybe address some of that i think i don't think than i don't know networks
0:42:32are just focused on
0:42:34i'm not on the on the head did you pretty well on the tales well
0:42:38but i mean there's two aspects here and there's the where the where you don't
0:42:42like vocabulary and those words that are out-of-vocabulary we have this in the model at
0:42:45test time and that's a different kind of orthogonal i don't my
0:42:52i mean and shake your head but i think
0:42:58i think if we can incorporate re well as to what we what we do
0:43:02our searches we have a stack decoder graph we can actually corporate dynamic can tailor
0:43:06into that into that graph
0:43:08i think when we do that a big actually recognise out-of-vocabulary words i mean that
0:43:15that we haven't seen during training time for example i worked on a voicemail years
0:43:20ago and you know people's names come up all the time and our program manager
0:43:26for the reason his name was
0:43:29with the recognizer is this is okay the people who tell collisions in but you're
0:43:34always recognise missed ten cents
0:43:36for some reason but once we switched on so direct vocabulary we have like is
0:43:41name checked into the stacks photograph same recognized and then we refer lots of other
0:43:47different so i think right now the system doesn't actually corpus title vocabulary and i
0:43:53think
0:43:53the metrics you talk about also devices to sort of working at sort of broad
0:43:57range and it is sort of makes more difficult to say if it introduces the
0:44:03you know technique where we do anything cataract that'll give us a point one or
0:44:09even middle today and that's a shame think really need to look at its
0:44:16techniques to look at the long
0:44:18but i think it still really this but recognition there's lots of the word that
0:44:23can be done and in language modeling and analyze about capital a men's room you
0:44:28know these words useful
0:44:34i'll chime in a little bit on this one too so i can speak from
0:44:38experience on doing few works are shown in lots of languages things to but the
0:44:42babel program which will be hearing about tomorrow from a very arbour
0:44:46and what we found is word error rate actually is pretty good basic metric even
0:44:53when we're doing search for words that are out-of-vocabulary in the training so it's not
0:44:59perfect correlation between word error rate and retrieval performance on this table past but
0:45:06at least the first-order a large improvements in word error rate like we see using
0:45:10neural networks instead of the gmms definitely
0:45:13to better retrieval performance even a vocabulary terms
0:45:17so it it's not perfect metric but it's one that we used for many years
0:45:22and it works pretty well
0:45:25the interesting pronunciation you'll find problems with those words and it you know it's very
0:45:30recognition and of that
0:45:32but i as you can see this work here where we're trying to drive dispensation
0:45:35so
0:45:37not dismissing it just
0:45:40i actually wanna
0:45:42the tractable but in favour of the direction what the question are saying
0:45:47"'cause" i think it is a disorder separate out the decoding so forth from what's
0:45:55happening in whatever you're acoustic model is you see whether it's gmms are dnns or
0:46:01whatever or mlps many layers for phone
0:46:07it's true that you just to do better on things that you see lots of
0:46:12examples
0:46:13and this is also true even if you looking for a particular see you know
0:46:17that are or you know triphones whatever those triphones occur less often and then you
0:46:22are not going to estimate as well but what you're saying is true to that
0:46:26you know
0:46:27it doesn't completely kill
0:46:31i agree i mean there's issues where we have some queries that just to get
0:46:34recognised recognizer and you know the combat the ones you know get recognised to find
0:46:40out there is on the five instances that context
0:46:43directly in their systems are trained to do so
0:46:47but something does need to be addressed
0:46:52where a one technical comment on the super watchers for of course
0:46:56this you know for but
0:46:59so we take a very pragmatic engineering approach and basically the recognizer is fed by
0:47:05the proceedings and by just like one and everything so generated of the new words
0:47:09are no not that the new anymore
0:47:12but i had another the question to the colour we're going to maybe also prime
0:47:17that was about the sequence of discriminative for training on the on the bentley transcriber
0:47:23or untranscribed report
0:47:26so we gaussian
0:47:28basically the simple you are needed to be don't know would portion of the data
0:47:32and then not on the model loosely transcribed by the recognizer
0:47:38what what's your experience on youtube videos and maybe
0:47:41but i'm common from this as well
0:47:44although first but because we actually have done sequence training experiments on the spread
0:47:55well let's see on
0:47:58i personally don't have a lot of experience i think when we report numbers or
0:48:03three hundred hours broadcast news
0:48:05there about half of it is manually transcribed have but it slightly transcribed and so
0:48:12i'm pretty sure we see some nice gains on that chart can for
0:48:16ten percent relative though
0:48:18likings be cer fifty hour broadcast news from cross entropy sequence are more then we
0:48:24sent four hundred i don't know that's amount of data or data ones you know
0:48:29transcribers is like this
0:48:31that's a and which what the reasonably good baseline but again with a pretty good
0:48:36proportion of the training data being lightly supervised
0:48:41anybody else the comments and that coral
0:48:46comment would be that's and should investigate deeper
0:48:50that i truly believe that there is
0:48:53more
0:48:54persons
0:48:57achieved three use the words right
0:49:02okay other comments or questions
0:49:09okay thomas
0:49:13so this is a very general question about how much training data really need in
0:49:16the future if you go would you make with the dnns
0:49:23well i guess is was trying to motivate my where are you know is we're
0:49:26just initials or system where you know with a lot of data to a big
0:49:30networks and it takes
0:49:31i don't know we trained a big networks but i think i think it's good
0:49:35sort of challenge question where no one have like ten thousand hours of the thousand
0:49:39hours of data used for training and we can maybe we do we increase than
0:49:45certain number context of outputs to a hundred thousand what we get
0:49:49and be interesting just no you know if we started you that we had to
0:49:53be at the change around with to train model
0:49:55that this or sizes
0:50:00i would also we just more is better if it's if the transcriptions are good
0:50:04enough
0:50:07that sounds great intro to mark format which was more
0:50:13i just wanted to
0:50:16mention the results was that but in numbers the role
0:50:22where we the actually somewhat and selection of the raw acoustic modeling
0:50:30all
0:50:31well you are the word error rate
0:50:36for well with one there but with other words the
0:50:42so i think
0:50:43for remote controls
0:50:45no piling
0:50:47well in that are split
0:50:50really remove more careful what we're
0:50:53but the performance of the word remove more thoughtful
0:50:56work that was coming from the model
0:51:03i'm blanking but there is a visitor
0:51:07from google actually give a talk at icsi you is showing us with look like
0:51:12definite as interpreting
0:51:14of performance with going up two hundred thousand two hundred thousand hours and so on
0:51:19so i think it helps but after awhile
0:51:22that's a much i think which was
0:51:24i and surprise that you've been quiet all day so
0:51:29there you making me happy
0:51:34so on the issue of selection i think
0:51:37you can certainly argue that
0:51:42selection
0:51:44cannot be the right thing to do
0:51:47instead you should always to weighting
0:51:50because whatever data you have there's certainly the i certainly agree that there's good data
0:51:55and bad data but that data is not worthless it's just less good than the
0:52:00good data
0:52:02so for example we have a paper here or something for the set for semi
0:52:06supervised training which revert done for a long time in the past you just transcribe
0:52:12make a model transcribe some untrained some
0:52:16recognise some untranscribed data and then use it for training when the error rates are
0:52:20relatively low
0:52:22for low is fifty percent or below you can do that
0:52:27with your eyes closed
0:52:29when the error rate gets really high like seventy percent
0:52:32that does break now but that doesn't mean you should discard the data
0:52:37you should just give it a lower weight and you can show that you always
0:52:41get better performance if you include the data the weight just gets lower yes in
0:52:45principle the weight could go to zero but
0:52:49you know you that the system decide that and the weights don't really go to
0:52:53zero the just get smaller so weights like one third and one half are error
0:52:57rates of eighty percent still giving me
0:53:01that's been our experience at least
0:53:14so i but the so it's
0:53:17more data widely
0:53:19may not
0:53:22monopoly the right thing i agree with what you're saying just as always value and
0:53:27data
0:53:28whatever those the figure forgot the utterance from should also
0:53:33pay some attention to be distributional properties that are
0:53:37but names
0:53:38so or this is one point of the problems of the room sure
0:53:45sampling space correctly that's really one
0:53:49i think i think it should that with my paper where i was that are
0:53:53closer general youtube data that when we actually that particular vertical like you use where
0:53:58we we're getting much better rates
0:54:01but adding all the data to train didn't doing that
0:54:04a bigger neural network for unknown parameters for your after getting losses
0:54:08on that specific domain so there are some issues of generalization just
0:54:15i like to add a little bit on data was we will be different but
0:54:20which is saying though i agree that of course more data is always better
0:54:24but i think the also we can be using less and less and less data
0:54:28so the question is how much data we will need i would hold less and
0:54:32less and less
0:54:33because we are any more and more about speech and we actually learning now how
0:54:37to train the nets on one language and use it on another and so on
0:54:41and also maybe sixty percent about the bottle which i'll go babel am
0:54:49i called bobble i think that we are going to learn how we use that
0:54:53knowledge from the some data bases on new task i this is at least my
0:54:57so i'd like to and up on this positive for
0:55:01approach less and less that's what i see
0:55:05just a follow up with what you're saying i think like sort of the lower
0:55:07part of the network or learning language-independent or task-independent information so you if you feel
0:55:13a lot of data and to those layers and less data in the upper parts
0:55:16that might be an approach to get i think is very
0:55:21actually when we started working in the gale we had a bunch of stuff trained
0:55:24that's trained on english
0:55:26and we're working this with this or i and trying to move to arabic we
0:55:30didn't have much are picked data yet so we just use the nets from english
0:55:33to begin with
0:55:35but still did something good
0:55:38one point i'd like to make but i think my in there you recognition and
0:55:44something that
0:55:45if you've got more than you i think you might be ten times and doesn't
0:55:51want to learn
0:55:56think i limited
0:55:58not like this number of an intuition
0:56:02C
0:56:09think we don't
0:56:12have any other pressing questions actually is time
0:56:16so no reach was
0:56:22saying that
0:56:23but at the way the data actually i did to do the and contrastive the
0:56:27experiment to one case use the frame selection in other cases the frame-weighting and
0:56:36D and i obtained identical word error rates for both systems
0:56:44so maybe you know if the
0:56:49if so it is true what to reach says then there should be done some
0:56:55post-processing in the
0:56:59of the confidence scores or it's true that those are not so uniform at all
0:57:04that
0:57:05it's a it looks more like an exponential
0:57:09mention kind of
0:57:10groups
0:57:15bring some more like
0:57:20something else in the more data so
0:57:23there are several because you want and are the ones don't speaker variability
0:57:27okay then you have more speakers but if you want you know is a list
0:57:30of our other robust against reverberation you can just make the data
0:57:35and then you so does present in the same data yes variation added noise but
0:57:41for a reverberation
0:57:43just train a system on for room acoustics
0:57:47makes it very robust against the for this microphones that's a very cheap trick and
0:57:53it works
0:57:54and something else about more data if you look at the very good neural networks
0:57:59presented in everybody's head
0:58:01they're not trained with that much data
0:58:03google already has more data so that is a strong point in making better than
0:58:07a networks you can do it so why tend to be
0:58:14"'cause" we don't know how but i
0:58:18could to
0:58:19i think we are out of time in principle and so i think we should
0:58:23turn this over
0:58:25conference organisers
0:58:28and thank the panelists still
0:58:37thank you morgan so this will be short
0:58:40i would like first before the we go a couple of practical things
0:58:45so for the people that subscribe to the micro brewery tour
0:58:50but so it's a word is not one way trip a that they did it
0:58:54should meet very important to seven a into will be and we begin tomorrow at
0:59:00the at their favourite forty with the limited the resources
0:59:04just the last practical command there is a carpeting table on the on the message
0:59:08board so whatever goes the prior knowledge and all other places
0:59:12and there's free space just write yourself maybe we'll have some nice the centres
0:59:20well
0:59:21i would like to thank let's take the
0:59:25i don't know where we ever is more or less important but let's think to
0:59:28the public because of almost everyone you're very like to thank you very much
0:59:33then the to the penalty is
0:59:35and to all the speakers and of course my greatest things go to today organisers
0:59:40and i have still one point but
0:59:43for brian because you have but one
0:59:45so this is a