Speech Transcript - Selected poster summaries, panel discussion

0:00:15	okay so that's what is not intended to be particularly for also you know we
0:00:19	have well
0:00:20	put away the screen there will be no slides
0:00:23	so what i was encouraging everybody who's other annals to do it got about five
0:00:28	minutes to just give sort of an oral summary of the poster so let's encourage
0:00:33	people to come see it because it's going to be up for the rest of
0:00:37	the
0:00:38	session and then we can open up the floor for questions morgan i might have
0:00:42	a few and we'll see where the discussion dots
0:00:45	so
0:00:47	why don't we get started and since you closest a big presently go first sorry
0:00:55	okay so
0:00:56	the one should not here because i'm basically got a nice gmms
0:01:01	so what i did this i looked at the neural networks and i try to
0:01:05	figure out why the work very well and try to port this back to a
0:01:09	gmm
0:01:11	so
0:01:13	why gmms so we kind of like stand for years we have lots of techniques
0:01:18	model based techniques model based adaptation speaker adaptation noise adaptation uncertainty decoding
0:01:25	all kinds of techniques that are based on maximum-likelihood trained hmm gmm systems if we
0:01:31	just
0:01:32	put in dnns
0:01:34	at the front and basically you basically you lose a lot
0:01:39	and all the reason is there fast
0:01:41	the very efficient a few parameters you can encode you can make a speech recognizer
0:01:46	with ten times less packed fee parameters in there at gold very fast
0:01:51	final and lost reason you'll do speech recognition we kind of try to understand how
0:01:56	it works so if you going to replace the neural network in the top of
0:02:00	your head
0:02:01	it's on all the black box method like a deep neural network what i've learned
0:02:05	in the end so maybe a little bit model molar system where you have building
0:02:10	blocks that are
0:02:11	at least doing something you understand
0:02:14	it's nice to have that
0:02:16	so the second part is what are we going to port from dnns to do
0:02:21	the nn the gmm world
0:02:23	so basically look at the nns they take a very large winnable frames
0:02:28	and they going to map that to context dependent states for bit basically long span
0:02:33	symbolic units to go from long span temporal patterns too long span
0:02:39	symbolic units fairly complex mapping that's why they need
0:02:44	lots of layers
0:02:45	probably and also they want to go wide a have something like two thousand four
0:02:50	thousand to notes in between so that's pretty
0:02:54	pretty be pipeline
0:02:56	so the deep we already had the white we have already had to important properties
0:03:03	of a neural network a long window of frames and another thing is neural networks
0:03:10	they advertise them as being a product of experts
0:03:14	so basically adding note us useful input and is trained on all output
0:03:20	so there's
0:03:22	lots of training data for every weight
0:03:25	okey so the next that is let's try to port all these ideas to the
0:03:30	hmm gmm world
0:03:33	so and basically i didn't invent anything you i used existing techniques
0:03:39	so if you want to handle log large frame of windows you have to do
0:03:44	feature reduction because gmms don't like a two hundred dimensional input features
0:03:50	so we use something like lda linear discriminant analysis to do feature reduction
0:03:57	but that's loses lots of information so in parallel with that for example use multiple
0:04:03	streams multiple streams are not new in you old discrete hmm world you have static
0:04:08	features delta features and double delta features multiple parallel streams and fusing at the end
0:04:13	you can still do that today
0:04:15	so that's already have coping bit a large input a window of frames
0:04:23	going a wider we already had at we have multiple streams in parallel
0:04:28	you can i the seed of
0:04:29	as a
0:04:31	property of coping with a large dimensional input feature stream or you can say that
0:04:36	a little models
0:04:39	then the going deeper that's basically don't by adding and log-linear layer up in the
0:04:46	layers but nothing new nothing special
0:04:49	the conditional random fields or maximum entropy models they go around but lots of names
0:04:56	just a softmax in the neural networks
0:04:59	so that's nothing special but it's a simplest the extra layer you can add more
0:05:03	or less
0:05:05	it is in a product of expert model so it combines values in a some
0:05:10	which is basically product and
0:05:13	makes a new values so it's very good at fusing thing so
0:05:17	i at the frame stacking from of it just to increase the feature dimension i
0:05:26	so basically all existing techniques very simple techniques i forgot one the parameter tying but
0:05:32	that's also very simple use tied states a our system that we have time to
0:05:37	go first row so
0:05:40	that basically means that every gaussian is trained not all output and all that all
0:05:44	inputs but a lot so basically every gaussian is used under and over a hundred
0:05:48	times for a and that the output states so the lots of view so if
0:05:52	it exceeds every frame anyhow
0:05:54	and if you combine all these things you and the pitch results that are competitive
0:05:58	to last year's union the results
0:06:02	this year's the nn the results at something like segmental training or sequence training convolutional
0:06:08	neural networks dropout training
0:06:11	then you techniques i don't know yet how i'm going to map that to my
0:06:15	system up to six the sequence training is very simple to add and probably will
0:06:19	improve the systems
0:06:21	so the and messages the gmms and hmms are not
0:06:31	okay thank you chris a hank
0:06:35	slightly worse on you to
0:06:38	it will work on voice search and them some work on a U T
0:06:44	we are actually published results i thought you know to be great if you chair
0:06:47	sometimes of the with you want you to you
0:06:50	so if you know what you to youtube so you're sharing site you can share
0:06:53	also things i think most popular video is
0:06:56	you know task using dogs or cats running but like this but a there there's
0:07:01	actually and there's some useful data their mean ability user's fees each of youtube every
0:07:05	month they watch six billion videos on U T
0:07:09	was over a hundred hours of data being uploaded every minute
0:07:13	so as long content of their a lot of people watching
0:07:18	one thing we like to do with you be able to provide a you know
0:07:21	because you use more accessible for those that are harder here you or not speak
0:07:26	also imagine if we would by automatic captions you to you
0:07:30	that would help searching for videos on youtube or
0:07:34	actual to navigate in the video if you want one particular instances words and videos
0:07:39	in people that there is that some people compute non-trivial actually latest video problem a
0:07:45	bit bigger roles where you is obviously snapping we say words of all the weak
0:07:51	acid
0:07:52	i'm gonna give you want soft and people it may be used this indexing technology
0:07:57	to the final instances where problem and says speech
0:08:01	so you know there's you know some can be some point but with applying
0:08:07	so i looked at this from a couple aspect of the i'll get is from
0:08:11	a D task so we have a lot of data what are some the with
0:08:14	that we can levels of data
0:08:18	for example
0:08:20	users are apply for twenty six about have uploaded twenty six thousand hours of just
0:08:27	caption
0:08:27	online text captions
0:08:29	for these videos attempt to have tasks because you know the find it is useful
0:08:34	to have them
0:08:37	but some of those artists can in some fashion matched video they just the advertising
0:08:42	things
0:08:43	so
0:08:44	i think about people looked at how to use this sort of
0:08:49	dropped data to use a strange and so we do is so much i think
0:08:53	that everyone else does we try to figure out what sort of aligned with doesn't
0:08:56	align and we had this island of confidence or technique so basically areas where a
0:09:02	lot of alignment happens from recognition result and the actual user provide result
0:09:07	what we use that the sort of islands was not the coherence then used as
0:09:11	training ground truth
0:09:13	and so we're slight
0:09:15	after all still training of like non-native data a christmas can be what actually aligns
0:09:20	well we get a we got initial corpus about thousand hours
0:09:25	and compared to but some and fifty hours of supervised
0:09:29	actually hand-transcribing
0:09:32	so we're able to do some persons on that
0:09:34	and
0:09:35	the other aspect is well we have so much data to me improve the modelling
0:09:39	techniques that certainly different ways and it just use of force people i think doesn't
0:09:44	talk about
0:09:45	having thousand cd state units and i think it typically we all work on our
0:09:49	own seven thousand cd state units
0:09:52	i think frank's ideas we have to thirty three thousand so we really do run
0:09:56	in europe so i'm not writing
0:09:58	around twenty thousand four five thousand see you know to use more data my one
0:10:02	time we got better
0:10:04	it's model
0:10:07	and so but you know that was really large that way with the softmax
0:10:12	there are forty but that's no points and thousand nodes prepare for five million parameters
0:10:17	there
0:10:18	just that one there so actually this is little bits by brit actors aristotle in
0:10:23	icassp the right factorization slight warming to try this data and see it goes so
0:10:29	and they're in a paper we looked at using various levels of this task
0:10:33	percent miss is lower a linear layer from
0:10:36	to just by close to you are and so that it task
0:10:41	and the basically our results were
0:10:43	actually suboptimal we can use privately well the semi supervised data where we use it
0:10:48	is a captions we can see build model it's better than or gmm system by
0:10:52	about ten percent relative
0:10:54	so our team system initially was well
0:10:56	plus fifty percent error rates
0:10:58	and i think there is that you know some issues of the gmm system i
0:11:01	think it's very cambridge the events matrix for us
0:11:05	they got below fifty percent but not much but with the same is rise data
0:11:09	no supervised training we did pretty well
0:11:13	we actually when we actually the supervised data the results with less data we actually
0:11:16	better results than in the systems revise data models but that's expected and actually combined
0:11:22	it doesn't work against combining
0:11:25	and with low rate roll find that with your parameters we're able to get you
0:11:30	know how better but actually results that slightly better maybe it's just regularization
0:11:36	we found that overall by having
0:11:39	and all this extra data we got the results on youtube general on you general
0:11:43	test data sets test sets but
0:11:46	we will actually that's a domain specific test set
0:11:50	for example you to use same broadcast news we actually get a degradation by adding
0:11:54	all this all the sins rise data so that was interesting so and you're a
0:11:58	neural networks people bigger better more data
0:12:02	but since then we still have some issues with cross training
0:12:06	so i still you know what things look
0:12:09	so that's what
0:12:12	okay thanks a star
0:12:17	okay so frank showed earlier today
0:12:20	one of the first and results on lvcsr was on switchboard showed about thirty percent
0:12:24	relative improvement on a speaker independent a system
0:12:28	and you know microsoft as well as i am in others have shown that if
0:12:32	you speaker adapted features for the dnn the results are better
0:12:37	and then earlier this year i but using very simple log-mel features just a convolutional
0:12:42	neural network you actually improve performance by between for the for seven percent relative over
0:12:48	at the end and trained with speaker adapted features
0:12:52	and one of the reasons we think is this sort of learning this speaker adaptation
0:12:56	jointly with the rest of the network for the actual objective function at hand you
0:13:00	either cross entropy or sequence
0:13:03	into the idea of this filter learning work we did is he said well why
0:13:06	are we've been starting log-mel let's start with the much simpler feature such as the
0:13:11	power spectra
0:13:13	and have a network learn a filterbank which is appropriate for the speech recognition task
0:13:18	at hand rather than using a filterbank which is perceptually motivated
0:13:22	it's if you think about how the log-mel is computed you take the power spectrum
0:13:25	you multiply by filterbank and then take the log which is respectively one layer of
0:13:31	a neural net great weight multiplication followed by nonlinearity
0:13:35	so the idea in this filter learning work was to learn start with the power
0:13:38	spectra and that the filter bank layer jointly with the rest of the convolutional neural
0:13:42	network
0:13:44	so and we did sort of this idea initially we got very modest improvements
0:13:51	and one of the reasons is because you have to normalize not the layer to
0:13:56	that
0:13:58	convolutional network but the layer to the filter learning
0:14:02	we know there's a lot of work that shows if you charge normalized input features
0:14:07	into the network such written down here
0:14:10	so we found that by normalizing input into the filter bank layer and by using
0:14:17	a trick very similar to done and that's not in rasta processing to ensure that
0:14:21	the input into the filter learning layer would be positive we able to get about
0:14:26	a four percent relative improvement
0:14:30	over using a
0:14:31	fix filterbank there is a nativity are broadcast news task
0:14:37	we then show that base you the filter bank where can be seen as a
0:14:41	convolutional layer with limited weight sharing so you can fly tricks such as pooling
0:14:47	so if you pull you can get you know what five percent relative improvement over
0:14:52	the baseline of the fixed with fixed mel-filterbank
0:14:56	and then tried other things like increasing the filter bank size a lot more freedom
0:15:01	for the filters that didn't seem to help much probably because there's lot of course
0:15:05	going on between the different filters
0:15:08	we also tried we found the filter weights or very few key probably picking up
0:15:12	many harmonics in the signal we tried smoothing that out that didn't seem to help
0:15:17	much
0:15:19	so it seems was that the extra peeps that are learned in the filter bank
0:15:22	layer is actually beneficial
0:15:29	finally we tried instead of enforcing you know using analogue nonlinearity positive weights along the
0:15:35	weights P negative in using like a sigmoid or you prove nonlinearity and also didn't
0:15:40	seem to help us
0:15:42	so it seems like using a lot nonlinearity which is perceptually motivated is actually does
0:15:46	so in summary we looked at filter bank learning i suppose using a fixed mel-filterbank
0:15:52	agreeable to get about five percent relative improvement number i guess
0:15:59	thank you chair call
0:16:05	new
0:16:14	okay
0:16:15	so
0:16:16	in principle i was trying to
0:16:19	so the same similar problem is thing was folding
0:16:22	but there was one difference that you will use probably several thousands of even ten
0:16:29	thousands of training data that we could be possibly leveraged to improve the word error
0:16:34	rates and in the our case the dataset most more modest at school
0:16:41	very nice to play with
0:16:44	in the our case we had the ten hours of transcribed data seventy four hours
0:16:49	of un-transcribed be done today that was i means
0:16:53	from the iarpa babel program and
0:16:59	this one of the conditions the limited language pack
0:17:02	condition
0:17:04	and
0:17:07	i try to find some heuristics to how to leverage the best
0:17:14	they don't
0:17:16	the results so what idea is that i used to different confidence the measures on
0:17:21	different levels
0:17:23	one level was to sometimes level
0:17:26	and the other was or frame level
0:17:28	so that we can select the data for training
0:17:35	the way that the sentence-level condition was computed it was
0:17:38	basically the average posterior from the confusion from the confusion network
0:17:45	the best word
0:17:46	and the frame level
0:17:49	confidence measure was the
0:17:53	imagine you have some
0:17:56	let these
0:17:58	you well the weighted semi supervised training is done that the beginning be able to
0:18:03	transcribe a to be built some system
0:18:05	and with this system we can decode the
0:18:09	data we don't transcripts and so we can
0:18:13	take the best parts from the lady sees as if you was the reference
0:18:18	and
0:18:20	so when we have the let this is we can take the best file and
0:18:24	we can compute the posteriors abilities and we can then read the posteriors which lie
0:18:30	under the best path and use those as this confidence measures
0:18:38	to use
0:18:41	so then when we start and when we started experiments first experiment with the frame
0:18:47	cross entropy training
0:18:49	and i try to make a systematic of steps first
0:18:54	star so on the larger commodity and then go to the smaller one so let
0:18:59	the beginning so i was starting to think of those sentences according to the confidence
0:19:06	and surprisingly i kept adding
0:19:09	more and more something see something like that it all of those and so there
0:19:14	was stick still the system was
0:19:17	it's a radio and there was no degradation be very in
0:19:22	it was surprising
0:19:23	and so
0:19:26	so this gave so minus one point one percent improvement in absolute and then be
0:19:32	very the situation that there was still
0:19:35	roughly ten hours of transcribed speech seventy hours of untranscribed speech so there was double
0:19:40	lines of in the monthly multiply the
0:19:44	amount of transcribed speech show by twenty three we we'd try to
0:19:50	different
0:19:53	different
0:19:56	numbers two
0:19:57	different multiplication numbers of the system
0:20:01	three was
0:20:02	the good one and there was meant of zero point three
0:20:06	no absolute on the
0:20:09	and finally we went to do
0:20:13	lower level to the frame level and found out that the
0:20:17	frame level selection with the appropriately to a threshold would be another zero point nine
0:20:25	zero point eight percent
0:20:28	you know something
0:20:30	so to the overall improvements over two point two
0:20:35	first order eight percent absolute
0:20:38	and
0:20:40	is the
0:20:42	as the full recipes use also includes the sequence-discriminative training
0:20:48	i did some experiments to
0:20:51	with the some the are criterion to improve the
0:20:56	results on these stage and i try to use similar
0:21:00	data selection framework
0:21:02	but the remote
0:21:05	how and what is this the safest option was to take the transcribed data and
0:21:12	a use a some the are it's just the transcribed data and
0:21:17	in the
0:21:20	the improvement that we obtained on the frame cross entropy level training a large part
0:21:26	of it persisted in the systems
0:21:31	pretty much more
0:21:33	the experiments we did
0:21:35	so i'd like to invite you to
0:21:40	so you see the posters union by i would like also to think to then
0:21:45	podium of the colleagues who
0:21:48	worked on developing company
0:22:07	thanks kernel next we have all okay
0:22:12	so other poster paper it's about how to learn a speech representation from multiple or
0:22:18	single distant channels so we did distant speech recognition
0:22:24	which has so we are now is much more difficult to copy of because of
0:22:28	many aspects like for example
0:22:31	course signal-to-noise ratio or
0:22:35	different interference effects of other acoustic sources so what people usually do with distant speech
0:22:44	recognition is to
0:22:47	capture they're sticking out using multiple distant microphones which we now
0:22:52	a germ at all so basically we can apply on top some sort of combining
0:22:57	algorithm which in from the signal entire the single channel and then you be it
0:23:02	acoustic model on top
0:23:04	a like an acoustic model you want and it's
0:23:09	we are interested how to how to use multiple distant microphones we'd up to conform
0:23:14	or so
0:23:15	we do in addition to the actual i in all the are dramatically
0:23:19	and at and try to explore that way to combine channels so we use
0:23:27	a neural networks for that and there are there are two obvious ways to follow
0:23:33	the first one is
0:23:34	a simple concatenation so you get a acoustic captured by multiple channels and you to
0:23:42	just like a large spliced input to the network and you train it we have
0:23:46	a single targets like we should why you do
0:23:50	and the other way a the other way to do it it's is it is
0:23:55	multi-style training and it must i multi-style training
0:24:02	allows you to actually use multiple distant microphones while you're training and you can
0:24:09	recognise with a single distant microphone
0:24:11	so getting back to concatenation a we have just a simple concatenation we were able
0:24:16	to recover around fifty percent of think of that inform again so we weren't able
0:24:22	to beat our best
0:24:24	and it
0:24:25	dnn model
0:24:26	trained on that eighteen from channels but we were like i've but we were able
0:24:30	to improve like around fifty percent relative
0:24:36	and
0:24:39	it
0:24:40	relative to the gain of indian eight of course
0:24:43	and we've quality style training we train the network in which a task fashion where
0:24:49	we actually had that share the representation for each channels and we like presented a
0:24:56	random batch of data from random channels and we did not eight
0:25:00	for that
0:25:01	and that apparently force the network to actually you can or some of the travel
0:25:06	it is in the channels so in the and
0:25:09	and multi-style training
0:25:12	i gave ask the same as those are simply if a concatenation
0:25:16	so basically it's a very attractive a way because you do not need multiple distant
0:25:24	microphones and test scenario
0:25:26	which is nice finding
0:25:28	and
0:25:32	right so
0:25:33	in that order we also point some sort of open challenges like for example she
0:25:39	still overlapping speech just select a huge issue
0:25:43	and not many researchers actually try to just it
0:25:47	and so the simplest think is just ignore it as
0:25:52	and
0:25:53	and we also like present the complex set of numbers for i mean datasets for
0:25:58	pure rugby datasets all this numbers should be easy to reproduce if someone is interested
0:26:02	so i by
0:26:05	anyone who's interested suggested
0:26:08	came by and we can discuss some more
0:26:11	thank you
0:26:14	okay thanks paul i finally alex
0:26:17	thank you
0:26:18	so just to start a little bit of for
0:26:23	more than ten percent here
0:26:25	at a kind of longstanding ambition
0:26:30	speech recognition can be reproducing kernel not
0:26:34	have one network to an acoustic modeling
0:26:39	the language modeling
0:26:43	state transition
0:26:45	and happens all kind of combining single network
0:26:49	can difficult
0:26:52	you probably won't be surprised here
0:26:54	and i was eventually costly mostly by my coworker rock my mama you maybe i
0:27:00	should just try
0:27:01	you can one of this thing i
0:27:05	replacing the
0:27:06	neural networks with or
0:27:13	and so that's basically what were you
0:27:18	and it's really it's
0:27:21	it's fairly straightforward
0:27:24	you know it's
0:27:27	standard system
0:27:29	the only thing would be
0:27:31	all the people here is the network architecture and the network architectures probably so
0:27:37	one thing a run you know
0:27:41	taking just ignore recurrent neural network making
0:27:45	a sample
0:27:47	input feature
0:27:48	and brings really
0:27:50	really you can increase
0:27:52	but like with multi
0:27:56	there are various other kind of
0:28:00	improvements basic recurrent network architecture that i mean accumulating you
0:28:07	and i guess the two main ones are not i directional so having single or
0:28:12	no network stop beginning sequence those the and
0:28:16	you have
0:28:17	recurrent networks one going forward someone going back
0:28:20	and you know that's not the past
0:28:24	future compact
0:28:27	and you can be that same structure just saying which account for normal
0:28:34	so it's bidirectional
0:28:37	and what you actually find the U
0:28:40	and networks use of context brands out
0:28:44	as it goes i
0:28:48	and the other hand a novel thing i guess is used to this long short
0:28:51	term memory architecture which i won't try to describe in detail you basic idea it's
0:28:58	better at
0:28:59	storing information times but gives you access longer range from
0:29:08	common problem and everyone's a when you try or no networks for speech things
0:29:15	there is flashing makes it difficult for score information
0:29:21	and
0:29:23	other not
0:29:25	well as a standard recipe from the training
0:29:30	fifteen hours
0:29:32	because one of the compare the system with
0:29:36	the kind of more and are workplaces
0:29:41	using implement we actually i printable system
0:29:49	and then we'll to the wall street journal corpus and the results kind of income
0:29:56	using these
0:29:58	bidirectional rnns can be cross entropy are frame are pretty small
0:30:08	a
0:30:10	one possible reasons
0:30:12	wall street journal is
0:30:14	maybe not the best corpus challenging and you know what is essential
0:30:20	model which switchboard
0:30:22	but my feeling is the
0:30:26	what we really have
0:30:30	this is this going to be cross entropy training
0:30:35	you know word error rate actually carry
0:30:40	something we got like
0:30:42	same just train
0:30:53	thanks so at this point we can open up the floor or questions or comments
0:30:58	from the audience either directed that the panel or anybody else's room
0:31:05	so drive any takers
0:31:12	so following it up for what are certain jobs you have so terrible start well
0:31:18	for you actually do you put was the power spectrum so do you think that
0:31:22	the known as will be of are capable of
0:31:25	right you can
0:31:27	very there or backward see if you want
0:31:30	to the waveform
0:31:32	i think that something better definitely
0:31:36	and nobody has done some more can actually
0:31:40	i think you're right i think there's then a little bit of were but been
0:31:43	by not do you might know alex on using convolution
0:31:47	neural network like approaches are do you remember right
0:31:52	i mention this is more has been some work with this but the generally do
0:31:57	something on top of it like to take the law yep take the actual value
0:32:02	florida log and so on the in there there's and things that are kind of
0:32:05	heart to reproduce just by pretending you don't know these things are any good
0:32:09	actually have i was gonna ask you
0:32:12	i was trying to recall
0:32:15	did you end up taking along still
0:32:19	i so let's take the log is right into the neural network i think like
0:32:22	twice right
0:32:24	right so that's interesting right i mean that you know we've got these are for
0:32:28	learning
0:32:30	machine we stuff to take more
0:32:38	i don't and about that
0:32:45	okay i've a question which actually is can be directed more morgan and hynek and
0:32:51	alex waibel the these in the room
0:32:53	so one of the themes a came up earlier in the day was that some
0:32:59	of this stuff was done that in the nineties and due to limitations only metadata
0:33:04	we had to work with the amount of computation available
0:33:09	there were there were things it really i couldn't people explore or couldn't viably be
0:33:13	explored and so the question now is are there papers from the nineties that occur
0:33:18	practitioner should be going back rereading and trying to plagiarised yes from the see that
0:33:24	can that absent improve on now
0:33:28	and it's the which ones
0:33:32	they are so this L is a lot of i mean
0:33:39	i don't say i mean it depends of people interested in right like this morning
0:33:44	their questions about adaptation and i didn't recall that up my head which papers but
0:33:48	there were a bunch of papers by a neural net so it in S K
0:33:52	and twenty grams in an improvement cambridge
0:33:58	if you interested in adaptation
0:34:00	there is
0:34:03	large number of papers on the basic methods on the sequence we're talking about a
0:34:08	luncheon on the sequence training
0:34:10	us papers i are shown and there are not there at an anti R T
0:34:18	where he did sequence training i think around ninety five or something
0:34:26	we're what we're doing the time once
0:34:29	using the cameras as the targets for the net training
0:34:33	i mean isn't just the computation and the
0:34:36	and storage and amount of data it's also just that
0:34:43	oftentimes you know these things are cyclic a new you try some things out and
0:34:48	somebody like we did the sequence training
0:34:52	help tiny little bit
0:34:54	and in what we're in the examples we're looking at and was a lot more
0:34:57	hassle
0:34:58	so we didn't pursue it more
0:35:00	we had a couple years where we really looking into it but it wasn't so
0:35:04	great
0:35:05	so there probably some things that we weren't doing quite right and
0:35:10	now it's coming back and
0:35:12	also people's and to see when you're enthusiastic about stuff
0:35:17	you look at a point two percent increase a lot differently than when you're not
0:35:30	about some other questions for the panel they had lots of interesting things they were
0:35:35	talking about so
0:35:48	i question for all and you're multiple
0:35:51	yes
0:35:53	the multi microphone
0:35:56	experiment you did i guess that was with the ami corpus
0:35:59	yes it's so you got you get this
0:36:02	i guess nowadays predictable result that if you just concatenate the features from your three
0:36:08	but to the for the different channels
0:36:11	you would perform better than any beamforming
0:36:16	wiener filtering or whatever else you that you're doing
0:36:18	but it's is that correct no okay
0:36:24	when you concatenate you get some improvements over a single distant microphone but it's a
0:36:31	like the message from the air is that if you can inform you probably should
0:36:35	inform
0:36:36	yes but how okay with the with the concatenated features going into neural network is
0:36:42	that assuming that this speaker is sort of
0:36:45	i mean if my speaker was to walk around and
0:36:49	i can imagine actually
0:36:52	or observation is not network isn't learning can fink well actually gives you
0:36:59	beamforming gives you a it's more like adapting to that
0:37:05	the most meaningful signal
0:37:07	to the strongest signal so basically
0:37:10	if you have like multiple distant microphones one of the speakers just always a like
0:37:15	in some way
0:37:17	it's closer to give an microphone but down to D are not or and that's
0:37:23	thinking actual you can exploit and you had
0:37:26	in this scenario we applied for
0:37:29	and
0:37:30	because the when you like put multiple frames
0:37:34	in the input you have like a very small time resolution so you actually can
0:37:39	not there any time delays in this setup so it's just the it's just take
0:37:45	eigenmaps really and you can do it like in a more obvious way for example
0:37:50	you can apply
0:37:52	convolutional that'll
0:37:54	the acoustic models and the max-pooling
0:37:57	and tops the also give some gains
0:38:00	but that's like a followup work
0:38:14	to be a little bit of courage to decide to response to brian was asking
0:38:19	because you know i'm they pretty bad in reading other people's papers
0:38:22	and so i had only examples of paper speech i wrote all my colleague set
0:38:27	of students roles each people should very critically we i mean i don't mean that
0:38:32	they are wonderful but i still think they are interesting which and this is this
0:38:37	is this work on contracts which we started to work on that it was in
0:38:41	the time you post pretty crazy
0:38:43	because we just took the temporal trajectory of spectral energy a given frequency
0:38:48	one second long and we said can you estimate what's happening in seventy six
0:38:53	of this trajectory
0:38:55	and so first of course without was that you got about twenty percent correct at
0:39:01	best
0:39:01	and of course you get is the number of frequencies so after that need to
0:39:05	this out with all these posteriors and fit in your into nothing in it and
0:39:10	a then you leads to
0:39:13	estimate still the phoneme in the centre so it was like kind of formants deep
0:39:18	neural net i would say because it was kind of neat it was also performance
0:39:22	why because he that trajectories at different frequencies
0:39:26	and you was it works surprisingly well i mean so if people can look at
0:39:32	it and the last possible global we should have better and of course you never
0:39:36	retrain the whole thing the which probably we should have done and we use
0:39:41	on the context independent phonemes each maybe we should and shouldn't number of things happen
0:39:47	at the time something that there are two entirely comparable to the manager and pos
0:39:53	it's all and hopkins and so one is also be where only all that much
0:39:58	but i still see that people should look at
0:40:01	in and tell us what was wrong or how you is that it works
0:40:07	that you try to recognize context independent phoneme out of one second context
0:40:13	you know and you get actually very well you do very well if you look
0:40:17	at least posteriorgram amazingly good
0:40:20	sort of the look like for that issue do you mean vector or perception at
0:40:26	all times seems like that it is i would see
0:40:30	somebody else to should look at it critically
0:40:33	so sorry for probabilities might work but i said i mean so
0:40:38	so the other people's work so
0:40:53	so this question is a mostly i and other hand address i have something to
0:41:01	say i
0:41:02	so i actually spoken trying to achieve a person
0:41:07	our knowledge and found on this but i
0:41:10	for example we can read and the program that the videos there being recorded going
0:41:15	to be can that's going to be able to search them for keywords
0:41:19	online so like i think the keywords that are typing into that system are gonna
0:41:23	be that
0:41:24	new at me now i have no
0:41:26	and it's gonna be names i think i can and you know i okay so
0:41:31	that are gonna be
0:41:32	and i have a cabinet where K fig
0:41:36	so
0:41:37	kinds of plastic
0:41:39	what is what is deep neural networks are we optimising R
0:41:44	where acoustic models either really frequent words and leaving a on infrequent words second i
0:41:51	mixture
0:41:53	and if the other thing is you're analysing with word error rates over your entire
0:41:58	vocabulary
0:42:01	it's is this really getting at the performance
0:42:05	the only one and we want to understand it is when the interesting to look
0:42:08	at landing a way that standpoint for spoken content retrieval
0:42:17	one stack of that
0:42:25	restricted by taking a i can
0:42:28	maybe address some of that i think i don't think than i don't know networks
0:42:32	are just focused on
0:42:34	i'm not on the on the head did you pretty well on the tales well
0:42:38	but i mean there's two aspects here and there's the where the where you don't
0:42:42	like vocabulary and those words that are out-of-vocabulary we have this in the model at
0:42:45	test time and that's a different kind of orthogonal i don't my
0:42:52	i mean and shake your head but i think
0:42:58	i think if we can incorporate re well as to what we what we do
0:43:02	our searches we have a stack decoder graph we can actually corporate dynamic can tailor
0:43:06	into that into that graph
0:43:08	i think when we do that a big actually recognise out-of-vocabulary words i mean that
0:43:15	that we haven't seen during training time for example i worked on a voicemail years
0:43:20	ago and you know people's names come up all the time and our program manager
0:43:26	for the reason his name was
0:43:29	with the recognizer is this is okay the people who tell collisions in but you're
0:43:34	always recognise missed ten cents
0:43:36	for some reason but once we switched on so direct vocabulary we have like is
0:43:41	name checked into the stacks photograph same recognized and then we refer lots of other
0:43:47	different so i think right now the system doesn't actually corpus title vocabulary and i
0:43:53	think
0:43:53	the metrics you talk about also devices to sort of working at sort of broad
0:43:57	range and it is sort of makes more difficult to say if it introduces the
0:44:03	you know technique where we do anything cataract that'll give us a point one or
0:44:09	even middle today and that's a shame think really need to look at its
0:44:16	techniques to look at the long
0:44:18	but i think it still really this but recognition there's lots of the word that
0:44:23	can be done and in language modeling and analyze about capital a men's room you
0:44:28	know these words useful
0:44:34	i'll chime in a little bit on this one too so i can speak from
0:44:38	experience on doing few works are shown in lots of languages things to but the
0:44:42	babel program which will be hearing about tomorrow from a very arbour
0:44:46	and what we found is word error rate actually is pretty good basic metric even
0:44:53	when we're doing search for words that are out-of-vocabulary in the training so it's not
0:44:59	perfect correlation between word error rate and retrieval performance on this table past but
0:45:06	at least the first-order a large improvements in word error rate like we see using
0:45:10	neural networks instead of the gmms definitely
0:45:13	to better retrieval performance even a vocabulary terms
0:45:17	so it it's not perfect metric but it's one that we used for many years
0:45:22	and it works pretty well
0:45:25	the interesting pronunciation you'll find problems with those words and it you know it's very
0:45:30	recognition and of that
0:45:32	but i as you can see this work here where we're trying to drive dispensation
0:45:35	so
0:45:37	not dismissing it just
0:45:40	i actually wanna
0:45:42	the tractable but in favour of the direction what the question are saying
0:45:47	"'cause" i think it is a disorder separate out the decoding so forth from what's
0:45:55	happening in whatever you're acoustic model is you see whether it's gmms are dnns or
0:46:01	whatever or mlps many layers for phone
0:46:07	it's true that you just to do better on things that you see lots of
0:46:12	examples
0:46:13	and this is also true even if you looking for a particular see you know
0:46:17	that are or you know triphones whatever those triphones occur less often and then you
0:46:22	are not going to estimate as well but what you're saying is true to that
0:46:26	you know
0:46:27	it doesn't completely kill
0:46:31	i agree i mean there's issues where we have some queries that just to get
0:46:34	recognised recognizer and you know the combat the ones you know get recognised to find
0:46:40	out there is on the five instances that context
0:46:43	directly in their systems are trained to do so
0:46:47	but something does need to be addressed
0:46:52	where a one technical comment on the super watchers for of course
0:46:56	this you know for but
0:46:59	so we take a very pragmatic engineering approach and basically the recognizer is fed by
0:47:05	the proceedings and by just like one and everything so generated of the new words
0:47:09	are no not that the new anymore
0:47:12	but i had another the question to the colour we're going to maybe also prime
0:47:17	that was about the sequence of discriminative for training on the on the bentley transcriber
0:47:23	or untranscribed report
0:47:26	so we gaussian
0:47:28	basically the simple you are needed to be don't know would portion of the data
0:47:32	and then not on the model loosely transcribed by the recognizer
0:47:38	what what's your experience on youtube videos and maybe
0:47:41	but i'm common from this as well
0:47:44	although first but because we actually have done sequence training experiments on the spread
0:47:55	well let's see on
0:47:58	i personally don't have a lot of experience i think when we report numbers or
0:48:03	three hundred hours broadcast news
0:48:05	there about half of it is manually transcribed have but it slightly transcribed and so
0:48:12	i'm pretty sure we see some nice gains on that chart can for
0:48:16	ten percent relative though
0:48:18	likings be cer fifty hour broadcast news from cross entropy sequence are more then we
0:48:24	sent four hundred i don't know that's amount of data or data ones you know
0:48:29	transcribers is like this
0:48:31	that's a and which what the reasonably good baseline but again with a pretty good
0:48:36	proportion of the training data being lightly supervised
0:48:41	anybody else the comments and that coral
0:48:46	comment would be that's and should investigate deeper
0:48:50	that i truly believe that there is
0:48:53	more
0:48:54	persons
0:48:57	achieved three use the words right
0:49:02	okay other comments or questions
0:49:09	okay thomas
0:49:13	so this is a very general question about how much training data really need in
0:49:16	the future if you go would you make with the dnns
0:49:23	well i guess is was trying to motivate my where are you know is we're
0:49:26	just initials or system where you know with a lot of data to a big
0:49:30	networks and it takes
0:49:31	i don't know we trained a big networks but i think i think it's good
0:49:35	sort of challenge question where no one have like ten thousand hours of the thousand
0:49:39	hours of data used for training and we can maybe we do we increase than
0:49:45	certain number context of outputs to a hundred thousand what we get
0:49:49	and be interesting just no you know if we started you that we had to
0:49:53	be at the change around with to train model
0:49:55	that this or sizes
0:50:00	i would also we just more is better if it's if the transcriptions are good
0:50:04	enough
0:50:07	that sounds great intro to mark format which was more
0:50:13	i just wanted to
0:50:16	mention the results was that but in numbers the role
0:50:22	where we the actually somewhat and selection of the raw acoustic modeling
0:50:30	all
0:50:31	well you are the word error rate
0:50:36	for well with one there but with other words the
0:50:42	so i think
0:50:43	for remote controls
0:50:45	no piling
0:50:47	well in that are split
0:50:50	really remove more careful what we're
0:50:53	but the performance of the word remove more thoughtful
0:50:56	work that was coming from the model
0:51:03	i'm blanking but there is a visitor
0:51:07	from google actually give a talk at icsi you is showing us with look like
0:51:12	definite as interpreting
0:51:14	of performance with going up two hundred thousand two hundred thousand hours and so on
0:51:19	so i think it helps but after awhile
0:51:22	that's a much i think which was
0:51:24	i and surprise that you've been quiet all day so
0:51:29	there you making me happy
0:51:34	so on the issue of selection i think
0:51:37	you can certainly argue that
0:51:42	selection
0:51:44	cannot be the right thing to do
0:51:47	instead you should always to weighting
0:51:50	because whatever data you have there's certainly the i certainly agree that there's good data
0:51:55	and bad data but that data is not worthless it's just less good than the
0:52:00	good data
0:52:02	so for example we have a paper here or something for the set for semi
0:52:06	supervised training which revert done for a long time in the past you just transcribe
0:52:12	make a model transcribe some untrained some
0:52:16	recognise some untranscribed data and then use it for training when the error rates are
0:52:20	relatively low
0:52:22	for low is fifty percent or below you can do that
0:52:27	with your eyes closed
0:52:29	when the error rate gets really high like seventy percent
0:52:32	that does break now but that doesn't mean you should discard the data
0:52:37	you should just give it a lower weight and you can show that you always
0:52:41	get better performance if you include the data the weight just gets lower yes in
0:52:45	principle the weight could go to zero but
0:52:49	you know you that the system decide that and the weights don't really go to
0:52:53	zero the just get smaller so weights like one third and one half are error
0:52:57	rates of eighty percent still giving me
0:53:01	that's been our experience at least
0:53:14	so i but the so it's
0:53:17	more data widely
0:53:19	may not
0:53:22	monopoly the right thing i agree with what you're saying just as always value and
0:53:27	data
0:53:28	whatever those the figure forgot the utterance from should also
0:53:33	pay some attention to be distributional properties that are
0:53:37	but names
0:53:38	so or this is one point of the problems of the room sure
0:53:45	sampling space correctly that's really one
0:53:49	i think i think it should that with my paper where i was that are
0:53:53	closer general youtube data that when we actually that particular vertical like you use where
0:53:58	we we're getting much better rates
0:54:01	but adding all the data to train didn't doing that
0:54:04	a bigger neural network for unknown parameters for your after getting losses
0:54:08	on that specific domain so there are some issues of generalization just
0:54:15	i like to add a little bit on data was we will be different but
0:54:20	which is saying though i agree that of course more data is always better
0:54:24	but i think the also we can be using less and less and less data
0:54:28	so the question is how much data we will need i would hold less and
0:54:32	less and less
0:54:33	because we are any more and more about speech and we actually learning now how
0:54:37	to train the nets on one language and use it on another and so on
0:54:41	and also maybe sixty percent about the bottle which i'll go babel am
0:54:49	i called bobble i think that we are going to learn how we use that
0:54:53	knowledge from the some data bases on new task i this is at least my
0:54:57	so i'd like to and up on this positive for
0:55:01	approach less and less that's what i see
0:55:05	just a follow up with what you're saying i think like sort of the lower
0:55:07	part of the network or learning language-independent or task-independent information so you if you feel
0:55:13	a lot of data and to those layers and less data in the upper parts
0:55:16	that might be an approach to get i think is very
0:55:21	actually when we started working in the gale we had a bunch of stuff trained
0:55:24	that's trained on english
0:55:26	and we're working this with this or i and trying to move to arabic we
0:55:30	didn't have much are picked data yet so we just use the nets from english
0:55:33	to begin with
0:55:35	but still did something good
0:55:38	one point i'd like to make but i think my in there you recognition and
0:55:44	something that
0:55:45	if you've got more than you i think you might be ten times and doesn't
0:55:51	want to learn
0:55:56	think i limited
0:55:58	not like this number of an intuition
0:56:02	C
0:56:09	think we don't
0:56:12	have any other pressing questions actually is time
0:56:16	so no reach was
0:56:22	saying that
0:56:23	but at the way the data actually i did to do the and contrastive the
0:56:27	experiment to one case use the frame selection in other cases the frame-weighting and
0:56:36	D and i obtained identical word error rates for both systems
0:56:44	so maybe you know if the
0:56:49	if so it is true what to reach says then there should be done some
0:56:55	post-processing in the
0:56:59	of the confidence scores or it's true that those are not so uniform at all
0:57:04	that
0:57:05	it's a it looks more like an exponential
0:57:09	mention kind of
0:57:10	groups
0:57:15	bring some more like
0:57:20	something else in the more data so
0:57:23	there are several because you want and are the ones don't speaker variability
0:57:27	okay then you have more speakers but if you want you know is a list
0:57:30	of our other robust against reverberation you can just make the data
0:57:35	and then you so does present in the same data yes variation added noise but
0:57:41	for a reverberation
0:57:43	just train a system on for room acoustics
0:57:47	makes it very robust against the for this microphones that's a very cheap trick and
0:57:53	it works
0:57:54	and something else about more data if you look at the very good neural networks
0:57:59	presented in everybody's head
0:58:01	they're not trained with that much data
0:58:03	google already has more data so that is a strong point in making better than
0:58:07	a networks you can do it so why tend to be
0:58:14	"'cause" we don't know how but i
0:58:18	could to
0:58:19	i think we are out of time in principle and so i think we should
0:58:23	turn this over
0:58:25	conference organisers
0:58:28	and thank the panelists still
0:58:37	thank you morgan so this will be short
0:58:40	i would like first before the we go a couple of practical things
0:58:45	so for the people that subscribe to the micro brewery tour
0:58:50	but so it's a word is not one way trip a that they did it
0:58:54	should meet very important to seven a into will be and we begin tomorrow at
0:59:00	the at their favourite forty with the limited the resources
0:59:04	just the last practical command there is a carpeting table on the on the message
0:59:08	board so whatever goes the prior knowledge and all other places
0:59:12	and there's free space just write yourself maybe we'll have some nice the centres
0:59:20	well
0:59:21	i would like to thank let's take the
0:59:25	i don't know where we ever is more or less important but let's think to
0:59:28	the public because of almost everyone you're very like to thank you very much
0:59:33	then the to the penalty is
0:59:35	and to all the speakers and of course my greatest things go to today organisers
0:59:40	and i have still one point but
0:59:43	for brian because you have but one
0:59:45	so this is a

Selected poster summaries, panel discussion

Neural Network Day