okay so that's what is not intended to be particularly for also you know we

have well

put away the screen there will be no slides

so what i was encouraging everybody who's other annals to do it got about five

minutes to just give sort of an oral summary of the poster so let's encourage

people to come see it because it's going to be up for the rest of


session and then we can open up the floor for questions morgan i might have

a few and we'll see where the discussion dots


why don't we get started and since you closest a big presently go first sorry

okay so

the one should not here because i'm basically got a nice gmms

so what i did this i looked at the neural networks and i try to

figure out why the work very well and try to port this back to a



why gmms so we kind of like stand for years we have lots of techniques

model based techniques model based adaptation speaker adaptation noise adaptation uncertainty decoding

all kinds of techniques that are based on maximum-likelihood trained hmm gmm systems if we


put in dnns

at the front and basically you basically you lose a lot

and all the reason is there fast

the very efficient a few parameters you can encode you can make a speech recognizer

with ten times less packed fee parameters in there at gold very fast

final and lost reason you'll do speech recognition we kind of try to understand how

it works so if you going to replace the neural network in the top of

your head

it's on all the black box method like a deep neural network what i've learned

in the end so maybe a little bit model molar system where you have building

blocks that are

at least doing something you understand

it's nice to have that

so the second part is what are we going to port from dnns to do

the nn the gmm world

so basically look at the nns they take a very large winnable frames

and they going to map that to context dependent states for bit basically long span

symbolic units to go from long span temporal patterns too long span

symbolic units fairly complex mapping that's why they need

lots of layers

probably and also they want to go wide a have something like two thousand four

thousand to notes in between so that's pretty

pretty be pipeline

so the deep we already had the white we have already had to important properties

of a neural network a long window of frames and another thing is neural networks

they advertise them as being a product of experts

so basically adding note us useful input and is trained on all output

so there's

lots of training data for every weight

okey so the next that is let's try to port all these ideas to the

hmm gmm world

so and basically i didn't invent anything you i used existing techniques

so if you want to handle log large frame of windows you have to do

feature reduction because gmms don't like a two hundred dimensional input features

so we use something like lda linear discriminant analysis to do feature reduction

but that's loses lots of information so in parallel with that for example use multiple

streams multiple streams are not new in you old discrete hmm world you have static

features delta features and double delta features multiple parallel streams and fusing at the end

you can still do that today

so that's already have coping bit a large input a window of frames

going a wider we already had at we have multiple streams in parallel

you can i the seed of

as a

property of coping with a large dimensional input feature stream or you can say that

a little models

then the going deeper that's basically don't by adding and log-linear layer up in the

layers but nothing new nothing special

the conditional random fields or maximum entropy models they go around but lots of names

just a softmax in the neural networks

so that's nothing special but it's a simplest the extra layer you can add more

or less

it is in a product of expert model so it combines values in a some

which is basically product and

makes a new values so it's very good at fusing thing so

i at the frame stacking from of it just to increase the feature dimension i

so basically all existing techniques very simple techniques i forgot one the parameter tying but

that's also very simple use tied states a our system that we have time to

go first row so

that basically means that every gaussian is trained not all output and all that all

inputs but a lot so basically every gaussian is used under and over a hundred

times for a and that the output states so the lots of view so if

it exceeds every frame anyhow

and if you combine all these things you and the pitch results that are competitive

to last year's union the results

this year's the nn the results at something like segmental training or sequence training convolutional

neural networks dropout training

then you techniques i don't know yet how i'm going to map that to my

system up to six the sequence training is very simple to add and probably will

improve the systems

so the and messages the gmms and hmms are not

okay thank you chris a hank

slightly worse on you to

it will work on voice search and them some work on a U T

we are actually published results i thought you know to be great if you chair

sometimes of the with you want you to you

so if you know what you to youtube so you're sharing site you can share

also things i think most popular video is

you know task using dogs or cats running but like this but a there there's

actually and there's some useful data their mean ability user's fees each of youtube every

month they watch six billion videos on U T

was over a hundred hours of data being uploaded every minute

so as long content of their a lot of people watching

one thing we like to do with you be able to provide a you know

because you use more accessible for those that are harder here you or not speak

also imagine if we would by automatic captions you to you

that would help searching for videos on youtube or

actual to navigate in the video if you want one particular instances words and videos

in people that there is that some people compute non-trivial actually latest video problem a

bit bigger roles where you is obviously snapping we say words of all the weak


i'm gonna give you want soft and people it may be used this indexing technology

to the final instances where problem and says speech

so you know there's you know some can be some point but with applying

so i looked at this from a couple aspect of the i'll get is from

a D task so we have a lot of data what are some the with

that we can levels of data

for example

users are apply for twenty six about have uploaded twenty six thousand hours of just


online text captions

for these videos attempt to have tasks because you know the find it is useful

to have them

but some of those artists can in some fashion matched video they just the advertising



i think about people looked at how to use this sort of

dropped data to use a strange and so we do is so much i think

that everyone else does we try to figure out what sort of aligned with doesn't

align and we had this island of confidence or technique so basically areas where a

lot of alignment happens from recognition result and the actual user provide result

what we use that the sort of islands was not the coherence then used as

training ground truth

and so we're slight

after all still training of like non-native data a christmas can be what actually aligns

well we get a we got initial corpus about thousand hours

and compared to but some and fifty hours of supervised

actually hand-transcribing

so we're able to do some persons on that


the other aspect is well we have so much data to me improve the modelling

techniques that certainly different ways and it just use of force people i think doesn't

talk about

having thousand cd state units and i think it typically we all work on our

own seven thousand cd state units

i think frank's ideas we have to thirty three thousand so we really do run

in europe so i'm not writing

around twenty thousand four five thousand see you know to use more data my one

time we got better

it's model

and so but you know that was really large that way with the softmax

there are forty but that's no points and thousand nodes prepare for five million parameters


just that one there so actually this is little bits by brit actors aristotle in

icassp the right factorization slight warming to try this data and see it goes so

and they're in a paper we looked at using various levels of this task

percent miss is lower a linear layer from

to just by close to you are and so that it task

and the basically our results were

actually suboptimal we can use privately well the semi supervised data where we use it

is a captions we can see build model it's better than or gmm system by

about ten percent relative

so our team system initially was well

plus fifty percent error rates

and i think there is that you know some issues of the gmm system i

think it's very cambridge the events matrix for us

they got below fifty percent but not much but with the same is rise data

no supervised training we did pretty well

we actually when we actually the supervised data the results with less data we actually

better results than in the systems revise data models but that's expected and actually combined

it doesn't work against combining

and with low rate roll find that with your parameters we're able to get you

know how better but actually results that slightly better maybe it's just regularization

we found that overall by having

and all this extra data we got the results on youtube general on you general

test data sets test sets but

we will actually that's a domain specific test set

for example you to use same broadcast news we actually get a degradation by adding

all this all the sins rise data so that was interesting so and you're a

neural networks people bigger better more data

but since then we still have some issues with cross training

so i still you know what things look

so that's what

okay thanks a star

okay so frank showed earlier today

one of the first and results on lvcsr was on switchboard showed about thirty percent

relative improvement on a speaker independent a system

and you know microsoft as well as i am in others have shown that if

you speaker adapted features for the dnn the results are better

and then earlier this year i but using very simple log-mel features just a convolutional

neural network you actually improve performance by between for the for seven percent relative over

at the end and trained with speaker adapted features

and one of the reasons we think is this sort of learning this speaker adaptation

jointly with the rest of the network for the actual objective function at hand you

either cross entropy or sequence

into the idea of this filter learning work we did is he said well why

are we've been starting log-mel let's start with the much simpler feature such as the

power spectra

and have a network learn a filterbank which is appropriate for the speech recognition task

at hand rather than using a filterbank which is perceptually motivated

it's if you think about how the log-mel is computed you take the power spectrum

you multiply by filterbank and then take the log which is respectively one layer of

a neural net great weight multiplication followed by nonlinearity

so the idea in this filter learning work was to learn start with the power

spectra and that the filter bank layer jointly with the rest of the convolutional neural


so and we did sort of this idea initially we got very modest improvements

and one of the reasons is because you have to normalize not the layer to


convolutional network but the layer to the filter learning

we know there's a lot of work that shows if you charge normalized input features

into the network such written down here

so we found that by normalizing input into the filter bank layer and by using

a trick very similar to done and that's not in rasta processing to ensure that

the input into the filter learning layer would be positive we able to get about

a four percent relative improvement

over using a

fix filterbank there is a nativity are broadcast news task

we then show that base you the filter bank where can be seen as a

convolutional layer with limited weight sharing so you can fly tricks such as pooling

so if you pull you can get you know what five percent relative improvement over

the baseline of the fixed with fixed mel-filterbank

and then tried other things like increasing the filter bank size a lot more freedom

for the filters that didn't seem to help much probably because there's lot of course

going on between the different filters

we also tried we found the filter weights or very few key probably picking up

many harmonics in the signal we tried smoothing that out that didn't seem to help


so it seems was that the extra peeps that are learned in the filter bank

layer is actually beneficial

finally we tried instead of enforcing you know using analogue nonlinearity positive weights along the

weights P negative in using like a sigmoid or you prove nonlinearity and also didn't

seem to help us

so it seems like using a lot nonlinearity which is perceptually motivated is actually does

so in summary we looked at filter bank learning i suppose using a fixed mel-filterbank

agreeable to get about five percent relative improvement number i guess

thank you chair call




in principle i was trying to

so the same similar problem is thing was folding

but there was one difference that you will use probably several thousands of even ten

thousands of training data that we could be possibly leveraged to improve the word error

rates and in the our case the dataset most more modest at school

very nice to play with

in the our case we had the ten hours of transcribed data seventy four hours

of un-transcribed be done today that was i means

from the iarpa babel program and

this one of the conditions the limited language pack



i try to find some heuristics to how to leverage the best

they don't

the results so what idea is that i used to different confidence the measures on

different levels

one level was to sometimes level

and the other was or frame level

so that we can select the data for training

the way that the sentence-level condition was computed it was

basically the average posterior from the confusion from the confusion network

the best word

and the frame level

confidence measure was the

imagine you have some

let these

you well the weighted semi supervised training is done that the beginning be able to

transcribe a to be built some system

and with this system we can decode the

data we don't transcripts and so we can

take the best parts from the lady sees as if you was the reference


so when we have the let this is we can take the best file and

we can compute the posteriors abilities and we can then read the posteriors which lie

under the best path and use those as this confidence measures

to use

so then when we start and when we started experiments first experiment with the frame

cross entropy training

and i try to make a systematic of steps first

star so on the larger commodity and then go to the smaller one so let

the beginning so i was starting to think of those sentences according to the confidence

and surprisingly i kept adding

more and more something see something like that it all of those and so there

was stick still the system was

it's a radio and there was no degradation be very in

it was surprising

and so

so this gave so minus one point one percent improvement in absolute and then be

very the situation that there was still

roughly ten hours of transcribed speech seventy hours of untranscribed speech so there was double

lines of in the monthly multiply the

amount of transcribed speech show by twenty three we we'd try to



numbers two

different multiplication numbers of the system

three was

the good one and there was meant of zero point three

no absolute on the

and finally we went to do

lower level to the frame level and found out that the

frame level selection with the appropriately to a threshold would be another zero point nine

zero point eight percent

you know something

so to the overall improvements over two point two

first order eight percent absolute


is the

as the full recipes use also includes the sequence-discriminative training

i did some experiments to

with the some the are criterion to improve the

results on these stage and i try to use similar

data selection framework

but the remote

how and what is this the safest option was to take the transcribed data and

a use a some the are it's just the transcribed data and

in the

the improvement that we obtained on the frame cross entropy level training a large part

of it persisted in the systems

pretty much more

the experiments we did

so i'd like to invite you to

so you see the posters union by i would like also to think to then

podium of the colleagues who

worked on developing company

thanks kernel next we have all okay

so other poster paper it's about how to learn a speech representation from multiple or

single distant channels so we did distant speech recognition

which has so we are now is much more difficult to copy of because of

many aspects like for example

course signal-to-noise ratio or

different interference effects of other acoustic sources so what people usually do with distant speech

recognition is to

capture they're sticking out using multiple distant microphones which we now

a germ at all so basically we can apply on top some sort of combining

algorithm which in from the signal entire the single channel and then you be it

acoustic model on top

a like an acoustic model you want and it's

we are interested how to how to use multiple distant microphones we'd up to conform

or so

we do in addition to the actual i in all the are dramatically

and at and try to explore that way to combine channels so we use

a neural networks for that and there are there are two obvious ways to follow

the first one is

a simple concatenation so you get a acoustic captured by multiple channels and you to

just like a large spliced input to the network and you train it we have

a single targets like we should why you do

and the other way a the other way to do it it's is it is

multi-style training and it must i multi-style training

allows you to actually use multiple distant microphones while you're training and you can

recognise with a single distant microphone

so getting back to concatenation a we have just a simple concatenation we were able

to recover around fifty percent of think of that inform again so we weren't able

to beat our best

and it

dnn model

trained on that eighteen from channels but we were like i've but we were able

to improve like around fifty percent relative



relative to the gain of indian eight of course

and we've quality style training we train the network in which a task fashion where

we actually had that share the representation for each channels and we like presented a

random batch of data from random channels and we did not eight

for that

and that apparently force the network to actually you can or some of the travel

it is in the channels so in the and

and multi-style training

i gave ask the same as those are simply if a concatenation

so basically it's a very attractive a way because you do not need multiple distant

microphones and test scenario

which is nice finding


right so

in that order we also point some sort of open challenges like for example she

still overlapping speech just select a huge issue

and not many researchers actually try to just it

and so the simplest think is just ignore it as


and we also like present the complex set of numbers for i mean datasets for

pure rugby datasets all this numbers should be easy to reproduce if someone is interested

so i by

anyone who's interested suggested

came by and we can discuss some more

thank you

okay thanks paul i finally alex

thank you

so just to start a little bit of for

more than ten percent here

at a kind of longstanding ambition

speech recognition can be reproducing kernel not

have one network to an acoustic modeling

the language modeling

state transition

and happens all kind of combining single network

can difficult

you probably won't be surprised here

and i was eventually costly mostly by my coworker rock my mama you maybe i

should just try

you can one of this thing i

replacing the

neural networks with or

and so that's basically what were you

and it's really it's

it's fairly straightforward

you know it's

standard system

the only thing would be

all the people here is the network architecture and the network architectures probably so

one thing a run you know

taking just ignore recurrent neural network making

a sample

input feature

and brings really

really you can increase

but like with multi

there are various other kind of

improvements basic recurrent network architecture that i mean accumulating you

and i guess the two main ones are not i directional so having single or

no network stop beginning sequence those the and

you have

recurrent networks one going forward someone going back

and you know that's not the past

future compact

and you can be that same structure just saying which account for normal

so it's bidirectional

and what you actually find the U

and networks use of context brands out

as it goes i

and the other hand a novel thing i guess is used to this long short

term memory architecture which i won't try to describe in detail you basic idea it's

better at

storing information times but gives you access longer range from

common problem and everyone's a when you try or no networks for speech things

there is flashing makes it difficult for score information


other not

well as a standard recipe from the training

fifteen hours

because one of the compare the system with

the kind of more and are workplaces

using implement we actually i printable system

and then we'll to the wall street journal corpus and the results kind of income

using these

bidirectional rnns can be cross entropy are frame are pretty small


one possible reasons

wall street journal is

maybe not the best corpus challenging and you know what is essential

model which switchboard

but my feeling is the

what we really have

this is this going to be cross entropy training

you know word error rate actually carry

something we got like

same just train

thanks so at this point we can open up the floor or questions or comments

from the audience either directed that the panel or anybody else's room

so drive any takers

so following it up for what are certain jobs you have so terrible start well

for you actually do you put was the power spectrum so do you think that

the known as will be of are capable of

right you can

very there or backward see if you want

to the waveform

i think that something better definitely

and nobody has done some more can actually

i think you're right i think there's then a little bit of were but been

by not do you might know alex on using convolution

neural network like approaches are do you remember right

i mention this is more has been some work with this but the generally do

something on top of it like to take the law yep take the actual value

florida log and so on the in there there's and things that are kind of

heart to reproduce just by pretending you don't know these things are any good

actually have i was gonna ask you

i was trying to recall

did you end up taking along still

i so let's take the log is right into the neural network i think like

twice right

right so that's interesting right i mean that you know we've got these are for


machine we stuff to take more

i don't and about that

okay i've a question which actually is can be directed more morgan and hynek and

alex waibel the these in the room

so one of the themes a came up earlier in the day was that some

of this stuff was done that in the nineties and due to limitations only metadata

we had to work with the amount of computation available

there were there were things it really i couldn't people explore or couldn't viably be

explored and so the question now is are there papers from the nineties that occur

practitioner should be going back rereading and trying to plagiarised yes from the see that

can that absent improve on now

and it's the which ones

they are so this L is a lot of i mean

i don't say i mean it depends of people interested in right like this morning

their questions about adaptation and i didn't recall that up my head which papers but

there were a bunch of papers by a neural net so it in S K

and twenty grams in an improvement cambridge

if you interested in adaptation

there is

large number of papers on the basic methods on the sequence we're talking about a

luncheon on the sequence training

us papers i are shown and there are not there at an anti R T

where he did sequence training i think around ninety five or something

we're what we're doing the time once

using the cameras as the targets for the net training

i mean isn't just the computation and the

and storage and amount of data it's also just that

oftentimes you know these things are cyclic a new you try some things out and

somebody like we did the sequence training

help tiny little bit

and in what we're in the examples we're looking at and was a lot more


so we didn't pursue it more

we had a couple years where we really looking into it but it wasn't so


so there probably some things that we weren't doing quite right and

now it's coming back and

also people's and to see when you're enthusiastic about stuff

you look at a point two percent increase a lot differently than when you're not

about some other questions for the panel they had lots of interesting things they were

talking about so

i question for all and you're multiple


the multi microphone

experiment you did i guess that was with the ami corpus

yes it's so you got you get this

i guess nowadays predictable result that if you just concatenate the features from your three

but to the for the different channels

you would perform better than any beamforming

wiener filtering or whatever else you that you're doing

but it's is that correct no okay

when you concatenate you get some improvements over a single distant microphone but it's a

like the message from the air is that if you can inform you probably should


yes but how okay with the with the concatenated features going into neural network is

that assuming that this speaker is sort of

i mean if my speaker was to walk around and

i can imagine actually

or observation is not network isn't learning can fink well actually gives you

beamforming gives you a it's more like adapting to that

the most meaningful signal

to the strongest signal so basically

if you have like multiple distant microphones one of the speakers just always a like

in some way

it's closer to give an microphone but down to D are not or and that's

thinking actual you can exploit and you had

in this scenario we applied for


because the when you like put multiple frames

in the input you have like a very small time resolution so you actually can

not there any time delays in this setup so it's just the it's just take

eigenmaps really and you can do it like in a more obvious way for example

you can apply

convolutional that'll

the acoustic models and the max-pooling

and tops the also give some gains

but that's like a followup work

to be a little bit of courage to decide to response to brian was asking

because you know i'm they pretty bad in reading other people's papers

and so i had only examples of paper speech i wrote all my colleague set

of students roles each people should very critically we i mean i don't mean that

they are wonderful but i still think they are interesting which and this is this

is this work on contracts which we started to work on that it was in

the time you post pretty crazy

because we just took the temporal trajectory of spectral energy a given frequency

one second long and we said can you estimate what's happening in seventy six

of this trajectory

and so first of course without was that you got about twenty percent correct at


and of course you get is the number of frequencies so after that need to

this out with all these posteriors and fit in your into nothing in it and

a then you leads to

estimate still the phoneme in the centre so it was like kind of formants deep

neural net i would say because it was kind of neat it was also performance

why because he that trajectories at different frequencies

and you was it works surprisingly well i mean so if people can look at

it and the last possible global we should have better and of course you never

retrain the whole thing the which probably we should have done and we use

on the context independent phonemes each maybe we should and shouldn't number of things happen

at the time something that there are two entirely comparable to the manager and pos

it's all and hopkins and so one is also be where only all that much

but i still see that people should look at

in and tell us what was wrong or how you is that it works

that you try to recognize context independent phoneme out of one second context

you know and you get actually very well you do very well if you look

at least posteriorgram amazingly good

sort of the look like for that issue do you mean vector or perception at

all times seems like that it is i would see

somebody else to should look at it critically

so sorry for probabilities might work but i said i mean so

so the other people's work so

so this question is a mostly i and other hand address i have something to

say i

so i actually spoken trying to achieve a person

our knowledge and found on this but i

for example we can read and the program that the videos there being recorded going

to be can that's going to be able to search them for keywords

online so like i think the keywords that are typing into that system are gonna

be that

new at me now i have no

and it's gonna be names i think i can and you know i okay so

that are gonna be

and i have a cabinet where K fig


kinds of plastic

what is what is deep neural networks are we optimising R

where acoustic models either really frequent words and leaving a on infrequent words second i


and if the other thing is you're analysing with word error rates over your entire


it's is this really getting at the performance

the only one and we want to understand it is when the interesting to look

at landing a way that standpoint for spoken content retrieval

one stack of that

restricted by taking a i can

maybe address some of that i think i don't think than i don't know networks

are just focused on

i'm not on the on the head did you pretty well on the tales well

but i mean there's two aspects here and there's the where the where you don't

like vocabulary and those words that are out-of-vocabulary we have this in the model at

test time and that's a different kind of orthogonal i don't my

i mean and shake your head but i think

i think if we can incorporate re well as to what we what we do

our searches we have a stack decoder graph we can actually corporate dynamic can tailor

into that into that graph

i think when we do that a big actually recognise out-of-vocabulary words i mean that

that we haven't seen during training time for example i worked on a voicemail years

ago and you know people's names come up all the time and our program manager

for the reason his name was

with the recognizer is this is okay the people who tell collisions in but you're

always recognise missed ten cents

for some reason but once we switched on so direct vocabulary we have like is

name checked into the stacks photograph same recognized and then we refer lots of other

different so i think right now the system doesn't actually corpus title vocabulary and i


the metrics you talk about also devices to sort of working at sort of broad

range and it is sort of makes more difficult to say if it introduces the

you know technique where we do anything cataract that'll give us a point one or

even middle today and that's a shame think really need to look at its

techniques to look at the long

but i think it still really this but recognition there's lots of the word that

can be done and in language modeling and analyze about capital a men's room you

know these words useful

i'll chime in a little bit on this one too so i can speak from

experience on doing few works are shown in lots of languages things to but the

babel program which will be hearing about tomorrow from a very arbour

and what we found is word error rate actually is pretty good basic metric even

when we're doing search for words that are out-of-vocabulary in the training so it's not

perfect correlation between word error rate and retrieval performance on this table past but

at least the first-order a large improvements in word error rate like we see using

neural networks instead of the gmms definitely

to better retrieval performance even a vocabulary terms

so it it's not perfect metric but it's one that we used for many years

and it works pretty well

the interesting pronunciation you'll find problems with those words and it you know it's very

recognition and of that

but i as you can see this work here where we're trying to drive dispensation


not dismissing it just

i actually wanna

the tractable but in favour of the direction what the question are saying

"'cause" i think it is a disorder separate out the decoding so forth from what's

happening in whatever you're acoustic model is you see whether it's gmms are dnns or

whatever or mlps many layers for phone

it's true that you just to do better on things that you see lots of


and this is also true even if you looking for a particular see you know

that are or you know triphones whatever those triphones occur less often and then you

are not going to estimate as well but what you're saying is true to that

you know

it doesn't completely kill

i agree i mean there's issues where we have some queries that just to get

recognised recognizer and you know the combat the ones you know get recognised to find

out there is on the five instances that context

directly in their systems are trained to do so

but something does need to be addressed

where a one technical comment on the super watchers for of course

this you know for but

so we take a very pragmatic engineering approach and basically the recognizer is fed by

the proceedings and by just like one and everything so generated of the new words

are no not that the new anymore

but i had another the question to the colour we're going to maybe also prime

that was about the sequence of discriminative for training on the on the bentley transcriber

or untranscribed report

so we gaussian

basically the simple you are needed to be don't know would portion of the data

and then not on the model loosely transcribed by the recognizer

what what's your experience on youtube videos and maybe

but i'm common from this as well

although first but because we actually have done sequence training experiments on the spread

well let's see on

i personally don't have a lot of experience i think when we report numbers or

three hundred hours broadcast news

there about half of it is manually transcribed have but it slightly transcribed and so

i'm pretty sure we see some nice gains on that chart can for

ten percent relative though

likings be cer fifty hour broadcast news from cross entropy sequence are more then we

sent four hundred i don't know that's amount of data or data ones you know

transcribers is like this

that's a and which what the reasonably good baseline but again with a pretty good

proportion of the training data being lightly supervised

anybody else the comments and that coral

comment would be that's and should investigate deeper

that i truly believe that there is



achieved three use the words right

okay other comments or questions

okay thomas

so this is a very general question about how much training data really need in

the future if you go would you make with the dnns

well i guess is was trying to motivate my where are you know is we're

just initials or system where you know with a lot of data to a big

networks and it takes

i don't know we trained a big networks but i think i think it's good

sort of challenge question where no one have like ten thousand hours of the thousand

hours of data used for training and we can maybe we do we increase than

certain number context of outputs to a hundred thousand what we get

and be interesting just no you know if we started you that we had to

be at the change around with to train model

that this or sizes

i would also we just more is better if it's if the transcriptions are good


that sounds great intro to mark format which was more

i just wanted to

mention the results was that but in numbers the role

where we the actually somewhat and selection of the raw acoustic modeling


well you are the word error rate

for well with one there but with other words the

so i think

for remote controls

no piling

well in that are split

really remove more careful what we're

but the performance of the word remove more thoughtful

work that was coming from the model

i'm blanking but there is a visitor

from google actually give a talk at icsi you is showing us with look like

definite as interpreting

of performance with going up two hundred thousand two hundred thousand hours and so on

so i think it helps but after awhile

that's a much i think which was

i and surprise that you've been quiet all day so

there you making me happy

so on the issue of selection i think

you can certainly argue that


cannot be the right thing to do

instead you should always to weighting

because whatever data you have there's certainly the i certainly agree that there's good data

and bad data but that data is not worthless it's just less good than the

good data

so for example we have a paper here or something for the set for semi

supervised training which revert done for a long time in the past you just transcribe

make a model transcribe some untrained some

recognise some untranscribed data and then use it for training when the error rates are

relatively low

for low is fifty percent or below you can do that

with your eyes closed

when the error rate gets really high like seventy percent

that does break now but that doesn't mean you should discard the data

you should just give it a lower weight and you can show that you always

get better performance if you include the data the weight just gets lower yes in

principle the weight could go to zero but

you know you that the system decide that and the weights don't really go to

zero the just get smaller so weights like one third and one half are error

rates of eighty percent still giving me

that's been our experience at least

so i but the so it's

more data widely

may not

monopoly the right thing i agree with what you're saying just as always value and


whatever those the figure forgot the utterance from should also

pay some attention to be distributional properties that are

but names

so or this is one point of the problems of the room sure

sampling space correctly that's really one

i think i think it should that with my paper where i was that are

closer general youtube data that when we actually that particular vertical like you use where

we we're getting much better rates

but adding all the data to train didn't doing that

a bigger neural network for unknown parameters for your after getting losses

on that specific domain so there are some issues of generalization just

i like to add a little bit on data was we will be different but

which is saying though i agree that of course more data is always better

but i think the also we can be using less and less and less data

so the question is how much data we will need i would hold less and

less and less

because we are any more and more about speech and we actually learning now how

to train the nets on one language and use it on another and so on

and also maybe sixty percent about the bottle which i'll go babel am

i called bobble i think that we are going to learn how we use that

knowledge from the some data bases on new task i this is at least my

so i'd like to and up on this positive for

approach less and less that's what i see

just a follow up with what you're saying i think like sort of the lower

part of the network or learning language-independent or task-independent information so you if you feel

a lot of data and to those layers and less data in the upper parts

that might be an approach to get i think is very

actually when we started working in the gale we had a bunch of stuff trained

that's trained on english

and we're working this with this or i and trying to move to arabic we

didn't have much are picked data yet so we just use the nets from english

to begin with

but still did something good

one point i'd like to make but i think my in there you recognition and

something that

if you've got more than you i think you might be ten times and doesn't

want to learn

think i limited

not like this number of an intuition


think we don't

have any other pressing questions actually is time

so no reach was

saying that

but at the way the data actually i did to do the and contrastive the

experiment to one case use the frame selection in other cases the frame-weighting and

D and i obtained identical word error rates for both systems

so maybe you know if the

if so it is true what to reach says then there should be done some

post-processing in the

of the confidence scores or it's true that those are not so uniform at all


it's a it looks more like an exponential

mention kind of


bring some more like

something else in the more data so

there are several because you want and are the ones don't speaker variability

okay then you have more speakers but if you want you know is a list

of our other robust against reverberation you can just make the data

and then you so does present in the same data yes variation added noise but

for a reverberation

just train a system on for room acoustics

makes it very robust against the for this microphones that's a very cheap trick and

it works

and something else about more data if you look at the very good neural networks

presented in everybody's head

they're not trained with that much data

google already has more data so that is a strong point in making better than

a networks you can do it so why tend to be

"'cause" we don't know how but i

could to

i think we are out of time in principle and so i think we should

turn this over

conference organisers

and thank the panelists still

thank you morgan so this will be short

i would like first before the we go a couple of practical things

so for the people that subscribe to the micro brewery tour

but so it's a word is not one way trip a that they did it

should meet very important to seven a into will be and we begin tomorrow at

the at their favourite forty with the limited the resources

just the last practical command there is a carpeting table on the on the message

board so whatever goes the prior knowledge and all other places

and there's free space just write yourself maybe we'll have some nice the centres


i would like to thank let's take the

i don't know where we ever is more or less important but let's think to

the public because of almost everyone you're very like to thank you very much

then the to the penalty is

and to all the speakers and of course my greatest things go to today organisers

and i have still one point but

for brian because you have but one

so this is a