i'm gonna be representing us to use a university of science and technology of china

the national engineering level of speech line and language

information processing of the

this is a paper by a margin is my student master student and some other

collaborators we asked him to

build his own c n which he did and then be austin to join using

it for something which he did

so what i'm gonna do is present

what came out when he tried that

we got before stages introduction really have this works for language and at the structure

in selling

is proposed method some experiments analysis and then

maybe and with but of sort on some future work

well the first thing to ask is what is language identification and

it just the task of taking a piece of speech and extracting language identification information

from that comes at different levels as we know

and we can say that that's all acoustic information or phonetic information list right hand

to disassociate that from the characteristics of a speech speaker as will say

and a little while

and was finding the tendency to do

speaker recognition

state-of-the-art as well probably

maybe this will change shortly i don't know but state-of-the-art is really gmm i-vectors

and we say in great gains but everybody's you know trying to find what's next

deep-learning in particular allows us to i'll take some of the advantages of supervised at

training be able to extract discriminative it

discriminative information out of the

the data is that we have especially when we have a small amounts of training

data we can use transfer learning methods to

to train something which may well be discriminative

on a weighted task of inferring it it's language id

some of these are we saying

recently they take bottleneck a network based i-vector representation of a

v and song collaborator

this was

i think it was last year in interspeech there's also a poster yesterday which you

missed paper should be in the savings we say dc non based a neural network

approaches here

doing great things that's transactions on ice lp this

then there's some approaches which are and to it and methods and we can look

at some of the

the we the

state-of-the-art as flown through that

deep neural networks

here

and that was i guess

long short-term memory

i'm n n's here

also in to speech

so this is really a extracting at a frame level

and gathering sufficient statistics over an utterance in order so

pulled out

language specific

identifiers they re entrant approach

using convolutional no young neural network to it so the text it's short

utterances and it's using the power of a c n

to put out

they the information from these short utterances

and seems to get over some of the problems in terms

of utterance length

we have a different method

we also think that doing st say so mfccs with a large context maybe

introducing

too much information that a c n all day n and then this to remove

so what we have today was

a use some of our train a precious training data to remove information that probably

shouldn't have been included in the first place if we had a magic wand

in terms of input features are

so what we're doing is a

is slightly different

we think convolution young neural network and

when using the c n to extract frame level information per se what we actually

doing

in this very

wide long

and to and type system is starting off with plp input features

and we're doing a the bottleneck

the nn just data to take bottleneck

network

taking the bottleneck features here

adding a what could be quite a lot of context to the bottleneck features and

then fading that adjusts ann

and here so three layers

i finally a fully connected output and we're getting is a language label

directly at the output from this

so you can see why this is sort of attractive in terms of a system

level implementation but to me it's

kind of counterintuitive

because we tend to use c n n's

to extract

front and information the mean in the related tasks that we've been trying

they tend to one for a while for that

i mean we did try things like stacks of mfccs as an input features to

a c n directly and it doesn't seem to somebody else can do about of

nasa doesn't seem to want that well

so what we did was we have of the nn

followed by a c n and the see how that works

and limits

sums up what is that transform acoustic features to a compact representation

we did not frame-by-frame and a context of multiple frames for the bottleneck features with

context into the c n

and we come out with something which should be discriminatively in terms of language

okay so this is what we call the lid features

i mean we think that the general acoustic features that the import they like i

said they do contain too much information

so we trying to reduce the amount of

information about an on the

on the train system

follows

i'm not given the limited amount of training data we don't really wanna voice that

we know that we can have a deep neural network which is trained on sentence

and that will be a phonetic information

the beginning of it is acoustic information

somewhere in the middle of that network is a transformation

effectively from the phonetic to the from the acoustic to a phonetic we take the

well something features which

we how far a compact

representation of the requirement information

not sure that's true because there's plenty of approach is that take

information from

both the center and the end of the day n and

seem to work well especially with fusion

anyway

what we're doing it would just

kind of different is when using a spatial pyramid polling

the output of the c n and

we want this allows us that there was it allows us to take the front

end information and to span utterance level with

which

provides us with a

utterance length invariant

fixed dimension vector this point

so i just a deal with arbitrary input so we just we take the method

spatial polling is from the paper by climbing huh

that's e c v computer vision two thousand and fourteen and it's designed to solve

the problem of making the feature dimension invariant to the input size missus a problem

we face often and is a problem certain

and areas of image processing also face

i think was happen is we've got i

so i kind of feedback where the speech technology goes into the image processing and

the

comes back to the speech failed and then i cycles around

so this is really inspired by a bag of words approach

and it comes through

into the special permit problem which uses a power of two

stack of max hold features

okay so it changes resolution of the power to

and we can control quite finally how many

features

we

one to the output of that

so attractive in that work well like it

the information on that is actually in the paper

so how do we do that had we put all the stuff together

well the shown in the diagram on the right here what we're doing is with

taking a

six layer the nn which is trained with large scale

switchboard

information

and with taking the half of the network up to the bottleneck layer and fading

that into system that now was trained using language id using lid training data

and

now if we propose that if we take that information and we feed directly into

a c n

given the training data that we would using this well it will not converge for

anything sensible

if at all

it just doesn't work so what we have that there was they have to build

a network

like a c n layer by layer

so that the nn is already trying that's fixed that's great

and then you start to build that the c n by having first convolutional and

then the second and then the third each one takes a special permit polling and

the fully connected layer at the output

to give us the direct language labels

and excel works right i'm we can see that late only look at the results

layer by layer

s two

how of the

how the accuracy improves with

the number of layers and with the size of the labels

it's quite interesting to say that

the nn pretty standard it's

forty eight features fifteen plp use delta and delta-delta

sorry pitch

and with a context size

twenty one frames

one of two four

one or two four fifty one or two for one or two four and three

zero to zeros senones at the output

and we look at the structure of the

c n of a little while

it is worth mentioning at this point because it's a problem

but we create sorry separate networks for the task of thirty second ten seconds and

three seconds data

i mean we would like to combine these with trying to money

but this separately trained

no maximum

a baseline is button like gmm i-vector and bottleneck the nn i-vector with lda doubly

c n

pretty much as we published previously

so we look at how this works

just try to visualise some of these layers

what we have here was we got the

post

pooling three

fully connected layer

information

note this diagram comes from the paper what we've done is be taken the

these

the test it

over some utterances and we've compared for different languages just visually

so what we don't is just thirty five randomly selected features from that stack

plotted here for two languages

because right

on the left this dowry

on the right it's farsi

which i'm told are very similar languages

the top and the bottom at different

segments

from utterances

so what we're looking on the left is intra language difference what we're looking at

the right just in between left and my is interlanguage difference so top and one

was intra

left and my is inter

so we should say that there is a large variability between languages small variability within

languages that's what we get

it gives us

visual evidence

to think that

these statistics might well be discriminative for languages

just leaving along a bit further

down here what we getting here was a frame level information

and we like to call this lid senones maybe this is not best terminology

but

just two

to explain have a how we get to that sort of a conclusion

if we look at this information i e bay saying and a right so i

and be noticed the scales on some of the one

one five a low lid senones coming out of the of the system out for

frame level with context for

speech

another piece of speech there

a transition region between

two parts of speech here

and non-speech region just here

so what we tend to say when we visualise this is we see a different

lid senones activating and a activating

as we go through an utterance or go between utterances

and we believe that this language discrimination information in the

if you look at the scale

the y-axis scale of that use

we can see that when there is a non speech regions around here we get

all sorts of things activating but the level

the amplitude of activation is quite low

you can it gives

evidence to the fact that rippling you have something which is a language specific at

least

so we also there's something called a

hybrid sampled evaluation so we spent thirty seconds ten seconds a three seconds in to

separate networks

we train them independently and we do well we don't do quite the same degree

of augmentation as a hundred but we do try to men by cutting the thirty

second speech into ten seconds

and three seconds regions

so we're doing is where

we're trying to make up to the fact that the three second information is woefully

inadequate in terms of statistics probably

i having a lot more effect

a mostly have that works

in terms of the but

performance of each

unfortunately we only have data here from

yes to allow you zero nine and for that we only have six

most confusable languages

it's a subset is much quicker subset so do analysis on into one experiments on

and if you look at papers over the last few years

we tend to publish with these six languages fast

and then extend later

seems worthwhile

it's about hundred fifty i was of

training data voice of america and radio broadcast cts and telephone speech

and we split up into the three different

level or looking at two baseline systems and our proposed network

normal

the fusion on that later

everybody wants to do fusion

the end

so let's look at three the way that this

this structure can be adapted because the so many different parameters that we could change

in here

the first one you wanted it was look at the

the size of the context

at the output of the

the nn layers

and with changing and

if you can make it out just here

lower case n

so what we're doing is where

keeping the same

bottleneck

network

but we're starting a more of them

and we can see from the results for thirty seconds ten seconds and three seconds

in eer

the bigger the context in general the better the results

no bear in mind that we only have some context the input here

right that's also got context twenty one frames to be precise so we adding more

context at this and we're saying benefit

and it turns out that for the ten seconds and three seconds tasks context of

twenty one

just here

tends to what better

for the

thirty second task and even longer context much better probably because the data was longer

i think that the

the problem is the three seconds and ten seconds data tends to saturate i mean

we just cannot physically get enough information had about data

no matter how much context size

we introduce

and moving on a little bit further

we can also experiment with

how

t and how wide the c n is

and we do that down here with

basically three different experiments one of which is the lid net with

i a one zero two four

such that convolution input layer

single-layer

then fading into the special permit polling and the fully connected system

we trained the system up we get about nine

nine percent to sixteen percent

performance on the three different scales if we add another layer so we have a

two class and then we all that down by reasonable amount for the three seconds

not quite so much of the thirty seconds

and we're looking at one two eight to five six or five one two

size on the secondly

in the c n

third layer

we check out sixty phone one two eight and we can say that basically with

increasing complexity the results tend to improve lesson for the thirty seconds more for the

others

i but temple evaluation what we actually doing here is way using the

the thumb thirty second network

to evaluate thirty second data

the ten second network to evaluate thirty second a ten second data

and the three second network to evaluate everything

and the performance

unsurprisingly of the three second network is better for the three second data you can

only use that ten seconds better for the ten second data but the thirty second

network a thirty second data is

however

it's better using the ten second one for thirty second data so this means that

perhaps this these networks themselves are hoping at different scales of information so we fuse

them together to get the results of the bottom

and we have a slight improvement there

but you won't notice that we can only improve on the baseline system for the

thirty second result

one more thing before we conclude the i-vector system uses a button first order statistics

but this effectively only uses a with order statistics

so

pretty much are a few what would be looking at hand we can incorporate more

statistics

whether we can build a comprehensive know what that uses

all scales and handles all scales simultaneously so that's it that's a weird and wonderful

day n c n hybrid thank you

we have time for questions

so

thanks very much that was very interesting and as far as i could see a

score of the network so

as far as understood you

did some incremental training and so once you that once you trying to part of

the network and then you extend the network the first the parameters of first part

they stay fixed you don't step them

you have this a fixed so that we fixed that enemy build on it and

it

again is what you do when you ask us to try different things and i

probably wouldn't have done this myself but it tends to one

quite well

the most

there's a fixed you mean you network trained the mortgage you just change

most layer

and they to retrain the whole system no we don't we train our system we

focus that the to the backend open and we just trained on us layer

q

i think we have another question for a we've got lots of time so you

spoke a lot about the information flow through the neural network so if you of

a read some of geoff hinton "'s" stuff on neural networks the you will you

will tell you again and again that

that there is more information

in our case in the speech and the labels so

use advocating for the use of generative models rather than discriminative ones as well as

i can see you horses ple discriminative so i just like to hear any point

you have on that matter

so actually is interesting that you bring that up because i was looking at some

of the comments in tumours making recently and

i he was talking about using

he was talking about the benefits of having a two-stage process where we have one

front end which is very good at picking out

how the most useful data from a large-scale dataset and then a backend which is

very good it using the and that these two tasks are complementary is seldom we

can use one system that excels in both tasks he believes that both women can

be trained and we seem to do not but we've done it

okay the opposite way around to the way i would have imagined

okay thank you very much speech