Speech Transcript - I-Vector Representation Based on GMM and DNN for Audio Classification

alright technique for this introduction and i would like reverse also thank the

or this recognizer for inviting me to give this presentation and to give

present my last work and also for bringing us here too

this one different location it was an amazing week

the with that was very good

and the social events with many things so i i'll exercising so as a good

part

so it was really good since was very enjoyable to a week to talk to

people and meet would be blunt this costs and exchange ideas so that's what wonderful

and the gospel begins to

to see this winter vision of the basque country

so hopefully we'll come back to visit that this tourist

we have chance so they only presenting some of my let this work about using

the

some of the i-vectors some kind of i-vectors to model the hidden layers and see

how

the u d n and sparkling information in the hidden layers and because usually the

way

actually the way we doing now the nn since we are trying to look to

the output of the d n and land the n is to make some decisions

or we look to use the bottleneck features one half of it and one of

the hidden layer to use a bottleneck feature to do some classification with it

but unfortunately not

a lot not only not any work have been proposed to sit to look to

the whole unpacking the nn

because i believe that some way that we can there is some information were not

exploring and using actually into the nn is the activities of the part of activation

how the information was propagate over to the nn and that's what we're going to

be talking today

and show some results

so this is the out of my possible our staffers by can an introduction benefit

that all move onto

you know slowly to the my lattice work but before that would give some you

know and reduce the i-vectors which i don't need to because a lot of people

you but probably a know what sometime better than me

so i mean you guys you know the i-vectors is based on the gmm so

the first pass will be based on a gmm how we use it for gmm

so we present for the gmm white gmm mean adaptation

and we are we show to study of case speaker recognition language recognition here i'm

not give any i'm not telling you how

how to build your language or speaker recognition system but i just want to tell

you that would i-vectors we can do something that is what is a show and

again see some very interesting behavior of the data how the channels and one of

the remote the condition can affect

for speaker recognition system if you don't do any channel compensation

for language recognition or we show how the closeness of the speakers from data driven

this what is asian so the number that would remove

then the direction of how we can use actually some discrete i-vectors to models the

gmm weight adaptation is just some work that has started one

of hugo new when you most students pass in has sent by how do you

sees is actually was the case of an in bellingham he visit my me an

almighty for six months

and we will start working this gmm one advantage of language id

and then after that i'm that the announced are progressing comment over feel

and that's where start looking saying maybe this discrete i-vectors can be also use it

to model the posterior distribution for the n s

so i start this is what had this of the second part of that also

a start looking how you know the intended representing information in addition to layers

because a lot of the box in the vision to show all you can recognise

that's this moron that model actually cat's face from youtube videos or something like that

can we do something for speech

you know that's how i start thinking about using i-vector representation the model data layers

and that's why

then we show that you know how for example the accuracy goes more to go

deep in indian and how the accuracy going for example for language id task how

we go better

and also how we can more than that of activation the progression of that you

know the activation of the information over the non-target the nn

so if you feel like you what one hours too much for you to sit

in the shower and you want to the perfume is that you should even the

first part because the gmm part but the second part maybe more interesting for you

guys

i would be not offended if you want to the

so and that after the our finished by so given some conclusion of the work

so as you know i-vectors have been largely used it's a nice way to work

on to it's a compact representation that nicely of summarize and describe what's happening in

a given recording

you know it's have been largely used for a different task

speaker language speaker diarization

speech recognition so there isn't i-vectors was actually related to the gmm adaptation of the

means

so i just say lately i have interested also in the gmm weights adaptation

for using i-vectors and then are you know that after they move on to use

this for the model that the nn based i-vectors

for the for you what modifications

so that's not you know slowly take you to data the

to my the others for what is what slowly

so you know in speech processing usually what you have you have a recording of

this one recording and you transform the to get some features

then based on the complexity of the distributions features you build a gmm usually classes

when but the gmm top of this remote to maximize the probability of distributions

so you know

gmms are have been is defined by portions and portion has the weights the means

and covariance matrix are described this portions

so the way that some other countries the i-vectors in context set a concept of

speaker recognition so the way we were doing it in early twenties well that's what

how the kernel started

you know we dig a lot of non-target speakers were trained a large gaussian mixture

model

then after that because we don't have to meet sometime too many recordings from the

same speaker where n and one maximum likelihood do adaptation so we tried got that

the universal background model which is a cut prior of that how all the sounds

looks like to the direction of t

target speech

and the so the way that okay this should happen the between source trajectory gmm

supervectors because we finally he found that the one of the pine find out that

only the adaptation of the means is enough so the main the weighted it the

mean shift from this universal background models of the large gmm trained on a lot

of data

to the direction of the target speaker can be categorized of something happened the recording

that make happen that shift

so the lot of people starts to think this shift example packet kenny which one

factor analysis to try to

supplied with one speaker and channels

during the gmm supervector shoot for example also become boundaries what would svms you know

trying to model gmm as input to the svm to describe the model to the

probability of between speakers there

so the in the sense fear i-vectors came out as well

so the i-vector disposal you have a gmms subspace the ubm is one point there

and so we have one recording so we try to ship to the new recording

to the ubm to this new recording so if you have a survey recordings i

you we have look different one space the i-vectors extracted more the oldest variable between

all this recording

in the low dimensional space

and we still rocking is the ubm

so all this new recording can be mapped to this new space and now we

can represent and i is reporting by and vector of a fixed line

so this can be an modeled by this equation so we have the universal background

model here middle and east recording gmm can supervectors can be explained by the ubm

plus an offset

when offset also describe the

what happened is recording which is you are given by the i-vectors into proposed variable

the space where the i-vector a vector space

so now when you have a strange you doesn't margaret training for that you when

you have new recording utterance from the to get the features than after that you

map that you're subspace are you sure that all familiar with that

so now i'm not going to give anyway how to tell you how to do

speaker recognition you have been seen a lot of goods

talks during this will wonder four

all this is a conference but and still that will show you how we can

do visualisation with it so

first of also for speaker recognition this i-vectors have been applied for different kind of

speaker recognition task of speaker modeling task like spoken speaker verification when you have a

set of speakers you want anyway of recording you want to defy with those who

spoke in this segment speaker verification when you have a to want to verify that

to recording are coming from the same speaker or diarization

you want to know box and one

so for the for the speaker recognition task i would like to show some visualisation

that explain to you what's happening in the they that if you don't do any

channel compensation for do that

i would like to notice the work of that currently was actually psd students with

the unopened hyman that bill combine a mighty and he was working would not at

time

so we took the this is that in the nist two thousand us a ten

speaker recognition evaluation was based on i-vectors and the time of was that this that

system was we build was actually based it was a single system that rounded to

deal with the telephone and microphone data in the same subspace

and so we look like a box five thousand recordings from that the data and

we build a cosine similarity between all the recordings

that i think that it does this make metrics that similarity matrix and he built

teen is never appear at so is your for that would be connected to that

this tenish never

and then use this software called guess to do the graph visualisation

so in this graph you know that the relative location of the node is not

important but the relative distance between the notes

and the clusters important because

it's reflect how close they are and how to structure your data is

so that so here

exactly they female they data but database with the inter session with a channel compensation

applied so we can see the colours are by speakers

and the is so he's and he should british or point corresponded recording and cluster

compare the speakers

so for people that actually want to the museum and since all are this early

week can you do i mean what's the at this was thinking twenty been this

"'cause"

so the thing is like now what we start doing that we say okay well

known that we tried to remove the channel components i said what happened well we

lost the speaker clustering

and something happen that were some cost so that happen that appeared in this clusters

and always say like well what's going on he says so he went we went

together we will look cd a

to the labels and we start looking what's going also for example here

each you one check all the microphone at used for the different back that they

that the microphone was used to recover one of the recordings and you find that

actually with the clusters like to the microphone that was have been use

and that would like to pursue the pretty surprising for example it may assume at

this at the telephone data we have like one in-cylinder and this of the microphone

data

and also we have five that you also find that there's to actually for the

same activities cluster two clusters and actually because the room was there

that the ldc lifetime used to rooms the collected data so also the two rooms

was also reflected in your data

this is a liberal press every civilisation to show that you know i don't want

to give your michael right one from two to one point five whatever but i

don't tell you that if you don't anything about the market for the channel compensation

it may be big issue

so this is what happened there is only

the data can be affected by the my microphone can be affected by the channels

and also can be affected by the room that have been recorded

so this is that what we do try on the market the channel compensation

and we do the clustering by speaker and bit the visualisation is by the

by channel so that specific the channel compensation doing some good job too

trying to normalize this

so i front lately we recognise mel bit and female on a y

but different clusters of the time was better so this is that say the same

at a later we all have see also the same behaviour so this is the

one to the microphone data which is the most interesting

and you can still see that split between microphone between the room one and room

to the ldc and this use the collected data

so this is actually unique visualisation

that have been you know very helpful for us and stand and you know shows

the people that actually about the what we are doing it makes sense

and you know how we can still be fun to the some pictures and microphone

a microphone channel compensation

so this is the same thing so i honestly after that you know what we're

doing language id two thousand eleven i start looking to the language id task so

and i will try to do the same things also for visualisation so he language

recognition task we have a verification is why doesn't fixations so you don't need to

to spend too much time at that so here what i did is actually a

i to connect nist two thousand nine i have an i-vectors was trained in the

training data on it took it doesn't matter just a can cost

and a two hundred recalling for each language i think we have like twenty three

for that language

and i know to the same thing salad build the cosine distance or similarity and

bill between a separate graph and try to visualise it so this is what happened

for this kind of language recognition class so for example here disappointed because we have

for example

english and into english close together

we have into english and hindi and urdu you know like what they are very

close together

mandarin cantonese and that i mean and korean

is same almost in the same cluster

so also here's duration ugly green and was any and of course shines origin

in the same cluster and also french and real

so it's really data driven

at a visualisation that show you how big how the closeness of the languages are

from the acoustics

that have the primary using to model the i-vector representation

so here this is what have been you know you know

i-vectors were allowed to do because you have this you know in the time with

cosine distance between you can be lda to this was a bit as well

that we can you know doing i-vectors and represent the data and see what's happening

the data and how you can interpret what's

what's phenomena is going on

so that of is what is it was a good tools for that

so it is a you know that meet now try to move on because i

know that you all familiar i-vectors i don't want to

to spend too much time it anymore probably prefer we want to the more interesting

topic of this to of this talk so that after that i start looking to

the gmm what adaptation is a say with the students from what has a higher

you

and the way the gmm weight works that there's lot of actually the several decay

that have been applied to that

for example maximum likelihood should

the most a simple way

and one of the and also nonnegative magic factorisation which is actually you go via

newman was working in that at the subspace multinomial model

which is that what else complement inequality and what but people use

and what we propose which called non-negative factor analysis because the you know that the

gmms what adaptation is a little bit tricky because you have the nonnegativity of the

weights as well as they should sum to one so this is can trying to

do you have to deal with

during the optimization and when you're training your

your subspace

so it's a

so the whiteboard ogi validation for example you have a set of features like oneself

recording industry features

and you have any bn you model if you try to compute impostor distribution of

a of a given a component for some time of a frame

given the ubm subspace are you so we get this posteriors and then you and

your then you accumulate that and can

from that

so the object so in order to get that the gmm what adaptation you don't

you try to maximize looks very function given here

and if you want to do a maximum likelihood so the way to do what

you accumulate all this serious overtime and it divided by the number of frames that

you haven't you can do maximum likelihood

all

you can for example do nonnegative market factors estimation

which consist that okay we just try to split this weights adaptation into little small

negative matrix as

basis that

also maximize looks very functions that given here they the input is that the count

and you try to estimate is to subspaces vector representation one assumptions one and they

the representation of this in the subspace

to characterize the weights adaptations

so this is a negative matrix factorization is the you go value money students paper

that describe that

what implemented via t is that you have a multinomial distribution

and which kind of is described

so we have this subspace all that describe the a this the i-vector representation of

in the weight subspaces the when did v is actually but so we have you

know ubm plus share and didn't but no matter here also how to make sure

that the weights obtained are normalized to one

the good part of it here is that this is very good to when you

have a nonlinear data to fit for example he an example i would like to

thanks

but an specially older for shown with giving me the slides and that this

picture

here for example you have a gmm of to question for example

and he would try to similar each point corresponds to one recording weights adaptation

for example much estimation

and we tried to simulate what happened when you have a large gmm so we

have some sparsity not all the goshen would appear so we can see that this

question here the corner sorry

then the d

so this abortion here we would not be this is just a simulation

in what happened when you have a large ubm

so we can see that we for example in this case how the data looks

and this subspace moody model in the minima the sorry multinomial that model is very

good to fit the data

but that it has a drawbacks make overfit so that's why the but you guys

user regularization do not make it more overfit

so has send work at a time was trying to do that similar the same

as an i-vector so you haven't ubm weight i weights and you want to make

sure that new recordings had the ubm for you the weights for the new recordings

is that it will be in what was an offset

and the constraint here it's

you they should a weighted sum to one and they should be noted nonnegative so

we developed in an em like approach so but someone right in the center of

sound i think we did something applied to maximize the likelihood of the objective function

so you have to step second compute all i-vectors and you got many of they'd

are you but the l and you have you tried and w because the convergence

so let's say we tried to maximize the lower the likelihood of the data does

a function of the subject that they should sum to one and they should be

opposite if there is

projected gradient ascend that can belong to do that

and this is are you gonna go to the reference in you can find all

the information i don't want to go there to be a for this talk to

not

the difference between for example the non-negative factor analysis and the s m is of

actually

showing this table so that they i don't think that tend to not overfit because

the approximate or the maximum data is that would not touch the corner compared to

the ammonia s m

but sometimes good sometimes bad dependent which application you are targeting

but we compare that for several application they seem the same bit s m invented

non-negative factor is practice to

behave almost the same

so this discrete i-vectors have been applied for several applications and purposes for example modeling

of prosody that's what marcel that for his phd

phonotactics when you model the n-grams for example on dry and the did that and

method is based is this

and also what we did for the gmm weight adaptation for language recognition and

and dialect recognition would have sent has an work so

in this paper we compared activity taking and i'm have

assume m and as well as the you don't get a factor analysis so we

can go and check that

be almost behave the same thing as one for gmm weight adaptation

so now in order to go to the fun part

how we can use this

discrete i-vectors to model the

the gmm that the model that the nn activations i was actually the time of

was motivated by

this picture

so i was watching what it was actually that any one of the pocketing whatever

was given a talking to go on training or something like that and he was

showing that you if you do like some a deep belief network to unsupervised trained

your auto-encoder data

and he trained in the millions of unlabeled youtube

number link but component

and he said that maybe if you divide one or in top you maybe you

can actually construct

the pictures and he was saying all kayaking see the cat

face

and it will like okay well we do something for speech and wishart okay it's

a continuous the time series but

that was taken it can actually see how the data is are gonna the nn

hidden layers and that's how it is exactly what motivated to start this work

so remember that before i say we have a recording and the waitress from that

to set of features

then we get this feature to a gmm earlier now let's just remove the gmm

and give it to

due to deanna so for example we can do easy where a language recognition as

in what you give some frame versus like modelling of frames that's what you not

your from who did freeze paper really got thousand fourteen so it's input is of

segment was just like a frame and output is a language and

i will show the several the same like eggs experiment

note that when you have a new recording and you want to make the decision

you do a frame-by-frame decision and he aberration he tries to the max of the

output so that's largely what we compared to and you can also do example show

anymore seen on the n n's and you want to see how the data as

representing in the this task so

so imagine you have it that have been there so the way that we do

now

the before as a set earlier is we to get the n and we take

the output to make a decision

you know like or alignment for example for ubm i-vectors

or we take one hidden layer

and are used to it as a bottleneck features

but whenever and since we only see one level of what we've got the and

only one

one hidden layer or the output we don't see how the d n actually provide

get the information over

all his on fire the end on part of the nn and the reason for

example imagine you have a sparsity coding for each

for example for each hidden layers

and use a for each input only fifty percent of your

of your the foregone or inactive for example but for example drop out

so the way that the data we colour information for example for class one the

one and you will call it here and the one he would call you can

be different

because some randomness the way he would provocative what when coded information so if you

can model you get more that of the battles activation of how the class went

to the nn

and this is an information that's available there but we're not using it

and that's exactly what actually motivate me for doing for doing this work

so can we looked at all hardly nn and see how to progress there and

you know this is our should be one way to do with maybe is not

the best way to maybe don't always but this is one way to do

so the idea here were tried to do is

since we had this discrete i-vectors that also based on counts

and posteriors so can i use that to model

i i-vectors for each that we should outlier

that's what it is only built for example of the nn here we use an

i-vectors are presented and one

it into a taken as a present the lastly a loss leader as well and

noted to do there i need to have some counts

to react like we were so i'll be able to apply to my gmm weight

adaptation techniques to do it be used for gmm weight adaptation so here is to

when you get a combined counts

for example you can compute the posterior fortyish norm activation foster for each normal then

if we use you don't layer for each input your normalized to sum to one

artificially a common either because the you know was not trying to do that

and then you accumulated over time i became that became counts because here you should

allow us to sum to one

and you can you can use the same gmm to gain you don't change anything

to them

so the second one gonna post softmax for example

similar thing but you ample softmax we generalize to map and sum to one

and the accumulated you can also trained with softmax as well

but what is the most important one which the most understanding of all this ad

hoc

situation

and it compute the probability activation operational wrong and its complement one minus one

so you can consider this to normalize the one gmm of to work

so now we don't you only model that you can use the d n and

have the rest of the response so we don't normalise anything

so here so for example here for example if you have one thousand four neurons

you will have double their doubled that and you would have

thousand of

genments what to bush and you use the subspace model tool to do that what

the constraint that we used to normalize and his company wayne one is complementary sum

to one and in this case you don't do anything go wrong because you're modeling

the same behavior of the nn

we tried to compare few of them but we didn't will i'm not going too

much in a detector the want to make too much numbers here to confuse you

there will be have the same one

so in this case the say we can use for example here for the first

application we should dialect the condition

i use non-negative factor analysis

for the nist eight are you subspace multimodal more than one not be a model

"'cause" i wanted to show that but actually but works there is no distinctive to

be you

so he to the say

the non-negative factor analysis you have the weights of a new recordings used the ubm

so with a wary compute d b m's can i the weights i usually take

all the that the training data extract the count for each of them are normalized

m and it took an average and that's might ubm so every ubm response for

that's only the average response of a moral issue the layers

for a given him and it and

so if you shouldn't layers for a given all the recordings

so when you can use the at the you wanna get the factor allows us

to do that

so now

though that resting by is an eigen factor as a scan all support other approaches

can help you also to model all the hidden layers as well one way to

do it for example you can build hit and i-vectors for each subspace then you

can compensate the i-vectors of them

and you would have

or you could have one

that actually model everything with the constraint that uses hidden layers of some to well

and this will allow you to see how

you know how the correlation is happening between all the activation of your hidden layers

and that's exactly what we did

in order to do that we extended for example accented to d non-negative factor analysis

so you have a different ubm each one corresponding to issue the layers and it

would have a common

i-vector that control all of all the output for each dollar data sorry you have

a common

i-vectors for all the weights for all data it hidden layers

so in order to do that let's try to give some experiments and show something

results

so the first experiment that i would like to show is in that some dialect

id so we have a small sore from apart from vision

so we were interested in doing some back here we have five dialects we have

this isn't know how many recalling by training

it's about forty hours important thing for ten or fifteen hours and it'll it an

hour threeish a dialect

and we have training how many cost for training and development and eval

so a train the d n and

so we have five class that problem of trying to the n and with five

hidden layers

thousand and the first you know little about two thousand and then after that i

have five for all the hidden layers of five hundred

five hundred

then so the in is that the while training that the input is the same

the is the features of a stack of

i think was twenty one features frame then the output is the five dialect class

the same as a google paper with any with the in a two

then the when you get the i-vectors are used cosine scoring with lda and the

people described earlier today

and the best image method we find for this task is that the it's also

most full rank

as about thousand five hundred five and the for each other ones

so that so the first results show is the i-vector results

and he was the i-vectors actually it's worse than twenty to the d n an

average of the output

which a mean that for each frame you compute the posterior for the five o'clock

for the five class and you average them and you mathematics which is exactly what

we would paper describe and he is better because the that this the characteristic of

this data is that's the recording are very short cuts around thirty second you know

organ sometime less

so we know that you know if you do that the nn and you do

average scores it's always better you have already seen that talks in a wednesday afternoon

a show that

even for news data so this is the error rate sorry so that less is

better

so now i will show you know there is a twenty do the i-vectors in

the hidden layers and starting from it layer want to layer five and how the

results are is

more you go deep but there is which we know that

so this understanding what are preprocessing on other feel like in a vision so we

were able to do the same thing here so

you can see that were from their one layer wanted to the board the devil

that's cool down and i can't this

five lighters because i want to show that sometimes there's no need to go too

much deep

for example layer five already saturated

like that like five didn't have anything but they q prodigious to make sure that

you know sometime we will try to make it really d but is not necessary

so this is one example what you really don't want to do it

and putting is now we can also see that you know we were able to

see the accuracy of you should the layers and we can we also be able

to prove that more you go deep in this that the network but there is

a result are so you will probably get more information

model in all the hidden layers maybe have model but the representation

so here this is l deity

to do that a dimension

of the that the five classes is an lda project into dimension lda and a

member the first on the presented this work and the what the slide that people

say well but probability don't to lda i said that's true i forgot to do

that

so this time i didn't forget

and so what i took a set of the row i-vectors for example for the

last layer

and i do it i did jesse any to model that so now here just

a zero i-vectors were using to see any use lda also you can see that

for example the origin is around here so we can see the scatter going this

way

which just signed that okay length normalization will be useful again

so this is what you wanna do the length normalization due to the same thing

so it's and speaker area

so is the same thing so that normalisation is also useful here so

i'm not sure this project was unfortunately i was hoping to see different behaviour but

it is in say behave the same thing

so this is using to see any cell this is a role

i-vectors

so since the reason why i was asked this question because of the i was

just which are trained to the task

so how it really actually represent

the that the data and the layer was and their important thing to do

so this is a one is one thing that we were tracked

so now

i just say here probability result the i-vector result in that the nn and over

averaging the scores of the frames which is better than i-vectors then more than in

the hidden layers actually better is necessarily

and the results so and i say that from all my experiment that they have

been that seeing is that the last he'd of the last layer is the worst

one in time of information so don't take decision that

but with data we so that the old information is actually in the hidden layers

there's no doubt about

so here i give the last layer result and then what happened if you model

everything one you get more again

you get all other two percent gain by modeling all the hidden layers

and the same thing would happen witness tape

so my point here is you know is true hidden layers

you know more go deep but there is

but if you also looked at all the correlation that happening over all hidden layers

is actually better

and the reason for example why is you know the even people that do some

you know brain division amount vision and everything that wanna try to the activation the

cost of you know what him or more i've something's can use it and one

level but you cannot see that how this to propagate maybe she can correcting about

that if i'm wrong you know this way we can do the same thing for

the n and we can

top and one hidden layer or we can see what's happening all the d n

and is the same okay

you can you do td in my right to sit activation how it happened or

you can cut and one levels can and make a decision this is the same

thing can we just so this is the same behaviour and here i'm just saying

that

the n and has more information that we are now using

because we are not looking to the path of activation that he took too cold

his data

so this is a deck id probably are not familiar with that so probably move

onto the speaker id but before that i did an experiment because i you know

in the state of the i-vectors was completely unsupervised i was thinking okay so that

i used is actually

discriminatively trained for this specific task

can i have the n and that was just using to call the data on

colder

for example

and you know the simplest way to do it i say let me just try

to do a good idea learning every n to try to see you know what

happening i'm sure that people has more sophisticated network for that

so i tried this every have the same are selected that trained before the same

data these speech as input frames input

and i use of dimensionality reduction at that it subspace and use cosine distance so

we use five by the layers are b m's

and i

this of the results l the i-vectors here at the d n and output

but i am having some struggle because i cannot go more than the first layer

for the every m called an ongoing colours

so the how the first layer give me the best at all is not as

good as

you know this discriminatively trained subspace with the in a subspace forty i-vectors but

you know it's not that bad

you know and that's what have been seen

so the hidden layers the first one you trained is actually the best one more

you go deeper

it doesn't how and my

my hypothesis i'm not sure if it's true

because they are not jointly training

altogether

if there may be they are all the number of the

the layers are jointly trained to maximize the likelihood of the data that may be

different story and that's why what that's what we are trying to investigate now

with the my students so can we trained variation for example operational uncoded to train

the maximize the likelihood of the data

and see how

all this representation has a meaningful or not

so this is one thing that we are trying to explore

so now for people that are more familiar would

with the nist data so are you what you seen as it was wednesday afternoon

session that people are more than in six languages

i tried to the same thing so we selected with the help of like to

laugh read a give me this subset of the data

so first in the korean mandarin russian vietnamese

and the difference between us and other people doing people try to use all evaluate

data so that want to remove the mismatch but the trend not use the what

of density s and v only be to avoid the mismatch

it because i want to know what's going on

for us was where everything together

it seems that we didn't have this issue

so that's the difference between possibly not you paper and sum p other papers in

the that section of the

wednesday afternoon so we should put everything together and we're trying to the n and

that actually you take the frames as input and the output is a six class

and this is actually that is also so actually before that i will say

i train firefly the error five data layers about thousand ish

the input is the frames sec frames of twenty one eleven contextfree side

at certain context for each side sorry the output is the class

of the six class use a linear according to this time before of course

cosine this one is a collection

and the so here this i the result in a subset of the thousand nine

for the six languages

so there's a result of the i-vectors intended to second ten second and three second

and the average of the score which is what everyone is doing what you the

direct approach

and

so the that the characteristic of this is as have been said before

it only got the this the and it's

average only be the i-vectors in the three second entire thirty seconds and ten second

it's not it doesn't work

but what happened when you do the hidden layers is a little bit different story

so is well more legal given that the nn but there is

so this is the same thing a slow does not different story here

but the thing is

or actually here forty four you know participant and second that no one is able

to be this because the this

if you do the hidden layers and for example i want to the hidden layer

five

it's obtain the best result everywhere

for even for ten for this for to just forty seconds

so hidden layers and also this is actually was interesting it is the hidden layers

five is just the one preceding this i'll put e

so this one sign the last layers as the one that you really don't need

to look

so based on the my experience so and here again see that the last in

the letters actually marsh much better than

then the i-vectors and as well as the nn output every

so the hidden layers aims at that i-vectors representation for this case seems to do

an interesting job of aggregating and pooling

the frames data to make your representation of the data and you can do classification

with it

so this is an interesting funding for that so actually all surprising to see what's

on the data

so now

what happened when you do everything model all that a whole hidden layers as well

so here are show d

i-vector representation d v d n and every score as well as the last hidden

layer five

and you know i'll i

and also try to see what happen if you do

all hidden layers what used again some k

and you can win also one almost like zero point eight this sorry i forgot

synthesis the averages right in there so we can see that for thirty seconds there

is already low

you know i don't i don't think that too much seriously

that we was little bit here but for ten seconds we were able to wayne

and forty eight the signal were also able to

so it's the same behaviour that all hidden layers

has better information than the one that single-layer of the time

and also the last layer is also better the than the first layer and then

then the first so that last is also but the minutes like the first layer

a hidden layers and looking but the last output layer is not that much interesting

in term of making decision

so either one reason to be honest one explanation is that this the nn time

by ten to overfit

which i just a do

second to shoot

but even when they overfit like that and use them to make a representation or

discrete your space

it's in they work fine if you try to make decision what over fitting a

different story

as one thing here

so this is what i have been finding this last

you're trying to use this models to

understand what's going on

so let me try to conclude

so we have five minutes and have something called that you want to say

so the i-vector representation is you know an elegant way to do a representation of

speech with the different lance you know a lot of people ready also used in

a wood that's and twenty of

of the work of the recordings the one where you have a long segment and

short segment

gmm innovation gmm weight adaptation subspace can also be applied to as a show sheen

say that that's you have seen in this talk can be applied to model the

d n and activation

in the hidden layers as well and they would doing good job

so was actually the take home here

so that stating that they want to focus here the seldom under down for all

the information that was modeling that the nn is not in the output but isn't

inherently

looked at that it is this

don't try to make a decision directly from the from the out

and so also you know looking to one the liar at the time and not

seen what's going on in all the data layers

it may be a mistake were going but it's may be good also to look

at that

because it's will tell you what's how the information one to all the d n

and how we show that each class to be model

that's something to seem to be

very useful

the subspace approaches that have been trying is one thing that i was thinking off

to do this work demo specially in time of modelling all data layers

that you know we can use and it is seen to doing good job of

putting and are aggregating that they the all the frames and give you are not

representation with the maximum information you can use for your

for your classification task

so this has seemed to be very good even if the day was trained in

frame based

so with svms trained at the end frame based and use it to make a

sequential classification

i-vectors is actually a representation seems to be doing a really good job for that

take two minutes to

and we have to mitigate

for future work contracts that we have been explored my students and colleague

my colleagues

now that's a earlier that the other than being using are based on frame based

and segment length

frame of contacts of twenty one or something like that

it's not doing so we are trying to shift to

more like memory the nn is like for example td and endorse unit time

or l s t m or which is the

special case of recurrent networks that's what ruben is doing

my inter so we are trying to explore data instead of frame-by-frame to make more

to extract a model more speech more the dynamics

explore more data such vector for speaker

to make them more useful for speaker

we're still working on that as well

and the set earlier i would be of interest and people spy authors in my

talk to meet i mean maybe there is a better way to do

watercolours

to really corpus clear that the data speech

and my whole is at some point we would be able to

to get some speech modeling at the end the nn or speech colour so you

know

it just call the speech and after that i used to discrete my space and

use task for example i give you

a bunch of thousand of recordings you call your data and after that you say

i want to use speaker i want use language

can i use from the

from the same model

just calling speech

so if anyone has any idea or have any tell please come talk to me

so also to make the things the i activation more interesting

i'm interesting in exploring the sparsity of activation for if you know later

no i'm not doing a specifically i'm trying to use that the nn training but

is there a way to for example one way that i'm doing now we didn't

have time to compare the result is dropped

example i say

what for each input fifty percent of my for additional layers fifty percent of mine

or active

so there is some randomness between the recording but when the hidden layers because

i find that actually some if you do have at the end and the two

hidden layers consecutively the layers sometime i redundant because i close together but supplied them

is actually the two of these separation between it's better

sometime

so if you do surpassed activation with for example would drop obviously the simplest way

to do

you make them complementary because there's some randomness happen in the middle

so that you for that the nn to take different bat for each hidden layers

are normally

so that's something i'm really interesting to make the

information but the between two consecutive

hidden layers more powerful more interesting and then and make them more rather than rather

than and

and also there's a way to for example alternate activation functions

by same we can say sigmoid rectified linear and sigmoid on

so between two consecutive sigmoid that something in the method to make things changing a

little bit

so the behaviour change for the consecutive sigmoid

so when you model down there is there's hopefully a way to get more information

and you're so in the subspace and also how the how the d n and

is coding information can be useful for the classification

and

to conclude

well i'm organising assess it doesn't sixteen portrait ago

so hopefully to suit their lee's summit your paper the same is the same time

as the c

so that that's this work so please help to see there and if you come

at the workshop you can also stay

for the rest of the week you enjoy the beach and that the cocktails very

nice to signature nor owns to make your compared to the right object a function

so and so that and that's it i q

jim had sent you from these distortions mum concerns just about a point which is

not main point of view or which is not in the main point of your

talk

it's about the television in addition

a particular always the t s in the stochastic neighboring of meetings

the to use of form determinization think that

this techniques and that is this phenomena useful and satisfying four

for thinking for the it but also for the thinking and understandings the distributions

but we remark and some if you put forth

and so for presenting the high divorce which of data with those techniques particular these

speaker classes

i'll distributed along ambulance form norwegian

this thing directions

t s and then don't does not respect the initial distribution

it separates speaker classes but so as you

the does not respects is montreal

direction of speaker classes

so it is useful because we use e

separation between necklaces of speaker

but not

or maybe more

view of this is we'll distribution

so it's i think a very good tool

two but it may become few not to use it of to propose a new

nist

it's as those more one so you're saying i it's here's just want to show

that you know how it's kind of structured but i'm not checking account how it's

model was a distribution from a t c any that's what you're saying yes

simply for the also points in particular fourteen and

i didn't write down all the numbers but i saw you had results and b

r and the dialect id task for other five dialects arabic

and their numbers are three writing down here you had to i think that the

fourth layer supervectors right a twelve point two percent and then when you into if

we're was twelve point five percent

and i apologise i didn't see a slight that that's if there so my question

wise

as you're moving forward you're actually getting improvement but would really be nice in dialect

id it's a lot more subtle differences between a derelict right search a lot of

times it interesting to figure out what are the things that are differentiating between each

of the dialects so i'm wondering if it anywhere you go back

and look and the bad the test files that you went through here for guitar

residual moving in the improvement here

you you're some not your hand it may be assumption would be that you're getting

a few more files except it correctly but you're just likely to have a few

morph rows rejected incorrectly a and it would be nice to can see what they

balance it's are you getting more pluses

and you're losing a few or are you not losing anything in gaining more so

that's where i'd like to kind of c is you're moving down here is zero

is a positive movement forward or are there some better falling backwards but the net

gain is always possible

no i agree with that in i didn't do it you know virginia the wood

but also is interested at the time of than more interested also to see

between the hidden layers what's if i'm getting i was hoping to see what happened

the recording you know is that having a linguist work we made me trying to

understand okay handling like this that classified correctly in the hidden layer five but not

in the layer for three or to what make its change that it's so i

want to know

which affirmation of the five layer that got me to make this one better than

another one that's true we window at the end we were thinking about

so not too much just want to thank you very much for proposing a new

solution to the very heart problem so

i just like to put that the difficulty of the problem in into context because

we've been banging our heads against the same kind of difficulty so

to summarize the problem

it the problem is to get a low dimensional representation of the information in the

in that it in a sequence so you've got lots of speech frames

and then you want to the stall the information in all the speech frames to

single smallish vectors

the reason is difficult is let's look at the i-vectors the classical i-vector so

you can write down information that the generative model for the i-vectors in one equation

you had

it's very easy for most of us to just look at that an immediately understand

so that's the general the fruit

but what you're doing is the inference fruit

from the data back to the two

the hidden information so now we have two

share all the information from all the frames accumulate that information back into

back into the single

vectors so

if you look at the i-vector solution

that the formula for

for

calculating the i-vector posterior

that's a lot more complex than

just the generative formula for the i-vector

and that takes as

might be applied to the live

i that formula and

i believe it's similarly difficult for the neural network to learn that

so you mentioned the variational bayes order encoders

so we've been looking at that was quite a lot

in the papers that have been published thus far it's always a one-to-one relationship between

the hidden variable and the observation and then everything's i r d so

i was machine learning by per state been solving that a much easier problem

to accumulate on all that information is a harder problem that's also computationally it is

also computationally hot

if you think of the i-vectors posterior lots of piper's with published how to make

that computationally lighter

that's why say you all

no solution is quite exciting to us

what else also the one of the guy from machine learning ask me okay say

okay so we have indian and you have your i-vectors representation can you propagate the

errors from the i-vectors of the nn to make it more power for your specific

task with the i-vector percent

that's something interesting for psd topic noise

if you're i

you know way to combine the subspace and that the like the same as what

people do in the data in asr the symmetry of training sequence of training can

we do the summary things with when you have the error coming from the i-vector

space that work to propagate the data the d n n's dow

that's something maybe

interesting as well that's we got from machine learning cost me this

so not nice presentation nudging

i hadn't thought of questions one was

when people move from gmm based i-vectors to you know the nn

least i-vectors using c you know just classes

as i understood the improvement was

because of the fact that just these was quantized much better than using gmms right

and

i that it was phones as classes or you know languages classes

if you doubly that you're proposing to use auto-encoder

has no information about you know any classes so what's your intuition behind

something like that would work better than

using c you know ones are you know languages as classes

well you know it's actually is a good question so my tuition is just a

my feeling up in the speech processing and hairless how without doing it

is we start too much scrolly

make in to win information away from the signal

for example

here if you do line frame and language is a class

i'm normalising speakers i'm doing the l d n is doing all the things for

you

so i'm hoping to not do that

try to maximize as much information

as i can

for example i give you

to a four thousand six or ten thousand of speech i don't giving level about

the development but you know going to train the speech continuance provides way in your

data which you had be helpful for you because you have thousand hundred thousand speech

and maybe in the industry is different

i say you have moral appleton with us

but for so can we do that so that's what i hope so i can

you know this is the same talk what the jackal said the twenty have letterman

supervised

can you use that you and your training

so i'm hoping to have a kind of speech coder

this model speech that you hear something you given the same thing from both sides

of the affirmation is there

it's not sure what away it just how to use it

that's exactly feeling wineries and i'm not saying that would be the i don't they

would be the destructive training or something like that i'm just saying that if i

haven't all the speech coder that something like to maybe if i am too much

use anything august alameda truth but that this is what i one is like something

you know if we haven't woken colour style or something like that

if the if he can produce the speech again

so the information is there we just need extracted

i don't know if it was clear and

I-Vector Representation Based on GMM and DNN for Audio Classification

Keynotes

Najim Dehak