a layer one point eight we really

and i'm the also for people initial of information for each speaker embedding using attention

and the other two already optimum a mac and there will be

no drama hong kong or it can be robustly income working s one the information


so reviews the contribution okay but we show that was we show that the model

specifically a hundred and twenty one layer tenseness will produce more discriminative speaker id

and the showroom show model as vector

with the referee as an amount of time

secondly we show that measure it or is it is the goal will include a

speaker at ease of a month in noisy data set

so okay so

okay well as five and that's well

as a bayesian so the network take a speech feature and mfcc a filter bank


and you and then costly several l of convolution

and then we because is a real and then us

she speech is very allowed in nature

so we need to convert into a single way to so we do it follows

that is a statistically less specifically we computed be and then the duration and compare

the mean and standard deviation and the goal wilfully conditionally a and it produces and

the network of the softmax layer


this is really is really on its own mean of variable an utterance will be

standard edition

and she the recesses found that using me a standard deviation is better than using

solely still this show that

more actually this

it is q did should description of the three levels feature is very helpful for

producing discriminative the speaker at


so this class that is really more detail

it's very easy operation we just compute a

me and of very low level and then we compute a standard deviation of three

level e

have a

so we can see no bonus that is still a as the kind of the

summary also no feature

so we use me as the this then used as a summary also three level

features this you a distribution


it may lessen the initial can only characterize where a single distribution of a gaussian


so multimodal distribution yes but alas decision so if even if the frame level feature

a kind of distribution recognition custom

lately this yes and deviation we'll kind of

some right a distribution will

so what was all here we propose a misrepresentation forty


it is i use it is no place the use i and use is that

haitian maximization algorithm for

for gas emission model

so here from here is that all using you know in those emission model we

actually kind of you

the euclidean distance to produce alignments and interview me as the user's we use we

use the tension mechanism

to reduce the center score so specifically you have control level feature s and the

we have multiple exposure had

and then and of allegiance it should have computers the set of weight

set away this is the other ways normalized to make system

a one

across each

and we use this certain way

can be me and a standard deviation

and isn't the and then we have multiple yes the divisional is not only as

that used as a reasonable tended to get there

and that we used to compute a speaker id

so the imbalance in here is that e and we have multiple okay and then

addition we is not right across each had so is only sees the kind of

is just that because wishable actually

you know how to compute yes that the user is exactly as a

as a gaussian mixture model

so be so still not allowed us that is supporting map plays the car content

in it is only is a proposal by another researcher

so is right is very close to it but its enrollment network

we use several times about being a different way

so as this is a computer

on the other setups location away

at least at attention ways normalized cost very soulful each we compute a score and

the scores normalized across three

so nist twenty that all the different arabic a in the with real state acacia


so you know case you a location we think that location like because the emission


in a attention model is more kind of a way to each frame up to

design a way that we use it is trained on the laws only idealise more

like a cell vad

two to three to fuse i'll some

and the contribute to three

so that's not that the landau wasn't that we might be a teacher forty so

actually that's not as in them some other researcher also have okay we will where

is internet

to latin like task

but now map place marginally case that

that's now that's undecidable was additionally could be

also use a model me

maximum mean but unlike visionary as us analyze computing a different way use euclidean system

as far as we use attention so we can have a score files are discovered

that can be very channels covering it can be

i just lock scores on the we use these remote

well various channel neural network

so is more powerful than euclidean distance

so let's take a look and other different between that and that you would disagree

and we shouldn't worry

unlike i talk about before

and i can't is it is probably

have a kind of computer location when normalized cost of frames so

so the distribution is a is the is distributed over a state and each has

kind of

you is very independent of yellow case of the distribution is this will work at

had is only the small where it kind of the mission recognition may be sure



so there are we use what i even and that was considered as an one

or on one hundred and you wanna there's net

that's nice it was in computer vision as a kind of

cancun oklahoma for the guy around a ski condition because the

okay ross given that can as the use of test condition

the original that slated the collusion and then we do we just a moment modification

to make you what we with the all pass thus because the very we use

the one to compensate you is not be

and to the to the for convolution

and then for the transition i a low precision here we use can we also

use clues as a sample

as specifically we use the kernels that a twister to lose in an example here

the data symbol

and i see a of the last once the last on the we use the

at my the softmax we find it very pretty effective

so the only information for always the training data that the which idea and you

lda we use

that is seven thousand and three hundred speakers always the rewind with thirty two

it has data a always night maybe we maybe a voice ninety

we of so you very that weighs thirty one has a

okay is that we use it is forty dimensional feature with the mean

and then the weights and additive educational use where is a while use of these

energy based voice activity you question

and the neon and we use your addition to the

is now we also use

us to use a as well and wise but in ways that are we double

up i mean

it double in and the channel size of the listeners that you scroll down there

so this is somebody else the model use a specific law me is a real

time operation

and then model parameter and use it on hold the number and in the model

we can see as well as well and that the work flow is quite low

having a is i is a low otherwise but the network because we don't models

i don't know the multichannel

solar for helpful plastic

a powerful

about referee of all time all and the models and also quite able but that

is that with as the although is are quite enough that will hundred and you

want layer

because the actually have a weighted loaf localities roughly is there almost every as i

z s where the network

and then i mean is also only all of the tuple

and but we can see that because of this as nice very even networks

so we've the you know device like that you will be a little bit so

that's right

so it is there is a well or in our results

so first let's talk about network structure

we find that does not for all of our last record and when wiseman than

the i-th user can you has if you all three data is that

our has never phone a fast and i and that although and why do as

well were used

a rough is a model parameter and take more time interval as


in the performance case can be our guys that obviously perform better

and then follows that is important maslow we found then be sure of that and


of on a the task you know ways nineteen evaluations that

and i've always that anyone we have been a small improvement

and generally speaking way out of all conditions that is we

so here is that was totally an application had

so here we to acquire it we because study

are we face the known ola layer after recognition so increase number of half will

not be sign quiz or not but i mean

because he to achieve increase the number of without controlling the concatenated that dimension you

use of where like to model so the number of the times that no i

can not be penetrated problem as a mechanism

it could be getting the benefit for like to model so as to the telly


so a reasonable how will i reason and stories they will be a more fair


so as to here we see that if we present will had four one two

to four

avoid and it's to the us is probably actually that i scheme going a

so we show that as you one highest ask volunteers as that is only

overall image relevant between c reasonable huh

okay queries the

only the increase the number that young we actually going not so reason is that

this kind of you shape at a when the number of buttons at high rates

so we conclude we introduce the console mixture of importing dues that is

i was that is the point i using only had training and all way or

policies is i about is imitation maximization verifying cost initial model i am on like

gmm model

images time on a given by this mechanism is that the fusion this that we

do nothing levels each index pieces and so i know propose a mechanism to one

hundred and twenty one data s now but it should be for one was that

everyone for on several ways night evaluation set

so this is all my presentation so thank you very much listening if you have

a of any question of all my presentation and all that it is illegal common