0:00:13a layer one point eight we really
0:00:17and i'm the also for people initial of information for each speaker embedding using attention
0:00:23and the other two already optimum a mac and there will be
0:00:27no drama hong kong or it can be robustly income working s one the information
0:00:36so reviews the contribution okay but we show that was we show that the model
0:00:41specifically a hundred and twenty one layer tenseness will produce more discriminative speaker id
0:00:48and the showroom show model as vector
0:00:51with the referee as an amount of time
0:00:54secondly we show that measure it or is it is the goal will include a
0:00:58speaker at ease of a month in noisy data set
0:01:03so okay so
0:01:05okay well as five and that's well
0:01:07as a bayesian so the network take a speech feature and mfcc a filter bank
0:01:12and you and then costly several l of convolution
0:01:16and then we because is a real and then us
0:01:20she speech is very allowed in nature
0:01:22so we need to convert into a single way to so we do it follows
0:01:26that is a statistically less specifically we computed be and then the duration and compare
0:01:32the mean and standard deviation and the goal wilfully conditionally a and it produces and
0:01:37the network of the softmax layer
0:01:39this is really is really on its own mean of variable an utterance will be
0:01:45standard edition
0:01:47and she the recesses found that using me a standard deviation is better than using
0:01:52solely still this show that
0:01:56more actually this
0:01:57it is q did should description of the three levels feature is very helpful for
0:02:02producing discriminative the speaker at
0:02:07so this class that is really more detail
0:02:11it's very easy operation we just compute a
0:02:14me and of very low level and then we compute a standard deviation of three
0:02:18level e
0:02:19have a
0:02:21so we can see no bonus that is still a as the kind of the
0:02:24summary also no feature
0:02:26so we use me as the this then used as a summary also three level
0:02:30features this you a distribution
0:02:32it may lessen the initial can only characterize where a single distribution of a gaussian
0:02:38so multimodal distribution yes but alas decision so if even if the frame level feature
0:02:44a kind of distribution recognition custom
0:02:49lately this yes and deviation we'll kind of
0:02:52some right a distribution will
0:02:55so what was all here we propose a misrepresentation forty
0:02:59it is i use it is no place the use i and use is that
0:03:04haitian maximization algorithm for
0:03:06for gas emission model
0:03:07so here from here is that all using you know in those emission model we
0:03:12actually kind of you
0:03:14the euclidean distance to produce alignments and interview me as the user's we use we
0:03:20use the tension mechanism
0:03:22to reduce the center score so specifically you have control level feature s and the
0:03:27we have multiple exposure had
0:03:28and then and of allegiance it should have computers the set of weight
0:03:34set away this is the other ways normalized to make system
0:03:37a one
0:03:38across each
0:03:39and we use this certain way
0:03:41can be me and a standard deviation
0:03:43and isn't the and then we have multiple yes the divisional is not only as
0:03:48that used as a reasonable tended to get there
0:03:50and that we used to compute a speaker id
0:03:53so the imbalance in here is that e and we have multiple okay and then
0:03:58addition we is not right across each had so is only sees the kind of
0:04:03is just that because wishable actually
0:04:06you know how to compute yes that the user is exactly as a
0:04:10as a gaussian mixture model
0:04:13so be so still not allowed us that is supporting map plays the car content
0:04:19in it is only is a proposal by another researcher
0:04:23so is right is very close to it but its enrollment network
0:04:28we use several times about being a different way
0:04:30so as this is a computer
0:04:33on the other setups location away
0:04:35at least at attention ways normalized cost very soulful each we compute a score and
0:04:41the scores normalized across three
0:04:43so nist twenty that all the different arabic a in the with real state acacia
0:04:47so you know case you a location we think that location like because the emission
0:04:53in a attention model is more kind of a way to each frame up to
0:04:58design a way that we use it is trained on the laws only idealise more
0:05:01like a cell vad
0:05:03two to three to fuse i'll some
0:05:06and the contribute to three
0:05:10so that's not that the landau wasn't that we might be a teacher forty so
0:05:17actually that's not as in them some other researcher also have okay we will where
0:05:23is internet
0:05:25to latin like task
0:05:27but now map place marginally case that
0:05:31that's now that's undecidable was additionally could be
0:05:34also use a model me
0:05:36maximum mean but unlike visionary as us analyze computing a different way use euclidean system
0:05:42as far as we use attention so we can have a score files are discovered
0:05:46that can be very channels covering it can be
0:05:48i just lock scores on the we use these remote
0:05:51well various channel neural network
0:05:53so is more powerful than euclidean distance
0:05:57so let's take a look and other different between that and that you would disagree
0:06:00and we shouldn't worry
0:06:02unlike i talk about before
0:06:05and i can't is it is probably
0:06:07have a kind of computer location when normalized cost of frames so
0:06:12so the distribution is a is the is distributed over a state and each has
0:06:17kind of
0:06:17you is very independent of yellow case of the distribution is this will work at
0:06:24had is only the small where it kind of the mission recognition may be sure
0:06:31so there are we use what i even and that was considered as an one
0:06:36or on one hundred and you wanna there's net
0:06:40that's nice it was in computer vision as a kind of
0:06:45cancun oklahoma for the guy around a ski condition because the
0:06:49okay ross given that can as the use of test condition
0:06:52the original that slated the collusion and then we do we just a moment modification
0:06:57to make you what we with the all pass thus because the very we use
0:07:01the one to compensate you is not be
0:07:03and to the to the for convolution
0:07:05and then for the transition i a low precision here we use can we also
0:07:11use clues as a sample
0:07:12as specifically we use the kernels that a twister to lose in an example here
0:07:17the data symbol
0:07:19and i see a of the last once the last on the we use the
0:07:23at my the softmax we find it very pretty effective
0:07:28so the only information for always the training data that the which idea and you
0:07:34lda we use
0:07:36that is seven thousand and three hundred speakers always the rewind with thirty two
0:07:41it has data a always night maybe we maybe a voice ninety
0:07:45we of so you very that weighs thirty one has a
0:07:48okay is that we use it is forty dimensional feature with the mean
0:07:53and then the weights and additive educational use where is a while use of these
0:07:57energy based voice activity you question
0:08:00and the neon and we use your addition to the
0:08:04is now we also use
0:08:05us to use a as well and wise but in ways that are we double
0:08:09up i mean
0:08:11it double in and the channel size of the listeners that you scroll down there
0:08:15so this is somebody else the model use a specific law me is a real
0:08:20time operation
0:08:21and then model parameter and use it on hold the number and in the model
0:08:25we can see as well as well and that the work flow is quite low
0:08:28having a is i is a low otherwise but the network because we don't models
0:08:33i don't know the multichannel
0:08:35solar for helpful plastic
0:08:39a powerful
0:08:40about referee of all time all and the models and also quite able but that
0:08:45is that with as the although is are quite enough that will hundred and you
0:08:50want layer
0:08:51because the actually have a weighted loaf localities roughly is there almost every as i
0:08:56z s where the network
0:08:57and then i mean is also only all of the tuple
0:09:00and but we can see that because of this as nice very even networks
0:09:04so we've the you know device like that you will be a little bit so
0:09:11that's right
0:09:13so it is there is a well or in our results
0:09:17so first let's talk about network structure
0:09:21we find that does not for all of our last record and when wiseman than
0:09:25the i-th user can you has if you all three data is that
0:09:29our has never phone a fast and i and that although and why do as
0:09:34well were used
0:09:35a rough is a model parameter and take more time interval as
0:09:42in the performance case can be our guys that obviously perform better
0:09:47and then follows that is important maslow we found then be sure of that and
0:09:52of on a the task you know ways nineteen evaluations that
0:09:55and i've always that anyone we have been a small improvement
0:09:58and generally speaking way out of all conditions that is we
0:10:04so here is that was totally an application had
0:10:09so here we to acquire it we because study
0:10:14are we face the known ola layer after recognition so increase number of half will
0:10:20not be sign quiz or not but i mean
0:10:22because he to achieve increase the number of without controlling the concatenated that dimension you
0:10:28use of where like to model so the number of the times that no i
0:10:32can not be penetrated problem as a mechanism
0:10:35it could be getting the benefit for like to model so as to the telly
0:10:39so a reasonable how will i reason and stories they will be a more fair
0:10:44so as to here we see that if we present will had four one two
0:10:49to four
0:10:51avoid and it's to the us is probably actually that i scheme going a
0:10:54so we show that as you one highest ask volunteers as that is only
0:10:59overall image relevant between c reasonable huh
0:11:03okay queries the
0:11:05only the increase the number that young we actually going not so reason is that
0:11:09this kind of you shape at a when the number of buttons at high rates
0:11:14so we conclude we introduce the console mixture of importing dues that is
0:11:19i was that is the point i using only had training and all way or
0:11:24policies is i about is imitation maximization verifying cost initial model i am on like
0:11:30gmm model
0:11:32images time on a given by this mechanism is that the fusion this that we
0:11:37do nothing levels each index pieces and so i know propose a mechanism to one
0:11:42hundred and twenty one data s now but it should be for one was that
0:11:47everyone for on several ways night evaluation set
0:11:52so this is all my presentation so thank you very much listening if you have
0:11:58a of any question of all my presentation and all that it is illegal common