0:00:13 | a layer one point eight we really |
---|---|

0:00:17 | and i'm the also for people initial of information for each speaker embedding using attention |

0:00:23 | and the other two already optimum a mac and there will be |

0:00:27 | no drama hong kong or it can be robustly income working s one the information |

0:00:31 | engineer |

0:00:36 | so reviews the contribution okay but we show that was we show that the model |

0:00:41 | specifically a hundred and twenty one layer tenseness will produce more discriminative speaker id |

0:00:48 | and the showroom show model as vector |

0:00:51 | with the referee as an amount of time |

0:00:54 | secondly we show that measure it or is it is the goal will include a |

0:00:58 | speaker at ease of a month in noisy data set |

0:01:03 | so okay so |

0:01:05 | okay well as five and that's well |

0:01:07 | as a bayesian so the network take a speech feature and mfcc a filter bank |

0:01:11 | feature |

0:01:12 | and you and then costly several l of convolution |

0:01:16 | and then we because is a real and then us |

0:01:20 | she speech is very allowed in nature |

0:01:22 | so we need to convert into a single way to so we do it follows |

0:01:26 | that is a statistically less specifically we computed be and then the duration and compare |

0:01:32 | the mean and standard deviation and the goal wilfully conditionally a and it produces and |

0:01:37 | the network of the softmax layer |

0:01:39 | so |

0:01:39 | this is really is really on its own mean of variable an utterance will be |

0:01:45 | standard edition |

0:01:47 | and she the recesses found that using me a standard deviation is better than using |

0:01:52 | solely still this show that |

0:01:56 | more actually this |

0:01:57 | it is q did should description of the three levels feature is very helpful for |

0:02:02 | producing discriminative the speaker at |

0:02:06 | so |

0:02:07 | so this class that is really more detail |

0:02:11 | it's very easy operation we just compute a |

0:02:14 | me and of very low level and then we compute a standard deviation of three |

0:02:18 | level e |

0:02:19 | have a |

0:02:21 | so we can see no bonus that is still a as the kind of the |

0:02:24 | summary also no feature |

0:02:26 | so we use me as the this then used as a summary also three level |

0:02:30 | features this you a distribution |

0:02:32 | however |

0:02:32 | it may lessen the initial can only characterize where a single distribution of a gaussian |

0:02:37 | distribution |

0:02:38 | so multimodal distribution yes but alas decision so if even if the frame level feature |

0:02:44 | a kind of distribution recognition custom |

0:02:49 | lately this yes and deviation we'll kind of |

0:02:52 | some right a distribution will |

0:02:55 | so what was all here we propose a misrepresentation forty |

0:02:59 | so |

0:02:59 | it is i use it is no place the use i and use is that |

0:03:04 | haitian maximization algorithm for |

0:03:06 | for gas emission model |

0:03:07 | so here from here is that all using you know in those emission model we |

0:03:12 | actually kind of you |

0:03:14 | the euclidean distance to produce alignments and interview me as the user's we use we |

0:03:20 | use the tension mechanism |

0:03:22 | to reduce the center score so specifically you have control level feature s and the |

0:03:27 | we have multiple exposure had |

0:03:28 | and then and of allegiance it should have computers the set of weight |

0:03:34 | set away this is the other ways normalized to make system |

0:03:37 | a one |

0:03:38 | across each |

0:03:39 | and we use this certain way |

0:03:41 | can be me and a standard deviation |

0:03:43 | and isn't the and then we have multiple yes the divisional is not only as |

0:03:48 | that used as a reasonable tended to get there |

0:03:50 | and that we used to compute a speaker id |

0:03:53 | so the imbalance in here is that e and we have multiple okay and then |

0:03:58 | addition we is not right across each had so is only sees the kind of |

0:04:03 | is just that because wishable actually |

0:04:06 | you know how to compute yes that the user is exactly as a |

0:04:10 | as a gaussian mixture model |

0:04:13 | so be so still not allowed us that is supporting map plays the car content |

0:04:19 | in it is only is a proposal by another researcher |

0:04:23 | so is right is very close to it but its enrollment network |

0:04:28 | we use several times about being a different way |

0:04:30 | so as this is a computer |

0:04:33 | on the other setups location away |

0:04:35 | at least at attention ways normalized cost very soulful each we compute a score and |

0:04:41 | the scores normalized across three |

0:04:43 | so nist twenty that all the different arabic a in the with real state acacia |

0:04:47 | mechanism |

0:04:47 | so you know case you a location we think that location like because the emission |

0:04:52 | model |

0:04:53 | in a attention model is more kind of a way to each frame up to |

0:04:58 | design a way that we use it is trained on the laws only idealise more |

0:05:01 | like a cell vad |

0:05:03 | two to three to fuse i'll some |

0:05:06 | and the contribute to three |

0:05:10 | so that's not that the landau wasn't that we might be a teacher forty so |

0:05:17 | actually that's not as in them some other researcher also have okay we will where |

0:05:23 | is internet |

0:05:25 | to latin like task |

0:05:27 | but now map place marginally case that |

0:05:31 | that's now that's undecidable was additionally could be |

0:05:34 | also use a model me |

0:05:36 | maximum mean but unlike visionary as us analyze computing a different way use euclidean system |

0:05:42 | as far as we use attention so we can have a score files are discovered |

0:05:46 | that can be very channels covering it can be |

0:05:48 | i just lock scores on the we use these remote |

0:05:51 | well various channel neural network |

0:05:53 | so is more powerful than euclidean distance |

0:05:57 | so let's take a look and other different between that and that you would disagree |

0:06:00 | and we shouldn't worry |

0:06:02 | unlike i talk about before |

0:06:05 | and i can't is it is probably |

0:06:07 | have a kind of computer location when normalized cost of frames so |

0:06:12 | so the distribution is a is the is distributed over a state and each has |

0:06:17 | kind of |

0:06:17 | you is very independent of yellow case of the distribution is this will work at |

0:06:24 | had is only the small where it kind of the mission recognition may be sure |

0:06:29 | execution |

0:06:30 | so |

0:06:31 | so there are we use what i even and that was considered as an one |

0:06:36 | or on one hundred and you wanna there's net |

0:06:40 | that's nice it was in computer vision as a kind of |

0:06:45 | cancun oklahoma for the guy around a ski condition because the |

0:06:49 | okay ross given that can as the use of test condition |

0:06:52 | the original that slated the collusion and then we do we just a moment modification |

0:06:57 | to make you what we with the all pass thus because the very we use |

0:07:01 | the one to compensate you is not be |

0:07:03 | and to the to the for convolution |

0:07:05 | and then for the transition i a low precision here we use can we also |

0:07:11 | use clues as a sample |

0:07:12 | as specifically we use the kernels that a twister to lose in an example here |

0:07:17 | the data symbol |

0:07:19 | and i see a of the last once the last on the we use the |

0:07:23 | at my the softmax we find it very pretty effective |

0:07:28 | so the only information for always the training data that the which idea and you |

0:07:34 | lda we use |

0:07:36 | that is seven thousand and three hundred speakers always the rewind with thirty two |

0:07:41 | it has data a always night maybe we maybe a voice ninety |

0:07:45 | we of so you very that weighs thirty one has a |

0:07:48 | okay is that we use it is forty dimensional feature with the mean |

0:07:53 | and then the weights and additive educational use where is a while use of these |

0:07:57 | energy based voice activity you question |

0:08:00 | and the neon and we use your addition to the |

0:08:04 | is now we also use |

0:08:05 | us to use a as well and wise but in ways that are we double |

0:08:09 | up i mean |

0:08:11 | it double in and the channel size of the listeners that you scroll down there |

0:08:15 | so this is somebody else the model use a specific law me is a real |

0:08:20 | time operation |

0:08:21 | and then model parameter and use it on hold the number and in the model |

0:08:25 | we can see as well as well and that the work flow is quite low |

0:08:28 | having a is i is a low otherwise but the network because we don't models |

0:08:33 | i don't know the multichannel |

0:08:35 | solar for helpful plastic |

0:08:39 | a powerful |

0:08:40 | about referee of all time all and the models and also quite able but that |

0:08:45 | is that with as the although is are quite enough that will hundred and you |

0:08:50 | want layer |

0:08:51 | because the actually have a weighted loaf localities roughly is there almost every as i |

0:08:56 | z s where the network |

0:08:57 | and then i mean is also only all of the tuple |

0:09:00 | and but we can see that because of this as nice very even networks |

0:09:04 | so we've the you know device like that you will be a little bit so |

0:09:11 | that's right |

0:09:13 | so it is there is a well or in our results |

0:09:17 | so first let's talk about network structure |

0:09:21 | we find that does not for all of our last record and when wiseman than |

0:09:25 | the i-th user can you has if you all three data is that |

0:09:29 | our has never phone a fast and i and that although and why do as |

0:09:34 | well were used |

0:09:35 | a rough is a model parameter and take more time interval as |

0:09:40 | ieee |

0:09:42 | in the performance case can be our guys that obviously perform better |

0:09:47 | and then follows that is important maslow we found then be sure of that and |

0:09:51 | we |

0:09:52 | of on a the task you know ways nineteen evaluations that |

0:09:55 | and i've always that anyone we have been a small improvement |

0:09:58 | and generally speaking way out of all conditions that is we |

0:10:04 | so here is that was totally an application had |

0:10:09 | so here we to acquire it we because study |

0:10:14 | are we face the known ola layer after recognition so increase number of half will |

0:10:20 | not be sign quiz or not but i mean |

0:10:22 | because he to achieve increase the number of without controlling the concatenated that dimension you |

0:10:28 | use of where like to model so the number of the times that no i |

0:10:32 | can not be penetrated problem as a mechanism |

0:10:35 | it could be getting the benefit for like to model so as to the telly |

0:10:39 | aside |

0:10:39 | so a reasonable how will i reason and stories they will be a more fair |

0:10:44 | comparison |

0:10:44 | so as to here we see that if we present will had four one two |

0:10:49 | to four |

0:10:51 | avoid and it's to the us is probably actually that i scheme going a |

0:10:54 | so we show that as you one highest ask volunteers as that is only |

0:10:59 | overall image relevant between c reasonable huh |

0:11:03 | okay queries the |

0:11:05 | only the increase the number that young we actually going not so reason is that |

0:11:09 | this kind of you shape at a when the number of buttons at high rates |

0:11:14 | so we conclude we introduce the console mixture of importing dues that is |

0:11:19 | i was that is the point i using only had training and all way or |

0:11:24 | policies is i about is imitation maximization verifying cost initial model i am on like |

0:11:30 | gmm model |

0:11:32 | images time on a given by this mechanism is that the fusion this that we |

0:11:37 | do nothing levels each index pieces and so i know propose a mechanism to one |

0:11:42 | hundred and twenty one data s now but it should be for one was that |

0:11:47 | everyone for on several ways night evaluation set |

0:11:52 | so this is all my presentation so thank you very much listening if you have |

0:11:58 | a of any question of all my presentation and all that it is illegal common |