| 0:00:13 | i know my mean is shocking ending this the ubiquity of these vectors and training | 
|---|
| 0:00:18 | workshop | 
|---|
| 0:00:19 | all represents all paper selecting t speaker in between its nodes or is okay shen | 
|---|
| 0:00:27 | these are to this contents are you or start with an introduction the motivation | 
|---|
| 0:00:32 | next are we going the voice just dataset | 
|---|
| 0:00:36 | and i we introduced a baseline system | 
|---|
| 0:00:38 | we use low and it and the proposed tomorrow this the remaining the states | 
|---|
| 0:00:44 | experiments and corresponding richard will be then present followed by our conclusion | 
|---|
| 0:00:49 | a nice to meet introduction | 
|---|
| 0:00:53 | recently | 
|---|
| 0:00:54 | tim neural network are using the kings table t are honest in speaker verification | 
|---|
| 0:01:01 | however distantly utterances are well known to integrate or honest because the contain environmental vector | 
|---|
| 0:01:08 | such and reverberation and noise | 
|---|
| 0:01:11 | so celeste of these so case we always use of security in complex environments ascending | 
|---|
| 0:01:16 | problem is done challenge was | 
|---|
| 0:01:19 | then encoded already dataset | 
|---|
| 0:01:24 | previously | 
|---|
| 0:01:25 | several studies have or compensation for the performance degradation or with the distant environments | 
|---|
| 0:01:33 | however to problem to have mean oregon meetings eating compensation method | 
|---|
| 0:01:39 | well as | 
|---|
| 0:01:39 | you just a one as a degradation of one cluster of utterance | 
|---|
| 0:01:44 | applying the compensation that a good agreement though honestly recognition or distant contrasts | 
|---|
| 0:01:51 | however when the distant compensation technique was applied to the cluster doctrines the performance det | 
|---|
| 0:01:58 | only | 
|---|
| 0:02:01 | or into this you know nina used in recording used compensation system when you come | 
|---|
| 0:02:05 | from various distance | 
|---|
| 0:02:08 | second | 
|---|
| 0:02:08 | there is a dependency on the sre system | 
|---|
| 0:02:12 | when a new speaker embedding structure is almost | 
|---|
| 0:02:15 | corresponding studies or adequate at position and you know you should be are well | 
|---|
| 0:02:23 | to all the gradient this | 
|---|
| 0:02:24 | previous problems | 
|---|
| 0:02:26 | we want to build a system followed in no or properties | 
|---|
| 0:02:31 | first | 
|---|
| 0:02:32 | you should be independent the front end speaker extractor | 
|---|
| 0:02:36 | second | 
|---|
| 0:02:37 | the proposed system should be or on selecting cepstral innocent | 
|---|
| 0:02:41 | while considering got used and you training speech and microphone | 
|---|
| 0:02:45 | certainly | 
|---|
| 0:02:46 | was cluster and distant utterance can be including | 
|---|
| 0:02:50 | into the proposed system | 
|---|
| 0:02:53 | why not only | 
|---|
| 0:02:53 | the problem of the system comprise all you late we simply architecture | 
|---|
| 0:02:58 | the cost minima or had to store all honestly cross that line | 
|---|
| 0:03:05 | we propose to this town doctrines compensation system | 
|---|
| 0:03:10 | the worst cross or system so that really can't of the announcements according to require | 
|---|
| 0:03:15 | use tentel compensation | 
|---|
| 0:03:18 | we design also or cleaning to determine the level and the voice and you preparation | 
|---|
| 0:03:23 | no apply compensation accordingly | 
|---|
| 0:03:26 | a second approach or system is based on the auto-encoder primal | 
|---|
| 0:03:31 | while key binding document retention | 
|---|
| 0:03:34 | into two sorts there is no system into set correctly stressed speaker information | 
|---|
| 0:03:40 | including embedding teary encoding quality | 
|---|
| 0:03:44 | once a spacey target contain clean speaker information on your plane or the channel offset | 
|---|
| 0:03:50 | function to these input layer | 
|---|
| 0:03:52 | and you know the subspace is target two | 
|---|
| 0:03:54 | contain subsequently incarnation but liberation indoors | 
|---|
| 0:04:01 | with dataset using this study will be described | 
|---|
| 0:04:06 | that was dataset was collected by clinton levers this dataset | 
|---|
| 0:04:10 | so one loss or | 
|---|
| 0:04:12 | only layer coding we'd already market various test and of course conditions | 
|---|
| 0:04:18 | of course the conditional order to according to learn | 
|---|
| 0:04:21 | trendy nor training mike | 
|---|
| 0:04:23 | impressed angle and distracters | 
|---|
| 0:04:25 | in the workforce it dataset | 
|---|
| 0:04:27 | there are three hundred speakers | 
|---|
| 0:04:30 | the development set comprise all our total term store | 
|---|
| 0:04:34 | two hundred speakers and all evaluation sets comprise are twelve utterance well unless the whole | 
|---|
| 0:04:40 | one hundred speakers | 
|---|
| 0:04:44 | introduce a known and used as baseline | 
|---|
| 0:04:48 | no the use of data from a speaker embedding stricter | 
|---|
| 0:04:52 | that you will know where one time actually | 
|---|
| 0:04:56 | when can as four or so used to extract speaker embedding | 
|---|
| 0:04:59 | mel frequency cepstral coefficients | 
|---|
| 0:05:02 | a local man a speech or moreover that only used | 
|---|
| 0:05:05 | this acoustic is true for that human knowledge into a size or discriminative features | 
|---|
| 0:05:12 | convolutional neural network which is frequently used or anything about extractor | 
|---|
| 0:05:18 | gradually increased only set to create | 
|---|
| 0:05:21 | does when in perspective ran into the c n only set their people standing can | 
|---|
| 0:05:26 | consider only on digits time and frequency region | 
|---|
| 0:05:30 | and then you're | 
|---|
| 0:05:31 | there are close to the input layer | 
|---|
| 0:05:35 | although | 
|---|
| 0:05:36 | this conventional acoustic is for us to in widely used | 
|---|
| 0:05:39 | mainly sense to the also explore low weight problem as you could to t n | 
|---|
| 0:05:45 | it is that they don't alignment learning can batteries track discriminant information you document layers | 
|---|
| 0:05:53 | when we're on are processed by synonyms | 
|---|
| 0:05:56 | additional frequency response | 
|---|
| 0:05:58 | also we can spend can be strictly | 
|---|
| 0:06:02 | in addition the progress and all data to data and task | 
|---|
| 0:06:09 | known and all the policy intentionally architecture where the midget a global c n n's | 
|---|
| 0:06:14 | extract train leavened representation | 
|---|
| 0:06:17 | as illustrated here | 
|---|
| 0:06:19 | no one installation the plot is similar to the original last night | 
|---|
| 0:06:23 | well the whole mess clean a year | 
|---|
| 0:06:27 | this representation and in canada uni directional getting equal to unit layer | 
|---|
| 0:06:33 | to all we're getting into a single times level election station | 
|---|
| 0:06:37 | a fully connected layer with the one thousand twenty four those | 
|---|
| 0:06:41 | and conduct affine transformation it is a later uses a speaker embedding | 
|---|
| 0:06:49 | in this section we introduce two or system or at a speaker invading last night | 
|---|
| 0:06:57 | the first proposed system is a lucrative as skin condition based selective innocent | 
|---|
| 0:07:03 | the q on the night show the crime local sc | 
|---|
| 0:07:08 | this system comprise all p n in that in a speaker embedding asking condition | 
|---|
| 0:07:14 | in on the other segments kiss each and unit | 
|---|
| 0:07:18 | sc cantonese out you know is able to encoder | 
|---|
| 0:07:20 | and sat in a decidedly stencil activity in the skin condition similar to the case | 
|---|
| 0:07:26 | becomes you | 
|---|
| 0:07:29 | during the training phase | 
|---|
| 0:07:31 | and ct nn is trained for me nice to me scared and an object motion | 
|---|
| 0:07:35 | routine do not include any in a speaker embedding | 
|---|
| 0:07:39 | when a source utterances include | 
|---|
| 0:07:41 | sc on the only on structural be included | 
|---|
| 0:07:46 | on the other hand we're not distant utterances include | 
|---|
| 0:07:49 | sc on the key noisy | 
|---|
| 0:07:52 | output or source all trials | 
|---|
| 0:07:54 | that was used to make the distance utterance | 
|---|
| 0:07:58 | a stinky in it is trained to minimize the wine on the cross entropy object | 
|---|
| 0:08:02 | function | 
|---|
| 0:08:04 | when a source alton seeing a binary label is a one to make the skin | 
|---|
| 0:08:09 | condition only working | 
|---|
| 0:08:11 | and the way not distance all utterances include the finally agrees general to make the | 
|---|
| 0:08:16 | iterative scheme condition | 
|---|
| 0:08:18 | in the figure below | 
|---|
| 0:08:20 | the top n only presented a training base of our proposed | 
|---|
| 0:08:24 | i think i feel | 
|---|
| 0:08:27 | or quoting from previous study | 
|---|
| 0:08:29 | when compensation is conducting speaker and benny's face | 
|---|
| 0:08:33 | compensation may not be and although the ins evaluation pair too low | 
|---|
| 0:08:38 | this phenomenon is to analyze as all users what we losing or discriminative power | 
|---|
| 0:08:43 | all speaker embedding by changing value | 
|---|
| 0:08:46 | you know high dimensional extract embedding space | 
|---|
| 0:08:50 | labels in this knowledge e unless component so proposed system | 
|---|
| 0:08:54 | or on a speaker identification where do contain what the cross entropy roses function is | 
|---|
| 0:08:59 | used | 
|---|
| 0:09:01 | so the final was it commissioned used to train the sc is it is just | 
|---|
| 0:09:05 | a | 
|---|
| 0:09:06 | just described there | 
|---|
| 0:09:09 | loss and the same is or total reconstruction error | 
|---|
| 0:09:12 | this is seeing measure the distance the detection error | 
|---|
| 0:09:15 | analysis a measure called speaker identification error | 
|---|
| 0:09:19 | this entire in a speaker and battery | 
|---|
| 0:09:23 | in the test case the speaker and made it is including to c t n | 
|---|
| 0:09:27 | and as the key and | 
|---|
| 0:09:30 | so clean condition to connect input and output all sc t n is not rely | 
|---|
| 0:09:35 | on it all other whereas the nn | 
|---|
| 0:09:37 | we don't sigmoid activation function | 
|---|
| 0:09:41 | this is only a longer between zero and one and produce source case clean condition | 
|---|
| 0:09:48 | why nineteen a speaker embedding is still i by adding the all will go to | 
|---|
| 0:09:52 | see the nn | 
|---|
| 0:09:54 | and its cascade condition | 
|---|
| 0:09:56 | in the figure below those already there all represent the test process over our proposed | 
|---|
| 0:10:02 | sc | 
|---|
| 0:10:05 | the second proposed system usually prior to causality business not destroy the whole time corner | 
|---|
| 0:10:12 | that is not | 
|---|
| 0:10:16 | those second proposed system usually prior to us so that leaving that's not | 
|---|
| 0:10:20 | described auto-encoder | 
|---|
| 0:10:23 | the second proposed system easily hurt us so that in a sense to discriminate auto-encoder | 
|---|
| 0:10:30 | that is composed of on encoder decoder and two on an intermediate hidden layers | 
|---|
| 0:10:37 | like you hear loss filter set architecture | 
|---|
| 0:10:41 | the architecture design follow descreening altering quality structure | 
|---|
| 0:10:46 | inspired by pca set eyes computer intermediate hidden layer | 
|---|
| 0:10:51 | to collect the reverberation voicing and layer | 
|---|
| 0:10:54 | and to contain | 
|---|
| 0:10:55 | clean speech recognition in this kind layer | 
|---|
| 0:11:01 | so that i used an intermediate human lay your next time s ideally and always | 
|---|
| 0:11:06 | isolated | 
|---|
| 0:11:07 | you has been very | 
|---|
| 0:11:09 | when training set up | 
|---|
| 0:11:11 | although was of ocean correspond to minimize the inter class areas and mesh five the | 
|---|
| 0:11:16 | you class variance | 
|---|
| 0:11:18 | we utilize central sandy tolerance margin thus | 
|---|
| 0:11:23 | centre or source presented very nice intra-class variance why don't you embedding it surely many | 
|---|
| 0:11:28 | discriminate | 
|---|
| 0:11:31 | noninternal destruction was used in d c in to maximize the entire class | 
|---|
| 0:11:36 | variance | 
|---|
| 0:11:40 | in the same yes the previous sc diana sylvia function was used to train but | 
|---|
| 0:11:46 | you know resulting colour | 
|---|
| 0:11:48 | to nest or on the ocean between the number of source of times | 
|---|
| 0:11:52 | and distance all times in the training set | 
|---|
| 0:11:54 | the sample weight or two on the because this six | 
|---|
| 0:11:58 | and one is given recording you put | 
|---|
| 0:12:01 | the c of the ocean is also used to store all the function shrek on | 
|---|
| 0:12:05 | the speaker identification | 
|---|
| 0:12:08 | the final was of functional propose that a system | 
|---|
| 0:12:12 | it is described below | 
|---|
| 0:12:14 | here can my is all hyper parameter the scale the omission or try to this | 
|---|
| 0:12:19 | time | 
|---|
| 0:12:20 | and at times all hyper parameter the combined always function gender roles and inter racial | 
|---|
| 0:12:27 | noticed | 
|---|
| 0:12:29 | no less mobile and experiments and results | 
|---|
| 0:12:34 | the train set comprise all art so the voices development set | 
|---|
| 0:12:38 | and what select one and two dataset | 
|---|
| 0:12:42 | baseline alone a system | 
|---|
| 0:12:43 | in cologne where called is a two | 
|---|
| 0:12:46 | it in nine thousand | 
|---|
| 0:12:48 | what's a nice sample which a car or was to recognise that was second | 
|---|
| 0:12:51 | we're meeting that's construction | 
|---|
| 0:12:54 | to the so | 
|---|
| 0:12:55 | we had to click a short utterance and a common and the call me | 
|---|
| 0:12:59 | all the details are present in the paper | 
|---|
| 0:13:04 | the baseline system used a low and then architecture | 
|---|
| 0:13:07 | we had some modification | 
|---|
| 0:13:10 | first set and the number of the articulators no to seven about it | 
|---|
| 0:13:15 | by on the sisters tree | 
|---|
| 0:13:17 | to consider more speakers | 
|---|
| 0:13:20 | secondly | 
|---|
| 0:13:21 | increased a criminal at all the speaker and battery to one thousand training or | 
|---|
| 0:13:28 | "'kay" the glow described here top on it in a single system o'connor's from the | 
|---|
| 0:13:33 | always the challenge | 
|---|
| 0:13:34 | and our baseline system with various congregation | 
|---|
| 0:13:38 | target comparison between the current system in our baseline | 
|---|
| 0:13:43 | kind of in may going to the occurrence in the | 
|---|
| 0:13:46 | input feature | 
|---|
| 0:13:47 | tries the congregation | 
|---|
| 0:13:49 | and binary classifiers | 
|---|
| 0:13:52 | our story describe the noticed when using all the voice just dataset or training | 
|---|
| 0:13:58 | our street train | 
|---|
| 0:14:00 | we first trained on that were use of constant two | 
|---|
| 0:14:03 | and then press | 
|---|
| 0:14:04 | on the top layer | 
|---|
| 0:14:06 | and conduct fine tuning we propose that set | 
|---|
| 0:14:09 | and hours or shown college road all training or street dataset scatter | 
|---|
| 0:14:15 | training all or street dataset simultaneously and provides the best but almost | 
|---|
| 0:14:23 | proposed sc explore the learning life's customer and optimiser | 
|---|
| 0:14:29 | the best performance loss and the quantum and used as treaty and cosine along a | 
|---|
| 0:14:34 | scheduler | 
|---|
| 0:14:36 | sc show six point | 
|---|
| 0:14:38 | it's by orson the year | 
|---|
| 0:14:40 | where the test set and then the only channels three percent laid our reduction of | 
|---|
| 0:14:46 | compared to the baseline | 
|---|
| 0:14:50 | we experiment the proposed set a we keep a bit size and a manager | 
|---|
| 0:14:56 | the best performance was an echo the menu saddam | 
|---|
| 0:14:59 | and set aside to ten thousand | 
|---|
| 0:15:03 | the set i shows system only or seven percent a year or the test set | 
|---|
| 0:15:08 | and fifteen point nine seven percent are | 
|---|
| 0:15:11 | compared to the baseline | 
|---|
| 0:15:16 | score normalization technique are frequently chlorine various acoustic business condition | 
|---|
| 0:15:22 | most of the artist and in the course is two thousand nineteen challenge or so | 
|---|
| 0:15:27 | use the score normalization techniques such as generous colour magician | 
|---|
| 0:15:31 | the score normalization estimating score normalization | 
|---|
| 0:15:36 | we experiment i actually so this technique or our baseline aurora two for all system | 
|---|
| 0:15:43 | sc that's data | 
|---|
| 0:15:45 | and an important measure the in table low | 
|---|
| 0:15:48 | the results show the z-norm demonstrate but best document in most cases in our experiments | 
|---|
| 0:15:55 | in addition scores and all somewhere all the two proposed system | 
|---|
| 0:15:59 | only the audition across the improvement | 
|---|
| 0:16:02 | we don't eer all other | 
|---|
| 0:16:03 | six point one nine percent or z-norm | 
|---|
| 0:16:08 | finally then we introduce the conclusion | 
|---|
| 0:16:13 | in this study we propose to speaker-invariant is not system | 
|---|
| 0:16:18 | was proposed system are independent from the front ends you can vary instruction | 
|---|
| 0:16:23 | and this taste and can process not only distance on trust was cluster utterance | 
|---|
| 0:16:29 | this process which can are you sure wasn't degradation | 
|---|
| 0:16:33 | when cluster goddess are input into the speaker and battery in is not system | 
|---|
| 0:16:37 | it is time won't systems utterance | 
|---|
| 0:16:41 | compared to the baseline system to proposed system as the c s c and set | 
|---|
| 0:16:47 | up in was based on a real eleven point two or three percent | 
|---|
| 0:16:51 | and fourteen point nine three percent respectively | 
|---|
| 0:16:55 | this is richard show that you x in this impulse cluster and discuss utterance | 
|---|
| 0:17:01 | in our just for making sensing interrogate to proposed system into a single speaker in | 
|---|
| 0:17:07 | body units nist is that | 
|---|
| 0:17:12 | they could probably sing | 
|---|