| 0:00:14 | and | 
|---|
| 0:00:19 | she's | 
|---|
| 0:00:21 | computer science from having university | 
|---|
| 0:00:25 | france | 
|---|
| 0:00:26 | the title of our work is a single thing quarter for robust speaker recognition | 
|---|
| 0:00:33 | in this work we try to focus on and that even noise | 
|---|
| 0:00:38 | to compare and additive noise | 
|---|
| 0:00:40 | for speaker recognition systems that use it "'cause" vector and endings | 
|---|
| 0:00:51 | firstly we discuss about the problem of additive noise and the effect of additive noise | 
|---|
| 0:00:58 | speaker it is | 
|---|
| 0:01:01 | specifically in exactly framework | 
|---|
| 0:01:04 | after they do that we have us from the lattice of our previous works that | 
|---|
| 0:01:09 | are known | 
|---|
| 0:01:11 | to compensate for additive noise and different levels | 
|---|
| 0:01:16 | after that | 
|---|
| 0:01:18 | are we discuss about | 
|---|
| 0:01:20 | different denoising techniques that we used for | 
|---|
| 0:01:24 | compensating the additive noise and extract an army | 
|---|
| 0:01:30 | here you can see the name of denoising techniques that to be | 
|---|
| 0:01:34 | you know just | 
|---|
| 0:01:36 | like i that and you know using those encounters the are all techniques | 
|---|
| 0:01:41 | and goes denoising go change in culture and the text you know a single chain | 
|---|
| 0:01:46 | codes that are | 
|---|
| 0:01:48 | new architectures and but we introduce them in this paper | 
|---|
| 0:01:54 | after that i speak about the experiments experimental protocol and the results that the achieved | 
|---|
| 0:02:00 | by denoising coding colours | 
|---|
| 0:02:03 | in noisy environment | 
|---|
| 0:02:08 | here you can see their problem of additive noise | 
|---|
| 0:02:14 | here is a new techniques | 
|---|
| 0:02:17 | that are used for speaker model like | 
|---|
| 0:02:22 | deep learning techniques | 
|---|
| 0:02:24 | that use that argumentation to their information about the noise reverberation and so on to | 
|---|
| 0:02:30 | create a robust system | 
|---|
| 0:02:33 | in noisy environments | 
|---|
| 0:02:36 | if and when we use their state of the art speaker modeling system like extractor | 
|---|
| 0:02:41 | if we see | 
|---|
| 0:02:44 | new noise there are not seen in data alimentation | 
|---|
| 0:02:49 | we can see that the results not dramatically different | 
|---|
| 0:02:56 | this problem | 
|---|
| 0:02:58 | and motivates us to do compensation | 
|---|
| 0:03:02 | for additive noise and because the i-vector framework | 
|---|
| 0:03:06 | because it's speaker recognition system we are looking for | 
|---|
| 0:03:11 | for clean signal and we just want to have | 
|---|
| 0:03:16 | with performance of recognizing | 
|---|
| 0:03:19 | speakers | 
|---|
| 0:03:21 | we can do i think noise compensation in different levels | 
|---|
| 0:03:26 | i mean slim was like signal | 
|---|
| 0:03:29 | like features | 
|---|
| 0:03:31 | for example doing noise compensation on mfccs or | 
|---|
| 0:03:36 | we can deal with higher levels like it "'cause" vectors speaker modeling level | 
|---|
| 0:03:43 | to do noise compensation in our research we try to do compensation in it "'cause" | 
|---|
| 0:03:48 | i-vector level because | 
|---|
| 0:03:50 | in this level the vectors have a gaussian distribution and working in this level is | 
|---|
| 0:03:57 | lower dimensions is easier | 
|---|
| 0:04:03 | in previous works we can see that | 
|---|
| 0:04:06 | there are some researchers | 
|---|
| 0:04:08 | in signal level for example | 
|---|
| 0:04:11 | in the first rule you can see a paper from nineteen that | 
|---|
| 0:04:17 | different techniques one convolutional and b l this you know are used | 
|---|
| 0:04:23 | to the noise on features | 
|---|
| 0:04:26 | a like log my magnitude spectrum and | 
|---|
| 0:04:31 | stft | 
|---|
| 0:04:33 | in another paper you can see that | 
|---|
| 0:04:36 | the | 
|---|
| 0:04:38 | denoising is done on raw speech | 
|---|
| 0:04:43 | in previous research | 
|---|
| 0:04:45 | and that was not in i-vector domain | 
|---|
| 0:04:49 | several statistical a neural techniques | 
|---|
| 0:04:52 | proposed for denoising for example | 
|---|
| 0:04:56 | wideband for the and two thousand at proposed i in that | 
|---|
| 0:05:01 | to map from noisy to clean i-vectors | 
|---|
| 0:05:08 | and there are also some other techniques like enjoyment i that and you know is | 
|---|
| 0:05:13 | a denoising both encoders that | 
|---|
| 0:05:17 | that i in the same manner and try to map from noisy i-vectors talking i-vectors | 
|---|
| 0:05:25 | based on that | 
|---|
| 0:05:26 | because the noisy techniques you would do you lose results and i-vector domain we can | 
|---|
| 0:05:33 | propose the previous techniques | 
|---|
| 0:05:36 | in extractors is also or | 
|---|
| 0:05:39 | we can make the proposed and you techniques would you noise in it "'cause" i-vector | 
|---|
| 0:05:43 | space | 
|---|
| 0:05:48 | the first second that is used for minimizing is i that is the statistical techniques | 
|---|
| 0:05:54 | that is used for denoising in i-vector space | 
|---|
| 0:05:59 | i amount | 
|---|
| 0:06:00 | we estimate that was because vectors clean and noisy i-vectors are | 
|---|
| 0:06:05 | be like to dollars distribution there is an assumption | 
|---|
| 0:06:11 | and the decision the noise random variable is the difference | 
|---|
| 0:06:17 | between clean and noisy | 
|---|
| 0:06:20 | here you can see these a probability it | 
|---|
| 0:06:25 | what is zero | 
|---|
| 0:06:26 | given x | 
|---|
| 0:06:27 | what is your | 
|---|
| 0:06:29 | is and noisy it "'cause" vector | 
|---|
| 0:06:31 | and x is clean exact | 
|---|
| 0:06:34 | we use | 
|---|
| 0:06:35 | and i mean why table two | 
|---|
| 0:06:38 | i estimate x zero | 
|---|
| 0:06:41 | it can version of | 
|---|
| 0:06:43 | it "'cause" it clean version | 
|---|
| 0:06:46 | or a denoising version of exact reuse and the estimated to do that | 
|---|
| 0:06:51 | here who casting the final solution for these | 
|---|
| 0:06:55 | formula | 
|---|
| 0:06:57 | a signal and is the covariance of | 
|---|
| 0:07:00 | a noisy | 
|---|
| 0:07:03 | vectors that are used for training | 
|---|
| 0:07:06 | i mean is the average of | 
|---|
| 0:07:09 | noisy because vectors | 
|---|
| 0:07:11 | cy x | 
|---|
| 0:07:12 | is the covariance a because vectors that are used for training and you know it's | 
|---|
| 0:07:18 | is the average of | 
|---|
| 0:07:20 | is the average of clean because vectors | 
|---|
| 0:07:25 | this set second technique that is used in our paper for denoising is conditional denoising | 
|---|
| 0:07:32 | of encoders | 
|---|
| 0:07:33 | conventional everything also encoders tries to minimize l x and f y where l is | 
|---|
| 0:07:39 | the loss function | 
|---|
| 0:07:41 | and why is this portray why is distorted extractor and why is | 
|---|
| 0:07:49 | the output of denoising coding condor | 
|---|
| 0:07:52 | organized because vectors | 
|---|
| 0:07:54 | a | 
|---|
| 0:07:55 | frankly | 
|---|
| 0:07:57 | a denoising opting for the rest right to minimize the distance between noisy "'cause" vectors | 
|---|
| 0:08:02 | and p x vectors | 
|---|
| 0:08:04 | we use this architecture in our research here you can see that in an input | 
|---|
| 0:08:10 | and output layers | 
|---|
| 0:08:12 | we use five hundred real to all the nodes | 
|---|
| 0:08:16 | we used a linear activation function the number of | 
|---|
| 0:08:20 | notice and this layers it is the thing because you want to exactly map and | 
|---|
| 0:08:26 | noisy | 
|---|
| 0:08:28 | because vectors | 
|---|
| 0:08:30 | we want to have exactly same dimension is organized expect to and output layer | 
|---|
| 0:08:37 | i in here down | 
|---|
| 0:08:39 | or you know there we use one thousand two training for knows we use non | 
|---|
| 0:08:45 | linear conjugate hyperbolic a few iterations activation function | 
|---|
| 0:08:48 | function | 
|---|
| 0:08:50 | here you can still error rate | 
|---|
| 0:08:52 | the loss function that is used | 
|---|
| 0:08:54 | for denoising in this paper is doing a square | 
|---|
| 0:09:00 | our dinners at encoder is strenuous stochastic gradient descent | 
|---|
| 0:09:04 | in these are to be mentioned that we used one thousand and twenty four nodes | 
|---|
| 0:09:10 | in hidden layer because if you use | 
|---|
| 0:09:15 | a small number of nodes and is there may have lost of information and it's | 
|---|
| 0:09:21 | better to use lord knows in hidden layer | 
|---|
| 0:09:27 | another technique that he's used in our paper is combination of denoising auto-encoder i'm | 
|---|
| 0:09:36 | here we call i'm not as expensive because | 
|---|
| 0:09:39 | a be used it for i-vector system | 
|---|
| 0:09:42 | in this architecture we have noisy "'cause" vectors we try to deny these vectors firstly | 
|---|
| 0:09:48 | by denoising auto-encoder then we do the output of denoising auto-encoder two | 
|---|
| 0:09:53 | excel because | 
|---|
| 0:09:55 | by doing these step we impose on | 
|---|
| 0:09:59 | our system to | 
|---|
| 0:10:01 | to achieve to extract a that have no statistical distribution | 
|---|
| 0:10:07 | in another technique that we introduce and we called gaussian denoising auto-encoder really given noisy | 
|---|
| 0:10:15 | exact resources and do we impose on you know is altering order to give those | 
|---|
| 0:10:21 | in | 
|---|
| 0:10:23 | a distribution for the noise | 
|---|
| 0:10:26 | it "'cause" vectors here you can see the loss function that | 
|---|
| 0:10:31 | put some the impose restrictions on the output of you know using three | 
|---|
| 0:10:38 | here you can say again the mute and is the average sum rate because vector | 
|---|
| 0:10:43 | c and i is the covariance of | 
|---|
| 0:10:47 | and noisy because vector new x | 
|---|
| 0:10:49 | is the average of | 
|---|
| 0:10:52 | clean because vectors and co using likes | 
|---|
| 0:10:56 | is there | 
|---|
| 0:10:59 | of kleenex vectors | 
|---|
| 0:11:04 | their final technique that we used in the stack you know is noting whether | 
|---|
| 0:11:10 | this type the in denoising something closer tries to find an estimation for noise | 
|---|
| 0:11:17 | by estimating the noise | 
|---|
| 0:11:20 | a week and a | 
|---|
| 0:11:22 | however better results because we did an experiment that we gave exactly the information about | 
|---|
| 0:11:29 | the noise and we | 
|---|
| 0:11:31 | at a very good results no close to clean environment | 
|---|
| 0:11:37 | we use this softer and reach firstly we use the noisy "'cause" vectors to the | 
|---|
| 0:11:43 | right to denoising no single other we have the first estimation of the noisy "'cause" | 
|---|
| 0:11:48 | vector | 
|---|
| 0:11:49 | by in knowing by calculating the difference between noisy extractors | 
|---|
| 0:11:55 | and the output of the first log | 
|---|
| 0:11:58 | we try to find an estimation of noise and we given this information to the | 
|---|
| 0:12:03 | second block and we repeat | 
|---|
| 0:12:06 | a in the same manner | 
|---|
| 0:12:08 | to have a better estimation of | 
|---|
| 0:12:11 | noise | 
|---|
| 0:12:12 | to use this information and the next lot the | 
|---|
| 0:12:18 | you need is better results | 
|---|
| 0:12:20 | at the output we train all these plots jointly and yes | 
|---|
| 0:12:27 | we have several datasets in our paper was and is used | 
|---|
| 0:12:33 | four | 
|---|
| 0:12:34 | the argumentation to train extraction network babysitting noises are used to create noisy extractors for | 
|---|
| 0:12:42 | training and test in that are used you know it's techniques | 
|---|
| 0:12:47 | also to are | 
|---|
| 0:12:49 | i used for training is excellent for most of them one is that augmentation is | 
|---|
| 0:12:54 | used to train because next unit for | 
|---|
| 0:12:57 | and a combination of wants to than one and was set up to is used | 
|---|
| 0:13:02 | to create | 
|---|
| 0:13:05 | and he's extractors to train denoising techniques | 
|---|
| 0:13:09 | five is a french corpus then is used as test and enrollment in our experiments | 
|---|
| 0:13:15 | we divide the ideal | 
|---|
| 0:13:18 | corpus and to be separated based on the based on the duration of | 
|---|
| 0:13:25 | files to calculate the results for different durations | 
|---|
| 0:13:33 | here you can see in this steps | 
|---|
| 0:13:36 | that we followed in our experiments firstly we trained a "'cause" mixture | 
|---|
| 0:13:41 | designed a recipe | 
|---|
| 0:13:43 | to train these network reusable to that one augmented in these models are | 
|---|
| 0:13:50 | a new use the training data for in the next it | 
|---|
| 0:13:54 | because | 
|---|
| 0:13:55 | we usually use these nets for to create training data for denoising techniques recreate about | 
|---|
| 0:14:02 | four million | 
|---|
| 0:14:04 | no is a clean pairs because vectors from both so that one and both of | 
|---|
| 0:14:08 | them to | 
|---|
| 0:14:10 | also we extract enrollment and test extractors | 
|---|
| 0:14:14 | a be like to five dollars speech corpus | 
|---|
| 0:14:18 | and | 
|---|
| 0:14:19 | a also we add noise to our test data to create a noisy version | 
|---|
| 0:14:26 | we used d v c noise because we want to | 
|---|
| 0:14:29 | we want to make our system | 
|---|
| 0:14:33 | and robust against unseen noises here we use them as a to form a token | 
|---|
| 0:14:39 | augmentation to train the expected network but in this step we use the mississippi choice | 
|---|
| 0:14:45 | the data that the | 
|---|
| 0:14:47 | noise files that are used to | 
|---|
| 0:14:51 | and to create noisy | 
|---|
| 0:14:53 | it "'cause" vectors for train r | 
|---|
| 0:14:56 | and different from those nice is that are used for | 
|---|
| 0:15:00 | for this | 
|---|
| 0:15:01 | so the noises that are used in | 
|---|
| 0:15:05 | test | 
|---|
| 0:15:06 | is a c | 
|---|
| 0:15:08 | they are us | 
|---|
| 0:15:10 | after that we train p lda and we do scoring we pay lda that used | 
|---|
| 0:15:16 | as an back and scoring technique | 
|---|
| 0:15:20 | but before scoring when we do denoising | 
|---|
| 0:15:24 | alone | 
|---|
| 0:15:25 | to reduce the type of is our test files | 
|---|
| 0:15:31 | here you can see the results | 
|---|
| 0:15:34 | use it for error rates may take | 
|---|
| 0:15:37 | for different experiments and the first row you can see the results for different durations | 
|---|
| 0:15:44 | for example | 
|---|
| 0:15:45 | and the first column | 
|---|
| 0:15:47 | for utterances shorter than two secondary issue | 
|---|
| 0:15:52 | eleven point fifty nine | 
|---|
| 0:15:55 | equal error rate | 
|---|
| 0:15:56 | when we don't have noise | 
|---|
| 0:15:59 | and for stresses the line longer than twelve seconds we have zero point eight | 
|---|
| 0:16:05 | in the second row we can see that impact of annoyance | 
|---|
| 0:16:09 | for example for short utterances | 
|---|
| 0:16:13 | and the equal error rate increases eleven to fifteen and four | 
|---|
| 0:16:18 | utterances longer than twelve seconds | 
|---|
| 0:16:21 | it increases from zero point eight to five point one | 
|---|
| 0:16:25 | this results show that | 
|---|
| 0:16:28 | it's important we do denoising before scoring | 
|---|
| 0:16:31 | a system is no say in next call on our assumption is true and using | 
|---|
| 0:16:38 | a denoising component be of before scoring is very important | 
|---|
| 0:16:42 | here you can see the results obtained by | 
|---|
| 0:16:47 | statistical except that taking | 
|---|
| 0:16:49 | for utterances longer than twelve seconds | 
|---|
| 0:16:52 | the equal error rate reviews from | 
|---|
| 0:16:54 | five point one two point six | 
|---|
| 0:16:58 | and then extra used in the results obtained after applying denoising auto-encoder and is the | 
|---|
| 0:17:04 | expected | 
|---|
| 0:17:05 | in the next one | 
|---|
| 0:17:07 | we see the results that the in the | 
|---|
| 0:17:11 | in the next around used in the results that obtained by a combination of denoising | 
|---|
| 0:17:17 | don't think other x | 
|---|
| 0:17:19 | in the last row you seen the results that obtained by gaussian distribution the loss | 
|---|
| 0:17:25 | function that we used | 
|---|
| 0:17:28 | the new can you loss function that we used in our experiments to train denoising | 
|---|
| 0:17:32 | auto-encoder to have | 
|---|
| 0:17:35 | and to impose on you know singleton closer to use it "'cause" vectors belong to | 
|---|
| 0:17:40 | a gaussian distribution | 
|---|
| 0:17:43 | here used in the results | 
|---|
| 0:17:46 | for each state denoising post encoder | 
|---|
| 0:17:49 | and therefore a strong you can see the results we may use just two blocks | 
|---|
| 0:17:54 | the first this second walk | 
|---|
| 0:17:57 | as you can see that in both cases the results are better in the previous | 
|---|
| 0:18:02 | techniques | 
|---|
| 0:18:03 | for us france's between eight and ten and twelve and along the | 
|---|
| 0:18:09 | twelve seconds | 
|---|
| 0:18:11 | in the last rule using the results | 
|---|
| 0:18:14 | for this situation that we use | 
|---|
| 0:18:17 | it's really noisy auto-encoder exactly the same architecture that is shown in these speech | 
|---|
| 0:18:25 | in this case we have no | 
|---|
| 0:18:28 | in almost all cases | 
|---|
| 0:18:30 | we have | 
|---|
| 0:18:32 | better results than previous techniques | 
|---|
| 0:18:38 | in our paper we showed that it's important is that augmentation and the learning techniques | 
|---|
| 0:18:46 | to achieve and noise-robust | 
|---|
| 0:18:49 | speech recognition system but it's like you know | 
|---|
| 0:18:52 | we and we are in because vector space we can obtain better results if we | 
|---|
| 0:18:58 | use | 
|---|
| 0:18:59 | denoising techniques | 
|---|
| 0:19:01 | we show that simple statistical matters like i know that used in i-vector space can | 
|---|
| 0:19:07 | be used because that's nine also | 
|---|
| 0:19:11 | after that we showed that averaging the advantage is a statistical and that and denoising | 
|---|
| 0:19:17 | also includes event and give better results | 
|---|
| 0:19:20 | finally we introduce and you and you'll technique called the extent you know is not | 
|---|
| 0:19:27 | think of the that | 
|---|
| 0:19:29 | tries to find information about the noise and use this information in deeper stacked in | 
|---|
| 0:19:35 | a single thing colours by using these techniques | 
|---|
| 0:19:38 | really in this technique we achieve that but not the in almost all cases we | 
|---|
| 0:19:44 | achieve better results than statistical technique like iona or system all conventional denoising out a | 
|---|
| 0:19:52 | encoders | 
|---|
| 0:19:55 | text for your attention | 
|---|