0:00:14and
0:00:19she's
0:00:21computer science from having university
0:00:25france
0:00:26the title of our work is a single thing quarter for robust speaker recognition
0:00:33in this work we try to focus on and that even noise
0:00:38to compare and additive noise
0:00:40for speaker recognition systems that use it "'cause" vector and endings
0:00:51firstly we discuss about the problem of additive noise and the effect of additive noise
0:00:58speaker it is
0:01:01specifically in exactly framework
0:01:04after they do that we have us from the lattice of our previous works that
0:01:09are known
0:01:11to compensate for additive noise and different levels
0:01:16after that
0:01:18are we discuss about
0:01:20different denoising techniques that we used for
0:01:24compensating the additive noise and extract an army
0:01:30here you can see the name of denoising techniques that to be
0:01:34you know just
0:01:36like i that and you know using those encounters the are all techniques
0:01:41and goes denoising go change in culture and the text you know a single chain
0:01:46codes that are
0:01:48new architectures and but we introduce them in this paper
0:01:54after that i speak about the experiments experimental protocol and the results that the achieved
0:02:00by denoising coding colours
0:02:03in noisy environment
0:02:08here you can see their problem of additive noise
0:02:14here is a new techniques
0:02:17that are used for speaker model like
0:02:22deep learning techniques
0:02:24that use that argumentation to their information about the noise reverberation and so on to
0:02:30create a robust system
0:02:33in noisy environments
0:02:36if and when we use their state of the art speaker modeling system like extractor
0:02:41if we see
0:02:44new noise there are not seen in data alimentation
0:02:49we can see that the results not dramatically different
0:02:56this problem
0:02:58and motivates us to do compensation
0:03:02for additive noise and because the i-vector framework
0:03:06because it's speaker recognition system we are looking for
0:03:11for clean signal and we just want to have
0:03:16with performance of recognizing
0:03:19speakers
0:03:21we can do i think noise compensation in different levels
0:03:26i mean slim was like signal
0:03:29like features
0:03:31for example doing noise compensation on mfccs or
0:03:36we can deal with higher levels like it "'cause" vectors speaker modeling level
0:03:43to do noise compensation in our research we try to do compensation in it "'cause"
0:03:48i-vector level because
0:03:50in this level the vectors have a gaussian distribution and working in this level is
0:03:57lower dimensions is easier
0:04:03in previous works we can see that
0:04:06there are some researchers
0:04:08in signal level for example
0:04:11in the first rule you can see a paper from nineteen that
0:04:17different techniques one convolutional and b l this you know are used
0:04:23to the noise on features
0:04:26a like log my magnitude spectrum and
0:04:31stft
0:04:33in another paper you can see that
0:04:36the
0:04:38denoising is done on raw speech
0:04:43in previous research
0:04:45and that was not in i-vector domain
0:04:49several statistical a neural techniques
0:04:52proposed for denoising for example
0:04:56wideband for the and two thousand at proposed i in that
0:05:01to map from noisy to clean i-vectors
0:05:08and there are also some other techniques like enjoyment i that and you know is
0:05:13a denoising both encoders that
0:05:17that i in the same manner and try to map from noisy i-vectors talking i-vectors
0:05:25based on that
0:05:26because the noisy techniques you would do you lose results and i-vector domain we can
0:05:33propose the previous techniques
0:05:36in extractors is also or
0:05:39we can make the proposed and you techniques would you noise in it "'cause" i-vector
0:05:43space
0:05:48the first second that is used for minimizing is i that is the statistical techniques
0:05:54that is used for denoising in i-vector space
0:05:59i amount
0:06:00we estimate that was because vectors clean and noisy i-vectors are
0:06:05be like to dollars distribution there is an assumption
0:06:11and the decision the noise random variable is the difference
0:06:17between clean and noisy
0:06:20here you can see these a probability it
0:06:25what is zero
0:06:26given x
0:06:27what is your
0:06:29is and noisy it "'cause" vector
0:06:31and x is clean exact
0:06:34we use
0:06:35and i mean why table two
0:06:38i estimate x zero
0:06:41it can version of
0:06:43it "'cause" it clean version
0:06:46or a denoising version of exact reuse and the estimated to do that
0:06:51here who casting the final solution for these
0:06:55formula
0:06:57a signal and is the covariance of
0:07:00a noisy
0:07:03vectors that are used for training
0:07:06i mean is the average of
0:07:09noisy because vectors
0:07:11cy x
0:07:12is the covariance a because vectors that are used for training and you know it's
0:07:18is the average of
0:07:20is the average of clean because vectors
0:07:25this set second technique that is used in our paper for denoising is conditional denoising
0:07:32of encoders
0:07:33conventional everything also encoders tries to minimize l x and f y where l is
0:07:39the loss function
0:07:41and why is this portray why is distorted extractor and why is
0:07:49the output of denoising coding condor
0:07:52organized because vectors
0:07:54a
0:07:55frankly
0:07:57a denoising opting for the rest right to minimize the distance between noisy "'cause" vectors
0:08:02and p x vectors
0:08:04we use this architecture in our research here you can see that in an input
0:08:10and output layers
0:08:12we use five hundred real to all the nodes
0:08:16we used a linear activation function the number of
0:08:20notice and this layers it is the thing because you want to exactly map and
0:08:26noisy
0:08:28because vectors
0:08:30we want to have exactly same dimension is organized expect to and output layer
0:08:37i in here down
0:08:39or you know there we use one thousand two training for knows we use non
0:08:45linear conjugate hyperbolic a few iterations activation function
0:08:48function
0:08:50here you can still error rate
0:08:52the loss function that is used
0:08:54for denoising in this paper is doing a square
0:09:00our dinners at encoder is strenuous stochastic gradient descent
0:09:04in these are to be mentioned that we used one thousand and twenty four nodes
0:09:10in hidden layer because if you use
0:09:15a small number of nodes and is there may have lost of information and it's
0:09:21better to use lord knows in hidden layer
0:09:27another technique that he's used in our paper is combination of denoising auto-encoder i'm
0:09:36here we call i'm not as expensive because
0:09:39a be used it for i-vector system
0:09:42in this architecture we have noisy "'cause" vectors we try to deny these vectors firstly
0:09:48by denoising auto-encoder then we do the output of denoising auto-encoder two
0:09:53excel because
0:09:55by doing these step we impose on
0:09:59our system to
0:10:01to achieve to extract a that have no statistical distribution
0:10:07in another technique that we introduce and we called gaussian denoising auto-encoder really given noisy
0:10:15exact resources and do we impose on you know is altering order to give those
0:10:21in
0:10:23a distribution for the noise
0:10:26it "'cause" vectors here you can see the loss function that
0:10:31put some the impose restrictions on the output of you know using three
0:10:38here you can say again the mute and is the average sum rate because vector
0:10:43c and i is the covariance of
0:10:47and noisy because vector new x
0:10:49is the average of
0:10:52clean because vectors and co using likes
0:10:56is there
0:10:59of kleenex vectors
0:11:04their final technique that we used in the stack you know is noting whether
0:11:10this type the in denoising something closer tries to find an estimation for noise
0:11:17by estimating the noise
0:11:20a week and a
0:11:22however better results because we did an experiment that we gave exactly the information about
0:11:29the noise and we
0:11:31at a very good results no close to clean environment
0:11:37we use this softer and reach firstly we use the noisy "'cause" vectors to the
0:11:43right to denoising no single other we have the first estimation of the noisy "'cause"
0:11:48vector
0:11:49by in knowing by calculating the difference between noisy extractors
0:11:55and the output of the first log
0:11:58we try to find an estimation of noise and we given this information to the
0:12:03second block and we repeat
0:12:06a in the same manner
0:12:08to have a better estimation of
0:12:11noise
0:12:12to use this information and the next lot the
0:12:18you need is better results
0:12:20at the output we train all these plots jointly and yes
0:12:27we have several datasets in our paper was and is used
0:12:33four
0:12:34the argumentation to train extraction network babysitting noises are used to create noisy extractors for
0:12:42training and test in that are used you know it's techniques
0:12:47also to are
0:12:49i used for training is excellent for most of them one is that augmentation is
0:12:54used to train because next unit for
0:12:57and a combination of wants to than one and was set up to is used
0:13:02to create
0:13:05and he's extractors to train denoising techniques
0:13:09five is a french corpus then is used as test and enrollment in our experiments
0:13:15we divide the ideal
0:13:18corpus and to be separated based on the based on the duration of
0:13:25files to calculate the results for different durations
0:13:33here you can see in this steps
0:13:36that we followed in our experiments firstly we trained a "'cause" mixture
0:13:41designed a recipe
0:13:43to train these network reusable to that one augmented in these models are
0:13:50a new use the training data for in the next it
0:13:54because
0:13:55we usually use these nets for to create training data for denoising techniques recreate about
0:14:02four million
0:14:04no is a clean pairs because vectors from both so that one and both of
0:14:08them to
0:14:10also we extract enrollment and test extractors
0:14:14a be like to five dollars speech corpus
0:14:18and
0:14:19a also we add noise to our test data to create a noisy version
0:14:26we used d v c noise because we want to
0:14:29we want to make our system
0:14:33and robust against unseen noises here we use them as a to form a token
0:14:39augmentation to train the expected network but in this step we use the mississippi choice
0:14:45the data that the
0:14:47noise files that are used to
0:14:51and to create noisy
0:14:53it "'cause" vectors for train r
0:14:56and different from those nice is that are used for
0:15:00for this
0:15:01so the noises that are used in
0:15:05test
0:15:06is a c
0:15:08they are us
0:15:10after that we train p lda and we do scoring we pay lda that used
0:15:16as an back and scoring technique
0:15:20but before scoring when we do denoising
0:15:24alone
0:15:25to reduce the type of is our test files
0:15:31here you can see the results
0:15:34use it for error rates may take
0:15:37for different experiments and the first row you can see the results for different durations
0:15:44for example
0:15:45and the first column
0:15:47for utterances shorter than two secondary issue
0:15:52eleven point fifty nine
0:15:55equal error rate
0:15:56when we don't have noise
0:15:59and for stresses the line longer than twelve seconds we have zero point eight
0:16:05in the second row we can see that impact of annoyance
0:16:09for example for short utterances
0:16:13and the equal error rate increases eleven to fifteen and four
0:16:18utterances longer than twelve seconds
0:16:21it increases from zero point eight to five point one
0:16:25this results show that
0:16:28it's important we do denoising before scoring
0:16:31a system is no say in next call on our assumption is true and using
0:16:38a denoising component be of before scoring is very important
0:16:42here you can see the results obtained by
0:16:47statistical except that taking
0:16:49for utterances longer than twelve seconds
0:16:52the equal error rate reviews from
0:16:54five point one two point six
0:16:58and then extra used in the results obtained after applying denoising auto-encoder and is the
0:17:04expected
0:17:05in the next one
0:17:07we see the results that the in the
0:17:11in the next around used in the results that obtained by a combination of denoising
0:17:17don't think other x
0:17:19in the last row you seen the results that obtained by gaussian distribution the loss
0:17:25function that we used
0:17:28the new can you loss function that we used in our experiments to train denoising
0:17:32auto-encoder to have
0:17:35and to impose on you know singleton closer to use it "'cause" vectors belong to
0:17:40a gaussian distribution
0:17:43here used in the results
0:17:46for each state denoising post encoder
0:17:49and therefore a strong you can see the results we may use just two blocks
0:17:54the first this second walk
0:17:57as you can see that in both cases the results are better in the previous
0:18:02techniques
0:18:03for us france's between eight and ten and twelve and along the
0:18:09twelve seconds
0:18:11in the last rule using the results
0:18:14for this situation that we use
0:18:17it's really noisy auto-encoder exactly the same architecture that is shown in these speech
0:18:25in this case we have no
0:18:28in almost all cases
0:18:30we have
0:18:32better results than previous techniques
0:18:38in our paper we showed that it's important is that augmentation and the learning techniques
0:18:46to achieve and noise-robust
0:18:49speech recognition system but it's like you know
0:18:52we and we are in because vector space we can obtain better results if we
0:18:58use
0:18:59denoising techniques
0:19:01we show that simple statistical matters like i know that used in i-vector space can
0:19:07be used because that's nine also
0:19:11after that we showed that averaging the advantage is a statistical and that and denoising
0:19:17also includes event and give better results
0:19:20finally we introduce and you and you'll technique called the extent you know is not
0:19:27think of the that
0:19:29tries to find information about the noise and use this information in deeper stacked in
0:19:35a single thing colours by using these techniques
0:19:38really in this technique we achieve that but not the in almost all cases we
0:19:44achieve better results than statistical technique like iona or system all conventional denoising out a
0:19:52encoders
0:19:55text for your attention