Speech Transcript - Denoising x-vectors for Robust Speaker Recognition

0:00:14	and
0:00:19	she's
0:00:21	computer science from having university
0:00:25	france
0:00:26	the title of our work is a single thing quarter for robust speaker recognition
0:00:33	in this work we try to focus on and that even noise
0:00:38	to compare and additive noise
0:00:40	for speaker recognition systems that use it "'cause" vector and endings
0:00:51	firstly we discuss about the problem of additive noise and the effect of additive noise
0:00:58	speaker it is
0:01:01	specifically in exactly framework
0:01:04	after they do that we have us from the lattice of our previous works that
0:01:09	are known
0:01:11	to compensate for additive noise and different levels
0:01:16	after that
0:01:18	are we discuss about
0:01:20	different denoising techniques that we used for
0:01:24	compensating the additive noise and extract an army
0:01:30	here you can see the name of denoising techniques that to be
0:01:34	you know just
0:01:36	like i that and you know using those encounters the are all techniques
0:01:41	and goes denoising go change in culture and the text you know a single chain
0:01:46	codes that are
0:01:48	new architectures and but we introduce them in this paper
0:01:54	after that i speak about the experiments experimental protocol and the results that the achieved
0:02:00	by denoising coding colours
0:02:03	in noisy environment
0:02:08	here you can see their problem of additive noise
0:02:14	here is a new techniques
0:02:17	that are used for speaker model like
0:02:22	deep learning techniques
0:02:24	that use that argumentation to their information about the noise reverberation and so on to
0:02:30	create a robust system
0:02:33	in noisy environments
0:02:36	if and when we use their state of the art speaker modeling system like extractor
0:02:41	if we see
0:02:44	new noise there are not seen in data alimentation
0:02:49	we can see that the results not dramatically different
0:02:56	this problem
0:02:58	and motivates us to do compensation
0:03:02	for additive noise and because the i-vector framework
0:03:06	because it's speaker recognition system we are looking for
0:03:11	for clean signal and we just want to have
0:03:16	with performance of recognizing
0:03:19	speakers
0:03:21	we can do i think noise compensation in different levels
0:03:26	i mean slim was like signal
0:03:29	like features
0:03:31	for example doing noise compensation on mfccs or
0:03:36	we can deal with higher levels like it "'cause" vectors speaker modeling level
0:03:43	to do noise compensation in our research we try to do compensation in it "'cause"
0:03:48	i-vector level because
0:03:50	in this level the vectors have a gaussian distribution and working in this level is
0:03:57	lower dimensions is easier
0:04:03	in previous works we can see that
0:04:06	there are some researchers
0:04:08	in signal level for example
0:04:11	in the first rule you can see a paper from nineteen that
0:04:17	different techniques one convolutional and b l this you know are used
0:04:23	to the noise on features
0:04:26	a like log my magnitude spectrum and
0:04:31	stft
0:04:33	in another paper you can see that
0:04:36	the
0:04:38	denoising is done on raw speech
0:04:43	in previous research
0:04:45	and that was not in i-vector domain
0:04:49	several statistical a neural techniques
0:04:52	proposed for denoising for example
0:04:56	wideband for the and two thousand at proposed i in that
0:05:01	to map from noisy to clean i-vectors
0:05:08	and there are also some other techniques like enjoyment i that and you know is
0:05:13	a denoising both encoders that
0:05:17	that i in the same manner and try to map from noisy i-vectors talking i-vectors
0:05:25	based on that
0:05:26	because the noisy techniques you would do you lose results and i-vector domain we can
0:05:33	propose the previous techniques
0:05:36	in extractors is also or
0:05:39	we can make the proposed and you techniques would you noise in it "'cause" i-vector
0:05:43	space
0:05:48	the first second that is used for minimizing is i that is the statistical techniques
0:05:54	that is used for denoising in i-vector space
0:05:59	i amount
0:06:00	we estimate that was because vectors clean and noisy i-vectors are
0:06:05	be like to dollars distribution there is an assumption
0:06:11	and the decision the noise random variable is the difference
0:06:17	between clean and noisy
0:06:20	here you can see these a probability it
0:06:25	what is zero
0:06:26	given x
0:06:27	what is your
0:06:29	is and noisy it "'cause" vector
0:06:31	and x is clean exact
0:06:34	we use
0:06:35	and i mean why table two
0:06:38	i estimate x zero
0:06:41	it can version of
0:06:43	it "'cause" it clean version
0:06:46	or a denoising version of exact reuse and the estimated to do that
0:06:51	here who casting the final solution for these
0:06:55	formula
0:06:57	a signal and is the covariance of
0:07:00	a noisy
0:07:03	vectors that are used for training
0:07:06	i mean is the average of
0:07:09	noisy because vectors
0:07:11	cy x
0:07:12	is the covariance a because vectors that are used for training and you know it's
0:07:18	is the average of
0:07:20	is the average of clean because vectors
0:07:25	this set second technique that is used in our paper for denoising is conditional denoising
0:07:32	of encoders
0:07:33	conventional everything also encoders tries to minimize l x and f y where l is
0:07:39	the loss function
0:07:41	and why is this portray why is distorted extractor and why is
0:07:49	the output of denoising coding condor
0:07:52	organized because vectors
0:07:54	a
0:07:55	frankly
0:07:57	a denoising opting for the rest right to minimize the distance between noisy "'cause" vectors
0:08:02	and p x vectors
0:08:04	we use this architecture in our research here you can see that in an input
0:08:10	and output layers
0:08:12	we use five hundred real to all the nodes
0:08:16	we used a linear activation function the number of
0:08:20	notice and this layers it is the thing because you want to exactly map and
0:08:26	noisy
0:08:28	because vectors
0:08:30	we want to have exactly same dimension is organized expect to and output layer
0:08:37	i in here down
0:08:39	or you know there we use one thousand two training for knows we use non
0:08:45	linear conjugate hyperbolic a few iterations activation function
0:08:48	function
0:08:50	here you can still error rate
0:08:52	the loss function that is used
0:08:54	for denoising in this paper is doing a square
0:09:00	our dinners at encoder is strenuous stochastic gradient descent
0:09:04	in these are to be mentioned that we used one thousand and twenty four nodes
0:09:10	in hidden layer because if you use
0:09:15	a small number of nodes and is there may have lost of information and it's
0:09:21	better to use lord knows in hidden layer
0:09:27	another technique that he's used in our paper is combination of denoising auto-encoder i'm
0:09:36	here we call i'm not as expensive because
0:09:39	a be used it for i-vector system
0:09:42	in this architecture we have noisy "'cause" vectors we try to deny these vectors firstly
0:09:48	by denoising auto-encoder then we do the output of denoising auto-encoder two
0:09:53	excel because
0:09:55	by doing these step we impose on
0:09:59	our system to
0:10:01	to achieve to extract a that have no statistical distribution
0:10:07	in another technique that we introduce and we called gaussian denoising auto-encoder really given noisy
0:10:15	exact resources and do we impose on you know is altering order to give those
0:10:21	in
0:10:23	a distribution for the noise
0:10:26	it "'cause" vectors here you can see the loss function that
0:10:31	put some the impose restrictions on the output of you know using three
0:10:38	here you can say again the mute and is the average sum rate because vector
0:10:43	c and i is the covariance of
0:10:47	and noisy because vector new x
0:10:49	is the average of
0:10:52	clean because vectors and co using likes
0:10:56	is there
0:10:59	of kleenex vectors
0:11:04	their final technique that we used in the stack you know is noting whether
0:11:10	this type the in denoising something closer tries to find an estimation for noise
0:11:17	by estimating the noise
0:11:20	a week and a
0:11:22	however better results because we did an experiment that we gave exactly the information about
0:11:29	the noise and we
0:11:31	at a very good results no close to clean environment
0:11:37	we use this softer and reach firstly we use the noisy "'cause" vectors to the
0:11:43	right to denoising no single other we have the first estimation of the noisy "'cause"
0:11:48	vector
0:11:49	by in knowing by calculating the difference between noisy extractors
0:11:55	and the output of the first log
0:11:58	we try to find an estimation of noise and we given this information to the
0:12:03	second block and we repeat
0:12:06	a in the same manner
0:12:08	to have a better estimation of
0:12:11	noise
0:12:12	to use this information and the next lot the
0:12:18	you need is better results
0:12:20	at the output we train all these plots jointly and yes
0:12:27	we have several datasets in our paper was and is used
0:12:33	four
0:12:34	the argumentation to train extraction network babysitting noises are used to create noisy extractors for
0:12:42	training and test in that are used you know it's techniques
0:12:47	also to are
0:12:49	i used for training is excellent for most of them one is that augmentation is
0:12:54	used to train because next unit for
0:12:57	and a combination of wants to than one and was set up to is used
0:13:02	to create
0:13:05	and he's extractors to train denoising techniques
0:13:09	five is a french corpus then is used as test and enrollment in our experiments
0:13:15	we divide the ideal
0:13:18	corpus and to be separated based on the based on the duration of
0:13:25	files to calculate the results for different durations
0:13:33	here you can see in this steps
0:13:36	that we followed in our experiments firstly we trained a "'cause" mixture
0:13:41	designed a recipe
0:13:43	to train these network reusable to that one augmented in these models are
0:13:50	a new use the training data for in the next it
0:13:54	because
0:13:55	we usually use these nets for to create training data for denoising techniques recreate about
0:14:02	four million
0:14:04	no is a clean pairs because vectors from both so that one and both of
0:14:08	them to
0:14:10	also we extract enrollment and test extractors
0:14:14	a be like to five dollars speech corpus
0:14:18	and
0:14:19	a also we add noise to our test data to create a noisy version
0:14:26	we used d v c noise because we want to
0:14:29	we want to make our system
0:14:33	and robust against unseen noises here we use them as a to form a token
0:14:39	augmentation to train the expected network but in this step we use the mississippi choice
0:14:45	the data that the
0:14:47	noise files that are used to
0:14:51	and to create noisy
0:14:53	it "'cause" vectors for train r
0:14:56	and different from those nice is that are used for
0:15:00	for this
0:15:01	so the noises that are used in
0:15:05	test
0:15:06	is a c
0:15:08	they are us
0:15:10	after that we train p lda and we do scoring we pay lda that used
0:15:16	as an back and scoring technique
0:15:20	but before scoring when we do denoising
0:15:24	alone
0:15:25	to reduce the type of is our test files
0:15:31	here you can see the results
0:15:34	use it for error rates may take
0:15:37	for different experiments and the first row you can see the results for different durations
0:15:44	for example
0:15:45	and the first column
0:15:47	for utterances shorter than two secondary issue
0:15:52	eleven point fifty nine
0:15:55	equal error rate
0:15:56	when we don't have noise
0:15:59	and for stresses the line longer than twelve seconds we have zero point eight
0:16:05	in the second row we can see that impact of annoyance
0:16:09	for example for short utterances
0:16:13	and the equal error rate increases eleven to fifteen and four
0:16:18	utterances longer than twelve seconds
0:16:21	it increases from zero point eight to five point one
0:16:25	this results show that
0:16:28	it's important we do denoising before scoring
0:16:31	a system is no say in next call on our assumption is true and using
0:16:38	a denoising component be of before scoring is very important
0:16:42	here you can see the results obtained by
0:16:47	statistical except that taking
0:16:49	for utterances longer than twelve seconds
0:16:52	the equal error rate reviews from
0:16:54	five point one two point six
0:16:58	and then extra used in the results obtained after applying denoising auto-encoder and is the
0:17:04	expected
0:17:05	in the next one
0:17:07	we see the results that the in the
0:17:11	in the next around used in the results that obtained by a combination of denoising
0:17:17	don't think other x
0:17:19	in the last row you seen the results that obtained by gaussian distribution the loss
0:17:25	function that we used
0:17:28	the new can you loss function that we used in our experiments to train denoising
0:17:32	auto-encoder to have
0:17:35	and to impose on you know singleton closer to use it "'cause" vectors belong to
0:17:40	a gaussian distribution
0:17:43	here used in the results
0:17:46	for each state denoising post encoder
0:17:49	and therefore a strong you can see the results we may use just two blocks
0:17:54	the first this second walk
0:17:57	as you can see that in both cases the results are better in the previous
0:18:02	techniques
0:18:03	for us france's between eight and ten and twelve and along the
0:18:09	twelve seconds
0:18:11	in the last rule using the results
0:18:14	for this situation that we use
0:18:17	it's really noisy auto-encoder exactly the same architecture that is shown in these speech
0:18:25	in this case we have no
0:18:28	in almost all cases
0:18:30	we have
0:18:32	better results than previous techniques
0:18:38	in our paper we showed that it's important is that augmentation and the learning techniques
0:18:46	to achieve and noise-robust
0:18:49	speech recognition system but it's like you know
0:18:52	we and we are in because vector space we can obtain better results if we
0:18:58	use
0:18:59	denoising techniques
0:19:01	we show that simple statistical matters like i know that used in i-vector space can
0:19:07	be used because that's nine also
0:19:11	after that we showed that averaging the advantage is a statistical and that and denoising
0:19:17	also includes event and give better results
0:19:20	finally we introduce and you and you'll technique called the extent you know is not
0:19:27	think of the that
0:19:29	tries to find information about the noise and use this information in deeper stacked in
0:19:35	a single thing colours by using these techniques
0:19:38	really in this technique we achieve that but not the in almost all cases we
0:19:44	achieve better results than statistical technique like iona or system all conventional denoising out a
0:19:52	encoders
0:19:55	text for your attention

Denoising x-vectors for Robust Speaker Recognition

Speaker and Language Recognition

Mohammad Mohammadamini, Driss Matrouf, Paul-Gauthier Noé