Speech Transcript - Iterative Bayesian and MMSE-based noise compensation techniques for speaker recognition in the i-vector space

0:00:15	so hi everyone i'll present thing to the iterative bayesian and mmse by noise compensation
0:00:22	techniques for speaker recognition in the i-vector space
0:00:28	so let's
0:00:29	start by setting up the problem
0:00:32	here we are working on noise also noise is one of the biggest problem in
0:00:38	speaker recognition
0:00:41	and the a lot of techniques have been proposed in the but in the past
0:00:45	years to deal with it in different domains
0:00:48	such as speech enhancement techniques
0:00:51	feature compensation mother compensation and robust scoring and in the last years the nn based
0:00:57	techniques
0:00:58	for a the robust feature extraction or a robust computations or statistics or
0:01:08	i-vector like representation of speech
0:01:12	so what we are proposing sheer ease a combination of two algorithms
0:01:19	in order to clean up and noisy i-vectors
0:01:23	so we are using a
0:01:25	clean front end so system trained using clean data and a clean back end so
0:01:33	in scoring model
0:01:36	so the first algorithm
0:01:39	in the past work in the previous work we presented a i'm up
0:01:45	it's an additive noise model operating in the i-vector space
0:01:49	it's based on a two hypothesis
0:01:53	the gaussianity of
0:01:55	the i-vectors distribution and the gaussianity of the night distribution
0:02:00	in the i-vector space
0:02:02	here i'm not saying that noise is additive in the i-vector space and just use
0:02:06	ink this model to represent relationship between clean and noisy i-vectors
0:02:11	just to be here
0:02:14	so using not criterion we can
0:02:19	there are in this equation
0:02:22	and we end up we a model that it given a y zero noisy i-vector
0:02:31	we can
0:02:33	d noise it
0:02:34	clean it up using
0:02:37	the between i-vectors distribution hyper parameters and the noise distribution hyper parameters
0:02:46	so in practice this algorithm is implemented like this given a test segment we start
0:02:54	by checking it's the snr level if the segment it's clean is clean so we
0:03:00	are okay
0:03:02	if it's not
0:03:04	we extract the noisy version of the i-vectors y zero and then using a voice
0:03:12	activity detection system we extract
0:03:15	noise from the signal using the silence intervals
0:03:19	and then we inject
0:03:22	this noise
0:03:25	into clean training utterances
0:03:28	this way we have clean i-vectors and they are noisy preference using the test noise
0:03:36	so we can build the noise model
0:03:39	using the gaussian distribution and then we can use the previous equation to clean up
0:03:44	the noisy i-vectors
0:03:49	so
0:03:52	the novelty of this paper is how can we improve the i'm
0:03:59	so that the problem is that we can apply time up many times
0:04:05	successfully
0:04:08	iteratively because we can guarantee the goshen hypothesis on the on the residual noise
0:04:15	so the solution that we came up with is to use another algorithm and to
0:04:20	iteratively between these two algorithms in order to achieve better training for the i-vectors
0:04:28	so this second algorithms this call the catfish algorithm it's used mainly in chemistry two
0:04:39	align different molecules so here we we're applying it on i-vectors and we're starting from
0:04:46	noisy i-vectors
0:04:48	and we want to estimate the best translation and rotation matrix
0:04:53	in order to go to the clean version
0:04:58	so formally for the formulation of the problem
0:05:04	it's called the
0:05:07	program this
0:05:09	problem and its start with two matrices to data matrices and noisy i-vectors
0:05:16	presented at a matrix and the clean version
0:05:20	this way we can estimate the best relation matrix or here
0:05:25	that relates the two
0:05:28	so in the training we start by
0:05:34	that we said that we are estimating a translation vector and the rotation matrix so
0:05:38	to get rid of the translation we start by center ink the data the we
0:05:44	compute the centroid on the clean data and the noisy data and then
0:05:50	we center
0:05:52	the clean and noisy very i-vectors
0:05:56	then
0:05:58	now we can compute the
0:06:01	to the best rotation matrix between the noisy i-vectors and their cleavers and using svd
0:06:09	decomposition
0:06:14	the once we've done this when we have the best translation and rotation for a
0:06:21	given noise
0:06:23	on the test
0:06:24	the weekend
0:06:27	extract the test i-vector
0:06:29	we apply we start by applying the translation a minus
0:06:34	here we subtract the centroid of the
0:06:38	the noisy i-vectors and then we apply the rotation and then either translation to and
0:06:45	up with its cleaver
0:06:51	so we use needs and switchboard data for training
0:06:56	and the nist two thousand and eight four test that seven condition we are using
0:07:03	nineteen mfcc coefficients plus energy plus their first and second derivatives
0:07:11	five hundred twelve components gmm
0:07:16	our i-vectors have a four hundred components under using the two covariance scoring
0:07:24	so here we are applying
0:07:26	each algorithm independently and then what combining the two
0:07:33	so
0:07:33	we've the first algorithm i'm up we can achieve from forty to sixty percent
0:07:39	for a t v equal error rate improvement
0:07:43	for each noise
0:07:45	for the first algorithm we jan achieved up to forty five percent of equal error
0:07:50	rate improvement but
0:07:53	when we combine the two
0:07:55	in the for one iteration or for you we can and up with up to
0:08:01	eighty five percent of whatever it improvement
0:08:08	here i presented the data for male they may
0:08:14	for male data and to your for you might but well for female it's
0:08:21	the error rates are a little bit tired but it's efficient for both
0:08:29	the and here we compare the two algorithms and their combination
0:08:34	on heterogeneous the setup it's the when we use a lot of data noisy and
0:08:42	clean data for enrollment and test with different snr levels on the target and test
0:08:49	and we can see that's a it's it remains efficient in this context
0:08:57	so as a summary
0:09:00	using
0:09:03	i'm out or that they kept algorithm we can improve the equal error rate from
0:09:09	forty to sixty percent but the interesting part is that combining the two
0:09:15	can achieve
0:09:18	for better gains
0:09:22	thank you
0:09:30	so we have questions
0:09:42	is the patient matrix a noise and it's
0:09:47	or anti noise that yes that's really different sorry
0:09:55	yes here we're estimating for each different noise at different a translation and rotation matrix
0:10:02	we just want to show the efficiency of this technique but in of the future
0:10:08	in another paper will be published in interspeech i guess we well it's except that
0:10:16	so
0:10:17	it will
0:10:20	we propose another approach so that the that does not
0:10:26	suppose a certain model of noise in the i-vector space
0:10:29	and that can be used for many noise
0:10:33	that can be trained using many noises and use it if you used efficiently
0:10:38	on the test with different places
0:10:40	so here is to just to show the how four we can go to the
0:10:46	best case scenario
0:10:48	but in another paper we show how we can extend this to go away many
0:10:53	noises
0:11:03	and i was presentation so
0:11:06	if you go back many years ago how lemon oppenheimer had a sequential map estimation
0:11:13	that be used for speech enhancement obliterated back and forth between noise suppression filters and
0:11:19	speech parameterization so you're iterating back and forth between two algorithms here
0:11:25	you show results we had one iteration to iteration is there any way to come
0:11:29	up with some well maybe two questions here anyway to come up with some form
0:11:34	of convergence criteria that you can assess and second is there any way to look
0:11:39	at the i-vectors as you go through the two iterations to see
0:11:44	which i-vectors are actually changing the most that might tell you a little bit more
0:11:47	about which vectors are more sensitive to the type of noise
0:11:54	so the first question
0:11:56	so the first question was is there any way to look at a convergence criteria
0:12:01	because when you say eight or two you need to know whether you convergence and
0:12:06	okay
0:12:07	so well here what we've that is just to iterate many tendency
0:12:13	at which
0:12:15	from a which level we get
0:12:17	we start the
0:12:20	making the results were so it's not really
0:12:26	it's not that the
0:12:29	we haven't the gone that gone there in that
0:12:34	so if you look at the two noise types you cycling fan noise and i
0:12:38	think you had to
0:12:40	car noise so both are low frequency type noises can you see if you have
0:12:45	similar changes in the i-vectors in both those noise types
0:12:50	yes
0:12:53	maybe i can't the common in that because i haven't then the full analysis but
0:12:59	the just from the right we can
0:13:03	i can tell you for sure for sure is the that the efficiency depends on
0:13:08	the
0:13:11	on which noise you're playing at all so
0:13:15	it sufficient store but it's it can be the
0:13:21	that is in the way that makes it more efficient if we have different noises
0:13:26	in the between enrollment and test
0:13:40	thank you for the nice presentation
0:13:43	one a while ago try to read original are mapped paper so if you don't
0:13:47	mind i just as a question about the original i'm out that the iterative one
0:13:51	sorry that i didn't understood original are you map
0:13:54	yes not data at one
0:13:58	okay so go like i mean in the block diagram that you how
0:14:06	can you go back to the block diagram of this
0:14:08	or email
0:14:11	yes
0:14:11	so you're estimating extracting noise from the signal or somehow estimating the noise and in
0:14:18	the signal
0:14:19	so and then you go up to the for noisy and of zero db that
0:14:24	the speech and noise are of steam similar or same strengths over there can you
0:14:28	tell us how would you or in extracting noise from signal in zero db
0:14:34	so here were using energy based voice activity detection system but we are we just
0:14:40	making the threshold more strict in order to avoid the and you got with speech
0:14:48	confused as noise so it's not we
0:14:53	we did the well as sophisticated the voice activity detection system for this task specific
0:14:59	well as the avoiding a slight as much as possible to end up with the
0:15:03	with speech by using a very strict this one on the energy
0:15:10	c use the it's just it's quite amazing the level of improvement you gain from
0:15:14	twenty something to date present it is it is quite something that it feel it
0:15:19	feels that you have very good model of noise here and if you have such
0:15:24	thing then it would make sense also to just check we is speech enhancement i
0:15:29	mean you have this
0:15:30	and misty based approach like wiener filtering if you have a good model the contract
0:15:34	the noise than it is good to also compare with that was to do you
0:15:38	like feature enhancement noise reduction in compare with that as well just a common
0:15:42	yes okay
0:15:54	okay that doesn't be any more questions over so that the speaker

Iterative Bayesian and MMSE-based noise compensation techniques for speaker recognition in the i-vector space

Speaker Recognition: i-vector approaches

Waad Ben Kheder, Driss Matrouf, Moez Ajili, Jean-Francois Bonastre