0:00:15so hi everyone i'll present thing to the iterative bayesian and mmse by noise compensation
0:00:22techniques for speaker recognition in the i-vector space
0:00:28so let's
0:00:29start by setting up the problem
0:00:32here we are working on noise also noise is one of the biggest problem in
0:00:38speaker recognition
0:00:41and the a lot of techniques have been proposed in the but in the past
0:00:45years to deal with it in different domains
0:00:48such as speech enhancement techniques
0:00:51feature compensation mother compensation and robust scoring and in the last years the nn based
0:00:57techniques
0:00:58for a the robust feature extraction or a robust computations or statistics or
0:01:08i-vector like representation of speech
0:01:12so what we are proposing sheer ease a combination of two algorithms
0:01:19in order to clean up and noisy i-vectors
0:01:23so we are using a
0:01:25clean front end so system trained using clean data and a clean back end so
0:01:33in scoring model
0:01:36so the first algorithm
0:01:39in the past work in the previous work we presented a i'm up
0:01:45it's an additive noise model operating in the i-vector space
0:01:49it's based on a two hypothesis
0:01:53the gaussianity of
0:01:55the i-vectors distribution and the gaussianity of the night distribution
0:02:00in the i-vector space
0:02:02here i'm not saying that noise is additive in the i-vector space and just use
0:02:06ink this model to represent relationship between clean and noisy i-vectors
0:02:11just to be here
0:02:14so using not criterion we can
0:02:19there are in this equation
0:02:22and we end up we a model that it given a y zero noisy i-vector
0:02:31we can
0:02:33d noise it
0:02:34clean it up using
0:02:37the between i-vectors distribution hyper parameters and the noise distribution hyper parameters
0:02:46so in practice this algorithm is implemented like this given a test segment we start
0:02:54by checking it's the snr level if the segment it's clean is clean so we
0:03:00are okay
0:03:02if it's not
0:03:04we extract the noisy version of the i-vectors y zero and then using a voice
0:03:12activity detection system we extract
0:03:15noise from the signal using the silence intervals
0:03:19and then we inject
0:03:22this noise
0:03:25into clean training utterances
0:03:28this way we have clean i-vectors and they are noisy preference using the test noise
0:03:36so we can build the noise model
0:03:39using the gaussian distribution and then we can use the previous equation to clean up
0:03:44the noisy i-vectors
0:03:49so
0:03:52the novelty of this paper is how can we improve the i'm
0:03:59so that the problem is that we can apply time up many times
0:04:05successfully
0:04:08iteratively because we can guarantee the goshen hypothesis on the on the residual noise
0:04:15so the solution that we came up with is to use another algorithm and to
0:04:20iteratively between these two algorithms in order to achieve better training for the i-vectors
0:04:28so this second algorithms this call the catfish algorithm it's used mainly in chemistry two
0:04:39align different molecules so here we we're applying it on i-vectors and we're starting from
0:04:46noisy i-vectors
0:04:48and we want to estimate the best translation and rotation matrix
0:04:53in order to go to the clean version
0:04:58so formally for the formulation of the problem
0:05:04it's called the
0:05:07program this
0:05:09problem and its start with two matrices to data matrices and noisy i-vectors
0:05:16presented at a matrix and the clean version
0:05:20this way we can estimate the best relation matrix or here
0:05:25that relates the two
0:05:28so in the training we start by
0:05:34that we said that we are estimating a translation vector and the rotation matrix so
0:05:38to get rid of the translation we start by center ink the data the we
0:05:44compute the centroid on the clean data and the noisy data and then
0:05:50we center
0:05:52the clean and noisy very i-vectors
0:05:56then
0:05:58now we can compute the
0:06:01to the best rotation matrix between the noisy i-vectors and their cleavers and using svd
0:06:09decomposition
0:06:14the once we've done this when we have the best translation and rotation for a
0:06:21given noise
0:06:23on the test
0:06:24the weekend
0:06:27extract the test i-vector
0:06:29we apply we start by applying the translation a minus
0:06:34here we subtract the centroid of the
0:06:38the noisy i-vectors and then we apply the rotation and then either translation to and
0:06:45up with its cleaver
0:06:51so we use needs and switchboard data for training
0:06:56and the nist two thousand and eight four test that seven condition we are using
0:07:03nineteen mfcc coefficients plus energy plus their first and second derivatives
0:07:11five hundred twelve components gmm
0:07:16our i-vectors have a four hundred components under using the two covariance scoring
0:07:24so here we are applying
0:07:26each algorithm independently and then what combining the two
0:07:33so
0:07:33we've the first algorithm i'm up we can achieve from forty to sixty percent
0:07:39for a t v equal error rate improvement
0:07:43for each noise
0:07:45for the first algorithm we jan achieved up to forty five percent of equal error
0:07:50rate improvement but
0:07:53when we combine the two
0:07:55in the for one iteration or for you we can and up with up to
0:08:01eighty five percent of whatever it improvement
0:08:08here i presented the data for male they may
0:08:14for male data and to your for you might but well for female it's
0:08:21the error rates are a little bit tired but it's efficient for both
0:08:29the and here we compare the two algorithms and their combination
0:08:34on heterogeneous the setup it's the when we use a lot of data noisy and
0:08:42clean data for enrollment and test with different snr levels on the target and test
0:08:49and we can see that's a it's it remains efficient in this context
0:08:57so as a summary
0:09:00using
0:09:03i'm out or that they kept algorithm we can improve the equal error rate from
0:09:09forty to sixty percent but the interesting part is that combining the two
0:09:15can achieve
0:09:18for better gains
0:09:22thank you
0:09:30so we have questions
0:09:42is the patient matrix a noise and it's
0:09:47or anti noise that yes that's really different sorry
0:09:55yes here we're estimating for each different noise at different a translation and rotation matrix
0:10:02we just want to show the efficiency of this technique but in of the future
0:10:08in another paper will be published in interspeech i guess we well it's except that
0:10:16so
0:10:17it will
0:10:20we propose another approach so that the that does not
0:10:26suppose a certain model of noise in the i-vector space
0:10:29and that can be used for many noise
0:10:33that can be trained using many noises and use it if you used efficiently
0:10:38on the test with different places
0:10:40so here is to just to show the how four we can go to the
0:10:46best case scenario
0:10:48but in another paper we show how we can extend this to go away many
0:10:53noises
0:11:03and i was presentation so
0:11:06if you go back many years ago how lemon oppenheimer had a sequential map estimation
0:11:13that be used for speech enhancement obliterated back and forth between noise suppression filters and
0:11:19speech parameterization so you're iterating back and forth between two algorithms here
0:11:25you show results we had one iteration to iteration is there any way to come
0:11:29up with some well maybe two questions here anyway to come up with some form
0:11:34of convergence criteria that you can assess and second is there any way to look
0:11:39at the i-vectors as you go through the two iterations to see
0:11:44which i-vectors are actually changing the most that might tell you a little bit more
0:11:47about which vectors are more sensitive to the type of noise
0:11:54so the first question
0:11:56so the first question was is there any way to look at a convergence criteria
0:12:01because when you say eight or two you need to know whether you convergence and
0:12:06okay
0:12:07so well here what we've that is just to iterate many tendency
0:12:13at which
0:12:15from a which level we get
0:12:17we start the
0:12:20making the results were so it's not really
0:12:26it's not that the
0:12:29we haven't the gone that gone there in that
0:12:34so if you look at the two noise types you cycling fan noise and i
0:12:38think you had to
0:12:40car noise so both are low frequency type noises can you see if you have
0:12:45similar changes in the i-vectors in both those noise types
0:12:50yes
0:12:53maybe i can't the common in that because i haven't then the full analysis but
0:12:59the just from the right we can
0:13:03i can tell you for sure for sure is the that the efficiency depends on
0:13:08the
0:13:11on which noise you're playing at all so
0:13:15it sufficient store but it's it can be the
0:13:21that is in the way that makes it more efficient if we have different noises
0:13:26in the between enrollment and test
0:13:40thank you for the nice presentation
0:13:43one a while ago try to read original are mapped paper so if you don't
0:13:47mind i just as a question about the original i'm out that the iterative one
0:13:51sorry that i didn't understood original are you map
0:13:54yes not data at one
0:13:58okay so go like i mean in the block diagram that you how
0:14:06can you go back to the block diagram of this
0:14:08or email
0:14:11yes
0:14:11so you're estimating extracting noise from the signal or somehow estimating the noise and in
0:14:18the signal
0:14:19so and then you go up to the for noisy and of zero db that
0:14:24the speech and noise are of steam similar or same strengths over there can you
0:14:28tell us how would you or in extracting noise from signal in zero db
0:14:34so here were using energy based voice activity detection system but we are we just
0:14:40making the threshold more strict in order to avoid the and you got with speech
0:14:48confused as noise so it's not we
0:14:53we did the well as sophisticated the voice activity detection system for this task specific
0:14:59well as the avoiding a slight as much as possible to end up with the
0:15:03with speech by using a very strict this one on the energy
0:15:10c use the it's just it's quite amazing the level of improvement you gain from
0:15:14twenty something to date present it is it is quite something that it feel it
0:15:19feels that you have very good model of noise here and if you have such
0:15:24thing then it would make sense also to just check we is speech enhancement i
0:15:29mean you have this
0:15:30and misty based approach like wiener filtering if you have a good model the contract
0:15:34the noise than it is good to also compare with that was to do you
0:15:38like feature enhancement noise reduction in compare with that as well just a common
0:15:42yes okay
0:15:54okay that doesn't be any more questions over so that the speaker