0:00:15 | so hi everyone i'll present thing to the iterative bayesian and mmse by noise compensation |
---|---|

0:00:22 | techniques for speaker recognition in the i-vector space |

0:00:28 | so let's |

0:00:29 | start by setting up the problem |

0:00:32 | here we are working on noise also noise is one of the biggest problem in |

0:00:38 | speaker recognition |

0:00:41 | and the a lot of techniques have been proposed in the but in the past |

0:00:45 | years to deal with it in different domains |

0:00:48 | such as speech enhancement techniques |

0:00:51 | feature compensation mother compensation and robust scoring and in the last years the nn based |

0:00:57 | techniques |

0:00:58 | for a the robust feature extraction or a robust computations or statistics or |

0:01:08 | i-vector like representation of speech |

0:01:12 | so what we are proposing sheer ease a combination of two algorithms |

0:01:19 | in order to clean up and noisy i-vectors |

0:01:23 | so we are using a |

0:01:25 | clean front end so system trained using clean data and a clean back end so |

0:01:33 | in scoring model |

0:01:36 | so the first algorithm |

0:01:39 | in the past work in the previous work we presented a i'm up |

0:01:45 | it's an additive noise model operating in the i-vector space |

0:01:49 | it's based on a two hypothesis |

0:01:53 | the gaussianity of |

0:01:55 | the i-vectors distribution and the gaussianity of the night distribution |

0:02:00 | in the i-vector space |

0:02:02 | here i'm not saying that noise is additive in the i-vector space and just use |

0:02:06 | ink this model to represent relationship between clean and noisy i-vectors |

0:02:11 | just to be here |

0:02:14 | so using not criterion we can |

0:02:19 | there are in this equation |

0:02:22 | and we end up we a model that it given a y zero noisy i-vector |

0:02:31 | we can |

0:02:33 | d noise it |

0:02:34 | clean it up using |

0:02:37 | the between i-vectors distribution hyper parameters and the noise distribution hyper parameters |

0:02:46 | so in practice this algorithm is implemented like this given a test segment we start |

0:02:54 | by checking it's the snr level if the segment it's clean is clean so we |

0:03:00 | are okay |

0:03:02 | if it's not |

0:03:04 | we extract the noisy version of the i-vectors y zero and then using a voice |

0:03:12 | activity detection system we extract |

0:03:15 | noise from the signal using the silence intervals |

0:03:19 | and then we inject |

0:03:22 | this noise |

0:03:25 | into clean training utterances |

0:03:28 | this way we have clean i-vectors and they are noisy preference using the test noise |

0:03:36 | so we can build the noise model |

0:03:39 | using the gaussian distribution and then we can use the previous equation to clean up |

0:03:44 | the noisy i-vectors |

0:03:49 | so |

0:03:52 | the novelty of this paper is how can we improve the i'm |

0:03:59 | so that the problem is that we can apply time up many times |

0:04:05 | successfully |

0:04:08 | iteratively because we can guarantee the goshen hypothesis on the on the residual noise |

0:04:15 | so the solution that we came up with is to use another algorithm and to |

0:04:20 | iteratively between these two algorithms in order to achieve better training for the i-vectors |

0:04:28 | so this second algorithms this call the catfish algorithm it's used mainly in chemistry two |

0:04:39 | align different molecules so here we we're applying it on i-vectors and we're starting from |

0:04:46 | noisy i-vectors |

0:04:48 | and we want to estimate the best translation and rotation matrix |

0:04:53 | in order to go to the clean version |

0:04:58 | so formally for the formulation of the problem |

0:05:04 | it's called the |

0:05:07 | program this |

0:05:09 | problem and its start with two matrices to data matrices and noisy i-vectors |

0:05:16 | presented at a matrix and the clean version |

0:05:20 | this way we can estimate the best relation matrix or here |

0:05:25 | that relates the two |

0:05:28 | so in the training we start by |

0:05:34 | that we said that we are estimating a translation vector and the rotation matrix so |

0:05:38 | to get rid of the translation we start by center ink the data the we |

0:05:44 | compute the centroid on the clean data and the noisy data and then |

0:05:50 | we center |

0:05:52 | the clean and noisy very i-vectors |

0:05:56 | then |

0:05:58 | now we can compute the |

0:06:01 | to the best rotation matrix between the noisy i-vectors and their cleavers and using svd |

0:06:09 | decomposition |

0:06:14 | the once we've done this when we have the best translation and rotation for a |

0:06:21 | given noise |

0:06:23 | on the test |

0:06:24 | the weekend |

0:06:27 | extract the test i-vector |

0:06:29 | we apply we start by applying the translation a minus |

0:06:34 | here we subtract the centroid of the |

0:06:38 | the noisy i-vectors and then we apply the rotation and then either translation to and |

0:06:45 | up with its cleaver |

0:06:51 | so we use needs and switchboard data for training |

0:06:56 | and the nist two thousand and eight four test that seven condition we are using |

0:07:03 | nineteen mfcc coefficients plus energy plus their first and second derivatives |

0:07:11 | five hundred twelve components gmm |

0:07:16 | our i-vectors have a four hundred components under using the two covariance scoring |

0:07:24 | so here we are applying |

0:07:26 | each algorithm independently and then what combining the two |

0:07:33 | so |

0:07:33 | we've the first algorithm i'm up we can achieve from forty to sixty percent |

0:07:39 | for a t v equal error rate improvement |

0:07:43 | for each noise |

0:07:45 | for the first algorithm we jan achieved up to forty five percent of equal error |

0:07:50 | rate improvement but |

0:07:53 | when we combine the two |

0:07:55 | in the for one iteration or for you we can and up with up to |

0:08:01 | eighty five percent of whatever it improvement |

0:08:08 | here i presented the data for male they may |

0:08:14 | for male data and to your for you might but well for female it's |

0:08:21 | the error rates are a little bit tired but it's efficient for both |

0:08:29 | the and here we compare the two algorithms and their combination |

0:08:34 | on heterogeneous the setup it's the when we use a lot of data noisy and |

0:08:42 | clean data for enrollment and test with different snr levels on the target and test |

0:08:49 | and we can see that's a it's it remains efficient in this context |

0:08:57 | so as a summary |

0:09:00 | using |

0:09:03 | i'm out or that they kept algorithm we can improve the equal error rate from |

0:09:09 | forty to sixty percent but the interesting part is that combining the two |

0:09:15 | can achieve |

0:09:18 | for better gains |

0:09:22 | thank you |

0:09:30 | so we have questions |

0:09:42 | is the patient matrix a noise and it's |

0:09:47 | or anti noise that yes that's really different sorry |

0:09:55 | yes here we're estimating for each different noise at different a translation and rotation matrix |

0:10:02 | we just want to show the efficiency of this technique but in of the future |

0:10:08 | in another paper will be published in interspeech i guess we well it's except that |

0:10:16 | so |

0:10:17 | it will |

0:10:20 | we propose another approach so that the that does not |

0:10:26 | suppose a certain model of noise in the i-vector space |

0:10:29 | and that can be used for many noise |

0:10:33 | that can be trained using many noises and use it if you used efficiently |

0:10:38 | on the test with different places |

0:10:40 | so here is to just to show the how four we can go to the |

0:10:46 | best case scenario |

0:10:48 | but in another paper we show how we can extend this to go away many |

0:10:53 | noises |

0:11:03 | and i was presentation so |

0:11:06 | if you go back many years ago how lemon oppenheimer had a sequential map estimation |

0:11:13 | that be used for speech enhancement obliterated back and forth between noise suppression filters and |

0:11:19 | speech parameterization so you're iterating back and forth between two algorithms here |

0:11:25 | you show results we had one iteration to iteration is there any way to come |

0:11:29 | up with some well maybe two questions here anyway to come up with some form |

0:11:34 | of convergence criteria that you can assess and second is there any way to look |

0:11:39 | at the i-vectors as you go through the two iterations to see |

0:11:44 | which i-vectors are actually changing the most that might tell you a little bit more |

0:11:47 | about which vectors are more sensitive to the type of noise |

0:11:54 | so the first question |

0:11:56 | so the first question was is there any way to look at a convergence criteria |

0:12:01 | because when you say eight or two you need to know whether you convergence and |

0:12:06 | okay |

0:12:07 | so well here what we've that is just to iterate many tendency |

0:12:13 | at which |

0:12:15 | from a which level we get |

0:12:17 | we start the |

0:12:20 | making the results were so it's not really |

0:12:26 | it's not that the |

0:12:29 | we haven't the gone that gone there in that |

0:12:34 | so if you look at the two noise types you cycling fan noise and i |

0:12:38 | think you had to |

0:12:40 | car noise so both are low frequency type noises can you see if you have |

0:12:45 | similar changes in the i-vectors in both those noise types |

0:12:50 | yes |

0:12:53 | maybe i can't the common in that because i haven't then the full analysis but |

0:12:59 | the just from the right we can |

0:13:03 | i can tell you for sure for sure is the that the efficiency depends on |

0:13:08 | the |

0:13:11 | on which noise you're playing at all so |

0:13:15 | it sufficient store but it's it can be the |

0:13:21 | that is in the way that makes it more efficient if we have different noises |

0:13:26 | in the between enrollment and test |

0:13:40 | thank you for the nice presentation |

0:13:43 | one a while ago try to read original are mapped paper so if you don't |

0:13:47 | mind i just as a question about the original i'm out that the iterative one |

0:13:51 | sorry that i didn't understood original are you map |

0:13:54 | yes not data at one |

0:13:58 | okay so go like i mean in the block diagram that you how |

0:14:06 | can you go back to the block diagram of this |

0:14:08 | or email |

0:14:11 | yes |

0:14:11 | so you're estimating extracting noise from the signal or somehow estimating the noise and in |

0:14:18 | the signal |

0:14:19 | so and then you go up to the for noisy and of zero db that |

0:14:24 | the speech and noise are of steam similar or same strengths over there can you |

0:14:28 | tell us how would you or in extracting noise from signal in zero db |

0:14:34 | so here were using energy based voice activity detection system but we are we just |

0:14:40 | making the threshold more strict in order to avoid the and you got with speech |

0:14:48 | confused as noise so it's not we |

0:14:53 | we did the well as sophisticated the voice activity detection system for this task specific |

0:14:59 | well as the avoiding a slight as much as possible to end up with the |

0:15:03 | with speech by using a very strict this one on the energy |

0:15:10 | c use the it's just it's quite amazing the level of improvement you gain from |

0:15:14 | twenty something to date present it is it is quite something that it feel it |

0:15:19 | feels that you have very good model of noise here and if you have such |

0:15:24 | thing then it would make sense also to just check we is speech enhancement i |

0:15:29 | mean you have this |

0:15:30 | and misty based approach like wiener filtering if you have a good model the contract |

0:15:34 | the noise than it is good to also compare with that was to do you |

0:15:38 | like feature enhancement noise reduction in compare with that as well just a common |

0:15:42 | yes okay |

0:15:54 | okay that doesn't be any more questions over so that the speaker |