0:00:15 | presentation is regularization of all-pole models for speaker verification under additive noise |
---|---|

0:00:21 | condition |

0:00:22 | and Cemal will be present |

0:00:34 | hello everyone |

0:00:36 | I am Cemal from Turkey |

0:00:39 | and today I am going to talk about Regularization of |

0:00:42 | Linear prediction method |

0:00:44 | for speaker verification |

0:00:46 | This work was a joint work with Tomi Kinnunen from University of Eastern Finland |

0:00:52 | Rahim Saeidi |

0:00:54 | from Radboud University |

0:00:56 | and professor Paavo Alku |

0:00:59 | from Aalto University |

0:01:04 | as all we know |

0:01:05 | Speaker recognition system |

0:01:07 | is usually quite high recognition accuracy |

0:01:10 | if each speech sample |

0:01:12 | that be used was collected under clinical and |

0:01:16 | controlled condition |

0:01:19 | However |

0:01:22 | the result of the speaker recognition decreases a lot in the case of |

0:01:27 | channel mismatch or additive noise |

0:01:31 | so previously present that spectrum estimation |

0:01:34 | have a big impact |

0:01:36 | on the result of speaker recognition under additive noise |

0:01:40 | so what do I mean by spectrum estimation |

0:01:43 | okay here is that speech samples, speech frame |

0:01:49 | on the left side we choose clean speech spectrum FFT analysis spectral |

0:01:54 | and on the right hand side |

0:01:56 | it shows this noisy version |

0:01:58 | 0 dB SNR level |

0:02:01 | if you look at the spectral |

0:02:06 | our spectrum is distorted in the case of noisy condition |

0:02:11 | this distortion |

0:02:13 | and causes degradation on the speaker recognition perform |

0:02:17 | so if we have a spectrum estimation method |

0:02:21 | that was not affected that much |

0:02:24 | from the noise |

0:02:26 | we wouldn't have so much degradation on the perform but |

0:02:29 | unfortunately we don't have |

0:02:31 | such as spectrum estimation method so that why we were looking for |

0:02:35 | a better way of estimation the spectrum |

0:02:38 | so yesterday |

0:02:41 | ... |

0:02:42 | not to touch MFCC |

0:02:45 | it is the best but |

0:02:46 | I am so to tell you this must we need touch |

0:02:49 | so basically and simply still using MFCC |

0:02:56 | that nothing is wrong with this much but we just simply |

0:02:59 | replace FFT spectrum estimation |

0:03:03 | with new |

0:03:05 | spectrum estimation method |

0:03:07 | that what we are doing |

0:03:10 | so the first |

0:03:13 | spectrum estimation method that i am going to talk about |

0:03:16 | conventional linear prediction |

0:03:18 | as you know linear prediction assume that speech samples |

0:03:22 | can be estimated from its previous samples |

0:03:27 | The objective is to compute predictor coefficients alpha by minimizing the energy of the residual |

0:03:33 | error |

0:03:33 | so the optimum alpha value is computed by |

0:03:37 | Toeplitz autocorrelation matrix |

0:03:40 | multiplied by the autocorrelation sequence |

0:03:45 | and the another method is |

0:03:48 | All-pole modeling method again is that temporally weighted linear prediction |

0:03:53 | it assumes the same idea as |

0:03:58 | as I explain for the linear prediction |

0:04:00 | again assume we can |

0:04:02 | estimate speech sample from |

0:04:04 | its previous sample but this time |

0:04:06 | we compute the optimum |

0:04:09 | predictor coefficients by minimizing the weighted |

0:04:12 | energy of the residual error |

0:04:16 | here psi N is weighting function and will be used as short time analysis as |

0:04:23 | the weighting function |

0:04:24 | and again this time we compute the optimum |

0:04:29 | predictor coefficients |

0:04:31 | multiply the inverse |

0:04:33 | of the autocorrelation matrix but |

0:04:35 | this time we need modify because we have |

0:04:38 | here weighting function the modified auto covariance matrix |

0:04:41 | multiply autocorrelation sequence |

0:04:45 | however this LP |

0:04:47 | this spectrum estimation maybe on time problem at decode because |

0:04:52 | especially in the speech coding point of view if we have speech sample produced by |

0:04:59 | high pitched speaker |

0:05:00 | eventually it will cause the sharp peak in the spectrum |

0:05:05 | and this |

0:05:08 | this needs to be solved for the speech coding and the |

0:05:12 | perspective and recently |

0:05:14 | to solve this |

0:05:17 | problem |

0:05:18 | the regularized linear prediction was proposed |

0:05:20 | by Ekman |

0:05:25 | smooth |

0:05:26 | spectrum |

0:05:27 | produce by high pitch speaker |

0:05:29 | to smooth this sharpen in this spectrum |

0:05:33 | they modify the conventional linear prediction by |

0:05:37 | adding a penalty function |

0:05:39 | phi a which is a functional |

0:05:41 | predictive coefficients and |

0:05:43 | and regularization factor lambda |

0:05:45 | here is the penalty function in the original paper was selected at |

0:05:51 | formula here and |

0:05:55 | the reason in here A prime is the |

0:05:58 | frequency derivative of the RLP |

0:06:03 | RLP spectrum and |

0:06:06 | omega ... in here |

0:06:09 | spectrum envelope estimate |

0:06:16 | So the reason of selecting this kind of penalty function is that |

0:06:21 | they turn that a closed form non-iterative solution |

0:06:26 | so that why they use this penalty function |

0:06:31 | and in the paper they |

0:06:34 | if we look at the paper we can find the details actually done |

0:06:37 | the depend on the function can be written in the matrix form as given here |

0:06:44 | in here F is the Toeplitz matrix of the window autocorrelation sequence |

0:06:50 | and in the paper they use the |

0:06:52 | boxcar window which is the special guess because if we use the boxcar window v(m) |

0:06:57 | as the boxcar window |

0:06:59 | the F matrix |

0:07:02 | correspond to autocorrelation matrix that we use in the conventional linear prediction |

0:07:09 | this study we also consider the Blackman and hamming window |

0:07:15 | and this Blackman |

0:07:18 | in the next slide and difference |

0:07:20 | when you compute the F matrix |

0:07:23 | and D is just diagonal matrix |

0:07:25 | in which each diagonal element has a |

0:07:28 | corresponding |

0:07:29 | row or column number |

0:07:32 | and given the penalty function |

0:07:35 | the optimum prediction coefficients can be computed by |

0:07:39 | given the equation here |

0:07:42 | so this is what |

0:07:45 | Ekman proposed in his original work and |

0:07:48 | we wanted to move on this because |

0:07:50 | this method applies on the |

0:07:53 | clean speech |

0:07:54 | and in the speech coding point of view |

0:07:57 | however we wanted to see this performance of this method on speaker recognition and especially |

0:08:03 | on the additive noise condition |

0:08:06 | so |

0:08:08 | in literature there exist some works which use the |

0:08:14 | double autocorrelation sequence to estimate the spectrum of given speech samples |

0:08:19 | the most recent one is the |

0:08:22 | shimamura |

0:08:24 | from the Interspeech 2010 and they use |

0:08:27 | this double autocorrelation sequence to estimate speech spectrum |

0:08:32 | in the presence of additive noise and they |

0:08:35 | they analyzed it |

0:08:38 | for the noisy word recognition |

0:08:40 | experiments |

0:08:41 | and in here in this work we propose |

0:08:45 | to use this double autocorrelation to compute F matrix which I explained in the previous |

0:08:51 | slide |

0:08:51 | to compute the penalty function |

0:08:54 | so basically also |

0:08:57 | ... the Blackman ... hamming window |

0:09:00 | we also use this double autocorrelation sequence in penalty function |

0:09:05 | to see the effects of |

0:09:08 | effects on speaker verification |

0:09:11 | so if you look at the |

0:09:13 | average residual error and the |

0:09:17 | penalty function as the function of predictor order p |

0:09:20 | as expected |

0:09:22 | as the predictor order increases |

0:09:25 | error decreases but |

0:09:29 | again as expected |

0:09:31 | the phi the penalty function increases |

0:09:34 | increases the predictor error because it is at the beginning |

0:09:41 | the main idea of the regularization is the |

0:09:44 | to smooth the spectrum |

0:09:46 | so it should penalize the spectrum |

0:09:51 | so if you look at the |

0:09:54 | different spectrum estimation methods on the |

0:09:57 | this spectrum this is again a speech samples |

0:10:01 | clean speech sample on the left hand side and on the right hand side |

0:10:05 | it's noisy version again 0 dB SNR level |

0:10:09 | and we have here |

0:10:10 | ... are FFT and its spectral |

0:10:13 | and regularized linear prediction |

0:10:16 | spectrum for different window functions |

0:10:18 | so as we see |

0:10:20 | both Blackman and hamming window |

0:10:23 | does not affect so much the spectral they look very similar to each other |

0:10:27 | however when we use the |

0:10:29 | double autocorrelation sequence in regularization |

0:10:32 | we got more smoother spectrum and another point is that |

0:10:37 | and I think that is the main |

0:10:39 | problem of the |

0:10:41 | additive noise condition because |

0:10:43 | if we look at the |

0:10:46 | the difference of the |

0:10:47 | given spectrum for example |

0:10:49 | for the spectrum if we look at the maximum and minimum values |

0:10:53 | of the spectrum |

0:10:55 | for both clean and noisy condition |

0:10:58 | there are so much variations between clean and noisy cases |

0:11:02 | so I think this mismatch |

0:11:04 | this mismatch causes the main problem of the |

0:11:09 | performance regularization however |

0:11:12 | if you look at the double autocorrelation sequence the mismatch the variations become that |

0:11:18 | in comparison of two other methods so |

0:11:21 | according to this figure a expect to |

0:11:24 | see not so much difference on the recognition performance |

0:11:28 | both Blackman and hamming windows because they also |

0:11:33 | almost produce the same spectral |

0:11:35 | but we should see some differences on the double autocorrelation sequence |

0:11:43 | so the Ekman's propose regularization for conventional autocorrelation based linear prediction |

0:11:50 | we also apply regularization to the other all-pole modeling ... weighted linear prediction |

0:11:58 | and it's doubled version doubled weighted linear prediction because |

0:12:03 | it is independent from |

0:12:04 | the all-pole modeling that we use |

0:12:07 | we just need to compute the autocorrelation matrix |

0:12:11 | and autocorrelation sequence |

0:12:13 | if we compute this to |

0:12:16 | we change regularize it independent from the method |

0:12:20 | so whatever we use whatever which method we are using is |

0:12:26 | it doesn't matter we just regularize to |

0:12:30 | method that we use |

0:12:35 | so if I |

0:12:36 | if you look at the experimental setup we use NIST 2002 |

0:12:40 | corpus and with GMM-UBM modeling |

0:12:43 | at the features |

0:12:45 | we use first spectral subtraction we apply spectral subtraction to a |

0:12:50 | noisy speech samples and the we extract the 12 MFCCs |

0:12:54 | and their first and second order derivative on cepstral mean and variance normalization |

0:13:00 | we also apply T-norm normalization on log-likelihood scores |

0:13:06 | we use two different types of the noises in the experiment with |

0:13:10 | five different SNR levels |

0:13:13 | factory and babble noises from noisex-92 database |

0:13:17 | and |

0:13:19 | and as I said five different SNR levels I would like to point that we |

0:13:24 | added noise only to test set speech samples we didn't touch the training samples |

0:13:29 | I mean it is the original NIST samples |

0:13:35 | maybe it is not the correct term because it is the telephony speech |

0:13:39 | so it also includes some noise convolutive noise I mean it is not additive noise |

0:13:44 | but |

0:13:44 | I prefer clean as the original NIST samples |

0:13:50 | so the first thing that we need to do when we are using regularization |

0:13:55 | that we need to optimize the lambda parameter |

0:13:58 | because it has a big impact and |

0:14:00 | if you look at the |

0:14:01 | if you look at the regularization formula here in the dark box |

0:14:06 | if when the lambda equals to zero it reduces to the conventional linear prediction |

0:14:12 | so we need to optimize it first |

0:14:14 | in our experiment to optimize it we |

0:14:17 | we run the speaker recognition experiments on the original |

0:14:22 | case |

0:14:23 | I mean original training and original |

0:14:25 | test |

0:14:27 | case |

0:14:28 | so we optimize we run the experiments for different values of lambda when we look |

0:14:34 | at the |

0:14:35 | equal error rate as a function of the lambda we can say the |

0:14:39 | lambda when lambda is 10 power of -7 |

0:14:42 | we got the smallest equal error rate |

0:14:45 | so in the remain |

0:14:48 | further experiment we will use this lambda value |

0:14:51 | in the regularized linear prediction and for the regularize |

0:14:56 | weighted linear prediction and its doubled version we optimize the lambda value in the same |

0:15:02 | in the similar way so we optimize each lambda value for each method separated |

0:15:10 | so |

0:15:11 | in the first experiment we just want to see the effect of |

0:15:16 | the autocorrelation windowing |

0:15:18 | on the recognition performance |

0:15:21 | can see from table in each row |

0:15:25 | the boldface numbers show the |

0:15:31 | the best value for each row |

0:15:33 | so when we look at the different windows from ... |

0:15:36 | as I mention when we look at the spectral |

0:15:39 | the different window function does not |

0:15:43 | have a big effect on the |

0:15:45 | recognition performance however |

0:15:48 | when we look at the double autocorrelation sequence |

0:15:51 | for regularization introduces the error rate |

0:15:55 | significantly |

0:15:58 | in the remain experiments we are gonna use the double autocorrelation sequence |

0:16:03 | for regularization |

0:16:08 | and for the regularization of other all-pole modeling techniques I mean weighted linear prediction and |

0:16:15 | stabilized weighted linear prediction |

0:16:17 | when we look at the page ... FFT is our baseline that we normally use |

0:16:23 | in MFCC extraction |

0:16:24 | as we can see in clean case regularization does not improve but also does not |

0:16:31 | harm the performance |

0:16:32 | but in noisy cases especially in zero dB and -10 dB cases |

0:16:37 | the regularization |

0:16:39 | improve the recognition accuracy compared to the un-regularized version |

0:16:45 | of each pair for example LP vs RLP |

0:16:50 | when we look at the |

0:16:52 | number there for example -10 dB babble noise |

0:16:55 | the EER reduces from |

0:16:58 | 20% to 16% |

0:17:02 | and this is the same for the other all-pole modeling weighted and |

0:17:05 | weighted linear prediction RSWLP |

0:17:08 | so to show some |

0:17:11 | that curl again |

0:17:13 | this is the babble noise -10dB SNR level and |

0:17:18 | FFT |

0:17:20 | conventional LP regularization and we use double autocorrelation sequence |

0:17:25 | we can see again the large improvement |

0:17:28 | indicate of regularized LP |

0:17:31 | and the same as |

0:17:33 | weighted linear prediction |

0:17:35 | we cannot see so much between conventional FFT and weighted linear prediction but when we |

0:17:41 | regularize it |

0:17:41 | the recognition performance is improved |

0:17:46 | and same for the stabilized weighted linear prediction |

0:17:50 | if we regularize, it also improves the recognition accuracy |

0:17:56 | so if you want to summarize |

0:17:58 | our observations |

0:18:00 | the first thing is that the regularization does not harm the clean condition performance |

0:18:07 | and different window functions |

0:18:09 | does not affect the recognition performance a lot but |

0:18:12 | the double autocorrelation sequence |

0:18:15 | compute the F matrix in regularization ... |

0:18:19 | spectrum envelope estimate |

0:18:22 | improve the recognition accuracy |

0:18:24 | and we also apply regularization on |

0:18:30 | other kind of all-pole modeling techniques |

0:18:32 | such as weighted linear prediction and stabilized weighted linear prediction |

0:18:37 | and thank you |

0:19:10 | this regularization can help us to improve the recognition performance because |

0:19:16 | the distortion is the main problem |

0:19:20 | of in the case of additive noise in the spectrum |

0:19:23 | if we can penalize this distortion |

0:19:26 | somehow |

0:19:28 | we can improve our recognition performance |

0:19:31 | that was the main point |

0:19:51 | actually in the slide, no, but in the paper we have some deeper analysis |

0:19:56 | on the regularization |

0:19:58 | in term of spectral distortion and something like that we have |

0:20:01 | we have more experiments on that |

0:20:49 | to achieve the minimum smallest MinDCF |

0:21:01 | to get the smallest EER when we are optimizing the lambda |