0:00:15presentation is regularization of all-pole models for speaker verification under additive noise
0:00:22and Cemal will be present
0:00:34hello everyone
0:00:36I am Cemal from Turkey
0:00:39and today I am going to talk about Regularization of
0:00:42Linear prediction method
0:00:44for speaker verification
0:00:46This work was a joint work with Tomi Kinnunen from University of Eastern Finland
0:00:52Rahim Saeidi
0:00:54from Radboud University
0:00:56and professor Paavo Alku
0:00:59from Aalto University
0:01:04as all we know
0:01:05Speaker recognition system
0:01:07is usually quite high recognition accuracy
0:01:10if each speech sample
0:01:12that be used was collected under clinical and
0:01:16controlled condition
0:01:22the result of the speaker recognition decreases a lot in the case of
0:01:27channel mismatch or additive noise
0:01:31so previously present that spectrum estimation
0:01:34have a big impact
0:01:36on the result of speaker recognition under additive noise
0:01:40so what do I mean by spectrum estimation
0:01:43okay here is that speech samples, speech frame
0:01:49on the left side we choose clean speech spectrum FFT analysis spectral
0:01:54and on the right hand side
0:01:56it shows this noisy version
0:01:580 dB SNR level
0:02:01if you look at the spectral
0:02:06our spectrum is distorted in the case of noisy condition
0:02:11this distortion
0:02:13and causes degradation on the speaker recognition perform
0:02:17so if we have a spectrum estimation method
0:02:21that was not affected that much
0:02:24from the noise
0:02:26we wouldn't have so much degradation on the perform but
0:02:29unfortunately we don't have
0:02:31such as spectrum estimation method so that why we were looking for
0:02:35a better way of estimation the spectrum
0:02:38so yesterday
0:02:42not to touch MFCC
0:02:45it is the best but
0:02:46I am so to tell you this must we need touch
0:02:49so basically and simply still using MFCC
0:02:56that nothing is wrong with this much but we just simply
0:02:59replace FFT spectrum estimation
0:03:03with new
0:03:05spectrum estimation method
0:03:07that what we are doing
0:03:10so the first
0:03:13spectrum estimation method that i am going to talk about
0:03:16conventional linear prediction
0:03:18as you know linear prediction assume that speech samples
0:03:22can be estimated from its previous samples
0:03:27The objective is to compute predictor coefficients alpha by minimizing the energy of the residual
0:03:33so the optimum alpha value is computed by
0:03:37Toeplitz autocorrelation matrix
0:03:40multiplied by the autocorrelation sequence
0:03:45and the another method is
0:03:48All-pole modeling method again is that temporally weighted linear prediction
0:03:53it assumes the same idea as
0:03:58as I explain for the linear prediction
0:04:00again assume we can
0:04:02estimate speech sample from
0:04:04its previous sample but this time
0:04:06we compute the optimum
0:04:09predictor coefficients by minimizing the weighted
0:04:12energy of the residual error
0:04:16here psi N is weighting function and will be used as short time analysis as
0:04:23the weighting function
0:04:24and again this time we compute the optimum
0:04:29predictor coefficients
0:04:31multiply the inverse
0:04:33of the autocorrelation matrix but
0:04:35this time we need modify because we have
0:04:38here weighting function the modified auto covariance matrix
0:04:41multiply autocorrelation sequence
0:04:45however this LP
0:04:47this spectrum estimation maybe on time problem at decode because
0:04:52especially in the speech coding point of view if we have speech sample produced by
0:04:59high pitched speaker
0:05:00eventually it will cause the sharp peak in the spectrum
0:05:05and this
0:05:08this needs to be solved for the speech coding and the
0:05:12perspective and recently
0:05:14to solve this
0:05:18the regularized linear prediction was proposed
0:05:20by Ekman
0:05:27produce by high pitch speaker
0:05:29to smooth this sharpen in this spectrum
0:05:33they modify the conventional linear prediction by
0:05:37adding a penalty function
0:05:39phi a which is a functional
0:05:41predictive coefficients and
0:05:43and regularization factor lambda
0:05:45here is the penalty function in the original paper was selected at
0:05:51formula here and
0:05:55the reason in here A prime is the
0:05:58frequency derivative of the RLP
0:06:03RLP spectrum and
0:06:06omega ... in here
0:06:09spectrum envelope estimate
0:06:16So the reason of selecting this kind of penalty function is that
0:06:21they turn that a closed form non-iterative solution
0:06:26so that why they use this penalty function
0:06:31and in the paper they
0:06:34if we look at the paper we can find the details actually done
0:06:37the depend on the function can be written in the matrix form as given here
0:06:44in here F is the Toeplitz matrix of the window autocorrelation sequence
0:06:50and in the paper they use the
0:06:52boxcar window which is the special guess because if we use the boxcar window v(m)
0:06:57as the boxcar window
0:06:59the F matrix
0:07:02correspond to autocorrelation matrix that we use in the conventional linear prediction
0:07:09this study we also consider the Blackman and hamming window
0:07:15and this Blackman
0:07:18in the next slide and difference
0:07:20when you compute the F matrix
0:07:23and D is just diagonal matrix
0:07:25in which each diagonal element has a
0:07:29row or column number
0:07:32and given the penalty function
0:07:35the optimum prediction coefficients can be computed by
0:07:39given the equation here
0:07:42so this is what
0:07:45Ekman proposed in his original work and
0:07:48we wanted to move on this because
0:07:50this method applies on the
0:07:53clean speech
0:07:54and in the speech coding point of view
0:07:57however we wanted to see this performance of this method on speaker recognition and especially
0:08:03on the additive noise condition
0:08:08in literature there exist some works which use the
0:08:14double autocorrelation sequence to estimate the spectrum of given speech samples
0:08:19the most recent one is the
0:08:24from the Interspeech 2010 and they use
0:08:27this double autocorrelation sequence to estimate speech spectrum
0:08:32in the presence of additive noise and they
0:08:35they analyzed it
0:08:38for the noisy word recognition
0:08:41and in here in this work we propose
0:08:45to use this double autocorrelation to compute F matrix which I explained in the previous
0:08:51to compute the penalty function
0:08:54so basically also
0:08:57... the Blackman ... hamming window
0:09:00we also use this double autocorrelation sequence in penalty function
0:09:05to see the effects of
0:09:08effects on speaker verification
0:09:11so if you look at the
0:09:13average residual error and the
0:09:17penalty function as the function of predictor order p
0:09:20as expected
0:09:22as the predictor order increases
0:09:25error decreases but
0:09:29again as expected
0:09:31the phi the penalty function increases
0:09:34increases the predictor error because it is at the beginning
0:09:41the main idea of the regularization is the
0:09:44to smooth the spectrum
0:09:46so it should penalize the spectrum
0:09:51so if you look at the
0:09:54different spectrum estimation methods on the
0:09:57this spectrum this is again a speech samples
0:10:01clean speech sample on the left hand side and on the right hand side
0:10:05it's noisy version again 0 dB SNR level
0:10:09and we have here
0:10:10... are FFT and its spectral
0:10:13and regularized linear prediction
0:10:16spectrum for different window functions
0:10:18so as we see
0:10:20both Blackman and hamming window
0:10:23does not affect so much the spectral they look very similar to each other
0:10:27however when we use the
0:10:29double autocorrelation sequence in regularization
0:10:32we got more smoother spectrum and another point is that
0:10:37and I think that is the main
0:10:39problem of the
0:10:41additive noise condition because
0:10:43if we look at the
0:10:46the difference of the
0:10:47given spectrum for example
0:10:49for the spectrum if we look at the maximum and minimum values
0:10:53of the spectrum
0:10:55for both clean and noisy condition
0:10:58there are so much variations between clean and noisy cases
0:11:02so I think this mismatch
0:11:04this mismatch causes the main problem of the
0:11:09performance regularization however
0:11:12if you look at the double autocorrelation sequence the mismatch the variations become that
0:11:18in comparison of two other methods so
0:11:21according to this figure a expect to
0:11:24see not so much difference on the recognition performance
0:11:28both Blackman and hamming windows because they also
0:11:33almost produce the same spectral
0:11:35but we should see some differences on the double autocorrelation sequence
0:11:43so the Ekman's propose regularization for conventional autocorrelation based linear prediction
0:11:50we also apply regularization to the other all-pole modeling ... weighted linear prediction
0:11:58and it's doubled version doubled weighted linear prediction because
0:12:03it is independent from
0:12:04the all-pole modeling that we use
0:12:07we just need to compute the autocorrelation matrix
0:12:11and autocorrelation sequence
0:12:13if we compute this to
0:12:16we change regularize it independent from the method
0:12:20so whatever we use whatever which method we are using is
0:12:26it doesn't matter we just regularize to
0:12:30method that we use
0:12:35so if I
0:12:36if you look at the experimental setup we use NIST 2002
0:12:40corpus and with GMM-UBM modeling
0:12:43at the features
0:12:45we use first spectral subtraction we apply spectral subtraction to a
0:12:50noisy speech samples and the we extract the 12 MFCCs
0:12:54and their first and second order derivative on cepstral mean and variance normalization
0:13:00we also apply T-norm normalization on log-likelihood scores
0:13:06we use two different types of the noises in the experiment with
0:13:10five different SNR levels
0:13:13factory and babble noises from noisex-92 database
0:13:19and as I said five different SNR levels I would like to point that we
0:13:24added noise only to test set speech samples we didn't touch the training samples
0:13:29I mean it is the original NIST samples
0:13:35maybe it is not the correct term because it is the telephony speech
0:13:39so it also includes some noise convolutive noise I mean it is not additive noise
0:13:44I prefer clean as the original NIST samples
0:13:50so the first thing that we need to do when we are using regularization
0:13:55that we need to optimize the lambda parameter
0:13:58because it has a big impact and
0:14:00if you look at the
0:14:01if you look at the regularization formula here in the dark box
0:14:06if when the lambda equals to zero it reduces to the conventional linear prediction
0:14:12so we need to optimize it first
0:14:14in our experiment to optimize it we
0:14:17we run the speaker recognition experiments on the original
0:14:23I mean original training and original
0:14:28so we optimize we run the experiments for different values of lambda when we look
0:14:34at the
0:14:35equal error rate as a function of the lambda we can say the
0:14:39lambda when lambda is 10 power of -7
0:14:42we got the smallest equal error rate
0:14:45so in the remain
0:14:48further experiment we will use this lambda value
0:14:51in the regularized linear prediction and for the regularize
0:14:56weighted linear prediction and its doubled version we optimize the lambda value in the same
0:15:02in the similar way so we optimize each lambda value for each method separated
0:15:11in the first experiment we just want to see the effect of
0:15:16the autocorrelation windowing
0:15:18on the recognition performance
0:15:21can see from table in each row
0:15:25the boldface numbers show the
0:15:31the best value for each row
0:15:33so when we look at the different windows from ...
0:15:36as I mention when we look at the spectral
0:15:39the different window function does not
0:15:43have a big effect on the
0:15:45recognition performance however
0:15:48when we look at the double autocorrelation sequence
0:15:51for regularization introduces the error rate
0:15:58in the remain experiments we are gonna use the double autocorrelation sequence
0:16:03for regularization
0:16:08and for the regularization of other all-pole modeling techniques I mean weighted linear prediction and
0:16:15stabilized weighted linear prediction
0:16:17when we look at the page ... FFT is our baseline that we normally use
0:16:23in MFCC extraction
0:16:24as we can see in clean case regularization does not improve but also does not
0:16:31harm the performance
0:16:32but in noisy cases especially in zero dB and -10 dB cases
0:16:37the regularization
0:16:39improve the recognition accuracy compared to the un-regularized version
0:16:45of each pair for example LP vs RLP
0:16:50when we look at the
0:16:52number there for example -10 dB babble noise
0:16:55the EER reduces from
0:16:5820% to 16%
0:17:02and this is the same for the other all-pole modeling weighted and
0:17:05weighted linear prediction RSWLP
0:17:08so to show some
0:17:11that curl again
0:17:13this is the babble noise -10dB SNR level and
0:17:20conventional LP regularization and we use double autocorrelation sequence
0:17:25we can see again the large improvement
0:17:28indicate of regularized LP
0:17:31and the same as
0:17:33weighted linear prediction
0:17:35we cannot see so much between conventional FFT and weighted linear prediction but when we
0:17:41regularize it
0:17:41the recognition performance is improved
0:17:46and same for the stabilized weighted linear prediction
0:17:50if we regularize, it also improves the recognition accuracy
0:17:56so if you want to summarize
0:17:58our observations
0:18:00the first thing is that the regularization does not harm the clean condition performance
0:18:07and different window functions
0:18:09does not affect the recognition performance a lot but
0:18:12the double autocorrelation sequence
0:18:15compute the F matrix in regularization ...
0:18:19spectrum envelope estimate
0:18:22improve the recognition accuracy
0:18:24and we also apply regularization on
0:18:30other kind of all-pole modeling techniques
0:18:32such as weighted linear prediction and stabilized weighted linear prediction
0:18:37and thank you
0:19:10this regularization can help us to improve the recognition performance because
0:19:16the distortion is the main problem
0:19:20of in the case of additive noise in the spectrum
0:19:23if we can penalize this distortion
0:19:28we can improve our recognition performance
0:19:31that was the main point
0:19:51actually in the slide, no, but in the paper we have some deeper analysis
0:19:56on the regularization
0:19:58in term of spectral distortion and something like that we have
0:20:01we have more experiments on that
0:20:49to achieve the minimum smallest MinDCF
0:21:01to get the smallest EER when we are optimizing the lambda