0:00:15okay this is going to be account where
0:00:17technical
0:00:19so that recovery problem
0:00:22maybe uncertainty modeling in
0:00:25text dependent speaker recognition
0:00:27of the issues that i'm concerned with here is that
0:00:34what to do about the
0:00:38the problem of a speaker recognition context where you have very little data and the
0:00:43features you extract
0:00:46are necessarily going to be noisy and the and the statistical sense
0:00:50okay the this comes straight to the for in text dependent speaker recognition where you
0:00:55have maybe just two seconds of data
0:00:59it's also an important problem in a text-independent speaker recognition
0:01:05because of the need to be able to
0:01:09just set a uniform threshold even in cases where you're test utterances are variable duration
0:01:17that would be interesting to say what happens in the forthcoming nist evaluation
0:01:23with that a particular problem
0:01:26some progress has been made on a with
0:01:29but subspace methods with i-vectors you try to quantify three
0:01:34the statistical noise in the i-vector extractor process
0:01:39and see that into the into p of you get a
0:01:45but
0:01:46i've taken a possibility of the table for the purposes of interest and said look
0:01:51of subspace methods
0:01:53in general are not gonna work in text-independent speaker record text dependent speaker recognition because
0:01:59of the data distribution
0:02:01okay so
0:02:02what i attempted to do it was to tackle this problem
0:02:06well and speaker variability when one could not able to characterize that by a subspace
0:02:14methods
0:02:17i realise what was preparing the presentation that the that the paper in the proceedings
0:02:21is very tense it's rather difficult tree
0:02:25but
0:02:26the idea it's a tricky but it's fairly simple okay so i have made an
0:02:32effort in the slides just to communicate the core idea and if you are interested
0:02:37them i recommend that you
0:02:40that you look at the slides rather than the paper i poles these lines one
0:02:44on my web page
0:02:48so the
0:02:50the test that for this task we took the dior store for three that's the
0:02:54random digit
0:02:56a portion or
0:02:58of the arms or data i just mentioned two things about this
0:03:02a because the a design we have five random digits at test time
0:03:10well all ten digits were repeated three times at enrollment time in random order you
0:03:15only see half of the digits at test time
0:03:18and actually turns out that under
0:03:22those conditions
0:03:24a gmm methods have an advantage okay because you can use all of the
0:03:31all of your enrollment data no matter what the test utterances or as if you
0:03:36pre-segment the data into digits you are actually constraining yourself chose to using the enrollment
0:03:43data that corresponds to the digits are actually think
0:03:46in the in the vocabulary so in practice you actually need to you need
0:03:53one of the thing a dimension is that this paper is about the back end
0:03:56we used a standard sixty dimensional plp front end which is maybe not hiding
0:04:01i think that'll come up and the
0:04:03in the next talk you can get much better results on female speakers if you
0:04:08use a low dimensional front-end
0:04:10which i think of resort was the was the first two of discover
0:04:15so
0:04:18the model that i was using here is a is that you have a model
0:04:22which uses the low dimensional of hidden variables to characterize speaker effects but does not
0:04:29attempt to something to characterize channel effects
0:04:33but does not attend so to characterize speakers using subspace methods
0:04:40and these the z factor that characterizes speakers would be used as a feature vector
0:04:47for
0:04:50speaker recognition
0:04:52and the problem i wanted to address was to design a back and that would
0:04:57take account of the fact
0:04:59that the number of observations to re
0:05:03available to estimate the components of this vector is very small
0:05:09in general a step of it has them
0:05:12one frame per mixture component if you have a two second utterance and the ubm
0:05:17with five hundred and twelve gaussians to calculation you see that you have extremely sparse
0:05:23data
0:05:26so there are two a backends the but are present
0:05:31one the joint density back end uses a point estimates of the features that are
0:05:38extracted at enrollment time and at test time and it's
0:05:43a models the correlation between the two
0:05:46okay to construct a likelihood ratio and the innovation in this paper the hidden supervector
0:05:52back end it treats
0:05:54those two feature vectors as hidden variables as in the original formulation jfa
0:06:03so that the key ingredient is to supply up a prior distribution that on the
0:06:11correlations between those hidden variables
0:06:17how much time don't have like i'm sorry i dependence one a strong
0:06:21how much
0:06:23okay good
0:06:24okay so
0:06:26the
0:06:28so i was just digress a minute
0:06:30the way uncertainty modeling is used is usually tackled in text independent speaker recognition is
0:06:39that you try to characterize the uncertainty in a point estimate of an i-vector using
0:06:44a posterior covariance matrix
0:06:47but is calculated using zero order statistics
0:06:50and you do this on the enrollment side and on the test side independently
0:06:56if you think about this you realise that this isn't quite the right way to
0:07:00do
0:07:01okay the reason is that if you are hypothesized thing a target trial then what
0:07:07you see on the test site has to be highly correlated with what you see
0:07:10on the from inside
0:07:13they are not statistically independent
0:07:16and that has to be of benefit
0:07:18from
0:07:19using those correlations to quantify the uncertainty in the feature that comes out of the
0:07:26test utterance we could something called a little total variance that says on average
0:07:32when you condition one random variable on another u models reduce the variance
0:07:37okay so
0:07:39this the critical thing that i introduced in this paper is there is this correlation
0:07:46between the enrollment on the on the test
0:07:50okay so here's a the mechanics of how the joint density back end work that's
0:07:54pretty straight the features has point estimates
0:07:58at this was inspired by us central command is work at the last odyssey
0:08:04he implemented this the upper level of i-vectors there's nothing to start with just a
0:08:10few doing at the level of supervectors as well
0:08:13you can't obviously training
0:08:17correlation matrices of supervector dimension but you can implement this idea at the level of
0:08:22individual mixture components and that's
0:08:25so that gives you a trainable
0:08:27back end for text dependent speaker recognition even if you can't use subspace methods
0:08:36so that's the
0:08:37that's our best back end and thus the one that i used as a benchmark
0:08:41for our experiments
0:08:46so
0:08:46the supervector backend is the hidden version of this
0:08:51okay that says
0:08:53you're in a position to observe by one statistics but you're not position to observe
0:08:58the these that factors you have to make inferences
0:09:02about the posterior distribution of those features and bayes
0:09:06you're likelihood ratio on a on that calculation
0:09:11now it turns out that three
0:09:14probability calculations are formally just mathematically equivalent to calculations with and then the fusion i-vector
0:09:23extractor that has just two gaussians in a
0:09:25weekly effects mixture component from the ubm
0:09:28you say you observed that mixture component once on the enrollment side once on the
0:09:32test side so you're two
0:09:34hidden gaussians
0:09:36okay you have a variable number of observations on the on one side of variable
0:09:40number of observations on the test site that's the type of situation that we model
0:09:45with an i-vector extractor so here
0:09:49the there is an i-vector extractor
0:09:51but it's only being used to do probability calculations is not going to be used
0:09:57to
0:09:59to extract features
0:10:03no
0:10:04one thing about this i-vector extractors that you're not going to use a to impose
0:10:08subspace constraints because it's just have the two gaussians right
0:10:12you don't need to say that those two gaussians
0:10:16line of low dimensional subspace of the
0:10:21of the supervector space
0:10:24so you might as well just take the total variability matrix to be the identity
0:10:28matrix and shift all of the burden
0:10:31modeling the data
0:10:32so the to the prior distribution
0:10:36in a i-vector modeling we always take us time or prior a standard normal prior
0:10:42zero mean identity covariance matrix that's because there is in fact norm general stick to
0:10:49be gained by using a non-standard prior okay you can always compensate for a non-standard
0:10:55prior
0:10:56by fiddling with the
0:10:58with the total variability matrix here
0:11:01take the total variability matrix of the to be the identity but you have to
0:11:05train the prior
0:11:07and
0:11:09that involves doing well you can do a posterior calculations if you look at those
0:11:15formulas you see they look just like the standard ones except i now have and
0:11:20mean and the precision matrix which
0:11:23would be zero and the identity matrix in the case of the
0:11:28standard
0:11:30standard normal prior
0:11:31and you can do minimum divergence estimation okay which is the way in the fact
0:11:37of training the prior if you think about the way you minimum divergence estimation wiretaps
0:11:42in fact what you're doing is your estimating a prior
0:11:47okay and then we say well there's no gain in using a non-standard prior
0:11:53so with standard as the prior and modified the total variability matrix instead so here
0:11:59we just
0:12:00estimate the prior to put that were estimated in inverted commas estimating a prior is
0:12:06not something
0:12:08variation due but we do it all the time so it works
0:12:12so how would you how would you training assuming a have to organise your training
0:12:16data into target trials
0:12:18you with collect
0:12:20the
0:12:21the i-vector for each trial three each mixture component in the ubm you would have
0:12:27an observation on the enrollment set an observational that over multiple observations so you're bound
0:12:32was statistics
0:12:33and you just implement this minimum divergence estimation procedure
0:12:38then get a prior distribution that tells you what correlations to expect
0:12:44between the enrollment data and the test data
0:12:47in the case of a target trial
0:12:53if you want to handle non-target trials and you just a impose a statistical independence
0:12:59assumption you just zero while the correlations
0:13:04okay so the way you would use this machinery to calculate a likelihood ratio
0:13:10is that
0:13:11given enrollment data and test data
0:13:14you would calculate the evidence but this is just the likelihood of the data
0:13:19but you get when you integrate out the hidden variables
0:13:24it's not usually done but i think everybody should of a gender implementation of i-vectors
0:13:29we should always calculate the evidence
0:13:31okay the because it tells you
0:13:33it's a very it's a very good a diagnostic
0:13:36that tells you whether you're implementation is correct you have to evaluate an integral it's
0:13:41a gaussian entropy role
0:13:43the answer can be expressed and close form in terms of a bomb or statistics
0:13:47as in the paper
0:13:49so you in order to use that's for speaker recognition you evaluate the evidence in
0:13:54two different ways one with the prior for target trials and one with prior for
0:13:59nontarget trials you take the ratio of the two and that gives you your likelihood
0:14:03ratio
0:14:04for speaker recognition
0:14:07so the mechanics then of getting destroyed depends critically on how you prepare
0:14:13the baumwelch statistics
0:14:15that us summarize the enrollment data and the and the test data
0:14:21the first thing you need to do
0:14:23is that stick the enrollment things i'm
0:14:26okay each of those is potentially contaminated by channel effects so you take the role
0:14:31by one statistics and
0:14:32filter out the channel effects
0:14:35just using the jfa model
0:14:39so in that way you get a set of syntactic up on well statistics which
0:14:43characterizes a speaker you just pooled at the bottom one statistics together after you have
0:14:48filtered out the channel effects you do that on the enrollment side you do the
0:14:52huntley on the test side and you end up in and the trial having to
0:14:57compare
0:14:58one set upon was statistics with another using this hadn't supervector back end
0:15:06i here's a here's a new wrinkle that really makes a more
0:15:13we know that in order to
0:15:16the sort of achilles heel of jfa models is the gaussian assumption
0:15:22okay and the reason why we do length normalization
0:15:28in between extracting i-vectors and feeding them to be able to get a is in
0:15:34order to fix the are only a scale some assumptions in really i
0:15:40we have to do a similar
0:15:43track here but the normalization is a bit tricky because
0:15:46you have to normalize one statistics are not normalising a vector
0:15:50obviously the first order statistics
0:15:53the magnitude is going to depend on zero order statistics so it's not
0:15:57immediately obvious that the one
0:16:00so the this recipe that a the data used it comes from well go back
0:16:08to the jfa model and see how the jfa model is trained in these that
0:16:12i-vectors this treat them as hidden variables
0:16:16okay that come with
0:16:18both a point estimate and an uncertainty of posterior covariance matrix that tells you
0:16:25well how answer you more about the about the observations
0:16:30a lot of the underlying hidden vector
0:16:34and the thing to
0:16:36but it turns out to be convened in to normalize is the expected norm of
0:16:41that hidden variable
0:16:45set of making the normal equal to one you make the expected norm one
0:16:50okay so
0:16:52a curious thing is that the second term on the right hand side of the
0:16:55trace of the posterior covariance matrix is actually the dominant term
0:17:00we can be because the encircling is you
0:17:04and there is an experiment and paper that shows that you better not neglect factor
0:17:11and the role of the relevance factor in the in the experiments that i reported
0:17:17in the paper
0:17:19as you fiddle with the relevance factor you're actually filling with the relative
0:17:23magnitude of this term here so you have you do have just without
0:17:28all possible relevance factors in order to get this thing working problem
0:17:38okay so here some results using what i call global feature vectors that's where we
0:17:46don't bother to pre-segment the data
0:17:49in two digits
0:17:51okay
0:17:51member a set of the beginning
0:17:53that the was an advantage on this task to not
0:17:57segment
0:17:58okay in other words just ignoring left to right structure
0:18:01that you give and in the problem
0:18:04so there is a gmm ubm benchmark
0:18:08joint density benchmark and the two versions of the hidden supervector back and one without
0:18:14but length normalization is applied to baumwelch statistics and the other with that
0:18:20so you see that the length normalization really is the key
0:18:25to getting this thing to work
0:18:30i should mention that those you use it is you should reduction in error rate
0:18:34there on the female side from eight percent to six percent
0:18:38that appears to have to do with the with the front and i
0:18:42we fixed a standard front-end for these experiments
0:18:46but it appears that if you use
0:18:50a lower dimensional feature vectors for female speakers you get better results and think that
0:18:55sticks explanation of that
0:18:59so there are there's a actually fairly big improvement okay if you go for one
0:19:04twenty eight gaussians to five twelve even though the uncertainty in the case of five
0:19:08twelve is necessarily going to be more ones
0:19:10it was this
0:19:12this phenomena that celibate motivated us to look at the uncertainty modelling problem
0:19:20you can also implement this with
0:19:23if you pre-segment the data into digits and extract what i call locals that vectors
0:19:28and the paper
0:19:30and it works in that case as well
0:19:33there is a tree that we use here famous circles at component fusion
0:19:40you can break the likelihood ratio up into a contribution from the individual gaussians and
0:19:46wait and where the weights are
0:19:50calculated using a logistic regression
0:19:54that helps quite a lot
0:19:57it requires however that you have a development set
0:20:03in order just to choose the fusion weights
0:20:07so in fact you see in the paper that
0:20:11the results with these locals that vectors although we did obtain an improvement on the
0:20:17evaluation set it was not as big nice improvement we obtained on we
0:20:22on the development set
0:20:24unique data if you're going to use a regularized logistic regression
0:20:29so there is a way the way we found a way around that
0:20:35instead of presegmenting the data into individual digits
0:20:41we used a speech recognition system
0:20:44in order to collect rebound one statistics
0:20:47i mean if you can
0:20:48in text-independent speaker recognition if you can use
0:20:54signal and discriminant neural network to collect almost statistics be obvious thing to do in
0:20:59text dependent speaker recognition for you know we
0:21:03the phonetic transcription is just use a speech recognizer to collect the
0:21:09to collect the common statistics
0:21:11there because
0:21:13individual signals are very unlikely to occur more than once
0:21:17i in a digit string you are implicitly imposing a left-to-right structure
0:21:22but you're not so you don't have to do it explicitly
0:21:26the that works the horse just as well
0:21:31okay so there's some fusion results
0:21:35three two approaches with and without the
0:21:39paying attention to that remind structure but they work if you use them you
0:21:43you do get better results
0:21:46okay so just adjustments one thing that i didn't well on
0:21:51this thing can be implemented very efficiently
0:21:54okay you come
0:21:56you can basically things up in such a way that the linear algebra that needs
0:22:00to be performed at runtime just involves
0:22:03diagonal matrices
0:22:05so it's not so it's nothing like the i-vector back end but i that i
0:22:11presented
0:22:13the last interspeech conference which
0:22:16was just a trial run for this wasn't intended to be a realistic solution problem
0:22:23it involves essentially
0:22:26extracting an i-vector approach which is not something you would so you would normally do
0:22:32but this is very computationally reason so it is effective in a lisp right
0:22:38okay that's the that's only have to say thank you
0:22:48okay we
0:22:50question
0:23:00are normalising the channel effect in the baum-welch statistics and he also normalized the for
0:23:06any the phoneme variability there as well
0:23:10well what was i said this is future work i to do something about but
0:23:19i
0:23:19i think this is a problem we should pay attention to
0:23:23okay so famous that some preliminary work on a but it it's a vector
0:23:28and sell something we well here for a very this
0:23:33fanatics israel email down for you in
0:23:37text dependent speaker recognition it's not so much an issue as it is and in
0:23:42text independent where it's really going to come from
0:23:47a neural network that's trying to discriminate
0:23:57okay and ask questions any using a skinny channel
0:24:03we estimate
0:24:06that's right that's what the rest of the calls for i think can be somehow
0:24:13we define it
0:24:15it's what the jfa recipe calls for even though the channel
0:24:21variables are treated as hidden variables that have
0:24:25posterior expectation the posterior covariance matrix
0:24:28if you look
0:24:31the role that
0:24:32likely
0:24:34if you merrily interest and filtering of the channel thanks to turns out that all
0:24:38you need is the
0:24:41is the posterior expectation
0:24:43this is just with the model sets
0:24:44so i
0:24:46the model is very simple you just two
0:24:50this the only way i'm not once the last
0:24:55okay tend to stick