0:00:17would you just
0:00:19a bear with me for a couple minutes subset and some background and then i
0:00:23will try to explain
0:00:25in some detail what the technical problem is that we're trying to solve
0:00:30so for the jfa model here i formulated that in terms of gmm mean vectors
0:00:35problem and supervectors
0:00:38that's first term and is the mean vector that comes from universal background model
0:00:47the second term involves that a hidden variable x
0:00:50which is independent of the channel
0:00:53excuse me independent of the mixture component and intended to model the channel effects across
0:00:58a recordings
0:01:01and the third term in that formulation it how it's a local hidden variable
0:01:06to characterize the
0:01:09speaker phrase variability within a particular mixture component
0:01:18the is typical approach would be to estimate facts matrix u
0:01:23using the maximum likelihood to criterion which is exactly the criterion that is used to
0:01:31train an i-vector extractor
0:01:35in practice
0:01:37rather than use maximum likelihood to you usually end up using relevance map of as
0:01:42an empirical estimate of those matrices t
0:01:47the relation between the two you can find explained in the paper by probably vol
0:01:53going back to two thousand and eight
0:01:56the point i what stress here is that z vector is high dimensional we're not
0:02:03trying to explain the
0:02:05a speaker phrase variability by a low dimensional vector of hidden variables
0:02:12it's a factorial prior in the sense that the
0:02:17explanations for the different mixture components are statistically independent
0:02:22which really is a weakness
0:02:24we're not actually in a position with a prior like this
0:02:27to exploit the correlations between mixture components
0:02:37so to do calculations with this type of model these standard method is an algorithm
0:02:43by robbie vol which alternates between updating the two
0:02:48hidden variables x and z
0:02:51it didn't present of this way but it's actually a variational bayes algorithm which means
0:02:55that it comes with variational lower bounds that you can used to
0:02:59a likelihood or evidence calculations
0:03:03that means that you can for example
0:03:06formulate the
0:03:08speaker recognition problem in exactly the same way as it's done in the l d
0:03:14n e
0:03:16a bayesian model selection
0:03:19problem the question is whether
0:03:21if you're given enrollment utterances and test utterances and you want to
0:03:25account for that on some of the of data
0:03:29whether you are better off
0:03:32passes doing a single cent vector
0:03:34or two vectors one for the enrollment data one for the test
0:03:42basically unsatisfactory about this namely
0:03:46it doesn't take account of the fact that the
0:03:50what jfa is it's model for handle
0:03:54the ubm moves under speaker and channel effects
0:03:59when we do these calculations we use the universal background model so collect belmont statistics
0:04:06and ignore the fact
0:04:07that i according to our model
0:04:10the ubm is actually ships to as a result of these hidden variables
0:04:19there is an important by jean down that tends to remedy this and i was
0:04:24particularly interested in looking into this for the reason that i mentioned at the beginning
0:04:29i believe that's
0:04:31the ubm does have to be adapted
0:04:34in text dependent speaker recognition
0:04:38this is a principled way of doing that it introduces a an extra sense of
0:04:43hidden variables
0:04:45indicators which
0:04:47so you how the frames are aligned with mixture components
0:04:53that can be interleaved into the
0:04:58variational bayes updates in vaults algorithm
0:05:02so that you get
0:05:04a quick
0:05:04karen framework from handling that adaptation
0:05:11there's just
0:05:14there's just one caviar
0:05:16that i think is worth pointing out about this algorithm
0:05:21it requires that you take account of all of the hidden variables when you're doing
0:05:26ubm adaptation and the evidence calculations
0:05:30no of course that's what you should do if the model is to be believed
0:05:35if you take the model that fixed value
0:05:37we should take account of all of the hidden variables
0:05:40however what's going on here is that this a factorial priors actually so weak
0:05:50doing things by the book
0:05:52dance lead you into problems
0:05:54so that's why have like this here as a as a kind of
0:06:00so in the paper a high presented results on the are stored data using a
0:06:05three types of classifier
0:06:08that come out of these calculations
0:06:13the first one there is simply to use the z vectors that can either from
0:06:19votes calculation or from the show and on calculation
0:06:23as features which
0:06:26it's are extracted properly should be purged of channel effects
0:06:31okay and then just feeding goes into a simple backend like the cosine distance classifier
0:06:37a jfa as it was originally construed
0:06:42attended not only to be a feature extractor but also model that's like a classifier
0:06:47that's two additions
0:06:54in order to understand
0:06:56this problem of ubm adaptation
0:07:01it's necessary also to look into what's going on
0:07:05with those bayesian model selection algorithms
0:07:09okay when you
0:07:10what happens when you appliance
0:07:12without a ubm adaptation and
0:07:16of boats algorithm
0:07:18or with ubm adaptation and round on some read from
0:07:22and also to compare it with
0:07:26the likelihood ratio calculation
0:07:32what's traditional about two thousand and eight
0:07:36it's turns out that
0:07:38when you look into these questions that there's a whole bunch of anomalies that
0:07:43that arise
0:07:47ubm adaptation call if you're using jfa as a feature extractor ubm adaptation hz point
0:07:55five point
0:07:57this is true
0:07:58for these sent vectors that's not true for i-vectors is not true for speaker factors
0:08:03it behaves a reasonably but not present factors that's and
0:08:09this year's icassp
0:08:13on the other hand
0:08:14if you look at the problem of maximum likelihood estimation
0:08:18all the jfa model parameters maximum likelihood so
0:08:23what you find is that it doesn't work at all
0:08:26without ubm adaptation you do need
0:08:29ubm adaptation order to get that to behave
0:08:34if you look here a on based model selection you find that there are some
0:08:39where shall and don's algorithm
0:08:42works better than vaults
0:08:44and other cases where exactly the opposite happens
0:08:49the traditional jfa likelihood ratio is actually very simplistic get just uses plug in estimates
0:08:56rather than attempt to integrate over hidden variables and no ubm adaptation of all
0:09:03what i will show in this paper is that it can be made to work
0:09:07very well
0:09:08with very careful
0:09:10ubm adaptation
0:09:12okay so this business of ubm adaptation turns out to be very tracking
0:09:17anyone who is being a around and in the in table long enough is probably
0:09:23in parent by this by this problem at some stage
0:09:28sorry my in my own experience
0:09:31i couldn't get jfa working at all
0:09:34until i stopped showing the ubm adaptation
0:09:38but it doesn't really make a little sense because if you look at the history
0:09:41of subspace methods eigenvoices eigen channels
0:09:45they world implemented originally with ubm adaptation
0:09:49if you speak to
0:09:51guys in speech recognition they will be surprised
0:09:55if you tell them that you're not doing ubm adaptation
0:09:57it is essential for instance and
0:10:01say subspace gaussian mixture models
0:10:08so here's an example these are just some examples of the anomalous results that to
0:10:15okay these are the
0:10:17a bayesian model selection results
0:10:21on the left hand side
0:10:24is with five hundred and twelve
0:10:27gaussians in the ubm
0:10:30on the right hand side with sixty four
0:10:34in the case of the small ubm
0:10:36john don solvers some
0:10:38does more
0:10:39gives you a small improvement
0:10:41that doesn't help with a five twelve gaussians
0:10:48the results in the third line the first two lines of the same as in
0:10:51the last slide the
0:10:53third line there is the traditional jfa likelihood ratio
0:10:57and that the it's model selection and style with or without
0:11:04ubm adaptation
0:11:07so this then is what the what the paper is about well what i want
0:11:11to show is that
0:11:14if you start with the traditional jfa likelihood ratio
0:11:19maybe just recall briefly
0:11:21how that goes
0:11:23you have a numerator and denominator
0:11:26in the numerator
0:11:28okay you plug in
0:11:30the target speakers
0:11:33supervector and you use that to center the baum-welch statistics and you integrate over the
0:11:38channel factors
0:11:40in the
0:11:43in the denominator you plug in
0:11:46the ubm supervector and you do exactly the same
0:11:50calculation and you compare
0:11:52those two those two probabilities
0:11:55no ubm adaptation going on at all and apply in estimate
0:11:59which is not serious in the numerator but in the denominator it really is problematic
0:12:06theory says you should be employed integrating over the entire speaker population
0:12:11rather than plugging in they
0:12:14the mean value the value of the comes from the ring supervector
0:12:22what i we show is that if you do the adaptation very carefully
0:12:28adapt the
0:12:30the ubm to some of the hidden variables but not all of them
0:12:34then everything will work properly
0:12:39this is as long as you were
0:12:41using jfa as a classifier you're calculating a likelihood ratios
0:12:49if you're using it as a feature extractor in this turns out to give the
0:12:53best results
0:12:55it turns out that you're better off
0:12:57avoiding ubm adaptational together
0:13:00i give you an explanation for this
0:13:03it has to do with the fact that the factorial priors two week this phenomenon
0:13:08is related to victoria priors not just subspace problems
0:13:17well really for this problem the first type of adaptation that you want to consider
0:13:23is the lexical mismatch between your
0:13:28enrollment and test utterance on the other on the one hand
0:13:31and the ubm that might have been trained
0:13:33on some other
0:13:36some of the data
0:13:39the jfa likelihood ratio in the numerator you're actually comparing the test speakers of the
0:13:45ubm speaker
0:13:47but if you consider what's going on here if you have
0:13:50no lexical content and the in the trial
0:13:53that is with thing which will most determine what the what the data looks like
0:14:00not the ubm the you would be much better off
0:14:03comparing to have phrase adapted
0:14:05background model and so the
0:14:08to the universal background model so you
0:14:10if you simply adapt the ubm
0:14:13to the lexical content of the frame is that is used in a particular trial
0:14:18that will lead to a substantial improvement
0:14:22in performance
0:14:24so what's going on here is that
0:14:26in the
0:14:28in the or sre data for or
0:14:34in the hours or days of there are a thirty different prices
0:14:38okay the mean supervector of jfa is adapted to each of the phrases
0:14:43but all of the other parameters are shared across phrases
0:14:51if you adapt to the
0:14:55channel effects in the test data
0:14:57this will work fine
0:15:01i this with these remotes are referred to the sort of early history of like
0:15:07and channel modeling
0:15:09there are two alternative ways of going about that you can combine the two together
0:15:14and you will get a slight so improvement there's
0:15:17there's no problem there
0:15:18if you
0:15:21if you adapt
0:15:22to the speaker affects in the enrollment data it would work fine
0:15:26okay so what i mean here's that you
0:15:29collect the bombers statistic strongly test utterance with
0:15:34a gmm that has been
0:15:37adapted to the target speaker
0:15:40you get an improvement
0:15:42if you
0:15:45perform multiple
0:15:46iterations of map to adapt of the
0:15:51lexical content things work even better
0:15:53so at this stage if you look through those lines you see that
0:15:57we've already got forty percent improvement in error rates
0:16:03just to just should through doing a
0:16:06ubm adaptation carefully
0:16:11this slide unfortunately we going to have to skip that because of the time constraints
0:16:17it's interesting and but i just don't of trying to deal with that
0:16:25here are results with a five hundred and twelve gaussians
0:16:31it turns out that so doing careful adaptation with the ubm and sixty four gaussians
0:16:36work can chew about these same performance as
0:16:41working with five hundred and twelve gaussians and no adaptation
0:16:46if you try adaptation with five twelve gaussians
0:16:50things will not behave so well this is a rather extreme case where you have
0:16:55many more gel since then you actually have frames in your in your test utterances
0:17:01and the remaining two presents our results that are so that are obtained
0:17:07with z vectors as features problem
0:17:11using likelihood computations
0:17:15likelihood ratio computations
0:17:17that the difference between the two is the nap is used in one case but
0:17:21not the other
0:17:22the of
0:17:24three point there is that you don't need now
0:17:27okay because you've already suppressed
0:17:30the channel effects
0:17:32in extracting the present vectors
0:17:38and these then our results on the on the full ten set
0:17:42that the full order sort test set
0:17:44just to compare
0:17:49z vector classifier
0:17:51using both soundworks and that's to say no ubm adaptation
0:17:55and joan don's algorithm with ubm adaptation
0:18:00and you can see that you're better off using both so algorithm that explained that
0:18:05the minute of only take a second
0:18:08okay so these are the
0:18:10these are the conclusions
0:18:13you can adapt to everything inside and the work
0:18:17but this one thing you should not to
0:18:19and that is adapt speaker affects in the test utterance
0:18:28the reason for that is actually
0:18:31this i believe is what's going on
0:18:33the factorial priors extremely weak if you have a single test utterance
0:18:39okay and your doing ubm adaptation
0:18:43then you're allowing
0:18:46different mean vectors in the gmm
0:18:49to be displays in statistically independent ways like gives you an awful lot of freedom
0:18:54to aligned
0:18:56the data with the gaussians too much freedom
0:19:01see what happens if you
0:19:05if you had multiple enrollment utterances which is normally the case in text dependent speaker
0:19:12you still have a very weak prior
0:19:14but you have a strong extra constraint
0:19:18if you go across the enrollment utterances the gaussians can not move in statistically independent
0:19:24ways that up to move in lockstep
0:19:27okay and that means that the
0:19:29adaptation algorithm will behave sensibly
0:19:34if you to
0:19:37adaptation to the channel effects in the test utterance it can things will behave sensibly
0:19:42and the reason for that
0:19:45is because these subspace prior
0:19:47channel effects are assumed to be confined to a low dimensional subspace
0:19:51that imposes a strong constraint
0:19:54on the way the
0:19:57the gaussians can move
0:20:02so final slide the
0:20:07if you're using jfa as a feature extractor
0:20:11which is my recommendation
0:20:14then the upshot of all this
0:20:17is that
0:20:19in the case of the test utterance when you're extract the feature vector you cannot
0:20:23use ubm adaptation
0:20:25if you cannot use that
0:20:27and extracting a feature from the test utterance you cannot use a in extracting
0:20:32feature a feature from the enrollment utterance i've or otherwise the features whatnot the
0:20:38would not be comparable
0:20:40okay so in other words you have to use false algorithm
0:20:43rather than rather than joan bounds
0:20:48adaptation of the ubm to the lexical content still works very well as a fifty
0:20:54percent error rate reduction compared with the
0:20:58with the icassp paper
0:21:02there's a follow on paper that interspeech which shows how this idea of adaptation to
0:21:08phrases can be extended to give a simple
0:21:14procedure for domain adaptation
0:21:17so you can train
0:21:19on sundays at a likeness data and use it on say a text-dependent
0:21:26task domain
0:21:28and the finally these that vectors at least on the orders or data
0:21:33they to they are very good features there is no residual
0:21:39channel variability that's to model in the in the back end
0:21:43okay thank you