would you just

a bear with me for a couple minutes subset and some background and then i

will try to explain

in some detail what the technical problem is that we're trying to solve

so for the jfa model here i formulated that in terms of gmm mean vectors

problem and supervectors

that's first term and is the mean vector that comes from universal background model

the second term involves that a hidden variable x

which is independent of the channel

excuse me independent of the mixture component and intended to model the channel effects across

a recordings

and the third term in that formulation it how it's a local hidden variable

to characterize the

speaker phrase variability within a particular mixture component

so

the is typical approach would be to estimate facts matrix u

using the maximum likelihood to criterion which is exactly the criterion that is used to

train an i-vector extractor

in practice

rather than use maximum likelihood to you usually end up using relevance map of as

an empirical estimate of those matrices t

the relation between the two you can find explained in the paper by probably vol

going back to two thousand and eight

the point i what stress here is that z vector is high dimensional we're not

trying to explain the

a speaker phrase variability by a low dimensional vector of hidden variables

it's a factorial prior in the sense that the

explanations for the different mixture components are statistically independent

which really is a weakness

we're not actually in a position with a prior like this

to exploit the correlations between mixture components

so to do calculations with this type of model these standard method is an algorithm

by robbie vol which alternates between updating the two

hidden variables x and z

it didn't present of this way but it's actually a variational bayes algorithm which means

that it comes with variational lower bounds that you can used to

a likelihood or evidence calculations

that means that you can for example

formulate the

speaker recognition problem in exactly the same way as it's done in the l d

n e

a bayesian model selection

problem the question is whether

if you're given enrollment utterances and test utterances and you want to

account for that on some of the of data

whether you are better off

passes doing a single cent vector

or two vectors one for the enrollment data one for the test

something

basically unsatisfactory about this namely

it doesn't take account of the fact that the

what jfa is it's model for handle

the ubm moves under speaker and channel effects

traditionally

when we do these calculations we use the universal background model so collect belmont statistics

and ignore the fact

that i according to our model

the ubm is actually ships to as a result of these hidden variables

and

there is an important by jean down that tends to remedy this and i was

particularly interested in looking into this for the reason that i mentioned at the beginning

i believe that's

the ubm does have to be adapted

in text dependent speaker recognition

and

this is a principled way of doing that it introduces a an extra sense of

hidden variables

indicators which

so you how the frames are aligned with mixture components

and

that can be interleaved into the

variational bayes updates in vaults algorithm

so that you get

a quick

karen framework from handling that adaptation

problem

there's just

there's just one caviar

that i think is worth pointing out about this algorithm

it requires that you take account of all of the hidden variables when you're doing

ubm adaptation and the evidence calculations

no of course that's what you should do if the model is to be believed

if you take the model that fixed value

we should take account of all of the hidden variables

however what's going on here is that this a factorial priors actually so weak

but

doing things by the book

dance lead you into problems

so that's why have like this here as a as a kind of

so in the paper a high presented results on the are stored data using a

three types of classifier

that come out of these calculations

the first one there is simply to use the z vectors that can either from

votes calculation or from the show and on calculation

as features which

it's are extracted properly should be purged of channel effects

okay and then just feeding goes into a simple backend like the cosine distance classifier

a jfa as it was originally construed

attended not only to be a feature extractor but also model that's like a classifier

that's two additions

it

however

in order to understand

this problem of ubm adaptation

it's necessary also to look into what's going on

with those bayesian model selection algorithms

okay when you

what happens when you appliance

without a ubm adaptation and

of boats algorithm

or with ubm adaptation and round on some read from

and also to compare it with

the likelihood ratio calculation

which

what's traditional about two thousand and eight

it's turns out that

when you look into these questions that there's a whole bunch of anomalies that

that arise

the

ubm adaptation call if you're using jfa as a feature extractor ubm adaptation hz point

five point

okay

this is true

for these sent vectors that's not true for i-vectors is not true for speaker factors

it behaves a reasonably but not present factors that's and

this year's icassp

paper

on the other hand

if you look at the problem of maximum likelihood estimation

all the jfa model parameters maximum likelihood so

what you find is that it doesn't work at all

without ubm adaptation you do need

ubm adaptation order to get that to behave

sensibly

if you look here a on based model selection you find that there are some

cases

where shall and don's algorithm

works better than vaults

and other cases where exactly the opposite happens

the traditional jfa likelihood ratio is actually very simplistic get just uses plug in estimates

rather than attempt to integrate over hidden variables and no ubm adaptation of all

and

what i will show in this paper is that it can be made to work

very well

with very careful

ubm adaptation

okay so this business of ubm adaptation turns out to be very tracking

and

anyone who is being a around and in the in table long enough is probably

in parent by this by this problem at some stage

sorry my in my own experience

i couldn't get jfa working at all

until i stopped showing the ubm adaptation

but it doesn't really make a little sense because if you look at the history

of subspace methods eigenvoices eigen channels

they world implemented originally with ubm adaptation

if you speak to

guys in speech recognition they will be surprised

if you tell them that you're not doing ubm adaptation

it is essential for instance and

say subspace gaussian mixture models

okay

so here's an example these are just some examples of the anomalous results that to

arise

okay these are the

a bayesian model selection results

on the left hand side

is with five hundred and twelve

gaussians in the ubm

on the right hand side with sixty four

in the case of the small ubm

john don solvers some

does more

gives you a small improvement

that doesn't help with a five twelve gaussians

here's

the results in the third line the first two lines of the same as in

the last slide the

third line there is the traditional jfa likelihood ratio

and that the it's model selection and style with or without

ubm adaptation

so this then is what the what the paper is about well what i want

to show is that

if you start with the traditional jfa likelihood ratio

maybe just recall briefly

how that goes

you have a numerator and denominator

in the numerator

okay you plug in

the target speakers

supervector and you use that to center the baum-welch statistics and you integrate over the

channel factors

in the

in the denominator you plug in

the ubm supervector and you do exactly the same

calculation and you compare

those two those two probabilities

no ubm adaptation going on at all and apply in estimate

which is not serious in the numerator but in the denominator it really is problematic

because

theory says you should be employed integrating over the entire speaker population

rather than plugging in they

the mean value the value of the comes from the ring supervector

so

what i we show is that if you do the adaptation very carefully

adapt the

the ubm to some of the hidden variables but not all of them

then everything will work properly

this is as long as you were

using jfa as a classifier you're calculating a likelihood ratios

however

if you're using it as a feature extractor in this turns out to give the

best results

it turns out that you're better off

avoiding ubm adaptational together

i give you an explanation for this

it has to do with the fact that the factorial priors two week this phenomenon

is related to victoria priors not just subspace problems

okay

well really for this problem the first type of adaptation that you want to consider

is the lexical mismatch between your

enrollment and test utterance on the other on the one hand

and the ubm that might have been trained

on some other

some of the data

the

the jfa likelihood ratio in the numerator you're actually comparing the test speakers of the

ubm speaker

but if you consider what's going on here if you have

no lexical content and the in the trial

that is with thing which will most determine what the what the data looks like

not the ubm the you would be much better off

comparing to have phrase adapted

background model and so the

to the universal background model so you

if you simply adapt the ubm

to the lexical content of the frame is that is used in a particular trial

that will lead to a substantial improvement

in performance

so what's going on here is that

in the

in the or sre data for or

in the hours or days of there are a thirty different prices

okay the mean supervector of jfa is adapted to each of the phrases

but all of the other parameters are shared across phrases

if you adapt to the

channel effects in the test data

this will work fine

okay

i this with these remotes are referred to the sort of early history of like

and channel modeling

there are two alternative ways of going about that you can combine the two together

and you will get a slight so improvement there's

there's no problem there

if you

if you adapt

to the speaker affects in the enrollment data it would work fine

okay so what i mean here's that you

collect the bombers statistic strongly test utterance with

a gmm that has been

adapted to the target speaker

you get an improvement

if you

perform multiple

iterations of map to adapt of the

lexical content things work even better

so at this stage if you look through those lines you see that

we've already got forty percent improvement in error rates

just to just should through doing a

ubm adaptation carefully

this slide unfortunately we going to have to skip that because of the time constraints

it's interesting and but i just don't of trying to deal with that

here are results with a five hundred and twelve gaussians

it turns out that so doing careful adaptation with the ubm and sixty four gaussians

work can chew about these same performance as

working with five hundred and twelve gaussians and no adaptation

if you try adaptation with five twelve gaussians

things will not behave so well this is a rather extreme case where you have

many more gel since then you actually have frames in your in your test utterances

and the remaining two presents our results that are so that are obtained

with z vectors as features problem

using likelihood computations

likelihood ratio computations

that the difference between the two is the nap is used in one case but

not the other

the of

three point there is that you don't need now

okay because you've already suppressed

the channel effects

in extracting the present vectors

and these then our results on the on the full ten set

that the full order sort test set

just to compare

the

z vector classifier

using both soundworks and that's to say no ubm adaptation

and joan don's algorithm with ubm adaptation

and you can see that you're better off using both so algorithm that explained that

the minute of only take a second

okay so these are the

these are the conclusions

you can adapt to everything inside and the work

but this one thing you should not to

and that is adapt speaker affects in the test utterance

the

the reason for that is actually

this i believe is what's going on

the factorial priors extremely weak if you have a single test utterance

okay and your doing ubm adaptation

then you're allowing

the

different mean vectors in the gmm

to be displays in statistically independent ways like gives you an awful lot of freedom

to aligned

the data with the gaussians too much freedom

see what happens if you

if you had multiple enrollment utterances which is normally the case in text dependent speaker

recognition

you still have a very weak prior

but you have a strong extra constraint

if you go across the enrollment utterances the gaussians can not move in statistically independent

ways that up to move in lockstep

okay and that means that the

adaptation algorithm will behave sensibly

if you to

adaptation to the channel effects in the test utterance it can things will behave sensibly

and the reason for that

is because these subspace prior

channel effects are assumed to be confined to a low dimensional subspace

that imposes a strong constraint

on the way the

the gaussians can move

so final slide the

if you're using jfa as a feature extractor

which is my recommendation

then the upshot of all this

is that

in the case of the test utterance when you're extract the feature vector you cannot

use ubm adaptation

if you cannot use that

and extracting a feature from the test utterance you cannot use a in extracting

feature a feature from the enrollment utterance i've or otherwise the features whatnot the

would not be comparable

okay so in other words you have to use false algorithm

rather than rather than joan bounds

adaptation of the ubm to the lexical content still works very well as a fifty

percent error rate reduction compared with the

with the icassp paper

there's a follow on paper that interspeech which shows how this idea of adaptation to

phrases can be extended to give a simple

procedure for domain adaptation

so you can train

jfa

on sundays at a likeness data and use it on say a text-dependent

task domain

and the finally these that vectors at least on the orders or data

they to they are very good features there is no residual

channel variability that's to model in the in the back end

okay thank you