Speech Transcript - Unsupervised Domain Adaptation for I-Vector Speaker Recognition

so as naked mentioned this is work mainly from last summer and continuing on after

the end of last summer

primarily by daniel my colleague at johns hopkins

but also with steve issue will be talking next about a little bit different flavour

and when you go and carlos

so bear with me level then you'll like but a lot of animation into slides

and use the and take them out but again so much in these that i

really couldn't get model

so i will try and do it with an animation style which is not natural

to me so we're trying to build a speaker recognition system which is state-of-the-art how

are we gonna do that well it depends what kind of an evaluation we're gonna

run we wanna know what the data looks like that we're actually gonna be working

and since normally in for example in sre we know what that data is going

to look like we go to our big pile a previous data that the ldc

has kindly given us generated for us

we use this development data we typically use very many speakers in very many labeled

cuts

to learn our system parameters

in particular what we call the across class within class covariance matrix using the key

things we need to make the lda were correctly

and then we are ready

to score our system and see what happens

so the thought here for this workshop was

what if we have this state-of-the-art system which you we have built for our sre

ten or sre twelve

and someone comes to us with the pilot data which doesn't look like in a

story what are we going to do

and the first thing in this corpus the direct also put together which is available

from there are links to the lid lists from the g h u website

we found that there is in fact a big performance gap with the p lda

system even with what seems like a fairly simple mismatch namely you train your parameters

on switchboard and you tested on mixture or sre ten

and you can see that the green line

which is a pure sre system designed for sre ten works extremely well and it's

this at the same algorithm trained only on switchboard has three times the error rate

so in the supervised domain adaptation that we attacked first which daniel presented at icassp

we are given an additional data set which is in we have the out-of-domain switchboard

data we have an in the main mixture set and its labeled but it may

not be very be so how can we combine these two datasets

to accomplish good performance on sre data

the setup that we have used about these experiments is a typical i-vector system

i think some people may do different things in this in this back part but

daniel has convinced me that links norm with total covariance would just whitening is in

fact the best most consistent way to do it

a typical system parameters the

the lda is typically four hundred or six hundred in our experiments

and the important point in one of size here is that the i-vector extractor doesn't

need any labeled data so we call that in unsupervised training

the

links norm also is unsupervised

and to p lda parameters of the ones where we need the speaker labels that's

the harder data to find

and

in these experiments we found that we can always use switchboard for the i-vector extractor

itself we don't need to retrain that every time we go to a new domain

which is a tremendous practical advantage

they whitening parameters can be trained specifically for whatever domain you're working in which is

not so far to do either because you only need an unlabeled pile of data

to accomplish that

and then i wanna focus on the at that adaptation part of the covariance matrices

that was the biggest challenge for us

in principle

at least in a little bit of a simplistic math if we have no one

covariance matrices we can do this map adaptation the dog has been doing in gmms

for a long time

the original map behind that is a conjugate prior for a covariance matrix

and you end up with a sort of account based regularisation if you configure your

prior in a certain tricky way

you end up with a very simple formula which is account based regularization back to

an initial matrix and a towards a new data sample covariance matrix so that's what

shown here

this is the in domain covariance matrix

and where smoothing it back to what out-of-domain

covariance

and what we showed earlier inance first supervised adaptation we can get very good performance

let's get used to this crap i'm gonna show a couple more the red line

at the top is the out-of-domain system which has the bad performance which is trained

purely on switchboard the green line at the bottom

is the matched in domain system that's our target if we had all of the

in domain data

and what we're doing is taking various amounts

of in domain data

to see how well we can exploit it and even with a hundred speakers we

can cut seventy percent of that

with this adaptation process and if we use the entire data we get the same

performance actually slightly better by using both sets the just the in domain set

one of the questions with this is how do we set this alpha parameter i

mean in theory if we knew the if we knew the prior exactly would tell

is theoretically what it should be but empirically

the main point of this crap is we're not very sensitive to it if output

is zero

where entirely the out-of-domain system and it's always pretty bad

if output is one where entirely trying to do an in domain system if we

have almost no data in domain we have a very bad performance

but it soon as we start to have data that system is pretty good but

we're always better by staying in the middle somewhere and using both datasets

using a come a combination

now this work the theme is an unsupervised adaptation what that means as we no

longer have labels for this pile of in domain data

so it's the same setup

but now we don't have labels

this means we wanna do some kind of clustering

and we found empirically as i think people in the i-vector challenge seem to a

found as well that h the user

is a particularly good algorithm for this task for whatever reason

and you can measure clustering performance

with

if you actually have the truth labels you can evaluate a clustering algorithm by purity

and fragmentation purity being help your clusters are in fragmentation being how much a speaker

was accidentally distributed into other clusters

one of the things we spent quite a bit a time on in fact then

you'll spend a lot of time and making an i-vector averaging system

is what the metric used for the clustering you gotta do hierarchical clustering you gonna

work your way out on the bottom but what's the definition of whether

two

clusters should be merged

p lda the theory gives

and answer for

a speaker hypothesis test that these two are the same speaker

that's something that we worked within the past

and then you know as soon as we started up this year so that really

doesn't work well at all which is a little disappointing from a theoretical point of

view but we found that in a stories as well when we have multiple cuts

using the correct formula doesn't always work as well as we would like

will be traditionally do an sre is i-vector averaging which is pertain we have a

single cut

dana spent a lot of time on that this summer then we found out that

in fact the simplest and thing to do which is to compute the score between

every pair of cuts get a matrix of scores and then never recompute any metrics

just average the scores is in fact

the best performing system and it's also much easier because you don't have to get

in your algorithm at all you just pre-computed this distance matrix and feed it into

an off-the-shelf

clustering software

so just as a as a baseline we compared against k-means for clustering with this

purity and fragmentation and the main point is this h c with this scoring metric

wasn't fact quite a bit better the k-means so we're comfortable that it seems to

be clustering in an intelligent way

now we wanna move towards doing it for adaptation but the other thing we need

to know is how do we side when the start clustering how do we decide

how many speakers are really there because nobody has told us

to do this you have to avenge the makeup are decision that you're gonna start

merging that and basically that you look at the two most similar clusters and you

gotta decide are these from a different speaker or are they the same in you

can make a hard-decisioning

and this is one of the

the nice contributions of this work that was really don't after the summer i think

where we just treaty scores as speak to speaker recognition scores we do calibration in

the way that we do and in particular

this unsupervised calibration method than aca what daniel presented at icassp

can be used exactly in this situation we can take

are unlabeled pile of data and look at all the scores across to learn a

calibration from that we can actually no with threshold and we can make a decision

about when to stop

so how well does that work

this is a across are unlabeled pile as we introduce bigger and bigger piles

the this is the correct number of clusters the dashed line

this is five random draws where we draw on random subsets and we've average the

performance and the blue is the average which is the easiest one to see and

you can see in general this technique works pretty well it always underestimate typically about

twenty percent so you think there's a few were speakers in the really are what

you're pretty close and getting

and automated and reliable way to actually figure out how many speakers are there is

actually we're

we're pretty excited to even do this well at it that's very heart task

so to actually do the adaptation then

the recipe is we use our out-of-domain p lda

to compute the similarity matrix of all pairs

we don't cluster the data using that distance metric

estimate all this how many speakers there are and the speaker labels

generate another set of covariance matrices from this labeled data

and then we apply or adaptation formulas

on this data

so here's a similar curve as i so the for here is the out of

the out-of-domain system and the in domain system in green at the bottom

and

but we're so in here

is the h z

adaptation

performance and the supervised

adaptation

we should means the number of speakers

no sorry supervised adaptation is one issue before

excuse me

so that if you have to labels

for all of the data that's what we you compress the first time now by

self labeling

of course we're not as good

but we are in fact much better than we ever thought we could be because

when we first set up this task we really didn't think

in fact daniel i had a little bit and he was convinced that this was

never gonna work because how are you gonna learn your parameters from your system that

doesn't know what you're parameters are but factor can

so we've done surprisingly well myself labeling

and we're still able to get at five percent of the for performance get if

we have all the data but is unlabeled which still able to recover

almost all the performance

now what if we didn't know the number of clusters

so if we had an oracle the told us it is exactly this many speakers

with that make our system perform better so that the additional

bar here and in fact

our estimation of the number of speakers is good enough because even had we known

it exactly we're gonna get

almost the same performance

so even though we didn't get exactly correct number of speakers the hyper parameters that

we have estimated still work just as well

and that's illustrated in this way

which is the sensitivity to knowing the number of clusters so here we're using all

the data the actual number of speakers is here and this is what we estimated

with their stopping criterion

and you can see that as a sweep across all of our if we had

stopped at all of these different points and decided that was how many speakers that

were

there's not a tremendous sensitivity if we massively over cluster then we have a big

hit in performance and if we massively under cluster it is bad but there's a

pretty big fat region

where we get almost the same kind performance with their hyper parameters if we had

us start our clustering at that point

so in conclusion then

domain mismatch can be a surprisingly difficult problem in state-of-the-art systems using the lda

and

we are denoted supervised adaptation could work quite well but in fact

unsupervised adaptation also works extremely well

we can close at five percent of the performance gap due to the domain mismatch

in order to do that we need to do this adaptation we need to use

both the out-of-domain parameters and the in domain parameters not just label of the in

domain

and this unsupervised calibration trick

in fact gives as a useful and meaningful stopping criterion for figuring out how many

speakers are in our data

thank you

i four questions

it's a wonder i can imagine that the distribution of speakers

comments basically the number of segments per speaker

of your unsupervised set

will make a difference right i guess at you get this from these days or

whatever switchboard data so the

will be relatively homogeneous is a is that correct or

i think yes classes i one has these are not homogeneous but this is a

good pile of unlabeled data because in fact it's the same power that we used

as a labeled data set

so it's pretty much everything we could find from these speakers some of them have

very many phone calls some of them have you are

but all of them have quite a few in order to be in this file

obviously for example you couldn't learn any within class covariance if you only had one

example from each speaker

hidden in that pile so you're absolutely right is not just that we do the

labelling it's also that the pilot self has some richness in order for us to

discover

before we give a microphone image i have a related question

when you train the i-vector extractor the nice thing is that you can do it

unsupervised

but again how many cats per speaker so if we had only one speaker with

many cats obviously that's not good because we don't get the speaker variability

the converse situations where you have every speaker only once

you have been any duration with that would give a good idea that

i don't think that the we looked at but i

i completely agree that would make me uncomfortable as i said in this effort we

just were able to show that the out-of-domain data which we assume we do have

a good labeled set somewhere in some domain that we can use we were able

to use that when the rest of the time so we're comfortable where it came

from i don't think of ever run an experiment with

with what you say and that is interesting i suspect it would not work so

well

what get both kinds of variability comes from a variety of channels

the variability speaker and the channels in the not quite the same proportions as you

get in the state

if you collect data and the while

in a situation where they're very many speakers

you might have data like that so i think that's an interesting qualities two

thank you

very impressive work in right set of results works are also thank you for that

so i question i have is this is all telephone speech and test work very

well with that i have we consider what would happen if the out-of-domain tighter walls

the different channels such as mock fine

i and is that even a realistic hence would you have a pre-training microphone system

that you try and adapt

right so yes we have like the microphone the very first work right a few

years ago on this task was adapting from telephone to microphone and daniel revisited early

in the summer when we were debating working with dog on this dataset whether we

trusted here if he did a similar experiment

with the sre telephone and microphone and actually got similar results

it is

that does sound a bit surprising but we have seen in the sre is the

telephone a microphone is not nearly as art is that ought to be i don't

know the reason for that but yes we have than worked with telephone microphone histories

and it's not shockingly different in this great things

i just isn't that answer i'm because question

we trained i-vector start on unofficial database which there is no one speaker a per

utterance

one and it's about the same as you two thousand four and five or so

first

okay thank you

i knew that no i think about

thank you

so either really stupid question yesterday people are mentioning about how the mean shift clustering

algorithm is working well

is that i mean you don't seem to use that you use the

a limited to a lot i don't the clustering so

is that a reason why

i believe over the course of the summer we another people look the quite a

few different algorithms i know that's use in diarisation a no we have looked at

it in diarisation i cannot remember if we looked at it for this task

we did look at others a stephen is gonna talk about some other clustering algorithms

where's but i don't think he's gonna talk about the mean shift and so i'm

not sure i don't have that compares it clearly is also useful out with

the i just want to know if the this split and this protocols and available

yes to they are on the jhu website the link as in the paper okay

it thanks you wanna get the speech type of error but the lists

i encourage you to work on this task

which

one question

let's suppose that you are not as to do a speaker clustering but gender clustering

and you don't have any prior to how many genders other

input and they

the stopping criterion would be the same i mean you have a file sign genders

i'm not sure and if there is saying with the clustering accidently fine gender well

let me say one thing first is we did i think i forgot to mention

this is a gender independent system

well gender suppose that they classes to sip to cluster and not that the and

the speakers that any of the end of clusters either kind of

correctly well this is why daniel thought this wouldn't work

who knows what you're gonna cluster by we're just using the metric we are hoping

that the out of out-of-domain p lda metric is encouraging the clustering to focus on

speaker differences

but we cannot guarantee that except

with the results

i think more so than gender if for example language we're different which there might

be some differently data and you might think you would cluster is the same speaker

speaking multiple languages you might think that would confuse are clustering for us

you saw nick like to one aspect i think is very important especially in the

forensic framework could you

shows slide five

probably

so all

what you've neglected here is the decision threshold

yes we have neglected calibration of the final task that's

so it could possibly be that a factor of three becomes a factor of

one hundred

not to three degradation could actually

the factor one take it is yes which we simply neglected

you are right george when you collected that has we would think the with the

unsupervised calibration that we could accomplish calibration is what i would like you to do

when you get home yes

in two up

annotate this slide with the

decision

points

and all these systems are not even calibrated so

well we always have to run a separate calibration process to get onto single somewhat

easily to a when you go home

but go ahead and do that work

and it's only your only gonna have to do this for the in domain system

and then you

applying a threshold

but the dots on those two curves in zambia copy

thank you very well for your assignments are then

that question is already partially on so by are unsupervised score calibration paper which was

published icassp so as true

that

okay so we so we thank the speaker

Unsupervised Domain Adaptation for I-Vector Speaker Recognition

Speaker Modeling II

Daniel Garcia-Romero, Alan McCree, Stephen Shum, Niko Brummer and Carlos Vaquero