Speech Transcript - Adaptive Mean Normalization for Unsupervised Adaptation of Speaker Embeddings

i management and they're representing work from sri international on the adapted mean normalization for

unsupervised adaptation of speaker and bindings

the scroll control problem statement if on are actually trying to tackle deal with this

work

well then look at the wrong name normalization have it applies to the state-of-the-art speaker

recognition systems

and we're going for the proposed technique adaptive may normalization

and have a look at six times to see how phones

the problem statement

variability is well-known one of the biggest challenges to practical use of speech face recognition

systems

and mister really refers to changes in a fixed between the training

and successive detection attempts

a system

everything calmed and two types of variability

one is extremes that is something separated from this data

includes things like microphones the acoustic environment transmission channel

you turn it is intrinsic are really this is to do with this data

ally vary over time and things into two different variability here

include health

stress

they're all the stake

speaking style

these differences are collectively referred to as the main mismatch

when you looking at the differences between system training sonar

and detection attempts

now many of us know the domain mismatch typically results in performance of the system

now this is a performance with respect to the extent the performance of the system

once the systems trained on land we have a certain estimate of have standard for

if that the main when there is that the what changes

then we have almost all

due to this domain mismatch

now we use a just two different things

one is discrimination loss

that means less karen the system to separate state

the others mis calibration when assistance miss calibrated

and it gives a score

that's for a my actually recently the use of into believing something that shouldn't have

been detected prints

domain adaptation be used to cope with this problem

and it's two different ways of dealing with domain adaptation one is suitable

this is where we have labeled data

where we often get improvement or reliable improvement

but it is a high costs will be a man data that you end up

eating to improve system

the alternative is unsupervised adaptation

it has a very low cost

there's actually no and use a labelling and all

plenty of data available

and it's ideally matched to leave out conditions

but insight here is that we have no ground truth labels to reliable

for using this work

is the unsupervised adaptation snore

there are some shortcomings of unsupervised adaptation

one is a lack of generalisation

now quite high decisions a to be my in a have to comply a supervised

our approach for instance if we're going to retrain

lda okay ogi best system

we end up needing to make some kind of assumptions about

which clusters different audio segments mark going to with respect to different stages

it can also be over g for the data being trained with

trustworthiness is nullified the

when we get guarantees for improvements friends it was a patient their limited i see

that

then this complexity

some approaches have high computation and that makes it a little bit more difficult to

give two

clients or uses a goes out the door

and the question we trained and two years or where is the best place to

apply adaptation

and i in the unsupervised scenario

where can be fast and reliable once deployed

so on screen here we have diagram all the different stages of a speaker recognition

one

and we can look at what would happen if we applied adaptations which of these

stages

i think the feature extraction the mfcc is or how normalized cepstral coefficients

speaker embedded in it

if someone was tuned that

and hot in this in our here

attacks requires a for re-training or but the nn and the back end modules

you need to have a lot of data on hand ready to do that process

that's how to explore

what about speech activity detection

now there are approaches by next the genus wasn't stages

a different scenarios

this is useful when that is actually the main since the

but it's on purchase solution doesn't really help the discrimination a in the rest of

the form

lda purely eye calibration

no these the sum of the clock kinds of the backend process

but center require labels or prediction in your clustering and this can be are carried

by a projectionist

like normalization well there's no actual adaptation to go on us does not applicable

which leads us would mean normalization

now this is simple to that a parameter sorry in general typically about two hundred

numbers of the only i

at the request use doesn't help

but sort of the role of mean normalization in a system

nobody only i is a strong model when the assumptions of a few for the

p lda model car

now that is the distribution of the data going into it

fixed a standard normal distribution

in training

without trying to is that

mean normalization

and length normalization together achieve this

so the assumptions of for your

for a few when the system is ranked right

length normalization when we will actually projects embedding some to you know how does the

and that's a trigram actually right here that demonstrates just

but evenly spread around the house

and this issues a zero-mean works well during training

emphasis shifted domain

wow of course of training

such as with evaluation data

so this is in this diagram here

and then some producing you know how to say that has a distribution that is

not evenly distributed

therefore assumptions appeared in model i'm not sure field anymore

and we actually reduce the discrimination battle

no so that actual performance here when we look at this difference all using a

system based main

where we have taken the mean from the actual training data a best this the

impact if we actually i mean just the mean of the system

two i held out dataset

only relevant conditions of the data with benchmark

now there are more details on the evaluation protocol on the dataset used here later

on in the presentation

but for now this is a quick snapshot of what happens if you simply update

the main the main of the only a tighter polish dataset

actually see the equal error rate increase prior to nineteen percent

really helping and discrimination that

but even more impressive

is the fact that sail that's the cost of the likelihood ratio

and that's an indication of discrimination and calibration performance

improves file to sixty percent

no this is this five holding the calibration model as used in mismatched to the

other conditions

in particular this calibration model here is train the red source data

that's clean telephone data

and yet in the sre dataset and stickers in well

it is dramatically helping calibration

so having a roommate really is crucial

no it's okay a that it may normalization

so if you we've got

i mean that a suitable when evaluation conditions are homogeneous so if we deploy a

system we know the generally what we do about four it's not gonna very much

from that

that is that the okay

the problem comes in my conditions can vary over time will between trial

so for instance dealing with radio broadcast right

over time depending on the single signal the time of day or maybe a system

thing used for both telephone and microphone style

calls

then we end up having this distribution over here only bottomright where we have different

needs

and how that projects onto the in opposite

this means that a ideally what would love to be able to here is actually

and that domain

depending on the conditions of the trial at hand

so that means we wanna dynamically defining

as we're going to resist

that's what we contract requires method of adaptive name normalization

so what is a

well this process actually stand that if that's on trial based calibration

and what role based calibration does is actually whites into a problem can to define

the system parameters

in particular the calibration model parameters

it actually looks conditions of hand

of the trial coming in both sides the enrollment and test on

defines different subsets for those

conditions

and then finds calibration model on-the-fly using held out data

so called the system here

is to try my the system model what is general

and reliable as possible

no one extra advantage here's the overtime as systems the system is saying more and

more conditions are more relevant data

it can act agreement about the new conditions of the time

the he's the process

so taken a bit of is not show on the embedding is only a mean

normalization links not purely and calibration left hand side

and was fifteen

the a normalization and the adaptive process

one used to be there was inconsistent me

no we're doing is taking the goodies from after all

so in fact this can be this is an embedding

specific process

not a troll specific process which is a bit of the benefit here

and terms of computation

for each embedding what we do is we go meet her some personal

against a bunch of handed in

embedding

what we wanna do is found those embedding from the candidate on that are similar

conditioned to out embedding that's coming into the adaptively may normalized

we make a selection of that's also

we then find the condition name based on that sounds that

no and how many strong and we find and how many we would like to

find we dental weighting process

and then we use that men

as noted that the main one for that embedding

with and follow on through the rest of the pipeline

so what we're trying to do here is

make this happen on the fly in fact that actually has very little overhead

there are some ingredients that we need for that it may normalization

now in terms of making a comparison between embedding and handed it we need something

that can

tell us whether the conditions from those embedding the similar or not

but this we use condition field

and this is really what we only a what we use the speaker recognition

ever instead of discriminating stated

it's trained to discriminate different conditions

but conditions include compression time

re the five noise type

language in general

when we combine those things together we actually end up with our eleven thousand unique

conditions

so it should be a very thin slice

that we're dealing with

so for meaningful candidate embedding

and this is just a love mixture conditions anything controller really

and ideally it's including some examples and evaluation conditions

now if that's not the case again what the system could actually be a

after is deployed is actually have testing data along the way to calculate the whole

handed in embedding

to be more suited to the conditions

and this whole is used to dynamically estimate that means all conditions

finally there are twelve parameters one is the condition similarity threshold we don't want everything

from the candidate for coming through we wanna say we want to determine how similarity

is and make sure similar enough

the pastor in the next stage remain constant

the thing that we wanna sit is the maximum number of candidates to select

now everything if everything in that and are the cool

was about the threshold

everything will be faster to maybe we get a no benefit to the don't have

a natural system

we wanna make sure that we mean that how much longer term just select the

top number of this

so if we then go back to our picture here we can fill in a

few different things

for instance the comparison now is done that security i

we didn't do the selection process where n is the number of candidates for the

similarity

about the official

however if an

is more than a maximum where layer

which is an

we kind of the and with the highest similarity

but when making sure we can be most relevant ones for our main estimate

once we estimate domain

we go on to a weighted average

with a system

and that weighted average

means

them close we get to that type of value and the more we rely on

being you gonna make me

whereas we do in the following that the system in the case that no relevant

samples could be

and several in a few of "'em" benefits of adaptive may normalization

no harm said that it's very minimal overhead

and that over here is that defined by the number handed examples that has to

compare it is

this is also applied for embedding instead of a problem which usually done in from

based calibration

and the sense to doing a lot in terms of reducing computation

it can for the case of no relevant examples where the reverse is just the

main based on the

weighted average

you know enrollment audio or test audio could actually collected to re

time in the candidate pool

and this allows that are most relevant changes are that i'm aware of the systems

being coloured

now the simple process

with the parameters thing under two hundred numbers this changing he

it's also weighting against this is the main so makes it a little bit difficult

over fitting which is room benefit

and finally

we show a we find quite impressive here is that it allows a single

static calibration model to be applied across in this now that's exactly the problem that

row based calibration was trying to sell

by adapting the calibration model

no gonna step further that and the one of the system to that main

which of as a calibration models just a static

and the cost

the other night sure main normalization there

a last few only ice assumptions to be for field

we calibration model after peoria scoring

is also suitable

let's take a look of experiments

first of all the baseline system we do and t is the sri

can you s e

came submission for the sre i mean

nist evaluation this involves sixteen khz

how normalized cepstral coefficients

and multi band i think that things

a multiband which means that we trained the embedding system with by i k and

sixteen k data at any time we had sixteen k to i

we also downsampled to eight as well

so that the nn was exposed both i k and six think a directional sign

audio segments

that tended to help bridge the gap between i khz and sixteen khz evaluation data

we trained on the standard datasets

the references for those datasets are and i

and we do the standard ornamentation occurs

now this mentioned before the calibration model years trained on the right source data

now that is from the rats program darpa rats program

but is the telephone dinally clean data not the transmission data which is heavily degraded

in terms of evaluation

we split out of l six and two evaluation and norm

now from the nist sre corpora

it doesn't sixteen two thousand nineteen

but have their own name sets

available with an known as the unlabeled data as you can see in table or

speakers in while

we use the about fourteen

or evaluation and the dataset

but the notes the

again speakers for this are disjoint

this is right source data we axis that this plastic to have two different speaker

calls

ones for evaluation and ones for the normalization step

in terms of the adapted mean normalization parameters

we setting condition similarity threshold of ten

and the maximum number of candidates thing half number candidate samples

for the dataset

of candidates

and these by searching on the rats or style

so you can see how many segments were available for each an onset

including the pools

which we use initially

and the n b value all can't is that with trying to g

and that value of and remember

also helps with the

weighted average

it when the dynamic maintenance this tonight so close we can get that

a more relies on the and it may

let's look at the out of the box performance

so we've got here of for different datasets sre sixteen i can stick as in

the wall and the right clean telephone data

the baseline system here we consider them a norm is

simply the main that was estimated during the training of the system

and the calibration model is right

now what happens

if we go and look at a

adapting calibration model

the actual eval set

so this is a cheating experiment on the right hand side

essentially what we're doing is we're replacing the rest calibration model

we've the eval set calibration model

what we can see here

is that we're getting much better calibration performance

"'cause" some datasets that acoustic isn't model

rest doesn't seem too much

and

the equal error rates they tend to vary wildly between these different datasets

but the calibration is considerably better compared to those

but see lower value of one for non

one four seven

because it is matched to the rats data used in calibration model

let's look at the impact of relevant may normalization

now previously really on in this presentation we show the first two columns he the

baseline and the condition based mean normalization

now reading on the call

assume that it may normalization

what we fancy a using adapted mean normalization the whole

on the held-out data says training together the sre six thing i think

speaks in water and rats data

will be held out dataset as one day call handed

and that you mean normalization was able to outperform the conditions specific mean normalization in

the heterogeneous conditions

so in particular the honesty is in well

and the sre dataset

the calibration performance there is in proving quite significantly and sometimes in some cases

other times i two thousand i think it's a nice improvement

the where i've to seven sixteen also improves

quite reasonably

no what's interesting is the adaptive process didn't really have a direct condition

so that of a benefit there's well

but now you data requirements

how much data do we actually need in canada segment

productive may normalization work

well we've done on the slide here is we're looking at these still

the which remember a

how's measure the discrimination performance and calibration problem

the dashed fines of the baseline performance across the for different datasets

the more solid lines a what happens as we were very the number of canada

segments

now remember these used to the at least

one thousand two hundred samples

i mean doing this in my dataset specific scenario

where for instance

sre sixteen

they can pull is the actual unlabeled data from sre six thing so

suitable for the conditions when we randomly selecting from that held out on that

well we see here is that

quite rapidly after thirty two relevant segments

independent or we're already

in front of the based on local

so it's already sufficient for significance a lower improvement and we saw the

this also have an equal error rate in terms of be trained

not quite so much of a relative guy

again the true relevant segments from the target to my

wasn't nothing this is that process to get a good guy

now importantly what happens when we have adapted mean normalization

we employ

and the data and it can all is mismatched

two conditions that are gonna be evaluated

we wanna see what happens in this case so what we did for each

dataset will benchmarking here

we excluded the relevant data

common can hold for that is

so for instance

with the rats data set down the one on the table

we actually excluded that from the whole standards and just retained stick as in the

war

and the two sre dataset in the can cool

and that's all had to select from

in order to estimate domain and the hyper don't in the system

i remember when it can actually find anything but it is relevant

if all that the system in

so a wooden thing that the performance is the sign

as the baseline system

well better

now we can see is speakers in while

and rats actually perform reasonably well

there was an improvement still with stages and well which is surprising

with right

this still a that was just a little bit

sre six thing

integrated what's really with respect and baseline

without any relevant data for a min

and we tried to vary just selection threshold he in the hope that using a

higher threshold

we restrict the subset selection to really

the closest ones possible

other this didn't have

so this indicates is there was a problem with the currency purely eye are conditions

the audio card and wasn't quite optimal for selection

in this not mismatch scenario

so in summary we propose adaptive may normalization

it's simple and effective leveraging the test and adjourn used possible

it's useful i just i samples the state in fact i think that should be

thirty two samples of speech

the discrimination we sort improvements about twenty six distend and intense calibration mentioned through the

c o l

with or improvements of up to sixty "'cause"

sixty six percent relative over the baseline system

and what's important here is that actually and i want to study calibration model to

become suitable for varying conditions

that's a room between once you system goes out the door

in terms of future work we identified a couple things

we want to enhance the selection method to be robust when relevant matters like embezzling

which i do not very fast experiment

we also wanna do experiments in how active learning over time

can improve that calibration pool sorry not calibration pork and oracle

but collecting five test data

over time that's relevant to the examples and retaining i recent history

five hundred representation not be happy to hear any remarks of questions from anyone

thank you

Adaptive Mean Normalization for Unsupervised Adaptation of Speaker Embeddings

Speaker and Language Recognition

Mitchell Mclaren, Md Hafizur Rahman, Diego Castan, Mahesh Kumar Nandwana, Aaron Lawson