Speech Transcript - A Speaker Verification Backend for Improved Calibration Performance across Varying Conditions

i don't

i am in the centre from a i'm research about the computer science institute which

is affiliated to can be set and to university of one of size you know

judy

the work i'm going to talk about today was done in collaboration with me h

one car from the startup

ut sri international

so let me start we describe the one of the most standard speaker verification pipelines

these days

and the pipeline is composed of

three stages

we have first the speaker but in extractor which is meant to transform the sequences

in the two trials into fixed-length vectors x one x two here

then we have a stage that thus lda followed by mean and variance normalization

and then we next normalize

and those resulting vectors x one x two are then processed with a the lda

stage which computes a score for the trial

which can then be threshold it to make the final decision

so that the lda scores

are computed s and rs log-likelihood ratios

and their state of gaussian assumptions

the form of the llr is these

it's the logarithm

of the racial between two probabilities which are the probabilities of it

two inputs

given that the speakers are the same

and the probability of the inputs given that the speakers are different

and these in an r

given the gaussian assumptions in the lda

can be computed with a close form which is a polynomial units one x two

you can find a for mean in the paper

the problem is that in most cases what comes of a purely eye are scores

that are very nice kind of rate is means that

no the we computed unless and an hour's data really are not and ours

and the cost for these mismatch a is that

they assumption that we may can be lda not really much they're real data

is calibrated scores have the problem that they have not probabilistic interpretation this means that

in consequence we cannot

and

used unless absolute values we can use them relative to each other

so we could run examples of trials

but we cannot interpret the

so let's say for example that you get a score minus one for certain system

for certain trial

you would only be able to tell one these minus one means a there you've

seen a distribution

some development data that has gone through the system

so once you see

this emotion and then you can interpret this minus one

properly and you could actually threshold the score and decide the thesis the target samples

okay so we would like scores to be equally weighted because

then

they have these nice property that they are in an hour so that we can

interpret their values

and we can also use based rules to make a

decision on the threshold

without having to see a development data

but calibration is done and generally with an affine transformation

there is trained using logistic regression so let's say you and all your score some

is calibrated

then what you do these

train these alpha and beta which are the two

parameters in the affine transformation

so value maximize the cross entropy

that's the logistic regression

objective function

and then you get at the output

properly calibrated and

okay so basically what these means is that we take these by applying we had

are we just at one stage

the global calibration

now the problem is that if this doesn't really solve the problem

and

in general so we are only solving the problem with this global calibration for

the extract set

for which we train the calibration parameters

if the calibration

the calibration set doesn't match our test set

then we will still have a calibration problem

and these results illustrate this so

the wearable one this sets are for now well explained them later but for now

what's important is then i'm showing three different be lda sets

that are

really a systems

that are identical to the calibration stage on what the first is

what training data was used to train

the calibration parameters

though so that the

red bars

one it's important here is to compare the height of the bar which is the

actual c and the lower for each of the systems

and the black line

which is the meaning of the llr

for that system

so if the difference between the two

is smaller than it means that the system is well calibrated if it's be it

means it is not what kind

so what we see here

is that the performance the actual c in an hour is very sensitive to reach

set was used to train the calibration

well

so for example

box

necessary to switch or which is

mostly box in this case

it's very well the speakers in the wild dataset

so it gives very good calibration but horrible for sre

and similarly the say the rats data is very good much more lasers but is

not so good for exactly sixty

so basically this means we cannot get

a single global calibration model that we work

well across the board

alright so the goal of this work is based digital but system that doesn't require

these we calibration for every new condition

it's quite ambitious goal

and

we basically want to speaker verification system that can be used out of the box

without having to lead to have been dataset

okay so

one back to the by line a the standard approach

in the by pinata showed

is to train each of the stages separately maybe you reach the previous stage

and when the

we they put data

that comes out of that stage train the next state

with different objectives so the first one this speaker media extractor is trained with

speaker classification also what object the

lda on the lda is used is trained to maximize the likelihood

and then finally the calibration stage is trained to

optimize minor cross entropy which is a speaker verification

now

one simple thing we can do is just integrated three stages in the market we

may think this is

some solution to the calibration problem and you may actually sol our initial of needs

calibration across conditions

what we do is basically keeping the same exact functional form

passing the standard pipeline

but instead of training them with different objectives

separately

we just trained them jointly using stochastic gradient descent

for this of course when integrating batch is that are trials

my budget of trials rather than samples

and we simply just

what we do is

randomly select speakers for each speaker select

two samples

and then

from that list of samples to create all the trials all the possible trials

across those samples all tool pursues older since all samples

so we know we can compute the

the binary cross entropy and we optimize that

so this is not the first time that something like this

is proposed of course i to solve the mean m and we'll get and others

what was something very similar

at the time of the actually

train the

but kind of with the svm what we linear logistic regression

is that of stochastic gradient descent but basically that the concept is saying

and more recently now there's been a few papers than two and to have a

speaker verification and they use some claymore of these

idea where the training data but can't which is usually very similar formats this tandem

again in a discriminatively

the of this paper is actually here you know these

and i'm sorry finest only in the upper

so this paper is actually report improving discrimination performance

but i don't usually report calibration performance which is one we care

in this work

and what we actually found in our previous paper is that this approach of just

trained discriminatively

at the lda back-end

is not sufficient to get good calibration across conditions

and that we know from our previous papers so

it means this is not a these architecture and training jointly is not e

so what n

what is the problem

in this basic form

and we

we show before the calibration stage is a global

well anyway

same as in the standard white nine

and it seems that this is not enough flexibility for the model to adapt to

the different conditions in the date

even if you train a small with a lot of different conditions you will just

of that to the

my jewelry the condition

so what we propose to do is to i and branch

so these model

so we keep the speaker verification range the same

and then we added a branch that

is in charge of computing calibration parameters as a function

both input vector sets one and x two

and the form for this branch is starts the same as the top one

it's an affine transformation

that's length normalization of course the parameters of these something transformation on different

on the top ones

then we do dimensionality reduction

i we go to very low dimensional seen in that paper we use of dimensional

five

to compute the mean vectors which are

and we call

side-information vectors

and then we use these vectors to compute an alpha and beta using and very

simple form which is based similar to the be lda form here

at so

when we and that is we had two branches one is in charge of computing

the score and the other one is its actual computing the

calibration parameters

for each of the sample c and

so i'll show the results now so let me

talk about the data

we have

a bunch

i had a whole lot of training data

we used books and of one and two

sre data speaker recognition evaluation data from

two thousand five two thousand twelve

blast mixer six

and switch for all of that it is actually share we

and the embedding extractor training data

we just use half of what we use one but in extractor training just for

expediency the experimentation

and then we have two more sets but source

which is telephone data in that would just other non-english for different languages

and then if it's just trying to which is forensic voice comparison

we just the very clean data set

it's a studio microphone anything

i australian english

and then for testing we use sre six sixteen sorry eighteen speakers in the while

the then on the ml

and lasers which is a bilingual

set recorded over several different microphones

and a forensic voice comparison

the chinese version so the

recording conditions of these two are very similar

but the language is the

and ask that sets

and we use the that part

all these three sets aside a sixteen sre eighteen of speakers in the way

i with that we do all the parameter tuning

we choose the iteration best iteration for each of the models

stuff like

okay so here we use a rear their results

and

the

rand bars have the same ones

as in the previous figure they showed

and i didn't the blue

bar which is the system we propose

we each as you can see

you know training rules

most cases over the best or that the global calibration model

we basically achieved what we want it which is to have a single model that

kind of that to the test conditions without that's telling them

what the test conditions are

the only exception is these lpc cmn case

which is not well calibrated idle

and in fact there is one global

the lda model that is better

than the one we propose

is still applied

but is better than ours

and

and the problem with that set

is basically that it's

it's a condition that is not seeing

in combination

during training so

we have clean data in training

but is not in chinese are we have training but is not key

so the model doesn't seem to be able to

learn

how to properly calibrated a that they

unfortunately so this just means

there's to work to be done we haven't really achieve that ambitious goal that i

mentioned before which was to have

a completely

general

out of box system

okay so before to finish i i'd like to describe a few details so how

this model is trained because they are essential to get would performance

so one important thing is to

do an

non random initialization so

what we do and

many of the papers than two and two and training do similar things

initialize the speaker brunch with the parameters that a standard the lda baseline

that's very sing

and then for this

side information much we

this first stage we initialize if we the bottom

and components of this anyway lda transform that we trained for

the speaker match

that means that what comes out of here he's

basically the words you could do for speaker i e

we should be

the best you can do for conditionality

so we're trying to get from the input

they condition information

then these matrix here

which doesn't have any recent level before value

we just initialized randomly anyway

and these two

components here we initialize them so that what comes out of here

are the global parameters

at the first iteration you portray

basically at the initialization what the scores that them out of here are the same

that would come out or a the lda

standard p only a by i

here the results

it comparing three different

initialization approaches

random

then

a one star partial which means

what i described before but without

initialising bees

stage with the lda what on components just one only

and then the louise

what is correct

so the blue is the best of the three

so it means it's worth the trouble two

take the time to find a initial parameters this marking

another important thing is to that we train them only two stages

so the first stage uses all the training data to train the formal all the

parameters

and then the second stage

we freeze the lda mp lda blocks

i'm trying to on the rest of the parameters using

domain balance

data

and this is important because if the data is not about and then

most of the trials in you a novel batch would be from one the mean

and then we would just be optimising things for that only

that something that has more samples

finally the convergence of the model is kind of a big issue

validation performance jumps of one from batch to batch and a lot

so you see that curve of optimization in

com one much to the next i in can change significantly

so what we do is basically choose the best iteration using the validation sets that

i mentioned before

and the good thing is that these approach seems to generalize well to other sets

even two sets that are not very well matched to the limitations

and we tried a bunch of tricks to smooth out the validation performance and they

do set sitting smoothing out the validation mccormick like regularization

sloane everybody

but they actually make the minimum

worse so we

keep the while when initial curves i'm just choose the mean

and well so

and say

did how repository

we the exactly these

model

implemented for training them for evaluation at you just want to have a pre-computed and

endings

and have an example we then bindings that we provide

three to use a modified let me know we could find box

i'll be how to respond questions and comments

okay so

conclusion we developed a model that achieves excellent performance across a wide variety of conditions

and it integrates different stages in a speaker verification looking into one stage

and trains the whole thing doing c

you also integrates an automatic extractor of side-information then in then uses to condition calibration

parameters

and these chips our goal of getting and good performance across different conditions

of course there are many open issues with like temporal

training convergence i don't think we are done with that i would like to see

it easier to

optimize model

and of course we'd like to plug in these small with the mle extractor and

training and

okay thank you very much

if you have any questions please by two need to be solved resort to the

ldc platform be more detail

thank you

A Speaker Verification Backend for Improved Calibration Performance across Varying Conditions

Speaker Recognition 2

Luciana Ferrer, Mitchell Mclaren