hello my name is then of course you have no i will be presenting joint

work with the michael it's

the excel and unlikely

from the human language technology center of excellence

i johns hopkins university

that i don't know or work is might need to expect a marketing estimation network

plus also

for improving speaker recognition

current state-of-the-art in text-independent speaker recognition is based on the in and variance training with

our classification loss

for example a multiclass cross entropy

if there is no severe mismatch between the nn training data and the deployment environment

this in that cosine similarity between them but is from a system trained with the

angular marking softmax

or vice versa

speaker discrimination

for example in the most recent nist sre evaluation

which is audio from

video particular didn't of videos

the top performing

single system

on the audio track was based on this part of that

unfortunately

even though cosine similarity provides

good speaker discrimination

directly using those scores

does not allow us to make use of the automatically a stressful

because this discourse are not calibrated

typical way to address this problem is

use an affine mapping to transform the scores into look like a result alright calibrated

this is

typically done using on logistic regression

and we learn two numbers i scale

and also

so looking at the top equation

the raw scores are denoted by s i e a

which is the cosine similarity between

two and variance

and this can be

basically as precise that unit for a well unit length and then in x i

till the

transpose

x till the j

so it is nothing more than this they number of unit length and between

you will learn a calibration mapping

with the parameters a and b we can transform this score in to log-likelihood ratios

and then we can make use of the bayes threshold to make optimal positions

in this work we

proposed agenda cezanne

and i went to look at it is to think that the actual

scale at

can be thought of us

simply a sign

constant might be due to the

unit length and variance

so it's inventing

get the same active

instead

we suggest that it's probably better that it somebody has its on my data

and we want to use a neural network

to estimate

the optimal value of those magnitudes

we also used a global offset to provide

the mapping

to like to raise

no the this new approach

may result in a non monotonic mapping

which means that you has the potential to not only produce calibrated scores but it

also can improve discrimination

by increasing the classical range

to train this mike to network

we

want to use a binary classification task

so we draw target and non-target trials from a training set

and i lost constant it's a by a weighted by now regression

where are five is the prior of a target trial

and then

these two i is the log posterior odds

which can be decomposed in terms of the local error rates so

on the log prior art

the overall

system architecture than one and use

it's gonna be training so we a steps

on the well left

it's a block diagram of our baseline architecture

we're gonna use

to the convolution with a resonant architecture

well why a temporal only

and getting a high dimensional

was probably in activations

and then we use an affine layer

to do a bottleneck so that we can obtain then between

and but it's are gonna be a one and fifty six dimensions

and their star is used to the node where the embedding is extracted condition at

work

in a more will be trained using multiclass cross entropy with the softmax classification the

using directive mark

the first as the of the training process is to use short segments to train

the network

in the past we've seen this to be a good you know compromise

because the sort of sequences allow for a good use of gpu memory with a

large buttons

i'm not the same time in makes the task are even though we have a

very powerful classification head was to get

error so that we're going back propagated gradients

as the second step we propose to freeze then most memory intensive layers

which are typically only layers we are operating at the frame level

and then finding the postpone layers with more recordings

using

all the nodes of the sequence of the audio recording

which might be a two minutes of speech

by freezing the people were layers

we

the dues the man's of memory and therefore we can use the long sequences

and also we avoid overfitting to these are problem

based on the long sequences

finally the third step it in which we train them i-vector estimation

the first thing we do is we're gonna discard the actual multiclass classification

and we're gonna use a binary a classification

we're gonna use a sinus structure which is depicted here by copying the network tries

but the parameters of the same this is just for illustration purposes

and notice that we also three is

the affine layer corresponding this is denoted by

degree of colour

so

actually merges fixing them variance and now we're adding

and i'm into the estimation of work

that takes the possible in

activation which are very high dimensional and tries to learn as a lower magnitude

the along with the unit length expressed or

it's gonna be optimized people use a to minimize the cross-entropy

we also keep the global also

as part of the optimization problem

to validate are ideas we're gonna use the following setup

i'll start baseline system we're gonna use a modification of the rest in a thirty

four

expect to propose

by saving

and company

the modifications that we're doing is we're allocating more challenge more channels to their layers

because wishing that improves performance

i'm not the same time to control the number of parameters

we're gonna change the expansion rates of different layers so that we do not increase

the channel so much in deeper layers

and we have a certain is control the number and it is without degrading performance

to train the n and we're gonna use the box selected to dev data which

comprises about six thousand speakers and a million utterances

and this is wideband a i sixteen khz

note that we process differently the data when we use it with source segments on

full-length refinements in terms of how we apply a plantations

and i refer you to the paper to look at it excels

those are very important

what a good performance and also generalization

to make sure that we do not overfit to a single

evaluation set we are benchmarking against for different states

speakers in the while and bob select one

are actually good three it's to bob select two

there there's not much about the means that between

those two evaluation sets on the training data

the

sre nineteen outperform video portion and the time five

have some domains is compared to the training data

and i will be someone in the results later

mostly this is

in the case of sre nineteen is because the tails audio comprises multiple speakers and

there's a need for diarization

and in the time five k's

there is

far field microphone recordings with a lot of overlap speech and higher levels of reverberation

so there is a very challenging setup

also the time five results will be a split

in terms of a close-talking microphone and too far field mike

so that start by looking at the baseline system the we're proposing

we're percent of results in terms of equal error rate and to all other operating

points

we're doing this to facilitate the comparison with prior work

if you look at the right of the table

we are listing the best single system with no fusion number the we're able to

find in the literature

for all the benchmarks

you know all of the costs work reported

but our baseline

since to do a good job compared to the prior work

i know performance of most of the operating

points

note that we're not actually doing any particular tuning for its evaluation set

it's the for some small carrier that as i said sre nineteen

require a diarisation so we'd are as the test segments

and then for its

speaker that adding a text and then we extract an expert or

and their score

we can score

with the enrollment on all the test expect sources the one of the key for

scoring

so check

the

improvements that the phone lines refinement brings

in the second stage

we can compare in this table

with respect to the baseline

overall we also positive trends across all the data sets an operating points

but the games are

larger for the speakers in a while also and this makes sense because

this is done

so for with the evaluation data has a longer duration compared to the four seconds

segments that were used to train the nn

so

this value is the recent findings know how in our interest this paper

in which we saw that formants were fine and it's a good way to mitigate

the mismatch between the duration

in the training faces on the test phases

regarding the amount of the destination node work we explore multiple topologies

all of them were fit for where architectures and we explore interracial that and with

here and percent in three

represented in cases

a change in terms of the

number of layers and the with of the layers

the parameters go for one point five million to twenty million

when we compare performance for this three architectures across all the task

we do not see why changes

so the performance is quite stable across networks which is

it's probably a string

to find a good trade-off between the number of parameters some performance we're gonna be

the magneto two

architecture for the remaining part of experiments

percent

the overall

gains in discrimination

and due to the three stages

you have the graphs

the horizontal axis

are

the different benchmarks

we are explained in a

by far field microphones and of different plot just a first facilitate the visualisation because

they're in a different dynamic range

on the vertical axis we're depicting one of the

a cost

and then the colour coding indicates into the baseline system

the utterance

is

applying the for refinement to that baseline system

and the grey indicates application of the magnitude estimation on top of full-length refinement

so overall we can see that there was it is trained as well the

full answer feynman an i-vector estimation produced

gains

and we see that across all data sets

in there so

e r

we are getting out twelve percent gain and then for the other two operating points

we're getting an average of to the one percent gains

even though i'm only assign one operating points here in the paper you guys the

results for the other two operating

so finally a look into the calibration results

most of the global calibrate or and the miami network our training on the ball

so they have to dataset

this is a is a good night for the box select one on this because

in the well evaluations that but is not subset would match forty five and sorry

nineteen

where the reason segments and

before

you know they do not calibrate or we can see that we can obtain good

performance

in terms of the actual cost max and the mean cost

what both box evidence because no well

but when we moved to the other datasets

we struggle to obtain a good calibration with the global calibration

looking at the magnitude estimation that work we see a similar trend

for box a lemon speakers in a while we obtain very good calibration

but we also system struggle for the other sets

i think that a fair statement is to say that the mac we can estimation

does not deal with the domain saved

but you

performance the global mean and calibration

in all the operating points

i'm for all data sets

to gain some understanding of what mounted estimations doing

we did some analysis

the bottom plot on the right shows the cosine scores

the histogram scores for the

non-target on the target distribution

the red colour indicates a non-target scores and the blue collar indicates a target score

the top two panels are still in the cosine score

it's kind of a lot

against the magnitude

the of the product the magnitudes

for both and variance involving strike

therefore some of the line indicates the global scale

or magnitude

the big global calibrate or assigns

to this limiting

discourse used for this analysis are one the speakers in a while evaluations

since the magnitude estimation network improves discrimination

we expect

two trends

for the local sense for targets

we expect that the

a lot the magnitude

should be bigger than a global scale

on the other hand for the high cosine score

non-target trials

we expect the others

that is that the product manager to be smaller than a global scale

the expected trends are actually person in these plots

we look at the top plot we see

the

there's on

tilt

and the

magnitude for the no

cosine scoring

tend to be of all

the

contact constant magnitude that will be assigned for the global i-vector

on the other and we see that a large portion

then non-targets are the global scale

and

the ones that are doing getting very high cosine scores

also quite attenuated

this is consistent with the observation that magnitude estimation there were is improvement of discrimination

so to control we have

introduce undirected estimation network

within a global offset

the idea is to assign an eigen to each one of the unit length x

vectors that are training with an angular mark the softmax

the resulting scale extractors can be directly compare used in inner products to produce calibrated

scores

and also we have seen that it increases the discrimination between speakers

although

the domain is still remains a chance this are significant improvements

the propose system outperforms a very strong baseline on the for common benchmarks dimensional

when we but also for the validated the use of for recording refinements to help

will the duration mismatches interviews you another training and test phase

if you found this work interesting i suggest that we also take a look at

day

current work the senator a and meets my clan are gonna be presented in this

work so

once it is related

and if you have any questions you can reach me at my email

and i look for

to hand you guys in the middle sessions

thanks for the time