however

this is the transition from clean laugh

and the indian institute of science bounded off

and i think presenting a work

and b or the u wouldn't but a model for speaker verification

the corpus this people actually as someone g

actually automatically

let's look at the roadmap of this presentation

for this and we into using what a speaker verification task consists of

no one to the motivation behind our work

i don't talk about the front end one that be used

discuss bus approaches to back and modeling

before describing

the proposed new will be lda and the lda model e

and then

some experiments and results before concluding the presentation

let's look at what a speaker verification task consists of

when you're given

a segment of and alignment recording of a particular target speaker and a test segment

the em audio objective of the speaker verification system

is that it only

whether the target speaker is speaking in the test segment which is the alternative hypothesis

or

if is not speaking

which is the null hypothesis

i think can see here

this and one minute recording differently by x e

and test recording denoted by x t

these are given

as an input to the speaker verification system

this system outputs a log likelihood ratio score

this score is used

indeed only

by the

the test segment belongs to the target speaker on a non target speaker

most popular state-of-the-art systems for speaker verification

consist

off and you will be celebrating extractor the most popular ones in the last few

years have been the extractor models

this is followed by a back a generative model such as the probabilistic linear discriminant

analysis or b

there are some discriminative backend approaches

like the discriminative but and svm

what we propose

is

one neural network of roads

which is discriminative as was to generate a

for back end modeling in speaker recognition and speaker verification tasks

let's look at the front-end model may be used

as i mentioned

most popular model the last few years have been the extra to extract us

we had they are extracted extractor

on the wall salem got what

which consisted

of seven thousand three hundred and twenty three speakers

this was clean

using

thirteen dimensional mfccs

from twenty five miliseconds frames shifted every ten miliseconds using a twenty channel mel-scale filterbank

this fan the frequency range twenty hours a seven thousand six hundred hz

a fine for augmentation strategy

which included

all mandarin data using things

like babble noise music

to generate or six point three million training segments

the architecture that we used to train on extractor model was the extended d v

and in architecture

this consists of well hidden layers

and value nonlinearities

the model is trained to discriminate among the speakers

the forest and hidden layers

already at the frame level well the last two layers already i'd a segment level

often training the and bearings are extracted from the five hundred twelve dimensional affine component

of the level cleo

that is this the forest adamantly a after the statistics for you

the and bindings extracted sure are the expectations

let's look at a few approaches

a back and modeling

the most popular one

in speaker verification systems is the general don't have gotten really order to be really

once the x well there's are extracted

there are a few steps of processing done on them

that is their standard well then mean is a more

the transform using lda

and the unit length normalized

the beheading model on this process extra go for a particular recording

is given in equation one a

but you dog

is the extra for the particular recording

make a describes a latent speaker factor which is a coalition

file

characterizes the speaker subspace matrix

and epsilon arc is of caution decision

now

first warning

and of these extra those

there is one from the enrollment recording

denoted by t e

and one from the test recording

denoted by county

use with the leading really in order to compute

the log-likelihood ratio score given in english

english and two

is that what i one

and b and q are dying to make this is

you all other approaches

backend modeling on the discriminative but

and pairwise abortion by

the discriminative be lda rdb lda

users

and expanded by the

in order to a by five of indulging or might be by entirely online you

got be represents art right

this computed using a quadratic on it

which is given in equation three

the final be a log-likelihood ratio score is computed

as the dot product of all weight vector

and this expanded vector

file viewed i can might not be

the pairwise collection backing

models the pairs of enrollment and test extra goes

using gosh distribution but i mean thus

these bottom lead us

i estimate

but computing

the sample mean and covariance matrix is off the target and non-target trials

in the training data

along with

the and really a model that we propose

we reported on results on the generative gaussian really their be clearly and pairwise gaussian

backend

no slow and the proposed new wouldn't but what and but architecture

what we have your

is a pair-wise the siamese time discriminative network

as you can see the green portion of the network corresponds

to the enrollment and ratings

and the being portion of the network response that s and brings

we can start the preprocessing steps

in the generated of course but as layers

in

the neural network

the lda

the first affine layers

unit length normalization as a nonlinear activation

and then the bleu centring and diagonalization as

another the affine transformation

the final vad of airway scoring

which is given in equation two

is implemented as a quadratic here

the bottom does of this model are optimized

you were saying

and approximation

all the minimum detection cost function which is known as the mindcf or semen

as the model

optimizes to minimize the detection cost function

we report results

on the mindcf metric

and the eer might

the normalized detection cost function or dcf

is defined as seen on all be done on a bigger

which is you will to be miss of being a pleasant be done times p

fa of data

where b

is and application basically

p miss and b f e

at the probability of miss

and false alarms respectively

on this

is when the model but it's a target trial to be a non-target one

that is the model believes

that the enrollment and test come from different speakers

whereas of false alarm is when non-target trial is wrong ready to as well

p miss and b if a computed by applying i'm detection threshold of peter to

the log-likelihood ratios

how p miss and b if we are computed

given in equation five

here

s i is the score all the log-likelihood ratio output by the model

e i

is the ground truth variable for by i

that is equal to zero if the right i is a target i

i d i is equal to one

if it doesn't non-driving k

one

is the indicator function

the normalized detection cost function is not as a function of the bad only those

due to the status continues by the indicator function

and hence

it cannot be used as an objective function in your electro

what we do all work on this is propose

okay differentiable approximation

of the normalized detection cost but approximating the indicator function but what sigmoid function

so the integration is integration six

i've

the

approximations of the normalized detection cost

given by p miss soft and be a face off a soft detection costs

g r e i is the client for index i s i is the system

output score or the log-likelihood ratio

signal a denotes the sigmoid function

by choosing a large enough value for the wall in fact that i phone

the approximation

can be made arbitrarily close

the actual detection cost function but a wide range of thresholds

before we diving the designers

let's look at the datasets used in training and testing the background model

we sample about six point six million trials from the key inbox alive set

and i don't put anything within five from the augmented boxes that say

for testing we report results

on three datasets

the speakers in the white but i second you go core test condition which consist

of around eight hundred thousand trials

the voices development set which consists of all work four million by as

and the wisest evaluations

which okay and consisted of roughly on the and a half million trials

the demon your words

the results on the sat w goal wise as development and one evaluation sets on

various models

like the gaussian really going back in the yearly and approach was divided but

along with the soft detection cost

we also that our experiments

with binding cross entropy as the loss which is denoted in the table i c

vc loss

we observe relative improvement in terms of mindcf

of around thirty one was in twenty percent and eleven percent

for s id w was is development and wise evaluation respectively

the best scores for slu w local is and eer of two point zero five

percent

and the mindcf of point two

for the wise as development began a best overall one point nine one person to

sdr and point do the best mindcf

for the voice is evaluation

you get six point zero one percent eer as the best

the other school and point four nine as the mindcf

the improvements observed in then you wouldn't but a more consistent

where data augmentation

as well as

for the eer metric

all that on this soft detection costs for a and b performs even better than

the binary cross entropy or b c loss

to summarize

the problem was model is the step and exploding on discriminative neural network model for

the task of speaker verification

using a single elegant back and more than just targeted to optimize the speaker verification

lost the and maybe a model uses

extract that and weightings directly to generate the speaker verification score

this more shows significant performance gains

on the s id w and was is datasets

we have also observed considerable improvements on other datasets

like the nist sre data six

a standard as well

two and do and model

where the more to optimize is not just from the expected and weightings but directly

from acoustic features like mfccs

this work was accepted and interspeech twenty

these are some of the difference is the reviewers

thank you