i

this is a special edition

from the lab id indian institute of science battle

and i think presenting a paper

system for assigning twenty nineteen cts challenge improvements in data analysis

the goal of those of this paper

actually i someone g

log online so

but using

actually on band

let's go

and the total number of this presentation

and introduce a brief overview

off how speaker recognition systems well

discuss

sre nineteen challenge performance metrics

talk about

the front end and back and modeling in a system something

discuss the results of these systems

and then some analysis of post evaluation results before concluding the presentation

this is a brief overview of how speaker verification for speaker recognition systems well

the first phase

we have the raw speech and extract features like mfccs from that

these features

well then

processed with some voice activity detection and normalization

then these features are given as an input to train a deep neural network model

parameters

the most popular neural network based embedding extractor and the last few years have been

the exploration of mars

once the extractor training phase is done

we enter the lda training phase

these extracted extractors

have some processing done on them

like sending and lda

the other then unit length normalized

before cleaning up but a more

most popular state-of-the-art systems

use a generative gaussian but more

for the back end system

in the verification phase

we have but right which consists

off and domain utterance under test functions

and the objective of the speaker recognition system

i don't know whether the test utterance belongs to the target speaker or non-target stego

thus once you extract extra and ratings for the enrollment and test utterances

we compute

log-likelihood ratio scores using will be lda back-end more

and then

using these scores we did only

if the trial is a target one

or a non-target one

let's look at the sre nineteen performance metrics

then this assigning challenge in twenty nineteen

consisting

of two tracks

the forest

one speaker detection one conversational telephone speech or c d s

and second

was no multimedia speaker recognition

a work was on the forced that the cts challenge

the normalized detection cost function or dcf

is defined

as in equation one

just seen on all be done on my data

is equal to be missed of the you'd are pleasantly times p fa of t

right

in this and be a fee

the probability of miss and false alarms respectively

on this is when the speaker recognition system

the database a target trial as a non-target one that is the system wrong

though and alignment and best

to be

of the same speaker

of false alarm

is when non-target trial is it on you ready to as a target right

in this and be a free

computed by applying detection threshold of the ego the log-likelihood ratios

the training cost mentally all the nist sre nineteen for the conversational telephone speech

is given

by equation two

you're to be done one

is equal to ninety nine and be done to is equal to one eighty nine

the minimum detection cost was alone

as mindcf or semen is computed using the detection thresholds that minimize the detection cost

you creation three

ins to minimize you wish to

on the threshold you know one and the two

the equal error rate eer

is the value of p fa and p miss

computed at that actually read p fa and b m is equal

we report the results in terms of eer

semen

and c primary for all of a systems

the assigning nineteen evaluation set consisted of or two and a half million trials from

fourteen thousand five hundred and sixty one segments

let's look at the front-end modeling in a systems

we obtain

t expect all models with different subsets of the training data that i described in

the next slide

we used the extended time delay neural network architecture

the extended d v and an architecture consisted of twenty hidden layers and value nonlinearities

the model

mostly to discriminate among the speakers in the training set

the forest and hidden layers

all three i-th frame level by the last two already at the statement level

that is a one thousand five hundred dimensional statistics putting your between the frame level

and the same and several years

it computes

the mean and standard deviation

after training and ratings are extracted from the five hundred and twelve dimensional affine company

of the lemon clear

which is the forced alignment label layer

these and weightings are the extra cost use

this table

describes the details of the training and development datasets

used in the assigning nineteen evaluation systems

x p one

well extract the one model

i was trying valiantly

on the wall syllabic or whatever

x lead to

you was mixer six

and vts sat process

x p d

was the full extent a system

which are staying on the little box ella

and previous sre data sets

the data partitions

use in the back end martyrs of the system in individual systems submitted are indicated

in the table two

now let's look at the background model

once the popular systems in speaker verification

use

the generated of course in the lda bungee nearly as of that in modeling approach

once the extra those that extracted

there is some preprocessing done on them

they are standard are the mean is a model

the transformed using lda

and are you wouldn't like nonetheless

the bleu model

on this process extract of a particular recording

is given

by equation four

but you do i

is the extra for the particular recording

well make our

this kind of only can speak of five do we just go origin

five

characterizes the speaker subspace matrix and axes on a is a collection procedure

now the scoring

well bad of expect that was one from the enrollment recording be noted your diary

and one

from the test recording denoting show but you don't e

are used

when w can be lda model or to compute the log-likelihood ratio score given in

equation five

english and five is of course that one and b and q

alright in many cases

along with the g vad approach

we propose

when you wouldn't be lda model what and the lda model

for background modeling

what we have you are

pairwise discriminative network

the bayesian portion of the network

corresponds

to the enrollment and ratings

and the pink portion of the network correspond

the test and really

we construct

the preprocessing steps

in the generated a gpu

as layers in the neural network

lda

as the force affine layer

then unit length normalization as a nonlinear activation

and then be is entering and diagonalization as another affine transformation

the final pairwise

but is scoring

which is given in equation five in the previous slide is implemented as a quadratically

the by having those of this model

are optimized

using an approximation of the minimum detection cost function

or seen in

no less than i'd are submitted systems and the results

the database your

shows det is about the seven individual models that we submitted

and a couple of fusion systems

the best individual system

was the combination of the x t which is the for the extra extractor with

the proposed and b idea more

for the s i eighteen development set

it had a score of five point three one person ser and pointing to a

signal

and the best scores for the assigned nineteen evaluation

was

for one nine seven percent

and for the and point four two

for semen

the fusion systems

are some gains on the individual systems

all that all

the for an extra system just actually three

performs significantly better than the walks l images extreme one

and the x s i next week two systems

for any choice of backing

systems be which is trained on and vad

just in the c include a system f

and it is observed that a model support in domain and out-of-domain data better than

the collision be lda

let's talk about some post evaluation experiments and analysis

one of the factors

then we found that we didn't to optimally

with calibration

in our previous work for sat

we propose

an alternative approach to calibration

but the target and non-target scores will model

as a gaussian distribution with the shape variance

as assigning nineteen did not have an exclusively matched development dataset provided

the aforementioned calibration

using the sre eating development dataset when applied on assigned nineteen don't know to be

ineffective

this was done for all of us operating systems and thus the calibration

was

not as optimal as you want to

the graph on the right

shows

how exciting

development and assigning nineteen evaluation datasets are not matched

and

the threshold instantly opening

well not optimal for are selected systems

we perform some normalisation techniques to improve a score

we perform the adaptive symmetric normalization well yes non using the sri meeting development unlimited

say as the core

and be achieved

twenty four percent relative improvements for the x p one which is the voxel of

extract the system

and twenty one percent relative improvement for the full extract the system actually the on

the sre eighteen development set

you got comparatively low but consistent improvement of about fourteen percent on an average

in all of us systems for the sre nineteen evaluations yes

the table

shows the best values

there we go out for the exciting development and the sre ninety eight evaluation

you got and eer of four point seventy question

and assuming all point two seven as best scores for deciding of love me

and eer also point five one

and semen of point thirty six and the c by many of pointy nine for

the sre ninety evaluation systems

to summarize

we k t extractor extract was and background models on different partitions be available data

sets

we also explored a normal discriminative back end model quality and the lda which is

inspired from be neural network architectures and the generated of be a dog key idea

more

we observe that the and view stuff only this of the system or g p

lda for with his datasets

the errors that will cost by calibration

with the mismatched development datasets are discussed

but also significant performance gains that were achieved by using

cohort based

yes non adaptive score normalization technique for various systems

these are some of the references that we use

thank you