i speech

that's going to present our

i files a odin counters i

i-vector space for speaker recognition

well and

i

presentation

that let me start from the

motivation or activation

cool for fireworks

down i would like to the and details are the only thing

a particular i will focus on

i've and

and the

few words

will be made out that they can't and scoring

well the next section of the dedicated to

and improve denoising thing or is this so i mean you're probability

we tried to apply

we tried to apply

this technique

and a deep

our conjecture will be considered in this section

next denoting comforting for the system in the domain mismatch

scenario will prevented

and the finally i will conclude

my presentation

okay let me start for all

our motivation and goals last year published our work about implementation of it you know

it engulfed encoder

for the speaker verification task

and the this system

based on

t aec still showed

some improvements

compared to the commonly used baseline system i mean ple on the raw i-vectors

well and this motivated us to

two

for the investigation to detailed investigation

and the

and i'll go also used to study the proposed to solve and i in the

i-vector space

to analyse different straight edges all units as a nation and training probably big back

and parameters

to investigate about and to explored a different deep architecture

we

we offer

and to investigate

the a basis to increase or domain mismatch conditions

well to the

the dataset and experimental setup we used in our work

as you can see for the

training data as a training data we used a telephone channel recording from the nist

is the re

corpora for evaluation we used and used

ten sre protocol condition five extended

and to our results presented in terms of four

equal error rate and minimum detection cost function

and now to our front end tent

i-vector extractor

as you can see we used to

uhuh

mfccs and the first and second to do it is just the county where from

well are what structural was based on

the nn posteriors

with the eleven frames why thing

we used

two thousand and that's a silence at one hundred three phone states with the twenty

non speech state

and

instead of

using

hardwired decision

we try to use soft one solution using the nn outputs

well you can see this formula i

we

try to apply

cepstral means you mean and variance normalization

in this way in the statistics space

well and the you can see that all e

triphone states corresponding to the

speech

states are used to calculate

a sufficient statistics

finally a four hundred dimensional i-vectors were instructed for

our first experiments

well

few works about the det system and the

the a training procedure

to their own devising transform we're

use

do noise are pre-training generative pre-training speech

with the contrastive divergence algorithm

well

and

tool

to train our

denoising transform we

we used the

speaker session dependent i-vectors and the box

the mean four

the main for i of all i-vectors of the same speaker

i mean i s

well and we modeled

joint distribution of

this

i-vectors

and then after training but we unfold they are and

a finds you and you two

two

to obtain a

a denoising out in order

well

on the next slide i have a back to prevent

our system

under consideration

well as you can see we used

convention the lda based system as our baseline

with whitening and length normalisation

a pre-processing

well

the next system is based on

are a out to import or also with a whitening and men normalisation

a pre-processing and the finally

are where

next system is a det based a

well it's just and

l two in order which is

find fuel from the army and this dashed looked at all means fine tuning procedure

well

and the

a ball the hero or about the parameter transmission or substitution

i will focus on that on the on my neck slides

it is very important

right it just turned out to be important in our system

well

for

we used two covariance model for scoring it's can be viewed as simple case of

the lda and the score can be a

expressed in terms of

between speaker and within speaker covariance matrices

well

few words about the parameter substitution

during our experiments

in our work we figure out that the

the best performing the best performance of the a base based the system

is performed so well we a substitute

why whitening and p lda back-end parameters from they are bm system

to the eight based system

denoting crafting for the basis

well it's empirical fun

but it's it is wearing important

for this system

let me show you our first results

well with just the system

on the nist as the retail

protocol and to

as you can see

the gain

we're observed again

over the baseline system when we applied our da a based system with parameter replacement

both four

commonly used in nist sre ten protocol and our second

corpus called rest rooms telecom test got stuck on the on the results

and

so

some information about the

a risk telecon corpus can perform and the by the slide

well

and

to the analysis of the det based system we decided to use cluster variability criteria

e g

it is also called for can not criteria

well it is based on

we since began between speaker covariance matrices

and if you're

take a look at this figure and you can see that there

odin quarter based projections have more stronger clustered variability

about unit is well and the in this case we didn't apply and normalization for

our bn and

d e bay super projections

well i mean about normalization i mean to know whitening

were applied to d r b m and v

well

additionally

are we decided to use cosine scoring

as an independent estimation

or to assess the

the properties of our projections

you can see from this result

that no weight in the that da based system achieves the

the good performance among the

all the system

by the way we try to use

and simple

out in order to

two

to try that it's in speaker recognition if you

but it shot out to be the

not so would is the e bay system

well

and now to the white in can length normalization

when we apply this parameters for the r b m and g u based projections

we obtain those results

and i

that we can see the

the lines are very similar

and that close to each other

in this situation a where we applied

it di da a based

whitening

one of the four

forty it based system

it's turned out to be

not so who

for the system

and the

now on the next slide

we applied parameter substitution so we decided to use the parameter whitening parameter from our

em system

and the

in this situation we achieve good performance of the system

yes you can see

one baseline

and the

to the figure

i

you also can see at the to

the discriminative properties

or was the in this case

is

a more stronger for the a basis projection

to summarize altogether i prepared

all table we we'll terrible with the all common result

and the among the

the system the a based system with a are very important the substitution i mean

whitening

at you the best performance

well

and no to the

p lda based scoring

well

in this table

you can see that our results we obtained a opted different experiments in different configuration

of our system

and again

at the last line

the table you can see that the

good improvement would be in

can be achieved by using

parameter substitution from there are bm system

but the question

why it's happens is still open for us and we didn't manage to until it's

question

well

no i will

we will discuss some improvements for the a based system

and first we decided to apply to apply

dropout regularisation

for both our em training

and the

for fine-tuning

well as you can see

dropped out helps

to improve the system

when we used the it's a in

the orange where

our em training stage

r be improved training

but unfortunately apple a plan to produce the stage of discriminative fine tuning wasn't couple

for us

well to the jeep our conjecture we try to use the two schemes

first

you can see the first one on the slide

it is cold stating audience

well

after training the first are

it's out what can be may be used as a as an input for the

next are

and then we try to find t one

each altogether you

jointly

well but it does not

asked to improve the system

about the second that scheme

which is named stating bias

manage to obtain good results

but in this scenario we need to

to you and or two

substitute whitening parameter again probably are bm system

some big generative pretrained system

and the we get a little bit improvement from that

and the

next question i would like to focus is

the domain mismatch tonight

we investigated our da a best system in

in the domain mismatch conditions

well we used domain adaptation challenge that a dataset

and setup

it's a back end we use cosine scoring

two covariance model record s

to as the lda and simplify the lda with

four hundred dimensional speaker subspace

referred to

as the only

it should be noted that in our experiments we absolutely ignore label so the in

the main beta we used

we use it

one way to estimate whitening and the

whitening parameters or the systems

well and not to the results

you can see

for the baseline

system when we use in domain data for training

we obtain both results for

cosine scoring and you can see that the in applying a to do when the

wind di da a based system

before was focus i in only a scoring

but so when we

used out-of-domain that the data to train our systems

or with a you can see the degradation

for both for cosine and you'll be scoring

and in the

find

this table

you can see it the improvement

when we used whitening parameters from

in the mean data

the same results but for the

a simplified field v scoring

well i just little bit

better

then you'll be

and i'll to conclude ones

we present to

the study of denoising grafting order

in there

i-vector space

we figured out that the i

but

i'm sort be performed on the t or tdoa based system is you two

you by employing can parameters directly from the rear are beyond i'll put

the question is still open why are beyond transform provide better bacon parameters for this

set

well dropped about helps to improve the results but when applied to do our em

training stage

and that helped when we implemented in fine tuning

different project share in the form of stated denoising crafting quarter provide a few further

improvements

well and all our findings

regarding speaker verification system in my conditions

called so true in

mismatched condition case

and

the last one it's and the you think whitening parameters for the target domain along

the

the a it train twenty out-of-domain set

else two

the weights avoid significant

performance gap

goes by domain mismatch

that's it

top questions

michael

in this late it's when d you show the and the stacked

in tennessee note and can then

digits right more than two layers

yes but in this

in this we need to inject whitening conflict summarisation between the wires it is the

this has five

five i want to with whitening and length normalization injection

i mean

and that when you when you use it to like us to

to denoising of the encoders

you improve the results

so that you use your tie the third one

what's

what do you know more than one at each other than the corrected where a

whole

i see

well we i

we decided to

two

through might not able to for the

goal deeper in this because of four we find out that this result is very

similar to the you know our first one based on only one

simmons

although we probably have discussed this issue about your question why copying the p lda

and the and the long length normalization variables from b r p m rather than

they

final say stage gives better performance

where it should be initial maybe of a over feeding you do the back propagation

but you're doing since you're using the same set

maybe therefore let's say via residual matrix that were using be lda becomes artificially small

in terms of strays let's say

so how to check maybe the traces of the two matrices

the one that you estimate from r b m and what i guesstimate after to

see maybe

the covariance matrices are sufficiently small

might be a result of overfitting

well this and now assumption and we try to check out chip it calculates after

as the meeting our paper our paper was submitted but we figure out

it was it does not the reason because of we try to

to split our datasets in two parts and the to use separate data to train

a lda based and so but they can parameters but

the results

schultz

shows that

and is not the rate

it is not a repeating

occured while we trained the system on the same data

well

al so try to

explain the situation by

using a house bill option assumption well i mean

after

det projection we can obtain

no more or less

goals and but less torsion the

and that can be the

this

in this case but

seems to us

but also it is not the answer

this time for another question jumps a

just to construe on the first step of your system but i think it will

to be spot you say that you are using twenty

non-speech states i don't quite amazed both this huge number could you say something about

that

you mean huge number of non-speech states but

we have

we use this

standard caldera see from our where speech recognition department

and they the

you fast

and a twice to use this configuration all these system and the we train

ours the d n and in this way

and the

well it's provide food

voice activity detection for our system

and we are also it's a

mentioned we also used to

this

capabilities to

a to a black soft one solution

also what decision in this statistic space

well i mean we

we have done

cepstral mean shift normalization in the statistics space

by excluding a non speech

well non speech is the problem our consideration

that's to the speaker again thank you