another one broken into my presentation this is the recorded video for all the c

two thousand and twenty eight

and in control from different university of science and technology

in this we do i'd like to introduce our work on orthogonality regularization for and

you and speaker verification

for the speaker verification tasks

high resistance has being dominic solutions for a long time

for example the i-vector based systems

or a selector based distance

the hybrid system is usually consist of multiple not your is

the speaker and endings can be and then problem i-vectors or deep neural networks and

a separate scoring function is commonly but with a ple a classifier

the in the hypothesis in

if from audio is optimized with respect to its own target function

which is usually not the stand

moreover these speaker verification is an open set problem

used to handle on all speakers in the evaluation stage

so that regularization ability of the system will be very important

recently

one more speaker verification systems are trending and two and then or

so and to assist as well macula test utterance and

i enrollment utterances directly to a single scoring

it can simplify the training pipeline

and then the whole system is optimized in the morning consistent manner

it also and levels lending or task specific matrix during the training stage

and various loss functions have been proposed for the entrance systems

for example that regret loss

the generalized and to analysis

though the core idea of the loss functions in the entrances students is to minimize

the distances between i'm outings from the signs being sent speaker

and maximize the distance between and bad things from different speakers

in this loss functions most of them will use the cosine similarity

the that means that cosine of the angle between two and biting vectors to and

be the distance between these two and batting

so the major on the line assumption for the effectiveness of the cosine similarity measurement

in that the and endings base is although no

which is not guaranteeing during the training

in this work

we and to explore in the orthogonal

regularization in the entrance speaker verification systems

so

in this work our systems are trend was generalized enter and loss

we propose to regular i-vectors

the first one is quite a soft also known as you regularization the second one

is score spectra restricted isometry property

regularization

and this

two proposed regularized there is i evaluated on two different you and then were structures

the air sgm based one and times and a neural network based one

so first i'd like to briefly introduce

the generalized and two and lost

in our and tie system

so one mini batch consists of an speakers and utterances from each speaker

that means we will have and times and utterances in total for one mini batch

so that x r j represents acoustic feature computed phone utterance tray of speaker i

for each input feature excitedly the network

produces a corresponding and adding vector e i j

so we can

compute the centroid of the speaker

so in and batting vectors from

speaker i

by averaging its and i think vectors

then we define the similarity make free matrix as r j k

b

scaled cosine distances between h and adding vector to all of the century

it's i j k-means the similarity matrix of

speaker and endings e i j two thus

speaker centroids the k

and w and

b r trainable parameters

and we

i don't really is constrained to be costing so that this me guarantee will be

larger when the cosine distance is larger

a during the training

we wanna h all utterances

and batting

e i j to be close to it all speakers

centroid while far away from other speakers centroid so we apply a softmax only s

i j k

for all the possible

okay

and got these

loss function

and the final a generalized and two and almost is the summation of classes all

well and adding vectors

in brief

each generalized and two and lost push and battles towards the centroid of the for

speaker

and away from the centroid of the most similar different speakers

and we introduce the two regular or right there is to the entrance is jen

the first one is corn soft all about ninety regularization

so house

we have a full a canadian the

and a has a weight matrix w

the start of the analogy regularization is defined in this way

and then and there is a regularization

coefficient

and the new on a first and the frobenius norm

so this soft although not if you're realisation turn

requires the grand matrix of w to be close to identity

and since the gradient of this also known or regularization to resist respect to the

weight

that you can be computed in the

stable form

so this regularization term can be directly added to the and two and almost

and optimize

together

a second one is core the spectral restricted isometry property regularization

the

restricted isometry property characterised

you matrices that on your e

of all know

so this realisation to is derived from the

a restricted isometry property

for it i weighted measure is the u s r i p regularization

is formulated in this way

here is an enter is also regularization coefficient

and sigma is

the spectral

it cost you the largest singular value of the

matrix

so this is alright p regularization term

requires

the largest singular value of the gram matrix

to be close to identity

eight in close to

required all this all the singular values of to be able to be crossed to

white

in the same pieces

s r i p realisation turn

requires the

a given value to control station

it will result in a novel stable gradients

so

in practice we use the technique cord how iteration

two

approximate the spectral norm computation process

so in our experiments we just randomly in usual nice the vector b

and

a repeat these above iterative process due for a two times

and there is regularization coefficient for both romanisation terms

the choice of the regularization coefficient

plays an important role in the training process as well as final system performance

so we investigated to different

a sky true

the first one is k the

consists and get a consistent coefficients are the training stage which is an and the

to be sure one

and the second scatter and started with

learn the you question zero point two and then we gradually reduce it to zero

during the training stage

we a scroll two different types of neural networks

and the first one is rest best system

and the second one is td and a system

the air sgm system fines three-layer as it always gmm based projection

and if a rest em there has seven hundred sixty a hidden nodes

the projections i z is set to two hundred fifty six

after processing the whole input utterance and they the lost brand output

only have rest

well we have

at the seven real

of the whole utterance

ending that indian system and we use smt do the structured as

in how these x vectors

model

and all other hand adding letters

computed as the l two normalization of the network output

so our ex experiments is

ways

well so we at one corpora

and the

in its meaning that if we use sixty four

speakers and a segments per speaker

then it is use of concern of ours you memory capacity

and this set in the last are randomly sample from one hundred forty age of

one hundred a j

and also already the also analogy regularization can be applied to all the layers

but in this work we hand the applied be also narrative constraints on the weighted

matrix of speaker and batting print using their

there is the results of the u r s g and basis jen's

in addition

now regularized the remains

we don't have an eighteen are then and two

regularization term during training stage and this is the baseline

for all the result we can see that both regularization term is improve the system

performance

and the is alright p regularization

and all to perform the soft orthogonality romanisation time

as well as a baseline with remarkable performance against

it's a around

twenty percent improvements in eer

this you have to mindcf three

and the decreasing scoundrel planned the

performance factors and the constant schedule for both regular i-vectors

we also show the det curves for the baseline and the fast addressee and the

assistant

trained with is already realisation and the decreasing scatter

in this figure it can see that the

regularization

also noted to realisation ray helps to produce i-vectors just

and here is the result of the td and then based systems

in this case

the two regularization to know actually is compatible in performance

and the

for these soft although not here annotation time

it is it is forging is a battery or no i percent better in addition

to and the sixteen christian

by doing these they have three when trained without decreasing band masking

and the solve the regularization

is beneficial one friend ways integrating seeing and the scatter

the best asr a physician and

is twelve percent better eer and i

teen percent better in this industry

so the sri p minimisation cues

consistent

in performance went random is different than the schemes

here we plot the det curves for the baseline and to all the

s sis

t and their systems trained with two regularizer this way and the decreasing is gonna

doing this figure

and to explore the effect of all that an additive regularization during training we plot

the validation last curves

grace to example

of the validation last curve do the training of error rates gmm based system

just noticed out

the actual number of training a hoax is then trained different four

different systems

it is because we stand the maximum number of training works

to one hundred and start training if the validation loss does not

decrease for six consecutive blocks

from the loss function of from the loss curves we can see that

all the regular right there is the accelerate the training process in the other day

training stage

and then ten at several or lost remote the training compared to of the baseline

in general the sre regularization and she's real remote additional lost

then

a soft also no regularization and this finding is consistent with their system performance

where in general the sre p rotation is better than is or regularization

for both running the right there is

training we set consists in lambda released to a more training

i box and also lower finder lost

or this is different from the findings

in the

system performance

well according to the final system without

training without increasing scared you

always results in better performance

so one possible reason is that

in the final training stage is a trend use of model parameters numbers more it's

more likely not point really

so keeping nist and recognition strange

stress

sure about the training

well be all illustrate

at this stage

so by decreasing the coefficient we lose in the orthogonality constrained and of the model

parameters have more flexibility in the final stage

thus leading to a better system performance

so in conclusion

we introduce the two also nancy reagan right there is infringing

and two and text independent speaker verification systems

the first one is the soft also known and two regularization

it requires the gram matrix

to be a close to identity

and the second one is

sri peer organisation

in minimize the not is

singular value of the gram matrix

based on the restricted isometry property

two different neural network architectures

there is meant easier than

or investigated

no weight fried different all regularization coefficient rantings gonna do this and investigated their effect

on the training data as well as evaluation performance

we find that spectral restricted isometry property realisation

performs that best indoor the cases and

and shapes in the bass case around trendy percent improvement on the all the criteria

both run underwriters can be combined into the original training loss and optimized scalar with

little computation or hate

and that's all e u

a presentation thank you for listening ningbo the constraints of work are then you