five

by there is being on

and now i'm going to present information preservation we

or speaker embedding

on my name is gonna and i'm from this all missionary first

"'kay" designed the contents of my

presentation first

introduce you briefly about the speaker recognition task

and then

i will explain your bone really borders

which is really to my research

then i will explain you my proposed method

then i will

sure you the experiments settings and its results

and finally i would concludes my presentation

okay

yes you can see this is the general generating based speaker recognition system

the first component so we use so being level network

i in it is only implemented be by which joe network and time delay neural

network

so

we take stuff envelopes features which is which can be on mfcc or spectra

it so input and you up the stuffing level or intuition

the best can fancy soap telling what you

so

it is really our biggest but the average which is no

but a sampling

or is just mean and variance vectors the completes mean and be there is so

the thing really features

and you're going is the frame it outputs

from the point network

which was a finger the network and you don't install x dimensional vector

why this is important is a

it you can make a fixed image and i-vector from of a variable length though

paul

premarital outputs

well teletext components is just speaker identified

well what this does is a eucharistic buys does

i guess speakers from the speaker embedding

well

you you'd health so that works to learn though

speaker dependent features

so what

it is only used for training because

in verification is scenario

in the test set you

you

you common the

once in speakers

so when testing the system

you use the on the other scoring metric like cosine similarity or purity

or

the scoring

this is

expected baseline systems

so

target and channel will expected baseline system

it is

it's you can see it is

we don't you'd is made up at the frame level network

and fleeing what you're and us to segments that where

that works

so

mm sixty two are usually used for the input features of the network

and five

first five there is a rectangular neural network

which works at the frame level

and then playing layer he's sober the female representation

and there are

additionally hidden layers

only which are operate at a segment level

and the last layer is a softmax output layer

so we pretty there is pretty doing so i guess speaker

on a line going to

or shrewd a richer information estimation and extraction technique

though

which are information

is a measure of the into a dependency

between two random variables

so

mutual information can be viewed as a probe l a place like blurred directions

it in that are in this group to division and the product of

very generous of though two random variables

or

during the representational from a laboratory principal directions are the

key element so well which are information to do this

estimator which i will be spin later

so

the following theorem gives are used for uses useful representation

well which is being called a scoreboard and representation you gives you the lower bound

so which are information

next the

which are information europe estimator

which is close to mine

so

this see the idea of mine is to model function

t we don't function i mean

or parameters by though due to network

you'd parameter omega

so

what this network or estimates the richer information

or using the

to model those got about the representational richer information

some things

using mine so you can do that on the training mission estimation and stuff estimation

together

the bigger all

well

buying is buying is the to maximize an estimate the mutual information between the input

and output pairs of the encoder

e by

so which is

neural networks

with the parameter five

so

was to eat is to realise on the sampling strategy

so

making a positive and negative examples

alright

drawn from the joint and the product of marginal descriptors

just to treat distribution

so in general

i still systems are collected from the same

utterance

in this case use the same utterance

or word imagery i can be are saying image

so

while the net

they can it samples are obtained by a

on the other randomly sample utterance or image

so

and is

we can optimize the richer information

so i mean though estimate and

maximize together because

the don't score does care about and reputation was the lower bound

sorry when you maximize the

the top line so it can estimate and

maximize the richer information at the same time

so this is a the mind of thirty

so it is

derived from the task or power directly

so

the all i want to spend new about my

proposed method

so

this is

information preservation pulling

the log

the idea of all i information fusion

paediatrician pulling one which i will call i d p so

to prevent

to prevent a information vacation in the putting stage

so i will ensure into this use

mine too regularized all

utterance level features

to have a high mutual information we the

frame level features

so

but i meant but i make ice you hear

well

the mine so frame level features and utterance-level features are sensitive from though

same input utterances utterance

so

the by cheap ares a the

joint

the actual sampled from the joint distribution

and automatic

are there at a pair

the frame about features and objective of features are sent to from the difference you

presented an utterance so it is a sentence from the product of margin

so

in information projection station playing

i such as the two difference way to use mine

so one is what we're image information maximization gi which i recall g i n

and the second one you so

local mature information maximization

which is lid and

so

but different used a in a g i n

so it's a matter what the u

to model or something sense information one frame rate of features i applied online to

maximize information between the

all frame level features and the

a transitive feature

so

two random variable for mine will be though

sequence so

we really features which is larger or older age

which you it's which you know stuff sequence

and the alternative feature

w so we which is the up the top the plea module so

in local which are information maximization

a the difference is still enjoy

or you can be one if you

frame of the features

will be enough to take all right decision to

applying to

predicting it is from those positive or negative samples so therefore some useful information will

be

you can also

in their individual frame individual feature so

i suggest a tree prevent this

so we fix nice to meet sure information between the single

payment of feature and the utterance set of feature or

tahoe lost will be for every g

which are information between single family feature and doctrines the feature

so this is tom

more information

preservation pulling architecture

so

man l i m can be applied

or together when training data in between speaker in bidding system

so i keep ulysses of optimized jointly we

speech

conventional speaker or classification loss

during the string

so in this case i used a

cross that softmax cross entropy lost

for the speaker

classification loss

so the first time

you know star speaker efficient loss which is softmax course cross entropy

and second then

the third terms are the clover and look where

mine objectives

so you can see this figure

two or understand my

or architecture

"'kay" this is so experiment is settings

so i used a

most commonly is dataset me too so the

one and two

so the input features was the

thirty dimensional mfcc extracted read

twenty five milisecond hamming window with a

and there is signal should it's shift

so

during the training each buttress was trying to

to point five second segment well which was a to make

input batch be dull

fixed dimension

so mean and variance normalization was applied to it is extracted

mfccs and i use

no voice activity detection

or

automatic sinus they from or any kind so

i

data augmentation was not

so

the what the competition is like this

like this

so point of pulling there are used a tentative

that is pulling so each is most commonly used one

and the

that image and no was a too large for magnitude because the last frame the

minute took output is so one hundred thousand

five hundred thirty six

dimension

so i

lda the addition of dimension

you just on that works

to make

the automation

lower

and that it is are the training materials so batch size was a

one directing t eight

and to make you the for my network

well segment level features this complaint even at to the frame level feature at

feature dimension so

and the optimal two measures

you don't initial

but anyway topple

or

tend to the minus three power and expose spanish jerry degrees at every epoch on

t are the final it will tend to the buying as he however so

and a whole neural network improving the implementation was done using the tensor floor will

and the for the back end just scoring metric i used

cosine similarity and p l d

so when using the p lda the last

up at all

nazi then there was used as

speaker embedding

or this is there are so this is for the cosine similarity and this is

for p eight is a this is

this is the ever so when using the cosine similarity doll up to class t

stand there was were

the top performers what was higher so for consensus similarity

why is the last you'd in the output of what for the using purity double

opt at all

second to the last hidden layer was used

so

before the ple training or lda was applied

to reduce does speaker

spain bidding imagine two

two hundred and you it is followed by and then normalization and whitening

okay

this is the experiment results so

i

the first size reminded using g i m only cheese and local

ill i am twenty case

though

laughs

table his for the

g i m o pony case

so

the best the best performance was for the p a d was are

by point

at four so

but in the

it's be expected baseline system

so he was

five point this

so you showed a better for performance from the baseline system

so

though the rights t-ball yes for the lid in one case

so the best it's

on performance was

for the

ple was not by point one e

percent's

but

in the baseline system i was a five point sixty six

so we show the better performance from

then a baseline system

this is a disadvantage that although i p

so i

in various hyperparameter case so i i'm thinking the five

i'm interest for my not exist and i or

basement in many cases

the best case was did this case was

for the giant was all

zero point zero one and that for a i was at zero point zero point

one

three shows the

this once all but using the cosine similarity

so you was at six point one four percent

so

we should a better

performance from

then are present system which was

six point seven tiff simply for in their expected fixating system

then

okay so

so the

i found no

best case so i hyper parameter settings and are you applying this to

folks do not too

dataset

so i training the we

that the system we don't excel up to and b be restricted on the same

as that which was

bookseller one test set

testing

so

the performance

well was a

much better

so then in the best case using a purity a was a

three point zero nine percent l b r

so it is

the for what for you was issued a better performance of the baseline which was

a

three point six

sixty two so we used so both a twenty percent

well

all the performance was better

in terms of the

i

i

we thank you so much

using

so using this showing this easement utterance

all

but new methods were

showed a

better performance in every case so

it shows turbine is very helpful for the l

relating the rights

features for

or more information speaker advance information

only training the speaker in being system

so

what

in the in our future research

we should experiments it more be

other flea method

except so

or but intuitive statistic putting which i was used

so and were refers really maybe to combine to

proposed that the

we other was this

so

thank you for listening to my presentation and if you have any session you can

on just email me in it is shown on my own speaker

and actually

but