Speech Transcript - Bottleneck Features for Speaker Recognition

OK so

my name is Sibel Yaman

today I will talk about my work

my colleagues

tis my picture to show speaker recognition

it's my take about introduction

I'll give you a big picture of speaker recognition

I'm also representating overview speaker recognition

methodology I'll take about the detail

next picture shows an exam

two key idea and first one is develop conversation of training in key terms

and the second one is cooperation of the separate system in discriminative training

I will report the experimental results and conclude the summary

OK we are listening a keynote speaker in the morning in the speech recognition literature

report to platform in HMMs

however in the speaker recognition we are not got that yet but

we are improving our results in everyday

here are the bottleneck

architecture that be used we have input layer

from l norm at speeching point five long of speech

these are raw features that means there no delta and delta deltas

fourteen dimensional MFCC features

there are fully connected to a layer of

about thousands of hidden notes

which are diagonal connect to narrow an hidden notes which are as a bottleneck layer

they are connected to another hidden layer and finally they are connected to

input feature statistics

are feed into network

and they are passed to the bottleneck layers

in the propagation mode the speaker recognition

speaker is going to information and

that propagate to lower layers

and that how we create the road map

this is kind of network strategy

i-vector should be studied

in 1998

by koing and heck

let me tell you what happen before

just use this bottleneck to do that

it doesn't work

this use some standard

we compare

traditional call transitional

bottleneck features

it's shown in red color

that blue color shows

MFCC baseline systems

let's just keep this number in our

and when we compare

performance in term of equal error rate

we see

40% decreasing

in the same microphone test set in NIST 2010

about 45%

decreasing in the different task

do techniques that will make performance per person where they are

here are some overview methodology

MFCC features

one way is we could do some linear transformation

and obtain delta and delta deltas

higher order features

but we want to do is

we want to perform nonlinear transformation on this

features so that when we were

choose sort of data

we just transform

MFCC features on robust features

in another words

be identified some two

two ways of using deep belief network to do that

the first one is

we can't change training algorithm in that way that

it could get speaker recognition application better

the second one is almost speaker

system combination in deal with speaker recognition

recognition so

we are explore if there is a way

a separate system in training

next I will talk about some details

ideas

first of all is the platform in frame training

learning the speaker information is constrained to show

compacted the current frame

if you want to increase the context

in our system in conversation level training algorithm

it shows the solution to the problems

first of all if the training is conversation level

it will be making one single decision per decoding

which should be the case

and another advantage could be

there are several ways to be doing this is to explore

I will explain that latter

so the first idea key is

using a speaker recognition training criterion

so the speaker recognition training criterion is

log likelihood ratio based training criterion

deal by Brummer

it is a way to sum cost

to see target trail plus

this use the kind of this objective times

think to find a target

and the I will remain that

upper layer is a speaker layer

as my mention earlier for part each coding we should be made one decision

there are severals ways by doing that how to feedback

make a decision in one frame

make a decision on another frame on the decoding

we took another pic here what we do

the score are averaged at the output layer before the nonlinearity

which it means that for each frame over the decoding we do the statistics and

sum them and average them

as I mention earlier the second key idea of my methodology is

using a separate system in training

this diagram here the top layer of the network

as before we have a BN score generation scheme here

we have a standard MFCC system

but seems we are using likelihood ratio base

training criterion in this score must be weighted

if be based

bottleneck features

so we do that with linear combination

are these two types of scores used them in training criterion

so one question is how the calibration is achieved

as we see here we have three parameters in the bottom equation

w 1 and w 2 and kappa are estimated

by min. are training objective where the network are weighted

after these parameters are estimated

there I how fixed it and estimated

as many of are speakers are mentioned we used

speakers system after we extracted features

many we use UBM support we have i-vectors

we have PLDA

I skip this part next I will repost experimental results

we ran experiments on the same and different microphone task of NIST SRE 2001

this is our main interest our target

in recording we use microphone recording only in bottleneck network

this give us we use all SRE 2004 5 6 and switchboard

data

data in our experiments

used microphone recording for bottleneck network training

we have 173 speakers

speakers in the training and validation sets

give us about 4341 recording in training

865 recordings in terms of number of

input samples

we have about few million samples in training

and two millions in calibration

number extracts was

like that we have

294 dimensional input 42 times 21 we use

plus and minus time frames and sum context frames

network has 1000 hidden notes

by 42 bottle network

notes these are fully connected to another 500 notes

173 speakers

I could like to mention that

as the process of input feature and bottleneck features

the input features are mean and variance normalize

conditional network

mean and variance are estimated in window length of three second of speech

and the also corrected to bottleneck features

to make them comparable with diagonal covariance GMMs as an assumption

this one show the factor of training criteria

present in blue and red column here

red color as a traditional

training network and now we have green color

we use calculation training

and those I just described

may you remember that decreasing is 40%

on the same microphone test it was 45%

in the different microphone test

it became 30% and 35%34% respectively

the different is more

observe but observe in

different performance in term of

when the train network in transitional right now is 30%

we also explore the factor of bottle network in layer in the next slide

the effect of yes

observed the train set

as we increase feature vector bottleneck feature and we got improvement

we did explore

this slide shows the combination strategy I just mention

the blue column shows the MFCC baseline

the red line show the line combination of two scores using toolkit

the green line shows that

training network in separate system

yes we also get improvement in 18%

with this strategy

I could like to

and let's we show how to

one way to train another network using full time

speaker recognition system

and also show how to use

a separate system in training network

and thank you questions are welcome

these just features yes

that just the same so instead of MFCC features we want bottleneck features

yes we do use GMM

so we what we said is MFCC delta and delta are project of linear transformation

see our question is if we perform nonlinear transform can we do better

we did we start with two thousands

we actually first start with baseline system it has 34

baseline feature got the same way

actually this combination

it was two steps combination

first we use these MFCC score in training network

after that we get PLDA score we combine these scores

oh I have that results actually

linear combination for these red column shows that

so we got some perhaps 30%

we don't have any delta

we actually to avoid that

test on 2010

training data are form 2004 5 and 6

Bottleneck Features for Speaker Recognition

SESSION 04: Neural Network for Speaker Recognition

Sibel Yaman