OK so
my name is Sibel Yaman
today I will talk about my work
my colleagues
tis my picture to show speaker recognition
it's my take about introduction
I'll give you a big picture of speaker recognition
I'm also representating overview speaker recognition
methodology I'll take about the detail
next picture shows an exam
two key idea and first one is develop conversation of training in key terms
and the second one is cooperation of the separate system in discriminative training
I will report the experimental results and conclude the summary
OK we are listening a keynote speaker in the morning in the speech recognition literature
report to platform in HMMs
however in the speaker recognition we are not got that yet but
we are improving our results in everyday
here are the bottleneck
architecture that be used we have input layer
from l norm at speeching point five long of speech
these are raw features that means there no delta and delta deltas
fourteen dimensional MFCC features
there are fully connected to a layer of
about thousands of hidden notes
which are diagonal connect to narrow an hidden notes which are as a bottleneck layer
they are connected to another hidden layer and finally they are connected to
input feature statistics
are feed into network
and they are passed to the bottleneck layers
in the propagation mode the speaker recognition
speaker is going to information and
that propagate to lower layers
and that how we create the road map
this is kind of network strategy
i-vector should be studied
in 1998
by koing and heck
let me tell you what happen before
just use this bottleneck to do that
it doesn't work
this use some standard
we compare
traditional call transitional
bottleneck features
it's shown in red color
that blue color shows
MFCC baseline systems
let's just keep this number in our
and when we compare
performance in term of equal error rate
we see
40% decreasing
in the same microphone test set in NIST 2010
about 45%
decreasing in the different task
do techniques that will make performance per person where they are
here are some overview methodology
MFCC features
one way is we could do some linear transformation
and obtain delta and delta deltas
higher order features
but we want to do is
we want to perform nonlinear transformation on this
features so that when we were
choose sort of data
we just transform
MFCC features on robust features
in another words
be identified some two
two ways of using deep belief network to do that
the first one is
we can't change training algorithm in that way that
it could get speaker recognition application better
the second one is almost speaker
system combination in deal with speaker recognition
recognition so
we are explore if there is a way
a separate system in training
next I will talk about some details
ideas
first of all is the platform in frame training
learning the speaker information is constrained to show
compacted the current frame
if you want to increase the context
in our system in conversation level training algorithm
it shows the solution to the problems
first of all if the training is conversation level
it will be making one single decision per decoding
which should be the case
and another advantage could be
there are several ways to be doing this is to explore
I will explain that latter
so the first idea key is
using a speaker recognition training criterion
so the speaker recognition training criterion is
log likelihood ratio based training criterion
deal by Brummer
it is a way to sum cost
to see target trail plus
this use the kind of this objective times
think to find a target
and the I will remain that
upper layer is a speaker layer
as my mention earlier for part each coding we should be made one decision
there are severals ways by doing that how to feedback
make a decision in one frame
make a decision on another frame on the decoding
we took another pic here what we do
the score are averaged at the output layer before the nonlinearity
which it means that for each frame over the decoding we do the statistics and
sum them and average them
as I mention earlier the second key idea of my methodology is
using a separate system in training
this diagram here the top layer of the network
as before we have a BN score generation scheme here
we have a standard MFCC system
but seems we are using likelihood ratio base
training criterion in this score must be weighted
if be based
bottleneck features
so we do that with linear combination
are these two types of scores used them in training criterion
so one question is how the calibration is achieved
as we see here we have three parameters in the bottom equation
w 1 and w 2 and kappa are estimated
by min. are training objective where the network are weighted
after these parameters are estimated
there I how fixed it and estimated
as many of are speakers are mentioned we used
speakers system after we extracted features
many we use UBM support we have i-vectors
we have PLDA
I skip this part next I will repost experimental results
we ran experiments on the same and different microphone task of NIST SRE 2001
this is our main interest our target
in recording we use microphone recording only in bottleneck network
this give us we use all SRE 2004 5 6 and switchboard
data
data in our experiments
used microphone recording for bottleneck network training
we have 173 speakers
speakers in the training and validation sets
give us about 4341 recording in training
865 recordings in terms of number of
input samples
we have about few million samples in training
and two millions in calibration
number extracts was
like that we have
294 dimensional input 42 times 21 we use
plus and minus time frames and sum context frames
network has 1000 hidden notes
by 42 bottle network
notes these are fully connected to another 500 notes
173 speakers
I could like to mention that
as the process of input feature and bottleneck features
the input features are mean and variance normalize
conditional network
mean and variance are estimated in window length of three second of speech
and the also corrected to bottleneck features
to make them comparable with diagonal covariance GMMs as an assumption
this one show the factor of training criteria
present in blue and red column here
red color as a traditional
training network and now we have green color
we use calculation training
and those I just described
may you remember that decreasing is 40%
on the same microphone test it was 45%
in the different microphone test
it became 30% and 35%34% respectively
the different is more
observe but observe in
different performance in term of
when the train network in transitional right now is 30%
we also explore the factor of bottle network in layer in the next slide
the effect of yes
observed the train set
as we increase feature vector bottleneck feature and we got improvement
we did explore
this slide shows the combination strategy I just mention
the blue column shows the MFCC baseline
the red line show the line combination of two scores using toolkit
the green line shows that
training network in separate system
yes we also get improvement in 18%
with this strategy
I could like to
and let's we show how to
one way to train another network using full time
speaker recognition system
and also show how to use
a separate system in training network
and thank you questions are welcome
these just features yes
that just the same so instead of MFCC features we want bottleneck features
yes we do use GMM
so we what we said is MFCC delta and delta are project of linear transformation
see our question is if we perform nonlinear transform can we do better
we did we start with two thousands
we actually first start with baseline system it has 34
baseline feature got the same way
actually this combination
it was two steps combination
first we use these MFCC score in training network
after that we get PLDA score we combine these scores
oh I have that results actually
linear combination for these red column shows that
so we got some perhaps 30%
we don't have any delta
we actually to avoid that
test on 2010
training data are form 2004 5 and 6