0:00:36OK so
0:00:38my name is Sibel Yaman
0:00:40today I will talk about my work
0:00:42my colleagues
0:00:47tis my picture to show speaker recognition
0:01:08it's my take about introduction
0:01:13I'll give you a big picture of speaker recognition
0:01:20I'm also representating overview speaker recognition
0:01:22methodology I'll take about the detail
0:01:26next picture shows an exam
0:01:32two key idea and first one is develop conversation of training in key terms
0:01:39and the second one is cooperation of the separate system in discriminative training
0:01:46I will report the experimental results and conclude the summary
0:01:54OK we are listening a keynote speaker in the morning in the speech recognition literature
0:02:01report to platform in HMMs
0:02:10however in the speaker recognition we are not got that yet but
0:02:14we are improving our results in everyday
0:02:23here are the bottleneck
0:02:26architecture that be used we have input layer
0:02:32from l norm at speeching point five long of speech
0:02:38these are raw features that means there no delta and delta deltas
0:02:43fourteen dimensional MFCC features
0:02:47there are fully connected to a layer of
0:02:51about thousands of hidden notes
0:02:54which are diagonal connect to narrow an hidden notes which are as a bottleneck layer
0:03:02they are connected to another hidden layer and finally they are connected to
0:03:16input feature statistics
0:03:18are feed into network
0:03:20and they are passed to the bottleneck layers
0:03:23in the propagation mode the speaker recognition
0:03:27speaker is going to information and
0:03:30that propagate to lower layers
0:03:32and that how we create the road map
0:03:39this is kind of network strategy
0:03:43i-vector should be studied
0:03:46in 1998
0:03:48by koing and heck
0:03:54let me tell you what happen before
0:03:56just use this bottleneck to do that
0:04:03it doesn't work
0:04:08this use some standard
0:04:14we compare
0:04:17traditional call transitional
0:04:22bottleneck features
0:04:25it's shown in red color
0:04:27that blue color shows
0:04:30MFCC baseline systems
0:04:33let's just keep this number in our
0:04:39and when we compare
0:04:42performance in term of equal error rate
0:04:46we see
0:04:4840% decreasing
0:04:52in the same microphone test set in NIST 2010
0:04:57about 45%
0:05:00decreasing in the different task
0:05:08do techniques that will make performance per person where they are
0:05:20here are some overview methodology
0:05:25MFCC features
0:05:27one way is we could do some linear transformation
0:05:31and obtain delta and delta deltas
0:05:34higher order features
0:05:37but we want to do is
0:05:41we want to perform nonlinear transformation on this
0:05:45features so that when we were
0:05:51choose sort of data
0:05:54we just transform
0:05:56MFCC features on robust features
0:06:01in another words
0:06:02be identified some two
0:06:05two ways of using deep belief network to do that
0:06:09the first one is
0:06:12we can't change training algorithm in that way that
0:06:16it could get speaker recognition application better
0:06:21the second one is almost speaker
0:06:24system combination in deal with speaker recognition
0:06:27recognition so
0:06:29we are explore if there is a way
0:06:32a separate system in training
0:06:42next I will talk about some details
0:06:45ideas
0:06:49first of all is the platform in frame training
0:06:54learning the speaker information is constrained to show
0:07:00compacted the current frame
0:07:02if you want to increase the context
0:07:11in our system in conversation level training algorithm
0:07:17it shows the solution to the problems
0:07:21first of all if the training is conversation level
0:07:28it will be making one single decision per decoding
0:07:33which should be the case
0:07:35and another advantage could be
0:07:43there are several ways to be doing this is to explore
0:07:47I will explain that latter
0:07:52so the first idea key is
0:07:56using a speaker recognition training criterion
0:07:59so the speaker recognition training criterion is
0:08:02log likelihood ratio based training criterion
0:08:07deal by Brummer
0:08:09it is a way to sum cost
0:08:13to see target trail plus
0:08:20this use the kind of this objective times
0:08:27think to find a target
0:08:38and the I will remain that
0:08:40upper layer is a speaker layer
0:08:48as my mention earlier for part each coding we should be made one decision
0:08:55there are severals ways by doing that how to feedback
0:09:00make a decision in one frame
0:09:03make a decision on another frame on the decoding
0:09:08we took another pic here what we do
0:09:12the score are averaged at the output layer before the nonlinearity
0:09:18which it means that for each frame over the decoding we do the statistics and
0:09:24sum them and average them
0:09:32as I mention earlier the second key idea of my methodology is
0:09:38using a separate system in training
0:09:43this diagram here the top layer of the network
0:09:49as before we have a BN score generation scheme here
0:09:54we have a standard MFCC system
0:09:58but seems we are using likelihood ratio base
0:10:01training criterion in this score must be weighted
0:10:05if be based
0:10:09bottleneck features
0:10:12so we do that with linear combination
0:10:16are these two types of scores used them in training criterion
0:10:25so one question is how the calibration is achieved
0:10:30as we see here we have three parameters in the bottom equation
0:10:36w 1 and w 2 and kappa are estimated
0:10:44by min. are training objective where the network are weighted
0:10:53after these parameters are estimated
0:10:55there I how fixed it and estimated
0:11:03as many of are speakers are mentioned we used
0:11:09speakers system after we extracted features
0:11:15many we use UBM support we have i-vectors
0:11:19we have PLDA
0:11:23I skip this part next I will repost experimental results
0:11:32we ran experiments on the same and different microphone task of NIST SRE 2001
0:11:41this is our main interest our target
0:11:47in recording we use microphone recording only in bottleneck network
0:11:55this give us we use all SRE 2004 5 6 and switchboard
0:12:02data
0:12:05data in our experiments
0:12:09used microphone recording for bottleneck network training
0:12:14we have 173 speakers
0:12:17speakers in the training and validation sets
0:12:21give us about 4341 recording in training
0:12:26865 recordings in terms of number of
0:12:32input samples
0:12:34we have about few million samples in training
0:12:39and two millions in calibration
0:12:46number extracts was
0:12:48like that we have
0:12:51294 dimensional input 42 times 21 we use
0:12:59plus and minus time frames and sum context frames
0:13:02network has 1000 hidden notes
0:13:08by 42 bottle network
0:13:12notes these are fully connected to another 500 notes
0:13:18173 speakers
0:13:24I could like to mention that
0:13:27as the process of input feature and bottleneck features
0:13:35the input features are mean and variance normalize
0:13:41conditional network
0:13:43mean and variance are estimated in window length of three second of speech
0:13:50and the also corrected to bottleneck features
0:13:54to make them comparable with diagonal covariance GMMs as an assumption
0:14:02this one show the factor of training criteria
0:14:08present in blue and red column here
0:14:13red color as a traditional
0:14:17training network and now we have green color
0:14:21we use calculation training
0:14:24and those I just described
0:14:26may you remember that decreasing is 40%
0:14:31on the same microphone test it was 45%
0:14:37in the different microphone test
0:14:41it became 30% and 35%34% respectively
0:14:46the different is more
0:14:50observe but observe in
0:14:53different performance in term of
0:15:01when the train network in transitional right now is 30%
0:15:09we also explore the factor of bottle network in layer in the next slide
0:15:15the effect of yes
0:15:17observed the train set
0:15:21as we increase feature vector bottleneck feature and we got improvement
0:15:27we did explore
0:15:34this slide shows the combination strategy I just mention
0:15:40the blue column shows the MFCC baseline
0:15:45the red line show the line combination of two scores using toolkit
0:15:52the green line shows that
0:15:55training network in separate system
0:16:01yes we also get improvement in 18%
0:16:06with this strategy
0:16:10I could like to
0:16:16and let's we show how to
0:16:19one way to train another network using full time
0:16:23speaker recognition system
0:16:26and also show how to use
0:16:28a separate system in training network
0:16:32and thank you questions are welcome
0:16:47these just features yes
0:16:52that just the same so instead of MFCC features we want bottleneck features
0:17:23yes we do use GMM
0:17:26so we what we said is MFCC delta and delta are project of linear transformation
0:17:41see our question is if we perform nonlinear transform can we do better
0:18:35we did we start with two thousands
0:18:49we actually first start with baseline system it has 34
0:19:00baseline feature got the same way
0:19:46actually this combination
0:19:48it was two steps combination
0:19:51first we use these MFCC score in training network
0:19:54after that we get PLDA score we combine these scores
0:20:13oh I have that results actually
0:20:18linear combination for these red column shows that
0:20:26so we got some perhaps 30%
0:20:38we don't have any delta
0:20:42we actually to avoid that
0:21:21test on 2010
0:21:25training data are form 2004 5 and 6