0:00:36 | OK so |
---|
0:00:38 | my name is Sibel Yaman |
---|
0:00:40 | today I will talk about my work |
---|
0:00:42 | my colleagues |
---|
0:00:47 | tis my picture to show speaker recognition |
---|
0:01:08 | it's my take about introduction |
---|
0:01:13 | I'll give you a big picture of speaker recognition |
---|
0:01:20 | I'm also representating overview speaker recognition |
---|
0:01:22 | methodology I'll take about the detail |
---|
0:01:26 | next picture shows an exam |
---|
0:01:32 | two key idea and first one is develop conversation of training in key terms |
---|
0:01:39 | and the second one is cooperation of the separate system in discriminative training |
---|
0:01:46 | I will report the experimental results and conclude the summary |
---|
0:01:54 | OK we are listening a keynote speaker in the morning in the speech recognition literature |
---|
0:02:01 | report to platform in HMMs |
---|
0:02:10 | however in the speaker recognition we are not got that yet but |
---|
0:02:14 | we are improving our results in everyday |
---|
0:02:23 | here are the bottleneck |
---|
0:02:26 | architecture that be used we have input layer |
---|
0:02:32 | from l norm at speeching point five long of speech |
---|
0:02:38 | these are raw features that means there no delta and delta deltas |
---|
0:02:43 | fourteen dimensional MFCC features |
---|
0:02:47 | there are fully connected to a layer of |
---|
0:02:51 | about thousands of hidden notes |
---|
0:02:54 | which are diagonal connect to narrow an hidden notes which are as a bottleneck layer |
---|
0:03:02 | they are connected to another hidden layer and finally they are connected to |
---|
0:03:16 | input feature statistics |
---|
0:03:18 | are feed into network |
---|
0:03:20 | and they are passed to the bottleneck layers |
---|
0:03:23 | in the propagation mode the speaker recognition |
---|
0:03:27 | speaker is going to information and |
---|
0:03:30 | that propagate to lower layers |
---|
0:03:32 | and that how we create the road map |
---|
0:03:39 | this is kind of network strategy |
---|
0:03:43 | i-vector should be studied |
---|
0:03:46 | in 1998 |
---|
0:03:48 | by koing and heck |
---|
0:03:54 | let me tell you what happen before |
---|
0:03:56 | just use this bottleneck to do that |
---|
0:04:03 | it doesn't work |
---|
0:04:08 | this use some standard |
---|
0:04:14 | we compare |
---|
0:04:17 | traditional call transitional |
---|
0:04:22 | bottleneck features |
---|
0:04:25 | it's shown in red color |
---|
0:04:27 | that blue color shows |
---|
0:04:30 | MFCC baseline systems |
---|
0:04:33 | let's just keep this number in our |
---|
0:04:39 | and when we compare |
---|
0:04:42 | performance in term of equal error rate |
---|
0:04:46 | we see |
---|
0:04:48 | 40% decreasing |
---|
0:04:52 | in the same microphone test set in NIST 2010 |
---|
0:04:57 | about 45% |
---|
0:05:00 | decreasing in the different task |
---|
0:05:08 | do techniques that will make performance per person where they are |
---|
0:05:20 | here are some overview methodology |
---|
0:05:25 | MFCC features |
---|
0:05:27 | one way is we could do some linear transformation |
---|
0:05:31 | and obtain delta and delta deltas |
---|
0:05:34 | higher order features |
---|
0:05:37 | but we want to do is |
---|
0:05:41 | we want to perform nonlinear transformation on this |
---|
0:05:45 | features so that when we were |
---|
0:05:51 | choose sort of data |
---|
0:05:54 | we just transform |
---|
0:05:56 | MFCC features on robust features |
---|
0:06:01 | in another words |
---|
0:06:02 | be identified some two |
---|
0:06:05 | two ways of using deep belief network to do that |
---|
0:06:09 | the first one is |
---|
0:06:12 | we can't change training algorithm in that way that |
---|
0:06:16 | it could get speaker recognition application better |
---|
0:06:21 | the second one is almost speaker |
---|
0:06:24 | system combination in deal with speaker recognition |
---|
0:06:27 | recognition so |
---|
0:06:29 | we are explore if there is a way |
---|
0:06:32 | a separate system in training |
---|
0:06:42 | next I will talk about some details |
---|
0:06:45 | ideas |
---|
0:06:49 | first of all is the platform in frame training |
---|
0:06:54 | learning the speaker information is constrained to show |
---|
0:07:00 | compacted the current frame |
---|
0:07:02 | if you want to increase the context |
---|
0:07:11 | in our system in conversation level training algorithm |
---|
0:07:17 | it shows the solution to the problems |
---|
0:07:21 | first of all if the training is conversation level |
---|
0:07:28 | it will be making one single decision per decoding |
---|
0:07:33 | which should be the case |
---|
0:07:35 | and another advantage could be |
---|
0:07:43 | there are several ways to be doing this is to explore |
---|
0:07:47 | I will explain that latter |
---|
0:07:52 | so the first idea key is |
---|
0:07:56 | using a speaker recognition training criterion |
---|
0:07:59 | so the speaker recognition training criterion is |
---|
0:08:02 | log likelihood ratio based training criterion |
---|
0:08:07 | deal by Brummer |
---|
0:08:09 | it is a way to sum cost |
---|
0:08:13 | to see target trail plus |
---|
0:08:20 | this use the kind of this objective times |
---|
0:08:27 | think to find a target |
---|
0:08:38 | and the I will remain that |
---|
0:08:40 | upper layer is a speaker layer |
---|
0:08:48 | as my mention earlier for part each coding we should be made one decision |
---|
0:08:55 | there are severals ways by doing that how to feedback |
---|
0:09:00 | make a decision in one frame |
---|
0:09:03 | make a decision on another frame on the decoding |
---|
0:09:08 | we took another pic here what we do |
---|
0:09:12 | the score are averaged at the output layer before the nonlinearity |
---|
0:09:18 | which it means that for each frame over the decoding we do the statistics and |
---|
0:09:24 | sum them and average them |
---|
0:09:32 | as I mention earlier the second key idea of my methodology is |
---|
0:09:38 | using a separate system in training |
---|
0:09:43 | this diagram here the top layer of the network |
---|
0:09:49 | as before we have a BN score generation scheme here |
---|
0:09:54 | we have a standard MFCC system |
---|
0:09:58 | but seems we are using likelihood ratio base |
---|
0:10:01 | training criterion in this score must be weighted |
---|
0:10:05 | if be based |
---|
0:10:09 | bottleneck features |
---|
0:10:12 | so we do that with linear combination |
---|
0:10:16 | are these two types of scores used them in training criterion |
---|
0:10:25 | so one question is how the calibration is achieved |
---|
0:10:30 | as we see here we have three parameters in the bottom equation |
---|
0:10:36 | w 1 and w 2 and kappa are estimated |
---|
0:10:44 | by min. are training objective where the network are weighted |
---|
0:10:53 | after these parameters are estimated |
---|
0:10:55 | there I how fixed it and estimated |
---|
0:11:03 | as many of are speakers are mentioned we used |
---|
0:11:09 | speakers system after we extracted features |
---|
0:11:15 | many we use UBM support we have i-vectors |
---|
0:11:19 | we have PLDA |
---|
0:11:23 | I skip this part next I will repost experimental results |
---|
0:11:32 | we ran experiments on the same and different microphone task of NIST SRE 2001 |
---|
0:11:41 | this is our main interest our target |
---|
0:11:47 | in recording we use microphone recording only in bottleneck network |
---|
0:11:55 | this give us we use all SRE 2004 5 6 and switchboard |
---|
0:12:02 | data |
---|
0:12:05 | data in our experiments |
---|
0:12:09 | used microphone recording for bottleneck network training |
---|
0:12:14 | we have 173 speakers |
---|
0:12:17 | speakers in the training and validation sets |
---|
0:12:21 | give us about 4341 recording in training |
---|
0:12:26 | 865 recordings in terms of number of |
---|
0:12:32 | input samples |
---|
0:12:34 | we have about few million samples in training |
---|
0:12:39 | and two millions in calibration |
---|
0:12:46 | number extracts was |
---|
0:12:48 | like that we have |
---|
0:12:51 | 294 dimensional input 42 times 21 we use |
---|
0:12:59 | plus and minus time frames and sum context frames |
---|
0:13:02 | network has 1000 hidden notes |
---|
0:13:08 | by 42 bottle network |
---|
0:13:12 | notes these are fully connected to another 500 notes |
---|
0:13:18 | 173 speakers |
---|
0:13:24 | I could like to mention that |
---|
0:13:27 | as the process of input feature and bottleneck features |
---|
0:13:35 | the input features are mean and variance normalize |
---|
0:13:41 | conditional network |
---|
0:13:43 | mean and variance are estimated in window length of three second of speech |
---|
0:13:50 | and the also corrected to bottleneck features |
---|
0:13:54 | to make them comparable with diagonal covariance GMMs as an assumption |
---|
0:14:02 | this one show the factor of training criteria |
---|
0:14:08 | present in blue and red column here |
---|
0:14:13 | red color as a traditional |
---|
0:14:17 | training network and now we have green color |
---|
0:14:21 | we use calculation training |
---|
0:14:24 | and those I just described |
---|
0:14:26 | may you remember that decreasing is 40% |
---|
0:14:31 | on the same microphone test it was 45% |
---|
0:14:37 | in the different microphone test |
---|
0:14:41 | it became 30% and 35%34% respectively |
---|
0:14:46 | the different is more |
---|
0:14:50 | observe but observe in |
---|
0:14:53 | different performance in term of |
---|
0:15:01 | when the train network in transitional right now is 30% |
---|
0:15:09 | we also explore the factor of bottle network in layer in the next slide |
---|
0:15:15 | the effect of yes |
---|
0:15:17 | observed the train set |
---|
0:15:21 | as we increase feature vector bottleneck feature and we got improvement |
---|
0:15:27 | we did explore |
---|
0:15:34 | this slide shows the combination strategy I just mention |
---|
0:15:40 | the blue column shows the MFCC baseline |
---|
0:15:45 | the red line show the line combination of two scores using toolkit |
---|
0:15:52 | the green line shows that |
---|
0:15:55 | training network in separate system |
---|
0:16:01 | yes we also get improvement in 18% |
---|
0:16:06 | with this strategy |
---|
0:16:10 | I could like to |
---|
0:16:16 | and let's we show how to |
---|
0:16:19 | one way to train another network using full time |
---|
0:16:23 | speaker recognition system |
---|
0:16:26 | and also show how to use |
---|
0:16:28 | a separate system in training network |
---|
0:16:32 | and thank you questions are welcome |
---|
0:16:47 | these just features yes |
---|
0:16:52 | that just the same so instead of MFCC features we want bottleneck features |
---|
0:17:23 | yes we do use GMM |
---|
0:17:26 | so we what we said is MFCC delta and delta are project of linear transformation |
---|
0:17:41 | see our question is if we perform nonlinear transform can we do better |
---|
0:18:35 | we did we start with two thousands |
---|
0:18:49 | we actually first start with baseline system it has 34 |
---|
0:19:00 | baseline feature got the same way |
---|
0:19:46 | actually this combination |
---|
0:19:48 | it was two steps combination |
---|
0:19:51 | first we use these MFCC score in training network |
---|
0:19:54 | after that we get PLDA score we combine these scores |
---|
0:20:13 | oh I have that results actually |
---|
0:20:18 | linear combination for these red column shows that |
---|
0:20:26 | so we got some perhaps 30% |
---|
0:20:38 | we don't have any delta |
---|
0:20:42 | we actually to avoid that |
---|
0:21:21 | test on 2010 |
---|
0:21:25 | training data are form 2004 5 and 6 |
---|