0:00:14 | oh the last paper you know or section is a reason the |
---|---|

0:00:18 | wise |

0:00:19 | and a big |

0:00:19 | speaker our recognition |

0:00:21 | i the paper is going to be present the back |

0:00:24 | style |

0:00:25 | one |

0:00:29 | this work come to my talk so um this the some colour but rate of work with the sri international |

0:00:34 | speech lab |

0:00:35 | so that the title is uh recent progress in tick speaker verification |

0:00:40 | so it's about prosodic speaker verification but |

0:00:42 | well that's okay |

0:00:44 | a general really it's like for a work uh |

0:00:48 | on what we call subspace multinomial model |

0:00:51 | we present a last into speech |

0:00:53 | this is mainly like a modeling techniques similar to the total variability modeling to obtain i |

0:00:59 | but that's not uh use to for uh got shouldn't but the parameters cost mean |

0:01:05 | but uh rather to mode uh model multinomial parameter |

0:01:09 | so what are the claims of these um |

0:01:12 | this work |

0:01:13 | is the first one was to |

0:01:15 | uh the should i stick quite complex system building process |

0:01:19 | but some simple toy example and some |

0:01:22 | some figures |

0:01:24 | um |

0:01:25 | and the the main claim was uh to introduce a new modeling uh to model deep but like i vectors |

0:01:31 | we obtain |

0:01:32 | but a probabilistic linear discriminant analysis |

0:01:36 | and um |

0:01:38 | we wanted to compare these this approach to uh |

0:01:41 | two main um of the a prosodic systems that are are all there in the field |

0:01:47 | and finally of course because prosodic is always just |

0:01:50 | higher level of speech so we want to combine it with a a state-of-the-art cepstral baseline system |

0:01:58 | "'kay" now come see by example to explain the system building process and |

0:02:02 | just uh imagine we have some conversational speech utterance so |

0:02:06 | yeah it's just |

0:02:07 | the some |

0:02:08 | except from the nist or |

0:02:10 | so that's S and uh where are you where which state are you in |

0:02:15 | so as we prove um |

0:02:17 | use prosodic features |

0:02:19 | we only extract um |

0:02:21 | some um |

0:02:23 | some from the mental frequency measures so we |

0:02:26 | extract the pitch |

0:02:28 | second we have some |

0:02:30 | was "'em" men so |

0:02:31 | we just use the uh a normalized energy for that |

0:02:34 | and finally we have some duration measure |

0:02:37 | and this is due from a a V C S R system so it's quite a Q right |

0:02:42 | and it's from initial one so |

0:02:44 | and this case we have like ten syllable |

0:02:46 | so a syllable words so the |

0:02:48 | sort of a segmentation is the same as the words |

0:02:51 | and |

0:02:52 | actually now we have these three measurements and now we obtain snaps from them |

0:02:56 | i'm these so snow of that's as uh |

0:03:00 | um |

0:03:02 | the best non union your the extract region features |

0:03:06 | so |

0:03:06 | we we saw like new many measurements of pitch energy and duration |

0:03:11 | for instant we to cut out the segments based on the syllable and we |

0:03:15 | measure the duration like for the onset the code |

0:03:19 | or we measure like race and for all of the pitch and the energy mean maximum |

0:03:24 | i just uh X amplify this with some |

0:03:27 | few measurements so we just take the mean and the |

0:03:30 | the the mean |

0:03:31 | each of each um segment here for pitch and |

0:03:35 | and the blue square and |

0:03:36 | so that cross the energy |

0:03:38 | and also a them and what we use in this example is to |

0:03:41 | just a duration the number of frames for each syllable |

0:03:46 | oh now we have like a a combination of uh a discrete values and |

0:03:50 | and um continuous value so how to to model of these |

0:03:53 | so what |

0:03:54 | i so i can with this like they said okay we try to |

0:03:57 | then these into some discrete class |

0:04:00 | and the best way to found was to use a soft binning |

0:04:03 | that use of |

0:04:04 | that's a chain |

0:04:06 | small small uh |

0:04:07 | caution mixture models |

0:04:09 | was these |

0:04:10 | so you really take the |

0:04:11 | something the mean parameter chain a |

0:04:14 | that model and so on and so |

0:04:16 | we extract use measurements from audio background data |

0:04:19 | and then and all example we we train like to mix just |

0:04:22 | and this one dimensional duration measure |

0:04:25 | you see here |

0:04:26 | and we do the same for a for the pitch in this example we use to three of them |

0:04:30 | three mixtures components |

0:04:32 | and for the energy a we use a |

0:04:36 | for mixtures |

0:04:37 | so now we have this |

0:04:39 | background model |

0:04:40 | for each of the snow |

0:04:43 | and now we can start like |

0:04:45 | parameterising are |

0:04:46 | our um which are again speech as we X |

0:04:49 | that from the snow so |

0:04:50 | and now example i show |

0:04:52 | i i just plot |

0:04:54 | the um ten values from the top |

0:04:58 | to the to the um |

0:05:00 | to the gmms |

0:05:01 | and now you can see |

0:05:03 | are they hit the gmms and now we can compute just a posterior probabilities |

0:05:08 | but each to and generated this frame |

0:05:10 | so this is like a soft binning and |

0:05:13 | actually can just generate |

0:05:14 | counts sense soft count and you see for the duration of K the first caution |

0:05:18 | but a con from six point three and the others to three point seven and so on |

0:05:23 | and S you C these is like |

0:05:25 | this is the same thing you um computer if you do um |

0:05:29 | or compute the weights for gmm |

0:05:31 | so this is not a |

0:05:33 | um um |

0:05:34 | you don't compute any means an you things so these is |

0:05:37 | like C on the ground line model is so you moody you norm a model |

0:05:41 | if you would plot the moody me model space for the |

0:05:44 | duration shouldn't you would see that you have um |

0:05:47 | like just a line |

0:05:49 | because it always restricted to sum up to one |

0:05:52 | so for the pitch you would get a |

0:05:54 | two D simplex |

0:05:55 | i can move around but it's or is restricted |

0:05:58 | and for the energy in this example you would get a |

0:06:01 | three D simplex |

0:06:03 | so this would be the for remote you know a model space |

0:06:06 | that but we tried to do with the subspace mode you know me model is two |

0:06:11 | to learn some structure |

0:06:13 | and the data some low dimension structure |

0:06:15 | well moves inside the parameters if you change |

0:06:19 | because |

0:06:20 | so |

0:06:21 | and furthermore these i really like independent |

0:06:24 | a features |

0:06:25 | so are E |

0:06:27 | we should and um but we also tried to learn correlations between the |

0:06:31 | does we can just a |

0:06:33 | and |

0:06:34 | inside side or all of these sub-spaces that ones with but a but that put the meters |

0:06:38 | so |

0:06:39 | but that's shown now was just to plot um a single dimensional subspace in these |

0:06:44 | but controlled or all these uh space is that once |

0:06:48 | you see that |

0:06:49 | we should project back to these uh moody a |

0:06:52 | space to really get some um yeah |

0:06:55 | the dimension dimensional |

0:06:57 | lines |

0:06:58 | and |

0:07:00 | so you really move like |

0:07:01 | that's just one number you would move to i vector |

0:07:04 | you would move like was the colours if you increase the number goes to read and |

0:07:08 | so |

0:07:09 | is just the black but it's just one extract extracted i vector |

0:07:13 | so and then we use this or model two |

0:07:15 | extract the i-vectors like for the |

0:07:17 | stan i-vector extractor |

0:07:19 | to but a lower of rent a low dimensional representation of the |

0:07:23 | i a whole utterance |

0:07:25 | so |

0:07:28 | that's |

0:07:29 | zoom into these four dimension energy part |

0:07:32 | and |

0:07:33 | that's look at a two dimensional subspace |

0:07:36 | and now we see |

0:07:37 | we get and nonlinear a a |

0:07:39 | plane into the three dimensional space |

0:07:42 | where the movement is restricted |

0:07:44 | actually Z black dots C here |

0:07:46 | so that really like real data are are really like um |

0:07:49 | that's |

0:07:50 | so it i like just projected back |

0:07:52 | from the data and if we resume in there |

0:07:56 | can the the colours stay show the different speakers so it's like ten speakers |

0:08:00 | ten utterances each |

0:08:02 | so the funny things that we have a |

0:08:03 | it can have a two dimensional space is here we already see that it somehow in one dimensional space |

0:08:09 | so this seems to be even a smaller the subspace so even probably one |

0:08:13 | and and mention would be enough but this |

0:08:15 | example here |

0:08:16 | and you can already see that you can distinguish the speakers quite were here |

0:08:20 | then of course this is and the multimodal modal space so but we in the end we |

0:08:24 | you yeah i vectors and model them in the |

0:08:27 | i vector space so |

0:08:28 | if we go plot in the i-vector space just i like the parameters |

0:08:33 | a really a gets is nice um |

0:08:35 | yeah class to as for speakers |

0:08:36 | and this is without any |

0:08:38 | a compensation of anything |

0:08:41 | so |

0:08:41 | and that's where than the P I D M model comes in |

0:08:44 | so but you go to another a um artificial data |

0:08:49 | also in two D space |

0:08:52 | so you we have again like for speakers that i vectors isn't two D space for four speakers |

0:08:57 | so the big dots are the mean of the speaker |

0:09:00 | and then we have flex several utterances each |

0:09:03 | and sell no we use a a a linear just come and of this assumption so we have say we |

0:09:08 | have some |

0:09:09 | but i'm across class variability |

0:09:12 | so |

0:09:13 | this is the um |

0:09:15 | the solid lines so really see that the uh the variability between the classes |

0:09:20 | and then we have a shared |

0:09:21 | common uh |

0:09:23 | within class covariance matrix |

0:09:26 | you see that all the |

0:09:27 | speakers a |

0:09:28 | so the individual or |

0:09:30 | um |

0:09:31 | utterances a share the same |

0:09:33 | same uh variable ugh |

0:09:36 | yeah of covariance matrix |

0:09:37 | so of this you can really see that uh even though |

0:09:41 | we have like this and that talks from the red |

0:09:43 | what's the um |

0:09:45 | recognise some from the same |

0:09:47 | same speaker because you of the |

0:09:48 | big variability in this dimension |

0:09:50 | and if you go from here to that you would see okay these a different speakers they don't belong together |

0:09:56 | and this is |

0:09:57 | the the |

0:09:57 | playing an yeah have some some but what we use is um |

0:10:00 | do we use that a probabilistic model |

0:10:03 | and the nice thing was that we can really train the parameters of the in the matrix as |

0:10:07 | the core about and sis uh the em algorithm |

0:10:10 | and secondly we |

0:10:12 | we can |

0:10:13 | use a P I D a model two |

0:10:15 | directly evaluate the likelihoods the likelihood ratio that use you compute with the ubm |

0:10:20 | and we can even compute a proper |

0:10:23 | a um like like you to to that really we can look |

0:10:26 | a these two i-vectors generated by the same speaker or not |

0:10:30 | if you look at it this is |

0:10:32 | okay is is quite complicated in go |

0:10:34 | but then you look at the numerator |

0:10:37 | you really |

0:10:38 | see that you would um |

0:10:40 | oh okay we have to the vector W |

0:10:43 | and then the prior as a given speaker |

0:10:46 | and we have two i-vectors and with the prior as the same so it's P Y and then we just |

0:10:50 | integrate over all speakers |

0:10:52 | so we we don't really care that which speaker is we just say are there from the same or not |

0:10:57 | and then the denominator we really have the margin |

0:11:00 | hmmm |

0:11:01 | probabilities are like what's for |

0:11:03 | if that they come from different speakers |

0:11:05 | i'm the nice thing was that this income come be uh |

0:11:07 | evaluated analytically |

0:11:10 | and we can solve it |

0:11:11 | and |

0:11:12 | can be uh |

0:11:13 | to scoring can be performed very efficiently |

0:11:18 | so |

0:11:20 | to experiments um um i present or on |

0:11:23 | and the nist sre two thousand eight task |

0:11:26 | so well |

0:11:27 | we presented on the uh you what be used for the nist two thousand ten |

0:11:32 | um developmental or what S so i find for it |

0:11:35 | so it's a menu the telephone condition |

0:11:37 | so |

0:11:38 | or target samples of the same but the number of impostor samples as |

0:11:42 | to create a a increase a lot too |

0:11:45 | to um |

0:11:47 | "'cause" of the new uh |

0:11:48 | there are measurements and you |

0:11:49 | a new dcf |

0:11:51 | because the that emphasise the very low false alarm rate |

0:11:55 | and the ubm um um or of a uh the model the P D everything is trained on |

0:12:00 | as so you all four five and switchboard data |

0:12:04 | so we will a uh evaluate |

0:12:06 | three different system the first two you |

0:12:09 | state of the art system that are there |

0:12:11 | and the so is the one we propose here so the first is is probably normal the J fa system |

0:12:16 | a call |

0:12:18 | so john factor and of this modeling of course mean parameters |

0:12:21 | and it just uses |

0:12:23 | quite simple provide a you know can to features |

0:12:26 | so this is |

0:12:27 | that's subset that is in this snow features but there are just thirteen dimensional |

0:12:32 | and they really just um |

0:12:34 | uh approximate the can two or over the syllable |

0:12:37 | and |

0:12:37 | second is the that's snow of svm system that |

0:12:41 | up to the point where we |

0:12:43 | as have soft counts of the smurfs it's exactly the same system as this one |

0:12:47 | that just the modeling is then done by |

0:12:49 | putting these are really high dimension there about thirty thousand dimensional |

0:12:53 | put them to S svm and train it so it's |

0:12:56 | quite demanding |

0:12:57 | and we |

0:13:00 | when the um i-vector extractor |

0:13:02 | to go down to a the dimensionality T of about two hundred from thirty two thousand |

0:13:06 | and in this low dimensional space we can use |

0:13:10 | really lies to be a machine learning i and and the P L D A model seems to be a |

0:13:14 | very uh |

0:13:16 | a very nice to do this |

0:13:17 | and finally we have the baseline system that yes so i system for |

0:13:22 | i that's a oh to a ten |

0:13:24 | and we fuse it |

0:13:27 | or or or is that that plot showing four |

0:13:30 | to be a single a prosodic systems |

0:13:33 | so the red line as the polynomial system the |

0:13:36 | who is uh |

0:13:38 | at some of system and the green is the |

0:13:40 | of P L A system |

0:13:41 | so the the are the new dcf |

0:13:43 | well this yeah and that you could uh rate from left to right |

0:13:48 | and we see that we get a big improvement over both systems on the equal could right so we reach |

0:13:53 | like six point not percent |

0:13:55 | with the uh P L D in modeling |

0:13:57 | and all the others are around ten percent |

0:14:00 | and also on the old dcf we get a big improvement |

0:14:04 | and i've |

0:14:05 | quite quite strange we have or is that |

0:14:07 | this now have |

0:14:08 | svm system always somehow or performs slightly if you go to the very low false alarm region |

0:14:14 | but some behaviour what that we really can't explain now |

0:14:18 | and |

0:14:20 | a second it and the next uh |

0:14:22 | results will be on the key and so we have the baseline system |

0:14:26 | that that you could have read of one point six percent and the new dcf is |

0:14:30 | and for two |

0:14:32 | so we do a done that score level fusion by a logistic regression some |

0:14:37 | and check knife thing approach |

0:14:39 | so we see that we can |

0:14:41 | most of the fusion of for the only normal and that |

0:14:44 | but the protein on the system include a |

0:14:47 | and then you new dcf we get better results on the you could have it you get even here |

0:14:52 | or the of system we get even the best system |

0:14:55 | but quite confused we get the best results on on the equal error rate of one point four seven |

0:15:00 | but we get the best improvement on new dcf |

0:15:03 | for our system and if you memo one |

0:15:05 | so |

0:15:06 | i'm this but it was the other way around |

0:15:09 | so we were but on you could right and the other system was better on the new dcf |

0:15:13 | well in the fusion |

0:15:15 | some of the other way around so |

0:15:18 | we a really a uh we want to try this that we is got that all and two thousand eight |

0:15:21 | data so that you |

0:15:23 | can train the fusion on two so eight and applied to to sound sense so maybe the fusion but |

0:15:28 | but uh |

0:15:29 | probably change in this case |

0:15:32 | so |

0:15:33 | where and conclusion |

0:15:35 | we can say that the P I D A high the or performed |

0:15:38 | so okay i didn't mention that because it |

0:15:41 | well the is that asked um paper |

0:15:43 | we did to |

0:15:44 | called than distance stocks scoring |

0:15:46 | with that T a and W C C and then we get about |

0:15:49 | relative improvement of twenty percent yeah with that |

0:15:51 | to to the P L A |

0:15:53 | and |

0:15:54 | generally the uh vector P L a system gives the best prepare or work performance of six point nine percent |

0:16:00 | equal error rate |

0:16:01 | and that's to our knowledge the best |

0:16:03 | best score for prosodic simple prosodic or the prosodic system |

0:16:08 | um |

0:16:08 | we have to investigate in this decrease in the low false or you reasons |

0:16:13 | and yeah the fusion gives around ten percent relative improvement on the new dcf measure a which use quite nice |

0:16:20 | and yeah for future work we want to investigate in |

0:16:24 | have a as channel and speech stuck um this is just and telephone but |

0:16:28 | and the nist still iteration there's only like microphone and not conversational speech |

0:16:33 | and and |

0:16:34 | another thing we we are trying already is |

0:16:37 | the i to modeling with P D a for the simple polynomial features |

0:16:41 | that be used in the j-th a modeling before |

0:16:43 | and then to combine these both system one on |

0:16:46 | or shouldn't me modeling of the other on the which know the modeling and even based one on snow one |

0:16:51 | on the other thing and |

0:16:52 | to combine this and |

0:16:54 | hopefully you can see that on the interspeech speech |

0:16:57 | so |

0:16:58 | that's it and thank you |

0:17:05 | yeah i i spoken to |

0:17:13 | ask |

0:17:18 | so |

0:17:19 | that was the uh |

0:17:20 | what was your baseline result without adding any prosodic |

0:17:23 | how much did it |

0:17:24 | any any the prosodic sat of your baseline |

0:17:27 | and you mean this |

0:17:28 | vector base yeah |

0:17:30 | but what's is uh |

0:17:33 | one is one to six |

0:17:35 | and point six percent equal error rate and |

0:17:38 | first line it's |

0:17:39 | and top of the table |

0:17:41 | all sorry i did see that think okay |

0:17:47 | i think you talk a i'd |

0:17:49 | can't remember did you actually a a score normalization in a nice and all that's that's an night thing i |

0:17:54 | didn't mention here that usually for P I D it we don't need any score normalization and that of about |

0:17:59 | for the S and system |

0:18:01 | i yeah the svm system has uh that T in minute i D i i just this is just a |

0:18:05 | bit of speculation on an that uh sometimes having the |

0:18:09 | a a back and dataset set uh in the S in training can act |

0:18:12 | a score normalization as well |

0:18:14 | and |

0:18:15 | perhaps like can point you to some that where it's sorry |

0:18:18 | that |

0:18:19 | this |

0:18:19 | dependent on background set |

0:18:21 | it can actually write type the debt of a little bit |

0:18:24 | a to take care in and in C S region is talking about a |

0:18:28 | sorry you |

0:18:29 | uh perhaps what's happening this is |

0:18:31 | still just a collection |

0:18:33 | yeah might be saying the uh improve min dcf in yes and system for this a reason |

0:18:38 | and perhaps the fusion |

0:18:40 | this can't track acting something which school normal more like an |

0:18:44 | say that it |

0:18:44 | not really that good the svm system probably |

0:18:48 | a a in combination not anymore |

0:18:51 | uh yeah perhaps the normalization affecting in me and and is |

0:18:56 | can track and sign problem |

0:18:58 | that fusion is can't |

0:19:00 | okay okay |

0:19:00 | also thanks |

0:19:12 | or from the country |

0:19:13 | and you go ahead could use used |

0:19:18 | so |

0:19:19 | a need to here the at you normalized the of a vector V is no used as a without any |

0:19:24 | of these fancy tricks |

0:19:25 | yeah because i was wondering if uh because you have this |

0:19:29 | a in a male |

0:19:30 | i don't think we the steering i |

0:19:32 | if if we have the same have here |

0:19:35 | a |

0:19:35 | when you |

0:19:36 | yeah the i actors the i "'cause" in the end they really gosh and distributed if you look at the |

0:19:40 | distribution |

0:19:42 | but they they are again look yeah that's a while last can yeah but you had we have here |

0:19:48 | so normalization for doing that the in the A this case are not know and it even |

0:19:52 | from you never helped i always at tried all these chicks and they don't help me at all all of |

0:19:58 | are for the the system but i don't know not |

0:20:01 | a work for me that's |

0:20:02 | things to know |

0:20:03 | yeah |

0:20:09 | and |

0:20:11 | okay is there no constraint that's that's like |

0:20:15 | speaker okay |

0:20:20 | i that's the of the session say |