0:00:14oh the last paper you know or section is a reason the
0:00:18wise
0:00:19and a big
0:00:19speaker our recognition
0:00:21i the paper is going to be present the back
0:00:24style
0:00:25one
0:00:29this work come to my talk so um this the some colour but rate of work with the sri international
0:00:34speech lab
0:00:35so that the title is uh recent progress in tick speaker verification
0:00:40so it's about prosodic speaker verification but
0:00:42well that's okay
0:00:44a general really it's like for a work uh
0:00:48on what we call subspace multinomial model
0:00:51we present a last into speech
0:00:53this is mainly like a modeling techniques similar to the total variability modeling to obtain i
0:00:59but that's not uh use to for uh got shouldn't but the parameters cost mean
0:01:05but uh rather to mode uh model multinomial parameter
0:01:09so what are the claims of these um
0:01:12this work
0:01:13is the first one was to
0:01:15uh the should i stick quite complex system building process
0:01:19but some simple toy example and some
0:01:22some figures
0:01:24um
0:01:25and the the main claim was uh to introduce a new modeling uh to model deep but like i vectors
0:01:31we obtain
0:01:32but a probabilistic linear discriminant analysis
0:01:36and um
0:01:38we wanted to compare these this approach to uh
0:01:41two main um of the a prosodic systems that are are all there in the field
0:01:47and finally of course because prosodic is always just
0:01:50higher level of speech so we want to combine it with a a state-of-the-art cepstral baseline system
0:01:58"'kay" now come see by example to explain the system building process and
0:02:02just uh imagine we have some conversational speech utterance so
0:02:06yeah it's just
0:02:07the some
0:02:08except from the nist or
0:02:10so that's S and uh where are you where which state are you in
0:02:15so as we prove um
0:02:17use prosodic features
0:02:19we only extract um
0:02:21some um
0:02:23some from the mental frequency measures so we
0:02:26extract the pitch
0:02:28second we have some
0:02:30was "'em" men so
0:02:31we just use the uh a normalized energy for that
0:02:34and finally we have some duration measure
0:02:37and this is due from a a V C S R system so it's quite a Q right
0:02:42and it's from initial one so
0:02:44and this case we have like ten syllable
0:02:46so a syllable words so the
0:02:48sort of a segmentation is the same as the words
0:02:51and
0:02:52actually now we have these three measurements and now we obtain snaps from them
0:02:56i'm these so snow of that's as uh
0:03:00um
0:03:02the best non union your the extract region features
0:03:06so
0:03:06we we saw like new many measurements of pitch energy and duration
0:03:11for instant we to cut out the segments based on the syllable and we
0:03:15measure the duration like for the onset the code
0:03:19or we measure like race and for all of the pitch and the energy mean maximum
0:03:24i just uh X amplify this with some
0:03:27few measurements so we just take the mean and the
0:03:30the the mean
0:03:31each of each um segment here for pitch and
0:03:35and the blue square and
0:03:36so that cross the energy
0:03:38and also a them and what we use in this example is to
0:03:41just a duration the number of frames for each syllable
0:03:46oh now we have like a a combination of uh a discrete values and
0:03:50and um continuous value so how to to model of these
0:03:53so what
0:03:54i so i can with this like they said okay we try to
0:03:57then these into some discrete class
0:04:00and the best way to found was to use a soft binning
0:04:03that use of
0:04:04that's a chain
0:04:06small small uh
0:04:07caution mixture models
0:04:09was these
0:04:10so you really take the
0:04:11something the mean parameter chain a
0:04:14that model and so on and so
0:04:16we extract use measurements from audio background data
0:04:19and then and all example we we train like to mix just
0:04:22and this one dimensional duration measure
0:04:25you see here
0:04:26and we do the same for a for the pitch in this example we use to three of them
0:04:30three mixtures components
0:04:32and for the energy a we use a
0:04:36for mixtures
0:04:37so now we have this
0:04:39background model
0:04:40for each of the snow
0:04:43and now we can start like
0:04:45parameterising are
0:04:46our um which are again speech as we X
0:04:49that from the snow so
0:04:50and now example i show
0:04:52i i just plot
0:04:54the um ten values from the top
0:04:58to the to the um
0:05:00to the gmms
0:05:01and now you can see
0:05:03are they hit the gmms and now we can compute just a posterior probabilities
0:05:08but each to and generated this frame
0:05:10so this is like a soft binning and
0:05:13actually can just generate
0:05:14counts sense soft count and you see for the duration of K the first caution
0:05:18but a con from six point three and the others to three point seven and so on
0:05:23and S you C these is like
0:05:25this is the same thing you um computer if you do um
0:05:29or compute the weights for gmm
0:05:31so this is not a
0:05:33um um
0:05:34you don't compute any means an you things so these is
0:05:37like C on the ground line model is so you moody you norm a model
0:05:41if you would plot the moody me model space for the
0:05:44duration shouldn't you would see that you have um
0:05:47like just a line
0:05:49because it always restricted to sum up to one
0:05:52so for the pitch you would get a
0:05:54two D simplex
0:05:55i can move around but it's or is restricted
0:05:58and for the energy in this example you would get a
0:06:01three D simplex
0:06:03so this would be the for remote you know a model space
0:06:06that but we tried to do with the subspace mode you know me model is two
0:06:11to learn some structure
0:06:13and the data some low dimension structure
0:06:15well moves inside the parameters if you change
0:06:19because
0:06:20so
0:06:21and furthermore these i really like independent
0:06:24a features
0:06:25so are E
0:06:27we should and um but we also tried to learn correlations between the
0:06:31does we can just a
0:06:33and
0:06:34inside side or all of these sub-spaces that ones with but a but that put the meters
0:06:38so
0:06:39but that's shown now was just to plot um a single dimensional subspace in these
0:06:44but controlled or all these uh space is that once
0:06:48you see that
0:06:49we should project back to these uh moody a
0:06:52space to really get some um yeah
0:06:55the dimension dimensional
0:06:57lines
0:06:58and
0:07:00so you really move like
0:07:01that's just one number you would move to i vector
0:07:04you would move like was the colours if you increase the number goes to read and
0:07:08so
0:07:09is just the black but it's just one extract extracted i vector
0:07:13so and then we use this or model two
0:07:15extract the i-vectors like for the
0:07:17stan i-vector extractor
0:07:19to but a lower of rent a low dimensional representation of the
0:07:23i a whole utterance
0:07:25so
0:07:28that's
0:07:29zoom into these four dimension energy part
0:07:32and
0:07:33that's look at a two dimensional subspace
0:07:36and now we see
0:07:37we get and nonlinear a a
0:07:39plane into the three dimensional space
0:07:42where the movement is restricted
0:07:44actually Z black dots C here
0:07:46so that really like real data are are really like um
0:07:49that's
0:07:50so it i like just projected back
0:07:52from the data and if we resume in there
0:07:56can the the colours stay show the different speakers so it's like ten speakers
0:08:00ten utterances each
0:08:02so the funny things that we have a
0:08:03it can have a two dimensional space is here we already see that it somehow in one dimensional space
0:08:09so this seems to be even a smaller the subspace so even probably one
0:08:13and and mention would be enough but this
0:08:15example here
0:08:16and you can already see that you can distinguish the speakers quite were here
0:08:20then of course this is and the multimodal modal space so but we in the end we
0:08:24you yeah i vectors and model them in the
0:08:27i vector space so
0:08:28if we go plot in the i-vector space just i like the parameters
0:08:33a really a gets is nice um
0:08:35yeah class to as for speakers
0:08:36and this is without any
0:08:38a compensation of anything
0:08:41so
0:08:41and that's where than the P I D M model comes in
0:08:44so but you go to another a um artificial data
0:08:49also in two D space
0:08:52so you we have again like for speakers that i vectors isn't two D space for four speakers
0:08:57so the big dots are the mean of the speaker
0:09:00and then we have flex several utterances each
0:09:03and sell no we use a a a linear just come and of this assumption so we have say we
0:09:08have some
0:09:09but i'm across class variability
0:09:12so
0:09:13this is the um
0:09:15the solid lines so really see that the uh the variability between the classes
0:09:20and then we have a shared
0:09:21common uh
0:09:23within class covariance matrix
0:09:26you see that all the
0:09:27speakers a
0:09:28so the individual or
0:09:30um
0:09:31utterances a share the same
0:09:33same uh variable ugh
0:09:36yeah of covariance matrix
0:09:37so of this you can really see that uh even though
0:09:41we have like this and that talks from the red
0:09:43what's the um
0:09:45recognise some from the same
0:09:47same speaker because you of the
0:09:48big variability in this dimension
0:09:50and if you go from here to that you would see okay these a different speakers they don't belong together
0:09:56and this is
0:09:57the the
0:09:57playing an yeah have some some but what we use is um
0:10:00do we use that a probabilistic model
0:10:03and the nice thing was that we can really train the parameters of the in the matrix as
0:10:07the core about and sis uh the em algorithm
0:10:10and secondly we
0:10:12we can
0:10:13use a P I D a model two
0:10:15directly evaluate the likelihoods the likelihood ratio that use you compute with the ubm
0:10:20and we can even compute a proper
0:10:23a um like like you to to that really we can look
0:10:26a these two i-vectors generated by the same speaker or not
0:10:30if you look at it this is
0:10:32okay is is quite complicated in go
0:10:34but then you look at the numerator
0:10:37you really
0:10:38see that you would um
0:10:40oh okay we have to the vector W
0:10:43and then the prior as a given speaker
0:10:46and we have two i-vectors and with the prior as the same so it's P Y and then we just
0:10:50integrate over all speakers
0:10:52so we we don't really care that which speaker is we just say are there from the same or not
0:10:57and then the denominator we really have the margin
0:11:00hmmm
0:11:01probabilities are like what's for
0:11:03if that they come from different speakers
0:11:05i'm the nice thing was that this income come be uh
0:11:07evaluated analytically
0:11:10and we can solve it
0:11:11and
0:11:12can be uh
0:11:13to scoring can be performed very efficiently
0:11:18so
0:11:20to experiments um um i present or on
0:11:23and the nist sre two thousand eight task
0:11:26so well
0:11:27we presented on the uh you what be used for the nist two thousand ten
0:11:32um developmental or what S so i find for it
0:11:35so it's a menu the telephone condition
0:11:37so
0:11:38or target samples of the same but the number of impostor samples as
0:11:42to create a a increase a lot too
0:11:45to um
0:11:47"'cause" of the new uh
0:11:48there are measurements and you
0:11:49a new dcf
0:11:51because the that emphasise the very low false alarm rate
0:11:55and the ubm um um or of a uh the model the P D everything is trained on
0:12:00as so you all four five and switchboard data
0:12:04so we will a uh evaluate
0:12:06three different system the first two you
0:12:09state of the art system that are there
0:12:11and the so is the one we propose here so the first is is probably normal the J fa system
0:12:16a call
0:12:18so john factor and of this modeling of course mean parameters
0:12:21and it just uses
0:12:23quite simple provide a you know can to features
0:12:26so this is
0:12:27that's subset that is in this snow features but there are just thirteen dimensional
0:12:32and they really just um
0:12:34uh approximate the can two or over the syllable
0:12:37and
0:12:37second is the that's snow of svm system that
0:12:41up to the point where we
0:12:43as have soft counts of the smurfs it's exactly the same system as this one
0:12:47that just the modeling is then done by
0:12:49putting these are really high dimension there about thirty thousand dimensional
0:12:53put them to S svm and train it so it's
0:12:56quite demanding
0:12:57and we
0:13:00when the um i-vector extractor
0:13:02to go down to a the dimensionality T of about two hundred from thirty two thousand
0:13:06and in this low dimensional space we can use
0:13:10really lies to be a machine learning i and and the P L D A model seems to be a
0:13:14very uh
0:13:16a very nice to do this
0:13:17and finally we have the baseline system that yes so i system for
0:13:22i that's a oh to a ten
0:13:24and we fuse it
0:13:27or or or is that that plot showing four
0:13:30to be a single a prosodic systems
0:13:33so the red line as the polynomial system the
0:13:36who is uh
0:13:38at some of system and the green is the
0:13:40of P L A system
0:13:41so the the are the new dcf
0:13:43well this yeah and that you could uh rate from left to right
0:13:48and we see that we get a big improvement over both systems on the equal could right so we reach
0:13:53like six point not percent
0:13:55with the uh P L D in modeling
0:13:57and all the others are around ten percent
0:14:00and also on the old dcf we get a big improvement
0:14:04and i've
0:14:05quite quite strange we have or is that
0:14:07this now have
0:14:08svm system always somehow or performs slightly if you go to the very low false alarm region
0:14:14but some behaviour what that we really can't explain now
0:14:18and
0:14:20a second it and the next uh
0:14:22results will be on the key and so we have the baseline system
0:14:26that that you could have read of one point six percent and the new dcf is
0:14:30and for two
0:14:32so we do a done that score level fusion by a logistic regression some
0:14:37and check knife thing approach
0:14:39so we see that we can
0:14:41most of the fusion of for the only normal and that
0:14:44but the protein on the system include a
0:14:47and then you new dcf we get better results on the you could have it you get even here
0:14:52or the of system we get even the best system
0:14:55but quite confused we get the best results on on the equal error rate of one point four seven
0:15:00but we get the best improvement on new dcf
0:15:03for our system and if you memo one
0:15:05so
0:15:06i'm this but it was the other way around
0:15:09so we were but on you could right and the other system was better on the new dcf
0:15:13well in the fusion
0:15:15some of the other way around so
0:15:18we a really a uh we want to try this that we is got that all and two thousand eight
0:15:21data so that you
0:15:23can train the fusion on two so eight and applied to to sound sense so maybe the fusion but
0:15:28but uh
0:15:29probably change in this case
0:15:32so
0:15:33where and conclusion
0:15:35we can say that the P I D A high the or performed
0:15:38so okay i didn't mention that because it
0:15:41well the is that asked um paper
0:15:43we did to
0:15:44called than distance stocks scoring
0:15:46with that T a and W C C and then we get about
0:15:49relative improvement of twenty percent yeah with that
0:15:51to to the P L A
0:15:53and
0:15:54generally the uh vector P L a system gives the best prepare or work performance of six point nine percent
0:16:00equal error rate
0:16:01and that's to our knowledge the best
0:16:03best score for prosodic simple prosodic or the prosodic system
0:16:08um
0:16:08we have to investigate in this decrease in the low false or you reasons
0:16:13and yeah the fusion gives around ten percent relative improvement on the new dcf measure a which use quite nice
0:16:20and yeah for future work we want to investigate in
0:16:24have a as channel and speech stuck um this is just and telephone but
0:16:28and the nist still iteration there's only like microphone and not conversational speech
0:16:33and and
0:16:34another thing we we are trying already is
0:16:37the i to modeling with P D a for the simple polynomial features
0:16:41that be used in the j-th a modeling before
0:16:43and then to combine these both system one on
0:16:46or shouldn't me modeling of the other on the which know the modeling and even based one on snow one
0:16:51on the other thing and
0:16:52to combine this and
0:16:54hopefully you can see that on the interspeech speech
0:16:57so
0:16:58that's it and thank you
0:17:05yeah i i spoken to
0:17:13ask
0:17:18so
0:17:19that was the uh
0:17:20what was your baseline result without adding any prosodic
0:17:23how much did it
0:17:24any any the prosodic sat of your baseline
0:17:27and you mean this
0:17:28vector base yeah
0:17:30but what's is uh
0:17:33one is one to six
0:17:35and point six percent equal error rate and
0:17:38first line it's
0:17:39and top of the table
0:17:41all sorry i did see that think okay
0:17:47i think you talk a i'd
0:17:49can't remember did you actually a a score normalization in a nice and all that's that's an night thing i
0:17:54didn't mention here that usually for P I D it we don't need any score normalization and that of about
0:17:59for the S and system
0:18:01i yeah the svm system has uh that T in minute i D i i just this is just a
0:18:05bit of speculation on an that uh sometimes having the
0:18:09a a back and dataset set uh in the S in training can act
0:18:12a score normalization as well
0:18:14and
0:18:15perhaps like can point you to some that where it's sorry
0:18:18that
0:18:19this
0:18:19dependent on background set
0:18:21it can actually write type the debt of a little bit
0:18:24a to take care in and in C S region is talking about a
0:18:28sorry you
0:18:29uh perhaps what's happening this is
0:18:31still just a collection
0:18:33yeah might be saying the uh improve min dcf in yes and system for this a reason
0:18:38and perhaps the fusion
0:18:40this can't track acting something which school normal more like an
0:18:44say that it
0:18:44not really that good the svm system probably
0:18:48a a in combination not anymore
0:18:51uh yeah perhaps the normalization affecting in me and and is
0:18:56can track and sign problem
0:18:58that fusion is can't
0:19:00okay okay
0:19:00also thanks
0:19:12or from the country
0:19:13and you go ahead could use used
0:19:18so
0:19:19a need to here the at you normalized the of a vector V is no used as a without any
0:19:24of these fancy tricks
0:19:25yeah because i was wondering if uh because you have this
0:19:29a in a male
0:19:30i don't think we the steering i
0:19:32if if we have the same have here
0:19:35a
0:19:35when you
0:19:36yeah the i actors the i "'cause" in the end they really gosh and distributed if you look at the
0:19:40distribution
0:19:42but they they are again look yeah that's a while last can yeah but you had we have here
0:19:48so normalization for doing that the in the A this case are not know and it even
0:19:52from you never helped i always at tried all these chicks and they don't help me at all all of
0:19:58are for the the system but i don't know not
0:20:01a work for me that's
0:20:02things to know
0:20:03yeah
0:20:09and
0:20:11okay is there no constraint that's that's like
0:20:15speaker okay
0:20:20i that's the of the session say