Speech Transcript - RECENT PROGRESS IN PROSODIC SPEAKER VERIFICATION

oh the last paper you know or section is a reason the wise and a big speaker our recognition i the paper is going to be present the back style one this work come to my talk so um this the some colour but rate of work with the sri international speech lab so that the title is uh recent progress in tick speaker verification so it's about prosodic speaker verification but well that's okay a general really it's like for a work uh on what we call subspace multinomial model we present a last into speech this is mainly like a modeling techniques similar to the total variability modeling to obtain i but that's not uh use to for uh got shouldn't but the parameters cost mean but uh rather to mode uh model multinomial parameter so what are the claims of these um this work is the first one was to uh the should i stick quite complex system building process but some simple toy example and some some figures um and the the main claim was uh to introduce a new modeling uh to model deep but like i vectors we obtain but a probabilistic linear discriminant analysis and um we wanted to compare these this approach to uh two main um of the a prosodic systems that are are all there in the field and finally of course because prosodic is always just higher level of speech so we want to combine it with a a state-of-the-art cepstral baseline system "'kay" now come see by example to explain the system building process and just uh imagine we have some conversational speech utterance so yeah it's just the some except from the nist or so that's S and uh where are you where which state are you in so as we prove um use prosodic features we only extract um some um some from the mental frequency measures so we extract the pitch second we have some was "'em" men so we just use the uh a normalized energy for that and finally we have some duration measure and this is due from a a V C S R system so it's quite a Q right and it's from initial one so and this case we have like ten syllable so a syllable words so the sort of a segmentation is the same as the words and actually now we have these three measurements and now we obtain snaps from them i'm these so snow of that's as uh um the best non union your the extract region features so we we saw like new many measurements of pitch energy and duration for instant we to cut out the segments based on the syllable and we measure the duration like for the onset the code or we measure like race and for all of the pitch and the energy mean maximum i just uh X amplify this with some few measurements so we just take the mean and the the the mean each of each um segment here for pitch and and the blue square and so that cross the energy and also a them and what we use in this example is to just a duration the number of frames for each syllable oh now we have like a a combination of uh a discrete values and and um continuous value so how to to model of these so what i so i can with this like they said okay we try to then these into some discrete class and the best way to found was to use a soft binning that use of that's a chain small small uh caution mixture models was these so you really take the something the mean parameter chain a that model and so on and so we extract use measurements from audio background data and then and all example we we train like to mix just and this one dimensional duration measure you see here and we do the same for a for the pitch in this example we use to three of them three mixtures components and for the energy a we use a for mixtures so now we have this background model for each of the snow and now we can start like parameterising are our um which are again speech as we X that from the snow so and now example i show i i just plot the um ten values from the top to the to the um to the gmms and now you can see are they hit the gmms and now we can compute just a posterior probabilities but each to and generated this frame so this is like a soft binning and actually can just generate counts sense soft count and you see for the duration of K the first caution but a con from six point three and the others to three point seven and so on and S you C these is like this is the same thing you um computer if you do um or compute the weights for gmm so this is not a um um you don't compute any means an you things so these is like C on the ground line model is so you moody you norm a model if you would plot the moody me model space for the duration shouldn't you would see that you have um like just a line because it always restricted to sum up to one so for the pitch you would get a two D simplex i can move around but it's or is restricted and for the energy in this example you would get a three D simplex so this would be the for remote you know a model space that but we tried to do with the subspace mode you know me model is two to learn some structure and the data some low dimension structure well moves inside the parameters if you change because so and furthermore these i really like independent a features so are E we should and um but we also tried to learn correlations between the does we can just a and inside side or all of these sub-spaces that ones with but a but that put the meters so but that's shown now was just to plot um a single dimensional subspace in these but controlled or all these uh space is that once you see that we should project back to these uh moody a space to really get some um yeah the dimension dimensional lines and so you really move like that's just one number you would move to i vector you would move like was the colours if you increase the number goes to read and so is just the black but it's just one extract extracted i vector so and then we use this or model two extract the i-vectors like for the stan i-vector extractor to but a lower of rent a low dimensional representation of the i a whole utterance so that's zoom into these four dimension energy part and that's look at a two dimensional subspace and now we see we get and nonlinear a a plane into the three dimensional space where the movement is restricted actually Z black dots C here so that really like real data are are really like um that's so it i like just projected back from the data and if we resume in there can the the colours stay show the different speakers so it's like ten speakers ten utterances each so the funny things that we have a it can have a two dimensional space is here we already see that it somehow in one dimensional space so this seems to be even a smaller the subspace so even probably one and and mention would be enough but this example here and you can already see that you can distinguish the speakers quite were here then of course this is and the multimodal modal space so but we in the end we you yeah i vectors and model them in the i vector space so if we go plot in the i-vector space just i like the parameters a really a gets is nice um yeah class to as for speakers and this is without any a compensation of anything so and that's where than the P I D M model comes in so but you go to another a um artificial data also in two D space so you we have again like for speakers that i vectors isn't two D space for four speakers so the big dots are the mean of the speaker and then we have flex several utterances each and sell no we use a a a linear just come and of this assumption so we have say we have some but i'm across class variability so this is the um the solid lines so really see that the uh the variability between the classes and then we have a shared common uh within class covariance matrix you see that all the speakers a so the individual or um utterances a share the same same uh variable ugh yeah of covariance matrix so of this you can really see that uh even though we have like this and that talks from the red what's the um recognise some from the same same speaker because you of the big variability in this dimension and if you go from here to that you would see okay these a different speakers they don't belong together and this is the the playing an yeah have some some but what we use is um do we use that a probabilistic model and the nice thing was that we can really train the parameters of the in the matrix as the core about and sis uh the em algorithm and secondly we we can use a P I D a model two directly evaluate the likelihoods the likelihood ratio that use you compute with the ubm and we can even compute a proper a um like like you to to that really we can look a these two i-vectors generated by the same speaker or not if you look at it this is okay is is quite complicated in go but then you look at the numerator you really see that you would um oh okay we have to the vector W and then the prior as a given speaker and we have two i-vectors and with the prior as the same so it's P Y and then we just integrate over all speakers so we we don't really care that which speaker is we just say are there from the same or not and then the denominator we really have the margin hmmm probabilities are like what's for if that they come from different speakers i'm the nice thing was that this income come be uh evaluated analytically and we can solve it and can be uh to scoring can be performed very efficiently so to experiments um um i present or on and the nist sre two thousand eight task so well we presented on the uh you what be used for the nist two thousand ten um developmental or what S so i find for it so it's a menu the telephone condition so or target samples of the same but the number of impostor samples as to create a a increase a lot too to um "'cause" of the new uh there are measurements and you a new dcf because the that emphasise the very low false alarm rate and the ubm um um or of a uh the model the P D everything is trained on as so you all four five and switchboard data so we will a uh evaluate three different system the first two you state of the art system that are there and the so is the one we propose here so the first is is probably normal the J fa system a call so john factor and of this modeling of course mean parameters and it just uses quite simple provide a you know can to features so this is that's subset that is in this snow features but there are just thirteen dimensional and they really just um uh approximate the can two or over the syllable and second is the that's snow of svm system that up to the point where we as have soft counts of the smurfs it's exactly the same system as this one that just the modeling is then done by putting these are really high dimension there about thirty thousand dimensional put them to S svm and train it so it's quite demanding and we when the um i-vector extractor to go down to a the dimensionality T of about two hundred from thirty two thousand and in this low dimensional space we can use really lies to be a machine learning i and and the P L D A model seems to be a very uh a very nice to do this and finally we have the baseline system that yes so i system for i that's a oh to a ten and we fuse it or or or is that that plot showing four to be a single a prosodic systems so the red line as the polynomial system the who is uh at some of system and the green is the of P L A system so the the are the new dcf well this yeah and that you could uh rate from left to right and we see that we get a big improvement over both systems on the equal could right so we reach like six point not percent with the uh P L D in modeling and all the others are around ten percent and also on the old dcf we get a big improvement and i've quite quite strange we have or is that this now have svm system always somehow or performs slightly if you go to the very low false alarm region but some behaviour what that we really can't explain now and a second it and the next uh results will be on the key and so we have the baseline system that that you could have read of one point six percent and the new dcf is and for two so we do a done that score level fusion by a logistic regression some and check knife thing approach so we see that we can most of the fusion of for the only normal and that but the protein on the system include a and then you new dcf we get better results on the you could have it you get even here or the of system we get even the best system but quite confused we get the best results on on the equal error rate of one point four seven but we get the best improvement on new dcf for our system and if you memo one so i'm this but it was the other way around so we were but on you could right and the other system was better on the new dcf well in the fusion some of the other way around so we a really a uh we want to try this that we is got that all and two thousand eight data so that you can train the fusion on two so eight and applied to to sound sense so maybe the fusion but but uh probably change in this case so where and conclusion we can say that the P I D A high the or performed so okay i didn't mention that because it well the is that asked um paper we did to called than distance stocks scoring with that T a and W C C and then we get about relative improvement of twenty percent yeah with that to to the P L A and generally the uh vector P L a system gives the best prepare or work performance of six point nine percent equal error rate and that's to our knowledge the best best score for prosodic simple prosodic or the prosodic system um we have to investigate in this decrease in the low false or you reasons and yeah the fusion gives around ten percent relative improvement on the new dcf measure a which use quite nice and yeah for future work we want to investigate in have a as channel and speech stuck um this is just and telephone but and the nist still iteration there's only like microphone and not conversational speech and and another thing we we are trying already is the i to modeling with P D a for the simple polynomial features that be used in the j-th a modeling before and then to combine these both system one on or shouldn't me modeling of the other on the which know the modeling and even based one on snow one on the other thing and to combine this and hopefully you can see that on the interspeech speech so that's it and thank you yeah i i spoken to ask so that was the uh what was your baseline result without adding any prosodic how much did it any any the prosodic sat of your baseline and you mean this vector base yeah but what's is uh one is one to six and point six percent equal error rate and first line it's and top of the table all sorry i did see that think okay i think you talk a i'd can't remember did you actually a a score normalization in a nice and all that's that's an night thing i didn't mention here that usually for P I D it we don't need any score normalization and that of about for the S and system i yeah the svm system has uh that T in minute i D i i just this is just a bit of speculation on an that uh sometimes having the a a back and dataset set uh in the S in training can act a score normalization as well and perhaps like can point you to some that where it's sorry that this dependent on background set it can actually write type the debt of a little bit a to take care in and in C S region is talking about a sorry you uh perhaps what's happening this is still just a collection yeah might be saying the uh improve min dcf in yes and system for this a reason and perhaps the fusion this can't track acting something which school normal more like an say that it not really that good the svm system probably a a in combination not anymore uh yeah perhaps the normalization affecting in me and and is can track and sign problem that fusion is can't okay okay also thanks or from the country and you go ahead could use used so a need to here the at you normalized the of a vector V is no used as a without any of these fancy tricks yeah because i was wondering if uh because you have this a in a male i don't think we the steering i if if we have the same have here a when you yeah the i actors the i "'cause" in the end they really gosh and distributed if you look at the distribution but they they are again look yeah that's a while last can yeah but you had we have here so normalization for doing that the in the A this case are not know and it even from you never helped i always at tried all these chicks and they don't help me at all all of are for the the system but i don't know not a work for me that's things to know yeah and okay is there no constraint that's that's like speaker okay i that's the of the session say

RECENT PROGRESS IN PROSODIC SPEAKER VERIFICATION

Speaker Verification