Přepis řeči - Investigation of Speaker-Clustered UBMs based on Vocal Tract Lengths and MLLR matrices for Speaker Verification

but it's not yeah um i guess so end of a long day so thanks to saying that um this is basically well almost a lot of overlap with whatever the first speaker did for this particular session um which is basically trying to find out if uh um it basically we can have uh different background models four different sets of speakers and the V L B A proposing at least in this paper is that uh we can uh have uh speakers blasted according to the vocal tract length and also uh another way of doing it is trying to use a similarity between the mllr mattresses and we show that uh using uh few sets of these uh speaker clusters we can obviously get some improvement and uh performance as opposed to using a single uh ubm so uh the overview of the top is i'm pretty much uh is that of them indicated one is to you bureau of your the conventional speaker verification we very often use a single background model and uh then uh what do we the the reason why we might want to use a speaker cluster wise back to models and that two ways you could do the clustering at least in this paper that's what we have suggesting one is to use people collect trent tract length parameter itself and the other is to use a speaker dependent mllr matrix support vector and uh then this sure how we can build background models for each of these uh individual speaker clusters and then we compare the performance first with the speaker uh a single gender independent ubm and then we compare with a gender dependent uh ubm and the gender dependent of speaker cluster model is that uh some of those overlapped but what the first speaker actually here uh so at least in pointing out uh it's basically a binary decision problem so given that this feature and some claimed identity we're trying to find out if the identity uh we compare uh the log likelihood ratio with the same model and and alternate model and see if if it's beyond a certain threshold uh there except that or we rejected and solve and the question is what should be uh the alternate hypothesis uh one of course is yeah to say that a good yeah one is to say that they are on that hypothesis is a universal background model well which is a single model that people use for all speakers the database um then there are other approaches that we have uh a set of uh speaker models cohorts that close to a particular speaker ah well so we take a linear combination of some combination of these uh scores or we could build a a background model for me using these cohorts itself for that particular speaker so one has one background model for all speakers the other has uh a background model for each speaker oh the the other way of doing it is have some compromise between the two which is to say that i'll have a background model for a group of speakers and then the question becomes how like group speakers so we're proposing two different ways that we can group the speakers oh oneness basically using the vocal tract length parameter and the other is to use a speaker pacific mllr matter so um so this is the basic idea that you're talking about so instead of using one background model and then uh comparing the likelihood with that background model and corresponding speaker uh clean model well we actually have different sets of models but different speaker clusters and uh so the how we're gonna build these speaker cluster background models is what we are uh what are you talking about the next slide and the speaker clustering itself was basically done uh using i don't vocal tract length parameter all you maximum likelihood mllr supervector so um the idea of the motivation for trying to use vocal tract and parameter for speaker clustering is because if uh you know basically if it's logical differences in our contract is going to give rise to some differences in the east but so uh what shown here is obviously a in in the fall in the dark lines a male speaker and the that a solid line female speaker so there are differences in the spectral for the same file for the simple reason that the uh sociology of the product system of very different from male speaker female speaker in terms of size and therefore we assume that if also if as a group of speakers have a similar a vocal tract length diameter of similar physiology they probably they produce very similar set of spectral characteristics for the sound and therefore we can group these speakers together and assume that they have very similar characteristics in terms of uh features of the produce for a particular sound uh obviously a we need to vtln i mean we do not have uh a difference speaker and therefore one has to uh you know use some sort of model a reference model if you will and that's what we're trying to use uh the background model itself as a reference model against which we are going to score uh different uh features or the different what parameters and choose the one that does uh be best for that so each speaker basically is uh his or her speck uh vocal tract parameters estimator respective background model this is similar to whatever we don't speech recognition too but we use insert ubm the speaker independent model oh the other way that we could uh possibly classified speak uh into groups if you use the mllr matrix itself uh and that there have been lots of evidence that mllr does capture quite a bit of information about a particular speaker so we that the columns of the mllr like to uh tactics to form a supervector and then we do a very simple uh clustering of these mllr support vectors among speakers in the database of the using this technique he means that about them and just using the simple euclidean distance the plaster these different speeches so given the ubm and the speaker training data we get mllr matrix for each speaker we stack columns uh to form a supervector so this identifies a discount places a speaker and then we have group the speakers depending on uh the clusters that have formed by those about the last subject so um and then oh so now that we have uh sort of group these speakers into different classes uh we will a different background model for each of these a group of speakers and what we have done here is to just basically use a simple mllr adaptation of the ubm model to uh get a new set of means for me each of these speaker cluster background models in each of these speaker adapted models i've got from the ubm by just a consummation of the and these are estimated the the transformation matrix the estimated by using all the data from a particular speaker clusters so that's what is written here so given the ubm you form uh for each cluster its own background model so this plastic will be based on either using vtln as a parameter so close to one 'cause one one how are clustered into another round or it could be uh a set of mllr uh cluster speakers so this is the implementation aspects of given the ubm um first i do and identification of each of the speakers in the database and find out what was the corresponding week and then parameters so let's say if i'm looking at the vtln but i'm at one point two zero i find that speakers three four six all of them have this but i'm just like group them together so and then if i'm looking at the vtln but i'm at a point eight two the speaker I D's two eight nine uh possibly belong to this so i group them together and then using the scruples because i transform the gmmubm to form a background model which basically he's a man mllr adaptation of this particular group of speakers and then i do the individual speaker modelling by doing a map adaptation uh so the so the background model uh and then from the background model for each of the individual speakers i use yeah corresponding addicted to do map adaptation so divide it can be used well if i had used uh uh clustering all speakers based on mllr itself so um so if you look at the test phase they are almost similar to whatever the conventional case it is except that too small differences so given the test utterances i find uh the ideal basically likelihood ratio by comparing the speaker model and the background model in the case of the conventional case still be one single ubm sitting and the the the speaker model is got by adapting this to get the particular speaker model and then i do uh you know uh uh threshold based analysis whether to accept or reject um the exact things that but slightly different models are used yeah here the background model is actually but specifically for that particular speakers cluster and then the speaker model is got by adapting this i see ubm so i have the speaker model that slightly different you know and then again i do a log likelihood ratio this so basically what these systems use identical uh computation cost except that the models a slightly different in what we used for the background uh this is uh just a standard database but we use of things that need uh two thousand two um for background modelling and uh evaluation uh is one type train and once i guess in this two thousand four so uh what we notice is that um depending on the number of vtln clusters that before uh uh you know depending on how many at first that the yellow uh we see that as the number of classes increases you do see some decrease in the uh in the E R so this is what you could if you use a single gender independent ubm so this is the uh that you would get if you use vtln and this is the yeah that you would use if you use mllr based speaker clustering we find that uh M L L about slightly better than vtln but both of them you significantly better formance then um then the single ubm based at that the same thing holds true for them in minimum dcf also uh so the couple of things that you notice wonders what of them uh what we can and then mllr can use some improvement in performance as opposed to single uh ubm and mllr performs slightly uh sometimes oh quite a debate you better than vtln and what we find is that forty and find out the parameter uh clusters that give the best performance and this is because point that yeah so which again shows that mllr doing much a little better than this black girl which is got by vtln plastic and the blue one is the regular single uh ubm base execution so the question to be asked is why use mllr uh performing better than vtln and so what i did is i mean there are lots of other information that was available so but if you look at the black and the white at the bottom the black response to basically having female speakers and the white response to having male speakers so here we have chosen fourteen clusters which was the one maybe not the maximum performance for vtln and you see that there are a lot of clusters that the vtln has both male and female speakers so if you look at this uh i like was one a lot and like what is this which means that are both male and female speakers for this particular one similarly four point ninety point nine six you see that there is some overlap between the male and female speakers on the other hand when you look at the mllr supervector and if you look at uh the black and white yeah very distinct uh the the the screen that picks up then female clusters as they are and the two possible vector uh mllr clusters pick up only the male speakers so there seems to be a white uh nice uh yeah purity in terms of clustering uh you don't agenda it is when you use mllr like uh supervectors and we think possibly that's one of the reasons why mllr seems to be obvious consistently perform better than vtln so we just wanted to go one step further and see if if that was indeed the case then if we separate the clusters according to gender then how would the gap between mllr and vtln disappear we get very similar performance using both of them and that's what the next set of experiments basically indicate so here what we have done is uh now we have a gender wise ubm so but i do you beams one for a million and one for females obviously you see some improvement in performance compared to the gender independent ubm but also what was uh you select what what we conjectured seems to be holding too once we classify once we just uh do gender wise uh splitting of the clusters then vtln and mllr gives all give almost compatible performance still mllr slightly better but uh nevertheless uh the performance it is almost compatible the same thing holds true for uh the minimum dcf also so so the point that we want to make is vtln if you use it just but for a clustering sometimes i gives that's a good performance for the simple reason that it performs it's picks up both the male and female speakers for the same alpha but of either gender wise of uh clustering then what mllr and vtln give almost same a comparable performance uh and in any case both of these methods of clustering but that's obvious or perform uh the gender wise single ubm for each gender yes so and that's reflected also in the debt go you can see that both the ubm about what the mllr clustered and the we can then clustered what of age and gender wise clustered now have very similar performance and they always do better than a gender wise you be so uh so the bottom line is that if you are willing to increase uh the number of background models and they're not much yeah we find that to get a reasonably good performance if you just use yeah i think of something like to uh males and two females clusters you get um some gain in performance a boat in the case of gender dependent and gender dependent case uh the computational cost at least at this is the same as a single ubm because we just uh comparing the two models uh mllr supervector uh performs better than vtln in most cases but the gap narrow down if you're willing to use agenda voice speaker clustering so does it we have time for one last question you close to speakers that they use the uh different ubm depending on the training speakers right i was like to you have but your training yeah samples yeah you close to the training speakers so one training speaker only has one is associated with one ubm get it right okay but now on yes when you have a new sample uh_huh um if you're doing are you talking about i was speaker whether a speaker verification or oh i see you're speaker verification task yeah one particular person only and then he's anglo that ubm yeah it's not a speaker and it S not the speaker deterrence would be much more expensive yeah so here it is because of associated uh background model but it's not each speaker having each cluster speakers have the one button right okay but the the the reasons not more expensive is that your only considering one training speaker right 'kay i think there's no more time to say thank you very much for submitting this session and joey oh

Investigation of Speaker-Clustered UBMs based on Vocal Tract Lengths and MLLR matrices for Speaker Verification

SESSION 3: Background modeling in Speaker recognition, Forensics

Přidáno: 14. 7. 2010 11:08, Autor: Achintya Kumar Sarkar, S. Umesh (Indian Institute of Technology Madras), Délka: 0:16:36