Speech Transcript - Computationally Efficient Speaker Identification for Large Population Tasks using MLLR and Sufficient Statistics

um so as a as i mentioned that um they're going to use mllr inception statistics for speaker identification problem uh but we're not building any speech recognition as such in this particular people and the idea that we're looking specifically at uh at the past where the large number of speakers anyone to identify one of them then we want to do it in a computationally efficient way so that's what was actually done by my students aging gets a car and checked it out and then we'll make so just to give you a brief overview of the top i'm just gonna go uh briefly about uh speaker identification problem that is identifying one out of uh a set of L speakers and i'll talk about the commonly used techniques such as using map adaptation followed by topsy mixture based likelihood estimation um and then stage maybe uh we show that this is actually if you have a large number of speakers uh then evaluating the likelihood across all the because and then choosing the best one it is uh obviously very computationally expensive and this number of speakers in the population can very large and so we proposing to use mllr mattresses uh for the adaptation of speaker models um the reason is that then we just need to have the mllr mattresses and uh we show that you know if you have a manila mattresses then estimating the uh the likelihood of the difference because is a very very fast step adjusting once a matrix multiplication with the mllr uh row vectors and so we give you some comparison of the performance of the conventional uh gmmubm based that that we we show that although the mllr system is oh uh it will give you some degradation in performance and therefore oh finally we propose some sort of a cascade system where the mllr system will reduce the search space from this huge population and then we find a gmmubm system can uh you know look at small set of speakers and identify the uh the best speaker from that uh set so this is the basic flow of that all so um as i said uh the idea is that i'm doing speaker identification so there are and because i had to close that so assuming that that is because in the population so given a test feature we're going to actually find like you would respect to all the things and models and choose the one that maximises the model okay and obviously the and the number of speakers population is large and i have to evaluate for each and every speaker population and therefore uh you know the computation complexity keeps going as an uh the number of speakers in the population becomes law so um so what would be uh conventional methods along the most popular method that is used for speaker identification oh pretty much the same thing is useful speaker verification assume is that uh we will be using uh you don't given uh uh you know so background model uh for each of the speakers the uh basically do a map adaptation to get the speaker models from the universal background model so these are people who uh a speaker adapted models and uh then uh as doug reynolds pointed out that it that of the possibly that you can do the scoring still and that is uh that given the data as such models we first align uh the test database respect to the ubm and finally topsy mixtures uh for that particular test data and so when you want to evaluate the likelihood you don't have to compute all the uh you know like that so each of those two thousand forty bits just assuming that the two thousand forty in the background model for each of the speaker model instead you have the first evaluation with respect all of the two thousand forty from the ubm but then for each of the speaker models you just need uh you know sealed those mixtures to be evaluated so oh but nevertheless as L becomes large uh there is a large uh uh increase in the computation so it's still uh as we will show is is expensive especially L becomes large so uh what we're proposing is is is uh a little more uh it's it's again adaptation but yeah saying that it's middle of doing map adaptation why don't people you can models using just fmllr adaptation so the idea is that for each speaker uh given that we already have the ubm model um we are going to use in strip map adaptation uh but uh we actually have a speaker model which is now gone through mllr speaker adaptation so this is where i think uh the confusion came so we're using a male and that's about all that we have ordering it from speech recognition literature so so each of the speaker is now basically uh the means of the speaker model is nothing but a maddox transformation of the means of the universal background model so uh the idea that i just need this uh uh matrix the mllr maddox to characterise a speaker so in essence we are not performing individual speaker models except that each speaker's now codified by his or her uh the spectrum mllr so this is a stage that the actually but the speaker specific mllr matrix and then this is she said identification vol oh becomes the one we have such i know that i lattices it's just sell such models and of course uh these lattices are what is the L C D this is that and so here the likelihood calculation essentially boils down to finding out the test utterance given that that's what that's what is like you respect to the background model which means see each of these and mattresses the which are already stored uh since you've done uh mllr adaptation for each of these individual speakers so at this point it still looks like we need to compute all the they'll likelihoods and therefore it is still uh looks like i mean we haven't solved anything as yeah but the advantage is that if i want to compute these individual likelihoods now it's very very simple all that i need to do is just do some markets multiplications to get the likelihoods for each of the individual speaker so um so the idea is again more out of this is again water from speech recognition literature because mllr basically uh even if it's using the equations from uh the mllr map text estimation so we have the use of a facility function uh in converge speech recognition what you would do if you're doing mllr estimation he's actually trying to estimate the schematics W S given the i went uh the test on the adaptation data okay so the idea that given the adaptation utterance X what is the back deck so what are the elements of the matrix that will maximise the likelihood in this case the optimal function that you're looking at so we and now we was in the same problem in a speaker identification framework so the idea is now i already know the L speaker mattresses for each individual speaker i already know the mllr matrix and that's what the problem is now one of finding out which of those and mattresses maximises the likelihood so in this case i am not estimating be mllr mattresses i have already computed the mllr mattresses and stored for each of the individual speakers and the only thing that i'm trying to maximise here is trying to maximise oh one of those L mllr mattresses that maximises the likelihood and this is very very efficiently done as a destroyer that again waterfront speech recognition so what we would do is that we already have these and mattresses each of which i'm now represented by the row vectors W one W B these are all vectors actually and in mllr you will see that these row vectors that what estimated when we you when you do actually speaker adaptation here these are already precomputed and stored and so we only computing the likelihood here so oh what is it efficient so i said i denied to compute all the likelihood but i can do that very very very efficiently a white what is it it's only varies because i just need to do one alignment of the data with respect to ubm and that's exactly same thing that is normally done in math class topsy uh likelihood estimation i had to have less an alignment to find out which are the mixture for that uh you know dominant so that's not exactly same as what to do with uh you know map just to see uh it's just that that we do which is again borrowed from speech recognition it is basically compute for the given test utterance D corresponding sufficient statistics he i in G R okay G I so these not sufficient statistic that that computer depending on the alignment and the data that's given guess of the alignment and then there's the data comes okay and so for each of the and speakers now i just one matrix multiplication using these key and G A G uh uh uh statistics so the ski energies computed one you want you suspect you of the number of speakers but the likelihood calculation now uses this individual a row vectors from the corresponding speaker mllr matrix so uh this is of dimension be so each row vector basically uh is modelled for that particular speaker so if it's speaker i'd model is i have that i hi i at low and this is just a matrix multiplication so in a sense this is the most crucial step that this happening and that is that the likelihood can be easily computed for each of those and speakers but using the corresponding mllr hypothesis and doing william at its multiplication on here i gee i hmmm and that's where we get the maximum key in in performance so i'll in computation time so just to go through the old useful given the feature vector i'm assuming that already i have taken and uh the individual speaker's training data and computed the mllr mattresses for all the and speaker and so given a test feature i first do an alignment but the background model and also compute the key I N G I statistics is only done once using the X the test feature and the ubm model and then with respect to each of those mattresses i just need to compute by multiplying this matrix but the statistics to get because one and likelihood so this is a very computationally efficient because it only in what's matrix multiplication oh please stop me if you you have any questions so um so the proof of the pudding is basically uh to go to some of the uh time uh and a complexity analysis uh so what they're doing is now we're comparing the conventional uh map plus topsy approach to check on gmm ubm uh and then the fast mllr system that's one that maybe have and uh mllr mattresses that capture the speaker characteristics and what is shown on the left that is basically um uh this is uh again more fun than that two thousand four data so we have two different uh uh test basically one to ten ten ten seconds speech and the other one side speech so if you use me and then at the end and such that six speakers in this identity that's so uh what we're trying to do is identify uh given the test uh data to identify from one of these three and then six models and so uh there's what the ten second anti one side uh case so the blue is basically what the conventional approach here we have taken C to be top fifteen um and you see that obviously there is a degradation in performance uh uh in in in the case of mllr uh so but uh you know i mean and i analyses them in a little would be good and for the one second C speech uh the gmmubm obviously does better and therefore uh also has a corresponding improvement for the mllr kiss but again there is a gap between performance in uh the conventional case and uh proposed approach um but the advantage comes in the right half of the figure which shows uh here we are just using uh a fixed computer can configuration and trying to find out the average time that they can but i want to estimate uh or to identify the optimal speaker so uh this is uh summation of that review uh so yeah we can see that there's a few which again in terms of complexity of the computation time while this takes about ten point three seconds and then averaged this takes about a second on an average twenty ten second data for this to be announced speakers and when the test it obviously becomes larger it obviously wouldn't take much more time to compute and that takes about forty four seconds versus uh more seconds uh for the mllr so uh so the bottom line is you got a huge gain in a it's about like one is to seven one is to ten a winning as fast mllr so this is useful if you have a two thousand speakers in your uh ask anyone to identify which one of them uh is the one that well the utterance but then there's a downside that is used some in terms of performance so you can see that the sum of performance and uh obviously when they're when these sentences are larger uh the be gmmubm takes a lot more time and that's what you gain more when you have a longer utterances to be so um so the said oh yeah this is a little more analysis uh a little more details of what's happening between the proposed a fast mllr yeah the gmmubm and it said uh since the uh like you would even for that all you have to be computed as the number of speakers so that's a lap dance figure shows me computation time and the number of speakers in the database increases so as the number of speakers increases so the blue line is the conventional approach or if it's ten second obviously it's gonna take less time than if someone finds speech so but you can see that there is a sort of a linear relationship with the number of speakers the database so as the number of speakers in the database increases the computation time results were linearly increase on the other hand if you look at the mllr system which is all those those brown sort of uh uh dark line ah it's almost flat as the number of speakers increases and that's because the meeting uh yeah complexity comes basically in uh in trying to do the alignment and things like that the actual likelihood estimation does not depend much uh that's not real significantly with the number of speakers but yeah just matrix multiplications with the mllr map so uh so uh you can see that uh you know it is a population of two thousand there's gonna be huge uh again in terms of uh computation time um the other interesting thing is obviously that if i'm trying to identify a dbn best performance that is of these two systems that is if i look at the top forty speakers i how often do the kind of speaker okay stop with you at all in we see a zipper that as the number of speakers in the top increases obviously uh they both start converging and so the blue a gmm ubm and the red army the order of the brown on mllr so the performance sort of the top and performance that is identifying at least in the top hundred he's uh similar uh uh as well what this some schools of of the uh the start and went into a teacher so we thought that we could sort of uh explain the advantage of simple the gmmubm which obviously is superior to a million tons of performance and still get some computation again by using the mllr to identify from the population of thousand of something the top one hundred one two speakers and then use only those uh in the uh use that we do set of speakers in the final gmmubm system so that's what one of the cascades yeah uh uh so the idea is that obviously fast mllr system to first i think that that sentence and made use the search space for the speaker so we identified the top hundred at all you print your properly speakers depending on as usual has an impact on performance and then we'll let the conventional gmmubm operate only on these three disorders because to identify the best okay and this is basically the same thing in implementation which basically shows that uh we don't lose much in terms of additional cost and computation so the conventional approach would have taken the uh the test feature and you would have done an alignment with the ubm a lot of the topsy mixtures and use uh the gmmubm based system to actually identify the speaker i'll be at exactly doing the same thing there's an alignment step that goes on here but we do an additional computational sufficient statistics this is only done once and then we have the mllr system which is down in the training phase so in the training phase we already a bill the mllr mattresses for each of those individual speakers so using the statistics the features and the mllr hypothesis we identify the and most probable speakers and once we identify the end was problems because we feed it to the human subjects to get the final identified because so in both cases the aligned this is so this is so um so that's a that's a compromise between complexity and performance um so if i look at the end that's performance that is if i did use a set uh of speakers only at all yeah oh then the performance for the this is the ten second case uh this a degradation performance but development because this degradation decreases and savannah good top thirty uh you know uh performances the still a degradation obviously by uh there is some hit in performance but that it does not very significant but on the other hand uh even for top thirty i do get significant gain in terms of uh computational complexity so as the number of speakers increases the back and that of the gmmubm system has to work on more number of speakers and that obviously the computation time is going to work and therefore the speed up is good at it you but still it's a significant i mean you do get some uh you know five times more uh gain in terms of computation uh this sort of uh same thing is repeated for the one side uh the problem with the one side of the pause between reading a book uh is a huge amount of data five seconds oh five minutes of speech so again if you look at the top uh you know and best if it's then put a top ten that obviously the huge hit in performance two point five percent slow yeah absolute lost but if i go to the top okay um then i get only about uh how point seven percent uh degradation but yeah the top uh oh but the P I in the top well and base that that i mean you're not allowed to segment it is the i can even though i did use the number of a speaker's to fourteen the backend gmms still have to operate at all this forty speakers and therefore compared to ten seconds features see that the gains are not that significant but still uh get about almost three times uh competition so this is the basic idea of our proposed method so we have compromised so you can actually but the operating point at a need any of these uh and best and you'll get uh in the past one again in performance but uh it in terms of computation so uh this is what we have uh uh so basically we're using the idea of uh uh you know exploiting uh mllr matter just to do fast likelihood calculation for the speaker models but uh using mllr adaptation that decrease the performance slightly or i mean significantly depending on whether you stop standing or oh what and therefore this you need that with this and that and that that we choose to reduce the search space so i think you say you get better accuracy but uh uh it gets in the uh in terms of computation time so for the T and speaker speaker that it up and then this database uh if you choose the top ten then you get these as the performance degradation speed up for the one side of the top twenty get about three point one one no so uh this is basically it and timefrequency thank you very much oh to achieve the same result uh recent ones okay two in yeah so okay you are much more uh to uh to to if you want to same performance right i do have some uh result who oh we use i want to achieve same performance not the same and this oh so i oh for some minimal i'm i'm we understand happy or mllr adaptation obviously one this past summer heat compared to uh map is that is generally true i i think that's what we notice so it may be more a hundred or two hundred you will get closer and closer to be conventional gmm you can but you will never get exactly the same so you're always going to get something can performance and a single closer to be complete set obviously all again in computation time is barnaby i mean you want to lose anybody get comparable performance so what do we think is that you will have an example in performance how much it in performances in your hand and depending on how much you're willing to go down and pick and performance we can get that much more gain in uh yeah so your question is can i have a cheap gmmubm performance and still get a speedup uh i'm not sure about that i think you will have something to always like well i listen to a in speech recognition i notice that using it so you system i need more adaptation data done and map adaptation note this morning since my the opposite is true right i mean yeah the better more data you have not always better than mllr right this is what i this alignment so that we do mllr because i do estimating and and not yeah but the most simple i mean the constrained mllr see so i sit in my no matter what is normally most conventional cases uh if you have enough data obviously we should go back to that oh okay uh oh if i understand it well in in the case of i'm a large like sufficient statistics but in the case of gmmubm you only things frame by frame like yeah you have this uh and evolution right got it but you could use the century right so you could actually is coming to such an statistics for even for the originals also um the well adapted model oh okay this but yeah so what is your question like uh so i'm just saying that you are you are you trying to the speed up comes from uh collecting deception statistics on the novel weighting mllr system quickly i don't know but you could use the same trick with the with them up adapted model you can actually look like this so you can you can apply the absentee function the multifunction instead of civility uh things frame by frame between all that work as well as a similarity G gmm frame by frame and using it to and school so you're saying i could do similar things format i mean that the clique assumption statistics exactly yes and which which would probably um well this is what we do and and this leads to much force i i certainly would be probably even faster than then dissimilar scoring result losing any powerful um okay uh i didn't uh so i i okay i i have to so you think i could either do this format right of one way to do it for mllr is that the question i mean yeah i'm i'm i i just think you are basically compare two different thing i mean you wanted to come with person you should only tingles medals with the sufficient statistics and i guess that would be uh about the same false alarm um my still more but i is that too i'm i'm not very familiar so maybe i should have because why do we always then variable that all been uh uh C mixtures five to the top and we don't do that okay so uh maybe okay so i should have a so going back to your original premise that you had here was about you were primarily focused on speech right you're saying that you're dealing with large by population set but but i also get their situation hearing about durations it wasn't just large population said it was the duration test utterance so score large populations that right at ten seconds versus poolside and you were compared that was one of the comparisons you had so i see mllr approach you have tree is it is is done kind of independent of the uh except for the ubm stats is independent of the duration of the test that's gonna right so but i mean what other approach people taken this propulsion speech recognition on it is why don't you look at the notion of the uh yeah that's a well known thing you do beam pruning is frames can do a lot of high calcium drop very quickly so that i see well necessarily have go through keep a hundred twenty at any time um it it back and if you're speech real concern alternately you can bail out oh yeah so i i actually we have all the very mention of the papers that i mean this and there are other methods that you can use what he what speed up um you know for for example pruning or you know the downsampling and things of that um so uh yes i mean maybe we're not saying that uh this is the only way of uh you know doing fast computation uh that's one of the base that we could possibly do that's always existed uh right the questions used more as a research paper is you chose this method in your baseline was full frames without classical other ways of speeding up why was this why was it eight wide user interface oh so C O so even in the case of pruning i'm sure you wanna get some hidden performed i don't think i can absolutely get the same performance as you do the gmmubm right through uh because it was this possibility that while opening about to lose some speakers out if i think i mean so this would be the ultimate uh oops i thought i mean with the uh you know well which one would try to achieve a thing right okay do you want it is that errors are introduced using that is more than can rest and play okay so um performance system as a function number right yeah you know sure no no i have to remind you selected suitable oh my can actually use those so um oh i would like two in utah also but this is obviously more or something kind of application to describe yeah creation in this case it's second was i mean we we just thought that the need we have uh you computing the likelihood an efficient manner and i'm sure that a lot of applications and maybe an audio indexing of some of the you might be having large populations because then you might want to identify somebody yeah big database so we have specifically looked at any particular application we just then that here they are a lot of applications possibly at least you know where the utility databases and one is interested and something like this might work so a realtor when they all possibly rather than the way that we have an application that what we want to find out we are we just um looks cool oh you the menu or what the application space well this is i would like to know more about these are sure but but right oh did you try to use more than one mllr transformations for speaker oh yeah we could do with this on that yeah i think that's something that we have thinking of doing that but we have but it should not uh you know it should hopefully improve but we are not hmmm so how does this um it should be interesting to compare the types going to do we will um another type of scoring where once you have sufficient statistics so the test utterance you actually get in a more transform for the test utterance as well and then compare the mllr transforms for the model and the test utterance whether by doing into the product an svm oh yeah so you could have i mean we have just using like it's what you you're saying that given that that's weapons i could use the test vectors mllr a lot and compare it with with the speaker's mllr and that was it it it will probably be more efficient because once you get to be a man the lord matrix you're the dimension is lower than you're only your submission statistics i mean you're sufficient statistics that winter um you have to consider how to note that these are just the the mentioned previously before the feature vectors of it that's it so this is very very small can't think speaker like yeah i

Computationally Efficient Speaker Identification for Large Population Tasks using MLLR and Sufficient Statistics

SESSION 1: Speaker recognition – LVCSR and high level features

Added: 14. 7. 2010 11:08, Author: Achintya Kumar Sarkar, S. Umesh, Shakti Prasad Rath (Indian Institute of Technology Madras), Length: 0:31:12