0:00:06um
0:00:07so
0:00:08as a as i mentioned that um
0:00:10they're going to use mllr inception statistics for
0:00:12speaker identification problem
0:00:14uh but we're not building any speech recognition as such in this particular
0:00:18people
0:00:19and the idea that we're looking specifically at
0:00:22uh
0:00:23at the past where the large number of speakers
0:00:26anyone to identify
0:00:27one of them
0:00:28then we want to do it in a computationally efficient way
0:00:31so that's what was actually done by my students aging gets a car and checked it out and then we'll
0:00:35make
0:00:39so just to give you a brief overview of the top
0:00:41i'm just gonna go uh briefly about uh
0:00:44speaker identification problem that is identifying one out of uh
0:00:48a set of L speakers
0:00:50and i'll talk about the commonly used techniques such as using map adaptation followed by topsy mixture based likelihood estimation
0:00:57um and then
0:00:58stage maybe
0:00:59uh we show that this is actually if you have a large number of speakers
0:01:03uh then evaluating the likelihood across all the
0:01:05because and then choosing the best one
0:01:07it is
0:01:08uh obviously very computationally expensive and this
0:01:10number of speakers
0:01:12in the population can very large
0:01:14and so we proposing to use
0:01:15mllr mattresses
0:01:17uh for the adaptation of
0:01:18speaker models
0:01:19um
0:01:20the reason is that then we just need to have the mllr mattresses
0:01:24and uh we show that you know if you have a manila mattresses then estimating the
0:01:28uh the likelihood of the difference because is a very very fast
0:01:31step adjusting once a matrix multiplication with the mllr
0:01:34uh row vectors
0:01:36and so we give you some comparison of the performance of the conventional uh gmmubm based that that
0:01:42we we show that although the mllr system is
0:01:44oh
0:01:45uh it will give you some degradation in performance
0:01:48and therefore
0:01:49oh finally we propose
0:01:50some sort of a cascade system where
0:01:52the mllr system will reduce the search space from this huge population
0:01:56and then we find a gmmubm system can
0:01:59uh you know look at small set of speakers
0:02:02and identify
0:02:03the
0:02:03uh the best speaker from that
0:02:05uh
0:02:05set
0:02:06so this is the basic
0:02:07flow of that all
0:02:10so
0:02:11um
0:02:11as i said uh the idea is that i'm doing speaker identification so
0:02:16there are and
0:02:17because
0:02:18i had to close that so
0:02:19assuming that that is because in the population
0:02:22so given a test feature
0:02:23we're going to actually find like you would
0:02:25respect to all the things
0:02:26and models
0:02:27and choose the one that maximises the model
0:02:30okay
0:02:30and obviously the and the number of speakers
0:02:33population is large
0:02:35and i have to evaluate for each and every speaker population
0:02:38and therefore uh you know the computation complexity keeps going
0:02:41as an uh the number of speakers in the population
0:02:44becomes law
0:02:47so um so what would be uh
0:02:50conventional methods along the most popular method that is used for speaker identification
0:02:54oh
0:02:55pretty much the same thing is useful speaker verification assume
0:02:58is that uh we will be using uh you don't given uh uh you know so background model
0:03:03uh for each of the speakers the uh basically do a map adaptation to get the speaker models
0:03:09from the universal background model
0:03:11so these are people who uh a speaker adapted models
0:03:14and uh then uh as
0:03:16doug reynolds pointed out that it that of the possibly that you can do the scoring still
0:03:20and that is
0:03:21uh that given the data as such models
0:03:24we first align uh the test database respect to the ubm
0:03:28and finally topsy mixtures
0:03:30uh for that particular test data
0:03:32and so when you want to evaluate the likelihood you don't have to compute all the uh you know
0:03:36like that so
0:03:37each of those two thousand forty bits just assuming that the two thousand forty in the background model
0:03:41for each of the speaker model
0:03:43instead you have the first evaluation with respect all of the two thousand forty
0:03:47from the ubm
0:03:48but then for each of the speaker models you just need
0:03:51uh you know sealed those mixtures to be evaluated
0:03:54so
0:03:55oh but nevertheless
0:03:56as L becomes large uh there is a large uh uh
0:04:00increase in the computation
0:04:02so
0:04:02it's still
0:04:03uh as we will show is
0:04:04is expensive especially L becomes large
0:04:09so uh what we're proposing is
0:04:11is
0:04:12is uh a little more
0:04:14uh
0:04:14it's it's again adaptation but yeah saying that it's middle of doing map adaptation
0:04:19why don't people
0:04:20you can models using just fmllr adaptation
0:04:23so the idea is that for each speaker
0:04:26uh given that we already have the ubm model
0:04:28um we are going to use in strip map adaptation
0:04:31uh
0:04:32but
0:04:32uh we actually have a speaker model
0:04:34which is now gone through mllr speaker adaptation
0:04:38so this is where i think uh
0:04:40the confusion came so we're using a male and that's about all that we have ordering it from
0:04:44speech recognition
0:04:45literature
0:04:46so
0:04:46so
0:04:47each of the speaker is now basically uh the means of the speaker model is nothing but a maddox transformation
0:04:53of the means of the universal background model
0:04:55so
0:04:56uh the idea that i just need this uh
0:04:59uh matrix the mllr maddox to characterise a speaker
0:05:02so in essence we are not performing individual speaker models
0:05:06except that each speaker's now codified by
0:05:08his or her
0:05:10uh the spectrum
0:05:11mllr
0:05:12so
0:05:12this is a stage that the actually
0:05:15but the speaker specific
0:05:17mllr matrix
0:05:18and then
0:05:19this is
0:05:20she said identification vol
0:05:22oh becomes the one we have such
0:05:26i know that i lattices
0:05:27it's just sell such models
0:05:29and of course uh these
0:05:31lattices are what is the L C D this is that
0:05:35and so here the likelihood calculation essentially boils down to finding out
0:05:39the test utterance
0:05:40given that that's what that's what is like you
0:05:42respect to
0:05:43the background model which means see each of these
0:05:45and mattresses
0:05:47the which are already stored uh since you've done uh mllr adaptation for each of these individual speakers
0:05:53so at this point it still looks like
0:05:55we need to compute all the
0:05:57they'll likelihoods and therefore
0:05:59it is still uh looks like i mean we haven't solved anything
0:06:02as yeah
0:06:03but the advantage is that if i want to compute these individual likelihoods
0:06:07now it's very very simple
0:06:08all that i need to do is just do some markets multiplications
0:06:12to get
0:06:13the likelihoods for each of the
0:06:14individual speaker
0:06:19so
0:06:19um
0:06:20so the idea is again more out of this
0:06:22is again water from speech recognition literature because mllr basically
0:06:26uh
0:06:27even if it's using the equations from
0:06:29uh the mllr map text estimation
0:06:32so we have the use of a facility function
0:06:34uh in converge speech recognition what you would do if you're doing mllr estimation
0:06:39he's actually trying to estimate the schematics
0:06:41W S
0:06:43given
0:06:43the i went uh the test on the adaptation data
0:06:47okay so
0:06:48the idea that given the adaptation utterance
0:06:50X
0:06:51what is the back deck so what are the elements of the matrix
0:06:54that will maximise the likelihood
0:06:56in this case the optimal function that you're looking at
0:06:58so
0:06:59we
0:07:00and now we was in the same problem in a speaker identification framework
0:07:04so the idea is now i already know
0:07:07the L
0:07:08speaker mattresses for each individual speaker i already know
0:07:11the mllr matrix
0:07:13and that's what the problem is now one of finding out which of those
0:07:17and mattresses
0:07:19maximises the likelihood so in this case
0:07:21i am not estimating
0:07:23be mllr mattresses
0:07:24i have already computed the mllr mattresses
0:07:27and stored for each of the individual speakers
0:07:30and the only thing that i'm trying to maximise here
0:07:32is trying to maximise
0:07:33oh
0:07:35one of those L
0:07:36mllr mattresses
0:07:37that maximises the likelihood
0:07:39and this is very very efficiently done as a destroyer that
0:07:42again waterfront speech recognition
0:07:44so what we would do is that we already have
0:07:46these
0:07:47and mattresses each of which
0:07:49i'm now represented by the row vectors W one W B these are all vectors actually
0:07:53and in mllr you will see that these row vectors that what estimated
0:07:57when we you when you do actually speaker adaptation
0:08:00here these are already precomputed and stored
0:08:03and so we only computing the likelihood
0:08:05here
0:08:07so
0:08:08oh what is it efficient so i said i denied to compute all the likelihood
0:08:12but i can do that very very very efficiently
0:08:15a white what is it it's only varies because
0:08:17i just need to do one alignment of the data with respect to ubm
0:08:21and that's exactly same thing that is normally done in math class topsy
0:08:25uh likelihood estimation
0:08:27i had to have less
0:08:28an alignment to find out which are the mixture for that
0:08:31uh you know dominant
0:08:32so that's not exactly same as what to do with
0:08:35uh you know map
0:08:36just to see
0:08:37uh
0:08:37it's just that that we do which is again borrowed from speech recognition it is
0:08:41basically compute for the given test utterance
0:08:44D corresponding
0:08:45sufficient statistics
0:08:46he i
0:08:47in G R
0:08:48okay
0:08:48G I so these not sufficient statistic that that
0:08:51computer depending on the alignment and the data that's given guess of the alignment and then there's the data comes
0:08:57okay
0:08:58and so
0:08:58for each of the and speakers now
0:09:01i just one matrix multiplication using these key and G A G uh uh uh statistics
0:09:07so the ski energies computed one you want
0:09:10you suspect you of the number of
0:09:11speakers
0:09:12but
0:09:13the likelihood calculation now
0:09:15uses this individual a row vectors from the corresponding
0:09:19speaker mllr matrix so
0:09:21uh this is of dimension be so each row vector basically
0:09:25uh is modelled for that particular speaker so if it's speaker
0:09:28i'd model is
0:09:29i have that i
0:09:30hi
0:09:31i at low
0:09:32and this is just a matrix multiplication so
0:09:34in a sense
0:09:35this is the most crucial step that this happening and that is that the likelihood can be easily computed for
0:09:40each of those
0:09:41and
0:09:42speakers
0:09:43but using the corresponding mllr hypothesis
0:09:46and doing william at its multiplication
0:09:48on here i
0:09:49gee i
0:09:50hmmm
0:09:50and that's where we get the maximum key in
0:09:53in performance
0:09:55so i'll in computation time so
0:09:57just to go through the old useful given the feature vector i'm assuming that already
0:10:01i have taken
0:10:02and uh the individual speaker's training data and computed the mllr mattresses for all the and speaker
0:10:08and so given a test feature
0:10:10i first do an alignment
0:10:12but the background model
0:10:14and also compute the key I N G I statistics is only done once
0:10:18using the X
0:10:20the test feature and the ubm model
0:10:22and then with respect to each of those mattresses
0:10:25i just need to compute by multiplying this matrix
0:10:28but the statistics
0:10:30to get
0:10:30because one and likelihood
0:10:32so this is
0:10:33a very computationally efficient because it only in what's matrix multiplication
0:10:39oh please stop me if you
0:10:40you have any questions
0:10:42so
0:10:42um
0:10:43so the proof of the pudding is basically uh to go to some of the
0:10:46uh time uh and a complexity analysis
0:10:50uh
0:10:51so what they're doing is now we're comparing the conventional uh map
0:10:55plus topsy approach to check on gmm ubm
0:10:58uh
0:10:59and then the fast mllr system that's one that maybe have
0:11:02and uh
0:11:04mllr mattresses that capture the speaker characteristics
0:11:07and what is shown on the left that is basically um uh
0:11:12this is
0:11:12uh again more fun than that
0:11:14two thousand four data
0:11:16so we have two different uh uh
0:11:18test basically one to ten ten ten seconds speech and the other one
0:11:22side speech
0:11:23so if you use me
0:11:25and then at the end and such that six speakers in this identity
0:11:28that's
0:11:30so uh what we're trying to do is identify
0:11:32uh given the test
0:11:33uh
0:11:34data
0:11:35to identify from one of these
0:11:36three and then six models
0:11:38and so uh there's what the ten second anti one side uh case
0:11:42so the blue
0:11:43is basically what the conventional approach
0:11:46here we have taken C to be top fifteen
0:11:48um and you see that obviously
0:11:50there is a degradation in performance
0:11:53uh uh in in in the case of mllr uh
0:11:56so but uh
0:11:57you know i mean and i analyses them in a little would be good
0:12:00and for the one second C
0:12:02speech
0:12:03uh the gmmubm obviously does better and therefore uh also has a corresponding improvement for the
0:12:09mllr kiss but again
0:12:10there is a gap
0:12:11between performance in uh
0:12:13the conventional case
0:12:14and uh proposed approach
0:12:16um but the advantage comes
0:12:18in the right half of the figure
0:12:20which shows
0:12:21uh here we are just using
0:12:22uh a fixed computer can configuration
0:12:25and trying to find out the average time that they can but i want to estimate uh or to identify
0:12:31the optimal speaker
0:12:32so uh this is
0:12:34uh summation of that review
0:12:36uh so yeah we can see that there's a few which again in terms of complexity of the computation time
0:12:42while this takes about ten point three seconds and then averaged
0:12:44this takes about a second on an average twenty ten second data
0:12:48for this to be announced
0:12:49speakers
0:12:50and when the test it obviously becomes larger it
0:12:53obviously wouldn't take much more time to compute
0:12:55and that takes about forty four seconds versus
0:12:57uh more seconds
0:12:58uh
0:12:59for the mllr
0:13:00so uh so the bottom line is
0:13:03you got a huge gain in a it's about like one is to seven one is to ten
0:13:07a winning as fast mllr so this is useful if you have a two thousand speakers in your
0:13:11uh
0:13:12ask anyone to identify which one of them
0:13:14uh
0:13:14is the one that
0:13:16well the utterance
0:13:17but then there's a downside that is used
0:13:20some
0:13:20in terms of performance
0:13:22so you can see that the sum of
0:13:23performance
0:13:24and uh obviously when they're when these sentences are larger
0:13:28uh the
0:13:29be gmmubm takes a lot more time and that's what you gain more
0:13:32when you have a longer utterances to be
0:13:37so um
0:13:38so the said oh yeah this is a little more analysis
0:13:41uh a little more details of what's happening between the proposed
0:13:44a fast mllr
0:13:46yeah
0:13:46the gmmubm
0:13:47and it said uh since
0:13:49the uh like you would even for that all you have to be computed
0:13:53as the number of speakers so that's
0:13:54a lap dance figure
0:13:55shows me computation time and the number of speakers
0:13:58in the database increases
0:14:00so as the number of speakers increases
0:14:02so the blue line is the conventional approach
0:14:05or if it's ten second obviously it's gonna take less time than if someone finds speech
0:14:09so but you can see that there is a sort of a linear relationship
0:14:13with
0:14:13the number of speakers
0:14:14the database so as the number of speakers
0:14:17in the database increases
0:14:18the computation time results were linearly increase
0:14:21on the other hand if you look at the mllr system which is
0:14:24all those those brown sort of uh uh dark line
0:14:27ah it's almost flat
0:14:28as the number of speakers increases
0:14:31and that's because the meeting uh yeah complexity comes basically in uh in trying to do the alignment and things
0:14:37like that
0:14:38the actual likelihood estimation does not depend much uh that's not real significantly with the number of speakers but
0:14:44yeah just matrix multiplications with the mllr map
0:14:47so
0:14:48uh so
0:14:49uh you can see that uh you know it is a population of two thousand there's gonna be huge uh
0:14:53again in terms of
0:14:55uh
0:14:55computation time
0:14:57um
0:14:57the other interesting thing is obviously that if i'm trying to identify a dbn best performance that is of these
0:15:04two systems that is
0:15:05if i look at the top forty speakers i how often
0:15:08do the kind of speaker okay
0:15:10stop with you at all in
0:15:11we see a zipper that as the number of speakers in the top increases obviously uh they both start converging
0:15:18and so the blue a gmm ubm and the red army
0:15:22the order of the brown on mllr
0:15:24so the performance
0:15:25sort of
0:15:26the top and performance that is identifying at least in the top hundred
0:15:29he's
0:15:30uh similar uh uh as well what this
0:15:32some schools of of the uh the start and went into a teacher
0:15:35so we thought that we could sort of
0:15:37uh
0:15:38explain
0:15:38the advantage of simple
0:15:40the gmmubm which obviously
0:15:42is
0:15:42superior to a million tons of performance
0:15:45and still get some computation again
0:15:47by using the mllr to identify
0:15:49from the population of thousand of something the top one hundred one
0:15:52two speakers
0:15:53and then use only those
0:15:55uh in the uh use
0:15:56that we do set of speakers
0:15:58in the final gmmubm system so that's what one of the cascades
0:16:01yeah
0:16:02uh
0:16:03uh
0:16:03so
0:16:04the idea is that obviously fast mllr system to first
0:16:07i think that that sentence and made use the
0:16:10search space for the speaker so we identified the top hundred at all
0:16:13you print your properly
0:16:15speakers depending on
0:16:16as usual
0:16:17has an impact on performance
0:16:19and then we'll let the conventional gmmubm operate only on these three disorders because to identify
0:16:25the best
0:16:26okay
0:16:26and this is basically the same thing in implementation
0:16:29which basically shows that uh we don't lose much in terms of additional cost and computation
0:16:35so the conventional approach would have taken the uh the test feature
0:16:39and you would have done an alignment with the ubm
0:16:42a lot of the topsy mixtures and use
0:16:44uh the gmmubm based
0:16:46system
0:16:47to actually identify the speaker
0:16:49i'll be at exactly doing the same thing
0:16:51there's an alignment step that goes on here
0:16:54but we do an additional computational
0:16:56sufficient statistics
0:16:57this is only done once
0:16:59and then we have the mllr system which is
0:17:02down in the training phase so in the training phase we already
0:17:05a bill
0:17:06the mllr mattresses for each of those individual speakers
0:17:09so using the statistics the features and the mllr hypothesis
0:17:13we identify
0:17:14the and most probable speakers
0:17:16and once we identify the end was problems because we feed it to the human subjects
0:17:20to get the final
0:17:21identified
0:17:22because
0:17:23so in both cases the aligned this is
0:17:25so
0:17:27this is
0:17:32so
0:17:32um
0:17:33so that's a that's a compromise between complexity and performance
0:17:37um so if i look at the end that's performance that is
0:17:40if i did use a set
0:17:41uh of speakers only at all yeah
0:17:44oh then the performance
0:17:45for the this is the ten second case
0:17:47uh this
0:17:48a degradation performance
0:17:49but development
0:17:50because
0:17:51this degradation decreases and savannah good top thirty
0:17:55uh
0:17:55you know uh
0:17:56performances
0:17:57the still a degradation obviously by uh there is some hit in performance
0:18:01but that it does not very significant
0:18:03but on the other hand uh even for top thirty
0:18:06i do get significant gain in terms of uh computational complexity
0:18:10so as the number of speakers
0:18:12increases
0:18:13the back and that of the gmmubm system has to work on more number of speakers
0:18:18and that obviously the computation time is going to work
0:18:21and therefore the speed up is good at it you
0:18:23but still it's
0:18:24a significant i mean you do get some uh you know five times more uh gain in terms of computation
0:18:29uh
0:18:30this
0:18:31sort of
0:18:31uh
0:18:32same thing is repeated for the one side
0:18:34uh the problem with the one side of the pause between reading a book
0:18:37uh is a huge amount of data five seconds
0:18:40oh five minutes of speech
0:18:41so again if you look at the top
0:18:43uh you know and best
0:18:44if it's then put a top ten that obviously the huge hit in performance two point five percent
0:18:49slow
0:18:51yeah
0:18:51absolute
0:18:52lost
0:18:53but if i go to the top
0:18:54okay
0:18:54um then i get only about uh how point seven percent
0:18:58uh
0:18:58degradation
0:18:59but yeah
0:19:00the top
0:19:01uh oh
0:19:02but the P I in the top well and base that that
0:19:05i mean you're not allowed to segment it is the i can
0:19:08even though i did use the number of
0:19:10a speaker's to fourteen the backend gmms still have to operate at all this forty speakers
0:19:14and therefore compared to ten seconds features see that the gains are not that significant but still uh
0:19:19get about
0:19:19almost
0:19:20three times
0:19:21uh competition
0:19:22so this is the basic idea of our proposed method so
0:19:26we have compromised so you can actually
0:19:29but the operating point at a need
0:19:30any of these
0:19:31uh and best and you'll get uh in the past one again in performance but uh
0:19:35it in terms of computation
0:19:37so uh this is
0:19:39what we have uh
0:19:41uh so basically we're using the idea of uh
0:19:45uh you know exploiting uh
0:19:46mllr matter
0:19:47just to do fast likelihood calculation for the speaker models
0:19:50but uh using mllr adaptation that decrease the performance
0:19:54slightly or i mean significantly depending on whether you stop standing or
0:19:58oh what
0:19:59and therefore this
0:20:00you need that with this and
0:20:01that and that that we choose
0:20:03to reduce the search space so i think you say
0:20:06you get better accuracy but uh uh it gets
0:20:09in the uh in terms of computation time
0:20:11so for the T and speaker
0:20:13speaker that it up and then this
0:20:14database
0:20:15uh if you choose the top ten
0:20:17then you get
0:20:18these as the performance degradation speed up
0:20:20for the one side of the top twenty get about three point one
0:20:23one
0:20:24no
0:20:25so uh
0:20:26this is basically it
0:20:41and
0:20:41timefrequency
0:20:42thank you very much
0:20:51oh
0:20:52to achieve the same result
0:20:54uh
0:20:55recent ones
0:20:56okay
0:20:58two
0:20:59in
0:20:59yeah
0:21:00so
0:21:02okay
0:21:03you are much more
0:21:06uh
0:21:07to uh to to
0:21:10if you want to
0:21:11same performance
0:21:12right
0:21:13i do have some uh
0:21:15result
0:21:16who
0:21:17oh
0:21:17we use
0:21:18i want to achieve
0:21:20same performance
0:21:21not
0:21:21the
0:21:22same and this
0:21:24oh
0:21:24so i
0:21:25oh
0:21:26for some minimal i'm i'm
0:21:28we understand happy or
0:21:30mllr adaptation obviously one this past summer heat compared to
0:21:33uh map is that is generally true i i think that's what we notice
0:21:37so it may be more a hundred or two hundred
0:21:40you will get closer and closer to be conventional gmm you can but you will never get exactly the same
0:21:45so you're always going to get
0:21:46something can performance
0:21:48and a single closer to be complete set obviously all again
0:21:51in computation time is
0:21:53barnaby
0:21:53i mean you want to lose
0:21:55anybody get comparable performance
0:21:56so what do we think is that you will have an example in performance
0:22:01how much it in performances in your hand
0:22:03and
0:22:04depending on how much you're willing to go down and pick and performance
0:22:07we can get that much more gain
0:22:09in
0:22:10uh
0:22:10yeah
0:22:11so your question is can i have a cheap gmmubm performance and still get a speedup
0:22:16uh i'm not sure about that i think you will have something to always
0:22:24like
0:22:30well
0:22:31i listen to a in speech recognition i notice that using it so you system
0:22:38i need more adaptation data done and map adaptation
0:22:41note
0:22:41this morning
0:22:42since my
0:22:43the opposite is true right i mean yeah the better more data you have not always better than
0:22:48mllr right
0:22:49this is what i
0:22:50this alignment so that we do mllr because i do
0:22:55estimating and and not
0:22:57yeah but the most simple i mean the constrained mllr see
0:22:59so i sit in my
0:23:01no matter what is normally most conventional cases
0:23:04uh if you have enough data obviously we should go back to that
0:23:12oh okay
0:23:13uh oh
0:23:14if i understand it well in in the case of
0:23:17i'm a large
0:23:18like
0:23:19sufficient statistics
0:23:20but in the case of gmmubm you only things frame by frame like
0:23:24yeah
0:23:25you have this uh
0:23:26and evolution right
0:23:28got it
0:23:28but you could use the century right so you could actually is coming to such an statistics for even for
0:23:34the originals
0:23:35also um
0:23:36the
0:23:37well adapted model
0:23:39oh
0:23:40okay this but
0:23:45yeah
0:23:46so what is your question like uh
0:23:48so i'm just saying that you are you are you trying to the speed up comes from uh collecting deception
0:23:53statistics on the novel weighting mllr system quickly
0:23:57i don't know but you could use the same trick with the with them up adapted model you can actually
0:24:01look like this so you can you can apply the absentee function the multifunction
0:24:06instead of civility
0:24:07uh things frame by frame between all that work
0:24:10as well as a similarity
0:24:12G gmm frame by frame and using it to
0:24:14and
0:24:15school
0:24:16so you're saying i could do similar things format i mean that the clique assumption statistics
0:24:20exactly yes
0:24:21and which which would probably um
0:24:24well this is what we do and and this leads to much force
0:24:27i i certainly would be probably even faster than
0:24:30then dissimilar scoring result losing any powerful
0:24:33um okay uh
0:24:36i didn't uh
0:24:37so
0:24:39i i okay i i have to so you think i could either do this
0:24:43format right of one way to do it for mllr is that the question
0:24:46i mean yeah i'm i'm i i just think you are basically compare
0:24:50two different thing i mean you wanted to come with
0:24:52person
0:24:53you should only tingles
0:24:54medals with the sufficient statistics and
0:24:57i guess that would be
0:24:58uh
0:24:59about the same false alarm
0:25:01um
0:25:02my
0:25:03still more
0:25:04but
0:25:05i is that too i'm i'm not very familiar so maybe i should have because
0:25:08why do we always then variable that all been uh uh C mixtures
0:25:12five
0:25:12to the top and we don't do that
0:25:15okay
0:25:15so
0:25:16uh maybe okay
0:25:17so i should have a
0:25:25so going back to your
0:25:27original premise that you had here was about you were primarily focused on speech
0:25:31right you're saying that you're dealing with large
0:25:34by population set
0:25:35but but i also get their situation
0:25:37hearing about durations
0:25:39it wasn't just large population said it was the duration
0:25:42test utterance
0:25:42so
0:25:43score large populations that
0:25:45right at ten seconds versus poolside
0:25:48and you were compared that was one of the comparisons you had
0:25:51so i see
0:25:52mllr approach you have
0:25:54tree
0:25:55is it is
0:25:56is done kind of independent of the uh
0:25:58except for the ubm stats is independent of the duration of the test
0:26:01that's gonna right
0:26:02so
0:26:03but i mean what other approach people taken this
0:26:05propulsion speech recognition on it
0:26:07is
0:26:08why don't you look at the notion of the uh
0:26:10yeah
0:26:12that's a well known thing you do beam pruning is frames can do a lot of high calcium drop
0:26:17very quickly so
0:26:18that i see
0:26:20well necessarily have
0:26:21go through
0:26:22keep a hundred twenty at any time
0:26:24um it it back and if you're speech real concern
0:26:27alternately you can bail out
0:26:30oh yeah
0:26:30so i i actually we have all the very mention of the papers that i mean this
0:26:35and there are other methods that you can use
0:26:37what he what speed up
0:26:38um you know for for example pruning or you know the downsampling and things of that
0:26:42um
0:26:43so
0:26:44uh
0:26:46yes
0:26:46i mean maybe we're not saying that
0:26:48uh this is the only way of
0:26:49uh you know doing fast computation
0:26:51uh
0:26:52that's one of the base that we could possibly do
0:26:54that's always existed
0:26:55uh
0:26:55right the questions used more as a research paper is
0:26:59you chose this method in your baseline was full frames without
0:27:03classical other ways of speeding up
0:27:05why was this
0:27:06why was it eight wide user interface
0:27:09oh so
0:27:10C O
0:27:11so even in the case of
0:27:13pruning i'm sure you wanna get some hidden performed i don't think i can absolutely get the same performance as
0:27:17you do the gmmubm right through
0:27:20uh because it was this possibility that while opening about to lose
0:27:23some speakers out
0:27:25if i think i mean so this would be the ultimate
0:27:28uh
0:27:28oops
0:27:29i thought i mean with the uh you know well
0:27:32which one would try to achieve a thing right okay
0:27:34do you want it is that errors are introduced
0:27:37using that is more than can rest and play
0:27:39okay
0:27:48so
0:27:49um
0:27:50performance
0:27:51system as a function
0:27:53number
0:27:53right yeah
0:27:55you know
0:27:56sure
0:27:57no no
0:27:58i have to remind you selected suitable
0:28:01oh my can actually use those
0:28:03so um
0:28:06oh i would like
0:28:07two
0:28:08in utah also but this is obviously more or something
0:28:10kind of application to describe yeah
0:28:13creation
0:28:15in this case it's second was i mean we we just thought that the need we have
0:28:19uh you computing the likelihood
0:28:21an efficient manner
0:28:22and i'm sure that a lot of applications
0:28:24and maybe an audio indexing of some of the
0:28:26you might be having large populations
0:28:29because then you might want to identify
0:28:31somebody yeah
0:28:32big database
0:28:33so
0:28:34we have specifically looked at any particular application
0:28:37we just
0:28:37then that
0:28:38here they are
0:28:39a lot of applications
0:28:40possibly
0:28:40at least
0:28:41you know
0:28:42where the utility
0:28:43databases
0:28:44and one is interested
0:28:45and something like this might work
0:28:47so
0:28:48a realtor when they all possibly rather than the way that
0:28:50we have an application that what we want to find out
0:28:53we are
0:28:54we just um
0:28:56looks cool
0:28:57oh
0:28:58you
0:28:58the menu or
0:29:00what the application space
0:29:02well
0:29:05this is i would like to know more about
0:29:07these are
0:29:07sure
0:29:08but
0:29:09but
0:29:09right
0:29:24oh did you try to use
0:29:25more than one mllr transformations for speaker
0:29:28oh yeah we could do with this on that yeah i think that's something that we have
0:29:32thinking of doing that but we have
0:29:34but it should not
0:29:35uh
0:29:36you know
0:29:36it should
0:29:37hopefully improve but we are not
0:29:45hmmm
0:29:46so how does this
0:29:47um
0:29:48it should be interesting to compare
0:29:50the types going to do we will
0:29:52um
0:29:53another type of scoring where
0:29:55once you have sufficient statistics
0:29:58so the test utterance you actually get in a more
0:30:00transform for the test utterance
0:30:03as well and then compare
0:30:05the mllr transforms for
0:30:07the model and the test utterance
0:30:08whether
0:30:09by doing into the product
0:30:12an svm
0:30:13oh yeah so you could have i mean we have just using like it's what you
0:30:17you're saying that given that that's weapons i could use
0:30:20the test vectors mllr a lot
0:30:23and compare it with
0:30:24with the speaker's mllr and that was it
0:30:27it
0:30:27it will probably be more efficient because once you get to be a man the lord matrix
0:30:31you're the dimension is lower than
0:30:34you're only your submission statistics i mean you're sufficient statistics that winter
0:30:39um
0:30:40you have to consider how to note that these are just
0:30:42the
0:30:43the mentioned previously before the feature vectors of it
0:30:46that's it
0:30:48so this is very very small
0:31:06can't think speaker like
0:31:07yeah
0:31:08i