0:00:14a single channel and i hope available you into two you kindness the output of
0:00:21a lunch
0:00:23the talking i'm going to talk about is how to make uncertainty propagation run fast
0:00:29and also consume less memory
0:00:32my name and why max from the home component alec university
0:00:37so here used a liar representation i will first to keep an overview of i-vector
0:00:42p d a and x spring how the uncertainty properties and can
0:00:49model the uncertainty of the i-vector
0:00:52and how to how to make the uncertainty fabrication run faster one possible use less
0:00:59and we evaluate the proposed to our problem on nist two thousand trial
0:01:08finally we keep looser okay so here is the outline of the i-vector p lda
0:01:13map onto the probably all you already know this though i mean cool for go
0:01:17through these i very quickly a here
0:01:22we use a the posterior be not the latent factor
0:01:28you to
0:01:31to as a low dimensional representation of the speaker so given the mfcc wetter or
0:01:39phrase utterance we compute the posterior mean of the
0:01:44a latent factor and recall this time at
0:01:47okay and t is the total variability matrix that define the channel and speaker subspace
0:01:55or you represent a subspace where the i met okay very
0:02:00so here's the procedure for the i-vector extraction given a sequence of mfcc what are
0:02:05we extract the
0:02:07i-vector using the post your and beam of the latent factor
0:02:11and because if we would like to use the gaussian the lda therefore will lead
0:02:17suppress the lawn the non gaussian behavior of the i-vector through some preprocessing
0:02:24okay for example whitening and also length normalization
0:02:29and after this preprocessing step because the process that i-vector pairs
0:02:35top you
0:02:36and then this not you
0:02:38i-vector or preprocessed the i-vector can be modeled by the t i would be
0:02:44so the idea is to decrease of the phone of you have a modeling
0:02:47we represent a speaker subspace
0:02:50and that h i is the use the speaker factor
0:02:55and that you can see for the change sets exception of the i've speaker we
0:03:01only have one
0:03:04latent factor h i okay
0:03:07and epsilon i j work that sentiment she to that cannot be represented by the
0:03:12speaker subspace
0:03:14so now that to in the scoring
0:03:17at time we have that
0:03:20the test i-vector
0:03:21we have a test i-vector w t
0:03:24and also we have this target speaker i-vector w s
0:03:28and they're
0:03:30and we compute these likelihoods are assuming that the top us to cut but come
0:03:36from the same speaker
0:03:38and of the also have the alternative hypothesis where the top us and w t
0:03:43come from different speaker
0:03:45there are four after some mathematical at manipulation become if this very nice equation so
0:03:51in this equation we only have matrix and wet
0:03:55multiplication and
0:03:57the nice thing use the matrix c
0:04:00hence i'll can all be computed as you can see these set of creation here
0:04:05at the bottom
0:04:06and all these sigma a c segment total
0:04:11and thus and so on can be pre-computed from the
0:04:16the lda model parameter that that's explain why the
0:04:20scoring p lda very fast
0:04:26but one problem of these conventional i-vector p lda there
0:04:31i'm not two years
0:04:34that's not have the ability to work that stands the reliability of the i-vector
0:04:39so whether the utterance is very long already sought we still use
0:04:45low dimension i-vector to work that stands
0:04:48the speaker characteristics
0:04:50of the whole utterance
0:04:52so this we propose a problem for sort utterance speaker verification
0:04:57but not problem for very long utterance that's how you we have three mean is
0:05:02or sixteen use all speech
0:05:04but if we're utterances only about ten second three second and you
0:05:09then the variability or uncertainty of the i-vector will be so high that's
0:05:14and the plp a scroll wheel favourite same speaker hypothesis
0:05:19even if the test utterance is given by a in a imposed
0:05:24about the recent years you've the spectrum is very short we will not have enough
0:05:30acoustic vector for the nbp estimation or in we do not have enough acoustic webster
0:05:36to compute the posterior mean of the leading factor you know factor analysis model
0:05:44so in the ideal i'm certainly publication
0:05:48we not only extract i-vectors but also clancy the
0:05:52the posterior covariance matrix q
0:05:55so this i this time to illustrate the idea
0:06:00this gaussian represent the posterior density of the latent factor
0:06:05and to do so i-vector is that's me so it is a point estimate
0:06:11and this
0:06:12equation initial the procedure of computing it
0:06:16okay so t c use access to cease partition of the total variability matrix but
0:06:22as you can see you
0:06:24if the variance of this gaussian use very large
0:06:28then the point estimate we'll not be where a correct
0:06:32and this happened went utterances where is not the recent years
0:06:36if the utterance is very short and see which is that zero although sufficient statistic
0:06:42will be very small
0:06:43so use this party where is more than the whole covariance matrix l university'll be
0:06:48very big
0:06:49and thus the means that these variations and be large as a result of point
0:06:54estimate we might be very reliable
0:07:00that's why two thousand and thirteen
0:07:02that any proposed ideas that the lda and certainly propagation
0:07:06so you that is so to extracting the i-vector we also express the posterior covariance
0:07:14the latent factor
0:07:15and that represents the uncertainty don't i-vector
0:07:19and with some preprocessing as i have mentioned because we want to use a
0:07:24a thousand you lda for the
0:07:27as the as the final stage of the modelling for the scoring
0:07:32therefore we also need to preprocessed
0:07:37you time school
0:07:38which is if you processed version of the of the posterior covariance matrix and i
0:07:45thought that we could have a the lda modelling now how can
0:07:49how to a certain these corpora t
0:07:52other publication come from here so you know generative model
0:07:56it a generative model we have wh i press i to allow you can see
0:08:02this you
0:08:03use and like at the conventional
0:08:06the lda model we've the eigen channels
0:08:09so this is my eigen channel but instead you keep and so on
0:08:13the section
0:08:15so it depends on the i
0:08:18and the change section of the i speaker
0:08:20as a result the z is also depends on the i and j
0:08:26now the trouble of this year's
0:08:28for every test utterance
0:08:30we also lead to compute the u i t so unlike the i can channel
0:08:35we only need to pre-computed and make use of which during scoring now in a
0:08:40uncertainty propagation this you i to have to be computed during
0:08:45a scoring time
0:08:47because the ssm dependent
0:08:49and do compute this u r j which was performed at a dusky decomposed system
0:08:53of the posterior covariance matrix
0:08:56and that's why we have these the intra speaker covariance matrix like this
0:09:05so loud finally and during the score
0:09:08with the p l d a u p then we have these at equation
0:09:13which is very similar to the equation the actual you
0:09:18all the conventional p lda right as you can see this s u
0:09:23matrix and vector multiplication
0:09:26but the difference yes
0:09:28this time the at b c and d all depends on the test utterance
0:09:33an issue can see from this a set of recreation a s t e s
0:09:38t c s t and the st the all depends on the test utterance
0:09:42that that's me as they have to be pre-computed
0:09:46and only very small number of matrix can be pretty compute this have to be
0:09:51computed this set have to be computed during scoring time
0:09:55and this that have to be compute a can be computed up before scoring time
0:09:59so we will thus save much computation might become use the covariance matrix
0:10:07so dissatisfy summarised some summarize the computation that needed to take phone a conventional lp
0:10:13lda we almost have nothing to
0:10:15to compute all you need to compute these
0:10:19i went and matrix multiplication
0:10:22but for the plp a review g
0:10:24we have to compute all these set of matrix on the right
0:10:28so as you can see that so we'll a increase the computation complexity a lot
0:10:34and also if we increase the memory about this place
0:10:39because of for every time the speaker we need to store
0:10:43this may take a p c d for every target speaker
0:10:47so we propose a to a way of in a speeding up the computation and
0:10:54a two we use the memory consumption
0:10:57the whole idea come from
0:10:59come from d c equation
0:11:01okay come on this equation case here the posterior covariance matrix and only depends on
0:11:06n c
0:11:09and two and a testing time and c will be to zero or the sufficient
0:11:12statistics of the test utterance
0:11:16you okay so you the two i-vectors are also meeting that's integration
0:11:21we assume that all we
0:11:23i think okay
0:11:25the composed here are covariance matrix a similar because as you can see we plot
0:11:30and the mfcc audible acoustic that only
0:11:34the zero all the sufficient statistic
0:11:36so having this hypothesis
0:11:39we and
0:11:43to roll direct a according to their be activity
0:11:47now can be
0:11:50we find w happy that we you we used a scalar to define the we're
0:11:55not be by facing for each scroll the i-vector reliability is modeled by performance vehicle
0:12:00right matrix
0:12:01and we obtain the posterior covariance matrix from the development data
0:12:06okay so here
0:12:08that i take a k stand for the
0:12:13and i'll this u k
0:12:16is independent of the section
0:12:18nice to look at
0:12:19well at the bottom of the slide
0:12:22we have you i j i taste depends on the
0:12:27but now if you look at this here
0:12:30we successfully
0:12:33make the u i j which is the session dependent
0:12:37is now becomes session independent
0:12:40now you've having this u k become session independent we could
0:12:45do a lot of precomputation on there
0:12:48so one way of doing this
0:12:50used to
0:12:51used to grow
0:12:54used to grow
0:13:00the i-vector
0:13:01using these three approaches one is the base on the utterance recent which is intuitive
0:13:07to group the i-vector based on the
0:13:09at a race of because we as we believe that reason we use related to
0:13:14the uncertainty of related to the reliability of the i-vector
0:13:19we have also tried using the mean of the diagonal elements of the posteriori matrix
0:13:23of this is a nice thing to do because
0:13:27the being of the diagonal and on there is a scalar so working will become
0:13:31very easy
0:13:32okay and the last one we have tried is the largest eigenvalue of the reference
0:13:38so this i basically tell us how to perform the grouping you for example you
0:13:43uses the time access
0:13:45then this one corresponding to extremely soft uncertains
0:13:49go to medium length but am sort
0:13:52and we're case where is
0:13:54long utterance and u h group be fine one representative
0:13:58okay from the k two
0:14:00we're consensus the whole group
0:14:02so this
0:14:03or percent at u one u one times will work at santa
0:14:06the posterior covariance matrix a very strong extremely short utterance
0:14:12u k or u k tricycle corresponding to the posterior covariance matrix
0:14:17what and certainty all that very long utterance
0:14:22so now that all you really two
0:14:25during the scoring time really to find
0:14:28the real identity
0:14:30so by using the three approach to quantify cook reliability noise gave a
0:14:36we will be able to find what i the nn and so that we case
0:14:42all the session dependent
0:14:45matrix in two
0:14:47am and
0:14:49and c n and
0:14:51not as to compare with the conventional original plp a few p
0:14:56this eight easy all the session dependent because
0:15:00t is the test utterance
0:15:03so t stand for a test utterance s spent for the attack at six speaker
0:15:09and now it's a it's two am an
0:15:12and c n and the n and all these have been pre-computed already
0:15:17using a development data
0:15:20so as to can see that will be ice
0:15:22for this computation saving my
0:15:25using the pre-computed but rather than computers the covariance matrix on the prior
0:15:32so again that this lie there are some more ice
0:15:35the computation saving that we could have
0:15:38so this is the p lda we've a fast what we've using a reference fast
0:15:43scoring okay so we to only to determine the group i t m and n
0:15:49but for the conventional plp a beep
0:15:51and so that the publication we have to compute all this matrix during the scoring
0:15:58so be performed experiments on
0:16:02sre two thousand trial common condition two
0:16:06using the classical sixty dimensional mfcc wetter one or two for gaussian
0:16:11find the total factor in the total variability matrix
0:16:16and we tried this three different way off on a single the i-vector
0:16:21you know how to create a procedure of the
0:16:25posterior covariance matrix
0:16:28okay so this diagram of summarize the results nice okay cp lda just ultra fast
0:16:34a piece of a represents the
0:16:38scoring time on the back to the times the total time for the whole evaluation
0:16:42on its common condition two
0:16:47but unfortunately the performance is not very good
0:16:50well the reason is that what we white is not very good use of because
0:16:54it we use
0:16:55where is the utterance of arbitrary duration so we need to do that this segmentation
0:17:01or cutting utterance into sought medium sort and a long very soul
0:17:10so that it is not
0:17:12we are we do not use the original data for training and testing but instead
0:17:16we use some of that at ones used were sought some of the utterance used
0:17:19medium sought some of that when it is very long so we create a situation
0:17:23be a victory
0:17:25to raise and you both the training and test utterance
0:17:29now the plp every u p performed extremely well
0:17:33unfortunately the scoring time is also where i
0:17:37and we've our fast scoring approach we successfully we used a scoring kind from here
0:17:43to here
0:17:45if only a very small increase in the eer
0:17:48we have using a more groups okay so that developed a you know our from
0:17:54you with the number of these larger we can make the eer almost the same
0:17:59as the one achieved by the
0:18:01p lda beep uncertainty project each so what happened use
0:18:05we successfully we use the computation time but we followed increasing
0:18:10the eer
0:18:11as the same situation ocarina been dcf the detailed
0:18:17you know paper
0:18:18and also we show system three here because the performance you some two and system
0:18:23three are very similar so i only saw this only show the system to
0:18:29and system one space on utterance duration we want to solve this because you syllables
0:18:33in three d way of doing
0:18:36a memory consumption
0:18:39a domain reconnaissance a i have i have in a similar trend
0:18:44the lda used very small amount of memory
0:18:49the plp a but use much of a large amount of memory
0:18:54because we need to store all of the posterior covariance matrix of the utterance
0:19:01and we have talk about well
0:19:03what they're gigabyte here
0:19:08this is not one videos that memory consumption almost by how
0:19:14case and system to a this set the same
0:19:19the memory consumption
0:19:21and if we increase the number groups
0:19:24or obviously not require something will increase
0:19:27but you value that
0:19:29number really a lot of forty five
0:19:33it still use less memory and
0:19:36the original plp we've and set and the propagation
0:19:41so it's the det curve and
0:19:45not as you can see the
0:19:50or the paper
0:19:51this leo and we use them for to conventional lp lda report performance
0:19:56about all the others system one two three and also the one with u p
0:20:02much better
0:20:04because it with the uncertainty propagation you can do the utterance of a feature integration
0:20:11and what we have used that we find that system one use i c pool
0:20:16then the system two and three
0:20:19but system one has the largest
0:20:22we that's and in terms of computation time
0:20:27so in conclusion
0:20:29we propose a very fast scoring map for the lda with certain people bifurcation
0:20:35as the whole idea used to become people's the
0:20:39posterior covariance matrix
0:20:42or the loading matrix representing the reliability of the i-vector
0:20:46two pre-computed
0:20:47all of them
0:20:49that's much as possible and you know how to do this precomputation really two
0:20:55to the grouping first two in the development time
0:20:58and we find three ways all performing the grouping
0:21:02and all this grouping
0:21:04are based on some a scalar just like a the k-means outgrow from you need
0:21:10to use the distance of the way to say
0:21:12it's a
0:21:14criteria for a finding al
0:21:18the cruel a what do you we mean by process now we use the
0:21:23the be all the diagonal covariance matrix
0:21:26okay or well sort then be all the diagonal elements of the posterior covariance matrix
0:21:32what the maximum
0:21:35eigenvalue all the posterior covariance matrix order to radiation as a way of doing this
0:21:44huh set as the criteria for the grouping
0:21:47and all these use a computationally light and sre so it's
0:21:53as a result
0:21:54the proposed f okay perform yes us a similar to the standard u p but
0:22:00we only two point three percent of the scoring time
0:22:03thank you
0:22:12we have time for questions yes we
0:22:17we do not frankly them randomly but use that set for every one second interval
0:22:23we have a week rate
0:22:27so for three second for second five seconds so we randomly extracted from there
0:22:33that's speech data after
0:22:36also when we extract every randomly extract the problem
0:22:40so we durations range between three seconds and how much
0:22:45well as well as long test utterance o can excel some utterances small groups of
0:22:51five a traditional to therefore for different utterance we will have a different operating
0:22:58my experience i wonder if you could just comment on this my experience with this
0:23:02with this method
0:23:04it is the i found other works well in a situation other than the specific
0:23:13problem where was intended
0:23:15okay if there is a gross mismatch between a enrollment and test such as telephone
0:23:22enrolment the microphone channels
0:23:24or a huge mismatch and the in the duration
0:23:28then i found that this work well but i was a bit disappointed with the
0:23:33with the performance only specific problem that you're addressing here which is the problem just
0:23:39a duration variability
0:23:43you fact we could be involved in our experiments we also have recently because
0:23:50well we literacy generated duration mismatch in order to create a situation having a at
0:23:56times a picture it duration therefore the test utterance and the target speaker utterance will
0:24:03have different k
0:24:05of course that we've each are very small times that
0:24:10you one of the u one or two open you a
0:24:14the utterance will have various
0:24:18but really then
0:24:22so because everything random so there will be a lot of utterance with various packet
0:24:28utterance which operates and also a trend which are real all that would be a
0:24:33duration mismatch
0:24:35a tree in the test
0:24:40i be very interested to see what so what happens in the in the upcoming
0:24:44nist evaluation where this problem is good is going to be in the in the
0:24:49forefront of our have excellent thank
0:24:53it is you know the truncation the duration will be truncated to between ten seconds
0:24:59and sixty seconds
0:25:02so i think we're all looking up to five percent equal error rate you know
0:25:08a before we moved to chinese and no
0:25:12target more go
0:25:14verification trials
0:25:17okay then that's like the speaker