0:00:14 | a single channel and i hope available you into two you kindness the output of |
---|---|

0:00:21 | a lunch |

0:00:22 | okay |

0:00:23 | the talking i'm going to talk about is how to make uncertainty propagation run fast |

0:00:29 | and also consume less memory |

0:00:32 | my name and why max from the home component alec university |

0:00:37 | so here used a liar representation i will first to keep an overview of i-vector |

0:00:42 | p d a and x spring how the uncertainty properties and can |

0:00:48 | can |

0:00:49 | model the uncertainty of the i-vector |

0:00:52 | and how to how to make the uncertainty fabrication run faster one possible use less |

0:00:58 | memory |

0:00:59 | and we evaluate the proposed to our problem on nist two thousand trial |

0:01:05 | and |

0:01:08 | finally we keep looser okay so here is the outline of the i-vector p lda |

0:01:13 | map onto the probably all you already know this though i mean cool for go |

0:01:17 | through these i very quickly a here |

0:01:22 | we use a the posterior be not the latent factor |

0:01:28 | you to |

0:01:29 | tool |

0:01:31 | to as a low dimensional representation of the speaker so given the mfcc wetter or |

0:01:39 | phrase utterance we compute the posterior mean of the |

0:01:44 | a latent factor and recall this time at |

0:01:47 | okay and t is the total variability matrix that define the channel and speaker subspace |

0:01:55 | or you represent a subspace where the i met okay very |

0:02:00 | so here's the procedure for the i-vector extraction given a sequence of mfcc what are |

0:02:05 | we extract the |

0:02:07 | i-vector using the post your and beam of the latent factor |

0:02:11 | and because if we would like to use the gaussian the lda therefore will lead |

0:02:16 | to |

0:02:17 | suppress the lawn the non gaussian behavior of the i-vector through some preprocessing |

0:02:24 | okay for example whitening and also length normalization |

0:02:29 | and after this preprocessing step because the process that i-vector pairs |

0:02:35 | top you |

0:02:36 | and then this not you |

0:02:38 | i-vector or preprocessed the i-vector can be modeled by the t i would be |

0:02:44 | so the idea is to decrease of the phone of you have a modeling |

0:02:47 | we represent a speaker subspace |

0:02:50 | and that h i is the use the speaker factor |

0:02:55 | and that you can see for the change sets exception of the i've speaker we |

0:03:01 | only have one |

0:03:04 | latent factor h i okay |

0:03:07 | and epsilon i j work that sentiment she to that cannot be represented by the |

0:03:12 | speaker subspace |

0:03:14 | so now that to in the scoring |

0:03:17 | at time we have that |

0:03:20 | the test i-vector |

0:03:21 | we have a test i-vector w t |

0:03:24 | and also we have this target speaker i-vector w s |

0:03:28 | and they're |

0:03:30 | and we compute these likelihoods are assuming that the top us to cut but come |

0:03:36 | from the same speaker |

0:03:38 | and of the also have the alternative hypothesis where the top us and w t |

0:03:43 | come from different speaker |

0:03:45 | there are four after some mathematical at manipulation become if this very nice equation so |

0:03:51 | in this equation we only have matrix and wet |

0:03:55 | multiplication and |

0:03:57 | the nice thing use the matrix c |

0:04:00 | hence i'll can all be computed as you can see these set of creation here |

0:04:05 | at the bottom |

0:04:06 | and all these sigma a c segment total |

0:04:11 | and thus and so on can be pre-computed from the |

0:04:16 | the lda model parameter that that's explain why the |

0:04:20 | scoring p lda very fast |

0:04:26 | but one problem of these conventional i-vector p lda there |

0:04:31 | i'm not two years |

0:04:34 | that's not have the ability to work that stands the reliability of the i-vector |

0:04:39 | so whether the utterance is very long already sought we still use |

0:04:45 | low dimension i-vector to work that stands |

0:04:48 | the speaker characteristics |

0:04:50 | of the whole utterance |

0:04:52 | so this we propose a problem for sort utterance speaker verification |

0:04:57 | but not problem for very long utterance that's how you we have three mean is |

0:05:02 | or sixteen use all speech |

0:05:04 | but if we're utterances only about ten second three second and you |

0:05:09 | then the variability or uncertainty of the i-vector will be so high that's |

0:05:14 | and the plp a scroll wheel favourite same speaker hypothesis |

0:05:19 | even if the test utterance is given by a in a imposed |

0:05:24 | about the recent years you've the spectrum is very short we will not have enough |

0:05:30 | acoustic vector for the nbp estimation or in we do not have enough acoustic webster |

0:05:36 | to compute the posterior mean of the leading factor you know factor analysis model |

0:05:44 | so in the ideal i'm certainly publication |

0:05:48 | we not only extract i-vectors but also clancy the |

0:05:52 | the posterior covariance matrix q |

0:05:55 | so this i this time to illustrate the idea |

0:06:00 | this gaussian represent the posterior density of the latent factor |

0:06:05 | and to do so i-vector is that's me so it is a point estimate |

0:06:11 | and this |

0:06:12 | equation initial the procedure of computing it |

0:06:16 | okay so t c use access to cease partition of the total variability matrix but |

0:06:22 | as you can see you |

0:06:24 | if the variance of this gaussian use very large |

0:06:28 | then the point estimate we'll not be where a correct |

0:06:32 | and this happened went utterances where is not the recent years |

0:06:36 | if the utterance is very short and see which is that zero although sufficient statistic |

0:06:42 | will be very small |

0:06:43 | so use this party where is more than the whole covariance matrix l university'll be |

0:06:48 | very big |

0:06:49 | and thus the means that these variations and be large as a result of point |

0:06:54 | estimate we might be very reliable |

0:06:57 | so |

0:06:59 | and |

0:07:00 | that's why two thousand and thirteen |

0:07:02 | that any proposed ideas that the lda and certainly propagation |

0:07:06 | so you that is so to extracting the i-vector we also express the posterior covariance |

0:07:11 | matrix |

0:07:14 | the latent factor |

0:07:15 | and that represents the uncertainty don't i-vector |

0:07:19 | and with some preprocessing as i have mentioned because we want to use a |

0:07:24 | a thousand you lda for the |

0:07:27 | as the as the final stage of the modelling for the scoring |

0:07:32 | therefore we also need to preprocessed |

0:07:35 | matrix |

0:07:37 | you time school |

0:07:38 | which is if you processed version of the of the posterior covariance matrix and i |

0:07:45 | thought that we could have a the lda modelling now how can |

0:07:49 | how to a certain these corpora t |

0:07:52 | other publication come from here so you know generative model |

0:07:56 | it a generative model we have wh i press i to allow you can see |

0:08:02 | this you |

0:08:03 | use and like at the conventional |

0:08:06 | the lda model we've the eigen channels |

0:08:09 | so this is my eigen channel but instead you keep and so on |

0:08:13 | the section |

0:08:15 | so it depends on the i |

0:08:17 | speaker |

0:08:18 | and the change section of the i speaker |

0:08:20 | as a result the z is also depends on the i and j |

0:08:26 | now the trouble of this year's |

0:08:28 | for every test utterance |

0:08:30 | we also lead to compute the u i t so unlike the i can channel |

0:08:34 | situation |

0:08:35 | we only need to pre-computed and make use of which during scoring now in a |

0:08:40 | uncertainty propagation this you i to have to be computed during |

0:08:45 | a scoring time |

0:08:47 | because the ssm dependent |

0:08:49 | and do compute this u r j which was performed at a dusky decomposed system |

0:08:53 | of the posterior covariance matrix |

0:08:56 | and that's why we have these the intra speaker covariance matrix like this |

0:09:05 | so loud finally and during the score |

0:09:08 | with the p l d a u p then we have these at equation |

0:09:13 | which is very similar to the equation the actual you |

0:09:18 | all the conventional p lda right as you can see this s u |

0:09:23 | matrix and vector multiplication |

0:09:26 | but the difference yes |

0:09:28 | this time the at b c and d all depends on the test utterance |

0:09:33 | an issue can see from this a set of recreation a s t e s |

0:09:38 | t c s t and the st the all depends on the test utterance |

0:09:42 | that that's me as they have to be pre-computed |

0:09:46 | and only very small number of matrix can be pretty compute this have to be |

0:09:51 | computed this set have to be computed during scoring time |

0:09:55 | and this that have to be compute a can be computed up before scoring time |

0:09:59 | so we will thus save much computation might become use the covariance matrix |

0:10:07 | so dissatisfy summarised some summarize the computation that needed to take phone a conventional lp |

0:10:13 | lda we almost have nothing to |

0:10:15 | to compute all you need to compute these |

0:10:19 | i went and matrix multiplication |

0:10:22 | but for the plp a review g |

0:10:24 | we have to compute all these set of matrix on the right |

0:10:28 | so as you can see that so we'll a increase the computation complexity a lot |

0:10:34 | and also if we increase the memory about this place |

0:10:39 | because of for every time the speaker we need to store |

0:10:43 | this may take a p c d for every target speaker |

0:10:47 | so we propose a to a way of in a speeding up the computation and |

0:10:53 | also |

0:10:54 | a two we use the memory consumption |

0:10:57 | the whole idea come from |

0:10:59 | come from d c equation |

0:11:01 | okay come on this equation case here the posterior covariance matrix and only depends on |

0:11:06 | n c |

0:11:08 | which |

0:11:09 | and two and a testing time and c will be to zero or the sufficient |

0:11:12 | statistics of the test utterance |

0:11:15 | well |

0:11:16 | you okay so you the two i-vectors are also meeting that's integration |

0:11:21 | we assume that all we |

0:11:23 | i think okay |

0:11:25 | the composed here are covariance matrix a similar because as you can see we plot |

0:11:30 | and the mfcc audible acoustic that only |

0:11:34 | the zero all the sufficient statistic |

0:11:36 | so having this hypothesis |

0:11:39 | we and |

0:11:42 | proposals |

0:11:43 | to roll direct a according to their be activity |

0:11:47 | now can be |

0:11:50 | we find w happy that we you we used a scalar to define the we're |

0:11:55 | not be by facing for each scroll the i-vector reliability is modeled by performance vehicle |

0:12:00 | right matrix |

0:12:01 | and we obtain the posterior covariance matrix from the development data |

0:12:06 | okay so here |

0:12:08 | that i take a k stand for the |

0:12:12 | case |

0:12:13 | and i'll this u k |

0:12:16 | is independent of the section |

0:12:18 | nice to look at |

0:12:19 | well at the bottom of the slide |

0:12:22 | we have you i j i taste depends on the |

0:12:26 | section |

0:12:27 | but now if you look at this here |

0:12:30 | we successfully |

0:12:33 | make the u i j which is the session dependent |

0:12:37 | is now becomes session independent |

0:12:40 | now you've having this u k become session independent we could |

0:12:45 | do a lot of precomputation on there |

0:12:48 | so one way of doing this |

0:12:50 | used to |

0:12:51 | used to grow |

0:12:54 | used to grow |

0:12:58 | the |

0:13:00 | the i-vector |

0:13:01 | using these three approaches one is the base on the utterance recent which is intuitive |

0:13:07 | to group the i-vector based on the |

0:13:09 | at a race of because we as we believe that reason we use related to |

0:13:14 | the uncertainty of related to the reliability of the i-vector |

0:13:19 | we have also tried using the mean of the diagonal elements of the posteriori matrix |

0:13:23 | of this is a nice thing to do because |

0:13:27 | the being of the diagonal and on there is a scalar so working will become |

0:13:31 | very easy |

0:13:32 | okay and the last one we have tried is the largest eigenvalue of the reference |

0:13:37 | matrix |

0:13:38 | so this i basically tell us how to perform the grouping you for example you |

0:13:43 | uses the time access |

0:13:45 | then this one corresponding to extremely soft uncertains |

0:13:49 | go to medium length but am sort |

0:13:52 | and we're case where is |

0:13:54 | long utterance and u h group be fine one representative |

0:13:58 | okay from the k two |

0:14:00 | we're consensus the whole group |

0:14:02 | so this |

0:14:03 | or percent at u one u one times will work at santa |

0:14:06 | the posterior covariance matrix a very strong extremely short utterance |

0:14:12 | u k or u k tricycle corresponding to the posterior covariance matrix |

0:14:17 | what and certainty all that very long utterance |

0:14:22 | so now that all you really two |

0:14:25 | during the scoring time really to find |

0:14:28 | the real identity |

0:14:30 | so by using the three approach to quantify cook reliability noise gave a |

0:14:36 | we will be able to find what i the nn and so that we case |

0:14:42 | the |

0:14:42 | all the session dependent |

0:14:45 | matrix in two |

0:14:47 | am and |

0:14:49 | and c n and |

0:14:51 | not as to compare with the conventional original plp a few p |

0:14:56 | this eight easy all the session dependent because |

0:15:00 | t is the test utterance |

0:15:03 | so t stand for a test utterance s spent for the attack at six speaker |

0:15:08 | utterance |

0:15:09 | and now it's a it's two am an |

0:15:12 | and c n and the n and all these have been pre-computed already |

0:15:17 | using a development data |

0:15:20 | so as to can see that will be ice |

0:15:22 | for this computation saving my |

0:15:25 | using the pre-computed but rather than computers the covariance matrix on the prior |

0:15:32 | so again that this lie there are some more ice |

0:15:35 | the computation saving that we could have |

0:15:38 | so this is the p lda we've a fast what we've using a reference fast |

0:15:43 | scoring okay so we to only to determine the group i t m and n |

0:15:49 | but for the conventional plp a beep |

0:15:51 | and so that the publication we have to compute all this matrix during the scoring |

0:15:58 | so be performed experiments on |

0:16:02 | sre two thousand trial common condition two |

0:16:06 | using the classical sixty dimensional mfcc wetter one or two for gaussian |

0:16:11 | find the total factor in the total variability matrix |

0:16:16 | and we tried this three different way off on a single the i-vector |

0:16:21 | you know how to create a procedure of the |

0:16:25 | posterior covariance matrix |

0:16:28 | okay so this diagram of summarize the results nice okay cp lda just ultra fast |

0:16:34 | a piece of a represents the |

0:16:38 | scoring time on the back to the times the total time for the whole evaluation |

0:16:42 | on its common condition two |

0:16:47 | but unfortunately the performance is not very good |

0:16:50 | well the reason is that what we white is not very good use of because |

0:16:54 | it we use |

0:16:55 | where is the utterance of arbitrary duration so we need to do that this segmentation |

0:17:01 | or cutting utterance into sought medium sort and a long very soul |

0:17:10 | so that it is not |

0:17:12 | we are we do not use the original data for training and testing but instead |

0:17:16 | we use some of that at ones used were sought some of the utterance used |

0:17:19 | medium sought some of that when it is very long so we create a situation |

0:17:23 | be a victory |

0:17:25 | to raise and you both the training and test utterance |

0:17:29 | now the plp every u p performed extremely well |

0:17:33 | unfortunately the scoring time is also where i |

0:17:37 | and we've our fast scoring approach we successfully we used a scoring kind from here |

0:17:43 | to here |

0:17:45 | if only a very small increase in the eer |

0:17:48 | we have using a more groups okay so that developed a you know our from |

0:17:54 | you with the number of these larger we can make the eer almost the same |

0:17:59 | as the one achieved by the |

0:18:01 | p lda beep uncertainty project each so what happened use |

0:18:05 | we successfully we use the computation time but we followed increasing |

0:18:10 | the eer |

0:18:11 | as the same situation ocarina been dcf the detailed |

0:18:17 | you know paper |

0:18:18 | and also we show system three here because the performance you some two and system |

0:18:23 | three are very similar so i only saw this only show the system to |

0:18:29 | and system one space on utterance duration we want to solve this because you syllables |

0:18:33 | in three d way of doing |

0:18:36 | a memory consumption |

0:18:39 | a domain reconnaissance a i have i have in a similar trend |

0:18:44 | the lda used very small amount of memory |

0:18:48 | and |

0:18:49 | the plp a but use much of a large amount of memory |

0:18:54 | because we need to store all of the posterior covariance matrix of the utterance |

0:19:01 | and we have talk about well |

0:19:03 | what they're gigabyte here |

0:19:06 | and |

0:19:08 | this is not one videos that memory consumption almost by how |

0:19:14 | case and system to a this set the same |

0:19:19 | the memory consumption |

0:19:21 | and if we increase the number groups |

0:19:24 | or obviously not require something will increase |

0:19:27 | but you value that |

0:19:29 | number really a lot of forty five |

0:19:33 | it still use less memory and |

0:19:36 | the original plp we've and set and the propagation |

0:19:41 | so it's the det curve and |

0:19:45 | not as you can see the |

0:19:47 | paying |

0:19:50 | or the paper |

0:19:51 | this leo and we use them for to conventional lp lda report performance |

0:19:56 | about all the others system one two three and also the one with u p |

0:20:01 | one |

0:20:02 | much better |

0:20:04 | because it with the uncertainty propagation you can do the utterance of a feature integration |

0:20:11 | and what we have used that we find that system one use i c pool |

0:20:16 | then the system two and three |

0:20:19 | but system one has the largest |

0:20:22 | we that's and in terms of computation time |

0:20:27 | so in conclusion |

0:20:29 | we propose a very fast scoring map for the lda with certain people bifurcation |

0:20:35 | as the whole idea used to become people's the |

0:20:39 | posterior covariance matrix |

0:20:42 | or the loading matrix representing the reliability of the i-vector |

0:20:46 | two pre-computed |

0:20:47 | all of them |

0:20:49 | that's much as possible and you know how to do this precomputation really two |

0:20:55 | to the grouping first two in the development time |

0:20:58 | and we find three ways all performing the grouping |

0:21:02 | and all this grouping |

0:21:04 | are based on some a scalar just like a the k-means outgrow from you need |

0:21:10 | to use the distance of the way to say |

0:21:12 | it's a |

0:21:14 | criteria for a finding al |

0:21:18 | the cruel a what do you we mean by process now we use the |

0:21:23 | the be all the diagonal covariance matrix |

0:21:26 | okay or well sort then be all the diagonal elements of the posterior covariance matrix |

0:21:32 | what the maximum |

0:21:35 | eigenvalue all the posterior covariance matrix order to radiation as a way of doing this |

0:21:44 | huh set as the criteria for the grouping |

0:21:47 | and all these use a computationally light and sre so it's |

0:21:53 | as a result |

0:21:54 | the proposed f okay perform yes us a similar to the standard u p but |

0:22:00 | we only two point three percent of the scoring time |

0:22:03 | thank you |

0:22:12 | we have time for questions yes we |

0:22:17 | we do not frankly them randomly but use that set for every one second interval |

0:22:23 | we have a week rate |

0:22:27 | so for three second for second five seconds so we randomly extracted from there |

0:22:33 | that's speech data after |

0:22:36 | also when we extract every randomly extract the problem |

0:22:40 | so we durations range between three seconds and how much |

0:22:45 | well as well as long test utterance o can excel some utterances small groups of |

0:22:51 | five a traditional to therefore for different utterance we will have a different operating |

0:22:58 | my experience i wonder if you could just comment on this my experience with this |

0:23:02 | with this method |

0:23:04 | it is the i found other works well in a situation other than the specific |

0:23:13 | problem where was intended |

0:23:15 | okay if there is a gross mismatch between a enrollment and test such as telephone |

0:23:22 | enrolment the microphone channels |

0:23:24 | or a huge mismatch and the in the duration |

0:23:28 | then i found that this work well but i was a bit disappointed with the |

0:23:33 | with the performance only specific problem that you're addressing here which is the problem just |

0:23:39 | a duration variability |

0:23:43 | you fact we could be involved in our experiments we also have recently because |

0:23:50 | well we literacy generated duration mismatch in order to create a situation having a at |

0:23:56 | times a picture it duration therefore the test utterance and the target speaker utterance will |

0:24:03 | have different k |

0:24:05 | of course that we've each are very small times that |

0:24:10 | you one of the u one or two open you a |

0:24:13 | harry |

0:24:14 | the utterance will have various |

0:24:18 | but really then |

0:24:21 | excluding |

0:24:22 | so because everything random so there will be a lot of utterance with various packet |

0:24:28 | utterance which operates and also a trend which are real all that would be a |

0:24:33 | duration mismatch |

0:24:35 | a tree in the test |

0:24:40 | i be very interested to see what so what happens in the in the upcoming |

0:24:44 | nist evaluation where this problem is good is going to be in the in the |

0:24:49 | forefront of our have excellent thank |

0:24:53 | it is you know the truncation the duration will be truncated to between ten seconds |

0:24:59 | and sixty seconds |

0:25:02 | so i think we're all looking up to five percent equal error rate you know |

0:25:08 | a before we moved to chinese and no |

0:25:12 | target more go |

0:25:14 | verification trials |

0:25:17 | okay then that's like the speaker |