0:00:15 | thanks project but i introductions |
---|---|

0:00:17 | and graph and all that it but the them going to present its adjoint what |

0:00:22 | my a student's t |

0:00:24 | it's wise the are we hi joanne prof young from nineteen you still pose and |

0:00:29 | try to train a |

0:00:30 | so put into the right context we called it to a post present about one |

0:00:36 | way and in central |

0:00:38 | is on the use of i-vectors in the lda |

0:00:41 | so in this paper stand alone to present but the intention is to we use |

0:00:46 | the computations |

0:00:47 | in i-vector extraction so we call repeat competition i-vectors |

0:00:53 | "'kay" for going to detail is let me as bank of a slight |

0:00:57 | to we send the background and so as the motivations of the work |

0:01:02 | so and i-vectors extraction process can be seen as a compression process |

0:01:07 | right maybe you compress |

0:01:09 | across the crime |

0:01:11 | and the supervector space |

0:01:13 | the optimal which is a low and fixed dimensional vector speech recall i-vectors which can |

0:01:18 | see this |

0:01:19 | not only the speaker information is but we have the characteristics of the recording devices |

0:01:25 | the microphones to use |

0:01:27 | the transmission channel characteristics which including the ankle is made that we use |

0:01:32 | in transmission |

0:01:34 | for this transmitted of the speech signals that as well as the cost experiments |

0:01:38 | so |

0:01:39 | point two would be a mathematical form this is the i-vector |

0:01:44 | this is i-vectors and i-vectors |

0:01:47 | is the mlp x timit of the |

0:01:50 | latent variables |

0:01:52 | and |

0:01:53 | if you see here we have a single latent variable which is high cross |

0:01:57 | gaussian |

0:01:58 | and it i of course of frames so tying across frames and also is the |

0:02:03 | one that gives us that compressions process |

0:02:06 | compressed "'cause" a time in this but with the space |

0:02:09 | so |

0:02:11 | we assume that we know the alignment of frames to gaussian |

0:02:15 | and in the actual implementations this year of a frame alignment of gaussians |

0:02:20 | could be you love ideally what the gmm pasta you |

0:02:23 | all |

0:02:24 | most of is only used a single posteriors i |

0:02:28 | so no if we look at this latent variables |

0:02:31 | we |

0:02:33 | there is the assumption that the |

0:02:36 | trial |

0:02:37 | of this late in trouble is the standard gaussian distributions to be zero mean and |

0:02:41 | unit variance |

0:02:43 | so even the observation sequence |

0:02:46 | we could x t makes the post you which is and that of gaussians |

0:02:50 | we main five and covariance are inverse |

0:02:54 | of course this five |

0:02:56 | will be applied the speech is the posterior means of the latent variable x |

0:03:00 | and |

0:03:02 | one can see i-vectors is italy about it was the covariance the pot over t |

0:03:07 | matrix c |

0:03:08 | think mars is the colour matrix of the ubm |

0:03:12 | and f is the centroids first order statistics |

0:03:16 | and |

0:03:16 | l inverse which is the post your covariance is under determined by the |

0:03:21 | joe the statistics |

0:03:23 | so one point or not is that |

0:03:26 | in order to compute what extent the i-vectors |

0:03:29 | we have to compute |

0:03:31 | the posterior covariance |

0:03:33 | because this is part of the questions |

0:03:36 | okay |

0:03:38 | we cannot in this paper reviews what we called up you want the statistics |

0:03:43 | where we want to do is to be active speech this task in the house |

0:03:48 | and it's open to t and f similar here |

0:03:51 | so this sector simplified equations |

0:03:54 | we ought having the stick my speaker |

0:04:04 | okay so now the we have only one |

0:04:07 | objective in this paper that is really of the computations complexity of i-vector extraction |

0:04:13 | while keeping a memory common the low |

0:04:16 | and which like all perhaps not degradations on the performance |

0:04:21 | okay so why it is important because |

0:04:25 | is important because implementations of a very fast |

0:04:30 | exclamation i-vectors could be |

0:04:32 | before on hand held devices |

0:04:34 | all for that scale how based applications where a single server may have to |

0:04:41 | receive request |

0:04:42 | from hundred or one thousand quite some kind of the same time |

0:04:46 | okay and |

0:04:48 | also we reason we also recently we have you know increasing |

0:04:52 | the numbers of gaussian w is a system for example in the people there is |

0:04:56 | going to present coming |

0:04:58 | sections |

0:05:00 | see |

0:05:01 | number one thousand which process ten thousand so direct computation would be |

0:05:06 | something while for these |

0:05:08 | scenario |

0:05:10 | okay and |

0:05:11 | i know whatever estimation is that the |

0:05:13 | the and i think is on the right precomputation i-vectors |

0:05:17 | rather conservative exclamation t matrix because t matrix is extreme at once and usually |

0:05:22 | offline |

0:05:23 | and we can use a huge amount of computation resources |

0:05:26 | they can use fixed but |

0:05:32 | okay so |

0:05:34 | yes the |

0:05:35 | problem statement |

0:05:39 | the computation of alternate of i-vector extractions |

0:05:43 | lights as the exclamations of the posterior means |

0:05:46 | requires us to |

0:05:49 | extreme at first the post your covariance |

0:05:52 | so are they are |

0:05:54 | couples of existing solutions to solve this problem |

0:05:57 | and |

0:05:58 | including the eigen decomposition method also covariance model but we |

0:06:04 | fix compose account by a guy |

0:06:07 | factors subspace |

0:06:08 | by up a little |

0:06:09 | and we also on the sparse coding to improve the you know a simplified |

0:06:15 | the most your cover estimations |

0:06:18 | so in this paper what we propose is to |

0:06:23 | complexity may rightly the posterior means be up and it to evade it will still |

0:06:28 | covariance |

0:06:29 | so we did this by doing a first one we call to use an informative |

0:06:34 | prior |

0:06:35 | which are going to shows later |

0:06:37 | and the uniform occupancy assumptions are still with the commission this tool |

0:06:41 | we can do a fuss extreme i-vectors |

0:06:44 | of course without the need to estimate the posterior covariance |

0:06:52 | okay so |

0:06:53 | in the combination of all |

0:06:55 | the |

0:06:57 | i-vector extraction we issue a standard doesn't profile |

0:07:03 | and |

0:07:03 | no if we can see those |

0:07:07 | involvement for all |

0:07:08 | we |

0:07:09 | mean given by new p and the core and you must marquee then i-vector extractions |

0:07:15 | is given by this regions where we have to an additional terms here |

0:07:20 | people determines by the |

0:07:23 | cover the prior |

0:07:25 | and this new mike |

0:07:28 | so no if we consider the case where this like with the zero this cycle |

0:07:32 | demanded a matrix then distance will disappear |

0:07:36 | and is only go to the i didn't matrix so we did use to the |

0:07:39 | standard form |

0:07:42 | so in this paper we propose to use this |

0:07:47 | well for informative problem |

0:07:49 | where the means to zero but the |

0:07:52 | but over in this young by this |

0:07:54 | t is the total where t matrix still we have the inner product |

0:07:59 | of that order bitexts of and in bus |

0:08:01 | to be a book file |

0:08:03 | so okay now i've able to reduce i think |

0:08:07 | so what is that we in the i-vector second formulas we have additional terms you |

0:08:12 | about the problem right so now if you plot is into this i-vector extraction from |

0:08:17 | will then we'll when the get this right so we can always share that it |

0:08:23 | transpose t there is a inverse because we can this always full rank |

0:08:27 | i given that the assumption of training data |

0:08:30 | then we could take this t l |

0:08:32 | from |

0:08:34 | no and again this in both then we'll get |

0:08:38 | right |

0:08:41 | and then us these matrix inversion identity which |

0:08:46 | i copied from the matrix a global |

0:08:49 | okay so like the idea guys of you have a matrix p and q and |

0:08:53 | p here we construct the although something |

0:08:56 | p and q by putting this in the front right |

0:09:00 | so if you look at this formula speech is the same as |

0:09:07 | this one |

0:09:10 | right so we can say this is the p is it's a key when it's |

0:09:15 | the pa then we can put this |

0:09:17 | for what |

0:09:18 | and then sort of these right so no if you do and this formulas write |

0:09:22 | this is the linear algebra this is a projection matrix right approaches in matrix is |

0:09:27 | you know you can buy in this fall what you want you to a although |

0:09:32 | than a matrix meaning that |

0:09:33 | each column of this |

0:09:35 | you want |

0:09:37 | is a |

0:09:39 | all the love each other columns |

0:09:41 | and there is a unique now |

0:09:43 | and you wanna spend the same subspace as the t matrix |

0:09:47 | okay and this |

0:09:49 | although the nice properties is actually introduced to the primal |

0:09:54 | right and that's why we call it |

0:09:56 | the problem we use |

0:10:00 | at the subspace of the nineteen prior |

0:10:04 | okay so |

0:10:06 | if it'll it |

0:10:09 | well like a avoiding the exclamation the posterior covariance |

0:10:13 | by you know we can data extreme at the post you means you |

0:10:16 | but the thing is that if you use this formula is going to encode more |

0:10:21 | computations because we are dealing with the t |

0:10:26 | t transpose which is a very big matrix |

0:10:30 | so there's a reason why we have to introduce another assumptions recon uniform occupancy assumptions |

0:10:35 | which speed up the computations |

0:10:39 | okay so to do so |

0:10:40 | we first of all window a singular value decomposition of t |

0:10:45 | into t |

0:10:47 | into u s b u one be a be a single but in a single |

0:10:52 | but others matrix |

0:10:55 | okay and then you |

0:10:57 | is this |

0:10:58 | side speech is assumed at stft matrix |

0:11:11 | okay so |

0:11:13 | one dataset is that you one which is the u one in the previous slide |

0:11:17 | is |

0:11:19 | spend the same subspace t |

0:11:22 | and then you two |

0:11:23 | is all together when you one okay then we use this property to simplify this |

0:11:29 | formulas |

0:11:31 | right so we can express t transit inverse t into this fall because this |

0:11:39 | is equal to this right |

0:11:41 | and then this can be expressed in to this file |

0:11:45 | okay because of this property |

0:11:48 | then we can multiply and into this so we have i plus and this okay |

0:11:54 | next |

0:11:56 | it's a i class and is equal to a |

0:12:01 | and then apply |

0:12:02 | the matrix inversion lemma |

0:12:05 | in this from this is what we get |

0:12:07 | and we apply gains this the are |

0:12:10 | matrix inversion entity that we used before here we have these |

0:12:17 | p |

0:12:18 | he'll and p right now we can put this p the front |

0:12:24 | then |

0:12:26 | have a few when p |

0:12:28 | so that is that we want to express this thing |

0:12:32 | on the laugh |

0:12:34 | in two days |

0:12:35 | a inverse and i terms |

0:12:39 | expressed in terms of you two |

0:12:41 | which is orthogonal be you one or to go an o b g |

0:12:48 | right |

0:12:49 | so is the a uniform occupants assumptions |

0:12:53 | because okay |

0:12:54 | okay is |

0:12:56 | i class and |

0:12:59 | and itself is the diagonal matrix |

0:13:02 | so if you look into individual elements of this |

0:13:06 | matrix here what we get is this thing here what we get this and see |

0:13:11 | divided by i |

0:13:14 | one class and see |

0:13:15 | right so that you need vol occupancy assumption says that |

0:13:21 | for all the doesn't components |

0:13:24 | the occupancy count divided by one cluster occupancy call is the same for all the |

0:13:30 | constants right here we do need to know what's of value of what is appropriate |

0:13:34 | value of all file |

0:13:36 | what we assume is that this the same of a |

0:13:40 | would be applied forty percent right |

0:13:43 | so |

0:13:44 | by doing so we have this |

0:13:46 | into this fall |

0:13:48 | and if you multiply this if you this is the i-vector extractor on this so |

0:13:53 | if you multiply this t |

0:13:56 | in two |

0:13:57 | we did you to then this to move we can sell |

0:14:00 | so we end up with this formula for i-vector extraction this is very fast because |

0:14:05 | a week and pre-computed systems |

0:14:09 | and this is thus |

0:14:10 | this is a diagonal matrix right so taking the inverse is |

0:14:14 | is very simple |

0:14:17 | right |

0:14:21 | okay no that's a look at the eer computational complexity |

0:14:25 | so we have four |

0:14:29 | comparison of for different the algorithm so we have the baseline i-vector extraction which is |

0:14:34 | the standard fall |

0:14:36 | we have the you know we have to do d in the product the of |

0:14:40 | the |

0:14:41 | but with these metrics |

0:14:43 | t c transpose d c |

0:14:45 | and for all the c components so this is your by c f m square |

0:14:52 | and |

0:14:53 | the m u is due to the metric conversions |

0:14:57 | also in terms of memory cost may have to install but and i t matrix |

0:15:02 | so this is the c f m |

0:15:05 | okay so now forty fast baseline we can actually be computed is a t transpose |

0:15:11 | t |

0:15:13 | and story while this computer cost all for this |

0:15:17 | a c m square |

0:15:18 | but we will actually we use the complete data cost from this to this |

0:15:25 | okay and that for all |

0:15:27 | what was made using the informative prior |

0:15:31 | without the uniform occupants assumptions |

0:15:34 | the a computational complexity and memory cost is it could be at the same and |

0:15:40 | the fast baseline |

0:15:41 | "'kay" |

0:15:42 | because we can recompute distance and story |

0:15:46 | well as for the fast |

0:15:48 | the proposed method |

0:15:50 | we have |

0:15:51 | computational complexity we use stream and the to be a this them |

0:15:56 | and we can pretty complete distance down to memory so in terms of computational complexity |

0:16:02 | the proposed |

0:16:03 | fast meant that is |

0:16:05 | twelve times faster |

0:16:07 | then the fast baseline |

0:16:08 | and had a time faster than the s o baseline |

0:16:18 | okay so |

0:16:20 | you know there is to present a shall we talk about |

0:16:23 | a as of today propagation |

0:16:25 | we need to post your problem |

0:16:29 | then i mean yes application of an impostor common so the pasta correct could actually |

0:16:34 | be computed using the same fast method |

0:16:37 | a given by these cushion here |

0:16:40 | using the same informative prior |

0:16:42 | as well as the uniform corpus assumption i mean this the computational complexity |

0:16:48 | also |

0:16:51 | we can actually use this that informative prior |

0:16:53 | given by d transposed he |

0:16:55 | into the is that |

0:16:57 | but be in the em a fixed emissions of the t matrix |

0:17:02 | okay of course we only use in the is that but in the sense that |

0:17:05 | we actually |

0:17:07 | this car but others associated with a prior which |

0:17:11 | which |

0:17:12 | allows you i think in the form |

0:17:20 | okay |

0:17:21 | experiments the experiment was conducted on the is as i ten x and the fast |

0:17:27 | come with condition one to nine |

0:17:29 | we use a gender and then ubm we found two gaussians |

0:17:34 | we fifty seven dimension mfcc and the ubm is trained on switchboard as i four |

0:17:39 | or five or six and we use you we use the same the about to |

0:17:42 | train the t matrix |

0:17:43 | we do a co-ranks of four hundred |

0:17:47 | based on the obvious p lda for scoring so our before |

0:17:52 | passing the p lda we use the dimension i-vector those two hundred using lda |

0:17:57 | and followed by an angle |

0:17:58 | and for the p lda we have the art when the speaker factors then we |

0:18:02 | use a full |

0:18:04 | race you can go into |

0:18:05 | more the session but |

0:18:10 | okay so this table shows the |

0:18:15 | without so for the baseline |

0:18:18 | the proposed as that method proposed fast method |

0:18:22 | so the first rule |

0:18:23 | so's the eer the second rule is the mean dcf so i'll know if we |

0:18:29 | compare this |

0:18:31 | results with this |

0:18:33 | well we can see that the result is not really much difference so we can |

0:18:38 | say that |

0:18:40 | by using implement a project what we use |

0:18:43 | it does not seem to degrade performance |

0:18:46 | okay then a if we look at the common condition five |

0:18:54 | which is a telephone conditions |

0:18:56 | for the proposed fast make the degradation is actually |

0:19:01 | about ten percent eer and four point five percent and mindcf |

0:19:06 | k and t v c across all the night common conditions |

0:19:10 | the relative degradation is ranging from ten to sixteen percent and |

0:19:16 | where is you can be a source that you with six seven percent |

0:19:20 | up to twenty point four percent mindcf |

0:19:27 | okay so i'm is okay so this is |

0:19:33 | this is the system that we use |

0:19:37 | this it's of |

0:19:38 | white data centre i suppose of the statistics |

0:19:41 | normalize three the an the occupancy kernel |

0:19:45 | so we use this as a small vectors |

0:19:47 | and we'd work pca |

0:19:49 | right |

0:19:50 | and then we do what projections of all these test or training utterance |

0:19:55 | and woman |

0:19:56 | into the low dimensional subspace |

0:19:58 | and useful for the p l d a simple |

0:20:00 | so |

0:20:02 | a what you can see that |

0:20:04 | okay i'll why we do that because |

0:20:12 | if you look at these formulas |

0:20:15 | this is the can be seen as a transformation matrix |

0:20:20 | and this is the input vector |

0:20:22 | and is the projection of this input vector |

0:20:25 | into a low dimensional vectors |

0:20:34 | binary comparing to resolve this we don't fast made but it's the others shows that |

0:20:38 | by using the t matrix training with the em |

0:20:41 | in the commission of phone give a better performance |

0:20:46 | no a |

0:20:47 | this result shows the comparisons of you matrix |

0:20:50 | train we do not all be informative problem with standard doesn't prowl |

0:20:56 | but extremely informative problem |

0:20:58 | so |

0:21:00 | comparing this tool we can see that the proposed as that may to actually give |

0:21:04 | a slightly better result |

0:21:11 | okay so in conclusions we introduced two new concept |

0:21:16 | of already computation i-vectors |

0:21:18 | the first one is what we call the subspace l optimising pro |

0:21:22 | and we |

0:21:24 | the use of subspace modeling probably can about in the to compute the posterior covariance |

0:21:30 | okay before computing the pasta means |

0:21:33 | and then we use a uniform workable assumption because read used |

0:21:37 | the |

0:21:38 | computed complicity |

0:21:41 | so we the combined combination use of this to the assumptions and informative prior |

0:21:46 | we speed of the i-vector extraction process |

0:21:50 | but i-vector trial we a slight degradation in terms of accuracy |

0:21:57 | is my have |

0:22:03 | we have time for a few questions |

0:22:15 | so it seems useful problem of course |

0:22:19 | i have so that i so i |

0:22:24 | i |

0:22:31 | this the performance of to me by saying this that's that we notice the same |

0:22:36 | as baseline is you have access also we what the as that because |

0:22:45 | exactly as we of the use of the uniform occupants assumptions |

0:22:49 | by just using the subspace the other than same problem |

0:22:56 | because we want to see that a by introducing difference that we first introduce the |

0:23:01 | starts based recogniser brow and informal by a uniform the basic assumptions so want to |

0:23:06 | see a in t v just |

0:23:08 | what is the a |

0:23:10 | what if x |

0:23:12 | maybe use you know we introduce a subset of the problem |

0:23:16 | we get a better performance of slightly was performance |