0:00:15 | given an i-vector on the value |
---|---|

0:00:18 | can be decomposed in part |

0:00:20 | speaker all part with your town zero on |

0:00:24 | with |

0:00:27 | matrix v |

0:00:29 | score one contains them but this of |

0:00:32 | and they can based voice subspace |

0:00:34 | and always it you |

0:00:37 | which is like to speaker factor normally distributed |

0:00:42 | which is |

0:00:43 | so we to do is a |

0:00:45 | consummate weeks from now |

0:00:48 | which is for inside too |

0:00:50 | and the lies in most commonly used |

0:00:53 | p l system for i-vectors in which shown in effect is kept for |

0:00:59 | the decision score |

0:01:01 | proposed by someone prince |

0:01:03 | is or log likelihood ratio |

0:01:06 | in which we can see that |

0:01:09 | the computing the scroll depends only on the |

0:01:12 | nolan shelf |

0:01:13 | i matrix fifty transpose of five but |

0:01:16 | for speaker |

0:01:18 | factor |

0:01:19 | and vector transpose it proves long down |

0:01:23 | which content to talk about every reliability |

0:01:30 | there shouldn't fairly on modeling can provide good performance but |

0:01:34 | it has been shown that just performance are achieved only if the condition and prosody |

0:01:40 | a it follows the and extraction of i-vector all this conditioning posteriors |

0:01:46 | and is summarized by whitening most commonly used a |

0:01:51 | whitening is a standardization and length normalisation |

0:01:58 | i was matrix |

0:01:59 | of liability shown in for the standardisation |

0:02:02 | can is a total covariance matrix |

0:02:05 | so within speaker covariance matrix the volume |

0:02:09 | eventually to eventually we iterate this process |

0:02:13 | parameters are computed for the i-vectors present in the training corpus and applied to test |

0:02:18 | i-vectors |

0:02:22 | assumptions of the mission p lda the vicinity |

0:02:26 | justly |

0:02:27 | and the linearity of eigenvoices it means that |

0:02:31 | so the speaker are but can be constrained in a linear subspace |

0:02:36 | and the mostly just a city of the radio or |

0:02:39 | it means that a system to build your model assumes |

0:02:42 | that a speaker classes |

0:02:46 | statistics means that channel effects can be modeled |

0:02:50 | in a speaker independent way |

0:02:53 | so that the distributions shells a seven grams metrics |

0:02:58 | so it's independency between number |

0:03:01 | and the speaker factor |

0:03:03 | and the equality of covariance |

0:03:06 | garlic which occurrence it means that there are also between the residual between the actually |

0:03:12 | beach of a class |

0:03:13 | and the middle parameter |

0:03:14 | computed |

0:03:16 | for the jelly a seem to be uncorrelated |

0:03:20 | normally distributed a on the explained by |

0:03:24 | so front it's a simple |

0:03:27 | of the development corpus |

0:03:30 | so randomly |

0:03:31 | and that surrounds the not vary with the effects being more target |

0:03:38 | on the left as a graph |

0:03:41 | is the simple condition of the p lda model is in speaker factor one dimension |

0:03:47 | one additional subspace |

0:03:50 | where is no more |

0:03:52 | while stoned a normal prior for the speaker factor |

0:03:56 | and some classes with the same |

0:03:59 | viability metrics |

0:04:01 | or am is that i-vector no lie on |

0:04:04 | the nonlinear and find it connects subsets of an impostor |

0:04:10 | so as the distribution of i-vector noise |

0:04:13 | which is referred to as it's very core distribution |

0:04:19 | we think that perhaps insurance that exists a renowned speaker-independent admits a parameter on the |

0:04:24 | of within stego abilities questionable |

0:04:27 | in such a not affect be modeled in a speaker independent way |

0:04:31 | it's difficult to sure that something is right or something is wrong |

0:04:37 | for example if we find out or ration significant duration between |

0:04:42 | the whole |

0:04:44 | and the class parameter |

0:04:46 | the effect drama to it where you're late the estimation of random variable |

0:04:54 | first we present the deterministic approach |

0:04:58 | why printing deterministic approach to compute a purely apparently fast |

0:05:03 | because first two and we try some |

0:05:07 | deterministic approach is an remarks and that other approaches |

0:05:11 | not |

0:05:12 | all relevant sometimes a not so but the to suit |

0:05:17 | it should still there is not optimal for i-vector cycle distribution |

0:05:22 | can we replace is sophistication of the expectation maximization maximum likelihood |

0:05:28 | estimation of |

0:05:30 | parameters |

0:05:32 | by a simple and straightforward while stifle wildest an acoustic approach |

0:05:37 | so we want to know if |

0:05:40 | so application of the maximum likelihood |

0:05:44 | approach compute the parameters of the india |

0:05:49 | brings significant improvement of performance |

0:05:54 | we did not sorry may be the value when signals the between losing into programs |

0:05:59 | matrix was completely |

0:06:00 | on our development corpus |

0:06:04 | a singular value decomposition of the between speaker covariance matrix |

0:06:08 | give a matrix |

0:06:10 | whose columns are |

0:06:12 | so eigenvectors of the weighting between speaker |

0:06:16 | liability and the their remote matrix of eigenvalues |

0:06:21 | sorted in decreasing order |

0:06:24 | un a wrong are less and b |

0:06:27 | we can |

0:06:30 | compute |

0:06:31 | as arounds principle between speaker variability |

0:06:36 | and summarize it's and metric speech times t matrix |

0:06:40 | defined by the question for |

0:06:45 | the fast not to x p one two we are used to be turned on |

0:06:49 | matrix composed of the first occurrence of p |

0:06:52 | and so they're gonna matrix don't i want to well |

0:06:57 | is only comprise of the |

0:06:59 | highest hardest |

0:07:01 | eigenvalues |

0:07:03 | and so we propose a two |

0:07:07 | carry out |

0:07:09 | experiment with only |

0:07:11 | i w conditioning |

0:07:14 | conditioning and the system the still addition according to |

0:07:19 | within class covariance matrix |

0:07:21 | followed by next lemmatization |

0:07:23 | and the direct estimation of others at the parameters of the p l |

0:07:28 | the lda without which emitted and then and |

0:07:31 | on the bus on development corpus |

0:07:35 | so the scoring replaced by is the smart this is the total covariance matrices |

0:07:40 | for |

0:07:42 | is estimated by |

0:07:44 | that at the transmitters of the development corpus |

0:07:47 | and speaker levity metrics fifty transpose by be want to all |

0:07:55 | suppose can be justified if we consider somebody solely data from the development corpus |

0:08:02 | we can express as a factor and the parameters |

0:08:06 | speaker and with your |

0:08:08 | factors and she |

0:08:10 | well i on the value s is the mean vector director of speaker s |

0:08:15 | we show in the article that the covariance matrix is be i two well as |

0:08:19 | desirable that the speaker factor is standardised mean zero and i don't to metrics for |

0:08:26 | ability |

0:08:27 | and the dependence between that and variables |

0:08:32 | remark that only the new which of the covariance which is a necessary condition |

0:08:36 | is the |

0:08:39 | shift |

0:08:40 | and we cry and to obtain the lda scoring |

0:08:48 | next mission is known to improve the question it is so we compute the custody |

0:08:53 | of the speaker and was or fact also for development corpus |

0:08:57 | before and after length normalisation |

0:09:00 | top graphs shows |

0:09:03 | and distribution offices quell line source to standardise digital factors |

0:09:09 | left as the speaker factors on whites are ways of the optimum |

0:09:14 | the dashed lull i and is |

0:09:17 | the distribution of the key to |

0:09:20 | the speaker factor or must follow |

0:09:25 | a key with a degrees of freedom |

0:09:29 | and still on |

0:09:31 | okay to is a p u is of freedoms peas dimension of the i-vector space |

0:09:37 | we show it's not use that |

0:09:39 | for all |

0:09:40 | development board line |

0:09:43 | and so for evaluation |

0:09:46 | datasets |

0:09:48 | there is a mismatch |

0:09:50 | between them |

0:09:54 | and as a distribution of an intimate we can give it a distribution |

0:10:01 | remark also the several dataset shift between |

0:10:05 | development and evaluation dataset |

0:10:09 | after length normalization |

0:10:12 | is the volume |

0:10:14 | right care to experiments with |

0:10:18 | manage to compute parameters and with a deterministic approaches |

0:10:23 | in both cases |

0:10:24 | we can see that |

0:10:26 | so the numbers and the t v |

0:10:29 | partially reduced |

0:10:31 | and the shift |

0:10:34 | between the development and evaluation |

0:10:38 | mark sets a deterministic approach |

0:10:42 | improves the question e g |

0:10:43 | in a similar manner to ml technique |

0:10:47 | what's that is on the and it's recognition but most distant of motion t |

0:10:53 | always use of |

0:10:54 | three systems |

0:10:58 | we ultraviolet of conditions of the nist speaker recognition evaluations on eight ten |

0:11:05 | twelve telephone |

0:11:08 | is that the noisy environment |

0:11:11 | with the system |

0:11:13 | was a length normalization |

0:11:15 | following do not exist from going to signal |

0:11:19 | so that learns metrics and |

0:11:21 | two w which two cases |

0:11:24 | what is you know and mel |

0:11:26 | estimate of parameters and is a deterministic |

0:11:29 | an estimate of parameters |

0:11:32 | we can see |

0:11:34 | you can see that the result of the same in terms of |

0:11:39 | the colour right |

0:11:40 | between the two last the last two techniques |

0:11:43 | in terms of this you have the probabilistic approach women superior |

0:11:48 | and we mark sets l w conditioning performed a bitter |

0:11:53 | done the l signal conditioning |

0:11:56 | event with a deterministic approach |

0:12:04 | so no you consider that maybe the fact that so |

0:12:09 | and the end and better approach doesn't bring as expected |

0:12:13 | improvement of performance |

0:12:15 | maybe is due to the fact that the g p l d ar model is |

0:12:20 | not optimal for i-vector spherical distributions |

0:12:26 | so we compute |

0:12:28 | two series for development corpus |

0:12:32 | first the average people celebrated of zero is you of our observations |

0:12:36 | given the model |

0:12:38 | even |

0:12:39 | she and until |

0:12:40 | and standard t money t |

0:12:45 | which we consider that are likely would |

0:12:48 | off |

0:12:49 | class but also for likelihood |

0:12:51 | of the class given number |

0:12:55 | then we compile this |

0:12:57 | likelihood to the parameter of position of the class consider a wider of probabilistic class |

0:13:04 | position it's pasta or like a likelihood of the speaker and for speaker factor of |

0:13:09 | the class |

0:13:12 | and we display |

0:13:16 | the two series |

0:13:19 | all horizontal |

0:13:21 | wasn't really is parameter of class position and optical |

0:13:26 | the likelihood |

0:13:28 | of the reason you |

0:13:30 | according |

0:13:31 | we can model |

0:13:35 | the first graph |

0:13:37 | shows results always that would next normalisation |

0:13:41 | with i-vector lost provided buys extractor |

0:13:44 | and we remark here that |

0:13:47 | no volition a cross between the position of the class |

0:13:53 | and is a likelihood |

0:13:55 | of the residue |

0:13:58 | each time we displays a coefficient of determination task well |

0:14:02 | two scroll from zero to one |

0:14:05 | when which indicates l well data for points fit alignment |

0:14:10 | the task was equal to zero point zero four |

0:14:13 | close to zero |

0:14:16 | after are all length normalisation |

0:14:19 | a significant reduction |

0:14:21 | appears between the likelihoods of the class factors and the likelihood of there is you |

0:14:27 | a squirrel are equal to zero point filing nine and zero point six four |

0:14:35 | so there is a dependency between |

0:14:37 | the actual vulnerability |

0:14:41 | matrix of class and the probability position of this classic sperry by the likelihood of |

0:14:46 | the fractal |

0:14:48 | so we can see they are that's the show and it there was a dusty |

0:14:51 | of the raising your |

0:14:58 | we compute the previews |

0:15:02 | results who is well training set |

0:15:05 | in which that are not evenly distributed across speakers |

0:15:10 | so we can object that relations due to the can to differ information to speaker |

0:15:15 | some four |

0:15:17 | so we compute the same graphs and before |

0:15:21 | but on the for is speaker |

0:15:25 | training classes |

0:15:27 | with the minimum number of sessions per training speaker |

0:15:33 | we don't are you see that a minimal number of sessions speaker |

0:15:36 | one from two to sixty two |

0:15:41 | and this time for only segments of speaker which |

0:15:46 | the more than this minimum well |

0:15:48 | we compute the l score |

0:15:50 | we see that before makes them addition there are no problems |

0:15:54 | because the |

0:15:55 | the two series are independent |

0:15:57 | and after maximization be seen |

0:16:00 | that event for |

0:16:04 | uses speaker classes with the |

0:16:07 | the maximum number of sessions |

0:16:10 | the same |

0:16:11 | was we took us |

0:16:14 | is else well which are higher than zero point six |

0:16:24 | so we remark |

0:16:27 | that the j p alone modelling is a good model |

0:16:32 | but if we are obliged to |

0:16:34 | project that on the nonlinear also phones are |

0:16:38 | problem is to be sure that and the most acoustic model with the quality of |

0:16:44 | covariance |

0:16:46 | will for from the simpson |

0:16:51 | we don't dusty does take this out to replace the overall with a cluster between |

0:16:55 | that parameter by the class dependent parameter |

0:16:58 | steak the queen the local position of the class to fit to it |

0:17:01 | actual distortions |

0:17:05 | such an adrenaline is difficult to carry out |

0:17:11 | because it induces a complex density |

0:17:14 | i passing the within class variability parameters will nonlinear function |

0:17:19 | or getting up length normalization and |

0:17:22 | posting approaches as well which present over the i-vector on the one |

0:17:27 | attempting to find out attic what why also as heavy tailed be |

0:17:31 | discriminative classifiers pairwise discriminative |

0:17:36 | all just on why we are obliged to ignore the non maybe because and all |

0:17:41 | contain expected |

0:17:43 | the art abilities |

0:17:46 | may be related to some parameters |

0:17:49 | acoustic |

0:17:51 | just remark |

0:17:54 | which the and w conditioning |

0:17:59 | transform is the within class variability in the identity matrix |

0:18:03 | and identity matrix as no |

0:18:06 | principal components |

0:18:08 | maybe it at alleviates is a constant of |

0:18:13 | almost a dusty city |

0:18:16 | thank you |

0:18:22 | i |

0:18:33 | condition man something eat with experiments that you replaced the probabilistic approach of estimating the |

0:18:42 | parameters with the say on the screen |

0:18:47 | the minister |

0:18:49 | i think that's |

0:18:50 | in the limit if you're stream have main speakers |

0:18:53 | these two conditions exactly the same sort the only difference is that you're putting the |

0:18:58 | prior in the one case |

0:19:00 | okay so we present us with the number of the number of speakers average number |

0:19:06 | of speakers a and i guess that's you can go to a small number of |

0:19:10 | speakers when you train the model |

0:19:12 | yes and it and difference that's as of the deterministic approach is not intended competes |

0:19:19 | with a man a matter and then is the best way |

0:19:22 | but just i was surprised by is a slight yelp of performance |

0:19:28 | and so it's |

0:19:31 | assume that maybe because our aim ml count |

0:19:35 | be optimal because there is a problem of sphericity of data |

0:19:40 | but deterministic approach is not |

0:19:43 | and that's exactly this topic when we try to show that the norm |

0:19:49 | of the speaker factors |

0:19:52 | whether the full weight a |

0:19:56 | yes i guess a because you have to treat them as random variables because the |

0:20:01 | not simply points |

0:20:03 | under the plp scheme there they have a posterior distribution |

0:20:07 | "'kay" a better way |

0:20:09 | to consider whether they following the distribution |

0:20:13 | would be |

0:20:14 | broccoli to at the trace |

0:20:16 | the posterior covariance matrix v should be should also be added when you can't leave |

0:20:20 | the norm in order to see the overall distribution rather than |

0:20:25 | dot products on okay |

0:20:30 | marketing that's |

0:20:31 | so that in that's the same rationale "'cause" with evaluation was test like toss a |

0:20:37 | rice with development corpus used as vectors |

0:20:42 | same effect okay was that the difference is that between length normalization they'll score is |

0:20:47 | not close to zero |

0:20:49 | the cost for off test |

0:20:52 | i-vectors before estimators and has provided by the extract all |

0:20:58 | is the to zero point three |

0:21:02 | where is a vector of test not used for training the lda factor analyses so |

0:21:08 | there is a shift only not only for mean |

0:21:11 | but only four |

0:21:13 | this problem of almost instantly |

0:21:19 | just one quick what i just missed your point when you said |

0:21:23 | i think you were saying that |

0:21:27 | trying to make the det spherically distributed you thought was inconsistent with being gaussian |

0:21:32 | why's that |

0:21:36 | its empirical but the gaussian high dimensional space are sphere |

0:21:41 | yes some very |

0:21:45 | but |

0:21:47 | we constrained speaker fact all floral |

0:21:52 | and sphere |

0:21:53 | the just a goat |

0:21:55 | to assume that the within class but not be a set of the problem but |

0:22:03 | we will be affected |

0:22:05 | by the position |

0:22:07 | writings the posterior |

0:22:09 | the prior distribution of the i-vectors |

0:22:11 | zero mean unit identity rate both in high dimensional space that will be approximate |

0:22:19 | so that that's i mean that's care what happened i not as mathematically what a |

0:22:23 | high dimensional space so why's it in its just |

0:22:28 | here we actually a lot of what's a spherical distribution for phase as well |

0:22:38 | and applying model with the quality of correlators is a difficult the surface |

0:22:44 | maybe a see that length normalization is a whole technique projects on the sphere |

0:22:50 | instead of adjusting the tanks taking the information i think |

0:22:59 | good but which |

0:23:02 | but not so |

0:23:05 | discussion |