0:00:16 | okay last undo |
---|---|

0:00:18 | i'm going to present well work on i-vector transformation and scaling for p lda based |

0:00:23 | recognition |

0:00:24 | and the goal of this work |

0:00:26 | is |

0:00:27 | two presents a way to transform over i-vectors so that they better fit the plp |

0:00:33 | assumptions |

0:00:34 | and the same time introduce a way |

0:00:37 | to perform some sort of dataset mismatch compensation similar to what length normalization is who |

0:00:43 | enforced on the p lda |

0:00:46 | so |

0:00:47 | as we all know the lda assumption assumes that the latent variables a portion which |

0:00:54 | with the resulting i-vectors which if we assume they are independently someone they would |

0:01:00 | follow a gaussian distribution |

0:01:02 | now we all know this is not really the case |

0:01:06 | indeed |

0:01:07 | we have two main problems personal model |

0:01:10 | our |

0:01:11 | i-vectors do not really look like they should if they were some performs a gaussian |

0:01:16 | distribution |

0:01:17 | for example here on the right |

0:01:19 | i've plotting the one dimension of the i-vectors the they mentioned with the highest skewness |

0:01:26 | i plot in the histogram and it's quite clear that |

0:01:29 | the histogram doesn't really resemble anything like a gaussian distribution but it's even almost multimodal |

0:01:37 | then the other problems that we're |

0:01:39 | a quite evident mismatch between development and evaluation |

0:01:43 | vectors |

0:01:45 | for example if we look at the left |

0:01:49 | there is a plot of the histogram of the squared i-vector models for both |

0:01:53 | our development set which is sre ten females at |

0:01:57 | and evaluation which is condition five female settles whatever sre ten |

0:02:01 | and we can see two things first of all |

0:02:05 | the distribution list pronounce or evaluation and development set are |

0:02:10 | quite different among themselves |

0:02:12 | and none of them resembles what we should expect |

0:02:16 | these i-vectors of everything sampled from a standard normal distribution |

0:02:21 | now |

0:02:22 | up to now we have |

0:02:24 | mainly to waste approach |

0:02:27 | these issues i've represented |

0:02:29 | first one was heavy tailed yesterday by patrick kenny which mainly tries to with the |

0:02:34 | non gaussian assumption |

0:02:36 | what with the gaussian assumption is that in that it removes the core channels options |

0:02:40 | and assumes that i-vector distributions are heavy tailed |

0:02:44 | and the second one is length and or |

0:02:47 | functional in our opinion is not really making things more portion what is really mainly |

0:02:53 | dealing with the dataset mismatch that we have in this between evaluation and development i-vectors |

0:03:00 | in need here i'm doing the same block that was doing on the most you |

0:03:04 | dimensional i-vectors before and after lexical and we can see that even if we apply |

0:03:09 | length on these cannot compensate since alike |

0:03:12 | multimodal distribution signal what i-vectors |

0:03:15 | it might actually compensate for heavy tailed of your that's for sure but still we |

0:03:19 | don't get things which are really |

0:03:21 | go shown like |

0:03:24 | now in this war we want to address |

0:03:27 | second the problem of doing both approximation of i-vectors so that they better fit the |

0:03:33 | lda assumption so we tried to portion right somewhat i-vectors |

0:03:37 | and that the same time we propose |

0:03:40 | way to perform the dataset compensations email toward length normalized on the difference being that |

0:03:46 | the this dataset compensation a student |

0:03:49 | for our transformation |

0:03:52 | and we estimate both of the same time |

0:03:55 | okay so |

0:03:57 | how do we perform these |

0:04:00 | this phones focus on how we |

0:04:03 | manner transform i-vectors so that they better fit the gaussian assumption |

0:04:07 | to do that stands we assume that i-vectors are sampled from a random variable feeding |

0:04:13 | which |

0:04:14 | whose pdf we don't know however we assume that we can express is unavoidable feels |

0:04:19 | a function |

0:04:20 | although a standard normal random variable |

0:04:23 | now if we do like these then we can express the pdf of this random |

0:04:28 | variable fee others |

0:04:30 | the little pdf for |

0:04:32 | samples |

0:04:34 | of samples which are transformed through f and computed over the for why class |

0:04:40 | sometimes which of the log that are we don't of the accordion of the transformation |

0:04:45 | no the good thing is that we can |

0:04:47 | due to things with this model first of all we can estimate the function f |

0:04:52 | us to maximize the lack of our i-vectors |

0:04:56 | and in that way we would obtain something which |

0:05:00 | use also the pdf of i-vectors with which is not anymore standard portion but depends |

0:05:06 | on the transformation |

0:05:08 | and the other one thing is that we can also employed this function to transform |

0:05:12 | i-vectors so that the samples which follow the distribution will fee |

0:05:17 | becomes transformed into samples which follow |

0:05:21 | standard normal distribution |

0:05:25 | no |

0:05:26 | two |

0:05:27 | no more than these unknown functions we decided to follow a |

0:05:33 | framework which is quite similar to the neural network framework |

0:05:37 | that is we assume that we can express this transformation function as a composition of |

0:05:42 | several a simple functions |

0:05:46 | which can be interpreted as layers of a neural network |

0:05:50 | now |

0:05:51 | the only constraint that we have with respect to the standard neural network here is |

0:05:55 | that we want to work with functions which i vegetables or our layers of the |

0:06:00 | same size and the transformation they |

0:06:02 | produce needs to be invertible |

0:06:05 | as we said we perform maximum like to estimate of the parameters of the transformation |

0:06:10 | and then instead of using the pdf directly we use the transformation function to map |

0:06:15 | back |

0:06:16 | i y i-vectors to |

0:06:18 | let's say well shall distributed i-vectors |

0:06:21 | here i have a small an example on the one dimensional data these is again |

0:06:28 | the most cute dimensional are almost you component of our training i-vectors |

0:06:36 | and from the top left the original histogram and on the right hyper the transformation |

0:06:41 | that we estimated |

0:06:43 | so how's you can see from the top left |

0:06:45 | if we directly use the transformation |

0:06:48 | to evaluate the log pdf of the |

0:06:51 | about one |

0:06:53 | i-vectors actually we obtain a pdf which are very closely matches the histogram of our |

0:06:58 | i-vectors |

0:07:00 | then if we apply the inverse transformation to these data points we obtain what we |

0:07:05 | c in the bottom v you hear |

0:07:08 | and what |

0:07:09 | does that show it shows that we managed to obtain a histogram of i-vectors which |

0:07:13 | very closely matches the gaussian |

0:07:16 | pdf which is portable i don't know if it's visible but there is the pdf |

0:07:20 | of the from one question which is pretty much on top of the histogram all |

0:07:25 | the transformed vectors |

0:07:29 | no |

0:07:30 | in this war |

0:07:32 | now we decided to use a simple selection for our layers in particular we have |

0:07:37 | one kind of layer which does just an affine transformation that is we can interpret |

0:07:42 | it just as the weights |

0:07:44 | of a neural network |

0:07:45 | what we call as you know it's in |

0:07:48 | let you have |

0:07:49 | which performs the nonlinearity |

0:07:51 | no the reason we chose this particular kind of an ideal is that it is |

0:07:56 | nice properties for example with a single layer we can already |

0:08:00 | represents pdfs |

0:08:02 | of the random variable which are most similar to the same in heavy tailed and |

0:08:07 | skewed with just a single layer and |

0:08:09 | if we are more like it we increase the |

0:08:12 | modelling capabilities of the program although this creates some problems of overfitting i was like |

0:08:16 | with say |

0:08:18 | later |

0:08:20 | now the other side we use a maximum likelihood criterion to estimate the transformation and |

0:08:25 | the nice thing |

0:08:27 | is that we can use are optimized on a general optimize the which we provide |

0:08:31 | the objective function and the grunt incentives guardians |

0:08:34 | can be computed we'd |

0:08:36 | an algorithm which resembles quite closely that of back propagation with mean square error of |

0:08:42 | a neural network |

0:08:44 | the main differences that would need to take into account also the contribution of the |

0:08:48 | log determinant switch |

0:08:50 | increases the complexity of the training but the training times is pretty much the same |

0:08:54 | as we what we would have with that standard neural network |

0:08:58 | no this is a full set of experiments here we still didn't a couple length |

0:09:03 | normalization and any other kind of |

0:09:06 | compensation approaches or what i'm showing here is what happens when we estimate |

0:09:11 | this transformation on our |

0:09:12 | training data and we applied to transform i wanna vectors |

0:09:17 | as you can see on top layer on the left the same histograms of the |

0:09:21 | square norm i was presenting before and on the right the squared norms of the |

0:09:25 | transformed i-vectors |

0:09:27 | of all |

0:09:28 | here i'm using a transformation way to just one not only not like |

0:09:33 | now of course as we can see the square norm is still not exactly what |

0:09:37 | we would expect from |

0:09:39 | standard normal or the distributed samples but |

0:09:43 | matches more closely our expectation and more important we also somehow |

0:09:49 | reduce the mismatch between evaluation and development squared norms which means that our i-vectors are |

0:09:55 | more similar |

0:09:57 | and this gets a reflected in the results on the first and second line you |

0:10:01 | know the lda and |

0:10:03 | the same the lda but trained with the transform i-vectors |

0:10:07 | has the same here would not |

0:10:08 | using any kind of like someone we can see that our model allows to achieve |

0:10:13 | much better performance compared to standard lda |

0:10:16 | on the last line all |

0:10:18 | we can still see that length normalization is compensating for is not as a mismatch |

0:10:23 | better which allows the lda with length normalized i-vectors to perform better than our model |

0:10:29 | right |

0:10:31 | so |

0:10:31 | the next part is how can we |

0:10:35 | incorporate this kind of preprocessed in our data of course we could try to maximize |

0:10:39 | i-vector but we can do better by |

0:10:42 | costing these |

0:10:44 | kind of transformation directly to our model |

0:10:47 | to this extent |

0:10:49 | we first need to in you but different interpretation elements alarm and the particular we |

0:10:54 | need to sting |

0:10:55 | all |

0:10:57 | length normalized the maximum like the solution of a quite simple model |

0:11:01 | well i what i-vectors are not i aid anymore in the sense that |

0:11:05 | we assume that each i-vector is sample from a different random variable has a distribution |

0:11:10 | which is normal |

0:11:12 | the it the all these time the variables channel i think down which is the |

0:11:17 | seed model |

0:11:18 | the covariance matrix but this covariance matrix is case for each i-vector by a scholar |

0:11:23 | that |

0:11:24 | this is quite similar to one maybe tailed distribution but instead of putting prior simple |

0:11:29 | zeros on this stems |

0:11:30 | we just optimized by the maximum like of solution |

0:11:34 | now if we perform a two-step optimization where we first estimate see no assuming that |

0:11:39 | the alpha terms are one |

0:11:41 | and then we fix that senile we estimate the optimal alpha times we would gonna |

0:11:46 | end up with something which is why |

0:11:49 | very similar to links norm indeed it's the links |

0:11:53 | is the squared no it's the norm of the white and i-vectors divided by the |

0:11:57 | square root of the dimensionality of the i-vectors |

0:12:01 | now why this is interesting because these |

0:12:03 | random variable can be represented as a transformational a standard random variable well the transformation |

0:12:10 | as a parameter which is like vector dependent |

0:12:13 | now if you have to estimate this |

0:12:15 | but i mean of using an iterative strategy which but of a first estimate the |

0:12:20 | sequence and the alpha and then we |

0:12:23 | well to apply the inverse transformation we would recover it exactly what we're doing right |

0:12:27 | now would length normalization |

0:12:30 | so these demos |

0:12:32 | you know how to implement a similar strategy into our model |

0:12:37 | we introduce what we call that not all eight euros scaling layer which is a |

0:12:41 | single parameter and this parameters i-vector dependence of for each i-vector where y to estimate |

0:12:46 | its much selected solution |

0:12:48 | now our transformation is the cascade of these |

0:12:52 | scaling layer and what we were proposing before saw |

0:12:56 | the |

0:12:57 | composition of a finance also there yes |

0:13:01 | that is one comment here |

0:13:03 | in order to |

0:13:04 | if you change in this thing we |

0:13:06 | still have to resort what adaptive training that is we first three why we estimate |

0:13:12 | the bottom the shared parameters that we fix the shared parameters and the optimize what |

0:13:15 | file |

0:13:16 | and one more thing that we need to take into account is that at this |

0:13:20 | time |

0:13:21 | while with the original more than we don't need to do anything as then transformed |

0:13:24 | i-vectors with this model at this point we also need to estimate the by selecting |

0:13:29 | the optimal scaling factor |

0:13:32 | however these |

0:13:34 | used as a great improvement as you can see well the first line of the |

0:13:38 | same i was presenting before |

0:13:41 | and then the last three lines are the lda would length normalization |

0:13:45 | then the one day of transformation with the out of a scaling with one iteration |

0:13:49 | of |

0:13:51 | i don't like to estimates and with three dimensional automate estimates |

0:13:55 | and as you can see |

0:13:57 | the model with three iteration is clearly outperformed the lda will end in all conditions |

0:14:03 | on the sre ten female dataset |

0:14:08 | no |

0:14:10 | so i guess we get the conclusions we |

0:14:14 | investigated here an approach to estimate of this transformation which allows modified by i-vectors |

0:14:20 | so that they better fit the plp assumptions |

0:14:22 | so we apply this transformation we obtain i-vectors which are more or shall i and |

0:14:28 | we calculating the more than a |

0:14:30 | prepare a way to perform length compensation which is similar to p s two length |

0:14:35 | norm |

0:14:36 | but is |

0:14:37 | but you want to the particular let us that we using in the transformation |

0:14:41 | this transformation is that you using a maximum likelihood criterion and the transformation function itself |

0:14:47 | is implemented using a frame or which is very similar to that |

0:14:51 | of the neural networks |

0:14:53 | we'd other said with some constraints because we want our latest embeddable in this case |

0:14:57 | of that we can compute |

0:14:59 | we can guarantee that the log that amount of our copiers a existence of one |

0:15:06 | no this approach allows to |

0:15:09 | so as to be improve the results remaining terms of this from the sre ten |

0:15:13 | data we also experiments in the paper that |

0:15:17 | i don't report here we show that used it may also works on nist two |

0:15:21 | thousand twelve data |

0:15:23 | there is one cup that's how they said before here we using a single layer |

0:15:27 | transformation the reason is that this kind of more there's ten two |

0:15:31 | overfit white easily |

0:15:33 | so our first experiments with more than one on you know layer |

0:15:38 | well not very satisfactory as in the they were decreasing the performance |

0:15:43 | now we are managing to get interesting results by changing |

0:15:47 | in the weights the first one is changing the kind of neat in only narratives |

0:15:51 | of the details |

0:15:52 | some constraints inside the function itself which you meet these |

0:15:57 | overfitting behaviour |

0:15:59 | and on the other hand we also find some structure where we impose constraints on |

0:16:03 | the parameters of the transformation which again |

0:16:06 | use the overfitting behaviour in these allows to train it was which are more players |

0:16:11 | although up to now we obtained with the results in the sense that we managed |

0:16:15 | to |

0:16:16 | train transformation which behave much better |

0:16:19 | if we don't |

0:16:20 | use the scaling down but after we have in so let's get into them and |

0:16:24 | the |

0:16:26 | all |

0:16:27 | frame or the end we more or less convincing there is also what was shown |

0:16:30 | here so do still working provide us to understand why we have this strange be |

0:16:36 | everywhere we can |

0:16:37 | improve the performance and that of the transformation itself but we cannot improve |

0:16:42 | when we add the scaling term anymore |

0:16:46 | so on |

0:16:52 | i know some questions we have are fine but |

0:17:05 | however this compared to just straight gas station |

0:17:10 | okay the |

0:17:11 | thing is how we would improvement association with one hundred fifty dimensional vectors i mean |

0:17:17 | what you got size each dimension on its own |

0:17:20 | well if you both sides it's dimensional with some we tried |

0:17:24 | something with this model which if we put cosine transformation or well the function itself |

0:17:29 | can |

0:17:30 | so |

0:17:31 | produce that kind of organ and by the way when working with one dimensional synthetic |

0:17:36 | data disk image period when many kind of different usual spot the results already much |

0:17:42 | worse |

0:17:43 | so my case is that it would not be sufficient to independently |

0:17:47 | gaussianized ml each on its own |

0:17:50 | but allows me i'm sorry miss you tried it didn't where's |

0:17:54 | no i didn't right exactly that i tried the same order like presenting here with |

0:17:59 | transformation which applied independently of each component and my experience what i'm working on a |

0:18:06 | single a single dimensional data points |

0:18:09 | you think size very well |

0:18:11 | it does not program over fitting we then if i are more like something data |

0:18:16 | with several kind of is the only reason aspen is that the gas station kernel |

0:18:20 | right exactly does inverse function it's not approximation to it |

0:18:24 | no but it makes one like it the spectral that approximation it that's what they |

0:18:28 | get here doesn't work so i guess is that the approximate the real thing with |

0:18:31 | the commercialisation would still not work |

0:18:39 | i don't use the sensitivity |

0:18:43 | this approach does not come and activation function for d n and |

0:18:47 | the justification to is shown to them to probably too well as the evaluation is |

0:18:55 | first of all the original transformation i was you think you know is the last |

0:19:00 | one which then it can be shown that we can split into several layers but |

0:19:05 | it is different probabilities first of all it can represent the identity transformation |

0:19:10 | so if our data already portion |

0:19:13 | are kept like that |

0:19:15 | then it has some nice properties which can be shown there are some references in |

0:19:19 | our paper where you can find that |

0:19:22 | this kind of |

0:19:24 | like single-layer skin color represents a role set of this shows which are both |

0:19:29 | same in heavy tailed is q |

0:19:32 | so the reason we shall this |

0:19:34 | kind of this show the overall layer is essentially because it was already shown lately |

0:19:39 | can more than what some broadside to family of distributions |

0:19:44 | well it's all |

0:19:49 | it's they have to strange question |

0:19:52 | first the is it possible to the universal parameters and try to understand what that |

0:19:58 | the characteristics |

0:20:00 | of you training set |

0:20:02 | in term of a twisty of in the most |

0:20:05 | station effect of ten effects |

0:20:08 | you mean what do you mean i mean |

0:20:11 | look at you transformation a try to understand this so you the loose enough phone |

0:20:18 | when the v |

0:20:19 | the mismatch between o training set the inside the training set you to the presence |

0:20:25 | of |

0:20:26 | said phone from them |

0:20:27 | okay that's why the s c could be applied separately on different sides |

0:20:33 | if you have some way to |

0:20:36 | more the to see what is the difference in your distribution before and after transformation |

0:20:41 | you can apply the same technique often so on my |

0:20:44 | as well |

0:20:46 | transform independently two different sets and see if this represents on the differences or not |

0:20:52 | what i have here is that |

0:20:54 | pretty much |

0:20:56 | it looks like at least if we can see that evaluation and development of two |

0:21:00 | different sets with different is usually it is somehow able to |

0:21:04 | partly compensate for that |

0:21:06 | no transformations that is partly responsible for is because these as |

0:21:11 | say maybe to have your is it allows to stretch the models which are far |

0:21:18 | from what we would expect |

0:21:20 | so in what she can also one of the middle of these used |

0:21:24 | and the other hand |

0:21:25 | there you have thing which does this processing is the scaling anyway so that scaling |

0:21:30 | is very similar to length or is it is two hundred transformation that i'm applying |

0:21:34 | for this all done blindly |

0:21:36 | and then i'm learning transformational x i-vectors but i'm estimating at the same time the |

0:21:41 | transformation into skating |

0:21:44 | okay that is the part which is in my opinion really responsible for posing due |

0:21:49 | to mismatch in the basement used in that |

0:21:51 | then another thing that i cannot is |

0:21:54 | what is what would be much |

0:21:57 | better done |

0:21:58 | we were using is really more that the speaker factors and the channel factors appear |

0:22:03 | the i for example |

0:22:05 | the problem is that |

0:22:06 | already like these it takes |

0:22:08 | several hours if not the is to train the transformation function that this time it's |

0:22:14 | very fast training is quite slow and if we move into |

0:22:18 | using it cannot be lda styles all if we wanted differently the times that i |

0:22:23 | would really explode that so computational time also this time |

0:22:27 | because we would need to consider |

0:22:29 | in cases where the i-vectors are from the same speaker or not and in that |

0:22:33 | case would grow up |

0:22:35 | you would have |

0:22:36 | similarly |

0:22:38 | something similar to what we have we uncertainty propagation where you have to do that |

0:22:43 | this time of computation of everything but much worse |

0:22:48 | okay it's just |

0:22:49 | in fact because the training needs to be but i want to try to x |

0:22:55 | exploit this much as possible you parameters and method which is related to the first |

0:23:00 | one |

0:23:02 | is it possible somewhere to use this approach to |

0:23:07 | determine if one thing though i think when i-vector |

0:23:11 | is in domain or out-of-domain |

0:23:15 | so you use the two d to detect say okay |

0:23:20 | my operationally is |

0:23:21 | probably not really i mean length normalization that is not affect you start with this |

0:23:25 | but this is not and i |

0:23:27 | and the problem with this thing is that if i of a really huge mismatch |

0:23:31 | then gets amplified by transformation itself |

0:23:35 | because the data point and transforming arnold will be should be so the weight to |

0:23:40 | as well like the non linear function |

0:23:42 | is probably going to increase my mismatch instead of using it |

0:23:46 | so i'll to some point the with respect to still work better than start up |

0:23:50 | you after some point with this but it does not been worse |

0:23:57 | mismatches datasets |

0:23:59 | thanks and disappointed |

0:24:03 | okay this like the special |