0:00:16okay last undo
0:00:18i'm going to present well work on i-vector transformation and scaling for p lda based
0:00:24and the goal of this work
0:00:27two presents a way to transform over i-vectors so that they better fit the plp
0:00:34and the same time introduce a way
0:00:37to perform some sort of dataset mismatch compensation similar to what length normalization is who
0:00:43enforced on the p lda
0:00:47as we all know the lda assumption assumes that the latent variables a portion which
0:00:54with the resulting i-vectors which if we assume they are independently someone they would
0:01:00follow a gaussian distribution
0:01:02now we all know this is not really the case
0:01:07we have two main problems personal model
0:01:11i-vectors do not really look like they should if they were some performs a gaussian
0:01:17for example here on the right
0:01:19i've plotting the one dimension of the i-vectors the they mentioned with the highest skewness
0:01:26i plot in the histogram and it's quite clear that
0:01:29the histogram doesn't really resemble anything like a gaussian distribution but it's even almost multimodal
0:01:37then the other problems that we're
0:01:39a quite evident mismatch between development and evaluation
0:01:45for example if we look at the left
0:01:49there is a plot of the histogram of the squared i-vector models for both
0:01:53our development set which is sre ten females at
0:01:57and evaluation which is condition five female settles whatever sre ten
0:02:01and we can see two things first of all
0:02:05the distribution list pronounce or evaluation and development set are
0:02:10quite different among themselves
0:02:12and none of them resembles what we should expect
0:02:16these i-vectors of everything sampled from a standard normal distribution
0:02:22up to now we have
0:02:24mainly to waste approach
0:02:27these issues i've represented
0:02:29first one was heavy tailed yesterday by patrick kenny which mainly tries to with the
0:02:34non gaussian assumption
0:02:36what with the gaussian assumption is that in that it removes the core channels options
0:02:40and assumes that i-vector distributions are heavy tailed
0:02:44and the second one is length and or
0:02:47functional in our opinion is not really making things more portion what is really mainly
0:02:53dealing with the dataset mismatch that we have in this between evaluation and development i-vectors
0:03:00in need here i'm doing the same block that was doing on the most you
0:03:04dimensional i-vectors before and after lexical and we can see that even if we apply
0:03:09length on these cannot compensate since alike
0:03:12multimodal distribution signal what i-vectors
0:03:15it might actually compensate for heavy tailed of your that's for sure but still we
0:03:19don't get things which are really
0:03:21go shown like
0:03:24now in this war we want to address
0:03:27second the problem of doing both approximation of i-vectors so that they better fit the
0:03:33lda assumption so we tried to portion right somewhat i-vectors
0:03:37and that the same time we propose
0:03:40way to perform the dataset compensations email toward length normalized on the difference being that
0:03:46the this dataset compensation a student
0:03:49for our transformation
0:03:52and we estimate both of the same time
0:03:55okay so
0:03:57how do we perform these
0:04:00this phones focus on how we
0:04:03manner transform i-vectors so that they better fit the gaussian assumption
0:04:07to do that stands we assume that i-vectors are sampled from a random variable feeding
0:04:14whose pdf we don't know however we assume that we can express is unavoidable feels
0:04:19a function
0:04:20although a standard normal random variable
0:04:23now if we do like these then we can express the pdf of this random
0:04:28variable fee others
0:04:30the little pdf for
0:04:34of samples which are transformed through f and computed over the for why class
0:04:40sometimes which of the log that are we don't of the accordion of the transformation
0:04:45no the good thing is that we can
0:04:47due to things with this model first of all we can estimate the function f
0:04:52us to maximize the lack of our i-vectors
0:04:56and in that way we would obtain something which
0:05:00use also the pdf of i-vectors with which is not anymore standard portion but depends
0:05:06on the transformation
0:05:08and the other one thing is that we can also employed this function to transform
0:05:12i-vectors so that the samples which follow the distribution will fee
0:05:17becomes transformed into samples which follow
0:05:21standard normal distribution
0:05:27no more than these unknown functions we decided to follow a
0:05:33framework which is quite similar to the neural network framework
0:05:37that is we assume that we can express this transformation function as a composition of
0:05:42several a simple functions
0:05:46which can be interpreted as layers of a neural network
0:05:51the only constraint that we have with respect to the standard neural network here is
0:05:55that we want to work with functions which i vegetables or our layers of the
0:06:00same size and the transformation they
0:06:02produce needs to be invertible
0:06:05as we said we perform maximum like to estimate of the parameters of the transformation
0:06:10and then instead of using the pdf directly we use the transformation function to map
0:06:16i y i-vectors to
0:06:18let's say well shall distributed i-vectors
0:06:21here i have a small an example on the one dimensional data these is again
0:06:28the most cute dimensional are almost you component of our training i-vectors
0:06:36and from the top left the original histogram and on the right hyper the transformation
0:06:41that we estimated
0:06:43so how's you can see from the top left
0:06:45if we directly use the transformation
0:06:48to evaluate the log pdf of the
0:06:51about one
0:06:53i-vectors actually we obtain a pdf which are very closely matches the histogram of our
0:07:00then if we apply the inverse transformation to these data points we obtain what we
0:07:05c in the bottom v you hear
0:07:08and what
0:07:09does that show it shows that we managed to obtain a histogram of i-vectors which
0:07:13very closely matches the gaussian
0:07:16pdf which is portable i don't know if it's visible but there is the pdf
0:07:20of the from one question which is pretty much on top of the histogram all
0:07:25the transformed vectors
0:07:30in this war
0:07:32now we decided to use a simple selection for our layers in particular we have
0:07:37one kind of layer which does just an affine transformation that is we can interpret
0:07:42it just as the weights
0:07:44of a neural network
0:07:45what we call as you know it's in
0:07:48let you have
0:07:49which performs the nonlinearity
0:07:51no the reason we chose this particular kind of an ideal is that it is
0:07:56nice properties for example with a single layer we can already
0:08:00represents pdfs
0:08:02of the random variable which are most similar to the same in heavy tailed and
0:08:07skewed with just a single layer and
0:08:09if we are more like it we increase the
0:08:12modelling capabilities of the program although this creates some problems of overfitting i was like
0:08:16with say
0:08:20now the other side we use a maximum likelihood criterion to estimate the transformation and
0:08:25the nice thing
0:08:27is that we can use are optimized on a general optimize the which we provide
0:08:31the objective function and the grunt incentives guardians
0:08:34can be computed we'd
0:08:36an algorithm which resembles quite closely that of back propagation with mean square error of
0:08:42a neural network
0:08:44the main differences that would need to take into account also the contribution of the
0:08:48log determinant switch
0:08:50increases the complexity of the training but the training times is pretty much the same
0:08:54as we what we would have with that standard neural network
0:08:58no this is a full set of experiments here we still didn't a couple length
0:09:03normalization and any other kind of
0:09:06compensation approaches or what i'm showing here is what happens when we estimate
0:09:11this transformation on our
0:09:12training data and we applied to transform i wanna vectors
0:09:17as you can see on top layer on the left the same histograms of the
0:09:21square norm i was presenting before and on the right the squared norms of the
0:09:25transformed i-vectors
0:09:27of all
0:09:28here i'm using a transformation way to just one not only not like
0:09:33now of course as we can see the square norm is still not exactly what
0:09:37we would expect from
0:09:39standard normal or the distributed samples but
0:09:43matches more closely our expectation and more important we also somehow
0:09:49reduce the mismatch between evaluation and development squared norms which means that our i-vectors are
0:09:55more similar
0:09:57and this gets a reflected in the results on the first and second line you
0:10:01know the lda and
0:10:03the same the lda but trained with the transform i-vectors
0:10:07has the same here would not
0:10:08using any kind of like someone we can see that our model allows to achieve
0:10:13much better performance compared to standard lda
0:10:16on the last line all
0:10:18we can still see that length normalization is compensating for is not as a mismatch
0:10:23better which allows the lda with length normalized i-vectors to perform better than our model
0:10:31the next part is how can we
0:10:35incorporate this kind of preprocessed in our data of course we could try to maximize
0:10:39i-vector but we can do better by
0:10:42costing these
0:10:44kind of transformation directly to our model
0:10:47to this extent
0:10:49we first need to in you but different interpretation elements alarm and the particular we
0:10:54need to sting
0:10:57length normalized the maximum like the solution of a quite simple model
0:11:01well i what i-vectors are not i aid anymore in the sense that
0:11:05we assume that each i-vector is sample from a different random variable has a distribution
0:11:10which is normal
0:11:12the it the all these time the variables channel i think down which is the
0:11:17seed model
0:11:18the covariance matrix but this covariance matrix is case for each i-vector by a scholar
0:11:24this is quite similar to one maybe tailed distribution but instead of putting prior simple
0:11:29zeros on this stems
0:11:30we just optimized by the maximum like of solution
0:11:34now if we perform a two-step optimization where we first estimate see no assuming that
0:11:39the alpha terms are one
0:11:41and then we fix that senile we estimate the optimal alpha times we would gonna
0:11:46end up with something which is why
0:11:49very similar to links norm indeed it's the links
0:11:53is the squared no it's the norm of the white and i-vectors divided by the
0:11:57square root of the dimensionality of the i-vectors
0:12:01now why this is interesting because these
0:12:03random variable can be represented as a transformational a standard random variable well the transformation
0:12:10as a parameter which is like vector dependent
0:12:13now if you have to estimate this
0:12:15but i mean of using an iterative strategy which but of a first estimate the
0:12:20sequence and the alpha and then we
0:12:23well to apply the inverse transformation we would recover it exactly what we're doing right
0:12:27now would length normalization
0:12:30so these demos
0:12:32you know how to implement a similar strategy into our model
0:12:37we introduce what we call that not all eight euros scaling layer which is a
0:12:41single parameter and this parameters i-vector dependence of for each i-vector where y to estimate
0:12:46its much selected solution
0:12:48now our transformation is the cascade of these
0:12:52scaling layer and what we were proposing before saw
0:12:57composition of a finance also there yes
0:13:01that is one comment here
0:13:03in order to
0:13:04if you change in this thing we
0:13:06still have to resort what adaptive training that is we first three why we estimate
0:13:12the bottom the shared parameters that we fix the shared parameters and the optimize what
0:13:16and one more thing that we need to take into account is that at this
0:13:21while with the original more than we don't need to do anything as then transformed
0:13:24i-vectors with this model at this point we also need to estimate the by selecting
0:13:29the optimal scaling factor
0:13:32however these
0:13:34used as a great improvement as you can see well the first line of the
0:13:38same i was presenting before
0:13:41and then the last three lines are the lda would length normalization
0:13:45then the one day of transformation with the out of a scaling with one iteration
0:13:51i don't like to estimates and with three dimensional automate estimates
0:13:55and as you can see
0:13:57the model with three iteration is clearly outperformed the lda will end in all conditions
0:14:03on the sre ten female dataset
0:14:10so i guess we get the conclusions we
0:14:14investigated here an approach to estimate of this transformation which allows modified by i-vectors
0:14:20so that they better fit the plp assumptions
0:14:22so we apply this transformation we obtain i-vectors which are more or shall i and
0:14:28we calculating the more than a
0:14:30prepare a way to perform length compensation which is similar to p s two length
0:14:36but is
0:14:37but you want to the particular let us that we using in the transformation
0:14:41this transformation is that you using a maximum likelihood criterion and the transformation function itself
0:14:47is implemented using a frame or which is very similar to that
0:14:51of the neural networks
0:14:53we'd other said with some constraints because we want our latest embeddable in this case
0:14:57of that we can compute
0:14:59we can guarantee that the log that amount of our copiers a existence of one
0:15:06no this approach allows to
0:15:09so as to be improve the results remaining terms of this from the sre ten
0:15:13data we also experiments in the paper that
0:15:17i don't report here we show that used it may also works on nist two
0:15:21thousand twelve data
0:15:23there is one cup that's how they said before here we using a single layer
0:15:27transformation the reason is that this kind of more there's ten two
0:15:31overfit white easily
0:15:33so our first experiments with more than one on you know layer
0:15:38well not very satisfactory as in the they were decreasing the performance
0:15:43now we are managing to get interesting results by changing
0:15:47in the weights the first one is changing the kind of neat in only narratives
0:15:51of the details
0:15:52some constraints inside the function itself which you meet these
0:15:57overfitting behaviour
0:15:59and on the other hand we also find some structure where we impose constraints on
0:16:03the parameters of the transformation which again
0:16:06use the overfitting behaviour in these allows to train it was which are more players
0:16:11although up to now we obtained with the results in the sense that we managed
0:16:16train transformation which behave much better
0:16:19if we don't
0:16:20use the scaling down but after we have in so let's get into them and
0:16:27frame or the end we more or less convincing there is also what was shown
0:16:30here so do still working provide us to understand why we have this strange be
0:16:36everywhere we can
0:16:37improve the performance and that of the transformation itself but we cannot improve
0:16:42when we add the scaling term anymore
0:16:46so on
0:16:52i know some questions we have are fine but
0:17:05however this compared to just straight gas station
0:17:10okay the
0:17:11thing is how we would improvement association with one hundred fifty dimensional vectors i mean
0:17:17what you got size each dimension on its own
0:17:20well if you both sides it's dimensional with some we tried
0:17:24something with this model which if we put cosine transformation or well the function itself
0:17:31produce that kind of organ and by the way when working with one dimensional synthetic
0:17:36data disk image period when many kind of different usual spot the results already much
0:17:43so my case is that it would not be sufficient to independently
0:17:47gaussianized ml each on its own
0:17:50but allows me i'm sorry miss you tried it didn't where's
0:17:54no i didn't right exactly that i tried the same order like presenting here with
0:17:59transformation which applied independently of each component and my experience what i'm working on a
0:18:06single a single dimensional data points
0:18:09you think size very well
0:18:11it does not program over fitting we then if i are more like something data
0:18:16with several kind of is the only reason aspen is that the gas station kernel
0:18:20right exactly does inverse function it's not approximation to it
0:18:24no but it makes one like it the spectral that approximation it that's what they
0:18:28get here doesn't work so i guess is that the approximate the real thing with
0:18:31the commercialisation would still not work
0:18:39i don't use the sensitivity
0:18:43this approach does not come and activation function for d n and
0:18:47the justification to is shown to them to probably too well as the evaluation is
0:18:55first of all the original transformation i was you think you know is the last
0:19:00one which then it can be shown that we can split into several layers but
0:19:05it is different probabilities first of all it can represent the identity transformation
0:19:10so if our data already portion
0:19:13are kept like that
0:19:15then it has some nice properties which can be shown there are some references in
0:19:19our paper where you can find that
0:19:22this kind of
0:19:24like single-layer skin color represents a role set of this shows which are both
0:19:29same in heavy tailed is q
0:19:32so the reason we shall this
0:19:34kind of this show the overall layer is essentially because it was already shown lately
0:19:39can more than what some broadside to family of distributions
0:19:44well it's all
0:19:49it's they have to strange question
0:19:52first the is it possible to the universal parameters and try to understand what that
0:19:58the characteristics
0:20:00of you training set
0:20:02in term of a twisty of in the most
0:20:05station effect of ten effects
0:20:08you mean what do you mean i mean
0:20:11look at you transformation a try to understand this so you the loose enough phone
0:20:18when the v
0:20:19the mismatch between o training set the inside the training set you to the presence
0:20:26said phone from them
0:20:27okay that's why the s c could be applied separately on different sides
0:20:33if you have some way to
0:20:36more the to see what is the difference in your distribution before and after transformation
0:20:41you can apply the same technique often so on my
0:20:44as well
0:20:46transform independently two different sets and see if this represents on the differences or not
0:20:52what i have here is that
0:20:54pretty much
0:20:56it looks like at least if we can see that evaluation and development of two
0:21:00different sets with different is usually it is somehow able to
0:21:04partly compensate for that
0:21:06no transformations that is partly responsible for is because these as
0:21:11say maybe to have your is it allows to stretch the models which are far
0:21:18from what we would expect
0:21:20so in what she can also one of the middle of these used
0:21:24and the other hand
0:21:25there you have thing which does this processing is the scaling anyway so that scaling
0:21:30is very similar to length or is it is two hundred transformation that i'm applying
0:21:34for this all done blindly
0:21:36and then i'm learning transformational x i-vectors but i'm estimating at the same time the
0:21:41transformation into skating
0:21:44okay that is the part which is in my opinion really responsible for posing due
0:21:49to mismatch in the basement used in that
0:21:51then another thing that i cannot is
0:21:54what is what would be much
0:21:57better done
0:21:58we were using is really more that the speaker factors and the channel factors appear
0:22:03the i for example
0:22:05the problem is that
0:22:06already like these it takes
0:22:08several hours if not the is to train the transformation function that this time it's
0:22:14very fast training is quite slow and if we move into
0:22:18using it cannot be lda styles all if we wanted differently the times that i
0:22:23would really explode that so computational time also this time
0:22:27because we would need to consider
0:22:29in cases where the i-vectors are from the same speaker or not and in that
0:22:33case would grow up
0:22:35you would have
0:22:38something similar to what we have we uncertainty propagation where you have to do that
0:22:43this time of computation of everything but much worse
0:22:48okay it's just
0:22:49in fact because the training needs to be but i want to try to x
0:22:55exploit this much as possible you parameters and method which is related to the first
0:23:02is it possible somewhere to use this approach to
0:23:07determine if one thing though i think when i-vector
0:23:11is in domain or out-of-domain
0:23:15so you use the two d to detect say okay
0:23:20my operationally is
0:23:21probably not really i mean length normalization that is not affect you start with this
0:23:25but this is not and i
0:23:27and the problem with this thing is that if i of a really huge mismatch
0:23:31then gets amplified by transformation itself
0:23:35because the data point and transforming arnold will be should be so the weight to
0:23:40as well like the non linear function
0:23:42is probably going to increase my mismatch instead of using it
0:23:46so i'll to some point the with respect to still work better than start up
0:23:50you after some point with this but it does not been worse
0:23:57mismatches datasets
0:23:59thanks and disappointed
0:24:03okay this like the special