Speech Transcript - I-vector transformation and scaling for PLDA based speaker recognition

0:00:16	okay last undo
0:00:18	i'm going to present well work on i-vector transformation and scaling for p lda based
0:00:23	recognition
0:00:24	and the goal of this work
0:00:26	is
0:00:27	two presents a way to transform over i-vectors so that they better fit the plp
0:00:33	assumptions
0:00:34	and the same time introduce a way
0:00:37	to perform some sort of dataset mismatch compensation similar to what length normalization is who
0:00:43	enforced on the p lda
0:00:46	so
0:00:47	as we all know the lda assumption assumes that the latent variables a portion which
0:00:54	with the resulting i-vectors which if we assume they are independently someone they would
0:01:00	follow a gaussian distribution
0:01:02	now we all know this is not really the case
0:01:06	indeed
0:01:07	we have two main problems personal model
0:01:10	our
0:01:11	i-vectors do not really look like they should if they were some performs a gaussian
0:01:16	distribution
0:01:17	for example here on the right
0:01:19	i've plotting the one dimension of the i-vectors the they mentioned with the highest skewness
0:01:26	i plot in the histogram and it's quite clear that
0:01:29	the histogram doesn't really resemble anything like a gaussian distribution but it's even almost multimodal
0:01:37	then the other problems that we're
0:01:39	a quite evident mismatch between development and evaluation
0:01:43	vectors
0:01:45	for example if we look at the left
0:01:49	there is a plot of the histogram of the squared i-vector models for both
0:01:53	our development set which is sre ten females at
0:01:57	and evaluation which is condition five female settles whatever sre ten
0:02:01	and we can see two things first of all
0:02:05	the distribution list pronounce or evaluation and development set are
0:02:10	quite different among themselves
0:02:12	and none of them resembles what we should expect
0:02:16	these i-vectors of everything sampled from a standard normal distribution
0:02:21	now
0:02:22	up to now we have
0:02:24	mainly to waste approach
0:02:27	these issues i've represented
0:02:29	first one was heavy tailed yesterday by patrick kenny which mainly tries to with the
0:02:34	non gaussian assumption
0:02:36	what with the gaussian assumption is that in that it removes the core channels options
0:02:40	and assumes that i-vector distributions are heavy tailed
0:02:44	and the second one is length and or
0:02:47	functional in our opinion is not really making things more portion what is really mainly
0:02:53	dealing with the dataset mismatch that we have in this between evaluation and development i-vectors
0:03:00	in need here i'm doing the same block that was doing on the most you
0:03:04	dimensional i-vectors before and after lexical and we can see that even if we apply
0:03:09	length on these cannot compensate since alike
0:03:12	multimodal distribution signal what i-vectors
0:03:15	it might actually compensate for heavy tailed of your that's for sure but still we
0:03:19	don't get things which are really
0:03:21	go shown like
0:03:24	now in this war we want to address
0:03:27	second the problem of doing both approximation of i-vectors so that they better fit the
0:03:33	lda assumption so we tried to portion right somewhat i-vectors
0:03:37	and that the same time we propose
0:03:40	way to perform the dataset compensations email toward length normalized on the difference being that
0:03:46	the this dataset compensation a student
0:03:49	for our transformation
0:03:52	and we estimate both of the same time
0:03:55	okay so
0:03:57	how do we perform these
0:04:00	this phones focus on how we
0:04:03	manner transform i-vectors so that they better fit the gaussian assumption
0:04:07	to do that stands we assume that i-vectors are sampled from a random variable feeding
0:04:13	which
0:04:14	whose pdf we don't know however we assume that we can express is unavoidable feels
0:04:19	a function
0:04:20	although a standard normal random variable
0:04:23	now if we do like these then we can express the pdf of this random
0:04:28	variable fee others
0:04:30	the little pdf for
0:04:32	samples
0:04:34	of samples which are transformed through f and computed over the for why class
0:04:40	sometimes which of the log that are we don't of the accordion of the transformation
0:04:45	no the good thing is that we can
0:04:47	due to things with this model first of all we can estimate the function f
0:04:52	us to maximize the lack of our i-vectors
0:04:56	and in that way we would obtain something which
0:05:00	use also the pdf of i-vectors with which is not anymore standard portion but depends
0:05:06	on the transformation
0:05:08	and the other one thing is that we can also employed this function to transform
0:05:12	i-vectors so that the samples which follow the distribution will fee
0:05:17	becomes transformed into samples which follow
0:05:21	standard normal distribution
0:05:25	no
0:05:26	two
0:05:27	no more than these unknown functions we decided to follow a
0:05:33	framework which is quite similar to the neural network framework
0:05:37	that is we assume that we can express this transformation function as a composition of
0:05:42	several a simple functions
0:05:46	which can be interpreted as layers of a neural network
0:05:50	now
0:05:51	the only constraint that we have with respect to the standard neural network here is
0:05:55	that we want to work with functions which i vegetables or our layers of the
0:06:00	same size and the transformation they
0:06:02	produce needs to be invertible
0:06:05	as we said we perform maximum like to estimate of the parameters of the transformation
0:06:10	and then instead of using the pdf directly we use the transformation function to map
0:06:15	back
0:06:16	i y i-vectors to
0:06:18	let's say well shall distributed i-vectors
0:06:21	here i have a small an example on the one dimensional data these is again
0:06:28	the most cute dimensional are almost you component of our training i-vectors
0:06:36	and from the top left the original histogram and on the right hyper the transformation
0:06:41	that we estimated
0:06:43	so how's you can see from the top left
0:06:45	if we directly use the transformation
0:06:48	to evaluate the log pdf of the
0:06:51	about one
0:06:53	i-vectors actually we obtain a pdf which are very closely matches the histogram of our
0:06:58	i-vectors
0:07:00	then if we apply the inverse transformation to these data points we obtain what we
0:07:05	c in the bottom v you hear
0:07:08	and what
0:07:09	does that show it shows that we managed to obtain a histogram of i-vectors which
0:07:13	very closely matches the gaussian
0:07:16	pdf which is portable i don't know if it's visible but there is the pdf
0:07:20	of the from one question which is pretty much on top of the histogram all
0:07:25	the transformed vectors
0:07:29	no
0:07:30	in this war
0:07:32	now we decided to use a simple selection for our layers in particular we have
0:07:37	one kind of layer which does just an affine transformation that is we can interpret
0:07:42	it just as the weights
0:07:44	of a neural network
0:07:45	what we call as you know it's in
0:07:48	let you have
0:07:49	which performs the nonlinearity
0:07:51	no the reason we chose this particular kind of an ideal is that it is
0:07:56	nice properties for example with a single layer we can already
0:08:00	represents pdfs
0:08:02	of the random variable which are most similar to the same in heavy tailed and
0:08:07	skewed with just a single layer and
0:08:09	if we are more like it we increase the
0:08:12	modelling capabilities of the program although this creates some problems of overfitting i was like
0:08:16	with say
0:08:18	later
0:08:20	now the other side we use a maximum likelihood criterion to estimate the transformation and
0:08:25	the nice thing
0:08:27	is that we can use are optimized on a general optimize the which we provide
0:08:31	the objective function and the grunt incentives guardians
0:08:34	can be computed we'd
0:08:36	an algorithm which resembles quite closely that of back propagation with mean square error of
0:08:42	a neural network
0:08:44	the main differences that would need to take into account also the contribution of the
0:08:48	log determinant switch
0:08:50	increases the complexity of the training but the training times is pretty much the same
0:08:54	as we what we would have with that standard neural network
0:08:58	no this is a full set of experiments here we still didn't a couple length
0:09:03	normalization and any other kind of
0:09:06	compensation approaches or what i'm showing here is what happens when we estimate
0:09:11	this transformation on our
0:09:12	training data and we applied to transform i wanna vectors
0:09:17	as you can see on top layer on the left the same histograms of the
0:09:21	square norm i was presenting before and on the right the squared norms of the
0:09:25	transformed i-vectors
0:09:27	of all
0:09:28	here i'm using a transformation way to just one not only not like
0:09:33	now of course as we can see the square norm is still not exactly what
0:09:37	we would expect from
0:09:39	standard normal or the distributed samples but
0:09:43	matches more closely our expectation and more important we also somehow
0:09:49	reduce the mismatch between evaluation and development squared norms which means that our i-vectors are
0:09:55	more similar
0:09:57	and this gets a reflected in the results on the first and second line you
0:10:01	know the lda and
0:10:03	the same the lda but trained with the transform i-vectors
0:10:07	has the same here would not
0:10:08	using any kind of like someone we can see that our model allows to achieve
0:10:13	much better performance compared to standard lda
0:10:16	on the last line all
0:10:18	we can still see that length normalization is compensating for is not as a mismatch
0:10:23	better which allows the lda with length normalized i-vectors to perform better than our model
0:10:29	right
0:10:31	so
0:10:31	the next part is how can we
0:10:35	incorporate this kind of preprocessed in our data of course we could try to maximize
0:10:39	i-vector but we can do better by
0:10:42	costing these
0:10:44	kind of transformation directly to our model
0:10:47	to this extent
0:10:49	we first need to in you but different interpretation elements alarm and the particular we
0:10:54	need to sting
0:10:55	all
0:10:57	length normalized the maximum like the solution of a quite simple model
0:11:01	well i what i-vectors are not i aid anymore in the sense that
0:11:05	we assume that each i-vector is sample from a different random variable has a distribution
0:11:10	which is normal
0:11:12	the it the all these time the variables channel i think down which is the
0:11:17	seed model
0:11:18	the covariance matrix but this covariance matrix is case for each i-vector by a scholar
0:11:23	that
0:11:24	this is quite similar to one maybe tailed distribution but instead of putting prior simple
0:11:29	zeros on this stems
0:11:30	we just optimized by the maximum like of solution
0:11:34	now if we perform a two-step optimization where we first estimate see no assuming that
0:11:39	the alpha terms are one
0:11:41	and then we fix that senile we estimate the optimal alpha times we would gonna
0:11:46	end up with something which is why
0:11:49	very similar to links norm indeed it's the links
0:11:53	is the squared no it's the norm of the white and i-vectors divided by the
0:11:57	square root of the dimensionality of the i-vectors
0:12:01	now why this is interesting because these
0:12:03	random variable can be represented as a transformational a standard random variable well the transformation
0:12:10	as a parameter which is like vector dependent
0:12:13	now if you have to estimate this
0:12:15	but i mean of using an iterative strategy which but of a first estimate the
0:12:20	sequence and the alpha and then we
0:12:23	well to apply the inverse transformation we would recover it exactly what we're doing right
0:12:27	now would length normalization
0:12:30	so these demos
0:12:32	you know how to implement a similar strategy into our model
0:12:37	we introduce what we call that not all eight euros scaling layer which is a
0:12:41	single parameter and this parameters i-vector dependence of for each i-vector where y to estimate
0:12:46	its much selected solution
0:12:48	now our transformation is the cascade of these
0:12:52	scaling layer and what we were proposing before saw
0:12:56	the
0:12:57	composition of a finance also there yes
0:13:01	that is one comment here
0:13:03	in order to
0:13:04	if you change in this thing we
0:13:06	still have to resort what adaptive training that is we first three why we estimate
0:13:12	the bottom the shared parameters that we fix the shared parameters and the optimize what
0:13:15	file
0:13:16	and one more thing that we need to take into account is that at this
0:13:20	time
0:13:21	while with the original more than we don't need to do anything as then transformed
0:13:24	i-vectors with this model at this point we also need to estimate the by selecting
0:13:29	the optimal scaling factor
0:13:32	however these
0:13:34	used as a great improvement as you can see well the first line of the
0:13:38	same i was presenting before
0:13:41	and then the last three lines are the lda would length normalization
0:13:45	then the one day of transformation with the out of a scaling with one iteration
0:13:49	of
0:13:51	i don't like to estimates and with three dimensional automate estimates
0:13:55	and as you can see
0:13:57	the model with three iteration is clearly outperformed the lda will end in all conditions
0:14:03	on the sre ten female dataset
0:14:08	no
0:14:10	so i guess we get the conclusions we
0:14:14	investigated here an approach to estimate of this transformation which allows modified by i-vectors
0:14:20	so that they better fit the plp assumptions
0:14:22	so we apply this transformation we obtain i-vectors which are more or shall i and
0:14:28	we calculating the more than a
0:14:30	prepare a way to perform length compensation which is similar to p s two length
0:14:35	norm
0:14:36	but is
0:14:37	but you want to the particular let us that we using in the transformation
0:14:41	this transformation is that you using a maximum likelihood criterion and the transformation function itself
0:14:47	is implemented using a frame or which is very similar to that
0:14:51	of the neural networks
0:14:53	we'd other said with some constraints because we want our latest embeddable in this case
0:14:57	of that we can compute
0:14:59	we can guarantee that the log that amount of our copiers a existence of one
0:15:06	no this approach allows to
0:15:09	so as to be improve the results remaining terms of this from the sre ten
0:15:13	data we also experiments in the paper that
0:15:17	i don't report here we show that used it may also works on nist two
0:15:21	thousand twelve data
0:15:23	there is one cup that's how they said before here we using a single layer
0:15:27	transformation the reason is that this kind of more there's ten two
0:15:31	overfit white easily
0:15:33	so our first experiments with more than one on you know layer
0:15:38	well not very satisfactory as in the they were decreasing the performance
0:15:43	now we are managing to get interesting results by changing
0:15:47	in the weights the first one is changing the kind of neat in only narratives
0:15:51	of the details
0:15:52	some constraints inside the function itself which you meet these
0:15:57	overfitting behaviour
0:15:59	and on the other hand we also find some structure where we impose constraints on
0:16:03	the parameters of the transformation which again
0:16:06	use the overfitting behaviour in these allows to train it was which are more players
0:16:11	although up to now we obtained with the results in the sense that we managed
0:16:15	to
0:16:16	train transformation which behave much better
0:16:19	if we don't
0:16:20	use the scaling down but after we have in so let's get into them and
0:16:24	the
0:16:26	all
0:16:27	frame or the end we more or less convincing there is also what was shown
0:16:30	here so do still working provide us to understand why we have this strange be
0:16:36	everywhere we can
0:16:37	improve the performance and that of the transformation itself but we cannot improve
0:16:42	when we add the scaling term anymore
0:16:46	so on
0:16:52	i know some questions we have are fine but
0:17:05	however this compared to just straight gas station
0:17:10	okay the
0:17:11	thing is how we would improvement association with one hundred fifty dimensional vectors i mean
0:17:17	what you got size each dimension on its own
0:17:20	well if you both sides it's dimensional with some we tried
0:17:24	something with this model which if we put cosine transformation or well the function itself
0:17:29	can
0:17:30	so
0:17:31	produce that kind of organ and by the way when working with one dimensional synthetic
0:17:36	data disk image period when many kind of different usual spot the results already much
0:17:42	worse
0:17:43	so my case is that it would not be sufficient to independently
0:17:47	gaussianized ml each on its own
0:17:50	but allows me i'm sorry miss you tried it didn't where's
0:17:54	no i didn't right exactly that i tried the same order like presenting here with
0:17:59	transformation which applied independently of each component and my experience what i'm working on a
0:18:06	single a single dimensional data points
0:18:09	you think size very well
0:18:11	it does not program over fitting we then if i are more like something data
0:18:16	with several kind of is the only reason aspen is that the gas station kernel
0:18:20	right exactly does inverse function it's not approximation to it
0:18:24	no but it makes one like it the spectral that approximation it that's what they
0:18:28	get here doesn't work so i guess is that the approximate the real thing with
0:18:31	the commercialisation would still not work
0:18:39	i don't use the sensitivity
0:18:43	this approach does not come and activation function for d n and
0:18:47	the justification to is shown to them to probably too well as the evaluation is
0:18:55	first of all the original transformation i was you think you know is the last
0:19:00	one which then it can be shown that we can split into several layers but
0:19:05	it is different probabilities first of all it can represent the identity transformation
0:19:10	so if our data already portion
0:19:13	are kept like that
0:19:15	then it has some nice properties which can be shown there are some references in
0:19:19	our paper where you can find that
0:19:22	this kind of
0:19:24	like single-layer skin color represents a role set of this shows which are both
0:19:29	same in heavy tailed is q
0:19:32	so the reason we shall this
0:19:34	kind of this show the overall layer is essentially because it was already shown lately
0:19:39	can more than what some broadside to family of distributions
0:19:44	well it's all
0:19:49	it's they have to strange question
0:19:52	first the is it possible to the universal parameters and try to understand what that
0:19:58	the characteristics
0:20:00	of you training set
0:20:02	in term of a twisty of in the most
0:20:05	station effect of ten effects
0:20:08	you mean what do you mean i mean
0:20:11	look at you transformation a try to understand this so you the loose enough phone
0:20:18	when the v
0:20:19	the mismatch between o training set the inside the training set you to the presence
0:20:25	of
0:20:26	said phone from them
0:20:27	okay that's why the s c could be applied separately on different sides
0:20:33	if you have some way to
0:20:36	more the to see what is the difference in your distribution before and after transformation
0:20:41	you can apply the same technique often so on my
0:20:44	as well
0:20:46	transform independently two different sets and see if this represents on the differences or not
0:20:52	what i have here is that
0:20:54	pretty much
0:20:56	it looks like at least if we can see that evaluation and development of two
0:21:00	different sets with different is usually it is somehow able to
0:21:04	partly compensate for that
0:21:06	no transformations that is partly responsible for is because these as
0:21:11	say maybe to have your is it allows to stretch the models which are far
0:21:18	from what we would expect
0:21:20	so in what she can also one of the middle of these used
0:21:24	and the other hand
0:21:25	there you have thing which does this processing is the scaling anyway so that scaling
0:21:30	is very similar to length or is it is two hundred transformation that i'm applying
0:21:34	for this all done blindly
0:21:36	and then i'm learning transformational x i-vectors but i'm estimating at the same time the
0:21:41	transformation into skating
0:21:44	okay that is the part which is in my opinion really responsible for posing due
0:21:49	to mismatch in the basement used in that
0:21:51	then another thing that i cannot is
0:21:54	what is what would be much
0:21:57	better done
0:21:58	we were using is really more that the speaker factors and the channel factors appear
0:22:03	the i for example
0:22:05	the problem is that
0:22:06	already like these it takes
0:22:08	several hours if not the is to train the transformation function that this time it's
0:22:14	very fast training is quite slow and if we move into
0:22:18	using it cannot be lda styles all if we wanted differently the times that i
0:22:23	would really explode that so computational time also this time
0:22:27	because we would need to consider
0:22:29	in cases where the i-vectors are from the same speaker or not and in that
0:22:33	case would grow up
0:22:35	you would have
0:22:36	similarly
0:22:38	something similar to what we have we uncertainty propagation where you have to do that
0:22:43	this time of computation of everything but much worse
0:22:48	okay it's just
0:22:49	in fact because the training needs to be but i want to try to x
0:22:55	exploit this much as possible you parameters and method which is related to the first
0:23:00	one
0:23:02	is it possible somewhere to use this approach to
0:23:07	determine if one thing though i think when i-vector
0:23:11	is in domain or out-of-domain
0:23:15	so you use the two d to detect say okay
0:23:20	my operationally is
0:23:21	probably not really i mean length normalization that is not affect you start with this
0:23:25	but this is not and i
0:23:27	and the problem with this thing is that if i of a really huge mismatch
0:23:31	then gets amplified by transformation itself
0:23:35	because the data point and transforming arnold will be should be so the weight to
0:23:40	as well like the non linear function
0:23:42	is probably going to increase my mismatch instead of using it
0:23:46	so i'll to some point the with respect to still work better than start up
0:23:50	you after some point with this but it does not been worse
0:23:57	mismatches datasets
0:23:59	thanks and disappointed
0:24:03	okay this like the special

I-vector transformation and scaling for PLDA based speaker recognition

Speaker Recognition: i-vector approaches

Sandro Cumani, Pietro Laface