Speech Transcript - Discriminative PLDA training with application-specific loss functions for speaker verification

0:00:15	this is very short about the topic
0:00:18	we are apply we are working with a probabilistic linear discriminant analysis and it has
0:00:25	previously been proved by discriminative training
0:00:30	previous studies now use a loss functions that essential to focus on a very broad
0:00:34	range of applications so in this work we are trying to
0:00:39	train the p lda in a way that it becomes
0:00:42	more suitable for
0:00:44	narrow range of applications
0:00:47	and we observe a small improvement in the minimum detection cost by doing so
0:00:55	so as a background
0:00:57	s when we use the speaker verification system we would like to minimize the expected
0:01:02	cost
0:01:04	from our decision
0:01:06	and that this is a very much reflected in the detection cost of the of
0:01:10	then use
0:01:12	so we have at the cost for false rejection and false alarm and also a
0:01:16	prior which we can say together constitutes the operating point of our system
0:01:22	and which of course depend on the application
0:01:25	so the targets here is to yield a application specific system that is optimal for
0:01:31	one or several
0:01:34	operating point rather than one wall
0:01:37	it is more specifics of the same
0:01:41	and there so it can only idea some already been explored force score calibration
0:01:48	in interspeech paper mention that
0:01:51	however well so score calibration with score calibration we can reduce the gap between actual
0:01:57	detection cost and minimum detection cost
0:02:00	but we cannot be used the minimum detection cost
0:02:04	part
0:02:05	by applying these channel five use some earlier stage of the speaker verification system we
0:02:10	could how to reduce also the minimum detection cost
0:02:14	so we will apply to
0:02:17	discriminative p lda training
0:02:25	we use this method that has been previously been developed for training well for discriminating
0:02:31	ple training
0:02:33	and the only kind of thing we need to do here is that the this
0:02:37	well the log-likelihood ratio score of the period more data is
0:02:44	you've done by this kind of for right here
0:02:47	and we can apply some discussion discriminative training criteria to these
0:02:52	parameter
0:02:53	here
0:02:55	well only you should be in of the i-vectors
0:03:00	which i same out that the still we basically take all possible pairs of my
0:03:05	make those in the training database and minimize sound loss function l possible with some
0:03:11	weight
0:03:12	and also be applied so on
0:03:14	regularization term
0:03:23	but this
0:03:25	have been
0:03:26	so
0:03:28	when we need to consider a
0:03:30	one which operating point we should data
0:03:34	talk about how we should target a system to be
0:03:38	suitable for a certain operating points we need to consider
0:03:42	the part we have here and gmm weights b that which is different for well
0:03:46	depends on the trial
0:03:48	in essence it will be different for target and non-target trials
0:03:52	and we also have a loss function and to say very simple the
0:03:57	that will depend which operating point they're targeting whereas the choice of loss function chime
0:04:04	decide how much emphasis would put on surrounding operating points
0:04:17	so
0:04:18	well just a bit short about the forest be that
0:04:24	well as probably several you know we can we'll in some applications where approach
0:04:29	these three parameters probability of target trial two costs we can rewrite it
0:04:34	where we can
0:04:36	we have an equivalent cost which will have a
0:04:40	a loss
0:04:41	in the training or evaluation that is proportional to this
0:04:44	first application so
0:04:46	we can as well consider this
0:04:50	well
0:04:51	such kind of application is that and to minimize
0:04:56	well as and we will as such we make sure that the
0:05:00	we have a
0:05:02	our system will be able to also
0:05:05	for that are breaking points are looking at
0:05:08	so essentially we need to be scale
0:05:16	every trial so that we get the retard the
0:05:21	percentage of target trials in the
0:05:24	evaluation
0:05:25	evaluation database we consider all data
0:05:29	because we consider can compare two
0:05:33	the training database
0:05:38	so regarding the choice of a loss function
0:05:42	previous studies you for discrimate bp lda training use a logistic regression loss or the
0:05:47	svm hinge loss
0:05:49	and the logistic regression scores which is essentially the same as the cmllr loss to
0:05:54	justify the eer application independent the
0:05:58	and
0:05:59	evaluation metrics so you could be suitable as a loss function if we want to
0:06:03	target a very broad range of applications
0:06:07	well what we want consider here is to
0:06:10	c by targeting a more narrow range of application of up of operating points if
0:06:16	we can give better performance for such operating points
0:06:20	and
0:06:22	well the most
0:06:24	i think that would call course one exactly to one detection cost would be that
0:06:29	zero one loss
0:06:31	and that we will also consider one which is a little bit broad one loss
0:06:35	function which is a little bit broader than that
0:06:38	zero one loss button bit more marilyn logistic regression loss which is the be a
0:06:43	loss
0:06:44	and well
0:06:47	to
0:06:49	explain why about this the case i can report that the speech paper which is
0:06:53	very interesting
0:07:00	so
0:07:01	i'm showing various the picture of how these different things slopes
0:07:06	and the blue one would be the logistic regression loss which is there
0:07:11	complex but a
0:07:14	and say comes
0:07:16	because on that this could also be sensitive to outliers because
0:07:21	for some
0:07:22	maybe a new show a trial so this would be by the way they look
0:07:26	for a target trial metric for example
0:07:29	for so one also we have
0:07:33	some cost here and then of the past the threshold which is here you that
0:07:37	no cost
0:07:38	what basically this one can be very large for some
0:07:43	when you change point in our database so
0:07:51	our system may be very much adjusted to one of my
0:07:56	degree one
0:07:57	targets the real loss and we are the zero one loss here as i said
0:08:02	with a couple of approximations that we will later use
0:08:07	we use this sigmoid approximation
0:08:10	in order to do optimization under which includes the parameter i'll show that makes it
0:08:16	more and more similar to the zero one loss when you increase it and we
0:08:20	have that for
0:08:22	well
0:08:24	well one ten hundred
0:08:37	so
0:08:39	there are a couple of problems though the real zero one loss is not differentiable
0:08:43	so that slightly use this one function
0:08:47	and
0:08:50	we also or a real as in the same one loss or non-convex so we
0:08:55	do one approach here where we can of gradually increase the non complexity and
0:09:02	for the sigmoid loss it means we start from the logistic regression model
0:09:06	we also tried from the ml model but it's better to start from the logistic
0:09:10	regression well
0:09:12	and then increase of five gradually on there is another papers to doing that's for
0:09:17	other applications
0:09:20	we do something similar for the radio lost what we start from the logistic regression
0:09:25	model and then train the sycamore loss with our for it was the one loss
0:09:30	finally
0:09:37	so
0:09:39	regarding the experiments we didn't we use the main telephone trials and
0:09:46	we use
0:09:47	of a couple of different databases and we used as development set there is this
0:09:51	research six which is which using one
0:09:54	the regular session but i
0:09:56	and then use this series zero eight and that's it is it intended for testing
0:10:02	and this one cannot standard datasets for p lda training
0:10:07	and
0:10:09	an engineer the number of i-vectors and speakers with or without including is there is
0:10:14	a sixteen goes this we use the fast development set but sometimes we included in
0:10:19	the training set off to react decided on the parameters to get the little bit
0:10:23	better performance
0:10:26	and we conducted the for experiments
0:10:30	okay i should also say that we target is the operating point mentioning here which
0:10:35	has been standard in which the operating point in several nist evaluations
0:10:40	and
0:10:42	for you need for experiments one is just considering a couple of different normalisation regularization
0:10:47	techniques because we limit on sure about what is the best although it's not really
0:10:52	related to the topic of this paper
0:10:56	the second experiment we just compare the different the loss functions that use of are
0:11:00	discussed
0:11:01	and then the underlies also the effect of calibration finally we address were tried to
0:11:06	investigate little bit
0:11:08	well
0:11:09	the choice of be a according to the formal idea before is actually suitable or
0:11:14	not
0:11:18	so well for regular stations there are two options that are popular i guess and
0:11:24	one in this kind of
0:11:27	topics so we can do regular size and regularization to see her which would be
0:11:32	most remote and warranted or station towards ml and icexml i mean normal generating trained
0:11:40	and because logistic regression is also in that sense
0:11:44	maximum likelihood approach
0:11:48	and to compare also weddings within class covariance for just whitening with
0:11:54	full covariance total covariance
0:11:58	and maybe we found that in terms of mindcf and eer
0:12:04	using just as
0:12:06	covariance the phone call total covariance and regularization towards a likelihood you lead to better
0:12:12	performance we use that
0:12:14	the remaining experiments
0:12:19	so comparing loss functions
0:12:31	well first we should say that there is given to training schemes that their actual
0:12:36	detection cost than the standard maximum likelihood training but that is kind of expect that
0:12:42	because
0:12:43	they at the same time do calibration
0:12:47	however not great calibration
0:12:50	which we will discuss
0:12:52	make the wrong
0:12:54	and
0:12:56	but the for calibration it's is that the matrix they're model is very competitive
0:13:04	but we can see some improvement by
0:13:08	these the application specific loss function compared to logistic regression minimum detection cost any all
0:13:15	three and
0:13:17	for sre silly there is no such that
0:13:24	so
0:13:29	maximum likelihood standard maximum likelihood model and a bit worse calibration but
0:13:34	since we can
0:13:36	take start by doing calibration
0:13:38	we will
0:13:40	we in order to a fair comparison we will
0:13:43	also consider that here
0:13:45	so what to do it we need to use some training someone some portion of
0:13:50	the training data that's really tried i hear you see that fifty cent defined ninety
0:13:54	or ninety five percent of that
0:13:57	training data for p lda training and the rest for calibration
0:14:01	and we use to see an alarm loss here which is essentially the same as
0:14:05	logistic regression
0:14:07	and used operating point that we are targeting
0:14:11	and in these experiments we assume zero six is not include
0:14:15	so the result looks like this and the first thing to say is that the
0:14:19	applying the calibration model
0:14:23	will be better results than discriminative training without calibration
0:14:29	the second thing is that the
0:14:32	distributed training here also
0:14:35	benefited from calibration which must be explained by
0:14:39	the fact that they're using a regularization
0:14:43	top
0:14:52	and are also maybe overall can say that seventy five percent of the different training
0:14:58	and the rest for
0:15:00	using the rest for calibration was the optimal
0:15:09	and the also
0:15:13	we notice that the logistic regression
0:15:16	performs quite bad for the very small amount of training data using is the fifty
0:15:21	percent
0:15:22	and the whereas the real also zero one loss
0:15:26	we perform better
0:15:28	but the this is probably course
0:15:33	and those two loss functions
0:15:35	chan
0:15:38	if i can go back to this
0:15:42	figure here
0:15:43	for example the zero one loss
0:15:46	we do not make so much use of a in the data but that's a
0:15:50	score like this
0:15:51	but as the logistic regression would and
0:15:56	that means that the regression loss will use more of the data
0:16:00	so what happened here
0:16:05	i think stuff that
0:16:08	since we do regularization towards the ml model
0:16:11	simply also one most remotes leads
0:16:15	changed so much change the model so much in the state of the model when
0:16:20	we used a really great
0:16:26	so also it is
0:16:35	choice of a
0:16:37	use
0:16:40	optimal that sounds assuming that the one that
0:16:43	trials in the database all channels which is still not the case because we have
0:16:47	made up the training data by
0:16:49	carrying all the i-vectors
0:16:52	and also of course it also assumes that the
0:16:56	training database and evaluation based have a better kind of similar properties which probably is
0:17:01	also not really case
0:17:03	so the optimal beat the could be different from
0:17:09	this according to the form
0:17:12	so i and i that looks at a bit strange but basically we want to
0:17:15	check a couple of different to not use for that and which means that the
0:17:21	effective prior p n
0:17:22	so we just trying different we make some kind of parameters section which make sure
0:17:27	that the
0:17:28	we use this parameter gamma which one mace zero point five
0:17:33	we used a standard that the
0:17:36	calculated
0:17:37	effective prior according to the for all and when i am is equal to one
0:17:41	we will use
0:17:43	and effective prior one which means people way to the target trials
0:17:48	and when it's zero we will use
0:17:51	this one this section we make sure that we use weights of the non-target trials
0:17:58	we used real also in this experiment
0:18:01	and also do not include
0:18:04	it's a zero six
0:18:09	so this the figures a little bit interesting i think
0:18:14	and it seems first like it's much more important to for the actual detection cost
0:18:19	simple minimum detection cost but remember also here we didn't the by calibration of the
0:18:24	way
0:18:25	and
0:18:29	it is clear that the best choice is not
0:18:32	that one we can see that was calculated to formalize of the thing
0:18:37	that's very interesting and media
0:18:39	area that should be more explored
0:18:47	and i should probably have said also that
0:18:50	it's very actually goes up a little bit that which is very noticeable and i'm
0:18:53	not sure why and
0:18:58	because that is actually that
0:19:01	the
0:19:03	right
0:19:05	the prior
0:19:06	effective fire okay the recording for which we used in other experiments
0:19:11	but anyway we can see that pattern really regularization towards the ml model
0:19:16	this is relaxation ones
0:19:23	and
0:19:27	very interesting
0:19:30	thing is that it seems that from name detection cost that actually goes down a
0:19:35	little bit here
0:19:36	which means that we
0:19:39	which is the cases where we used for just for training data or target right
0:19:45	you one way to target trials or we give a way to non-target trials
0:19:50	and it but i think this the results when the so this is that
0:19:55	well
0:19:57	this should really not work but
0:20:00	because we do regularization towards the ml model it i just a very close to
0:20:05	the and the model for such kind of a system that was actually be something
0:20:08	good
0:20:09	so we can also can not need any you wanna be included
0:20:13	the results for regular station towards here where we can see that this is not
0:20:17	the case
0:20:20	so in conclusions
0:20:23	we can see that sometimes can improve the performance but quite often there is not
0:20:28	so much different
0:20:30	and we tried different optimization strategy is
0:20:33	and the
0:20:35	what that should say about that it's that starting from the ml a from the
0:20:38	logistic regression model is important to the starting point is important but this kind of
0:20:43	gradually increasing the complexity of course
0:20:47	the non-convex of was not so
0:20:50	effective actually but they didn't discuss the details about it
0:20:57	so the optimization is something to consider and also
0:21:01	since it seems to be the has really the weight be that's
0:21:06	some kind of importance what is not simply area
0:21:10	well what we should do we probably shouldn't consider a better estimate its of it
0:21:14	may be something that depends on other factors and just whether it's of target and
0:21:19	non-target trials
0:21:21	and
0:21:22	since
0:21:23	also
0:21:24	the discriminative training
0:21:28	criterias for the two is connected it is trained models
0:21:32	needed calibration i think we could be interesting to
0:21:35	mate the regularization towards as well
0:21:42	parameter vector where we have built in the regularization parameter so we do calibration of
0:21:47	the ml model and then kind of
0:21:50	put in the parameters from the regularization into that
0:21:54	a from the candidate calibration into the regular stations we actually do
0:21:59	regularization towards something that's calibrated
0:22:02	okay so
0:22:03	or something
0:22:29	might be opposed to
0:22:31	what optimiser could you use a used to be yes algorithm
0:22:36	a little attention
0:22:38	okay so
0:22:40	you mentioned there was some issues with a non complexity of your objective so
0:22:47	hidden in the work that are just like
0:22:50	note that this morning
0:22:53	i also had issues with non-convex of t
0:22:58	and
0:22:59	we have to its was a problem
0:23:02	of course basically
0:23:04	the gift use forms a rough approximation to the inverse of the haitian right and
0:23:10	if you do we have two years probably that here soon matrix is going to
0:23:15	be positive definite
0:23:18	what's so
0:23:19	we have two s can see
0:23:21	the non-convex of t right and that it can do anything about
0:23:25	so it's probably i think you that's a good point and we should consider some
0:23:30	better optimization algorithm
0:23:32	or we can come from that we have reduced the value of the objective function
0:23:36	quite the significantly but maybe
0:23:39	we are
0:23:40	we could have done much better in that aspect well using something more sure i
0:23:44	think so
0:23:45	in my case there was simple solutions are could calculate the full history and a
0:23:50	inferred that without problems because i very few parameters and i could do an eigenvalue
0:23:56	analysis and then go down the steepest negative eigen vectors that can be out of
0:24:02	the non-convex regions right the for you
0:24:05	very high also
0:24:08	you could perhaps to some other things but it's more this week why i'm afraid
0:24:13	but thank you
0:24:21	which does
0:24:31	i
0:25:04	okay well basically because we are not the doing calibration we doing discriminative training of
0:25:09	the ple
0:25:12	okay

Discriminative PLDA training with application-specific loss functions for speaker verification

Speaker Modeling I

Johan Rohdin, Sangeeta Biswas and Koichi Shinoda