Speech Transcript - Unsupervised Domain Adaptation for I-Vector Speaker Recognition

0:00:15	so as naked mentioned this is work mainly from last summer and continuing on after
0:00:20	the end of last summer
0:00:22	primarily by daniel my colleague at johns hopkins
0:00:26	but also with steve issue will be talking next about a little bit different flavour
0:00:30	and when you go and carlos
0:00:35	so bear with me level then you'll like but a lot of animation into slides
0:00:40	and use the and take them out but again so much in these that i
0:00:42	really couldn't get model
0:00:44	so i will try and do it with an animation style which is not natural
0:00:47	to me so we're trying to build a speaker recognition system which is state-of-the-art how
0:00:51	are we gonna do that well it depends what kind of an evaluation we're gonna
0:00:55	run we wanna know what the data looks like that we're actually gonna be working
0:00:58	on
0:00:59	and since normally in for example in sre we know what that data is going
0:01:04	to look like we go to our big pile a previous data that the ldc
0:01:08	has kindly given us generated for us
0:01:11	we use this development data we typically use very many speakers in very many labeled
0:01:16	cuts
0:01:17	to learn our system parameters
0:01:20	in particular what we call the across class within class covariance matrix using the key
0:01:25	things we need to make the lda were correctly
0:01:29	and then we are ready
0:01:31	to score our system and see what happens
0:01:36	so the thought here for this workshop was
0:01:39	what if we have this state-of-the-art system which you we have built for our sre
0:01:43	ten or sre twelve
0:01:45	and someone comes to us with the pilot data which doesn't look like in a
0:01:48	story what are we going to do
0:01:53	and the first thing in this corpus the direct also put together which is available
0:01:57	from there are links to the lid lists from the g h u website
0:02:03	we found that there is in fact a big performance gap with the p lda
0:02:07	system even with what seems like a fairly simple mismatch namely you train your parameters
0:02:12	on switchboard and you tested on mixture or sre ten
0:02:17	and you can see that the green line
0:02:20	which is a pure sre system designed for sre ten works extremely well and it's
0:02:26	this at the same algorithm trained only on switchboard has three times the error rate
0:02:34	so in the supervised domain adaptation that we attacked first which daniel presented at icassp
0:02:39	we are given an additional data set which is in we have the out-of-domain switchboard
0:02:46	data we have an in the main mixture set and its labeled but it may
0:02:49	not be very be so how can we combine these two datasets
0:02:54	to accomplish good performance on sre data
0:02:57	the setup that we have used about these experiments is a typical i-vector system
0:03:05	i think some people may do different things in this in this back part but
0:03:09	daniel has convinced me that links norm with total covariance would just whitening is in
0:03:15	fact the best most consistent way to do it
0:03:20	a typical system parameters the
0:03:24	the lda is typically four hundred or six hundred in our experiments
0:03:29	and the important point in one of size here is that the i-vector extractor doesn't
0:03:34	need any labeled data so we call that in unsupervised training
0:03:38	the
0:03:39	links norm also is unsupervised
0:03:42	and to p lda parameters of the ones where we need the speaker labels that's
0:03:46	the harder data to find
0:03:56	and
0:03:57	in these experiments we found that we can always use switchboard for the i-vector extractor
0:04:01	itself we don't need to retrain that every time we go to a new domain
0:04:05	which is a tremendous practical advantage
0:04:08	they whitening parameters can be trained specifically for whatever domain you're working in which is
0:04:13	not so far to do either because you only need an unlabeled pile of data
0:04:17	to accomplish that
0:04:18	and then i wanna focus on the at that adaptation part of the covariance matrices
0:04:23	that was the biggest challenge for us
0:04:28	in principle
0:04:29	at least in a little bit of a simplistic math if we have no one
0:04:33	covariance matrices we can do this map adaptation the dog has been doing in gmms
0:04:39	for a long time
0:04:40	the original map behind that is a conjugate prior for a covariance matrix
0:04:45	and you end up with a sort of account based regularisation if you configure your
0:04:49	prior in a certain tricky way
0:04:51	you end up with a very simple formula which is account based regularization back to
0:04:56	an initial matrix and a towards a new data sample covariance matrix so that's what
0:05:02	shown here
0:05:03	this is the in domain covariance matrix
0:05:05	and where smoothing it back to what out-of-domain
0:05:08	covariance
0:05:10	and what we showed earlier inance first supervised adaptation we can get very good performance
0:05:15	let's get used to this crap i'm gonna show a couple more the red line
0:05:18	at the top is the out-of-domain system which has the bad performance which is trained
0:05:22	purely on switchboard the green line at the bottom
0:05:25	is the matched in domain system that's our target if we had all of the
0:05:28	in domain data
0:05:29	and what we're doing is taking various amounts
0:05:33	of in domain data
0:05:34	to see how well we can exploit it and even with a hundred speakers we
0:05:39	can cut seventy percent of that
0:05:41	with this adaptation process and if we use the entire data we get the same
0:05:46	performance actually slightly better by using both sets the just the in domain set
0:05:53	one of the questions with this is how do we set this alpha parameter i
0:05:57	mean in theory if we knew the if we knew the prior exactly would tell
0:06:00	is theoretically what it should be but empirically
0:06:03	the main point of this crap is we're not very sensitive to it if output
0:06:06	is zero
0:06:07	where entirely the out-of-domain system and it's always pretty bad
0:06:11	if output is one where entirely trying to do an in domain system if we
0:06:15	have almost no data in domain we have a very bad performance
0:06:18	but it soon as we start to have data that system is pretty good but
0:06:21	we're always better by staying in the middle somewhere and using both datasets
0:06:26	using a come a combination
0:06:29	now this work the theme is an unsupervised adaptation what that means as we no
0:06:34	longer have labels for this pile of in domain data
0:06:37	so it's the same setup
0:06:39	but now we don't have labels
0:06:46	this means we wanna do some kind of clustering
0:06:49	and we found empirically as i think people in the i-vector challenge seem to a
0:06:53	found as well that h the user
0:06:56	is a particularly good algorithm for this task for whatever reason
0:07:00	and you can measure clustering performance
0:07:02	with
0:07:03	if you actually have the truth labels you can evaluate a clustering algorithm by purity
0:07:08	and fragmentation purity being help your clusters are in fragmentation being how much a speaker
0:07:13	was accidentally distributed into other clusters
0:07:20	one of the things we spent quite a bit a time on in fact then
0:07:23	you'll spend a lot of time and making an i-vector averaging system
0:07:26	is what the metric used for the clustering you gotta do hierarchical clustering you gonna
0:07:30	work your way out on the bottom but what's the definition of whether
0:07:34	two
0:07:35	clusters should be merged
0:07:38	p lda the theory gives
0:07:41	and answer for
0:07:42	a speaker hypothesis test that these two are the same speaker
0:07:46	that's something that we worked within the past
0:07:49	and then you know as soon as we started up this year so that really
0:07:52	doesn't work well at all which is a little disappointing from a theoretical point of
0:07:56	view but we found that in a stories as well when we have multiple cuts
0:08:00	using the correct formula doesn't always work as well as we would like
0:08:05	will be traditionally do an sre is i-vector averaging which is pertain we have a
0:08:08	single cut
0:08:10	dana spent a lot of time on that this summer then we found out that
0:08:14	in fact the simplest and thing to do which is to compute the score between
0:08:17	every pair of cuts get a matrix of scores and then never recompute any metrics
0:08:23	just average the scores is in fact
0:08:26	the best performing system and it's also much easier because you don't have to get
0:08:30	in your algorithm at all you just pre-computed this distance matrix and feed it into
0:08:33	an off-the-shelf
0:08:35	clustering software
0:08:38	so just as a as a baseline we compared against k-means for clustering with this
0:08:42	purity and fragmentation and the main point is this h c with this scoring metric
0:08:48	wasn't fact quite a bit better the k-means so we're comfortable that it seems to
0:08:52	be clustering in an intelligent way
0:08:56	now we wanna move towards doing it for adaptation but the other thing we need
0:08:59	to know is how do we side when the start clustering how do we decide
0:09:03	how many speakers are really there because nobody has told us
0:09:06	to do this you have to avenge the makeup are decision that you're gonna start
0:09:09	merging that and basically that you look at the two most similar clusters and you
0:09:14	gotta decide are these from a different speaker or are they the same in you
0:09:17	can make a hard-decisioning
0:09:19	and this is one of the
0:09:20	the nice contributions of this work that was really don't after the summer i think
0:09:24	of
0:09:25	where we just treaty scores as speak to speaker recognition scores we do calibration in
0:09:31	the way that we do and in particular
0:09:33	this unsupervised calibration method than aca what daniel presented at icassp
0:09:38	can be used exactly in this situation we can take
0:09:42	are unlabeled pile of data and look at all the scores across to learn a
0:09:45	calibration from that we can actually no with threshold and we can make a decision
0:09:49	about when to stop
0:09:53	so how well does that work
0:09:56	this is a across are unlabeled pile as we introduce bigger and bigger piles
0:10:02	the this is the correct number of clusters the dashed line
0:10:07	this is five random draws where we draw on random subsets and we've average the
0:10:11	performance and the blue is the average which is the easiest one to see and
0:10:15	you can see in general this technique works pretty well it always underestimate typically about
0:10:21	twenty percent so you think there's a few were speakers in the really are what
0:10:25	you're pretty close and getting
0:10:27	and automated and reliable way to actually figure out how many speakers are there is
0:10:31	actually we're
0:10:32	we're pretty excited to even do this well at it that's very heart task
0:10:36	so to actually do the adaptation then
0:10:40	the recipe is we use our out-of-domain p lda
0:10:44	to compute the similarity matrix of all pairs
0:10:49	we don't cluster the data using that distance metric
0:10:54	estimate all this how many speakers there are and the speaker labels
0:10:58	generate another set of covariance matrices from this labeled data
0:11:02	and then we apply or adaptation formulas
0:11:05	on this data
0:11:11	so here's a similar curve as i so the for here is the out of
0:11:16	the out-of-domain system and the in domain system in green at the bottom
0:11:20	and
0:11:21	but we're so in here
0:11:23	is the h z
0:11:25	adaptation
0:11:28	performance and the supervised
0:11:30	adaptation
0:11:32	we should means the number of speakers
0:11:35	no sorry supervised adaptation is one issue before
0:11:38	excuse me
0:11:39	so that if you have to labels
0:11:41	for all of the data that's what we you compress the first time now by
0:11:44	self labeling
0:11:46	of course we're not as good
0:11:47	but we are in fact much better than we ever thought we could be because
0:11:50	when we first set up this task we really didn't think
0:11:53	in fact daniel i had a little bit and he was convinced that this was
0:11:56	never gonna work because how are you gonna learn your parameters from your system that
0:11:59	doesn't know what you're parameters are but factor can
0:12:02	so we've done surprisingly well myself labeling
0:12:05	and we're still able to get at five percent of the for performance get if
0:12:08	we have all the data but is unlabeled which still able to recover
0:12:12	almost all the performance
0:12:16	now what if we didn't know the number of clusters
0:12:19	so if we had an oracle the told us it is exactly this many speakers
0:12:23	with that make our system perform better so that the additional
0:12:27	bar here and in fact
0:12:30	our estimation of the number of speakers is good enough because even had we known
0:12:34	it exactly we're gonna get
0:12:36	almost the same performance
0:12:38	so even though we didn't get exactly correct number of speakers the hyper parameters that
0:12:42	we have estimated still work just as well
0:12:48	and that's illustrated in this way
0:12:50	which is the sensitivity to knowing the number of clusters so here we're using all
0:12:54	the data the actual number of speakers is here and this is what we estimated
0:12:58	with their stopping criterion
0:13:00	and you can see that as a sweep across all of our if we had
0:13:03	stopped at all of these different points and decided that was how many speakers that
0:13:07	were
0:13:08	there's not a tremendous sensitivity if we massively over cluster then we have a big
0:13:12	hit in performance and if we massively under cluster it is bad but there's a
0:13:16	pretty big fat region
0:13:18	where we get almost the same kind performance with their hyper parameters if we had
0:13:23	us start our clustering at that point
0:13:28	so in conclusion then
0:13:31	domain mismatch can be a surprisingly difficult problem in state-of-the-art systems using the lda
0:13:38	and
0:13:39	we are denoted supervised adaptation could work quite well but in fact
0:13:42	unsupervised adaptation also works extremely well
0:13:47	we can close at five percent of the performance gap due to the domain mismatch
0:13:51	in order to do that we need to do this adaptation we need to use
0:13:54	both the out-of-domain parameters and the in domain parameters not just label of the in
0:13:59	domain
0:14:00	and this unsupervised calibration trick
0:14:04	in fact gives as a useful and meaningful stopping criterion for figuring out how many
0:14:08	speakers are in our data
0:14:10	thank you
0:14:21	i four questions
0:14:30	it's a wonder i can imagine that the distribution of speakers
0:14:35	comments basically the number of segments per speaker
0:14:40	of your unsupervised set
0:14:43	will make a difference right i guess at you get this from these days or
0:14:48	whatever switchboard data so the
0:14:51	will be relatively homogeneous is a is that correct or
0:14:56	i think yes classes i one has these are not homogeneous but this is a
0:15:00	good pile of unlabeled data because in fact it's the same power that we used
0:15:04	as a labeled data set
0:15:06	so it's pretty much everything we could find from these speakers some of them have
0:15:10	very many phone calls some of them have you are
0:15:13	but all of them have quite a few in order to be in this file
0:15:16	obviously for example you couldn't learn any within class covariance if you only had one
0:15:20	example from each speaker
0:15:22	hidden in that pile so you're absolutely right is not just that we do the
0:15:27	labelling it's also that the pilot self has some richness in order for us to
0:15:30	discover
0:15:34	before we give a microphone image i have a related question
0:15:40	when you train the i-vector extractor the nice thing is that you can do it
0:15:44	unsupervised
0:15:47	but again how many cats per speaker so if we had only one speaker with
0:15:51	many cats obviously that's not good because we don't get the speaker variability
0:15:56	the converse situations where you have every speaker only once
0:16:01	you have been any duration with that would give a good idea that
0:16:07	i don't think that the we looked at but i
0:16:10	i completely agree that would make me uncomfortable as i said in this effort we
0:16:15	just were able to show that the out-of-domain data which we assume we do have
0:16:19	a good labeled set somewhere in some domain that we can use we were able
0:16:23	to use that when the rest of the time so we're comfortable where it came
0:16:26	from i don't think of ever run an experiment with
0:16:29	with what you say and that is interesting i suspect it would not work so
0:16:33	well
0:16:34	what get both kinds of variability comes from a variety of channels
0:16:39	the variability speaker and the channels in the not quite the same proportions as you
0:16:44	get in the state
0:16:47	if you collect data and the while
0:16:50	in a situation where they're very many speakers
0:16:54	you might have data like that so i think that's an interesting qualities two
0:16:59	thank you
0:17:01	very impressive work in right set of results works are also thank you for that
0:17:05	so i question i have is this is all telephone speech and test work very
0:17:11	well with that i have we consider what would happen if the out-of-domain tighter walls
0:17:16	the different channels such as mock fine
0:17:18	i and is that even a realistic hence would you have a pre-training microphone system
0:17:23	that you try and adapt
0:17:25	right so yes we have like the microphone the very first work right a few
0:17:30	years ago on this task was adapting from telephone to microphone and daniel revisited early
0:17:36	in the summer when we were debating working with dog on this dataset whether we
0:17:40	trusted here if he did a similar experiment
0:17:43	with the sre telephone and microphone and actually got similar results
0:17:47	it is
0:17:49	that does sound a bit surprising but we have seen in the sre is the
0:17:52	telephone a microphone is not nearly as art is that ought to be i don't
0:17:55	know the reason for that but yes we have than worked with telephone microphone histories
0:17:59	and it's not shockingly different in this great things
0:18:06	i just isn't that answer i'm because question
0:18:08	we trained i-vector start on unofficial database which there is no one speaker a per
0:18:14	utterance
0:18:16	one and it's about the same as you two thousand four and five or so
0:18:23	first
0:18:25	okay thank you
0:18:29	i knew that no i think about
0:18:31	thank you
0:18:32	so either really stupid question yesterday people are mentioning about how the mean shift clustering
0:18:40	algorithm is working well
0:18:43	is that i mean you don't seem to use that you use the
0:18:47	a limited to a lot i don't the clustering so
0:18:50	is that a reason why
0:18:52	i believe over the course of the summer we another people look the quite a
0:18:57	few different algorithms i know that's use in diarisation a no we have looked at
0:19:00	it in diarisation i cannot remember if we looked at it for this task
0:19:05	we did look at others a stephen is gonna talk about some other clustering algorithms
0:19:08	where's but i don't think he's gonna talk about the mean shift and so i'm
0:19:13	not sure i don't have that compares it clearly is also useful out with
0:19:28	the i just want to know if the this split and this protocols and available
0:19:33	yes to they are on the jhu website the link as in the paper okay
0:19:38	it thanks you wanna get the speech type of error but the lists
0:19:43	i encourage you to work on this task
0:19:49	which
0:19:50	one question
0:19:52	let's suppose that you are not as to do a speaker clustering but gender clustering
0:19:57	and you don't have any prior to how many genders other
0:20:02	input and they
0:20:04	the stopping criterion would be the same i mean you have a file sign genders
0:20:10	i'm not sure and if there is saying with the clustering accidently fine gender well
0:20:14	let me say one thing first is we did i think i forgot to mention
0:20:17	this is a gender independent system
0:20:20	well gender suppose that they classes to sip to cluster and not that the and
0:20:24	the speakers that any of the end of clusters either kind of
0:20:27	correctly well this is why daniel thought this wouldn't work
0:20:31	who knows what you're gonna cluster by we're just using the metric we are hoping
0:20:34	that the out of out-of-domain p lda metric is encouraging the clustering to focus on
0:20:40	speaker differences
0:20:42	but we cannot guarantee that except
0:20:44	with the results
0:20:49	i think more so than gender if for example language we're different which there might
0:20:53	be some differently data and you might think you would cluster is the same speaker
0:20:57	speaking multiple languages you might think that would confuse are clustering for us
0:21:02	so
0:21:03	you saw nick like to one aspect i think is very important especially in the
0:21:08	forensic framework could you
0:21:12	shows slide five
0:21:15	probably
0:21:24	so all
0:21:26	what you've neglected here is the decision threshold
0:21:30	yes we have neglected calibration of the final task that's
0:21:34	so it could possibly be that a factor of three becomes a factor of
0:21:41	one hundred
0:21:43	not to three degradation could actually
0:21:46	the factor one take it is yes which we simply neglected
0:21:50	you are right george when you collected that has we would think the with the
0:21:53	unsupervised calibration that we could accomplish calibration is what i would like you to do
0:21:59	when you get home yes
0:22:02	in two up
0:22:05	annotate this slide with the
0:22:08	decision
0:22:09	points
0:22:10	and all these systems are not even calibrated so
0:22:14	well we always have to run a separate calibration process to get onto single somewhat
0:22:18	easily to a when you go home
0:22:21	but go ahead and do that work
0:22:23	and it's only your only gonna have to do this for the in domain system
0:22:31	and then you
0:22:31	applying a threshold
0:22:33	but the dots on those two curves in zambia copy
0:22:38	thank you very well for your assignments are then
0:22:42	that question is already partially on so by are unsupervised score calibration paper which was
0:22:50	published icassp so as true
0:22:53	that
0:22:56	okay so we so we thank the speaker

Unsupervised Domain Adaptation for I-Vector Speaker Recognition

Speaker Modeling II

Daniel Garcia-Romero, Alan McCree, Stephen Shum, Niko Brummer and Carlos Vaquero