Speech Transcript - Unsupervised Clustering Approaches for Domain Adaptation in Speaker Recognition Systems

0:00:15	and so what i'm gonna talk about is also what we did at in last
0:00:20	summer a at hopkins and again this work between myself and steven a dog
0:00:26	doug reynolds the annual and l
0:00:29	so
0:00:31	actually
0:00:32	i really hope
0:00:35	that
0:00:37	well okay
0:00:38	four years ago
0:00:39	was my first a odyssey
0:00:41	ever and does my first conference presentation i made a joke
0:00:45	at the start of the slide that are not to be far more memorable and
0:00:48	then the presentation itself
0:00:50	i wanna do the same this time but i do the on couldn't come up
0:00:54	with any good stories so
0:00:55	i was gonna give you this picture in which i hope none of you will
0:00:59	look like by the end of my presentation
0:01:03	okay
0:01:04	all right now
0:01:06	so what i meant talk about is on the remote clustering approaches
0:01:09	for domain adaptation in speaker recognition systems
0:01:13	and first off i guess the titles a bit of a handful so i'm gonna
0:01:17	break it down and explain kinda each piece one at a time
0:01:21	so domain adaptation
0:01:23	is
0:01:24	one in which
0:01:27	where
0:01:28	most
0:01:29	current statistical learning techniques assume like someone incorrectly rather that
0:01:33	the training and test data come from the same underlying distribution right
0:01:38	and
0:01:38	so what we know in general is that labeled data may exist in one domain
0:01:42	but what we want is a model that can also perform well in a related
0:01:47	but say not necessarily identical domain
0:01:51	and labeling data
0:01:53	in this particular the new domain may be difficult and or expensive
0:01:57	and so what can we do
0:01:59	and to leverage the original labeled out-of-domain data one building a model to work with
0:02:04	this in domain data
0:02:07	so is nothing new here everything we've heard before in the previous presentation so speaker
0:02:12	recognition systems on the blaster once again used for all rather familiar with the i-vector
0:02:17	approach
0:02:19	and that's
0:02:20	clearly just you know you're standard summary length of segment length independent
0:02:25	low dimensional vector based or some representation of the audio
0:02:30	and what we've done what the i-vector allows us to do is to use large
0:02:35	amounts of previously collected in labeled audio to characterize and exploit speaker and channel variability
0:02:42	that's right and usually that entails the use of you know thousands of speakers making
0:02:47	tens of calls each
0:02:49	so
0:02:51	unfortunately it is a bit unrealistic to expect that most applications will have access to
0:02:56	such a large set of labeled data from matched condition
0:03:01	and so
0:03:02	well here's you know that anatomy of that standard i-vector system that's very similar and
0:03:07	almost actually identical to what counted shown and z are yet again the thing that
0:03:13	the no is that
0:03:15	you know the your ubm your i-vector extractor and your resulting mean subtraction and like
0:03:21	normalisation is are does not require the use of labels
0:03:25	what does require some labels are
0:03:27	on your within class and across class covariance matrices
0:03:31	and that's where the labels come in so
0:03:34	that's what we've got now the first thing and like to do sort of just
0:03:38	the like paints
0:03:40	at the a larger picture of what we've done
0:03:42	on this to demonstrate that mismatch right
0:03:45	on between are two domains
0:03:47	on similar to the deck curve plot that hounded shown on what we start with
0:03:51	this one role in score on sre two thousand ten
0:03:55	and what one denote as the in domain set is that of the sre data
0:03:59	i mean that's all the telephone calls from mixer o four five six and two
0:04:04	thousand eight collections
0:04:06	now the mismatched out-of-domain data is all the switchboard data
0:04:11	which are all the calls from that from those collections
0:04:15	so in general
0:04:17	some summary statistics there
0:04:19	what we're basically looking at is that
0:04:21	number of speakers
0:04:24	a number of calls an average number of calls per speaker and number of channels
0:04:28	that you speaker spoke on a relatively the same and that help with that as
0:04:32	a visualization
0:04:35	that's kind of the normalized histogram of the distribution of
0:04:39	the number of utterances per speaker between the two
0:04:44	between the two sets of data in blue well that the pretty much all overlap
0:04:49	but in blues the switchboard and read is that of the sre
0:04:54	so what we can say is that we would not expect a large performance gap
0:04:59	between these two sets of data if indeed
0:05:03	are
0:05:04	our ability you know are training where
0:05:07	dataset independent and are robust across datasets
0:05:11	so
0:05:12	what we found obviously i is that this is not the case which is why
0:05:15	we ended up having a summer workshop
0:05:18	on it and so it just take
0:05:19	to give in summary of
0:05:21	equal error rate results the wrestler talk just be using equal error rate to provide
0:05:25	a summary set of results and
0:05:30	what we what we have is are
0:05:33	inbred i believe denoted just
0:05:36	the portion of the system that actually requires
0:05:40	labels on and have also
0:05:42	on the shown what we what we had at hopkins of the summer and what
0:05:46	we
0:05:46	replicated at mit
0:05:49	as well so you can see that for use all switchboard on to train everything
0:05:53	we'll get a set of results around seven percent equal error rate
0:05:58	and if we just use all of the sre
0:06:01	we will get around two and half percent
0:06:04	so now if we start varying the ingredients that we used to actually train these
0:06:07	systems
0:06:09	in particular we just say we just switch these two here we go from switchboard
0:06:14	to the your whitening parameters that's the mean subtraction et cetera
0:06:20	and you switch it to use sre
0:06:22	you get a little bit of again you get you go down from seven percent
0:06:25	of five
0:06:27	and subsequently if you
0:06:30	stick with
0:06:32	a switchboard to do your
0:06:34	ubm
0:06:35	and i-vector extraction
0:06:39	then
0:06:40	and also
0:06:42	keep the sre as are whining and use the sre labels
0:06:46	basically here then you get down to
0:06:50	under two and a half percent which is actually better than the last row here
0:06:53	not gonna try and explain
0:06:55	what happens there but
0:06:57	or we decided from then on is that we were obviously can focus
0:07:01	on the performance gap
0:07:02	between the sre and the use of switchboard labels for our compare a within and
0:07:10	across class covariance matrices
0:07:12	so
0:07:13	that's what will continue one and
0:07:16	so basically the source be the baseline that we've got and this will be the
0:07:21	the benchmark that we're trying to hit or even to better that
0:07:26	so
0:07:27	the rules for this what we call the domain adaptation challenge task is that's were
0:07:34	allowed to use switchboard all the data and all of the labels
0:07:39	and
0:07:40	where allowed to use the
0:07:42	sre data but not of that labels
0:07:45	and obviously we're gonna okay and evaluate on a the twenty ten sre
0:07:50	so before we actually jump into that though on well we'd like to do perhaps
0:07:54	is to be a mix for the domain mismatch i got a lot of questions
0:07:57	like what actually is the difference between these two datasets that might cause such a
0:08:01	gap
0:08:02	in there
0:08:03	and so we
0:08:04	big n and did a little bit of a rudimentary analysis of actually what was
0:08:09	going on
0:08:09	and well some the
0:08:12	clear questions that you might wanna
0:08:14	our might think of as well as at the speaker age right
0:08:18	or is it perhaps the languages spoken in particular switchboard contains only english and it's
0:08:24	collected from a over a decade and
0:08:29	and it's a over a decade that preceded that of the sre on and the
0:08:33	sre contains twenty more than twenty different languages right so the question is whether or
0:08:38	not
0:08:39	that might have caused some of the shift and variabilities that be that we see
0:08:45	are the difference in performance in some of this
0:08:47	this work
0:08:48	there was previously also export believe by columns back and twenty two l
0:08:54	and what we found however was that there was like absolute there was really no
0:08:59	affect of either h
0:09:02	and
0:09:03	eight or language spoken
0:09:05	and so
0:09:07	with that
0:09:08	well
0:09:08	one
0:09:09	the next step then was to look at something else
0:09:12	which was
0:09:14	that of the switchboard
0:09:16	itself on a what we found was well we realised first off that they're switch
0:09:21	but was collected in different phases over approximately a decade and so what would happen
0:09:27	what happens whether when we use on different subsets
0:09:31	we just use different subsets to build our models
0:09:35	and so
0:09:36	well we ended up finding
0:09:37	was
0:09:39	the following if you if you take on switchboard cellular both parts and those of
0:09:45	the most recent ones
0:09:47	you actually get a starting baseline so the previous starting baseline was five and a
0:09:51	half percent
0:09:53	you actually get a starting baseline percentage of four point six which is a little
0:09:57	bit better and now if you also at in switchboard phase by three you can
0:10:03	actually start all the way down at three not have percent
0:10:07	and then but then as you keep adding these i guess you could say maybe
0:10:11	older
0:10:12	the older portions of switchboard on you might you'd start actually doing a bit worse
0:10:19	and
0:10:19	and that's we found in and i think are similar i work was also on
0:10:23	done in presented by high guy on during the summer and over it i can
0:10:28	ours is a slightly different take on it but
0:10:32	that's kind of what we what we noticed on as we're trying to analyze the
0:10:36	mismatch that
0:10:37	it basically the differences within switchboard itself on selecting out some of those particular subsets
0:10:45	might actually
0:10:47	affect the baseline performance
0:10:49	so then the next question then is alright so it should be actually just
0:10:54	continue with other three graph
0:10:56	and
0:10:57	also secondly can you actually
0:11:00	just
0:11:01	find some automatic way of selecting out the out-of-domain data that you actually wanna end
0:11:08	up using okay
0:11:09	to do your initial domain adaptation
0:11:12	or to not even to just like selected the labeled data that you want to
0:11:17	use that best matches the in domain data that you have right
0:11:21	and so what we
0:11:22	did again was and it's just a couple of ninety that's for exports were experiments
0:11:26	were set alright
0:11:28	if we
0:11:29	did an automatic subset selection so
0:11:32	in particular
0:11:34	first are this is the three no half percent of equal error rate
0:11:38	on that you get from the cellular and
0:11:40	and the faces that's the best we did
0:11:42	and this here on the five and a half percent is approximately what
0:11:46	you what if you use all of the data all the switchboard and started off
0:11:51	there so instead if you
0:11:53	these two lines let's focus on the blue for a second that's if you select
0:11:59	the proportion of scores
0:12:02	or proportion of i-vectors that's are
0:12:05	in the in at the highest that you the prop highest probability density function value
0:12:12	with respect to the that the sre so you select the switchboard
0:12:18	a subset of the switchboard automatically that were closest in the likelihood onto the sre
0:12:24	marginal
0:12:25	and you increase the proportion a how would you do in terms of the baseline
0:12:30	performance
0:12:31	and similar the and lda
0:12:33	but is if you took switchboard and
0:12:36	and
0:12:37	sre and you try to
0:12:39	learn just a simple
0:12:41	one dimensional linear separator between the two the ones and i take the ones that
0:12:46	are closest to
0:12:49	the sre data and i reckon that way so
0:12:52	and how well can i do the and basically what we can see is obviously
0:12:55	if you use all of the discourse and
0:12:58	you've done nothing different
0:13:00	but you know as you as you as you use just the some proportion of
0:13:03	the likelihood
0:13:04	are proportion of these top ranking scores
0:13:07	you can actually do a little bit better than our baseline however
0:13:10	you never approach
0:13:12	this three half that seem to be set by this particular this magical subset on
0:13:17	that was not
0:13:19	so that was the initial exploration of the domain mismatch that we did
0:13:23	now
0:13:24	covered most of the set up most of the problem
0:13:27	and
0:13:29	now i can continue one with the rest of a work
0:13:32	so
0:13:33	the bootstrap remark that i'm gonna go over one more time on it's pretty standard
0:13:37	for the domain adaptation we begin with our prior across class and within class hyper
0:13:43	parameters
0:13:44	and then we use
0:13:45	p lda to confuse and pairwise affinity matrix
0:13:49	on the sre data
0:13:51	subsequently will do some form a clustering on that are pairwise affinity matrix to obtain
0:13:56	some hypothesized cluster labels will use these labels to obtain another set
0:14:01	of hyper parameters
0:14:03	and then be linearly interpolate
0:14:08	as alan showed and then potentially we iterate on the me
0:14:12	just to make this look better it
0:14:14	between mac and windows so that's actually have that slide supposed to look
0:14:21	so
0:14:22	basically that's the set up and we'll just run into some clustering algorithms and output
0:14:27	unsupervised in parentheses "'cause" you know all clustering other algorithms have at least some parameter
0:14:33	that you can to right
0:14:35	so you start off a mobile find later on is that hierarchical clustering on really
0:14:39	does do the best
0:14:41	however
0:14:43	in light of you know the stopping criterion that you choose or the cluster merging
0:14:46	criterion those are kind of up to the user to choose but we find that
0:14:50	with some reasonably appropriate choice on hierarchical clustering does do the best the two algorithms
0:14:56	that we also explored pretty extensively on word some graph based random walk algorithms
0:15:02	and i and that's known as in format and of markov clustering i'm not gonna
0:15:06	go into the details about those but on feel free to ask me offline or
0:15:09	at the end of on the presentation
0:15:12	and those do you know you basically have a graph work each node is an
0:15:17	i-vector and then you have some edges on that a contain
0:15:21	perhaps and edges and then you do some clustering on those edges
0:15:25	so our initial findings this is no really no different from what i wanted shown
0:15:30	previously but mainly is that what's mainly true is that the in the presence of
0:15:35	interpolation
0:15:37	an imperfect clustering is in fact forgivable
0:15:41	this here
0:15:41	is just the plot that says we took a thousand speakers subset
0:15:46	and this shows a cluster error just some thing of cluster error
0:15:51	and
0:15:53	these are the solid lines in a green and red
0:15:57	are if you
0:15:58	new
0:15:58	the
0:15:59	the cluster labels
0:16:04	if you new cluster labels are pure in didn't have to do any automatic clustering
0:16:06	and then the rest of these two lines here are a in dotted lines are
0:16:12	basically
0:16:13	what you would have you would do if you
0:16:16	clustered or stop your clustering at different points of a
0:16:21	at different points of the hierarchical tree okay and basically what the thing is that
0:16:26	this ball is incredibly flat okay
0:16:29	and this and also the last thing is that
0:16:32	alpha star itself is basically the best adaptation parameters so much whatever just talked about
0:16:40	so
0:16:41	however
0:16:42	one thing is that we that we kinda glossed over so far is that alpha
0:16:46	itself needs to be estimated you can do it improves on via like more principled
0:16:52	way be as a the counts of
0:16:55	of the relative dataset size is or you can look at it empirically and you
0:16:59	can separate you know you can do your alpha for a within class differently from
0:17:03	the alpha of your across class and
0:17:05	and that's
0:17:06	that seems to be an empirically the case the better ones seem to be this
0:17:10	way and so you can see we be range across the elephants on both sides
0:17:15	for the within class and you across class and find that this is approximately the
0:17:20	best on for a one particular subset of a thousand speakers however
0:17:25	like and it seems like
0:17:27	alpha star itself is an open an unsolved problem but actually it's not so bad
0:17:31	because if we rescaled is plot to within ten percent of this optimal on equal
0:17:36	error rate and we can actually find that
0:17:40	there's actually a range of values
0:17:44	that would you a range of values for alpha that would actually you'll the pretty
0:17:48	good
0:17:49	good results
0:17:52	so results so far without parsing drum running on a bit out of time but
0:17:58	basically the best you can do is you roughly around fifteen percent of the absolute
0:18:04	best you can use the best we can do with automatic methods is on it
0:18:08	close that gap by about eighty five percent
0:18:12	so that a calm ideas for now is that given interpolation an imprecise estimate of
0:18:18	the number of clusters is okay
0:18:21	there is a range of adaptation parameters that would yield reasonable results and the best
0:18:25	automatic system on gives us within fifteen percent of a system that has access to
0:18:29	all speaker labels
0:18:31	now that fourth that between allan's talking mine
0:18:35	we wonder well
0:18:36	i mean this telephone the telephone domain mismatch simple solutions work already
0:18:41	and
0:18:42	and we'd like to
0:18:44	and what we been working on is to explicitly identified the sources of this mismatch
0:18:49	and that's kinda ongoing work at the moment but the question just like mitch brought
0:18:53	up a couple seconds ago are at the end of alan's five
0:18:57	what can we do about telephone to microphone domain mismatch i did the work independently
0:19:01	actually did not know that a
0:19:05	alanna daniel had done this and this about what i'm about to show is that
0:19:09	is not in the paper itself but
0:19:11	it's a little just a little at all
0:19:13	and lastly what else you can talk about is out of domain detection like what
0:19:17	when
0:19:19	do i actually when maybe when what is system knowing that it actually needs
0:19:25	some additional
0:19:27	albeit unlabeled data on but you know that it cannot perform at the level it
0:19:32	usually doubts so that's perhaps an instance of like outlier detection or something like that
0:19:38	that we can also we will look into on that something sort of a future
0:19:42	work kind of thing
0:19:43	so
0:19:45	what i will really quickly show is a quick visualization using some low dimensional embedding
0:19:51	is actually
0:19:54	and basically what we're gonna start with is
0:19:57	if you have switchboard
0:19:59	and sre and those are these are all the i-vectors in there and i'm gonna
0:20:03	collapse
0:20:04	a lot i-vectors into a very low dimensional space which is why just looks very
0:20:07	cloudy at the moment
0:20:09	it's harder to
0:20:10	a fit a lot of points into
0:20:12	into a into a small space and still have them preserve their to their relative
0:20:17	distances
0:20:18	however this is
0:20:19	if i try to learn first off and i using unsupervised
0:20:24	and betting that
0:20:25	it just takes all the data and learns on some low dimensional visualization here
0:20:29	and then i apply the colouring is to the spline
0:20:32	so what it shows here is that we have switchboard
0:20:35	in blue and we have the sre data in red and you can kinda see
0:20:39	that there is a little bit of separation
0:20:42	you
0:20:43	perhaps right but their the can also a little bit on top of each other
0:20:47	now to be a just one other point set it talked about earlier
0:20:53	if we just took that subset of
0:20:57	at that magical subset the gave us that three not have percentage that magical subset
0:21:01	of switchboard we get this in green and we have the sre in the red
0:21:06	as well and so they're pretty uniformly distributed a round the sre data itself
0:21:10	right
0:21:12	on the other hand
0:21:14	if you just
0:21:15	if you just the remaining amount of data
0:21:17	and we leave it in blue the old switchboard stuff
0:21:20	they're actually like a little farther away then the rest of the sre itself so
0:21:25	that kind of maybe that gives some idea of how things work r y
0:21:31	a what performance was as once and
0:21:34	however if you take a look at
0:21:37	telephone and microphone
0:21:38	if you do same
0:21:39	it's a it's you same kind of an embedding
0:21:42	then
0:21:44	you will
0:21:45	get it completely different a slight a much more separate sort of
0:21:51	visualization and that sort of just illustrate that i think telephone and microphone
0:21:56	can be a harder problem however i guess initial results have also shown that is
0:22:00	actually not as bad as maybe this visualization shows some other stop there
0:22:05	and take any questions
0:22:22	you said that you have found that the language is not the cost of these
0:22:27	domain mismatch how to find that
0:22:31	let me think so
0:22:32	but like basically
0:22:36	well i basically hold
0:22:38	like the different languages out
0:22:40	note that the various different languages out of
0:22:43	of the sre and of the sre data and just try to basically see whether
0:22:48	that was
0:22:51	that would be like distinctly different from that of the
0:22:56	sorry
0:22:56	sorry no the one that i basically on looked at it and saw
0:23:02	whether or not so on
0:23:03	the different languages are clustered together in a sense
0:23:09	that's a in general that's how we what about trying to tease apart whether or
0:23:13	not the languages
0:23:14	where
0:23:16	at a source of a domain mismatch
0:23:21	so you look at t s and you can just like that's on now
0:23:25	no
0:23:29	let's talk offline about that i'm actually for getting some of the details of that
0:23:33	it of the language experiment exactly at the moment but
0:23:38	what soft aligned about that
0:23:40	sorry
0:23:46	this is in no the beginning of the two you have table issue the did
0:23:52	you know if you the
0:23:53	of what used for training u v and also
0:23:57	it did you try this which is
0:24:00	put in the training switchboard in a city
0:24:03	that yes we did originally and there was one terribly different
0:24:08	there does very just about the same okay thanks there's really no difference
0:24:18	so sweet little then mix to zero were collected over a wide range will use
0:24:24	so maybe the your easy dependent variable shows the evolution of the telephone network and
0:24:31	how
0:24:32	speech is transmitted of the telephone that the now compared to the in nine and
0:24:38	ninety nine
0:24:39	absolutely no it totally on that's actually one of the that's almost exactly a sentence
0:24:44	at it like that we wrote in a and yes
0:24:46	and that's a
0:24:48	a potential like a hypothesis that
0:24:50	i'm certainly willing to leave thanks
0:24:54	even a related question
0:24:56	the p lda has the within and between speaker covariance parameters so
0:25:03	which of those most need to be adapted with moving from switchboard two
0:25:07	the mixer a think that shown
0:25:13	go with
0:25:16	this one right
0:25:18	so
0:25:19	the one that most needs to be adapted would be that within class
0:25:25	variability relative to
0:25:27	the across class at the it shown in so that we just
0:25:32	the speakers the speaker distribution
0:25:34	so i more this constant exactly but the left channels
0:25:39	so that we which is what you need more weight within
0:25:44	it's very even and

Unsupervised Clustering Approaches for Domain Adaptation in Speaker Recognition Systems

Speaker Modeling II

Stephen Shum, Douglas Reynolds, Daniel Garcia-Romero and Alan McCree