0:00:15and so what i'm gonna talk about is also what we did at in last
0:00:20summer a at hopkins and again this work between myself and steven a dog
0:00:26doug reynolds the annual and l
0:00:32i really hope
0:00:37well okay
0:00:38four years ago
0:00:39was my first a odyssey
0:00:41ever and does my first conference presentation i made a joke
0:00:45at the start of the slide that are not to be far more memorable and
0:00:48then the presentation itself
0:00:50i wanna do the same this time but i do the on couldn't come up
0:00:54with any good stories so
0:00:55i was gonna give you this picture in which i hope none of you will
0:00:59look like by the end of my presentation
0:01:04all right now
0:01:06so what i meant talk about is on the remote clustering approaches
0:01:09for domain adaptation in speaker recognition systems
0:01:13and first off i guess the titles a bit of a handful so i'm gonna
0:01:17break it down and explain kinda each piece one at a time
0:01:21so domain adaptation
0:01:24one in which
0:01:29current statistical learning techniques assume like someone incorrectly rather that
0:01:33the training and test data come from the same underlying distribution right
0:01:38so what we know in general is that labeled data may exist in one domain
0:01:42but what we want is a model that can also perform well in a related
0:01:47but say not necessarily identical domain
0:01:51and labeling data
0:01:53in this particular the new domain may be difficult and or expensive
0:01:57and so what can we do
0:01:59and to leverage the original labeled out-of-domain data one building a model to work with
0:02:04this in domain data
0:02:07so is nothing new here everything we've heard before in the previous presentation so speaker
0:02:12recognition systems on the blaster once again used for all rather familiar with the i-vector
0:02:19and that's
0:02:20clearly just you know you're standard summary length of segment length independent
0:02:25low dimensional vector based or some representation of the audio
0:02:30and what we've done what the i-vector allows us to do is to use large
0:02:35amounts of previously collected in labeled audio to characterize and exploit speaker and channel variability
0:02:42that's right and usually that entails the use of you know thousands of speakers making
0:02:47tens of calls each
0:02:51unfortunately it is a bit unrealistic to expect that most applications will have access to
0:02:56such a large set of labeled data from matched condition
0:03:01and so
0:03:02well here's you know that anatomy of that standard i-vector system that's very similar and
0:03:07almost actually identical to what counted shown and z are yet again the thing that
0:03:13the no is that
0:03:15you know the your ubm your i-vector extractor and your resulting mean subtraction and like
0:03:21normalisation is are does not require the use of labels
0:03:25what does require some labels are
0:03:27on your within class and across class covariance matrices
0:03:31and that's where the labels come in so
0:03:34that's what we've got now the first thing and like to do sort of just
0:03:38the like paints
0:03:40at the a larger picture of what we've done
0:03:42on this to demonstrate that mismatch right
0:03:45on between are two domains
0:03:47on similar to the deck curve plot that hounded shown on what we start with
0:03:51this one role in score on sre two thousand ten
0:03:55and what one denote as the in domain set is that of the sre data
0:03:59i mean that's all the telephone calls from mixer o four five six and two
0:04:04thousand eight collections
0:04:06now the mismatched out-of-domain data is all the switchboard data
0:04:11which are all the calls from that from those collections
0:04:15so in general
0:04:17some summary statistics there
0:04:19what we're basically looking at is that
0:04:21number of speakers
0:04:24a number of calls an average number of calls per speaker and number of channels
0:04:28that you speaker spoke on a relatively the same and that help with that as
0:04:32a visualization
0:04:35that's kind of the normalized histogram of the distribution of
0:04:39the number of utterances per speaker between the two
0:04:44between the two sets of data in blue well that the pretty much all overlap
0:04:49but in blues the switchboard and read is that of the sre
0:04:54so what we can say is that we would not expect a large performance gap
0:04:59between these two sets of data if indeed
0:05:04our ability you know are training where
0:05:07dataset independent and are robust across datasets
0:05:12what we found obviously i is that this is not the case which is why
0:05:15we ended up having a summer workshop
0:05:18on it and so it just take
0:05:19to give in summary of
0:05:21equal error rate results the wrestler talk just be using equal error rate to provide
0:05:25a summary set of results and
0:05:30what we what we have is are
0:05:33inbred i believe denoted just
0:05:36the portion of the system that actually requires
0:05:40labels on and have also
0:05:42on the shown what we what we had at hopkins of the summer and what
0:05:46replicated at mit
0:05:49as well so you can see that for use all switchboard on to train everything
0:05:53we'll get a set of results around seven percent equal error rate
0:05:58and if we just use all of the sre
0:06:01we will get around two and half percent
0:06:04so now if we start varying the ingredients that we used to actually train these
0:06:09in particular we just say we just switch these two here we go from switchboard
0:06:14to the your whitening parameters that's the mean subtraction et cetera
0:06:20and you switch it to use sre
0:06:22you get a little bit of again you get you go down from seven percent
0:06:25of five
0:06:27and subsequently if you
0:06:30stick with
0:06:32a switchboard to do your
0:06:35and i-vector extraction
0:06:40and also
0:06:42keep the sre as are whining and use the sre labels
0:06:46basically here then you get down to
0:06:50under two and a half percent which is actually better than the last row here
0:06:53not gonna try and explain
0:06:55what happens there but
0:06:57or we decided from then on is that we were obviously can focus
0:07:01on the performance gap
0:07:02between the sre and the use of switchboard labels for our compare a within and
0:07:10across class covariance matrices
0:07:13that's what will continue one and
0:07:16so basically the source be the baseline that we've got and this will be the
0:07:21the benchmark that we're trying to hit or even to better that
0:07:27the rules for this what we call the domain adaptation challenge task is that's were
0:07:34allowed to use switchboard all the data and all of the labels
0:07:40where allowed to use the
0:07:42sre data but not of that labels
0:07:45and obviously we're gonna okay and evaluate on a the twenty ten sre
0:07:50so before we actually jump into that though on well we'd like to do perhaps
0:07:54is to be a mix for the domain mismatch i got a lot of questions
0:07:57like what actually is the difference between these two datasets that might cause such a
0:08:02in there
0:08:03and so we
0:08:04big n and did a little bit of a rudimentary analysis of actually what was
0:08:09going on
0:08:09and well some the
0:08:12clear questions that you might wanna
0:08:14our might think of as well as at the speaker age right
0:08:18or is it perhaps the languages spoken in particular switchboard contains only english and it's
0:08:24collected from a over a decade and
0:08:29and it's a over a decade that preceded that of the sre on and the
0:08:33sre contains twenty more than twenty different languages right so the question is whether or
0:08:39that might have caused some of the shift and variabilities that be that we see
0:08:45are the difference in performance in some of this
0:08:47this work
0:08:48there was previously also export believe by columns back and twenty two l
0:08:54and what we found however was that there was like absolute there was really no
0:08:59affect of either h
0:09:03eight or language spoken
0:09:05and so
0:09:07with that
0:09:09the next step then was to look at something else
0:09:12which was
0:09:14that of the switchboard
0:09:16itself on a what we found was well we realised first off that they're switch
0:09:21but was collected in different phases over approximately a decade and so what would happen
0:09:27what happens whether when we use on different subsets
0:09:31we just use different subsets to build our models
0:09:35and so
0:09:36well we ended up finding
0:09:39the following if you if you take on switchboard cellular both parts and those of
0:09:45the most recent ones
0:09:47you actually get a starting baseline so the previous starting baseline was five and a
0:09:51half percent
0:09:53you actually get a starting baseline percentage of four point six which is a little
0:09:57bit better and now if you also at in switchboard phase by three you can
0:10:03actually start all the way down at three not have percent
0:10:07and then but then as you keep adding these i guess you could say maybe
0:10:12the older portions of switchboard on you might you'd start actually doing a bit worse
0:10:19and that's we found in and i think are similar i work was also on
0:10:23done in presented by high guy on during the summer and over it i can
0:10:28ours is a slightly different take on it but
0:10:32that's kind of what we what we noticed on as we're trying to analyze the
0:10:36mismatch that
0:10:37it basically the differences within switchboard itself on selecting out some of those particular subsets
0:10:45might actually
0:10:47affect the baseline performance
0:10:49so then the next question then is alright so it should be actually just
0:10:54continue with other three graph
0:10:57also secondly can you actually
0:11:01find some automatic way of selecting out the out-of-domain data that you actually wanna end
0:11:08up using okay
0:11:09to do your initial domain adaptation
0:11:12or to not even to just like selected the labeled data that you want to
0:11:17use that best matches the in domain data that you have right
0:11:21and so what we
0:11:22did again was and it's just a couple of ninety that's for exports were experiments
0:11:26were set alright
0:11:28if we
0:11:29did an automatic subset selection so
0:11:32in particular
0:11:34first are this is the three no half percent of equal error rate
0:11:38on that you get from the cellular and
0:11:40and the faces that's the best we did
0:11:42and this here on the five and a half percent is approximately what
0:11:46you what if you use all of the data all the switchboard and started off
0:11:51there so instead if you
0:11:53these two lines let's focus on the blue for a second that's if you select
0:11:59the proportion of scores
0:12:02or proportion of i-vectors that's are
0:12:05in the in at the highest that you the prop highest probability density function value
0:12:12with respect to the that the sre so you select the switchboard
0:12:18a subset of the switchboard automatically that were closest in the likelihood onto the sre
0:12:25and you increase the proportion a how would you do in terms of the baseline
0:12:31and similar the and lda
0:12:33but is if you took switchboard and
0:12:37sre and you try to
0:12:39learn just a simple
0:12:41one dimensional linear separator between the two the ones and i take the ones that
0:12:46are closest to
0:12:49the sre data and i reckon that way so
0:12:52and how well can i do the and basically what we can see is obviously
0:12:55if you use all of the discourse and
0:12:58you've done nothing different
0:13:00but you know as you as you as you use just the some proportion of
0:13:03the likelihood
0:13:04are proportion of these top ranking scores
0:13:07you can actually do a little bit better than our baseline however
0:13:10you never approach
0:13:12this three half that seem to be set by this particular this magical subset on
0:13:17that was not
0:13:19so that was the initial exploration of the domain mismatch that we did
0:13:24covered most of the set up most of the problem
0:13:29now i can continue one with the rest of a work
0:13:33the bootstrap remark that i'm gonna go over one more time on it's pretty standard
0:13:37for the domain adaptation we begin with our prior across class and within class hyper
0:13:44and then we use
0:13:45p lda to confuse and pairwise affinity matrix
0:13:49on the sre data
0:13:51subsequently will do some form a clustering on that are pairwise affinity matrix to obtain
0:13:56some hypothesized cluster labels will use these labels to obtain another set
0:14:01of hyper parameters
0:14:03and then be linearly interpolate
0:14:08as alan showed and then potentially we iterate on the me
0:14:12just to make this look better it
0:14:14between mac and windows so that's actually have that slide supposed to look
0:14:22basically that's the set up and we'll just run into some clustering algorithms and output
0:14:27unsupervised in parentheses "'cause" you know all clustering other algorithms have at least some parameter
0:14:33that you can to right
0:14:35so you start off a mobile find later on is that hierarchical clustering on really
0:14:39does do the best
0:14:43in light of you know the stopping criterion that you choose or the cluster merging
0:14:46criterion those are kind of up to the user to choose but we find that
0:14:50with some reasonably appropriate choice on hierarchical clustering does do the best the two algorithms
0:14:56that we also explored pretty extensively on word some graph based random walk algorithms
0:15:02and i and that's known as in format and of markov clustering i'm not gonna
0:15:06go into the details about those but on feel free to ask me offline or
0:15:09at the end of on the presentation
0:15:12and those do you know you basically have a graph work each node is an
0:15:17i-vector and then you have some edges on that a contain
0:15:21perhaps and edges and then you do some clustering on those edges
0:15:25so our initial findings this is no really no different from what i wanted shown
0:15:30previously but mainly is that what's mainly true is that the in the presence of
0:15:37an imperfect clustering is in fact forgivable
0:15:41this here
0:15:41is just the plot that says we took a thousand speakers subset
0:15:46and this shows a cluster error just some thing of cluster error
0:15:53these are the solid lines in a green and red
0:15:57are if you
0:15:59the cluster labels
0:16:04if you new cluster labels are pure in didn't have to do any automatic clustering
0:16:06and then the rest of these two lines here are a in dotted lines are
0:16:13what you would have you would do if you
0:16:16clustered or stop your clustering at different points of a
0:16:21at different points of the hierarchical tree okay and basically what the thing is that
0:16:26this ball is incredibly flat okay
0:16:29and this and also the last thing is that
0:16:32alpha star itself is basically the best adaptation parameters so much whatever just talked about
0:16:42one thing is that we that we kinda glossed over so far is that alpha
0:16:46itself needs to be estimated you can do it improves on via like more principled
0:16:52way be as a the counts of
0:16:55of the relative dataset size is or you can look at it empirically and you
0:16:59can separate you know you can do your alpha for a within class differently from
0:17:03the alpha of your across class and
0:17:05and that's
0:17:06that seems to be an empirically the case the better ones seem to be this
0:17:10way and so you can see we be range across the elephants on both sides
0:17:15for the within class and you across class and find that this is approximately the
0:17:20best on for a one particular subset of a thousand speakers however
0:17:25like and it seems like
0:17:27alpha star itself is an open an unsolved problem but actually it's not so bad
0:17:31because if we rescaled is plot to within ten percent of this optimal on equal
0:17:36error rate and we can actually find that
0:17:40there's actually a range of values
0:17:44that would you a range of values for alpha that would actually you'll the pretty
0:17:49good results
0:17:52so results so far without parsing drum running on a bit out of time but
0:17:58basically the best you can do is you roughly around fifteen percent of the absolute
0:18:04best you can use the best we can do with automatic methods is on it
0:18:08close that gap by about eighty five percent
0:18:12so that a calm ideas for now is that given interpolation an imprecise estimate of
0:18:18the number of clusters is okay
0:18:21there is a range of adaptation parameters that would yield reasonable results and the best
0:18:25automatic system on gives us within fifteen percent of a system that has access to
0:18:29all speaker labels
0:18:31now that fourth that between allan's talking mine
0:18:35we wonder well
0:18:36i mean this telephone the telephone domain mismatch simple solutions work already
0:18:42and we'd like to
0:18:44and what we been working on is to explicitly identified the sources of this mismatch
0:18:49and that's kinda ongoing work at the moment but the question just like mitch brought
0:18:53up a couple seconds ago are at the end of alan's five
0:18:57what can we do about telephone to microphone domain mismatch i did the work independently
0:19:01actually did not know that a
0:19:05alanna daniel had done this and this about what i'm about to show is that
0:19:09is not in the paper itself but
0:19:11it's a little just a little at all
0:19:13and lastly what else you can talk about is out of domain detection like what
0:19:19do i actually when maybe when what is system knowing that it actually needs
0:19:25some additional
0:19:27albeit unlabeled data on but you know that it cannot perform at the level it
0:19:32usually doubts so that's perhaps an instance of like outlier detection or something like that
0:19:38that we can also we will look into on that something sort of a future
0:19:42work kind of thing
0:19:45what i will really quickly show is a quick visualization using some low dimensional embedding
0:19:51is actually
0:19:54and basically what we're gonna start with is
0:19:57if you have switchboard
0:19:59and sre and those are these are all the i-vectors in there and i'm gonna
0:20:04a lot i-vectors into a very low dimensional space which is why just looks very
0:20:07cloudy at the moment
0:20:09it's harder to
0:20:10a fit a lot of points into
0:20:12into a into a small space and still have them preserve their to their relative
0:20:18however this is
0:20:19if i try to learn first off and i using unsupervised
0:20:24and betting that
0:20:25it just takes all the data and learns on some low dimensional visualization here
0:20:29and then i apply the colouring is to the spline
0:20:32so what it shows here is that we have switchboard
0:20:35in blue and we have the sre data in red and you can kinda see
0:20:39that there is a little bit of separation
0:20:43perhaps right but their the can also a little bit on top of each other
0:20:47now to be a just one other point set it talked about earlier
0:20:53if we just took that subset of
0:20:57at that magical subset the gave us that three not have percentage that magical subset
0:21:01of switchboard we get this in green and we have the sre in the red
0:21:06as well and so they're pretty uniformly distributed a round the sre data itself
0:21:12on the other hand
0:21:14if you just
0:21:15if you just the remaining amount of data
0:21:17and we leave it in blue the old switchboard stuff
0:21:20they're actually like a little farther away then the rest of the sre itself so
0:21:25that kind of maybe that gives some idea of how things work r y
0:21:31a what performance was as once and
0:21:34however if you take a look at
0:21:37telephone and microphone
0:21:38if you do same
0:21:39it's a it's you same kind of an embedding
0:21:44you will
0:21:45get it completely different a slight a much more separate sort of
0:21:51visualization and that sort of just illustrate that i think telephone and microphone
0:21:56can be a harder problem however i guess initial results have also shown that is
0:22:00actually not as bad as maybe this visualization shows some other stop there
0:22:05and take any questions
0:22:22you said that you have found that the language is not the cost of these
0:22:27domain mismatch how to find that
0:22:31let me think so
0:22:32but like basically
0:22:36well i basically hold
0:22:38like the different languages out
0:22:40note that the various different languages out of
0:22:43of the sre and of the sre data and just try to basically see whether
0:22:48that was
0:22:51that would be like distinctly different from that of the
0:22:56sorry no the one that i basically on looked at it and saw
0:23:02whether or not so on
0:23:03the different languages are clustered together in a sense
0:23:09that's a in general that's how we what about trying to tease apart whether or
0:23:13not the languages
0:23:16at a source of a domain mismatch
0:23:21so you look at t s and you can just like that's on now
0:23:29let's talk offline about that i'm actually for getting some of the details of that
0:23:33it of the language experiment exactly at the moment but
0:23:38what soft aligned about that
0:23:46this is in no the beginning of the two you have table issue the did
0:23:52you know if you the
0:23:53of what used for training u v and also
0:23:57it did you try this which is
0:24:00put in the training switchboard in a city
0:24:03that yes we did originally and there was one terribly different
0:24:08there does very just about the same okay thanks there's really no difference
0:24:18so sweet little then mix to zero were collected over a wide range will use
0:24:24so maybe the your easy dependent variable shows the evolution of the telephone network and
0:24:32speech is transmitted of the telephone that the now compared to the in nine and
0:24:38ninety nine
0:24:39absolutely no it totally on that's actually one of the that's almost exactly a sentence
0:24:44at it like that we wrote in a and yes
0:24:46and that's a
0:24:48a potential like a hypothesis that
0:24:50i'm certainly willing to leave thanks
0:24:54even a related question
0:24:56the p lda has the within and between speaker covariance parameters so
0:25:03which of those most need to be adapted with moving from switchboard two
0:25:07the mixer a think that shown
0:25:13go with
0:25:16this one right
0:25:19the one that most needs to be adapted would be that within class
0:25:25variability relative to
0:25:27the across class at the it shown in so that we just
0:25:32the speakers the speaker distribution
0:25:34so i more this constant exactly but the left channels
0:25:39so that we which is what you need more weight within
0:25:44it's very even and