0:00:15 | and so what i'm gonna talk about is also what we did at in last |
---|---|

0:00:20 | summer a at hopkins and again this work between myself and steven a dog |

0:00:26 | doug reynolds the annual and l |

0:00:29 | so |

0:00:31 | actually |

0:00:32 | i really hope |

0:00:35 | that |

0:00:37 | well okay |

0:00:38 | four years ago |

0:00:39 | was my first a odyssey |

0:00:41 | ever and does my first conference presentation i made a joke |

0:00:45 | at the start of the slide that are not to be far more memorable and |

0:00:48 | then the presentation itself |

0:00:50 | i wanna do the same this time but i do the on couldn't come up |

0:00:54 | with any good stories so |

0:00:55 | i was gonna give you this picture in which i hope none of you will |

0:00:59 | look like by the end of my presentation |

0:01:03 | okay |

0:01:04 | all right now |

0:01:06 | so what i meant talk about is on the remote clustering approaches |

0:01:09 | for domain adaptation in speaker recognition systems |

0:01:13 | and first off i guess the titles a bit of a handful so i'm gonna |

0:01:17 | break it down and explain kinda each piece one at a time |

0:01:21 | so domain adaptation |

0:01:23 | is |

0:01:24 | one in which |

0:01:27 | where |

0:01:28 | most |

0:01:29 | current statistical learning techniques assume like someone incorrectly rather that |

0:01:33 | the training and test data come from the same underlying distribution right |

0:01:38 | and |

0:01:38 | so what we know in general is that labeled data may exist in one domain |

0:01:42 | but what we want is a model that can also perform well in a related |

0:01:47 | but say not necessarily identical domain |

0:01:51 | and labeling data |

0:01:53 | in this particular the new domain may be difficult and or expensive |

0:01:57 | and so what can we do |

0:01:59 | and to leverage the original labeled out-of-domain data one building a model to work with |

0:02:04 | this in domain data |

0:02:07 | so is nothing new here everything we've heard before in the previous presentation so speaker |

0:02:12 | recognition systems on the blaster once again used for all rather familiar with the i-vector |

0:02:17 | approach |

0:02:19 | and that's |

0:02:20 | clearly just you know you're standard summary length of segment length independent |

0:02:25 | low dimensional vector based or some representation of the audio |

0:02:30 | and what we've done what the i-vector allows us to do is to use large |

0:02:35 | amounts of previously collected in labeled audio to characterize and exploit speaker and channel variability |

0:02:42 | that's right and usually that entails the use of you know thousands of speakers making |

0:02:47 | tens of calls each |

0:02:49 | so |

0:02:51 | unfortunately it is a bit unrealistic to expect that most applications will have access to |

0:02:56 | such a large set of labeled data from matched condition |

0:03:01 | and so |

0:03:02 | well here's you know that anatomy of that standard i-vector system that's very similar and |

0:03:07 | almost actually identical to what counted shown and z are yet again the thing that |

0:03:13 | the no is that |

0:03:15 | you know the your ubm your i-vector extractor and your resulting mean subtraction and like |

0:03:21 | normalisation is are does not require the use of labels |

0:03:25 | what does require some labels are |

0:03:27 | on your within class and across class covariance matrices |

0:03:31 | and that's where the labels come in so |

0:03:34 | that's what we've got now the first thing and like to do sort of just |

0:03:38 | the like paints |

0:03:40 | at the a larger picture of what we've done |

0:03:42 | on this to demonstrate that mismatch right |

0:03:45 | on between are two domains |

0:03:47 | on similar to the deck curve plot that hounded shown on what we start with |

0:03:51 | this one role in score on sre two thousand ten |

0:03:55 | and what one denote as the in domain set is that of the sre data |

0:03:59 | i mean that's all the telephone calls from mixer o four five six and two |

0:04:04 | thousand eight collections |

0:04:06 | now the mismatched out-of-domain data is all the switchboard data |

0:04:11 | which are all the calls from that from those collections |

0:04:15 | so in general |

0:04:17 | some summary statistics there |

0:04:19 | what we're basically looking at is that |

0:04:21 | number of speakers |

0:04:24 | a number of calls an average number of calls per speaker and number of channels |

0:04:28 | that you speaker spoke on a relatively the same and that help with that as |

0:04:32 | a visualization |

0:04:35 | that's kind of the normalized histogram of the distribution of |

0:04:39 | the number of utterances per speaker between the two |

0:04:44 | between the two sets of data in blue well that the pretty much all overlap |

0:04:49 | but in blues the switchboard and read is that of the sre |

0:04:54 | so what we can say is that we would not expect a large performance gap |

0:04:59 | between these two sets of data if indeed |

0:05:03 | are |

0:05:04 | our ability you know are training where |

0:05:07 | dataset independent and are robust across datasets |

0:05:11 | so |

0:05:12 | what we found obviously i is that this is not the case which is why |

0:05:15 | we ended up having a summer workshop |

0:05:18 | on it and so it just take |

0:05:19 | to give in summary of |

0:05:21 | equal error rate results the wrestler talk just be using equal error rate to provide |

0:05:25 | a summary set of results and |

0:05:30 | what we what we have is are |

0:05:33 | inbred i believe denoted just |

0:05:36 | the portion of the system that actually requires |

0:05:40 | labels on and have also |

0:05:42 | on the shown what we what we had at hopkins of the summer and what |

0:05:46 | we |

0:05:46 | replicated at mit |

0:05:49 | as well so you can see that for use all switchboard on to train everything |

0:05:53 | we'll get a set of results around seven percent equal error rate |

0:05:58 | and if we just use all of the sre |

0:06:01 | we will get around two and half percent |

0:06:04 | so now if we start varying the ingredients that we used to actually train these |

0:06:07 | systems |

0:06:09 | in particular we just say we just switch these two here we go from switchboard |

0:06:14 | to the your whitening parameters that's the mean subtraction et cetera |

0:06:20 | and you switch it to use sre |

0:06:22 | you get a little bit of again you get you go down from seven percent |

0:06:25 | of five |

0:06:27 | and subsequently if you |

0:06:30 | stick with |

0:06:32 | a switchboard to do your |

0:06:34 | ubm |

0:06:35 | and i-vector extraction |

0:06:39 | then |

0:06:40 | and also |

0:06:42 | keep the sre as are whining and use the sre labels |

0:06:46 | basically here then you get down to |

0:06:50 | under two and a half percent which is actually better than the last row here |

0:06:53 | not gonna try and explain |

0:06:55 | what happens there but |

0:06:57 | or we decided from then on is that we were obviously can focus |

0:07:01 | on the performance gap |

0:07:02 | between the sre and the use of switchboard labels for our compare a within and |

0:07:10 | across class covariance matrices |

0:07:12 | so |

0:07:13 | that's what will continue one and |

0:07:16 | so basically the source be the baseline that we've got and this will be the |

0:07:21 | the benchmark that we're trying to hit or even to better that |

0:07:26 | so |

0:07:27 | the rules for this what we call the domain adaptation challenge task is that's were |

0:07:34 | allowed to use switchboard all the data and all of the labels |

0:07:39 | and |

0:07:40 | where allowed to use the |

0:07:42 | sre data but not of that labels |

0:07:45 | and obviously we're gonna okay and evaluate on a the twenty ten sre |

0:07:50 | so before we actually jump into that though on well we'd like to do perhaps |

0:07:54 | is to be a mix for the domain mismatch i got a lot of questions |

0:07:57 | like what actually is the difference between these two datasets that might cause such a |

0:08:01 | gap |

0:08:02 | in there |

0:08:03 | and so we |

0:08:04 | big n and did a little bit of a rudimentary analysis of actually what was |

0:08:09 | going on |

0:08:09 | and well some the |

0:08:12 | clear questions that you might wanna |

0:08:14 | our might think of as well as at the speaker age right |

0:08:18 | or is it perhaps the languages spoken in particular switchboard contains only english and it's |

0:08:24 | collected from a over a decade and |

0:08:29 | and it's a over a decade that preceded that of the sre on and the |

0:08:33 | sre contains twenty more than twenty different languages right so the question is whether or |

0:08:38 | not |

0:08:39 | that might have caused some of the shift and variabilities that be that we see |

0:08:45 | are the difference in performance in some of this |

0:08:47 | this work |

0:08:48 | there was previously also export believe by columns back and twenty two l |

0:08:54 | and what we found however was that there was like absolute there was really no |

0:08:59 | affect of either h |

0:09:02 | and |

0:09:03 | eight or language spoken |

0:09:05 | and so |

0:09:07 | with that |

0:09:08 | well |

0:09:08 | one |

0:09:09 | the next step then was to look at something else |

0:09:12 | which was |

0:09:14 | that of the switchboard |

0:09:16 | itself on a what we found was well we realised first off that they're switch |

0:09:21 | but was collected in different phases over approximately a decade and so what would happen |

0:09:27 | what happens whether when we use on different subsets |

0:09:31 | we just use different subsets to build our models |

0:09:35 | and so |

0:09:36 | well we ended up finding |

0:09:37 | was |

0:09:39 | the following if you if you take on switchboard cellular both parts and those of |

0:09:45 | the most recent ones |

0:09:47 | you actually get a starting baseline so the previous starting baseline was five and a |

0:09:51 | half percent |

0:09:53 | you actually get a starting baseline percentage of four point six which is a little |

0:09:57 | bit better and now if you also at in switchboard phase by three you can |

0:10:03 | actually start all the way down at three not have percent |

0:10:07 | and then but then as you keep adding these i guess you could say maybe |

0:10:11 | older |

0:10:12 | the older portions of switchboard on you might you'd start actually doing a bit worse |

0:10:19 | and |

0:10:19 | and that's we found in and i think are similar i work was also on |

0:10:23 | done in presented by high guy on during the summer and over it i can |

0:10:28 | ours is a slightly different take on it but |

0:10:32 | that's kind of what we what we noticed on as we're trying to analyze the |

0:10:36 | mismatch that |

0:10:37 | it basically the differences within switchboard itself on selecting out some of those particular subsets |

0:10:45 | might actually |

0:10:47 | affect the baseline performance |

0:10:49 | so then the next question then is alright so it should be actually just |

0:10:54 | continue with other three graph |

0:10:56 | and |

0:10:57 | also secondly can you actually |

0:11:00 | just |

0:11:01 | find some automatic way of selecting out the out-of-domain data that you actually wanna end |

0:11:08 | up using okay |

0:11:09 | to do your initial domain adaptation |

0:11:12 | or to not even to just like selected the labeled data that you want to |

0:11:17 | use that best matches the in domain data that you have right |

0:11:21 | and so what we |

0:11:22 | did again was and it's just a couple of ninety that's for exports were experiments |

0:11:26 | were set alright |

0:11:28 | if we |

0:11:29 | did an automatic subset selection so |

0:11:32 | in particular |

0:11:34 | first are this is the three no half percent of equal error rate |

0:11:38 | on that you get from the cellular and |

0:11:40 | and the faces that's the best we did |

0:11:42 | and this here on the five and a half percent is approximately what |

0:11:46 | you what if you use all of the data all the switchboard and started off |

0:11:51 | there so instead if you |

0:11:53 | these two lines let's focus on the blue for a second that's if you select |

0:11:59 | the proportion of scores |

0:12:02 | or proportion of i-vectors that's are |

0:12:05 | in the in at the highest that you the prop highest probability density function value |

0:12:12 | with respect to the that the sre so you select the switchboard |

0:12:18 | a subset of the switchboard automatically that were closest in the likelihood onto the sre |

0:12:24 | marginal |

0:12:25 | and you increase the proportion a how would you do in terms of the baseline |

0:12:30 | performance |

0:12:31 | and similar the and lda |

0:12:33 | but is if you took switchboard and |

0:12:36 | and |

0:12:37 | sre and you try to |

0:12:39 | learn just a simple |

0:12:41 | one dimensional linear separator between the two the ones and i take the ones that |

0:12:46 | are closest to |

0:12:49 | the sre data and i reckon that way so |

0:12:52 | and how well can i do the and basically what we can see is obviously |

0:12:55 | if you use all of the discourse and |

0:12:58 | you've done nothing different |

0:13:00 | but you know as you as you as you use just the some proportion of |

0:13:03 | the likelihood |

0:13:04 | are proportion of these top ranking scores |

0:13:07 | you can actually do a little bit better than our baseline however |

0:13:10 | you never approach |

0:13:12 | this three half that seem to be set by this particular this magical subset on |

0:13:17 | that was not |

0:13:19 | so that was the initial exploration of the domain mismatch that we did |

0:13:23 | now |

0:13:24 | covered most of the set up most of the problem |

0:13:27 | and |

0:13:29 | now i can continue one with the rest of a work |

0:13:32 | so |

0:13:33 | the bootstrap remark that i'm gonna go over one more time on it's pretty standard |

0:13:37 | for the domain adaptation we begin with our prior across class and within class hyper |

0:13:43 | parameters |

0:13:44 | and then we use |

0:13:45 | p lda to confuse and pairwise affinity matrix |

0:13:49 | on the sre data |

0:13:51 | subsequently will do some form a clustering on that are pairwise affinity matrix to obtain |

0:13:56 | some hypothesized cluster labels will use these labels to obtain another set |

0:14:01 | of hyper parameters |

0:14:03 | and then be linearly interpolate |

0:14:08 | as alan showed and then potentially we iterate on the me |

0:14:12 | just to make this look better it |

0:14:14 | between mac and windows so that's actually have that slide supposed to look |

0:14:21 | so |

0:14:22 | basically that's the set up and we'll just run into some clustering algorithms and output |

0:14:27 | unsupervised in parentheses "'cause" you know all clustering other algorithms have at least some parameter |

0:14:33 | that you can to right |

0:14:35 | so you start off a mobile find later on is that hierarchical clustering on really |

0:14:39 | does do the best |

0:14:41 | however |

0:14:43 | in light of you know the stopping criterion that you choose or the cluster merging |

0:14:46 | criterion those are kind of up to the user to choose but we find that |

0:14:50 | with some reasonably appropriate choice on hierarchical clustering does do the best the two algorithms |

0:14:56 | that we also explored pretty extensively on word some graph based random walk algorithms |

0:15:02 | and i and that's known as in format and of markov clustering i'm not gonna |

0:15:06 | go into the details about those but on feel free to ask me offline or |

0:15:09 | at the end of on the presentation |

0:15:12 | and those do you know you basically have a graph work each node is an |

0:15:17 | i-vector and then you have some edges on that a contain |

0:15:21 | perhaps and edges and then you do some clustering on those edges |

0:15:25 | so our initial findings this is no really no different from what i wanted shown |

0:15:30 | previously but mainly is that what's mainly true is that the in the presence of |

0:15:35 | interpolation |

0:15:37 | an imperfect clustering is in fact forgivable |

0:15:41 | this here |

0:15:41 | is just the plot that says we took a thousand speakers subset |

0:15:46 | and this shows a cluster error just some thing of cluster error |

0:15:51 | and |

0:15:53 | these are the solid lines in a green and red |

0:15:57 | are if you |

0:15:58 | new |

0:15:58 | the |

0:15:59 | the cluster labels |

0:16:04 | if you new cluster labels are pure in didn't have to do any automatic clustering |

0:16:06 | and then the rest of these two lines here are a in dotted lines are |

0:16:12 | basically |

0:16:13 | what you would have you would do if you |

0:16:16 | clustered or stop your clustering at different points of a |

0:16:21 | at different points of the hierarchical tree okay and basically what the thing is that |

0:16:26 | this ball is incredibly flat okay |

0:16:29 | and this and also the last thing is that |

0:16:32 | alpha star itself is basically the best adaptation parameters so much whatever just talked about |

0:16:40 | so |

0:16:41 | however |

0:16:42 | one thing is that we that we kinda glossed over so far is that alpha |

0:16:46 | itself needs to be estimated you can do it improves on via like more principled |

0:16:52 | way be as a the counts of |

0:16:55 | of the relative dataset size is or you can look at it empirically and you |

0:16:59 | can separate you know you can do your alpha for a within class differently from |

0:17:03 | the alpha of your across class and |

0:17:05 | and that's |

0:17:06 | that seems to be an empirically the case the better ones seem to be this |

0:17:10 | way and so you can see we be range across the elephants on both sides |

0:17:15 | for the within class and you across class and find that this is approximately the |

0:17:20 | best on for a one particular subset of a thousand speakers however |

0:17:25 | like and it seems like |

0:17:27 | alpha star itself is an open an unsolved problem but actually it's not so bad |

0:17:31 | because if we rescaled is plot to within ten percent of this optimal on equal |

0:17:36 | error rate and we can actually find that |

0:17:40 | there's actually a range of values |

0:17:44 | that would you a range of values for alpha that would actually you'll the pretty |

0:17:48 | good |

0:17:49 | good results |

0:17:52 | so results so far without parsing drum running on a bit out of time but |

0:17:58 | basically the best you can do is you roughly around fifteen percent of the absolute |

0:18:04 | best you can use the best we can do with automatic methods is on it |

0:18:08 | close that gap by about eighty five percent |

0:18:12 | so that a calm ideas for now is that given interpolation an imprecise estimate of |

0:18:18 | the number of clusters is okay |

0:18:21 | there is a range of adaptation parameters that would yield reasonable results and the best |

0:18:25 | automatic system on gives us within fifteen percent of a system that has access to |

0:18:29 | all speaker labels |

0:18:31 | now that fourth that between allan's talking mine |

0:18:35 | we wonder well |

0:18:36 | i mean this telephone the telephone domain mismatch simple solutions work already |

0:18:41 | and |

0:18:42 | and we'd like to |

0:18:44 | and what we been working on is to explicitly identified the sources of this mismatch |

0:18:49 | and that's kinda ongoing work at the moment but the question just like mitch brought |

0:18:53 | up a couple seconds ago are at the end of alan's five |

0:18:57 | what can we do about telephone to microphone domain mismatch i did the work independently |

0:19:01 | actually did not know that a |

0:19:05 | alanna daniel had done this and this about what i'm about to show is that |

0:19:09 | is not in the paper itself but |

0:19:11 | it's a little just a little at all |

0:19:13 | and lastly what else you can talk about is out of domain detection like what |

0:19:17 | when |

0:19:19 | do i actually when maybe when what is system knowing that it actually needs |

0:19:25 | some additional |

0:19:27 | albeit unlabeled data on but you know that it cannot perform at the level it |

0:19:32 | usually doubts so that's perhaps an instance of like outlier detection or something like that |

0:19:38 | that we can also we will look into on that something sort of a future |

0:19:42 | work kind of thing |

0:19:43 | so |

0:19:45 | what i will really quickly show is a quick visualization using some low dimensional embedding |

0:19:51 | is actually |

0:19:54 | and basically what we're gonna start with is |

0:19:57 | if you have switchboard |

0:19:59 | and sre and those are these are all the i-vectors in there and i'm gonna |

0:20:03 | collapse |

0:20:04 | a lot i-vectors into a very low dimensional space which is why just looks very |

0:20:07 | cloudy at the moment |

0:20:09 | it's harder to |

0:20:10 | a fit a lot of points into |

0:20:12 | into a into a small space and still have them preserve their to their relative |

0:20:17 | distances |

0:20:18 | however this is |

0:20:19 | if i try to learn first off and i using unsupervised |

0:20:24 | and betting that |

0:20:25 | it just takes all the data and learns on some low dimensional visualization here |

0:20:29 | and then i apply the colouring is to the spline |

0:20:32 | so what it shows here is that we have switchboard |

0:20:35 | in blue and we have the sre data in red and you can kinda see |

0:20:39 | that there is a little bit of separation |

0:20:42 | you |

0:20:43 | perhaps right but their the can also a little bit on top of each other |

0:20:47 | now to be a just one other point set it talked about earlier |

0:20:53 | if we just took that subset of |

0:20:57 | at that magical subset the gave us that three not have percentage that magical subset |

0:21:01 | of switchboard we get this in green and we have the sre in the red |

0:21:06 | as well and so they're pretty uniformly distributed a round the sre data itself |

0:21:10 | right |

0:21:12 | on the other hand |

0:21:14 | if you just |

0:21:15 | if you just the remaining amount of data |

0:21:17 | and we leave it in blue the old switchboard stuff |

0:21:20 | they're actually like a little farther away then the rest of the sre itself so |

0:21:25 | that kind of maybe that gives some idea of how things work r y |

0:21:31 | a what performance was as once and |

0:21:34 | however if you take a look at |

0:21:37 | telephone and microphone |

0:21:38 | if you do same |

0:21:39 | it's a it's you same kind of an embedding |

0:21:42 | then |

0:21:44 | you will |

0:21:45 | get it completely different a slight a much more separate sort of |

0:21:51 | visualization and that sort of just illustrate that i think telephone and microphone |

0:21:56 | can be a harder problem however i guess initial results have also shown that is |

0:22:00 | actually not as bad as maybe this visualization shows some other stop there |

0:22:05 | and take any questions |

0:22:22 | you said that you have found that the language is not the cost of these |

0:22:27 | domain mismatch how to find that |

0:22:31 | let me think so |

0:22:32 | but like basically |

0:22:36 | well i basically hold |

0:22:38 | like the different languages out |

0:22:40 | note that the various different languages out of |

0:22:43 | of the sre and of the sre data and just try to basically see whether |

0:22:48 | that was |

0:22:51 | that would be like distinctly different from that of the |

0:22:56 | sorry |

0:22:56 | sorry no the one that i basically on looked at it and saw |

0:23:02 | whether or not so on |

0:23:03 | the different languages are clustered together in a sense |

0:23:09 | that's a in general that's how we what about trying to tease apart whether or |

0:23:13 | not the languages |

0:23:14 | where |

0:23:16 | at a source of a domain mismatch |

0:23:21 | so you look at t s and you can just like that's on now |

0:23:25 | no |

0:23:29 | let's talk offline about that i'm actually for getting some of the details of that |

0:23:33 | it of the language experiment exactly at the moment but |

0:23:38 | what soft aligned about that |

0:23:40 | sorry |

0:23:46 | this is in no the beginning of the two you have table issue the did |

0:23:52 | you know if you the |

0:23:53 | of what used for training u v and also |

0:23:57 | it did you try this which is |

0:24:00 | put in the training switchboard in a city |

0:24:03 | that yes we did originally and there was one terribly different |

0:24:08 | there does very just about the same okay thanks there's really no difference |

0:24:18 | so sweet little then mix to zero were collected over a wide range will use |

0:24:24 | so maybe the your easy dependent variable shows the evolution of the telephone network and |

0:24:31 | how |

0:24:32 | speech is transmitted of the telephone that the now compared to the in nine and |

0:24:38 | ninety nine |

0:24:39 | absolutely no it totally on that's actually one of the that's almost exactly a sentence |

0:24:44 | at it like that we wrote in a and yes |

0:24:46 | and that's a |

0:24:48 | a potential like a hypothesis that |

0:24:50 | i'm certainly willing to leave thanks |

0:24:54 | even a related question |

0:24:56 | the p lda has the within and between speaker covariance parameters so |

0:25:03 | which of those most need to be adapted with moving from switchboard two |

0:25:07 | the mixer a think that shown |

0:25:13 | go with |

0:25:16 | this one right |

0:25:18 | so |

0:25:19 | the one that most needs to be adapted would be that within class |

0:25:25 | variability relative to |

0:25:27 | the across class at the it shown in so that we just |

0:25:32 | the speakers the speaker distribution |

0:25:34 | so i more this constant exactly but the left channels |

0:25:39 | so that we which is what you need more weight within |

0:25:44 | it's very even and |