0:00:15so as naked mentioned this is work mainly from last summer and continuing on after
0:00:20the end of last summer
0:00:22primarily by daniel my colleague at johns hopkins
0:00:26but also with steve issue will be talking next about a little bit different flavour
0:00:30and when you go and carlos
0:00:35so bear with me level then you'll like but a lot of animation into slides
0:00:40and use the and take them out but again so much in these that i
0:00:42really couldn't get model
0:00:44so i will try and do it with an animation style which is not natural
0:00:47to me so we're trying to build a speaker recognition system which is state-of-the-art how
0:00:51are we gonna do that well it depends what kind of an evaluation we're gonna
0:00:55run we wanna know what the data looks like that we're actually gonna be working
0:00:58on
0:00:59and since normally in for example in sre we know what that data is going
0:01:04to look like we go to our big pile a previous data that the ldc
0:01:08has kindly given us generated for us
0:01:11we use this development data we typically use very many speakers in very many labeled
0:01:16cuts
0:01:17to learn our system parameters
0:01:20in particular what we call the across class within class covariance matrix using the key
0:01:25things we need to make the lda were correctly
0:01:29and then we are ready
0:01:31to score our system and see what happens
0:01:36so the thought here for this workshop was
0:01:39what if we have this state-of-the-art system which you we have built for our sre
0:01:43ten or sre twelve
0:01:45and someone comes to us with the pilot data which doesn't look like in a
0:01:48story what are we going to do
0:01:53and the first thing in this corpus the direct also put together which is available
0:01:57from there are links to the lid lists from the g h u website
0:02:03we found that there is in fact a big performance gap with the p lda
0:02:07system even with what seems like a fairly simple mismatch namely you train your parameters
0:02:12on switchboard and you tested on mixture or sre ten
0:02:17and you can see that the green line
0:02:20which is a pure sre system designed for sre ten works extremely well and it's
0:02:26this at the same algorithm trained only on switchboard has three times the error rate
0:02:34so in the supervised domain adaptation that we attacked first which daniel presented at icassp
0:02:39we are given an additional data set which is in we have the out-of-domain switchboard
0:02:46data we have an in the main mixture set and its labeled but it may
0:02:49not be very be so how can we combine these two datasets
0:02:54to accomplish good performance on sre data
0:02:57the setup that we have used about these experiments is a typical i-vector system
0:03:05i think some people may do different things in this in this back part but
0:03:09daniel has convinced me that links norm with total covariance would just whitening is in
0:03:15fact the best most consistent way to do it
0:03:20a typical system parameters the
0:03:24the lda is typically four hundred or six hundred in our experiments
0:03:29and the important point in one of size here is that the i-vector extractor doesn't
0:03:34need any labeled data so we call that in unsupervised training
0:03:38the
0:03:39links norm also is unsupervised
0:03:42and to p lda parameters of the ones where we need the speaker labels that's
0:03:46the harder data to find
0:03:56and
0:03:57in these experiments we found that we can always use switchboard for the i-vector extractor
0:04:01itself we don't need to retrain that every time we go to a new domain
0:04:05which is a tremendous practical advantage
0:04:08they whitening parameters can be trained specifically for whatever domain you're working in which is
0:04:13not so far to do either because you only need an unlabeled pile of data
0:04:17to accomplish that
0:04:18and then i wanna focus on the at that adaptation part of the covariance matrices
0:04:23that was the biggest challenge for us
0:04:28in principle
0:04:29at least in a little bit of a simplistic math if we have no one
0:04:33covariance matrices we can do this map adaptation the dog has been doing in gmms
0:04:39for a long time
0:04:40the original map behind that is a conjugate prior for a covariance matrix
0:04:45and you end up with a sort of account based regularisation if you configure your
0:04:49prior in a certain tricky way
0:04:51you end up with a very simple formula which is account based regularization back to
0:04:56an initial matrix and a towards a new data sample covariance matrix so that's what
0:05:02shown here
0:05:03this is the in domain covariance matrix
0:05:05and where smoothing it back to what out-of-domain
0:05:08covariance
0:05:10and what we showed earlier inance first supervised adaptation we can get very good performance
0:05:15let's get used to this crap i'm gonna show a couple more the red line
0:05:18at the top is the out-of-domain system which has the bad performance which is trained
0:05:22purely on switchboard the green line at the bottom
0:05:25is the matched in domain system that's our target if we had all of the
0:05:28in domain data
0:05:29and what we're doing is taking various amounts
0:05:33of in domain data
0:05:34to see how well we can exploit it and even with a hundred speakers we
0:05:39can cut seventy percent of that
0:05:41with this adaptation process and if we use the entire data we get the same
0:05:46performance actually slightly better by using both sets the just the in domain set
0:05:53one of the questions with this is how do we set this alpha parameter i
0:05:57mean in theory if we knew the if we knew the prior exactly would tell
0:06:00is theoretically what it should be but empirically
0:06:03the main point of this crap is we're not very sensitive to it if output
0:06:06is zero
0:06:07where entirely the out-of-domain system and it's always pretty bad
0:06:11if output is one where entirely trying to do an in domain system if we
0:06:15have almost no data in domain we have a very bad performance
0:06:18but it soon as we start to have data that system is pretty good but
0:06:21we're always better by staying in the middle somewhere and using both datasets
0:06:26using a come a combination
0:06:29now this work the theme is an unsupervised adaptation what that means as we no
0:06:34longer have labels for this pile of in domain data
0:06:37so it's the same setup
0:06:39but now we don't have labels
0:06:46this means we wanna do some kind of clustering
0:06:49and we found empirically as i think people in the i-vector challenge seem to a
0:06:53found as well that h the user
0:06:56is a particularly good algorithm for this task for whatever reason
0:07:00and you can measure clustering performance
0:07:02with
0:07:03if you actually have the truth labels you can evaluate a clustering algorithm by purity
0:07:08and fragmentation purity being help your clusters are in fragmentation being how much a speaker
0:07:13was accidentally distributed into other clusters
0:07:20one of the things we spent quite a bit a time on in fact then
0:07:23you'll spend a lot of time and making an i-vector averaging system
0:07:26is what the metric used for the clustering you gotta do hierarchical clustering you gonna
0:07:30work your way out on the bottom but what's the definition of whether
0:07:34two
0:07:35clusters should be merged
0:07:38p lda the theory gives
0:07:41and answer for
0:07:42a speaker hypothesis test that these two are the same speaker
0:07:46that's something that we worked within the past
0:07:49and then you know as soon as we started up this year so that really
0:07:52doesn't work well at all which is a little disappointing from a theoretical point of
0:07:56view but we found that in a stories as well when we have multiple cuts
0:08:00using the correct formula doesn't always work as well as we would like
0:08:05will be traditionally do an sre is i-vector averaging which is pertain we have a
0:08:08single cut
0:08:10dana spent a lot of time on that this summer then we found out that
0:08:14in fact the simplest and thing to do which is to compute the score between
0:08:17every pair of cuts get a matrix of scores and then never recompute any metrics
0:08:23just average the scores is in fact
0:08:26the best performing system and it's also much easier because you don't have to get
0:08:30in your algorithm at all you just pre-computed this distance matrix and feed it into
0:08:33an off-the-shelf
0:08:35clustering software
0:08:38so just as a as a baseline we compared against k-means for clustering with this
0:08:42purity and fragmentation and the main point is this h c with this scoring metric
0:08:48wasn't fact quite a bit better the k-means so we're comfortable that it seems to
0:08:52be clustering in an intelligent way
0:08:56now we wanna move towards doing it for adaptation but the other thing we need
0:08:59to know is how do we side when the start clustering how do we decide
0:09:03how many speakers are really there because nobody has told us
0:09:06to do this you have to avenge the makeup are decision that you're gonna start
0:09:09merging that and basically that you look at the two most similar clusters and you
0:09:14gotta decide are these from a different speaker or are they the same in you
0:09:17can make a hard-decisioning
0:09:19and this is one of the
0:09:20the nice contributions of this work that was really don't after the summer i think
0:09:24of
0:09:25where we just treaty scores as speak to speaker recognition scores we do calibration in
0:09:31the way that we do and in particular
0:09:33this unsupervised calibration method than aca what daniel presented at icassp
0:09:38can be used exactly in this situation we can take
0:09:42are unlabeled pile of data and look at all the scores across to learn a
0:09:45calibration from that we can actually no with threshold and we can make a decision
0:09:49about when to stop
0:09:53so how well does that work
0:09:56this is a across are unlabeled pile as we introduce bigger and bigger piles
0:10:02the this is the correct number of clusters the dashed line
0:10:07this is five random draws where we draw on random subsets and we've average the
0:10:11performance and the blue is the average which is the easiest one to see and
0:10:15you can see in general this technique works pretty well it always underestimate typically about
0:10:21twenty percent so you think there's a few were speakers in the really are what
0:10:25you're pretty close and getting
0:10:27and automated and reliable way to actually figure out how many speakers are there is
0:10:31actually we're
0:10:32we're pretty excited to even do this well at it that's very heart task
0:10:36so to actually do the adaptation then
0:10:40the recipe is we use our out-of-domain p lda
0:10:44to compute the similarity matrix of all pairs
0:10:49we don't cluster the data using that distance metric
0:10:54estimate all this how many speakers there are and the speaker labels
0:10:58generate another set of covariance matrices from this labeled data
0:11:02and then we apply or adaptation formulas
0:11:05on this data
0:11:11so here's a similar curve as i so the for here is the out of
0:11:16the out-of-domain system and the in domain system in green at the bottom
0:11:20and
0:11:21but we're so in here
0:11:23is the h z
0:11:25adaptation
0:11:28performance and the supervised
0:11:30adaptation
0:11:32we should means the number of speakers
0:11:35no sorry supervised adaptation is one issue before
0:11:38excuse me
0:11:39so that if you have to labels
0:11:41for all of the data that's what we you compress the first time now by
0:11:44self labeling
0:11:46of course we're not as good
0:11:47but we are in fact much better than we ever thought we could be because
0:11:50when we first set up this task we really didn't think
0:11:53in fact daniel i had a little bit and he was convinced that this was
0:11:56never gonna work because how are you gonna learn your parameters from your system that
0:11:59doesn't know what you're parameters are but factor can
0:12:02so we've done surprisingly well myself labeling
0:12:05and we're still able to get at five percent of the for performance get if
0:12:08we have all the data but is unlabeled which still able to recover
0:12:12almost all the performance
0:12:16now what if we didn't know the number of clusters
0:12:19so if we had an oracle the told us it is exactly this many speakers
0:12:23with that make our system perform better so that the additional
0:12:27bar here and in fact
0:12:30our estimation of the number of speakers is good enough because even had we known
0:12:34it exactly we're gonna get
0:12:36almost the same performance
0:12:38so even though we didn't get exactly correct number of speakers the hyper parameters that
0:12:42we have estimated still work just as well
0:12:48and that's illustrated in this way
0:12:50which is the sensitivity to knowing the number of clusters so here we're using all
0:12:54the data the actual number of speakers is here and this is what we estimated
0:12:58with their stopping criterion
0:13:00and you can see that as a sweep across all of our if we had
0:13:03stopped at all of these different points and decided that was how many speakers that
0:13:07were
0:13:08there's not a tremendous sensitivity if we massively over cluster then we have a big
0:13:12hit in performance and if we massively under cluster it is bad but there's a
0:13:16pretty big fat region
0:13:18where we get almost the same kind performance with their hyper parameters if we had
0:13:23us start our clustering at that point
0:13:28so in conclusion then
0:13:31domain mismatch can be a surprisingly difficult problem in state-of-the-art systems using the lda
0:13:38and
0:13:39we are denoted supervised adaptation could work quite well but in fact
0:13:42unsupervised adaptation also works extremely well
0:13:47we can close at five percent of the performance gap due to the domain mismatch
0:13:51in order to do that we need to do this adaptation we need to use
0:13:54both the out-of-domain parameters and the in domain parameters not just label of the in
0:13:59domain
0:14:00and this unsupervised calibration trick
0:14:04in fact gives as a useful and meaningful stopping criterion for figuring out how many
0:14:08speakers are in our data
0:14:10thank you
0:14:21i four questions
0:14:30it's a wonder i can imagine that the distribution of speakers
0:14:35comments basically the number of segments per speaker
0:14:40of your unsupervised set
0:14:43will make a difference right i guess at you get this from these days or
0:14:48whatever switchboard data so the
0:14:51will be relatively homogeneous is a is that correct or
0:14:56i think yes classes i one has these are not homogeneous but this is a
0:15:00good pile of unlabeled data because in fact it's the same power that we used
0:15:04as a labeled data set
0:15:06so it's pretty much everything we could find from these speakers some of them have
0:15:10very many phone calls some of them have you are
0:15:13but all of them have quite a few in order to be in this file
0:15:16obviously for example you couldn't learn any within class covariance if you only had one
0:15:20example from each speaker
0:15:22hidden in that pile so you're absolutely right is not just that we do the
0:15:27labelling it's also that the pilot self has some richness in order for us to
0:15:30discover
0:15:34before we give a microphone image i have a related question
0:15:40when you train the i-vector extractor the nice thing is that you can do it
0:15:44unsupervised
0:15:47but again how many cats per speaker so if we had only one speaker with
0:15:51many cats obviously that's not good because we don't get the speaker variability
0:15:56the converse situations where you have every speaker only once
0:16:01you have been any duration with that would give a good idea that
0:16:07i don't think that the we looked at but i
0:16:10i completely agree that would make me uncomfortable as i said in this effort we
0:16:15just were able to show that the out-of-domain data which we assume we do have
0:16:19a good labeled set somewhere in some domain that we can use we were able
0:16:23to use that when the rest of the time so we're comfortable where it came
0:16:26from i don't think of ever run an experiment with
0:16:29with what you say and that is interesting i suspect it would not work so
0:16:33well
0:16:34what get both kinds of variability comes from a variety of channels
0:16:39the variability speaker and the channels in the not quite the same proportions as you
0:16:44get in the state
0:16:47if you collect data and the while
0:16:50in a situation where they're very many speakers
0:16:54you might have data like that so i think that's an interesting qualities two
0:16:59thank you
0:17:01very impressive work in right set of results works are also thank you for that
0:17:05so i question i have is this is all telephone speech and test work very
0:17:11well with that i have we consider what would happen if the out-of-domain tighter walls
0:17:16the different channels such as mock fine
0:17:18i and is that even a realistic hence would you have a pre-training microphone system
0:17:23that you try and adapt
0:17:25right so yes we have like the microphone the very first work right a few
0:17:30years ago on this task was adapting from telephone to microphone and daniel revisited early
0:17:36in the summer when we were debating working with dog on this dataset whether we
0:17:40trusted here if he did a similar experiment
0:17:43with the sre telephone and microphone and actually got similar results
0:17:47it is
0:17:49that does sound a bit surprising but we have seen in the sre is the
0:17:52telephone a microphone is not nearly as art is that ought to be i don't
0:17:55know the reason for that but yes we have than worked with telephone microphone histories
0:17:59and it's not shockingly different in this great things
0:18:06i just isn't that answer i'm because question
0:18:08we trained i-vector start on unofficial database which there is no one speaker a per
0:18:14utterance
0:18:16one and it's about the same as you two thousand four and five or so
0:18:23first
0:18:25okay thank you
0:18:29i knew that no i think about
0:18:31thank you
0:18:32so either really stupid question yesterday people are mentioning about how the mean shift clustering
0:18:40algorithm is working well
0:18:43is that i mean you don't seem to use that you use the
0:18:47a limited to a lot i don't the clustering so
0:18:50is that a reason why
0:18:52i believe over the course of the summer we another people look the quite a
0:18:57few different algorithms i know that's use in diarisation a no we have looked at
0:19:00it in diarisation i cannot remember if we looked at it for this task
0:19:05we did look at others a stephen is gonna talk about some other clustering algorithms
0:19:08where's but i don't think he's gonna talk about the mean shift and so i'm
0:19:13not sure i don't have that compares it clearly is also useful out with
0:19:28the i just want to know if the this split and this protocols and available
0:19:33yes to they are on the jhu website the link as in the paper okay
0:19:38it thanks you wanna get the speech type of error but the lists
0:19:43i encourage you to work on this task
0:19:49which
0:19:50one question
0:19:52let's suppose that you are not as to do a speaker clustering but gender clustering
0:19:57and you don't have any prior to how many genders other
0:20:02input and they
0:20:04the stopping criterion would be the same i mean you have a file sign genders
0:20:10i'm not sure and if there is saying with the clustering accidently fine gender well
0:20:14let me say one thing first is we did i think i forgot to mention
0:20:17this is a gender independent system
0:20:20well gender suppose that they classes to sip to cluster and not that the and
0:20:24the speakers that any of the end of clusters either kind of
0:20:27correctly well this is why daniel thought this wouldn't work
0:20:31who knows what you're gonna cluster by we're just using the metric we are hoping
0:20:34that the out of out-of-domain p lda metric is encouraging the clustering to focus on
0:20:40speaker differences
0:20:42but we cannot guarantee that except
0:20:44with the results
0:20:49i think more so than gender if for example language we're different which there might
0:20:53be some differently data and you might think you would cluster is the same speaker
0:20:57speaking multiple languages you might think that would confuse are clustering for us
0:21:02so
0:21:03you saw nick like to one aspect i think is very important especially in the
0:21:08forensic framework could you
0:21:12shows slide five
0:21:15probably
0:21:24so all
0:21:26what you've neglected here is the decision threshold
0:21:30yes we have neglected calibration of the final task that's
0:21:34so it could possibly be that a factor of three becomes a factor of
0:21:41one hundred
0:21:43not to three degradation could actually
0:21:46the factor one take it is yes which we simply neglected
0:21:50you are right george when you collected that has we would think the with the
0:21:53unsupervised calibration that we could accomplish calibration is what i would like you to do
0:21:59when you get home yes
0:22:02in two up
0:22:05annotate this slide with the
0:22:08decision
0:22:09points
0:22:10and all these systems are not even calibrated so
0:22:14well we always have to run a separate calibration process to get onto single somewhat
0:22:18easily to a when you go home
0:22:21but go ahead and do that work
0:22:23and it's only your only gonna have to do this for the in domain system
0:22:31and then you
0:22:31applying a threshold
0:22:33but the dots on those two curves in zambia copy
0:22:38thank you very well for your assignments are then
0:22:42that question is already partially on so by are unsupervised score calibration paper which was
0:22:50published icassp so as true
0:22:53that
0:22:56okay so we so we thank the speaker