0:00:07but it's not
0:00:08yeah um
0:00:10i guess so end of a long day so
0:00:12thanks to saying that
0:00:14um
0:00:15this is basically well almost
0:00:17a lot of overlap with whatever
0:00:19the first
0:00:20speaker did for this particular session
0:00:23um which is basically trying to find out
0:00:25if uh
0:00:29um it basically
0:00:31we can have
0:00:31uh different background models
0:00:33four different sets of speakers
0:00:35and the V L B A proposing at least in this paper is that uh
0:00:39we can uh have
0:00:40uh speakers blasted according to the vocal tract length
0:00:43and also uh another way of doing it is trying to use
0:00:46a similarity between the mllr mattresses
0:00:49and we show that uh using uh
0:00:50few sets of
0:00:51these uh speaker clusters
0:00:53we can obviously get some improvement and uh performance as opposed to
0:00:57using a single
0:00:58uh ubm
0:00:59so uh the overview of the top is
0:01:02i'm pretty much uh
0:01:04is that of them indicated one is to
0:01:06you bureau of your
0:01:07the conventional speaker verification we
0:01:09very often use
0:01:10a single background model
0:01:12and uh then uh what do we
0:01:14the the reason why we might want to use
0:01:16a speaker cluster wise
0:01:17back to models
0:01:18and that two ways you could do the clustering at least in this paper that's what we have suggesting
0:01:23one is to use
0:01:24people collect
0:01:25trent
0:01:25tract length parameter itself
0:01:27and the other is to use a speaker dependent mllr matrix
0:01:30support vector
0:01:32and uh then this
0:01:33sure how we can build background models for each of these
0:01:36uh individual speaker clusters
0:01:38and then we compare the performance
0:01:40first with the speaker
0:01:41uh a single gender independent ubm
0:01:44and then we compare with a gender dependent uh ubm
0:01:48and the gender dependent of
0:01:50speaker cluster model
0:01:53is that uh some of those
0:01:55overlapped but what the first speaker
0:01:57actually here
0:01:58uh so
0:01:59at least in pointing out uh it's basically a binary decision problem so given that this feature
0:02:04and some claimed identity
0:02:06we're trying to find out
0:02:07if the identity uh we compare
0:02:10uh the log likelihood ratio with the same model
0:02:13and and alternate model
0:02:15and see if if it's
0:02:17beyond a certain threshold uh there except that or
0:02:20we rejected
0:02:21and solve and the question is
0:02:23what should be uh the alternate hypothesis uh one of course
0:02:27is
0:02:27yeah
0:02:28to say that
0:02:30a good yeah
0:02:31one is to say that they are on that hypothesis is
0:02:33a universal background model well
0:02:36which is a single model
0:02:37that people use for all speakers
0:02:39the database
0:02:40um then there are other approaches that we have
0:02:43uh
0:02:43a set of
0:02:44uh
0:02:45speaker models
0:02:46cohorts that close to a particular speaker
0:02:49ah well so we take a linear combination of some combination of these
0:02:52uh
0:02:53scores
0:02:54or we could build a a background model for me
0:02:56using these cohorts itself
0:02:58for that particular speaker so one
0:03:00has
0:03:00one background model for all speakers
0:03:03the other has
0:03:04uh a background model for each speaker
0:03:06oh
0:03:07the the other way of doing it is
0:03:09have some compromise between the two
0:03:11which is to say that i'll have a background model for a group of speakers
0:03:15and then the question becomes how like group
0:03:17speakers
0:03:17so we're proposing
0:03:19two different ways that we can group the speakers
0:03:21oh oneness
0:03:22basically using
0:03:23the vocal tract length parameter
0:03:25and the other is to use a speaker
0:03:27pacific
0:03:28mllr matter
0:03:30so um so this is the basic idea that you're talking about so instead of using one background model
0:03:35and then uh comparing the likelihood with that background model and corresponding speaker uh
0:03:40clean model
0:03:42well we actually have
0:03:44different sets of models
0:03:45but different speaker clusters
0:03:47and uh so the how we're gonna build these speaker cluster background models is what we are uh what are
0:03:53you talking about the next slide
0:03:55and the speaker clustering itself was basically done uh using i don't vocal tract length parameter
0:04:00all you maximum likelihood mllr supervector
0:04:04so um the idea of the motivation for trying to use vocal tract and parameter for speaker clustering
0:04:10is
0:04:11because
0:04:11if
0:04:12uh you know basically
0:04:13if it's logical differences in our contract is going to give rise to some differences in the east
0:04:18but
0:04:19so uh
0:04:21what shown here is obviously a in in the fall in the dark lines a male speaker and the
0:04:26that a solid line female speaker so there are differences in the spectral for the same file
0:04:31for the simple reason that the uh sociology of the product
0:04:34system of very different from male speaker female speaker
0:04:37in terms of size
0:04:38and therefore
0:04:39we assume that if
0:04:40also
0:04:41if
0:04:41as a group of speakers have a similar
0:04:44a vocal tract length diameter of similar physiology
0:04:47they probably they produce very similar
0:04:49set of spectral characteristics for the sound
0:04:52and therefore we can group these speakers together
0:04:54and assume that they have very similar characteristics in terms of
0:04:57uh
0:04:58features of the produce for a particular sound
0:05:00uh obviously a we need to vtln
0:05:04i mean we do not have
0:05:05uh a difference
0:05:06speaker
0:05:07and therefore one has to uh you know use some sort of model a reference model
0:05:12if you will
0:05:12and that's what we're trying to use uh the background model
0:05:15itself as a reference model against which we are going to score
0:05:19uh different
0:05:20uh features or the different
0:05:22what parameters
0:05:23and choose the one that does
0:05:25uh be best for that so each speaker basically
0:05:28is
0:05:28uh his or her
0:05:30speck uh vocal tract parameters estimator respective background model
0:05:34this is similar to whatever we don't speech recognition too
0:05:37but we use insert
0:05:38ubm
0:05:39the speaker independent model
0:05:41oh the other way that we could
0:05:43uh possibly classified speak
0:05:45uh into groups
0:05:46if you use
0:05:47the mllr matrix itself
0:05:49uh and that there have been lots of evidence that mllr does capture quite a bit of information about a
0:05:55particular speaker
0:05:56so we
0:05:57that
0:05:57the columns of the mllr like to uh tactics to form a supervector
0:06:01and then we do a very simple uh
0:06:04clustering of these mllr support vectors
0:06:07among speakers in the database of the using this technique he means that about them
0:06:11and just using the simple euclidean distance
0:06:13the plaster these different speeches
0:06:15so given the ubm
0:06:17and the speaker training data we get mllr matrix for each speaker
0:06:20we stack columns
0:06:22uh to form a supervector
0:06:23so this identifies a discount places a speaker
0:06:27and then we have
0:06:28group the speakers depending on
0:06:30uh the clusters that have formed by those
0:06:32about the last subject
0:06:36so um
0:06:37and then
0:06:38oh so now that we have
0:06:40uh sort of group
0:06:41these speakers into different classes
0:06:43uh we will
0:06:45a different background model for each of these
0:06:47a group of speakers
0:06:49and what we have done here is to just basically use a simple
0:06:52mllr adaptation of the
0:06:54ubm model
0:06:55to uh get a new set of
0:06:57means
0:06:58for me
0:06:59each of these speaker cluster background models in each of these speaker adapted models
0:07:03i've got from the ubm by just a consummation of the
0:07:06and these are estimated
0:07:08the the transformation matrix
0:07:10the estimated by using
0:07:11all the data from a particular speaker clusters so that's what is written here
0:07:15so given the ubm you form uh for each cluster
0:07:18its own
0:07:19background model
0:07:20so this plastic will be based on either using
0:07:23vtln as a parameter so close to one 'cause one one how are clustered into another round
0:07:28or it could be uh
0:07:30a set of
0:07:31mllr
0:07:32uh
0:07:33cluster
0:07:34speakers
0:07:35so this is the implementation aspects of given the ubm
0:07:38um first i do and identification of each of the speakers in the database
0:07:43and find out what was the corresponding week and then parameters so let's say if i'm looking at the vtln
0:07:48but i'm at one point
0:07:49two zero
0:07:50i find that speakers three four six all of them have this but i'm just like group them together
0:07:55so and then if i'm looking at the vtln but i'm at a point eight two
0:07:58the speaker I D's
0:07:59two eight nine
0:08:00uh possibly belong to this
0:08:02so i group them together
0:08:03and then using the scruples
0:08:05because
0:08:06i transform the gmmubm
0:08:08to form
0:08:09a background model
0:08:11which basically he's a man
0:08:12mllr adaptation of this particular
0:08:14group of speakers
0:08:16and then i do the individual speaker modelling by doing a map adaptation
0:08:21uh
0:08:22so the so the background model
0:08:23uh and then from the background model for each of the individual speakers
0:08:26i use
0:08:27yeah corresponding addicted to do map adaptation
0:08:31so
0:08:31divide it can be used
0:08:32well if i had used uh uh clustering
0:08:35all speakers based on
0:08:36mllr itself
0:08:41so
0:08:41um so if you look at the test phase they are almost similar to whatever the conventional case it is
0:08:46except that too small differences so given the test utterances
0:08:50i find uh
0:08:51the ideal basically likelihood ratio by comparing the speaker model
0:08:55and the background model
0:08:56in the case of the conventional case still be one single ubm sitting
0:09:00and the
0:09:01the the speaker model is got by adapting this
0:09:04to get the particular speaker model
0:09:06and then i do uh you know uh uh
0:09:09threshold based analysis whether to accept or reject
0:09:12um the exact things that
0:09:13but slightly different models are used yeah
0:09:16here the background model is actually
0:09:18but
0:09:18specifically for that particular speakers
0:09:21cluster
0:09:22and then the speaker model is got by adapting this
0:09:25i see ubm so i have the
0:09:27speaker model that slightly different you know
0:09:29and then again i do a log likelihood ratio
0:09:31this
0:09:31so basically what these systems use
0:09:33identical uh computation cost except that
0:09:36the models
0:09:37a slightly different in what we used for the background
0:09:40uh
0:09:40this is uh just
0:09:41a standard database
0:09:43but we use of things that need
0:09:44uh
0:09:46two thousand two
0:09:47um
0:09:48for background modelling
0:09:50and uh
0:09:51evaluation
0:09:52uh is one type train and once i guess
0:09:54in this two thousand four
0:09:58so uh what we notice is that
0:10:01um
0:10:02depending on the number of vtln clusters that before uh
0:10:06uh you know depending on how many at first that the yellow
0:10:09uh we see that as the number of classes increases you do
0:10:12see some decrease in the
0:10:14uh in the E R so this is what you could if you use a single
0:10:18gender independent ubm
0:10:20so this is the uh that you would get
0:10:22if you use vtln
0:10:24and this is the yeah that you would use if you use
0:10:26mllr based
0:10:27speaker clustering
0:10:29we find that uh M L L about
0:10:30slightly better than
0:10:32vtln but both of them
0:10:33you
0:10:34significantly better
0:10:35formance
0:10:36then
0:10:36um
0:10:37then
0:10:37the single ubm based at that
0:10:39the same thing holds true for them in minimum dcf also
0:10:43uh so the couple of
0:10:45things that you notice wonders what of them uh what we can and then mllr can use some improvement in
0:10:49performance
0:10:50as opposed to single
0:10:51uh ubm
0:10:53and mllr performs
0:10:55slightly
0:10:55uh
0:10:56sometimes
0:10:56oh
0:10:57quite a debate you better than vtln
0:10:59and what we find is that
0:11:00forty and find out the parameter uh clusters that give the best performance
0:11:07and this is
0:11:07because point that yeah so which again shows that mllr doing much a little better than
0:11:13this black girl which is got by vtln plastic
0:11:16and the blue one is the regular single uh ubm base
0:11:21execution
0:11:22so the question to be asked is why use mllr uh performing better than vtln
0:11:28and so what i did is i mean there are lots of other information that was available so but if
0:11:32you look at the black and the white at the bottom
0:11:34the black response to basically having female speakers and the white response to having
0:11:39male speakers so here we have chosen
0:11:42fourteen clusters which was the one maybe not the maximum performance for vtln
0:11:46and you see that there are a lot of clusters that
0:11:48the vtln has both male and female speakers so if you look at this
0:11:52uh i like was one a lot
0:11:54and like what is this which means
0:11:56that are both male and female speakers for this particular one
0:11:59similarly four point ninety point nine six
0:12:01you see that there is some overlap between the male and female speakers
0:12:05on the other hand when you look at the mllr supervector
0:12:08and if you look at
0:12:09uh the black and white
0:12:11yeah very distinct uh the the the screen that
0:12:13picks up
0:12:14then
0:12:14female clusters as they are
0:12:16and the two possible vector uh mllr clusters pick up only the male speakers
0:12:21so there seems to be a white
0:12:23uh
0:12:24nice
0:12:24uh yeah purity in terms of clustering
0:12:27uh you don't agenda
0:12:28it is when you use mllr like uh
0:12:30supervectors
0:12:31and
0:12:32we think possibly that's one of the reasons why mllr seems to be obvious consistently
0:12:36perform better than
0:12:38vtln
0:12:39so we just wanted to go one step further and see
0:12:41if if that was indeed the case then if we separate the clusters
0:12:44according to gender
0:12:46then how would the gap between mllr and vtln disappear we get
0:12:50very similar performance
0:12:51using both of them
0:12:52and that's what the next set of experiments
0:12:54basically indicate
0:12:56so here what we have done is
0:12:58uh now we have a gender wise
0:12:59ubm so
0:13:00but i do you beams one for a million and one for females obviously you see some improvement in performance
0:13:05compared to the gender independent ubm
0:13:07but also what was
0:13:09uh you select what what we conjectured seems to be holding too
0:13:12once we classify once we just uh
0:13:14do gender wise
0:13:15uh
0:13:16splitting of the clusters
0:13:17then vtln and mllr gives all give almost compatible performance
0:13:23still mllr slightly better but uh nevertheless
0:13:26uh the performance
0:13:27it is
0:13:28almost compatible
0:13:29the same thing holds true for
0:13:31uh the minimum dcf also
0:13:33so
0:13:33so the point that we want to make is vtln if you use it just but for a clustering
0:13:38sometimes
0:13:39i gives
0:13:39that's a good performance
0:13:41for the simple reason that it performs
0:13:43it's
0:13:43picks up both the male and female speakers for the same alpha
0:13:46but of either gender wise
0:13:47of uh clustering
0:13:48then what
0:13:49mllr and vtln give almost
0:13:51same
0:13:52a comparable performance
0:13:53uh and in any case both of these methods of clustering but that's obvious
0:13:57or perform
0:13:58uh
0:13:58the gender wise
0:13:59single ubm for each
0:14:01gender
0:14:01yes
0:14:04so and that's reflected also in the debt go
0:14:07you can see that both the ubm about what
0:14:09the mllr clustered and the
0:14:12we can then clustered what of age and gender wise clustered now
0:14:15have very similar performance and they always do better than a gender wise
0:14:19you be
0:14:22so uh
0:14:23so the bottom line is that if you are willing to increase
0:14:26uh the number of background models
0:14:28and they're not much yeah we find that to get a reasonably good performance
0:14:32if you just use
0:14:33yeah
0:14:34i think of something like to uh males and two females
0:14:37clusters
0:14:37you get
0:14:38um
0:14:39some gain in performance
0:14:41a boat in the case of gender dependent and gender dependent case
0:14:44uh the computational cost at least at this
0:14:46is the same as a single ubm because we just uh comparing the two models
0:14:51uh mllr supervector uh performs better than vtln in most cases
0:14:55but the gap
0:14:56narrow down
0:14:57if you're willing to use
0:14:58agenda voice
0:14:59speaker clustering
0:15:01so
0:15:02does it
0:15:09we have time for one last question
0:15:20you close to speakers that they use the uh different ubm depending on the training speakers right
0:15:27i was like to
0:15:28you have
0:15:28but your training
0:15:30yeah samples yeah you close to the training speakers so
0:15:33one training speaker only has one
0:15:36is associated with one ubm
0:15:37get it right
0:15:38okay
0:15:38but now
0:15:39on
0:15:40yes
0:15:41when you have a new sample
0:15:42uh_huh um
0:15:44if you're doing
0:15:45are you talking about
0:15:46i was
0:15:47speaker
0:15:48whether a speaker verification or oh i see you're speaker verification task
0:15:52yeah one particular person only
0:15:55and then he's anglo that ubm yeah
0:15:57it's not a speaker and it S not the speaker deterrence would be much more expensive
0:16:01yeah
0:16:02so here
0:16:03it is because of associated uh
0:16:05background model
0:16:06but
0:16:06it's not each speaker having each cluster speakers have the one button
0:16:10right
0:16:10okay but the the the
0:16:11reasons not more expensive is that your only considering one training speaker
0:16:15right
0:16:23'kay
0:16:24i think there's no more time to say thank you very much for submitting this session and
0:16:29joey
0:16:30oh