0:00:16i don't
0:00:17i am in the centre from a i'm research about the computer science institute which
0:00:21is affiliated to can be set and to university of one of size you know
0:00:25judy
0:00:26the work i'm going to talk about today was done in collaboration with me h
0:00:30one car from the startup
0:00:31ut sri international
0:00:35so let me start we describe the one of the most standard speaker verification pipelines
0:00:39these days
0:00:40and the pipeline is composed of
0:00:42three stages
0:00:44we have first the speaker but in extractor which is meant to transform the sequences
0:00:49in the two trials into fixed-length vectors x one x two here
0:00:54then we have a stage that thus lda followed by mean and variance normalization
0:00:59and then we next normalize
0:01:02and those resulting vectors x one x two are then processed with a the lda
0:01:07stage which computes a score for the trial
0:01:10which can then be threshold it to make the final decision
0:01:14so that the lda scores
0:01:16are computed s and rs log-likelihood ratios
0:01:19and their state of gaussian assumptions
0:01:23the form of the llr is these
0:01:24it's the logarithm
0:01:26of the racial between two probabilities which are the probabilities of it
0:01:30two inputs
0:01:31given that the speakers are the same
0:01:33and the probability of the inputs given that the speakers are different
0:01:37and these in an r
0:01:39given the gaussian assumptions in the lda
0:01:42can be computed with a close form which is a polynomial units one x two
0:01:47you can find a for mean in the paper
0:01:51so
0:01:52the problem is that in most cases what comes of a purely eye are scores
0:01:57that are very nice kind of rate is means that
0:01:59no the we computed unless and an hour's data really are not and ours
0:02:04and the cost for these mismatch a is that
0:02:09they assumption that we may can be lda not really much they're real data
0:02:16so
0:02:18is calibrated scores have the problem that they have not probabilistic interpretation this means that
0:02:25in consequence we cannot
0:02:26and
0:02:28used unless absolute values we can use them relative to each other
0:02:32so we could run examples of trials
0:02:35but we cannot interpret the
0:02:37so let's say for example that you get a score minus one for certain system
0:02:41for certain trial
0:02:44you would only be able to tell one these minus one means a there you've
0:02:48seen a distribution
0:02:50or
0:02:51some development data that has gone through the system
0:02:55so once you see
0:02:57this emotion and then you can interpret this minus one
0:03:00properly and you could actually threshold the score and decide the thesis the target samples
0:03:10okay so we would like scores to be equally weighted because
0:03:14then
0:03:15they have these nice property that they are in an hour so that we can
0:03:18interpret their values
0:03:20and we can also use based rules to make a
0:03:24decision on the threshold
0:03:26without having to see a development data
0:03:30so
0:03:31but calibration is done and generally with an affine transformation
0:03:35there is trained using logistic regression so let's say you and all your score some
0:03:39is calibrated
0:03:42then what you do these
0:03:43train these alpha and beta which are the two
0:03:47parameters in the affine transformation
0:03:50so value maximize the cross entropy
0:03:52that's the logistic regression
0:03:54objective function
0:03:55and then you get at the output
0:03:58properly calibrated and
0:04:02okay so basically what these means is that we take these by applying we had
0:04:07are we just at one stage
0:04:09the global calibration
0:04:12now the problem is that if this doesn't really solve the problem
0:04:17and
0:04:18in general so we are only solving the problem with this global calibration for
0:04:23the extract set
0:04:24for which we train the calibration parameters
0:04:27if the calibration
0:04:30the calibration set doesn't match our test set
0:04:33then we will still have a calibration problem
0:04:36and these results illustrate this so
0:04:39the wearable one this sets are for now well explained them later but for now
0:04:44what's important is then i'm showing three different be lda sets
0:04:50that are
0:04:51really a systems
0:04:53that are identical to the calibration stage on what the first is
0:04:57what training data was used to train
0:05:00the calibration parameters
0:05:02though so that the
0:05:05red bars
0:05:07i
0:05:08one it's important here is to compare the height of the bar which is the
0:05:11actual c and the lower for each of the systems
0:05:14and the black line
0:05:15which is the meaning of the llr
0:05:17for that system
0:05:19so if the difference between the two
0:05:21is smaller than it means that the system is well calibrated if it's be it
0:05:26means it is not what kind
0:05:28so what we see here
0:05:30is that the performance the actual c in an hour is very sensitive to reach
0:05:35set was used to train the calibration
0:05:38well
0:05:39so for example
0:05:40box
0:05:42necessary to switch or which is
0:05:44mostly box in this case
0:05:48it's very well the speakers in the wild dataset
0:05:52so it gives very good calibration but horrible for sre
0:05:56and similarly the say the rats data is very good much more lasers but is
0:06:01not so good for exactly sixty
0:06:04so basically this means we cannot get
0:06:07a single global calibration model that we work
0:06:10well across the board
0:06:14alright so the goal of this work is based digital but system that doesn't require
0:06:18these we calibration for every new condition
0:06:21it's quite ambitious goal
0:06:23and
0:06:24we basically want to speaker verification system that can be used out of the box
0:06:29without having to lead to have been dataset
0:06:34okay so
0:06:36one back to the by line a the standard approach
0:06:40in the by pinata showed
0:06:41is to train each of the stages separately maybe you reach the previous stage
0:06:47and when the
0:06:49we they put data
0:06:50that comes out of that stage train the next state
0:06:54with different objectives so the first one this speaker media extractor is trained with
0:06:59speaker classification also what object the
0:07:02lda on the lda is used is trained to maximize the likelihood
0:07:07and then finally the calibration stage is trained to
0:07:11optimize minor cross entropy which is a speaker verification
0:07:19now
0:07:21one simple thing we can do is just integrated three stages in the market we
0:07:26may think this is
0:07:29some solution to the calibration problem and you may actually sol our initial of needs
0:07:34calibration across conditions
0:07:37so
0:07:38what we do is basically keeping the same exact functional form
0:07:43passing the standard pipeline
0:07:45but instead of training them with different objectives
0:07:49separately
0:07:50we just trained them jointly using stochastic gradient descent
0:07:54for this of course when integrating batch is that are trials
0:07:59my budget of trials rather than samples
0:08:02and we simply just
0:08:04what we do is
0:08:06randomly select speakers for each speaker select
0:08:10two samples
0:08:11and then
0:08:12from that list of samples to create all the trials all the possible trials
0:08:17across those samples all tool pursues older since all samples
0:08:24so we know we can compute the
0:08:27the binary cross entropy and we optimize that
0:08:31so this is not the first time that something like this
0:08:35is proposed of course i to solve the mean m and we'll get and others
0:08:40what was something very similar
0:08:43at the time of the actually
0:08:45train the
0:08:47but kind of with the svm what we linear logistic regression
0:08:50is that of stochastic gradient descent but basically that the concept is saying
0:08:55and more recently now there's been a few papers than two and to have a
0:09:00speaker verification and they use some claymore of these
0:09:04idea where the training data but can't which is usually very similar formats this tandem
0:09:10again in a discriminatively
0:09:13the of this paper is actually here you know these
0:09:16and i'm sorry finest only in the upper
0:09:21so this paper is actually report improving discrimination performance
0:09:25but i don't usually report calibration performance which is one we care
0:09:30in this work
0:09:32and what we actually found in our previous paper is that this approach of just
0:09:37trained discriminatively
0:09:39at the lda back-end
0:09:41is not sufficient to get good calibration across conditions
0:09:45and that we know from our previous papers so
0:09:49it means this is not a these architecture and training jointly is not e
0:09:55so what n
0:09:57what is the problem
0:09:59in this basic form
0:10:01and we
0:10:02we show before the calibration stage is a global
0:10:07well anyway
0:10:09same as in the standard white nine
0:10:11and it seems that this is not enough flexibility for the model to adapt to
0:10:16the different conditions in the date
0:10:18even if you train a small with a lot of different conditions you will just
0:10:22of that to the
0:10:23my jewelry the condition
0:10:27so what we propose to do is to i and branch
0:10:30so these model
0:10:32so we keep the speaker verification range the same
0:10:36and then we added a branch that
0:10:39is in charge of computing calibration parameters as a function
0:10:43both input vector sets one and x two
0:10:46and the form for this branch is starts the same as the top one
0:10:51it's an affine transformation
0:10:53that's length normalization of course the parameters of these something transformation on different
0:10:58on the top ones
0:11:00then we do dimensionality reduction
0:11:02i we go to very low dimensional seen in that paper we use of dimensional
0:11:06five
0:11:07to compute the mean vectors which are
0:11:10and we call
0:11:11side-information vectors
0:11:13and then we use these vectors to compute an alpha and beta using and very
0:11:19simple form which is based similar to the be lda form here
0:11:24at so
0:11:26when we and that is we had two branches one is in charge of computing
0:11:30the score and the other one is its actual computing the
0:11:33calibration parameters
0:11:36for each of the sample c and
0:11:40so i'll show the results now so let me
0:11:42talk about the data
0:11:44we have
0:11:46a bunch
0:11:47i had a whole lot of training data
0:11:49we used books and of one and two
0:11:52sre data speaker recognition evaluation data from
0:11:55two thousand five two thousand twelve
0:11:58blast mixer six
0:12:00and switch for all of that it is actually share we
0:12:04and the embedding extractor training data
0:12:06we just use half of what we use one but in extractor training just for
0:12:10expediency the experimentation
0:12:14and then we have two more sets but source
0:12:16which is telephone data in that would just other non-english for different languages
0:12:21and then if it's just trying to which is forensic voice comparison
0:12:26we just the very clean data set
0:12:28it's a studio microphone anything
0:12:31i australian english
0:12:34and then for testing we use sre six sixteen sorry eighteen speakers in the while
0:12:40the then on the ml
0:12:41and lasers which is a bilingual
0:12:44set recorded over several different microphones
0:12:48and a forensic voice comparison
0:12:50the chinese version so the
0:12:53recording conditions of these two are very similar
0:12:57but the language is the
0:12:59and ask that sets
0:13:01and we use the that part
0:13:04all these three sets aside a sixteen sre eighteen of speakers in the way
0:13:08i with that we do all the parameter tuning
0:13:11we choose the iteration best iteration for each of the models
0:13:15stuff like
0:13:18okay so here we use a rear their results
0:13:21and
0:13:22the
0:13:23rand bars have the same ones
0:13:25as in the previous figure they showed
0:13:29and i didn't the blue
0:13:31bar which is the system we propose
0:13:34we each as you can see
0:13:36you know training rules
0:13:38most cases over the best or that the global calibration model
0:13:44so
0:13:44we basically achieved what we want it which is to have a single model that
0:13:49kind of that to the test conditions without that's telling them
0:13:53what the test conditions are
0:13:56the only exception is these lpc cmn case
0:14:00which is not well calibrated idle
0:14:03and in fact there is one global
0:14:05the lda model that is better
0:14:08than the one we propose
0:14:10is still applied
0:14:11but is better than ours
0:14:13and
0:14:14and the problem with that set
0:14:16is basically that it's
0:14:18it's a condition that is not seeing
0:14:21in combination
0:14:22during training so
0:14:24we have clean data in training
0:14:27but is not in chinese are we have training but is not key
0:14:32so the model doesn't seem to be able to
0:14:34learn
0:14:35how to properly calibrated a that they
0:14:39unfortunately so this just means
0:14:42there's to work to be done we haven't really achieve that ambitious goal that i
0:14:46mentioned before which was to have
0:14:48a completely
0:14:50general
0:14:51out of box system
0:14:54okay so before to finish i i'd like to describe a few details so how
0:14:59this model is trained because they are essential to get would performance
0:15:04so one important thing is to
0:15:06do an
0:15:07non random initialization so
0:15:10what we do and
0:15:11many of the papers than two and two and training do similar things
0:15:15is
0:15:18initialize the speaker brunch with the parameters that a standard the lda baseline
0:15:24that's very sing
0:15:25and then for this
0:15:26side information much we
0:15:29this first stage we initialize if we the bottom
0:15:33and components of this anyway lda transform that we trained for
0:15:38the speaker match
0:15:41that means that what comes out of here he's
0:15:44basically the words you could do for speaker i e
0:15:48we should be
0:15:49the best you can do for conditionality
0:15:51so we're trying to get from the input
0:15:54they condition information
0:15:56then these matrix here
0:15:59which doesn't have any recent level before value
0:16:02we just initialized randomly anyway
0:16:05and these two
0:16:07components here we initialize them so that what comes out of here
0:16:11are the global parameters
0:16:13at the first iteration you portray
0:16:16so
0:16:16basically at the initialization what the scores that them out of here are the same
0:16:22that would come out or a the lda
0:16:25standard p only a by i
0:16:28here the results
0:16:30it comparing three different
0:16:31initialization approaches
0:16:34random
0:16:36then
0:16:37a one star partial which means
0:16:40what i described before but without
0:16:43initialising bees
0:16:44stage with the lda what on components just one only
0:16:49and then the louise
0:16:50what is correct
0:16:52so the blue is the best of the three
0:16:54so it means it's worth the trouble two
0:16:58take the time to find a initial parameters this marking
0:17:04so
0:17:06another important thing is to that we train them only two stages
0:17:10so the first stage uses all the training data to train the formal all the
0:17:15parameters
0:17:16and then the second stage
0:17:18we freeze the lda mp lda blocks
0:17:21i'm trying to on the rest of the parameters using
0:17:24domain balance
0:17:25data
0:17:26and this is important because if the data is not about and then
0:17:29most of the trials in you a novel batch would be from one the mean
0:17:33and then we would just be optimising things for that only
0:17:38that something that has more samples
0:17:42finally the convergence of the model is kind of a big issue
0:17:47validation performance jumps of one from batch to batch and a lot
0:17:52so you see that curve of optimization in
0:17:55com one much to the next i in can change significantly
0:18:00so what we do is basically choose the best iteration using the validation sets that
0:18:04i mentioned before
0:18:06and the good thing is that these approach seems to generalize well to other sets
0:18:11even two sets that are not very well matched to the limitations
0:18:15and we tried a bunch of tricks to smooth out the validation performance and they
0:18:20do set sitting smoothing out the validation mccormick like regularization
0:18:25sloane everybody
0:18:27but they actually make the minimum
0:18:29worse so we
0:18:31keep the while when initial curves i'm just choose the mean
0:18:38and well so
0:18:39and say
0:18:40did how repository
0:18:43we the exactly these
0:18:45model
0:18:46implemented for training them for evaluation at you just want to have a pre-computed and
0:18:51endings
0:18:52and have an example we then bindings that we provide
0:18:56three to use a modified let me know we could find box
0:19:01i'll be how to respond questions and comments
0:19:05okay so
0:19:07conclusion we developed a model that achieves excellent performance across a wide variety of conditions
0:19:13and it integrates different stages in a speaker verification looking into one stage
0:19:18and trains the whole thing doing c
0:19:21you also integrates an automatic extractor of side-information then in then uses to condition calibration
0:19:27parameters
0:19:28and these chips our goal of getting and good performance across different conditions
0:19:36of course there are many open issues with like temporal
0:19:39training convergence i don't think we are done with that i would like to see
0:19:44it easier to
0:19:47optimize model
0:19:49and of course we'd like to plug in these small with the mle extractor and
0:19:53training and
0:19:56okay thank you very much
0:19:59if you have any questions please by two need to be solved resort to the
0:20:02ldc platform be more detail
0:20:05thank you