0:00:16 | i don't |
---|---|

0:00:17 | i am in the centre from a i'm research about the computer science institute which |

0:00:21 | is affiliated to can be set and to university of one of size you know |

0:00:25 | judy |

0:00:26 | the work i'm going to talk about today was done in collaboration with me h |

0:00:30 | one car from the startup |

0:00:31 | ut sri international |

0:00:35 | so let me start we describe the one of the most standard speaker verification pipelines |

0:00:39 | these days |

0:00:40 | and the pipeline is composed of |

0:00:42 | three stages |

0:00:44 | we have first the speaker but in extractor which is meant to transform the sequences |

0:00:49 | in the two trials into fixed-length vectors x one x two here |

0:00:54 | then we have a stage that thus lda followed by mean and variance normalization |

0:00:59 | and then we next normalize |

0:01:02 | and those resulting vectors x one x two are then processed with a the lda |

0:01:07 | stage which computes a score for the trial |

0:01:10 | which can then be threshold it to make the final decision |

0:01:14 | so that the lda scores |

0:01:16 | are computed s and rs log-likelihood ratios |

0:01:19 | and their state of gaussian assumptions |

0:01:23 | the form of the llr is these |

0:01:24 | it's the logarithm |

0:01:26 | of the racial between two probabilities which are the probabilities of it |

0:01:30 | two inputs |

0:01:31 | given that the speakers are the same |

0:01:33 | and the probability of the inputs given that the speakers are different |

0:01:37 | and these in an r |

0:01:39 | given the gaussian assumptions in the lda |

0:01:42 | can be computed with a close form which is a polynomial units one x two |

0:01:47 | you can find a for mean in the paper |

0:01:51 | so |

0:01:52 | the problem is that in most cases what comes of a purely eye are scores |

0:01:57 | that are very nice kind of rate is means that |

0:01:59 | no the we computed unless and an hour's data really are not and ours |

0:02:04 | and the cost for these mismatch a is that |

0:02:09 | they assumption that we may can be lda not really much they're real data |

0:02:16 | so |

0:02:18 | is calibrated scores have the problem that they have not probabilistic interpretation this means that |

0:02:25 | in consequence we cannot |

0:02:26 | and |

0:02:28 | used unless absolute values we can use them relative to each other |

0:02:32 | so we could run examples of trials |

0:02:35 | but we cannot interpret the |

0:02:37 | so let's say for example that you get a score minus one for certain system |

0:02:41 | for certain trial |

0:02:44 | you would only be able to tell one these minus one means a there you've |

0:02:48 | seen a distribution |

0:02:50 | or |

0:02:51 | some development data that has gone through the system |

0:02:55 | so once you see |

0:02:57 | this emotion and then you can interpret this minus one |

0:03:00 | properly and you could actually threshold the score and decide the thesis the target samples |

0:03:10 | okay so we would like scores to be equally weighted because |

0:03:14 | then |

0:03:15 | they have these nice property that they are in an hour so that we can |

0:03:18 | interpret their values |

0:03:20 | and we can also use based rules to make a |

0:03:24 | decision on the threshold |

0:03:26 | without having to see a development data |

0:03:30 | so |

0:03:31 | but calibration is done and generally with an affine transformation |

0:03:35 | there is trained using logistic regression so let's say you and all your score some |

0:03:39 | is calibrated |

0:03:42 | then what you do these |

0:03:43 | train these alpha and beta which are the two |

0:03:47 | parameters in the affine transformation |

0:03:50 | so value maximize the cross entropy |

0:03:52 | that's the logistic regression |

0:03:54 | objective function |

0:03:55 | and then you get at the output |

0:03:58 | properly calibrated and |

0:04:02 | okay so basically what these means is that we take these by applying we had |

0:04:07 | are we just at one stage |

0:04:09 | the global calibration |

0:04:12 | now the problem is that if this doesn't really solve the problem |

0:04:17 | and |

0:04:18 | in general so we are only solving the problem with this global calibration for |

0:04:23 | the extract set |

0:04:24 | for which we train the calibration parameters |

0:04:27 | if the calibration |

0:04:30 | the calibration set doesn't match our test set |

0:04:33 | then we will still have a calibration problem |

0:04:36 | and these results illustrate this so |

0:04:39 | the wearable one this sets are for now well explained them later but for now |

0:04:44 | what's important is then i'm showing three different be lda sets |

0:04:50 | that are |

0:04:51 | really a systems |

0:04:53 | that are identical to the calibration stage on what the first is |

0:04:57 | what training data was used to train |

0:05:00 | the calibration parameters |

0:05:02 | though so that the |

0:05:05 | red bars |

0:05:07 | i |

0:05:08 | one it's important here is to compare the height of the bar which is the |

0:05:11 | actual c and the lower for each of the systems |

0:05:14 | and the black line |

0:05:15 | which is the meaning of the llr |

0:05:17 | for that system |

0:05:19 | so if the difference between the two |

0:05:21 | is smaller than it means that the system is well calibrated if it's be it |

0:05:26 | means it is not what kind |

0:05:28 | so what we see here |

0:05:30 | is that the performance the actual c in an hour is very sensitive to reach |

0:05:35 | set was used to train the calibration |

0:05:38 | well |

0:05:39 | so for example |

0:05:40 | box |

0:05:42 | necessary to switch or which is |

0:05:44 | mostly box in this case |

0:05:48 | it's very well the speakers in the wild dataset |

0:05:52 | so it gives very good calibration but horrible for sre |

0:05:56 | and similarly the say the rats data is very good much more lasers but is |

0:06:01 | not so good for exactly sixty |

0:06:04 | so basically this means we cannot get |

0:06:07 | a single global calibration model that we work |

0:06:10 | well across the board |

0:06:14 | alright so the goal of this work is based digital but system that doesn't require |

0:06:18 | these we calibration for every new condition |

0:06:21 | it's quite ambitious goal |

0:06:23 | and |

0:06:24 | we basically want to speaker verification system that can be used out of the box |

0:06:29 | without having to lead to have been dataset |

0:06:34 | okay so |

0:06:36 | one back to the by line a the standard approach |

0:06:40 | in the by pinata showed |

0:06:41 | is to train each of the stages separately maybe you reach the previous stage |

0:06:47 | and when the |

0:06:49 | we they put data |

0:06:50 | that comes out of that stage train the next state |

0:06:54 | with different objectives so the first one this speaker media extractor is trained with |

0:06:59 | speaker classification also what object the |

0:07:02 | lda on the lda is used is trained to maximize the likelihood |

0:07:07 | and then finally the calibration stage is trained to |

0:07:11 | optimize minor cross entropy which is a speaker verification |

0:07:19 | now |

0:07:21 | one simple thing we can do is just integrated three stages in the market we |

0:07:26 | may think this is |

0:07:29 | some solution to the calibration problem and you may actually sol our initial of needs |

0:07:34 | calibration across conditions |

0:07:37 | so |

0:07:38 | what we do is basically keeping the same exact functional form |

0:07:43 | passing the standard pipeline |

0:07:45 | but instead of training them with different objectives |

0:07:49 | separately |

0:07:50 | we just trained them jointly using stochastic gradient descent |

0:07:54 | for this of course when integrating batch is that are trials |

0:07:59 | my budget of trials rather than samples |

0:08:02 | and we simply just |

0:08:04 | what we do is |

0:08:06 | randomly select speakers for each speaker select |

0:08:10 | two samples |

0:08:11 | and then |

0:08:12 | from that list of samples to create all the trials all the possible trials |

0:08:17 | across those samples all tool pursues older since all samples |

0:08:24 | so we know we can compute the |

0:08:27 | the binary cross entropy and we optimize that |

0:08:31 | so this is not the first time that something like this |

0:08:35 | is proposed of course i to solve the mean m and we'll get and others |

0:08:40 | what was something very similar |

0:08:43 | at the time of the actually |

0:08:45 | train the |

0:08:47 | but kind of with the svm what we linear logistic regression |

0:08:50 | is that of stochastic gradient descent but basically that the concept is saying |

0:08:55 | and more recently now there's been a few papers than two and to have a |

0:09:00 | speaker verification and they use some claymore of these |

0:09:04 | idea where the training data but can't which is usually very similar formats this tandem |

0:09:10 | again in a discriminatively |

0:09:13 | the of this paper is actually here you know these |

0:09:16 | and i'm sorry finest only in the upper |

0:09:21 | so this paper is actually report improving discrimination performance |

0:09:25 | but i don't usually report calibration performance which is one we care |

0:09:30 | in this work |

0:09:32 | and what we actually found in our previous paper is that this approach of just |

0:09:37 | trained discriminatively |

0:09:39 | at the lda back-end |

0:09:41 | is not sufficient to get good calibration across conditions |

0:09:45 | and that we know from our previous papers so |

0:09:49 | it means this is not a these architecture and training jointly is not e |

0:09:55 | so what n |

0:09:57 | what is the problem |

0:09:59 | in this basic form |

0:10:01 | and we |

0:10:02 | we show before the calibration stage is a global |

0:10:07 | well anyway |

0:10:09 | same as in the standard white nine |

0:10:11 | and it seems that this is not enough flexibility for the model to adapt to |

0:10:16 | the different conditions in the date |

0:10:18 | even if you train a small with a lot of different conditions you will just |

0:10:22 | of that to the |

0:10:23 | my jewelry the condition |

0:10:27 | so what we propose to do is to i and branch |

0:10:30 | so these model |

0:10:32 | so we keep the speaker verification range the same |

0:10:36 | and then we added a branch that |

0:10:39 | is in charge of computing calibration parameters as a function |

0:10:43 | both input vector sets one and x two |

0:10:46 | and the form for this branch is starts the same as the top one |

0:10:51 | it's an affine transformation |

0:10:53 | that's length normalization of course the parameters of these something transformation on different |

0:10:58 | on the top ones |

0:11:00 | then we do dimensionality reduction |

0:11:02 | i we go to very low dimensional seen in that paper we use of dimensional |

0:11:06 | five |

0:11:07 | to compute the mean vectors which are |

0:11:10 | and we call |

0:11:11 | side-information vectors |

0:11:13 | and then we use these vectors to compute an alpha and beta using and very |

0:11:19 | simple form which is based similar to the be lda form here |

0:11:24 | at so |

0:11:26 | when we and that is we had two branches one is in charge of computing |

0:11:30 | the score and the other one is its actual computing the |

0:11:33 | calibration parameters |

0:11:36 | for each of the sample c and |

0:11:40 | so i'll show the results now so let me |

0:11:42 | talk about the data |

0:11:44 | we have |

0:11:46 | a bunch |

0:11:47 | i had a whole lot of training data |

0:11:49 | we used books and of one and two |

0:11:52 | sre data speaker recognition evaluation data from |

0:11:55 | two thousand five two thousand twelve |

0:11:58 | blast mixer six |

0:12:00 | and switch for all of that it is actually share we |

0:12:04 | and the embedding extractor training data |

0:12:06 | we just use half of what we use one but in extractor training just for |

0:12:10 | expediency the experimentation |

0:12:14 | and then we have two more sets but source |

0:12:16 | which is telephone data in that would just other non-english for different languages |

0:12:21 | and then if it's just trying to which is forensic voice comparison |

0:12:26 | we just the very clean data set |

0:12:28 | it's a studio microphone anything |

0:12:31 | i australian english |

0:12:34 | and then for testing we use sre six sixteen sorry eighteen speakers in the while |

0:12:40 | the then on the ml |

0:12:41 | and lasers which is a bilingual |

0:12:44 | set recorded over several different microphones |

0:12:48 | and a forensic voice comparison |

0:12:50 | the chinese version so the |

0:12:53 | recording conditions of these two are very similar |

0:12:57 | but the language is the |

0:12:59 | and ask that sets |

0:13:01 | and we use the that part |

0:13:04 | all these three sets aside a sixteen sre eighteen of speakers in the way |

0:13:08 | i with that we do all the parameter tuning |

0:13:11 | we choose the iteration best iteration for each of the models |

0:13:15 | stuff like |

0:13:18 | okay so here we use a rear their results |

0:13:21 | and |

0:13:22 | the |

0:13:23 | rand bars have the same ones |

0:13:25 | as in the previous figure they showed |

0:13:29 | and i didn't the blue |

0:13:31 | bar which is the system we propose |

0:13:34 | we each as you can see |

0:13:36 | you know training rules |

0:13:38 | most cases over the best or that the global calibration model |

0:13:44 | so |

0:13:44 | we basically achieved what we want it which is to have a single model that |

0:13:49 | kind of that to the test conditions without that's telling them |

0:13:53 | what the test conditions are |

0:13:56 | the only exception is these lpc cmn case |

0:14:00 | which is not well calibrated idle |

0:14:03 | and in fact there is one global |

0:14:05 | the lda model that is better |

0:14:08 | than the one we propose |

0:14:10 | is still applied |

0:14:11 | but is better than ours |

0:14:13 | and |

0:14:14 | and the problem with that set |

0:14:16 | is basically that it's |

0:14:18 | it's a condition that is not seeing |

0:14:21 | in combination |

0:14:22 | during training so |

0:14:24 | we have clean data in training |

0:14:27 | but is not in chinese are we have training but is not key |

0:14:32 | so the model doesn't seem to be able to |

0:14:34 | learn |

0:14:35 | how to properly calibrated a that they |

0:14:39 | unfortunately so this just means |

0:14:42 | there's to work to be done we haven't really achieve that ambitious goal that i |

0:14:46 | mentioned before which was to have |

0:14:48 | a completely |

0:14:50 | general |

0:14:51 | out of box system |

0:14:54 | okay so before to finish i i'd like to describe a few details so how |

0:14:59 | this model is trained because they are essential to get would performance |

0:15:04 | so one important thing is to |

0:15:06 | do an |

0:15:07 | non random initialization so |

0:15:10 | what we do and |

0:15:11 | many of the papers than two and two and training do similar things |

0:15:15 | is |

0:15:18 | initialize the speaker brunch with the parameters that a standard the lda baseline |

0:15:24 | that's very sing |

0:15:25 | and then for this |

0:15:26 | side information much we |

0:15:29 | this first stage we initialize if we the bottom |

0:15:33 | and components of this anyway lda transform that we trained for |

0:15:38 | the speaker match |

0:15:41 | that means that what comes out of here he's |

0:15:44 | basically the words you could do for speaker i e |

0:15:48 | we should be |

0:15:49 | the best you can do for conditionality |

0:15:51 | so we're trying to get from the input |

0:15:54 | they condition information |

0:15:56 | then these matrix here |

0:15:59 | which doesn't have any recent level before value |

0:16:02 | we just initialized randomly anyway |

0:16:05 | and these two |

0:16:07 | components here we initialize them so that what comes out of here |

0:16:11 | are the global parameters |

0:16:13 | at the first iteration you portray |

0:16:16 | so |

0:16:16 | basically at the initialization what the scores that them out of here are the same |

0:16:22 | that would come out or a the lda |

0:16:25 | standard p only a by i |

0:16:28 | here the results |

0:16:30 | it comparing three different |

0:16:31 | initialization approaches |

0:16:34 | random |

0:16:36 | then |

0:16:37 | a one star partial which means |

0:16:40 | what i described before but without |

0:16:43 | initialising bees |

0:16:44 | stage with the lda what on components just one only |

0:16:49 | and then the louise |

0:16:50 | what is correct |

0:16:52 | so the blue is the best of the three |

0:16:54 | so it means it's worth the trouble two |

0:16:58 | take the time to find a initial parameters this marking |

0:17:04 | so |

0:17:06 | another important thing is to that we train them only two stages |

0:17:10 | so the first stage uses all the training data to train the formal all the |

0:17:15 | parameters |

0:17:16 | and then the second stage |

0:17:18 | we freeze the lda mp lda blocks |

0:17:21 | i'm trying to on the rest of the parameters using |

0:17:24 | domain balance |

0:17:25 | data |

0:17:26 | and this is important because if the data is not about and then |

0:17:29 | most of the trials in you a novel batch would be from one the mean |

0:17:33 | and then we would just be optimising things for that only |

0:17:38 | that something that has more samples |

0:17:42 | finally the convergence of the model is kind of a big issue |

0:17:47 | validation performance jumps of one from batch to batch and a lot |

0:17:52 | so you see that curve of optimization in |

0:17:55 | com one much to the next i in can change significantly |

0:18:00 | so what we do is basically choose the best iteration using the validation sets that |

0:18:04 | i mentioned before |

0:18:06 | and the good thing is that these approach seems to generalize well to other sets |

0:18:11 | even two sets that are not very well matched to the limitations |

0:18:15 | and we tried a bunch of tricks to smooth out the validation performance and they |

0:18:20 | do set sitting smoothing out the validation mccormick like regularization |

0:18:25 | sloane everybody |

0:18:27 | but they actually make the minimum |

0:18:29 | worse so we |

0:18:31 | keep the while when initial curves i'm just choose the mean |

0:18:38 | and well so |

0:18:39 | and say |

0:18:40 | did how repository |

0:18:43 | we the exactly these |

0:18:45 | model |

0:18:46 | implemented for training them for evaluation at you just want to have a pre-computed and |

0:18:51 | endings |

0:18:52 | and have an example we then bindings that we provide |

0:18:56 | three to use a modified let me know we could find box |

0:19:01 | i'll be how to respond questions and comments |

0:19:05 | okay so |

0:19:07 | conclusion we developed a model that achieves excellent performance across a wide variety of conditions |

0:19:13 | and it integrates different stages in a speaker verification looking into one stage |

0:19:18 | and trains the whole thing doing c |

0:19:21 | you also integrates an automatic extractor of side-information then in then uses to condition calibration |

0:19:27 | parameters |

0:19:28 | and these chips our goal of getting and good performance across different conditions |

0:19:36 | of course there are many open issues with like temporal |

0:19:39 | training convergence i don't think we are done with that i would like to see |

0:19:44 | it easier to |

0:19:47 | optimize model |

0:19:49 | and of course we'd like to plug in these small with the mle extractor and |

0:19:53 | training and |

0:19:56 | okay thank you very much |

0:19:59 | if you have any questions please by two need to be solved resort to the |

0:20:02 | ldc platform be more detail |

0:20:05 | thank you |