more really okay good right okay so today i'm gonna start talking a little bit about um unsupervised you adaptation and uh with respect to use a total variability cosine scoring like it changes previously discussed but first off i've only known a gene for for a couple of months and and i think a lot of you probably know a lot more about an idea but i just want to summarise maybe a couple of things i where and and these are in in court said that he's had over the last couple of a couple weeks now um i i had to get a little bit appear approval with with with this 'cause i wasn't sure this introduction slight was gonna be appropriate at all but well it seemed okay so we're gonna go with that so this is what happened during a this sre do that okay this is this is this main but um no upon arrival in brno um it goes ahead oh yeah but but i like his mind changes really quickly you 'cause i mean and not just put this together bottom a couple hours ago 'cause like this morning he was really it's really up for the city is is good fig you know no words that and so i decided you know protect what what's gonna happen in a few days and like and and this is kind of what i'm what i'm hoping for as you as you may know he doesn't during that much here anything but i'm thinking that maybe he also for co for some reason um but we maybe we should we should help them out with with fear and an open on give that a go anyway down to business um so the the whole idea of how my top is that uh it's on again unsupervised adaptation and and the whole motivation behind it is that uh capturing characterising every source oh variability um is pretty difficult especially with only one enrolment session now if we were able to have multiple enrolments of the same speaker this would help average out um more sources of the inconsistencies and provide a better representation apart and a speaker model and so this brings us to the problem of um so you can adaptation in general um but in the problem unsupervised speaker adaptation we are updating our speaker models with out a priori knowledge that utterance that were updating a model with actually belongs to the target speaker now and we do so based on utterance process during testing and this is was incorporated in uh i think in this very uh two thousand four what isn't five or so um now in previous work um using joint factor analysis before we began on it work with total variability um what we noticed was that there were indeed highly variable scores on that were produced by J F K and the required normalisation in particular is each you know or um and applying these score normalisations in the unsupervised adaptation domain requires a significant amount of additional computation with each adaptation of um i'll go a little bit more into that um just a little bit now when we when i we begin this attend with total variability we were hoping for a certain number of improvement where um we could do unsupervised adaptation with less computation um and we take advantage of total variabilities use of lowdimensional total factor vectors or ivectors we can debate on who once to name what later um but and then there's a cosine similarity scoring which is which is very que and and also there's also the news a set of news score normalisation strategies that we wanted to play with namely the the symmetric normalisation snr and uh normalised cosine disk that is you just talk so the a little bit about sort of a outline for this talk on i i'm gonna go over really quickly whatever total variability wasn't as in just are you get a very good job explaining some of on some of the ideas behind it oh going to uh the the unsupervised adaptation algorithm that that we then we came up with on that that's gotten decent is that in in results and then we can um proceed onward with uh score normalisation experiments and a little better for the disk so total variability et cetera has as all the components that um that was shown in the previous thought and and that we've probably all seen in the past on this you know we're using factor analysis feature extractor you have a speaker and channel dependent supervector um there's intersession compensation without the in the V C C N and cosine scoring um just in the final at the end of the day we're just gonna use um W prime um which is after everything has been applied and such that was an scoring is actually really just um the you know product between that you've so for previous work in in joint factor analysis um on the topic of unsupervised adaptation was done by dan rosen kenny um and what they did was given some new adaptation data they would compute the posterior distribution of that speakerdependent hyperparameters using um the current ever odours as prior um and what they what they also did was to set a fixed and predefined adaptation threshold and use log likelihood ratio scoring um the in that paper they also introduce an adaptive T known score normalisation technique um because what they had observed was there is a there's a good this attribute you normalise scores as more adaptation data was used um and and in order two use um be able to use fixed decision threshold with um in their in their decision process um they had to do a a new type of normalisation now that that that that was met with a good amount of success and um and they were very promising yet that to say in order to combat the in order to implement the adaptive you know normalisation it requires a good bit of computation um and um calculating pursue to speech design easy and um and also there was that there was a require a computational as you know in france every cation update um so uh and then lastly i guess success also dependent on the choice of adaptation threshold which was which was tuned but well we also get an animal but now for us then in order two better uh or try to improve upon this work in in the context of total variability um what we wanted was satisfy the following criteria we wanted a simple and robust method for setting an adaptation threshold data and what we decided would was to set it to be insane and some optimal decision thresholds and development data but we're gonna use was the nist two thousand six um sre sre data and basically um what we would do uh is carry out your test without adaptation um are carry out of one without adaptation and set the optimal i guess the the the point that will minimise the dcf i i a threshold and we'd set that as a special to to test um i don't get in to details about that in a little bit no next was we wanted minimise the amount of computation that was that's area during each unsupervised adaptation update and this helps that already in control variability we are we are able to use low dimensional um total factor vectors and that we are able to use cosine similarity scoring and lastly our hope was to simplify score normalisation procedures wherever part so really the really basically i mean if you were if we're able to use our total factor vectors or i vector as um when estimates um in in our speaker space then then given a limited training data we we might not obviously have it's perfect estimation of of where our speaker really lies and so i suppose this is just the cartoon so it's it's not not anything rigorous at all but suppose we had our true speaker identity S which is um which can be given by by the the little circle there but then and are estimated and one utterance or speaker identity of um the reason i and so this might not be this isn't exactly very right on the spot where the true speaker identity is but if we had a good number of these utterances i it would make sense that um we should converges towards a better representation of speaker now was this this question is also assuming a priori that the additional data that we actually have is from on speaker S now um so as such we decide to propose this on this algorithm here and in an effort to match the the the technical rigour of the previous two presentations for me i decided to pack as much math i could in this in this one slide um but basically it we're saying that we have a set of total factor vectors that are um assumed to pertain to an identity of a known speaker S um and then we have a set of total factor vectors T survive on that are extracted test utterance each of arc a test and with a defined if decision threshold data um we have a new equation for the score which um since this is uh the this notation is just the cardinality so that's is it just the mean of all possible um and then we compared to some threshold and if the threshold of of all the and that threshold of the score exceeds you're you're threshold then you decide that you're that but the utterance current utterance to supply would belong in inside here into the the identity of a new speaker as and you would a you would say yes to that trial and be you would and the um that that new utterance these to buy into you speaker big W sub now um what we have is the symmetry that allows for text sure and um later we will have more discussion on the ideas for on the design of this function is that it can um conceivably be in a better but to to reiterate what i just said um it's it's actually quite easy so if you had in initial enrolment utterance estimator speaker identity on W there is a one and you have incoming test utterance to supply this is assuming that these are all of your speaker i i didn't is right it is all you have just a single utterance and that you compute a score than if you're your singles what as one is great and data then you just take um you just take a test utterance and you can place it um in with with this and this becomes what you have is now when you you test utterance uh arcs are and you training utterance um that that was just to test and so and that that's how you simply admit um a bit more training vectors into in your set um and so now you had a second test utterance then and you can keep two scores right you have yeah you're initially um estimated speaker identity and this new training i utterance W so uh T one and you can see these two scores and now if you um if that function of those two scores is again greater than your fixed threshold data then you do the same thing and and do you you put another you training utterance yeah so that's that's all that's and so and the emphasis behind this approach again is such that we do not need to change decision thresholds data um previously that the idea i in the past in related work with that with more i i adapted utterances or and all that in the text dependent setting um they the what some work was doing in the past was just that you would um increase the decision threshold with each adaptation utterance before us we want to keep things as simple as possible and so we decided that we did not wanna change in your decision there shop data and also there's simply right now there's no modification of a troll factor vectors simply all we're doing is actually combined score um and now so that summarises what variability and um unsupervised adaptation something go really quickly in the score normalisation which i am aware it's very well known topic um and so i'm not gonna give it a give it a pretty brief review um with with uh couple of indices um in in the wording but so it for the idea behind score normalisation is that we are assuming that distribution of target speaker and impostor scores followed to the string two distinct normal distribution um and however the the parameters for these two distributions are on speaker independent start part target speaker depending as such we need to normalise to allow for universal decision pressure and so in zero normalisation ones enormous is well known um we scale each of the distribution of scores produced by target speaker model and and a set of impostor utterances two standard normal distribution now in test normalisation to no one it's like the same thing it said um in order to adjust for the idea of intersession variability we we sh scale into the distribution of scores produced by test utterance and the set of impostor models um to see a normal distribution now the the idea here is to keep in mind that the ad tell size words um utterance and model um not discuss how how the related in total but in the context of total variability in just a little bit is easy norm we've already seen that um it achieves the best results uh in a factor analysis based system um and and that that's what's currently being used stay there so now what we have here with um oh sorry uh what we have in G T not parameter updates during this model adaptation is is the lack of any um or uh the need for for more normalisation parameters right because it is we can indeed become training utterances um W the T one and diversity to then on their previously test utterance so they already had a T non associated with them but what we need additionally is obviously as you don't to be computed but that's actually it so that means we can simply um precompute roc norm parameters for each test utterance as we do it i the same way we do it for each um for each target figure utterances are target speaker model as well and so that's all we need to do in in the past we had compute is um adapted to known parameters um yeah that we can do is you don't parameters after each adaptation update however here very simple and it's very in should be much quicker um now the next thing was that total variability in the context of the difference between utterances and uh models well total variability uses factor analysis the front end and so all we have is that the extraction of total factors from an enrolment or test utterance follows really the exact same process and as such there's really no difference between an utterance and the model and and it there's no a and also with the cosine similarity in the symmetry behind all that um there is no distinction to be made as that we can think of them all as the same thing which brings us to an even more simplified method of score normalisation which is which isn't which is the S norm um which um which is yeah thing that uh that uh after kenny had um and so really all we do in implementing the as non here is two define a new set of impostors which is simply the union this is the non impostors in the teen on impulse and we get is the new scoring function um that that looks pretty similar to any other normalisation function that we have except we just simply the two uh added to normalise scores and this becomes R S nine um what the the first time there refers to um use of W S which is uh which is the estimate parameters associated with on your your model on W S and then you have your score um are you have you mute so the uh sixty a which becomes simply that's not you know your uh the the test utterance so so what this gives us now is universal procedure for exactly normalisation parameters and correspondingly simple method for score normal and then the the the last step of um normalisation that we we i explored here was um previously discussed in the genes discussion but um that's sort of what it looks like um i just as a quick reminder so we have now are are some maybe some some experiments that were right and the system that i used here was really the same system that um that niche in previously had um given that we're working together so yeah you're you're standard parameters for for the system setup and and then the the table of the corpora that we use um which can take a more detailed look at at some other time like but um that the idea of with with this the protocol of our experiments was we ran we wanted to test our results are based on the female part of the two thousand eight nist sre on data set and we fix our decision an adaptation threshold data um as the optimal min dcf a posteriori decision threshold and development data uh nist two thousand um and so these are like the the ten second condition results what we we cared mostly about the ten second results here and um and these are these are the results that we see we we can see that uh in terms of the minimising detection cost function the the adaptation based ct norm on achieves the best uh min dcf um whereas and there the then normalised cosine on that then is in previously discussed did it also very well um did very well yeah um or for everything else um the english trials for equal error rates and also than than in dcf or all all trials and but at the same time we we can also notice that uh as normal also come uh achieves uh good results very competitive um and in in in some cases even even better uh then then then then Z T norm and this um at least for english whatever so uh so in order to validate our results we also tried our work in the in the i guess this uh the log of the conversation the longer utterances and um and and this one in this case uh regions normalisations stan actually sweat our results across the board um in in achieving that the best results here um but and at the same time there there are a couple things to to take no um in in that um well we can see here is that uh what are our proposed adaptation algorithm is um successful in improving performance results uh regardless of the normalisation procedure um we it this is obviously consisting with the notion that unsupervised adaptation with of course an appropriately problem chosen threshold on should be at least as good as the hopefully better than on the baseline method without adaptation um and the next thing is that we have our simplified as an approach for performance competitively with the more complicated traditional teaching on but um and of course um what we've seen is that the best result is ultimately obtained using cosine normalisation um and as a result we i i think one of the cooler things is that we we we seem to come full circle with the study and the story of a score normalisation techniques in that um in the beginning we needed normalisation techniques in order to better sort of calibrate our scores on for fixed decision threshold and then however as we got to like the most complicated ct norm we actually started going backwards and sort of simplifying things um into an as normal which is much easier to calculate and then and then now it's it's almost a um it that the parameters that we need are are not speaker dependent at all there is no need in the in the normalised cosine distance to actually have um each speaker parameter uh each uh that the parameters of each distribution calculated um for each speaker it's just it's pretty it's a pretty universal um set of parameters that need you um and and now um i there's a bit of work that i i i brought up earlier that where we decided that a lot maybe um there's a better way to improve um our our score combination function and this is um basic some basic ideas i don't want to go over that we're currently working but we don't have any uh really we don't have any significant improvement in our results just yet in terms of uh in terms of our results um however the idea is weighted averaging because are currently proposed method for combining scores really treat every vector in the set oh so you control factors as equally important however at the end of the day on the only back to that unequivocally um the long speaker to the speaker as is the initial one long vector as such maybe it makes more sense to wait that vector is a little bit higher um then then we have uh then then the rest of the training utterances that that we admit because the presence of false alarm adaptation updates um in which it has utterances incorrectly admitted um will have an adverse effect on all sorts test i said maybe R R score combining functions should take take the following into one into account um where uh where we we each score by a coefficient eight where the school is a is um is the unnormalised score like the something like the cosine similarity on that ranges between negative one and one so it can be seen as a way teach score and then that that's sort of well we can look at on an a quick visualisation and all morning at time right now is that uh we can we can simply take a look at it this way where or if you're initial identity speaker vector is uh W someone is on most important for the next you vectors might be erroneous because uh based on so you're threshold you're only allowing um the green circle the region in the green circle in in so um if you're to speaker identity is S then you may incorrectly uh allow some factors in um however as you get hmmm um yeah a as you get more and more vectors then we can we can be more more certain that they belong into the speaker's identity and so what we would have is like you add a training vector and use it that space little bit and then and then maybe you you add in an incorrect one um but then you i don't more correct one and and uh as a result after after a couple more these utterances you can see finally that maybe you get you get the right one however you did that false alarm one maybe you should take it out or so like that um and so that's sort of like what we're looking at working on in in feature and and we're still looking to improve the score combination function welcome to any ideas and and um and one of the ideas also is that uh though it's not allowed in in this protocol say but but since we're most airport in the beginning after a while maybe we can go take another look at the training vectors and correct any errors from the beginning uh and this is easy to do because we don't actually modify modify vectors at all um and so we have a final summary is that we have uh propose they uh a method for unsupervised speaker adaptation um in the use of total variability cosine scoring um we have simple and efficient method for score combination and the fixed a priori decision threshold before class and um and this method here can also easily comedy all score normalisation procedures um and with respect to score normalisation um discuss some some of the more then you were non C T norm um ideas which are like a snore and the normalised cosine that that the gene um a talk about it so oh thanks uh the the questions for uh steven you different paper the last one was proposed by each student was in the we look at this problem and trying to sure but if you are using a fixed threshold what never the eighteen you uh speaker model you want thinking about the new uh cost function proposed by needs to be sure you know why new terms to uh you don't have the will to the beginning to have at least we all uh we need if we should all over the news for sure and we propose in this paper the way to do that they should like you are doing to schools we continue so that they shouldn't oh just try to evaluate the confidence so which we all and it would look easy wait this real with the confidence and use it for good reading put increasing between you do twice fig legal um so so the question is whether or not i tried using we should could be first you using but using a six sure two so if you well using or not to be only two two to view speaker model using good solution but would be we use to um if um upon hearing the question correctly um i think so yeah uh that the use of a a fixed decision to four i i believe would have um well i at the end of the i think it makes things simple um as opposed like and then our using a fixed decision threshold but at the same time using um a of varying score combination function that that will wait wait scores um oh each of each training training utterance that we have um i i think that that's like a pretty way good way to pretty good good way to do it but uh i guess we we could talk more about it later yeah i i would just be so simple but we question yeah when you compute you try to imagine you compute to weighted school board would exist resort materials okay i'm just the blah you addition reachable on this school you think it would be to work with them until now um oh computing the score with all the every single sure trial so i'm i'm not okay uh jumps also i i also have the non certainly we just will on it uh i also have an onset to uh your question about prior uh and uh i have a a rather mixed in the comment on on this in my presentation tomorrow so uh um it's this this do that might be a last question too steven okay it's time to speak again