Přepis řeči - Cosine Similarity Scoring without Score Normalization Techniques

so i think today will you be happy to say two different point of view that mathematical point of view topic maybe that one generation i'm not saying that what's all just oh and engineering solution which is maybe fast and the more simple simplify life for us and so the topic today is about how we can make cosine distance scoring because you know buttocks this covers morning that he what is P L D A T don't need any score normalisation so we try to understand what is corn images in the wagon cosine scoring and how we can move the score normalisation from scum space to put a very but the spatial ivector space so the presentation um is uh organised as follows O four so we give an introduction on the contribution of this paper i'll try to define like a text i have to define what stop are very very space what is cosine distance our channel compensation work in this uh space and after that i will show you how how tensed and what score normalisation likes it enormously normal as normal is doing in the score it's gonna go uh design the cosine distance scoring and how we show you how we develop new scoring that not any score normalisation with it still there but we we just move it to the probability space and leave some experiment and result and finally give a conclusion so i recently would allow so new delhi motion speaker presentation we should make a lot of a lot easier for us you know it because now we just open the water pollution working can you can try i'll try P L D A wanted to the egg and it can be fatal with with this in this new dimension of the space so and we also there well i cosine scoring they don't need any target ornaments just to X X I i vector over time factors for target them yes and compute cosine distance and compile this threshold it's very easy is not complication there that's so this make the decision faster simple clacks complex there's no scatter so but yeah coming here we need spinach score normalisation so the with the wheel is it a normal as normal because in the new version of the system i use as normal so we did we did see no need that so so i try in this paper tensed and one score normalisation is doing and the cosine distance and how i can't assimilate this uh kind of scoring in the clean up but right space without going the score space so this is the thought that would talk on in this uh this part and what we did that you just want to do some speaker adaptation using cosine distance but i would not talk about in the paper you flip on some result but stephen was next presenter will talk about so we if you have any question about speak answer why that nation you can talk to him not to be okay so so if you now everyone down here now that C F A try to split in the general supervectors in two parts uh one part is the speaker space okay and the second is a sorry that's so with the first part is because by the second part of channel space so two years ago when we was engine option interested what's it is to watch a thousand eight we try to see the efficiency of it and he'd of for every line variable like speaker space common space and channel space so we take every component of this jfa and we put that in i don't was not much for support vector machine and we use cosine distance to see the performance so what was surprising that eigenchannel of this site the channel factors when we put it to to scrub the machine we are then to having black decorate fifty percent because normally channel factor don't contain speaker information we find that we have and incorrect of twenty so it means that affirmation that we are losing in this channel factors so in order to restore maybe maybe be a moron could say that but to minimise the impact of this information that we are losing the speaker factor the idea of the pen factors comes that so total factor was born in but also a tangent hopkins university so and what we did is just which although that had been built separate speaker space and channel space with one one once again space at what which model but speaker and channel variability i do recall that the real and so when we have a target and yeah that's which is got project but together in the shop and maybe space and we can just compute the cosine distance there so so the one that we use that are very pretty and what is different between speaker space like eigenvoices and put on my ability for the uh for in our case so for the egg and a voice for the jfa if like for additional for speaker for for speaker always recording it is seen as same speaker so we put all the work on it together for a bit of a space is the opposite so four for each recording of the same speaker is seen as a different speaker so we want try to model but speaker and channel variability the only thing so if you have the eigenvoice algorithm is the same things you should the same use the same are good for but just the list is different okay so for this for the eigenspace we put the data from the same speaker in the same five and four but maybe two speakers it's five is recording slightly stated different speaker so there's a different way to estimate no i can eigenvoice okay or or maybe so the relevance map so we can use for each recording estimate general supervectors by map adaptation marilyn smart and then compute pca okay and you know again with eigenvoice map map adaptation with a with a gmm supervectors not observable and we would yeah my good too we estimate all this for every day so why we are using that eigenvoice i think because like in what was in G out in a p2p uh some happen university for the workshop some people from you to try different kind if i'm not wrong like my and everyone smart and again mathematician for speaker true speaker factors training and we find that the best is eigenvoice maybe i'm wrong you can confirm after um so and also eigenvoice and is known to be more power for for short duration so maybe it's explain why are very weak in this case given a better result than irvine smart so what do we have targets speech or a target recording and test recording so we estimate this but i'm very beating up the factors uh vectors and which is to compute a cosine distance scoring between the two vectors okay so um and then competitor shot so you don't have to do channel compensation so i just uh first do lda to do some dimmers introduction and to maximise the speaker and minimised and stuff it wouldn't cost within class with the speaker variability sorry and updated obvious to see and to do some kind of normalisation in the in the little be much less of a node initially rate no timit space okay so linear so and the is it just like uh i'm gonna metric is defined by solving this generalised eigenvalue so between we use a bit with speaker viability and within speaker variability um i think my sister so here there's only one remark that they need to put up uh in the first version of the but the cosine distance i say that the mean of all all this speakers is equal to zero because they have normal from the distribution but for the top of factors but in this work i've i but it like i'd estimated so i think i need to show that i need to compute it because i find some like problem with the new scoring when i don't estimated that so for the data to C N what we do is after estimating lda would project all our background in this slowly much of the space which is we move from four hundred to two hundred and after we also the same background but not the same but all you make sure that the data to estimate adaptive this year in two hundred space so it's a because for his W C C N is applied sorry right now one the basis is applied in the projected space oh and the a okay so it's not the origin of space so here's some kind of visualisation of all the steps where all this kind of stuff so is this five speaker so is colour is this one speaker and if one is uh one recording for or speaker so that's is five female speaker so this is after lda projection into the emotional okay so if you know the other C C N so is it the same scatter if you have the same here as in black scale so we are minimising the intraspeaker variability and when you do W let normalisation of course sciences course going you are going in the spherical area here so here the speaker one who the speaker to interspeaker tape so this is why to find out what what what what a fine like how how about explained this morning about the dissertation so all this data on the same fig yeah so this is the jack i'm off that of brevity system so when you have a look not a lot of we've first we use a lot of nontarget speaker like a lot of lot lot of speaker whatsoever a recording for speakers and i use mfcc extraction i used to be into uh yeah my going to attain a B M and after extract the the what what statistical here for all the same sorry all the same recording and after a change of max i tried to to train data but maybe two metrics and then here extract ivectors for all this uh uh recording and then i estimate and the N W C C S of his on the interview C N it's not obedience ubm so what i have a target okay set according so i just extract mfccs and the U D B M to excitable what statistic here and upon my be extracted uh factors and then uh it was only and the innovation to normalise the the the the the the the new new vectors okay so when you have that yeah we're the same person and to getting that of a matrix uh that's right at the top of factors and then projected indiana B C C N and that can and uh compute the cosine distance and make a final decision so now uh i'll explain what score normalisation is doing again and the space okay in this what what's gonna musician that we can get cosine distance scoring so let me simplify some questions so this is like cosine distance scoring first okay so let's use like we call that a five normalised above factors which is the projection of lda and uh some ski decomposition of the within class parameterisation so and normalised by the land so in this case cosine distance 'cause we can just on the product okay so just i just want to simplify have a dot product okay so so this is you can see all this like maybe because we've with the first paper we say that W well opening W is feature extraction so we can see also all this as a double as a feature extraction 'cause you do it such a compensation and of course i became or just a dot product so no if you have you want to see that who started that someone's you know so we have a target speaker and the set of you know utterance okay so we or is it turns you extract the proposed factors okay and need to compute the main come the scores the mean of the scores and the standard deviation of the schools okay so i tried to say how to how what is the mean and so another innovation is doing okay what the what is that what it's got what's the value that so i try display so is it a it so it for every is that you know impostors i tried to spit in schools okay just the product between target and uh posters and it's divided by and this is the main okay so the target speakers if you to simplify that you take this oh it's just the product with win target unnormalised eigenvectors and the mean of yeah posters you know about that the normalised factor okay so this is the me okay so and this is the um posters uh no multiple vectors means okay so and and the number of and posture for the teen forms you know so if you see for standard deviation you do the same price process you have this scores four for the between target and impostors you knows and is it the meeting which is exactly this one okay so the but product between the two almost i get uh to to normalise uh target speakers and the impostors i mean and if you to go if we take the they're not targeting the target oh so here you can see this is the covariance matrix all the of the yeah uh apostasy no okay so score normalisation which is you know is just no if you you're trying to but in the a question of how do score normalisation it's just shifting the task normalisation by the mean all the impostors and the week another that normalisation but this time normalisation is base it only oh uh between class impostor okay this is an apostle so this is mean that we are going for the for this you know if i want to do is you know what do in another that normalisation that the direction is base it on maximising the distance between a poster okay in a similar way you can find that you know so you know is this you know example is shifting the test you know is shifting the body where the me and doing that that normalisation of the test you know is doing that minimises on the target do you know was doing that normalisation of the best with some kind of covariance between a poster so we will new scoring one assuming ideal which is not is not exactly easy to you know it would save you just amaze you can also we shift target we we we was like some background of impostors and we compute the mean of that and we shift the target that's why we should target here and normalised target that's done factors and also for the test by the impostor means and no my the bottom end of the test and the target a based on between awaiting covariance a posters so another one uh is that some of that i think he was and uh secondary anyway factories newspaper notice it doesn't then so it's as well and this in this case for us and all this is exactly that's not we well because what as women doing this may be seen on a systematic it's going that was eating omitting always the same so it's for the target shifting the task and normalising by the target here it shift in the target tantalising but this so this is exactly as well so we can do as well without any all windy per parameter estimation just not about maybe the space so this kind of it's going have a lot to speed up the process more so the only just compute the cosine distance so now we can do it as you know uh maybe seem easy to you know or complete as long in this paper maybe this paper so then do some experiments so we used two thousand forty eight abortions with the motion to sixty like we have ninety percent of T C as you know jeepers that of that of that is an old system that they have i don't do you need a date for that right i did or both horizontal so it doesn't and um sorry for that so is four hundred benefactors lda reduced a hundred and the basis in is applied in two hundred space and use some kind of one of our one thousand you norm and two hundred yet you know for as normal we use all overcome by all the apostle together and for the uh for the mean and the covariance of the new scoring we use all together all the impostor together but we use diagonal covariance matrix for the impostors just to speed up the process and make an experiment we can use the force to so here a lot of people ask me how you but at a very poor spatial trying to build this that table to show how can train your lda and where which database so for the A B M we use switchboard uh fig switchboard about senior and uh landline uh we use discover four and five what about a bit that we use all the data so what's the type that you have more of it is is it and use like minimum speaker that have to recording to be the order of a matrix okay this is the first time of the sixty to use fisher data to in the factorises because patrick died in the past with the jfa and he that have success with that um lda i use switchboard and nist and four and five and because i tried to model this but with speaker variability so we need more speakers for them use this year was surprising that i i found that the best result is only for two of the four and five maybe because in which are data we have this kind of speaker speaking different their phone numbers and telephone compared to switchboard i'm not maybe this is what we need only make two thousand four and five okay so this is the uh the uh uh there's a lot so i tried to sit and core condition uh often times eight result only female part uh portion so i just want to compare that the score normalisation is working here i forget to put this score without score normalisation sorry uh so this is the origin of scoring like go find the so was it you know as a group we should in the past and uh when you do a new like is uh and use it you know which image it or not it's you would most like to but incorporates your point five an absolutely great but um within this year why the same there's not very basic that improvement however for all try we have some kind of a job because he's english trials and his all time when we have different languages and here this year the accord and this you have was good very good in this new city knobs units scoring okay so it's nice norm it's quite this competitive results and getting better result in all tries applied to do like original scoring and so it seems like we can do score normalisation in this in the above a vector space so there's no problem for that so this is intense again that's again the results so here big i like to uh okay i like um core condition we find that it's had a lot here it's it's improving the performance uh not for the dcf patrol decorate and also for this you have all trials and that's that's not what was doing very well here in the second uh second compared to the core condition so and the conclusion so for this paper i try to uh simplify life again by making the score normalisation and a very this space so which makes the process more simple and more fast if you want to try to optimise the or cosine the some scoring and we do it for for the purpose of doing some speak and some adaptation no that's not that's not up to date a parameter of the but the the that you know how much you know and the answer but adaptation so stephen was talking more about after the start and thank you distance for magazine occlusion um like you say right yes oh scroll through oh uh hmmm yeah uh no yeah oh where so just true yeah uh use power or something sure uh the the uh was oh boy most yes okay the question is that right oh most yes oh so well so okay no yeah the most yeah uh hmmm some right you know so if you can hmmm yeah most maximum normalisation such that hmmm so school so but the point okay so so one of them so selecting uh emphasising you space based on the different right which is uh_huh and i'm wondering if you could modify the normalisation approach but you you know posted yeah you could modify such that who are loosely coupled okay function score that's that's can be good point here because ah this length normalisation okay if i try to do if i tried and stand like for example for that you know so i try to be okay i when i did cosine distance away by doing a D N W C A so i removing some i'm removing the within class but here pentium okay do wanna do that normalisation and take to the the reformation of maximising this year between speaker it can be seen as a between speaker via the the map quest metric so it seems like i am quite losing information between speakers but with a the other basis ian that's true when i see this kind of things it seems like i am doing something that it hurt me yeah right but this is a good point that the basis yes or no it's like we have a nice a dog here like at the end of it is it is a project that i'll do it again but this may this all the way that i need to no interaction of the the speaker right so i don't know how to do that yet because like looks to zero huh this is an excellent yes i okay i have a comment regarding decency kristin i try to do length normalisation for the eighteen the B C C N actually it had yeah before the division i i i just do it length normalisation before it's peoples then they do then the W C C i tried but they didn't have so we have a way to talk i try and i think more one that also try it you tried one but no yes and the funds not having a and the yeah a quick question um so so cool hmmm you know um and so and so i mean and the current remotes for what this is gonna prove most most yeah where um i'm just wondering you know hmmm is that you know and the the actual models yeah cluster but um um like you so like you used to i'm just you know i said oh i see is it really this is not exactly is it you know i don't know so i just wanna have your oh well where the mean and variance when i mean is used them go denotes the yeah proof posted but hmmm and you don't actually need or hmmm you wanna try to do you know in this new scoring or would you no no i'm just well you you have you mean in your room hmmm and i mean and number and you were computing the number one the i mean and variance and right uh_huh okay um yeah so i don't know uh when when you do that you know which is the process yes yes and you explain to yeah so i'm just wondering where when my the system is calibrated and also the units you mean okay that's good that the ah i try to understand what the third one is doing in the middle but i never six said to yeah i know i know i know and it's uh i i never sick said to do that but i tried to see if my system is not is is if you compare the result is not what the same as it you know the only nicole rate that change a little bit but anyways scene that they have it that's not what calibre distill what kind of it but but not good if you have any comment about how we can put the third part i will be happy to because like i did it i did it as normal because i needed in the new version of the system but did you know mine i had to start but i don't know how to do it and here uh if you are in this one comment if you are doing like max for example we have training and the telephone and make a but that's in the microphone so we can do this different based on which database are using so which can help you in the crosstalk uh not to construct costs uh channel right thank you very much larger than here

Cosine Similarity Scoring without Score Normalization Techniques

SESSION 4: Speaker and language recognition – scoring, confidences and calibration