uh good morning everyone i'm much more claritin uh that would be presenting somewhere that it uh it it is here at Q U T back numbers try to up now relocated to another one S anyone's wondering are presenting on behalf of the colts as as well robbie by brendan baker and strata street hard the web today is basically an experimental study on how svms perform when you decrease the amount of speech that is available to them for speaker there some brief outline or the motivation why we did this study uh and then we'll do some experiments looking at how each of the components of a standard gmm svm it's them how how it responds to the rim job no page uh being available to it this includes the background dataset session compensation particularly now uh we look at the a bit of an analysis of the variation in the kernel space with short utterances and for score normalisation dataset then a present some creations so motivation uh it's quite well known that as you reduce the amount of speech available to assist them we're going to have a reduction performance no there have been some previous studies uh which generally focus on the gmmubm approach and even more recently with the uh joint factor analysis uh but nothing really targeted in the svm case and this is why uh we're doing this work here uh one of the things to mention here's acuity participated in the valley to which is almost a miniature nist evaluation i guess you'd say in two thousand on and some of the observations we got from this uh evaluation was that the svm outperform L J I sister when we had ample amount of spaces six minutes uh where is the op that was true for me twenty second condition subject i perform better there was a distinct difference between the generative and discriminative right is um that was depending on the duration of each come in another observation here was also the chair i was more effective when estimating the session and take it sells places on a duration of speech that was similar to evaluation condition so we're going to look at that a bit over this in of course it's the ends are quite right right in the speaker verification community we just have to look at the presentations last week um this two thousand ten where almost all submissions had uh the gmm svm configuration in this somehow uh so we're looking now at now having to to a T is to select element development ah ah uh when we have mismatch mismatch training and trot segment durations in the svm configure so the main questions here for the svm systems uh to what degree limited speech affect yes fan back base class okay and also which system components on my sense steve just speech quantity uh so we're presenting these results oh with the hypo pointing direction which time to uh counteract effects i should say most of you know about the gmm svm system i would suppose uh where we using stacked gmm component means that speech is for the svm classification we now we can get good formance when you have plenty of speech available and in this work we're looking at uh the important of matching and development dataset to the guy white conditions for each of the individual component let's take a look at uh the flow diagram of the system and basically we have three main datasets that uh go into development first of all we want to train i transfer matrix perception come section particularly now uh so we have a transform training data we also have a background dataset for about provide negative information during svm training and lastly we have score normalisation dataset secured choose to apply score normalisation the upright for this study uh is that we're going to go from a baseline svm system that's one without score normalisation and noise session comp citation and build onto that progressively looking at how it to the additional components um are affected by the duration speech uh so these three sets as i mentioned whether the background dataset training data set session compensation and lastly score so maybe a quick look at the uh system we're working with here's the gmm svm system five hundred twelve finding you the end twelve dimension if mfccs with appended delta is impostor daughter was like ninety from sre are for and we use this stuff by the background dataset and uh ct score normalisation with no we use uh only dimension dimensions greatest variation and then one from sre lance which boarding here we are valuations we perform here from the nist two thousand i corpora particularly the shore to ensure three condition now this usually has two and a half minutes of conversational speech per utterance uh and the way we looking introduced duration is uh into focus condition for short condition and sure sure dish and for sure condition really the training segment as is pulling and we uh progressively truncate the test utterance to to decide in the short short case we truncate by train and test to the same direction so it's essentially not uh duration in this evaluation so let's look at the baseline svm performance any particular going to go back to uh what we'll do it in detail later and say how phones compared to the G M and it's just a guess point of reference all what we will so here we using uh baseline and what we're timing state of the art um which is now not so true uh with the oh i vector part coming out um we're looking at the baseline and study are both gmm and svm systems four systems that were developed using the full two and a half minutes of speech in training test so we're not uh explicitly dealing with the load actions as fig the first thing we notice here this solid line all the baseline arches we say that the baseline svm part uh gives us better performance than the gmm baseline uh just doesn't like the gmm baseline he has nice session compensation and our score normalisation which might you what being conservative but as we reduce the duration of speech the S P N uh quickly deteriorates in performance compared to the gmm system uh it's not quite noticeable in the state of the art um but the gmm is uh in front of this in the hallway now if we look at the short short uh conditions this is where both train and test of being reduced actually see that the svm baselines them out on the cycles data they are uh once we reduce be like the eighty second sorry uh having that the development of the system on for two and how you know speech here might be the reason for this but we're got to look into that in the case the G M G M M system however less than ten seconds that was saying the baseline jump in front of D better yeah so there's a good some significant differences and issues we need to look into he and hopefully uh the development datasets that we look into here will help us out with that let's start with the background dataset and here we're going to look at the svm system and how changing the speech direction in the background dataset affects performance without score normalisation and without session compensation so as we know it background dataset gives us the negative information in svm training we generally have many more negative examples thanks fine examples in the nist sre is and we previously signed uh that the choice of this dataset greatly affects model quality a real question comes up with E S P N C is how we select this data set in mismatched train test duration we should we be matching the duration to the try not hurt the test utterance all the shorter of the two out so colour us there is a three slides here to print for present firstly we've got a short short conditions that match training and testing direction and that's quite obvious that it's better to match background to the uh evaluation conditions here in the fall shorts that's for training short testing actually signals better to match the background dataset to the test the shorter test after in the last condition which we have introduced a shortfall social testing training for test uh and again we don't see what uh as as large a discrepancy in the short their durations but we're actually saying that matching to the shorter training utterance give us a little bit of an impertinent towards the uh larger rice and see so what conclusions can we draw from this will let's look at the equal error rate as well on this click here to give us a bit more for you and we particularly by pressing on the ten second condition here first thing we can see here is that matching the background dataset to the training segment does not always maximise one however if we matched to the test segment in our results were always getting the best dcf performance and in contrast if we want the best equal error upon we next to the shortest you're right so is a bit of a choice can be made it depending on what you want justice the what operating point you wanna i so in the following chairman switch a reason uh to use the shorter test our as the duration that we're matching up granddaughters set that's look now session compensation nuisance attribute projection a or maybe some kind of spice the directions of greatest uh session variation and as a small honourably and showing that uh the dimensions captured in the U transform matrix are projected out of the kernel space 'cause transform you has to be learned from a training data set now what would be using in this transformed right training dataset when we've got limited test page what is what train and test speech of minutes on this board first are we looking at the whole short condition uh L system he has no score normalisation but the background as being that's to the shorter test abhorrence in each of these cases and it's quite clear that using match not training in this that's matching to the short test after gives us the best phone and in fact if we use full net training the referent system that's one without nap jumps in front in the longer duration sorry here we really wanna match to the net uh to the shorter test duration in than that trance and in that i was tied to the mice challenging trust so the short now let's look at the short short isis an interesting case because we actually observe that even though we match the net training data set to the ten second duration where still finding the best performance comes from baseline system so one without now so why is this the we we pointing up the full nap training of pasta great one quite significantly uh but matt's not just isn't something in front of the base so nasty point somewhere that not uh files to provide benefits uh in the limited training and testing so what point is well he's a plot where would match than that training based on the yeah duration in the short short remember this is short short condition whereas for sure we actually got more a benefit out of that well actually see that just below forty second mark a nasty uh is where the reference system jobs in front i compensated so then why is this happening let's look at the uh variability and we can so if and the not wasn't quite robust to limited training and testing speech um in the context of jack by uh systems the session subspace variation withstand too increase uh as the re the length of training and testing either do you reduce so we're going to say that's assigned times in the svm kernel on the slide we have a table with um number of durations will be short short uh draw condition and we also got a top i reference on that rare relevance factor all night uh and we're presenting the total variability uh in the they get space and session space um oh the svm kernel and we actually say that in contrast to what was observed which i pi we're getting a reduction in both of these bases as duration is great no wonder why is this the case what is the difference here and so what we did was actually take an inconsequential town close to zero uh so that uh S supervectors have more room to maybe we actually find that we do in fact agree with the jedi uh observations and that we are getting more uh i greater magnitude of cargo in each of these cases if we uh change irrelevant back to too close to zero so here we consider a map adaptation relevance factor has a significant influence on the observable variation in the svm kernel space that's just something to be aware of now what's interesting night irrespective of the town that we use we're getting very similar um session to speaker right here so you session variation that's coming out is a more dominant uh as the duration is reduced and of course this is why speaker okay she's more difficult with uh shorter speech segment so why then we're getting more session variation why is now struggling to estimate that um as we reduce the duration just look at this uh for you we have this session variability in the magnitude of session variability and speaker variability in the top one hundred eigenvectors estimated by now um for direction of eighty seconds and ten second now the solid lines i do seconds that one's a ten sec and session variability is the black line first thing we notice he is that when we have longer durations speech this large for the session variation is great so we're getting more session variation that can be represented in a lower than men uh whereas as the duration reduces we flattening out would be coming bit more isotropic in our session a variation in contrast L speaker variation slide is actually quite similar this aligns with the uh table we just saw where these session variation is uh it coming from one domain then that was developed on the assumption that the majority of session variation lots and like dimensional space so it's our understanding of it the because of the um isotropic uh more isotropic session variation that coming about on these reduced up says that the assumption no longer holds and this is why it's unable to our benefit in the short short condition so how do we can overcome this problem we're still working on the next to move on to score normalisation uh it quite a lot because everyone knows it's colonisation is he i think of the last you presentations uh basically can correct statistical variation in class cations goals and attentive scowl schools from uh i given trout or by what is fusion using a to Z normal T normal check line and test centric approaches respectively and again we using an impostor cohort something we need to select that way no typically score normalisation cohorts should match the evaluation conditions the context the S P Ns we want an R how important is it to match these uh conditions and how much to score normalisation X benefit us when we have limited space this type of here we've got the uh full short condition on the second row and the short short condition down the bottom they're looking at the ten sec condition in particular we have three different horrible selection method see none which other all schools are normalised full which means out by tells the and T norm uh cardboard so using two and a half minutes speech and then match sorry in the case of the full ten second condition he met simply means is that you know matter and a truncated to that end whereas in the ten second ten second case but it's the ending on that right that's quite obvious that the full uh hard what's it going give us worst performance we we can see and that maps no longer holds offer the best so uh quite elementary but the uh interesting observation here is that uh the relative performance gain from applying score normalisation seems quite minimal sorry the question is uh at what point are we willing to you go about choosing at a score normalisation sets to try and help on so that try and help answer that question we looked at the relative gain in min dcf that score normalisation provides as we reduce the duration of speech we say that would the full eighty seconds weakening i attend the same kind which is hmmm quite reasonable it's in the lower durations of speech five and ten seconds we've got less than two percent relative gain are these really worth yeah i do trying to choose at a good normalised that uh and the risk that and normalisation set uh i'm not actually kind of chosen well and reduced on that's another question is right now thank conclusion we've been investigated sensitivity of the populist the end system uh to reduce training and testing segments and we found the best phone i'm from selecting a background uh that match the shortest test duration depending on when you want to optimise the dcf or equal error rate but not a transforms trained on data matching it sure just a direction that was the best performance and score normalisation how much conditions were also the best the highlight an issue in that when dealing with a limited speech and this is judy session variability becoming more isotropic the speech duration was reduced and score normalisation provider uh what you in the uh condition thank you for thank you for the that's a systematic uh investigation into the effects of uh uh duration um as far as i can see but trick uh there's a patient this morning which i'm not sure you you will you have no impact at the sleeping well that had a uh right we're not going on you know that i i think um uh patrick observations this morning uh yeah a nice explanation of what you see so the short explanation uh if you using relevance map uh_huh then um you introducing speaker dependent uh within speaker variability uh that's what but recall uh the uh the original script um so you agree with me that explains perhaps explains what you see i'll i'll have to talk for the other ones are honest representation one so any any others any other questions about posted um your name uh you're you're matrix for the now and to do and relevance map and maybe we pca on that information sorry a saying yeah not quite well sorry um my question is regarding the uh how you really mean the U matrix uh to project the way so you're doing relevance map uh a man on bad you're not P C computing it pca pca on uh your uh um centre real time at that or or i know that uh to estimate you matrix we are doing some kind of pca to go to los lights for computational reasons but then we go back to the original so that would but not so my question is uh vicki lapsing when you learned that you matrix is that uh if you just doing a regular pca which is uh the computer low dimensional approximation of your uh if you put all your body that's vectors i mean you do uh low rank approximation about me to basically what piece you know you're not taking into account the the count uh that when you do your part to analyses um you using the count somehow to uh wait they four tones of uh information in in different parts of the the pen to so um i my question is mostly we're going to are you somehow incorporating the information that when you have a lot of gaussian and i'm very few points not all the gaussian get us assign points and then when you train your subspace you're subspace does not know that so maybe that accounts for a lot of these uh observations are you happier understanding point actually i think i think either uh i don't believe we're actually explicitly take into account um the fact that some gaussians might miss out on oh patient and yeah i think i can understand saying that it's might have an effect on the but on the united um uh you mean yeah i'm a little sure about the so what i i mean i'm all you're cool studies because you want to see what works best but you also want to understand why it works best so what you said sort of or magnitude of standpoint was you doing this process oh map to get gaussian and then you're comparing the means a some training gaussians you got mad with some test gaussians you go with mapping using U S B M and if it's not the same amount of data things go wrong basically and so uh the solution you're applying is your single make it the same length um it would seem like uh you yeah but you did that study without normalisation okay so of course when the noise uh you dicks back all kinds of normalisation is there as you said two deal with differences like just another differences um i'm wondering whether by doing it without normalisation that was true i making the worst possible condition that it wouldn't be fixed produced but your solution ended up being discard data so did you read it would so the first question i guess is when you truncated the training samples did you literally just discard the rest of the data where did you create additional short training utterances out of those and i would discover that i okay so one obvious thing is if you if you take a thirty second utterance truncated to ten seconds it would be wasteful not to use the other twenty seconds as two more to the second term um but besides that that observation i'm worried about the yeah uh if you had used normalisation uh_huh you might fix the problem to begin with did actually run they've also with school and quantisation but we can't we found based similar by but we wanted two uh try and get back to a very basic system just to help i guess you'd say the breeders understanding and floor of that i i i i i'm i'm hearing in many papers especially today a strong desire and everyone's part two find a way to do things without normalisation is it somehow normalisation were a bad thing when it seems to me that normalisation is almost beyond the obvious thing that you have to model the speech hmmm it seems like the only other thing you know very high level since is a normalisation after all we're doing we're doing some kind of hypothesis test verification and that inherently requires knowing how to set a threshold which require or some kind of normalisation and if to the extent that we try to get away from that we're trying her hands behind her back um i mean it's good it's good to look for methods that are inherently better but i guess i would say you know what we should still do normalisation it can ever okay done properly oh what is where that my my claim was uh i well that's good to look for better models um i i don't see it i don't i understand the desire to do away with normalisation seems like normalisation is at the crux of the problem and ultimately fig fixed whatever else you do wrong and if you never heard yes normalisation does exactly that so uh what what we are unhappy with that we did do something wrong so uh we we're trying to do that's a bit of uh and if and then we find it's still not perfect yeah then i'm sure we will keep a normalised so the other way to look at it is but the normalisation is just another modelling stage uh the extracting the mfcc features as modelling that the acoustic signal and then uh gmms is is modelling the mfccs and uh i victor's again this morning the the gmm supervectors and then in the end there's a score modelling stage uh so at the end you just expecting more most pages might be nice just to use uh the number of all stages but the probably probably we might just go on mobilising forever can we uh uh have the next week