0:00:06uh good morning everyone i'm much more claritin uh that would be presenting somewhere that it uh
0:00:12it it is here at Q U T back numbers
0:00:14try to
0:00:15up now relocated to another one
0:00:18S anyone's wondering
0:00:19are presenting on behalf of the colts as as well robbie by brendan baker and strata street hard
0:00:25the web today is basically an experimental study on how svms perform
0:00:30when you decrease the amount of
0:00:32speech that is available to them for speaker there
0:00:36some brief outline
0:00:37or the motivation why we did this study
0:00:40and then we'll do some experiments looking at how each of the components of a standard
0:00:46gmm svm
0:00:47it's them
0:00:48how how it responds to the rim job
0:00:51page uh being available to it
0:00:53this includes the background dataset
0:00:55session compensation particularly now
0:00:57uh we look at the a bit of an analysis of the variation in the kernel space with short utterances
0:01:04and for score normalisation dataset
0:01:06then a present some
0:01:09so motivation
0:01:11uh it's quite well known that as you reduce the amount of speech available to assist them
0:01:15we're going to have a reduction
0:01:18no there have been some previous studies uh which generally focus on the gmmubm approach and even more recently with
0:01:25the uh joint factor analysis
0:01:27uh but nothing really targeted in the svm case and this is why
0:01:32uh we're doing this work here
0:01:34uh one of the things to mention here's acuity participated in the valley to which is almost a miniature nist
0:01:40evaluation i guess you'd say
0:01:41in two thousand on
0:01:43and some of the observations we got from this uh evaluation
0:01:48was that
0:01:48the svm outperform L J I sister
0:01:52when we had ample amount of spaces
0:01:54six minutes
0:01:55uh where is the op
0:01:56that was true for me twenty second
0:01:58condition subject i perform better
0:02:01there was a distinct difference between the generative and discriminative
0:02:04right is
0:02:06that was depending on the duration of each
0:02:09come in
0:02:10another observation here was also the chair i was more effective when
0:02:13estimating the session and
0:02:15take it sells places
0:02:16on a duration of speech that was similar to evaluation condition
0:02:22so we're going to look at that a bit over this in
0:02:26of course it's the ends are quite right
0:02:28in the speaker verification community we just have to look at the presentations last week
0:02:32um this two thousand ten where almost all
0:02:35submissions had uh the gmm svm
0:02:38configuration in this somehow
0:02:41uh so we're looking now at
0:02:44having to to a T is to select element development ah ah
0:02:48uh when we have mismatch mismatch
0:02:51training and trot segment durations
0:02:53in the svm configure
0:02:56so the main questions here for the svm systems uh
0:03:00to what degree
0:03:01limited speech affect
0:03:02yes fan back
0:03:03base class
0:03:05and also which system components on my sense
0:03:08just speech quantity
0:03:09uh so we're presenting these results
0:03:12with the hypo
0:03:13pointing direction which time to uh counteract
0:03:17i should say
0:03:19most of you know about the gmm svm system i would suppose
0:03:23uh where we using stacked gmm component means that speech is for the svm classification
0:03:28we now we can get good
0:03:29formance when you have plenty of speech available
0:03:33in this work we're looking at uh the important
0:03:36of matching and development dataset
0:03:38to the guy white
0:03:39for each of the individual component
0:03:43let's take a look at uh
0:03:44the flow diagram of the
0:03:47and basically we have three main datasets that uh go into development
0:03:52first of all we want to train i transfer matrix
0:03:55perception come
0:03:57particularly now
0:03:58uh so we have a transform training data
0:04:00we also have a background dataset
0:04:02for about
0:04:03provide negative information during
0:04:05svm training
0:04:08and lastly we have score normalisation dataset secured
0:04:10choose to apply score normalisation
0:04:15the upright for this
0:04:18is that we're going to go from a baseline svm system that's one without
0:04:22score normalisation and noise session comp
0:04:24and build onto that progressively
0:04:26looking at how it to the additional components
0:04:29um are affected by the duration
0:04:33uh so these three sets as i mentioned whether the background dataset
0:04:37training data set
0:04:38session compensation and lastly score
0:04:42so maybe a quick look at the uh system we're working with here's the gmm svm system five hundred twelve
0:04:48finding you the end
0:04:49twelve dimension if
0:04:51mfccs with appended delta is
0:04:53impostor daughter was like ninety from sre are for
0:04:56and we use this stuff by the background dataset and uh ct score normalisation
0:05:02with no we use uh only
0:05:04dimension dimensions
0:05:06greatest variation
0:05:07and then one from sre lance
0:05:09which boarding
0:05:12here we are
0:05:12valuations we perform here from the nist two thousand
0:05:15i corpora
0:05:17particularly the shore to ensure three condition
0:05:19now this usually has two and a half minutes of conversational speech per utterance
0:05:24and the way we looking introduced duration
0:05:27is uh into focus condition
0:05:30for short condition and sure sure
0:05:32and for sure condition really the training segment as is
0:05:37we uh
0:05:40truncate the test utterance
0:05:41to to decide
0:05:43in the short short
0:05:45we truncate by train and test
0:05:47to the same direction so it's essentially not
0:05:49uh duration in this evaluation
0:05:53so let's look at the baseline svm performance
0:05:56any particular going to go back to
0:05:58uh what we'll do it in detail later and say how phones compared to the G M and
0:06:03it's just a guess
0:06:03point of reference all
0:06:05what we will
0:06:07so here we using uh
0:06:10baseline and what we're timing state of the art
0:06:14which is now not so true
0:06:16with the oh i vector part coming out
0:06:19we're looking at the baseline and study are both gmm and svm systems
0:06:24four systems that were developed using the full two and a half minutes of speech in training
0:06:30so we're not
0:06:30uh explicitly dealing with the
0:06:33actions as
0:06:36the first thing we notice here
0:06:38solid line
0:06:40all the baseline
0:06:41we say that the baseline svm part
0:06:44uh gives us
0:06:45better performance than the gmm baseline
0:06:49just doesn't like the gmm baseline he has nice session compensation
0:06:53and our score normalisation which might
0:06:56you what
0:06:56being conservative
0:06:58as we reduce the duration of speech the S P N
0:07:02quickly deteriorates in performance compared to the gmm system
0:07:08it's not quite noticeable in the state of the art
0:07:11um but the gmm is
0:07:13in front of this in the hallway
0:07:15now if we look at the short short
0:07:17uh conditions this is where both train and test of being reduced
0:07:21actually see that the svm baselines
0:07:24them out
0:07:25on the
0:07:26cycles data they are
0:07:28once we reduce be like the eighty second sorry
0:07:31uh having that
0:07:33the development of the system on for two and how you know
0:07:36speech here
0:07:37might be the reason for this but we're got to look into that
0:07:40in the case the G M G M M system however
0:07:43less than ten seconds that was saying the baseline jump in front of
0:07:47D better
0:07:50so there's a good some significant differences and issues we need to look into he
0:07:54and hopefully
0:07:55uh the development datasets that we look into here will help us out with that
0:08:00let's start with the background dataset
0:08:02and here we're going to look at the svm system
0:08:06how changing the speech direction in the background dataset affects performance
0:08:10without score normalisation
0:08:11and without session compensation
0:08:15so as we know it background dataset gives us the negative information in svm training
0:08:20we generally have
0:08:21many more negative examples thanks fine examples in the nist sre is
0:08:26and we previously signed uh that the choice of this dataset greatly affects model quality
0:08:32a real question comes up with E S P N C is how we select this data set
0:08:37in mismatched train test duration
0:08:40we should we be matching the duration to the try not hurt
0:08:43the test utterance
0:08:44all the shorter of the two out
0:08:48so colour us there is a three slides here to print for present
0:08:52firstly we've got a short short conditions that match
0:08:55training and testing direction
0:08:57and that's quite obvious that it's better to match
0:08:59background to the uh evaluation conditions here
0:09:02in the fall shorts that's for training
0:09:05short testing
0:09:06actually signals better to match
0:09:08the background dataset to the test
0:09:10the shorter
0:09:11test after
0:09:15in the last condition which we have introduced a shortfall social testing
0:09:20for test
0:09:22and again we don't see what
0:09:23uh as as large a discrepancy in the short their durations
0:09:28we're actually saying that matching to the shorter
0:09:31training utterance give us a little bit of an impertinent towards the uh larger rice and see
0:09:37so what conclusions can we draw from this will let's look at the equal error rate as well on this
0:09:41click here to give us a bit more
0:09:43for you
0:09:44and we
0:09:44particularly by pressing on the ten second condition here
0:09:49first thing we can see here is that matching the background dataset to the training segment
0:09:54does not always maximise
0:09:58however if we matched to the test segment
0:10:01in our results were always getting the best
0:10:03dcf performance
0:10:05and in contrast
0:10:06if we want the best equal error upon
0:10:08we next to the shortest you're right
0:10:11so is a bit of a choice can be made it depending on what you want justice
0:10:15the what operating point you wanna i
0:10:20so in the following
0:10:22chairman switch a reason uh to use
0:10:24the shorter test our
0:10:26as the duration that we're matching up
0:10:29granddaughters set
0:10:31that's look now session compensation
0:10:34nuisance attribute projection
0:10:37a or maybe some kind of spice the directions of greatest uh session variation
0:10:42and as a small honourably and showing that uh
0:10:45the dimensions captured in the U
0:10:47transform matrix are projected out of the kernel space
0:10:50'cause transform you has to be learned from a training data set
0:10:55now what would be using in this transformed right training dataset when we've got limited test page
0:11:00what is what
0:11:01train and test speech of minutes
0:11:05on this board first are we looking at the whole short condition
0:11:10L system he has no score normalisation but the background as being that's to the shorter test
0:11:15abhorrence in each of these cases
0:11:18and it's quite clear that using match
0:11:20not training in this
0:11:22that's matching to the short test after
0:11:25gives us the best
0:11:27and in fact if we use
0:11:29full net
0:11:31the referent
0:11:31system that's one without nap
0:11:33jumps in front in the longer duration
0:11:36here we really wanna match to the net
0:11:38uh to the
0:11:40test duration in than that trance
0:11:45and in that i was tied to the mice
0:11:47challenging trust
0:11:48so the short
0:11:52now let's look at the short short isis an interesting case
0:11:56we actually observe that even though we match
0:11:59the net training data set to the ten second duration
0:12:04still finding the best
0:12:05performance comes from baseline system so one without now
0:12:09so why is this the we we pointing up the full nap training of pasta great
0:12:15uh but matt's not just isn't something in front of the base
0:12:19so nasty
0:12:20point somewhere that
0:12:22uh files to provide benefits
0:12:24uh in the limited training and testing
0:12:29so what point is
0:12:30well he's a plot where would match than that
0:12:33based on the yeah duration
0:12:35in the short short
0:12:36remember this is short short condition whereas
0:12:39for sure we actually
0:12:40got more
0:12:41a benefit out of that
0:12:43well actually see that
0:12:45just below forty second mark a nasty
0:12:47uh is where the reference system jobs in front
0:12:50i compensated
0:12:53so then
0:12:54why is this happening
0:12:56let's look at the uh variability and we can
0:13:00so if and the not wasn't quite robust to limited
0:13:02training and testing speech
0:13:05in the context of jack by
0:13:07uh systems
0:13:09the session subspace
0:13:10variation withstand too
0:13:13uh as the re
0:13:15the length of
0:13:16training and testing either
0:13:17do you reduce
0:13:18so we're going to say that's assigned times in the svm kernel
0:13:25on the slide we have a table with um number of durations
0:13:29will be short short
0:13:30uh draw condition
0:13:32and we
0:13:33also got a
0:13:34top i reference on that rare
0:13:36relevance factor all night
0:13:38uh and we're
0:13:39presenting the total variability
0:13:42uh in the
0:13:44they get space and session space
0:13:47oh the svm kernel
0:13:49and we actually say that
0:13:50in contrast to what was observed which i pi
0:13:53we're getting a reduction in both of these bases as duration is
0:13:58no wonder why is this the case what is the difference here
0:14:01and so what we did
0:14:02was actually take an inconsequential town close to zero
0:14:06uh so that
0:14:08S supervectors have more room to maybe
0:14:11we actually find that we do in fact agree with the jedi
0:14:15observations and that we are getting
0:14:18uh i greater magnitude of cargo in each of these cases
0:14:22if we uh
0:14:23change irrelevant
0:14:24back to
0:14:25too close to zero
0:14:27so here we consider a map adaptation relevance factor has a significant influence on the observable variation in the svm
0:14:33kernel space
0:14:34that's just something to be aware of
0:14:37now what's interesting night irrespective of the town that we use
0:14:41we're getting very similar
0:14:44session to speaker right here so you
0:14:47session variation that's coming out is a more dominant
0:14:51uh as the duration is reduced
0:14:53and of course this is why speaker
0:14:55she's more difficult with
0:14:58speech segment
0:15:01so why then
0:15:02we're getting more session variation
0:15:04why is now struggling to estimate that
0:15:07as we reduce the duration
0:15:10just look at this uh for you
0:15:12we have
0:15:13this session variability in the magnitude of session variability and speaker variability
0:15:18in the top one hundred eigenvectors estimated by now
0:15:23for direction of eighty seconds and ten second
0:15:26now the
0:15:27solid lines i do seconds that one's a ten sec
0:15:30and session variability is the black line
0:15:33first thing we notice he is that
0:15:35when we have longer
0:15:38this large
0:15:39for the session variation is great
0:15:41so we're getting more
0:15:43session variation
0:15:44that can be represented in a lower than men
0:15:48uh whereas as the duration
0:15:50reduces we
0:15:51flattening out would be coming bit more isotropic in our session
0:15:55a variation
0:15:57in contrast L speaker variation
0:15:59slide is actually quite similar
0:16:03this aligns with the uh table we just saw
0:16:06where these session variation is uh
0:16:09it coming from one domain
0:16:12then that was developed on the assumption that the majority of session variation lots and like dimensional space
0:16:19it's our understanding of it
0:16:21the because of the
0:16:25uh more isotropic session variation that
0:16:28coming about on these reduced up
0:16:31the assumption no longer holds and this is why it's unable to our benefit
0:16:36in the short short condition
0:16:38so how do we can overcome this problem
0:16:40we're still working on the
0:16:45next to move on to score normalisation
0:16:48it quite a lot because everyone knows
0:16:50it's colonisation is he
0:16:52i think of the last you
0:16:55uh basically can correct statistical variation in class
0:16:58cations goals
0:16:59and attentive
0:17:00scowl schools from
0:17:02uh i given trout or by what is
0:17:04using a to Z normal T normal check line and test centric approaches respectively
0:17:10and again we using an impostor cohort something we need to
0:17:13select that way
0:17:16no typically
0:17:17score normalisation cohorts should match the evaluation conditions
0:17:21the context the
0:17:22S P Ns we want an R
0:17:24how important is it to match these
0:17:26uh conditions
0:17:27and how much to score normalisation X
0:17:29benefit us when we have limited space
0:17:34this type of here we've got the uh
0:17:36full short condition on the second row
0:17:39and the short short condition down the bottom they're looking at the ten sec
0:17:43condition in particular
0:17:44we have three different horrible selection method see none which other all schools are normalised
0:17:50which means out by tells the and T norm
0:17:52uh cardboard so using two and a half minutes
0:17:55and then match
0:17:57in the case of the full ten second
0:18:00condition he met
0:18:01simply means is that you know matter and
0:18:03a truncated to that end
0:18:05whereas in the ten second ten second case
0:18:07but it's the ending on that
0:18:12that's quite obvious that the full uh hard what's it going give us worst performance we
0:18:17we can see
0:18:18and that maps no longer holds offer the best
0:18:22so uh quite elementary but
0:18:24the uh interesting observation here is that
0:18:29the relative performance gain from applying score normalisation
0:18:32seems quite minimal sorry
0:18:34the question is
0:18:37at what point are we willing to
0:18:39you go about choosing at a score normalisation sets to try and help
0:18:45so that try and help answer that question we looked at the
0:18:48relative gain in min dcf that score normalisation provides
0:18:52as we reduce the duration of speech
0:18:56we say that would
0:18:56the full eighty seconds weakening i attend the same kind which is
0:19:00quite reasonable
0:19:01it's in the lower durations of speech five and ten seconds we've got less than two percent relative gain
0:19:07are these really worth yeah i do
0:19:08trying to choose at a good normalised
0:19:11uh and the risk
0:19:13and normalisation
0:19:15i'm not actually kind of chosen well and
0:19:19that's another question is right now
0:19:22thank conclusion we've been investigated
0:19:24sensitivity of the populist the end system
0:19:27uh to reduce training and testing segments
0:19:29and we found the best phone i'm from selecting a background
0:19:33uh that match the shortest test duration depending on
0:19:37when you want to optimise the dcf or equal error rate
0:19:40but not a transforms trained on data matching
0:19:43it sure just
0:19:44a direction that was the best performance
0:19:46and score normalisation
0:19:48how much
0:19:49conditions were also the best
0:19:51the highlight an issue in that
0:19:53when dealing with a limited speech and this is judy session variability
0:19:57becoming more isotropic the speech duration was reduced
0:20:01score normalisation provider uh what you
0:20:04in the
0:20:06uh condition
0:20:08thank you for
0:20:17thank you for the
0:20:18that's a systematic
0:20:21investigation into the effects of uh
0:20:27as far as i can see
0:20:29but trick
0:20:30uh there's a patient this morning which i'm not sure
0:20:33you will you have no impact at the sleeping well that had a uh right
0:20:37we're not going on you know that
0:20:39i i think
0:20:43observations this morning
0:20:46a nice
0:20:49of what you see
0:20:51the short
0:20:53uh if you using relevance map
0:20:55uh_huh then
0:21:01speaker dependent
0:21:03within speaker
0:21:06uh that's what
0:21:07but recall uh the uh
0:21:09the original script
0:21:14you agree with me that explains
0:21:16perhaps explains
0:21:18what you see
0:21:24i'll i'll have to talk for the other ones are honest representation
0:21:29so any any others
0:21:31any other questions
0:21:40your name uh you're you're matrix for the
0:21:43and to do and relevance map and maybe we pca on
0:21:47that information
0:21:49sorry a saying
0:21:50not quite well
0:21:52um my question is regarding the
0:21:54uh how you really mean the U matrix uh
0:21:57to project the way
0:21:58so you're doing relevance map
0:22:00uh a man on bad
0:22:02you're not P C
0:22:04computing it
0:22:05pca pca on uh
0:22:07your uh um centre
0:22:09real time at that or or
0:22:13i know that uh to estimate you matrix we are doing some kind of pca to go to los lights
0:22:19for computational reasons
0:22:21but then we go back to the original
0:22:23so that would
0:22:24but not so my question is uh
0:22:27vicki lapsing when you learned that you matrix
0:22:30is that uh if you just doing a regular pca which is uh
0:22:34the computer low dimensional approximation of your uh
0:22:38if you put all your body that's vectors
0:22:40i mean you do uh low rank approximation about me to basically what piece you know
0:22:45you're not taking into account
0:22:47the count
0:22:48uh that when you do your part to analyses
0:22:53you using the count somehow
0:22:55to uh
0:22:57four tones
0:22:58of uh
0:23:00information in in different parts of the
0:23:03the pen to
0:23:04so um i my question is mostly we're going to
0:23:08are you somehow incorporating
0:23:10the information that
0:23:11when you have a lot of gaussian and i'm very few points
0:23:15not all the gaussian get us assign points
0:23:18and then when you
0:23:19train your subspace
0:23:20you're subspace
0:23:21does not know that
0:23:23so maybe that accounts for a lot of these uh
0:23:26observations are you happier
0:23:28understanding point actually i think
0:23:30i think either
0:23:31uh i don't believe we're actually explicitly take into account
0:23:37the fact that some gaussians might miss out on
0:23:42and yeah i think i can understand
0:23:44saying that it's might have an effect on the
0:23:46but on the united
0:24:02you mean yeah
0:24:04i'm a little
0:24:06sure about the
0:24:08so what
0:24:08i i mean i'm all
0:24:09you're cool
0:24:10studies because you want to see what works best
0:24:13but you also want to understand why it works best
0:24:16so what you said sort of
0:24:18or magnitude of standpoint was
0:24:19you doing this
0:24:22oh map to get gaussian
0:24:24and then you're comparing the means a some training
0:24:27gaussians you got mad
0:24:29with some test gaussians you go with mapping using U S B M
0:24:32and if it's not the same amount of data
0:24:35things go wrong
0:24:37and so
0:24:40the solution you're applying is your single make it the same length
0:24:45it would seem like
0:24:49yeah but you did that study without normalisation
0:24:52okay so of course when the noise
0:24:54uh you dicks
0:24:56all kinds of normalisation is there as you said
0:25:00deal with
0:25:01differences like just another differences
0:25:03um i'm wondering whether
0:25:05by doing it without normalisation that was true i
0:25:08making the worst possible condition that
0:25:11it wouldn't be fixed produced
0:25:12but your solution ended up being discard data
0:25:15so did you read it would so the first question i guess is
0:25:18when you truncated the training samples did you literally just discard the rest of the data where did you
0:25:23create additional short training utterances out of those
0:25:26and i would discover that i
0:25:29one obvious thing is if you if you take a thirty second utterance
0:25:32truncated to ten seconds it would be wasteful not to use the other twenty
0:25:35seconds as two more to the second term
0:25:39besides that
0:25:40that observation
0:25:42i'm worried about the
0:25:46if you had used normalisation uh_huh
0:25:48you might
0:25:49fix the problem
0:25:50to begin with did actually run they've also with school and
0:25:53quantisation but we can't we found based
0:25:57but we wanted
0:25:58uh try and get back to a very basic system just to help
0:26:02i guess you'd say the breeders understanding and floor
0:26:05of that i
0:26:06i i i i'm i'm hearing in many papers especially today
0:26:10a strong desire and everyone's part
0:26:14find a way to do things without normalisation is it
0:26:17somehow normalisation were a bad thing
0:26:20when it seems to me that normalisation is
0:26:25beyond the obvious thing that you have to model the speech hmmm
0:26:29it seems like the only other thing
0:26:31you know very high level since
0:26:33is a normalisation
0:26:34after all we're doing
0:26:36we're doing some kind of hypothesis test
0:26:40inherently requires
0:26:43knowing how to set a threshold which require
0:26:45or some kind of normalisation
0:26:48to the extent that we try to get away from that
0:26:51we're trying her hands behind her back
0:26:55i mean it's good it's good to look for methods that are
0:26:58inherently better
0:27:00i guess i would
0:27:02you know what
0:27:03we should still do normalisation it can ever
0:27:10done properly
0:27:18oh what is
0:27:19where that my my claim was
0:27:23i well that's good to look for better models
0:27:27i i don't see it
0:27:29i don't
0:27:29i understand the desire to do away with normalisation
0:27:33seems like normalisation
0:27:36at the crux of the problem
0:27:37and ultimately
0:27:40fixed whatever else you do wrong
0:27:42and if you never heard
0:27:53yes normalisation does exactly that so
0:27:57what we are unhappy with
0:27:59that we did do something wrong so
0:28:02uh we we're trying to do
0:28:04that's a bit of
0:28:08if and then we find
0:28:09it's still not perfect
0:28:11then i'm sure we will keep a normalised
0:28:14so the other way to look at it is
0:28:16but the
0:28:17normalisation is just another
0:28:19modelling stage
0:28:22extracting the mfcc features as modelling that the acoustic signal
0:28:26and then
0:28:28gmms is is
0:28:29modelling the mfccs and
0:28:32uh i victor's again this morning
0:28:35the gmm supervectors and then in the end
0:28:38there's a score modelling stage
0:28:43at the end you just expecting more most pages might be nice just to use
0:28:49the number of
0:28:49all stages but the
0:28:51probably probably
0:28:54we might just go on
0:28:55mobilising forever
0:28:59can we uh
0:29:00uh have the next week