| 0:00:06 | uh good morning everyone i'm much more claritin uh that would be presenting somewhere that it uh |
|---|
| 0:00:12 | it it is here at Q U T back numbers |
|---|
| 0:00:14 | try to |
|---|
| 0:00:15 | up now relocated to another one |
|---|
| 0:00:18 | S anyone's wondering |
|---|
| 0:00:19 | are presenting on behalf of the colts as as well robbie by brendan baker and strata street hard |
|---|
| 0:00:25 | the web today is basically an experimental study on how svms perform |
|---|
| 0:00:30 | when you decrease the amount of |
|---|
| 0:00:32 | speech that is available to them for speaker there |
|---|
| 0:00:36 | some brief outline |
|---|
| 0:00:37 | or the motivation why we did this study |
|---|
| 0:00:40 | uh |
|---|
| 0:00:40 | and then we'll do some experiments looking at how each of the components of a standard |
|---|
| 0:00:46 | gmm svm |
|---|
| 0:00:47 | it's them |
|---|
| 0:00:48 | how how it responds to the rim job |
|---|
| 0:00:50 | no |
|---|
| 0:00:51 | page uh being available to it |
|---|
| 0:00:53 | this includes the background dataset |
|---|
| 0:00:55 | session compensation particularly now |
|---|
| 0:00:57 | uh we look at the a bit of an analysis of the variation in the kernel space with short utterances |
|---|
| 0:01:04 | and for score normalisation dataset |
|---|
| 0:01:06 | then a present some |
|---|
| 0:01:07 | creations |
|---|
| 0:01:09 | so motivation |
|---|
| 0:01:11 | uh it's quite well known that as you reduce the amount of speech available to assist them |
|---|
| 0:01:15 | we're going to have a reduction |
|---|
| 0:01:16 | performance |
|---|
| 0:01:18 | no there have been some previous studies uh which generally focus on the gmmubm approach and even more recently with |
|---|
| 0:01:25 | the uh joint factor analysis |
|---|
| 0:01:27 | uh but nothing really targeted in the svm case and this is why |
|---|
| 0:01:32 | uh we're doing this work here |
|---|
| 0:01:34 | uh one of the things to mention here's acuity participated in the valley to which is almost a miniature nist |
|---|
| 0:01:40 | evaluation i guess you'd say |
|---|
| 0:01:41 | in two thousand on |
|---|
| 0:01:43 | and some of the observations we got from this uh evaluation |
|---|
| 0:01:48 | was that |
|---|
| 0:01:48 | the svm outperform L J I sister |
|---|
| 0:01:52 | when we had ample amount of spaces |
|---|
| 0:01:54 | six minutes |
|---|
| 0:01:55 | uh where is the op |
|---|
| 0:01:56 | that was true for me twenty second |
|---|
| 0:01:58 | condition subject i perform better |
|---|
| 0:02:01 | there was a distinct difference between the generative and discriminative |
|---|
| 0:02:04 | right is |
|---|
| 0:02:05 | um |
|---|
| 0:02:06 | that was depending on the duration of each |
|---|
| 0:02:09 | come in |
|---|
| 0:02:10 | another observation here was also the chair i was more effective when |
|---|
| 0:02:13 | estimating the session and |
|---|
| 0:02:15 | take it sells places |
|---|
| 0:02:16 | on a duration of speech that was similar to evaluation condition |
|---|
| 0:02:22 | so we're going to look at that a bit over this in |
|---|
| 0:02:26 | of course it's the ends are quite right |
|---|
| 0:02:28 | right |
|---|
| 0:02:28 | in the speaker verification community we just have to look at the presentations last week |
|---|
| 0:02:32 | um this two thousand ten where almost all |
|---|
| 0:02:35 | submissions had uh the gmm svm |
|---|
| 0:02:38 | configuration in this somehow |
|---|
| 0:02:41 | uh so we're looking now at |
|---|
| 0:02:43 | now |
|---|
| 0:02:44 | having to to a T is to select element development ah ah |
|---|
| 0:02:48 | uh when we have mismatch mismatch |
|---|
| 0:02:51 | training and trot segment durations |
|---|
| 0:02:53 | in the svm configure |
|---|
| 0:02:56 | so the main questions here for the svm systems uh |
|---|
| 0:03:00 | to what degree |
|---|
| 0:03:01 | limited speech affect |
|---|
| 0:03:02 | yes fan back |
|---|
| 0:03:03 | base class |
|---|
| 0:03:04 | okay |
|---|
| 0:03:05 | and also which system components on my sense |
|---|
| 0:03:07 | steve |
|---|
| 0:03:08 | just speech quantity |
|---|
| 0:03:09 | uh so we're presenting these results |
|---|
| 0:03:12 | oh |
|---|
| 0:03:12 | with the hypo |
|---|
| 0:03:13 | pointing direction which time to uh counteract |
|---|
| 0:03:17 | effects |
|---|
| 0:03:17 | i should say |
|---|
| 0:03:19 | most of you know about the gmm svm system i would suppose |
|---|
| 0:03:23 | uh where we using stacked gmm component means that speech is for the svm classification |
|---|
| 0:03:28 | we now we can get good |
|---|
| 0:03:29 | formance when you have plenty of speech available |
|---|
| 0:03:32 | and |
|---|
| 0:03:33 | in this work we're looking at uh the important |
|---|
| 0:03:36 | of matching and development dataset |
|---|
| 0:03:38 | to the guy white |
|---|
| 0:03:39 | conditions |
|---|
| 0:03:39 | for each of the individual component |
|---|
| 0:03:43 | let's take a look at uh |
|---|
| 0:03:44 | the flow diagram of the |
|---|
| 0:03:47 | system |
|---|
| 0:03:47 | and basically we have three main datasets that uh go into development |
|---|
| 0:03:52 | first of all we want to train i transfer matrix |
|---|
| 0:03:55 | perception come |
|---|
| 0:03:56 | section |
|---|
| 0:03:57 | particularly now |
|---|
| 0:03:58 | uh so we have a transform training data |
|---|
| 0:04:00 | we also have a background dataset |
|---|
| 0:04:02 | for about |
|---|
| 0:04:03 | provide negative information during |
|---|
| 0:04:05 | svm training |
|---|
| 0:04:08 | and lastly we have score normalisation dataset secured |
|---|
| 0:04:10 | choose to apply score normalisation |
|---|
| 0:04:15 | the upright for this |
|---|
| 0:04:16 | study |
|---|
| 0:04:17 | uh |
|---|
| 0:04:18 | is that we're going to go from a baseline svm system that's one without |
|---|
| 0:04:22 | score normalisation and noise session comp |
|---|
| 0:04:24 | citation |
|---|
| 0:04:24 | and build onto that progressively |
|---|
| 0:04:26 | looking at how it to the additional components |
|---|
| 0:04:29 | um are affected by the duration |
|---|
| 0:04:32 | speech |
|---|
| 0:04:33 | uh so these three sets as i mentioned whether the background dataset |
|---|
| 0:04:37 | training data set |
|---|
| 0:04:38 | session compensation and lastly score |
|---|
| 0:04:42 | so maybe a quick look at the uh system we're working with here's the gmm svm system five hundred twelve |
|---|
| 0:04:48 | finding you the end |
|---|
| 0:04:49 | twelve dimension if |
|---|
| 0:04:51 | mfccs with appended delta is |
|---|
| 0:04:53 | impostor daughter was like ninety from sre are for |
|---|
| 0:04:56 | and we use this stuff by the background dataset and uh ct score normalisation |
|---|
| 0:05:02 | with no we use uh only |
|---|
| 0:05:04 | dimension dimensions |
|---|
| 0:05:06 | greatest variation |
|---|
| 0:05:07 | and then one from sre lance |
|---|
| 0:05:09 | which boarding |
|---|
| 0:05:12 | here we are |
|---|
| 0:05:12 | valuations we perform here from the nist two thousand |
|---|
| 0:05:15 | i corpora |
|---|
| 0:05:17 | particularly the shore to ensure three condition |
|---|
| 0:05:19 | now this usually has two and a half minutes of conversational speech per utterance |
|---|
| 0:05:24 | uh |
|---|
| 0:05:24 | and the way we looking introduced duration |
|---|
| 0:05:27 | is uh into focus condition |
|---|
| 0:05:30 | for short condition and sure sure |
|---|
| 0:05:32 | dish |
|---|
| 0:05:32 | and for sure condition really the training segment as is |
|---|
| 0:05:36 | pulling |
|---|
| 0:05:37 | and |
|---|
| 0:05:37 | we uh |
|---|
| 0:05:38 | progressively |
|---|
| 0:05:40 | truncate the test utterance |
|---|
| 0:05:41 | to to decide |
|---|
| 0:05:43 | in the short short |
|---|
| 0:05:44 | case |
|---|
| 0:05:45 | we truncate by train and test |
|---|
| 0:05:47 | to the same direction so it's essentially not |
|---|
| 0:05:49 | uh duration in this evaluation |
|---|
| 0:05:53 | so let's look at the baseline svm performance |
|---|
| 0:05:56 | any particular going to go back to |
|---|
| 0:05:58 | uh what we'll do it in detail later and say how phones compared to the G M and |
|---|
| 0:06:03 | it's just a guess |
|---|
| 0:06:03 | point of reference all |
|---|
| 0:06:05 | what we will |
|---|
| 0:06:07 | so here we using uh |
|---|
| 0:06:10 | baseline and what we're timing state of the art |
|---|
| 0:06:13 | um |
|---|
| 0:06:14 | which is now not so true |
|---|
| 0:06:15 | uh |
|---|
| 0:06:16 | with the oh i vector part coming out |
|---|
| 0:06:18 | um |
|---|
| 0:06:19 | we're looking at the baseline and study are both gmm and svm systems |
|---|
| 0:06:24 | four systems that were developed using the full two and a half minutes of speech in training |
|---|
| 0:06:29 | test |
|---|
| 0:06:30 | so we're not |
|---|
| 0:06:30 | uh explicitly dealing with the |
|---|
| 0:06:33 | load |
|---|
| 0:06:33 | actions as |
|---|
| 0:06:34 | fig |
|---|
| 0:06:36 | the first thing we notice here |
|---|
| 0:06:37 | this |
|---|
| 0:06:38 | solid line |
|---|
| 0:06:40 | all the baseline |
|---|
| 0:06:41 | arches |
|---|
| 0:06:41 | we say that the baseline svm part |
|---|
| 0:06:44 | uh gives us |
|---|
| 0:06:45 | better performance than the gmm baseline |
|---|
| 0:06:48 | uh |
|---|
| 0:06:49 | just doesn't like the gmm baseline he has nice session compensation |
|---|
| 0:06:53 | and our score normalisation which might |
|---|
| 0:06:56 | you what |
|---|
| 0:06:56 | being conservative |
|---|
| 0:06:58 | but |
|---|
| 0:06:58 | as we reduce the duration of speech the S P N |
|---|
| 0:07:01 | uh |
|---|
| 0:07:02 | quickly deteriorates in performance compared to the gmm system |
|---|
| 0:07:07 | uh |
|---|
| 0:07:08 | it's not quite noticeable in the state of the art |
|---|
| 0:07:11 | um but the gmm is |
|---|
| 0:07:12 | uh |
|---|
| 0:07:13 | in front of this in the hallway |
|---|
| 0:07:15 | now if we look at the short short |
|---|
| 0:07:17 | uh conditions this is where both train and test of being reduced |
|---|
| 0:07:21 | actually see that the svm baselines |
|---|
| 0:07:24 | them out |
|---|
| 0:07:25 | on the |
|---|
| 0:07:26 | cycles data they are |
|---|
| 0:07:28 | uh |
|---|
| 0:07:28 | once we reduce be like the eighty second sorry |
|---|
| 0:07:31 | uh having that |
|---|
| 0:07:33 | the development of the system on for two and how you know |
|---|
| 0:07:36 | speech here |
|---|
| 0:07:37 | might be the reason for this but we're got to look into that |
|---|
| 0:07:40 | in the case the G M G M M system however |
|---|
| 0:07:43 | less than ten seconds that was saying the baseline jump in front of |
|---|
| 0:07:47 | D better |
|---|
| 0:07:48 | yeah |
|---|
| 0:07:50 | so there's a good some significant differences and issues we need to look into he |
|---|
| 0:07:54 | and hopefully |
|---|
| 0:07:55 | uh the development datasets that we look into here will help us out with that |
|---|
| 0:08:00 | let's start with the background dataset |
|---|
| 0:08:02 | and here we're going to look at the svm system |
|---|
| 0:08:05 | and |
|---|
| 0:08:06 | how changing the speech direction in the background dataset affects performance |
|---|
| 0:08:10 | without score normalisation |
|---|
| 0:08:11 | and without session compensation |
|---|
| 0:08:15 | so as we know it background dataset gives us the negative information in svm training |
|---|
| 0:08:20 | we generally have |
|---|
| 0:08:21 | many more negative examples thanks fine examples in the nist sre is |
|---|
| 0:08:26 | and we previously signed uh that the choice of this dataset greatly affects model quality |
|---|
| 0:08:32 | a real question comes up with E S P N C is how we select this data set |
|---|
| 0:08:37 | in mismatched train test duration |
|---|
| 0:08:40 | we should we be matching the duration to the try not hurt |
|---|
| 0:08:43 | the test utterance |
|---|
| 0:08:44 | all the shorter of the two out |
|---|
| 0:08:48 | so colour us there is a three slides here to print for present |
|---|
| 0:08:52 | firstly we've got a short short conditions that match |
|---|
| 0:08:55 | training and testing direction |
|---|
| 0:08:57 | and that's quite obvious that it's better to match |
|---|
| 0:08:59 | background to the uh evaluation conditions here |
|---|
| 0:09:02 | in the fall shorts that's for training |
|---|
| 0:09:05 | short testing |
|---|
| 0:09:06 | actually signals better to match |
|---|
| 0:09:08 | the background dataset to the test |
|---|
| 0:09:10 | the shorter |
|---|
| 0:09:11 | test after |
|---|
| 0:09:15 | in the last condition which we have introduced a shortfall social testing |
|---|
| 0:09:19 | training |
|---|
| 0:09:20 | for test |
|---|
| 0:09:21 | uh |
|---|
| 0:09:22 | and again we don't see what |
|---|
| 0:09:23 | uh as as large a discrepancy in the short their durations |
|---|
| 0:09:27 | but |
|---|
| 0:09:28 | we're actually saying that matching to the shorter |
|---|
| 0:09:31 | training utterance give us a little bit of an impertinent towards the uh larger rice and see |
|---|
| 0:09:37 | so what conclusions can we draw from this will let's look at the equal error rate as well on this |
|---|
| 0:09:41 | click here to give us a bit more |
|---|
| 0:09:43 | for you |
|---|
| 0:09:44 | and we |
|---|
| 0:09:44 | particularly by pressing on the ten second condition here |
|---|
| 0:09:49 | first thing we can see here is that matching the background dataset to the training segment |
|---|
| 0:09:54 | does not always maximise |
|---|
| 0:09:55 | one |
|---|
| 0:09:58 | however if we matched to the test segment |
|---|
| 0:10:01 | in our results were always getting the best |
|---|
| 0:10:03 | dcf performance |
|---|
| 0:10:05 | and in contrast |
|---|
| 0:10:06 | if we want the best equal error upon |
|---|
| 0:10:08 | we next to the shortest you're right |
|---|
| 0:10:11 | so is a bit of a choice can be made it depending on what you want justice |
|---|
| 0:10:15 | the what operating point you wanna i |
|---|
| 0:10:20 | so in the following |
|---|
| 0:10:22 | chairman switch a reason uh to use |
|---|
| 0:10:24 | the shorter test our |
|---|
| 0:10:26 | as the duration that we're matching up |
|---|
| 0:10:29 | granddaughters set |
|---|
| 0:10:31 | that's look now session compensation |
|---|
| 0:10:34 | nuisance attribute projection |
|---|
| 0:10:37 | a or maybe some kind of spice the directions of greatest uh session variation |
|---|
| 0:10:42 | and as a small honourably and showing that uh |
|---|
| 0:10:45 | the dimensions captured in the U |
|---|
| 0:10:47 | transform matrix are projected out of the kernel space |
|---|
| 0:10:50 | 'cause transform you has to be learned from a training data set |
|---|
| 0:10:55 | now what would be using in this transformed right training dataset when we've got limited test page |
|---|
| 0:11:00 | what is what |
|---|
| 0:11:01 | train and test speech of minutes |
|---|
| 0:11:05 | on this board first are we looking at the whole short condition |
|---|
| 0:11:09 | uh |
|---|
| 0:11:10 | L system he has no score normalisation but the background as being that's to the shorter test |
|---|
| 0:11:15 | abhorrence in each of these cases |
|---|
| 0:11:18 | and it's quite clear that using match |
|---|
| 0:11:20 | not training in this |
|---|
| 0:11:22 | that's matching to the short test after |
|---|
| 0:11:25 | gives us the best |
|---|
| 0:11:26 | phone |
|---|
| 0:11:27 | and in fact if we use |
|---|
| 0:11:29 | full net |
|---|
| 0:11:30 | training |
|---|
| 0:11:31 | the referent |
|---|
| 0:11:31 | system that's one without nap |
|---|
| 0:11:33 | jumps in front in the longer duration |
|---|
| 0:11:35 | sorry |
|---|
| 0:11:36 | here we really wanna match to the net |
|---|
| 0:11:38 | uh to the |
|---|
| 0:11:39 | shorter |
|---|
| 0:11:40 | test duration in than that trance |
|---|
| 0:11:45 | and in that i was tied to the mice |
|---|
| 0:11:47 | challenging trust |
|---|
| 0:11:48 | so the short |
|---|
| 0:11:52 | now let's look at the short short isis an interesting case |
|---|
| 0:11:56 | because |
|---|
| 0:11:56 | we actually observe that even though we match |
|---|
| 0:11:59 | the net training data set to the ten second duration |
|---|
| 0:12:03 | where |
|---|
| 0:12:04 | still finding the best |
|---|
| 0:12:05 | performance comes from baseline system so one without now |
|---|
| 0:12:09 | so why is this the we we pointing up the full nap training of pasta great |
|---|
| 0:12:13 | one |
|---|
| 0:12:14 | quite |
|---|
| 0:12:14 | significantly |
|---|
| 0:12:15 | uh but matt's not just isn't something in front of the base |
|---|
| 0:12:19 | so nasty |
|---|
| 0:12:20 | point somewhere that |
|---|
| 0:12:21 | not |
|---|
| 0:12:22 | uh files to provide benefits |
|---|
| 0:12:24 | uh in the limited training and testing |
|---|
| 0:12:29 | so what point is |
|---|
| 0:12:30 | well he's a plot where would match than that |
|---|
| 0:12:32 | training |
|---|
| 0:12:33 | based on the yeah duration |
|---|
| 0:12:35 | in the short short |
|---|
| 0:12:36 | remember this is short short condition whereas |
|---|
| 0:12:39 | for sure we actually |
|---|
| 0:12:40 | got more |
|---|
| 0:12:41 | a benefit out of that |
|---|
| 0:12:43 | well actually see that |
|---|
| 0:12:45 | just below forty second mark a nasty |
|---|
| 0:12:47 | uh is where the reference system jobs in front |
|---|
| 0:12:50 | i compensated |
|---|
| 0:12:53 | so then |
|---|
| 0:12:54 | why is this happening |
|---|
| 0:12:56 | let's look at the uh variability and we can |
|---|
| 0:13:00 | so if and the not wasn't quite robust to limited |
|---|
| 0:13:02 | training and testing speech |
|---|
| 0:13:04 | um |
|---|
| 0:13:05 | in the context of jack by |
|---|
| 0:13:07 | uh systems |
|---|
| 0:13:09 | the session subspace |
|---|
| 0:13:10 | variation withstand too |
|---|
| 0:13:12 | increase |
|---|
| 0:13:13 | uh as the re |
|---|
| 0:13:15 | the length of |
|---|
| 0:13:16 | training and testing either |
|---|
| 0:13:17 | do you reduce |
|---|
| 0:13:18 | so we're going to say that's assigned times in the svm kernel |
|---|
| 0:13:25 | on the slide we have a table with um number of durations |
|---|
| 0:13:29 | will be short short |
|---|
| 0:13:30 | uh draw condition |
|---|
| 0:13:32 | and we |
|---|
| 0:13:33 | also got a |
|---|
| 0:13:34 | top i reference on that rare |
|---|
| 0:13:36 | relevance factor all night |
|---|
| 0:13:38 | uh and we're |
|---|
| 0:13:39 | presenting the total variability |
|---|
| 0:13:42 | uh in the |
|---|
| 0:13:44 | they get space and session space |
|---|
| 0:13:46 | um |
|---|
| 0:13:47 | oh the svm kernel |
|---|
| 0:13:49 | and we actually say that |
|---|
| 0:13:50 | in contrast to what was observed which i pi |
|---|
| 0:13:53 | we're getting a reduction in both of these bases as duration is |
|---|
| 0:13:57 | great |
|---|
| 0:13:58 | no wonder why is this the case what is the difference here |
|---|
| 0:14:01 | and so what we did |
|---|
| 0:14:02 | was actually take an inconsequential town close to zero |
|---|
| 0:14:06 | uh so that |
|---|
| 0:14:07 | uh |
|---|
| 0:14:08 | S supervectors have more room to maybe |
|---|
| 0:14:11 | we actually find that we do in fact agree with the jedi |
|---|
| 0:14:14 | uh |
|---|
| 0:14:15 | observations and that we are getting |
|---|
| 0:14:17 | more |
|---|
| 0:14:18 | uh i greater magnitude of cargo in each of these cases |
|---|
| 0:14:22 | if we uh |
|---|
| 0:14:23 | change irrelevant |
|---|
| 0:14:24 | back to |
|---|
| 0:14:25 | too close to zero |
|---|
| 0:14:27 | so here we consider a map adaptation relevance factor has a significant influence on the observable variation in the svm |
|---|
| 0:14:33 | kernel space |
|---|
| 0:14:34 | that's just something to be aware of |
|---|
| 0:14:37 | now what's interesting night irrespective of the town that we use |
|---|
| 0:14:41 | we're getting very similar |
|---|
| 0:14:43 | um |
|---|
| 0:14:44 | session to speaker right here so you |
|---|
| 0:14:47 | session variation that's coming out is a more dominant |
|---|
| 0:14:51 | uh as the duration is reduced |
|---|
| 0:14:53 | and of course this is why speaker |
|---|
| 0:14:55 | okay |
|---|
| 0:14:55 | she's more difficult with |
|---|
| 0:14:57 | uh |
|---|
| 0:14:57 | shorter |
|---|
| 0:14:58 | speech segment |
|---|
| 0:15:01 | so why then |
|---|
| 0:15:02 | we're getting more session variation |
|---|
| 0:15:04 | why is now struggling to estimate that |
|---|
| 0:15:06 | um |
|---|
| 0:15:07 | as we reduce the duration |
|---|
| 0:15:10 | just look at this uh for you |
|---|
| 0:15:12 | we have |
|---|
| 0:15:13 | this session variability in the magnitude of session variability and speaker variability |
|---|
| 0:15:18 | in the top one hundred eigenvectors estimated by now |
|---|
| 0:15:21 | um |
|---|
| 0:15:23 | for direction of eighty seconds and ten second |
|---|
| 0:15:26 | now the |
|---|
| 0:15:27 | solid lines i do seconds that one's a ten sec |
|---|
| 0:15:30 | and session variability is the black line |
|---|
| 0:15:33 | first thing we notice he is that |
|---|
| 0:15:35 | when we have longer |
|---|
| 0:15:37 | durations |
|---|
| 0:15:37 | speech |
|---|
| 0:15:38 | this large |
|---|
| 0:15:39 | for the session variation is great |
|---|
| 0:15:41 | so we're getting more |
|---|
| 0:15:43 | session variation |
|---|
| 0:15:44 | that can be represented in a lower than men |
|---|
| 0:15:48 | uh whereas as the duration |
|---|
| 0:15:50 | reduces we |
|---|
| 0:15:51 | flattening out would be coming bit more isotropic in our session |
|---|
| 0:15:55 | a variation |
|---|
| 0:15:57 | in contrast L speaker variation |
|---|
| 0:15:59 | slide is actually quite similar |
|---|
| 0:16:03 | this aligns with the uh table we just saw |
|---|
| 0:16:06 | where these session variation is uh |
|---|
| 0:16:09 | it coming from one domain |
|---|
| 0:16:12 | then that was developed on the assumption that the majority of session variation lots and like dimensional space |
|---|
| 0:16:19 | so |
|---|
| 0:16:19 | it's our understanding of it |
|---|
| 0:16:21 | the because of the |
|---|
| 0:16:23 | um |
|---|
| 0:16:24 | isotropic |
|---|
| 0:16:25 | uh more isotropic session variation that |
|---|
| 0:16:28 | coming about on these reduced up |
|---|
| 0:16:30 | says |
|---|
| 0:16:31 | that |
|---|
| 0:16:31 | the assumption no longer holds and this is why it's unable to our benefit |
|---|
| 0:16:36 | in the short short condition |
|---|
| 0:16:38 | so how do we can overcome this problem |
|---|
| 0:16:40 | we're still working on the |
|---|
| 0:16:45 | next to move on to score normalisation |
|---|
| 0:16:47 | uh |
|---|
| 0:16:48 | it quite a lot because everyone knows |
|---|
| 0:16:50 | it's colonisation is he |
|---|
| 0:16:52 | i think of the last you |
|---|
| 0:16:54 | presentations |
|---|
| 0:16:55 | uh basically can correct statistical variation in class |
|---|
| 0:16:58 | cations goals |
|---|
| 0:16:59 | and attentive |
|---|
| 0:17:00 | scowl schools from |
|---|
| 0:17:02 | uh i given trout or by what is |
|---|
| 0:17:04 | fusion |
|---|
| 0:17:04 | using a to Z normal T normal check line and test centric approaches respectively |
|---|
| 0:17:10 | and again we using an impostor cohort something we need to |
|---|
| 0:17:13 | select that way |
|---|
| 0:17:16 | no typically |
|---|
| 0:17:17 | score normalisation cohorts should match the evaluation conditions |
|---|
| 0:17:21 | the context the |
|---|
| 0:17:22 | S P Ns we want an R |
|---|
| 0:17:24 | how important is it to match these |
|---|
| 0:17:26 | uh conditions |
|---|
| 0:17:27 | and how much to score normalisation X |
|---|
| 0:17:29 | benefit us when we have limited space |
|---|
| 0:17:34 | this type of here we've got the uh |
|---|
| 0:17:36 | full short condition on the second row |
|---|
| 0:17:39 | and the short short condition down the bottom they're looking at the ten sec |
|---|
| 0:17:43 | condition in particular |
|---|
| 0:17:44 | we have three different horrible selection method see none which other all schools are normalised |
|---|
| 0:17:49 | full |
|---|
| 0:17:50 | which means out by tells the and T norm |
|---|
| 0:17:52 | uh cardboard so using two and a half minutes |
|---|
| 0:17:55 | speech |
|---|
| 0:17:55 | and then match |
|---|
| 0:17:57 | sorry |
|---|
| 0:17:57 | in the case of the full ten second |
|---|
| 0:18:00 | condition he met |
|---|
| 0:18:01 | simply means is that you know matter and |
|---|
| 0:18:03 | a truncated to that end |
|---|
| 0:18:05 | whereas in the ten second ten second case |
|---|
| 0:18:07 | but it's the ending on that |
|---|
| 0:18:09 | right |
|---|
| 0:18:12 | that's quite obvious that the full uh hard what's it going give us worst performance we |
|---|
| 0:18:17 | we can see |
|---|
| 0:18:18 | and that maps no longer holds offer the best |
|---|
| 0:18:22 | so uh quite elementary but |
|---|
| 0:18:24 | the uh interesting observation here is that |
|---|
| 0:18:28 | uh |
|---|
| 0:18:29 | the relative performance gain from applying score normalisation |
|---|
| 0:18:32 | seems quite minimal sorry |
|---|
| 0:18:34 | the question is |
|---|
| 0:18:36 | uh |
|---|
| 0:18:37 | at what point are we willing to |
|---|
| 0:18:39 | you go about choosing at a score normalisation sets to try and help |
|---|
| 0:18:43 | on |
|---|
| 0:18:45 | so that try and help answer that question we looked at the |
|---|
| 0:18:48 | relative gain in min dcf that score normalisation provides |
|---|
| 0:18:52 | as we reduce the duration of speech |
|---|
| 0:18:56 | we say that would |
|---|
| 0:18:56 | the full eighty seconds weakening i attend the same kind which is |
|---|
| 0:18:59 | hmmm |
|---|
| 0:19:00 | quite reasonable |
|---|
| 0:19:01 | it's in the lower durations of speech five and ten seconds we've got less than two percent relative gain |
|---|
| 0:19:07 | are these really worth yeah i do |
|---|
| 0:19:08 | trying to choose at a good normalised |
|---|
| 0:19:11 | that |
|---|
| 0:19:11 | uh and the risk |
|---|
| 0:19:12 | that |
|---|
| 0:19:13 | and normalisation |
|---|
| 0:19:14 | set |
|---|
| 0:19:14 | uh |
|---|
| 0:19:15 | i'm not actually kind of chosen well and |
|---|
| 0:19:17 | reduced |
|---|
| 0:19:18 | on |
|---|
| 0:19:19 | that's another question is right now |
|---|
| 0:19:22 | thank conclusion we've been investigated |
|---|
| 0:19:24 | sensitivity of the populist the end system |
|---|
| 0:19:27 | uh to reduce training and testing segments |
|---|
| 0:19:29 | and we found the best phone i'm from selecting a background |
|---|
| 0:19:33 | uh that match the shortest test duration depending on |
|---|
| 0:19:37 | when you want to optimise the dcf or equal error rate |
|---|
| 0:19:40 | but not a transforms trained on data matching |
|---|
| 0:19:43 | it sure just |
|---|
| 0:19:44 | a direction that was the best performance |
|---|
| 0:19:46 | and score normalisation |
|---|
| 0:19:48 | how much |
|---|
| 0:19:49 | conditions were also the best |
|---|
| 0:19:51 | the highlight an issue in that |
|---|
| 0:19:53 | when dealing with a limited speech and this is judy session variability |
|---|
| 0:19:57 | becoming more isotropic the speech duration was reduced |
|---|
| 0:20:00 | and |
|---|
| 0:20:01 | score normalisation provider uh what you |
|---|
| 0:20:04 | in the |
|---|
| 0:20:06 | uh condition |
|---|
| 0:20:08 | thank you for |
|---|
| 0:20:17 | thank you for the |
|---|
| 0:20:18 | that's a systematic |
|---|
| 0:20:20 | uh |
|---|
| 0:20:21 | investigation into the effects of uh |
|---|
| 0:20:23 | uh |
|---|
| 0:20:24 | duration |
|---|
| 0:20:25 | um |
|---|
| 0:20:27 | as far as i can see |
|---|
| 0:20:29 | but trick |
|---|
| 0:20:30 | uh there's a patient this morning which i'm not sure |
|---|
| 0:20:32 | you |
|---|
| 0:20:33 | you will you have no impact at the sleeping well that had a uh right |
|---|
| 0:20:37 | we're not going on you know that |
|---|
| 0:20:39 | i i think |
|---|
| 0:20:40 | um |
|---|
| 0:20:41 | uh |
|---|
| 0:20:42 | patrick |
|---|
| 0:20:43 | observations this morning |
|---|
| 0:20:44 | uh |
|---|
| 0:20:45 | yeah |
|---|
| 0:20:46 | a nice |
|---|
| 0:20:48 | explanation |
|---|
| 0:20:49 | of what you see |
|---|
| 0:20:50 | so |
|---|
| 0:20:51 | the short |
|---|
| 0:20:52 | explanation |
|---|
| 0:20:53 | uh if you using relevance map |
|---|
| 0:20:55 | uh_huh then |
|---|
| 0:20:56 | um |
|---|
| 0:20:58 | you |
|---|
| 0:20:59 | introducing |
|---|
| 0:21:01 | speaker dependent |
|---|
| 0:21:02 | uh |
|---|
| 0:21:03 | within speaker |
|---|
| 0:21:05 | variability |
|---|
| 0:21:06 | uh that's what |
|---|
| 0:21:07 | but recall uh the uh |
|---|
| 0:21:09 | the original script |
|---|
| 0:21:11 | um |
|---|
| 0:21:11 | so |
|---|
| 0:21:14 | you agree with me that explains |
|---|
| 0:21:16 | perhaps explains |
|---|
| 0:21:18 | what you see |
|---|
| 0:21:24 | i'll i'll have to talk for the other ones are honest representation |
|---|
| 0:21:27 | one |
|---|
| 0:21:29 | so any any others |
|---|
| 0:21:31 | any other questions |
|---|
| 0:21:37 | about |
|---|
| 0:21:38 | posted |
|---|
| 0:21:39 | um |
|---|
| 0:21:40 | your name uh you're you're matrix for the |
|---|
| 0:21:43 | now |
|---|
| 0:21:43 | and to do and relevance map and maybe we pca on |
|---|
| 0:21:47 | that information |
|---|
| 0:21:49 | sorry a saying |
|---|
| 0:21:50 | yeah |
|---|
| 0:21:50 | not quite well |
|---|
| 0:21:52 | sorry |
|---|
| 0:21:52 | um my question is regarding the |
|---|
| 0:21:54 | uh how you really mean the U matrix uh |
|---|
| 0:21:57 | to project the way |
|---|
| 0:21:58 | so you're doing relevance map |
|---|
| 0:22:00 | uh a man on bad |
|---|
| 0:22:02 | you're not P C |
|---|
| 0:22:04 | computing it |
|---|
| 0:22:05 | pca pca on uh |
|---|
| 0:22:07 | your uh um centre |
|---|
| 0:22:09 | real time at that or or |
|---|
| 0:22:13 | i know that uh to estimate you matrix we are doing some kind of pca to go to los lights |
|---|
| 0:22:19 | for computational reasons |
|---|
| 0:22:21 | but then we go back to the original |
|---|
| 0:22:23 | so that would |
|---|
| 0:22:24 | but not so my question is uh |
|---|
| 0:22:27 | vicki lapsing when you learned that you matrix |
|---|
| 0:22:30 | is that uh if you just doing a regular pca which is uh |
|---|
| 0:22:34 | the computer low dimensional approximation of your uh |
|---|
| 0:22:38 | if you put all your body that's vectors |
|---|
| 0:22:40 | i mean you do uh low rank approximation about me to basically what piece you know |
|---|
| 0:22:45 | you're not taking into account |
|---|
| 0:22:47 | the |
|---|
| 0:22:47 | the count |
|---|
| 0:22:48 | uh that when you do your part to analyses |
|---|
| 0:22:52 | um |
|---|
| 0:22:53 | you using the count somehow |
|---|
| 0:22:55 | to uh |
|---|
| 0:22:56 | wait |
|---|
| 0:22:57 | they |
|---|
| 0:22:57 | four tones |
|---|
| 0:22:58 | of uh |
|---|
| 0:23:00 | information in in different parts of the |
|---|
| 0:23:03 | the pen to |
|---|
| 0:23:04 | so um i my question is mostly we're going to |
|---|
| 0:23:08 | are you somehow incorporating |
|---|
| 0:23:10 | the information that |
|---|
| 0:23:11 | when you have a lot of gaussian and i'm very few points |
|---|
| 0:23:15 | not all the gaussian get us assign points |
|---|
| 0:23:18 | and then when you |
|---|
| 0:23:19 | train your subspace |
|---|
| 0:23:20 | you're subspace |
|---|
| 0:23:21 | does not know that |
|---|
| 0:23:23 | so maybe that accounts for a lot of these uh |
|---|
| 0:23:26 | observations are you happier |
|---|
| 0:23:28 | understanding point actually i think |
|---|
| 0:23:30 | i think either |
|---|
| 0:23:31 | uh i don't believe we're actually explicitly take into account |
|---|
| 0:23:35 | um |
|---|
| 0:23:37 | the fact that some gaussians might miss out on |
|---|
| 0:23:40 | oh |
|---|
| 0:23:40 | patient |
|---|
| 0:23:42 | and yeah i think i can understand |
|---|
| 0:23:44 | saying that it's might have an effect on the |
|---|
| 0:23:46 | but on the united |
|---|
| 0:24:01 | um |
|---|
| 0:24:01 | uh |
|---|
| 0:24:02 | you mean yeah |
|---|
| 0:24:04 | i'm a little |
|---|
| 0:24:06 | sure about the |
|---|
| 0:24:08 | so what |
|---|
| 0:24:08 | i i mean i'm all |
|---|
| 0:24:09 | you're cool |
|---|
| 0:24:10 | studies because you want to see what works best |
|---|
| 0:24:13 | but you also want to understand why it works best |
|---|
| 0:24:16 | so what you said sort of |
|---|
| 0:24:18 | or magnitude of standpoint was |
|---|
| 0:24:19 | you doing this |
|---|
| 0:24:20 | process |
|---|
| 0:24:22 | oh map to get gaussian |
|---|
| 0:24:24 | and then you're comparing the means a some training |
|---|
| 0:24:27 | gaussians you got mad |
|---|
| 0:24:29 | with some test gaussians you go with mapping using U S B M |
|---|
| 0:24:32 | and if it's not the same amount of data |
|---|
| 0:24:35 | things go wrong |
|---|
| 0:24:36 | basically |
|---|
| 0:24:37 | and so |
|---|
| 0:24:38 | uh |
|---|
| 0:24:40 | the solution you're applying is your single make it the same length |
|---|
| 0:24:44 | um |
|---|
| 0:24:45 | it would seem like |
|---|
| 0:24:46 | uh |
|---|
| 0:24:48 | you |
|---|
| 0:24:49 | yeah but you did that study without normalisation |
|---|
| 0:24:52 | okay so of course when the noise |
|---|
| 0:24:54 | uh you dicks |
|---|
| 0:24:55 | back |
|---|
| 0:24:56 | all kinds of normalisation is there as you said |
|---|
| 0:24:59 | two |
|---|
| 0:25:00 | deal with |
|---|
| 0:25:01 | differences like just another differences |
|---|
| 0:25:03 | um i'm wondering whether |
|---|
| 0:25:05 | by doing it without normalisation that was true i |
|---|
| 0:25:08 | making the worst possible condition that |
|---|
| 0:25:11 | it wouldn't be fixed produced |
|---|
| 0:25:12 | but your solution ended up being discard data |
|---|
| 0:25:15 | so did you read it would so the first question i guess is |
|---|
| 0:25:18 | when you truncated the training samples did you literally just discard the rest of the data where did you |
|---|
| 0:25:23 | create additional short training utterances out of those |
|---|
| 0:25:26 | and i would discover that i |
|---|
| 0:25:28 | okay |
|---|
| 0:25:28 | so |
|---|
| 0:25:29 | one obvious thing is if you if you take a thirty second utterance |
|---|
| 0:25:32 | truncated to ten seconds it would be wasteful not to use the other twenty |
|---|
| 0:25:35 | seconds as two more to the second term |
|---|
| 0:25:38 | um |
|---|
| 0:25:39 | but |
|---|
| 0:25:39 | besides that |
|---|
| 0:25:40 | that observation |
|---|
| 0:25:42 | i'm worried about the |
|---|
| 0:25:44 | yeah |
|---|
| 0:25:45 | uh |
|---|
| 0:25:46 | if you had used normalisation uh_huh |
|---|
| 0:25:48 | you might |
|---|
| 0:25:49 | fix the problem |
|---|
| 0:25:50 | to begin with did actually run they've also with school and |
|---|
| 0:25:53 | quantisation but we can't we found based |
|---|
| 0:25:56 | similar |
|---|
| 0:25:56 | by |
|---|
| 0:25:57 | but we wanted |
|---|
| 0:25:58 | two |
|---|
| 0:25:58 | uh try and get back to a very basic system just to help |
|---|
| 0:26:02 | i guess you'd say the breeders understanding and floor |
|---|
| 0:26:05 | of that i |
|---|
| 0:26:06 | i i i i'm i'm hearing in many papers especially today |
|---|
| 0:26:10 | a strong desire and everyone's part |
|---|
| 0:26:13 | two |
|---|
| 0:26:14 | find a way to do things without normalisation is it |
|---|
| 0:26:17 | somehow normalisation were a bad thing |
|---|
| 0:26:20 | when it seems to me that normalisation is |
|---|
| 0:26:25 | almost |
|---|
| 0:26:25 | beyond the obvious thing that you have to model the speech hmmm |
|---|
| 0:26:29 | it seems like the only other thing |
|---|
| 0:26:31 | you know very high level since |
|---|
| 0:26:33 | is a normalisation |
|---|
| 0:26:34 | after all we're doing |
|---|
| 0:26:36 | we're doing some kind of hypothesis test |
|---|
| 0:26:38 | verification |
|---|
| 0:26:39 | and |
|---|
| 0:26:40 | that |
|---|
| 0:26:40 | inherently requires |
|---|
| 0:26:43 | knowing how to set a threshold which require |
|---|
| 0:26:45 | or some kind of normalisation |
|---|
| 0:26:47 | and |
|---|
| 0:26:47 | if |
|---|
| 0:26:48 | to the extent that we try to get away from that |
|---|
| 0:26:51 | we're trying her hands behind her back |
|---|
| 0:26:54 | um |
|---|
| 0:26:55 | i mean it's good it's good to look for methods that are |
|---|
| 0:26:58 | inherently better |
|---|
| 0:26:59 | but |
|---|
| 0:27:00 | i guess i would |
|---|
| 0:27:01 | say |
|---|
| 0:27:02 | you know what |
|---|
| 0:27:03 | we should still do normalisation it can ever |
|---|
| 0:27:06 | okay |
|---|
| 0:27:10 | done properly |
|---|
| 0:27:18 | oh what is |
|---|
| 0:27:19 | where that my my claim was |
|---|
| 0:27:21 | uh |
|---|
| 0:27:23 | i well that's good to look for better models |
|---|
| 0:27:26 | um |
|---|
| 0:27:27 | i i don't see it |
|---|
| 0:27:29 | i don't |
|---|
| 0:27:29 | i understand the desire to do away with normalisation |
|---|
| 0:27:33 | seems like normalisation |
|---|
| 0:27:35 | is |
|---|
| 0:27:36 | at the crux of the problem |
|---|
| 0:27:37 | and ultimately |
|---|
| 0:27:39 | fig |
|---|
| 0:27:40 | fixed whatever else you do wrong |
|---|
| 0:27:42 | and if you never heard |
|---|
| 0:27:53 | yes normalisation does exactly that so |
|---|
| 0:27:56 | uh |
|---|
| 0:27:57 | what |
|---|
| 0:27:57 | what we are unhappy with |
|---|
| 0:27:59 | that we did do something wrong so |
|---|
| 0:28:02 | uh we we're trying to do |
|---|
| 0:28:04 | that's a bit of |
|---|
| 0:28:05 | uh |
|---|
| 0:28:07 | and |
|---|
| 0:28:08 | if and then we find |
|---|
| 0:28:09 | it's still not perfect |
|---|
| 0:28:11 | yeah |
|---|
| 0:28:11 | then i'm sure we will keep a normalised |
|---|
| 0:28:14 | so the other way to look at it is |
|---|
| 0:28:16 | but the |
|---|
| 0:28:17 | normalisation is just another |
|---|
| 0:28:19 | modelling stage |
|---|
| 0:28:21 | uh |
|---|
| 0:28:21 | the |
|---|
| 0:28:22 | extracting the mfcc features as modelling that the acoustic signal |
|---|
| 0:28:26 | and then |
|---|
| 0:28:27 | uh |
|---|
| 0:28:28 | gmms is is |
|---|
| 0:28:29 | modelling the mfccs and |
|---|
| 0:28:32 | uh i victor's again this morning |
|---|
| 0:28:35 | the |
|---|
| 0:28:35 | the gmm supervectors and then in the end |
|---|
| 0:28:38 | there's a score modelling stage |
|---|
| 0:28:40 | uh |
|---|
| 0:28:42 | so |
|---|
| 0:28:43 | at the end you just expecting more most pages might be nice just to use |
|---|
| 0:28:48 | uh |
|---|
| 0:28:49 | the number of |
|---|
| 0:28:49 | all stages but the |
|---|
| 0:28:51 | probably probably |
|---|
| 0:28:54 | we might just go on |
|---|
| 0:28:55 | mobilising forever |
|---|
| 0:28:59 | can we uh |
|---|
| 0:29:00 | uh have the next week |
|---|