0:00:17 | would you just |
---|---|

0:00:19 | a bear with me for a couple minutes subset and some background and then i |

0:00:23 | will try to explain |

0:00:25 | in some detail what the technical problem is that we're trying to solve |

0:00:30 | so for the jfa model here i formulated that in terms of gmm mean vectors |

0:00:35 | problem and supervectors |

0:00:38 | that's first term and is the mean vector that comes from universal background model |

0:00:47 | the second term involves that a hidden variable x |

0:00:50 | which is independent of the channel |

0:00:53 | excuse me independent of the mixture component and intended to model the channel effects across |

0:00:58 | a recordings |

0:01:01 | and the third term in that formulation it how it's a local hidden variable |

0:01:06 | to characterize the |

0:01:09 | speaker phrase variability within a particular mixture component |

0:01:16 | so |

0:01:18 | the is typical approach would be to estimate facts matrix u |

0:01:23 | using the maximum likelihood to criterion which is exactly the criterion that is used to |

0:01:31 | train an i-vector extractor |

0:01:35 | in practice |

0:01:37 | rather than use maximum likelihood to you usually end up using relevance map of as |

0:01:42 | an empirical estimate of those matrices t |

0:01:47 | the relation between the two you can find explained in the paper by probably vol |

0:01:53 | going back to two thousand and eight |

0:01:56 | the point i what stress here is that z vector is high dimensional we're not |

0:02:03 | trying to explain the |

0:02:05 | a speaker phrase variability by a low dimensional vector of hidden variables |

0:02:12 | it's a factorial prior in the sense that the |

0:02:17 | explanations for the different mixture components are statistically independent |

0:02:22 | which really is a weakness |

0:02:24 | we're not actually in a position with a prior like this |

0:02:27 | to exploit the correlations between mixture components |

0:02:37 | so to do calculations with this type of model these standard method is an algorithm |

0:02:43 | by robbie vol which alternates between updating the two |

0:02:48 | hidden variables x and z |

0:02:51 | it didn't present of this way but it's actually a variational bayes algorithm which means |

0:02:55 | that it comes with variational lower bounds that you can used to |

0:02:59 | a likelihood or evidence calculations |

0:03:03 | that means that you can for example |

0:03:06 | formulate the |

0:03:08 | speaker recognition problem in exactly the same way as it's done in the l d |

0:03:14 | n e |

0:03:16 | a bayesian model selection |

0:03:19 | problem the question is whether |

0:03:21 | if you're given enrollment utterances and test utterances and you want to |

0:03:25 | account for that on some of the of data |

0:03:29 | whether you are better off |

0:03:32 | passes doing a single cent vector |

0:03:34 | or two vectors one for the enrollment data one for the test |

0:03:41 | something |

0:03:42 | basically unsatisfactory about this namely |

0:03:46 | it doesn't take account of the fact that the |

0:03:50 | what jfa is it's model for handle |

0:03:54 | the ubm moves under speaker and channel effects |

0:03:58 | traditionally |

0:03:59 | when we do these calculations we use the universal background model so collect belmont statistics |

0:04:06 | and ignore the fact |

0:04:07 | that i according to our model |

0:04:10 | the ubm is actually ships to as a result of these hidden variables |

0:04:17 | and |

0:04:19 | there is an important by jean down that tends to remedy this and i was |

0:04:24 | particularly interested in looking into this for the reason that i mentioned at the beginning |

0:04:29 | i believe that's |

0:04:31 | the ubm does have to be adapted |

0:04:34 | in text dependent speaker recognition |

0:04:38 | and |

0:04:38 | this is a principled way of doing that it introduces a an extra sense of |

0:04:43 | hidden variables |

0:04:45 | indicators which |

0:04:47 | so you how the frames are aligned with mixture components |

0:04:52 | and |

0:04:53 | that can be interleaved into the |

0:04:58 | variational bayes updates in vaults algorithm |

0:05:02 | so that you get |

0:05:04 | a quick |

0:05:04 | karen framework from handling that adaptation |

0:05:09 | problem |

0:05:11 | there's just |

0:05:14 | there's just one caviar |

0:05:16 | that i think is worth pointing out about this algorithm |

0:05:21 | it requires that you take account of all of the hidden variables when you're doing |

0:05:26 | ubm adaptation and the evidence calculations |

0:05:30 | no of course that's what you should do if the model is to be believed |

0:05:35 | if you take the model that fixed value |

0:05:37 | we should take account of all of the hidden variables |

0:05:40 | however what's going on here is that this a factorial priors actually so weak |

0:05:48 | but |

0:05:50 | doing things by the book |

0:05:52 | dance lead you into problems |

0:05:54 | so that's why have like this here as a as a kind of |

0:06:00 | so in the paper a high presented results on the are stored data using a |

0:06:05 | three types of classifier |

0:06:08 | that come out of these calculations |

0:06:13 | the first one there is simply to use the z vectors that can either from |

0:06:19 | votes calculation or from the show and on calculation |

0:06:23 | as features which |

0:06:26 | it's are extracted properly should be purged of channel effects |

0:06:31 | okay and then just feeding goes into a simple backend like the cosine distance classifier |

0:06:37 | a jfa as it was originally construed |

0:06:42 | attended not only to be a feature extractor but also model that's like a classifier |

0:06:47 | that's two additions |

0:06:50 | it |

0:06:52 | however |

0:06:54 | in order to understand |

0:06:56 | this problem of ubm adaptation |

0:07:01 | it's necessary also to look into what's going on |

0:07:05 | with those bayesian model selection algorithms |

0:07:09 | okay when you |

0:07:10 | what happens when you appliance |

0:07:12 | without a ubm adaptation and |

0:07:16 | of boats algorithm |

0:07:18 | or with ubm adaptation and round on some read from |

0:07:22 | and also to compare it with |

0:07:26 | the likelihood ratio calculation |

0:07:30 | which |

0:07:32 | what's traditional about two thousand and eight |

0:07:36 | it's turns out that |

0:07:38 | when you look into these questions that there's a whole bunch of anomalies that |

0:07:43 | that arise |

0:07:45 | the |

0:07:47 | ubm adaptation call if you're using jfa as a feature extractor ubm adaptation hz point |

0:07:55 | five point |

0:07:56 | okay |

0:07:57 | this is true |

0:07:58 | for these sent vectors that's not true for i-vectors is not true for speaker factors |

0:08:03 | it behaves a reasonably but not present factors that's and |

0:08:09 | this year's icassp |

0:08:11 | paper |

0:08:13 | on the other hand |

0:08:14 | if you look at the problem of maximum likelihood estimation |

0:08:18 | all the jfa model parameters maximum likelihood so |

0:08:23 | what you find is that it doesn't work at all |

0:08:26 | without ubm adaptation you do need |

0:08:29 | ubm adaptation order to get that to behave |

0:08:32 | sensibly |

0:08:34 | if you look here a on based model selection you find that there are some |

0:08:39 | cases |

0:08:39 | where shall and don's algorithm |

0:08:42 | works better than vaults |

0:08:44 | and other cases where exactly the opposite happens |

0:08:49 | the traditional jfa likelihood ratio is actually very simplistic get just uses plug in estimates |

0:08:56 | rather than attempt to integrate over hidden variables and no ubm adaptation of all |

0:09:02 | and |

0:09:03 | what i will show in this paper is that it can be made to work |

0:09:07 | very well |

0:09:08 | with very careful |

0:09:10 | ubm adaptation |

0:09:12 | okay so this business of ubm adaptation turns out to be very tracking |

0:09:16 | and |

0:09:17 | anyone who is being a around and in the in table long enough is probably |

0:09:23 | in parent by this by this problem at some stage |

0:09:28 | sorry my in my own experience |

0:09:31 | i couldn't get jfa working at all |

0:09:34 | until i stopped showing the ubm adaptation |

0:09:38 | but it doesn't really make a little sense because if you look at the history |

0:09:41 | of subspace methods eigenvoices eigen channels |

0:09:45 | they world implemented originally with ubm adaptation |

0:09:49 | if you speak to |

0:09:51 | guys in speech recognition they will be surprised |

0:09:55 | if you tell them that you're not doing ubm adaptation |

0:09:57 | it is essential for instance and |

0:10:01 | say subspace gaussian mixture models |

0:10:06 | okay |

0:10:08 | so here's an example these are just some examples of the anomalous results that to |

0:10:14 | arise |

0:10:15 | okay these are the |

0:10:17 | a bayesian model selection results |

0:10:21 | on the left hand side |

0:10:24 | is with five hundred and twelve |

0:10:27 | gaussians in the ubm |

0:10:30 | on the right hand side with sixty four |

0:10:34 | in the case of the small ubm |

0:10:36 | john don solvers some |

0:10:38 | does more |

0:10:39 | gives you a small improvement |

0:10:41 | that doesn't help with a five twelve gaussians |

0:10:47 | here's |

0:10:48 | the results in the third line the first two lines of the same as in |

0:10:51 | the last slide the |

0:10:53 | third line there is the traditional jfa likelihood ratio |

0:10:57 | and that the it's model selection and style with or without |

0:11:04 | ubm adaptation |

0:11:07 | so this then is what the what the paper is about well what i want |

0:11:11 | to show is that |

0:11:14 | if you start with the traditional jfa likelihood ratio |

0:11:19 | maybe just recall briefly |

0:11:21 | how that goes |

0:11:23 | you have a numerator and denominator |

0:11:26 | in the numerator |

0:11:28 | okay you plug in |

0:11:30 | the target speakers |

0:11:33 | supervector and you use that to center the baum-welch statistics and you integrate over the |

0:11:38 | channel factors |

0:11:40 | in the |

0:11:43 | in the denominator you plug in |

0:11:46 | the ubm supervector and you do exactly the same |

0:11:50 | calculation and you compare |

0:11:52 | those two those two probabilities |

0:11:55 | no ubm adaptation going on at all and apply in estimate |

0:11:59 | which is not serious in the numerator but in the denominator it really is problematic |

0:12:05 | because |

0:12:06 | theory says you should be employed integrating over the entire speaker population |

0:12:11 | rather than plugging in they |

0:12:14 | the mean value the value of the comes from the ring supervector |

0:12:20 | so |

0:12:22 | what i we show is that if you do the adaptation very carefully |

0:12:28 | adapt the |

0:12:30 | the ubm to some of the hidden variables but not all of them |

0:12:34 | then everything will work properly |

0:12:39 | this is as long as you were |

0:12:41 | using jfa as a classifier you're calculating a likelihood ratios |

0:12:48 | however |

0:12:49 | if you're using it as a feature extractor in this turns out to give the |

0:12:53 | best results |

0:12:55 | it turns out that you're better off |

0:12:57 | avoiding ubm adaptational together |

0:13:00 | i give you an explanation for this |

0:13:03 | it has to do with the fact that the factorial priors two week this phenomenon |

0:13:08 | is related to victoria priors not just subspace problems |

0:13:16 | okay |

0:13:17 | well really for this problem the first type of adaptation that you want to consider |

0:13:23 | is the lexical mismatch between your |

0:13:28 | enrollment and test utterance on the other on the one hand |

0:13:31 | and the ubm that might have been trained |

0:13:33 | on some other |

0:13:36 | some of the data |

0:13:38 | the |

0:13:39 | the jfa likelihood ratio in the numerator you're actually comparing the test speakers of the |

0:13:45 | ubm speaker |

0:13:47 | but if you consider what's going on here if you have |

0:13:50 | no lexical content and the in the trial |

0:13:53 | that is with thing which will most determine what the what the data looks like |

0:14:00 | not the ubm the you would be much better off |

0:14:03 | comparing to have phrase adapted |

0:14:05 | background model and so the |

0:14:08 | to the universal background model so you |

0:14:10 | if you simply adapt the ubm |

0:14:13 | to the lexical content of the frame is that is used in a particular trial |

0:14:18 | that will lead to a substantial improvement |

0:14:22 | in performance |

0:14:24 | so what's going on here is that |

0:14:26 | in the |

0:14:28 | in the or sre data for or |

0:14:34 | in the hours or days of there are a thirty different prices |

0:14:38 | okay the mean supervector of jfa is adapted to each of the phrases |

0:14:43 | but all of the other parameters are shared across phrases |

0:14:51 | if you adapt to the |

0:14:55 | channel effects in the test data |

0:14:57 | this will work fine |

0:14:59 | okay |

0:15:01 | i this with these remotes are referred to the sort of early history of like |

0:15:07 | and channel modeling |

0:15:09 | there are two alternative ways of going about that you can combine the two together |

0:15:14 | and you will get a slight so improvement there's |

0:15:17 | there's no problem there |

0:15:18 | if you |

0:15:21 | if you adapt |

0:15:22 | to the speaker affects in the enrollment data it would work fine |

0:15:26 | okay so what i mean here's that you |

0:15:29 | collect the bombers statistic strongly test utterance with |

0:15:34 | a gmm that has been |

0:15:37 | adapted to the target speaker |

0:15:40 | you get an improvement |

0:15:42 | if you |

0:15:45 | perform multiple |

0:15:46 | iterations of map to adapt of the |

0:15:51 | lexical content things work even better |

0:15:53 | so at this stage if you look through those lines you see that |

0:15:57 | we've already got forty percent improvement in error rates |

0:16:03 | just to just should through doing a |

0:16:06 | ubm adaptation carefully |

0:16:11 | this slide unfortunately we going to have to skip that because of the time constraints |

0:16:17 | it's interesting and but i just don't of trying to deal with that |

0:16:25 | here are results with a five hundred and twelve gaussians |

0:16:31 | it turns out that so doing careful adaptation with the ubm and sixty four gaussians |

0:16:36 | work can chew about these same performance as |

0:16:41 | working with five hundred and twelve gaussians and no adaptation |

0:16:46 | if you try adaptation with five twelve gaussians |

0:16:50 | things will not behave so well this is a rather extreme case where you have |

0:16:55 | many more gel since then you actually have frames in your in your test utterances |

0:17:01 | and the remaining two presents our results that are so that are obtained |

0:17:07 | with z vectors as features problem |

0:17:11 | using likelihood computations |

0:17:15 | likelihood ratio computations |

0:17:17 | that the difference between the two is the nap is used in one case but |

0:17:21 | not the other |

0:17:22 | the of |

0:17:24 | three point there is that you don't need now |

0:17:27 | okay because you've already suppressed |

0:17:30 | the channel effects |

0:17:32 | in extracting the present vectors |

0:17:38 | and these then our results on the on the full ten set |

0:17:42 | that the full order sort test set |

0:17:44 | just to compare |

0:17:48 | the |

0:17:49 | z vector classifier |

0:17:51 | using both soundworks and that's to say no ubm adaptation |

0:17:55 | and joan don's algorithm with ubm adaptation |

0:18:00 | and you can see that you're better off using both so algorithm that explained that |

0:18:05 | the minute of only take a second |

0:18:08 | okay so these are the |

0:18:10 | these are the conclusions |

0:18:13 | you can adapt to everything inside and the work |

0:18:17 | but this one thing you should not to |

0:18:19 | and that is adapt speaker affects in the test utterance |

0:18:25 | the |

0:18:28 | the reason for that is actually |

0:18:31 | this i believe is what's going on |

0:18:33 | the factorial priors extremely weak if you have a single test utterance |

0:18:39 | okay and your doing ubm adaptation |

0:18:43 | then you're allowing |

0:18:45 | the |

0:18:46 | different mean vectors in the gmm |

0:18:49 | to be displays in statistically independent ways like gives you an awful lot of freedom |

0:18:54 | to aligned |

0:18:56 | the data with the gaussians too much freedom |

0:19:01 | see what happens if you |

0:19:05 | if you had multiple enrollment utterances which is normally the case in text dependent speaker |

0:19:11 | recognition |

0:19:12 | you still have a very weak prior |

0:19:14 | but you have a strong extra constraint |

0:19:18 | if you go across the enrollment utterances the gaussians can not move in statistically independent |

0:19:24 | ways that up to move in lockstep |

0:19:27 | okay and that means that the |

0:19:29 | adaptation algorithm will behave sensibly |

0:19:34 | if you to |

0:19:37 | adaptation to the channel effects in the test utterance it can things will behave sensibly |

0:19:42 | and the reason for that |

0:19:45 | is because these subspace prior |

0:19:47 | channel effects are assumed to be confined to a low dimensional subspace |

0:19:51 | that imposes a strong constraint |

0:19:54 | on the way the |

0:19:57 | the gaussians can move |

0:20:02 | so final slide the |

0:20:07 | if you're using jfa as a feature extractor |

0:20:11 | which is my recommendation |

0:20:14 | then the upshot of all this |

0:20:17 | is that |

0:20:19 | in the case of the test utterance when you're extract the feature vector you cannot |

0:20:23 | use ubm adaptation |

0:20:25 | if you cannot use that |

0:20:27 | and extracting a feature from the test utterance you cannot use a in extracting |

0:20:32 | feature a feature from the enrollment utterance i've or otherwise the features whatnot the |

0:20:38 | would not be comparable |

0:20:40 | okay so in other words you have to use false algorithm |

0:20:43 | rather than rather than joan bounds |

0:20:48 | adaptation of the ubm to the lexical content still works very well as a fifty |

0:20:54 | percent error rate reduction compared with the |

0:20:58 | with the icassp paper |

0:21:02 | there's a follow on paper that interspeech which shows how this idea of adaptation to |

0:21:08 | phrases can be extended to give a simple |

0:21:14 | procedure for domain adaptation |

0:21:17 | so you can train |

0:21:18 | jfa |

0:21:19 | on sundays at a likeness data and use it on say a text-dependent |

0:21:26 | task domain |

0:21:28 | and the finally these that vectors at least on the orders or data |

0:21:33 | they to they are very good features there is no residual |

0:21:39 | channel variability that's to model in the in the back end |

0:21:43 | okay thank you |