0:00:06 | the title of my talk uh |
---|---|

0:00:08 | vision speaker verification with |

0:00:10 | heavy tailed right |

0:00:13 | yeah |

0:00:16 | or not |

0:00:17 | yeah |

0:00:18 | right |

0:00:19 | oh |

0:00:20 | yeah |

0:00:23 | oh |

0:00:24 | or |

0:00:24 | okay |

0:00:30 | oh in a nutshell uh but still is about it is um |

0:00:34 | applying uh joint factor analysis where |

0:00:37 | i vectors |

0:00:38 | as |

0:00:39 | features |

0:00:41 | so i'll be assuming that you have uh some familiarity with |

0:00:45 | joint factor analysis |

0:00:47 | i vectors |

0:00:49 | and |

0:00:49 | cosine distance |

0:00:51 | scroll right |

0:00:54 | uh the key fact |

0:00:55 | about i actors is that they provide a representation of speech segments so |

0:01:00 | arbitrator durations by |

0:01:03 | vectors of |

0:01:05 | uh fixed dimension |

0:01:08 | uh these all these vectors uh seem to contain most of the information needed to distinguish between speakers |

0:01:15 | and as a bonus they are of relatively low dimension |

0:01:20 | typically four hundred rather than |

0:01:23 | a hundred thousand |

0:01:24 | as in the case of a gmm supervectors |

0:01:29 | uh this means that |

0:01:31 | it's |

0:01:31 | possible to |

0:01:33 | apply |

0:01:34 | modern |

0:01:36 | bayesian that because of uh pattern recognition |

0:01:39 | to the speaker recognition problem |

0:01:41 | we've banished |

0:01:42 | the |

0:01:43 | time dimension altogether |

0:01:45 | and we're in a situation which is quite analogous to |

0:01:48 | other |

0:01:49 | action recognition problems |

0:01:54 | the |

0:01:55 | um |

0:01:57 | i think i should |

0:01:58 | at the outset explained what i need but where nation |

0:02:01 | because it's open to several interpretations |

0:02:05 | um |

0:02:06 | what i intend is that it is |

0:02:08 | uh |

0:02:08 | in my mind |

0:02:10 | the terms station |

0:02:11 | and for the ballistic |

0:02:13 | are synonymous with each other |

0:02:16 | the idea is |

0:02:17 | two |

0:02:18 | as far as possible |

0:02:21 | do everything within the framework of the cartoons probability |

0:02:27 | it doesn't |

0:02:29 | really matter whether you prefer |

0:02:31 | to interpret probabilities and frequentist terms |

0:02:35 | or and added then surely terms |

0:02:38 | three |

0:02:39 | rules the probability of the same or only two |

0:02:42 | the sum rule |

0:02:43 | and the product |

0:02:45 | a very they give you the same results in both cases |

0:02:50 | um |

0:02:51 | and the advantage of this is that you have uh |

0:02:53 | logically coherent way of doing |

0:02:57 | reasoning in the face of uncertainty |

0:03:01 | the disadvantage |

0:03:03 | is that in practise |

0:03:04 | you usually |

0:03:06 | run into a computational brick wall in pretty short order |

0:03:11 | if you try to to follow these rules |

0:03:14 | consistently |

0:03:16 | so in fact |

0:03:17 | it's really only been in the past ten years |

0:03:21 | that's uh |

0:03:22 | this |

0:03:22 | field of |

0:03:24 | they shouldn't pattern recognition has really taken off |

0:03:27 | and that's that thanks to the |

0:03:30 | introduction |

0:03:31 | um |

0:03:32 | fast |

0:03:32 | approximate |

0:03:34 | methods |

0:03:35 | all |

0:03:36 | bayesian inference |

0:03:38 | uh in particular age a variational bayes |

0:03:43 | uh which makes it possible to treat |

0:03:45 | probabilistic models which are |

0:03:47 | well more sophisticated |

0:03:50 | then |

0:03:50 | was possible in the case of uh |

0:03:53 | traditional statistic |

0:03:55 | so the you know the unifying theme in my |

0:03:57 | twelve will be the application of variational bayes method |

0:04:00 | to the |

0:04:02 | speaker recognition proper |

0:04:06 | um |

0:04:07 | i start out with the |

0:04:09 | traditional assumptions in joint factor analysis that |

0:04:13 | speaker and channel effects |

0:04:15 | or |

0:04:16 | and uh so |

0:04:18 | statistically independent |

0:04:20 | and |

0:04:20 | gaussian the strip |

0:04:23 | and in the first part might well |

0:04:26 | i will simply a to show |

0:04:29 | how joint factor analysis |

0:04:31 | can be done |

0:04:32 | under these assumptions |

0:04:34 | using i pictures as |

0:04:36 | features and |

0:04:38 | a patient rate |

0:04:42 | um |

0:04:42 | this already works very well |

0:04:44 | yeah in my experience it gives better results them then joint factor analysis |

0:04:49 | uh the second part of my talk will be |

0:04:53 | concerned with hell |

0:04:54 | a variational bayes |

0:04:57 | can be used |

0:04:58 | two |

0:04:59 | model non gaussian behaviour in the data |

0:05:03 | uh i i found that this |

0:05:05 | leads to to a substantial |

0:05:07 | uh improvement in performance |

0:05:10 | and uh as an added bonus it seems to be possible to do away with the need for |

0:05:15 | score normalisation across the the whole day |

0:05:22 | ah the fun part of my talk of this factor |

0:05:25 | okay it's concerned with the problem |

0:05:27 | of |

0:05:28 | how to |

0:05:29 | integrate the assumptions of |

0:05:32 | joint factor analysis and cosine distance scoring you know |

0:05:35 | coherent framework |

0:05:38 | um |

0:05:40 | on the face but this looks like a hopeless exercise |

0:05:43 | okay the the assumptions appeared to be completely different |

0:05:47 | uh however |

0:05:48 | it is possible to do something about this |

0:05:51 | thanks to the flexibility |

0:05:53 | provided by variational bayes so even though this is like that of i think this is where |

0:05:58 | uh talking about because it's a real object lesson in how harmful |

0:06:03 | these beijing methods are |

0:06:05 | at least potentially |

0:06:08 | um |

0:06:10 | before getting down to business uh i just say |

0:06:12 | something about the way of organise this presentation |

0:06:16 | uh in preparing the slides i i tried to ensure that they were |

0:06:20 | reasonably complete and self contained |

0:06:22 | okay what are the idea i have in my mind is that |

0:06:25 | if anyone was interested in reading through the slides afterwards |

0:06:28 | they should tell a fairly complete story |

0:06:31 | okay but |

0:06:32 | uh because of time constraints i'm going to have to gloss over |

0:06:36 | uh |

0:06:37 | some |

0:06:37 | points in V in your presentation |

0:06:41 | uh for the same reason there's going to be somehow |

0:06:44 | in the slides |

0:06:45 | okay okay |

0:06:46 | to do some hand waving their |

0:06:48 | um |

0:06:49 | i found |

0:06:50 | that by focusing on the gaussian dance just |

0:06:54 | statistical independence assumptions |

0:06:57 | uh i could explain the the variational bayes ideas |

0:07:00 | but the uh an animal |

0:07:02 | uh amount of uh of technicalities so i would spend almost half |

0:07:07 | we |

0:07:07 | time |

0:07:08 | on the first part |

0:07:09 | really |

0:07:10 | tall |

0:07:11 | uh on the other hand the last part of the talk |

0:07:15 | uh |

0:07:15 | is |

0:07:16 | is technical is addressed |

0:07:18 | primarily |

0:07:20 | two |

0:07:20 | uh members of the audience who would have read |

0:07:23 | say the the chapter on variational bayes |

0:07:26 | and uh bishop's book |

0:07:30 | okay |

0:07:35 | okay so here the the the the |

0:07:37 | basic assumptions of factor analysis with |

0:07:40 | i vectors |

0:07:41 | uh |

0:07:42 | features |

0:07:43 | um |

0:07:45 | we had used |

0:07:46 | D for data as for speaker C for channel |

0:07:49 | or |

0:07:49 | recording |

0:07:50 | okay we have a collection of recordings per speaker |

0:07:54 | um |

0:07:56 | we assume that that can be decomposed |

0:07:58 | into two statistically independent parts a speaker part |

0:08:01 | um |

0:08:02 | uh channel or |

0:08:04 | these assumptions are questionable but i'm going to stick with them for the um |

0:08:09 | first part of the channel |

0:08:15 | um |

0:08:16 | this uh |

0:08:18 | this model |

0:08:19 | well we have replaced |

0:08:21 | they had the supervector |

0:08:22 | by |

0:08:23 | and observable i vector already has a name |

0:08:26 | it's known and |

0:08:28 | uh face recognition |

0:08:30 | as |

0:08:30 | probabilistic |

0:08:33 | a linear discriminant |

0:08:34 | uh i mouses |

0:08:36 | uh make i think as a |

0:08:37 | that's twenty nine is the true covariance model |

0:08:41 | okay but the other guy is is the one that you will find it very uh |

0:08:45 | and the best picture |

0:08:47 | the um |

0:08:49 | it's not |

0:08:50 | perhaps quite as straightforward as it appears |

0:08:52 | because |

0:08:54 | uh if you're dealing with high dimensional features for example |

0:08:57 | mllr features |

0:08:59 | you can treat these are covariance matrices |

0:09:01 | as being a full rank |

0:09:04 | yeah and you need uh a hidden variable |

0:09:07 | a representation of the model which is practically |

0:09:10 | analogous to the |

0:09:13 | hidden variable description of |

0:09:15 | joint factor analysis |

0:09:19 | so here on the left hand side D on that's an observable ivector not a |

0:09:24 | a hidden supervector |

0:09:26 | um it turns out to be convenient for the heavy tails stuff to refer to the |

0:09:31 | eigenvoice matrix |

0:09:32 | and the eigenchannel matrix |

0:09:34 | matrix using subscripts you want and you too |

0:09:37 | rather than the traditional names that you wouldn't be |

0:09:41 | uh same thing for the |

0:09:42 | um |

0:09:43 | where the speaker factors are labelled X one |

0:09:46 | the channel factors i label them X two or |

0:09:48 | B or indicates the V dependence on the right |

0:09:51 | or the or the uh |

0:09:53 | the channel |

0:09:55 | uh there's one difference here from the um |

0:09:59 | conventional formulation on a joint factor analysis in the lda this uh residual term |

0:10:05 | the epsilon |

0:10:08 | which |

0:10:09 | in general has been modelled by right now |

0:10:11 | by a diagonal covariance or or precision matrix |

0:10:15 | it's associated |

0:10:16 | traditionally with the channel |

0:10:18 | rather than with the speaker |

0:10:20 | okay in jfa i i formulated it slightly differently but i |

0:10:24 | i'm i'm just going to follow this uh |

0:10:26 | uh this model |

0:10:28 | in in in this presentation |

0:10:30 | so because the the residual epsilon is associated with the channel there are |

0:10:35 | two |

0:10:36 | noise terms |

0:10:37 | okay that's the contribution of the eigenchannels |

0:10:41 | okay that contribute |

0:10:43 | uh |

0:10:43 | this |

0:10:44 | so the |

0:10:46 | so the |

0:10:47 | channel variance |

0:10:48 | and the contribution to the residual and there's a precision matrix sense is to say the inverse |

0:10:52 | of the covariance matrix sorry about that too because |

0:10:56 | you have |

0:10:57 | statistical independence |

0:11:03 | uh |

0:11:04 | is the graphical model that goes |

0:11:06 | um |

0:11:08 | with that application |

0:11:09 | uh if you're not familiar with is that we just take a minute to explain how to read these uh |

0:11:14 | these diagrams |

0:11:16 | um |

0:11:19 | uh a much uh mode like that |

0:11:22 | in the case um observable there |

0:11:23 | oh |

0:11:25 | the black nodes |

0:11:27 | in the case |

0:11:28 | hidden variables |

0:11:30 | the |

0:11:31 | do not |

0:11:32 | indicate model parameters |

0:11:35 | and |

0:11:36 | the arrows |

0:11:37 | in the case |

0:11:38 | conditional dependency |

0:11:40 | okay so the |

0:11:43 | the i vector is assumed to depend on a speaker factors |

0:11:46 | the channel factors |

0:11:47 | um |

0:11:48 | residual |

0:11:50 | this like notation indicates that something is |

0:11:53 | replicated server |

0:11:54 | time |

0:11:55 | okay there are several sets of channel factors |

0:11:58 | one for each recording |

0:12:00 | but there's only one set of speaker factors |

0:12:02 | so that's |

0:12:03 | outside |

0:12:04 | three |

0:12:04 | of the plate |

0:12:06 | uh |

0:12:07 | here are specified |

0:12:09 | say that parameter lambda |

0:12:12 | but i did about |

0:12:13 | specifying the distribution |

0:12:16 | oh speaker factors because it's understood |

0:12:18 | be |

0:12:18 | standard normal |

0:12:21 | um |

0:12:24 | so |

0:12:25 | as i mentioned well including the channel factors enables this decomposition here |

0:12:31 | it's not always nest |

0:12:33 | if you have i mean vectors of dimension four hundred it's actually possible to model |

0:12:39 | full rank |

0:12:40 | are |

0:12:41 | rather full |

0:12:42 | precision matrices |

0:12:44 | instead of diagonal |

0:12:46 | okay and in that case |

0:12:48 | this time doesn't actually contribute anything |

0:12:51 | um |

0:12:52 | i have found it useful well |

0:12:54 | in experimental work to use this term |

0:12:56 | to estimate |

0:12:57 | eigenchannels on microphone data |

0:12:59 | so it's useful to people |

0:13:02 | and in fact it turns out that so these channel factors can always be eliminated at recognition time that's a |

0:13:07 | technical point i come back to it later |

0:13:09 | if i |

0:13:15 | okay so how do you do |

0:13:17 | speaker recognition with the the lda model |

0:13:19 | okay i'm gonna make some |

0:13:21 | provisional assumptions here |

0:13:23 | one is that you've already succeeded in estimating the model parameters |

0:13:27 | yeah eigenvoices the eigenchannels et cetera |

0:13:30 | and the other that you know how to uh evaluate |

0:13:33 | this thing known as the evidence integral |

0:13:36 | okay you have a collection of ivectors associated with each speaker |

0:13:39 | you also have a collection of hidden variables |

0:13:42 | to evaluate the marginal likelihood you have to integrate over it variables |

0:13:48 | so |

0:13:49 | and assume that |

0:13:50 | we've tackle these two problems |

0:13:53 | uh it turns out that the key to solving both problems in general |

0:13:58 | is to evaluate the posterior distribution of the hidden variables |

0:14:01 | and |

0:14:02 | i returned |

0:14:03 | so that in a minute |

0:14:04 | but first i just one to show you have to do speaker recognition |

0:14:10 | okay we take the simplest case |

0:14:12 | the the |

0:14:13 | the core condition in the nist evaluation |

0:14:16 | yeah one recording which is usually |

0:14:18 | designated as test |

0:14:19 | mother |

0:14:20 | designated |

0:14:21 | trained and you're interested |

0:14:24 | inception the question whether |

0:14:26 | the two speakers are the same |

0:14:28 | or different |

0:14:30 | so if the two speakers are the same |

0:14:34 | okay |

0:14:34 | i think it's natural to call that the alternative hypothesis but that doesn't seem to be an a universal really |

0:14:39 | about that |

0:14:41 | um |

0:14:43 | then |

0:14:44 | the likelihood |

0:14:45 | the atoms |

0:14:46 | is calculated |

0:14:48 | okay assumption that there is a |

0:14:50 | common seven speaker factors |

0:14:52 | but |

0:14:52 | different channel factors |

0:14:54 | for the two recording |

0:14:58 | on the other hand |

0:14:59 | it's the two speakers are different and |

0:15:02 | then be calculation of these two likelihoods can be done uh independently because the speaker factors |

0:15:08 | and that's channel factors |

0:15:09 | or on time |

0:15:11 | for that record |

0:15:12 | so the point is that everything here is an evidence into |

0:15:16 | okay |

0:15:17 | if you can evaluate the evidence integral |

0:15:19 | you're in this |

0:15:22 | uh a few things to note |

0:15:24 | uh unlike traditional likelihood ratios this is symmetric |

0:15:27 | and D one and D two |

0:15:30 | uh it also |

0:15:31 | has |

0:15:33 | an unusual |

0:15:35 | denominator here |

0:15:36 | okay |

0:15:37 | you don't see anything like this |

0:15:39 | and joint factor analysis |

0:15:42 | okay this is this is something that comes out of |

0:15:45 | following will be |

0:15:46 | the patient |

0:15:48 | um |

0:15:49 | power line |

0:15:51 | and it's actually |

0:15:53 | we see this later |

0:15:55 | potentially |

0:15:57 | and effective method of score normalisation |

0:16:01 | and the other |

0:16:02 | point i would like to stress |

0:16:04 | is |

0:16:04 | but you can write down the likelihood ratio for any type |

0:16:08 | speaker recognition problem in the same way |

0:16:10 | for instance |

0:16:11 | you eight conversations |

0:16:13 | in training one conversations and test |

0:16:16 | we might have three conversations and train into conversations and test |

0:16:20 | in all cases |

0:16:21 | it's just a matter of |

0:16:23 | following the rules of probability consistently |

0:16:26 | and you can write down the mic |

0:16:27 | ratio |

0:16:28 | or bayes factor |

0:16:29 | uh as it is |

0:16:30 | usually called in this field |

0:16:36 | uh the standard insensible |

0:16:38 | had to be evaluated exactly under gaussian assumptions |

0:16:42 | table is |

0:16:43 | it's rather convert |

0:16:45 | and if you do |

0:16:46 | relax |

0:16:46 | the gaussian assumptions you can't do it |

0:16:49 | um uh |

0:16:50 | i believe that even in the gaussian case you're better off using variational bayes |

0:16:54 | and the co disagrees |

0:16:56 | best but i decided to let it stand |

0:16:58 | and we can uh |

0:16:59 | yeah |

0:17:00 | go into it later |

0:17:01 | if |

0:17:02 | so um |

0:17:03 | if there's time |

0:17:06 | the |

0:17:06 | uh |

0:17:07 | key inside |

0:17:08 | here |

0:17:09 | is |

0:17:09 | that |

0:17:10 | this |

0:17:10 | uh this inequality |

0:17:13 | that you can always |

0:17:14 | find a lower bound on the evidence with |

0:17:16 | and we |

0:17:17 | distribution of it on the hidden factors |

0:17:21 | um |

0:17:22 | it's |

0:17:23 | and i i grant you it's not obvious just by looking at it but the derivation |

0:17:27 | turns out to be just a cost |

0:17:28 | once all the facts |

0:17:29 | come back like |

0:17:30 | right |

0:17:31 | but are |

0:17:32 | or |

0:17:33 | a nonnegative |

0:17:36 | um |

0:17:37 | and what i'll be focusing on is |

0:17:40 | the use of the |

0:17:42 | variational bayes method |

0:17:44 | so |

0:17:45 | um |

0:17:46 | find a principle |

0:17:47 | approximation to the |

0:17:49 | the true posterior |

0:17:55 | oh |

0:17:56 | let me just digress a minute to explain why posteriors of about nine |

0:18:02 | there's nothing mysterious about this posterior distribution you you just apply bayes' rule this is what you get |

0:18:08 | you can read all this term here from the graphical model |

0:18:11 | this is the prior |

0:18:13 | this is the evidence |

0:18:15 | okay |

0:18:16 | practically straightforward |

0:18:17 | the only problem can practise |

0:18:19 | says that you can't evaluate yeah |

0:18:21 | exactly |

0:18:22 | evaluating the evidence and evaluating the posterior |

0:18:25 | are |

0:18:25 | two sides of the same problem |

0:18:29 | you can't do it just by numerical integration because these uh |

0:18:33 | these integrals |

0:18:34 | are in hundreds of dimensions |

0:18:38 | um |

0:18:39 | another way of saying the difficulty which i i think is a useful way to of thinking about it |

0:18:43 | is that |

0:18:45 | whatever factorisations you haven't the prior |

0:18:47 | that's be a page they get destroyed when you multiply by |

0:18:51 | okay factorisations in the prior art |

0:18:53 | statistical independence assumptions |

0:18:56 | statistical independence assumptions get destroyed in the poster |

0:19:01 | uh it's easy to uh |

0:19:04 | to see |

0:19:05 | why this |

0:19:05 | the case in terms of the graphical model but as i said i'm going to draw so |

0:19:09 | if you |

0:19:10 | uh a few things |

0:19:13 | and |

0:19:14 | return to this question variational bayes |

0:19:17 | the um |

0:19:20 | yeah the in the variational bayes approximation |

0:19:23 | is that |

0:19:24 | what you acknowledge that |

0:19:26 | uh |

0:19:27 | independence has been destroyed |

0:19:29 | in the posterior |

0:19:30 | but you go back and forth so |

0:19:32 | impostor |

0:19:33 | okay and you look for |

0:19:34 | what's called a variational approximation of the poster |

0:19:38 | variational because it's actually free form |

0:19:40 | as in the countless variations you don't impose any restriction |

0:19:45 | on the functional form |

0:19:47 | of |

0:19:47 | oh |

0:19:48 | yeah |

0:19:49 | and there's a standard set of couple uh update formulas that you can |

0:19:54 | that you can apply here |

0:19:56 | the couple because this expectation is calculated with the posterior on extra |

0:20:01 | this |

0:20:02 | expectation is calculated with the posterior next one |

0:20:05 | so you have to uh iterate between the two |

0:20:08 | um |

0:20:10 | nice thing is that this iteration comes with ian like uh convergence uh guarantees |

0:20:16 | and |

0:20:16 | it's avoided |

0:20:17 | altogether the need |

0:20:19 | to invert |

0:20:20 | um |

0:20:22 | large sparse block matrices which is the only way you can evaluate the |

0:20:26 | evidence exactly |

0:20:28 | and then |

0:20:28 | only in the gaussian |

0:20:29 | okay |

0:20:35 | uh this uh posterior distribution or the the variational |

0:20:39 | approximation of the posterior distribution is also the |

0:20:43 | the key |

0:20:44 | to estimate and model parameter |

0:20:47 | okay you use a lower bound |

0:20:49 | as a proxy |

0:20:50 | for the likelihood of the evidence |

0:20:53 | and you see two |

0:20:54 | optimise a lower bound |

0:20:56 | calculated |

0:20:57 | over |

0:20:58 | uh a collection of training speakers |

0:21:01 | uh here i just |

0:21:02 | taking the definition and |

0:21:04 | rewritten it this way |

0:21:06 | uh it's convenient to do this because this term here doesn't involve me model parameters |

0:21:11 | parameters at all |

0:21:13 | so the |

0:21:14 | first |

0:21:15 | approach |

0:21:16 | problem or would be just too |

0:21:18 | uh optimise |

0:21:19 | uh this term here |

0:21:21 | okay the contribution again |

0:21:24 | to the |

0:21:25 | uh to the evidence criterion by summing this overall speaker |

0:21:32 | okay um |

0:21:33 | this |

0:21:33 | when you when you work it out |

0:21:35 | turns out to be formally identical |

0:21:39 | two |

0:21:41 | um |

0:21:41 | probabilistic principal components analysis |

0:21:44 | it's just a least squares problem |

0:21:46 | the only um |

0:21:51 | and it's actually the E M auxiliary function for probabilistic principal components analysis |

0:21:56 | the only |

0:21:58 | the only difference is that you have to use the variational posterior |

0:22:02 | rather than be |

0:22:03 | other than the exact |

0:22:04 | that's true |

0:22:07 | um there is another way of |

0:22:10 | estimation |

0:22:11 | which |

0:22:13 | i called minimum divergence |

0:22:15 | estimation the this is pretty good you can of confusion over here so uh |

0:22:20 | try and explains briefly |

0:22:23 | there is concentrate this term here |

0:22:27 | it's independent of the model parameters |

0:22:29 | okay |

0:22:30 | but you can do you can |

0:22:32 | the |

0:22:33 | i changes of variable here |

0:22:36 | okay which |

0:22:37 | minimise the B divergence but are constrained in such a way as to preserve the value of the um auxiliary |

0:22:44 | function |

0:22:46 | and if you minimise |

0:22:48 | these divergences you will them |

0:22:50 | keeping this thing |

0:22:51 | you will then |

0:22:52 | increase |

0:22:52 | the |

0:22:54 | the uh value you have adams uh |

0:22:56 | criterion |

0:23:00 | uh the way this work |

0:23:02 | say in the case of speaker factors |

0:23:04 | to minimise the divergence |

0:23:06 | what you do is you look for |

0:23:08 | uh i'm transformations of the speaker factors such that the first and second order |

0:23:13 | moments |

0:23:16 | are the speaker factors |

0:23:17 | agree on average |

0:23:19 | as as the number of |

0:23:21 | uh speakers in the training set |

0:23:23 | with |

0:23:23 | the |

0:23:24 | first order moment of the prior and the second order moment |

0:23:27 | right |

0:23:27 | that's that's just a matter of uh |

0:23:30 | a finding an affine transformation |

0:23:32 | that satisfies |

0:23:33 | this condition you then applied |

0:23:35 | the inverse transformation |

0:23:37 | to update the model parameters |

0:23:39 | in such a way as to keep the value of the |

0:23:43 | uh yeah i'm auxiliary function fixed |

0:23:46 | and it turns out that if you |

0:23:49 | interleaved these two uh steps |

0:23:52 | you will be able to accelerate the um |

0:23:56 | the convergence |

0:23:59 | so ah |

0:24:01 | well just one comment about |

0:24:03 | about this |

0:24:04 | uh and i set out to do here is to produce point estimates |

0:24:08 | of three |

0:24:10 | eigenvoice matrix and the uh i'm the eigenchannel matrix |

0:24:15 | uh if you are really hardcore bayesian you don't allow point estimates |

0:24:20 | into your |

0:24:22 | model you have to do everything in terms of |

0:24:24 | prior probabilities |

0:24:26 | um |

0:24:26 | posterior probabilities |

0:24:29 | so a true blue bayesian approach a prior |

0:24:32 | on the eigenvoices and calculate the posterior |

0:24:35 | again by |

0:24:36 | variational bayes |

0:24:38 | even the |

0:24:39 | number of speaker factors |

0:24:40 | could be treated as a hidden random variable |

0:24:43 | okay and the posterior distribution could be calculated |

0:24:46 | again by |

0:24:47 | haitian |

0:24:47 | right |

0:24:49 | so there is |

0:24:50 | an extensive literature |

0:24:52 | on this |

0:24:53 | on this subject |

0:24:54 | uh |

0:24:55 | and say that if there's one problem with variational bayes |

0:24:59 | it provides too much flexibility |

0:25:01 | you have to |

0:25:01 | exercise good judgement |

0:25:03 | as to which things |

0:25:05 | you should try |

0:25:07 | i wish things are probably not |

0:25:09 | going to help |

0:25:10 | in other words don't lose sight of your |

0:25:12 | you're engineering objective |

0:25:15 | and the particular thing i chose |

0:25:17 | to to focus on was |

0:25:19 | the |

0:25:20 | gaussian assumption |

0:25:21 | okay |

0:25:22 | uh as far as i can see |

0:25:25 | the gaussian assumption is just not realistic |

0:25:28 | for the |

0:25:30 | i don't a so that |

0:25:31 | we're dealing with |

0:25:34 | and what i set out to do using variational bayes |

0:25:37 | was to replace |

0:25:39 | the |

0:25:39 | gaussian assumption with the |

0:25:41 | exponential decrease adam famously by |

0:25:44 | a power law distribution |

0:25:46 | which uh allows |

0:25:48 | four |

0:25:49 | um |

0:25:50 | outlier |

0:25:51 | exceptional |

0:25:52 | speaker of facts |

0:25:53 | severe channel distortions |

0:25:55 | uh in the data |

0:25:57 | and this term black swan is amusing |

0:26:00 | uh it |

0:26:01 | so um |

0:26:02 | romans had a had a phrase or a rare bird much like a black |

0:26:06 | one |

0:26:07 | intended to convey the motion of something impossible or inconceivable |

0:26:12 | and they were in no position to know that uh likes one's actually do exist |

0:26:17 | uh in australia |

0:26:19 | um |

0:26:20 | um |

0:26:21 | a financial forecaster by the name of |

0:26:23 | tell the |

0:26:25 | a few years ago he wrote a polemic |

0:26:28 | against the gaussian distribution called |

0:26:30 | the black swan |

0:26:32 | the um |

0:26:33 | yeah actually rolled before they start |

0:26:36 | rationed in two thousand and made which of course is the |

0:26:39 | mother of all blacks ones |

0:26:41 | and |

0:26:42 | as as a result |

0:26:43 | is it |

0:26:44 | uh quite a bigger |

0:26:45 | media splash |

0:26:50 | okay it turns out that the um |

0:26:53 | textbook a definition of uh |

0:26:56 | the student's T distribution the one which i'm |

0:26:59 | going to use in place of the gaussian distribution that this is a workable |

0:27:03 | with the variational bayes |

0:27:06 | there is a not a construction that represents |

0:27:09 | the student's T distribution um |

0:27:12 | as a continuous mixture of |

0:27:14 | um |

0:27:15 | normal random variable |

0:27:17 | uh it's based on the gamma distribution is unimodal distribution |

0:27:21 | on the positive real switch has two parameters that enable you to adjust the |

0:27:26 | the mean and the variance independently of each other |

0:27:31 | but it was is |

0:27:31 | this |

0:27:32 | okay in order to |

0:27:34 | sample from a student's T distribution |

0:27:40 | you start with a gaussian distribution with precision matrix lambda |

0:27:45 | you then |

0:27:46 | yeah |

0:27:47 | the covariance matrix by a random scale factor drawn from the |

0:27:53 | gaussian distribution |

0:27:55 | and then you sample from the |

0:27:57 | normal distribution with the modified covariance matrix |

0:28:00 | is that random scale factor that |

0:28:04 | introduces the the heavy tail |

0:28:06 | behaviour |

0:28:08 | um |

0:28:09 | the parameters of the |

0:28:11 | gaussian distribution of the gamma distribution rather |

0:28:14 | determine |

0:28:15 | the extent to which this thing |

0:28:17 | is is heavy tail you have the gaussian at at one extreme |

0:28:21 | at the other extreme you something called the the cushion distribution which is |

0:28:25 | so heavy tail that the |

0:28:27 | variances in from |

0:28:29 | uh this term degrees of freedom it comes from classical statistics but it doesn't have any particular main |

0:28:35 | uh |

0:28:36 | in in this context |

0:28:39 | okay |

0:28:40 | so for example |

0:28:42 | suppose you want to make the |

0:28:44 | channel factors heavy tail |

0:28:47 | in order to model |

0:28:48 | applying |

0:28:49 | channel distortion |

0:28:53 | well you have to do here X |

0:28:55 | so |

0:28:56 | remember |

0:28:57 | are you one set of channel factors |

0:28:58 | for each recording so this is inside the plate |

0:29:02 | you associate a random scale factor |

0:29:05 | okay with that |

0:29:07 | hidden random variable |

0:29:09 | okay and that one time scale factor |

0:29:12 | is |

0:29:12 | sampled |

0:29:13 | from |

0:29:14 | a gamma distribution |

0:29:16 | call the member with the freedom into |

0:29:19 | so handy to the lda does this |

0:29:22 | for all of the |

0:29:24 | hidden variables |

0:29:25 | and the |

0:29:27 | gaussian P L D A model |

0:29:29 | yeah of speaker factors |

0:29:31 | have an associated |

0:29:32 | scale factor random scale factor |

0:29:35 | channel factors |

0:29:37 | and so pseudorandom scale factor |

0:29:39 | residual |

0:29:40 | has an associated time and scale |

0:29:42 | vector |

0:29:43 | so |

0:29:44 | in fact |

0:29:45 | all i didn't just here are just three extra |

0:29:48 | parameters |

0:29:49 | three extra degrees of freedom |

0:29:51 | in order to |

0:29:53 | model |

0:29:53 | the |

0:29:54 | the heavy tail |

0:29:55 | behaviour |

0:29:58 | yeah |

0:29:59 | these are some tactical points |

0:30:01 | okay |

0:30:02 | uh how |

0:30:04 | you can |

0:30:06 | carryover variational bayes from the gaussian case to the heavy tailed case and do so |

0:30:11 | in a computationally uh efficient way |

0:30:14 | um |

0:30:16 | i refer you to the paper for these |

0:30:18 | the |

0:30:19 | key point that i would like to draw your attention to |

0:30:22 | is that these numbers degrees of freedom |

0:30:25 | can actually be estimated |

0:30:27 | using the same evidence criterion |

0:30:30 | as the eigenvoices |

0:30:32 | and the eigenchannels |

0:30:38 | okay here's some results |

0:30:40 | this is a a comparison of gas |

0:30:42 | really |

0:30:44 | and |

0:30:45 | how detailed P L D A |

0:30:47 | um the several conditions |

0:30:49 | of the nist |

0:30:50 | uh two thousand and eight evaluation |

0:30:55 | okay so this is the equal error rate |

0:30:58 | and the two thousand and eight |

0:31:00 | detection cost function |

0:31:02 | okay it's clear |

0:31:03 | it in all three conditions the there's a very dramatic |

0:31:06 | uh |

0:31:07 | reduction in errors |

0:31:09 | uh |

0:31:10 | both |

0:31:11 | the dcf point |

0:31:12 | and we are |

0:31:15 | uh this was done without score normalisation if you do what score normalisation |

0:31:20 | what happens |

0:31:21 | this |

0:31:22 | you get |

0:31:22 | uniform improvement in all cases |

0:31:26 | okay i'll simply lda |

0:31:28 | i get uniform degradation |

0:31:30 | probably uh |

0:31:31 | student's T distribution |

0:31:33 | but only |

0:31:34 | does normalisation not help you |

0:31:36 | it's a nuisance |

0:31:38 | in the students to |

0:31:46 | uh let me just say a word about score normalisation |

0:31:48 | um |

0:31:50 | it's usually needed in order to |

0:31:52 | set the decision threshold in speaker verification in a trial dependent way |

0:31:59 | um |

0:32:01 | it |

0:32:01 | so uh this typically french are computationally expensive |

0:32:05 | and |

0:32:05 | it complicates life if you if you ever have to do cross gender |

0:32:09 | uh trials |

0:32:11 | on the other hand |

0:32:13 | if you have a good general model for speech in other words if you insist on the probabilistic |

0:32:18 | yeah |

0:32:19 | way of thinking |

0:32:21 | there's no wrong |

0:32:22 | for for score normalisation |

0:32:24 | if there is no need |

0:32:25 | for calibration but we're not there |

0:32:27 | yeah |

0:32:29 | um |

0:32:31 | in practice is needed because of |

0:32:33 | applying recordings |

0:32:35 | okay which tend to produce |

0:32:37 | uh exceptionally low scores for all of |

0:32:40 | trials in which they are |

0:32:41 | involved |

0:32:43 | and what the uh student's T distribution appears to be doing |

0:32:47 | is that the extra hidden variables these scale factors that i introduce |

0:32:53 | appear |

0:32:53 | the |

0:32:54 | capable of uh of modelling |

0:32:57 | this uh |

0:32:59 | this outlier behaviour adequate |

0:33:02 | thus doing away with the need for uh |

0:33:04 | for score normalisation |

0:33:08 | uh i should |

0:33:09 | so |

0:33:10 | i have a copy of about |

0:33:11 | microphones |

0:33:12 | each |

0:33:13 | if |

0:33:13 | the situation with telephone speech seems to be quite clear |

0:33:16 | okay i guess of the L D A |

0:33:18 | what's globalisation |

0:33:21 | gives results which are comparable to cosine distance scoring |

0:33:24 | get better results but |

0:33:26 | uh heavy tailed the lda at least on the two thousand and a data |

0:33:30 | and in general there about twenty five |

0:33:32 | send better than traditional joint factor analysis |

0:33:36 | uh but it turns out to break down and that |

0:33:38 | an interesting way |

0:33:39 | um |

0:33:39 | um |

0:33:40 | on microphone speech |

0:33:46 | uh |

0:33:47 | now how much yesterday he described an ivector extractor of dimension six hundred |

0:33:52 | which could be used |

0:33:53 | for recognition both microphone |

0:33:56 | and |

0:33:57 | telephone speech |

0:33:59 | so we started out by training a model using only telephone speech speaker factors |

0:34:04 | and the residual was modelled |

0:34:06 | with |

0:34:06 | a full |

0:34:07 | precision right right |

0:34:09 | okay then we augmented that with |

0:34:11 | the with eigenchannels |

0:34:14 | and everything was treated in the heavy tailed right |

0:34:17 | okay um |

0:34:18 | well turned out |

0:34:19 | upon |

0:34:21 | unfortunately |

0:34:22 | is that we ran straight into the |

0:34:24 | cushy distribution |

0:34:25 | for the |

0:34:27 | microphone |

0:34:28 | transducer |

0:34:29 | affect |

0:34:30 | that means is that the variance |

0:34:32 | all the channel effects |

0:34:34 | microphone back that |

0:34:35 | is infinite |

0:34:36 | um |

0:34:37 | it's a short so |

0:34:39 | it's a short step to realise that if you have infinite variance for channel effects |

0:34:43 | you're not able |

0:34:44 | to speaker recognition |

0:34:46 | so um i haven't been able to uh to fix this |

0:34:50 | uh at present |

0:34:52 | the |

0:34:53 | best strategy would seem to be too project away the V troubles some dimensions using some type of P O |

0:34:58 | D A that |

0:34:59 | so |

0:35:00 | that's not gene structure which i i believe |

0:35:02 | we |

0:35:03 | talking about |

0:35:04 | uh in the next presentation |

0:35:09 | okay |

0:35:10 | oh and then come to the third part of my talk |

0:35:14 | which concerns the question |

0:35:16 | oh |

0:35:17 | how |

0:35:19 | it would be possible |

0:35:21 | to integrate |

0:35:23 | joint factor analysis or P L B A |

0:35:26 | and call centre |

0:35:27 | and scoring |

0:35:28 | or something resembling a |

0:35:30 | in a coherent |

0:35:33 | probably |

0:35:33 | fig |

0:35:34 | right |

0:35:36 | uh if you haven't seen |

0:35:38 | these |

0:35:39 | types of uh scatter plots |

0:35:41 | there are very interesting |

0:35:43 | okay each colour here represents a speaker |

0:35:46 | and each point |

0:35:48 | represents an utterance |

0:35:50 | the speech |

0:35:55 | um |

0:35:56 | this is a plot of of supervectors |

0:35:58 | projected onto the |

0:36:00 | what is essentially the first two |

0:36:02 | uh i vector |

0:36:04 | components |

0:36:07 | so |

0:36:07 | you see what's going on here |

0:36:09 | this is the |

0:36:10 | well i motivation for |

0:36:12 | cosine distance scoring |

0:36:13 | cosine distance scoring |

0:36:15 | ignores the magnitude |

0:36:17 | of the vectors |

0:36:18 | and uses |

0:36:19 | only the angle between them |

0:36:21 | as |

0:36:22 | the similar signature |

0:36:27 | and this is completely inconsistent with the assumptions |

0:36:29 | all |

0:36:30 | joint factor analysis because |

0:36:33 | there seems to be |

0:36:34 | for each speaker |

0:36:36 | a principal axes of variability that passes through the speakers me |

0:36:40 | the |

0:36:42 | session variability for speaker is augmented |

0:36:45 | in a particular direction |

0:36:46 | the direction i mean vector |

0:36:48 | where is |

0:36:49 | jfa or P L V A assumes |

0:36:53 | that you can't model |

0:36:55 | session |

0:36:56 | okay |

0:36:57 | for all speakers in the same way |

0:37:00 | the strip |

0:37:01 | that's three |

0:37:01 | statistical independence |

0:37:03 | assumption |

0:37:03 | in in in jfa |

0:37:08 | um |

0:37:11 | i thought of necessarily just |

0:37:13 | to add a |

0:37:14 | you |

0:37:15 | have the ad |

0:37:16 | in interpreting these |

0:37:18 | these plots to have to be careful that it's not a notified to |

0:37:21 | the |

0:37:22 | well the way you estimate supervectors and so on we |

0:37:25 | we do find these plots with with an vectors but we have to cherry |

0:37:29 | the results in order to get |

0:37:30 | um ice pictures like one |

0:37:32 | right |

0:37:33 | i showed you |

0:37:34 | but the the principle that |

0:37:36 | okay for this type of behaviour |

0:37:39 | which i call directional scatter |

0:37:41 | is the effect |

0:37:42 | that's |

0:37:43 | of the |

0:37:44 | colour distance |

0:37:45 | matcher |

0:37:46 | yeah |

0:37:47 | uh |

0:37:48 | in speaker recognition |

0:37:51 | i don't know how to account for it i'm not concerned with that question |

0:37:54 | the only question i would like |

0:37:56 | to answer |

0:37:56 | is how to model this type of behaviour probabilistic |

0:38:05 | okay as i |

0:38:06 | i said this part is going to get of the technical it's addressed to people who have |

0:38:11 | red |

0:38:11 | the chapter |

0:38:13 | and |

0:38:14 | bashers book |

0:38:15 | um |

0:38:16 | variational |

0:38:16 | right |

0:38:18 | uh in order to get a handle on this problem there seems to be a natural strategy |

0:38:23 | okay instead of representing |

0:38:25 | each speaker by a single point |

0:38:27 | next one |

0:38:28 | and the speaker factor space |

0:38:30 | represent each speaker by a distribution which is specified by |

0:38:34 | i mean vector |

0:38:35 | you and the precision matrix model |

0:38:41 | the |

0:38:42 | i'm vectors are then generated by sampling speaker factors from this just |

0:38:45 | version |

0:38:46 | i have |

0:38:47 | but this inverted commas |

0:38:49 | because the speaker factors |

0:38:51 | very |

0:38:51 | from one recording to remember |

0:38:54 | okay as to channel |

0:38:56 | but the |

0:38:57 | mechanism by push the generator is quite different |

0:39:00 | that's willing to come |

0:39:01 | the man |

0:39:04 | the trick is to choose the prior |

0:39:06 | on the |

0:39:07 | mean and precision matrix read speaker |

0:39:10 | in which |

0:39:11 | you and then the |

0:39:12 | or not |

0:39:13 | statistically independent |

0:39:15 | because what you want |

0:39:17 | is |

0:39:18 | you want to precision matrix for each speaker |

0:39:21 | which varies |

0:39:22 | with the location of |

0:39:24 | speakers mean vector |

0:39:28 | and of course |

0:39:29 | once you set this out |

0:39:31 | your |

0:39:32 | immediately going to run into problems you you does not hold |

0:39:34 | all of doing point estimation of the perceptual matrix if you only have one or two |

0:39:39 | observations of the speaker |

0:39:42 | uh you have to follow the rules of probability |

0:39:44 | system play |

0:39:46 | integrator prior |

0:39:47 | and the way to do that |

0:39:48 | courses with |

0:39:49 | um |

0:39:49 | right |

0:39:56 | okay so he was an accountant |

0:39:58 | we can either seems to be only one way to to |

0:40:01 | um |

0:40:02 | one natural prior on precision matrices |

0:40:04 | although we should prior |

0:40:08 | uh i won't talk about this |

0:40:09 | okay i just |

0:40:10 | put it down there so that if you're interested you be able to recognise that this is |

0:40:15 | just a generalisation of the gamma distribution |

0:40:18 | okay if you take an equal to one this will reduce to the gamma distribution |

0:40:23 | in higher dimensions it's concentrating on positive definite |

0:40:27 | major |

0:40:30 | um |

0:40:32 | there is a parameter call the the number of degrees of freedom again |

0:40:35 | okay that |

0:40:36 | so determines how P |

0:40:38 | uh this uh distribution is |

0:40:41 | uh also |

0:40:42 | this point i think is worth mentioning there's no loss of generality in assuming that W |

0:40:47 | which would matrix here |

0:40:48 | is it good to be identity |

0:40:51 | the reason this is worth mentioning is that this turns out to correspond exactly to something that nudging does |

0:40:58 | and |

0:40:59 | uh he's processing |

0:41:01 | if you're familiar with his work |

0:41:03 | you know that |

0:41:05 | uh |

0:41:05 | he estimates that W C C N |

0:41:09 | matrix |

0:41:10 | in the |

0:41:11 | speaker space |

0:41:12 | and then lightens the data with that matrix |

0:41:15 | before evaluating the |

0:41:18 | uh |

0:41:18 | because |

0:41:23 | okay |

0:41:24 | first thing then |

0:41:25 | we have generated the |

0:41:27 | decision matrix for the speaker the next step |

0:41:29 | is to generate |

0:41:30 | the |

0:41:31 | the mean vector |

0:41:32 | speaker |

0:41:34 | and you do that |

0:41:35 | using |

0:41:36 | a student's T distribution |

0:41:39 | okay once you have a precision matrix |

0:41:42 | that's all you need |

0:41:43 | if you |

0:41:44 | just adding the gamma distribution |

0:41:47 | you can sample |

0:41:48 | the mean vector |

0:41:49 | according to a student's T distribution |

0:41:51 | uh and explained in the manual white |

0:41:53 | you need to use the student's T distribution |

0:41:58 | uh |

0:41:59 | the |

0:42:00 | point i would just like to draw your attention to at this stage |

0:42:04 | is that |

0:42:05 | because |

0:42:06 | the distribution of you depends on the land there |

0:42:10 | the conditional distribution lambda depends on you |

0:42:14 | okay |

0:42:14 | so |

0:42:15 | that means |

0:42:17 | but |

0:42:18 | he precision matrix for a speaker |

0:42:20 | and |

0:42:21 | on location |

0:42:22 | all the speaker |

0:42:24 | in the speaker factor space |

0:42:25 | so that means that you have somehow |

0:42:28 | modelling |

0:42:28 | this |

0:42:29 | directional scout |

0:42:35 | skip that |

0:42:36 | um |

0:42:36 | go to the |

0:42:38 | um |

0:42:39 | graphical model |

0:42:42 | i think it's clear from this uh remember |

0:42:44 | when you're confronted with something like this that |

0:42:46 | everything inside the plate |

0:42:48 | is replicated |

0:42:50 | for each of the recordings |

0:42:52 | speaker |

0:42:53 | everything that outside of the plate |

0:42:54 | is done |

0:42:55 | once |

0:42:57 | per speaker |

0:42:58 | okay so the first step |

0:43:00 | is it generate the precision matrix |

0:43:04 | you then generate the mean for the speaker |

0:43:06 | by sampling from |

0:43:08 | um |

0:43:09 | a student's T distribution of call the hidden scale factor W |

0:43:13 | and the parameters of the gamma distribution out |

0:43:15 | data |

0:43:16 | once you have the mean |

0:43:18 | and the precision matrix |

0:43:20 | you generate the speaker factors |

0:43:22 | re speaker |

0:43:24 | uh for each recording |

0:43:25 | remember we're making the speaker factors depend on |

0:43:28 | or |

0:43:29 | okay bye |

0:43:30 | something from another student's T distribution |

0:43:34 | the interesting thing |

0:43:35 | is that |

0:43:36 | these three parameters alpha beta and tell |

0:43:39 | the term and |

0:43:40 | whether or not |

0:43:41 | this |

0:43:42 | oh |

0:43:42 | business |

0:43:43 | it's going to |

0:43:44 | exhibit directions cat |

0:43:46 | normal |

0:43:51 | okay |

0:43:52 | sorry |

0:43:54 | this can be explained without some hundred |

0:43:56 | you have to do it calculation |

0:43:59 | remember landers |

0:44:00 | a session matrix land inverse is the |

0:44:03 | covariance matrix someone and comparing here |

0:44:07 | is the distribution of the covariance matrix |

0:44:09 | given the speaker dependent parameters |

0:44:13 | and the prior distribution of the covariance |

0:44:16 | you see what you have is a weighted average |

0:44:19 | of the prior |

0:44:20 | expectation |

0:44:22 | and |

0:44:23 | another term |

0:44:25 | now this |

0:44:26 | second term here |

0:44:27 | and |

0:44:28 | all the speakers me |

0:44:30 | it's a rank one covariance matrix the only variability that's allowed |

0:44:34 | is in the direction of the mean vector |

0:44:37 | this is |

0:44:37 | a picture book |

0:44:38 | response to it |

0:44:39 | which is exactly what |

0:44:41 | the doctor |

0:44:43 | four |

0:44:43 | action scatter |

0:44:46 | um |

0:44:48 | i'd |

0:44:49 | draw your attention to the fact |

0:44:50 | that |

0:44:50 | the |

0:44:52 | this |

0:44:53 | term here is multiplied by this so it depends on how the number of degrees of freedom |

0:44:58 | and |

0:44:59 | this uh random scale factor that |

0:45:03 | okay so the extent |

0:45:05 | a directional scattering |

0:45:07 | is going to |

0:45:08 | and |

0:45:09 | on the behaviour of this uh |

0:45:11 | this much |

0:45:16 | uh |

0:45:17 | it depends |

0:45:17 | in fact on the parameters which govern the distribution |

0:45:21 | oh |

0:45:21 | the |

0:45:22 | random scale factor W |

0:45:24 | yeah |

0:45:25 | W |

0:45:27 | has |

0:45:28 | a large mean and a small variance you can say that |

0:45:32 | this |

0:45:33 | this thing but |

0:45:35 | the |

0:45:36 | a fact |

0:45:36 | all the variability in the direction of the mean vector |

0:45:40 | okay so in that case |

0:45:42 | directions kevin would be present to a large extent |

0:45:46 | four |

0:45:47 | um |

0:45:48 | most speakers |

0:45:49 | in the data |

0:45:50 | on the other hand there's another limiting case where |

0:45:53 | uh you can show that the thing reduces to to heavy tailed field again and there's no directional scattering at |

0:45:59 | all |

0:46:00 | so the |

0:46:02 | key question would be to see how this model trains that |

0:46:05 | uh |

0:46:06 | to be frank this is going to take a couple |

0:46:07 | models |

0:46:09 | uh i don't have any |

0:46:10 | results to uh |

0:46:13 | okay so in conclusion |

0:46:15 | um |

0:46:16 | well guess immediately it's an effective model for speaker recognition |

0:46:20 | and it's just joint factor analysis with ivectors |

0:46:23 | uh as features |

0:46:25 | my experience |

0:46:26 | spain |

0:46:27 | that it works better |

0:46:29 | then |

0:46:29 | uh traditional joint factor analysis even though the basic assumptions |

0:46:33 | or |

0:46:34 | are open to question |

0:46:36 | okay |

0:46:37 | variational bayes |

0:46:39 | allows you to go a long way |

0:46:41 | in relaxing these assumptions you can model outliers by adding these |

0:46:45 | hidden |

0:46:46 | variables |

0:46:47 | you can model directional scattering by having |

0:46:50 | these variables |

0:46:53 | the |

0:46:54 | derivation of the variational bayes update formulas is mechanical |

0:46:59 | no i'm not saying it's always easy but it is |

0:47:01 | coming |

0:47:02 | okay |

0:47:03 | and |

0:47:04 | it comes with um |

0:47:05 | yeah my convergence guarantees so that you can |

0:47:09 | you have some hope of uh the barking or implementation |

0:47:15 | one can get is that |

0:47:16 | in practise you have to stay inside the exponential |

0:47:19 | second work |

0:47:20 | uh |

0:47:21 | i can |

0:47:21 | the other one |

0:47:23 | uh |

0:47:23 | it's also |

0:47:24 | i'm |

0:47:25 | personally of the opinion that is uh |

0:47:27 | in order to get the full benefit of these methods we need for recall |

0:47:30 | informative priors |

0:47:33 | that is to say |

0:47:34 | prior distributions on the hidden variables whose |

0:47:36 | parameters can be |

0:47:39 | i use the word of this is it because uh estimated is is really an appropriate here |

0:47:44 | and this is a strong uh |

0:47:46 | larger training sets |

0:47:48 | so the example is that |

0:47:50 | one of the hidden variables that i just |

0:47:53 | uh disk right |

0:47:54 | okay are controlled by a handful of |

0:47:57 | scalar degrees of freedom |

0:47:59 | and these can all be estimated using the |

0:48:02 | using the evidence criterion |

0:48:04 | from uh |

0:48:05 | from training data |

0:48:09 | now it to be to be |

0:48:11 | trying to locate the advantage of probabilistic methods is is that you have |

0:48:15 | uh logically coherent way reasoning and the phase uncertainty |

0:48:19 | the disadvantage is |

0:48:21 | that it needs |

0:48:22 | timing |

0:48:23 | um |

0:48:23 | after |

0:48:24 | okay too |

0:48:26 | to master the techniques and to program them |

0:48:29 | if you're |

0:48:31 | principal concern is to get a good system |

0:48:33 | up and running quickly |

0:48:35 | i would recommend |

0:48:36 | um |

0:48:38 | something michael signed distance |

0:48:41 | uh |

0:48:42 | on the other hand |

0:48:43 | if you're interested in |

0:48:45 | mastering |

0:48:46 | this |

0:48:46 | family of methods |

0:48:48 | i think they're really only three things you need to look at |

0:48:51 | okay there's the original |

0:48:53 | paper by prince analogy or a probabilistic linear discriminant analysis in |

0:48:58 | uh face recognition |

0:49:00 | that's the gaussian case |

0:49:04 | everything you need to know |

0:49:05 | about probabilities ambitious book |

0:49:08 | which ah i highly recommend it |

0:49:10 | so |

0:49:10 | it's very well written and it starts from first run |

0:49:13 | oh |

0:49:15 | uh this is the |

0:49:17 | this is |

0:49:17 | paper |

0:49:19 | um i don't believe the paper is actually found its way into |

0:49:23 | proceedings |

0:49:24 | but it is available along those lines |

0:49:26 | uh |

0:49:28 | okay |

0:49:30 | thank you |

0:49:41 | much |

0:49:43 | right |

0:49:44 | this is the |

0:49:45 | action |

0:49:54 | no |

0:49:56 | yeah |

0:49:57 | but it |

0:50:01 | no |

0:50:02 | and of course |

0:50:03 | uh |

0:50:03 | thanks representation which uh |

0:50:05 | reuniting |

0:50:06 | uh |

0:50:07 | you use you to uh |

0:50:10 | uh encourage us to |

0:50:11 | uh as you said |

0:50:12 | if you wanna which solution you can do it that way |

0:50:15 | if you want a more principled solution |

0:50:17 | uh |

0:50:18 | but |

0:50:18 | i |

0:50:19 | i cannot |

0:50:20 | uh |

0:50:21 | notice i just know is that uh |

0:50:23 | they use a point of uh you algorithm |

0:50:26 | is based on that point is to |

0:50:28 | um |

0:50:28 | so you have |

0:50:29 | a speech utterance |

0:50:31 | uh use your factor analyses |

0:50:34 | to summarise i decided |

0:50:36 | and you completely ignored in certain people to process and then from that you should use that |

0:50:41 | we should keep track of that uncertainty |

0:50:44 | so how do you like that it's a an entirely empirical uh decision |

0:50:48 | based on on the effectiveness of of machines uh cosine distance scoring |

0:50:53 | no it just works really well um |

0:50:56 | attend somewhere maybe |

0:50:58 | so |

0:50:58 | um |

0:50:59 | incorporate the uncertainty |

0:51:01 | in the |

0:51:03 | i vector estimation procedure |

0:51:05 | don't seem to have they |

0:51:06 | complicate life |

0:51:08 | it's it's really imperative |

0:51:10 | what |

0:51:11 | it's dictated by baltimore |

0:51:23 | um |

0:51:24 | but you know it's true |

0:51:25 | tuition |

0:51:25 | um |

0:51:26 | one one question regarding you results are presented so i would uh one categories remote |

0:51:32 | yeah |

0:51:32 | um |

0:51:33 | conversation sides down |

0:51:35 | and so |

0:51:35 | you know |

0:51:36 | you were |

0:51:37 | yeah |

0:51:37 | you |

0:51:38 | i picture |

0:51:39 | which |

0:51:39 | finding your |

0:51:41 | um |

0:51:41 | retail setup |

0:51:43 | house and |

0:51:44 | you |

0:51:44 | nine |

0:51:45 | you |

0:51:46 | but when they did that |

0:51:47 | you |

0:51:47 | see |

0:51:48 | it was |

0:51:49 | not |

0:51:50 | the ten second data |

0:51:53 | you can circle |

0:51:54 | um i |

0:51:56 | well the best results were obtained without score normalisation |

0:52:00 | okay so we're was no question of uh |

0:52:04 | introducing a corporate your question is maybe |

0:52:06 | in the gaussian case |

0:52:07 | should we should be used that |

0:52:10 | oh no |

0:52:10 | a what you need |

0:52:11 | yeah |

0:52:12 | to me |

0:52:13 | distribution |

0:52:14 | i |

0:52:15 | right so |

0:52:16 | you see i |

0:52:17 | yeah |

0:52:19 | when you open |

0:52:20 | we estimate |

0:52:22 | you do |

0:52:23 | these |

0:52:23 | particular i picked |

0:52:24 | right |

0:52:25 | maybe |

0:52:25 | oh |

0:52:27 | and second |

0:52:29 | um |

0:52:30 | but i |

0:52:31 | my experience has been and then this |

0:52:33 | black or white |

0:52:34 | okay is that it's better not to use |

0:52:36 | ten seconds later |

0:52:37 | right |

0:52:39 | uh |

0:52:40 | in the case of the |

0:52:42 | indicate an interesting |

0:52:44 | aspect of ivectors |

0:52:45 | is |

0:52:46 | but |

0:52:46 | um |

0:52:48 | they |

0:52:48 | perform |

0:52:49 | very well on the ten second in sec |

0:52:52 | okay |

0:52:53 | in other words the estimation |

0:52:55 | figure drawing vectors |

0:52:56 | is |

0:52:56 | much less sense |

0:52:58 | so |

0:52:59 | um |

0:53:01 | short duration |

0:53:02 | um relevance map |

0:53:03 | right |

0:53:04 | prob |

0:53:11 | a high one based on what the impact |

0:53:13 | uh you make an assumption that um some um |

0:53:17 | fig oceanic you |

0:53:19 | the last |

0:53:20 | the slide |

0:53:21 | somehow |

0:53:22 | exhibit a gaussian decent |

0:53:24 | last of it |

0:53:25 | i this way |

0:53:26 | i mean he's doing at a nonparametric way to do so |

0:53:30 | and i only sensations back |

0:53:32 | so i i think i was careful to use students T distributions everywhere yeah decreases that that that require that |

0:53:38 | it's that which gives me the flexibility |

0:53:40 | to model of players and directions got |

0:53:44 | that does that answer your question or |

0:53:46 | yeah innocently used to model it uh some highlights |

0:53:50 | are made |

0:53:51 | are much more |

0:53:52 | oh at the last |

0:53:53 | uh last |

0:53:55 | variational bayes |

0:53:56 | does require that |

0:53:58 | and in fact it was an actual an extra restriction that you have |

0:54:02 | stay inside the |

0:54:03 | the exponential |

0:54:05 | uh funnelling solely |

0:54:07 | the art |

0:54:08 | consists in achieving what you want to do |

0:54:11 | subject of those uh |

0:54:12 | strange |

0:54:14 | i |

0:54:15 | is that an adequate response |

0:54:18 | yeah |

0:54:34 | about the product |

0:54:35 | yeah |

0:54:37 | you you |

0:54:38 | hmmm you know |

0:54:39 | so |

0:54:39 | and like you |

0:54:41 | yeah |

0:54:42 | i |

0:54:43 | i |

0:54:44 | you |

0:54:45 | right |

0:54:46 | you can |

0:54:48 | well |

0:54:48 | in fact uh we use |

0:54:50 | the evidence criterion |

0:54:51 | you |

0:54:52 | which is exactly the same criterion for estimating these |

0:54:56 | the the numbers of trees of freedom |

0:54:58 | as we did for estimating the eigenvoices |

0:55:01 | and the eigenchannels |

0:55:02 | so it's completely consistent |

0:55:04 | there was no manual tuning |

0:55:07 | thank you |

0:55:21 | so |

0:55:22 | there was a question |

0:55:23 | let me think |

0:55:24 | but okay |

0:55:32 | because |