0:00:06the title of my talk uh
0:00:08vision speaker verification with
0:00:10heavy tailed right
0:00:13yeah
0:00:16or not
0:00:17yeah
0:00:18right
0:00:19oh
0:00:20yeah
0:00:23oh
0:00:24or
0:00:24okay
0:00:30oh in a nutshell uh but still is about it is um
0:00:34applying uh joint factor analysis where
0:00:37i vectors
0:00:38as
0:00:39features
0:00:41so i'll be assuming that you have uh some familiarity with
0:00:45joint factor analysis
0:00:47i vectors
0:00:49and
0:00:49cosine distance
0:00:51scroll right
0:00:54uh the key fact
0:00:55about i actors is that they provide a representation of speech segments so
0:01:00arbitrator durations by
0:01:03vectors of
0:01:05uh fixed dimension
0:01:08uh these all these vectors uh seem to contain most of the information needed to distinguish between speakers
0:01:15and as a bonus they are of relatively low dimension
0:01:20typically four hundred rather than
0:01:23a hundred thousand
0:01:24as in the case of a gmm supervectors
0:01:29uh this means that
0:01:31it's
0:01:31possible to
0:01:33apply
0:01:34modern
0:01:36bayesian that because of uh pattern recognition
0:01:39to the speaker recognition problem
0:01:41we've banished
0:01:42the
0:01:43time dimension altogether
0:01:45and we're in a situation which is quite analogous to
0:01:48other
0:01:49action recognition problems
0:01:54the
0:01:55um
0:01:57i think i should
0:01:58at the outset explained what i need but where nation
0:02:01because it's open to several interpretations
0:02:05um
0:02:06what i intend is that it is
0:02:08uh
0:02:08in my mind
0:02:10the terms station
0:02:11and for the ballistic
0:02:13are synonymous with each other
0:02:16the idea is
0:02:17two
0:02:18as far as possible
0:02:21do everything within the framework of the cartoons probability
0:02:27it doesn't
0:02:29really matter whether you prefer
0:02:31to interpret probabilities and frequentist terms
0:02:35or and added then surely terms
0:02:38three
0:02:39rules the probability of the same or only two
0:02:42the sum rule
0:02:43and the product
0:02:45a very they give you the same results in both cases
0:02:50um
0:02:51and the advantage of this is that you have uh
0:02:53logically coherent way of doing
0:02:57reasoning in the face of uncertainty
0:03:01the disadvantage
0:03:03is that in practise
0:03:04you usually
0:03:06run into a computational brick wall in pretty short order
0:03:11if you try to to follow these rules
0:03:14consistently
0:03:16so in fact
0:03:17it's really only been in the past ten years
0:03:21that's uh
0:03:22this
0:03:22field of
0:03:24they shouldn't pattern recognition has really taken off
0:03:27and that's that thanks to the
0:03:30introduction
0:03:31um
0:03:32fast
0:03:32approximate
0:03:34methods
0:03:35all
0:03:36bayesian inference
0:03:38uh in particular age a variational bayes
0:03:43uh which makes it possible to treat
0:03:45probabilistic models which are
0:03:47well more sophisticated
0:03:50then
0:03:50was possible in the case of uh
0:03:53traditional statistic
0:03:55so the you know the unifying theme in my
0:03:57twelve will be the application of variational bayes method
0:04:00to the
0:04:02speaker recognition proper
0:04:06um
0:04:07i start out with the
0:04:09traditional assumptions in joint factor analysis that
0:04:13speaker and channel effects
0:04:15or
0:04:16and uh so
0:04:18statistically independent
0:04:20and
0:04:20gaussian the strip
0:04:23and in the first part might well
0:04:26i will simply a to show
0:04:29how joint factor analysis
0:04:31can be done
0:04:32under these assumptions
0:04:34using i pictures as
0:04:36features and
0:04:38a patient rate
0:04:42um
0:04:42this already works very well
0:04:44yeah in my experience it gives better results them then joint factor analysis
0:04:49uh the second part of my talk will be
0:04:53concerned with hell
0:04:54a variational bayes
0:04:57can be used
0:04:58two
0:04:59model non gaussian behaviour in the data
0:05:03uh i i found that this
0:05:05leads to to a substantial
0:05:07uh improvement in performance
0:05:10and uh as an added bonus it seems to be possible to do away with the need for
0:05:15score normalisation across the the whole day
0:05:22ah the fun part of my talk of this factor
0:05:25okay it's concerned with the problem
0:05:27of
0:05:28how to
0:05:29integrate the assumptions of
0:05:32joint factor analysis and cosine distance scoring you know
0:05:35coherent framework
0:05:38um
0:05:40on the face but this looks like a hopeless exercise
0:05:43okay the the assumptions appeared to be completely different
0:05:47uh however
0:05:48it is possible to do something about this
0:05:51thanks to the flexibility
0:05:53provided by variational bayes so even though this is like that of i think this is where
0:05:58uh talking about because it's a real object lesson in how harmful
0:06:03these beijing methods are
0:06:05at least potentially
0:06:08um
0:06:10before getting down to business uh i just say
0:06:12something about the way of organise this presentation
0:06:16uh in preparing the slides i i tried to ensure that they were
0:06:20reasonably complete and self contained
0:06:22okay what are the idea i have in my mind is that
0:06:25if anyone was interested in reading through the slides afterwards
0:06:28they should tell a fairly complete story
0:06:31okay but
0:06:32uh because of time constraints i'm going to have to gloss over
0:06:36uh
0:06:37some
0:06:37points in V in your presentation
0:06:41uh for the same reason there's going to be somehow
0:06:44in the slides
0:06:45okay okay
0:06:46to do some hand waving their
0:06:48um
0:06:49i found
0:06:50that by focusing on the gaussian dance just
0:06:54statistical independence assumptions
0:06:57uh i could explain the the variational bayes ideas
0:07:00but the uh an animal
0:07:02uh amount of uh of technicalities so i would spend almost half
0:07:07we
0:07:07time
0:07:08on the first part
0:07:09really
0:07:10tall
0:07:11uh on the other hand the last part of the talk
0:07:15uh
0:07:15is
0:07:16is technical is addressed
0:07:18primarily
0:07:20two
0:07:20uh members of the audience who would have read
0:07:23say the the chapter on variational bayes
0:07:26and uh bishop's book
0:07:30okay
0:07:35okay so here the the the the
0:07:37basic assumptions of factor analysis with
0:07:40i vectors
0:07:41uh
0:07:42features
0:07:43um
0:07:45we had used
0:07:46D for data as for speaker C for channel
0:07:49or
0:07:49recording
0:07:50okay we have a collection of recordings per speaker
0:07:54um
0:07:56we assume that that can be decomposed
0:07:58into two statistically independent parts a speaker part
0:08:01um
0:08:02uh channel or
0:08:04these assumptions are questionable but i'm going to stick with them for the um
0:08:09first part of the channel
0:08:15um
0:08:16this uh
0:08:18this model
0:08:19well we have replaced
0:08:21they had the supervector
0:08:22by
0:08:23and observable i vector already has a name
0:08:26it's known and
0:08:28uh face recognition
0:08:30as
0:08:30probabilistic
0:08:33a linear discriminant
0:08:34uh i mouses
0:08:36uh make i think as a
0:08:37that's twenty nine is the true covariance model
0:08:41okay but the other guy is is the one that you will find it very uh
0:08:45and the best picture
0:08:47the um
0:08:49it's not
0:08:50perhaps quite as straightforward as it appears
0:08:52because
0:08:54uh if you're dealing with high dimensional features for example
0:08:57mllr features
0:08:59you can treat these are covariance matrices
0:09:01as being a full rank
0:09:04yeah and you need uh a hidden variable
0:09:07a representation of the model which is practically
0:09:10analogous to the
0:09:13hidden variable description of
0:09:15joint factor analysis
0:09:19so here on the left hand side D on that's an observable ivector not a
0:09:24a hidden supervector
0:09:26um it turns out to be convenient for the heavy tails stuff to refer to the
0:09:31eigenvoice matrix
0:09:32and the eigenchannel matrix
0:09:34matrix using subscripts you want and you too
0:09:37rather than the traditional names that you wouldn't be
0:09:41uh same thing for the
0:09:42um
0:09:43where the speaker factors are labelled X one
0:09:46the channel factors i label them X two or
0:09:48B or indicates the V dependence on the right
0:09:51or the or the uh
0:09:53the channel
0:09:55uh there's one difference here from the um
0:09:59conventional formulation on a joint factor analysis in the lda this uh residual term
0:10:05the epsilon
0:10:08which
0:10:09in general has been modelled by right now
0:10:11by a diagonal covariance or or precision matrix
0:10:15it's associated
0:10:16traditionally with the channel
0:10:18rather than with the speaker
0:10:20okay in jfa i i formulated it slightly differently but i
0:10:24i'm i'm just going to follow this uh
0:10:26uh this model
0:10:28in in in this presentation
0:10:30so because the the residual epsilon is associated with the channel there are
0:10:35two
0:10:36noise terms
0:10:37okay that's the contribution of the eigenchannels
0:10:41okay that contribute
0:10:43uh
0:10:43this
0:10:44so the
0:10:46so the
0:10:47channel variance
0:10:48and the contribution to the residual and there's a precision matrix sense is to say the inverse
0:10:52of the covariance matrix sorry about that too because
0:10:56you have
0:10:57statistical independence
0:11:03uh
0:11:04is the graphical model that goes
0:11:06um
0:11:08with that application
0:11:09uh if you're not familiar with is that we just take a minute to explain how to read these uh
0:11:14these diagrams
0:11:16um
0:11:19uh a much uh mode like that
0:11:22in the case um observable there
0:11:23oh
0:11:25the black nodes
0:11:27in the case
0:11:28hidden variables
0:11:30the
0:11:31do not
0:11:32indicate model parameters
0:11:35and
0:11:36the arrows
0:11:37in the case
0:11:38conditional dependency
0:11:40okay so the
0:11:43the i vector is assumed to depend on a speaker factors
0:11:46the channel factors
0:11:47um
0:11:48residual
0:11:50this like notation indicates that something is
0:11:53replicated server
0:11:54time
0:11:55okay there are several sets of channel factors
0:11:58one for each recording
0:12:00but there's only one set of speaker factors
0:12:02so that's
0:12:03outside
0:12:04three
0:12:04of the plate
0:12:06uh
0:12:07here are specified
0:12:09say that parameter lambda
0:12:12but i did about
0:12:13specifying the distribution
0:12:16oh speaker factors because it's understood
0:12:18be
0:12:18standard normal
0:12:21um
0:12:24so
0:12:25as i mentioned well including the channel factors enables this decomposition here
0:12:31it's not always nest
0:12:33if you have i mean vectors of dimension four hundred it's actually possible to model
0:12:39full rank
0:12:40are
0:12:41rather full
0:12:42precision matrices
0:12:44instead of diagonal
0:12:46okay and in that case
0:12:48this time doesn't actually contribute anything
0:12:51um
0:12:52i have found it useful well
0:12:54in experimental work to use this term
0:12:56to estimate
0:12:57eigenchannels on microphone data
0:12:59so it's useful to people
0:13:02and in fact it turns out that so these channel factors can always be eliminated at recognition time that's a
0:13:07technical point i come back to it later
0:13:09if i
0:13:15okay so how do you do
0:13:17speaker recognition with the the lda model
0:13:19okay i'm gonna make some
0:13:21provisional assumptions here
0:13:23one is that you've already succeeded in estimating the model parameters
0:13:27yeah eigenvoices the eigenchannels et cetera
0:13:30and the other that you know how to uh evaluate
0:13:33this thing known as the evidence integral
0:13:36okay you have a collection of ivectors associated with each speaker
0:13:39you also have a collection of hidden variables
0:13:42to evaluate the marginal likelihood you have to integrate over it variables
0:13:48so
0:13:49and assume that
0:13:50we've tackle these two problems
0:13:53uh it turns out that the key to solving both problems in general
0:13:58is to evaluate the posterior distribution of the hidden variables
0:14:01and
0:14:02i returned
0:14:03so that in a minute
0:14:04but first i just one to show you have to do speaker recognition
0:14:10okay we take the simplest case
0:14:12the the
0:14:13the core condition in the nist evaluation
0:14:16yeah one recording which is usually
0:14:18designated as test
0:14:19mother
0:14:20designated
0:14:21trained and you're interested
0:14:24inception the question whether
0:14:26the two speakers are the same
0:14:28or different
0:14:30so if the two speakers are the same
0:14:34okay
0:14:34i think it's natural to call that the alternative hypothesis but that doesn't seem to be an a universal really
0:14:39about that
0:14:41um
0:14:43then
0:14:44the likelihood
0:14:45the atoms
0:14:46is calculated
0:14:48okay assumption that there is a
0:14:50common seven speaker factors
0:14:52but
0:14:52different channel factors
0:14:54for the two recording
0:14:58on the other hand
0:14:59it's the two speakers are different and
0:15:02then be calculation of these two likelihoods can be done uh independently because the speaker factors
0:15:08and that's channel factors
0:15:09or on time
0:15:11for that record
0:15:12so the point is that everything here is an evidence into
0:15:16okay
0:15:17if you can evaluate the evidence integral
0:15:19you're in this
0:15:22uh a few things to note
0:15:24uh unlike traditional likelihood ratios this is symmetric
0:15:27and D one and D two
0:15:30uh it also
0:15:31has
0:15:33an unusual
0:15:35denominator here
0:15:36okay
0:15:37you don't see anything like this
0:15:39and joint factor analysis
0:15:42okay this is this is something that comes out of
0:15:45following will be
0:15:46the patient
0:15:48um
0:15:49power line
0:15:51and it's actually
0:15:53we see this later
0:15:55potentially
0:15:57and effective method of score normalisation
0:16:01and the other
0:16:02point i would like to stress
0:16:04is
0:16:04but you can write down the likelihood ratio for any type
0:16:08speaker recognition problem in the same way
0:16:10for instance
0:16:11you eight conversations
0:16:13in training one conversations and test
0:16:16we might have three conversations and train into conversations and test
0:16:20in all cases
0:16:21it's just a matter of
0:16:23following the rules of probability consistently
0:16:26and you can write down the mic
0:16:27ratio
0:16:28or bayes factor
0:16:29uh as it is
0:16:30usually called in this field
0:16:36uh the standard insensible
0:16:38had to be evaluated exactly under gaussian assumptions
0:16:42table is
0:16:43it's rather convert
0:16:45and if you do
0:16:46relax
0:16:46the gaussian assumptions you can't do it
0:16:49um uh
0:16:50i believe that even in the gaussian case you're better off using variational bayes
0:16:54and the co disagrees
0:16:56best but i decided to let it stand
0:16:58and we can uh
0:16:59yeah
0:17:00go into it later
0:17:01if
0:17:02so um
0:17:03if there's time
0:17:06the
0:17:06uh
0:17:07key inside
0:17:08here
0:17:09is
0:17:09that
0:17:10this
0:17:10uh this inequality
0:17:13that you can always
0:17:14find a lower bound on the evidence with
0:17:16and we
0:17:17distribution of it on the hidden factors
0:17:21um
0:17:22it's
0:17:23and i i grant you it's not obvious just by looking at it but the derivation
0:17:27turns out to be just a cost
0:17:28once all the facts
0:17:29come back like
0:17:30right
0:17:31but are
0:17:32or
0:17:33a nonnegative
0:17:36um
0:17:37and what i'll be focusing on is
0:17:40the use of the
0:17:42variational bayes method
0:17:44so
0:17:45um
0:17:46find a principle
0:17:47approximation to the
0:17:49the true posterior
0:17:55oh
0:17:56let me just digress a minute to explain why posteriors of about nine
0:18:02there's nothing mysterious about this posterior distribution you you just apply bayes' rule this is what you get
0:18:08you can read all this term here from the graphical model
0:18:11this is the prior
0:18:13this is the evidence
0:18:15okay
0:18:16practically straightforward
0:18:17the only problem can practise
0:18:19says that you can't evaluate yeah
0:18:21exactly
0:18:22evaluating the evidence and evaluating the posterior
0:18:25are
0:18:25two sides of the same problem
0:18:29you can't do it just by numerical integration because these uh
0:18:33these integrals
0:18:34are in hundreds of dimensions
0:18:38um
0:18:39another way of saying the difficulty which i i think is a useful way to of thinking about it
0:18:43is that
0:18:45whatever factorisations you haven't the prior
0:18:47that's be a page they get destroyed when you multiply by
0:18:51okay factorisations in the prior art
0:18:53statistical independence assumptions
0:18:56statistical independence assumptions get destroyed in the poster
0:19:01uh it's easy to uh
0:19:04to see
0:19:05why this
0:19:05the case in terms of the graphical model but as i said i'm going to draw so
0:19:09if you
0:19:10uh a few things
0:19:13and
0:19:14return to this question variational bayes
0:19:17the um
0:19:20yeah the in the variational bayes approximation
0:19:23is that
0:19:24what you acknowledge that
0:19:26uh
0:19:27independence has been destroyed
0:19:29in the posterior
0:19:30but you go back and forth so
0:19:32impostor
0:19:33okay and you look for
0:19:34what's called a variational approximation of the poster
0:19:38variational because it's actually free form
0:19:40as in the countless variations you don't impose any restriction
0:19:45on the functional form
0:19:47of
0:19:47oh
0:19:48yeah
0:19:49and there's a standard set of couple uh update formulas that you can
0:19:54that you can apply here
0:19:56the couple because this expectation is calculated with the posterior on extra
0:20:01this
0:20:02expectation is calculated with the posterior next one
0:20:05so you have to uh iterate between the two
0:20:08um
0:20:10nice thing is that this iteration comes with ian like uh convergence uh guarantees
0:20:16and
0:20:16it's avoided
0:20:17altogether the need
0:20:19to invert
0:20:20um
0:20:22large sparse block matrices which is the only way you can evaluate the
0:20:26evidence exactly
0:20:28and then
0:20:28only in the gaussian
0:20:29okay
0:20:35uh this uh posterior distribution or the the variational
0:20:39approximation of the posterior distribution is also the
0:20:43the key
0:20:44to estimate and model parameter
0:20:47okay you use a lower bound
0:20:49as a proxy
0:20:50for the likelihood of the evidence
0:20:53and you see two
0:20:54optimise a lower bound
0:20:56calculated
0:20:57over
0:20:58uh a collection of training speakers
0:21:01uh here i just
0:21:02taking the definition and
0:21:04rewritten it this way
0:21:06uh it's convenient to do this because this term here doesn't involve me model parameters
0:21:11parameters at all
0:21:13so the
0:21:14first
0:21:15approach
0:21:16problem or would be just too
0:21:18uh optimise
0:21:19uh this term here
0:21:21okay the contribution again
0:21:24to the
0:21:25uh to the evidence criterion by summing this overall speaker
0:21:32okay um
0:21:33this
0:21:33when you when you work it out
0:21:35turns out to be formally identical
0:21:39two
0:21:41um
0:21:41probabilistic principal components analysis
0:21:44it's just a least squares problem
0:21:46the only um
0:21:51and it's actually the E M auxiliary function for probabilistic principal components analysis
0:21:56the only
0:21:58the only difference is that you have to use the variational posterior
0:22:02rather than be
0:22:03other than the exact
0:22:04that's true
0:22:07um there is another way of
0:22:10estimation
0:22:11which
0:22:13i called minimum divergence
0:22:15estimation the this is pretty good you can of confusion over here so uh
0:22:20try and explains briefly
0:22:23there is concentrate this term here
0:22:27it's independent of the model parameters
0:22:29okay
0:22:30but you can do you can
0:22:32the
0:22:33i changes of variable here
0:22:36okay which
0:22:37minimise the B divergence but are constrained in such a way as to preserve the value of the um auxiliary
0:22:44function
0:22:46and if you minimise
0:22:48these divergences you will them
0:22:50keeping this thing
0:22:51you will then
0:22:52increase
0:22:52the
0:22:54the uh value you have adams uh
0:22:56criterion
0:23:00uh the way this work
0:23:02say in the case of speaker factors
0:23:04to minimise the divergence
0:23:06what you do is you look for
0:23:08uh i'm transformations of the speaker factors such that the first and second order
0:23:13moments
0:23:16are the speaker factors
0:23:17agree on average
0:23:19as as the number of
0:23:21uh speakers in the training set
0:23:23with
0:23:23the
0:23:24first order moment of the prior and the second order moment
0:23:27right
0:23:27that's that's just a matter of uh
0:23:30a finding an affine transformation
0:23:32that satisfies
0:23:33this condition you then applied
0:23:35the inverse transformation
0:23:37to update the model parameters
0:23:39in such a way as to keep the value of the
0:23:43uh yeah i'm auxiliary function fixed
0:23:46and it turns out that if you
0:23:49interleaved these two uh steps
0:23:52you will be able to accelerate the um
0:23:56the convergence
0:23:59so ah
0:24:01well just one comment about
0:24:03about this
0:24:04uh and i set out to do here is to produce point estimates
0:24:08of three
0:24:10eigenvoice matrix and the uh i'm the eigenchannel matrix
0:24:15uh if you are really hardcore bayesian you don't allow point estimates
0:24:20into your
0:24:22model you have to do everything in terms of
0:24:24prior probabilities
0:24:26um
0:24:26posterior probabilities
0:24:29so a true blue bayesian approach a prior
0:24:32on the eigenvoices and calculate the posterior
0:24:35again by
0:24:36variational bayes
0:24:38even the
0:24:39number of speaker factors
0:24:40could be treated as a hidden random variable
0:24:43okay and the posterior distribution could be calculated
0:24:46again by
0:24:47haitian
0:24:47right
0:24:49so there is
0:24:50an extensive literature
0:24:52on this
0:24:53on this subject
0:24:54uh
0:24:55and say that if there's one problem with variational bayes
0:24:59it provides too much flexibility
0:25:01you have to
0:25:01exercise good judgement
0:25:03as to which things
0:25:05you should try
0:25:07i wish things are probably not
0:25:09going to help
0:25:10in other words don't lose sight of your
0:25:12you're engineering objective
0:25:15and the particular thing i chose
0:25:17to to focus on was
0:25:19the
0:25:20gaussian assumption
0:25:21okay
0:25:22uh as far as i can see
0:25:25the gaussian assumption is just not realistic
0:25:28for the
0:25:30i don't a so that
0:25:31we're dealing with
0:25:34and what i set out to do using variational bayes
0:25:37was to replace
0:25:39the
0:25:39gaussian assumption with the
0:25:41exponential decrease adam famously by
0:25:44a power law distribution
0:25:46which uh allows
0:25:48four
0:25:49um
0:25:50outlier
0:25:51exceptional
0:25:52speaker of facts
0:25:53severe channel distortions
0:25:55uh in the data
0:25:57and this term black swan is amusing
0:26:00uh it
0:26:01so um
0:26:02romans had a had a phrase or a rare bird much like a black
0:26:06one
0:26:07intended to convey the motion of something impossible or inconceivable
0:26:12and they were in no position to know that uh likes one's actually do exist
0:26:17uh in australia
0:26:19um
0:26:20um
0:26:21a financial forecaster by the name of
0:26:23tell the
0:26:25a few years ago he wrote a polemic
0:26:28against the gaussian distribution called
0:26:30the black swan
0:26:32the um
0:26:33yeah actually rolled before they start
0:26:36rationed in two thousand and made which of course is the
0:26:39mother of all blacks ones
0:26:41and
0:26:42as as a result
0:26:43is it
0:26:44uh quite a bigger
0:26:45media splash
0:26:50okay it turns out that the um
0:26:53textbook a definition of uh
0:26:56the student's T distribution the one which i'm
0:26:59going to use in place of the gaussian distribution that this is a workable
0:27:03with the variational bayes
0:27:06there is a not a construction that represents
0:27:09the student's T distribution um
0:27:12as a continuous mixture of
0:27:14um
0:27:15normal random variable
0:27:17uh it's based on the gamma distribution is unimodal distribution
0:27:21on the positive real switch has two parameters that enable you to adjust the
0:27:26the mean and the variance independently of each other
0:27:31but it was is
0:27:31this
0:27:32okay in order to
0:27:34sample from a student's T distribution
0:27:40you start with a gaussian distribution with precision matrix lambda
0:27:45you then
0:27:46yeah
0:27:47the covariance matrix by a random scale factor drawn from the
0:27:53gaussian distribution
0:27:55and then you sample from the
0:27:57normal distribution with the modified covariance matrix
0:28:00is that random scale factor that
0:28:04introduces the the heavy tail
0:28:06behaviour
0:28:08um
0:28:09the parameters of the
0:28:11gaussian distribution of the gamma distribution rather
0:28:14determine
0:28:15the extent to which this thing
0:28:17is is heavy tail you have the gaussian at at one extreme
0:28:21at the other extreme you something called the the cushion distribution which is
0:28:25so heavy tail that the
0:28:27variances in from
0:28:29uh this term degrees of freedom it comes from classical statistics but it doesn't have any particular main
0:28:35uh
0:28:36in in this context
0:28:39okay
0:28:40so for example
0:28:42suppose you want to make the
0:28:44channel factors heavy tail
0:28:47in order to model
0:28:48applying
0:28:49channel distortion
0:28:53well you have to do here X
0:28:55so
0:28:56remember
0:28:57are you one set of channel factors
0:28:58for each recording so this is inside the plate
0:29:02you associate a random scale factor
0:29:05okay with that
0:29:07hidden random variable
0:29:09okay and that one time scale factor
0:29:12is
0:29:12sampled
0:29:13from
0:29:14a gamma distribution
0:29:16call the member with the freedom into
0:29:19so handy to the lda does this
0:29:22for all of the
0:29:24hidden variables
0:29:25and the
0:29:27gaussian P L D A model
0:29:29yeah of speaker factors
0:29:31have an associated
0:29:32scale factor random scale factor
0:29:35channel factors
0:29:37and so pseudorandom scale factor
0:29:39residual
0:29:40has an associated time and scale
0:29:42vector
0:29:43so
0:29:44in fact
0:29:45all i didn't just here are just three extra
0:29:48parameters
0:29:49three extra degrees of freedom
0:29:51in order to
0:29:53model
0:29:53the
0:29:54the heavy tail
0:29:55behaviour
0:29:58yeah
0:29:59these are some tactical points
0:30:01okay
0:30:02uh how
0:30:04you can
0:30:06carryover variational bayes from the gaussian case to the heavy tailed case and do so
0:30:11in a computationally uh efficient way
0:30:14um
0:30:16i refer you to the paper for these
0:30:18the
0:30:19key point that i would like to draw your attention to
0:30:22is that these numbers degrees of freedom
0:30:25can actually be estimated
0:30:27using the same evidence criterion
0:30:30as the eigenvoices
0:30:32and the eigenchannels
0:30:38okay here's some results
0:30:40this is a a comparison of gas
0:30:42really
0:30:44and
0:30:45how detailed P L D A
0:30:47um the several conditions
0:30:49of the nist
0:30:50uh two thousand and eight evaluation
0:30:55okay so this is the equal error rate
0:30:58and the two thousand and eight
0:31:00detection cost function
0:31:02okay it's clear
0:31:03it in all three conditions the there's a very dramatic
0:31:06uh
0:31:07reduction in errors
0:31:09uh
0:31:10both
0:31:11the dcf point
0:31:12and we are
0:31:15uh this was done without score normalisation if you do what score normalisation
0:31:20what happens
0:31:21this
0:31:22you get
0:31:22uniform improvement in all cases
0:31:26okay i'll simply lda
0:31:28i get uniform degradation
0:31:30probably uh
0:31:31student's T distribution
0:31:33but only
0:31:34does normalisation not help you
0:31:36it's a nuisance
0:31:38in the students to
0:31:46uh let me just say a word about score normalisation
0:31:48um
0:31:50it's usually needed in order to
0:31:52set the decision threshold in speaker verification in a trial dependent way
0:31:59um
0:32:01it
0:32:01so uh this typically french are computationally expensive
0:32:05and
0:32:05it complicates life if you if you ever have to do cross gender
0:32:09uh trials
0:32:11on the other hand
0:32:13if you have a good general model for speech in other words if you insist on the probabilistic
0:32:18yeah
0:32:19way of thinking
0:32:21there's no wrong
0:32:22for for score normalisation
0:32:24if there is no need
0:32:25for calibration but we're not there
0:32:27yeah
0:32:29um
0:32:31in practice is needed because of
0:32:33applying recordings
0:32:35okay which tend to produce
0:32:37uh exceptionally low scores for all of
0:32:40trials in which they are
0:32:41involved
0:32:43and what the uh student's T distribution appears to be doing
0:32:47is that the extra hidden variables these scale factors that i introduce
0:32:53appear
0:32:53the
0:32:54capable of uh of modelling
0:32:57this uh
0:32:59this outlier behaviour adequate
0:33:02thus doing away with the need for uh
0:33:04for score normalisation
0:33:08uh i should
0:33:09so
0:33:10i have a copy of about
0:33:11microphones
0:33:12each
0:33:13if
0:33:13the situation with telephone speech seems to be quite clear
0:33:16okay i guess of the L D A
0:33:18what's globalisation
0:33:21gives results which are comparable to cosine distance scoring
0:33:24get better results but
0:33:26uh heavy tailed the lda at least on the two thousand and a data
0:33:30and in general there about twenty five
0:33:32send better than traditional joint factor analysis
0:33:36uh but it turns out to break down and that
0:33:38an interesting way
0:33:39um
0:33:39um
0:33:40on microphone speech
0:33:46uh
0:33:47now how much yesterday he described an ivector extractor of dimension six hundred
0:33:52which could be used
0:33:53for recognition both microphone
0:33:56and
0:33:57telephone speech
0:33:59so we started out by training a model using only telephone speech speaker factors
0:34:04and the residual was modelled
0:34:06with
0:34:06a full
0:34:07precision right right
0:34:09okay then we augmented that with
0:34:11the with eigenchannels
0:34:14and everything was treated in the heavy tailed right
0:34:17okay um
0:34:18well turned out
0:34:19upon
0:34:21unfortunately
0:34:22is that we ran straight into the
0:34:24cushy distribution
0:34:25for the
0:34:27microphone
0:34:28transducer
0:34:29affect
0:34:30that means is that the variance
0:34:32all the channel effects
0:34:34microphone back that
0:34:35is infinite
0:34:36um
0:34:37it's a short so
0:34:39it's a short step to realise that if you have infinite variance for channel effects
0:34:43you're not able
0:34:44to speaker recognition
0:34:46so um i haven't been able to uh to fix this
0:34:50uh at present
0:34:52the
0:34:53best strategy would seem to be too project away the V troubles some dimensions using some type of P O
0:34:58D A that
0:34:59so
0:35:00that's not gene structure which i i believe
0:35:02we
0:35:03talking about
0:35:04uh in the next presentation
0:35:09okay
0:35:10oh and then come to the third part of my talk
0:35:14which concerns the question
0:35:16oh
0:35:17how
0:35:19it would be possible
0:35:21to integrate
0:35:23joint factor analysis or P L B A
0:35:26and call centre
0:35:27and scoring
0:35:28or something resembling a
0:35:30in a coherent
0:35:33probably
0:35:33fig
0:35:34right
0:35:36uh if you haven't seen
0:35:38these
0:35:39types of uh scatter plots
0:35:41there are very interesting
0:35:43okay each colour here represents a speaker
0:35:46and each point
0:35:48represents an utterance
0:35:50the speech
0:35:55um
0:35:56this is a plot of of supervectors
0:35:58projected onto the
0:36:00what is essentially the first two
0:36:02uh i vector
0:36:04components
0:36:07so
0:36:07you see what's going on here
0:36:09this is the
0:36:10well i motivation for
0:36:12cosine distance scoring
0:36:13cosine distance scoring
0:36:15ignores the magnitude
0:36:17of the vectors
0:36:18and uses
0:36:19only the angle between them
0:36:21as
0:36:22the similar signature
0:36:27and this is completely inconsistent with the assumptions
0:36:29all
0:36:30joint factor analysis because
0:36:33there seems to be
0:36:34for each speaker
0:36:36a principal axes of variability that passes through the speakers me
0:36:40the
0:36:42session variability for speaker is augmented
0:36:45in a particular direction
0:36:46the direction i mean vector
0:36:48where is
0:36:49jfa or P L V A assumes
0:36:53that you can't model
0:36:55session
0:36:56okay
0:36:57for all speakers in the same way
0:37:00the strip
0:37:01that's three
0:37:01statistical independence
0:37:03assumption
0:37:03in in in jfa
0:37:08um
0:37:11i thought of necessarily just
0:37:13to add a
0:37:14you
0:37:15have the ad
0:37:16in interpreting these
0:37:18these plots to have to be careful that it's not a notified to
0:37:21the
0:37:22well the way you estimate supervectors and so on we
0:37:25we do find these plots with with an vectors but we have to cherry
0:37:29the results in order to get
0:37:30um ice pictures like one
0:37:32right
0:37:33i showed you
0:37:34but the the principle that
0:37:36okay for this type of behaviour
0:37:39which i call directional scatter
0:37:41is the effect
0:37:42that's
0:37:43of the
0:37:44colour distance
0:37:45matcher
0:37:46yeah
0:37:47uh
0:37:48in speaker recognition
0:37:51i don't know how to account for it i'm not concerned with that question
0:37:54the only question i would like
0:37:56to answer
0:37:56is how to model this type of behaviour probabilistic
0:38:05okay as i
0:38:06i said this part is going to get of the technical it's addressed to people who have
0:38:11red
0:38:11the chapter
0:38:13and
0:38:14bashers book
0:38:15um
0:38:16variational
0:38:16right
0:38:18uh in order to get a handle on this problem there seems to be a natural strategy
0:38:23okay instead of representing
0:38:25each speaker by a single point
0:38:27next one
0:38:28and the speaker factor space
0:38:30represent each speaker by a distribution which is specified by
0:38:34i mean vector
0:38:35you and the precision matrix model
0:38:41the
0:38:42i'm vectors are then generated by sampling speaker factors from this just
0:38:45version
0:38:46i have
0:38:47but this inverted commas
0:38:49because the speaker factors
0:38:51very
0:38:51from one recording to remember
0:38:54okay as to channel
0:38:56but the
0:38:57mechanism by push the generator is quite different
0:39:00that's willing to come
0:39:01the man
0:39:04the trick is to choose the prior
0:39:06on the
0:39:07mean and precision matrix read speaker
0:39:10in which
0:39:11you and then the
0:39:12or not
0:39:13statistically independent
0:39:15because what you want
0:39:17is
0:39:18you want to precision matrix for each speaker
0:39:21which varies
0:39:22with the location of
0:39:24speakers mean vector
0:39:28and of course
0:39:29once you set this out
0:39:31your
0:39:32immediately going to run into problems you you does not hold
0:39:34all of doing point estimation of the perceptual matrix if you only have one or two
0:39:39observations of the speaker
0:39:42uh you have to follow the rules of probability
0:39:44system play
0:39:46integrator prior
0:39:47and the way to do that
0:39:48courses with
0:39:49um
0:39:49right
0:39:56okay so he was an accountant
0:39:58we can either seems to be only one way to to
0:40:01um
0:40:02one natural prior on precision matrices
0:40:04although we should prior
0:40:08uh i won't talk about this
0:40:09okay i just
0:40:10put it down there so that if you're interested you be able to recognise that this is
0:40:15just a generalisation of the gamma distribution
0:40:18okay if you take an equal to one this will reduce to the gamma distribution
0:40:23in higher dimensions it's concentrating on positive definite
0:40:27major
0:40:30um
0:40:32there is a parameter call the the number of degrees of freedom again
0:40:35okay that
0:40:36so determines how P
0:40:38uh this uh distribution is
0:40:41uh also
0:40:42this point i think is worth mentioning there's no loss of generality in assuming that W
0:40:47which would matrix here
0:40:48is it good to be identity
0:40:51the reason this is worth mentioning is that this turns out to correspond exactly to something that nudging does
0:40:58and
0:40:59uh he's processing
0:41:01if you're familiar with his work
0:41:03you know that
0:41:05uh
0:41:05he estimates that W C C N
0:41:09matrix
0:41:10in the
0:41:11speaker space
0:41:12and then lightens the data with that matrix
0:41:15before evaluating the
0:41:18uh
0:41:18because
0:41:23okay
0:41:24first thing then
0:41:25we have generated the
0:41:27decision matrix for the speaker the next step
0:41:29is to generate
0:41:30the
0:41:31the mean vector
0:41:32speaker
0:41:34and you do that
0:41:35using
0:41:36a student's T distribution
0:41:39okay once you have a precision matrix
0:41:42that's all you need
0:41:43if you
0:41:44just adding the gamma distribution
0:41:47you can sample
0:41:48the mean vector
0:41:49according to a student's T distribution
0:41:51uh and explained in the manual white
0:41:53you need to use the student's T distribution
0:41:58uh
0:41:59the
0:42:00point i would just like to draw your attention to at this stage
0:42:04is that
0:42:05because
0:42:06the distribution of you depends on the land there
0:42:10the conditional distribution lambda depends on you
0:42:14okay
0:42:14so
0:42:15that means
0:42:17but
0:42:18he precision matrix for a speaker
0:42:20and
0:42:21on location
0:42:22all the speaker
0:42:24in the speaker factor space
0:42:25so that means that you have somehow
0:42:28modelling
0:42:28this
0:42:29directional scout
0:42:35skip that
0:42:36um
0:42:36go to the
0:42:38um
0:42:39graphical model
0:42:42i think it's clear from this uh remember
0:42:44when you're confronted with something like this that
0:42:46everything inside the plate
0:42:48is replicated
0:42:50for each of the recordings
0:42:52speaker
0:42:53everything that outside of the plate
0:42:54is done
0:42:55once
0:42:57per speaker
0:42:58okay so the first step
0:43:00is it generate the precision matrix
0:43:04you then generate the mean for the speaker
0:43:06by sampling from
0:43:08um
0:43:09a student's T distribution of call the hidden scale factor W
0:43:13and the parameters of the gamma distribution out
0:43:15data
0:43:16once you have the mean
0:43:18and the precision matrix
0:43:20you generate the speaker factors
0:43:22re speaker
0:43:24uh for each recording
0:43:25remember we're making the speaker factors depend on
0:43:28or
0:43:29okay bye
0:43:30something from another student's T distribution
0:43:34the interesting thing
0:43:35is that
0:43:36these three parameters alpha beta and tell
0:43:39the term and
0:43:40whether or not
0:43:41this
0:43:42oh
0:43:42business
0:43:43it's going to
0:43:44exhibit directions cat
0:43:46normal
0:43:51okay
0:43:52sorry
0:43:54this can be explained without some hundred
0:43:56you have to do it calculation
0:43:59remember landers
0:44:00a session matrix land inverse is the
0:44:03covariance matrix someone and comparing here
0:44:07is the distribution of the covariance matrix
0:44:09given the speaker dependent parameters
0:44:13and the prior distribution of the covariance
0:44:16you see what you have is a weighted average
0:44:19of the prior
0:44:20expectation
0:44:22and
0:44:23another term
0:44:25now this
0:44:26second term here
0:44:27and
0:44:28all the speakers me
0:44:30it's a rank one covariance matrix the only variability that's allowed
0:44:34is in the direction of the mean vector
0:44:37this is
0:44:37a picture book
0:44:38response to it
0:44:39which is exactly what
0:44:41the doctor
0:44:43four
0:44:43action scatter
0:44:46um
0:44:48i'd
0:44:49draw your attention to the fact
0:44:50that
0:44:50the
0:44:52this
0:44:53term here is multiplied by this so it depends on how the number of degrees of freedom
0:44:58and
0:44:59this uh random scale factor that
0:45:03okay so the extent
0:45:05a directional scattering
0:45:07is going to
0:45:08and
0:45:09on the behaviour of this uh
0:45:11this much
0:45:16uh
0:45:17it depends
0:45:17in fact on the parameters which govern the distribution
0:45:21oh
0:45:21the
0:45:22random scale factor W
0:45:24yeah
0:45:25W
0:45:27has
0:45:28a large mean and a small variance you can say that
0:45:32this
0:45:33this thing but
0:45:35the
0:45:36a fact
0:45:36all the variability in the direction of the mean vector
0:45:40okay so in that case
0:45:42directions kevin would be present to a large extent
0:45:46four
0:45:47um
0:45:48most speakers
0:45:49in the data
0:45:50on the other hand there's another limiting case where
0:45:53uh you can show that the thing reduces to to heavy tailed field again and there's no directional scattering at
0:45:59all
0:46:00so the
0:46:02key question would be to see how this model trains that
0:46:05uh
0:46:06to be frank this is going to take a couple
0:46:07models
0:46:09uh i don't have any
0:46:10results to uh
0:46:13okay so in conclusion
0:46:15um
0:46:16well guess immediately it's an effective model for speaker recognition
0:46:20and it's just joint factor analysis with ivectors
0:46:23uh as features
0:46:25my experience
0:46:26spain
0:46:27that it works better
0:46:29then
0:46:29uh traditional joint factor analysis even though the basic assumptions
0:46:33or
0:46:34are open to question
0:46:36okay
0:46:37variational bayes
0:46:39allows you to go a long way
0:46:41in relaxing these assumptions you can model outliers by adding these
0:46:45hidden
0:46:46variables
0:46:47you can model directional scattering by having
0:46:50these variables
0:46:53the
0:46:54derivation of the variational bayes update formulas is mechanical
0:46:59no i'm not saying it's always easy but it is
0:47:01coming
0:47:02okay
0:47:03and
0:47:04it comes with um
0:47:05yeah my convergence guarantees so that you can
0:47:09you have some hope of uh the barking or implementation
0:47:15one can get is that
0:47:16in practise you have to stay inside the exponential
0:47:19second work
0:47:20uh
0:47:21i can
0:47:21the other one
0:47:23uh
0:47:23it's also
0:47:24i'm
0:47:25personally of the opinion that is uh
0:47:27in order to get the full benefit of these methods we need for recall
0:47:30informative priors
0:47:33that is to say
0:47:34prior distributions on the hidden variables whose
0:47:36parameters can be
0:47:39i use the word of this is it because uh estimated is is really an appropriate here
0:47:44and this is a strong uh
0:47:46larger training sets
0:47:48so the example is that
0:47:50one of the hidden variables that i just
0:47:53uh disk right
0:47:54okay are controlled by a handful of
0:47:57scalar degrees of freedom
0:47:59and these can all be estimated using the
0:48:02using the evidence criterion
0:48:04from uh
0:48:05from training data
0:48:09now it to be to be
0:48:11trying to locate the advantage of probabilistic methods is is that you have
0:48:15uh logically coherent way reasoning and the phase uncertainty
0:48:19the disadvantage is
0:48:21that it needs
0:48:22timing
0:48:23um
0:48:23after
0:48:24okay too
0:48:26to master the techniques and to program them
0:48:29if you're
0:48:31principal concern is to get a good system
0:48:33up and running quickly
0:48:35i would recommend
0:48:36um
0:48:38something michael signed distance
0:48:41uh
0:48:42on the other hand
0:48:43if you're interested in
0:48:45mastering
0:48:46this
0:48:46family of methods
0:48:48i think they're really only three things you need to look at
0:48:51okay there's the original
0:48:53paper by prince analogy or a probabilistic linear discriminant analysis in
0:48:58uh face recognition
0:49:00that's the gaussian case
0:49:04everything you need to know
0:49:05about probabilities ambitious book
0:49:08which ah i highly recommend it
0:49:10so
0:49:10it's very well written and it starts from first run
0:49:13oh
0:49:15uh this is the
0:49:17this is
0:49:17paper
0:49:19um i don't believe the paper is actually found its way into
0:49:23proceedings
0:49:24but it is available along those lines
0:49:26uh
0:49:28okay
0:49:30thank you
0:49:41much
0:49:43right
0:49:44this is the
0:49:45action
0:49:54no
0:49:56yeah
0:49:57but it
0:50:01no
0:50:02and of course
0:50:03uh
0:50:03thanks representation which uh
0:50:05reuniting
0:50:06uh
0:50:07you use you to uh
0:50:10uh encourage us to
0:50:11uh as you said
0:50:12if you wanna which solution you can do it that way
0:50:15if you want a more principled solution
0:50:17uh
0:50:18but
0:50:18i
0:50:19i cannot
0:50:20uh
0:50:21notice i just know is that uh
0:50:23they use a point of uh you algorithm
0:50:26is based on that point is to
0:50:28um
0:50:28so you have
0:50:29a speech utterance
0:50:31uh use your factor analyses
0:50:34to summarise i decided
0:50:36and you completely ignored in certain people to process and then from that you should use that
0:50:41we should keep track of that uncertainty
0:50:44so how do you like that it's a an entirely empirical uh decision
0:50:48based on on the effectiveness of of machines uh cosine distance scoring
0:50:53no it just works really well um
0:50:56attend somewhere maybe
0:50:58so
0:50:58um
0:50:59incorporate the uncertainty
0:51:01in the
0:51:03i vector estimation procedure
0:51:05don't seem to have they
0:51:06complicate life
0:51:08it's it's really imperative
0:51:10what
0:51:11it's dictated by baltimore
0:51:23um
0:51:24but you know it's true
0:51:25tuition
0:51:25um
0:51:26one one question regarding you results are presented so i would uh one categories remote
0:51:32yeah
0:51:32um
0:51:33conversation sides down
0:51:35and so
0:51:35you know
0:51:36you were
0:51:37yeah
0:51:37you
0:51:38i picture
0:51:39which
0:51:39finding your
0:51:41um
0:51:41retail setup
0:51:43house and
0:51:44you
0:51:44nine
0:51:45you
0:51:46but when they did that
0:51:47you
0:51:47see
0:51:48it was
0:51:49not
0:51:50the ten second data
0:51:53you can circle
0:51:54um i
0:51:56well the best results were obtained without score normalisation
0:52:00okay so we're was no question of uh
0:52:04introducing a corporate your question is maybe
0:52:06in the gaussian case
0:52:07should we should be used that
0:52:10oh no
0:52:10a what you need
0:52:11yeah
0:52:12to me
0:52:13distribution
0:52:14i
0:52:15right so
0:52:16you see i
0:52:17yeah
0:52:19when you open
0:52:20we estimate
0:52:22you do
0:52:23these
0:52:23particular i picked
0:52:24right
0:52:25maybe
0:52:25oh
0:52:27and second
0:52:29um
0:52:30but i
0:52:31my experience has been and then this
0:52:33black or white
0:52:34okay is that it's better not to use
0:52:36ten seconds later
0:52:37right
0:52:39uh
0:52:40in the case of the
0:52:42indicate an interesting
0:52:44aspect of ivectors
0:52:45is
0:52:46but
0:52:46um
0:52:48they
0:52:48perform
0:52:49very well on the ten second in sec
0:52:52okay
0:52:53in other words the estimation
0:52:55figure drawing vectors
0:52:56is
0:52:56much less sense
0:52:58so
0:52:59um
0:53:01short duration
0:53:02um relevance map
0:53:03right
0:53:04prob
0:53:11a high one based on what the impact
0:53:13uh you make an assumption that um some um
0:53:17fig oceanic you
0:53:19the last
0:53:20the slide
0:53:21somehow
0:53:22exhibit a gaussian decent
0:53:24last of it
0:53:25i this way
0:53:26i mean he's doing at a nonparametric way to do so
0:53:30and i only sensations back
0:53:32so i i think i was careful to use students T distributions everywhere yeah decreases that that that require that
0:53:38it's that which gives me the flexibility
0:53:40to model of players and directions got
0:53:44that does that answer your question or
0:53:46yeah innocently used to model it uh some highlights
0:53:50are made
0:53:51are much more
0:53:52oh at the last
0:53:53uh last
0:53:55variational bayes
0:53:56does require that
0:53:58and in fact it was an actual an extra restriction that you have
0:54:02stay inside the
0:54:03the exponential
0:54:05uh funnelling solely
0:54:07the art
0:54:08consists in achieving what you want to do
0:54:11subject of those uh
0:54:12strange
0:54:14i
0:54:15is that an adequate response
0:54:18yeah
0:54:34about the product
0:54:35yeah
0:54:37you you
0:54:38hmmm you know
0:54:39so
0:54:39and like you
0:54:41yeah
0:54:42i
0:54:43i
0:54:44you
0:54:45right
0:54:46you can
0:54:48well
0:54:48in fact uh we use
0:54:50the evidence criterion
0:54:51you
0:54:52which is exactly the same criterion for estimating these
0:54:56the the numbers of trees of freedom
0:54:58as we did for estimating the eigenvoices
0:55:01and the eigenchannels
0:55:02so it's completely consistent
0:55:04there was no manual tuning
0:55:07thank you
0:55:21so
0:55:22there was a question
0:55:23let me think
0:55:24but okay
0:55:32because