0:00:14um i everyone
0:00:16and not gonna apologise for the quality of my slides because
0:00:20and uh i just i think it really matters
0:00:23people are people if you make an to find people seem to trying to compensate for something that that's my
0:00:29that that's my uh
0:00:33okay so this is that actually very similar to some of
0:00:37but in this session and it's
0:00:39it's a strange session is in the because almost all of these talks are about some kind of uh
0:00:44no no
0:00:45fast faster from a uh
0:00:46with some kind of factorization
0:00:51i was going to introduce you
0:00:53okay so i that all should be call it C M or or should we call it F a lot
0:00:58for some reason i gone over recently to the C a last side but i'm hedging my bets and the
0:01:02actual talk
0:01:06i i think you get everyone really had a stand that is the same thing
0:01:12and i'm not gonna go through this slide in detail because it's the same as many slides that have been
0:01:17presented already
0:01:19notation is a little bit different i
0:01:22this notation with a little plus
0:01:25it kind of my personal notation
0:01:28but i i i use it because i i'm not comfortable with a or the greek let as an it's
0:01:32hard to me to remember a
0:01:34i i is an a and that is supposed to be the same as X
0:01:37so i just think it's easy at to
0:01:39remember but the total plus means the a one
0:01:43uh in and since some uh
0:01:45in some people's work it's a side i
0:01:52as that's that's also another little can confusing
0:01:55difference between people notation that sometimes people put the one on the and and sometimes at the beginning
0:02:00i think the B M had it is is to put on the end and that's what i'm doing
0:02:08this is another kind of introductory slide but
0:02:12since we've had so many introductions to cmllr
0:02:16have from a lot
0:02:17in this in this uh session i don't really think i need to go through this
0:02:24a point i i put in here
0:02:26is that
0:02:27have some uh
0:02:29i i'm gonna switch them out
0:02:30works so the relative little adaptation data
0:02:33now here
0:02:35well that's a little adaptation data means after about thirty seconds or so
0:02:39and someone else actually in this
0:02:41session did mention that figure
0:02:43after about thirty seconds or so you you get almost all of the improvement that you gonna get
0:02:48at thirty seconds is a little bit too much for many practical applications
0:02:53if you have some telephone service was someone that's gonna like request to stop quote or
0:02:58do web search or something
0:03:00my and you might only have
0:03:02two or three
0:03:03seconds or maybe five seconds of a and
0:03:06and that's not really enough for a from a lot of work
0:03:09in fact the law about five seconds and and my experience
0:03:13it's not gonna give you and and and the thing is gone actually make it worse
0:03:17so you might as well turn off
0:03:20so that's the problem that this
0:03:21talk is addressing an actual is the same problem that many previous talks and the session have then
0:03:27i think all of the previous talks and the session of an addressing this problem
0:03:36okay this
0:03:37light is rising some of the prior
0:03:40approach is an i i should emphasise that
0:03:43i'm talking about the prior approaches to
0:03:46somehow have regular rising C M are
0:03:49obviously there's many other things that you can do like eigenvoices voices the stuff that uh the previous speaker mention
0:03:56i single other parameters but i'm talking about
0:03:58see all are
0:04:00yeah regularization
0:04:02so a simple things you can do you can just
0:04:05like the a matrix diagonal
0:04:08and that
0:04:08a kind of an option and H K and it
0:04:11it you know it it's a good approach you get a lot of improvement but it very ad hoc
0:04:16i also uh make a block diagonal
0:04:20this this approach had its origins the cost and the delta
0:04:24and delta-delta type features
0:04:26so you have three blocks that thirteen by thirteen
0:04:29and i don't nobody really uses those features on process anymore one that serious
0:04:35site i don't think use them anymore
0:04:37uh without some kind of transformation you can still use the block diagonal in fact one of the baseline that
0:04:42will be presenting
0:04:43is we use these blocks even though they've lost
0:04:46the original meaning
0:04:48and you know it is still seems to work
0:04:51so and other
0:04:53she's is bayesian approaches
0:04:55as the that been a couple of different papers both called
0:04:58F map a lot
0:05:00by a not a lot of it means would be we do have from a while but we have a
0:05:05and we pick the map estimate
0:05:08as been a paper from microsoft and one from be M and they had slightly different priors but it was
0:05:13the same basic idea
0:05:14this is one of the baseline that we're going to
0:05:17you using an our experiment
0:05:21a an issue with these approaches is that
0:05:24you you you probably like to have
0:05:27a prior that tells you how all of the rows of the transform colour late with each other
0:05:32but in practice that's not really uh do able
0:05:36people we generally see priors
0:05:39the are ride the row by row or even completely diagonal
0:05:42so it's the prior over each individual parameter
0:05:47approach that we are using is
0:05:49parameter or doctor
0:05:51reduction using a basis
0:05:53where the uh
0:05:54a from all i'm matrix a some kind of weighted
0:05:57so of uh
0:05:59i the of a basis matrices are of
0:06:02a prototype matrices
0:06:06the basic idea of not that this is similar to some of the previous talks
0:06:10sometimes people have the uh
0:06:12the there is a factorization than done a basis on you know the upper and lower how or something like
0:06:18but we we we're talking about a a a a basic expansion of just the row or
0:06:23the row transform
0:06:24and the basic form of it is given here
0:06:29is W subscript and
0:06:31do the kind of like prototype or i again
0:06:34a from a lot matrices and
0:06:36they're kind of computed in advance somehow
0:06:39for a given speaker
0:06:41you have to estimate these coefficients
0:06:44yeah well
0:06:45this is not a really convex problem but
0:06:48uh uh it's
0:06:50it's solvable in a kind of local sense and it's not really
0:06:53i i don't think it's a practical issue
0:06:57the previous work in this area
0:06:59you can of decide the basis this size an advance
0:07:03so this the again
0:07:04the decide let's we're gonna make that two hundred say
0:07:08a number of parameters and the actual matrix what what's the T nine by forty it's like
0:07:13a couple of thousand so
0:07:15if you decide in advance some we gonna make two hundred coefficient
0:07:19a that you know does pretty well for typical configurations if you have you know between ten and thirty seconds
0:07:25of speech
0:07:26but you gonna get a degradation
0:07:28once you have a lot of data because you're not really estimating all of the parameters that you could estimate
0:07:34eventually gets a bit worse
0:07:36when you have a a lot of adaptation data
0:07:38so that the
0:07:40this to the closest prior work
0:07:45the couple of differences
0:07:46there what we're describing here from this prior what it was done in I B M
0:07:51yeah why we
0:07:53a basis
0:07:54size to very per speaker
0:07:57and we have a very simple real we just say
0:07:59the more data that we have them more coefficients we can estimate
0:08:03and we'd we just make the number of coefficients proportional to the amount of data
0:08:08a of you could do all kinds of fancy stuff with information criterion so on but
0:08:13i think this technique is you know easily complicated enough already without
0:08:17introducing an you aspects so we just picked a very simple rule
0:08:20but but the other aspect is we have a
0:08:23we have a keep way of estimating these basis matrices W and
0:08:27but a little bit more clever than just doing pca
0:08:33and and and and lee
0:08:36we just trying to popularise this type of method
0:08:38this uh
0:08:40we have uh
0:08:41we have a job version of this paper in which we tried to explain very clearly had to implement it
0:08:46"'cause" it really does
0:08:49and it's uh it's for robust and everything and police so i think it's something but the i do recommend
0:08:54to you
0:09:01so probably covers material that i've
0:09:05oh i'm gonna go to this slide
0:09:09yeah the sample that it's like making a diagonal block back all very well and good but
0:09:14it just doesn't it it's just a bit
0:09:16ad hoc can it doesn't
0:09:18you that much improvement
0:09:20also so you it's all a half of to have to decide like
0:09:23how much data do we have can we afford to do the form one are we gonna make it diagonal
0:09:27when that this trade-off and you can get into
0:09:30having count cutoffs and stuff but it's a bit of a mess
0:09:35i to have anything and in methods you know if we could of done this in the bayesian way
0:09:40that's probably
0:09:42i think that's what more optimal because by picking a basis
0:09:45size you kind of making a hard decision
0:09:48a with the bayesian method you could do that and a soft way
0:09:52and you know i think making a soft decision was always better than a hard decision
0:09:57the the problem with the bayesian approach is first is very hard to estimate the prior
0:10:04the whole reason may in the situation as we don't have a ton of data per speaker
0:10:08right and
0:10:09assuming a training data is matched just T testing condition
0:10:13not gonna have a lot of data from you training data to estimate the
0:10:17the uh W major sees
0:10:19so how you gonna estimate of prior because you don't have good estimates of the
0:10:24the things are trying to get a prior on and a "'cause" you can do all of these
0:10:28no easy in schemes where you integrate and stuff but it just becomes a a big head a
0:10:32plus was always somebody choice choices to make
0:10:35but there is i like this basis type of method is you just a we can use a basis
0:10:40and everything just falls out and it's obvious what to do
0:10:46i to talk about how to estimate the uh
0:10:51because we going to decide the coefficients and test i am
0:10:56the kind of a or did base that's what you have the most important elements first
0:11:00and released least important elements last
0:11:03i if you we gonna say we gonna have any close two hundred and just that's it
0:11:09it wouldn't really matter whether they were all mixed up what order they were and
0:11:12but because uh
0:11:14we're gonna decide the number of coefficients and test i'm we need to have this or
0:11:19so things are not show that
0:11:21pca or A or or you know as three D are those those kinds of
0:11:27the they actually give you this
0:11:30i'm not very comfortable just
0:11:32saying we're gonna do pca because you know
0:11:35but who's to say that that makes sense
0:11:38uh what one obvious argument why doesn't make sense
0:11:41was that if you were to uh
0:11:45and a scale the different dimensions of your
0:11:48a of feature vector
0:11:49that's gonna change the solution that P gives C you
0:11:53no i mean is gonna change it and then
0:11:55yeah and that kind of uh is gonna change the so basically it's gonna affect your decoding
0:11:59and to me that's a but uh
0:12:02it's not the right thing to do
0:12:06but the framework but i think is most natural as maximum likelihood
0:12:09what we're going to choose try to pick the bases that maximise the like you on test
0:12:15and uh
0:12:17i i don't think i have time to go through the whole
0:12:22and i wasn't having to through the whole you know
0:12:25argument about what we're doing
0:12:27but but basically
0:12:29we end that the use pca A but in that slightly precondition and uh
0:12:36but W is a thirty nine by forty T matrix typically
0:12:41we wanna consider the correlations between the rows and it's not ready can to think of it as a matrix
0:12:45a let's think of it as one big but to
0:12:48of size that in nine by forty byte we can cat make the rose
0:12:58i don't know if i can easily
0:12:59and and how well this argument works but
0:13:02i think about the objective function for each speaker
0:13:05if that objective function were records at work or tick function
0:13:09if we can somehow to change of variable so that the uh
0:13:12quadratic part of that function is just proportional to the you makes
0:13:17it then is possible to show that
0:13:19we kind of right solution is just doing weighted pca
0:13:23a some kind of derivation i mean it it might be obvious to some people how how you derive that
0:13:29but you know
0:13:31but let's just take it is given for now but
0:13:33that that's true
0:13:34so so the
0:13:36to do is do this strange variable so that the objective function is quadratic for each speaker
0:13:41and for is not quite possible to do that
0:13:47okay okay for a basis for as a couple of reasons
0:13:50that's the objective function is not really quadratic that this log and and it's
0:13:54for nonlinear
0:13:56you can take a taylor series approximation of round the kind of before make trick
0:14:01i and zero
0:14:03take it taylor series around there
0:14:05and the could right sick uh
0:14:09that the quadratic term in that taylor series
0:14:11but remember rubber it is a big vector right so the quadratic reading um is like
0:14:15a big major
0:14:18a if you know about two thousand by two thousand
0:14:20but quadratic to it depends on the that
0:14:24so it's not just the constant
0:14:27i i i don't think this that's really very that much in an important way
0:14:32it is possible to kind of do a
0:14:35it's possible to make an argument that
0:14:37once we
0:14:39once we work out the uh average of these quadratic terms
0:14:43and then kind of preconditions conditions so that average averages the unit
0:14:47it possible to make a reasonable argument that
0:14:50to each speech uh
0:14:51the uh
0:14:53that was matrices are approximately unit
0:14:56and and it is the situation where
0:14:59would like it to be you know but is not gonna make it big difference if it's not quite unit
0:15:03because i with all this is doing is this is
0:15:06we're gonna pretty
0:15:07we gonna pretty take everything in then do pca
0:15:11and if you don't quite three low take correctly
0:15:14rotations on the correct word kind a pretty a lot of a
0:15:18that is accurate it's not gonna like totally change the result of pca
0:15:22well that's gonna happen is that let's say the first i eigenvector vector
0:15:27is gonna be mixed up a little bit with the second and so on
0:15:32it pretty close to maximum like we had
0:15:36now i think i i
0:15:38cover this material
0:15:40oh training time computation is you do this
0:15:44but the training time computation basically involves computing
0:15:48big matrices like this and doing like a a as we D and stuff
0:15:51it's all described in the in the journal paper
0:15:54so a test time there's an iterative of in test them to uh to keep the coefficients those the
0:16:01a subscript something
0:16:04it's a lot convex problem but it's pretty easy to get a local optimum
0:16:08we just you like steepest ascent and you do it pretty exact line search
0:16:13and it's all it's all described in and the
0:16:17so this is the result with slide requires a little bit of explanation
0:16:21so we have to
0:16:24test data one is
0:16:25short utterances one is long
0:16:28it's it's like one is the digit digits type of task and stocks and stuff and one is a voice
0:16:34we divide each of those two
0:16:36in two uh four subsets based on the line
0:16:39and the X axis
0:16:42is is the uh length about of an so ten to the there was one second ten to the one
0:16:46is ten seconds
0:16:48and each of these kind
0:16:51each of the uh
0:16:56each of you kind of a
0:16:57lines the points corresponds to a bin of train a test data
0:17:03we we we can divide it up and the buck them on the left hand side this short
0:17:08utterances on the right is wrong
0:17:10and each point is a relative improvement
0:17:13the the the the triangles the that triangle on the bottom that's just regular
0:17:18a from or R
0:17:20and it's actually making things worse for the first three bins
0:17:23and then a helps a bit
0:17:26oh is the a either word error rate kind of jumps up and down a bit
0:17:30because it's different types of data
0:17:33this is maybe not the ideal uh a data to test this on but
0:17:37you've got look at the relative of and these thing
0:17:42the very top line is our method
0:17:44is doing a bed so it's given me a lot more improvement than the other methods
0:17:49i the this the three block on the diagonal one have matter a lot
0:17:54there are bit a uh then doing regular similar
0:17:57i get some improvement for the shorter amount of data
0:18:01but we get a more improvement from our method
0:18:05and i i mean the the story is that
0:18:07if you have let's say between about three and ten seconds of data
0:18:12i think this method will be a big improvement versus
0:18:15doing of map a lower diagonal
0:18:17or whatever
0:18:19so uh
0:18:21but but if you have a you know let's a more than thirty seconds they really doesn't make it from
0:18:26so i think
0:18:27i have pretty much covered all of this
0:18:30i'm being us to wrap up
0:18:34i think we've covered all of this
0:18:36so i recommend that john paper if you want to implement this "'cause" i do described very clearly had to
0:18:41do it and i think this stuff it does work
0:18:49i a question mark
0:18:57pretty stunned
0:19:03okay okay well uh but close the session at okay thanks right