um i everyone and not gonna apologise for the quality of my slides because and uh i just i think it really matters people are people if you make an to find people seem to trying to compensate for something that that's my that that's my uh yeah philosophy okay so this is that actually very similar to some of but in this session and it's it's a strange session is in the because almost all of these talks are about some kind of uh no no fast faster from a uh with some kind of factorization now i was going to introduce you okay so i that all should be call it C M or or should we call it F a lot for some reason i gone over recently to the C a last side but i'm hedging my bets and the actual talk uh i i think you get everyone really had a stand that is the same thing yeah and i'm not gonna go through this slide in detail because it's the same as many slides that have been presented already yeah notation is a little bit different i this notation with a little plus it kind of my personal notation but i i i use it because i i'm not comfortable with a or the greek let as an it's hard to me to remember a what i i is an a and that is supposed to be the same as X so i just think it's easy at to remember but the total plus means the a one uh in and since some uh in some people's work it's a side i um as that's that's also another little can confusing difference between people notation that sometimes people put the one on the and and sometimes at the beginning i think the B M had it is is to put on the end and that's what i'm doing yeah this is another kind of introductory slide but since we've had so many introductions to cmllr have from a lot in this in this uh session i don't really think i need to go through this no a point i i put in here is that have some uh i i'm gonna switch them out works so the relative little adaptation data now here well that's a little adaptation data means after about thirty seconds or so and someone else actually in this session did mention that figure after about thirty seconds or so you you get almost all of the improvement that you gonna get at thirty seconds is a little bit too much for many practical applications if you have some telephone service was someone that's gonna like request to stop quote or do web search or something my and you might only have two or three seconds or maybe five seconds of a and and that's not really enough for a from a lot of work in fact the law about five seconds and and my experience it's not gonna give you and and and the thing is gone actually make it worse so you might as well turn off so that's the problem that this talk is addressing an actual is the same problem that many previous talks and the session have then i think all of the previous talks and the session of an addressing this problem so okay this light is rising some of the prior approach is an i i should emphasise that i'm talking about the prior approaches to somehow have regular rising C M are obviously there's many other things that you can do like eigenvoices voices the stuff that uh the previous speaker mention i single other parameters but i'm talking about see all are yeah regularization so a simple things you can do you can just like the a matrix diagonal and that a kind of an option and H K and it it you know it it's a good approach you get a lot of improvement but it very ad hoc i also uh make a block diagonal this this approach had its origins the cost and the delta and delta-delta type features so you have three blocks that thirteen by thirteen and i don't nobody really uses those features on process anymore one that serious site i don't think use them anymore uh without some kind of transformation you can still use the block diagonal in fact one of the baseline that will be presenting is we use these blocks even though they've lost the original meaning and you know it is still seems to work so and other she's is bayesian approaches as the that been a couple of different papers both called F map a lot by a not a lot of it means would be we do have from a while but we have a prior and we pick the map estimate as been a paper from microsoft and one from be M and they had slightly different priors but it was the same basic idea this is one of the baseline that we're going to you using an our experiment yeah a an issue with these approaches is that you you you probably like to have a prior that tells you how all of the rows of the transform colour late with each other but in practice that's not really uh do able so people we generally see priors the are ride the row by row or even completely diagonal so it's the prior over each individual parameter yeah approach that we are using is parameter or doctor reduction using a basis where the uh a from all i'm matrix a some kind of weighted so of uh i the of a basis matrices are of a prototype matrices so the basic idea of not that this is similar to some of the previous talks sometimes people have the uh the there is a factorization than done a basis on you know the upper and lower how or something like that but we we we're talking about a a a a basic expansion of just the row or the row transform and the basic form of it is given here where is W subscript and do the kind of like prototype or i again a from a lot matrices and they're kind of computed in advance somehow for a given speaker you have to estimate these coefficients yeah well this is not a really convex problem but uh uh it's it's solvable in a kind of local sense and it's not really i i don't think it's a practical issue so the previous work in this area you can of decide the basis this size an advance so this the again the decide let's we're gonna make that two hundred say a number of parameters and the actual matrix what what's the T nine by forty it's like a couple of thousand so i if you decide in advance some we gonna make two hundred coefficient a that you know does pretty well for typical configurations if you have you know between ten and thirty seconds of speech but you gonna get a degradation once you have a lot of data because you're not really estimating all of the parameters that you could estimate eventually gets a bit worse when you have a a lot of adaptation data so that the this to the closest prior work so the couple of differences there what we're describing here from this prior what it was done in I B M yeah why we a basis size to very per speaker and we have a very simple real we just say the more data that we have them more coefficients we can estimate and we'd we just make the number of coefficients proportional to the amount of data a of you could do all kinds of fancy stuff with information criterion so on but i think this technique is you know easily complicated enough already without introducing an you aspects so we just picked a very simple rule but but the other aspect is we have a we have a keep way of estimating these basis matrices W and but a little bit more clever than just doing pca yeah and and and and lee we just trying to popularise this type of method this uh we have uh we have a job version of this paper in which we tried to explain very clearly had to implement it "'cause" it really does were and it's uh it's for robust and everything and police so i think it's something but the i do recommend to you yeah this so probably covers material that i've oh i'm gonna go to this slide so yeah the sample that it's like making a diagonal block back all very well and good but it just doesn't it it's just a bit ad hoc can it doesn't you that much improvement also so you it's all a half of to have to decide like how much data do we have can we afford to do the form one are we gonna make it diagonal when that this trade-off and you can get into having count cutoffs and stuff but it's a bit of a mess yeah i to have anything and in methods you know if we could of done this in the bayesian way that's probably i think that's what more optimal because by picking a basis size you kind of making a hard decision a with the bayesian method you could do that and a soft way and you know i think making a soft decision was always better than a hard decision but the the problem with the bayesian approach is first is very hard to estimate the prior because the whole reason may in the situation as we don't have a ton of data per speaker right and assuming a training data is matched just T testing condition not gonna have a lot of data from you training data to estimate the the uh W major sees so how you gonna estimate of prior because you don't have good estimates of the uh the things are trying to get a prior on and a "'cause" you can do all of these no easy in schemes where you integrate and stuff but it just becomes a a big head a plus was always somebody choice choices to make but there is i like this basis type of method is you just a we can use a basis and everything just falls out and it's obvious what to do so i to talk about how to estimate the uh basis so because we going to decide the coefficients and test i am the kind of a or did base that's what you have the most important elements first and released least important elements last i if you we gonna say we gonna have any close two hundred and just that's it then it wouldn't really matter whether they were all mixed up what order they were and but because uh we're gonna decide the number of coefficients and test i'm we need to have this or thing so things are not show that pca or A or or you know as three D are those those kinds of approaches the they actually give you this but i'm not very comfortable just saying we're gonna do pca because you know but who's to say that that makes sense uh what one obvious argument why doesn't make sense was that if you were to uh and a scale the different dimensions of your a of feature vector that's gonna change the solution that P gives C you no i mean is gonna change it and then yeah and that kind of uh is gonna change the so basically it's gonna affect your decoding and to me that's a but uh it's not the right thing to do so but the framework but i think is most natural as maximum likelihood what we're going to choose try to pick the bases that maximise the like you on test and uh i i don't think i have time to go through the whole and i wasn't having to through the whole you know argument about what we're doing but but basically we end that the use pca A but in that slightly precondition and uh space so but W is a thirty nine by forty T matrix typically but we wanna consider the correlations between the rows and it's not ready can to think of it as a matrix a let's think of it as one big but to of size that in nine by forty byte we can cat make the rose no uh if i don't know if i can easily and and how well this argument works but that i think about the objective function for each speaker if that objective function were records at work or tick function if we can somehow to change of variable so that the uh quadratic part of that function is just proportional to the you makes it then is possible to show that we kind of right solution is just doing weighted pca a some kind of derivation i mean it it might be obvious to some people how how you derive that uh but you know but let's just take it is given for now but that that's true so so the to do is do this strange variable so that the objective function is quadratic for each speaker and for is not quite possible to do that because we okay okay for a basis for as a couple of reasons that's the objective function is not really quadratic that this log and and it's for nonlinear secondly you can take a taylor series approximation of round the kind of before make trick i and zero take it taylor series around there and the could right sick uh that the quadratic term in that taylor series but remember rubber it is a big vector right so the quadratic reading um is like a big major uh a if you know about two thousand by two thousand but quadratic to it depends on the that so it's not just the constant but i i i don't think this that's really very that much in an important way so it is possible to kind of do a it's possible to make an argument that once we once we work out the uh average of these quadratic terms and then kind of preconditions conditions so that average averages the unit it possible to make a reasonable argument that to each speech uh the uh that was matrices are approximately unit and and it is the situation where would like it to be you know but is not gonna make it big difference if it's not quite unit because i with all this is doing is this is we're gonna pretty we gonna pretty take everything in then do pca and if you don't quite three low take correctly rotations on the correct word kind a pretty a lot of a that is accurate it's not gonna like totally change the result of pca well that's gonna happen is that let's say the first i eigenvector vector is gonna be mixed up a little bit with the second and so on so it pretty close to maximum like we had now i think i i cover this material so oh training time computation is you do this but the training time computation basically involves computing big matrices like this and doing like a a as we D and stuff it's all described in the in the journal paper so a test time there's an iterative of in test them to uh to keep the coefficients those the a subscript something so it's it's a lot convex problem but it's pretty easy to get a local optimum we just you like steepest ascent and you do it pretty exact line search and it's all it's all described in and the paper so this is the result with slide requires a little bit of explanation so we have to of test data one is short utterances one is long it's it's like one is the digit digits type of task and stocks and stuff and one is a voice mail yeah yes we divide each of those two in two uh four subsets based on the line and the X axis is is is the uh length about of an so ten to the there was one second ten to the one is ten seconds and each of these kind each of the uh the each of you kind of a lines the points corresponds to a bin of train a test data so we we we can divide it up and the buck them on the left hand side this short utterances on the right is wrong and each point is a relative improvement yeah the the the the triangles the that triangle on the bottom that's just regular a from or R and it's actually making things worse for the first three bins and then a helps a bit oh is the a either word error rate kind of jumps up and down a bit because it's different types of data so this is maybe not the ideal uh a data to test this on but you've got look at the relative of and these thing uh the very top line is our method is doing a bed so it's given me a lot more improvement than the other methods i the this the three block on the diagonal one have matter a lot those there are bit a uh then doing regular similar i get some improvement for the shorter amount of data but we get a more improvement from our method so and i i mean the the story is that if you have let's say between about three and ten seconds of data i think this method will be a big improvement versus doing of map a lower diagonal or whatever so uh but but if you have a you know let's a more than thirty seconds they really doesn't make it from so i think i have pretty much covered all of this i'm being us to wrap up yeah i think we've covered all of this so i recommend that john paper if you want to implement this "'cause" i do described very clearly had to do it and i think this stuff it does work okay i a question mark pretty stunned okay okay well uh but close the session at okay thanks right