0:00:14 | um i everyone |
---|---|

0:00:16 | and not gonna apologise for the quality of my slides because |

0:00:20 | and uh i just i think it really matters |

0:00:23 | people are people if you make an to find people seem to trying to compensate for something that that's my |

0:00:29 | that that's my uh |

0:00:31 | yeah |

0:00:32 | philosophy |

0:00:33 | okay so this is that actually very similar to some of |

0:00:37 | but in this session and it's |

0:00:39 | it's a strange session is in the because almost all of these talks are about some kind of uh |

0:00:44 | no no |

0:00:45 | fast faster from a uh |

0:00:46 | with some kind of factorization |

0:00:49 | now |

0:00:51 | i was going to introduce you |

0:00:53 | okay so i that all should be call it C M or or should we call it F a lot |

0:00:58 | for some reason i gone over recently to the C a last side but i'm hedging my bets and the |

0:01:02 | actual talk |

0:01:04 | uh |

0:01:06 | i i think you get everyone really had a stand that is the same thing |

0:01:11 | yeah |

0:01:12 | and i'm not gonna go through this slide in detail because it's the same as many slides that have been |

0:01:17 | presented already |

0:01:19 | yeah |

0:01:19 | notation is a little bit different i |

0:01:22 | this notation with a little plus |

0:01:25 | it kind of my personal notation |

0:01:28 | but i i i use it because i i'm not comfortable with a or the greek let as an it's |

0:01:32 | hard to me to remember a |

0:01:33 | what |

0:01:34 | i i is an a and that is supposed to be the same as X |

0:01:37 | so i just think it's easy at to |

0:01:39 | remember but the total plus means the a one |

0:01:43 | uh in and since some uh |

0:01:45 | in some people's work it's a side i |

0:01:49 | um |

0:01:52 | as that's that's also another little can confusing |

0:01:55 | difference between people notation that sometimes people put the one on the and and sometimes at the beginning |

0:02:00 | i think the B M had it is is to put on the end and that's what i'm doing |

0:02:05 | yeah |

0:02:08 | this is another kind of introductory slide but |

0:02:12 | since we've had so many introductions to cmllr |

0:02:16 | have from a lot |

0:02:17 | in this in this uh session i don't really think i need to go through this |

0:02:22 | no |

0:02:24 | a point i i put in here |

0:02:26 | is that |

0:02:27 | have some uh |

0:02:29 | i i'm gonna switch them out |

0:02:30 | works so the relative little adaptation data |

0:02:33 | now here |

0:02:35 | well that's a little adaptation data means after about thirty seconds or so |

0:02:39 | and someone else actually in this |

0:02:41 | session did mention that figure |

0:02:43 | after about thirty seconds or so you you get almost all of the improvement that you gonna get |

0:02:48 | at thirty seconds is a little bit too much for many practical applications |

0:02:53 | if you have some telephone service was someone that's gonna like request to stop quote or |

0:02:58 | do web search or something |

0:03:00 | my and you might only have |

0:03:02 | two or three |

0:03:03 | seconds or maybe five seconds of a and |

0:03:06 | and that's not really enough for a from a lot of work |

0:03:09 | in fact the law about five seconds and and my experience |

0:03:13 | it's not gonna give you and and and the thing is gone actually make it worse |

0:03:17 | so you might as well turn off |

0:03:20 | so that's the problem that this |

0:03:21 | talk is addressing an actual is the same problem that many previous talks and the session have then |

0:03:27 | i think all of the previous talks and the session of an addressing this problem |

0:03:32 | so |

0:03:36 | okay this |

0:03:37 | light is rising some of the prior |

0:03:40 | approach is an i i should emphasise that |

0:03:43 | i'm talking about the prior approaches to |

0:03:46 | somehow have regular rising C M are |

0:03:49 | obviously there's many other things that you can do like eigenvoices voices the stuff that uh the previous speaker mention |

0:03:56 | i single other parameters but i'm talking about |

0:03:58 | see all are |

0:04:00 | yeah regularization |

0:04:02 | so a simple things you can do you can just |

0:04:05 | like the a matrix diagonal |

0:04:08 | and that |

0:04:08 | a kind of an option and H K and it |

0:04:11 | it you know it it's a good approach you get a lot of improvement but it very ad hoc |

0:04:16 | i also uh make a block diagonal |

0:04:20 | this this approach had its origins the cost and the delta |

0:04:24 | and delta-delta type features |

0:04:26 | so you have three blocks that thirteen by thirteen |

0:04:29 | and i don't nobody really uses those features on process anymore one that serious |

0:04:35 | site i don't think use them anymore |

0:04:37 | uh without some kind of transformation you can still use the block diagonal in fact one of the baseline that |

0:04:42 | will be presenting |

0:04:43 | is we use these blocks even though they've lost |

0:04:46 | the original meaning |

0:04:48 | and you know it is still seems to work |

0:04:51 | so and other |

0:04:53 | she's is bayesian approaches |

0:04:55 | as the that been a couple of different papers both called |

0:04:58 | F map a lot |

0:05:00 | by a not a lot of it means would be we do have from a while but we have a |

0:05:04 | prior |

0:05:05 | and we pick the map estimate |

0:05:08 | as been a paper from microsoft and one from be M and they had slightly different priors but it was |

0:05:13 | the same basic idea |

0:05:14 | this is one of the baseline that we're going to |

0:05:17 | you using an our experiment |

0:05:19 | yeah |

0:05:21 | a an issue with these approaches is that |

0:05:24 | you you you probably like to have |

0:05:27 | a prior that tells you how all of the rows of the transform colour late with each other |

0:05:32 | but in practice that's not really uh do able |

0:05:35 | so |

0:05:36 | people we generally see priors |

0:05:39 | the are ride the row by row or even completely diagonal |

0:05:42 | so it's the prior over each individual parameter |

0:05:46 | yeah |

0:05:47 | approach that we are using is |

0:05:49 | parameter or doctor |

0:05:51 | reduction using a basis |

0:05:53 | where the uh |

0:05:54 | a from all i'm matrix a some kind of weighted |

0:05:57 | so of uh |

0:05:59 | i the of a basis matrices are of |

0:06:02 | a prototype matrices |

0:06:04 | so |

0:06:06 | the basic idea of not that this is similar to some of the previous talks |

0:06:10 | sometimes people have the uh |

0:06:12 | the there is a factorization than done a basis on you know the upper and lower how or something like |

0:06:17 | that |

0:06:18 | but we we we're talking about a a a a basic expansion of just the row or |

0:06:23 | the row transform |

0:06:24 | and the basic form of it is given here |

0:06:27 | where |

0:06:29 | is W subscript and |

0:06:31 | do the kind of like prototype or i again |

0:06:34 | a from a lot matrices and |

0:06:36 | they're kind of computed in advance somehow |

0:06:39 | for a given speaker |

0:06:41 | you have to estimate these coefficients |

0:06:44 | yeah well |

0:06:45 | this is not a really convex problem but |

0:06:48 | uh uh it's |

0:06:50 | it's solvable in a kind of local sense and it's not really |

0:06:53 | i i don't think it's a practical issue |

0:06:56 | so |

0:06:57 | the previous work in this area |

0:06:59 | you can of decide the basis this size an advance |

0:07:03 | so this the again |

0:07:04 | the decide let's we're gonna make that two hundred say |

0:07:08 | a number of parameters and the actual matrix what what's the T nine by forty it's like |

0:07:13 | a couple of thousand so |

0:07:15 | i |

0:07:15 | if you decide in advance some we gonna make two hundred coefficient |

0:07:19 | a that you know does pretty well for typical configurations if you have you know between ten and thirty seconds |

0:07:25 | of speech |

0:07:26 | but you gonna get a degradation |

0:07:28 | once you have a lot of data because you're not really estimating all of the parameters that you could estimate |

0:07:34 | eventually gets a bit worse |

0:07:36 | when you have a a lot of adaptation data |

0:07:38 | so that the |

0:07:40 | this to the closest prior work |

0:07:43 | so |

0:07:45 | the couple of differences |

0:07:46 | there what we're describing here from this prior what it was done in I B M |

0:07:51 | yeah why we |

0:07:53 | a basis |

0:07:54 | size to very per speaker |

0:07:57 | and we have a very simple real we just say |

0:07:59 | the more data that we have them more coefficients we can estimate |

0:08:03 | and we'd we just make the number of coefficients proportional to the amount of data |

0:08:08 | a of you could do all kinds of fancy stuff with information criterion so on but |

0:08:13 | i think this technique is you know easily complicated enough already without |

0:08:17 | introducing an you aspects so we just picked a very simple rule |

0:08:20 | but but the other aspect is we have a |

0:08:23 | we have a keep way of estimating these basis matrices W and |

0:08:27 | but a little bit more clever than just doing pca |

0:08:32 | yeah |

0:08:33 | and and and and lee |

0:08:36 | we just trying to popularise this type of method |

0:08:38 | this uh |

0:08:40 | we have uh |

0:08:41 | we have a job version of this paper in which we tried to explain very clearly had to implement it |

0:08:46 | "'cause" it really does |

0:08:48 | were |

0:08:49 | and it's uh it's for robust and everything and police so i think it's something but the i do recommend |

0:08:54 | to you |

0:08:57 | yeah |

0:09:00 | this |

0:09:01 | so probably covers material that i've |

0:09:05 | oh i'm gonna go to this slide |

0:09:06 | so |

0:09:09 | yeah the sample that it's like making a diagonal block back all very well and good but |

0:09:14 | it just doesn't it it's just a bit |

0:09:16 | ad hoc can it doesn't |

0:09:18 | you that much improvement |

0:09:20 | also so you it's all a half of to have to decide like |

0:09:23 | how much data do we have can we afford to do the form one are we gonna make it diagonal |

0:09:27 | when that this trade-off and you can get into |

0:09:30 | having count cutoffs and stuff but it's a bit of a mess |

0:09:34 | yeah |

0:09:35 | i to have anything and in methods you know if we could of done this in the bayesian way |

0:09:40 | that's probably |

0:09:42 | i think that's what more optimal because by picking a basis |

0:09:45 | size you kind of making a hard decision |

0:09:48 | a with the bayesian method you could do that and a soft way |

0:09:52 | and you know i think making a soft decision was always better than a hard decision |

0:09:56 | but |

0:09:57 | the the problem with the bayesian approach is first is very hard to estimate the prior |

0:10:02 | because |

0:10:04 | the whole reason may in the situation as we don't have a ton of data per speaker |

0:10:08 | right and |

0:10:09 | assuming a training data is matched just T testing condition |

0:10:13 | not gonna have a lot of data from you training data to estimate the |

0:10:17 | the uh W major sees |

0:10:19 | so how you gonna estimate of prior because you don't have good estimates of the |

0:10:23 | uh |

0:10:24 | the things are trying to get a prior on and a "'cause" you can do all of these |

0:10:28 | no easy in schemes where you integrate and stuff but it just becomes a a big head a |

0:10:32 | plus was always somebody choice choices to make |

0:10:35 | but there is i like this basis type of method is you just a we can use a basis |

0:10:40 | and everything just falls out and it's obvious what to do |

0:10:44 | so |

0:10:46 | i to talk about how to estimate the uh |

0:10:48 | basis |

0:10:49 | so |

0:10:51 | because we going to decide the coefficients and test i am |

0:10:56 | the kind of a or did base that's what you have the most important elements first |

0:11:00 | and released least important elements last |

0:11:03 | i if you we gonna say we gonna have any close two hundred and just that's it |

0:11:07 | then |

0:11:09 | it wouldn't really matter whether they were all mixed up what order they were and |

0:11:12 | but because uh |

0:11:14 | we're gonna decide the number of coefficients and test i'm we need to have this or |

0:11:18 | thing |

0:11:19 | so things are not show that |

0:11:21 | pca or A or or you know as three D are those those kinds of |

0:11:25 | approaches |

0:11:27 | the they actually give you this |

0:11:29 | but |

0:11:30 | i'm not very comfortable just |

0:11:32 | saying we're gonna do pca because you know |

0:11:35 | but who's to say that that makes sense |

0:11:38 | uh what one obvious argument why doesn't make sense |

0:11:41 | was that if you were to uh |

0:11:45 | and a scale the different dimensions of your |

0:11:48 | a of feature vector |

0:11:49 | that's gonna change the solution that P gives C you |

0:11:53 | no i mean is gonna change it and then |

0:11:55 | yeah and that kind of uh is gonna change the so basically it's gonna affect your decoding |

0:11:59 | and to me that's a but uh |

0:12:02 | it's not the right thing to do |

0:12:04 | so |

0:12:06 | but the framework but i think is most natural as maximum likelihood |

0:12:09 | what we're going to choose try to pick the bases that maximise the like you on test |

0:12:15 | and uh |

0:12:17 | i i don't think i have time to go through the whole |

0:12:22 | and i wasn't having to through the whole you know |

0:12:25 | argument about what we're doing |

0:12:27 | but but basically |

0:12:29 | we end that the use pca A but in that slightly precondition and uh |

0:12:34 | space |

0:12:35 | so |

0:12:36 | but W is a thirty nine by forty T matrix typically |

0:12:40 | but |

0:12:41 | we wanna consider the correlations between the rows and it's not ready can to think of it as a matrix |

0:12:45 | a let's think of it as one big but to |

0:12:48 | of size that in nine by forty byte we can cat make the rose |

0:12:52 | no |

0:12:53 | uh |

0:12:55 | if |

0:12:58 | i don't know if i can easily |

0:12:59 | and and how well this argument works but |

0:13:01 | that |

0:13:02 | i think about the objective function for each speaker |

0:13:05 | if that objective function were records at work or tick function |

0:13:09 | if we can somehow to change of variable so that the uh |

0:13:12 | quadratic part of that function is just proportional to the you makes |

0:13:17 | it then is possible to show that |

0:13:19 | we kind of right solution is just doing weighted pca |

0:13:23 | a some kind of derivation i mean it it might be obvious to some people how how you derive that |

0:13:29 | uh |

0:13:29 | but you know |

0:13:31 | but let's just take it is given for now but |

0:13:33 | that that's true |

0:13:34 | so so the |

0:13:36 | to do is do this strange variable so that the objective function is quadratic for each speaker |

0:13:41 | and for is not quite possible to do that |

0:13:45 | because |

0:13:46 | we |

0:13:47 | okay okay for a basis for as a couple of reasons |

0:13:50 | that's the objective function is not really quadratic that this log and and it's |

0:13:54 | for nonlinear |

0:13:55 | secondly |

0:13:56 | you can take a taylor series approximation of round the kind of before make trick |

0:14:01 | i and zero |

0:14:03 | take it taylor series around there |

0:14:05 | and the could right sick uh |

0:14:09 | that the quadratic term in that taylor series |

0:14:11 | but remember rubber it is a big vector right so the quadratic reading um is like |

0:14:15 | a big major |

0:14:17 | uh |

0:14:18 | a if you know about two thousand by two thousand |

0:14:20 | but quadratic to it depends on the that |

0:14:24 | so it's not just the constant |

0:14:26 | but |

0:14:27 | i i i don't think this that's really very that much in an important way |

0:14:31 | so |

0:14:32 | it is possible to kind of do a |

0:14:35 | it's possible to make an argument that |

0:14:37 | once we |

0:14:39 | once we work out the uh average of these quadratic terms |

0:14:43 | and then kind of preconditions conditions so that average averages the unit |

0:14:47 | it possible to make a reasonable argument that |

0:14:50 | to each speech uh |

0:14:51 | the uh |

0:14:53 | that was matrices are approximately unit |

0:14:56 | and and it is the situation where |

0:14:59 | would like it to be you know but is not gonna make it big difference if it's not quite unit |

0:15:03 | because i with all this is doing is this is |

0:15:06 | we're gonna pretty |

0:15:07 | we gonna pretty take everything in then do pca |

0:15:11 | and if you don't quite three low take correctly |

0:15:14 | rotations on the correct word kind a pretty a lot of a |

0:15:18 | that is accurate it's not gonna like totally change the result of pca |

0:15:22 | well that's gonna happen is that let's say the first i eigenvector vector |

0:15:27 | is gonna be mixed up a little bit with the second and so on |

0:15:30 | so |

0:15:32 | it pretty close to maximum like we had |

0:15:36 | now i think i i |

0:15:38 | cover this material |

0:15:39 | so |

0:15:40 | oh training time computation is you do this |

0:15:44 | but the training time computation basically involves computing |

0:15:48 | big matrices like this and doing like a a as we D and stuff |

0:15:51 | it's all described in the in the journal paper |

0:15:54 | so a test time there's an iterative of in test them to uh to keep the coefficients those the |

0:16:01 | a subscript something |

0:16:03 | so |

0:16:04 | it's |

0:16:04 | it's a lot convex problem but it's pretty easy to get a local optimum |

0:16:08 | we just you like steepest ascent and you do it pretty exact line search |

0:16:13 | and it's all it's all described in and the |

0:16:15 | paper |

0:16:17 | so this is the result with slide requires a little bit of explanation |

0:16:21 | so we have to |

0:16:23 | of |

0:16:24 | test data one is |

0:16:25 | short utterances one is long |

0:16:28 | it's it's like one is the digit digits type of task and stocks and stuff and one is a voice |

0:16:32 | |

0:16:33 | yeah |

0:16:34 | yes |

0:16:34 | we divide each of those two |

0:16:36 | in two uh four subsets based on the line |

0:16:39 | and the X axis |

0:16:41 | is |

0:16:42 | is is the uh length about of an so ten to the there was one second ten to the one |

0:16:46 | is ten seconds |

0:16:48 | and each of these kind |

0:16:51 | each of the uh |

0:16:54 | the |

0:16:56 | each of you kind of a |

0:16:57 | lines the points corresponds to a bin of train a test data |

0:17:02 | so |

0:17:03 | we we we can divide it up and the buck them on the left hand side this short |

0:17:08 | utterances on the right is wrong |

0:17:10 | and each point is a relative improvement |

0:17:13 | yeah |

0:17:13 | the the the the triangles the that triangle on the bottom that's just regular |

0:17:18 | a from or R |

0:17:20 | and it's actually making things worse for the first three bins |

0:17:23 | and then a helps a bit |

0:17:26 | oh is the a either word error rate kind of jumps up and down a bit |

0:17:30 | because it's different types of data |

0:17:32 | so |

0:17:33 | this is maybe not the ideal uh a data to test this on but |

0:17:37 | you've got look at the relative of and these thing |

0:17:41 | uh |

0:17:42 | the very top line is our method |

0:17:44 | is doing a bed so it's given me a lot more improvement than the other methods |

0:17:49 | i the this the three block on the diagonal one have matter a lot |

0:17:53 | those |

0:17:54 | there are bit a uh then doing regular similar |

0:17:57 | i get some improvement for the shorter amount of data |

0:18:01 | but we get a more improvement from our method |

0:18:04 | so |

0:18:05 | and i i mean the the story is that |

0:18:07 | if you have let's say between about three and ten seconds of data |

0:18:12 | i think this method will be a big improvement versus |

0:18:15 | doing of map a lower diagonal |

0:18:17 | or whatever |

0:18:19 | so uh |

0:18:21 | but but if you have a you know let's a more than thirty seconds they really doesn't make it from |

0:18:26 | so i think |

0:18:27 | i have pretty much covered all of this |

0:18:30 | i'm being us to wrap up |

0:18:33 | yeah |

0:18:34 | i think we've covered all of this |

0:18:36 | so i recommend that john paper if you want to implement this "'cause" i do described very clearly had to |

0:18:41 | do it and i think this stuff it does work |

0:18:44 | okay |

0:18:49 | i a question mark |

0:18:57 | pretty stunned |

0:19:03 | okay okay well uh but close the session at okay thanks right |