are um i i america of the list term the uh session chair and um a one second an advertisement if you see people wearing a wine may ask them about a asr you okay um we're gonna start off a a first uh paper is uh front-end feature transforms with context filtering for speaker adaptation um papers by ageing one a i i take a as well uh as as well as raw were yeah a a all and and by how a go go all um and it will be presented by you any okay so uh the top is front in feature transforms with context filtering for speaker adaptation um so uh this an line of the talk first stop briefly motivate a other work can explain a basically what's uh what we're trying to accomplish uh then next all give an overview of uh and the new technique called maximum likelihood context filtering and then we'll move uh strain in some experiments and results to see how works okay so uh the top is front speaker adaptation uh in terms of front-end transforms we usually uh do in your transforms or conditionally linear transform a perhaps the most popular technique is feature space mllr or maybe more popularly named constraint yeah mllr and the course there's just discriminative uh techniques that of developed and nonlinear transformations a some variance of F from a additive been a work on and in years are quick of more are and uh full covariance a formal are um so to they'll tell you about another aryan of the full are and but first that let's review of more so the ideas you're given set of adaptation data and you want to estimate uh when you're transformation a and a bias be uh which can be concatenated in into the linear a a matrix W so uh the key point about a a a from are for the purposes of this talk is that the a matrix is square uh T by D and the notation used here and uh that makes it particularly easy to learn um so of course the main thing you need to deal with when you up these transforms is uh the volume uh change compensation in the case of a when you're transformation it's just the log determinant a and red you C in our objective function Q uh the second term there is just your a typical term you see it has the posterior probability of all the components of the acoustic model your evaluating that's gamma uh subscript scripted by J for each gaussian okay so when we start to think about the non square case uh what what do we need to do so first that's set up the notation so we use the notation X had of T to do know of a vector with context in this case there's context size one so there's X T might as one T in T plus one of being concatenated to make X have the T so the model is why i because a X have um we can uh condense this notation to um form W and hear them the main difference here is the a is not square in this case it is D by three D because the output Y is the original dimension of of the input X but X that is three D so how do we estimate a a a a a non square matrix the max my like uh so a an important point is there's no direct obvious way to do this and that's because the uh if your you change you changing the dimension of the space so that is no determine volume the you can use a a in a straightforward manner to accomplish this so let's go back and look at how we get that term so basically what you say is that the log likelihood under the transformation of Y uh is of equal to the log likelihood of the input variable up to a constant that is your to cope in term so in the case that uh you soon that a square you can readily confirmed that the term is have log ratio of the determinant of the input and output mall assuming the gaussian so this slide is just showing how you would ride that there's L X a gaussian L Y i when you to are are get as you know the when your transform data and essentially you quite them the fine what C is in you find that it is the log ratio a a as we started before so on the bottom line and read if you break down that not show you see that uh the covariance of D variable Y uh this is just a known uh identity it's a a a a signal X transport transpose a transpose E a a signal X at transpose um so the compensation term ends up being log determine a um so in our case we're gonna assume that the compensation term remains the same uh will drop the log determinant of sigma X had term because it does not depend on a number left with the a segment X a transport pos turned uh that we had in this case that they was square um so the modified objective becomes uh are the following and the one one point is that well what is this signal X had that was used well what they did is they use uh a full covariance approximations all the speech features to come up with a that a a full covariance sigma X hat a used it in this subjective to learn K okay so in terms of optimising a uh this modified objective this the statistics that you need of the same form is in the square case scores the sizes are different uh but to uh main quantities that you need to optimize the objective are to be able to evaluate the objective Q and the derivative of the objective um and so the row by row it or iterative up data are real that people normally use not be applied here at least it's not obvious how to do it there uh a a a a we're looking at it now but there are some ways to do something very similar uh but it uh for the purposes of this paper just a a a gas optimization a uh package was used the H C L package and uh as i mentioned before you just has in the function in in a function and its gradient available at any point that the optimiser wants to evaluate okay okay so uh that's that's essentially the map it'll trying leave some time for questions at the end this any more uh details that are request um so moving right on to training data and models so uh the training data for the task we evaluated this technique on was collected in stationary noise there's about eight hundred hours of it a a word internal uh a weird internal model with kind of phone contact eight hundred thirty context dependent states in a ten K gaussians was train and uh the technique was tested on a L D A forty dimensional features bill and on models built using maximum likelihood the M i and on a model uh with an F a i transformation applied before uh we apply context filtering uh in terms of test data uh was recorded in car at three different speeds zeros thirty at sixty miles per hour uh there were four tasks dressed digits commands and radio control and that's about uh twenty six K utterances and uh a total of a hundred and thirty thousand word he is that the distribution of the the snr distribution of this data in terms of speed how you can see will see that most in is obtained that for the sixty for our data and for that data about basically half of the data is below say twelve and a half T V uh a we estimate a using a forced alignment okay so for experiments uh a context filtering was tried for speaker adaptation training speaker dependent uh a a that being uh the canonical model so uh a and all C a and just a little uh nomenclature here it is uh maximum likelihood context filtering with context size and so one would be plus or minus one aim is included in the context when computing the transform uh so for all the experiments uh we the transform was in lies with identity uh with respect to the current frames parameters and the side frames where uh than a lies to have zero to zeros so just for reference they also tried using for the centre a a part of the matrix the F from or that was estimated uh you uh using the usual technique okay so in terms a result give skip it had to that so uh clearly a from R uh brings a lot over the baseline on this data and when a you turn on context filtering yeah actually get some significant gains in the sixty per hour call call and you can see that there are late and red so this is actually twenty three percent uh relative gain in word error rate thirty percent and sensor rate over a more um um the other point here is it's starting with a more are and then adapting actually doesn't give you any advantage over uh starting with an and it then you made a this point is just showing how uh performance varies with than um with the amount of data you provide uh to the transform estimation so uh where we can see that actually the relative a degradation in performance when you have less data as in this case ten utterance ten utterances as all utterances i believe all is a hundred in this case um is less and i i think the argument here is that uh uh you're using context so you can do some averaging of the data you see and that's that effectively regular thing yes estimation the sum extent although there's more parameters to estimate so uh kind of counter intuitive i think okay a this just a picture of but typical F mark transform estimated uh using our system it's it's uh for the most part no and uh this is the corresponding a one frame of context a context filtering transforms so you can see interestingly it's not symmetric the the uh previous the mapping from previous the current frame is almost i no so is the current the current frame mapping scene a but the a count of the future looks kind of random and thing to keep in mind is that this is uh no uh it's is actually that's whole subspace a lot most solutions to this problem so it's not clear if this is an artifact of the optimization package perhaps the order in it that that uh that it optimize the subspace and whatnot not hmmm using okay so here's more results uh a collective a uh using a you my model and again uh we're seeing seven significant gains uh oh over F a are about ten percent relative improvement on the six team up our data a once again when we when we uh train have a my transform and then apply context filtering we're still actually getting some gains it's about nine percent relative sent error rate reduction over a more okay so one summer a uh and i'll see a fixed ends well the full rank square matrix technique colour from R to not square meter and now there's some very nice gains on a some pretty good systems uh the use to be am am i have from my uh when we apply this technique so terms the future work course is the use a uh we should uh trying a discriminative objective function is something that i think they're looking at in the course uh the another question is how this technique interacts with traditional noise robust as methods like spectral subtraction dynamic noise adaptation et cetera okay so that's all i have hopefully this sometime time of for questions i is use my so the plot you have for improvement so that you had to have ten utterance for each speed how do you do that a practical sense are you going to keep track yeah hi uh let me just go to the us you i mean this is just a investigating the amount of data is needed for the transform to be effective so uh the this is useful in sense that if you need to roll a speaker for example on a cell phone he only needs to talk ten utterances by the way that's a good point uh each utterance is only about three seconds so we're talking about you know uh at thirty seconds of a of data uh were already almost that a completely adapted to do the speaker as opposed to that from more are that's actually seems the need about thirty to be at that stage a oh microphone for the third one right there so from this chart you're working all screws me uh you know utterances collected at sixty models mouse or speech in a real scenario in people drive store high we so the you know sequence uh uh you know uh as made you that with this same are is not this scenery oh have you test that's scenario yeah i they in consider that in this work but it so this is a block this block optimization of the matrix actually so just take a section of speaker data and see how many utterances are required a to can to get uh decent gains but that's certainly an important problem to more questions have a quick one um i was actually kind of interested when you were looking at the results for one and two like to be two different um when you know the they actually do visualisation of the two um context was it's great and i found that the visual interesting and i was wondering at that i think but different uh oh i see right i i think that uh think this is one of the only ones they actually look that okay uh i was very curious about that myself so ready put them in at the last moment actually um yeah that's very true choosing certainly uh uh but it for the experiments they did they found that performance was that eating at about uh but and right context two so uh uh i i think that uh is symmetry and that X for future uh investigation and understanding i think the speaker maybe we we're gonna need a minute to set up the next peak there so sir but