0:00:15i i america of the list term the uh session chair and um a one second an advertisement if you
0:00:21see people wearing a wine may ask them about a asr you
0:00:26um we're gonna start off
0:00:28a a first uh paper is uh
0:00:31front-end feature transforms with context filtering for speaker adaptation
0:00:35um papers by ageing one
0:00:38i i take a
0:00:40as well uh
0:00:41as as well as raw were yeah
0:00:43a a all and and
0:00:45by how a go go all
0:00:47um and it will be presented by you any
0:00:56okay so uh the top is front in feature transforms with context filtering for speaker adaptation
0:01:03so uh this an line of the talk
0:01:06first stop briefly motivate
0:01:08a other work can explain a
0:01:10basically what's
0:01:12uh what we're trying to accomplish
0:01:14uh then next all give an overview of
0:01:16uh and the new technique called maximum likelihood context filtering
0:01:21and then we'll move uh strain in some experiments and results to see how works
0:01:26okay so uh the top is front
0:01:29speaker adaptation
0:01:31in terms of front-end transforms we usually uh do in your transforms
0:01:35or conditionally linear transform
0:01:38a perhaps the most popular technique is feature space mllr or maybe more popularly named constraint
0:01:44yeah mllr
0:01:46and the course there's just discriminative uh techniques that of developed and nonlinear transformations
0:01:53a some variance of F from a additive been
0:01:55a work on and in years are quick of more are
0:01:59and uh full covariance
0:02:01a formal are
0:02:03so to they'll tell you about another
0:02:06aryan of the full are
0:02:07and but first that let's review of more so the ideas you're given
0:02:12set of adaptation data
0:02:14and you want to estimate uh when you're transformation a
0:02:17and a bias
0:02:18be uh which can be
0:02:21in into the linear
0:02:23a a matrix W
0:02:25so uh the key point about
0:02:28a a a from are for the purposes of this talk is that the a matrix is square
0:02:32uh T by D and the notation used here
0:02:35and uh
0:02:36that makes it particularly easy to learn
0:02:42of course
0:02:43the main thing you need to deal with when you up these transforms is
0:02:47uh the volume
0:02:48uh change compensation
0:02:50in the case of a when you're transformation it's just the log determinant a and red you C
0:02:55in our objective function Q
0:02:57uh the second term there is just your
0:03:00a typical term you see it has the posterior
0:03:03probability of all the components of the acoustic model your evaluating that's
0:03:09uh subscript scripted by J for each gaussian
0:03:13okay so when we
0:03:16start to think about the non square case
0:03:20what what do we need to do so first that's set up the notation
0:03:23so we use the notation X had of T
0:03:25to do know of
0:03:27a vector with context in this case there's context size one
0:03:30so there's X T might as one T in T plus one of being concatenated to make X have the
0:03:36so the model is
0:03:38why i
0:03:39because a X have
0:03:42we can uh condense this notation to
0:03:46form W
0:03:47and hear them the main difference here is the a is not square in this case it is D by
0:03:52three D
0:03:53because the output Y
0:03:54is the original dimension of of the input X
0:03:58but X that is three D
0:04:01so how do we estimate
0:04:03a a a a a non square matrix the max my like
0:04:08a an important point is there's no direct obvious way to do this and that's because
0:04:15if your you change you changing the dimension of the space so that is no determine volume
0:04:21the you can use
0:04:22a a in a straightforward manner to
0:04:24accomplish this
0:04:25so let's go back and look at how we get that term
0:04:28so basically what you say is that the
0:04:30log likelihood under the transformation
0:04:33of Y
0:04:36is of equal to the log likelihood of the input variable up to a constant
0:04:40that is your
0:04:41to cope in term
0:04:43so in the case that uh you soon that a square you can readily
0:04:48confirmed that the term is
0:04:51have log
0:04:52ratio of the determinant
0:04:54of the input and output mall assuming the gaussian
0:04:57so this slide is just showing how you would ride that
0:05:00there's L X
0:05:02a gaussian
0:05:03L Y
0:05:04i when you to
0:05:05are are get as you know the when your transform data
0:05:08and essentially you quite them the fine what C is in you find that
0:05:11it is the log ratio
0:05:13a a as we started before
0:05:15so on the bottom line and read
0:05:18if you break down that not show you see that uh the covariance of
0:05:22D variable Y
0:05:25this is just a known uh identity it's a a
0:05:29a a signal X transport
0:05:32a transpose
0:05:34a a signal X
0:05:37um so the compensation term ends up being log determine a
0:05:42um so in our case we're gonna assume that the compensation term
0:05:45remains the same
0:05:47uh will drop the
0:05:49log determinant of sigma X had term
0:05:52because it does not depend on a
0:05:54number left
0:05:55with the
0:05:56a segment X a transport
0:05:58pos turned uh that we had in this case that
0:06:01they was square
0:06:04so the modified objective becomes
0:06:07uh are the following
0:06:08and the one one point is that well what is this
0:06:11signal X had that was used well what they did is they use
0:06:15a full covariance approximations all the speech features
0:06:19to come up with a that
0:06:21a a full covariance sigma X hat
0:06:23a used it in this subjective to
0:06:25learn K
0:06:30okay so in terms of optimising
0:06:32a uh this modified objective this the statistics that you need of
0:06:36the same
0:06:37form is in the square case scores the sizes are different
0:06:41but to uh main quantities that you need to optimize
0:06:45the objective are
0:06:47to be able to evaluate the objective Q and the derivative of the objective
0:06:52um and so the row by row it or iterative up
0:06:55data are real that people normally use
0:06:58not be applied here at least it's not obvious how to do it
0:07:03a a a a we're looking at it now but there are some ways to do something
0:07:06very similar
0:07:07uh but it uh for the purposes of this paper just a a a gas optimization a uh
0:07:14package was used the H C L package
0:07:17and uh as i mentioned before you just has in the function in in a function and its gradient available
0:07:22at any point that the optimiser wants to evaluate
0:07:27okay okay so uh
0:07:30that's essentially the map it'll trying leave some time for questions at the end
0:07:34this any more uh
0:07:35details that are
0:07:38um so moving right on to training data and models so
0:07:41uh the training data
0:07:43for the task we evaluated this
0:07:44technique on was collected in stationary noise
0:07:47there's about eight hundred hours of it
0:07:49a a word internal uh
0:07:52a weird internal model with kind of phone contact
0:07:55eight hundred thirty
0:07:57context dependent states in a
0:07:58ten K gaussians was train
0:08:01and uh the technique was tested on
0:08:04a L D A forty dimensional features bill and on models
0:08:08built using maximum likelihood
0:08:10the M i
0:08:12on a model uh with an F a i transformation applied before
0:08:17uh we apply context filtering
0:08:21uh in terms of test data uh was recorded in car at three different speeds zeros thirty at sixty miles
0:08:27per hour
0:08:28uh there were four tasks dressed
0:08:30digits commands and radio control
0:08:33and that's about uh
0:08:34twenty six K utterances and uh
0:08:37a total of a hundred and thirty thousand word
0:08:40he is that the distribution of the
0:08:42the snr distribution of this data
0:08:45in terms of
0:08:47how you can see will see that most in is obtained that for the sixty
0:08:52for our data and for that data about
0:08:54basically half of the data is below
0:08:57say twelve and a half T V
0:08:59uh a we estimate a using a forced alignment
0:09:05okay so for experiments
0:09:07uh a context filtering was tried for speaker adaptation
0:09:11training speaker dependent uh
0:09:13a a that being uh
0:09:15the canonical model
0:09:16so uh a and all C a and just a little uh nomenclature here
0:09:21it is uh
0:09:22maximum likelihood context filtering with context size and
0:09:26so one would be plus or minus one
0:09:29aim is included in the context
0:09:31when computing the transform
0:09:36so for all the experiments
0:09:38uh we
0:09:39the transform was in lies with identity uh with respect to the current
0:09:43frames parameters
0:09:44and the side frames where
0:09:46uh than a lies to
0:09:48have zero
0:09:49to zeros
0:09:51just for reference they also tried using
0:09:53for the centre
0:09:54a a part of the matrix the F from or that was estimated
0:09:59you uh
0:10:00using the usual technique
0:10:03okay so in terms a result
0:10:05give skip it had to that so
0:10:08clearly a from R
0:10:10uh brings a lot over the baseline on this data
0:10:13and when a you turn on context filtering
0:10:16yeah actually get some significant gains
0:10:18in the sixty per hour call call and you can see that there are late and red
0:10:22so this is actually twenty three percent
0:10:25uh relative gain in word error rate thirty percent and sensor rate over a more
0:10:30um um
0:10:31the other point here is it's starting with a more are
0:10:33and then adapting actually doesn't give you any advantage over
0:10:37uh starting with an and it then you made
0:10:41a this point is just showing how
0:10:44uh performance varies with than um with the amount of data you provide
0:10:48uh to the transform estimation
0:10:50so uh
0:10:52where we can see that actually
0:10:54the relative a degradation in performance when you have less data as in this case ten utterance
0:10:59ten utterances as
0:11:01all utterances i believe all is a hundred in this case
0:11:06is less and i i think the argument here is that
0:11:10uh uh you're using context so you can do some averaging
0:11:14of the data you see and that's
0:11:16that effectively regular thing yes
0:11:18estimation the sum
0:11:19extent although there's more parameters to estimate so
0:11:23kind of counter intuitive i think
0:11:28okay a this just a picture of but typical F mark transform estimated uh using our
0:11:33system it's
0:11:34it's uh
0:11:35for the most part no
0:11:38and uh this is the corresponding
0:11:40a one frame of context
0:11:42a context filtering transforms so you can see
0:11:44interestingly it's not symmetric the
0:11:47the uh
0:11:49the mapping from previous the current frame is almost i no so is the current the current frame
0:11:58a but the
0:12:00a count of the future looks
0:12:01kind of random
0:12:05and thing to keep in mind is that this is
0:12:08no uh it's
0:12:09is actually
0:12:11that's whole subspace a lot most solutions to this problem
0:12:15so it's not clear if this is an artifact of the optimization package
0:12:18perhaps the order in it that that uh that it optimize the subspace and whatnot not
0:12:28okay so here's more results
0:12:31a collective a uh using a you my model
0:12:34and again uh we're seeing seven significant gains
0:12:39oh over F a are about ten percent relative improvement
0:12:43on the six team up our data
0:12:47a once again when we when we uh train have a my transform and then apply context filtering
0:12:53we're still actually getting some gains
0:12:56it's about
0:12:58nine percent
0:12:59relative sent error rate reduction over a more
0:13:06okay so one summer a uh and i'll see a fixed ends well the full rank square matrix
0:13:12colour from R to not square meter
0:13:15and now there's some very nice gains
0:13:18on a
0:13:19some pretty good systems
0:13:21the use to be am am i have from my
0:13:24uh when we apply this technique
0:13:26so terms the future work
0:13:28course is the use a
0:13:30we should uh
0:13:32trying a discriminative objective function is something that i think they're looking at in the course
0:13:38the another question is how this technique interacts with traditional noise robust as methods like spectral subtraction
0:13:45dynamic noise adaptation et cetera
0:13:49that's all i have hopefully this sometime time of for questions
0:13:55i is use my
0:14:00so the plot you have for improvement so that you had to have ten utterance
0:14:04for each speed how do you do that
0:14:07a practical sense are you going to keep track
0:14:12uh let me just go to the us you
0:14:15i mean this is just a investigating the amount of data is needed for the transform to be effective
0:14:21so uh the this is useful
0:14:24in sense that if you need to roll a speaker for example on a cell phone
0:14:28he only needs to talk ten utterances by the way that's a good point uh each utterance is only about
0:14:33three seconds
0:14:34so we're talking about
0:14:36you know uh at thirty seconds of a of data uh were already
0:14:40almost that
0:14:42a completely adapted to do the speaker as opposed to
0:14:45that from more are that's
0:14:46actually seems the need about thirty
0:14:48to be at that stage
0:14:58oh microphone
0:14:59for the third one right there
0:15:03so from this chart you're working all
0:15:05screws me
0:15:06uh you know
0:15:08utterances collected at sixty
0:15:10models mouse or speech
0:15:12in a real scenario in people drive store high we
0:15:16so the you know sequence
0:15:18uh uh you know uh as made you that with this same are is not this scenery
0:15:22oh have you test that's scenario
0:15:25yeah i they in consider that in this work but it so this is a block
0:15:30this block optimization of the matrix actually
0:15:33just take a section of
0:15:34speaker data and
0:15:35see how many utterances are required
0:15:38to can to get uh decent gains
0:15:42but that's certainly an important problem
0:15:49to more questions
0:15:52have a quick one
0:15:53um i was actually kind of interested when you were looking at
0:15:57the results for one and two
0:15:59like to be two different um when you know
0:16:02the they actually do visualisation of the two
0:16:05um context
0:16:07was it's great and i found that the visual interesting and i was wondering at that
0:16:12i think but different
0:16:16oh i see
0:16:18right i i think that uh
0:16:21think this is one of the only ones they actually look that okay
0:16:25i was very curious about that myself
0:16:28ready put them in at the last moment
0:16:32yeah that's very true choosing certainly uh
0:16:37but it for the experiments they did they found that performance was that eating at about
0:16:43but and right context
0:16:45so uh uh i i think that uh is symmetry and that
0:16:50for future uh
0:16:52investigation and understanding
0:16:55i think the speaker
0:17:00maybe we we're gonna need a minute to set up the next peak there so