are
um
i i america of the list term the uh session chair and um a one second an advertisement if you
see people wearing a wine may ask them about a asr you
okay
um we're gonna start off
a a first uh paper is uh
front-end feature transforms with context filtering for speaker adaptation
um papers by ageing one
a
i i take a
as well uh
as as well as raw were yeah
a a all and and
by how a go go all
um and it will be presented by you any
okay so uh the top is front in feature transforms with context filtering for speaker adaptation
um
so uh this an line of the talk
first stop briefly motivate
a other work can explain a
basically what's
uh what we're trying to accomplish
uh then next all give an overview of
uh and the new technique called maximum likelihood context filtering
and then we'll move uh strain in some experiments and results to see how works
okay so uh the top is front
speaker adaptation
uh
in terms of front-end transforms we usually uh do in your transforms
or conditionally linear transform
a perhaps the most popular technique is feature space mllr or maybe more popularly named constraint
yeah mllr
and the course there's just discriminative uh techniques that of developed and nonlinear transformations
a some variance of F from a additive been
a work on and in years are quick of more are
and uh full covariance
a formal are
um
so to they'll tell you about another
aryan of the full are
and but first that let's review of more so the ideas you're given
set of adaptation data
and you want to estimate uh when you're transformation a
and a bias
be uh which can be
concatenated
in into the linear
a a matrix W
so uh the key point about
a a a from are for the purposes of this talk is that the a matrix is square
uh T by D and the notation used here
and uh
that makes it particularly easy to learn
um
so
of course
the main thing you need to deal with when you up these transforms is
uh the volume
uh change compensation
in the case of a when you're transformation it's just the log determinant a and red you C
in our objective function Q
uh the second term there is just your
a typical term you see it has the posterior
probability of all the components of the acoustic model your evaluating that's
gamma
uh subscript scripted by J for each gaussian
okay so when we
start to think about the non square case
uh
what what do we need to do so first that's set up the notation
so we use the notation X had of T
to do know of
a vector with context in this case there's context size one
so there's X T might as one T in T plus one of being concatenated to make X have the
T
so the model is
why i
because a X have
um
we can uh condense this notation to
um
form W
and hear them the main difference here is the a is not square in this case it is D by
three D
because the output Y
is the original dimension of of the input X
but X that is three D
so how do we estimate
a a a a a non square matrix the max my like
uh
so
a an important point is there's no direct obvious way to do this and that's because
the
uh
if your you change you changing the dimension of the space so that is no determine volume
the you can use
a a in a straightforward manner to
accomplish this
so let's go back and look at how we get that term
so basically what you say is that the
log likelihood under the transformation
of Y
uh
is of equal to the log likelihood of the input variable up to a constant
that is your
to cope in term
so in the case that uh you soon that a square you can readily
confirmed that the term is
have log
ratio of the determinant
of the input and output mall assuming the gaussian
so this slide is just showing how you would ride that
there's L X
a gaussian
L Y
i when you to
are are get as you know the when your transform data
and essentially you quite them the fine what C is in you find that
it is the log ratio
a a as we started before
so on the bottom line and read
if you break down that not show you see that uh the covariance of
D variable Y
uh
this is just a known uh identity it's a a
a a signal X transport
transpose
a transpose
E
a a signal X
at
transpose
um so the compensation term ends up being log determine a
um so in our case we're gonna assume that the compensation term
remains the same
uh will drop the
log determinant of sigma X had term
because it does not depend on a
number left
with the
a segment X a transport
pos turned uh that we had in this case that
they was square
um
so the modified objective becomes
uh are the following
and the one one point is that well what is this
signal X had that was used well what they did is they use
uh
a full covariance approximations all the speech features
to come up with a that
a a full covariance sigma X hat
a used it in this subjective to
learn K
okay so in terms of optimising
a uh this modified objective this the statistics that you need of
the same
form is in the square case scores the sizes are different
uh
but to uh main quantities that you need to optimize
the objective are
to be able to evaluate the objective Q and the derivative of the objective
um and so the row by row it or iterative up
data are real that people normally use
not be applied here at least it's not obvious how to do it
there
uh
a a a a we're looking at it now but there are some ways to do something
very similar
uh but it uh for the purposes of this paper just a a a gas optimization a uh
package was used the H C L package
and uh as i mentioned before you just has in the function in in a function and its gradient available
at any point that the optimiser wants to evaluate
okay okay so uh
that's
that's essentially the map it'll trying leave some time for questions at the end
this any more uh
details that are
request
um so moving right on to training data and models so
uh the training data
for the task we evaluated this
technique on was collected in stationary noise
there's about eight hundred hours of it
a a word internal uh
a weird internal model with kind of phone contact
eight hundred thirty
context dependent states in a
ten K gaussians was train
and uh the technique was tested on
a L D A forty dimensional features bill and on models
built using maximum likelihood
the M i
and
on a model uh with an F a i transformation applied before
uh we apply context filtering
uh in terms of test data uh was recorded in car at three different speeds zeros thirty at sixty miles
per hour
uh there were four tasks dressed
digits commands and radio control
and that's about uh
twenty six K utterances and uh
a total of a hundred and thirty thousand word
he is that the distribution of the
the snr distribution of this data
in terms of
speed
how you can see will see that most in is obtained that for the sixty
for our data and for that data about
basically half of the data is below
say twelve and a half T V
uh a we estimate a using a forced alignment
okay so for experiments
uh a context filtering was tried for speaker adaptation
training speaker dependent uh
a a that being uh
the canonical model
so uh a and all C a and just a little uh nomenclature here
it is uh
maximum likelihood context filtering with context size and
so one would be plus or minus one
aim is included in the context
when computing the transform
uh
so for all the experiments
uh we
the transform was in lies with identity uh with respect to the current
frames parameters
and the side frames where
uh than a lies to
have zero
to zeros
so
just for reference they also tried using
for the centre
a a part of the matrix the F from or that was estimated
uh
you uh
using the usual technique
okay so in terms a result
give skip it had to that so
uh
clearly a from R
uh brings a lot over the baseline on this data
and when a you turn on context filtering
yeah actually get some significant gains
in the sixty per hour call call and you can see that there are late and red
so this is actually twenty three percent
uh relative gain in word error rate thirty percent and sensor rate over a more
um um
the other point here is it's starting with a more are
and then adapting actually doesn't give you any advantage over
uh starting with an and it then you made
a this point is just showing how
uh performance varies with than um with the amount of data you provide
uh to the transform estimation
so uh
where we can see that actually
the relative a degradation in performance when you have less data as in this case ten utterance
ten utterances as
all utterances i believe all is a hundred in this case
um
is less and i i think the argument here is that
uh uh you're using context so you can do some averaging
of the data you see and that's
that effectively regular thing yes
estimation the sum
extent although there's more parameters to estimate so
uh
kind of counter intuitive i think
okay a this just a picture of but typical F mark transform estimated uh using our
system it's
it's uh
for the most part no
and uh this is the corresponding
a one frame of context
a context filtering transforms so you can see
interestingly it's not symmetric the
the uh
previous
the mapping from previous the current frame is almost i no so is the current the current frame
mapping
scene
a but the
a count of the future looks
kind of random
and thing to keep in mind is that this is
uh
no uh it's
is actually
that's whole subspace a lot most solutions to this problem
so it's not clear if this is an artifact of the optimization package
perhaps the order in it that that uh that it optimize the subspace and whatnot not
hmmm
using
okay so here's more results
uh
a collective a uh using a you my model
and again uh we're seeing seven significant gains
uh
oh over F a are about ten percent relative improvement
on the six team up our data
a once again when we when we uh train have a my transform and then apply context filtering
we're still actually getting some gains
it's about
nine percent
relative sent error rate reduction over a more
okay so one summer a uh and i'll see a fixed ends well the full rank square matrix
technique
colour from R to not square meter
and now there's some very nice gains
on a
some pretty good systems
uh
the use to be am am i have from my
uh when we apply this technique
so terms the future work
course is the use a
uh
we should uh
trying a discriminative objective function is something that i think they're looking at in the course
uh
the another question is how this technique interacts with traditional noise robust as methods like spectral subtraction
dynamic noise adaptation et cetera
okay
so
that's all i have hopefully this sometime time of for questions
i is use my
so the plot you have for improvement so that you had to have ten utterance
for each speed how do you do that
a practical sense are you going to keep track
yeah
hi
uh let me just go to the us you
i mean this is just a investigating the amount of data is needed for the transform to be effective
so uh the this is useful
in sense that if you need to roll a speaker for example on a cell phone
he only needs to talk ten utterances by the way that's a good point uh each utterance is only about
three seconds
so we're talking about
you know uh at thirty seconds of a of data uh were already
almost that
a completely adapted to do the speaker as opposed to
that from more are that's
actually seems the need about thirty
to be at that stage
a
oh microphone
for the third one right there
so from this chart you're working all
screws me
uh you know
utterances collected at sixty
models mouse or speech
in a real scenario in people drive store high we
so the you know sequence
uh uh you know uh as made you that with this same are is not this scenery
oh have you test that's scenario
yeah i they in consider that in this work but it so this is a block
this block optimization of the matrix actually
so
just take a section of
speaker data and
see how many utterances are required
a
to can to get uh decent gains
but that's certainly an important problem
to more questions
have a quick one
um i was actually kind of interested when you were looking at
the results for one and two
like to be two different um when you know
the they actually do visualisation of the two
um context
was it's great and i found that the visual interesting and i was wondering at that
i think but different
uh
oh i see
right i i think that uh
think this is one of the only ones they actually look that okay
uh
i was very curious about that myself
so
ready put them in at the last moment
actually
um
yeah that's very true choosing certainly uh
uh
but it for the experiments they did they found that performance was that eating at about
uh
but and right context
two
so uh uh i i think that uh is symmetry and that
X
for future uh
investigation and understanding
i think the speaker
maybe we we're gonna need a minute to set up the next peak there so
sir
but