yeah as you five a mentor machine you i minor

and i'm from that's of university and this is a joint what my adviser

john looked large

and all that title of all walk is log spectral enhancement using speaker dependent as for speaker verification

and the aim of this what the key idea behind it is how we can use

set and in

parameter estimation techniques to improve the robustness of about

a verification systems to noise and miss much

and here

why we want to do

uh

based and up

technique is that

a bayesian approach is

a i'll what's to use

but to have a principled way of accounting for parameter down setting to

noise estimation task

and you know like every most button recognition system

the range is up

a a a a a a a key component

in use yes to extract thing

keep the parameters of interest from your role signal in this case when we have a speech

which is corrupted by noise and we want to extract features of interest

that we using a or

but and classification algorithm

the the the the noise makes all but i

parameterized estimates it's it wrong yes in some case

and depending on the severity of the noise you know

uh

this may

that that is also go ones are or well how much of an fact you are on

and the parameter estimates not if we use

if we can

a like than to put in a bayesian up uh estimate

uh uh we can probably you know and hans all speaker verification system and here will can see that

to

you know the the two main courses of

and what's degradation noise

which have what you discussed and ms much because you know in speaker verification system we we need a more

model to to a model a or a speaker distribution

and

depending on the acoustic environment in which you

you you in the training data that this may not be the same environment in which you using

which are using the system

and this results in me much

and and hence performance degradation

for the cup joe what you're trying to do here

so the aim of all lot that

the title to just me using speaker dependent in the log spectral domain

and the K yeah he is that we want to

link

two system which we feel are closely much that speech enhancement system

and the recognition cyst

the intuition behind it is that

feel doing speech enhancement than you you you and enhancing features

and you know

because dependent priors

uh

this

the intuition is that if you have a better idea of who is speaking and you good brad in

and that domain

then you can do a better job of enhancing and we the N signal

you can do a better job of

oh or recognition

so these an to play between this two systems

and they they are what we do the eight week up to this in doubly plea as a message passing

along nodes you not not have colour model

so

so that be little be

message passing and this will fall out in our formulation

just a brief outline of what the rest of the talk would be like

i just

briefly go over a little bit of verification for any members of the audience who may need it

and then

going to

uh but but any in inference and then

for to how all into a variational bayesian inference which is a

a what we walk in

and then a discuss our model

and then

going to the experimental results

so

he in verification that task is you know you a given an utterance and a claimed identity and the task

is

it's a hypothesis test is the

given the speech segment X is the speech

from from speaker S a not

so

this is a hypothesis test as of say the and uh

what we do is we have to model model

uh out target

get

a target speakers

using splits speaker-specific specific gmms

and then we can

user a

i universal background model to test it out and it i what this it and this is the

you know usually the this line system

you ubm

gmm system which is you know

but

with that starting point to most verification systems there at once is but

this is the the the most basic

oh

and this is where we'll try try a more calm enhancement

in the log spectral domain to see if we can

have improved

so

no no uh

so the classification deciding when we type of the C C

you compute a scroll

just just a lot like who who

log likelihood ratio and then

you know

this do not threshold

decide which uh

which type of this is it correct and

yeah

you know we can plot to i'll two

but i'm for all

same well as of formants matrix we can plot that a an also to compute equal error rate

you know to determine the trade-off between

missed detection

and false alarms

so that's just a a speaker verification part just a little bit of a bayesian inference

so

i

we can say that that two main approaches to parameter estimation you can go the maximum likelihood route

or the bayesian inference rule

uh

here we see

if you have a data X

represented presented in this

figure by X

be the generative model

i one by a parameter a

data

now in them market to some like you

a a day

we assume that this parameter is an unknown constant

and then the

they're quantity of interest is that likelihood and then we can estimate

so it are based on the map them a maximum likelihood criterion

and the but they didn't paradigm i'm one the other hand

we assume that they to uh is uh

is a

but one by is a random variable good up that one by a prior

and this is where the robe

the the robustness to but it down setting to comes in

the fact that we have a prior out what the what this to over the parameter of interest

and then the clean quantity in these cases that

posterior which is proportional that

is given that the problem is proportional to the product of a like the then prior

and then

uh the issue is

how we obtain i estimates we obtain based an estimate does that minimize expect expect

costs and

for instance if we we have the

ah

if the cost is the squared

is the squared norm

of the big

a difference there between now

this expression he a fit so

the difference between the estimate and the true value

the it well known that

this this an estimate a the minimum mean square error estimate a just the posterior mean

note that this is easy to write

but

the what happens is that in most practical cases and even in the one we can see that here

it's

you know

import a almost impossible to perform from this tech

so now what do we do

uh

we can

we can use

the problem lies in the instructor stability

if the problem lies in the ability of the posterior

then we can apply what uh approximate bayesian techniques and for instance are we can use V B or variational

of base

uh where we approximate

well

i what true posterior

by one that's

constrained to be them

now

we need a metric the mapping between two from

and intractable for maybe and a tractable farm

of distributions

and we need a metric so that we know

yeah

and the uh

you know what's the close

approximation to the true posterior indestructible family

and and we measure yeah

we we obtain the approximation that minimize is the a out that dense

to to in our all our approximation and the true to to

oh in cases where i'll but i'm it that's set that uh

consists of a and

number of parameters in this case

and parameters as we can and shot ability by

assuming that

the product of the the posterior factor like that shown in this expression one

so

no the the question what is that

we boils down to

estimating what a

no computing the forms of uh

this approximate posterior each of the five does

and then

a for if

up updating the sufficient statistics

i can be shown that these and

uh uh an expression for the

for the approximate from of the distributions in this uh

we computed by taking an expectation with respect to the logarithm a of that

the joint distribution between observations

and the parameters of interest

oh

so

no that

let's

get but to our speaker verification context and by in particular let's discuss the

the model the probabilistic model

so here what did we are in the log spectral domain

and uh

but we assume is that our our or signal Y of T of the observed signal

is corrupted by additive noise

and if we take the dft we can compute the log spectrum much shown

but the can the look at the

this

that's to

a a F T

and then we can it can be shown

but uh these a nice

a proximity relationship between

then the the log spectrum of the up signal

that the clean log spectrum and that log spectrum of the noise

of this

just a lot

i what our likelihood

you look you in the bayesian paradigm and we we have the likelihood and the prior

so

this

this

is our likelihood

now we need to

two

to write out what is out joint distribution how does it five

because this will help was when we come to compute a

the that box the approximate distribution because that you the called the

the expression

for each of the optimum for does

depends on an expectation

like to

like an expectation of the look that a beam of the joint distribution

so this is a how the joint distribution in this content

uh

a factor arises

you have all all of that out

uh i log spectrum

the clean log spectrum

this is that what it what which tell explain later that we introduce one might lead to up the ability

to like an indicator variable than the noise

so here you have the likelihood tao

and that prior what

or what this

B

clean

speech log spectrum we assume that it is speaker dependent

and uh

so what happens is

yes the speaker dependent ubm so in a speaker I D context this would

in mean that we we'll and models for

each speaker

not id context but in know a verification context what we do is we approximate that

that would be that you snap not

mean in this but if kitchen context we assume that we can

model the light bright you'll speakers as as

just the target speaker and the ubm so this is what happens is that the library dynamic

for each at that your testing you your when you like

and we have a what

i it is that this indicator the variable

uh that was you

who peeking

oh in other what where they'd the target the ubm and which mixture the component is active

so this

just shows you the forms of the five does that we compute

and we can see that there

yeah

the well-known known fans

and the V be able but and what the don't to each realising this a

this

but the sufficient statistics in a in a in a case the mean and

and the covariance

and then and this out of a function of the observations and the prior

and then cycling through until some convergence is that thing

and

yeah

what's

good is that once you obtain

uh

the clean posterior a an estimate of the clean posterior we can derive mfccs easy lee for from them

for verification

so just some experimental results what we do it is

we we use three datasets initially we use to

then we to use the M T mobile device because verification corpus

then we have we also tried it out on a

S the sre two thousand and four corpora

so initial

results here a for

oh to make

and

uh

we did did we trained a ubm with that subset using training data from a subset of the and it's

because that six hundred and that is because in

and then we corrupted the speech

using

additive white gaussian noise

i present results for that

for realistic noise later

and then we used to test utterances by speaker

so

what happens is that we can generate from for the six hundred and that is because you can generate

uh

or hundred and sixty

true trials and then we select a random subset of ten speakers

but in posters

and then we compute its

scores for each trial

and we also compare we tried to implement the

this one by

a a and is corpora

which is a feature domain intersession compensation technique

which entails it

a a a a a a a pro uh

and in a a a a projection matrix to project the features into a

session session independent subspace

we have a a the recognition the i i

verification would be more robust the details i that it but will go through them

oh

and just some

brief uh

table of some result

or the timit case when we add in additive white gaussian noise we sweep through some snr

and then we just

it from the raw data we compute mfccs

and the top line shows you if we just up to in the mfccs without

note fast applying anything to

i out you know just roll

and then what if we obtain uh mfccs after that we've and hans

a log spectra

we in the second line using that B technique

i implementation of F D I C was able to draw

uh

i was able to walk in this

i in the low

as some case is in the high and that case it shouldn't draw broken down in our implementation

uh

we can investigate

this is does a plot for

and the that to db case for timit

and we see that the equal error rate uh dropped that by half and that's the case

a

a course this snr a we investigated

oh we also looked at

uh uh i had that types of noise

a a a a a what we had for three noise

and this noise was obtained from the noise X ninety two dataset

and the the the the results are similar

only the figure that different you know a different snrs because of the type of noise

oh but this to see that

uh

that i almost

have been in this domain yeah it's not as good but this is because there

oh no this is

a very

oh almost clean condition

now then

when we applied this to the mit T um

dataset

a

uh

we we want to show

the the difference

you know what happens when we have missed much

we

data obtained in an all is and uh

and tested it with

yeah

has has data from was noise noisy street intersection

when we we observe the means much you know a it jumps up to twenty percent when the test data

he's from one intersection when that all models were change of this data

and when we apply the the B technique

to use uses to twenty four percent

uh

for sorry experiments we we use

this with it will for corpora

we so the details we use ubm with the fifty

five to of mixture coefficients

and nineteen dimensional mfccs with a stands

a from mean normalization

but up and the is that we only obtain more disk gang

and we applied the whole that's

and this may be due to you know

oh

baseline line system with then

he are of that thirteen point eight

i the only able to get to that in point for

this may be due to the fact that uh

we think that uh

the the the formulation to the models trained on clean speech

and uh and uh it is for all on the L that um what is gained when compared to what

you get to meet and

the M of that data set

and

and that's it

then you

i i one time for one quick concern

i have as a question uh did you try to use uh

and as as a type of voiced speech you hands had reasons such as a wiener filtering

to obtain the enhanced the speech and then

using hands to speech to to to do speaker verification

a no we did not but we tried a a a a at the in a a what we tried

using F frame are

i and they have a but we where you you to getting that you not speaker id context but not

in this context

that is something we we should do

okay yes thank you

let's has a

oh go

okay