okay

okay

um the paper them go to put the is cost we of put acoustic model weights for area

um um will fall this up

uh first that would discuss the overview of bootstrap

and the rich structure in B S i as acoustic modeling framework

uh and discuss the motivation why we do cost in and why did on full covariance

um then i would discuss how to do the cost training clean two parts

the distance

an immense investigated

uh including entropy should K L by i are S zero and channel

and uh some cost in a groups

proposed and investigated

and then uh how we discuss the experimental results on proposed cost in algorithms

and uh the experiment read out some bs S size with full rents for more

um finally uh is conclusion and future extension

okay uh

let's have some back ground of the bootstrap based acoustic modeling

uh so it is basically

um

send point a training data

uh in

and subsets sets

raise each subset covers a fraction

the of original data

uh we combine all the data together to try and it is she with lda and semi-tied covariance

and for each subset training any subset

we perform E N training

in parallel on and subsets for and edge and then

so

we have and models and way

agree get it

them together

and um

obviously it is very large

uh but it performs so well

a but the problem is it is

to a large N the restructuring is needed

so there are

to to digits here

for the structure in

uh the first one is

a a S close

uh uh note strategy

could choose trend i know covariance modelling in all the steps

uh and the second one is

chan full covariance model at

the all the steps on T of the last step

so

he a full covariance cost is needed and as use can see from the framework

a a clustering is a critical step

and

oh doing this

um

it can remove the redundancy and the scaled down the model and so that we can

put it a mobile device

and it is flexible

but this is an advantage for the clustering train

because you can channel lot model

and scale down to a i was size without a new training

and here full co variance class train is needed for the

that's

P S plus for to diagnose strategy

okay

uh so let's

take a look at the this sense measurements for clustering

um

so wait

investigate several a distance measurements including entropy should be

which measures a change of in be up to two distributions marriage

and the kl that averages could use a symmetric kl damages define it this form

and the but i to

this step

and the

S a or which is measures

the overlap of two distributions

uh but there is no close form even for multi right couches

so of a relational approach is applied based on the channel feast S

uh a channel of this function can be viewed as a higher bond

oh of the S zero

yeah it is defined in this form

and that the but has higher uh

this sense is a special case for channel function weights as you could to zero point five

so

a a how to opt and the channel of this is

the details is elaborated in another paper

a reference in number two

uh a why you can apply new to on a with them

a about do you have to opt to the first and second order of the derivation

or or or or you can use that D revision free approach

based on the analytical form of the channel function

i

okay just now we discuss a sense measurements

and now a it is is a

i'll client for the investigated all algorithms

um so

the cost in is

a a one is based on the bottom up or also can be called agglomerative clustering

is greedy

and and that's this sense refinement is proposed to improve the speed

and uh uh some non greedy approaches is also proposed

two

for a global optimisation including the case that look ahead

and search the best pass

i finally a two-pass strategy is

to improve the model structure

uh let's review the problem again um

so we have

and gaussian

mixture model

we uh it comes from T models

and we want it to compress to and models

that N gaussian

so

if not you it in entropy we want the the entropy minimize the to be change between

F and G which is our target

this is a global to my addition target

however this is extremely hard to to often and then

the conventional met so they is

which time

a two most a steamy the counts as um "'cause" and have a a

and at

the in

uh the hmmm

step i a combine to one on the some so criterion

so

for in this idea it is actually minimize the

it is actually a really a approach

um

so a a good global approach is

supposed to be better

uh a he here is a

is the

example of K step look ahead

um basically if

it is greedy approach

and it will always choose the the first rank combination

however if you

take a look at two step of file for the

we find the best

uh combining has

oh combining candidates is from the second best order

from here the the red pass

so um this is a gentle way too often a global were optimized

without

another the idea is

uh search the bit optimize past which employees the bread it's first a search idea or which is a dynamic

programming

um

so we you the beam is set to and

at each layer you keep and candidates

at each layer

and you extend to it next layer from and candidate so you have an square

possibilities

and use

pruning

it

back to N

uh aft

this

searching process your

font

uh the

corpus up global optimize

point

uh at an minus a later

so uh if the beam is only it then

the the real outs will be you

um surely global optimized

however this is an like and P problem and then

and so we have to set a beam too

to do this job

um

so uh

the conventional so it is

i have state had the same compression rate

um

so we could use not very optimized

because

yeah

that was set it can have a lab or compression rate

they

this makes more sense

so uh

fisher information criteria uh is employed here

uh and a two-pass pass idea is employed

i in the first pass way try to keep

to K plus one

um

compression rate candidate date

um with the bic value

and uh in the second step

way you the second pass with fixed to the bic value for all the states

and therefore for the the different compression rate is

here

so um

is

i i are applied to our clustering

uh a algorithm

so that comes to the X experiment setup

a we did the X

per meant um past till dataset

oh ways one hundred and thirty five hours of training data

ten hours of testing data

uh the model is speaker independent and the those training and the testing data a spontaneous speech

uh is um

model

we cost at from is

combined with for team

bootstrap strap model

to that six K states and to the one point eight meeting as

and this speak model has that whatever rate of thirty five point four six percent uh

in full covariance

um

so it comes to

to a

problem like channel of and K L sense manage but just a very slow all ten

from this figure you can see

um

K are use like ten six times slower the entropy

yeah channel of like

twenty your thirty times slower than entropy

so uh simple idea here

is

so entropy should be is fast and effective why don't wear use and to be do find and best candidate

pairs

and use channel for K to recalculate the distance

to speed up the process

so

uh i aft

a plane this idea the the speed improvement is significant

and the the word error rate

also improving

yeah that's take

the K L

quickly clear the a the baseline vice thirty six point

twenty three and aft

using in the entropy stacked to that

ten best

and them um

but

there is improvement to city six point

the roof for

so the that we be had this is

maybe a a a and B are suggest that with it this S

can be put a show improvement because entropy

uh

please it like can see the the

the weighting between the mixtures

so i i i tried to the

we it by a target do sense

compared with the about is says

and the compared with a B R

approach approach

um on uh compressed to one hundred K gaussian well and uh what fifty K gaussians

so from this figure we can see that we did

this sense it's better than

now we did is sense

which means the weighting is very important

and N B R approach is

that are then

the weighted is a

and and then the observation is that fifty K has

roger improvement

oh which makes sense to because

um

i becoming more and more important in a

if you compression

rate is high

so here are some X

results for global my addition

so let's first take a look at the using a and to be criteria and um measure the

so we or entropy change

between compression be before

if compression and the compression F and G

so that the two looking had has a

tiny improvement like zero point zero four

i about the search approach

has a

roger improvement likes is something

X thirty

which means uh our approach is

effective

that

the the speed is slow because

you all want to search the the past and that

uh it is a a twenty

times slower than the baseline

um um

when you value where is the what error rate

uh i one how can a fifty K the proposed approach is a

better

a positive of improvement

that's the improvement is small

in

we on a higher compression rate the the

difference between our proposed approach

and uh the baseline that approaches larger

which means

um

so that

this work it

effective

um

she is and

experimental results on to pass structure up my addition

for one hasn't to pass again them the two pass is always better than the one pass

uh

oh though the improvement is

small

so

here

is uh

a in figure of

uh uh the three

approach the baseline the

strapping

raised is diagonal covariance the street huge

the bs S plus diagnosed no G and the the bs plus

or to diagonal not conversion strategy

and uh from this figure

there are we evaluated on bows

not likelihood that

and is

discriminative training

and the the the

but we so is pretty imp

uh interesting and the

the improvement is

but large

if we compare mean ways of four two diagonal conversion compared ways

a training all the process in

using dark no

covariance

um so like one percent in

a in a maximum likelihood

and uh

uh like to were point seven percent

for discriminative training

a place so um

for future extension

uh the search based

approach

uh the the beam

can be out to adaptive uh the beam we are using used um

yeah um

it's for the beginning the beam use small it but for the for ending

would be is large because you want to capture

or word candidates

and here

okay use out adaptive idea to two

uh optimize the beam

and the case step look ahead and search optimized pass can be ease a general approach in optimisation

oh can be applied to to other tasks such as decision tree

and the

for two pass model structure to addition we can try

different criteria such as and the L you set of P C

um

so this is the reference

and uh

and we questions

i Q

i

and

he we got wouldn't

is up for for the mic

thanks

i two questions the first one is how how do you divide

the training set

in into a different class or seen the in the very beginning

"'cause" a second question see if i understand correctly

uh each model we are have it's on scene tree structure so called a you if if this is true

how can you decide

each two states

for example can be can be moved

i think okay

okay um

so the first uh so uh the first question sounds uh is

uh

each subset is used random sampling

without replacement

so with a something rate R you to seventy percent

and the second question is we

i to the actually a share the same decision tree

so here way

um

come by all the bootstrap data together

to to an I Ds in so

is no problem you you mentioned

thanks

i you are doing for go so the combines costing in in this case

i i a uh and according to me my experience that maybe be on the cost of that has that

read small number of

components

so had do you have any

and a nash sure to for this small cost

um

and and actually the agglomerative clustering the is

you come by

to to it most similar gauss as together

so

um

the next step you you

after this step you have and minus one girls as right

and then

and i think wait it's very important here too

to uh to avoid the C as you are mentioned

so i in or you you are using uh

it is no mess or uh using to explicitly a you nice right you do it you you do not

have that

small and a small number of a small cost

um um

i i D and the measure of the small cost us like but

that

i means small cost to the way it is

the

the weight is to represent if it is small right

but the mixtures weight

uh

uh

yeah but i you have just a

for example you you've have just the one "'cause" one one one component in one class

so that's a isolate the from the art of L do is almost how do you

how do you do this

a isolated with the

do if you don't know that i

we don't need to buy it is all that

so so you mean cross state cost in

i not i i i and just a a and you to do a class team when it is

is

some of the task just a house

a a very small number

a number of components

is sometimes as a uh i do you

i a four class and then

a the task to do um a you treating some

some

then each ring the as some models on one local lines example that's late the lady you create problem

right so so i and i don't have this small yeah i i think that weight is very important here

as that i showed that we it it's then is better than um we it is sense

so uh the weight ease

the represent the station off a small or large

cost and right

so from my perspective

and

so so okay uh thank you thank you

been