0:00:18okay
0:00:33okay
0:00:33um the paper them go to put the is cost we of put acoustic model weights for area
0:00:42um um will fall this up
0:00:44uh first that would discuss the overview of bootstrap
0:00:47and the rich structure in B S i as acoustic modeling framework
0:00:52uh and discuss the motivation why we do cost in and why did on full covariance
0:00:58um then i would discuss how to do the cost training clean two parts
0:01:02the distance
0:01:04an immense investigated
0:01:06uh including entropy should K L by i are S zero and channel
0:01:11and uh some cost in a groups
0:01:13proposed and investigated
0:01:17and then uh how we discuss the experimental results on proposed cost in algorithms
0:01:24and uh the experiment read out some bs S size with full rents for more
0:01:30um finally uh is conclusion and future extension
0:01:37okay uh
0:01:38let's have some back ground of the bootstrap based acoustic modeling
0:01:43uh so it is basically
0:01:45um
0:01:46send point a training data
0:01:49uh in
0:01:50and subsets sets
0:01:51raise each subset covers a fraction
0:01:54the of original data
0:01:56uh we combine all the data together to try and it is she with lda and semi-tied covariance
0:02:03and for each subset training any subset
0:02:07we perform E N training
0:02:09in parallel on and subsets for and edge and then
0:02:12so
0:02:13we have and models and way
0:02:16agree get it
0:02:17them together
0:02:19and um
0:02:21obviously it is very large
0:02:23uh but it performs so well
0:02:26a but the problem is it is
0:02:28to a large N the restructuring is needed
0:02:31so there are
0:02:32to to digits here
0:02:34for the structure in
0:02:37uh the first one is
0:02:39a a S close
0:02:41uh uh note strategy
0:02:43could choose trend i know covariance modelling in all the steps
0:02:48uh and the second one is
0:02:51chan full covariance model at
0:02:54the all the steps on T of the last step
0:02:58so
0:03:00he a full covariance cost is needed and as use can see from the framework
0:03:05a a clustering is a critical step
0:03:08and
0:03:09oh doing this
0:03:11um
0:03:12it can remove the redundancy and the scaled down the model and so that we can
0:03:18put it a mobile device
0:03:20and it is flexible
0:03:22but this is an advantage for the clustering train
0:03:25because you can channel lot model
0:03:27and scale down to a i was size without a new training
0:03:31and here full co variance class train is needed for the
0:03:35that's
0:03:36P S plus for to diagnose strategy
0:03:41okay
0:03:42uh so let's
0:03:43take a look at the this sense measurements for clustering
0:03:46um
0:03:47so wait
0:03:48investigate several a distance measurements including entropy should be
0:03:52which measures a change of in be up to two distributions marriage
0:03:56and the kl that averages could use a symmetric kl damages define it this form
0:04:02and the but i to
0:04:03this step
0:04:07and the
0:04:08S a or which is measures
0:04:10the overlap of two distributions
0:04:13uh but there is no close form even for multi right couches
0:04:17so of a relational approach is applied based on the channel feast S
0:04:23uh a channel of this function can be viewed as a higher bond
0:04:27oh of the S zero
0:04:29yeah it is defined in this form
0:04:32and that the but has higher uh
0:04:35this sense is a special case for channel function weights as you could to zero point five
0:04:41so
0:04:42a a how to opt and the channel of this is
0:04:45the details is elaborated in another paper
0:04:49a reference in number two
0:04:51uh a why you can apply new to on a with them
0:04:55a about do you have to opt to the first and second order of the derivation
0:04:59or or or or you can use that D revision free approach
0:05:02based on the analytical form of the channel function
0:05:08i
0:05:09okay just now we discuss a sense measurements
0:05:12and now a it is is a
0:05:13i'll client for the investigated all algorithms
0:05:17um so
0:05:18the cost in is
0:05:20a a one is based on the bottom up or also can be called agglomerative clustering
0:05:25is greedy
0:05:27and and that's this sense refinement is proposed to improve the speed
0:05:31and uh uh some non greedy approaches is also proposed
0:05:35two
0:05:36for a global optimisation including the case that look ahead
0:05:40and search the best pass
0:05:42i finally a two-pass strategy is
0:05:44to improve the model structure
0:05:47uh let's review the problem again um
0:05:50so we have
0:05:51and gaussian
0:05:53mixture model
0:05:55we uh it comes from T models
0:05:58and we want it to compress to and models
0:06:01that N gaussian
0:06:03so
0:06:04if not you it in entropy we want the the entropy minimize the to be change between
0:06:11F and G which is our target
0:06:14this is a global to my addition target
0:06:16however this is extremely hard to to often and then
0:06:20the conventional met so they is
0:06:22which time
0:06:23a two most a steamy the counts as um "'cause" and have a a
0:06:27and at
0:06:28the in
0:06:29uh the hmmm
0:06:31step i a combine to one on the some so criterion
0:06:36so
0:06:37for in this idea it is actually minimize the
0:06:40it is actually a really a approach
0:06:44um
0:06:45so a a good global approach is
0:06:48supposed to be better
0:06:50uh a he here is a
0:06:53is the
0:06:54example of K step look ahead
0:06:57um basically if
0:06:59it is greedy approach
0:07:00and it will always choose the the first rank combination
0:07:05however if you
0:07:06take a look at two step of file for the
0:07:10we find the best
0:07:11uh combining has
0:07:13oh combining candidates is from the second best order
0:07:16from here the the red pass
0:07:20so um this is a gentle way too often a global were optimized
0:07:25without
0:07:29another the idea is
0:07:30uh search the bit optimize past which employees the bread it's first a search idea or which is a dynamic
0:07:37programming
0:07:38um
0:07:39so we you the beam is set to and
0:07:42at each layer you keep and candidates
0:07:44at each layer
0:07:45and you extend to it next layer from and candidate so you have an square
0:07:50possibilities
0:07:52and use
0:07:54pruning
0:07:54it
0:07:55back to N
0:07:57uh aft
0:07:58this
0:07:59searching process your
0:08:00font
0:08:02uh the
0:08:03corpus up global optimize
0:08:05point
0:08:06uh at an minus a later
0:08:10so uh if the beam is only it then
0:08:14the the real outs will be you
0:08:16um surely global optimized
0:08:19however this is an like and P problem and then
0:08:23and so we have to set a beam too
0:08:26to do this job
0:08:30um
0:08:33so uh
0:08:35the conventional so it is
0:08:37i have state had the same compression rate
0:08:40um
0:08:41so we could use not very optimized
0:08:44because
0:08:45yeah
0:08:46that was set it can have a lab or compression rate
0:08:49they
0:08:50this makes more sense
0:08:52so uh
0:08:53fisher information criteria uh is employed here
0:08:57uh and a two-pass pass idea is employed
0:08:59i in the first pass way try to keep
0:09:02to K plus one
0:09:04um
0:09:06compression rate candidate date
0:09:09um with the bic value
0:09:13and uh in the second step
0:09:15way you the second pass with fixed to the bic value for all the states
0:09:19and therefore for the the different compression rate is
0:09:23here
0:09:25so um
0:09:27is
0:09:27i i are applied to our clustering
0:09:30uh a algorithm
0:09:33so that comes to the X experiment setup
0:09:36a we did the X
0:09:38per meant um past till dataset
0:09:41oh ways one hundred and thirty five hours of training data
0:09:44ten hours of testing data
0:09:47uh the model is speaker independent and the those training and the testing data a spontaneous speech
0:09:53uh is um
0:09:56model
0:09:57we cost at from is
0:09:59combined with for team
0:10:01bootstrap strap model
0:10:02to that six K states and to the one point eight meeting as
0:10:07and this speak model has that whatever rate of thirty five point four six percent uh
0:10:12in full covariance
0:10:16um
0:10:17so it comes to
0:10:18to a
0:10:19problem like channel of and K L sense manage but just a very slow all ten
0:10:24from this figure you can see
0:10:26um
0:10:28K are use like ten six times slower the entropy
0:10:31yeah channel of like
0:10:32twenty your thirty times slower than entropy
0:10:35so uh simple idea here
0:10:38is
0:10:39so entropy should be is fast and effective why don't wear use and to be do find and best candidate
0:10:44pairs
0:10:45and use channel for K to recalculate the distance
0:10:49to speed up the process
0:10:51so
0:10:52uh i aft
0:10:53a plane this idea the the speed improvement is significant
0:10:58and the the word error rate
0:11:00also improving
0:11:02yeah that's take
0:11:03the K L
0:11:04quickly clear the a the baseline vice thirty six point
0:11:08twenty three and aft
0:11:10using in the entropy stacked to that
0:11:12ten best
0:11:14and them um
0:11:15but
0:11:16there is improvement to city six point
0:11:19the roof for
0:11:21so the that we be had this is
0:11:24maybe a a a and B are suggest that with it this S
0:11:28can be put a show improvement because entropy
0:11:31uh
0:11:32please it like can see the the
0:11:35the weighting between the mixtures
0:11:38so i i i tried to the
0:11:40we it by a target do sense
0:11:42compared with the about is says
0:11:44and the compared with a B R
0:11:47approach approach
0:11:48um on uh compressed to one hundred K gaussian well and uh what fifty K gaussians
0:11:54so from this figure we can see that we did
0:11:57this sense it's better than
0:11:59now we did is sense
0:12:01which means the weighting is very important
0:12:03and N B R approach is
0:12:06that are then
0:12:07the weighted is a
0:12:09and and then the observation is that fifty K has
0:12:14roger improvement
0:12:16oh which makes sense to because
0:12:19um
0:12:20i becoming more and more important in a
0:12:23if you compression
0:12:24rate is high
0:12:28so here are some X
0:12:30results for global my addition
0:12:33so let's first take a look at the using a and to be criteria and um measure the
0:12:38so we or entropy change
0:12:40between compression be before
0:12:43if compression and the compression F and G
0:12:46so that the two looking had has a
0:12:49tiny improvement like zero point zero four
0:12:52i about the search approach
0:12:54has a
0:12:55roger improvement likes is something
0:12:58X thirty
0:12:59which means uh our approach is
0:13:02effective
0:13:04that
0:13:04the the speed is slow because
0:13:07you all want to search the the past and that
0:13:10uh it is a a twenty
0:13:13times slower than the baseline
0:13:16um um
0:13:17when you value where is the what error rate
0:13:20uh i one how can a fifty K the proposed approach is a
0:13:25better
0:13:26a positive of improvement
0:13:28that's the improvement is small
0:13:30in
0:13:32we on a higher compression rate the the
0:13:35difference between our proposed approach
0:13:38and uh the baseline that approaches larger
0:13:41which means
0:13:42um
0:13:45so that
0:13:47this work it
0:13:48effective
0:13:51um
0:13:53she is and
0:13:55experimental results on to pass structure up my addition
0:13:59for one hasn't to pass again them the two pass is always better than the one pass
0:14:05uh
0:14:06oh though the improvement is
0:14:07small
0:14:11so
0:14:12here
0:14:13is uh
0:14:14a in figure of
0:14:16uh uh the three
0:14:18approach the baseline the
0:14:20strapping
0:14:22raised is diagonal covariance the street huge
0:14:25the bs S plus diagnosed no G and the the bs plus
0:14:29or to diagonal not conversion strategy
0:14:32and uh from this figure
0:14:34there are we evaluated on bows
0:14:37not likelihood that
0:14:39and is
0:14:39discriminative training
0:14:43and the the the
0:14:44but we so is pretty imp
0:14:46uh interesting and the
0:14:48the improvement is
0:14:50but large
0:14:53if we compare mean ways of four two diagonal conversion compared ways
0:14:58a training all the process in
0:15:00using dark no
0:15:02covariance
0:15:04um so like one percent in
0:15:08a in a maximum likelihood
0:15:10and uh
0:15:11uh like to were point seven percent
0:15:13for discriminative training
0:15:18a place so um
0:15:21for future extension
0:15:23uh the search based
0:15:25approach
0:15:26uh the the beam
0:15:29can be out to adaptive uh the beam we are using used um
0:15:34yeah um
0:15:34it's for the beginning the beam use small it but for the for ending
0:15:39would be is large because you want to capture
0:15:42or word candidates
0:15:45and here
0:15:47okay use out adaptive idea to two
0:15:50uh optimize the beam
0:15:52and the case step look ahead and search optimized pass can be ease a general approach in optimisation
0:15:58oh can be applied to to other tasks such as decision tree
0:16:03and the
0:16:04for two pass model structure to addition we can try
0:16:07different criteria such as and the L you set of P C
0:16:11um
0:16:13so this is the reference
0:16:15and uh
0:16:16and we questions
0:16:18i Q
0:16:19i
0:16:25and
0:16:28he we got wouldn't
0:16:30is up for for the mic
0:16:37thanks
0:16:38i two questions the first one is how how do you divide
0:16:41the training set
0:16:43in into a different class or seen the in the very beginning
0:16:45"'cause" a second question see if i understand correctly
0:16:49uh each model we are have it's on scene tree structure so called a you if if this is true
0:16:54how can you decide
0:16:56each two states
0:16:57for example can be can be moved
0:17:00i think okay
0:17:01okay um
0:17:02so the first uh so uh the first question sounds uh is
0:17:06uh
0:17:07each subset is used random sampling
0:17:10without replacement
0:17:12so with a something rate R you to seventy percent
0:17:16and the second question is we
0:17:19i to the actually a share the same decision tree
0:17:22so here way
0:17:24um
0:17:25come by all the bootstrap data together
0:17:28to to an I Ds in so
0:17:31is no problem you you mentioned
0:17:35thanks
0:17:37i you are doing for go so the combines costing in in this case
0:17:42i i a uh and according to me my experience that maybe be on the cost of that has that
0:17:47read small number of
0:17:48components
0:17:50so had do you have any
0:17:51and a nash sure to for this small cost
0:17:55um
0:17:57and and actually the agglomerative clustering the is
0:18:00you come by
0:18:01to to it most similar gauss as together
0:18:05so
0:18:06um
0:18:08the next step you you
0:18:10after this step you have and minus one girls as right
0:18:13and then
0:18:15and i think wait it's very important here too
0:18:19to uh to avoid the C as you are mentioned
0:18:23so i in or you you are using uh
0:18:26it is no mess or uh using to explicitly a you nice right you do it you you do not
0:18:32have that
0:18:33small and a small number of a small cost
0:18:38um um
0:18:39i i D and the measure of the small cost us like but
0:18:43that
0:18:44i means small cost to the way it is
0:18:47the
0:18:49the weight is to represent if it is small right
0:18:53but the mixtures weight
0:18:54uh
0:18:56uh
0:18:58yeah but i you have just a
0:19:01for example you you've have just the one "'cause" one one one component in one class
0:19:06so that's a isolate the from the art of L do is almost how do you
0:19:10how do you do this
0:19:14a isolated with the
0:19:30do if you don't know that i
0:19:33we don't need to buy it is all that
0:19:34so so you mean cross state cost in
0:19:37i not i i i and just a a and you to do a class team when it is
0:19:41is
0:19:42some of the task just a house
0:19:44a a very small number
0:19:45a number of components
0:19:48is sometimes as a uh i do you
0:19:50i a four class and then
0:19:53a the task to do um a you treating some
0:19:56some
0:19:57then each ring the as some models on one local lines example that's late the lady you create problem
0:20:04right so so i and i don't have this small yeah i i think that weight is very important here
0:20:09as that i showed that we it it's then is better than um we it is sense
0:20:14so uh the weight ease
0:20:16the represent the station off a small or large
0:20:20cost and right
0:20:21so from my perspective
0:20:24and
0:20:24so so okay uh thank you thank you
0:20:28been