okay okay um the paper them go to put the is cost we of put acoustic model weights for area um um will fall this up uh first that would discuss the overview of bootstrap and the rich structure in B S i as acoustic modeling framework uh and discuss the motivation why we do cost in and why did on full covariance um then i would discuss how to do the cost training clean two parts the distance an immense investigated uh including entropy should K L by i are S zero and channel and uh some cost in a groups proposed and investigated and then uh how we discuss the experimental results on proposed cost in algorithms and uh the experiment read out some bs S size with full rents for more um finally uh is conclusion and future extension okay uh let's have some back ground of the bootstrap based acoustic modeling uh so it is basically um send point a training data uh in and subsets sets raise each subset covers a fraction the of original data uh we combine all the data together to try and it is she with lda and semi-tied covariance and for each subset training any subset we perform E N training in parallel on and subsets for and edge and then so we have and models and way agree get it them together and um obviously it is very large uh but it performs so well a but the problem is it is to a large N the restructuring is needed so there are to to digits here for the structure in uh the first one is a a S close uh uh note strategy could choose trend i know covariance modelling in all the steps uh and the second one is chan full covariance model at the all the steps on T of the last step so he a full covariance cost is needed and as use can see from the framework a a clustering is a critical step and oh doing this um it can remove the redundancy and the scaled down the model and so that we can put it a mobile device and it is flexible but this is an advantage for the clustering train because you can channel lot model and scale down to a i was size without a new training and here full co variance class train is needed for the that's P S plus for to diagnose strategy okay uh so let's take a look at the this sense measurements for clustering um so wait investigate several a distance measurements including entropy should be which measures a change of in be up to two distributions marriage and the kl that averages could use a symmetric kl damages define it this form and the but i to this step and the S a or which is measures the overlap of two distributions uh but there is no close form even for multi right couches so of a relational approach is applied based on the channel feast S uh a channel of this function can be viewed as a higher bond oh of the S zero yeah it is defined in this form and that the but has higher uh this sense is a special case for channel function weights as you could to zero point five so a a how to opt and the channel of this is the details is elaborated in another paper a reference in number two uh a why you can apply new to on a with them a about do you have to opt to the first and second order of the derivation or or or or you can use that D revision free approach based on the analytical form of the channel function i okay just now we discuss a sense measurements and now a it is is a i'll client for the investigated all algorithms um so the cost in is a a one is based on the bottom up or also can be called agglomerative clustering is greedy and and that's this sense refinement is proposed to improve the speed and uh uh some non greedy approaches is also proposed two for a global optimisation including the case that look ahead and search the best pass i finally a two-pass strategy is to improve the model structure uh let's review the problem again um so we have and gaussian mixture model we uh it comes from T models and we want it to compress to and models that N gaussian so if not you it in entropy we want the the entropy minimize the to be change between F and G which is our target this is a global to my addition target however this is extremely hard to to often and then the conventional met so they is which time a two most a steamy the counts as um "'cause" and have a a and at the in uh the hmmm step i a combine to one on the some so criterion so for in this idea it is actually minimize the it is actually a really a approach um so a a good global approach is supposed to be better uh a he here is a is the example of K step look ahead um basically if it is greedy approach and it will always choose the the first rank combination however if you take a look at two step of file for the we find the best uh combining has oh combining candidates is from the second best order from here the the red pass so um this is a gentle way too often a global were optimized without another the idea is uh search the bit optimize past which employees the bread it's first a search idea or which is a dynamic programming um so we you the beam is set to and at each layer you keep and candidates at each layer and you extend to it next layer from and candidate so you have an square possibilities and use pruning it back to N uh aft this searching process your font uh the corpus up global optimize point uh at an minus a later so uh if the beam is only it then the the real outs will be you um surely global optimized however this is an like and P problem and then and so we have to set a beam too to do this job um so uh the conventional so it is i have state had the same compression rate um so we could use not very optimized because yeah that was set it can have a lab or compression rate they this makes more sense so uh fisher information criteria uh is employed here uh and a two-pass pass idea is employed i in the first pass way try to keep to K plus one um compression rate candidate date um with the bic value and uh in the second step way you the second pass with fixed to the bic value for all the states and therefore for the the different compression rate is here so um is i i are applied to our clustering uh a algorithm so that comes to the X experiment setup a we did the X per meant um past till dataset oh ways one hundred and thirty five hours of training data ten hours of testing data uh the model is speaker independent and the those training and the testing data a spontaneous speech uh is um model we cost at from is combined with for team bootstrap strap model to that six K states and to the one point eight meeting as and this speak model has that whatever rate of thirty five point four six percent uh in full covariance um so it comes to to a problem like channel of and K L sense manage but just a very slow all ten from this figure you can see um K are use like ten six times slower the entropy yeah channel of like twenty your thirty times slower than entropy so uh simple idea here is so entropy should be is fast and effective why don't wear use and to be do find and best candidate pairs and use channel for K to recalculate the distance to speed up the process so uh i aft a plane this idea the the speed improvement is significant and the the word error rate also improving yeah that's take the K L quickly clear the a the baseline vice thirty six point twenty three and aft using in the entropy stacked to that ten best and them um but there is improvement to city six point the roof for so the that we be had this is maybe a a a and B are suggest that with it this S can be put a show improvement because entropy uh please it like can see the the the weighting between the mixtures so i i i tried to the we it by a target do sense compared with the about is says and the compared with a B R approach approach um on uh compressed to one hundred K gaussian well and uh what fifty K gaussians so from this figure we can see that we did this sense it's better than now we did is sense which means the weighting is very important and N B R approach is that are then the weighted is a and and then the observation is that fifty K has roger improvement oh which makes sense to because um i becoming more and more important in a if you compression rate is high so here are some X results for global my addition so let's first take a look at the using a and to be criteria and um measure the so we or entropy change between compression be before if compression and the compression F and G so that the two looking had has a tiny improvement like zero point zero four i about the search approach has a roger improvement likes is something X thirty which means uh our approach is effective that the the speed is slow because you all want to search the the past and that uh it is a a twenty times slower than the baseline um um when you value where is the what error rate uh i one how can a fifty K the proposed approach is a better a positive of improvement that's the improvement is small in we on a higher compression rate the the difference between our proposed approach and uh the baseline that approaches larger which means um so that this work it effective um she is and experimental results on to pass structure up my addition for one hasn't to pass again them the two pass is always better than the one pass uh oh though the improvement is small so here is uh a in figure of uh uh the three approach the baseline the strapping raised is diagonal covariance the street huge the bs S plus diagnosed no G and the the bs plus or to diagonal not conversion strategy and uh from this figure there are we evaluated on bows not likelihood that and is discriminative training and the the the but we so is pretty imp uh interesting and the the improvement is but large if we compare mean ways of four two diagonal conversion compared ways a training all the process in using dark no covariance um so like one percent in a in a maximum likelihood and uh uh like to were point seven percent for discriminative training a place so um for future extension uh the search based approach uh the the beam can be out to adaptive uh the beam we are using used um yeah um it's for the beginning the beam use small it but for the for ending would be is large because you want to capture or word candidates and here okay use out adaptive idea to two uh optimize the beam and the case step look ahead and search optimized pass can be ease a general approach in optimisation oh can be applied to to other tasks such as decision tree and the for two pass model structure to addition we can try different criteria such as and the L you set of P C um so this is the reference and uh and we questions i Q i and he we got wouldn't is up for for the mic thanks i two questions the first one is how how do you divide the training set in into a different class or seen the in the very beginning "'cause" a second question see if i understand correctly uh each model we are have it's on scene tree structure so called a you if if this is true how can you decide each two states for example can be can be moved i think okay okay um so the first uh so uh the first question sounds uh is uh each subset is used random sampling without replacement so with a something rate R you to seventy percent and the second question is we i to the actually a share the same decision tree so here way um come by all the bootstrap data together to to an I Ds in so is no problem you you mentioned thanks i you are doing for go so the combines costing in in this case i i a uh and according to me my experience that maybe be on the cost of that has that read small number of components so had do you have any and a nash sure to for this small cost um and and actually the agglomerative clustering the is you come by to to it most similar gauss as together so um the next step you you after this step you have and minus one girls as right and then and i think wait it's very important here too to uh to avoid the C as you are mentioned so i in or you you are using uh it is no mess or uh using to explicitly a you nice right you do it you you do not have that small and a small number of a small cost um um i i D and the measure of the small cost us like but that i means small cost to the way it is the the weight is to represent if it is small right but the mixtures weight uh uh yeah but i you have just a for example you you've have just the one "'cause" one one one component in one class so that's a isolate the from the art of L do is almost how do you how do you do this a isolated with the do if you don't know that i we don't need to buy it is all that so so you mean cross state cost in i not i i i and just a a and you to do a class team when it is is some of the task just a house a a very small number a number of components is sometimes as a uh i do you i a four class and then a the task to do um a you treating some some then each ring the as some models on one local lines example that's late the lady you create problem right so so i and i don't have this small yeah i i think that weight is very important here as that i showed that we it it's then is better than um we it is sense so uh the weight ease the represent the station off a small or large cost and right so from my perspective and so so okay uh thank you thank you been