Speech Transcript - CLUSTERING OF BOOTSTRAPPED ACOUSTIC MODEL WITH FULL COVARIANCE

0:00:18	okay
0:00:33	okay
0:00:33	um the paper them go to put the is cost we of put acoustic model weights for area
0:00:42	um um will fall this up
0:00:44	uh first that would discuss the overview of bootstrap
0:00:47	and the rich structure in B S i as acoustic modeling framework
0:00:52	uh and discuss the motivation why we do cost in and why did on full covariance
0:00:58	um then i would discuss how to do the cost training clean two parts
0:01:02	the distance
0:01:04	an immense investigated
0:01:06	uh including entropy should K L by i are S zero and channel
0:01:11	and uh some cost in a groups
0:01:13	proposed and investigated
0:01:17	and then uh how we discuss the experimental results on proposed cost in algorithms
0:01:24	and uh the experiment read out some bs S size with full rents for more
0:01:30	um finally uh is conclusion and future extension
0:01:37	okay uh
0:01:38	let's have some back ground of the bootstrap based acoustic modeling
0:01:43	uh so it is basically
0:01:45	um
0:01:46	send point a training data
0:01:49	uh in
0:01:50	and subsets sets
0:01:51	raise each subset covers a fraction
0:01:54	the of original data
0:01:56	uh we combine all the data together to try and it is she with lda and semi-tied covariance
0:02:03	and for each subset training any subset
0:02:07	we perform E N training
0:02:09	in parallel on and subsets for and edge and then
0:02:12	so
0:02:13	we have and models and way
0:02:16	agree get it
0:02:17	them together
0:02:19	and um
0:02:21	obviously it is very large
0:02:23	uh but it performs so well
0:02:26	a but the problem is it is
0:02:28	to a large N the restructuring is needed
0:02:31	so there are
0:02:32	to to digits here
0:02:34	for the structure in
0:02:37	uh the first one is
0:02:39	a a S close
0:02:41	uh uh note strategy
0:02:43	could choose trend i know covariance modelling in all the steps
0:02:48	uh and the second one is
0:02:51	chan full covariance model at
0:02:54	the all the steps on T of the last step
0:02:58	so
0:03:00	he a full covariance cost is needed and as use can see from the framework
0:03:05	a a clustering is a critical step
0:03:08	and
0:03:09	oh doing this
0:03:11	um
0:03:12	it can remove the redundancy and the scaled down the model and so that we can
0:03:18	put it a mobile device
0:03:20	and it is flexible
0:03:22	but this is an advantage for the clustering train
0:03:25	because you can channel lot model
0:03:27	and scale down to a i was size without a new training
0:03:31	and here full co variance class train is needed for the
0:03:35	that's
0:03:36	P S plus for to diagnose strategy
0:03:41	okay
0:03:42	uh so let's
0:03:43	take a look at the this sense measurements for clustering
0:03:46	um
0:03:47	so wait
0:03:48	investigate several a distance measurements including entropy should be
0:03:52	which measures a change of in be up to two distributions marriage
0:03:56	and the kl that averages could use a symmetric kl damages define it this form
0:04:02	and the but i to
0:04:03	this step
0:04:07	and the
0:04:08	S a or which is measures
0:04:10	the overlap of two distributions
0:04:13	uh but there is no close form even for multi right couches
0:04:17	so of a relational approach is applied based on the channel feast S
0:04:23	uh a channel of this function can be viewed as a higher bond
0:04:27	oh of the S zero
0:04:29	yeah it is defined in this form
0:04:32	and that the but has higher uh
0:04:35	this sense is a special case for channel function weights as you could to zero point five
0:04:41	so
0:04:42	a a how to opt and the channel of this is
0:04:45	the details is elaborated in another paper
0:04:49	a reference in number two
0:04:51	uh a why you can apply new to on a with them
0:04:55	a about do you have to opt to the first and second order of the derivation
0:04:59	or or or or you can use that D revision free approach
0:05:02	based on the analytical form of the channel function
0:05:08	i
0:05:09	okay just now we discuss a sense measurements
0:05:12	and now a it is is a
0:05:13	i'll client for the investigated all algorithms
0:05:17	um so
0:05:18	the cost in is
0:05:20	a a one is based on the bottom up or also can be called agglomerative clustering
0:05:25	is greedy
0:05:27	and and that's this sense refinement is proposed to improve the speed
0:05:31	and uh uh some non greedy approaches is also proposed
0:05:35	two
0:05:36	for a global optimisation including the case that look ahead
0:05:40	and search the best pass
0:05:42	i finally a two-pass strategy is
0:05:44	to improve the model structure
0:05:47	uh let's review the problem again um
0:05:50	so we have
0:05:51	and gaussian
0:05:53	mixture model
0:05:55	we uh it comes from T models
0:05:58	and we want it to compress to and models
0:06:01	that N gaussian
0:06:03	so
0:06:04	if not you it in entropy we want the the entropy minimize the to be change between
0:06:11	F and G which is our target
0:06:14	this is a global to my addition target
0:06:16	however this is extremely hard to to often and then
0:06:20	the conventional met so they is
0:06:22	which time
0:06:23	a two most a steamy the counts as um "'cause" and have a a
0:06:27	and at
0:06:28	the in
0:06:29	uh the hmmm
0:06:31	step i a combine to one on the some so criterion
0:06:36	so
0:06:37	for in this idea it is actually minimize the
0:06:40	it is actually a really a approach
0:06:44	um
0:06:45	so a a good global approach is
0:06:48	supposed to be better
0:06:50	uh a he here is a
0:06:53	is the
0:06:54	example of K step look ahead
0:06:57	um basically if
0:06:59	it is greedy approach
0:07:00	and it will always choose the the first rank combination
0:07:05	however if you
0:07:06	take a look at two step of file for the
0:07:10	we find the best
0:07:11	uh combining has
0:07:13	oh combining candidates is from the second best order
0:07:16	from here the the red pass
0:07:20	so um this is a gentle way too often a global were optimized
0:07:25	without
0:07:29	another the idea is
0:07:30	uh search the bit optimize past which employees the bread it's first a search idea or which is a dynamic
0:07:37	programming
0:07:38	um
0:07:39	so we you the beam is set to and
0:07:42	at each layer you keep and candidates
0:07:44	at each layer
0:07:45	and you extend to it next layer from and candidate so you have an square
0:07:50	possibilities
0:07:52	and use
0:07:54	pruning
0:07:54	it
0:07:55	back to N
0:07:57	uh aft
0:07:58	this
0:07:59	searching process your
0:08:00	font
0:08:02	uh the
0:08:03	corpus up global optimize
0:08:05	point
0:08:06	uh at an minus a later
0:08:10	so uh if the beam is only it then
0:08:14	the the real outs will be you
0:08:16	um surely global optimized
0:08:19	however this is an like and P problem and then
0:08:23	and so we have to set a beam too
0:08:26	to do this job
0:08:30	um
0:08:33	so uh
0:08:35	the conventional so it is
0:08:37	i have state had the same compression rate
0:08:40	um
0:08:41	so we could use not very optimized
0:08:44	because
0:08:45	yeah
0:08:46	that was set it can have a lab or compression rate
0:08:49	they
0:08:50	this makes more sense
0:08:52	so uh
0:08:53	fisher information criteria uh is employed here
0:08:57	uh and a two-pass pass idea is employed
0:08:59	i in the first pass way try to keep
0:09:02	to K plus one
0:09:04	um
0:09:06	compression rate candidate date
0:09:09	um with the bic value
0:09:13	and uh in the second step
0:09:15	way you the second pass with fixed to the bic value for all the states
0:09:19	and therefore for the the different compression rate is
0:09:23	here
0:09:25	so um
0:09:27	is
0:09:27	i i are applied to our clustering
0:09:30	uh a algorithm
0:09:33	so that comes to the X experiment setup
0:09:36	a we did the X
0:09:38	per meant um past till dataset
0:09:41	oh ways one hundred and thirty five hours of training data
0:09:44	ten hours of testing data
0:09:47	uh the model is speaker independent and the those training and the testing data a spontaneous speech
0:09:53	uh is um
0:09:56	model
0:09:57	we cost at from is
0:09:59	combined with for team
0:10:01	bootstrap strap model
0:10:02	to that six K states and to the one point eight meeting as
0:10:07	and this speak model has that whatever rate of thirty five point four six percent uh
0:10:12	in full covariance
0:10:16	um
0:10:17	so it comes to
0:10:18	to a
0:10:19	problem like channel of and K L sense manage but just a very slow all ten
0:10:24	from this figure you can see
0:10:26	um
0:10:28	K are use like ten six times slower the entropy
0:10:31	yeah channel of like
0:10:32	twenty your thirty times slower than entropy
0:10:35	so uh simple idea here
0:10:38	is
0:10:39	so entropy should be is fast and effective why don't wear use and to be do find and best candidate
0:10:44	pairs
0:10:45	and use channel for K to recalculate the distance
0:10:49	to speed up the process
0:10:51	so
0:10:52	uh i aft
0:10:53	a plane this idea the the speed improvement is significant
0:10:58	and the the word error rate
0:11:00	also improving
0:11:02	yeah that's take
0:11:03	the K L
0:11:04	quickly clear the a the baseline vice thirty six point
0:11:08	twenty three and aft
0:11:10	using in the entropy stacked to that
0:11:12	ten best
0:11:14	and them um
0:11:15	but
0:11:16	there is improvement to city six point
0:11:19	the roof for
0:11:21	so the that we be had this is
0:11:24	maybe a a a and B are suggest that with it this S
0:11:28	can be put a show improvement because entropy
0:11:31	uh
0:11:32	please it like can see the the
0:11:35	the weighting between the mixtures
0:11:38	so i i i tried to the
0:11:40	we it by a target do sense
0:11:42	compared with the about is says
0:11:44	and the compared with a B R
0:11:47	approach approach
0:11:48	um on uh compressed to one hundred K gaussian well and uh what fifty K gaussians
0:11:54	so from this figure we can see that we did
0:11:57	this sense it's better than
0:11:59	now we did is sense
0:12:01	which means the weighting is very important
0:12:03	and N B R approach is
0:12:06	that are then
0:12:07	the weighted is a
0:12:09	and and then the observation is that fifty K has
0:12:14	roger improvement
0:12:16	oh which makes sense to because
0:12:19	um
0:12:20	i becoming more and more important in a
0:12:23	if you compression
0:12:24	rate is high
0:12:28	so here are some X
0:12:30	results for global my addition
0:12:33	so let's first take a look at the using a and to be criteria and um measure the
0:12:38	so we or entropy change
0:12:40	between compression be before
0:12:43	if compression and the compression F and G
0:12:46	so that the two looking had has a
0:12:49	tiny improvement like zero point zero four
0:12:52	i about the search approach
0:12:54	has a
0:12:55	roger improvement likes is something
0:12:58	X thirty
0:12:59	which means uh our approach is
0:13:02	effective
0:13:04	that
0:13:04	the the speed is slow because
0:13:07	you all want to search the the past and that
0:13:10	uh it is a a twenty
0:13:13	times slower than the baseline
0:13:16	um um
0:13:17	when you value where is the what error rate
0:13:20	uh i one how can a fifty K the proposed approach is a
0:13:25	better
0:13:26	a positive of improvement
0:13:28	that's the improvement is small
0:13:30	in
0:13:32	we on a higher compression rate the the
0:13:35	difference between our proposed approach
0:13:38	and uh the baseline that approaches larger
0:13:41	which means
0:13:42	um
0:13:45	so that
0:13:47	this work it
0:13:48	effective
0:13:51	um
0:13:53	she is and
0:13:55	experimental results on to pass structure up my addition
0:13:59	for one hasn't to pass again them the two pass is always better than the one pass
0:14:05	uh
0:14:06	oh though the improvement is
0:14:07	small
0:14:11	so
0:14:12	here
0:14:13	is uh
0:14:14	a in figure of
0:14:16	uh uh the three
0:14:18	approach the baseline the
0:14:20	strapping
0:14:22	raised is diagonal covariance the street huge
0:14:25	the bs S plus diagnosed no G and the the bs plus
0:14:29	or to diagonal not conversion strategy
0:14:32	and uh from this figure
0:14:34	there are we evaluated on bows
0:14:37	not likelihood that
0:14:39	and is
0:14:39	discriminative training
0:14:43	and the the the
0:14:44	but we so is pretty imp
0:14:46	uh interesting and the
0:14:48	the improvement is
0:14:50	but large
0:14:53	if we compare mean ways of four two diagonal conversion compared ways
0:14:58	a training all the process in
0:15:00	using dark no
0:15:02	covariance
0:15:04	um so like one percent in
0:15:08	a in a maximum likelihood
0:15:10	and uh
0:15:11	uh like to were point seven percent
0:15:13	for discriminative training
0:15:18	a place so um
0:15:21	for future extension
0:15:23	uh the search based
0:15:25	approach
0:15:26	uh the the beam
0:15:29	can be out to adaptive uh the beam we are using used um
0:15:34	yeah um
0:15:34	it's for the beginning the beam use small it but for the for ending
0:15:39	would be is large because you want to capture
0:15:42	or word candidates
0:15:45	and here
0:15:47	okay use out adaptive idea to two
0:15:50	uh optimize the beam
0:15:52	and the case step look ahead and search optimized pass can be ease a general approach in optimisation
0:15:58	oh can be applied to to other tasks such as decision tree
0:16:03	and the
0:16:04	for two pass model structure to addition we can try
0:16:07	different criteria such as and the L you set of P C
0:16:11	um
0:16:13	so this is the reference
0:16:15	and uh
0:16:16	and we questions
0:16:18	i Q
0:16:19	i
0:16:25	and
0:16:28	he we got wouldn't
0:16:30	is up for for the mic
0:16:37	thanks
0:16:38	i two questions the first one is how how do you divide
0:16:41	the training set
0:16:43	in into a different class or seen the in the very beginning
0:16:45	"'cause" a second question see if i understand correctly
0:16:49	uh each model we are have it's on scene tree structure so called a you if if this is true
0:16:54	how can you decide
0:16:56	each two states
0:16:57	for example can be can be moved
0:17:00	i think okay
0:17:01	okay um
0:17:02	so the first uh so uh the first question sounds uh is
0:17:06	uh
0:17:07	each subset is used random sampling
0:17:10	without replacement
0:17:12	so with a something rate R you to seventy percent
0:17:16	and the second question is we
0:17:19	i to the actually a share the same decision tree
0:17:22	so here way
0:17:24	um
0:17:25	come by all the bootstrap data together
0:17:28	to to an I Ds in so
0:17:31	is no problem you you mentioned
0:17:35	thanks
0:17:37	i you are doing for go so the combines costing in in this case
0:17:42	i i a uh and according to me my experience that maybe be on the cost of that has that
0:17:47	read small number of
0:17:48	components
0:17:50	so had do you have any
0:17:51	and a nash sure to for this small cost
0:17:55	um
0:17:57	and and actually the agglomerative clustering the is
0:18:00	you come by
0:18:01	to to it most similar gauss as together
0:18:05	so
0:18:06	um
0:18:08	the next step you you
0:18:10	after this step you have and minus one girls as right
0:18:13	and then
0:18:15	and i think wait it's very important here too
0:18:19	to uh to avoid the C as you are mentioned
0:18:23	so i in or you you are using uh
0:18:26	it is no mess or uh using to explicitly a you nice right you do it you you do not
0:18:32	have that
0:18:33	small and a small number of a small cost
0:18:38	um um
0:18:39	i i D and the measure of the small cost us like but
0:18:43	that
0:18:44	i means small cost to the way it is
0:18:47	the
0:18:49	the weight is to represent if it is small right
0:18:53	but the mixtures weight
0:18:54	uh
0:18:56	uh
0:18:58	yeah but i you have just a
0:19:01	for example you you've have just the one "'cause" one one one component in one class
0:19:06	so that's a isolate the from the art of L do is almost how do you
0:19:10	how do you do this
0:19:14	a isolated with the
0:19:30	do if you don't know that i
0:19:33	we don't need to buy it is all that
0:19:34	so so you mean cross state cost in
0:19:37	i not i i i and just a a and you to do a class team when it is
0:19:41	is
0:19:42	some of the task just a house
0:19:44	a a very small number
0:19:45	a number of components
0:19:48	is sometimes as a uh i do you
0:19:50	i a four class and then
0:19:53	a the task to do um a you treating some
0:19:56	some
0:19:57	then each ring the as some models on one local lines example that's late the lady you create problem
0:20:04	right so so i and i don't have this small yeah i i think that weight is very important here
0:20:09	as that i showed that we it it's then is better than um we it is sense
0:20:14	so uh the weight ease
0:20:16	the represent the station off a small or large
0:20:20	cost and right
0:20:21	so from my perspective
0:20:24	and
0:20:24	so so okay uh thank you thank you
0:20:28	been

CLUSTERING OF BOOTSTRAPPED ACOUSTIC MODEL WITH FULL COVARIANCE

Acoustic Modeling

Presented by: Xin Chen, Author(s): Xin Chen, University of Missouri, United States; Xiaodong Cui, Jian Xue, Peder Olsen, IBM, United States; John Hersey, Mitsubishi, United States; Bowen Zhou, IBM, United States; Yunxin Zhao, University of Missouri, United States