Speech Transcript - ACOUSTIC MODELS IN KALDI

0:00:13	so that and give you a a a uh a um all of you up the whole to locate and
0:00:18	just going to you give a brief description of how the
0:00:20	a to model various acts to model classes or
0:00:23	organise just to give you a flavour of file
0:00:25	what is meant by the court is modular
0:00:28	and parts that don't need to know about each of the north
0:00:32	um um
0:00:34	so just tool
0:00:36	re rate um
0:00:39	uh
0:00:40	the thing that we support currently it's
0:00:42	it's mainly the
0:00:43	the standard max in the cute training of acoustic models together with a gmms and in the kind of max
0:00:48	that cute framework
0:00:50	um we have the usual in your transforms like lda to
0:00:54	and S T C
0:00:55	um
0:00:56	we also support speaker adaptation
0:00:59	currently if a are is
0:01:01	a we have tested it in the recipes
0:01:04	mllr lower court is there it's
0:01:05	you mean tested um
0:01:07	this still
0:01:08	any to right so somebody needs to write the the cable
0:01:12	and and on them um
0:01:14	so
0:01:15	mllr is not in the recipe
0:01:17	almost done
0:01:19	um
0:01:20	and well
0:01:21	uh and uh
0:01:22	and leather obviously has but it it's with which trees then if and lower
0:01:26	has to
0:01:27	variations of one it's it's just a global transform or with which trees
0:01:32	a
0:01:34	uh yeah and i
0:01:35	this is the point
0:01:36	which
0:01:37	and once uh then can mention that that
0:01:39	we had some discussion whether two
0:01:41	a sub for um things like uh do you known type systems are be take models where
0:01:46	uh and uh for now
0:01:49	uh things are fairly simple
0:01:51	um we decided not to do it now
0:01:54	maybe if the need is felt in feature and sometimes P
0:01:57	also
0:01:58	for the course of
0:01:59	this development
0:02:00	a a couple of times a part
0:02:01	my
0:02:02	be good to have a system like that
0:02:04	but currently when a gmm it's
0:02:06	it's
0:02:07	uh a very specific thing with means and covariances
0:02:11	uh and i'm going to
0:02:12	just be few also see how the gmms are implemented
0:02:15	um
0:02:16	and yeah the sims in the thing with is gmms we also have the
0:02:20	uh if from lower adaptation court phrase gmms uh and a little bit
0:02:24	uh
0:02:25	um that there are few results we had previously published which are still lot in this new code base but
0:02:30	there
0:02:31	uh going to be added
0:02:34	so
0:02:35	this is
0:02:36	this is already been talked about we have a
0:02:39	gmm class and uh it knows really in about nothing else other than
0:02:44	and what what it contains uh
0:02:46	that is the parameters
0:02:48	and there is that acoustic stick model class which is just a vector of gmms
0:02:51	and for implementation reason
0:02:54	or pointers but
0:02:55	not that
0:02:56	uh interesting uh a thing but uh uh a the green of in this
0:03:00	slides would
0:03:01	uh signify this
0:03:03	technical term called knows about where hit which is and
0:03:07	it it could be a so it's so we have a did um as much of inheritance has because
0:03:13	uh so
0:03:15	um most of the time things are not uh inherited things
0:03:19	if
0:03:20	uh uh uh object needs to
0:03:22	cheap
0:03:23	uh track of another object it's
0:03:25	either
0:03:26	by keeping a once preference uh it
0:03:29	that's that case otherwise
0:03:30	yeah
0:03:31	specific fake uh that will take just pointers and modified that
0:03:35	um
0:03:36	so knows was about is in that sense that you can think that
0:03:39	you know if you have to write the code you have to be to the head or four
0:03:42	this on the thing right
0:03:44	um
0:03:47	uh so so
0:03:48	so
0:03:49	the gmms are parametrized
0:03:51	um
0:03:52	using the natural parameters which is a which
0:03:55	a natural parameters in the sense of um the that's of parameters of an mention distribution
0:04:00	where uh if you right of the
0:04:02	like your got you get
0:04:04	um
0:04:05	this too
0:04:06	i think that the
0:04:08	uh them
0:04:08	the there is a
0:04:09	uh the mean time
0:04:11	the inverse of the covariance and the inverse of the covariance of the natural parameters of few M
0:04:15	and the reason for doing that is then you can do the like your calculation
0:04:18	using just
0:04:20	two
0:04:20	matrix vector multiplication locations because it or if you have diagonal covariance system
0:04:25	you have your and
0:04:26	you have the mean times
0:04:28	in this covariance is the vector and say
0:04:30	you five components are i mean
0:04:32	i components
0:04:33	and you have your data vector and
0:04:35	you just
0:04:36	do this to make exact vector
0:04:38	but
0:04:40	and
0:04:41	there are last ratings for doing that obviously
0:04:43	yeah a to blast
0:04:45	is
0:04:46	yeah not the most optimize thing but
0:04:48	i mean it's still
0:04:49	uh a nice
0:04:50	um
0:04:51	uh we of doing things
0:04:53	so um
0:04:56	so uh uh uh a graphical uh overview of uh what dan has already said that
0:05:01	uh uh we have this as to model class but
0:05:04	when it in to the decoder it contracts with this decodable
0:05:08	uh object
0:05:09	and uh the decoder knows only about uh this the court of an interface and
0:05:13	for each type of acoustic model we need to implement the project us
0:05:17	as with the able
0:05:18	uh interface uh for that model right
0:05:22	and the decodable
0:05:23	uh object is the one which all some about features
0:05:26	and um
0:05:27	just that isn't you'd of the like computation
0:05:30	and this is
0:05:31	exactly how the decoder interface looks like
0:05:34	so
0:05:35	so but when i be avoid yeah using uh in here dense
0:05:39	this is the only exception which would be uh
0:05:42	when V have interfaces which we have a
0:05:45	you
0:05:45	for features for portable and
0:05:47	a few of the things
0:05:49	uh and these are actually pure interfaces
0:05:52	uh so that
0:05:54	what B
0:05:55	a a a that's only case where we hate
0:05:58	um so as you can see it's
0:06:01	a simple E
0:06:02	the main function is that like you good combination
0:06:04	and uh the decoder can know that but there
0:06:07	at
0:06:08	there no more frames
0:06:09	and yeah
0:06:11	how many states essentially you have
0:06:17	so
0:06:17	a for every other model type you then in heard from this end
0:06:20	uh in
0:06:22	not
0:06:23	so um
0:06:24	that was the decoding for training we similarly have a object for
0:06:29	spring that matters
0:06:30	and uh
0:06:31	for the gmms and
0:06:33	uh in in the same way that the acoustic model is just a vector of gmms the
0:06:37	uh the
0:06:38	acoustic model trainer is just a vector of
0:06:40	uh objects with screen that you
0:06:43	and uh
0:06:49	yeah yeah
0:06:51	okay yes sure this this yeah that my slides are not compatible
0:06:56	yeah
0:06:57	so
0:06:58	um
0:07:00	yeah
0:07:02	ah
0:07:02	um and and and the red arrow means that uh this classes with modified those classes
0:07:08	obviously modifies it implies it also knows about and
0:07:11	typically modification it doesn't keep
0:07:14	any or an object up the other class pictures
0:07:18	it has a method which will
0:07:19	um take that object and
0:07:21	do the modification
0:07:25	um so how do you adaptation adaptation for that
0:07:28	say uh for feature space mllr um
0:07:33	and so it's
0:07:34	if it's global it's implemented as as
0:07:36	as a
0:07:37	simple matrix
0:07:38	uh
0:07:39	and
0:07:40	the matrix doesn't need to know what it as like a a it's it's only the estimation which makes it
0:07:44	that from the ladder
0:07:45	so the estimator knows about acoustic model nodes
0:07:49	about revision too if you're using the version three
0:07:51	and if you're using regression P
0:07:54	the timber object has just multiple transform
0:07:57	um
0:07:58	and similarly to so that it from another object then however doesn't know about
0:08:02	uh regression feed this concept
0:08:04	it just has a bunch of transforms it's a decodable object which
0:08:08	nose
0:08:09	hoping to read this thing
0:08:14	a similarly with mllr
0:08:16	uh obviously that has to know that "'cause" model and them a lower
0:08:20	uh can either
0:08:21	uh you can
0:08:22	it can acoustic model and tell it give me an adapted models are to just
0:08:26	a all the means and give you and you model
0:08:28	uh a i it can do it lazy so that every you can
0:08:33	um um so the decodable
0:08:35	the decoder will as the D portable to
0:08:37	get the lack you'd from an out of date model the
0:08:39	the decodable will
0:08:41	quite either the M other object which
0:08:43	then we'll see fit
0:08:46	has already completed this
0:08:48	i mean it catches the mean
0:08:49	if not then will
0:08:51	uh a the mean from the acoustic model and i weekly see that
0:08:55	then convert it right
0:08:56	so which
0:08:58	which is
0:08:59	how you would use it can practical uh situation
0:09:05	there's gmms
0:09:06	have very similar structure
0:09:08	again
0:09:09	yeah there is that the able
0:09:10	uh on the is gmm
0:09:12	oh it
0:09:14	that should say S
0:09:17	jim
0:09:18	and the gmm class
0:09:20	um it the is gmm model it has
0:09:22	this you switch
0:09:25	um that's why needs to know about
0:09:27	the gmm classes as well
0:09:29	right and
0:09:30	just for
0:09:32	yeah the
0:09:32	convenience of coding
0:09:34	there's gmm up for the gmm classes that can lead to send out dating
0:09:38	class is the same
0:09:39	for is you rooms they different because
0:09:41	there many uh a big
0:09:42	method
0:09:43	used in is
0:09:47	yeah and things sort nets so am
0:09:50	and uh so
0:09:51	so the first bullet point there from lower basis for for you miss already
0:09:54	published
0:09:55	like know
0:09:57	to your own work on most
0:09:58	uh it's in the old code base
0:10:00	new
0:10:01	we need to put it in the new one
0:10:03	um
0:10:04	partially actually done
0:10:05	um
0:10:06	then
0:10:08	a couple of is back then present the symmetric extension of is gmms
0:10:13	um
0:10:14	so at you can
0:10:15	people keep an asking what's summit at means
0:10:18	uh
0:10:19	um uh uh uh so so that that's also partially done
0:10:23	um
0:10:24	and then has then mention that
0:10:26	we of reading for um that generation to finished
0:10:29	and we can out of the this thing things
0:10:32	um
0:10:34	yes there but parts and discussions and debates and this
0:10:38	um and on
0:10:40	supporting multiple feature transforms
0:10:42	currently you only have
0:10:45	global transform send their just
0:10:47	put into one chain
0:10:53	a regression class yeah i i you can have regression classes for M F and alarms
0:10:58	but then you can compose it with any other transform which has multiple
0:11:02	john some as well
0:11:03	so yeah so
0:11:05	so that when i say
0:11:12	no yeah no
0:11:16	so
0:11:16	that's the thing with that
0:11:18	but would feature transforms and
0:11:20	okay that is
0:11:21	to multiple here
0:11:23	first of for for each type there are multiple transforms and then my
0:11:27	that's types
0:11:27	composed of good
0:11:29	and i don't know
0:11:30	for the roof feel the need for a but when me to the need for a will think about four
0:11:33	to do this
0:11:34	i and probably will be handled in something like a decodable
0:11:38	uh object level because
0:11:39	nothing
0:11:41	else needs to know about
0:11:42	uh how the compose
0:11:45	so that's the end of
0:11:46	we would be you of
0:11:48	a models
0:11:50	i
0:11:55	i

ACOUSTIC MODELS IN KALDI

Kaldi Workshop

Presented by: Arnab Ghoshal, Author(s): Arnab Ghoshal