Speech Transcript - Parallel Acoustic Model Adaptation for Improving Phonotactic Language Recognition

0:00:06	you have to do
0:00:07	um that well yeah i eyepiece entity
0:00:10	it's not a so parallel acoustic model adaptation for improving alphabetic language recognition
0:00:18	um
0:00:19	in general
0:00:20	um phonotactic um language recognition system um that you move to complement
0:00:25	the first one you start our recognition of one ten
0:00:28	in which then maybe
0:00:29	what a single phone recogniser or a little thing uh for recognising patterns
0:00:34	we wish oh we use it for the
0:00:36	uh for that information extraction
0:00:39	and the second one you say
0:00:40	and classifier
0:00:42	that
0:00:43	oh use the extracted
0:00:45	oh from type information
0:00:47	two
0:00:47	to distinguish between target language
0:00:50	um
0:00:51	in politics uh language recognition
0:00:54	the idea of feature down
0:00:56	i would first implication
0:00:58	is that well i don't
0:01:00	but example include a
0:01:02	using parallel recognise a uniform and
0:01:05	and the second yourself
0:01:06	using multiple high level
0:01:08	in the phone lattice decoding
0:01:13	two we use the speaker
0:01:14	and um section
0:01:16	i used to be but
0:01:17	in the uh speech data
0:01:19	generally i involving the telephone speech
0:01:22	something like that
0:01:23	um
0:01:24	and now adaptation and
0:01:26	speaker adaptive training S A T
0:01:28	a parallel to to a phone lattice decoding
0:01:31	is used
0:01:32	has gone and it must be posted
0:01:34	seriously
0:01:35	so in this piece of our work
0:01:38	um we would like to investigate
0:01:40	different types of uh
0:01:41	adaptation techniques
0:01:43	and we
0:01:44	with um
0:01:45	quantitatively master
0:01:47	that i was working
0:01:48	i between two sets of
0:01:50	phonotactic features
0:01:51	and finally
0:01:52	oh we investigate
0:01:54	but the
0:01:54	hello acoustic model adaptation
0:01:57	can provide for the
0:01:58	oh feature diversification
0:02:01	and in particular the
0:02:02	we will work on the mean only mllr the station
0:02:05	and the variance on the and a rotation
0:02:13	yeah slows down
0:02:14	it struck general structure
0:02:16	of a
0:02:17	all three
0:02:18	um
0:02:18	food addict
0:02:19	a language recognition system
0:02:23	that you want a two component
0:02:25	that i mentioned before
0:02:26	the parallel phone recogniser
0:02:29	and also the backend
0:02:30	uh in the back and we can use a
0:02:32	oh vectors
0:02:33	space modelling
0:02:35	or at the end where modelling
0:02:36	uh you know about
0:02:38	in our experiment we use the
0:02:39	uh we have to model
0:02:41	double curved space modelling
0:02:47	i'm sorry
0:02:48	the
0:02:48	the reason there's some problem
0:02:50	on it i don't know
0:02:52	the
0:02:52	but uh anyway
0:02:53	oh it was so so
0:02:55	there is a lot um a value
0:02:57	school
0:02:58	on the if of a yes uh model
0:03:00	and then we would like to combine them
0:03:02	to get up in the school that
0:03:05	and
0:03:06	that and in fact they say at here
0:03:08	and the F represent different
0:03:10	different our phone recogniser
0:03:12	and we combine the school
0:03:15	and also we have
0:03:16	so i have a phone recogniser so we combine the at school
0:03:27	and you know
0:03:28	our work
0:03:28	we were uh
0:03:30	uh we should use wall
0:03:32	all features are for
0:03:34	i diversification
0:03:35	using a different uh model adaptation
0:03:38	you can see that yeah at a phone recogniser
0:03:41	oh
0:03:42	and for each formica and so we have to
0:03:44	if uh
0:03:45	mobile application
0:03:46	so for yeah yeah
0:03:49	um organiser and
0:03:50	maybe is that we use that eight he was
0:03:53	two
0:03:53	and then they are to have well
0:03:55	score from the reassembled
0:04:00	and you know experiment
0:04:02	where you try to set a
0:04:04	the
0:04:04	and you go to one that we that means we
0:04:07	we use are
0:04:08	a single form organised
0:04:10	you know experiments
0:04:13	but all of this and see we
0:04:15	we were uh for the whole experiment we find that
0:04:18	uh using the other from a fellow that
0:04:21	uh well know that the patient
0:04:24	yeah i can still get into
0:04:26	when we use a paraffin recognise
0:04:38	um to to further up to we use the speaker and the session induced variation
0:04:42	oh we use the N R or and
0:04:45	the um uh that
0:04:47	uh i patient
0:04:48	uh
0:04:49	in in in the phone lattice decoding
0:04:52	um
0:04:53	the transformation can be
0:04:56	for me data
0:04:57	but these two impatient
0:04:58	yeah
0:04:59	eight B and H is the transform to be computed
0:05:03	and the meal and uh
0:05:05	signal
0:05:06	is the
0:05:07	gaussian mean ankle very informative
0:05:13	yeah so well
0:05:15	the different types of adaptation technique we test
0:05:18	by the way we also test the each radius and not the patient
0:05:21	and also
0:05:22	oh adaptation with multiple
0:05:24	regression classes
0:05:26	that's how we found that not all of this uh improvement can be found at
0:05:29	so we did a report the results
0:05:31	in details you know people
0:05:33	you know
0:05:39	well that was the mobile application to class
0:05:42	uh decoding using wap
0:05:43	and the post process is
0:05:45	first of all we generate a single bad so
0:05:48	sequence
0:05:48	and then we estimate the transform eighty and all eight
0:05:52	and then based on the transformed but
0:05:54	i i was the model we generate
0:05:56	the the format
0:06:08	up a second uh uh who's model adaptation in the test
0:06:11	uh test data
0:06:12	we cannot fight uh
0:06:14	speaker adaptive training
0:06:16	in the training data
0:06:17	all the
0:06:18	of the uh phone recogniser
0:06:20	oh
0:06:22	and
0:06:23	in in which of that feature level and all times well
0:06:26	is a pilot to each other
0:06:29	um
0:06:30	uh training utterance
0:06:31	in a uniform recogniser
0:06:33	and
0:06:34	do we test
0:06:35	our experiment
0:06:36	oh
0:06:37	three types of adaptation technique
0:06:40	we have right
0:06:48	in the U S N um vector space um although
0:06:51	i can um
0:06:53	the phone like this
0:06:55	is uh on
0:06:57	is a commercial
0:06:58	two
0:07:00	to to expect that and run a
0:07:02	and he's expert you are
0:07:04	very much
0:07:04	um we use that and all that
0:07:07	uh and rambled on tree
0:07:09	and then
0:07:10	it is converted to a high dimensional
0:07:13	a remote that features
0:07:15	that contains on unigram bigram line
0:07:18	trigram forms
0:07:19	uh for uh statistic
0:07:22	and
0:07:23	this
0:07:24	the size of this L I dimensional phonotactic feature
0:07:28	the the uh that the dimension S
0:07:31	is determined by the name brand
0:07:33	uh all the and
0:07:34	and also the phone set size
0:07:36	she
0:07:39	after we generate uh
0:07:41	the high dimension
0:07:42	phonotactic feature
0:07:45	we put it into the
0:07:47	svm training for the S R O the reassemble inside
0:07:53	moreover we also define
0:07:54	the diversity pitching to to to set up
0:07:58	between two or phonotactic feature
0:08:01	oh
0:08:02	using at that you uh you could be
0:08:05	yeah idea is that
0:08:06	um
0:08:08	between uh
0:08:10	that
0:08:11	the the feature
0:08:12	C A S E
0:08:13	be
0:08:14	based on their nonzero uh and bram
0:08:17	a statistic
0:08:19	and
0:08:20	you are
0:08:21	but you have to use
0:08:23	you uh means that the set of anger and statistic
0:08:26	which is nonzero in blue
0:08:28	both C N C P
0:08:30	and and you
0:08:31	use those
0:08:32	uh size
0:08:33	of the set you
0:08:41	our system
0:08:41	has been talking about it
0:08:43	uh using the thirty second tar
0:08:45	in two thousand i snap and this uh language recognition evaluation
0:08:50	you michelle fourteen target languages are involved
0:08:53	in the detection cost
0:08:55	um the system
0:08:56	determine whether the
0:08:58	target language is spoken
0:09:00	in the speech
0:09:01	uh huh
0:09:02	and
0:09:03	at least equal error rate
0:09:05	which is
0:09:06	calculate the from the eer of each target
0:09:09	target language could easily ported
0:09:11	oh we use this that i've page uh
0:09:14	he are used to ensure that
0:09:16	oh is
0:09:16	target language has very
0:09:18	has an equal contribution to the match
0:09:24	on examination people of a single organiser
0:09:28	is used
0:09:29	you know one and
0:09:31	um
0:09:32	forty nine uh dimension mfcc feature
0:09:35	or standard three state
0:09:37	left to right hmm
0:09:38	thirty two gaussian components per state is used
0:09:41	in all acoustic model
0:09:45	um
0:09:46	for the training data
0:09:47	fifteen hours of uh
0:09:49	switchboard one set or the uh english uh data
0:09:52	use use that to train do some recogniser
0:09:55	and
0:09:56	a full
0:09:57	um on the phone loop grammar used used in the decoding
0:10:03	of all the training data of the target languages
0:10:06	we use the close friend
0:10:08	ooh so uh corpora and also the training data set of uh
0:10:12	this uh L R E zero two thousand and seven training data
0:10:18	in those
0:10:19	in the first experiment
0:10:21	we compare
0:10:22	if and if adaptation techniques
0:10:25	and with this for uh
0:10:27	uh what model but
0:10:28	and these uh
0:10:31	the
0:10:32	yes i
0:10:32	uh speaker independent and S A T multiple model
0:10:38	um but so first of all
0:10:40	oh we found that uh or adaptation techniques
0:10:42	for white input but
0:10:44	oh
0:10:45	you can see that a system able we didn't do any adaptation technique
0:10:50	and all the others
0:10:50	we use on different kind of adaptation technique
0:10:53	and
0:10:54	maybe using A C T model
0:10:56	and what S
0:10:58	S I
0:11:00	S I phone model
0:11:02	yeah now adaptation and and
0:11:04	mean only and uh
0:11:06	adaptation performed the best
0:11:09	and also you can find that
0:11:13	a further improvement can be
0:11:15	can be obtained
0:11:16	when we use a
0:11:17	yeah i say to you for model
0:11:24	secondly are we test whether
0:11:26	um to phonotactic system with different types of
0:11:30	add that to uh also more uh uh that that that
0:11:33	model
0:11:34	provide complementary information to the uh to each other
0:11:38	and better
0:11:38	the corresponding system user
0:11:41	um cookbook for white a further system uh
0:11:44	input what
0:11:46	by considering
0:11:47	oh
0:11:48	curacy whistle at the table that eight
0:11:50	sis
0:11:51	phonotactic system
0:11:52	uh we can combine them
0:11:54	oh
0:11:55	can can generate twenty eight possible uh to assist on a user
0:12:00	and then we plot
0:12:02	yeah their corresponding
0:12:03	average uh
0:12:05	featured a varsity
0:12:06	and also that
0:12:07	oh be out in the fields the system
0:12:09	and you can find that
0:12:13	that's a
0:12:15	you can also that
0:12:16	system using mean only
0:12:19	mean only adaptation and bayesian adaptation
0:12:22	i i my
0:12:23	here
0:12:24	both of them
0:12:25	um they can provide relatively higher uh
0:12:28	oh
0:12:29	diversity
0:12:30	and also
0:12:31	you can see the trend all over all
0:12:33	twenty eight possible combination
0:12:36	you can see that when you are
0:12:37	uh when you update
0:12:39	oh higher
0:12:40	oh
0:12:41	feature directly and then you can all take
0:12:44	uh
0:12:44	low uh yeah
0:12:53	you know the last experiment
0:12:54	or refuse to a system using mean only and that the only adaptation
0:12:59	that need
0:13:00	system at a cheaper so
0:13:02	eighty and eighty four and B
0:13:04	she too and petri
0:13:07	you can see the result
0:13:08	you need only
0:13:10	here
0:13:10	and then the fusion result
0:13:18	and we also that just use a lot so
0:13:21	system with
0:13:24	uh can provide uh obvious improvement
0:13:26	for example
0:13:27	uh when
0:13:28	when aside model use use
0:13:30	and
0:13:31	a tree and a four is used
0:13:34	it can all hold form
0:13:36	to the system be one which are
0:13:38	S A T model is used
0:13:41	and also
0:13:42	when we use A S A T model
0:13:45	um p2p plus P V
0:13:47	we can provide
0:13:48	a four door
0:13:50	um improvement
0:13:52	and you know vol
0:13:53	when you compare
0:13:54	this result using S A P model
0:13:57	and comparing with uh a one before any
0:14:00	uh adaptation techniques
0:14:02	we can provide overall uh around forty percent relative improvement
0:14:12	one two seven
0:14:13	uh we have studied
0:14:14	a different types of C and uh and and uh adaptation techniques
0:14:19	for the phonotactic language
0:14:20	recognition
0:14:22	oh yeah yeah that's true
0:14:23	uh illustrate
0:14:25	oh yeah that a mistake model adaptation
0:14:28	and we found that
0:14:30	um and then only and no adaptation which polite and uh the phonotactic feature
0:14:35	so i
0:14:35	can provide a complementary information to the one using
0:14:39	mean only mllr
0:14:40	cation
0:14:41	and our ongoing work include
0:14:43	uh
0:14:44	to see the interaction with a recogniser fun and
0:14:48	and also we we investigate more sophisticated
0:14:51	adaptation technique
0:14:54	and that's all all all my temptation
0:14:56	fig
0:15:03	let's see
0:15:11	you could use
0:15:13	hmmm
0:15:16	you mean for a second all test data
0:15:18	yeah
0:15:19	yes
0:15:19	text
0:15:20	we used
0:15:23	hmmm
0:15:25	fig no i didn't do it
0:15:26	but uh
0:15:27	in a room
0:15:28	where motif on the first exactly
0:15:30	you know
0:15:31	but that that would be a problem if you we test it on the feedback and all kinds i control
0:15:37	yeah but is likely to be no
0:15:41	and in this movie also
0:15:42	sure
0:15:43	to think about this
0:15:44	moving paul
0:15:45	so that i i thought about the most of you
0:15:47	data that
0:15:49	yeah
0:15:50	you see this
0:15:54	no
0:15:55	hmmm
0:15:57	with extreme
0:16:00	hmmm
0:16:01	yeah yeah sure sure
0:16:02	mixture
0:16:02	sure
0:16:03	sure
0:16:04	exactly
0:16:05	yeah but i
0:16:05	it is in this moment
0:16:07	in all
0:16:08	in the very study we found that even using the simple most convenient
0:16:12	uh
0:16:13	commas a new method we can still get some improvement
0:16:15	but school of course you are right
0:16:17	we can do some more in public uh interpolation we have some
0:16:21	some uh
0:16:22	like that
0:16:23	a universal
0:16:25	adaptation trans
0:16:31	is one
0:16:33	yeah
0:16:34	and your your
0:16:36	you
0:16:39	sorry
0:16:40	you
0:16:40	hmmm
0:16:41	hmmm
0:16:42	i
0:16:42	oh
0:16:46	you you mean i using
0:16:48	from a practical
0:16:49	acoustic or
0:16:50	well as well
0:16:52	uh
0:16:52	to
0:16:55	yeah
0:16:56	hmmm
0:16:58	oh oh you mean a and five test
0:17:00	diffusion with a
0:17:02	system no i didn't
0:17:05	yes
0:17:08	hmmm
0:17:11	yeah sure sure sure
0:17:13	but i didn't make a number so that depends on it
0:17:16	yeah
0:17:22	questions
0:17:29	okay

Parallel Acoustic Model Adaptation for Improving Phonotactic Language Recognition

SESSION 10: Language recognition – phonotactics

Added: 14. 7. 2010 11:08, Author: Cheung Chi Leung, Bin Ma, Haizhou Li (Institute for Infocomm Research), Length: 0:17:37