Speech Transcript - Analyzing the Effect of Channel Mismatch on the SRI Language Recognition Evaluation 2015 System

0:00:15	thank you very much
0:00:17	thanks to the organisation for the enhanced percent in a hardware work
0:00:24	which is still trying to complement well
0:00:28	so with some post analyses the necessary they larry able to
0:00:34	you to the due to some somebody beauties a meat couldn't come here so i'm
0:00:40	gonna
0:00:42	try to percent
0:00:44	so
0:00:45	thank you now present if you tell somewhat all overview about the other we submissions
0:00:52	where system
0:00:55	we have some hypotheses are not at each that they would like to show you
0:01:00	a how we work with a development dataset and the man on interactions that we
0:01:05	have
0:01:07	the evaluation results and someone of these things and configurations on the lesson study we
0:01:13	learn from this
0:01:16	okay still
0:01:18	very briefly the other we are able to a shown was focused on the development
0:01:23	of language recognition systems
0:01:26	for very closely related languages
0:01:30	so well we have to twenty target language is a split across
0:01:35	six different clusters and the participants have to devise their own development set
0:01:42	so
0:01:43	there were mean up to maine a channels the telephone speech and a broadcast speech
0:01:50	and here we have the six different plaster probably chinese english french slide we can
0:01:56	be very in
0:01:58	them the performance metric was the average of the performance within each cluster so
0:02:04	these a low to development
0:02:06	the development of six different a separate systems for
0:02:11	it's cluster
0:02:13	since the we have to torture the language in each cluster
0:02:18	okay so
0:02:20	we have before the yellow re some hypotheses the first one was that
0:02:27	there where the data that there where l limit mismatch between that there and the
0:02:33	test set up
0:02:36	as we have seen the previews salaries but of course work
0:02:41	i say so you
0:02:43	second one is that the bottleneck features where all
0:02:47	good features for these kind of a task
0:02:50	and also you that
0:02:52	we we're right from these hypotheses
0:02:55	later
0:02:57	i where hypothesis here was that the fusion with multiple systems
0:03:02	a it was a nice approached to increase their
0:03:06	robustness
0:03:07	and we were run
0:03:10	finally
0:03:12	have a good development dataset design would be crucial
0:03:15	and we were
0:03:17	so
0:03:19	we have i mean three octaves here are the for one was to design a
0:03:23	development dataset
0:03:25	the second be below innovative approach is to dialect id
0:03:31	on the third one select a rubber used fusion coming from the right of complementary
0:03:36	bottleneck features so features
0:03:40	but we were all developing on their
0:03:43	darpa rats program
0:03:44	and also
0:03:46	fusion with the different backend classifier
0:03:51	okay
0:03:52	so first we use plead that data in eighty percent for training and twenty percent
0:03:56	for that
0:03:58	a constant mentioned in his last question it was but there are a decision that
0:04:04	passage so you
0:04:05	or it could be better
0:04:09	and we have ten audio files per language you need you need to split
0:04:17	we prevent to have these telephone conversational scrollers uttering and taps
0:04:23	and in here we include a equal proportion of thirty four of telephone speech and
0:04:29	broadcast speech in its in need to split
0:04:33	and we screwed switchboard one and two basically because
0:04:38	our first experiments didn't so great impact on that
0:04:43	probably because we
0:04:45	didn't expect these huge missed spots
0:04:48	so
0:04:50	and so we
0:04:53	get their from the with that they out your we changed a the audio to
0:04:58	the
0:04:59	different segments of three seconds to assist a short durations
0:05:04	so
0:05:06	a the end we have a wrong hundred k used for they ubm and i
0:05:10	p i ubm training and which in the training data used for take a back
0:05:17	and classifiers
0:05:21	we contextualized features with different methods like sdc
0:05:26	and deltas and double deltas at run p c d or pca dct and also
0:05:32	we fusion different i-vector system select from a traditional features and at the end they
0:05:40	bottleneck where training with these combination of different
0:05:44	a better original features with different context of sessions
0:05:52	for data back and classifiers we used a the gaussian backend and a neural networks
0:05:59	are
0:06:00	both methods are very well known for the community
0:06:05	and two methods for adapt that the other coalition back and which aims to better
0:06:10	cope with a mismatch conditions
0:06:13	basically it's a based on the a i-vector taste we try to select some i-vectors
0:06:19	are from their from the training to train the gaussian backends
0:06:24	and also the resolution and neural networks that
0:06:29	it was a new method the we propose here
0:06:32	and i aims to exploit day they this short dialect differences that we caff or
0:06:39	with the phonetic information
0:06:42	so a we have a different chunk durations from short directions to thirty two seconds
0:06:51	direction a chance and the phone segment and we have a different weights for each
0:06:56	for each
0:06:59	for each tank
0:07:01	okay and here we have comparison
0:07:05	for all these five
0:07:07	i can systems that we had
0:07:10	they multi-resolution neural networks was performed the but the best solution we're using the best
0:07:20	single bottleneck features and the number linux features in the case of the a multiresolution
0:07:25	neural network we were using just the bottleneck features because
0:07:29	we need phonetic information so as to make sense to use the bottleneck features
0:07:37	since aware bottleneck feature for training with it for the siemens
0:07:42	and also another thing it that the additive gaussian backend approaches were more complement are
0:07:49	we with a normal bottleneck i-vectors
0:07:54	we're uncle these systems as we can see here for our data
0:07:59	and here
0:08:00	what it would like to show you use that it clearly works much better the
0:08:04	bottleneck features and non bottleneck features
0:08:07	for a
0:08:10	for the feature for the for the backends
0:08:14	okay so this is it
0:08:15	in general i claim or a of our system
0:08:20	at the end of the consumptions we used fusion somehow some of this of these
0:08:26	systems fusion like seek so or all five or six hours of them
0:08:34	where we in clusters specific fusion or on overall the a data fusion and we
0:08:41	with that the scores we get the look really cute conversions also or into the
0:08:45	cluster or with a global
0:08:47	with the global locally the huge radio and at the end this is therefore
0:08:51	aw systems that we were percent the
0:08:55	so the for our primary systems were used in five weight cluster based fusion
0:09:02	cluster based log-likelihood conversions
0:09:05	all the second one was to system we fusion a cluster based conversions the third
0:09:10	one was used using the belgian but can only five wait a cluster based fusion
0:09:16	and the for one was with us as the second one
0:09:20	but we think global compression of day likely if you to reduce
0:09:24	okay so some evaluation analyses is
0:09:29	here
0:09:30	after
0:09:32	we got the
0:09:33	test data we can see the future work that we have the difference between the
0:09:38	data
0:09:39	on the test we were from well
0:09:41	three percent to twenty three percent
0:09:45	it is huge
0:09:47	and of course we have questions weight happened right
0:09:51	so this is a round also for it the core to compare the data under
0:09:56	test
0:09:58	as we can see here this is our primary system
0:10:01	so it's i think it's real one to say that are there is a three
0:10:06	five percent of relative gain over the best single system that
0:10:12	but
0:10:13	on the test
0:10:14	we got a eight percent lost and on the evaluation
0:10:19	okay so
0:10:22	for us what was more important and distribution okay
0:10:25	t and use a different
0:10:27	algorithms that they have to develop a and use agreed a development set up
0:10:38	due to these several the mismatch what is more important the algorithms that use of
0:10:42	human data
0:10:44	and we run some analyses of to try to have some a answers to these
0:10:50	questions
0:10:51	using an mfcc
0:10:53	plus deltas and double and the task weights at the nn out a gaussian backend
0:10:58	classifier
0:11:00	is that sixty nine twenty here
0:11:04	so after
0:11:07	which good discussions with something so the evaluation will there are several factors
0:11:13	in the development least
0:11:15	so
0:11:16	all morse
0:11:17	the chunking didn't help at all
0:11:21	so we're gonna do some experiments just removing the a the a the chunks of
0:11:27	the all on that
0:11:30	also the different this plead
0:11:34	most of the team square you seen sixty percent now forty or sixty percent for
0:11:39	training and forty percent for development
0:11:42	we
0:11:44	would like to things the in made to guys for providing their the least that
0:11:48	we were using
0:11:51	and also usual the data for the final mark and training and calibration
0:11:56	was also a key
0:11:58	thing to do
0:12:01	i'm unit using the uniform s p duration for the dev segments
0:12:06	and also we run some augmentation of the data and some double algorithms that we
0:12:11	liked
0:12:13	okay so here is the results post evaluation results so us we can see we
0:12:20	went from our primary system and twenty three point three
0:12:25	to say fusion system to twenty one point nine within the fusion just that one
0:12:31	and we keep
0:12:35	improving if we modify the training and that this pleading we are you seen
0:12:40	all the all the data for the training the ubm and the backend systems and
0:12:46	diffusions and also
0:12:49	you we are not chunking we're we are also improvement
0:12:53	the performance so id in we could have fifteen percent a relative gain
0:13:01	out so
0:13:03	so that that's shows that a the development data was crucial easy solution
0:13:09	also scenes
0:13:12	a small leak said they where using a different ubm system for used its cluster
0:13:17	we want to also
0:13:19	use these solution and we also
0:13:22	could see some improvement
0:13:25	thanks to guys from prior for that
0:13:30	that so we want to study how we how sensitive he's the different
0:13:36	a blocks in our paper claim to this mismatch so we use radar so get
0:13:42	some data from the from the test put on the development we create up for
0:13:46	full deviations of that this they don't get some data on the different parts of
0:13:51	the of our paper
0:13:54	so
0:13:55	easily we can say that they back end that a and the i-vector extractor sniffling
0:14:02	c significantly impact the mismatch a lot because we can see there is a few
0:14:07	percent of relative gain an s sixty percent of relative gains seen in
0:14:13	balls
0:14:16	steps a respectively
0:14:18	so some message to take a means that
0:14:23	for us it didn't work they fusion and the chunking training data for day for
0:14:30	the classification
0:14:32	and it works
0:14:34	and also it works for the rest of the groups i guess the bottleneck features
0:14:39	the gaussian and a neural networks cans
0:14:45	and also it were so
0:14:48	it was a low you that are the having a good development set it was
0:14:54	something very important for this
0:14:57	okay something top
0:15:05	we have time core for questions
0:15:12	all the channels cz getting they segments that we have and lead segment a speeding
0:15:20	very short segment
0:15:22	from the second two seconds
0:15:27	for the backend was used for the work
0:15:29	between
0:15:37	and the question
0:15:41	al
0:15:46	just like i guess this is a commonality whatever's but we define a fact that
0:15:51	we could be successful with an at twenty split and with doing a segment durations
0:15:58	for all classifier trained
0:16:02	really
0:16:03	figure two no
0:16:04	so we are
0:16:06	is not the ones for this okay good to know
0:16:09	we could you sure the spleen at least
0:16:12	just yes i think we could we had documentations in it too so we have
0:16:17	to talk about that part of this
0:16:19	okay
0:16:23	could you put up to us like the can where you didn't the twenty at
0:16:27	the at twenty and then went down to the sixty forty splits
0:16:33	so that it was really nice to see that because i think most groups we
0:16:37	saw most sensitive using sixty forty than the data retrain right we didn't have an
0:16:43	operating cycles receive you cycles what an hour training so we did we actually started
0:16:47	to sixty which was where her track what hurt us
0:16:50	but i think most folks of they started with the at if they didn't do
0:16:54	a retrain probably
0:16:56	did or did okay
0:16:58	but i think that's actually showed really nice improvement on where exactly so when you
0:17:03	do all
0:17:05	you did is then all test
0:17:09	that is the you that is the and
0:17:11	okay
0:17:16	to other questions
0:17:23	okay well let's think the speaker again thing

Analyzing the Effect of Channel Mismatch on the SRI Language Recognition Evaluation 2015 System

Speaker & Language Recognition Systems

Mitchell Mclaren, Diego Castán, Luciana Ferrer