Speech Transcript - Application of Convolutional Neural Networks to Language Identification in Noisy Conditions

0:00:15	also be presenting the work of a whole so the sri you malay with a
0:00:19	little from this one
0:00:22	and this is looking at applying convolutional deep neural networks to language id noisy conditions
0:00:26	in particular the conditions of the rats top part
0:00:31	so start with a bit of a background mormon we might wanna do this and
0:00:35	one domain
0:00:36	motivation to use the in an i-vector framework that we recently proposed a speaker id
0:00:41	then how do we kind of the noisy conditions we start looking at convolutional neural
0:00:46	networks for that purpose
0:00:47	and then we also present a simple a system called a scene and posterior system
0:00:52	the language id ensure results on that
0:00:54	that your experiment setup note walk through some results
0:00:58	that's for a bit of the background and language id the ubm i-vector framework is
0:01:02	pretty widely used in language are they a phone recognizers are also good option
0:01:07	the when if you use these two together that's when you get a really nice
0:01:12	improvement the quite complementary so in our books one of the challenge is as always
0:01:18	been how do we get speech information the white someone pronounces something into a single
0:01:23	system that outperformed you don't individual systems that's that what we call the challenge
0:01:28	so we want one phonetically away a system that can produce scores that outperform few
0:01:33	scores about
0:01:34	so we recently solve this the speaker id
0:01:37	at least in the telephone cases
0:01:40	that
0:01:41	we just a background of the nn i-vector framework
0:01:44	so what we're doing here's combining the deep neural network that's trained for automatic speech
0:01:49	recognition along with the popular i-vector model
0:01:52	why we use it is to generate out first-order stats and zero what's that
0:01:56	in particular we using with the nn in place of the ubm
0:02:01	and what we're doing here if you look at the comparison down the bottom here
0:02:04	the ubm plus the in the ubm is trained in unsupervised manner
0:02:08	it's trying to
0:02:10	represent classes would be gaussian
0:02:12	and it's a shame generally
0:02:14	to map to different phonetic classes however someone pronounces define a one why someone else
0:02:20	mark
0:02:20	phonetic completely different
0:02:22	ubms gonna model that in different components
0:02:26	so what the bn nn is
0:02:27	i else is trained in unsupervised manner
0:02:30	that means it's trying to map those same fines up to what we call seen
0:02:34	that it's a
0:02:35	a tight three fine state so the two different people pronouncing different wise the find
0:02:42	a
0:02:43	would be activating the same scene
0:02:46	and that
0:02:47	hopefully should capture different speaker trials
0:02:52	so have a second speaker i
0:02:54	it's very couples but speaker id in a the initial publication in like ask this
0:02:59	year
0:03:01	we got thirty percent relative improvement on telephone conditions particularly see two and c five
0:03:06	of the nist sre twelve
0:03:09	what i'm showing on this
0:03:11	slide here is actually we've got three different systems the sri sre twelve submission which
0:03:17	the fusion of six different features side-information a quarter conglomeration the
0:03:23	and then we show some recent work done is that the mfccs and deltas and
0:03:27	double deltas what we're calling my pca dct
0:03:30	that we publication on that in
0:03:33	i mean to speech i mean icassp next year
0:03:36	just to give your reference that gives about twenty percent relative improvement i've mfccs on
0:03:41	all conditions of sre twelve
0:03:44	but what's really not to be in an i-vector can still bring twenty percent improvement
0:03:48	on these two conditions c two and c five
0:03:51	so it's very powerful there's to work to be done a microphone trials this mismatch
0:03:55	happening we have my progress on that and we'll fact would be able to publish
0:03:59	on that very soon
0:04:01	so what i want to conclude here is that would now got a single system
0:04:05	the pizza sre twelve submission
0:04:08	so this how
0:04:09	cannot be useful language
0:04:11	that's the question to get there
0:04:13	so the context years the output of the nn should include language-related information
0:04:19	and ideally be more robust to speaker and noise variations
0:04:22	so the reason i say that is when you training in the nn for
0:04:26	i guess a you wanna remove speaker variability
0:04:32	but i'll a suitable for channel degraded language
0:04:36	in the rats program we actually and i think it was i b m o
0:04:39	b n
0:04:41	so i c n was particularly good for the rats noisy conditions we validated that
0:04:46	a nap keyword spotting trials you can see that dramatic different mikes in the channel
0:04:50	degraded speech so we said but use the same and then why should be very
0:04:54	way that the nn
0:04:56	and so that something that still open and we got a few review comments on
0:04:59	that actually in we need to validate difference between nice to show the actual improvement
0:05:04	in lid performance
0:05:06	so we do that in future
0:05:08	smoothing along with the c n
0:05:11	this is essentially the same trying to close to the and then
0:05:15	you can see we've got acoustic features that go into you had to men gmm
0:05:18	that provides alignments for the day n and trying to what you're trying to the
0:05:21	nn you no longer need to islam so you don't need to generate those test
0:05:25	false coming in france
0:05:27	and we've got acoustic features the forty dimensional log mel filterbank energies that are used
0:05:31	for training the neural net
0:05:34	here we stacking in our work with stacking fifteen frames together has the input for
0:05:38	the training
0:05:39	and we use a decision tree needs to be fancy names
0:05:43	and as a set a we generate training alignments with the pre-trained
0:05:47	h m and gmm which we don't need of
0:05:51	just as an illustration
0:05:53	c n
0:05:54	basically front of trying to ban in appends this liar all this process here where
0:06:00	i you've got your
0:06:03	filterbank energies within fifteen frame context
0:06:05	you
0:06:07	possible be convolutional filter
0:06:09	i think we're using
0:06:11	size of i which is in but i
0:06:14	and then what we're doing is the max pooling option of the n
0:06:17	that means that each of the three
0:06:19	for each three blocks the come out we take the maximum one
0:06:22	and i just helps with the noise state
0:06:27	have a single i-vector system go with this
0:06:30	we can see that which simply plug in the c n instead of the ubm
0:06:34	what straightforward
0:06:35	what's interesting here is that we've got two different acoustic features first is used for
0:06:40	the c and to get the posteriors for each of the same lines
0:06:43	and then multiplying those posteriors would be
0:06:46	acoustic features for language out that is to discriminate languages
0:06:51	that is the second set of features so the number and of the two apart
0:06:55	will be negative thought as you got extracted from features if you choose to you
0:06:58	can use the same suffice
0:07:00	but if you want to use model features in this in the fusion system
0:07:04	and you need to extract posteriors using that one set of pages
0:07:07	this is in contrast to twenty but multiple features for fusing with the ubm systems
0:07:11	you've got extract for instance five different sets of posteriors for each feature if you
0:07:16	had a five way fusion
0:07:18	another aspect is you're right but it sure in the language are the features independently
0:07:22	of those of the providing the posteriors
0:07:26	heart currently with their ubm systems it's a bit of a balancing act you want
0:07:29	stable posteriors but you also want would extremely discriminability a of the upper side of
0:07:35	it
0:07:37	in the statistics
0:07:40	so it can we go easy a simpler an alternative system here's a simple system
0:07:44	which take mostly and then we get the frame posteriors
0:07:48	we forget about first order statistics
0:07:51	we're doing here is normalized in the zero th order statistics in log domain
0:07:55	and then we just use a simple we back end for instance here we using
0:07:58	a neural network can use a gaussian backend assets one thing that distinguishes it from
0:08:02	phonotactic system
0:08:04	you can use standard language id backends which is not
0:08:08	so here we using a count of but i'd context dependent states will try
0:08:14	and that's a state level instead of find labels just
0:08:18	let's look at experimental setup
0:08:21	darpa rats program sure many of you have a how noisy these samples are
0:08:26	i think john was talking about them anywhere on the way this part target languages
0:08:29	tend online
0:08:31	i can see those on the screen the
0:08:33	this a few channel the degradation seven different channels snrs between zero and thirty
0:08:39	the transcription that we used to train the seen in table keyword spotting task and
0:08:43	as any two languages that that's an unusual aspect that we're trying to distinguish five
0:08:47	languages but we training onto a plane
0:08:51	test durations three ten thirty seconds and one twenty seconds and a metric we use
0:08:55	here is the average equal error rate across the target languages
0:08:59	terms of the model the one used to generate the training alignments for the and
0:09:03	then
0:09:04	the hmm gmm set up here we were producing three around three thousand c nines
0:09:10	with two hundred thousand gaussians
0:09:13	and this was multilingual training on both bastien haven't on our
0:09:17	c n model was also trying to sign my with the multilingual training set
0:09:22	we've got hot pocket lies with twelve hundred
0:09:25	nodes each and we've got forty filter banks of things that frames
0:09:29	you can see the pooling size and the filter sauce will be
0:09:33	for be seen in convolutional stuff
0:09:37	for the ubm model for comparison
0:09:39	our training of two thousand forty eight component ubm
0:09:42	and the features directly optimize the seed task the speaker id task the right based
0:09:47	tended to for well over two language are the actual number forty dimensional two d
0:09:51	dct log mel spectral features and this is similar to the zigzag dct
0:09:58	work that we propose to not cast this you the pci dct the shirt the
0:10:02	speaker are they really a that's an extension that further improves that's
0:10:08	what about the vectors and background back and sorry about the thin and then ubm
0:10:13	i-vectors all trained on the same data for the i-vector subspace
0:10:17	and that by four hundred dimensional
0:10:18	for the posterior system with collecting the three thousand average posteriors removing the silence indexes
0:10:25	three of those am reducing to four hundred dimensional reality same as the i-vector subspace
0:10:31	using probabilistic pca
0:10:33	for the back end we trying simple neural network mlp
0:10:36	i would cross entropy
0:10:38	what we do with the data to a enlarge our training dataset is to chunk
0:10:42	the data into thirty second chunks and i'd second chunks with fifty percent overlap i
0:10:46	think that end up with around two million
0:10:49	i-vectors to train on the
0:10:51	the output is five target languages and the one house across as well
0:10:55	i was performance got
0:10:58	well first of all the ubm i-vector approach the ubm isn't being trained in a
0:11:02	supervised by so what we said was well let's take the same lines from the
0:11:06	same and where we know we've got three thousand five
0:11:09	and let's along the frames for each of the icing lines and train each of
0:11:13	the ubm components with that
0:11:14	so the idea here was to try to give a fair comparison between ubm unseen
0:11:18	system
0:11:20	we don't nice improvement across all of those the
0:11:24	the scene and approaches
0:11:25	a is less to see that for ten seconds or more
0:11:28	getting a thirty percent for more relative improvement over the ubm approach for the three
0:11:34	second
0:11:35	timeframe testing
0:11:36	twenty percent relative improvement
0:11:39	what was interesting between the posterior system and the i-vector system for the c in
0:11:43	lid performance is actually quite similar
0:11:48	but if we fuse that study
0:11:49	we're gonna nice can again twenty percent relative improvement
0:11:52	when we just component combine the two difference in the parts as
0:11:56	optically for less than one twenty seconds is where we see for one twenty wasn't
0:12:00	pretty
0:12:02	when we got hidden at the ubm i-vector system to that so different modeling part
0:12:06	actually get no
0:12:08	kind from the fusion except in the one twenty case
0:12:10	also another interesting problem
0:12:14	in conclusion we compare in the robustness of the c n in the noisy conditions
0:12:18	in particular taking that the i-vector framework and making an effective on the rest language
0:12:23	id task
0:12:25	we propose to in that sense and yukon the phonotactic system the scene and posterior
0:12:29	system
0:12:30	i which is quite a simple system and those high complementarity between these two propose
0:12:36	i in terms of extension where do we go from here we can improve performance
0:12:39	a little one not doing probabilistic pca before the backend classification
0:12:45	and fusion of different language dependent si amends
0:12:49	the schools from there is a also provides a again
0:12:54	some the bottleneck features which i think a local might be talking about so
0:12:58	are also good alternatives
0:13:00	for the direct usage of the nn c n l for language
0:13:04	thank you
0:13:12	we have time for some question
0:13:21	like what
0:13:26	thanks for my start we're expecting request from right
0:13:30	possibly
0:13:31	so for the posterior cm impostors esteem you said it use the pca to four
0:13:36	hundred a prior to on your on it right yep you try also to put
0:13:41	data for vector
0:13:42	i imagine it is
0:13:44	yes the extension the first one on the extension says if we don't do that's
0:13:48	that we do get a slight improvement
0:13:50	i think the motivation for them reducing it to four hundred dimensions was for comparability
0:13:54	with the i-vector space see what can we get in that's four hundred dimensions
0:14:16	but
0:14:18	so my question to do with the data that was used to train the asr
0:14:22	and the c n i chi was it the multiple channels of the arabic and
0:14:27	the farsi data abilities are so you trained in channel
0:14:31	no channel condition less channel
0:14:32	i believe that the ubm
0:14:35	the ubm yes so we use the channel degraded data for the ubm the like
0:14:39	outlook same data
0:14:41	the a use arabic in the farsi data to train ubm like alignment like the
0:14:46	ubm was exposed all five languages across all channel conditions
0:14:50	but that was one thing you said you to the alignment of the signals and
0:14:53	then train the states the ubm
0:14:56	the second one there i guess that's what it was used for that
0:15:01	supervised ubm to get the alignment with the c nine words coming through the c
0:15:05	n and that was trained with keyword spotting tighter but the ubm it so i
0:15:09	believe that have checked this of the couples as was trained with the with data
0:15:13	which is across five languages
0:15:15	that's what you think that how much you think that's an impact of having c
0:15:19	change datasets and
0:15:21	classifiers you think that first question i would have to say you would you would
0:15:26	think that having five languages in the ubm
0:15:30	plus more plus the other set languages would give it an advantage to some degree
0:15:34	but as you said datasets changing
0:15:37	i think so to their be a good point
0:15:45	so match it if you're gonna do a very wide set of languages do you
0:15:50	have a hope of having sort of a single master universal like the hungarian traps
0:15:55	has been so successful in the past or do you think you're gonna have to
0:15:57	build many different language d n and
0:16:00	so what would what was saying so far as basically
0:16:04	the mall language dependency intends that you put together you fused together
0:16:08	the improvement reduces so perhaps if you had five the cover a good a space
0:16:15	of the phones across different languages that might be what you michael universal collection that's
0:16:20	appealing
0:16:27	right

Application of Convolutional Neural Networks to Language Identification in Noisy Conditions

Neural Nets for Speaker and Language Modeling

Yun Lei, Luciana Ferrer, Aaron Lawson, Mitchell McLaren and Nicolas Scheffer