Speech Transcript - Neural Network Bottleneck Features for Language Identification

0:00:15	my name is problematic a and i'll be talking about that
0:00:20	a neural network or bottleneck features for language identification
0:00:24	a i did this work during my postdoc iterating bbn
0:00:29	and at first i will talk about the writes they darpa rats program which i
0:00:35	tested you don't so it's a noisy condition
0:00:39	then i will talk about the neural network what the mecc features and then an
0:00:44	application to language identification
0:00:48	so the darpa rats program i think that it's already introduce that so i would
0:00:53	like to give you some taste of the red unfortunately
0:00:58	there are not enough rates to taste for all of you arena there and so
0:01:01	i know there is a place an audio samples
0:01:09	i
0:01:14	i
0:01:15	they really i
0:01:18	i
0:01:24	so you get some impression a noise it is
0:01:29	so the bottleneck features
0:01:31	so the bottleneck feature stands for
0:01:34	a neural network topology where the one hidden layer is a has a significantly lower
0:01:40	dimension than the surrounding layers
0:01:42	my case was at diamond used for the bottleneck and fifteen hundred for the surrounding
0:01:48	layers
0:01:49	what actually it does it does the it does a kind of a compression in
0:01:53	this compress information so that we can use it in a some other ways then
0:01:57	adjusting neural network
0:01:59	it comes from the speech recognition
0:02:02	i where they usually use it the frigid features alone or in the conjunction we
0:02:08	is the baseline features that will be a final image mfccs
0:02:13	what i actually used still got stuck but the mac
0:02:16	where i have to the redhead doing you know networks
0:02:21	both these bottlenecks
0:02:22	and actually the that's second neural network takes the input from the first not from
0:02:30	the bottleneck then expect it in time five frames with a five frame shift
0:02:36	actually this was proven by that the but guys to be very good for speech
0:02:41	recognition so are used today the right to do different number of frames different used
0:02:47	different used
0:02:48	and so on so we you mustn't for this
0:02:52	right topology of the bottleneck features where is the for the first not okay used
0:02:57	frequency domain linear prediction coefficients peace fundamental frequency
0:03:04	as input actually we use the block of the log mel-filterbank it gives you about
0:03:08	the same results
0:03:10	then i have fifteen hundred sixteen hundred and eighty the bottleneck
0:03:15	fifteen hundred and the target
0:03:17	a number of target for me to targets where a state of the context dependent
0:03:22	clause with queen phones
0:03:24	usually like to the beauty garbage or use the triphones i use a queen phones
0:03:29	because bbn had dbn is using the queen phones
0:03:34	the second net actually has about the same topology just the input is different it's
0:03:41	actually i have a five frames
0:03:43	that's stuck in time so it's five times at so it's four hundred but then
0:03:49	other otherwise the quality same
0:03:53	for that's we have a two languages which were transcribed which is a farsi and
0:03:58	eleven time you can see the number of hours what the net was trained
0:04:03	and number of targets
0:04:06	was we just for the each system
0:04:10	so let's go to
0:04:13	language recognition
0:04:14	so the data that syndication as meat set the rest five target language is out
0:04:20	of set class
0:04:21	different durations and as you heard it's quite noisy so i would just keep this
0:04:26	like
0:04:28	i baseline discrete might baseline system description
0:04:31	he's
0:04:33	i use the p l ps might not nine plp coefficients i use a short
0:04:37	time gaussianization usually can see the benefit of using this for language id but for
0:04:42	these noisy
0:04:43	condition you actually helps
0:04:46	we use a block of these look at eleven frames respect them together and project
0:04:51	them to sixty dimensions of hlda
0:04:54	and as you
0:04:58	sorry is you can see in the in the next slide i tried different coefficients
0:05:04	to compare
0:05:06	you go see the results in next slide are used a ubm with one thousand
0:05:10	twenty four versions
0:05:12	i-vector was four hundred dimensional and the final classifier was neural network we found that
0:05:18	for this kind of task was the best
0:05:22	but you should you have to do something speech are described in the paper
0:05:27	so here are the slide with the first results ugh of them baseline system baseline
0:05:31	results and there are four different feature extraction is i we focused on the three
0:05:38	seconds and ten second condition because under twenty second was so good that it did
0:05:43	make sense to look at thirty second was also would after diffusion
0:05:47	so we mainly focus on the on these two conditions
0:05:50	as you see the mmi coefficients from you two dollars are but the fourth
0:05:55	ten second condition plp sub at the phone three second conditions
0:05:59	the rest this was the but mfcc features which we very using for nist
0:06:06	evaluations
0:06:06	and this is the features which of a the best two thousand thirteen that doesn't
0:06:12	thirteen rats evaluation for us
0:06:15	so these are the like the baseline a baseline features like the conventional acoustic features
0:06:21	so let me before agenda the results of is the bottleneck features let me talk
0:06:26	about the prior for over
0:06:28	the mainly
0:06:30	they use the
0:06:32	a context independent phonemes
0:06:34	which makes quite a lot of differences we will see later
0:06:37	and so what in two thousand thirteen in the forest evaluation jeff map from bbn
0:06:43	actually he use the
0:06:45	context independent and phonemes actually clustering on valentine arabic the dimensional thirty nine so he
0:06:51	to go look of these posteriors just and simply just stick it to the block
0:06:56	of the p l ps is the baseline and then all of this projected back
0:07:01	to sixty dimensions with hlda
0:07:03	and he got the pretty good results it's like got
0:07:06	feature-level fusion it's like
0:07:11	your idea is she she's doing so called phone a log-likelihood ratio posterior features
0:07:19	what she does she takes the posteriors take the log and then do the likelihood
0:07:23	ratio between them
0:07:25	usually appended deltas and sometimes you use the pca to reduce the dimension dimensionality and
0:07:31	then later she easier if use it is this plp
0:07:35	she was before christmas she was it but and she was working on a lot
0:07:39	as well so we could compare these features
0:07:43	and actually these features these features that also better than the baseline features and exceeded
0:07:51	are better than the phonotactic system because they did also the for like the conventional
0:07:55	phonotactic system in this which is there much better and that the and the phonotactics
0:07:59	the code like the conventional phonotactic system to make it to the fusion
0:08:02	and these features the speech used it
0:08:07	during the value process one of your told us that there was a there was
0:08:13	a very similar work which was submitted to ieee electronic that there is at the
0:08:18	end of two thousand thirteen
0:08:19	it will by mister strong and he applied on the clean white cream data on
0:08:24	the nist two thousand you have two thousand nine data
0:08:28	then during the presentation on two thousand and fourteen i guess
0:08:34	actually it's not in the paper we just in the presentation
0:08:37	that's your more and of from google you present in the bottleneck features
0:08:44	and but he's neural network is d n is actually the range to produce the
0:08:52	posterior probability of target language is not a phonemes
0:08:55	so it might open the new field of the like data-driven and data driven features
0:09:03	so let's go to results
0:09:05	so if i take so here are again this for baseline features then divide take
0:09:10	the look up posterior just the log posterior of the which comes out of the
0:09:15	neural network i think just one frame this time means of just one frame
0:09:21	and just build the i-vectors esteem then you can see that it can it is
0:09:27	better than any of the based on about
0:09:30	so then what i did i to eleven like going to block of the this
0:09:34	posteriors
0:09:35	and
0:09:37	stacked them together project the we hlda two sixty
0:09:41	and you can see that it's
0:09:44	quite better than just one frame so it means that the context is very important
0:09:49	and then this is what jeff might need the like the baseline features plus one
0:09:56	frame of and you the posteriors
0:10:00	and projected to hlda via just dimensions
0:10:03	and you can see that this is this case good but it's all the data
0:10:06	like fusion of two systems
0:10:10	so how does the what select features then
0:10:13	so again is just one frame
0:10:16	i tried also more things but it didn't help for me
0:10:21	so one frame of bottleneck features the diamonds nineties at
0:10:25	and you by take the but this at the bottom language the bottleneck from the
0:10:29	first neural network
0:10:31	and this is the stack but language is the book like from the second neural
0:10:35	network so you can see that a boss this teams is quite better than any
0:10:40	of the baseline and actually it makes sense to do the that the stack but
0:10:46	may architecture because you get something
0:10:48	something out of it
0:10:50	what why i'm thinking just one frame it might be a this for the case
0:10:55	for the button like for the state but make features that i'm doing this taking
0:10:58	between the between the nets of it might be that the context it solidity so
0:11:02	that there
0:11:05	so then i did some i have some analysis slides
0:11:10	and the first thing was obviously to a try to tune the bottleneck size
0:11:16	so the usually they use it for speech recognition used a user usually at so
0:11:20	i took eighty and it is the baseline and then try to very the button
0:11:26	excise but is it is you see
0:11:29	the at was the best
0:11:31	if you go to sixty and i
0:11:34	it's kind of stuff to rate both so i stick with at because it was
0:11:37	the baseline formant
0:11:40	the other thing i was interested in force how it stand if we if what
0:11:45	what's but the target for the neural network should be
0:11:48	so we did of is the context dependent phonemes
0:11:53	but how it how is it is the context independent
0:11:56	so it's much easier to train the system is context independent phonemes
0:12:02	then this context dependent because we do not need to build the lvcsr sistine the
0:12:05	training of the neural network is much faster and so and so on
0:12:09	bob if you look at the results the results
0:12:12	a query speaker use the context dependent context dependent phones
0:12:17	i think it's because of you are modelling of the final estimate structure in does
0:12:22	the this feature space
0:12:27	then
0:12:29	the
0:12:31	we have it as i said at the beginning we had we have a to
0:12:34	language is we do we have a transcription for
0:12:38	it's farsi eleven time
0:12:40	and so i
0:12:42	the dues to set of features one was trained on farsi one on lemon time
0:12:46	you can see that they perform about the same
0:12:51	actual data used because it is you can see of the final a slide
0:12:56	and evaluate would you buy we need to choose just one i which is the
0:13:00	levenshtein one because it's just
0:13:03	slightly better
0:13:06	you would not see doing to do in test proceed but test but in that
0:13:10	the reigning the farsi has much higher target like that might much more
0:13:17	context dependent phones so that the reigning was more time so it for like training
0:13:22	convenience the levenshtein but the
0:13:26	so then into thai wasn't two thousand thirteen what jeff did for the rats evaluation
0:13:35	he did the kind of fusion of several six teams
0:13:40	recording language dependent sixteen
0:13:42	and i was explained on the picture
0:13:46	so the language do what is the language dependency for usually we have just ubm
0:13:52	and i-vector and you're not to obtain on the same data which are usually all
0:13:57	data we have
0:13:58	so what we did is to train the gmm on one language which that's a
0:14:04	just are a big language just on dari farsi bunch two or two
0:14:09	and all other languages and then i-vector and in it was collected all of them
0:14:15	and then at the end we to be a just a simple average of the
0:14:18	scores we didn't want to train diffusion because it's more parameters so we do we
0:14:22	because the fusion was then train this other systems
0:14:27	so
0:14:28	personally i do like that structure that match because the complexity of the systems grows
0:14:34	quite a lot but i think is doing it takes advantage of different of different
0:14:38	alignment of the of the ubms so how does it look like and the results
0:14:44	so that the first line is the
0:14:47	is the baseline where we trained everything on when we try to train the ubm
0:14:50	on all languages
0:14:52	one only down
0:14:53	then next six lines
0:14:56	are the sistine the separate sistine where we train the ubm only on the particle
0:15:00	language
0:15:02	so if you if you see the results none of these be the baseline which
0:15:06	is kind of the is
0:15:07	but then you by take average of this of the of discourse and score it
0:15:11	you can see that there is a very nice benefit of doing this so it's
0:15:15	those of fifteen percent forty second twenty five percent for ten second
0:15:19	and
0:15:21	we had also shan't we had we what we did also the so that in
0:15:25	the rats there are eight
0:15:27	it should be nine
0:15:28	g h channels but the source channel so i did the same for that no
0:15:31	on the channel level
0:15:34	it perform about the same
0:15:35	and then i do those of the average of all of them i
0:15:38	it's also about the same so
0:15:40	there is some separation is also
0:15:44	due to some point it improves
0:15:47	it would be good to the right the what we small saying that the like
0:15:51	the d n and alignment which might be which might be something similar different alignment
0:15:57	to look at is or
0:16:01	so let's look at the final sideways the fusion
0:16:05	the first line is the plp sistine
0:16:08	then
0:16:10	the then i have a fusion of three system which is the stacked bottleneck sistine
0:16:15	with for false eleven time and then the feature level fusion with acoustic system
0:16:21	and you can see that there is about the thirty percent improvement
0:16:27	then
0:16:28	the same one to compare if we did length both all system language dependent so
0:16:33	we saw that like thirty percent or twenty five percent improvement from the
0:16:38	from the language dependent us esteem and here we can see that the fusion still
0:16:44	can't gain the same gain as if we do not the language dependency which is
0:16:49	which is very nice
0:16:53	but the thirty percent from the fusion over the single best system if you do
0:16:57	the language dependency or not
0:17:00	then one of your suggest it's to do something words isolated maybe it was a
0:17:06	review from sri
0:17:09	the
0:17:10	and
0:17:11	also are after the rats evaluation i actually extent streamers from within each and he
0:17:18	said just introduced to do that so what we had is the blue
0:17:22	lou stream
0:17:23	and what's
0:17:26	kind of day deed was very easy to for me to try so i didn't
0:17:30	got the what to make here but i just use entire network and use the
0:17:35	posteriors which were here and dialect defeated due to another mlp and produce the scores
0:17:40	and then i could to use it
0:17:43	so you can see that there is
0:17:45	that's actually for me
0:17:47	the posterior system was voters then the like the stacked bottleneck with i-vectors
0:17:54	but yesterday we can i compared the results this image and actually they are see
0:18:00	steam
0:18:01	like the c n posterior system is a little bit better than mine system here
0:18:07	we talk a little bit it might be because of the c n is behaving
0:18:10	much better than the indian and for noisy condition
0:18:13	which we need to train it to try but the fusion was fusion with these
0:18:18	two approaches is very nice
0:18:21	the conclusion the bottleneck features provides very nice gain
0:18:28	it
0:18:29	it's very nicely compete of is the with the conventional phonotactic system which we did
0:18:33	before actually it it's much better
0:18:36	and as i said before we
0:18:39	for than for the rats evaluation this year we had also phonotactic systems and none
0:18:45	of then made it to the final fusion
0:18:48	and there are much bigger gains for longer audio files
0:18:53	a
0:18:55	as i said this what events you more noise during the
0:18:59	the
0:19:00	that is that the make this trained for direct to like to the bottom x
0:19:05	but on the direct task that he's doing the net with the target are the
0:19:09	languages for this case
0:19:11	then it might open to new space for the that the ribbon feature extraction
0:19:18	thank you
0:19:26	thank you problem do we have any questions
0:19:31	how to train the neural networks for the deck
0:19:35	so i for this task is the bbn training told is a stochastic gradient tests
0:19:40	and gpus and each net was trained like it's three day so i have two
0:19:46	nets it was about a week to train trying to
0:19:55	activation function but it was is
0:20:01	i do they remembered activation function in the in the in the a hidden layers
0:20:06	but i know that for the button make these the linear one so there wasn't
0:20:09	there was a linear activation function for the bottleneck was also shown that for the
0:20:13	speech recognition that is providing a better
0:20:16	but the results i can i it's between actually it's in the paper
0:20:21	i satisfy the deleted
0:20:25	so that the same questions that's image this you tell what was used to train
0:20:29	your asr in your d n and
0:20:31	same s
0:20:32	since all the channels
0:20:34	yes all channels and
0:20:37	the d n the d n and for that's the bottleneck or the but make
0:20:41	features contain from the keyword spotting data so it's different data from the ubm and
0:20:46	the and the this
0:20:48	okay so you'd also at different datasets on there so what are the questions here's
0:20:53	how much
0:20:55	what is your sense on the sensitivity to so the to do the indians it
0:20:59	seems like there's it's a start all that a good asr system i and label
0:21:03	your data then trained indian answer the question
0:21:06	maybe people the places i seven had experiences
0:21:09	what people think this sensitivity is on saying i start off with a very good
0:21:13	alignment
0:21:14	see that you start to train at the end
0:21:17	do you get the sense on that you know anything maybe not this work but
0:21:20	otherwise
0:21:22	that's hardware so i
0:21:26	what is but here that you really need to beautiful lvcsr reason is to be
0:21:31	good
0:21:32	what i like is what a nazi armour nancy more noise doing that actually does
0:21:37	not need to do that the irises subsystem you just a side so we can
0:21:42	use actually the language id data directly you trying to neural network on the post
0:21:46	like in the language posteriors so use the same data actually as norm assisting
0:21:52	i played is that a little bit as well i did those of the bottleneck
0:21:54	speakers the because if you do it what he was doing actually on the j
0:21:59	g workshop
0:22:00	that he train the net like the d n and to produce the
0:22:05	line the targets the languages are then he did a because you have it you
0:22:10	have a
0:22:11	the posterior probability of language each frame so we need to do the some timing
0:22:15	so what he did he just the average
0:22:18	and which is
0:22:20	good for three seconds but is not good for ten seconds
0:22:23	so what i did then i to the exactly this a posteriors as the features
0:22:29	i to the output of the layer before the features and then it helps because
0:22:33	then and is that if you just i-vectors justin
0:22:36	and then it helps to do something actually for that i would have much smaller
0:22:40	system to do the to the i-vector system
0:22:44	so that might be
0:22:47	on support not to do the ldc as
0:22:55	just to follow on in response to dogs question with the keyword spotting died of
0:22:59	that was transmitted different time to language are data one thing so i observed in
0:23:05	the speaker i data was that retransmission a time as it trying to the side
0:23:11	as the atmosphere and the transmission affects
0:23:14	so the channel is bearing i've a time so in one regard we got this
0:23:17	keyword spotting that of that kind has different channels of language at daylight
0:23:22	a even though it's theoretically the same equipment that sending that that's a different effect
0:23:27	that's coming three so it's nice to see that is still working despite that different
0:23:32	a similar question now is for instance in the clean sre data with saying difference
0:23:37	between or a problem trying to classify microphone a trials when most you're trying that
0:23:45	if we take your network is telephone speech
0:23:49	the one here last statements on the thought was that the bottleneck features are a
0:23:53	great even in noisy conditions so of course got very matched data he at do
0:23:58	you have any theories of how the bottleneck features my car in mismatched conditions i
0:24:04	last minute because of various system
0:24:06	appears sensitive to it i wonder if the bottleneck smart little bit more distant just
0:24:10	because the compression factor
0:24:12	i think it would depend on the train data for the d n and
0:24:16	okay so what we did for that's it for best
0:24:18	together with speech
0:24:22	we had adjust the clean data for training to the nn so we just say
0:24:26	okay so what we go do if the test data will be noisy so we
0:24:30	just to get thirty percent of the training data and we just artificially i denotes
0:24:35	that help a lot
0:24:37	so then the d n and source of the noisy condition
0:24:46	since that reputed question our
0:24:50	if you have to do very many languages the
0:24:53	could you imagine having from for universal recognizer system for the d n are you
0:24:58	think you'll have to be very
0:25:01	i think that the people need to build at least a few d n n's
0:25:04	because i think that mitch you said that you try to those of the like
0:25:07	the farsi eleven time and then
0:25:10	the universal one right
0:25:14	so you might common much more if it was better than to separate one or
0:25:18	the fusion of these two
0:25:23	so we had someone in our lab construct the multilingual dictionary between these two languages
0:25:27	that was the best
0:25:28	of the three systems that we tried but we also found the fusion of all
0:25:33	three was best in fact our primary system was the fusion of the t c
0:25:36	and in systems but also three a c in an i-vector systems for the site
0:25:43	languages but all with one language id feature
0:25:47	if that might if you're member distinction between the d n and has a certain
0:25:51	age and language id as a set which we just maintain one image and em
0:25:56	sales code language id feature that the c and then change and that was
0:26:00	a very good fusion
0:26:02	i in terms of the sre all i read i to we found that having
0:26:06	the multiple languages
0:26:10	if you get the good scart across the different phone right that's one of stuff
0:26:13	to converge
0:26:16	then i think that you would need to a few systems
0:26:19	not many for three four
0:26:22	and it would be better than to have one universal
0:26:30	okay

Neural Network Bottleneck Features for Language Identification

Neural Nets for Speaker and Language Modeling

Pavel Matejka, Le Zhang, Tim Ng, Sri Harish Mallidi, Ondrej Glembek, Jeff Ma and Bing Zhang