Speech Transcript - LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification

0:00:15	i'm gonna be representing us to use a university of science and technology of china
0:00:19	the national engineering level of speech line and language
0:00:24	information processing of the
0:00:26	this is a paper by a margin is my student master student and some other
0:00:32	collaborators we asked him to
0:00:34	build his own c n which he did and then be austin to join using
0:00:38	it for something which he did
0:00:40	so what i'm gonna do is present
0:00:42	what came out when he tried that
0:00:46	we got before stages introduction really have this works for language and at the structure
0:00:51	in selling
0:00:52	is proposed method some experiments analysis and then
0:00:56	maybe and with but of sort on some future work
0:01:00	well the first thing to ask is what is language identification and
0:01:06	it just the task of taking a piece of speech and extracting language identification information
0:01:12	from that comes at different levels as we know
0:01:14	and we can say that that's all acoustic information or phonetic information list right hand
0:01:21	to disassociate that from the characteristics of a speech speaker as will say
0:01:26	and a little while
0:01:27	and was finding the tendency to do
0:01:30	speaker recognition
0:01:32	state-of-the-art as well probably
0:01:36	maybe this will change shortly i don't know but state-of-the-art is really gmm i-vectors
0:01:42	and we say in great gains but everybody's you know trying to find what's next
0:01:48	deep-learning in particular allows us to i'll take some of the advantages of supervised at
0:01:55	training be able to extract discriminative it
0:01:58	discriminative information out of the
0:02:01	the data is that we have especially when we have a small amounts of training
0:02:04	data we can use transfer learning methods to
0:02:08	to train something which may well be discriminative
0:02:11	on a weighted task of inferring it it's language id
0:02:16	some of these are we saying
0:02:18	recently they take bottleneck a network based i-vector representation of a
0:02:24	v and song collaborator
0:02:26	this was
0:02:27	i think it was last year in interspeech there's also a poster yesterday which you
0:02:31	missed paper should be in the savings we say dc non based a neural network
0:02:37	approaches here
0:02:40	doing great things that's transactions on ice lp this
0:02:44	then there's some approaches which are and to it and methods and we can look
0:02:50	at some of the
0:02:53	the we the
0:02:55	state-of-the-art as flown through that
0:02:58	deep neural networks
0:03:00	here
0:03:01	and that was i guess
0:03:05	long short-term memory
0:03:07	i'm n n's here
0:03:11	also in to speech
0:03:13	so this is really a extracting at a frame level
0:03:17	and gathering sufficient statistics over an utterance in order so
0:03:22	pulled out
0:03:22	language specific
0:03:25	identifiers they re entrant approach
0:03:28	using convolutional no young neural network to it so the text it's short
0:03:34	utterances and it's using the power of a c n
0:03:37	to put out
0:03:38	they the information from these short utterances
0:03:41	and seems to get over some of the problems in terms
0:03:45	of utterance length
0:03:47	we have a different method
0:03:48	we also think that doing st say so mfccs with a large context maybe
0:03:56	introducing
0:03:57	too much information that a c n all day n and then this to remove
0:04:00	so what we have today was
0:04:03	a use some of our train a precious training data to remove information that probably
0:04:08	shouldn't have been included in the first place if we had a magic wand
0:04:11	in terms of input features are
0:04:15	so what we're doing is a
0:04:16	is slightly different
0:04:20	we think convolution young neural network and
0:04:24	when using the c n to extract frame level information per se what we actually
0:04:31	doing
0:04:32	in this very
0:04:34	wide long
0:04:36	and to and type system is starting off with plp input features
0:04:43	and we're doing a the bottleneck
0:04:46	the nn just data to take bottleneck
0:04:48	network
0:04:49	taking the bottleneck features here
0:04:52	adding a what could be quite a lot of context to the bottleneck features and
0:04:57	then fading that adjusts ann
0:05:00	and here so three layers
0:05:03	i finally a fully connected output and we're getting is a language label
0:05:07	directly at the output from this
0:05:09	so you can see why this is sort of attractive in terms of a system
0:05:14	level implementation but to me it's
0:05:17	kind of counterintuitive
0:05:19	because we tend to use c n n's
0:05:21	to extract
0:05:22	front and information the mean in the related tasks that we've been trying
0:05:27	they tend to one for a while for that
0:05:30	i mean we did try things like stacks of mfccs as an input features to
0:05:37	a c n directly and it doesn't seem to somebody else can do about of
0:05:41	nasa doesn't seem to want that well
0:05:43	so what we did was we have of the nn
0:05:46	followed by a c n and the see how that works
0:05:50	and limits
0:05:52	sums up what is that transform acoustic features to a compact representation
0:05:56	we did not frame-by-frame and a context of multiple frames for the bottleneck features with
0:06:02	context into the c n
0:06:03	and we come out with something which should be discriminatively in terms of language
0:06:10	okay so this is what we call the lid features
0:06:15	i mean we think that the general acoustic features that the import they like i
0:06:19	said they do contain too much information
0:06:21	so we trying to reduce the amount of
0:06:24	information about an on the
0:06:26	on the train system
0:06:28	follows
0:06:31	i'm not given the limited amount of training data we don't really wanna voice that
0:06:36	we know that we can have a deep neural network which is trained on sentence
0:06:43	and that will be a phonetic information
0:06:47	the beginning of it is acoustic information
0:06:49	somewhere in the middle of that network is a transformation
0:06:53	effectively from the phonetic to the from the acoustic to a phonetic we take the
0:06:57	well something features which
0:06:59	we how far a compact
0:07:03	representation of the requirement information
0:07:06	not sure that's true because there's plenty of approach is that take
0:07:10	information from
0:07:12	both the center and the end of the day n and
0:07:15	seem to work well especially with fusion
0:07:18	anyway
0:07:19	what we're doing it would just
0:07:20	kind of different is when using a spatial pyramid polling
0:07:24	the output of the c n and
0:07:29	we want this allows us that there was it allows us to take the front
0:07:33	end information and to span utterance level with
0:07:38	which
0:07:38	provides us with a
0:07:41	utterance length invariant
0:07:44	fixed dimension vector this point
0:07:50	so i just a deal with arbitrary input so we just we take the method
0:07:53	spatial polling is from the paper by climbing huh
0:07:56	that's e c v computer vision two thousand and fourteen and it's designed to solve
0:08:01	the problem of making the feature dimension invariant to the input size missus a problem
0:08:07	we face often and is a problem certain
0:08:10	and areas of image processing also face
0:08:14	i think was happen is we've got i
0:08:16	so i kind of feedback where the speech technology goes into the image processing and
0:08:20	the
0:08:20	comes back to the speech failed and then i cycles around
0:08:24	so this is really inspired by a bag of words approach
0:08:26	and it comes through
0:08:31	into the special permit problem which uses a power of two
0:08:34	stack of max hold features
0:08:37	okay so it changes resolution of the power to
0:08:41	and we can control quite finally how many
0:08:44	features
0:08:45	we
0:08:45	one to the output of that
0:08:47	so attractive in that work well like it
0:08:50	the information on that is actually in the paper
0:08:54	so how do we do that had we put all the stuff together
0:08:57	well the shown in the diagram on the right here what we're doing is with
0:09:01	taking a
0:09:02	six layer the nn which is trained with large scale
0:09:07	switchboard
0:09:09	information
0:09:10	and with taking the half of the network up to the bottleneck layer and fading
0:09:14	that into system that now was trained using language id using lid training data
0:09:21	and
0:09:23	now if we propose that if we take that information and we feed directly into
0:09:27	a c n
0:09:28	given the training data that we would using this well it will not converge for
0:09:32	anything sensible
0:09:33	if at all
0:09:35	it just doesn't work so what we have that there was they have to build
0:09:37	a network
0:09:38	like a c n layer by layer
0:09:41	so that the nn is already trying that's fixed that's great
0:09:43	and then you start to build that the c n by having first convolutional and
0:09:47	then the second and then the third each one takes a special permit polling and
0:09:52	the fully connected layer at the output
0:09:55	to give us the direct language labels
0:09:59	and excel works right i'm we can see that late only look at the results
0:10:03	layer by layer
0:10:05	s two
0:10:05	how of the
0:10:08	how the accuracy improves with
0:10:10	the number of layers and with the size of the labels
0:10:14	it's quite interesting to say that
0:10:17	the nn pretty standard it's
0:10:20	forty eight features fifteen plp use delta and delta-delta
0:10:24	sorry pitch
0:10:25	and with a context size
0:10:28	twenty one frames
0:10:30	one of two four
0:10:32	one or two four fifty one or two for one or two four and three
0:10:35	zero to zeros senones at the output
0:10:38	and we look at the structure of the
0:10:40	c n of a little while
0:10:42	it is worth mentioning at this point because it's a problem
0:10:46	but we create sorry separate networks for the task of thirty second ten seconds and
0:10:52	three seconds data
0:10:55	i mean we would like to combine these with trying to money
0:10:58	but this separately trained
0:10:59	no maximum
0:11:02	a baseline is button like gmm i-vector and bottleneck the nn i-vector with lda doubly
0:11:09	c n
0:11:10	pretty much as we published previously
0:11:14	so we look at how this works
0:11:16	just try to visualise some of these layers
0:11:19	what we have here was we got the
0:11:23	post
0:11:24	pooling three
0:11:26	fully connected layer
0:11:28	information
0:11:31	note this diagram comes from the paper what we've done is be taken the
0:11:37	these
0:11:38	the test it
0:11:41	over some utterances and we've compared for different languages just visually
0:11:47	so what we don't is just thirty five randomly selected features from that stack
0:11:53	plotted here for two languages
0:11:57	because right
0:11:58	on the left this dowry
0:12:00	on the right it's farsi
0:12:02	which i'm told are very similar languages
0:12:06	the top and the bottom at different
0:12:09	segments
0:12:10	from utterances
0:12:12	so what we're looking on the left is intra language difference what we're looking at
0:12:16	the right just in between left and my is interlanguage difference so top and one
0:12:21	was intra
0:12:22	left and my is inter
0:12:23	so we should say that there is a large variability between languages small variability within
0:12:29	languages that's what we get
0:12:31	it gives us
0:12:32	visual evidence
0:12:33	to think that
0:12:35	these statistics might well be discriminative for languages
0:12:40	just leaving along a bit further
0:12:43	down here what we getting here was a frame level information
0:12:48	and we like to call this lid senones maybe this is not best terminology
0:12:54	but
0:12:55	just two
0:12:56	to explain have a how we get to that sort of a conclusion
0:13:01	if we look at this information i e bay saying and a right so i
0:13:05	and be noticed the scales on some of the one
0:13:11	one five a low lid senones coming out of the of the system out for
0:13:18	frame level with context for
0:13:22	speech
0:13:25	another piece of speech there
0:13:27	a transition region between
0:13:29	two parts of speech here
0:13:32	and non-speech region just here
0:13:35	so what we tend to say when we visualise this is we see a different
0:13:40	lid senones activating and a activating
0:13:44	as we go through an utterance or go between utterances
0:13:48	and we believe that this language discrimination information in the
0:13:54	if you look at the scale
0:13:56	the y-axis scale of that use
0:13:58	we can see that when there is a non speech regions around here we get
0:14:01	all sorts of things activating but the level
0:14:05	the amplitude of activation is quite low
0:14:08	you can it gives
0:14:09	evidence to the fact that rippling you have something which is a language specific at
0:14:12	least
0:14:18	so we also there's something called a
0:14:22	hybrid sampled evaluation so we spent thirty seconds ten seconds a three seconds in to
0:14:26	separate networks
0:14:28	we train them independently and we do well we don't do quite the same degree
0:14:32	of augmentation as a hundred but we do try to men by cutting the thirty
0:14:37	second speech into ten seconds
0:14:38	and three seconds regions
0:14:40	so we're doing is where
0:14:41	we're trying to make up to the fact that the three second information is woefully
0:14:45	inadequate in terms of statistics probably
0:14:48	i having a lot more effect
0:14:50	a mostly have that works
0:14:52	in terms of the but
0:14:54	performance of each
0:14:58	unfortunately we only have data here from
0:15:01	yes to allow you zero nine and for that we only have six
0:15:05	most confusable languages
0:15:07	it's a subset is much quicker subset so do analysis on into one experiments on
0:15:13	and if you look at papers over the last few years
0:15:16	we tend to publish with these six languages fast
0:15:20	and then extend later
0:15:22	seems worthwhile
0:15:24	it's about hundred fifty i was of
0:15:26	training data voice of america and radio broadcast cts and telephone speech
0:15:31	and we split up into the three different
0:15:35	level or looking at two baseline systems and our proposed network
0:15:39	normal
0:15:40	the fusion on that later
0:15:42	everybody wants to do fusion
0:15:44	the end
0:15:47	so let's look at three the way that this
0:15:52	this structure can be adapted because the so many different parameters that we could change
0:15:57	in here
0:15:58	the first one you wanted it was look at the
0:16:00	the size of the context
0:16:02	at the output of the
0:16:04	the nn layers
0:16:06	and with changing and
0:16:08	if you can make it out just here
0:16:12	lower case n
0:16:13	so what we're doing is where
0:16:15	keeping the same
0:16:16	bottleneck
0:16:17	network
0:16:18	but we're starting a more of them
0:16:22	and we can see from the results for thirty seconds ten seconds and three seconds
0:16:25	in eer
0:16:27	the bigger the context in general the better the results
0:16:31	no bear in mind that we only have some context the input here
0:16:37	right that's also got context twenty one frames to be precise so we adding more
0:16:43	context at this and we're saying benefit
0:16:48	and it turns out that for the ten seconds and three seconds tasks context of
0:16:52	twenty one
0:16:53	just here
0:16:54	tends to what better
0:16:55	for the
0:16:56	thirty second task and even longer context much better probably because the data was longer
0:17:02	i think that the
0:17:04	the problem is the three seconds and ten seconds data tends to saturate i mean
0:17:09	we just cannot physically get enough information had about data
0:17:12	no matter how much context size
0:17:14	we introduce
0:17:17	and moving on a little bit further
0:17:19	we can also experiment with
0:17:21	how
0:17:22	t and how wide the c n is
0:17:26	and we do that down here with
0:17:31	basically three different experiments one of which is the lid net with
0:17:36	i a one zero two four
0:17:40	such that convolution input layer
0:17:44	single-layer
0:17:45	then fading into the special permit polling and the fully connected system
0:17:50	we trained the system up we get about nine
0:17:54	nine percent to sixteen percent
0:17:57	performance on the three different scales if we add another layer so we have a
0:18:01	two class and then we all that down by reasonable amount for the three seconds
0:18:06	not quite so much of the thirty seconds
0:18:09	and we're looking at one two eight to five six or five one two
0:18:14	size on the secondly
0:18:17	in the c n
0:18:18	third layer
0:18:21	we check out sixty phone one two eight and we can say that basically with
0:18:24	increasing complexity the results tend to improve lesson for the thirty seconds more for the
0:18:29	others
0:18:31	i but temple evaluation what we actually doing here is way using the
0:18:35	the thumb thirty second network
0:18:38	to evaluate thirty second data
0:18:40	the ten second network to evaluate thirty second a ten second data
0:18:43	and the three second network to evaluate everything
0:18:46	and the performance
0:18:49	unsurprisingly of the three second network is better for the three second data you can
0:18:54	only use that ten seconds better for the ten second data but the thirty second
0:18:59	network a thirty second data is
0:19:03	however
0:19:04	it's better using the ten second one for thirty second data so this means that
0:19:08	perhaps this these networks themselves are hoping at different scales of information so we fuse
0:19:13	them together to get the results of the bottom
0:19:16	and we have a slight improvement there
0:19:20	but you won't notice that we can only improve on the baseline system for the
0:19:25	thirty second result
0:19:29	one more thing before we conclude the i-vector system uses a button first order statistics
0:19:36	but this effectively only uses a with order statistics
0:19:39	so
0:19:41	pretty much are a few what would be looking at hand we can incorporate more
0:19:45	statistics
0:19:47	whether we can build a comprehensive know what that uses
0:19:50	all scales and handles all scales simultaneously so that's it that's a weird and wonderful
0:19:56	day n c n hybrid thank you
0:20:06	we have time for questions
0:20:13	so
0:20:15	thanks very much that was very interesting and as far as i could see a
0:20:20	score of the network so
0:20:24	as far as understood you
0:20:28	did some incremental training and so once you that once you trying to part of
0:20:34	the network and then you extend the network the first the parameters of first part
0:20:38	they stay fixed you don't step them
0:20:41	you have this a fixed so that we fixed that enemy build on it and
0:20:44	it
0:20:46	again is what you do when you ask us to try different things and i
0:20:50	probably wouldn't have done this myself but it tends to one
0:20:53	quite well
0:21:00	the most
0:21:01	there's a fixed you mean you network trained the mortgage you just change
0:21:07	most layer
0:21:08	and they to retrain the whole system no we don't we train our system we
0:21:12	focus that the to the backend open and we just trained on us layer
0:21:17	q
0:21:21	i think we have another question for a we've got lots of time so you
0:21:25	spoke a lot about the information flow through the neural network so if you of
0:21:32	a read some of geoff hinton "'s" stuff on neural networks the you will you
0:21:39	will tell you again and again that
0:21:41	that there is more information
0:21:44	in our case in the speech and the labels so
0:21:48	use advocating for the use of generative models rather than discriminative ones as well as
0:21:54	i can see you horses ple discriminative so i just like to hear any point
0:22:00	you have on that matter
0:22:02	so actually is interesting that you bring that up because i was looking at some
0:22:05	of the comments in tumours making recently and
0:22:10	i he was talking about using
0:22:12	he was talking about the benefits of having a two-stage process where we have one
0:22:18	front end which is very good at picking out
0:22:20	how the most useful data from a large-scale dataset and then a backend which is
0:22:26	very good it using the and that these two tasks are complementary is seldom we
0:22:32	can use one system that excels in both tasks he believes that both women can
0:22:36	be trained and we seem to do not but we've done it
0:22:40	okay the opposite way around to the way i would have imagined
0:22:48	okay thank you very much speech

LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification

Speaker & Language Recognition: Deep learning approaches

Ma Jin, Yan Song, Ian Mcloughlin, Lirong Dai, Zhongfu Ye