Speech Transcript - LOCALIZATION OF NON-LINGUISTIC EVENTS IN SPONTANEOUS SPEECH BY NON-NEGATIVE MATRIX FACTORIZATION AND LONG SHORT-TERM MEMORY

0:00:24	right
0:00:25	right
0:00:26	short term memory
0:00:28	no
0:00:29	oh
0:00:29	i
0:00:30	i
0:00:31	i explain each extraction by nonnegative made
0:00:34	i and
0:00:35	and think that's case but um so a memory
0:00:38	and and and and i like
0:00:41	one thing speech
0:00:46	yeah so
0:00:48	as probably most of you know
0:00:51	the localisation of non-linguistic events and spontaneous speech can have multiple applications one of them is of course again
0:00:58	some paralinguistic information
0:01:00	for example if you recognise laughed there
0:01:03	or size or other
0:01:06	yeah localisation colonisation that have a
0:01:08	semantic meaning
0:01:09	but
0:01:10	and the application would uh be to increase the vertex you C of an are system
0:01:16	so for example if you know where there is a lexical items and the speech and the there are no
0:01:21	lexical items
0:01:22	we can perform decoding only on the lexical items and
0:01:25	maybe this can increase your word accuracy
0:01:28	so the crucial question your johns a is well to do this inside or outside the A's our frame
0:01:35	because you of think
0:01:36	uh of doing it inside the a our framework you can just at some more models to yeah recognizer
0:01:42	and for example of or it for after of works S i
0:01:46	and include them in the language model and yeah to the standard acoustic modeling of them
0:01:52	but i'd another approach would be to do this outside the ace are for um work in uh
0:01:56	different classify and this is actually the
0:01:58	approach to that "'cause" you here
0:02:01	so i do a frame-wise context sensitive classification
0:02:05	all the speech into yeah
0:02:07	lexical speech and L items on linguistic segments
0:02:11	and i do it in a purely data based way year
0:02:15	but means i just uh trained on different non-linguistic segments
0:02:19	and speech and try to discriminate them
0:02:25	so
0:02:27	why i'm confident that this should work is because we already did some work on static classification of speech and
0:02:34	non-linguistic vocalisations
0:02:36	using an F features
0:02:38	um and as a svm classifier
0:02:42	and we could show that uh and F features
0:02:45	together with at the M
0:02:47	a performance and that mfcc classification here
0:02:51	but of course uh static classification is means that you already have a presegmentation
0:02:57	and to speech and non-linguistic segments
0:02:59	so this is not an the realistic application
0:03:02	which is in this study we now include the segmentation part
0:03:06	and
0:03:08	that's classifier we used a long short-term memory recurrent neural network
0:03:11	which has been widely successfully used for phoneme recognition and speech
0:03:16	also and spontaneous speech
0:03:21	so i i don't know how many of your familiar with non-negative matrix factorization
0:03:26	there is just
0:03:53	the only matrix
0:03:56	as a basis i
0:03:58	i think that
0:03:59	anyway
0:04:00	and the H matrix
0:04:01	gives you the time activations of those spec
0:04:05	i
0:04:06	and here's a of the place for let that advertisement here
0:04:10	yeah because we have an open source to look at for nmf
0:04:13	oh the that which will also present on
0:04:16	first the in the evening the poster session
0:04:18	so all of our experiments can
0:04:20	you're read on very easily
0:04:26	so the nmf are good that we apply
0:04:29	is just the multiplicative update i my think it's pretty stand that
0:04:33	so it's uh iterative minimisation of a cost function
0:04:37	between the original
0:04:40	back down we and the product of W eight
0:04:44	and in our previous study we could show that the euclidean distance
0:04:48	not a good measure to minimize here
0:04:51	um so we
0:04:53	on the one hand we evaluate the colour the climate the origins
0:04:56	and on the other hand uh yeah
0:04:59	but say new cost function that has been proposed
0:05:02	X especially for music processing
0:05:04	which is the itakura side so D origins
0:05:07	and the the main difference of those in
0:05:09	that the the itakura-saito divergence is scale and and
0:05:13	so um low energy components
0:05:16	are weighted uh the same way as high energy components basically
0:05:20	and calculation of the error
0:05:26	so now we move on to the feature extraction by and a mask
0:05:30	and the idea used to follow a supervised nmf approach
0:05:34	which means that the W matrix
0:05:37	fine
0:05:38	um so
0:05:40	yeah that's is actually a
0:05:41	and approach of those pursued you'd multiply in
0:05:44	source separation
0:05:46	so if you have different sources like speech and noise you print pretty initialize the W matrix and kind reconstruct
0:05:52	a sources after work
0:05:55	and what we did here is
0:05:56	re predefined the
0:05:58	the W matrix with
0:06:00	yeah spectra from different classes
0:06:03	which uh on the one hand
0:06:05	normal speech so it's say
0:06:07	so with
0:06:08	was words
0:06:09	and that there here
0:06:11	and other vocal noise
0:06:13	and
0:06:14	all the noise
0:06:15	which is most environmental noise or microphone noise well
0:06:20	yeah
0:06:21	so
0:06:22	in an ideal world
0:06:24	if you do this decomposition
0:06:27	we can just look
0:06:41	yeah
0:06:45	and then the
0:06:46	the activation matrix what exactly give us the temporal location
0:06:51	of those segments but of course this
0:06:53	does not possible or this does not work like that
0:06:56	because of the large spectral overlap between the different spectra
0:07:00	from the different classes
0:07:04	so
0:07:08	the real case
0:07:11	is
0:07:20	so our approach is uh just to normalize each column of the H matrix
0:07:26	to get something like a likelihood
0:07:28	that uh a given spectrum was active at a given time frame
0:07:33	and because those likelihood features do not contain energy information as opposed to the normal H matrix
0:07:40	you also at the energy
0:07:45	okay so now i come to the classification was long short-term memory
0:07:50	so my colleague step us as um after what's presenting
0:07:54	not talk on long short-term memory which is why a
0:07:56	explain interior a little more in detail
0:07:59	so the
0:08:01	yeah the drawback like of a conventional recurrent network is uh
0:08:04	but the context range quite limited
0:08:07	because the the weight of a was single input
0:08:11	on the output calculation decreases exponentially over time
0:08:15	and this is all known as the vanishing gradient problem
0:08:19	so the solution
0:08:21	or one solution for this us to use you long short-term memory cells
0:08:25	instead of the standard cells for the neural network
0:08:29	which have an internal state that is maintained by a
0:08:32	well a connection with a recurrent rate which is
0:08:35	constant that's
0:08:36	uh i one point zero
0:08:39	so this means that the network can actually store information over yeah an arbitrarily long time
0:08:46	and of course
0:08:47	to also to access that that information and to update it and maybe to deleted you need some other units
0:08:53	that control the state of this cell
0:08:56	and these are known one is the gate units for input output and memory
0:09:01	and yeah the the great advantage of this architecture that it automatically learns the a required amount of context
0:09:08	so all those weights for those gate units the control input output and memory a learned during training by
0:09:14	resilient propagation for example
0:09:18	so you don't have to specify the be required mind amount of context as you would have to do for
0:09:23	example when you
0:09:24	just to feature frames stacking
0:09:28	so of course you can ask does this give us any had wanted
0:09:31	oh but just and
0:09:33	normal recurrent not work
0:09:35	which is why we investigated several architectures in this study
0:09:40	so
0:09:43	so to just speak in
0:09:45	it's here
0:09:52	oh
0:09:54	so
0:09:54	a bidirectional actually means that the you network processes the input for but and backward
0:10:00	and
0:10:00	yeah to this and is has
0:10:02	two input layers
0:10:03	and also to in layers
0:10:06	and
0:10:08	yeah the dimensionality of the input layer is just the number of input features
0:10:13	which is a three for in or nmf configuration
0:10:16	or thirty nine if you just use normal plp features plus the those
0:10:21	yeah and the size of hidden layer was evaluated at at and one hundred twenty
0:10:25	and the the output layer
0:10:27	just a gives T posterior probabilities of the for different classes that i want to discriminate
0:10:38	so or evaluation was done of the part i corpus of spontaneous speech
0:10:42	i don't know how many of you know wait
0:10:44	so it's uh we took only is subject turns
0:10:47	so there remained about twenty five hours of spontaneous speech
0:10:52	it's ten use in the sense that it's interview speech so there is one interview viewer
0:10:57	and a test subject and
0:10:59	they follow a free conversation
0:11:01	without any specific protocol
0:11:04	there are forty speakers to male and twenty female
0:11:07	and we sup they white at the corpus in a speaker independent manner
0:11:11	which means we divided into a training validation and test set
0:11:15	but all stratified by age and gender
0:11:18	so the percentages were around eighty percent for training ten percent for validation and ten percent for test
0:11:26	and yeah O to to make it more reproducible we did this subdivision an ascending order of speaker id
0:11:33	and the corpus also comes with an automatic alignment
0:11:36	of of yes
0:11:37	speech and phonemes
0:11:39	and after a car noise and i don't noise
0:11:41	and this automatic alignment was used on the training data
0:11:45	as well to to use that to train the nmf math
0:11:49	as well as to train the neural network
0:11:55	various just a short summary
0:11:57	on the the different sizes of the test sets sub the it back classes
0:12:02	so as you would expect the the speech classes predominant
0:12:07	and
0:12:08	yeah especially the the after and the other noise class are quite
0:12:13	sparse
0:12:14	especially in the test set
0:12:18	so
0:12:20	yeah the evaluation that that we did um
0:12:23	was
0:12:24	yeah motive what by the question about but is better to model the non-linguistic vocalisations inside the a our system
0:12:31	or outside the A's our system
0:12:34	which is why we set up and
0:12:35	yeah produced and that's a a are system on the back i corpus for
0:12:40	as a reference
0:12:42	so i'm going
0:12:43	yeah quite fast a is because it's all pretty standard
0:12:47	a plp coefficients plus deltas and uh
0:12:50	bigram gram language model trained on the black i training set
0:12:53	we also experimented with other language models but it didn't increase but curious accuracy
0:12:58	we had and addition to the thirty nine monophones
0:13:01	we had three models for non-linguistic vocalisations laughter woke noise and i don't noise
0:13:06	but at uh double as many states as T funny models
0:13:10	and we estimated say clustered triphones with
0:13:13	sixteen you thirty two mixtures
0:13:15	and yeah as you can see the word accuracy of the system is quite low
0:13:20	but which is quite common actually for spontaneous speech
0:13:28	oh on this i have the comparison on the discriminability of the different classes by a different types of are
0:13:35	and hands
0:13:36	and the general trend that you can see is that the
0:13:39	you normal uh and and the as the lowest frame wise F one measure
0:13:44	which is the primary evaluation matter here
0:13:47	and you a and W A stand for and weighted average of of the four classes and weighted average
0:13:53	weighted weighted is just uh weighted by the prior class probability
0:13:59	um what you also can see
0:14:01	is that the at as T M concept doesn't
0:14:04	give that much K you know well the be normal our and and
0:14:07	but uh they
0:14:09	the bidirectional be L S T M
0:14:11	the live was again for
0:14:13	almost all the classes over will be
0:14:15	i'll the network
0:14:17	the only class that this is not the case is the other noise class
0:14:21	but this might be also do you just by to sparsity
0:14:24	as i've indicated before
0:14:31	so i according to be a less M size and features
0:14:35	we can actually conclude that the F
0:14:37	features computed with the call but lie but the of origins
0:14:41	or to form as well plp features
0:14:43	and also the the nmf F features
0:14:46	generated by the itakura the at the were in
0:14:50	and you have the improvements ah
0:14:52	especially
0:14:54	was able for the
0:14:56	we won't know the
0:14:58	the other noise class
0:15:00	but uh as i said those as
0:15:02	yeah is not that much data on that
0:15:05	but um in some we can see that the unweighted average
0:15:08	increases by about uh two percent absolute
0:15:12	from the plp features to the kl based and have features
0:15:17	and yeah be D weighted
0:15:19	average is of course dominated by the performance on speech
0:15:23	which is
0:15:24	also increase
0:15:29	so no to come to a conclusion whether this better to model the
0:15:34	you vocalisations inside a is are or outside a are
0:15:38	you we can see that
0:15:40	actually except for the after class it is always better
0:15:44	and terms of frame wise F one measure
0:15:46	to two model it with the B T em approach
0:15:50	and set of direct modeling in the A are
0:15:54	so
0:15:55	there are yeah
0:15:57	but if you difference
0:15:58	according to recall and precision
0:16:00	because we are actually not talking about detection here
0:16:03	but of of classification so of course
0:16:06	we could also a
0:16:08	uh yeah
0:16:10	read use it to a binary detection task and
0:16:12	calculate a use you measures but we have not done here
0:16:16	um
0:16:19	yeah so over all the
0:16:21	you weighted average accuracy or weighted average recall
0:16:25	this increased from
0:16:27	you ninety one but
0:16:28	three five
0:16:29	a nine point one percent
0:16:32	and this improvement is also significant
0:16:35	uh like and on said previously
0:16:37	it's not a
0:16:38	real significance test here but
0:16:40	just a heuristic measures of the key values actually smaller than
0:16:44	ten to the minus three year
0:16:49	or concluding we can say that the
0:16:51	be less tim approach
0:16:53	the live uh
0:16:54	quite high
0:16:55	reduction of the frame wise error rate
0:16:58	by thirty
0:16:59	seven point five percent relative
0:17:01	and best results have been obtained with the
0:17:04	kl divergence as and F cost function
0:17:07	and of course future work will do you with
0:17:10	yeah to how to integrate this uh be a less to classify actually in the asr are system
0:17:16	and for those we have a quite promising approach
0:17:18	or multi-stream hmm a are system which is
0:17:21	currently also using a a S in phoneme prediction
0:17:25	and it could all some of the
0:17:27	prediction of that there is speech or non-linguistic vocalisations
0:17:31	and other improvements uh relate to the and i agree and and there we could
0:17:36	include context-sensitive sensitive if features
0:17:39	like features to by a non-negative matrix deconvolution
0:17:43	or also use sparsity constraints
0:17:45	in the supervised and and have
0:17:47	to improve the discrimination
0:17:50	so this concludes a talk from my part
0:17:52	and i'm looking forward to your questions
0:17:55	thank you much fit
0:17:56	i
0:18:00	someone one all we moved from slightly behind get you to be a had of schedule so there's quite some
0:18:05	time of question
0:18:11	and
0:18:17	and at at or you done some experiments uh
0:18:20	whether there um a use our modelling out perform this one or whether the uh they both to together
0:18:27	uh uh all the form each single one
0:18:37	channel
0:18:43	so how about how about the result all that makes a speech
0:18:47	i mean that's speech that a mixed up with snap
0:18:53	yeah yeah yeah
0:19:15	i have a question
0:19:17	can
0:19:30	i
0:19:33	a short reply
0:19:34	so
0:19:36	thing
0:19:37	but thank phoenix again

LOCALIZATION OF NON-LINGUISTIC EVENTS IN SPONTANEOUS SPEECH BY NON-NEGATIVE MATRIX FACTORIZATION AND LONG SHORT-TERM MEMORY

Audio/Visual Detection of Non-Linguistic Vocal Outbursts

Presented by: Felix Weninger, Author(s): Felix Weninger, Björn Schuller, Martin Wöllmer, Gerhard Rigoll, Technische Universität München, Germany