Speech Transcript - LOCALIZATION OF NON-LINGUISTIC EVENTS IN SPONTANEOUS SPEECH BY NON-NEGATIVE MATRIX FACTORIZATION AND LONG SHORT-TERM MEMORY

right right short term memory no oh i i i explain each extraction by nonnegative made i and and think that's case but um so a memory and and and and i like one thing speech yeah so as probably most of you know the localisation of non-linguistic events and spontaneous speech can have multiple applications one of them is of course again some paralinguistic information for example if you recognise laughed there or size or other yeah localisation colonisation that have a semantic meaning but and the application would uh be to increase the vertex you C of an are system so for example if you know where there is a lexical items and the speech and the there are no lexical items we can perform decoding only on the lexical items and maybe this can increase your word accuracy so the crucial question your johns a is well to do this inside or outside the A's our frame because you of think uh of doing it inside the a our framework you can just at some more models to yeah recognizer and for example of or it for after of works S i and include them in the language model and yeah to the standard acoustic modeling of them but i'd another approach would be to do this outside the ace are for um work in uh different classify and this is actually the approach to that "'cause" you here so i do a frame-wise context sensitive classification all the speech into yeah lexical speech and L items on linguistic segments and i do it in a purely data based way year but means i just uh trained on different non-linguistic segments and speech and try to discriminate them so why i'm confident that this should work is because we already did some work on static classification of speech and non-linguistic vocalisations using an F features um and as a svm classifier and we could show that uh and F features together with at the M a performance and that mfcc classification here but of course uh static classification is means that you already have a presegmentation and to speech and non-linguistic segments so this is not an the realistic application which is in this study we now include the segmentation part and that's classifier we used a long short-term memory recurrent neural network which has been widely successfully used for phoneme recognition and speech also and spontaneous speech so i i don't know how many of your familiar with non-negative matrix factorization there is just the only matrix as a basis i i think that anyway and the H matrix gives you the time activations of those spec i and here's a of the place for let that advertisement here yeah because we have an open source to look at for nmf oh the that which will also present on first the in the evening the poster session so all of our experiments can you're read on very easily so the nmf are good that we apply is just the multiplicative update i my think it's pretty stand that so it's uh iterative minimisation of a cost function between the original back down we and the product of W eight and in our previous study we could show that the euclidean distance not a good measure to minimize here um so we on the one hand we evaluate the colour the climate the origins and on the other hand uh yeah but say new cost function that has been proposed X especially for music processing which is the itakura side so D origins and the the main difference of those in that the the itakura-saito divergence is scale and and so um low energy components are weighted uh the same way as high energy components basically and calculation of the error so now we move on to the feature extraction by and a mask and the idea used to follow a supervised nmf approach which means that the W matrix fine um so yeah that's is actually a and approach of those pursued you'd multiply in source separation so if you have different sources like speech and noise you print pretty initialize the W matrix and kind reconstruct a sources after work and what we did here is re predefined the the W matrix with yeah spectra from different classes which uh on the one hand normal speech so it's say so with was words and that there here and other vocal noise and all the noise which is most environmental noise or microphone noise well yeah so in an ideal world if you do this decomposition we can just look yeah and then the the activation matrix what exactly give us the temporal location of those segments but of course this does not possible or this does not work like that because of the large spectral overlap between the different spectra from the different classes so the real case is so our approach is uh just to normalize each column of the H matrix to get something like a likelihood that uh a given spectrum was active at a given time frame and because those likelihood features do not contain energy information as opposed to the normal H matrix you also at the energy okay so now i come to the classification was long short-term memory so my colleague step us as um after what's presenting not talk on long short-term memory which is why a explain interior a little more in detail so the yeah the drawback like of a conventional recurrent network is uh but the context range quite limited because the the weight of a was single input on the output calculation decreases exponentially over time and this is all known as the vanishing gradient problem so the solution or one solution for this us to use you long short-term memory cells instead of the standard cells for the neural network which have an internal state that is maintained by a well a connection with a recurrent rate which is constant that's uh i one point zero so this means that the network can actually store information over yeah an arbitrarily long time and of course to also to access that that information and to update it and maybe to deleted you need some other units that control the state of this cell and these are known one is the gate units for input output and memory and yeah the the great advantage of this architecture that it automatically learns the a required amount of context so all those weights for those gate units the control input output and memory a learned during training by resilient propagation for example so you don't have to specify the be required mind amount of context as you would have to do for example when you just to feature frames stacking so of course you can ask does this give us any had wanted oh but just and normal recurrent not work which is why we investigated several architectures in this study so so to just speak in it's here oh so a bidirectional actually means that the you network processes the input for but and backward and yeah to this and is has two input layers and also to in layers and yeah the dimensionality of the input layer is just the number of input features which is a three for in or nmf configuration or thirty nine if you just use normal plp features plus the those yeah and the size of hidden layer was evaluated at at and one hundred twenty and the the output layer just a gives T posterior probabilities of the for different classes that i want to discriminate so or evaluation was done of the part i corpus of spontaneous speech i don't know how many of you know wait so it's uh we took only is subject turns so there remained about twenty five hours of spontaneous speech it's ten use in the sense that it's interview speech so there is one interview viewer and a test subject and they follow a free conversation without any specific protocol there are forty speakers to male and twenty female and we sup they white at the corpus in a speaker independent manner which means we divided into a training validation and test set but all stratified by age and gender so the percentages were around eighty percent for training ten percent for validation and ten percent for test and yeah O to to make it more reproducible we did this subdivision an ascending order of speaker id and the corpus also comes with an automatic alignment of of yes speech and phonemes and after a car noise and i don't noise and this automatic alignment was used on the training data as well to to use that to train the nmf math as well as to train the neural network various just a short summary on the the different sizes of the test sets sub the it back classes so as you would expect the the speech classes predominant and yeah especially the the after and the other noise class are quite sparse especially in the test set so yeah the evaluation that that we did um was yeah motive what by the question about but is better to model the non-linguistic vocalisations inside the a our system or outside the A's our system which is why we set up and yeah produced and that's a a are system on the back i corpus for as a reference so i'm going yeah quite fast a is because it's all pretty standard a plp coefficients plus deltas and uh bigram gram language model trained on the black i training set we also experimented with other language models but it didn't increase but curious accuracy we had and addition to the thirty nine monophones we had three models for non-linguistic vocalisations laughter woke noise and i don't noise but at uh double as many states as T funny models and we estimated say clustered triphones with sixteen you thirty two mixtures and yeah as you can see the word accuracy of the system is quite low but which is quite common actually for spontaneous speech oh on this i have the comparison on the discriminability of the different classes by a different types of are and hands and the general trend that you can see is that the you normal uh and and the as the lowest frame wise F one measure which is the primary evaluation matter here and you a and W A stand for and weighted average of of the four classes and weighted average weighted weighted is just uh weighted by the prior class probability um what you also can see is that the at as T M concept doesn't give that much K you know well the be normal our and and but uh they the bidirectional be L S T M the live was again for almost all the classes over will be i'll the network the only class that this is not the case is the other noise class but this might be also do you just by to sparsity as i've indicated before so i according to be a less M size and features we can actually conclude that the F features computed with the call but lie but the of origins or to form as well plp features and also the the nmf F features generated by the itakura the at the were in and you have the improvements ah especially was able for the we won't know the the other noise class but uh as i said those as yeah is not that much data on that but um in some we can see that the unweighted average increases by about uh two percent absolute from the plp features to the kl based and have features and yeah be D weighted average is of course dominated by the performance on speech which is also increase so no to come to a conclusion whether this better to model the you vocalisations inside a is are or outside a are you we can see that actually except for the after class it is always better and terms of frame wise F one measure to two model it with the B T em approach and set of direct modeling in the A are so there are yeah but if you difference according to recall and precision because we are actually not talking about detection here but of of classification so of course we could also a uh yeah read use it to a binary detection task and calculate a use you measures but we have not done here um yeah so over all the you weighted average accuracy or weighted average recall this increased from you ninety one but three five a nine point one percent and this improvement is also significant uh like and on said previously it's not a real significance test here but just a heuristic measures of the key values actually smaller than ten to the minus three year or concluding we can say that the be less tim approach the live uh quite high reduction of the frame wise error rate by thirty seven point five percent relative and best results have been obtained with the kl divergence as and F cost function and of course future work will do you with yeah to how to integrate this uh be a less to classify actually in the asr are system and for those we have a quite promising approach or multi-stream hmm a are system which is currently also using a a S in phoneme prediction and it could all some of the prediction of that there is speech or non-linguistic vocalisations and other improvements uh relate to the and i agree and and there we could include context-sensitive sensitive if features like features to by a non-negative matrix deconvolution or also use sparsity constraints in the supervised and and have to improve the discrimination so this concludes a talk from my part and i'm looking forward to your questions thank you much fit i someone one all we moved from slightly behind get you to be a had of schedule so there's quite some time of question and and at at or you done some experiments uh whether there um a use our modelling out perform this one or whether the uh they both to together uh uh all the form each single one channel so how about how about the result all that makes a speech i mean that's speech that a mixed up with snap yeah yeah yeah i have a question can i a short reply so thing but thank phoenix again

LOCALIZATION OF NON-LINGUISTIC EVENTS IN SPONTANEOUS SPEECH BY NON-NEGATIVE MATRIX FACTORIZATION AND LONG SHORT-TERM MEMORY

Audio/Visual Detection of Non-Linguistic Vocal Outbursts

Presented by: Felix Weninger, Author(s): Felix Weninger, Björn Schuller, Martin Wöllmer, Gerhard Rigoll, Technische Universität München, Germany