0:00:24right
0:00:25right
0:00:26short term memory
0:00:28no
0:00:29oh
0:00:29i
0:00:30i
0:00:31i explain each extraction by nonnegative made
0:00:34i and
0:00:35and think that's case but um so a memory
0:00:38and and and and i like
0:00:41one thing speech
0:00:46yeah so
0:00:48as probably most of you know
0:00:51the localisation of non-linguistic events and spontaneous speech can have multiple applications one of them is of course again
0:00:58some paralinguistic information
0:01:00for example if you recognise laughed there
0:01:03or size or other
0:01:06yeah localisation colonisation that have a
0:01:08semantic meaning
0:01:09but
0:01:10and the application would uh be to increase the vertex you C of an are system
0:01:16so for example if you know where there is a lexical items and the speech and the there are no
0:01:21lexical items
0:01:22we can perform decoding only on the lexical items and
0:01:25maybe this can increase your word accuracy
0:01:28so the crucial question your johns a is well to do this inside or outside the A's our frame
0:01:35because you of think
0:01:36uh of doing it inside the a our framework you can just at some more models to yeah recognizer
0:01:42and for example of or it for after of works S i
0:01:46and include them in the language model and yeah to the standard acoustic modeling of them
0:01:52but i'd another approach would be to do this outside the ace are for um work in uh
0:01:56different classify and this is actually the
0:01:58approach to that "'cause" you here
0:02:01so i do a frame-wise context sensitive classification
0:02:05all the speech into yeah
0:02:07lexical speech and L items on linguistic segments
0:02:11and i do it in a purely data based way year
0:02:15but means i just uh trained on different non-linguistic segments
0:02:19and speech and try to discriminate them
0:02:25so
0:02:27why i'm confident that this should work is because we already did some work on static classification of speech and
0:02:34non-linguistic vocalisations
0:02:36using an F features
0:02:38um and as a svm classifier
0:02:42and we could show that uh and F features
0:02:45together with at the M
0:02:47a performance and that mfcc classification here
0:02:51but of course uh static classification is means that you already have a presegmentation
0:02:57and to speech and non-linguistic segments
0:02:59so this is not an the realistic application
0:03:02which is in this study we now include the segmentation part
0:03:06and
0:03:08that's classifier we used a long short-term memory recurrent neural network
0:03:11which has been widely successfully used for phoneme recognition and speech
0:03:16also and spontaneous speech
0:03:21so i i don't know how many of your familiar with non-negative matrix factorization
0:03:26there is just
0:03:53the only matrix
0:03:56as a basis i
0:03:58i think that
0:03:59anyway
0:04:00and the H matrix
0:04:01gives you the time activations of those spec
0:04:05i
0:04:06and here's a of the place for let that advertisement here
0:04:10yeah because we have an open source to look at for nmf
0:04:13oh the that which will also present on
0:04:16first the in the evening the poster session
0:04:18so all of our experiments can
0:04:20you're read on very easily
0:04:26so the nmf are good that we apply
0:04:29is just the multiplicative update i my think it's pretty stand that
0:04:33so it's uh iterative minimisation of a cost function
0:04:37between the original
0:04:40back down we and the product of W eight
0:04:44and in our previous study we could show that the euclidean distance
0:04:48not a good measure to minimize here
0:04:51um so we
0:04:53on the one hand we evaluate the colour the climate the origins
0:04:56and on the other hand uh yeah
0:04:59but say new cost function that has been proposed
0:05:02X especially for music processing
0:05:04which is the itakura side so D origins
0:05:07and the the main difference of those in
0:05:09that the the itakura-saito divergence is scale and and
0:05:13so um low energy components
0:05:16are weighted uh the same way as high energy components basically
0:05:20and calculation of the error
0:05:26so now we move on to the feature extraction by and a mask
0:05:30and the idea used to follow a supervised nmf approach
0:05:34which means that the W matrix
0:05:37fine
0:05:38um so
0:05:40yeah that's is actually a
0:05:41and approach of those pursued you'd multiply in
0:05:44source separation
0:05:46so if you have different sources like speech and noise you print pretty initialize the W matrix and kind reconstruct
0:05:52a sources after work
0:05:55and what we did here is
0:05:56re predefined the
0:05:58the W matrix with
0:06:00yeah spectra from different classes
0:06:03which uh on the one hand
0:06:05normal speech so it's say
0:06:07so with
0:06:08was words
0:06:09and that there here
0:06:11and other vocal noise
0:06:13and
0:06:14all the noise
0:06:15which is most environmental noise or microphone noise well
0:06:20yeah
0:06:21so
0:06:22in an ideal world
0:06:24if you do this decomposition
0:06:27we can just look
0:06:41yeah
0:06:45and then the
0:06:46the activation matrix what exactly give us the temporal location
0:06:51of those segments but of course this
0:06:53does not possible or this does not work like that
0:06:56because of the large spectral overlap between the different spectra
0:07:00from the different classes
0:07:04so
0:07:08the real case
0:07:11is
0:07:20so our approach is uh just to normalize each column of the H matrix
0:07:26to get something like a likelihood
0:07:28that uh a given spectrum was active at a given time frame
0:07:33and because those likelihood features do not contain energy information as opposed to the normal H matrix
0:07:40you also at the energy
0:07:45okay so now i come to the classification was long short-term memory
0:07:50so my colleague step us as um after what's presenting
0:07:54not talk on long short-term memory which is why a
0:07:56explain interior a little more in detail
0:07:59so the
0:08:01yeah the drawback like of a conventional recurrent network is uh
0:08:04but the context range quite limited
0:08:07because the the weight of a was single input
0:08:11on the output calculation decreases exponentially over time
0:08:15and this is all known as the vanishing gradient problem
0:08:19so the solution
0:08:21or one solution for this us to use you long short-term memory cells
0:08:25instead of the standard cells for the neural network
0:08:29which have an internal state that is maintained by a
0:08:32well a connection with a recurrent rate which is
0:08:35constant that's
0:08:36uh i one point zero
0:08:39so this means that the network can actually store information over yeah an arbitrarily long time
0:08:46and of course
0:08:47to also to access that that information and to update it and maybe to deleted you need some other units
0:08:53that control the state of this cell
0:08:56and these are known one is the gate units for input output and memory
0:09:01and yeah the the great advantage of this architecture that it automatically learns the a required amount of context
0:09:08so all those weights for those gate units the control input output and memory a learned during training by
0:09:14resilient propagation for example
0:09:18so you don't have to specify the be required mind amount of context as you would have to do for
0:09:23example when you
0:09:24just to feature frames stacking
0:09:28so of course you can ask does this give us any had wanted
0:09:31oh but just and
0:09:33normal recurrent not work
0:09:35which is why we investigated several architectures in this study
0:09:40so
0:09:43so to just speak in
0:09:45it's here
0:09:52oh
0:09:54so
0:09:54a bidirectional actually means that the you network processes the input for but and backward
0:10:00and
0:10:00yeah to this and is has
0:10:02two input layers
0:10:03and also to in layers
0:10:06and
0:10:08yeah the dimensionality of the input layer is just the number of input features
0:10:13which is a three for in or nmf configuration
0:10:16or thirty nine if you just use normal plp features plus the those
0:10:21yeah and the size of hidden layer was evaluated at at and one hundred twenty
0:10:25and the the output layer
0:10:27just a gives T posterior probabilities of the for different classes that i want to discriminate
0:10:38so or evaluation was done of the part i corpus of spontaneous speech
0:10:42i don't know how many of you know wait
0:10:44so it's uh we took only is subject turns
0:10:47so there remained about twenty five hours of spontaneous speech
0:10:52it's ten use in the sense that it's interview speech so there is one interview viewer
0:10:57and a test subject and
0:10:59they follow a free conversation
0:11:01without any specific protocol
0:11:04there are forty speakers to male and twenty female
0:11:07and we sup they white at the corpus in a speaker independent manner
0:11:11which means we divided into a training validation and test set
0:11:15but all stratified by age and gender
0:11:18so the percentages were around eighty percent for training ten percent for validation and ten percent for test
0:11:26and yeah O to to make it more reproducible we did this subdivision an ascending order of speaker id
0:11:33and the corpus also comes with an automatic alignment
0:11:36of of yes
0:11:37speech and phonemes
0:11:39and after a car noise and i don't noise
0:11:41and this automatic alignment was used on the training data
0:11:45as well to to use that to train the nmf math
0:11:49as well as to train the neural network
0:11:55various just a short summary
0:11:57on the the different sizes of the test sets sub the it back classes
0:12:02so as you would expect the the speech classes predominant
0:12:07and
0:12:08yeah especially the the after and the other noise class are quite
0:12:13sparse
0:12:14especially in the test set
0:12:18so
0:12:20yeah the evaluation that that we did um
0:12:23was
0:12:24yeah motive what by the question about but is better to model the non-linguistic vocalisations inside the a our system
0:12:31or outside the A's our system
0:12:34which is why we set up and
0:12:35yeah produced and that's a a are system on the back i corpus for
0:12:40as a reference
0:12:42so i'm going
0:12:43yeah quite fast a is because it's all pretty standard
0:12:47a plp coefficients plus deltas and uh
0:12:50bigram gram language model trained on the black i training set
0:12:53we also experimented with other language models but it didn't increase but curious accuracy
0:12:58we had and addition to the thirty nine monophones
0:13:01we had three models for non-linguistic vocalisations laughter woke noise and i don't noise
0:13:06but at uh double as many states as T funny models
0:13:10and we estimated say clustered triphones with
0:13:13sixteen you thirty two mixtures
0:13:15and yeah as you can see the word accuracy of the system is quite low
0:13:20but which is quite common actually for spontaneous speech
0:13:28oh on this i have the comparison on the discriminability of the different classes by a different types of are
0:13:35and hands
0:13:36and the general trend that you can see is that the
0:13:39you normal uh and and the as the lowest frame wise F one measure
0:13:44which is the primary evaluation matter here
0:13:47and you a and W A stand for and weighted average of of the four classes and weighted average
0:13:53weighted weighted is just uh weighted by the prior class probability
0:13:59um what you also can see
0:14:01is that the at as T M concept doesn't
0:14:04give that much K you know well the be normal our and and
0:14:07but uh they
0:14:09the bidirectional be L S T M
0:14:11the live was again for
0:14:13almost all the classes over will be
0:14:15i'll the network
0:14:17the only class that this is not the case is the other noise class
0:14:21but this might be also do you just by to sparsity
0:14:24as i've indicated before
0:14:31so i according to be a less M size and features
0:14:35we can actually conclude that the F
0:14:37features computed with the call but lie but the of origins
0:14:41or to form as well plp features
0:14:43and also the the nmf F features
0:14:46generated by the itakura the at the were in
0:14:50and you have the improvements ah
0:14:52especially
0:14:54was able for the
0:14:56we won't know the
0:14:58the other noise class
0:15:00but uh as i said those as
0:15:02yeah is not that much data on that
0:15:05but um in some we can see that the unweighted average
0:15:08increases by about uh two percent absolute
0:15:12from the plp features to the kl based and have features
0:15:17and yeah be D weighted
0:15:19average is of course dominated by the performance on speech
0:15:23which is
0:15:24also increase
0:15:29so no to come to a conclusion whether this better to model the
0:15:34you vocalisations inside a is are or outside a are
0:15:38you we can see that
0:15:40actually except for the after class it is always better
0:15:44and terms of frame wise F one measure
0:15:46to two model it with the B T em approach
0:15:50and set of direct modeling in the A are
0:15:54so
0:15:55there are yeah
0:15:57but if you difference
0:15:58according to recall and precision
0:16:00because we are actually not talking about detection here
0:16:03but of of classification so of course
0:16:06we could also a
0:16:08uh yeah
0:16:10read use it to a binary detection task and
0:16:12calculate a use you measures but we have not done here
0:16:16um
0:16:19yeah so over all the
0:16:21you weighted average accuracy or weighted average recall
0:16:25this increased from
0:16:27you ninety one but
0:16:28three five
0:16:29a nine point one percent
0:16:32and this improvement is also significant
0:16:35uh like and on said previously
0:16:37it's not a
0:16:38real significance test here but
0:16:40just a heuristic measures of the key values actually smaller than
0:16:44ten to the minus three year
0:16:49or concluding we can say that the
0:16:51be less tim approach
0:16:53the live uh
0:16:54quite high
0:16:55reduction of the frame wise error rate
0:16:58by thirty
0:16:59seven point five percent relative
0:17:01and best results have been obtained with the
0:17:04kl divergence as and F cost function
0:17:07and of course future work will do you with
0:17:10yeah to how to integrate this uh be a less to classify actually in the asr are system
0:17:16and for those we have a quite promising approach
0:17:18or multi-stream hmm a are system which is
0:17:21currently also using a a S in phoneme prediction
0:17:25and it could all some of the
0:17:27prediction of that there is speech or non-linguistic vocalisations
0:17:31and other improvements uh relate to the and i agree and and there we could
0:17:36include context-sensitive sensitive if features
0:17:39like features to by a non-negative matrix deconvolution
0:17:43or also use sparsity constraints
0:17:45in the supervised and and have
0:17:47to improve the discrimination
0:17:50so this concludes a talk from my part
0:17:52and i'm looking forward to your questions
0:17:55thank you much fit
0:17:56i
0:18:00someone one all we moved from slightly behind get you to be a had of schedule so there's quite some
0:18:05time of question
0:18:11and
0:18:17and at at or you done some experiments uh
0:18:20whether there um a use our modelling out perform this one or whether the uh they both to together
0:18:27uh uh all the form each single one
0:18:37channel
0:18:43so how about how about the result all that makes a speech
0:18:47i mean that's speech that a mixed up with snap
0:18:53yeah yeah yeah
0:19:15i have a question
0:19:17can
0:19:30i
0:19:33a short reply
0:19:34so
0:19:36thing
0:19:37but thank phoenix again