thank you so as yeah and is right mentioned my name stuff and then we repeat this unit yeah uh a some that of all what do these is an of vocal out and is a joint work a a you know that of menu like an even against low and uh are group at in P and which also includes george the from the feeling and my upon yeah okay so what i of vocal outbursts and availability of non-linguistic vocalisations which basically what's a main both facial expressions as well a examples include a or include the laughter which is probably the most common a golfing briefing uh but are also many different types of vocalisations it's uh we we may not to real i but they this vocalisations play an important role in front dense conversations for example uh it has been shown that laughter punch eight speech and uh whether this mean it's that we tend to laugh at places where punctuation would be placed and a another example is uh um when in a conversation a two participants laughs and we can lee mean it is very likely that this indicates if this is that a a the end of the topic they discuss and it's very likely that in you topic we start and a part from the after which is probably the most vary widely studied and vocalisation most of the other vocalisation vocalisations and i used as a feedback but a mechanism during direction and as i said before it's we are very common although we don't realise it in uh like a a real conversations and a if you in several works laughter uh recognition and classification from a the only the also if you were to not do visual classification of a laughter but uh a works on recognizing work very discriminating between different vocalisations are limited compared to laughter and one of the main reasons is uh the lack of data it's uh so what was all goal you in this work and i would like to discriminate between different vocalisations scenes fitted a a a a like to of a data set that contains such vocalisations but uh used not only the features but also visual features and the idea here is that since uh you you most of the time they if a facial expression involved in the production of localization a a no would be this uh information can be cats very visual features uh uh so it can improve the performance when it is added to be audio information uh okay so i do that this we used was the visual in the score from a U M which basically it's uh a contains twenty one subject the and yeah yeah it's a no thirty thirty thousand nine hundred and one turns and basically to dyadic interaction scenario so basically as a present or and the subject we've interact and uh during the interaction there are several uh vocalisations these uh uh we use basically and the partitioning we used actually she's been used a this they this it you'll also be used in the the speech probably sticks allan's and we use the same partitioning for training development and testing although unfortunately this slide the development a column is missing and to that our for all a non-linguistic vocalisations two are for uh that to like yes uh B think can send his attention and laughter and they are also these also another class we could use another class got but which contains other noise since speech and uh for experiments uh i'm going to show you we have excluded the breath class because and most of the time this i would they use not all and use a is no um it in please new this data set in a most of the time there is reason the facial expression of vol uh okay so seems that what so just so you a few examples uh okay so this is an example of a laughter a database i so which can you see is also although there is a common uh a point a directed the face of the subject you see they still uh significant did movement it's uh which is quite common in action well as uh for example which you bill a database that we that subjects uh what to the fight to find a video clip and we record the reaction and thing this case because just a static mean they just what's uh something it don't there is a a this did movements of a small where in this case and also in or or a a a real case cases they should movement is always uh uh there are and uh okay so let and so an example of a uh i think the nixon's um a i so basically it's pretty so but i a and uh and example of cons and this from oh huh huh so basically in this and uh that was an of express and just some the movement and T so but presentation okay so we use this P it and vocalisations no look classification of the uh of of of a a in the five classes the for uh yeah the the three actually for the for the for the four plus is three they a vocalisations and got but i and so we extract it okay this is just a a number you what will explain it so which of these and slide which selected uh and visual features which where and the visual features what up sampled to much the frame rate of the audio the features and then were concatenated for feature-level fusion a close in the classification was performed two different approaches one was a is M and the other and was there long sort memory uh a can you a network it's uh okay so the frame rate for visual features is twenty five frames per second which is a a common frame rate and to there are D we just to types of features shape features which are based on the point distribution model and appearance features which are are based on pca and grad and orientations uh so uh a yeah so in the beginning with track twenty point to the phase it's uh a these are the four points on the out of one point for seen four points and meet i two for is eyebrow and you can see this is an example of or of subject laughing with the tracking point this face and uh okay as you may see tracking i and uh and that happens i mean a fourteen to be in the perfect tracking and so we initialize the points and then the tracker uh like this twenty points and um now the main problem we yeah and that should be is present for both shape and appearance features is that uh we want to decouple couple a a a a set pose from uh a facial expressions and now how we do this is uh basically we use a distribution model it's point and just to court needs X and Y so i won't cut innate all the coordinates went up with a forty dimensional vector for each frame if we concatenate now of these for uh vectors or from of frames into a matrix if we have K frames will end up with a K by forty meeting and then we apply pca on this matrix okay is well known that the great is variance of the data lies in the fact me small components and the i'm knees that seems that are a a significant a head movements most of the variance will be captured and from a uh in the first principal components well as facial expressions was account for smaller variance and will be encoded in a lower component no no components and uh so in this case and we found that a first first piece for components correspond head movements and the yeah remaining from five ten facial expressions but was to this is but a uh most of the it's uh depends on the data sets you know a other do the sets with the a even higher and with a stronger should movement and we consider that you of in the first five or from six uh corresponding hidden would so basically of that the features are very simple it just the project see of the core image of the thirty decoding mates to this principal components of course to facial expressions and i can we gonna example okay so basically what so on it's this is not from maybe is database but just to get an idea of how these uh a principal works it's uh on the top left yeah is see the videos three or the top right you see the actually tracked points on the bottom left you see the reconstruction but on the principal components that correspond to should movements and the bottom uh light you should a reconstruction of corresponds oh to the press components uh with a to expressions and you can see that it's C tensor head top uh the bottom right remains always front i'll and four expressions well as the bottom left for was that the should poles it's uh because it's very simple but just yeah and uh okay a a okay uh are we show the simple also for appearance features we want to move and should pose and in in this case it's harder and yeah so what to a we need just the common approach in computer vision we use a reference frame which is uh can they you tell expression of the subject and also each in front of you and we compute the fine transformation between it's frame and the reference frame and went with affine transformation we mean basically and we scale rotate and translate and the face so but it comes to front of uh house and you can see a very simple example the bottom you see or the left uh a the shared is it's a bit uh rotated and uh after applying scaling translation rotation uh uh use of the face becomes fonda uh and then we crop and yeah yeah area on the face it's uh um and then we apply pca to you much good D and orientations it's a okay i'll gonna i will not going into details and vision it goes uh can find more information this paper and but the main idea is that it's quite common to apply pca their action and a pixel intensities are with this approach as you can sing discuss of this paper it you some advantages for example it's more robust to illumination and so that's why would side to use this one uh now we got audio features is were computed with the open smile which is uh to could provide you M and you the a frame rate are is one how find frame for second and that's why we need to up sample the visual features which are extract twenty five frames per second it's a we use some standard or where what do features like a plp coefficients the first five good could be since insanity loudness a a fundamental frequency and probability of voicing of T with the uh the first and second order delta coefficients is a pretty standard features and um oh for classification and the first approach was to use long short-term memory recurrent neural networks which the risky mean felix describe so just can to give a slide it's uh and this for dynamically a a classification of forced uh for starting a the main problem is that uh it's not a at a different line S so order to extract features which do not depend on the length of the utterance and was simply to extract some with statistics of over the entire utterance of these low level features so just for example the mean or of the feature over the entire utterance or the maximum value of the rains yeah will convert and not ounce so that has is represented by you if feature vector or of fixed size and uh event classification is uh are performed for the entire utterance using support vector machines and uh a you can see just and i but you hear of via the same features the appearance features plp P energy of zero loudness and probability of voicing one case yeah in the study case we compute the statistics over the entire utterance we fill them to an svm and we get to label for a sequence and where in the second case when we use the L S T Ms uh and team networks we in simply yeah give the low level features were no need to compute function functionals to a a list of their L S T works which provide a label for it's frame and then we can simply take the majority and to label the sequence according to so now is result a as you can see for a um for that weighted average is B Ms provide uh but but performance where as for on a weighted average L S T M a lead to better performance so these skin means but is B M's are good at uh scream mean there that's in classifying a the largest uh class which in this case is station contains more than a thousand examples a a and they are not so good uh it's recognising the other classes where whereas the less the M's and i was the recognising okay a would do that they better data but recognising all classes so you see that's twenty Q uh usually much higher and waited of but it's a values something also which is also interesting is that to compare the performance of for the oh and with the audio visual approach the close of for do for example here you see that it's sixty four point six percent now when we add appearance basically lead goes down this may sound a bit uh surprising and because especially for visual speech recognition peons just consider the state-of-the-art but there are two reasons first of all we use information from the entire face and so basically these a lot of down information which can made we possible to get the performance and and a second reason is a scenes uh is this sort you before these significant head movement a although we do this registration step to convert all expressions to frontal pose and still this is but not perfect and especially when there are out of plane rotations which means that uh subject is not looking at the common a but is looking somewhere else then we this approach is impossible uh to reconstruct the front of a a you you and the and it is known but the appearance features are are much more sensitive to a stationary or stop than shape features it's uh so this could be it's a reasonable explanation of the but performance when adding the P where when we had a shape information we should that is uh a significant gain from sixty four point six to seven two percent and uh now if we look at the co fusion radix to see which class uh the result per class and was you that the okay of this is that little on the left is the result when using all the information only and them but i is when using audio plus shape in this is for the L S the M and networks so we're a we we see that for can and and laughter there is uh significant improvement from seven the phone from forty seven to sixty six and from sixty three two seventy nine where as for his it a of the performance uh ghost down so basically uh know when we that these a extra visual information so this somebody so and we so that it is shape features improve press performance for consent and laughter where appearance features uh you not seem to do so i mean it's uh uh only a which just on seem to show have on in the case of support but the must scenes and still okay this is negligible the improve right improvement from fifty nine point and fifty nine point four well as when we combine all the features together then there is is more improvement so this is going case of the peons features a cattle comparing a now is uh L S T M networks with svms so that a a a list the as basically a a a a to do a better job of recognising and the egg uh different vocalisations and where B M's mostly recognise the class with a the largest class was his station and uh and of for future work okay it's uh is felix said we not experiments we have used presegmenting sequence which means we know the start the end to extract the sequence and we do classification no much harder the problem is to do sporting of these non-linguistic vocalisations which means give a continuous stream and we don't know this the beginning and the end actually that's out uh goal which uh especially when using uh when i in visual information which this could be it challenging task because uh that are case is that the face may not be visible so in this case is to like that we have to ten not for example uh the visual system and i fink this is you to soak in a look these are web sites um or to P of and you best of unique so thank you very much thank you much to left oh i'm for a couple of questions and some i i a paper looks as far as i can so but you can correct me the illumination was uh a pretty much okay a so i'm just wondering uh when you you go over to a more realistic recordings illumination that makes like station we that terry rate what to expect that uh you get the same amount of improvement for consent and laughter or or not or have you done any and well in this case uh okay use appearance features a definitely influenced by illumination and they are sensitive to illumination now shape features and the question is if this is a difference in illumination uh can uh fact that the that i can even it works fine fine a but even like a ins you to illumination then a okay it's uh um can can use even shape is gonna provide an useful information because the points we be top along the but uh no this is a um basically this is an open problem because not been solved and yeah i this as you know computer vision still be don't audio processing and these are problems that uh and nobody knows the answer and that's why in most applications use single bit of easy and basically for example not a visual speech recognition uh uh it's subject looking directly the common to and it's always a frontal A are of the face and quite recently that their here some approach is a trying to apply this method to more realistic scenarios but to apply it in uh a case is like you said you real environment um at least i don't know when approach that true would do work well at the moment of a question a some uh basically yeah all the features where uh it what up some of both for a a for both cases an initial to the same frame rate although for is sit it mean it may not be necessary since we starts these function yeah actually i'm talking about instance up something if the that's like you know okay only feature a okay of a question and you you you shall couple look the most what what i one there uh are there any that as the audience's a actually this class definition for so you close to the V usual features for example for a station to i i mean i i cool more and why uh facial expression each cried station and also a be able to actually it in training it and a so that is by so i mean uh uh okay just for you want example a but if you look at all the a examples you see that there is also sometimes is also a a big difference and sometimes and uh if you want look and the video without leasing to audio it is very likely that even humans and a would be confused between different vocalisation but and she's station comes N and so yeah there is body ends and uh uh in particular for laughter that uh i from there three hundred examples of laughter and there okay it's a the variance is high it's uh and now was a gotten turning a test set i think this is the official partitioning and or maybe be you and can same more what was the criteria for deciding training and testing it's uh uh what actually as just of that done for the silence and that was done to be very transfer and similar as for the a i corpus done by a speaker right this so yeah but is clear there is by and uh is it before sometimes even if you turn of the audio you cannot discriminate between those two i think a a is between different that that my main uh i i think actually and is also a another issue so what do you wear a so what's question but covariance would between a i to print that this question as if there was cool covariance between for example a uh that to explain for example that you didn't get much improvement for stations "'cause" the covariance as a high between the combination features so that that a a a a uh it's uh yeah it could be a all i means it's uh um some expressions are similar but yeah from the different a in different classes and if friend it's a yeah the i i'm not sure i mean it's uh a could because it's are spontaneous expressions and uh they also different i mean if you look at all of them you will not find too that are exactly the same it's uh and okay thank you oh the question okay so thank you again that