0:00:14thank you so as yeah and is right mentioned my name stuff
0:00:17and then we repeat this unit
0:00:19yeah
0:00:20uh a some that of all what do these is an of vocal out
0:00:24and is a joint work
0:00:27a a you know that of menu like an even against low
0:00:29and uh are group at in P and which also includes george the from the feeling and my upon
0:00:35yeah okay
0:00:36so what i of vocal outbursts
0:00:38and availability of non-linguistic vocalisations
0:00:41which basically
0:00:43what's a main both facial expressions as well
0:00:45a examples include a or include the laughter which is probably the most common
0:00:50a golfing briefing
0:00:52uh but are also many different types of vocalisations
0:00:55it's uh
0:00:56we we may not to real i
0:00:58but they this vocalisations play an important role in front dense conversations for example
0:01:03uh it has been shown that laughter punch eight speech
0:01:06and uh whether this mean
0:01:08it's that we tend to laugh at places where punctuation would be placed
0:01:12and a another example is uh um
0:01:15when in a conversation a two participants laughs and we can lee
0:01:19mean it is very likely that this indicates if this is that a a the end of the topic they
0:01:23discuss and it's very likely that in you topic we start
0:01:27and a part from the after which is probably the most vary widely studied
0:01:31and vocalisation
0:01:33most of the other vocalisation vocalisations
0:01:35and i used as a feedback but a mechanism during direction
0:01:39and as i said before it's
0:01:41we are very common although we don't realise it in uh
0:01:45like a a real conversations
0:01:48and a if you in several works laughter uh recognition and classification from a the only the also if you
0:01:54were to not do visual classification of a laughter
0:01:57but uh a works on recognizing work very discriminating between different vocalisations are limited compared to laughter
0:02:05and one of the main reasons is uh the lack of data
0:02:08it's uh so what was all goal you in this work
0:02:11and i would like to discriminate between different vocalisations
0:02:15scenes fitted a a a a like to of a data set that contains such vocalisations
0:02:19but uh used not only the features
0:02:22but also visual features
0:02:24and the idea here is that
0:02:25since uh you you most of the time they if a facial expression involved in the production of localization
0:02:32a a
0:02:33no would be this uh information can be cats very visual features
0:02:36uh uh so it can improve the performance
0:02:39when it is added to be audio information
0:02:42uh okay so
0:02:44i do that this we used was the visual in the score from a U M
0:02:48which basically it's uh a contains twenty one subject the
0:02:52and yeah yeah it's a no thirty thirty thousand nine hundred and one turns
0:02:57and basically to dyadic interaction scenario so basically as a present or and the subject we've
0:03:03interact
0:03:04and uh
0:03:05during the interaction there are several uh vocalisations
0:03:10these uh uh we use basically
0:03:11and the partitioning we used
0:03:13actually
0:03:14she's been used a this they this it you'll also be used in the the speech probably sticks allan's
0:03:19and we use the same partitioning
0:03:21for training development
0:03:23and testing
0:03:24although
0:03:25unfortunately this slide the development a column is missing
0:03:29and to that our for all a non-linguistic vocalisations
0:03:33two are for uh that to like yes
0:03:36uh
0:03:36B think can send his attention and laughter
0:03:39and they are also these also another class we could use another class got but which contains other noise since
0:03:44speech
0:03:46and uh for experiments
0:03:48uh i'm going to show you we have excluded the breath class because
0:03:53and most of the time this i would they use not all
0:03:56and use a is no um
0:03:58it in please new this data set in a most of the time there is reason the facial expression of
0:04:02vol
0:04:03uh okay
0:04:05so
0:04:06seems that what
0:04:08so just so you a few examples
0:04:11uh
0:04:17okay so this is an example of a laughter
0:04:19a database
0:04:21i
0:04:23so which can you see is also
0:04:27although there is a common uh a point a directed the face of the subject
0:04:32you see they still uh significant did movement
0:04:35it's uh
0:04:37which is quite common in action well as uh for example which you bill a database that
0:04:42we that subjects
0:04:44uh what to the fight to find a video clip
0:04:46and we record the reaction and thing this case because just a static mean they just what's uh something it
0:04:52don't there is a a this did movements of a small where in this case and also
0:04:56in or or a a a real case cases
0:04:58they should movement is always
0:04:59uh uh there are and
0:05:01uh okay so let and so an example of a uh i think the nixon's um
0:05:07a i
0:05:08so basically it's pretty so but i a
0:05:12and uh and example of cons and
0:05:15this from
0:05:16oh
0:05:18huh
0:05:20huh
0:05:21so basically in this and uh that was an of express and just some
0:05:25the movement
0:05:27and T
0:05:28so
0:05:29but presentation
0:05:40okay so we use this P it
0:05:43and vocalisations
0:05:45no look classification of the uh of of of a a in the five classes the for uh
0:05:52yeah the the three actually
0:05:53for the for the for the four plus is three they a
0:05:56vocalisations and got but i
0:05:58and
0:06:00so we extract it
0:06:01okay this is just a a number you what will explain it so which of these and slide
0:06:05which selected uh and visual features which where
0:06:07and the visual features what up sampled to much the frame rate of the audio the features and then were
0:06:12concatenated for feature-level fusion
0:06:15a close in the classification was performed
0:06:17two different approaches one was a is M
0:06:20and the other and was there long sort memory
0:06:23uh a can you a network
0:06:25it's uh okay
0:06:27so the frame rate for visual features is twenty five frames per second which is
0:06:31a a common frame rate
0:06:33and to there are D we just to types of features
0:06:35shape features which are based on the point distribution model
0:06:38and appearance features which are are based on pca and grad and orientations
0:06:43uh
0:06:45so uh a yeah so in the beginning with track twenty point to the phase
0:06:49it's uh a these are the four points on the out of one point for seen four points and meet
0:06:54i two for is eyebrow
0:06:55and you can see this is an example of or
0:06:58of subject laughing
0:07:00with the tracking point this face
0:07:02and uh okay as you may see tracking
0:07:06i
0:07:07and uh and that happens i mean a fourteen to be in the perfect tracking
0:07:11and
0:07:13so we initialize the points and then the tracker
0:07:15uh like this twenty points
0:07:17and um
0:07:19now the main problem we yeah
0:07:21and that should be is present for both shape and appearance features
0:07:24is that uh we want to decouple couple
0:07:27a a a a set pose from uh a facial expressions
0:07:30and now how we do this
0:07:33is uh
0:07:34basically we use
0:07:36a distribution model
0:07:37it's point
0:07:38and just to court needs X and Y
0:07:41so i won't cut innate all the coordinates went up with a forty dimensional vector for each frame
0:07:46if we concatenate now of these for uh vectors or from of frames into a matrix
0:07:51if we have K frames will end up with a K by forty meeting
0:07:54and then we apply pca on this matrix
0:07:58okay is well known that the great is variance of the data lies in the fact me small components
0:08:02and the i'm knees
0:08:03that seems that are a a significant a head movements most of the variance will be captured and from a
0:08:09uh in the first principal components well as facial expressions was account for smaller variance
0:08:15and will be encoded in a lower component
0:08:19no no components
0:08:20and uh
0:08:22so in this case
0:08:24and we found that a first first piece for components correspond head movements
0:08:28and the yeah remaining from five ten facial expressions
0:08:31but was to this is but a uh most of the it's uh
0:08:35depends on the data sets you know a other do the sets with the a even higher
0:08:39and with
0:08:40a stronger should movement
0:08:42and
0:08:44we consider that you of in the first five or from six
0:08:47uh corresponding hidden would
0:08:48so basically of that the features are very simple it just the project see of the core image of the
0:08:53thirty decoding mates
0:08:54to this principal components of course to facial expressions
0:08:58and
0:08:59i can we gonna example
0:09:11okay so basically
0:09:17what so on
0:09:18it's this is not from maybe is database but just to get an idea of how these uh a principal
0:09:23works
0:09:23it's uh on the top left
0:09:25yeah is see the videos three
0:09:27or the top right you see the actually tracked points on the bottom left
0:09:32you see the reconstruction but on the principal components that correspond to should movements
0:09:36and the bottom uh light
0:09:38you should a reconstruction of corresponds
0:09:40oh to the press components
0:09:42uh with a to expressions
0:09:45and you can see that it's C
0:09:47tensor head
0:09:49top uh the bottom right remains always front i'll
0:09:52and four expressions well as the bottom left
0:09:55for was that the should poles
0:10:02it's uh
0:10:03because it's very simple but just
0:10:05yeah
0:10:06and uh
0:10:14okay
0:10:14a a okay uh are we show the simple also for appearance features we want to move
0:10:19and should pose
0:10:21and in in this case it's harder
0:10:23and yeah so what to a we need
0:10:25just the common approach in computer vision
0:10:27we use a reference frame which is uh
0:10:30can they you tell expression of the subject
0:10:32and also each in front of you
0:10:34and we compute the fine transformation between it's frame and the reference frame and went with affine transformation we mean
0:10:39basically
0:10:40and we scale rotate
0:10:43and translate
0:10:44and the face
0:10:45so but it comes to front of uh house
0:10:48and you can see a very simple example
0:10:50the bottom
0:10:51you see
0:10:53or the left
0:10:54uh a
0:10:55the shared is
0:10:57it's a bit uh rotated
0:10:59and uh after applying
0:11:01scaling translation rotation
0:11:04uh uh use of the face becomes fonda
0:11:07uh
0:11:08and then we crop
0:11:10and yeah yeah area on the face
0:11:12it's uh um
0:11:14and then we apply pca to you much good D and orientations
0:11:18it's a okay i'll gonna i will not going into details and vision it goes uh
0:11:23can find more information this paper
0:11:25and but the main idea is that
0:11:28it's quite common to apply pca their action and a pixel intensities are with this approach as you can sing
0:11:33discuss of this paper it you some advantages for example it's more robust to illumination
0:11:37and so that's why would side to use this one
0:11:40uh now
0:11:42we got audio features is were computed
0:11:44with the open smile which is uh
0:11:47to could provide
0:11:49you M
0:11:50and you
0:11:50the a frame rate are is one how find frame for second
0:11:54and that's why we need to up sample the visual features which are extract twenty five frames per second
0:11:59it's a we use some standard or where what do features like a
0:12:03plp coefficients the first five good could be since insanity loudness
0:12:07a a fundamental frequency and probability of voicing of T
0:12:11with the uh the first and second order delta coefficients
0:12:14is a pretty standard features
0:12:16and um
0:12:18oh for classification and the first approach was to use long short-term memory recurrent neural networks
0:12:24which the risky
0:12:26mean felix describe so
0:12:28just can to give a slide it's uh
0:12:32and this for dynamically a a classification of forced uh for starting
0:12:36a the main problem is that uh it's not a at a different line
0:12:41S so order to extract features which do not depend on the length of the utterance
0:12:46and was simply to extract some with statistics of over the entire utterance of these low level features
0:12:53so just for example the mean
0:12:55or of the feature over the entire utterance or the maximum value of the rains
0:13:00yeah will convert
0:13:01and not ounce so that has is represented by you
0:13:04if feature vector or of fixed size
0:13:07and uh event classification is uh are performed for the entire utterance using support vector machines
0:13:13and uh a you can see just and i but you hear of via the same features the appearance features
0:13:18plp P energy of zero loudness and probability of voicing
0:13:22one case
0:13:23yeah in the study case
0:13:25we compute the statistics over the entire utterance
0:13:28we fill them to an svm and we get to label for a sequence
0:13:32and
0:13:34where in the second case when we use the L S T Ms uh and team networks
0:13:38we in simply
0:13:41yeah give the low level features were no need to compute function functionals
0:13:45to a
0:13:47a list of their L S T works
0:13:49which provide a label for it's
0:13:50frame
0:13:51and then we can simply take the majority and to label the sequence according to
0:13:55so now is result
0:13:58a as you can see
0:13:59for a um
0:14:01for that weighted average
0:14:03is B Ms
0:14:04provide
0:14:05uh but but performance
0:14:07where as for on a weighted average L S T M
0:14:09a lead to better performance
0:14:11so these skin means
0:14:13but is B M's are good at uh
0:14:15scream mean there that's in classifying a the largest uh
0:14:20class which in this case is station
0:14:22contains more than a thousand examples
0:14:25a a and they are not so good uh it's recognising the other classes
0:14:30where whereas the less the M's
0:14:31and i was the recognising okay a would do that they better data but recognising all classes
0:14:36so you see that's twenty Q
0:14:38uh usually much higher and waited of but it's a values
0:14:42something also which is also interesting
0:14:44is that
0:14:45to compare the performance of for the oh
0:14:47and with the audio visual approach
0:14:49the close of for do for example here
0:14:51you see that it's sixty four point six percent
0:14:54now when we add appearance basically lead goes down
0:14:57this may sound a bit uh surprising
0:15:00and because especially for visual speech recognition peons just consider the state-of-the-art
0:15:05but there are two reasons first of all we use information from the entire face
0:15:09and so basically these a lot of down information which can made
0:15:13we possible to get the performance
0:15:15and and a second reason is a scenes
0:15:18uh is this sort you before these significant head movement
0:15:21a although we do this registration step
0:15:23to convert all expressions to frontal pose
0:15:27and
0:15:28still this is but not perfect and especially when there are out of plane rotations which means that uh subject
0:15:34is not looking at the common a but is looking somewhere else
0:15:37then we this approach is impossible
0:15:39uh to reconstruct the front of a a you you
0:15:42and the
0:15:44and it is known but the appearance features are are much more sensitive
0:15:48to a stationary or stop than shape features
0:15:50it's uh
0:15:53so this could be it's a reasonable explanation of the but performance when adding the P where when we had
0:15:59a shape information
0:16:00we should that is uh a significant gain from sixty four point six to seven two percent
0:16:05and uh
0:16:07now if we look at the co fusion radix to see
0:16:10which class
0:16:12uh
0:16:13the result per class
0:16:15and was you that the okay of this is that little on the left
0:16:18is the result when using all the information only
0:16:21and them but i is when using audio plus shape in this is for the L S the M
0:16:26and networks
0:16:28so we're a
0:16:29we we see that for can and and laughter there is uh
0:16:33significant improvement from seven the phone
0:16:36from forty seven to sixty six and from sixty three two seventy nine
0:16:40where as for his it a of the performance
0:16:42uh ghost down so basically
0:16:45uh know
0:16:46when we that these a extra visual information
0:16:49so this
0:16:50somebody so
0:16:51and we so that it is shape features improve press performance for consent and laughter
0:16:56where appearance features
0:16:58uh
0:16:59you not seem to do so i mean
0:17:01it's uh uh only a which just on seem to show have on in the case of support but the
0:17:06must scenes
0:17:07and still okay this is negligible the improve right
0:17:10improvement from fifty nine point and fifty nine point four
0:17:13well as when we combine all the features together then there is
0:17:16is more improvement
0:17:17so this is going case of the peons features
0:17:19a cattle
0:17:21comparing a now is uh L S T M networks with svms
0:17:25so that a a a list the as basically a a a a to do a better job of recognising
0:17:30and the egg uh
0:17:32different vocalisations and where B M's
0:17:35mostly recognise
0:17:36the class with a
0:17:38the largest class was his station
0:17:40and uh and of for future work okay
0:17:43it's uh is felix said
0:17:46we not experiments we have used presegmenting sequence which means we know the start the end to extract the sequence
0:17:52and we do classification
0:17:54no much harder the problem is to do sporting of these non-linguistic vocalisations which means give a continuous stream
0:17:59and we don't know this the beginning and the end actually that's out uh goal
0:18:03which uh especially when using uh when i in visual information
0:18:07which this
0:18:09could be it challenging task because uh that are case is that the face may not be visible so in
0:18:14this case is to like that we have to ten not for example
0:18:17uh the visual system
0:18:19and i fink this is you to soak in a look
0:18:22these are web sites um
0:18:24or to P of and you best of unique
0:18:27so thank you very much
0:18:29thank you much
0:18:32to left
0:18:33oh i'm for a couple of questions
0:18:38and some
0:18:39i
0:18:45i
0:18:46a paper
0:18:48looks
0:18:50as far as i can so but you can correct me
0:18:52the illumination was uh a pretty much okay a so i'm just wondering
0:18:57uh
0:18:58when you you go over to a more realistic
0:19:01recordings illumination that makes like station
0:19:04we that terry rate
0:19:06what to expect that uh
0:19:08you get the same
0:19:09amount of improvement for consent and
0:19:13laughter or or not or
0:19:16have you done any
0:19:17and
0:19:18well in this case
0:19:20uh okay use appearance features a definitely influenced by illumination and they are sensitive to illumination now shape features
0:19:27and the question is if this is a difference in illumination uh can uh fact that the that i can
0:19:33even it works
0:19:34fine fine a but even like a ins you to illumination
0:19:38then a okay it's uh um
0:19:41can can use even shape is gonna provide an useful information because the points we be top along the
0:19:47but uh no this is a um basically this is an open problem because not been solved and
0:19:53yeah i this as you know computer vision still be don't
0:19:56audio processing
0:19:57and these are problems that uh and nobody knows the answer
0:20:01and that's why in most applications use single bit of easy and
0:20:04basically
0:20:06for example not a visual speech recognition
0:20:09uh uh it's subject looking directly the common to and it's always a frontal A are of the face
0:20:15and
0:20:16quite recently that their here some approach is a trying to apply this method to more realistic scenarios
0:20:23but to apply it in uh a case is like you said you real environment
0:20:28um at least i don't know when approach that true
0:20:31would do work well at the moment
0:20:36of a question
0:20:48a some uh basically yeah all the features where uh it what up some of both for a a for
0:20:52both cases
0:20:53an initial to the same frame rate
0:20:55although for is sit it
0:20:57mean it may not be necessary since we starts these function yeah actually i'm talking about instance up something
0:21:04if the that's like you know
0:21:05okay only feature a
0:21:08okay
0:21:12of a question
0:21:16and
0:21:19you you you shall couple look the most what what i one there
0:21:22uh are there any that as the audience's a actually this class definition
0:21:28for so you close to the V usual
0:21:31features
0:21:31for example for a station to
0:21:34i i mean i i cool more and why uh facial expression
0:21:39each
0:21:41cried
0:21:42station
0:21:43and also a
0:21:44be able to actually it in training
0:21:48it
0:21:50and
0:21:52a
0:21:52so that is by so i mean uh uh okay just for you want example a but if you look
0:21:58at all the a examples you see that there is also sometimes is also a a big difference and sometimes
0:22:04and uh if you want look
0:22:06and the video without leasing to audio
0:22:08it is very likely that even
0:22:10humans
0:22:11and a would be confused between different vocalisation but and she's station comes N
0:22:17and
0:22:18so yeah there is body ends and uh uh in particular for laughter that
0:22:23uh i from there three hundred examples of laughter
0:22:26and there
0:22:27okay it's a the variance is high
0:22:29it's uh and now was a gotten turning a test set
0:22:34i think this is the official partitioning and or maybe be you and can same more
0:22:37what was the criteria for deciding training and testing
0:22:41it's uh
0:22:42uh
0:22:43what actually as just of that done for the silence
0:22:45and that was done to be very transfer and
0:22:48similar as for the a i corpus done by a speaker right
0:22:53this so yeah but is clear there is by and uh is it before sometimes even
0:22:57if you turn of the audio
0:22:59you cannot discriminate between those two
0:23:01i think
0:23:09a a is between different
0:23:18that that my main
0:23:20uh
0:23:21i i think actually and is also a another issue
0:23:27so
0:23:28what do you wear a so what's question but covariance would between a
0:23:35i to print that this question as if there was cool covariance between for example a
0:23:41uh
0:23:42that
0:23:43to explain for example that you didn't get much improvement for stations
0:23:48"'cause" the covariance as a high between the
0:23:50combination features so that that
0:23:52a a a a uh
0:23:56it's uh
0:23:58yeah it could be a all i means it's uh um
0:24:02some expressions are similar but yeah from the different a
0:24:06in different classes
0:24:07and if friend
0:24:08it's a
0:24:13yeah the i i'm not sure i mean it's uh a could because it's are spontaneous expressions and uh they
0:24:19also different i mean
0:24:20if you look at all of them you will not find too that are exactly the same
0:24:24it's uh and
0:24:28okay thank you
0:24:31oh the question
0:24:33okay so thank you again that