thank you so as yeah and is right mentioned my name stuff
and then we repeat this unit
yeah
uh a some that of all what do these is an of vocal out
and is a joint work
a a you know that of menu like an even against low
and uh are group at in P and which also includes george the from the feeling and my upon
yeah okay
so what i of vocal outbursts
and availability of non-linguistic vocalisations
which basically
what's a main both facial expressions as well
a examples include a or include the laughter which is probably the most common
a golfing briefing
uh but are also many different types of vocalisations
it's uh
we we may not to real i
but they this vocalisations play an important role in front dense conversations for example
uh it has been shown that laughter punch eight speech
and uh whether this mean
it's that we tend to laugh at places where punctuation would be placed
and a another example is uh um
when in a conversation a two participants laughs and we can lee
mean it is very likely that this indicates if this is that a a the end of the topic they
discuss and it's very likely that in you topic we start
and a part from the after which is probably the most vary widely studied
and vocalisation
most of the other vocalisation vocalisations
and i used as a feedback but a mechanism during direction
and as i said before it's
we are very common although we don't realise it in uh
like a a real conversations
and a if you in several works laughter uh recognition and classification from a the only the also if you
were to not do visual classification of a laughter
but uh a works on recognizing work very discriminating between different vocalisations are limited compared to laughter
and one of the main reasons is uh the lack of data
it's uh so what was all goal you in this work
and i would like to discriminate between different vocalisations
scenes fitted a a a a like to of a data set that contains such vocalisations
but uh used not only the features
but also visual features
and the idea here is that
since uh you you most of the time they if a facial expression involved in the production of localization
a a
no would be this uh information can be cats very visual features
uh uh so it can improve the performance
when it is added to be audio information
uh okay so
i do that this we used was the visual in the score from a U M
which basically it's uh a contains twenty one subject the
and yeah yeah it's a no thirty thirty thousand nine hundred and one turns
and basically to dyadic interaction scenario so basically as a present or and the subject we've
interact
and uh
during the interaction there are several uh vocalisations
these uh uh we use basically
and the partitioning we used
actually
she's been used a this they this it you'll also be used in the the speech probably sticks allan's
and we use the same partitioning
for training development
and testing
although
unfortunately this slide the development a column is missing
and to that our for all a non-linguistic vocalisations
two are for uh that to like yes
uh
B think can send his attention and laughter
and they are also these also another class we could use another class got but which contains other noise since
speech
and uh for experiments
uh i'm going to show you we have excluded the breath class because
and most of the time this i would they use not all
and use a is no um
it in please new this data set in a most of the time there is reason the facial expression of
vol
uh okay
so
seems that what
so just so you a few examples
uh
okay so this is an example of a laughter
a database
i
so which can you see is also
although there is a common uh a point a directed the face of the subject
you see they still uh significant did movement
it's uh
which is quite common in action well as uh for example which you bill a database that
we that subjects
uh what to the fight to find a video clip
and we record the reaction and thing this case because just a static mean they just what's uh something it
don't there is a a this did movements of a small where in this case and also
in or or a a a real case cases
they should movement is always
uh uh there are and
uh okay so let and so an example of a uh i think the nixon's um
a i
so basically it's pretty so but i a
and uh and example of cons and
this from
oh
huh
huh
so basically in this and uh that was an of express and just some
the movement
and T
so
but presentation
okay so we use this P it
and vocalisations
no look classification of the uh of of of a a in the five classes the for uh
yeah the the three actually
for the for the for the four plus is three they a
vocalisations and got but i
and
so we extract it
okay this is just a a number you what will explain it so which of these and slide
which selected uh and visual features which where
and the visual features what up sampled to much the frame rate of the audio the features and then were
concatenated for feature-level fusion
a close in the classification was performed
two different approaches one was a is M
and the other and was there long sort memory
uh a can you a network
it's uh okay
so the frame rate for visual features is twenty five frames per second which is
a a common frame rate
and to there are D we just to types of features
shape features which are based on the point distribution model
and appearance features which are are based on pca and grad and orientations
uh
so uh a yeah so in the beginning with track twenty point to the phase
it's uh a these are the four points on the out of one point for seen four points and meet
i two for is eyebrow
and you can see this is an example of or
of subject laughing
with the tracking point this face
and uh okay as you may see tracking
i
and uh and that happens i mean a fourteen to be in the perfect tracking
and
so we initialize the points and then the tracker
uh like this twenty points
and um
now the main problem we yeah
and that should be is present for both shape and appearance features
is that uh we want to decouple couple
a a a a set pose from uh a facial expressions
and now how we do this
is uh
basically we use
a distribution model
it's point
and just to court needs X and Y
so i won't cut innate all the coordinates went up with a forty dimensional vector for each frame
if we concatenate now of these for uh vectors or from of frames into a matrix
if we have K frames will end up with a K by forty meeting
and then we apply pca on this matrix
okay is well known that the great is variance of the data lies in the fact me small components
and the i'm knees
that seems that are a a significant a head movements most of the variance will be captured and from a
uh in the first principal components well as facial expressions was account for smaller variance
and will be encoded in a lower component
no no components
and uh
so in this case
and we found that a first first piece for components correspond head movements
and the yeah remaining from five ten facial expressions
but was to this is but a uh most of the it's uh
depends on the data sets you know a other do the sets with the a even higher
and with
a stronger should movement
and
we consider that you of in the first five or from six
uh corresponding hidden would
so basically of that the features are very simple it just the project see of the core image of the
thirty decoding mates
to this principal components of course to facial expressions
and
i can we gonna example
okay so basically
what so on
it's this is not from maybe is database but just to get an idea of how these uh a principal
works
it's uh on the top left
yeah is see the videos three
or the top right you see the actually tracked points on the bottom left
you see the reconstruction but on the principal components that correspond to should movements
and the bottom uh light
you should a reconstruction of corresponds
oh to the press components
uh with a to expressions
and you can see that it's C
tensor head
top uh the bottom right remains always front i'll
and four expressions well as the bottom left
for was that the should poles
it's uh
because it's very simple but just
yeah
and uh
okay
a a okay uh are we show the simple also for appearance features we want to move
and should pose
and in in this case it's harder
and yeah so what to a we need
just the common approach in computer vision
we use a reference frame which is uh
can they you tell expression of the subject
and also each in front of you
and we compute the fine transformation between it's frame and the reference frame and went with affine transformation we mean
basically
and we scale rotate
and translate
and the face
so but it comes to front of uh house
and you can see a very simple example
the bottom
you see
or the left
uh a
the shared is
it's a bit uh rotated
and uh after applying
scaling translation rotation
uh uh use of the face becomes fonda
uh
and then we crop
and yeah yeah area on the face
it's uh um
and then we apply pca to you much good D and orientations
it's a okay i'll gonna i will not going into details and vision it goes uh
can find more information this paper
and but the main idea is that
it's quite common to apply pca their action and a pixel intensities are with this approach as you can sing
discuss of this paper it you some advantages for example it's more robust to illumination
and so that's why would side to use this one
uh now
we got audio features is were computed
with the open smile which is uh
to could provide
you M
and you
the a frame rate are is one how find frame for second
and that's why we need to up sample the visual features which are extract twenty five frames per second
it's a we use some standard or where what do features like a
plp coefficients the first five good could be since insanity loudness
a a fundamental frequency and probability of voicing of T
with the uh the first and second order delta coefficients
is a pretty standard features
and um
oh for classification and the first approach was to use long short-term memory recurrent neural networks
which the risky
mean felix describe so
just can to give a slide it's uh
and this for dynamically a a classification of forced uh for starting
a the main problem is that uh it's not a at a different line
S so order to extract features which do not depend on the length of the utterance
and was simply to extract some with statistics of over the entire utterance of these low level features
so just for example the mean
or of the feature over the entire utterance or the maximum value of the rains
yeah will convert
and not ounce so that has is represented by you
if feature vector or of fixed size
and uh event classification is uh are performed for the entire utterance using support vector machines
and uh a you can see just and i but you hear of via the same features the appearance features
plp P energy of zero loudness and probability of voicing
one case
yeah in the study case
we compute the statistics over the entire utterance
we fill them to an svm and we get to label for a sequence
and
where in the second case when we use the L S T Ms uh and team networks
we in simply
yeah give the low level features were no need to compute function functionals
to a
a list of their L S T works
which provide a label for it's
frame
and then we can simply take the majority and to label the sequence according to
so now is result
a as you can see
for a um
for that weighted average
is B Ms
provide
uh but but performance
where as for on a weighted average L S T M
a lead to better performance
so these skin means
but is B M's are good at uh
scream mean there that's in classifying a the largest uh
class which in this case is station
contains more than a thousand examples
a a and they are not so good uh it's recognising the other classes
where whereas the less the M's
and i was the recognising okay a would do that they better data but recognising all classes
so you see that's twenty Q
uh usually much higher and waited of but it's a values
something also which is also interesting
is that
to compare the performance of for the oh
and with the audio visual approach
the close of for do for example here
you see that it's sixty four point six percent
now when we add appearance basically lead goes down
this may sound a bit uh surprising
and because especially for visual speech recognition peons just consider the state-of-the-art
but there are two reasons first of all we use information from the entire face
and so basically these a lot of down information which can made
we possible to get the performance
and and a second reason is a scenes
uh is this sort you before these significant head movement
a although we do this registration step
to convert all expressions to frontal pose
and
still this is but not perfect and especially when there are out of plane rotations which means that uh subject
is not looking at the common a but is looking somewhere else
then we this approach is impossible
uh to reconstruct the front of a a you you
and the
and it is known but the appearance features are are much more sensitive
to a stationary or stop than shape features
it's uh
so this could be it's a reasonable explanation of the but performance when adding the P where when we had
a shape information
we should that is uh a significant gain from sixty four point six to seven two percent
and uh
now if we look at the co fusion radix to see
which class
uh
the result per class
and was you that the okay of this is that little on the left
is the result when using all the information only
and them but i is when using audio plus shape in this is for the L S the M
and networks
so we're a
we we see that for can and and laughter there is uh
significant improvement from seven the phone
from forty seven to sixty six and from sixty three two seventy nine
where as for his it a of the performance
uh ghost down so basically
uh know
when we that these a extra visual information
so this
somebody so
and we so that it is shape features improve press performance for consent and laughter
where appearance features
uh
you not seem to do so i mean
it's uh uh only a which just on seem to show have on in the case of support but the
must scenes
and still okay this is negligible the improve right
improvement from fifty nine point and fifty nine point four
well as when we combine all the features together then there is
is more improvement
so this is going case of the peons features
a cattle
comparing a now is uh L S T M networks with svms
so that a a a list the as basically a a a a to do a better job of recognising
and the egg uh
different vocalisations and where B M's
mostly recognise
the class with a
the largest class was his station
and uh and of for future work okay
it's uh is felix said
we not experiments we have used presegmenting sequence which means we know the start the end to extract the sequence
and we do classification
no much harder the problem is to do sporting of these non-linguistic vocalisations which means give a continuous stream
and we don't know this the beginning and the end actually that's out uh goal
which uh especially when using uh when i in visual information
which this
could be it challenging task because uh that are case is that the face may not be visible so in
this case is to like that we have to ten not for example
uh the visual system
and i fink this is you to soak in a look
these are web sites um
or to P of and you best of unique
so thank you very much
thank you much
to left
oh i'm for a couple of questions
and some
i
i
a paper
looks
as far as i can so but you can correct me
the illumination was uh a pretty much okay a so i'm just wondering
uh
when you you go over to a more realistic
recordings illumination that makes like station
we that terry rate
what to expect that uh
you get the same
amount of improvement for consent and
laughter or or not or
have you done any
and
well in this case
uh okay use appearance features a definitely influenced by illumination and they are sensitive to illumination now shape features
and the question is if this is a difference in illumination uh can uh fact that the that i can
even it works
fine fine a but even like a ins you to illumination
then a okay it's uh um
can can use even shape is gonna provide an useful information because the points we be top along the
but uh no this is a um basically this is an open problem because not been solved and
yeah i this as you know computer vision still be don't
audio processing
and these are problems that uh and nobody knows the answer
and that's why in most applications use single bit of easy and
basically
for example not a visual speech recognition
uh uh it's subject looking directly the common to and it's always a frontal A are of the face
and
quite recently that their here some approach is a trying to apply this method to more realistic scenarios
but to apply it in uh a case is like you said you real environment
um at least i don't know when approach that true
would do work well at the moment
of a question
a some uh basically yeah all the features where uh it what up some of both for a a for
both cases
an initial to the same frame rate
although for is sit it
mean it may not be necessary since we starts these function yeah actually i'm talking about instance up something
if the that's like you know
okay only feature a
okay
of a question
and
you you you shall couple look the most what what i one there
uh are there any that as the audience's a actually this class definition
for so you close to the V usual
features
for example for a station to
i i mean i i cool more and why uh facial expression
each
cried
station
and also a
be able to actually it in training
it
and
a
so that is by so i mean uh uh okay just for you want example a but if you look
at all the a examples you see that there is also sometimes is also a a big difference and sometimes
and uh if you want look
and the video without leasing to audio
it is very likely that even
humans
and a would be confused between different vocalisation but and she's station comes N
and
so yeah there is body ends and uh uh in particular for laughter that
uh i from there three hundred examples of laughter
and there
okay it's a the variance is high
it's uh and now was a gotten turning a test set
i think this is the official partitioning and or maybe be you and can same more
what was the criteria for deciding training and testing
it's uh
uh
what actually as just of that done for the silence
and that was done to be very transfer and
similar as for the a i corpus done by a speaker right
this so yeah but is clear there is by and uh is it before sometimes even
if you turn of the audio
you cannot discriminate between those two
i think
a a is between different
that that my main
uh
i i think actually and is also a another issue
so
what do you wear a so what's question but covariance would between a
i to print that this question as if there was cool covariance between for example a
uh
that
to explain for example that you didn't get much improvement for stations
"'cause" the covariance as a high between the
combination features so that that
a a a a uh
it's uh
yeah it could be a all i means it's uh um
some expressions are similar but yeah from the different a
in different classes
and if friend
it's a
yeah the i i'm not sure i mean it's uh a could because it's are spontaneous expressions and uh they
also different i mean
if you look at all of them you will not find too that are exactly the same
it's uh and
okay thank you
oh the question
okay so thank you again that