thank you so as yeah and is right mentioned my name stuff

and then we repeat this unit

yeah

uh a some that of all what do these is an of vocal out

and is a joint work

a a you know that of menu like an even against low

and uh are group at in P and which also includes george the from the feeling and my upon

yeah okay

so what i of vocal outbursts

and availability of non-linguistic vocalisations

which basically

what's a main both facial expressions as well

a examples include a or include the laughter which is probably the most common

a golfing briefing

uh but are also many different types of vocalisations

it's uh

we we may not to real i

but they this vocalisations play an important role in front dense conversations for example

uh it has been shown that laughter punch eight speech

and uh whether this mean

it's that we tend to laugh at places where punctuation would be placed

and a another example is uh um

when in a conversation a two participants laughs and we can lee

mean it is very likely that this indicates if this is that a a the end of the topic they

discuss and it's very likely that in you topic we start

and a part from the after which is probably the most vary widely studied

and vocalisation

most of the other vocalisation vocalisations

and i used as a feedback but a mechanism during direction

and as i said before it's

we are very common although we don't realise it in uh

like a a real conversations

and a if you in several works laughter uh recognition and classification from a the only the also if you

were to not do visual classification of a laughter

but uh a works on recognizing work very discriminating between different vocalisations are limited compared to laughter

and one of the main reasons is uh the lack of data

it's uh so what was all goal you in this work

and i would like to discriminate between different vocalisations

scenes fitted a a a a like to of a data set that contains such vocalisations

but uh used not only the features

but also visual features

and the idea here is that

since uh you you most of the time they if a facial expression involved in the production of localization

a a

no would be this uh information can be cats very visual features

uh uh so it can improve the performance

when it is added to be audio information

uh okay so

i do that this we used was the visual in the score from a U M

which basically it's uh a contains twenty one subject the

and yeah yeah it's a no thirty thirty thousand nine hundred and one turns

and basically to dyadic interaction scenario so basically as a present or and the subject we've

interact

and uh

during the interaction there are several uh vocalisations

these uh uh we use basically

and the partitioning we used

actually

she's been used a this they this it you'll also be used in the the speech probably sticks allan's

and we use the same partitioning

for training development

and testing

although

unfortunately this slide the development a column is missing

and to that our for all a non-linguistic vocalisations

two are for uh that to like yes

uh

B think can send his attention and laughter

and they are also these also another class we could use another class got but which contains other noise since

speech

and uh for experiments

uh i'm going to show you we have excluded the breath class because

and most of the time this i would they use not all

and use a is no um

it in please new this data set in a most of the time there is reason the facial expression of

vol

uh okay

so

seems that what

so just so you a few examples

uh

okay so this is an example of a laughter

a database

i

so which can you see is also

although there is a common uh a point a directed the face of the subject

you see they still uh significant did movement

it's uh

which is quite common in action well as uh for example which you bill a database that

we that subjects

uh what to the fight to find a video clip

and we record the reaction and thing this case because just a static mean they just what's uh something it

don't there is a a this did movements of a small where in this case and also

in or or a a a real case cases

they should movement is always

uh uh there are and

uh okay so let and so an example of a uh i think the nixon's um

a i

so basically it's pretty so but i a

and uh and example of cons and

this from

oh

huh

huh

so basically in this and uh that was an of express and just some

the movement

and T

so

but presentation

okay so we use this P it

and vocalisations

no look classification of the uh of of of a a in the five classes the for uh

yeah the the three actually

for the for the for the four plus is three they a

vocalisations and got but i

and

so we extract it

okay this is just a a number you what will explain it so which of these and slide

which selected uh and visual features which where

and the visual features what up sampled to much the frame rate of the audio the features and then were

concatenated for feature-level fusion

a close in the classification was performed

two different approaches one was a is M

and the other and was there long sort memory

uh a can you a network

it's uh okay

so the frame rate for visual features is twenty five frames per second which is

a a common frame rate

and to there are D we just to types of features

shape features which are based on the point distribution model

and appearance features which are are based on pca and grad and orientations

uh

so uh a yeah so in the beginning with track twenty point to the phase

it's uh a these are the four points on the out of one point for seen four points and meet

i two for is eyebrow

and you can see this is an example of or

of subject laughing

with the tracking point this face

and uh okay as you may see tracking

i

and uh and that happens i mean a fourteen to be in the perfect tracking

and

so we initialize the points and then the tracker

uh like this twenty points

and um

now the main problem we yeah

and that should be is present for both shape and appearance features

is that uh we want to decouple couple

a a a a set pose from uh a facial expressions

and now how we do this

is uh

basically we use

a distribution model

it's point

and just to court needs X and Y

so i won't cut innate all the coordinates went up with a forty dimensional vector for each frame

if we concatenate now of these for uh vectors or from of frames into a matrix

if we have K frames will end up with a K by forty meeting

and then we apply pca on this matrix

okay is well known that the great is variance of the data lies in the fact me small components

and the i'm knees

that seems that are a a significant a head movements most of the variance will be captured and from a

uh in the first principal components well as facial expressions was account for smaller variance

and will be encoded in a lower component

no no components

and uh

so in this case

and we found that a first first piece for components correspond head movements

and the yeah remaining from five ten facial expressions

but was to this is but a uh most of the it's uh

depends on the data sets you know a other do the sets with the a even higher

and with

a stronger should movement

and

we consider that you of in the first five or from six

uh corresponding hidden would

so basically of that the features are very simple it just the project see of the core image of the

thirty decoding mates

to this principal components of course to facial expressions

and

i can we gonna example

okay so basically

what so on

it's this is not from maybe is database but just to get an idea of how these uh a principal

works

it's uh on the top left

yeah is see the videos three

or the top right you see the actually tracked points on the bottom left

you see the reconstruction but on the principal components that correspond to should movements

and the bottom uh light

you should a reconstruction of corresponds

oh to the press components

uh with a to expressions

and you can see that it's C

tensor head

top uh the bottom right remains always front i'll

and four expressions well as the bottom left

for was that the should poles

it's uh

because it's very simple but just

yeah

and uh

okay

a a okay uh are we show the simple also for appearance features we want to move

and should pose

and in in this case it's harder

and yeah so what to a we need

just the common approach in computer vision

we use a reference frame which is uh

can they you tell expression of the subject

and also each in front of you

and we compute the fine transformation between it's frame and the reference frame and went with affine transformation we mean

basically

and we scale rotate

and translate

and the face

so but it comes to front of uh house

and you can see a very simple example

the bottom

you see

or the left

uh a

the shared is

it's a bit uh rotated

and uh after applying

scaling translation rotation

uh uh use of the face becomes fonda

uh

and then we crop

and yeah yeah area on the face

it's uh um

and then we apply pca to you much good D and orientations

it's a okay i'll gonna i will not going into details and vision it goes uh

can find more information this paper

and but the main idea is that

it's quite common to apply pca their action and a pixel intensities are with this approach as you can sing

discuss of this paper it you some advantages for example it's more robust to illumination

and so that's why would side to use this one

uh now

we got audio features is were computed

with the open smile which is uh

to could provide

you M

and you

the a frame rate are is one how find frame for second

and that's why we need to up sample the visual features which are extract twenty five frames per second

it's a we use some standard or where what do features like a

plp coefficients the first five good could be since insanity loudness

a a fundamental frequency and probability of voicing of T

with the uh the first and second order delta coefficients

is a pretty standard features

and um

oh for classification and the first approach was to use long short-term memory recurrent neural networks

which the risky

mean felix describe so

just can to give a slide it's uh

and this for dynamically a a classification of forced uh for starting

a the main problem is that uh it's not a at a different line

S so order to extract features which do not depend on the length of the utterance

and was simply to extract some with statistics of over the entire utterance of these low level features

so just for example the mean

or of the feature over the entire utterance or the maximum value of the rains

yeah will convert

and not ounce so that has is represented by you

if feature vector or of fixed size

and uh event classification is uh are performed for the entire utterance using support vector machines

and uh a you can see just and i but you hear of via the same features the appearance features

plp P energy of zero loudness and probability of voicing

one case

yeah in the study case

we compute the statistics over the entire utterance

we fill them to an svm and we get to label for a sequence

and

where in the second case when we use the L S T Ms uh and team networks

we in simply

yeah give the low level features were no need to compute function functionals

to a

a list of their L S T works

which provide a label for it's

frame

and then we can simply take the majority and to label the sequence according to

so now is result

a as you can see

for a um

for that weighted average

is B Ms

provide

uh but but performance

where as for on a weighted average L S T M

a lead to better performance

so these skin means

but is B M's are good at uh

scream mean there that's in classifying a the largest uh

class which in this case is station

contains more than a thousand examples

a a and they are not so good uh it's recognising the other classes

where whereas the less the M's

and i was the recognising okay a would do that they better data but recognising all classes

so you see that's twenty Q

uh usually much higher and waited of but it's a values

something also which is also interesting

is that

to compare the performance of for the oh

and with the audio visual approach

the close of for do for example here

you see that it's sixty four point six percent

now when we add appearance basically lead goes down

this may sound a bit uh surprising

and because especially for visual speech recognition peons just consider the state-of-the-art

but there are two reasons first of all we use information from the entire face

and so basically these a lot of down information which can made

we possible to get the performance

and and a second reason is a scenes

uh is this sort you before these significant head movement

a although we do this registration step

to convert all expressions to frontal pose

and

still this is but not perfect and especially when there are out of plane rotations which means that uh subject

is not looking at the common a but is looking somewhere else

then we this approach is impossible

uh to reconstruct the front of a a you you

and the

and it is known but the appearance features are are much more sensitive

to a stationary or stop than shape features

it's uh

so this could be it's a reasonable explanation of the but performance when adding the P where when we had

a shape information

we should that is uh a significant gain from sixty four point six to seven two percent

and uh

now if we look at the co fusion radix to see

which class

uh

the result per class

and was you that the okay of this is that little on the left

is the result when using all the information only

and them but i is when using audio plus shape in this is for the L S the M

and networks

so we're a

we we see that for can and and laughter there is uh

significant improvement from seven the phone

from forty seven to sixty six and from sixty three two seventy nine

where as for his it a of the performance

uh ghost down so basically

uh know

when we that these a extra visual information

so this

somebody so

and we so that it is shape features improve press performance for consent and laughter

where appearance features

uh

you not seem to do so i mean

it's uh uh only a which just on seem to show have on in the case of support but the

must scenes

and still okay this is negligible the improve right

improvement from fifty nine point and fifty nine point four

well as when we combine all the features together then there is

is more improvement

so this is going case of the peons features

a cattle

comparing a now is uh L S T M networks with svms

so that a a a list the as basically a a a a to do a better job of recognising

and the egg uh

different vocalisations and where B M's

mostly recognise

the class with a

the largest class was his station

and uh and of for future work okay

it's uh is felix said

we not experiments we have used presegmenting sequence which means we know the start the end to extract the sequence

and we do classification

no much harder the problem is to do sporting of these non-linguistic vocalisations which means give a continuous stream

and we don't know this the beginning and the end actually that's out uh goal

which uh especially when using uh when i in visual information

which this

could be it challenging task because uh that are case is that the face may not be visible so in

this case is to like that we have to ten not for example

uh the visual system

and i fink this is you to soak in a look

these are web sites um

or to P of and you best of unique

so thank you very much

thank you much

to left

oh i'm for a couple of questions

and some

i

i

a paper

looks

as far as i can so but you can correct me

the illumination was uh a pretty much okay a so i'm just wondering

uh

when you you go over to a more realistic

recordings illumination that makes like station

we that terry rate

what to expect that uh

you get the same

amount of improvement for consent and

laughter or or not or

have you done any

and

well in this case

uh okay use appearance features a definitely influenced by illumination and they are sensitive to illumination now shape features

and the question is if this is a difference in illumination uh can uh fact that the that i can

even it works

fine fine a but even like a ins you to illumination

then a okay it's uh um

can can use even shape is gonna provide an useful information because the points we be top along the

but uh no this is a um basically this is an open problem because not been solved and

yeah i this as you know computer vision still be don't

audio processing

and these are problems that uh and nobody knows the answer

and that's why in most applications use single bit of easy and

basically

for example not a visual speech recognition

uh uh it's subject looking directly the common to and it's always a frontal A are of the face

and

quite recently that their here some approach is a trying to apply this method to more realistic scenarios

but to apply it in uh a case is like you said you real environment

um at least i don't know when approach that true

would do work well at the moment

of a question

a some uh basically yeah all the features where uh it what up some of both for a a for

both cases

an initial to the same frame rate

although for is sit it

mean it may not be necessary since we starts these function yeah actually i'm talking about instance up something

if the that's like you know

okay only feature a

okay

of a question

and

you you you shall couple look the most what what i one there

uh are there any that as the audience's a actually this class definition

for so you close to the V usual

features

for example for a station to

i i mean i i cool more and why uh facial expression

each

cried

station

and also a

be able to actually it in training

it

and

a

so that is by so i mean uh uh okay just for you want example a but if you look

at all the a examples you see that there is also sometimes is also a a big difference and sometimes

and uh if you want look

and the video without leasing to audio

it is very likely that even

humans

and a would be confused between different vocalisation but and she's station comes N

and

so yeah there is body ends and uh uh in particular for laughter that

uh i from there three hundred examples of laughter

and there

okay it's a the variance is high

it's uh and now was a gotten turning a test set

i think this is the official partitioning and or maybe be you and can same more

what was the criteria for deciding training and testing

it's uh

uh

what actually as just of that done for the silence

and that was done to be very transfer and

similar as for the a i corpus done by a speaker right

this so yeah but is clear there is by and uh is it before sometimes even

if you turn of the audio

you cannot discriminate between those two

i think

a a is between different

that that my main

uh

i i think actually and is also a another issue

so

what do you wear a so what's question but covariance would between a

i to print that this question as if there was cool covariance between for example a

uh

that

to explain for example that you didn't get much improvement for stations

"'cause" the covariance as a high between the

combination features so that that

a a a a uh

it's uh

yeah it could be a all i means it's uh um

some expressions are similar but yeah from the different a

in different classes

and if friend

it's a

yeah the i i'm not sure i mean it's uh a could because it's are spontaneous expressions and uh they

also different i mean

if you look at all of them you will not find too that are exactly the same

it's uh and

okay thank you

oh the question

okay so thank you again that