a could bring everybody uh this paper is is of all tract and changes in continuous emotion state using body language and prosody uh uh there's so this paper of you uh working with the inverse to seven point forty S at out or when the this work was done so yeah there is a video so you and X if express an interaction between two actors um is i i i i i i sure alright so i as an example of by express it and leading to a there's you can see that through the course of the time there is a continues uh a for bingo body and sprays since that of all at the same time the emotional state of the two i've there's of holes through the course of the time and we can see that it has a can see that has a viable intensity and clarity to the course of by so the focus of this work uh is oh first to examine what is the emotional content of these body language gestures and how or different the language gestures indicative of different underlying motion states of the art and secondly we want to use this this live and prosodic information in order to continuously track and folding emotion and chase to the course of time for this uh method we will use for this paper we we use uh pose the problem as the tracking problem so we're trying to track the underlying emotional state through time a a we use as that this a mapping between that emotional state and the observe audio visual cues so for each of a you can observe all do use of cues we will try to parade to try the underlying emotional state um the data base that we use for this work is the you C creative by D database you just so an example read from this database uh it's a multi model in multidisciplinary database that was collected uh as a collaboration of the U C engineering department and the you T at their school a and it consists of a a variety of uh dyadic that you cal improvisation a uh we asked the access from the of there's school to come to or uh motion capture lab and the play an improvised to a place of the got exercises and at the same time we to them with motion capture we placed markers in their bodies as you can see uh so we record them with video cameras motion capture and close audio from a of microphone uh so more details of this thing bayes is in this with site but is very um important to keep in mind that it "'cause" uh a large variety of a very expressive uh a the language uh express um so now we want to have a and annotation of the underlying emotional state of a two out there's during their interaction but we found that uh that the goal an emotion discrete of such just and re happy or sad at to restrict lead to uh i think clean are the good you describe the emotional state of a right of the emotional state so we went for dimensional motion emotional descriptors that that why we use in the emotional community based descriptors are uh activation which describes how excited versus calm is a person it's um balance we describes how positive the versus negative is the additive of a person and is don't with describe how don and versus of it is a person you know interaction uh also we didn't want to uh a channel the set the recordings of the recordings in arbitrary segments because we gonna want to in out is continuous flow of audio cues so we decided to go for a is a notation through the course of the time or of uh these emotional descriptors so we use the filter trace instrument which is a software that out i was people as they were to be do to give a continuous notations all balance activation up and down and now i also one to show an other them more of uh the use of a trace but three a data annotators as they're watching the video to give a and um rating of the activation i oh i i i uh oh a yeah i i i a so as you can see uh or three different people but giving the rating of the male at time it seems a reasonable since in the beginning that are is like stalking a lot is moving a well there as the end he's turning around and going to in into the end of the room and not talking a lot so basically that the decrease is of course we can see differences between different annotators which brings us to our nee which is how do we define that a a different uh a lot they are a green uh so we not is by examining our data that uh i i don't think things are agree more on the correlation on but train uh that is the agree more on a you for emotion motion at a rate is increasing or decreasing or staying stable rather than the absolute values of any motion uh at be and this is expected scenes for human seems to be easier to classify emotion motion in relative there say something is more active at or less it rather than absolute there so um so motivated by this we decided to define evaluator agreement in this work as positive correlation or a different annotation care uh in this plan you can see an example of uh of a notation in the blue lines is three different annotators a there's a dating the recordings through the course of time and the red line is that uh mean annotation uh which will be used as ground truth for our experiment a in this work we using uh a uh uh i don't thirty different instances of people interact a a a number of known to the building young which feature extraction uh we decided to this uh to extract a variety of body language features that we wanted to be into and inspired from psychology so that they are well suited for analysis per a i so we have to different types of features one is absolute features that have to do with just one person has to do the but only portion of that person and move an of that person the second of features are relative features of one person with respect to his interlocutor so they have to do more with semi that in the other person looking at the other person or a or approaching or of reading the other person and there interesting because you need for people to extract these features so they can be some information about this actor in right i i i have to that it is that this analysis the purpose of that is to see how well and its behaviour changes in a yeah yes that the underlying emotional attribute either increases decreases is or is thing uh stable a a in uh specifically we extract it to a to work of the features were extracted for them one show are michael a a in a dramatic a fashion a straightforward manner uh you can see a list of some of the features that we use for example there some absolute features such as uh have the velocity and are some relative features that as the relative velocity of and i actor with respect to the other i um from this a low level features we can be some high level of behaviours that the this behaviours for example take a are relative a even the last P is uh a positive that that means of moving to was the other person you the velocities that the mean some moving away from the other person and very close to zero in i'm not moving my so by to set of in this value you i can i in for these high-level level used i that my approach of what still was the other person and now uh when we was that this scale analysis of this high a little behaviours we will use a ground through to select regions that we have the dance with is that um of the estimation the mean estimation for different not of data uh when you select a yes of increase decrease or stability of the emotional attribute and then it's that this we compare but but the language behaviours in regions of in these the different regions and we see there are statistically significant difference in the but the language speech we behaviours so we find meaningful results for the activation and the dominant uh emotion attributes uh i i as an example of when where comparing activation division increase compared but at division decrease we not is that uh when the present is increasing he's activation keep tends to what more to move more there as the other person of his hands more was the other person when she's is vision is decreasing in moves more away from the other person and he hand to each other more uh uh when we uh looking at the dominance increase purses down is decreased we see that when a person is increasing is the as tends to low as the other person or or to was the other person well he's looking uh which he's decrease it decreasing his dominant he looks more a weight or heating uh away from the other person and this result seems intuitive and they are in agreement with some uh qualitative results from psychology i unfortunately we're not able to find so into D uh here yours for the balance attribute which simply your if we ate in our late the results about tracking and they now only it's good to uh the tracking part i also have a B D here a a so are gonna for the tracking experiment is using the are uh all the visual features to track the underlying emotional states which is the red curve that you see here but great care is are estimation of the underlying emotional states through the course of the time in this case is that i-th division of the at actor for one of a recording so in order to do that a you know to do this we use the gaussian mixture model based mapping that was used a among others for the problem or for articulatory to speech uh inversion uh so it consists of finding a mapping uh between uh a continuous underlying emotional state at a specific that means that that is you know that X and it can is their boy language or process of that feature set of features at a specific that is the start uh uh i denote by Y where modeling base uh joint distribution uh with a gaussian mixture model uh there four uh the conditional distribution of a of of the underlying emotional state you don't they observed audio visual feature vector is gonna be also a gaussian mixture model and therefore we will have the optimal uh maximum likelihood mapping a if we select the underlying emotional state that maximise the conditional probability of the underlying motion state given uh the observe a feature vector this is an eight i dated process uh through expectation maximization algorithm and it converts is to the maximum likelihood map um in practice we don't just want to use information about time it a but we want to take into account temporal content a a to do this we also use a a first and second derivatives so we augment meant of feature vector back the those three but this and this enables us to take into account to this temporal context and produce smooth or emotional trajectory estimate now i don't like to use not only barely language which but also speech as as our feature vector however seems that is relevant for the whole interaction but uh audio cues i only relevant when the actors is and actually the result speaking of the time so uh we need to find a solution for this so we decided to uh a to go mixture model mapping one more mapping is train when the or is not speaking is of these all gmm pink only uh using but features the second is string when the actor is speaking a it's in all the gmm being using both will be and speech speech so you know a testing stage for it's test recording which are in two consecutive overlapping segments any need check and we apply the appropriate mapping according to whether the speaker is uh the actor speak your are not speaking uh in order to find a a a at the curve estimation uh the audio features that we are using our peach energy and no put the button feature so here are some of our tracking result uh you can see in this plot uh the red line is the mean out nation uh the blue line is the maximum likelihood estimation trajectory using only body features and the green one is when using by end speech features together uh i this is one of our best results uh we achieve the correlation between the ground truth and are estimation of the order of zero point eight which increased to zero point a a at one a when we also use all audio information at the second he's some more medium result uh uh we have a a correlation between uh the ground truth and the blue line which uses only body feature oh far down zero point four when we use the or features we go to zero we increase significantly to zero point uh five and that in general have that we can see from this to play is that the two trajectories they for a little bit trends of the underlying emotional states that of number the absolute value values so we can going that we can track emotional changes yeah that then absolute emotional body and this is a a in part explained by the fact that these easier to quantify as i or dimension motion in relative terms rather than absolute there um so now uh going to the or or or a results uh we evaluate the performance uh uh of tracking changes probably uh measuring the correlation between the ground truth and the estimate here that we have and i an upper bound to show that the difficulty of the problem we also so what's the inter notation correlation for one specific recording between the annotators um so this is are there are result uh for the activation case the median correlation between the ground truth and the these or uh mle mapping is zero point thirty one when we uh use this is the median correlation when we use that all is were mapping the correlation increases to zero point for two there's a significant increase and the internal annotation correlations are zero point fifty five so one can coding word uh a comparable to uh the annotation so for a human uh for don't that's our results are lower uh the V well in the old of is more being are uh similar a a around zero point and to six zero point than be three well the in there are they probably correlations are zero point the forty seven uh for the by case uh we're not able to track the changes uh uh you have a median listen around zero and this is in agreement with our statistical results that uh where we were not able to show and mean that the violence is meaningfully reflected through our features um so this these us just our discussion of how are observable i'd the underlying emotional states that were trying to trap with the features that are extracted it seems from a results that were able to capture the activation changes of a persons through the course of an actors to the course of time uh some of the dominance changes but we not able to track the violence changes uh these might mean that the ball language in prosody maybe more informative about the activation and the dominant state well the violence date um the balanced state can be expressed through out a more than two such as facial expressions or lexical content um also uh also could be the case that we need to do more detailed feature extraction specifically tailored for the balance uh attribute rather than using the same feature set for or three motion now uh attribute um um this is a one of the future work so i only we note that uh the use of prosodic cues greatly benefits activation tracking the fact that uh um of use uh the fact that uh vocal cues are when feature for activation is is uh that has already been seen in they motion literature and finally uh are are all rep a close and is a we can perhaps or motion changes rather than absolute motion values with this frame uh a in there some feature work we would like to focus on improving our features by extracting specifically features for each emotional attributes that we got well so we would like to uh improve uh a bit data annotation process as in an not them to achieve higher we put evaluator agreement so you know to have a a more consistent true and and as a a a lot of but we want to wear the rates use uh monitoring of emotional state that would like enables us to a estimate the crimes time to find uh regions where we have but increase or or increase of the activation for example of a person so find region were something interesting happened or or as there is in or or else we can say was that they emotionally salient regions uh of direction um is that the reference that that are used for this presentation and thank you for your thing a a source of dropped to start from your okay or and or sort of remarks or a remote one were strong