Speech Transcript - Modeling Human Communication Dynamics

0:00:18	thank you very much for waking up early the star
0:00:23	this is really exciting this is the first time
0:00:26	i will be giving a talk in this room in two years
0:00:29	it is that the same time kind of emotional for me
0:00:33	and the so i'm really happy to share
0:00:36	the recent research i've done on human communication analysis
0:00:41	and i will also talk a little bit briefly estimate
0:00:43	of the earlier project i've been doing
0:00:45	on this topic
0:00:46	and as you know really well
0:00:49	i'm here spending about a lot of the word is
0:00:52	with my student and also with my collaborators
0:00:56	this is
0:00:57	this is the new of the comp lab other one at cmu there is one
0:01:02	let us see that stuff answer is leading
0:01:05	this the theme of you don't and we are all working together
0:01:09	with the goal of building algorithm
0:01:12	two and the light
0:01:13	really and event may sometimes think the five
0:01:16	german syllable can get the behaviors
0:01:19	and to really get into this understanding of why
0:01:24	human communication and why multimodal the magic word i know it's impossible for me to
0:01:29	give a talk without that you know about multimodal
0:01:32	i really strongly that when we analyze dialogue
0:01:36	dialogue is powerful in how people what they are seeing
0:01:41	and this is a really strong component
0:01:44	of dialogue in conversation analysis
0:01:47	but i also strongly believe that nonverbal communication both vocal and visual
0:01:52	is that the really important
0:01:53	and for that reason i'm gonna show you an example some of you may have
0:01:57	in it so don't tell you never about the answer but i want to give
0:02:02	you this sort a clip where we have an interview
0:02:07	between two people
0:02:08	and we i want to task from you and easy and a hard
0:02:13	the easy one is to find out from the so you have the interviewer and
0:02:17	interviewee
0:02:18	how what emotion
0:02:20	there's the interviewee
0:02:22	feel
0:02:23	and that's what i'll do it is a hard one
0:02:25	it is just of the two task
0:02:27	the second that i want you what i
0:02:30	well as the cost
0:02:31	that's the hardest but is also the most interesting
0:02:35	so we're gonna we will let's read it together about a corpus tried to have
0:02:39	no prior to the
0:02:41	denote the board
0:02:42	so did you need it if the of the technology what side
0:02:46	l o good morning good morning
0:02:48	where you surprised by the verdict today i'm very surprised that the this world economy
0:02:53	because there was no the expecting that
0:02:55	when a game tell me something out
0:02:57	so maybe something of big surprise
0:03:00	what emotion does you feel
0:03:04	it is an easy question
0:03:06	so right exactly i
0:03:09	and that's look at it from a computer
0:03:12	who is probably just gonna do some kind of word embedding and matching things
0:03:18	what is why these surprise
0:03:20	let's look at the question probably because of the verdict
0:03:23	that the that the follows
0:03:25	really quick one
0:03:27	what if we more carefully
0:03:29	we do see that there was something unexpected
0:03:32	a and maybe even got related to him
0:03:36	so let's add one more modality
0:03:38	that is in which word as you decide to emphasise
0:03:43	i for me is a set of technology websites f is it is i see
0:03:49	this is like this i said ice surfaces yes
0:03:57	is something that
0:04:00	this is this something isn't done yet to address this as a basis i said
0:04:08	yes
0:04:09	okay so
0:04:10	which word
0:04:12	and his second and so that he decided to emphasise
0:04:16	me
0:04:17	is strongly emphasise the me
0:04:19	so this surprise doesn't seem as much about the big
0:04:23	but mostly because it can count em
0:04:26	so that add another modality
0:04:28	where you see surprise but now you want to look at it at the timing
0:04:33	of thing
0:04:34	and that's one of the other take all my want to bring in
0:04:37	it's not just multimodal
0:04:39	where the alignment of the modality that's really important
0:04:43	the let's look at the visual modality second line
0:04:46	for tracking the et cetera technology website f news line is a good morning t
0:04:52	is fine this and i said that the surface to see this is not have
0:04:59	to come only because i would like to think that
0:05:02	unless you know that is something this and don't think of the to address this
0:05:08	implies that they suffices that
0:05:11	okay so
0:05:13	with that that's a driveway came a lot earlier than with to
0:05:17	much earlier
0:05:18	and five where with the
0:05:22	rampantly and five you look carefully it is around that
0:05:26	so given that information
0:05:28	what is the cause or what how can you explain the surprise
0:05:33	probably is related to this title there's probably something wrong with the title
0:05:39	okay and that's would be interesting so that's where the timing is important
0:05:43	really surprised at uni of the case of pride so if you look at name
0:05:47	entity recognition there's differently to entities there is the name of the person enters the
0:05:54	position in the place if you look carefully it is the second one
0:05:58	so
0:05:58	based on that you inferred that his name at uni his job title is not
0:06:05	recognizer web site
0:06:06	the last be i have to give you will never have known that there ought
0:06:10	without the context but effectively his did you need
0:06:13	what he's a taxi driver
0:06:17	the taxi driver goes therefore small job interview item one of the small there
0:06:22	and i'll give you a that's great command
0:06:25	they put him with the makeup what the microphone is that we're the job interview
0:06:31	thing i think that up and everything
0:06:34	and that well i don't the realise that all my guys this is not that
0:06:38	have interviewed it only something love and that the that thing
0:06:42	but that are you that have known what are several interesting it is see the
0:06:46	proportion of them of the interviewer see keep the straight place
0:06:50	the only thing see that the will come back after the commercial
0:06:54	you never comes back that's also a so what we start here is
0:07:00	we as human are expressing or communicative behavior to that's we i call it the
0:07:07	rouble vocal and visual
0:07:09	a word you decide you
0:07:11	is maybe slightly more power that it was like you or negative
0:07:16	this is the choice you make
0:07:18	this is a child because you want to emphasize the sentiment
0:07:20	all because you want to be polite in that's really importance for discourse
0:07:25	the way you decide to a phrase the sentence would bring a lot also
0:07:30	the vocal every word use p can be emphasized differently
0:07:35	and also you can decide to put more or less tension of writing this on
0:07:39	the voice
0:07:40	it also the vocal expression of laughter
0:07:42	or the policy allows that are important
0:07:46	the visible which i come from computer vision background the reason is i put the
0:07:50	phone call them on visual
0:07:52	is it might bias but i strongly believe there's also a lot to the gesture
0:07:56	i'm doing to be gesture i mean do some iconic gesture
0:08:00	the eye gaze the way i will also do occur on gesture
0:08:05	the body language is important it's both on my posture of the body and also
0:08:09	the proxy mixed with others
0:08:12	and that is really also control specific always have this is a great example
0:08:17	of a brain you student who graduated by now
0:08:20	but just came up from china
0:08:22	and we have the wonderful discussed and i go to the whiteboard and i turn
0:08:27	and he was right there
0:08:30	and i tried to have a conversation but my canadian bobble well
0:08:36	lied
0:08:36	i survive only twenty seconds and then when we have a wonderful conversation about tried
0:08:41	to make so that within a
0:08:44	i j then had gate
0:08:45	one of the first q i look almost always in any video analysis i do
0:08:50	is eye gaze eye gaze is extremely important
0:08:53	it is also some time cognitive emotions also eye gaze is really important
0:08:58	and i have a bias for facial expression also so i believe the face brings
0:09:03	a lot
0:09:04	we have about forty two models on the phase depending you can't exactly but for
0:09:08	to do
0:09:09	all of them has been i sighing a number of byproduct men famous coding scheme
0:09:15	and i'm interest and not just in the basic emotion like had is that if
0:09:19	you is happy starts to cry
0:09:21	well i'm also interested in these other going to state is the thing the confusion
0:09:26	and understanding
0:09:27	there are about of and more important when we think about learning an indication for
0:09:31	example
0:09:32	so that you of the three v verbal vocal and visual
0:09:36	and
0:09:37	the reason for this research has been in that people's mind for many years
0:09:42	if you look back sixty years ago and by the way have puberty a it
0:09:46	is the sixtieth anniversary of artificial intelligence
0:09:50	the us they're from the beginning but we didn't have all the technology now these
0:09:56	days we have technology to do a lot of the low-level sensing finding facial landmarks
0:10:02	and the licensing the voice
0:10:03	every in speech recognition is getting better
0:10:06	so we can in real time at leftmost and browse speech
0:10:11	and i can be able to start doing some of the original goal of inferring
0:10:17	behaviour in emotion
0:10:18	so personally when i look at this challenge of looking in human communication dynamic
0:10:23	i don't get for type of dynamics
0:10:27	the first one is behavioural dynamics
0:10:30	and that every smile is born it or there's some mild that seems to show
0:10:36	politeness some are feeling and there is also what we call and that this is
0:10:42	i have to give this to my
0:10:44	appear as opposed to
0:10:45	but if the size of
0:10:47	which means that the same
0:10:49	can be really need a lot there's by the change of prosody and for people
0:10:54	working in speech in conversation analysis try to find out who is speaking
0:11:01	the stuff
0:11:05	the
0:11:11	i
0:11:12	okay this was one that only
0:11:15	this was from only one hour of audio
0:11:19	do you know with it
0:11:21	it nick campbell and it's was from one of experiments data as that the interaction
0:11:27	they have that but only from one hour or the you can see the variety
0:11:33	as some of them are just
0:11:35	which is more like a concentration please continue
0:11:38	some clearly show some common ground
0:11:41	and the lights men
0:11:42	and some of them maybe eventually agreement so just from the brother the same word
0:11:46	changes
0:11:47	the second one was by now you hopefully bought into is the idea of multimodal
0:11:52	dynamic with a line
0:11:54	the third one is really important i think that's where a lot of the research
0:11:57	in this conference
0:11:59	and moving forward is needed is the interpersonal dynamic
0:12:03	and the former one is the cultural muscles title dynamics
0:12:07	this is a lot of study of both difference of also and event between cultures
0:12:12	so today i will focus
0:12:14	primarily on these tree
0:12:16	and try to explain some of the mathematics behind that
0:12:20	how can we use the
0:12:21	and develop new algorithms to be able to send
0:12:24	the behaviors so
0:12:26	and i make personal excited in this field
0:12:30	right i'm only follows for because of its but then syllable healthcare
0:12:35	there's a lot of what then so in the being able to have the doctor
0:12:39	during their assessment or treatment
0:12:42	a depression
0:12:44	the since i don't live and offers them
0:12:46	and the other i have every are which is really important is education
0:12:50	the way people are learning these they this shifting completely we remove was seeing more
0:12:55	and more online learning
0:12:57	online learning brings a lot of advent age
0:13:00	but one of the b is advantageous you lose the face-to-face interaction
0:13:04	how can you improve that still in this new error
0:13:08	and
0:13:09	the internet is wonderful
0:13:11	there is so much there are there people lie to talk about themselves and talk
0:13:17	about what they lower their poppy and everything this so much data and every language
0:13:21	every call so it allows a and a lot of it
0:13:24	and then transcribed already
0:13:26	it gives us a great opportunity for gathering data and starting people's behaviour so that
0:13:32	a two day i on purpose put it in three phases
0:13:36	the first phase is probably where one half of my heart is which is that
0:13:40	on held behavior informatics i will present some of their work we have done when
0:13:45	i was also at usc
0:13:47	working on the hard you analyse gonna get the behavior to have doctors
0:13:52	the core of this star
0:13:54	will be about the mathematics
0:13:56	of communication
0:13:58	and this is that a little bit of map but you can always ignore the
0:14:01	bottom half of the screen if you don't
0:14:03	i want to see mathematical equation and i will give an interest and on every
0:14:08	algorithms that present
0:14:09	but i want you to believe and understand
0:14:12	that we can get a lot from mathematical an algorithm
0:14:15	when studying
0:14:17	communication
0:14:17	and the last one is the interpersonal dynamic i was to some result but i
0:14:22	think this is where there's a need
0:14:23	of working together and pushing this part of the research
0:14:27	a lot further
0:14:29	and so let me start with help behavior informatics
0:14:33	you're gonna recognise right away
0:14:36	any maze of a person who's been really important was sick dial this year us
0:14:41	they're elicit thank you for your email as a citizen realise but i mean using
0:14:45	her as my patient well out of my slide
0:14:48	but let's suppose that we have a patient
0:14:51	weights for anybody else who than that in this room
0:14:54	and we wanted the interaction between the patient and the doctor
0:14:58	during that interaction we will have some camera let's say a samsung tree sixty
0:15:03	just sitting on the table
0:15:05	if we are lucky and are at i c t or we are working we
0:15:10	dissected then we can also have a natural and to your
0:15:14	the advantage of the virtual interviewer versus the human is then they're dissertation
0:15:20	the virtual interior is gonna have the question always the same way as long as
0:15:24	we asked to do it
0:15:25	the core my research there
0:15:27	is to while the interaction is happening
0:15:30	to be able to pick up on the communicative cues
0:15:33	that may be related to depression
0:15:35	exactly within this schizophrenia
0:15:38	we bring it back to the clinician
0:15:41	and then they can do a better assessment of depression
0:15:44	this is the you'd the views and long-term
0:15:48	what is really lucky
0:15:50	is we started this
0:15:52	and it was primarily computers lines is
0:15:55	with one strong believer which escape result
0:15:57	we would like we believe in this and working to with us
0:16:01	made it possible but now the medical field is thing it
0:16:05	a more and more important and with a lot more links going on after that
0:16:10	so let me
0:16:11	introduced la probably a lot of you seen her sit changed a lot of clothing
0:16:15	and you may ask you know in three
0:16:18	i heard i'm gonna sure that primarily because i want you yes to see the
0:16:23	technology which i think is amazing because it to forty five people in four years
0:16:29	to build
0:16:30	i'm showing this video as the landmark video on that on that field but also
0:16:35	to look at the nonverbal happening in real time the sensing of this
0:16:40	hi and highly
0:16:42	it's the community
0:16:43	and is created to talk to people in a safe and secure environment
0:16:48	i'm not a therapist that i'm here to learn about people in the black to
0:16:51	learn about you ask a few questions can start
0:16:55	and please feel free to tell me anything you can see are totally confidential
0:17:00	are you looking like this
0:17:03	yes
0:17:04	so
0:17:05	high density
0:17:08	whom well
0:17:10	that's good
0:17:13	where you from originally
0:17:16	from los angeles
0:17:19	i'm from not only myself
0:17:22	one this time last time you felt really happy
0:17:27	and
0:17:29	i time and i i'd rather be happy
0:17:35	like a skinny nine
0:17:38	okay thanks but you get an yourself to twenty years ago
0:17:43	and
0:17:47	i it's not a lean
0:17:51	it
0:17:52	an example that is that i'll
0:17:56	okay this is really sort it it's or not we originally designed get within fifteen
0:18:01	minutes instruction in mine people easily top twenty thirty minutes with l e
0:18:06	we have one example are really famous professor i'm not gonna name
0:18:11	and that person who came in visiting and we told them
0:18:14	be careful we're gonna be watching behind the videos
0:18:18	don't that'll to much a we are there
0:18:21	just and allow no problem
0:18:24	this start talking a little bit and eventually the started talking the slow thing about
0:18:29	the bars and about everything and i was not there are present at that point
0:18:34	the l a brings that in what are that's really
0:18:39	and a is there to listen to you which is a good listener
0:18:42	has been designed with that if you want otherwise you know so in what like
0:18:46	so much emotion
0:18:48	emotion is the is the double edged in this case
0:18:52	you can surely most and get the present more engaged you can go the opposite
0:18:55	way for example a bad error in speech recognition the patient said
0:19:01	i and my grandmother died and the l it was a
0:19:05	and so you can definitely be sure so all those reduce the aspect
0:19:09	and a lot of the world there was done by david and david
0:19:13	on handling the dialogue at a level
0:19:16	then make the interaction grow through a rapport way
0:19:20	true of phase of intimacy what part of their what was positive in the lower
0:19:26	what have you moment in the last week
0:19:28	a negative as well
0:19:30	if you could go back in time what do you change about yourself
0:19:33	these are important and he
0:19:36	four hours or research because
0:19:38	how does the presenter we have from positive
0:19:41	and how they react one they can sit will tell you a lot about the
0:19:44	their reaction and allow us to calibrate
0:19:47	so our view
0:19:50	is and that's prior to my research and in this case is hard to analyze
0:19:55	the patient behavior to date
0:19:58	and how to be a yes that's we and compared to like two weeks ago
0:20:03	that allows us to see a change so if you ask me where the technology
0:20:06	is going to be sparse
0:20:08	it's in treatment
0:20:10	because in sweet menu see the same person over time
0:20:13	and now over time we have gathered is the entire that allows also to maybe
0:20:18	due screening over this technology and give a great indicators
0:20:23	so this is the project that start and more than six years ago and that
0:20:26	means do you in a few minutes
0:20:29	what are the other things we discovered that we did not expect
0:20:33	and things i think that we were not seen previously
0:20:36	and so the first
0:20:38	population will look at is depression
0:20:41	and you think of depressed people and you think my
0:20:44	smile is gonna be a great way to the that you look at the red
0:20:48	and on the press this is an obvious one it sort out that no
0:20:52	the comp a smile
0:20:54	in almost exactly the same between the pressure in a depressed
0:20:59	what change the is the relation shorter
0:21:02	and less amplitude
0:21:04	that is hypothetically what it means is social norm thousand that you have to smile
0:21:10	where you don't feel it
0:21:12	and so use change the dynamic of your behavior
0:21:15	and that's where behavior directly so important
0:21:18	the second population we look at
0:21:20	look at its posttraumatic stress
0:21:23	and you like okay point vts the it is for sure there's some negative expression
0:21:28	with this
0:21:29	it is a given
0:21:30	people would be it is there will probably so
0:21:33	and what we did we see almost the same rate in or intensity
0:21:37	the same intensive negative
0:21:39	what did we end up doing we split it men and women
0:21:43	what did we find out
0:21:44	man
0:21:46	c and increase in the gets a spatial expressions well woman see a decrease and
0:21:52	negative expression when they have symptoms related to pitch the
0:21:56	this is really interesting
0:21:57	so why
0:21:59	another interesting question
0:22:01	i respond we have nice research question
0:22:03	again probably maybe because of social norm
0:22:06	man it is accepted in our culture
0:22:08	that it may show more negative expression
0:22:11	so they are not
0:22:13	reducing them well woman because of the social norm again main to reducing
0:22:17	this one here i part is this i'm just gonna see it because i'm here
0:22:21	that maybe it is because they're from los angeles and both boxes so popular
0:22:26	i don't know about the we have to study there's the don't give a new
0:22:32	new interesting research question to study
0:22:35	the research population that we looked at is suicidal id asian
0:22:41	the you know that there's forty teenagers are we going to the eer in cincinnati
0:22:47	only
0:22:49	forces title idea is to either first attempt or strong sits idle addition
0:22:53	and that has to make this hard decision
0:22:55	i my keeping all of them here
0:22:57	sending some of them or putting on medication or not
0:23:01	is a hard decision so we have to task in mind
0:23:04	one is findings this i don't versus non societal
0:23:08	but where is the money
0:23:09	the money is then detecting repeaters
0:23:12	because the first time is always
0:23:14	a phrase that then the second item bits of and the most and to
0:23:18	so we did a lot of research and this is in collaboration with defined server
0:23:22	and cincinnati john question
0:23:24	where we studied the behavior between societal and non societal
0:23:29	and the language is really important
0:23:32	you see more pronounced when societal about themselves
0:23:36	and you also see more negative
0:23:38	these are not surprising but they were confirmation of previous research
0:23:42	what was the most challenging is repeaters in on repeaters
0:23:48	how can we differentiate that and one of the most interesting result is that the
0:23:53	voice
0:23:54	where the difference shader
0:23:56	people we're speaking differently
0:23:58	when a repeat what's gonna happen we will call again three weeks later to find
0:24:03	if there was a second at them
0:24:05	and so the brightness of the voice was an indicator
0:24:08	is it just one indicator will not just because you were had to rate advice
0:24:13	in itself
0:24:13	but that's and that is then in together and then we can add this
0:24:17	we did you know there's a lot of other indicated that you can add
0:24:21	to help with this
0:24:22	the last population and we also look at it schizophrenia
0:24:27	use of in is the really important
0:24:31	disorder
0:24:32	and they also related to buy there's also by problem is a free not vote
0:24:36	in the cycle this
0:24:37	arena
0:24:38	and so we were really interested to look at the facial be yours because we
0:24:43	were o is of rain are they gonna look everywhere the gonna move and al
0:24:47	this
0:24:47	and what did we find out
0:24:49	when they were the doctor nothing
0:24:51	they were not moving they are brought there was no more sand with the same
0:24:55	that they were strongly schizophrenic or not
0:24:57	but
0:24:58	if there were by themselves
0:25:01	then we could see that just a
0:25:03	so that brings than the really interesting aspect of interpersonal
0:25:08	where the doctors the there
0:25:09	they're kind of constraining a little bit their behaviour well when they were the by
0:25:13	the slu could see a lot in the facial expression
0:25:16	so the that some of the example these are more of the population will been
0:25:20	working on
0:25:21	since then we started looking at art is then
0:25:23	and also as sleep deprivation
0:25:26	it's all of my phd student the like can be really get paid one that
0:25:30	sleeping
0:25:32	and yes they're the lattice that is
0:25:34	onthe-fly and so we're looking at these as well
0:25:37	if you're interested in doing and pushing for that kind of research
0:25:41	i strongly suggest
0:25:43	to go aligned right now and download open phase
0:25:47	open phase is us
0:25:49	taking promote to stance and taking the main component of multisensor for visual analysis
0:25:55	and giving it
0:25:56	not only for free
0:25:59	not only give the open source for recognition
0:26:02	what do you mean you all the open source for the training
0:26:06	of all the model that were all trained with public dataset
0:26:10	i'd probably not good for my grant proposal and all this because i'm probably gonna
0:26:14	give too much but i think it is important for the community and we're doing
0:26:18	that for that
0:26:19	open phase has state-of-the-art performance for facial landmarks sixty eight facial landmark
0:26:24	state-of-the-art performance for twenty two facial action unit
0:26:28	also for eye gaze
0:26:30	eye gaze just from a webcam plus or minus by degree and also head position
0:26:34	we're adding more and more every few months also
0:26:38	so this is online
0:26:40	and be sure to contact that that's with the main person behind all of the
0:26:44	switchboard
0:26:45	so i think i got you hopefully excited about the potential of an analysing nonverbal
0:26:50	and verbal behaviour for help here
0:26:53	so how do we do this
0:26:55	how can we go a step ahead right now we just a couple of uni
0:26:59	modal
0:27:00	one behavior
0:27:01	but what i really excited about is how can we add together
0:27:05	all of these indicators from probable vocal and visual
0:27:08	so then we can better infer
0:27:10	the tighter the disorder or in a social interaction to recognize leadership
0:27:16	ripple
0:27:17	and also maybe emotion
0:27:19	so
0:27:20	what are the court silence and
0:27:23	if you have to remember wanting of this lecture is these four challenges
0:27:28	when you look at them negation therefore main challenge to the first one is with
0:27:32	dimension is the temporal aspect i told us smiled the dynamic of this might is
0:27:38	really important
0:27:39	we need to model each day behaviour
0:27:41	but there is also what's got representation alignment and fusion
0:27:48	representation i have what the person said and i have these gesture how can i
0:27:53	learn a joint way of representing it
0:27:57	so that if someone say i like it
0:27:59	and the smile
0:28:01	these should be indicators that are represented close to each other
0:28:05	and by representation what i mean
0:28:07	i mean numerical numbers that are import that our interpretable by the computer
0:28:13	imagine a vector in some sense
0:28:16	the alignment is the second thing
0:28:18	we move i sometime faster and of course changes faster than all words so we
0:28:24	need to align the modality and the last one is the fusion
0:28:28	we want a breathing disorder or emotion how do you use this information
0:28:33	so the first one is and i will ask you to use one other part
0:28:38	of your brain a
0:28:39	the one that's is slowly waking up because of the copy about looking at matt's
0:28:44	and algorithms but i want to give you a little bit of a background on
0:28:47	the mat side
0:28:48	so we have the behavior of a person
0:28:51	and we wanna be looking at
0:28:53	what is this so that
0:28:56	component to it and what is the information you have a you have a plot
0:29:00	like a movie plot and the all sub plots to it
0:29:04	there is a gesture and there's subcomponents to it
0:29:08	this component i really important when you look at my at behaviors
0:29:12	so how do we do this so anybody the let's see
0:29:16	whose strongest background is in language and an l p
0:29:21	would be most of you
0:29:22	anybody with a strong background in vocal and out of the speech
0:29:26	okay great
0:29:27	anybody with a strong background in visual computer vision
0:29:31	okay good thank you
0:29:33	i don't feel lonely well for each of these modality
0:29:37	there are existing problems that are well studied looking at structure for example in language
0:29:44	looking at a noun phrases or shallow segmentation
0:29:48	in have used one recognizing gesture or in vocal looking at the tenseness already motion
0:29:54	in the voice
0:29:55	and there are been a lot of approaches suggested to that
0:29:59	it generates addresses this common that's a
0:30:02	generative in a nutshell is looking at each gesture and try to generate it so
0:30:07	if you look at hand out and head shake it's gonna learn how and upgraded
0:30:12	and how the head say created
0:30:15	and if i'm giving a new video is say that no other the with head
0:30:19	shake a discriminative approach is really looking at what differentiates the two
0:30:25	and so in a lot of our work it or not the discriminative approaches perform
0:30:30	better at least for the task of prediction
0:30:32	and so i'm gonna give you
0:30:34	information about this kind of approach
0:30:37	knowing really well it is interesting work on the genitive
0:30:40	so
0:30:41	what is a conditional random field
0:30:45	my guys i didn't thing i would see that do this morning
0:30:48	but no conditional random field is what's colour graphical model
0:30:52	and the reason i want you to learn about it is that this is the
0:30:55	and good entry way to a lot about the research that you've heard about word
0:31:00	embedding
0:31:01	our board to back or deep learning or recurrent neural network you're all of these
0:31:06	terms
0:31:07	we're gonna go step by step to be able to understand the and that the
0:31:11	same time i will give you some of the work we've done tree that
0:31:15	so given the task and given the sentence
0:31:18	and i want to know what is the beginning of a noun phrase
0:31:21	all what is the continuation of a noun phrase or what is other like ever
0:31:25	so it is simple classification task
0:31:28	and you could imagine given observation
0:31:30	where you have a one hot encoding
0:31:32	zero and one for the words if it's a word embedding
0:31:37	you can try to predict
0:31:38	the relationship between the word and the non trade
0:31:42	if you wanna do it in a discriminative way what does this minutes of mean
0:31:46	in means that you model problem the of the label
0:31:50	given the input b r y given x
0:31:53	now this equation is simpler than o
0:31:56	there is one component that look at
0:31:59	how is my observation looking like the label this is what color is singular potential
0:32:06	and the second part is if i'm at the beginning of a noun phrase what
0:32:10	is the likely label afterwards
0:32:13	if i tell you that if i'm the beginning and noun phrase one is like
0:32:16	there were i know a continuation of a noun phrase or another but if i
0:32:22	mean concentration a noun phrase
0:32:24	it's really less likely maybe that i go
0:32:26	into a global after that so this is the kind of interest and you put
0:32:30	in this model
0:32:31	this model i patients recognize behaviour and they can do it but
0:32:37	but there's always about
0:32:39	but in this problem will be
0:32:41	so much easier
0:32:43	if i knew the part-of-speech tagging it would be so much easier
0:32:47	if i had and at college the undergrad in the box if at the annotators
0:32:52	same and obtaining out of this for us
0:32:56	the task will be so much easier from this pronouns you know it's like but
0:33:01	beginning of a well i
0:33:02	beginning of an off right
0:33:04	this is the verb so
0:33:06	why don't just do that when it is the hard a i r b doesn't
0:33:10	allow us to put undergrads in the box and it is a time-consuming
0:33:15	i process to do that so
0:33:17	this is the want a remote wants you to remember from that's part of the
0:33:21	lecture
0:33:21	latent variable i'm gonna replace that by a latent variable length bible is the number
0:33:28	from one so let's they can
0:33:31	that's gonna do the job for you
0:33:33	latent variable are therefore have been
0:33:36	they can include the words together for you
0:33:39	but you don't have to give them what the name of each group
0:33:44	they can define camping naturally that works for the purpose of your past which is
0:33:49	in this case
0:33:49	noun phrase
0:33:51	so you et al it hey learn this grouping for me of all the words
0:33:56	and you can do that by doing a small to make with saying for the
0:34:00	non fright the beginning a noun phrase i'm allowing you for this
0:34:05	these four rule
0:34:07	for the middle for the constellation of a noun phrase i'm allowing you grew for
0:34:11	you to group all the words in four or the rooms
0:34:14	and i would do it also for all the other one
0:34:17	so you see it almost
0:34:19	it's not unsupervised-clustering because i have the grouping will be happening because i have a
0:34:25	task in mind
0:34:26	discriminative model task in mind
0:34:29	so if you do this once beautiful is the complexity of this algorithm is that
0:34:34	almost the same as the c i have with a simple a summation over that
0:34:38	now what do you end up learning with this grouping
0:34:42	the most important is this link
0:34:45	what do you end up learning you know knowing what's got intrinsic dynamic what is
0:34:50	that if i want to recognize hand on the intrinsic tells me i'm going down
0:34:55	and well this is the dynamic
0:34:58	but it had say at the different dynamic this is specific to the gesture
0:35:02	extrinsic tells you if i my hand on how likely am i to switch strategy
0:35:07	this is between the labeled how likely am i two had say now rely on
0:35:12	lightly in fact come back then i can head shape
0:35:15	it's an intuition behind this
0:35:17	so if you do this and you apply this to the task where famous that
0:35:22	of noun phrase
0:35:24	segmentation also called shallow parsing
0:35:26	and then you know
0:35:27	it should have the hidden state look the most likely one for this word when
0:35:31	it is i want to know what that my model learn what is the grouping
0:35:36	that loan
0:35:37	and if you know can what they did learn
0:35:39	it's really beautiful
0:35:40	it is an automatically that the beginning of a phrase is the determinant or pronouns
0:35:44	and it also give me intuition
0:35:47	about the kind of part-of-speech tags
0:35:49	that is but in that one on whether part-of-speech tags it just learned automatically
0:35:54	because of the words and the way of these words happen in the bright
0:35:58	so this is that they come first they common stage
0:36:01	latent variable are there so rule thing
0:36:05	for you
0:36:06	their grouping thing temporal grouping
0:36:08	that the first ingredient we will need
0:36:12	the
0:36:14	you probably heard the word recurrent neural network
0:36:18	and you like that fancy name have no clue what i don't wanna use that
0:36:23	right away recurrent neural network looks a lot like this model
0:36:27	the only thing that change it is instead of having one latent state from one
0:36:32	so well
0:36:33	i'm gonna have many neurons that are binary
0:36:36	zero o one
0:36:38	and so recurrent neural network is someone looking at a neural network and it looking
0:36:43	at the painting and be like how it will look better horizontally so it's taking
0:36:48	a neural network and moving it horizontally and that is your temporal
0:36:53	so if i was to show you the other way around you with the other
0:36:56	just the neural network that the normal one
0:36:58	by shifting it this way this is the temporal
0:37:02	that i model and so this is right
0:37:05	the problem with these
0:37:06	is therefore get
0:37:08	therefore get they have a problem in the learning
0:37:11	so this famous algorithm that happen in germany
0:37:15	have more than twenty years ago that speaking super famous recently
0:37:19	it long short-term memory
0:37:21	and the long short-term memory is really similar to the previous neural network
0:37:26	but in also then you have the memory
0:37:30	and but how do you guard the memory
0:37:33	you going to put the gate
0:37:34	that only once you want that's in the memory
0:37:38	and only what you want get out of the memory you putting a gating and
0:37:43	then you think hey i'm gonna sometime for get things but i'm gonna design what
0:37:47	i forget this is a really high level you but you could imagine by now
0:37:51	this is the exact same that
0:37:53	the word
0:37:54	and the label
0:37:56	and the only difference is i'm going to memorise when i memorise i memorise what
0:38:01	happened before
0:38:02	i'm gonna memorise what are the word and the faster the grouping that happened before
0:38:06	i wanted to show you that
0:38:08	just so that when you see this times you have at least in its vision
0:38:11	that there is a way to approach
0:38:14	temporal modeling two latent by about that i talk about
0:38:17	or true neural networks
0:38:20	okay
0:38:21	no i want to address the second challenge
0:38:23	that's one of the most interesting from my perspective other i work a lot of
0:38:27	my life on temporal modeling so as to not say that i think the next
0:38:31	screen fluent
0:38:32	is how do you work on representation how can you in the look at someone
0:38:37	what they say
0:38:38	and how they stated in the gesture
0:38:41	and find a common representation
0:38:43	what is this common representations to look like
0:38:46	i wanna representation so that if i know why a video and i have a
0:38:52	segment of someone saying i like it
0:38:55	i a part of the video it as someone smiling
0:38:59	part of the video i
0:39:00	a joyful tone
0:39:02	i want these
0:39:03	to all be represent that mainly similar from each other if you look at the
0:39:09	numbers representing
0:39:11	this it should be really similar i like it from happy forms artful
0:39:16	and if i have someone will look a little bit tens of the press or
0:39:20	some tenderness in there but i want them the number like i think there is
0:39:24	audio clip
0:39:26	and i tried to every presented with this that the transformation
0:39:29	i wanted to be we need those someone would deprive
0:39:32	or if i have someone who looked surprised and i hear
0:39:35	wow
0:39:36	i want these to look alike
0:39:38	and this was the dream
0:39:40	i personally had this dream
0:39:42	back more than ten years ago
0:39:44	and this really smite researcher at toronto
0:39:47	showed us a path for that
0:39:50	and it is ruslan in university of toronto
0:39:53	but is a lot of interesting work
0:39:55	where neural network
0:39:57	are allowing us to make this dream come true
0:40:00	it did it installed at don't worry but they've done the first step that's really
0:40:04	important i'm gonna show you result in second
0:40:06	what they say it's a visual
0:40:09	could be represented with multiple layer of neurons
0:40:13	and verbal can be represented
0:40:16	with multiple layer of neurons
0:40:18	what i see here
0:40:19	i don't collect like word to back for people who know about it it's a
0:40:23	representation of a word that becomes a vector and here i have images that suddenly
0:40:29	becomes also a nice vector by the way
0:40:32	if you wonder why modes model was not working
0:40:34	it's all the fault of computer vision people
0:40:37	the reasonable to model was not working is images were so hard to recognize any
0:40:43	object it was barely working well
0:40:46	but certainly in two thousand and eleven
0:40:48	computer vision started working
0:40:50	at a level that is really impressive we can recognize object really efficiently and now
0:40:56	we can look at all
0:40:58	hi is the high-level representation of the image that is useful
0:41:02	words were always quite informative in itself
0:41:05	but the you guys that solve a lot of the and now we can do
0:41:09	that and put them together
0:41:11	in one representation
0:41:13	and there's been a lot of really interesting work
0:41:16	starting that from two thousand ten
0:41:18	and this is still a lot of work on that feel
0:41:21	i'm gonna show this one a result that's that
0:41:24	to me how it may be possible
0:41:26	and this is the work from toronto
0:41:29	is what they did
0:41:31	they learned
0:41:32	how images from the web from flicker
0:41:35	they take a bunch of images and then
0:41:37	they were here
0:41:39	one word or you were describing them
0:41:42	and the first two
0:41:43	well point to the same place
0:41:46	and when you do that
0:41:48	you get for any rate
0:41:50	and their representation put at work you get a representation
0:41:54	but now i'm going to do
0:41:55	multilingual
0:41:57	work and he is there but of it i'm gonna take an image
0:42:01	and the number
0:42:03	representation
0:42:04	i'm gonna get the word
0:42:05	and get a number and stuff strike
0:42:08	the what number from the image number
0:42:11	and i am gonna and that the number
0:42:15	and finally again this final number out of it and i'm gonna know what kind
0:42:19	of email
0:42:20	to that part of the space
0:42:22	then you get a new car
0:42:24	and then it becomes red color
0:42:26	that for me what it man is i find belief on what is the bad
0:42:31	l what is the their magic language where everything can be no the
0:42:37	and that's no there is a language
0:42:41	the magic language where everybody can go from the french think this and all that
0:42:44	is this magic language
0:42:47	this is the live in the same for language and bayes and we finally got
0:42:50	a piece of that magic language where computer vision people can live happily with natural
0:42:55	language people and speech people
0:42:58	and they can do that for the they and then i
0:43:03	flying in sailing bold box i don't know it is beautiful but they didn't sell
0:43:09	any of the only problem i mentioned without about communicative behavior they don't have yet
0:43:14	happy smile that goes with lie like but you can see the product now to
0:43:19	that
0:43:20	so i'm gonna do now store an algorithm
0:43:23	that brings together what you learn all your
0:43:26	latent viable
0:43:28	which are grouping have role
0:43:30	and now i'm gonna at this new ingredient which are neural network that their goal
0:43:35	is to find a better way of representing i don't like one hot
0:43:41	representation for words like zero and one
0:43:44	i want something that's more informative
0:43:46	and i don't like images i want something much more informative
0:43:50	so i'm going to learn at the same time
0:43:52	how what in my room being temporally what does my temporal dynamic and what is
0:43:57	my way to
0:43:58	represent
0:43:59	so given the same input
0:44:02	and the goal of maybe
0:44:04	doing it's email are recognition or let's say recognizing what is positive or negative i'm
0:44:10	changing the task
0:44:11	because noun phrase
0:44:13	segmentation is not really among the model problem
0:44:15	so i'm thinking at that like positive versus negative like
0:44:19	we will smaller sentiment the not of that for example
0:44:22	and that was at the first layer here this is in fact i'm showing it
0:44:26	this way but what it is
0:44:27	is that the word
0:44:29	is multidimensional
0:44:31	and this is also multi dimensional because you have neurons
0:44:34	so i'm replacing this as one layer
0:44:37	of neurons
0:44:38	and then
0:44:39	i'm gonna at you or famous latent variables
0:44:43	so what is happening here
0:44:45	and that's really important
0:44:46	on this their job
0:44:48	is that they all the agenda-based here
0:44:51	that's a me is about a false there you don't and those then because i
0:44:56	speak french about of other
0:44:57	and so they call this gibberish and one in the format
0:45:00	that's going to be useful for the computer and their task here is to say
0:45:04	from a useful information that we tried to bit
0:45:08	to see what is similar between the different
0:45:12	between the different modalities
0:45:14	and so this is what you get here
0:45:16	it is it right grouping what should i grew
0:45:19	this is the this is here
0:45:22	how should i go from the numbers just something that's useful for my computer and
0:45:26	here is the same as all your is how the between late and viable or
0:45:31	grouping
0:45:32	so this is beautiful because you do at the same time
0:45:36	translate from gibberish to something useful and cluster the same time
0:45:40	one of the most challenging thing when you train that
0:45:44	is that each layer is he then late and you don't have it on the
0:45:48	ground labelling it
0:45:49	so when you have many of that what happen is one could try to learn
0:45:54	the same as the next layer
0:45:55	so you want divers city in its of your layer
0:45:58	and the good neural network they will do we what's called dropout
0:46:02	or you can also implies some sparsity so that this is gonna be really different
0:46:06	from this one
0:46:08	and when you do this by emotion recognition
0:46:10	you get a huge bruise on any of the prior work
0:46:13	because we were not just the only a late fusion we're really at the same
0:46:17	time modeling the representation
0:46:19	and the temporal clustering
0:46:21	okay
0:46:23	that everyone survived this is the last equation we had so but this was
0:46:28	this is my goal of
0:46:31	present thing for you
0:46:33	the representation how do i goal
0:46:35	from temporal and the representation and the two that's one which i wanna presents quickly
0:46:42	one is that about riyadh alignment
0:46:44	how do you align
0:46:46	usual which is really i thirty frames per second
0:46:49	we language
0:46:51	which is in fact i don't know how many words per second i see i'm
0:46:54	from you know the high end on that
0:46:56	but it's probably five to six word maybe a little bit more per second
0:46:59	so how do you emanates to be able to
0:47:02	they really high frame rate and the lying it is something much lower
0:47:06	in some other way i have a video
0:47:09	and i want to summarize that video
0:47:12	it's which is so that at the end
0:47:14	i really have only the important part
0:47:16	and if you look at computer vision people
0:47:19	they don't look at the excel
0:47:21	and this is allowed to change prop excel
0:47:23	and this is really few change
0:47:25	is really little change here
0:47:27	is about the and pixel changing here so if you just look at the excel
0:47:31	in you try to merge you wanna i all of these frame
0:47:35	and you want to find how am i gonna merge them
0:47:38	there's two obvious way to do it
0:47:40	one it in all one out of two frames
0:47:44	really a long sequence then you just ignore and all of the people in neural
0:47:48	network that's often what they do they take one out of ten frames that side
0:47:52	about the most interesting will be
0:47:54	look at one image visit look at look like the previous one
0:47:58	in that they look alike i'm gonna modes them but i don't you the local
0:48:02	at this time
0:48:03	but i do not merge them
0:48:05	what is more importing or magic a gradient you remember latent variable they didn't viable
0:48:11	are gonna move things for you
0:48:12	for a task in mine which is recognizing gesture
0:48:16	and if i do the merging because they look alike and this space
0:48:20	then there really more important more fusion
0:48:22	and if you do that you get a you lose in performance for recognizing gesture
0:48:28	and i'm gonna give you want more intuition about see i have an hmm
0:48:32	so you have an hmm are a lot like finding new model or finding dora
0:48:37	is the dollar
0:48:39	short memory they don't remember the only remember the last thing be seen that the
0:48:44	really short term memory
0:48:45	so if you give them something really high frame rate
0:48:48	the only think it wouldn't remember is the previous one
0:48:51	so what do they remember and a member my previous frame always look a lot
0:48:55	like my current frame
0:48:56	so i smoothing
0:48:58	but i was give it
0:48:59	these frames here that are different from each other
0:49:03	it will be learned some temporal information that's more useful and that's why
0:49:08	a lot of model works so much better on language
0:49:12	because every word is quite different from the previous
0:49:15	but every major in a video frame a really similar to each other so that
0:49:18	this model
0:49:19	and when you do that you get a nice clustering
0:49:23	the frame because it's not looking
0:49:25	just that the similarity but it really
0:49:27	and the at the mood being that you get from the latent bible
0:49:32	the last one is fusion and there's a lot more work to be done on
0:49:36	fusion but this one is like okay
0:49:39	i model the temporal
0:49:42	i model the representation i lying my modality
0:49:45	but now i want to make a prediction i wanna make my final prediction
0:49:50	and i want to use all the information i have
0:49:52	to make my prediction
0:49:54	and to do that is a lot of new way to do that
0:49:58	if you think about it each modality has its own dynamics of voice is really
0:50:03	quick
0:50:04	word is floor
0:50:05	so you don't want to lose that
0:50:07	so you have word
0:50:09	uhuh dynamic for
0:50:11	each modality so one is private and one
0:50:14	will in fact with mine mation
0:50:16	okay so you will learn a dynamic for audio and you learn a dynamic for
0:50:21	visual and then you know how to synchronise them
0:50:25	i'm going quickly turned out but just want to give you the institution
0:50:28	that user and the last that is the one that's going to do
0:50:33	learned the dynamic and learned also to synchronise at the same time and when you
0:50:36	do that you improve a lot so
0:50:38	i'm gonna coming back closing the loop
0:50:41	i'm clothing the lu
0:50:43	and going back to the average and all work on this stress depression and ptsd
0:50:48	i'm gonna take verbal acoustic and visual
0:50:51	and i want to predict how
0:50:54	distress you are
0:50:55	and here the results you get when you do multimodal fusion
0:50:59	you get this to what you have is a hundred part is event
0:51:03	who interacted with l e
0:51:05	and each of them at the level of distress in blue
0:51:10	and some of them have speech the in depression
0:51:13	and in green what you get
0:51:15	is in fact the prediction
0:51:18	you get the prediction from the green
0:51:20	but i putting together the verbal indicator
0:51:24	the vocal and the visual
0:51:26	and you can do that i'm gonna skip to that because of time
0:51:29	but you can also do this a lot for
0:51:32	looking at sentiment
0:51:34	in videos sentiment in youtube videos
0:51:37	is another application of that i'm gonna skip this one
0:51:40	does because our model to go quickly under the last point i want to make
0:51:44	but the last part i want a state now is interpersonal dynamic
0:51:49	you guys have been amazing you been handout thing smiling yearning watching emails
0:51:56	i got you
0:51:57	okay
0:51:58	but interpersonal dynamic is i think the next friends really in algorithm because people some
0:52:06	people will like siri synchrony in their behaviors
0:52:09	synchrony in their behaviour are great this all up and some kind of rubber ball
0:52:14	i with the also the in the video
0:52:17	in some of our video using the virtual human mimicking each other
0:52:21	well in negotiation
0:52:22	you also c and d symmetry or divergence
0:52:26	we also really informative
0:52:28	if i move or what you move backward design important you
0:52:32	this is important negotiation but also in learning
0:52:35	if i look at the behavior of one speaker and another
0:52:39	i can find moment where the synchronise
0:52:41	and i can also find one when there is synchrony
0:52:45	and these are often in our data
0:52:48	related to
0:52:49	a rejection or bad in their homework
0:52:53	because they're not working well together
0:52:55	there's a there's the disagreement
0:52:57	and that synchrony can show their
0:53:00	we can use some of the behaviour is more for one but you get the
0:53:03	right leader from expert
0:53:06	and this year otherwise you think the other knowledge about the on but they're not
0:53:09	always that the there are not only the knowledgeable and so hard to differentiate that
0:53:14	and voice is a good one for that
0:53:17	and one another type what are you gonna accent on that my offer during negotiation
0:53:23	and to do that i will look
0:53:25	and your behavior
0:53:27	i will look at my behaviour as the proposed or and i will look at
0:53:31	our history together if we do that together we get a user improvement when we
0:53:36	put the dinally
0:53:38	but that i think what is that
0:53:39	it your behavior if you hand not are stored bothers you are likely to accept
0:53:44	but my behaviour important by the way the best way to have someone a text
0:53:48	that you are for
0:53:49	tells you have
0:53:50	you put that you put that out in your on a request
0:53:54	so the last one is there you guys
0:53:59	good listeners
0:54:01	how do i create a crowd like you guys as good listener you
0:54:05	i can do that from data
0:54:07	i can look at each of you how you reacting to the speaker
0:54:11	and learn
0:54:12	what are the most predictive one
0:54:14	and be able to eventually grade of its own listener
0:54:17	these are the top for most predictive listener speech about features so if i part
0:54:23	you likely to hannah
0:54:24	that's another surprise if i look at you you're likely to and not after a
0:54:29	little well known right away
0:54:31	if i stayed a word and the one hand by itself is not a good
0:54:34	predictor but if i'm in the middle of as and then ipods and look at
0:54:39	you
0:54:39	you really likely to give feedback
0:54:41	so this is the power of multimodal and badly if i don't look at you
0:54:46	unlikely
0:54:47	to hannah but not all that you guys are the same
0:54:50	you all the little bit different you not all a smiling at another thing which
0:54:55	i don't know why use all be about the
0:54:58	a
0:54:58	some of you i can learn a model for one person
0:55:02	i can learn a model for another person
0:55:04	and that a person
0:55:06	and then when i would like to do is find out the prototypical grouping
0:55:11	grouping
0:55:13	latent viable a again very like that model selection
0:55:18	again at that it
0:55:20	but you will be grouping people want to find what is common between people
0:55:24	and what do you fine
0:55:25	you find that some people
0:55:27	is that was produced by law on so that they also that the warm i'm
0:55:31	a men's is the than is only about one that if i begin in france
0:55:35	event have i say stupid things you will hand not just because that the part
0:55:39	of the right time
0:55:40	a some people will be a visual there don't even care listening
0:55:45	and i do this and noun phrases turn out to be a good predictor
0:55:50	okay so i wonder so work from stacy mice the lower here this is the
0:55:55	really great representation of putting all this interpersonal dynamic in one video i could have
0:56:01	never done better than that
0:56:02	so i wanna you just do this
0:56:04	this is a video movie and you want if we only gonna take the audio
0:56:08	track
0:56:09	and the text
0:56:10	only the audio and the text
0:56:12	and we're gonna and may
0:56:13	the virtual human here we gotta make two of them
0:56:17	some of them are going to be speakers so it speaking behavior based on how
0:56:21	to
0:56:22	you don't the speech you want to know is the icing that the head is
0:56:26	the
0:56:27	which facial expression is it speaker behaviour
0:56:30	but we also want to predict the center behaviors
0:56:33	directly from the speech of the speaker and so look at the
0:56:38	it's is beautiful
0:56:40	and i hope you enjoy the movie
0:56:42	s two s process i like an answer the question judge the core poor performance
0:56:51	statistical touched
0:56:59	technical difficulty writing style change to do so
0:57:05	i o
0:57:11	i
0:57:13	i
0:57:14	i don't have to answer the question or answer the question
0:57:19	you want answers i entirely
0:57:23	i don't try to
0:57:27	but this was all automatic from the audio
0:57:31	and the visual one that some of the text only
0:57:34	i you get the can you cues from the audio you get the emotion
0:57:38	so this is an example putting everything together these are some of the application that
0:57:44	you can will
0:57:45	bringing together the behavior dynamic every my not every smile on equal going to model
0:57:50	the model with the late and viable you don't quite get that the multimodal representation
0:57:57	and alignment in the fusion
0:57:59	and then the interpersonal dynamics so
0:58:01	with the bocal for your attention remotes
1:00:16	okay
1:00:17	so
1:00:18	let me to answer the first second one and maybe the first one we can
1:00:22	discuss more
1:00:23	about the second one apartments model alignment right now we are looking at alignment i
1:00:29	don't really instantaneous level so it's only really small piece of the big problem of
1:00:36	alignment
1:00:36	right now we only aligning
1:00:38	i really short term
1:00:40	i personally believe the next
1:00:43	okay at the next level
1:00:48	of alignment needs to be at the segment level so you need to be able
1:00:52	to do segmentation
1:00:54	at the same time as you the alignment and to go ahead with the other
1:00:59	example that you mention
1:01:01	the a when you don't you mimicry instantaneously
1:01:05	the plastic example i think it's four seconds or something like that so that the
1:01:10	problem is that temporal contingency you need to model that and i think
1:01:14	right now as i said a lot of a model are sort or memory
1:01:17	and so we need the infrastructure
1:01:20	to be able to remember so
1:01:21	i think all the points you mention are wonderful i agree with you this is
1:01:25	why i'm excited with this we don't
1:01:27	is that we got actually the building blocks there
1:01:30	and i think we need to study the next step so
1:01:33	thank you
1:01:35	okay the with the money and then
1:02:12	right requested
1:02:13	so right now we tried to work with the calibration of each speaker
1:02:20	by having a for space of four or
1:02:23	but where we got more sober indicators
1:02:26	what's the difference on how to direct from positive
1:02:29	and from positive
1:02:31	as a problem there from negative still really
1:02:34	and looking at the delta
1:02:35	what is the most informative
1:02:37	because the data is the little bit
1:02:39	it's not completely independent on the user base a lot less dependent
1:02:43	then just looking at hoffman this might happen to this might if it's positive hop
1:02:48	into this might in when it's negative
1:02:49	that is more informative
1:02:52	the other work is if you ask me where this research going follows it's in
1:02:56	treatment
1:02:57	and they're
1:02:58	what is it and we're working with harvard medical school
1:03:01	is you get a schizophrenic patient at their worst
1:03:04	you get a schizophrenic patient as they go through treatment at the back they go
1:03:08	back home
1:03:09	you can create a beautiful patient profile of that there were at their best and
1:03:14	then use that to monitor
1:03:16	their behaviour as they go back
1:03:18	and so that the work we are putting forward with harvard medical school
1:03:22	is to be able to create these
1:03:24	profiles of people
1:03:25	at the word profile doesn't sound also we call the signature
1:03:28	as on a list the big brother but the idea is the profile of that
1:03:34	so
1:03:36	thing thank you all four pension thank you

Modeling Human Communication Dynamics

Keynotes

Louis-Philippe Morency (Carnegie Mellon University)