Speech Transcript - Empowering Human-Robot Dialogue by Affective Computing Research

0:00:15	one of communication which is just as important as much
0:00:21	namely nonverbal communication
0:00:25	and in my i will discuss
0:00:28	a how to enrich a
0:00:31	the precise and useful function of computers with the human stability
0:00:37	to show i mean of the message nonverbal behaviors
0:00:42	also here in the collaboration between the woman and a robot to see they are
0:00:48	not just collaborating this even the kind of close effective upon between down
0:00:56	unease is actually of the for both of my of the research
0:01:01	so that the problem i will be a structure is always select will first talk
0:01:07	about
0:01:08	at the recognition of social issues in human robot interaction but of course the technology
0:01:15	is also useful for any kind
0:01:18	of a solution to see there are also source signals in human
0:01:23	interaction
0:01:25	or in a man virtual agent interaction
0:01:29	then average hold out that the generation of soldiers you in human robot interaction
0:01:36	of course dropped what should not be just able to interpret the human signal
0:01:42	it should also be able to respond with appropriately
0:01:47	the next topic will be a dialogue management
0:01:51	in a still virtual human robot in that should be able to talk about what
0:01:56	the talent hiking
0:01:58	also a pilot a solution is a mutual gaze back channels
0:02:04	and to handle all these challenges we need of course a lot of data
0:02:10	and so the last a part of my whole will be on how gradient learning
0:02:16	for focus
0:02:18	and at least is
0:02:19	which will ease at fort of human by using
0:02:25	scenarios to wise or
0:02:28	so let's start with the recognition of a social use and human robot interaction
0:02:36	so what kind of i c nodes are interested in
0:02:40	basically in speech and facial expressions guys holster gestures body movements
0:02:47	and approximate
0:02:49	about not only the in are not only interested in the solution can use of
0:02:54	an individual person
0:02:56	but also in interaction patterns such as synchrony on maybe we interpersonal added you for
0:03:06	example with the don't mean and a person
0:03:09	all agent in interaction
0:03:11	and i was also engagement
0:03:15	so how engaged are the participants in than in a church
0:03:20	so if you look at the literature the most attention has high to facial features
0:03:29	so i don't want to go in detail here just mentioned
0:03:35	and spatial i should according to the system which is used applied of to recognize
0:03:42	but also channel eight facial expressions
0:03:45	and the basic idea is to define such units
0:03:50	i to characterize sosa emotional expressions
0:03:54	others such as a hundred raised out of which is usually an indicator of the
0:04:00	happiness
0:04:02	also a lot of what has been spent on
0:04:06	cool emotion recognition
0:04:09	you're just for inspiration i show you how to signal of the same utterance baseball
0:04:16	in different emotions
0:04:19	you can see here the pitch point where it is quite a different
0:04:24	depending on the emotion expressed
0:04:27	and there has been some effort to find a wouldn't predictors of for vocal
0:04:34	a motion it i would like to mention geneva minimalistic a set of features
0:04:42	which was recently introduced and which actually titanium that's why would waste is also if
0:04:49	you compare the two feature set
0:04:52	consisting of a semblance analysis of features so if you like some will try to
0:05:00	get of speech is a binary or deep neural network approaches us so it would
0:05:07	be
0:05:08	it put it here to compare arguably side it's a with the police is obtained
0:05:13	and it by the chili that minimalistic a feature set
0:05:17	so if you look at the literature you might get the impression okay you get
0:05:22	very high recognition why the four emotions
0:05:26	it even a little bit a scary a few wiped it to model and a
0:05:32	test run in real words and mapping of the find out okay
0:05:37	we started as sometimes even comes
0:05:41	close up to four
0:05:43	randomized the
0:05:46	we sites
0:05:47	so why is that so actually a previous research has focused on the analysis of
0:05:55	equipments the basic emotions the motions
0:05:58	that are quiet or extreme prototypical
0:06:03	emoticons such as happiness that knows this task anger
0:06:08	i emotional responses of what what's can usually not be mapped to a men's basic
0:06:16	a motion so we see here for example use that's and because of the point
0:06:24	i know
0:06:25	that a post edit any woman and to create web why the happy in the
0:06:31	interaction with the robot but it's not clearly
0:06:35	with
0:06:36	a couple of years ago
0:06:38	a colleague of mine and one about clean a heated that we are interested in
0:06:44	this study
0:06:46	so actually they invest to investigate the motion recognition rate for acted emotions
0:06:53	for read a motion and motions type in the with that of course the sound
0:06:59	natural and it actually cost was just to distinguish between what they the motion no
0:07:06	unknown motion so not the very difficult task
0:07:10	and what i don't a motion is
0:07:12	they got one hundred percent so help
0:07:16	for an emotion
0:07:19	is a little bit more natural than acted emotions they got eight percent
0:07:25	which is okay but not really exciting because you know chances fifty percent if we
0:07:31	just need to recognise what distinguish between
0:07:35	mutual the motion and
0:07:38	abortions
0:07:40	and finally for obvious that of course scenario they just got a seventy percent
0:07:47	so obviously systems developed under laboratory conditions of how perform poorly unless ordered
0:07:56	a scenarios
0:07:58	and the challenge is actually adaptive real time applications
0:08:05	so usually if you look at the clutch if you look at speakers people obtain
0:08:11	you will find out that most studies a offline start this so they take the
0:08:17	call was
0:08:18	and the calls
0:08:21	is usually a pair
0:08:23	and for example expressions that cannot be locally and in that one was the city
0:08:29	emotional states
0:08:31	a simple thing that our
0:08:33	and the also a
0:08:36	yup and of course the start from the assumption that whole process is a segment
0:08:42	that in some way
0:08:44	but the in so we don't life we also have a one handed noise the
0:08:50	on the other data
0:08:53	so we might as seen you information
0:08:56	and also our pictures can only rely on previously seen data so we cannot
0:09:01	look into the future
0:09:03	but of course that the system has to at least one
0:09:06	in a real time
0:09:09	so the question is what can you about what they're
0:09:13	and one other thing though we might consider would be or
0:09:18	the context
0:09:19	so if you know at the picture why matching in which emotional state
0:09:27	a couple s
0:09:29	so we have any idea of pos just people who don't know
0:09:35	to compute a context
0:09:37	any ideas what emotional state
0:09:40	to go would be
0:09:44	your quite what so usually actually in other people's say okay anyway distress
0:09:53	its candidate
0:09:55	i you are actually very good of size three that
0:10:00	because it's a actually a jealousy
0:10:04	i do actually the first cousins who actually of how it immediately a correct motion
0:10:09	i nevertheless i don't say a four system
0:10:14	and even able to a type of the facial action you want it in a
0:10:18	perfect manner would have problems to find how without knowing the context
0:10:25	that the least actually other channels
0:10:28	so there are some
0:10:31	recent research has been done actually to consider of the context and we science electro
0:10:38	some improvement
0:10:40	so a couple of years ago we investigate the agenda specific motion in the motion
0:10:48	like recognition
0:10:49	and so we were able to improve the recognition rates by training gender-specific a model
0:10:57	and that's an approach was a done by christina format so actually she can see
0:11:05	that the success and failure you don't
0:11:08	it would during an application for example it student is heading a little time
0:11:13	and that's to smiling a then interacting with the way application
0:11:19	so probably the student is not a really happy to might be a to student
0:11:25	does not try to system
0:11:27	serious
0:11:28	and i even though this approach is quite used for quite reasonable it has not
0:11:35	be in a pick up so much
0:11:38	so
0:11:39	we arg one see that you got the dialogue behind me out of the virtual
0:11:43	agent in the job and do you training scenario
0:11:47	so for example when a job interview a task difficult questions about a the weaknesses
0:11:53	of the candy that
0:11:55	then it is also i had to something the pilot a
0:11:59	a likely the motion that state
0:12:02	and they are some of the time
0:12:05	the to align actually a temp where context using bidirectional long short-term and you were
0:12:14	networks
0:12:15	so the context
0:12:16	a might be a good option to oakland see that
0:12:22	and not a maybe obvious thing to one see that use a multi modality here
0:12:28	you can see you know what has bought cell where no it's just one
0:12:32	and it to one of four two with a look at nearly to say so
0:12:37	actually or it's
0:12:38	for me it's not possible to recognize any difference in the face
0:12:43	but if you look at the bottom
0:12:46	you'll get a match the other pictures so on the right
0:12:50	this moment is obviously why that's right i guess correct way no but not you
0:12:55	very happy about l
0:12:57	but nonetheless at least two from
0:13:00	home or a demonstrated by a the face
0:13:05	so
0:13:07	multimodal fusion the data how
0:13:12	that is an interesting to start by a team at all and a whole rate
0:13:16	on my remote affect the detection
0:13:20	and a study us to investigate that many studies that have outperformed the possibly with
0:13:28	them at a study
0:13:29	and radio show what
0:13:31	that improvement how correlates with the naturalness of the calls which is actually that you
0:13:38	so as a step four of them
0:13:41	acted emotions
0:13:43	you get quite high recognition rate and if you use multiple modalities
0:13:48	so you can even get improvement of more than ten percent
0:13:52	but for to difficult task namely spontaneous emotions
0:13:56	the improvement left and i was then which is really bad you because the
0:14:02	should we a hundred
0:14:05	the user to additional devices just get less than five percent recognition rates
0:14:11	and this assumption actually is that in the natural interaction
0:14:17	a sheep are actually a of shall a motion in a you once is a
0:14:24	menace or may not show a motion
0:14:28	so more channels are the same express if a
0:14:32	manner
0:14:33	and first investigate a tractable
0:14:38	assumption of we all looked at the call so we have we had a corpus
0:14:45	i would hate affect just by the video and then just find audio
0:14:50	and then you don't with that note i should mismatch on or
0:14:55	and then we don't at the recognition rate and actually or when the annotations a
0:15:02	mismatch
0:15:04	and so the robot a match the low well
0:15:08	like recognition weights
0:15:10	so it will show you another example look at a woman the here
0:15:16	so we have let's look at the second rate and here the woman shows a
0:15:21	neutral face
0:15:23	and the voice is happy
0:15:26	and a little bit late error rates the other way round one is of the
0:15:30	face looks at it but it's new well
0:15:34	and i was sort of question is a watch a whole fusion approach to in
0:15:41	such a situation
0:15:43	and a yellow i sketch a potential solution
0:15:48	so we my a so you show actually modality a specific recognizer might decide when
0:15:57	to quantum leap you would
0:15:58	and then interpolated
0:16:01	and the y-axis interpolation or we get a better recognition besides
0:16:08	so if you look at the literature so most of fusion approaches actually used in
0:16:13	one this fusion approaches
0:16:16	and synchronous fusion approaches are carried wise a it could situation of multiple modalities
0:16:23	within the same time frame so for example people at a complete seven and eight
0:16:31	just analyze the face
0:16:33	and avoid over complete
0:16:36	sentence
0:16:38	i think owners fusion
0:16:41	approaches
0:16:42	actually a
0:16:44	they a color rate and a modality is not bad all at different times
0:16:52	so they do not assume that for example audio and video
0:16:57	a expressed
0:16:59	at the same time
0:17:01	and therefore they are able to track channel to a simple nature of cops the
0:17:06	other modalities so it's very important if you use the fusion approach and like to
0:17:13	use of approach that is thus able
0:17:17	two point see that and what a dependency is
0:17:21	and it depends what if we wish of modalities but also
0:17:26	the interdependence between modalities
0:17:29	and that is only possible
0:17:31	if you go for frame-wise recognition approach
0:17:36	so we don't this approach either but a first year
0:17:39	so we adopt at an event bayes fusion approach where we once you to events
0:17:46	as an additional
0:17:48	layout of at stretch between or sink nodes
0:17:51	and higher-level emotional states
0:17:54	even though the such as are allowed to have no
0:17:59	or similar kinds of the social few
0:18:04	and a in this way we were able to try to work how the temporal
0:18:10	relationships between channel
0:18:12	and learn when to provide information
0:18:15	and also in case of some data on be seeing
0:18:20	another approach is still a delivers a reasonable recognition besides
0:18:26	so let's have a look at an exam well it's a simplified example it's over
0:18:31	here we have audio and we have a facial expressions
0:18:36	and the fusion approach my comma
0:18:41	ways
0:18:43	so what degree of whether it's
0:18:45	and now let's assume for some reason the audio is no longer available
0:18:50	and why interpolation
0:18:52	we still a get a wide reasonable
0:18:56	with is
0:18:57	so we compare
0:19:01	and number of those seen owners fusion approaches i think there is a fusion approaches
0:19:07	and he went written of fusion
0:19:10	and so for example of forty a synchronous fusion approaches so we call
0:19:16	consider for example you wouldn't networks we also once it's not understand to
0:19:24	take into account the temporal history of signals
0:19:29	and also a bidirectional long on a short time you will networks
0:19:34	a to be able to look in the future
0:19:38	and to learn to tamper what history and what you can see here which is
0:19:43	quite a whitening
0:19:46	that or i think colin is a fusion the
0:19:49	approaches actually up outperform a that are
0:19:53	then one is a fusion approaches
0:19:56	so a message i call it is if you fuses modalities
0:20:02	usually do for approach that its first a able to point see that
0:20:07	the can we wish of modality is
0:20:10	but also in the dependency between modalities
0:20:15	actually i mean actually i am i
0:20:35	i don't i right away
0:20:48	like a rational
0:20:54	and actually two
0:20:58	a postech development of
0:21:02	social see their processing approaches for on-line recognition task
0:21:08	we developed a framework which is called justice i for social signal
0:21:12	in the quantization
0:21:14	and this framework a synchronized with the modalities and it supports equal clear
0:21:21	machine learning i nine words or offering a various kinds of machine learning
0:21:27	approaches
0:21:28	and
0:21:29	we are able to actually or
0:21:34	you with the natural at all modalities and sentences and whenever stands and uses and
0:21:41	it becomes available
0:21:42	my people write read will for it
0:21:45	so we consider a motion capturing as you are the ones you doing of various
0:21:51	kinds of
0:21:52	i try to a stationary i like a smoothed by
0:21:56	i traded
0:21:58	and
0:21:59	also a text is
0:22:02	so basically all kinds of
0:22:05	sensors that our company
0:22:07	but way level
0:22:09	so this was the top one or
0:22:12	emotion recognition now i would like to come up to the as a side namely
0:22:18	to the generation of those used by the robot
0:22:22	it's nice that it is not sufficient to recognize the motion
0:22:25	you also need to respond appropriately approaches a list apart appropriate responses
0:22:33	and
0:22:36	i guess it's a clear so why would nonverbal human signals a where we all
0:22:43	and update not only express emotions but also edit you would
0:22:48	intention
0:22:49	also called only high interpersonal relations with the plate sample
0:22:55	you are interested in talking to have a
0:22:59	or not
0:23:00	and nonverbal the three minutes kind of course also be you with
0:23:05	other to understand be worth messages
0:23:10	and in general will make the communication
0:23:12	more natural implausible
0:23:15	so we see that there are a couple of years ago a with and how
0:23:18	well what
0:23:19	of course the not what a leader is not how well
0:23:23	and expressive case fetters so we have to look for after options and so we
0:23:29	looked for action
0:23:31	a tendency is
0:23:32	which are related to motion selection and this is actually want to show before you
0:23:39	start at so it's very common in
0:23:42	in sports
0:23:44	so you have proposed chat bots a person
0:23:48	and to sports is not yet it but it's quite clear what is coming next
0:23:56	and so we among a cisco we simulated actually tendencies such as approach
0:24:03	panic attack and submission
0:24:05	and it turned out that people were able to
0:24:08	wait and is a ds the action can see is
0:24:13	later we actually
0:24:16	got a robot from hand mobile kind
0:24:19	and here we actually try to simulate of facial
0:24:24	expressions
0:24:26	and you well kind of image that is all three start from the facial action
0:24:32	coding system i mentioned
0:24:35	well
0:24:36	and a actually identify forty actually you would minutes of forty human of high
0:24:45	for the question or can we simulate a report the action units
0:24:51	and the for the robot
0:24:53	so we write about the and a this the simulation of just seven hatch you
0:24:59	wouldn't
0:25:00	and these robot has a syntactic a skin and on the skin your house on
0:25:06	modal is and the motors can move a and a beep or form eight
0:25:11	a to form a the skin
0:25:13	do we not only a little a two hour
0:25:16	simulate the seven action units and at a question is whether this is enough and
0:25:21	i show you show a video
0:25:24	so it pretty what is in german with english as a high that's a lot
0:25:28	is introduced focus about non-verbal signals it does not necessarily that you want to understand
0:25:35	what is you start
0:25:38	you can just a discussion of actually what the machine about information the machine have
0:25:44	to be close she did not consider at stage the semantics of or utterances
0:25:50	to about position is equal
0:25:55	it can you see also would not test so it is equal to one can
0:26:04	once will be given by its because i
0:26:10	i
0:26:12	yes i understand what is not one but also talk about
0:26:20	but also useful what it is not quite often what it
0:26:25	i don't think it's one is that all data that are not handled by a
0:26:31	weighted sum of all
0:26:36	is that it is not able to account for instance the hopefully it does come
0:26:42	zero point all possible
0:26:45	a problem with this is no i o
0:26:55	in to compute so that you mentioned you can
0:26:59	one for training
0:27:04	okay just to show you that really does not can see that the semantics another
0:27:08	example
0:27:12	that's my
0:27:15	schuller
0:27:20	about done
0:27:22	are you
0:27:25	the system can do not work about online to one hundred fifty yet but not
0:27:38	really constant talk detector e
0:27:44	so just to show that you can't
0:27:47	i have a conversation with emotional features are that's of course not over
0:27:53	and a few well maybe we
0:27:57	the of course a use a different from and to see so maybe we
0:28:05	my a held at a it's not a over
0:28:09	so what is the embassy still embassy it is an emotional response and its stance
0:28:16	from the comprehension of emotional state of and also
0:28:22	pairs
0:28:24	and a so that the emotional state of the other person
0:28:29	might be similar to your own emotions at but that's not have to be design
0:28:36	a motion
0:28:37	and embassy like what is either deeper such a of emotional state of an a
0:28:43	set of parents and facilities is what we can more of a signal processing technology
0:28:50	and it is also like well i guess so we don't think about the situation
0:28:55	of the also use somehow
0:28:59	need to know
0:29:00	and of what at the outset person is feeling and why not start to oppose
0:29:05	that it
0:29:07	and also you are required to decide the how to respond to the ad suppose
0:29:14	a motion
0:29:16	so for example in the tutoring system
0:29:19	if
0:29:20	the student is in the very emotional state and depressed
0:29:24	in a high it could be a disaster if the virtual agent would actually minimal
0:29:30	a emotional state
0:29:32	of the student because it might make a student
0:29:36	moura
0:29:37	depressed
0:29:38	so
0:29:40	it is actually a week what is a tree or
0:29:45	this is a potential and want to not to show
0:29:49	and we can realize kind of have say listen now
0:29:55	so where we can see a motion we try to understand a emotional state
0:30:02	and understanding and the motion state of the knots that appears in
0:30:08	we could choose an internal reaction and that the question is should be external is
0:30:15	a reaction and of what are two ways that i virtual you'll another
0:30:20	examples was actually and how much will be
0:30:25	simulated and appraisal a model
0:30:29	a lot of the dialog alive will show you is actually is that of course
0:30:34	so first and of what we do in this kind of a tie and all
0:30:39	so we be able and motions
0:30:41	a lot so
0:30:43	we also a common to on the user's the emotions so the story will be
0:30:50	a pilot a forgotten
0:30:52	four point of medication
0:30:54	and
0:30:56	function and to see it is so we had to robert shows console a power
0:31:01	of a button medication to increase awareness but it is doing it in a supplement
0:31:08	no
0:31:09	actually not what we are still at
0:31:13	to a much
0:31:15	and no overt so dropped what will show the some intention as
0:31:20	while
0:31:21	the palm down to the user
0:31:23	so i will apply deal with
0:31:26	but the video and what is actually a kind of amazing
0:31:32	this is that it is disappointing fine edge while it is all
0:31:36	here
0:32:07	okay
0:32:12	i
0:32:26	a
0:32:39	okay and a actually a to develop a better understanding of four emotions of users
0:32:47	we are currently investigating how to combine the social signal processing of with affective as
0:32:54	you rate of mind and cases actually what operation where is that happily an apart
0:33:01	from the if i
0:33:03	in a support
0:33:05	so partly other developed a model of the whole and i don't know
0:33:10	actually to simulate emotional behaviors
0:33:14	and the basic idea is actually
0:33:17	what
0:33:19	have some and motion of stimulation and then change a ways of what do you
0:33:24	recognise in terms of sources used
0:33:27	actually matches and how well
0:33:29	a simulation
0:33:31	and the even type just a little bit of errors are
0:33:36	we do not just once you to how a list one was so that
0:33:41	and emotional state
0:33:43	we also points you know how people
0:33:46	actually show like to like they'll motions to show you an example
0:33:52	so let's see that
0:33:55	shape so if you are not regulated well you want a motion is either so
0:34:00	the person who
0:34:03	just flash they had a dollar
0:34:07	and that this is the typical
0:34:10	emotional expression
0:34:12	we would expect
0:34:14	and a people usually awake you like a motion is actually i like to better
0:34:20	whole always the emotional state
0:34:24	and or shy of the at different weights to like motions
0:34:32	so avoided is one reaction but you put it text yourself so we have for
0:34:37	example you say okay and i four and a but also at a gas a
0:34:43	person
0:34:44	and
0:34:46	what you panacea actually other that we have a quite a different is no actually
0:34:53	you know people might show depending on the way they regulate their motion and if
0:34:59	you use a typical the machine learning approach actually
0:35:04	to analyze distortion no
0:35:07	you would never know i'm be able to find one motions
0:35:11	because don't know
0:35:13	how do people go back to rely on the emotional state so here is and
0:35:19	have a price we have to discussion already yesterday
0:35:23	maybe you can us
0:35:27	machine learning approaches as like boxers recognise certain signals
0:35:33	a fine tuning as some understanding actually
0:35:37	a map
0:35:39	to see that want to emotional states
0:35:41	and it's even more important
0:35:44	if the system has to respond what emotional state so matching a
0:35:49	you talked to somebody on the on the guys not really understanding what's your problem
0:35:53	you
0:35:54	and i just at behaving like what we can you like well and
0:36:02	a responding in a schematic a manner we were able shall
0:36:09	and behaviour
0:36:10	so it would like at the end of are also called me but all what
0:36:15	is the weighted dialogue between a
0:36:18	humans and or
0:36:20	robots
0:36:21	and only actually a client by dpi apply a job which can decide no
0:36:27	on engagement and human robot in the action
0:36:32	we looked at so
0:36:35	signs of engagement in human robot a dialogue act of the amount of mutual gaze
0:36:41	below a direct gaze turn taking
0:36:45	and i just show you example the here it's a path of gain between a
0:36:52	robot and you can result
0:36:54	and to use that is where we hyped weight loss there's a so that the
0:36:59	robot notes when it was is a loopy
0:37:03	and in this specific as scenario
0:37:06	all you know simulated directed gaze which is that kind of
0:37:12	functional same
0:37:14	so
0:37:15	the robot is able to detect which all check
0:37:19	the use that is a focusing on and this makes the interaction more efficient because
0:37:25	there is no longer forced to describe
0:37:28	o j lo detector i also implemented a hallway a scenario is or should gaze
0:37:36	for distortion case actually voice
0:37:38	do not have we deal function
0:37:41	so i'd the dialogue was completely understandable without distortion we just wanted to know
0:37:48	that's my to any difference
0:37:51	so it just a very quickly
0:37:55	we have a direct that a gaze assorted one who is the following two options
0:38:02	and pointing the object or just looking at the object
0:38:06	and for mutual gaze of both in that interval establish eye gaze
0:38:12	the next thing what we realise was case is a disambiguation
0:38:18	and a case applies disambiguation is interesting in so yes other people
0:38:25	a few option which was then look away again
0:38:30	so we need a different disambiguation approach
0:38:34	that for example powerpointy then for example for pointing gestures when two point usually just
0:38:40	point one and that's it you know what into the one time
0:38:45	and so case is
0:38:47	then we
0:38:49	different
0:38:50	and we also a real is
0:38:53	so that some typical gaze behaviour is that you in a turn taking
0:38:58	so speakers a new way usually from the addressee to indicate
0:39:05	that they are for it to process of thinking about what to say next
0:39:11	and also to show that they don't one and it should be a drop that
0:39:15	and are typically at the end of an utterance the speakers
0:39:20	low would you have a person
0:39:23	because they want to know how we are suppose
0:39:26	what the as opposed
0:39:28	thinking about what has been set
0:39:31	so basically
0:39:33	we realize a shared folder of what follows the user's hand movements and drop what
0:39:40	follows to users he's
0:39:42	we will i social around eight
0:39:45	so here to what i see and recognise this mutual gaze
0:39:49	and finally to an eye dropper to make a nice is going to use that
0:39:54	you tell
0:39:55	and that will show you
0:39:58	we deal
0:40:06	so i decided to leave at the top and because i realise the top is
0:40:11	much better roundy then the problem i did it is one of the
0:40:20	how do okay
0:41:06	the red wine it's of course ambiguous nothing more i k
0:41:12	which man
0:42:19	e
0:42:25	thus
0:42:46	again you know that
0:42:53	and the we did an evaluation well this where
0:42:58	and what we found was that actually of the object wanting was more effective than
0:43:04	distortion grounding
0:43:07	so the people were there are able to interact more efficiently with object a groundings
0:43:12	of the dialogs were much shorter
0:43:15	and the word lattice misconceptions
0:43:18	and it's not distortion rounding error you not a improve the perception
0:43:23	of the interaction
0:43:25	which is of course appear because we spend quite some time one mutual gaze
0:43:32	i one assumption is that people wear out that once waiting on the task instead
0:43:38	of the social interaction with the robot
0:43:40	and we might investigate if you have a more sources ask for example looking at
0:43:46	family for both
0:43:47	and the distortion gaze a might become more important
0:43:52	and its assumption is a which we do not yet a try
0:43:57	that some people are focusing more on the task in some without focusing more on
0:44:01	the social interactions you can be classified like these
0:44:06	and a specific people
0:44:08	might appreciate the social gaze a more
0:44:12	the analysis
0:44:14	so have finally i would like a to come to reason a development is so
0:44:20	we started one
0:44:23	interactions in or dialogue
0:44:27	and data from both sides of always
0:44:30	do you make an interactive machine but also to machines in route a robot
0:44:36	come do we she can interact
0:44:38	the human
0:44:39	so the o project which was already mentioned yesterday
0:44:45	we have collected a corpus of which people a dialogue between
0:44:51	you minutes
0:44:52	and the dialogue has the in i'm not trying to label
0:44:58	and we actually or integrated active learning and hope wait a litany
0:45:05	in the annotation work so basically i think it is that the system actually
0:45:11	this is which samples of the show you label
0:45:16	pick the right relatedness and it also this is which sound shall be no actually
0:45:26	a from like that at all
0:45:29	and so one of which is forced to select examples
0:45:34	for which a did not she and actually
0:45:38	tie a low confidence
0:45:40	and always that approach so we've well at the o to o
0:45:47	make up the annotation process
0:45:51	significantly more efficient
0:45:53	and of these basically integration of the no one system is as i a system
0:45:59	which i mentioned earlier
0:46:02	and for the interactions that it is actually that you do an additive high main
0:46:10	which is the essence of interruptions
0:46:13	from
0:46:15	called as a between a human
0:46:20	it down
0:46:21	so i to come to one compare emotion
0:46:25	i think that a human robot in that capture cannot come we can treat here
0:46:32	until a
0:46:34	the problem of
0:46:35	appropriate social interaction between robots and human
0:46:40	for it
0:46:40	in particular
0:46:42	if a what is employed in
0:46:47	the people it's how you
0:46:49	and of what we need of course is a fully integrated into consisting of perception
0:46:54	reasoning
0:46:55	learning and responding
0:46:58	and a particular it is at the moment is a big gap between the perceptual
0:47:04	and the reason nine so the reasoning is
0:47:08	kind of the net like that
0:47:10	at the moment in favour of a black box the
0:47:14	approaches
0:47:15	which is useful for
0:47:18	actually attended i o
0:47:20	so we should use as such as laughter
0:47:24	but after that so we need to reason about what
0:47:28	actually distortion signal a marine
0:47:32	and of course i know my disciplinary expertise is a
0:47:37	necessary in order to emulate aspects of social intelligence that's why
0:47:41	we call up with a lot we so
0:47:44	psychologist
0:47:46	and so we might a lot of software publicly or a way that well in
0:47:51	particular its as i system distortion no
0:47:55	interpretation and there's no way as its i leave work on the nist you make
0:48:01	a small
0:48:02	the
0:48:05	install we entirely and finite state automaton
0:48:11	and of which the of was actually at is to various virtual agents but also
0:48:17	to all kinds of
0:48:18	robots
0:48:20	and of these is actually and
0:48:24	problem thinking when
0:49:46	is so actually that's a good a point
0:49:50	because
0:49:53	you to do it making dropped what is of a with of point able to
0:49:57	recognize o where is looking
0:50:01	at a much higher level of accuracy at any human would be
0:50:05	and some people because they are just used explicitly also pointed and of course if
0:50:13	you and not change its flexible kind of a reference i don't act
0:50:19	in that particular we deal discourse features are just stuff
0:50:24	for the illustration
0:50:26	these boards and as a model somebody would you wanna pollard benefits
0:50:31	of a it quickly a but also had this kind of behavior or we just
0:50:37	got here we have the people off policy with a non contact with a
0:50:42	of what some people show some people use pointing some people do not use pointing
0:50:49	up by a nevertheless it will always a good usually do not point and not
0:50:55	low so i wake up with
0:50:59	had a information
0:51:02	and because it meant to have a study are usually people believe you want has
0:51:08	now and so they are really concentrating on this task
0:51:13	and so that's probably why
0:51:15	okay not
0:51:18	at
0:51:19	appreciate so much at a social okay so it is not bestow and the people
0:51:24	actually makes no solution i would want to turn taking opening is realized release the
0:51:30	turn taking a dialog was more efficient because it was clear out
0:51:36	open dropped what the was expecting a user type that on a in terms of
0:51:42	subjective evaluation considers did not to do the what was a behaviour or natural or
0:51:52	a source what if
0:51:55	men and i case it's really a task based
0:51:59	scenario
0:52:00	i it's not have time to show live video humans collaborating on data on to
0:52:07	say a
0:52:08	and we have
0:52:10	some examples of human interaction is left not sure that the human robot interaction and
0:52:18	syntactic in cases we had to human knowledge at that very well okay fact that
0:52:25	we and various taking not
0:52:27	for statistics was very close to take not they have to look at each other
0:52:32	for data on the table
0:52:34	and this was followed by a wide interest
0:53:30	s two because actually correctly skewed documents we
0:53:36	acquired they do so you have not one but what which looks like it would
0:53:43	for this got what look like points and so intuitively the people of course top
0:53:51	down in a very well may be justified condition in a more expressive no according
0:53:58	to which i
0:54:00	s two s p o
0:54:02	more clearly
0:54:03	and it was also used for people to a related to drop what so we
0:54:10	brought one what to and
0:54:12	it home and fist people well
0:54:15	valley points a and a set you know why not fall at home we would
0:54:21	just want to be a tweet that by i will
0:54:25	and then is that okay as long as to what just calls it's okay it
0:54:30	cannot close
0:54:31	and
0:54:33	and india and dropped what performance this is actually a to send out of are
0:54:39	realised exactly how they had something called out and
0:54:45	actually taken to do not what you like to have real data that you like
0:54:51	a example somebody to take et al
0:54:55	and it is sometimes they were also
0:54:58	a p a surprise was one ladies she was
0:55:02	the one hundred years the what affords you was really clear
0:55:07	we call it can still and she's at
0:55:10	it's just plastic i have a high round dropped were extracted by a strange but
0:55:18	you're right with use
0:55:21	i don't i brought lots of people find it easier or want to talk and
0:55:26	what expressed
0:55:32	details on thank you press
0:55:35	i
0:56:17	it's probably and that's that in
0:56:19	because for example in
0:56:22	since holders gain
0:56:25	people actually intentionally shall was quite sure a particular emotional state whereas when regulate motion
0:56:37	usually do not really
0:56:39	think about it
0:56:43	and that there's a
0:56:44	that some quite some properties pulses just a few hundred and so that the general
0:56:51	expression years with i just can't seem high location just looking at least is used
0:56:58	machine learning always have a kind of evaluation able
0:57:04	to recognize
0:57:07	emotional and the state of what has actually you
0:57:11	and what situation
0:58:00	i believe that of the phase is quite important
0:58:05	so i was in the presentation by a company that was really proud of their
0:58:10	robot and did not have facial expressions it is not have just thinking
0:58:17	and somebody in the audience that i don't understand the point is just a loudspeaker
0:58:25	and what is the point so i think the
0:58:30	the party as i want a back to face as important as well and what
0:58:35	the
0:58:35	now we have washed up to an issue
0:58:39	okay we have this property of before and that was possible with the case apart
0:58:47	head pose actually

Empowering Human-Robot Dialogue by Affective Computing Research

Keynotes

Elisabeth André