Speech Transcript - Situated Interaction

0:00:19	but not any
0:00:23	i don't fit the crime and i and i have great pleasure in that using
0:00:29	the second keynote speaker of the confidence that abolish from microsoft research
0:00:36	and
0:00:39	seeing everything either restart at microsoft research
0:00:45	and is what he has been for the last twelve years and it's going to
0:00:49	talk to is about straight the interaction
0:00:52	okay thanks a lot thanks ingrid things for the introduction and also for the invitation
0:00:57	to talk it's a great
0:00:59	be back here think that i think i missed the last couple of years but
0:01:02	this is always and
0:01:03	a central to come back to
0:01:05	so the time of the talk is situated interaction and i think is gonna those
0:01:10	they'll pretty well with the panel discussion we had at the end of yesterday a
0:01:15	operable narrowing versus broadening of the field
0:01:18	an interesting questions we might all be working on there are basically
0:01:23	two main points that i would like to highlight in this talk the first one
0:01:28	is that dialogue is really a multimodal highly coordinated complex affair
0:01:35	that goes well beyond the spoken word
0:01:38	i don't know how many of your familiar with a little work over eight boards
0:01:41	with all views an anthropologist that did some of the seminal work on can as
0:01:46	exciting back in the sixties
0:01:48	and basically studying the role body movement in communication and in one of his books
0:01:53	he essentially or
0:01:55	comments on how basically
0:01:59	perhaps the problem with the early records that we have of studies of communications
0:02:03	is that they were done by illiterate people
0:02:06	now all joking aside it is the case that if you look at
0:02:10	most of the work we do to the in dialogue
0:02:12	is really heavily and curtin text in the written word and at best in the
0:02:17	spoken world
0:02:19	but in reality we do a lot of work with our bodies when we interact
0:02:23	with each other when we communicate with each other and the surrounding physical context also
0:02:27	plays a very important role in these interactions
0:02:30	from where we place ourselves in space relative to each other the stance we adopt
0:02:35	two where or gaze goes moment-by-moment to facial expressions head nods hand gestures
0:02:41	prosodic o contours
0:02:43	all of these channels come into play when we interact with each other
0:02:47	and so that's the view of dialogue that i would like to highlight today
0:02:50	the second point that i'm gonna try to make in this talk is i think
0:02:54	were also it is very interesting time
0:02:57	when in the last take it also seem very fast paced advantages is based on
0:03:01	deep learning
0:03:01	in areas like vision and in germany perception and sensing
0:03:05	and i think these advances are getting us to this point or able to
0:03:10	start building machines that understand
0:03:12	people in physical spacing how people moving behaving physical space
0:03:17	i think it's a very interesting time in that sense in just like in the
0:03:20	nineties
0:03:21	advances in speech recognition have broken up the field and open up this whole area
0:03:25	of spoken dialogue systems with all the research that has come to that
0:03:29	and that today has led to these mobile assistance in our pockets i think these
0:03:33	advances in vision and in the perceptual technologies
0:03:36	give us a chance to again brought not the feel in this direction of
0:03:40	physically situated dialogue and more generally situated interaction
0:03:45	so what i'm gonna doing this talk is i'm gonna try to give us a
0:03:47	sense of this area based on some research vignettes from our own work at m
0:03:52	s r
0:03:52	over last ten years or so in this space
0:03:56	and hopefully i'll be able to convey to my excitement about it then maybe gets
0:04:00	more of you guys to look into this direction
0:04:02	because i think there's a lot of interesting an open problems in this space and
0:04:07	i think a lot of the people in this will have quite a bit to
0:04:10	contribute to solving these problems
0:04:13	so finally before i get going before we'd i've been i'll sonar make sure i
0:04:17	think my collaborators that i've had over the years that in likely to work with
0:04:22	fabulous people elements are
0:04:25	and that long-term collaborations with folks like corvettes and shown andreas's
0:04:29	here and also many other researchers
0:04:33	talented in jeers and great interest we have
0:04:35	over the years and
0:04:37	some of the work with c and the work with done and ms are in
0:04:39	this space will not be impossible without their help so on
0:04:42	then them
0:04:43	okay so let's get started situated interaction well
0:04:47	i started working in this space or shortly after i joined m s a round
0:04:51	two thousand than eight and the main question that has been driving my research agenda
0:04:56	scenes has been basically how do we get computers
0:05:00	to reason about the physical space around them
0:05:02	and to interact with people in this kind of open moral physically situated setting enough
0:05:07	fluid and seamless manner
0:05:09	and the general approach i've taken two words that space has been one where
0:05:13	we built a variety of systems
0:05:15	and we've deployed in the wild and by deploying the wild what i mean in
0:05:20	this case is place them in some public space in our building where people would
0:05:24	naturally encounter any interact with them without much instruction
0:05:28	so is not a control upsetting there just deployed some
0:05:31	where all their people just comment intractable
0:05:33	then we observe the interactions and we lead to drive what are the research problems
0:05:39	that will do not address we find what are the problems we need to solve
0:05:42	by observing what happens in this kind of ecologically more valid setting
0:05:47	and try to let that
0:05:49	give us direction and so to make this concrete then to give you a sense
0:05:52	of the variety of systems we build an honours start by showing your view videos
0:05:56	and then we can go more for some of the research questions we've looked at
0:06:02	the first really i'm gonna show use from the system that we refer to as
0:06:06	the assistant
0:06:07	it's a virtual agent based system that's placed outside eric's office and interacts with people
0:06:13	that come by whenever he is not available or maybe what is available but busy
0:06:17	his office
0:06:18	and basically the system that some simple assistive type tasks like
0:06:23	handling meetings and taking you know some notes the relay and so on
0:06:29	it's connected to a quite a wide infrastructure has access to eric's calendar but also
0:06:33	for other machine learned models that predict his availability when is he gonna be backing
0:06:38	his office you know what's the likelihood that he will but then the particular meeting
0:06:42	and so on
0:06:43	but what i want to highlight with this video is not so much lower part
0:06:46	as much as
0:06:47	the multi party dialogue or interaction capabilities here the system has a camera why dental
0:06:53	camera the top and a microphone array and
0:06:56	it's able to basically reason about multiple people
0:06:59	and understand who it engages with and have dialogue in this kind of all multiparty
0:07:04	setting
0:07:05	based on the roles and he's that these people have
0:07:16	and you hear forty five o'clock meeting we can eat
0:07:20	i'm sorry i told you hear for each meeting could you please one appeal h
0:07:26	and it is not here and then it makes me look at scanning to see
0:07:30	if i can tell me he's trying to see he should be meeting in his
0:07:34	office now saline actually not sure he's he will probably be packing amount fifteen minutes
0:07:40	listen would you like to have unit or maybe come back later
0:07:46	and you could try sending him an email message i'm expecting to look at seen
0:07:51	in amount a mean it's fancy rejecting nighttime
0:08:05	great
0:08:08	thanks
0:08:10	so over the years we built a variety of these systems are based on virtual
0:08:14	agents this is a prototype for those aiming to do shuttle reservations on campus of
0:08:18	for people moving from one building to another when you going to little be you
0:08:21	can say i'm going to this building and get a shovel
0:08:24	we build the fun trivia questions game that we deployed in a quarterly or one
0:08:29	of our kitchens where the system would try to engage people that go buy into
0:08:33	this questions gamelike out ask you what's the longest river in the world then you
0:08:37	try to figure out the answer but the interesting bit here is that this is
0:08:40	the most trying to do this in some sense
0:08:42	cooperatively is trying to get people to reach a consensus before revealing the answer moving
0:08:47	to the next question
0:08:48	we did a lot of interesting studies on engagement then how do you attract a
0:08:51	bystander a little times people kind of sit back and watch from a distance what
0:08:56	happens working on how do you attract bystanders
0:08:59	inside an interaction so again studying various problems related to multi party dialogue an open
0:09:03	more settings
0:09:05	without more that has also nothing to do it language show i'm using the term
0:09:08	situated interaction
0:09:10	purposefully
0:09:12	because my focus is on is my interests are in sort of how do we
0:09:15	get machines to interact with people with there's language or not
0:09:18	this is a an example a system we call the third generation elevator
0:09:23	what you're seeing here is a view from the top in our atrium
0:09:28	there's basically let's see this work there's the elevator doors are over there this is
0:09:32	a fisheye distorted you from the top but this is in front of the bank
0:09:35	of elevators where people are going by
0:09:37	so we build a simple model that just those optical flow and based on features
0:09:41	from optical flow
0:09:42	if there's to anticipate by about three seconds when the button will be pushed
0:09:46	so as you walk towards the elevator pushes the button for you the idea was
0:09:49	a mess build a star trek elevator but if you just simply go by you
0:09:53	know nothing happens
0:09:56	and n is not necessary that i think this is high elevators will work in
0:09:59	the future but its own exploration and i had not
0:10:02	to this idea that machines should be able to reason about and think about how
0:10:07	people behaving physical space
0:10:09	and right interesting interactions of that and the system has been running four years in
0:10:13	our lobby and by now everyone's
0:10:15	no one models it's there in some sense it just works
0:10:20	within the last years also started the looking in the directional interaction with the robot
0:10:24	so human robot interaction and system that we've done a lot of research with are
0:10:29	these direction robot's we have three of these guys we have them deployed on each
0:10:33	of the floors in a building as you come up of the elevator
0:10:36	and they can give you directions inside the building so you can ask for meeting
0:10:40	rooms are various people
0:10:41	and they can directly there are four
0:10:47	conference room three hundred
0:10:50	go to be useful way
0:10:53	turn right into three down the hall we review
0:10:56	conference room three hundred will be the first room on your right
0:11:02	your will
0:11:07	john is in all use number forty one twenty
0:11:10	here
0:11:11	t v all of your to the fourth floor
0:11:14	you're right when you mix of the elevator and continue to the end of this
0:11:18	fall
0:11:19	john solve this will be in that we have revealed
0:11:25	okay so hopefully this gives you guys a sense of the class of systems will
0:11:29	be in building an working with are doing research with over the years
0:11:33	no when you try to build these things and have them actually work in the
0:11:37	wild in this kind of one control settings you quickly run into a number of
0:11:41	problems that otherwise you might not even think of our consider
0:11:45	so
0:11:46	a lot of the problems with interactions i think we as human soul on self
0:11:51	conscious the this is so training to us that we don't think about it
0:11:55	but you know once you try to do something with a machine and computational eyes
0:11:59	it you run into the actual problem so first problem you have to solve is
0:12:03	that of engagement knowing
0:12:05	who am i engage with an in an interaction with and one
0:12:08	like this is all obvious loss whenever word an interaction
0:12:11	but a machine is to reason about it for instance here needs to reason that
0:12:14	even though these two guys are looking away from it at this moment
0:12:19	they're actually still engaged in an interaction with the machine they're looking away because the
0:12:24	robot just pointed over there and she well she's been looking at the machine all
0:12:28	the time she's actually not engaged in this conversation and going one step for the
0:12:32	robot my reason that well perhaps is you know group with them and waiting for
0:12:35	them
0:12:36	or perhaps she's not in a group with number has an intention to engage with
0:12:40	the robot once they're done there's all these reasoning that we assume as kind of
0:12:43	the one automatic and we don't think about what you have to kind of program
0:12:47	machine to do it
0:12:49	once you can solve the problem of engagement not a problem you have to solve
0:12:52	is that of turn taking and you know the standard dialogue model we all phone
0:12:58	work with this one where dialogue is of all the of utterances by the system
0:13:02	and user and system and user this breaks two pieces immediately once you're in a
0:13:06	multi party setting
0:13:08	you need to reason not only about when utterances are happening but you to reason
0:13:11	about who's producing them
0:13:13	who are the utterance is addressed to and what does the producer expect would talk
0:13:19	next so who is the next ratified speaker here
0:13:22	should i as a robot inject myself or the end of this utterance that i
0:13:25	heard or should i wait "'cause" someone else is gonna respond
0:13:28	so the problem gets more complex
0:13:30	and again all of this
0:13:32	we do on automatic and it's regulated with gaze with prosody with
0:13:36	how we move our bodies and so on and only once you can kind of
0:13:41	deal with these two problems you can start worrying about speech recognition and decoding the
0:13:45	signals in understanding what is actually contained in the signals that was sent to each
0:13:49	other
0:13:50	and doing the high-level interaction planning and dialogue control so in some sense a we've
0:13:56	use it we view this as a
0:13:59	almost like a minimal set of communicative competence is that you need to have to
0:14:03	do this kind of interaction open world settings
0:14:05	and over the years the re our research agenda
0:14:08	has been basically looking at various problems looking that in this processes by trying to
0:14:16	leverage the information we have about the situated context the who the lot and the
0:14:20	why of the surroundings
0:14:23	so that's kind of a the very high level kind of fuzzy one slide about
0:14:26	what the research has been about that the ms are in the last ten years
0:14:29	in the space and i'm gonna diving now when show you two different examples in
0:14:36	a little bit more detail i'm not gonna goal very technically d pointier pointed to
0:14:40	the papers i'm happy to talk more
0:14:42	offline but i want to show you give you a sense of what the research
0:14:46	problems look like i'm gonna start with the problem that has to do with engagement
0:14:52	i've already mentioned
0:14:54	engagement as a process can this in the reverse this is a process by which
0:14:57	participants
0:14:58	initiate maintain and terminate the conversation is that they jointly undertake now you know lot
0:15:04	of classical dialogue work i mean this is in telephony applications re mobile phones and
0:15:09	so on so
0:15:11	trivial problem to solve right i push a button i know i'm engaged or i
0:15:14	pick up a phone call i'm not i'm gauge i don't have a really big
0:15:17	problem to solve however if you have a robot or system that's embodied in situated
0:15:20	in space is becomes a more of complex problem
0:15:24	and just to illustrate sort of the diversity of behave years
0:15:28	with respect to engagement that one might have
0:15:32	we sort of
0:15:34	capture this video this as many years ago at the at the start of this
0:15:38	work
0:15:38	it's a video from a the receptionist prototype the one that was doing the shuttle
0:15:42	reservations
0:15:43	and it mostly highlights how by reasoning about
0:15:46	three engagement variables in particular engagement state the my negation a conversation or not engagement
0:15:53	actions which regulate the transitions between the states and engagement intentions which are different from
0:15:58	states
0:15:59	by reading about these three keep variables you can construct fairly sophisticated policies
0:16:04	in terms of how you manage engagement in you know group setting
0:16:08	so no play this video for you in a second just before i do that
0:16:11	to help you with the legend here and all this annotation
0:16:14	yellow line
0:16:16	below the face means this is what the system is engaged with at some points
0:16:19	of this is the system's viewpoint what we see is one of these avatar has
0:16:23	that but for us
0:16:25	dotted line is an engagement that is currently suspended
0:16:28	the red dot moving around that right now it's on eric's face shows the direction
0:16:33	of the avatars case
0:16:35	so i'll run this for you
0:16:38	sorry for the quality of the audio here
0:16:42	here
0:16:50	right
0:16:54	right
0:16:59	for
0:17:03	right
0:17:21	alright thank you
0:17:24	i
0:17:27	right
0:17:28	sure
0:17:29	yes
0:17:31	or not
0:17:34	yes
0:17:37	in addition
0:17:40	right
0:17:45	sure
0:17:50	sure
0:17:52	here
0:17:54	right
0:18:07	right
0:18:09	sure
0:18:12	right
0:18:16	i
0:18:19	so there's many behaviors in here that flight by pretty fast like for instance when
0:18:24	the receptionist turns from eric to me and my attention is in my cellphone says
0:18:28	excuse me and waits for my attention to come up to continue that engagement or
0:18:33	at the end when i'm basing some more far away in the distance
0:18:37	the moment i turn my attention towards it even though i'm at a distance he
0:18:40	creates this initiate disengagement because you know as i still have this task of getting
0:18:44	the shells i can give me an update
0:18:47	there's a lot of behaviours that you can create from relatively simple inferences
0:18:51	now i don't obviously you this is a demonstration video that was shot in the
0:18:54	lab in probably we had it do it i don't know three five times to
0:18:57	get it right
0:18:58	this stuff does not work that well when you put it out there in the
0:19:02	wild and i will show you know second how well it works in the wild
0:19:05	but this is almost like a more star video like a more star direction for
0:19:09	us in our research work
0:19:11	we wanna be able to create systems where the underlying inference models are so robust
0:19:16	that
0:19:16	we can actually have this kind of lead interactions are there in the wild
0:19:21	so let me
0:19:23	show you how it works in practice and talk about a particular give an example
0:19:27	of a research problem in this space
0:19:30	start with this video that kind of motivates it pay attention to how badly in
0:19:36	this case is a be a from the directions robot
0:19:39	how badly the robot is as negotiating disengagement so the moment of breaking of the
0:19:44	interaction
0:19:48	you need help finding something
0:19:52	a room that hallway and on my
0:19:55	by the way would you mind swiping your badge on the remote so i know
0:20:00	wideband park with not
0:20:03	thank you hear anything else i can help you find nothing
0:20:07	okay
0:20:08	think that it
0:20:10	i'm or
0:20:12	i know i help you find something else no thank you
0:20:17	okay that
0:20:19	by
0:20:21	not very good he's running to the bottom so
0:20:26	so what happens here well what happens here is that at this point in time
0:20:30	it's obvious to all of us that this interaction is over
0:20:34	but all the machines easy is just
0:20:37	the rectangle of where the face easy back in the day that's all the tracking
0:20:40	were doing doesn't understand this gesture
0:20:44	and so this point the robot continues the dialogue with is there anything else i
0:20:48	can help you find and this is quite a long production now what's interesting here
0:20:54	is that by some but just a couple of seconds right after that by this
0:20:58	point by this frame
0:20:59	the robot's engagement model can actually tell that this person is disengaging but by that
0:21:04	time it's already too late because we've already study producing this is there anything else
0:21:09	and the person hears descent errors and word in this bad look now where we
0:21:14	are basically non negotiating these disengagement properly in person starts coming back so now they're
0:21:18	engaged again
0:21:19	and we get into this problem
0:21:22	so what's interesting here is that the robot eventually notes
0:21:25	and so the idea that comes to mind is
0:21:27	well
0:21:28	if we could somehow forecast from here that some future time this person is likely
0:21:33	to disengage with some
0:21:36	good probability
0:21:37	we could perhaps use hesitations to mitigate the uncertainty people of unused hesitations this situation
0:21:43	of uncertainty so if we could somehow forecasting funny for perfect in that forecast that
0:21:48	a t zero plus l for this person might be disengaging is there are launching
0:21:52	this production we could launch of filler or like a hesitation like soul
0:21:56	and then if it zero plus a thought we find them disengaging we say so
0:22:00	well guess a catcher later then
0:22:02	or if somehow alternatively there are not we can still say so is there anything
0:22:06	else i can help you fine and that doesn't sound too bad and so the
0:22:10	core idea here is
0:22:12	that's forecast what's gonna happen in the future
0:22:15	and maybe use hesitations to mitigate the associated uncertainty
0:22:20	now how do we do this well we have an interesting approach here that is
0:22:24	in some sense self-supervised
0:22:26	the motion eventually novel so we can leverage that knowledge you basically rollback time and
0:22:31	you can learn from your own experience basically without the need for any manual supervision
0:22:36	so you have a variety of features i'm illustrating here three features like the location
0:22:40	of the face in the image and the size of the face
0:22:42	which kind of these esns this is where they start moving away it right the
0:22:45	size of the face is kind of a proxy for how far away from you
0:22:48	they are we have all sorts of probabilistic models for instance for inferring whereas their
0:22:52	attention is the attention on the robot there is their attention somewhere else
0:22:56	and there's many i such features in the system
0:22:59	no the ideas you start with the
0:23:01	very conservative heuristic for detecting disengagement then you wanna be conservative because
0:23:06	the flip side of the equation breaking then engage moment someone is still engages even
0:23:11	more painful so you don't want no kind of stopped talking to someone one that
0:23:14	they're talking to so is there on the conservative side which means you're gonna be
0:23:18	late in detecting when they disengaged
0:23:20	but you will eventually detect that they disengage at some point you would exceed some
0:23:24	probability threshold that says they're disengaging and then what you can do is like i
0:23:29	said you rollback time so let's say you one anticipate that moment by five seconds
0:23:32	where it's easy to automatically construct a label that looks like that and five seconds
0:23:37	ahead of time predicts that event
0:23:39	and then you train a model from all these features that you have to predict
0:23:42	this label
0:23:44	all this model is not gonna be it's gonna be far from perfect but you'll
0:23:47	probably detect that moment a bit earlier on
0:23:50	so if you use the same threshold of point eight you might have you know
0:23:55	you might be able to detect by this much earlier we call these the only
0:23:58	detection
0:23:59	and so then you go and train models with all these features and really the
0:24:02	technical details are not that important here the point i wanna make is a high-level
0:24:05	point this case i think
0:24:07	we use logistic regression boosted cheese whatever favour machine learning technique is and you can
0:24:12	see that like you know for the same false positive rate you can kind of
0:24:15	increase our you can detect the engagement over baseline heuristics
0:24:19	the other sort of high-level lesson is that
0:24:21	by using multi modal features you tend to improve your performance all use features
0:24:26	relate the focus of attention location and tracking confidence scores
0:24:30	dialogue features like dialog state how one died in there and so on
0:24:33	each of these individually do something and then at you at the mall up together
0:24:37	you get better results was generally
0:24:40	something that tends to happen would multimodal systems
0:24:44	again the high-level point i wanna make here use
0:24:47	forecasting was a construct i think is very interesting like there's been a lot of
0:24:52	work recently in dialogue with incrementality and i think forecasting goes handing hand without
0:24:57	and because it's very important in order to be able to achieve
0:25:01	the kind of fluid coordination we want you we probably have to anticipate more
0:25:06	and then also presents this interesting opportunities from learning easy to from experience without manually
0:25:12	labeling data because in general if you wanna forecast an event like you have the
0:25:16	label you know when you happens you just know it too late but you can
0:25:19	still learn from all of that and you can do that online in the system
0:25:23	can adapt to the particular situation it's in
0:25:26	so i think those are a couple of interesting lessons sort of what the high-level
0:25:29	a from this work
0:25:31	i'm gonna switch gears
0:25:32	and talk about a different problem that lives more
0:25:37	relatively speaking in the turn taking or you know just like engagement is
0:25:43	reach mixed-initiative process you know by which regulate how initiate interactions
0:25:48	turn taking is also you know mixed-initiative incrementally controlled by the participants is this process
0:25:55	by which we regulate who takes that are not talk
0:25:58	in conversation and as i mentioned before again in a lot of a traditional dialogue
0:26:04	work we make the simple turn taking the assumptions of you speak then i speak
0:26:08	then you speak the nice we can maybe there's barge ins that are being handled
0:26:11	in multiparty settings you really the be double more sophisticated model "'cause" you to understand
0:26:16	who's talking to someone any given point in time
0:26:19	and when is your time to speak
0:26:20	and we've done
0:26:22	bunch of work in that direction i'm not gonna show you that on a show
0:26:25	you a different problem that relates to turn taking that i think illustrates even better
0:26:30	this a high degree of coordination and multimodality in situated dialogue and this has to
0:26:37	do with coordination between speech and attention
0:26:39	and in some sense this work was prompted by reading some of goodwin's work on
0:26:45	disfluencies and attention so goodwin made this interesting observation about disfluencies you know one of
0:26:52	his of papers
0:26:53	we all know that if you look at transcripts of conversational speech it's formal false
0:26:58	starts and b starts and disfluencies so they're gonna look like
0:27:01	you know the speaker says anyway
0:27:03	we went to i want to bad or brian you're gonna have
0:27:06	you can still have to go or i can't mean and also mercy down the
0:27:10	car choice of these this part of a t v transcribe like very literally you
0:27:14	know conversational speech these are everywhere and they create problems for speech recognition people in
0:27:19	language modeling people and so on conversational speech is hard
0:27:23	well goodwin had the interesting insight of looking that this in conjunction with gaze
0:27:28	so here's the listener's gaze
0:27:30	and the region in red dots
0:27:32	is where the listener is not looking at the speaker
0:27:36	this is the point where mutual gaze gets reestablished and then we have mutual gaze
0:27:40	between listener in speaker
0:27:42	as a something that's really interesting in this examples is that things become much more
0:27:45	grammatical
0:27:47	in regions on mutual gaze
0:27:49	and this means to kind of one interesting hypotheses that maybe disfluencies are not just
0:27:54	errors-in-production maybe some of this one is we have
0:27:58	actually fulfil the coordinative purpose the are used to regulate and coordinate and make sure
0:28:03	that either i'm able to attract your attention back if you'd has drifted away
0:28:09	or whenever i deliver what i want to deliver i really have your attention
0:28:13	and so
0:28:15	partly inspired by this work and partly inspired by behaviors in our systems
0:28:20	we did a bunch of work on coordinating speech and attention
0:28:25	so let me show the example in contrast to
0:28:28	what humans are able to do without thinking about it
0:28:31	here's our robot is not able to reason about where the person's attention is
0:28:36	as a bunch of speech recognition errors in this interaction as well but i like
0:28:40	it to pay more attention to basically how the robot is not able to take
0:28:44	into account where the participantsattention is as the interaction is happening she's just looking her
0:28:49	phone trying to get
0:28:50	the number for the meeting she's going to but the robot is ignoring all that
0:28:56	right
0:28:58	or
0:28:59	i think that again
0:29:05	metric in going right
0:29:09	during that would help or go back to like you want right
0:29:15	well o where
0:29:19	or maybe not so she's a she's you know she's just looking or phone trying
0:29:24	to find the and the robot keeps pushing this question or four where you going
0:29:27	where you going and so that's you know quite different from what people are doing
0:29:32	so inspired by goodwin's work we did some work on
0:29:36	basically coordinating speech with the
0:29:40	attention and the idea here was to have a model where one hand
0:29:44	we model the attentional demands
0:29:46	like where does the robot expect the attention persons to be
0:29:50	and on the other hand we model attentional supply where is the actual attention going
0:29:55	so attentional demands are defined of the phrase level so for every output that the
0:29:59	robot is producing got the phrase level
0:30:01	we have an expectation about where attention should be in most cases it probably should
0:30:04	be on the robot but it is not always the case twenty seventy five point
0:30:08	of what they're in say to get to thirty eight hundred i might expect that
0:30:11	your attention will go over there and actually fluoridation doesn't go over there may be
0:30:14	we have a problem
0:30:16	so we are specifying descent is are manually specified basically of just like a natural
0:30:20	language generation for every output we have one of these
0:30:23	expected attention targets and then on the other hand we make inferences about where is
0:30:27	your attention
0:30:28	and we do that based on machine learning models that use radio features us on
0:30:33	and so forth
0:30:34	whenever there's a difference between the two is there of just ballistic reproducing speech synthesis
0:30:40	we use this coordinative policy that basically interjects the same kinds of pauses and feels
0:30:46	like pauses in false starts and restarts
0:30:49	that humans do is basically create these disfluencies
0:30:52	to get to a point where attention is exactly where we expected to be an
0:30:55	only then we continue so instead of saying to get the thirty eight hundred we
0:30:59	might pose for awhile say excuse me be say the first two words to get
0:31:03	pause more and so on before we actually produce the utterance
0:31:08	so i'll in this is again than on the phrase by phrase
0:31:13	basis
0:31:13	here is again a demonstration video of
0:31:16	eric and i bad actors trying to kind of illustrate this behavior
0:31:23	yes or no i
0:31:33	for me excuse
0:31:42	for
0:31:47	where he is just you know it's fashion
0:31:58	so still bit clunky you know but you get the sense and the idea let
0:32:02	me show you a few interactions captured in the wild once we deployed this coordinative
0:32:08	mechanism
0:32:09	in here basically
0:32:11	the regions in block are the production that you know the robot normally produces the
0:32:17	synthesis these are phrase boundary delimiters
0:32:20	and the regions in the regions in orange are
0:32:24	these filled pauses interjections that are dynamically injected on the fly
0:32:29	based on where the user's attention is
0:32:40	rule
0:32:47	all you cough that the volume was kind of level
0:32:57	excuse me
0:33:00	really
0:33:02	you know you
0:33:14	you are right
0:33:21	here you direction
0:33:25	so that excuse we might be a bit aggressive you know there's a lot of
0:33:28	tuning once you once you put this in there you realise the next layer of
0:33:31	problems that you have been how synthesis is not quite conversational enough and you know
0:33:36	like than one sees of saying social forces so an excuse me and so on
0:33:42	and while these videos again my make it look like a wild like we can
0:33:45	go quite far again wanna leave you with the wrong impression of a lot of
0:33:49	work remains to be done
0:33:51	these things often failed or videos i shown easement things work
0:33:54	relatively well i would say
0:33:56	but this things often failed and i want to show you one interesting the example
0:34:00	of a failure
0:34:02	right
0:34:04	we would be
0:34:08	you give you
0:34:09	what would be included
0:34:16	well you will see later
0:34:19	so the signals to say whoops
0:34:21	so what actually happens here well what happens here is that
0:34:25	we are coordinating you know warping a lot of attention to coordinate our speech with
0:34:31	the participantsattention
0:34:33	but were completely ignoring what his upper body and torso is signalling so what happens
0:34:38	here is
0:34:38	the robot guess to this phrase where it says to get their walk to the
0:34:41	end of this hallway
0:34:43	at which point the person feels that maybe this is the end of the instructions
0:34:47	so they start turning both their face and their body to kind of indicate that
0:34:51	they might be leaving right
0:34:54	the robot sees their attention goal way and things well i'm gonna wait for their
0:34:58	attention to come back and the long pause that gets created for the reinforces the
0:35:03	person to believe that this is the end of the directions so i'm just going
0:35:06	given the robot had all these other things to say right
0:35:09	and so because the robot in this some sense ignores the signal from his upper
0:35:14	body that i'm and if the robot can take into account that signal we could
0:35:17	be a bit smarter and maybe not wait there maybe use a different mechanism to
0:35:21	get their attention back
0:35:22	or maybe just
0:35:23	blasts through that you don't always have to coordinate exactly that way it right and
0:35:27	so
0:35:29	i love this example because it really highlights any drives on this point and trying
0:35:34	to make i think that
0:35:35	dialogue is really highly coordinate in and highly multimodal dialogue between people in face-to-face settings
0:35:41	has these properties you know
0:35:44	we've talked about carnegie speech and gaze
0:35:47	and we seen in this example how not reasoning about body pose gets us into
0:35:51	trouble
0:35:52	as many other things going on we do head gestures like not then shakes and
0:35:57	all sorts of other head gestures and there's a myriad of hand gestures you know
0:36:01	from be metaphorically iconic the big gestures
0:36:05	facial expressions smiles frowns expressions of uncertainty
0:36:09	where we
0:36:10	put our bodies and how we move dynamically prosodic all contours all of these things
0:36:14	come into play and their highly coordinated frame-by-frame moment-by-moment in the coordination that happens is
0:36:21	not just across the channels
0:36:23	it's across people
0:36:25	and these channels and so i'd like us to think about dialogue in this view
0:36:29	more from a view of you know sequence of turns into of your of
0:36:35	multimodal incrementally co-produce process
0:36:38	and i think if we do that i think there's a lot of interesting opportunities
0:36:42	because of these enabling technologies that are coming up these days
0:36:46	so i've shown you a couple of problems in the space of turn taking an
0:36:51	engagement there's many more problems in every time we touch one of these we really
0:36:55	feel like we barely scratched the surface
0:36:58	take for instance engagement i talk for a bit about
0:37:01	how to forecast disengagement and maybe negotiate the disengagement process better but this many other
0:37:08	problems how do we build robust models for making inferences about those engagement variables like
0:37:13	states engagement actions and intentions
0:37:16	how do we or construct measures of engagement that are more continuous here all the
0:37:21	work we've done is on i'm engaged or i'm not engage well-known educational or tutoring
0:37:25	or other kinds of setting you wanna more continuous measure engagement
0:37:28	how do you reason about that
0:37:31	similarly many other problems in turn taking understanding how do we ground all these things
0:37:35	in the physical situation is interesting challenges with rapport with negotiation grounding well lots of
0:37:43	open space lots of interesting problem once you start thinking about how the physical world
0:37:46	a whole these channels interact with each other
0:37:50	like i said i said i think we have this interesting opportunities because
0:37:53	there has been a lot of progress in the visual and perception space
0:37:57	the tracking facial expression tracking smiles affix recognition is on that can
0:38:03	help us sort of in this direction
0:38:06	i think the other think that i really want to highlight bill be size the
0:38:09	current technological advances that i think is very important
0:38:12	is all these body of work that comes from connected feels like anthropology sociology
0:38:18	cycling sociolinguistics a conversational analysis context analysis on
0:38:23	there's a wide body of work basically
0:38:25	as soon as people got their hands on video tapes in the fifties and sixties
0:38:28	they started looking carefully at
0:38:30	human communicative behaviours
0:38:32	and all that work was done
0:38:34	based on you know small snippets or video and if you think about it today
0:38:37	we have millions of videos
0:38:40	an interesting a powerful data techniques so there's interesting questions about how do we bring
0:38:46	this work into the present the how do we leverage all the knowledge and the
0:38:49	theoretical models that have been built into the past
0:38:51	i've put here just some names there's many more
0:38:54	people that have done work in this space and i pick one title from each
0:38:57	of them in each of these guys
0:38:58	has full bodies of works i really recommend that
0:39:01	as a community we look back more on all this work that has that has
0:39:05	been done already in a human communication and try to understand how to leverage that
0:39:09	when we think of dialogue
0:39:12	so
0:39:14	with that i guess i have a ten minutes left i one a kind of
0:39:17	switch gears a bit and talk more about
0:39:20	challenges because you know
0:39:22	there's a lot of opportunity there's a lot of open field
0:39:25	but working in this space is not necessarily easy either and when i think of
0:39:30	challenges i think the
0:39:32	high level i think of three kind of categories there's obviously the research challenges that
0:39:37	we have like i wanna work on this problem and forecasting disengagement will help lysol
0:39:41	try there's obviously the research challenge
0:39:44	but i'm gonna leave those aside and gonna try to talk about to other kinds
0:39:48	of challenges one is data and experimentation challenges and we touch briefly on this in
0:39:52	the panel yesterday i think getting data for these kinds of systems is it's not
0:39:57	easy and stuff
0:39:59	if you look at a lot of our adjacent feels like machine translation in speech
0:40:03	recognition nlp and so on
0:40:05	a lot of progress has been accomplished by you know
0:40:09	challenges with datasets and clear evaluation metrics and so on
0:40:12	in dialogue this is not easy to do any is not easy to do because
0:40:15	dialogue is an interactive process you cannot easily studied on a fixed that dataset
0:40:20	because by the time you've
0:40:22	made an improvement or change something the whole thing behaves differently
0:40:26	and so that creates challenges generally for dialogue and even more so for multimodal a
0:40:31	dialogue in the multi model space right
0:40:33	then apart from the data charges there's also kind of experimentation challenges
0:40:38	we've done a lot of the work we've done in the while because i feel
0:40:42	like you see the real problems you see ecologically valid settings and you see what
0:40:47	really happens
0:40:48	some of these phenomena are actually even probably
0:40:52	challenging and hard to do in a controlled lab settings like study how engagement how
0:40:56	these break supplements on you can think of all sorts of things of confederates and
0:40:59	you can try to you know figure out controlled experiments but is not easy and
0:41:04	all the other hand experimenting in-the-wild is not easy either for many in reasons
0:41:09	one of the
0:41:10	other kinds of challenges in here are purely building up the system's right so
0:41:14	in our work over last ten years the way we've gotten our data is by
0:41:17	building systems and deploying them right
0:41:21	but building systems is hard in so in the last five minutes i wanna talk
0:41:25	a bit about actually engineering challenges because i think there just as important in that
0:41:29	they kind of create the damped nor on the research and they kind of stifle
0:41:34	things from moving faster forward building this kind of a multimodal systems is hard for
0:41:39	a number of reasons
0:41:41	first there's a problem integration they leverage many different kinds of technologies
0:41:45	that
0:41:47	are of different types operate on different time scales the sheer complexity and the number
0:41:51	of boxes you have to having one of these systems kind of makes the problem
0:41:55	challenge
0:41:56	but then there's other things where constructs that are pervasive in the systems like pine
0:42:01	space and uncertainty are nowhere in our
0:42:04	programming fabrics like
0:42:06	it's kind of the clear to me that time for instance is not a first-order
0:42:10	citizen in any programming language that i can think of so every time i wanna
0:42:14	do something that's over time or streaming or
0:42:16	i have to go implement might buffers and my streaming and my you know like
0:42:19	a kind of have to go from scratch and it's similar for space in uncertainty
0:42:23	but it is very important because
0:42:26	we want to create systems that are fluid
0:42:27	but the sensing thinking acting all of these things take time
0:42:32	being fast is not even enough often times you need to do fusion in the
0:42:36	systems and things the right but different latency so you need to coordinate basically so
0:42:40	you need to kind of deal whereabout time in a deeper sense down deep down
0:42:45	be well and the same things can be set i think in this systems about
0:42:49	the notions of space and notions of uncertainty
0:42:52	and finally the other thing that kind of puts of them are is the fact
0:42:55	that the development tools we have
0:42:58	are not here for this class of systems right so the development environments and debug
0:43:03	errors and all of this stuff is not
0:43:05	they were not developed with this kind of with this class of systems in mind
0:43:09	and if i think back of all the work we've done i don't know if
0:43:12	after time as maybe spend on building the tools to build a systems rather than
0:43:16	building the systems are doing the research right and so
0:43:20	basically driven by a lot of the lessons we've learned over the years
0:43:25	in the last three years three or four years at ms are we basically embarked
0:43:29	on this project and i wanted to spend the last couple of minutes telling you
0:43:32	about it because if there's any people in the room that are more interested in
0:43:36	joining the space this might be useful for them
0:43:39	we've worked on developing a open-source platform that
0:43:42	basically aims to simplify building the systems
0:43:46	the end goal being lower the barrier to entry in enabling more research into this
0:43:51	pay so it's a framework that three targeted researchers
0:43:55	it's open source and it's
0:43:59	supports the construction of this kind of a situated interactive system
0:44:04	we call it
0:44:06	platform for cd intelligence which kind of a mouthful solo abbreviate either side pronounced like
0:44:10	the greek letter sci
0:44:12	and i want to just give your whirlwind tour in two minutes just to kind
0:44:15	of give us sensible or what's available in there
0:44:19	the platform consists of three layers there's a runtime layer
0:44:23	a set of tools in a set of component components the runtime basically provides all
0:44:28	these infrastructure
0:44:30	for building systems that operate over streaming data are have latency constraints anytime you have
0:44:34	something interactive
0:44:36	it's latency constraint
0:44:38	so there's a certain model for parallel courtney computation that actually feels pretty natural you
0:44:43	just kind of connect components streams of data you know so it's the standard sort
0:44:48	of data flow model
0:44:50	but the streams a have a really interesting properties and i don't have time to
0:44:55	get here in
0:44:56	the full beetle and all the glory here
0:44:59	but i wanna kind of highlight some of the important aspect so for instance i'm
0:45:03	mentioned about time how time is to be first-order citizen well we bake that from
0:45:08	day one d below in the fabric all messages that are flowing to are timestamp
0:45:13	the origin when they're captured
0:45:15	and then as they flow to
0:45:16	through the pipeline
0:45:18	we have access not only to the
0:45:20	time the message was created by the component that created but also to that originating
0:45:24	time
0:45:25	so we know this message has a latency of four hundred and thirty milliseconds so
0:45:29	in the entire graph we can be latency or all points
0:45:32	which enables synchronization so we provide a whole time algebra and synchronization mechanisms when you
0:45:37	work was training data
0:45:39	that pairs these messages correctly and so on
0:45:41	so is basically all about enabling coordinated computation where time is really first-order citizen
0:45:49	the strings can be automatically persisted so there's a logging infrastructure
0:45:53	that is therefore free any data type of you know you can stream any of
0:45:57	your data types and we can automatically persist those and because we per system with
0:46:02	all this is so sure you timing information
0:46:04	we can enable a more interesting replace scenarios are i say well forget about these
0:46:08	sensors less played back from disk
0:46:10	and tune this component and i can play this back from disk exactly as it
0:46:14	happen in real time or i can speed it up or slowly down time is
0:46:18	entirely under our control because is baked deep down in the fabric
0:46:22	so these are some of the properties of the runtime there's a lot more
0:46:25	is basically a very lightweight very efficient kind of
0:46:29	system for constructing things that works with streaming data
0:46:32	at this level we don't care we don't know anything about speech or dialogue or
0:46:36	components
0:46:37	it's a gnostic to that you can use it for anything that operates was training
0:46:40	data and temporal constraints
0:46:42	the set of tools we built
0:46:44	basically are heavily centred on visualisation this is a snapshot from a
0:46:49	the visualisation tool we have on the right there someone's actually eating it and this
0:46:52	video sped up a bit but these are the streams that were persisted in application
0:46:56	these are just visualise there's for different kinds of streams that can get composer didn't
0:47:00	overlaid
0:47:00	so this is a visualiser for and in each stream this is a visualiser for
0:47:05	face detection results stream this is audio this is a voice activity detection that's a
0:47:09	speech recognition result is a visualiser for all three d conversational scene analysis and the
0:47:15	basic idea is that can composite overlaid is visualise there's
0:47:18	and then you can navigate over time left and right ensue mean and look at
0:47:22	particular moments this is very powerful in enabling especially when coupled with debugging
0:47:28	and word evolving this to visualize not just the data collected and running through the
0:47:33	systems
0:47:34	but also all
0:47:35	the architecture of the system itself and you know the view of the component graph
0:47:42	and also towards annotation for supporting data annotation
0:47:45	finally a the components layer we are hoping to create an ecosystem of components where
0:47:51	people can plug n play different kinds of components will bootstrapping this with things like
0:47:56	sensors imaging components vision audio speech output is are very relatively simple components that we
0:48:01	have in the initial echo system
0:48:03	but the idea is that
0:48:05	is meant to be an across system and people are meant to contribute into it
0:48:08	is an open source project there's already boise state casey kennington has its own repository
0:48:13	of sci components
0:48:15	and so people are starting to use this and the hope is that as more
0:48:18	people use it
0:48:19	if i can get you to have eighty percent of what you need off-the-shelf and
0:48:24	just focus on your research
0:48:26	that's the key idea
0:48:28	lasting else a is that something we haven't released yet but we are planning to
0:48:32	release in the next few months
0:48:34	is an array of components that we refer to as a situated interaction foundation it's
0:48:41	basically a set of components at that level that
0:48:43	plus a set of representations
0:48:45	that you want further abstract and accelerate the development of this physically situated interactive systems
0:48:51	basically what we are planning to construct is
0:48:56	the ability to instantiate the perception pipeline where you as a developer of the system
0:49:00	just only where you're sensors and what sensors you have
0:49:03	so in this instance there is you know there's a kinect sensor the big box
0:49:08	their represents my office and there's a kinect sensor sitting on top of the screen
0:49:12	and if you tell me i have three sensors i'm gonna use the data from
0:49:15	all the three sensors infuse evil gonna configure perception pipeline automatically from all the sensors
0:49:20	we have the right fusion
0:49:22	and provide the d n the
0:49:24	the kind of
0:49:25	analyses a deep scene analysis object that runs at frame rate at four thirty frames
0:49:30	per second i'm gonna tell you things like here's where the people are in the
0:49:34	scene and what their body pauses are here's where everyone's attention is
0:49:39	in this case there's an actual engagement happening between the two of us in an
0:49:43	agent that's on the screen
0:49:44	and stewart is you know directing the utterance the words
0:49:50	you know the agent and at some later point
0:49:53	we have peeled off we've gone more towards the back of the office towards the
0:49:56	whiteboard
0:49:57	and we're just talking to each other and so we're trying to provide all these
0:50:00	reach analysis all
0:50:02	the conversation in the conversation space including issues of engagement turn taking utterances sources targets
0:50:07	and all of that
0:50:08	from the available sensors and
0:50:10	if you give me more sensors
0:50:12	the idea is that you get the same object back
0:50:15	but at a higher fidelity because we have more sensors and we confuse data
0:50:19	this parts have not be really see other coming out probably in the next couple
0:50:22	of months
0:50:23	but our hope with the entire framework is basically to accelerate research in this space
0:50:27	to get people to be able to
0:50:29	build an experiment with these kinds of systems are having to spend two years to
0:50:33	construct all the infrastructure that's necessary
0:50:37	and so this brings me basically two
0:50:40	the end of my talk all conclude on this slide
0:50:44	try to adopt this view of dialogue in this is a talk and portrayed is
0:50:48	view of dialogue as a
0:50:50	multimodal incrementally corporate used process where part this one scene interaction really
0:50:56	do fine grained coordination across all these different modalities
0:50:59	i think there is
0:51:01	tremendous number of opportunities e here and i think it's up to us to basically
0:51:05	broaden the field in this direction because the
0:51:08	underlying technologies are coming and they are starting to get to the point where
0:51:12	the reliable enough to start to do interesting work and again there's this
0:51:18	big body of work in human communication dynamics that will we can leverage and that
0:51:22	we can draw upon
0:51:24	so i'll stop here thank you all for listening and all the questions
0:51:37	thanks very much and then
0:51:44	thank stan i was so great to see
0:51:47	all this work again and how
0:51:50	oppressive the research program over the number of years or to get at this point
0:51:53	i'm really looking forward to that
0:51:55	situated the interaction foundation
0:51:58	coming out
0:51:59	i've a question i guess related partly to that but
0:52:03	one of the problems with integration is not just taking a bunch of pieces and
0:52:07	putting them together but
0:52:08	the maintenance of that over time as you add new pieces so
0:52:11	in particular for this last thing
0:52:15	how much can you just adding a new component expect everything else to
0:52:21	work the way it did i just have some value added by getting new information
0:52:24	and how much do have to re engineer the whole architecture to make sure that
0:52:30	your not and doing things are getting and a problem thinking
0:52:34	you know in terms of engineering that the recent plane flight crashes seem to then
0:52:38	for this kind of thing where
0:52:41	different engineers design systems very well given a set of assumptions about what else would
0:52:45	be there
0:52:46	or not and then that changed under them and that's what seem because the point
0:52:50	right
0:52:51	i mean i completely agree i mean the ideal world is one where
0:52:56	you know everything works in you like your thing in and but in reality is
0:52:59	never that way right in e d is gonna be like different people with different
0:53:02	research agendas you know few things different the have different mental models are different
0:53:08	viewpoints from which they look at a problem and attacking
0:53:11	and i think that does create challenges that way i don't know holes all those
0:53:15	challenges
0:53:15	well all i can say that these were kind of aware of that and one
0:53:19	more constructing this work trying to
0:53:21	make us view commitments in some sense as possible to allow for the flexibility that's
0:53:26	needed for research
0:53:27	because i think there's actual value in all those different viewpoints and different architectures an
0:53:31	exploration
0:53:33	and so
0:53:33	yes i think what i can say that we are purposefully trying to
0:53:37	not make hard commitments to what the what is an utterance you know i don't
0:53:42	wanna tell you what an utterance as i wanna have you do have your opinion
0:53:45	of what an utterance is
0:53:46	but also might mean that again when you try to plug in your speech recognizer
0:53:50	in my system
0:53:51	the my needs to be some wrangling and so on or you know making these
0:53:54	components work together i don't know how we can solve this problem
0:53:57	i'm not a big believer in all will all come together with the big beautiful
0:54:01	standard that will agree to i don't think that i don't see that happening
0:54:04	we're just trying to design words
0:54:07	flexibility i would say
0:54:10	and
0:54:11	i think that are a wonderful talk and you're highlighting these things that you're right
0:54:16	this is not right for us to be able to that and address and we
0:54:20	should be working more about this work beyond the simple turn and
0:54:26	sorry i might be introducing something even more complex down the line and one about
0:54:31	user adaptation users are very good humans are very good it changing their behaviour based
0:54:36	on the system that in front of you know if its human of its that
0:54:42	you will call and there's a delay we'll or not the backchannel because it
0:54:46	screws up the conversation
0:54:48	and people can adapt to this forty dollars
0:54:52	and that might be confusing to our learning this will then allowed to be able
0:54:56	to the
0:54:58	two shows the affects that
0:55:01	to windsor good adapting to rather the most natural ones of you thought about how
0:55:06	to
0:55:07	to hear about not getting the human to adapt or to be able to control
0:55:11	how the human adapts to the particular system
0:55:14	and the policies that you're doing that are adaptation
0:55:18	no i think it's a very interesting question so i think
0:55:20	so there's a couple things here someone is i do not seen a lot of
0:55:24	the data that we will observe a large variability
0:55:27	between people's attitudes and what people so
0:55:30	both in the you know just the initial meant like you that they come towards
0:55:33	the system and the expectations they have and also you how they do or do
0:55:38	not adapt to whatever the system is doing
0:55:40	well i guess my view is one think i would say is i think more
0:55:45	of this system should be learning continuously because you are basically not continues that's with
0:55:51	the person on the other and in this adaptation you know and
0:55:54	doing things in big batches
0:55:56	is likely to create more friction than doing things that is continuous the adaptive so
0:56:00	i think that's an interesting their selecting a to solve a problem
0:56:04	i fuel
0:56:05	a lot of the work i and the when thinking of it is i want
0:56:08	to reduce this impotence mismatch interaction between where machines are where people are and i
0:56:13	think we still have a law to travel with the machines this way
0:56:17	people always come whatever the machines and mediate but i think i want that going
0:56:21	to be closer to where the human is and that would make things easier
0:56:25	so i think of all my
0:56:26	the work we've done in the way i see this kind of
0:56:30	i'm gonna try to reduce that impotence from the machine side as much as possible
0:56:35	but you're right people will it that sometimes with clever designs you can actually you
0:56:40	know create interesting experiences we leverage that adaptation when you know it's gonna happen
0:56:46	but i think in most cases i'm in favour of systems that just
0:56:49	incrementally adjust themselves to be able to be at the right spot "'cause" it continues
0:56:54	to shift
0:56:55	i don't know that really asses the questions were some sort surrounded
0:57:00	i time i'm rubber's from technological university double speaking maybe as one of the many
0:57:06	people here over the years of wasted two years of our lives building a dialogue
0:57:10	systems from the ground up or i think what you presents their at the end
0:57:14	is fantastic and but my question is a bit more specific
0:57:18	and in terms of the work you did on interjections being used and hesitations being
0:57:23	used to sort of keep the user's engagement
0:57:26	in the work in the wilds did you do any variation in terms of the
0:57:30	multimodal aspects of task in other words the avatar that's being used to gestures that
0:57:36	we're be used in fact whether or not using an avatar was a good idea
0:57:39	that's by fine grained question and then just a more general question is have you
0:57:45	looked at all but the issues
0:57:47	of engagement in terms of activity modeling because it's always struck me data big problem
0:57:51	in situated interaction
0:57:53	when you move away from the kiosk style the user is asking a question is
0:57:59	that users are engaged in activities and first to truly get the situated interaction working
0:58:05	we are we necessarily need to track the user what they're doing to be able
0:58:10	to make sensible contributions to the dialogue about just answer questions yep
0:58:14	so to the first part of the question the short answer is no what we
0:58:18	should have
0:58:19	like i think there's a there's
0:58:22	there's a rich set of and once is basically how you do hesitations and
0:58:26	interjections and all these policies and definitely in the nonverbal corresponding behaviors
0:58:31	would affect that
0:58:32	and we just seen the process of the prosodic o contours of a
0:58:35	so you know also was not such a good choice because so
0:58:39	as excitation sometimes
0:58:41	pricks people back the likes
0:58:42	so what
0:58:43	you know why wanna say but does are hard to synthesize it's on the display
0:58:47	the technology we have at the time
0:58:50	so i should say that yes definitely should consider those aspects
0:58:56	the second part of the of the question remind me was
0:59:03	so i think you're absolutely right a lot of the work i've shown in that
0:59:07	we've done actually in the last you know ten years there has been
0:59:11	well focused on interaction one communicate where when any interaction and communication like there's some
0:59:17	communication happens between the human and the person
0:59:19	but that's the whole task is this conversation that we're having
0:59:22	where actually just now starting to do more work with systems that where the human
0:59:28	is involved in an actual task does not just the communicative task
0:59:31	and we're trying to see how the machine can play a supporting role in that
0:59:35	and i think you're absolutely right like that kind of brings up the next interesting
0:59:39	level of how we really get collaboration going rather than just this kind of back
0:59:44	and forth of i can ask or answer question and so on i think that's
0:59:47	a very interesting space and we're just starting to play in that space
0:59:54	thank you very much for two where interesting a request i think is great but
0:59:58	this is
1:00:00	going out in the wild approach i was just wondering have you
1:00:05	i still assume that microsoft research office is
1:00:09	a certain type of people who are in there
1:00:13	so it's not completely out in them are not so it sort of a question
1:00:17	of a have you considered the sort of other i mean i guess children on
1:00:22	the other types of user groups that
1:00:25	or other types of problems that you might have in is more sort of accepting
1:00:29	or something no we have so the short answer is again all we have on
1:00:33	but i completely agree like the population we have these just the very narrow very
1:00:37	specific one
1:00:39	it's interesting to me
1:00:40	how much variability i see even in that narrow cross section which makes me wonder
1:00:44	like you know and units interesting there's and there's a lot of variability even in
1:00:49	that narrow population
1:00:50	but you're absolutely right like it's not
1:00:53	truly in-the-wild is not a to public space like
1:00:56	and so you be very interesting to go there and see what kind of "'cause"
1:01:00	yes populations are different than
1:01:05	we haven't done much outside this
1:01:08	okay let's think then again for a really is done

Situated Interaction

Keynotes

Dan Bohus (Microsoft Research, Redmond, Washington, US)