Speech Transcript - Frames: a corpus for adding memory to goal-oriented dialogue systems

0:00:15	and the hybrid hamming layer i it from a liberal and i'm here to present
0:00:19	a data set i collected and annotated with my colleagues at a little bit
0:00:25	highness is actually here with me if you want to talk to him
0:00:30	so
0:00:32	there is a the motivation behind this dataset is that there is indeed
0:00:36	for dialogue systems to be able to handle complex interactions
0:00:41	one motivation comes from studies and e commerce and there is a paper by month
0:00:45	later in twenty eleven
0:00:47	where they show that users that come to an e commerce website problem sometimes with
0:00:51	a very well defined cool
0:00:54	in mind but sometimes they just come to shop around or they don't really know
0:00:58	what they want just one to look for options
0:01:01	there is also
0:01:03	sorry some interest in the dialogue community and most notably there was a paper last
0:01:09	or it's a dialogue by finding the and i mean distance
0:01:12	i think it was based that's papers
0:01:14	last year
0:01:15	it's that has any idea that the state tracking for flexible interaction
0:01:19	and is this in this paper they try to move a beyond the traditional
0:01:24	linear slot filling paradigm and try to handle more complex
0:01:29	conversations where you have different user goals and possibly across domains
0:01:33	so we decided are so for this work actually didn't have a proper dataset to
0:01:38	test their method because they there wasn't anything available
0:01:42	so the
0:01:44	modified an existing data set and so we decided to actually try to collect data
0:01:49	and promote this kind of work for future dialogue systems
0:01:55	so we collected one thousand two hundred and sixty nine human-human interactions and the travel
0:02:00	domain
0:02:01	we also propose a new time frame tracking and the dataset is fully annotated and
0:02:06	publicly available at this url
0:02:11	so when i talk about linear slot filling what i mean it's something like this
0:02:16	is actually here dialogue from the dataset
0:02:19	and here and so the user basically gives you some constraints you want to go
0:02:24	somewhere from columbus it doesn't really know where
0:02:26	then the wizard is the agent two plays the role of the dialogue system
0:02:31	he proposes two options vancouver draw no then the user gives a bit more information
0:02:35	about his constraints
0:02:36	and then at the end of day and then the user asks
0:02:39	for information about the offers from the wizard
0:02:42	and that the and the user box the
0:02:45	one of the proposed trips
0:02:46	so here the user will never really changes during the dialogue it's very just drilling
0:02:51	down some options
0:02:53	and by nonlinear slot filling i mean something like this dialogue which is also from
0:02:58	our data is that it was able to onto to support entirely on the slides
0:03:02	are just cut the interesting part
0:03:04	so here this is a representation of the different options that the user
0:03:09	see the mouse you can okay
0:03:12	so on the left
0:03:13	the this is a representation of the different options and goals that the user might
0:03:18	have during the dialogue
0:03:19	so by nonlinear slot filling what i mean is that at the beginning the user
0:03:24	is talking about or in some going to toronto
0:03:27	and then and he explores a options and i think in green
0:03:32	but at the end of the dialog the actually decides to go back to that
0:03:36	you're on a trip and then
0:03:38	so in this case
0:03:39	and the user goal changes during the dialogue but the user also goes from one
0:03:43	able to the other and if we want to be able to actually broke the
0:03:47	drawing a package for this trees are we need to remember it
0:03:51	so let's that of into the details of the datasets freeze the domain so it's
0:03:55	a travel domain we had trouble packages with a round trip flight and a hotel
0:04:00	this is an example of a package so you had you hold our
0:04:03	the flights with their time and the dates
0:04:07	and for the hotel we had are the category which is the number of stars
0:04:11	we also have guessed readings on a scale of and
0:04:14	of one to ten and amenities and vicinity so
0:04:18	on the rows
0:04:20	those are the first one is
0:04:22	a bit too small to read
0:04:24	but it vicinity so vicinity of the hotel you have something like shopping malls museums
0:04:31	but is universities airports et cetera so that
0:04:36	the distribution
0:04:37	and on the o
0:04:39	a button graph we had the number of amenities burr hotels so the amenities could
0:04:45	be breakfast wifi
0:04:47	whether the what has a spot those kind of things
0:04:50	and so that for most hotels we have more than one and automatically so that
0:04:55	the users
0:04:56	had something some ground
0:04:58	some matter to compare to what else one against each other
0:05:02	and we had two hundred and sixty eight hotels and one o nine cities in
0:05:06	total
0:05:10	so for this dataset we hired
0:05:12	twelve participants to collect the entire data
0:05:17	are over twenty days don't our data collection last
0:05:20	the twenty day i'll for of the participants
0:05:24	it entire data collection and the other ones where hired for just one week
0:05:30	and each dialogue was performed ugly a chat on slack
0:05:34	so we had about that was a pairing up to user is
0:05:39	and then they can they were able to chat so when the user what spare
0:05:43	to a wizard you would get a task
0:05:45	and we generated those that is based on templates like this one
0:05:49	so are basically we tell the user his goal
0:05:52	and to generate those are tasks from the templates we just replace the placeholders for
0:05:57	the different entities with values that we randomly true from the database
0:06:02	and
0:06:03	two very the task
0:06:05	we actually
0:06:07	word error probability for each template
0:06:10	so for this template would say
0:06:12	and has a probability of additive
0:06:15	point five to succeed
0:06:16	so that means that when we actually wary the database with the entities
0:06:22	well fifty
0:06:23	present of the time it will every turn results and fifty percent of the time
0:06:27	it want to return results
0:06:29	and when it won't return results we would give to the user we would either
0:06:32	tell the user to close the dialogue
0:06:34	or we would give him some alternative like if nothing much easier constraint then tried
0:06:39	increasing your budget by twelve hundred
0:06:41	dollars
0:06:42	so as i said we only had twelve participants and we collected a bit more
0:06:46	than a thousand dialogues
0:06:47	so to keep it interesting for them
0:06:50	we tried to tell them to play roles and try to very the way they
0:06:55	speak to the to the wizard and to anchorage just a bit more we also
0:07:00	growed sound fine
0:07:02	templates like this one so that was at the time when pocket mango was very
0:07:08	popular so we told them to pretend that there are pokemon hunter and they're really
0:07:12	wanna go to the city because there is a very rare pokemon there and that
0:07:16	they should find a good package to do that
0:07:20	so
0:07:21	to keep it interesting we are created such templates and we then kind of
0:07:26	throughout the day data collection so that they would have different tasks and they did
0:07:31	they would they would stay engaged in the data collection
0:07:39	we also gave some instructions to the user to make sure that we collected dialogues
0:07:43	that we could use so we told them to not use too much and comments
0:07:47	buying but also to use some so that you know what it's data bit realistic
0:07:53	so we told them to make personally the lectures and
0:07:57	and
0:07:59	we also told them to feel free to and the conversation at any time because
0:08:03	we wanted them to feel like they're real users
0:08:05	and for that we also created some templates that would
0:08:10	and courage to select one of the templates words
0:08:13	you're a pop star you're an absolute geneva and you want accept anything under five
0:08:17	stars
0:08:18	so sometimes you know there would be we act like a different just close the
0:08:22	dialogue and leave so that was interesting for us to have different cases the
0:08:26	successful dialogues in there are lots where the user would just three
0:08:29	we also told them to try to spell things directly to keep not too complicated
0:08:35	and we told them to
0:08:38	try to determine what they can get for their money so that they would really
0:08:41	exploring the options compare the hotels and
0:08:45	try to figure out what's in the database
0:08:49	so on the wizard side so the agent
0:08:53	playing the role of the dialogue system at the beginning of each dialogue they get
0:08:57	a link to search interface that look like that
0:09:01	so on the left
0:09:04	you have although searchable fields and on the right you have the results
0:09:08	and for each search the wizard will always get up to ten results so from
0:09:12	zero to ten
0:09:15	and you can also see
0:09:17	the little tab on top
0:09:18	so basically what we did is that
0:09:21	every time the user would change i've been strange so it might so here it's
0:09:25	for which cd baltimore
0:09:28	if the user would say then okay what about to run all then we create
0:09:32	this search and you have so that if the user wants to go back to
0:09:35	baltimore
0:09:36	the wizard can do it easily and wouldn't have to repeat the search over again
0:09:42	and we also gave instructions to see whether it
0:09:45	those where whites
0:09:46	critical for us to be able to have a dataset where we can actually try
0:09:50	to imitate the wizard behaviour
0:09:52	so we told them to be polite and not jump
0:09:56	and on the role played by the user
0:09:58	claim that a mistake
0:10:00	and this the start point also relates to that we told them your knowledge of
0:10:05	the world is only a limited by the database because we don't want the wizard
0:10:08	to start talking about pokemon
0:10:10	or things that we can't we don't wanna dialogue system to do so we just
0:10:14	pull them to
0:10:15	you know that the user is gonna play a role in be kind of funny
0:10:18	but try to just
0:10:19	talk like a dialogue system basically
0:10:23	i we also tell them to told them to try to spell things correctly for
0:10:27	nlg
0:10:28	and now the second point we told them to very the way a cancer
0:10:33	the user and we told them that sometimes
0:10:36	they can try to say something that is a bit impromptu so imagine if you're
0:10:40	having a dialogue and then the middle of it the wizard with say hello
0:10:44	doesn't make sense
0:10:45	and we did that because we wanted to have so
0:10:48	we have a lot of experience in training dialogue systems with reinforcement learning and the
0:10:52	problem with that is that if you only have
0:10:55	positive examples and you don't know
0:10:57	what a mistake looks like so something that you shouldn't do at some point of
0:11:01	the dialogue it's it makes it a bit hard
0:11:03	and as a way to
0:11:06	measure how
0:11:08	how that
0:11:09	was there are in the in the dataset we ask
0:11:12	the user to read the dialogue at the end of each dialogue
0:11:16	and we told them to base the rating only on the wizard behaviour so if
0:11:20	they didn't get any results because there wasn't any result in the database
0:11:24	but the wizard was helpful and we told them to give a maximum score
0:11:28	so we had suppose on the scale of one to five and those are available
0:11:31	as the dataset
0:11:32	and as we can say as we can see there are a few most of
0:11:36	them have
0:11:37	the maximal score of five but somehow
0:11:40	lower scores because the wizard was not completely operators and the actions that were not
0:11:45	very helpful
0:11:49	then other statistics of the corpus this is the proportion of dialogue
0:11:53	through dialogue length so number of turns in a dialogue as you can see
0:11:58	the
0:12:00	for of the dataset is around
0:12:03	fifteen turns bird the averages that fifty turns per dialogue so even though we have
0:12:07	only one thousand three hundred sixty nine dialogues we have about twenty thousand turns in
0:12:12	total
0:12:15	a then this is the number of dialogue act
0:12:19	this is the distribution of dialogue act types in the dataset so we had about
0:12:22	twenty dialogue act types
0:12:26	and the number of dialogue acts per turn so during one turn because it's human
0:12:32	dialogues and
0:12:34	there was more than one dialogue act per turn very often as you can see
0:12:38	about three percent of the time
0:12:40	there is more than one dialogue act type opportunity
0:12:44	so
0:12:46	that is that isn't in frames so once a frame but we
0:12:49	so and i said what we really want to do is
0:12:52	remember everything that the user has
0:12:54	tool this during the dialogue so that we can
0:12:59	get back to one option if the user decides to put that option in the
0:13:01	n
0:13:02	so we took inspiration from state tracking and the definition of a state and a
0:13:08	dialog state tracking challenge in this challenge they define the state by the user constraints
0:13:14	and at the user requests so everything that the user's task if he asks for
0:13:19	the price or for the
0:13:22	the name of the what out that that's a request
0:13:25	and we also added things that we
0:13:28	saw in the dataset and that we needed
0:13:30	one is user binary questions so those are questions where you have
0:13:36	so the user is
0:13:37	a request is like the user is asking for price
0:13:40	a binary question is when the user asks is the price
0:13:43	two thousand dollars for instance so that's the yes no answer
0:13:47	and we also had comparison request
0:13:50	where the user as
0:13:51	to compare something between two or tells you can ask if there is what do
0:13:55	a cheaper than hotel be for instance
0:13:58	and so those are examples of frames and the how their related so those two
0:14:03	hotels are children of the
0:14:06	the bowl
0:14:07	frame
0:14:08	as you can see
0:14:09	and something you in our dataset is that
0:14:14	frames can be created by users but also by whether it's so every time the
0:14:18	wizard makes a proposition for hotel we create a frame because we want to remember
0:14:22	it in case the user wants to book this hotel
0:14:27	so we had a we
0:14:29	made up a few rules for frame creation after analysing the dataset and seeing what
0:14:34	makes sense
0:14:35	and for frame creation
0:14:37	we create a new frame every time the user changes a value so here at
0:14:42	the beginning the user is to go to atlantis so that's one frame
0:14:46	and then on these are utterance the user asked to go to never land and
0:14:51	sold or destination cities change the we create an you separate frame with this value
0:14:56	for the destination city
0:14:57	actually changes a more entities here but we need to just have one tend to
0:15:02	change to creating you frame
0:15:04	and so that's one type of frame creation but we also create a new frame
0:15:10	one the wizard makes a proposition for hotel and we put in this frame all
0:15:14	the properties of the hotel
0:15:16	so that gives you are frequencies of those behaviours
0:15:20	in the dataset
0:15:21	as for changing frames
0:15:24	as you can see it's all user controls
0:15:27	because we want
0:15:29	the wizard to really be an assistant and
0:15:31	just a dialogue system to really be an assistant and propose things but then the
0:15:35	user controls what we're talking about the user controls the topic and the
0:15:40	in the dialogue so the user or only has the power to change the frame
0:15:45	that were talking about
0:15:46	and so that happens
0:15:48	which in you frame when the user proposes a new values a leafy changes the
0:15:52	destination city then we automatically switch to that new frame
0:15:57	if the user decides to consider an option a hotel and ask more information about
0:16:01	those this option then we also switch to that option is a frame corresponding to
0:16:06	that option
0:16:07	and we can also switch to an earlier frame if the user says for instance
0:16:12	and the dialogue that actually earlier okay let's go back to toronto package then we
0:16:16	switch to the frame corresponding to the toronto package
0:16:21	we also have annotations for dialogue acts and slots
0:16:26	so the dialogue acts
0:16:28	we have general purpose function still kind of typical dialogue act inform offer compare
0:16:34	we also have dialogue act specific for frame tracking with the which is which frame
0:16:38	that in the case when the user switches to are a frame
0:16:42	then a for the slots we have all the fields in the database we also
0:16:47	have specific ask the slots describing specific aspects of the dialogue
0:16:53	while one is intense so the intent of the user is to book for instance
0:16:57	action is their counterparts on the on the wizard side so the wizard book a
0:17:01	hotel we annotated as action equal book
0:17:05	and count is when the user gives the number of hotels in the database corresponding
0:17:10	to the user constraints are sometimes the wizard will they i have stream or tell
0:17:14	them about a more since the we would
0:17:16	we would annotated with count peoples three
0:17:20	and then we have specific
0:17:23	slot-types
0:17:23	to report
0:17:25	the creation and a modification of that of a frames
0:17:28	so we actually
0:17:31	automatically annotated the frames and the content in the under frames based on those slots
0:17:36	so those slots are it for each new frame we give a to a new
0:17:39	idea
0:17:41	reference so every time the user preferences the past frame
0:17:45	and read and write
0:17:47	so i'm gonna go faster here
0:17:49	so that's an example of how we used read and write
0:17:53	for read it's
0:17:54	basically it so we sorry wherein frame five here the "'cause" the active frame is
0:17:59	frame five
0:18:01	but the wizard five talks about
0:18:04	values that were provided in frame for so reread those values from frame for and
0:18:09	we would put them in figure five
0:18:10	and for right it's on the last utterance
0:18:14	duh wizard provides new information
0:18:17	about a frame that we already talked about before so we write this information and
0:18:21	the preview in frame for
0:18:24	even though we're the currently active frame is
0:18:27	the frame number six a basis it's a bit
0:18:30	complicated like that but
0:18:31	it's basically a way to track of all the values and then
0:18:35	dynamically
0:18:38	populate the content of the frames
0:18:42	so i statistics are some statistics of frame changes in the dataset
0:18:46	the average number of frame changes
0:18:49	created per dialogue is six point seven
0:18:52	and the average number of frame switches is a three point
0:18:56	fifty eight and we get a we have a lot of variability between the daleks
0:19:00	as you can see here
0:19:01	so we observe do the behaviour that we wanted to observe
0:19:06	we also trying to see so we had five experts annotating the dataset and we
0:19:12	evaluating how well they agreed on the annotation
0:19:17	and we got a reasonable agreements
0:19:21	so we propose baselines with for this dataset one is an nlu baseline that was
0:19:27	choose to you kind of how hard piano your task was
0:19:30	we adapted model from arnold and colleagues published in twenty sixteen
0:19:37	and we predict dialogue act type and slot
0:19:39	and slot values and we get about eighty percent accuracy so
0:19:45	it's all already pretty good but there is room for improvement
0:19:48	so for frame tracking ripple for the task
0:19:51	so if you want to create a dialogue system that's gonna be able to
0:19:55	g
0:19:56	in memory all the frames talked about during the dialogue you'll have to do it
0:19:59	to create the frames dynamically as throughout the dialogue but we decided to take the
0:20:04	first step
0:20:05	of having a simple task
0:20:06	so if you know all the frames created so far you have the new user
0:20:11	utterance
0:20:11	and the nlu annotation for this user utterance so you know the dialogue acts in
0:20:16	the slot types
0:20:17	and the task consists of for each
0:20:19	slot
0:20:23	find the frame that it references so here for instance
0:20:27	that's efficiency nipples mine reference to frame number one
0:20:31	budget a post you cheaper actually makes was created new frame
0:20:35	and flexibly view of the steeple true refers to the current frame
0:20:42	are we proposed a rule based baseline that was very simple and that we just
0:20:47	we just observed some behaviour in the and the dataset and so we propose a
0:20:52	very simple baseline so basically if the user can forms a new value we create
0:20:57	a new frame
0:20:59	we switch to a previous frame if we find the mouse is that the user
0:21:03	is talking about in one of the previous frame
0:21:05	and basically
0:21:08	very simple rules are those of some for
0:21:12	switching to frames
0:21:13	and of so the performance was bad because rules are not enough to do this
0:21:17	task
0:21:19	we kind of breaking down based on
0:21:22	different cases and the dataset so it
0:21:25	for frame switching
0:21:26	if the user provides a slot so it's as they are let's go back to
0:21:29	toronto package
0:21:31	then we get about forty five percent performance
0:21:34	if the user replies to a previous frame but without specifying a specific slot
0:21:39	then it's harder because we don't it's harder to understand what the users talking about
0:21:45	after a wizard after the wizard proposes a hotel so that after an offer
0:21:50	most of time the user will ask for more information about this hotel so
0:21:55	very often we would switch to that frame so what that's easier also to predict
0:21:59	and it's easier than one there is no offers so we get a lower performance
0:22:04	there
0:22:05	and for frame creation we can predict that no frame is greeted but it's harder
0:22:10	to predict when the frame is created
0:22:13	and as followup work we
0:22:16	okay so we had a paper was the better model that
0:22:21	outperform the baseline by a lot
0:22:24	we presented it workshop at a c l very recently
0:22:28	and so to conclude this is the new human dataset to study complex state tracking
0:22:34	we have turn level annotation of dialogue act slots and phrase we also propose a
0:22:38	new task which is frame tracking and some baseline
0:22:41	thanks for your attention
0:22:49	the first minutes for questions
0:23:02	fixed would talk could utilize the language variability
0:23:06	but it's a few but anyway
0:23:10	over one thousand dialogues actually the user actually filled or increasing the
0:23:18	so by just eyeballing we didn't really
0:23:21	compute anything but by just looking at the dialogue they really playing the really get
0:23:25	into a they play the roles and they just change their language sometimes it goes
0:23:30	from very polite to more
0:23:33	like young speaking it there's a lot of variability thanks to
0:23:51	possible to combinations so it is to monitor from you to see would be
0:23:58	to generate will fall
0:24:00	so it's of combinations able to do something over it sorry if
0:24:05	but only from you
0:24:08	to work well
0:24:10	so that's
0:24:11	that's something we decided not to deal with the we actually asked to always talk
0:24:16	about one thing at a time
0:24:20	but with the true for example the system
0:24:22	is it should have seen from small words
0:24:24	huh
0:24:27	we would have would have all right
0:24:42	to thank you for interesting to before i just quickly you and the u
0:24:51	three point but the among the you can use you pixels detailed results tools to
0:25:00	promote collagen dreams
0:25:04	so we record all those urges and that the end result of the such as
0:25:11	we
0:25:12	that's an idea that we had we have we haven't really try to see if
0:25:16	it's really reliable but
0:25:18	because
0:25:18	everything was not searchable database as well so that's probably had and we're actually
0:25:25	that something when it's a we're collecting more dialogue right now to make it bigger
0:25:29	and now we're gonna make all the field in the database searchable
0:25:33	so that we can record of those searches and then do something like that
0:25:39	just one more question
0:25:44	all clusters let's take the speaker again

Frames: a corpus for adding memory to goal-oriented dialogue systems

Oral Session 3: Modeling Semantics and Pragmatics

Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra and Kaheer Suleman