Speech Transcript - Sequential Dialogue Context Modeling for Spoken Language Understanding

0:00:14	great a thanks everyone for saying that the finalisation
0:00:19	and hundred and i'm gonna do you talking about using dialogue context to improve language
0:00:24	understanding performance in multi domain dialogues
0:00:27	so this is the outline of the target is that still give a brief background
0:00:30	of the problem i'll talk about the data sets of the model architectures and then
0:00:35	a data augmentation scheme and experiments
0:00:38	so i'll go to what is important in to dialogue system so in goal oriented
0:00:42	dialogue systems of
0:00:44	the goal of the system is to help the user to complete some task and
0:00:47	the user's goal is to compute some task as opposed to chart based dialogue systems
0:00:51	where no the user is just have a conversation and go system is spending is
0:00:55	to use of
0:00:57	so this is a typical architecture for all of goal oriented dialogue system it is
0:01:01	not you know that no that's of a plane of components and the first component
0:01:06	is basically the language understanding module of its to access an interface the other two
0:01:12	incoming user utterances and their transforms them transforms them into a semantic representation
0:01:18	the next component of the state tracker that keeps track of the probability distributions of
0:01:22	the states over all the control over the conversation after that is the policy that
0:01:28	depending on the dialog state and the back and or stage to decide what action
0:01:33	to take
0:01:33	which could be making a back and at all and no asking the user some
0:01:37	information on informing be the set of something and the last component is the language
0:01:41	generation to just external dialogue act based representation of the one is the output and
0:01:46	o since the user do not exist
0:01:50	so i just briefly talk about that semantic frame representation so are dialogue understanding is
0:01:57	based on themes and redefine themes in connection tube actions in the sense that the
0:02:01	your back and might support certain documents are stored in intense o and o those
0:02:06	are basically replicated in touch screen
0:02:08	so that computes a replicated that's not and the lack an intensity replicated the sentence
0:02:13	and apart from the back an intense be support a bunch of conversational intensity from
0:02:18	dialogue acts like a phone then i are complement expressed frustration exactly
0:02:25	so basically what does the language understanding module
0:02:29	so it performs to dust the first task is domain classification a given an incoming
0:02:34	user utterance though language understanding module replace to identify bit stream it sure though correspond
0:02:39	to so this is an utterance classification task
0:02:42	just second task is intent classification so it tries to classify it tries to identify
0:02:47	what intense exist in the user's utterance
0:02:51	so and the third not sounds an utterance classification task
0:02:54	and the third one is not telling a and the idea that is to identify
0:02:58	attributes in the frame identify attributes which have been defined in the frame but in
0:03:03	the user actions
0:03:04	for example for this query like plates from boston here pretty your premade fee the
0:03:09	plate stream and the user intent might be fine plates then you're trying to identify
0:03:13	attributes like departure city i don't physically the party exactly so this is a sequence
0:03:18	tagging task and be treated so lake sequence labeling task based on annual meeting
0:03:26	so to basically sum it up
0:03:28	given a user utterance like i want to go to tuck it allows titles can
0:03:32	you look up table for the model the goal of to a language understanding module
0:03:36	is to identify that the domain is less front resolution
0:03:40	the intent is the user this thing to do so what is trying is trying
0:03:43	to inform the system about the restaurant name and the system i entity
0:03:48	and then identify that is certainly and similarly for the rest of this work
0:03:55	so there has been not of related work on using context for dialogue related das
0:04:00	and for language understanding there was work on using memory networks for language understanding on
0:04:05	a single domain don't know that has been able on using memory networks for end-to-end
0:04:09	dialog systems
0:04:11	and that has been work on using hierarchical the current encoder decoded models for generative
0:04:16	query suggestions of which is a slightly unrelated thus but our model is an enhancement
0:04:20	of the smaller so it's
0:04:25	so i reviewed all over the datasets
0:04:29	so be have a collection of teasing the domain dialogue data set
0:04:32	the idea it is that the user has a single task that is going to
0:04:35	complete and their response to a single mean so we have i don't a thousand
0:04:40	not tune these datasets and they are a bit england's but i don't include influence
0:04:47	then we have a small did not selected a small multi-domain dialogue data set o
0:04:51	where the training set is around five hundred dialogues the dev set aside and fifty
0:04:54	dialogues in the test dataset on two hundred and seventy two you know dogs these
0:04:58	dialogues and longer because the user has multiple pos that he's trying to complete until
0:05:02	would span across multiple domains
0:05:06	the entity said that we use two
0:05:08	create the training and test dialogues sets are non-overlapping still we have a lot of
0:05:13	out-of-vocabulary entities in our dataset that it i don't eating button to the test user
0:05:18	utterances that for the vocal
0:05:22	so our data collection process
0:05:24	relies on the interaction of a policy model and the user simulator
0:05:28	which in tracking tones of dialogue acts and back and politics extra and then we
0:05:33	can also natural language manifestations of o on based on the style of selecting
0:05:38	so the process and the datasets will be covered in an upcoming publication
0:05:44	okay so now i l display the warlock detected this is the conceptual like the
0:05:49	idea is that
0:05:51	there is a context encoder not acts on the intensity of the dialogue and the
0:05:55	dining the produce a context vector and then there's attack and it will not just
0:05:59	x in the dialogue context and the current user utterance
0:06:03	and tries to determine the domain independence not between a single model on multi-domain and
0:06:08	it does so everything is directly model
0:06:13	so i just i know this paper though architecture the type a network
0:06:18	we use the same architecture doctors so all the models that we compared to be
0:06:23	does vary the context important thing so of this is a rnn this model that
0:06:30	jointly models that we don't mean and the features
0:06:33	be viewed in our embeddings corresponding to the user tokens a user utterance tokens in
0:06:38	twenty it would buy detection gru that which is depicted herein laid yellow if
0:06:44	visible
0:06:45	the outputs of though by gru are then fit right into an lstm the which
0:06:49	is depicted in like to o
0:06:52	so well as the context encoded common so the output of the dialogue context a
0:06:56	input that is fed into the initial state of the lstm and we tried a
0:07:01	bunch of different configurations but this one seemed to what corpus so that's what we
0:07:07	well so weighted use an lstm in the second lead and it's a instead of
0:07:10	gru the only because it seems to work with the slot filling maybe because it
0:07:15	leads to a separation between the open the internal states and outputs
0:07:22	so the final states of the lstm are fed into the domain and the classification
0:07:26	as
0:07:27	and the final or token level outputs of the lstm a better fit into its
0:07:31	not like english
0:07:33	so this is that are gonna work i don't know
0:07:35	that's the user's across all the models
0:07:40	so this is basically just a description of what dataset
0:07:44	so
0:07:46	by the mean you to use context may not just used to track the network
0:07:49	one if the user utterance
0:07:50	so suppose the user is having a conversation with a restaurant reservation bart and the
0:07:54	user says i
0:07:56	so in that sense of context this is a pretty i make a statement it's
0:07:59	not easy to make out what the user means it could mean five people or
0:08:04	paper or maybe am order could be a restaurant name but if you know without
0:08:09	the system does that what name would you prefer then it's pretty obvious that the
0:08:12	user meant by as a time
0:08:15	as opposed to a number of people at this time
0:08:18	so this leads us to i first baseline model
0:08:21	the idea to start we just input the previous system to an energy are u
0:08:25	and v the final state of the gru as the dialogue context
0:08:29	so be evaluated for matrix so the first one is domain upon which is the
0:08:34	classification of phones good or domains
0:08:36	well the second as intent upon which is the classification of funds go to what
0:08:40	extent and the third one is not fun and you know this was
0:08:45	same edit it is the ratio of utterances bad though
0:08:49	model you get any one of the predictions wrong so be obvious you want to
0:08:53	go for the lowest possible premeditated
0:08:56	so these are the performances of those simple and quality for the model where the
0:08:59	system tone is encoded in the gru and then fed into the target network
0:09:06	so they do we need black context remote dialogue in
0:09:09	one text on the data so suppose though
0:09:12	so user instead of responding just but are you to a system niche initiative dialogue
0:09:17	this point but if it for all i know all the user is taking initiative
0:09:21	robot so this makes the problem more difficult because in a sense of context about
0:09:27	the previous dialogue you can be clear what the user is referring to here
0:09:33	it could be a movie name it could be a titanium it would be responding
0:09:36	as many options are but
0:09:39	but suppose you nude art this user has been talking about meeting attended iteration then
0:09:44	it's you more likely to get the prediction right so that so we are context
0:09:50	from or that of all the previous turns it i
0:09:54	so this is our second baseline
0:09:56	and this is based of the model proposed by chen and out those in though
0:10:01	emily network for language understanding people
0:10:05	the idea that is to have a gru layer that
0:10:09	and so on the previous sp utterances to produce the memory of vectors so this
0:10:13	memory easily representation of all the previous utterance
0:10:17	we have another gru dark box and the current actions to produce the representation of
0:10:21	this utterance
0:10:22	based on the inner product of this memory and the system i the current utterance
0:10:27	vector we get the notation distribution and a user some them met but it and
0:10:31	get the context of for the data but is depicted in there so this is
0:10:35	the output of this context encoded bit speech into the target network
0:10:41	so
0:10:44	so as you can see adding
0:10:47	on the main body of the entire dialog seats to leads to an improvement over
0:10:52	all the metrics so for domain we see an improvement of roundy percent absolute
0:10:56	of an intent around two point three percent for slot point five percent but a
0:11:00	significant reduction in female but can lead to better than this
0:11:06	so if it a member of your
0:11:08	or working on multi-domain dialogue so the idea is that the user might a multiple
0:11:12	goals and
0:11:14	just
0:11:16	just do knowledge of what the user said
0:11:18	in the double being able to understand the dialogue history in context of it as
0:11:23	the rest of the utterances in s p o might not give the complete picture
0:11:27	for example suppose the user has multiple goals the user expendable can we take us
0:11:31	to use that is trained to make it in anticipation
0:11:35	in the absence of so
0:11:37	how these utterances relate to each other the user utterances still ambiguous but if you
0:11:43	can if you have a sequential history of the dialogue act you can really where
0:11:48	you can understand each utterance but and in context of the other
0:11:52	you know it's more likely that you get the prediction right
0:11:55	so
0:11:55	this is a final models that could be so
0:11:59	experiment but and this is an extension of the memory network the idea to start
0:12:03	again you get the
0:12:05	emily of the previous dialog sp which is depicted herein yellow well you get come
0:12:12	are you get to a representation of the current utterance which is depicted in three
0:12:17	but instead of getting an inner product to get some attention distribution you combine them
0:12:21	together in of each what would lead to get the context of memory of data
0:12:25	and this is then fed into separate is another gru bits or produces the context
0:12:33	vector so
0:12:34	basically what is happening is we just to do so
0:12:38	representation of the entire dialog history in context with the current utterance and then we
0:12:44	go then we have an it would like to the with and that dialogue in
0:12:48	all tries to understand i combines these utterances together in context of each other
0:12:53	and at the final state of that the idea is still aren't expected that the
0:12:57	reader to the target
0:13:02	so this is an enhancement of the memory network and this is also in a
0:13:05	sense an announcement of the hierarchical according to encode or decode a model that has
0:13:09	been used for next utterance prediction on for context to generate of release addition
0:13:18	so a very unexpectedly what we observe is that this model doesn't perform as well
0:13:23	as the memory network
0:13:25	now be stated take into this and a hypothesis is that the
0:13:30	that is a huge training this shift in our datasets so like training set is
0:13:34	composed largely a single domain dialogue a single domain to compute a bit that likely
0:13:39	hundred it does and single domain data sets and a single domain dialogue and i
0:13:44	don't for seventy of multi domain dialogues so
0:13:47	o b
0:13:49	i believe the meeting that the those sequential dialog input that is unable to adapt
0:13:55	from a single domain dialogue with the multi-domain not a set
0:13:58	so what do we do so
0:14:01	so we go with a simple data augmentation scheme
0:14:04	since then is addressed between our training and test datasets but it may got training
0:14:08	dataset more similar to the test data
0:14:11	so we take a large single domain dialogue datasets
0:14:14	b g combine single domain dialogue so far too
0:14:18	syntactically r o domain switches by a
0:14:23	basically combining basically graphing the single domain dialogue into another one
0:14:29	so we ended around ten thousand dialogues what you geodesic that so it's i don't
0:14:32	know
0:14:33	t pairs so
0:14:36	but you know the utterance
0:14:39	so this is an example of the sample be combined dialogue acts as the dialogue
0:14:42	where the user is trying to output is movie tickets
0:14:46	in dialogue by the user is trying to find that a strong and then we
0:14:50	randomly sampling location in dialogue acts and in fact that this is no longer be
0:14:55	combined
0:14:58	and use this for training
0:15:01	so the locally numbers are very of syllable
0:15:05	improvement in performance by just after training on that accompanying dialogue compared to all training
0:15:10	without in my dialogues we combine dialogues
0:15:13	and the boy number is
0:15:17	i describe them and it so the boy numbers are the ones read it all
0:15:21	the model built from the based on a so on certain metrics
0:15:24	so
0:15:25	using that the sequential dialogue and put it seems to benefit them was from dialogue
0:15:29	combination at all
0:15:31	then only combination leads to or performance improvements were almost all the models by the
0:15:36	one that benefits them was just a sequential elegant and this is probably because the
0:15:42	and you combining dialogues leads to longer dialogues it adds noise and o l which
0:15:47	acts like a regularization and since the sequential dialogue and put it is the most
0:15:50	complex model would expect it to benefit the most unbiased
0:15:55	and this is what we observe basically a the sequential dialogue encoder does better don't
0:15:59	the mean of one
0:16:00	that's not define and primitive it's but it's at most and the best the model
0:16:05	on intent classification
0:16:08	so
0:16:11	this is so and example this is a degenerate example but it
0:16:16	trace to illustrate what's happening here we just look at that into distributions and try
0:16:20	to figure out what the models are doing so this is there an utterance from
0:16:24	the test set and the movies are in boldface
0:16:26	so all these are a bit later for dataset because of identity that's a non
0:16:30	overlapping
0:16:32	so
0:16:34	you see that the last three utterances have a lot of for b o
0:16:38	so if you look at the memory network attention distribution you know what is that
0:16:42	are
0:16:44	focus is almost entirely on the user utterance i want to visit industry
0:16:48	well as the sequential dialogue encoders out of focus is equally on the last two
0:16:52	utterances
0:16:54	and the i should make a scalar the dialogue at the utterance that are trained
0:16:59	to understand it is the final one that's at the bottom up to be a
0:17:03	by the users is but with the tool is used as
0:17:07	so the good units are identified that the domain is a standard finding restaurants and
0:17:12	to identify the slot for what the two presidents day goes
0:17:20	so what we observe is not the encoder-decoder model fails to well identify the domain
0:17:25	or the starts
0:17:26	the memory network correctly identifies the cutting domain because it is focusing on the utterance
0:17:31	where the user says i want some restaurant
0:17:34	but it feels incorporate a context from the previous system utterance where the system is
0:17:38	offering a response to the user and is any would identify the slot of it
0:17:43	as a sequential dialog input data successfully to combine context one possible utterances and a
0:17:49	recognizer what the domain and this larger
0:17:55	okay i think that's it
0:17:56	a lot for listening
0:18:04	questions
0:18:13	care have two questions and stuff first one
0:18:17	so as a byproduct of what you're doing q you get memory representation of the
0:18:24	context
0:18:25	you have the whole dialogue history i'm wondering if you consider maybe training because you
0:18:32	have access to the simulated user of whether you can train a policy
0:18:39	using this representation because it's very similar to belief tracking
0:18:44	in traditional that much of which you soac a question is more like maybe you
0:18:50	can instead of for doing a modular thing most ica you can just have the
0:18:54	same or do that though for end-to-end that
0:18:59	so no
0:19:01	that's so very indistinct addition because we have some people running experiments on this so
0:19:06	this is something thing
0:19:10	because i think that the problem and maybe
0:19:13	the problem usually with such an interpretable representation is that you when you pick some
0:19:18	actually use a confirm you don't know which slot to confirm but at the same
0:19:22	time you have this semantic so you can make as usable
0:19:30	i think by carefully designing though
0:19:34	semantics that we using we can
0:19:37	alleviate mean removed i have been made in a few instead of having a single
0:19:41	you can form if you have a conform or slot and then have the model
0:19:45	predict so based on the context what the user is trying to control and then
0:19:50	not made it uses a problem is
0:19:55	again to an certainly is still
0:19:59	and even a question
0:20:15	so can you go back to
0:20:17	the last but slight where you had the brazilian restaurant so i wanted to us
0:20:22	two questions about these example
0:20:25	first i thought you said you would train on synthetic dataset where you combine is
0:20:34	domains right so now used to consider the restaurant domain to be out of domain
0:20:40	at this point that's sorry the first and second
0:20:45	would you deal with something that is
0:20:48	to me true only out-of-domain like no
0:20:51	the weather is nice or two day i'm grumpy or whatever in many different way
0:20:55	than the
0:20:57	these
0:20:58	utterance that is still task related even if not used but they got
0:21:02	so for the first question do this one domain is not out of domain but
0:21:06	also because a our system can handle movie tickets and restaurant
0:21:12	given an utterance the system would try to keep track across different domains it will
0:21:16	see that this is a different domain utterance and you speaker hundred
0:21:20	so even have the dialogue is multi domain date
0:21:26	no out-of-vocabulary so
0:21:28	out-of-vocabulary
0:21:31	the second question we have or out-of-domain utterances in this dataset to with
0:21:36	to base the system is supposed to see i cannot handle that
0:21:41	but so i
0:21:44	i think in our dataset it's not there was enough so we definitely need model
0:21:47	domain data to be able to successfully handle out-of-domain utterances
0:21:58	right or questions
0:22:07	my second question to you don't use delexicalization or you do
0:22:13	of the input
0:22:14	no we don't use any delexicalization so this is basically this is the model that
0:22:18	the lexical existing work effectively
0:22:21	right so it's this model will basically identify though entities that are trained to delexicalise
0:22:28	so
0:22:30	because if you use naked guys at a bayes approach or something to delexicalise it
0:22:33	then it doesn't scale to all the response that in the by the this model
0:22:39	will try to identify the based on context based on some lines the annotations from
0:22:43	they cannot it's got something
0:22:49	thank you very much

Sequential Dialogue Context Modeling for Spoken Language Understanding

Oral Session 4: Context in Discourse and Dialogue

Ankur Bapna, Gokhan Tur, Dilek Hakkani-Tur and Larry Heck