0:00:14great a thanks everyone for saying that the finalisation
0:00:19and hundred and i'm gonna do you talking about using dialogue context to improve language
0:00:24understanding performance in multi domain dialogues
0:00:27so this is the outline of the target is that still give a brief background
0:00:30of the problem i'll talk about the data sets of the model architectures and then
0:00:35a data augmentation scheme and experiments
0:00:38so i'll go to what is important in to dialogue system so in goal oriented
0:00:42dialogue systems of
0:00:44the goal of the system is to help the user to complete some task and
0:00:47the user's goal is to compute some task as opposed to chart based dialogue systems
0:00:51where no the user is just have a conversation and go system is spending is
0:00:55to use of
0:00:57so this is a typical architecture for all of goal oriented dialogue system it is
0:01:01not you know that no that's of a plane of components and the first component
0:01:06is basically the language understanding module of its to access an interface the other two
0:01:12incoming user utterances and their transforms them transforms them into a semantic representation
0:01:18the next component of the state tracker that keeps track of the probability distributions of
0:01:22the states over all the control over the conversation after that is the policy that
0:01:28depending on the dialog state and the back and or stage to decide what action
0:01:33to take
0:01:33which could be making a back and at all and no asking the user some
0:01:37information on informing be the set of something and the last component is the language
0:01:41generation to just external dialogue act based representation of the one is the output and
0:01:46o since the user do not exist
0:01:50so i just briefly talk about that semantic frame representation so are dialogue understanding is
0:01:57based on themes and redefine themes in connection tube actions in the sense that the
0:02:01your back and might support certain documents are stored in intense o and o those
0:02:06are basically replicated in touch screen
0:02:08so that computes a replicated that's not and the lack an intensity replicated the sentence
0:02:13and apart from the back an intense be support a bunch of conversational intensity from
0:02:18dialogue acts like a phone then i are complement expressed frustration exactly
0:02:25so basically what does the language understanding module
0:02:29so it performs to dust the first task is domain classification a given an incoming
0:02:34user utterance though language understanding module replace to identify bit stream it sure though correspond
0:02:39to so this is an utterance classification task
0:02:42just second task is intent classification so it tries to classify it tries to identify
0:02:47what intense exist in the user's utterance
0:02:51so and the third not sounds an utterance classification task
0:02:54and the third one is not telling a and the idea that is to identify
0:02:58attributes in the frame identify attributes which have been defined in the frame but in
0:03:03the user actions
0:03:04for example for this query like plates from boston here pretty your premade fee the
0:03:09plate stream and the user intent might be fine plates then you're trying to identify
0:03:13attributes like departure city i don't physically the party exactly so this is a sequence
0:03:18tagging task and be treated so lake sequence labeling task based on annual meeting
0:03:26so to basically sum it up
0:03:28given a user utterance like i want to go to tuck it allows titles can
0:03:32you look up table for the model the goal of to a language understanding module
0:03:36is to identify that the domain is less front resolution
0:03:40the intent is the user this thing to do so what is trying is trying
0:03:43to inform the system about the restaurant name and the system i entity
0:03:48and then identify that is certainly and similarly for the rest of this work
0:03:55so there has been not of related work on using context for dialogue related das
0:04:00and for language understanding there was work on using memory networks for language understanding on
0:04:05a single domain don't know that has been able on using memory networks for end-to-end
0:04:09dialog systems
0:04:11and that has been work on using hierarchical the current encoder decoded models for generative
0:04:16query suggestions of which is a slightly unrelated thus but our model is an enhancement
0:04:20of the smaller so it's
0:04:25so i reviewed all over the datasets
0:04:29so be have a collection of teasing the domain dialogue data set
0:04:32the idea it is that the user has a single task that is going to
0:04:35complete and their response to a single mean so we have i don't a thousand
0:04:40not tune these datasets and they are a bit england's but i don't include influence
0:04:47then we have a small did not selected a small multi-domain dialogue data set o
0:04:51where the training set is around five hundred dialogues the dev set aside and fifty
0:04:54dialogues in the test dataset on two hundred and seventy two you know dogs these
0:04:58dialogues and longer because the user has multiple pos that he's trying to complete until
0:05:02would span across multiple domains
0:05:06the entity said that we use two
0:05:08create the training and test dialogues sets are non-overlapping still we have a lot of
0:05:13out-of-vocabulary entities in our dataset that it i don't eating button to the test user
0:05:18utterances that for the vocal
0:05:22so our data collection process
0:05:24relies on the interaction of a policy model and the user simulator
0:05:28which in tracking tones of dialogue acts and back and politics extra and then we
0:05:33can also natural language manifestations of o on based on the style of selecting
0:05:38so the process and the datasets will be covered in an upcoming publication
0:05:44okay so now i l display the warlock detected this is the conceptual like the
0:05:49idea is that
0:05:51there is a context encoder not acts on the intensity of the dialogue and the
0:05:55dining the produce a context vector and then there's attack and it will not just
0:05:59x in the dialogue context and the current user utterance
0:06:03and tries to determine the domain independence not between a single model on multi-domain and
0:06:08it does so everything is directly model
0:06:13so i just i know this paper though architecture the type a network
0:06:18we use the same architecture doctors so all the models that we compared to be
0:06:23does vary the context important thing so of this is a rnn this model that
0:06:30jointly models that we don't mean and the features
0:06:33be viewed in our embeddings corresponding to the user tokens a user utterance tokens in
0:06:38twenty it would buy detection gru that which is depicted herein laid yellow if
0:06:44visible
0:06:45the outputs of though by gru are then fit right into an lstm the which
0:06:49is depicted in like to o
0:06:52so well as the context encoded common so the output of the dialogue context a
0:06:56input that is fed into the initial state of the lstm and we tried a
0:07:01bunch of different configurations but this one seemed to what corpus so that's what we
0:07:07well so weighted use an lstm in the second lead and it's a instead of
0:07:10gru the only because it seems to work with the slot filling maybe because it
0:07:15leads to a separation between the open the internal states and outputs
0:07:22so the final states of the lstm are fed into the domain and the classification
0:07:26as
0:07:27and the final or token level outputs of the lstm a better fit into its
0:07:31not like english
0:07:33so this is that are gonna work i don't know
0:07:35that's the user's across all the models
0:07:40so this is basically just a description of what dataset
0:07:44so
0:07:46by the mean you to use context may not just used to track the network
0:07:49one if the user utterance
0:07:50so suppose the user is having a conversation with a restaurant reservation bart and the
0:07:54user says i
0:07:56so in that sense of context this is a pretty i make a statement it's
0:07:59not easy to make out what the user means it could mean five people or
0:08:04paper or maybe am order could be a restaurant name but if you know without
0:08:09the system does that what name would you prefer then it's pretty obvious that the
0:08:12user meant by as a time
0:08:15as opposed to a number of people at this time
0:08:18so this leads us to i first baseline model
0:08:21the idea to start we just input the previous system to an energy are u
0:08:25and v the final state of the gru as the dialogue context
0:08:29so be evaluated for matrix so the first one is domain upon which is the
0:08:34classification of phones good or domains
0:08:36well the second as intent upon which is the classification of funds go to what
0:08:40extent and the third one is not fun and you know this was
0:08:45same edit it is the ratio of utterances bad though
0:08:49model you get any one of the predictions wrong so be obvious you want to
0:08:53go for the lowest possible premeditated
0:08:56so these are the performances of those simple and quality for the model where the
0:08:59system tone is encoded in the gru and then fed into the target network
0:09:06so they do we need black context remote dialogue in
0:09:09one text on the data so suppose though
0:09:12so user instead of responding just but are you to a system niche initiative dialogue
0:09:17this point but if it for all i know all the user is taking initiative
0:09:21robot so this makes the problem more difficult because in a sense of context about
0:09:27the previous dialogue you can be clear what the user is referring to here
0:09:33it could be a movie name it could be a titanium it would be responding
0:09:36as many options are but
0:09:39but suppose you nude art this user has been talking about meeting attended iteration then
0:09:44it's you more likely to get the prediction right so that so we are context
0:09:50from or that of all the previous turns it i
0:09:54so this is our second baseline
0:09:56and this is based of the model proposed by chen and out those in though
0:10:01emily network for language understanding people
0:10:05the idea that is to have a gru layer that
0:10:09and so on the previous sp utterances to produce the memory of vectors so this
0:10:13memory easily representation of all the previous utterance
0:10:17we have another gru dark box and the current actions to produce the representation of
0:10:21this utterance
0:10:22based on the inner product of this memory and the system i the current utterance
0:10:27vector we get the notation distribution and a user some them met but it and
0:10:31get the context of for the data but is depicted in there so this is
0:10:35the output of this context encoded bit speech into the target network
0:10:41so
0:10:44so as you can see adding
0:10:47on the main body of the entire dialog seats to leads to an improvement over
0:10:52all the metrics so for domain we see an improvement of roundy percent absolute
0:10:56of an intent around two point three percent for slot point five percent but a
0:11:00significant reduction in female but can lead to better than this
0:11:06so if it a member of your
0:11:08or working on multi-domain dialogue so the idea is that the user might a multiple
0:11:12goals and
0:11:14just
0:11:16just do knowledge of what the user said
0:11:18in the double being able to understand the dialogue history in context of it as
0:11:23the rest of the utterances in s p o might not give the complete picture
0:11:27for example suppose the user has multiple goals the user expendable can we take us
0:11:31to use that is trained to make it in anticipation
0:11:35in the absence of so
0:11:37how these utterances relate to each other the user utterances still ambiguous but if you
0:11:43can if you have a sequential history of the dialogue act you can really where
0:11:48you can understand each utterance but and in context of the other
0:11:52you know it's more likely that you get the prediction right
0:11:55so
0:11:55this is a final models that could be so
0:11:59experiment but and this is an extension of the memory network the idea to start
0:12:03again you get the
0:12:05emily of the previous dialog sp which is depicted herein yellow well you get come
0:12:12are you get to a representation of the current utterance which is depicted in three
0:12:17but instead of getting an inner product to get some attention distribution you combine them
0:12:21together in of each what would lead to get the context of memory of data
0:12:25and this is then fed into separate is another gru bits or produces the context
0:12:33vector so
0:12:34basically what is happening is we just to do so
0:12:38representation of the entire dialog history in context with the current utterance and then we
0:12:44go then we have an it would like to the with and that dialogue in
0:12:48all tries to understand i combines these utterances together in context of each other
0:12:53and at the final state of that the idea is still aren't expected that the
0:12:57reader to the target
0:13:02so this is an enhancement of the memory network and this is also in a
0:13:05sense an announcement of the hierarchical according to encode or decode a model that has
0:13:09been used for next utterance prediction on for context to generate of release addition
0:13:18so a very unexpectedly what we observe is that this model doesn't perform as well
0:13:23as the memory network
0:13:25now be stated take into this and a hypothesis is that the
0:13:30that is a huge training this shift in our datasets so like training set is
0:13:34composed largely a single domain dialogue a single domain to compute a bit that likely
0:13:39hundred it does and single domain data sets and a single domain dialogue and i
0:13:44don't for seventy of multi domain dialogues so
0:13:47o b
0:13:49i believe the meeting that the those sequential dialog input that is unable to adapt
0:13:55from a single domain dialogue with the multi-domain not a set
0:13:58so what do we do so
0:14:01so we go with a simple data augmentation scheme
0:14:04since then is addressed between our training and test datasets but it may got training
0:14:08dataset more similar to the test data
0:14:11so we take a large single domain dialogue datasets
0:14:14b g combine single domain dialogue so far too
0:14:18syntactically r o domain switches by a
0:14:23basically combining basically graphing the single domain dialogue into another one
0:14:29so we ended around ten thousand dialogues what you geodesic that so it's i don't
0:14:32know
0:14:33t pairs so
0:14:36but you know the utterance
0:14:39so this is an example of the sample be combined dialogue acts as the dialogue
0:14:42where the user is trying to output is movie tickets
0:14:46in dialogue by the user is trying to find that a strong and then we
0:14:50randomly sampling location in dialogue acts and in fact that this is no longer be
0:14:55combined
0:14:58and use this for training
0:15:01so the locally numbers are very of syllable
0:15:05improvement in performance by just after training on that accompanying dialogue compared to all training
0:15:10without in my dialogues we combine dialogues
0:15:13and the boy number is
0:15:17i describe them and it so the boy numbers are the ones read it all
0:15:21the model built from the based on a so on certain metrics
0:15:24so
0:15:25using that the sequential dialogue and put it seems to benefit them was from dialogue
0:15:29combination at all
0:15:31then only combination leads to or performance improvements were almost all the models by the
0:15:36one that benefits them was just a sequential elegant and this is probably because the
0:15:42and you combining dialogues leads to longer dialogues it adds noise and o l which
0:15:47acts like a regularization and since the sequential dialogue and put it is the most
0:15:50complex model would expect it to benefit the most unbiased
0:15:55and this is what we observe basically a the sequential dialogue encoder does better don't
0:15:59the mean of one
0:16:00that's not define and primitive it's but it's at most and the best the model
0:16:05on intent classification
0:16:08so
0:16:11this is so and example this is a degenerate example but it
0:16:16trace to illustrate what's happening here we just look at that into distributions and try
0:16:20to figure out what the models are doing so this is there an utterance from
0:16:24the test set and the movies are in boldface
0:16:26so all these are a bit later for dataset because of identity that's a non
0:16:30overlapping
0:16:32so
0:16:34you see that the last three utterances have a lot of for b o
0:16:38so if you look at the memory network attention distribution you know what is that
0:16:42are
0:16:44focus is almost entirely on the user utterance i want to visit industry
0:16:48well as the sequential dialogue encoders out of focus is equally on the last two
0:16:52utterances
0:16:54and the i should make a scalar the dialogue at the utterance that are trained
0:16:59to understand it is the final one that's at the bottom up to be a
0:17:03by the users is but with the tool is used as
0:17:07so the good units are identified that the domain is a standard finding restaurants and
0:17:12to identify the slot for what the two presidents day goes
0:17:20so what we observe is not the encoder-decoder model fails to well identify the domain
0:17:25or the starts
0:17:26the memory network correctly identifies the cutting domain because it is focusing on the utterance
0:17:31where the user says i want some restaurant
0:17:34but it feels incorporate a context from the previous system utterance where the system is
0:17:38offering a response to the user and is any would identify the slot of it
0:17:43as a sequential dialog input data successfully to combine context one possible utterances and a
0:17:49recognizer what the domain and this larger
0:17:55okay i think that's it
0:17:56a lot for listening
0:18:04questions
0:18:13care have two questions and stuff first one
0:18:17so as a byproduct of what you're doing q you get memory representation of the
0:18:24context
0:18:25you have the whole dialogue history i'm wondering if you consider maybe training because you
0:18:32have access to the simulated user of whether you can train a policy
0:18:39using this representation because it's very similar to belief tracking
0:18:44in traditional that much of which you soac a question is more like maybe you
0:18:50can instead of for doing a modular thing most ica you can just have the
0:18:54same or do that though for end-to-end that
0:18:59so no
0:19:01that's so very indistinct addition because we have some people running experiments on this so
0:19:06this is something thing
0:19:10because i think that the problem and maybe
0:19:13the problem usually with such an interpretable representation is that you when you pick some
0:19:18actually use a confirm you don't know which slot to confirm but at the same
0:19:22time you have this semantic so you can make as usable
0:19:30i think by carefully designing though
0:19:34semantics that we using we can
0:19:37alleviate mean removed i have been made in a few instead of having a single
0:19:41you can form if you have a conform or slot and then have the model
0:19:45predict so based on the context what the user is trying to control and then
0:19:50not made it uses a problem is
0:19:55again to an certainly is still
0:19:59and even a question
0:20:15so can you go back to
0:20:17the last but slight where you had the brazilian restaurant so i wanted to us
0:20:22two questions about these example
0:20:25first i thought you said you would train on synthetic dataset where you combine is
0:20:34domains right so now used to consider the restaurant domain to be out of domain
0:20:40at this point that's sorry the first and second
0:20:45would you deal with something that is
0:20:48to me true only out-of-domain like no
0:20:51the weather is nice or two day i'm grumpy or whatever in many different way
0:20:55than the
0:20:57these
0:20:58utterance that is still task related even if not used but they got
0:21:02so for the first question do this one domain is not out of domain but
0:21:06also because a our system can handle movie tickets and restaurant
0:21:12given an utterance the system would try to keep track across different domains it will
0:21:16see that this is a different domain utterance and you speaker hundred
0:21:20so even have the dialogue is multi domain date
0:21:26no out-of-vocabulary so
0:21:28out-of-vocabulary
0:21:31the second question we have or out-of-domain utterances in this dataset to with
0:21:36to base the system is supposed to see i cannot handle that
0:21:41but so i
0:21:44i think in our dataset it's not there was enough so we definitely need model
0:21:47domain data to be able to successfully handle out-of-domain utterances
0:21:58right or questions
0:22:07my second question to you don't use delexicalization or you do
0:22:13of the input
0:22:14no we don't use any delexicalization so this is basically this is the model that
0:22:18the lexical existing work effectively
0:22:21right so it's this model will basically identify though entities that are trained to delexicalise
0:22:28so
0:22:30because if you use naked guys at a bayes approach or something to delexicalise it
0:22:33then it doesn't scale to all the response that in the by the this model
0:22:39will try to identify the based on context based on some lines the annotations from
0:22:43they cannot it's got something
0:22:49thank you very much