0:00:17so the mixed speaker
0:00:29so the next be could be sounded
0:00:31so these
0:00:36we study presentation
0:00:40okay everyone my name is sent it and i'm going to present our data dialog
0:00:44state tracking and you don't reading completion approach
0:00:46this is a dying forbidden below we shake tag and the like another set from
0:00:49the amazon alex a it means anyone california
0:00:52so i'll first briefly introduce of the problem galaxy tracking is i guess sort of
0:00:55you already know that work
0:00:57thus for completeness and then i'll
0:00:59talk about the motivation of our approach going to the tts of architecture show some
0:01:03results innovations thirties and finally conclude that some at an analysis
0:01:07so let's start so this is a the state discuss order dialog state is basically
0:01:12dialog state represents a composition of dialogue history but galaxies basically
0:01:17to represents what the user is interested in at any point in the conversation and
0:01:20typically you the presenter dialog state with
0:01:23slots and values
0:01:24so here in the in the first and the user say that it needs a
0:01:27book he needs to book a hotel in the use of that four stars and
0:01:30this corresponds to a state where you have to start stars any together the respective
0:01:35the elderly represents a domain that the user is talking about
0:01:37and it will become more evident by that's important because
0:01:41in the conversation again have multiple domains
0:01:43so in like to examine the second done the user sees that so that it
0:01:48can response asking if they surprising the user say that does not matter if it
0:01:52has three wifi in parking in so how the spigot submitted this with three new
0:01:57slot spotting and internet with the values us and the price don't get and the
0:02:01other does not starting here gets carried about
0:02:04in the next on the agent give some recommendation user say that sounds good i
0:02:08would like also like a taxi to the ordered from cambridge sonar here we see
0:02:11that these stars correspond the hotel domain gets got it over
0:02:14but they are
0:02:15slots to new starts departure and destination
0:02:17corresponding to
0:02:18a new domain taxi that also we need to
0:02:21which also gets a bit in the dialog state so
0:02:24know what is the task of dialog state tracking so you are not attacking basically
0:02:28means you want to predict
0:02:29the dialog state of the user one or more complete you are given the dialogue
0:02:33history plus the current user utterance and you want to predict a distribution over the
0:02:37our dialogue states and we saw the galaxy stability to typically to presented as slots
0:02:42and values so this means a state trackers are
0:02:44output a distribution over the slots and all the associated values
0:02:47and that i looks too quickly consists of features like past user utterances pa system
0:02:52it can have previous belief state or even any you interpretation of that is available
0:02:56so this is the task
0:02:58i don't to talk about briefly about what are the other traditional approaches to say
0:03:02so one of the common approaches is a is very you encode the dialogue history
0:03:07to some model architecture and then you have
0:03:10you have a linear plus softmax layer on top and you are put a distribution
0:03:13over the vocabulary
0:03:14all the slot type and you do this for each slot in your scheme although
0:03:17our dialog state
0:03:19for example here you see on a protocol joystick tracking where the encode the dialogue
0:03:23history using high technical lstm and then on top of that on the hidden representation
0:03:28of the context they have a few properly or one for each not type
0:03:32and then softmax layer to output the distribution of would be values that the that
0:03:36particular star can take and these are the values which you have seen on the
0:03:39training set
0:03:40this brings to like to
0:03:42main problems which such approaches
0:03:44one is that they cannot handle out-of-vocabulary slot value mentions because the only output the
0:03:49distribution over values that have been seen in the training set
0:03:52the so in such a process it is assumed that the vocabulary or the ontology
0:03:56is known in advance
0:03:57and the second thing is that they do not scale well for slots that have
0:04:00large vocabulary
0:04:00but example the slot based on in we can assume that you can imagine that
0:04:04the slot can take values from a possibly very large set so there's not enough
0:04:08data to learn a good distribution over this large vocabulary
0:04:11so on the other hand the teaching completion approaches typically do not rely on the
0:04:15fixed vocabulary
0:04:16this is because there are typically reading completion approaches are structured as
0:04:20i an extractive question answering their the goal is to find a span of tokens
0:04:23in the
0:04:24in the passage which can t is the answer so there is no fixed vocabulary
0:04:28and the second thing is
0:04:29also that they have been a lot of be set advancement in reading comprehension that
0:04:32we can leverage
0:04:33if we structure our problem of state tracking as reading comprehension this let us to
0:04:37propose this be computed for dialog state tracking and
0:04:43in the next side of
0:04:44before i go to exactly how we found in the problem i also want to
0:04:47just give a month later or would be of how
0:04:50typically machine reading compuation problems are opposed
0:04:52so the general idea in reading companies you are given a question and pass it
0:04:55and you are looking for a start of tokens in the passage that can be
0:04:58assigned to
0:04:59it's also to extract a question answering
0:05:01and how people do is you encode the past it european a representation of each
0:05:06token in the past would you encode the question you have a question representation and
0:05:09on the top you have generally have what ancient head i training from the question
0:05:13to each token in the past it one of the intention had to present the
0:05:16start probability distribution
0:05:17and the other representing and probability distribution once you have these two probability distributions you
0:05:21just output b
0:05:22at this point all the most probable span
0:05:25and that is your answer
0:05:26here it shows a popular architecture contatenate which is from microsoft the internally gets on
0:05:31one and the use bunch of self attention according to layers to encode the basses
0:05:36with the general it is assumed that you encode passage and question and then you
0:05:38have attention for representing the start and end spent
0:05:41so not less look at how we form the guitar that it's a tracking problem
0:05:45as a teaching completion so
0:05:46is the same dialogue as before
0:05:48user is looking for a hotel
0:05:49and after the second on you want to predict the values for each of these
0:05:52slots at a hotel at a reading rise
0:05:55and so on
0:05:56and this easy chart takes into something like this may
0:05:59you're dialogue context the whole dialogue context becomes a passage between alex and user times
0:06:04and then the questions or something like what is the requested hotel at all but
0:06:08to the requested value of the slot that you want to track and or is
0:06:11something like this parking required in total and so on and then what you want
0:06:14to find is the answer to these questions so
0:06:17hotel for these first question you can look for the arts and the passes and
0:06:20the models are point or something like ease and for some luckily second mission
0:06:24and you are looking for hotel creating the models the point of this setup tokens
0:06:27for starts
0:06:28so as simple as that
0:06:31no representations of how we present if a different got different components so dialogue history
0:06:36which is also like the passage an arc formulation is represented as a concatenated user
0:06:41in it and onset is to solve
0:06:42it can be either one dimensional representation order to like to have assumed matrix like
0:06:47a hierarchical representation and then you can use probably had a cloud in is to
0:06:50encode them
0:06:52and the slot which is the question in our formulation is domain class light emitting
0:06:57we want to mean as well because as we saw in the previous ones the
0:06:59example in there
0:07:01it's not get out a data taken them a span multiple domains
0:07:04and we have a fixed dimensional vector for these domains not combination which is learned
0:07:08along with the full model
0:07:10one thing to note here is that unlike what actually alike
0:07:13we don't actually convert the slot into a full natural language question we just three
0:07:17the embedding of the slot plus domain
0:07:19as the question itself
0:07:21and finally the onset is adjusted
0:07:23starting in position in the conversation
0:07:26"'kay" so this is the main model in our approach is quite the slots and
0:07:31which is just like a typical extract if you're model what it does it predicts
0:07:34the slot values this panel to consider in the dialogue the you have starting point
0:07:38does and the starting spend a lot to bilinear tension between the dialogue context and
0:07:42the slot invading
0:07:43just like reading completion models and example shown here is
0:07:46the same dialogue proposed on the uses a user wants to book a hotel in
0:07:49these four stars so after the first and if you want to track this not
0:07:53wouldn't hotel at a so in this case will assume that our model outputs a
0:07:58start and probability which is high for the eight token in the context which represents
0:08:02basically they down south east
0:08:04okay so but this model is not sufficient
0:08:08and this is true also for a question i think cases because
0:08:11in certain slots that can take values from a closer like this a parking and
0:08:14internet yes no so we need to can't for that and also the assumption slot
0:08:18that can have a value core don't care for example pricing in the previous example
0:08:23many of the slots they are never mation to the schema and so you need
0:08:26to fill them with the default none value so these are the kids at that
0:08:28cannot be guardedly handled by the span model
0:08:31so to do this we augment are q model be to other auxiliary models at
0:08:35cal you would model and the slot take model
0:08:37okay you will model predicts whether we should just
0:08:40a bit a slot value of in the current dialog the scheduler the old slightly
0:08:44from the previous done and in the beginning it's not is initialized at the t
0:08:47for none value
0:08:48and a type model is just a simple classified which makes
0:08:51decision about one of the four classes related yes no don't care order span type
0:08:57so i'm going to because of the two models okay you are modeled as i
0:09:00said it just predict so that will be the slot value for the content on
0:09:02or to tell you what and it makes the binary decision for all the stories
0:09:06jointly at each done
0:09:08an example here would be so after the first and you have
0:09:11values so what one thing i wanted like if i get the can you are
0:09:14model is a kind of confusing because
0:09:18what it exactly is it a slot a bit model what by mean that is
0:09:21the one he represents that
0:09:23you want to update the slot and zero to present that you want a caddy
0:09:25or i just give this convention because we have it can in the people
0:09:28so in here after the when you go from the forced down to the second
0:09:32done the using as mentioned three new starts by five
0:09:36like internet parking and the pricing so those slots will get a bit rates of
0:09:39the values one by the added to start at an stars
0:09:43they will be single because they want they will just get carried away from the
0:09:46previous turn
0:09:48and the type model is a simple it just predicts the start i given the
0:09:53question which is the start and the dialogue context and it makes a for a
0:09:57decision but yes no don't get a span simple example would be just a hotel
0:10:01at a full in this context would be a span type because you want to
0:10:04find the value used in the context and for the slot would barking the value
0:10:08would be just
0:10:08yes so it would be the aesthetically that the model should output
0:10:12okay the so putting all this together the combined model is also be at the
0:10:16bottom most we have about embedding it will cover the tokens in the passage
0:10:20next we have a connection limiting i coding which is basically a bidirectional lstm
0:10:25we just use only a bidirectional lstm one so this will give us the contextual
0:10:28representation for each of the tokens use the last hidden layer of the lstm which
0:10:31gives us the embedding of the dialogue
0:10:33we embed the question using just the start as domain of adding a randomly initialized
0:10:38and we just learned to the model
0:10:40then so this
0:10:42dialogue embedding back to will data t v used to predict these not get you
0:10:47were decision so we have an instance in my
0:10:49layer on top of that it just makes the binary decision for each of the
0:10:53for the slot i one of the input the dialogue embedding vector along with the
0:10:56question vector and then it makes that's a softmax the to predict
0:10:59i one of the four classes
0:11:00and the spend more and finally will take input the question vector and that can
0:11:04have attention from the question to each of the tokens in the past it's just
0:11:08like any dm model and you would have these start span prediction and the in
0:11:12so at infinite what happens is you will you will begin you with a single
0:11:17dislike at what model if the cat you were modeled sees
0:11:20a one which means to update the slot if it is a zero then we
0:11:23just carry over the slot value from the previous done if it saves one which
0:11:26means you want to be to start
0:11:27then we label that i model
0:11:30that i models easiest nor don't give it a bit the slot value for that
0:11:32if it's a span then meanwhile disband model to get at
0:11:36the start and end position of the slot value and then we just extract that
0:11:40from the conversation and update a slot value
0:11:42okay so
0:11:44everyone and the two they have been using the same data set i can do
0:11:46you know with the multi was dataset it's
0:11:48most which is a human document collection about two point five thousand single domain and
0:11:52seventy multi domain dialogues
0:11:54it has annotations for dialog state and system acts we don't user dialogue act in
0:11:59this in the small
0:12:00and some statistics on that has about it of the four dialogs about hundred fifteen
0:12:04thousand dollars
0:12:06and averaged about answer starting point five in total exhausted we're tracking here is thirty
0:12:10a cross six domains
0:12:14some results
0:12:15so this is the original so before that the metric it is joint goal accuracy
0:12:20which basically means that activity done you want to predict all the slots critically if
0:12:24any of the start is a round then the value the accuracy zero otherwise one
0:12:29so it so it is strict metric
0:12:31so the audio this other
0:12:33the first number it's from the original multi was paper the response people
0:12:36glad and dcr what about that have been there a lattice using like sender can
0:12:41do out and then split
0:12:43i mean the global tracking in a local track attendee c is just a simplified
0:12:46version of black
0:12:47so these two numbers and then
0:12:49dstc joint state tracking that i should before where the encode decode and dialogue history
0:12:53too high typical lstm
0:12:54and then have a feed-forward layer for each start i
0:12:57so that the number is about thirty eight to not approach with the single model
0:13:01is bits all these approaches
0:13:03and then we'll to be done on someone model which is basically just take a
0:13:06majority would between t different a models trained with three different seats
0:13:10and finally we also wanted to come we also wanted to check
0:13:13a however this work if you just combine our approach with this
0:13:17with a close look at videoplus like of demonstrating a joint state tracking model and
0:13:21how we combine is it is very simple we just
0:13:23choose one of the two approaches based on
0:13:26for each slot we choose one of the two approaches based on which of it
0:13:28is better
0:13:29for that particular slot on the dev set
0:13:31and this gives us a constable whose like about five percent
0:13:34and we see why this happens it
0:13:38we did some recent studies the first and most important is like
0:13:41if we feed the ground truth for all the three models that get so these
0:13:45submissions series of for a for this the single model of are plotted this is
0:13:49not for the
0:13:49we combined
0:13:51a model that the dst
0:13:52so here if we feed on the t carry over to slot-types and these not
0:13:56and model that the ground truth
0:13:57you get the accuracy joint goal accuracy on the dataset as seventy three
0:14:01which basically means that approach is upper bounded base of entity
0:14:04what that basic you need to decrease with seven percent of slot values are not
0:14:07even present in the conversation and example would be something like what kind of sports
0:14:11in the context six marginal sports attraction are is available in the centre of town
0:14:15and you want to find the slot attraction or type
0:14:18the if the answer is multiple support a model will never get it right even
0:14:22if the model and it points to support it does points to this values board
0:14:25it is not the same as the ground truth is much but also in this
0:14:28area bounded
0:14:30by seventy percent and this is also the reason and combine our approach with the
0:14:33all close look at very which is more based on the ontology we get some
0:14:37and elevation is that board so if we add about you get about two percent
0:14:41gain then we did some oracle with each of the model type so if we
0:14:45place the so the justice not like model with the ground truth so this already
0:14:49constructed model we don't get much again we get about like one percent gain or
0:14:54half of a person in
0:14:55if we replace the slots and model with the grounded we get about four percent
0:15:00but if we replace the order of the slot carryover model with the ground would
0:15:03be get about sixty we get about twenty percent
0:15:05the in so as you can see that this is the bottleneck here the caddy
0:15:08were model this is also evident from if you look at the accuracy for each
0:15:11model that i understand models have like
0:15:13ninety to ninety five percent which is pretty high
0:15:15but i and you're model only has like seventy percent of seventy six percent unable
0:15:19so this gives direction for future work may be wanting prove this
0:15:23you model
0:15:26so we also analyze how does the performance leafy as being the conversation history but
0:15:30and these are strictly decrease in performance that as a conversation is cheaper and this
0:15:35because of the other propagation from the caddy one model
0:15:40and finally we did some added analysis we basically took some two hundred data samples
0:15:45and b
0:15:47we did some two hundred and samples the and we analyze the men be bracketed
0:15:50them into for different categories
0:15:52the first in the biggest categories call unanswerable slot data
0:15:55so these are the others which are made by our cat your start getting word
0:15:59so there to get a case in this the first one is but the difference
0:16:03is non and hypothesis is not and it basically needs
0:16:05the references that we should can't you does not value from the previous done by
0:16:09the i model the same to updated
0:16:12so in this case and in this is the second one is the opposite of
0:16:14this so in the first case
0:16:16even though this is the bulk of i don't like forty two person i mean
0:16:19we look at the actually the others these are not real as the model is
0:16:22making the prediction which is actually correct
0:16:25but there is a lot of annotation noise in the dataset because the state some
0:16:28on either the states is are they are modeled they have adhered model like they
0:16:33are updated after one after one done so because of its all these that you
0:16:37get added as that i was but a bunch of them are about "'em" a
0:16:40lot of them are not really errors
0:16:42in the second case of it is
0:16:44maybe ground for this predicting that we should
0:16:47but with the ground it is to update the start value while our model predicts
0:16:49to just carry over from the previous done in this case the there are some
0:16:54for example here you can see the user is trying to book
0:16:58trying to destroy in the centre part of the down and finally the they didn't
0:17:01is able to make the reservation and the new users to next say that you
0:17:05also needs an attraction type near the nystrom so here many via when you want
0:17:09to fill a slot say attraction dark at a so the model is model c
0:17:14is that this would be non which basically means it is not been mentioned
0:17:17no but as you can see that the user says it should be near the
0:17:20neck structure it should be carried away from the previous domain so our model is
0:17:23unable to unable to do that
0:17:26so the next i will denote is what we call in can extract reference which
0:17:29basically means there are multiple possible candidates in the context but our model predicts that
0:17:33on candidate so in this case you see the user is trying to book a
0:17:37hotel with of with all four people made in response to the booking was unsuccessful
0:17:41and the user
0:17:42a basic question at eight people
0:17:45the ground truth is eight of course but our model predicts for
0:17:47so be seen as a lot of this happens and there is at i think
0:17:50as in this case or in the user change its mind so our model is
0:17:54a robust to these kinds of things and the possible reason would be that models
0:17:57were fitted to a particular entity like for which is the testing more data
0:18:00training set
0:18:02this accounts for about twenty percent of it is
0:18:05the next categories the what we call slot resolution that are here you see the
0:18:10context or something like i want to leave the hotel by two thirty
0:18:13the model with a model pointed to thirty but the ground truth is actually fifteen
0:18:17thirty so these this is kind of like an intended output because we only do
0:18:20pointing the context
0:18:21so these are more like and unlike playstation it is it's about thirty percent
0:18:25the final thing is the slot boundary errors there that's and model makes a mistake
0:18:28it's either exploit it i to get the span which is are supposed to be
0:18:34a different sort it is a subset of the difference in this case the difference
0:18:37is just the nine does as the to start by a model guest not all
0:18:40city center but this is only a small was it is like to point represents
0:18:43the other
0:18:45finally i also want to just
0:18:47but one slide on that the number that i should is about state-of-the-art can be
0:18:50some but it but since then there's a paper it is here this is the
0:18:53or transportable multi-domain generated out for task oriented dialogue systems are
0:18:57here what did we deduce pointed entered a network to combine the fixable cavity along
0:19:01with their distribution over the dialogues
0:19:03dialogue history and they get a slightly better accuracy than the
0:19:05the then a model combined with the dst but the a key difference between data
0:19:10points in a see that
0:19:11the user decoded degenerate barstow can try to convey the we just use r two
0:19:15pointers to point to the start and end up the span
0:19:20that's probably already wanted to thank you and
0:19:22i could questions
0:19:29okay so we have time questions
0:19:32i said to thank you thank you for the talk my question is when you're
0:19:36considering the different types like yes no don't care ends and the span and span
0:19:42this potential eels of another case right with that is when the user doesn't really
0:19:50the value of the slot is but can infer that like twins fsa and what
0:19:54cuisine type do you want an essay i want some pizza tonight the classifier could
0:19:59be inferred that is the value for the cuisine type will be italian but the
0:20:04user never said italian so the span would not cover that case right
0:20:08so you what the user say so you're geniuses user says i want some pizza
0:20:12night or something that are not okay that's true so those are not covered here
0:20:16because we are just doing more like pointing i and model probably would put expand
0:20:20because it's not one of the two types but we will and will point probably
0:20:24point two is the category but we fail just like in other cases
0:20:28so we have you have a future direction where we can sort inspired from being
0:20:31completion where you can do more like abstract of question answering you can use these
0:20:35as a rational and then try to have a generative model it generates the value
0:20:39which it is most like italian grounded selection that we can we can do that
0:20:43in future
0:20:51thanks for the great talk a just one simple question present so if i give
0:20:56you a sentence like i want to go from cameras to and then you know
0:21:00the destination efficient and approaches camera dissing your model can do like in this case
0:21:06you can do better because they are all they are both value for the place
0:21:10using a model can do better than baseline system within these kind of
0:21:15because you are still like slot by slot by then how this the model no
0:21:19destination is
0:21:20it's london is not comments
0:21:24i see
0:21:26so it would because we three the context right so it can learn like from
0:21:29and to from the context that
0:21:32what about you know what about because you check that span try and is possible
0:21:37that both slot
0:21:39both on both a prediction that's n
0:21:42like they all mark no so but we also proceed in the slot type right
0:21:47so destination and so the c
0:21:56no i don't maybe it's in the final present so he have in the predicting
0:22:02we also
0:22:03have a question vector right so it would be either destination on the source right
0:22:07so it based on that the span model can infer whether it would be
0:22:11the question is user query embedding so follows two slots is the same user query
0:22:17no should be different right so it would be a destination are the source
0:22:22so the other considered slide information yes okay so this is the question recognition it
0:22:26is a slot
0:22:28okay they can you might tell different it meant that if okay cool thanks
0:22:35the questions
0:22:50maybe a provocative question but we have heard many papers about you know
0:22:56dialog state tracking and in particular at this particular corpus and so my question is
0:23:02what do you think we need to take it to then next level
0:23:05when you know we don't talk about going from cambridge to land on or looking
0:23:11for a chinese restaurant on
0:23:15so if you don't or a particularly improving on this dataset i think nine jen
0:23:20be honest
0:23:24i think
0:23:26i mean v but it is necessary i would say like a particularly looking experimented
0:23:30with it is this data set i found that the a lot of errors in
0:23:33this especially with respect to dialog state annotations so if you're just trying to improve
0:23:38upon this it's not a good idea because we won't even over that we are
0:23:41doing better not a so they are these a new dataset dstc a that we
0:23:45can look into and c
0:23:47for approaches are do better but otherwise i mean i feel like now people have
0:23:52begin to do more into n approaches where you don't even need the state it's
0:23:55more implicit but then that's eigenvoice under the same problem to pipeline or not to
0:24:00pipeline so
0:24:01i don't good answers
0:24:05and user questions
0:24:08i have one question so have you can see that the wasteful evaluation i i'm
0:24:14not sure if the carryover ease the you know closing some problem in the evaluation
0:24:18if we can be so previous slot values a circle back propagating areas to the
0:24:25next ones but if you if you sort of the
0:24:28have another metric that like a soft update rate or something like that is the
0:24:33be possible for you to evaluate you missus more accurately
0:24:37a slot will be treated like
0:24:42i see a point
0:24:44so the numbers i think get for the
0:24:48so this some of the seventy six percent is more like
0:24:51each i don't level accuracy for a particular done if the carrier model predict everything
0:24:55gradient using more like
0:24:57better not be updated
0:24:59like more like precision and recall for either that be better exactly the eigen put
0:25:03it here but also here like this these twitter data rate
0:25:07you can think about it is the first one is more like a precision it
0:25:10will for the slot a big model for the carrier would like this thing about
0:25:13that big model so in this case the model predicts that we should update
0:25:17but the grounded is not a base so this is like a precision and the
0:25:20second is more can you what it
0:25:22statistic is more likely correlated
0:25:23so i don't know the numbers this morning at t and eighty four percent number
0:25:27to this more destructive actually somewhat inflated i is more meaningful looking down level because
0:25:32it won't all the
0:25:34starts to be getting because eventual goal is to do joint goal accuracy when you
0:25:37want all the slots to be correctly predicted
0:25:39okay and we did we didn't train our models so an important
0:25:43thing is also that train these the caddy or model jointly and not
0:25:48well as log and this is important because if you do per slot the we
0:25:51don't we try to the meeting good performance because of a one like
0:25:55the dataset that lead up examples particularly for the cable model is highly biased you
0:26:00can imagine like the number of bits are very few most of the time distorted
0:26:03just getting at either one so if you just trained directly you would won't have
0:26:06anything but signals are two for the updates and you will get just biased
0:26:11the training
0:26:14so it it's about time so it's not to speak again