0:00:17but now we would listen to three papers that as i said underwent
0:00:22regular review process
0:00:25and the first one out is deep copy grounded response generation where the hierarchical pointer
0:00:33percent the by semi doubles
0:01:24hello everyone
0:01:26i'm say we use a phd student from university of california santa barbara
0:01:32today i'm going to talk about artwork one grounded response generation with hierarchical pointer networks
0:01:39this is a joint work with i've been have gone in and the lexical
0:01:46stay at different places now but so this is actually of work direct google they
0:01:51i while i was an intern
0:01:53last year
0:01:56okay without further ado let's start
0:02:03this paper is about
0:02:05building dialogue models for
0:02:09you know knowledge
0:02:10ground the response generation
0:02:12and the problem that you want to tackle here is to
0:02:16a basically possible state models to
0:02:20kind of be able to do
0:02:22you know more natural are engaging
0:02:27you know compositions
0:02:28so basically like the previous
0:02:32papers in this domain has pointed
0:02:35several problems
0:02:37that actually sort of
0:02:39all the down to
0:02:42you know generic response generation of the models
0:02:45this is like sort of like the basic problem
0:02:48of you know that this paper is trying to tackle
0:02:51i just to start with an example so say we have a user out looking
0:02:56for a italian
0:02:58you know fourteen los altos
0:03:01a response
0:03:02coming from a system like
0:03:04poppy's a nice restaurant in los altos serving italian food
0:03:08would be a good response but at the same time
0:03:11you know how engaging
0:03:14you know does this response sound
0:03:17i don't know i would probably prefer
0:03:19something that would it contain more information
0:03:23but in general basically this is sort of like the
0:03:26the scenario that we try to
0:03:29so if and reached responses with more information
0:03:34so and the question that the ask is that what happens if we were able
0:03:39to use
0:03:40an external knowledge to
0:03:43to make the content of these responses like more informative
0:03:47or more engaging if you wanna say
0:03:49so basically
0:03:51let's say we have a model
0:03:53that can actually go look at the commands of this restaurant
0:03:57that you actually want to sort of recommended user
0:04:01and then
0:04:01you know take a you know pieces so piece of information
0:04:05from these other reviews
0:04:07and then generate basically maybe like
0:04:09response that is looking like
0:04:11you know the first sentence this aim
0:04:13but it also says there are more chance but get the
0:04:16a carbon there are quite popular
0:04:20excuse me
0:04:22so this like would be sort of
0:04:24more engaging response to me
0:04:26so basically the general problem that we are going to be trying to so
0:04:32well in mall
0:04:34so proposing models
0:04:37to incorporate external knowledge in response to the next
0:04:43you know previous no previous work in this domain actually most of the early work
0:04:48try to not do this with you know sequence a sequence models
0:04:54it not exactly the same problem but are trying to kind of model that local
0:04:59we do not be using decks external knowledge
0:05:02so this sort of like requires
0:05:05you know a lot of data
0:05:07to be able to so encode a world's knowledge into you know them into actually
0:05:14the model's parameters
0:05:16and you know it some additional excerpt drawbacks also include like you know view you
0:05:22might actually depending movies the on the model you might need to retrain the model
0:05:26as a new knowledge becomes available and it's also
0:05:30instead of that like can be sort of think of this problem as
0:05:36incorporating like basic that adding the knowledge
0:05:39as an input to the model
0:05:41so there is like
0:05:45basically like there is an early work
0:05:48tries to achieve this what they do is that they basically how decomposition is g
0:05:53and then they try to use you know
0:05:56additionally fax
0:05:58let's say like no external knowledge
0:06:00and then
0:06:01sort of pick some of the knowledge from
0:06:04this is this resource
0:06:05and so it incorporate that into response english
0:06:08so in this work we try to sort of go over the existing models that
0:06:13tries to achieve this
0:06:16exact scenario
0:06:17and then
0:06:19proposed for their models
0:06:21that weeding might be useful
0:06:26so basically like the d contributions that we talk about the going to be like
0:06:30sort of
0:06:31you know models that tries to incorporate external knowledge as an additional input
0:06:35and then so
0:06:37like more in more detail it will contain
0:06:40you know going or some baselines
0:06:42and actually proposing for the baseline that are not
0:06:45like covered in the literature which actually are sort of like
0:06:50useful it may be used models
0:06:53and then it at the end of you will actually talk about the model that
0:06:56propose and heading that mind that might be helpful
0:07:02okay so
0:07:04there's like a bunch of
0:07:07even you knew what datasets in this domain
0:07:10where actually like you have the
0:07:11you know you have like conversations
0:07:14and back to the data that actually accompanied
0:07:18with external knowledge
0:07:19like one of them exactly like dcf dstc so one challenge problem
0:07:24no last year and it's like a sentence generation track
0:07:28basically the there are like i rated conversations
0:07:31and then you want to use the really reached btr goes
0:07:35to be able to generate better responses
0:07:40and there's a wizard of each p d a
0:07:43where like
0:07:44there's nature conversations between you know learner and the expert
0:07:48or because few d a
0:07:50it's also like a dataset
0:07:53the two d recent
0:07:55in this work we will
0:07:57actually talk about commit to dataset
0:08:01the one of the reasons y
0:08:04so of worked on this was
0:08:06i'd this there is it doesn't sort of like need any double step
0:08:11so basically like you can just a drill the relevant facts to the dialogue already
0:08:16and the dataset will talk about in more detail
0:08:20so basically in this dataset there is like a two persons which are basically
0:08:26you when a person must
0:08:28and data us to talk about you know sort of
0:08:32basically a whole like a conversation based on their per sentence
0:08:40some of the properties of this dataset is
0:08:43you know like basically
0:08:45some challenges are
0:08:47you know you have some packets that you want to be able to incorporate in
0:08:51your a response generation which is actually sort of one of the motivations of why
0:08:56you have like the personal
0:08:59but it's also like hard for the models to be able to do that
0:09:04and you have some like sort of had is needed facts
0:09:07where you don't sort of had to leave when you're persona
0:09:11but you have to be able to
0:09:14produce that which is also like a
0:09:16another main challenge of this dataset
0:09:20and there's all kinds of difference if you motors
0:09:23which are sort of i would say
0:09:26a close to the statistics of the data is that
0:09:29and hard to model
0:09:31okay so basically like this is the
0:09:33so this is the dataset that you're going to work on
0:09:37and some evaluation metrics before we dove into like
0:09:41devon two models
0:09:44there will be like automated metrics which sort of our common
0:09:50for the sentence generation task
0:09:52which will be d so main task of this channel this challenge
0:09:57and will also have like a human evaluation where we ask
0:10:01if you must the rate the responses
0:10:03generated by the models from you know want to five
0:10:07will also like at the end present like a little bit further analysis on
0:10:12you know the ability of the models to
0:10:15incorporate the fax the kind of present the
0:10:21and finally will also have like this is also sort of like an automated metric
0:10:26divorce the analysis
0:10:29to see like if the most can do it is a divorce responses
0:10:34okay so basically like the models are going to be you know two parts one
0:10:39is the baseline models
0:10:40which will cover pretty fast and then we'll have the
0:10:44you know models that we sort of
0:10:47t there are helpful for this task
0:10:52let's that with this because the sequence more we'd attention which is like basically
0:10:56you have the dialogue history
0:10:59which is sort of concatenated you know into a single sequence
0:11:03and then you have like the sequence encoder which we use like lstm
0:11:08and the we have like the decoder that actually sort of generates the
0:11:13response based on this
0:11:15and then
0:11:16we have like a sequence a sequence again with a single fact where we actually
0:11:20take you
0:11:21also they want us fact from the you know personal
0:11:25and then appended to the
0:11:29appended to the
0:11:29basically context now you have like a longer sequence
0:11:32have each also have sort of like of factual information
0:11:36and then you want to generate a response from this
0:11:38and then the most relevant artifacts a is updated in two ways
0:11:42the first one is like bass fact context
0:11:46which is basically you take the dialogue context and then
0:11:49a find a you know most element factor this
0:11:52based on tf-idf similarity
0:11:54and then we have basic response now the similarity
0:11:58is you know measured between
0:12:00the between the facts and the grounded response
0:12:04so this is like a
0:12:05you know
0:12:06g d model just to be able to see if you
0:12:09very able to provide like the
0:12:11you know right
0:12:13with the model be able to generate the
0:12:15you are generated better response basically
0:12:22so basically
0:12:24some results here i'm gonna first
0:12:29present the
0:12:30the results
0:12:31like be kind of the main results which are going to be like automated metrics
0:12:35like perplexity belligerence either
0:12:37i also like to human evaluation
0:12:39which is appropriate this
0:12:41so here like i and the not fact was the no
0:12:46the first model
0:12:47and as a basically what we see is that like you incorporating seen single fact
0:12:51improve the perplexity
0:12:52i is you is you see here
0:12:55and also like if you if you if you incorporate like sort of the cheating
0:12:58fact that it you like even further improves it
0:13:01but you sort of like moves from the
0:13:03and naturalness
0:13:05and like one of the sort of
0:13:08reason is
0:13:09i mean this is so like ipod this is the observed looking at the results
0:13:14is that
0:13:15so the not affect one kind of generates like a very sort of generating responses
0:13:20which are you like sometimes i see like a very frequently rated higher than the
0:13:26ones that are trying to incorporate the response
0:13:28but so the field of it
0:13:31so that sort of like the main reason for why don't happens
0:13:36and also another thing that is interesting here is that if you look at the
0:13:39appropriate the score of the ground truth response
0:13:42i mean this is out of five so it's four point four it's like a
0:13:46i'm not perfect
0:13:48that sort of another challenge here
0:13:51so i another and another line of sort of baselines that no like memory networks
0:13:57where basically like we and quote the context again we the no sequence model
0:14:03and i we take its representation ten on the facts
0:14:07each fact actually have like he representation which are in green
0:14:11which are basically like a vector
0:14:13and then they also have like a value representations there which are in blue
0:14:17so we a turn on the key representations and then
0:14:19i have a probability distribution over the facts
0:14:22and then compute a summary vector
0:14:26out of them and then be added to the context vector and then feed it
0:14:29to decoder and then because it generates the response
0:14:37so we will call this like a memo network like so for this task
0:14:42and then we hope will be also have like the
0:14:44you know version
0:14:46another version of this which is that is similar to
0:14:51you know a model that is covered in the previous works at this is again
0:14:54another baseline model
0:14:56where basically
0:14:58in the decoder you also have like an attention on the context
0:15:03so in the previous one there were like nor decoder per step decoder attention but
0:15:07here there is
0:15:10we also have like the fact that action version basically at every decoder step
0:15:14you have like an you know additional attention on the facts
0:15:18i mean basically when you generating used to go back and look at
0:15:21the fact
0:15:24and then we have like a name a network where like for both fact and
0:15:29context of action or enabled
0:15:33okay so if we
0:15:36look at the results of this compared to the like the and also baselines
0:15:41we see that like basically attentional only facts
0:15:44is you can see here results in the bad so of fact incorporation
0:15:50and additionally like six acoustic models that actually v source try to
0:15:55alex year are compared to memory network models that actually sort of like proposed by
0:16:00a we hope you previous serves
0:16:08basically on top of that like the next thing is that we realise that like
0:16:12the sequence models that we sort of analyzed
0:16:15so failed to are reproduce the
0:16:18factual information like such as the ones that the that i showed at the beginning
0:16:22with us like idea what's ones one
0:16:25so for that we want to try you know we tried incorporating compute again
0:16:31you went on the baselines here
0:16:34for that we basically just to the point the generator you know network that is
0:16:38proposed two years back
0:16:42and what it does is basically at every sort of decoder step you basically have
0:16:47like you know
0:16:50soft combination of what generation
0:16:52and copying of the tokens from the input
0:16:55so that like if there is something in the input that is not in your
0:16:59vocabulary you can generated
0:17:03okay so basically as i said that see important for
0:17:08you know but using the art of
0:17:11likely but using deep actual information that may not end up in the vocabulary
0:17:15so basically what we do is like we had to use its you can small
0:17:20is that we like kind of sequences sequence models
0:17:23that be exploited to beginning we had to copy mechanism one double for each i
0:17:27don't look at what happens
0:17:28and we like sort of immediately sees
0:17:31that d copy mechanism improves all of them
0:17:36you know
0:17:36a pretty good and
0:17:38we also sort of see that like if you look at sort of like the
0:17:42one the model that you have c
0:17:43feed the base
0:17:44that's fact
0:17:45in a cheating where e
0:17:47basically that sort of like it says that like
0:17:50if you had a way to
0:17:53find the best there's fact that is not response then you'd be able to like
0:17:57to pretty good so it sort of like an upper bound again
0:18:00okay so now
0:18:02we just one a
0:18:04for their c
0:18:07how we can actually make use of like
0:18:12every token in every fact that is available to us because previous models
0:18:18sort of either did it use
0:18:20all the facts like the sequence models we just pick one fact and then use
0:18:23what the memo network models
0:18:25basically use the entire sort of like summary of the
0:18:29fact as a as a whole and then just use that
0:18:31now we wanna see
0:18:32what happens if basically we were able to condition the response and a few fact
0:18:39so i mean which this might be important in sort of like
0:18:43you know of copy d
0:18:46sort of copying the relevant
0:18:48pieces of information from the facts
0:18:50even though you're not actually given the
0:18:52this fact
0:18:57so basically like the base for this is that
0:18:59we call it more thai stick to stick hierarchical attention
0:19:02where the context in court is the same
0:19:05but for the fact the encoding we also use like an lstm so basically we
0:19:09have the context of presentation for every fact
0:19:13sorry every fact token
0:19:14and what we do is that at every decoder step we take the you know
0:19:19the core state and at hand on the
0:19:21have sort of you know of
0:19:24tokens of the fact so which basically gives us like a distribution over the fact
0:19:29and then
0:19:31sort of basically computing a context at the over these
0:19:35i'd users basically fact summaries
0:19:37and then we do for another attention on the fact summaries which gives us like
0:19:42a distribution over the faxing which fact might be more important
0:19:46and then
0:19:48we also have like you know context summary
0:19:51coming from the attention in the context
0:19:54and then we have one more attention
0:19:56which basically a times on the fact semantic context summary and then combines them
0:20:02based on like which one is more important this is all like sort of salt
0:20:05attention so you just like don't need any
0:20:08you know so this is basically a differentiable that's what i'm saying three
0:20:13and then now you sort of you to generate your response and
0:20:18the and the and the loss is basically the you know local it blows
0:20:23so the negative log-likelihood
0:20:27and i in the deep copy which is like this sort of main model that
0:20:30we propose in small in this paper what we try to exploit here is that
0:20:35basically everything remains of the same
0:20:37with the previous one that i showed
0:20:40but what we basically do here is that
0:20:43we use the probabilities
0:20:45like attention probabilities
0:20:46over the context
0:20:48tokens and the fact tokens
0:20:50as the corresponding you know copying probabilities
0:20:54so basically here is you can see you have like a distribution over the facts
0:20:58and distribution over the tokens of every fact so you can basically use a single
0:21:03or whatever unique token in your facts here
0:21:07and then you also like have another question on
0:21:11you know context and the facts and using those
0:21:14you can combine these two into
0:21:16again single
0:21:19and you can use that
0:21:21as the copy probabilities of tokens
0:21:25and then combine it with the generation
0:21:30so here
0:21:31basically we all already have like a generation probabilities over the vocabulary and then we
0:21:36also have like to copy world this from the context tokens and all these tokens
0:21:40and we combine all of them into single distribution
0:21:44and then if you look at the results
0:21:48basically on all of the you know evaluation metrics like the main evaluation metrics
0:21:53the copy sort of you know
0:21:56outperforms all the other models that we may see it here
0:22:01and then it's also important to note that this like a best for context plus
0:22:04copy that we sort of like try to analyze was also are computed model
0:22:12okay a to b can probably
0:22:15it's good is
0:22:17so we also that like a divorce the analysis
0:22:19well you know this is a metric that is actually proposed in one of the
0:22:23previous works
0:22:25so look at looking at a do are still three did just wants to that
0:22:28generated so deep copy also like sort of is shown
0:22:33performing good here is all compared to the other models
0:22:38this is an example
0:22:40where we can see that like
0:22:43the deep copy can achieve that is that we wanted to do
0:22:47basically it can
0:22:49i depend on the right person of fact we just highlighted here before knowing which
0:22:52one is related it can copied exactly kind of relevant pieces
0:22:57from the fact and the current context of the dialogue
0:23:01and also like you can also see that it can copy and generate
0:23:04at the same time so basically it can switch between the modes
0:23:09so basically we propose a general model that actually can take a query which is
0:23:14the context in this case
0:23:15and then external knowledge which is basically set of facts in unstructured text
0:23:19and then you can generate a response out of them
0:23:23we propose like strong baselines
0:23:26on top of this
0:23:27and then show that the proposed model actually performs
0:23:31hospital you two d existing ones in the integer
0:23:38that's it thank you for the scene
0:23:40i can take questions
0:23:44okay so we have any questions
0:23:46in the audience
0:23:54i this is someone to form arkansas
0:23:57a quick costs when you say the for the coffee instead of focusing on only
0:24:03one side focus and a few five but so in fact is that like compute
0:24:08the ways of all the facts and then do a
0:24:12which sound
0:24:14instead of just p top three top
0:24:20i mean like in are you asking what we do in the proposed model or
0:24:25in the proposed model in it was normally basically you feed all the facts
0:24:29and it can use which are which ones
0:24:31so you so but it doesn't pick exactly one so it actually they compute a
0:24:35soft representation out of all
0:24:38okay and then use that as a weighted sum of the vocabulary just
0:24:44well i that is actually copy part in the copied part you have like a
0:24:49vocabulary from which you can look at it of size it's a five k
0:24:54and then
0:24:55this is like so frequent words
0:24:57and then you also have a way
0:24:59do you combine like you have a distribution with this right
0:25:02and then you have a distribution over the tokens unique tokens that appear either in
0:25:08whatever dialogue context
0:25:10so now you can induce the signal pro to distribution out of all of this
0:25:15and that we have like a single post a distribution which is computed in the
0:25:19software which means it's a differentiable so you can just shaded with the negative light
0:25:30i have a question also the of a human evaluation you have its appropriateness
0:25:36one was well
0:25:37where they actually
0:25:38because the motivation for this was to create more engaging
0:25:42responses and appropriate estimate doesn't sound like that so they have to be engaging so
0:25:48what is the actual instruction the
0:25:49that that's a good causes so
0:25:52actually that was
0:25:54something i had to
0:25:57as a bit so basically we have two
0:26:00to human of evaluations
0:26:02one is the appropriateness
0:26:04a one also is about like five inclusion analysis
0:26:08so this is i mean this is more relevant to
0:26:12measuring whether it is more engaging or not
0:26:15but it is
0:26:16it is not in time because of the following
0:26:19so if you look at the grassroots so here a so this is this a
0:26:22matrix that we have humans the rate the these are binary matrix
0:26:26so every inclusion mean is the response include the fact from the i mean doesn't
0:26:31have to be from a person
0:26:32it could be you have exceeded five or it could be a factor of course
0:26:37and then you have like a to follow this how much of it is coming
0:26:41from percent how much
0:26:42of it is coming from the station
0:26:44so that sort of like what we also the humans
0:26:48basically here you tell the like this metric as a bit about
0:26:52of the engaging this but not exactly because of the following
0:26:56if you look at the ground truths course like this is the main metric here
0:27:00a you know
0:27:02so if i like factual information included from the persona
0:27:06if we look at the ground truth even that does
0:27:08fifty percent
0:27:09so it means that like the grounded responses
0:27:12even detente
0:27:15sort of coverage of the
0:27:16for some affect all the time
0:27:18because you can think of this as
0:27:21in an actual conversation between the two person
0:27:24you like basically five fact cannot cover the complexity of
0:27:28you know such conversation right that's why
0:27:31this is also not a perfect metric
0:27:33so what i'm trying to say is
0:27:36measuring engaging this is a little bit
0:27:39more difficult
0:27:41we try to engage in this way so it measured this way
0:27:44just by looking at whether jen included below and have fact
0:27:50but we don't have like a perfect sort of evaluation for that