Speech Transcript - DeepCopy: Grounded Response Generation with Hierarchical Pointer Networks

but now we would listen to three papers that as i said underwent

regular review process

and the first one out is deep copy grounded response generation where the hierarchical pointer

networks

percent the by semi doubles

hello everyone

i'm say we use a phd student from university of california santa barbara

today i'm going to talk about artwork one grounded response generation with hierarchical pointer networks

this is a joint work with i've been have gone in and the lexical

stay at different places now but so this is actually of work direct google they

i while i was an intern

last year

okay without further ado let's start

this paper is about

building dialogue models for

you know knowledge

ground the response generation

and the problem that you want to tackle here is to

a basically possible state models to

kind of be able to do

you know more natural are engaging

you know compositions

so basically like the previous

papers in this domain has pointed

several problems

that actually sort of

all the down to

you know generic response generation of the models

this is like sort of like the basic problem

of you know that this paper is trying to tackle

i just to start with an example so say we have a user out looking

for a italian

you know fourteen los altos

and

a response

coming from a system like

poppy's a nice restaurant in los altos serving italian food

would be a good response but at the same time

you know how engaging

you know does this response sound

i don't know i would probably prefer

something that would it contain more information

but in general basically this is sort of like the

the scenario that we try to

so if and reached responses with more information

so and the question that the ask is that what happens if we were able

to use

an external knowledge to

to make the content of these responses like more informative

or more engaging if you wanna say

so basically

let's say we have a model

that can actually go look at the commands of this restaurant

that you actually want to sort of recommended user

and then

you know take a you know pieces so piece of information

from these other reviews

and then generate basically maybe like

response that is looking like

you know the first sentence this aim

but it also says there are more chance but get the

a carbon there are quite popular

excuse me

so this like would be sort of

more engaging response to me

so basically the general problem that we are going to be trying to so

well in mall

so proposing models

to incorporate external knowledge in response to the next

you know previous no previous work in this domain actually most of the early work

try to not do this with you know sequence a sequence models

it not exactly the same problem but are trying to kind of model that local

we do not be using decks external knowledge

so this sort of like requires

you know a lot of data

to be able to so encode a world's knowledge into you know them into actually

the model's parameters

and you know it some additional excerpt drawbacks also include like you know view you

might actually depending movies the on the model you might need to retrain the model

as a new knowledge becomes available and it's also

instead of that like can be sort of think of this problem as

incorporating like basic that adding the knowledge

as an input to the model

so there is like

basically like there is an early work

that

tries to achieve this what they do is that they basically how decomposition is g

and then they try to use you know

additionally fax

let's say like no external knowledge

and then

sort of pick some of the knowledge from

this is this resource

and so it incorporate that into response english

so in this work we try to sort of go over the existing models that

tries to achieve this

exact scenario

and then

proposed for their models

that weeding might be useful

so basically like the d contributions that we talk about the going to be like

sort of

you know models that tries to incorporate external knowledge as an additional input

and then so

like more in more detail it will contain

you know going or some baselines

and actually proposing for the baseline that are not

like covered in the literature which actually are sort of like

useful it may be used models

and then it at the end of you will actually talk about the model that

propose and heading that mind that might be helpful

okay so

there's like a bunch of

even you knew what datasets in this domain

where actually like you have the

you know you have like conversations

and back to the data that actually accompanied

with external knowledge

like one of them exactly like dcf dstc so one challenge problem

no last year and it's like a sentence generation track

basically the there are like i rated conversations

and then you want to use the really reached btr goes

to be able to generate better responses

and there's a wizard of each p d a

where like

there's nature conversations between you know learner and the expert

or because few d a

it's also like a dataset

the two d recent

in this work we will

actually talk about commit to dataset

the one of the reasons y

so of worked on this was

i'd this there is it doesn't sort of like need any double step

so basically like you can just a drill the relevant facts to the dialogue already

given

and the dataset will talk about in more detail

so basically in this dataset there is like a two persons which are basically

you when a person must

and data us to talk about you know sort of

basically a whole like a conversation based on their per sentence

and

some of the properties of this dataset is

you know like basically

some challenges are

you know you have some packets that you want to be able to incorporate in

your a response generation which is actually sort of one of the motivations of why

you have like the personal

but it's also like hard for the models to be able to do that

and you have some like sort of had is needed facts

where you don't sort of had to leave when you're persona

but you have to be able to

produce that which is also like a

another main challenge of this dataset

and there's all kinds of difference if you motors

which are sort of i would say

a close to the statistics of the data is that

and hard to model

okay so basically like this is the

so this is the dataset that you're going to work on

and some evaluation metrics before we dove into like

devon two models

there will be like automated metrics which sort of our common

for the sentence generation task

which will be d so main task of this channel this challenge

and will also have like a human evaluation where we ask

if you must the rate the responses

generated by the models from you know want to five

will also like at the end present like a little bit further analysis on

you know the ability of the models to

incorporate the fax the kind of present the

and finally will also have like this is also sort of like an automated metric

divorce the analysis

to see like if the most can do it is a divorce responses

okay so basically like the models are going to be you know two parts one

is the baseline models

which will cover pretty fast and then we'll have the

you know models that we sort of

t there are helpful for this task

let's that with this because the sequence more we'd attention which is like basically

you have the dialogue history

which is sort of concatenated you know into a single sequence

and then you have like the sequence encoder which we use like lstm

and the we have like the decoder that actually sort of generates the

response based on this

and then

we have like a sequence a sequence again with a single fact where we actually

take you

also they want us fact from the you know personal

and then appended to the

appended to the

basically context now you have like a longer sequence

have each also have sort of like of factual information

and then you want to generate a response from this

and then the most relevant artifacts a is updated in two ways

the first one is like bass fact context

which is basically you take the dialogue context and then

a find a you know most element factor this

based on tf-idf similarity

and then we have basic response now the similarity

is you know measured between

the between the facts and the grounded response

so this is like a

you know

g d model just to be able to see if you

very able to provide like the

you know right

fact

with the model be able to generate the

you are generated better response basically

so basically

some results here i'm gonna first

present the

the results

like be kind of the main results which are going to be like automated metrics

like perplexity belligerence either

i also like to human evaluation

which is appropriate this

so here like i and the not fact was the no

the first model

and as a basically what we see is that like you incorporating seen single fact

improve the perplexity

i is you is you see here

and also like if you if you if you incorporate like sort of the cheating

fact that it you like even further improves it

but you sort of like moves from the

and naturalness

and like one of the sort of

reason is

i mean this is so like ipod this is the observed looking at the results

is that

so the not affect one kind of generates like a very sort of generating responses

which are you like sometimes i see like a very frequently rated higher than the

ones that are trying to incorporate the response

but so the field of it

so that sort of like the main reason for why don't happens

and also another thing that is interesting here is that if you look at the

appropriate the score of the ground truth response

i mean this is out of five so it's four point four it's like a

i'm not perfect

that sort of another challenge here

so i another and another line of sort of baselines that no like memory networks

where basically like we and quote the context again we the no sequence model

and i we take its representation ten on the facts

each fact actually have like he representation which are in green

which are basically like a vector

and then they also have like a value representations there which are in blue

so we a turn on the key representations and then

i have a probability distribution over the facts

and then compute a summary vector

out of them and then be added to the context vector and then feed it

to decoder and then because it generates the response

so we will call this like a memo network like so for this task

and then we hope will be also have like the

you know version

another version of this which is that is similar to

you know a model that is covered in the previous works at this is again

another baseline model

where basically

in the decoder you also have like an attention on the context

itself

so in the previous one there were like nor decoder per step decoder attention but

here there is

we also have like the fact that action version basically at every decoder step

you have like an you know additional attention on the facts

i mean basically when you generating used to go back and look at

the fact

and then we have like a name a network where like for both fact and

context of action or enabled

okay so if we

look at the results of this compared to the like the and also baselines

we see that like basically attentional only facts

is you can see here results in the bad so of fact incorporation

and

and additionally like six acoustic models that actually v source try to

alex year are compared to memory network models that actually sort of like proposed by

a we hope you previous serves

basically on top of that like the next thing is that we realise that like

the

the sequence models that we sort of analyzed

so failed to are reproduce the

factual information like such as the ones that the that i showed at the beginning

with us like idea what's ones one

so for that we want to try you know we tried incorporating compute again

you went on the baselines here

for that we basically just to the point the generator you know network that is

proposed two years back

and what it does is basically at every sort of decoder step you basically have

like you know

soft combination of what generation

and copying of the tokens from the input

so that like if there is something in the input that is not in your

vocabulary you can generated

okay so basically as i said that see important for

you know but using the art of

likely but using deep actual information that may not end up in the vocabulary

so basically what we do is like we had to use its you can small

is that we like kind of sequences sequence models

that be exploited to beginning we had to copy mechanism one double for each i

don't look at what happens

and we like sort of immediately sees

that d copy mechanism improves all of them

actually

you know

a pretty good and

we also sort of see that like if you look at sort of like the

one the model that you have c

feed the base

that's fact

in a cheating where e

basically that sort of like it says that like

if you had a way to

find the best there's fact that is not response then you'd be able to like

to pretty good so it sort of like an upper bound again

okay so now

we just one a

for their c

how

how we can actually make use of like

every token in every fact that is available to us because previous models

sort of either did it use

all the facts like the sequence models we just pick one fact and then use

that

what the memo network models

basically use the entire sort of like summary of the

fact as a as a whole and then just use that

now we wanna see

what happens if basically we were able to condition the response and a few fact

talk

so i mean which this might be important in sort of like

you know of copy d

sort of copying the relevant

pieces of information from the facts

even though you're not actually given the

this fact

so basically like the base for this is that

we call it more thai stick to stick hierarchical attention

where the context in court is the same

but for the fact the encoding we also use like an lstm so basically we

have the context of presentation for every fact

sorry every fact token

and what we do is that at every decoder step we take the you know

the core state and at hand on the

have sort of you know of

tokens of the fact so which basically gives us like a distribution over the fact

tokens

and then

sort of basically computing a context at the over these

i'd users basically fact summaries

and then we do for another attention on the fact summaries which gives us like

a distribution over the faxing which fact might be more important

and then

we also have like you know context summary

coming from the attention in the context

and then we have one more attention

which basically a times on the fact semantic context summary and then combines them

based on like which one is more important this is all like sort of salt

attention so you just like don't need any

you know so this is basically a differentiable that's what i'm saying three

and then now you sort of you to generate your response and

the and the and the loss is basically the you know local it blows

so the negative log-likelihood

and i in the deep copy which is like this sort of main model that

we propose in small in this paper what we try to exploit here is that

basically everything remains of the same

with the previous one that i showed

but what we basically do here is that

we use the probabilities

like attention probabilities

over the context

tokens and the fact tokens

as the corresponding you know copying probabilities

so basically here is you can see you have like a distribution over the facts

and distribution over the tokens of every fact so you can basically use a single

distribution

or whatever unique token in your facts here

and then you also like have another question on

you know context and the facts and using those

you can combine these two into

again single

distribution

and you can use that

as the copy probabilities of tokens

and then combine it with the generation

so here

basically we all already have like a generation probabilities over the vocabulary and then we

also have like to copy world this from the context tokens and all these tokens

and we combine all of them into single distribution

and then if you look at the results

basically on all of the you know evaluation metrics like the main evaluation metrics

the copy sort of you know

outperforms all the other models that we may see it here

and then it's also important to note that this like a best for context plus

copy that we sort of like try to analyze was also are computed model

okay a to b can probably

it's good is

so we also that like a divorce the analysis

well you know this is a metric that is actually proposed in one of the

previous works

so look at looking at a do are still three did just wants to that

generated so deep copy also like sort of is shown

performing good here is all compared to the other models

this is an example

where we can see that like

the deep copy can achieve that is that we wanted to do

basically it can

i depend on the right person of fact we just highlighted here before knowing which

one is related it can copied exactly kind of relevant pieces

from the fact and the current context of the dialogue

and also like you can also see that it can copy and generate

at the same time so basically it can switch between the modes

so basically we propose a general model that actually can take a query which is

the context in this case

and then external knowledge which is basically set of facts in unstructured text

and then you can generate a response out of them

we propose like strong baselines

on top of this

and then show that the proposed model actually performs

hospital you two d existing ones in the integer

right

that's it thank you for the scene

i can take questions

okay so we have any questions

in the audience

there

i this is someone to form arkansas

a quick costs when you say the for the coffee instead of focusing on only

one side focus and a few five but so in fact is that like compute

the ways of all the facts and then do a

which sound

instead of just p top three top

i mean like in are you asking what we do in the proposed model or

in the proposed model in it was normally basically you feed all the facts

and it can use which are which ones

so you so but it doesn't pick exactly one so it actually they compute a

soft representation out of all

okay and then use that as a weighted sum of the vocabulary just

well i that is actually copy part in the copied part you have like a

vocabulary from which you can look at it of size it's a five k

and then

this is like so frequent words

right

and then you also have a way

do you combine like you have a distribution with this right

and then you have a distribution over the tokens unique tokens that appear either in

fact

whatever dialogue context

so now you can induce the signal pro to distribution out of all of this

and that we have like a single post a distribution which is computed in the

software which means it's a differentiable so you can just shaded with the negative light

okay

i have a question also the of a human evaluation you have its appropriateness

one was well

where they actually

because the motivation for this was to create more engaging

responses and appropriate estimate doesn't sound like that so they have to be engaging so

what is the actual instruction the

that that's a good causes so

actually that was

something i had to

as a bit so basically we have two

to human of evaluations

one is the appropriateness

a one also is about like five inclusion analysis

so this is i mean this is more relevant to

measuring whether it is more engaging or not

but it is

it is not in time because of the following

so if you look at the grassroots so here a so this is this a

matrix that we have humans the rate the these are binary matrix

so every inclusion mean is the response include the fact from the i mean doesn't

have to be from a person

it could be you have exceeded five or it could be a factor of course

and then you have like a to follow this how much of it is coming

from percent how much

of it is coming from the station

so that sort of like what we also the humans

and

basically here you tell the like this metric as a bit about

of the engaging this but not exactly because of the following

if you look at the ground truths course like this is the main metric here

a you know

so if i like factual information included from the persona

if we look at the ground truth even that does

fifty percent

so it means that like the grounded responses

even detente

have

sort of coverage of the

for some affect all the time

because you can think of this as

in an actual conversation between the two person

you like basically five fact cannot cover the complexity of

you know such conversation right that's why

this is also not a perfect metric

so what i'm trying to say is

measuring engaging this is a little bit

more difficult

we try to engage in this way so it measured this way

just by looking at whether jen included below and have fact

but we don't have like a perfect sort of evaluation for that

DeepCopy: Grounded Response Generation with Hierarchical Pointer Networks

Oral Session 2: Implications of Deep Learning for Dialogue Modeling

Semih Yavuz, Abhinav Rastogi, Guan-Lin Chao and Dilek Hakkani-Tur