Speech Transcript - Key-Value Retrieval Networks for Task-Oriented Dialogue

how everyone a man in style i'm a student is the university and i'll be

discussing some joint work between my collaborators that you know p group and also the

for research and relations that are

and i guess before i actually get started i think this target can be pretty

deep learning happy so

before you kind of star trek may resume you know as like the harbinger of

everything is bad and bad in dialogue today i'm and you to learn as well

i wanna talk you know please lets visible about this so

just come isn't that is

before i get started i like to kind of take a step back

and discuss some of what i think like the larger motivations of dialogue research arm

and to do that i'd like to talk about a film her which somebody may

be seen

in it the protagonist played by walking unix essentially develops an infant relationship in the

sparse feature world with his super intelligent assistance amanda

and how do we estimate the so feeling is her charisma her really to

conduct very intelligible conversations

and why would you necessarily swell the details of the movie i would like to

say that i think it does a fantastic job of illustrating really what is at

the core of a lot of dialogue research i think on the one and we

do we are trying to build very practically useful agents try we're trying to build

things that people can use on a daily basis

but i think more broadly

i think we also should be trying to build a just edible compassionate and pathetic

relatable collaborative i think in doing so will learn a lot of ourselves what we

as humans are what makes as human what's the core of our of our humanity

and so i think this is that this is due motive is something that i

think she got a lot of dialogue research and certainly guys a lot of us

of the that i

well like to do

moving now into the actual talk itself

a quick roadmap i'm gonna be discussing some background to this work a i'll be

discussing the model that we developed a dataset we also developed

also the experiments that validated sort of the approach and some concluding remarks

so background

if we take this snippet of dialogue between a human asking a sort of fairly

simple query you know what time is my doctor's appointment

we would like an agent to be able to just to answer the query with

reasonable effectiveness and say something to be effective your appointment is a three point one

thursday

the traditional dialogue systems tend to have a lot going on in the back end

we have a number of modules that in various things including actual and understanding interfacing

with some sort of a knowledge base and then obviously a natural and generation

in tradition we have a separate modules that are doing all these things together and

and often times can be very difficult to make a smooth interaction between all these

different modules

and so i think the problem is of a lot of present the enrolled dollar

researchers will is that will be able to kind of an automated

some

really all of all these separate modules and with is affected and doesn't really

limit performance

more specifically i think that one of the big challenge is that a lot of

present in all dialogue systems suffer from is interfacing with the knowledge base itself

and so

really the kinds of things we would like to see is sort of a smooth

interaction involves heterogeneous components and we could replace these all these separate you know hardworking

the robot's with one make a robot i the end-to-end dialog system then

maybe we're getting some sort of progress

this is of course

i have a suitable it may be would like to work towards

so the purposes of this work i guess so first discuss

some previous work has been done in this in general in this general line of

enquiry

so some work from when it all has sought to essentially take the traditional modular

connected paradigm and replace some or all the components with the neural acquittal and

other work has tried to

kind of enhanced these soft the kb lookups and interaction of the kb through some

sort of soft operation that still maintained some sort of belief state tracking

there's another line of work the kind of tries to find a middle ground

that try seek the best of sort of the rule based heuristic systems

and the more neural that is still not able to neural training

and then there's some work that we kind of been pursuing in the past

that seeks to

bill some sort and then system that's builds of the traditional c to seek paradigm

and is able to enhance the paradigm with some mechanisms that actually one more effective

dialogue exchanges

the motivation then of our work is twofold

one we would like to develop some sort of a system that can interface with

the knowledge base in a more or less intense fashion without the need for explicit

training of believers like trackers

and i think a sample of that is then how we get a sequence the

sequence architectures this purported architecture to interact nicely with some intrinsically structure information you know

we're talking about it

sequential model

combining with this more like structured representation

and

getting there's to work together is something that i think is gonna be a challenge

going forward

some details on the model

so first steps i don't know what people's general material acoustic models but

the encoder decoder with attention framework is one is investigated a number of different works

and for the purposes of dialog evolves more less the exact same starting paradigm the

same general back on the encoder side we're basically heating in a single token of

of dialogue context one of the time through a recurrent unit highlighted in blue

and one travelling the recurrent for some number of times that's

and after some number of computations we get the hidden state that is initial that

is used to initialize the decoder which also the recurrent unit and is also relevant

for some number of time steps

at each step of the decoding we're gonna be referring back to the encoder and

essentially computing some sort of a distribution

over the various tokens of the encoder

and this will be used to generate a context vector that then is combined with

the decoder hidden state to form a distribution over possible capture tokens that we can

arg max over and essentially new our system response for

sewing with this general background i liked hypothesize that in principle we should be able

to just like take this decoder hidden state that we already computing at a given

timestep just move that one step further and say hey uses exact same decoder hidden

state to compute some sort of an attention over the rows of a knowledge base

so that the question is how do we actually represent the knowledge base in such

a way that this is actually feasible i mean we're eigen can talking about structure

information and we're trying to deal with in some more of a sequential fashion to

we are interested sequence

so again this is the question is really guarding the

the work is how can we were represent a cave effectively

to do so we draw information are inspiration from

the key value memory networks of millard all which essentially showed that he value representation

which

not only is kind of a nice

elegant design paradigm but also

can we can directly be shown to be quite effective a number different tasks

maybe something helpful for us

so the show how this actually would it play out for our purposes i mean

taking one row of a kb and show how were trying to transform into something

that is amenable to keep value representation

so consider this a single row of a look at here we're talking about a

calendar scheduling task

and we have some

these the structure information

and we want to convert that into essentially what is the subject relation object a

triple format

and so here what we're doing is we have some event the dinner

which is connected to a number of different items in a backs about the dinner

through some relation so you have some time which can be relations and data which

simulation et cetera et cetera

and everything is information that is originally represented in the in the role of the

cup knowledge base is now collapsed into triple format

and so this is the first sort of a operation that we're gonna work with

going from the subject relation object triple format

we then

make just one small change which converts into a key values store

taking the subject a relation and essentially concatenating it to form a sort of canonical

as representation that is our key

that is sort of exactly what we're trying to do

if you look the first row we had the simulation object with for the dinnertime

an eight p m

and

this subject relation essentially become this new not realised make a that's a make a

key called dinnertime for lack of about word and the object is just mapped one

to one to the value

and we do the same for every single other row in the original

is a row a triple format

and so because we're dealing with embeddings

the keys in this case and that being just the sum of these subjects relation

embeddings

so dinnertime this case is just litter the sum of the gender bending and the

time adding

and

an important detail is now one word doing some sort of decoding

we're all argmax sing over an augmented vocabulary

which includes not only the original vocabulary that we started off with but now also

these

these additional canonical as a key representations

when we put it all together we

have essentially again well we start out with which was the sink or decode with

the tension framework

but now we filled in this attention over the over the knowledge base

we compute some weight over every single role of the knowledge base

and so for example in the case of something like you know the football time

at two p m

that's visible

there's no weights that is there is used to await the appropriate entry in this

case the football time cannot representation in the distribution of you are mixing or

we do this essentially for every single row of the

of the of the new canonical eyes kb

for that

and this essentially is adjusted model

moving on

the dataset that we used because

i mean first off i guess a quick no data scarcity the obvious ignition a

lot about research especially when we're talking about the neural dialogue models that are that

a lot of people are dealing with you know it seems that more data often

helps but

given that are collaborations one with for which obvious is a for company is a

car company and hence

the really only interested in things really it requires

we had to go about building since the new data set

that would be

and then able to still being able as the same question that we want to

ask about knowledge bases but is kind of more relevant to their use case

so that in the being the in car virtual assistant domain

so here i three sub domains there were interested in our scheduling calendar scheduling whether

and then point of interest navigation

the way we wanna by collecting data set

essentially you see whether masking which is adapted from the work of one at all

and it essentially what we're doing is we have

crowdsource workers

that are playing one of human essentially they can either be the driver or the

car systems

and we progress dialogue collection one exchange of the time

so the driver basing interface looks like this

you have essentially a task that's generated automatically for the worker

and usually provided with the with the dialogue history but because this is the first

exchange of the dialogue there's no history again with

and then you have the to the worker is passed with essentially progressing the dialogue

a single turn

on the cars just inside

we also provide the history of the dialogue history so far

but

the car system is actually being asked to use some private collection information that they

had access to the user does not have access to and they are then supposed

to use then information to also progress

the dialog iteratively port exactly what the user ones

the dataset ontology

has a number of different

and three types and associated values across the different domains

and i guess that sort of lends itself to a fairly a large amount of

devastation types of things that people can talk about

what data collection was done we had a little over three thousand dialogues and it

was more or less split evenly across the three different domains

with an average number of like five or utterances per dialogue as well as

nine research tokens per utterance

now for some experiments

using this data set and the model we propose

the baselines that we used for benchmarking our model we're two

first we build a sort of traditional rule based system that uses

manual rules go to do not going to understanding as well as the naturalness generation

and to do all the interfacing with the k p

and then on the kind of the neural competitor that we put up against are

new model was the copy augment the c to stick model that we could build

previously in prior work which at its core is essentially also and encoderdecoder framework

with attention

kind of background but also daugman's that

with an additional copy mechanism over the entities that are the dimension of the dialogue

context

we chose this

because one it is an exact same classifier of models as the new one to

we're proposing iac to stick with attention

and i guess previous work also shown that this is actually pretty competitive with other

model classes including like the intent every network facebook

and also because the code was already there so one

so i guess for automatic evaluation we had a number different metrics and i'm gonna

say this and i'm the bite the bullet that we did provide some sort of

automatic evaluation

but i guess i know that indicates a dollar especially automatic evaluation is something that

is a little tricky to do and in that it really is a little dab

it is divisive of a of a topic

but there were some object but i guess some people have reported previously so we

kinda just follow the line of previous work

we use bleu which is of course of data from machine translation and there's some

work that says it's actually awful metric

no correlation human judgement and then there's some more recent work that says you know

it's like it's pretty decent the n-gram basis extraction not really all that had

and then we provided in into the f one which basically is a matter of

micro-averaged f one over the set of entities that are mentions the response as compared

to that in the target response that we're going for

so when we hit it all the models against each other we see that

first off be the rule based model is doesn't have a particular high blue which

again a binary too much but

that can simply be explained by the fact that maybe we don't write as many

diverse templates for natural image generation

but the idea of one is decent in the sense that

you know we kind of did target the models which way that was can be

pretty accurate of picking out

and accommodating search queries

the copy network is what had a pretty decently score which

can of course be true to the fact that i mean this acoustic models are

known to be good language modeling but the mtf one is pretty bad comparatively and

this is i guess a function of the fact that

essentially the copy number doesn't really making use of the kb directly instead relying totally

on dialogue context to generate entities

and then the cater to a network outperforms these on the various metrics

performed pretty well and lewis wasn't you have one but we also show human performance

on this and show that

they're still naturally japanese to be filled so well this is encouraging it's not i'm

not receive tentative and it's by no means suggest of the fact that quilt models

of the one model secure the other

but it is kind of it there for coarse grained evaluation

we also provide an human evaluation

where we essentially generate about a hundred twenty distinct scenarios across the three different domains

that we had

once the to never before been seen in training or test and then we hear

that the different more classes with amt workers in real time and had then conduct

the dialogue and then assess the quality of the dialogue based on fluency what we're

denis and human likeness on a one to five scale

here i mean this kind of this scheme evaluation tends to be a little more

but more consistent a little a little more seriously regarded and again it at which

you network actually outperforms

various and fetters

especially getting good gains over the copy network which is which is of course encouraging

here again we also have human performance which

i mean a sort of sanity check does perform does provide an upper bounds that

there still of really large margin between even our best performing system in human performance

the still gap to the to be filled there

i just as an example of a dialog one of these scenarios

we have here a sort of truncated knowledge base

and

in each data point of interest navigation

setting and

we have the driver asking for

a gas station with the shortest route from where you are

the car answers appropriately

you know the driver kind of all samples the next year's gas station the cars

is again

and string approach really with respect to the knowledge of its given so it's nice

to see that there is a reference to the knowledge base and it's handling stuff

appropriately

some conclusion and kind of final thoughts

so the main contributions of the work we're namely that we had this new class

of seek to seek style models that is able to perform a look up over

the knowledge base in a way that is that is fairly effective

and it does this without any slot or belief state tracking which is kind of

a nice and nice benefit

and it doesn't outperform several the baselines

on a number of different metrics

and the process we also created new data set of roughly three thousand dialogues in

a in a radically domain

and i heard a new domain

future directions i think one of the main ones a scaling up the knowledge bases

right now word not exactly only the scale of knowledge base of the people would

seeing how to relax applications and think that somebody's

typical google calendar or

any anything of that nature there is always the a disparity in the size of

these knowledge bases

and so we like to move in the actual feasible realm of types things that

people talk about and is the magnitude of types of things people talk about

we also like to can move away from

operating in the most at each range for gene and it's that kind of do

more rl by base things which we accommodate any deviations from typical dialogue tempos that

we may see

and i guess even further down the line it would be nice to see models

that

they can actually incorporate more kind of pragmatic reasoning

into the kinds of inference is that they're able to make so that simple query

like well i need to wear jacketed a the pragmatic reasoning the lousy to say

that hey wearing a jacket is indicative of some sort of temperature kind of reason

gonna have to do is a bic that also in the model

so that that's my presentation thank you be happy taking questions

question use

i think that that's a great question and i think right now for the predicate

this particular iteration of the of the model

i think it is

relatively dependent on the types of things there talked about because again with the kinds

the entire look up operation is depending on like embeddings and is embedding have been

trained right on the appropriate types of a database c and so naturally you're talking

about calendar scheduling for you know five hundred dialogues a listen you're talking about you

know ponies or something is gonna be hard to have well trained embeddings that are

gonna allow you to do that and so i think that certainly

this is something that is a subject of future work and i can think of

likes some ways you know using pre-trained embeddings mail it kind of circumvent the need

to literally train scratch again and kind of bootstrap a little bit more other kinds

of things you expect to see i think it's a regression deaf and something to

spoken further

and thank you for your presentation i just want to our during the experiment and

the training process

and the testing as well so i'll do you deal with unseen situations you know

if you show you can see our knowledge like used to meet

used in the nation's you in that and are talking to deal with that sorry

how can i do anything to deal with the situation you the task

also

in what particular sensor like you're talking about

what exactly so it's like if something that is entirely different what you've seen before

all maybe like just

some kind of like you just

like new p-values show you the task force not change

i mean i think in this case

it would have to be augmented a little bit more with some sort of a

copy mechanism by you over the

i mean i guess in this case

it is a little bit dependent on the kinds of things that it's seen

and i think that

i think that in general there have to be done

through

i mean right now is able to pattern only having seen these entities in training

as well

i think in general it's something that which kind of look at how they can

be done in a way that is

less dependent on the keys the keys as they and demonstrate and i think right

now it would probably come out as in which people a little difficult to handle

but

but solution to

last point that you had your site some future direction it just structure knowledge addition

right information system is that you can perform reason you can you can recently you

can probably be you have any and you like to incorporate that with the reading

and twenty five

right you mean allowing for these kinds of more complex styles of reasoning without

i mean that so that's a really good points and

i think the last one especially is right now a little bit of a long

time i mean in the sense that

even though it's and the kinds of things that are common it still is something

like that more less falls into the

one particular type of pattern with the slot you can well as the land and

act on that

i think that

right now the model would what's troubles would be famous probably with this kind of

things that obviously deal with like you know synonyms kinds of various like the use

of speech et cetera

and i don't have like a super gonna answer for with that would look like

because the model is very much of this like

slot filling but i think that

i think the interplay of chitchat systems and the kind of the more structure information

is one that should definitely be explored more we can i think that you know

really touched on that a lot as well

and you speaker

Key-Value Retrieval Networks for Task-Oriented Dialogue

Oral Session 1: Task-Oriented Dialogue Systems

Mihail Eric and Christopher D. Manning