i a good afternoon everybody and can send out from carnegie mellon university so they

we're going to present our work including and you and a dialogue system using reinforcement


to get started out of my talk is going to focus on task driven a

dialog agents

so those are agents that can have exposed to go to achieve such as providing

information from a database even user's preference

and traditionally people using for this kind of pipeline to building such a system so

first well we have some user inputs and the those inputs are annotated by hum

level annotation format you know you and that of impose a pipe into a state

tracker that can what is information or work on a same kind of the target

also able to

i providing a sufficient information to vertical rate okay interface with some structured database of


and condition on this information would have a dialogue policy that besides the next action

to do and of thing speaking back to the user

so all a project going to focus on how to replace the three highlighted module

here by a single end-to-end model

of people that getting to how we do such a model to talk about

why we want do this

so there are some limitations about traditional pipeline so the first one is the core

test on a problem so the system is they brought into the real word and

we get feedback from and user

those users that texas at the good dialogue about dialogue

but is not clear which module is responsible for the six that all the failure

of this of those feedback signal and you mixing you know what the arrow can

propagate between modules so you bargain can be even more challenging

and the second problem is that scalability or you got of the representation


you won't before we have the ldc challenge

but you knows case we use the command to the chart to estimate the value

of life that of dialog state variables

are those of arable the handcrafted

and how to design discuss that up like of our both a require a lot

of a extra knowledge and by design usually handicap the performance of the quality because

it simply not providing a sufficient information to make a good deletions

and i think time it's also challenging to building an end-to-end posture agents so we're

going to have some challenges

so the first challenge is it's and

a customer agents and used to know some sort of a strategic plan or policy

you achieve the and go on uncertainty from asr nlu and a user which is

beyond you know i think we'll supervised learning

also i think hind

a task it isn't interface with some external structure knowledge outside that only except

i think baltic already like as q or language or i call

that someone in during chapel and dimension of words but those you're not model the

only have continues intermediate representation and it's not a easy to get e symbolic already

from the

so all the challenge here is our contribution of this project so we introduce a

the rainbows learning based end-to-end dialogue system that should result in

do you perspective

the first one is we show that we can jointly optimize the state tracking

a database interaction and apology together

and the second one is

we can we shall we provide some prove that

and the deep learning a recurrent nets can learn some sort of that of the

representation automatically

so that speaking

so first of all we follow this intuition so we want to have a minimum

symbolic about state so that all state is defined at all the you know sufficient

information that in a dialogue

and we thought we can note that the or symbolic representation is only needed for

those paul that's

related to the database

so on any two main things that express the values for those thing of all

those variables and the for the rest of information such as the discourse structure the

intent of the user the goal the user what has just means that

we can more than other continuous vector

so that the by a high level integration

and then we propose this architecture

so we still follows the a typical home bp approach they have agent and they

have environment

and agent can apply some action to the environment and environment will respond with observations

and the reward

and you know case environment the comprised of two elements the first element is the


a second i'm in the database and also for asian that had it can apply

two types of action the first type of action is the verbal action it's similar

to the previous one d e a dialogue manager so

the agent can choose an action and say something back to the user

and the second type of action league or and our hypothesis action so we maintain

a external piece of

memory to hold the value

of a bounded according to the database

and agent can apply some sort of a high parts of action to modify this

is a memory and this bit on that we can be parsed into the database

which were built on

some matching entities and a given instant feedback


with this architecture the entire process of that all state tracking and a dialogue policy

is formulated as a single a sequential decision making process so we can using reinforcement

you know this policy in end-to-end fashion

and so basically we want you

approximate some argue about a function which is the expected future at and we want

to be able to not this about a function

okay so here is the implement a neural network structure

so it has several layers i will explain them from the bought into account

so the bottom layer is the observation we got from every time so they have

three elements though

we have the action that isn't to in the last that

and observation from the user and observation from the database and those two information all

mapped into a low-dimensional embedding us willing a transformation and the embedded in the past

into the recurrent nets which is we hope we can maintain a of the temporal

information over time

so we call a dialog state they

and it is the outputs from i don't is

leading you to a decision networks

with the actors fully connected feed-forward neural networks and one of them is

used to model the iq value function for verbal action and the other one is

used to model the q about a function for the

the hypothesis actions and you can see and you can do not focus our own

time so every time you make addition the action overview piping to the next action

and we can use a new observation from environment and it just keep going

ideally the proposed architecture can be trained and un only using the dialog success paper

which is come only come from the end of session is that success cooperative dialogue

but if it gets kind of us past reward calico cat results into value of

very slow learning

so we all of the four we i think once the fact that would be

okay so sometimes we have the hypothesis oracle label like we got from the dstc

so how can we including those label to speed up the learning

so we're describe a tech a single technique and by results into a significant speed


in terms of the convergence of algorithm

so that the that the trick is you know we modified up the remote model

in this upon db so we assume the correct hypothesis action follows them multinomial distributions

so there's only one single correct arms the at the time so we can add

okay is to the reward

in the reward which is just simply the probability that this hypothesis action

is correct at the time

and the second a trick it

because of the environment a comprise up to par the user and of the in

a database

the user is difficult to model but the database is just the program is a

dynamic is known

so we can usually get a general indifference tempos of this kind of a state

action the next they

by applying all possible hypotheses action at a time

so we can add of those generic a sample orange you the

experience table

and the this experience table will be used to update the parameters at

the next a large and this is a kind of similar to the final iq


that introduced by certain but

the difference is that they use that have separate model to

estimate the transition probability but here we have many more kind of this that we

just you know trying to you to generate sample from the database which isn't has



okay so the training were used in the state of a value based people are

important learning which is the traumatised w q and so the quantized experience reply allows

the that i wasn't to focus on sample that has

a more important information which is beloved on

and the second one is we use of w q and which can so the

bias that in the in the q-value a function of the time but the estimation

and the loss function is simply the that particular motion of the temporal difference a


and the square loss and we minimize the loss function by a stochastic gradient descent

okay so to pass all well

our proposed model we choose a conversational game is game is called twenty question game

so this game is simply that the agent aghast important was and that this is

a user you think you know

so it had you know two player the user and agent

so the agent can have access to a famous people database

it also can ask select from a list of yes or no question to ask

the user

and then use then used on the on that the to this question with yes

no i don't know intend in any possible a natural way

and the agent can also making guesses noted lu i think about bill gates or

and then to ring

and if it happens to against the correct person the k eight it will consider

waiting for agent otherwise eighteen were lost

so to using the to make you know to have to be able to do

experiment we view construct a uses the narrator so the senator a first one construct

a very famous people database and database we select a hundred people from the freebase

and each person is associated with about six attributes

and we manually designed a few just on a question to ask about

every attribute

so you can see some

examples you like what he in

trying to or is he won't be phone i six the influence of words

and the second of all

i we also on the use and to reply that intense in

a memory in a different possible ways so we collecting a different natural ways of

saying that three intense from the

switchboard dialog act corpus

and the we

and eventually we care about a few hundreds of different

unique way all the thing that intends from the as the ready a corpus and

also we maintain the frequency count of each expression so we can sample from the

so you can you know some expression can more often be

reply from a more real expression will be only

a location or example

and also here is the just found final piece of configuration for the pomdp


the game is terminated only if one of the four

condition l two

and we basically either the game like the agent guess the correct person already took

too long for the stochastic of then

or to make too many wrong guesses

or it you know you all but at

i a core hypothesis that is not consistent with any people database

and only if the put the agent against the correct person

we consider that as a texas otherwise it of area

and the if the agent when you look at thirty points other not otherwise inactivity


and you can buy at the most making five also examine most making five ten

wrong guesses and everyone guesses were induced inactive i don't see

so he tries to

and make it and making a more correct guess it more careful guesses

okay so we have described our model that and the we

trying to model and let's and how to use them analysis

so we analyze the model from three different perspective

so from dialogue policy high quality tracking and dialog state representation on

the first analysis is done on the

a dialogue up a policy analysis

so we compare three models so the first not in the baseline so

and we train basically we are the only difference is that we train the state

tracking and dialog policy

separately without a joint

right a joint training so they don't know the error come from each other

and the second one is the proposed model that only use the end of session

reward been used excess of area

and the last one is we use the ipod the label that we talk about

only a

with a different reward function

and the table shows the results

so we can see that of the proposed model the bowl was proposed model outperforms

by large margin compared to the baseline

and of the hybrid approach

even performs better than rl

so do you have a no more deep understanding about what's happening we also plot

the learning of pollution during training

so the

horizontal axis is the numbers that a number of parameter update

and the vertical access if the success rate and the green line use rl

the right the light is a hybrid approach and the purple light years baseline

so you can see they have a quite distinct behaviour so for the baseline model

because the state tracker is simply it ring either supervised learning so accomplish much faster

so it but its performance actually it's quite pretty quickly

because then they note kind of the data points in a state tracking of finding

with each other they don't know the information from each other


and of what i'll post it takes a very long time to have good performance

but it and it gets a good performance and the hyperbolic kind of kind of

benefits from both sides and a loss

relatively faster

in the beginning and then converges they probably to the path components weaker

so if eigen analysis it were trying to that you somehow cross-track analysis so to

do this we do you in the past remodel for

all the baseline i don't hybrid i'll

and we got a

a ten thousand k samples from each

and we report the precision recall for each one and he represents the pricing is

that we can see the base actually from those score is much greater than the

proposed approach

so what happened

so we look at some example dialogues that we can see

thoughtfully found for the eight that the laughter why the agent that the baseline and

the right one of the proposed model so the agent ask or is that wasn't

from america and the user utterance i would i don't think about issue is kind

of difficult for the model that you can survive

and the for the baseline because it's just you know it's a

classification doesn't take into a kind of the future it has to make a hard

addition right now so that each

it shows yes which is a route on so

well in a second case

was model

it will find okay that's kind of ambiguous i will even though no for now

and ask

another question

and use time use a set was no which is much simpler to classify so

now it to that the relative you know

so the main difference is that baseline is

it doesn't take into the future so it hasn't decomposition every time

for the proposed model

it because it has kind of you are kind of training was reinforcement so

if the models the future so you can have a long time planning

well at the last we doing some dialog state representation analysis

so we wanna see how well the lstm hidden layer is known about of the


so we get to task the first task is okay c we want to see

if we can reconstruct some important variable from the dialogues the embedding

so we do is we took the model that showing a twenty k fifty k

and hundred k steps and we're trying to train nothing coding of a question

to predict

and the to predict the number i guess it has band

and we took eighty percent for training twenty percent forecasting and the table shows the

ask well on the testing set

so cloudy week as the former one of the better training set was more

beta it's easier to reconstruct

this us they variable from this allows the embedding so we can kind of come

from a hypothesis that the model is employed in trying to encode of this information

in is hidden layer although in our politics person to do so

and the second pass with the l is we're trying to do a retrieval task

so because we brought us to not it also actually know the true state of

this one so we have many palace like we have a dialogue is the embedding

with the corresponding similar the internal state

so have as it is it actually that atoms the embedding is a learning the

choose the internal status the narrator those two pass they know that must be very


so we can do average of also do a simple nearest neighbor

based on cosine this than in the embedding space and then we compare the similarity

in the retrieved to state

so you've their recall rate the rich you know retrieve choose they should also be

very similar to each other because the echoes that very close in embedding space

so we do this then experiment and that the horizontal axis here in the basically

the this the

the wearable index in a scenario

and the vertical axis is the probability are there

the retrieved five nearest neighbor that different from each other


and also we can we compare the performance of twenty fifty eight hundred k models


again we come from that for model that's better training

the probably the different is much i keep in decreasing

so that also means kind of means that

the that was the embedding it right become one or more correlated

with the internal state has the narrator

so it's actually learning

the internal dynamics of this particular work environment

in conclusion

we first show that it is possible that we can jointly long

the have all the tracking i don't dialogue policy together and it also to outperform

the baseline a modular approach

and the second level we show that

recurrent neural networks

the here that it is able to long continuous a factor a vectorial representation of

what dialog state

that's also cost driven so it only lost in important information

that is useful for it make the decision to achieve the goal and the other

all relevant information alone

and it's a standard we can also we show that we still we show that

purely reinforcement approach with a very sparse reward still suffer from a slow convergence

in the deep reinforcement is starting

and how to improve this kind of learning speed

is left to our future work

and not so thank you

okay something to so was time for questions

i so the user observation that is input to the marketing is a i mean

how to encode input utterance to you just send the tokens are you'd expect any


so the question is how do we encode the user observations so we do actually

simple so we just take up to about a bigram back to the use of

the observations

so it just the vector and you and that's factors and with the will be

in unit transformed into a us imply a user utterance embedding and then concatenate with

other vectors from the database and of action

you go


the first question is do we try some a simple decision tree so i tried

with the max temporal setting that user also with yes or no on don't know

who is no ambiguity should just you know just a bc three where labels and

addition works pretty well

but i didn't write on the large scale because the point is to compare for

this model is a decision tree in this impose additional work well but in more

complex adding the kind of is also into a model will be more obvious

and the second one is what's questions

actually somewhat so actually the baseline here is obvious that of the baseline has a

state tracker that string of the three weak classifier and of that have another that


model separately just to select a verbal action and every time so

but no the question the problem is they kind of they don't know each other

and the they make mistakes that so that they are not aware of the pilots

in of the that just distribution difference in training and testing

here in this did you a trade having soft outputs from a state tracker to

the point is in the confidence scores non-zero probability distributions

no we use hard decision so because in the or use hard addition

as a really nice paper really interesting but

i guess i'm not being slightly sceptical but you this actually scales

you've actually got three possible user intends we know you know no elaborations of that

is known or use

this uncertainty because you're allowing different ways of saying yes and were essentially there is

no noise

and i want the

two things what a why you haven't writing with real users

up to so you'd have real kind of

the noise you get from an is all system and b

have you done anything which suggests this would scale

to a system where you know

of the type that the question answer type system the general personal agent where the

user is expressed as some really rather rich intents

so express and then you need to be encoded in the state that you're tracking

rather than is very simple at least from the use of the state as far

as i can see is basically a three state tracker

so this model within this very preliminary at the point and so the what

the okay so if i go back to the proposed architecture

so how we define this hypothesis action a reader comments the scalability of this approach

at this point because only have very intense we can have just through action that

change the value of this hypothesis

in that's true that if the there's many at are involved in this system that

we need to tracking complex

a dialog state and that's have to be symbolic then

the proposed this next able remain obvious fit

well this proposed architecture we

still work we do you have i don't have a way to design to have

a selection and how to we

maintaining this it's an app is a memory that holds those in about important about

of the interface to database and the i think that's that will be in part

of future research

okay so thing

i have time so that since again the speaker