0:00:16i a good afternoon everybody and can send out from carnegie mellon university so they
0:00:21we're going to present our work including and you and a dialogue system using reinforcement
0:00:27learning
0:00:30to get started out of my talk is going to focus on task driven a
0:00:35dialog agents
0:00:36so those are agents that can have exposed to go to achieve such as providing
0:00:40information from a database even user's preference
0:00:44and traditionally people using for this kind of pipeline to building such a system so
0:00:49first well we have some user inputs and the those inputs are annotated by hum
0:00:54level annotation format you know you and that of impose a pipe into a state
0:00:59tracker that can what is information or work on a same kind of the target
0:01:03also able to
0:01:04i providing a sufficient information to vertical rate okay interface with some structured database of
0:01:10external
0:01:11and condition on this information would have a dialogue policy that besides the next action
0:01:17to do and of thing speaking back to the user
0:01:20so all a project going to focus on how to replace the three highlighted module
0:01:25here by a single end-to-end model
0:01:28of people that getting to how we do such a model to talk about
0:01:31why we want do this
0:01:33so there are some limitations about traditional pipeline so the first one is the core
0:01:39test on a problem so the system is they brought into the real word and
0:01:43we get feedback from and user
0:01:45those users that texas at the good dialogue about dialogue
0:01:48but is not clear which module is responsible for the six that all the failure
0:01:52of this of those feedback signal and you mixing you know what the arrow can
0:01:58propagate between modules so you bargain can be even more challenging
0:02:01and the second problem is that scalability or you got of the representation
0:02:06so
0:02:07you won't before we have the ldc challenge
0:02:10but you knows case we use the command to the chart to estimate the value
0:02:14of life that of dialog state variables
0:02:16are those of arable the handcrafted
0:02:19and how to design discuss that up like of our both a require a lot
0:02:22of a extra knowledge and by design usually handicap the performance of the quality because
0:02:28it simply not providing a sufficient information to make a good deletions
0:02:34and i think time it's also challenging to building an end-to-end posture agents so we're
0:02:40going to have some challenges
0:02:42so the first challenge is it's and
0:02:45a customer agents and used to know some sort of a strategic plan or policy
0:02:50you achieve the and go on uncertainty from asr nlu and a user which is
0:02:56beyond you know i think we'll supervised learning
0:03:00also i think hind
0:03:02a task it isn't interface with some external structure knowledge outside that only except
0:03:08i think baltic already like as q or language or i call
0:03:13that someone in during chapel and dimension of words but those you're not model the
0:03:17only have continues intermediate representation and it's not a easy to get e symbolic already
0:03:23from the
0:03:26so all the challenge here is our contribution of this project so we introduce a
0:03:31the rainbows learning based end-to-end dialogue system that should result in
0:03:36do you perspective
0:03:37the first one is we show that we can jointly optimize the state tracking
0:03:41a database interaction and apology together
0:03:44and the second one is
0:03:45we can we shall we provide some prove that
0:03:49and the deep learning a recurrent nets can learn some sort of that of the
0:03:53representation automatically
0:03:56so that speaking
0:03:58so first of all we follow this intuition so we want to have a minimum
0:04:02symbolic about state so that all state is defined at all the you know sufficient
0:04:07information that in a dialogue
0:04:09and we thought we can note that the or symbolic representation is only needed for
0:04:14those paul that's
0:04:16related to the database
0:04:17so on any two main things that express the values for those thing of all
0:04:22those variables and the for the rest of information such as the discourse structure the
0:04:26intent of the user the goal the user what has just means that
0:04:29we can more than other continuous vector
0:04:32so that the by a high level integration
0:04:35and then we propose this architecture
0:04:38so we still follows the a typical home bp approach they have agent and they
0:04:43have environment
0:04:44and agent can apply some action to the environment and environment will respond with observations
0:04:50and the reward
0:04:52and you know case environment the comprised of two elements the first element is the
0:04:57user
0:04:57a second i'm in the database and also for asian that had it can apply
0:05:03two types of action the first type of action is the verbal action it's similar
0:05:07to the previous one d e a dialogue manager so
0:05:11the agent can choose an action and say something back to the user
0:05:14and the second type of action league or and our hypothesis action so we maintain
0:05:19a external piece of
0:05:23memory to hold the value
0:05:24of a bounded according to the database
0:05:28and agent can apply some sort of a high parts of action to modify this
0:05:32is a memory and this bit on that we can be parsed into the database
0:05:36which were built on
0:05:37some matching entities and a given instant feedback
0:05:40so
0:05:40with this architecture the entire process of that all state tracking and a dialogue policy
0:05:45is formulated as a single a sequential decision making process so we can using reinforcement
0:05:52you know this policy in end-to-end fashion
0:05:56and so basically we want you
0:05:59approximate some argue about a function which is the expected future at and we want
0:06:03to be able to not this about a function
0:06:09okay so here is the implement a neural network structure
0:06:13so it has several layers i will explain them from the bought into account
0:06:18so the bottom layer is the observation we got from every time so they have
0:06:22three elements though
0:06:24we have the action that isn't to in the last that
0:06:27and observation from the user and observation from the database and those two information all
0:06:34mapped into a low-dimensional embedding us willing a transformation and the embedded in the past
0:06:40into the recurrent nets which is we hope we can maintain a of the temporal
0:06:45information over time
0:06:46so we call a dialog state they
0:06:48and it is the outputs from i don't is
0:06:52leading you to a decision networks
0:06:55with the actors fully connected feed-forward neural networks and one of them is
0:07:00used to model the iq value function for verbal action and the other one is
0:07:05used to model the q about a function for the
0:07:08the hypothesis actions and you can see and you can do not focus our own
0:07:12time so every time you make addition the action overview piping to the next action
0:07:16and we can use a new observation from environment and it just keep going
0:07:23ideally the proposed architecture can be trained and un only using the dialog success paper
0:07:29which is come only come from the end of session is that success cooperative dialogue
0:07:34but if it gets kind of us past reward calico cat results into value of
0:07:39very slow learning
0:07:41so we all of the four we i think once the fact that would be
0:07:45okay so sometimes we have the hypothesis oracle label like we got from the dstc
0:07:50so how can we including those label to speed up the learning
0:07:54so we're describe a tech a single technique and by results into a significant speed
0:07:59up
0:08:00in terms of the convergence of algorithm
0:08:04so that the that the trick is you know we modified up the remote model
0:08:08in this upon db so we assume the correct hypothesis action follows them multinomial distributions
0:08:17so there's only one single correct arms the at the time so we can add
0:08:21okay is to the reward
0:08:23in the reward which is just simply the probability that this hypothesis action
0:08:27is correct at the time
0:08:28and the second a trick it
0:08:31because of the environment a comprise up to par the user and of the in
0:08:34a database
0:08:35the user is difficult to model but the database is just the program is a
0:08:39dynamic is known
0:08:41so we can usually get a general indifference tempos of this kind of a state
0:08:45action the next they
0:08:46by applying all possible hypotheses action at a time
0:08:49so we can add of those generic a sample orange you the
0:08:53experience table
0:08:54and the this experience table will be used to update the parameters at
0:08:59the next a large and this is a kind of similar to the final iq
0:09:04learning
0:09:06that introduced by certain but
0:09:07the difference is that they use that have separate model to
0:09:11estimate the transition probability but here we have many more kind of this that we
0:09:16just you know trying to you to generate sample from the database which isn't has
0:09:20known
0:09:22dynamics
0:09:25okay so the training were used in the state of a value based people are
0:09:30important learning which is the traumatised w q and so the quantized experience reply allows
0:09:36the that i wasn't to focus on sample that has
0:09:40a more important information which is beloved on
0:09:43and the second one is we use of w q and which can so the
0:09:47bias that in the in the q-value a function of the time but the estimation
0:09:52and the loss function is simply the that particular motion of the temporal difference a
0:09:58role
0:09:58and the square loss and we minimize the loss function by a stochastic gradient descent
0:10:06okay so to pass all well
0:10:08our proposed model we choose a conversational game is game is called twenty question game
0:10:14so this game is simply that the agent aghast important was and that this is
0:10:19a user you think you know
0:10:21so it had you know two player the user and agent
0:10:24so the agent can have access to a famous people database
0:10:28it also can ask select from a list of yes or no question to ask
0:10:32the user
0:10:33and then use then used on the on that the to this question with yes
0:10:37no i don't know intend in any possible a natural way
0:10:42and the agent can also making guesses noted lu i think about bill gates or
0:10:46and then to ring
0:10:47and if it happens to against the correct person the k eight it will consider
0:10:52waiting for agent otherwise eighteen were lost
0:10:56so to using the to make you know to have to be able to do
0:11:00experiment we view construct a uses the narrator so the senator a first one construct
0:11:05a very famous people database and database we select a hundred people from the freebase
0:11:11and each person is associated with about six attributes
0:11:16and we manually designed a few just on a question to ask about
0:11:20every attribute
0:11:21so you can see some
0:11:23examples you like what he in
0:11:26trying to or is he won't be phone i six the influence of words
0:11:31and the second of all
0:11:32i we also on the use and to reply that intense in
0:11:36a memory in a different possible ways so we collecting a different natural ways of
0:11:41saying that three intense from the
0:11:44switchboard dialog act corpus
0:11:45and the we
0:11:48and eventually we care about a few hundreds of different
0:11:51unique way all the thing that intends from the as the ready a corpus and
0:11:56also we maintain the frequency count of each expression so we can sample from the
0:12:01so you can you know some expression can more often be
0:12:05reply from a more real expression will be only
0:12:08a location or example
0:12:11and also here is the just found final piece of configuration for the pomdp
0:12:18so
0:12:18the game is terminated only if one of the four
0:12:23condition l two
0:12:24and we basically either the game like the agent guess the correct person already took
0:12:28too long for the stochastic of then
0:12:30or to make too many wrong guesses
0:12:32or it you know you all but at
0:12:35i a core hypothesis that is not consistent with any people database
0:12:39and only if the put the agent against the correct person
0:12:42we consider that as a texas otherwise it of area
0:12:47and the if the agent when you look at thirty points other not otherwise inactivity
0:12:51point
0:12:52and you can buy at the most making five also examine most making five ten
0:12:58wrong guesses and everyone guesses were induced inactive i don't see
0:13:03so he tries to
0:13:04and make it and making a more correct guess it more careful guesses
0:13:10okay so we have described our model that and the we
0:13:14trying to model and let's and how to use them analysis
0:13:18so we analyze the model from three different perspective
0:13:22so from dialogue policy high quality tracking and dialog state representation on
0:13:28the first analysis is done on the
0:13:30a dialogue up a policy analysis
0:13:33so we compare three models so the first not in the baseline so
0:13:37and we train basically we are the only difference is that we train the state
0:13:41tracking and dialog policy
0:13:42separately without a joint
0:13:44right a joint training so they don't know the error come from each other
0:13:48and the second one is the proposed model that only use the end of session
0:13:52reward been used excess of area
0:13:54and the last one is we use the ipod the label that we talk about
0:13:58only a
0:13:59with a different reward function
0:14:02and the table shows the results
0:14:04so we can see that of the proposed model the bowl was proposed model outperforms
0:14:09by large margin compared to the baseline
0:14:12and of the hybrid approach
0:14:14even performs better than rl
0:14:17so do you have a no more deep understanding about what's happening we also plot
0:14:23the learning of pollution during training
0:14:26so the
0:14:27horizontal axis is the numbers that a number of parameter update
0:14:32and the vertical access if the success rate and the green line use rl
0:14:37the right the light is a hybrid approach and the purple light years baseline
0:14:41so you can see they have a quite distinct behaviour so for the baseline model
0:14:45because the state tracker is simply it ring either supervised learning so accomplish much faster
0:14:51so it but its performance actually it's quite pretty quickly
0:14:54because then they note kind of the data points in a state tracking of finding
0:14:58with each other they don't know the information from each other
0:15:01and
0:15:01and of what i'll post it takes a very long time to have good performance
0:15:05but it and it gets a good performance and the hyperbolic kind of kind of
0:15:10benefits from both sides and a loss
0:15:12relatively faster
0:15:14in the beginning and then converges they probably to the path components weaker
0:15:21so if eigen analysis it were trying to that you somehow cross-track analysis so to
0:15:26do this we do you in the past remodel for
0:15:30all the baseline i don't hybrid i'll
0:15:32and we got a
0:15:34a ten thousand k samples from each
0:15:36and we report the precision recall for each one and he represents the pricing is
0:15:41that we can see the base actually from those score is much greater than the
0:15:45proposed approach
0:15:46so what happened
0:15:47so we look at some example dialogues that we can see
0:15:52thoughtfully found for the eight that the laughter why the agent that the baseline and
0:15:56the right one of the proposed model so the agent ask or is that wasn't
0:16:00from america and the user utterance i would i don't think about issue is kind
0:16:03of difficult for the model that you can survive
0:16:07and the for the baseline because it's just you know it's a
0:16:10classification doesn't take into a kind of the future it has to make a hard
0:16:14addition right now so that each
0:16:16it shows yes which is a route on so
0:16:18well in a second case
0:16:20was model
0:16:21it will find okay that's kind of ambiguous i will even though no for now
0:16:24and ask
0:16:25another question
0:16:26and use time use a set was no which is much simpler to classify so
0:16:30now it to that the relative you know
0:16:32so the main difference is that baseline is
0:16:35it doesn't take into the future so it hasn't decomposition every time
0:16:39for the proposed model
0:16:40it because it has kind of you are kind of training was reinforcement so
0:16:45if the models the future so you can have a long time planning
0:16:49well at the last we doing some dialog state representation analysis
0:16:54so we wanna see how well the lstm hidden layer is known about of the
0:16:58representation
0:17:00so we get to task the first task is okay c we want to see
0:17:03if we can reconstruct some important variable from the dialogues the embedding
0:17:08so we do is we took the model that showing a twenty k fifty k
0:17:12and hundred k steps and we're trying to train nothing coding of a question
0:17:16to predict
0:17:17and the to predict the number i guess it has band
0:17:21and we took eighty percent for training twenty percent forecasting and the table shows the
0:17:25ask well on the testing set
0:17:28so cloudy week as the former one of the better training set was more
0:17:32beta it's easier to reconstruct
0:17:34this us they variable from this allows the embedding so we can kind of come
0:17:39from a hypothesis that the model is employed in trying to encode of this information
0:17:44in is hidden layer although in our politics person to do so
0:17:48and the second pass with the l is we're trying to do a retrieval task
0:17:52so because we brought us to not it also actually know the true state of
0:17:56this one so we have many palace like we have a dialogue is the embedding
0:18:01with the corresponding similar the internal state
0:18:03so have as it is it actually that atoms the embedding is a learning the
0:18:08choose the internal status the narrator those two pass they know that must be very
0:18:12correlated
0:18:14so we can do average of also do a simple nearest neighbor
0:18:17based on cosine this than in the embedding space and then we compare the similarity
0:18:22in the retrieved to state
0:18:24so you've their recall rate the rich you know retrieve choose they should also be
0:18:28very similar to each other because the echoes that very close in embedding space
0:18:32so we do this then experiment and that the horizontal axis here in the basically
0:18:38the this the
0:18:40the wearable index in a scenario
0:18:42and the vertical axis is the probability are there
0:18:45the retrieved five nearest neighbor that different from each other
0:18:50and
0:18:50and also we can we compare the performance of twenty fifty eight hundred k models
0:18:55so
0:18:56again we come from that for model that's better training
0:18:59the probably the different is much i keep in decreasing
0:19:03so that also means kind of means that
0:19:06the that was the embedding it right become one or more correlated
0:19:10with the internal state has the narrator
0:19:12so it's actually learning
0:19:14the internal dynamics of this particular work environment
0:19:20in conclusion
0:19:21we first show that it is possible that we can jointly long
0:19:24the have all the tracking i don't dialogue policy together and it also to outperform
0:19:29the baseline a modular approach
0:19:31and the second level we show that
0:19:35recurrent neural networks
0:19:36the here that it is able to long continuous a factor a vectorial representation of
0:19:41what dialog state
0:19:42that's also cost driven so it only lost in important information
0:19:46that is useful for it make the decision to achieve the goal and the other
0:19:51all relevant information alone
0:19:52and it's a standard we can also we show that we still we show that
0:19:57purely reinforcement approach with a very sparse reward still suffer from a slow convergence
0:20:03in the deep reinforcement is starting
0:20:05and how to improve this kind of learning speed
0:20:08is left to our future work
0:20:10and not so thank you
0:20:17okay something to so was time for questions
0:20:24i so the user observation that is input to the marketing is a i mean
0:20:30how to encode input utterance to you just send the tokens are you'd expect any
0:20:35features
0:20:36so the question is how do we encode the user observations so we do actually
0:20:40simple so we just take up to about a bigram back to the use of
0:20:44the observations
0:20:46so it just the vector and you and that's factors and with the will be
0:20:52in unit transformed into a us imply a user utterance embedding and then concatenate with
0:20:57other vectors from the database and of action
0:21:05you go
0:21:59okay
0:22:00the first question is do we try some a simple decision tree so i tried
0:22:04with the max temporal setting that user also with yes or no on don't know
0:22:09who is no ambiguity should just you know just a bc three where labels and
0:22:14addition works pretty well
0:22:15but i didn't write on the large scale because the point is to compare for
0:22:20this model is a decision tree in this impose additional work well but in more
0:22:24complex adding the kind of is also into a model will be more obvious
0:22:29and the second one is what's questions
0:22:37actually somewhat so actually the baseline here is obvious that of the baseline has a
0:22:43state tracker that string of the three weak classifier and of that have another that
0:22:47of
0:22:48model separately just to select a verbal action and every time so
0:22:52but no the question the problem is they kind of they don't know each other
0:22:56and the they make mistakes that so that they are not aware of the pilots
0:23:01in of the that just distribution difference in training and testing
0:23:09here in this did you a trade having soft outputs from a state tracker to
0:23:17the point is in the confidence scores non-zero probability distributions
0:23:21no we use hard decision so because in the or use hard addition
0:23:35as a really nice paper really interesting but
0:23:39i guess i'm not being slightly sceptical but you this actually scales
0:23:46you've actually got three possible user intends we know you know no elaborations of that
0:23:53is known or use
0:23:55this uncertainty because you're allowing different ways of saying yes and were essentially there is
0:24:00no noise
0:24:01and i want the
0:24:04two things what a why you haven't writing with real users
0:24:09up to so you'd have real kind of
0:24:12the noise you get from an is all system and b
0:24:16have you done anything which suggests this would scale
0:24:21to a system where you know
0:24:25of the type that the question answer type system the general personal agent where the
0:24:30user is expressed as some really rather rich intents
0:24:34so express and then you need to be encoded in the state that you're tracking
0:24:39rather than is very simple at least from the use of the state as far
0:24:43as i can see is basically a three state tracker
0:24:48so this model within this very preliminary at the point and so the what
0:24:53the okay so if i go back to the proposed architecture
0:24:58so how we define this hypothesis action a reader comments the scalability of this approach
0:25:03at this point because only have very intense we can have just through action that
0:25:08change the value of this hypothesis
0:25:10in that's true that if the there's many at are involved in this system that
0:25:15we need to tracking complex
0:25:17a dialog state and that's have to be symbolic then
0:25:21the proposed this next able remain obvious fit
0:25:24well this proposed architecture we
0:25:26still work we do you have i don't have a way to design to have
0:25:30a selection and how to we
0:25:32maintaining this it's an app is a memory that holds those in about important about
0:25:37of the interface to database and the i think that's that will be in part
0:25:42of future research
0:25:44okay so thing
0:25:46i have time so that since again the speaker