Speech Transcript - Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning

0:00:16	i a good afternoon everybody and can send out from carnegie mellon university so they
0:00:21	we're going to present our work including and you and a dialogue system using reinforcement
0:00:27	learning
0:00:30	to get started out of my talk is going to focus on task driven a
0:00:35	dialog agents
0:00:36	so those are agents that can have exposed to go to achieve such as providing
0:00:40	information from a database even user's preference
0:00:44	and traditionally people using for this kind of pipeline to building such a system so
0:00:49	first well we have some user inputs and the those inputs are annotated by hum
0:00:54	level annotation format you know you and that of impose a pipe into a state
0:00:59	tracker that can what is information or work on a same kind of the target
0:01:03	also able to
0:01:04	i providing a sufficient information to vertical rate okay interface with some structured database of
0:01:10	external
0:01:11	and condition on this information would have a dialogue policy that besides the next action
0:01:17	to do and of thing speaking back to the user
0:01:20	so all a project going to focus on how to replace the three highlighted module
0:01:25	here by a single end-to-end model
0:01:28	of people that getting to how we do such a model to talk about
0:01:31	why we want do this
0:01:33	so there are some limitations about traditional pipeline so the first one is the core
0:01:39	test on a problem so the system is they brought into the real word and
0:01:43	we get feedback from and user
0:01:45	those users that texas at the good dialogue about dialogue
0:01:48	but is not clear which module is responsible for the six that all the failure
0:01:52	of this of those feedback signal and you mixing you know what the arrow can
0:01:58	propagate between modules so you bargain can be even more challenging
0:02:01	and the second problem is that scalability or you got of the representation
0:02:06	so
0:02:07	you won't before we have the ldc challenge
0:02:10	but you knows case we use the command to the chart to estimate the value
0:02:14	of life that of dialog state variables
0:02:16	are those of arable the handcrafted
0:02:19	and how to design discuss that up like of our both a require a lot
0:02:22	of a extra knowledge and by design usually handicap the performance of the quality because
0:02:28	it simply not providing a sufficient information to make a good deletions
0:02:34	and i think time it's also challenging to building an end-to-end posture agents so we're
0:02:40	going to have some challenges
0:02:42	so the first challenge is it's and
0:02:45	a customer agents and used to know some sort of a strategic plan or policy
0:02:50	you achieve the and go on uncertainty from asr nlu and a user which is
0:02:56	beyond you know i think we'll supervised learning
0:03:00	also i think hind
0:03:02	a task it isn't interface with some external structure knowledge outside that only except
0:03:08	i think baltic already like as q or language or i call
0:03:13	that someone in during chapel and dimension of words but those you're not model the
0:03:17	only have continues intermediate representation and it's not a easy to get e symbolic already
0:03:23	from the
0:03:26	so all the challenge here is our contribution of this project so we introduce a
0:03:31	the rainbows learning based end-to-end dialogue system that should result in
0:03:36	do you perspective
0:03:37	the first one is we show that we can jointly optimize the state tracking
0:03:41	a database interaction and apology together
0:03:44	and the second one is
0:03:45	we can we shall we provide some prove that
0:03:49	and the deep learning a recurrent nets can learn some sort of that of the
0:03:53	representation automatically
0:03:56	so that speaking
0:03:58	so first of all we follow this intuition so we want to have a minimum
0:04:02	symbolic about state so that all state is defined at all the you know sufficient
0:04:07	information that in a dialogue
0:04:09	and we thought we can note that the or symbolic representation is only needed for
0:04:14	those paul that's
0:04:16	related to the database
0:04:17	so on any two main things that express the values for those thing of all
0:04:22	those variables and the for the rest of information such as the discourse structure the
0:04:26	intent of the user the goal the user what has just means that
0:04:29	we can more than other continuous vector
0:04:32	so that the by a high level integration
0:04:35	and then we propose this architecture
0:04:38	so we still follows the a typical home bp approach they have agent and they
0:04:43	have environment
0:04:44	and agent can apply some action to the environment and environment will respond with observations
0:04:50	and the reward
0:04:52	and you know case environment the comprised of two elements the first element is the
0:04:57	user
0:04:57	a second i'm in the database and also for asian that had it can apply
0:05:03	two types of action the first type of action is the verbal action it's similar
0:05:07	to the previous one d e a dialogue manager so
0:05:11	the agent can choose an action and say something back to the user
0:05:14	and the second type of action league or and our hypothesis action so we maintain
0:05:19	a external piece of
0:05:23	memory to hold the value
0:05:24	of a bounded according to the database
0:05:28	and agent can apply some sort of a high parts of action to modify this
0:05:32	is a memory and this bit on that we can be parsed into the database
0:05:36	which were built on
0:05:37	some matching entities and a given instant feedback
0:05:40	so
0:05:40	with this architecture the entire process of that all state tracking and a dialogue policy
0:05:45	is formulated as a single a sequential decision making process so we can using reinforcement
0:05:52	you know this policy in end-to-end fashion
0:05:56	and so basically we want you
0:05:59	approximate some argue about a function which is the expected future at and we want
0:06:03	to be able to not this about a function
0:06:09	okay so here is the implement a neural network structure
0:06:13	so it has several layers i will explain them from the bought into account
0:06:18	so the bottom layer is the observation we got from every time so they have
0:06:22	three elements though
0:06:24	we have the action that isn't to in the last that
0:06:27	and observation from the user and observation from the database and those two information all
0:06:34	mapped into a low-dimensional embedding us willing a transformation and the embedded in the past
0:06:40	into the recurrent nets which is we hope we can maintain a of the temporal
0:06:45	information over time
0:06:46	so we call a dialog state they
0:06:48	and it is the outputs from i don't is
0:06:52	leading you to a decision networks
0:06:55	with the actors fully connected feed-forward neural networks and one of them is
0:07:00	used to model the iq value function for verbal action and the other one is
0:07:05	used to model the q about a function for the
0:07:08	the hypothesis actions and you can see and you can do not focus our own
0:07:12	time so every time you make addition the action overview piping to the next action
0:07:16	and we can use a new observation from environment and it just keep going
0:07:23	ideally the proposed architecture can be trained and un only using the dialog success paper
0:07:29	which is come only come from the end of session is that success cooperative dialogue
0:07:34	but if it gets kind of us past reward calico cat results into value of
0:07:39	very slow learning
0:07:41	so we all of the four we i think once the fact that would be
0:07:45	okay so sometimes we have the hypothesis oracle label like we got from the dstc
0:07:50	so how can we including those label to speed up the learning
0:07:54	so we're describe a tech a single technique and by results into a significant speed
0:07:59	up
0:08:00	in terms of the convergence of algorithm
0:08:04	so that the that the trick is you know we modified up the remote model
0:08:08	in this upon db so we assume the correct hypothesis action follows them multinomial distributions
0:08:17	so there's only one single correct arms the at the time so we can add
0:08:21	okay is to the reward
0:08:23	in the reward which is just simply the probability that this hypothesis action
0:08:27	is correct at the time
0:08:28	and the second a trick it
0:08:31	because of the environment a comprise up to par the user and of the in
0:08:34	a database
0:08:35	the user is difficult to model but the database is just the program is a
0:08:39	dynamic is known
0:08:41	so we can usually get a general indifference tempos of this kind of a state
0:08:45	action the next they
0:08:46	by applying all possible hypotheses action at a time
0:08:49	so we can add of those generic a sample orange you the
0:08:53	experience table
0:08:54	and the this experience table will be used to update the parameters at
0:08:59	the next a large and this is a kind of similar to the final iq
0:09:04	learning
0:09:06	that introduced by certain but
0:09:07	the difference is that they use that have separate model to
0:09:11	estimate the transition probability but here we have many more kind of this that we
0:09:16	just you know trying to you to generate sample from the database which isn't has
0:09:20	known
0:09:22	dynamics
0:09:25	okay so the training were used in the state of a value based people are
0:09:30	important learning which is the traumatised w q and so the quantized experience reply allows
0:09:36	the that i wasn't to focus on sample that has
0:09:40	a more important information which is beloved on
0:09:43	and the second one is we use of w q and which can so the
0:09:47	bias that in the in the q-value a function of the time but the estimation
0:09:52	and the loss function is simply the that particular motion of the temporal difference a
0:09:58	role
0:09:58	and the square loss and we minimize the loss function by a stochastic gradient descent
0:10:06	okay so to pass all well
0:10:08	our proposed model we choose a conversational game is game is called twenty question game
0:10:14	so this game is simply that the agent aghast important was and that this is
0:10:19	a user you think you know
0:10:21	so it had you know two player the user and agent
0:10:24	so the agent can have access to a famous people database
0:10:28	it also can ask select from a list of yes or no question to ask
0:10:32	the user
0:10:33	and then use then used on the on that the to this question with yes
0:10:37	no i don't know intend in any possible a natural way
0:10:42	and the agent can also making guesses noted lu i think about bill gates or
0:10:46	and then to ring
0:10:47	and if it happens to against the correct person the k eight it will consider
0:10:52	waiting for agent otherwise eighteen were lost
0:10:56	so to using the to make you know to have to be able to do
0:11:00	experiment we view construct a uses the narrator so the senator a first one construct
0:11:05	a very famous people database and database we select a hundred people from the freebase
0:11:11	and each person is associated with about six attributes
0:11:16	and we manually designed a few just on a question to ask about
0:11:20	every attribute
0:11:21	so you can see some
0:11:23	examples you like what he in
0:11:26	trying to or is he won't be phone i six the influence of words
0:11:31	and the second of all
0:11:32	i we also on the use and to reply that intense in
0:11:36	a memory in a different possible ways so we collecting a different natural ways of
0:11:41	saying that three intense from the
0:11:44	switchboard dialog act corpus
0:11:45	and the we
0:11:48	and eventually we care about a few hundreds of different
0:11:51	unique way all the thing that intends from the as the ready a corpus and
0:11:56	also we maintain the frequency count of each expression so we can sample from the
0:12:01	so you can you know some expression can more often be
0:12:05	reply from a more real expression will be only
0:12:08	a location or example
0:12:11	and also here is the just found final piece of configuration for the pomdp
0:12:18	so
0:12:18	the game is terminated only if one of the four
0:12:23	condition l two
0:12:24	and we basically either the game like the agent guess the correct person already took
0:12:28	too long for the stochastic of then
0:12:30	or to make too many wrong guesses
0:12:32	or it you know you all but at
0:12:35	i a core hypothesis that is not consistent with any people database
0:12:39	and only if the put the agent against the correct person
0:12:42	we consider that as a texas otherwise it of area
0:12:47	and the if the agent when you look at thirty points other not otherwise inactivity
0:12:51	point
0:12:52	and you can buy at the most making five also examine most making five ten
0:12:58	wrong guesses and everyone guesses were induced inactive i don't see
0:13:03	so he tries to
0:13:04	and make it and making a more correct guess it more careful guesses
0:13:10	okay so we have described our model that and the we
0:13:14	trying to model and let's and how to use them analysis
0:13:18	so we analyze the model from three different perspective
0:13:22	so from dialogue policy high quality tracking and dialog state representation on
0:13:28	the first analysis is done on the
0:13:30	a dialogue up a policy analysis
0:13:33	so we compare three models so the first not in the baseline so
0:13:37	and we train basically we are the only difference is that we train the state
0:13:41	tracking and dialog policy
0:13:42	separately without a joint
0:13:44	right a joint training so they don't know the error come from each other
0:13:48	and the second one is the proposed model that only use the end of session
0:13:52	reward been used excess of area
0:13:54	and the last one is we use the ipod the label that we talk about
0:13:58	only a
0:13:59	with a different reward function
0:14:02	and the table shows the results
0:14:04	so we can see that of the proposed model the bowl was proposed model outperforms
0:14:09	by large margin compared to the baseline
0:14:12	and of the hybrid approach
0:14:14	even performs better than rl
0:14:17	so do you have a no more deep understanding about what's happening we also plot
0:14:23	the learning of pollution during training
0:14:26	so the
0:14:27	horizontal axis is the numbers that a number of parameter update
0:14:32	and the vertical access if the success rate and the green line use rl
0:14:37	the right the light is a hybrid approach and the purple light years baseline
0:14:41	so you can see they have a quite distinct behaviour so for the baseline model
0:14:45	because the state tracker is simply it ring either supervised learning so accomplish much faster
0:14:51	so it but its performance actually it's quite pretty quickly
0:14:54	because then they note kind of the data points in a state tracking of finding
0:14:58	with each other they don't know the information from each other
0:15:01	and
0:15:01	and of what i'll post it takes a very long time to have good performance
0:15:05	but it and it gets a good performance and the hyperbolic kind of kind of
0:15:10	benefits from both sides and a loss
0:15:12	relatively faster
0:15:14	in the beginning and then converges they probably to the path components weaker
0:15:21	so if eigen analysis it were trying to that you somehow cross-track analysis so to
0:15:26	do this we do you in the past remodel for
0:15:30	all the baseline i don't hybrid i'll
0:15:32	and we got a
0:15:34	a ten thousand k samples from each
0:15:36	and we report the precision recall for each one and he represents the pricing is
0:15:41	that we can see the base actually from those score is much greater than the
0:15:45	proposed approach
0:15:46	so what happened
0:15:47	so we look at some example dialogues that we can see
0:15:52	thoughtfully found for the eight that the laughter why the agent that the baseline and
0:15:56	the right one of the proposed model so the agent ask or is that wasn't
0:16:00	from america and the user utterance i would i don't think about issue is kind
0:16:03	of difficult for the model that you can survive
0:16:07	and the for the baseline because it's just you know it's a
0:16:10	classification doesn't take into a kind of the future it has to make a hard
0:16:14	addition right now so that each
0:16:16	it shows yes which is a route on so
0:16:18	well in a second case
0:16:20	was model
0:16:21	it will find okay that's kind of ambiguous i will even though no for now
0:16:24	and ask
0:16:25	another question
0:16:26	and use time use a set was no which is much simpler to classify so
0:16:30	now it to that the relative you know
0:16:32	so the main difference is that baseline is
0:16:35	it doesn't take into the future so it hasn't decomposition every time
0:16:39	for the proposed model
0:16:40	it because it has kind of you are kind of training was reinforcement so
0:16:45	if the models the future so you can have a long time planning
0:16:49	well at the last we doing some dialog state representation analysis
0:16:54	so we wanna see how well the lstm hidden layer is known about of the
0:16:58	representation
0:17:00	so we get to task the first task is okay c we want to see
0:17:03	if we can reconstruct some important variable from the dialogues the embedding
0:17:08	so we do is we took the model that showing a twenty k fifty k
0:17:12	and hundred k steps and we're trying to train nothing coding of a question
0:17:16	to predict
0:17:17	and the to predict the number i guess it has band
0:17:21	and we took eighty percent for training twenty percent forecasting and the table shows the
0:17:25	ask well on the testing set
0:17:28	so cloudy week as the former one of the better training set was more
0:17:32	beta it's easier to reconstruct
0:17:34	this us they variable from this allows the embedding so we can kind of come
0:17:39	from a hypothesis that the model is employed in trying to encode of this information
0:17:44	in is hidden layer although in our politics person to do so
0:17:48	and the second pass with the l is we're trying to do a retrieval task
0:17:52	so because we brought us to not it also actually know the true state of
0:17:56	this one so we have many palace like we have a dialogue is the embedding
0:18:01	with the corresponding similar the internal state
0:18:03	so have as it is it actually that atoms the embedding is a learning the
0:18:08	choose the internal status the narrator those two pass they know that must be very
0:18:12	correlated
0:18:14	so we can do average of also do a simple nearest neighbor
0:18:17	based on cosine this than in the embedding space and then we compare the similarity
0:18:22	in the retrieved to state
0:18:24	so you've their recall rate the rich you know retrieve choose they should also be
0:18:28	very similar to each other because the echoes that very close in embedding space
0:18:32	so we do this then experiment and that the horizontal axis here in the basically
0:18:38	the this the
0:18:40	the wearable index in a scenario
0:18:42	and the vertical axis is the probability are there
0:18:45	the retrieved five nearest neighbor that different from each other
0:18:50	and
0:18:50	and also we can we compare the performance of twenty fifty eight hundred k models
0:18:55	so
0:18:56	again we come from that for model that's better training
0:18:59	the probably the different is much i keep in decreasing
0:19:03	so that also means kind of means that
0:19:06	the that was the embedding it right become one or more correlated
0:19:10	with the internal state has the narrator
0:19:12	so it's actually learning
0:19:14	the internal dynamics of this particular work environment
0:19:20	in conclusion
0:19:21	we first show that it is possible that we can jointly long
0:19:24	the have all the tracking i don't dialogue policy together and it also to outperform
0:19:29	the baseline a modular approach
0:19:31	and the second level we show that
0:19:35	recurrent neural networks
0:19:36	the here that it is able to long continuous a factor a vectorial representation of
0:19:41	what dialog state
0:19:42	that's also cost driven so it only lost in important information
0:19:46	that is useful for it make the decision to achieve the goal and the other
0:19:51	all relevant information alone
0:19:52	and it's a standard we can also we show that we still we show that
0:19:57	purely reinforcement approach with a very sparse reward still suffer from a slow convergence
0:20:03	in the deep reinforcement is starting
0:20:05	and how to improve this kind of learning speed
0:20:08	is left to our future work
0:20:10	and not so thank you
0:20:17	okay something to so was time for questions
0:20:24	i so the user observation that is input to the marketing is a i mean
0:20:30	how to encode input utterance to you just send the tokens are you'd expect any
0:20:35	features
0:20:36	so the question is how do we encode the user observations so we do actually
0:20:40	simple so we just take up to about a bigram back to the use of
0:20:44	the observations
0:20:46	so it just the vector and you and that's factors and with the will be
0:20:52	in unit transformed into a us imply a user utterance embedding and then concatenate with
0:20:57	other vectors from the database and of action
0:21:05	you go
0:21:59	okay
0:22:00	the first question is do we try some a simple decision tree so i tried
0:22:04	with the max temporal setting that user also with yes or no on don't know
0:22:09	who is no ambiguity should just you know just a bc three where labels and
0:22:14	addition works pretty well
0:22:15	but i didn't write on the large scale because the point is to compare for
0:22:20	this model is a decision tree in this impose additional work well but in more
0:22:24	complex adding the kind of is also into a model will be more obvious
0:22:29	and the second one is what's questions
0:22:37	actually somewhat so actually the baseline here is obvious that of the baseline has a
0:22:43	state tracker that string of the three weak classifier and of that have another that
0:22:47	of
0:22:48	model separately just to select a verbal action and every time so
0:22:52	but no the question the problem is they kind of they don't know each other
0:22:56	and the they make mistakes that so that they are not aware of the pilots
0:23:01	in of the that just distribution difference in training and testing
0:23:09	here in this did you a trade having soft outputs from a state tracker to
0:23:17	the point is in the confidence scores non-zero probability distributions
0:23:21	no we use hard decision so because in the or use hard addition
0:23:35	as a really nice paper really interesting but
0:23:39	i guess i'm not being slightly sceptical but you this actually scales
0:23:46	you've actually got three possible user intends we know you know no elaborations of that
0:23:53	is known or use
0:23:55	this uncertainty because you're allowing different ways of saying yes and were essentially there is
0:24:00	no noise
0:24:01	and i want the
0:24:04	two things what a why you haven't writing with real users
0:24:09	up to so you'd have real kind of
0:24:12	the noise you get from an is all system and b
0:24:16	have you done anything which suggests this would scale
0:24:21	to a system where you know
0:24:25	of the type that the question answer type system the general personal agent where the
0:24:30	user is expressed as some really rather rich intents
0:24:34	so express and then you need to be encoded in the state that you're tracking
0:24:39	rather than is very simple at least from the use of the state as far
0:24:43	as i can see is basically a three state tracker
0:24:48	so this model within this very preliminary at the point and so the what
0:24:53	the okay so if i go back to the proposed architecture
0:24:58	so how we define this hypothesis action a reader comments the scalability of this approach
0:25:03	at this point because only have very intense we can have just through action that
0:25:08	change the value of this hypothesis
0:25:10	in that's true that if the there's many at are involved in this system that
0:25:15	we need to tracking complex
0:25:17	a dialog state and that's have to be symbolic then
0:25:21	the proposed this next able remain obvious fit
0:25:24	well this proposed architecture we
0:25:26	still work we do you have i don't have a way to design to have
0:25:30	a selection and how to we
0:25:32	maintaining this it's an app is a memory that holds those in about important about
0:25:37	of the interface to database and the i think that's that will be in part
0:25:42	of future research
0:25:44	okay so thing
0:25:46	i have time so that since again the speaker

Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning

Oral Session 1: Dialogue state tracking & Spoken language understanding

Tiancheng Zhao and Maxine Eskenazi