Speech Transcript - Improving Interaction Quality Estimation with BiLSTMs and the Impact on Dialogue Policy Learning

0:00:17	i
0:00:18	i
0:00:20	that's cool will talk about controlling direction while it is it may win by lstm
0:00:26	and the impact on dialogue policy learning
0:00:51	great
0:00:53	so welcome everyone to my talk today
0:00:55	on improving interaction quality estimation with bias and s t ends
0:00:59	and the impact on dialogue policy learning
0:01:02	i didn't check but i'm quite sure i'll wait already one the time to for
0:01:06	the longest paper title
0:01:09	nonetheless
0:01:11	let's get started this
0:01:13	and in reinforcement learning
0:01:16	the one of the thing is that
0:01:18	have a huge influence on the learnt behaviors the reward
0:01:22	and this is also true for
0:01:24	the world of task oriented dialogue
0:01:27	and in a modulo statistical dialog system we use reinforcement learning to learn the
0:01:31	dialogue policy
0:01:32	weights
0:01:34	maps
0:01:35	a belief state
0:01:36	which represents the
0:01:38	progression over the several
0:01:40	dialogue turns
0:01:42	from the input side i mean and this
0:01:45	but this maps to a system action which is then
0:01:49	transferred into a response to the user
0:01:53	and the task of reinforcement learning is then to find the policy that maximizes the
0:01:58	activated future reward
0:02:00	not called the optimal possible policy
0:02:03	and
0:02:04	for most else go in a dialogue systems
0:02:07	the reward that's used is
0:02:11	other tasks s
0:02:12	so you have some measure of has assessed usually
0:02:15	the user provide some task information or you can
0:02:19	derive it in some other way
0:02:22	and
0:02:23	you can check what this what the system responses look like an then you can
0:02:26	evaluate whether task was excessive successfully achieved
0:02:30	or not
0:02:34	however
0:02:35	i think that the real behaviour should optimize user the section the set are not
0:02:39	tasks
0:02:40	and this is many out of two reasons
0:02:43	first of all user the section but better represents
0:02:47	what the user ones
0:02:49	and effect is only been
0:02:52	used as it correlates well as task success
0:02:55	you know conferences a long been used because correlates well with use of this section
0:02:59	that's the right order
0:03:01	secondly
0:03:05	task and
0:03:07	uses the setting can be links
0:03:09	two task or domain independent phenomena
0:03:14	and you don't need information about the underlying task
0:03:17	to illustrate this have just this creek x
0:03:19	sample dialog here you only see here parameters extracted from it it's from the let's
0:03:24	go bus information
0:03:26	system
0:03:27	you have ta is a state is the asr confidence activity time in the represent
0:03:31	and the claim is that you can derive some measure of user satisfaction just by
0:03:37	looking at this
0:03:39	whereas if you were actually need to look at
0:03:41	task success
0:03:43	you would have to have knowledge about what was actually going on what were the
0:03:46	system utterances user input so and so forth to if you know about on the
0:03:50	right
0:03:51	energy has been found
0:03:55	and i'm proposing a novel by lstm remote estimator that first of all improves the
0:04:00	estimation of interaction quality itself
0:04:04	and also improve the dialogue performance
0:04:08	and this is done without explicitly modeling of temporal features
0:04:12	so you see this crime where
0:04:14	where we don't estimate the we don't evaluate the task success anymore but we estimate
0:04:20	the you sect use this section we will
0:04:22	which is the edge has originally been published
0:04:26	two years ago already funny really enough it it'll speech also in stock on
0:04:31	so i'm talking about this topic and to come only apparently
0:04:35	so to model the use of this section
0:04:39	we use the interaction quality
0:04:41	as a less objective metric
0:04:43	yes we use the plaintive terms of for the same purpose
0:04:47	and previously the estimation
0:04:50	was making use a lot of
0:04:53	manually a handcrafted features to encode the temporal information
0:04:57	and in the proposed estimator i'm
0:05:00	i'm sure so you're the next slide i was so how you can do this
0:05:03	without the need to actually learn those of temporal information
0:05:10	so that two aspects of this of this of this talk one is the
0:05:13	detection quality estimation itself
0:05:15	and after the time to talk about how
0:05:17	using this it as a reward actually influences the dialogue policy
0:05:24	so
0:05:27	first of all its of a closer look at interaction quality and how it smaller
0:05:30	than how it used to be modeled with all the handcrafted stuff going on
0:05:34	you see the module architect of a dialogue and information is extracted from the user
0:05:41	input and the system response and this these carry me to think constitute one exchange
0:05:47	so you end up with a sequence of exchanges from the
0:05:50	beginning of the dialog up to the current
0:05:52	turn t value code yet or exchange in this case because it
0:05:57	contains information about both
0:05:58	size user and system
0:06:00	and
0:06:01	this that the exchange level
0:06:03	and the temporal information used to be encoded
0:06:06	on the window rather that you have a look at the bigger to look at
0:06:10	we know of three in this example and also on the overall dialog level
0:06:15	and both levels those parameters can codes concatenated
0:06:19	then were fed into a classification in the data into a classifier
0:06:22	to estimate interaction quality
0:06:25	and the interaction quality itself
0:06:28	is then obviously the supervision signal
0:06:30	it is it's annotated on a scale from five to one
0:06:34	five representing satisfied one experience satisfied
0:06:38	and it's important to understand that every interaction quality annotation which
0:06:43	exists for every exchange
0:06:45	actually models the whole the subdialogue from the beginning after that exchange
0:06:49	so it's not a measure of
0:06:51	how well was this turn the system reaction or whatever but it's
0:06:56	a measure of how
0:06:57	well
0:06:58	how satisfied was the use of from the beginning up to now so if it
0:07:02	goes down
0:07:03	it might be that the last turn wasn't really great but also many things before
0:07:08	they could also have influence but
0:07:14	and the unit circle model
0:07:17	i proposed
0:07:18	gets rid of those temporal features it only uses a very small set of
0:07:23	of exchange level features which you can see here's the asr reason statist asr confidence
0:07:29	was it a reprint or not
0:07:32	what's the channel
0:07:33	exterior type is a the so the statement or question and so on
0:07:37	or is the system action to confirm request or something like that
0:07:41	so these are the parameters we use
0:07:44	and
0:07:46	notice that work anymore to be cy
0:07:52	this exchange that is then
0:07:54	used as input to
0:07:56	and colluder is a strictly to encode another using by a that's the or a
0:08:00	by lstm
0:08:01	and the so
0:08:02	for every subdialogue we want to estimate one interaction quality value at every subdialogue is
0:08:07	then fed into the encoder
0:08:11	to generate hidden a sequence of hidden states with additional attention layer
0:08:17	to
0:08:18	with the hope of figuring out which
0:08:21	turn actually contributes most to the final estimation
0:08:25	of
0:08:27	interaction quality itself
0:08:32	intentionally is the set of attention value
0:08:34	cutback based on the context of the other
0:08:36	in this state
0:08:38	weights are computed
0:08:40	and the results of applying this to the task of
0:08:45	decks in quality estimation
0:08:47	so those results
0:08:48	you see the
0:08:50	unweighted average recall it's just the arithmetic average of all over all class-wise recalls
0:08:56	the grey ones are the
0:08:59	baselines of the for the one on the right was it two thousand fifteen is
0:09:04	using the support vector machine using
0:09:07	both temporal features of hank of the temporal features
0:09:10	and the second one by a docket i from two thousand seventeen is the not
0:09:14	on your network approach which
0:09:17	most making use of different architecture
0:09:19	but still not using the sample features
0:09:24	but if you can we fought for test data we use the they go corpus
0:09:28	which contains of to hunt two hundred bus information dialogs
0:09:32	to the
0:09:32	let's go system of pittsburgh
0:09:35	and those results are computed in ten-fold dialoguewise cross validation
0:09:40	and
0:09:41	you can see that the best performing
0:09:45	okay sick as a file of the by lstm with the attention mechanism
0:09:48	we compare there's all those with the but a pure biased em all up you
0:09:51	lstms
0:09:52	with and without attention
0:09:54	he be achieved an
0:09:55	i mean every speaker of zero point five for which is increase over the previous
0:09:59	best
0:10:01	we thought of zero point zero nine
0:10:05	now those numbers
0:10:06	don't seem to be
0:10:08	very useful
0:10:09	because it's not i mean if you want to estimate reward you want to have
0:10:12	a very good quality and you need to have a certain
0:10:15	certainty that what you actually
0:10:17	get as a as an estimate actually
0:10:19	can be used as a remote you don't get like
0:10:22	right wrong indicators
0:10:25	another measure we used to evaluate that is the extended accuracy when we did not
0:10:28	just look at the at the actual match but also look at neighboring
0:10:32	but you so if you to but estimating a five although it was
0:10:36	originally of four would still be counted as correct
0:10:39	because the way we
0:10:41	transfer those detection quality values to the reward
0:10:45	makes it is not very use problem if you're off by one
0:10:50	you will see later all this is this is done and then we can see
0:10:52	that we can actually good very good values above the ninety
0:10:56	a present accuracy rate we're with your points nine four
0:10:59	but the follow based approach which is
0:11:01	three point zero six
0:11:03	better than the previous
0:11:06	best result
0:11:09	and
0:11:12	this estimation of the best performing model by an estimate the attention mechanism
0:11:18	is then used to train a dialogue policies
0:11:22	first of all we have to address the question how can we make use of
0:11:24	an interaction quality value
0:11:26	in a remote a here we see that for the remote better interaction quality we
0:11:31	use the turn penalty of minus one per turn
0:11:34	and then we actually scalar
0:11:37	the remote the detection quality so that
0:11:42	it takes values from zero to trendy
0:11:45	to be in correspondence to the baseline of talks assess we just been using
0:11:50	many different papers already
0:11:52	where yells of the time penalty and the past trend is the dialogue was successful
0:11:56	and zero if not
0:12:00	so
0:12:01	you bet you get the same value range but you have more or more fine
0:12:04	grained
0:12:05	interpretation of
0:12:06	how the dialogue actually did
0:12:11	we compare
0:12:12	the
0:12:13	best performing evaluate it's tomato again as the support vector machine is a stunning pre
0:12:17	previous work it so the evaluation system we use pied i'll
0:12:22	with a set of using the focus tracker and the cheapest also policy learning algorithm
0:12:26	we use the difference duration environments
0:12:31	containing of zero percent error rate fifteen percent error rate and
0:12:34	twenty but and thirty percent error rate
0:12:38	we used two different evaluation metrics one is the task success rate because even though
0:12:42	we want to be optimized towards indexing quality or user satisfaction
0:12:46	it still
0:12:48	very important also have successfully
0:12:50	it see if the task doesn't help if you have
0:12:53	if you estimated does all this was a very nice dialogue
0:12:55	but that didn't the user didn't achieve the task that's of no use
0:13:01	the second metric we use a see i was interaction quality
0:13:04	maybe just to get the estimate
0:13:06	and
0:13:07	compute the average of all final estimates for the overall dialog
0:13:13	and to address the aspect of domain independence
0:13:17	we actually look at many different domains
0:13:21	so the estimated been trained on the let's go domain
0:13:24	there we have the annotations
0:13:26	for
0:13:27	but it's the dialogue to themselves the domains in which dialogues
0:13:32	has been there and are actually lows to so we have a complete restaurants domain
0:13:36	it can be chosen as the main so that the score estimate rest of the
0:13:39	men services go to the man and that of the name
0:13:42	they have different compare complexity they have different
0:13:44	aspects to them
0:13:46	so
0:13:48	so this basically will showcase that
0:13:51	that the approach is actually
0:13:54	domain independent you don't need to have information about the
0:13:58	underlying task
0:13:59	so
0:14:00	no question is how does this perform actually
0:14:03	you have a lot of sparse
0:14:05	of a obviously because they're a lot of experiments
0:14:10	curators in the non logically be the paper
0:14:13	but i think what's very interesting here is that
0:14:16	for the task success rate
0:14:18	and the different noise levels we can see
0:14:21	that are in comparing the black bars which is
0:14:25	a robot using a support vector machine
0:14:28	with the blue ones
0:14:30	the reward using the
0:14:32	nova
0:14:33	why lstm approach
0:14:35	we can see that the
0:14:37	overall the task success
0:14:38	increase in this is but is especially resting for higher noise rate so here we
0:14:41	have the for all domains combined we can see that is fifty four higher noise
0:14:45	rates
0:14:46	the improvement in task success is
0:14:50	very important than almost
0:14:52	even solid
0:14:54	to the to use the actual task success
0:14:58	as the remote signal
0:14:59	so what the slide tells us is
0:15:02	that
0:15:04	even though we are not using any information about the task
0:15:08	test looking at user satisfaction and actually estimating that
0:15:12	we can still get
0:15:13	on average
0:15:15	almost the same task success rate is when we were doing
0:15:19	but if it's if you're optimising on how success
0:15:22	directly
0:15:24	and
0:15:26	obviously also the election qualities of importance
0:15:31	we have you we here we show the
0:15:34	a rich interaction quality as i said earlier had which is computed
0:15:37	at the end of the dialogue
0:15:39	and he we can see that there is an improvement for the task success based
0:15:43	once you already get
0:15:47	these and indexing quality estimates
0:15:50	so the users are estimated to be a
0:15:55	not completely unsatisfied so it's quite okay but by applying
0:16:00	by optimising towards the interaction quality itself
0:16:03	you get also improvement on the side
0:16:06	is not very surprising because
0:16:08	you actually i improving to the actual value
0:16:13	you are
0:16:13	showing here so it would be
0:16:15	bad if it with what in the case like that
0:16:17	so it's mostly like a more proof of concept
0:16:25	as i said earlier this was all done in simulation the was emitted experiments
0:16:31	in my publication two years ago i already did evaluations with humans
0:16:35	as a validation
0:16:37	we had humans
0:16:38	talking to the system and using this in texas directly to their own
0:16:43	that a dialogue policy
0:16:45	and
0:16:47	you see the moving average detection quality and the moving to task success rate the
0:16:54	green a
0:16:55	curve is uses interaction quality and the red one is using task success
0:16:59	you can see the timit of times and says there's not a real
0:17:02	we use the difference here
0:17:03	however when you look at
0:17:05	the interaction quality you see also the same spcs you know on the simulated experiments
0:17:10	that has already after a few on a dialogue to use get
0:17:14	in
0:17:15	in
0:17:16	detection quality estimation
0:17:20	so
0:17:21	whatever the what have i told you so far today
0:17:25	we used the interaction quality to model uses section four subdialogues
0:17:31	i present a novel
0:17:32	become a neural network
0:17:34	model that outperforms all previous models without explicitly encoding the temporal information
0:17:39	and this but the for model was then
0:17:42	used to learn dialogue policies in unseen domains without knowledge about the underlying task and
0:17:48	the didn't require knowledge about the underlying task
0:17:51	an increased use this section and
0:17:54	so as to be
0:17:55	more robust to noise
0:17:57	and this similar experience accommodated in humiliation
0:18:02	already why the goal
0:18:05	for future work
0:18:06	obviously would be
0:18:08	very beneficial
0:18:09	two
0:18:11	applied to more complex context tasks
0:18:14	and also to have
0:18:17	better understanding of
0:18:18	what are the actual differences in policies learned
0:18:21	to be able to transfer this two
0:18:24	new knowledge
0:18:25	thank you
0:18:34	ten question
0:18:40	hi you know has from all mention that and i am i have to questions
0:18:45	that data star and probably in something circular neatly the lexical a that and getting
0:18:52	it we do their domains and just okay and my other question is am
0:18:58	one problem that i actually have now is that we just have a normalized score
0:19:03	satisfaction score of the user from close
0:19:08	and we don't have is an attention at each dialect or
0:19:13	so what i'd are told about that because of leo you'll have annotations that at
0:19:18	every that it are so what they re what is your intuition about that they
0:19:23	kind of model that close to that point eight problem and global
0:19:28	as user satisfaction to a time and well in a user satisfaction
0:19:34	estimation
0:19:36	i think it's very interesting question i think that probably the
0:19:39	the biggest disadvantage of this approach that
0:19:42	you
0:19:43	seem to need those turn level annotations i think that a quite important during the
0:19:47	learning phase because tree learning with of the to tutoring dialog learning you see a
0:19:52	lot of
0:19:53	maybe a lot of interrupted dialogues on all of things and if you don't have
0:19:58	a good estimate for those three hard to learn all those because even then even
0:20:01	when interrupting can be can come from anything basically somebody hangs out the following because
0:20:06	even though the dialogue was a pretty good until now you wouldn't
0:20:10	you can sit on something out of it
0:20:13	so i think if you only have turn level estimates
0:20:18	you can still bracket i think you should you but you need is to set
0:20:21	up you are
0:20:22	policy learning
0:20:23	more carefully
0:20:24	maybe
0:20:26	get rid of stud a don't regard some dialogues actually experience because you don't know
0:20:31	won't be able to
0:20:32	take anything any not out of them
0:20:34	but then it can actually were quite well i think
0:20:37	i don't i don't think the estimated self
0:20:40	needs to turn that the estimates
0:20:43	as i said those of subdialogues
0:20:45	and if you only consider the whole dialogue and you've if enough annotations of those
0:20:49	not only this one a dialect we have here
0:20:51	like i don't know
0:20:52	thousands millions i don't know what we scare you operate
0:20:55	then i think it's possible
0:20:58	without that under the ones
0:21:00	or you can try using that's goal
0:21:02	system then applied to the elements
0:21:10	we have to implement questions
0:21:22	or cell phone i'm lose their from a report university
0:21:27	thanks for the talk i was wondering a lot of people that support for instance
0:21:31	in the alexi challenge to see the
0:21:34	this user satisfaction can be very noisy
0:21:38	no you're corpus was collected some years back
0:21:43	did you see this noise in this in the corpus and in the annotations and
0:21:49	how do you think this is affecting the way you're regarding the pos you're predicting
0:21:53	this interaction forward
0:21:55	so the idea of the interaction quality is especially
0:21:59	it is specifically to avoid the not to reduce noise net noisiness
0:22:03	indexing quality was not collected by a
0:22:07	people rating the on dialogs
0:22:10	but it was weighted by expert raters after dialogue so people sitting there
0:22:15	forming
0:22:16	if you guidelines have some general guidance on both how you apply those interaction quality
0:22:21	labels
0:22:22	and based on that
0:22:24	also applied
0:22:25	then you have multiple raters current exchange the noise things
0:22:29	and
0:22:31	this was time to actually reduce noisiness
0:22:34	but for the data we have
0:22:37	we are able to cover the noise and s
0:22:43	one last
0:22:50	so the buildings k google
0:22:53	did you see cases where the interaction quality predictions within a dialogue change dramatically when
0:22:59	was just with the patterns that were interesting so user cases of interesting recovery cases
0:23:04	within dialogue cities or something would be learned from these students
0:23:08	stepwise processes in dialogues
0:23:11	well the estimation is not
0:23:14	what is an accurate so that you see drops
0:23:17	but in x annotation you don't see any dropped
0:23:19	because
0:23:21	based on the guidelines
0:23:23	it was forbidden basically
0:23:24	so the idea was to have a more consistent labeling
0:23:27	and it was rather
0:23:29	we gotta rather unlikely that only one single event would kind of
0:23:33	tropp the surface text level from the three to one or something like that
0:23:38	so from the it's on the annotation you don't zeros
0:23:41	but from the from the learned policies
0:23:44	i haven't done yet the analysis of
0:23:48	what has actually been around comparing this to other things a maybe two
0:23:52	human dialogues evenly generated dialogues
0:23:54	but this is as a cell part of the future work i think this will
0:23:58	hopefully shed a lot of insight into how what these
0:24:02	different remote signals actually
0:24:04	learn
0:24:05	and how we can make use of that
0:24:13	and can you itself and

Improving Interaction Quality Estimation with BiLSTMs and the Impact on Dialogue Policy Learning

Oral Session 1: Policy and Knowledge

Stefan Ultes