0:00:20that's cool will talk about controlling direction while it is it may win by lstm
0:00:26and the impact on dialogue policy learning
0:00:53so welcome everyone to my talk today
0:00:55on improving interaction quality estimation with bias and s t ends
0:00:59and the impact on dialogue policy learning
0:01:02i didn't check but i'm quite sure i'll wait already one the time to for
0:01:06the longest paper title
0:01:11let's get started this
0:01:13and in reinforcement learning
0:01:16the one of the thing is that
0:01:18have a huge influence on the learnt behaviors the reward
0:01:22and this is also true for
0:01:24the world of task oriented dialogue
0:01:27and in a modulo statistical dialog system we use reinforcement learning to learn the
0:01:31dialogue policy
0:01:35a belief state
0:01:36which represents the
0:01:38progression over the several
0:01:40dialogue turns
0:01:42from the input side i mean and this
0:01:45but this maps to a system action which is then
0:01:49transferred into a response to the user
0:01:53and the task of reinforcement learning is then to find the policy that maximizes the
0:01:58activated future reward
0:02:00not called the optimal possible policy
0:02:04for most else go in a dialogue systems
0:02:07the reward that's used is
0:02:11other tasks s
0:02:12so you have some measure of has assessed usually
0:02:15the user provide some task information or you can
0:02:19derive it in some other way
0:02:23you can check what this what the system responses look like an then you can
0:02:26evaluate whether task was excessive successfully achieved
0:02:30or not
0:02:35i think that the real behaviour should optimize user the section the set are not
0:02:40and this is many out of two reasons
0:02:43first of all user the section but better represents
0:02:47what the user ones
0:02:49and effect is only been
0:02:52used as it correlates well as task success
0:02:55you know conferences a long been used because correlates well with use of this section
0:02:59that's the right order
0:03:05task and
0:03:07uses the setting can be links
0:03:09two task or domain independent phenomena
0:03:14and you don't need information about the underlying task
0:03:17to illustrate this have just this creek x
0:03:19sample dialog here you only see here parameters extracted from it it's from the let's
0:03:24go bus information
0:03:27you have ta is a state is the asr confidence activity time in the represent
0:03:31and the claim is that you can derive some measure of user satisfaction just by
0:03:37looking at this
0:03:39whereas if you were actually need to look at
0:03:41task success
0:03:43you would have to have knowledge about what was actually going on what were the
0:03:46system utterances user input so and so forth to if you know about on the
0:03:51energy has been found
0:03:55and i'm proposing a novel by lstm remote estimator that first of all improves the
0:04:00estimation of interaction quality itself
0:04:04and also improve the dialogue performance
0:04:08and this is done without explicitly modeling of temporal features
0:04:12so you see this crime where
0:04:14where we don't estimate the we don't evaluate the task success anymore but we estimate
0:04:20the you sect use this section we will
0:04:22which is the edge has originally been published
0:04:26two years ago already funny really enough it it'll speech also in stock on
0:04:31so i'm talking about this topic and to come only apparently
0:04:35so to model the use of this section
0:04:39we use the interaction quality
0:04:41as a less objective metric
0:04:43yes we use the plaintive terms of for the same purpose
0:04:47and previously the estimation
0:04:50was making use a lot of
0:04:53manually a handcrafted features to encode the temporal information
0:04:57and in the proposed estimator i'm
0:05:00i'm sure so you're the next slide i was so how you can do this
0:05:03without the need to actually learn those of temporal information
0:05:10so that two aspects of this of this of this talk one is the
0:05:13detection quality estimation itself
0:05:15and after the time to talk about how
0:05:17using this it as a reward actually influences the dialogue policy
0:05:27first of all its of a closer look at interaction quality and how it smaller
0:05:30than how it used to be modeled with all the handcrafted stuff going on
0:05:34you see the module architect of a dialogue and information is extracted from the user
0:05:41input and the system response and this these carry me to think constitute one exchange
0:05:47so you end up with a sequence of exchanges from the
0:05:50beginning of the dialog up to the current
0:05:52turn t value code yet or exchange in this case because it
0:05:57contains information about both
0:05:58size user and system
0:06:01this that the exchange level
0:06:03and the temporal information used to be encoded
0:06:06on the window rather that you have a look at the bigger to look at
0:06:10we know of three in this example and also on the overall dialog level
0:06:15and both levels those parameters can codes concatenated
0:06:19then were fed into a classification in the data into a classifier
0:06:22to estimate interaction quality
0:06:25and the interaction quality itself
0:06:28is then obviously the supervision signal
0:06:30it is it's annotated on a scale from five to one
0:06:34five representing satisfied one experience satisfied
0:06:38and it's important to understand that every interaction quality annotation which
0:06:43exists for every exchange
0:06:45actually models the whole the subdialogue from the beginning after that exchange
0:06:49so it's not a measure of
0:06:51how well was this turn the system reaction or whatever but it's
0:06:56a measure of how
0:06:58how satisfied was the use of from the beginning up to now so if it
0:07:02goes down
0:07:03it might be that the last turn wasn't really great but also many things before
0:07:08they could also have influence but
0:07:14and the unit circle model
0:07:17i proposed
0:07:18gets rid of those temporal features it only uses a very small set of
0:07:23of exchange level features which you can see here's the asr reason statist asr confidence
0:07:29was it a reprint or not
0:07:32what's the channel
0:07:33exterior type is a the so the statement or question and so on
0:07:37or is the system action to confirm request or something like that
0:07:41so these are the parameters we use
0:07:46notice that work anymore to be cy
0:07:52this exchange that is then
0:07:54used as input to
0:07:56and colluder is a strictly to encode another using by a that's the or a
0:08:00by lstm
0:08:01and the so
0:08:02for every subdialogue we want to estimate one interaction quality value at every subdialogue is
0:08:07then fed into the encoder
0:08:11to generate hidden a sequence of hidden states with additional attention layer
0:08:18with the hope of figuring out which
0:08:21turn actually contributes most to the final estimation
0:08:27interaction quality itself
0:08:32intentionally is the set of attention value
0:08:34cutback based on the context of the other
0:08:36in this state
0:08:38weights are computed
0:08:40and the results of applying this to the task of
0:08:45decks in quality estimation
0:08:47so those results
0:08:48you see the
0:08:50unweighted average recall it's just the arithmetic average of all over all class-wise recalls
0:08:56the grey ones are the
0:08:59baselines of the for the one on the right was it two thousand fifteen is
0:09:04using the support vector machine using
0:09:07both temporal features of hank of the temporal features
0:09:10and the second one by a docket i from two thousand seventeen is the not
0:09:14on your network approach which
0:09:17most making use of different architecture
0:09:19but still not using the sample features
0:09:24but if you can we fought for test data we use the they go corpus
0:09:28which contains of to hunt two hundred bus information dialogs
0:09:32to the
0:09:32let's go system of pittsburgh
0:09:35and those results are computed in ten-fold dialoguewise cross validation
0:09:41you can see that the best performing
0:09:45okay sick as a file of the by lstm with the attention mechanism
0:09:48we compare there's all those with the but a pure biased em all up you
0:09:52with and without attention
0:09:54he be achieved an
0:09:55i mean every speaker of zero point five for which is increase over the previous
0:10:01we thought of zero point zero nine
0:10:05now those numbers
0:10:06don't seem to be
0:10:08very useful
0:10:09because it's not i mean if you want to estimate reward you want to have
0:10:12a very good quality and you need to have a certain
0:10:15certainty that what you actually
0:10:17get as a as an estimate actually
0:10:19can be used as a remote you don't get like
0:10:22right wrong indicators
0:10:25another measure we used to evaluate that is the extended accuracy when we did not
0:10:28just look at the at the actual match but also look at neighboring
0:10:32but you so if you to but estimating a five although it was
0:10:36originally of four would still be counted as correct
0:10:39because the way we
0:10:41transfer those detection quality values to the reward
0:10:45makes it is not very use problem if you're off by one
0:10:50you will see later all this is this is done and then we can see
0:10:52that we can actually good very good values above the ninety
0:10:56a present accuracy rate we're with your points nine four
0:10:59but the follow based approach which is
0:11:01three point zero six
0:11:03better than the previous
0:11:06best result
0:11:12this estimation of the best performing model by an estimate the attention mechanism
0:11:18is then used to train a dialogue policies
0:11:22first of all we have to address the question how can we make use of
0:11:24an interaction quality value
0:11:26in a remote a here we see that for the remote better interaction quality we
0:11:31use the turn penalty of minus one per turn
0:11:34and then we actually scalar
0:11:37the remote the detection quality so that
0:11:42it takes values from zero to trendy
0:11:45to be in correspondence to the baseline of talks assess we just been using
0:11:50many different papers already
0:11:52where yells of the time penalty and the past trend is the dialogue was successful
0:11:56and zero if not
0:12:01you bet you get the same value range but you have more or more fine
0:12:05interpretation of
0:12:06how the dialogue actually did
0:12:11we compare
0:12:13best performing evaluate it's tomato again as the support vector machine is a stunning pre
0:12:17previous work it so the evaluation system we use pied i'll
0:12:22with a set of using the focus tracker and the cheapest also policy learning algorithm
0:12:26we use the difference duration environments
0:12:31containing of zero percent error rate fifteen percent error rate and
0:12:34twenty but and thirty percent error rate
0:12:38we used two different evaluation metrics one is the task success rate because even though
0:12:42we want to be optimized towards indexing quality or user satisfaction
0:12:46it still
0:12:48very important also have successfully
0:12:50it see if the task doesn't help if you have
0:12:53if you estimated does all this was a very nice dialogue
0:12:55but that didn't the user didn't achieve the task that's of no use
0:13:01the second metric we use a see i was interaction quality
0:13:04maybe just to get the estimate
0:13:07compute the average of all final estimates for the overall dialog
0:13:13and to address the aspect of domain independence
0:13:17we actually look at many different domains
0:13:21so the estimated been trained on the let's go domain
0:13:24there we have the annotations
0:13:27but it's the dialogue to themselves the domains in which dialogues
0:13:32has been there and are actually lows to so we have a complete restaurants domain
0:13:36it can be chosen as the main so that the score estimate rest of the
0:13:39men services go to the man and that of the name
0:13:42they have different compare complexity they have different
0:13:44aspects to them
0:13:48so this basically will showcase that
0:13:51that the approach is actually
0:13:54domain independent you don't need to have information about the
0:13:58underlying task
0:14:00no question is how does this perform actually
0:14:03you have a lot of sparse
0:14:05of a obviously because they're a lot of experiments
0:14:10curators in the non logically be the paper
0:14:13but i think what's very interesting here is that
0:14:16for the task success rate
0:14:18and the different noise levels we can see
0:14:21that are in comparing the black bars which is
0:14:25a robot using a support vector machine
0:14:28with the blue ones
0:14:30the reward using the
0:14:33why lstm approach
0:14:35we can see that the
0:14:37overall the task success
0:14:38increase in this is but is especially resting for higher noise rate so here we
0:14:41have the for all domains combined we can see that is fifty four higher noise
0:14:46the improvement in task success is
0:14:50very important than almost
0:14:52even solid
0:14:54to the to use the actual task success
0:14:58as the remote signal
0:14:59so what the slide tells us is
0:15:04even though we are not using any information about the task
0:15:08test looking at user satisfaction and actually estimating that
0:15:12we can still get
0:15:13on average
0:15:15almost the same task success rate is when we were doing
0:15:19but if it's if you're optimising on how success
0:15:26obviously also the election qualities of importance
0:15:31we have you we here we show the
0:15:34a rich interaction quality as i said earlier had which is computed
0:15:37at the end of the dialogue
0:15:39and he we can see that there is an improvement for the task success based
0:15:43once you already get
0:15:47these and indexing quality estimates
0:15:50so the users are estimated to be a
0:15:55not completely unsatisfied so it's quite okay but by applying
0:16:00by optimising towards the interaction quality itself
0:16:03you get also improvement on the side
0:16:06is not very surprising because
0:16:08you actually i improving to the actual value
0:16:13you are
0:16:13showing here so it would be
0:16:15bad if it with what in the case like that
0:16:17so it's mostly like a more proof of concept
0:16:25as i said earlier this was all done in simulation the was emitted experiments
0:16:31in my publication two years ago i already did evaluations with humans
0:16:35as a validation
0:16:37we had humans
0:16:38talking to the system and using this in texas directly to their own
0:16:43that a dialogue policy
0:16:47you see the moving average detection quality and the moving to task success rate the
0:16:54green a
0:16:55curve is uses interaction quality and the red one is using task success
0:16:59you can see the timit of times and says there's not a real
0:17:02we use the difference here
0:17:03however when you look at
0:17:05the interaction quality you see also the same spcs you know on the simulated experiments
0:17:10that has already after a few on a dialogue to use get
0:17:16detection quality estimation
0:17:21whatever the what have i told you so far today
0:17:25we used the interaction quality to model uses section four subdialogues
0:17:31i present a novel
0:17:32become a neural network
0:17:34model that outperforms all previous models without explicitly encoding the temporal information
0:17:39and this but the for model was then
0:17:42used to learn dialogue policies in unseen domains without knowledge about the underlying task and
0:17:48the didn't require knowledge about the underlying task
0:17:51an increased use this section and
0:17:54so as to be
0:17:55more robust to noise
0:17:57and this similar experience accommodated in humiliation
0:18:02already why the goal
0:18:05for future work
0:18:06obviously would be
0:18:08very beneficial
0:18:11applied to more complex context tasks
0:18:14and also to have
0:18:17better understanding of
0:18:18what are the actual differences in policies learned
0:18:21to be able to transfer this two
0:18:24new knowledge
0:18:25thank you
0:18:34ten question
0:18:40hi you know has from all mention that and i am i have to questions
0:18:45that data star and probably in something circular neatly the lexical a that and getting
0:18:52it we do their domains and just okay and my other question is am
0:18:58one problem that i actually have now is that we just have a normalized score
0:19:03satisfaction score of the user from close
0:19:08and we don't have is an attention at each dialect or
0:19:13so what i'd are told about that because of leo you'll have annotations that at
0:19:18every that it are so what they re what is your intuition about that they
0:19:23kind of model that close to that point eight problem and global
0:19:28as user satisfaction to a time and well in a user satisfaction
0:19:36i think it's very interesting question i think that probably the
0:19:39the biggest disadvantage of this approach that
0:19:43seem to need those turn level annotations i think that a quite important during the
0:19:47learning phase because tree learning with of the to tutoring dialog learning you see a
0:19:52lot of
0:19:53maybe a lot of interrupted dialogues on all of things and if you don't have
0:19:58a good estimate for those three hard to learn all those because even then even
0:20:01when interrupting can be can come from anything basically somebody hangs out the following because
0:20:06even though the dialogue was a pretty good until now you wouldn't
0:20:10you can sit on something out of it
0:20:13so i think if you only have turn level estimates
0:20:18you can still bracket i think you should you but you need is to set
0:20:21up you are
0:20:22policy learning
0:20:23more carefully
0:20:26get rid of stud a don't regard some dialogues actually experience because you don't know
0:20:31won't be able to
0:20:32take anything any not out of them
0:20:34but then it can actually were quite well i think
0:20:37i don't i don't think the estimated self
0:20:40needs to turn that the estimates
0:20:43as i said those of subdialogues
0:20:45and if you only consider the whole dialogue and you've if enough annotations of those
0:20:49not only this one a dialect we have here
0:20:51like i don't know
0:20:52thousands millions i don't know what we scare you operate
0:20:55then i think it's possible
0:20:58without that under the ones
0:21:00or you can try using that's goal
0:21:02system then applied to the elements
0:21:10we have to implement questions
0:21:22or cell phone i'm lose their from a report university
0:21:27thanks for the talk i was wondering a lot of people that support for instance
0:21:31in the alexi challenge to see the
0:21:34this user satisfaction can be very noisy
0:21:38no you're corpus was collected some years back
0:21:43did you see this noise in this in the corpus and in the annotations and
0:21:49how do you think this is affecting the way you're regarding the pos you're predicting
0:21:53this interaction forward
0:21:55so the idea of the interaction quality is especially
0:21:59it is specifically to avoid the not to reduce noise net noisiness
0:22:03indexing quality was not collected by a
0:22:07people rating the on dialogs
0:22:10but it was weighted by expert raters after dialogue so people sitting there
0:22:16if you guidelines have some general guidance on both how you apply those interaction quality
0:22:22and based on that
0:22:24also applied
0:22:25then you have multiple raters current exchange the noise things
0:22:31this was time to actually reduce noisiness
0:22:34but for the data we have
0:22:37we are able to cover the noise and s
0:22:43one last
0:22:50so the buildings k google
0:22:53did you see cases where the interaction quality predictions within a dialogue change dramatically when
0:22:59was just with the patterns that were interesting so user cases of interesting recovery cases
0:23:04within dialogue cities or something would be learned from these students
0:23:08stepwise processes in dialogues
0:23:11well the estimation is not
0:23:14what is an accurate so that you see drops
0:23:17but in x annotation you don't see any dropped
0:23:21based on the guidelines
0:23:23it was forbidden basically
0:23:24so the idea was to have a more consistent labeling
0:23:27and it was rather
0:23:29we gotta rather unlikely that only one single event would kind of
0:23:33tropp the surface text level from the three to one or something like that
0:23:38so from the it's on the annotation you don't zeros
0:23:41but from the from the learned policies
0:23:44i haven't done yet the analysis of
0:23:48what has actually been around comparing this to other things a maybe two
0:23:52human dialogues evenly generated dialogues
0:23:54but this is as a cell part of the future work i think this will
0:23:58hopefully shed a lot of insight into how what these
0:24:02different remote signals actually
0:24:05and how we can make use of that
0:24:13and can you itself and