i

i

that's cool will talk about controlling direction while it is it may win by lstm

and the impact on dialogue policy learning

great

so welcome everyone to my talk today

on improving interaction quality estimation with bias and s t ends

and the impact on dialogue policy learning

i didn't check but i'm quite sure i'll wait already one the time to for

the longest paper title

nonetheless

let's get started this

and in reinforcement learning

the one of the thing is that

have a huge influence on the learnt behaviors the reward

and this is also true for

the world of task oriented dialogue

and in a modulo statistical dialog system we use reinforcement learning to learn the

dialogue policy

weights

maps

a belief state

which represents the

progression over the several

dialogue turns

from the input side i mean and this

but this maps to a system action which is then

transferred into a response to the user

and the task of reinforcement learning is then to find the policy that maximizes the

activated future reward

not called the optimal possible policy

and

for most else go in a dialogue systems

the reward that's used is

other tasks s

so you have some measure of has assessed usually

the user provide some task information or you can

derive it in some other way

and

you can check what this what the system responses look like an then you can

evaluate whether task was excessive successfully achieved

or not

however

i think that the real behaviour should optimize user the section the set are not

tasks

and this is many out of two reasons

first of all user the section but better represents

what the user ones

and effect is only been

used as it correlates well as task success

you know conferences a long been used because correlates well with use of this section

that's the right order

secondly

task and

uses the setting can be links

two task or domain independent phenomena

and you don't need information about the underlying task

to illustrate this have just this creek x

sample dialog here you only see here parameters extracted from it it's from the let's

go bus information

system

you have ta is a state is the asr confidence activity time in the represent

and the claim is that you can derive some measure of user satisfaction just by

looking at this

whereas if you were actually need to look at

task success

you would have to have knowledge about what was actually going on what were the

system utterances user input so and so forth to if you know about on the

right

energy has been found

and i'm proposing a novel by lstm remote estimator that first of all improves the

estimation of interaction quality itself

and also improve the dialogue performance

and this is done without explicitly modeling of temporal features

so you see this crime where

where we don't estimate the we don't evaluate the task success anymore but we estimate

the you sect use this section we will

which is the edge has originally been published

two years ago already funny really enough it it'll speech also in stock on

so i'm talking about this topic and to come only apparently

so to model the use of this section

we use the interaction quality

as a less objective metric

yes we use the plaintive terms of for the same purpose

and previously the estimation

was making use a lot of

manually a handcrafted features to encode the temporal information

and in the proposed estimator i'm

i'm sure so you're the next slide i was so how you can do this

without the need to actually learn those of temporal information

so that two aspects of this of this of this talk one is the

detection quality estimation itself

and after the time to talk about how

using this it as a reward actually influences the dialogue policy

so

first of all its of a closer look at interaction quality and how it smaller

than how it used to be modeled with all the handcrafted stuff going on

you see the module architect of a dialogue and information is extracted from the user

input and the system response and this these carry me to think constitute one exchange

so you end up with a sequence of exchanges from the

beginning of the dialog up to the current

turn t value code yet or exchange in this case because it

contains information about both

size user and system

and

this that the exchange level

and the temporal information used to be encoded

on the window rather that you have a look at the bigger to look at

we know of three in this example and also on the overall dialog level

and both levels those parameters can codes concatenated

then were fed into a classification in the data into a classifier

to estimate interaction quality

and the interaction quality itself

is then obviously the supervision signal

it is it's annotated on a scale from five to one

five representing satisfied one experience satisfied

and it's important to understand that every interaction quality annotation which

exists for every exchange

actually models the whole the subdialogue from the beginning after that exchange

so it's not a measure of

how well was this turn the system reaction or whatever but it's

a measure of how

well

how satisfied was the use of from the beginning up to now so if it

goes down

it might be that the last turn wasn't really great but also many things before

they could also have influence but

and the unit circle model

i proposed

gets rid of those temporal features it only uses a very small set of

of exchange level features which you can see here's the asr reason statist asr confidence

was it a reprint or not

what's the channel

exterior type is a the so the statement or question and so on

or is the system action to confirm request or something like that

so these are the parameters we use

and

notice that work anymore to be cy

this exchange that is then

used as input to

and colluder is a strictly to encode another using by a that's the or a

by lstm

and the so

for every subdialogue we want to estimate one interaction quality value at every subdialogue is

then fed into the encoder

to generate hidden a sequence of hidden states with additional attention layer

to

with the hope of figuring out which

turn actually contributes most to the final estimation

of

interaction quality itself

intentionally is the set of attention value

cutback based on the context of the other

in this state

weights are computed

and the results of applying this to the task of

decks in quality estimation

so those results

you see the

unweighted average recall it's just the arithmetic average of all over all class-wise recalls

the grey ones are the

baselines of the for the one on the right was it two thousand fifteen is

using the support vector machine using

both temporal features of hank of the temporal features

and the second one by a docket i from two thousand seventeen is the not

on your network approach which

most making use of different architecture

but still not using the sample features

but if you can we fought for test data we use the they go corpus

which contains of to hunt two hundred bus information dialogs

to the

let's go system of pittsburgh

and those results are computed in ten-fold dialoguewise cross validation

and

you can see that the best performing

okay sick as a file of the by lstm with the attention mechanism

we compare there's all those with the but a pure biased em all up you

lstms

with and without attention

he be achieved an

i mean every speaker of zero point five for which is increase over the previous

best

we thought of zero point zero nine

now those numbers

don't seem to be

very useful

because it's not i mean if you want to estimate reward you want to have

a very good quality and you need to have a certain

certainty that what you actually

get as a as an estimate actually

can be used as a remote you don't get like

right wrong indicators

another measure we used to evaluate that is the extended accuracy when we did not

just look at the at the actual match but also look at neighboring

but you so if you to but estimating a five although it was

originally of four would still be counted as correct

because the way we

transfer those detection quality values to the reward

makes it is not very use problem if you're off by one

you will see later all this is this is done and then we can see

that we can actually good very good values above the ninety

a present accuracy rate we're with your points nine four

but the follow based approach which is

three point zero six

better than the previous

best result

and

this estimation of the best performing model by an estimate the attention mechanism

is then used to train a dialogue policies

first of all we have to address the question how can we make use of

an interaction quality value

in a remote a here we see that for the remote better interaction quality we

use the turn penalty of minus one per turn

and then we actually scalar

the remote the detection quality so that

it takes values from zero to trendy

to be in correspondence to the baseline of talks assess we just been using

many different papers already

where yells of the time penalty and the past trend is the dialogue was successful

and zero if not

so

you bet you get the same value range but you have more or more fine

grained

interpretation of

how the dialogue actually did

we compare

the

best performing evaluate it's tomato again as the support vector machine is a stunning pre

previous work it so the evaluation system we use pied i'll

with a set of using the focus tracker and the cheapest also policy learning algorithm

we use the difference duration environments

containing of zero percent error rate fifteen percent error rate and

twenty but and thirty percent error rate

we used two different evaluation metrics one is the task success rate because even though

we want to be optimized towards indexing quality or user satisfaction

it still

very important also have successfully

it see if the task doesn't help if you have

if you estimated does all this was a very nice dialogue

but that didn't the user didn't achieve the task that's of no use

the second metric we use a see i was interaction quality

maybe just to get the estimate

and

compute the average of all final estimates for the overall dialog

and to address the aspect of domain independence

we actually look at many different domains

so the estimated been trained on the let's go domain

there we have the annotations

for

but it's the dialogue to themselves the domains in which dialogues

has been there and are actually lows to so we have a complete restaurants domain

it can be chosen as the main so that the score estimate rest of the

men services go to the man and that of the name

they have different compare complexity they have different

aspects to them

so

so this basically will showcase that

that the approach is actually

domain independent you don't need to have information about the

underlying task

so

no question is how does this perform actually

you have a lot of sparse

of a obviously because they're a lot of experiments

curators in the non logically be the paper

but i think what's very interesting here is that

for the task success rate

and the different noise levels we can see

that are in comparing the black bars which is

a robot using a support vector machine

with the blue ones

the reward using the

nova

why lstm approach

we can see that the

overall the task success

increase in this is but is especially resting for higher noise rate so here we

have the for all domains combined we can see that is fifty four higher noise

rates

the improvement in task success is

very important than almost

even solid

to the to use the actual task success

as the remote signal

so what the slide tells us is

that

even though we are not using any information about the task

test looking at user satisfaction and actually estimating that

we can still get

on average

almost the same task success rate is when we were doing

but if it's if you're optimising on how success

directly

and

obviously also the election qualities of importance

we have you we here we show the

a rich interaction quality as i said earlier had which is computed

at the end of the dialogue

and he we can see that there is an improvement for the task success based

once you already get

these and indexing quality estimates

so the users are estimated to be a

not completely unsatisfied so it's quite okay but by applying

by optimising towards the interaction quality itself

you get also improvement on the side

is not very surprising because

you actually i improving to the actual value

you are

showing here so it would be

bad if it with what in the case like that

so it's mostly like a more proof of concept

as i said earlier this was all done in simulation the was emitted experiments

in my publication two years ago i already did evaluations with humans

as a validation

we had humans

talking to the system and using this in texas directly to their own

that a dialogue policy

and

you see the moving average detection quality and the moving to task success rate the

green a

curve is uses interaction quality and the red one is using task success

you can see the timit of times and says there's not a real

we use the difference here

however when you look at

the interaction quality you see also the same spcs you know on the simulated experiments

that has already after a few on a dialogue to use get

in

in

detection quality estimation

so

whatever the what have i told you so far today

we used the interaction quality to model uses section four subdialogues

i present a novel

become a neural network

model that outperforms all previous models without explicitly encoding the temporal information

and this but the for model was then

used to learn dialogue policies in unseen domains without knowledge about the underlying task and

the didn't require knowledge about the underlying task

an increased use this section and

so as to be

more robust to noise

and this similar experience accommodated in humiliation

already why the goal

for future work

obviously would be

very beneficial

two

applied to more complex context tasks

and also to have

better understanding of

what are the actual differences in policies learned

to be able to transfer this two

new knowledge

thank you

ten question

hi you know has from all mention that and i am i have to questions

that data star and probably in something circular neatly the lexical a that and getting

it we do their domains and just okay and my other question is am

one problem that i actually have now is that we just have a normalized score

satisfaction score of the user from close

and we don't have is an attention at each dialect or

so what i'd are told about that because of leo you'll have annotations that at

every that it are so what they re what is your intuition about that they

kind of model that close to that point eight problem and global

as user satisfaction to a time and well in a user satisfaction

estimation

i think it's very interesting question i think that probably the

the biggest disadvantage of this approach that

you

seem to need those turn level annotations i think that a quite important during the

learning phase because tree learning with of the to tutoring dialog learning you see a

lot of

maybe a lot of interrupted dialogues on all of things and if you don't have

a good estimate for those three hard to learn all those because even then even

when interrupting can be can come from anything basically somebody hangs out the following because

even though the dialogue was a pretty good until now you wouldn't

you can sit on something out of it

so i think if you only have turn level estimates

you can still bracket i think you should you but you need is to set

up you are

policy learning

more carefully

maybe

get rid of stud a don't regard some dialogues actually experience because you don't know

won't be able to

take anything any not out of them

but then it can actually were quite well i think

i don't i don't think the estimated self

needs to turn that the estimates

as i said those of subdialogues

and if you only consider the whole dialogue and you've if enough annotations of those

not only this one a dialect we have here

like i don't know

thousands millions i don't know what we scare you operate

then i think it's possible

without that under the ones

or you can try using that's goal

system then applied to the elements

we have to implement questions

or cell phone i'm lose their from a report university

thanks for the talk i was wondering a lot of people that support for instance

in the alexi challenge to see the

this user satisfaction can be very noisy

no you're corpus was collected some years back

did you see this noise in this in the corpus and in the annotations and

how do you think this is affecting the way you're regarding the pos you're predicting

this interaction forward

so the idea of the interaction quality is especially

it is specifically to avoid the not to reduce noise net noisiness

indexing quality was not collected by a

people rating the on dialogs

but it was weighted by expert raters after dialogue so people sitting there

forming

if you guidelines have some general guidance on both how you apply those interaction quality

labels

and based on that

also applied

then you have multiple raters current exchange the noise things

and

this was time to actually reduce noisiness

but for the data we have

we are able to cover the noise and s

one last

so the buildings k google

did you see cases where the interaction quality predictions within a dialogue change dramatically when

was just with the patterns that were interesting so user cases of interesting recovery cases

within dialogue cities or something would be learned from these students

stepwise processes in dialogues

well the estimation is not

what is an accurate so that you see drops

but in x annotation you don't see any dropped

because

based on the guidelines

it was forbidden basically

so the idea was to have a more consistent labeling

and it was rather

we gotta rather unlikely that only one single event would kind of

tropp the surface text level from the three to one or something like that

so from the it's on the annotation you don't zeros

but from the from the learned policies

i haven't done yet the analysis of

what has actually been around comparing this to other things a maybe two

human dialogues evenly generated dialogues

but this is as a cell part of the future work i think this will

hopefully shed a lot of insight into how what these

different remote signals actually

learn

and how we can make use of that

and can you itself and