right

for the last talk of the session

i try to keep it fun that a lot of videos it's on again it's

gonna be

find an engaging

so this work is

on using reinforcement learning for modeling incrementality in the context of a fast paced dialogue

game

this might work with my advisers david you want and can right hand i most

wanted to see how to

so

incrementality that's what this work is focused on

a human speech as incremental right

so we process the of be processed the content what whiteboard and sometimes and subword

but we try to process it as soon as the things are available

so incrementality helps us model are different natural language

natural dialogue phenomena such as that record a game

speech overlaps barge ins backchannels so modeling these things is very important to make a

dialogue systems more natural and efficient

so the contributions of this work

can be kind of grouped into three points

the first one is the white of reinforcement learning method to model incrementality

the second one is we provide a detailed analysis and what does it need a

so in our previous work we have a very

you know like state-of-the-art carefully designed rule based a baseline system

which interacts with humans in real time

and it performs nearly as well as humans right so it's a it's a really

strong baseline

the actual the videos and you get more context in the slides to calm

the reinforcement learning model introduce your actually outperforms

the cd a baseline but please keep in mind it's offline also we don't have

a real system yet but it's a lot of forms

and we also provide some analysis of the fourteen time it took the double up

to six

selected domain so the domain as a adapted dialogue again we call it already image

it's a two player game

it's an image matching collaborative game between two players so each person

is assigned a little farther detector or the matcher

the detector sees the eight images as you see on the screen you're one of

them is highlighted with the red border the data that is supposed to describe this

one

no matter sees the same eight images in different orders and to match it is

supposed to make the selection based on the description given

right and the goal is to get

as many may just as possible in a lot of fine

so that's but you know like it's fast and incrementally

so let's look at an example as to how this gain works so we also

so here you see are two players human begin with one another other person on

the top is the detector so that the listing one of the image highlighted with

the red border sure despite the highlighted image other person below is the matcher

should try to guess the image based on the description

and there's time and score so depend i o for that

part

our classes that

it offline such a while it might here we got it

it's the line classes but

okay so in this in this particular a been used for like

the dialogue is very fast an incremental there's lot of rapid turn taking

so it's very is it's a fast-paced game it's that's fun

so we got a lot of data in the from the human conversation and then

we design an incremental a agent

so we been employed eve

she's high performance of baseline system right

so she's trained on the human conversation data and provide more details in the coming

slide

we evaluated this but one hundred twenty five users

and this performs nearly as well as the humans

so it should work

so this medium a show you how the interaction between the even the human goes

this of your pain again right

so the top

or you will see eight images what the human but c and on the bottom

you see he's

eight images and using clean boss going up and all its basically high confidence

in each images and their changing based on our the human descriptions

okay

it's a yellow bird

time

so

okay

and sleeping black and white cat

okay

sounds which one

bernstein

clean

by which can handle bars and c

okay

alright so that so just playing the game but humans it's real time and she

is not a just one begin with

so

how does she what so basically we have the user's speech coming and we have

an incremental asr which is quality

we just providing like it's one best hypothesis every hundred milliseconds

we use this hypothesis and we compute are the distribution of confidence distribution of course

all the eight images that's on the screen

and then the dialogue policy uses the distributions and then she decides whether way

or to select or to skip

the wait action is

she silent then she's the listening

selection is bad

she has enough confidence to you know like make the selection and skip this where

she's thinking hey i'm not getting much of you know like information maybe i just

skip and go to the next one and hope i get that right

the thai language generation is very simple it's templatebased of innovation select

she says got it has you heard in the video

if you skipping she says that someone

now the focus of this work is the dialogue policy so the dialogue policy in

the previous work you see the a task is to design rules

a fairly wide we call it carefully designed rules in the minute

we thought that can we do better

than the current baseline

right

so we use reinforcement learning and we try to see if either that perform better

so

the carefully designed baseline has these things we just be start b is the first

one which is how yes

probability a sign

do you know like a

are two one update images

and then there are two values which is identification threshold and the give-up threshold so

identification threshold is

the minimum confidence that should hit for at any given image able which if can

say got it

and give-up threshold is

the maximum time she rates after which

rc say skip so any time in between she's waiting right

so this is what is the carefully design a hurdles baseline system is

why do we call it carefully designed a rules right

so in the published comparisons of auditing policies and the rule based one thing that

of a nokia like how much fine was actually

on designing the rule based a systems

so in this work

you know we do that so

the information so identification threshold it and the give-up threshold

g is actually

not some random value that depict it's actually train i mean it's tomb from the

field human conversation data

we use something called eavesdropper a framework and we use this to get it send

g d's

for more details please refer to a paper in this is in fact that really

fifteen

so we spent almost one month and you know like trying to find

not like what's the best way to design this policies

so predicting the next word is one such examples so

so it what it looks something for we designed this rules and actually performs nearly

as well as human so it's a really strong baseline

but even though it is carefully designed rules she still

you know a few limitations so it's group best case one case twenty gives three

so in this

in this particular slide you see on the x-axis is fine right as the bangles

the y-axis is the confidence

of assigned by the nlu

so each one of the points is each partial that's coming again

from the asr so confidence is actually changing

in the case one

eve is very eager to skip

right

in the case to she's very eager to select

so sometimes what happens but postures incremental speech recognition is that we have a lot

of you know like unstable

a hypothesis and it softens leading to kind of spoken to that and comment

and the case to be as bad

actually save time

by maybe selecting you know like much with your right

so these other three cases where it's hard you've can actually perform

so we use reinforcement learning

so the state space

is represented by a people that is used r t which is the highest confidence

in any one of the eight images

and then p c is the and consumed

right so that what happened in nature

the action is basically is it select

is it skip or is it weight

or maybe transition probabilities and analyses hundred factors and the reward is very simplistic

that is if you gets

the image right

she gets a reward of plus hundred if you gets a problem it's a negative

hundred

i the weight is like a very small epsilon value

it's very close to zero

and she gets more reward for skipping

so

the data that we use for this experiment

how comes in three flavours

the human data in the lab that we collected

the human web interaction data collected in another experiment

and then the eve interactions with other humans

the one twenty five that i was talking about so there are a thirteen thousand

but more than thirteen thousand subdialogues here

so we split them up

based on the user's the ninety percent of the users

of used for training and the ten percent of for testing

for reinforcement learning the user this be a by a describe what is iterations and

user question

a radial basis functions for representing are the features

so how so how does how does it operate

so every time hundred milliseconds asr is given out the partials

we start is assigned by the nlu

and the policies deciding whether actual rate

or select or skip

if its weight

the next it is i can always samples the next time step that is two

hundred millisecond what happened after the second

day sidebar to the

the new value for the nist rt and the new policy of our decision

so this keep happening until we see a selection of the skipped if it selections

of the scale

we know the ground truth so we can do like a fine on the values

based on that

so this is a snapshot as to how the things able right

so on the x-axis you see the partials

so each one of those more i need things of the partials that's coming in

from the asr

and on the y-axis you see the confidence assigned by a the nlu

so in this example you see the baseline the agenda skipping at the point

and then the rl agent that strangers actually waiting for a longer time until she

sees like very high confidence and hands should get stuck image right

okay so i want a text

a little time to explain what describe

it's a not so instead of

so on the horizontal axis you see three groups

wait actions

in the middle you see skip actions

and on the right you see the select actions right

so this graph shows the complete state space it's everything in the state

the red dots

indicate the baseline agent decisions

the blue dots is what was learned by the reinforcement learning policy

on the vertical axis you see that i'm going from zero to fifteen

and on the data taxes you see the confidence going from point to one

so

the red dots can see that actually fit together it's you know like it's a

rule based system right so we can be determinized are deterministically no you know like

what action any changes taking

the blue is the actions that's learned by the reinforcement learning so she's kind of

the learning similar things but there's some difference you

that is the reinforcement learning policies learning actually select an image for very high confidence

for extremely high confidence that is one point zero

if the time consumed is low

so if the time consuming is not solo

she's actually learning wait more

so by creating more she's actually getting more partials that since you like she's getting

more boxes as a result of which are she has a chance of performing better

in the game and hence quality more points

so this so this graph is kind of i showing that

this is more simple

so on the x-axis you see the average point score for one of the subset

and on the y-axis you see that underpins you

so the blue one is the reinforcement learning the red one as the baseline agent

so you see the agent is actually waiting for longer time

and she's coding more points and on the vertical axis you have you know like

the baseline system which is very hard to know like a skit or you know

like make a selection

so here we have that is so here we have a religion significantly scoring more

points than the baseline

and

there's a trend which is actually performing she's taking more fine to make the selections

so why

did the cd a baseline couldn't loan or reinforcement learning don't right

so

you see that all is if you're gonna come back to the policy that we

used for the cd a baseline

it's actually independent the time and the confidence values of the copy start be

you know independent of each other

but what reinforcement learning is doing is it's actually learning to optimize

the policy based on nist rt and the time consuming and jointly

and back results in

the reinforcement learning agent performing much better than the baseline agent

so this shows like

you know how much points

she's score

and b s is the points per second

like it kind of combines both the points and dot fine you know like aspect

in one particular

a table

so you can think consistently you know like i just putting much higher in terms

of points you know like across all the image sets

but the points per second a something that you're can so that that's of you

know like interest

have and how it twice

so in the by cc the points per second as zero point zero nine and

in the rls zero point one for that means

by scoring more points

she's actually don't better in the game

because her points per second has been lot higher

and in the necklace

subset

we see that even if the baseline agent has scored much less points

the points for second is very high

that's because

the baseline agent is very you got the one i just one some points by

chance

but rl is getting what points

basically by rating more as a result of which are b bs s one

so i want to discuss a little bit about if for and

the time

so that they systems are often criticised as being laborious and time-consuming the bit

they are but they actually have doesn't perform nearly as well as human so i

don't know if it's favouritism

and you know it nearly the same time the better the cd a baseline asked

to reinforcement learning policy no this is of course excluding

the data collection and the intersection building efforts

but

the advantage that we get is that rl approach is more scalable

because adding features is more easy and

so the future work exactly one

two

actually investigative are the best improvements transfer to the interactions which means we want to

you know like

put the policy of the reinforcement learning policy to the agent and see if you

can actually perform better

in the real user study

and then we want to explore adding more features or to the state space and

then

the reward function one alone how from the data using other in four the inverse

reinforcement learning

and finally thank you wanna time mike about the

and anonymous reviewers for their very useful comments and nsf and additive for supporting this

work

and this people for providing a images a second using this a particular paper

thank you very much so they questions

very much and now time for questions

take here so

i think you very much for a nice talk and just a clarification question regarding

a room for reinforcement learning setup c four i'm correct your learning from a corpus

right yep but here the using least squares for quality duration

but easy and onpolicy method which requires learning from interaction righted in learning from corpus

alright so

so this is the one of which expand so we kind of three this as

a real interaction

that is even though it's you know like

so for every hundred milliseconds as it would happen in a real infractions but user

or subdialogue

be kind of sample like based on each time-step rate for every for the first

hundred milliseconds we have a partial and for the first partial we have a small

we have the probability distribution that the fine and we have that and consume

so here we just use the probability distribution and the time

not like is in a like as a feature

and then

the next time sample hasn't happened in a real interaction the next thing that's happening

is

the next question is coming in

and the next part show that you know like is something that the user task

actually spoken you know like in the data that you're collector

and

you know like it keeps going on so basically it's train

but subdialogue

a image

but i still think you would gets improvement if you actually used something like importance

sampling

count the fact that you're tree very seeing a trajectory that happens in corpus project

and in an online exploration method which on policy reinforcement learning

i

that that's a good question i mean i have explored a bar

i guess that's according to you know like have explored

include

fix for talk all two questions first one can you explain a little bit more

how you can point you know you work with image recognition for the same using

some cnn model

we fix the vision

what we fake vision

okay so the nlu a sign

so the way the nlu strain does

we have the human data that we have collected that humans are actually describing the

corpus right

so we had other descriptions from the human examples where to humans was speaking and

they were describing the target image

so we had

the words that's associated image

so we

that's that i mean like that something that we really want to do that is

you know like user you images and then of get like

learn from the image rather than fig that we should but in this particular one

bit just learning from

doctor and did you play around with setting like actually do the work for we

do actually to be negative so you might speed up the so we tried like

a lot of different things so one thing is very and start of that lspi

in the beginning like the start of but you know like all the different algorithms

like

one of the example is we try to q-learning but you know it

lot more us samples like that if there was prosody and really trial negative it

what's for the weight actions but that would mean the agent this actually

in a like penalized weight but we don't really want back rate we want agent

to be assigned with higher rewards for doing well in the game rather than waiting

or you know like

the specific reward function manipulation we just one

i mean

the reward function is kind of

reflective of what's happening in the game

more points for

and flexible that well i just one the let us try switching the roles of

human the most we need the game like what would happen i

the machine have has to describe the actions so we v so currently

the agent is only in the matcher role

it's not playing the role of the director it becomes much more complex because we

have to incrementally generate the descriptions but that's something that we really want to know

like in

in the future work

we don't know how

thanks room is talking or just a quick question about this is the representation so

and four for purely from is the so you're putting the portals in this the

yes

the partials

no the state just have those the you know you just an idea okay so

you're not you're not like a morning instability

right okay should not being on the captured okay this portals like always talk a

bicycle but it's not it's like this you know be or something like that so

we can be faster if you put it was able you know the colonial wasn't

as the you're okay the partials right even its ability to learn you could learn

the most because in the original consistent

that is right i wanna shall one small taking your with the instability in the

case to you know like and then use course in a later let's because of

this instability

and what we actually want is and what actually happened in the game is

that is not like the nlu confidence

is actually you know like fluctuating

part i and nlu confidence you know like all these blips of

these things

you know like it's kind of

lower as a joint way of like probability and the fine so it's actually waiting

a lot more giving the chance to kind of the last but that's of a

question i mean we i mean i think if you if you had a more

information use the nothing you warning would probably be more successful "'cause" i think that

it's possible your maybe why we use of the dp weak assumptions little bit

so

adding more features to spit that's and

right

thank you thank you think we want to thank him speaker once again