hi everyone on

j p and on the sister showed work done with

my core accurate attention and modify search to

sorted down going to talk about their dialogue policy learning problems for task oriented visual

dollar

also first let me introduce their problems

so

there's physically situated x and you know that would we want to study is

where a few joint chapel tries to engage with the user to how

a lot and five to their p georgia order target image

so here you can see

and of there were twenty similar images presented a tutor a agent and at the

first n

their users can provide er

this questions

a luddite and you want

so

and then there

agents here pay some more proactive role by asking

i reverend cushions in this community cushions

and hopefully once

if the confidence in notes it to make a decisions

to finish the top within a minimal number of turns

so in this setting

on their a true our main challenges

on the agent very need to none

and understand the multimodal rip intuitions

and also be aware of the dynamic your dark contrast

especially on where receiving signals

for making decisions cell phones sample wrong information correlate all wrong guesses

so the main goal is for the agent is to learn

efficient dialogue policy to accomplish the task

still a motivation very counts for

on some

potential so usable real applications

so imagines for virtual a nice talking assists and

to help the customers

army commander propose or recommend approach is based on all the user's preference so that

multimodal contest us through the dialog

so is assigned as we also working probabilistically that are consummated visual dialogue based on

the fashion dataset

and hopefully we have something interesting what still

next year

on so

but the previous research are only show that a lot mainly focus on be sure

to language understanding and generations

where it so they have for questionable and also for underwear with each other with

thing a fixed number of turns

a however we focus on the dialogue policy learning problems of the cushion policy

so it was within style it's

the questionable can produce a more constructive rules to help her sister

the human to accomplish the task

we want to be very their efficiency and the robustness of the dialogue policies in

terms of on

more task ninety dollars to semantics

and is supposed to mentioning that our

word is also related to hierarchical reinforcement learning

basically we view this as the two stage problem

at first we want to obtain a dialog proceeds to

so that a proper dialog basically a information queries all making a decision on to

do information image retrieval

and then we can have for real lower level proceed to see that of primitive

actions

like which question to ask

all

however to reinforcement learning has been applied seeing a multi domain dialogue system but with

our multimodal contestant action space

and

on our

architecture also resembles the fruit or reinforcement learning which have some nice properties

that are steadfast rations state sharing and a sequential execution

and here is to overview order

that the information for one thing to our proposed framework and important

we have their simulator module a which

how to teach us to again transition state and also provide the p but a

remote signals and

the generated out of service

to feed into the vision data loss matching batting more joe to

updates of usually stays home with our new approach is appears and also communicates with

the

dialog state checking more so we attentional signal to and dialog state checking kind of

formal easter

the people all speech without loss state representations and in the high-level going to proceed

any module uses a

a lot

do you get us to

to a tool and the prosody in terms of asking questions on making a gas

and we have to specialise questions that you modulo two

all learn the decision what will question to ask

the first simple them also is or visual dialogue

the matching bad and module

on to go for this module is true

i'll try to learns an encoder

tasks and

that the region and the task information into a joint space

so the intuition is

we want to have to each and thus able to kind of have

of intelligence to understand the visions and

and that the semantic relation between the image and

and the dialog contest

so

bodies we also need to preach and this module on

all to timber to encode to have for

a robot a efficient as reinforcement learning training

and also the album can be also applicable to use for image retrieval

and to be very the performance of this module we perform a sanity check

and we will choose for high a image which you're accuracies

in this system in again setting

which means this can provide a reliable signals for our reinforcement learning change

in the visual dialog state checking module we need to teach i three types of

all state information

on the vision released a kind of represent the agent's internal decision making models

i which is solvable of the vision dollars imagine a impending module and

on and the vision qantas stays kind of captures their

features the visual feature of their environments in here we applied a and what was

the technical stay adaptations well

basically the intuition is used to we want to after a vision contest

or more phones and but that's a two

the vision really state or the decision making model of the agents

also

on based on some feedback so attentional signals

so the attention a signal here is calculated by all the semantic similarity score between

the vision a belief state and image vectors and

then we take the weighted average

and in case of the wrong guess which are set their attendance attention signal to

zero

and we also could show that our the alignment information a number of questions asked

number of image yes and their last session

so given the past dialogs stays we have all

all

this kind of boasting learning modules and basic since we have two separate is quite

something also we applied to

wt q and method a

so we have applied the project was replaced and pose tracking to

to improve the eight thousand problem efficiency

and

another important task that for reinforcement learning is the reward design

and

so

the rule for this model training use can be composed decomposed into

the words and

questions rewards in the image mature words and so with

a reward shaping possibility into the a question suggestions and which

it's kind of all

the information gain of

of us

of our question ask

and then here we calculate query is to

difference between tough only to score between the usually state and attacked image vector

on so

a cushion citizen modules are to see that there

the most informative questions to ask for when asked and

based on the shared visual contest eight

so we use this a core a reinforcement relevant networks

that's able to handle a large discrete task based

action space

in

there you value is can be post made a

a between the embedding

i vectors softer revision contest

the and

the questions

on the reward t is the intermediate or not quite sure what is we discuss

and then we use also an assertion strategy

as their inspiration policy

to train the reinforcement learning with different need to have for simulator and so we

propose a corpus sse that once onset of anns consists

since all thought a similar image

and it is stiff image

a corresponds to a ten rows of question answer pairs

also this model provides the remote signal is saying axis related to the target image

and also chest internal against a to their what do we

other types of diminishing conditions first a teaching get is the correct answer

on the mess number of gets this is reached

and there is a lot turns his original depends on different experiment settings

all we define the winning and lost we were assessed

plus it's a negative tent and

the wrong guess penalties negative we

and to evaluate the contribution of each component within our friend work we focus on

five policies models on

the sap baseline to

a random procedures in is still at the cushions all my guess and any state

and then we added it you and to optimize dependent

level decision making and of the a handful

the lower or level pushes session a process

and we also want to evaluate the stay adaptations and reward shaping techniques to see

how

data affect the policy learning

and we want to

a because we want to be very to efficiencies and the robustness of dallas policy

we construct three sets of experiments

by step by step

for the first it smell nice to agents only see that

the questions formed directive and eight also obtain a questions answer pairs generated by human

for the target image

so this

last are open down stepping allows us to verify the effectiveness of our

a friend word

and then we increase the task difficulties are by enlarge the number of questions are

so there are two hundred questions generated by humans

and the dancers who are generated using our approach and this question answering models

respect to the target image

and

doesn't their experience we scale outer testing process you have

to answer question answer pairs generated automatically using the pitch and question answer parts

which kind of simulates a more noisy and real was setting and a different a

also

also we very sour the policy model set of the one thousand iterations during the

training process we pretty policy and we look at them reiteration magics lie within rate

and average number of ten dollars terms

here as there is there is also out in its parent once all we constrain

on their maxima number missing the rows of dialogue to hand and

within are defined ten questions

there fourteen well so there

there and encode falter william rate and all the average can reward

and we can see the optimal model is the last prosody models and i have

solar cell part of also conversion rates and outperform model with a hierarchy can sure

pos see a question partitions and state adaptation

and depression with that we were i want to also is whether a hierarchical reinforcement

learning policy enable efficient decision making

so here we define the of oracle baseline a data each and kids asking questions

in order

and only make the guess at the end of the data loss

which means are or where a is means

there

the ages ask several tens of not number of two rounds operations and then only

make a tick sid and so we found our optimal dialogue policies

okay such as a significant higher a win rate and the or point seven

and have a compulsive a win raise with their oracle baseline at eight well we're

knows

static o significant difference

so

and also we know that the oracle and nine and ten have higher we may

because they can about more information our longer turns

so we can see that our how code enforcement policy coefficient decision making

in

and we further work after if you know why we want to offer the evaluate

the robustness of our are thousand policies

so in paris the number of all

we increase the number of questions and then we also use a fly above chance

of vision question answering model as a user simulator to generate on servers and we

can see our departments we watch is the best performance induce more noisy a setting

and so on

this point three we further

increase to

task difficulties

and as we know all when e varies

when you very thin the analysis and the test data can be very different

so here we uses l two in this way because simulator different testing dataset and

and

and we are served the performance in the can jobs a but other propose reward

is more robust to noise and we think there is a potential application of using

the restart it has a bicluster orchard a song datasets constructing

by to humans are just talk about their

the call quality may state assets hope that was basically goes

so that it may not be very suitable for task oriented

applications on so here's is their sampled al also where so sir

systems for example in you spend two and a failure example when you spend the

as we can see in example tutor dialogue policy a sensor

susceptible ready a see that the relevant question some relates to color ten

and birds

and all those are some wrong guesses happens and there's someone answers to everything is

o

they can you can do a good job to self correcting and then maybe yes

in the end

and in a israel the weights and

since the question that also appear are overgenerating using sequence to a sequence model and

so the testing on the questions is more general or and

on the very specific

to summarize a we propose

a correct answer in t v show that allows set of tasks that is a

applicable and extensible for real application and we also propose a hardcore reinforcement learning framework

to selectively learn the multimodal state

a reputation and efficient dialogue policy

and then we what's propose and a state adaptation technique to make the vision contest

rip condition more relevant to the usual dialog state

and we vary only at estimating the dialogue system matches in different a semantics narrows

to very date the task completion efficiency and robustness

for future work we plan to extend then apply a different well former study in

the city real application that i don't realise something scenarios and we are

we can also explore ways to incorporate or domain noise like the ontology

on the data about interactions into a multimodal dialogue system to enable a large scale

or information retrieval task

thanks

which again

okay

how do you push the signals in different models mean

basically how do you model dreamworks

and every works i guess

are so as i mention the

there we will all the most leader

the reinforcement learning part transform

the high-level within the policy and the questions the actual module

so

after we have for this part well

consists of three

three parts of rewards as i mentioned take reward and there are questionnaire was and

also their intention reachable we what is making a wrong guesses

and so and the rule for the classes is actually macho only a

applies to reward shaping techniques e

so we manager to their

a basically the

similarity between the this to embedding vectors

it's a real environment

system defines itself wrong from what they're having

okay

we also

because we have

on the simulations so we have for pretty five talking image

so you so the two is controlled by the simulator module to kind of value

at a at each can state our waiter

yes the correct also not

and so we can get the signals

during the training process

sorry affected by the question selection like you have any idea is to five to

find it

the most important question defined in section

a here it's a fixed number here a paragraph

a situation i have here a

find you have generated

nee most if english ink

question

and two and add a question is mm i finish working on it

i think in high recognition i

it's kind of questions i think that's the group cushion

so here is basically a discriminative approach

to still no questions

from the different data also a because there's a ago a question proves

so we can just to that of questions but

okay a more interesting question is how we can generate a discriminative questions and

and we you know online fashion

so i think that something to explore in future