and the next

speaker we have is she mary

with the paper on structured fusion networks for dialogue which use an end-to-end dialog model

so

please

emission key

and i'm here today to talking about structured fusion networks for dialogue

this work was done with

to just ring a bus and my adviser maxine eskenazi

okay let's talk about neural models of dialogue

so neural dialogue systems do really well on the task of dialog generation

but they have several well-known shortcomings

they need a lot of data to train

they struggle to generalize to new domains

there are difficult to control

and

they exhibit divergent behavior one tune with reinforcement learning

on the other hand traditional pipelined dialogue systems

have structure components

that allow us to easily generalize them

interpret them and control these systems

both these systems have their respective advantages and disadvantages

neural dialogue systems can learn from data

and they can learn a higher level reasoning

we're higher level policy

on the other hand pipeline systems

are very structured nature which has several benefits

yesterday there was this question in the panel of

so pipeline or not to pipeline

and to me the obvious answer seems why not both and i think that

combining these two approaches is a very intuitive thing to do

so how do we go about combining these two approaches

so in powerpoint systems we have structure components so the very first thing to do

to bring the structure

to neural dialogue systems

it's to and you like these components

so using the multimodal dataset we first define and train

several neural dialogue modules

one for the nlu

one for the dm and one for the nlg

so for the nlu what we do is

we read the dialogue context

encoded and then

ultimately make a prediction about the belief state

for the dialogue manager

we look at the belief state as well as some vectorized representation of the database

output passage are several in your layers and ultimately predict the system dialogue act

for the nlg we have a condition language model

where the initial hidden state is a linear combination

of the dialogue act the belief state and the database vector and then at every

time step

the model outputs what the next word should be to ultimately generate the response

so we have these three neural dialogue modules

that i merely is structured components of traditional pipelined systems

given these three components

how do we actually go about

building a system for dialog generation

well the simplest thing to do is

now you fusion

where what we do is we train these systems and then we just combine the

naively during inference where instead of passing in the ground truth belief state of the

dialogue manager which is what we would do during training we make a prediction

using our trained nlu

and then pass it into the dialogue manager

another way of using these dialogue modules

after training them independently is multitasking

so

which simultaneously learn the dialogue modules

as well as the final task of dialog response generation so we have these three

independent modules here

and then we have these red arrows that correspond to the forward propagation

for the task of response generation

sharing these the parameters in this way result in more structured components

now the encoder

is both being used for the task of the nlu

as well as for the task of response generation

so now would have this notion of structure in it

another way which is the primary

novel work in our paper is structured fusion networks

structured fusion that works aim to learn a higher level model

on top of free train neural dialogue modules

here's a visualization of structured fusion networks

and don't worry if the seems like spaghetti a come back to this

so here what we have is

we have the original dialogue modules the nlu the dm and all g

in these grey small boxes in the middle

and then what we do is we

define these black boxes around them

that consist of a higher level module

so the nlu get upgraded to the and on you plots

the dm to the dm plus and the nlg to the energy plus

by doing this

the higher level model does not need to relearn and remodel the dialogue structure

because it's provided to it

do the pre-trained dialogue modules

instead the higher level model

can focus on the necessary abstract modeling for the task of response generation

which includes encoding complex natural language

modeling the dialogue policy

and generating language conditional some latent representation

and they can leverage

the already provided dialogue structure to do this

so let's go through the structured fusion network piece by piece and see how we

build it up

we start out with these dialogue modules and great here

the combination between them is exactly what you sign it fusion

first we're gonna we're gonna add the nlu plus

the nlu plus get the output it belief state

and one it

re encodes the dialogue context

it has the already predicted belief state concatenated at every time step

and in this way the encoder does not need to relearn the structure and can

leverage the already computed belief state to better encode the

the dialogue context

next we're gonna add the dm plus

and the dm plus

initially

it takes as input it concatenation of four different features

the database vector the predicted dialogue act

the predicted belief state

and the final hidden state of the higher level encoder

and then passes that the real when you're layer

by providing the structure in this way it's our hope that

this sort of serves of the pause you modeling components

in this and send model

the nlg plus

takes as output takes as input the output of the dm plots and user that's

initialize the hidden state and then interfaces with the nlg

let's take a closer look into the nlg plus

it relies on cold fusion

so basically what this means is

the nlg it condition language model gives us a sense of what the next word

could be

the decoder on the other hand

is more

is more so

performing higher level reasoning

and then

we take the large it's the output from the nlg about what the next word

could be as well as the hidden state from the decoder

about the representation of what we should be generating and combine them using cold fusion

and then there's a cyclical relationship between the and all g and the higher level

decoder

in the sense that one cold fusion predicts what the next word should be three

combination of the decoder nlg it passes that prediction both into the decoder

and it to the next time step of the nlg

and here's the final combination again which

hopefully should make more sense

so how do we train the structure fusion network

because we have these modules this three different ways that we can do it

the first one is that we can freeze these modules

we can freeze the modules for obvious in their pre-trained

and then just learn the higher level model on top

in other ways that we can fine tune these modules for the final task of

dialog response generation

and then of course we can multitask the modules where we

simultaneously fine tune them for response generation and for their original tasks

we use the multi was dataset and generally follow their experimental setup

which means the same hyper parameters and because they use the ground truth belief state

we do so as well

and you can sort of think with this as the oracle and all you in

our case

for evaluation we use the same hyper parameters which includes bleu score

inform rate which

measures how often the system has provided the appropriate entities to the user

and success rate which is how often the system

answers all the attributes the user request

and then we use a combined score which they propose as well

which is blue plus the average of informant success rate

so let's take a look at our results

first our baseline so as you see here sadistic with attention does gets a combined

score of about eighty three point three six

next we an i fusion both zero shot which means that they're in the penalty

pre-trained in just combine it inference

and then we also finetune for

the task response generation which just slightly better than the baseline

multitasking does not do so well with sort of indicates that

the loss functions may be pulling

the weights in different directions

structure fusion networks with frozen modules

also do not do so well

but as soon as we start fine tuning

we get a significant improvement

with strong improvements

with slight improvements over these other models

in bleu score and then very strong improvements in informant success rate

and we observe

somewhat patterns with s f and with multitasking

and honestly the seems kind of

intuitive when you think about it informally then success rate measure how often we inform

the user of the appropriate entities and how often we provide the appropriate attributes

and explicitly modeling the belief state explicitly modeling the system act

should into italy help with this

if for model is explicitly aware of

what attributes the user has requested it's going to better provide that information to the

user

but of course i talked about several different problems

with neural models so let's see a structured fusion networks did anything to those problems

the first problem that i mentioned is the neural models are very data hungry

and i think that the added structure sure result and lasted hungry models

so we compare secrecy got the tension instructed fusion networks

i one percent five percent ten percent and twenty five percent of the training data

on the left you see the informer a graph and on the right you see

the success rate graph

and varying levels of percentage of data used

so the inform rate

right about thirty

thirty percent inform rate with c

and i fifty five

with structured fusion networks

of course there's different this difference is really big when were

and very small amounts of data as in one percent

and then it's lonely comes together

as we increase the data

what success rate word about twenty

what structured fusion networks

and fairly close to about like two or three percent

with sixty six and one percent of the data

so for extremely low data scenarios one percent which is about

six hundred utterances

we do

really well what structured fusion networks

and the difference

remains that about like ten percent improvement across both metrics

another problem dimension is domain generalisability

the added structure should give us more generalisable models

so what we do is we compare secrecy constructor fusion that works

by training on two thousand out of domain

dialogue examples

and fifty in domain examples

where in domain is restaurant and then we evaluate entirely on the restaurant domain

and what we see here is we get a sizable improvement and the combined scored

using structured fusion networks

what stronger permits in six sets in four

the blue a slightly lower but this drop matches roughly

what we saw in when using all the data so i don't think it's a

problem specific the generalisability

the next problem and to me the most interesting one

is divergent behavior with reinforcement learning

training general "'em" dialogue models with reinforcement learning

often results in divergent behavior

and you generate output

i'm sure that everybody here has seen the headlines where people claimed that face okay

i shut down there bought after it start inventing its own language really what happened

was it started outputting

stuff that doesn't look like english because it loses the structure as soon as you

trying to with a reinforcement learning

so why does this happen

my theory about why this happens is the notion of the implicit language model

stack decoders have the issue of the implicit language model which basically means that the

decoder simultaneously learns the false and strategy

as well as model language

and image captioning this is very well observed

and it's observed that the implicit language model over one the decoder

so basically what happens is

if the decoder generates if the if the image model detect so there's a giraffe

the model always output the giraffe standing in a field

which is this even if the draft is not standing in a field just because

that's what the language model has been

trying to do

in dialogue on the other hand this problem a slightly different in the sense that

when we finetuned dialogue models with reinforcement learning

raw optimising for the strategy

and alternately causing it on learn the implicit language model

so

structured fusion networks have an explicit language model

so maybe we don't have this problem

so let's try structured fusion networks with reinforcement learning

so for this we trained with supervised learning and then we freeze the dialogue modules

and finetune only the higher level model with the reward inform rape a success rate

so we're optimising the higher level model for some dialogue strategy

well relying on the structure components

to maintain the structured nature of the model

and we compared to changing cells work a knuckle

where he export a similar problem

and what we seize we get

less divergence and language

and fairly similar informant success rate with the state-of-the-art combined score here

so here all the results for all the models that we compared

throughout this presentation

we see that

adding structure in general seems to help

and we get a sizable improvement over our baseline

and

the model especially is robust to reinforcement learning

of course given how fast this field moves

well or paper was in reviews somebody be our results and we don't have state-of-the-art

anymore

but

one of the core contributions of their work

was improving dialogue act prediction

and because structured fusion that works have this ability

to leverage dialogue act predictions and an explicit component

i think there's room for combination here

so

no dialogue paper is complete without human evaluation so what we did here was we

as mechanical turk workers

to read the dialogue context and rate responses on a scale of one to five

on the notion of appropriateness

and what we see here is that

structured fusion networks with reinforcement learning

r per for r rated slightly higher

with

ratings of four or more given

more often suggest that all everything in bold are statistically significant

of course we have a lot more room

to improve before we beat the human ground truth but i think adding structure char

models is the way to go

thank you for your attention and the code is available here

for talk

so now we have

actually

quite some time for top questions so any questions

that's a

a very interesting work and looks promising but

you have plans to extend the evaluation and looking at whether

the system with your architecture can actually engage and dialogue rather than replicating dialogues

the second question i think the structure should help was do that and maintain

like not have the issue of when you start training models and evaluating models an

adaptive manner usually what happens is the errors propagate and i think that

the structure should make that less likely to happen

we

i think that something that we should definitely look into in the literature

and just if you put up your comparative slides the first one compare i think

you're to

a quick to

see the ranks to the other one as having the

the preferred performance because blue i would say is not something that should be measured

in this context it's

they're doing much better than you in blue but it's completely irrelevant whether you give

exactly the same words as the original or not

and you're actually doing much better and success for that's true i like my general

feeling having looked at the data lot is that

for this type of task at least we just relatively well and i think in

the original paper they did some correlation analysis with human judgement

but i think like

blue does not make like on its own will not measure quality of the system

but more so what it's measuring is

how structure the languages and how like

you disagree

okay that's fair i guess with multiple references maybe we can improve this

i so you and this three components but do you and you said that but

trained on what are they pre-trained and the second mation sorry during training do you

also have intermediate supervision there or they finetuned and then fashion

right okay good question

i mean just go back to that's why

so in the multi was data

they

they give us the belief state and they give us the system dialogue act

so what we do for pre-training these components is

the no use pre-trained to go from context of belief state

the dm from police data dialogue act

and the and all g from dialogue acts response

for your second question

we do explore one of in our multi test set up

we do intermediate supervision but in the other two we don't

so it seems to me that you too much more supervision then there are usual

sequence the sequence model each would be the reason for better performance rather than

different architecture no

no alike i completely agree with a point i think this but i think

a point of our paper is doing this additional supervision

and adding the structure into the model it's something that's numbering something that people should

be fair enough

but i do you understand that

it's not necessarily the architecture and its own that's doing better cool thinking

t any other questions

a great dog picked as much as looks promising so you talk a bit about

generalisability about this issue divergence with rl but it didn't touch much on the other

is you mentioned in the trade off at the beginning which was control ability and

i'm wondering if you have some thoughts on that

i guess some of the questions that come into my mind we design but models

with respect to control is suppose i wanted to behave a little bit differently in

one case is there anyway that this architecture can address that run the other way

to look at it looks at ten dollars three best in improving one of these

components can i do it in any on the way other than

get more data like how does the architecture for something in that sense okay

the that's a good question control ability isn't something that we looked at yet but

it's definitely something that i do you want to look at in the future just

because i think doing something as simple as

adding rules on top of the dialogue manager

to just change and say like i with this dialogue act instead of these conditions

are met would work really well and the model does leverage those dialogue act and

like i've seen back projections from the lower level model

result in four outputs

that's definitely something that we should look into in the future

remote mean was the second thing is a

the other part is there's architectures suitable for it to decompose ability of can invest

more on one calm like

there is no need to blame assignment in any sense better and does it you

know

so

i in like

i'm not entirely sure

for when we look at the final task response generation

but we do sort of have a sense just because of the intermediate supervision

how well each of the respective lower level components are doing

and what i can say that the and all you just really well

the

the natural language generation just pretty well

the main thing that struggling is this

this active going from police data dialogue act

and i think that if i was to recommend a component

based on just the this the pre supervision

to improve it would be the dialogue manager

but like blame assignment in general for the response generation task

isn't something that

i think is really easy with the current state of the model but i think

things might be able to be done to further interpret the model

anymore questions

okay in that case i'll

one of my own

can you

explain how exactly the you know what it what is it that the

dm and the impostor pretty how does it look like is it is it's some

kind of

a like

dialogue act and betting or is it's is it explicit use it like a one

hot

so

so the you mean like the dialogue act vector or just i mean basically what

when you look at the dm

well this i guess these are two different thing when you look at the end

the output is dialogue act right yes and the dm plus has something different so

like okay

so for the dm itself because of the supervision

we're predicting the dialogue act which is a multi class label

and it's basically just one zeros

like a binary vector okay and that's like

in form a request that a single slot yellow inform restaurant available type thing right

but then for the dm plus

it's not a structured in that sense and basically what we do is

we just treated as a linear layer that initialises

the decoders hidden state and in the original multi what's paper they had this type

of thing as well

where they just had eight when you're layer between the encoder and decoder the combined

more information into the hidden state

and they call that the palsy and

that sort of what where we're hoping that

by adding the structure beforehand

it's actually more like a policy rather than just to linger layer before

right okay thank you into k what

any more

the last one

did you try we have their baselines because she claims to sequence seems to be

basic

well we did try the other ways of combining the neural modules

and then a fusion the multi tasking those ones

i can go to that slide

but we didn't write transformers or anything like that and i think that

that's something that we can look into in the future

but we tried like night fusion multitasking which are different which are baselines the we

came up with

for actually leveraging the structure as well

okay thank you thank you