okay so i

how we can start hello everyone good morning and

we'll come to the third session

and today the topic is the end-to-end dialog systems and natural language generation

we have none natural language generation model to end-to-end systems

and the

first speaker to the is post and saying

i with the paper on a tree structured semantic encoder with knowledge sharing for domain

adaptation in nlg

so this is this is the natural language generation model

a are we ready

okay so

go ahead you have the four

hello everyone

good morning work on to my presentation

my name is both and then run university of cambridge and today i'm going to

share my word tree structure semantic encoder with knowledge sharing for them annotation in nature

language generation

i guess

pretty much of you

are pretty much familiar with this pipeline dolls system

here just one a high like that

this work is focusing on

this in a chilling generation components

so the input is just these semantics from the policy network and the output is

natural language

okay so given the semantics representation like this

he really too many source from the man

and the system is informed about the in the end of the rest run

address

and it's phone number

so we soar nature language model

it would produce the nature of language to a user

and this sentences this all turns has to contain all the correct information in the

semantics

that's the goal of an image a model

we focus on domain that the patients in there really in this work

which means that

you might have bunch of data from your source in

and you can use that data

to put on your model

to get a preacher model

and then you want to use some of the limited data from your target the

men

to finding your model

that makes you model maybe able to work well in the in the domain you

are interested in

that's of the meditation scenario

so how do we on usually encode all semantics

among prior work

pretty much to mend approach

the first one the this

people will use pine the representation

like this

so each element

each element in the back to representation

its corresponding to the certain slot value pairs

and your ontology

or we can treat

or semantics

as a sequence of tokens

and singleuser lstm

to encode your semantics

actually

both of approach works well

however

they don't really capture

the internal structure of the something takes

for example

in the semantics

you actually have this kind of tree structure

because

under the request

there's a full price slot

then a more data system used to ask from the user

so like here like this up to here

and on there'll inform dialogue act

you actually have three slot

information that you want to tell the user

and both style that's

are on their the restaurant domain

so

that's the semantic structure is not capture by lows by lows to approach

but doing you really need to capture these kind of structure

the c help if it's not help then what about the right

i'll give it a very simple example

so again given this then summing takes like this

for the source in

and you have the corresponding tree like this

during adaptation

in the domain adaptation scenario

you mike you might have these similar on semantics

we sure some contents

and that's its corresponding tree structure

as you can see here

most of the information

i shared between those two semantics in the tree structure

besides

domain information

so if we can come up with about a weight to capture low structures

within a someone thinks

perhaps the model is able to surely information

more effectively

between domains doing them annotation

and that's the motivation of this work

so the question here is

how to encode the structure

so here is the on the pos model forty

tree structure semantic encoder

actually the structure is pretty much

the one you see

in the previous slide

first

we have the slot layer

and all your slots in the ontology

will be listed here

and then you have dialogue act layer

it is used to describe all it a lattice you have

in your system

and then we have done the layer

i bought and of the tree

we designed a property layer

that is used to describe

the property of a slot

because for example

any slot

perhaps area can be requestable

or sort can be requestable

and the

here is informal

so we use it to describe the property of whistle

so

and given the semantics like this

based on all information all the structure you has

we can build a corresponding tree

we with this definition of a tree

so first but you sound basically based on the property of a slot you can

peel the links between the property layer

between the property layer and this follow your

and then

all the slots will goes to load a lax

it belongs to in the semantics

like this

and two of the da lacks in this example will go to respond to men

i finally

we'll take the root of the tree

as they find the representation

so that this is the way we can

encode

the tree structure in the semantics

how what we really compute what do we exactly compute in the three

and basically we focus on we follow the work the problem worked three lstm

in the two thousand fifteen

first

for example on that say the node here

we compute

the summation over all is chosen

the hidden state the summation of the hidden state in the summation of the

memory cell but always trojan

and then

like the when you live lstm

we compute the input gate forget gate and a bouquet

and finally

we can compute the memory cell and hidden state

at is clear enough to you

so

on

again give a again the same simple example

given the semantic thing the source in

we have the corresponding trick structure

and doing of the patient

you might have this then you might have the steamer some intakes in the target

domain

and thus we can see here

without design

two structured

most information the tree

are shared

and we hope that can help model fisher information between domains

okay

so now so far we know how to encode a tree

of the semantics

then that's go to the generation process

it is very straightforward to just take the output it

the final representation of a tree as teens initialization

of your decoder

and we follow some prior work

where the value in the all turns are delexicalise as the

so our token we do something

so in this work we designed a slot spoken as domain information dialect information and

slot information

so we just follow the center cross entropy

to train our decoder

sounds alright sounds good

we have a way to encode a trick structure

but actually for think more we just use the battery abstract information of a tree

however they are

punching him a bunch of information at intermediate level

thanks to our on define tree

so this moldable us to

come up with a better way

to access to information at intermediate level

so that it decoder

can have more information about three structure

so here we propose are on it is very sorry for

we apply

we applied attention to the

to the to the top man tell at and slow later

do you have otherwise attention we can't

whenever the model

the decoder

produce the special flock

slot token like this

the hidden state at each time-step

will be used as acquire we

to trigger the tension mechanics in

like this so for example at the slot later

all the slot information

will be treated as the context

for the

for the attention what kind of you

and then the model

we compute

a proper the probability distribution over or information for the three layers

so for example again

in slot s law they are

you will have a distribution over all possible slot

it basically tales model which slot

each to focus on which information the models you focus on that is done step

of course during training

we do have supervision signals

from the input semantics

this can help the model this can guy to model

to tell him what to focus on

at each time-step

and then will use this extra

we use this attention distributions as the ask for information for the next time step

and the and the generation process

goes on

so

with all they'll wise attainable kind is an

on a loss function becomes standard cross entropy

then the cross entropy plots

or a loss for only loss

from the

three attention mechanisms

that's how we use a channel or model

okay that's goes to some basic setups

for experiments

we are using models i was dataset which is which has on ten thousand dialogues

over seven domains

and within all utterance

it's actually have more than one dialogue act

we have three strong baselines the first one is as the lstm

on it basically use a binary representation to encode the semantics

and we have

t gen and ra lstm

lows to model i'll basically sector set model

so they are using lstm encode i think older

i think all the semantics

a small evaluation

we have on the stander

on all the mathematics such as blue

and also to fly error rate

because we don't we don't want all we don't one or channel

nature link generation model

just before when but also

the content should be correct

and we also conduct a human evaluation

okay let's see some numbers first

on

here this database

source the man is first run

any target domain is hotel

the have access

is the different amount of the adaptation data

any white athens is the bleu score

three baseline models are here

and tree structure tree structure encoder

and its variant

tree structure with attention we kind of them

as you can see on

with for the patient data a hundred percent data

that all the all the model performed pretty much similar

because the data is

pretty much enough

however

on with last data

such as the last m five percent

our model start again benefits

thanks to the on

structure

sense to the tree structure

that's the last again number of these slot error rate

that's not error rate is defined like this

we don't want our model

to produce

missing slots

to have missing slots or put to use redundant slot

so again with a hundred percent of data

all the model performs very similar

they're all good

with all data

however

which pretty much last data

with pretty limited data

even in the

one point twenty five percent of the data

our model start to

on produce very good performance

overall the baselines

previous like just show one setups

we actually conduct three c given kind of set up to show that

the model works in different scenarios

the first column is

the one used all in the previous line

restaurant tube don't hotel adaptation

and the second one

the middle column is the restaurant at attraction

and the second and the sort of one is trying to taxi

here we just want to show that we can observe a similar trend similar results

overall different setups

okay so we all know that natural language generation task

is not enough

to just evaluate by the automatic metrics

so we also conduct you may validation

we use but amazon mechanical turk

each mturk that asked to score five out of it in terms of

informativeness

and they show in this

so here some basic numbers

in terms of informativeness

the tree structure with attention

score the best

and the tree without attention score the second

which tells us that

if you have a better way to encode your trick structure

then the information can be sure for determine that the patient

the model is tend not the model tends to produce

right correct semantics in your in the generated sentences

meanwhile we can still meant and the nature of nature and s

of their generative sentences

so we wonder

where r

improvements coming from

what kind of as get what kind of example are more or model really performs

good

performs well

so we divide the task that into seeing and on things up that

subset

thing basic leanings

if the input semantics is thing during training

then it's belongs to sing subset otherwise is on thing

that's

let's see the

numbers from the fifty percent adaptation data

with this bunch of data

most of the testing example are thing

and all the model performs similarly

well as numbers are the

number of the wrong examples the model produces

and the lower the better

however

with very limited adaptation data

out of nine hundred on things semantics

that the semantics never think before doing training or that of the patient

those based the baseline system

but does

several around seven hundred

raw examples

wrong semantics in the generative sentences

however archery with attention can produce very low number

just around a hundred and thirty

so this a this implicitly tell us

our model my have the better tinnitus in a ability

to the on things semantics

okay so here's comes to my conclusion

by modeling the semantic structure

low information might be shared between domains and this is helpful for domain adaptation

and our model we use

especially with the with the proposed there was attention mechanics and

generates better sentences in terms of automatic metrics and the human scores

especially with the limit

very limited adaptation data

our model performs the best

so thank you very much for your calming

and the any question and feedbacks are welcome thank you

thank you very much so questions

you said that you're doing well with one point two five percent which sounds good

what's the number of training examples yes one point

yes

on

is here so for example when we adapt one restaurant or hotel during preach an

example is

eight point five k but if we are using only one percent

here it probably six under

is still song

yes

hi can't go to the plot for the tree

yes so you're for the full use

yes of to unseen so first use a the attention is on all fours on

wall for slot is not all the slots

but for a given example the only the for green nodes are like yes in

the data so why do we do need to attend to internet and from which

sorry actually is the slot within the all semantics

so only the slots in the semantics are activated

and will be what we use it for the outage in case of another question

is when you do domaintransfer what if the two domains have different sets of slots

and for those slots that only appear in one in the onscene domain it's never

trained in the in the data because in

on

because by the nature of this dataset

as you see we have restaurant hotel attraction we sure which we sure low three-dimensional

most of these slot or they have their unique slot relation most of false and

each line and taxi sure some slot so that's why when and when i have

that's lying or setup we have there's run to a hotel restaurant to attraction

and is trying to taxi

because we try to leverage the sure slots

hello great so i had a question about the evaluation that looks at the you

redundant and missing slots

yes that site error rate

my question is

conceptually why does not even need to be a problem because

you could have constraints

that ensure that each slot is produced exactly one time during the gender on

yes and it what depends on how you put your constraints on

if you put in on generation loss function loss function during training

that doesn't guarantee right down again to model still fall your constant

but if you put your constrained at the output like more after like a post

processing

you my few there are some slot that's good but you might have not

you might come up with a on the each row sentences right because you use

more too few it out something

you need to come up with small was to make it for one between

between the floor you figured out

so it is actually a problem and the we simply follow some prior work which

is which is my fix the use ice

just okay so i guess yes conceptually i get there'd be a tradeoff between naturalness

and coverage but if you know in advance that a requirement is coverage than

i guess you're only degree of freedom would be to give a constant natural

sorry i

so i had a miss your left and i just making the comment that if

you know in advance that you're requirement is that you need to generate all the

slots yes in your only degree of freedom

is to give up on naturalness

all right based on that nation if the scoring for the right in this task

thanks to

i have a question regarding this year that you show you have shown here i

picture yes and is thereby eigenvalues and somehow encoding in this year so that you

are only taking into account is not

only the slot we don't use

value because you don't need to nine

yes and also the value is there's too much actually i don't use the male

and then i have anything completion and have you thinking in a moment we then

it takes a condensation

without elicitation

on yes

on

any will come up with some questions for example your value will be pretty much

like open but vocabulary right from if you have for this dataset we have restaurant

then

attraction m and the hotel in n

and the train it

and the time slot

this will become very complex

it is a still challenging problem in analogy

okay

right i think we need to move to the next papers so let's think the

speaker again thank you very much