0:00:17okay so i
0:00:20how we can start hello everyone good morning and
0:00:25we'll come to the third session
0:00:29and today the topic is the end-to-end dialog systems and natural language generation
0:00:37we have none natural language generation model to end-to-end systems
0:00:43and the
0:00:45first speaker to the is post and saying
0:00:50i with the paper on a tree structured semantic encoder with knowledge sharing for domain
0:00:57adaptation in nlg
0:01:00so this is this is the natural language generation model
0:01:05a are we ready
0:01:07okay so
0:01:10go ahead you have the four
0:02:28hello everyone
0:02:29good morning work on to my presentation
0:02:32my name is both and then run university of cambridge and today i'm going to
0:02:36share my word tree structure semantic encoder with knowledge sharing for them annotation in nature
0:02:42language generation
0:02:46i guess
0:02:47pretty much of you
0:02:48are pretty much familiar with this pipeline dolls system
0:02:52here just one a high like that
0:02:54this work is focusing on
0:02:56this in a chilling generation components
0:02:59so the input is just these semantics from the policy network and the output is
0:03:03natural language
0:03:06okay so given the semantics representation like this
0:03:12he really too many source from the man
0:03:15and the system is informed about the in the end of the rest run
0:03:19address
0:03:20and it's phone number
0:03:23so we soar nature language model
0:03:26it would produce the nature of language to a user
0:03:29and this sentences this all turns has to contain all the correct information in the
0:03:34semantics
0:03:35that's the goal of an image a model
0:03:38we focus on domain that the patients in there really in this work
0:03:42which means that
0:03:44you might have bunch of data from your source in
0:03:47and you can use that data
0:03:49to put on your model
0:03:51to get a preacher model
0:03:53and then you want to use some of the limited data from your target the
0:03:58men
0:03:58to finding your model
0:04:00that makes you model maybe able to work well in the in the domain you
0:04:05are interested in
0:04:07that's of the meditation scenario
0:04:11so how do we on usually encode all semantics
0:04:15among prior work
0:04:17pretty much to mend approach
0:04:20the first one the this
0:04:22people will use pine the representation
0:04:24like this
0:04:25so each element
0:04:27each element in the back to representation
0:04:29its corresponding to the certain slot value pairs
0:04:32and your ontology
0:04:36or we can treat
0:04:38or semantics
0:04:39as a sequence of tokens
0:04:40and singleuser lstm
0:04:43to encode your semantics
0:04:47actually
0:04:48both of approach works well
0:04:50however
0:04:52they don't really capture
0:04:53the internal structure of the something takes
0:04:56for example
0:04:56in the semantics
0:04:58you actually have this kind of tree structure
0:05:01because
0:05:03under the request
0:05:05there's a full price slot
0:05:07then a more data system used to ask from the user
0:05:10so like here like this up to here
0:05:13and on there'll inform dialogue act
0:05:16you actually have three slot
0:05:18information that you want to tell the user
0:05:21and both style that's
0:05:22are on their the restaurant domain
0:05:27so
0:05:28that's the semantic structure is not capture by lows by lows to approach
0:05:35but doing you really need to capture these kind of structure
0:05:38the c help if it's not help then what about the right
0:05:42i'll give it a very simple example
0:05:46so again given this then summing takes like this
0:05:49for the source in
0:05:52and you have the corresponding tree like this
0:05:56during adaptation
0:05:57in the domain adaptation scenario
0:06:00you mike you might have these similar on semantics
0:06:04we sure some contents
0:06:09and that's its corresponding tree structure
0:06:12as you can see here
0:06:14most of the information
0:06:17i shared between those two semantics in the tree structure
0:06:21besides
0:06:22domain information
0:06:24so if we can come up with about a weight to capture low structures
0:06:29within a someone thinks
0:06:30perhaps the model is able to surely information
0:06:33more effectively
0:06:35between domains doing them annotation
0:06:37and that's the motivation of this work
0:06:40so the question here is
0:06:42how to encode the structure
0:06:48so here is the on the pos model forty
0:06:51tree structure semantic encoder
0:06:55actually the structure is pretty much
0:06:57the one you see
0:06:58in the previous slide
0:07:00first
0:07:01we have the slot layer
0:07:03and all your slots in the ontology
0:07:06will be listed here
0:07:10and then you have dialogue act layer
0:07:12it is used to describe all it a lattice you have
0:07:15in your system
0:07:18and then we have done the layer
0:07:22i bought and of the tree
0:07:23we designed a property layer
0:07:25that is used to describe
0:07:27the property of a slot
0:07:29because for example
0:07:31any slot
0:07:32perhaps area can be requestable
0:07:35or sort can be requestable
0:07:37and the
0:07:39here is informal
0:07:42so we use it to describe the property of whistle
0:07:46so
0:07:47and given the semantics like this
0:07:50based on all information all the structure you has
0:07:53we can build a corresponding tree
0:07:55we with this definition of a tree
0:07:58so first but you sound basically based on the property of a slot you can
0:08:02peel the links between the property layer
0:08:05between the property layer and this follow your
0:08:08and then
0:08:09all the slots will goes to load a lax
0:08:12it belongs to in the semantics
0:08:14like this
0:08:17and two of the da lacks in this example will go to respond to men
0:08:23i finally
0:08:24we'll take the root of the tree
0:08:25as they find the representation
0:08:28so that this is the way we can
0:08:30encode
0:08:31the tree structure in the semantics
0:08:35how what we really compute what do we exactly compute in the three
0:08:40and basically we focus on we follow the work the problem worked three lstm
0:08:47in the two thousand fifteen
0:08:50first
0:08:52for example on that say the node here
0:08:56we compute
0:08:57the summation over all is chosen
0:09:02the hidden state the summation of the hidden state in the summation of the
0:09:06memory cell but always trojan
0:09:11and then
0:09:12like the when you live lstm
0:09:14we compute the input gate forget gate and a bouquet
0:09:19and finally
0:09:20we can compute the memory cell and hidden state
0:09:23at is clear enough to you
0:09:29so
0:09:30on
0:09:33again give a again the same simple example
0:09:37given the semantic thing the source in
0:09:44we have the corresponding trick structure
0:09:47and doing of the patient
0:09:50you might have this then you might have the steamer some intakes in the target
0:09:53domain
0:09:54and thus we can see here
0:09:55without design
0:09:57two structured
0:09:58most information the tree
0:10:00are shared
0:10:02and we hope that can help model fisher information between domains
0:10:11okay
0:10:12so now so far we know how to encode a tree
0:10:14of the semantics
0:10:16then that's go to the generation process
0:10:21it is very straightforward to just take the output it
0:10:24the final representation of a tree as teens initialization
0:10:28of your decoder
0:10:30and we follow some prior work
0:10:32where the value in the all turns are delexicalise as the
0:10:38so our token we do something
0:10:40so in this work we designed a slot spoken as domain information dialect information and
0:10:45slot information
0:10:48so we just follow the center cross entropy
0:10:51to train our decoder
0:10:54sounds alright sounds good
0:10:55we have a way to encode a trick structure
0:10:59but actually for think more we just use the battery abstract information of a tree
0:11:05however they are
0:11:07punching him a bunch of information at intermediate level
0:11:12thanks to our on define tree
0:11:15so this moldable us to
0:11:18come up with a better way
0:11:19to access to information at intermediate level
0:11:22so that it decoder
0:11:24can have more information about three structure
0:11:29so here we propose are on it is very sorry for
0:11:32we apply
0:11:34we applied attention to the
0:11:36to the to the top man tell at and slow later
0:11:40do you have otherwise attention we can't
0:11:44whenever the model
0:11:45the decoder
0:11:47produce the special flock
0:11:49slot token like this
0:11:53the hidden state at each time-step
0:11:55will be used as acquire we
0:11:57to trigger the tension mechanics in
0:11:59like this so for example at the slot later
0:12:04all the slot information
0:12:06will be treated as the context
0:12:09for the
0:12:10for the attention what kind of you
0:12:13and then the model
0:12:14we compute
0:12:16a proper the probability distribution over or information for the three layers
0:12:21so for example again
0:12:23in slot s law they are
0:12:25you will have a distribution over all possible slot
0:12:29it basically tales model which slot
0:12:32each to focus on which information the models you focus on that is done step
0:12:39of course during training
0:12:40we do have supervision signals
0:12:42from the input semantics
0:12:45this can help the model this can guy to model
0:12:48to tell him what to focus on
0:12:50at each time-step
0:12:54and then will use this extra
0:12:55we use this attention distributions as the ask for information for the next time step
0:13:02and the and the generation process
0:13:04goes on
0:13:06so
0:13:07with all they'll wise attainable kind is an
0:13:10on a loss function becomes standard cross entropy
0:13:14then the cross entropy plots
0:13:16or a loss for only loss
0:13:18from the
0:13:18three attention mechanisms
0:13:21that's how we use a channel or model
0:13:26okay that's goes to some basic setups
0:13:29for experiments
0:13:31we are using models i was dataset which is which has on ten thousand dialogues
0:13:37over seven domains
0:13:39and within all utterance
0:13:42it's actually have more than one dialogue act
0:13:47we have three strong baselines the first one is as the lstm
0:13:52on it basically use a binary representation to encode the semantics
0:13:57and we have
0:13:58t gen and ra lstm
0:14:01lows to model i'll basically sector set model
0:14:04so they are using lstm encode i think older
0:14:07i think all the semantics
0:14:10a small evaluation
0:14:12we have on the stander
0:14:14on all the mathematics such as blue
0:14:18and also to fly error rate
0:14:20because we don't we don't want all we don't one or channel
0:14:23nature link generation model
0:14:24just before when but also
0:14:26the content should be correct
0:14:29and we also conduct a human evaluation
0:14:34okay let's see some numbers first
0:14:36on
0:14:38here this database
0:14:39source the man is first run
0:14:41any target domain is hotel
0:14:46the have access
0:14:48is the different amount of the adaptation data
0:14:51any white athens is the bleu score
0:14:55three baseline models are here
0:14:58and tree structure tree structure encoder
0:15:01and its variant
0:15:02tree structure with attention we kind of them
0:15:05as you can see on
0:15:08with for the patient data a hundred percent data
0:15:11that all the all the model performed pretty much similar
0:15:14because the data is
0:15:15pretty much enough
0:15:17however
0:15:18on with last data
0:15:21such as the last m five percent
0:15:24our model start again benefits
0:15:27thanks to the on
0:15:30structure
0:15:30sense to the tree structure
0:15:34that's the last again number of these slot error rate
0:15:38that's not error rate is defined like this
0:15:41we don't want our model
0:15:42to produce
0:15:43missing slots
0:15:45to have missing slots or put to use redundant slot
0:15:50so again with a hundred percent of data
0:15:53all the model performs very similar
0:15:54they're all good
0:15:56with all data
0:15:57however
0:15:58which pretty much last data
0:16:00with pretty limited data
0:16:04even in the
0:16:06one point twenty five percent of the data
0:16:08our model start to
0:16:10on produce very good performance
0:16:12overall the baselines
0:16:18previous like just show one setups
0:16:20we actually conduct three c given kind of set up to show that
0:16:24the model works in different scenarios
0:16:28the first column is
0:16:30the one used all in the previous line
0:16:32restaurant tube don't hotel adaptation
0:16:36and the second one
0:16:37the middle column is the restaurant at attraction
0:16:40and the second and the sort of one is trying to taxi
0:16:44here we just want to show that we can observe a similar trend similar results
0:16:49overall different setups
0:16:54okay so we all know that natural language generation task
0:16:58is not enough
0:16:59to just evaluate by the automatic metrics
0:17:02so we also conduct you may validation
0:17:04we use but amazon mechanical turk
0:17:08each mturk that asked to score five out of it in terms of
0:17:12informativeness
0:17:14and they show in this
0:17:16so here some basic numbers
0:17:20in terms of informativeness
0:17:22the tree structure with attention
0:17:25score the best
0:17:27and the tree without attention score the second
0:17:30which tells us that
0:17:32if you have a better way to encode your trick structure
0:17:36then the information can be sure for determine that the patient
0:17:40the model is tend not the model tends to produce
0:17:43right correct semantics in your in the generated sentences
0:17:49meanwhile we can still meant and the nature of nature and s
0:17:52of their generative sentences
0:17:56so we wonder
0:17:58where r
0:17:59improvements coming from
0:18:01what kind of as get what kind of example are more or model really performs
0:18:05good
0:18:05performs well
0:18:07so we divide the task that into seeing and on things up that
0:18:11subset
0:18:13thing basic leanings
0:18:15if the input semantics is thing during training
0:18:18then it's belongs to sing subset otherwise is on thing
0:18:24that's
0:18:25let's see the
0:18:26numbers from the fifty percent adaptation data
0:18:30with this bunch of data
0:18:32most of the testing example are thing
0:18:35and all the model performs similarly
0:18:37well as numbers are the
0:18:39number of the wrong examples the model produces
0:18:43and the lower the better
0:18:45however
0:18:47with very limited adaptation data
0:18:50out of nine hundred on things semantics
0:18:53that the semantics never think before doing training or that of the patient
0:18:58those based the baseline system
0:19:00but does
0:19:01several around seven hundred
0:19:03raw examples
0:19:04wrong semantics in the generative sentences
0:19:08however archery with attention can produce very low number
0:19:13just around a hundred and thirty
0:19:16so this a this implicitly tell us
0:19:19our model my have the better tinnitus in a ability
0:19:23to the on things semantics
0:19:28okay so here's comes to my conclusion
0:19:33by modeling the semantic structure
0:19:35low information might be shared between domains and this is helpful for domain adaptation
0:19:41and our model we use
0:19:43especially with the with the proposed there was attention mechanics and
0:19:48generates better sentences in terms of automatic metrics and the human scores
0:19:54especially with the limit
0:19:56very limited adaptation data
0:19:58our model performs the best
0:20:02so thank you very much for your calming
0:20:04and the any question and feedbacks are welcome thank you
0:20:12thank you very much so questions
0:20:21you said that you're doing well with one point two five percent which sounds good
0:20:24what's the number of training examples yes one point
0:20:28yes
0:20:29on
0:20:31is here so for example when we adapt one restaurant or hotel during preach an
0:20:36example is
0:20:37eight point five k but if we are using only one percent
0:20:41here it probably six under
0:20:44is still song
0:20:47yes
0:20:49hi can't go to the plot for the tree
0:20:53yes so you're for the full use
0:20:57yes of to unseen so first use a the attention is on all fours on
0:21:03wall for slot is not all the slots
0:21:05but for a given example the only the for green nodes are like yes in
0:21:13the data so why do we do need to attend to internet and from which
0:21:17sorry actually is the slot within the all semantics
0:21:22so only the slots in the semantics are activated
0:21:26and will be what we use it for the outage in case of another question
0:21:30is when you do domaintransfer what if the two domains have different sets of slots
0:21:36and for those slots that only appear in one in the onscene domain it's never
0:21:42trained in the in the data because in
0:21:45on
0:21:46because by the nature of this dataset
0:21:48as you see we have restaurant hotel attraction we sure which we sure low three-dimensional
0:21:55most of these slot or they have their unique slot relation most of false and
0:22:00each line and taxi sure some slot so that's why when and when i have
0:22:04that's lying or setup we have there's run to a hotel restaurant to attraction
0:22:09and is trying to taxi
0:22:10because we try to leverage the sure slots
0:22:18hello great so i had a question about the evaluation that looks at the you
0:22:24redundant and missing slots
0:22:27yes that site error rate
0:22:30my question is
0:22:32conceptually why does not even need to be a problem because
0:22:36you could have constraints
0:22:38that ensure that each slot is produced exactly one time during the gender on
0:22:44yes and it what depends on how you put your constraints on
0:22:47if you put in on generation loss function loss function during training
0:22:51that doesn't guarantee right down again to model still fall your constant
0:22:56but if you put your constrained at the output like more after like a post
0:23:00processing
0:23:02you my few there are some slot that's good but you might have not
0:23:06you might come up with a on the each row sentences right because you use
0:23:10more too few it out something
0:23:12you need to come up with small was to make it for one between
0:23:14between the floor you figured out
0:23:17so it is actually a problem and the we simply follow some prior work which
0:23:23is which is my fix the use ice
0:23:26just okay so i guess yes conceptually i get there'd be a tradeoff between naturalness
0:23:30and coverage but if you know in advance that a requirement is coverage than
0:23:36i guess you're only degree of freedom would be to give a constant natural
0:23:43sorry i
0:23:44so i had a miss your left and i just making the comment that if
0:23:47you know in advance that you're requirement is that you need to generate all the
0:23:50slots yes in your only degree of freedom
0:23:53is to give up on naturalness
0:23:55all right based on that nation if the scoring for the right in this task
0:24:01thanks to
0:24:02i have a question regarding this year that you show you have shown here i
0:24:09picture yes and is thereby eigenvalues and somehow encoding in this year so that you
0:24:15are only taking into account is not
0:24:18only the slot we don't use
0:24:20value because you don't need to nine
0:24:23yes and also the value is there's too much actually i don't use the male
0:24:27and then i have anything completion and have you thinking in a moment we then
0:24:34it takes a condensation
0:24:37without elicitation
0:24:39on yes
0:24:41on
0:24:42any will come up with some questions for example your value will be pretty much
0:24:47like open but vocabulary right from if you have for this dataset we have restaurant
0:24:53then
0:24:54attraction m and the hotel in n
0:24:57and the train it
0:24:59and the time slot
0:25:01this will become very complex
0:25:04it is a still challenging problem in analogy
0:25:08okay
0:25:11right i think we need to move to the next papers so let's think the
0:25:15speaker again thank you very much