Speech Transcript - Tree-Structured Semantic Encoder with Knowledge Sharing for Domain Adaptation in Natural Language Generation

0:00:17	okay so i
0:00:20	how we can start hello everyone good morning and
0:00:25	we'll come to the third session
0:00:29	and today the topic is the end-to-end dialog systems and natural language generation
0:00:37	we have none natural language generation model to end-to-end systems
0:00:43	and the
0:00:45	first speaker to the is post and saying
0:00:50	i with the paper on a tree structured semantic encoder with knowledge sharing for domain
0:00:57	adaptation in nlg
0:01:00	so this is this is the natural language generation model
0:01:05	a are we ready
0:01:07	okay so
0:01:10	go ahead you have the four
0:02:28	hello everyone
0:02:29	good morning work on to my presentation
0:02:32	my name is both and then run university of cambridge and today i'm going to
0:02:36	share my word tree structure semantic encoder with knowledge sharing for them annotation in nature
0:02:42	language generation
0:02:46	i guess
0:02:47	pretty much of you
0:02:48	are pretty much familiar with this pipeline dolls system
0:02:52	here just one a high like that
0:02:54	this work is focusing on
0:02:56	this in a chilling generation components
0:02:59	so the input is just these semantics from the policy network and the output is
0:03:03	natural language
0:03:06	okay so given the semantics representation like this
0:03:12	he really too many source from the man
0:03:15	and the system is informed about the in the end of the rest run
0:03:19	address
0:03:20	and it's phone number
0:03:23	so we soar nature language model
0:03:26	it would produce the nature of language to a user
0:03:29	and this sentences this all turns has to contain all the correct information in the
0:03:34	semantics
0:03:35	that's the goal of an image a model
0:03:38	we focus on domain that the patients in there really in this work
0:03:42	which means that
0:03:44	you might have bunch of data from your source in
0:03:47	and you can use that data
0:03:49	to put on your model
0:03:51	to get a preacher model
0:03:53	and then you want to use some of the limited data from your target the
0:03:58	men
0:03:58	to finding your model
0:04:00	that makes you model maybe able to work well in the in the domain you
0:04:05	are interested in
0:04:07	that's of the meditation scenario
0:04:11	so how do we on usually encode all semantics
0:04:15	among prior work
0:04:17	pretty much to mend approach
0:04:20	the first one the this
0:04:22	people will use pine the representation
0:04:24	like this
0:04:25	so each element
0:04:27	each element in the back to representation
0:04:29	its corresponding to the certain slot value pairs
0:04:32	and your ontology
0:04:36	or we can treat
0:04:38	or semantics
0:04:39	as a sequence of tokens
0:04:40	and singleuser lstm
0:04:43	to encode your semantics
0:04:47	actually
0:04:48	both of approach works well
0:04:50	however
0:04:52	they don't really capture
0:04:53	the internal structure of the something takes
0:04:56	for example
0:04:56	in the semantics
0:04:58	you actually have this kind of tree structure
0:05:01	because
0:05:03	under the request
0:05:05	there's a full price slot
0:05:07	then a more data system used to ask from the user
0:05:10	so like here like this up to here
0:05:13	and on there'll inform dialogue act
0:05:16	you actually have three slot
0:05:18	information that you want to tell the user
0:05:21	and both style that's
0:05:22	are on their the restaurant domain
0:05:27	so
0:05:28	that's the semantic structure is not capture by lows by lows to approach
0:05:35	but doing you really need to capture these kind of structure
0:05:38	the c help if it's not help then what about the right
0:05:42	i'll give it a very simple example
0:05:46	so again given this then summing takes like this
0:05:49	for the source in
0:05:52	and you have the corresponding tree like this
0:05:56	during adaptation
0:05:57	in the domain adaptation scenario
0:06:00	you mike you might have these similar on semantics
0:06:04	we sure some contents
0:06:09	and that's its corresponding tree structure
0:06:12	as you can see here
0:06:14	most of the information
0:06:17	i shared between those two semantics in the tree structure
0:06:21	besides
0:06:22	domain information
0:06:24	so if we can come up with about a weight to capture low structures
0:06:29	within a someone thinks
0:06:30	perhaps the model is able to surely information
0:06:33	more effectively
0:06:35	between domains doing them annotation
0:06:37	and that's the motivation of this work
0:06:40	so the question here is
0:06:42	how to encode the structure
0:06:48	so here is the on the pos model forty
0:06:51	tree structure semantic encoder
0:06:55	actually the structure is pretty much
0:06:57	the one you see
0:06:58	in the previous slide
0:07:00	first
0:07:01	we have the slot layer
0:07:03	and all your slots in the ontology
0:07:06	will be listed here
0:07:10	and then you have dialogue act layer
0:07:12	it is used to describe all it a lattice you have
0:07:15	in your system
0:07:18	and then we have done the layer
0:07:22	i bought and of the tree
0:07:23	we designed a property layer
0:07:25	that is used to describe
0:07:27	the property of a slot
0:07:29	because for example
0:07:31	any slot
0:07:32	perhaps area can be requestable
0:07:35	or sort can be requestable
0:07:37	and the
0:07:39	here is informal
0:07:42	so we use it to describe the property of whistle
0:07:46	so
0:07:47	and given the semantics like this
0:07:50	based on all information all the structure you has
0:07:53	we can build a corresponding tree
0:07:55	we with this definition of a tree
0:07:58	so first but you sound basically based on the property of a slot you can
0:08:02	peel the links between the property layer
0:08:05	between the property layer and this follow your
0:08:08	and then
0:08:09	all the slots will goes to load a lax
0:08:12	it belongs to in the semantics
0:08:14	like this
0:08:17	and two of the da lacks in this example will go to respond to men
0:08:23	i finally
0:08:24	we'll take the root of the tree
0:08:25	as they find the representation
0:08:28	so that this is the way we can
0:08:30	encode
0:08:31	the tree structure in the semantics
0:08:35	how what we really compute what do we exactly compute in the three
0:08:40	and basically we focus on we follow the work the problem worked three lstm
0:08:47	in the two thousand fifteen
0:08:50	first
0:08:52	for example on that say the node here
0:08:56	we compute
0:08:57	the summation over all is chosen
0:09:02	the hidden state the summation of the hidden state in the summation of the
0:09:06	memory cell but always trojan
0:09:11	and then
0:09:12	like the when you live lstm
0:09:14	we compute the input gate forget gate and a bouquet
0:09:19	and finally
0:09:20	we can compute the memory cell and hidden state
0:09:23	at is clear enough to you
0:09:29	so
0:09:30	on
0:09:33	again give a again the same simple example
0:09:37	given the semantic thing the source in
0:09:44	we have the corresponding trick structure
0:09:47	and doing of the patient
0:09:50	you might have this then you might have the steamer some intakes in the target
0:09:53	domain
0:09:54	and thus we can see here
0:09:55	without design
0:09:57	two structured
0:09:58	most information the tree
0:10:00	are shared
0:10:02	and we hope that can help model fisher information between domains
0:10:11	okay
0:10:12	so now so far we know how to encode a tree
0:10:14	of the semantics
0:10:16	then that's go to the generation process
0:10:21	it is very straightforward to just take the output it
0:10:24	the final representation of a tree as teens initialization
0:10:28	of your decoder
0:10:30	and we follow some prior work
0:10:32	where the value in the all turns are delexicalise as the
0:10:38	so our token we do something
0:10:40	so in this work we designed a slot spoken as domain information dialect information and
0:10:45	slot information
0:10:48	so we just follow the center cross entropy
0:10:51	to train our decoder
0:10:54	sounds alright sounds good
0:10:55	we have a way to encode a trick structure
0:10:59	but actually for think more we just use the battery abstract information of a tree
0:11:05	however they are
0:11:07	punching him a bunch of information at intermediate level
0:11:12	thanks to our on define tree
0:11:15	so this moldable us to
0:11:18	come up with a better way
0:11:19	to access to information at intermediate level
0:11:22	so that it decoder
0:11:24	can have more information about three structure
0:11:29	so here we propose are on it is very sorry for
0:11:32	we apply
0:11:34	we applied attention to the
0:11:36	to the to the top man tell at and slow later
0:11:40	do you have otherwise attention we can't
0:11:44	whenever the model
0:11:45	the decoder
0:11:47	produce the special flock
0:11:49	slot token like this
0:11:53	the hidden state at each time-step
0:11:55	will be used as acquire we
0:11:57	to trigger the tension mechanics in
0:11:59	like this so for example at the slot later
0:12:04	all the slot information
0:12:06	will be treated as the context
0:12:09	for the
0:12:10	for the attention what kind of you
0:12:13	and then the model
0:12:14	we compute
0:12:16	a proper the probability distribution over or information for the three layers
0:12:21	so for example again
0:12:23	in slot s law they are
0:12:25	you will have a distribution over all possible slot
0:12:29	it basically tales model which slot
0:12:32	each to focus on which information the models you focus on that is done step
0:12:39	of course during training
0:12:40	we do have supervision signals
0:12:42	from the input semantics
0:12:45	this can help the model this can guy to model
0:12:48	to tell him what to focus on
0:12:50	at each time-step
0:12:54	and then will use this extra
0:12:55	we use this attention distributions as the ask for information for the next time step
0:13:02	and the and the generation process
0:13:04	goes on
0:13:06	so
0:13:07	with all they'll wise attainable kind is an
0:13:10	on a loss function becomes standard cross entropy
0:13:14	then the cross entropy plots
0:13:16	or a loss for only loss
0:13:18	from the
0:13:18	three attention mechanisms
0:13:21	that's how we use a channel or model
0:13:26	okay that's goes to some basic setups
0:13:29	for experiments
0:13:31	we are using models i was dataset which is which has on ten thousand dialogues
0:13:37	over seven domains
0:13:39	and within all utterance
0:13:42	it's actually have more than one dialogue act
0:13:47	we have three strong baselines the first one is as the lstm
0:13:52	on it basically use a binary representation to encode the semantics
0:13:57	and we have
0:13:58	t gen and ra lstm
0:14:01	lows to model i'll basically sector set model
0:14:04	so they are using lstm encode i think older
0:14:07	i think all the semantics
0:14:10	a small evaluation
0:14:12	we have on the stander
0:14:14	on all the mathematics such as blue
0:14:18	and also to fly error rate
0:14:20	because we don't we don't want all we don't one or channel
0:14:23	nature link generation model
0:14:24	just before when but also
0:14:26	the content should be correct
0:14:29	and we also conduct a human evaluation
0:14:34	okay let's see some numbers first
0:14:36	on
0:14:38	here this database
0:14:39	source the man is first run
0:14:41	any target domain is hotel
0:14:46	the have access
0:14:48	is the different amount of the adaptation data
0:14:51	any white athens is the bleu score
0:14:55	three baseline models are here
0:14:58	and tree structure tree structure encoder
0:15:01	and its variant
0:15:02	tree structure with attention we kind of them
0:15:05	as you can see on
0:15:08	with for the patient data a hundred percent data
0:15:11	that all the all the model performed pretty much similar
0:15:14	because the data is
0:15:15	pretty much enough
0:15:17	however
0:15:18	on with last data
0:15:21	such as the last m five percent
0:15:24	our model start again benefits
0:15:27	thanks to the on
0:15:30	structure
0:15:30	sense to the tree structure
0:15:34	that's the last again number of these slot error rate
0:15:38	that's not error rate is defined like this
0:15:41	we don't want our model
0:15:42	to produce
0:15:43	missing slots
0:15:45	to have missing slots or put to use redundant slot
0:15:50	so again with a hundred percent of data
0:15:53	all the model performs very similar
0:15:54	they're all good
0:15:56	with all data
0:15:57	however
0:15:58	which pretty much last data
0:16:00	with pretty limited data
0:16:04	even in the
0:16:06	one point twenty five percent of the data
0:16:08	our model start to
0:16:10	on produce very good performance
0:16:12	overall the baselines
0:16:18	previous like just show one setups
0:16:20	we actually conduct three c given kind of set up to show that
0:16:24	the model works in different scenarios
0:16:28	the first column is
0:16:30	the one used all in the previous line
0:16:32	restaurant tube don't hotel adaptation
0:16:36	and the second one
0:16:37	the middle column is the restaurant at attraction
0:16:40	and the second and the sort of one is trying to taxi
0:16:44	here we just want to show that we can observe a similar trend similar results
0:16:49	overall different setups
0:16:54	okay so we all know that natural language generation task
0:16:58	is not enough
0:16:59	to just evaluate by the automatic metrics
0:17:02	so we also conduct you may validation
0:17:04	we use but amazon mechanical turk
0:17:08	each mturk that asked to score five out of it in terms of
0:17:12	informativeness
0:17:14	and they show in this
0:17:16	so here some basic numbers
0:17:20	in terms of informativeness
0:17:22	the tree structure with attention
0:17:25	score the best
0:17:27	and the tree without attention score the second
0:17:30	which tells us that
0:17:32	if you have a better way to encode your trick structure
0:17:36	then the information can be sure for determine that the patient
0:17:40	the model is tend not the model tends to produce
0:17:43	right correct semantics in your in the generated sentences
0:17:49	meanwhile we can still meant and the nature of nature and s
0:17:52	of their generative sentences
0:17:56	so we wonder
0:17:58	where r
0:17:59	improvements coming from
0:18:01	what kind of as get what kind of example are more or model really performs
0:18:05	good
0:18:05	performs well
0:18:07	so we divide the task that into seeing and on things up that
0:18:11	subset
0:18:13	thing basic leanings
0:18:15	if the input semantics is thing during training
0:18:18	then it's belongs to sing subset otherwise is on thing
0:18:24	that's
0:18:25	let's see the
0:18:26	numbers from the fifty percent adaptation data
0:18:30	with this bunch of data
0:18:32	most of the testing example are thing
0:18:35	and all the model performs similarly
0:18:37	well as numbers are the
0:18:39	number of the wrong examples the model produces
0:18:43	and the lower the better
0:18:45	however
0:18:47	with very limited adaptation data
0:18:50	out of nine hundred on things semantics
0:18:53	that the semantics never think before doing training or that of the patient
0:18:58	those based the baseline system
0:19:00	but does
0:19:01	several around seven hundred
0:19:03	raw examples
0:19:04	wrong semantics in the generative sentences
0:19:08	however archery with attention can produce very low number
0:19:13	just around a hundred and thirty
0:19:16	so this a this implicitly tell us
0:19:19	our model my have the better tinnitus in a ability
0:19:23	to the on things semantics
0:19:28	okay so here's comes to my conclusion
0:19:33	by modeling the semantic structure
0:19:35	low information might be shared between domains and this is helpful for domain adaptation
0:19:41	and our model we use
0:19:43	especially with the with the proposed there was attention mechanics and
0:19:48	generates better sentences in terms of automatic metrics and the human scores
0:19:54	especially with the limit
0:19:56	very limited adaptation data
0:19:58	our model performs the best
0:20:02	so thank you very much for your calming
0:20:04	and the any question and feedbacks are welcome thank you
0:20:12	thank you very much so questions
0:20:21	you said that you're doing well with one point two five percent which sounds good
0:20:24	what's the number of training examples yes one point
0:20:28	yes
0:20:29	on
0:20:31	is here so for example when we adapt one restaurant or hotel during preach an
0:20:36	example is
0:20:37	eight point five k but if we are using only one percent
0:20:41	here it probably six under
0:20:44	is still song
0:20:47	yes
0:20:49	hi can't go to the plot for the tree
0:20:53	yes so you're for the full use
0:20:57	yes of to unseen so first use a the attention is on all fours on
0:21:03	wall for slot is not all the slots
0:21:05	but for a given example the only the for green nodes are like yes in
0:21:13	the data so why do we do need to attend to internet and from which
0:21:17	sorry actually is the slot within the all semantics
0:21:22	so only the slots in the semantics are activated
0:21:26	and will be what we use it for the outage in case of another question
0:21:30	is when you do domaintransfer what if the two domains have different sets of slots
0:21:36	and for those slots that only appear in one in the onscene domain it's never
0:21:42	trained in the in the data because in
0:21:45	on
0:21:46	because by the nature of this dataset
0:21:48	as you see we have restaurant hotel attraction we sure which we sure low three-dimensional
0:21:55	most of these slot or they have their unique slot relation most of false and
0:22:00	each line and taxi sure some slot so that's why when and when i have
0:22:04	that's lying or setup we have there's run to a hotel restaurant to attraction
0:22:09	and is trying to taxi
0:22:10	because we try to leverage the sure slots
0:22:18	hello great so i had a question about the evaluation that looks at the you
0:22:24	redundant and missing slots
0:22:27	yes that site error rate
0:22:30	my question is
0:22:32	conceptually why does not even need to be a problem because
0:22:36	you could have constraints
0:22:38	that ensure that each slot is produced exactly one time during the gender on
0:22:44	yes and it what depends on how you put your constraints on
0:22:47	if you put in on generation loss function loss function during training
0:22:51	that doesn't guarantee right down again to model still fall your constant
0:22:56	but if you put your constrained at the output like more after like a post
0:23:00	processing
0:23:02	you my few there are some slot that's good but you might have not
0:23:06	you might come up with a on the each row sentences right because you use
0:23:10	more too few it out something
0:23:12	you need to come up with small was to make it for one between
0:23:14	between the floor you figured out
0:23:17	so it is actually a problem and the we simply follow some prior work which
0:23:23	is which is my fix the use ice
0:23:26	just okay so i guess yes conceptually i get there'd be a tradeoff between naturalness
0:23:30	and coverage but if you know in advance that a requirement is coverage than
0:23:36	i guess you're only degree of freedom would be to give a constant natural
0:23:43	sorry i
0:23:44	so i had a miss your left and i just making the comment that if
0:23:47	you know in advance that you're requirement is that you need to generate all the
0:23:50	slots yes in your only degree of freedom
0:23:53	is to give up on naturalness
0:23:55	all right based on that nation if the scoring for the right in this task
0:24:01	thanks to
0:24:02	i have a question regarding this year that you show you have shown here i
0:24:09	picture yes and is thereby eigenvalues and somehow encoding in this year so that you
0:24:15	are only taking into account is not
0:24:18	only the slot we don't use
0:24:20	value because you don't need to nine
0:24:23	yes and also the value is there's too much actually i don't use the male
0:24:27	and then i have anything completion and have you thinking in a moment we then
0:24:34	it takes a condensation
0:24:37	without elicitation
0:24:39	on yes
0:24:41	on
0:24:42	any will come up with some questions for example your value will be pretty much
0:24:47	like open but vocabulary right from if you have for this dataset we have restaurant
0:24:53	then
0:24:54	attraction m and the hotel in n
0:24:57	and the train it
0:24:59	and the time slot
0:25:01	this will become very complex
0:25:04	it is a still challenging problem in analogy
0:25:08	okay
0:25:11	right i think we need to move to the next papers so let's think the
0:25:15	speaker again thank you very much

Tree-Structured Semantic Encoder with Knowledge Sharing for Domain Adaptation in Natural Language Generation

Oral Session 3: Generation and End-to-end Dialogue Systems

Bo-Hsiang Tseng, Paweł Budzianowski, Yen-chen Wu and Milica Gasic