Speech Transcript - Zero-Shot Dialog Generation with Cross-Domain Latent Actions

0:00:15	i everyone my name is attention from carnegie mellon university that i i'm going to
0:00:20	talk about working there was shot current generation with cross domain data actions
0:00:24	and the code and data are both available in the k
0:00:30	so like target was going to be about generative end-to-end dialogue system
0:00:34	which is perhaps one of the most flexible for remote we have nowadays to model
0:00:39	both task part scoring and non-cause cora conversations
0:00:43	and the basic idea i'm sure everybody already familiar with we have a dialogue context
0:00:47	and we have a new encoder that encoding whatever is available at testing time encoding
0:00:52	dialogue history on or the information i don't have it because the network
0:00:57	i can generate in the response
0:00:58	and for i do it a verbal response that sending back to human
0:01:03	or it can be a api request offended back to databases
0:01:06	so that the single model can handle pose the interactions between human too much in
0:01:10	and also much into by k databases
0:01:14	and
0:01:15	although this point what is more powerful and flexible
0:01:18	most of all kinds of the successful prior work has one assumption that
0:01:22	is a large training dataset
0:01:24	the exact same a task or domain that were interested so we can show me
0:01:28	model on them
0:01:30	and
0:01:31	a some trade off and not true in practice and the because dialogue system can
0:01:36	just be applied to so many different domains even just for slot filling we have
0:01:40	slot filling for bus
0:01:41	schedule a whether you know and
0:01:44	five and so many other domains
0:01:46	and in many times we don't have the exact data so that were interested that
0:01:51	we'll that for a domain that we're going to be able
0:01:54	and one human another can hear actual example here human is incredible a chance for
0:01:59	knowledge from domain to domain
0:02:01	so in managing a customer service agent a who is was in these should department
0:02:06	and if you can very quickly adapt to the
0:02:07	closing department just really some training materials without the need to up the training example
0:02:13	dialogues
0:02:14	so we want to achieve similar goals for this study
0:02:18	and to summarize
0:02:20	the goal of the first goal is we want to exploit
0:02:23	the flexibility of a generative model so that can simultaneously accurate knowledge from multiple domains
0:02:29	and then a second a more we wanted having the canyons and to being able
0:02:33	model to transfer knowledge from source to maintain you domain where we don't have data
0:02:39	and this is a new problems that we formalize as a learning problem we name
0:02:43	it was shown that a generation the c g
0:02:46	so the set up as follows
0:02:48	so we have source domain which means domain where we do have dialogue data and
0:02:53	that we have a set of target domain wherein the we the so may we
0:02:56	don't have dialogue data
0:02:57	and for domain both source and target we do have access to a domain description
0:03:02	which is can be any type of knowledge that describe the specific information about their
0:03:07	domain the and then given a set up the learning problem becomes follows so in
0:03:12	training time we the model can access information can be trained on
0:03:17	the source dialogue from the source domain and also ultimate destruction from both source and
0:03:23	target
0:03:23	and testing time we ask the model to directly generate responses in the target domain
0:03:28	whereas the target on a number of the in training that's why we called the
0:03:33	there are shown that estimation problem
0:03:36	and
0:03:38	just to show in the formula also the visual figures
0:03:43	so given snr is
0:03:45	very easy to see that the design of torment description is the most important factor
0:03:49	here because that cover all the domain and that can that's enable the possibility of
0:03:54	transfer knowledge from source to target and there could be many different type of them
0:03:58	a description and in this study we propose one type would call the cm response
0:04:03	so this
0:04:04	the assumption serious problems is the that between the source and target we assume that
0:04:08	there exist some sort of a shared related discourse patterns such a full page i
0:04:13	can also for policy and again given the assumption
0:04:16	what is the response
0:04:17	so as to response is a list of pupils and each triple contains elements acts
0:04:22	at
0:04:23	and axes example utterance the can be spoken from either user or system from this
0:04:28	domain and a is the annotation of that utterance that you're example i shows here
0:04:33	and d is basically the domain index
0:04:37	and then for each domain we have a table like this and having c responses
0:04:43	from each domain
0:04:46	so given the same response and also the dialogue from this also make how do
0:04:50	you can i suppose data to train model to actually the std
0:04:53	so in this work we propose a new class of algorithm can actually matching algorithm
0:04:58	and in this algorithm the most important a notion is the cross domain data collection
0:05:03	so introduce a new space basically the and in the latent space the and we
0:05:08	assume the only possible is the action from system the user can reside in the
0:05:14	latent space
0:05:15	and in actually match my when we try to learn still we propose to use
0:05:18	that of parameters the first one is are the recognition network and a function of
0:05:23	these are is basically mapping utterance so annotation from sentence from words td late actions
0:05:30	and now we have in cold and the text in the dialogue context and try
0:05:33	to predict what's an excellent an action
0:05:36	and these are the one is the decoder
0:05:39	because we do enjoy we can model so we expect you called a basic click
0:05:42	select an action any point the latent space and can map back to a sentence
0:05:46	so visual here shows all the possible
0:05:49	transformation between the for rebels utterance annotation late actions and it context
0:05:57	okay still now we have this free parameter want to learn
0:06:01	and we have to type of data so how do we optimize
0:06:04	so the first couple data we encounter is the response data
0:06:07	so basically a bunch of sentence from different domains and the objective here is we
0:06:12	want to make the later action from two utterances in from two domains them at
0:06:16	each other only one the annotations in with each other and
0:06:19	well we do here is
0:06:21	so the task the yellow is from one domain i think it's a bystander going
0:06:25	from movie
0:06:26	and we try to a introduce the first loss function is called domain description loss
0:06:31	and we basically minimize the distance from the as the access to the a
0:06:36	in this way so that an utterance from two domain the only close to each
0:06:40	other unless the annotation close to each other
0:06:44	and then the second type of data we're dealing with is about a better from
0:06:47	source domain so in here the objective here is what we want to make to
0:06:50	predict action did be accurate so one of the project action from a context d
0:06:55	v actually similar to the actual response that's been studied the data
0:06:59	and that we introduce the second last clause
0:07:02	so the bottom for task is the same as the previous slide
0:07:06	and we have the predict the action late an action that are and we try
0:07:10	to minimize the distance between a particular connection two d just to the late an
0:07:14	action of the arcs here
0:07:19	so to summarize the
0:07:22	to summarize
0:07:23	action matching i with the as its here
0:07:25	and it has is very simple and elegant solution so we only have to loss
0:07:29	function and we alternating between them so for a software and the we have atomic
0:07:33	description loss
0:07:34	so that's why we're dealing with data from the seat response
0:07:38	so we second fine we minimize the distance between
0:07:40	the energy is
0:07:42	and also the first and we trying to train the decoder to generate a response
0:07:45	from also and target
0:07:48	and the second author dialog lost we
0:07:50	this loss is actually about related to the latent variable model all the original encoder
0:07:56	and you can see that you timewise training decoder the other ten is trying to
0:08:00	minimize the distance that i just talk about
0:08:02	and training i within is basically taking data from two stream of a serious problems
0:08:08	the dialogue and we randomly pick you want and then optimize the corresponding loss function
0:08:14	so for the exact location for this study we using a bidirectional gru for the
0:08:19	recognition of to work and we have a then the hierarchical honesty an encoder for
0:08:23	the encoder
0:08:24	and afford a condo experiments cucumber decoder
0:08:27	one is a standard lstm decoder with a attention
0:08:30	and the second one is a lstm was the score pointers sentinel genital is actually
0:08:37	the decoder with caulking we can use and so you can copy what from the
0:08:40	context and iraq all put into the
0:08:43	the output the response
0:08:44	and it's been shown to be a pretty robust against out-of-vocabulary token in the language
0:08:49	modeling
0:08:51	and here we show the picture
0:08:53	but what we having a this model where we have been covered decoder the left
0:08:58	figure shows that how do we deal with dialogue data and the second figure shows
0:09:02	how do we deal with a c response data and the that we can optimize
0:09:06	three a network jointly
0:09:11	so that our method and we passed this framework a to the task that wine
0:09:16	cm esteemed i'll and second one is that for multi domain thought of that is
0:09:20	that
0:09:21	and signal is a new open-source multiple madonna generator with complex the control and it's
0:09:27	open so i'm gonna have and they have more menus instruction about how to use
0:09:31	it and that we use this generator to generate in dialogues from seven domains
0:09:36	so we take so we don't make as the source domain there are us from
0:09:40	bus and when a
0:09:41	each one thousand dialogues
0:09:43	and a target domain we have four and we cast in different perspective so the
0:09:48	first one is rostrum so this is in domain because also because in the training
0:09:51	and then a second one is a things not address for the rest of slot
0:09:55	so is the restaurant but we completely have a different set of slot values
0:10:00	and then the so the one is i think analogies list your a strong but
0:10:04	we user need if a complete different start time and not a template for both
0:10:08	user and the system
0:10:09	and the last one movie is a new domain but has joe nothing with anything
0:10:13	in this also may which is the most challenging one
0:10:16	and forty was files we take a hundred transform each domain addressee response and we
0:10:20	use the internal frame as the annotation
0:10:24	and the second type of data something that would do is the staff of data
0:10:28	a stand for the that is result and dialogue from so we don't main scheduling
0:10:31	whether and the navigation and with you to take one our approach by rotating and
0:10:36	use one has this talk target and other two at the source and we have
0:10:40	so that we have three possible configurations
0:10:43	and we use a hundred fifty utterance from each other domain as a serious problem
0:10:47	and we have an expert annotators and with semantic frames and the that's all we
0:10:53	need for the final domain so we only use a hundred utterance from the target
0:10:56	domain which training and don't use and dialogue from the domain
0:11:02	and for evaluation we and the left is also evaluation so because in the past
0:11:07	for instance then we invite of system from four different metrics
0:11:11	without bleu score energy
0:11:13	dialogue a and it database cory f one
0:11:16	and you
0:11:17	although quantify the overall performance we have a new score
0:11:21	for the bic score basically take the geometric mean of the four managers and having
0:11:25	a one number for each system so we get data
0:11:28	a overall performance manager
0:11:30	and we compare for different models the top to a baseline so that optimize its
0:11:36	no encoder-decoder was attention and the second one is the decoder with the company we
0:11:40	can use n
0:11:40	and that's the to propose a method is basically we add the action matching the
0:11:45	proposed actually match algorithm to did you baseline and see what happens or we adding
0:11:49	this action matching
0:11:52	so in the results so here the local formants and on the life as i
0:11:57	we show the peaks go on the thing died
0:11:59	and on the right we show the overall the performance is therefore data
0:12:04	and so here we can already see some interesting content
0:12:07	so we first can see that the two baseline the conformance but it's pretty well
0:12:12	on the in domain data which is the normal test in training a scenario
0:12:17	but why they moved to the one thing slot as the energy a new domain
0:12:21	a performance job significantly
0:12:24	and also we can see that the blue the green bar which is actually matching
0:12:28	cost the copy decoder
0:12:30	it has really strong performance in a in those target domain well it's were quite
0:12:36	different from the training data especially when you domain the going by is able to
0:12:40	achieve
0:12:40	sixty eight or performance
0:12:42	well as even in domain the
0:12:44	performance got a cap is about eighty two
0:12:47	so it actually learning something that
0:12:50	if one the from the by two baseline is significant improve performance
0:12:53	so we come up with for question that we once the in the last and
0:12:58	the in the later experiment the first one is well for everything only moving from
0:13:03	source and target and the second level
0:13:05	so is interesting see the kaldi decoder the roundabout
0:13:08	is that you're doing something pretty interesting compared to the baseline
0:13:12	so what does the cockiness all
0:13:14	and it's not question is what does actually much install and lastly the heart of
0:13:18	the size of serious problem affect the performance
0:13:21	so now let's go
0:13:22	in to each question one by one
0:13:24	and so first little fails on the domain
0:13:27	so the figure two shows the just the dialogue act f one performance
0:13:32	it surprising to see that all the mono a mono baseline our proposed one
0:13:36	the purple on dialogue i it's quite similar in different studies
0:13:41	so what happens is we found that the precise estimation failed to generate incorrect identity
0:13:46	as well as normal utterance
0:13:48	the novel words in domain
0:13:51	but dialogue acts as actually okay at least in this dataset
0:13:54	so one good example can see here
0:13:56	the reference it see you all model is able to generating so you next time
0:14:00	of the you something
0:14:01	so that kind of a short response across domains the no problem
0:14:05	but the bad examples let's go sample
0:14:07	so once this then the referent is that finally about what kind before you
0:14:11	this is then generating high this the russell system how can do for you
0:14:15	the hardest thing the current dialogue act secreting
0:14:18	but the words to compute here arabic i still think it's interest from
0:14:21	and not think about as the in the movie domain and estimate example for example
0:14:25	here the reference science fiction movie what times movie the baseline only generating focus by
0:14:31	what kind of rust right looking for all
0:14:33	so that's the problem that was for the way moving training on a restaurant in
0:14:38	casting movie
0:14:40	and then the question is what does common assault so here the most useful metric
0:14:45	is the energy score so we found that the copy decoder the decoder was coming
0:14:49	we can then
0:14:50	it into and it's got can continue because in ons to copy and it from
0:14:55	context and output it even if the audible can do not for
0:14:59	for this model
0:15:01	so what the problem solving the good example see
0:15:05	if the reference they something like audience i selection the contradict what it will be
0:15:09	able to generate in that science fiction and it by driving that was from the
0:15:12	user speech instead of putting piece
0:15:15	but the presence of all the problem the bad example can see here
0:15:19	the reference a
0:15:21	i want i believe use that comedy movie
0:15:23	and the system or generating something like
0:15:26	i believe use that come before
0:15:28	i grab the comedy but it doesn't generating a sentence
0:15:31	an example here we see
0:15:33	it was say something i would recommend rest of fifty five although fifty five years
0:15:37	and in the movie name it was it should be saying movie fifty five the
0:15:41	good choice
0:15:44	so
0:15:45	and the question is what does the proposed action matching solving so the answer is
0:15:50	the most relevant score his approval scores because we want to see if actual the
0:15:54	correct wasn't being generated in the new domain
0:15:57	so the we find that room actually being able to be called a to generating
0:16:01	overall a novel utterance
0:16:02	the never occurred in training not only entities
0:16:05	and so he also show some good examples
0:16:08	so in one example is only fifty five good choice and you will do we
0:16:12	make a choice and also from this more complex human data we can see this
0:16:18	data was say it was a scheduling remind afford no on friday in ten
0:16:24	which the only training why the and a navigation don't know which is the we
0:16:28	have a sense and distances but is still generating this novel utterance
0:16:34	and the last question is how to the size of s is effect performance so
0:16:40	this is the past on these data for the human data
0:16:43	we have found a fifteen the previous results from here we
0:16:47	that result from zero to two hundred and see the performance changed
0:16:51	so one thing the comfort that confirms it is before indeed increased what we have
0:16:56	been not the size of the response equally have a wider coverage about what's going
0:16:59	happen to data
0:17:00	but also we can see that
0:17:02	the performance becomes palatal while we going beyond about how to twenty five
0:17:06	a hundred fifty
0:17:08	and that validates the tracking progress of head using c whisper because we don't need
0:17:13	a huge size of zero files you get performance k
0:17:17	so to summarize yes what we propose the new problem "'cause" the std
0:17:22	and we propose actually matching this algorithm that performed pretty well in the for is
0:17:28	that under the assumption that the extra discourse better and also we do experiment divided
0:17:34	the performance of both human and synthetic dataset
0:17:37	and the last we also open source is the entire this multidimensional generator that can
0:17:42	be used to benchmark of the future experiments
0:17:45	and at the last i wanna say and this is a first step towards a
0:17:49	very big directions and their opens up many interesting problem that we can exploit the
0:17:54	future for example how do we quantify the relationship between domains
0:17:58	in most situations that it is possible
0:18:01	and also how do we
0:18:03	rely less on a human annotation because now we
0:18:05	yep and annotation to find in the relationship between utterance across domains
0:18:10	and also how do we started the official problem one assumption of c response fail
0:18:15	actually to the mainly can have different discourse calendar had to have different dialogue policy
0:18:20	how do we in which is for and last one is i know what are
0:18:24	the type of dormant description how we have in to enable yes
0:18:30	and i think you're much
0:18:53	which one
0:19:29	so i'll the laughs we have the discourse so here the ranges from zero to
0:19:36	a hundred maximum and this is a because the so this is a synthetic that
0:19:43	so it easier to achieve high performance
0:19:45	in this domain and also we intentionality a lot of compressed eli such as in
0:19:49	the rating as a role simulating different nonverbal behavior so
0:19:54	so that the range for that and here
0:19:56	i think that it was that the peace corps so it is the it's a
0:20:00	to match the meaning of the true and the t f one
0:20:03	so they impose also zero two hundred but is a human datasets much more challenging
0:20:08	so the rules goal is actually
0:20:10	mostly they're pretty low you can see about
0:20:13	can you in a zero two k twenty something so it job i jog the
0:20:18	number down so the range here is also zero two hundred
0:20:22	for the two score pages to it as that which the lab right one is
0:20:26	much more challenging
0:20:34	before
0:20:39	okay
0:20:55	okay
0:21:09	so this is for example a come from what we treat the scheduling as the
0:21:15	target and the why the and the navigation and the source coleman
0:21:18	and what we caff honest the scheduling domain
0:21:22	and
0:21:23	so the of the dialogue history so we went because a spacing and initial is
0:21:27	the history but
0:21:29	the actual system utterance is okay scheduling
0:21:33	try to denote with manual i mean
0:21:35	and the
0:21:36	so the generations is not perfect but the first time
0:21:40	the only model that is able to generate nickel coherent utterance to
0:21:44	obviously comfort in a it's a scheduling domain utterance and has
0:21:49	estimate a
0:21:50	i log are compared to the one shows and the than the baseline system we
0:21:53	just not generating a coherent utterance from scheduled a scheduling domain
0:21:57	it's more likely to generally something like all what's the weather all okay you an
0:22:02	allegheny to some rubber case of a strong bias
0:22:06	k is the transfer of from the source to target that's the only one that
0:22:11	able to
0:22:11	she split style completely from the source target
0:22:27	so clearly i think of the most challenging what is a navigation domain
0:22:31	and
0:22:32	i think because in like a scheduling the if you look into the conversation the
0:22:36	data list two should lead us to dialogues
0:22:39	and a schedule a usually that caught is not very long so is like schedule
0:22:43	probably with my for and eleven
0:22:46	and just confirmed that only about three to five four times before covers and finish
0:22:51	i think in navigation out as much longer and also the even more a detailed
0:22:57	information like i wanna check navagati from this case another place and
0:23:00	as much harder to get all phonetic arrives and the
0:23:03	sometimes they wanna change navigation places so it's
0:23:07	i think how to be more challenging domain comparative to all the other two domains
0:23:40	if you don't have c was muffled time domain then
0:23:43	you cannot do the chance but because all the knowledge we have about target domain
0:23:48	is from the cm response basically what companies which are into finding utterances thompson similar
0:23:55	function between domains those for example in the one of them may have missed a
0:23:59	tremendous a request so the model trying to find estimate utterance
0:24:02	that in a in the new domain just filling similar function so we can translates
0:24:07	knowledge about a policy to the new domain still you will you know it with
0:24:10	you i want your request when i'm not in the new domain still you will
0:24:14	finding the most match sentences in a target on anastasia
0:24:17	so if we don't have target domains the response and a hybrid and the little
0:24:22	work
0:24:31	so definition here is there was shown means that we don't have any dialogue data
0:24:34	from target domain
0:24:35	so we don't have any multi-turn conversation that a target domain
0:24:45	so as it was because it was only utterance is no dialogue so it doesn't
0:24:50	really it's not dialogue data so here
0:24:53	you know the overall definition here we will try to propose here is domain description
0:24:59	so it any
0:25:00	it doesn't use the bc was like any other type of them a description giving
0:25:04	the application but here we assume that that's the response is a what only description
0:25:09	about this domain
0:25:12	well you have some sort of the description
0:25:14	a knowledge about target
0:25:50	that's acting as a four inches suppose that you mean
0:25:52	we want to have a express the latent representation we can see
0:25:56	on inter interpreted so
0:26:00	so in so now he's or continuous so we tried to probably in the in
0:26:05	the bipolar to two d and the product and
0:26:07	i we can see some patterns that with the group similar sentence from different limited
0:26:11	no but
0:26:12	i think is interesting direction to see how can we get more explicit information
0:26:17	what about
0:26:20	for interpretation

Zero-Shot Dialog Generation with Cross-Domain Latent Actions

Oral Session 1: Generation 1

Tiancheng Zhao and Maxine Eskenazi