0:00:15i everyone my name is attention from carnegie mellon university that i i'm going to
0:00:20talk about working there was shot current generation with cross domain data actions
0:00:24and the code and data are both available in the k
0:00:30so like target was going to be about generative end-to-end dialogue system
0:00:34which is perhaps one of the most flexible for remote we have nowadays to model
0:00:39both task part scoring and non-cause cora conversations
0:00:43and the basic idea i'm sure everybody already familiar with we have a dialogue context
0:00:47and we have a new encoder that encoding whatever is available at testing time encoding
0:00:52dialogue history on or the information i don't have it because the network
0:00:57i can generate in the response
0:00:58and for i do it a verbal response that sending back to human
0:01:03or it can be a api request offended back to databases
0:01:06so that the single model can handle pose the interactions between human too much in
0:01:10and also much into by k databases
0:01:14and
0:01:15although this point what is more powerful and flexible
0:01:18most of all kinds of the successful prior work has one assumption that
0:01:22is a large training dataset
0:01:24the exact same a task or domain that were interested so we can show me
0:01:28model on them
0:01:30and
0:01:31a some trade off and not true in practice and the because dialogue system can
0:01:36just be applied to so many different domains even just for slot filling we have
0:01:40slot filling for bus
0:01:41schedule a whether you know and
0:01:44five and so many other domains
0:01:46and in many times we don't have the exact data so that were interested that
0:01:51we'll that for a domain that we're going to be able
0:01:54and one human another can hear actual example here human is incredible a chance for
0:01:59knowledge from domain to domain
0:02:01so in managing a customer service agent a who is was in these should department
0:02:06and if you can very quickly adapt to the
0:02:07closing department just really some training materials without the need to up the training example
0:02:13dialogues
0:02:14so we want to achieve similar goals for this study
0:02:18and to summarize
0:02:20the goal of the first goal is we want to exploit
0:02:23the flexibility of a generative model so that can simultaneously accurate knowledge from multiple domains
0:02:29and then a second a more we wanted having the canyons and to being able
0:02:33model to transfer knowledge from source to maintain you domain where we don't have data
0:02:39and this is a new problems that we formalize as a learning problem we name
0:02:43it was shown that a generation the c g
0:02:46so the set up as follows
0:02:48so we have source domain which means domain where we do have dialogue data and
0:02:53that we have a set of target domain wherein the we the so may we
0:02:56don't have dialogue data
0:02:57and for domain both source and target we do have access to a domain description
0:03:02which is can be any type of knowledge that describe the specific information about their
0:03:07domain the and then given a set up the learning problem becomes follows so in
0:03:12training time we the model can access information can be trained on
0:03:17the source dialogue from the source domain and also ultimate destruction from both source and
0:03:23target
0:03:23and testing time we ask the model to directly generate responses in the target domain
0:03:28whereas the target on a number of the in training that's why we called the
0:03:33there are shown that estimation problem
0:03:36and
0:03:38just to show in the formula also the visual figures
0:03:43so given snr is
0:03:45very easy to see that the design of torment description is the most important factor
0:03:49here because that cover all the domain and that can that's enable the possibility of
0:03:54transfer knowledge from source to target and there could be many different type of them
0:03:58a description and in this study we propose one type would call the cm response
0:04:03so this
0:04:04the assumption serious problems is the that between the source and target we assume that
0:04:08there exist some sort of a shared related discourse patterns such a full page i
0:04:13can also for policy and again given the assumption
0:04:16what is the response
0:04:17so as to response is a list of pupils and each triple contains elements acts
0:04:22at
0:04:23and axes example utterance the can be spoken from either user or system from this
0:04:28domain and a is the annotation of that utterance that you're example i shows here
0:04:33and d is basically the domain index
0:04:37and then for each domain we have a table like this and having c responses
0:04:43from each domain
0:04:46so given the same response and also the dialogue from this also make how do
0:04:50you can i suppose data to train model to actually the std
0:04:53so in this work we propose a new class of algorithm can actually matching algorithm
0:04:58and in this algorithm the most important a notion is the cross domain data collection
0:05:03so introduce a new space basically the and in the latent space the and we
0:05:08assume the only possible is the action from system the user can reside in the
0:05:14latent space
0:05:15and in actually match my when we try to learn still we propose to use
0:05:18that of parameters the first one is are the recognition network and a function of
0:05:23these are is basically mapping utterance so annotation from sentence from words td late actions
0:05:30and now we have in cold and the text in the dialogue context and try
0:05:33to predict what's an excellent an action
0:05:36and these are the one is the decoder
0:05:39because we do enjoy we can model so we expect you called a basic click
0:05:42select an action any point the latent space and can map back to a sentence
0:05:46so visual here shows all the possible
0:05:49transformation between the for rebels utterance annotation late actions and it context
0:05:57okay still now we have this free parameter want to learn
0:06:01and we have to type of data so how do we optimize
0:06:04so the first couple data we encounter is the response data
0:06:07so basically a bunch of sentence from different domains and the objective here is we
0:06:12want to make the later action from two utterances in from two domains them at
0:06:16each other only one the annotations in with each other and
0:06:19well we do here is
0:06:21so the task the yellow is from one domain i think it's a bystander going
0:06:25from movie
0:06:26and we try to a introduce the first loss function is called domain description loss
0:06:31and we basically minimize the distance from the as the access to the a
0:06:36in this way so that an utterance from two domain the only close to each
0:06:40other unless the annotation close to each other
0:06:44and then the second type of data we're dealing with is about a better from
0:06:47source domain so in here the objective here is what we want to make to
0:06:50predict action did be accurate so one of the project action from a context d
0:06:55v actually similar to the actual response that's been studied the data
0:06:59and that we introduce the second last clause
0:07:02so the bottom for task is the same as the previous slide
0:07:06and we have the predict the action late an action that are and we try
0:07:10to minimize the distance between a particular connection two d just to the late an
0:07:14action of the arcs here
0:07:19so to summarize the
0:07:22to summarize
0:07:23action matching i with the as its here
0:07:25and it has is very simple and elegant solution so we only have to loss
0:07:29function and we alternating between them so for a software and the we have atomic
0:07:33description loss
0:07:34so that's why we're dealing with data from the seat response
0:07:38so we second fine we minimize the distance between
0:07:40the energy is
0:07:42and also the first and we trying to train the decoder to generate a response
0:07:45from also and target
0:07:48and the second author dialog lost we
0:07:50this loss is actually about related to the latent variable model all the original encoder
0:07:56and you can see that you timewise training decoder the other ten is trying to
0:08:00minimize the distance that i just talk about
0:08:02and training i within is basically taking data from two stream of a serious problems
0:08:08the dialogue and we randomly pick you want and then optimize the corresponding loss function
0:08:14so for the exact location for this study we using a bidirectional gru for the
0:08:19recognition of to work and we have a then the hierarchical honesty an encoder for
0:08:23the encoder
0:08:24and afford a condo experiments cucumber decoder
0:08:27one is a standard lstm decoder with a attention
0:08:30and the second one is a lstm was the score pointers sentinel genital is actually
0:08:37the decoder with caulking we can use and so you can copy what from the
0:08:40context and iraq all put into the
0:08:43the output the response
0:08:44and it's been shown to be a pretty robust against out-of-vocabulary token in the language
0:08:49modeling
0:08:51and here we show the picture
0:08:53but what we having a this model where we have been covered decoder the left
0:08:58figure shows that how do we deal with dialogue data and the second figure shows
0:09:02how do we deal with a c response data and the that we can optimize
0:09:06three a network jointly
0:09:11so that our method and we passed this framework a to the task that wine
0:09:16cm esteemed i'll and second one is that for multi domain thought of that is
0:09:20that
0:09:21and signal is a new open-source multiple madonna generator with complex the control and it's
0:09:27open so i'm gonna have and they have more menus instruction about how to use
0:09:31it and that we use this generator to generate in dialogues from seven domains
0:09:36so we take so we don't make as the source domain there are us from
0:09:40bus and when a
0:09:41each one thousand dialogues
0:09:43and a target domain we have four and we cast in different perspective so the
0:09:48first one is rostrum so this is in domain because also because in the training
0:09:51and then a second one is a things not address for the rest of slot
0:09:55so is the restaurant but we completely have a different set of slot values
0:10:00and then the so the one is i think analogies list your a strong but
0:10:04we user need if a complete different start time and not a template for both
0:10:08user and the system
0:10:09and the last one movie is a new domain but has joe nothing with anything
0:10:13in this also may which is the most challenging one
0:10:16and forty was files we take a hundred transform each domain addressee response and we
0:10:20use the internal frame as the annotation
0:10:24and the second type of data something that would do is the staff of data
0:10:28a stand for the that is result and dialogue from so we don't main scheduling
0:10:31whether and the navigation and with you to take one our approach by rotating and
0:10:36use one has this talk target and other two at the source and we have
0:10:40so that we have three possible configurations
0:10:43and we use a hundred fifty utterance from each other domain as a serious problem
0:10:47and we have an expert annotators and with semantic frames and the that's all we
0:10:53need for the final domain so we only use a hundred utterance from the target
0:10:56domain which training and don't use and dialogue from the domain
0:11:02and for evaluation we and the left is also evaluation so because in the past
0:11:07for instance then we invite of system from four different metrics
0:11:11without bleu score energy
0:11:13dialogue a and it database cory f one
0:11:16and you
0:11:17although quantify the overall performance we have a new score
0:11:21for the bic score basically take the geometric mean of the four managers and having
0:11:25a one number for each system so we get data
0:11:28a overall performance manager
0:11:30and we compare for different models the top to a baseline so that optimize its
0:11:36no encoder-decoder was attention and the second one is the decoder with the company we
0:11:40can use n
0:11:40and that's the to propose a method is basically we add the action matching the
0:11:45proposed actually match algorithm to did you baseline and see what happens or we adding
0:11:49this action matching
0:11:52so in the results so here the local formants and on the life as i
0:11:57we show the peaks go on the thing died
0:11:59and on the right we show the overall the performance is therefore data
0:12:04and so here we can already see some interesting content
0:12:07so we first can see that the two baseline the conformance but it's pretty well
0:12:12on the in domain data which is the normal test in training a scenario
0:12:17but why they moved to the one thing slot as the energy a new domain
0:12:21a performance job significantly
0:12:24and also we can see that the blue the green bar which is actually matching
0:12:28cost the copy decoder
0:12:30it has really strong performance in a in those target domain well it's were quite
0:12:36different from the training data especially when you domain the going by is able to
0:12:40achieve
0:12:40sixty eight or performance
0:12:42well as even in domain the
0:12:44performance got a cap is about eighty two
0:12:47so it actually learning something that
0:12:50if one the from the by two baseline is significant improve performance
0:12:53so we come up with for question that we once the in the last and
0:12:58the in the later experiment the first one is well for everything only moving from
0:13:03source and target and the second level
0:13:05so is interesting see the kaldi decoder the roundabout
0:13:08is that you're doing something pretty interesting compared to the baseline
0:13:12so what does the cockiness all
0:13:14and it's not question is what does actually much install and lastly the heart of
0:13:18the size of serious problem affect the performance
0:13:21so now let's go
0:13:22in to each question one by one
0:13:24and so first little fails on the domain
0:13:27so the figure two shows the just the dialogue act f one performance
0:13:32it surprising to see that all the mono a mono baseline our proposed one
0:13:36the purple on dialogue i it's quite similar in different studies
0:13:41so what happens is we found that the precise estimation failed to generate incorrect identity
0:13:46as well as normal utterance
0:13:48the novel words in domain
0:13:51but dialogue acts as actually okay at least in this dataset
0:13:54so one good example can see here
0:13:56the reference it see you all model is able to generating so you next time
0:14:00of the you something
0:14:01so that kind of a short response across domains the no problem
0:14:05but the bad examples let's go sample
0:14:07so once this then the referent is that finally about what kind before you
0:14:11this is then generating high this the russell system how can do for you
0:14:15the hardest thing the current dialogue act secreting
0:14:18but the words to compute here arabic i still think it's interest from
0:14:21and not think about as the in the movie domain and estimate example for example
0:14:25here the reference science fiction movie what times movie the baseline only generating focus by
0:14:31what kind of rust right looking for all
0:14:33so that's the problem that was for the way moving training on a restaurant in
0:14:38casting movie
0:14:40and then the question is what does common assault so here the most useful metric
0:14:45is the energy score so we found that the copy decoder the decoder was coming
0:14:49we can then
0:14:50it into and it's got can continue because in ons to copy and it from
0:14:55context and output it even if the audible can do not for
0:14:59for this model
0:15:01so what the problem solving the good example see
0:15:05if the reference they something like audience i selection the contradict what it will be
0:15:09able to generate in that science fiction and it by driving that was from the
0:15:12user speech instead of putting piece
0:15:15but the presence of all the problem the bad example can see here
0:15:19the reference a
0:15:21i want i believe use that comedy movie
0:15:23and the system or generating something like
0:15:26i believe use that come before
0:15:28i grab the comedy but it doesn't generating a sentence
0:15:31an example here we see
0:15:33it was say something i would recommend rest of fifty five although fifty five years
0:15:37and in the movie name it was it should be saying movie fifty five the
0:15:41good choice
0:15:44so
0:15:45and the question is what does the proposed action matching solving so the answer is
0:15:50the most relevant score his approval scores because we want to see if actual the
0:15:54correct wasn't being generated in the new domain
0:15:57so the we find that room actually being able to be called a to generating
0:16:01overall a novel utterance
0:16:02the never occurred in training not only entities
0:16:05and so he also show some good examples
0:16:08so in one example is only fifty five good choice and you will do we
0:16:12make a choice and also from this more complex human data we can see this
0:16:18data was say it was a scheduling remind afford no on friday in ten
0:16:24which the only training why the and a navigation don't know which is the we
0:16:28have a sense and distances but is still generating this novel utterance
0:16:34and the last question is how to the size of s is effect performance so
0:16:40this is the past on these data for the human data
0:16:43we have found a fifteen the previous results from here we
0:16:47that result from zero to two hundred and see the performance changed
0:16:51so one thing the comfort that confirms it is before indeed increased what we have
0:16:56been not the size of the response equally have a wider coverage about what's going
0:16:59happen to data
0:17:00but also we can see that
0:17:02the performance becomes palatal while we going beyond about how to twenty five
0:17:06a hundred fifty
0:17:08and that validates the tracking progress of head using c whisper because we don't need
0:17:13a huge size of zero files you get performance k
0:17:17so to summarize yes what we propose the new problem "'cause" the std
0:17:22and we propose actually matching this algorithm that performed pretty well in the for is
0:17:28that under the assumption that the extra discourse better and also we do experiment divided
0:17:34the performance of both human and synthetic dataset
0:17:37and the last we also open source is the entire this multidimensional generator that can
0:17:42be used to benchmark of the future experiments
0:17:45and at the last i wanna say and this is a first step towards a
0:17:49very big directions and their opens up many interesting problem that we can exploit the
0:17:54future for example how do we quantify the relationship between domains
0:17:58in most situations that it is possible
0:18:01and also how do we
0:18:03rely less on a human annotation because now we
0:18:05yep and annotation to find in the relationship between utterance across domains
0:18:10and also how do we started the official problem one assumption of c response fail
0:18:15actually to the mainly can have different discourse calendar had to have different dialogue policy
0:18:20how do we in which is for and last one is i know what are
0:18:24the type of dormant description how we have in to enable yes
0:18:30and i think you're much
0:18:53which one
0:19:29so i'll the laughs we have the discourse so here the ranges from zero to
0:19:36a hundred maximum and this is a because the so this is a synthetic that
0:19:43so it easier to achieve high performance
0:19:45in this domain and also we intentionality a lot of compressed eli such as in
0:19:49the rating as a role simulating different nonverbal behavior so
0:19:54so that the range for that and here
0:19:56i think that it was that the peace corps so it is the it's a
0:20:00to match the meaning of the true and the t f one
0:20:03so they impose also zero two hundred but is a human datasets much more challenging
0:20:08so the rules goal is actually
0:20:10mostly they're pretty low you can see about
0:20:13can you in a zero two k twenty something so it job i jog the
0:20:18number down so the range here is also zero two hundred
0:20:22for the two score pages to it as that which the lab right one is
0:20:26much more challenging
0:20:34before
0:20:39okay
0:20:55okay
0:21:09so this is for example a come from what we treat the scheduling as the
0:21:15target and the why the and the navigation and the source coleman
0:21:18and what we caff honest the scheduling domain
0:21:22and
0:21:23so the of the dialogue history so we went because a spacing and initial is
0:21:27the history but
0:21:29the actual system utterance is okay scheduling
0:21:33try to denote with manual i mean
0:21:35and the
0:21:36so the generations is not perfect but the first time
0:21:40the only model that is able to generate nickel coherent utterance to
0:21:44obviously comfort in a it's a scheduling domain utterance and has
0:21:49estimate a
0:21:50i log are compared to the one shows and the than the baseline system we
0:21:53just not generating a coherent utterance from scheduled a scheduling domain
0:21:57it's more likely to generally something like all what's the weather all okay you an
0:22:02allegheny to some rubber case of a strong bias
0:22:06k is the transfer of from the source to target that's the only one that
0:22:11able to
0:22:11she split style completely from the source target
0:22:27so clearly i think of the most challenging what is a navigation domain
0:22:31and
0:22:32i think because in like a scheduling the if you look into the conversation the
0:22:36data list two should lead us to dialogues
0:22:39and a schedule a usually that caught is not very long so is like schedule
0:22:43probably with my for and eleven
0:22:46and just confirmed that only about three to five four times before covers and finish
0:22:51i think in navigation out as much longer and also the even more a detailed
0:22:57information like i wanna check navagati from this case another place and
0:23:00as much harder to get all phonetic arrives and the
0:23:03sometimes they wanna change navigation places so it's
0:23:07i think how to be more challenging domain comparative to all the other two domains
0:23:40if you don't have c was muffled time domain then
0:23:43you cannot do the chance but because all the knowledge we have about target domain
0:23:48is from the cm response basically what companies which are into finding utterances thompson similar
0:23:55function between domains those for example in the one of them may have missed a
0:23:59tremendous a request so the model trying to find estimate utterance
0:24:02that in a in the new domain just filling similar function so we can translates
0:24:07knowledge about a policy to the new domain still you will you know it with
0:24:10you i want your request when i'm not in the new domain still you will
0:24:14finding the most match sentences in a target on anastasia
0:24:17so if we don't have target domains the response and a hybrid and the little
0:24:22work
0:24:31so definition here is there was shown means that we don't have any dialogue data
0:24:34from target domain
0:24:35so we don't have any multi-turn conversation that a target domain
0:24:45so as it was because it was only utterance is no dialogue so it doesn't
0:24:50really it's not dialogue data so here
0:24:53you know the overall definition here we will try to propose here is domain description
0:24:59so it any
0:25:00it doesn't use the bc was like any other type of them a description giving
0:25:04the application but here we assume that that's the response is a what only description
0:25:09about this domain
0:25:12well you have some sort of the description
0:25:14a knowledge about target
0:25:50that's acting as a four inches suppose that you mean
0:25:52we want to have a express the latent representation we can see
0:25:56on inter interpreted so
0:26:00so in so now he's or continuous so we tried to probably in the in
0:26:05the bipolar to two d and the product and
0:26:07i we can see some patterns that with the group similar sentence from different limited
0:26:11no but
0:26:12i think is interesting direction to see how can we get more explicit information
0:26:17what about
0:26:20for interpretation