0:00:17and the next
0:00:20speaker we have is she mary
0:00:24with the paper on structured fusion networks for dialogue which use an end-to-end dialog model
0:00:31so
0:00:32please
0:00:49emission key
0:00:51and i'm here today to talking about structured fusion networks for dialogue
0:00:55this work was done with
0:00:57to just ring a bus and my adviser maxine eskenazi
0:01:01okay let's talk about neural models of dialogue
0:01:04so neural dialogue systems do really well on the task of dialog generation
0:01:08but they have several well-known shortcomings
0:01:11they need a lot of data to train
0:01:13they struggle to generalize to new domains
0:01:17there are difficult to control
0:01:19and
0:01:20they exhibit divergent behavior one tune with reinforcement learning
0:01:25on the other hand traditional pipelined dialogue systems
0:01:28have structure components
0:01:30that allow us to easily generalize them
0:01:34interpret them and control these systems
0:01:37both these systems have their respective advantages and disadvantages
0:01:41neural dialogue systems can learn from data
0:01:43and they can learn a higher level reasoning
0:01:46we're higher level policy
0:01:47on the other hand pipeline systems
0:01:49are very structured nature which has several benefits
0:01:53yesterday there was this question in the panel of
0:01:55so pipeline or not to pipeline
0:01:57and to me the obvious answer seems why not both and i think that
0:02:03combining these two approaches is a very intuitive thing to do
0:02:07so how do we go about combining these two approaches
0:02:10so in powerpoint systems we have structure components so the very first thing to do
0:02:16to bring the structure
0:02:17to neural dialogue systems
0:02:19it's to and you like these components
0:02:22so using the multimodal dataset we first define and train
0:02:26several neural dialogue modules
0:02:28one for the nlu
0:02:29one for the dm and one for the nlg
0:02:33so for the nlu what we do is
0:02:35we read the dialogue context
0:02:38encoded and then
0:02:39ultimately make a prediction about the belief state
0:02:43for the dialogue manager
0:02:44we look at the belief state as well as some vectorized representation of the database
0:02:48output passage are several in your layers and ultimately predict the system dialogue act
0:02:55for the nlg we have a condition language model
0:02:58where the initial hidden state is a linear combination
0:03:01of the dialogue act the belief state and the database vector and then at every
0:03:05time step
0:03:06the model outputs what the next word should be to ultimately generate the response
0:03:11so we have these three neural dialogue modules
0:03:14that i merely is structured components of traditional pipelined systems
0:03:18given these three components
0:03:21how do we actually go about
0:03:22building a system for dialog generation
0:03:25well the simplest thing to do is
0:03:28now you fusion
0:03:29where what we do is we train these systems and then we just combine the
0:03:34naively during inference where instead of passing in the ground truth belief state of the
0:03:40dialogue manager which is what we would do during training we make a prediction
0:03:44using our trained nlu
0:03:46and then pass it into the dialogue manager
0:03:50another way of using these dialogue modules
0:03:53after training them independently is multitasking
0:03:57so
0:03:58which simultaneously learn the dialogue modules
0:04:01as well as the final task of dialog response generation so we have these three
0:04:06independent modules here
0:04:07and then we have these red arrows that correspond to the forward propagation
0:04:11for the task of response generation
0:04:15sharing these the parameters in this way result in more structured components
0:04:19now the encoder
0:04:20is both being used for the task of the nlu
0:04:23as well as for the task of response generation
0:04:25so now would have this notion of structure in it
0:04:29another way which is the primary
0:04:32novel work in our paper is structured fusion networks
0:04:35structured fusion that works aim to learn a higher level model
0:04:39on top of free train neural dialogue modules
0:04:43here's a visualization of structured fusion networks
0:04:45and don't worry if the seems like spaghetti a come back to this
0:04:49so here what we have is
0:04:51we have the original dialogue modules the nlu the dm and all g
0:04:55in these grey small boxes in the middle
0:04:58and then what we do is we
0:04:59define these black boxes around them
0:05:02that consist of a higher level module
0:05:04so the nlu get upgraded to the and on you plots
0:05:07the dm to the dm plus and the nlg to the energy plus
0:05:11by doing this
0:05:12the higher level model does not need to relearn and remodel the dialogue structure
0:05:16because it's provided to it
0:05:18do the pre-trained dialogue modules
0:05:21instead the higher level model
0:05:23can focus on the necessary abstract modeling for the task of response generation
0:05:28which includes encoding complex natural language
0:05:31modeling the dialogue policy
0:05:33and generating language conditional some latent representation
0:05:37and they can leverage
0:05:38the already provided dialogue structure to do this
0:05:43so let's go through the structured fusion network piece by piece and see how we
0:05:47build it up
0:05:48we start out with these dialogue modules and great here
0:05:51the combination between them is exactly what you sign it fusion
0:05:56first we're gonna we're gonna add the nlu plus
0:05:59the nlu plus get the output it belief state
0:06:02and one it
0:06:03re encodes the dialogue context
0:06:05it has the already predicted belief state concatenated at every time step
0:06:10and in this way the encoder does not need to relearn the structure and can
0:06:14leverage the already computed belief state to better encode the
0:06:18the dialogue context
0:06:21next we're gonna add the dm plus
0:06:23and the dm plus
0:06:24initially
0:06:25it takes as input it concatenation of four different features
0:06:29the database vector the predicted dialogue act
0:06:32the predicted belief state
0:06:33and the final hidden state of the higher level encoder
0:06:36and then passes that the real when you're layer
0:06:39by providing the structure in this way it's our hope that
0:06:41this sort of serves of the pause you modeling components
0:06:44in this and send model
0:06:48the nlg plus
0:06:50takes as output takes as input the output of the dm plots and user that's
0:06:55initialize the hidden state and then interfaces with the nlg
0:06:59let's take a closer look into the nlg plus
0:07:03it relies on cold fusion
0:07:05so basically what this means is
0:07:07the nlg it condition language model gives us a sense of what the next word
0:07:12could be
0:07:14the decoder on the other hand
0:07:16is more
0:07:18is more so
0:07:19performing higher level reasoning
0:07:22and then
0:07:22we take the large it's the output from the nlg about what the next word
0:07:26could be as well as the hidden state from the decoder
0:07:29about the representation of what we should be generating and combine them using cold fusion
0:07:36and then there's a cyclical relationship between the and all g and the higher level
0:07:40decoder
0:07:41in the sense that one cold fusion predicts what the next word should be three
0:07:44combination of the decoder nlg it passes that prediction both into the decoder
0:07:49and it to the next time step of the nlg
0:07:53and here's the final combination again which
0:07:56hopefully should make more sense
0:08:00so how do we train the structure fusion network
0:08:02because we have these modules this three different ways that we can do it
0:08:06the first one is that we can freeze these modules
0:08:08we can freeze the modules for obvious in their pre-trained
0:08:12and then just learn the higher level model on top
0:08:15in other ways that we can fine tune these modules for the final task of
0:08:19dialog response generation
0:08:21and then of course we can multitask the modules where we
0:08:24simultaneously fine tune them for response generation and for their original tasks
0:08:30we use the multi was dataset and generally follow their experimental setup
0:08:34which means the same hyper parameters and because they use the ground truth belief state
0:08:38we do so as well
0:08:39and you can sort of think with this as the oracle and all you in
0:08:42our case
0:08:43for evaluation we use the same hyper parameters which includes bleu score
0:08:47inform rate which
0:08:49measures how often the system has provided the appropriate entities to the user
0:08:54and success rate which is how often the system
0:08:57answers all the attributes the user request
0:09:00and then we use a combined score which they propose as well
0:09:03which is blue plus the average of informant success rate
0:09:07so let's take a look at our results
0:09:09first our baseline so as you see here sadistic with attention does gets a combined
0:09:14score of about eighty three point three six
0:09:17next we an i fusion both zero shot which means that they're in the penalty
0:09:21pre-trained in just combine it inference
0:09:23and then we also finetune for
0:09:25the task response generation which just slightly better than the baseline
0:09:30multitasking does not do so well with sort of indicates that
0:09:33the loss functions may be pulling
0:09:35the weights in different directions
0:09:38structure fusion networks with frozen modules
0:09:41also do not do so well
0:09:43but as soon as we start fine tuning
0:09:45we get a significant improvement
0:09:47with strong improvements
0:09:49with slight improvements over these other models
0:09:51in bleu score and then very strong improvements in informant success rate
0:09:55and we observe
0:09:57somewhat patterns with s f and with multitasking
0:10:00and honestly the seems kind of
0:10:02intuitive when you think about it informally then success rate measure how often we inform
0:10:08the user of the appropriate entities and how often we provide the appropriate attributes
0:10:12and explicitly modeling the belief state explicitly modeling the system act
0:10:16should into italy help with this
0:10:18if for model is explicitly aware of
0:10:21what attributes the user has requested it's going to better provide that information to the
0:10:25user
0:10:29but of course i talked about several different problems
0:10:32with neural models so let's see a structured fusion networks did anything to those problems
0:10:37the first problem that i mentioned is the neural models are very data hungry
0:10:41and i think that the added structure sure result and lasted hungry models
0:10:45so we compare secrecy got the tension instructed fusion networks
0:10:48i one percent five percent ten percent and twenty five percent of the training data
0:10:53on the left you see the informer a graph and on the right you see
0:10:56the success rate graph
0:10:57and varying levels of percentage of data used
0:11:01so the inform rate
0:11:02right about thirty
0:11:04thirty percent inform rate with c
0:11:06and i fifty five
0:11:08with structured fusion networks
0:11:11of course there's different this difference is really big when were
0:11:14and very small amounts of data as in one percent
0:11:17and then it's lonely comes together
0:11:19as we increase the data
0:11:21what success rate word about twenty
0:11:25what structured fusion networks
0:11:27and fairly close to about like two or three percent
0:11:30with sixty six and one percent of the data
0:11:33so for extremely low data scenarios one percent which is about
0:11:36six hundred utterances
0:11:39we do
0:11:40really well what structured fusion networks
0:11:42and the difference
0:11:43remains that about like ten percent improvement across both metrics
0:11:49another problem dimension is domain generalisability
0:11:52the added structure should give us more generalisable models
0:11:55so what we do is we compare secrecy constructor fusion that works
0:11:59by training on two thousand out of domain
0:12:02dialogue examples
0:12:03and fifty in domain examples
0:12:05where in domain is restaurant and then we evaluate entirely on the restaurant domain
0:12:11and what we see here is we get a sizable improvement and the combined scored
0:12:15using structured fusion networks
0:12:17what stronger permits in six sets in four
0:12:19the blue a slightly lower but this drop matches roughly
0:12:23what we saw in when using all the data so i don't think it's a
0:12:27problem specific the generalisability
0:12:30the next problem and to me the most interesting one
0:12:33is divergent behavior with reinforcement learning
0:12:36training general "'em" dialogue models with reinforcement learning
0:12:39often results in divergent behavior
0:12:42and you generate output
0:12:44i'm sure that everybody here has seen the headlines where people claimed that face okay
0:12:48i shut down there bought after it start inventing its own language really what happened
0:12:53was it started outputting
0:12:56stuff that doesn't look like english because it loses the structure as soon as you
0:13:00trying to with a reinforcement learning
0:13:02so why does this happen
0:13:04my theory about why this happens is the notion of the implicit language model
0:13:09stack decoders have the issue of the implicit language model which basically means that the
0:13:13decoder simultaneously learns the false and strategy
0:13:16as well as model language
0:13:18and image captioning this is very well observed
0:13:21and it's observed that the implicit language model over one the decoder
0:13:25so basically what happens is
0:13:27if the decoder generates if the if the image model detect so there's a giraffe
0:13:32the model always output the giraffe standing in a field
0:13:36which is this even if the draft is not standing in a field just because
0:13:39that's what the language model has been
0:13:41trying to do
0:13:44in dialogue on the other hand this problem a slightly different in the sense that
0:13:48when we finetuned dialogue models with reinforcement learning
0:13:51raw optimising for the strategy
0:13:53and alternately causing it on learn the implicit language model
0:13:57so
0:13:59structured fusion networks have an explicit language model
0:14:02so maybe we don't have this problem
0:14:05so let's try structured fusion networks with reinforcement learning
0:14:09so for this we trained with supervised learning and then we freeze the dialogue modules
0:14:14and finetune only the higher level model with the reward inform rape a success rate
0:14:20so we're optimising the higher level model for some dialogue strategy
0:14:23well relying on the structure components
0:14:26to maintain the structured nature of the model
0:14:29and we compared to changing cells work a knuckle
0:14:33where he export a similar problem
0:14:35and what we seize we get
0:14:38less divergence and language
0:14:39and fairly similar informant success rate with the state-of-the-art combined score here
0:14:47so here all the results for all the models that we compared
0:14:50throughout this presentation
0:14:53we see that
0:14:54adding structure in general seems to help
0:14:57and we get a sizable improvement over our baseline
0:14:59and
0:15:00the model especially is robust to reinforcement learning
0:15:04of course given how fast this field moves
0:15:07well or paper was in reviews somebody be our results and we don't have state-of-the-art
0:15:11anymore
0:15:12but
0:15:12one of the core contributions of their work
0:15:15was improving dialogue act prediction
0:15:18and because structured fusion that works have this ability
0:15:22to leverage dialogue act predictions and an explicit component
0:15:27i think there's room for combination here
0:15:30so
0:15:31no dialogue paper is complete without human evaluation so what we did here was we
0:15:37as mechanical turk workers
0:15:39to read the dialogue context and rate responses on a scale of one to five
0:15:43on the notion of appropriateness
0:15:46and what we see here is that
0:15:48structured fusion networks with reinforcement learning
0:15:51r per for r rated slightly higher
0:15:54with
0:15:54ratings of four or more given
0:15:58more often suggest that all everything in bold are statistically significant
0:16:02of course we have a lot more room
0:16:04to improve before we beat the human ground truth but i think adding structure char
0:16:09models is the way to go
0:16:12thank you for your attention and the code is available here
0:16:20for talk
0:16:21so now we have
0:16:23actually
0:16:24quite some time for top questions so any questions
0:16:31that's a
0:16:32a very interesting work and looks promising but
0:16:37you have plans to extend the evaluation and looking at whether
0:16:42the system with your architecture can actually engage and dialogue rather than replicating dialogues
0:16:47the second question i think the structure should help was do that and maintain
0:16:52like not have the issue of when you start training models and evaluating models an
0:16:58adaptive manner usually what happens is the errors propagate and i think that
0:17:02the structure should make that less likely to happen
0:17:07we
0:17:08i think that something that we should definitely look into in the literature
0:17:11and just if you put up your comparative slides the first one compare i think
0:17:16you're to
0:17:18a quick to
0:17:20see the ranks to the other one as having the
0:17:23the preferred performance because blue i would say is not something that should be measured
0:17:29in this context it's
0:17:31they're doing much better than you in blue but it's completely irrelevant whether you give
0:17:34exactly the same words as the original or not
0:17:37and you're actually doing much better and success for that's true i like my general
0:17:42feeling having looked at the data lot is that
0:17:45for this type of task at least we just relatively well and i think in
0:17:48the original paper they did some correlation analysis with human judgement
0:17:52but i think like
0:17:53blue does not make like on its own will not measure quality of the system
0:17:57but more so what it's measuring is
0:18:00how structure the languages and how like
0:18:03you disagree
0:18:06okay that's fair i guess with multiple references maybe we can improve this
0:18:15i so you and this three components but do you and you said that but
0:18:21trained on what are they pre-trained and the second mation sorry during training do you
0:18:27also have intermediate supervision there or they finetuned and then fashion
0:18:34right okay good question
0:18:36i mean just go back to that's why
0:18:39so in the multi was data
0:18:42they
0:18:43they give us the belief state and they give us the system dialogue act
0:18:46so what we do for pre-training these components is
0:18:50the no use pre-trained to go from context of belief state
0:18:53the dm from police data dialogue act
0:18:56and the and all g from dialogue acts response
0:18:59for your second question
0:19:00we do explore one of in our multi test set up
0:19:04we do intermediate supervision but in the other two we don't
0:19:08so it seems to me that you too much more supervision then there are usual
0:19:13sequence the sequence model each would be the reason for better performance rather than
0:19:19different architecture no
0:19:21no alike i completely agree with a point i think this but i think
0:19:26a point of our paper is doing this additional supervision
0:19:29and adding the structure into the model it's something that's numbering something that people should
0:19:33be fair enough
0:19:33but i do you understand that
0:19:35it's not necessarily the architecture and its own that's doing better cool thinking
0:19:42t any other questions
0:19:46a great dog picked as much as looks promising so you talk a bit about
0:19:51generalisability about this issue divergence with rl but it didn't touch much on the other
0:19:57is you mentioned in the trade off at the beginning which was control ability and
0:20:02i'm wondering if you have some thoughts on that
0:20:05i guess some of the questions that come into my mind we design but models
0:20:08with respect to control is suppose i wanted to behave a little bit differently in
0:20:12one case is there anyway that this architecture can address that run the other way
0:20:17to look at it looks at ten dollars three best in improving one of these
0:20:21components can i do it in any on the way other than
0:20:25get more data like how does the architecture for something in that sense okay
0:20:30the that's a good question control ability isn't something that we looked at yet but
0:20:34it's definitely something that i do you want to look at in the future just
0:20:37because i think doing something as simple as
0:20:39adding rules on top of the dialogue manager
0:20:42to just change and say like i with this dialogue act instead of these conditions
0:20:45are met would work really well and the model does leverage those dialogue act and
0:20:50like i've seen back projections from the lower level model
0:20:54result in four outputs
0:20:57that's definitely something that we should look into in the future
0:21:01remote mean was the second thing is a
0:21:02the other part is there's architectures suitable for it to decompose ability of can invest
0:21:08more on one calm like
0:21:09there is no need to blame assignment in any sense better and does it you
0:21:12know
0:21:13so
0:21:15i in like
0:21:17i'm not entirely sure
0:21:18for when we look at the final task response generation
0:21:22but we do sort of have a sense just because of the intermediate supervision
0:21:26how well each of the respective lower level components are doing
0:21:29and what i can say that the and all you just really well
0:21:32the
0:21:33the natural language generation just pretty well
0:21:36the main thing that struggling is this
0:21:38this active going from police data dialogue act
0:21:41and i think that if i was to recommend a component
0:21:45based on just the this the pre supervision
0:21:49to improve it would be the dialogue manager
0:21:52but like blame assignment in general for the response generation task
0:21:56isn't something that
0:21:58i think is really easy with the current state of the model but i think
0:22:01things might be able to be done to further interpret the model
0:22:09anymore questions
0:22:14okay in that case i'll
0:22:18one of my own
0:22:21can you
0:22:23explain how exactly the you know what it what is it that the
0:22:28dm and the impostor pretty how does it look like is it is it's some
0:22:31kind of
0:22:33a like
0:22:36dialogue act and betting or is it's is it explicit use it like a one
0:22:41hot
0:22:42so
0:22:44so the you mean like the dialogue act vector or just i mean basically what
0:22:47when you look at the dm
0:22:51well this i guess these are two different thing when you look at the end
0:22:56the output is dialogue act right yes and the dm plus has something different so
0:23:02like okay
0:23:03so for the dm itself because of the supervision
0:23:07we're predicting the dialogue act which is a multi class label
0:23:10and it's basically just one zeros
0:23:13like a binary vector okay and that's like
0:23:17in form a request that a single slot yellow inform restaurant available type thing right
0:23:25but then for the dm plus
0:23:28it's not a structured in that sense and basically what we do is
0:23:32we just treated as a linear layer that initialises
0:23:35the decoders hidden state and in the original multi what's paper they had this type
0:23:40of thing as well
0:23:41where they just had eight when you're layer between the encoder and decoder the combined
0:23:46more information into the hidden state
0:23:47and they call that the palsy and
0:23:50that sort of what where we're hoping that
0:23:52by adding the structure beforehand
0:23:54it's actually more like a policy rather than just to linger layer before
0:23:58right okay thank you into k what
0:24:01any more
0:24:04the last one
0:24:06did you try we have their baselines because she claims to sequence seems to be
0:24:10basic
0:24:12well we did try the other ways of combining the neural modules
0:24:15and then a fusion the multi tasking those ones
0:24:19i can go to that slide
0:24:21but we didn't write transformers or anything like that and i think that
0:24:24that's something that we can look into in the future
0:24:27but we tried like night fusion multitasking which are different which are baselines the we
0:24:32came up with
0:24:33for actually leveraging the structure as well
0:24:37okay thank you thank you