0:00:16so this is joint work with such people from your is your proven
0:00:20so
0:00:22as
0:00:23you'll null and you architectures have been increasingly popular for the development of conversational agents
0:00:29and one major advantage of these approaches is that they can be learned from role
0:00:33and annotated dialogues without needing too much domain knowledge or feature engineering
0:00:40but however we also require large amounts of training data because they have a large
0:00:45parameter space
0:00:46so usually
0:00:48we use large online resource to train then suggest
0:00:52twitter conversations
0:00:54a technical web forums like the one from one to
0:00:57chuck close
0:00:59movie scripts
0:01:00a movie subtitles
0:01:01so that is for t v series and these resorts are and diana be useful
0:01:06but they all the face some kind of limitations
0:01:09in terms of dialogue modeling
0:01:11it has several limitations we can talk for a long time about these but i
0:01:15would like to just point out to limitations
0:01:19especially the ones that are important for subtitles
0:01:24one of this limitation is that for movies of the next we don't have
0:01:28any turn structure explicit don't structure
0:01:32the corpus itself is only a sequence of sentences
0:01:37together with timestamps for the start and end times
0:01:40but we don't know who is speaking because of course
0:01:42the stuff that is always don't come together with
0:01:45well and audio track and video where you see who is speaking at a given
0:01:48time
0:01:50so we don't know who is speaking we don't know when sentences unanswered another tour
0:01:55or is a continuation of the current turn
0:01:58so in this particular example
0:02:00the actual ten structure is the following
0:02:04and as you can see there are some strong cue
0:02:07the time stance can be used
0:02:09i in a few cases
0:02:11and you have lexical and syntactic cues that can be used
0:02:14to infer data structure
0:02:15but you never have to run through
0:02:18and so that's an important disadvantage when you want to actually to build a system
0:02:22that generates responses and not just continuations in a given dialogue
0:02:29whenever limitation is that
0:02:32many of these data contain reference to
0:02:35named entities
0:02:35that might be absent from the inputs
0:02:39in particular fictional characters
0:02:41not always referred to a context which use external to the dialogue and which cannot
0:02:48be captured by the inputs on
0:02:50so in this particular case mister holes
0:02:53i is an input that
0:02:55you would require access to annex not context
0:02:58you know to make sense of what's happened
0:03:02and are ordered images of course but i'm just wanted to point two important limitations
0:03:08so how do we deal with these
0:03:10these problems the key idea i'm going to present here start with the fact that
0:03:16not what examples of
0:03:19context response pairs
0:03:20are equally useful or relevant for different but of conversational models
0:03:25some examples as
0:03:26oliver lemon showed i in is keynote might even be detrimental to the development of
0:03:31your model
0:03:32so we can view this as the can kind of
0:03:34domain adaptation problem
0:03:36there is some kind of this frequency between the context response appears that we observe
0:03:41in a corpus
0:03:43and the ones that we wish to encode in or new a conversational model
0:03:48i n in the particle application that we want
0:03:51so the proposed solution is one that is very well known in the field of
0:03:55the mean adaptation
0:03:57and which is it simply and inclusion of the weighting model
0:04:01so we try to map each
0:04:03where context and response
0:04:05two particular weight value that corresponds to its importance
0:04:09tweets quality if you want
0:04:12for the particle proposed of building
0:04:14a conversation model
0:04:18so how we assign this weeks
0:04:21of course you to the sheer size of four corpora we cannot i don't think
0:04:25each pair manually
0:04:26and even a handcrafted rules per se
0:04:29may be difficult to apply in many cases
0:04:32because the quality of examples might be depend on multiple factors that might interact in
0:04:37complex ways
0:04:39so we propose here is the data driven approach
0:04:42where we learn
0:04:44a weighting model from examples of high quality responses
0:04:48and of course what constitutes a response of high quality might depend on the particular
0:04:52objectives on the particular type of conversational model
0:04:56that one wishes to be a
0:04:58so there is no single answer to what constitutes a high-quality response
0:05:02but if you have some idea what kind of response you want and what which
0:05:06once you don't one
0:05:07you can often select
0:05:10a subset of high quality response and learn a weighting model from these
0:05:14and the weighting model uses and you are architecture
0:05:17which is the following
0:05:21so as you can see here we have two recurrent neural networks with shared weights
0:05:27and embedding lay your
0:05:29and a recurrent layer with lstm what you're weights are units
0:05:33these two
0:05:35and respectively and code that the context and the response
0:05:40as a sequence of a sequences of tokens
0:05:43and
0:05:44are
0:05:46put if a fixed size vectors which are then fed to a dense lay your
0:05:52which can also incorporate additional inputs
0:05:54a for instance document level factors
0:05:57if you have some features
0:06:00that are specific to move dialogue and that may be of interest
0:06:04to calculate the weights you can incorporate these
0:06:07in this then sleigh your for the supplied us for instance we also have information
0:06:11about the time gaps
0:06:13between the context and the response
0:06:15and that something that can be used as well
0:06:18and so we include all these data in inferior to this final translate your
0:06:23which ten outputs
0:06:24a weight
0:06:25for a given context response pair
0:06:30so that's the model
0:06:32and once we have learned a weighting model from examples of high quality responses we
0:06:38can then apply the model to the full training data
0:06:43to assign a particular weight to each pair
0:06:46then we can include it in the brick a loss
0:06:49that we minimize then we trained a neural model
0:06:52i
0:06:53the exact formula for dumper get lost might be depend on what kind of models
0:06:57you're building
0:06:57and what kind of loss function you you're using
0:07:01but the key idea is that
0:07:03than the loss function calculus some kind of distance between
0:07:07what the model produces and the ground truth
0:07:11and then you waiting
0:07:14the this lost
0:07:15by the weight value that you calculate from the weighting model
0:07:18so it some kind of two class pursue years where you first
0:07:21calculate the weight of your example and then given this weight
0:07:25and the result of a linear model you can calculate the empirical loss
0:07:31and then optimize the parameters
0:07:33of one these weighted sum
0:07:37so that's the model and
0:07:40the way the with integrated in the wench training time
0:07:44so how do we evaluate the models
0:07:47so we evaluate you only using retrieval-based your models
0:07:51because it's easier to matrix or more clearly defined and four agenda models
0:07:55so the retrieval-based your models seek to
0:07:59compute a score for a given
0:08:01a context response pair
0:08:03which is the score about how relevant is the response given the context
0:08:08and then you can use this core to write possible response and to select the
0:08:11most relevant
0:08:13the training data is
0:08:15uhuh comes from examples from open subtitles
0:08:19which is a large corpus of the palace that we're is least last year
0:08:23and we compare three models
0:08:25a classical tf-idf models
0:08:27and what an order models
0:08:29one with uniform weight
0:08:30so without waiting
0:08:32and one using the weighting model and we conducted what an automatic and a human
0:08:37evaluation of
0:08:38this approach
0:08:40and you are multiple models
0:08:42after we now have proposed a few years ago there actually quite simple models
0:08:47where you all have to recurrent networks we sure weights
0:08:52that you then
0:08:53i feed to then slayers
0:08:55and then combine in the dot product
0:08:58so it's computing some kind of semantic similarity
0:09:01between the respondent is predicted given the context
0:09:04and the actual response that you find in the corpus
0:09:07we
0:09:09so this dot product
0:09:11we made a small modification to the model to add a low the final score
0:09:16two also be defined on some features from the response itself
0:09:19"'cause" they might be some features that are not
0:09:22you to the similarity between the
0:09:24the context and the response but are you to
0:09:27some aspects of the respondent my
0:09:29give some clues about whether is high quality low quality
0:09:32for instance some unknown words
0:09:35might indicate a local response from lower quality
0:09:39in terms of evaluation we use
0:09:42so as i said and the subtitles as training data
0:09:46the two going to select
0:09:48the high quickly responses we took a subset of these training data
0:09:52for which we knew don't structure because we could aligned then we've movie scripts
0:09:56where you have speaker names
0:09:58and then we use two heuristics
0:10:01we only kept responses
0:10:02that introduce a new director
0:10:04so not
0:10:05i sequence sentences that simply berserk a continuation of a given turn
0:10:10and we only use the two party conversations because it's easier to two-party conversations to
0:10:18define winter the response is in response for the previous speaker or not
0:10:22and then we all the filter out
0:10:23responses containing fictional names
0:10:27and out-of-vocabulary words
0:10:29and we heartily the set
0:10:30of about one hundred thousand
0:10:33response pairs that we considered it to be helpful high quality
0:10:37for the test data we use one in domain and one a slightly out of
0:10:41the main test sets
0:10:44we use the core that movie data corpus which is a collection of movie script
0:10:48the movie subtitles but movie scripts
0:10:52and then a small corpus of sixty two t at your place
0:10:55that we found on the web
0:10:58of course we prove p process them tokenizer postech then
0:11:03and then in terms of experimental design we consider the context to be limited to
0:11:08the last ten utterances preceding the response maxima sixty tokens for the response was the
0:11:14maximum five utterances
0:11:16in case of turns with multiple utterances
0:11:19and then we had a one-to-one racial between positive examples we were actual peers
0:11:24observed in the corpus and negative examples that were drawn at random
0:11:30from the same corpus
0:11:32we use gru units instead of testaments because there it's possible to train and we
0:11:36didn't see any difference
0:11:38in performance compared to lstms
0:11:42and here the results
0:11:43so as you can see well tf-idf doesn't perform well but that's
0:11:48that's a really well known
0:11:51so we look at the recall and that i metric
0:11:53which looks at
0:11:55a set of possible and responses
0:11:59one of which is the actual response of certain the corpus
0:12:03and then we looked at whether the model was able
0:12:06to put a to put the actual response in the top high
0:12:12responses so we are then a one means that in a set of then responses
0:12:17one of which is the actual responses where to the model would rank the actual
0:12:22response to be the highs
0:12:25so that's the metric
0:12:27and then we assume compared to so the that will do what encoder models
0:12:31and as you can see the one with the with the model performs a little
0:12:33better on both test sets
0:12:35and what we found you in using a subsequent error analysis what's that the weighting
0:12:40model gives more importance to cohesive adjacency pairs
0:12:45between the context response
0:12:47so
0:12:48response so we're not simply continuations
0:12:50but they were actual responses
0:12:52that were clearly from under the speaker and it worked answering the context
0:12:58we also performed you meant evaluation of responses
0:13:02generated by the double encoder models
0:13:04using crowdsourcing
0:13:07so we had we picked
0:13:08fifty one hundred fifteen random complex from the corner corpus
0:13:12and four possible responses
0:13:15a random response the two responses from the u one encoder models
0:13:20and then expect response that were manually order
0:13:23so we had the resulting four hundred and sixty pairs
0:13:27that we each evaluate but at the human judges
0:13:31and were asked to rate the consistency between the context and response on a scale
0:13:35of five points
0:13:37so we had one hundred eighteen individuals party pit in the evaluation
0:13:42through dropped flower
0:13:45unfortunately the results were not conclusive
0:13:48so we can define any statistically significant difference between the two models
0:13:53and this in general a very low agreement between the participants
0:13:58for all four models
0:14:01and we hypothesize that this was due to the difficulty for the raiders
0:14:06to discriminate between the responses and this is might be due to the nature of
0:14:10the corpus itself is heavily dependent on an external context
0:14:13just to the movie scenes
0:14:15and if you don't have access to the movie scenes is very
0:14:18difficult to understand what's going on
0:14:20but even if you have longer directly story that nina seem to help
0:14:26and so for a human evaluation we think another type of test data might be
0:14:30more beneficial
0:14:34so that was for the human evaluation
0:14:38so to conclude
0:14:41large that of corpora usually include many noisy examples
0:14:45and noise can cover many things
0:14:47but can for response that we're not actual responses
0:14:50mike a response that includes
0:14:53i mean if you show names that you don't want to appear
0:14:55in your models it might also include
0:14:58double common places responses
0:15:01response that are inconsistent
0:15:05with what the model knows
0:15:08so not what examples have the same quality or the same relevance
0:15:13for learning conversational models
0:15:15and the possible remedy to that used to include a weighting model
0:15:18which can be seen as a form of domain adaptation
0:15:21instance weighting and models
0:15:23common approach for domain adaptation
0:15:26and we show that
0:15:28this weighting model does not need to be in practice in some
0:15:32if you have a clear idea how you want to filter you data then you
0:15:35can of course
0:15:36and use handcrafted rules but in many cases what determines the quality of an example
0:15:41is hard to pinpoint
0:15:42so it might be easier to use a data driven approach
0:15:46and learning within model from examples of high quality responses
0:15:53what constitutes this quality
0:15:55what consecutive good response
0:15:58is of course depend then all of the actual application that you trying to build
0:16:03the this approach is very general so it can be applied is essentially a preprocessing
0:16:07step
0:16:08so it can be applied to any
0:16:10data driven model dialogue
0:16:12you simply as long as you have example of high quality responses
0:16:17you can use it as a preprocessing step to anything
0:16:21as future work we would like to extend it to work
0:16:24generative models so and evaluation we restricted ourselves to
0:16:28one type of retrieval-based models
0:16:32but might be very interesting to apply to other kinds of models
0:16:36and especially to generative once which are known to be quite difficult to work to
0:16:40train
0:16:42and an additional benefit of waiting models would be that you could filter all examples
0:16:48that
0:16:50are known to be as detrimental to the model before you even feed them to
0:16:54the
0:16:55to the
0:16:56to the training scheme
0:16:58so that you might have performance benefits in addition
0:17:01to benefits it regarding here
0:17:04your metric your accuracy
0:17:06so that's for future work and possibly also
0:17:10i don't types of test data then the
0:17:13the cornet movie data corpus that we have
0:17:16yes that's a thank you
0:17:32can you go back to the box plot towards the end
0:17:36so
0:17:37i'm not sure what's in the box plot that way i read it is that
0:17:43there is no difference really between in agreement on two does not as
0:17:50but you have a set that is very low agreement between the evaluation but is
0:17:54that site was wondering whether we are looking at two different
0:17:58and to define
0:18:00in our is that is that it is that right
0:18:06i three it is mostly between the two d but encoder models
0:18:10so i
0:18:11there's of course a statistically significant difference between the
0:18:14the altar models and the random ones
0:18:16and although between the to do what encoder models and the random
0:18:20but there is no internal difference between the two
0:18:23we waiting and without waiting
0:18:24so quickly but not have some maybe would be more significant as if i just
0:18:31the two ways to set
0:18:34right i agree i read
0:18:39something
0:18:43and you elaborate well why you change the final piece of dual encoder what was
0:18:49the wrist extended
0:18:51so give
0:18:55so the idea is
0:18:57the dot product
0:18:58will give you a similarity between
0:19:00the prediction from the response and the actual response right
0:19:04and so this is a very important aspect when considering
0:19:08no whole relevant to responses compared to the context but they might be aspects
0:19:13that i really intrinsic to the response itself
0:19:15and i have nothing to do the context
0:19:18for instance
0:19:19unknown words or rare words that are probably not typos
0:19:24run punctuations
0:19:26a lengthy responses
0:19:29and this is not going to be directly captured in the dot product
0:19:33this is going to be captured by extracting
0:19:36some features from the response and then using these
0:19:40in the final adequacy score
0:19:42so something that was
0:19:44of one missing in this button portables
0:19:46that's why we wanted to modified
0:19:54i guess as just wondering if you could elaborate on the extent to which you
0:19:56believe that the generalizability of the generalisability capabilities of
0:20:02of training a weighting model on a single dataset and having it extend reasonably to
0:20:07enhance performance only just of compared to training on multiple domains you mean
0:20:12why means it to train i guess like
0:20:15is the current scheme or no such that whenever you are trying to improve performance
0:20:19on a dataset is you would basically find a similar dataset and three training the
0:20:24weighting model on like a similar data set and then use the weighting model on
0:20:27a new data centre is that sort of like that the general scheme when we
0:20:31use this
0:20:34so
0:20:36it's not exactly the question that you asking but in some cases
0:20:40you might want to
0:20:42it or to use different domains
0:20:44four or two preselect
0:20:46to prune out some parts of the
0:20:48the data that you don't want
0:20:50in some cases and that was the case that we had here
0:20:54it's very difficult to the pre-processing advance on the full dataset
0:21:00because the quality is very hard to determine
0:21:03i using
0:21:03you know simple rules
0:21:06in particular here a deterrent structure is something that
0:21:10it is important for determining what can secure natural response but it was near possible
0:21:15to write rules for that
0:21:17because it was dependent on post and gas lexical cues and many different factors
0:21:22and you get of course
0:21:24build a machine learning classifier that we'll
0:21:26the segment your turns
0:21:28but then it will be over all or nothing right in many examples in my
0:21:32dataset
0:21:33where
0:21:34probably responses
0:21:36but it's
0:21:37the classifier we didn't give me a really answer
0:21:42so it was better to use a weighting function
0:21:44subjective still icon for some of these examples
0:21:48but then not in the same way as i would from you know
0:21:51high quality responses
0:21:53n is but i don't are aspect that would like to mention is that
0:22:00i could for sense that we could train on the high quality responses
0:22:04but in this case i would have had to from
0:22:06with ninety nine point nine percent of my dataset
0:22:10so i don't want to i want the one that they want to control everything
0:22:13just because i'm not exactly sure of the hypothesis responses
0:22:18i don't if you that as a regression
0:22:22at one more question i dunno maybe losses
0:22:26i guess i i'm not sure are maybe i didn't like it is that the
0:22:29evaluation too closely but did you try a baseline where you may be used to
0:22:32simply are simpler heuristic for assigning the weights like maybe like
0:22:39some something
0:22:41as a heuristic for exercise none of the weights rather than like building at a
0:22:44separate model the control model to now learn the weights you just a
0:22:50so you know learn but not necessary
0:22:54i
0:22:56no idea i didn't
0:23:00i'm not exactly sure i we could find a very simple
0:23:03i guess that
0:23:06something that could be done i don't know how would we perform would be i
0:23:11where is it
0:23:13two new use the time gaps
0:23:14between the context and response
0:23:18as a way to their the mean
0:23:19what are
0:23:20i data didn't right
0:23:22i tried in a previous paper when i was just looking at turn segmentation that
0:23:26in a work very well for the for this particular task but here you can
0:23:30see be different this was assigned a weight with value instead of just segmenting
0:23:35but that just the kind that doesn't work very well you have to use some
0:23:38lexical cues usually
0:23:40like after signal
0:23:42doctor holds blah that's usually an indicator that
0:23:46the tick speakers going to beat of the whole
0:23:48but you need to testified for that