Speech Transcript - Not All Dialogues are Created Equal: Instance Weighting for Neural Conversational Models

so this is joint work with such people from your is your proven

you'll null and you architectures have been increasingly popular for the development of conversational agents

and one major advantage of these approaches is that they can be learned from role

and annotated dialogues without needing too much domain knowledge or feature engineering

but however we also require large amounts of training data because they have a large

parameter space

so usually

we use large online resource to train then suggest

twitter conversations

a technical web forums like the one from one to

chuck close

movie scripts

a movie subtitles

so that is for t v series and these resorts are and diana be useful

but they all the face some kind of limitations

in terms of dialogue modeling

it has several limitations we can talk for a long time about these but i

would like to just point out to limitations

especially the ones that are important for subtitles

one of this limitation is that for movies of the next we don't have

any turn structure explicit don't structure

the corpus itself is only a sequence of sentences

together with timestamps for the start and end times

but we don't know who is speaking because of course

the stuff that is always don't come together with

well and audio track and video where you see who is speaking at a given

time

so we don't know who is speaking we don't know when sentences unanswered another tour

or is a continuation of the current turn

so in this particular example

the actual ten structure is the following

and as you can see there are some strong cue

the time stance can be used

i in a few cases

and you have lexical and syntactic cues that can be used

to infer data structure

but you never have to run through

and so that's an important disadvantage when you want to actually to build a system

that generates responses and not just continuations in a given dialogue

whenever limitation is that

many of these data contain reference to

named entities

that might be absent from the inputs

in particular fictional characters

not always referred to a context which use external to the dialogue and which cannot

be captured by the inputs on

so in this particular case mister holes

i is an input that

you would require access to annex not context

you know to make sense of what's happened

and are ordered images of course but i'm just wanted to point two important limitations

so how do we deal with these

these problems the key idea i'm going to present here start with the fact that

not what examples of

context response pairs

are equally useful or relevant for different but of conversational models

some examples as

oliver lemon showed i in is keynote might even be detrimental to the development of

your model

so we can view this as the can kind of

domain adaptation problem

there is some kind of this frequency between the context response appears that we observe

in a corpus

and the ones that we wish to encode in or new a conversational model

i n in the particle application that we want

so the proposed solution is one that is very well known in the field of

the mean adaptation

and which is it simply and inclusion of the weighting model

so we try to map each

where context and response

two particular weight value that corresponds to its importance

tweets quality if you want

for the particle proposed of building

a conversation model

so how we assign this weeks

of course you to the sheer size of four corpora we cannot i don't think

each pair manually

and even a handcrafted rules per se

may be difficult to apply in many cases

because the quality of examples might be depend on multiple factors that might interact in

complex ways

so we propose here is the data driven approach

where we learn

a weighting model from examples of high quality responses

and of course what constitutes a response of high quality might depend on the particular

objectives on the particular type of conversational model

that one wishes to be a

so there is no single answer to what constitutes a high-quality response

but if you have some idea what kind of response you want and what which

once you don't one

you can often select

a subset of high quality response and learn a weighting model from these

and the weighting model uses and you are architecture

which is the following

so as you can see here we have two recurrent neural networks with shared weights

and embedding lay your

and a recurrent layer with lstm what you're weights are units

these two

and respectively and code that the context and the response

as a sequence of a sequences of tokens

and

are

put if a fixed size vectors which are then fed to a dense lay your

which can also incorporate additional inputs

a for instance document level factors

if you have some features

that are specific to move dialogue and that may be of interest

to calculate the weights you can incorporate these

in this then sleigh your for the supplied us for instance we also have information

about the time gaps

between the context and the response

and that something that can be used as well

and so we include all these data in inferior to this final translate your

which ten outputs

a weight

for a given context response pair

so that's the model

and once we have learned a weighting model from examples of high quality responses we

can then apply the model to the full training data

to assign a particular weight to each pair

then we can include it in the brick a loss

that we minimize then we trained a neural model

the exact formula for dumper get lost might be depend on what kind of models

you're building

and what kind of loss function you you're using

but the key idea is that

than the loss function calculus some kind of distance between

what the model produces and the ground truth

and then you waiting

the this lost

by the weight value that you calculate from the weighting model

so it some kind of two class pursue years where you first

calculate the weight of your example and then given this weight

and the result of a linear model you can calculate the empirical loss

and then optimize the parameters

of one these weighted sum

so that's the model and

the way the with integrated in the wench training time

so how do we evaluate the models

so we evaluate you only using retrieval-based your models

because it's easier to matrix or more clearly defined and four agenda models

so the retrieval-based your models seek to

compute a score for a given

a context response pair

which is the score about how relevant is the response given the context

and then you can use this core to write possible response and to select the

most relevant

the training data is

uhuh comes from examples from open subtitles

which is a large corpus of the palace that we're is least last year

and we compare three models

a classical tf-idf models

and what an order models

one with uniform weight

so without waiting

and one using the weighting model and we conducted what an automatic and a human

evaluation of

this approach

and you are multiple models

after we now have proposed a few years ago there actually quite simple models

where you all have to recurrent networks we sure weights

that you then

i feed to then slayers

and then combine in the dot product

so it's computing some kind of semantic similarity

between the respondent is predicted given the context

and the actual response that you find in the corpus

so this dot product

we made a small modification to the model to add a low the final score

two also be defined on some features from the response itself

"'cause" they might be some features that are not

you to the similarity between the

the context and the response but are you to

some aspects of the respondent my

give some clues about whether is high quality low quality

for instance some unknown words

might indicate a local response from lower quality

in terms of evaluation we use

so as i said and the subtitles as training data

the two going to select

the high quickly responses we took a subset of these training data

for which we knew don't structure because we could aligned then we've movie scripts

where you have speaker names

and then we use two heuristics

we only kept responses

that introduce a new director

so not

i sequence sentences that simply berserk a continuation of a given turn

and we only use the two party conversations because it's easier to two-party conversations to

define winter the response is in response for the previous speaker or not

and then we all the filter out

responses containing fictional names

and out-of-vocabulary words

and we heartily the set

of about one hundred thousand

response pairs that we considered it to be helpful high quality

for the test data we use one in domain and one a slightly out of

the main test sets

we use the core that movie data corpus which is a collection of movie script

the movie subtitles but movie scripts

and then a small corpus of sixty two t at your place

that we found on the web

of course we prove p process them tokenizer postech then

and then in terms of experimental design we consider the context to be limited to

the last ten utterances preceding the response maxima sixty tokens for the response was the

maximum five utterances

in case of turns with multiple utterances

and then we had a one-to-one racial between positive examples we were actual peers

observed in the corpus and negative examples that were drawn at random

from the same corpus

we use gru units instead of testaments because there it's possible to train and we

didn't see any difference

in performance compared to lstms

and here the results

so as you can see well tf-idf doesn't perform well but that's

that's a really well known

so we look at the recall and that i metric

which looks at

a set of possible and responses

one of which is the actual response of certain the corpus

and then we looked at whether the model was able

to put a to put the actual response in the top high

responses so we are then a one means that in a set of then responses

one of which is the actual responses where to the model would rank the actual

response to be the highs

so that's the metric

and then we assume compared to so the that will do what encoder models

and as you can see the one with the with the model performs a little

better on both test sets

and what we found you in using a subsequent error analysis what's that the weighting

model gives more importance to cohesive adjacency pairs

between the context response

response so we're not simply continuations

but they were actual responses

that were clearly from under the speaker and it worked answering the context

we also performed you meant evaluation of responses

generated by the double encoder models

using crowdsourcing

so we had we picked

fifty one hundred fifteen random complex from the corner corpus

and four possible responses

a random response the two responses from the u one encoder models

and then expect response that were manually order

so we had the resulting four hundred and sixty pairs

that we each evaluate but at the human judges

and were asked to rate the consistency between the context and response on a scale

of five points

so we had one hundred eighteen individuals party pit in the evaluation

through dropped flower

unfortunately the results were not conclusive

so we can define any statistically significant difference between the two models

and this in general a very low agreement between the participants

for all four models

and we hypothesize that this was due to the difficulty for the raiders

to discriminate between the responses and this is might be due to the nature of

the corpus itself is heavily dependent on an external context

just to the movie scenes

and if you don't have access to the movie scenes is very

difficult to understand what's going on

but even if you have longer directly story that nina seem to help

and so for a human evaluation we think another type of test data might be

more beneficial

so that was for the human evaluation

so to conclude

large that of corpora usually include many noisy examples

and noise can cover many things

but can for response that we're not actual responses

mike a response that includes

i mean if you show names that you don't want to appear

in your models it might also include

double common places responses

response that are inconsistent

with what the model knows

so not what examples have the same quality or the same relevance

for learning conversational models

and the possible remedy to that used to include a weighting model

which can be seen as a form of domain adaptation

instance weighting and models

common approach for domain adaptation

and we show that

this weighting model does not need to be in practice in some

if you have a clear idea how you want to filter you data then you

can of course

and use handcrafted rules but in many cases what determines the quality of an example

is hard to pinpoint

so it might be easier to use a data driven approach

and learning within model from examples of high quality responses

what constitutes this quality

what consecutive good response

is of course depend then all of the actual application that you trying to build

the this approach is very general so it can be applied is essentially a preprocessing

step

so it can be applied to any

data driven model dialogue

you simply as long as you have example of high quality responses

you can use it as a preprocessing step to anything

as future work we would like to extend it to work

generative models so and evaluation we restricted ourselves to

one type of retrieval-based models

but might be very interesting to apply to other kinds of models

and especially to generative once which are known to be quite difficult to work to

train

and an additional benefit of waiting models would be that you could filter all examples

that

are known to be as detrimental to the model before you even feed them to

the

to the

to the training scheme

so that you might have performance benefits in addition

to benefits it regarding here

your metric your accuracy

so that's for future work and possibly also

i don't types of test data then the

the cornet movie data corpus that we have

yes that's a thank you

can you go back to the box plot towards the end

i'm not sure what's in the box plot that way i read it is that

there is no difference really between in agreement on two does not as

but you have a set that is very low agreement between the evaluation but is

that site was wondering whether we are looking at two different

and to define

in our is that is that it is that right

i three it is mostly between the two d but encoder models

so i

there's of course a statistically significant difference between the

the altar models and the random ones

and although between the to do what encoder models and the random

but there is no internal difference between the two

we waiting and without waiting

so quickly but not have some maybe would be more significant as if i just

the two ways to set

right i agree i read

something

and you elaborate well why you change the final piece of dual encoder what was

the wrist extended

so give

so the idea is

the dot product

will give you a similarity between

the prediction from the response and the actual response right

and so this is a very important aspect when considering

no whole relevant to responses compared to the context but they might be aspects

that i really intrinsic to the response itself

and i have nothing to do the context

for instance

unknown words or rare words that are probably not typos

run punctuations

a lengthy responses

and this is not going to be directly captured in the dot product

this is going to be captured by extracting

some features from the response and then using these

in the final adequacy score

so something that was

of one missing in this button portables

that's why we wanted to modified

i guess as just wondering if you could elaborate on the extent to which you

believe that the generalizability of the generalisability capabilities of

of training a weighting model on a single dataset and having it extend reasonably to

enhance performance only just of compared to training on multiple domains you mean

why means it to train i guess like

is the current scheme or no such that whenever you are trying to improve performance

on a dataset is you would basically find a similar dataset and three training the

weighting model on like a similar data set and then use the weighting model on

a new data centre is that sort of like that the general scheme when we

use this

it's not exactly the question that you asking but in some cases

you might want to

it or to use different domains

four or two preselect

to prune out some parts of the

the data that you don't want

in some cases and that was the case that we had here

it's very difficult to the pre-processing advance on the full dataset

because the quality is very hard to determine

i using

you know simple rules

in particular here a deterrent structure is something that

it is important for determining what can secure natural response but it was near possible

to write rules for that

because it was dependent on post and gas lexical cues and many different factors

and you get of course

build a machine learning classifier that we'll

the segment your turns

but then it will be over all or nothing right in many examples in my

dataset

where

probably responses

but it's

the classifier we didn't give me a really answer

so it was better to use a weighting function

subjective still icon for some of these examples

but then not in the same way as i would from you know

high quality responses

n is but i don't are aspect that would like to mention is that

i could for sense that we could train on the high quality responses

but in this case i would have had to from

with ninety nine point nine percent of my dataset

so i don't want to i want the one that they want to control everything

just because i'm not exactly sure of the hypothesis responses

i don't if you that as a regression

at one more question i dunno maybe losses

i guess i i'm not sure are maybe i didn't like it is that the

evaluation too closely but did you try a baseline where you may be used to

simply are simpler heuristic for assigning the weights like maybe like

some something

as a heuristic for exercise none of the weights rather than like building at a

separate model the control model to now learn the weights you just a

so you know learn but not necessary

no idea i didn't

i'm not exactly sure i we could find a very simple

i guess that

something that could be done i don't know how would we perform would be i

where is it

two new use the time gaps

between the context and response

as a way to their the mean

what are

i data didn't right

i tried in a previous paper when i was just looking at turn segmentation that

in a work very well for the for this particular task but here you can

see be different this was assigned a weight with value instead of just segmenting

but that just the kind that doesn't work very well you have to use some

lexical cues usually

like after signal

doctor holds blah that's usually an indicator that

the tick speakers going to beat of the whole

but you need to testified for that

Not All Dialogues are Created Equal: Instance Weighting for Neural Conversational Models

Oral Session 4: Context in Discourse and Dialogue

Pierre Lison and Serge Bibauw