okay

i'm marilyn walker and the word that i'm presenting his phd work about my denominator

media who can be here

i'm gonna talk about summarizing biologic arguments and social media and the first thing i

guess they wanna say about this you know negotiation session is it's not clear how

much negotiation it's gary actually carried on

in these argumentative dialogues

although they definitely thing to be to go see at something

so

the current data that are in summaries of argumentative dialogues really human state-of-the-art

so websites have curators to manually curate argument summaries and so lots of different debate

websites

well have curated argument so i debate has you know these kinds of points for

points against and

and pro conduct work has the top ten pro and con arguments so on these

websites they kind of summarise like what are the repeated arguments that people make about

a particular

social issue these examples are this one example here about gay marriage and another one

about gun control

and

when you go when you look at the natural human dialogues where people discussing the

same kinds of issues it's really striking how difficult it would be to actually produce

a summary

of these dialogues

i'll give you minutes a kind you know i know you're gonna read it anyway

i

i give you

and then it to me to a

you know people are very emotional there not necessarily logical they make fun of each

other they're sarcastic

there's all kinds of stuff going on in these dialogues

they do kind of fit in with your notion of what you would

happens a summary of an argument especially when you compare these to the curated arguments

that are produced by

by professionals and so the question the first question that we had was obviously it

would be great if you could actually summarize the whole bunch of conversations out there

and social media like what is it that the person on this tree

is thing about gay marriage and what is it that the person on the straight

is saying about gun control or portion or even lucien or

any kind of issue that's be constantly debated only social media website

and i would claim that you're interested not just in like what are the kind

of arguments that lawyer a constitutional expert would actually make but you're actually interested to

know

what is that people are saying you know you

it's everybody can vote these days right you have a whether or not you're in

the top one percent of the population that's actually educated

how to argue logically

so it's be good thing to actually no you know what it is that people

are saying what kinds of arguments that they're making

when you look at the easiest thing you know

what should the summary contain what kind of information should we pull out of these

conversations in order to make a summary

and you know the common convergence don't agree and so you know to you seems

like you would at least need to represent both sides of the arguments of that's

would be like that may be a first criteria

but you know you want to represent the opposing stances

then do you want to include some kind of emotional information in it do you

want to include socio-emotional relationship like that the second speaker

making fun of the first speaker that they're being sarcastic do you should that kind

of information going to

a summary or

you know do you want to take like the philosophical argumentation logical argumentation you and

say well i'm gonna i'm gonna consider all this to be just the flame or

troll or

whatever i'm not really interested in any part of this argument that doesn't actually fit

in with the logical you of argumentation

and there has been previous work on dialogue summarisation but there hasn't been anywhere on

summarizing argumentative dialogues automatically

and so all the other high dialogue summarisation that are that's out there some of

which i think spin but done by some of people in this room

they all have very different properties

and they're not merely as

i don't really as these argumentative dialogues are

so our goal is to automatically produced summaries of argumentative dialogues then

we're taking and extractive summarization perspective at this point although would clearly be nice if

we could do abstract of summarisation

and so one that step that we're trying to do in this paper is we're

trying to identify and extract what are the most important

arguments

on each side

for an issue

and are

our initial starting point is that as i pointed out of previous slides actually really

difficult to figure out what information these summary should contain and so we start from

the standpoint that summarization is something that any native speaker knows how to do they

don't have to have any training

and so are initial concept this is that we're gonna

collect summaries of that humans produce a piece conversations and see what people pick out

and then we're gonna take these summaries that we collected and we're gonna apply the

here amid method which is been used like in duck summarization task

and we're gonna assume that the arguments that up here in model summaries that those

of the most important argument so we're

a kind of applying a standard summary extractive summarization and evaluation approach

to these argumentative dialogues

so we have gold standard training data

we have

collected five human summaries

for me for each of about fifty dialogues on the topics the gay marriage gun

control and abortion

and that the

a lot of this is described in more detail in our paper in a whole

ami this paper in a twenty fifteen

the but the summaries look like and what their properties are then we trained undergraduate

linguists to use the pyramid method to identify important arguments in the dialogue

so they construct spearman's for each set of five summaries

and the idea that the repeated elements of the summaries and upon the higher here's

of the peer minute are gonna give you an example in a minute some cases

this is all probably

go to you are so that will be clear

after next slide

then so then we have we have this human dialogues we have five summaries for

each dialogue

and then we have these purim it's that are constructed on top of each of

those summaries look you know what

elements get repeated

then we still have a problem where we know which of the important concepts in

the dialogue because those of the once it appeared in the model summaries

we have to map it was actually original dialogues if we want to develop an

extractive summarizer we want to be able to operate on the original dialogue texts and

not that intermediate summary representation which we collected right

so that's the third step of

getting this mapping back and then once we have that making

characterize our problem is a binary problem or ranking problem of identifying the most important

utterances in the dialogues that we want to go into the extractive summary

this is what kind of samples summaries look like this is from a gay marriage

dialogue

you know so there

the these summarizers they're really good quality and the ones for gay marriage are currently

available

on our website at not a thought as so we just for gay marriage the

new ones that we collected better talked about in this paper about abortion

and then ensure we will be releasing

soon

but if you want to see what they look like just for gay marriage these

were released a few years ago with our

previous paper

so

this is what the data looks like so we have the summaries for these different

fifty different about fifty different conversations for each topic

and let them

human does when they make the pyramid label

is the kind of read through all the summaries they decide what are the important

concepts kind of distinct from the words that are actually used by the summarizers

and they make their own human label so they come up with the human label

which is the paraphrase

no one has been able to prove that gun owners are safer than on gun

owners

and then they identify for each summary how this summarizer phrase that particular argument that

particular concept

and i think i'm a concept in more than one of the

summaries up to five because we have five summaries

then that means that that's very important concept so that

represented in this tier right

so the arguments

that multiple summarizers picked out

and put in their summaries and

having more contributors right in this human label and they end up being ranked is

more important argument

okay so

so that we're on step three where we now have these

we have these are summary contributors which again as i said they're removed from the

language of the original dialogues

and we have these human labels

and what we want to do is to figure out in the original dialogue

what i utterance is actually correspond to these things that ended up really highly ranked

in the in the peer in it

and

where only collected this data like two or three years ago we well we're gonna

be able to do this automatically once we had this space

and after multiple different attempts we decided that we could impact of it automatically

because the language of the summarizers and the language of the

of the human labels from the pure images

two different

from the original language in the original dialogues so we did speaker

and their mechanical turk tasks

something try actually we didn't do it we didn't right on mechanical turk we couldn't

get mechanical turkers to do this task reliably of map back from the summary labels

to the original dialogues

so i forgot this we added that we recruited to graduate linguists into undergraduate linguists

to actually do this mapping forced in order to get good quality data

so we have we presented them with the original conversations and thus have the labels

other produce the highest tier labels

and we ask for each utterance of the conversation

to pick one or more of the labels that correspond to the content of that

conversation and again where only interested

where only interested in the

utterances that have a score of three or higher that are considered most important by

the original summarizers

and we get pretty good reliability on this

once we started using our own internal train people we could get

turkers

to do this reliably

so

so that i

three

we have the fifty dialogues for each you know size that we had about a

fifty for each one

so effective dialogue twenty fifty summaries

five for each dialogue how we pull out the important sentences and the not important

sentences for each dialogue and we frame is that the as a binary classification task

again we could rate have framed as the ranking task

and you just use the peer label but

we decided to just frame it is binary classification

so we group the labels liked here we compute

compute the average tear label and then we define any sense with an average are

scored very high risk been an important

so we believe that we provided a well motivated and theoretically grounded definition of what

is an important argument by going through this whole process

and now we have this binary classification problem we're trying to do

so we have a three different off-the-shelf summarizers

that we apply this to see how standard summary algorithms work so we use some

basic

which is a algorithm by think open about inventing we use this kl divergence summarization

which is from heidi in front of any these are all available off-the-shelf and we

used lex rank this of these are all different kind of the algorithm selects rank

is the one that was

most successful at the most recent document understanding

competition

and all of these rank utterances instead of classify them

so what we did with we apply them to the dialogues

we get the ranking and then the

we take e

number of utterances that are in the task so we kind of say let's do

that come up with the ranking at the point where

the length of the extractive summaries the same

as what we expect

we have a bunch different models

we tried support vector machines with the linear kernel for packet learned we use cross

validation

for tuning the parameters and then we also tried a combination

bidirectional lstm with the convolutional neural net with the biased

and we split our data into training and test

for features we have hundreds

two different kinds of word embeddings

google work to back and low and then we have some other things that we

think that are lean more linguistically motivated that we expected to

problem possibly help

so we have readability scores would expect that utterances that are more readable would be

better and that be more important

we thought sentiment might be important we thought the position sentence position in the summary

might be important like the first sentences was summary might be more important of the

first sentences in the dialogue

and then we have linguistic intra in query word count which gives us a lot

of lexical categories with three different representations of the context once one in terms of

blue one in terms of the dialogue act classification of the previous utterances the previous

two utterances in the dialogue

and then we ran stanford co graph

which

i expected to not produce anything so that's a little foreshadowing it works it actually

helps amazingly

and these are our results

so let's rank was our very best baseline some not tell you what the other

baselines were

and so for lex rank we getting a weighted f-score on the test that

in my upper fifties

when we

have just

as the are very best model svm using features so as just with word embeddings

is not just well but if we put all these linguistic features and we see

that

for both gun control and for the abortion topics that all the shock reference engine

applied to these very things very noisy dialogues actually improves performance of having representation of

the context

and we get all you know we get better results for gun control

that we do for gay marriage and abortion and we had that result repeatedly over

and over and over it and we think that the reason this is the same

arguments get repeated and gun control

and it's not it's not have created

i am about other topics

so this cn and with the by lstm with just the word embeddings gets their

in that kind of in those

in the sixties

and then we get our best model using

i one along with features and what the gun control what this one shows here

to let me remind you with these features are

so l c p is lou

with the context representation that is also a liu are is the readability ga c

is a dialogue act score and then the colour out

so for gun control having three different representations of context

give gives us the best model

and both for gay marriage and abortion as well just having this loop

the categories of the a previous utterance also gives as good performance

and so i think it's interesting have a pretty simple representation context it's not a

sequential model that we do have something that shows that the context helps

one minute

okay

the like let's right where very well

this work very well because of all this reputation in dialogue

so the assumption of blacks rank for like newspaper corpora is this something gets repeated

it's important

but it might infer from like the previous speaker talking about alignment there's lots of

repetition in conversation that doesn't indicate that the information is actually important and it's based

on lexical repetitions so it doesn't really help the it's interesting about

sentiment is that something be positive sentiment actually turns out to be a very good

predictor that is not important

and it's not for the reason necessarily that you think it would be it's because

sentiment classifiers think that anything that's conversational were data at any time

is positive sentiment

so it just rules out anything right today

you know that is no where did you know no it is right it just

rules out a lot of stuff it's just purely conversational and that

and that's why

sentiment house

and then four categories we get some you know some loop categories that are different

for each topic search shows that is

some of the stuff that we're learning for the loop

is actually topic specific

"'cause" it's learning to use particular

look categories okay

so absent and a novel method for summarizing argumentative dialogues should our results speak several

summarization baselines

we compare the svm with the nor deep learning model

show that the linguistic features actually really how

the context based features

improve

over the sentence alone

and then we wanna do more work exploring

whether this could be topic-independent so i wouldn't want to point out that our baseline

summary baselines are all topic-independent that don't need any

training

okay

questions

e

that's really good point i needed that recently didn't we distinguish between

conversations with there was more or less agreement and we have a

we haven't looked at that so i think it should i should be interesting because

you would think that it would be easier to summarize the conversation

where they were segment seven where people where more on the same stance side

yes

i dunno uni you had you can you

it's in the paper

it seems like it seems like they would be pretty

can you rephrase just me for a given the model when you features you still

yes or them simultaneously for our method that pretty no i don't we tried that

i don't think we did we tried word to back

embeddings and then weighted glove embeddings we didn't put in both in

and that we looked at both of those with

i mean in like which features make a difference

so there's a there's a hole in fact probably not all the results are in

the paper

but there is a pretty decent set of laplacian results in the paper about how

much each feature country

david you give a quick question

sorry

wait

so trained on

abortion to the store on goals

also we have done here we had a

paper

a few years back where we did some cross domain experiments versus subset of this

problem

which is just

trying to identify

some

sentences which are more likely to be understandable it's good arguments out of context in

that

that paper which has first author swanson i can tell you about it afterwards we

did some cross domain experiments

and of course doesn't work as well

and it is interesting "'cause" you would think we have thought that we

that most of the features that we're using would not be domain

specific

but every time we do that cross domain thing the results are about ten percent

worse

okay so you're the most domain-specific fisheries in buildings

the embeddings and also that look and that you know you give it all the

look features but the ones that the model

learns the pay attention to our topic specific

let's think the speaker again