everyone so i will continue to talk
the topic on a rst but we will focus instead on the discourse units in
the context of summarisation
i'm just see is the joint work we camp when amanda when i was interning
at yahoo
what summarisation that's first look at an example and i will read it
for that case as the global warming created by human emissions costly and is the
mouth and ocean water to expand
scientists warned that the accelerating rise of the c would eventually impair the united states
coastline
now these warnings are no longer theoretical the enumeration of the coast has begun to
see has crept up to the point at high tide and the first we know
what it takes descent water pouring into streets and homes and so on
so here i'm showing real human summary that says
scientists warnings that the rise of the c would eventually impair the united states "'cause"
nine are no longer theoretical
so if we compare the two we see that the documents and it sentences in
order to capture the documents meaning
and they do so by trimming extraneous content
by combining sentences
i replacing phrases or clauses
and so on
though for machine summarisation usually there too big scores of for the system's one is
extractive summarization where the send a the summary summarizer extract four sentences from the original
article
the second one is abstract of summarisation where the system actually generate the text response
for the summary
and if we look at the number of results returned by a search engine we
see that actually the extractive techniques are very popular
and things they select sentences from the documents the summaries are
always grammatical
and so that the systems can focus on things like cartons that action and coherence
now
if we want to have an extractive summary that convey everything that human
was trying to convey in their summary
i'm these two sentences will be selected
and we can see that the summary here is very low on it and it's
nothing like what the human was trying to produce
so in this paper we look at single document summarization
we want to ask question whether extractive summarization techniques can be used to be produced
more human like summaries
in particular we are interested in whether extracting sub sentential units would help to produce
a wider range of summaries
by a wider range what i mean is a four summers to be near extractive
where the tokens extracted from contiguous and not consumer goods bands
from the original sentences
and for a sub sentential units we are particularly interested in elementary discourse unit square
use and we want to see whether they are good the summarisation units
though just for a quick recap what our elementary discourse units
but this is part of the rhetorical structure theory what rst where it's a user
defined at the segmentation of sentences
in two independent clauses
so for example astro floppy drive rights or read
i think on disk it is working for ways to keep lose particles and dust
from causing software as and dropouts
so here the sentence is segmented into three edus
in a full discourse tree the second and third edu has a purpose relationship
and
they also have a circumstance relationship with the first you
in the for discourse tree the more important part of a relation is quite the
nucleus
and the less important part is called the satellite and this fact will be used
later
here's the contributions for this paper
we first of all do analysis
automatically obtained edus and cost and human identified concepts
we show that edus correspond with these conceptual units identified by human
and second we show that on the importance of edus
correlate with the importance of concepts
next we look at the context of near extractive summarization where we first introduce a
large dataset of extractive and you're extractive summaries
and then we show that you boundaries aligned with human content extraction in this dataset
and furthermore we show that edus are superior to sentences in your extractive sent it's
summarisation
under varying length constraints
okay so i will start with the first contribution how we look at on edus
and it's correspondence with human identified conceptual units
the ideas on the one hand we have abstract units of information on the other
hand we have sentences that contain these units
and we want to see is whether elementary discourse units are happy middle ground between
the two
so what we have is articles with human identified and labeled our conceptual units
and we can segmented over automatically into edus so we can get a correspondence between
edus and concept
and then using this correspondence we can look at the lexical coverage for edus
but the articles with human labeled concepts k we use are from are the human
summaries from top two thousand five to two thousand seven and task two thousand eight
two thousand eleven
the concepts here are summary content unit contributors and the hear each a summary content
unit or as su contains at least one contribute are extracted from each summary
so what do i mean by contributors
so say here is a original article
and humans coming and the right summaries for this article at and
at this point we will
disregard the original article and consider the summaries as independent
articles
except that they have the same topic
now other humans coming and they mark contribute our contributors
from these summaries
and their aggregated into summary content units with a way to cure the weight is
depend is determined by
how many summaries contain the
a contributor what's the same semantic content
so here the weight of for means that it comes from or summaries
and here wait up to means that comes from two summaries
though what do they look like
so for example the american booksellers association represents private books bookstore on there's and sponsors
book expo and i know convention
here the first contributor is the american booksellers association rubber represents private bookstore on there's
the second one is american booksellers association sponsors book expo
and the third one is book expo an annual convention
though in all we have more than thirty two thousand contributors and about seventy nine
percent of them are contiguous spans in the text
and from now on we will refer to these contributors that's concepts
though now we have a human-labelled concepts from the summaries how do we get the
edus will be doing so we do for discourse parsing automatically using phone in her
stool
though in the previous example everything before the word and is the first edu and
everything afterwards is the second
the now we can look at number of overlapping edus per concept in particular this
graph shows the number of edus that overlap with at least one toll can
with each concept
and we see that it's usually one it one sometimes to and rarely more than
three
so on average the number of concept that over
concepts overlapped with one point five six used
and the no
the number of concepts in the whole sentence this is two point one eight
so we can see that sentences are much more coarse then edus
or concepts
and that if we want to represent a concepts using edus we would not like
extraneous content in the concept that's not present in the user so
here we show the number of words that need to be deleted from each concept
to be covered by a single edu
and here
most of them are we see that
in most cases edus are larger the concepts
and the less than eight percent of the concepts are observed to have more than
four words out outside their corresponding you
so now we see that use do correspond with human identify conceptual labels
so now we can look at
as a then another angle which is on the importance of edus with the importance
of a concept weights
so how do we do this so remember that each concept is associated with the
weight
that is from how many summaries are
the same semantic content concept is present
so we have the weight of concepts and we have for each concept the overlapping
edus
so now if we can get the waiter edus we have the full picture for
comparison
and indeed we can
i will not elaborate on how to derive is
but the idea is to use the nucleus and satellite information and in this case
the second edu is the most important one
but now in this table i shows the average a salience score for are used
that overlap with concepts with different weights and we can see that as the weight
of a concept because becomes larger the weight of the edu also goes higher
and
i want to stress that the weight for concepts it's from different documents
but the weight for are edu is from a single document so that intuitively
the weight of the edu a can have some notion for the importance of the
concept in itself
okay so now we see that in try document edu weights correlate with a into
a document concept weights next we can investigate near extractive summarization and i will first
talk about the dataset
with data we use is harder than a ldc released of the new times annotated
dataset
in particular it contains about two hundred forty five thousand online lead paragraphs
from two thousand one to two thousand seven so these are the paragraphs underlined the
headlines
all then you're times a homepage
and the in do you the first the example there actually in the beginning
is one of these a ninety paragraphs
though in particular it in this dataset we have identified three subsets of extractive been
your extractive summaries
so the first one is
sentence extractive alright kinds contains more than thirty eight thousand examples
where the summary sentences are extracted from the original text sentences
the second one is near extractive span
it contains more than fifteen thousand examples
where the summary sentences are from contiguous spans from the original text sentences
and the third one is near extractive sub-sequences
which contains more than twenty five thousand examples here the sentences from non contiguous spans
from the original text sentences
and we have cleaned up the data and with the code it's released
on this website
okay so what they this dataset now we can look at how are edu boundaries
aligned with a human content extraction and we are only interested in the near extractive
datasets because they're the human actually need to delete something
though we have on the one hand that's the article on the other hand we
have the summary what we can do is we can get the corresponding units whether
sentences or edus and we can study the number of
words they need to be deleted were added from each unit to recover the summary
for example here i'm showing a summary sentence
with three edus
and below i'm showing the corresponding sentences
from the document and we can see that some of the content or use are
deleted
from the original text
so here we show the average number of tokens
that need to be deleted or added for each type of units in order to
recover the summary
and we can see that on average twelve tokens need to be deleted from sentences
but what you use this average number is less than two
and the number of added talk and square edus is also less than one
but we see that edus do involve much less talk and deletion and very little
addition
so what are the words that are deleted so here i'm showing different part-of-speech categories
and the
darker colours are the sentences so the take away here is that for sentences a
lot of the content words need to be deleted and these are kind of difficult
to solve
okay so now we see that edu boundaries to align with human content extraction
now we can look at things summarisation whether edus are superior to a sentences
so we do single-document summarization all the new york times dataset and we barely our
land constrained form a hundred to three hundred characters so hundred here is
about one standard deviation below the
the shortest than your extractive spend
and three hundred character is
one standard deviation
above the longest extractive sentence
dataset
the summarization framework that we use is a supervised greedy summarizer
where we have and units
we want to select a subset
where the feature weights are maximized
and the length constraint is satisfied
and for inference we do agree
for learning we do structured perceptron
for the features
we want to use neutral features that are not biased towards the benefits or disadvantages
for each type of unit
so we both basically use the
things like position of the unit position of the paragraph containing the unit
cosine weighted similarity with document and the unit
whether the unit is adjacent to something that's previously added to some reinstall
the for evaluation we used to each one and two
so rouge is the recall oriented metric that looks at the coverage of the summary
content
and which one here means a unigram in which two things bigram
okay so before a show varying length results
if we think about single-document summarization a strong baseline is just selecting
the first k top k units such that the length constraint is satisfied
so
we want to compare with that and here we show the results for each type
of unit
for each system and we shall we see that the supervised summarizers outperform
the baseline in all cases and then
edus outperforms sentences where all cases
and this is underlined constraint of two hundred characters
now we are ready to look at varying budget results
so here i'm showing the results were extractive sentence ignore almost all cases use of
on sentences
for any extractive spend in all cases usable outperforms sentences and when you extracted so
stuff sequence okay and the situation similar to extractive sentence situation
and in particular we see that when the land constrain its a tighter edus have
a much better advantage and sentences
the wire you still good here's an example
the reference summary is the plan which rivals the scope of battery park city would
be so no one seventy five block area of queen point
and williamsburg
so here we can see that the summarizer is not selecting the right sentence at
all
but for edus all of the content is selected so
we see that it's not the case that the summarizer cannot find the right sentence
is sometimes like the that that's of the sentence is just too long
and also you boundary a boundary is really correspond well with a human identified content
boundaries and finally since user clauses
they have much better readability and things like n-grams
okay so in conclusion we first conduct a corpus analysis where we show that edus
correspond well with human identify conceptual units
we show that you use the importance of edus from intra document
weights
correlate with the inter documents concept weights
and we also look at near extractive summarization where first i introduce a large dataset
for extractive in your extractive summaries
it's are released on this website
and
we showed in this dataset edu boundaries along with human called doesn't extraction and finally
edus are superior to sentences in your extractive summarization under varying length constraint
and that's all thanks for your attention i will come in questions
so
are you referring to kind of the boundary for use or you're referring to
so there also like the importance of the concept right
i think depends on how someone want to express something that importance itself may be
different but as we can see the summaries our problem for different people
but we also
observe this kind of correlation which we found really interesting but we need to look
into more like why this is the case
but for used i think
for
like we analyze two corpora one it's like
different summaries from different people and second one is a gold summaries from editors
we see that good correspondence with each case so i'm pretty confident that you know
this is okay
right we're not looking at the coherence and grammatical it for this work but it's
part of the future plans that we have
so
for some reason we still find a good readable summaries
so
for example
well if we look at this one for the edus it's
build very reasonable but i wouldn't say like everything is super grammatical or and
like
we will see different edus being attached just because the summarizer want to fulfil the
length it doesn't make sense and
things like that is what happened
no not at all so all of our features for the summarizer we bypassed anything
that has
e
that will show the advantage or disadvantage for each type of unit
so we are only using things like position and
a similarity cosine similarity and things like adjacency and so on
the weights
yes i we didn't use the parser but it's for the summarization task we only
use the edus
but for the analysis part we did look at the weights for the use and
we associate that with the weights for concept
right there is a common work that's why we did for parsing
right so
the pdtb
it doesn't have two things that i think we really need in this task
the first one is of full segmentation
so the pdtb arguments are
but like they have very
but we have a lot of freedom to where the position of the arguments are
and they're not a segmentations are nothing is contiguous
the second part is we don't like for the pdtb there's nothing
associated with salient so if we want to consider weights
or so or salience we cannot do that pdtb