0:00:15everyone so i will continue to talk
0:00:19the topic on a rst but we will focus instead on the discourse units in
0:00:25the context of summarisation
0:00:27i'm just see is the joint work we camp when amanda when i was interning
0:00:31at yahoo
0:00:34what summarisation that's first look at an example and i will read it
0:00:39for that case as the global warming created by human emissions costly and is the
0:00:43mouth and ocean water to expand
0:00:45scientists warned that the accelerating rise of the c would eventually impair the united states
0:00:50coastline
0:00:51now these warnings are no longer theoretical the enumeration of the coast has begun to
0:00:55see has crept up to the point at high tide and the first we know
0:00:59what it takes descent water pouring into streets and homes and so on
0:01:03so here i'm showing real human summary that says
0:01:07scientists warnings that the rise of the c would eventually impair the united states "'cause"
0:01:11nine are no longer theoretical
0:01:14so if we compare the two we see that the documents and it sentences in
0:01:17order to capture the documents meaning
0:01:20and they do so by trimming extraneous content
0:01:23by combining sentences
0:01:25i replacing phrases or clauses
0:01:27and so on
0:01:29though for machine summarisation usually there too big scores of for the system's one is
0:01:35extractive summarization where the send a the summary summarizer extract four sentences from the original
0:01:41article
0:01:42the second one is abstract of summarisation where the system actually generate the text response
0:01:47for the summary
0:01:49and if we look at the number of results returned by a search engine we
0:01:53see that actually the extractive techniques are very popular
0:01:57and things they select sentences from the documents the summaries are
0:02:02always grammatical
0:02:04and so that the systems can focus on things like cartons that action and coherence
0:02:10now
0:02:11if we want to have an extractive summary that convey everything that human
0:02:16was trying to convey in their summary
0:02:19i'm these two sentences will be selected
0:02:21and we can see that the summary here is very low on it and it's
0:02:24nothing like what the human was trying to produce
0:02:28so in this paper we look at single document summarization
0:02:31we want to ask question whether extractive summarization techniques can be used to be produced
0:02:36more human like summaries
0:02:38in particular we are interested in whether extracting sub sentential units would help to produce
0:02:43a wider range of summaries
0:02:45by a wider range what i mean is a four summers to be near extractive
0:02:50where the tokens extracted from contiguous and not consumer goods bands
0:02:54from the original sentences
0:02:57and for a sub sentential units we are particularly interested in elementary discourse unit square
0:03:03use and we want to see whether they are good the summarisation units
0:03:08though just for a quick recap what our elementary discourse units
0:03:13but this is part of the rhetorical structure theory what rst where it's a user
0:03:18defined at the segmentation of sentences
0:03:20in two independent clauses
0:03:23so for example astro floppy drive rights or read
0:03:26i think on disk it is working for ways to keep lose particles and dust
0:03:30from causing software as and dropouts
0:03:33so here the sentence is segmented into three edus
0:03:37in a full discourse tree the second and third edu has a purpose relationship
0:03:41and
0:03:42they also have a circumstance relationship with the first you
0:03:47in the for discourse tree the more important part of a relation is quite the
0:03:51nucleus
0:03:52and the less important part is called the satellite and this fact will be used
0:03:56later
0:03:58here's the contributions for this paper
0:04:00we first of all do analysis
0:04:03automatically obtained edus and cost and human identified concepts
0:04:08we show that edus correspond with these conceptual units identified by human
0:04:13and second we show that on the importance of edus
0:04:17correlate with the importance of concepts
0:04:20next we look at the context of near extractive summarization where we first introduce a
0:04:25large dataset of extractive and you're extractive summaries
0:04:29and then we show that you boundaries aligned with human content extraction in this dataset
0:04:34and furthermore we show that edus are superior to sentences in your extractive sent it's
0:04:40summarisation
0:04:41under varying length constraints
0:04:44okay so i will start with the first contribution how we look at on edus
0:04:48and it's correspondence with human identified conceptual units
0:04:53the ideas on the one hand we have abstract units of information on the other
0:04:58hand we have sentences that contain these units
0:05:01and we want to see is whether elementary discourse units are happy middle ground between
0:05:04the two
0:05:06so what we have is articles with human identified and labeled our conceptual units
0:05:13and we can segmented over automatically into edus so we can get a correspondence between
0:05:17edus and concept
0:05:20and then using this correspondence we can look at the lexical coverage for edus
0:05:25but the articles with human labeled concepts k we use are from are the human
0:05:29summaries from top two thousand five to two thousand seven and task two thousand eight
0:05:33two thousand eleven
0:05:34the concepts here are summary content unit contributors and the hear each a summary content
0:05:40unit or as su contains at least one contribute are extracted from each summary
0:05:45so what do i mean by contributors
0:05:47so say here is a original article
0:05:50and humans coming and the right summaries for this article at and
0:05:54at this point we will
0:05:55disregard the original article and consider the summaries as independent
0:06:01articles
0:06:02except that they have the same topic
0:06:05now other humans coming and they mark contribute our contributors
0:06:10from these summaries
0:06:11and their aggregated into summary content units with a way to cure the weight is
0:06:15depend is determined by
0:06:18how many summaries contain the
0:06:21a contributor what's the same semantic content
0:06:24so here the weight of for means that it comes from or summaries
0:06:28and here wait up to means that comes from two summaries
0:06:32though what do they look like
0:06:35so for example the american booksellers association represents private books bookstore on there's and sponsors
0:06:40book expo and i know convention
0:06:43here the first contributor is the american booksellers association rubber represents private bookstore on there's
0:06:49the second one is american booksellers association sponsors book expo
0:06:53and the third one is book expo an annual convention
0:06:58though in all we have more than thirty two thousand contributors and about seventy nine
0:07:03percent of them are contiguous spans in the text
0:07:06and from now on we will refer to these contributors that's concepts
0:07:12though now we have a human-labelled concepts from the summaries how do we get the
0:07:17edus will be doing so we do for discourse parsing automatically using phone in her
0:07:22stool
0:07:24though in the previous example everything before the word and is the first edu and
0:07:29everything afterwards is the second
0:07:31the now we can look at number of overlapping edus per concept in particular this
0:07:36graph shows the number of edus that overlap with at least one toll can
0:07:41with each concept
0:07:43and we see that it's usually one it one sometimes to and rarely more than
0:07:47three
0:07:48so on average the number of concept that over
0:07:51concepts overlapped with one point five six used
0:07:55and the no
0:07:56the number of concepts in the whole sentence this is two point one eight
0:08:00so we can see that sentences are much more coarse then edus
0:08:04or concepts
0:08:06and that if we want to represent a concepts using edus we would not like
0:08:11extraneous content in the concept that's not present in the user so
0:08:15here we show the number of words that need to be deleted from each concept
0:08:19to be covered by a single edu
0:08:21and here
0:08:23most of them are we see that
0:08:26in most cases edus are larger the concepts
0:08:29and the less than eight percent of the concepts are observed to have more than
0:08:32four words out outside their corresponding you
0:08:37so now we see that use do correspond with human identify conceptual labels
0:08:42so now we can look at
0:08:45as a then another angle which is on the importance of edus with the importance
0:08:49of a concept weights
0:08:52so how do we do this so remember that each concept is associated with the
0:08:56weight
0:08:57that is from how many summaries are
0:09:00the same semantic content concept is present
0:09:04so we have the weight of concepts and we have for each concept the overlapping
0:09:08edus
0:09:09so now if we can get the waiter edus we have the full picture for
0:09:13comparison
0:09:14and indeed we can
0:09:16i will not elaborate on how to derive is
0:09:19but the idea is to use the nucleus and satellite information and in this case
0:09:26the second edu is the most important one
0:09:30but now in this table i shows the average a salience score for are used
0:09:35that overlap with concepts with different weights and we can see that as the weight
0:09:40of a concept because becomes larger the weight of the edu also goes higher
0:09:45and
0:09:46i want to stress that the weight for concepts it's from different documents
0:09:50but the weight for are edu is from a single document so that intuitively
0:09:55the weight of the edu a can have some notion for the importance of the
0:10:01concept in itself
0:10:04okay so now we see that in try document edu weights correlate with a into
0:10:08a document concept weights next we can investigate near extractive summarization and i will first
0:10:14talk about the dataset
0:10:16with data we use is harder than a ldc released of the new times annotated
0:10:22dataset
0:10:23in particular it contains about two hundred forty five thousand online lead paragraphs
0:10:29from two thousand one to two thousand seven so these are the paragraphs underlined the
0:10:34headlines
0:10:35all then you're times a homepage
0:10:37and the in do you the first the example there actually in the beginning
0:10:42is one of these a ninety paragraphs
0:10:46though in particular it in this dataset we have identified three subsets of extractive been
0:10:52your extractive summaries
0:10:54so the first one is
0:10:55sentence extractive alright kinds contains more than thirty eight thousand examples
0:11:00where the summary sentences are extracted from the original text sentences
0:11:05the second one is near extractive span
0:11:07it contains more than fifteen thousand examples
0:11:10where the summary sentences are from contiguous spans from the original text sentences
0:11:15and the third one is near extractive sub-sequences
0:11:18which contains more than twenty five thousand examples here the sentences from non contiguous spans
0:11:24from the original text sentences
0:11:26and we have cleaned up the data and with the code it's released
0:11:30on this website
0:11:33okay so what they this dataset now we can look at how are edu boundaries
0:11:37aligned with a human content extraction and we are only interested in the near extractive
0:11:44datasets because they're the human actually need to delete something
0:11:48though we have on the one hand that's the article on the other hand we
0:11:51have the summary what we can do is we can get the corresponding units whether
0:11:55sentences or edus and we can study the number of
0:11:58words they need to be deleted were added from each unit to recover the summary
0:12:04for example here i'm showing a summary sentence
0:12:08with three edus
0:12:09and below i'm showing the corresponding sentences
0:12:12from the document and we can see that some of the content or use are
0:12:16deleted
0:12:18from the original text
0:12:21so here we show the average number of tokens
0:12:24that need to be deleted or added for each type of units in order to
0:12:29recover the summary
0:12:30and we can see that on average twelve tokens need to be deleted from sentences
0:12:35but what you use this average number is less than two
0:12:39and the number of added talk and square edus is also less than one
0:12:44but we see that edus do involve much less talk and deletion and very little
0:12:47addition
0:12:49so what are the words that are deleted so here i'm showing different part-of-speech categories
0:12:54and the
0:12:56darker colours are the sentences so the take away here is that for sentences a
0:13:01lot of the content words need to be deleted and these are kind of difficult
0:13:04to solve
0:13:06okay so now we see that edu boundaries to align with human content extraction
0:13:11now we can look at things summarisation whether edus are superior to a sentences
0:13:17so we do single-document summarization all the new york times dataset and we barely our
0:13:23land constrained form a hundred to three hundred characters so hundred here is
0:13:27about one standard deviation below the
0:13:30the shortest than your extractive spend
0:13:33and three hundred character is
0:13:35one standard deviation
0:13:37above the longest extractive sentence
0:13:41dataset
0:13:43the summarization framework that we use is a supervised greedy summarizer
0:13:47where we have and units
0:13:49we want to select a subset
0:13:52where the feature weights are maximized
0:13:55and the length constraint is satisfied
0:13:58and for inference we do agree
0:14:00for learning we do structured perceptron
0:14:04for the features
0:14:05we want to use neutral features that are not biased towards the benefits or disadvantages
0:14:12for each type of unit
0:14:14so we both basically use the
0:14:17things like position of the unit position of the paragraph containing the unit
0:14:21cosine weighted similarity with document and the unit
0:14:25whether the unit is adjacent to something that's previously added to some reinstall
0:14:31the for evaluation we used to each one and two
0:14:35so rouge is the recall oriented metric that looks at the coverage of the summary
0:14:40content
0:14:41and which one here means a unigram in which two things bigram
0:14:45okay so before a show varying length results
0:14:51if we think about single-document summarization a strong baseline is just selecting
0:14:55the first k top k units such that the length constraint is satisfied
0:15:02so
0:15:03we want to compare with that and here we show the results for each type
0:15:07of unit
0:15:08for each system and we shall we see that the supervised summarizers outperform
0:15:13the baseline in all cases and then
0:15:16edus outperforms sentences where all cases
0:15:19and this is underlined constraint of two hundred characters
0:15:23now we are ready to look at varying budget results
0:15:26so here i'm showing the results were extractive sentence ignore almost all cases use of
0:15:32on sentences
0:15:34for any extractive spend in all cases usable outperforms sentences and when you extracted so
0:15:40stuff sequence okay and the situation similar to extractive sentence situation
0:15:46and in particular we see that when the land constrain its a tighter edus have
0:15:51a much better advantage and sentences
0:15:55the wire you still good here's an example
0:15:59the reference summary is the plan which rivals the scope of battery park city would
0:16:03be so no one seventy five block area of queen point
0:16:05and williamsburg
0:16:07so here we can see that the summarizer is not selecting the right sentence at
0:16:10all
0:16:12but for edus all of the content is selected so
0:16:15we see that it's not the case that the summarizer cannot find the right sentence
0:16:18is sometimes like the that that's of the sentence is just too long
0:16:25and also you boundary a boundary is really correspond well with a human identified content
0:16:32boundaries and finally since user clauses
0:16:35they have much better readability and things like n-grams
0:16:40okay so in conclusion we first conduct a corpus analysis where we show that edus
0:16:46correspond well with human identify conceptual units
0:16:49we show that you use the importance of edus from intra document
0:16:55weights
0:16:55correlate with the inter documents concept weights
0:16:58and we also look at near extractive summarization where first i introduce a large dataset
0:17:04for extractive in your extractive summaries
0:17:07it's are released on this website
0:17:09and
0:17:11we showed in this dataset edu boundaries along with human called doesn't extraction and finally
0:17:16edus are superior to sentences in your extractive summarization under varying length constraint
0:17:22and that's all thanks for your attention i will come in questions
0:18:11so
0:18:13are you referring to kind of the boundary for use or you're referring to
0:18:17so there also like the importance of the concept right
0:18:22i think depends on how someone want to express something that importance itself may be
0:18:27different but as we can see the summaries our problem for different people
0:18:32but we also
0:18:33observe this kind of correlation which we found really interesting but we need to look
0:18:37into more like why this is the case
0:18:39but for used i think
0:18:41for
0:18:42like we analyze two corpora one it's like
0:18:45different summaries from different people and second one is a gold summaries from editors
0:18:50we see that good correspondence with each case so i'm pretty confident that you know
0:18:54this is okay
0:19:15right we're not looking at the coherence and grammatical it for this work but it's
0:19:20part of the future plans that we have
0:19:24so
0:19:26for some reason we still find a good readable summaries
0:19:30so
0:19:33for example
0:19:38well if we look at this one for the edus it's
0:19:42build very reasonable but i wouldn't say like everything is super grammatical or and
0:19:48like
0:19:48we will see different edus being attached just because the summarizer want to fulfil the
0:19:54length it doesn't make sense and
0:19:56things like that is what happened
0:20:16no not at all so all of our features for the summarizer we bypassed anything
0:20:22that has
0:20:24e
0:20:24that will show the advantage or disadvantage for each type of unit
0:20:29so we are only using things like position and
0:20:32a similarity cosine similarity and things like adjacency and so on
0:20:44the weights
0:20:46yes i we didn't use the parser but it's for the summarization task we only
0:20:51use the edus
0:20:53but for the analysis part we did look at the weights for the use and
0:20:57we associate that with the weights for concept
0:21:07right there is a common work that's why we did for parsing
0:21:25right so
0:21:27the pdtb
0:21:29it doesn't have two things that i think we really need in this task
0:21:33the first one is of full segmentation
0:21:36so the pdtb arguments are
0:21:39but like they have very
0:21:41but we have a lot of freedom to where the position of the arguments are
0:21:46and they're not a segmentations are nothing is contiguous
0:21:49the second part is we don't like for the pdtb there's nothing
0:21:54associated with salient so if we want to consider weights
0:21:57or so or salience we cannot do that pdtb