0:00:15 | everyone so i will continue to talk |
---|
0:00:19 | the topic on a rst but we will focus instead on the discourse units in |
---|
0:00:25 | the context of summarisation |
---|
0:00:27 | i'm just see is the joint work we camp when amanda when i was interning |
---|
0:00:31 | at yahoo |
---|
0:00:34 | what summarisation that's first look at an example and i will read it |
---|
0:00:39 | for that case as the global warming created by human emissions costly and is the |
---|
0:00:43 | mouth and ocean water to expand |
---|
0:00:45 | scientists warned that the accelerating rise of the c would eventually impair the united states |
---|
0:00:50 | coastline |
---|
0:00:51 | now these warnings are no longer theoretical the enumeration of the coast has begun to |
---|
0:00:55 | see has crept up to the point at high tide and the first we know |
---|
0:00:59 | what it takes descent water pouring into streets and homes and so on |
---|
0:01:03 | so here i'm showing real human summary that says |
---|
0:01:07 | scientists warnings that the rise of the c would eventually impair the united states "'cause" |
---|
0:01:11 | nine are no longer theoretical |
---|
0:01:14 | so if we compare the two we see that the documents and it sentences in |
---|
0:01:17 | order to capture the documents meaning |
---|
0:01:20 | and they do so by trimming extraneous content |
---|
0:01:23 | by combining sentences |
---|
0:01:25 | i replacing phrases or clauses |
---|
0:01:27 | and so on |
---|
0:01:29 | though for machine summarisation usually there too big scores of for the system's one is |
---|
0:01:35 | extractive summarization where the send a the summary summarizer extract four sentences from the original |
---|
0:01:41 | article |
---|
0:01:42 | the second one is abstract of summarisation where the system actually generate the text response |
---|
0:01:47 | for the summary |
---|
0:01:49 | and if we look at the number of results returned by a search engine we |
---|
0:01:53 | see that actually the extractive techniques are very popular |
---|
0:01:57 | and things they select sentences from the documents the summaries are |
---|
0:02:02 | always grammatical |
---|
0:02:04 | and so that the systems can focus on things like cartons that action and coherence |
---|
0:02:10 | now |
---|
0:02:11 | if we want to have an extractive summary that convey everything that human |
---|
0:02:16 | was trying to convey in their summary |
---|
0:02:19 | i'm these two sentences will be selected |
---|
0:02:21 | and we can see that the summary here is very low on it and it's |
---|
0:02:24 | nothing like what the human was trying to produce |
---|
0:02:28 | so in this paper we look at single document summarization |
---|
0:02:31 | we want to ask question whether extractive summarization techniques can be used to be produced |
---|
0:02:36 | more human like summaries |
---|
0:02:38 | in particular we are interested in whether extracting sub sentential units would help to produce |
---|
0:02:43 | a wider range of summaries |
---|
0:02:45 | by a wider range what i mean is a four summers to be near extractive |
---|
0:02:50 | where the tokens extracted from contiguous and not consumer goods bands |
---|
0:02:54 | from the original sentences |
---|
0:02:57 | and for a sub sentential units we are particularly interested in elementary discourse unit square |
---|
0:03:03 | use and we want to see whether they are good the summarisation units |
---|
0:03:08 | though just for a quick recap what our elementary discourse units |
---|
0:03:13 | but this is part of the rhetorical structure theory what rst where it's a user |
---|
0:03:18 | defined at the segmentation of sentences |
---|
0:03:20 | in two independent clauses |
---|
0:03:23 | so for example astro floppy drive rights or read |
---|
0:03:26 | i think on disk it is working for ways to keep lose particles and dust |
---|
0:03:30 | from causing software as and dropouts |
---|
0:03:33 | so here the sentence is segmented into three edus |
---|
0:03:37 | in a full discourse tree the second and third edu has a purpose relationship |
---|
0:03:41 | and |
---|
0:03:42 | they also have a circumstance relationship with the first you |
---|
0:03:47 | in the for discourse tree the more important part of a relation is quite the |
---|
0:03:51 | nucleus |
---|
0:03:52 | and the less important part is called the satellite and this fact will be used |
---|
0:03:56 | later |
---|
0:03:58 | here's the contributions for this paper |
---|
0:04:00 | we first of all do analysis |
---|
0:04:03 | automatically obtained edus and cost and human identified concepts |
---|
0:04:08 | we show that edus correspond with these conceptual units identified by human |
---|
0:04:13 | and second we show that on the importance of edus |
---|
0:04:17 | correlate with the importance of concepts |
---|
0:04:20 | next we look at the context of near extractive summarization where we first introduce a |
---|
0:04:25 | large dataset of extractive and you're extractive summaries |
---|
0:04:29 | and then we show that you boundaries aligned with human content extraction in this dataset |
---|
0:04:34 | and furthermore we show that edus are superior to sentences in your extractive sent it's |
---|
0:04:40 | summarisation |
---|
0:04:41 | under varying length constraints |
---|
0:04:44 | okay so i will start with the first contribution how we look at on edus |
---|
0:04:48 | and it's correspondence with human identified conceptual units |
---|
0:04:53 | the ideas on the one hand we have abstract units of information on the other |
---|
0:04:58 | hand we have sentences that contain these units |
---|
0:05:01 | and we want to see is whether elementary discourse units are happy middle ground between |
---|
0:05:04 | the two |
---|
0:05:06 | so what we have is articles with human identified and labeled our conceptual units |
---|
0:05:13 | and we can segmented over automatically into edus so we can get a correspondence between |
---|
0:05:17 | edus and concept |
---|
0:05:20 | and then using this correspondence we can look at the lexical coverage for edus |
---|
0:05:25 | but the articles with human labeled concepts k we use are from are the human |
---|
0:05:29 | summaries from top two thousand five to two thousand seven and task two thousand eight |
---|
0:05:33 | two thousand eleven |
---|
0:05:34 | the concepts here are summary content unit contributors and the hear each a summary content |
---|
0:05:40 | unit or as su contains at least one contribute are extracted from each summary |
---|
0:05:45 | so what do i mean by contributors |
---|
0:05:47 | so say here is a original article |
---|
0:05:50 | and humans coming and the right summaries for this article at and |
---|
0:05:54 | at this point we will |
---|
0:05:55 | disregard the original article and consider the summaries as independent |
---|
0:06:01 | articles |
---|
0:06:02 | except that they have the same topic |
---|
0:06:05 | now other humans coming and they mark contribute our contributors |
---|
0:06:10 | from these summaries |
---|
0:06:11 | and their aggregated into summary content units with a way to cure the weight is |
---|
0:06:15 | depend is determined by |
---|
0:06:18 | how many summaries contain the |
---|
0:06:21 | a contributor what's the same semantic content |
---|
0:06:24 | so here the weight of for means that it comes from or summaries |
---|
0:06:28 | and here wait up to means that comes from two summaries |
---|
0:06:32 | though what do they look like |
---|
0:06:35 | so for example the american booksellers association represents private books bookstore on there's and sponsors |
---|
0:06:40 | book expo and i know convention |
---|
0:06:43 | here the first contributor is the american booksellers association rubber represents private bookstore on there's |
---|
0:06:49 | the second one is american booksellers association sponsors book expo |
---|
0:06:53 | and the third one is book expo an annual convention |
---|
0:06:58 | though in all we have more than thirty two thousand contributors and about seventy nine |
---|
0:07:03 | percent of them are contiguous spans in the text |
---|
0:07:06 | and from now on we will refer to these contributors that's concepts |
---|
0:07:12 | though now we have a human-labelled concepts from the summaries how do we get the |
---|
0:07:17 | edus will be doing so we do for discourse parsing automatically using phone in her |
---|
0:07:22 | stool |
---|
0:07:24 | though in the previous example everything before the word and is the first edu and |
---|
0:07:29 | everything afterwards is the second |
---|
0:07:31 | the now we can look at number of overlapping edus per concept in particular this |
---|
0:07:36 | graph shows the number of edus that overlap with at least one toll can |
---|
0:07:41 | with each concept |
---|
0:07:43 | and we see that it's usually one it one sometimes to and rarely more than |
---|
0:07:47 | three |
---|
0:07:48 | so on average the number of concept that over |
---|
0:07:51 | concepts overlapped with one point five six used |
---|
0:07:55 | and the no |
---|
0:07:56 | the number of concepts in the whole sentence this is two point one eight |
---|
0:08:00 | so we can see that sentences are much more coarse then edus |
---|
0:08:04 | or concepts |
---|
0:08:06 | and that if we want to represent a concepts using edus we would not like |
---|
0:08:11 | extraneous content in the concept that's not present in the user so |
---|
0:08:15 | here we show the number of words that need to be deleted from each concept |
---|
0:08:19 | to be covered by a single edu |
---|
0:08:21 | and here |
---|
0:08:23 | most of them are we see that |
---|
0:08:26 | in most cases edus are larger the concepts |
---|
0:08:29 | and the less than eight percent of the concepts are observed to have more than |
---|
0:08:32 | four words out outside their corresponding you |
---|
0:08:37 | so now we see that use do correspond with human identify conceptual labels |
---|
0:08:42 | so now we can look at |
---|
0:08:45 | as a then another angle which is on the importance of edus with the importance |
---|
0:08:49 | of a concept weights |
---|
0:08:52 | so how do we do this so remember that each concept is associated with the |
---|
0:08:56 | weight |
---|
0:08:57 | that is from how many summaries are |
---|
0:09:00 | the same semantic content concept is present |
---|
0:09:04 | so we have the weight of concepts and we have for each concept the overlapping |
---|
0:09:08 | edus |
---|
0:09:09 | so now if we can get the waiter edus we have the full picture for |
---|
0:09:13 | comparison |
---|
0:09:14 | and indeed we can |
---|
0:09:16 | i will not elaborate on how to derive is |
---|
0:09:19 | but the idea is to use the nucleus and satellite information and in this case |
---|
0:09:26 | the second edu is the most important one |
---|
0:09:30 | but now in this table i shows the average a salience score for are used |
---|
0:09:35 | that overlap with concepts with different weights and we can see that as the weight |
---|
0:09:40 | of a concept because becomes larger the weight of the edu also goes higher |
---|
0:09:45 | and |
---|
0:09:46 | i want to stress that the weight for concepts it's from different documents |
---|
0:09:50 | but the weight for are edu is from a single document so that intuitively |
---|
0:09:55 | the weight of the edu a can have some notion for the importance of the |
---|
0:10:01 | concept in itself |
---|
0:10:04 | okay so now we see that in try document edu weights correlate with a into |
---|
0:10:08 | a document concept weights next we can investigate near extractive summarization and i will first |
---|
0:10:14 | talk about the dataset |
---|
0:10:16 | with data we use is harder than a ldc released of the new times annotated |
---|
0:10:22 | dataset |
---|
0:10:23 | in particular it contains about two hundred forty five thousand online lead paragraphs |
---|
0:10:29 | from two thousand one to two thousand seven so these are the paragraphs underlined the |
---|
0:10:34 | headlines |
---|
0:10:35 | all then you're times a homepage |
---|
0:10:37 | and the in do you the first the example there actually in the beginning |
---|
0:10:42 | is one of these a ninety paragraphs |
---|
0:10:46 | though in particular it in this dataset we have identified three subsets of extractive been |
---|
0:10:52 | your extractive summaries |
---|
0:10:54 | so the first one is |
---|
0:10:55 | sentence extractive alright kinds contains more than thirty eight thousand examples |
---|
0:11:00 | where the summary sentences are extracted from the original text sentences |
---|
0:11:05 | the second one is near extractive span |
---|
0:11:07 | it contains more than fifteen thousand examples |
---|
0:11:10 | where the summary sentences are from contiguous spans from the original text sentences |
---|
0:11:15 | and the third one is near extractive sub-sequences |
---|
0:11:18 | which contains more than twenty five thousand examples here the sentences from non contiguous spans |
---|
0:11:24 | from the original text sentences |
---|
0:11:26 | and we have cleaned up the data and with the code it's released |
---|
0:11:30 | on this website |
---|
0:11:33 | okay so what they this dataset now we can look at how are edu boundaries |
---|
0:11:37 | aligned with a human content extraction and we are only interested in the near extractive |
---|
0:11:44 | datasets because they're the human actually need to delete something |
---|
0:11:48 | though we have on the one hand that's the article on the other hand we |
---|
0:11:51 | have the summary what we can do is we can get the corresponding units whether |
---|
0:11:55 | sentences or edus and we can study the number of |
---|
0:11:58 | words they need to be deleted were added from each unit to recover the summary |
---|
0:12:04 | for example here i'm showing a summary sentence |
---|
0:12:08 | with three edus |
---|
0:12:09 | and below i'm showing the corresponding sentences |
---|
0:12:12 | from the document and we can see that some of the content or use are |
---|
0:12:16 | deleted |
---|
0:12:18 | from the original text |
---|
0:12:21 | so here we show the average number of tokens |
---|
0:12:24 | that need to be deleted or added for each type of units in order to |
---|
0:12:29 | recover the summary |
---|
0:12:30 | and we can see that on average twelve tokens need to be deleted from sentences |
---|
0:12:35 | but what you use this average number is less than two |
---|
0:12:39 | and the number of added talk and square edus is also less than one |
---|
0:12:44 | but we see that edus do involve much less talk and deletion and very little |
---|
0:12:47 | addition |
---|
0:12:49 | so what are the words that are deleted so here i'm showing different part-of-speech categories |
---|
0:12:54 | and the |
---|
0:12:56 | darker colours are the sentences so the take away here is that for sentences a |
---|
0:13:01 | lot of the content words need to be deleted and these are kind of difficult |
---|
0:13:04 | to solve |
---|
0:13:06 | okay so now we see that edu boundaries to align with human content extraction |
---|
0:13:11 | now we can look at things summarisation whether edus are superior to a sentences |
---|
0:13:17 | so we do single-document summarization all the new york times dataset and we barely our |
---|
0:13:23 | land constrained form a hundred to three hundred characters so hundred here is |
---|
0:13:27 | about one standard deviation below the |
---|
0:13:30 | the shortest than your extractive spend |
---|
0:13:33 | and three hundred character is |
---|
0:13:35 | one standard deviation |
---|
0:13:37 | above the longest extractive sentence |
---|
0:13:41 | dataset |
---|
0:13:43 | the summarization framework that we use is a supervised greedy summarizer |
---|
0:13:47 | where we have and units |
---|
0:13:49 | we want to select a subset |
---|
0:13:52 | where the feature weights are maximized |
---|
0:13:55 | and the length constraint is satisfied |
---|
0:13:58 | and for inference we do agree |
---|
0:14:00 | for learning we do structured perceptron |
---|
0:14:04 | for the features |
---|
0:14:05 | we want to use neutral features that are not biased towards the benefits or disadvantages |
---|
0:14:12 | for each type of unit |
---|
0:14:14 | so we both basically use the |
---|
0:14:17 | things like position of the unit position of the paragraph containing the unit |
---|
0:14:21 | cosine weighted similarity with document and the unit |
---|
0:14:25 | whether the unit is adjacent to something that's previously added to some reinstall |
---|
0:14:31 | the for evaluation we used to each one and two |
---|
0:14:35 | so rouge is the recall oriented metric that looks at the coverage of the summary |
---|
0:14:40 | content |
---|
0:14:41 | and which one here means a unigram in which two things bigram |
---|
0:14:45 | okay so before a show varying length results |
---|
0:14:51 | if we think about single-document summarization a strong baseline is just selecting |
---|
0:14:55 | the first k top k units such that the length constraint is satisfied |
---|
0:15:02 | so |
---|
0:15:03 | we want to compare with that and here we show the results for each type |
---|
0:15:07 | of unit |
---|
0:15:08 | for each system and we shall we see that the supervised summarizers outperform |
---|
0:15:13 | the baseline in all cases and then |
---|
0:15:16 | edus outperforms sentences where all cases |
---|
0:15:19 | and this is underlined constraint of two hundred characters |
---|
0:15:23 | now we are ready to look at varying budget results |
---|
0:15:26 | so here i'm showing the results were extractive sentence ignore almost all cases use of |
---|
0:15:32 | on sentences |
---|
0:15:34 | for any extractive spend in all cases usable outperforms sentences and when you extracted so |
---|
0:15:40 | stuff sequence okay and the situation similar to extractive sentence situation |
---|
0:15:46 | and in particular we see that when the land constrain its a tighter edus have |
---|
0:15:51 | a much better advantage and sentences |
---|
0:15:55 | the wire you still good here's an example |
---|
0:15:59 | the reference summary is the plan which rivals the scope of battery park city would |
---|
0:16:03 | be so no one seventy five block area of queen point |
---|
0:16:05 | and williamsburg |
---|
0:16:07 | so here we can see that the summarizer is not selecting the right sentence at |
---|
0:16:10 | all |
---|
0:16:12 | but for edus all of the content is selected so |
---|
0:16:15 | we see that it's not the case that the summarizer cannot find the right sentence |
---|
0:16:18 | is sometimes like the that that's of the sentence is just too long |
---|
0:16:25 | and also you boundary a boundary is really correspond well with a human identified content |
---|
0:16:32 | boundaries and finally since user clauses |
---|
0:16:35 | they have much better readability and things like n-grams |
---|
0:16:40 | okay so in conclusion we first conduct a corpus analysis where we show that edus |
---|
0:16:46 | correspond well with human identify conceptual units |
---|
0:16:49 | we show that you use the importance of edus from intra document |
---|
0:16:55 | weights |
---|
0:16:55 | correlate with the inter documents concept weights |
---|
0:16:58 | and we also look at near extractive summarization where first i introduce a large dataset |
---|
0:17:04 | for extractive in your extractive summaries |
---|
0:17:07 | it's are released on this website |
---|
0:17:09 | and |
---|
0:17:11 | we showed in this dataset edu boundaries along with human called doesn't extraction and finally |
---|
0:17:16 | edus are superior to sentences in your extractive summarization under varying length constraint |
---|
0:17:22 | and that's all thanks for your attention i will come in questions |
---|
0:18:11 | so |
---|
0:18:13 | are you referring to kind of the boundary for use or you're referring to |
---|
0:18:17 | so there also like the importance of the concept right |
---|
0:18:22 | i think depends on how someone want to express something that importance itself may be |
---|
0:18:27 | different but as we can see the summaries our problem for different people |
---|
0:18:32 | but we also |
---|
0:18:33 | observe this kind of correlation which we found really interesting but we need to look |
---|
0:18:37 | into more like why this is the case |
---|
0:18:39 | but for used i think |
---|
0:18:41 | for |
---|
0:18:42 | like we analyze two corpora one it's like |
---|
0:18:45 | different summaries from different people and second one is a gold summaries from editors |
---|
0:18:50 | we see that good correspondence with each case so i'm pretty confident that you know |
---|
0:18:54 | this is okay |
---|
0:19:15 | right we're not looking at the coherence and grammatical it for this work but it's |
---|
0:19:20 | part of the future plans that we have |
---|
0:19:24 | so |
---|
0:19:26 | for some reason we still find a good readable summaries |
---|
0:19:30 | so |
---|
0:19:33 | for example |
---|
0:19:38 | well if we look at this one for the edus it's |
---|
0:19:42 | build very reasonable but i wouldn't say like everything is super grammatical or and |
---|
0:19:48 | like |
---|
0:19:48 | we will see different edus being attached just because the summarizer want to fulfil the |
---|
0:19:54 | length it doesn't make sense and |
---|
0:19:56 | things like that is what happened |
---|
0:20:16 | no not at all so all of our features for the summarizer we bypassed anything |
---|
0:20:22 | that has |
---|
0:20:24 | e |
---|
0:20:24 | that will show the advantage or disadvantage for each type of unit |
---|
0:20:29 | so we are only using things like position and |
---|
0:20:32 | a similarity cosine similarity and things like adjacency and so on |
---|
0:20:44 | the weights |
---|
0:20:46 | yes i we didn't use the parser but it's for the summarization task we only |
---|
0:20:51 | use the edus |
---|
0:20:53 | but for the analysis part we did look at the weights for the use and |
---|
0:20:57 | we associate that with the weights for concept |
---|
0:21:07 | right there is a common work that's why we did for parsing |
---|
0:21:25 | right so |
---|
0:21:27 | the pdtb |
---|
0:21:29 | it doesn't have two things that i think we really need in this task |
---|
0:21:33 | the first one is of full segmentation |
---|
0:21:36 | so the pdtb arguments are |
---|
0:21:39 | but like they have very |
---|
0:21:41 | but we have a lot of freedom to where the position of the arguments are |
---|
0:21:46 | and they're not a segmentations are nothing is contiguous |
---|
0:21:49 | the second part is we don't like for the pdtb there's nothing |
---|
0:21:54 | associated with salient so if we want to consider weights |
---|
0:21:57 | or so or salience we cannot do that pdtb |
---|