0:00:17syllable final taken for all six l two thousand nineteen is about chris of it
0:00:24entitled discourse relation prediction rooms robustness convolutional networks
0:00:32glow a
0:00:34so i'm chris it i'm presenting
0:00:37we have of us start is not able to make it to the conference
0:00:40and
0:00:41i detecting the presence of a discourse relation between two text segments
0:00:45i is important for
0:00:48a lot of downstream applications
0:00:50i've including text level or document level tasks us to just
0:00:54that planning or summarization
0:00:56and one such resource that's labelled with discourse relations is the penn discourse treebank
0:01:03as also mentioned in the previous talk and this
0:01:06to find the shallow a discourse semantics between segments
0:01:10i'm like a framework such as rst which both the full
0:01:13a parse tree over a document
0:01:15and at the top level there are for different classes
0:01:21the comparison relation which includes contrast in concession expansion relation which might include examples
0:01:27a contingency which includes conditional and causal statements
0:01:31and then temporal relations
0:01:33and this can be expressed either explicitly using a discourse connective or implicitly
0:01:41sort of provide an example from the pdtb the first argument is a mister hand
0:01:45began selling non-core businesses such as oil and gas and chemicals the second argument and
0:01:50even sold one unit the made by a chequebook covers
0:01:53and in this case
0:01:55this is an implicit example but this would be the expansion relation with the implicit
0:02:00connective in fact
0:02:03so
0:02:04i'll discuss are the background in using word pair to predict a discourse relations and
0:02:09all talk about the related work and word pairs along with
0:02:13previous work in neural models i'll discuss our method of using a convolutional networks to
0:02:18model word pairs
0:02:19and also compared to the previous work in infrared some analysis of the performance of
0:02:24our model
0:02:26so
0:02:29earlier work by mark you an h a hobby
0:02:33looked at using a word pairs to identify discourse relations and they noted that are
0:02:38absent have very good semantic parsers
0:02:41one way to
0:02:44identify the relationship between text segments is to define word pairs using a very large
0:02:49a corpus
0:02:51so this is the comparison relation set
0:02:54and a word pair such as good in fails as this wouldn't be and antonym
0:02:59in
0:03:00resource like wordnet but we might be able to identify this from a
0:03:04a large unlabeled corpus
0:03:06and so they were averaged at discourse connectives to identify the these word pairs and
0:03:13and then build a model using those word pairs as features
0:03:17so the initial work using word pairs
0:03:20but that using
0:03:21are the cross product of words in either side of connective
0:03:24from some external resource
0:03:26and then using those identified word pairs as features for classifier
0:03:31some work on the pdtb found that the top word pairs in terms of a
0:03:35information gain are discourse connectives and functional words
0:03:39and this may be a product of the frequency of those words as well as
0:03:43the sparsity of workers
0:03:47so in order to handle the sparsity issue
0:03:50we ran and weak un
0:03:53build separate tf-idf features and so they identified word pairs across each connective in the
0:03:58gigaword corpus
0:04:00and then they identify these around a hundred different
0:04:03tf-idf vectors which gave like hundred dot product so they could use as features on
0:04:07the on the labeled data
0:04:10so recently neural models of had a lot of success and the pdtb
0:04:15either a recurrent models or cnns are more recently attention based models
0:04:19and one advantage of these models is that
0:04:21it easier to jointly model either the
0:04:25pdtb with other corpora either labeled to unlabeled data
0:04:31more recent work that using adversarial learning
0:04:34so given the
0:04:37given an implicit connective
0:04:39as well as the model without a connective
0:04:42and then
0:04:43a very recently had i and one
0:04:45i use the adjoint approach using the full
0:04:49paragraph context
0:04:50and jointly modeling explicit and implicit relations
0:04:53using a bidirectional lstm and the crf
0:04:57so the advantages of a the word pairs do that it provides an intuitive way
0:05:02of
0:05:02identifying features
0:05:04but
0:05:06it also tends to use noise the unlabeled external data and then the word pair
0:05:11representations are very sparse a since it's
0:05:14not possible to explicitly model every word pair
0:05:17no on the other hand the
0:05:18the neural models allow us to
0:05:22jointly model other data as well but the downside is that we have to identify
0:05:27a specific architecture
0:05:30and the there's models can be very complex as well
0:05:33so this
0:05:34they just a to research questions whether we can explicitly model these word pairs
0:05:39using neural models
0:05:40and then whether we can i transfer knowledge by joint learning with explicit
0:05:44labeled examples in the pdtb
0:05:48so
0:05:49right and an example so given a sentence i'm late for the meeting because the
0:05:54train was delayed
0:05:56we would split that in to argument one an argument to so where are you
0:06:01mean to start with the explicit discourse connective
0:06:05and then we would take the
0:06:07i the cartesian product of the word pairs on either side of the argument and
0:06:11so this gives as
0:06:13i does matrix of word pairs
0:06:15and then we take the same approach for implicit relations
0:06:21it's the same the c matrix minus the connective
0:06:26and so given this given this grid of word pairs
0:06:30we then take these filters
0:06:33of even link and we slide it over this grid
0:06:38so we initially we take a word and word pairs where we take a single
0:06:42word from either side of the argument
0:06:44and we splattered across to that we get word pair
0:06:47representations
0:06:50we can also do the same thing
0:06:52where larger filter size essentially represent
0:06:54where an n-gram pairs so in this case this is a filter of size eight
0:06:58and a represents
0:06:59a word and a four gram pair
0:07:02from the first argument and then the second argument
0:07:05so
0:07:05we can again take this folder
0:07:08and slighted across the box using us right of two
0:07:13and for the most forever getting word and n-gram pairs accepted row
0:07:16and column boundaries where we end up with multiple word pairs
0:07:22we again do the same thing
0:07:24seven four we were going across the rose we again
0:07:28take these convolutions misled them down the columns
0:07:31so we get arg two an arg one as well as arg one and arg
0:07:35two
0:07:38so this gives us our initial architecture where we have
0:07:42argument one an argument to
0:07:44which are passed into a cn and we do max going over that to extract
0:07:47the features
0:07:49and then we do the same thing argument to an argument one
0:07:52and we concatenate the
0:07:55there's resulting features and this gives us the representation for word pairs
0:08:00and the weights between these two shows the cnns are shared as well
0:08:06so similarly
0:08:08we
0:08:09we take a similar approach for the individual arguments
0:08:13and the reason for this is two fold and the first reason is that you
0:08:17can that's the way to determine the effect of the word pairs and said to
0:08:21evaluate if the word pairs are complementary to individual arguments
0:08:25and then the other motivation for including individual arguments
0:08:29is that many discourse relations
0:08:32contain lexical indicators
0:08:34absence context
0:08:36that are there often indicative of a discourse relation so
0:08:39an example of that are the
0:08:42implicit causal verbs that there might identify like a contingency relation such as maker provide
0:08:49so
0:08:50we use the same architecture here where instead of
0:08:54the cross product of the arguments we have the individual arguments
0:08:58which are passed into a cnn
0:09:01and that gives this
0:09:03i feature representation for the individual arguments which we can concatenate together
0:09:09to ten argument representation
0:09:12so we also want to be able to model of the interaction between the arguments
0:09:16and the way that we do that as with an additional gain layer
0:09:21so we concatenate argument one argument to and path that through a nonlinearity
0:09:28and then we determine how much to we the individual features
0:09:32so this gives us a
0:09:34a weighted representation of the interaction between the two arguments
0:09:41and then in order to model the interaction between the arguments in the word pairs
0:09:45we have an again with an identical architecture
0:09:49where we take
0:09:51the output of the first gay so the argument interaction
0:09:55and you combine that with the word pairs we can pass at their nonlinearity
0:09:59and we predict how much to weight is individual features
0:10:06and then finally this entire architecture
0:10:09is shared between the implicit and explicit relations
0:10:14except for the final classification where
0:10:17so
0:10:18the final classification where we just i have a separate
0:10:23multilayer perceptrons for
0:10:25it's was a relations and for implicit relations
0:10:27and we predict the discourse relation
0:10:30and then we do joint learning over the over the pdtb
0:10:34to predict the discourse relation
0:10:39so overall this gives us a features from argument one an argument to where we
0:10:43have word and word pairs we have word an n-gram pairs
0:10:46and then we have n-gram features
0:10:48and for the word pairs we use even size filters of two four six an
0:10:53eight hour for the n-grams we used for there's of size is two three five
0:10:57and then we use static word embeddings so we fix them in don't update their
0:11:02them during training
0:11:03we just initialise them with
0:11:05we're to back and we use
0:11:09word to back embeddings trained on the pdtb for the out-of-vocabulary words
0:11:13and then finally we concatenate those with one-hot a part of speech encodings
0:11:17and this is the initial input into the network
0:11:22so we evaluated on two different datasets
0:11:27pdtb two point now as well as the ica test datasets for kernel two thousand
0:11:33sixteen
0:11:34and we evaluate on three different tasks the one versus all task
0:11:39the four way classification task in the fifteen way classification
0:11:44and all of these experiments are
0:11:47available in the paper here for this talk all discuss the four way classification results
0:11:53and
0:11:54we use the standard splits
0:11:56so that we can compare to previous work
0:12:01so compared to recent work
0:12:03we obtain improved performance
0:12:05in order to compare to previous work some previous work
0:12:11use the max of a number of different runs some you use the average so
0:12:14we present both so that we can
0:12:16or provide a fair comparison
0:12:18we primarily compared to dine one since they also have a joint model over implicit
0:12:23and explicit relations
0:12:24and so we thought we find
0:12:27improve performance over their model on both
0:12:30on both types
0:12:33compared to convey to other recent work
0:12:36we also find that than the max
0:12:39f one in accuracy is better on implicit relations as well
0:12:44so in order to identify where the
0:12:48improve performance is coming from we conduct a number of ablation experiments
0:12:52so examining the full model
0:12:56with joint planning and compared to the
0:13:00implicit only case we find that most of the improved performance is coming from expansion
0:13:04so there's five point improvement on the expansion class
0:13:10from the joint learning in this improves the microphone and accuracy overall
0:13:15so the
0:13:17the explicit graph representations of expansion relations are helpful for implicit relations
0:13:25we conduct an additional experiment to
0:13:29to determine the effect of the word pairs
0:13:31and so we find that compared to using individual arguments
0:13:35on implicit relations we obtain
0:13:37i increasingly better performances we
0:13:39increase the number of word pairs that we use
0:13:43so
0:13:45in terms of implicit relations we obtain around a two point improvement over all on
0:13:49both f one in accuracy
0:13:51on the other hand with explicit relations we don't find improve performance
0:13:56and a part of that is probably due to the fact that the
0:14:01the connective itself is a very strong baseline and that's difficult to improve upon
0:14:05so even just learning a representation of the connective by itself is it is a
0:14:10pretty strong is a pretty strong model
0:14:13on the other hand we don't do worse to were still able to use this
0:14:15joint model for both
0:14:20if we examine the performance on individual classes in terms of where the word pairs
0:14:25help
0:14:27we find that
0:14:29using
0:14:30word pairs of a up to link for
0:14:33compared to individual arguments improves over
0:14:39improves every just the f one and accuracy on the on the full fourway task
0:14:43a but we find that it especially helps
0:14:46the comparison relations so we obtain a six and have point improvement in comparison relations
0:14:51and small improvements on expansion temporal
0:14:55i where is for contingency we do we do a bit worse
0:14:59and
0:15:02so this is worth investigating further in future work so we find that
0:15:08three of the for high level relations are held by word pairs but can continue
0:15:12c is not
0:15:15so some speculation about why this word pairs might help
0:15:19they expansion comparison that they tend to have words or phrases a similar opposite meaning
0:15:24and it's possible the word pair representations or capturing that
0:15:29whereas contingency
0:15:31since it does much better in the individual arguments case
0:15:35it might be because of these impose a causality verbs that are indicative of the
0:15:42the contingency relation as well
0:15:47so we also conducted a qualitative analysis
0:15:50so it's a look at some examples of where the word pair features there are
0:15:53helping
0:15:55so
0:15:57we conducted an experiment where we removed all the nonlinearities after the convolutional layers so
0:16:01removing the gates
0:16:02and
0:16:04we only have that the features extracted from the word pairs in the arguments concatenated
0:16:07together
0:16:09before
0:16:10making
0:16:11production of
0:16:13using a linear classifier
0:16:15and then the average of the three runs using these two different models
0:16:19it reduces the score by
0:16:22round a pointer so
0:16:23and so the shows both that the gates help
0:16:28with modeling discourse relations
0:16:29a but also that this is a reasonable approximation to what the model is learning
0:16:34so we then take and the arg max of these feature maps instead of instead
0:16:38of doing max pooling
0:16:40and then we map those counts back to the original a word pairs are n-gram
0:16:44features
0:16:45and we identify examples that are recovered by the full model and not by the
0:16:50implicit model
0:16:51only
0:16:53so this is the comparison example a align set it plans to use a microprocessor
0:16:58and it declined to discuss its plans
0:17:02so one of the top word pair features that the model learns in this case
0:17:05is plans and took the client to discuss its plans
0:17:09so here the model
0:17:12it's it seems like it's able to learn that these are the this is a
0:17:15word in a phrase with opposing meaning
0:17:20for we also provide an expansion example it allows most of an to camp to
0:17:24get around campaign spending limits
0:17:26you can spend a legal maximum for his campaign
0:17:29and
0:17:30again one of the top word pair features learned is spending limits and maximum so
0:17:35it seems like it's learning that these
0:17:37these are important features because they are
0:17:41because they have similar meaning
0:17:45so finally we conduct an experiment to compare "'em" our model the previous work in
0:17:51terms of running time and
0:17:53the number of parameters
0:17:56and define the compared to a bidirectional lstm crf model
0:18:01we have around half a number of parameters
0:18:04and then we also ran the model
0:18:06three times for four five epic each
0:18:10so but using pi towards an on the same gpu
0:18:14and we find that our model runs in around half the running time
0:18:17so
0:18:19so
0:18:20we're using a less complex modeling were able to obtain similar better performance
0:18:26so overall we find that word pairs are complementary to individual arguments
0:18:32both
0:18:33but overall
0:18:35and on
0:18:37three of the first
0:18:40three of the four top level classes
0:18:43we also find
0:18:45that
0:18:46joint learning improves the model
0:18:48indicating some share properties between the implicit and explicit
0:18:51it was a relations
0:18:52in particular for the
0:18:54expansion class
0:18:56and for future work we would like to evaluate the impact of contextual embedding such
0:19:03as per
0:19:04so instead of using
0:19:06using just word embeddings add to see if we can obtain improved performance
0:19:11but also to evaluate whether these properties transfer to other corpora as well and either
0:19:16external labeled datasets
0:19:18or unlabeled data sets across
0:19:21no cross explicit connectives
0:19:26so if there any questions
0:19:28feel free to
0:19:29to email us
0:19:30where sp right now
0:19:33and our code is available at the at the following like
0:19:44so you're remotes questions
0:19:50thanks for the talk and so you about the word there's but actually you showed
0:19:56work to n-gram combinations sell
0:20:01we with the end of the n-gram being
0:20:06a priori
0:20:07if you need right i mean
0:20:09within
0:20:11the limits of the longest sentence so why did you do that and did you
0:20:16try a you know with experimentation to which you meet the and you just write
0:20:24the word pairs the actual word pairs and what happened
0:20:30so we did try just word pairs
0:20:34and so we found that improve performance but then
0:20:38modeling like the word and the n-gram pairs was it
0:20:43was the better identified better features
0:20:48so
0:20:51i can see you
0:20:52so here
0:20:55so do so the w p one in this case is just the individual word
0:20:59pairs
0:21:01so
0:21:03the word pairs themselves improve
0:21:06in
0:21:08overall
0:21:09but
0:21:10not as much as when we include like the
0:21:12word an n-gram pair
0:21:16so it's in this case we limited it to for us so that was just
0:21:20an experimental determination like the and four we didn't obtain any
0:21:24improve performance
0:21:29flexible talk i had a question your last example
0:21:34i think about
0:21:35the this one right
0:21:37so if you say he will spend the legal maximum force comparing with the p
0:21:41trample
0:21:51i think it might think it might be both
0:21:53so you can have multiple taxes yes to the pdtb allows for multiple
0:21:58labels for a single and
0:21:59okay it seems to me from your talk and also for the previous talk the
0:22:03temporal relations were more difficult on the other ones is that
0:22:06that's correct and so why
0:22:11i think
0:22:11part of that part of the reason is that the temporal class in the pdtb
0:22:15is very small
0:22:18i
0:22:19i think temporal relations are hard in general i don't know like neural models are
0:22:23particularly getting representing dates and end times so that might be part of the reason
0:22:28but that's just
0:22:29speculation
0:22:34more questions
0:22:41there is a question
0:22:44your estimator is it also able to identify those the relation between the
0:22:51two arguments
0:22:54most meetings as you always assume there is either an explicit or implicit
0:22:59religious right so we just to deaf the four way task
0:23:03so assuming there is a discourse relation
0:23:13rather than that that's a single speaker again