Speech Transcript - Discourse Relation Prediction: Revisiting Word Pairs with Convolutional Networks

0:00:17	syllable final taken for all six l two thousand nineteen is about chris of it
0:00:24	entitled discourse relation prediction rooms robustness convolutional networks
0:00:32	glow a
0:00:34	so i'm chris it i'm presenting
0:00:37	we have of us start is not able to make it to the conference
0:00:40	and
0:00:41	i detecting the presence of a discourse relation between two text segments
0:00:45	i is important for
0:00:48	a lot of downstream applications
0:00:50	i've including text level or document level tasks us to just
0:00:54	that planning or summarization
0:00:56	and one such resource that's labelled with discourse relations is the penn discourse treebank
0:01:03	as also mentioned in the previous talk and this
0:01:06	to find the shallow a discourse semantics between segments
0:01:10	i'm like a framework such as rst which both the full
0:01:13	a parse tree over a document
0:01:15	and at the top level there are for different classes
0:01:21	the comparison relation which includes contrast in concession expansion relation which might include examples
0:01:27	a contingency which includes conditional and causal statements
0:01:31	and then temporal relations
0:01:33	and this can be expressed either explicitly using a discourse connective or implicitly
0:01:41	sort of provide an example from the pdtb the first argument is a mister hand
0:01:45	began selling non-core businesses such as oil and gas and chemicals the second argument and
0:01:50	even sold one unit the made by a chequebook covers
0:01:53	and in this case
0:01:55	this is an implicit example but this would be the expansion relation with the implicit
0:02:00	connective in fact
0:02:03	so
0:02:04	i'll discuss are the background in using word pair to predict a discourse relations and
0:02:09	all talk about the related work and word pairs along with
0:02:13	previous work in neural models i'll discuss our method of using a convolutional networks to
0:02:18	model word pairs
0:02:19	and also compared to the previous work in infrared some analysis of the performance of
0:02:24	our model
0:02:26	so
0:02:29	earlier work by mark you an h a hobby
0:02:33	looked at using a word pairs to identify discourse relations and they noted that are
0:02:38	absent have very good semantic parsers
0:02:41	one way to
0:02:44	identify the relationship between text segments is to define word pairs using a very large
0:02:49	a corpus
0:02:51	so this is the comparison relation set
0:02:54	and a word pair such as good in fails as this wouldn't be and antonym
0:02:59	in
0:03:00	resource like wordnet but we might be able to identify this from a
0:03:04	a large unlabeled corpus
0:03:06	and so they were averaged at discourse connectives to identify the these word pairs and
0:03:13	and then build a model using those word pairs as features
0:03:17	so the initial work using word pairs
0:03:20	but that using
0:03:21	are the cross product of words in either side of connective
0:03:24	from some external resource
0:03:26	and then using those identified word pairs as features for classifier
0:03:31	some work on the pdtb found that the top word pairs in terms of a
0:03:35	information gain are discourse connectives and functional words
0:03:39	and this may be a product of the frequency of those words as well as
0:03:43	the sparsity of workers
0:03:47	so in order to handle the sparsity issue
0:03:50	we ran and weak un
0:03:53	build separate tf-idf features and so they identified word pairs across each connective in the
0:03:58	gigaword corpus
0:04:00	and then they identify these around a hundred different
0:04:03	tf-idf vectors which gave like hundred dot product so they could use as features on
0:04:07	the on the labeled data
0:04:10	so recently neural models of had a lot of success and the pdtb
0:04:15	either a recurrent models or cnns are more recently attention based models
0:04:19	and one advantage of these models is that
0:04:21	it easier to jointly model either the
0:04:25	pdtb with other corpora either labeled to unlabeled data
0:04:31	more recent work that using adversarial learning
0:04:34	so given the
0:04:37	given an implicit connective
0:04:39	as well as the model without a connective
0:04:42	and then
0:04:43	a very recently had i and one
0:04:45	i use the adjoint approach using the full
0:04:49	paragraph context
0:04:50	and jointly modeling explicit and implicit relations
0:04:53	using a bidirectional lstm and the crf
0:04:57	so the advantages of a the word pairs do that it provides an intuitive way
0:05:02	of
0:05:02	identifying features
0:05:04	but
0:05:06	it also tends to use noise the unlabeled external data and then the word pair
0:05:11	representations are very sparse a since it's
0:05:14	not possible to explicitly model every word pair
0:05:17	no on the other hand the
0:05:18	the neural models allow us to
0:05:22	jointly model other data as well but the downside is that we have to identify
0:05:27	a specific architecture
0:05:30	and the there's models can be very complex as well
0:05:33	so this
0:05:34	they just a to research questions whether we can explicitly model these word pairs
0:05:39	using neural models
0:05:40	and then whether we can i transfer knowledge by joint learning with explicit
0:05:44	labeled examples in the pdtb
0:05:48	so
0:05:49	right and an example so given a sentence i'm late for the meeting because the
0:05:54	train was delayed
0:05:56	we would split that in to argument one an argument to so where are you
0:06:01	mean to start with the explicit discourse connective
0:06:05	and then we would take the
0:06:07	i the cartesian product of the word pairs on either side of the argument and
0:06:11	so this gives as
0:06:13	i does matrix of word pairs
0:06:15	and then we take the same approach for implicit relations
0:06:21	it's the same the c matrix minus the connective
0:06:26	and so given this given this grid of word pairs
0:06:30	we then take these filters
0:06:33	of even link and we slide it over this grid
0:06:38	so we initially we take a word and word pairs where we take a single
0:06:42	word from either side of the argument
0:06:44	and we splattered across to that we get word pair
0:06:47	representations
0:06:50	we can also do the same thing
0:06:52	where larger filter size essentially represent
0:06:54	where an n-gram pairs so in this case this is a filter of size eight
0:06:58	and a represents
0:06:59	a word and a four gram pair
0:07:02	from the first argument and then the second argument
0:07:05	so
0:07:05	we can again take this folder
0:07:08	and slighted across the box using us right of two
0:07:13	and for the most forever getting word and n-gram pairs accepted row
0:07:16	and column boundaries where we end up with multiple word pairs
0:07:22	we again do the same thing
0:07:24	seven four we were going across the rose we again
0:07:28	take these convolutions misled them down the columns
0:07:31	so we get arg two an arg one as well as arg one and arg
0:07:35	two
0:07:38	so this gives us our initial architecture where we have
0:07:42	argument one an argument to
0:07:44	which are passed into a cn and we do max going over that to extract
0:07:47	the features
0:07:49	and then we do the same thing argument to an argument one
0:07:52	and we concatenate the
0:07:55	there's resulting features and this gives us the representation for word pairs
0:08:00	and the weights between these two shows the cnns are shared as well
0:08:06	so similarly
0:08:08	we
0:08:09	we take a similar approach for the individual arguments
0:08:13	and the reason for this is two fold and the first reason is that you
0:08:17	can that's the way to determine the effect of the word pairs and said to
0:08:21	evaluate if the word pairs are complementary to individual arguments
0:08:25	and then the other motivation for including individual arguments
0:08:29	is that many discourse relations
0:08:32	contain lexical indicators
0:08:34	absence context
0:08:36	that are there often indicative of a discourse relation so
0:08:39	an example of that are the
0:08:42	implicit causal verbs that there might identify like a contingency relation such as maker provide
0:08:49	so
0:08:50	we use the same architecture here where instead of
0:08:54	the cross product of the arguments we have the individual arguments
0:08:58	which are passed into a cnn
0:09:01	and that gives this
0:09:03	i feature representation for the individual arguments which we can concatenate together
0:09:09	to ten argument representation
0:09:12	so we also want to be able to model of the interaction between the arguments
0:09:16	and the way that we do that as with an additional gain layer
0:09:21	so we concatenate argument one argument to and path that through a nonlinearity
0:09:28	and then we determine how much to we the individual features
0:09:32	so this gives us a
0:09:34	a weighted representation of the interaction between the two arguments
0:09:41	and then in order to model the interaction between the arguments in the word pairs
0:09:45	we have an again with an identical architecture
0:09:49	where we take
0:09:51	the output of the first gay so the argument interaction
0:09:55	and you combine that with the word pairs we can pass at their nonlinearity
0:09:59	and we predict how much to weight is individual features
0:10:06	and then finally this entire architecture
0:10:09	is shared between the implicit and explicit relations
0:10:14	except for the final classification where
0:10:17	so
0:10:18	the final classification where we just i have a separate
0:10:23	multilayer perceptrons for
0:10:25	it's was a relations and for implicit relations
0:10:27	and we predict the discourse relation
0:10:30	and then we do joint learning over the over the pdtb
0:10:34	to predict the discourse relation
0:10:39	so overall this gives us a features from argument one an argument to where we
0:10:43	have word and word pairs we have word an n-gram pairs
0:10:46	and then we have n-gram features
0:10:48	and for the word pairs we use even size filters of two four six an
0:10:53	eight hour for the n-grams we used for there's of size is two three five
0:10:57	and then we use static word embeddings so we fix them in don't update their
0:11:02	them during training
0:11:03	we just initialise them with
0:11:05	we're to back and we use
0:11:09	word to back embeddings trained on the pdtb for the out-of-vocabulary words
0:11:13	and then finally we concatenate those with one-hot a part of speech encodings
0:11:17	and this is the initial input into the network
0:11:22	so we evaluated on two different datasets
0:11:27	pdtb two point now as well as the ica test datasets for kernel two thousand
0:11:33	sixteen
0:11:34	and we evaluate on three different tasks the one versus all task
0:11:39	the four way classification task in the fifteen way classification
0:11:44	and all of these experiments are
0:11:47	available in the paper here for this talk all discuss the four way classification results
0:11:53	and
0:11:54	we use the standard splits
0:11:56	so that we can compare to previous work
0:12:01	so compared to recent work
0:12:03	we obtain improved performance
0:12:05	in order to compare to previous work some previous work
0:12:11	use the max of a number of different runs some you use the average so
0:12:14	we present both so that we can
0:12:16	or provide a fair comparison
0:12:18	we primarily compared to dine one since they also have a joint model over implicit
0:12:23	and explicit relations
0:12:24	and so we thought we find
0:12:27	improve performance over their model on both
0:12:30	on both types
0:12:33	compared to convey to other recent work
0:12:36	we also find that than the max
0:12:39	f one in accuracy is better on implicit relations as well
0:12:44	so in order to identify where the
0:12:48	improve performance is coming from we conduct a number of ablation experiments
0:12:52	so examining the full model
0:12:56	with joint planning and compared to the
0:13:00	implicit only case we find that most of the improved performance is coming from expansion
0:13:04	so there's five point improvement on the expansion class
0:13:10	from the joint learning in this improves the microphone and accuracy overall
0:13:15	so the
0:13:17	the explicit graph representations of expansion relations are helpful for implicit relations
0:13:25	we conduct an additional experiment to
0:13:29	to determine the effect of the word pairs
0:13:31	and so we find that compared to using individual arguments
0:13:35	on implicit relations we obtain
0:13:37	i increasingly better performances we
0:13:39	increase the number of word pairs that we use
0:13:43	so
0:13:45	in terms of implicit relations we obtain around a two point improvement over all on
0:13:49	both f one in accuracy
0:13:51	on the other hand with explicit relations we don't find improve performance
0:13:56	and a part of that is probably due to the fact that the
0:14:01	the connective itself is a very strong baseline and that's difficult to improve upon
0:14:05	so even just learning a representation of the connective by itself is it is a
0:14:10	pretty strong is a pretty strong model
0:14:13	on the other hand we don't do worse to were still able to use this
0:14:15	joint model for both
0:14:20	if we examine the performance on individual classes in terms of where the word pairs
0:14:25	help
0:14:27	we find that
0:14:29	using
0:14:30	word pairs of a up to link for
0:14:33	compared to individual arguments improves over
0:14:39	improves every just the f one and accuracy on the on the full fourway task
0:14:43	a but we find that it especially helps
0:14:46	the comparison relations so we obtain a six and have point improvement in comparison relations
0:14:51	and small improvements on expansion temporal
0:14:55	i where is for contingency we do we do a bit worse
0:14:59	and
0:15:02	so this is worth investigating further in future work so we find that
0:15:08	three of the for high level relations are held by word pairs but can continue
0:15:12	c is not
0:15:15	so some speculation about why this word pairs might help
0:15:19	they expansion comparison that they tend to have words or phrases a similar opposite meaning
0:15:24	and it's possible the word pair representations or capturing that
0:15:29	whereas contingency
0:15:31	since it does much better in the individual arguments case
0:15:35	it might be because of these impose a causality verbs that are indicative of the
0:15:42	the contingency relation as well
0:15:47	so we also conducted a qualitative analysis
0:15:50	so it's a look at some examples of where the word pair features there are
0:15:53	helping
0:15:55	so
0:15:57	we conducted an experiment where we removed all the nonlinearities after the convolutional layers so
0:16:01	removing the gates
0:16:02	and
0:16:04	we only have that the features extracted from the word pairs in the arguments concatenated
0:16:07	together
0:16:09	before
0:16:10	making
0:16:11	production of
0:16:13	using a linear classifier
0:16:15	and then the average of the three runs using these two different models
0:16:19	it reduces the score by
0:16:22	round a pointer so
0:16:23	and so the shows both that the gates help
0:16:28	with modeling discourse relations
0:16:29	a but also that this is a reasonable approximation to what the model is learning
0:16:34	so we then take and the arg max of these feature maps instead of instead
0:16:38	of doing max pooling
0:16:40	and then we map those counts back to the original a word pairs are n-gram
0:16:44	features
0:16:45	and we identify examples that are recovered by the full model and not by the
0:16:50	implicit model
0:16:51	only
0:16:53	so this is the comparison example a align set it plans to use a microprocessor
0:16:58	and it declined to discuss its plans
0:17:02	so one of the top word pair features that the model learns in this case
0:17:05	is plans and took the client to discuss its plans
0:17:09	so here the model
0:17:12	it's it seems like it's able to learn that these are the this is a
0:17:15	word in a phrase with opposing meaning
0:17:20	for we also provide an expansion example it allows most of an to camp to
0:17:24	get around campaign spending limits
0:17:26	you can spend a legal maximum for his campaign
0:17:29	and
0:17:30	again one of the top word pair features learned is spending limits and maximum so
0:17:35	it seems like it's learning that these
0:17:37	these are important features because they are
0:17:41	because they have similar meaning
0:17:45	so finally we conduct an experiment to compare "'em" our model the previous work in
0:17:51	terms of running time and
0:17:53	the number of parameters
0:17:56	and define the compared to a bidirectional lstm crf model
0:18:01	we have around half a number of parameters
0:18:04	and then we also ran the model
0:18:06	three times for four five epic each
0:18:10	so but using pi towards an on the same gpu
0:18:14	and we find that our model runs in around half the running time
0:18:17	so
0:18:19	so
0:18:20	we're using a less complex modeling were able to obtain similar better performance
0:18:26	so overall we find that word pairs are complementary to individual arguments
0:18:32	both
0:18:33	but overall
0:18:35	and on
0:18:37	three of the first
0:18:40	three of the four top level classes
0:18:43	we also find
0:18:45	that
0:18:46	joint learning improves the model
0:18:48	indicating some share properties between the implicit and explicit
0:18:51	it was a relations
0:18:52	in particular for the
0:18:54	expansion class
0:18:56	and for future work we would like to evaluate the impact of contextual embedding such
0:19:03	as per
0:19:04	so instead of using
0:19:06	using just word embeddings add to see if we can obtain improved performance
0:19:11	but also to evaluate whether these properties transfer to other corpora as well and either
0:19:16	external labeled datasets
0:19:18	or unlabeled data sets across
0:19:21	no cross explicit connectives
0:19:26	so if there any questions
0:19:28	feel free to
0:19:29	to email us
0:19:30	where sp right now
0:19:33	and our code is available at the at the following like
0:19:44	so you're remotes questions
0:19:50	thanks for the talk and so you about the word there's but actually you showed
0:19:56	work to n-gram combinations sell
0:20:01	we with the end of the n-gram being
0:20:06	a priori
0:20:07	if you need right i mean
0:20:09	within
0:20:11	the limits of the longest sentence so why did you do that and did you
0:20:16	try a you know with experimentation to which you meet the and you just write
0:20:24	the word pairs the actual word pairs and what happened
0:20:30	so we did try just word pairs
0:20:34	and so we found that improve performance but then
0:20:38	modeling like the word and the n-gram pairs was it
0:20:43	was the better identified better features
0:20:48	so
0:20:51	i can see you
0:20:52	so here
0:20:55	so do so the w p one in this case is just the individual word
0:20:59	pairs
0:21:01	so
0:21:03	the word pairs themselves improve
0:21:06	in
0:21:08	overall
0:21:09	but
0:21:10	not as much as when we include like the
0:21:12	word an n-gram pair
0:21:16	so it's in this case we limited it to for us so that was just
0:21:20	an experimental determination like the and four we didn't obtain any
0:21:24	improve performance
0:21:29	flexible talk i had a question your last example
0:21:34	i think about
0:21:35	the this one right
0:21:37	so if you say he will spend the legal maximum force comparing with the p
0:21:41	trample
0:21:51	i think it might think it might be both
0:21:53	so you can have multiple taxes yes to the pdtb allows for multiple
0:21:58	labels for a single and
0:21:59	okay it seems to me from your talk and also for the previous talk the
0:22:03	temporal relations were more difficult on the other ones is that
0:22:06	that's correct and so why
0:22:11	i think
0:22:11	part of that part of the reason is that the temporal class in the pdtb
0:22:15	is very small
0:22:18	i
0:22:19	i think temporal relations are hard in general i don't know like neural models are
0:22:23	particularly getting representing dates and end times so that might be part of the reason
0:22:28	but that's just
0:22:29	speculation
0:22:34	more questions
0:22:41	there is a question
0:22:44	your estimator is it also able to identify those the relation between the
0:22:51	two arguments
0:22:54	most meetings as you always assume there is either an explicit or implicit
0:22:59	religious right so we just to deaf the four way task
0:23:03	so assuming there is a discourse relation
0:23:13	rather than that that's a single speaker again

Discourse Relation Prediction: Revisiting Word Pairs with Convolutional Networks

Oral Session 7: Discourse

Siddharth Varia, Christopher Hidey and Tuhin Chakrabarty