Speech Transcript - Creating and Characterizing a Diverse Corpus of Sarcasm in Dialogue

0:00:16	hi everybody
0:00:18	so
0:00:18	creating in characterizing the diverse corpus of sarcasm the dialogue
0:00:24	they want to start by explaining why we study sarcasm
0:00:28	and then the need for a large-scale corpus of sarcasm
0:00:32	different examples of sarcasm in the wild
0:00:35	followed by how we build our corpus some experimental results and linguistic analysis and then
0:00:40	conclusions
0:00:42	so why study sarcasm
0:00:44	well it's as we all kind of no it's creative complex and diverse here are
0:00:49	some examples
0:00:50	things like this or missing the point
0:00:53	i love it when you bash people for stating opinions and no facts then you
0:00:56	turn around to do the same thing
0:01:00	and even more complex my pyramidal tinfoil hat is an antenna for knowledge and truth
0:01:05	it reflects idiocy and this into deep space
0:01:08	as we can see
0:01:10	it's very creative it's very diverse
0:01:12	and
0:01:13	it gets more and more ambiguous in complex
0:01:15	very long tell problem
0:01:19	so further motivation is it's very prevalent so estimated around ten percent in debate forums
0:01:26	dialogue which is kind of our domain of interest
0:01:30	and this sort of dialogue is very different from traditional mediums like independent tweets or
0:01:34	reviews for products things like that
0:01:38	so it's very interesting to our group
0:01:41	also part of the motivation is that things like sentiment analysis systems are supported by
0:01:45	misleading sarcastic postal people
0:01:47	being sarcastic thinking something is really great about their product and then it's very misleading
0:01:53	also for question answer systems it's important to know when things are not sarcastic to
0:01:57	use that it's good data right so it's also important to differentiate between
0:02:01	the classes sometimes you wanna look at the not sarcastic post sometimes you care about
0:02:05	the sarcastic once
0:02:07	so some examples of sarcasm the wild
0:02:12	so sarcasm is clearly not a unitary phenomenon gives into thousands developed a taxonomy of
0:02:19	five different categories of sarcasm on conversations between friends
0:02:23	so you talks about sarcasm as speaking positively to convey negative intent
0:02:28	this is kind of a generally accepted way
0:02:31	to define sarcasm
0:02:33	but you also defines different categories where sarcasm is probably things like rhetorical questions so
0:02:38	somebody asking a question implying a humorous are critical assertion
0:02:42	things like hyperbole expressing a non-literal meeting by exaggeration
0:02:46	on the other side of the scale understatement so under playing the reality of a
0:02:50	situation
0:02:52	and jock hilarity so humouring teasing humours weights
0:02:56	so this is a little bit more fine grained
0:02:59	as a taxonomy for sarcasm
0:03:03	and it's kind of
0:03:04	accepted that people use the term sarcasm to meet all of these things as like
0:03:07	a big rollback for anything that could be sarcastic
0:03:11	but the okay theoretical models side that there is often a contrast between what is
0:03:16	said
0:03:17	and a literal description of the actual situation
0:03:19	so that's a very common thing that characterizes much of sarcasm in different domains
0:03:25	so no previous work has really operationalize these different categories that it gives is defined
0:03:30	gives an other work people have defined
0:03:34	so that kind of the focus of our corpus building
0:03:37	so we explore in great detail rhetorical questions and hyperbole us to very prevalent
0:03:44	subcategories of sarcasm in our online debate
0:03:46	every probably in our debate forums and they can be used in fact sarcastically or
0:03:50	it not sarcastically sounds interesting binary
0:03:53	classification question
0:03:55	so to kind of showcase why that's true here are examples of rhetorical questions answers
0:04:02	that in the top row is used sarcastically in the bottom row not sarcastically
0:04:06	so
0:04:07	something like then what you call politician who ran such measures liberal
0:04:11	yes it's "'cause" you're public and you're a conservative at all
0:04:15	what without proof we would certainly show that it animal adapted to blah more of
0:04:18	like an informative sort of thing
0:04:20	so rhetorical questions exist in both categories
0:04:25	similarly for hyperbole
0:04:27	something like thank you from
0:04:29	making my point better that i never do
0:04:31	or again i'm astonished by the fact that you think i will do this
0:04:35	so there's kind of different ways that you can use these categories in both sarcastic
0:04:39	or not sarcastic
0:04:42	with sarcastic or not sarcastic intent
0:04:46	so kind of going into why do we need a large scale
0:04:49	scale corpus of sarcasm
0:04:52	first of all like i tried creativity and diversity make it difficult to model generalizations
0:04:58	and subjectivity makes it very difficult to get high agreement annotation and we see that
0:05:02	from lots of previous work on sarcasm
0:05:04	people often use hash like sarcasm or use you know positive or negative
0:05:11	a sentiment in different mediums to try to
0:05:15	highlight where sarcasm exists
0:05:17	because it's very difficult to get high agreement annotations
0:05:20	and these annotations are costly and they require kind of expert workers
0:05:25	so for example in and out of the blue context something like got your sosa
0:05:29	think simple found i think i love you it's hard to tell if that's really
0:05:33	sarcastic right
0:05:34	out of the blue
0:05:36	something like humans are nominal mammal that the fact it you just this in the
0:05:40	real schools
0:05:41	very subtle we don't know right
0:05:44	so it's pretty hard to ask people to do this sort of annotations you have
0:05:48	to be a little bit clever about it that kind of what we try to
0:05:51	do
0:05:52	so we need a way to get more labeled data and the short-term to study
0:05:56	sarcasm
0:05:57	to allow for better linguistic generalisations
0:06:00	more powerful classifiers in the long term not kind of the promise of our corpus
0:06:04	building stage
0:06:06	how do we do it
0:06:08	so we do bootstrapping
0:06:10	we begin by replicating looking and walker's bootstrapping setup from twenty thirteen
0:06:14	and the idea behind this is that
0:06:17	you begin with a small set of annotated sarcastic and not sarcastic post
0:06:21	and use some kind of the linguistic pattern extractor to find
0:06:26	cues that you think are highly
0:06:28	precise indicators of sarcasm and not sarcasm in the data
0:06:32	once you have these sorts of cues you can go out against huge sets of
0:06:35	an annotated data look for those cues
0:06:37	and anything that matches we're gonna call the bootstrap data
0:06:41	drop it back in the original annotated data and then kind of iteratively expand your
0:06:45	data set that way
0:06:47	that's kind of the premise that we use
0:06:48	well i really the crux of this is that
0:06:51	did you could bootstrapping we need this
0:06:53	portion right here
0:06:55	which requires the high-precision linguistic patterns to be really good we need really good high
0:06:59	precision patterns so we try to get them using or
0:07:04	using the linguistic patterns are out of slot t s
0:07:08	so others log on the well by relevant ninety six is a weakly supervised pattern
0:07:11	learner
0:07:13	and we use it's extract lexical syntactic patterns highly associated with both sarcastic and not
0:07:18	sarcastic utterances
0:07:19	so the way that works is that it has a bunch of patterns templates that
0:07:23	are defined so things like
0:07:25	some sort of a subject followed by a passive verb phrase et cetera
0:07:29	and it uses these patterns to then find instantiations in the text and then brings
0:07:33	these different instantiations based on probability of occurrence in a certain class and frequency of
0:07:38	occurrence
0:07:39	so something like if you had the sentence in your data there are millions of
0:07:43	people saying all sorts of stupid things about the president
0:07:46	and you know run out of soggy we match this
0:07:49	for example would match this but noun phrase proposition
0:07:53	noun phrase pattern
0:07:55	millions of people
0:07:57	and then if this pattern was very frequent and highly
0:08:00	probably occurring in sarcasm and then you would float up to the top of are
0:08:03	ranked list
0:08:06	so we do this
0:08:09	and give each extraction pattern of frequency data off in a probability therapy
0:08:13	and we classify post as belonging to a class it has at least and of
0:08:17	those patterns existing
0:08:20	so the first round that we observe
0:08:22	looking at the small sample data
0:08:24	is
0:08:25	so here's some examples so something like say about your head get over a current
0:08:30	sarcastic posts
0:08:31	with these frequencies and probabilities of association
0:08:34	and things like
0:08:35	natural signal selection big thing area of our probabilities not sarcastic post
0:08:40	and just to kind of sparse
0:08:43	we find that the not sarcastic class contains
0:08:45	a lot of very technical jargon scientific language topic specific things
0:08:49	and then we can get
0:08:50	high precision when classifying post based on just these templates
0:08:55	it's up to about eighty percent
0:08:57	whereas the sarcastic classes you can see are much more varied not high precision
0:09:01	thirty percent
0:09:03	and so it's difficult and you bootstrapping
0:09:06	on data where the precision of these patterns is relatively low
0:09:11	so
0:09:13	we decided to make use of this high precision not sarcastic set of patterns that
0:09:17	we can collect
0:09:18	so actually expand our data trying to find post that would be good to get
0:09:23	annotated
0:09:24	that we think would have a higher probability than ten percent of being sarcastic and
0:09:28	based on that original metric from a sample of to be forms data
0:09:32	so using a pull of thirty k we filter out tools that we think contain
0:09:36	not sarcastic post
0:09:38	so pos that containing a any of those not sarcastic patterns that we identified
0:09:43	and we end up with about eleven k posts that we believe have higher likelihood
0:09:47	of being sarcastic and we put those out for annotation on mechanical turk
0:09:51	and the way the kind of annotation task looks as they get a definition of
0:09:54	sarcasm and an example of responses that contain sarcasm
0:09:58	and don't contain sarcasm
0:10:00	and then we show them a quote response pair so this is like a dialogic
0:10:03	pair where we have a dialogic parent and the response and we asked them to
0:10:07	identify sarcasm in the response
0:10:10	so that's what are annotators are seeing
0:10:14	and then using this method were able to skew the distribution of sarcasm to from
0:10:18	ten percent up to thirty one percent
0:10:21	so kind of getting annotated that pair that poll of eleven k
0:10:26	depending on where we set our agreement threshold where he was askew this distribution quite
0:10:30	high
0:10:31	so here from nineteen to twenty three percent using this relatively can sort of conservative
0:10:35	threshold of six out of nine annotators agreeing
0:10:38	that posttest sarcastic
0:10:40	we kind of since it so subjective and diverse we wanna make sure that or
0:10:44	annotations are
0:10:45	clean
0:10:46	so that's why use a relatively high threshold
0:10:52	so having more data
0:10:54	means we're we can do better at the boot-strapping task but we still need we
0:10:58	still observe some of the same trends
0:11:00	so highly-precise not sarcastic patterns less precise sarcastic
0:11:05	but and were still not quite at the point we wanted to be a propose
0:11:08	trapping
0:11:09	so
0:11:10	kind of given up
0:11:12	the diversity of the data we decide to revisit that
0:11:15	categorization i talked about earlier
0:11:18	so sarcasm rhetorical questions hyperbole understatement regularity
0:11:23	so we make this observation that somebody's lexical syntactic cues are frequently used sarcastically
0:11:30	so for example
0:11:31	i
0:11:32	well
0:11:32	let's all copper that great argument revolution as well
0:11:35	well
0:11:37	the what's your plan how to
0:11:40	how to realistic my friend
0:11:43	interesting someone hijacked your account role
0:11:46	central
0:11:47	so pretty funny and really a combination of words expel an arm mean to expel
0:11:53	arms louse use the creative genius
0:11:55	so kind of these different terms that are pretty probably in the terms like this
0:12:00	are pretty probably in sarcastic post and we try to make use of this observation
0:12:05	in our data collection
0:12:08	so the way we do that is
0:12:12	we develop projects a search for different patterns in or an annotated data
0:12:16	so we get annotations for different things that we think are quite probably the data
0:12:20	things like well
0:12:22	and things like
0:12:24	all the all of these ones pretty much fantastic et cetera
0:12:28	and we find that were able to get again distributions that are much higher than
0:12:31	ten percent searching for post that only contains a single cues so it's interesting to
0:12:37	note that just a single q have such a well large distribution of sarcasm so
0:12:41	something like well
0:12:42	used forty four percent of the time about
0:12:44	about something post
0:12:46	so using these observations we begin constructing are sub corpora
0:12:51	one for rhetorical questions and one for hyperbole
0:12:55	and the way we gather more data for this is that we observe that their
0:12:58	use both sarcastically and not sarcastically for argumentation
0:13:02	and we use this middle of posts heuristic to estimate whether a post is whether
0:13:07	questions actually use rhetorically or not
0:13:09	i'm so one a speaker
0:13:11	ask the question then continues on with their turn their not giving it a chance
0:13:15	whether
0:13:17	the listener at actually respond and so it's a question at least that doesn't require
0:13:21	at answering from someone else
0:13:23	in the view of the writer
0:13:25	so we do a little pilot annotation find that seventy five percent of these rhetorical
0:13:29	questions that we gather in this way are in fact used
0:13:34	artifact annotated to be
0:13:35	rhetorical
0:13:36	and we do annotations of these new post ending up with eight hundred fifty one
0:13:40	post per class so something like do you wish to not have a logical to
0:13:44	be
0:13:45	already then god bless you anyway
0:13:47	proof that you can't prove that i got
0:13:49	and given anything but in salt et cetera so these things where someone this is
0:13:53	the same post some was asking questions going on with their turn
0:13:58	the second subcorpus we look at is hyperbole so hyperbole exaggerated situation we use intensive
0:14:03	fires to capture these sorts of instances and we can get more annotations so calls
0:14:08	in an o'brien side this sort of situational scale this contrast in fact i was
0:14:13	talking about earlier
0:14:14	so hyperbole can shift utterances across the scale so chipped something into extremely positive i
0:14:21	don't way from literal and also into strictly negative and away from literal and so
0:14:27	intensify or is kind of sort of this purpose
0:14:28	so something like wow i'm so amazed by your comeback skills
0:14:33	do you go on "'em" so impressed by or intellectual argument
0:14:35	things like that
0:14:38	so the statistics for a final corpus we get around six thousand
0:14:43	five hundred post for are generic sarcasm corpus
0:14:46	and then rhetorical questions and hyperbole with this distribution and more information on the dataset
0:14:51	is available there
0:14:53	it's in the paper
0:14:55	so it's kind of validate the quality of our corpus
0:15:01	we do simple experiments using very simple features bag of words about features
0:15:06	noting previous work has achieved about seventy percent with more complex features
0:15:11	and we end up with distributions that are higher than that so we get we
0:15:16	do this kind of
0:15:18	segmented
0:15:20	segmented set of experiments where we test at different dataset sizes
0:15:23	and we see that are f-measure is continue to increase above our peak right now
0:15:27	seventy four with these simple features
0:15:29	so that warrants you know expanding our dataset even more
0:15:33	also we do again r weakly supervised experiments with other slot its just see what
0:15:38	sorts of precisions we can get now for bootstrapping
0:15:41	and we see much higher precision is that we were getting before at reasonable because
0:15:45	for bootstrapping so that's good use as well
0:15:47	so now we could expander method to be weakly supervised and gather more data more
0:15:51	quickly
0:15:53	and this is the numbers of new patterns that we learned so patterns that we
0:15:56	never searched for in the original data
0:15:59	so we're get we're learning a lot of new patterns that we didn't originally search
0:16:02	for
0:16:03	for all of the datasets
0:16:05	and then some linguistic analysis quickly
0:16:09	so we aim to characterize the differences between our datasets so again user some others
0:16:14	love instantiations still and are generic data we see these
0:16:20	creative sorts of different instantiations were sarcastic posts whereas again the not sarcastic pose that
0:16:26	these highly technically
0:16:28	technical jargon sort of terminologies
0:16:31	for the rhetorical questions we observe a lot of the same properties for the not
0:16:36	sarcastic class
0:16:38	but for the start has the class we observe that
0:16:40	there's a lot of attack on basic human abilities right on these debate forum dialogue
0:16:44	some people say things like can you read it can you write
0:16:46	do you understand
0:16:48	so we kind of went through looking at some of the dependency parses on these
0:16:52	sorts of questions
0:16:53	i'm just found a lot of things that really relate to basic human ability so
0:16:56	people are attacking people
0:16:58	not really attacking their argument that's very probably on rgb boards data
0:17:03	and finally for probably we find that the adjective an adverb patterns are really common
0:17:09	even though we don't search for these of originally in our metrics experiments
0:17:13	so
0:17:14	when there and things like contrast by exclusion used
0:17:24	samples of hyperbole are really interesting that we pick up
0:17:29	so in conclusion we develop a large-scale highly reliable corpus of sarcasm we reduced annotation
0:17:35	cost and effort by skewing the distribution of waiting having to annotate huge boobs of
0:17:39	data
0:17:40	and we operationalize lexical syntactic cues for rhetorical questions and hyperbole
0:17:45	and verify the quality of our corpus empirically qualitatively
0:17:49	for future directions you wanna do more feature engineering more model selection based on our
0:17:54	linguistic observations
0:17:56	develop more generalisable models of different categories of sarcasm that we haven't looked at
0:18:01	and explore characteristics of our lower agreement data see if there's anything interesting there as
0:18:05	well
0:18:06	thanks
0:18:15	questions
0:18:35	so first of all so we have we began with not looking at those categories
0:18:39	right so we start with this really generic sarcasm so definitely there it's kind of
0:18:44	it so long tail right so there's a lotta different exaggerations
0:18:48	definitely the problem
0:18:49	we began initially talked leaving was just sarcasm a general but it kind of interesting
0:18:55	to get into the more refined categories and look at how those are different and
0:19:00	yes there's also different sorts of things that we could look at the understatement is
0:19:06	quite prevalent as well
0:19:07	so it's it doesn't only existing to be formed it just quite pronounced in the
0:19:11	form so
0:19:12	good to look at their
0:19:27	right so the question is the question is about the words of x features so
0:19:31	do we train them
0:19:34	we train the word back model on our corpus are to be use existing model
0:19:37	so we don't both on these results that are reporting are actually on the google
0:19:40	news trained actors which is kind of
0:19:43	it correlates with our with our data as well it
0:19:45	the debate forums
0:19:47	we have used our own trained model it today perform as well as this probably
0:19:51	because that the smaller amount of data
0:19:53	and the google news is trained on a huge amount of data so that definitely
0:19:55	worth exploring in the future as well
0:20:12	right so actually i didn't mention the numbers here are there's more detail in our
0:20:17	in our paper but are level of agreement were about seventy percent for each of
0:20:22	the for each of the tasks and they were actually better for the smaller tasks
0:20:26	where what in generic sarcasm is a little bit are more constraint
0:20:29	i think it's
0:20:31	no that's actually agreement with the majority label so just
0:20:36	and
0:20:37	so is actually better for the sub categories in fact then the and the generics
0:20:42	are can talk it's pretty hard to
0:20:45	to get high agreement annotations rhetorical
0:20:52	so i was wondering about the idea of twelve contrast try so you set
0:20:59	these somewhere that highlights the fact that the entire time and it is some contrast
0:21:02	between let us thing and what you think that element and so i guess
0:21:06	that
0:21:08	and also this idea that is that the t seven a meaning that is non
0:21:12	leader right yes so i was thinking about the possible connection with method for and
0:21:18	with the task of metaphor detection right and so here you are focusing on trying
0:21:23	to find patterns that can act the rights a constant
0:21:27	but for instance in some working metaphor detection the goal is to
0:21:31	to
0:21:32	to capture contrast rice to what makes a particular use different from the little use
0:21:37	so by looking at how the sarcastic intended indicate actually be far from the regular
0:21:44	used by was wondering into the so it's a very open question i was wondering
0:21:47	that they have you had thought about the task in
0:21:51	in this their arms
0:21:53	that's really interesting so looking at kind of
0:21:57	maybe trying to measure how far away tonic sort of a contrast scale that would
0:22:01	definitely be interesting we haven't
0:22:02	do not explicitly but i mean
0:22:04	like the different intensified can have different affect so it's kind of
0:22:08	trying to map it across the scale
0:22:13	other questions
0:22:19	a question
0:22:21	it when you're doing the mining of the data
0:22:24	and you're identifying different
0:22:27	phrases that removes some more socially with sarcasm and non sarcasm
0:22:32	did you do things to make sure that the dataset was not biased you know
0:22:37	for "'cause" it utilizing portals kind of phrases
0:22:40	so that if they don't later someone wanted to build an automated system to detect
0:22:44	sarcasm an hour and sarcasm they would just
0:22:47	reader paper and they are gonna go after these phrases "'cause" this was used to
0:22:50	construct the corpus
0:22:51	right so far are generic sarcasm corpus that was a random sample
0:22:55	so all of that is not sampled anyway the for the rhetorical questions of hyperbole
0:23:01	we would select those posts but
0:23:04	the poster actually contain all sorts of other cues and it's important to note that
0:23:09	if we ever selected a cue it would exist in both sarcastic i'm not sarcastic
0:23:12	both
0:23:13	so it's not like you would only find the mid one and that kind of
0:23:16	what made it interesting that you can use those think used in both sorts of
0:23:20	infatuation so it would be by so that lee

Creating and Characterizing a Diverse Corpus of Sarcasm in Dialogue

Oral Session 2: Corpus creation

Shereen Oraby, Vrindavan Harrison, Lena Reed, Ernesto Hernandez, Ellen Riloff and Marilyn Walker