in q

my name is that i have i worked with we marker at the natural language

and dialogue systems lab at uses an improved

i'm going to talk about

learning fine grained knowledge about contingent relations between everyday band

or well here in this work is

to capture commonsense knowledge

about the fine grained events are everyday experience and the events that occur in their

everyday life of people

like in a like opening every enable

preparing food in getting to or

an alarm going off

triggers waking up and getting out of that

you believe that this type of knowledge and the relation between

you events is a contingency relation

based on the definition of contingency from the penn discourse treebank which has two types

that cause and condition

another motivation of our work if that much of the user generated content on social

media is provided by ordinary people telling stories about their daily lives and stories are

reach in common sense

knowledge and discontinuing relation between events i have two examples here from our dataset our

dataset is drawn from

is actually a subset of this you know what's which have millions of blog posts

and it contains the personal stories written by people about their daily lives in their


in the examples you can see that are sequences of coherent events in the stories

for example the first one

it about going on a camping trip

and you have a sequence of events that

they pack everything they

we got in the morning to go get to the camping ground and

a place that are a set of the pen so there is you sequence of

events in the second story is about witnessing the store

the parking

make landfall the green blue a tree fell

then they people start cleaning up writing of the trees a lot of breaking

there is this commonsense knowledge in this contingency relation between events in stories implicitly

and you want to learn these and

it is showing his talk that use fine-grained knowledge

is not found in the previous work

on the extraction of

narrative events and even collection

a much of the previous work

is not focus on a particularly the relation between the events they characterise what they

learned as a collection no

events that tend to co-occur

and on they are kind of vague about what is the relation between the sequence

of events

is also mostly focused on venues wired longer

so the type of the knowledge that they can learn

is a limited to

the new rd events that are on the news articles like the ball mean really

more explosions

as for evaluation they mostly used an article fast which we believe is not the

right way to evaluate used type of knowledge


in our work we focus on contingency relation between events

we use the personal blog stories

at the dataset

so we can learn new types of knowledge about even and other than the newsworthy


and we also used to

evaluation metals one of them is inspire and motivated by previous work and the other

one is completely new

this is that a distortion this dataset or tends to be told in chronological order

so there's a temporal order between the events that are told in the story and

this is great because temporal order between events is a strong cue to contingency

so this makes it us to the whole

for our task

but these data sitcoms the with its own challenges

it has more of an informal structure as compared the news article that are well


and the structural

disorders or more similar to the oral narrative in one of our previous studies apply

the oral narrative while of all involves key to

this to label that clauses in his personal stories and we show that about only

a third of the sentences in the personal narratives

you describe the actions and events and

the other two is there are talking about the background and the

try to distaste the emotional of the narrator

i have an example here i'm not going to describe what is labeled are but

you can see that there is some background like

now on we speak story

is that one can take the

or a then there is some a actions and events about the person getting rs


by the traffic police and then

at like me to my i should go free so it's not all events there

is a lot of other things going on which makes it more challenging


so we need not all methods to

in very useful relations between events from this dataset and i'm going to show it

in the experiments that

if we apply the methods that work on the news articles to extract the event


we won't get good results on this dataset

what events we define event at the brier

we three arguments the subject the direct object and the particle

and some examples

definition is motivated by

one of the previous work by a hotel and nineteen twenty fourteen that they show

that more argument representation argument is richer in it is more capable of capturing the

interaction between events

they use verb and subject and object we also added the particle

because we think that it is also necessary for conveying the right meaning of a

bound for example the first and then a stable

event it putting out that and

and you have put direct object and the particle with all and you can see

how all these arguments

i can contribute to the meaning of the event like put by itself

it has a different meaning that putting up to ten and

also the particle put and put up

are you tell you different thing

and you know or especially it is important because

it's more informally had a lot of each verb still it's important to have all

the arguments

in the event representation

or extracting events we use the stamp for dependency parser and

use the dependency parse trees to extract the use of verbs and arguments

and we also use the sample named entity recognizer

to do a little more generalization of the arguments for example the

terms of the phrases that refer to location

are mapped to their type location the same four percent i'm date and

section so

the contributions of our work is that we have a data collection step we generate

topics sort of personal stories using it would it's tracking algorithm

then be directly compare our method for extracting these contingency relation between events on a

general domain set of stories and also on the topic specific data that we have


and we will show that we can learn more fine grained and richer knowledge and

more interesting knowledge from the topic specific corpus

and a model works

significantly better on the topic specific corpus and this is the first time that you're

doing this comparison directly on these two types of data set

for the event collection

and will show that is improvement is possible even with less amount of data on

the topic specific corpus

we have to use that some experiments we directly compare our work to the most

relevant previous work

and we also used to

evaluation methods for these experiments

no the data collection of our we have a some unsupervised algorithm for generating a

topic specific dataset using a bootstrapping method the corpus you're is the general the on

annotated blocks corpus that has the all the personal blog stories

we first manually label a small set at this feature for the bootstrapping

about two hundred to three hundred and each topic

and that is into a lot of like us we choose a we it

after the learner

so we generate some event utterance

specific to that talking for example if you're looking at the camping trip story

we can generate some pattern like this like and p followed by proposition followed by

and optional and p and the head of the first noun phrases counting the recognition

and so

it generates some even tyrants that or

strongly correlated with the topic

and then we use these patterns to lose track and label automatically label or stories

on the topic from the

corpus so

then we fading the on labeled data

java slot and we use this patterns

and based on how much of the patterns of a topic you can find in

an unlabeled data

in the label each other

that topic so

we do about two to three hundred and two topics we generated about with one

around a bootstrapping you generated about one problem

new label with a bootstrapping

and here i'm presenting the results on two


from our corpus the counting story

and stories about witnessing image or store

a bit about three hundred stories we generated the expanded the corpus about al

or learning called a contingency relation between events we use how the potential method introduced

by anymore in your june two thousand nine

it's an unsupervised distributional measured it measures that

and then the of an event pair to encode a cause relation

you know on it apparently when have a high like cause a potential all swore

they have a higher probability of occurring in

the causal context

so the first component your is the mutual information in the second one

it is taking into account the temporal order between the advanced so

if they can talk or more in this order this particular ordered a we have

a higher recall the potential score

and this is

great for our corpus because the events are the events tend to be told in

the right sample order

then we calculate a called a potential

for every pair of you understand events in the corpus

a using escape to bigram model

because like i shown in example

all the sentences or not

events and events can be interrupted by the non events and that the we use

this you to bigram

which defines two events to be ideas and if they are really in two or

less the events from each other

most of the previous work use this narrative closed test

for evaluating their

sequence of events that they have learned

now suppose that a sequence of narrative events in a document from which one event

has been removed

and the task is to predict the remove event

no we believe that this is not

suitable for our task for evaluating the coherence of events

and also in the previous work by each other and we they show that unigram

model results are nearly as good as or more complicated more sophisticated models

on this task so it's

not good for a capturing the

all the capabilities of the model

no we are proposing in new evaluation model which is motivated by cope all corpora

wasn't evaluation method for the common sense causal reasoning it had to choice questions


we are generating are automatically generating used to choice questions from a task that we

have a separate held-out test set for each dataset


it would choice question

consist of one event question

event question for example of your a range outdoor

is extracted from the test that still it occurs in the test that

and one of the choices which is the correct answer

is the event that is followed by it is falling the a range so it's

falling the event question

unlike the task that and the second one

it is not the correct answer is randomly generated from the list of all events

that have that

so if you're in that have a range outdoor followed by whole tray

and call it is randomly generated

so the model is supposed to predict which one

of these two choices is more likely to have a contingency relation with the event

in the question and then we calculate the accuracy based on the answers that the

model generates your

in previous work we compared to indirectly it

the work by but is the remaining in grounding thirteen

they generate something that they call the realm gram toppled it basically a pair of

relational toppled of events so they generate these pairs of events

that tend to

collector together

they use the news article

there is that

use the co-occurrence statistics based on symmetric conditional probability

which is here and the cp

so we basically just combines the bigram model in two directions

and on their corpus their demands that they have learned is publicly available you can

access to

the run online search interface

and they show that in the remote that they outperform the previous work on the


talking on learning the narrative events

big and i two experiments to compare these previous work

e compare the content of what hitler

to show that would be learned not exist in the previous collections and we also

applied or model on our dataset to show that the model that more on the

more structured data like news article cannot get good results on how to do that

of the baseline we use the unigram model which basically is the distribution of the

prior probability of the events we use the bigram which is the bigram probability of

the event pair

again using the script a bigram model and the event a cp the symmetric conditional

probability from the real grams work

and i mean method here is that all the potential so

we have two dataset the general domain stories dataset are the stories are randomly selected

from the corpus they don't have a specific theme or topic

we have four thousand stories in the training and two hundred stories in the held

out test set

we also have a topic specific that is that your time

i will be presenting the results on two topics the camping stories and stories about

witnessing the score

here's displayed other dataset for each of topic so we had a hand labeled c

we split into test and training so we have the hand labeled cast you have

the hand labeled training and then we create for each topic

a larger training set that has the hand labeled training

last the bootstrap data to see if the blue strapping is helpful at all or


here is the results this is the accuracy on a all the task was true

for each topic


i'm reporting the results of the baselines on the largest rings that each other hand

label and it with a strap because

and the hand labeled the results are

just a little worse so i'm just a reporting the

best results for the baseline

and then for causal potentially all have the results for both the hand labeled train

set mutual small about

one or two hundred and

on the largest trends that about the problem

which is the hand labeled plus the word wrap

it up

here you can see that the other potential results are significantly stronger than

all gutter baseline

and also the results on the topic specific dataset is significantly stronger

on the results on the general domain even for the call the controls about

accuracy is pointing by one but on the topic specific on the

even a smaller dataset

you can

and you sixty eight percent accuracy for the for one topic and for another about

eighty eight percent accuracy

and also if you compare the results on the smaller hand-labeled train set to the

training set with the worst wrapping which is larger

they consider more training data collected by bootstraping can improve the results that was tracking

was actually effective


event the cp or the bigram models that

were used in the previous work for generating these events a collection

did not work very well on our dataset

the next thing it is we want to compare the content of what we have

them and see if we actually exist in the previous collections are not so you're

i want to show the results of comparing the event that we expected from

the camping trip story

against the realm i'm tuples

so the real grounds are not topics or that

so what we did to get the ones that are related to the campaign is

that we use our top then even tighter that are generated in the blues tracking


and we use them to search the interface so

each event pair in the

like this example

for example go camping is one of the even patterns that we have and then

research it in the interface and then we get all the pairs that

there at least one of the event is

go camping

so then apply filtering and ranking that was used in the same paper they filter

based on the frequency at rank based on the symmetric conditional probability metric that they


and then evaluate the top and in four hundred

on our next evaluation task that a jury next

and this is some examples of the

there are extracted for the counting from real ground

in you can see that if you look at the second

events in his parents

person a likeable camping and then work we first then we'll reorganisation be direct organisation

lose percent so it seems that this is not about the chanting tree like to

find camping three it's about

mostly the eight moves or the refugee


so we propose a new evaluation method on

under the mechanical turk

for evaluating the topic specific contingent event pair

so we evaluate the cars based on their topic relevant and contingency relation

we asked the annotators to

rates the pairs on a scale of zero two three zero is the events are

not contingent to one

events are contingent but not relevant to talk to their can you know but somewhat

relevant topic and three's the strongest the events are doing and stronger about the topic

in to make the

even under presentation more readable for annotators be

not receive and representation to subject verb particle


direct object


i subject percent or topic or particle all will be mapped to person tackle part

which is more readable to the user's

and this is the result

what the runtime evaluation


only seven percent are judged to be continued and topic relevant and we think this

is because

that the camping trip a big actually does not exist in the collection

an overall only forty two percent are just to be continue

we evaluate our topic specific content event pairs in the same big help to clustering

method selecting

the pairs that are

more than five times frequent filtering by the same event utterance

and the ranking by call control model

on the same clustering and ranking method and evaluating the top hundred for each

a topic and this is there a result

that is showing that

for each topic for the camping forty four percent and for the store

thirty three percent

are contingent and type of nn

and overall about eighty percent of the

all the pairs that the other and are contingent and

on average inter annotator

reliability on these mechanical turk task was

point seven three we show substantial agreement so

finally i want to show some examples of the event first


we show that the results on the topic specific are stronger and even by looking

at the examples you can see that the knowledge that we learn is more interesting

like climb find a rock or we include transformer power was a three policies crush

relocation exactly person but

the ones from the general domain data set or more general like person who were

down trail or persons you can't person simply use or personable locked on



so we learn new type of knowledge at the current knowledge round every event that

is not available in the a previous work on the news wire genre

you have a collection is that of uses a supervised model to with this relevant

greatest topic specific data and the first work that the

compared the results on the topic specific words that the general domain the story

have to new evaluation method one of them completely new on mechanical turk and the

other one inspired by the core task

and the results

i have already talked about this and by doing things

and you

i think that's true so if you have the dataset that is specific to even

it's easier

to learn and it's

the methods will be more effective

that is definitely an interesting idea i have tried to war to make model

on the corpus but the results didn't look a good

like that the ones that are considered similar or when i look at then they

are actually not similar for our task

the labeling is only for the for the

the only thing is only here

or the

stories not the event types

right so the event handlers are generated automatically by the awful like

you just need to find some topics like

come up with the topics like you think that okay what people right on the


four minutes and store and then you go you look at the corpus you try

to find a small c

a small set of stories that are undoubtedly

so what it initially was running a topic modeling is

but it topics that are generated or non-coherent but they help you get some why

doesn't what topics exist here so you know that you can go look for the

stories about going

you like whoa

or going on a camping trip

but once you come up with the topics that actually this i think you can

expand this and then be more and more rounds of what is tracking you can

collect more data