Speech Transcript - Controlling Personality-Based Stylistic Variation with Neural Natural Language Generators

0:00:15	okay a my name is mean a
0:00:18	and i from the natural language in dialogue systems via
0:00:22	you see santa cruz preventing paper controlling personalities with that of variation
0:00:28	the neural that language generators
0:00:34	so
0:00:37	the problem in that work on the task oriented neural nlg structured data has focused
0:00:44	on
0:00:44	a weighting semantic errors which has resulted in
0:00:48	by logistically an interesting outputs
0:00:50	so for example
0:00:52	i two references with the
0:00:54	locus generating coca is a stronger describe our holiday and coca
0:01:00	is the low really construct your expressed by holiday and
0:01:04	both realise
0:01:06	although the attribute in the mr but that's really all that is
0:01:10	so our goal is to train a neural nlg user semantics and stylistic
0:01:16	variation by controlling input data and the amount of supervision available to the model
0:01:27	really need lots of training data to learn the style
0:01:31	so we use a statistical generator personage which is able to generate data is being
0:01:38	and the big five personality to create stylistic variation
0:01:43	we use
0:01:44	i personalities agreeable conscientious disagreeable extrovert and conscientious
0:01:51	two
0:01:53	to generate
0:01:54	data using train and dev mars each e
0:01:58	challenge so personage you can systematically control
0:02:02	the types of styles variational produced and we know which had to
0:02:06	stylistic variation in our in
0:02:08	it's reproducing so there are two examples
0:02:11	the screen one for the agreeable personality and one for the disagreeable personality
0:02:16	the remote personality and
0:02:19	part markers like i e
0:02:21	and the disagreeable one hand and the size or like effectively
0:02:25	and or
0:02:27	conversation
0:02:29	and disagreeable is broken up into five sentences for their support agreeable
0:02:35	all you are in one sentence
0:02:44	for our data distribution we have i think eight hundred fifty five total utterance is
0:02:49	generated from three thousand seven hundred and eighty four unique more and seventeen thousand seven
0:02:55	hundred and seventy one references for personality and protest we generate one thousand three hundred
0:03:01	and ninety total utterances
0:03:03	for rendering from a unique are you get one
0:03:06	preference for personality from the fact personality
0:03:10	so with this data the mr our problem
0:03:13	it rate and that's that each we challenge and have them are taken is directly
0:03:19	from the text
0:03:20	at each we challenge
0:03:22	so the distribution of this data is problem but challenge so
0:03:27	the training data number of attributes gram are a bit more balanced like
0:03:33	mostly for five and the attributes
0:03:36	gram or and a test data
0:03:39	has a lot
0:03:40	that's quite a bit more attributes per more mostly seven or eight actually
0:03:46	we think this makes the test a little or in the training
0:03:52	so there are five types of a rotation operation that personage can here
0:03:57	do you combine the actual mr there's the period operation so x or y it
0:04:03	is q x and y with three
0:04:07	in conjunction operation x y and i e
0:04:12	where x is why don't you and the
0:04:16	the different areas the lack of four
0:04:19	and the also q which is
0:04:22	has why also it
0:04:24	e
0:04:26	aggregation operations are necessary to combine
0:04:30	actually together with the distribution
0:04:33	most of the personalities use most of the aggregation
0:04:37	operations that there is still some
0:04:40	brightly so
0:04:41	it just agreeable voice
0:04:44	using period operational lot more than all of the other one with
0:04:48	and extrovert
0:04:50	is a lot more likely to use the conjunction operator then the other
0:04:56	what is so we can still see that was different
0:05:02	you're the sample and pragmatic
0:05:05	marker except me
0:05:06	that personage can
0:05:08	used
0:05:09	the by now that we have had about thirty one i binary operators
0:05:15	so some of these are the correct requests confirmation so that he what we can
0:05:19	find on a
0:05:21	exactly the restaurant be emphasized for
0:05:26	like really basically actually just competing mitigation
0:05:30	the come on obviously rewritten note that
0:05:34	and include markers such as
0:05:36	however we need it for and
0:05:39	and has a product
0:05:41	markers are necessary
0:05:43	or a grammatically correct
0:05:45	sentence and what utterance
0:05:48	be you can see that not all over the personalities
0:05:52	you every harmonic
0:05:54	operator and i can occur
0:05:58	you end up with some like tag question is really only used by agreeable
0:06:05	many of them are used by multiple so
0:06:08	what it is pretty much equally used by disagreeable and conscientious and some of a
0:06:14	little bit less talent so you know it's mostly whose make extra or but agreeable
0:06:19	will also
0:06:21	you
0:06:22	you know marker
0:06:27	so we begin with the refined system from two sec at all and we have
0:06:32	three different models with varying levels of supervision
0:06:36	then there's a model the nose to model directly follows the baseline model has no
0:06:42	supervision token model as a single okay
0:06:46	specifies the personality
0:06:48	similar to machine translation problems
0:06:51	and our context model directly encodes that thirty six that parameters the pragmatic marketing aggregation
0:06:57	operations
0:06:58	from personage as context and if you forward network
0:07:06	here's an example how what
0:07:08	from our context model
0:07:12	i e
0:07:13	realization i had no application and no pragmatic markers so
0:07:19	each attribute is that it on sentence and
0:07:22	the a variety it's just realising attributes
0:07:27	sar
0:07:28	and i have three examples from personalities first agreeable
0:07:33	let's see what we can finally it is well is we could use a rating
0:07:38	also with an italian restaurant riverside moderately priced notice right so
0:07:42	also with it in a really friendly easy
0:07:45	so it had a confirmation in its hands and knowledge and justifications bayesian well and
0:07:53	then it has a high as to the end and it also he's is also
0:07:59	q for aggregation
0:08:01	the second one
0:08:03	i and twenty inches voice
0:08:06	god i don't know it's really said at separating also it is moderately priced restaurant
0:08:11	so italian place in riverside and you think you'd friendly
0:08:16	expletive got
0:08:18	and an initial rejection with the i don't know and this use this
0:08:23	still uses the also q there is also with you
0:08:27	the final four with
0:08:29	in extrovert
0:08:30	basically it's really is an italian place of this right and actually moderately priced the
0:08:36	riverside decent reading okay brightly and it's a you know
0:08:40	so it's one hand a year to emphasize errors
0:08:45	basically actually and you know marker and only uses merge in conjunction and
0:08:52	although he's just one sentence in there is no use of the period operation
0:09:00	so
0:09:02	automatic metrics
0:09:05	really or just
0:09:07	the
0:09:10	i really you know why
0:09:12	it systems that they don't just although the training data is a really is similar
0:09:18	to the training data and i inherently bad
0:09:22	for
0:09:23	stylistic variation
0:09:24	so
0:09:26	our context model does perform the best but numbers may be a great
0:09:32	we are mostly showing be specific completeness
0:09:35	and we propose a new metrics for evaluating semantic causality and stylistic variation
0:09:43	so first we evaluate the quality
0:09:46	using four types of errors from the actual you're sitting here are in reference to
0:09:52	the realizations so
0:09:55	the first is deletions which is one
0:09:57	and action you near bar it is not rely in the what
0:10:02	reputations which is where a here
0:10:05	actually you in the reference multiple times
0:10:08	substitution which is where
0:10:11	actually you is i think in a year more and the reference considered value
0:10:17	so for example if you are marked it was italian restaurant and referent a french
0:10:23	restaurant
0:10:24	what he wants everything you know
0:10:26	and then hallucinations which is one reason actually reference that was not new original mr
0:10:32	so we have in table here that have
0:10:35	he values for each model each personality for deletions insertions and substitutions something very or
0:10:42	stable and it is hard to tell which one okay
0:10:45	is doing the best overall we
0:10:48	simplified it included a slot error rate
0:10:52	where it is the sum of those force semantic errors over a number of slots
0:10:59	are actually you
0:11:00	this is modelled after the word error rate
0:11:03	and how we have more similar table where you can
0:11:07	actually see the difference between the models and you can see that no stupid as
0:11:12	performed the best but also that this is
0:11:14	we had a cost and stylistic variation and that
0:11:18	context really
0:11:19	that much worse
0:11:24	so
0:11:26	that was rated the semantic quality and now we want to measure stylistic variation
0:11:31	so first we take a shared a text and should he to see how very
0:11:36	the results are
0:11:37	the context model a performs the best directly models and is closest to the original
0:11:44	personage training data so it is why is varied of the original data
0:11:53	we also want to measure the models are the fully reproducing it pragmatic markers
0:12:00	at each personality user
0:12:03	so we
0:12:06	calculated for all right marking set of here a region
0:12:10	and then we get the pearson's correlation between
0:12:15	a personage training data and the output for each
0:12:20	model and each personality
0:12:22	so the context model that for most of the personalities except for very important can
0:12:29	perform better
0:12:31	no stew
0:12:32	it has positive value for two of them agreeing projections right are actually negatively correlated
0:12:39	i think this is because conscientious
0:12:42	actually easy to provide markers
0:12:45	mostly that are the request confirmation and an initial rejection which are generally at the
0:12:50	very beginning for the very end of the sentence which makes them at your
0:12:55	to reproduce and soon as you pretty much exclusively just one does
0:13:00	so it's very similar conscientious but
0:13:06	so we did pretty much the same thing for a rapid creation
0:13:09	operations will be counting occurrences of our age
0:13:14	and the pearson's correlation between each rate in the test data
0:13:19	again context is performing a better than
0:13:23	each other
0:13:24	except for one case this time disagreeable
0:13:28	hand
0:13:29	you see that actually used for pretty well here
0:13:33	it does better than okay well a couple of instance since we think this is
0:13:37	because
0:13:40	i patient operations like is that they need to be you can have a sentence
0:13:45	with our own
0:13:46	and so you'll see that it is an excellent pragmatic markers but less
0:13:51	create a location operations this is morgan opportunity to do better with the application to
0:13:58	the pragmatic markers
0:14:00	the overall are context model
0:14:03	gives us the best next a systematic quality and stylistic variation
0:14:11	so we also evaluated a the quality of the work is all easy and turk
0:14:17	study e
0:14:18	so our best performing model
0:14:20	the context model and tested whether people
0:14:24	can recognize personality
0:14:27	as a baseline we randomly select a set of ten unique or mars from training
0:14:32	and their references so we gave its workers is very three hundred and i would
0:14:41	entail in that an item
0:14:43	inventory
0:14:45	tp and we also i
0:14:48	the dm's range how natural it that the utterance down
0:14:57	so we evaluate it very unique or mars
0:15:00	we generated from the context modeling task
0:15:04	we had five tokens per hit me measured how
0:15:09	frequently the majority select the crack cheapy item
0:15:12	we were opposite item
0:15:14	to get a ratio which is no all i highlighted
0:15:20	personage
0:15:20	that is had over fifty percent or
0:15:23	all of the p i n
0:15:25	model context
0:15:27	that's right over fifty percent and everything except agree well conscientious
0:15:32	yes or
0:15:33	the lowest percentage does seem to the trend
0:15:37	personage just a little bit lower
0:15:42	we also got be a great rating from one to seven scale from the t
0:15:48	v and we basically a average rating of the
0:15:52	which of the case so it's agreeable with the average rating for the agreeable
0:15:58	in and
0:16:00	it's but a
0:16:03	the average for all the time for percentage most of them for the context model
0:16:08	agreeable it it'd
0:16:13	about
0:16:16	that the same and then for unconsciously and you know
0:16:21	condescension it also has a little better than the original personage
0:16:29	we also the nationalist rating again one to seven
0:16:35	i
0:16:35	the model contact again hands couple instances where it actually sounds a little more natural
0:16:41	than the original data so disagreeable
0:16:44	then there anything with an conscientious
0:16:47	people are models that's where k
0:16:51	more natural in overall results
0:16:58	so we also tested our model for general
0:17:02	its ability
0:17:04	and we tried to generate what matches characteristics of
0:17:09	all personalities so for me to
0:17:12	the disagreeable voice and the conscientious way
0:17:16	and we combine them and that are you sentences
0:17:20	is that what extent to one example
0:17:23	our model out what a fool a disagreeable and point here just personality
0:17:29	we
0:17:30	to evaluate it we look at
0:17:32	e average occurrence of the different features
0:17:37	are two examples
0:17:38	that are pretty there is no the fury are location is a lot more common
0:17:42	in this variable
0:17:43	in conscientious
0:17:44	and when we combine them the results of sorted in the middle and same with
0:17:49	the
0:17:50	expletive handwriting or it's much more common in disagreeable
0:17:54	conscientious
0:17:55	okay you can okay result that is what again between so it really think indicate
0:18:00	that models not sticky
0:18:03	one way or other is
0:18:07	sort of averaging them and getting in our hands data well
0:18:12	and this is from a model that we only trained on a single personality train
0:18:17	it on x personalities so word tells me to have a paper speech
0:18:23	a neural model to voice models p-expression novel personality or we can t s
0:18:31	o solution we show
0:18:34	and do not models used to generate a but that is both syntactically and semantically
0:18:38	correct
0:18:39	based on each week generation challenge
0:18:42	in b and are role models be able to use stylistic variation in a controlled
0:18:47	setting
0:18:48	based on the type of data and they are trained on a number of supervision
0:18:52	there are given in training you're currently
0:18:55	focusing on can swarms of stylistic variation
0:19:00	our dataset is available at that link
0:19:37	i
0:19:41	well
0:19:43	so all these results are actually people have test i don't is with first which
0:19:49	i
0:19:52	we got around the same results as a as it were really just one show
0:19:56	that
0:19:57	the neural that is the model context is it's
0:20:01	still producing these personalities and weight is recognisable so
0:20:07	people can still tell the conscientious voice
0:20:10	is conscientious and i
0:20:15	it's not just that we're looking at these pragmatic markers and think that repeat it
0:20:19	is actually still same personality training

Controlling Personality-Based Stylistic Variation with Neural Natural Language Generators

Oral Session 2: Generation 2

Shereen Oraby, Lena Reed, Shubhangi Tandon, Sharath T.S., Stephanie Lukin, Marilyn Walker