okay a my name is mean a

and i from the natural language in dialogue systems via

you see santa cruz preventing paper controlling personalities with that of variation

the neural that language generators


the problem in that work on the task oriented neural nlg structured data has focused


a weighting semantic errors which has resulted in

by logistically an interesting outputs

so for example

i two references with the

locus generating coca is a stronger describe our holiday and coca

is the low really construct your expressed by holiday and

both realise

although the attribute in the mr but that's really all that is

so our goal is to train a neural nlg user semantics and stylistic

variation by controlling input data and the amount of supervision available to the model

really need lots of training data to learn the style

so we use a statistical generator personage which is able to generate data is being

and the big five personality to create stylistic variation

we use

i personalities agreeable conscientious disagreeable extrovert and conscientious


to generate

data using train and dev mars each e

challenge so personage you can systematically control

the types of styles variational produced and we know which had to

stylistic variation in our in

it's reproducing so there are two examples

the screen one for the agreeable personality and one for the disagreeable personality

the remote personality and

part markers like i e

and the disagreeable one hand and the size or like effectively

and or


and disagreeable is broken up into five sentences for their support agreeable

all you are in one sentence

for our data distribution we have i think eight hundred fifty five total utterance is

generated from three thousand seven hundred and eighty four unique more and seventeen thousand seven

hundred and seventy one references for personality and protest we generate one thousand three hundred

and ninety total utterances

for rendering from a unique are you get one

preference for personality from the fact personality

so with this data the mr our problem

it rate and that's that each we challenge and have them are taken is directly

from the text

at each we challenge

so the distribution of this data is problem but challenge so

the training data number of attributes gram are a bit more balanced like

mostly for five and the attributes

gram or and a test data

has a lot

that's quite a bit more attributes per more mostly seven or eight actually

we think this makes the test a little or in the training

so there are five types of a rotation operation that personage can here

do you combine the actual mr there's the period operation so x or y it

is q x and y with three

in conjunction operation x y and i e

where x is why don't you and the

the different areas the lack of four

and the also q which is

has why also it


aggregation operations are necessary to combine

actually together with the distribution

most of the personalities use most of the aggregation

operations that there is still some

brightly so

it just agreeable voice

using period operational lot more than all of the other one with

and extrovert

is a lot more likely to use the conjunction operator then the other

what is so we can still see that was different

you're the sample and pragmatic

marker except me

that personage can


the by now that we have had about thirty one i binary operators

so some of these are the correct requests confirmation so that he what we can

find on a

exactly the restaurant be emphasized for

like really basically actually just competing mitigation

the come on obviously rewritten note that

and include markers such as

however we need it for and

and has a product

markers are necessary

or a grammatically correct

sentence and what utterance

be you can see that not all over the personalities

you every harmonic

operator and i can occur

you end up with some like tag question is really only used by agreeable

many of them are used by multiple so

what it is pretty much equally used by disagreeable and conscientious and some of a

little bit less talent so you know it's mostly whose make extra or but agreeable

will also


you know marker

so we begin with the refined system from two sec at all and we have

three different models with varying levels of supervision

then there's a model the nose to model directly follows the baseline model has no

supervision token model as a single okay

specifies the personality

similar to machine translation problems

and our context model directly encodes that thirty six that parameters the pragmatic marketing aggregation


from personage as context and if you forward network

here's an example how what

from our context model

i e

realization i had no application and no pragmatic markers so

each attribute is that it on sentence and

the a variety it's just realising attributes


and i have three examples from personalities first agreeable

let's see what we can finally it is well is we could use a rating

also with an italian restaurant riverside moderately priced notice right so

also with it in a really friendly easy

so it had a confirmation in its hands and knowledge and justifications bayesian well and

then it has a high as to the end and it also he's is also

q for aggregation

the second one

i and twenty inches voice

god i don't know it's really said at separating also it is moderately priced restaurant

so italian place in riverside and you think you'd friendly

expletive got

and an initial rejection with the i don't know and this use this

still uses the also q there is also with you

the final four with

in extrovert

basically it's really is an italian place of this right and actually moderately priced the

riverside decent reading okay brightly and it's a you know

so it's one hand a year to emphasize errors

basically actually and you know marker and only uses merge in conjunction and

although he's just one sentence in there is no use of the period operation


automatic metrics

really or just


i really you know why

it systems that they don't just although the training data is a really is similar

to the training data and i inherently bad


stylistic variation


our context model does perform the best but numbers may be a great

we are mostly showing be specific completeness

and we propose a new metrics for evaluating semantic causality and stylistic variation

so first we evaluate the quality

using four types of errors from the actual you're sitting here are in reference to

the realizations so

the first is deletions which is one

and action you near bar it is not rely in the what

reputations which is where a here

actually you in the reference multiple times

substitution which is where

actually you is i think in a year more and the reference considered value

so for example if you are marked it was italian restaurant and referent a french


what he wants everything you know

and then hallucinations which is one reason actually reference that was not new original mr

so we have in table here that have

he values for each model each personality for deletions insertions and substitutions something very or

stable and it is hard to tell which one okay

is doing the best overall we

simplified it included a slot error rate

where it is the sum of those force semantic errors over a number of slots

are actually you

this is modelled after the word error rate

and how we have more similar table where you can

actually see the difference between the models and you can see that no stupid as

performed the best but also that this is

we had a cost and stylistic variation and that

context really

that much worse


that was rated the semantic quality and now we want to measure stylistic variation

so first we take a shared a text and should he to see how very

the results are

the context model a performs the best directly models and is closest to the original

personage training data so it is why is varied of the original data

we also want to measure the models are the fully reproducing it pragmatic markers

at each personality user

so we

calculated for all right marking set of here a region

and then we get the pearson's correlation between

a personage training data and the output for each

model and each personality

so the context model that for most of the personalities except for very important can

perform better

no stew

it has positive value for two of them agreeing projections right are actually negatively correlated

i think this is because conscientious

actually easy to provide markers

mostly that are the request confirmation and an initial rejection which are generally at the

very beginning for the very end of the sentence which makes them at your

to reproduce and soon as you pretty much exclusively just one does

so it's very similar conscientious but

so we did pretty much the same thing for a rapid creation

operations will be counting occurrences of our age

and the pearson's correlation between each rate in the test data

again context is performing a better than

each other

except for one case this time disagreeable


you see that actually used for pretty well here

it does better than okay well a couple of instance since we think this is


i patient operations like is that they need to be you can have a sentence

with our own

and so you'll see that it is an excellent pragmatic markers but less

create a location operations this is morgan opportunity to do better with the application to

the pragmatic markers

the overall are context model

gives us the best next a systematic quality and stylistic variation

so we also evaluated a the quality of the work is all easy and turk

study e

so our best performing model

the context model and tested whether people

can recognize personality

as a baseline we randomly select a set of ten unique or mars from training

and their references so we gave its workers is very three hundred and i would

entail in that an item


tp and we also i

the dm's range how natural it that the utterance down

so we evaluate it very unique or mars

we generated from the context modeling task

we had five tokens per hit me measured how

frequently the majority select the crack cheapy item

we were opposite item

to get a ratio which is no all i highlighted


that is had over fifty percent or

all of the p i n

model context

that's right over fifty percent and everything except agree well conscientious

yes or

the lowest percentage does seem to the trend

personage just a little bit lower

we also got be a great rating from one to seven scale from the t

v and we basically a average rating of the

which of the case so it's agreeable with the average rating for the agreeable

in and

it's but a

the average for all the time for percentage most of them for the context model

agreeable it it'd


that the same and then for unconsciously and you know

condescension it also has a little better than the original personage

we also the nationalist rating again one to seven


the model contact again hands couple instances where it actually sounds a little more natural

than the original data so disagreeable

then there anything with an conscientious

people are models that's where k

more natural in overall results

so we also tested our model for general

its ability

and we tried to generate what matches characteristics of

all personalities so for me to

the disagreeable voice and the conscientious way

and we combine them and that are you sentences

is that what extent to one example

our model out what a fool a disagreeable and point here just personality


to evaluate it we look at

e average occurrence of the different features

are two examples

that are pretty there is no the fury are location is a lot more common

in this variable

in conscientious

and when we combine them the results of sorted in the middle and same with


expletive handwriting or it's much more common in disagreeable


okay you can okay result that is what again between so it really think indicate

that models not sticky

one way or other is

sort of averaging them and getting in our hands data well

and this is from a model that we only trained on a single personality train

it on x personalities so word tells me to have a paper speech

a neural model to voice models p-expression novel personality or we can t s

o solution we show

and do not models used to generate a but that is both syntactically and semantically


based on each week generation challenge

in b and are role models be able to use stylistic variation in a controlled


based on the type of data and they are trained on a number of supervision

there are given in training you're currently

focusing on can swarms of stylistic variation

our dataset is available at that link



so all these results are actually people have test i don't is with first which


we got around the same results as a as it were really just one show


the neural that is the model context is it's

still producing these personalities and weight is recognisable so

people can still tell the conscientious voice

is conscientious and i

it's not just that we're looking at these pragmatic markers and think that repeat it

is actually still same personality training