so good morning everyone i know i introduce a phonotactic that of science and technology

in japan

detailed like to talking about our recent work in utilizing unsupervised-clustering or positive emotion elicitation

in overall dialog system

i so in this research we particularly look at affective dialogue system and that is

dialogue system that takes into account

affective aspects in the interaction

so that all systems are as a way for users to interact naturally with the

system

especially to complete sorry house

but as the technology develops we see high potential of

dialogue system to address the emotional needs of the user

and we can see in the increase of dialogue system works applications

in various tasks that involve a perfect

for example companionship for elderly

distress clues assessment and affect sensitive tutoring

the traditional fr in a working system with affective aspects and for surround to mean

from utterances

so there's emotion recognition or a speech recognition where we try to see what the

user is currently feeling where their affective state and then use this information in the

interaction

and there's also emotion expression where the system tries to could be certain personality over

emotion the user

well useful this does not slowly we present emotion processes in human communication

resulting in there is an increasing interest in emotion elicitation

so it is focuses on the change of emotion in

in dialogue

there are some work has to go when only the

use

machine translation to translate users can what into a system response at and target a

specific emotion

there's also workplace or on and quality the implement different affective personalities in dialogue system

and this study how user are impacted by each of these personalities

upon interaction

so the

the drawback or the shortcoming of these existing work is that they have not yet

considered the benefit emotional benefit for the user

so he focuses on the intent of really sufficient itself and the me ask will

be able to achieve this intention

but as to how this can better with the user has not yet we study

so in this research into drawing and overlooked potential of emotion elicitation to improve user

emotional states

and it's form is a chat based dialog system

with an implicit role of positive emotion elicitation

and now to formalize this we follow an emotion model which is called the circumplex

model

this is quite emotion in terms of two dimensions so there's a lens

that masters the positivity negativity of emotion

and there's arousal that captures the activation of emotion

so based on this model what we mean when was a positive emotion is

emotion with

also if you and that's

and what we mean when we say posterior emotional change or positive emotion elicitation

it's any move in this valence arousal space the word more positive feelings so any

of these errors that are shown here we consider a specific emotion elicitation

so given a query integer less dialogue system or social bought

there are many ways to answer it

and actually in real life each of this answer is different emotional impact

meaning they alice different kinds of emotion

and as can be seen a very obvious example confront here for the first one

has a negative impact and the second one is a positive one

and we can actually

find a response of information from conversational data

now if we take a look at japanese dialogue system

neural response generator has been frequently reported to perform well

and have promising properties

we have recurrent encoder-decoder

that includes sequence of user inputs and then use this representation

so we all sequence of

word

as the response

and serpentine for me is a step further and the if you don't know levels

of

sequences

so we have sequence of words

that makes up a dialogue turn and then we have sequence of dialogue turns that

makes up a dialogue itself

and we try to model that in a neural network we get something that looks

like this

so in the bottom we have an utterance encoder a link with the sequence of

words

and in the middle we take the dialogue turn representation

and then also

a model that sequentially

so when we

generate a sequence of four as the response we don't only take into account the

current dialogue but also dialogue constraint

and this helps with to maintain longer

during longer dependencies in the dialogue

in terms

off

in terms of application emotion

of various of the danger when quality

propose a system that can express different kinds of emotion

by using an internal state in the general really ugly the response generator

so you see here that application for emotion elicitation using neural networks

is still very lacking of not altogether absent

what we have recently is proposing set emotion sensitive response generation

which was published in your body are proceeding this year so the main idea is

to have an emotion order that takes into account the emotional context of the dialogue

and use this information in generating the response

so now we have any motion encoder which is

in here

that takes the dialogue context

and try to predict

emotion context of the current or

and when generating the response we use the combination of both

the dialogue context and the emotion context

so in this way we then the network is in motion sensitive

and if we train that only so that contains responses that is it possible motion

we can achieve and positive emotion elicitation

and all subjective evaluation actually proves this method work very well

however there are two million two main limitations

the first is that it has not yet learned strategies from an expert so which

are easy own

a wizard of oz conversation

but we would like to see how an expert or people who are knowledgeable in

a emotion interaction i will be as it possible motion

and also still tends towards short and generic responses with positive affect work

this in paris

i mean for engagement and that's

important especially in

mobile oriented interaction

so the main focus in this contribution is to address these limitations

there are several challenges

which i will talk about now

so that then the first goal is to learn

elicitation strategy from an expert

and the challenges that absent of absence of such features if we take a look

at

emotion which corpora

none of them

have yet to involve an expert in the data collection

and there is also not data that shows positive emotion elicitation strategy in everyday situations

so what we did construct such a dialogue corpus we carefully design this scenario and

i will be talking about this model more detail in a bit

the second lowest increase for it in the generator response

to improve engagements and the main challenge here is the sparsity

so we would like to cover as much as possible dialogue speech emotion space

however it's really hard to collect large amounts of data into annotated with emotion information

reliably so we would like to tackle this problem or methodically we hypothesize that higher

level information such as dialog action and help reduce the sparsity

but how to break types of responses that the action a

the system and

emphasizing is information in the training and generation process

and then put it all together and then try to utilize this information in the

response generation the main difference here now is that

you using the dialog state not only we predict the emotional context of the dialogue

but we also tries to would be action that this is the multi in

in the response

so then

b

repost able to context a chart be

that uses a combination of these three contracts to generate a response

no talking about the corpus construction

that's talked about for the goal here is or expert strategy for emotion elicitation

so that what we do this we like interactions between an expert in a participant

we through a professional counsellor a to take place is the expert

and the mean things to condition interaction at the beginning with negative emotions so that

as a

dialogue progresses we can see how export rise at the conversation

to allow emotional recovery and we stick

and this is how a typical recording such a session look like

we start with an opening for small or

and afterwards we induce the negative emotion and what

do you know which we show that videos and non fictional videos such as interview

clips

or it's

about topics

that have a negative sentiments such as well

all righty or environment change

and the ball of the session is the this question that

we've talked about four

we recorded sixty sessions amounting to about twenty four hours of data we recruited one

counsellor and thirty participants

for each participant

recordings

in one of the report in one of the session was

we showed that might induce over all the other one you that might be used

at nist

for the emotion annotation we rely on self reported emotion and a teacher so we

have to participants

you watch the recordings that just a

and using the g traced all the use this scale on the right-hand side

mark their emotional state at the core

at any given time

so if we project the dialogue

the length of the dialogue we can get

and emotion

trace that looks like this

of course we also be a we also transcribed it in we use the combination

of these two information a tree

later on but before that

other the other goal is to find higher level information from the overall expert

responses

what we would like to have here is more information that probably equivalent to dialog

actions

but we would like it should be specific to dialog scenario because this is the

scenario that particular interest

interestingly

it would also like for these dialogue acts but in fact if intense of the

export

there are several ways

into human annotation this is obvious limitation with the expensive and hard to reach a

reliable inter annotator agreement

we also use standard dialogue act classifiers that the constraint here is that it may

not cover specific emotion we intend to

so we resorted to unsupervised clustering

so we do that by first extracting the responses of the caller for the at

work

and then using a pre-trained word defect model we get a compact representation of each

response

and we do we try out two types of clustering methods

which means you need to you find beforehand how many clusters would like to find

our case which is we chose k empirically

four db gmm

we are not to define the model complexity beforehand all within itself tries to find

the optimal number of components we presented a

and then we did some analysis this is the t is a new representation of

the factors and the label

this is the result of the k-means clustering where we choose k u i

in between cluster we have many sentences that are really didn't corresponding to participants contains

in the red clustering we get affirmative responses or confirmation responses

and the blue clusters we have a listening or backchannels

what we do get here though is a very large cluster where all the more

complex

sentences are grouped together

so we

we cluster that one more time and we find another some clusters

some examples on the right cluster we have

a lot of sentences

that contains five a recall election about the topic

i don't the green cluster

we have

sentences that are focus on the participants so you is the most common words there

and sounds like the

score tries to be opinions and their assessment of the topic

for each year and

the

characteristic of each cluster is less you end up probably this is due to the

very imbalanced

distribution of the sentence and the cluster so we have to be very clusters here

and there are plenty very small clusters the parameters

so because just because the clusters are bigger is harder to include what they represent

so then we put all of these two the experiment to see if things are

working as we know

this is the experimental setup the first thing that it is to retrain the model

so we would like before we start only action any motion specific "'cause" we would

like to be

a prior for and he

response generation task

so we use a large-scale dialog corpus which is the subtle corpus containing

five point five million dialogue years movie subtitles

and we used in charge me models so we note any can wear any other

the dialogue context

and then we fit we find alternatives pre-trained model on how something that we have

like this

to ask for comparison we retrain every five point three types of model we have

more a chart that only relies on emotion context

we have anything at a really need that uses both

emotion and i actually convex combination and for completeness we also train a model that

all you realise on action

and of course because the models after works

a little bit about how we retraining point you

so what pre-training does is initialized is the way of the

of the jargon components

so the

the parts that have nothing to do with additional context

an and doing fine tuning because the data that we have is pretty small we

do it selectively so we only optimize parameters that are affected by the new products

so the decoder here and the two

to a complex encoders

in terms of m c h r t we have three different targets

reading during training

so we have the negative log

and

each of those targets have their own classes we have a negative log-likelihood of the

target response

and emotion importer tries to predict the emotional state

and we have the prediction error rate training as well as for the action orders

which is would be the action for the response

and we combine these clusters together linearly interpolate them and then used is back propagation

this to update the corresponding arcs

the first evaluation of it is we see the perplexity of the model

forty one cherry be a perplexity lower is better

well you much are needed i see that would get is forty two point six

and actually if we use action information we got slight slightly better model

however when we combine this information together with see if anything's happening for each action

labels that

so fourteen it's cluster-and-label we see some improvements

you're to here and forty three gmm it actually slightly worse than

we analyze this further by

separating that has the top forty the length

so we can get

reflects if or shorter is very animals

that's queries

there's a stark difference between the two groups performance on short queries

are consistently better than that of all ones which is not surprising a long-term dependency

the it sitting that

of the there's the neural network for a random performance

so the thing with a c h are basically means that

it again substantial improvement for a little queries

most of the improvement

that i get comes from all

i being able to perform better for queries

so this we can see that the multiple context how especially for longer inputs

and then we also subjective evaluation we extracted a hundred various

have each judge slightly crowd workers

we asked to rate the naturalness emotional impact and in addition to the response

vol two models

so we have really mortuary is the baseline and h r v the best of

the hrtf the proposed system

and we see improve engagements

from the proposed model while maintaining the emotional impact and naturalness

and when we look at the responses that's generated by the

system we see that ch are you

well on average two and how words longer than the baseline

so in conclusion here we have presented a corpus that shows expert

strategy in positive emotion elicitation

we also or c we also show how we use unsupervised clustering method

to obtain higher level information

and use all of these in

a response generation

in the future there are many things

that needs to be worked on but in particular we would like to look at

multimodal information

this is especially important for the

and the emotional context of the dialogue

and of course evaluations were user interaction is also important

that was my presentation

so that pre-training is that a

using another corpus which we do not construct so

we use this model

the training data is

is the time here

so it's a large-scale corpus

probably subtitles

right so the reading

we pre-training we did not use any emotion or action one

so that the pre-training is

only to brian that now or boards dialog generation

and then refine training we give the model's ability to encode actually context and emotion

that's and use this

in the generation

right so

the word identity

so

there are no menus or embodiment for different weights the first one is

using a pre-trained word i'm but in model we

we use that for the counsellor dialogue clustering and another three in the model itself

wheeler and the word embeddings

green pre-training

it is learned

what is it it's learned by the utterance in order

all the large scale data

cluster sentences or the dialogue our clustering

but export response clustering we cluster sentences

and for that we use the pre-training work to fact model

we average

what the sentence

right i've just heard about skip or yesterday whereas the s and that's of the

different we think that

q

so there's definitely an overlap between the actions that would like to find from the

experts

actions just general dialogues

so we did find for example backchannels

backchannels are actions that are generally the conversation and confirmation

but the unsupervised clustering is especially helpful for the this other actions probably act right

and it's

you do not need any expert one of the t at all

right

so

what we find that most of the time the majority of the time a counselor

is able to reach the opposing emotions

in terms of their the participant's reaction towards the video it hi varies

so there are people who are not so reactive and are people who are there

is emotionally sensitive

so

we get

different types of responses

but this is an example of well

all the dialogue so the red lines here is here wins throughout the dialogue

we can see that the kalman quite positive and there is an the real you

feel

every negative but as the dialogue progresses

the

the counsellor

successfully for this

you know we have a more extensive analysis in another paper

i'll be happy to help you