0:00:14so good morning everyone i know i introduce a phonotactic that of science and technology
0:00:20in japan
0:00:22detailed like to talking about our recent work in utilizing unsupervised-clustering or positive emotion elicitation
0:00:29in overall dialog system
0:00:32i so in this research we particularly look at affective dialogue system and that is
0:00:37dialogue system that takes into account
0:00:40affective aspects in the interaction
0:00:43so that all systems are as a way for users to interact naturally with the
0:00:48system
0:00:49especially to complete sorry house
0:00:51but as the technology develops we see high potential of
0:00:56dialogue system to address the emotional needs of the user
0:00:59and we can see in the increase of dialogue system works applications
0:01:04in various tasks that involve a perfect
0:01:07for example companionship for elderly
0:01:10distress clues assessment and affect sensitive tutoring
0:01:15the traditional fr in a working system with affective aspects and for surround to mean
0:01:20from utterances
0:01:21so there's emotion recognition or a speech recognition where we try to see what the
0:01:27user is currently feeling where their affective state and then use this information in the
0:01:31interaction
0:01:32and there's also emotion expression where the system tries to could be certain personality over
0:01:38emotion the user
0:01:41well useful this does not slowly we present emotion processes in human communication
0:01:47resulting in there is an increasing interest in emotion elicitation
0:01:51so it is focuses on the change of emotion in
0:01:55in dialogue
0:01:57there are some work has to go when only the
0:02:01use
0:02:02machine translation to translate users can what into a system response at and target a
0:02:09specific emotion
0:02:10there's also workplace or on and quality the implement different affective personalities in dialogue system
0:02:18and this study how user are impacted by each of these personalities
0:02:22upon interaction
0:02:24so the
0:02:26the drawback or the shortcoming of these existing work is that they have not yet
0:02:31considered the benefit emotional benefit for the user
0:02:34so he focuses on the intent of really sufficient itself and the me ask will
0:02:38be able to achieve this intention
0:02:41but as to how this can better with the user has not yet we study
0:02:47so in this research into drawing and overlooked potential of emotion elicitation to improve user
0:02:53emotional states
0:02:55and it's form is a chat based dialog system
0:02:59with an implicit role of positive emotion elicitation
0:03:02and now to formalize this we follow an emotion model which is called the circumplex
0:03:07model
0:03:08this is quite emotion in terms of two dimensions so there's a lens
0:03:12that masters the positivity negativity of emotion
0:03:15and there's arousal that captures the activation of emotion
0:03:20so based on this model what we mean when was a positive emotion is
0:03:25emotion with
0:03:26also if you and that's
0:03:29and what we mean when we say posterior emotional change or positive emotion elicitation
0:03:34it's any move in this valence arousal space the word more positive feelings so any
0:03:40of these errors that are shown here we consider a specific emotion elicitation
0:03:46so given a query integer less dialogue system or social bought
0:03:50there are many ways to answer it
0:03:53and actually in real life each of this answer is different emotional impact
0:03:58meaning they alice different kinds of emotion
0:04:01and as can be seen a very obvious example confront here for the first one
0:04:05has a negative impact and the second one is a positive one
0:04:09and we can actually
0:04:11find a response of information from conversational data
0:04:16now if we take a look at japanese dialogue system
0:04:19neural response generator has been frequently reported to perform well
0:04:24and have promising properties
0:04:27we have recurrent encoder-decoder
0:04:29that includes sequence of user inputs and then use this representation
0:04:34so we all sequence of
0:04:35word
0:04:37as the response
0:04:38and serpentine for me is a step further and the if you don't know levels
0:04:44of
0:04:44sequences
0:04:45so we have sequence of words
0:04:47that makes up a dialogue turn and then we have sequence of dialogue turns that
0:04:51makes up a dialogue itself
0:04:53and we try to model that in a neural network we get something that looks
0:04:57like this
0:04:58so in the bottom we have an utterance encoder a link with the sequence of
0:05:02words
0:05:02and in the middle we take the dialogue turn representation
0:05:07and then also
0:05:08a model that sequentially
0:05:10so when we
0:05:12generate a sequence of four as the response we don't only take into account the
0:05:17current dialogue but also dialogue constraint
0:05:20and this helps with to maintain longer
0:05:22during longer dependencies in the dialogue
0:05:26in terms
0:05:27off
0:05:28in terms of application emotion
0:05:32of various of the danger when quality
0:05:36propose a system that can express different kinds of emotion
0:05:40by using an internal state in the general really ugly the response generator
0:05:46so you see here that application for emotion elicitation using neural networks
0:05:51is still very lacking of not altogether absent
0:05:54what we have recently is proposing set emotion sensitive response generation
0:06:00which was published in your body are proceeding this year so the main idea is
0:06:05to have an emotion order that takes into account the emotional context of the dialogue
0:06:10and use this information in generating the response
0:06:13so now we have any motion encoder which is
0:06:17in here
0:06:18that takes the dialogue context
0:06:21and try to predict
0:06:23emotion context of the current or
0:06:26and when generating the response we use the combination of both
0:06:30the dialogue context and the emotion context
0:06:32so in this way we then the network is in motion sensitive
0:06:37and if we train that only so that contains responses that is it possible motion
0:06:42we can achieve and positive emotion elicitation
0:06:48and all subjective evaluation actually proves this method work very well
0:06:53however there are two million two main limitations
0:06:56the first is that it has not yet learned strategies from an expert so which
0:07:00are easy own
0:07:01a wizard of oz conversation
0:07:05but we would like to see how an expert or people who are knowledgeable in
0:07:10a emotion interaction i will be as it possible motion
0:07:14and also still tends towards short and generic responses with positive affect work
0:07:20this in paris
0:07:21i mean for engagement and that's
0:07:23important especially in
0:07:25mobile oriented interaction
0:07:28so the main focus in this contribution is to address these limitations
0:07:35there are several challenges
0:07:36which i will talk about now
0:07:38so that then the first goal is to learn
0:07:41elicitation strategy from an expert
0:07:44and the challenges that absent of absence of such features if we take a look
0:07:49at
0:07:50emotion which corpora
0:07:53none of them
0:07:54have yet to involve an expert in the data collection
0:08:00and there is also not data that shows positive emotion elicitation strategy in everyday situations
0:08:07so what we did construct such a dialogue corpus we carefully design this scenario and
0:08:13i will be talking about this model more detail in a bit
0:08:17the second lowest increase for it in the generator response
0:08:21to improve engagements and the main challenge here is the sparsity
0:08:25so we would like to cover as much as possible dialogue speech emotion space
0:08:30however it's really hard to collect large amounts of data into annotated with emotion information
0:08:36reliably so we would like to tackle this problem or methodically we hypothesize that higher
0:08:43level information such as dialog action and help reduce the sparsity
0:08:47but how to break types of responses that the action a
0:08:52the system and
0:08:54emphasizing is information in the training and generation process
0:08:58and then put it all together and then try to utilize this information in the
0:09:02response generation the main difference here now is that
0:09:07you using the dialog state not only we predict the emotional context of the dialogue
0:09:12but we also tries to would be action that this is the multi in
0:09:17in the response
0:09:19so then
0:09:21b
0:09:22repost able to context a chart be
0:09:25that uses a combination of these three contracts to generate a response
0:09:32no talking about the corpus construction
0:09:37that's talked about for the goal here is or expert strategy for emotion elicitation
0:09:44so that what we do this we like interactions between an expert in a participant
0:09:49we through a professional counsellor a to take place is the expert
0:09:55and the mean things to condition interaction at the beginning with negative emotions so that
0:10:00as a
0:10:01dialogue progresses we can see how export rise at the conversation
0:10:06to allow emotional recovery and we stick
0:10:10and this is how a typical recording such a session look like
0:10:15we start with an opening for small or
0:10:18and afterwards we induce the negative emotion and what
0:10:22do you know which we show that videos and non fictional videos such as interview
0:10:28clips
0:10:28or it's
0:10:30about topics
0:10:31that have a negative sentiments such as well
0:10:34all righty or environment change
0:10:37and the ball of the session is the this question that
0:10:42we've talked about four
0:10:45we recorded sixty sessions amounting to about twenty four hours of data we recruited one
0:10:52counsellor and thirty participants
0:10:54for each participant
0:10:56recordings
0:10:58in one of the report in one of the session was
0:11:02we showed that might induce over all the other one you that might be used
0:11:07at nist
0:11:09for the emotion annotation we rely on self reported emotion and a teacher so we
0:11:16have to participants
0:11:18you watch the recordings that just a
0:11:21and using the g traced all the use this scale on the right-hand side
0:11:27mark their emotional state at the core
0:11:30at any given time
0:11:31so if we project the dialogue
0:11:35the length of the dialogue we can get
0:11:37and emotion
0:11:39trace that looks like this
0:11:42of course we also be a we also transcribed it in we use the combination
0:11:47of these two information a tree
0:11:50later on but before that
0:11:53other the other goal is to find higher level information from the overall expert
0:11:59responses
0:12:01what we would like to have here is more information that probably equivalent to dialog
0:12:06actions
0:12:08but we would like it should be specific to dialog scenario because this is the
0:12:12scenario that particular interest
0:12:15interestingly
0:12:16it would also like for these dialogue acts but in fact if intense of the
0:12:21export
0:12:22there are several ways
0:12:25into human annotation this is obvious limitation with the expensive and hard to reach a
0:12:31reliable inter annotator agreement
0:12:35we also use standard dialogue act classifiers that the constraint here is that it may
0:12:40not cover specific emotion we intend to
0:12:44so we resorted to unsupervised clustering
0:12:48so we do that by first extracting the responses of the caller for the at
0:12:53work
0:12:53and then using a pre-trained word defect model we get a compact representation of each
0:12:59response
0:13:00and we do we try out two types of clustering methods
0:13:04which means you need to you find beforehand how many clusters would like to find
0:13:09our case which is we chose k empirically
0:13:13four db gmm
0:13:15we are not to define the model complexity beforehand all within itself tries to find
0:13:21the optimal number of components we presented a
0:13:26and then we did some analysis this is the t is a new representation of
0:13:30the factors and the label
0:13:32this is the result of the k-means clustering where we choose k u i
0:13:37in between cluster we have many sentences that are really didn't corresponding to participants contains
0:13:45in the red clustering we get affirmative responses or confirmation responses
0:13:49and the blue clusters we have a listening or backchannels
0:13:54what we do get here though is a very large cluster where all the more
0:13:58complex
0:13:59sentences are grouped together
0:14:02so we
0:14:03we cluster that one more time and we find another some clusters
0:14:09some examples on the right cluster we have
0:14:11a lot of sentences
0:14:13that contains five a recall election about the topic
0:14:17i don't the green cluster
0:14:19we have
0:14:21sentences that are focus on the participants so you is the most common words there
0:14:26and sounds like the
0:14:28score tries to be opinions and their assessment of the topic
0:14:32for each year and
0:14:34the
0:14:36characteristic of each cluster is less you end up probably this is due to the
0:14:41very imbalanced
0:14:44distribution of the sentence and the cluster so we have to be very clusters here
0:14:50and there are plenty very small clusters the parameters
0:14:55so because just because the clusters are bigger is harder to include what they represent
0:15:03so then we put all of these two the experiment to see if things are
0:15:07working as we know
0:15:10this is the experimental setup the first thing that it is to retrain the model
0:15:14so we would like before we start only action any motion specific "'cause" we would
0:15:19like to be
0:15:20a prior for and he
0:15:23response generation task
0:15:25so we use a large-scale dialog corpus which is the subtle corpus containing
0:15:32five point five million dialogue years movie subtitles
0:15:35and we used in charge me models so we note any can wear any other
0:15:40the dialogue context
0:15:43and then we fit we find alternatives pre-trained model on how something that we have
0:15:48like this
0:15:50to ask for comparison we retrain every five point three types of model we have
0:15:55more a chart that only relies on emotion context
0:15:59we have anything at a really need that uses both
0:16:03emotion and i actually convex combination and for completeness we also train a model that
0:16:08all you realise on action
0:16:12and of course because the models after works
0:16:16a little bit about how we retraining point you
0:16:21so what pre-training does is initialized is the way of the
0:16:25of the jargon components
0:16:27so the
0:16:29the parts that have nothing to do with additional context
0:16:33an and doing fine tuning because the data that we have is pretty small we
0:16:38do it selectively so we only optimize parameters that are affected by the new products
0:16:43so the decoder here and the two
0:16:48to a complex encoders
0:16:50in terms of m c h r t we have three different targets
0:16:55reading during training
0:16:57so we have the negative log
0:16:59and
0:17:00each of those targets have their own classes we have a negative log-likelihood of the
0:17:04target response
0:17:05and emotion importer tries to predict the emotional state
0:17:09and we have the prediction error rate training as well as for the action orders
0:17:15which is would be the action for the response
0:17:18and we combine these clusters together linearly interpolate them and then used is back propagation
0:17:26this to update the corresponding arcs
0:17:30the first evaluation of it is we see the perplexity of the model
0:17:35forty one cherry be a perplexity lower is better
0:17:39well you much are needed i see that would get is forty two point six
0:17:43and actually if we use action information we got slight slightly better model
0:17:48however when we combine this information together with see if anything's happening for each action
0:17:54labels that
0:17:55so fourteen it's cluster-and-label we see some improvements
0:17:59you're to here and forty three gmm it actually slightly worse than
0:18:04we analyze this further by
0:18:07separating that has the top forty the length
0:18:09so we can get
0:18:10reflects if or shorter is very animals
0:18:15that's queries
0:18:18there's a stark difference between the two groups performance on short queries
0:18:23are consistently better than that of all ones which is not surprising a long-term dependency
0:18:28the it sitting that
0:18:31of the there's the neural network for a random performance
0:18:36so the thing with a c h are basically means that
0:18:40it again substantial improvement for a little queries
0:18:45most of the improvement
0:18:46that i get comes from all
0:18:49i being able to perform better for queries
0:18:52so this we can see that the multiple context how especially for longer inputs
0:18:59and then we also subjective evaluation we extracted a hundred various
0:19:04have each judge slightly crowd workers
0:19:07we asked to rate the naturalness emotional impact and in addition to the response
0:19:13vol two models
0:19:15so we have really mortuary is the baseline and h r v the best of
0:19:19the hrtf the proposed system
0:19:21and we see improve engagements
0:19:26from the proposed model while maintaining the emotional impact and naturalness
0:19:31and when we look at the responses that's generated by the
0:19:35system we see that ch are you
0:19:38well on average two and how words longer than the baseline
0:19:43so in conclusion here we have presented a corpus that shows expert
0:19:47strategy in positive emotion elicitation
0:19:50we also or c we also show how we use unsupervised clustering method
0:19:56to obtain higher level information
0:19:59and use all of these in
0:20:01a response generation
0:20:04in the future there are many things
0:20:06that needs to be worked on but in particular we would like to look at
0:20:09multimodal information
0:20:11this is especially important for the
0:20:14and the emotional context of the dialogue
0:20:17and of course evaluations were user interaction is also important
0:20:23that was my presentation
0:20:51so that pre-training is that a
0:20:54using another corpus which we do not construct so
0:20:58we use this model
0:21:03the training data is
0:21:05is the time here
0:21:08so it's a large-scale corpus
0:21:11probably subtitles
0:21:16right so the reading
0:21:20we pre-training we did not use any emotion or action one
0:21:24so that the pre-training is
0:21:26only to brian that now or boards dialog generation
0:21:32and then refine training we give the model's ability to encode actually context and emotion
0:21:38that's and use this
0:21:40in the generation
0:21:56right so
0:21:59the word identity
0:22:07so
0:22:09there are no menus or embodiment for different weights the first one is
0:22:14using a pre-trained word i'm but in model we
0:22:19we use that for the counsellor dialogue clustering and another three in the model itself
0:22:24wheeler and the word embeddings
0:22:26green pre-training
0:22:29it is learned
0:22:31what is it it's learned by the utterance in order
0:22:35all the large scale data
0:22:43cluster sentences or the dialogue our clustering
0:22:47but export response clustering we cluster sentences
0:22:51and for that we use the pre-training work to fact model
0:23:01we average
0:23:02what the sentence
0:23:07right i've just heard about skip or yesterday whereas the s and that's of the
0:23:13different we think that
0:23:16q
0:23:49so there's definitely an overlap between the actions that would like to find from the
0:23:54experts
0:23:55actions just general dialogues
0:23:58so we did find for example backchannels
0:24:02backchannels are actions that are generally the conversation and confirmation
0:24:10but the unsupervised clustering is especially helpful for the this other actions probably act right
0:24:18and it's
0:24:20you do not need any expert one of the t at all
0:24:57right
0:25:01so
0:25:03what we find that most of the time the majority of the time a counselor
0:25:07is able to reach the opposing emotions
0:25:11in terms of their the participant's reaction towards the video it hi varies
0:25:17so there are people who are not so reactive and are people who are there
0:25:22is emotionally sensitive
0:25:24so
0:25:26we get
0:25:27different types of responses
0:25:30but this is an example of well
0:25:34all the dialogue so the red lines here is here wins throughout the dialogue
0:25:40we can see that the kalman quite positive and there is an the real you
0:25:43feel
0:25:44every negative but as the dialogue progresses
0:25:47the
0:25:51the counsellor
0:25:53successfully for this
0:25:55you know we have a more extensive analysis in another paper
0:26:00i'll be happy to help you