Speech Transcript - Changing the Level of Directness in Dialogue using Dialogue Vector Models and Recurrent Neural Networks

0:00:15	okay so hello i'm these are processed and from one university already introduced
0:00:21	thank you for having
0:00:23	and i'm going to talk about changing the level of directions on the dialogue
0:00:27	so when i first had that's have a little motivation of why this could be
0:00:31	useful
0:00:33	if we look at human dialogue for example one person could say you want to
0:00:38	you to sell it repeats not and for some reason the other person decides not
0:00:43	to answer that question directly and test i prefer warm il
0:00:47	i want you human dialogue we can easily say okay that's itself
0:00:51	and then the person could
0:00:53	shoes to be more polite did not say directly you should really go on a
0:00:57	diet
0:00:58	and just say that pizza has a lot of countries
0:01:01	and then the other person
0:01:03	it's not offended and can say okay take the summit
0:01:07	so if we have a look at the same conversation with them
0:01:10	dialogue system which is not equipped to handle in directness
0:01:15	we can run into a number of problems
0:01:17	so
0:01:19	for example if the system says a do you want to excel at a pizza
0:01:23	and the human says i'd rather have a one meal
0:01:27	if the system is not equipped to handle this indirect this and just expects a
0:01:31	direct translate won't understand that
0:01:33	and then of course is to repeat the question and the user has to more
0:01:38	directly state-of-the-art sir
0:01:42	which is not that bad but could be handled better prices
0:01:45	but system if it could understand this indirect version of the answer
0:01:50	and another problem we have that is in the output because sometimes as humans we
0:01:54	expect our conversation partner to not really be direct
0:01:59	so if the system not chooses to be directed and say you should not itself
0:02:03	and the human will be very angry
0:02:06	so it would be better or if the system could handle in directness well on
0:02:11	the inputs and i and on the output side
0:02:14	and that is why
0:02:16	the goal of my work is changing the level of directness of an utterance
0:02:22	now i want to have a look at the algorithm a whole want to do
0:02:26	that
0:02:27	at first i will give an overview of the overall algorithm and then to address
0:02:31	some challenges specifically
0:02:34	so my algorithm works with the three different types of input from the current utterance
0:02:39	and the previous utterance and the double
0:02:42	and a pool of utterances that it can choose from to exchange the current utterance
0:02:50	the next step then is to evaluate the directness level of those utterances
0:02:55	and from that we get of course the directions of the current utterance
0:02:59	and the directions of every utterance and we need the previous utterance
0:03:03	because the directness is of course depending on what was said before
0:03:08	and we can have different levels of in directness depending on the previous utterance
0:03:14	and the next step then is
0:03:16	to filter all the utterances so that we only you have the pool of utterances
0:03:22	we can choose from
0:03:23	it have the opposite directions of the current utterance
0:03:26	and the last step we have to see
0:03:30	which of those utterances is the most similar in a functional manner to the current
0:03:35	utterance
0:03:36	which then leaves us with the utterance we can exchange it for
0:03:41	so two challenges in this algorithm one is
0:03:46	the directness level how can we estimate that
0:03:49	and the other one is
0:03:51	how do we assume which one which utterances are functionally similar
0:03:56	so that start with that for is what is functionally similar
0:04:02	i define that as the degree to which two utterances can be used interchangeably in
0:04:06	the dialogue so they fulfill the same function in the dialogue
0:04:11	and
0:04:12	as a measure of course functional similarity i decided to do that with a dialogue
0:04:17	act models
0:04:19	they are inspired by work spectrum models so they follow the same principle
0:04:25	and that
0:04:26	utterances in the back for space in a manner
0:04:30	the utterances appearing in the same context are mapped controls vicinity to each other so
0:04:35	if two utterances
0:04:37	are used in the same context
0:04:40	it's very likely that they
0:04:42	can be exchanging in the same the
0:04:46	the median distance and the spectral space is then used as an approximation of the
0:04:51	functional similarity
0:04:54	i'm pretty sure that works because i have already published paper it outright this year
0:04:59	and i will quickly summarise the findings of the paper so you can see why
0:05:03	this is good feet
0:05:06	i have evaluated the accuracy of clusters
0:05:10	then i have hard k-means
0:05:12	in the dialogue vector space and compare them
0:05:17	to the ground truth of clusters by hand annotated dialogue acts
0:05:21	so want to see of improving in the dialogue vector space corresponds to the annotated
0:05:28	dialogue acts
0:05:29	and i didn't cross corpus evaluation so on the dialogue act the models are trained
0:05:34	on a different corpus then the clustering was performed on and just you can see
0:05:38	on the left side the risks the accuracy is very good
0:05:43	and
0:05:45	that's why i think
0:05:47	at a dialogue act models work very well for the estimation functional similar utterances
0:05:54	so that's get to the estimation of directors which was the second challenge
0:05:59	you can already see an architecture here this is for a recurrent neural network is
0:06:05	to make the directness with the supervised learning approach
0:06:08	and as an input is used the sum of weight vectors on the one
0:06:13	so every work in the sector a in the an utterance
0:06:18	we use the word vector and just
0:06:21	at all of them
0:06:23	and also i use than the dialogue vector representation as an input
0:06:27	and the suspect it's a reference so we have a twenty data connection that also
0:06:30	that's just get
0:06:31	previous an utterance or and the input of the previous utterance
0:06:38	its output
0:06:40	we have i've made as a classification problem
0:06:44	so the output just the probability of the utterance being either a very direct so
0:06:49	i wanted to drink and so for example
0:06:52	slightly indirect you we have can i get a ticket trajectory itself which is not
0:06:56	quite the same but still has all the main works in there that are met
0:07:03	necessary for the meeting
0:07:04	and then very indirect where you just say i don't like meat
0:07:08	and hopefully the other person can get that
0:07:13	so this has not been tested before so as part of the evaluation for this
0:07:18	work
0:07:18	i also evaluated the how well the estimation of directors with this approach works
0:07:25	so and with that let's get to the evaluation
0:07:29	so as a set on the one hand the accuracy of the direct estimation was
0:07:33	evaluated
0:07:35	and of course the accuracy of the actual utterance exchange also
0:07:40	and for that on we of course the ground truth that means we need
0:07:46	a dialogue corpus that contains utterances but we can exchange
0:07:52	we need to of course and annotation of the directness level
0:07:56	and an annotation of dialogue act and in order to see if we made a
0:08:01	correct exchange
0:08:03	it was
0:08:05	impossible to find holes like that
0:08:08	so i also wasn't sure we could actually
0:08:13	get a corpus like that ourselves because it's very difficult
0:08:17	do not inhibited the naturalness of conversation
0:08:20	well still the same to the participants
0:08:22	okay we need to this meeting in different phrases a different directness levels
0:08:29	to make sure that there are external equivalent utterances in the corpus
0:08:33	so for this i decided to do an automatically generated corpus to want to present
0:08:38	now
0:08:42	so that it calls contained
0:08:44	the definition of the dialog domain with system and user actions
0:08:49	and just accession rules under which set which system
0:08:53	what which action could for the which other actually
0:08:57	each action had
0:08:59	multiple utterances that actually a to a used were great
0:09:04	and of course of directors level depending on the previous utterance
0:09:09	then we started with the beginning with the start actually
0:09:16	and then just record simply cut all the successors
0:09:20	but also selsa's again until we reach the end
0:09:23	and thereby generated all the dialogue flows that where possible with the time domain which
0:09:28	no and we defined
0:09:30	and the weighting was then choosing randomly and this resulted in more than four hundred
0:09:37	thousand dialogue flows
0:09:41	and
0:09:42	about or for working is very dialogue act she
0:09:48	for example you can see here yes could be worried it is a great i'm
0:09:53	going forward to it or that sounds to the shoes
0:09:55	if anyone
0:09:58	what the previous utterance was
0:10:01	or i would like to order pizza can order pizza from you
0:10:06	the topic of those conversational style story
0:10:10	so for example ordering a pizza or arranging for joint coding together
0:10:17	and it
0:10:18	i try to incorporate
0:10:20	many elements of human conversation
0:10:22	so it for example i had over a string
0:10:26	the one misunderstandings a request for confirmation corrections and things like that
0:10:31	and as already mentioned context-dependent directness levels
0:10:35	so for example and you have time today
0:10:38	can be answered with i have planned it is in
0:10:41	which is not a direct answer so it hasn't directors three
0:10:45	and it finds today i haven't planned anything
0:10:48	so here we have
0:10:51	different a question and before that
0:10:54	so this time it's a direct answer and achieves the directness the number of one
0:11:01	so of course with an automatically generated out or was there are some limitations
0:11:07	we have that's variation in a natural conversations of course
0:11:12	and well
0:11:13	with regard to the dialogue flow
0:11:15	and to the weighting that here
0:11:18	and that very likely means
0:11:20	it's
0:11:21	more predictable and therefore easier to or
0:11:25	however i also see some advantages of this approach
0:11:29	on the one hand we have a very controlled environment
0:11:32	we can make sure that
0:11:35	for example is an actual on server
0:11:38	in the corpus utterances
0:11:40	so we know that there is a valid exchange and if we didn't find that
0:11:45	the for the our algorithm and not just that there is no
0:11:49	correct utterance in the corpus
0:11:52	and also
0:11:54	we know
0:11:56	because
0:11:57	the corpus was
0:12:00	was not annotated but generates the ground truth
0:12:03	at this ground truth is very dependable
0:12:07	and also i think it's an advantage that using this approach we have
0:12:12	a very complete dataset we have all the possible flows we have
0:12:16	many different weightings
0:12:17	and i think that having this for small application
0:12:22	can meet implications for if we actually have a lot of data and the approach
0:12:27	the full coverage so
0:12:29	for example usually if i could just collect dialogue i want
0:12:34	one have a lot of data and i won't this poor coverage
0:12:38	but a larger companies may but i just don't get the data and that we
0:12:42	do this
0:12:45	small
0:12:46	small what complete set that we generated
0:12:49	that can
0:12:50	then have some implications for what if i could get that
0:12:55	i at all
0:12:58	so for our results this means is of course
0:13:02	that
0:13:03	they don't do not represent the actual performance in the applied for spoken dialogue system
0:13:08	which test we don't have natural conversations
0:13:11	so it's very likely that it will perform worse
0:13:15	but we can replace potential or approach given ideal circumstances
0:13:21	so i think it still or some an average rate
0:13:27	so with that that's get to the actual results
0:13:32	at first the accuracy of the directness estimation
0:13:36	here we use them as input for the
0:13:39	a dialogue vector model
0:13:40	it was trained on our automatically generated calls
0:13:44	and we used word actual models that were trained on the google news call and
0:13:49	you can see the reference for that
0:13:54	as dependent variable of course we have the accuracy of correctly predicting the level of
0:13:59	directness as annotated
0:14:01	and it is indeed and river
0:14:02	independent variables we use
0:14:07	versions where we
0:14:08	with and without where actors as input to see if we improve use all
0:14:13	and also
0:14:15	we wanted to see
0:14:16	if the whole the size of the training sets impacts the classifier
0:14:21	so we used of course ten fold cross validation as usual which leads to a
0:14:27	training corpus of ninety percent of the data
0:14:32	and we also tested it with when we only use ten percent of the data
0:14:39	also we use different other tactile models there we also used different
0:14:46	sure
0:14:47	sizes of the dialogue corpus that we generated how many of the dialogs we included
0:14:51	the actual train
0:14:54	and he can see the results
0:14:57	so
0:14:58	we could achieve a very high accuracy of darkness estimation
0:15:03	but keep in mind it's an automatically generated corpus so that plays the role in
0:15:07	that of course
0:15:10	the baseline for the majority class prediction would have been zero point five to nine
0:15:15	one
0:15:16	and can clearly outperform that
0:15:21	we can see a significant influence of both the size of the training set
0:15:26	and this is whether or not we include the word vectors
0:15:32	and
0:15:33	i think then
0:15:35	that the word vector as input to improve so much of the estimation results really
0:15:43	speaks of the quality of those models that what we have the speaker data
0:15:50	but i think extensive work come on six is
0:15:54	so that should not you problem
0:15:57	what could be a problem is and the size of the training set
0:16:01	because this is annotated data it's a supervised approach
0:16:05	so
0:16:07	if we want
0:16:09	choose a scale this approach
0:16:11	we would need a lot of annotated data so perhaps in the future we could
0:16:16	consider i'm
0:16:18	unsupervised approach for this
0:16:21	that doesn't need to a lot of annotated data
0:16:28	so the accuracy of i utterance exchange for the functional similarity we again used
0:16:34	the dialogue act models from the automatically generated calls
0:16:37	and for that are just estimation we use different portions of the train crash classifier
0:16:43	that are just presented
0:16:44	and as dependent variable and the percentage of correctly exchange utterances
0:16:50	and independent variables here where the classifier accuracy
0:16:55	and again the size of the training corpus for the dialogue act models
0:16:59	g you can see the results
0:17:02	the best performance we could achieve overall was zero point seven
0:17:06	percent of utterances that were correctly exchange
0:17:11	and we have a significant influence of both a classifier
0:17:16	accuracy
0:17:17	and that the size of the training data for the dialogue act models
0:17:22	and a common error that could see
0:17:25	it was made by the algorithm
0:17:27	was that the utterance exchange was done
0:17:31	either with more or less information than the original utterance
0:17:34	so for example and stuff i want something spicy
0:17:38	it was exchanged with i want a large pepperoni pizza and large of course is
0:17:42	not included in the first sentence
0:17:44	so
0:17:46	this points to that a dialogue act models as we trained and cannot really differentiate
0:17:51	that well between those a ring in
0:17:55	but this could be solved with just adding more context to them so
0:17:59	during the training take into account more utterances in the vicinity
0:18:06	we can see here the importance of a good classifier and a good similarity measure
0:18:12	the similarity measure i don't that's a problem because
0:18:17	it's on annotated data so we can just take large corpora of dialog data and
0:18:22	use that
0:18:23	again the annotated data is here the real challenge and
0:18:27	we should consider the unsupervised approach
0:18:32	a short discussion of the results
0:18:35	i think the approach shows a high potential what the evaluation was done in a
0:18:40	theoretical setting
0:18:41	and we have not applied to an full dialogue system
0:18:45	and therefore they are still some questions to be answered
0:18:49	so in this corpus we have this variability and in a natural dialogue
0:18:54	that means that
0:18:56	very likely the performance of the classifier and style vector model will decrease in an
0:19:01	actual dialog
0:19:04	to compensate for that we then we need more data
0:19:09	and we have the problem that
0:19:11	we don't really know if in an actual dialog hope was suitable alternative to exchange
0:19:17	actually exist
0:19:19	again if we have an increasing amount of data it becomes more likely
0:19:25	what was and it's not sure
0:19:27	so perhaps as a future work we can look into the generation of utterances instead
0:19:32	of just their exchange
0:19:36	and i dunno point is the interpolation of user experience and the accuracy of exchange
0:19:42	because at the moment we don't know what
0:19:45	actually we see we actually need to achieve
0:19:47	to improve the user experience
0:19:50	so that is also something we should look into
0:19:54	so
0:19:56	that system
0:19:57	the end of my talk i want to conclude what i presented to you today
0:20:02	i discuss the impact of interest in human computer interaction
0:20:07	and propose an approach to changing that both directions of matter
0:20:12	the directness estimation is done using recurrent neural networks the functionality measure
0:20:19	uses dialogue act models
0:20:21	and the evaluation shows the high potential this approach is also a lot of future
0:20:26	work should you
0:20:28	it would be good to have a corpus of natural dialogues annotated with the director's
0:20:32	that to use that as an evaluation
0:20:37	there would be benefits of to an unsupervised estimation of the directness level
0:20:42	and
0:20:42	also an evaluation on an actual dialog corpus
0:20:46	would give more insights on how that actually impacts the performance
0:20:50	and the generation of suitable utterances would be just desirable because we don't actually know
0:20:56	if an
0:20:57	the right utterances in the corpus
0:21:00	and finally of course we would like to a okay apply this to an actual
0:21:05	for dialogue system
0:21:08	thank you very much for your attention
0:21:52	no i did not evaluate and set
0:22:12	yes a lot of my somewhere in this regard for differences there the directness is
0:22:18	a very major difference that exists between cultures so therefore the source of major interest
0:22:23	for me
0:23:04	yes
0:23:06	i think it would you really good
0:23:07	such a coarse and
0:23:10	i'm thinking about ways like i think one of the main difficulty is there is
0:23:17	as i said i'm coming from an actual difference
0:23:21	so for example i would expect a german i will be even more direct and
0:23:25	for example japanese
0:23:27	then we have the translation problem we can't exchange german utterances for japanese utterances so
0:23:33	that makes it difficult and i'm not sure how to ensure for example in a
0:23:38	german that the participants were actually use a in alright version
0:23:44	the direct utterances as well
0:23:47	so there is a little bit of a problem
0:24:28	that sounds interesting thank you very much
0:24:42	so this was small part of the error rate
0:24:45	and there are just use the k-means clustering algorithms
0:24:48	to find clusters
0:24:50	in this work i don't actually define clusters but just use the closest one
0:25:15	no i used it is so basic i
0:25:19	it
0:25:21	it's director pretty of is
0:25:25	if it's a colloquial re-formulate dislike you know
0:25:32	and i you know works from the original sentence here in the exchange sentence then
0:25:39	it's a here and very direct

Changing the Level of Directness in Dialogue using Dialogue Vector Models and Recurrent Neural Networks

Oral Session 1: Generation 1

Louisa Pragst and Stefan Ultes