0:00:15okay so hello i'm these are processed and from one university already introduced
0:00:21thank you for having
0:00:23and i'm going to talk about changing the level of directions on the dialogue
0:00:27so when i first had that's have a little motivation of why this could be
0:00:31useful
0:00:33if we look at human dialogue for example one person could say you want to
0:00:38you to sell it repeats not and for some reason the other person decides not
0:00:43to answer that question directly and test i prefer warm il
0:00:47i want you human dialogue we can easily say okay that's itself
0:00:51and then the person could
0:00:53shoes to be more polite did not say directly you should really go on a
0:00:57diet
0:00:58and just say that pizza has a lot of countries
0:01:01and then the other person
0:01:03it's not offended and can say okay take the summit
0:01:07so if we have a look at the same conversation with them
0:01:10dialogue system which is not equipped to handle in directness
0:01:15we can run into a number of problems
0:01:17so
0:01:19for example if the system says a do you want to excel at a pizza
0:01:23and the human says i'd rather have a one meal
0:01:27if the system is not equipped to handle this indirect this and just expects a
0:01:31direct translate won't understand that
0:01:33and then of course is to repeat the question and the user has to more
0:01:38directly state-of-the-art sir
0:01:42which is not that bad but could be handled better prices
0:01:45but system if it could understand this indirect version of the answer
0:01:50and another problem we have that is in the output because sometimes as humans we
0:01:54expect our conversation partner to not really be direct
0:01:59so if the system not chooses to be directed and say you should not itself
0:02:03and the human will be very angry
0:02:06so it would be better or if the system could handle in directness well on
0:02:11the inputs and i and on the output side
0:02:14and that is why
0:02:16the goal of my work is changing the level of directness of an utterance
0:02:22now i want to have a look at the algorithm a whole want to do
0:02:26that
0:02:27at first i will give an overview of the overall algorithm and then to address
0:02:31some challenges specifically
0:02:34so my algorithm works with the three different types of input from the current utterance
0:02:39and the previous utterance and the double
0:02:42and a pool of utterances that it can choose from to exchange the current utterance
0:02:50the next step then is to evaluate the directness level of those utterances
0:02:55and from that we get of course the directions of the current utterance
0:02:59and the directions of every utterance and we need the previous utterance
0:03:03because the directness is of course depending on what was said before
0:03:08and we can have different levels of in directness depending on the previous utterance
0:03:14and the next step then is
0:03:16to filter all the utterances so that we only you have the pool of utterances
0:03:22we can choose from
0:03:23it have the opposite directions of the current utterance
0:03:26and the last step we have to see
0:03:30which of those utterances is the most similar in a functional manner to the current
0:03:35utterance
0:03:36which then leaves us with the utterance we can exchange it for
0:03:41so two challenges in this algorithm one is
0:03:46the directness level how can we estimate that
0:03:49and the other one is
0:03:51how do we assume which one which utterances are functionally similar
0:03:56so that start with that for is what is functionally similar
0:04:02i define that as the degree to which two utterances can be used interchangeably in
0:04:06the dialogue so they fulfill the same function in the dialogue
0:04:11and
0:04:12as a measure of course functional similarity i decided to do that with a dialogue
0:04:17act models
0:04:19they are inspired by work spectrum models so they follow the same principle
0:04:25and that
0:04:26utterances in the back for space in a manner
0:04:30the utterances appearing in the same context are mapped controls vicinity to each other so
0:04:35if two utterances
0:04:37are used in the same context
0:04:40it's very likely that they
0:04:42can be exchanging in the same the
0:04:46the median distance and the spectral space is then used as an approximation of the
0:04:51functional similarity
0:04:54i'm pretty sure that works because i have already published paper it outright this year
0:04:59and i will quickly summarise the findings of the paper so you can see why
0:05:03this is good feet
0:05:06i have evaluated the accuracy of clusters
0:05:10then i have hard k-means
0:05:12in the dialogue vector space and compare them
0:05:17to the ground truth of clusters by hand annotated dialogue acts
0:05:21so want to see of improving in the dialogue vector space corresponds to the annotated
0:05:28dialogue acts
0:05:29and i didn't cross corpus evaluation so on the dialogue act the models are trained
0:05:34on a different corpus then the clustering was performed on and just you can see
0:05:38on the left side the risks the accuracy is very good
0:05:43and
0:05:45that's why i think
0:05:47at a dialogue act models work very well for the estimation functional similar utterances
0:05:54so that's get to the estimation of directors which was the second challenge
0:05:59you can already see an architecture here this is for a recurrent neural network is
0:06:05to make the directness with the supervised learning approach
0:06:08and as an input is used the sum of weight vectors on the one
0:06:13so every work in the sector a in the an utterance
0:06:18we use the word vector and just
0:06:21at all of them
0:06:23and also i use than the dialogue vector representation as an input
0:06:27and the suspect it's a reference so we have a twenty data connection that also
0:06:30that's just get
0:06:31previous an utterance or and the input of the previous utterance
0:06:38its output
0:06:40we have i've made as a classification problem
0:06:44so the output just the probability of the utterance being either a very direct so
0:06:49i wanted to drink and so for example
0:06:52slightly indirect you we have can i get a ticket trajectory itself which is not
0:06:56quite the same but still has all the main works in there that are met
0:07:03necessary for the meeting
0:07:04and then very indirect where you just say i don't like meat
0:07:08and hopefully the other person can get that
0:07:13so this has not been tested before so as part of the evaluation for this
0:07:18work
0:07:18i also evaluated the how well the estimation of directors with this approach works
0:07:25so and with that let's get to the evaluation
0:07:29so as a set on the one hand the accuracy of the direct estimation was
0:07:33evaluated
0:07:35and of course the accuracy of the actual utterance exchange also
0:07:40and for that on we of course the ground truth that means we need
0:07:46a dialogue corpus that contains utterances but we can exchange
0:07:52we need to of course and annotation of the directness level
0:07:56and an annotation of dialogue act and in order to see if we made a
0:08:01correct exchange
0:08:03it was
0:08:05impossible to find holes like that
0:08:08so i also wasn't sure we could actually
0:08:13get a corpus like that ourselves because it's very difficult
0:08:17do not inhibited the naturalness of conversation
0:08:20well still the same to the participants
0:08:22okay we need to this meeting in different phrases a different directness levels
0:08:29to make sure that there are external equivalent utterances in the corpus
0:08:33so for this i decided to do an automatically generated corpus to want to present
0:08:38now
0:08:42so that it calls contained
0:08:44the definition of the dialog domain with system and user actions
0:08:49and just accession rules under which set which system
0:08:53what which action could for the which other actually
0:08:57each action had
0:08:59multiple utterances that actually a to a used were great
0:09:04and of course of directors level depending on the previous utterance
0:09:09then we started with the beginning with the start actually
0:09:16and then just record simply cut all the successors
0:09:20but also selsa's again until we reach the end
0:09:23and thereby generated all the dialogue flows that where possible with the time domain which
0:09:28no and we defined
0:09:30and the weighting was then choosing randomly and this resulted in more than four hundred
0:09:37thousand dialogue flows
0:09:41and
0:09:42about or for working is very dialogue act she
0:09:48for example you can see here yes could be worried it is a great i'm
0:09:53going forward to it or that sounds to the shoes
0:09:55if anyone
0:09:58what the previous utterance was
0:10:01or i would like to order pizza can order pizza from you
0:10:06the topic of those conversational style story
0:10:10so for example ordering a pizza or arranging for joint coding together
0:10:17and it
0:10:18i try to incorporate
0:10:20many elements of human conversation
0:10:22so it for example i had over a string
0:10:26the one misunderstandings a request for confirmation corrections and things like that
0:10:31and as already mentioned context-dependent directness levels
0:10:35so for example and you have time today
0:10:38can be answered with i have planned it is in
0:10:41which is not a direct answer so it hasn't directors three
0:10:45and it finds today i haven't planned anything
0:10:48so here we have
0:10:51different a question and before that
0:10:54so this time it's a direct answer and achieves the directness the number of one
0:11:01so of course with an automatically generated out or was there are some limitations
0:11:07we have that's variation in a natural conversations of course
0:11:12and well
0:11:13with regard to the dialogue flow
0:11:15and to the weighting that here
0:11:18and that very likely means
0:11:20it's
0:11:21more predictable and therefore easier to or
0:11:25however i also see some advantages of this approach
0:11:29on the one hand we have a very controlled environment
0:11:32we can make sure that
0:11:35for example is an actual on server
0:11:38in the corpus utterances
0:11:40so we know that there is a valid exchange and if we didn't find that
0:11:45the for the our algorithm and not just that there is no
0:11:49correct utterance in the corpus
0:11:52and also
0:11:54we know
0:11:56because
0:11:57the corpus was
0:12:00was not annotated but generates the ground truth
0:12:03at this ground truth is very dependable
0:12:07and also i think it's an advantage that using this approach we have
0:12:12a very complete dataset we have all the possible flows we have
0:12:16many different weightings
0:12:17and i think that having this for small application
0:12:22can meet implications for if we actually have a lot of data and the approach
0:12:27the full coverage so
0:12:29for example usually if i could just collect dialogue i want
0:12:34one have a lot of data and i won't this poor coverage
0:12:38but a larger companies may but i just don't get the data and that we
0:12:42do this
0:12:45small
0:12:46small what complete set that we generated
0:12:49that can
0:12:50then have some implications for what if i could get that
0:12:55i at all
0:12:58so for our results this means is of course
0:13:02that
0:13:03they don't do not represent the actual performance in the applied for spoken dialogue system
0:13:08which test we don't have natural conversations
0:13:11so it's very likely that it will perform worse
0:13:15but we can replace potential or approach given ideal circumstances
0:13:21so i think it still or some an average rate
0:13:27so with that that's get to the actual results
0:13:32at first the accuracy of the directness estimation
0:13:36here we use them as input for the
0:13:39a dialogue vector model
0:13:40it was trained on our automatically generated calls
0:13:44and we used word actual models that were trained on the google news call and
0:13:49you can see the reference for that
0:13:54as dependent variable of course we have the accuracy of correctly predicting the level of
0:13:59directness as annotated
0:14:01and it is indeed and river
0:14:02independent variables we use
0:14:07versions where we
0:14:08with and without where actors as input to see if we improve use all
0:14:13and also
0:14:15we wanted to see
0:14:16if the whole the size of the training sets impacts the classifier
0:14:21so we used of course ten fold cross validation as usual which leads to a
0:14:27training corpus of ninety percent of the data
0:14:32and we also tested it with when we only use ten percent of the data
0:14:39also we use different other tactile models there we also used different
0:14:46sure
0:14:47sizes of the dialogue corpus that we generated how many of the dialogs we included
0:14:51the actual train
0:14:54and he can see the results
0:14:57so
0:14:58we could achieve a very high accuracy of darkness estimation
0:15:03but keep in mind it's an automatically generated corpus so that plays the role in
0:15:07that of course
0:15:10the baseline for the majority class prediction would have been zero point five to nine
0:15:15one
0:15:16and can clearly outperform that
0:15:21we can see a significant influence of both the size of the training set
0:15:26and this is whether or not we include the word vectors
0:15:32and
0:15:33i think then
0:15:35that the word vector as input to improve so much of the estimation results really
0:15:43speaks of the quality of those models that what we have the speaker data
0:15:50but i think extensive work come on six is
0:15:54so that should not you problem
0:15:57what could be a problem is and the size of the training set
0:16:01because this is annotated data it's a supervised approach
0:16:05so
0:16:07if we want
0:16:09choose a scale this approach
0:16:11we would need a lot of annotated data so perhaps in the future we could
0:16:16consider i'm
0:16:18unsupervised approach for this
0:16:21that doesn't need to a lot of annotated data
0:16:28so the accuracy of i utterance exchange for the functional similarity we again used
0:16:34the dialogue act models from the automatically generated calls
0:16:37and for that are just estimation we use different portions of the train crash classifier
0:16:43that are just presented
0:16:44and as dependent variable and the percentage of correctly exchange utterances
0:16:50and independent variables here where the classifier accuracy
0:16:55and again the size of the training corpus for the dialogue act models
0:16:59g you can see the results
0:17:02the best performance we could achieve overall was zero point seven
0:17:06percent of utterances that were correctly exchange
0:17:11and we have a significant influence of both a classifier
0:17:16accuracy
0:17:17and that the size of the training data for the dialogue act models
0:17:22and a common error that could see
0:17:25it was made by the algorithm
0:17:27was that the utterance exchange was done
0:17:31either with more or less information than the original utterance
0:17:34so for example and stuff i want something spicy
0:17:38it was exchanged with i want a large pepperoni pizza and large of course is
0:17:42not included in the first sentence
0:17:44so
0:17:46this points to that a dialogue act models as we trained and cannot really differentiate
0:17:51that well between those a ring in
0:17:55but this could be solved with just adding more context to them so
0:17:59during the training take into account more utterances in the vicinity
0:18:06we can see here the importance of a good classifier and a good similarity measure
0:18:12the similarity measure i don't that's a problem because
0:18:17it's on annotated data so we can just take large corpora of dialog data and
0:18:22use that
0:18:23again the annotated data is here the real challenge and
0:18:27we should consider the unsupervised approach
0:18:32a short discussion of the results
0:18:35i think the approach shows a high potential what the evaluation was done in a
0:18:40theoretical setting
0:18:41and we have not applied to an full dialogue system
0:18:45and therefore they are still some questions to be answered
0:18:49so in this corpus we have this variability and in a natural dialogue
0:18:54that means that
0:18:56very likely the performance of the classifier and style vector model will decrease in an
0:19:01actual dialog
0:19:04to compensate for that we then we need more data
0:19:09and we have the problem that
0:19:11we don't really know if in an actual dialog hope was suitable alternative to exchange
0:19:17actually exist
0:19:19again if we have an increasing amount of data it becomes more likely
0:19:25what was and it's not sure
0:19:27so perhaps as a future work we can look into the generation of utterances instead
0:19:32of just their exchange
0:19:36and i dunno point is the interpolation of user experience and the accuracy of exchange
0:19:42because at the moment we don't know what
0:19:45actually we see we actually need to achieve
0:19:47to improve the user experience
0:19:50so that is also something we should look into
0:19:54so
0:19:56that system
0:19:57the end of my talk i want to conclude what i presented to you today
0:20:02i discuss the impact of interest in human computer interaction
0:20:07and propose an approach to changing that both directions of matter
0:20:12the directness estimation is done using recurrent neural networks the functionality measure
0:20:19uses dialogue act models
0:20:21and the evaluation shows the high potential this approach is also a lot of future
0:20:26work should you
0:20:28it would be good to have a corpus of natural dialogues annotated with the director's
0:20:32that to use that as an evaluation
0:20:37there would be benefits of to an unsupervised estimation of the directness level
0:20:42and
0:20:42also an evaluation on an actual dialog corpus
0:20:46would give more insights on how that actually impacts the performance
0:20:50and the generation of suitable utterances would be just desirable because we don't actually know
0:20:56if an
0:20:57the right utterances in the corpus
0:21:00and finally of course we would like to a okay apply this to an actual
0:21:05for dialogue system
0:21:08thank you very much for your attention
0:21:52no i did not evaluate and set
0:22:12yes a lot of my somewhere in this regard for differences there the directness is
0:22:18a very major difference that exists between cultures so therefore the source of major interest
0:22:23for me
0:23:04yes
0:23:06i think it would you really good
0:23:07such a coarse and
0:23:10i'm thinking about ways like i think one of the main difficulty is there is
0:23:17as i said i'm coming from an actual difference
0:23:21so for example i would expect a german i will be even more direct and
0:23:25for example japanese
0:23:27then we have the translation problem we can't exchange german utterances for japanese utterances so
0:23:33that makes it difficult and i'm not sure how to ensure for example in a
0:23:38german that the participants were actually use a in alright version
0:23:44the direct utterances as well
0:23:47so there is a little bit of a problem
0:24:28that sounds interesting thank you very much
0:24:42so this was small part of the error rate
0:24:45and there are just use the k-means clustering algorithms
0:24:48to find clusters
0:24:50in this work i don't actually define clusters but just use the closest one
0:25:15no i used it is so basic i
0:25:19it
0:25:21it's director pretty of is
0:25:25if it's a colloquial re-formulate dislike you know
0:25:32and i you know works from the original sentence here in the exchange sentence then
0:25:39it's a here and very direct