| 0:00:16 | hi everyone | 
|---|
| 0:00:17 | i am nichols from that an inverse to germany | 
|---|
| 0:00:22 | i'm gonna talk to go about | 
|---|
| 0:00:24 | way to discover user groups for natural language generation in dialogue | 
|---|
| 0:00:29 | and this is work i've done together with crystal spiderman an onyx on the corner | 
|---|
| 0:00:39 | let's see let's look at this example here | 
|---|
| 0:00:42 | we have a navigation system that there's | 
|---|
| 0:00:45 | the user turn right after my central | 
|---|
| 0:00:50 | so user a sexy | 
|---|
| 0:00:52 | in finding the | 
|---|
| 0:00:54 | i think that do | 
|---|
| 0:00:55 | and use of be phase | 
|---|
| 0:00:58 | so why couldn't be | 
|---|
| 0:01:03 | well there are different reasons why | 
|---|
| 0:01:05 | users react differently to such instructions so | 
|---|
| 0:01:10 | most likely here the person is not from the and user is not from melbourne | 
|---|
| 0:01:15 | so | 
|---|
| 0:01:16 | they do not know what maybe one central means but | 
|---|
| 0:01:20 | and we can imagine also other reasons such as the lack | 
|---|
| 0:01:26 | demographics are present a sign or | 
|---|
| 0:01:29 | experience with navigational systems | 
|---|
| 0:01:34 | however such information is often difficult to obtain | 
|---|
| 0:01:38 | so | 
|---|
| 0:01:40 | and | 
|---|
| 0:01:42 | we can ask everyone and before the user navigation system where they from | 
|---|
| 0:01:47 | but it's an interactive setting is something approaching who | 
|---|
| 0:01:52 | and collect observations and react to them so ideally after observing something like that | 
|---|
| 0:01:58 | a system with okay user a using place names from an but | 
|---|
| 0:02:04 | and they want adapt to user b and say something like other on the ball | 
|---|
| 0:02:09 | take the third that's | 
|---|
| 0:02:14 | so people deal with this problem in different ways one approach is of course to | 
|---|
| 0:02:18 | completely ignored | 
|---|
| 0:02:21 | which we don't want | 
|---|
| 0:02:24 | another approach is | 
|---|
| 0:02:26 | to use | 
|---|
| 0:02:27 | one model for every user | 
|---|
| 0:02:31 | however there is requires lots of data for that user and we might lose information | 
|---|
| 0:02:37 | that | 
|---|
| 0:02:39 | might help us from similar users | 
|---|
| 0:02:44 | and another approach would be used pre-defined groups | 
|---|
| 0:02:48 | so for example have | 
|---|
| 0:02:50 | a group of residents of mild one and another group for outsiders | 
|---|
| 0:02:57 | but this is hard to annotate and it's also hard to know in advance | 
|---|
| 0:03:04 | which categories could be rate of and then | 
|---|
| 0:03:09 | which i categories that actually we can actually find inside the and in the dataset | 
|---|
| 0:03:16 | so instead of doing these things | 
|---|
| 0:03:19 | we assume that's the user's behavior clusters | 
|---|
| 0:03:23 | in two | 
|---|
| 0:03:24 | groups that we cannot observe | 
|---|
| 0:03:27 | and | 
|---|
| 0:03:29 | we use bayesian reasoning to infer those groups from the un from an annotated the | 
|---|
| 0:03:35 | training data | 
|---|
| 0:03:36 | and then test time to dynamically assign users those good as the dialogue progresses | 
|---|
| 0:03:46 | so our starting point is a simple log-linear model of a language use | 
|---|
| 0:03:52 | where in particular we have a stack of the way of whether we are doing | 
|---|
| 0:03:56 | and | 
|---|
| 0:03:57 | complete attention like simulating complication or production | 
|---|
| 0:04:02 | so we just in general that we want to predict their behaviour of | 
|---|
| 0:04:07 | and the behavior of view of the user and response the stimulus is coming from | 
|---|
| 0:04:12 | the system so if we trying to simulate language production | 
|---|
| 0:04:17 | the stimulus can be the communicative goal that the user is trying to achieve and | 
|---|
| 0:04:22 | behavior would be the utterance that the use or some other linguistic choice the thing | 
|---|
| 0:04:28 | make | 
|---|
| 0:04:29 | and | 
|---|
| 0:04:31 | if we want to predict what the user would understand | 
|---|
| 0:04:35 | another stimulus is system produce utterance and the behaviour is i mean that the user | 
|---|
| 0:04:42 | signs | 
|---|
| 0:04:43 | the utterance | 
|---|
| 0:04:47 | so this is | 
|---|
| 0:04:49 | this is how our basic model looks like | 
|---|
| 0:04:52 | before we had the user groups | 
|---|
| 0:04:54 | and it's a log-linear model with a real-valued parameter vector o | 
|---|
| 0:05:00 | and set of feature functions fight over behaviors and stimuli | 
|---|
| 0:05:05 | and this model can be trained with a dataset of pairs of the cases in | 
|---|
| 0:05:10 | my using | 
|---|
| 0:05:11 | no longer a gradient descent the based methods | 
|---|
| 0:05:15 | no actually we have already use that thing this work for | 
|---|
| 0:05:20 | events possible resolution in dialogue | 
|---|
| 0:05:24 | so | 
|---|
| 0:05:27 | now if we want to extend this model with user groups | 
|---|
| 0:05:33 | we just assume that there is a finite number of user groups of the data | 
|---|
| 0:05:37 | okay | 
|---|
| 0:05:39 | and the we do you | 
|---|
| 0:05:41 | each of the groups of their own i mean vector | 
|---|
| 0:05:46 | so and we place visionary only the vector from the model before | 
|---|
| 0:05:53 | really is a group specific parameter vectors or if we know exactly what group a | 
|---|
| 0:06:00 | user don't still | 
|---|
| 0:06:01 | and all we have to do is just a replace a just use these new | 
|---|
| 0:06:06 | parameters and | 
|---|
| 0:06:08 | we have like in new prediction model that is get that in particular | 
|---|
| 0:06:16 | however as we still | 
|---|
| 0:06:20 | we want to adapt to user is that we haven't seen in training data | 
|---|
| 0:06:26 | so | 
|---|
| 0:06:27 | we assume that the training data was generated in the following way | 
|---|
| 0:06:33 | we have a set | 
|---|
| 0:06:34 | of users u | 
|---|
| 0:06:36 | and | 
|---|
| 0:06:38 | so it's each user is assigned | 
|---|
| 0:06:42 | to a group | 
|---|
| 0:06:45 | with a probability | 
|---|
| 0:06:47 | you're given by which is another which is another parameter vector that determines the prior | 
|---|
| 0:06:53 | probability of age group | 
|---|
| 0:06:56 | and then | 
|---|
| 0:06:57 | as we said we have one parameter vector for a third group so now the | 
|---|
| 0:07:02 | behavior of the of the user | 
|---|
| 0:07:05 | and not only depends on the stimulus but also on their group assignment and of | 
|---|
| 0:07:10 | the group specific one of the vectors | 
|---|
| 0:07:16 | so now let's suppose that's we have trained our system we don't both training data | 
|---|
| 0:07:23 | and then you user starts talking to us | 
|---|
| 0:07:28 | since we don't know what they're action movies | 
|---|
| 0:07:31 | and we marginalise overall groups using the prior probability | 
|---|
| 0:07:37 | and so we directly have | 
|---|
| 0:07:40 | an idea of what they would do | 
|---|
| 0:07:46 | given a given the prior probabilities that we have observed in the training data and | 
|---|
| 0:07:51 | we can already use this model for interacting with them and then observes a behaviour | 
|---|
| 0:08:00 | so if the user fees | 
|---|
| 0:08:02 | control system gives interacting with a system we start collecting observations for them | 
|---|
| 0:08:09 | so let's say we have | 
|---|
| 0:08:11 | a sets the you of observations for user you of that particular time step | 
|---|
| 0:08:20 | we cannot use these observations to estimate | 
|---|
| 0:08:24 | find out which so you belong still | 
|---|
| 0:08:28 | so we can do that because | 
|---|
| 0:08:30 | as i said we have a specific | 
|---|
| 0:08:34 | the cave you're a prediction | 
|---|
| 0:08:36 | so we can | 
|---|
| 0:08:39 | calculated probability on the right-hand side probability of the data of the observations for the | 
|---|
| 0:08:46 | user given it to the group specific parameters of each clue | 
|---|
| 0:08:51 | and also we have the prior membership probabilities so that is truly we can also | 
|---|
| 0:08:57 | compute | 
|---|
| 0:08:59 | the probability that the user belongs to each of the groups g given the data | 
|---|
| 0:09:04 | and | 
|---|
| 0:09:05 | and there's | 
|---|
| 0:09:09 | so if we plug in this new posterior group membership estimation | 
|---|
| 0:09:14 | in the previous | 
|---|
| 0:09:16 | and behavior prediction model | 
|---|
| 0:09:19 | we have | 
|---|
| 0:09:20 | we have a new | 
|---|
| 0:09:22 | you can prediction model that is aware of that there is a into account | 
|---|
| 0:09:28 | the data but we have seen for this new user and | 
|---|
| 0:09:31 | then you know group membership estimation | 
|---|
| 0:09:35 | and that's we collect more observations from the user | 
|---|
| 0:09:41 | we hopefully have a more accurate group and are suppressed night and a better behavior | 
|---|
| 0:09:45 | addition | 
|---|
| 0:09:50 | now how do we train another system to find the best parameter setting | 
|---|
| 0:09:58 | other set our model has | 
|---|
| 0:10:01 | parameters by which are the prior group of the numbers of phone address and | 
|---|
| 0:10:06 | for each of other groups | 
|---|
| 0:10:09 | has one and | 
|---|
| 0:10:11 | finally the vector for the features | 
|---|
| 0:10:15 | now we assume that we have a corpus of | 
|---|
| 0:10:19 | behaviors instinct line | 
|---|
| 0:10:21 | and for each of these for use of this pair of we haven't seen use | 
|---|
| 0:10:25 | we have we know the use of that use then | 
|---|
| 0:10:29 | but we don't know the groups of young | 
|---|
| 0:10:33 | so we will try to maximize the data likelihood | 
|---|
| 0:10:37 | according to | 
|---|
| 0:10:40 | the previous | 
|---|
| 0:10:43 | behavior probabilities | 
|---|
| 0:10:46 | however we can use or not straightforward to use a gradient descent as for the | 
|---|
| 0:10:52 | basic model because we don't know the group assignments | 
|---|
| 0:10:58 | so instead | 
|---|
| 0:11:00 | we use | 
|---|
| 0:11:01 | a method similar to expectation maximization | 
|---|
| 0:11:05 | so | 
|---|
| 0:11:07 | and in the beginning we just initialize all parameters | 
|---|
| 0:11:13 | randomly from a normal distribution | 
|---|
| 0:11:15 | and then these times that | 
|---|
| 0:11:18 | we compute | 
|---|
| 0:11:20 | the group estimates the group membership probabilities | 
|---|
| 0:11:24 | for given the data for each user | 
|---|
| 0:11:29 | using the parameter setting from the previous time step | 
|---|
| 0:11:32 | and | 
|---|
| 0:11:34 | we use this probabilities | 
|---|
| 0:11:37 | as frequencies for no so the observations | 
|---|
| 0:11:42 | according to that of this distribution | 
|---|
| 0:11:46 | so we have set of sort of separations with | 
|---|
| 0:11:51 | observed | 
|---|
| 0:11:54 | group memberships | 
|---|
| 0:11:55 | so now we can do we can use normal gradient ascent to maximize the lower | 
|---|
| 0:12:01 | part of the of the location given this and observations | 
|---|
| 0:12:06 | and we got we find new parameter setting and | 
|---|
| 0:12:12 | and we | 
|---|
| 0:12:14 | we go back to step one and two they look like it doesn't improve further | 
|---|
| 0:12:20 | and more than a threshold | 
|---|
| 0:12:29 | so now let's see if | 
|---|
| 0:12:32 | if our method works | 
|---|
| 0:12:34 | a if we can discover groups natural and data | 
|---|
| 0:12:39 | so actually our model is a very generic so we can use it in an | 
|---|
| 0:12:43 | component of a that exist and | 
|---|
| 0:12:46 | for which we need to predict the user's behavior | 
|---|
| 0:12:51 | but for the purpose of this work we evaluated in | 
|---|
| 0:12:55 | those specific prediction tasks related to natural language generation | 
|---|
| 0:13:02 | and so the first task | 
|---|
| 0:13:05 | is | 
|---|
| 0:13:06 | taken from the expression generation detection | 
|---|
| 0:13:11 | in this case the stimulus is a visual scene and the target object | 
|---|
| 0:13:15 | and we want to predict | 
|---|
| 0:13:17 | and whether the | 
|---|
| 0:13:19 | user will the speaker will use of spatial relation in describing that object | 
|---|
| 0:13:26 | so for example in this scene if they would say something like that both in | 
|---|
| 0:13:30 | front of the cube or the small global | 
|---|
| 0:13:34 | the dataset we use | 
|---|
| 0:13:36 | is generally three d three | 
|---|
| 0:13:40 | which is a commonly used the dataset in briefings question generation | 
|---|
| 0:13:44 | and it has | 
|---|
| 0:13:46 | at anything described by a sixty three users usage | 
|---|
| 0:13:51 | and relations are using thirty five percent of the scenes | 
|---|
| 0:13:56 | so it is difficult to predict | 
|---|
| 0:13:59 | in this dataset whether the user would you like just from the same it is | 
|---|
| 0:14:03 | it is difficult to predict | 
|---|
| 0:14:05 | whether the speaker will user a spatial relation or not | 
|---|
| 0:14:10 | because some users don't use spatial relations at all | 
|---|
| 0:14:16 | sound use | 
|---|
| 0:14:17 | spatial relations all the time and some are in between | 
|---|
| 0:14:21 | so | 
|---|
| 0:14:22 | we expect that's | 
|---|
| 0:14:24 | our model will capture that | 
|---|
| 0:14:27 | difference | 
|---|
| 0:14:30 | the way we evaluate it is | 
|---|
| 0:14:32 | we firstly we do crossvalidation and with the data in such a way that the | 
|---|
| 0:14:37 | users that we see testing never seen in training before | 
|---|
| 0:14:42 | and we implement two baselines based on the state-of-the-art for this dataset which is work | 
|---|
| 0:14:50 | done by different by one hundred fourteen | 
|---|
| 0:14:56 | so | 
|---|
| 0:14:58 | we see that | 
|---|
| 0:15:01 | are | 
|---|
| 0:15:03 | however the version of our model for one group is actually equivalent with one of | 
|---|
| 0:15:09 | the baselines | 
|---|
| 0:15:10 | which is and basic | 
|---|
| 0:15:12 | and the second baseline also used some demographic data which also the don't | 
|---|
| 0:15:20 | on the help | 
|---|
| 0:15:23 | for improving the data | 
|---|
| 0:15:25 | the f-score of the prediction task | 
|---|
| 0:15:29 | but as soon as we introduce a more than one group | 
|---|
| 0:15:34 | the performance goes up because we are able to actually distinguish between | 
|---|
| 0:15:39 | the different the user behaviors | 
|---|
| 0:15:44 | and this is what happens at test time as we see more and more observations | 
|---|
| 0:15:48 | so we see that for a already after one | 
|---|
| 0:15:53 | after seeing one of the federation our model can is better at predicting what the | 
|---|
| 0:15:59 | user will do next | 
|---|
| 0:16:01 | and the green time is the entropy of the group members | 
|---|
| 0:16:05 | probably distributions so this and this for some throughout the testing phase | 
|---|
| 0:16:12 | so this means that our model our system is a more and more certain about | 
|---|
| 0:16:17 | the actual group that the user | 
|---|
| 0:16:19 | belongs to | 
|---|
| 0:16:22 | the second task which i | 
|---|
| 0:16:24 | is related to comprehension | 
|---|
| 0:16:28 | given the stimulus s which is a visual scene and referring expression | 
|---|
| 0:16:32 | we want to predict the object that so the user understood as a reference | 
|---|
| 0:16:38 | our baseline is based on our previous work from thousand fifteen | 
|---|
| 0:16:43 | where we also use a log-linear model as the one i showed in the beginning | 
|---|
| 0:16:47 | and | 
|---|
| 0:16:49 | for this so experiment we use | 
|---|
| 0:16:51 | as in that paper we use the data from the give two point five challenge | 
|---|
| 0:16:56 | for training and the gift to challenge for testing | 
|---|
| 0:17:01 | however in this dataset | 
|---|
| 0:17:04 | we can thumb achieve an accuracy improvement compared to the baseline | 
|---|
| 0:17:10 | and we observe that the them our model can decide which group to assign the | 
|---|
| 0:17:16 | users two | 
|---|
| 0:17:18 | and | 
|---|
| 0:17:20 | even as we tried different features | 
|---|
| 0:17:22 | we could not detect and the viability of the and | 
|---|
| 0:17:26 | in the data so | 
|---|
| 0:17:28 | we assume that there might be in this case | 
|---|
| 0:17:32 | there the so the user behaviour doesn't actually can we cannot actually class of the | 
|---|
| 0:17:38 | user behavior to | 
|---|
| 0:17:40 | meaningful clusters | 
|---|
| 0:17:42 | and that a test that's however that hypothesis we did the third experiment | 
|---|
| 0:17:48 | where we use the same since but with a one hundred synthetic users | 
|---|
| 0:17:53 | and we artificially introduced a to a completely different use of behaviors in the dataset | 
|---|
| 0:18:02 | so half the user's always select the most are visually salient target and the other | 
|---|
| 0:18:07 | have very salient | 
|---|
| 0:18:09 | and | 
|---|
| 0:18:10 | in this case we did discover that our model can actually distinguish between those two | 
|---|
| 0:18:16 | groups | 
|---|
| 0:18:17 | next we more than one group one and two groups doesn't really improve | 
|---|
| 0:18:25 | the accuracy | 
|---|
| 0:18:28 | and again in the test phase we have the same pictures before so | 
|---|
| 0:18:34 | after a couple of observations are model is | 
|---|
| 0:18:37 | with a certain that look the user belongs to one of the groups | 
|---|
| 0:18:45 | so | 
|---|
| 0:18:47 | somehow | 
|---|
| 0:18:49 | we have shown that we can | 
|---|
| 0:18:51 | cluster users to groups based on the behavior in i data for which we don't | 
|---|
| 0:18:57 | have group annotations | 
|---|
| 0:18:59 | and this time we can dynamically assign announcing uses two groups in the course of | 
|---|
| 0:19:05 | the dialogue | 
|---|
| 0:19:06 | and we can use these assignments to provide a better and better predictions of their | 
|---|
| 0:19:13 | behaviour | 
|---|
| 0:19:15 | and in future work we want to try | 
|---|
| 0:19:19 | different datasets | 
|---|
| 0:19:21 | and applying the same effort to other dialogue-related the prediction tasks | 
|---|
| 0:19:28 | and also | 
|---|
| 0:19:30 | slightly more sophisticated the underlying models | 
|---|
| 0:19:35 | and with this meant for your | 
|---|
| 0:19:56 | yes of course it's very task dependent what the so we only wanted | 
|---|
| 0:20:03 | to predict how the user's plus the depending on that we can ask | 
|---|
| 0:20:27 | yes | 
|---|
| 0:20:35 | as i said so | 
|---|
| 0:20:37 | i'm not sure if i said to what we evaluated on just recorded data so | 
|---|
| 0:20:40 | we didn't have which and the but that's of course very good do when you | 
|---|
| 0:20:46 | have an actual that | 
|---|
| 0:21:03 | well we expected to so in this task | 
|---|
| 0:21:10 | can be honest is an easy task for the for the user right so | 
|---|
| 0:21:14 | if i don't know if you can see if you can read that so it | 
|---|
| 0:21:18 | says press the button to the right of the land so most users get it | 
|---|
| 0:21:20 | right | 
|---|
| 0:21:21 | so but there is a sound fifteen percent of errors | 
|---|
| 0:21:26 | so we will | 
|---|
| 0:21:28 | we call to find about some he didn't bother and but | 
|---|
| 0:21:33 | like why some users | 
|---|
| 0:21:36 | it sounds uses for example have difficulty with colours | 
|---|
| 0:21:40 | or with a spatial relations | 
|---|
| 0:21:44 | well | 
|---|
| 0:21:45 | we didn't | 
|---|
| 0:21:48 | yes it's probably | 
|---|
| 0:22:16 | so for the for the production task | 
|---|
| 0:22:28 | yes so we didn't | 
|---|
| 0:22:32 | so for this task studied in the literature says that | 
|---|
| 0:22:37 | there are basically two clearly distinguishable groups | 
|---|
| 0:22:41 | and some people are in between | 
|---|
| 0:22:44 | so this is my this might be why we have like a slight improvement for | 
|---|
| 0:22:49 | six or seven | 
|---|
| 0:22:51 | groups like | 
|---|
| 0:22:53 | maybe by we have | 
|---|
| 0:22:56 | when we have a six or seven groups we have like | 
|---|
| 0:23:01 | groups that happened to a captures some particular usersbehaviour but which have very low prior | 
|---|
| 0:23:07 | probability | 
|---|
| 0:23:08 | but we do find the main two groups with the groups which are | 
|---|
| 0:23:13 | whether i people who always use relations and | 
|---|
| 0:23:17 | you don't | 
|---|
| 0:23:34 | you mean to look at a particular feature weights | 
|---|
| 0:24:01 | yes we did so i that we didn't look at that i don't remember exactly | 
|---|
| 0:24:08 | what we found out but we | 
|---|
| 0:24:10 | we did find out that there are like | 
|---|
| 0:24:15 | and some particular features which | 
|---|
| 0:24:18 | which have a completely different ways to use | 
|---|
| 0:24:25 | that i don't remember which one | 
|---|
| 0:24:27 | which one | 
|---|