0:00:16 | hi everyone |
---|---|

0:00:17 | i am nichols from that an inverse to germany |

0:00:22 | i'm gonna talk to go about |

0:00:24 | way to discover user groups for natural language generation in dialogue |

0:00:29 | and this is work i've done together with crystal spiderman an onyx on the corner |

0:00:39 | let's see let's look at this example here |

0:00:42 | we have a navigation system that there's |

0:00:45 | the user turn right after my central |

0:00:50 | so user a sexy |

0:00:52 | in finding the |

0:00:54 | i think that do |

0:00:55 | and use of be phase |

0:00:58 | so why couldn't be |

0:01:03 | well there are different reasons why |

0:01:05 | users react differently to such instructions so |

0:01:10 | most likely here the person is not from the and user is not from melbourne |

0:01:15 | so |

0:01:16 | they do not know what maybe one central means but |

0:01:20 | and we can imagine also other reasons such as the lack |

0:01:26 | demographics are present a sign or |

0:01:29 | experience with navigational systems |

0:01:34 | however such information is often difficult to obtain |

0:01:38 | so |

0:01:40 | and |

0:01:42 | we can ask everyone and before the user navigation system where they from |

0:01:47 | but it's an interactive setting is something approaching who |

0:01:52 | and collect observations and react to them so ideally after observing something like that |

0:01:58 | a system with okay user a using place names from an but |

0:02:04 | and they want adapt to user b and say something like other on the ball |

0:02:09 | take the third that's |

0:02:14 | so people deal with this problem in different ways one approach is of course to |

0:02:18 | completely ignored |

0:02:21 | which we don't want |

0:02:24 | another approach is |

0:02:26 | to use |

0:02:27 | one model for every user |

0:02:31 | however there is requires lots of data for that user and we might lose information |

0:02:37 | that |

0:02:39 | might help us from similar users |

0:02:44 | and another approach would be used pre-defined groups |

0:02:48 | so for example have |

0:02:50 | a group of residents of mild one and another group for outsiders |

0:02:57 | but this is hard to annotate and it's also hard to know in advance |

0:03:04 | which categories could be rate of and then |

0:03:09 | which i categories that actually we can actually find inside the and in the dataset |

0:03:16 | so instead of doing these things |

0:03:19 | we assume that's the user's behavior clusters |

0:03:23 | in two |

0:03:24 | groups that we cannot observe |

0:03:27 | and |

0:03:29 | we use bayesian reasoning to infer those groups from the un from an annotated the |

0:03:35 | training data |

0:03:36 | and then test time to dynamically assign users those good as the dialogue progresses |

0:03:46 | so our starting point is a simple log-linear model of a language use |

0:03:52 | where in particular we have a stack of the way of whether we are doing |

0:03:56 | and |

0:03:57 | complete attention like simulating complication or production |

0:04:02 | so we just in general that we want to predict their behaviour of |

0:04:07 | and the behavior of view of the user and response the stimulus is coming from |

0:04:12 | the system so if we trying to simulate language production |

0:04:17 | the stimulus can be the communicative goal that the user is trying to achieve and |

0:04:22 | behavior would be the utterance that the use or some other linguistic choice the thing |

0:04:28 | make |

0:04:29 | and |

0:04:31 | if we want to predict what the user would understand |

0:04:35 | another stimulus is system produce utterance and the behaviour is i mean that the user |

0:04:42 | signs |

0:04:43 | the utterance |

0:04:47 | so this is |

0:04:49 | this is how our basic model looks like |

0:04:52 | before we had the user groups |

0:04:54 | and it's a log-linear model with a real-valued parameter vector o |

0:05:00 | and set of feature functions fight over behaviors and stimuli |

0:05:05 | and this model can be trained with a dataset of pairs of the cases in |

0:05:10 | my using |

0:05:11 | no longer a gradient descent the based methods |

0:05:15 | no actually we have already use that thing this work for |

0:05:20 | events possible resolution in dialogue |

0:05:24 | so |

0:05:27 | now if we want to extend this model with user groups |

0:05:33 | we just assume that there is a finite number of user groups of the data |

0:05:37 | okay |

0:05:39 | and the we do you |

0:05:41 | each of the groups of their own i mean vector |

0:05:46 | so and we place visionary only the vector from the model before |

0:05:53 | really is a group specific parameter vectors or if we know exactly what group a |

0:06:00 | user don't still |

0:06:01 | and all we have to do is just a replace a just use these new |

0:06:06 | parameters and |

0:06:08 | we have like in new prediction model that is get that in particular |

0:06:16 | however as we still |

0:06:20 | we want to adapt to user is that we haven't seen in training data |

0:06:26 | so |

0:06:27 | we assume that the training data was generated in the following way |

0:06:33 | we have a set |

0:06:34 | of users u |

0:06:36 | and |

0:06:38 | so it's each user is assigned |

0:06:42 | to a group |

0:06:45 | with a probability |

0:06:47 | you're given by which is another which is another parameter vector that determines the prior |

0:06:53 | probability of age group |

0:06:56 | and then |

0:06:57 | as we said we have one parameter vector for a third group so now the |

0:07:02 | behavior of the of the user |

0:07:05 | and not only depends on the stimulus but also on their group assignment and of |

0:07:10 | the group specific one of the vectors |

0:07:16 | so now let's suppose that's we have trained our system we don't both training data |

0:07:23 | and then you user starts talking to us |

0:07:28 | since we don't know what they're action movies |

0:07:31 | and we marginalise overall groups using the prior probability |

0:07:37 | and so we directly have |

0:07:40 | an idea of what they would do |

0:07:46 | given a given the prior probabilities that we have observed in the training data and |

0:07:51 | we can already use this model for interacting with them and then observes a behaviour |

0:08:00 | so if the user fees |

0:08:02 | control system gives interacting with a system we start collecting observations for them |

0:08:09 | so let's say we have |

0:08:11 | a sets the you of observations for user you of that particular time step |

0:08:20 | we cannot use these observations to estimate |

0:08:24 | find out which so you belong still |

0:08:28 | so we can do that because |

0:08:30 | as i said we have a specific |

0:08:34 | the cave you're a prediction |

0:08:36 | so we can |

0:08:39 | calculated probability on the right-hand side probability of the data of the observations for the |

0:08:46 | user given it to the group specific parameters of each clue |

0:08:51 | and also we have the prior membership probabilities so that is truly we can also |

0:08:57 | compute |

0:08:59 | the probability that the user belongs to each of the groups g given the data |

0:09:04 | and |

0:09:05 | and there's |

0:09:09 | so if we plug in this new posterior group membership estimation |

0:09:14 | in the previous |

0:09:16 | and behavior prediction model |

0:09:19 | we have |

0:09:20 | we have a new |

0:09:22 | you can prediction model that is aware of that there is a into account |

0:09:28 | the data but we have seen for this new user and |

0:09:31 | then you know group membership estimation |

0:09:35 | and that's we collect more observations from the user |

0:09:41 | we hopefully have a more accurate group and are suppressed night and a better behavior |

0:09:45 | addition |

0:09:50 | now how do we train another system to find the best parameter setting |

0:09:58 | other set our model has |

0:10:01 | parameters by which are the prior group of the numbers of phone address and |

0:10:06 | for each of other groups |

0:10:09 | has one and |

0:10:11 | finally the vector for the features |

0:10:15 | now we assume that we have a corpus of |

0:10:19 | behaviors instinct line |

0:10:21 | and for each of these for use of this pair of we haven't seen use |

0:10:25 | we have we know the use of that use then |

0:10:29 | but we don't know the groups of young |

0:10:33 | so we will try to maximize the data likelihood |

0:10:37 | according to |

0:10:40 | the previous |

0:10:43 | behavior probabilities |

0:10:46 | however we can use or not straightforward to use a gradient descent as for the |

0:10:52 | basic model because we don't know the group assignments |

0:10:58 | so instead |

0:11:00 | we use |

0:11:01 | a method similar to expectation maximization |

0:11:05 | so |

0:11:07 | and in the beginning we just initialize all parameters |

0:11:13 | randomly from a normal distribution |

0:11:15 | and then these times that |

0:11:18 | we compute |

0:11:20 | the group estimates the group membership probabilities |

0:11:24 | for given the data for each user |

0:11:29 | using the parameter setting from the previous time step |

0:11:32 | and |

0:11:34 | we use this probabilities |

0:11:37 | as frequencies for no so the observations |

0:11:42 | according to that of this distribution |

0:11:46 | so we have set of sort of separations with |

0:11:51 | observed |

0:11:54 | group memberships |

0:11:55 | so now we can do we can use normal gradient ascent to maximize the lower |

0:12:01 | part of the of the location given this and observations |

0:12:06 | and we got we find new parameter setting and |

0:12:12 | and we |

0:12:14 | we go back to step one and two they look like it doesn't improve further |

0:12:20 | and more than a threshold |

0:12:29 | so now let's see if |

0:12:32 | if our method works |

0:12:34 | a if we can discover groups natural and data |

0:12:39 | so actually our model is a very generic so we can use it in an |

0:12:43 | component of a that exist and |

0:12:46 | for which we need to predict the user's behavior |

0:12:51 | but for the purpose of this work we evaluated in |

0:12:55 | those specific prediction tasks related to natural language generation |

0:13:02 | and so the first task |

0:13:05 | is |

0:13:06 | taken from the expression generation detection |

0:13:11 | in this case the stimulus is a visual scene and the target object |

0:13:15 | and we want to predict |

0:13:17 | and whether the |

0:13:19 | user will the speaker will use of spatial relation in describing that object |

0:13:26 | so for example in this scene if they would say something like that both in |

0:13:30 | front of the cube or the small global |

0:13:34 | the dataset we use |

0:13:36 | is generally three d three |

0:13:40 | which is a commonly used the dataset in briefings question generation |

0:13:44 | and it has |

0:13:46 | at anything described by a sixty three users usage |

0:13:51 | and relations are using thirty five percent of the scenes |

0:13:56 | so it is difficult to predict |

0:13:59 | in this dataset whether the user would you like just from the same it is |

0:14:03 | it is difficult to predict |

0:14:05 | whether the speaker will user a spatial relation or not |

0:14:10 | because some users don't use spatial relations at all |

0:14:16 | sound use |

0:14:17 | spatial relations all the time and some are in between |

0:14:21 | so |

0:14:22 | we expect that's |

0:14:24 | our model will capture that |

0:14:27 | difference |

0:14:30 | the way we evaluate it is |

0:14:32 | we firstly we do crossvalidation and with the data in such a way that the |

0:14:37 | users that we see testing never seen in training before |

0:14:42 | and we implement two baselines based on the state-of-the-art for this dataset which is work |

0:14:50 | done by different by one hundred fourteen |

0:14:56 | so |

0:14:58 | we see that |

0:15:01 | are |

0:15:03 | however the version of our model for one group is actually equivalent with one of |

0:15:09 | the baselines |

0:15:10 | which is and basic |

0:15:12 | and the second baseline also used some demographic data which also the don't |

0:15:20 | on the help |

0:15:23 | for improving the data |

0:15:25 | the f-score of the prediction task |

0:15:29 | but as soon as we introduce a more than one group |

0:15:34 | the performance goes up because we are able to actually distinguish between |

0:15:39 | the different the user behaviors |

0:15:44 | and this is what happens at test time as we see more and more observations |

0:15:48 | so we see that for a already after one |

0:15:53 | after seeing one of the federation our model can is better at predicting what the |

0:15:59 | user will do next |

0:16:01 | and the green time is the entropy of the group members |

0:16:05 | probably distributions so this and this for some throughout the testing phase |

0:16:12 | so this means that our model our system is a more and more certain about |

0:16:17 | the actual group that the user |

0:16:19 | belongs to |

0:16:22 | the second task which i |

0:16:24 | is related to comprehension |

0:16:28 | given the stimulus s which is a visual scene and referring expression |

0:16:32 | we want to predict the object that so the user understood as a reference |

0:16:38 | our baseline is based on our previous work from thousand fifteen |

0:16:43 | where we also use a log-linear model as the one i showed in the beginning |

0:16:47 | and |

0:16:49 | for this so experiment we use |

0:16:51 | as in that paper we use the data from the give two point five challenge |

0:16:56 | for training and the gift to challenge for testing |

0:17:01 | however in this dataset |

0:17:04 | we can thumb achieve an accuracy improvement compared to the baseline |

0:17:10 | and we observe that the them our model can decide which group to assign the |

0:17:16 | users two |

0:17:18 | and |

0:17:20 | even as we tried different features |

0:17:22 | we could not detect and the viability of the and |

0:17:26 | in the data so |

0:17:28 | we assume that there might be in this case |

0:17:32 | there the so the user behaviour doesn't actually can we cannot actually class of the |

0:17:38 | user behavior to |

0:17:40 | meaningful clusters |

0:17:42 | and that a test that's however that hypothesis we did the third experiment |

0:17:48 | where we use the same since but with a one hundred synthetic users |

0:17:53 | and we artificially introduced a to a completely different use of behaviors in the dataset |

0:18:02 | so half the user's always select the most are visually salient target and the other |

0:18:07 | have very salient |

0:18:09 | and |

0:18:10 | in this case we did discover that our model can actually distinguish between those two |

0:18:16 | groups |

0:18:17 | next we more than one group one and two groups doesn't really improve |

0:18:25 | the accuracy |

0:18:28 | and again in the test phase we have the same pictures before so |

0:18:34 | after a couple of observations are model is |

0:18:37 | with a certain that look the user belongs to one of the groups |

0:18:45 | so |

0:18:47 | somehow |

0:18:49 | we have shown that we can |

0:18:51 | cluster users to groups based on the behavior in i data for which we don't |

0:18:57 | have group annotations |

0:18:59 | and this time we can dynamically assign announcing uses two groups in the course of |

0:19:05 | the dialogue |

0:19:06 | and we can use these assignments to provide a better and better predictions of their |

0:19:13 | behaviour |

0:19:15 | and in future work we want to try |

0:19:19 | different datasets |

0:19:21 | and applying the same effort to other dialogue-related the prediction tasks |

0:19:28 | and also |

0:19:30 | slightly more sophisticated the underlying models |

0:19:35 | and with this meant for your |

0:19:56 | yes of course it's very task dependent what the so we only wanted |

0:20:03 | to predict how the user's plus the depending on that we can ask |

0:20:27 | yes |

0:20:35 | as i said so |

0:20:37 | i'm not sure if i said to what we evaluated on just recorded data so |

0:20:40 | we didn't have which and the but that's of course very good do when you |

0:20:46 | have an actual that |

0:21:03 | well we expected to so in this task |

0:21:10 | can be honest is an easy task for the for the user right so |

0:21:14 | if i don't know if you can see if you can read that so it |

0:21:18 | says press the button to the right of the land so most users get it |

0:21:20 | right |

0:21:21 | so but there is a sound fifteen percent of errors |

0:21:26 | so we will |

0:21:28 | we call to find about some he didn't bother and but |

0:21:33 | like why some users |

0:21:36 | it sounds uses for example have difficulty with colours |

0:21:40 | or with a spatial relations |

0:21:44 | well |

0:21:45 | we didn't |

0:21:48 | yes it's probably |

0:22:16 | so for the for the production task |

0:22:28 | yes so we didn't |

0:22:32 | so for this task studied in the literature says that |

0:22:37 | there are basically two clearly distinguishable groups |

0:22:41 | and some people are in between |

0:22:44 | so this is my this might be why we have like a slight improvement for |

0:22:49 | six or seven |

0:22:51 | groups like |

0:22:53 | maybe by we have |

0:22:56 | when we have a six or seven groups we have like |

0:23:01 | groups that happened to a captures some particular usersbehaviour but which have very low prior |

0:23:07 | probability |

0:23:08 | but we do find the main two groups with the groups which are |

0:23:13 | whether i people who always use relations and |

0:23:17 | you don't |

0:23:34 | you mean to look at a particular feature weights |

0:24:01 | yes we did so i that we didn't look at that i don't remember exactly |

0:24:08 | what we found out but we |

0:24:10 | we did find out that there are like |

0:24:15 | and some particular features which |

0:24:18 | which have a completely different ways to use |

0:24:25 | that i don't remember which one |

0:24:27 | which one |