0:00:16hi everyone
0:00:17i am nichols from that an inverse to germany
0:00:22i'm gonna talk to go about
0:00:24way to discover user groups for natural language generation in dialogue
0:00:29and this is work i've done together with crystal spiderman an onyx on the corner
0:00:39let's see let's look at this example here
0:00:42we have a navigation system that there's
0:00:45the user turn right after my central
0:00:50so user a sexy
0:00:52in finding the
0:00:54i think that do
0:00:55and use of be phase
0:00:58so why couldn't be
0:01:03well there are different reasons why
0:01:05users react differently to such instructions so
0:01:10most likely here the person is not from the and user is not from melbourne
0:01:15so
0:01:16they do not know what maybe one central means but
0:01:20and we can imagine also other reasons such as the lack
0:01:26demographics are present a sign or
0:01:29experience with navigational systems
0:01:34however such information is often difficult to obtain
0:01:38so
0:01:40and
0:01:42we can ask everyone and before the user navigation system where they from
0:01:47but it's an interactive setting is something approaching who
0:01:52and collect observations and react to them so ideally after observing something like that
0:01:58a system with okay user a using place names from an but
0:02:04and they want adapt to user b and say something like other on the ball
0:02:09take the third that's
0:02:14so people deal with this problem in different ways one approach is of course to
0:02:18completely ignored
0:02:21which we don't want
0:02:24another approach is
0:02:26to use
0:02:27one model for every user
0:02:31however there is requires lots of data for that user and we might lose information
0:02:37that
0:02:39might help us from similar users
0:02:44and another approach would be used pre-defined groups
0:02:48so for example have
0:02:50a group of residents of mild one and another group for outsiders
0:02:57but this is hard to annotate and it's also hard to know in advance
0:03:04which categories could be rate of and then
0:03:09which i categories that actually we can actually find inside the and in the dataset
0:03:16so instead of doing these things
0:03:19we assume that's the user's behavior clusters
0:03:23in two
0:03:24groups that we cannot observe
0:03:27and
0:03:29we use bayesian reasoning to infer those groups from the un from an annotated the
0:03:35training data
0:03:36and then test time to dynamically assign users those good as the dialogue progresses
0:03:46so our starting point is a simple log-linear model of a language use
0:03:52where in particular we have a stack of the way of whether we are doing
0:03:56and
0:03:57complete attention like simulating complication or production
0:04:02so we just in general that we want to predict their behaviour of
0:04:07and the behavior of view of the user and response the stimulus is coming from
0:04:12the system so if we trying to simulate language production
0:04:17the stimulus can be the communicative goal that the user is trying to achieve and
0:04:22behavior would be the utterance that the use or some other linguistic choice the thing
0:04:28make
0:04:29and
0:04:31if we want to predict what the user would understand
0:04:35another stimulus is system produce utterance and the behaviour is i mean that the user
0:04:42signs
0:04:43the utterance
0:04:47so this is
0:04:49this is how our basic model looks like
0:04:52before we had the user groups
0:04:54and it's a log-linear model with a real-valued parameter vector o
0:05:00and set of feature functions fight over behaviors and stimuli
0:05:05and this model can be trained with a dataset of pairs of the cases in
0:05:10my using
0:05:11no longer a gradient descent the based methods
0:05:15no actually we have already use that thing this work for
0:05:20events possible resolution in dialogue
0:05:24so
0:05:27now if we want to extend this model with user groups
0:05:33we just assume that there is a finite number of user groups of the data
0:05:37okay
0:05:39and the we do you
0:05:41each of the groups of their own i mean vector
0:05:46so and we place visionary only the vector from the model before
0:05:53really is a group specific parameter vectors or if we know exactly what group a
0:06:00user don't still
0:06:01and all we have to do is just a replace a just use these new
0:06:06parameters and
0:06:08we have like in new prediction model that is get that in particular
0:06:16however as we still
0:06:20we want to adapt to user is that we haven't seen in training data
0:06:26so
0:06:27we assume that the training data was generated in the following way
0:06:33we have a set
0:06:34of users u
0:06:36and
0:06:38so it's each user is assigned
0:06:42to a group
0:06:45with a probability
0:06:47you're given by which is another which is another parameter vector that determines the prior
0:06:53probability of age group
0:06:56and then
0:06:57as we said we have one parameter vector for a third group so now the
0:07:02behavior of the of the user
0:07:05and not only depends on the stimulus but also on their group assignment and of
0:07:10the group specific one of the vectors
0:07:16so now let's suppose that's we have trained our system we don't both training data
0:07:23and then you user starts talking to us
0:07:28since we don't know what they're action movies
0:07:31and we marginalise overall groups using the prior probability
0:07:37and so we directly have
0:07:40an idea of what they would do
0:07:46given a given the prior probabilities that we have observed in the training data and
0:07:51we can already use this model for interacting with them and then observes a behaviour
0:08:00so if the user fees
0:08:02control system gives interacting with a system we start collecting observations for them
0:08:09so let's say we have
0:08:11a sets the you of observations for user you of that particular time step
0:08:20we cannot use these observations to estimate
0:08:24find out which so you belong still
0:08:28so we can do that because
0:08:30as i said we have a specific
0:08:34the cave you're a prediction
0:08:36so we can
0:08:39calculated probability on the right-hand side probability of the data of the observations for the
0:08:46user given it to the group specific parameters of each clue
0:08:51and also we have the prior membership probabilities so that is truly we can also
0:08:57compute
0:08:59the probability that the user belongs to each of the groups g given the data
0:09:04and
0:09:05and there's
0:09:09so if we plug in this new posterior group membership estimation
0:09:14in the previous
0:09:16and behavior prediction model
0:09:19we have
0:09:20we have a new
0:09:22you can prediction model that is aware of that there is a into account
0:09:28the data but we have seen for this new user and
0:09:31then you know group membership estimation
0:09:35and that's we collect more observations from the user
0:09:41we hopefully have a more accurate group and are suppressed night and a better behavior
0:09:45addition
0:09:50now how do we train another system to find the best parameter setting
0:09:58other set our model has
0:10:01parameters by which are the prior group of the numbers of phone address and
0:10:06for each of other groups
0:10:09has one and
0:10:11finally the vector for the features
0:10:15now we assume that we have a corpus of
0:10:19behaviors instinct line
0:10:21and for each of these for use of this pair of we haven't seen use
0:10:25we have we know the use of that use then
0:10:29but we don't know the groups of young
0:10:33so we will try to maximize the data likelihood
0:10:37according to
0:10:40the previous
0:10:43behavior probabilities
0:10:46however we can use or not straightforward to use a gradient descent as for the
0:10:52basic model because we don't know the group assignments
0:10:58so instead
0:11:00we use
0:11:01a method similar to expectation maximization
0:11:05so
0:11:07and in the beginning we just initialize all parameters
0:11:13randomly from a normal distribution
0:11:15and then these times that
0:11:18we compute
0:11:20the group estimates the group membership probabilities
0:11:24for given the data for each user
0:11:29using the parameter setting from the previous time step
0:11:32and
0:11:34we use this probabilities
0:11:37as frequencies for no so the observations
0:11:42according to that of this distribution
0:11:46so we have set of sort of separations with
0:11:51observed
0:11:54group memberships
0:11:55so now we can do we can use normal gradient ascent to maximize the lower
0:12:01part of the of the location given this and observations
0:12:06and we got we find new parameter setting and
0:12:12and we
0:12:14we go back to step one and two they look like it doesn't improve further
0:12:20and more than a threshold
0:12:29so now let's see if
0:12:32if our method works
0:12:34a if we can discover groups natural and data
0:12:39so actually our model is a very generic so we can use it in an
0:12:43component of a that exist and
0:12:46for which we need to predict the user's behavior
0:12:51but for the purpose of this work we evaluated in
0:12:55those specific prediction tasks related to natural language generation
0:13:02and so the first task
0:13:05is
0:13:06taken from the expression generation detection
0:13:11in this case the stimulus is a visual scene and the target object
0:13:15and we want to predict
0:13:17and whether the
0:13:19user will the speaker will use of spatial relation in describing that object
0:13:26so for example in this scene if they would say something like that both in
0:13:30front of the cube or the small global
0:13:34the dataset we use
0:13:36is generally three d three
0:13:40which is a commonly used the dataset in briefings question generation
0:13:44and it has
0:13:46at anything described by a sixty three users usage
0:13:51and relations are using thirty five percent of the scenes
0:13:56so it is difficult to predict
0:13:59in this dataset whether the user would you like just from the same it is
0:14:03it is difficult to predict
0:14:05whether the speaker will user a spatial relation or not
0:14:10because some users don't use spatial relations at all
0:14:16sound use
0:14:17spatial relations all the time and some are in between
0:14:21so
0:14:22we expect that's
0:14:24our model will capture that
0:14:27difference
0:14:30the way we evaluate it is
0:14:32we firstly we do crossvalidation and with the data in such a way that the
0:14:37users that we see testing never seen in training before
0:14:42and we implement two baselines based on the state-of-the-art for this dataset which is work
0:14:50done by different by one hundred fourteen
0:14:56so
0:14:58we see that
0:15:01are
0:15:03however the version of our model for one group is actually equivalent with one of
0:15:09the baselines
0:15:10which is and basic
0:15:12and the second baseline also used some demographic data which also the don't
0:15:20on the help
0:15:23for improving the data
0:15:25the f-score of the prediction task
0:15:29but as soon as we introduce a more than one group
0:15:34the performance goes up because we are able to actually distinguish between
0:15:39the different the user behaviors
0:15:44and this is what happens at test time as we see more and more observations
0:15:48so we see that for a already after one
0:15:53after seeing one of the federation our model can is better at predicting what the
0:15:59user will do next
0:16:01and the green time is the entropy of the group members
0:16:05probably distributions so this and this for some throughout the testing phase
0:16:12so this means that our model our system is a more and more certain about
0:16:17the actual group that the user
0:16:19belongs to
0:16:22the second task which i
0:16:24is related to comprehension
0:16:28given the stimulus s which is a visual scene and referring expression
0:16:32we want to predict the object that so the user understood as a reference
0:16:38our baseline is based on our previous work from thousand fifteen
0:16:43where we also use a log-linear model as the one i showed in the beginning
0:16:47and
0:16:49for this so experiment we use
0:16:51as in that paper we use the data from the give two point five challenge
0:16:56for training and the gift to challenge for testing
0:17:01however in this dataset
0:17:04we can thumb achieve an accuracy improvement compared to the baseline
0:17:10and we observe that the them our model can decide which group to assign the
0:17:16users two
0:17:18and
0:17:20even as we tried different features
0:17:22we could not detect and the viability of the and
0:17:26in the data so
0:17:28we assume that there might be in this case
0:17:32there the so the user behaviour doesn't actually can we cannot actually class of the
0:17:38user behavior to
0:17:40meaningful clusters
0:17:42and that a test that's however that hypothesis we did the third experiment
0:17:48where we use the same since but with a one hundred synthetic users
0:17:53and we artificially introduced a to a completely different use of behaviors in the dataset
0:18:02so half the user's always select the most are visually salient target and the other
0:18:07have very salient
0:18:09and
0:18:10in this case we did discover that our model can actually distinguish between those two
0:18:16groups
0:18:17next we more than one group one and two groups doesn't really improve
0:18:25the accuracy
0:18:28and again in the test phase we have the same pictures before so
0:18:34after a couple of observations are model is
0:18:37with a certain that look the user belongs to one of the groups
0:18:45so
0:18:47somehow
0:18:49we have shown that we can
0:18:51cluster users to groups based on the behavior in i data for which we don't
0:18:57have group annotations
0:18:59and this time we can dynamically assign announcing uses two groups in the course of
0:19:05the dialogue
0:19:06and we can use these assignments to provide a better and better predictions of their
0:19:13behaviour
0:19:15and in future work we want to try
0:19:19different datasets
0:19:21and applying the same effort to other dialogue-related the prediction tasks
0:19:28and also
0:19:30slightly more sophisticated the underlying models
0:19:35and with this meant for your
0:19:56yes of course it's very task dependent what the so we only wanted
0:20:03to predict how the user's plus the depending on that we can ask
0:20:27yes
0:20:35as i said so
0:20:37i'm not sure if i said to what we evaluated on just recorded data so
0:20:40we didn't have which and the but that's of course very good do when you
0:20:46have an actual that
0:21:03well we expected to so in this task
0:21:10can be honest is an easy task for the for the user right so
0:21:14if i don't know if you can see if you can read that so it
0:21:18says press the button to the right of the land so most users get it
0:21:20right
0:21:21so but there is a sound fifteen percent of errors
0:21:26so we will
0:21:28we call to find about some he didn't bother and but
0:21:33like why some users
0:21:36it sounds uses for example have difficulty with colours
0:21:40or with a spatial relations
0:21:44well
0:21:45we didn't
0:21:48yes it's probably
0:22:16so for the for the production task
0:22:28yes so we didn't
0:22:32so for this task studied in the literature says that
0:22:37there are basically two clearly distinguishable groups
0:22:41and some people are in between
0:22:44so this is my this might be why we have like a slight improvement for
0:22:49six or seven
0:22:51groups like
0:22:53maybe by we have
0:22:56when we have a six or seven groups we have like
0:23:01groups that happened to a captures some particular usersbehaviour but which have very low prior
0:23:07probability
0:23:08but we do find the main two groups with the groups which are
0:23:13whether i people who always use relations and
0:23:17you don't
0:23:34you mean to look at a particular feature weights
0:24:01yes we did so i that we didn't look at that i don't remember exactly
0:24:08what we found out but we
0:24:10we did find out that there are like
0:24:15and some particular features which
0:24:18which have a completely different ways to use
0:24:25that i don't remember which one
0:24:27which one