0:00:16once for the introduction
0:00:18this is joint work with my students rumours remote query can and her collaborators wanted
0:00:23to be trying what we and want to chunk
0:00:30so image and its in
0:00:34certain characteristics of an image
0:00:37and that can be done with software tools such as adobe photoshop microsoft's photos et
0:00:42cetera and here we can see two examples
0:00:45in the first example we add close to the sky
0:00:49the second example we have
0:00:51the photograph of a file a
0:00:53we make its black and white
0:00:56image editing is a very hard task
0:01:04artistic creativity patients is a lot of experimentation and try there which in turn makes
0:01:11it a very time-consuming task
0:01:13and that users may not be fully aware of the functionality of a given image
0:01:19editing tool and some image editing tools a very complex use in there is the
0:01:26learning curve
0:01:27furthermore users may not be sure about what changes exactly it want to perform on
0:01:33the image and here's an example basically scores look nice rugby this is pretty abstract
0:01:39and how do you make a
0:01:41but also look like that is currently
0:01:44or maybe do not an abstract not change it is important
0:01:48but we know where their precise
0:01:51and then steps required so here's another example
0:01:55remove the human from the fields
0:01:57so this is pretty hungry
0:01:59but it requires maybe and things that
0:02:04so clearly there's attention here or fashionable how anybody there are web services web forums
0:02:12where bill with users lost their images
0:02:17and that the request i and their expert raters will perform this i either from
0:02:24or forty three
0:02:26and then be expert annotators in the novice users can exchange messages until the user
0:02:33is happy we had
0:02:36and stupidity in these forums their requests are formulated an abstract manner using natural language
0:02:41here is an example
0:02:44i satisfied from our last holiday and someone please remove my x from these but
0:02:48i'm the one on the right
0:02:51so it's more likely like to get something like this rather than indicated one step
0:03:02so the web forums are clearly very helpful
0:03:07but the major
0:03:10a first of all users current request changes or provide feedback in real time
0:03:17and the users can ask a clarification or provide suggestions well anything the editing is
0:03:25be performed
0:03:27so a user is clearly benefited greatly round conversing real-time with an expert image i
0:03:36so our ultimate goal is to build the dialogue system with such capabilities
0:03:43and i'm to play a million huh show but maybe by conversational image editing
0:03:53a realistic speech
0:04:13you actually
0:04:28not the same as that back and talk about incrementality in that of systems incremental
0:04:32dialogue systems means that user utterances start being processed word-byword before the user has after
0:04:41a complete utterance and the system has to respond
0:04:44as soon as possible
0:04:48now conversational image editing is a domain particularly well suited to or incremental processing and
0:04:56the reason is that it requires a lot of fine grained changes that will see
0:05:00in this
0:05:03users maybe to be their requests radically
0:05:09and very fast speak very fast
0:05:12and the wizard has to process everything very fast and to respond as soon as
0:05:42all right you just
0:05:52so we collected a corpus over skype in a wizard-of-oz setting
0:05:57so we had the user request and it's and the wizard would perform decided
0:06:03the wizard screen was share and only the wizard could control the image editing tool
0:06:10and this was done deliberately because we wanted to record all the user's input in
0:06:15spoken language format
0:06:18there was no time constraints take the top as long as they wanted
0:06:23we did that's basically celtic users whether they interact with the system
0:06:28or if you want but the conversation was very natural so it was pretty obvious
0:06:32that it were talking to another human
0:06:36and here are some statistics of our corpus we had twenty users hundred and twenty
0:06:43nine dialogues
0:06:45we can see that roughly number of user utterances is double the number of wizard
0:06:51users will talk about
0:06:52and occasionally that wizards would provide suggestions and questions
0:06:57however by technology alliance
0:06:59the corpus release to the public in the near future
0:07:04so that our corpus
0:07:06our next step is to annotated with dialogue that
0:07:10so we define an utterance as a portion of speech
0:07:13between silence intervals greater than three hundred milliseconds
0:07:17i'm utterances are segmented into utterances like we assign a dialogue act to each utterance
0:07:26and here are the dialogue act labels and from our corpus also
0:07:32but we are interested for this study only in the ones so all dialogue act
0:07:39labels are
0:07:40user i
0:07:43image it requires
0:07:44you requests updates to produce request
0:07:47are reverting to the previous state of the image comparing the current state of the
0:07:52image with a previous data image we have comments like or dislike comments
0:07:57and image comments these are mutual comments for example
0:08:02looks very striking
0:08:04we have yes no responses
0:08:06and have their attributes
0:08:08anything that can be classified into a and the other labels
0:08:15so these are the cat
0:08:17a dialogue act labels that were interested in this study the most frequent ones where
0:08:22b you image that it requires the updates
0:08:25the like comments
0:08:26the gets responses and after
0:08:32and here are some examples so increase saturation that really this is a new image
0:08:37editor is and isn't more this is an update
0:08:41that's not good enough
0:08:42that's it is like problems
0:08:45a change the saturation back to the original this is an image and it requires
0:08:52great this is that like comments
0:08:55and you show me before and after this is imaginative request compare
0:09:01we measure inter annotator agreement what happens of expert annotators on to take the same
0:09:06dialogue session of twenty minutes
0:09:08and we calculated cohen's kappa to be point eighty one which shows high agreement
0:09:13no but we perform
0:09:15when we measure agreement
0:09:17we want to make sure that the annotators and every
0:09:20not only on the dialogue act labels but also have a segments
0:09:25here's an example the utterances
0:09:28proper thought to remove it
0:09:30so the first annotator
0:09:31as you is that this is one segments
0:09:35and annotated with the demodulated request you whereas the second annotator
0:09:41assumes that these are just segments problem for a remote that i
0:09:46i candidates the first segment
0:09:48at the demodulated request new and the second segment as an image i that request
0:09:55a we got each row
0:09:57and we compare the annotations
0:10:02well we'll segmentation dialogue act
0:10:04and what and why the agree on everything
0:10:07a this counts as agreement so in this case we have so agreements out of
0:10:12which is point thirty three
0:10:17i was used
0:10:18if you only have the dialogue act label this is not enough information to perform
0:10:22image and it
0:10:23so this reason
0:10:25we also have more complex annotations of actions and it is so here's an example
0:10:31so this segments may actually writer to like one hundred
0:10:34dialogue act is limited it requires you actually is addressed it is right
0:10:40you object this tree
0:10:43that is one hundred
0:10:46but for this study
0:10:49we only use information from the dialogue act labels
0:10:53so yes that about detection models so we split our corpus into training and testing
0:11:01you have one hundred and sixteen atoms for training and three for text and we
0:11:05compare neural network based models versus traditional classification models so far neural networks we have
0:11:12long short-term memory networks
0:11:14and convolutional neural networks
0:11:16i'm more traditional models we have made by
0:11:19conditional random fields and random forest
0:11:23we also compare
0:11:24and bad a strange well image related corpora versus false pre-training a by things
0:11:32and here are classification results
0:11:36note that
0:11:37we don't do anything incremental yet
0:11:40with you
0:11:40but we have a full utterance
0:11:47we see that the conditional random fields and have yes
0:11:50they don't perform very well these are both sequential models
0:11:54we hypothesize that this is because we didn't have enough data to construct a larger
0:11:58context dependencies
0:12:00random forests are doing well
0:12:02and said
0:12:05however using training where the bindings
0:12:08is better than using half of the boxing balance the difference is not statistically significant
0:12:13and it also helps when we use sentence accent right we use that we generate
0:12:17a vector for the whole sentence rather than for each where
0:12:22no be a
0:12:24well it
0:12:27a simple random forest
0:12:29this is a problem and difference is that it is that
0:12:34but when we use random forest with sent back
0:12:37that is that's as the latter at the difference is not statistically
0:12:43note here is a we can see
0:12:46and we
0:12:47one more rights from the user in the x-axis we see that i mean
0:12:52however in a score for each one of them by a
0:12:57changes so initiate for the first word i opinion score or the dialogue act a
0:13:04but i
0:13:05after the word that the word also here is that it's pretty clear
0:13:11i output is the image that it requires you which happens to be good right
0:13:17label for this example
0:13:22no that's that about in from time okay
0:13:26how right
0:13:28where i'm correctness of prediction
0:13:32when we have getting from and so i which from the user
0:13:36so in order to calculate
0:13:39how much we say whether we're right or not
0:13:44is we get all these samples from the user
0:13:49we have two
0:13:50this is one comic strip prediction
0:13:54here we use a very
0:13:56a simple model we set on stress
0:14:00that explains it is also we have come from the user i think i think
0:14:05that i think that's something that's
0:14:08then we have a confidence scores
0:14:10but the classifier
0:14:13and how a
0:14:15the classifier assigns to each
0:14:19note that me a that can support of point two
0:14:23i mean i wish detection
0:14:27sorry a constant threshold according to this means that we should make a prediction wondered
0:14:32and score becomes higher and this problem so we should make a prediction here
0:14:39is to predict the other class
0:14:42so this is wrong prediction
0:14:43and when we have wrong predictions we assume that we're not going to how it
0:14:47works like this
0:14:49not to say that the confidence threshold is one four
0:14:52this means that we should be a prediction here because point five is larger than
0:14:58one or here we have the classifier is like performance and this is correct
0:15:05we have a correct prediction and we state one more
0:15:08because we have right until the user i think that
0:15:13so it helps when we increase
0:15:15the confidence that
0:15:19problem is that
0:15:20we increase the price phone too much
0:15:23let's say we present a point five for sex then no one anything to make
0:15:29a prediction on
0:15:30because you hear all the
0:15:33all the threshold the scores are below points a
0:15:37i five four point six
0:15:40so i will see in the next time
0:15:44the higher the confidence stress balls
0:15:46the fewer samples
0:15:50we use to make a prediction
0:15:57no in this
0:15:59a row
0:16:00on the x-axis we have a confidence that
0:16:04and on the y-axis would have percentages
0:16:07so that no i shows as the percentage of correct predictions
0:16:11and the red line shows the percentages of word savings
0:16:18and this numbers here that we see at each point the number
0:16:24shows us how many samples a
0:16:28we have above a certain press so we can see or columns writes one point
0:16:34two we have at least once
0:16:35but as a confidence threshold is that
0:16:39we have fewer and fewer samples
0:16:42so basically baseline
0:16:45ace precision
0:16:54so if the confidence right or a notice that we're doing better but
0:17:01the number of samples becomes lower and lower this is precision but very small would
0:17:07really be very well
0:17:09so if your's precision as a function of and threshold and here's a recall
0:17:16so it seems that this is a good point but it's not really good point
0:17:19because the recall is very well
0:17:22and now let's go back to the original graph have the same pattern where the
0:17:26percentage of word savings
0:17:28that's a confidence threshold point five
0:17:32we use a the lost
0:17:35we don't have many examples where the score
0:17:38is going to become a larger than the confidence threshold one five so it's the
0:17:43same pattern
0:17:44however as that the computers or a threshold becomes are higher we have fewer samples
0:17:50so the bottom line
0:17:51is that increase in this is not clear one where we should say which domains
0:17:58so that means that just having rely on a result is not a good model
0:18:04and we certainly need something more sophisticated that
0:18:09and here we can see what would happen if we had histogram-based model and that
0:18:14i think i rates can make the prediction and the prediction would always be alright
0:18:20so we can see here
0:18:22we don't know works at each of the dialogue act
0:18:25and for all the data
0:18:27we have thirty nine percent saving and we can see the percentage of correct predictions
0:18:32is a seventy four and this is actually i see because in this case the
0:18:37number of samples that we have doesn't actually a constant stress balls
0:18:44so we introduced
0:18:47and you a domain conversational image i
0:18:51it's a real world application which combines language and vision you tonight particularly well suited
0:18:58for incremental dialogue processing we compare models or incrementally accent identification this yes and random
0:19:07forest reform either the rest of the models
0:19:10and we found that
0:19:12and i think strain image related corpora outperformed reaching out of the box some bad
0:19:19we also calculated the impact of any problems i suppose
0:19:22above which the classifier's prediction should be considered on the classification rate is i don't
0:19:28work savings which is a proxy for time saving so our experiments provide evidence
0:19:34that incremental intent processing
0:19:38a can save time it would be more efficient real user
0:19:42so our future work obviously
0:19:45we need to better model of prediction
0:19:48whether to complete a prediction because just rely on frequent stressful
0:19:52is not enough
0:19:54we also need to perform for natural language understanding
0:19:57taking into account the action is an additive
0:20:00because just having
0:20:02but i don't that has got enough information from behind it
0:20:05and a
0:20:06the ultimate goal is able to that of system but imply that was
0:20:11this we're going to need not a natural language processing
0:20:15and dialogue processing
0:20:17but also computer vision algorithms because the system should be able to locate the part
0:20:22of the image that the user is referring to
0:20:28thank you very much
0:20:35okay thank you time for questions one the
0:21:04could you the please repeat the question
0:21:23what would you can be made action
0:21:26we mean the dialogue act label
0:21:30when a when they say something that can categorise then it's been it's fine to
0:21:35be labeled as other
0:21:45no i
0:21:49let's have
0:21:55right actually let's go back to the annotations
0:22:06so you hear about the auto
0:22:08so we have the label and we define the label to each word
0:22:13so you're saying that whether they say some something garbage here that we conduct
0:22:35well we will assume any dialogue act schemes and for example the of the recent
0:22:42one by
0:22:43how deep bonds
0:22:48let me go to this might be
0:22:56you know i talked about the data access users
0:23:00there are a other dialogue acts like request implementation questions about features question about image
0:23:08actually use
0:23:11nova be talking about the image location actually directive et cetera so obviously some of
0:23:17these dialogue acts are domain specific so we looked into other annotations days but we
0:23:23have to adapt this annotation schemes to our data so we have to look for
0:23:27of data and
0:23:37this way round a convolutional neural networks with multiple
0:23:54so you're talking about
0:23:59or here
0:24:00so after
0:24:02we feed i to the convolutional neural network and we get the class that high
0:24:09that have the highest probability
0:24:16one or something
0:24:37well we have
0:24:39percent of the users in the training data but i agree with you that if
0:24:43we i it's a small corpus and we had some for effect so on the
0:24:49ls this and this year s
0:24:51and we think that was recorded we didn't have enough data okay maybe if we
0:24:55rearrange maybe we'll get slightly different
0:25:01right results but i think the patterns will still remain the same
0:25:06maybe the random forest are quite close
0:25:09to the cnn
0:25:11so maybe we have an effect
0:25:13we have in fact
0:25:15there and of course the models are also very sensitive to the parameters of the
0:25:19neural networks we had to do a lot of experimentation to set the hyper parameters
0:25:23and we did not on the training data
0:25:52well i
0:25:53a wizard need to ask a clarification questions or provide suggestions
0:25:58so that sometimes the users would say okay i'm not i want to make it
0:26:05what's the what's the type of feature what's the bottom actually use so
0:26:10they were asking that the users were asking questions and the wizard without once
0:26:15but also the wizard could ask a clarification questions
0:26:54so it initially we have an quite low confidence threshold
0:27:00i think that was interaction and explaining that are also explain it well
0:27:04so when we calculate the word savings we only take into account correct predictions
0:27:11in the beginning we have low confidence threshold
0:27:16and most likely are predictions are not alright
0:27:19so we don't really say much it seems that i point five
0:27:24we say why the law
0:27:26because whatever predictions we make their correct
0:27:31but after that
0:27:32one we have a confidence threshold of point six or point seven basically we the
0:27:38user had acted
0:27:39a whole
0:27:41utterance so we don't say anything
0:27:43and here
0:27:44so i don't wanna talked about the blue line i said that these numbers in
0:27:49blue eyes the number of samples that we consider or the correctness of the predictions
0:27:54here are the number of samples that we considered for the works at this and
0:27:59you can see that this is lower than base
0:28:02and this is because is that when we calculate the word savings we only consider
0:28:08the correct predictions
0:28:09it's a quite complicated to a graph i tried to make it is as a
0:28:14clear as possible
0:28:17okay or
0:28:41that it doesn't mean that
0:28:45it has to be
0:28:48it has to respond immediately as long as it the state gets the user's input
0:28:54ideally we should have a policy that
0:28:58should tell the system went away and when there is enough information to perform we
0:29:03had it's in the video that i like it was clear everything was have been
0:29:08happening very fast
0:29:09the user response that the topic
0:29:11and that the wizard had to follow never can happen fast
0:29:15if the if the users is
0:29:19i don't know tell me about the functionality of the
0:29:23it might accidentally doesn't mean that the systems
0:29:28john right pane and start talking before the user has finished so we need a
0:29:32policy to tell us when it makes sense
0:29:36to process things very fast and well when it makes sense to wait
0:29:41we did have a paper last year exec the higher
0:29:43with a different domain
0:29:45but we had an incremental dialogue policy which would make decisions when to wait
0:29:52or when to
0:29:56jumping and perform an action
0:29:59okay but the last question
0:30:27so when you make correct predictions you save time
0:30:30when you
0:30:31don't make
0:30:32a correct predictions
0:30:36it it's
0:30:40and ensure understand it's a tradeoff
0:30:53well first of all the we don't have a system
0:30:58so you if you may have
0:31:02okay the will be will give a wrong answer
0:31:09i don't think i understood your right
0:31:35that's true but this is the distribution this is just an analysis what happens
0:31:43for each continent stressful it does and
0:31:46it doesn't
0:31:48we don't show it let's say interdependencies about jumping
0:31:55and then the user was happy with me until then
0:32:00actually started being happy because i did sometimes jpeg now we don't manager this at
0:32:05this point is just an analysis of the results
0:32:07and as i said
0:32:11the bottom line means that
0:32:13we can just reliable confidence stressful so we need something very sophisticated to decide on
0:32:19whether to make a prediction and not
0:32:22and it should take into account all kinds of things
0:32:25the context
0:32:26there should be rewards
0:32:28if the user selects a gives the like formant and means that we're doing well
0:32:32so this is just an analysis of what happened okay
0:32:36okay useful tools for this purpose