Speech Transcript - Conversational Image Editing: Incremental Intent Identification in a New Dialogue Task

0:00:16	once for the introduction
0:00:18	this is joint work with my students rumours remote query can and her collaborators wanted
0:00:23	to be trying what we and want to chunk
0:00:30	so image and its in
0:00:32	changing
0:00:34	certain characteristics of an image
0:00:37	and that can be done with software tools such as adobe photoshop microsoft's photos et
0:00:42	cetera and here we can see two examples
0:00:45	in the first example we add close to the sky
0:00:49	the second example we have
0:00:51	the photograph of a file a
0:00:53	we make its black and white
0:00:56	image editing is a very hard task
0:00:59	i
0:01:01	requires
0:01:04	artistic creativity patients is a lot of experimentation and try there which in turn makes
0:01:11	it a very time-consuming task
0:01:13	and that users may not be fully aware of the functionality of a given image
0:01:19	editing tool and some image editing tools a very complex use in there is the
0:01:26	learning curve
0:01:27	furthermore users may not be sure about what changes exactly it want to perform on
0:01:33	the image and here's an example basically scores look nice rugby this is pretty abstract
0:01:39	and how do you make a
0:01:41	but also look like that is currently
0:01:44	or maybe do not an abstract not change it is important
0:01:48	but we know where their precise
0:01:51	and then steps required so here's another example
0:01:55	remove the human from the fields
0:01:57	so this is pretty hungry
0:01:59	but it requires maybe and things that
0:02:04	so clearly there's attention here or fashionable how anybody there are web services web forums
0:02:12	where bill with users lost their images
0:02:17	and that the request i and their expert raters will perform this i either from
0:02:23	three
0:02:24	or forty three
0:02:26	and then be expert annotators in the novice users can exchange messages until the user
0:02:33	is happy we had
0:02:36	and stupidity in these forums their requests are formulated an abstract manner using natural language
0:02:41	here is an example
0:02:44	i satisfied from our last holiday and someone please remove my x from these but
0:02:48	i'm the one on the right
0:02:51	so it's more likely like to get something like this rather than indicated one step
0:02:56	instructions
0:03:02	so the web forums are clearly very helpful
0:03:07	but the major
0:03:09	draw
0:03:10	a first of all users current request changes or provide feedback in real time
0:03:17	and the users can ask a clarification or provide suggestions well anything the editing is
0:03:25	be performed
0:03:27	so a user is clearly benefited greatly round conversing real-time with an expert image i
0:03:35	didn't
0:03:36	so our ultimate goal is to build the dialogue system with such capabilities
0:03:43	and i'm to play a million huh show but maybe by conversational image editing
0:03:53	a realistic speech
0:04:08	g
0:04:13	you actually
0:04:17	in
0:04:21	where
0:04:28	not the same as that back and talk about incrementality in that of systems incremental
0:04:32	dialogue systems means that user utterances start being processed word-byword before the user has after
0:04:41	a complete utterance and the system has to respond
0:04:44	as soon as possible
0:04:48	now conversational image editing is a domain particularly well suited to or incremental processing and
0:04:56	the reason is that it requires a lot of fine grained changes that will see
0:05:00	in this
0:05:03	users maybe to be their requests radically
0:05:09	and very fast speak very fast
0:05:12	and the wizard has to process everything very fast and to respond as soon as
0:05:19	possible
0:05:36	she
0:05:37	e
0:05:42	all right you just
0:05:52	so we collected a corpus over skype in a wizard-of-oz setting
0:05:57	so we had the user request and it's and the wizard would perform decided
0:06:03	the wizard screen was share and only the wizard could control the image editing tool
0:06:10	and this was done deliberately because we wanted to record all the user's input in
0:06:15	spoken language format
0:06:18	there was no time constraints take the top as long as they wanted
0:06:22	and
0:06:23	we did that's basically celtic users whether they interact with the system
0:06:28	or if you want but the conversation was very natural so it was pretty obvious
0:06:32	that it were talking to another human
0:06:36	and here are some statistics of our corpus we had twenty users hundred and twenty
0:06:43	nine dialogues
0:06:45	we can see that roughly number of user utterances is double the number of wizard
0:06:50	utterances
0:06:51	users will talk about
0:06:52	and occasionally that wizards would provide suggestions and questions
0:06:57	however by technology alliance
0:06:59	the corpus release to the public in the near future
0:07:04	so that our corpus
0:07:06	our next step is to annotated with dialogue that
0:07:10	so we define an utterance as a portion of speech
0:07:13	between silence intervals greater than three hundred milliseconds
0:07:17	i'm utterances are segmented into utterances like we assign a dialogue act to each utterance
0:07:23	seconds
0:07:26	and here are the dialogue act labels and from our corpus also
0:07:32	but we are interested for this study only in the ones so all dialogue act
0:07:39	labels are
0:07:40	user i
0:07:43	image it requires
0:07:44	you requests updates to produce request
0:07:47	are reverting to the previous state of the image comparing the current state of the
0:07:52	image with a previous data image we have comments like or dislike comments
0:07:57	and image comments these are mutual comments for example
0:08:02	looks very striking
0:08:04	we have yes no responses
0:08:06	and have their attributes
0:08:08	anything that can be classified into a and the other labels
0:08:15	so these are the cat
0:08:17	a dialogue act labels that were interested in this study the most frequent ones where
0:08:22	b you image that it requires the updates
0:08:25	the like comments
0:08:26	the gets responses and after
0:08:32	and here are some examples so increase saturation that really this is a new image
0:08:37	editor is and isn't more this is an update
0:08:41	that's not good enough
0:08:42	that's it is like problems
0:08:45	a change the saturation back to the original this is an image and it requires
0:08:49	river
0:08:52	great this is that like comments
0:08:55	and you show me before and after this is imaginative request compare
0:09:01	we measure inter annotator agreement what happens of expert annotators on to take the same
0:09:06	dialogue session of twenty minutes
0:09:08	and we calculated cohen's kappa to be point eighty one which shows high agreement
0:09:13	no but we perform
0:09:15	when we measure agreement
0:09:17	we want to make sure that the annotators and every
0:09:20	not only on the dialogue act labels but also have a segments
0:09:24	utterances
0:09:25	here's an example the utterances
0:09:28	proper thought to remove it
0:09:30	so the first annotator
0:09:31	as you is that this is one segments
0:09:35	and annotated with the demodulated request you whereas the second annotator
0:09:41	assumes that these are just segments problem for a remote that i
0:09:46	i candidates the first segment
0:09:48	at the demodulated request new and the second segment as an image i that request
0:09:54	update
0:09:55	so
0:09:55	a we got each row
0:09:57	and we compare the annotations
0:10:02	well we'll segmentation dialogue act
0:10:04	and what and why the agree on everything
0:10:07	a this counts as agreement so in this case we have so agreements out of
0:10:11	six
0:10:12	which is point thirty three
0:10:17	i was used
0:10:18	if you only have the dialogue act label this is not enough information to perform
0:10:22	image and it
0:10:23	so this reason
0:10:25	we also have more complex annotations of actions and it is so here's an example
0:10:31	so this segments may actually writer to like one hundred
0:10:34	dialogue act is limited it requires you actually is addressed it is right
0:10:40	you object this tree
0:10:42	and
0:10:43	that is one hundred
0:10:46	but for this study
0:10:48	we
0:10:49	we only use information from the dialogue act labels
0:10:53	so yes that about detection models so we split our corpus into training and testing
0:11:01	you have one hundred and sixteen atoms for training and three for text and we
0:11:05	compare neural network based models versus traditional classification models so far neural networks we have
0:11:12	long short-term memory networks
0:11:14	and convolutional neural networks
0:11:16	i'm more traditional models we have made by
0:11:19	conditional random fields and random forest
0:11:23	we also compare
0:11:24	and bad a strange well image related corpora versus false pre-training a by things
0:11:32	and here are classification results
0:11:36	note that
0:11:37	we don't do anything incremental yet
0:11:40	with you
0:11:40	but we have a full utterance
0:11:45	so
0:11:47	we see that the conditional random fields and have yes
0:11:50	they don't perform very well these are both sequential models
0:11:54	we hypothesize that this is because we didn't have enough data to construct a larger
0:11:58	context dependencies
0:12:00	random forests are doing well
0:12:02	and said
0:12:05	however using training where the bindings
0:12:08	is better than using half of the boxing balance the difference is not statistically significant
0:12:13	and it also helps when we use sentence accent right we use that we generate
0:12:17	a vector for the whole sentence rather than for each where
0:12:22	no be a
0:12:24	well it
0:12:27	a simple random forest
0:12:29	this is a problem and difference is that it is that
0:12:34	but when we use random forest with sent back
0:12:37	that is that's as the latter at the difference is not statistically
0:12:43	note here is a we can see
0:12:46	and we
0:12:47	one more rights from the user in the x-axis we see that i mean
0:12:52	however in a score for each one of them by a
0:12:57	changes so initiate for the first word i opinion score or the dialogue act a
0:13:02	better
0:13:04	but i
0:13:05	after the word that the word also here is that it's pretty clear
0:13:11	i output is the image that it requires you which happens to be good right
0:13:17	label for this example
0:13:22	no that's that about in from time okay
0:13:26	how right
0:13:28	where i'm correctness of prediction
0:13:32	when we have getting from and so i which from the user
0:13:36	so in order to calculate
0:13:39	how much we say whether we're right or not
0:13:44	is we get all these samples from the user
0:13:49	we have two
0:13:50	this is one comic strip prediction
0:13:53	so
0:13:54	here we use a very
0:13:56	a simple model we set on stress
0:14:00	that explains it is also we have come from the user i think i think
0:14:05	that i think that's something that's
0:14:08	then we have a confidence scores
0:14:10	but the classifier
0:14:13	and how a
0:14:14	god
0:14:15	the classifier assigns to each
0:14:17	some
0:14:19	note that me a that can support of point two
0:14:23	i mean i wish detection
0:14:27	sorry a constant threshold according to this means that we should make a prediction wondered
0:14:32	and score becomes higher and this problem so we should make a prediction here
0:14:39	is to predict the other class
0:14:42	so this is wrong prediction
0:14:43	and when we have wrong predictions we assume that we're not going to how it
0:14:47	works like this
0:14:49	not to say that the confidence threshold is one four
0:14:52	this means that we should be a prediction here because point five is larger than
0:14:58	one or here we have the classifier is like performance and this is correct
0:15:05	we have a correct prediction and we state one more
0:15:08	because we have right until the user i think that
0:15:13	so it helps when we increase
0:15:15	the confidence that
0:15:19	problem is that
0:15:20	we increase the price phone too much
0:15:23	let's say we present a point five for sex then no one anything to make
0:15:29	a prediction on
0:15:30	because you hear all the
0:15:33	all the threshold the scores are below points a
0:15:37	i five four point six
0:15:40	so i will see in the next time
0:15:44	the higher the confidence stress balls
0:15:46	the fewer samples
0:15:48	that
0:15:50	we use to make a prediction
0:15:55	okay
0:15:57	no in this
0:15:59	a row
0:16:00	on the x-axis we have a confidence that
0:16:04	and on the y-axis would have percentages
0:16:07	so that no i shows as the percentage of correct predictions
0:16:11	and the red line shows the percentages of word savings
0:16:18	and this numbers here that we see at each point the number
0:16:24	shows us how many samples a
0:16:28	we have above a certain press so we can see or columns writes one point
0:16:34	two we have at least once
0:16:35	but as a confidence threshold is that
0:16:39	then
0:16:39	we have fewer and fewer samples
0:16:42	so basically baseline
0:16:45	ace precision
0:16:54	so if the confidence right or a notice that we're doing better but
0:16:58	because
0:17:01	the number of samples becomes lower and lower this is precision but very small would
0:17:07	really be very well
0:17:09	so if your's precision as a function of and threshold and here's a recall
0:17:16	so it seems that this is a good point but it's not really good point
0:17:19	because the recall is very well
0:17:22	and now let's go back to the original graph have the same pattern where the
0:17:26	percentage of word savings
0:17:28	that's a confidence threshold point five
0:17:32	we use a the lost
0:17:35	we don't have many examples where the score
0:17:38	is going to become a larger than the confidence threshold one five so it's the
0:17:43	same pattern
0:17:44	however as that the computers or a threshold becomes are higher we have fewer samples
0:17:50	so the bottom line
0:17:51	is that increase in this is not clear one where we should say which domains
0:17:58	so that means that just having rely on a result is not a good model
0:18:04	and we certainly need something more sophisticated that
0:18:09	and here we can see what would happen if we had histogram-based model and that
0:18:14	i think i rates can make the prediction and the prediction would always be alright
0:18:20	so we can see here
0:18:22	we don't know works at each of the dialogue act
0:18:25	and for all the data
0:18:27	we have thirty nine percent saving and we can see the percentage of correct predictions
0:18:32	is a seventy four and this is actually i see because in this case the
0:18:37	number of samples that we have doesn't actually a constant stress balls
0:18:44	so we introduced
0:18:47	and you a domain conversational image i
0:18:51	it's a real world application which combines language and vision you tonight particularly well suited
0:18:58	for incremental dialogue processing we compare models or incrementally accent identification this yes and random
0:19:07	forest reform either the rest of the models
0:19:10	and we found that
0:19:12	and i think strain image related corpora outperformed reaching out of the box some bad
0:19:19	we also calculated the impact of any problems i suppose
0:19:22	above which the classifier's prediction should be considered on the classification rate is i don't
0:19:28	work savings which is a proxy for time saving so our experiments provide evidence
0:19:34	that incremental intent processing
0:19:38	a can save time it would be more efficient real user
0:19:42	so our future work obviously
0:19:45	we need to better model of prediction
0:19:48	whether to complete a prediction because just rely on frequent stressful
0:19:52	is not enough
0:19:54	we also need to perform for natural language understanding
0:19:57	taking into account the action is an additive
0:20:00	because just having
0:20:02	but i don't that has got enough information from behind it
0:20:05	and a
0:20:06	the ultimate goal is able to that of system but imply that was
0:20:11	this we're going to need not a natural language processing
0:20:15	and dialogue processing
0:20:17	but also computer vision algorithms because the system should be able to locate the part
0:20:22	of the image that the user is referring to
0:20:28	thank you very much
0:20:35	okay thank you time for questions one the
0:21:04	could you the please repeat the question
0:21:23	what would you can be made action
0:21:26	we mean the dialogue act label
0:21:30	when a when they say something that can categorise then it's been it's fine to
0:21:35	be labeled as other
0:21:39	right
0:21:45	no i
0:21:47	so
0:21:49	let's have
0:21:55	right actually let's go back to the annotations
0:21:59	here
0:22:03	so
0:22:06	so you hear about the auto
0:22:08	so we have the label and we define the label to each word
0:22:13	so you're saying that whether they say some something garbage here that we conduct
0:22:22	right
0:22:26	okay
0:22:35	well we will assume any dialogue act schemes and for example the of the recent
0:22:42	one by
0:22:43	how deep bonds
0:22:44	and
0:22:46	but
0:22:48	let me go to this might be
0:22:53	so
0:22:56	you know i talked about the data access users
0:23:00	there are a other dialogue acts like request implementation questions about features question about image
0:23:08	actually use
0:23:11	nova be talking about the image location actually directive et cetera so obviously some of
0:23:17	these dialogue acts are domain specific so we looked into other annotations days but we
0:23:23	have to adapt this annotation schemes to our data so we have to look for
0:23:27	of data and
0:23:37	this way round a convolutional neural networks with multiple
0:23:49	exactly
0:23:54	so you're talking about
0:23:59	or here
0:24:00	so after
0:24:02	we feed i to the convolutional neural network and we get the class that high
0:24:09	that have the highest probability
0:24:16	one or something
0:24:37	well we have
0:24:39	percent of the users in the training data but i agree with you that if
0:24:43	we i it's a small corpus and we had some for effect so on the
0:24:49	ls this and this year s
0:24:51	and we think that was recorded we didn't have enough data okay maybe if we
0:24:55	rearrange maybe we'll get slightly different
0:25:01	right results but i think the patterns will still remain the same
0:25:06	maybe the random forest are quite close
0:25:09	to the cnn
0:25:11	so maybe we have an effect
0:25:13	we have in fact
0:25:15	there and of course the models are also very sensitive to the parameters of the
0:25:19	neural networks we had to do a lot of experimentation to set the hyper parameters
0:25:23	and we did not on the training data
0:25:52	well i
0:25:53	a wizard need to ask a clarification questions or provide suggestions
0:25:58	so that sometimes the users would say okay i'm not i want to make it
0:26:03	writer
0:26:05	what's the what's the type of feature what's the bottom actually use so
0:26:10	they were asking that the users were asking questions and the wizard without once
0:26:15	but also the wizard could ask a clarification questions
0:26:51	so
0:26:54	so it initially we have an quite low confidence threshold
0:27:00	i think that was interaction and explaining that are also explain it well
0:27:04	so when we calculate the word savings we only take into account correct predictions
0:27:10	so
0:27:11	in the beginning we have low confidence threshold
0:27:16	and most likely are predictions are not alright
0:27:19	so we don't really say much it seems that i point five
0:27:24	we say why the law
0:27:26	because whatever predictions we make their correct
0:27:31	but after that
0:27:32	one we have a confidence threshold of point six or point seven basically we the
0:27:38	user had acted
0:27:39	a whole
0:27:41	utterance so we don't say anything
0:27:43	and here
0:27:44	so i don't wanna talked about the blue line i said that these numbers in
0:27:49	blue eyes the number of samples that we consider or the correctness of the predictions
0:27:54	here are the number of samples that we considered for the works at this and
0:27:59	you can see that this is lower than base
0:28:02	and this is because is that when we calculate the word savings we only consider
0:28:08	the correct predictions
0:28:09	it's a quite complicated to a graph i tried to make it is as a
0:28:14	clear as possible
0:28:17	okay or
0:28:31	what
0:28:41	that it doesn't mean that
0:28:45	it has to be
0:28:48	it has to respond immediately as long as it the state gets the user's input
0:28:54	ideally we should have a policy that
0:28:58	should tell the system went away and when there is enough information to perform we
0:29:03	had it's in the video that i like it was clear everything was have been
0:29:08	happening very fast
0:29:09	the user response that the topic
0:29:11	and that the wizard had to follow never can happen fast
0:29:15	but
0:29:15	if the if the users is
0:29:19	i don't know tell me about the functionality of the
0:29:23	it might accidentally doesn't mean that the systems
0:29:28	john right pane and start talking before the user has finished so we need a
0:29:32	policy to tell us when it makes sense
0:29:36	to process things very fast and well when it makes sense to wait
0:29:41	we did have a paper last year exec the higher
0:29:43	with a different domain
0:29:45	but we had an incremental dialogue policy which would make decisions when to wait
0:29:52	or when to
0:29:56	jumping and perform an action
0:29:59	okay but the last question
0:30:27	so when you make correct predictions you save time
0:30:30	when you
0:30:31	don't make
0:30:32	a correct predictions
0:30:36	it it's
0:30:40	and ensure understand it's a tradeoff
0:30:53	well first of all the we don't have a system
0:30:56	by
0:30:58	so you if you may have
0:31:02	okay the will be will give a wrong answer
0:31:09	i don't think i understood your right
0:31:35	that's true but this is the distribution this is just an analysis what happens
0:31:43	for each continent stressful it does and
0:31:46	it doesn't
0:31:48	we don't show it let's say interdependencies about jumping
0:31:55	and then the user was happy with me until then
0:32:00	actually started being happy because i did sometimes jpeg now we don't manager this at
0:32:05	this point is just an analysis of the results
0:32:07	and as i said
0:32:09	the
0:32:11	the bottom line means that
0:32:13	we can just reliable confidence stressful so we need something very sophisticated to decide on
0:32:19	whether to make a prediction and not
0:32:22	and it should take into account all kinds of things
0:32:25	the context
0:32:26	there should be rewards
0:32:28	if the user selects a gives the like formant and means that we're doing well
0:32:32	so this is just an analysis of what happened okay
0:32:36	okay useful tools for this purpose

Conversational Image Editing: Incremental Intent Identification in a New Dialogue Task

Oral Session 3: Dialogue

Ramesh Manuvinakurike, Trung Bui, Walter Chang, Kallirroi Georgila