Speech Transcript - Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog

0:00:14	hi everyone on
0:00:17	j p and on the sister showed work done with
0:00:21	my core accurate attention and modify search to
0:00:27	sorted down going to talk about their dialogue policy learning problems for task oriented visual
0:00:32	dollar
0:00:36	also first let me introduce their problems
0:00:40	so
0:00:41	there's physically situated x and you know that would we want to study is
0:00:46	where a few joint chapel tries to engage with the user to how
0:00:52	a lot and five to their p georgia order target image
0:00:57	so here you can see
0:00:59	and of there were twenty similar images presented a tutor a agent and at the
0:01:07	first n
0:01:09	their users can provide er
0:01:12	this questions
0:01:13	a luddite and you want
0:01:17	so
0:01:17	and then there
0:01:19	agents here pay some more proactive role by asking
0:01:23	i reverend cushions in this community cushions
0:01:27	and hopefully once
0:01:28	if the confidence in notes it to make a decisions
0:01:33	to finish the top within a minimal number of turns
0:01:37	so in this setting
0:01:38	on their a true our main challenges
0:01:42	on the agent very need to none
0:01:45	and understand the multimodal rip intuitions
0:01:48	and also be aware of the dynamic your dark contrast
0:01:52	especially on where receiving signals
0:01:56	for making decisions cell phones sample wrong information correlate all wrong guesses
0:02:03	so the main goal is for the agent is to learn
0:02:07	efficient dialogue policy to accomplish the task
0:02:14	still a motivation very counts for
0:02:16	on some
0:02:17	potential so usable real applications
0:02:21	so imagines for virtual a nice talking assists and
0:02:26	to help the customers
0:02:28	army commander propose or recommend approach is based on all the user's preference so that
0:02:35	multimodal contest us through the dialog
0:02:39	so is assigned as we also working probabilistically that are consummated visual dialogue based on
0:02:44	the fashion dataset
0:02:47	and hopefully we have something interesting what still
0:02:50	next year
0:02:53	on so
0:02:54	but the previous research are only show that a lot mainly focus on be sure
0:02:59	to language understanding and generations
0:03:02	where it so they have for questionable and also for underwear with each other with
0:03:07	thing a fixed number of turns
0:03:09	a however we focus on the dialogue policy learning problems of the cushion policy
0:03:16	so it was within style it's
0:03:19	the questionable can produce a more constructive rules to help her sister
0:03:24	the human to accomplish the task
0:03:28	we want to be very their efficiency and the robustness of the dialogue policies in
0:03:34	terms of on
0:03:35	more task ninety dollars to semantics
0:03:40	and is supposed to mentioning that our
0:03:43	word is also related to hierarchical reinforcement learning
0:03:47	basically we view this as the two stage problem
0:03:50	at first we want to obtain a dialog proceeds to
0:03:53	so that a proper dialog basically a information queries all making a decision on to
0:03:59	do information image retrieval
0:04:01	and then we can have for real lower level proceed to see that of primitive
0:04:06	actions
0:04:07	like which question to ask
0:04:09	all
0:04:12	however to reinforcement learning has been applied seeing a multi domain dialogue system but with
0:04:18	our multimodal contestant action space
0:04:21	and
0:04:21	on our
0:04:23	architecture also resembles the fruit or reinforcement learning which have some nice properties
0:04:28	that are steadfast rations state sharing and a sequential execution
0:04:35	and here is to overview order
0:04:38	that the information for one thing to our proposed framework and important
0:04:44	we have their simulator module a which
0:04:47	how to teach us to again transition state and also provide the p but a
0:04:53	remote signals and
0:04:54	the generated out of service
0:04:56	to feed into the vision data loss matching batting more joe to
0:05:01	updates of usually stays home with our new approach is appears and also communicates with
0:05:07	the
0:05:08	dialog state checking more so we attentional signal to and dialog state checking kind of
0:05:14	formal easter
0:05:15	the people all speech without loss state representations and in the high-level going to proceed
0:05:21	any module uses a
0:05:23	a lot
0:05:24	do you get us to
0:05:26	to a tool and the prosody in terms of asking questions on making a gas
0:05:31	and we have to specialise questions that you modulo two
0:05:35	all learn the decision what will question to ask
0:05:42	the first simple them also is or visual dialogue
0:05:45	the matching bad and module
0:05:47	on to go for this module is true
0:05:50	i'll try to learns an encoder
0:05:54	tasks and
0:05:55	that the region and the task information into a joint space
0:06:00	so the intuition is
0:06:02	we want to have to each and thus able to kind of have
0:06:06	of intelligence to understand the visions and
0:06:10	and that the semantic relation between the image and
0:06:15	and the dialog contest
0:06:18	so
0:06:20	bodies we also need to preach and this module on
0:06:23	all to timber to encode to have for
0:06:27	a robot a efficient as reinforcement learning training
0:06:32	and also the album can be also applicable to use for image retrieval
0:06:38	and to be very the performance of this module we perform a sanity check
0:06:43	and we will choose for high a image which you're accuracies
0:06:49	in this system in again setting
0:06:50	which means this can provide a reliable signals for our reinforcement learning change
0:06:59	in the visual dialog state checking module we need to teach i three types of
0:07:03	all state information
0:07:05	on the vision released a kind of represent the agent's internal decision making models
0:07:11	i which is solvable of the vision dollars imagine a impending module and
0:07:17	on and the vision qantas stays kind of captures their
0:07:21	features the visual feature of their environments in here we applied a and what was
0:07:27	the technical stay adaptations well
0:07:31	basically the intuition is used to we want to after a vision contest
0:07:35	or more phones and but that's a two
0:07:39	the vision really state or the decision making model of the agents
0:07:43	also
0:07:44	on based on some feedback so attentional signals
0:07:48	so the attention a signal here is calculated by all the semantic similarity score between
0:07:54	the vision a belief state and image vectors and
0:07:58	then we take the weighted average
0:08:00	and in case of the wrong guess which are set their attendance attention signal to
0:08:05	zero
0:08:06	and we also could show that our the alignment information a number of questions asked
0:08:10	number of image yes and their last session
0:08:15	so given the past dialogs stays we have all
0:08:19	all
0:08:21	this kind of boasting learning modules and basic since we have two separate is quite
0:08:26	something also we applied to
0:08:29	wt q and method a
0:08:31	so we have applied the project was replaced and pose tracking to
0:08:36	to improve the eight thousand problem efficiency
0:08:39	and
0:08:40	another important task that for reinforcement learning is the reward design
0:08:45	and
0:08:45	so
0:08:46	the rule for this model training use can be composed decomposed into
0:08:52	the words and
0:08:54	questions rewards in the image mature words and so with
0:09:00	a reward shaping possibility into the a question suggestions and which
0:09:07	it's kind of all
0:09:09	the information gain of
0:09:12	of us
0:09:13	of our question ask
0:09:15	and then here we calculate query is to
0:09:19	difference between tough only to score between the usually state and attacked image vector
0:09:29	on so
0:09:31	a cushion citizen modules are to see that there
0:09:34	the most informative questions to ask for when asked and
0:09:39	based on the shared visual contest eight
0:09:43	so we use this a core a reinforcement relevant networks
0:09:48	that's able to handle a large discrete task based
0:09:52	action space
0:09:54	in
0:09:54	there you value is can be post made a
0:09:58	a between the embedding
0:10:00	i vectors softer revision contest
0:10:03	the and
0:10:04	the questions
0:10:06	on the reward t is the intermediate or not quite sure what is we discuss
0:10:10	and then we use also an assertion strategy
0:10:14	as their inspiration policy
0:10:18	to train the reinforcement learning with different need to have for simulator and so we
0:10:23	propose a corpus sse that once onset of anns consists
0:10:27	since all thought a similar image
0:10:30	and it is stiff image
0:10:32	a corresponds to a ten rows of question answer pairs
0:10:36	also this model provides the remote signal is saying axis related to the target image
0:10:42	and also chest internal against a to their what do we
0:10:46	other types of diminishing conditions first a teaching get is the correct answer
0:10:51	on the mess number of gets this is reached
0:10:54	and there is a lot turns his original depends on different experiment settings
0:11:00	all we define the winning and lost we were assessed
0:11:03	plus it's a negative tent and
0:11:06	the wrong guess penalties negative we
0:11:11	and to evaluate the contribution of each component within our friend work we focus on
0:11:17	five policies models on
0:11:20	the sap baseline to
0:11:22	a random procedures in is still at the cushions all my guess and any state
0:11:27	and then we added it you and to optimize dependent
0:11:31	level decision making and of the a handful
0:11:34	the lower or level pushes session a process
0:11:38	and we also want to evaluate the stay adaptations and reward shaping techniques to see
0:11:43	how
0:11:45	data affect the policy learning
0:11:50	and we want to
0:11:52	a because we want to be very to efficiencies and the robustness of dallas policy
0:11:57	we construct three sets of experiments
0:11:59	by step by step
0:12:02	for the first it smell nice to agents only see that
0:12:06	the questions formed directive and eight also obtain a questions answer pairs generated by human
0:12:12	for the target image
0:12:14	so this
0:12:16	last are open down stepping allows us to verify the effectiveness of our
0:12:20	a friend word
0:12:21	and then we increase the task difficulties are by enlarge the number of questions are
0:12:27	so there are two hundred questions generated by humans
0:12:30	and the dancers who are generated using our approach and this question answering models
0:12:37	respect to the target image
0:12:39	and
0:12:40	doesn't their experience we scale outer testing process you have
0:12:45	to answer question answer pairs generated automatically using the pitch and question answer parts
0:12:52	which kind of simulates a more noisy and real was setting and a different a
0:12:56	also
0:12:58	also we very sour the policy model set of the one thousand iterations during the
0:13:04	training process we pretty policy and we look at them reiteration magics lie within rate
0:13:09	and average number of ten dollars terms
0:13:15	here as there is there is also out in its parent once all we constrain
0:13:20	on their maxima number missing the rows of dialogue to hand and
0:13:25	within are defined ten questions
0:13:29	there fourteen well so there
0:13:32	there and encode falter william rate and all the average can reward
0:13:38	and we can see the optimal model is the last prosody models and i have
0:13:43	solar cell part of also conversion rates and outperform model with a hierarchy can sure
0:13:50	pos see a question partitions and state adaptation
0:13:54	and depression with that we were i want to also is whether a hierarchical reinforcement
0:13:58	learning policy enable efficient decision making
0:14:02	so here we define the of oracle baseline a data each and kids asking questions
0:14:09	in order
0:14:10	and only make the guess at the end of the data loss
0:14:14	which means are or where a is means
0:14:19	there
0:14:19	the ages ask several tens of not number of two rounds operations and then only
0:14:26	make a tick sid and so we found our optimal dialogue policies
0:14:33	okay such as a significant higher a win rate and the or point seven
0:14:38	and have a compulsive a win raise with their oracle baseline at eight well we're
0:14:46	knows
0:14:48	static o significant difference
0:14:50	so
0:14:51	and also we know that the oracle and nine and ten have higher we may
0:14:56	because they can about more information our longer turns
0:15:01	so we can see that our how code enforcement policy coefficient decision making
0:15:09	in
0:15:11	and we further work after if you know why we want to offer the evaluate
0:15:16	the robustness of our are thousand policies
0:15:19	so in paris the number of all
0:15:22	we increase the number of questions and then we also use a fly above chance
0:15:28	of vision question answering model as a user simulator to generate on servers and we
0:15:33	can see our departments we watch is the best performance induce more noisy a setting
0:15:43	and so on
0:15:44	this point three we further
0:15:47	increase to
0:15:49	task difficulties
0:15:50	and as we know all when e varies
0:15:54	when you very thin the analysis and the test data can be very different
0:15:58	so here we uses l two in this way because simulator different testing dataset and
0:16:06	and
0:16:07	and we are served the performance in the can jobs a but other propose reward
0:16:13	is more robust to noise and we think there is a potential application of using
0:16:20	the restart it has a bicluster orchard a song datasets constructing
0:16:26	by to humans are just talk about their
0:16:32	the call quality may state assets hope that was basically goes
0:16:37	so that it may not be very suitable for task oriented
0:16:40	applications on so here's is their sampled al also where so sir
0:16:47	systems for example in you spend two and a failure example when you spend the
0:16:52	as we can see in example tutor dialogue policy a sensor
0:16:58	susceptible ready a see that the relevant question some relates to color ten
0:17:04	and birds
0:17:05	and all those are some wrong guesses happens and there's someone answers to everything is
0:17:13	o
0:17:14	they can you can do a good job to self correcting and then maybe yes
0:17:18	in the end
0:17:20	and in a israel the weights and
0:17:23	since the question that also appear are overgenerating using sequence to a sequence model and
0:17:28	so the testing on the questions is more general or and
0:17:32	on the very specific
0:17:38	to summarize a we propose
0:17:41	a correct answer in t v show that allows set of tasks that is a
0:17:46	applicable and extensible for real application and we also propose a hardcore reinforcement learning framework
0:17:53	to selectively learn the multimodal state
0:17:56	a reputation and efficient dialogue policy
0:18:00	and then we what's propose and a state adaptation technique to make the vision contest
0:18:07	rip condition more relevant to the usual dialog state
0:18:11	and we vary only at estimating the dialogue system matches in different a semantics narrows
0:18:17	to very date the task completion efficiency and robustness
0:18:23	for future work we plan to extend then apply a different well former study in
0:18:28	the city real application that i don't realise something scenarios and we are
0:18:34	we can also explore ways to incorporate or domain noise like the ontology
0:18:39	on the data about interactions into a multimodal dialogue system to enable a large scale
0:18:45	or information retrieval task
0:18:48	thanks
0:19:05	which again
0:19:08	okay
0:19:19	how do you push the signals in different models mean
0:19:22	basically how do you model dreamworks
0:19:24	and every works i guess
0:19:27	are so as i mention the
0:19:30	there we will all the most leader
0:19:35	the reinforcement learning part transform
0:19:38	the high-level within the policy and the questions the actual module
0:19:42	so
0:19:44	after we have for this part well
0:19:47	consists of three
0:19:48	three parts of rewards as i mentioned take reward and there are questionnaire was and
0:19:53	also their intention reachable we what is making a wrong guesses
0:19:57	and so and the rule for the classes is actually macho only a
0:20:04	applies to reward shaping techniques e
0:20:07	so we manager to their
0:20:11	a basically the
0:20:13	similarity between the this to embedding vectors
0:20:35	it's a real environment
0:20:42	system defines itself wrong from what they're having
0:20:49	okay
0:20:50	we also
0:20:52	because we have
0:20:54	on the simulations so we have for pretty five talking image
0:20:59	so you so the two is controlled by the simulator module to kind of value
0:21:04	at a at each can state our waiter
0:21:08	yes the correct also not
0:21:10	and so we can get the signals
0:21:15	during the training process
0:21:30	sorry affected by the question selection like you have any idea is to five to
0:21:36	find it
0:21:38	the most important question defined in section
0:21:42	a here it's a fixed number here a paragraph
0:21:48	a situation i have here a
0:21:53	find you have generated
0:21:56	nee most if english ink
0:22:02	question
0:22:03	and two and add a question is mm i finish working on it
0:22:12	i think in high recognition i
0:22:18	it's kind of questions i think that's the group cushion
0:22:22	so here is basically a discriminative approach
0:22:28	to still no questions
0:22:30	from the different data also a because there's a ago a question proves
0:22:36	so we can just to that of questions but
0:22:40	okay a more interesting question is how we can generate a discriminative questions and
0:22:46	and we you know online fashion
0:22:50	so i think that something to explore in future

Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog

Special Session: Physically Situated Dialogue

Jiaping Zhang, Tiancheng Zhao, Zhou Yu