Speech Transcript - Task Lineages: Dialog State Tracking for Flexible Interaction

0:00:15	i i'm constantly and
0:00:18	or with that
0:00:21	and
0:00:23	i'm into it
0:00:27	and i'm willing to reduce the news that does go framework for a that it
0:00:32	was that attracting it is called task when it is
0:00:37	to handle flexible interaction
0:00:40	though this is my online and after
0:00:43	giving a brief introduction i'm i will get to some challenges we one addressed in
0:00:49	the store
0:00:50	and the approaches to be used to solve the problems
0:00:54	and i'm gonna show some experimental results on benchmark test there's that and then our
0:01:00	conclude my talk and if the time permits
0:01:03	then i will
0:01:05	peeper to some technical details one task for in parsing
0:01:10	so we have seen a lot of one recent advances in statistical devastated tracking
0:01:16	and
0:01:18	of the next thing is that many algorithms have been shown
0:01:22	to be active a divorce the result from search task is
0:01:26	bottom so we got a really a robust to systems to some noises errors like
0:01:33	asr errors in a style varies
0:01:35	but they are usually limited just some you know session based simple task simple a
0:01:41	corpus the dialogues
0:01:44	given the
0:01:46	servers in the interest in conversation agents and the enormous use cases
0:01:52	it seems like necessary to extended the previous
0:01:58	but those to handle multiple task is with a complex calls
0:02:02	in just like long interactions task
0:02:07	so let me talk about the set of challenges to be want to address your
0:02:13	and our approach is to solve them
0:02:15	and the first the challenge you be complex schools
0:02:18	so what i mean by complete score is the
0:02:22	any combination of
0:02:24	positive and negative how constraints
0:02:27	so in
0:02:28	restaurant
0:02:29	finding domain
0:02:30	you can say italian or french but not tie such kind of thing and the
0:02:35	approach we take is a very straightforward
0:02:38	we just to do
0:02:40	constant a level
0:02:41	belief tracking rather than a slot level tracking
0:02:45	though a second challenge is to handle complex input from distributed et al use
0:02:51	so
0:02:51	the complex input includes not only complex calls about multiple task is that the same
0:02:56	time
0:02:57	so we introduce some new
0:03:00	a problem
0:03:02	were concept of called task frame parsing to address this challenge
0:03:08	though
0:03:09	to scale the conversational agent platform or we usually adopt a distributed architecture so in
0:03:16	this architecture we have on several
0:03:19	numerous actually a service providers or without their own slu and sort of as a
0:03:25	components and we also provide you know on a rate of a common components like
0:03:31	slus and task is
0:03:33	and then down the when user input outcomes in it would especially to all these
0:03:39	components and its component to will return
0:03:42	on their own interpretation so it's the
0:03:45	platform to where is duck a possibly completing
0:03:49	a semantic interpretations to come of it though a coherent
0:03:53	semantic interpretation
0:03:56	so for example when user says connections from soho to me the town and on
0:04:00	p m
0:04:01	italian restaurant near times square and have friendly coffee shop
0:04:06	then
0:04:08	the trends it a domain will detect just so as a form
0:04:13	we did town it's to one p m as
0:04:15	time and so on
0:04:18	it goes the similarly for local domain that's as well
0:04:23	so
0:04:23	at you can see there are completing slots
0:04:27	and
0:04:28	what we wanna get is
0:04:31	a list of coherent
0:04:34	task
0:04:35	a frame parser is like this to it could samples
0:04:39	so the first parse
0:04:43	identified a first
0:04:45	three spans as the trains it task frame and it also has to more a
0:04:51	local task is
0:04:54	and it's a probably right so it gets a high score like to point eight
0:04:59	and the second one is a less likely and it one you have to local
0:05:04	ask frames and you get so variable scored like to point two
0:05:11	so we call this process as task frame parsing
0:05:16	so we use
0:05:19	beam search using mcmc with the simulated annealing
0:05:23	umbilical many algorithms to be can use
0:05:27	to do this you press we chose to use this method because it allows us
0:05:32	to integrate hard constraints with the power probabilistic reasoning very easily
0:05:38	so i'll one you get sample of hard constraints will be mutually exclusive in yes
0:05:43	recently is
0:05:46	to
0:05:48	either way act items with the same span cannot be on used at the same
0:05:53	time that's kind of constraints we one mean
0:05:56	and to do probabilistic reasoning
0:06:00	we use the normalized global what would be a model like is
0:06:04	so we can get the confidence score at the end of the thing
0:06:07	and there are numerous a features are you can use so easily port our paper
0:06:12	for more details
0:06:14	and thus
0:06:15	third challenge is about flexible
0:06:18	course task management so to do this we also introduced a new concept of cold
0:06:25	task greenish
0:06:27	so yes i'm suppose this a situation
0:06:31	so a user starts with this conversation with a two task select a weather information
0:06:36	restaurant or finding
0:06:38	and then she continues this composition with the transportation task and ticket booking without compute
0:06:45	leading the first to one
0:06:47	and they she laughed and you know ten does some meeting and came back
0:06:51	and try to reach assume that a restaurant or related to task
0:06:57	and then a sheep
0:06:59	now finishes the restaurant booking and they moved to the transportation and ticket booking again
0:07:05	and complete them
0:07:09	so if we you use a traditional stack based on
0:07:14	task management then you might have on several problems
0:07:17	first you might not be able to do some you know or multiple passwords at
0:07:22	the same time
0:07:23	usually and the other problem is information loss
0:07:27	so when you can't to the turn three if the system about the first restaurant
0:07:34	of finding is complete then the relevant information by removing gone so you just you
0:07:41	know we started a restaurant a task at turn three again
0:07:46	on the contrary if the a system that it it's a cup in company
0:07:51	then on the system can resume the rest want to related task
0:07:56	without relevant information the past but when you get to do a time for actually
0:08:02	the system should you know most of popped up to a transportation and to get
0:08:07	looking to resume this a restaurant booking task
0:08:11	so we it the relevant information for task of okay the task at time for
0:08:16	will be gone
0:08:18	so
0:08:18	anyway you might suffer from information allows
0:08:23	to handle this problem
0:08:25	we come up with the task of the image kinds
0:08:28	and there is no restriction on the number of phone a task of states in
0:08:33	the task may need
0:08:34	so you can from multiple cats goes and so many you want
0:08:39	and also you a of the task between each grows at each turn
0:08:44	so i mean whenever you we have a new turn you just add a new
0:08:49	task as days
0:08:50	and retrieve relevant information from that have stays in the past
0:08:55	so for a transportation and ticket booking
0:08:58	can get some information from rest of
0:09:00	want to finding if these are task as
0:09:04	are you know are related in our higher up to nine
0:09:10	and turned really
0:09:11	you can you know
0:09:14	resume their restaurant finding
0:09:17	even if the long time
0:09:19	and you can retrieve the relevant information from the task of state and first turn
0:09:24	out to one or without any problem
0:09:27	and
0:09:28	you know that you can similarly for the current for
0:09:32	so basically we don't removal or abandon any information in the past
0:09:37	and you can always retrieve relevant information from the past
0:09:41	and them to you know current task as there is to give you the ideal
0:09:45	from the current focus
0:09:47	and how do we do the context of matching
0:09:50	or we construct context the stats
0:09:53	it is very simple given this has got lineage
0:09:57	we set a time window
0:09:59	and then you construct a beep is that
0:10:02	to collected or
0:10:04	the latest belief estimates
0:10:07	p for the time window
0:10:10	and then you construct
0:10:13	motion dataset
0:10:15	and user act a set
0:10:16	by collecting all question acts
0:10:19	and the
0:10:20	task of frame parts is
0:10:22	in the time window down to
0:10:24	and then
0:10:26	you have an and the context the stats and based on the current
0:10:31	much an act and the current
0:10:34	task of frame parse you try to select which information you want to use
0:10:40	to update the current apply
0:10:42	so it's not just you know a bunch of a binary classifications
0:10:47	so we use a lot just a regression
0:10:50	and there are a bunch of other features for this task so you can refer
0:10:54	to my a very well for them
0:10:58	and the forward challenge
0:11:02	it's about a casket disambiguation
0:11:05	there
0:11:05	always could be some and but in task detection
0:11:11	to sort of this
0:11:13	this problem we
0:11:16	use
0:11:16	on n-best list of the task of images
0:11:19	and this a on the user's that
0:11:23	i wanna put high then this could i don't interpreted as restaurant to finding or
0:11:31	travel
0:11:34	we have them to pass clean it is here
0:11:37	and then when the user clicked by a four
0:11:41	a real intention on like i saying i meant i wanna travel to try high
0:11:48	then down on the you know second
0:11:51	has continues will get higher score because it's a more coherent
0:11:56	in this way we many cities
0:11:59	the task or ambiguity
0:12:03	the overall tyler was that a tracking procedure
0:12:07	older consisted of three steps first we do task frame parsing still given a set
0:12:14	of one possible to completing semantic frames from distributed slus
0:12:19	we generate a coherent task a task of frame parses
0:12:24	and then given the task frame parses we try to retrieve a relevant information from
0:12:29	the past has "'cause" states in the lineage
0:12:33	and it happens for each of the image
0:12:36	and we use this retrieving information and the input information
0:12:41	to date at the task stays at this turn
0:12:44	the how do we do
0:12:46	the task update
0:12:48	actually
0:12:49	this is one of the most a trivial task is in this framework
0:12:54	because we can pick up any
0:12:58	meant but was developed a so far for dialog state tracking because thus they ask
0:13:03	a state update is not in part of a dialog state tracking for conventional setting
0:13:08	so
0:13:09	you can you know enjoy a wide range of different algorithms like a discriminative map
0:13:15	of the and channel document or you know these troubled done by you know tuning
0:13:19	the data
0:13:21	and done you can control well between the
0:13:25	i believe it has to make use of belief estimates and the wall
0:13:30	observations
0:13:31	so
0:13:32	to make the analysis simply a simple
0:13:36	five
0:13:37	we actually just about that
0:13:39	the a generative rulebased the manifold of from to go at all
0:13:43	and we use this algorithm for belief tracking for each slot value pair
0:13:49	and
0:13:49	these rules are just you know aiding the current of be the of
0:13:54	well by you know i agree creating a negative and positive confidence scores
0:14:02	so let's move on to evaluation
0:14:04	so we used
0:14:06	dstc two on to evaluate our algorithm and its based on the restaurant of finding
0:14:12	domain
0:14:13	and one interesting characteristic of this day that's that is
0:14:17	relatively frequent user's goal changes
0:14:21	so if our method working well in this
0:14:25	a test dataset then the context venture
0:14:28	sure of and then
0:14:30	in any information for all the gold
0:14:34	so the let's look at the result
0:14:37	actually our mentality
0:14:39	show the best performance so far one accuracy
0:14:43	and
0:14:45	this actually can tell you
0:14:48	that the importance of better competition in tyler was state tracking problem
0:14:55	and
0:14:55	we got this performance to be down using any in seem it would like you
0:15:01	know
0:15:01	a system combination and word neural networks where
0:15:06	decision trees it's just a rule based update for that's a ask a state update
0:15:13	and we want to evaluate our system on more complex
0:15:18	interactions
0:15:19	but unfortunately there is no i sure able on data set out there so we
0:15:24	i had at assimilate some datasets
0:15:28	and we
0:15:29	to dstc three data
0:15:31	as our a base line base corpus
0:15:34	and because the that
0:15:36	contains multiple cats case
0:15:38	like a restaurant to finding copy shop finding end up finding
0:15:42	so we simulated three datasets
0:15:48	with a deeper and a representative settings forced to one
0:15:52	and does not have
0:15:54	any other user goals a complex social goals and no multiple task is so we
0:16:00	just to use that the s this
0:16:01	dstc three it is itself and for a second setting we a have a complex
0:16:08	usual to simulated and no multiple has goes well and down for the re
0:16:14	the third dataset we have a both complex user goals and multiple task is this
0:16:19	numbers are all rips task for this corpora
0:16:25	so was look at the which alt
0:16:28	if you look at the joint goal accuracy
0:16:32	we actually compare our system with the baseline system in dstc
0:16:38	and if we look at the or joint goal accuracy then the are almost all
0:16:44	from a baseline system traps a very sharply from zero point five seven two zero
0:16:49	point three one and zero point zero two
0:16:52	well
0:16:53	our system
0:16:55	dropped some words and the lights your point nine
0:16:59	point five nine two zero point four white and zero point three object that
0:17:04	so keeping the fact that
0:17:07	the task gets exponentially harder with a complete with respect to the complexity
0:17:12	this is gentle reduction is a big when
0:17:16	and we for their evaluate our system we don't work on results
0:17:21	the t l t st
0:17:23	all p uses oracle parses and t l
0:17:27	yes t or uses both
0:17:29	oracle parses and were local
0:17:33	a context patches
0:17:35	and you can see the improved results by using oracle information
0:17:40	so this indicates that there's a some room for future improvement
0:17:46	then we conclude my talk
0:17:49	we have proposed
0:17:51	new statistical dialog state tracking a framework called a task greenish to orchestrate multiple task
0:18:01	is treated the complex scores across multiple domains in continuous interaction
0:18:06	and it's a proof of concept we demonstrate good performance on common benchmark test datasets
0:18:13	and possibly simulate dialogue corpus
0:18:17	and some interesting future direction can't include stop is the use of sophisticated machine learning
0:18:23	models like not gbd keyword a random for a restorative neural networks
0:18:29	i'm pretty sure you can get the problem as much higher than on the problem
0:18:33	was that is shown here by just using this techniques for task a state update
0:18:39	and i can also i also interest the in extending this framework
0:18:46	for weakly supervised learning to be used to cost
0:18:50	and
0:18:51	or so i'm interest the and to see some potential impact on other dialogue system
0:18:57	components i provide a more comprehensive state representation likely on task of images
0:19:04	okay
0:19:06	i have about one minute
0:19:09	that
0:19:10	so
0:19:11	basically
0:19:13	task of revising a was like this
0:19:15	given this input
0:19:18	wanna go to high work or in
0:19:21	then
0:19:22	there are let's say there are two domain and they generate two different interpretations
0:19:28	like the d o the opera to and the bottom two
0:19:33	and we identify all possible for a casket frame for each dialogue act item
0:19:39	and we have a special
0:19:43	task frame quote in a tape
0:19:45	to accommodate all unnecessary information and then to the a task is to get the
0:19:52	right assignment from a dialect item to do you know why task of brains
0:19:58	so the parsing algorithm
0:20:01	well start with the some configuration that somehow valid
0:20:05	then it moves assignment one at a time
0:20:10	and i equal
0:20:12	to the reason and without you know what word a scores about you can actually
0:20:16	it has to do you know how proper configuration with a high score
0:20:26	i think this is for my presentation thank you for i
0:20:35	okay
0:20:46	right sure
0:20:50	right
0:20:53	actually it's so done through some feature functions because
0:21:01	that's a
0:21:03	as you extending the task of the n is actually we keep the timestamp
0:21:09	and it just times that so the feature function to match the context uses timestamp
0:21:16	and so one of features
0:21:17	so it
0:21:18	as their context you know gets all way for their father from the current times
0:21:23	that then you know you will have a last chance to fetch the information
0:21:28	and
0:21:29	so it's
0:21:32	okay
0:21:35	okay so actually it involves another notion i guess is i you know long-term memory
0:21:42	and then this tomorrow about short term interaction management
0:21:45	and then another my resource you marcel a long term memory management and so you
0:21:50	are gonna have a more you know a perpetual memory there
0:21:54	and it'll be it was so use the four
0:21:58	you know some features to disambiguate board some to boost the some evidence that's kind
0:22:03	of thing
0:22:04	so you
0:22:05	need them more memory structure
0:22:08	other than just you know how short-term dynamic structure
0:22:16	random of course
0:22:49	i missed that the initial part of the question so
0:22:52	can you can
0:22:54	to everyone you have multiple in turn-holding where do you start
0:22:57	probably a run all the dialogue so it it's a more about policy so given
0:23:04	you know a lot of what ambiguity or a higher you know how entropy in
0:23:08	your state representation
0:23:10	actually you can train you know some how smart policy where there
0:23:16	it's a better to ask confirmation at this time or users to assume something were
0:23:22	you try to retrieve some
0:23:24	you know users happy from long term memory
0:23:27	so all these are you know determine the by your policy so it's a kinda
0:23:32	you know another module
0:23:34	that takes care of such kind of things
0:23:40	small question
0:24:01	i think he asking that
0:24:05	i'm repeating question
0:24:06	i think that the question that
0:24:09	i did classification
0:24:12	for each a constraint for complex calls it the right
0:24:38	okay so
0:24:42	what he ask is that then whether it can use us to classification to predict
0:24:48	the how users
0:24:51	input if the right
0:24:53	intention different classifier
0:24:59	to break the based is okay so i think
0:25:02	is actually are shown in this or frame are but i didn't actually you know
0:25:06	use that the they're kind of classifiers only actually that's a necessary part i guess
0:25:12	to for scalability
0:25:14	because you know if we used as the consider all possible interpretation
0:25:19	for all possible slus or don't have the components
0:25:23	then the complexity will explore explode
0:25:27	so i just a you know to some filtering
0:25:30	and so preprocessing step
0:25:32	and do this the parsing
0:25:34	to you know construct
0:25:36	parse they can contain more people
0:25:39	a task is it at one utterance but you need to classification it it's a
0:25:44	little bit difficult to happen multiple has case
0:25:48	and so you are you answer
0:25:51	structure
0:25:53	okay so terrible
0:26:13	absolutely there okay i let me repeat the question so are that i have to
0:26:17	use the context to integrate got a user's utterances
0:26:22	actually i'm using just the corpus the so i didn't have to use the context
0:26:27	the two you know understand that the user utterance
0:26:31	but there have been a lot of research is
0:26:35	you to try to use the context to interpret the intentions so there's no reason
0:26:40	on that to use it
0:26:43	right okay something again

Task Lineages: Dialog State Tracking for Flexible Interaction

Oral Session 1: Dialogue state tracking & Spoken language understanding

Sungjin Lee and Amanda Stent