Speech Transcript - Hierarchical Multi-Task Natural Language Understanding for Cross-domain Conversational AI: HERMIT NLU

0:00:17	so the first present there is a man you know so
0:00:19	these start you presentation
0:00:22	good after don't know to one
0:00:24	so my name is manner thus generally amount of furniture from the interaction lab
0:00:29	of they headed for university and then gonna present work have done we don't have
0:00:34	an so an oliver lemon
0:00:36	about a docking outmoded task natural language understanding system for cross domain conversationally i that
0:00:42	we call and meet nlu
0:00:45	so and another language understanding is quite a white concept
0:00:50	a most of the time when is about compositionally i a dialogue system it of
0:00:54	us to the process of extracting the meeting from natural language and providing key to
0:00:58	the dialogue system in a structured way so that the dialogue system can perform definitely
0:01:03	better
0:01:04	and we begin end up
0:01:07	study studying this problem is for the sake of it but actually
0:01:10	we did it in the context of the moment project which will see as you
0:01:14	to be project that was about
0:01:16	at the deployment of a robot with the
0:01:18	multimodal interaction capability it was supposed to be deployed in a shopping one thing around
0:01:23	and it was supposed to interact with the user's a giving them a structure entertaining
0:01:27	them would only be little bit of chit chatting
0:01:29	and i'm gonna show a video of it that may be explained it be better
0:01:33	what the robot was supposed to do
0:01:35	help you can hear the audio although they don't the subtitles
0:01:45	i dunno one of the recording
0:01:50	so the robot with both i sent and if no indication we just the and
0:02:00	voice
0:02:01	in this five phase
0:02:03	and with or without the backing being detriment and the preference of the user
0:02:09	right
0:02:16	one value no straight i actually and no not attacking
0:02:32	but for some with of the next to
0:02:35	so we so a lot of generation but everything started with a request from the
0:02:38	user
0:02:39	and that's the mute one where we are focusing today so is basically designing an
0:02:45	nlu component of with a robust enough to work and is very complex dialogue move
0:02:49	to model dialogue system
0:02:52	again most often in compositionally i
0:02:56	not a language understanding is a synonym of shallow semantic parsing so this can actually
0:03:00	the beat with the next to the
0:03:02	morning keynote and which is the process of extracting some frame an argument structure
0:03:08	that completely meaning in a sentence and it doesn't really matter how we call them
0:03:12	if is intent of slot
0:03:13	well and most of the time this types are defined according to
0:03:17	the application domain
0:03:18	whether they have a system two db i'm like framesemantic switched off and isolate of
0:03:22	abstraction and is the one we are using in our context
0:03:26	but actually some problems especially in our case when we wanted to be then interface
0:03:30	there was able to but using several different domains while most of the time
0:03:35	in dialogue system when you have another language understanding component they always did we must
0:03:39	single domain or
0:03:41	if you don't through domains at the same time
0:03:44	and this also
0:03:44	what because
0:03:45	the resources are available the are always or about so looking restaurants so booking flights
0:03:51	while we wanted our interface to be use them in several different location that can
0:03:55	be in a domestic environmental rights of the shopping mall or in sin for example
0:04:00	why you have to command robot
0:04:02	formant in unseen offshore all drinks
0:04:04	and so
0:04:05	one of the first problem want to the system to be the system that was
0:04:08	cross domain
0:04:09	and even if there may be noted see a recipe for that we what trying
0:04:13	to this problem anyway
0:04:16	and the big problem is that
0:04:17	most of the time dependencies into that are designed i you for dialogue system error
0:04:22	only contain a single intent or frame
0:04:25	while in our case there are many sentences that given to the robot
0:04:29	which contains two different free more intense and four as can be very important to
0:04:35	a detect both of them because if we ignore the temporal relation between these two
0:04:41	different frames for every important to you know satisfy the user both for the codec
0:04:46	a mess by action and also the needing of a pole at the same time
0:04:50	so that's another problem that when you rely on these
0:04:54	hi you know the and structure
0:04:57	most of the time
0:04:58	two different kind of interaction might end up being the exact same intent or frame
0:05:03	like in this case while the actually belong in the dialogue
0:05:06	two different kind of interaction so what we actually wanted to do is not only
0:05:10	targeting the frame and en
0:05:13	and the slots
0:05:14	but also wanting a layer of dialogue acts they will tell the dialogue system
0:05:18	the context in which these are has been said so for example in the first
0:05:21	case we are informing the robot's that starbucks next on the all imagine that we
0:05:24	want to teach the robot how the shopping mall is done and the second one
0:05:28	days at a customer that is ask asking a an information about the location
0:05:32	all starbucks
0:05:33	so in two
0:05:35	quickly to cup we wanted to deal with different domain of the same time if
0:05:39	possible
0:05:40	we wanted to talk more than one single intent and arguments
0:05:44	the sentence and since we are also during the dialogue act so we have a
0:05:48	moody task i could that share
0:05:49	we have to deal also we multiple dialogue act
0:05:52	we might argue why the
0:05:54	is actually very important to understand both the dialogue act in this case
0:05:58	if not the final intent is only to give information about the location of starbucks
0:06:03	but actually we might want also to understand why
0:06:06	the user is asking for starbucks because we need a coffee if maybe was meeting
0:06:09	and meet shaken does not starbucks you could do could have pointed it somewhere else
0:06:13	so far have this stuff is real important
0:06:16	and of course
0:06:17	we wanted to try to benchmark of and the you system a initiatives
0:06:24	and eye gaze to off-the-shelf tools in this was given by the people are there
0:06:28	was actually
0:06:29	providing us with these utterances and evaluations and we will see later
0:06:34	note the very quickly i mean is nothing complicated we tried with this
0:06:39	this problem by
0:06:40	addressing the three different task
0:06:42	at the same time so this asks another of locating dialogue acts the frame
0:06:48	and the arguments
0:06:50	each task was solve the with a sequence labeling approach in which we were giving
0:06:55	and label to each token of the sentence is
0:06:57	something very common in nlp
0:07:00	and each label was actually composed by the class
0:07:03	of the structured we were able to target for a given task
0:07:08	enriched with the label that can be o i o
0:07:12	depending well
0:07:13	the and the type was the beginning of a span of a structure they inside
0:07:18	or was outside one of these and here we have a very easy example
0:07:21	now the problem is that
0:07:23	this is a linear solution for a problem which is
0:07:26	and i gotta save because the language is a gaussian then we might end up
0:07:29	having some structure which set actually nested inside other structure especially for freeman arguments this
0:07:35	doesn't happen that basically never for dialogue acts
0:07:39	but for frame and arguments this is happens quite of an especially in the data
0:07:44	we collected
0:07:45	so what we that was solutions kit was to
0:07:48	basically collapse
0:07:49	the just actual in a single linear selection and trying to get whether one of
0:07:53	this structure
0:07:54	was actually inside
0:07:56	a previously target that one
0:07:58	by using some realistic on the syntactic relation among the words of an example if
0:08:02	find was actually
0:08:04	syntactic child of two
0:08:06	we could but usually sticks a by some roots actually say what that the locating
0:08:11	nh frame was actually a embedded inside the requirement argument of the needing frame
0:08:18	now there has been solved in a multitask fashion so we basically generate them created
0:08:23	a single network that was dealing with that the ti in task at the same
0:08:26	time is basically other sequence of stick with the t within quadrants yet if that
0:08:31	is that i'm gonna show
0:08:32	next slide is nothing but the only complicated but there are two main reason why
0:08:37	we adopt the d is
0:08:39	architecture first of all we wanted more or less to replicate
0:08:42	and yet a key of
0:08:44	and task difficulty in a sense that we were assuming actually we were
0:08:48	not the think that the tagging they'll that backs is easier than typing frames any
0:08:52	it easy if the target frame t v then tagging arguments
0:08:56	and that's also
0:08:57	i kind of structural relationship between you do it between these three because many times
0:09:00	some frames tend to appear model friend in the context of some dialogue acts and
0:09:05	arguments are almost always dependent on and frames
0:09:09	extra especially when there is a strong to be i'm like from semantics
0:09:12	and
0:09:13	so this is these are the reason why the network is down like this
0:09:17	and i'm going to illustrate the network quite quickly because this is a little bit
0:09:21	more
0:09:22	technical stuff so
0:09:24	the input file a network with only a pretty and then one betting that we
0:09:27	were not be training and that with the firstly there was encoding with a step
0:09:32	of encoded with some set potentially there was supposed to capture
0:09:36	some relationship that the bidirectional lstm encoder was in capturing because he wouldn't sometimes of
0:09:42	attention is more able to capture relationship among words which are quite distant in the
0:09:47	sentence
0:09:48	and then we were feeding us yet if layer
0:09:51	there was actually typing the sequence of four by your tags for the dialogue act
0:09:56	in a right of the this of attention delay
0:10:00	so for the frames it was basically the same thing
0:10:04	but we were
0:10:06	using shot recognition before because we wanted to provide encoded with the fresh information
0:10:11	from the first layer so actually the lexical information but also
0:10:16	which some information that was encoded while
0:10:18	being it
0:10:19	kind of i and directly being a condition on what the
0:10:23	the dialogue act was starting so we were putting the information together and with serving
0:10:28	the information to the next layer
0:10:30	and the with a crf for typing of before
0:10:32	and finally for the arguments whether again the same thing
0:10:36	another step of encoding and crf layer with lots of attention and these came up
0:10:40	from the experiments we have done with some ablation study it is on the p
0:10:44	but we're another button you hear about this is the final network we manage to
0:10:49	tune at the very end
0:10:51	so in either was think at the beginning we wanted to
0:10:57	benchmark this
0:10:59	these nlu
0:11:01	components now benchmarking and nlu for the system is quite of a big issue in
0:11:05	a sense that the dataset and that was thing before most of these are that
0:11:10	are quite
0:11:12	single domain
0:11:13	and then very few stuff
0:11:15	i mean about an hour now that there are some doubt that direct
0:11:18	the started go popping up but the beginning of this year we were still put
0:11:22	on that side
0:11:24	by likely that was these results which is score the nlu benchmark
0:11:29	which is a bicycle cross domain corpus of hundred interaction with the house assistant the
0:11:33	robot
0:11:34	is mostly i or orient that is not a collection of dialogue is the only
0:11:38	single interaction utterance interaction we with the system
0:11:42	and callers a lot of the mean we will see later
0:11:45	and but is mostly not oriented there are some
0:11:50	a comments that can be used for a robot bodies mostly again i go to
0:11:53	oriented
0:11:53	what does a second rest of that we started collecting along the esn is taking
0:11:58	a lot of time
0:11:59	which is the rubber score was a is called the is like that because we
0:12:03	stand for robotics oriented mostly task language understanding corpus
0:12:07	and is again is a collection of single interaction with the robot that called a
0:12:12	different domains that more think them of kind of interaction there is there is to
0:12:16	chopping that is
0:12:17	is state common the robot's there is a also a lot of information you can
0:12:21	give to the robot about completion of the environmental name of both on
0:12:25	well this kind of tough
0:12:26	that's quite a huge overlap between the two in terms of kind of interaction
0:12:30	but they spun on
0:12:32	different domains
0:12:33	so
0:12:35	the first corpus the nlu benchmark provide us three different semantically yes
0:12:41	and their code scenario action an entity i know this sounds completely different of
0:12:44	from what we said before but we had to find some mappings with the stuff
0:12:48	we where we wanted to that are go over the sentences
0:12:52	the robot is good big the full set of it is twenty five almost twenty
0:12:57	six thousand percent sentences
0:13:00	and there are agent different this scenario types and each scenario busy a domain
0:13:05	and that of the fifty four different action types and fifty six different entities
0:13:11	there is something the goal and intent which is basically the sum up of scenario
0:13:15	plus action and this is important for the model for the evaluation will see later
0:13:20	as you can see there is a problem with this the dataset is that is
0:13:24	that it is gonna cost domain
0:13:26	is that it is more t task because we have three different semantic layer
0:13:29	but
0:13:30	we have always one single send audio and actions so one single intent per sentence
0:13:35	so what we could benchmark on these it
0:13:38	corpus was mostly these two initial
0:13:42	these two initial factors
0:13:45	we did evaluation according to the paper that was presenting
0:13:49	the benchmark
0:13:50	and this was done on a ten fold cross validation with like half of the
0:13:53	sentences that eleven off of the sentences in this was to balance
0:13:56	the number of classes and it is inside the on the results
0:14:02	so i that was saying that we had to do a mapping
0:14:05	between
0:14:06	their tagging scheme and whatever we wanted to die which is very general approach for
0:14:11	extracting the semantics from sentences in the context of a dialogue system
0:14:16	bum we also so that
0:14:18	the kind of relationship that what holding between
0:14:20	they are semantically at one or more or less the same there were holding for
0:14:24	our approach
0:14:26	and so these at some result
0:14:28	this is that are reported in the be but there are quite old in a
0:14:31	sense that they are from the beginning of this the they've been evaluated in two
0:14:34	thousand eighteen
0:14:35	they have been around on all the open source
0:14:39	reduction of these that nlu component of dialogue system available of the shots
0:14:46	that's a problem we want some because you know why second specific training for entities
0:14:50	and these was not possible because it does a constraint on the number
0:14:56	of entity types and ended example you can pass do we do we try to
0:15:00	talk with what some people but we didn't manage to get the licensed at least
0:15:03	to run a one training with the full set of things so do you have
0:15:08	to take that into account too much unfortunately
0:15:11	the intent that was think is the sum up of the scenario
0:15:14	and an action
0:15:16	and these
0:15:19	performance is then
0:15:21	obtain it on ten fold cross validation i didn't about the standard deviation because
0:15:25	it would they were almost all stable but if you want to look at them
0:15:28	they're on the paper
0:15:29	and the other important thing is that we want to take into account whether it's
0:15:34	upon
0:15:35	of a target structure to was matching exactly actually
0:15:39	the elders of the people when in taking into account that
0:15:41	but they got the true positive whether there was a and an overlap
0:15:45	an overlap of the of the spun
0:15:48	so these are kind of a lose metric
0:15:50	that we whatever we are evaluating one
0:15:52	we can see that the entity for the entity and then the combined setting a
0:15:57	our system was the performing on average better than the other while for the intent
0:16:01	we will actually not performing as what is what some but better than the other
0:16:06	two system
0:16:07	the other important bit is that the combined the
0:16:11	measure is actually the sum up of the two confusion matrix of intents and entities
0:16:15	are we doesn't
0:16:16	actually give us anything about the pipeline
0:16:18	our the full pipeline is working
0:16:20	but these a something that we have done
0:16:22	on our corpus which is much smaller
0:16:25	and is not yet available because that we are still gathering data
0:16:29	probably end of this year we're gonna release it
0:16:32	i know if you colours are very natural environment but for people doing a chair
0:16:37	are your dialogue in the context of robotics this can be
0:16:39	one interesting
0:16:42	so here we have eleven dialogue types and fifty eight frame types
0:16:46	which compared to the number of example is quite high
0:16:49	and eighty four frame element types of which are the arguments
0:16:52	and if you can see
0:16:54	not always but there are many cases in which will we have more than one
0:16:58	frame per sentence and what them more than one that about but sentence
0:17:01	and no idea the frame elements are quite a lot
0:17:07	we i have like
0:17:09	they fit into semantic space body into these three is more formally the only tool
0:17:13	because
0:17:13	we have thirteen dialogue acts exactly like we so during the in the rest of
0:17:16	the presentation
0:17:17	and we also provide semantics in a them in terms of frame semantics
0:17:22	well we have three main frame elements these are actually this the same the same
0:17:25	semantic layer theoretically but there are two different layers or variational e
0:17:30	and if you can see we have a lot of four
0:17:32	embedded structure a frame inside on the frame and this kind of stuff
0:17:36	a this is the mapping we had to do again
0:17:39	with the different semantic layer is basically same dialogue acts dialogue acts frames and frames
0:17:43	and frame element some arguments
0:17:46	and of course
0:17:47	the these are the two aspect that we could tackle why using this corpus so
0:17:51	is not incur of domain because he's not a score of the mean of the
0:17:54	other one
0:17:54	it is enough to have that we have
0:17:56	different kind of interaction and we have also sentences coming
0:17:59	from two to different scenarios that can be
0:18:03	the house scenario and the shopping mall scenario jealousy charting something coming from these interaction
0:18:09	with the month in answer about
0:18:12	but we don't want to sell it is completely closed domain mostly because the other
0:18:17	record with a much more of the mean than this one
0:18:19	but it every multi task and is there really moody dialogue at frame on each
0:18:23	sentence
0:18:24	and k that is out of
0:18:27	the might look quite we hear the about
0:18:29	i'm gonna explain why the like this
0:18:31	so most that's one i report here is the same exact measure that was reporting
0:18:36	for the nh the nlu benchmark so
0:18:38	we have take into account only when the span
0:18:40	of to structure the overlap okay
0:18:43	and
0:18:43	the results are quite high
0:18:45	and the main reason is that to the corpus is not been delexicalised
0:18:49	so there are sentences are quite similar
0:18:52	and then the system be a very well
0:18:53	but you don't have to get parts of by doubt because
0:18:56	if we look at the last one could be the second one is basically only
0:18:59	using the
0:19:00	the coal two thousand set of task evaluation which is a standard and we report
0:19:05	the need for general comparison with other system
0:19:07	but the most important one is the last one with a that is the exact
0:19:11	match
0:19:11	and the laughter of the exact match is telling us
0:19:14	how well the system over the pipeline with working completely so we were taking into
0:19:18	account the exact span
0:19:21	of
0:19:23	all of the target structure
0:19:24	and also
0:19:25	we were
0:19:26	yes we were
0:19:30	we were actually
0:19:31	trying to get
0:19:32	i mean a frame was actually correctly dog only if the also the dialogue that
0:19:36	what's quality data so with actually the end-to-end system
0:19:39	in a pipeline and that is
0:19:40	the measure we have to chase
0:19:43	no two
0:19:45	conclude and some future work so the system that i presented which is these their
0:19:49	cross domain moody task
0:19:52	and that you system for not a language understanding to
0:19:55	for conversational i a that we designed a is actually running in the shopping mall
0:20:01	you feel on
0:20:03	the video i showed you was formed from the deployment we have done
0:20:07	and is gonna be derived for three months in a role
0:20:09	some pos during the weekend to do some matter out easy vendors rebooting the system
0:20:13	but we
0:20:14	manage to collect a lot of the time order maybe integrate them in the corpus
0:20:17	and release it and of this year
0:20:19	if we manage to back them properly into the checking only the latest beginning of
0:20:23	next year
0:20:25	we have to deal with their this area with different a demon sad this
0:20:28	it means not relying on these heuristic on the syntactic structure but actually simultaneous most
0:20:33	honestly starting
0:20:35	in but that's sequences are moved event sequence e the canopy one inside the other
0:20:38	if any topic because we actually already of this system we
0:20:42	finally the final added few months ago so we didn't have time to the meeting
0:20:45	here but these exist and then there is a branch on that everybody the ti
0:20:50	show you which is about this new system
0:20:55	but of our work is
0:20:56	this one of generating a general framework for frame neck structure so it doesn't
0:21:01	method it's you audio the application that is the reason behind
0:21:04	we are trying to create a network that can be with all the possible frame
0:21:08	like structure passing this is our a long-term goal something very big but we are
0:21:13	actually pushing for that
0:21:14	and the last bit is mostly dealing with this special tagging of segment that a
0:21:19	segmented utterances we are like that in our corpus there were many
0:21:23	small bit of sentence that the user with one thing because they were stopping you
0:21:27	the basic dating so the missing the first part of the sentence like i would
0:21:30	like to
0:21:31	and there's asr what actually this equation is that was sending the thing to the
0:21:36	bus set and the bus to work correctly by think it by the with some
0:21:39	bit missing
0:21:40	now when the user with thing
0:21:42	to find the starbucks for example we receiving these find the starbucks there was contextualize
0:21:47	the as a fine finding locating frame
0:21:50	but we didn't know it was also a frame element of the previous
0:21:53	structured so we are studying the way to
0:21:55	make the system aware of what has been part before
0:21:58	so that you can actually give more info what information in the context of the
0:22:02	same utterance even if these broken by idea is to
0:22:05	and
0:22:06	this is everything
0:22:07	okay thanks very much
0:22:13	okay so that's it's time for questions
0:22:23	no him
0:22:30	hi and thanks to the rate talk and always good to see rows of being
0:22:34	benchmark i'm just curious did you use i just default out of the box parameters
0:22:38	the did you do but it during
0:22:40	so i we just with the results from the people of the benchmark and they
0:22:45	were only saying that the
0:22:48	something like a little bit of the and specific training and would for the end
0:22:51	it is something like that
0:22:54	and bumper for and they use the version
0:22:57	there was to using the crf and not the narrow one and a tensor for
0:23:01	one okay so that's actually like a very basic version i suppose
0:23:08	questions
0:23:09	okay
0:23:12	so he showed the architecture their with some intermediate layers also be serious are they
0:23:18	also into me just supervision here
0:23:21	thirty one so this labels via alarm and sonar they also
0:23:25	supervised labels used as you know that is all the supervised parts of the five
0:23:29	multitasking in this sense that we are solving the three task at the same time
0:23:32	so you need
0:23:34	slightly more complicated data set for that to have all of that supervised
0:23:38	while we have more labels than just and
0:23:41	we need to the dialogue act in this case what are the scenarios we need
0:23:44	the egg the actions and the frame and their the arguments basically so that's why
0:23:49	the data vectors is called the moody does because we have this three layers okay
0:23:53	but for a c was really important to different seed we didn't action and dialogue
0:23:57	acts because have a show you
0:23:58	it will many cases in which it was important for the robot to have a
0:24:02	better idea of what was going on in the single sentence okay
0:24:06	okay
0:24:10	thanks for talking a question in the last slide you mentioned it's a frame like
0:24:15	so what's the difference between four and like on the framenet
0:24:19	a frame like so unlike what if a to whatever can be
0:24:25	mm someone is the enough traction which represent a predication in a sentence and have
0:24:30	some arguments
0:24:32	this is like the general frame like you know like the very
0:24:35	bold
0:24:35	it's the same as the frame that's so the data was this decision making the
0:24:39	same that big difference is that frame at the very busy fight ut behind
0:24:43	and that there are some extra two d is the most things like some relationship
0:24:47	between frames and the results of special frame elements like at the lexical unit itself
0:24:51	which make it easier to look at the frame in the sentence
0:24:54	but
0:24:55	what we like to do is it doesn't matter where e framenet thirty five just
0:24:58	in time slot like from the i-th this corpus or any other corpus
0:25:02	wait like to i'm we are trying to build the is a shallow semantic by
0:25:07	so they can deal with all this stuff of the same time
0:25:09	as better as possible is if a kind of map task but we have trying
0:25:13	to incorporate these different
0:25:14	aspects of the ut is then we have trying to deal with them
0:25:17	more or less that in different ways but without compromising
0:25:21	the assistive led to all their kind of formant
0:25:24	one other question with us what to that used for data annotation
0:25:29	so we actually had to for our corpus we had to develop already interface
0:25:34	is always nice basically a web interface where we have all the token i sentence
0:25:39	and we can talk everything on that and the score was as be entirely i
0:25:45	mean something with been collecting in the last we have then it takes a long
0:25:48	time ago it's a it's
0:25:51	it is a hard task to collect these sentences and also we have to filter
0:25:54	out many of them because the context of the most different i sometimes we went
0:25:59	to the rubber gap to do this collection and of a lot of noise and
0:26:02	things we were also value that you're
0:26:05	file of these then we stopped but in the and we were always applying some
0:26:09	people from all alarm
0:26:11	to annotate them like to three of them then you know doing some unintended beam
0:26:14	and annotation trying to get whether the actual understood out that but with working if
0:26:18	a very long process okay and
0:26:21	we're the computational linguist but opposite thing point so
0:26:24	it is very hard but this that's
0:26:29	that's that the situation with the corpus
0:26:32	okay so we have run time so it's not speak again
0:26:36	okay

Hierarchical Multi-Task Natural Language Understanding for Cross-domain Conversational AI: HERMIT NLU

Oral Session 4: Understanding and Dialogue State Tracking

Andrea Vanzo, Emanuele Bastianelli and Oliver Lemon