0:00:17so the first present there is a man you know so
0:00:19these start you presentation
0:00:22good after don't know to one
0:00:24so my name is manner thus generally amount of furniture from the interaction lab
0:00:29of they headed for university and then gonna present work have done we don't have
0:00:34an so an oliver lemon
0:00:36about a docking outmoded task natural language understanding system for cross domain conversationally i that
0:00:42we call and meet nlu
0:00:45so and another language understanding is quite a white concept
0:00:50a most of the time when is about compositionally i a dialogue system it of
0:00:54us to the process of extracting the meeting from natural language and providing key to
0:00:58the dialogue system in a structured way so that the dialogue system can perform definitely
0:01:03better
0:01:04and we begin end up
0:01:07study studying this problem is for the sake of it but actually
0:01:10we did it in the context of the moment project which will see as you
0:01:14to be project that was about
0:01:16at the deployment of a robot with the
0:01:18multimodal interaction capability it was supposed to be deployed in a shopping one thing around
0:01:23and it was supposed to interact with the user's a giving them a structure entertaining
0:01:27them would only be little bit of chit chatting
0:01:29and i'm gonna show a video of it that may be explained it be better
0:01:33what the robot was supposed to do
0:01:35help you can hear the audio although they don't the subtitles
0:01:45i dunno one of the recording
0:01:50so the robot with both i sent and if no indication we just the and
0:02:00voice
0:02:01in this five phase
0:02:03and with or without the backing being detriment and the preference of the user
0:02:09right
0:02:16one value no straight i actually and no not attacking
0:02:32but for some with of the next to
0:02:35so we so a lot of generation but everything started with a request from the
0:02:38user
0:02:39and that's the mute one where we are focusing today so is basically designing an
0:02:45nlu component of with a robust enough to work and is very complex dialogue move
0:02:49to model dialogue system
0:02:52again most often in compositionally i
0:02:56not a language understanding is a synonym of shallow semantic parsing so this can actually
0:03:00the beat with the next to the
0:03:02morning keynote and which is the process of extracting some frame an argument structure
0:03:08that completely meaning in a sentence and it doesn't really matter how we call them
0:03:12if is intent of slot
0:03:13well and most of the time this types are defined according to
0:03:17the application domain
0:03:18whether they have a system two db i'm like framesemantic switched off and isolate of
0:03:22abstraction and is the one we are using in our context
0:03:26but actually some problems especially in our case when we wanted to be then interface
0:03:30there was able to but using several different domains while most of the time
0:03:35in dialogue system when you have another language understanding component they always did we must
0:03:39single domain or
0:03:41if you don't through domains at the same time
0:03:44and this also
0:03:44what because
0:03:45the resources are available the are always or about so looking restaurants so booking flights
0:03:51while we wanted our interface to be use them in several different location that can
0:03:55be in a domestic environmental rights of the shopping mall or in sin for example
0:04:00why you have to command robot
0:04:02formant in unseen offshore all drinks
0:04:04and so
0:04:05one of the first problem want to the system to be the system that was
0:04:08cross domain
0:04:09and even if there may be noted see a recipe for that we what trying
0:04:13to this problem anyway
0:04:16and the big problem is that
0:04:17most of the time dependencies into that are designed i you for dialogue system error
0:04:22only contain a single intent or frame
0:04:25while in our case there are many sentences that given to the robot
0:04:29which contains two different free more intense and four as can be very important to
0:04:35a detect both of them because if we ignore the temporal relation between these two
0:04:41different frames for every important to you know satisfy the user both for the codec
0:04:46a mess by action and also the needing of a pole at the same time
0:04:50so that's another problem that when you rely on these
0:04:54hi you know the and structure
0:04:57most of the time
0:04:58two different kind of interaction might end up being the exact same intent or frame
0:05:03like in this case while the actually belong in the dialogue
0:05:06two different kind of interaction so what we actually wanted to do is not only
0:05:10targeting the frame and en
0:05:13and the slots
0:05:14but also wanting a layer of dialogue acts they will tell the dialogue system
0:05:18the context in which these are has been said so for example in the first
0:05:21case we are informing the robot's that starbucks next on the all imagine that we
0:05:24want to teach the robot how the shopping mall is done and the second one
0:05:28days at a customer that is ask asking a an information about the location
0:05:32all starbucks
0:05:33so in two
0:05:35quickly to cup we wanted to deal with different domain of the same time if
0:05:39possible
0:05:40we wanted to talk more than one single intent and arguments
0:05:44the sentence and since we are also during the dialogue act so we have a
0:05:48moody task i could that share
0:05:49we have to deal also we multiple dialogue act
0:05:52we might argue why the
0:05:54is actually very important to understand both the dialogue act in this case
0:05:58if not the final intent is only to give information about the location of starbucks
0:06:03but actually we might want also to understand why
0:06:06the user is asking for starbucks because we need a coffee if maybe was meeting
0:06:09and meet shaken does not starbucks you could do could have pointed it somewhere else
0:06:13so far have this stuff is real important
0:06:16and of course
0:06:17we wanted to try to benchmark of and the you system a initiatives
0:06:24and eye gaze to off-the-shelf tools in this was given by the people are there
0:06:28was actually
0:06:29providing us with these utterances and evaluations and we will see later
0:06:34note the very quickly i mean is nothing complicated we tried with this
0:06:39this problem by
0:06:40addressing the three different task
0:06:42at the same time so this asks another of locating dialogue acts the frame
0:06:48and the arguments
0:06:50each task was solve the with a sequence labeling approach in which we were giving
0:06:55and label to each token of the sentence is
0:06:57something very common in nlp
0:07:00and each label was actually composed by the class
0:07:03of the structured we were able to target for a given task
0:07:08enriched with the label that can be o i o
0:07:12depending well
0:07:13the and the type was the beginning of a span of a structure they inside
0:07:18or was outside one of these and here we have a very easy example
0:07:21now the problem is that
0:07:23this is a linear solution for a problem which is
0:07:26and i gotta save because the language is a gaussian then we might end up
0:07:29having some structure which set actually nested inside other structure especially for freeman arguments this
0:07:35doesn't happen that basically never for dialogue acts
0:07:39but for frame and arguments this is happens quite of an especially in the data
0:07:44we collected
0:07:45so what we that was solutions kit was to
0:07:48basically collapse
0:07:49the just actual in a single linear selection and trying to get whether one of
0:07:53this structure
0:07:54was actually inside
0:07:56a previously target that one
0:07:58by using some realistic on the syntactic relation among the words of an example if
0:08:02find was actually
0:08:04syntactic child of two
0:08:06we could but usually sticks a by some roots actually say what that the locating
0:08:11nh frame was actually a embedded inside the requirement argument of the needing frame
0:08:18now there has been solved in a multitask fashion so we basically generate them created
0:08:23a single network that was dealing with that the ti in task at the same
0:08:26time is basically other sequence of stick with the t within quadrants yet if that
0:08:31is that i'm gonna show
0:08:32next slide is nothing but the only complicated but there are two main reason why
0:08:37we adopt the d is
0:08:39architecture first of all we wanted more or less to replicate
0:08:42and yet a key of
0:08:44and task difficulty in a sense that we were assuming actually we were
0:08:48not the think that the tagging they'll that backs is easier than typing frames any
0:08:52it easy if the target frame t v then tagging arguments
0:08:56and that's also
0:08:57i kind of structural relationship between you do it between these three because many times
0:09:00some frames tend to appear model friend in the context of some dialogue acts and
0:09:05arguments are almost always dependent on and frames
0:09:09extra especially when there is a strong to be i'm like from semantics
0:09:12and
0:09:13so this is these are the reason why the network is down like this
0:09:17and i'm going to illustrate the network quite quickly because this is a little bit
0:09:21more
0:09:22technical stuff so
0:09:24the input file a network with only a pretty and then one betting that we
0:09:27were not be training and that with the firstly there was encoding with a step
0:09:32of encoded with some set potentially there was supposed to capture
0:09:36some relationship that the bidirectional lstm encoder was in capturing because he wouldn't sometimes of
0:09:42attention is more able to capture relationship among words which are quite distant in the
0:09:47sentence
0:09:48and then we were feeding us yet if layer
0:09:51there was actually typing the sequence of four by your tags for the dialogue act
0:09:56in a right of the this of attention delay
0:10:00so for the frames it was basically the same thing
0:10:04but we were
0:10:06using shot recognition before because we wanted to provide encoded with the fresh information
0:10:11from the first layer so actually the lexical information but also
0:10:16which some information that was encoded while
0:10:18being it
0:10:19kind of i and directly being a condition on what the
0:10:23the dialogue act was starting so we were putting the information together and with serving
0:10:28the information to the next layer
0:10:30and the with a crf for typing of before
0:10:32and finally for the arguments whether again the same thing
0:10:36another step of encoding and crf layer with lots of attention and these came up
0:10:40from the experiments we have done with some ablation study it is on the p
0:10:44but we're another button you hear about this is the final network we manage to
0:10:49tune at the very end
0:10:51so in either was think at the beginning we wanted to
0:10:57benchmark this
0:10:59these nlu
0:11:01components now benchmarking and nlu for the system is quite of a big issue in
0:11:05a sense that the dataset and that was thing before most of these are that
0:11:10are quite
0:11:12single domain
0:11:13and then very few stuff
0:11:15i mean about an hour now that there are some doubt that direct
0:11:18the started go popping up but the beginning of this year we were still put
0:11:22on that side
0:11:24by likely that was these results which is score the nlu benchmark
0:11:29which is a bicycle cross domain corpus of hundred interaction with the house assistant the
0:11:33robot
0:11:34is mostly i or orient that is not a collection of dialogue is the only
0:11:38single interaction utterance interaction we with the system
0:11:42and callers a lot of the mean we will see later
0:11:45and but is mostly not oriented there are some
0:11:50a comments that can be used for a robot bodies mostly again i go to
0:11:53oriented
0:11:53what does a second rest of that we started collecting along the esn is taking
0:11:58a lot of time
0:11:59which is the rubber score was a is called the is like that because we
0:12:03stand for robotics oriented mostly task language understanding corpus
0:12:07and is again is a collection of single interaction with the robot that called a
0:12:12different domains that more think them of kind of interaction there is there is to
0:12:16chopping that is
0:12:17is state common the robot's there is a also a lot of information you can
0:12:21give to the robot about completion of the environmental name of both on
0:12:25well this kind of tough
0:12:26that's quite a huge overlap between the two in terms of kind of interaction
0:12:30but they spun on
0:12:32different domains
0:12:33so
0:12:35the first corpus the nlu benchmark provide us three different semantically yes
0:12:41and their code scenario action an entity i know this sounds completely different of
0:12:44from what we said before but we had to find some mappings with the stuff
0:12:48we where we wanted to that are go over the sentences
0:12:52the robot is good big the full set of it is twenty five almost twenty
0:12:57six thousand percent sentences
0:13:00and there are agent different this scenario types and each scenario busy a domain
0:13:05and that of the fifty four different action types and fifty six different entities
0:13:11there is something the goal and intent which is basically the sum up of scenario
0:13:15plus action and this is important for the model for the evaluation will see later
0:13:20as you can see there is a problem with this the dataset is that is
0:13:24that it is gonna cost domain
0:13:26is that it is more t task because we have three different semantic layer
0:13:29but
0:13:30we have always one single send audio and actions so one single intent per sentence
0:13:35so what we could benchmark on these it
0:13:38corpus was mostly these two initial
0:13:42these two initial factors
0:13:45we did evaluation according to the paper that was presenting
0:13:49the benchmark
0:13:50and this was done on a ten fold cross validation with like half of the
0:13:53sentences that eleven off of the sentences in this was to balance
0:13:56the number of classes and it is inside the on the results
0:14:02so i that was saying that we had to do a mapping
0:14:05between
0:14:06their tagging scheme and whatever we wanted to die which is very general approach for
0:14:11extracting the semantics from sentences in the context of a dialogue system
0:14:16bum we also so that
0:14:18the kind of relationship that what holding between
0:14:20they are semantically at one or more or less the same there were holding for
0:14:24our approach
0:14:26and so these at some result
0:14:28this is that are reported in the be but there are quite old in a
0:14:31sense that they are from the beginning of this the they've been evaluated in two
0:14:34thousand eighteen
0:14:35they have been around on all the open source
0:14:39reduction of these that nlu component of dialogue system available of the shots
0:14:46that's a problem we want some because you know why second specific training for entities
0:14:50and these was not possible because it does a constraint on the number
0:14:56of entity types and ended example you can pass do we do we try to
0:15:00talk with what some people but we didn't manage to get the licensed at least
0:15:03to run a one training with the full set of things so do you have
0:15:08to take that into account too much unfortunately
0:15:11the intent that was think is the sum up of the scenario
0:15:14and an action
0:15:16and these
0:15:19performance is then
0:15:21obtain it on ten fold cross validation i didn't about the standard deviation because
0:15:25it would they were almost all stable but if you want to look at them
0:15:28they're on the paper
0:15:29and the other important thing is that we want to take into account whether it's
0:15:34upon
0:15:35of a target structure to was matching exactly actually
0:15:39the elders of the people when in taking into account that
0:15:41but they got the true positive whether there was a and an overlap
0:15:45an overlap of the of the spun
0:15:48so these are kind of a lose metric
0:15:50that we whatever we are evaluating one
0:15:52we can see that the entity for the entity and then the combined setting a
0:15:57our system was the performing on average better than the other while for the intent
0:16:01we will actually not performing as what is what some but better than the other
0:16:06two system
0:16:07the other important bit is that the combined the
0:16:11measure is actually the sum up of the two confusion matrix of intents and entities
0:16:15are we doesn't
0:16:16actually give us anything about the pipeline
0:16:18our the full pipeline is working
0:16:20but these a something that we have done
0:16:22on our corpus which is much smaller
0:16:25and is not yet available because that we are still gathering data
0:16:29probably end of this year we're gonna release it
0:16:32i know if you colours are very natural environment but for people doing a chair
0:16:37are your dialogue in the context of robotics this can be
0:16:39one interesting
0:16:42so here we have eleven dialogue types and fifty eight frame types
0:16:46which compared to the number of example is quite high
0:16:49and eighty four frame element types of which are the arguments
0:16:52and if you can see
0:16:54not always but there are many cases in which will we have more than one
0:16:58frame per sentence and what them more than one that about but sentence
0:17:01and no idea the frame elements are quite a lot
0:17:07we i have like
0:17:09they fit into semantic space body into these three is more formally the only tool
0:17:13because
0:17:13we have thirteen dialogue acts exactly like we so during the in the rest of
0:17:16the presentation
0:17:17and we also provide semantics in a them in terms of frame semantics
0:17:22well we have three main frame elements these are actually this the same the same
0:17:25semantic layer theoretically but there are two different layers or variational e
0:17:30and if you can see we have a lot of four
0:17:32embedded structure a frame inside on the frame and this kind of stuff
0:17:36a this is the mapping we had to do again
0:17:39with the different semantic layer is basically same dialogue acts dialogue acts frames and frames
0:17:43and frame element some arguments
0:17:46and of course
0:17:47the these are the two aspect that we could tackle why using this corpus so
0:17:51is not incur of domain because he's not a score of the mean of the
0:17:54other one
0:17:54it is enough to have that we have
0:17:56different kind of interaction and we have also sentences coming
0:17:59from two to different scenarios that can be
0:18:03the house scenario and the shopping mall scenario jealousy charting something coming from these interaction
0:18:09with the month in answer about
0:18:12but we don't want to sell it is completely closed domain mostly because the other
0:18:17record with a much more of the mean than this one
0:18:19but it every multi task and is there really moody dialogue at frame on each
0:18:23sentence
0:18:24and k that is out of
0:18:27the might look quite we hear the about
0:18:29i'm gonna explain why the like this
0:18:31so most that's one i report here is the same exact measure that was reporting
0:18:36for the nh the nlu benchmark so
0:18:38we have take into account only when the span
0:18:40of to structure the overlap okay
0:18:43and
0:18:43the results are quite high
0:18:45and the main reason is that to the corpus is not been delexicalised
0:18:49so there are sentences are quite similar
0:18:52and then the system be a very well
0:18:53but you don't have to get parts of by doubt because
0:18:56if we look at the last one could be the second one is basically only
0:18:59using the
0:19:00the coal two thousand set of task evaluation which is a standard and we report
0:19:05the need for general comparison with other system
0:19:07but the most important one is the last one with a that is the exact
0:19:11match
0:19:11and the laughter of the exact match is telling us
0:19:14how well the system over the pipeline with working completely so we were taking into
0:19:18account the exact span
0:19:21of
0:19:23all of the target structure
0:19:24and also
0:19:25we were
0:19:26yes we were
0:19:30we were actually
0:19:31trying to get
0:19:32i mean a frame was actually correctly dog only if the also the dialogue that
0:19:36what's quality data so with actually the end-to-end system
0:19:39in a pipeline and that is
0:19:40the measure we have to chase
0:19:43no two
0:19:45conclude and some future work so the system that i presented which is these their
0:19:49cross domain moody task
0:19:52and that you system for not a language understanding to
0:19:55for conversational i a that we designed a is actually running in the shopping mall
0:20:01you feel on
0:20:03the video i showed you was formed from the deployment we have done
0:20:07and is gonna be derived for three months in a role
0:20:09some pos during the weekend to do some matter out easy vendors rebooting the system
0:20:13but we
0:20:14manage to collect a lot of the time order maybe integrate them in the corpus
0:20:17and release it and of this year
0:20:19if we manage to back them properly into the checking only the latest beginning of
0:20:23next year
0:20:25we have to deal with their this area with different a demon sad this
0:20:28it means not relying on these heuristic on the syntactic structure but actually simultaneous most
0:20:33honestly starting
0:20:35in but that's sequences are moved event sequence e the canopy one inside the other
0:20:38if any topic because we actually already of this system we
0:20:42finally the final added few months ago so we didn't have time to the meeting
0:20:45here but these exist and then there is a branch on that everybody the ti
0:20:50show you which is about this new system
0:20:55but of our work is
0:20:56this one of generating a general framework for frame neck structure so it doesn't
0:21:01method it's you audio the application that is the reason behind
0:21:04we are trying to create a network that can be with all the possible frame
0:21:08like structure passing this is our a long-term goal something very big but we are
0:21:13actually pushing for that
0:21:14and the last bit is mostly dealing with this special tagging of segment that a
0:21:19segmented utterances we are like that in our corpus there were many
0:21:23small bit of sentence that the user with one thing because they were stopping you
0:21:27the basic dating so the missing the first part of the sentence like i would
0:21:30like to
0:21:31and there's asr what actually this equation is that was sending the thing to the
0:21:36bus set and the bus to work correctly by think it by the with some
0:21:39bit missing
0:21:40now when the user with thing
0:21:42to find the starbucks for example we receiving these find the starbucks there was contextualize
0:21:47the as a fine finding locating frame
0:21:50but we didn't know it was also a frame element of the previous
0:21:53structured so we are studying the way to
0:21:55make the system aware of what has been part before
0:21:58so that you can actually give more info what information in the context of the
0:22:02same utterance even if these broken by idea is to
0:22:05and
0:22:06this is everything
0:22:07okay thanks very much
0:22:13okay so that's it's time for questions
0:22:23no him
0:22:30hi and thanks to the rate talk and always good to see rows of being
0:22:34benchmark i'm just curious did you use i just default out of the box parameters
0:22:38the did you do but it during
0:22:40so i we just with the results from the people of the benchmark and they
0:22:45were only saying that the
0:22:48something like a little bit of the and specific training and would for the end
0:22:51it is something like that
0:22:54and bumper for and they use the version
0:22:57there was to using the crf and not the narrow one and a tensor for
0:23:01one okay so that's actually like a very basic version i suppose
0:23:08questions
0:23:09okay
0:23:12so he showed the architecture their with some intermediate layers also be serious are they
0:23:18also into me just supervision here
0:23:21thirty one so this labels via alarm and sonar they also
0:23:25supervised labels used as you know that is all the supervised parts of the five
0:23:29multitasking in this sense that we are solving the three task at the same time
0:23:32so you need
0:23:34slightly more complicated data set for that to have all of that supervised
0:23:38while we have more labels than just and
0:23:41we need to the dialogue act in this case what are the scenarios we need
0:23:44the egg the actions and the frame and their the arguments basically so that's why
0:23:49the data vectors is called the moody does because we have this three layers okay
0:23:53but for a c was really important to different seed we didn't action and dialogue
0:23:57acts because have a show you
0:23:58it will many cases in which it was important for the robot to have a
0:24:02better idea of what was going on in the single sentence okay
0:24:06okay
0:24:10thanks for talking a question in the last slide you mentioned it's a frame like
0:24:15so what's the difference between four and like on the framenet
0:24:19a frame like so unlike what if a to whatever can be
0:24:25mm someone is the enough traction which represent a predication in a sentence and have
0:24:30some arguments
0:24:32this is like the general frame like you know like the very
0:24:35bold
0:24:35it's the same as the frame that's so the data was this decision making the
0:24:39same that big difference is that frame at the very busy fight ut behind
0:24:43and that there are some extra two d is the most things like some relationship
0:24:47between frames and the results of special frame elements like at the lexical unit itself
0:24:51which make it easier to look at the frame in the sentence
0:24:54but
0:24:55what we like to do is it doesn't matter where e framenet thirty five just
0:24:58in time slot like from the i-th this corpus or any other corpus
0:25:02wait like to i'm we are trying to build the is a shallow semantic by
0:25:07so they can deal with all this stuff of the same time
0:25:09as better as possible is if a kind of map task but we have trying
0:25:13to incorporate these different
0:25:14aspects of the ut is then we have trying to deal with them
0:25:17more or less that in different ways but without compromising
0:25:21the assistive led to all their kind of formant
0:25:24one other question with us what to that used for data annotation
0:25:29so we actually had to for our corpus we had to develop already interface
0:25:34is always nice basically a web interface where we have all the token i sentence
0:25:39and we can talk everything on that and the score was as be entirely i
0:25:45mean something with been collecting in the last we have then it takes a long
0:25:48time ago it's a it's
0:25:51it is a hard task to collect these sentences and also we have to filter
0:25:54out many of them because the context of the most different i sometimes we went
0:25:59to the rubber gap to do this collection and of a lot of noise and
0:26:02things we were also value that you're
0:26:05file of these then we stopped but in the and we were always applying some
0:26:09people from all alarm
0:26:11to annotate them like to three of them then you know doing some unintended beam
0:26:14and annotation trying to get whether the actual understood out that but with working if
0:26:18a very long process okay and
0:26:21we're the computational linguist but opposite thing point so
0:26:24it is very hard but this that's
0:26:29that's that the situation with the corpus
0:26:32okay so we have run time so it's not speak again
0:26:36okay