0:00:15so i'm gonna present my work without power on the topic off language guided adaptive
0:00:21perception
0:00:22for efficient grounded communication
0:00:25right robotic manipulators in cluttered environments it's kind of a lot better hope you raise
0:00:30will understand at the end of the presentation
0:00:33but this about
0:00:36so on
0:00:37situated so like a language understanding in a physically situated settings and
0:00:43interesting problem in robotics
0:00:45okay
0:00:46ability to
0:00:48interact with the collaborative the once using natural language then
0:00:52but saving then planning and establishing like a
0:00:55common ground
0:00:57is critical
0:00:58for effective human the what in fact that is look at an example
0:01:02the user says
0:01:04we got the leftmost you you're
0:01:06then the robot
0:01:08where c then why and then a grounded a specific object
0:01:12and that's done and the
0:01:13assuming and say put it on the top of the
0:01:16right composed into the
0:01:18okay
0:01:19so they're feuding sweetheart about it is there is diversity in the language that user
0:01:27that was it in the language in terms of instructions that the user can you
0:01:30to the robot or the way in which the instructions the said
0:01:35there's there is challenges because the environments or unstructured they could be clutter and then
0:01:40one and like and here
0:01:43and if you need a real-time interaction with this robot
0:01:47and perception takes time so
0:01:50so that's what this work specifically talks about is how to efficiently perceive environments
0:01:56for fast and accurate grounded in setting off a variety of natural language
0:02:01instructions and it's demonstrated in the context of robotic manipulation
0:02:07you give you back downtown
0:02:09in four or on provide perception representation what perception usually refers to is you have
0:02:15sensor measurements that come from the robot sensors you have some perception pipeline i
0:02:20perception by plane in compressed this high-dimensional a sensor measurements and gives you a representation
0:02:26of something colours world model
0:02:30then alignment to give you an example of visual perception
0:02:34you can fill in the sequence of it really images
0:02:37and what you get out of it is some representation of the word
0:02:41and the representation varies based on application for example here it's just strong point only
0:02:47presentation i can make a
0:02:49three d voxel they're not affect you can have an occupancy grid are semantic map
0:02:54or if you if you want to
0:02:56become specific object you want to model the pos the six degrees of freedom
0:03:01was of those objects you can get something like that
0:03:04even going further you can have some articulation modeling of the components often do you
0:03:10lot
0:03:10so the point to note is the at the representations of at based on the
0:03:15application
0:03:18so a as i one more point to note is as we move from like
0:03:23a simple representations to more detail regions
0:03:25that year it's just the bounding box representation of the all data you have a
0:03:29semantics that you know that it's most reported then you know six degrees of freedom
0:03:33pose and going for here you have decompose the boarding two hundred and body
0:03:39the more motivated representations you have
0:03:43more complicated cost and it'll what can perform
0:03:46so
0:03:51so highly detailed models allow for reasoning and planning for a wide variety of complex
0:03:55task but that leads us to the problem
0:03:58that element in floating search exhaustively detailed world models
0:04:02after environments it is computationally expensive and it integrates like the real-time interaction
0:04:07and dialogue with this collaborative about
0:04:09so one common approach is to have a task specific representations very know what after
0:04:15what was to respond then
0:04:16i you
0:04:17you hardcode the perception pipeline according to that but how to best represent environments to
0:04:24cecilia planning and
0:04:28grounding for reasoning for a wide variety of complex task is an open question
0:04:32okay
0:04:34what we observe here is
0:04:35and in the in case of exhaustive modeling and if you model all the properties
0:04:41of all objects in the world is one problem that some of these properties that
0:04:45inconsequential interpreting the meaning of takes o one of the instructions so
0:04:51in this case like modeling deference between the lid of the active scan
0:04:55is your living for the task of picking up the most important and vice versa
0:05:00so
0:05:02so what we propose in our work is learning a model of language and perception
0:05:07specifically do i doubt the configuration of perception pipeline
0:05:11we further on
0:05:13to infer task optimal representations of the world that first year the grounding a phone
0:05:20language instructions for example
0:05:23this is
0:05:24the environment representation inferred the task of picking up the leftmost you hear a very
0:05:30just segments are w appears and this is like for the task of picking and
0:05:35the nearest tracked object where you're ignoring the bu objects inferring properties when you want
0:05:40to read a text
0:05:43i'll to give you some background about the models that had been used in the
0:05:47paper we are not the first ones to do language understanding so all generalize grounding
0:05:52grass is one of the models that was developed by right quadrant the lexus different
0:05:57alexis and they demonstrated
0:06:01the utility on the task of lifting stuff using forklift
0:06:05tracks
0:06:07you have one advancement over that model was dct allocated to them because of this
0:06:12model later but this was
0:06:15basically exploited conditional independence assumption that crossed constituents
0:06:19a language and the semantic constituents to infer high-level motion planning constraints
0:06:25given instruction
0:06:27there's one more model that was used to infer abstract visual concepts for example to
0:06:33learn what it means to become brittle blocking the role of five blocks of the
0:06:38right
0:06:38okay
0:06:39so all of these all of these language models
0:06:44i in some fixed flat representation of the word
0:06:47i mean
0:06:48and it is an hour that
0:06:50so that has been working on in the intersection of perception and language understanding
0:06:55that talks about how we can leverage language you
0:07:00and the perceptions
0:07:01so in this case it was used to add semantic labels to the
0:07:06do that
0:07:08regions in the map this is a good to the occupancy grid representation
0:07:11are in this work
0:07:14then a little use language to
0:07:18it also in the process of in fitting kinematic models this was one which you
0:07:23we apply a so that the another part of the what about the for the
0:07:28instruction like go to the hydrant be identical and robot cannot see what's behind the
0:07:31goal and so you
0:07:33however instruction that you can be a models we can be augmented representation
0:07:38so
0:07:40these models do not these models augmented representation but they do not consider how to
0:07:45efficiently convert raw observations into representation that can
0:07:50speed up the grounding process so most related work to our work is done by
0:07:57sink amortizing
0:07:58i shows he uses a joint language perception model to select a subset of objects
0:08:03based on some colouring geometric properties and that this is what done by are you
0:08:10with the jaws like segmenting from natural expressions very haven't rgb image and
0:08:16you given instruction people in the be accorded segments those things
0:08:19so what is different in our work is here we are expanding the data was
0:08:22again complexity of the perceptual classifiers that using the word
0:08:27and we were conducted really data
0:08:29and
0:08:30we present an approach to adapt the concentration of the perception pipeline order to
0:08:36in for task specific representations so going to the technical approach we present the general
0:08:43high very high level language understanding problem as
0:08:47finding the most likely trajectory you and some natural language expression and some observations is
0:08:53unlike also which in this could be just like sequence of rgbd was this
0:08:58in our case is just a single rgb image
0:09:02solving this inference is computationally expensive and the space of
0:09:05project is quite large for
0:09:08for complicated environment and what
0:09:12so we singly contemporary techniques the proposed this
0:09:16i structure this problem has a symbol grounding problem
0:09:19so there's refer the infer a distribution over some symbols
0:09:23a given the language and avoid model soviet moving from like high dimensional
0:09:29sensor
0:09:31measurements to
0:09:32a structurally go up the representation of all the model
0:09:36that it's a function of perception pipeline of the robot
0:09:39so
0:09:41a what is
0:09:44symbol space exactly cancel the is consisted authenticity
0:09:48so in the so
0:09:50in dct model the symbol spaces basically consist basically consists of the objects in the
0:09:55word the properties which are perceived all regions that have found the world model
0:09:59so i spatial relations what action symbols
0:10:03so you for this so symbols response like a discrete space of interpretations and which
0:10:07are in instruction will be understood
0:10:12v specifically using dct in our work so dct is like a probabilistic graphical model
0:10:17so it but it's a factor graph for the past instruction so in this axis
0:10:22their phases of these that the linguistic component in the vertical axis of the constituents
0:10:27of the single space so this is like an example of one of the factors
0:10:30that are
0:10:31it clings the linguistic
0:10:33phrase to one of the symbols
0:10:35it could represent objects or regions of what next neck and thing with of it
0:10:39correspondence variable
0:10:41so
0:10:42what dct does is it
0:10:44it is trying to find
0:10:47the most likely set of correspondence variables in the context of the grounding language the
0:10:54child corresponds variables and the world model and this probably a by maximizing this product
0:11:00of an usual factors across the linguistic across the linguistic components and someone a constituent
0:11:06so these are my
0:11:09it's likely to the estimated with log-linear models
0:11:11and dct
0:11:12so i'll the what we so a problem here is that the runtime of dct
0:11:18is directly proportional to the one model figure three
0:11:22and this is because the size of the semantic space increases as the number of
0:11:26objects
0:11:27in the waving trees
0:11:28so what we of there is some objects and the si models based on those
0:11:32objects is inconsequential reading the meaning of that instruction so we can this of the
0:11:38as you there exists a given an optimal the world model that can express the
0:11:43necessary information this is insufficient information to solve this problem
0:11:47so we go from this previous equation that this behavior have this nonstarter so we
0:11:53want and we hypothesize that from time to solve this equation will be this are
0:11:57then but
0:11:59so what we propose is using language doesn't mean to guide the a process of
0:12:05generating these optimal word models so we make this world model a function of perception
0:12:09pipeline observations and language
0:12:13so now we have added this nlu part
0:12:16which takes in language lead you some constraints on the perception based on the task
0:12:20and perception use an optimal what model back and on which directly models with reason
0:12:26no
0:12:27so this is so to achieve this we define a new symbol space quite a
0:12:31specific only to suppose such as input space
0:12:33so what it basically consists off is different colour detectors is already detectors was detectors
0:12:39and semantic object detectors accepted that
0:12:42these need not be just the liberty to dismiss can be
0:12:45like
0:12:46a detector to infer or what's the likelihood of an object pointing out some major
0:12:51something like that
0:12:53so
0:12:55so we so we use these easy and we add that to infer this perception
0:13:00symbols that modifying this equation
0:13:02so we don't know that no longer have the world model that in this equation
0:13:05and via reasoning in the positives in the space
0:13:08so to give you some of the base what the symbolic representation that we use
0:13:14is made above two
0:13:15different sets of symbols which are independent symbols and conditionally dependent symbols independent perceptual symbols
0:13:22are basically and the individual detectors that exists in their perception pipeline
0:13:28like you better detector red color detector and so on
0:13:32and
0:13:33it forms of set of all those new detectors so we also recognise that to
0:13:38incorporate some complex phases that just pick up to a ball you would need some
0:13:43condition because
0:13:45have some conditional independence in was that be just runs the red that
0:13:51sphere detector in this case on objects which are rate so we can have extent
0:13:55of a faster interpretation
0:13:59so it going forward
0:14:01in experiment it is and this is a system architecture so we have our g
0:14:04really sensor that feeds into the acquisition model
0:14:08we have a parser that x instruction buses and use it to this and to
0:14:11nlu models first one is for inferring the language perception constraint to equalizer the references
0:14:17like a specific model
0:14:19and the second one second one is
0:14:22and the nlu used for symbol grounding so
0:14:25this
0:14:27the and the indexing language and gives you the perception constraints are the can think
0:14:32is in the fight planet suitable for the task and then adapted perception takes an
0:14:37observation and the constraints and gives you an audible would model in which the seasons
0:14:41and
0:14:41then symbol grounding inference like high-level motion planning constraints to go to the motion that
0:14:46are
0:14:46alright
0:14:48so we do a comparative study in which you compare our proposed model with the
0:14:52baseline that the only difference is that the l p n block is missing in
0:14:56this architecture so we i'm this different processes and required to infer this constraint same
0:15:02time according to perception here was is time required to
0:15:07we will complete perception very and use all the detectors all of the models all
0:15:12of the modules in the perception pipeline
0:15:16and according to the symbol grounding
0:15:18so on
0:15:20there are few assumptions in the experiments that we
0:15:22we have
0:15:24the environment is like we have a baxter what we keep the reasons aren't all
0:15:28binary a wide range different we get different word that is meant so we have
0:15:33and different more i don't spend
0:15:37and different word utterance events in our work and number of objects in the collector
0:15:43varies from fifteen to twenty
0:15:46so this is the actually that's of that perception pipeline so it has different components
0:15:51like colour detectors geometry detectors label detectors and bottom lost at a different type of
0:15:56body most detectors region those objectives acceptor
0:16:00this is actually the
0:16:02for all of those detectors we have this independent symbols and conditionally dependent symbols where
0:16:08this is like set of symbols which are which depend on jonathan colour labels like
0:16:16eight what basically chooses
0:16:19the expression of the symbol say a engage the geometry detector of specific type on
0:16:24the colour detector of a specific type
0:16:29and the symbolic representation for the symbol grounding model basically consist of five seven different
0:16:34things
0:16:34which are objects in the words labels sticklers geometries regions in the world except for
0:16:41the corpus consists of syntactically we'd like instruct a sparse instructions so i but like
0:16:47hundred instructions and annotated and once it was annotated with the
0:16:53the perception symbols and another with the grounding symbols
0:16:57and the linguistic background that i followed was
0:17:01was
0:17:02inspired by the work done in this analysis paper
0:17:05on the collected data using amazon mechanical turk so we use the similar linguistic buttons
0:17:11so
0:17:11we did in our experiments we have to hypothesis
0:17:14and the first one is i don't really inferring the task optimally representations a given
0:17:19that we use the perception real-time babbling exhaustively detailed uniform modeling of the word
0:17:25and the second have what's is the reasoning in this context i think this compact
0:17:29representations will reduce the symbol grounding time as well
0:17:33so we have this two experiments first is just a simple learning characteristics of the
0:17:38inventory
0:17:39of ill we observe plastic training fastening chooses what happens at accuracy of the second
0:17:44one is more interesting that how does help in an impact the perception runtime and
0:17:50the third wise however the ubm back the symbol grounding runtime
0:17:53so we hypothesize that
0:17:56has the printing fraction in increases the accuracy of inference should increase in the second
0:18:01is
0:18:03as the number of objects increases if you're using the complete perception the potentials and
0:18:08trees and in case of when using an em it should still lower than this
0:18:14similarly in the case of symbol grounding
0:18:16and this
0:18:18in your nature exponential nature is just to demonstrate the ten year
0:18:24so in our results we find this is basically the
0:18:27the
0:18:28learning characteristics i just as a as we expected
0:18:31in the second one data vc the blue demonstrates the
0:18:36the time required to perceive the world is the number of context changes from fifteen
0:18:40to twenty busiest regularize and here it's kind of independent and in then use the
0:18:47i think this for the symbol grounding runtime
0:18:53so to summarize
0:18:54this is this table shows the average perception paradigm for all the instructions when the
0:18:59user to complete exhaustive modeling of the word and i do so you see like
0:19:03a good increasing the decrease in the perception and then here so we implemented a
0:19:07show similar in similar for the semantic content
0:19:11and the point to notice that the symbol grounding accuracy is fairly the same in
0:19:15both the keys
0:19:16and so i so we just
0:19:19coming back to the hypothesis we had this to hypothesis which we verified to the
0:19:23experiment
0:19:25so
0:19:26in conclusion
0:19:29the real-time interaction is important
0:19:31for physically situated dialogue interaction the robot
0:19:34and the problem is exhaustive modeling of the clutter that model utterance is a perception
0:19:39bottleneck in such cases
0:19:41and so we propose
0:19:42a language perception model that
0:19:46that configures
0:19:48that takes an instruction understands the perception constraint on figures the perception pipeline of the
0:19:53robot to give optimal what model is that again
0:19:57then if the symbol grounding
0:19:58process and we verified that to the experiments
0:20:03and q
0:20:19this is really great already like
0:20:23so in relation to
0:20:26extra information extra fish the optimisation you get result in mind
0:20:31you're language interpreter is a deposit
0:20:34so what your language interpreter is the parser
0:20:38i mean how well as examples
0:20:41so
0:20:44are you talking about
0:20:48so you have to design a the run in real-time incrementally all the to the
0:20:52whole parse just wait till the end of the utterance then passed the whole thing
0:20:55so also it is not the main contributions of we just use the boston's track
0:21:00instructions
0:21:01so this is the and then you model is what contributes to instructions so it
0:21:07is it interpret
0:21:08the instructions word by word orders of what the end i instructed to
0:21:15i resulting "'cause" i think we might see for the
0:21:20efficiency gains if you interpret the utterance word but yes evidence that differences from the
0:21:25visual world paradigm humans that humans do that and see in their eye movement as
0:21:31a listening and
0:21:34if you could a speed of the process
0:21:40but this work it's just like
0:21:42it integrates day
0:21:44the instruction after it's received by the nlu it does this phase by phase so
0:21:50the interpretation and avoid lake lexical closed phases of what phrase in the case pick
0:21:55up the
0:21:56you want
0:21:57so the interpretation on at the word phrase is a function of its child thesis
0:22:02as one
0:22:02so that's a to pick up a blue ball you
0:22:05need not know the six degrees of freedom pose of the ball because it's a
0:22:08semantic content
0:22:09so in that case it will
0:22:11the reason that you would need a to a degree of freedom was of course
0:22:15estimator
0:22:16in the as opposite in the case of become a value box you would need
0:22:21a six degrees of freedom
0:22:24pose estimation of the object but also i the word for this for that it
0:22:27will and uses a six degrees of freedom soak reasons in the context of the
0:22:30ten faces
0:22:37well questions
0:22:51over the course the back and forth dialogue we're gonna have discussion of different objects
0:22:56i noticed in your conclusions live you have an example using the word
0:23:02in the second you put it on the top of the red card so i
0:23:04was wondering how are currently exploring dialogue history what like the previous utterances and how
0:23:09you might tracker
0:23:11you no longer
0:23:13longer histories in the future
0:23:15in this work we are not tracking the dialogue history it's basically the first monologue
0:23:19part of the dialogue
0:23:21x p something that that's supposed to speed up the entire dialog by speeding up
0:23:24the perception
0:23:26but
0:23:27we are not currently designing what it means in the con
0:23:32any other questions
0:23:38okay estimation okay i'm going to not tell you
0:23:46it's a special case where a the detectors in a perception pipeline
0:23:51the time recorded on the detections was also function of size of the object
0:23:55so in especially specific case i had like lots of objects but there is small
0:23:59in size does not the ones
0:24:01specifically the geometry detectors because depends on the point cloud
0:24:04it's like to point out it needs to reason about more points that's the
0:24:11except that
0:24:13still the time required to do that the perception this one and you're so