0:00:17hello everyone we're ready for final session of one or assertion of a conference
0:00:23on discourse
0:00:26i was to produce a session chair the first talk remove your is from the
0:00:32ritual corpora too
0:00:34space reference a new approach to spatial reference resolution in a real environment
0:00:41thank you graph to everybody so my name is middle and i'm a piece the
0:00:45student here at age and of course is a busy student you get to read
0:00:49a lot of papers and every time everything you wanna get confused by the title
0:00:52the most
0:00:53so i decided to represent mine i'm going to make sure all of you would
0:00:57understand why those words i mean the title
0:00:59so we start backwards paraphrase a realistic environment referring to the main we're working in
0:01:04which is but that's train wayfinding
0:01:07in the real sick this
0:01:09and when you are in i'm from elicit this the first thing you do is
0:01:12you take a smartphone and launch something like mobile apps
0:01:17the way they global map something typically guide use the same as they would not
0:01:21be cars typically pitch is present you some think will turn by turn irrigation so
0:01:25you get the bunch of instructions presented to you on the screen supplemented by map
0:01:29of the movie marker indicating your position
0:01:31and the instruction can be watched as well
0:01:32and they would sound like turn right on the wallet eigen slash route
0:01:36two hundred seventy seven then go six hundred fifty meters and turn left on the
0:01:40frame on which is not exactly the most natural thing you would expect from the
0:01:45system mainly because of two reasons
0:01:47the first one is that it relies most and quantitative data so on cardinal directions
0:01:53street names and distances and these exactly the things that we humans of trying to
0:01:57avoid when guiding each other instead we tend to rely more on landmarks according to
0:02:03present to the previous research solo salient objects in the vicinity
0:02:06so what we would really like to have here is the shift from turn by
0:02:10turn navigation to the landmark by landmarks navigation
0:02:14the second reason is that the wayfinding process is inherently interactive because it happens in
0:02:19a dotted between two humans so we would like to have more and more interactions
0:02:24from the wayfinding systems which led us to that a spoken dialogue system that uses
0:02:29landmark navigation to wear for wayfinding
0:02:32this what you're saying structures like these go forward until a see the fountain either
0:02:36glass building with some slicer and if the person got lost assess i think i'm
0:02:40lost but they see yellow house to my right a system should but and still
0:02:44be able to respond with no what is due possibly see apart opposite to this
0:02:47yellow house
0:02:48not only to do that the basic task is to identify that the yellow house
0:02:53to my right is really referring to something and this being and being able to
0:02:57find this geographical object and done
0:03:00process it under support response accordingly which leads us to the basic thought that we
0:03:04need to solve that of spatial reference resolution and this is the next phrase in
0:03:08the title that we have
0:03:11what we're talking here is the referring expressions there's of those words that people use
0:03:15when the reference or something
0:03:17like those in the pink over here
0:03:19then the optics that amount by those referring expressions the geographical object
0:03:25a cold reference like those with green frames and where what variations that here is
0:03:31three level referring expressions so when you're walking down the street whatever you see
0:03:37those of the objects were adjusted
0:03:40and then the task of reference solution is defined simply as resolving your for expressions
0:03:44the reference
0:03:45now some very
0:03:47to use the listener's my say wait there is also that
0:03:51it is also a referring expression and indeed but this is a core reference so
0:03:56it refers to something that is inside the discourse that is under forty for expression
0:04:00whereas in this work going to sit in x afford referring expressions so that those
0:04:03referring to something else i to discourse the jurors could object and then another problem
0:04:07here is that will have nested for expression
0:04:10so have class we don't their which refers to
0:04:12thus a small shop and for this particular work we decided to take the maximal
0:04:16for expressions of the largest in case we have nested ones
0:04:20okay so from this for example is the seems like it's pretty easy you just
0:04:23take not phrases in your don't say there was a referring expressions
0:04:26is that so
0:04:27well not really
0:04:29can see the district samples for example first question is you know if there is
0:04:32a subway station your life and the subway station is an all face but it's
0:04:36not there for expression and the reason is because there is no reference you're not
0:04:40meeting any specific object it just and kind of subway station can be there can
0:04:44be not we don't know and the same as it goes for the
0:04:47two examples below
0:04:49than the last phrase is space recognized and you're approach which is sort of the
0:04:54method we're proposing here and all the word neural might think you that there will
0:04:58be neural networks yes in the that would two
0:05:00and when you thinking about neural networks you think but there really hungry for they
0:05:05that so what the it to the use
0:05:07and with the dataset called space ref
0:05:10and it was collected by letting ten users walk to predefined rules and just basically
0:05:15describing the weights of like thinking allow so like i see a red brick building
0:05:19over there are going down the steps and so on so forth
0:05:22and this way
0:05:23one thousand three hundred and three geographical for expression
0:05:27have been collected which we're going to use for the purpose of this work so
0:05:32now we see the problem of special efforts a solution is being decomposed into three
0:05:36stages the first one is what have spoken utterance you want to identify referring suppression
0:05:41in the so those words if we but potentially five something the second step would
0:05:45be find potential reference potential geographical objects which we call can that's and the third
0:05:50step would be the resolution itself so we goal
0:05:52bottom to top
0:05:55so one we're thinking about referring expressions identification we realise that it's actually very similar
0:06:01to named entity recognition "'cause" what you need to do is just fine specific kind
0:06:05of face instance named and in one case and referring expressions in the other
0:06:09well actually named in this are can also be referring expressions so we were thinking
0:06:14okay then we can maybe borrow or get inspired by the methods for the named
0:06:18entity recognition
0:06:19and we started by labeling the data in the same weight in fig with this
0:06:24famous by you
0:06:26labeling and in here
0:06:27what's your is if you have assumed a word
0:06:29we can label it as and then still have a referring expression because think it's
0:06:34to can have noncontiguous referring expressions be labeled that
0:06:37and then we're thinking at the method you also be inspired by than that by
0:06:42the methods when the net recognition and guess in this case is your network with
0:06:46architecture to the right we go definite so that see how it works
0:06:50as i we have an utterance noisy a big red building so the first one
0:06:54thing we do is with that it at the fixed with because we're standard when
0:06:57they want fixed with dancers
0:07:00then we fitted word-byword reference
0:07:03and then every word for every word with first encoded with a fifty dimensional are
0:07:08more demanding so that pre-trained we have downloaded those and of course there are out
0:07:12of the capital cases and mostly those are sweeney street names in our case and
0:07:18to those
0:07:19we encode the character level information using a character-level bidirectional are now
0:07:26and the reasoning why we're using biran and is speakers
0:07:30this we restrict names tend to have this bit that the and like a diagonal
0:07:33in your of are part of that again or holding up and meaning the street
0:07:36actually so we're thinking we're hoping but this small myrna and can identify those patterns
0:07:42and that we have some kind of information for these words
0:07:46lacking la vectors
0:07:47so them
0:07:48the final including for every word would be
0:07:51the column vector
0:07:52then the hidden state of the forward cell of the small they're the level by
0:07:55or not and hidden state of the backward cell and we do for each work
0:07:59so we get so the sort of metrics
0:08:01i don't of course sentence we want to have sentence level information there as well
0:08:05so not there is a larger by are known to account for sentence level information
0:08:12at the end we get and all the matrix
0:08:14which we got all such as sub sentence encodings and
0:08:19the idea here is that for each word we want to account for information that
0:08:23all the preceding words beginning and those exceeding words are giving
0:08:27so for example
0:08:28for the word b
0:08:30where taking that hidden state of the forward cell
0:08:34but sort of encodes the information for all the preceding word so noisy and word
0:08:38biggest one and
0:08:40also we take a hidden state of the backward so that encodes information about all
0:08:44the words
0:08:45from the backward from the backward direction so big red building and a number of
0:08:49bands and we have there
0:08:53why do we need to do that
0:08:55so that's consider two examples the wording green strain
0:08:59now if you consider only the preceding words
0:09:02in both cases they're the same you can see hey and you can see so
0:09:06when you're deciding whether this word will be part for expression you have to have
0:09:11to look at the succeeding words and in the first case the station with hopefully
0:09:14indicate that this is a part of referring expressions the first train and in the
0:09:17second case departing would indicate that it's not hopefully
0:09:21and the center by spoken cistern but in the different direction
0:09:24so is turn
0:09:25is the same succeeding words but then proceeding more to different so we can hopefully
0:09:30labels and differently
0:09:32on them
0:09:33the part of the network
0:09:35part of breath that is getting of the subset and sub sentence encoding metrics
0:09:39we double as ref not and will be using it later
0:09:44so than with it
0:09:46output of the red not through the fully connected layer followed by a drop out
0:09:50and then we get the final softmax layer of gives us this kind of metrics
0:09:53there word so far ward where getting a distribution over the three labels so be
0:09:59rough and direct then we take the maximum probability there you see the green that
0:10:04which is sort of the labeling so now i and c would get a and
0:10:07a million be so this is where the ranks person starts and the bigger building
0:10:10we get
0:10:11i rough and then all the possible get so this is
0:10:14then i began building is a referring expression
0:10:18when it comes to evaluation what do we consider as a positive data point positively
0:10:24labeled data point
0:10:25so we are interested only in those cases where the whole very expressions table so
0:10:30if a part of free expression is then we say it's not correct
0:10:33like the second case or whatever
0:10:35but then we also mm notice that there are cases where you have filler words
0:10:39in between
0:10:40and we label them with
0:10:42for the filler words but the network sometimes tend to put are up there and
0:10:46that's a pity to counter this the wrong example directly use it's also sort of
0:10:50with post processing can be used so we introduce the notion of partial precision and
0:10:56so we say the point is partially correct
0:10:59if the that they're for expression is labeled that's partially correct if its start at
0:11:03the same ward
0:11:04and then it has at most one error one labeling error and of course it's
0:11:08more longer than two words gives you start with one word while limiting our its
0:11:14then we have the baseline that we're comparing with of this is the most natural
0:11:17baseline you can think of this just basically taking no phrases you have an utterance
0:11:20u parsing in to get all the non-phrasal that you say those of the referring
0:11:28let's see
0:11:29what the results we didn't we had so the rest not
0:11:32perform better than the baseline
0:11:34but this not the most interesting result partial precision and recall is
0:11:38multipath the for definite than precision recall which indicates that
0:11:41probably if we get more data you will get much better performance but just precision
0:11:46recall there's of the whole architecture has the potential with thing
0:11:50and the second step is finding the can that's
0:11:52the geographical objects then we for that we use that open street map specifically two
0:11:56types of objects an open street maps ways representing most this trees and buildings and
0:12:01nodes representing the points-of-interest like say and from somewhere or the function for the static
0:12:08now the way we've construct a can that's that is all the objects that you
0:12:12can see
0:12:13from the point where is standing in
0:12:15so that say you standing over that
0:12:17and then we know the direction that a working walking in by just taking the
0:12:20fifth and
0:12:21that you're coordinates five ten and fifteen seconds before
0:12:24so that we
0:12:27in the radius of one hundred meters from minus one hundred two hundred degrees and
0:12:31they called the objects visible
0:12:33and so in this case
0:12:35you get
0:12:36those objects over the
0:12:38and on average you have thirty three such objects in the candidate set
0:12:43and then each object we're going to encode it
0:12:47and the following way so first
0:12:49we have taken a four hundred and twenty seven
0:12:51automatically derived binary features from the open street maps and the way they were derived
0:12:55as by considering
0:12:56open always sam tags over here both tags and values and the typical that could
0:13:00be building with about the university
0:13:02and this would get one of those slots
0:13:04of zero months over that and we also take the normalized distance feature and then
0:13:10also take normalized swedish with sweet being how much of your visual field that's this
0:13:15particular object occupied and we divided by three hundred and sixty degrees
0:13:20so that this is the second network as promised
0:13:23it's called space ref that and this added it easier so it operates on the
0:13:28pair is not on the pairs of the referring expression and the candidate
0:13:33for example we have a referring expression the bus station
0:13:35and we can that set which is just three objects here because it's hard to
0:13:38put thirty three there
0:13:41it starts by building the bus station using the rest i think older as we've
0:13:45seen before and having the sub sentence encoding that kicks and it takes the last
0:13:49hidden state of the forward sell the first and say to the backward so concatenate
0:13:53and this is the representation of your referring expression
0:13:57then it takes each candidate is we're operating in paris referring expression can that
0:14:01the first can that in this case and represented with those or some features distance
0:14:06and sweet as we've seen just couple of slides before
0:14:09then we concatenate all of those
0:14:11put it through fully connected layer and have a
0:14:14final softmax for each or rather the sigmoid prediction there is a binary classification and
0:14:19we have the label
0:14:20between zero and one also are zero or one zero meaning that there are faring
0:14:24spoken and the candidate do not match and one meaning that they do not so
0:14:28so resolving water for expression
0:14:30would involve one averaged thirty three binary classification problem result and that after we don
0:14:36that hopefully the first one is being labeled as every frame as a
0:14:40reference for this referring expression "'cause" it is a bias about station
0:14:44and then we do the same thing with a second again
0:14:47and hopefully the second in the third a label as
0:14:54and now what kind of baseline do we have to compare two
0:14:57that's pretty straightforward the baseline that the first thing you could have focal
0:15:01so it can referring expression like a very nice big part
0:15:05displayed by space then you lower case of and remove stopwords we give a set
0:15:08of words like a very nice big part
0:15:10and then you look at the open street map tags for every can that
0:15:15and if any word from this set but we got and for a second step
0:15:19it appears in either technique or a value
0:15:22we say to match
0:15:23otherwise it's not too much
0:15:27and these of the results
0:15:28we also compared it with another method previously reported in literature that's called words s
0:15:33classifiers and spacecraft not performs better
0:15:40which is where you stop sleeping and probably this is why tuple for everybody
0:15:47many things can go wrong so
0:15:48that's what works
0:15:52blue dot able to represent my position where am something
0:15:56so i'll just put myself
0:15:58just near the building where we are
0:16:07say the utterance like i
0:16:10standing near the university
0:16:14different number
0:16:15that one green is the work of rough not so we found an utterance
0:16:20a referring expression in the utterance
0:16:23now we take the data from the open street map
0:16:30these are all with the sort of
0:16:32do you have of objects that are present in the open street map
0:16:36now we assume that we're looking north
0:16:38so this direction where sort of
0:16:40going to
0:16:42now we're trying to resolve
0:16:45the reference
0:16:47yes i
0:16:48so that was in orange is it those objects in orange the know the counter
0:16:53that's up so this is those object that have been considered by else
0:16:56to be possible reference and the one green denotes the actual reference s spacecraft not
0:17:02and this is the building where in exactly
0:17:05so if we move
0:17:07i don't that down over here
0:17:10and tried to say
0:17:12i see
0:17:14the fountain
0:17:16in front of me
0:17:19and the same trick of that
0:17:22we see again all everything in orange is a can that's that
0:17:26and the one in green there is no the actual fountains with
0:17:30does not only find the ways of the buildings but also the notes so the
0:17:34point of interest
0:17:36then if we camelot bit different direction
0:17:40and say
0:17:44at each for example
0:17:48with see that that's capable also finding multiple reference "'cause" sometimes from the cat your
0:17:53river expression can be ambiguous so it can be the case that you get more
0:17:58than one reference you can't also the case if you give the reference what
0:18:06it of course not perfect
0:18:08because you have sixty four percent precision cannot be perfect so let's see
0:18:11where is a perfect
0:18:14if i say something where i'm standing
0:18:19and the bits
0:18:22cool on the mean things
0:18:30so it somehow for some reason it selects as part of the street
0:18:34so i mean some streets not all of them we don't know why yet but
0:18:37this is also to the research question for us
0:18:40to understand why in this case it selected like something like eight object "'cause" the
0:18:44streets are actually not to contain the contiguous objects some for some reason an open
0:18:48street map the street there just
0:18:50stored as bit sort of just part of the streets and of the one contiguous
0:18:54trees which makes
0:18:56definitely our job harder
0:18:59right hand down
0:19:01we try one more
0:19:08here somewhere
0:19:11and we say
0:19:12i see that george for example
0:19:17i mean in some cases it does not actually identify the optics although
0:19:21the charges up with our you see probably the cross
0:19:23but then
0:19:25if you come to be closer
0:19:30i still doesn't work
0:19:36right okay out of course doesn't
0:19:39when it did work because it was very hard for example
0:19:43if you say i see the church and trans
0:19:47and works on a one
0:19:49so it's sort of sensitive and we don't know what why yet but this is
0:19:52raise number of research questions we addressed in future thank you very much and not
0:20:06the you're much for the very interesting talk and the
0:20:10we can call so we don't fall asleep again
0:20:13our questions
0:20:20i think you for your talk that was great i was wondering it in with
0:20:24the earlier slides you had an example where the person said now ica and then
0:20:28there was like an explicit reference to the
0:20:32the object or the building
0:20:34and i was what i was just wondering can you handle purely anaphoric references like
0:20:38if the person had just said now i see it
0:20:40no in there it's just like the for reference that we're howling in this paper
0:20:44we consciously excluded anaphoric references case we think it's separate problem with separate that's okay
0:21:01well time to work are uses like to have a question the back to the
0:21:07this like
0:21:09the powers for a what happen if it's close that we consider the distance i
0:21:15for some object like a church or something
0:21:19if the user might say that i see a charge
0:21:22well maybe just a direct form a really marcus that's
0:21:25also very short distance
0:21:27that is true so that the well as a as i said previously that the
0:21:32way we sort of the final can't is that is
0:21:35we take a fixed radius of one hundred meters in this case so if it's
0:21:39really far and the user says that if we do not increase the rate use
0:21:42it will not be able to track currently
0:21:45i hope to transfer sequence
0:21:48and thank you for the top is in that the ml ui the couple of
0:21:53examples in which you know it was and i nd a near the university and
0:21:59then there is another one with k t h where
0:22:03these are
0:22:04the not dean
0:22:06a large joe graphical object right
0:22:11but then especially in the for example in the first case you're is all to
0:22:14the building we re saul and
0:22:18and just wondering if you can speculate how
0:22:23these sort of references can be you know they are really context-dependent right so that
0:22:31you identified that building but actually i n
0:22:34like in the corner of the campus and i'm near to the whole university in
0:22:39a sense right
0:22:41true so the first thing is again we have is rate is one hundred meters
0:22:45we can all get the whole universe that is the first limitation we have the
0:22:48second is again this was more to show the imperfection of the system rather than
0:22:52the fraction so actually when you say is seriously it and it and it identified
0:22:56as building it was just one because we're in this building
0:22:58but really you have also the building on the right-hand side which it didn't identified
0:23:02and this is more to show that you know it's imperfect and has sixty one
0:23:05percent precision so it sort of we have still the way to go
0:23:13okay how would improve okay by ice so that so okay a i'm not as
0:23:18effective requesting you press so the obvious thing is to collect more data and try
0:23:22to try to train the same thing and see if it works
0:23:25and the second the second thing it might it might be
0:23:29you know
0:23:30probably don't take on that those objects that are in the immediate vicinity maybe as
0:23:35one but this is probably will be harder because you know it is computationally will
0:23:39become invisible i guess
0:23:41"'cause" you know you have to identify which of the object you
0:23:44you i mean you still need to have some notion of visibility so to identify
0:23:48which of these you can potentially referred to run this you have a collisions computations
0:23:52you have mine side and you in these if you collide with specific operational and
0:23:55the lobster vision and also that
0:23:59i don't know that answers the question i probably not it seems like it is
0:24:12we conducted later "'cause" we sort of have a like times filling it here but
0:24:15we can take it later we have of the right
0:24:17okay thank you
0:24:19thanks i think that's all the time we had suppresses the sixes that's think the
0:24:22speaker again