0:00:17everyone my name is the injury and from the interactional that are about university in
0:00:21denver
0:00:22and today i want to talk our paper about training and adaptive dialogue policy
0:00:28for interactive learning of visually grounded word meanings
0:00:32so in this talk i want to talk subgoal a couple of things about the
0:00:38too many two aspects one is we want to discuss an overview of the system
0:00:43architecture
0:00:44and then we shows the
0:00:46based on a movie you about how the system how it works
0:00:49and also
0:00:50based on this system architecture we use that to investigate the event effectiveness of different
0:00:56a random forest is the and the probabilities worked
0:01:00in the
0:01:01interaction learning interactive learning process and the based on that we investigation we trained and
0:01:08adaptive
0:01:10don't policy
0:01:11now let's move to the motivation okay
0:01:14what we want to do in this case we want to be at each other
0:01:18about toward a multimodal systems and which can learn
0:01:21and individual using
0:01:23are users of language bike are listed my best clear are this my look like
0:01:29four or something like that
0:01:31and then
0:01:32we in you know to us that we have to we learn everything we learn
0:01:37the visual context we learn the
0:01:39the knowledge online using
0:01:41a the target interactions rappers then the text the basic descriptions war manual annotations
0:01:48and also we
0:01:49this is the system used in the really small amounts most months of the training
0:01:54data maybe once all the what use of it
0:01:57well and then what is this them into a really different position we put that
0:02:02into the channel the rather than a second language learner you know for the technical
0:02:06and you don't know is the
0:02:08it has all of the visual knowledge about what happens what the what's the meaning
0:02:12of colour was the mean of shape and of the what they only need to
0:02:15do is the trying to associate the all and
0:02:17vision or h two ways the specific words or phrases in another language
0:02:22but the channel is quite different because it doesn't have any
0:02:25a knowledge about that and they have to trying to learn the
0:02:30one language meanings that a forced
0:02:33though it as what we know there are lots of the recent works the trying
0:02:39to do ways that the symbol grounding problems they are trying to generates the other
0:02:44natural language descriptions what images were we use
0:02:47or they are trying to identify or describe the visual objects using the visual features
0:02:51like colors shapes war materials
0:02:54but
0:02:56to our knowledge now of these masters or long as to go for teachable robot
0:03:00or multimodal system
0:03:02and this aspect should be combined together
0:03:05so here we present a table comparing our project with others and is what we
0:03:11can see this the is almost all words and they are focused on one where
0:03:15some aspects of in this table but our work it consider all of the cup
0:03:20aspects
0:03:21including the interaction
0:03:24online learning natural language and incremental okay
0:03:29now let's move to the same the system to capture is a really on general
0:03:35architecture it has the it combines the visual we show and you wasted a remote
0:03:42you and the
0:03:43visual mode you
0:03:45we can see on the left that's the
0:03:48on chopped the class of weak classifiers which rounds of the semantic representations in the
0:03:53language processing module
0:03:55and then
0:03:56so that
0:04:00so the predictions from the classifiers to the dialogue and the visual observation mode you
0:04:06produce the semantic analysis of the real thing
0:04:09and then the becomes the non-linguistic
0:04:12a context as part of the dialogs
0:04:15and you
0:04:16for the parsing a generating
0:04:18for example maybe pronounce war reference resolution
0:04:22and on the other hand for the
0:04:24on the data mode you don't mode you the dstc removed you we parse the
0:04:30dialogue from with that users the number of ways that users and through the parsing
0:04:35all object the judgements are used as the labels to
0:04:40updates
0:04:41other classifiers incremental bayes rule the interaction
0:04:46notice talking about the vision all
0:04:48mode you the vision mode you can extract the a high dimensional feature vectors
0:04:53including the h s b
0:04:56space will colour and the back on visual words for the shape
0:04:59and then
0:05:00it incrementally train a binary classifiers what each at you each we show attribute
0:05:05using the logistic regression where is the other stuff stick gradient descent model
0:05:11finally after the classification
0:05:14it's put use the visual context the based on the predictions and the corresponding outcome
0:05:20discourse
0:05:22that's trying to ground that is atomic i turned into the particular classifiers
0:05:28well the dstc our model to
0:05:30the that the dft remote you contains the
0:05:32to parse the dynamic syntax and the teacher directly types other than anything text example
0:05:38word-byword incremental semantic module of the dialog equal including the parsing and parser generator
0:05:45and it produce the semantic and the contextual
0:05:49representations in the temple have not cafeterias records
0:05:52and i work
0:05:54in i want to highlight here is a what is quite similar to okay the
0:05:58can indiana data strings work and but where they are trying to do is to
0:06:03the one that run grounding the
0:06:05the words to the other classifiers but we are using that each are regular time
0:06:10logical form in that
0:06:13okay here's an example about the incremental parser and the
0:06:18and here's the other graph about the whole r
0:06:22it shows the plot for the dialogue context
0:06:25and each acts like dialogue context from all of the participants in the is dialogue
0:06:29including the learner and the
0:06:31other tutor
0:06:32and
0:06:34is each note here is we represents this that when states or the key directly
0:06:40types for the particular points
0:06:43and the an edge
0:06:44each and we present the particular word
0:06:48parse the by the that the at each other parser
0:06:51though
0:06:52when we get new word
0:06:53if just to grow up i guess new tattoos new node and the updates the
0:06:58medial record travel again so it's got another one
0:07:01and
0:07:02there's always sensors and the final record time for
0:07:06distances about what is this
0:07:08and the learner's thing a square
0:07:11so
0:07:11it just continue
0:07:12is crossed and update that by removing
0:07:16the question
0:07:19judgement type and
0:07:21and the result some answers about squeaker
0:07:26so i have to work
0:07:28well in the because you're saying yes as a kind of acknowledgement comes from the
0:07:33tutor so the previous don't of the previous dialog is
0:07:38has been upgraded by those that you jen the and the
0:07:43the learner it check industrial contexts in this case
0:07:47is another example about the a ground at natural language semantics in the visual quality
0:07:54for hours
0:07:55and the system
0:07:56we can see the system capture okay objects from the webcam role and extracts features
0:08:04push them into the classifiers get a bunch of the
0:08:07the prediction labels and the corresponding for a confidence scores
0:08:11and the based on that we instead of using the distribution in probabilities well using
0:08:18the binary classifiers for each other attributes so we just pick up a the labels
0:08:24with highest scores
0:08:26for each group here so we used in that to generates
0:08:30a visual context and the that's a teletype
0:08:35training the role it cold to read and ways than zero point seventy us then
0:08:40t five
0:08:41and the pretty one square
0:08:43is
0:08:45want to the stick where trust if are ways of zero point eighty eight
0:08:50and when we push them into the generator with the meaning that means
0:08:53i can see a red square
0:08:57our system
0:08:59the vision module change all of the visual attributes in quality of assigns of binary
0:09:04classifiers
0:09:05it does not
0:09:07identify the meaning soft is fake actually word
0:09:11but in the language model the language processing module the grammar
0:09:15note them and knows they are in the different okay great strike a red is
0:09:20a kind of colour and a switch kind of shape
0:09:25and the our research will as we mentioned already or the system is the position
0:09:29of the channel so we don't need to learn
0:09:32are this maps between the classifiers and the semantic items
0:09:35instead
0:09:36we learn a week we just types of classifiers on the fly and for the
0:09:40new semantic items and
0:09:42we might encode or in contrary to eight
0:09:45in the dialogue and we retrain them are incrementally strolled interaction
0:09:52no i want to show you about the how the system works
0:10:16here's a really simple a dollar dialogues
0:10:19system and has or of the
0:10:21a dialogue on the left
0:10:24is very simple to just to reach a specific words about the colour was shapes
0:10:30and then
0:10:31we got new dialogue and is trying to testing what's the system or you learned
0:10:36from
0:10:36are the previous work
0:10:40and then we got the
0:10:43object and we got the visual contact
0:10:45two based on their that would be a visual sorry visual features and we you
0:10:50have a bunch of the classification without
0:10:52and
0:10:53based on the result
0:10:54we generate the window shows the visual context
0:10:58could use the by the classification without
0:11:03and then about this dialogue context it shows the
0:11:07at each are idle time harsh the by the from the previous tutor
0:11:11all streams
0:11:18and then the this window shows that generation go and it's shave the answers
0:11:23by unifying
0:11:24a the dialogue context and visual contacts
0:11:29and would clues that into that generate the generator to get the final ten sentences
0:11:33by collect-is square
0:11:36all because the time i once
0:11:38finished that
0:11:41okay
0:11:42i want to start
0:11:43i want to highlight here is what we use in a very simple colours and
0:11:47shapes in this week you and a but this is the ms module is the
0:11:52really generalize the framework and it should scale to more complex visual things will classifiers
0:11:58in the future work
0:12:01note let's move to the experiments
0:12:03in the experiments
0:12:05we aim to explore the effectiveness of the different dialogue capabilities in the possibilities
0:12:10on the learning grounded word meanings
0:12:12with the history of factors and certainty a context dependency and the initiative
0:12:19and then based on the exploration and we learn an adaptive dialogue strategies
0:12:24a comedy contains a common text
0:12:28into account to the reliability of the clutter far results
0:12:33okay on in the experiment one we
0:12:36designed two by two by two spectral or experiment and considers three a
0:12:42factors the first one is initiative which determines the who takes initiative in this whole
0:12:47dialogue
0:12:48and then
0:12:49what time to their
0:12:50the a context dependency which determines a whether the learner can personalise the context-dependent a
0:12:57expressions like short answers were incrementally are constructed turns
0:13:03as the example here
0:13:06and then
0:13:07we considered as edge in g
0:13:08the uncertainty determines whether and how the classification of the classification scores affect the learner's
0:13:14dialog behaviors
0:13:16and
0:13:16as what we know
0:13:19the
0:13:20for the system the classification there's the system you achieve a bunch of the confidence
0:13:27scores a stronger predictions and the chip agents considering the and surgeons you
0:13:32we're trying to find out the points
0:13:34where it can be of its own predictions so are we call this point is
0:13:38that the strike show the confidence threshold
0:13:41and for the agent's consider the and certainty it will be of you task and
0:13:47of the active learning as like a you want asking you will only asking questions
0:13:53one you are not very sure about you answer word about your predictions
0:13:57and on the other hand
0:13:59the and
0:14:00and are the condition without uncertainty
0:14:03and that the agent always takes the confirmations were are more informations from the tutor
0:14:08so it is more a cost as well
0:14:11and to our knowledge the classification scores and not always reliable is better in the
0:14:16very beginning because you don't have also extend use them house
0:14:19and
0:14:20this reliability you approved obviously improve that you're in the interaction interactive learning
0:14:28when you get more and more are examples
0:14:30so
0:14:32the class you see this can other and the uncertainty will take the risk
0:14:37of meeting some informations from that users
0:14:39so it cannot ask any questions you might be for the
0:14:42the answer is wrong maybe
0:14:45the to evaluate the interactive a learning a performance we come up with this kind
0:14:51of the magics by integrating the classification accuracy and the tutoring housed during comes to
0:14:57me fact that at first mediated by the tutor in the interactive interaction with the
0:15:02system
0:15:02and we provide a proponent performance of scores for the increase in the accuracy against
0:15:08the cost to the tutor and the discourse capture the trade-off between accuracy and a
0:15:14cost
0:15:16there's a score very if we record these scores into the map of the graph
0:15:22it will be a curve and the other score being be represented using the gradient
0:15:28of the curve and what we want to do is we want to find out
0:15:32word learn a suitable down holistic
0:15:36to maximize the performance score
0:15:39okay here's the results from the experiment one
0:15:41and to the
0:15:42the x is represents the uni the unit of the that you the cost you
0:15:50did by that user zero five had i arg only instance and the y-axis is
0:15:55are we present the accuracy
0:15:57so this graph we can see the agents
0:16:00was the outcomes using the learner taking initiative ways and certainty the green and the
0:16:06blue curve
0:16:08performs much better than the others are however because the process use the at least
0:16:14two more are risks
0:16:15it can have gets pretty open answers from them valve so
0:16:20you can already achieve the really higher accuracy like the others that is achieved near
0:16:25zero there are nine but this on the other one seventy five was then please
0:16:29take something like that
0:16:30so we conclude that because the confidence scores and not really are reliable in the
0:16:36rolling per size
0:16:37so it shouldn't be cheap constant in the whole running time on the whole learning
0:16:41tasks
0:16:42so we assume that a certainty and that can change that dynamic range over time
0:16:48should lead to a dog add a tradeoff between accuracy and the cost
0:16:53therefore we used we trained a an adaptive dialogue policy by
0:17:00using the multimodal a mdp model and the reinforcements running starts to all the reasons
0:17:06and because of time they made i can article details so
0:17:10would you just before in the paper
0:17:13so here's another results and in this without
0:17:17we keep all of other conditions that constant using the learner taking that you know
0:17:22the initiative
0:17:23a takes the answers indiana context dependency into account as well
0:17:27zero the and stroll the result we can find a the
0:17:32the adaptive strategy on the right curve
0:17:34achieve the much answer
0:17:38much higher actors the are
0:17:41but it cannot really bait other constants racial the of the quake in is not
0:17:47really that
0:17:49answer the that that's already good enough but what we know is
0:17:54it achieves the high actually is the much faster especially in the first the ones
0:17:58thousands
0:17:59a unit is of course so we can find here and much
0:18:02batter
0:18:04so we conclude that the agent's with the adaptive astray show is more visible
0:18:09in the and interactive learning tasks
0:18:14and in the convolution
0:18:16in this paper we use a fully integrated multimodal and interactive at each post systems
0:18:22for the language n-gram language grounding and we trained a party rate are adaptive dialogue
0:18:27strategy for all other language grounding
0:18:30and then we can inspect impact of different strategies and the conditions on the regarding
0:18:36precise
0:18:37and we know
0:18:38the learned policy we which takes the and searching into account with based adaptive most
0:18:44racial shows the past the overall performance
0:18:48in the future works are we
0:18:52trying to crafting and training the dialogue policy using the human tutor
0:18:58using a human data collection without a to and do we trained two gender learner
0:19:04at the same time
0:19:06and
0:19:07and then we learn a word level while trying to learn word-level adaptive dialogue policy
0:19:12using the reinforcement learning based on the dstc a mold you
0:19:17and finally a way trying
0:19:20order to deal with the new previous i think words what features what concepts we
0:19:25trying to integrated distribution is semantics into this system
0:19:30and here's the bunch of the a reference and sensible at a tension iq much
0:20:30actually grins we want to use because we considering the
0:20:36the uncertainty from the visual knowledge
0:20:38and we think about okay well actually we're trying to use the different things as
0:20:43will we using the entropy were something else to manager
0:20:46the reliability of the classifiers and we consider
0:20:50okay
0:20:51because we will have a lot of the classifiers from scratch
0:20:54so it doesn't have an orange it has only one what to a examples in
0:20:58very beginnings the web trying to you is kind of thing astray shows they okay
0:21:02are we trying to sign top three hires right you very beginning and it be
0:21:08asking more questions
0:21:09allows us to speak a domain was we speak actually use and then when you
0:21:14get more examples we just to reduce the maybe just reuse the illustration
0:21:19and trying to get read off from
0:21:21a waste maybe the where the learner doesn't need a kind of questions
0:21:26and what we want to do this we want to do this kind of our
0:21:28troubles imagine would buy role at home
0:21:31we just trying to
0:21:32teach the robot about all of the information from the user's aspect as a perspective
0:21:37so we on
0:21:40posterior that situation because and y has different knowledge about the visual things so
0:21:46all
0:21:47we're trying to actually which windy to consider the situation for the
0:21:51the confidence try to show the parts
0:21:54with you what count that
0:22:11well actually
0:22:14that's very good question that we
0:22:17in this case we don't really thing about that
0:22:21it's we just think about overall about all of the answer changes in the usual
0:22:25knowledge rather than asr or well
0:22:30i think i would try to figure out that
0:22:48to be on is currently not yet
0:22:51that's maybe the future
0:22:53stuff
0:23:03sorry
0:23:06so you questions or how i generates the representations for the object right
0:23:10so we use thing the
0:23:12i just be used the matlab now brilliance using us get the a chance we
0:23:19cover have a spare beats for the
0:23:22the colour style of the colour and the battle visual words and just build a
0:23:26kind of dictionary ourself and you get gas the frequency of each which are in
0:23:31the pixels and get them together to generate the ones of them to handle at
0:23:35a feature to make a feature vectors
0:23:50well actually
0:23:52because we know there's a lot of guys working on the classification and their use
0:23:56in deep learning or no degree a room or with twenty four network your network
0:24:02and we trying to think about you know also channel it doesn't have any knowledge
0:24:06and trying to use all of the
0:24:08classifications classifiers upon a very from scratch
0:24:11and we trying to think about okay we don't really mount the system already know
0:24:16what's the meaning what's the group of the colours what shapes
0:24:19so we used in the final class for each attributes and that's equally
0:24:24so afterwards went through the interaction and we get new more knowledge and trying to
0:24:29figure out okay this rat is a kind of colour and then we get new
0:24:33features right yellow and i know yellow is quite similar to the red so is
0:24:38also
0:24:39in the same group
0:24:47right
0:24:53that's not really the weights
0:24:56you mean that the distribution no
0:25:00results right across all of the classifiers in the same group right
0:25:05that's different where we just use in the country the binary and it the all
0:25:09of the even the shamanic colours they are encoding
0:25:13it doesn't have any difference between