Speech Transcript - Learning Natural Language Interfaces with Neural Models

0:00:17	but we have a session sure
0:00:24	okay thank you got real so that we want to the office keynote
0:00:34	so using this time i vector you do not use
0:00:37	the first keynote
0:00:40	the first keynote speaker is needed about the
0:00:43	the proposal
0:00:44	school of informatics university obeyed embark
0:00:47	but not be due to use a proper so natural language processing in the school
0:00:52	of informatics that's the university of edinburgh okay
0:00:55	how results will can see his own
0:00:59	one getting compute us to understand reasonably and generate natural language so zero talk about
0:01:06	that he's got how kind of research activities
0:01:11	there's a more information on the other proceedings a node but
0:01:16	she doesn't
0:01:18	okay right okay you can hear me i sat right
0:01:22	okay at
0:01:23	right like that what it like a was saying earlier this talk is gonna be
0:01:27	about learning
0:01:28	natural language interfaces with neural models
0:01:31	and so i'm gonna give you a bit of
0:01:33	and introduction as to what these natural language interfaces are
0:01:38	and then we're gonna see how we build a more problems are related to them
0:01:41	and you know what future lies ahead
0:01:44	okay so what he's a natural language interface it's the most intuitive thing one wants
0:01:51	to do to a computer
0:01:53	you just want to speak to it the computer in an ideal world understands and
0:02:00	executes what you wanted to do
0:02:02	and this
0:02:03	billy don't know it is like one of the first things that people
0:02:06	wanted to do with nlp so in the sixties
0:02:10	when we didn't have computers the computers didn't have memory
0:02:13	we didn't have neural networks none of this
0:02:15	the first systems that appeared out there
0:02:19	had to do
0:02:21	speaking to the computer
0:02:23	and
0:02:23	getting some response so green at are in nineteen fifty nine
0:02:29	presented this system called the conversation machine
0:02:32	and this was the system that was having conversations with a human can people guests
0:02:38	or know what about
0:02:41	the weather
0:02:43	well it's always the weather
0:02:45	first the weather and then everything else so that they said okay the what the
0:02:49	weather is a bit boring let's talk about baseball
0:02:51	and this work very primitive systems they just had models they had grammars you know
0:02:56	it was all manual but the intent was there we want to communicate with computers
0:03:02	well in a little bit more formally
0:03:04	what the task entails
0:03:06	is we have a natural language
0:03:08	and natural language has to be translated
0:03:11	by what you see the arrow thereby parser you can think of it as a
0:03:15	model or some black box the takes the natural language
0:03:18	and translates it
0:03:19	to something
0:03:21	the computer can understand
0:03:22	and this cannot be natural language had it must be
0:03:25	either sql or lambda calculus or some internal representation that the computer as
0:03:33	to give you an answer
0:03:34	okay
0:03:35	so as an example
0:03:37	it is again has been very popular within the semantic parsing field you query a
0:03:42	database
0:03:44	but you actually don't want to learn the syntax of the database and you don't
0:03:47	want to learn a square you just ask the question what are the copy those
0:03:51	of states bordering texas
0:03:53	you translate these into these logical form you see down there
0:03:58	okay you don't need to understand this is just something that the computer understands you
0:04:02	can see there is variables it's a form a language
0:04:05	and then you get the answer and i'm not gonna tell you the answer you
0:04:07	can see here texas is bordering a lot of states
0:04:11	now i start from asking data bases the questions another task and this is an
0:04:18	actual task that people have deployed in the real world
0:04:21	is instructing a role board to do something that you wanted to do
0:04:25	again this is a another example you can tell the robot if you have it
0:04:29	one of this little robots of make you coffee and you know go up and
0:04:32	down the corridor
0:04:33	you can say at the chair move forward three steps past the sofa
0:04:38	again the robot has to translate this into some internal representation but you understands
0:04:43	in order not to crash against the software
0:04:48	another example is actually doing question answering and
0:04:52	a there is a lot of systems like this using a big knowledge base like
0:04:58	freebase doesn't exist anymore
0:05:01	it's called knowledge graph
0:05:03	but this is issue much graph with millions of entities and connections between them
0:05:07	and the delayed congolese using
0:05:09	it's when you ask a question i mean to have many modules but one of
0:05:13	them is that
0:05:14	so
0:05:14	one of the questions you may want to ask is for the male actors in
0:05:18	the titanic and again this has to be translated
0:05:21	in some language
0:05:23	that freebase or your knowledge graph understands and you can see here this is expressed
0:05:27	in
0:05:28	lambda calculus but you have to translate it meant that some sql that the freebase
0:05:33	again
0:05:34	understand
0:05:35	so you see there is many applications in the real world of that
0:05:39	necessitate semantic parsing or some interface with a computer
0:05:44	and
0:05:45	here comes the man himself so bill gates
0:05:48	the costume mit publishes this technology review it's actually
0:05:54	very interesting i suggest that you take a look
0:05:56	and it's not very mit centric they talk about many things
0:06:00	and so this year they when an asked a bill gates they said to him
0:06:04	okay what do you think are the new technological breakthroughs theme pensions the of two
0:06:09	thousand nineteen
0:06:10	that will actually change the world
0:06:12	and so if you read the review
0:06:14	he starts by saying you know i want to be able to detect premature babies
0:06:19	fine
0:06:20	then he says you know with a couple free burger
0:06:23	so no meat
0:06:24	you make a burgers so you know because the world has so many animals
0:06:28	then he talks about drugs for cancer and the very last
0:06:33	he's
0:06:35	smooth talking ai assistance so semantic parsing comes last which means that you know it's
0:06:41	very important to bill gates
0:06:43	now
0:06:44	i don't know why i mean no why
0:06:47	but anyway he thinks it's really cool
0:06:50	and
0:06:50	of course is not only bill gates
0:06:53	every company you can fit coref has a smooth talking a is system or is
0:06:58	working on one
0:07:00	or using the back of their head or they have prototypes
0:07:02	and there's so many of them
0:07:05	i so i'll xa is your sponsor
0:07:08	there is cortana a context has at least what will
0:07:13	decided to be different of the call it will hold not some female name
0:07:17	then god
0:07:18	so there is get salience of these things
0:07:21	and can i see is shorthand how many people have one of them at home
0:07:27	very good
0:07:28	do you think do you think that work
0:07:31	how many how do you think they work
0:07:35	exactly so here i want this think the set alarms for me all the time
0:07:42	i mean it they work if you're in the kitchen they use a lexus set
0:07:45	for half an hour
0:07:47	or can do you have to monitor the kids homework
0:07:49	but
0:07:50	we want these things to
0:07:52	go beyond simple commands
0:07:56	now i'll just show here
0:07:58	and there is the reason why there's so much talk about these smooth talking i
0:08:01	assistance because
0:08:03	they could have in society a four
0:08:06	not able people for people who cannot see for people who are you know are
0:08:10	disabled
0:08:11	is actually pretty huge if it worked
0:08:14	now i'm gonna show here
0:08:18	if we deal
0:08:19	the video is the parity of i'm as an l x to
0:08:23	and you see it and then you understand immediately
0:08:26	why
0:08:28	there's no sound
0:08:30	hello
0:08:33	we check the sound as well before
0:08:39	should i do something
0:08:44	i raise of the volume is raised
0:08:48	to the max
0:08:53	amazon and everyone asking for help
0:09:05	technology isn't always easy to use for people others are you thinking
0:09:12	that's why i was on par with a darpa to present amazon so we only
0:09:20	smart speaker device designed specifically we used five greatest generation it's to rule out and
0:09:26	response in even remotely close to
0:09:31	and there is a forty i agree i
0:09:48	i
0:09:58	i
0:10:03	no
0:10:05	using hold true
0:10:08	one two three
0:10:21	this is like your thermostat i was set to ten
0:10:25	i one
0:10:28	i feel may have
0:10:30	you amazon co silver placed on the music they loved when they were a
0:10:44	it also has a quick skin feature to help them find things
0:10:50	right
0:10:55	feature for a long rambling stories i is the one i
0:11:03	so i
0:11:07	i really great of yours did i say yours today to them as a nickel
0:11:17	silver said to check or money order to do not go right i think that's
0:11:21	not exist
0:11:22	okay
0:11:23	it's saturday night live sketch
0:11:25	but you can see how we could help the elderly
0:11:30	or those in need it could to remind you for example to take two pills
0:11:33	or you know it could help you feel more comfortable in your own home
0:11:38	now
0:11:40	let's get a bit more formal a so what are we going to try to
0:11:43	do here we will try to learn this mapping from the natural language
0:11:48	to the
0:11:49	for remote
0:11:50	representation that the computer understands and the landing setting is we have
0:11:55	sentence logical form
0:11:57	and biological form i will use the terms logical form
0:12:01	meaning representations interchangeably because
0:12:05	the model so will be talking about do not care about what the
0:12:09	meaning representation is what the program if you like that the computer will execute days
0:12:15	so we assume we have sentence logical form pairs
0:12:19	and this is a setting the most of the work has focused on a previously
0:12:26	so it's like machine translation but except that you know the target is a an
0:12:32	executable and which now
0:12:33	this task
0:12:35	is harder than it seems for three reasons
0:12:38	first of all
0:12:40	their ease
0:12:41	it's severe mismatch between
0:12:44	d natural language and the logical form
0:12:49	so if you look at this example how much does it cost a flight to
0:12:53	boston
0:12:54	and look at the representation here
0:12:56	you will immediately notice that
0:12:59	they're not very similar this structures mismatch
0:13:02	and a only there is a mismatch between the logical form
0:13:06	and the natural language string
0:13:09	but also its syntactic representation so you couldn't even using text if you wanted to
0:13:13	get the matching
0:13:15	so here for example
0:13:17	flight
0:13:18	would align to fly
0:13:20	and two and boston to boston but then fair corresponds to these huge natural language
0:13:26	phrase how much does it cost and the system must
0:13:30	in federal of that
0:13:32	now
0:13:33	this is the first challenge of destruction mismatching
0:13:36	the second challenge has to do with the fact
0:13:39	that
0:13:40	the former language
0:13:42	the program if you like that we have to execute with a computer
0:13:46	has structure any has to be well-formed
0:13:50	you cannot just generate anything and hope that the computer will give you an answer
0:13:54	so this is a structure prediction problem and
0:13:57	if you look here for the male actors and the titanic there is
0:14:01	three mating representations
0:14:03	do people see which one is the right one
0:14:06	i mean they all look similar you have to squint that it
0:14:09	the first one
0:14:12	hasn't bound variables the second one has apparent this is that is missing
0:14:17	so the only right one is the last one
0:14:20	you cannot do it approximately
0:14:22	it's not like machine translation you're gonna get the gist of it you actually need
0:14:25	to get the right logical form of that executes the computer
0:14:29	now the fact challenge
0:14:31	and this is when you deploy google holman lx that the people who developed these
0:14:35	things immediately notice is that people will say
0:14:38	i mean
0:14:40	so
0:14:41	the same intent can be realized in very many different expressions who created microsoft
0:14:47	microsoft was created by
0:14:50	who founded microsoft qualities the founder of microsoft and so on and so forth
0:14:55	and all that maps to this little bit from the knowledge graph which is
0:15:01	well under bill gates are the founders of microsoft
0:15:04	and we have to be able the system has to be able you're semantic parser
0:15:08	to actually deal
0:15:09	we've all of these
0:15:10	different ways that we can express
0:15:13	are intent
0:15:14	okay
0:15:15	so in this talk we have three parts
0:15:18	well actually three parts so first i'm gonna show you how with neural models we
0:15:23	are dealing with this
0:15:24	structural mismatch
0:15:25	using something that is very familiar to all of you the encoder decoder paradigm
0:15:30	then i will talk about the
0:15:33	structure prediction problem and the fact that you're and not if you're like your formal
0:15:38	representation has to be well-formed using this coarse to fine decoding algorithm i will explain
0:15:43	it and then finally i will show you solution to the coverage problem
0:15:49	okay
0:15:49	now i should point out that there are many more challenges that and are there
0:15:53	and i'm not going to talk about but it's good to flag of them
0:15:57	where do we get the training data from so i told you that we have
0:16:00	to have
0:16:01	natural language logical form pairs to train the models for creates this and some of
0:16:06	it is like i actually quite complicated
0:16:08	what happens if you have out-of-domain queries if you have a parser trained on one
0:16:12	domain let's say the weather and then you want to use it for baseball
0:16:17	what happens if you don't have actually only
0:16:20	independent questions and answers but you have codependent there's coreference between the aquarius now we're
0:16:26	getting into the territory of dialogue
0:16:29	what's with speech we all pretend here that speech is to solve problem it is
0:16:33	and a lot of times alexi doesn't understand children doesn't in the some people with
0:16:37	accents like me
0:16:39	and then you talk to design wasn't people and you say but okay so do
0:16:42	you use the lattice and the good old the lattice we use on a lattice
0:16:46	of one because you know
0:16:48	if it it's to each slows us down the so there is many
0:16:52	technical and actual a challenge is that you know
0:16:56	have to all work together to make this work this thing work
0:16:59	okay
0:17:00	so let's talk about the structure mismatches
0:17:03	and so here the model is something you all must be a bit familiar with
0:17:08	and it's
0:17:09	one of the like
0:17:11	there is three or four things with neural models that get a recycled a over
0:17:16	and over again the encoderdecoder framework is one of them
0:17:19	so we have natural language as input
0:17:22	we encoded with using an lstm or whatever favourite model you have a you can
0:17:28	use a transform all the transformers don't work for this task
0:17:31	but well because the datasets are small
0:17:34	whatever the next thing is you encoded you get a vector out of it then
0:17:38	these encoded vector is serves as an input to
0:17:41	another lstm that actually decoded into
0:17:46	and logical form
0:17:47	and you will not use here i say you decoded into a sequence
0:17:51	or a tree
0:17:53	i will not talk about trees but i should flak that there is a lot
0:17:57	of work trying to decode
0:17:59	the natural language into this tree structure which makes sense since
0:18:04	the logical form has structures there's parentheses there is a there is a recursive
0:18:10	however in my experience these models
0:18:13	are weighted complicated to get to work
0:18:15	and
0:18:17	the advantage over the assuming that the logical form is a sequence is not that
0:18:21	great so for the rest of the talk we will assume that we have sequences
0:18:25	in and we get sequences out and we will pretend
0:18:28	but the logical form is a sequence even though it isn't
0:18:32	okay
0:18:33	a little bit formally the model will map
0:18:36	the natural language input
0:18:38	which is a sequence of tokens x to logical form
0:18:41	representation of its meaning a which is a sequence of tokens y
0:18:46	and we are modeling the probability of
0:18:49	the
0:18:50	input
0:18:51	given
0:18:51	the representation of the meaning
0:18:53	and the encoder
0:18:56	we'll just in called the language into the vector this vector then will be fed
0:19:01	into the decoder which will the generated conditioned on the encoding vector
0:19:05	and of course we have the
0:19:08	very important
0:19:09	attention here the attention mechanism that the original models did not use attention but then
0:19:16	everybody realised in particular in semantic parsing it's very important because it deals with this
0:19:22	structure mismatching problem
0:19:25	so i'm assuming people are familiar here it instead of actually generating the tokens in
0:19:32	the logical form one by one without considering the input the attention will look at
0:19:37	the input be able
0:19:38	wait
0:19:39	the output given the input and you will get things you will get some sort
0:19:44	of certainty that you know
0:19:46	if to generate mountain maps two mountain in my input
0:19:52	now
0:19:53	this is a very sort of simplistic view of semantic parsing
0:19:58	it assumes that not only natural language is a string
0:20:01	but what the logical form
0:20:03	does is also a string and
0:20:06	and this may be okay but maybe it isn't
0:20:10	there is a problem so i and i'll explain
0:20:12	so we train this model by maximizing the likelihood of the logical forms
0:20:17	given the natural language input to this is a standard
0:20:21	its time
0:20:22	we have to predict the locks the logical form that for any input utterance
0:20:28	and we have to find the one that actually maximizes this probability
0:20:33	of the output given the input
0:20:35	now trying to find this
0:20:38	argmax can be very computationally intensive and if you're google you can do beam search
0:20:43	if you're university of edinburgh you just too greedy search any works just fine
0:20:50	now
0:20:52	can people see the problem with this assumption of actually decoding into a string
0:20:58	remember the second problem but i said we have these we have to make sure
0:21:03	that the logical form is a well formed
0:21:07	and by assuming that everything is a sequence i have no way to check for
0:21:12	example that my parentheses are being matched
0:21:15	i don't all these because i've forgotten what i've generated
0:21:19	so i keep going to get mine at some point i
0:21:22	it he the end of sequence and that's it
0:21:24	so we actually want
0:21:26	should be able to enforce some constraints of well formedness on the output
0:21:32	so how are we gonna do that
0:21:34	we're gonna do this with this idea of coarse to fine decoding which i'm gonna
0:21:38	explain
0:21:39	so again we will have are not sure language input here all slides from dallas
0:21:43	before ten am
0:21:45	and i what we would do before is we will be called the entire
0:21:49	natural language string into this logical form representation but now what can insert a second
0:21:55	stage
0:21:56	where we first
0:21:58	the cold
0:21:59	to a meaning sketch
0:22:01	what the meeting's sketch does is it abstracts away details
0:22:05	from the very detailed logical form it's an abstraction
0:22:11	it doesn't have arguments it doesn't have variable names you can think of it
0:22:15	if you're familiar with
0:22:16	template it's a template of the
0:22:19	logical form of the meaning representation
0:22:22	so first we will have a natural language
0:22:25	to decode into this meeting sketch and then we will use this meeting this case
0:22:29	to fill in the details
0:22:31	know why does this make sense
0:22:34	well there is several arguments first of all you disentangle higher level information from low-level
0:22:41	information
0:22:43	so there are some things that are the same
0:22:45	across logical forms
0:22:47	but you want to capture
0:22:49	so you're meaning representation in this case at the sketch level is gonna to be
0:22:53	more compact so in if for example a need to switch is the dataset we
0:22:57	work with
0:22:58	these catch use nine point two tokens as opposed to twenty one twenty one tokens
0:23:04	is a very long logical form
0:23:06	another thing that is important is that the model level because then you explicitly share
0:23:12	the core structure
0:23:14	that is the same for multiple examples so you use your data more efficiently
0:23:19	and you learn to represent commonalities across examples which the other model did not know
0:23:24	so you do provide global context
0:23:27	to do the find meaning decoding no i have a graph coming up in a
0:23:31	minute
0:23:32	now
0:23:32	the formulation of the problem is the same as before we again map natural language
0:23:37	input to the logical form representation
0:23:40	except now that we have two stages in this model and so we again the
0:23:45	model the probability of the output given the input
0:23:48	but now
0:23:49	this is factorized into two terms
0:23:51	the probability of
0:23:53	the meetings kitsch given the input
0:23:56	and the probability of the output
0:23:59	given the input in the meetings catch
0:24:02	so the meetings get
0:24:04	i is shared between those two terms
0:24:07	and i'm sure you a graph here so the
0:24:11	green nodes are to be encoder units the orange or brown i don't know how
0:24:16	comes out here this colour
0:24:18	are the decoder human it's so in the beginning we have a natural and which
0:24:23	we will encoded with your favourite encoder
0:24:25	here are you see a bidirectional lstm
0:24:29	then we will use this encoding
0:24:31	to decode two s catch
0:24:33	which is this abstraction of the high-level meaning representation
0:24:38	once would you call it this catch we will
0:24:41	and coded again
0:24:43	we do not or bidirectional lstm into some representation
0:24:47	that we will fit in to our final decoder that fills in all the details
0:24:52	we're missing
0:24:54	and you can see at their the red bits are the information that i'm filling
0:24:59	in
0:25:00	you will see a list of the this decoder
0:25:03	this the coder takes into account
0:25:05	not only the encoding
0:25:07	all of the sketch
0:25:08	but also the input
0:25:10	remember in the probably probability terms it is
0:25:13	be probability of x given x and a
0:25:16	the probably y given x n a y and use our output x is their
0:25:21	input and the a is the encoding of my sketch
0:25:26	okay this is what why we say
0:25:29	the sketch provides context for the decoding
0:25:33	okay
0:25:34	no training and inference works the same way to gain maximizing the log-likelihood of the
0:25:38	generated meaning representations given the natural language
0:25:42	and a test set i'm again we have to predict both the sketch and the
0:25:49	more detailed logical form
0:25:51	and we do this via greedy search
0:25:55	okay so a question that they have not addressed is where do these templates come
0:26:00	from
0:26:01	where do we find the meaning sketches
0:26:04	and if the answer that i would like to give you use our work we
0:26:09	would just an errand
0:26:11	now
0:26:12	that is fine we can their them
0:26:14	but a first will try something very simple no show you examples because of the
0:26:19	simple thing doesn't work then learning will never work
0:26:22	so
0:26:24	actually example so the different meanings sketches
0:26:27	for different kinds of a meaning representations
0:26:31	so here we have logical form lambda calculus
0:26:34	and it's very trivial
0:26:36	to understand how would you would get the meeting sketches you would just
0:26:40	get rid of arable information
0:26:43	you know lambda counts and arg max this gets you would anything that is specific
0:26:48	to that would remove we would remove any notions of arguments
0:26:53	and
0:26:54	a any sort of
0:26:56	information that may be specific to the logical form so you see here
0:27:00	this is the details for and this
0:27:03	whole the expression becomes lambda to a fight there is known numeric information so these
0:27:09	are variables
0:27:10	this is for logical form
0:27:13	if you have source code this is python a thinks are very easy actually would
0:27:17	just substitute tokens with token types
0:27:22	so here is the python called and
0:27:25	s will become a name for will become a number
0:27:30	named here is the name of the function and then this is a string
0:27:34	of course
0:27:35	we want to keep the structure of the expression as it is so we will
0:27:39	not substitute delimiters operators or built-in keywords
0:27:43	because that would change actually what the problem program is meaning to do
0:27:49	if we have sql query is
0:27:52	it's again simple to get this meeting sketches so this is above you can see
0:27:56	this is the s two l syntax
0:27:58	so we have a select clause and we have two
0:28:02	first select the columns so industrial we have tables and they have columns
0:28:07	here we have to select the call them and then
0:28:10	we have the where clause that has conditions on it so in the example we're
0:28:14	selecting a record company
0:28:16	and here we are saying
0:28:19	the where clause put some conditions so the hearer reporting in this record company has
0:28:24	to be after nineteen ninety six of the contact conductor has to be
0:28:28	michael thus need cohesive russian composer now if you want to create a meeting scheduled
0:28:33	very simple
0:28:34	well we'll just have the syntax of the were close where
0:28:37	larger and
0:28:39	and equal
0:28:40	so we'll just have the were close in the conditions on it
0:28:43	these are not filled out yet so we could apply
0:28:47	too many different columns in an sql table
0:28:53	okay let me show you some results so i'm gonna compare
0:28:56	the simple model that have shown you the simple is supposed to sequence model
0:29:02	with this more sophisticated model but that's constrained decoding
0:29:06	and this is comparing two state-of-the-art of course
0:29:10	the state-of-the-art is a moving target in the sense that now all these numbers with
0:29:15	barrett
0:29:16	a people are familiar with paired rate and so these numbers with paired
0:29:20	go up by some percent so whatever show you
0:29:23	you can add in your head
0:29:25	two or three percent
0:29:28	it so this is that it is models do not use but so this is
0:29:31	the previous to the state-of-the-art this is geo query and the eighties this some gonna
0:29:35	trigger results for and
0:29:37	different datasets
0:29:38	and this important to see that it works in different datasets with very different meaning
0:29:43	representation so somehow of logical form do you play an eighties have logical form
0:29:48	and then we have an example with python code and with sql so here is
0:29:53	the system
0:29:55	uses syntactic the coding
0:29:58	so it uses
0:29:59	i
0:30:00	quite sophisticated grammatical operations that then get compose two with neural networks
0:30:05	to perform semantic parsing
0:30:07	this is the simple sequences you ones model or showed you before
0:30:10	and this is coarse to fine decoding so
0:30:13	you do get a three percent increase
0:30:16	with regards to eight is a this is very interesting it has fan every very
0:30:20	long utterances in very long logical forms
0:30:24	again at six you do almost as well
0:30:27	remember what is said about you know
0:30:29	syntactic the coding does not give so much of an advantage
0:30:33	and then again
0:30:35	we get a bows with coarse to fine
0:30:37	and a similar pattern can be observed when you use
0:30:40	sql
0:30:43	for you jump from seventy four to seventy nine
0:30:45	and the john goal use these
0:30:50	pi phone so you execute python code and again from seventy to seventy four
0:30:57	okay
0:30:59	now this is on the side no just mention it a very briefly
0:31:04	all the all the tasks and i'm talking about here
0:31:08	are dealing with the fact that you have
0:31:10	your input and you're output pre-specified some human goal was and writes down to logical
0:31:17	form
0:31:17	for the utterance
0:31:19	and the community has realise that this is not scalable
0:31:22	so what we're also trying to do is to work with weak supervision where you
0:31:27	have the question
0:31:28	and then you have the answer
0:31:30	no logical form
0:31:32	the logical form is latent
0:31:33	and you have to
0:31:34	come up with it the model has to come up with it so now this
0:31:37	is good because it's more realistic
0:31:39	but it opens another huge kind of warms which is you have to come up
0:31:43	with a logical forms you have to have a way of generating them
0:31:47	and then you have a and their this variance because you don't know which ones
0:31:50	are correct and which ones are and
0:31:52	so here we show you table you're given the table
0:31:56	you're given how many silver medals in the nation of turkey when
0:32:00	and the answer which is zero and that you have to hallucinate all the rest
0:32:04	so this idea of actually using the meaning skate used
0:32:08	is very useful in this scenario
0:32:10	because it sort of restricts the search space
0:32:14	so rather than actually a looking for all the types of logical forms you can
0:32:20	have you sort of first generate a map struck
0:32:24	program or and meaning sketch
0:32:26	and then
0:32:27	once you have that
0:32:29	you can feel in pdtb so this idea of obstruction
0:32:32	is helpful that would say
0:32:33	in this scenario even more
0:32:37	okay
0:32:37	now
0:32:38	let's go back to the third challenge which has to do with linguistic coverage
0:32:44	and this is the problem
0:32:46	that will always be with this it will be whatever used all of the human
0:32:50	is unpredictable
0:32:51	i think that you know what was it things that you're model does not anticipate
0:32:55	and so we have to have a way of dealing with it
0:33:00	okay so
0:33:03	this is not then you at a
0:33:05	whatever has done question answering has come up with this problem
0:33:09	or of g how do i increase the coverage of my system
0:33:14	so what people have done and this is actually unbounded thing to do you have
0:33:18	a question there and you paraphrase it to in ir for example people to query
0:33:24	expansion it's the analogous idea what i have a question i will have some paraphrases
0:33:28	that will paraphrase it and then
0:33:31	you know what i will submit the paraphrases and i will get some answers and
0:33:34	the this is the problem solved
0:33:36	except that it is and if any of you have worked with paraphrases you see
0:33:40	but you know
0:33:42	the paraphrases can be really bad
0:33:44	and so you get a couple answers so now you have the problem and then
0:33:49	you've created a problem and the reason why this happens is because the
0:33:55	paraphrases are generated
0:33:58	independently
0:33:59	all your task of the qa module but you have so you have accurate module
0:34:04	you paraphrasing the questions and then you get answers and that not point do you
0:34:08	have v
0:34:09	and sir communicate with the paraphrase
0:34:12	to get something that you know
0:34:14	is appropriate for the task or for the qa model
0:34:18	so what i'm gonna show you now is how
0:34:20	we train these paraphrase model jointly
0:34:24	with a qa model for and then turn task and our task is again semantic
0:34:28	parsing except that this time because this is a more realistic tasks we're gonna be
0:34:33	asking a knowledge base like freebase or was knowledge graph
0:34:37	and of course there is a question that i will address in the bit where
0:34:41	do the paraphrases come from
0:34:43	who gives the most who what where are they
0:34:48	okay so this is don think this slide of but it's actually really simple and
0:34:52	i'm gonna take it through this so this is how we see the
0:34:58	modeling framework as
0:35:00	we have a question who created microsoft
0:35:03	and we have some paraphrases
0:35:06	bettered even with this and i will tell you mean the minute whole gives the
0:35:09	paraphrases assume for a moment we have these paraphrases
0:35:13	now what we will do is we will first take all these paraphrases here
0:35:19	and score them
0:35:22	okay
0:35:22	so we will then called we will get question vectors we will have a model
0:35:27	that gives the score how what is this paraphrase for question
0:35:31	how would is who founded microsoft as a paraphrase for who created microsoft
0:35:36	now once we normalize this course
0:35:39	then we have our question answering module so we have two modules one is the
0:35:43	paraphrasing module in one the question answering module and their trained jointly
0:35:47	so once i have my scores for my paraphrases these are gonna may be used
0:35:52	to weight the answers given the question
0:35:56	so this is gonna tell your model well look
0:35:59	this answer is quite good given your paraphrase or this answer is not so good
0:36:05	giving your paraphrases do you see now that you kind of latter which paraphrases are
0:36:10	important for your task
0:36:12	for your question answering model
0:36:14	and your answer jointly
0:36:18	okay
0:36:20	so
0:36:20	a bit more formally we have
0:36:23	them the modeling problem is we have the an answer
0:36:26	and we want to model the probability of the answer given the question
0:36:30	and this is factorized into two models one is the question answering model
0:36:35	and the other one is the paraphrasing model
0:36:37	now for the question answering model you can use whatever you like
0:36:41	your latest neural qa model you can plug in there and
0:36:46	this is what the paraphrase model
0:36:48	if whatever you have as long as you can actually
0:36:52	and called them somehow
0:36:54	it doesn't really matter
0:36:56	now i will not talk a lot about the question answering model we used an
0:37:01	in-house model that is based on graphs that the
0:37:05	is quite simple be it just as graph matching on wheels knowledge graph
0:37:10	and i'm gonna tell you a bit more about the paraphrasing model
0:37:15	okay so this is how we score of the paraphrases
0:37:20	we have a question
0:37:22	we generate paraphrases for this question
0:37:25	and then for each of these paraphrases so we will just
0:37:30	score them how good r-d given
0:37:33	my question
0:37:34	and this is you know a dot product essentially
0:37:37	is a good paraphrase or not
0:37:39	but it's trained and they're and
0:37:42	with the answer in mind
0:37:44	so
0:37:46	is this paraphrases going to help me to find the right answer
0:37:50	and now
0:37:51	as far as the paraphrases are concerned again this is applied can play module you
0:37:55	can use your favourite so if you are in limited domain you can write them
0:38:00	yourself
0:38:02	manually
0:38:03	you could use wordnet
0:38:05	or pp db which is this database which has a lot of paraphrases
0:38:10	but we do something else a
0:38:12	using neural machine translation
0:38:17	okay so this like to put it i know everybody knows it but it's my
0:38:20	favourite slide of all times
0:38:22	because
0:38:23	but we address tried to do this slide again
0:38:26	it's not as good as the original
0:38:29	like you do it in particular if you go to machine translation talks about that
0:38:32	all this is a machine translation
0:38:34	or ever come to capture so beautifully
0:38:37	the fact that bob sorry the fact that you have this language here
0:38:41	you have this english language and that you have attention weights so beautiful
0:38:46	and then you take it is sensational weights and you wait them
0:38:49	with the decoder and hey presto you get the french language
0:38:53	so
0:38:54	this is your usual machine translation your vanilla machine translation engine
0:38:59	it's again and encoder-decoder model with attention
0:39:02	and we assume we have access to this engine
0:39:06	now
0:39:07	you may wonder how i'm not gonna get paraphrases out of this
0:39:12	this again an old idea which goes back a back actually the martin k somatic
0:39:17	a i think can be eighties
0:39:19	notice this thing so what we wanted to ease
0:39:23	in the case of english goal from english to english
0:39:27	so we want to be able to sort of paraphrase and english expression to another
0:39:31	english expression but in machine translation i don't have any direct path
0:39:35	from english to english
0:39:37	what i don't have is a path from english to german
0:39:40	and german to english
0:39:42	so
0:39:43	the theory goal is if i have to english phrase is
0:39:47	like here under control
0:39:49	and
0:39:50	in check
0:39:51	if they are aligned or if they correspond to the same phrase in another language
0:39:57	there are likely to be a paraphrase
0:40:00	now i'm gonna use these alignments this is for you'd understand the concept but you
0:40:04	can see that i have english i translate english to german
0:40:09	then german gets back translated to english
0:40:13	i have my paraphrase
0:40:19	more specifically
0:40:20	i have my input which is in one language
0:40:24	okay i encoded i decode it into some translations in the foreign language g stance
0:40:29	here for german
0:40:31	i encode my german and then i decoded back to english
0:40:36	there is
0:40:37	two or three things you should not just about this thing
0:40:41	first of all
0:40:42	these things in the middle the translation so called people it's
0:40:46	and you see that we have k people it's
0:40:49	i don't have one translation but i have multiple translations distance out to be really
0:40:53	important because a single translation may be very wrong and then i'm completely screwed i
0:40:58	have very bad paraphrases
0:41:00	so i have to have multiple people it's i don't only that i could also
0:41:05	have multiple people it's in multiple languages
0:41:08	which then i take into account while i'm the coding
0:41:12	now this is very different from what do you may think of as paraphrases because
0:41:17	the paraphrases there never
0:41:20	explicitly stored anywhere they're all model internal
0:41:23	so what this thing variance i give it english you just paraphrases english into english
0:41:30	but i don't have an explicit database
0:41:32	with paraphrases
0:41:34	and of course they are all vectors and they're all scored but
0:41:37	i you know i cannot ball in say
0:41:39	where is that paraphrase i cannot give the model the paraphrase and it generates another
0:41:44	one which is very nice because you do generation for free in the past if
0:41:49	you had rules you have to see how you actually use them to generate something
0:41:53	that is meaningful and so on
0:41:55	okay
0:41:55	let me show again example
0:41:57	this is a paraphrasing the question what is the zip code of the largest car
0:42:02	manufacturer if we put people through french
0:42:06	so french tells us what is the zip code of the largest vehicle manufacturer or
0:42:11	what is the zip code of the largest car producer
0:42:14	if we people through german
0:42:16	what's the postal code of the biggest automobile manufacturer
0:42:20	what is the postcode of the biggest car manufacturer
0:42:24	and if we people through check
0:42:25	what is the largest car manufacturers postal code
0:42:29	or zip code of the largest car manufacturer
0:42:32	can i see a show of hands which are people to language do you think
0:42:36	gives you the best
0:42:37	paraphrases
0:42:39	i mean it's a sample of two
0:42:43	check
0:42:44	very good
0:42:44	check
0:42:45	proved out to be the best pay but
0:42:47	for the by german
0:42:49	french was not so good
0:42:51	and again here there's the question how many people it's to use what languages do
0:42:56	you choose i mean these are all experimental variables that you can manipulate okay
0:43:00	then we show you some results
0:43:03	the grey you don't need to understand
0:43:05	these are all be used baselines that somebody can use
0:43:10	to show that the model is doing something over and above the obvious things
0:43:16	this is
0:43:17	c grad the this graph here is using nothing so you go from forty nine
0:43:23	to fifty one
0:43:25	this it from sixteen to twenty
0:43:27	these are web questions a graph questions is our datasets that people have developed this
0:43:33	graph questions is very difficult it has like
0:43:36	very complicated questions that have a multihop reasoning so who's the bombers daughters friend dog
0:43:43	called a very difficult that's why the performance is really bad
0:43:48	what you should a c d's that
0:43:52	here pink is apparent that
0:43:54	is so in all cases
0:43:56	using the hold on a pad paranoid is pink
0:44:00	a here is second best system
0:44:03	and
0:44:05	read here is best system and you can see that it is very well in
0:44:08	the difficult dataset
0:44:09	in the other dataset there is another system that is better
0:44:12	but they use a lot of external knowledge which we don't have a better exploits
0:44:16	the graph itself which is another avenue for future work
0:44:21	okay
0:44:22	now this my last slide and then our take questions
0:44:27	what have we learned is so there is a couple of things that are interesting
0:44:31	first of all he's that
0:44:34	if you use encoder-decoder models
0:44:36	are
0:44:37	good enough
0:44:38	for mapping natural language to meaning representations with minimal engineering effort and the cannot emphasise
0:44:46	that
0:44:48	more
0:44:49	before
0:44:50	these paradigm shift
0:44:53	what we used to do is we would spend a huge is coming up with
0:44:56	features that we would have to re engineer
0:44:58	for every single domain so if i go from lambda calculus to sql and then
0:45:02	to python code are would have to do the whole process from scratch
0:45:05	here you have one model
0:45:08	with some experimental variables that you know you can keep fixed or change and it
0:45:13	works very well of across domains
0:45:17	a constrained decoding improves performance and only for this setting the type show to you
0:45:22	but for more weakly supervised settings
0:45:25	and i'll people are using this constraint encoding even
0:45:29	not in semantic parsing i so you know in generation for example
0:45:34	the paraphrases n and hands the robustness of the model and in general it would
0:45:38	say their useful
0:45:40	if you have other tasks leave for dialogue for example
0:45:43	you could give robustness to a dialogue model to generate answer of a chat board
0:45:49	and the models could transfer to other tasks or architectures i've shown for the purposes
0:45:54	of this talk
0:45:56	you know so as not to overwhelm people
0:45:59	simple architectures but you know you can put neural networks left right and centres you
0:46:03	feel like
0:46:04	now in the future i think there is a couple of a venues from future
0:46:08	work worth pursuing one is of course learning the sketch is so big could be
0:46:12	a latent variable in your model trying to you know generalise and that would mean
0:46:18	that you don't need to do any preprocessing you don't need to give the algorithm
0:46:21	the sketches
0:46:23	how do you do with multiple languages that have a semantic parser in english
0:46:27	how do i try switching chinese big problem in particular industry they have the come
0:46:33	up this problem a lot and their answers we higher annotators
0:46:39	how do you
0:46:42	train this model seaview have no data at all so just a database
0:46:47	and of course there is something but i would be in of interest to you
0:46:51	is how do i actually
0:46:53	do coreference how do i
0:46:56	model a sequence of turns
0:46:59	are suppose to a single turn
0:47:01	and without further ado i have one last slide and it's a very depressing slide
0:47:07	so
0:47:08	when they get this talk like a couple months ago i used to have this
0:47:11	where it was to resume
0:47:13	and a this is on twitter and she's to the david the jockeys to resume
0:47:18	will ask alexi to negotiate for her
0:47:21	and it will be fine i try to find another one with boris johnson
0:47:25	and failed i don't think it does technology
0:47:28	so and he doesn't of negotiating either
0:47:30	so she would have been she would at least negotiate and at this point out
0:47:35	just a questions thank you very much
0:47:38	really
0:47:43	and my store
0:47:45	the time for question
0:47:48	thank you this is result from i j p morgan so my question is do
0:47:53	we really need to do
0:47:56	to extract the logical forms
0:47:58	given the fact that
0:48:00	probably humans don't do we really except in really complicated
0:48:05	case
0:48:06	about my daughter that
0:48:10	do we really need to do that for a well in that world machine translation
0:48:15	we don't really extract all these things
0:48:18	but we do translate i even to
0:48:22	like personal data stuff
0:48:24	that's a that's a good question so the answer is the
0:48:27	yes no
0:48:28	so if you look at a lexus l or google these people
0:48:33	they have very complicated systems where they have
0:48:37	one module that does what you're say i don't translate to logical form i just
0:48:41	you know like to query matching and then extract the answer
0:48:44	but for some of the highly compositional way switch to get with to execute the
0:48:49	mean databases
0:48:51	and they all have internal representations of what they're which means
0:48:55	also
0:48:56	if you are developer and for example
0:48:59	whenever you have a database
0:49:02	and that has think so i seven genes or i still fruit and have a
0:49:06	database and the deal with
0:49:07	customers and i have to have a spoken interface there you would have to extracted
0:49:12	somehow now for the phone when you say cv a set my alarm clock i
0:49:17	would agree with you there you just need to recognize intents
0:49:20	and do the attribute slot filling
0:49:22	and then you're done
0:49:24	but whenever you know how
0:49:27	more like to beak infrastructure in the
0:49:30	output a of the answer space and then you do this
0:49:39	thanks for a very nice to
0:49:42	had a question on the on the paraphrase
0:49:47	the scoring and it seem to me something wasn't quite right if i understood it
0:49:51	well but what's more the you have an equation with the summation of thing that's
0:49:57	what so intuitively
0:50:01	to make the right thing is to you look for the closest paraphrase that actually
0:50:06	has an answer that you can a good quality actually can find it so you're
0:50:09	trying to optimize that's two things by finding something that means the same that we're
0:50:14	i can find an answer if i can't find a matter of the original question
0:50:17	but when you some that the problem as paraphrases that have been an equal
0:50:22	distribution out of some phrases have many paraphrases are many paraphrases in a particular direction
0:50:27	but maybe not so many in the others just depending on how many synonyms you
0:50:31	haven't so trying to add them up and weight them if you have a lot
0:50:35	of paraphrases here for the wrong answer and one for something that's better you know
0:50:39	it seems like the
0:50:40	closeness should dominated if you have a very high quality after and it seems like
0:50:45	your models trying to do something different that i'm wondering if that
0:50:48	is causing problems or something that are not seen that no right so this is
0:50:52	how morally strange at the case we have to make it robust
0:50:55	and you can manipulate the n-best paraphrases
0:50:59	access time you're absolutely right would just find the one the one max the one
0:51:03	that is best
0:51:05	so you are right it's and i did not explain well but you are absolutely
0:51:09	right that you know you don't have
0:51:11	you know you can be all over the place if you're just looking for the
0:51:14	sum of but its time we just want to one
0:51:21	a high thank you for the great war decision model for microsoft research so my
0:51:25	question is for the coarse to fine decoding would you think of its potential in
0:51:30	generating natural language outputs like dialogue like summarisation
0:51:35	a what get come again ask the question again what would be o
0:51:40	would you think of the potential of you close to find that's a good question
0:51:44	that connection question so
0:51:46	i think well i think it's very interesting now
0:51:51	for a
0:51:52	sentence generation so you mentioned summarisation i'll do one thing at a time so if
0:51:57	you're just want to generate
0:51:59	from some input a sentence
0:52:02	you want to do surface realization people have already done this is a rash they
0:52:06	have a very similar model where the first sort of
0:52:11	produce a template which they learn in from the temple at the surface realize a
0:52:15	sentence
0:52:16	however summarization which is the more interesting case
0:52:20	you would have to have a document template
0:52:24	and
0:52:25	it's not clear what this document template might look like in how you might learn
0:52:29	it so you may
0:52:31	for example i assume that the template it uses some sort of a tree or
0:52:36	a graph
0:52:37	with generalizations and then from there you just generate the summary
0:52:41	and i believe it's like very
0:52:44	we should do this but it will not be as trivial as
0:52:50	what to do right now which is the encode the document in the vector and
0:52:53	that have attention and then a bit of coffee and then here's your summary
0:52:57	so the question their want the template is
0:53:01	nobody has an answer
0:53:13	i was wondering if you could elaborate on your very late this work on generating
0:53:19	the abstract meaning representation because of course my reaction
0:53:23	what you are saying in the first five was
0:53:26	well
0:53:27	it's all good then where and when you have you know
0:53:29	a
0:53:30	corpus where you at the mapping between the query and did not and the and
0:53:35	logical form what do you do if you don't have which is the majority of
0:53:40	cases
0:53:41	see okay so this is a tough problem a so how do you do inference
0:53:47	with weak supervision a
0:53:49	and there is two things their that we found out that have
0:53:56	because the space you have dinner somewhere doing a but merely a it's
0:53:59	of
0:54:01	potential programs that execute and we haven't always signal
0:54:04	other than the right answer
0:54:06	so because the only signal is the right answer there's two things that can happen
0:54:10	one is ambiguity
0:54:12	so
0:54:13	it's entities it may be ambiguous we can be can be another turkey or both
0:54:17	took the country interactively
0:54:20	government
0:54:21	and so that then you're screwed and you will get things and the other one
0:54:24	is spurious this so you have things that execute to the right answer
0:54:29	they don't have the right intent the right semantics
0:54:31	and so what people do what do things we do the templates here
0:54:36	and then we have another step which actually again tries to do
0:54:41	some structural matching and tries to say okay so i have this abstract program
0:54:46	this will cut down the search space
0:54:49	and then
0:54:49	you also have to do some alignment and put some constraints of the sensei for
0:54:55	example
0:54:55	i cannot have
0:54:57	column silver repeated twice
0:55:00	because this is no well formed
0:55:02	but
0:55:02	the accuracy of these i didn't put it is like forty four percent
0:55:06	knots you know
0:55:09	note anywhere i mean the global in amazon would laugh
0:55:12	there is a more work to be
0:55:18	so thank you for the talk so i have a question about your calls lane
0:55:21	deporting so you go your course plaintiff or being you use a meaning representation but
0:55:27	you're the whole being final deporting these of these two based on the cross marks
0:55:33	it'll be both old ones but it to be politically
0:55:37	o and it means that there is no guarantee that the meaning representation we use
0:55:42	the on wavelet the that intonation without but in some cases so we need to
0:55:48	consider such things because if we consider of the semantics some arguments over the eight
0:55:54	it was something
0:55:56	of the d scene which should be included in that the warnings
0:56:00	that is a very good i'm glad they are you guys were paying attention so
0:56:04	yes we don't have we don't have this and
0:56:08	we saved constraint a coding but what you really do is you constraining the encoding
0:56:12	hoping of their your decoder will be more constrained by the encoding
0:56:17	you could include we didn't know analysis where we saw two things one is how
0:56:22	good are the temple so if you're templates are
0:56:25	not great so what you're saying
0:56:28	will be more problematic
0:56:31	and we didn't analysis let me see if i have a slide that shows that
0:56:34	actually the templates are working quite well
0:56:37	i might have a slight i don't remember
0:56:41	yes
0:56:42	so this slide shows you see
0:56:46	the sequence to sequence model the first row use the sequence to sequence model
0:56:50	and without any sketches
0:56:53	and the second is a coarse to fine where you have to predict the sketch
0:56:56	and you see that the coarse to fine predicts a sketch is much better
0:57:01	then the one stage more than one but does sequence to sequence
0:57:04	so this tells you that you
0:57:06	are kind of winning but not exactly
0:57:09	so it's i don't know what if what would happen if you includes these constraints
0:57:14	might
0:57:14	my answer would be this doesn't happen a lot it could be but it's the
0:57:18	logical forms we tried if you have vary along very complicated so we've and then
0:57:23	you really huge sql where is then
0:57:25	i would say that you're approach
0:57:27	would be required
0:57:30	okay no it's
0:57:33	this could do
0:57:35	so maybe ask one question okay it's that in the last time that's what you
0:57:40	said that the model seventies this doesn't
0:57:43	so you so what is i mean it double that all use related to the
0:57:47	qa or once in this and one up but in a dialogue case we have
0:57:52	a multiple times
0:57:54	so what is the common problems more will be good
0:57:57	yes so i i'll send you i have a nice of this so we did
0:58:01	try to do
0:58:02	this paper in submission multiple turns
0:58:06	so where you say an example i want to buy this levi's jeans
0:58:14	how much to the course to do you have the mean another side
0:58:18	or other two why well what is the colour so you know you elaborate a
0:58:22	new questions and there's patterns of you know these multiturn dialogue but you can do
0:58:28	and
0:58:29	you can do this but the one thing that we actually need to sort out
0:58:34	before doing please
0:58:35	is coreference
0:58:37	and
0:58:37	because right now this model some take a reference into account if you model coreference
0:58:42	in the simple way of like a look at the past and they do modeled
0:58:45	as a sequence it doesn't really work that well so i think definitely
0:58:49	sequential question answering is the way the goal i have not seen any models that
0:58:54	make me go like all this is great but
0:59:00	yes it's a very problem and the very not sure but you know one step
0:59:04	at the time
0:59:05	so thank you much so that sense because they give him

Learning Natural Language Interfaces with Neural Models

Keynotes

Mirella Lapata (University of Edinburgh, UK)