Speech Transcript - Deep Reinforcement Learning For Modeling Chit-Chat Dialog With Discrete Attributes

0:00:17	two
0:00:30	hello and the lightning
0:00:33	again and welcome to the next fashion and on policy and knowledge and we will
0:00:41	start this test set with the talk
0:00:44	on the reinforcement learning for modeling chitchat dialogue with this we actually it's
0:00:50	and that i did they are is by
0:00:53	seen that the right channel chi and c g rather and the presenter is a
0:00:58	g
0:01:10	i works
0:01:12	and you at the trial run
0:01:14	hi everyone
0:01:16	thank you for be here and it's pretty exciting to be at sig dull
0:01:20	i'm cg it's probably and let me give a little background intro
0:01:25	to what we do i'm from global a i'm a research scientist healthily multiple machine
0:01:30	learning groups my group is focused on dealing with a lot of deep learning problems
0:01:35	where you actually have to inject structure into deep networks like only combine graph lining
0:01:39	the traditional graph learning approaches
0:01:41	with deep learning so we've actually released like a bunch of things and doing semi
0:01:45	supervised learning at scale if you using any of the good products g mail so
0:01:49	to anything et cetera where you will actually be using stuff that people
0:01:53	we also do count as a actually i so i'll show you want example of
0:01:57	that on detecting intends but also like multiple times
0:02:02	board for language and also for revisions of using state-of-the-art vision
0:02:07	technology
0:02:09	misnomer
0:02:10	people might think google
0:02:12	large companies have a lot of resources we label all the data sets that we
0:02:16	have
0:02:17	do you actually able to set of god recognition image recognition system that you using
0:02:21	google photos and cloud
0:02:23	we have less than one percent
0:02:25	annotation
0:02:26	and the reason it works is
0:02:28	in like two words
0:02:30	semi supervised
0:02:31	thus
0:02:32	deep learning and a lot of other optimisations that are going on under the hood
0:02:36	to my group is responsible for some of these things
0:02:39	and finally
0:02:40	a lot of the problems that we have to do with
0:02:43	actually require a lot of compute on the cloud
0:02:45	my group is also looking at things like how to do things on device
0:02:49	imagine you have to build a dialog generation system
0:02:52	or a conversational system that has to fit on your watch that cannot actually have
0:02:56	access to gigabytes of memory or even you know a lot of compute unlike you
0:03:01	know the cloud where you can do cpus gpus and all the latest generation hardware
0:03:06	so with that
0:03:07	hope is gone just mapping of things we work on
0:03:09	this is joint work with
0:03:11	my fabulous interns to know the right who couldn't be here is from y'all are
0:03:15	from us images lab
0:03:18	the talk is gonna be about deep reinforcement learning for modeling chitchat
0:03:22	dialogue with discrete attribute if that's quite a mouthful
0:03:25	all it means as
0:03:26	we try to do dialog generation but controllable semantics
0:03:30	and i will give you an overview of what we are talking about here so
0:03:34	first off
0:03:36	like for any generation system you have to predict responses
0:03:39	here to applications where we have to predict responses and these are not more data
0:03:44	and but equally hard
0:03:46	at the order of like millions or even billions of predictions per day
0:03:50	one s market by which our team double up
0:03:54	several years ago
0:03:55	i mean if you're familiar with smart reply
0:03:57	okay quite a few if for those of you who don't know
0:04:00	if a using g e mail
0:04:02	on your phone
0:04:03	if you see those blue suggestion box that pop up at the bottom that's exactly
0:04:07	what it is
0:04:08	so
0:04:09	if you have any email or chat message it actually contextually generates responses that are
0:04:13	relevant for you and if you notice these are actually very different responses that all
0:04:17	the three suggestions and not necessarily the same so this is the smart reply system
0:04:22	and for free folks who think that this is a simple
0:04:24	and coder decoder problem
0:04:26	i can sure you that
0:04:28	to get it to work
0:04:29	it's definitely not there's a lot more things going on you can either paper from
0:04:33	ktd
0:04:34	but out that someone some of these attributes later in the talk today as well
0:04:37	but you can take this to the multi modal setting as well so we all
0:04:40	really something called for a reply after the initial smart of like version
0:04:44	where now you lead to you receive an image and you have to understand the
0:04:49	semantics of the visual content
0:04:51	and generate an appropriate response so if you look at the picture
0:04:55	and it shows a baby
0:04:56	the system would say so cute
0:04:59	and you probably send it unless probably you don't have a hard
0:05:02	right
0:05:03	or if you see like other favourite things that would like if you see skydiving
0:05:07	video or a image it'll actually suggest how brave
0:05:11	i always been a very good the start
0:05:13	one more suggestions how stupid should come at the end of it as well but
0:05:17	b control for those set of things
0:05:20	so these are just examples of generation systems but
0:05:23	like the task that we're trying to solve in this paper is well basically we
0:05:26	try to model open-domain dialogue so everybody here i don't need to introduce
0:05:30	task-oriented dialog systems are available in everyday systems i mean you're talking about booking reservations
0:05:35	like you know playing music et cetera there is a task and all the you
0:05:39	know prediction a system that you bill
0:05:42	parameters are optimized towards solving the task
0:05:45	open-ended dialogue is much harder
0:05:47	and one of the common way that people's all this is the standard
0:05:51	sequences sequence model
0:05:52	but you try to modeled as a machine translation problem so you given a history
0:05:56	of dialogue utterance sequences
0:05:57	and then you're trying to translate
0:05:59	some representation of that encoded sequence
0:06:03	into
0:06:04	you know decoder sequence in this case an utterance that you're going to
0:06:07	like send
0:06:09	what's the problem
0:06:10	almost every system especially the neural systems
0:06:14	that you have today
0:06:16	like doesn't matter which over time when you use seem quite repetitive and they sound
0:06:20	very redundant right so the problem as a like from and ml perspective
0:06:26	the unlike the task oriented dialogue the we cover is much larger and
0:06:30	there's a high entropy that you have like few responses that are very commonly occurring
0:06:33	but then of this long tail off like red responses so
0:06:37	given a choice most of these systems are trying to maximize likelihood in some form
0:06:41	of the other
0:06:42	ill actually pretty big to generate responses and give you the maximum
0:06:46	likelihood or the lowest perplexity
0:06:48	so this is a common problem of course it's not a new problem like anyone
0:06:52	who's
0:06:53	both systems would have realised this and there are many ways to address this like
0:06:56	people afraid doing adding like you know loss function objective function extending the loss functions
0:07:02	you basically by sir system to produce longer sequences you know non-redundant responses
0:07:08	adding an rl layer on top of the you know the deep learning system so
0:07:12	that you can actually optimise your policy to do something that is non redundant and
0:07:16	even injecting knowledge it's from sources like we need but a et cetera
0:07:22	so
0:07:22	in our work
0:07:23	what we propose is instead
0:07:25	do
0:07:26	conditional model where we're trying to condition the utterance generation that the dialog generation
0:07:30	based on interpretable and discrete dialog attributes
0:07:34	so
0:07:34	i will unpack each of those phrases like it within the next you slide but
0:07:41	here the building block for the model
0:07:43	so we use the standard
0:07:45	encoder-decoder model but this is a hierarchical encoder-decoder model like originally introduced in serving at
0:07:50	all
0:07:50	and
0:07:50	you can think of the says like to levels of and rnn recurrent neural network
0:07:55	where the first layer is actually operating over words in the utterance
0:07:59	at any given time step and then that generates a context eight
0:08:02	and then you have another rnn that operate over a sequence of
0:08:06	timestamps
0:08:07	so basically that operates over the multiple turns in the dialogue
0:08:11	simple enough of course
0:08:12	training these things a never ever simple enough is like you know all kinds of
0:08:16	hyperparameter tunings et cetera but we're not gonna talk about that
0:08:20	instead what our model does as we propose a conditional response generation model
0:08:24	where we trying to learn a conversational network that is conditioned on interpretable and
0:08:29	compose able dialogue attribute so
0:08:32	you have the same the first layer of rnn operating over be what in the
0:08:36	utterance
0:08:36	but instead of actually using just the context it to start decoding and generate a
0:08:41	response we now going to model attributes
0:08:44	dialog attributes in a tell you what does dialog attributes are
0:08:47	these are interpretable and discrete attributes
0:08:50	just not like there's been what do not like latent attributes where you have continues
0:08:53	representations like the model a dialog state et cetera but here we can use discrete
0:08:58	attribute
0:08:59	which are predicted
0:09:00	and model
0:09:01	during the generation process
0:09:02	and now want to predict the attribute at a given time stamp
0:09:06	that last the context state is
0:09:09	together used to generate the decoding state that means then you're gonna start generating the
0:09:13	utterance after that point
0:09:14	so what is a dialog attribute
0:09:17	so we chose intentionally chose things like
0:09:20	dialogue acts
0:09:21	sentiment emotion speaker persona these are things that be actually want to model about a
0:09:25	dialogue
0:09:26	so the reason is we want to get control the semantic so
0:09:29	it's not just about
0:09:30	saying that hey does it look fluent or not
0:09:33	but imagine what i want to if i want to say that
0:09:37	make the dialogue sound more happy
0:09:38	or
0:09:39	for example
0:09:40	and that the specific speaker style
0:09:43	or a specific emotion
0:09:44	or in the extreme and this is like
0:09:46	for their along if you want your dialogue systems to start becoming empathetic et cetera
0:09:52	like first of all quantifying what that means is also hard problem like there's i
0:09:56	we don't have a whole talk and just that
0:09:59	and
0:10:00	this is that
0:10:01	crucial part here
0:10:02	so we are trying to force the encoder not to just generate the con contextual
0:10:06	state but instead use that also degenerate a latent but interpretable representation of the dialogue
0:10:11	at that particular time stamp and together use it to start the generation process
0:10:16	now these are composed of lies has said
0:10:19	so it's not just one single dialogue act or dialogue act to be that you
0:10:22	would predict you can actually predict multiple ones of them so you can have a
0:10:25	sentiment and a dialogue act
0:10:28	and any motion and a style all being represented in the same model and in
0:10:33	a few slides will be tear why this is useful
0:10:36	so
0:10:38	this is pretty much the just of the model
0:10:40	so the
0:10:42	but that you change are now you wouldn't model the attribute sequence
0:10:45	and predicting the attribute itself is a simple mlp multilayer perceptron you can have more
0:10:50	fancier things
0:10:51	but this is integrated with the joint model
0:10:53	and then used are the generation process
0:10:55	during inference the best part about this is you would say that now you're complicating
0:10:59	model even more
0:11:00	you just introduce another bunch of parameters there
0:11:02	obviously is gonna do better perplexity but
0:11:06	what are you going to do for annotation like do you need another system just
0:11:09	to give you manually labeled annotated data at the attribute level now for your dollar
0:11:14	the good news is that you don't need it so here's how you do the
0:11:17	inference
0:11:18	so you start predicting be dialog attributes of the dialogue context so at any time
0:11:22	to time you use the context vector to predict the attribute
0:11:25	now condition on the previous attribute
0:11:28	you actually predict the next
0:11:30	i'd view that means that time stamp i use that attributed i minus one to
0:11:34	predict that you know the dialogue act
0:11:36	combine it with the context aided i minus one
0:11:40	to start the generation process
0:11:43	and as i mentioned the
0:11:44	attribute annotation is not required during inference you just user during training
0:11:49	now there is a whole
0:11:50	bunch of things you can do together we even from the actual adaptation during training
0:11:56	time for example
0:11:57	you need to say that
0:11:58	i need my training data also to be tied with semantic labels or like you
0:12:02	motion labels or dialogue acts
0:12:04	you could learn
0:12:05	an open-ended
0:12:06	set of things like for example open-ended topics of the dialogue
0:12:10	and i want getting to that and the startling it but if a person to
0:12:13	be happy to answer that you to
0:12:16	so
0:12:17	this is the crux of the model
0:12:19	of course it doesn't stop there
0:12:21	for most dialogue systems we also have to do in a rl reinforcement layer on
0:12:25	top of that where you try to optimize a policy gradient
0:12:28	usually these objectives a slightly different from the maximum likelihood objective that means you're trying
0:12:33	to bias along responses or some other goal
0:12:36	use the standard reinforce
0:12:37	and usually the policies are initialized from the supervised pre-training so the
0:12:42	attribute conditional the hierarchical recurrent
0:12:44	and coda model is the one for screen and then you initialise the rl policy
0:12:49	parameters
0:12:50	from that state
0:12:52	in standard works the this is how it looks like
0:12:55	you formant formally the policy as a token prediction problem so this database is basically
0:13:00	represented by the context at that means the encoder state
0:13:04	and the action space is you trying to predict the token vocabulary one at a
0:13:08	time
0:13:09	what's the problem with this
0:13:11	besides the double countries large for open-domain
0:13:14	usually what ends up happening is these
0:13:16	policy grading methods exhibit high variance and this is basically because of the large action
0:13:20	space
0:13:21	and
0:13:22	the rl which is actually introduced to actually buys this surprise learning system some you
0:13:26	know away from what it was supposed to line and like printers
0:13:29	do meaningful dialogue
0:13:31	instead tries to step away be linguistic and that's language phenomena
0:13:35	simply because
0:13:36	certain words are more frequent than others
0:13:38	again
0:13:39	the policies friend
0:13:40	big
0:13:40	those words
0:13:41	from the vocabulary that will actually maximize its reward or utility function
0:13:46	so
0:13:47	of course
0:13:48	training and convergence is another issue in this
0:13:51	setting as well
0:13:52	instead would be say is like
0:13:55	instead of doing be
0:13:57	token generation be formulated policy as a dialog attribute prediction problem the state space now
0:14:02	becomes
0:14:04	a combination of the dialogue context
0:14:06	and the contextual attribute and these attributes of the dialogue at with the dimension in
0:14:10	the previous slide
0:14:11	the action space is
0:14:13	the set of dialog attribute
0:14:15	something more latent
0:14:17	something more interpretable
0:14:18	and
0:14:19	in fact
0:14:20	think about it like if you capture some aspect of a semantics of a sentiment
0:14:25	you need all the words possible
0:14:28	in the english vocabulary or any language vocabulary to generate that specific sentiment i mean
0:14:33	as soon as you gotta that just
0:14:35	the generation can actually downstream do much more interesting things so you're elevating the problem
0:14:39	from the lexical level to the semantic level
0:14:44	so
0:14:45	there's a reason why this so people might say okay you introduce another attribute or
0:14:50	like another set of parameters a latent layer there this is interpretable it's great
0:14:56	of course this is gonna improve perplexity
0:14:58	i'll show you that it's not just about complexity what ends up happening is even
0:15:02	from the
0:15:03	learning theory perspective
0:15:05	because you're introducing these
0:15:06	latent models and interpretable discrete variable models
0:15:10	it actually converges better and learns to generate much more fluent and smooth responses
0:15:15	and explore parts of the search space that it wouldn't the before
0:15:19	simply because as an on almost every problem in the space is nonconvex so here
0:15:24	we start with that but
0:15:25	so here you're actually using the semantics or the user not language phenomena to guide
0:15:30	it in a better
0:15:31	what was it speaks
0:15:33	so the experiment results conform the same like so we runs on a bunch of
0:15:37	datasets like there's a perplexity and the table shows basically
0:15:41	the columns are how much training data was trained on
0:15:44	obviously if you go from left to right
0:15:46	the more data trained on the better the perplexity of the generated dialogue that it's
0:15:50	e
0:15:51	and here are the attributes that we use a to model the dialogue
0:15:56	now
0:15:57	like sentiment means you're actually incorporating sentiment in the dialogue attribute stage of the model
0:16:01	prediction switchboard is basically the dialogue acts frames is not a set of dialogue act
0:16:06	so
0:16:07	this can all be mutually exclusive all to be complementary or even overlapping
0:16:12	and what we know what is this it's actually even beneficial to compose them of
0:16:15	these attributes so they provide very different information so
0:16:18	the fact that you model sentiment is not the same as you fact that you
0:16:21	model
0:16:21	dialogue acts the fact that you model dialogue acts from one particular
0:16:25	john does not the same as modeling
0:16:27	dialogue act from a different drawn so you can actually compose these attributes in very
0:16:31	flexible fashion and in fact it actually improves the generation
0:16:34	but the means the perplexity goes down
0:16:38	so overall would be c is that the
0:16:40	both the attribute conditioning and the reinforcement learning part
0:16:44	generates like much better responses and more interesting in diverse responses
0:16:49	so one we obviously
0:16:51	as i said i keep repeating perplexity because every time you see a deep learning
0:16:55	system i mean it's easy to improve perplexity try to me you add more parameters
0:16:59	the system i mean
0:17:00	the
0:17:01	the weight works is like more parameters means and you add more data you can
0:17:05	actually improve perplexity by optimising towards better state to the other parameter settings configurations
0:17:12	now we also in addition
0:17:14	did you many bows on the generated responses to see if it actually makes sense
0:17:18	i mean because as a whole goal of generation i believe every generation system should
0:17:22	do
0:17:22	human about some setting if at all possible
0:17:26	and what we notice is like
0:17:27	a standard sequences sequence model compared with the attribute conditioning
0:17:32	obviously the i could be conditioning actually helps the varsity and also relevance
0:17:36	better that means it has much more winter loss ratio compared to this baseline model
0:17:41	now in addition
0:17:42	when you add the rl conditioning on top of that the means like we do
0:17:46	the policy optimisation from this implies pre-training step
0:17:49	it does even better
0:17:51	so the rl as i said is actually knew
0:17:54	move or nicely supervised training states from that initialization state to a better is good
0:18:00	a lot about a policy but instead of learning it over at the token level
0:18:02	now it's actually gonna learned that the attribute so we injecting attribute conditioning both the
0:18:06	b r a level and also this approach training model
0:18:11	if you compute the score is already but see discourse and their standard ways to
0:18:15	do these based in the literature
0:18:17	look at the responses and you can do automatic
0:18:20	you know computation of the about metrics like
0:18:23	compute the number of you know n-grams
0:18:25	that are overlapping et cetera
0:18:27	a how many distinct phrases or you know generated in the system
0:18:31	overall the
0:18:33	sequences you can model is worse than the attribute condition model and the other one
0:18:37	is actually even better than both of that
0:18:42	in addition
0:18:45	if you take like the said
0:18:47	of the response space that means like the most likely responses
0:18:50	and you look at the percentage of them generated in the new systems
0:18:54	the percentage goes down significantly how many times have you seen a chat or anything
0:18:58	or any of the voice's systems you ask a question says i don't know right
0:19:02	so the goal is
0:19:06	that's a default you know fallback mechanism but the goal is like instead of that
0:19:10	can be model something about for example
0:19:13	emotional responses or other things just sort of engage the user in a better fashion
0:19:18	what this allows to do is like you don't get the
0:19:20	standard frustrating i don't know instead you get something mourn once it may not be
0:19:24	the answer directly but it'll probably d the quantisation a much better five
0:19:28	or direction
0:19:31	and you're some examples which are one go through but like
0:19:34	for standard inputs or not the standard either from read it so that never standard
0:19:39	you get like interesting responses instead of think saying things like
0:19:45	you know i don't know or you know leaving i don't want to have no
0:19:48	idea used are getting like longer responses but also things that like mitch you know
0:19:53	probably make more sense like for example i'm honestly bit confused
0:19:57	why
0:19:58	no one is brought me or my books any k might but it should be
0:20:01	box i think at kick
0:20:04	i don't think i don't think anything that's with the sequence a sequence model would
0:20:07	even but that you conditioning
0:20:10	voices are all say i can't wait to see in the city
0:20:13	some of the context is missing from this example because the previous dialogue history it's
0:20:16	been cut off here but there's something about the c d being mentioned there that's
0:20:20	why it's to see
0:20:22	okay just to summarize i-th
0:20:25	we propose a new approach for dialog generation with control the link opposable semantics i
0:20:29	think this is a super important then interesting topic because
0:20:33	it's very easy to
0:20:34	begin or what can generation we can do jans and all kinds of things like
0:20:38	that but
0:20:39	making it actually interpretable uncontrollable in this fashion believe also how that these in our
0:20:44	empirical experiments tell the learning process as well it's not just about saying that this
0:20:48	is a good knots language for non that we wanna model
0:20:51	both the rl and look at the conditioning
0:20:54	gender improves the baseline model by generating interesting and it was responses
0:20:58	their number of things that b
0:21:00	you know are looking at in the future
0:21:02	in addition to incorporating multimodal but
0:21:05	what is the impact of debriefing
0:21:07	classifiers like for example as is that like we didn't use pre-trained classifiers as the
0:21:11	attribute prediction problem there
0:21:13	and how do we like
0:21:15	measure the interpretability via modeling this during the training process
0:21:18	audrey dialogue data generated actually
0:21:22	respecting the semantics of the attributes that it actually predicts i mean there's that even
0:21:26	makes sense
0:21:28	and then like how do you know do this for
0:21:30	speaker persona an extended to more open-ended concepts
0:21:34	these are
0:21:36	questions in like you know thoughts
0:21:37	if you have any questions related to any of these things hundred runs of them
0:21:50	i am residuals from start of five am i was very interested in your training
0:21:54	corpus size of the examples you gave for the dialogue model training we've had up
0:21:57	to two meeting million training examples obviously in a situation assume you're not a manually
0:22:03	generating them are you getting them for me to give examples or where else you
0:22:06	get it's a user some of them are from
0:22:09	that dreaded and the open-set i was corporas these are available
0:22:13	as it is said
0:22:14	the attributes
0:22:15	themselves i'm not necessarily always manly annotated for example for so which but i believe
0:22:20	first part of that behind it
0:22:22	a for one of the dataset but what we ended up doing is like you
0:22:25	can take the
0:22:26	standard lda or any other you know tool
0:22:29	actually label them with the center so you can have a less a high precision
0:22:32	classify image actually do
0:22:34	a runaway training corpus so these can be single label for instance
0:22:37	and interesting part is that
0:22:40	after modeling all this like the it's not necessary the accuracy of the dialogue act
0:22:45	to be prediction will go are in the latent system
0:22:48	even though that might be really eighties or something like that it still is good
0:22:52	enough for the generation system
0:22:54	it so there is a so there's something work to be done about like
0:22:57	how good can we get like i mean should be bumped up to like to
0:23:00	ninety nine percent then whether that have an effect on the generation
0:23:04	things that we are looking at
0:23:18	i am adding more german research lab just had a question about i guess did
0:23:21	you look at speaker persona at all i was only curious maybe you can speculated
0:23:25	about it do you think with enough data
0:23:29	with the conditional model you could model individual users
0:23:32	maybe like to read it user names or something like
0:23:35	there is a joke when we really smarter clapping after the first
0:23:38	further for version assume
0:23:41	i think it was a some professor from universities it
0:23:43	this modifies and getting seem very snotty to me
0:23:46	as like
0:23:47	it's training on your own data i mean we don't look at the data but
0:23:50	you know it's basically reflect in yourself
0:23:52	so show an answer is yes but of course you want to do this what
0:23:56	you know data right and you also want to do it in the privacy present
0:23:59	manner which i haven't talked about here at all right part of my group focus
0:24:01	on like
0:24:02	how do you do this all in the privacy preserving manner right for example you
0:24:05	can build a general system
0:24:06	but then
0:24:07	all the inference and things can happen only on-device are in like sort of like
0:24:11	your data is like silent off from everybody else
0:24:14	and the question is again
0:24:16	deep really do you feel like you have a specific personality or what you feel
0:24:20	was is what you actually right
0:24:21	might be very different right so that their aspects of that to be considered
0:24:35	i'll be here if you want

Deep Reinforcement Learning For Modeling Chit-Chat Dialog With Discrete Attributes

Oral Session 1: Policy and Knowledge

Chinnadhurai Sankar and Sujith Ravi