Speech Transcript - Unsupervised Dialogue Spectrum Generation for Log Dialogue Ranking

0:00:19	okay the
0:00:22	so then we move on to the next
0:00:24	speaker
0:00:27	so the paper is unsupervised dialogue spectrum generation for more variable rounding
0:00:34	the
0:00:35	well as usual
0:00:48	however and signal
0:00:50	and this work is finished with the each or jealously then and aston gently from
0:00:55	microsoft research by the way i'm from her what university
0:00:58	and a flat start
0:01:01	so the aim for this paper is that we are we would you the a
0:01:05	ranker
0:01:06	to detect the problematic dialogues
0:01:08	from the normal ones without the and you labeled data
0:01:12	where used in the existing six dialogues as the normal dialogues
0:01:16	and then learned in a way to a generative use assimilated by can setups
0:01:21	and have a talk with the
0:01:23	bought in an in different training steps
0:01:26	and the we get the old conversations from different rings taps
0:01:29	and take them as the problem is problematic dialogues we call this mattered a step
0:01:35	gone
0:01:36	and the experiment result shows that the stuff step can compared favorably with the run
0:01:41	first train it on the labeled a manually labeled datasets
0:01:46	okay so what is the log data like a ranking
0:01:49	so the log dialogues are dialogues are dialogues of conversations happen between the real users
0:01:54	and the dialogue system
0:01:56	and the other
0:01:58	dialogue ranking aims that the identify the problematic splines from the normal ones
0:02:03	here are two examples of
0:02:05	the normal dialogues and problematic dialogues
0:02:08	here is the first one
0:02:10	the first one is a normal dialogue
0:02:13	the dialogue scenes in the restaurant searching domain or every
0:02:18	firstly the cs it's and state hollow and then the user
0:02:22	is asking for european restaurant
0:02:25	and then the system task what's part of time to have in mind
0:02:29	and they used a set the center
0:02:32	and after that the u s systems that i it's a system was asking for
0:02:36	the price range
0:02:37	and the uses the tagset expensive one
0:02:41	and the get after getting all the information at the system said i suggest this
0:02:45	the machine house cafe
0:02:47	and then repeat the all the requirements of the users
0:02:51	and after that the user ask for the rest of this cafe and the system
0:02:56	gives the cracks informations
0:02:58	and this think it to each other and the dialog finish
0:03:06	so we define it is not what dialogue as
0:03:09	dialogues that without any can't actually and natural turns
0:03:13	and also achieved over the requirements
0:03:16	ask about the user
0:03:19	and here is that problematic with dialogue
0:03:23	so where a is very
0:03:26	pat apparently
0:03:27	so when the system can understand the user utterance
0:03:30	and the conversations going to the wrong direction
0:03:32	for example
0:03:33	the use this that i would really like still european that's cheap
0:03:37	and the system has some problems based understanding this utterance
0:03:41	by suggesting one restaurant
0:03:44	which is in the east is tough town
0:03:46	however the user was asking for the standard
0:03:49	and after that the user it's a
0:03:54	i want to eat at this restaurant have you got there is that a address
0:03:59	and this is indeed is the this utterance and ask what part of town to
0:04:03	have in mind again
0:04:05	so we define is problematic dialogs as
0:04:08	the dialogues with either can't actually unnatural turns
0:04:12	or some and cheap the requirements or both
0:04:16	so the goal for this bunker
0:04:19	actually is the best
0:04:22	so the goal for the ranker is the to pick up this type of problematic
0:04:26	dialogues from the normal ones
0:04:30	so what we need a stronger
0:04:33	in people unity development loop of the at data driven dialogues
0:04:38	the developer is would able upgrade there's a dialogue system
0:04:42	i seeing some in domestic dialogues
0:04:47	and then the dialogue system will a beating the
0:04:51	deploy a three
0:04:53	will be released to the customers
0:04:54	and then the locked a lot or log conversations can be collected
0:05:02	and then the developers can improve the performance of the system
0:05:06	by correcting some mistakes than system at in a locked dialogues and then retrain the
0:05:12	a dialogue system model
0:05:15	however
0:05:16	going through all these dialogues are time consuming
0:05:22	so we hope
0:05:23	that these manually checking process can be replace the by the a dialog drunker
0:05:29	that can detect dialogue with lower quality automatically
0:05:33	to make this dialogue learning process with human the look more efficient
0:05:40	so here this structure of the structure of the ranker
0:05:44	that you put for the ranker used it just the dialogue
0:05:46	and outputs is the score
0:05:49	in between zero and one and zero mean is the normal dialogues and the why
0:05:53	means that problematic dialogues
0:05:57	so firstly
0:05:58	we get the sentencing biting by distance decoder
0:06:03	and then feed them into this multi have stuff what's multi have self attention
0:06:07	to capture the meaning of the dialogue context
0:06:13	and then we have these turn level classifier
0:06:16	to identify the quality of each turn
0:06:18	for example
0:06:20	for these very smooth turn the score should be zero point one
0:06:24	and for all sort
0:06:27	and for these problematic turns the score should be zero point nine
0:06:33	and their and then would be these i a dialogue level run curves on top
0:06:38	of this term life what qualities
0:06:40	and this the for this dialogue there are some parts of them are us to
0:06:44	move the some of them are problematic
0:06:46	so probably the score will be like a zero point eight or something the extracted
0:06:49	score
0:06:52	so the training for the normally digit that a
0:06:55	the gathering all the data for these of a trend for training of this one
0:06:59	queries very time-consuming
0:07:01	so you matching that human the loop process in the development when that whenever at
0:07:06	a significant change is made to the system a new labeled data for the i'd
0:07:11	run queries required
0:07:13	this is not feasible for most of the developer
0:07:16	and that's
0:07:17	motivates us to explore this stuck on approach
0:07:23	the general idea for this task is that
0:07:25	we take the c dialogue set the normal dialogues
0:07:27	and at the same time we need to step can to simulate the problem might
0:07:31	problematic dialogues
0:07:33	and train the bankers on top of this data
0:07:39	so here is the structure of the i-th turn setup
0:07:43	we have these dialogue generator and all have this we made here discriminator and need
0:07:47	the dialogue generator would have the restaurant searching a dialogue system
0:07:51	and are in based the user simulator
0:07:56	firstly we start of pre-training process
0:08:00	in this process we preach win over a user simulator but the full utterance
0:08:04	a multi domain dialogues
0:08:08	for example for the full for the most intimate dialogue this can be for example
0:08:12	the pizza ordering
0:08:13	we in which the user is asking for the large a pineapple pizza
0:08:17	and this does it can be the temperatures taking to men in which the user
0:08:22	is asking for a setting the temperature of the room to a seventy two degrees
0:08:30	and then we
0:08:32	we just we ask the user simulator to simulate some dialogue
0:08:36	together with the restaurant search in both
0:08:40	and hearsay example of simulated dialogue after pre-training as we can see the
0:08:45	user simulator has some that basically language abilities
0:08:49	but it doesn't know how to talk a bit based a restaurant search imports
0:08:54	so when the system is asking for
0:08:57	some restaurant searching requirement the user said management home or something like that
0:09:02	and of course
0:09:03	the dialogs not going to the right direction
0:09:10	so
0:09:11	after a guide the this after we get the simulated problematic dialogues we a trend
0:09:17	that is committed to get discriminator together with the c dialogues
0:09:20	which is pre-trained sorry
0:09:24	so after the pre-training process we come we move on to the first type of
0:09:29	the goddess that can training
0:09:32	firstly we just the initialize the are user simulator and that discriminator
0:09:37	by the occlusion and model
0:09:40	separately
0:09:43	and they're in there
0:09:45	than setups
0:09:46	for the training of the discriminator we ask the that looked in the reader to
0:09:51	simulate some dialogues with only one pair
0:09:54	and take them at the problem problematic dialogues
0:09:57	and then we have this each dialogue and truncated them up to the first turn
0:10:03	and to get a take them as the normal dialogues and feed them into the
0:10:06	discriminator
0:10:10	and for the training of the simulator in step one we also where you also
0:10:14	use nist are wondering stick sd and si dialogues
0:10:20	after that we start our can setups that's trained for treating the training of the
0:10:24	generated matching of the discriminator
0:10:27	after conver after the model get cumbersome
0:10:30	we ask
0:10:31	the model to simulate full length of dialogues
0:10:34	and put them into the simulated problem simulated problematic dialogues
0:10:40	buckets
0:10:41	as we can see the first term of this system is very is very small
0:10:45	but after that when the system
0:10:48	is asking for which what's product and you have in mind
0:10:52	the used as the continent which they use the system can understand
0:10:57	and the dialogues going wrong
0:11:00	and have the first that were coming to the second step
0:11:05	and we firstly we also initialize our used a military and the discriminator
0:11:10	we use to be we initialize the user simulator with the wire which rendered in
0:11:14	this step one
0:11:14	and we are
0:11:17	a initialize disk major with the push shouldn't model
0:11:22	and the only difference between this that one is that to step two is that
0:11:27	we are asking the you the that ballot denoted to generator to generate the
0:11:34	to simulate dialogue with two turns
0:11:36	and that the same time we truncate our artistic see dialogues into two turns and
0:11:40	then show in that is committed and estimated a user simulator at the same time
0:11:47	after the model get commerce
0:11:49	we asked then using user simulator to simulate folded of dialogues
0:11:53	and then put them into the simulated problematic dialogues
0:11:57	so as we can see the first two terms of or a smooth and stuff
0:12:00	and third term turns there's something wrong
0:12:06	okay and then
0:12:08	we just repeat this that for like and steps
0:12:13	and after the and step of training
0:12:15	we get
0:12:18	a four bucks buckets of the simulated problem of problematic dialogues
0:12:22	and together with the c dialogues
0:12:24	where should in our dialogue drunker
0:12:29	so here's so that is a set or using this paper
0:12:32	basically we're using the re dataset
0:12:34	the first one is the multi domain dialogues
0:12:38	that is for the pre-training of that segment user simulator and it's good discriminator
0:12:44	and where using this might otherwise these is that
0:12:47	which is task oriented conversations with a thirty sorry for two thousand dialogues
0:12:53	you over fifty one domains
0:12:56	and each dialogues in this dataset is task oriented conversational we interaction
0:13:01	between two real speakers and one of them a stimulating the user and detect the
0:13:06	otherwise stimulating the but
0:13:10	and the second part is to see dialogues
0:13:14	this a dialogue is portrayed is the is for the training of the can structure
0:13:19	and normally to see dialogues are human written dialogues that will be offered to the
0:13:23	developers before the active development of the dialogue system
0:13:27	however we don't have these human written dialogues
0:13:30	so the we create this stick dialogue this
0:13:34	we create what i just need a lot
0:13:37	by having the a high dial restaurant a searching but
0:13:41	talk to be the rule based
0:13:43	user simulator that also offer a high tail
0:13:48	and the third one is the manually labeled log dialogues which is for the evaluation
0:13:53	of this task
0:13:56	to claques this the labeled data we deployed a deployed our a high tail
0:14:02	restaurant search in both the way the amazon mechanical turk platform
0:14:06	are firstly we generate automatically generates some requirements for the user's for example
0:14:13	for some for type and also
0:14:15	locations and price range
0:14:17	and then
0:14:18	we asked turkers to find the restaurant
0:14:21	that satisfy those requirements
0:14:24	by checking base our restaurant sports
0:14:27	and i d n i d end of each
0:14:30	and the also at the end of each task
0:14:33	we add the quite the users are asked two questions
0:14:37	and the first one is the weather define the restaurant
0:14:40	making all the requirements mistaken one in the second one where we ask the user
0:14:44	two labeled a contextually an actual turn
0:14:48	do in the conversation
0:14:53	in total we collect a one what are than the six hundred normal dialogues and
0:14:58	one thousand three hundred problematic dialogues
0:15:05	here are some experiment results would you basically for example for experiments
0:15:11	to justify the performance of this
0:15:13	stuck on
0:15:16	so the first one is we investigate how
0:15:21	the generated dialogue's move to was to the normal dialogues
0:15:25	basically we examine the dialogues generated at each test
0:15:30	each time step of the static on
0:15:32	in terms of three metrics
0:15:34	a here are to love them
0:15:35	the first one a dapper one is the ranking score and the second one dollar
0:15:40	wise the success rate
0:15:43	and the yellow dashed lines and the green dashed line is probably very
0:15:47	a week
0:15:49	d stands for the average performance of the are labeled
0:15:53	no more dialogues and the labeled
0:15:55	problematic that problematic dialogues
0:16:02	so as we can see after the first turn a training
0:16:06	the
0:16:07	performance of the are generated dialogue
0:16:10	are much worse than the probably labeled problematic that'll a problem
0:16:16	labeled problematic dialogues
0:16:22	okay
0:16:23	after three terms of training
0:16:25	the both matrix star a growing and are better than the average performance of the
0:16:30	labeled a problematic dialogues
0:16:35	and as we can see after the and i terms of training
0:16:39	and the success rate
0:16:41	used email is as high is the
0:16:44	unlabeled normal dialogues
0:16:47	and also we can see the dialogues is going or smaller than the
0:16:51	a very smooth and very
0:16:54	a natural
0:16:58	it here is the
0:17:02	cues is second experiment
0:17:03	so in the second experiment we just the compare the stuck on be the
0:17:09	a ranker train it on the labeled data set
0:17:13	so firstly we just divided aim at amt labeled data into three part of the
0:17:17	two thousand training dataset and
0:17:20	to the training examples two hundred tap examples and the four hundred testing samples
0:17:25	and then we trained these dialogue ranker
0:17:28	we call this as to provide two thousand on this labeled training dataset
0:17:33	and use the performance
0:17:35	and by the we were evaluating this problem by the opposite yet proceed and k
0:17:39	and recall at k
0:17:44	so the training of the
0:17:46	sorry for this task done
0:17:47	and we simulated basically rt start and problematic dialogues
0:17:51	and
0:17:52	because the number of the c dialogue opens a we so all the all the
0:17:56	data set up balanced datasets to their one thousand a positive examples in the what
0:18:00	the next examples
0:18:01	and because the see the number of c dialogues is only one hundred still
0:18:05	we just duplicated by thirty times and try to make this dataset balanced
0:18:11	and then which in our aspect that on this dataset
0:18:15	so here's the performance
0:18:18	and as we can see the us that can performs even better than the supervised
0:18:22	approach
0:18:23	when the k a is lower than fifty
0:18:26	even though the supervised at two thousand has higher performance
0:18:31	wouldn't case getting larger
0:18:32	just can't do you comparison a fair regulate this
0:18:37	and here's the thirty some experiments
0:18:40	we just basically class the
0:18:43	we just basically i'd the simulated data
0:18:46	into the into the unlabeled data
0:18:49	and try to compare the performance of this combined it has said with the labeled
0:18:52	data set
0:18:53	and
0:18:56	here is the result
0:18:58	so basically the experiment shows that our us that can
0:19:02	approach can bring some additional
0:19:04	generate sessions by the segment by simulating
0:19:07	a wired a range of dialogues
0:19:09	that are not covered by the labeled data
0:19:14	so the last six or experiment is where comparing the set down with other type
0:19:19	of use of user simulator
0:19:21	and the first one is the
0:19:22	basically what coded multi domain
0:19:25	what is doing is just like we train this user simulator with that the multi
0:19:29	domain dialogues
0:19:30	and simulated one about them problematic dialogues
0:19:32	and then a together with just see dialogue which we need a ranker the dialogue
0:19:37	ranker
0:19:38	and q
0:19:41	and the second one is the find you model
0:19:44	so basically we preach when the user simulator based the multi-domain dollars
0:19:48	and then find kuwait on this t dialogues
0:19:52	and then we generate
0:19:54	went out and problematic dialogues and train it together with the see that looks
0:19:58	thank you performance
0:20:01	and the last one is the we code it's that finite-state thank you
0:20:05	so basically we just
0:20:07	replace this find used to use that of blank unit on the full length of
0:20:12	i think that walks we just
0:20:14	thank you in the stepwise fashion which has been introduced in the a stack on
0:20:18	just without the con structure
0:20:21	and
0:20:22	hughes the results
0:20:23	and we also train our are stacked on the same size of dataset
0:20:27	we should still one thought and assimilate out with simulated dialogue and the ones not
0:20:31	and i'll
0:20:33	the c dialogues so as we can see the is that stuck on are also
0:20:38	performance than all the others user simulator
0:20:42	so the conclusion is just that can generate dialogues based a wide range of
0:20:47	qualities
0:20:48	and compared to i this compares favorably with the ranker train another labelled dataset
0:20:54	and this we need additional general addition by simulating little
0:20:57	while the range of take this dialogue
0:20:59	they can not covered by the al
0:21:02	a labeled data or sorry
0:21:04	the last wise
0:21:05	it also forms other your system
0:21:15	but you're much we questions volumes
0:21:22	hi i actually have to questions let's see if i
0:21:25	the first one is
0:21:28	of course you starting with a binary classification problematic versus non problematic but of course
0:21:34	there are
0:21:35	more problematic dialogues and you had it
0:21:38	i and you address some of that via the times however in the end is
0:21:43	still a binary classification right yep
0:21:46	then my second question is because it's a binary classification what does it mean precision
0:21:51	that okay in this case so used to basically procedure is i case like to
0:21:56	a ranking of matrix it might is pretty relevant for evaluating the ranking process
0:22:03	so basically what we're doing is like
0:22:05	we for example have for a four hundred testing data and then we just the
0:22:10	use our model to dialogue ranker to give score to each dialogues and they would
0:22:15	market from a top from
0:22:18	upper to down
0:22:19	and then that means like
0:22:21	we suppose that
0:22:23	i the top of these dataset like it would give is higher score to this
0:22:27	dialogue them use like these dialogues are problematic dialogues
0:22:30	so was again the case like which is truncated this tell at this dataset as
0:22:34	for example first ten dialogues
0:22:37	and then we calculate how many of them are the problem at a problematic dialogues
0:22:42	and divided by ten
0:22:43	and we can transmit more like maybe we can see like of part fifty and
0:22:48	top one hand
0:22:57	you generate this problem is to dialogue so sort of letting lasso for us a
0:23:02	so we generate this problematic dialogues in this fashion where the beginning "'cause" all this
0:23:08	food and then the and this kind of rubbish
0:23:12	this is also comes from there or you this is a separate but there is
0:23:16	like something that in the middle of the thing to get you so for the
0:23:19	task that is the like basically use human labeled data is not only labeled but
0:23:25	thanks acumen is talking with the our system so the error can be like at
0:23:28	the meteoric talk alright and the end so it's like
0:23:32	it's just or if you don't really don't by john it's like the whole don't
0:23:35	know yes we don't run time by turn would just about the hotel or is
0:23:40	to think intent
0:23:41	it
0:23:45	all the questions
0:23:53	hi i'm a really from a dt i have a question about the how you
0:23:57	define the problematic dellaleau as a whole i mean that is they can be some
0:24:01	errors in the middle that the system can repair so what you mean exactly what
0:24:07	a problem of the limiting database so we define a problematic dialogs as
0:24:12	are like they have to look up to way not two-way these like to type
0:24:17	of problematic that actually history type of its a problematic dialogues
0:24:21	and the first type is likely they have some a natural turn
0:24:24	so basically
0:24:25	they achieve this goal they achieve their goal
0:24:28	but the communication is not smooth
0:24:30	so this person that
0:24:31	and second type is like the communication is not smooth but i know that same
0:24:36	type at the achieve a goal
0:24:37	and actually they're potentially have the third one which nist back behind the communication use
0:24:42	the moves but they didn't you are so
0:24:44	we just define diplomatically in this way
0:24:47	the in terms of the fan from the entrance is not smooth but the task
0:24:51	be successful is that this do you have a targeted the done data entry and
0:24:55	i'm sorry i didn't you have been calculated the annotator agreement
0:25:00	hence we can o we didn't specifically to find this type of data but because
0:25:06	the we gather data i think this type of examples are in the testing dataset
0:25:12	alright thank you
0:25:15	question whether
0:25:19	right
0:25:23	because like the ranker outputs
0:25:26	continues but
0:25:28	and you
0:25:32	no so as to the also the run queries is cut continuers between zero and
0:25:38	one so it can be like their point eight hours a point five something
0:25:41	and when is close to one that means this problem i take one and when
0:25:45	is close to zero that means that so that like the normalized so is the
0:25:49	these units and zero and one
0:25:52	it can it's just what is the loss function so that so
0:25:57	so the loss function basically use the
0:26:00	the
0:26:01	discord that the run currently based the late is not labeled with the label so
0:26:07	we labeled problematic dialogs as one
0:26:09	and the normal dallas zero and the loss it just like the score given by
0:26:14	the ranker
0:26:15	between this like with this one so for example we use should be critical bands
0:26:24	you know a one question also so this generate the
0:26:28	but dialogue some problematic dialogues
0:26:31	how do you know that they also wrote something to the actual problematic scores owes
0:26:35	to that of course are so this corpus
0:26:38	so we also be assuming we have three metrics to evaluate that
0:26:41	and
0:26:42	the first one is like the last
0:26:44	so normally if the if that there's something while the dialogue
0:26:48	or the user didn't achieve this goal normally the dialogues longer
0:26:51	so this one matrix
0:26:53	and the otherwise the a success rate determines whether the user achieve their goal
0:26:59	and third one is the
0:27:01	to score given by the on the run for which to train it on the
0:27:05	labeled data so basically it's like
0:27:07	proper boundary like giving the score
0:27:09	so we just compiler like to
0:27:12	so we just that would just compare the so basically that is this one
0:27:16	this
0:27:27	so basically use this lies so
0:27:30	we just compare it with the average
0:27:33	for example the average running score of the are labeled problematic dialogues which is the
0:27:38	real
0:27:38	and compare it with the also compare with the yellow dashed line
0:27:42	that means the average performance of the labeled
0:27:46	a normal dialogues so we just see like at the beginning of these very always
0:27:50	all this evaluation metrics a very low and after that is getting higher so that
0:27:54	means like at the beginning that for the dialogues over there is a lot of
0:27:58	problem is problematic dialogues and you the end is getting
0:28:01	but i was if you read this example it seems like the user utterances or
0:28:05	various look up to be very unlikely to happen in about
0:28:10	what color turn your mind boston
0:28:13	them for is going up in colorado and it's like the user is doing great
0:28:18	system here
0:28:20	yes there is no virtual characters and system reacting yes but it without introducing probably
0:28:27	or whatever but for this one is likely only after one trainings that and after
0:28:31	so you can see like after the three
0:28:35	after treatment utterance of training the user is
0:28:40	saying something a possible example that i'm not looking for this place please change so
0:28:44	these also related to the restaurants do man
0:28:47	but so that
0:28:48	that is the utterance that use the contents are the system can understand so that
0:28:52	cost the problem of the failure of the dialogue so probably at the bikini well
0:28:56	we want to generate the problematic in like multiple maori in very creepy we but
0:29:02	after so i do you in this step can training process the dialogue is getting
0:29:07	into this is a restaurant search and a man is just like the way the
0:29:11	user describing their requirements is not accepted by the by the system so you to
0:29:19	generate a dialogue is getting closer to the to the domain and is getting last
0:29:23	three
0:29:25	okay but you
0:29:27	we want to run for a final question
0:29:34	so it is you go along blues steps of the step again it looks like
0:29:38	the
0:29:39	the problems
0:29:40	looks like ordering and back
0:29:44	like after this the g m is that the case i'm just asking what you
0:29:48	like doesn't the generator
0:29:51	generated a low quality
0:29:53	problem just and
0:29:55	is actually you know so
0:29:58	so most of the devil wears problem there come at their of the appeared in
0:30:03	the end but it's a unit do that you in the generation process because we
0:30:08	have some like a random seed or something
0:30:09	and there are some problem as can
0:30:13	appeared in between but these the much less than the one appeared in the end
0:30:20	i see okay i mean so then be secure you i mean that's
0:30:24	something because we are doing
0:30:30	problems in the middle or the beginning i see it does so basically we actually
0:30:40	ideally we one this paper like we have the arrow in like all kinds of
0:30:44	place
0:30:45	and the
0:30:47	indeed like some of the generated dialogue even though after maybe six times over the
0:30:52	seven turns they are still there are some problems appear in me to but is
0:30:56	much lasso
0:30:57	i think maybe second of future work this i guess it was just gonna see
0:31:01	my helpful to combine different dialogs from different steps of just a
0:31:07	in table i want to train the rent
0:31:10	you mean like to collect the data from a different a training stuff but we're
0:31:14	doing that where like
0:31:16	a completely at all these dialogues into this okay
0:31:22	okay the think that's the from a question so let's think the speaker again

Unsupervised Dialogue Spectrum Generation for Log Dialogue Ranking

Oral Session 2: Implications of Deep Learning for Dialogue Modeling

Xinnuo Xu, Yizhe Zhang, Lars Liden and Sungjin Lee