0:00:19okay the
0:00:22so then we move on to the next
0:00:27so the paper is unsupervised dialogue spectrum generation for more variable rounding
0:00:35well as usual
0:00:48however and signal
0:00:50and this work is finished with the each or jealously then and aston gently from
0:00:55microsoft research by the way i'm from her what university
0:00:58and a flat start
0:01:01so the aim for this paper is that we are we would you the a
0:01:06to detect the problematic dialogues
0:01:08from the normal ones without the and you labeled data
0:01:12where used in the existing six dialogues as the normal dialogues
0:01:16and then learned in a way to a generative use assimilated by can setups
0:01:21and have a talk with the
0:01:23bought in an in different training steps
0:01:26and the we get the old conversations from different rings taps
0:01:29and take them as the problem is problematic dialogues we call this mattered a step
0:01:36and the experiment result shows that the stuff step can compared favorably with the run
0:01:41first train it on the labeled a manually labeled datasets
0:01:46okay so what is the log data like a ranking
0:01:49so the log dialogues are dialogues are dialogues of conversations happen between the real users
0:01:54and the dialogue system
0:01:56and the other
0:01:58dialogue ranking aims that the identify the problematic splines from the normal ones
0:02:03here are two examples of
0:02:05the normal dialogues and problematic dialogues
0:02:08here is the first one
0:02:10the first one is a normal dialogue
0:02:13the dialogue scenes in the restaurant searching domain or every
0:02:18firstly the cs it's and state hollow and then the user
0:02:22is asking for european restaurant
0:02:25and then the system task what's part of time to have in mind
0:02:29and they used a set the center
0:02:32and after that the u s systems that i it's a system was asking for
0:02:36the price range
0:02:37and the uses the tagset expensive one
0:02:41and the get after getting all the information at the system said i suggest this
0:02:45the machine house cafe
0:02:47and then repeat the all the requirements of the users
0:02:51and after that the user ask for the rest of this cafe and the system
0:02:56gives the cracks informations
0:02:58and this think it to each other and the dialog finish
0:03:06so we define it is not what dialogue as
0:03:09dialogues that without any can't actually and natural turns
0:03:13and also achieved over the requirements
0:03:16ask about the user
0:03:19and here is that problematic with dialogue
0:03:23so where a is very
0:03:26pat apparently
0:03:27so when the system can understand the user utterance
0:03:30and the conversations going to the wrong direction
0:03:32for example
0:03:33the use this that i would really like still european that's cheap
0:03:37and the system has some problems based understanding this utterance
0:03:41by suggesting one restaurant
0:03:44which is in the east is tough town
0:03:46however the user was asking for the standard
0:03:49and after that the user it's a
0:03:54i want to eat at this restaurant have you got there is that a address
0:03:59and this is indeed is the this utterance and ask what part of town to
0:04:03have in mind again
0:04:05so we define is problematic dialogs as
0:04:08the dialogues with either can't actually unnatural turns
0:04:12or some and cheap the requirements or both
0:04:16so the goal for this bunker
0:04:19actually is the best
0:04:22so the goal for the ranker is the to pick up this type of problematic
0:04:26dialogues from the normal ones
0:04:30so what we need a stronger
0:04:33in people unity development loop of the at data driven dialogues
0:04:38the developer is would able upgrade there's a dialogue system
0:04:42i seeing some in domestic dialogues
0:04:47and then the dialogue system will a beating the
0:04:51deploy a three
0:04:53will be released to the customers
0:04:54and then the locked a lot or log conversations can be collected
0:05:02and then the developers can improve the performance of the system
0:05:06by correcting some mistakes than system at in a locked dialogues and then retrain the
0:05:12a dialogue system model
0:05:16going through all these dialogues are time consuming
0:05:22so we hope
0:05:23that these manually checking process can be replace the by the a dialog drunker
0:05:29that can detect dialogue with lower quality automatically
0:05:33to make this dialogue learning process with human the look more efficient
0:05:40so here this structure of the structure of the ranker
0:05:44that you put for the ranker used it just the dialogue
0:05:46and outputs is the score
0:05:49in between zero and one and zero mean is the normal dialogues and the why
0:05:53means that problematic dialogues
0:05:57so firstly
0:05:58we get the sentencing biting by distance decoder
0:06:03and then feed them into this multi have stuff what's multi have self attention
0:06:07to capture the meaning of the dialogue context
0:06:13and then we have these turn level classifier
0:06:16to identify the quality of each turn
0:06:18for example
0:06:20for these very smooth turn the score should be zero point one
0:06:24and for all sort
0:06:27and for these problematic turns the score should be zero point nine
0:06:33and their and then would be these i a dialogue level run curves on top
0:06:38of this term life what qualities
0:06:40and this the for this dialogue there are some parts of them are us to
0:06:44move the some of them are problematic
0:06:46so probably the score will be like a zero point eight or something the extracted
0:06:52so the training for the normally digit that a
0:06:55the gathering all the data for these of a trend for training of this one
0:06:59queries very time-consuming
0:07:01so you matching that human the loop process in the development when that whenever at
0:07:06a significant change is made to the system a new labeled data for the i'd
0:07:11run queries required
0:07:13this is not feasible for most of the developer
0:07:16and that's
0:07:17motivates us to explore this stuck on approach
0:07:23the general idea for this task is that
0:07:25we take the c dialogue set the normal dialogues
0:07:27and at the same time we need to step can to simulate the problem might
0:07:31problematic dialogues
0:07:33and train the bankers on top of this data
0:07:39so here is the structure of the i-th turn setup
0:07:43we have these dialogue generator and all have this we made here discriminator and need
0:07:47the dialogue generator would have the restaurant searching a dialogue system
0:07:51and are in based the user simulator
0:07:56firstly we start of pre-training process
0:08:00in this process we preach win over a user simulator but the full utterance
0:08:04a multi domain dialogues
0:08:08for example for the full for the most intimate dialogue this can be for example
0:08:12the pizza ordering
0:08:13we in which the user is asking for the large a pineapple pizza
0:08:17and this does it can be the temperatures taking to men in which the user
0:08:22is asking for a setting the temperature of the room to a seventy two degrees
0:08:30and then we
0:08:32we just we ask the user simulator to simulate some dialogue
0:08:36together with the restaurant search in both
0:08:40and hearsay example of simulated dialogue after pre-training as we can see the
0:08:45user simulator has some that basically language abilities
0:08:49but it doesn't know how to talk a bit based a restaurant search imports
0:08:54so when the system is asking for
0:08:57some restaurant searching requirement the user said management home or something like that
0:09:02and of course
0:09:03the dialogs not going to the right direction
0:09:11after a guide the this after we get the simulated problematic dialogues we a trend
0:09:17that is committed to get discriminator together with the c dialogues
0:09:20which is pre-trained sorry
0:09:24so after the pre-training process we come we move on to the first type of
0:09:29the goddess that can training
0:09:32firstly we just the initialize the are user simulator and that discriminator
0:09:37by the occlusion and model
0:09:43and they're in there
0:09:45than setups
0:09:46for the training of the discriminator we ask the that looked in the reader to
0:09:51simulate some dialogues with only one pair
0:09:54and take them at the problem problematic dialogues
0:09:57and then we have this each dialogue and truncated them up to the first turn
0:10:03and to get a take them as the normal dialogues and feed them into the
0:10:10and for the training of the simulator in step one we also where you also
0:10:14use nist are wondering stick sd and si dialogues
0:10:20after that we start our can setups that's trained for treating the training of the
0:10:24generated matching of the discriminator
0:10:27after conver after the model get cumbersome
0:10:30we ask
0:10:31the model to simulate full length of dialogues
0:10:34and put them into the simulated problem simulated problematic dialogues
0:10:41as we can see the first term of this system is very is very small
0:10:45but after that when the system
0:10:48is asking for which what's product and you have in mind
0:10:52the used as the continent which they use the system can understand
0:10:57and the dialogues going wrong
0:11:00and have the first that were coming to the second step
0:11:05and we firstly we also initialize our used a military and the discriminator
0:11:10we use to be we initialize the user simulator with the wire which rendered in
0:11:14this step one
0:11:14and we are
0:11:17a initialize disk major with the push shouldn't model
0:11:22and the only difference between this that one is that to step two is that
0:11:27we are asking the you the that ballot denoted to generator to generate the
0:11:34to simulate dialogue with two turns
0:11:36and that the same time we truncate our artistic see dialogues into two turns and
0:11:40then show in that is committed and estimated a user simulator at the same time
0:11:47after the model get commerce
0:11:49we asked then using user simulator to simulate folded of dialogues
0:11:53and then put them into the simulated problematic dialogues
0:11:57so as we can see the first two terms of or a smooth and stuff
0:12:00and third term turns there's something wrong
0:12:06okay and then
0:12:08we just repeat this that for like and steps
0:12:13and after the and step of training
0:12:15we get
0:12:18a four bucks buckets of the simulated problem of problematic dialogues
0:12:22and together with the c dialogues
0:12:24where should in our dialogue drunker
0:12:29so here's so that is a set or using this paper
0:12:32basically we're using the re dataset
0:12:34the first one is the multi domain dialogues
0:12:38that is for the pre-training of that segment user simulator and it's good discriminator
0:12:44and where using this might otherwise these is that
0:12:47which is task oriented conversations with a thirty sorry for two thousand dialogues
0:12:53you over fifty one domains
0:12:56and each dialogues in this dataset is task oriented conversational we interaction
0:13:01between two real speakers and one of them a stimulating the user and detect the
0:13:06otherwise stimulating the but
0:13:10and the second part is to see dialogues
0:13:14this a dialogue is portrayed is the is for the training of the can structure
0:13:19and normally to see dialogues are human written dialogues that will be offered to the
0:13:23developers before the active development of the dialogue system
0:13:27however we don't have these human written dialogues
0:13:30so the we create this stick dialogue this
0:13:34we create what i just need a lot
0:13:37by having the a high dial restaurant a searching but
0:13:41talk to be the rule based
0:13:43user simulator that also offer a high tail
0:13:48and the third one is the manually labeled log dialogues which is for the evaluation
0:13:53of this task
0:13:56to claques this the labeled data we deployed a deployed our a high tail
0:14:02restaurant search in both the way the amazon mechanical turk platform
0:14:06are firstly we generate automatically generates some requirements for the user's for example
0:14:13for some for type and also
0:14:15locations and price range
0:14:17and then
0:14:18we asked turkers to find the restaurant
0:14:21that satisfy those requirements
0:14:24by checking base our restaurant sports
0:14:27and i d n i d end of each
0:14:30and the also at the end of each task
0:14:33we add the quite the users are asked two questions
0:14:37and the first one is the weather define the restaurant
0:14:40making all the requirements mistaken one in the second one where we ask the user
0:14:44two labeled a contextually an actual turn
0:14:48do in the conversation
0:14:53in total we collect a one what are than the six hundred normal dialogues and
0:14:58one thousand three hundred problematic dialogues
0:15:05here are some experiment results would you basically for example for experiments
0:15:11to justify the performance of this
0:15:13stuck on
0:15:16so the first one is we investigate how
0:15:21the generated dialogue's move to was to the normal dialogues
0:15:25basically we examine the dialogues generated at each test
0:15:30each time step of the static on
0:15:32in terms of three metrics
0:15:34a here are to love them
0:15:35the first one a dapper one is the ranking score and the second one dollar
0:15:40wise the success rate
0:15:43and the yellow dashed lines and the green dashed line is probably very
0:15:47a week
0:15:49d stands for the average performance of the are labeled
0:15:53no more dialogues and the labeled
0:15:55problematic that problematic dialogues
0:16:02so as we can see after the first turn a training
0:16:07performance of the are generated dialogue
0:16:10are much worse than the probably labeled problematic that'll a problem
0:16:16labeled problematic dialogues
0:16:23after three terms of training
0:16:25the both matrix star a growing and are better than the average performance of the
0:16:30labeled a problematic dialogues
0:16:35and as we can see after the and i terms of training
0:16:39and the success rate
0:16:41used email is as high is the
0:16:44unlabeled normal dialogues
0:16:47and also we can see the dialogues is going or smaller than the
0:16:51a very smooth and very
0:16:54a natural
0:16:58it here is the
0:17:02cues is second experiment
0:17:03so in the second experiment we just the compare the stuck on be the
0:17:09a ranker train it on the labeled data set
0:17:13so firstly we just divided aim at amt labeled data into three part of the
0:17:17two thousand training dataset and
0:17:20to the training examples two hundred tap examples and the four hundred testing samples
0:17:25and then we trained these dialogue ranker
0:17:28we call this as to provide two thousand on this labeled training dataset
0:17:33and use the performance
0:17:35and by the we were evaluating this problem by the opposite yet proceed and k
0:17:39and recall at k
0:17:44so the training of the
0:17:46sorry for this task done
0:17:47and we simulated basically rt start and problematic dialogues
0:17:52because the number of the c dialogue opens a we so all the all the
0:17:56data set up balanced datasets to their one thousand a positive examples in the what
0:18:00the next examples
0:18:01and because the see the number of c dialogues is only one hundred still
0:18:05we just duplicated by thirty times and try to make this dataset balanced
0:18:11and then which in our aspect that on this dataset
0:18:15so here's the performance
0:18:18and as we can see the us that can performs even better than the supervised
0:18:23when the k a is lower than fifty
0:18:26even though the supervised at two thousand has higher performance
0:18:31wouldn't case getting larger
0:18:32just can't do you comparison a fair regulate this
0:18:37and here's the thirty some experiments
0:18:40we just basically class the
0:18:43we just basically i'd the simulated data
0:18:46into the into the unlabeled data
0:18:49and try to compare the performance of this combined it has said with the labeled
0:18:52data set
0:18:56here is the result
0:18:58so basically the experiment shows that our us that can
0:19:02approach can bring some additional
0:19:04generate sessions by the segment by simulating
0:19:07a wired a range of dialogues
0:19:09that are not covered by the labeled data
0:19:14so the last six or experiment is where comparing the set down with other type
0:19:19of use of user simulator
0:19:21and the first one is the
0:19:22basically what coded multi domain
0:19:25what is doing is just like we train this user simulator with that the multi
0:19:29domain dialogues
0:19:30and simulated one about them problematic dialogues
0:19:32and then a together with just see dialogue which we need a ranker the dialogue
0:19:38and q
0:19:41and the second one is the find you model
0:19:44so basically we preach when the user simulator based the multi-domain dollars
0:19:48and then find kuwait on this t dialogues
0:19:52and then we generate
0:19:54went out and problematic dialogues and train it together with the see that looks
0:19:58thank you performance
0:20:01and the last one is the we code it's that finite-state thank you
0:20:05so basically we just
0:20:07replace this find used to use that of blank unit on the full length of
0:20:12i think that walks we just
0:20:14thank you in the stepwise fashion which has been introduced in the a stack on
0:20:18just without the con structure
0:20:22hughes the results
0:20:23and we also train our are stacked on the same size of dataset
0:20:27we should still one thought and assimilate out with simulated dialogue and the ones not
0:20:31and i'll
0:20:33the c dialogues so as we can see the is that stuck on are also
0:20:38performance than all the others user simulator
0:20:42so the conclusion is just that can generate dialogues based a wide range of
0:20:48and compared to i this compares favorably with the ranker train another labelled dataset
0:20:54and this we need additional general addition by simulating little
0:20:57while the range of take this dialogue
0:20:59they can not covered by the al
0:21:02a labeled data or sorry
0:21:04the last wise
0:21:05it also forms other your system
0:21:15but you're much we questions volumes
0:21:22hi i actually have to questions let's see if i
0:21:25the first one is
0:21:28of course you starting with a binary classification problematic versus non problematic but of course
0:21:34there are
0:21:35more problematic dialogues and you had it
0:21:38i and you address some of that via the times however in the end is
0:21:43still a binary classification right yep
0:21:46then my second question is because it's a binary classification what does it mean precision
0:21:51that okay in this case so used to basically procedure is i case like to
0:21:56a ranking of matrix it might is pretty relevant for evaluating the ranking process
0:22:03so basically what we're doing is like
0:22:05we for example have for a four hundred testing data and then we just the
0:22:10use our model to dialogue ranker to give score to each dialogues and they would
0:22:15market from a top from
0:22:18upper to down
0:22:19and then that means like
0:22:21we suppose that
0:22:23i the top of these dataset like it would give is higher score to this
0:22:27dialogue them use like these dialogues are problematic dialogues
0:22:30so was again the case like which is truncated this tell at this dataset as
0:22:34for example first ten dialogues
0:22:37and then we calculate how many of them are the problem at a problematic dialogues
0:22:42and divided by ten
0:22:43and we can transmit more like maybe we can see like of part fifty and
0:22:48top one hand
0:22:57you generate this problem is to dialogue so sort of letting lasso for us a
0:23:02so we generate this problematic dialogues in this fashion where the beginning "'cause" all this
0:23:08food and then the and this kind of rubbish
0:23:12this is also comes from there or you this is a separate but there is
0:23:16like something that in the middle of the thing to get you so for the
0:23:19task that is the like basically use human labeled data is not only labeled but
0:23:25thanks acumen is talking with the our system so the error can be like at
0:23:28the meteoric talk alright and the end so it's like
0:23:32it's just or if you don't really don't by john it's like the whole don't
0:23:35know yes we don't run time by turn would just about the hotel or is
0:23:40to think intent
0:23:45all the questions
0:23:53hi i'm a really from a dt i have a question about the how you
0:23:57define the problematic dellaleau as a whole i mean that is they can be some
0:24:01errors in the middle that the system can repair so what you mean exactly what
0:24:07a problem of the limiting database so we define a problematic dialogs as
0:24:12are like they have to look up to way not two-way these like to type
0:24:17of problematic that actually history type of its a problematic dialogues
0:24:21and the first type is likely they have some a natural turn
0:24:24so basically
0:24:25they achieve this goal they achieve their goal
0:24:28but the communication is not smooth
0:24:30so this person that
0:24:31and second type is like the communication is not smooth but i know that same
0:24:36type at the achieve a goal
0:24:37and actually they're potentially have the third one which nist back behind the communication use
0:24:42the moves but they didn't you are so
0:24:44we just define diplomatically in this way
0:24:47the in terms of the fan from the entrance is not smooth but the task
0:24:51be successful is that this do you have a targeted the done data entry and
0:24:55i'm sorry i didn't you have been calculated the annotator agreement
0:25:00hence we can o we didn't specifically to find this type of data but because
0:25:06the we gather data i think this type of examples are in the testing dataset
0:25:12alright thank you
0:25:15question whether
0:25:23because like the ranker outputs
0:25:26continues but
0:25:28and you
0:25:32no so as to the also the run queries is cut continuers between zero and
0:25:38one so it can be like their point eight hours a point five something
0:25:41and when is close to one that means this problem i take one and when
0:25:45is close to zero that means that so that like the normalized so is the
0:25:49these units and zero and one
0:25:52it can it's just what is the loss function so that so
0:25:57so the loss function basically use the
0:26:01discord that the run currently based the late is not labeled with the label so
0:26:07we labeled problematic dialogs as one
0:26:09and the normal dallas zero and the loss it just like the score given by
0:26:14the ranker
0:26:15between this like with this one so for example we use should be critical bands
0:26:24you know a one question also so this generate the
0:26:28but dialogue some problematic dialogues
0:26:31how do you know that they also wrote something to the actual problematic scores owes
0:26:35to that of course are so this corpus
0:26:38so we also be assuming we have three metrics to evaluate that
0:26:42the first one is like the last
0:26:44so normally if the if that there's something while the dialogue
0:26:48or the user didn't achieve this goal normally the dialogues longer
0:26:51so this one matrix
0:26:53and the otherwise the a success rate determines whether the user achieve their goal
0:26:59and third one is the
0:27:01to score given by the on the run for which to train it on the
0:27:05labeled data so basically it's like
0:27:07proper boundary like giving the score
0:27:09so we just compiler like to
0:27:12so we just that would just compare the so basically that is this one
0:27:27so basically use this lies so
0:27:30we just compare it with the average
0:27:33for example the average running score of the are labeled problematic dialogues which is the
0:27:38and compare it with the also compare with the yellow dashed line
0:27:42that means the average performance of the labeled
0:27:46a normal dialogues so we just see like at the beginning of these very always
0:27:50all this evaluation metrics a very low and after that is getting higher so that
0:27:54means like at the beginning that for the dialogues over there is a lot of
0:27:58problem is problematic dialogues and you the end is getting
0:28:01but i was if you read this example it seems like the user utterances or
0:28:05various look up to be very unlikely to happen in about
0:28:10what color turn your mind boston
0:28:13them for is going up in colorado and it's like the user is doing great
0:28:18system here
0:28:20yes there is no virtual characters and system reacting yes but it without introducing probably
0:28:27or whatever but for this one is likely only after one trainings that and after
0:28:31so you can see like after the three
0:28:35after treatment utterance of training the user is
0:28:40saying something a possible example that i'm not looking for this place please change so
0:28:44these also related to the restaurants do man
0:28:47but so that
0:28:48that is the utterance that use the contents are the system can understand so that
0:28:52cost the problem of the failure of the dialogue so probably at the bikini well
0:28:56we want to generate the problematic in like multiple maori in very creepy we but
0:29:02after so i do you in this step can training process the dialogue is getting
0:29:07into this is a restaurant search and a man is just like the way the
0:29:11user describing their requirements is not accepted by the by the system so you to
0:29:19generate a dialogue is getting closer to the to the domain and is getting last
0:29:25okay but you
0:29:27we want to run for a final question
0:29:34so it is you go along blues steps of the step again it looks like
0:29:39the problems
0:29:40looks like ordering and back
0:29:44like after this the g m is that the case i'm just asking what you
0:29:48like doesn't the generator
0:29:51generated a low quality
0:29:53problem just and
0:29:55is actually you know so
0:29:58so most of the devil wears problem there come at their of the appeared in
0:30:03the end but it's a unit do that you in the generation process because we
0:30:08have some like a random seed or something
0:30:09and there are some problem as can
0:30:13appeared in between but these the much less than the one appeared in the end
0:30:20i see okay i mean so then be secure you i mean that's
0:30:24something because we are doing
0:30:30problems in the middle or the beginning i see it does so basically we actually
0:30:40ideally we one this paper like we have the arrow in like all kinds of
0:30:45and the
0:30:47indeed like some of the generated dialogue even though after maybe six times over the
0:30:52seven turns they are still there are some problems appear in me to but is
0:30:56much lasso
0:30:57i think maybe second of future work this i guess it was just gonna see
0:31:01my helpful to combine different dialogs from different steps of just a
0:31:07in table i want to train the rent
0:31:10you mean like to collect the data from a different a training stuff but we're
0:31:14doing that where like
0:31:16a completely at all these dialogues into this okay
0:31:22okay the think that's the from a question so let's think the speaker again