okay the

so then we move on to the next

speaker

so the paper is unsupervised dialogue spectrum generation for more variable rounding

the

well as usual

however and signal

and this work is finished with the each or jealously then and aston gently from

microsoft research by the way i'm from her what university

and a flat start

so the aim for this paper is that we are we would you the a

ranker

to detect the problematic dialogues

from the normal ones without the and you labeled data

where used in the existing six dialogues as the normal dialogues

and then learned in a way to a generative use assimilated by can setups

and have a talk with the

bought in an in different training steps

and the we get the old conversations from different rings taps

and take them as the problem is problematic dialogues we call this mattered a step

gone

and the experiment result shows that the stuff step can compared favorably with the run

first train it on the labeled a manually labeled datasets

okay so what is the log data like a ranking

so the log dialogues are dialogues are dialogues of conversations happen between the real users

and the dialogue system

and the other

dialogue ranking aims that the identify the problematic splines from the normal ones

here are two examples of

the normal dialogues and problematic dialogues

here is the first one

the first one is a normal dialogue

the dialogue scenes in the restaurant searching domain or every

firstly the cs it's and state hollow and then the user

is asking for european restaurant

and then the system task what's part of time to have in mind

and they used a set the center

and after that the u s systems that i it's a system was asking for

the price range

and the uses the tagset expensive one

and the get after getting all the information at the system said i suggest this

the machine house cafe

and then repeat the all the requirements of the users

and after that the user ask for the rest of this cafe and the system

gives the cracks informations

and this think it to each other and the dialog finish

so we define it is not what dialogue as

dialogues that without any can't actually and natural turns

and also achieved over the requirements

ask about the user

and here is that problematic with dialogue

so where a is very

pat apparently

so when the system can understand the user utterance

and the conversations going to the wrong direction

for example

the use this that i would really like still european that's cheap

and the system has some problems based understanding this utterance

by suggesting one restaurant

which is in the east is tough town

however the user was asking for the standard

and after that the user it's a

i want to eat at this restaurant have you got there is that a address

and this is indeed is the this utterance and ask what part of town to

have in mind again

so we define is problematic dialogs as

the dialogues with either can't actually unnatural turns

or some and cheap the requirements or both

so the goal for this bunker

actually is the best

so the goal for the ranker is the to pick up this type of problematic

dialogues from the normal ones

so what we need a stronger

in people unity development loop of the at data driven dialogues

the developer is would able upgrade there's a dialogue system

i seeing some in domestic dialogues

and then the dialogue system will a beating the

deploy a three

will be released to the customers

and then the locked a lot or log conversations can be collected

and then the developers can improve the performance of the system

by correcting some mistakes than system at in a locked dialogues and then retrain the

a dialogue system model

however

going through all these dialogues are time consuming

so we hope

that these manually checking process can be replace the by the a dialog drunker

that can detect dialogue with lower quality automatically

to make this dialogue learning process with human the look more efficient

so here this structure of the structure of the ranker

that you put for the ranker used it just the dialogue

and outputs is the score

in between zero and one and zero mean is the normal dialogues and the why

means that problematic dialogues

so firstly

we get the sentencing biting by distance decoder

and then feed them into this multi have stuff what's multi have self attention

to capture the meaning of the dialogue context

and then we have these turn level classifier

to identify the quality of each turn

for example

for these very smooth turn the score should be zero point one

and for all sort

and for these problematic turns the score should be zero point nine

and their and then would be these i a dialogue level run curves on top

of this term life what qualities

and this the for this dialogue there are some parts of them are us to

move the some of them are problematic

so probably the score will be like a zero point eight or something the extracted

score

so the training for the normally digit that a

the gathering all the data for these of a trend for training of this one

queries very time-consuming

so you matching that human the loop process in the development when that whenever at

a significant change is made to the system a new labeled data for the i'd

run queries required

this is not feasible for most of the developer

and that's

motivates us to explore this stuck on approach

the general idea for this task is that

we take the c dialogue set the normal dialogues

and at the same time we need to step can to simulate the problem might

problematic dialogues

and train the bankers on top of this data

so here is the structure of the i-th turn setup

we have these dialogue generator and all have this we made here discriminator and need

the dialogue generator would have the restaurant searching a dialogue system

and are in based the user simulator

firstly we start of pre-training process

in this process we preach win over a user simulator but the full utterance

a multi domain dialogues

for example for the full for the most intimate dialogue this can be for example

the pizza ordering

we in which the user is asking for the large a pineapple pizza

and this does it can be the temperatures taking to men in which the user

is asking for a setting the temperature of the room to a seventy two degrees

and then we

we just we ask the user simulator to simulate some dialogue

together with the restaurant search in both

and hearsay example of simulated dialogue after pre-training as we can see the

user simulator has some that basically language abilities

but it doesn't know how to talk a bit based a restaurant search imports

so when the system is asking for

some restaurant searching requirement the user said management home or something like that

and of course

the dialogs not going to the right direction

so

after a guide the this after we get the simulated problematic dialogues we a trend

that is committed to get discriminator together with the c dialogues

which is pre-trained sorry

so after the pre-training process we come we move on to the first type of

the goddess that can training

firstly we just the initialize the are user simulator and that discriminator

by the occlusion and model

separately

and they're in there

than setups

for the training of the discriminator we ask the that looked in the reader to

simulate some dialogues with only one pair

and take them at the problem problematic dialogues

and then we have this each dialogue and truncated them up to the first turn

and to get a take them as the normal dialogues and feed them into the

discriminator

and for the training of the simulator in step one we also where you also

use nist are wondering stick sd and si dialogues

after that we start our can setups that's trained for treating the training of the

generated matching of the discriminator

after conver after the model get cumbersome

we ask

the model to simulate full length of dialogues

and put them into the simulated problem simulated problematic dialogues

buckets

as we can see the first term of this system is very is very small

but after that when the system

is asking for which what's product and you have in mind

the used as the continent which they use the system can understand

and the dialogues going wrong

and have the first that were coming to the second step

and we firstly we also initialize our used a military and the discriminator

we use to be we initialize the user simulator with the wire which rendered in

this step one

and we are

a initialize disk major with the push shouldn't model

and the only difference between this that one is that to step two is that

we are asking the you the that ballot denoted to generator to generate the

to simulate dialogue with two turns

and that the same time we truncate our artistic see dialogues into two turns and

then show in that is committed and estimated a user simulator at the same time

after the model get commerce

we asked then using user simulator to simulate folded of dialogues

and then put them into the simulated problematic dialogues

so as we can see the first two terms of or a smooth and stuff

and third term turns there's something wrong

okay and then

we just repeat this that for like and steps

and after the and step of training

we get

a four bucks buckets of the simulated problem of problematic dialogues

and together with the c dialogues

where should in our dialogue drunker

so here's so that is a set or using this paper

basically we're using the re dataset

the first one is the multi domain dialogues

that is for the pre-training of that segment user simulator and it's good discriminator

and where using this might otherwise these is that

which is task oriented conversations with a thirty sorry for two thousand dialogues

you over fifty one domains

and each dialogues in this dataset is task oriented conversational we interaction

between two real speakers and one of them a stimulating the user and detect the

otherwise stimulating the but

and the second part is to see dialogues

this a dialogue is portrayed is the is for the training of the can structure

and normally to see dialogues are human written dialogues that will be offered to the

developers before the active development of the dialogue system

however we don't have these human written dialogues

so the we create this stick dialogue this

we create what i just need a lot

by having the a high dial restaurant a searching but

talk to be the rule based

user simulator that also offer a high tail

and the third one is the manually labeled log dialogues which is for the evaluation

of this task

to claques this the labeled data we deployed a deployed our a high tail

restaurant search in both the way the amazon mechanical turk platform

are firstly we generate automatically generates some requirements for the user's for example

for some for type and also

locations and price range

and then

we asked turkers to find the restaurant

that satisfy those requirements

by checking base our restaurant sports

and i d n i d end of each

and the also at the end of each task

we add the quite the users are asked two questions

and the first one is the weather define the restaurant

making all the requirements mistaken one in the second one where we ask the user

two labeled a contextually an actual turn

do in the conversation

in total we collect a one what are than the six hundred normal dialogues and

one thousand three hundred problematic dialogues

here are some experiment results would you basically for example for experiments

to justify the performance of this

stuck on

so the first one is we investigate how

the generated dialogue's move to was to the normal dialogues

basically we examine the dialogues generated at each test

each time step of the static on

in terms of three metrics

a here are to love them

the first one a dapper one is the ranking score and the second one dollar

wise the success rate

and the yellow dashed lines and the green dashed line is probably very

a week

d stands for the average performance of the are labeled

no more dialogues and the labeled

problematic that problematic dialogues

so as we can see after the first turn a training

the

performance of the are generated dialogue

are much worse than the probably labeled problematic that'll a problem

labeled problematic dialogues

okay

after three terms of training

the both matrix star a growing and are better than the average performance of the

labeled a problematic dialogues

and as we can see after the and i terms of training

and the success rate

used email is as high is the

unlabeled normal dialogues

and also we can see the dialogues is going or smaller than the

a very smooth and very

a natural

it here is the

cues is second experiment

so in the second experiment we just the compare the stuck on be the

a ranker train it on the labeled data set

so firstly we just divided aim at amt labeled data into three part of the

two thousand training dataset and

to the training examples two hundred tap examples and the four hundred testing samples

and then we trained these dialogue ranker

we call this as to provide two thousand on this labeled training dataset

and use the performance

and by the we were evaluating this problem by the opposite yet proceed and k

and recall at k

so the training of the

sorry for this task done

and we simulated basically rt start and problematic dialogues

and

because the number of the c dialogue opens a we so all the all the

data set up balanced datasets to their one thousand a positive examples in the what

the next examples

and because the see the number of c dialogues is only one hundred still

we just duplicated by thirty times and try to make this dataset balanced

and then which in our aspect that on this dataset

so here's the performance

and as we can see the us that can performs even better than the supervised

approach

when the k a is lower than fifty

even though the supervised at two thousand has higher performance

wouldn't case getting larger

just can't do you comparison a fair regulate this

and here's the thirty some experiments

we just basically class the

we just basically i'd the simulated data

into the into the unlabeled data

and try to compare the performance of this combined it has said with the labeled

data set

and

here is the result

so basically the experiment shows that our us that can

approach can bring some additional

generate sessions by the segment by simulating

a wired a range of dialogues

that are not covered by the labeled data

so the last six or experiment is where comparing the set down with other type

of use of user simulator

and the first one is the

basically what coded multi domain

what is doing is just like we train this user simulator with that the multi

domain dialogues

and simulated one about them problematic dialogues

and then a together with just see dialogue which we need a ranker the dialogue

ranker

and q

and the second one is the find you model

so basically we preach when the user simulator based the multi-domain dollars

and then find kuwait on this t dialogues

and then we generate

went out and problematic dialogues and train it together with the see that looks

thank you performance

and the last one is the we code it's that finite-state thank you

so basically we just

replace this find used to use that of blank unit on the full length of

i think that walks we just

thank you in the stepwise fashion which has been introduced in the a stack on

just without the con structure

and

hughes the results

and we also train our are stacked on the same size of dataset

we should still one thought and assimilate out with simulated dialogue and the ones not

and i'll

the c dialogues so as we can see the is that stuck on are also

performance than all the others user simulator

so the conclusion is just that can generate dialogues based a wide range of

qualities

and compared to i this compares favorably with the ranker train another labelled dataset

and this we need additional general addition by simulating little

while the range of take this dialogue

they can not covered by the al

a labeled data or sorry

the last wise

it also forms other your system

but you're much we questions volumes

hi i actually have to questions let's see if i

the first one is

of course you starting with a binary classification problematic versus non problematic but of course

there are

more problematic dialogues and you had it

i and you address some of that via the times however in the end is

still a binary classification right yep

then my second question is because it's a binary classification what does it mean precision

that okay in this case so used to basically procedure is i case like to

a ranking of matrix it might is pretty relevant for evaluating the ranking process

so basically what we're doing is like

we for example have for a four hundred testing data and then we just the

use our model to dialogue ranker to give score to each dialogues and they would

market from a top from

upper to down

and then that means like

we suppose that

i the top of these dataset like it would give is higher score to this

dialogue them use like these dialogues are problematic dialogues

so was again the case like which is truncated this tell at this dataset as

for example first ten dialogues

and then we calculate how many of them are the problem at a problematic dialogues

and divided by ten

and we can transmit more like maybe we can see like of part fifty and

top one hand

you generate this problem is to dialogue so sort of letting lasso for us a

so we generate this problematic dialogues in this fashion where the beginning "'cause" all this

food and then the and this kind of rubbish

this is also comes from there or you this is a separate but there is

like something that in the middle of the thing to get you so for the

task that is the like basically use human labeled data is not only labeled but

thanks acumen is talking with the our system so the error can be like at

the meteoric talk alright and the end so it's like

it's just or if you don't really don't by john it's like the whole don't

know yes we don't run time by turn would just about the hotel or is

to think intent

it

all the questions

hi i'm a really from a dt i have a question about the how you

define the problematic dellaleau as a whole i mean that is they can be some

errors in the middle that the system can repair so what you mean exactly what

a problem of the limiting database so we define a problematic dialogs as

are like they have to look up to way not two-way these like to type

of problematic that actually history type of its a problematic dialogues

and the first type is likely they have some a natural turn

so basically

they achieve this goal they achieve their goal

but the communication is not smooth

so this person that

and second type is like the communication is not smooth but i know that same

type at the achieve a goal

and actually they're potentially have the third one which nist back behind the communication use

the moves but they didn't you are so

we just define diplomatically in this way

the in terms of the fan from the entrance is not smooth but the task

be successful is that this do you have a targeted the done data entry and

i'm sorry i didn't you have been calculated the annotator agreement

hence we can o we didn't specifically to find this type of data but because

the we gather data i think this type of examples are in the testing dataset

alright thank you

question whether

right

because like the ranker outputs

continues but

and you

no so as to the also the run queries is cut continuers between zero and

one so it can be like their point eight hours a point five something

and when is close to one that means this problem i take one and when

is close to zero that means that so that like the normalized so is the

these units and zero and one

it can it's just what is the loss function so that so

so the loss function basically use the

the

discord that the run currently based the late is not labeled with the label so

we labeled problematic dialogs as one

and the normal dallas zero and the loss it just like the score given by

the ranker

between this like with this one so for example we use should be critical bands

you know a one question also so this generate the

but dialogue some problematic dialogues

how do you know that they also wrote something to the actual problematic scores owes

to that of course are so this corpus

so we also be assuming we have three metrics to evaluate that

and

the first one is like the last

so normally if the if that there's something while the dialogue

or the user didn't achieve this goal normally the dialogues longer

so this one matrix

and the otherwise the a success rate determines whether the user achieve their goal

and third one is the

to score given by the on the run for which to train it on the

labeled data so basically it's like

proper boundary like giving the score

so we just compiler like to

so we just that would just compare the so basically that is this one

this

so basically use this lies so

we just compare it with the average

for example the average running score of the are labeled problematic dialogues which is the

real

and compare it with the also compare with the yellow dashed line

that means the average performance of the labeled

a normal dialogues so we just see like at the beginning of these very always

all this evaluation metrics a very low and after that is getting higher so that

means like at the beginning that for the dialogues over there is a lot of

problem is problematic dialogues and you the end is getting

but i was if you read this example it seems like the user utterances or

various look up to be very unlikely to happen in about

what color turn your mind boston

them for is going up in colorado and it's like the user is doing great

system here

yes there is no virtual characters and system reacting yes but it without introducing probably

or whatever but for this one is likely only after one trainings that and after

so you can see like after the three

after treatment utterance of training the user is

saying something a possible example that i'm not looking for this place please change so

these also related to the restaurants do man

but so that

that is the utterance that use the contents are the system can understand so that

cost the problem of the failure of the dialogue so probably at the bikini well

we want to generate the problematic in like multiple maori in very creepy we but

after so i do you in this step can training process the dialogue is getting

into this is a restaurant search and a man is just like the way the

user describing their requirements is not accepted by the by the system so you to

generate a dialogue is getting closer to the to the domain and is getting last

three

okay but you

we want to run for a final question

so it is you go along blues steps of the step again it looks like

the

the problems

looks like ordering and back

like after this the g m is that the case i'm just asking what you

like doesn't the generator

generated a low quality

problem just and

is actually you know so

so most of the devil wears problem there come at their of the appeared in

the end but it's a unit do that you in the generation process because we

have some like a random seed or something

and there are some problem as can

appeared in between but these the much less than the one appeared in the end

i see okay i mean so then be secure you i mean that's

something because we are doing

problems in the middle or the beginning i see it does so basically we actually

ideally we one this paper like we have the arrow in like all kinds of

place

and the

indeed like some of the generated dialogue even though after maybe six times over the

seven turns they are still there are some problems appear in me to but is

much lasso

i think maybe second of future work this i guess it was just gonna see

my helpful to combine different dialogs from different steps of just a

in table i want to train the rent

you mean like to collect the data from a different a training stuff but we're

doing that where like

a completely at all these dialogues into this okay

okay the think that's the from a question so let's think the speaker again