Speech Transcript - Role play-based question-answering by real users for building chatbots with consistent personalities

0:00:15	so my name is a recharging is not there are some in the operation and
0:00:20	the today i'm gonna talk about the real data is question answering by a real
0:00:26	users for a million samples is consistent first like this
0:00:32	so
0:00:33	now we are seeing a lot of
0:00:36	samples okay because we are talking everyday the these little some people are talking to
0:00:41	these characters everyday
0:00:42	i criticism microsoft's we know in japan
0:00:45	it is very famous people talking to a everyday and we have a like to
0:00:51	get a box i image
0:00:53	the people can tell to the virtual characters in this us small cost
0:00:58	and also we have a
0:01:00	more human like
0:01:01	catherine you mentions in destiny as in david work
0:01:06	so we are having
0:01:08	many samples and they have consistent present it is
0:01:12	and if we want them to the but double they need to have consistent just
0:01:16	like this
0:01:17	and to generate consistent responses what follows
0:01:21	it's got each of the specific question answer yes
0:01:26	like
0:01:27	but the creation of that yes is as you know very costly
0:01:32	so the motivation behind this work is that
0:01:35	we want to efficiently
0:01:37	what
0:01:38	questions that there's for characters
0:01:41	and in this work we particularly news
0:01:44	the technique called role-play this question answering
0:01:47	as a technique for collecting
0:01:49	the
0:01:50	questions that s
0:01:52	and it before going into the details of this work i'm gonna explaining about what
0:01:56	role play this question answering
0:02:00	so in well database question answering
0:02:02	in the middle we have
0:02:04	a famous person
0:02:06	and people users talk to this famous person
0:02:10	and in this case this is an image and cutting down who is very famous
0:02:14	we've got is a
0:02:15	and
0:02:16	at the back
0:02:17	all this and scatter we have a bunch of all players to collectively play the
0:02:23	role of the famous plus
0:02:26	so if the user this user
0:02:29	asks a question to this famous person like what to do you like
0:02:32	and this question is broadcast
0:02:34	do all the old place
0:02:36	and better
0:02:38	one of the probably as and so is the question by saying like high tech
0:02:42	suites
0:02:42	then this answer was like to use a while
0:02:46	and
0:02:48	this question a second formant
0:02:51	a this there can be collected at a question answer for this task to
0:02:57	since both players can enjoy playing the role of their favourite character
0:03:02	and also the users can ask listen to their favourite character
0:03:06	users can get highly motivated to provide questions okay is that this is how it
0:03:11	works
0:03:13	let the that there are some problems with this architecture
0:03:17	so that is
0:03:18	only a small scale experiment with paid users was performed
0:03:23	to test the concept of the whole database question answering
0:03:26	so because not clear if this key would work with okay we've users
0:03:32	and also another problem is that the small scale experiment
0:03:35	if not you must data
0:03:37	to allow data driven methods to work
0:03:40	so the applicability of the collected data to the creation of examples
0:03:45	but not very fight
0:03:47	so to us all these problems in this
0:03:50	a to the protein that we buried by
0:03:53	effectiveness of role played this question answering is real users
0:03:58	six study we focus on two famous characters in japan
0:04:02	and
0:04:03	you setup we have signs for roleplay discuss something
0:04:06	both the people to you know enjoy the class
0:04:11	and for the second problem we created samples using the collected data
0:04:16	quickly in this way
0:04:18	and
0:04:19	in this paper we propose a retriever based method
0:04:22	and evaluate its performance by subjective evaluation
0:04:27	so let me
0:04:28	talk about
0:04:30	that the data collection by you
0:04:32	users
0:04:34	so we focus on these two characters
0:04:37	who are very concerned about
0:04:39	why is not my reason actual present and he's a company c or and
0:04:45	he's also youtube a who specialises you like the coverage of t v games
0:04:50	and
0:04:51	and the characters is a rig it is there is a fictional character is novel
0:04:56	and it does is the company this you
0:04:58	and head character is often referred to as the and the right
0:05:01	according to mitigate here and their exact is mentally unstable and use extreme balance of
0:05:07	brutality is an absolute
0:05:09	but in most so they are two very distinct
0:05:12	different chapters ones
0:05:13	actually present
0:05:15	male cat to another one that action factor of female part
0:05:20	and we set up websites
0:05:23	so that people can enjoy the role played this question answering
0:05:27	so each task has the channel
0:05:29	kind of maybe a kind of channel
0:05:31	user channels for the fans on the japanese
0:05:34	jamie service you can decode all that
0:05:36	this is like are you to
0:05:38	and
0:05:38	we set up the side
0:05:40	on their channels for the subscribers to enjoy role-play based question answering
0:05:46	so this is how it's how the image that looks like fall right
0:05:50	the people down
0:05:51	for questions these are the questions posed questions
0:05:55	and these are the given answers by several pages
0:05:59	and this is how it looks like full
0:06:01	sn
0:06:02	you can post questions in the text few and the and this is a
0:06:07	is imposed by the user and this is the answer posted by the well
0:06:13	so this is how it looks like
0:06:16	and we ran this kind of a trial for several model
0:06:21	and this is what we get task to a few and shows the statistics of
0:06:27	the collected data
0:06:30	if you look at the these two
0:06:32	number of users who participated and number of a questions okay as we obtain
0:06:37	we obtain a have many uses a
0:06:40	as you can see play roles of right and is a model three hundred people
0:06:44	participated
0:06:46	and we over ten thousand questions there's were collected for both
0:06:50	that is right and there's
0:06:53	and also houses for is a this is this is average
0:06:57	words but also that are is that is pronounced as of is it will much
0:07:01	longer and contain more wasn't matters
0:07:04	so in that is a there was more talkative and my are not as talkative
0:07:09	that is
0:07:11	just filling their effects present utterance
0:07:15	and this slide shows efficiency
0:07:18	of the data collection process
0:07:20	that this
0:07:21	yes table shows
0:07:23	how long we took to reach this number of questions up yes
0:07:28	so
0:07:29	for example
0:07:30	to each two thousand
0:07:32	there's
0:07:33	if the standard a full scale of the seven day from right and about one
0:07:38	day for is a and to reach ten thousand pairs
0:07:42	it took about three months former i and eighteen days for testing
0:07:47	so for both characters it is just about the couple of days to reach two
0:07:51	thousand questions appears
0:07:53	and what is a we collected
0:07:55	ten thousand question answer pairs in just eighteen days i think if it is quite
0:07:59	fast
0:08:00	and deciding this confirms this chancy a role-play discourse something for the question
0:08:06	you "'cause" note that uses doesn't run parry provided a to develop a they just
0:08:10	boundary in
0:08:13	provide data enjoying contrast
0:08:17	and the decisive the quality of data and user satisfaction of the users
0:08:23	so this shows
0:08:25	this table shows the average score for example downstairs
0:08:30	and the maximum score is five and we get very reasonable utterance correctly for the
0:08:36	posted classes
0:08:38	and for the user satisfaction of the users
0:08:42	we had the three items for the questionnaire items usability a website willingness for future
0:08:48	use and enjoyment of update and we see that users really enjoyed roleplaying
0:08:58	so we have a created about the more than ten k okay sounds okay as
0:09:02	in
0:09:03	well maybe this question answering and now it's time to create samples using the click
0:09:08	data
0:09:10	so this is a overview of our proposed method
0:09:14	basically we employ a retrieval-based approach that you haven't that question q
0:09:20	and
0:09:20	your question answer pairs of which leaves from this question answer pairs database that we
0:09:25	have collected
0:09:26	and if
0:09:27	the score of this which ends up the is high
0:09:30	in this exactly as or not
0:09:33	so
0:09:35	with the highest score is but and it's a prime
0:09:39	is used as out of this task
0:09:42	so for example this has a score of zero point nine and other ones how
0:09:46	the scores based on
0:09:47	the point nine then this would be selected and a prime the use of the
0:09:52	output for this tuple
0:09:55	and
0:09:55	the important thing to do this
0:09:58	how do we collected this goal
0:10:01	so for this purpose we have this scoring function
0:10:04	it is a weighted sum of six
0:10:06	different
0:10:08	school
0:10:09	so score you types my school central school translation score
0:10:15	so a rave transition score and semantic similarity score and these scores are integrated you
0:10:20	calculate this overall score for the for each question that
0:10:25	a nice
0:10:27	describe these scores along by well
0:10:30	for the initial sweets course
0:10:32	so for the summer school
0:10:34	this is what is given by the scene text with you but engine conclusions of
0:10:38	asr service this question as a great
0:10:41	and reason using with default settings it uses the m twenty five as such
0:10:47	and for the question types
0:10:48	my school
0:10:49	you score is calculated on the basis of case of the question type of to
0:10:53	match that of q prime and the number of named entities good prime requested by
0:10:59	chris
0:11:01	and also susceptible school
0:11:03	we first extract centre was and the was mean noun phrases representing topics are extracted
0:11:09	from all those q and q prime and if the overlap is score of while
0:11:13	it's okay
0:11:16	for the other three scores
0:11:18	well for this some sessions for use a mural found this model can be a
0:11:23	primary cue it is a generative probability of a prime given q at the school
0:11:29	the model is proclaimed is in house the point five million question answer pairs and
0:11:34	then fine tune is a quick collected questions up yes
0:11:38	and for this purpose we use open and m t two
0:11:42	and the reverse translation score is very similar to the translation score not be huge
0:11:48	even a crime is used
0:11:49	at school
0:11:51	finally the semantic similarity score
0:11:55	first sentence vectors are obtained from both q defined by using the averaged word vectors
0:12:00	using welcome back
0:12:01	then cosine similarity between two sentences because it's
0:12:05	used at the school
0:12:07	what do back model is trained from wikipedia articles
0:12:11	note that all scores are normalized between zero and one before integrating the schools
0:12:17	so it's i shows the overlapping to all the system
0:12:22	so user question comes in then this look into document retrieval engine the same achieve
0:12:28	this question answer pairs from discussions appears database
0:12:31	and top and candidates aretha
0:12:33	and for each of the candidate
0:12:35	indicate the score
0:12:37	by using these modules
0:12:39	question-type system action a named entity recognition sent over the extraction module you are translation
0:12:44	models
0:12:45	and what of a model
0:12:47	and we obtain g six
0:12:49	scores that i just plain
0:12:52	and
0:12:53	we get the final ranking of the two it is a the and outputs the
0:12:57	top and
0:12:58	just the masses and did not use only top one also
0:13:01	at the tuples response
0:13:05	and
0:13:06	because we have only about ten k
0:13:09	questions appears in this database is that it can at the coverage of the questions
0:13:12	you know you know
0:13:13	so we additionally have another database which is an extended question answer pairs
0:13:18	created from discussion on sub yes i just explaining but this is
0:13:24	so to extend the questions that the as
0:13:27	we first
0:13:29	focus on this
0:13:30	on the full
0:13:32	in a in a in one particular questions up
0:13:35	and we first that's for a very similar
0:13:38	three in a feature space
0:13:40	which has a very similar content on the normalized edit distance is below zero point
0:13:45	one so they should be very similar on the surface
0:13:48	and for this study we use
0:13:50	the all that questions
0:13:54	to which this was announced
0:13:56	and we therefore these questions
0:14:00	and
0:14:00	a couple these questions is questions and the sounds that
0:14:04	and these
0:14:05	hubble's i mean do is extended question answer yes that's how we extend its question
0:14:13	answer yes into this extended question answer yes
0:14:17	and former i
0:14:19	we all the thing additional wasn't really on
0:14:22	questions that sample is a
0:14:24	we obtain
0:14:26	about one million additional questions okay yes
0:14:31	so by using the proposed method
0:14:34	we did an experiment to verify the effectiveness of the proposed method
0:14:40	we use twenty six subjects
0:14:42	each fold ryan is a
0:14:44	and they were recruited from the transcribers data they are very tricky about the quality
0:14:49	of the utterance is that they are five of the cactus
0:14:54	and the procedure is that each subject evaluated ounces
0:14:58	of the five methods for comparison i explained and misses later
0:15:02	on a five point likert scale
0:15:05	and
0:15:06	you use test speakers questions which were the held-out data from the collected questions appears
0:15:13	were used as input
0:15:16	we have the two evaluation criteria
0:15:19	why naturalness
0:15:21	not knowing who's taking the answer is appropriate to the input question or not
0:15:26	and have an s
0:15:27	knowing that i think question is taking there is probably due to input question on
0:15:35	so
0:15:36	i
0:15:37	describe the message for comparison we have five
0:15:40	we have two baselines
0:15:42	and to propose messes i wonder about
0:15:46	as a problem as a baseline while it's called mail
0:15:51	and it uses general-purpose three hundred k and crafted we use you can email a
0:15:57	show intelligence markup language for response generation
0:16:01	and personal pronouns and sentence and expressions of them
0:16:05	but i lose to match those of the cast as
0:16:08	so as you know this is applied massive amount of
0:16:11	a handcrafted rules that we have been developing and we are using that
0:16:15	for response generation in this and of set
0:16:19	and baseline to this is called c
0:16:22	and it is easy the answer to the highest ranking to it
0:16:26	which achieved by to see which uses the in twenty five by using the input
0:16:30	question other clear
0:16:32	and this is the proposed method one it is called prob
0:16:36	without you x d be extended database the proposed method without the extended question is
0:16:42	like three
0:16:44	and i have the all the all the weights in the scoring function a set
0:16:47	to one
0:16:49	for this proposed method
0:16:51	and for the proposed method to it's called prob
0:16:55	the proposed method this is the proposed method itself and all the weight us to
0:16:59	do well
0:17:01	and the upper bound
0:17:04	it's called goals and it's a gold responses
0:17:06	provide it online user's focus questions
0:17:10	then we compare these five
0:17:13	and this is shows the results
0:17:16	for the five methods for both right and s
0:17:20	and as you can see that the proposed method a much better than the baseline
0:17:25	all right the proposed messes seeing significantly outperform the baselines
0:17:30	and those the problem is that doesn't probably the text and database or not
0:17:36	of what is a
0:17:38	the proposed method outperforms one of the baselines which is mail
0:17:42	and also proposed method is better than problem without extent database all naturalness
0:17:49	the weighted by good and this is a
0:17:51	at the bounds of the but close getting goals is the
0:17:55	gold about data
0:17:59	i show you some of the examples that a more interesting so for example this
0:18:03	is for right and what you do you
0:18:06	for lunch today and then we tend i have it's a compressed by for it
0:18:10	is good at the g
0:18:12	and it had a very high that's on the school but it does not very
0:18:16	much like and so
0:18:18	and the proposed method just return running
0:18:20	but it was hot but it was that just like himself
0:18:26	and via say
0:18:27	use of cute with a question and
0:18:30	we had the two
0:18:32	responses like to thank you very embarrassing thank you from the proposed methods and they
0:18:36	are very much higher scores
0:18:39	so that mm lose may produce not frequencies
0:18:42	but such happens is not necessary you too high
0:18:46	and short answers just liked of these ram and thank you
0:18:50	can lead to high schoolers showing that the content is utterances
0:18:53	it's very important for
0:18:57	so to summarize
0:18:59	we successfully verify the effectiveness of our previous question answering
0:19:03	by using real users
0:19:05	and we successfully created samples using the selected questions yes
0:19:09	and of future work
0:19:10	you want to improve the quality
0:19:12	of the proposed method and those so we want to try additional types of characters
0:19:17	as targets for local a discussion on
0:19:20	actually
0:19:28	questions
0:19:49	so actually this is a kind of a
0:19:53	how they say people can compare different the answers and that's the winds in part
0:19:58	of this the system
0:19:59	the people can just actually there's a kind of like important here
0:20:02	the people can just press this button then
0:20:04	the you know you can you can see that this was much better utterances so
0:20:08	it was kind of you know it's not a confusion but this kind of into
0:20:12	the thing for comparing them
0:20:31	yes a they are completely isolated
0:20:36	no it was just this amounts to
0:21:09	so we just wanted to make sure that
0:21:12	we are not cheating so that that's not that the point
0:21:15	and we could have done
0:21:17	users
0:21:18	but in their own questions and then evaluate the response but since we had a
0:21:23	dataset we wanted to do kind of us as kind of a class wasn't survey
0:21:27	so we can do that so we what how
0:21:46	so we
0:21:48	you have to be able reading with the this streaming service and that they have
0:21:53	the right to be addicted and area
0:21:56	so we have the rights to but our website and their fans on it was
0:22:00	and we all of the right have been created
0:22:06	and the other question
0:22:10	okay so let's thank you gaze

Role play-based question-answering by real users for building chatbots with consistent personalities

Oral Session 3: Dialogue

Ryuichiro Higashinaka, Masahiro Mizukami, Hidetoshi Kawabata, Emi Yamaguchi, Noritake Adachi, Junji Tomita