Speech Transcript - A Large-Scale User Study of an Alexa Prize Chatbot: Effect of TTS Dynamism on Perceived Quality of Social Dialog

0:00:21	or it but so slowly start them accession my name is for the province crevices
0:00:27	evaluation very to session
0:00:31	first speaker today
0:00:33	is gonna be special colour
0:00:37	we're gonna have a three talks in the session
0:00:39	which random to lunchtime
0:00:43	so we shall
0:00:45	thank you
0:00:49	can you hear me okay
0:00:51	a high i michelle code i'm a close talking u c davis working jointly with
0:00:56	department of linguistics
0:00:57	computer science and psychology and to they'll be presenting a project i did with our
0:01:02	bit chen and joe you
0:01:04	so more and more humans are talking to voice activated artificially intelligent devices like amazon
0:01:10	alexi to complete daily tasks
0:01:12	like setting a timer turning on the lights
0:01:15	and the new aspects through the amazon elects the price competition is the ability to
0:01:20	engage real users in social chitchat three d systems many view here have competed or
0:01:26	are competing but for those of you who don't know about it
0:01:30	the amazon a leg surprises the competition to create social but that can converse coherently
0:01:36	and engaging lee with humans on a range of topics like food music technology animals
0:01:42	and so on
0:01:43	and what's unique
0:01:45	at least for researchers in academia is the ability to deploy the strap right in
0:01:50	the wild and something dan bohus talked about yesterday
0:01:54	so during the competition anyone with an amazon ago
0:01:57	could say let's chat
0:01:58	and get one of the computing chat bots
0:02:01	you may be familiar with some other teams from twenty eighteen
0:02:05	a including one from katie each phantom advice by gabriel's concept and light by patrick
0:02:10	joan l
0:02:12	but today i mean to be talking about gun rock the social but developed at
0:02:16	u c davis advise by joe you and light by or pitch and make two
0:02:20	corridors
0:02:21	and gun rack a special as it won first place in the twenty eighteen competition
0:02:26	i you can see joanne and our bit here
0:02:30	so when i might show in our bit july last summer a contract team was
0:02:34	about halfway through the competition and i was working on other projects related to how
0:02:38	humans talk to voice ai so it's
0:02:40	interested in seeing how
0:02:41	users would engage with the social but like can rock
0:02:45	so we started to collaborate recording these user interactions you can see my microphone there
0:02:51	but we notice something as he listens to how these interactions unfold it
0:02:56	alexis speech was relatively flat
0:02:59	and really lacked the dynamism in human interaction
0:03:02	we're speakers very their speech just to show their excitement
0:03:05	their interests and their understanding
0:03:08	and this is important
0:03:09	is users for example were offering information about their favourite movie lx i really didn't
0:03:14	sound like she cared
0:03:16	and others have noticed this flatness in the alexi voices well here's an echo review
0:03:20	where they mentioned that it would be nice if alexi didn't sound so monotone
0:03:25	and that she needs to have a little more expression one she speaks
0:03:29	and another where they say that they're having a lot of fun with her
0:03:33	but her monotone productions can make things difficult for us to understand so this flatness
0:03:38	could also effect user's ability to understand her speech
0:03:42	so this slide to several research questions the first was how can improve a lexus
0:03:47	expressiveness in a social dialogue system like on rock
0:03:50	a especially given the time constraints of being in a competition
0:03:54	so we know from work on human interaction that cognitive emotional expression is important for
0:04:00	the quality of our interactions with others
0:04:03	we see that readily in people's faces such as happiness and excitement
0:04:07	we need to go to the vast a museum or contemplation and interest
0:04:12	but we also see that in the way we produce and perceive speech so for
0:04:16	example how emotionally express if we are relates to perceptions a speaker enthusiasm in human
0:04:22	conversation
0:04:23	so this is something we wanted to mimic in a lexus speech
0:04:27	so how do we make a lexus a more expressive what one option is to
0:04:31	completely overhaul the prosody
0:04:33	we really didn't have that as an option we didn't work controlling the tts models
0:04:37	in the competition which are given by amazon
0:04:40	we can adjust the tts in my in minor ways using s m l
0:04:44	but again we are on the time crunch and
0:04:46	we also wanted to very carefully specify a where cognitive
0:04:51	emotional expression would be inserted
0:04:54	so we asked whether we could add discrete units of color emotional expression or voice
0:04:59	them jeez add to improve expressiveness of the lx a voice
0:05:04	so we identified to that we were interested in expressive interjections and these are ones
0:05:09	that we're pre-recorded by the alexi voice
0:05:12	here's an example
0:05:13	wow is a
0:05:15	and filler words like or
0:05:20	and their relatively easy to add in the a lexus skills k just with a
0:05:24	simple ssm l tag to adjust expressiveness
0:05:27	i here for speech call an interjection
0:05:31	or to add in a pause to make the filler words sound more natural
0:05:34	so this is not modeled off of human
0:05:37	interaction where
0:05:39	individual signal their cognitive emotional states
0:05:41	using these smaller response tokens
0:05:44	so for this project we focus on these two types of voice emote jeez interjections
0:05:49	and fillers
0:05:50	and interjections can signal different things
0:05:53	like the speaker's the motion
0:05:55	but also how interested or surprise they are about information
0:05:59	or whether what we're hearing about is newsworthy
0:06:03	the other type of voice emote these are fillers
0:06:05	like and
0:06:07	which can also signal information about the speaker
0:06:10	such as the speaker needing more time to collect their thoughts inconsiderate topic their degree
0:06:15	of uncertainty about a topic and even their level of understanding
0:06:20	so well are first research question was how do we add expressiveness are second is
0:06:25	how will people respond to alexis expressiveness
0:06:29	series of computer personification such as clifford nasa's computers are social actors framework propose that
0:06:35	when a person sense as a cue few manning the system we automatically treated like
0:06:39	a person so here are question is really theoretically important in considering the degree to
0:06:44	which users personify voicing i
0:06:48	what users develop greater report with a
0:06:51	expressive alexi
0:06:53	or will it be creepy falling into the uncanny valley
0:06:56	the idea that the more similar nonhuman entity like a robot or alexi is to
0:07:01	person the more people like it to look at a point where they find it
0:07:05	incredibly creepy
0:07:07	so here's an overview of the rest of the talk
0:07:10	first will go over some prior work looking at interjections in fillers in human computer
0:07:14	interaction
0:07:15	then i'll go over a study we did our dialect surprise track pop and rock
0:07:19	and then go over some conclusions and future directions
0:07:23	so they are actually very few studies that have tested adding interjections and exclamations in
0:07:28	the dialogue system
0:07:29	and there's been a lot greater focus on overall prosodic adjustments to fraser utterance
0:07:35	i one side you did test the impact of non-linguistic affective burst
0:07:40	so buzzes and b
0:07:42	you know robot than our robot and they found that
0:07:45	kids sixty years old readily attribute motion to those noises
0:07:51	and will not using interjections per se sort all colleagues found that speech trained on
0:07:55	a corpus of positive exclamations like great
0:07:59	resulted in higher listener ratings
0:08:01	in a seven utterance simulated dialogue
0:08:04	but they observed no such a fact when the tts was trained on negative exclamations
0:08:08	like dear or groups
0:08:10	so really overall adding interjections as in
0:08:13	under study area in human computer interaction
0:08:17	and there's a bit more work looking at adding filler words but the findings have
0:08:21	been mixed
0:08:22	so i'm the one hand some studies have found a facility were effect
0:08:26	for example users have reported having a greater sense of engagement
0:08:30	with the robot if that robot uses filler words
0:08:34	and in another study independent raters keep higher naturalness ratings
0:08:38	for human computer conversations
0:08:40	when that voice included filler words
0:08:43	but others are found no positive affective introducing filler words or even a negative effect
0:08:47	for some listeners
0:08:49	so it's really an open question as to how humans might response to voice ai
0:08:53	systems
0:08:54	using interjections and fillers
0:08:56	a whether these voice mode jeez for example might be beneficial or detrimental to user
0:09:01	experience
0:09:03	okay so now think and rock
0:09:07	here's the overall architecture i'm just gonna provide a brief overview there's a technical report
0:09:12	if you're if you're curious
0:09:14	so the asr and tts models were provided by amazon
0:09:19	they we have a multi step and all you pipeline including sentence segmentation constituency parsing
0:09:24	in dialogue prediction
0:09:27	and then gonna has a hierarchical dialogue manager with higher level higher level topic or
0:09:31	organizers well as
0:09:34	template specific dialogue flows and that's for about been different topics so includes animals movies
0:09:40	news books
0:09:42	and so on
0:09:44	and this dialogue manager pulls an information from e v a factual knowledge base and
0:09:49	the can rock persona
0:09:51	database
0:09:52	questions about who elects it is
0:09:56	next we have a template based nlg module where the system fill slots with data
0:10:01	retrieved from various knowledge sources such as i am db
0:10:05	and then
0:10:05	finally we adjusted the prosody by adding the fillers and interjections so this is really
0:10:10	the focus of this presentation which were then output by the tts in the i
0:10:14	d for all x of voice
0:10:18	okay so how are we going to insert
0:10:20	interjections and fillers
0:10:22	we can't just insert them randomly that's not how language works
0:10:26	it's ten mentioned in his you know yesterday placement of these elements is really you
0:10:31	words so together we created a framework
0:10:34	for context specific placement of interjections and fillers into existing
0:10:38	can rock templates
0:10:39	and again we didn't manipulate any other prosodic aspects of a lexus speech we just
0:10:44	added these discrete words and phrases
0:10:48	okay so starting with the interjections we define two five context
0:10:52	for each we defined a list of possible interjections which could be used in that
0:10:56	context so we defined a list and then they're randomly pulled in
0:10:59	so the first is to signal interest this was really important because we wanted the
0:11:04	user to elaborate
0:11:06	so for example
0:11:09	so tell me more about it
0:11:12	since the goal the competition is to get users talking as long as possible
0:11:16	we want really wanted them to expand on their experience and make it seem as
0:11:20	though alexi was actually interested in what they had to say
0:11:23	so here we used
0:11:25	a lot of in different interjections which could be randomly inserted
0:11:28	into this word in a phrase initial slot
0:11:32	the second context what's for error resolution or to show a lexus feelings about her
0:11:37	misunderstanding
0:11:38	and this was a really important one since a lexus a often misheard the user
0:11:43	we wanted to convey for disappointment
0:11:45	in not getting it right
0:11:47	again with lots of possible variations for example
0:11:50	there are i think you said probably can you say that one more time
0:11:55	the third was to except the user's request
0:11:58	for example
0:11:59	l t here is some more information
0:12:03	this we didn't have as many as to signal interest since it was a social
0:12:07	dialogue system is less
0:12:09	task based then elect is usually
0:12:13	the fourth was to change topic as it alexi just remembered something she wanted to
0:12:17	share the user
0:12:19	and this was the part of a strategy to change the topic if the user
0:12:22	wasn't being very responsive giving a lot of one word versatile answers
0:12:26	well i've been meaning to ask you do you like animals
0:12:31	and the fifth was to express agreement of opinion
0:12:34	yes
0:12:36	we share the same fox
0:12:37	and this to happen is often in the gun rock template so we just used
0:12:40	to interjections here
0:12:42	but if you had a lot that really wanted to agree with people you could
0:12:45	out a lot of others like also more cool
0:12:48	so in addition to the five context we also included some interjections meant to convey
0:12:52	a lexus playfulness
0:12:54	and these were all utterance specific and not interchangeable so for example or in
0:13:01	that's so cute
0:13:03	and what one so get ready for a cheesy joke
0:13:06	what you call blueberries playing the guitar
0:13:09	a jam session while
0:13:12	okay and now for the fillers like the interjections we identified certain types of context
0:13:17	where fillers would be used in human conversation i so you're the three
0:13:23	the first was to change the topic a freebies we also tune the production of
0:13:27	so slowing it's rate and adding the parts
0:13:30	so i've been meaning to ask you do you like to play video games
0:13:36	the second was when retrieving a lexus back story so this was as if
0:13:41	it to make it seem as though alexi had think about a response that was
0:13:45	less automatic
0:13:48	i'm not all animals that i think my favourite is probably the elephant
0:13:53	and the third was to acknowledge the user's answer to a lexus question
0:13:57	a like an interesting choice
0:14:01	so following that framework this resulted in interjections and about twelve percent
0:14:05	of turns in a conversation and filler words in about eighty percent of turns
0:14:09	the exact number of fillers and interjections varied based on whether that particular utterance was
0:14:15	engaged in
0:14:16	being the conversations you could
0:14:18	can think of it is each user got a slightly different sampling of
0:14:22	of these interjections and fillers based on the course of their conversation
0:14:27	okay so here's an example i'll up maybe user so he lx that let's chat
0:14:33	hi this is an annex a prize social that and black to median how are
0:14:38	you doing today
0:14:39	doing okay how are you
0:14:42	and good sell tell me more about yourself do you like sports
0:14:47	not really
0:14:49	so have you seen any movies lately
0:14:53	a star is boring
0:14:55	a i seen a stars board to what would you rate this movie on a
0:15:00	scale from one to ten times
0:15:04	wow
0:15:05	a perfect and it really is an exceptional movie here is something i just learned
0:15:10	some of the scenes were filmed at the co-channel a festival in twenty seventeen we
0:15:14	only got how was the first female headliner in ten years
0:15:18	festival goers pay ten dollars to see the scenes being filmed all the recording devices
0:15:23	were locked away to prevent any footage being read any thoughts
0:15:27	so you can see it's really a discrete
0:15:29	phrase in a very long utterance
0:15:32	okay so does adding interjections and fillers have an impact
0:15:37	so we conducted a user study through the devices themselves so this is in the
0:15:41	wild as part of the lx the price competition
0:15:44	so we had four conditions
0:15:46	one with interjections
0:15:47	one with fillers one with both and one with the night or and these are
0:15:51	these conditions push live to all alexi table devices from november twenty have to december
0:15:57	third so this was after the competition was over and know what other code updates
0:16:02	were happening that's very crucial
0:16:04	and this methodology extends prior work on human computer interaction
0:16:09	giving us large sample size for over five thousand unique users individuals who actually wanted
0:16:15	to talk to the device and we're doing so on the place most comparable to
0:16:18	that
0:16:18	in their own homes
0:16:20	and the reader so at the end of the conversation they would re the conversational
0:16:24	scale from one to five so
0:16:26	the raiders where actually the users in the conversation itself
0:16:31	this also consists of users anyone with the device so it's not constrain to the
0:16:35	eighteen to twenty two year old slice that we generally test
0:16:38	but it's still likely skewed by social economic status and finally users have more experience
0:16:44	with the specific system so perhaps they have more familiarity and report with their legs
0:16:48	that
0:16:51	so we analyze the reading at the end of the conversation with a linear mixed
0:16:54	effects model weather conditions and values are random intercepts
0:16:59	we only included data for conversations with at least ten turns
0:17:01	and for the ones that had a filler interjection both
0:17:05	that had at least one of those
0:17:07	or two of those options
0:17:09	so i'll take you through the results one by one i here we have the
0:17:12	conditions on the x-axis and the rating on the y
0:17:16	i here we can see the baseline model this is the one without interjections and
0:17:19	fillers had an average around two point eight
0:17:23	then we site
0:17:24	the linear regression model revealed a main effect of condition so we see significantly higher
0:17:29	ratings for conversations with interjections this is all relative to the baseline
0:17:34	we also see higher ratings for the conversations with fillers
0:17:38	and also for the conversations with both with an average increase of about
0:17:43	point seven five
0:17:45	we are curious to see if the combined condition
0:17:48	was different from the
0:17:50	single interjections and fillers and we did
0:17:52	indeed thought that was the case
0:17:55	so adding voice them jeez inappropriate context
0:17:59	improves user ratings
0:18:01	and
0:18:02	this shows that even adding discrete elements may improve overall expressiveness of a social dialogue
0:18:07	system in this provides support forecaster frameworks as humans appear to be responding positively to
0:18:13	human like displays of cognitive emotional expression
0:18:16	in an alexi voice
0:18:18	in may in some ways be responding to the system or like a person
0:18:23	we also see that the effect is additive for different types of voice m o
0:18:26	g so users keep the high ratings or conversations with both fillers and interjections
0:18:32	and overall this effect is robust we see it over thousands of unique
0:18:36	users can conversations
0:18:38	but one limitation perhaps you've already thinking of is that these ratings are really a
0:18:42	holistic measure of the overall conversation so we wanna do one more controlled study
0:18:49	to confirm that the voice them jeez do indeed improve the ratings of the conversations
0:18:55	so we did a mechanical turk experiment with any five turkers
0:19:00	and the similar conditions structure as in the user study
0:19:04	with two dialogues one to signal interest
0:19:06	and one to resolve endeavour
0:19:10	so just as in the main study we had the baseline one with fillers
0:19:15	one with that interjection
0:19:16	and one with both yours an example
0:19:20	movies can be really fun
0:19:21	so i've been meaning to ask you what else are you interested in do you
0:19:27	like animals
0:19:29	what we're animals
0:19:31	some i think my favourite animal is the elephant
0:19:35	and then same for the dialogue or the error resolution dialogue
0:19:38	i one with night or fillers or interjections
0:19:41	one with fillers only
0:19:43	one with interjections only in one with both
0:19:46	that's pretty interesting
0:19:47	so have you seen any movies lately
0:19:51	the not really is really in good
0:19:55	darn i didn't catch that can you say that again
0:19:58	so these are real user interaction caesar once we scripted loosely based off of topics
0:20:04	in gun rock
0:20:05	so the turkers heard these two dialogues and all possible conditions randomly and then for
0:20:11	each dialogue they heard a raster radial x a voice on a sliding scale so
0:20:15	how engaged is a lexus sound how expressive does a lexus sound how likable and
0:20:20	how natural
0:20:21	and we analyze these ratings with separate linear mixed effects models
0:20:26	since i'm running on a time ago through this quickly
0:20:29	so here's what we found as with the overall user study we found a main
0:20:33	effect of condition
0:20:35	i get relative to the baseline
0:20:38	my computers
0:20:39	having some issues
0:20:43	so we see an increase for
0:20:48	so conversations with interjections shown in red show significantly higher readings of all of those
0:20:54	social variables look for
0:20:59	for those four dimensions
0:21:02	i'll just give you a quick summary my computer it's frozen so overall what we
0:21:07	found perfect so what we saw that the results for the user study me were
0:21:12	what we observed in the mechanical turk study instances of social ratings we saw something
0:21:17	a little bit different with the fillers so users
0:21:20	the mechanical turkers actually redid the voice as having lower likability a low-rank each meant
0:21:26	when that voice had the fillers so this is a little bit different in suggests
0:21:30	that the role of the reader so if you are is makes a difference so
0:21:35	if you're the person in the conversation you tend to like the interjections you don't
0:21:39	also like the fillers
0:21:41	but if you're an external rate or listening and on the conversation
0:21:44	you really pick up on those fillers and that really made from yours what we
0:21:48	seen in research one human interaction
0:21:50	thank you
0:22:04	we have some five questions
0:22:12	very interesting topics i'm wondering about how
0:22:17	given the way that you're adding this
0:22:20	fillers and interjections it seems like it somewhat stochastic us to when they come out
0:22:25	an s one and f
0:22:27	all the dialogues that included them have roughly the same percentage or number wrong number
0:22:32	more work number per term are or where there's a big variance within the different
0:22:37	dialogs and if there's variance whether you
0:22:41	john more carefully at a whether having more fillers robust fillers changed the rating is
0:22:46	actually question we didn't look at that are so we looked at the number of
0:22:50	fillers encryption particular conversation
0:22:53	and didn't seem to your relationship at least with reading
0:22:58	is related to overall turns that that's
0:23:00	let me to be expected
0:23:12	that backs fascinating and results reducing and
0:23:16	i was wondering having looked at the data
0:23:20	do you think doesn't is goal for building a model that can you know look
0:23:24	at context and decides yes or no we're gonna put a veteran seems likely
0:23:27	limits the yes right so this was just a very simple kind of way to
0:23:32	test this but we it was not the most sophisticated way that we could we
0:23:37	could do it by definitely
0:23:39	but i mean if you look at the conversations in the ones that looks like
0:23:42	it's going well looks like number do you think there's some signal on the
0:23:46	but there could be a model to train or
0:23:49	i noticed in the increase in user studies
0:23:53	that the users would smile if you had interjection
0:23:58	and some actually
0:24:00	mention the filler words themselves
0:24:03	it's so
0:24:04	i mean that's a very explicit sort of q by if you're able to record
0:24:09	we you know you could
0:24:11	use
0:24:12	you know the smiling the facial expressions
0:24:14	to know if it's
0:24:15	if it's going well that's appropriate
0:24:20	more question
0:24:24	since build a t vs to keep people engaged for longer what has the effect
0:24:28	of length of conversation
0:24:30	there wasn't a clear relationship so there are two so we wanna keep people and
0:24:34	each as long as possible but also
0:24:36	however in meaningful conversation
0:24:38	really feel for so there was no relationship between number of
0:24:42	okay utterances but well only with reading
0:24:47	in the this is more common than questions
0:24:49	sometimes people have news stories and they like tori the first time then after a
0:24:55	while a good point five t
0:24:57	in have you sort of making in experiment over time
0:25:03	well
0:25:04	you see if this really works
0:25:06	in the long time that's a great that's a great question no we haven't but
0:25:10	that's already down
0:25:15	and we have time for one last question
0:25:21	just for clarification what you're fillers seem to be all the sort of a turn
0:25:26	initial did you have them you know like the most notable fillers a like you
0:25:31	know in noun phrases just up the services
0:25:34	so we didn't so we just put them in the same location is the interjections
0:25:39	but if you're absolutely right they occur in a lot of different places if you
0:25:43	have a hesitation for example or of false start sometimes you get fillers there as
0:25:48	well
0:25:49	you're just trying to keep it very simple
0:25:53	but stack the speaker model

A Large-Scale User Study of an Alexa Prize Chatbot: Effect of TTS Dynamism on Perceived Quality of Social Dialog

Oral Session 6: Evaluation and Data

Michelle Cohn, Chun-Yen Chen and Zhou Yu