0:00:21or it but so slowly start them accession my name is for the province crevices
0:00:27evaluation very to session
0:00:31first speaker today
0:00:33is gonna be special colour
0:00:37we're gonna have a three talks in the session
0:00:39which random to lunchtime
0:00:43so we shall
0:00:45thank you
0:00:49can you hear me okay
0:00:51a high i michelle code i'm a close talking u c davis working jointly with
0:00:56department of linguistics
0:00:57computer science and psychology and to they'll be presenting a project i did with our
0:01:02bit chen and joe you
0:01:04so more and more humans are talking to voice activated artificially intelligent devices like amazon
0:01:10alexi to complete daily tasks
0:01:12like setting a timer turning on the lights
0:01:15and the new aspects through the amazon elects the price competition is the ability to
0:01:20engage real users in social chitchat three d systems many view here have competed or
0:01:26are competing but for those of you who don't know about it
0:01:30the amazon a leg surprises the competition to create social but that can converse coherently
0:01:36and engaging lee with humans on a range of topics like food music technology animals
0:01:42and so on
0:01:43and what's unique
0:01:45at least for researchers in academia is the ability to deploy the strap right in
0:01:50the wild and something dan bohus talked about yesterday
0:01:54so during the competition anyone with an amazon ago
0:01:57could say let's chat
0:01:58and get one of the computing chat bots
0:02:01you may be familiar with some other teams from twenty eighteen
0:02:05a including one from katie each phantom advice by gabriel's concept and light by patrick
0:02:10joan l
0:02:12but today i mean to be talking about gun rock the social but developed at
0:02:16u c davis advise by joe you and light by or pitch and make two
0:02:21and gun rack a special as it won first place in the twenty eighteen competition
0:02:26i you can see joanne and our bit here
0:02:30so when i might show in our bit july last summer a contract team was
0:02:34about halfway through the competition and i was working on other projects related to how
0:02:38humans talk to voice ai so it's
0:02:40interested in seeing how
0:02:41users would engage with the social but like can rock
0:02:45so we started to collaborate recording these user interactions you can see my microphone there
0:02:51but we notice something as he listens to how these interactions unfold it
0:02:56alexis speech was relatively flat
0:02:59and really lacked the dynamism in human interaction
0:03:02we're speakers very their speech just to show their excitement
0:03:05their interests and their understanding
0:03:08and this is important
0:03:09is users for example were offering information about their favourite movie lx i really didn't
0:03:14sound like she cared
0:03:16and others have noticed this flatness in the alexi voices well here's an echo review
0:03:20where they mentioned that it would be nice if alexi didn't sound so monotone
0:03:25and that she needs to have a little more expression one she speaks
0:03:29and another where they say that they're having a lot of fun with her
0:03:33but her monotone productions can make things difficult for us to understand so this flatness
0:03:38could also effect user's ability to understand her speech
0:03:42so this slide to several research questions the first was how can improve a lexus
0:03:47expressiveness in a social dialogue system like on rock
0:03:50a especially given the time constraints of being in a competition
0:03:54so we know from work on human interaction that cognitive emotional expression is important for
0:04:00the quality of our interactions with others
0:04:03we see that readily in people's faces such as happiness and excitement
0:04:07we need to go to the vast a museum or contemplation and interest
0:04:12but we also see that in the way we produce and perceive speech so for
0:04:16example how emotionally express if we are relates to perceptions a speaker enthusiasm in human
0:04:23so this is something we wanted to mimic in a lexus speech
0:04:27so how do we make a lexus a more expressive what one option is to
0:04:31completely overhaul the prosody
0:04:33we really didn't have that as an option we didn't work controlling the tts models
0:04:37in the competition which are given by amazon
0:04:40we can adjust the tts in my in minor ways using s m l
0:04:44but again we are on the time crunch and
0:04:46we also wanted to very carefully specify a where cognitive
0:04:51emotional expression would be inserted
0:04:54so we asked whether we could add discrete units of color emotional expression or voice
0:04:59them jeez add to improve expressiveness of the lx a voice
0:05:04so we identified to that we were interested in expressive interjections and these are ones
0:05:09that we're pre-recorded by the alexi voice
0:05:12here's an example
0:05:13wow is a
0:05:15and filler words like or
0:05:20and their relatively easy to add in the a lexus skills k just with a
0:05:24simple ssm l tag to adjust expressiveness
0:05:27i here for speech call an interjection
0:05:31or to add in a pause to make the filler words sound more natural
0:05:34so this is not modeled off of human
0:05:37interaction where
0:05:39individual signal their cognitive emotional states
0:05:41using these smaller response tokens
0:05:44so for this project we focus on these two types of voice emote jeez interjections
0:05:49and fillers
0:05:50and interjections can signal different things
0:05:53like the speaker's the motion
0:05:55but also how interested or surprise they are about information
0:05:59or whether what we're hearing about is newsworthy
0:06:03the other type of voice emote these are fillers
0:06:05like and
0:06:07which can also signal information about the speaker
0:06:10such as the speaker needing more time to collect their thoughts inconsiderate topic their degree
0:06:15of uncertainty about a topic and even their level of understanding
0:06:20so well are first research question was how do we add expressiveness are second is
0:06:25how will people respond to alexis expressiveness
0:06:29series of computer personification such as clifford nasa's computers are social actors framework propose that
0:06:35when a person sense as a cue few manning the system we automatically treated like
0:06:39a person so here are question is really theoretically important in considering the degree to
0:06:44which users personify voicing i
0:06:48what users develop greater report with a
0:06:51expressive alexi
0:06:53or will it be creepy falling into the uncanny valley
0:06:56the idea that the more similar nonhuman entity like a robot or alexi is to
0:07:01person the more people like it to look at a point where they find it
0:07:05incredibly creepy
0:07:07so here's an overview of the rest of the talk
0:07:10first will go over some prior work looking at interjections in fillers in human computer
0:07:15then i'll go over a study we did our dialect surprise track pop and rock
0:07:19and then go over some conclusions and future directions
0:07:23so they are actually very few studies that have tested adding interjections and exclamations in
0:07:28the dialogue system
0:07:29and there's been a lot greater focus on overall prosodic adjustments to fraser utterance
0:07:35i one side you did test the impact of non-linguistic affective burst
0:07:40so buzzes and b
0:07:42you know robot than our robot and they found that
0:07:45kids sixty years old readily attribute motion to those noises
0:07:51and will not using interjections per se sort all colleagues found that speech trained on
0:07:55a corpus of positive exclamations like great
0:07:59resulted in higher listener ratings
0:08:01in a seven utterance simulated dialogue
0:08:04but they observed no such a fact when the tts was trained on negative exclamations
0:08:08like dear or groups
0:08:10so really overall adding interjections as in
0:08:13under study area in human computer interaction
0:08:17and there's a bit more work looking at adding filler words but the findings have
0:08:21been mixed
0:08:22so i'm the one hand some studies have found a facility were effect
0:08:26for example users have reported having a greater sense of engagement
0:08:30with the robot if that robot uses filler words
0:08:34and in another study independent raters keep higher naturalness ratings
0:08:38for human computer conversations
0:08:40when that voice included filler words
0:08:43but others are found no positive affective introducing filler words or even a negative effect
0:08:47for some listeners
0:08:49so it's really an open question as to how humans might response to voice ai
0:08:54using interjections and fillers
0:08:56a whether these voice mode jeez for example might be beneficial or detrimental to user
0:09:03okay so now think and rock
0:09:07here's the overall architecture i'm just gonna provide a brief overview there's a technical report
0:09:12if you're if you're curious
0:09:14so the asr and tts models were provided by amazon
0:09:19they we have a multi step and all you pipeline including sentence segmentation constituency parsing
0:09:24in dialogue prediction
0:09:27and then gonna has a hierarchical dialogue manager with higher level higher level topic or
0:09:31organizers well as
0:09:34template specific dialogue flows and that's for about been different topics so includes animals movies
0:09:40news books
0:09:42and so on
0:09:44and this dialogue manager pulls an information from e v a factual knowledge base and
0:09:49the can rock persona
0:09:52questions about who elects it is
0:09:56next we have a template based nlg module where the system fill slots with data
0:10:01retrieved from various knowledge sources such as i am db
0:10:05and then
0:10:05finally we adjusted the prosody by adding the fillers and interjections so this is really
0:10:10the focus of this presentation which were then output by the tts in the i
0:10:14d for all x of voice
0:10:18okay so how are we going to insert
0:10:20interjections and fillers
0:10:22we can't just insert them randomly that's not how language works
0:10:26it's ten mentioned in his you know yesterday placement of these elements is really you
0:10:31words so together we created a framework
0:10:34for context specific placement of interjections and fillers into existing
0:10:38can rock templates
0:10:39and again we didn't manipulate any other prosodic aspects of a lexus speech we just
0:10:44added these discrete words and phrases
0:10:48okay so starting with the interjections we define two five context
0:10:52for each we defined a list of possible interjections which could be used in that
0:10:56context so we defined a list and then they're randomly pulled in
0:10:59so the first is to signal interest this was really important because we wanted the
0:11:04user to elaborate
0:11:06so for example
0:11:09so tell me more about it
0:11:12since the goal the competition is to get users talking as long as possible
0:11:16we want really wanted them to expand on their experience and make it seem as
0:11:20though alexi was actually interested in what they had to say
0:11:23so here we used
0:11:25a lot of in different interjections which could be randomly inserted
0:11:28into this word in a phrase initial slot
0:11:32the second context what's for error resolution or to show a lexus feelings about her
0:11:38and this was a really important one since a lexus a often misheard the user
0:11:43we wanted to convey for disappointment
0:11:45in not getting it right
0:11:47again with lots of possible variations for example
0:11:50there are i think you said probably can you say that one more time
0:11:55the third was to except the user's request
0:11:58for example
0:11:59l t here is some more information
0:12:03this we didn't have as many as to signal interest since it was a social
0:12:07dialogue system is less
0:12:09task based then elect is usually
0:12:13the fourth was to change topic as it alexi just remembered something she wanted to
0:12:17share the user
0:12:19and this was the part of a strategy to change the topic if the user
0:12:22wasn't being very responsive giving a lot of one word versatile answers
0:12:26well i've been meaning to ask you do you like animals
0:12:31and the fifth was to express agreement of opinion
0:12:36we share the same fox
0:12:37and this to happen is often in the gun rock template so we just used
0:12:40to interjections here
0:12:42but if you had a lot that really wanted to agree with people you could
0:12:45out a lot of others like also more cool
0:12:48so in addition to the five context we also included some interjections meant to convey
0:12:52a lexus playfulness
0:12:54and these were all utterance specific and not interchangeable so for example or in
0:13:01that's so cute
0:13:03and what one so get ready for a cheesy joke
0:13:06what you call blueberries playing the guitar
0:13:09a jam session while
0:13:12okay and now for the fillers like the interjections we identified certain types of context
0:13:17where fillers would be used in human conversation i so you're the three
0:13:23the first was to change the topic a freebies we also tune the production of
0:13:27so slowing it's rate and adding the parts
0:13:30so i've been meaning to ask you do you like to play video games
0:13:36the second was when retrieving a lexus back story so this was as if
0:13:41it to make it seem as though alexi had think about a response that was
0:13:45less automatic
0:13:48i'm not all animals that i think my favourite is probably the elephant
0:13:53and the third was to acknowledge the user's answer to a lexus question
0:13:57a like an interesting choice
0:14:01so following that framework this resulted in interjections and about twelve percent
0:14:05of turns in a conversation and filler words in about eighty percent of turns
0:14:09the exact number of fillers and interjections varied based on whether that particular utterance was
0:14:15engaged in
0:14:16being the conversations you could
0:14:18can think of it is each user got a slightly different sampling of
0:14:22of these interjections and fillers based on the course of their conversation
0:14:27okay so here's an example i'll up maybe user so he lx that let's chat
0:14:33hi this is an annex a prize social that and black to median how are
0:14:38you doing today
0:14:39doing okay how are you
0:14:42and good sell tell me more about yourself do you like sports
0:14:47not really
0:14:49so have you seen any movies lately
0:14:53a star is boring
0:14:55a i seen a stars board to what would you rate this movie on a
0:15:00scale from one to ten times
0:15:05a perfect and it really is an exceptional movie here is something i just learned
0:15:10some of the scenes were filmed at the co-channel a festival in twenty seventeen we
0:15:14only got how was the first female headliner in ten years
0:15:18festival goers pay ten dollars to see the scenes being filmed all the recording devices
0:15:23were locked away to prevent any footage being read any thoughts
0:15:27so you can see it's really a discrete
0:15:29phrase in a very long utterance
0:15:32okay so does adding interjections and fillers have an impact
0:15:37so we conducted a user study through the devices themselves so this is in the
0:15:41wild as part of the lx the price competition
0:15:44so we had four conditions
0:15:46one with interjections
0:15:47one with fillers one with both and one with the night or and these are
0:15:51these conditions push live to all alexi table devices from november twenty have to december
0:15:57third so this was after the competition was over and know what other code updates
0:16:02were happening that's very crucial
0:16:04and this methodology extends prior work on human computer interaction
0:16:09giving us large sample size for over five thousand unique users individuals who actually wanted
0:16:15to talk to the device and we're doing so on the place most comparable to
0:16:18in their own homes
0:16:20and the reader so at the end of the conversation they would re the conversational
0:16:24scale from one to five so
0:16:26the raiders where actually the users in the conversation itself
0:16:31this also consists of users anyone with the device so it's not constrain to the
0:16:35eighteen to twenty two year old slice that we generally test
0:16:38but it's still likely skewed by social economic status and finally users have more experience
0:16:44with the specific system so perhaps they have more familiarity and report with their legs
0:16:51so we analyze the reading at the end of the conversation with a linear mixed
0:16:54effects model weather conditions and values are random intercepts
0:16:59we only included data for conversations with at least ten turns
0:17:01and for the ones that had a filler interjection both
0:17:05that had at least one of those
0:17:07or two of those options
0:17:09so i'll take you through the results one by one i here we have the
0:17:12conditions on the x-axis and the rating on the y
0:17:16i here we can see the baseline model this is the one without interjections and
0:17:19fillers had an average around two point eight
0:17:23then we site
0:17:24the linear regression model revealed a main effect of condition so we see significantly higher
0:17:29ratings for conversations with interjections this is all relative to the baseline
0:17:34we also see higher ratings for the conversations with fillers
0:17:38and also for the conversations with both with an average increase of about
0:17:43point seven five
0:17:45we are curious to see if the combined condition
0:17:48was different from the
0:17:50single interjections and fillers and we did
0:17:52indeed thought that was the case
0:17:55so adding voice them jeez inappropriate context
0:17:59improves user ratings
0:18:02this shows that even adding discrete elements may improve overall expressiveness of a social dialogue
0:18:07system in this provides support forecaster frameworks as humans appear to be responding positively to
0:18:13human like displays of cognitive emotional expression
0:18:16in an alexi voice
0:18:18in may in some ways be responding to the system or like a person
0:18:23we also see that the effect is additive for different types of voice m o
0:18:26g so users keep the high ratings or conversations with both fillers and interjections
0:18:32and overall this effect is robust we see it over thousands of unique
0:18:36users can conversations
0:18:38but one limitation perhaps you've already thinking of is that these ratings are really a
0:18:42holistic measure of the overall conversation so we wanna do one more controlled study
0:18:49to confirm that the voice them jeez do indeed improve the ratings of the conversations
0:18:55so we did a mechanical turk experiment with any five turkers
0:19:00and the similar conditions structure as in the user study
0:19:04with two dialogues one to signal interest
0:19:06and one to resolve endeavour
0:19:10so just as in the main study we had the baseline one with fillers
0:19:15one with that interjection
0:19:16and one with both yours an example
0:19:20movies can be really fun
0:19:21so i've been meaning to ask you what else are you interested in do you
0:19:27like animals
0:19:29what we're animals
0:19:31some i think my favourite animal is the elephant
0:19:35and then same for the dialogue or the error resolution dialogue
0:19:38i one with night or fillers or interjections
0:19:41one with fillers only
0:19:43one with interjections only in one with both
0:19:46that's pretty interesting
0:19:47so have you seen any movies lately
0:19:51the not really is really in good
0:19:55darn i didn't catch that can you say that again
0:19:58so these are real user interaction caesar once we scripted loosely based off of topics
0:20:04in gun rock
0:20:05so the turkers heard these two dialogues and all possible conditions randomly and then for
0:20:11each dialogue they heard a raster radial x a voice on a sliding scale so
0:20:15how engaged is a lexus sound how expressive does a lexus sound how likable and
0:20:20how natural
0:20:21and we analyze these ratings with separate linear mixed effects models
0:20:26since i'm running on a time ago through this quickly
0:20:29so here's what we found as with the overall user study we found a main
0:20:33effect of condition
0:20:35i get relative to the baseline
0:20:38my computers
0:20:39having some issues
0:20:43so we see an increase for
0:20:48so conversations with interjections shown in red show significantly higher readings of all of those
0:20:54social variables look for
0:20:59for those four dimensions
0:21:02i'll just give you a quick summary my computer it's frozen so overall what we
0:21:07found perfect so what we saw that the results for the user study me were
0:21:12what we observed in the mechanical turk study instances of social ratings we saw something
0:21:17a little bit different with the fillers so users
0:21:20the mechanical turkers actually redid the voice as having lower likability a low-rank each meant
0:21:26when that voice had the fillers so this is a little bit different in suggests
0:21:30that the role of the reader so if you are is makes a difference so
0:21:35if you're the person in the conversation you tend to like the interjections you don't
0:21:39also like the fillers
0:21:41but if you're an external rate or listening and on the conversation
0:21:44you really pick up on those fillers and that really made from yours what we
0:21:48seen in research one human interaction
0:21:50thank you
0:22:04we have some five questions
0:22:12very interesting topics i'm wondering about how
0:22:17given the way that you're adding this
0:22:20fillers and interjections it seems like it somewhat stochastic us to when they come out
0:22:25an s one and f
0:22:27all the dialogues that included them have roughly the same percentage or number wrong number
0:22:32more work number per term are or where there's a big variance within the different
0:22:37dialogs and if there's variance whether you
0:22:41john more carefully at a whether having more fillers robust fillers changed the rating is
0:22:46actually question we didn't look at that are so we looked at the number of
0:22:50fillers encryption particular conversation
0:22:53and didn't seem to your relationship at least with reading
0:22:58is related to overall turns that that's
0:23:00let me to be expected
0:23:12that backs fascinating and results reducing and
0:23:16i was wondering having looked at the data
0:23:20do you think doesn't is goal for building a model that can you know look
0:23:24at context and decides yes or no we're gonna put a veteran seems likely
0:23:27limits the yes right so this was just a very simple kind of way to
0:23:32test this but we it was not the most sophisticated way that we could we
0:23:37could do it by definitely
0:23:39but i mean if you look at the conversations in the ones that looks like
0:23:42it's going well looks like number do you think there's some signal on the
0:23:46but there could be a model to train or
0:23:49i noticed in the increase in user studies
0:23:53that the users would smile if you had interjection
0:23:58and some actually
0:24:00mention the filler words themselves
0:24:03it's so
0:24:04i mean that's a very explicit sort of q by if you're able to record
0:24:09we you know you could
0:24:12you know the smiling the facial expressions
0:24:14to know if it's
0:24:15if it's going well that's appropriate
0:24:20more question
0:24:24since build a t vs to keep people engaged for longer what has the effect
0:24:28of length of conversation
0:24:30there wasn't a clear relationship so there are two so we wanna keep people and
0:24:34each as long as possible but also
0:24:36however in meaningful conversation
0:24:38really feel for so there was no relationship between number of
0:24:42okay utterances but well only with reading
0:24:47in the this is more common than questions
0:24:49sometimes people have news stories and they like tori the first time then after a
0:24:55while a good point five t
0:24:57in have you sort of making in experiment over time
0:25:04you see if this really works
0:25:06in the long time that's a great that's a great question no we haven't but
0:25:10that's already down
0:25:15and we have time for one last question
0:25:21just for clarification what you're fillers seem to be all the sort of a turn
0:25:26initial did you have them you know like the most notable fillers a like you
0:25:31know in noun phrases just up the services
0:25:34so we didn't so we just put them in the same location is the interjections
0:25:39but if you're absolutely right they occur in a lot of different places if you
0:25:43have a hesitation for example or of false start sometimes you get fillers there as
0:25:49you're just trying to keep it very simple
0:25:53but stack the speaker model