Speech Transcript - Toward incremental dialogue act segmentation in fast-paced interactive dialogue systems

0:00:14	right good evening everyone again
0:00:20	a problem of how or more
0:00:25	a tall
0:00:26	i one i tried to make it more interesting or and exciting
0:00:33	alright so
0:00:37	taking a step by looking at of our previous work while previous work was one
0:00:42	so in the last work we looked tired
0:00:44	like fine grained semantic like we tried one design
0:00:48	are the scene descriptions by segmenting the target descriptions or what's right and as described
0:00:54	target you
0:00:55	in two different parts
0:00:57	in two different a into two different semantic act as we see your and then
0:01:02	try to understand the images
0:01:04	so in this work we take a step back and we try to understand the
0:01:08	high-level dialogue acts
0:01:09	so you're
0:01:11	we try to understand these high signal at different dialogue or i one for instance
0:01:16	i don't regulate understand what the person is trying to two and a kind of
0:01:21	extend upon a previous work presented previously
0:01:25	alright
0:01:25	so the motivation for this work is to achieve fast paced interaction so
0:01:31	in the fast pace interactions lot of things happened like that is
0:01:35	a single user speech segment can have multiple dialogue acts a single dialogue i can
0:01:40	span across multiple or speech segments
0:01:43	and in those cases what should be due what kind of all these things we
0:01:47	design
0:01:48	then i think is we want to understand
0:01:52	methodology to perform this dialogue act segmentation and try to understand what dialogue acts are
0:01:57	and in an environment which is highly which is very fast paced and i'll try
0:02:03	to or the things
0:02:05	and then initiate of a dialogue act at the right
0:02:09	fine so that's something that okay that the
0:02:13	well
0:02:14	the structure of this talk a will be divided into these parts so in the
0:02:18	first thing is a speak a bit of our domain the previous work and try
0:02:22	to
0:02:22	see that their technical problem a starting point
0:02:26	and then are the annotation scheme that we good i that we used outline of
0:02:32	a target al
0:02:33	then the meant that strip of the minutes we use to perform the segmentation and
0:02:37	a dialogue act understanding the link
0:02:40	sorry
0:02:41	then evaluate the components then see how it works but agent
0:02:47	so
0:02:48	the domain that we use is very similar to the one that we saw the
0:02:52	last talk so
0:02:54	that's not cases topic one fly so the domain is basically call our dog image
0:02:58	are
0:02:59	okay so it's
0:03:00	it's a rapid dialogue in two people a
0:03:04	two people are things game so it's fast it's
0:03:08	it's very rapid time-constrained
0:03:11	and
0:03:12	thus we don't study has little harder classes little heart classes got it
0:03:17	okay this line is i
0:03:21	before that i'm sorry
0:03:22	before that so that wasn't at all is the detector
0:03:26	the data that is trying to a this was in the director the detector to
0:03:29	see the screen on a computer
0:03:32	and she basically trying to describe
0:03:34	the target highlight the target image and this is the matcher
0:03:37	a matcher doesn't see any of those images kind of highlighted so wasn't is trying
0:03:42	to
0:03:43	or make the selection they can have a dialogue exchanges back and forth
0:03:48	and it's time-constrained and make they also see score so it's
0:03:52	it's insane device
0:04:01	extensive study has a little hard classes little hard classes got okay this line is
0:04:08	i really didn't go flying its actual a lot of might actually got it okay
0:04:11	it's one as the line classes yellow classes with the space on the tiny classes
0:04:16	that high
0:04:19	well as you can see it's something that the game from furthermore dialogue exchanges and
0:04:23	it's a kind of problem
0:04:25	so we built an agent or using this our data and is what we present
0:04:29	in the previous think that the
0:04:31	the agent and of play the game this fast-paced game but the real user history
0:04:38	we had incremental components of the had asr nlu and the policy and all these
0:04:42	components were operating but incrementally
0:04:46	an agreement architecture is very important because we got better a game scores
0:04:53	but
0:04:54	it's not significantly better than humans
0:04:56	which means you know like it or from really rather i don't perform much better
0:05:00	than alternate incremental
0:05:02	architectures for which a one point of view back or what previous adaptive one thing
0:05:08	and it had available subject evaluations that is people interacting with this agency like interacting
0:05:13	with the agent compared to other all versions of the agent that
0:05:18	it is
0:05:19	there are there are few limitations of this architecture okay the limitation is that it
0:05:24	assumes every three okay every description every board that the person is speaking is
0:05:29	basically description of for a good
0:05:30	and if that's the case we can't you can't
0:05:33	have really fun base interaction is of the two players were having
0:05:38	so it's
0:05:38	not as interactive as human players
0:05:40	but it is really fast
0:05:42	so
0:05:45	we build an engine so i want to show a small real for the agent
0:05:48	interacting with a human soul to reinforce the points that i just
0:05:55	at the top using the human director screen
0:05:57	so there's a cultural studies so you want to human faces but
0:06:00	in the top eight images using the human describing that and the bottom screen using
0:06:05	the agent a images and of confidence
0:06:08	in
0:06:09	in the power
0:06:17	i
0:06:21	it is apparent i
0:06:26	so i
0:06:31	it is asleep and y
0:06:34	i
0:06:38	so which one is the same time
0:06:45	are
0:06:46	i placed indoors and c
0:06:50	i
0:06:54	though the agent is you know like very muffins as you can see
0:06:59	which is really grappling the game but models
0:07:01	alright so what we want to do so we wanna make the agent more interactive
0:07:05	so we want to make use of full range of dialogue acts that you know
0:07:08	only know of this
0:07:10	of you want to initiate the right dialogue act are that i one of the
0:07:14	right time so that we get the right interactions and one for that
0:07:19	it needs a an incremental or dialogue act segmentation and labeling it some sense and
0:07:24	we show as to how we use it and i we need it
0:07:28	and the challenges is that
0:07:30	they efficiently employing a good for it for instance in the previous architecture we had
0:07:34	the agent which
0:07:36	transport every utterance is basically a target image descriptions so she was being very efficient
0:07:41	in understanding the target images
0:07:42	but
0:07:43	if we have if we include more dialogue acts it's very possible like dialogue acts
0:07:47	make it is i don't i've make it is label of you want to be
0:07:51	going to other dialogue acts surrounding the target descriptions for instance and the gimp make
0:07:57	a good so we want to glad we wanted one of c
0:08:00	if the agent performance index ahead or
0:08:05	so we collected the human heart dialogue corpus in the lab setting in one of
0:08:11	the previous studies
0:08:13	and we annotated as data it was annotated by a human
0:08:17	so
0:08:18	the gain characteristic is that
0:08:21	it's a rapid okay
0:08:23	and there are like multiple a dialogue acts which in within speech segment
0:08:28	and the same dialogue acts can actually span across a different speech segments like for
0:08:32	instance your
0:08:34	you can see whenever the don't they can kind of work sits down dating is
0:08:38	like really fast there's like lot of overlaps
0:08:41	and then you're in this example we can see that are like multiple dialogue acts
0:08:45	within a single speech segment
0:08:48	so you each speech segment is in a separate it out by these two hundred
0:08:51	milliseconds
0:08:53	and in this example that is like a cm dialogue act just and across multiple
0:08:57	of multiple speech segments
0:09:00	and from this table we can see there are like not of dialogue acts and
0:09:03	you need anything each speech segment and hypothesis that each
0:09:08	i q or each speech segment if you in by separating it out by a
0:09:13	silence threshold we won't the role of a good job than identifying of the dialogue
0:09:18	acts ones
0:09:20	so the human annotators or our goal and it was annotated is doing so
0:09:25	and annotation is done in a very fine grained level i
0:09:29	the i'd the word level
0:09:31	so here for instance a couple of dialog data kind of identify this is a
0:09:35	question and if that is its answer to the previous question than a little or
0:09:40	i don't all
0:09:43	so how does not i'd addition corpus how the corpus once and repeated looks like
0:09:47	so it's very diverse
0:09:49	so if we think of this game as a simple target description and acknowledgement all
0:09:54	assert-identified or motions by the person
0:09:57	as to our dialogue acts will be covering only fifty six percent of the total
0:10:01	dialogue acts
0:10:01	so the rest of the forty four percent of dialogue acts as it contains a
0:10:05	lot of other
0:10:06	a kind of dialogue exchanges
0:10:09	well some of them on the questions you know and source oracle confirmations and all
0:10:15	game but it is
0:10:18	so in the methods
0:10:19	so this corpus that we have working but so we have a human corpus and
0:10:23	our goal is if we include this data in an agent
0:10:28	but the segmentation and labeling dialogue act labeling perform what outage and okay so that's
0:10:33	the thing that you want to that people want to kind of work on what
0:10:37	account value
0:10:38	one kind of methods for
0:10:40	the method that we use is a is kind of divided into or steps rather
0:10:47	so the first step so we have
0:10:48	the asr utterances the asr is giving route its incremental utterances
0:10:53	we just kind of way to we just try to the linear chain conditional of
0:10:57	real the curve the crf does a sequential what it is a sequential what i
0:11:01	doubt about
0:11:02	everybody's been labeled as a part of or a new segment or not part of
0:11:06	a previous segment or not
0:11:08	and then once we have the segment boundaries assigned we want to identify what each
0:11:12	of these segment
0:11:15	so
0:11:16	one thing is that it's not a new approach a variety of you know like
0:11:19	segmenting the dialogue act a segment in the whole dialogue into something that some kind
0:11:23	of identifying that i like that
0:11:25	it's been used by many people in the past messages passed
0:11:32	and we make sure
0:11:34	everyone so here in this approach let's see a so we have the transcripts which
0:11:40	contains these many words are just coming out from the asr
0:11:44	so this black boxes are basically two hundred milliseconds at least three hundred miliseconds of
0:11:48	speech
0:11:49	and
0:11:51	once these importance
0:11:53	it's kind of free to the linear chain conditional random field the it is that
0:11:58	those are sequential and ask a sequential i think that it assigns each word with
0:12:02	the label or if this word is part of a new segment of our previous
0:12:07	segment or not
0:12:08	so we just use be eye tracking because each word is part of
0:12:11	a segment
0:12:13	and then once we have a segment or once we have the segments extracted me
0:12:17	we label each one of the segments using a svm classifier
0:12:22	but what kind of segment
0:12:25	the what kind of features to be used to perform these methods
0:12:28	so we used three kinds of features for our feature is a lexical syntactic features
0:12:33	which includes well it's the part-of-speech tags a door
0:12:37	the top level question a problem which are obtained from the parse strings
0:12:41	and then we have the prosody information prosody features which we extracted from the audio
0:12:47	incrementally
0:12:48	so every ten milliseconds we don't this prosody feature extractor for which we use in
0:12:53	forty k
0:12:54	and we go via and then be obtained like this or to but don't them
0:12:58	domain the max and as these scores for a pitch and dynamics values which he
0:13:03	was an idea about like
0:13:05	the frequency and energy values
0:13:08	and then we have the pause duration between the words which is also a clean
0:13:12	as a feature
0:13:14	then for the contextual features we believe or wouldn't be one though of you want
0:13:18	a teaching to know what kind of rule of the person is performing is a
0:13:22	direct orders of the match of all because they both have different kinds of dialog
0:13:26	act distributions
0:13:27	so then we have previously that could light recognize dialogue act labels which is very
0:13:31	important to identify things like a confirmation or answers to questions
0:13:35	and then how recent words from the other interlocutor which makes which is very important
0:13:39	to identify echo confirmation
0:13:44	we use these features and all these modules are operating incrementally back means every new
0:13:49	asr hypothesis that comes and
0:13:51	the b i actor
0:13:54	splits the utterance into are the different segments and that is the classifier that has
0:13:59	the dialogue are only the rich and of runs and identifies dialogue acts
0:14:04	so their dialogue acts change with every new word because
0:14:08	you know it has more information and go on the task
0:14:13	so there is this question that we want that the task is how well does
0:14:16	the segmental and the dialogue actually lower and pipeline kind of method perform in this
0:14:21	or a reference resolution on each image task
0:14:25	and what is the impact of asr performance that is an asr with reasonable word
0:14:30	error rate if it is it is into those who makes
0:14:34	how well how well we're not ask kind of a core
0:14:38	and then how does automated pipeline of form of but
0:14:42	i mean like how does it impacted image understanding of the user can correctly one
0:14:46	dimensional
0:14:47	evaluation of components is a little hard because there are a lot of cables you
0:14:54	because the first thing is that our transcripts from the users and there is asr
0:14:58	hypothesis we just coming and
0:15:00	and they don't kind of match up and it's very hard to align them
0:15:04	so here in this example they are not there is a it's not online
0:15:08	or to one another but it's basically just a line as a mentor coming in
0:15:12	and the human annotator does the segmentation and the dialog act labeling
0:15:17	and the word level and
0:15:21	we have that as data
0:15:23	now we if you want to measure the performance of the dialogue act label or
0:15:27	we can just run the dialogue act label it on this human transcribe
0:15:32	i know but also segmentation of the human segmented information and we can get a
0:15:36	sensors as to how the dialogue act label is performing
0:15:40	but if you if you put the segment order to go forward or then we
0:15:44	have then you lose the one-to-one mapping between the segmenter and
0:15:49	the between the dialogue acts from the gold and
0:15:52	from the segmental and the dialogue act i one
0:15:55	so how do we measure because to go by the word the word measure for
0:15:58	instance
0:16:00	but once we have the asr it once we put is starting to the picture
0:16:04	we even lose one-to-one mapping between the transcribed and annotated ago
0:16:09	and the asr
0:16:11	corpora are and asr a big also how do we kind of evaluate you know
0:16:16	like a pipeline just working in a such a more
0:16:19	so a the previously researchers have used a
0:16:23	many matrix to kind of measure these things so we have that all segmentation error
0:16:27	rates opinion error rates and f scores and concept of its which people have used
0:16:33	which is to just have used in the past measure of the system
0:16:38	but each one of these metrics have
0:16:40	one you know like
0:16:42	kind of measure different things in the system
0:16:45	but what we actually want to make sure when we're building the system is that
0:16:49	we want to know if
0:16:51	the right dialogue act was identified so that we can take the right action
0:16:56	for example i it doesn't matter they have you know like if the asr did
0:17:01	an error in identifying the whole goal for example and it gave you know like
0:17:06	instead of on no maybe give no and i identifying the no answer l in
0:17:12	spite of a this the it'll though
0:17:15	the asr error which was happening
0:17:17	so if i get the regular graph maybe my agent and eight or a better
0:17:21	performance i mean and they better actions
0:17:23	so the measures such a kind of a system we need a multi position and
0:17:28	recall metrics
0:17:30	for which
0:17:31	it is sorted of time i would be would like would into the details of
0:17:34	this metric but just let let's just keep in mind that the segment level boundaries
0:17:39	for the words
0:17:40	are not so important it's important that we identify a dialogue acts
0:17:46	that was kind of traffic
0:17:49	so the evaluation kind of produces these numbers so if we use the baseline which
0:17:54	is just one dialogue act or you know what for speech segment kind of like
0:17:59	to the end up with other perform an accuracy of seventy eight percent
0:18:04	but once we have the prosody runs if we perform the segmentation for just the
0:18:09	prosody features like seventy two percent
0:18:12	go drop in performance could also be because of it's not be able to identify
0:18:18	development there are like something out by silence
0:18:21	and then we have if we use the lexical and the lexical syntactic and contextual
0:18:25	features get at ninety percent
0:18:27	but once we combine all the features in the we get a performance in queens
0:18:32	and like one two percent
0:18:34	so it's a really back in a possibly features aren't impacting the performance much
0:18:39	you can see that change
0:18:41	but it's not close to human-level performance
0:18:44	so this is the numbers that we have any for the market on marty said
0:18:48	precision and recall for are described on it and other identified and from this table
0:18:54	we can we can observe that
0:18:56	the automation of every level
0:18:58	the performance is kind of at the head so the numbers are dropping down if
0:19:02	we
0:19:03	from going from human transcripts and human segmented to order segmented and automated yearly and
0:19:09	finally the asr
0:19:12	but i really what we want to see is how well as the agent how
0:19:16	does the agent how it is the agent performs the agent performing equally well or
0:19:19	not
0:19:20	so in a previous study we use a stimulation method to measure how well the
0:19:24	agent
0:19:25	or performed
0:19:27	so this offline method of evaluating the agent is scored eavesdropper which we have explained
0:19:31	you get enough to do twenty fifteen paper i nine creation look at the right
0:19:36	and that gave us a really good picture as to how the agent performance actually
0:19:39	was in one so we use that metric to kind of evaluate
0:19:45	the agent performance on target done on target image identification
0:19:50	and we found i
0:19:51	it was no significant difference between
0:19:55	finally the take a message is that there are many metrics to measure other dialogue
0:20:00	act segmentation measuring the final impact on the agent performance is very important and the
0:20:05	individual model performance might give us a different picture
0:20:09	and bite plane performance negatively module or information and finally the da segmentation can facilitate
0:20:15	austin building a better and more complex just
0:20:19	an individual we want to integrate these policies and the agent
0:20:23	thank you
0:20:57	so that's a very good question
0:21:00	so the question was that
0:21:01	if so this domain is really specific in terms of utterances being or short duration
0:21:07	unshorten and r doesn't really scale up or a large and then
0:21:13	so the answer is that i don't know a maybe it could because the framework
0:21:17	is kind of channel in the sense that the features that the users are not
0:21:22	very much to note that this domain but it should really explore and see how
0:21:27	the group of formant other domain for example
0:21:30	so the answer is it score one
0:21:33	i can't say all
0:21:42	of the creation
0:21:47	however questionable be architecture for segmentation and labeling what do you have to stuck swarm
0:21:54	for segmenting about one probably where you do the prior to drawing or cost
0:22:00	so the question was a wide we have a separate step a segmentation and labeling
0:22:06	any be so the researchers in have looked at like order and architectures like they
0:22:11	have tried to do the joint method
0:22:13	of identifying the boundaries
0:22:15	and also doing it into separate steps
0:22:18	so i would say be to try out and it's kind of workload and but
0:22:23	i guess every kind of measure the performance and the joint method was not working
0:22:28	as well as well
0:22:32	this method
0:22:50	that's right they were just set of e
0:22:53	we probably don't have we have a long tail of dialogue acts from a stable
0:22:58	there is a dialog act distribution is kind of long haired and the joint matter
0:23:03	probably what but if we had more
0:23:05	with this issue
0:23:10	no scripts
0:23:28	that so good questions the question was a can be look at the and best
0:23:33	list and kind of or
0:23:35	could we see how well the performance was for dialogue act labeling how our weather
0:23:41	work well as well so are the answer is we then we can take a
0:23:46	look at the n-best list but definitely that's something

Toward incremental dialogue act segmentation in fast-paced interactive dialogue systems

Oral session 4: Incremental processing

Ramesh Manuvinakurike, Maike Paetzel, Cheng Qu, David Schlangen and David DeVault