Speech Transcript - ONLINE DETECTION OF VOCAL LISTENER RESPONSES WITH MAXIMUM LATENCY CONSTRAINTS

0:00:13	yeah so high uh and daniel angle percent these work that the to get with a cheap
0:00:18	you are still trying
0:00:20	so
0:00:21	hmmm
0:00:23	oh
0:00:27	okay so uh like it's a conversational systems task to be developed using these kind of uh uh talkie talk
0:00:32	at turn taking part down
0:00:33	is is means that one guys talking at a time and response times or or long
0:00:39	make a mostly do because the use pause duration threshold for and electrons protection
0:00:43	but
0:00:44	uh uh
0:00:45	and you must talk
0:00:47	a a more than fifty percent
0:00:49	a well uh speaker ships
0:00:51	a in this two situations that is in the part will that
0:00:55	or
0:00:55	yeah in a gap up to two on the milliseconds
0:00:58	which is supposed to be the minimum response times to polls
0:01:03	so if you want to do are taking
0:01:05	in uh uh a a uh and you talking to a computer
0:01:08	uh you can do why these into two cases the first one this when you having a long gap
0:01:13	a longer than two men seconds
0:01:15	okay handle these by and and the kinds
0:01:17	predict does
0:01:19	uh the second case here nice when you adding a little overlap
0:01:22	or short gap
0:01:24	this is our target for uh this study
0:01:27	that we
0:01:29	uh introduce a simplify a approach for this so by introducing this acknowledgement most with
0:01:34	basically of
0:01:36	backchannel channel type of dialogue act
0:01:37	this is means that
0:01:39	but people say it hmmm yeah and so on
0:01:42	so you want to dial of they system they had to do
0:01:44	uh these two things that is
0:01:46	it should be a bit to continue to talk in income than uh these signals transmit
0:01:50	windy G in the in to complete overlap
0:01:53	or or to compute it should be a to that speech
0:01:56	well you you still a training one of
0:01:59	um
0:01:59	things
0:02:00	so we talking about a uh a a lot about to response times here
0:02:04	uh this is the corpus that use the classical the adding or map task corpus is got in these
0:02:09	a a the for face-to-face dialogues
0:02:12	and that the task is the map does of there's one guy a space to another
0:02:16	and it the it has provided a shows
0:02:19	uh among these are the acknowledgement both
0:02:23	and which more to do this under a yeah
0:02:26	in in to a talk spurt
0:02:28	there are the fine here as i i mean them and voice activity
0:02:31	a duration to actual on fifty milliseconds
0:02:34	a a a a separated by a a durations two on milliseconds
0:02:39	this but makes the provide a some patient more
0:02:42	perceptually relevant and uh
0:02:44	more will more closely or someone on online condition
0:02:48	so uh this or on the uh
0:02:51	twenty to most frequency according board
0:02:53	you can see that to top five here are right okay K uh and yeah
0:02:59	uh so
0:03:02	this might actually be the by their lexical content
0:03:05	so how that is a in the overlap them well i to the corpus and one it in ten miliseconds
0:03:10	frames
0:03:11	so given that the frame is norm of that
0:03:13	is a five percent probability there is an acknowledgement mode
0:03:16	while a a if you is in the wheel that this it's the five percent probably there and the occlusion
0:03:22	so
0:03:23	uh this um to be seems to be more common in the without that
0:03:26	so what is going on here
0:03:28	i try to lit the goal a bit deeper uh by computing uh the between speaker and to well
0:03:33	is defined by the partial with that
0:03:35	and the gap
0:03:37	so uh what are actually going for a are the target to used an assumption didn't look at or assumption
0:03:43	of them all this month mode
0:03:45	but it was a a a a uh uh for others to to have a reference to compare with
0:03:50	that is
0:03:51	a in the context
0:03:52	oh i'm like motion low
0:03:54	i bit out X we stick cheap of sounds i'm including X the linguistic so
0:04:00	so this is to drop
0:04:01	um
0:04:02	from coming station in press
0:04:05	and as you can see here if you introduce these extra we stick uh two cans you get much more
0:04:10	overlap
0:04:11	which is uh
0:04:13	is is um
0:04:15	uh the negative scale of of the graph fear
0:04:19	uh while the
0:04:21	if you are computed for
0:04:22	uh in the context of a a motion model
0:04:25	uh there is not much different
0:04:28	uh a
0:04:29	you can build for uh the in look the assumption of them cushion you get slightly more over that be
0:04:35	as you can see here
0:04:36	uh to the left image or
0:04:38	so what does this mean well it seems like the worship station a closed
0:04:43	and no will that are mostly due to interaction to complete lap
0:04:47	but are uh but to actually want to do here is
0:04:49	uh
0:04:50	for both interaction direction to complete the were that are shown some of them
0:04:54	and occlusion mode
0:04:55	uh into action and to silence we need to classify i income speech
0:04:59	and some acknowledgement no and off
0:05:01	uh as
0:05:02	by i early
0:05:04	so a a a i to this these calls set that a called maximum it's like
0:05:08	might late to classification
0:05:10	a a it's quite simple actually it's a is just a several or talks but each segment there a with
0:05:15	which has a mean one speech activity
0:05:18	a threshold and minimum pulse duration threshold
0:05:21	but i want to make the decision at all
0:05:24	uh however
0:05:26	uh in the first case here you tao use
0:05:29	uh uh uh a larger down
0:05:31	the talk sport
0:05:32	a a a a uh duration plus
0:05:35	mean mean um are some threshold you make it at
0:05:38	at this time instead
0:05:39	to minimize the response times
0:05:42	so
0:05:43	how the set top were done for the maximum latency a well this or that the durations
0:05:48	of these two
0:05:49	can talk spurts that most from close um the ones
0:05:53	this can see here that these are much shorter
0:05:55	so
0:05:57	uh if you want to use duration that's a feature down for classification
0:06:02	uh you uh
0:06:04	um
0:06:05	you basically you you can you you you might have to make it or that the longer the wait
0:06:10	uh the most the most audience uh direction of the S a feature
0:06:14	but to less of between two for
0:06:16	a a a a lot to watch
0:06:17	so i tried to hunt seconds from seconds and five on men seconds and just see C would get
0:06:23	a for the acoustic detector or use this kind of permit station is it um
0:06:29	it's a length in improvisation which is basically type do this at T
0:06:33	which is smooth the search way that to the divide the length for the talk spurt
0:06:38	in this very this is quite a useful because the basis functions are but you could each of those for
0:06:42	a good interpolation on the syllabic them
0:06:45	length in gives
0:06:46	and the station for duration or speaking rate you can separate these and
0:06:50	the classifier were a
0:06:53	one the sears scopes and to equal to that of yet rich areas so which ms if you made it
0:06:58	you only parents are are to that in the real to shape of the this is that directory
0:07:03	so this is useful for it want model uh F zero
0:07:06	a a a which chance a speaker dependent buys
0:07:09	density
0:07:10	that's a is because
0:07:11	oh them this to the microphone
0:07:13	and then it's is used has these channel uh
0:07:16	used by
0:07:18	and this is the class powered set up
0:07:20	use F zero and envelopes
0:07:21	um
0:07:23	the to shape these intense there'll two shapes the intensity
0:07:27	uh the absolute
0:07:28	you try to to the absolute and relative shapes of C is
0:07:32	"'cause" one to see how this will affect
0:07:35	we can get up to it
0:07:36	and the for duration
0:07:38	uh used to fill those but duration for training while for testing we can at the maximum latency
0:07:44	then a at the spectral flux would up too much motivation
0:07:48	and the class Y is uh support vector machine with a or of cool
0:07:53	this or that was was personal
0:07:55	and as you can see
0:07:57	uh it seems like of zero envelopes loops are the weakest feature
0:08:01	uh followed by intensity
0:08:03	and spectral flux
0:08:05	well i M C's is are the strongest ones
0:08:07	and doesn't are to meet the sears course stamped which means that we actually only modeling all
0:08:13	tell that to trees
0:08:15	uh
0:08:17	uh uh
0:08:18	these
0:08:18	these features
0:08:19	i for duration
0:08:21	uh
0:08:21	you get nothing
0:08:22	for at time that milliseconds
0:08:24	but the rate of the longer
0:08:26	five and men seconds comes the second most sunny and feature
0:08:31	so i
0:08:32	uh
0:08:33	sorry
0:08:33	yeah so you decided to uh include the is is it's of the sears consent
0:08:38	and uh all to have zero loops
0:08:40	a "'cause" this of the case
0:08:42	this were the weakest a in the feature combination
0:08:45	uh a sort the walls that was sold and we tried to conditions here that is the online
0:08:51	the blind a plan uses the provide is a show while the online
0:08:55	use um and and D based threshold
0:08:58	uh what voice activity text detector
0:09:01	because we a little bit um
0:09:03	and about how sounds to this time wearing a station walls
0:09:08	to to this kind of um
0:09:10	what's active the detection
0:09:12	does out that it was not that uh since due to this and she can see
0:09:16	the longer you rate to the but their classification the get
0:09:19	rubber
0:09:20	it's such would pursue to get a quite this simplification that when we hundred milliseconds
0:09:24	which it christ
0:09:25	uh in i think
0:09:28	and it was surprised
0:09:31	so
0:09:31	uh
0:09:33	sort computer uh well duration and C is two
0:09:36	seems seems to be the most silent features here
0:09:40	and uh if you want to integrate these kind
0:09:43	classifier mean
0:09:44	um increment the dialogue an framework
0:09:47	this baseline framework that just
0:09:48	based uh can handle multiple ongoing plans
0:09:52	and i two guess been here is to run every classifiers in parallel uh perhaps the first one we prepare
0:09:58	decisions optimal miliseconds
0:10:00	and the next to execute them at three am and five hundred milliseconds
0:10:05	so and the actual implementation there was done in one spot
0:10:11	um
0:10:12	that's sort for me and thing have to say uh and the questions
0:10:16	think much better
0:10:21	know
0:10:22	question
0:10:32	oh or do you have any
0:10:34	particularly the nation for is being able to here
0:10:38	but that what you get
0:10:40	uh no uh
0:10:42	well
0:10:43	it
0:10:44	um
0:10:45	simply the because they
0:10:46	you can
0:10:47	it seems like the annotators basically they have
0:10:49	on a lot of on the lexical content
0:10:52	say
0:10:53	i a particular reason
0:10:55	but it's not that is either because if you make the sears cups C em to which it's actually there
0:11:00	um
0:11:02	a a which holds the information about
0:11:05	the actual uh the formants
0:11:08	uh
0:11:08	if you mid that then you lose this
0:11:11	so one we that links kind of the that that directories
0:11:14	and that is would be different
0:11:16	seems to be more something about voice called
0:11:19	um uh
0:11:20	uh i can't explain
0:11:22	white what's alone
0:11:27	so i have a question
0:11:33	okay than
0:11:33	think again on you
0:11:35	thank you

ONLINE DETECTION OF VOCAL LISTENER RESPONSES WITH MAXIMUM LATENCY CONSTRAINTS

Audio/Visual Detection of Non-Linguistic Vocal Outbursts

Presented by: Daniel Neiberg, Author(s): Daniel Neiberg, KTH - Royal Institute of Technology, Sweden; Khiet P. Truong, University of Twente, Netherlands