Přepis řeči - PROCESSING ‘YUP!’ AND OTHER SHORT UTTERANCES IN INTERACTIVE SPEECH

0:00:13	K in on in of from true halogen of them i don't
0:00:16	and um i be presenting our preliminary work and processing your
0:00:20	and short utterances and to speech
0:00:22	so it just
0:00:23	starve an it's by saying that this is promising be the work of
0:00:26	fess campbell it could be here
0:00:29	but i also about than work was hell in one two is you've visit a and trinity for um
0:00:34	for remote last somewhere and to work on the court
0:00:38	so we interested in spoken conversation or interested looking at describing characterising modeling and ultimately sent sizing and this button
0:00:46	conversation
0:00:47	and probably most striking that aspect of spoken conversation is that it's massively the interact
0:00:53	so you have two people say having a conversation
0:00:55	it could be that one person is doing most the talk which happens quite a lot
0:00:59	and
0:01:00	perhaps it's
0:01:01	and somebody talk talking an issue that had to work
0:01:03	the the to a friend
0:01:05	even no one is doing most
0:01:07	of the talking
0:01:08	and the interaction is is still in mask the interactive and um
0:01:12	if the other participant
0:01:13	providing constant feedback to the main speaker
0:01:16	a base what are they understand of
0:01:18	something needs to be or the it that the need to go faster provide background information
0:01:22	this sort of
0:01:24	and
0:01:25	spoken conversation is
0:01:26	that's sake contrast a type of model like speech you get in
0:01:29	news broadcast lectures or well talks like this
0:01:33	and in spoken interaction meaning is built up
0:01:35	uh adaptive iain's clubbers of
0:01:38	so a and being a kind of a linear rather of information
0:01:41	it's
0:01:41	that rather to directly the goes forward and back and forth than even a one times
0:01:46	i you look at the uh and you spoken interaction of was is
0:01:50	very apparent from a that there's a very high frequency of short of a
0:01:54	and
0:01:54	shortens utterances have um
0:01:57	a function spoken conversation which is disproportionate to their let
0:02:00	there
0:02:01	very very useful and very important than managing spoken disk
0:02:06	you have here um
0:02:07	a graph of a a a telephone conversation
0:02:09	so when each of the panels
0:02:11	um corresponds each speaker
0:02:13	and and and the x-axis is time in the y-axis is a
0:02:16	well speech density and that ten second frames
0:02:20	and so speech then see here's that
0:02:22	and a measure of talking time at per frame and
0:02:26	and and
0:02:27	from this in this conversation that a speaker is much you
0:02:30	we need to be most oh of the of the talking and you can see a high frequency
0:02:34	a both a a hot uh long and short utterances here
0:02:38	the second speaker even known that than us
0:02:41	using as many long utterances there still very active in terms of you meant a short of it all utterances
0:02:46	that use
0:02:47	suppose that a partner is a highly active the number of short utterances
0:02:51	is extremely high you look the transcription of these utterances
0:02:54	the linguistic contents
0:02:56	is
0:02:56	really very repetitive and
0:02:58	the variation is only quite as
0:03:00	quite minimal
0:03:02	the course when we're engaging spoken conversation for not using linguistic content as a two
0:03:06	and we have to use some other to and
0:03:09	to provide different iteration to the speaker
0:03:11	and hence importance of
0:03:13	prosody and voice quality or vocal timbre
0:03:17	so and just to illustrate this a little best i'm gonna play
0:03:20	uh a sequence of a short utterances from single males speak speaker from the T sixty four corpus
0:03:26	and we show described it few minutes
0:03:28	um so i just play a first
0:03:33	i
0:03:33	i
0:03:35	yeah
0:03:36	i
0:03:37	i
0:03:38	i
0:03:39	i
0:03:41	i
0:03:41	i
0:03:42	i
0:03:43	i
0:03:44	i
0:03:45	oh
0:03:45	i i i
0:03:47	i
0:03:48	i
0:03:48	oh
0:03:49	i
0:03:50	i
0:03:51	i
0:03:52	i
0:03:52	i
0:03:53	i
0:03:54	i
0:03:55	i
0:03:56	i
0:03:56	i
0:03:58	okay um
0:03:59	just kind of a ones should show from out is just as uh i i'm sure just by since then
0:04:03	you can hear thus
0:04:05	in a spoken conversation those different same linguistic units have
0:04:09	very different pragmatic functions in in this in discourse
0:04:12	and the the of the lime that we believe
0:04:15	provides a sort different iteration well i'll a a lot of it is
0:04:18	the prosody and voice quality which i think you could hear and some of those shorter and
0:04:23	one of the
0:04:24	corporate thus professor campbell worked on was the express speech processing corpus which was one thousand five hundred hours of
0:04:30	interactive speech
0:04:31	recorded in japan between two guys and and to hasn't five six
0:04:35	and um
0:04:36	one of the most common words made up
0:04:39	like single words made of more than half of the total utterance count
0:04:43	these these single words came in a die range of prosodic conditions
0:04:47	trying entirely different mess
0:04:49	of the um examples that press to campbell sometimes gives gives
0:04:53	is the word have home are
0:04:54	which is a a sack dialect of of or words which roughly translates as really
0:04:59	in in english
0:05:00	and and state as a of the corpus
0:05:02	um
0:05:03	and it and it ages is twenty different at least twenty different
0:05:07	and pragmatic functions that single single word
0:05:10	and again processing voice quality essential and provide the different station and
0:05:14	a spoken conversation
0:05:16	just um a final uh a final graph just to um
0:05:21	a are just a trace the the frequency short utterances
0:05:24	also a large party conversations with
0:05:27	and uh graph here from the uh free talk corpus
0:05:30	so and
0:05:31	there's is a five five speakers involved in the conversation
0:05:35	each of the different colours represent different speakers and the of the bar represents the length of the other utterance
0:05:40	and again if you look to this there is and a high frequency of short utterances sometimes by single speaker
0:05:45	and sometimes
0:05:46	and
0:05:47	why i more than more same time
0:05:51	okay and so this brings sounds on the corpus not and the current study so um a at an in
0:05:57	to present end the T sixty four corpus was recorded in um
0:06:01	in a a a a an part uh apartments in double
0:06:04	and the goal of the corpus was to um richly and a
0:06:08	re receive records
0:06:10	and highly naturalistic and spoken conversation that
0:06:14	um
0:06:15	so that was twelve audio lines five you cameras to three sixty degree videos and six
0:06:20	up to track motion capture
0:06:22	and this five participants
0:06:23	three male and two female
0:06:25	a social interaction was completely unstructured non scripted and
0:06:28	there was no particular conversation go
0:06:30	and
0:06:31	for this reason that the topics
0:06:32	very it um
0:06:34	very widely
0:06:35	so that was four sessions over two days in in the current study we look as the first two sessions
0:06:40	for session i don't i don't even was meant to be recorded post
0:06:44	and the
0:06:44	the three male speakers in the room at the time
0:06:47	well have and
0:06:48	um what how headset mikes on
0:06:51	and um
0:06:52	that was only a short of time before the other two female participants are arrives
0:06:56	and because of
0:06:58	problems with Q base and are technical issues
0:07:01	and a for a knows a be tense
0:07:03	and stressful environments and is very apparent from the speech data and you listen to it after
0:07:08	second session was um
0:07:10	and we're to the two female participants ride was um
0:07:13	a much more relaxed
0:07:15	kind of a um people are sitting and drinking cups a copy talking a little bit of themselves
0:07:20	and
0:07:20	so that two sessions are starkly contrast them terms
0:07:23	yeah i in this regard
0:07:25	and what should state that's and only one of the female speakers and a it is in know analysis
0:07:30	in the current work
0:07:32	so what having them on its was over with us
0:07:34	and last so much um
0:07:36	she annotates it's and twelve
0:07:38	twelve that and is used of an annotation labels for the short utterances
0:07:42	in these two sessions
0:07:43	so not gonna go through all of them but just the high like the most frequent ones
0:07:47	so that back channels are clearly the um
0:07:50	the the the the most frequent so back channels is kind of and kind of feedback that a speaker might
0:07:55	be given like yeah we okay rice
0:07:58	uh also very come more filled pauses
0:08:00	so like um uh
0:08:02	like
0:08:03	these sort of things
0:08:04	and also parents interjections and repetitions where and quite freak
0:08:09	we can it's some and prosodic analysis on these shores
0:08:12	short utterances
0:08:13	and we met measure and fundamental frequency mean max and
0:08:18	position of P set of present percentage
0:08:21	location of the peak in the order an
0:08:23	i are the same edges
0:08:24	we used to and
0:08:26	break crude voice quality measures the difference being the first two harmonics of the speech spectrum
0:08:30	and it if seen the first
0:08:32	harmonic and the harmonic
0:08:33	because someone a the third formant region
0:08:36	and we also measured duration
0:08:39	we don't carry principal component analysis not that showed
0:08:42	the first loading to be dominated by power values the second we dominated by F zero values and third be
0:08:47	dominated by both what's quality in duration bodies
0:08:50	so this kind of suggested to them
0:08:52	uh in the independence of these
0:08:54	of these groups
0:08:55	in the first five loadings accounted for seventy percent of the very
0:09:00	we wants to look for or us at the voice quality involved than this
0:09:03	and so we wants look S voice qualities across the it's tense
0:09:06	and to
0:09:08	so and as as phonation mode or mode of vocal fold vibration is
0:09:12	uh a critical to these voice qualities
0:09:15	and like shown here and
0:09:16	a kind of image of the of the vote of poke of of the larynx taken from above both
0:09:20	and that three men mostly or tensions um high like
0:09:24	so the breath you voice quality when the vocal folds or vibrating you you have these low levels of tension
0:09:29	load up to tension
0:09:30	so this
0:09:31	means that there's not block your
0:09:34	and you get this
0:09:35	and you get this chain get posterior and of the vocal folds that lies that agenda every and there
0:09:40	and that to pass tree the vocal folds in this this is the main contributor to the
0:09:45	and sort of brandy perceptual quality
0:09:47	at tense voice call you at the other end of the spectrum
0:09:50	you've you've
0:09:51	yeah high levels of the three main range of tensions
0:09:54	a producing a uh a a a a tensor voice called so we want to use some acoustic measures to
0:09:58	measure and these
0:10:00	physiological current
0:10:01	we use the tree three step method first we
0:10:04	measures done closure instances
0:10:06	using instance
0:10:07	using uh
0:10:09	dsps S S method so called your instance
0:10:11	and corresponds to the moments where the vocal folds uh come together
0:10:16	we used um
0:10:18	and the inverse filtering method so in for filtering is basically a to remove the contribution of the vocal tract
0:10:23	from the speech signal giving it and estimates of that but source signal of the uh uh same was created
0:10:28	by the folk files of the larynx
0:10:30	so use the issues of adaptive inverse filtering method
0:10:33	scribe out Q
0:10:34	i one country the block diagram just
0:10:36	at that the methods it tends to compensate for the spectral roll off of the voice source signal
0:10:41	and use the lpc analysis to try and i guess an all-pole model
0:10:44	of and of the vocal tract transfer function
0:10:47	this is done in a couple of iterations and uh
0:10:50	i put is the estimate of the of source signal
0:10:53	um so then we with this with this i put signal we want it's
0:10:57	we used these glottal gradients describe by look or in yeah and in two present six which is kind of
0:11:02	follow on work from scenes and house
0:11:04	reason for using these buttons gradients was
0:11:06	they they were described in previous work to be um
0:11:09	to be useful even in a less than ideal recording conditions which never be happen when you when you um
0:11:16	um hmmm when you're dealing with
0:11:18	kind of interactive speech like that's
0:11:20	well so we chose the to but by gradient stuff from a own work can carefully controlled at a oh
0:11:25	show the best different station of voice quality qualities of cross the breath it's tense the match
0:11:30	a just highlighted here are the two gone gradients with
0:11:33	geology buffalo G gradients gradient and or C G rate of closure gradient
0:11:38	and so uh
0:11:39	i want just described is any for to but just to state thus low levels of these two at value
0:11:44	suggest
0:11:45	tensor voice qualities and
0:11:47	a a higher levels the chance
0:11:48	as suggest um um or voice score
0:11:52	okay so we we carried as
0:11:54	we carried at this uh we analyse the short utterances using using this and
0:11:59	this method
0:12:01	and i should have mentioned earlier that we that in the annotation was also annotation of and overlapping and non
0:12:07	overlapping segments
0:12:08	so we find that um
0:12:10	or C G values are significantly lower
0:12:12	staking taking uh is taking or um
0:12:15	or four speakers
0:12:16	uh and that was lower G G values
0:12:19	this this trend was also seen in each of the speakers individually
0:12:23	um for
0:12:24	comparing session once session two we only use the three male speakers because one of the females wasn't present in
0:12:29	in the first
0:12:30	we and lower or C G an lower G or you values
0:12:34	a book when we looked us the the the speakers individually
0:12:37	a to the of the male speakers showed significantly lower or ct values where
0:12:41	another the one the male speakers show show significantly higher or C G value so this was
0:12:46	this a little bit cute
0:12:49	so
0:12:49	how we interpret this well we in of this is a tensor over all voice called you in the first
0:12:53	session
0:12:54	and steering overlapping speech that this is reasonably shoes of
0:12:57	and perhaps in overlapping speech and a tensor voice calls you could be a mechanism for a competing for turn
0:13:05	also um as i stated at the beginning of this kind of
0:13:08	more stressful at first session
0:13:10	and they leads to an overall tensor
0:13:12	and um but the productions by the by to is but
0:13:16	wouldn't participant two showed up the trends across sessions we we spoke to after
0:13:20	and am
0:13:21	yeah actually describes
0:13:23	and the first session as a a uh and environment is more completely equipment set up some people can find
0:13:27	us to be more control didn't
0:13:29	P of the stress that the others that
0:13:31	and where
0:13:32	at the can get to know you session can actually be very socially or for some people and you you
0:13:37	to miss it's to to this in the second session
0:13:40	over the kind of take a mess just as short utterances and very substantially in terms of processing voice quality
0:13:46	in in spoken conversation
0:13:48	traditional and speech recognition systems
0:13:51	i i don't take can't of these
0:13:53	these aspects
0:13:54	of speech
0:13:55	and if we want to
0:13:57	a a proper properly model and uh
0:14:00	the type of naturalistic speech we haven't spoken conversation
0:14:03	that we feel that these these aspects
0:14:05	and need to be taken care
0:14:07	so i just finally just to just the state at what we're doing with this we're currently have a um
0:14:12	an exhibition signs a gallery and train college
0:14:15	where at her meet the robot is a like a robot with
0:14:18	with a
0:14:19	and bows and audio recordings
0:14:23	and uh base see tracks people's faces and walks walks rents them the strikes of a conversation
0:14:27	so we use in this for data collection on spun conversation and short utterances
0:14:31	and also that is uh a platform for testing our hypotheses bikes
0:14:36	a a short of its
0:14:38	and so
0:14:39	yeah um
0:14:40	and make uh nick the campbell i about from the S F I
0:14:44	and they "'em" more it's was supported by F C T god
0:14:46	let
0:14:48	and that's
0:14:48	i stuff
0:14:54	so we can have
0:14:55	a a time for two question
0:15:02	maybe i thought
0:15:03	all there it one
0:15:08	i
0:15:08	the on
0:15:10	i K
0:15:10	i i just my i would maybe i
0:15:13	no something that we trying to a different state and a different types are are so that you have labeled
0:15:18	the lower and we didn't do well that they that was done in the annotation but we didn't smash that
0:15:24	to the acoustics and in
0:15:25	description here
0:15:26	and post but that that would be something thus thus
0:15:29	and thus
0:15:30	press covers
0:15:31	yeah and i think you my work along along those lines with with
0:15:35	but some the measurements we use but i that didn't
0:15:37	that wasn't
0:15:38	yeah
0:15:38	i don't have that that
0:15:45	um
0:15:46	why you're my a very similar but
0:15:48	a in the very fact one again what
0:15:50	for
0:15:52	was going from over that you that
0:15:54	and i i'm as
0:15:56	may may maybe be N Z might know little that better than me and this but um
0:16:00	uh i i i i
0:16:01	just
0:16:02	a what i what i think is is the true is that that's press how both a a set that
0:16:06	C you does want a are people to be using a i think you want to the a system that's
0:16:10	any annotation that people would do would be
0:16:12	conch be the back into the to the overall project
0:16:15	but and if you contact him at nick a T C D dot
0:16:19	and B I is definitely open to to chains
0:16:23	thank you my
0:16:24	oh

PROCESSING ‘YUP!’ AND OTHER SHORT UTTERANCES IN INTERACTIVE SPEECH

Audio/Visual Detection of Non-Linguistic Vocal Outbursts

Přednášející: John Kane, Autoři: Nick Campbell, John Kane, University of Dublin, Ireland; Helena Moniz, FLUL/INESC-ID, Ireland