0:00:13K in on in of from true halogen of them i don't
0:00:16and um i be presenting our preliminary work and processing your
0:00:20and short utterances and to speech
0:00:22so it just
0:00:23starve an it's by saying that this is promising be the work of
0:00:26fess campbell it could be here
0:00:29but i also about than work was hell in one two is you've visit a and trinity for um
0:00:34for remote last somewhere and to work on the court
0:00:38so we interested in spoken conversation or interested looking at describing characterising modeling and ultimately sent sizing and this button
0:00:46conversation
0:00:47and probably most striking that aspect of spoken conversation is that it's massively the interact
0:00:53so you have two people say having a conversation
0:00:55it could be that one person is doing most the talk which happens quite a lot
0:00:59and
0:01:00perhaps it's
0:01:01and somebody talk talking an issue that had to work
0:01:03the the to a friend
0:01:05even no one is doing most
0:01:07of the talking
0:01:08and the interaction is is still in mask the interactive and um
0:01:12if the other participant
0:01:13providing constant feedback to the main speaker
0:01:16a base what are they understand of
0:01:18something needs to be or the it that the need to go faster provide background information
0:01:22this sort of
0:01:24and
0:01:25spoken conversation is
0:01:26that's sake contrast a type of model like speech you get in
0:01:29news broadcast lectures or well talks like this
0:01:33and in spoken interaction meaning is built up
0:01:35uh adaptive iain's clubbers of
0:01:38so a and being a kind of a linear rather of information
0:01:41it's
0:01:41that rather to directly the goes forward and back and forth than even a one times
0:01:46i you look at the uh and you spoken interaction of was is
0:01:50very apparent from a that there's a very high frequency of short of a
0:01:54and
0:01:54shortens utterances have um
0:01:57a function spoken conversation which is disproportionate to their let
0:02:00there
0:02:01very very useful and very important than managing spoken disk
0:02:06you have here um
0:02:07a graph of a a a telephone conversation
0:02:09so when each of the panels
0:02:11um corresponds each speaker
0:02:13and and and the x-axis is time in the y-axis is a
0:02:16well speech density and that ten second frames
0:02:20and so speech then see here's that
0:02:22and a measure of talking time at per frame and
0:02:26and and
0:02:27from this in this conversation that a speaker is much you
0:02:30we need to be most oh of the of the talking and you can see a high frequency
0:02:34a both a a hot uh long and short utterances here
0:02:38the second speaker even known that than us
0:02:41using as many long utterances there still very active in terms of you meant a short of it all utterances
0:02:46that use
0:02:47suppose that a partner is a highly active the number of short utterances
0:02:51is extremely high you look the transcription of these utterances
0:02:54the linguistic contents
0:02:56is
0:02:56really very repetitive and
0:02:58the variation is only quite as
0:03:00quite minimal
0:03:02the course when we're engaging spoken conversation for not using linguistic content as a two
0:03:06and we have to use some other to and
0:03:09to provide different iteration to the speaker
0:03:11and hence importance of
0:03:13prosody and voice quality or vocal timbre
0:03:17so and just to illustrate this a little best i'm gonna play
0:03:20uh a sequence of a short utterances from single males speak speaker from the T sixty four corpus
0:03:26and we show described it few minutes
0:03:28um so i just play a first
0:03:33i
0:03:33i
0:03:35yeah
0:03:36i
0:03:37i
0:03:38i
0:03:39i
0:03:41i
0:03:41i
0:03:42i
0:03:43i
0:03:44i
0:03:45oh
0:03:45i i i
0:03:47i
0:03:48i
0:03:48oh
0:03:49i
0:03:50i
0:03:51i
0:03:52i
0:03:52i
0:03:53i
0:03:54i
0:03:55i
0:03:56i
0:03:56i
0:03:58okay um
0:03:59just kind of a ones should show from out is just as uh i i'm sure just by since then
0:04:03you can hear thus
0:04:05in a spoken conversation those different same linguistic units have
0:04:09very different pragmatic functions in in this in discourse
0:04:12and the the of the lime that we believe
0:04:15provides a sort different iteration well i'll a a lot of it is
0:04:18the prosody and voice quality which i think you could hear and some of those shorter and
0:04:23one of the
0:04:24corporate thus professor campbell worked on was the express speech processing corpus which was one thousand five hundred hours of
0:04:30interactive speech
0:04:31recorded in japan between two guys and and to hasn't five six
0:04:35and um
0:04:36one of the most common words made up
0:04:39like single words made of more than half of the total utterance count
0:04:43these these single words came in a die range of prosodic conditions
0:04:47trying entirely different mess
0:04:49of the um examples that press to campbell sometimes gives gives
0:04:53is the word have home are
0:04:54which is a a sack dialect of of or words which roughly translates as really
0:04:59in in english
0:05:00and and state as a of the corpus
0:05:02um
0:05:03and it and it ages is twenty different at least twenty different
0:05:07and pragmatic functions that single single word
0:05:10and again processing voice quality essential and provide the different station and
0:05:14a spoken conversation
0:05:16just um a final uh a final graph just to um
0:05:21a are just a trace the the frequency short utterances
0:05:24also a large party conversations with
0:05:27and uh graph here from the uh free talk corpus
0:05:30so and
0:05:31there's is a five five speakers involved in the conversation
0:05:35each of the different colours represent different speakers and the of the bar represents the length of the other utterance
0:05:40and again if you look to this there is and a high frequency of short utterances sometimes by single speaker
0:05:45and sometimes
0:05:46and
0:05:47why i more than more same time
0:05:51okay and so this brings sounds on the corpus not and the current study so um a at an in
0:05:57to present end the T sixty four corpus was recorded in um
0:06:01in a a a a an part uh apartments in double
0:06:04and the goal of the corpus was to um richly and a
0:06:08re receive records
0:06:10and highly naturalistic and spoken conversation that
0:06:14um
0:06:15so that was twelve audio lines five you cameras to three sixty degree videos and six
0:06:20up to track motion capture
0:06:22and this five participants
0:06:23three male and two female
0:06:25a social interaction was completely unstructured non scripted and
0:06:28there was no particular conversation go
0:06:30and
0:06:31for this reason that the topics
0:06:32very it um
0:06:34very widely
0:06:35so that was four sessions over two days in in the current study we look as the first two sessions
0:06:40for session i don't i don't even was meant to be recorded post
0:06:44and the
0:06:44the three male speakers in the room at the time
0:06:47well have and
0:06:48um what how headset mikes on
0:06:51and um
0:06:52that was only a short of time before the other two female participants are arrives
0:06:56and because of
0:06:58problems with Q base and are technical issues
0:07:01and a for a knows a be tense
0:07:03and stressful environments and is very apparent from the speech data and you listen to it after
0:07:08second session was um
0:07:10and we're to the two female participants ride was um
0:07:13a much more relaxed
0:07:15kind of a um people are sitting and drinking cups a copy talking a little bit of themselves
0:07:20and
0:07:20so that two sessions are starkly contrast them terms
0:07:23yeah i in this regard
0:07:25and what should state that's and only one of the female speakers and a it is in know analysis
0:07:30in the current work
0:07:32so what having them on its was over with us
0:07:34and last so much um
0:07:36she annotates it's and twelve
0:07:38twelve that and is used of an annotation labels for the short utterances
0:07:42in these two sessions
0:07:43so not gonna go through all of them but just the high like the most frequent ones
0:07:47so that back channels are clearly the um
0:07:50the the the the most frequent so back channels is kind of and kind of feedback that a speaker might
0:07:55be given like yeah we okay rice
0:07:58uh also very come more filled pauses
0:08:00so like um uh
0:08:02like
0:08:03these sort of things
0:08:04and also parents interjections and repetitions where and quite freak
0:08:09we can it's some and prosodic analysis on these shores
0:08:12short utterances
0:08:13and we met measure and fundamental frequency mean max and
0:08:18position of P set of present percentage
0:08:21location of the peak in the order an
0:08:23i are the same edges
0:08:24we used to and
0:08:26break crude voice quality measures the difference being the first two harmonics of the speech spectrum
0:08:30and it if seen the first
0:08:32harmonic and the harmonic
0:08:33because someone a the third formant region
0:08:36and we also measured duration
0:08:39we don't carry principal component analysis not that showed
0:08:42the first loading to be dominated by power values the second we dominated by F zero values and third be
0:08:47dominated by both what's quality in duration bodies
0:08:50so this kind of suggested to them
0:08:52uh in the independence of these
0:08:54of these groups
0:08:55in the first five loadings accounted for seventy percent of the very
0:09:00we wants to look for or us at the voice quality involved than this
0:09:03and so we wants look S voice qualities across the it's tense
0:09:06and to
0:09:08so and as as phonation mode or mode of vocal fold vibration is
0:09:12uh a critical to these voice qualities
0:09:15and like shown here and
0:09:16a kind of image of the of the vote of poke of of the larynx taken from above both
0:09:20and that three men mostly or tensions um high like
0:09:24so the breath you voice quality when the vocal folds or vibrating you you have these low levels of tension
0:09:29load up to tension
0:09:30so this
0:09:31means that there's not block your
0:09:34and you get this
0:09:35and you get this chain get posterior and of the vocal folds that lies that agenda every and there
0:09:40and that to pass tree the vocal folds in this this is the main contributor to the
0:09:45and sort of brandy perceptual quality
0:09:47at tense voice call you at the other end of the spectrum
0:09:50you've you've
0:09:51yeah high levels of the three main range of tensions
0:09:54a producing a uh a a a a tensor voice called so we want to use some acoustic measures to
0:09:58measure and these
0:10:00physiological current
0:10:01we use the tree three step method first we
0:10:04measures done closure instances
0:10:06using instance
0:10:07using uh
0:10:09dsps S S method so called your instance
0:10:11and corresponds to the moments where the vocal folds uh come together
0:10:16we used um
0:10:18and the inverse filtering method so in for filtering is basically a to remove the contribution of the vocal tract
0:10:23from the speech signal giving it and estimates of that but source signal of the uh uh same was created
0:10:28by the folk files of the larynx
0:10:30so use the issues of adaptive inverse filtering method
0:10:33scribe out Q
0:10:34i one country the block diagram just
0:10:36at that the methods it tends to compensate for the spectral roll off of the voice source signal
0:10:41and use the lpc analysis to try and i guess an all-pole model
0:10:44of and of the vocal tract transfer function
0:10:47this is done in a couple of iterations and uh
0:10:50i put is the estimate of the of source signal
0:10:53um so then we with this with this i put signal we want it's
0:10:57we used these glottal gradients describe by look or in yeah and in two present six which is kind of
0:11:02follow on work from scenes and house
0:11:04reason for using these buttons gradients was
0:11:06they they were described in previous work to be um
0:11:09to be useful even in a less than ideal recording conditions which never be happen when you when you um
0:11:16um hmmm when you're dealing with
0:11:18kind of interactive speech like that's
0:11:20well so we chose the to but by gradient stuff from a own work can carefully controlled at a oh
0:11:25show the best different station of voice quality qualities of cross the breath it's tense the match
0:11:30a just highlighted here are the two gone gradients with
0:11:33geology buffalo G gradients gradient and or C G rate of closure gradient
0:11:38and so uh
0:11:39i want just described is any for to but just to state thus low levels of these two at value
0:11:44suggest
0:11:45tensor voice qualities and
0:11:47a a higher levels the chance
0:11:48as suggest um um or voice score
0:11:52okay so we we carried as
0:11:54we carried at this uh we analyse the short utterances using using this and
0:11:59this method
0:12:01and i should have mentioned earlier that we that in the annotation was also annotation of and overlapping and non
0:12:07overlapping segments
0:12:08so we find that um
0:12:10or C G values are significantly lower
0:12:12staking taking uh is taking or um
0:12:15or four speakers
0:12:16uh and that was lower G G values
0:12:19this this trend was also seen in each of the speakers individually
0:12:23um for
0:12:24comparing session once session two we only use the three male speakers because one of the females wasn't present in
0:12:29in the first
0:12:30we and lower or C G an lower G or you values
0:12:34a book when we looked us the the the speakers individually
0:12:37a to the of the male speakers showed significantly lower or ct values where
0:12:41another the one the male speakers show show significantly higher or C G value so this was
0:12:46this a little bit cute
0:12:49so
0:12:49how we interpret this well we in of this is a tensor over all voice called you in the first
0:12:53session
0:12:54and steering overlapping speech that this is reasonably shoes of
0:12:57and perhaps in overlapping speech and a tensor voice calls you could be a mechanism for a competing for turn
0:13:05also um as i stated at the beginning of this kind of
0:13:08more stressful at first session
0:13:10and they leads to an overall tensor
0:13:12and um but the productions by the by to is but
0:13:16wouldn't participant two showed up the trends across sessions we we spoke to after
0:13:20and am
0:13:21yeah actually describes
0:13:23and the first session as a a uh and environment is more completely equipment set up some people can find
0:13:27us to be more control didn't
0:13:29P of the stress that the others that
0:13:31and where
0:13:32at the can get to know you session can actually be very socially or for some people and you you
0:13:37to miss it's to to this in the second session
0:13:40over the kind of take a mess just as short utterances and very substantially in terms of processing voice quality
0:13:46in in spoken conversation
0:13:48traditional and speech recognition systems
0:13:51i i don't take can't of these
0:13:53these aspects
0:13:54of speech
0:13:55and if we want to
0:13:57a a proper properly model and uh
0:14:00the type of naturalistic speech we haven't spoken conversation
0:14:03that we feel that these these aspects
0:14:05and need to be taken care
0:14:07so i just finally just to just the state at what we're doing with this we're currently have a um
0:14:12an exhibition signs a gallery and train college
0:14:15where at her meet the robot is a like a robot with
0:14:18with a
0:14:19and bows and audio recordings
0:14:23and uh base see tracks people's faces and walks walks rents them the strikes of a conversation
0:14:27so we use in this for data collection on spun conversation and short utterances
0:14:31and also that is uh a platform for testing our hypotheses bikes
0:14:36a a short of its
0:14:38and so
0:14:39yeah um
0:14:40and make uh nick the campbell i about from the S F I
0:14:44and they "'em" more it's was supported by F C T god
0:14:46let
0:14:48and that's
0:14:48i stuff
0:14:54so we can have
0:14:55a a time for two question
0:15:02maybe i thought
0:15:03all there it one
0:15:08i
0:15:08the on
0:15:10i K
0:15:10i i just my i would maybe i
0:15:13no something that we trying to a different state and a different types are are so that you have labeled
0:15:18the lower and we didn't do well that they that was done in the annotation but we didn't smash that
0:15:24to the acoustics and in
0:15:25description here
0:15:26and post but that that would be something thus thus
0:15:29and thus
0:15:30press covers
0:15:31yeah and i think you my work along along those lines with with
0:15:35but some the measurements we use but i that didn't
0:15:37that wasn't
0:15:38yeah
0:15:38i don't have that that
0:15:45um
0:15:46why you're my a very similar but
0:15:48a in the very fact one again what
0:15:50for
0:15:52was going from over that you that
0:15:54and i i'm as
0:15:56may may maybe be N Z might know little that better than me and this but um
0:16:00uh i i i i
0:16:01just
0:16:02a what i what i think is is the true is that that's press how both a a set that
0:16:06C you does want a are people to be using a i think you want to the a system that's
0:16:10any annotation that people would do would be
0:16:12conch be the back into the to the overall project
0:16:15but and if you contact him at nick a T C D dot
0:16:19and B I is definitely open to to chains
0:16:23thank you my
0:16:24oh