Speech Transcript - PROCESSING ‘YUP!’ AND OTHER SHORT UTTERANCES IN INTERACTIVE SPEECH

K in on in of from true halogen of them i don't and um i be presenting our preliminary work and processing your and short utterances and to speech so it just starve an it's by saying that this is promising be the work of fess campbell it could be here but i also about than work was hell in one two is you've visit a and trinity for um for remote last somewhere and to work on the court so we interested in spoken conversation or interested looking at describing characterising modeling and ultimately sent sizing and this button conversation and probably most striking that aspect of spoken conversation is that it's massively the interact so you have two people say having a conversation it could be that one person is doing most the talk which happens quite a lot and perhaps it's and somebody talk talking an issue that had to work the the to a friend even no one is doing most of the talking and the interaction is is still in mask the interactive and um if the other participant providing constant feedback to the main speaker a base what are they understand of something needs to be or the it that the need to go faster provide background information this sort of and spoken conversation is that's sake contrast a type of model like speech you get in news broadcast lectures or well talks like this and in spoken interaction meaning is built up uh adaptive iain's clubbers of so a and being a kind of a linear rather of information it's that rather to directly the goes forward and back and forth than even a one times i you look at the uh and you spoken interaction of was is very apparent from a that there's a very high frequency of short of a and shortens utterances have um a function spoken conversation which is disproportionate to their let there very very useful and very important than managing spoken disk you have here um a graph of a a a telephone conversation so when each of the panels um corresponds each speaker and and and the x-axis is time in the y-axis is a well speech density and that ten second frames and so speech then see here's that and a measure of talking time at per frame and and and from this in this conversation that a speaker is much you we need to be most oh of the of the talking and you can see a high frequency a both a a hot uh long and short utterances here the second speaker even known that than us using as many long utterances there still very active in terms of you meant a short of it all utterances that use suppose that a partner is a highly active the number of short utterances is extremely high you look the transcription of these utterances the linguistic contents is really very repetitive and the variation is only quite as quite minimal the course when we're engaging spoken conversation for not using linguistic content as a two and we have to use some other to and to provide different iteration to the speaker and hence importance of prosody and voice quality or vocal timbre so and just to illustrate this a little best i'm gonna play uh a sequence of a short utterances from single males speak speaker from the T sixty four corpus and we show described it few minutes um so i just play a first i i yeah i i i i i i i i i oh i i i i i oh i i i i i i i i i i okay um just kind of a ones should show from out is just as uh i i'm sure just by since then you can hear thus in a spoken conversation those different same linguistic units have very different pragmatic functions in in this in discourse and the the of the lime that we believe provides a sort different iteration well i'll a a lot of it is the prosody and voice quality which i think you could hear and some of those shorter and one of the corporate thus professor campbell worked on was the express speech processing corpus which was one thousand five hundred hours of interactive speech recorded in japan between two guys and and to hasn't five six and um one of the most common words made up like single words made of more than half of the total utterance count these these single words came in a die range of prosodic conditions trying entirely different mess of the um examples that press to campbell sometimes gives gives is the word have home are which is a a sack dialect of of or words which roughly translates as really in in english and and state as a of the corpus um and it and it ages is twenty different at least twenty different and pragmatic functions that single single word and again processing voice quality essential and provide the different station and a spoken conversation just um a final uh a final graph just to um a are just a trace the the frequency short utterances also a large party conversations with and uh graph here from the uh free talk corpus so and there's is a five five speakers involved in the conversation each of the different colours represent different speakers and the of the bar represents the length of the other utterance and again if you look to this there is and a high frequency of short utterances sometimes by single speaker and sometimes and why i more than more same time okay and so this brings sounds on the corpus not and the current study so um a at an in to present end the T sixty four corpus was recorded in um in a a a a an part uh apartments in double and the goal of the corpus was to um richly and a re receive records and highly naturalistic and spoken conversation that um so that was twelve audio lines five you cameras to three sixty degree videos and six up to track motion capture and this five participants three male and two female a social interaction was completely unstructured non scripted and there was no particular conversation go and for this reason that the topics very it um very widely so that was four sessions over two days in in the current study we look as the first two sessions for session i don't i don't even was meant to be recorded post and the the three male speakers in the room at the time well have and um what how headset mikes on and um that was only a short of time before the other two female participants are arrives and because of problems with Q base and are technical issues and a for a knows a be tense and stressful environments and is very apparent from the speech data and you listen to it after second session was um and we're to the two female participants ride was um a much more relaxed kind of a um people are sitting and drinking cups a copy talking a little bit of themselves and so that two sessions are starkly contrast them terms yeah i in this regard and what should state that's and only one of the female speakers and a it is in know analysis in the current work so what having them on its was over with us and last so much um she annotates it's and twelve twelve that and is used of an annotation labels for the short utterances in these two sessions so not gonna go through all of them but just the high like the most frequent ones so that back channels are clearly the um the the the the most frequent so back channels is kind of and kind of feedback that a speaker might be given like yeah we okay rice uh also very come more filled pauses so like um uh like these sort of things and also parents interjections and repetitions where and quite freak we can it's some and prosodic analysis on these shores short utterances and we met measure and fundamental frequency mean max and position of P set of present percentage location of the peak in the order an i are the same edges we used to and break crude voice quality measures the difference being the first two harmonics of the speech spectrum and it if seen the first harmonic and the harmonic because someone a the third formant region and we also measured duration we don't carry principal component analysis not that showed the first loading to be dominated by power values the second we dominated by F zero values and third be dominated by both what's quality in duration bodies so this kind of suggested to them uh in the independence of these of these groups in the first five loadings accounted for seventy percent of the very we wants to look for or us at the voice quality involved than this and so we wants look S voice qualities across the it's tense and to so and as as phonation mode or mode of vocal fold vibration is uh a critical to these voice qualities and like shown here and a kind of image of the of the vote of poke of of the larynx taken from above both and that three men mostly or tensions um high like so the breath you voice quality when the vocal folds or vibrating you you have these low levels of tension load up to tension so this means that there's not block your and you get this and you get this chain get posterior and of the vocal folds that lies that agenda every and there and that to pass tree the vocal folds in this this is the main contributor to the and sort of brandy perceptual quality at tense voice call you at the other end of the spectrum you've you've yeah high levels of the three main range of tensions a producing a uh a a a a tensor voice called so we want to use some acoustic measures to measure and these physiological current we use the tree three step method first we measures done closure instances using instance using uh dsps S S method so called your instance and corresponds to the moments where the vocal folds uh come together we used um and the inverse filtering method so in for filtering is basically a to remove the contribution of the vocal tract from the speech signal giving it and estimates of that but source signal of the uh uh same was created by the folk files of the larynx so use the issues of adaptive inverse filtering method scribe out Q i one country the block diagram just at that the methods it tends to compensate for the spectral roll off of the voice source signal and use the lpc analysis to try and i guess an all-pole model of and of the vocal tract transfer function this is done in a couple of iterations and uh i put is the estimate of the of source signal um so then we with this with this i put signal we want it's we used these glottal gradients describe by look or in yeah and in two present six which is kind of follow on work from scenes and house reason for using these buttons gradients was they they were described in previous work to be um to be useful even in a less than ideal recording conditions which never be happen when you when you um um hmmm when you're dealing with kind of interactive speech like that's well so we chose the to but by gradient stuff from a own work can carefully controlled at a oh show the best different station of voice quality qualities of cross the breath it's tense the match a just highlighted here are the two gone gradients with geology buffalo G gradients gradient and or C G rate of closure gradient and so uh i want just described is any for to but just to state thus low levels of these two at value suggest tensor voice qualities and a a higher levels the chance as suggest um um or voice score okay so we we carried as we carried at this uh we analyse the short utterances using using this and this method and i should have mentioned earlier that we that in the annotation was also annotation of and overlapping and non overlapping segments so we find that um or C G values are significantly lower staking taking uh is taking or um or four speakers uh and that was lower G G values this this trend was also seen in each of the speakers individually um for comparing session once session two we only use the three male speakers because one of the females wasn't present in in the first we and lower or C G an lower G or you values a book when we looked us the the the speakers individually a to the of the male speakers showed significantly lower or ct values where another the one the male speakers show show significantly higher or C G value so this was this a little bit cute so how we interpret this well we in of this is a tensor over all voice called you in the first session and steering overlapping speech that this is reasonably shoes of and perhaps in overlapping speech and a tensor voice calls you could be a mechanism for a competing for turn also um as i stated at the beginning of this kind of more stressful at first session and they leads to an overall tensor and um but the productions by the by to is but wouldn't participant two showed up the trends across sessions we we spoke to after and am yeah actually describes and the first session as a a uh and environment is more completely equipment set up some people can find us to be more control didn't P of the stress that the others that and where at the can get to know you session can actually be very socially or for some people and you you to miss it's to to this in the second session over the kind of take a mess just as short utterances and very substantially in terms of processing voice quality in in spoken conversation traditional and speech recognition systems i i don't take can't of these these aspects of speech and if we want to a a proper properly model and uh the type of naturalistic speech we haven't spoken conversation that we feel that these these aspects and need to be taken care so i just finally just to just the state at what we're doing with this we're currently have a um an exhibition signs a gallery and train college where at her meet the robot is a like a robot with with a and bows and audio recordings and uh base see tracks people's faces and walks walks rents them the strikes of a conversation so we use in this for data collection on spun conversation and short utterances and also that is uh a platform for testing our hypotheses bikes a a short of its and so yeah um and make uh nick the campbell i about from the S F I and they "'em" more it's was supported by F C T god let and that's i stuff so we can have a a time for two question maybe i thought all there it one i the on i K i i just my i would maybe i no something that we trying to a different state and a different types are are so that you have labeled the lower and we didn't do well that they that was done in the annotation but we didn't smash that to the acoustics and in description here and post but that that would be something thus thus and thus press covers yeah and i think you my work along along those lines with with but some the measurements we use but i that didn't that wasn't yeah i don't have that that um why you're my a very similar but a in the very fact one again what for was going from over that you that and i i'm as may may maybe be N Z might know little that better than me and this but um uh i i i i just a what i what i think is is the true is that that's press how both a a set that C you does want a are people to be using a i think you want to the a system that's any annotation that people would do would be conch be the back into the to the overall project but and if you contact him at nick a T C D dot and B I is definitely open to to chains thank you my oh

PROCESSING ‘YUP!’ AND OTHER SHORT UTTERANCES IN INTERACTIVE SPEECH

Audio/Visual Detection of Non-Linguistic Vocal Outbursts

Presented by: John Kane, Author(s): Nick Campbell, John Kane, University of Dublin, Ireland; Helena Moniz, FLUL/INESC-ID, Ireland