Speech Transcript - A Scalable Method for Quantifying the Role of Pitch in Conversational Turn-Taking

0:00:18	martian will be presenting the next talk
0:00:26	is this on
0:00:28	so how do i how do i do that
0:00:35	i need some help here think or maybe
0:00:38	oops i'm sorry stop by computer
0:00:44	they are i'm and the that the presentation is on this computer but i can't
0:00:47	find the
0:00:48	how there is no point there now right or
0:00:53	right
0:01:02	is the other
0:01:07	well i can start while this is happening i can start by saying but
0:01:12	the work that i'm gonna be present thing was really cornell of cost gives work
0:01:16	and he very
0:01:19	generously
0:01:21	invited us to collaborate don't
0:01:24	here to collaborate arithmetic invite comedy as and me to
0:01:29	to collaborate with him on this
0:01:31	and then it turned out that he
0:01:33	cannot make it today
0:01:35	which means that you are
0:01:39	which means that you are stuck with me here i will try not to make
0:01:43	too much of a mess up stocks
0:01:45	so the question that we're that we're
0:01:48	but we are talking here is a very old question
0:01:52	in speech science in is the question of
0:01:55	whether page or to what extent pitch plays a role in management of speaker change
0:02:01	and this was generated if you to the bay to generate a huge so steady
0:02:04	stream with papers
0:02:06	and but if you look across those papers you can
0:02:09	so to extract some broad
0:02:11	then but broad consensus that's
0:02:13	first of all
0:02:14	pitch does play some or all and then secondly that there is this binary opposition
0:02:18	between flat pitch
0:02:19	signalling or being links to turnholding and any kind of pitch movement dynamic pitch being
0:02:25	linked to turn-yielding
0:02:27	and that's it trained that's the whole the story
0:02:29	except of course it is not because there are still
0:02:31	and number of questions that you might want to ask about the contribution of pitch
0:02:35	to turn taking
0:02:37	such as well
0:02:38	doesn't matter whether you're looking at spontaneous or task oriented material does it matter whether
0:02:43	you're
0:02:44	speakers can see each other with the you know each other
0:02:47	what is the actual contribution of
0:02:49	off
0:02:50	pitch over
0:02:52	lexical or are syntactic cues and finally
0:02:55	i mean i'm a i'm a linguist by training or politician and so we know
0:02:59	that different languages use pitch linguistically to different extents and where the question is with
0:03:04	this is also reflected in
0:03:06	how the user pitch
0:03:07	for pragmatic purposes such as
0:03:10	that just turn taking
0:03:12	and then there's a whole
0:03:14	other another list of questions is how do you
0:03:17	how do you transform how do you are present your pitch in your
0:03:20	model right so how do you do some kind of perceptual stylisation based on
0:03:25	perceptual threshold
0:03:26	do you do kind of some sort of curve fitting
0:03:29	polynomial functional data analysis what have you
0:03:32	to use log scale
0:03:33	the do you do transform at a semi tones how far back to the you
0:03:37	look for those cues right now we're looking at ten miliseconds hundred or one second
0:03:40	or ten second right
0:03:41	these are all interesting an important question but is very difficult to
0:03:45	to answer them in an a systematic way because any two studies you point two
0:03:49	well vary across so many dimensions that it's very difficult to
0:03:54	to estimate a sort of quantify the contribution of each of any of these factors
0:03:59	to actual contribution of pitch to turn taking
0:04:02	so what we've trying to do here's propose a way of
0:04:05	evaluating the role of pitch
0:04:08	in turn taking and that's a method which has three important we think a i
0:04:14	properties the first one it's
0:04:16	scalable trying to its applicable to material when the size
0:04:18	it
0:04:19	is not
0:04:20	reliant a large the are many miller reliant on manual annotation
0:04:25	and it is
0:04:26	it gives you a sort of quantative
0:04:28	index of
0:04:30	contribution of pitch or any other feature as a matter of fact because in the
0:04:35	long term i mean this model this method can be applied to any potential turn
0:04:40	taking can to a few candidate
0:04:43	so
0:04:45	the way we chose to showcase this and also to evaluate this method was to
0:04:49	ask three questions which
0:04:50	well we thought we but there were interesting for us and we hope
0:04:54	or interesting to some of you and this is the first question is whether pitch
0:04:58	but there is any benefit in having pitch information to prediction of
0:05:01	of speech activity and dialogue
0:05:04	the second one is if it does
0:05:06	make a difference how best to represent
0:05:09	your pitch information and the third one is how far back to you have to
0:05:13	look for the for these cues
0:05:15	so these are the question that will be asking and will be trying to answer
0:05:18	them
0:05:20	using switchboard
0:05:21	which we divided into these three speaker disjoint sets right there's no speaker
0:05:26	in more than one of those and instead of running our own voice activity detection
0:05:31	we just use the forced alignments of the of the manual transcriptions that come up
0:05:36	with switchboard
0:05:39	and the whole i mean what you have what we did ben and this is
0:05:42	the idea that lies
0:05:43	at the heart of this of this of this method and i'm sure you've seen
0:05:46	this
0:05:47	before and it's this idea of contractual pornography
0:05:51	which is a sort of
0:05:52	discrete eyes are quantized speech silence annotation right so you have basically a
0:05:57	a frame of predefined duration here we used hundred milliseconds and for each of those
0:06:01	frames and for each of the speakers you indicate whether someone was speaking
0:06:05	or was silent during in that interval and so here we have a person
0:06:09	speaker a speaking for
0:06:11	four hundred milliseconds and there's a hundred miliseconds of overlap
0:06:14	the speaker b
0:06:16	takes four
0:06:16	frame it for frames of speech and there is a hundred milliseconds of
0:06:21	of silence and then speaker a contain
0:06:24	and what you can do then it once you have this sort of representation that
0:06:27	of course you can
0:06:28	do this very simple very simply you can very simply predict speech activity and that
0:06:33	you just take
0:06:35	speech this one speakers history we call this speaker target speaker
0:06:39	you take this person's
0:06:40	is to speak speech activity history
0:06:42	you can potentially if you're interested in that it can take this the other
0:06:46	persons the speech activity history
0:06:48	and then what you do is
0:06:49	you would trying to predict
0:06:50	where the target speaker is gonna be silent or is going to be speaking in
0:06:54	the next hundred milliseconds
0:06:57	and this kind of model can serve as a very neat baseline onto which you
0:07:03	can then keep adding
0:07:04	other features in our case pitch
0:07:07	and what you can do though is then you can compare this speech activity based
0:07:11	only model so baseline and the composite speech activity and
0:07:16	in our case pitch model
0:07:18	any kind of course also
0:07:19	compare the different types of pitch parameterization with one another
0:07:23	of course the only thing that you have to do before you do this kind
0:07:27	of
0:07:28	exercise
0:07:29	is you somehow have to take the continuously varying
0:07:33	pitch values and you somehow have to cast them into this chromagram
0:07:36	a matrix like representation and what we did here was of the simplest possible solution
0:07:41	we just calculate they
0:07:43	for each hundred millisecond frame we calculate be the average
0:07:47	pitch in that interval or missing or we just leave it is the missing value
0:07:51	if there was no voicing in that interval
0:07:55	right
0:07:57	and then we run those prediction experiments using quite simple feed forward networks with the
0:08:02	single hidden layer and for all the experiments that are talking about here
0:08:06	we had a two units in that hidden layer
0:08:08	other some more in the paper which i will not be talking about here
0:08:13	and you will note that this is a non recurrent network in there is a
0:08:17	reason for this right because since we are actually interested in the in the
0:08:21	in the length of the of the usable
0:08:24	pitch history we actually want to have axes we want to have control over how
0:08:28	much
0:08:29	history that the network has
0:08:31	access to
0:08:33	and before we go on the difference is were compared
0:08:37	using cross entropy
0:08:39	expressed in those bits per hundred miliseconds frame there'll be a lot of comparisons here
0:08:44	so there'll be lots of pictures
0:08:45	there's even more in the paper i've sort of to the liberty of picking out
0:08:48	the more boring ones which i think is good as long as you don't tell
0:08:51	cornell so if you know them don't tell
0:08:54	so
0:08:55	the two questions
0:08:57	the first two questions where a
0:09:00	first of all
0:09:01	well is there any benefit in knowing
0:09:04	in having access to pitch history well doing is a speech activity prediction
0:09:08	and the second one
0:09:09	is
0:09:10	how to
0:09:11	what's the optimal representation of pitch values for in such a system
0:09:16	and
0:09:17	so what we do here
0:09:18	it's we start with the speech activity only baseline or in so will be seeing
0:09:22	this kind of picture a lot
0:09:24	so what we have here have here is the training set dev set and test
0:09:27	set here we have the cross entropy rates for all those systems and what we
0:09:31	have here
0:09:32	on the x-axis is the conditioning context right so this is a system which is
0:09:36	trained on one hundred millisecond frame of a speech activity history and this is a
0:09:41	system trained on
0:09:42	one second of speech
0:09:44	activity history and you can see that the big
0:09:46	we cross
0:09:47	all those three sets the cross entropy is drop as you would expect right
0:09:52	so there is an improvement in prediction
0:09:55	and
0:09:56	and what we will be doing
0:09:58	from now on
0:09:59	it's will be taking this
0:10:01	guy so will be taking the system which is trained on
0:10:05	ten
0:10:05	on one second of speech activity history of both speakers
0:10:08	and will be adding
0:10:11	more and more all
0:10:12	of pitch history right so it's always
0:10:16	ten frames of speech activity history propose speakers and then pitch
0:10:19	all one
0:10:22	and what we did first we just added absolute pitch a linear
0:10:26	scale in hz
0:10:28	and surprisingly base even this simple pitch representation helps quite a bit trying to i
0:10:33	mean you can see that even having one frame with pitch
0:10:37	history is already better than
0:10:40	then this baseline here
0:10:42	and
0:10:43	but then it sort of improve c and further and it starts to settle around
0:10:47	three hundred milliseconds
0:10:50	so that the that that's good news rank it seems to suggest that the pitch
0:10:53	information is somehow relevant for speech active prediction
0:10:56	but i mean
0:10:58	clearly adding apps use representing pitch in absolute terms this is a kind of a
0:11:03	laughable id alright that we this completely
0:11:06	speaker dependent
0:11:07	so what you wanna do is you want to
0:11:09	do it well speaker-independent somehow so you want to
0:11:12	the speaker normalization and what we did hear your we do this again the simplest
0:11:16	thing
0:11:16	so we just that score the
0:11:18	the pitch values
0:11:19	and surprisingly this did not really might make much of a different side so that's
0:11:25	that's surprising
0:11:27	you would expect some improvement but of course
0:11:30	if you think about it actually
0:11:32	this introduces more confusion because ones that scoring does of course it brings the mean
0:11:37	to zero
0:11:38	and the voiceless frames
0:11:40	are also represented as zeros in the model
0:11:43	so then these models are just
0:11:46	confusing those two
0:11:47	those two phenomena
0:11:49	this can be
0:11:50	quite easily
0:11:52	improved
0:11:53	by just adding another feature vector this to be
0:11:58	a feature vector which is just a binary feature
0:12:00	both voicing feature right so it's one when there's voicing and zero when it's not
0:12:04	and this allows us to
0:12:05	this allows the model to disambiguate zeros which are due to being close to speakers
0:12:09	mean from zeros which are due to voice lessons
0:12:12	and when you do this that you actually get a quite is quite a substantial
0:12:17	drop in cross entropy rates right switch
0:12:19	the just the bases a
0:12:20	as a good representation and this drop was actually greater
0:12:24	then if you add voicing on top of absolute pitch exact again it's not something
0:12:27	i'm showing here but it is in the in the paper
0:12:31	and then of course
0:12:32	you can go on and say well we know that speech is really
0:12:35	it perceived on semi timescale runs on log scale so does actually matter if we
0:12:40	convert
0:12:41	are how the hz to semi turn before is that scoring and it actually does
0:12:45	a little bit trying to there is that there is a slight improvement would generalizes
0:12:49	to the
0:12:50	the test set
0:12:52	and of course and the last up with data was asking
0:12:55	so all along with only been using pitch history of the target speaker but you
0:12:59	can also ask well that's not doesn't help to know the pitch history of the
0:13:02	interlocutor
0:13:03	and again there is a there is a
0:13:06	slides
0:13:07	but consistent improvement if you if you use both speakers history right
0:13:12	so this is our solution arg answer to question number one and two
0:13:17	or preliminary answer anyway
0:13:18	and then we have question number three which is how far back do you have
0:13:22	to walk and for this we have this sort of diagram
0:13:26	the so the topline is as before so this is the speech activity only
0:13:31	model
0:13:32	except previously be ended here on this blue dots and here we
0:13:36	extended
0:13:38	for another ten frames so this model is trained on
0:13:42	two seconds of speech activity his trade we can say see that is sort of
0:13:45	continues dropping but a little bit less
0:13:48	bless abruptly this curve here is exactly the curve that we had before so trained
0:13:53	on
0:13:53	pitch plus
0:13:56	one second of speech activity history and this one is
0:14:00	more and more of speech history
0:14:01	plus
0:14:02	two seconds of speech act i pitch history plus
0:14:05	two seconds of speech activities training
0:14:08	and this is quite interesting actually hand and a little bit puzzling in that
0:14:11	these curves
0:14:13	i mean whiskers are quite similar i mean they all still
0:14:17	start settling around four hundred
0:14:19	milliseconds
0:14:20	but this one is just is just a shifted down to know what this means
0:14:23	is basically that
0:14:24	the same amount of
0:14:25	pitch history is more helpful
0:14:28	if you have more speech activity history that just kind of interesting have some ideas
0:14:32	about we don't let me weekly we don't know why that is
0:14:35	one possibility that could be something to do with the sort of backchannel nonbackchannel thing
0:14:39	and that
0:14:40	a pitch act as out of a whatever
0:14:42	four hundred of those four hundred milliseconds of
0:14:46	off
0:14:46	pitch cues
0:14:47	might be only useful when the when the person has been talking for a
0:14:52	for sufficiently long
0:14:56	right so as i said there's more in the paper but this is all i
0:14:59	wanted to show you for here
0:15:01	but then what have we learned the three questions are back first what was well
0:15:06	the speaker does have does that speech help
0:15:09	and a prediction of
0:15:10	a speech activity
0:15:12	in dialogue the answer is yes
0:15:14	what is the optimal representation well from what we've seen it seems to be
0:15:18	the binary voicing combination of binary voicing for this disambiguation of voice listeners
0:15:23	and
0:15:23	is that score normalization normalized pitch on an intel on the same assembly don't scale
0:15:31	and how far back should one log well it seems that four hundred of context
0:15:35	is
0:15:36	sufficient
0:15:39	but we have also seen that in terms of the absolute reduction and cross entropy
0:15:44	then into a that the best performing pitch
0:15:48	and representation
0:15:51	retreated resulted in a reduction in reduction which is corresponds to roughly seventy five percent
0:15:55	of the reduction
0:15:57	in the speech activity only model when you go from one frame
0:16:00	to ten frames right so it's quite a
0:16:02	quite substantial in the in that
0:16:05	and the most arms
0:16:07	we have also seen that
0:16:09	but that
0:16:09	i mean four hundred millisecond seems to be enough
0:16:12	which is not much if you
0:16:15	think
0:16:15	about this study that cornell did with less tried work in two thousand twelve and
0:16:20	they found that if you do
0:16:21	speech activity history only you can go
0:16:23	back as much as eight
0:16:25	seconds and you still
0:16:26	keep
0:16:27	i improving
0:16:29	but on the other hand if you think about the sort of prosodic domain with
0:16:32	the window which within which any kind of
0:16:36	pitch
0:16:37	q
0:16:37	could be embedded then something on the order of the magnitude of the foot of
0:16:42	the method of a prosodic foot so something like
0:16:45	four hundred milisecond
0:16:47	long
0:16:47	makes
0:16:48	perfect sense to me
0:16:50	and
0:16:52	we have a coke or we one thing we did was of course cheat a
0:16:55	little bit in that
0:16:57	when we did those that scoring of the pitch
0:17:01	we used speakers
0:17:03	means and standard deviation that we assume that they are known a prior alright and
0:17:06	this of course is not the case if you work to run this analysis of
0:17:10	real time
0:17:11	a scenario
0:17:12	and these would then have to be estimated incrementally
0:17:17	and i want to finish here
0:17:19	and go back to the to the rationale of doing all this
0:17:23	analysis and all this sort of playing around with this and this was really to
0:17:27	to come up with a better way
0:17:30	of doing
0:17:31	automated analysis of large speech material and then especially
0:17:35	to be able to
0:17:36	to bootstrap to produce results
0:17:38	across
0:17:39	across different corpora and make them so of comp arable so one thing you could
0:17:43	do with this for instance is
0:17:44	we run this in switchboard what you can do is take the same thing and
0:17:47	run it on callhome for instance which is also dyadic
0:17:50	which is also
0:17:52	phone
0:17:53	and but people know each other there right
0:17:56	and then which you can and what you can then do is sort of you
0:17:59	can compare those things
0:18:00	and you can see to what extent familiarity between speakers for instance plays a role
0:18:04	a in how pitch is employed for
0:18:09	turn management
0:18:11	and of course in this is kind of what goblet here's and me excited about
0:18:15	this
0:18:16	is that
0:18:17	there there's nothing but limits
0:18:19	these things to pitch trying to can do we intend there's nothing stop the printing
0:18:23	you from doing intensity and the kind of voice quality features so or a bottom-up
0:18:27	multimodal features so this
0:18:28	this really opens the way in a sense for doing a lot
0:18:33	of interesting things and of course in the long term whatever you find out
0:18:36	could potentially be also used in some sort of mixed initiative dialogue system but this
0:18:41	really is something that but that you know about than i don't so i will
0:18:45	i will stop here thank you
0:18:53	can we have plenty of time for questions
0:19:03	i have a hidden slide with corn else phone numbers like i
0:19:08	so perhaps aim is this but so how you handling cases where you're not able
0:19:13	to fine depicts the pitch isn't the thing because you have voiceless that any particular
0:19:16	thing i mean i are originally it's its left to assess the missing value
0:19:21	but then of course of all the because of all the shenanigans that happened inside
0:19:25	i understand they just
0:19:26	they just the transformed into zeros right so that's why then there is this confusion
0:19:31	between
0:19:32	voiceless nist and the
0:19:34	after that scoring of the and the mean pitch
0:19:42	their questions
0:19:50	thanks for in there is to so i'm as i was wondering i
0:19:56	absolute
0:19:58	is
0:20:00	a little bit is very different from a male voice you
0:20:04	voice is on female voices
0:20:07	so
0:20:10	i'm wondering if you you're more than a non tools
0:20:15	i mean voice and a female voices define three
0:20:19	i mean
0:20:20	well maybe but i mean how would that's information b
0:20:24	useful the prediction of speaker of
0:20:27	so the speaker of speaking in the next hundred milisecond
0:20:31	also
0:20:33	but you results is very surprising that absolutely yes is right i think so too
0:20:40	i think so too
0:20:41	because i mean you don't assume that
0:20:43	those speaking and hundred and sixty five hz
0:20:46	signals
0:20:47	but you wanna
0:20:48	all the time right i agree that it is it is it is it is
0:20:51	surprising
0:20:52	but of course i mean
0:20:53	i if you compare those
0:20:57	what was
0:21:00	right so if you compare the absolute pitch and is that the that speaker normalized
0:21:04	speech there is a lot
0:21:05	clearly that the that the absolute pitch missus so there is a lot to improve
0:21:10	on that there must be some information that of that is still
0:21:15	how do you mean of there is some kind of the model man it sort
0:21:18	of inside the network there was some kind of
0:21:21	clustering that it sort of had a one classes of classifier for men and one
0:21:24	for women sort of
0:21:28	yes actually i think you just he don't my question i'm wondering here how much
0:21:32	is the modeling doing like you're proposing a certain representation you with binarize pretty so
0:21:37	but obviously the model is probably also doing something on top of that and so
0:21:42	i i'm not sure if we did you guys have looked into
0:21:45	can you disentangle really understand because if someone takes a different approach that c where
0:21:49	construct features that are temporally nature you know like looking at slopes and all the
0:21:53	stuff like a much as the model accounting for i'm not sure it's hard i
0:21:57	guess what to say i cannot answer this but it's i mean of course you
0:22:01	don't know what the model is actually doing yes absolutely
0:22:04	absolute but i mean
0:22:05	that the things the than thing is that this is what i think this is
0:22:08	one way of sort of
0:22:10	approaching this problem well
0:22:12	producing results which are sort of comparable across studies yes but its absence
0:22:23	you mentioned at the beginning that
0:22:26	the pitch might flat and before turn taking so we for unity norm
0:22:31	and sees you don't user recurrent model did you also consider doing and
0:22:37	taking the tent are of the absolute pitch not only the absolute values no we
0:22:42	didn't but isn't the something that the network of potentially kind of figure out and
0:22:46	so does the question i mean i think so
0:22:54	the question whether you i don't think you've done but are you planning to take
0:22:57	this out of the corpus and see whether the kinds of differentiation the your models
0:23:04	finding might be used
0:23:07	productively to change the behavior of the other speaker like if you alter the captured
0:23:12	or vice fits well right of people we generate absolutely out that could be done
0:23:18	and the other question
0:23:22	i was wondering what using would need to change its it was a multi-speaker situations
0:23:27	and not just to at three four
0:23:29	possibly i mean then this is a this is something that we have discussed a
0:23:32	lot i mean the problem with
0:23:33	doing this then
0:23:35	is that
0:23:37	we had a paper it into speech and two thousand seventeen where we did this
0:23:41	kind of modeling for
0:23:43	for the for respiratory data and turn taking
0:23:46	and the problem is then we had three speakers and then you can absolutely do
0:23:50	it's
0:23:51	but then you have to do all this kind of so we then you would
0:23:53	have another row here right
0:23:55	what you then have to do
0:23:57	is that you have to sort of he
0:23:59	for you have to keep sort of shifting those speakers because you don't want your
0:24:02	model
0:24:03	two
0:24:04	to rely on the final but speaker b was on the row
0:24:08	two and speaker you see was all row three right so then with three speaker
0:24:13	it with three speakers is still doable once you go into really multiparty things then
0:24:17	there's just this explodes
0:24:19	so then you would have to do it
0:24:20	the somehow differently and perhaps only use the only take into account the speaker so
0:24:25	we're speaking wouldn't then it's the last i don't five minute or five minutes or
0:24:28	something and then sort of two
0:24:30	to an incremental also dynamically
0:24:33	produce those
0:24:34	subsets of speakers that you that you predict for
0:24:39	anymore questions
0:24:48	just wondering whether you've looked into the granularity here so you picking hundred milliseconds of
0:24:54	you look at all the time windows
0:24:56	i mean well we had we didn't but this is a clear of bayes i
0:24:59	think is a clique you
0:25:01	problem right that that's
0:25:02	but that could but for the
0:25:07	somehow should be addressed a absolutely but i mean that the them at the method
0:25:14	itself right i mean you like is agnostic of this sort of the
0:25:17	is like
0:25:18	whatever your
0:25:20	your pitch extraction is
0:25:21	the then i mean we will produce different
0:25:24	pitch tracks but also whatever your voice activity detection run like these were also produces
0:25:29	a but this sort of a pretty in some sense of the preprocessing
0:25:32	but still i think
0:25:36	absolutely
0:25:39	absolutely
0:25:41	alright let's thank our speaker again

A Scalable Method for Quantifying the Role of Pitch in Conversational Turn-Taking

Oral Session 5: Acoustics

Kornel Laskowski, Marcin Wlodarczak and Mattias Heldner