martian will be presenting the next talk

is this on

so how do i how do i do that

i need some help here think or maybe

oops i'm sorry stop by computer

they are i'm and the that the presentation is on this computer but i can't

find the

how there is no point there now right or

right

is the other

well i can start while this is happening i can start by saying but

the work that i'm gonna be present thing was really cornell of cost gives work

and he very

generously

invited us to collaborate don't

here to collaborate arithmetic invite comedy as and me to

to collaborate with him on this

and then it turned out that he

cannot make it today

which means that you are

which means that you are stuck with me here i will try not to make

too much of a mess up stocks

so the question that we're that we're

but we are talking here is a very old question

in speech science in is the question of

whether page or to what extent pitch plays a role in management of speaker change

and this was generated if you to the bay to generate a huge so steady

stream with papers

and but if you look across those papers you can

so to extract some broad

then but broad consensus that's

first of all

pitch does play some or all and then secondly that there is this binary opposition

between flat pitch

signalling or being links to turnholding and any kind of pitch movement dynamic pitch being

linked to turn-yielding

and that's it trained that's the whole the story

except of course it is not because there are still

and number of questions that you might want to ask about the contribution of pitch

to turn taking

such as well

doesn't matter whether you're looking at spontaneous or task oriented material does it matter whether

you're

speakers can see each other with the you know each other

what is the actual contribution of

off

pitch over

lexical or are syntactic cues and finally

i mean i'm a i'm a linguist by training or politician and so we know

that different languages use pitch linguistically to different extents and where the question is with

this is also reflected in

how the user pitch

for pragmatic purposes such as

that just turn taking

and then there's a whole

other another list of questions is how do you

how do you transform how do you are present your pitch in your

model right so how do you do some kind of perceptual stylisation based on

perceptual threshold

do you do kind of some sort of curve fitting

polynomial functional data analysis what have you

to use log scale

the do you do transform at a semi tones how far back to the you

look for those cues right now we're looking at ten miliseconds hundred or one second

or ten second right

these are all interesting an important question but is very difficult to

to answer them in an a systematic way because any two studies you point two

well vary across so many dimensions that it's very difficult to

to estimate a sort of quantify the contribution of each of any of these factors

to actual contribution of pitch to turn taking

so what we've trying to do here's propose a way of

evaluating the role of pitch

in turn taking and that's a method which has three important we think a i

properties the first one it's

scalable trying to its applicable to material when the size

it

is not

reliant a large the are many miller reliant on manual annotation

and it is

it gives you a sort of quantative

index of

contribution of pitch or any other feature as a matter of fact because in the

long term i mean this model this method can be applied to any potential turn

taking can to a few candidate

so

the way we chose to showcase this and also to evaluate this method was to

ask three questions which

well we thought we but there were interesting for us and we hope

or interesting to some of you and this is the first question is whether pitch

but there is any benefit in having pitch information to prediction of

of speech activity and dialogue

the second one is if it does

make a difference how best to represent

your pitch information and the third one is how far back to you have to

look for the for these cues

so these are the question that will be asking and will be trying to answer

them

using switchboard

which we divided into these three speaker disjoint sets right there's no speaker

in more than one of those and instead of running our own voice activity detection

we just use the forced alignments of the of the manual transcriptions that come up

with switchboard

and the whole i mean what you have what we did ben and this is

the idea that lies

at the heart of this of this of this method and i'm sure you've seen

this

before and it's this idea of contractual pornography

which is a sort of

discrete eyes are quantized speech silence annotation right so you have basically a

a frame of predefined duration here we used hundred milliseconds and for each of those

frames and for each of the speakers you indicate whether someone was speaking

or was silent during in that interval and so here we have a person

speaker a speaking for

four hundred milliseconds and there's a hundred miliseconds of overlap

the speaker b

takes four

frame it for frames of speech and there is a hundred milliseconds of

of silence and then speaker a contain

and what you can do then it once you have this sort of representation that

of course you can

do this very simple very simply you can very simply predict speech activity and that

you just take

speech this one speakers history we call this speaker target speaker

you take this person's

is to speak speech activity history

you can potentially if you're interested in that it can take this the other

persons the speech activity history

and then what you do is

you would trying to predict

where the target speaker is gonna be silent or is going to be speaking in

the next hundred milliseconds

and this kind of model can serve as a very neat baseline onto which you

can then keep adding

other features in our case pitch

and what you can do though is then you can compare this speech activity based

only model so baseline and the composite speech activity and

in our case pitch model

any kind of course also

compare the different types of pitch parameterization with one another

of course the only thing that you have to do before you do this kind

of

exercise

is you somehow have to take the continuously varying

pitch values and you somehow have to cast them into this chromagram

a matrix like representation and what we did here was of the simplest possible solution

we just calculate they

for each hundred millisecond frame we calculate be the average

pitch in that interval or missing or we just leave it is the missing value

if there was no voicing in that interval

right

and then we run those prediction experiments using quite simple feed forward networks with the

single hidden layer and for all the experiments that are talking about here

we had a two units in that hidden layer

other some more in the paper which i will not be talking about here

and you will note that this is a non recurrent network in there is a

reason for this right because since we are actually interested in the in the

in the length of the of the usable

pitch history we actually want to have axes we want to have control over how

much

history that the network has

access to

and before we go on the difference is were compared

using cross entropy

expressed in those bits per hundred miliseconds frame there'll be a lot of comparisons here

so there'll be lots of pictures

there's even more in the paper i've sort of to the liberty of picking out

the more boring ones which i think is good as long as you don't tell

cornell so if you know them don't tell

so

the two questions

the first two questions where a

first of all

well is there any benefit in knowing

in having access to pitch history well doing is a speech activity prediction

and the second one

is

how to

what's the optimal representation of pitch values for in such a system

and

so what we do here

it's we start with the speech activity only baseline or in so will be seeing

this kind of picture a lot

so what we have here have here is the training set dev set and test

set here we have the cross entropy rates for all those systems and what we

have here

on the x-axis is the conditioning context right so this is a system which is

trained on one hundred millisecond frame of a speech activity history and this is a

system trained on

one second of speech

activity history and you can see that the big

we cross

all those three sets the cross entropy is drop as you would expect right

so there is an improvement in prediction

and

and what we will be doing

from now on

it's will be taking this

guy so will be taking the system which is trained on

ten

on one second of speech activity history of both speakers

and will be adding

more and more all

of pitch history right so it's always

ten frames of speech activity history propose speakers and then pitch

all one

and what we did first we just added absolute pitch a linear

scale in hz

and surprisingly base even this simple pitch representation helps quite a bit trying to i

mean you can see that even having one frame with pitch

history is already better than

then this baseline here

and

but then it sort of improve c and further and it starts to settle around

three hundred milliseconds

so that the that that's good news rank it seems to suggest that the pitch

information is somehow relevant for speech active prediction

but i mean

clearly adding apps use representing pitch in absolute terms this is a kind of a

laughable id alright that we this completely

speaker dependent

so what you wanna do is you want to

do it well speaker-independent somehow so you want to

the speaker normalization and what we did hear your we do this again the simplest

thing

so we just that score the

the pitch values

and surprisingly this did not really might make much of a different side so that's

that's surprising

you would expect some improvement but of course

if you think about it actually

this introduces more confusion because ones that scoring does of course it brings the mean

to zero

and the voiceless frames

are also represented as zeros in the model

so then these models are just

confusing those two

those two phenomena

this can be

quite easily

improved

by just adding another feature vector this to be

a feature vector which is just a binary feature

both voicing feature right so it's one when there's voicing and zero when it's not

and this allows us to

this allows the model to disambiguate zeros which are due to being close to speakers

mean from zeros which are due to voice lessons

and when you do this that you actually get a quite is quite a substantial

drop in cross entropy rates right switch

the just the bases a

as a good representation and this drop was actually greater

then if you add voicing on top of absolute pitch exact again it's not something

i'm showing here but it is in the in the paper

and then of course

you can go on and say well we know that speech is really

it perceived on semi timescale runs on log scale so does actually matter if we

convert

are how the hz to semi turn before is that scoring and it actually does

a little bit trying to there is that there is a slight improvement would generalizes

to the

the test set

and of course and the last up with data was asking

so all along with only been using pitch history of the target speaker but you

can also ask well that's not doesn't help to know the pitch history of the

interlocutor

and again there is a there is a

slides

but consistent improvement if you if you use both speakers history right

so this is our solution arg answer to question number one and two

or preliminary answer anyway

and then we have question number three which is how far back do you have

to walk and for this we have this sort of diagram

the so the topline is as before so this is the speech activity only

model

except previously be ended here on this blue dots and here we

extended

for another ten frames so this model is trained on

two seconds of speech activity his trade we can say see that is sort of

continues dropping but a little bit less

bless abruptly this curve here is exactly the curve that we had before so trained

on

pitch plus

one second of speech activity history and this one is

more and more of speech history

plus

two seconds of speech act i pitch history plus

two seconds of speech activities training

and this is quite interesting actually hand and a little bit puzzling in that

these curves

i mean whiskers are quite similar i mean they all still

start settling around four hundred

milliseconds

but this one is just is just a shifted down to know what this means

is basically that

the same amount of

pitch history is more helpful

if you have more speech activity history that just kind of interesting have some ideas

about we don't let me weekly we don't know why that is

one possibility that could be something to do with the sort of backchannel nonbackchannel thing

and that

a pitch act as out of a whatever

four hundred of those four hundred milliseconds of

off

pitch cues

might be only useful when the when the person has been talking for a

for sufficiently long

right so as i said there's more in the paper but this is all i

wanted to show you for here

but then what have we learned the three questions are back first what was well

the speaker does have does that speech help

and a prediction of

a speech activity

in dialogue the answer is yes

what is the optimal representation well from what we've seen it seems to be

the binary voicing combination of binary voicing for this disambiguation of voice listeners

and

is that score normalization normalized pitch on an intel on the same assembly don't scale

and how far back should one log well it seems that four hundred of context

is

sufficient

but we have also seen that in terms of the absolute reduction and cross entropy

then into a that the best performing pitch

and representation

retreated resulted in a reduction in reduction which is corresponds to roughly seventy five percent

of the reduction

in the speech activity only model when you go from one frame

to ten frames right so it's quite a

quite substantial in the in that

and the most arms

we have also seen that

but that

i mean four hundred millisecond seems to be enough

which is not much if you

think

about this study that cornell did with less tried work in two thousand twelve and

they found that if you do

speech activity history only you can go

back as much as eight

seconds and you still

keep

i improving

but on the other hand if you think about the sort of prosodic domain with

the window which within which any kind of

pitch

q

could be embedded then something on the order of the magnitude of the foot of

the method of a prosodic foot so something like

four hundred milisecond

long

makes

perfect sense to me

and

we have a coke or we one thing we did was of course cheat a

little bit in that

when we did those that scoring of the pitch

we used speakers

means and standard deviation that we assume that they are known a prior alright and

this of course is not the case if you work to run this analysis of

real time

a scenario

and these would then have to be estimated incrementally

and i want to finish here

and go back to the to the rationale of doing all this

analysis and all this sort of playing around with this and this was really to

to come up with a better way

of doing

automated analysis of large speech material and then especially

to be able to

to bootstrap to produce results

across

across different corpora and make them so of comp arable so one thing you could

do with this for instance is

we run this in switchboard what you can do is take the same thing and

run it on callhome for instance which is also dyadic

which is also

phone

and but people know each other there right

and then which you can and what you can then do is sort of you

can compare those things

and you can see to what extent familiarity between speakers for instance plays a role

a in how pitch is employed for

turn management

and of course in this is kind of what goblet here's and me excited about

this

is that

there there's nothing but limits

these things to pitch trying to can do we intend there's nothing stop the printing

you from doing intensity and the kind of voice quality features so or a bottom-up

multimodal features so this

this really opens the way in a sense for doing a lot

of interesting things and of course in the long term whatever you find out

could potentially be also used in some sort of mixed initiative dialogue system but this

really is something that but that you know about than i don't so i will

i will stop here thank you

can we have plenty of time for questions

i have a hidden slide with corn else phone numbers like i

so perhaps aim is this but so how you handling cases where you're not able

to fine depicts the pitch isn't the thing because you have voiceless that any particular

thing i mean i are originally it's its left to assess the missing value

but then of course of all the because of all the shenanigans that happened inside

i understand they just

they just the transformed into zeros right so that's why then there is this confusion

between

voiceless nist and the

after that scoring of the and the mean pitch

their questions

thanks for in there is to so i'm as i was wondering i

absolute

is

a little bit is very different from a male voice you

voice is on female voices

so

i'm wondering if you you're more than a non tools

i mean voice and a female voices define three

i mean

well maybe but i mean how would that's information b

useful the prediction of speaker of

so the speaker of speaking in the next hundred milisecond

also

but you results is very surprising that absolutely yes is right i think so too

i think so too

because i mean you don't assume that

those speaking and hundred and sixty five hz

signals

but you wanna

all the time right i agree that it is it is it is it is

surprising

but of course i mean

i if you compare those

what was

right so if you compare the absolute pitch and is that the that speaker normalized

speech there is a lot

clearly that the that the absolute pitch missus so there is a lot to improve

on that there must be some information that of that is still

how do you mean of there is some kind of the model man it sort

of inside the network there was some kind of

clustering that it sort of had a one classes of classifier for men and one

for women sort of

yes actually i think you just he don't my question i'm wondering here how much

is the modeling doing like you're proposing a certain representation you with binarize pretty so

but obviously the model is probably also doing something on top of that and so

i i'm not sure if we did you guys have looked into

can you disentangle really understand because if someone takes a different approach that c where

construct features that are temporally nature you know like looking at slopes and all the

stuff like a much as the model accounting for i'm not sure it's hard i

guess what to say i cannot answer this but it's i mean of course you

don't know what the model is actually doing yes absolutely

absolute but i mean

that the things the than thing is that this is what i think this is

one way of sort of

approaching this problem well

producing results which are sort of comparable across studies yes but its absence

you mentioned at the beginning that

the pitch might flat and before turn taking so we for unity norm

and sees you don't user recurrent model did you also consider doing and

taking the tent are of the absolute pitch not only the absolute values no we

didn't but isn't the something that the network of potentially kind of figure out and

so does the question i mean i think so

the question whether you i don't think you've done but are you planning to take

this out of the corpus and see whether the kinds of differentiation the your models

finding might be used

productively to change the behavior of the other speaker like if you alter the captured

or vice fits well right of people we generate absolutely out that could be done

and the other question

i was wondering what using would need to change its it was a multi-speaker situations

and not just to at three four

possibly i mean then this is a this is something that we have discussed a

lot i mean the problem with

doing this then

is that

we had a paper it into speech and two thousand seventeen where we did this

kind of modeling for

for the for respiratory data and turn taking

and the problem is then we had three speakers and then you can absolutely do

it's

but then you have to do all this kind of so we then you would

have another row here right

what you then have to do

is that you have to sort of he

for you have to keep sort of shifting those speakers because you don't want your

model

two

to rely on the final but speaker b was on the row

two and speaker you see was all row three right so then with three speaker

it with three speakers is still doable once you go into really multiparty things then

there's just this explodes

so then you would have to do it

the somehow differently and perhaps only use the only take into account the speaker so

we're speaking wouldn't then it's the last i don't five minute or five minutes or

something and then sort of two

to an incremental also dynamically

produce those

subsets of speakers that you that you predict for

anymore questions

just wondering whether you've looked into the granularity here so you picking hundred milliseconds of

you look at all the time windows

i mean well we had we didn't but this is a clear of bayes i

think is a clique you

problem right that that's

but that could but for the

somehow should be addressed a absolutely but i mean that the them at the method

itself right i mean you like is agnostic of this sort of the

is like

whatever your

your pitch extraction is

the then i mean we will produce different

pitch tracks but also whatever your voice activity detection run like these were also produces

a but this sort of a pretty in some sense of the preprocessing

but still i think

absolutely

absolutely

alright let's thank our speaker again