Speech Transcript - Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using LSTM Recurrent Neural Networks

as we all know turn taking is one of the most fundamental aspects of dialog

and it's something that dialogue systems are struggling with

if we look at you monument dialogue we know that humans are very good at

turn taking they can take the turn with

barely any very little overlap

at the same time

and people make posters within speech without the other person interrupting them

and this is accomplished by a number of turntaking cues

ask many researchers and have established

so syntax twice

you yield the turn typically when you are syntactically complete

a few look at prosody pitch is normally rising or falling when you're yielding the

turn

the intensity might be lower the duration for mean duration might be shorter

you might read out

gaze you look at the other c

to yield the turn and also gestures might be used

we also know that the more cues

we combine the stronger the signal is

and of course for dialogue systems to properly handle turn taking this is something they

have to take into account

and in dialogue systems there are number of decisions that have to the main that

are

related to turn taking so maybe the one most common one that have been address

this given that the user stops speaking

and should the system take the turn

of course it would be the nice with systems because i think is the user

assumed yielding the turn so that the system can start preparing a response

another decision is given that the user has just started to speak is it just

the beginning of a brief backchannels

or something that m to take the turn before that affects what the system should

also if the system

is gonna produce an utterance and want to produce a pulls it would be good

to know how likely is that the user would try to take the turn depending

on the cues that the system produce

before and these different questions have been address with different models basically

and the problem of course also is that rounding is highly context-dependent

and

dialogue context with all these different factors this of course very hard to model

so what

if we would like however i would like to have at least

is the model that is more general where you have a model that can apply

to many different turn taking decisions

that is continuous so you can apply to continuously not just for specific

events that happens

it should also be predictive so you shouldn't just classify the current state but be

able to predict what will happen in the future so that the system can start

preparing

and it should also be probabilistic not just the binary decisions

so what i propose is that we

a use recurrent neural network for this and the model that i have been working

on words like this we have that to speech channels from the two from two

speakers

which can be to you may as if we are predicting between two humans but

it could also be human and the system speech

we segment the speech of the slices which are fifty milliseconds low so twenty frames

per second

we do feature extraction and with v it into a recurrent neural network using lstm

to be able to capture long a little differences and at each frame

we make a prediction

for the next three seconds

what is the likelihood of

yes

bigger

is a weaker zero here

speaking in this future time window

so we see that would both speakers but we make prediction for one speaker here

and then we train it with the what's what what's actually happening in the future

so that's training labels

and when we do this we of course want to will be able to model

both speakers so we first train it with if we have speaker a and b

we first train the whole thing with a being speaker zero and b as a

speaker one and that was switched them around so a speaker one these experiments we

traded from both perspectives

at the application time we run two neural networks at the same time it to

make predictions for both speakers

the features that we have been using is voice activity we use pitch power

normalized for the speaker we don't do any sort of that was the

adult thus or anything we because we think that the network should figure this thing

we use a measure of spectral stability to capture the for a particular lengthening

we also use part-of-speech tags

so at the end of each word we feed in a one hot representation of

the part of speech that has just been produced

we compared to model is available that use all this lattice or one without the

inputs

and also prosody model that use everything but the part-of-speech to see how much the

parts which actually helps

we use the deep learning for data or toolkit

we have used the of web corpus for this which we are divided tonight a

six friend dialogues of the two test dialogues

that gives us about ten hours of training data

we use the manual labeling voice activity which should be set to expect where we

with the automatically

and the manual labour what speech on the prosody expected with respect to fit

we can show you have video what the production for predictions looks like when we

run it

continuously online so these are the predictions

the red is the point the prediction we are now

and the green is the probability so the higher the curve the more likely it

is that the parsable speech in this future time window

after of course is you will see the future what was actually gonna happen also

style if you can extend from keynote is just the so for me key to

chain link fence at sixteen to illustrate how this is right there is a more

the sources in the more likely model based on speech

i just don't seasons from now exactly determine the least tendency to distinct okay so

we i have looked at two different tasks that we can use this model for

one is very common talk is to predict

given of course who was the most likely neck speaker sound this is an example

where you can see that

here one person that's just a stop speaking and we can see that makes a

fairly good prediction in this case

it's not

it will take some time and things it for this person will continue

but it's quite likely that this person will produce a response but it's not gonna

be a very long so it makes very good prediction

there is another prediction

so that was the turn shift that was predicting here is a predicting that the

speaker will actually continue speaking

fairly high prediction but is not very likely that the other person produced response

so to make it easy i made it is into a binary classification task so

we debated basically asked at

the average prediction over the two we compare say and a is it a key

or shift

or hold

and then we can yes compute an f-score a see how well it thus we

can compare it with other methods for doing this

this is the number of training epochs and the blue is the full model the

red is the prosody model

we consider the prosody model which is stabilises where is the full model continues to

learn

so the best prediction we get for this

it's for the prosody only you can see the numbers here

for features are points

some to six it's not hard to know of course is this is good or

not good

it's

impossible of course to get hundred percent because turn taking is highly optional is not

always the case that it's obvious will take will continue speaking

of course if we have compared to the majority class baseline always hold the turn

is much better but that's not very interesting so we let humans listen to these

dialogue

to this point and dialogue and try to estimate who will be the neck speakers

speaker

using the crowd power

and they didn't performance well we also tried

more traditional modeling where we just

trying to model as good as possible the features we have at that point and

make one shot position and the best classifiers

did not perform as well as we can see also

this is also comparable what we find the literature where people have done similar terms

with more traditional modeling

we also compare what happen if we look at different balls nice the so how

quickly into the portal post we make the decision

and we see that what we're gonna have to fifty miliseconds into the pos we

make a fairly good prediction you will be the next week

it doesn't really matter what's proposed mentally as

so the next task will if that was the prediction at speech onsets so this

is interesting

someone has just started to speech as we can see here

and we want to know is this like its be very short utterance

backchannel or is it likely to be a longer happens if is a long rappers

maybe if of the dialogue system which is stopped speaking wrap select the other person

take the turn if we want to otherwise continue speaking

he would makes also very fairly lewd a prediction and you see the slow is

going down very quickly so it's gonna be cool short utterance whereas here it makes

prediction

all the more low reference we are here yes

at the same

points into the utterance as you can see that the predictions about different

to make the task binary again we divide between short and long process that with

finding in

i in the test data

social to process we in both cases we are one half second in the speech

sure that process not allowed to be more than half a second more as all

have to be more than

two and half second

and then we average the

speaking probability that is predicted of the fusion time window

and this is a histogram showing for the short utterance is what the average predicted

it speaking probabilities and for the longer utterances

so you can see it may give fairly good for separation

and just using this very simple method it can be more sophisticated of course

and f-score

zero point seventy six

again if we compared to the majority class baseline or

a more traditional modeling we get

a better performance also if we compared to similar terms

being done before

okay so then this looks very promising of course the question is can this be

used for

spoken dialogue system

so we took a corpus we had of human robot interaction and we tried to

built which was already annotated at the end of each user speech segments for whether

this was a good based take the turn or not

and within the network with the cysts it is synthesized speech from the system the

user speech and we compare the predictions us like we did

before

and of course since these are

very different type of dialogue the map task dialogue and the human computer dialog direct

application we use the prosody model didn't give a very good f-score it's better than

baseline but not very useful

so what with what was that

well maybe at least we can use the recurrent neural network is a feature extraction

as a representation of the current turn taking dialog state

so we take the lstm layers and we

training with supervised learning a logistic regression that is to predict whether this is the

best detect on

and then we get the fairly good

results with the right determine cross validation

but it also but only well if we just printed with twenty percent of the

a lot of the data

so that's problems

so of course to it for future work

we think we need more by boris interaction like that

map task is highly specific also of course

it's not very similar to human

machine interaction so we could for example training a wizard-of-oz data

also the way we have used it now it's very coarse we i just average

these two predictions

and compare them and it doesn't really make justice to the model which has a

much more fine grained

prediction also what's interesting is that has to go along you're in these polls the

predictions updates during the poles so we can make continuous decisions while was is unfolding

and also make use of the probabilities of course for example in the decision directed

the framework

multimodal interaction of course we have data from

from

face-to-face interaction

and of course we know that gaze and gesture and so on a very important

so that should be highly useful

and also multi party interaction of the model applies very well to multiparty since each

user where each speaker is modeled with its own and that work

so we could apply to any number of speakers

thank you

so we are trying to feed features a feature for that what's happening during this

fifty milliseconds if we have pitch for example take the average pitch in that small

window

sorry the

so that we is the

as soon as a word is finished we take a one up representation

with a pause tag and feed it into the network

at the frame

as soon as soon as the words and with the adapting to it and then

with its zeros again into to the pos tags

so it's just for one frame you get the while for that part of speech

thanks for the top is more clarification question so the two task that you representing

the two prediction task with it separate networks that you were training or using the

same network with two output layers is the same network

that is trained

so it's not for the to sort of roles or anything at that we rounded

instances of the same network

okay so i just kind of multitask learning

i mean you just having two different ways to prediction but the latent representations same

not application at application time they're completely different the two networks both the word skip

what from both speakers

it says yes that each network makes prediction for

for one of the speakers

right but the model itself the parameters that you learning

are there are completely the training in isolation or to train that the same time

for the two prediction task

no other prediction task is i mean the prediction is used to predict what's happening

at each frame

and then we can apply the same model to different tasks

so we can see what does the model predicts that speech onset what does the

model predict at the beginning of balls

okay that's what that's so that's why i wanted to general model that it's the

same model that is implied by the different tasks

so the thanks for great talk so

the model includes temporal information in the project

so i wanted to ask if you could talk a little bit about

how you imagine systems could use that kind of temporal information

i talked about long versa short utterances i think

i should say okay this is right time for a short utterance or the more

detail the not what are you subtree

predictions are come

so if it's it if it's for the user utterance if it if it's a

short utterance typically if i expected to use the short utterance i don't stop speaking

i might continue speaking for example because it's okay and turn taking for someone to

have a very brief utterance

whereas if you all are initiating margaret rose

i might have to stop speaking and we'll that jointly for example so that's

temporal aspect

such a way back to the past and what is that with intuition for including

that as a feature

so what the pos tag what exactly with the intuition including that feature vector understanding

spectral and english but it has a lot remainder to modeling and because the same

is a strong cues of typically if you and if i and if i say

and then i want to go to

you have that i i'm gonna continue because

that was a preposition last autumn usual way to understand and samples where say i

want to go to the bus stop

a noun that is typically signal that i

and it is a part now we

so in general we tried to give it that sort of much lower level information

as possible and help that it will figure things out

and typically i don't think you need

a much more complicated i mean i think i think it's the last house text

that is gonna influence the decision and

my in my intuition is that a more deeper syntactic analysis would help that much

okay thank alignments listening to make a speaker

Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using LSTM Recurrent Neural Networks

Oral Session 2: Turn-Taking and Real-Time Interaction

Gabriel Skantze