0:00:18so
0:00:20as we all know turn taking is one of the most fundamental aspects of dialog
0:00:26and it's something that dialogue systems are struggling with
0:00:30if we look at you monument dialogue we know that humans are very good at
0:00:33turn taking they can take the turn with
0:00:36barely any very little overlap
0:00:40at the same time
0:00:41and people make posters within speech without the other person interrupting them
0:00:50and this is accomplished by a number of turntaking cues
0:00:56ask many researchers and have established
0:00:59so syntax twice
0:01:01you yield the turn typically when you are syntactically complete
0:01:05a few look at prosody pitch is normally rising or falling when you're yielding the
0:01:10turn
0:01:11the intensity might be lower the duration for mean duration might be shorter
0:01:16you might read out
0:01:19gaze you look at the other c
0:01:21to yield the turn and also gestures might be used
0:01:26we also know that the more cues
0:01:30we combine the stronger the signal is
0:01:35and of course for dialogue systems to properly handle turn taking this is something they
0:01:39have to take into account
0:01:43and in dialogue systems there are number of decisions that have to the main that
0:01:47are
0:01:48related to turn taking so maybe the one most common one that have been address
0:01:51this given that the user stops speaking
0:01:54and should the system take the turn
0:01:58of course it would be the nice with systems because i think is the user
0:02:02assumed yielding the turn so that the system can start preparing a response
0:02:08another decision is given that the user has just started to speak is it just
0:02:12the beginning of a brief backchannels
0:02:14or something that m to take the turn before that affects what the system should
0:02:17do
0:02:19also if the system
0:02:21is gonna produce an utterance and want to produce a pulls it would be good
0:02:26to know how likely is that the user would try to take the turn depending
0:02:29on the cues that the system produce
0:02:34so
0:02:36before and these different questions have been address with different models basically
0:02:42and the problem of course also is that rounding is highly context-dependent
0:02:47and
0:02:48dialogue context with all these different factors this of course very hard to model
0:02:54so what
0:02:56if we would like however i would like to have at least
0:02:59is the model that is more general where you have a model that can apply
0:03:02to many different turn taking decisions
0:03:05that is continuous so you can apply to continuously not just for specific
0:03:10events that happens
0:03:12it should also be predictive so you shouldn't just classify the current state but be
0:03:16able to predict what will happen in the future so that the system can start
0:03:20preparing
0:03:21and it should also be probabilistic not just the binary decisions
0:03:26so what i propose is that we
0:03:29a use recurrent neural network for this and the model that i have been working
0:03:35on words like this we have that to speech channels from the two from two
0:03:39speakers
0:03:41which can be to you may as if we are predicting between two humans but
0:03:45it could also be human and the system speech
0:03:48we segment the speech of the slices which are fifty milliseconds low so twenty frames
0:03:52per second
0:03:54we do feature extraction and with v it into a recurrent neural network using lstm
0:04:02to be able to capture long a little differences and at each frame
0:04:08we make a prediction
0:04:09for the next three seconds
0:04:12what is the likelihood of
0:04:15yes
0:04:15bigger
0:04:17is a weaker zero here
0:04:19speaking in this future time window
0:04:24so we see that would both speakers but we make prediction for one speaker here
0:04:29and then we train it with the what's what what's actually happening in the future
0:04:34so that's training labels
0:04:37and when we do this we of course want to will be able to model
0:04:40both speakers so we first train it with if we have speaker a and b
0:04:44we first train the whole thing with a being speaker zero and b as a
0:04:48speaker one and that was switched them around so a speaker one these experiments we
0:04:52traded from both perspectives
0:04:57at the application time we run two neural networks at the same time it to
0:05:01make predictions for both speakers
0:05:05the features that we have been using is voice activity we use pitch power
0:05:11normalized for the speaker we don't do any sort of that was the
0:05:15adult thus or anything we because we think that the network should figure this thing
0:05:19so
0:05:20we use a measure of spectral stability to capture the for a particular lengthening
0:05:24we also use part-of-speech tags
0:05:28so at the end of each word we feed in a one hot representation of
0:05:32the part of speech that has just been produced
0:05:36we compared to model is available that use all this lattice or one without the
0:05:40inputs
0:05:43and also prosody model that use everything but the part-of-speech to see how much the
0:05:46parts which actually helps
0:05:48we use the deep learning for data or toolkit
0:05:52we have used the of web corpus for this which we are divided tonight a
0:05:57six friend dialogues of the two test dialogues
0:06:00that gives us about ten hours of training data
0:06:04we use the manual labeling voice activity which should be set to expect where we
0:06:08with the automatically
0:06:09and the manual labour what speech on the prosody expected with respect to fit
0:06:15we can show you have video what the production for predictions looks like when we
0:06:19run it
0:06:21continuously online so these are the predictions
0:06:24the red is the point the prediction we are now
0:06:28and the green is the probability so the higher the curve the more likely it
0:06:32is that the parsable speech in this future time window
0:06:37after of course is you will see the future what was actually gonna happen also
0:06:45style if you can extend from keynote is just the so for me key to
0:06:50chain link fence at sixteen to illustrate how this is right there is a more
0:06:55the sources in the more likely model based on speech
0:06:59i just don't seasons from now exactly determine the least tendency to distinct okay so
0:07:03we i have looked at two different tasks that we can use this model for
0:07:09one is very common talk is to predict
0:07:11given of course who was the most likely neck speaker sound this is an example
0:07:17where you can see that
0:07:20here one person that's just a stop speaking and we can see that makes a
0:07:24fairly good prediction in this case
0:07:26it's not
0:07:27it will take some time and things it for this person will continue
0:07:31but it's quite likely that this person will produce a response but it's not gonna
0:07:35be a very long so it makes very good prediction
0:07:40there is another prediction
0:07:42so that was the turn shift that was predicting here is a predicting that the
0:07:46speaker will actually continue speaking
0:07:48fairly high prediction but is not very likely that the other person produced response
0:07:54so to make it easy i made it is into a binary classification task so
0:07:59we debated basically asked at
0:08:01the average prediction over the two we compare say and a is it a key
0:08:08or shift
0:08:10or hold
0:08:11and then we can yes compute an f-score a see how well it thus we
0:08:15can compare it with other methods for doing this
0:08:18this is the number of training epochs and the blue is the full model the
0:08:21red is the prosody model
0:08:23we consider the prosody model which is stabilises where is the full model continues to
0:08:27learn
0:08:32so the best prediction we get for this
0:08:35it's for the prosody only you can see the numbers here
0:08:40a
0:08:41for features are points
0:08:42some to six it's not hard to know of course is this is good or
0:08:46not good
0:08:47it's
0:08:48impossible of course to get hundred percent because turn taking is highly optional is not
0:08:52always the case that it's obvious will take will continue speaking
0:08:57of course if we have compared to the majority class baseline always hold the turn
0:09:02is much better but that's not very interesting so we let humans listen to these
0:09:07dialogue
0:09:09to this point and dialogue and try to estimate who will be the neck speakers
0:09:12speaker
0:09:14using the crowd power
0:09:17and they didn't performance well we also tried
0:09:21more traditional modeling where we just
0:09:25trying to model as good as possible the features we have at that point and
0:09:28make one shot position and the best classifiers
0:09:34did not perform as well as we can see also
0:09:36this is also comparable what we find the literature where people have done similar terms
0:09:41with more traditional modeling
0:09:45we also compare what happen if we look at different balls nice the so how
0:09:49quickly into the portal post we make the decision
0:09:52and we see that what we're gonna have to fifty miliseconds into the pos we
0:09:55make a fairly good prediction you will be the next week
0:09:59it doesn't really matter what's proposed mentally as
0:10:03so the next task will if that was the prediction at speech onsets so this
0:10:07is interesting
0:10:09someone has just started to speech as we can see here
0:10:12and we want to know is this like its be very short utterance
0:10:17backchannel or is it likely to be a longer happens if is a long rappers
0:10:20maybe if of the dialogue system which is stopped speaking wrap select the other person
0:10:24take the turn if we want to otherwise continue speaking
0:10:29he would makes also very fairly lewd a prediction and you see the slow is
0:10:34going down very quickly so it's gonna be cool short utterance whereas here it makes
0:10:39prediction
0:10:41all the more low reference we are here yes
0:10:46at the same
0:10:47points into the utterance as you can see that the predictions about different
0:10:52to make the task binary again we divide between short and long process that with
0:10:57finding in
0:10:59i in the test data
0:11:02social to process we in both cases we are one half second in the speech
0:11:08sure that process not allowed to be more than half a second more as all
0:11:12have to be more than
0:11:13two and half second
0:11:17and then we average the
0:11:21speaking probability that is predicted of the fusion time window
0:11:24and this is a histogram showing for the short utterance is what the average predicted
0:11:29it speaking probabilities and for the longer utterances
0:11:33so you can see it may give fairly good for separation
0:11:36and just using this very simple method it can be more sophisticated of course
0:11:41and f-score
0:11:42zero point seventy six
0:11:45again if we compared to the majority class baseline or
0:11:48a more traditional modeling we get
0:11:53a better performance also if we compared to similar terms
0:11:58being done before
0:12:02okay so then this looks very promising of course the question is can this be
0:12:07used for
0:12:09spoken dialogue system
0:12:12so we took a corpus we had of human robot interaction and we tried to
0:12:18built which was already annotated at the end of each user speech segments for whether
0:12:23this was a good based take the turn or not
0:12:25and within the network with the cysts it is synthesized speech from the system the
0:12:30user speech and we compare the predictions us like we did
0:12:35before
0:12:36and of course since these are
0:12:40very different type of dialogue the map task dialogue and the human computer dialog direct
0:12:46application we use the prosody model didn't give a very good f-score it's better than
0:12:52baseline but not very useful
0:12:54so what with what was that
0:12:57well maybe at least we can use the recurrent neural network is a feature extraction
0:13:01as a representation of the current turn taking dialog state
0:13:06so we take the lstm layers and we
0:13:10training with supervised learning a logistic regression that is to predict whether this is the
0:13:16best detect on
0:13:21and then we get the fairly good
0:13:24results with the right determine cross validation
0:13:29but it also but only well if we just printed with twenty percent of the
0:13:33a lot of the data
0:13:35so that's problems
0:13:40so of course to it for future work
0:13:44we think we need more by boris interaction like that
0:13:49map task is highly specific also of course
0:13:53it's not very similar to human
0:13:55machine interaction so we could for example training a wizard-of-oz data
0:14:01also the way we have used it now it's very coarse we i just average
0:14:06these two predictions
0:14:09and compare them and it doesn't really make justice to the model which has a
0:14:13much more fine grained
0:14:16prediction also what's interesting is that has to go along you're in these polls the
0:14:20predictions updates during the poles so we can make continuous decisions while was is unfolding
0:14:29and also make use of the probabilities of course for example in the decision directed
0:14:34the framework
0:14:36multimodal interaction of course we have data from
0:14:41from
0:14:42face-to-face interaction
0:14:45and of course we know that gaze and gesture and so on a very important
0:14:48so that should be highly useful
0:14:50and also multi party interaction of the model applies very well to multiparty since each
0:14:54user where each speaker is modeled with its own and that work
0:14:59so we could apply to any number of speakers
0:15:02thank you
0:15:29so we are trying to feed features a feature for that what's happening during this
0:15:34fifty milliseconds if we have pitch for example take the average pitch in that small
0:15:40window
0:15:45sorry the
0:15:49so that we is the
0:15:51as soon as a word is finished we take a one up representation
0:15:55with a pause tag and feed it into the network
0:15:59at
0:16:00at the frame
0:16:02as soon as soon as the words and with the adapting to it and then
0:16:06with its zeros again into to the pos tags
0:16:09so it's just for one frame you get the while for that part of speech
0:16:19thanks for the top is more clarification question so the two task that you representing
0:16:24the two prediction task with it separate networks that you were training or using the
0:16:29same network with two output layers is the same network
0:16:35that is trained
0:16:37so it's not for the to sort of roles or anything at that we rounded
0:16:41instances of the same network
0:16:43okay so i just kind of multitask learning
0:16:45i mean you just having two different ways to prediction but the latent representations same
0:16:52not application at application time they're completely different the two networks both the word skip
0:16:57what from both speakers
0:16:59it says yes that each network makes prediction for
0:17:02for one of the speakers
0:17:04right but the model itself the parameters that you learning
0:17:09are there are completely the training in isolation or to train that the same time
0:17:13for the two prediction task
0:17:15no other prediction task is i mean the prediction is used to predict what's happening
0:17:20at each frame
0:17:22and then we can apply the same model to different tasks
0:17:25so we can see what does the model predicts that speech onset what does the
0:17:28model predict at the beginning of balls
0:17:31okay that's what that's so that's why i wanted to general model that it's the
0:17:34same model that is implied by the different tasks
0:17:47so the thanks for great talk so
0:17:50the model includes temporal information in the project
0:17:54so i wanted to ask if you could talk a little bit about
0:17:59how you imagine systems could use that kind of temporal information
0:18:05i talked about long versa short utterances i think
0:18:09i should say okay this is right time for a short utterance or the more
0:18:13detail the not what are you subtree
0:18:16predictions are come
0:18:18so if it's it if it's for the user utterance if it if it's a
0:18:23short utterance typically if i expected to use the short utterance i don't stop speaking
0:18:28i might continue speaking for example because it's okay and turn taking for someone to
0:18:32have a very brief utterance
0:18:34whereas if you all are initiating margaret rose
0:18:38i might have to stop speaking and we'll that jointly for example so that's
0:18:44temporal aspect
0:18:53such a way back to the past and what is that with intuition for including
0:18:58that as a feature
0:18:59so what the pos tag what exactly with the intuition including that feature vector understanding
0:19:03spectral and english but it has a lot remainder to modeling and because the same
0:19:11is a strong cues of typically if you and if i and if i say
0:19:15and then i want to go to
0:19:18you have that i i'm gonna continue because
0:19:21that was a preposition last autumn usual way to understand and samples where say i
0:19:25want to go to the bus stop
0:19:28a noun that is typically signal that i
0:19:44and it is a part now we
0:19:54so in general we tried to give it that sort of much lower level information
0:19:58as possible and help that it will figure things out
0:20:02and typically i don't think you need
0:20:04a much more complicated i mean i think i think it's the last house text
0:20:08that is gonna influence the decision and
0:20:11my in my intuition is that a more deeper syntactic analysis would help that much
0:20:17okay thank alignments listening to make a speaker