yeah so high uh and daniel angle percent these work that the to get with a cheap

you are still trying

so

hmmm

oh

okay so uh like it's a conversational systems task to be developed using these kind of uh uh talkie talk

at turn taking part down

is is means that one guys talking at a time and response times or or long

make a mostly do because the use pause duration threshold for and electrons protection

but

uh uh

and you must talk

a a more than fifty percent

a well uh speaker ships

a in this two situations that is in the part will that

or

yeah in a gap up to two on the milliseconds

which is supposed to be the minimum response times to polls

so if you want to do are taking

in uh uh a a uh and you talking to a computer

uh you can do why these into two cases the first one this when you having a long gap

a longer than two men seconds

okay handle these by and and the kinds

predict does

uh the second case here nice when you adding a little overlap

or short gap

this is our target for uh this study

that we

uh introduce a simplify a approach for this so by introducing this acknowledgement most with

basically of

backchannel channel type of dialogue act

this is means that

but people say it hmmm yeah and so on

so you want to dial of they system they had to do

uh these two things that is

it should be a bit to continue to talk in income than uh these signals transmit

windy G in the in to complete overlap

or or to compute it should be a to that speech

well you you still a training one of

um

things

so we talking about a uh a a lot about to response times here

uh this is the corpus that use the classical the adding or map task corpus is got in these

a a the for face-to-face dialogues

and that the task is the map does of there's one guy a space to another

and it the it has provided a shows

uh among these are the acknowledgement both

and which more to do this under a yeah

in in to a talk spurt

there are the fine here as i i mean them and voice activity

a duration to actual on fifty milliseconds

a a a a separated by a a durations two on milliseconds

this but makes the provide a some patient more

perceptually relevant and uh

more will more closely or someone on online condition

so uh this or on the uh

twenty to most frequency according board

you can see that to top five here are right okay K uh and yeah

uh so

this might actually be the by their lexical content

so how that is a in the overlap them well i to the corpus and one it in ten miliseconds

frames

so given that the frame is norm of that

is a five percent probability there is an acknowledgement mode

while a a if you is in the wheel that this it's the five percent probably there and the occlusion

so

uh this um to be seems to be more common in the without that

so what is going on here

i try to lit the goal a bit deeper uh by computing uh the between speaker and to well

is defined by the partial with that

and the gap

so uh what are actually going for a are the target to used an assumption didn't look at or assumption

of them all this month mode

but it was a a a a uh uh for others to to have a reference to compare with

that is

a in the context

oh i'm like motion low

i bit out X we stick cheap of sounds i'm including X the linguistic so

so this is to drop

um

from coming station in press

and as you can see here if you introduce these extra we stick uh two cans you get much more

overlap

which is uh

is is um

uh the negative scale of of the graph fear

uh while the

if you are computed for

uh in the context of a a motion model

uh there is not much different

uh a

you can build for uh the in look the assumption of them cushion you get slightly more over that be

as you can see here

uh to the left image or

so what does this mean well it seems like the worship station a closed

and no will that are mostly due to interaction to complete lap

but are uh but to actually want to do here is

uh

for both interaction direction to complete the were that are shown some of them

and occlusion mode

uh into action and to silence we need to classify i income speech

and some acknowledgement no and off

uh as

by i early

so a a a i to this these calls set that a called maximum it's like

might late to classification

a a it's quite simple actually it's a is just a several or talks but each segment there a with

which has a mean one speech activity

a threshold and minimum pulse duration threshold

but i want to make the decision at all

uh however

uh in the first case here you tao use

uh uh uh a larger down

the talk sport

a a a a uh duration plus

mean mean um are some threshold you make it at

at this time instead

to minimize the response times

so

how the set top were done for the maximum latency a well this or that the durations

of these two

can talk spurts that most from close um the ones

this can see here that these are much shorter

so

uh if you want to use duration that's a feature down for classification

uh you uh

um

you basically you you can you you you might have to make it or that the longer the wait

uh the most the most audience uh direction of the S a feature

but to less of between two for

a a a a lot to watch

so i tried to hunt seconds from seconds and five on men seconds and just see C would get

a for the acoustic detector or use this kind of permit station is it um

it's a length in improvisation which is basically type do this at T

which is smooth the search way that to the divide the length for the talk spurt

in this very this is quite a useful because the basis functions are but you could each of those for

a good interpolation on the syllabic them

length in gives

and the station for duration or speaking rate you can separate these and

the classifier were a

one the sears scopes and to equal to that of yet rich areas so which ms if you made it

you only parents are are to that in the real to shape of the this is that directory

so this is useful for it want model uh F zero

a a a which chance a speaker dependent buys

density

that's a is because

oh them this to the microphone

and then it's is used has these channel uh

used by

and this is the class powered set up

use F zero and envelopes

um

the to shape these intense there'll two shapes the intensity

uh the absolute

you try to to the absolute and relative shapes of C is

"'cause" one to see how this will affect

we can get up to it

and the for duration

uh used to fill those but duration for training while for testing we can at the maximum latency

then a at the spectral flux would up too much motivation

and the class Y is uh support vector machine with a or of cool

this or that was was personal

and as you can see

uh it seems like of zero envelopes loops are the weakest feature

uh followed by intensity

and spectral flux

well i M C's is are the strongest ones

and doesn't are to meet the sears course stamped which means that we actually only modeling all

tell that to trees

uh

uh uh

these

these features

i for duration

uh

you get nothing

for at time that milliseconds

but the rate of the longer

five and men seconds comes the second most sunny and feature

so i

uh

sorry

yeah so you decided to uh include the is is it's of the sears consent

and uh all to have zero loops

a "'cause" this of the case

this were the weakest a in the feature combination

uh a sort the walls that was sold and we tried to conditions here that is the online

the blind a plan uses the provide is a show while the online

use um and and D based threshold

uh what voice activity text detector

because we a little bit um

and about how sounds to this time wearing a station walls

to to this kind of um

what's active the detection

does out that it was not that uh since due to this and she can see

the longer you rate to the but their classification the get

rubber

it's such would pursue to get a quite this simplification that when we hundred milliseconds

which it christ

uh in i think

and it was surprised

so

uh

sort computer uh well duration and C is two

seems seems to be the most silent features here

and uh if you want to integrate these kind

classifier mean

um increment the dialogue an framework

this baseline framework that just

based uh can handle multiple ongoing plans

and i two guess been here is to run every classifiers in parallel uh perhaps the first one we prepare

decisions optimal miliseconds

and the next to execute them at three am and five hundred milliseconds

so and the actual implementation there was done in one spot

um

that's sort for me and thing have to say uh and the questions

think much better

know

question

oh or do you have any

particularly the nation for is being able to here

but that what you get

uh no uh

well

it

um

simply the because they

you can

it seems like the annotators basically they have

on a lot of on the lexical content

say

i a particular reason

but it's not that is either because if you make the sears cups C em to which it's actually there

um

a a which holds the information about

the actual uh the formants

uh

if you mid that then you lose this

so one we that links kind of the that that directories

and that is would be different

seems to be more something about voice called

um uh

uh i can't explain

white what's alone

so i have a question

okay than

think again on you

thank you