yeah so high uh and daniel angle percent these work that the to get with a cheap
you are still trying
so
hmmm
oh
okay so uh like it's a conversational systems task to be developed using these kind of uh uh talkie talk
at turn taking part down
is is means that one guys talking at a time and response times or or long
make a mostly do because the use pause duration threshold for and electrons protection
but
uh uh
and you must talk
a a more than fifty percent
a well uh speaker ships
a in this two situations that is in the part will that
or
yeah in a gap up to two on the milliseconds
which is supposed to be the minimum response times to polls
so if you want to do are taking
in uh uh a a uh and you talking to a computer
uh you can do why these into two cases the first one this when you having a long gap
a longer than two men seconds
okay handle these by and and the kinds
predict does
uh the second case here nice when you adding a little overlap
or short gap
this is our target for uh this study
that we
uh introduce a simplify a approach for this so by introducing this acknowledgement most with
basically of
backchannel channel type of dialogue act
this is means that
but people say it hmmm yeah and so on
so you want to dial of they system they had to do
uh these two things that is
it should be a bit to continue to talk in income than uh these signals transmit
windy G in the in to complete overlap
or or to compute it should be a to that speech
well you you still a training one of
um
things
so we talking about a uh a a lot about to response times here
uh this is the corpus that use the classical the adding or map task corpus is got in these
a a the for face-to-face dialogues
and that the task is the map does of there's one guy a space to another
and it the it has provided a shows
uh among these are the acknowledgement both
and which more to do this under a yeah
in in to a talk spurt
there are the fine here as i i mean them and voice activity
a duration to actual on fifty milliseconds
a a a a separated by a a durations two on milliseconds
this but makes the provide a some patient more
perceptually relevant and uh
more will more closely or someone on online condition
so uh this or on the uh
twenty to most frequency according board
you can see that to top five here are right okay K uh and yeah
uh so
this might actually be the by their lexical content
so how that is a in the overlap them well i to the corpus and one it in ten miliseconds
frames
so given that the frame is norm of that
is a five percent probability there is an acknowledgement mode
while a a if you is in the wheel that this it's the five percent probably there and the occlusion
so
uh this um to be seems to be more common in the without that
so what is going on here
i try to lit the goal a bit deeper uh by computing uh the between speaker and to well
is defined by the partial with that
and the gap
so uh what are actually going for a are the target to used an assumption didn't look at or assumption
of them all this month mode
but it was a a a a uh uh for others to to have a reference to compare with
that is
a in the context
oh i'm like motion low
i bit out X we stick cheap of sounds i'm including X the linguistic so
so this is to drop
um
from coming station in press
and as you can see here if you introduce these extra we stick uh two cans you get much more
overlap
which is uh
is is um
uh the negative scale of of the graph fear
uh while the
if you are computed for
uh in the context of a a motion model
uh there is not much different
uh a
you can build for uh the in look the assumption of them cushion you get slightly more over that be
as you can see here
uh to the left image or
so what does this mean well it seems like the worship station a closed
and no will that are mostly due to interaction to complete lap
but are uh but to actually want to do here is
uh
for both interaction direction to complete the were that are shown some of them
and occlusion mode
uh into action and to silence we need to classify i income speech
and some acknowledgement no and off
uh as
by i early
so a a a i to this these calls set that a called maximum it's like
might late to classification
a a it's quite simple actually it's a is just a several or talks but each segment there a with
which has a mean one speech activity
a threshold and minimum pulse duration threshold
but i want to make the decision at all
uh however
uh in the first case here you tao use
uh uh uh a larger down
the talk sport
a a a a uh duration plus
mean mean um are some threshold you make it at
at this time instead
to minimize the response times
so
how the set top were done for the maximum latency a well this or that the durations
of these two
can talk spurts that most from close um the ones
this can see here that these are much shorter
so
uh if you want to use duration that's a feature down for classification
uh you uh
um
you basically you you can you you you might have to make it or that the longer the wait
uh the most the most audience uh direction of the S a feature
but to less of between two for
a a a a lot to watch
so i tried to hunt seconds from seconds and five on men seconds and just see C would get
a for the acoustic detector or use this kind of permit station is it um
it's a length in improvisation which is basically type do this at T
which is smooth the search way that to the divide the length for the talk spurt
in this very this is quite a useful because the basis functions are but you could each of those for
a good interpolation on the syllabic them
length in gives
and the station for duration or speaking rate you can separate these and
the classifier were a
one the sears scopes and to equal to that of yet rich areas so which ms if you made it
you only parents are are to that in the real to shape of the this is that directory
so this is useful for it want model uh F zero
a a a which chance a speaker dependent buys
density
that's a is because
oh them this to the microphone
and then it's is used has these channel uh
used by
and this is the class powered set up
use F zero and envelopes
um
the to shape these intense there'll two shapes the intensity
uh the absolute
you try to to the absolute and relative shapes of C is
"'cause" one to see how this will affect
we can get up to it
and the for duration
uh used to fill those but duration for training while for testing we can at the maximum latency
then a at the spectral flux would up too much motivation
and the class Y is uh support vector machine with a or of cool
this or that was was personal
and as you can see
uh it seems like of zero envelopes loops are the weakest feature
uh followed by intensity
and spectral flux
well i M C's is are the strongest ones
and doesn't are to meet the sears course stamped which means that we actually only modeling all
tell that to trees
uh
uh uh
these
these features
i for duration
uh
you get nothing
for at time that milliseconds
but the rate of the longer
five and men seconds comes the second most sunny and feature
so i
uh
sorry
yeah so you decided to uh include the is is it's of the sears consent
and uh all to have zero loops
a "'cause" this of the case
this were the weakest a in the feature combination
uh a sort the walls that was sold and we tried to conditions here that is the online
the blind a plan uses the provide is a show while the online
use um and and D based threshold
uh what voice activity text detector
because we a little bit um
and about how sounds to this time wearing a station walls
to to this kind of um
what's active the detection
does out that it was not that uh since due to this and she can see
the longer you rate to the but their classification the get
rubber
it's such would pursue to get a quite this simplification that when we hundred milliseconds
which it christ
uh in i think
and it was surprised
so
uh
sort computer uh well duration and C is two
seems seems to be the most silent features here
and uh if you want to integrate these kind
classifier mean
um increment the dialogue an framework
this baseline framework that just
based uh can handle multiple ongoing plans
and i two guess been here is to run every classifiers in parallel uh perhaps the first one we prepare
decisions optimal miliseconds
and the next to execute them at three am and five hundred milliseconds
so and the actual implementation there was done in one spot
um
that's sort for me and thing have to say uh and the questions
think much better
know
question
oh or do you have any
particularly the nation for is being able to here
but that what you get
uh no uh
well
it
um
simply the because they
you can
it seems like the annotators basically they have
on a lot of on the lexical content
say
i a particular reason
but it's not that is either because if you make the sears cups C em to which it's actually there
um
a a which holds the information about
the actual uh the formants
uh
if you mid that then you lose this
so one we that links kind of the that that directories
and that is would be different
seems to be more something about voice called
um uh
uh i can't explain
white what's alone
so i have a question
okay than
think again on you
thank you