0:00:13yeah so high uh and daniel angle percent these work that the to get with a cheap
0:00:18you are still trying
0:00:20so
0:00:21hmmm
0:00:23oh
0:00:27okay so uh like it's a conversational systems task to be developed using these kind of uh uh talkie talk
0:00:32at turn taking part down
0:00:33is is means that one guys talking at a time and response times or or long
0:00:39make a mostly do because the use pause duration threshold for and electrons protection
0:00:43but
0:00:44uh uh
0:00:45and you must talk
0:00:47a a more than fifty percent
0:00:49a well uh speaker ships
0:00:51a in this two situations that is in the part will that
0:00:55or
0:00:55yeah in a gap up to two on the milliseconds
0:00:58which is supposed to be the minimum response times to polls
0:01:03so if you want to do are taking
0:01:05in uh uh a a uh and you talking to a computer
0:01:08uh you can do why these into two cases the first one this when you having a long gap
0:01:13a longer than two men seconds
0:01:15okay handle these by and and the kinds
0:01:17predict does
0:01:19uh the second case here nice when you adding a little overlap
0:01:22or short gap
0:01:24this is our target for uh this study
0:01:27that we
0:01:29uh introduce a simplify a approach for this so by introducing this acknowledgement most with
0:01:34basically of
0:01:36backchannel channel type of dialogue act
0:01:37this is means that
0:01:39but people say it hmmm yeah and so on
0:01:42so you want to dial of they system they had to do
0:01:44uh these two things that is
0:01:46it should be a bit to continue to talk in income than uh these signals transmit
0:01:50windy G in the in to complete overlap
0:01:53or or to compute it should be a to that speech
0:01:56well you you still a training one of
0:01:59um
0:01:59things
0:02:00so we talking about a uh a a lot about to response times here
0:02:04uh this is the corpus that use the classical the adding or map task corpus is got in these
0:02:09a a the for face-to-face dialogues
0:02:12and that the task is the map does of there's one guy a space to another
0:02:16and it the it has provided a shows
0:02:19uh among these are the acknowledgement both
0:02:23and which more to do this under a yeah
0:02:26in in to a talk spurt
0:02:28there are the fine here as i i mean them and voice activity
0:02:31a duration to actual on fifty milliseconds
0:02:34a a a a separated by a a durations two on milliseconds
0:02:39this but makes the provide a some patient more
0:02:42perceptually relevant and uh
0:02:44more will more closely or someone on online condition
0:02:48so uh this or on the uh
0:02:51twenty to most frequency according board
0:02:53you can see that to top five here are right okay K uh and yeah
0:02:59uh so
0:03:02this might actually be the by their lexical content
0:03:05so how that is a in the overlap them well i to the corpus and one it in ten miliseconds
0:03:10frames
0:03:11so given that the frame is norm of that
0:03:13is a five percent probability there is an acknowledgement mode
0:03:16while a a if you is in the wheel that this it's the five percent probably there and the occlusion
0:03:22so
0:03:23uh this um to be seems to be more common in the without that
0:03:26so what is going on here
0:03:28i try to lit the goal a bit deeper uh by computing uh the between speaker and to well
0:03:33is defined by the partial with that
0:03:35and the gap
0:03:37so uh what are actually going for a are the target to used an assumption didn't look at or assumption
0:03:43of them all this month mode
0:03:45but it was a a a a uh uh for others to to have a reference to compare with
0:03:50that is
0:03:51a in the context
0:03:52oh i'm like motion low
0:03:54i bit out X we stick cheap of sounds i'm including X the linguistic so
0:04:00so this is to drop
0:04:01um
0:04:02from coming station in press
0:04:05and as you can see here if you introduce these extra we stick uh two cans you get much more
0:04:10overlap
0:04:11which is uh
0:04:13is is um
0:04:15uh the negative scale of of the graph fear
0:04:19uh while the
0:04:21if you are computed for
0:04:22uh in the context of a a motion model
0:04:25uh there is not much different
0:04:28uh a
0:04:29you can build for uh the in look the assumption of them cushion you get slightly more over that be
0:04:35as you can see here
0:04:36uh to the left image or
0:04:38so what does this mean well it seems like the worship station a closed
0:04:43and no will that are mostly due to interaction to complete lap
0:04:47but are uh but to actually want to do here is
0:04:49uh
0:04:50for both interaction direction to complete the were that are shown some of them
0:04:54and occlusion mode
0:04:55uh into action and to silence we need to classify i income speech
0:04:59and some acknowledgement no and off
0:05:01uh as
0:05:02by i early
0:05:04so a a a i to this these calls set that a called maximum it's like
0:05:08might late to classification
0:05:10a a it's quite simple actually it's a is just a several or talks but each segment there a with
0:05:15which has a mean one speech activity
0:05:18a threshold and minimum pulse duration threshold
0:05:21but i want to make the decision at all
0:05:24uh however
0:05:26uh in the first case here you tao use
0:05:29uh uh uh a larger down
0:05:31the talk sport
0:05:32a a a a uh duration plus
0:05:35mean mean um are some threshold you make it at
0:05:38at this time instead
0:05:39to minimize the response times
0:05:42so
0:05:43how the set top were done for the maximum latency a well this or that the durations
0:05:48of these two
0:05:49can talk spurts that most from close um the ones
0:05:53this can see here that these are much shorter
0:05:55so
0:05:57uh if you want to use duration that's a feature down for classification
0:06:02uh you uh
0:06:04um
0:06:05you basically you you can you you you might have to make it or that the longer the wait
0:06:10uh the most the most audience uh direction of the S a feature
0:06:14but to less of between two for
0:06:16a a a a lot to watch
0:06:17so i tried to hunt seconds from seconds and five on men seconds and just see C would get
0:06:23a for the acoustic detector or use this kind of permit station is it um
0:06:29it's a length in improvisation which is basically type do this at T
0:06:33which is smooth the search way that to the divide the length for the talk spurt
0:06:38in this very this is quite a useful because the basis functions are but you could each of those for
0:06:42a good interpolation on the syllabic them
0:06:45length in gives
0:06:46and the station for duration or speaking rate you can separate these and
0:06:50the classifier were a
0:06:53one the sears scopes and to equal to that of yet rich areas so which ms if you made it
0:06:58you only parents are are to that in the real to shape of the this is that directory
0:07:03so this is useful for it want model uh F zero
0:07:06a a a which chance a speaker dependent buys
0:07:09density
0:07:10that's a is because
0:07:11oh them this to the microphone
0:07:13and then it's is used has these channel uh
0:07:16used by
0:07:18and this is the class powered set up
0:07:20use F zero and envelopes
0:07:21um
0:07:23the to shape these intense there'll two shapes the intensity
0:07:27uh the absolute
0:07:28you try to to the absolute and relative shapes of C is
0:07:32"'cause" one to see how this will affect
0:07:35we can get up to it
0:07:36and the for duration
0:07:38uh used to fill those but duration for training while for testing we can at the maximum latency
0:07:44then a at the spectral flux would up too much motivation
0:07:48and the class Y is uh support vector machine with a or of cool
0:07:53this or that was was personal
0:07:55and as you can see
0:07:57uh it seems like of zero envelopes loops are the weakest feature
0:08:01uh followed by intensity
0:08:03and spectral flux
0:08:05well i M C's is are the strongest ones
0:08:07and doesn't are to meet the sears course stamped which means that we actually only modeling all
0:08:13tell that to trees
0:08:15uh
0:08:17uh uh
0:08:18these
0:08:18these features
0:08:19i for duration
0:08:21uh
0:08:21you get nothing
0:08:22for at time that milliseconds
0:08:24but the rate of the longer
0:08:26five and men seconds comes the second most sunny and feature
0:08:31so i
0:08:32uh
0:08:33sorry
0:08:33yeah so you decided to uh include the is is it's of the sears consent
0:08:38and uh all to have zero loops
0:08:40a "'cause" this of the case
0:08:42this were the weakest a in the feature combination
0:08:45uh a sort the walls that was sold and we tried to conditions here that is the online
0:08:51the blind a plan uses the provide is a show while the online
0:08:55use um and and D based threshold
0:08:58uh what voice activity text detector
0:09:01because we a little bit um
0:09:03and about how sounds to this time wearing a station walls
0:09:08to to this kind of um
0:09:10what's active the detection
0:09:12does out that it was not that uh since due to this and she can see
0:09:16the longer you rate to the but their classification the get
0:09:19rubber
0:09:20it's such would pursue to get a quite this simplification that when we hundred milliseconds
0:09:24which it christ
0:09:25uh in i think
0:09:28and it was surprised
0:09:31so
0:09:31uh
0:09:33sort computer uh well duration and C is two
0:09:36seems seems to be the most silent features here
0:09:40and uh if you want to integrate these kind
0:09:43classifier mean
0:09:44um increment the dialogue an framework
0:09:47this baseline framework that just
0:09:48based uh can handle multiple ongoing plans
0:09:52and i two guess been here is to run every classifiers in parallel uh perhaps the first one we prepare
0:09:58decisions optimal miliseconds
0:10:00and the next to execute them at three am and five hundred milliseconds
0:10:05so and the actual implementation there was done in one spot
0:10:11um
0:10:12that's sort for me and thing have to say uh and the questions
0:10:16think much better
0:10:21know
0:10:22question
0:10:32oh or do you have any
0:10:34particularly the nation for is being able to here
0:10:38but that what you get
0:10:40uh no uh
0:10:42well
0:10:43it
0:10:44um
0:10:45simply the because they
0:10:46you can
0:10:47it seems like the annotators basically they have
0:10:49on a lot of on the lexical content
0:10:52say
0:10:53i a particular reason
0:10:55but it's not that is either because if you make the sears cups C em to which it's actually there
0:11:00um
0:11:02a a which holds the information about
0:11:05the actual uh the formants
0:11:08uh
0:11:08if you mid that then you lose this
0:11:11so one we that links kind of the that that directories
0:11:14and that is would be different
0:11:16seems to be more something about voice called
0:11:19um uh
0:11:20uh i can't explain
0:11:22white what's alone
0:11:27so i have a question
0:11:33okay than
0:11:33think again on you
0:11:35thank you