yeah so high uh and daniel angle percent these work that the to get with a cheap you are still trying so hmmm oh okay so uh like it's a conversational systems task to be developed using these kind of uh uh talkie talk at turn taking part down is is means that one guys talking at a time and response times or or long make a mostly do because the use pause duration threshold for and electrons protection but uh uh and you must talk a a more than fifty percent a well uh speaker ships a in this two situations that is in the part will that or yeah in a gap up to two on the milliseconds which is supposed to be the minimum response times to polls so if you want to do are taking in uh uh a a uh and you talking to a computer uh you can do why these into two cases the first one this when you having a long gap a longer than two men seconds okay handle these by and and the kinds predict does uh the second case here nice when you adding a little overlap or short gap this is our target for uh this study that we uh introduce a simplify a approach for this so by introducing this acknowledgement most with basically of backchannel channel type of dialogue act this is means that but people say it hmmm yeah and so on so you want to dial of they system they had to do uh these two things that is it should be a bit to continue to talk in income than uh these signals transmit windy G in the in to complete overlap or or to compute it should be a to that speech well you you still a training one of um things so we talking about a uh a a lot about to response times here uh this is the corpus that use the classical the adding or map task corpus is got in these a a the for face-to-face dialogues and that the task is the map does of there's one guy a space to another and it the it has provided a shows uh among these are the acknowledgement both and which more to do this under a yeah in in to a talk spurt there are the fine here as i i mean them and voice activity a duration to actual on fifty milliseconds a a a a separated by a a durations two on milliseconds this but makes the provide a some patient more perceptually relevant and uh more will more closely or someone on online condition so uh this or on the uh twenty to most frequency according board you can see that to top five here are right okay K uh and yeah uh so this might actually be the by their lexical content so how that is a in the overlap them well i to the corpus and one it in ten miliseconds frames so given that the frame is norm of that is a five percent probability there is an acknowledgement mode while a a if you is in the wheel that this it's the five percent probably there and the occlusion so uh this um to be seems to be more common in the without that so what is going on here i try to lit the goal a bit deeper uh by computing uh the between speaker and to well is defined by the partial with that and the gap so uh what are actually going for a are the target to used an assumption didn't look at or assumption of them all this month mode but it was a a a a uh uh for others to to have a reference to compare with that is a in the context oh i'm like motion low i bit out X we stick cheap of sounds i'm including X the linguistic so so this is to drop um from coming station in press and as you can see here if you introduce these extra we stick uh two cans you get much more overlap which is uh is is um uh the negative scale of of the graph fear uh while the if you are computed for uh in the context of a a motion model uh there is not much different uh a you can build for uh the in look the assumption of them cushion you get slightly more over that be as you can see here uh to the left image or so what does this mean well it seems like the worship station a closed and no will that are mostly due to interaction to complete lap but are uh but to actually want to do here is uh for both interaction direction to complete the were that are shown some of them and occlusion mode uh into action and to silence we need to classify i income speech and some acknowledgement no and off uh as by i early so a a a i to this these calls set that a called maximum it's like might late to classification a a it's quite simple actually it's a is just a several or talks but each segment there a with which has a mean one speech activity a threshold and minimum pulse duration threshold but i want to make the decision at all uh however uh in the first case here you tao use uh uh uh a larger down the talk sport a a a a uh duration plus mean mean um are some threshold you make it at at this time instead to minimize the response times so how the set top were done for the maximum latency a well this or that the durations of these two can talk spurts that most from close um the ones this can see here that these are much shorter so uh if you want to use duration that's a feature down for classification uh you uh um you basically you you can you you you might have to make it or that the longer the wait uh the most the most audience uh direction of the S a feature but to less of between two for a a a a lot to watch so i tried to hunt seconds from seconds and five on men seconds and just see C would get a for the acoustic detector or use this kind of permit station is it um it's a length in improvisation which is basically type do this at T which is smooth the search way that to the divide the length for the talk spurt in this very this is quite a useful because the basis functions are but you could each of those for a good interpolation on the syllabic them length in gives and the station for duration or speaking rate you can separate these and the classifier were a one the sears scopes and to equal to that of yet rich areas so which ms if you made it you only parents are are to that in the real to shape of the this is that directory so this is useful for it want model uh F zero a a a which chance a speaker dependent buys density that's a is because oh them this to the microphone and then it's is used has these channel uh used by and this is the class powered set up use F zero and envelopes um the to shape these intense there'll two shapes the intensity uh the absolute you try to to the absolute and relative shapes of C is "'cause" one to see how this will affect we can get up to it and the for duration uh used to fill those but duration for training while for testing we can at the maximum latency then a at the spectral flux would up too much motivation and the class Y is uh support vector machine with a or of cool this or that was was personal and as you can see uh it seems like of zero envelopes loops are the weakest feature uh followed by intensity and spectral flux well i M C's is are the strongest ones and doesn't are to meet the sears course stamped which means that we actually only modeling all tell that to trees uh uh uh these these features i for duration uh you get nothing for at time that milliseconds but the rate of the longer five and men seconds comes the second most sunny and feature so i uh sorry yeah so you decided to uh include the is is it's of the sears consent and uh all to have zero loops a "'cause" this of the case this were the weakest a in the feature combination uh a sort the walls that was sold and we tried to conditions here that is the online the blind a plan uses the provide is a show while the online use um and and D based threshold uh what voice activity text detector because we a little bit um and about how sounds to this time wearing a station walls to to this kind of um what's active the detection does out that it was not that uh since due to this and she can see the longer you rate to the but their classification the get rubber it's such would pursue to get a quite this simplification that when we hundred milliseconds which it christ uh in i think and it was surprised so uh sort computer uh well duration and C is two seems seems to be the most silent features here and uh if you want to integrate these kind classifier mean um increment the dialogue an framework this baseline framework that just based uh can handle multiple ongoing plans and i two guess been here is to run every classifiers in parallel uh perhaps the first one we prepare decisions optimal miliseconds and the next to execute them at three am and five hundred milliseconds so and the actual implementation there was done in one spot um that's sort for me and thing have to say uh and the questions think much better know question oh or do you have any particularly the nation for is being able to here but that what you get uh no uh well it um simply the because they you can it seems like the annotators basically they have on a lot of on the lexical content say i a particular reason but it's not that is either because if you make the sears cups C em to which it's actually there um a a which holds the information about the actual uh the formants uh if you mid that then you lose this so one we that links kind of the that that directories and that is would be different seems to be more something about voice called um uh uh i can't explain white what's alone so i have a question okay than think again on you thank you