0:00:18martian will be presenting the next talk
0:00:26is this on
0:00:28so how do i how do i do that
0:00:35i need some help here think or maybe
0:00:38oops i'm sorry stop by computer
0:00:44they are i'm and the that the presentation is on this computer but i can't
0:00:47find the
0:00:48how there is no point there now right or
0:00:53right
0:01:02is the other
0:01:07well i can start while this is happening i can start by saying but
0:01:12the work that i'm gonna be present thing was really cornell of cost gives work
0:01:16and he very
0:01:19generously
0:01:21invited us to collaborate don't
0:01:24here to collaborate arithmetic invite comedy as and me to
0:01:29to collaborate with him on this
0:01:31and then it turned out that he
0:01:33cannot make it today
0:01:35which means that you are
0:01:39which means that you are stuck with me here i will try not to make
0:01:43too much of a mess up stocks
0:01:45so the question that we're that we're
0:01:48but we are talking here is a very old question
0:01:52in speech science in is the question of
0:01:55whether page or to what extent pitch plays a role in management of speaker change
0:02:01and this was generated if you to the bay to generate a huge so steady
0:02:04stream with papers
0:02:06and but if you look across those papers you can
0:02:09so to extract some broad
0:02:11then but broad consensus that's
0:02:13first of all
0:02:14pitch does play some or all and then secondly that there is this binary opposition
0:02:18between flat pitch
0:02:19signalling or being links to turnholding and any kind of pitch movement dynamic pitch being
0:02:25linked to turn-yielding
0:02:27and that's it trained that's the whole the story
0:02:29except of course it is not because there are still
0:02:31and number of questions that you might want to ask about the contribution of pitch
0:02:35to turn taking
0:02:37such as well
0:02:38doesn't matter whether you're looking at spontaneous or task oriented material does it matter whether
0:02:43you're
0:02:44speakers can see each other with the you know each other
0:02:47what is the actual contribution of
0:02:49off
0:02:50pitch over
0:02:52lexical or are syntactic cues and finally
0:02:55i mean i'm a i'm a linguist by training or politician and so we know
0:02:59that different languages use pitch linguistically to different extents and where the question is with
0:03:04this is also reflected in
0:03:06how the user pitch
0:03:07for pragmatic purposes such as
0:03:10that just turn taking
0:03:12and then there's a whole
0:03:14other another list of questions is how do you
0:03:17how do you transform how do you are present your pitch in your
0:03:20model right so how do you do some kind of perceptual stylisation based on
0:03:25perceptual threshold
0:03:26do you do kind of some sort of curve fitting
0:03:29polynomial functional data analysis what have you
0:03:32to use log scale
0:03:33the do you do transform at a semi tones how far back to the you
0:03:37look for those cues right now we're looking at ten miliseconds hundred or one second
0:03:40or ten second right
0:03:41these are all interesting an important question but is very difficult to
0:03:45to answer them in an a systematic way because any two studies you point two
0:03:49well vary across so many dimensions that it's very difficult to
0:03:54to estimate a sort of quantify the contribution of each of any of these factors
0:03:59to actual contribution of pitch to turn taking
0:04:02so what we've trying to do here's propose a way of
0:04:05evaluating the role of pitch
0:04:08in turn taking and that's a method which has three important we think a i
0:04:14properties the first one it's
0:04:16scalable trying to its applicable to material when the size
0:04:18it
0:04:19is not
0:04:20reliant a large the are many miller reliant on manual annotation
0:04:25and it is
0:04:26it gives you a sort of quantative
0:04:28index of
0:04:30contribution of pitch or any other feature as a matter of fact because in the
0:04:35long term i mean this model this method can be applied to any potential turn
0:04:40taking can to a few candidate
0:04:43so
0:04:45the way we chose to showcase this and also to evaluate this method was to
0:04:49ask three questions which
0:04:50well we thought we but there were interesting for us and we hope
0:04:54or interesting to some of you and this is the first question is whether pitch
0:04:58but there is any benefit in having pitch information to prediction of
0:05:01of speech activity and dialogue
0:05:04the second one is if it does
0:05:06make a difference how best to represent
0:05:09your pitch information and the third one is how far back to you have to
0:05:13look for the for these cues
0:05:15so these are the question that will be asking and will be trying to answer
0:05:18them
0:05:20using switchboard
0:05:21which we divided into these three speaker disjoint sets right there's no speaker
0:05:26in more than one of those and instead of running our own voice activity detection
0:05:31we just use the forced alignments of the of the manual transcriptions that come up
0:05:36with switchboard
0:05:39and the whole i mean what you have what we did ben and this is
0:05:42the idea that lies
0:05:43at the heart of this of this of this method and i'm sure you've seen
0:05:46this
0:05:47before and it's this idea of contractual pornography
0:05:51which is a sort of
0:05:52discrete eyes are quantized speech silence annotation right so you have basically a
0:05:57a frame of predefined duration here we used hundred milliseconds and for each of those
0:06:01frames and for each of the speakers you indicate whether someone was speaking
0:06:05or was silent during in that interval and so here we have a person
0:06:09speaker a speaking for
0:06:11four hundred milliseconds and there's a hundred miliseconds of overlap
0:06:14the speaker b
0:06:16takes four
0:06:16frame it for frames of speech and there is a hundred milliseconds of
0:06:21of silence and then speaker a contain
0:06:24and what you can do then it once you have this sort of representation that
0:06:27of course you can
0:06:28do this very simple very simply you can very simply predict speech activity and that
0:06:33you just take
0:06:35speech this one speakers history we call this speaker target speaker
0:06:39you take this person's
0:06:40is to speak speech activity history
0:06:42you can potentially if you're interested in that it can take this the other
0:06:46persons the speech activity history
0:06:48and then what you do is
0:06:49you would trying to predict
0:06:50where the target speaker is gonna be silent or is going to be speaking in
0:06:54the next hundred milliseconds
0:06:57and this kind of model can serve as a very neat baseline onto which you
0:07:03can then keep adding
0:07:04other features in our case pitch
0:07:07and what you can do though is then you can compare this speech activity based
0:07:11only model so baseline and the composite speech activity and
0:07:16in our case pitch model
0:07:18any kind of course also
0:07:19compare the different types of pitch parameterization with one another
0:07:23of course the only thing that you have to do before you do this kind
0:07:27of
0:07:28exercise
0:07:29is you somehow have to take the continuously varying
0:07:33pitch values and you somehow have to cast them into this chromagram
0:07:36a matrix like representation and what we did here was of the simplest possible solution
0:07:41we just calculate they
0:07:43for each hundred millisecond frame we calculate be the average
0:07:47pitch in that interval or missing or we just leave it is the missing value
0:07:51if there was no voicing in that interval
0:07:55right
0:07:57and then we run those prediction experiments using quite simple feed forward networks with the
0:08:02single hidden layer and for all the experiments that are talking about here
0:08:06we had a two units in that hidden layer
0:08:08other some more in the paper which i will not be talking about here
0:08:13and you will note that this is a non recurrent network in there is a
0:08:17reason for this right because since we are actually interested in the in the
0:08:21in the length of the of the usable
0:08:24pitch history we actually want to have axes we want to have control over how
0:08:28much
0:08:29history that the network has
0:08:31access to
0:08:33and before we go on the difference is were compared
0:08:37using cross entropy
0:08:39expressed in those bits per hundred miliseconds frame there'll be a lot of comparisons here
0:08:44so there'll be lots of pictures
0:08:45there's even more in the paper i've sort of to the liberty of picking out
0:08:48the more boring ones which i think is good as long as you don't tell
0:08:51cornell so if you know them don't tell
0:08:54so
0:08:55the two questions
0:08:57the first two questions where a
0:09:00first of all
0:09:01well is there any benefit in knowing
0:09:04in having access to pitch history well doing is a speech activity prediction
0:09:08and the second one
0:09:09is
0:09:10how to
0:09:11what's the optimal representation of pitch values for in such a system
0:09:16and
0:09:17so what we do here
0:09:18it's we start with the speech activity only baseline or in so will be seeing
0:09:22this kind of picture a lot
0:09:24so what we have here have here is the training set dev set and test
0:09:27set here we have the cross entropy rates for all those systems and what we
0:09:31have here
0:09:32on the x-axis is the conditioning context right so this is a system which is
0:09:36trained on one hundred millisecond frame of a speech activity history and this is a
0:09:41system trained on
0:09:42one second of speech
0:09:44activity history and you can see that the big
0:09:46we cross
0:09:47all those three sets the cross entropy is drop as you would expect right
0:09:52so there is an improvement in prediction
0:09:55and
0:09:56and what we will be doing
0:09:58from now on
0:09:59it's will be taking this
0:10:01guy so will be taking the system which is trained on
0:10:05ten
0:10:05on one second of speech activity history of both speakers
0:10:08and will be adding
0:10:11more and more all
0:10:12of pitch history right so it's always
0:10:16ten frames of speech activity history propose speakers and then pitch
0:10:19all one
0:10:22and what we did first we just added absolute pitch a linear
0:10:26scale in hz
0:10:28and surprisingly base even this simple pitch representation helps quite a bit trying to i
0:10:33mean you can see that even having one frame with pitch
0:10:37history is already better than
0:10:40then this baseline here
0:10:42and
0:10:43but then it sort of improve c and further and it starts to settle around
0:10:47three hundred milliseconds
0:10:50so that the that that's good news rank it seems to suggest that the pitch
0:10:53information is somehow relevant for speech active prediction
0:10:56but i mean
0:10:58clearly adding apps use representing pitch in absolute terms this is a kind of a
0:11:03laughable id alright that we this completely
0:11:06speaker dependent
0:11:07so what you wanna do is you want to
0:11:09do it well speaker-independent somehow so you want to
0:11:12the speaker normalization and what we did hear your we do this again the simplest
0:11:16thing
0:11:16so we just that score the
0:11:18the pitch values
0:11:19and surprisingly this did not really might make much of a different side so that's
0:11:25that's surprising
0:11:27you would expect some improvement but of course
0:11:30if you think about it actually
0:11:32this introduces more confusion because ones that scoring does of course it brings the mean
0:11:37to zero
0:11:38and the voiceless frames
0:11:40are also represented as zeros in the model
0:11:43so then these models are just
0:11:46confusing those two
0:11:47those two phenomena
0:11:49this can be
0:11:50quite easily
0:11:52improved
0:11:53by just adding another feature vector this to be
0:11:58a feature vector which is just a binary feature
0:12:00both voicing feature right so it's one when there's voicing and zero when it's not
0:12:04and this allows us to
0:12:05this allows the model to disambiguate zeros which are due to being close to speakers
0:12:09mean from zeros which are due to voice lessons
0:12:12and when you do this that you actually get a quite is quite a substantial
0:12:17drop in cross entropy rates right switch
0:12:19the just the bases a
0:12:20as a good representation and this drop was actually greater
0:12:24then if you add voicing on top of absolute pitch exact again it's not something
0:12:27i'm showing here but it is in the in the paper
0:12:31and then of course
0:12:32you can go on and say well we know that speech is really
0:12:35it perceived on semi timescale runs on log scale so does actually matter if we
0:12:40convert
0:12:41are how the hz to semi turn before is that scoring and it actually does
0:12:45a little bit trying to there is that there is a slight improvement would generalizes
0:12:49to the
0:12:50the test set
0:12:52and of course and the last up with data was asking
0:12:55so all along with only been using pitch history of the target speaker but you
0:12:59can also ask well that's not doesn't help to know the pitch history of the
0:13:02interlocutor
0:13:03and again there is a there is a
0:13:06slides
0:13:07but consistent improvement if you if you use both speakers history right
0:13:12so this is our solution arg answer to question number one and two
0:13:17or preliminary answer anyway
0:13:18and then we have question number three which is how far back do you have
0:13:22to walk and for this we have this sort of diagram
0:13:26the so the topline is as before so this is the speech activity only
0:13:31model
0:13:32except previously be ended here on this blue dots and here we
0:13:36extended
0:13:38for another ten frames so this model is trained on
0:13:42two seconds of speech activity his trade we can say see that is sort of
0:13:45continues dropping but a little bit less
0:13:48bless abruptly this curve here is exactly the curve that we had before so trained
0:13:53on
0:13:53pitch plus
0:13:56one second of speech activity history and this one is
0:14:00more and more of speech history
0:14:01plus
0:14:02two seconds of speech act i pitch history plus
0:14:05two seconds of speech activities training
0:14:08and this is quite interesting actually hand and a little bit puzzling in that
0:14:11these curves
0:14:13i mean whiskers are quite similar i mean they all still
0:14:17start settling around four hundred
0:14:19milliseconds
0:14:20but this one is just is just a shifted down to know what this means
0:14:23is basically that
0:14:24the same amount of
0:14:25pitch history is more helpful
0:14:28if you have more speech activity history that just kind of interesting have some ideas
0:14:32about we don't let me weekly we don't know why that is
0:14:35one possibility that could be something to do with the sort of backchannel nonbackchannel thing
0:14:39and that
0:14:40a pitch act as out of a whatever
0:14:42four hundred of those four hundred milliseconds of
0:14:46off
0:14:46pitch cues
0:14:47might be only useful when the when the person has been talking for a
0:14:52for sufficiently long
0:14:56right so as i said there's more in the paper but this is all i
0:14:59wanted to show you for here
0:15:01but then what have we learned the three questions are back first what was well
0:15:06the speaker does have does that speech help
0:15:09and a prediction of
0:15:10a speech activity
0:15:12in dialogue the answer is yes
0:15:14what is the optimal representation well from what we've seen it seems to be
0:15:18the binary voicing combination of binary voicing for this disambiguation of voice listeners
0:15:23and
0:15:23is that score normalization normalized pitch on an intel on the same assembly don't scale
0:15:31and how far back should one log well it seems that four hundred of context
0:15:35is
0:15:36sufficient
0:15:39but we have also seen that in terms of the absolute reduction and cross entropy
0:15:44then into a that the best performing pitch
0:15:48and representation
0:15:51retreated resulted in a reduction in reduction which is corresponds to roughly seventy five percent
0:15:55of the reduction
0:15:57in the speech activity only model when you go from one frame
0:16:00to ten frames right so it's quite a
0:16:02quite substantial in the in that
0:16:05and the most arms
0:16:07we have also seen that
0:16:09but that
0:16:09i mean four hundred millisecond seems to be enough
0:16:12which is not much if you
0:16:15think
0:16:15about this study that cornell did with less tried work in two thousand twelve and
0:16:20they found that if you do
0:16:21speech activity history only you can go
0:16:23back as much as eight
0:16:25seconds and you still
0:16:26keep
0:16:27i improving
0:16:29but on the other hand if you think about the sort of prosodic domain with
0:16:32the window which within which any kind of
0:16:36pitch
0:16:37q
0:16:37could be embedded then something on the order of the magnitude of the foot of
0:16:42the method of a prosodic foot so something like
0:16:45four hundred milisecond
0:16:47long
0:16:47makes
0:16:48perfect sense to me
0:16:50and
0:16:52we have a coke or we one thing we did was of course cheat a
0:16:55little bit in that
0:16:57when we did those that scoring of the pitch
0:17:01we used speakers
0:17:03means and standard deviation that we assume that they are known a prior alright and
0:17:06this of course is not the case if you work to run this analysis of
0:17:10real time
0:17:11a scenario
0:17:12and these would then have to be estimated incrementally
0:17:17and i want to finish here
0:17:19and go back to the to the rationale of doing all this
0:17:23analysis and all this sort of playing around with this and this was really to
0:17:27to come up with a better way
0:17:30of doing
0:17:31automated analysis of large speech material and then especially
0:17:35to be able to
0:17:36to bootstrap to produce results
0:17:38across
0:17:39across different corpora and make them so of comp arable so one thing you could
0:17:43do with this for instance is
0:17:44we run this in switchboard what you can do is take the same thing and
0:17:47run it on callhome for instance which is also dyadic
0:17:50which is also
0:17:52phone
0:17:53and but people know each other there right
0:17:56and then which you can and what you can then do is sort of you
0:17:59can compare those things
0:18:00and you can see to what extent familiarity between speakers for instance plays a role
0:18:04a in how pitch is employed for
0:18:09turn management
0:18:11and of course in this is kind of what goblet here's and me excited about
0:18:15this
0:18:16is that
0:18:17there there's nothing but limits
0:18:19these things to pitch trying to can do we intend there's nothing stop the printing
0:18:23you from doing intensity and the kind of voice quality features so or a bottom-up
0:18:27multimodal features so this
0:18:28this really opens the way in a sense for doing a lot
0:18:33of interesting things and of course in the long term whatever you find out
0:18:36could potentially be also used in some sort of mixed initiative dialogue system but this
0:18:41really is something that but that you know about than i don't so i will
0:18:45i will stop here thank you
0:18:53can we have plenty of time for questions
0:19:03i have a hidden slide with corn else phone numbers like i
0:19:08so perhaps aim is this but so how you handling cases where you're not able
0:19:13to fine depicts the pitch isn't the thing because you have voiceless that any particular
0:19:16thing i mean i are originally it's its left to assess the missing value
0:19:21but then of course of all the because of all the shenanigans that happened inside
0:19:25i understand they just
0:19:26they just the transformed into zeros right so that's why then there is this confusion
0:19:31between
0:19:32voiceless nist and the
0:19:34after that scoring of the and the mean pitch
0:19:42their questions
0:19:50thanks for in there is to so i'm as i was wondering i
0:19:56absolute
0:19:58is
0:20:00a little bit is very different from a male voice you
0:20:04voice is on female voices
0:20:07so
0:20:10i'm wondering if you you're more than a non tools
0:20:15i mean voice and a female voices define three
0:20:19i mean
0:20:20well maybe but i mean how would that's information b
0:20:24useful the prediction of speaker of
0:20:27so the speaker of speaking in the next hundred milisecond
0:20:31also
0:20:33but you results is very surprising that absolutely yes is right i think so too
0:20:40i think so too
0:20:41because i mean you don't assume that
0:20:43those speaking and hundred and sixty five hz
0:20:46signals
0:20:47but you wanna
0:20:48all the time right i agree that it is it is it is it is
0:20:51surprising
0:20:52but of course i mean
0:20:53i if you compare those
0:20:57what was
0:21:00right so if you compare the absolute pitch and is that the that speaker normalized
0:21:04speech there is a lot
0:21:05clearly that the that the absolute pitch missus so there is a lot to improve
0:21:10on that there must be some information that of that is still
0:21:15how do you mean of there is some kind of the model man it sort
0:21:18of inside the network there was some kind of
0:21:21clustering that it sort of had a one classes of classifier for men and one
0:21:24for women sort of
0:21:28yes actually i think you just he don't my question i'm wondering here how much
0:21:32is the modeling doing like you're proposing a certain representation you with binarize pretty so
0:21:37but obviously the model is probably also doing something on top of that and so
0:21:42i i'm not sure if we did you guys have looked into
0:21:45can you disentangle really understand because if someone takes a different approach that c where
0:21:49construct features that are temporally nature you know like looking at slopes and all the
0:21:53stuff like a much as the model accounting for i'm not sure it's hard i
0:21:57guess what to say i cannot answer this but it's i mean of course you
0:22:01don't know what the model is actually doing yes absolutely
0:22:04absolute but i mean
0:22:05that the things the than thing is that this is what i think this is
0:22:08one way of sort of
0:22:10approaching this problem well
0:22:12producing results which are sort of comparable across studies yes but its absence
0:22:23you mentioned at the beginning that
0:22:26the pitch might flat and before turn taking so we for unity norm
0:22:31and sees you don't user recurrent model did you also consider doing and
0:22:37taking the tent are of the absolute pitch not only the absolute values no we
0:22:42didn't but isn't the something that the network of potentially kind of figure out and
0:22:46so does the question i mean i think so
0:22:54the question whether you i don't think you've done but are you planning to take
0:22:57this out of the corpus and see whether the kinds of differentiation the your models
0:23:04finding might be used
0:23:07productively to change the behavior of the other speaker like if you alter the captured
0:23:12or vice fits well right of people we generate absolutely out that could be done
0:23:18and the other question
0:23:22i was wondering what using would need to change its it was a multi-speaker situations
0:23:27and not just to at three four
0:23:29possibly i mean then this is a this is something that we have discussed a
0:23:32lot i mean the problem with
0:23:33doing this then
0:23:35is that
0:23:37we had a paper it into speech and two thousand seventeen where we did this
0:23:41kind of modeling for
0:23:43for the for respiratory data and turn taking
0:23:46and the problem is then we had three speakers and then you can absolutely do
0:23:50it's
0:23:51but then you have to do all this kind of so we then you would
0:23:53have another row here right
0:23:55what you then have to do
0:23:57is that you have to sort of he
0:23:59for you have to keep sort of shifting those speakers because you don't want your
0:24:02model
0:24:03two
0:24:04to rely on the final but speaker b was on the row
0:24:08two and speaker you see was all row three right so then with three speaker
0:24:13it with three speakers is still doable once you go into really multiparty things then
0:24:17there's just this explodes
0:24:19so then you would have to do it
0:24:20the somehow differently and perhaps only use the only take into account the speaker so
0:24:25we're speaking wouldn't then it's the last i don't five minute or five minutes or
0:24:28something and then sort of two
0:24:30to an incremental also dynamically
0:24:33produce those
0:24:34subsets of speakers that you that you predict for
0:24:39anymore questions
0:24:48just wondering whether you've looked into the granularity here so you picking hundred milliseconds of
0:24:54you look at all the time windows
0:24:56i mean well we had we didn't but this is a clear of bayes i
0:24:59think is a clique you
0:25:01problem right that that's
0:25:02but that could but for the
0:25:07somehow should be addressed a absolutely but i mean that the them at the method
0:25:14itself right i mean you like is agnostic of this sort of the
0:25:17is like
0:25:18whatever your
0:25:20your pitch extraction is
0:25:21the then i mean we will produce different
0:25:24pitch tracks but also whatever your voice activity detection run like these were also produces
0:25:29a but this sort of a pretty in some sense of the preprocessing
0:25:32but still i think
0:25:36absolutely
0:25:39absolutely
0:25:41alright let's thank our speaker again