0:00:16i'm planning to k i'm working in you gap research and ct would and i
0:00:19mean representing the talk on modeling overlapping speech using but it in the cities
0:00:24this work i have been but modified eisenhower we would low
0:00:30first i would be presenting the motivation for using this problem
0:00:35then in brief i would discuss the previous approaches for detecting overlapping speech
0:00:41then i would go towards the vector taylor series approach
0:00:44in which we have two parts on the first part is the
0:00:47using the standard vts approach and then the next part would be the multiclass we
0:00:51just algorithm which is
0:00:53which we have proposed in this work and then we will discuss the results in
0:00:59so the more recent comes from the problem of speaker diarization
0:01:03but if the task of deciding or determining who spoke when in the meeting audio
0:01:07so ugly when the audio recording you want to find out on different portions which
0:01:12belong to the speakers
0:01:15one challenges that the number of speakers is not applied in one
0:01:18so it's it has to be determined booty an unsupervised manner
0:01:24now in this task overlapping speech
0:01:28becomes a very huge
0:01:30the source of either
0:01:32so first i would define the overlapping speech it is at the moment trying to
0:01:37speaker speak simultaneously it might be when people are debating they are arguing or when
0:01:42they are just
0:01:45okay so when there does agreeing or disagreeing like
0:01:49this kind of things are men they're laughing together
0:01:52so what happens is that when you have overlapping speech in your audio then you
0:01:57cannot model
0:01:59the art speaker models very precisely
0:02:01or when you are doing speaker recognition and you want to assign one speaker anybody
0:02:06to a portion in which actually there are two people speak speaking then that also
0:02:10results in some errors in speaker recognition
0:02:14and a previous studies have shown that in meetings sometimes those model twenty percent of
0:02:19the spoken name can be overlapping if the participants are maybe
0:02:28no the previous approaches so one of the first what was done by book
0:02:32in we see well it made adamant be segmented for the three classes speech non-speech
0:02:37an overlapping speech
0:02:39this was the baseline
0:02:40then people have used
0:02:42a salad the knowledge of silence distribution
0:02:45and some things like speaker changes because it has been found that people tractable with
0:02:50that when the speaker change
0:02:53the state-of-the-art a is based on convolutive non-negative sparse coding and which
0:03:00d gotten we put like they have
0:03:03no and basis for each speaker
0:03:05and then the artifact to find out the activity of each speaker for each stream
0:03:09and they have used the same features using non stardom neural networks long short-term memory
0:03:14neural networks
0:03:18now we come to
0:03:19our problem
0:03:20so before i moved to overlapping speech that is one and all of this problem
0:03:24of how to model the speech which is corrupted with noise
0:03:28so if you have a noisy speech model by then you can express it in
0:03:32the signal domain as
0:03:33x plus and where x if you're clean speech
0:03:37actually of the channel noise and is the additive noise
0:03:43so in the mfcc domain
0:03:46these are the mel-scale filter power spectrum
0:03:49you take the log and dct then and then you but the mfcc features
0:03:53so in the mfcc domain
0:03:55this simple expression here
0:03:57it becomes
0:03:59a quite complex expression value have a linear park and the nonlinear part
0:04:04the see if the dct a text and seen what six the pseudoinverse of that
0:04:09so we call this nonlinear part of g
0:04:12so you have by the way to x besides plus this nonlinear part
0:04:18we want to model this equation and we use that the data c d is
0:04:22basically two
0:04:23x point this expression here
0:04:26so that it is it is simply an expansion of objective function about the point
0:04:30where you
0:04:31have the first then
0:04:33this is the first order don't but you pick the first derivative
0:04:37so do when this expression for the noisy speech
0:04:41if you at this point it around this point m u x new and new
0:04:44at which are the mean of clean speech mean of noisy speech and we wanna
0:04:48you know
0:04:49channel noise
0:04:51so you can't this expression here in which the first line
0:04:54if s at the evaluation of y around this point
0:04:58and the second line is
0:04:59the first order down
0:05:01a bit
0:05:01that with energy this capital g and this capital have a
0:05:07they are the derivatives of y with respect to accent at ten and
0:05:15so in the standard
0:05:18and the standard rectangular c d's when you are trying to model
0:05:21this of i here
0:05:23what people do is that the if you model gmm for x
0:05:27a single gaussian for the noise and add
0:05:29this is because the nicest a study and
0:05:32that's ads if the channel noise
0:05:37so the expletive a gmm it is being corrupted by additive noise and then at
0:05:42using the vts approximation and that gives you the can obtain
0:05:46speech by
0:05:49these are i these are gaussian so that look like wave but they are the
0:05:53this of the gmm
0:05:57now become to the overlapping speech so what we propose what we propose is that
0:06:01the overlapping speech is actually just a superposition of two or more in you just
0:06:06so if you if we see the model for the noisy speech we can make
0:06:11the analogous model for overlapping speech but we see that this x it's x one
0:06:16which we call them in speaker
0:06:18and this external here is the corrupting speaker
0:06:21with like the additive noise
0:06:24the we for simplicity of be ignored are the channel noise because of
0:06:30the recordings are the recording for all the speakers and the same room
0:06:34so we are not going to deal with edgier
0:06:40analogy we have this expression where the than the can overlapping speech is now a
0:06:45combination of
0:06:47this is no linear campus this non linear term
0:06:50this nonlinear domain cms the k as in the case of the noisy speech
0:06:56again analogous to the case of target speech we have the mean speaker gmm here
0:07:01and the corrupting speaker which is being represented by a single gaussian here as a
0:07:05like the additive noise
0:07:07the equations are totally similar as in the noisy speech
0:07:11the and you can see here
0:07:13that the subscript m so each component of y here is being computed using this
0:07:19component from the main speaker and then some contribution from the corrupting speaker
0:07:27this g and have which are the derivatives of by
0:07:31they are also different for each component
0:07:37if you take the expectation of this why here then you can guide the mean
0:07:42for the overlapping speech and the variance for the overlapping speech so this if the
0:07:46final overlapping speech model which we want to estimate
0:07:53now a for estimating that model we are going to use the em algorithm for
0:07:58which this of the q function
0:08:00so q one of the overlapping speech data x from excellent to at the time
0:08:05we want to use the probability of
0:08:09having this data using them overlapping speech model new why am signal y m
0:08:15we optimized this function q
0:08:17with respect to the mean of look at a mean of the corrupting speaker x
0:08:21so the update equations for me units this exhibition the new x to zero if
0:08:26the previous
0:08:28value for the
0:08:29mean of adapting speaker and that of the new value for the mean of adapting
0:08:34one thing that you can notice here is that
0:08:37this one mean of its two presents the kind of things because it is being
0:08:42using all the mixture components
0:08:45from the overlapping speech model
0:08:49the through the whole vts algorithm but something like this thought initially we estimate or
0:08:55we initialize the mean of adapting speaker and the covariance
0:08:59then we compute the overlapping speech model using these expressions
0:09:04after that we use the em look but we optimize the q function
0:09:08and will be replaced them or us to go in signal x to zero by
0:09:12mu extrinsic next two
0:09:15in this work we are not going to update segments to because it
0:09:20it's very have you for computation
0:09:24then when this look on what converges we finally a t v finally guide of
0:09:28overlapping speech model by
0:09:30which we used for overlapping speech detection
0:09:36so the overlapping speech detection system it i for input it takes the meeting audio
0:09:42and the recordings are informal speech segments which we got using the speech activity detection
0:09:48then one major task is to have speaker models the initial speaker models for the
0:09:54mean speaker and the kind of things because the how to how to get that
0:09:56so there are two options either you can use the oracle speaker segmentations
0:10:01or we take them from the data are not that additional port
0:10:04so this is much more challenging task because when you take the speaker lines alignment
0:10:10from the data is not put
0:10:12you don't know how many speakers that what actually in your audio
0:10:16so you might get more than the actual number of speakers as an utterance and
0:10:22the output which we are one finally if the detection of overlaps
0:10:28now so
0:10:30given the audio recording this blue box shows the
0:10:34a speech segment given by the speech activity detection
0:10:37remove a slight sliding analysis window what it
0:10:40for each analysis window we can have and square hypothesis so we have you on
0:10:45that in and then overlap that would be two speakers who would be overlapping so
0:10:49if you have and speakers then the total number of overlapping speech models can be
0:10:53and squared minus and
0:10:55that this and shows the single speaker models when only one speaker is speaking
0:11:01so this is a huge number so what we do with that for each speech
0:11:05segment first we determine the means speaker and then we compute the overlapping speech models
0:11:11when that means speaker if being a big by
0:11:13some of the speaker
0:11:16finally we have overlap model is that the speaker
0:11:21i is being adapted with speaker g
0:11:23and then there i think that speaker models
0:11:25where the speaker i is speaking alone so we compared all this likelihood ratios for
0:11:30the domain if we have overlapping speech a single speaker speech
0:11:39up to hear that was the standard but it is likely that bloat now be
0:11:42moved to the multiclass but it is really the algorithm so you would have seen
0:11:47that in the standard vts we used only one simple gaussian distribution for the noise
0:11:52but there sometimes and might be good in the case when we are dealing with
0:11:55noise but in case of overlapping speech
0:11:58the other cup are the expert without collecting speaker he himself if the human being
0:12:02in and said so
0:12:04it's not like a noisy might be he might i don't multiple phonemes in that
0:12:09so we want to prevent him using more data
0:12:13or more a better modeling
0:12:16so what we suppose that likes instead of having one single question here we assume
0:12:22that all the gaussian all or all the questions in the gmm of x two
0:12:26are also present
0:12:29so now we are going to have a rectangular to this combination of
0:12:33two gmms with this gmms for the adapting speaker
0:12:40so what we do here is that v for start with the times and that
0:12:44each of this gaussian might have might have hit in that analysis window
0:12:49by then for each of the gaussian be computed i'm value which is the average
0:12:54number of frames assigned to that question component in that analysis window
0:12:58if this gamma value happens to be lower than it actually you to
0:13:03then we clustering with and you're just watching component
0:13:07v guide like this kind of clustering
0:13:11then we say that
0:13:14the gaussian which have the highest got mine discussed that would be that of the
0:13:18standard so idea this d the all components they would being adopted by one single
0:13:23gaussian here now these all gaussians would be good update by the cluster center of
0:13:30we make that have them sent because
0:13:31all this gaussian mixture models that have been doing
0:13:34using the difference ubm the same difference ubm
0:13:40in the gonna pick speech by
0:13:42the question here
0:13:43it would be computed using the gaussian here last the a contribution from the kind
0:13:48of things speaker
0:13:50a from this component
0:13:53if you said you'd like what you zero that you don't want to set any
0:13:57threshold and their window clustering
0:13:59and each question would be going to one than what having the one-to-one combination to
0:14:04give you look at a bit speech
0:14:09the equations for mean update in case of multiclass we get think we show might
0:14:13because we d s is the cms the previous case the only difference being that
0:14:17now you have a subscript see here which denotes the cluster
0:14:20the for each class you have a different the third thing going
0:14:23and that centroid would be updated using this equation
0:14:27and as i or shall work in the previous like that
0:14:31idea this mean was being computed using all the gaussian components but now this equation
0:14:38only takes into account
0:14:40the questions which is the which are in the cluster c
0:14:46similarly all the other questions they are identical the only difference being that
0:14:51instead of having the single gaussian thing the gaussian representation for text to now be
0:14:56doing the stairways
0:14:57so you have a subscripts the every good
0:15:02so that's the multiclass we do this algorithm framework
0:15:07now coming to that experiments so different than on the it might it as i
0:15:10which is the meeting data set
0:15:12so the meetings are kind of like there are a group of three or four
0:15:16people who are trying to design a remote or something there are so they are
0:15:20discussing arguing debating
0:15:22and the vector the duration varies from seventeen to fifty seven minutes
0:15:27the audio which we take
0:15:28if of from a single distant microphone which is the most difficult task
0:15:34and then we use like mfcc features
0:15:36and for the think that speaker model we use a i mean be adapted
0:15:45now the added my so that it's called it would have detection error which is
0:15:49the false alarm time but smith time
0:15:51divided by the label speaker overlap time so one thing to notice that the false
0:15:55alarms that come from the reasons we're the only think that speaker is speaking
0:15:59and that those reasons are quite much more than the overlapping speech
0:16:03so this whole expression it can be more it can take values over a hundred
0:16:11the first experiment which we did what using the standard vts where we have only
0:16:15one gaussian representation for the corrupting speaker
0:16:18we wanted to determine the analysis window size which were about the best
0:16:22so we found that
0:16:24when you were using going over a window size of three point two seconds the
0:16:27elderly voice
0:16:28lower as compared to the smaller venues like this
0:16:32above this
0:16:33the added identically if that much and
0:16:36instead a the computation time in a lot because then you have you are doing
0:16:40the same computational burden
0:16:41apply a larger window
0:16:44so in the next experiment we are going to use this window size
0:16:50so these are this is the cost for the previous table so this that the
0:16:54required precision guided and a the cut one the top if the with the window
0:16:58size two point two seconds
0:17:03so not be that the results for the multiclass vts so in the standard vts
0:17:07the overlap detection error rate was ninety six point two percent
0:17:11when we use the multiclass vts
0:17:13it top of well by an absolute value of sixteen percent
0:17:17and these for experiments that type of data domain
0:17:22what should be and optimal value for the threshold for this thing
0:17:26so when so in a window of three point two seconds we had three hundred
0:17:30twenty frames
0:17:31and if without l
0:17:34threshold of five frames for each gaussian
0:17:38this values here the denote how much that the clustering happen i mean we start
0:17:44from sixty four clusters in the beginning
0:17:46and if we have what utah five then and then we have tens of this
0:17:50we found that the best results were
0:17:52when we were having and threshold of one frame
0:17:55in that case
0:17:56the data this and the overlap detection error it reduces to eighty percent
0:18:00which is quite good
0:18:03the final number of clusters that we got if
0:18:07twenty four point seven still beginning with sixty four we end up having twenty four
0:18:10point seven does this year
0:18:15as i said we have like to different kind of options for modeling the speaker
0:18:20one likely model the speaker from the oracle or one
0:18:23we are modeled the speakers on the data is not bored
0:18:26so in case of articles the speaker models are ready purely to begin with so
0:18:29that's why
0:18:30the results that are quite good
0:18:32but when we start with the database an output
0:18:35we don't we might get a seven speaker target speakers
0:18:38when there are actually only for speaker so
0:18:41it's a set of problems given that is
0:18:43the added it is ninety three point three percent
0:18:46which is it better than the standard vts approach
0:18:53so these are the kernel sorta previous table
0:18:56but i that if using but that a vision system so that efficient system works
0:19:02in a totally unsupervised manner and the final goal we have it is to make
0:19:05this by data back end we want to so
0:19:09improve it
0:19:10up to this point which is by the articles
0:19:12so we are trying to reduce this gap
0:19:16comparing to the other words
0:19:18so the mfcc a gmm system which is which was proposed by bouquet
0:19:24it works with a ninety two point four percent
0:19:27or whatever it takes another the state-of-the-art which using l s d m o'clock set
0:19:31seventy six point nine percent
0:19:33the best of that we have in this work at eighty percent
0:19:37but then there's of using the or tickets
0:19:40a completely unsupervised the system works at and add an error tradeoff ninety three point
0:19:44two people think
0:19:52okay so
0:19:54through in the conclusions we have proposed a new approach for overlapping speech model
0:19:59and we extended the biggest crime framework to the multiclass vts
0:20:04and we analyze that if we have a billows of three point two seconds and
0:20:09it was better
0:20:11and then we were able to have
0:20:14okay concentrations precisions up to fourteen seventy percent
0:20:17one thing to note here is that in the l svm approach
0:20:21they had very good precision but in a case we have a much better because
0:20:25then that
0:20:28the future about which we want to do with into the covariance operation and delta
0:20:31features and in the case of the activation not work we want we order models
0:20:38so after that we also we extended the work for you think we wouldn't submission
0:20:43and you when the security of got its output
0:20:45so we have been a way to improve these numbers from you do seventy eight
0:20:50and this ninety three point two i don't in nine
0:20:53but still from eighty nine to seventy six it's we have to work for that
0:20:59or although we cannot say that says working in part with the
0:21:03state-of-the-art system but we think that this of the very promising approach
0:21:06and this can be used maybe for some other kind of maybe if you want
0:21:11to model speech corrupted with noise but noise which is much more complex
0:21:18i think that to thank you
0:21:34so i'm having problems understanding our when you go from ninety six ninety three percent
0:21:38error that that's a big improvement
0:21:41i'm not questioning that he's
0:21:44more what might help of i guess if you can be done sometime a test
0:21:48like that's what is the performance that you think
0:21:52is necessary for usable system to work
0:21:56you hit seventy six is kind of state-of-the-art
0:21:59has anyone done any test where maybe you take a clean data that doesn't have
0:22:03any overlap at all
0:22:05but in certain control amounts of overlap
0:22:09where you can run that performance metric there and decide whether humans or where the
0:22:14this the subsequent diarisation system
0:22:18is acceptable when it hits you know an error rate of fifty percent i i'm
0:22:23not sure what number you actually have to hit before you say it's a viable
0:22:27solution because come from ninety three bird ninety six to ninety three year a year
0:22:32something just seems like the numbers are just too high to make it practically users
0:22:38okay so are the ones that the first person i'm not aware of any for
0:22:42where they have artificially created or they have concluded overlaps in the audio
0:22:48so the main task the main purpose of doing all this to improve the speaker
0:22:52recognition system so we want to know that it's values for finally improved activation either
0:23:01the state-of-the-art using an svm which had the edited of seventy six point nine but
0:23:06i think in that paper they have not a given the data vision edited which
0:23:10they have achieved using that system
0:23:12in our system but so
0:23:15of you have a people in interspeech very we also present
0:23:19the effect of this overlap detection on television and
0:23:24in the case when we have eighty nine percent error
0:23:29this value ninety three point three we have we need way to reduce it to
0:23:32eighty nine percent and when we use that system for television we have marginal improvement
0:23:37over the baseline
0:23:39so i hope that when someone by when you have a over the prediction error
0:23:45below at it would have quite a significant improvement only
0:23:59a show why
0:24:07more speakers once
0:24:09the second question how defined who is the main speaker who is the a six
0:24:19once the first base and that's of anybody question
0:24:21that's can have them sent to keep the number of more than slow and that
0:24:25we have done the thing that
0:24:28the overlaps are i don't remember the exact values but
0:24:33unless people are laughing together or a having a very uncontrolled
0:24:38meeting are discussed and then they would tend to speak like to be for all
0:24:41together otherwise the claim to like
0:24:43but when one speaker that i speaking and then someone other speakers start speaking at
0:24:47that moment they might have an overlap of with speaker
0:24:50and this but this formulation of vts
0:24:56at this moment we cannot extended to three speakers
0:25:00because of the formulation so be we are assuming one additive noise
0:25:06and in a repeat the second version
0:25:13so for the means because we use the we have speaker models for all speakers
0:25:18so we directly use them
0:25:20to find out which gives the most likely what for that analysis window
0:25:25so we use that thing to determine the mean speaker
0:25:31i'm just wondering about the inter annotator agreement on this task i it seems to
0:25:36be very difficult task to even for humans
0:25:38so all those numbers in the range of a inter annotator agreement story
0:25:43i mean
0:25:44do you have any ideal on this point
0:25:47or what the annotation which we have come from icsi and i have descent with
0:25:51annotation it's quite accurate even the overlaps like but is more than over that's the
0:25:56have been annotated
0:25:59but i'm not sure about the inter annotator document