0:00:14 | everyone |
---|---|

0:00:16 | i'm planning to k i'm working in you gap research and ct would and i |

0:00:19 | mean representing the talk on modeling overlapping speech using but it in the cities |

0:00:24 | this work i have been but modified eisenhower we would low |

0:00:28 | so |

0:00:30 | first i would be presenting the motivation for using this problem |

0:00:35 | then in brief i would discuss the previous approaches for detecting overlapping speech |

0:00:41 | then i would go towards the vector taylor series approach |

0:00:44 | in which we have two parts on the first part is the |

0:00:47 | using the standard vts approach and then the next part would be the multiclass we |

0:00:51 | just algorithm which is |

0:00:53 | which we have proposed in this work and then we will discuss the results in |

0:00:57 | experiments |

0:00:59 | so the more recent comes from the problem of speaker diarization |

0:01:03 | but if the task of deciding or determining who spoke when in the meeting audio |

0:01:07 | so ugly when the audio recording you want to find out on different portions which |

0:01:12 | belong to the speakers |

0:01:15 | one challenges that the number of speakers is not applied in one |

0:01:18 | so it's it has to be determined booty an unsupervised manner |

0:01:24 | now in this task overlapping speech |

0:01:28 | becomes a very huge |

0:01:30 | the source of either |

0:01:32 | so first i would define the overlapping speech it is at the moment trying to |

0:01:37 | speaker speak simultaneously it might be when people are debating they are arguing or when |

0:01:42 | they are just |

0:01:45 | okay so when there does agreeing or disagreeing like |

0:01:49 | this kind of things are men they're laughing together |

0:01:52 | so what happens is that when you have overlapping speech in your audio then you |

0:01:57 | cannot model |

0:01:59 | the art speaker models very precisely |

0:02:01 | or when you are doing speaker recognition and you want to assign one speaker anybody |

0:02:06 | to a portion in which actually there are two people speak speaking then that also |

0:02:10 | results in some errors in speaker recognition |

0:02:14 | and a previous studies have shown that in meetings sometimes those model twenty percent of |

0:02:19 | the spoken name can be overlapping if the participants are maybe |

0:02:24 | active |

0:02:28 | no the previous approaches so one of the first what was done by book |

0:02:32 | in we see well it made adamant be segmented for the three classes speech non-speech |

0:02:37 | an overlapping speech |

0:02:39 | this was the baseline |

0:02:40 | then people have used |

0:02:42 | a salad the knowledge of silence distribution |

0:02:45 | and some things like speaker changes because it has been found that people tractable with |

0:02:50 | that when the speaker change |

0:02:53 | the state-of-the-art a is based on convolutive non-negative sparse coding and which |

0:03:00 | d gotten we put like they have |

0:03:03 | no and basis for each speaker |

0:03:05 | and then the artifact to find out the activity of each speaker for each stream |

0:03:09 | and they have used the same features using non stardom neural networks long short-term memory |

0:03:14 | neural networks |

0:03:18 | now we come to |

0:03:19 | our problem |

0:03:20 | so before i moved to overlapping speech that is one and all of this problem |

0:03:24 | of how to model the speech which is corrupted with noise |

0:03:28 | so if you have a noisy speech model by then you can express it in |

0:03:32 | the signal domain as |

0:03:33 | x plus and where x if you're clean speech |

0:03:37 | actually of the channel noise and is the additive noise |

0:03:43 | so in the mfcc domain |

0:03:46 | these are the mel-scale filter power spectrum |

0:03:49 | you take the log and dct then and then you but the mfcc features |

0:03:53 | so in the mfcc domain |

0:03:55 | this simple expression here |

0:03:57 | it becomes |

0:03:59 | a quite complex expression value have a linear park and the nonlinear part |

0:04:04 | the see if the dct a text and seen what six the pseudoinverse of that |

0:04:09 | so we call this nonlinear part of g |

0:04:12 | so you have by the way to x besides plus this nonlinear part |

0:04:16 | no |

0:04:18 | we want to model this equation and we use that the data c d is |

0:04:22 | basically two |

0:04:23 | x point this expression here |

0:04:26 | so that it is it is simply an expansion of objective function about the point |

0:04:30 | where you |

0:04:31 | have the first then |

0:04:33 | this is the first order don't but you pick the first derivative |

0:04:37 | so do when this expression for the noisy speech |

0:04:41 | if you at this point it around this point m u x new and new |

0:04:44 | at which are the mean of clean speech mean of noisy speech and we wanna |

0:04:48 | you know |

0:04:49 | channel noise |

0:04:51 | so you can't this expression here in which the first line |

0:04:54 | if s at the evaluation of y around this point |

0:04:58 | and the second line is |

0:04:59 | the first order down |

0:05:01 | a bit |

0:05:01 | that with energy this capital g and this capital have a |

0:05:07 | they are the derivatives of y with respect to accent at ten and |

0:05:15 | so in the standard |

0:05:18 | and the standard rectangular c d's when you are trying to model |

0:05:21 | this of i here |

0:05:23 | what people do is that the if you model gmm for x |

0:05:26 | and |

0:05:27 | a single gaussian for the noise and add |

0:05:29 | this is because the nicest a study and |

0:05:32 | that's ads if the channel noise |

0:05:37 | so the expletive a gmm it is being corrupted by additive noise and then at |

0:05:42 | using the vts approximation and that gives you the can obtain |

0:05:46 | speech by |

0:05:49 | the |

0:05:49 | these are i these are gaussian so that look like wave but they are the |

0:05:53 | this of the gmm |

0:05:57 | now become to the overlapping speech so what we propose what we propose is that |

0:06:01 | the overlapping speech is actually just a superposition of two or more in you just |

0:06:05 | speakers |

0:06:06 | so if you if we see the model for the noisy speech we can make |

0:06:11 | the analogous model for overlapping speech but we see that this x it's x one |

0:06:16 | which we call them in speaker |

0:06:18 | and this external here is the corrupting speaker |

0:06:21 | with like the additive noise |

0:06:24 | the we for simplicity of be ignored are the channel noise because of |

0:06:30 | the recordings are the recording for all the speakers and the same room |

0:06:34 | so we are not going to deal with edgier |

0:06:38 | so |

0:06:38 | doing |

0:06:40 | analogy we have this expression where the than the can overlapping speech is now a |

0:06:45 | combination of |

0:06:47 | this is no linear campus this non linear term |

0:06:50 | this nonlinear domain cms the k as in the case of the noisy speech |

0:06:56 | again analogous to the case of target speech we have the mean speaker gmm here |

0:07:01 | and the corrupting speaker which is being represented by a single gaussian here as a |

0:07:05 | like the additive noise |

0:07:07 | the equations are totally similar as in the noisy speech |

0:07:11 | the and you can see here |

0:07:13 | that the subscript m so each component of y here is being computed using this |

0:07:19 | component from the main speaker and then some contribution from the corrupting speaker |

0:07:26 | and |

0:07:27 | this g and have which are the derivatives of by |

0:07:31 | they are also different for each component |

0:07:35 | now |

0:07:37 | if you take the expectation of this why here then you can guide the mean |

0:07:42 | for the overlapping speech and the variance for the overlapping speech so this if the |

0:07:46 | final overlapping speech model which we want to estimate |

0:07:53 | now a for estimating that model we are going to use the em algorithm for |

0:07:58 | which this of the q function |

0:08:00 | so q one of the overlapping speech data x from excellent to at the time |

0:08:04 | frames |

0:08:05 | we want to use the probability of |

0:08:09 | having this data using them overlapping speech model new why am signal y m |

0:08:14 | and |

0:08:15 | we optimized this function q |

0:08:17 | with respect to the mean of look at a mean of the corrupting speaker x |

0:08:20 | two |

0:08:21 | so the update equations for me units this exhibition the new x to zero if |

0:08:26 | the previous |

0:08:28 | value for the |

0:08:29 | mean of adapting speaker and that of the new value for the mean of adapting |

0:08:32 | speaker |

0:08:34 | one thing that you can notice here is that |

0:08:37 | this one mean of its two presents the kind of things because it is being |

0:08:41 | updating |

0:08:42 | using all the mixture components |

0:08:45 | from the overlapping speech model |

0:08:49 | the through the whole vts algorithm but something like this thought initially we estimate or |

0:08:55 | we initialize the mean of adapting speaker and the covariance |

0:08:59 | then we compute the overlapping speech model using these expressions |

0:09:04 | after that we use the em look but we optimize the q function |

0:09:08 | and will be replaced them or us to go in signal x to zero by |

0:09:12 | mu extrinsic next two |

0:09:15 | in this work we are not going to update segments to because it |

0:09:20 | it's very have you for computation |

0:09:24 | then when this look on what converges we finally a t v finally guide of |

0:09:28 | overlapping speech model by |

0:09:30 | which we used for overlapping speech detection |

0:09:36 | so the overlapping speech detection system it i for input it takes the meeting audio |

0:09:40 | recordings |

0:09:42 | and the recordings are informal speech segments which we got using the speech activity detection |

0:09:46 | system |

0:09:48 | then one major task is to have speaker models the initial speaker models for the |

0:09:54 | mean speaker and the kind of things because the how to how to get that |

0:09:56 | so there are two options either you can use the oracle speaker segmentations |

0:10:01 | or we take them from the data are not that additional port |

0:10:04 | so this is much more challenging task because when you take the speaker lines alignment |

0:10:10 | from the data is not put |

0:10:12 | you don't know how many speakers that what actually in your audio |

0:10:16 | so you might get more than the actual number of speakers as an utterance and |

0:10:20 | output |

0:10:22 | the output which we are one finally if the detection of overlaps |

0:10:28 | now so |

0:10:30 | given the audio recording this blue box shows the |

0:10:34 | a speech segment given by the speech activity detection |

0:10:37 | remove a slight sliding analysis window what it |

0:10:40 | for each analysis window we can have and square hypothesis so we have you on |

0:10:45 | that in and then overlap that would be two speakers who would be overlapping so |

0:10:49 | if you have and speakers then the total number of overlapping speech models can be |

0:10:53 | and squared minus and |

0:10:55 | that this and shows the single speaker models when only one speaker is speaking |

0:11:01 | so this is a huge number so what we do with that for each speech |

0:11:05 | segment first we determine the means speaker and then we compute the overlapping speech models |

0:11:11 | when that means speaker if being a big by |

0:11:13 | some of the speaker |

0:11:16 | finally we have overlap model is that the speaker |

0:11:21 | i is being adapted with speaker g |

0:11:23 | and then there i think that speaker models |

0:11:25 | where the speaker i is speaking alone so we compared all this likelihood ratios for |

0:11:30 | the domain if we have overlapping speech a single speaker speech |

0:11:37 | so |

0:11:39 | up to hear that was the standard but it is likely that bloat now be |

0:11:42 | moved to the multiclass but it is really the algorithm so you would have seen |

0:11:47 | that in the standard vts we used only one simple gaussian distribution for the noise |

0:11:52 | but there sometimes and might be good in the case when we are dealing with |

0:11:55 | noise but in case of overlapping speech |

0:11:58 | the other cup are the expert without collecting speaker he himself if the human being |

0:12:02 | in and said so |

0:12:04 | it's not like a noisy might be he might i don't multiple phonemes in that |

0:12:08 | window |

0:12:09 | so we want to prevent him using more data |

0:12:13 | or more a better modeling |

0:12:16 | so what we suppose that likes instead of having one single question here we assume |

0:12:22 | that all the gaussian all or all the questions in the gmm of x two |

0:12:26 | are also present |

0:12:29 | so now we are going to have a rectangular to this combination of |

0:12:33 | two gmms with this gmms for the adapting speaker |

0:12:40 | so what we do here is that v for start with the times and that |

0:12:44 | each of this gaussian might have might have hit in that analysis window |

0:12:49 | by then for each of the gaussian be computed i'm value which is the average |

0:12:54 | number of frames assigned to that question component in that analysis window |

0:12:58 | if this gamma value happens to be lower than it actually you to |

0:13:03 | then we clustering with and you're just watching component |

0:13:06 | so |

0:13:07 | v guide like this kind of clustering |

0:13:10 | and |

0:13:11 | then we say that |

0:13:14 | the gaussian which have the highest got mine discussed that would be that of the |

0:13:18 | standard so idea this d the all components they would being adopted by one single |

0:13:23 | gaussian here now these all gaussians would be good update by the cluster center of |

0:13:28 | disgusted |

0:13:30 | we make that have them sent because |

0:13:31 | all this gaussian mixture models that have been doing |

0:13:34 | using the difference ubm the same difference ubm |

0:13:39 | so |

0:13:40 | in the gonna pick speech by |

0:13:42 | the question here |

0:13:43 | it would be computed using the gaussian here last the a contribution from the kind |

0:13:48 | of things speaker |

0:13:50 | a from this component |

0:13:53 | if you said you'd like what you zero that you don't want to set any |

0:13:57 | threshold and their window clustering |

0:13:59 | and each question would be going to one than what having the one-to-one combination to |

0:14:04 | give you look at a bit speech |

0:14:09 | the equations for mean update in case of multiclass we get think we show might |

0:14:13 | because we d s is the cms the previous case the only difference being that |

0:14:17 | now you have a subscript see here which denotes the cluster |

0:14:20 | the for each class you have a different the third thing going |

0:14:23 | and that centroid would be updated using this equation |

0:14:27 | and as i or shall work in the previous like that |

0:14:31 | idea this mean was being computed using all the gaussian components but now this equation |

0:14:38 | only takes into account |

0:14:40 | the questions which is the which are in the cluster c |

0:14:46 | similarly all the other questions they are identical the only difference being that |

0:14:51 | instead of having the single gaussian thing the gaussian representation for text to now be |

0:14:56 | doing the stairways |

0:14:57 | so you have a subscripts the every good |

0:15:02 | so that's the multiclass we do this algorithm framework |

0:15:07 | now coming to that experiments so different than on the it might it as i |

0:15:10 | which is the meeting data set |

0:15:12 | so the meetings are kind of like there are a group of three or four |

0:15:16 | people who are trying to design a remote or something there are so they are |

0:15:20 | discussing arguing debating |

0:15:22 | and the vector the duration varies from seventeen to fifty seven minutes |

0:15:27 | the audio which we take |

0:15:28 | if of from a single distant microphone which is the most difficult task |

0:15:34 | and then we use like mfcc features |

0:15:36 | and for the think that speaker model we use a i mean be adapted |

0:15:41 | gmm |

0:15:45 | now the added my so that it's called it would have detection error which is |

0:15:49 | the false alarm time but smith time |

0:15:51 | divided by the label speaker overlap time so one thing to notice that the false |

0:15:55 | alarms that come from the reasons we're the only think that speaker is speaking |

0:15:59 | and that those reasons are quite much more than the overlapping speech |

0:16:03 | so this whole expression it can be more it can take values over a hundred |

0:16:07 | percent |

0:16:11 | the first experiment which we did what using the standard vts where we have only |

0:16:15 | one gaussian representation for the corrupting speaker |

0:16:18 | we wanted to determine the analysis window size which were about the best |

0:16:22 | so we found that |

0:16:24 | when you were using going over a window size of three point two seconds the |

0:16:27 | elderly voice |

0:16:28 | lower as compared to the smaller venues like this |

0:16:32 | above this |

0:16:33 | the added identically if that much and |

0:16:36 | instead a the computation time in a lot because then you have you are doing |

0:16:40 | the same computational burden |

0:16:41 | apply a larger window |

0:16:44 | so in the next experiment we are going to use this window size |

0:16:50 | so these are this is the cost for the previous table so this that the |

0:16:54 | required precision guided and a the cut one the top if the with the window |

0:16:58 | size two point two seconds |

0:17:03 | so not be that the results for the multiclass vts so in the standard vts |

0:17:07 | the overlap detection error rate was ninety six point two percent |

0:17:11 | when we use the multiclass vts |

0:17:13 | it top of well by an absolute value of sixteen percent |

0:17:17 | and these for experiments that type of data domain |

0:17:22 | what should be and optimal value for the threshold for this thing |

0:17:26 | so when so in a window of three point two seconds we had three hundred |

0:17:30 | twenty frames |

0:17:31 | and if without l |

0:17:34 | threshold of five frames for each gaussian |

0:17:36 | then |

0:17:38 | this values here the denote how much that the clustering happen i mean we start |

0:17:44 | from sixty four clusters in the beginning |

0:17:46 | and if we have what utah five then and then we have tens of this |

0:17:50 | we found that the best results were |

0:17:52 | when we were having and threshold of one frame |

0:17:55 | in that case |

0:17:56 | the data this and the overlap detection error it reduces to eighty percent |

0:18:00 | which is quite good |

0:18:02 | and |

0:18:03 | the final number of clusters that we got if |

0:18:07 | twenty four point seven still beginning with sixty four we end up having twenty four |

0:18:10 | point seven does this year |

0:18:14 | so |

0:18:15 | as i said we have like to different kind of options for modeling the speaker |

0:18:20 | one likely model the speaker from the oracle or one |

0:18:23 | we are modeled the speakers on the data is not bored |

0:18:26 | so in case of articles the speaker models are ready purely to begin with so |

0:18:29 | that's why |

0:18:30 | the results that are quite good |

0:18:32 | but when we start with the database an output |

0:18:35 | we don't we might get a seven speaker target speakers |

0:18:38 | when there are actually only for speaker so |

0:18:41 | it's a set of problems given that is |

0:18:43 | the added it is ninety three point three percent |

0:18:46 | which is it better than the standard vts approach |

0:18:53 | so these are the kernel sorta previous table |

0:18:56 | but i that if using but that a vision system so that efficient system works |

0:19:02 | in a totally unsupervised manner and the final goal we have it is to make |

0:19:05 | this by data back end we want to so |

0:19:09 | improve it |

0:19:10 | up to this point which is by the articles |

0:19:12 | so we are trying to reduce this gap |

0:19:16 | comparing to the other words |

0:19:18 | so the mfcc a gmm system which is which was proposed by bouquet |

0:19:24 | it works with a ninety two point four percent |

0:19:27 | or whatever it takes another the state-of-the-art which using l s d m o'clock set |

0:19:31 | seventy six point nine percent |

0:19:33 | the best of that we have in this work at eighty percent |

0:19:37 | but then there's of using the or tickets |

0:19:40 | a completely unsupervised the system works at and add an error tradeoff ninety three point |

0:19:44 | two people think |

0:19:49 | so |

0:19:50 | after |

0:19:52 | okay so |

0:19:54 | through in the conclusions we have proposed a new approach for overlapping speech model |

0:19:59 | and we extended the biggest crime framework to the multiclass vts |

0:20:04 | system |

0:20:04 | and we analyze that if we have a billows of three point two seconds and |

0:20:09 | it was better |

0:20:11 | and then we were able to have |

0:20:14 | okay concentrations precisions up to fourteen seventy percent |

0:20:17 | one thing to note here is that in the l svm approach |

0:20:21 | they had very good precision but in a case we have a much better because |

0:20:25 | then that |

0:20:28 | the future about which we want to do with into the covariance operation and delta |

0:20:31 | features and in the case of the activation not work we want we order models |

0:20:36 | to |

0:20:37 | use |

0:20:38 | so after that we also we extended the work for you think we wouldn't submission |

0:20:43 | and you when the security of got its output |

0:20:45 | so we have been a way to improve these numbers from you do seventy eight |

0:20:50 | and this ninety three point two i don't in nine |

0:20:53 | but still from eighty nine to seventy six it's we have to work for that |

0:20:59 | or although we cannot say that says working in part with the |

0:21:03 | state-of-the-art system but we think that this of the very promising approach |

0:21:06 | and this can be used maybe for some other kind of maybe if you want |

0:21:11 | to model speech corrupted with noise but noise which is much more complex |

0:21:18 | i think that to thank you |

0:21:34 | so i'm having problems understanding our when you go from ninety six ninety three percent |

0:21:38 | error that that's a big improvement |

0:21:41 | i'm not questioning that he's |

0:21:44 | more what might help of i guess if you can be done sometime a test |

0:21:48 | like that's what is the performance that you think |

0:21:52 | is necessary for usable system to work |

0:21:56 | you hit seventy six is kind of state-of-the-art |

0:21:59 | has anyone done any test where maybe you take a clean data that doesn't have |

0:22:03 | any overlap at all |

0:22:05 | but in certain control amounts of overlap |

0:22:09 | where you can run that performance metric there and decide whether humans or where the |

0:22:14 | this the subsequent diarisation system |

0:22:18 | is acceptable when it hits you know an error rate of fifty percent i i'm |

0:22:23 | not sure what number you actually have to hit before you say it's a viable |

0:22:27 | solution because come from ninety three bird ninety six to ninety three year a year |

0:22:32 | something just seems like the numbers are just too high to make it practically users |

0:22:38 | okay so are the ones that the first person i'm not aware of any for |

0:22:42 | where they have artificially created or they have concluded overlaps in the audio |

0:22:47 | but |

0:22:48 | so the main task the main purpose of doing all this to improve the speaker |

0:22:52 | recognition system so we want to know that it's values for finally improved activation either |

0:22:59 | so |

0:23:01 | the state-of-the-art using an svm which had the edited of seventy six point nine but |

0:23:06 | i think in that paper they have not a given the data vision edited which |

0:23:10 | they have achieved using that system |

0:23:12 | in our system but so |

0:23:15 | of you have a people in interspeech very we also present |

0:23:19 | the effect of this overlap detection on television and |

0:23:24 | in the case when we have eighty nine percent error |

0:23:27 | so |

0:23:29 | this value ninety three point three we have we need way to reduce it to |

0:23:32 | eighty nine percent and when we use that system for television we have marginal improvement |

0:23:37 | over the baseline |

0:23:39 | so i hope that when someone by when you have a over the prediction error |

0:23:44 | rate |

0:23:45 | below at it would have quite a significant improvement only |

0:23:59 | a show why |

0:24:03 | sure |

0:24:07 | more speakers once |

0:24:09 | the second question how defined who is the main speaker who is the a six |

0:24:17 | so |

0:24:19 | once the first base and that's of anybody question |

0:24:21 | that's can have them sent to keep the number of more than slow and that |

0:24:25 | we have done the thing that |

0:24:28 | the overlaps are i don't remember the exact values but |

0:24:33 | unless people are laughing together or a having a very uncontrolled |

0:24:38 | meeting are discussed and then they would tend to speak like to be for all |

0:24:41 | together otherwise the claim to like |

0:24:43 | but when one speaker that i speaking and then someone other speakers start speaking at |

0:24:47 | that moment they might have an overlap of with speaker |

0:24:50 | and this but this formulation of vts |

0:24:55 | p |

0:24:56 | at this moment we cannot extended to three speakers |

0:25:00 | because of the formulation so be we are assuming one additive noise |

0:25:06 | and in a repeat the second version |

0:25:10 | sorry |

0:25:12 | okay |

0:25:13 | so for the means because we use the we have speaker models for all speakers |

0:25:18 | so we directly use them |

0:25:20 | to find out which gives the most likely what for that analysis window |

0:25:25 | so we use that thing to determine the mean speaker |

0:25:31 | i'm just wondering about the inter annotator agreement on this task i it seems to |

0:25:36 | be very difficult task to even for humans |

0:25:38 | so all those numbers in the range of a inter annotator agreement story |

0:25:43 | i mean |

0:25:44 | do you have any ideal on this point |

0:25:47 | or what the annotation which we have come from icsi and i have descent with |

0:25:51 | annotation it's quite accurate even the overlaps like but is more than over that's the |

0:25:56 | have been annotated |

0:25:59 | but i'm not sure about the inter annotator document |