| 0:00:15 | okay | 
|---|
| 0:00:15 | thank you my name is a problem | 
|---|
| 0:00:18 | i'll present the work we have carried out by extracting i-vectors from | 
|---|
| 0:00:23 | short and long time speech features for speaker clustering | 
|---|
| 0:00:26 | and | 
|---|
| 0:00:27 | this is also the what of don't really k and have it at london | 
|---|
| 0:00:32 | so the outline of | 
|---|
| 0:00:34 | a representation that's is as follows so we would describe | 
|---|
| 0:00:38 | so objectives of our research | 
|---|
| 0:00:40 | we would also describe the main | 
|---|
| 0:00:44 | long-term features that are used in our experiments we would also mention the | 
|---|
| 0:00:49 | baseline and the proposed speaker the standard vector | 
|---|
| 0:00:53 | and then we will | 
|---|
| 0:00:55 | describe the fusion techniques that are carried out in the speaker segmentation and speaker clustering | 
|---|
| 0:01:00 | and finally the experimental setups conclusions would be presented | 
|---|
| 0:01:06 | so far the on all | 
|---|
| 0:01:08 | speaker diarization consists of two men tasks and these are | 
|---|
| 0:01:12 | speaker segmentation and speaker clustering | 
|---|
| 0:01:14 | and in a speaker segmentation | 
|---|
| 0:01:16 | a given audio speech is | 
|---|
| 0:01:19 | split it into homogeneous boxes and in speaker clustering | 
|---|
| 0:01:23 | speech clusters that belong to a given speaker are grouped together | 
|---|
| 0:01:28 | so the main motivation for this dataset used in our previous | 
|---|
| 0:01:32 | work | 
|---|
| 0:01:32 | we have shown that the use of jitter and shimmer and | 
|---|
| 0:01:36 | prosodic features have improved | 
|---|
| 0:01:39 | the performance of | 
|---|
| 0:01:41 | gmm based speaker detection systems so based on these | 
|---|
| 0:01:45 | we have proposed to the extraction of i-vectors from these | 
|---|
| 0:01:49 | detection or prosodic features | 
|---|
| 0:01:51 | and then want to fuse their cosine distance courses with the | 
|---|
| 0:01:56 | mfcc for speaker clustering task | 
|---|
| 0:02:00 | so here in the feature selection | 
|---|
| 0:02:02 | we select different set of features from the voice quality and from the prosodic | 
|---|
| 0:02:08 | from the voice quality way extracts | 
|---|
| 0:02:10 | features called absolute jitter absolute stream attention meticulously and from the prosodic once we extract | 
|---|
| 0:02:16 | the speech | 
|---|
| 0:02:18 | intensity and the first four formant frequencies | 
|---|
| 0:02:21 | once these features are extracted the abstract the same feature vectors | 
|---|
| 0:02:27 | then we extract two different set of i-vectors the first i-vector is from | 
|---|
| 0:02:32 | the mfcc | 
|---|
| 0:02:33 | and the second i-vector is from the long-term features | 
|---|
| 0:02:37 | then the cosine similarity of these two | 
|---|
| 0:02:41 | i-vectors is used for speaker clustering task | 
|---|
| 0:02:46 | so these are the main speech features that are used in our experiments without mfcc | 
|---|
| 0:02:52 | voice quality that the jitter and shimmer and we have also used the prosodic ones | 
|---|
| 0:02:59 | so from the voice qualities we have selected three different measurement is based on previous | 
|---|
| 0:03:04 | studies these are the absolute jitter which major the variation between | 
|---|
| 0:03:09 | two consecutive periods | 
|---|
| 0:03:11 | and we have also very absolute stream | 
|---|
| 0:03:15 | it may just evaluation of the amplitude between consecutive periods and also | 
|---|
| 0:03:20 | they should medically two c d's | 
|---|
| 0:03:22 | similar to should matter out of instrument that's | 
|---|
| 0:03:26 | it takes into consideration three consecutive periods | 
|---|
| 0:03:31 | so from prosody we have extracted speech | 
|---|
| 0:03:34 | in basically and formant frequencies | 
|---|
| 0:03:38 | so when it comes to the speaker diarization architecture first i'll try to describe the | 
|---|
| 0:03:43 | baseline system | 
|---|
| 0:03:45 | so given speech signal | 
|---|
| 0:03:48 | so we | 
|---|
| 0:03:49 | further steak the speech different mappings or i thought | 
|---|
| 0:03:53 | the main reason wearers using the oracle studies | 
|---|
| 0:03:56 | where i'm really interested on the speaker errors | 
|---|
| 0:03:59 | where the restaurant the speaker segmentation errors | 
|---|
| 0:04:03 | then we extract the mfcc the jitter and shimmer and the prosodic once only for | 
|---|
| 0:04:08 | the speech frames | 
|---|
| 0:04:10 | then the jitter and shimmer and that was only once output in the same feature | 
|---|
| 0:04:14 | vectors | 
|---|
| 0:04:17 | so based on the side of that inside the | 
|---|
| 0:04:19 | new channel number of clusters is initialized if we have | 
|---|
| 0:04:23 | more number of if there | 
|---|
| 0:04:25 | size of the data is or if the | 
|---|
| 0:04:29 | sure is longer to have more number of clusters if it is shown to do | 
|---|
| 0:04:33 | have less number of clusters all the initial number of clusters | 
|---|
| 0:04:38 | depend just on the duration of the audio signal | 
|---|
| 0:04:42 | then we assign genments complex tenish ali for this neutralised clusters | 
|---|
| 0:04:48 | then we perform the hmm decoding and | 
|---|
| 0:04:51 | training process and then we'll get to different to log-likelihood scores the first one is | 
|---|
| 0:04:56 | for the | 
|---|
| 0:04:57 | short time spectral features | 
|---|
| 0:04:58 | and then we also get another score role | 
|---|
| 0:05:01 | you don't of features | 
|---|
| 0:05:02 | then these two scores are used you nearly in the speaker segmentation and | 
|---|
| 0:05:07 | we get the speaker segmentation still in gives us | 
|---|
| 0:05:10 | a set of clusters | 
|---|
| 0:05:12 | so we use a classical bic | 
|---|
| 0:05:15 | computation technique and computes | 
|---|
| 0:05:17 | pairwise similarity between | 
|---|
| 0:05:19 | all set of clusters and i each iteration the two clusters that that's the not | 
|---|
| 0:05:25 | the highest | 
|---|
| 0:05:28 | bic score | 
|---|
| 0:05:29 | will be but you'd and this process | 
|---|
| 0:05:33 | i to its until the highest peak value among the clusters is less than the | 
|---|
| 0:05:38 | specified threshold value | 
|---|
| 0:05:40 | so this is a classical be computation so you know work | 
|---|
| 0:05:44 | the initialization and the speaker segmentation are the same | 
|---|
| 0:05:47 | the may conclusion we should it in the speaker clustering one the speech the decomposition | 
|---|
| 0:05:52 | of the gmm be competition is replaced by the i-vector clustering one | 
|---|
| 0:06:00 | so this is our proposed architecture so given a set of clusters | 
|---|
| 0:06:05 | that are | 
|---|
| 0:06:06 | the output of the viterbi segmentation we extract | 
|---|
| 0:06:09 | two different set of i-vectors | 
|---|
| 0:06:11 | if a test i-vector is from the mfcc | 
|---|
| 0:06:14 | and the second one is from the detection and the problems once | 
|---|
| 0:06:18 | and we used to difference | 
|---|
| 0:06:20 | universal background models the first one is forty | 
|---|
| 0:06:25 | short-term spectral features | 
|---|
| 0:06:26 | and the second one is four | 
|---|
| 0:06:29 | the | 
|---|
| 0:06:31 | long-term features | 
|---|
| 0:06:33 | so the ubm and the t matrix is are trained using the same source from | 
|---|
| 0:06:38 | a and have selected one hundreds unusual side of a duration of forty hours to | 
|---|
| 0:06:43 | train | 
|---|
| 0:06:44 | the ubm | 
|---|
| 0:06:46 | and the i-vectors are extracted using an is a toolkit | 
|---|
| 0:06:49 | so that the less than the g | 
|---|
| 0:06:52 | is normally based on | 
|---|
| 0:06:54 | specified threshold value so if the threshold value is based on | 
|---|
| 0:06:59 | specified one the | 
|---|
| 0:07:01 | system stops margin | 
|---|
| 0:07:04 | so | 
|---|
| 0:07:05 | to find the optimum | 
|---|
| 0:07:07 | threshold value we have used a semi-automatic way of | 
|---|
| 0:07:11 | finding | 
|---|
| 0:07:12 | the number of triphone value | 
|---|
| 0:07:14 | for example in this figure | 
|---|
| 0:07:17 | we have displayed how we have selected | 
|---|
| 0:07:20 | the lamp that value and the stopping criterion for five shows from the development set | 
|---|
| 0:07:25 | so these once the or it once show the highest | 
|---|
| 0:07:29 | cosine distance scores per each iteration | 
|---|
| 0:07:32 | and | 
|---|
| 0:07:34 | these block once they are the diarization error rates but each iteration so there | 
|---|
| 0:07:40 | horizontal dashed line is the lamb data selected | 
|---|
| 0:07:44 | as a threshold to stop the process for example | 
|---|
| 0:07:48 | if we talk about the rest | 
|---|
| 0:07:49 | showed | 
|---|
| 0:07:52 | there's system | 
|---|
| 0:07:53 | stops at the for citation because in the fall detection | 
|---|
| 0:07:56 | the | 
|---|
| 0:07:57 | maximum | 
|---|
| 0:07:58 | cosine distance score value is made than this threshold value so we have applied this | 
|---|
| 0:08:04 | techniques on | 
|---|
| 0:08:06 | the whole development shows and this number about it is applied directly on the test | 
|---|
| 0:08:12 | sites | 
|---|
| 0:08:15 | so we have used to different fusion techniques ones on the speaker segmentation and the | 
|---|
| 0:08:21 | other on speaker clustering | 
|---|
| 0:08:23 | so in the | 
|---|
| 0:08:25 | segmentation | 
|---|
| 0:08:26 | the figure technique is based on log likelihood scores so we get | 
|---|
| 0:08:31 | two different scores for a given we can see that the axes | 
|---|
| 0:08:35 | the short-term spectral features and the what is the long-term features all | 
|---|
| 0:08:40 | we gates | 
|---|
| 0:08:41 | and more than | 
|---|
| 0:08:43 | for | 
|---|
| 0:08:44 | the short-term spectral features so we get the log-likelihood score so this is multiplied by | 
|---|
| 0:08:48 | are five and again similarly for the | 
|---|
| 0:08:52 | long-term features we ate | 
|---|
| 0:08:54 | extract | 
|---|
| 0:08:55 | the log-likelihood score and this is multiplied by a the file and the and fast | 
|---|
| 0:09:00 | how to be against you on the development data sites | 
|---|
| 0:09:06 | so that putting technique in the speaker clustering is carried out possible so we have | 
|---|
| 0:09:11 | three different set of features very mfcc the voice quality and the prosodic once | 
|---|
| 0:09:17 | so the long-term features are stacked a basic | 
|---|
| 0:09:21 | then we extract two different sets of i-vectors from the mfcc and from the long | 
|---|
| 0:09:26 | term one | 
|---|
| 0:09:27 | then the cosine similarity between | 
|---|
| 0:09:30 | these two sets of i-vectors | 
|---|
| 0:09:33 | is fused divide | 
|---|
| 0:09:34 | a linear weighting function | 
|---|
| 0:09:36 | so that fused score that is a multi it "'cause" i similarity is multiplied by | 
|---|
| 0:09:46 | weight functions | 
|---|
| 0:09:48 | but also the beta in this one is | 
|---|
| 0:09:51 | the weights | 
|---|
| 0:09:52 | but applied for their cosine task force extracted from | 
|---|
| 0:09:57 | the spectral features and one minus data is | 
|---|
| 0:10:00 | the way to signs | 
|---|
| 0:10:02 | for the cosine distance scores | 
|---|
| 0:10:04 | extracted from the long-term features | 
|---|
| 0:10:10 | so when we come to the experimental setup | 
|---|
| 0:10:13 | we have | 
|---|
| 0:10:15 | developed and tested that experiment on ami corpus which is | 
|---|
| 0:10:19 | and multi-party and spontaneous that of meeting recordings | 
|---|
| 0:10:24 | so normally in the i shows the number of speakers is | 
|---|
| 0:10:27 | let me just two | 
|---|
| 0:10:29 | so you to five that's mostly | 
|---|
| 0:10:31 | the number of speakers these for and | 
|---|
| 0:10:34 | it is and meeting records and it is a model channel with the fight of | 
|---|
| 0:10:37 | each condition | 
|---|
| 0:10:38 | so we have selected potentials as a development set to tune the different parameter studies | 
|---|
| 0:10:43 | the weight values | 
|---|
| 0:10:45 | and that threshold values | 
|---|
| 0:10:48 | then we have defined | 
|---|
| 0:10:50 | two experimental setups the first one is a single sides so potentials | 
|---|
| 0:10:54 | how to be selected from idea | 
|---|
| 0:10:58 | and the other one is a multiple sites | 
|---|
| 0:11:00 | we have selected ten calls from idea | 
|---|
| 0:11:03 | adam back end to end all sides so | 
|---|
| 0:11:06 | the | 
|---|
| 0:11:07 | optimum parameters that are obtained from the development sites are directly used on these | 
|---|
| 0:11:14 | a single and the multiple sites roles so we have used to difference | 
|---|
| 0:11:18 | as of i-vectors | 
|---|
| 0:11:20 | for the short and long term features and is also | 
|---|
| 0:11:24 | do you want on the development set and we have | 
|---|
| 0:11:27 | use the artists at all the speech differences | 
|---|
| 0:11:30 | at the speech activity detection so very but is that the city portage in this | 
|---|
| 0:11:35 | work | 
|---|
| 0:11:36 | corresponds mainly to the speaker errors missus speech and the form out on this have | 
|---|
| 0:11:40 | a zero value | 
|---|
| 0:11:44 | so he if we see | 
|---|
| 0:11:45 | the results the baseline system that is based on mfcc and gmm big | 
|---|
| 0:11:51 | clustering p does | 
|---|
| 0:11:52 | is a model of the art | 
|---|
| 0:11:54 | but when we are using jitter and shimmer and prosody both in the gmm and | 
|---|
| 0:12:00 | i-vector | 
|---|
| 0:12:02 | clustering technique it improves | 
|---|
| 0:12:05 | a lot compared to | 
|---|
| 0:12:07 | the baseline | 
|---|
| 0:12:08 | and if we compare these to the i-vector | 
|---|
| 0:12:12 | clustering techniques with the gmm ones | 
|---|
| 0:12:16 | but i with a clustering techniques | 
|---|
| 0:12:19 | again provide better result is on | 
|---|
| 0:12:22 | they gmm clustering technique | 
|---|
| 0:12:24 | and we can also conclude that | 
|---|
| 0:12:26 | if we compare the same clustering techniques the i-vector clustering techniques that this one based | 
|---|
| 0:12:31 | on only short-term spectral feature and this one | 
|---|
| 0:12:34 | using two different set of features it's | 
|---|
| 0:12:37 | i provide us better results on | 
|---|
| 0:12:40 | using one i-vectors from the | 
|---|
| 0:12:43 | sure that features | 
|---|
| 0:12:48 | so we have | 
|---|
| 0:12:49 | also then | 
|---|
| 0:12:50 | some posts paper processing work | 
|---|
| 0:12:53 | after the sensational stories to better | 
|---|
| 0:12:55 | so we have | 
|---|
| 0:12:57 | also pasted | 
|---|
| 0:12:59 | that the lda scoring | 
|---|
| 0:13:01 | in the clustering stage | 
|---|
| 0:13:02 | and the p l a clustering as it is shown in the table | 
|---|
| 0:13:06 | with that it uses only one set of i-vector or | 
|---|
| 0:13:09 | two sets of i-vectors | 
|---|
| 0:13:11 | it provides a better diarization of results on what the gmm and cosine scoring techniques | 
|---|
| 0:13:19 | so one of the issues in a speaker adaptation is the diarization error rates among | 
|---|
| 0:13:24 | the different roles is | 
|---|
| 0:13:28 | a relatively | 
|---|
| 0:13:31 | it follows from one to one show for example is a wonderful may give us | 
|---|
| 0:13:34 | a small d are like five percent and another show to make debusk idea of | 
|---|
| 0:13:39 | like a fifty percent | 
|---|
| 0:13:41 | so for example this box plot shows the | 
|---|
| 0:13:45 | d r evaluation all the multi pole and a single side so | 
|---|
| 0:13:50 | this one is a d r evaluation for the single five and the grey one | 
|---|
| 0:13:54 | is | 
|---|
| 0:13:55 | the idea validation for the multiple site | 
|---|
| 0:13:57 | so this easy high d r and d c the lowest eer | 
|---|
| 0:14:01 | so we can see that there is | 
|---|
| 0:14:02 | a huge evaluation | 
|---|
| 0:14:04 | between | 
|---|
| 0:14:05 | the maximum and the minimum | 
|---|
| 0:14:09 | so if we see | 
|---|
| 0:14:12 | here the use of long-term features | 
|---|
| 0:14:15 | both in the gmm and i-vector clustering technique | 
|---|
| 0:14:18 | help us to reduce the | 
|---|
| 0:14:21 | the other what if you normal the different roles | 
|---|
| 0:14:24 | and the other thing we can see both | 
|---|
| 0:14:27 | i-vector clustering techniques that are based on | 
|---|
| 0:14:30 | short-term and shorter class long-term features | 
|---|
| 0:14:33 | they give us | 
|---|
| 0:14:35 | a bit errors | 
|---|
| 0:14:37 | at least we can say it reduces again | 
|---|
| 0:14:39 | the idea variations among | 
|---|
| 0:14:42 | the different roles | 
|---|
| 0:14:43 | and finally this one that is the i-vector clustering technique based on | 
|---|
| 0:14:48 | short-term and long-term features used as | 
|---|
| 0:14:51 | the lost | 
|---|
| 0:14:52 | variations among | 
|---|
| 0:14:53 | the different roles | 
|---|
| 0:14:58 | so in conclusion | 
|---|
| 0:15:00 | we have proposed the extraction of i-vectors from | 
|---|
| 0:15:04 | short and long term c feature for | 
|---|
| 0:15:06 | speaker clustering task | 
|---|
| 0:15:09 | and in the experiments are designed to sit strum that's the | 
|---|
| 0:15:12 | i-vector clustering techniques provide | 
|---|
| 0:15:15 | bitter diarization error is that is and the clustering the general clustering once | 
|---|
| 0:15:20 | and also the extraction of i-vectors from the | 
|---|
| 0:15:24 | long-term features | 
|---|
| 0:15:25 | in addition to the | 
|---|
| 0:15:27 | a short time once | 
|---|
| 0:15:29 | help us to reduce the d r | 
|---|
| 0:15:32 | so in conclusion we can phase that's the extraction of i-vectors | 
|---|
| 0:15:37 | and the use of | 
|---|
| 0:15:39 | i-vector clustering techniques are helpful for speaker diarization system | 
|---|
| 0:15:43 | and thank you | 
|---|
| 0:15:52 | then it's time for questions | 
|---|
| 0:16:12 | so i have | 
|---|
| 0:16:19 | but i was one thing to do explain the process you using for calculating the | 
|---|
| 0:16:26 | jitter and shimmer in did you find it to be a robust process across the | 
|---|
| 0:16:32 | tv shows | 
|---|
| 0:16:37 | normally are | 
|---|
| 0:16:40 | shows a meeting domains | 
|---|
| 0:16:42 | but | 
|---|
| 0:16:44 | it is | 
|---|
| 0:16:45 | it is | 
|---|
| 0:16:46 | and meeting domain it's not a t v show | 
|---|
| 0:16:49 | but when we extract different remote | 
|---|
| 0:16:53 | we the problem of bases if | 
|---|
| 0:16:56 | the speech is almost | 
|---|
| 0:16:59 | we give zero buttons | 
|---|
| 0:17:01 | so we compensate them by averaging over five hundred milisecond duration | 
|---|
| 0:17:06 | that extract the fattest all certainly second duration | 
|---|
| 0:17:10 | sort compensates a zero values for the unvoiced frames we averaging over five hundred milisecond | 
|---|
| 0:17:16 | duration | 
|---|
| 0:17:27 | you have also in one of your a slight and you said that the training | 
|---|
| 0:17:31 | from the development set | 
|---|
| 0:17:33 | how did you you'll find it or train it how did you find that threshold | 
|---|
| 0:17:38 | and did you experiment with changing the threshold value | 
|---|
| 0:17:42 | you mean that the segmentation i think this one | 
|---|
| 0:17:47 | no in the formula you | 
|---|
| 0:17:51 | we present the segmentation | 
|---|
| 0:17:57 | this one | 
|---|
| 0:17:58 | or you hear | 
|---|
| 0:18:01 | so you mean that i four buttons they have been | 
|---|
| 0:18:05 | modeling be you on the development sites | 
|---|
| 0:18:08 | we taste different weights while the weight | 
|---|
| 0:18:11 | bottles | 
|---|
| 0:18:12 | for the two features | 
|---|
| 0:18:14 | and | 
|---|
| 0:18:15 | these files are directly applied on the test sites | 
|---|
| 0:18:22 | okay so they are fixed your exists in the test experiments affix it's | 
|---|
| 0:18:41 | of thank you very clear presentation i just wanted to understand the little bit about | 
|---|
| 0:18:48 | the physical what we should you have an explanation why he went so did to | 
|---|
| 0:18:54 | shiver and prosody | 
|---|
| 0:18:56 | so for example in explains that we do we for pitch to be quite well | 
|---|
| 0:19:00 | quite to important how did you sort of converge of these two did you go | 
|---|
| 0:19:06 | through a selection opposed to get to the mean do you have any intuition or | 
|---|
| 0:19:10 | expression for the | 
|---|
| 0:19:11 | so you're saying why we are interested in the extraction of the detection but and | 
|---|
| 0:19:15 | prosodic how did you zero it on the balloon is what's your sort of physical | 
|---|
| 0:19:19 | intuition for what using that as opposed to of the long-term features | 
|---|
| 0:19:25 | because they are voice quality measurements | 
|---|
| 0:19:27 | no special potentially much | 
|---|
| 0:19:29 | so they can be used to discriminate | 
|---|
| 0:19:33 | where the speech of one percent from another one so you'll hypothesis is that they | 
|---|
| 0:19:38 | would that would be significant difference between us we have seen it is and the | 
|---|
| 0:19:41 | this will be robust to whatever channel that is going through | 
|---|
| 0:19:46 | but we didn't similar extremely delicate if you will so | 
|---|
| 0:19:53 | if you had extend this outside this dataset for example of real life recording | 
|---|
| 0:19:59 | we're going to worry about the sensitivity of these features that you looking at | 
|---|
| 0:20:04 | okay for example jitter and shimmer they have also been used in a speaker | 
|---|
| 0:20:10 | verification and recognition on these database | 
|---|
| 0:20:14 | so we have normally | 
|---|
| 0:20:16 | that is it will not is the reason why we applied on speaker diarization | 
|---|
| 0:20:20 | and we have checked the jitter and shimmer on ami corpus | 
|---|
| 0:20:25 | here's what i'm presenting we have also attracted on how about campus it is a | 
|---|
| 0:20:30 | cut on projects t v show | 
|---|
| 0:20:32 | is that also we got some improvements | 
|---|
| 0:20:36 | so you would like companies it's helps and | 
|---|
| 0:20:40 | would that be any other as you think | 
|---|
| 0:20:42 | i don't it would a but others you think that you could out to the | 
|---|
| 0:20:46 | two | 
|---|
| 0:20:47 | note that different types of region we have about ten or eleven types of jitter | 
|---|
| 0:20:51 | and shimmer measurements | 
|---|
| 0:20:53 | beds we have selected this c d based on previous studies for speaker recognition and | 
|---|
| 0:20:59 | of maybe you can check with the others also | 
|---|
| 0:21:08 | and you in a question | 
|---|
| 0:21:14 | and i don't have we question so it's about the stopping criterion so you are | 
|---|
| 0:21:22 | not assuming and that you know the number of speakers beforehand | 
|---|
| 0:21:26 | that's right now we know the number of speakers you know you larson and you | 
|---|
| 0:21:29 | know it is okay conditions | 
|---|
| 0:21:37 | so any other questions | 
|---|
| 0:21:42 | there are no more questions to estimate speaker again | 
|---|