0:00:14she good morning at second university at the it is data signs that you recently
0:00:19worked on a soft voice activity detection in that factor analysis based speaker segmentation of
0:00:23a broadcast news
0:00:26so what this work has been done in the context of the artiste on project
0:00:31so the u r d is actually the public broadcasters of long as
0:00:35the dutch speaking region of that belgium
0:00:38and the idea is to use the speech technology to
0:00:42speed of the process of a subset of grading subtitles for tv shows
0:00:47another case can be for journalists to meter reports two
0:00:51have a fess track to put the report online with the subtitles so then they
0:00:55can use the speech technology to generate the subtitles
0:00:58and the quality maybe a bit less but
0:01:01in case of for online you the speed is more important than the quality of
0:01:05the subtitles
0:01:07so the ideas that the subtitling as a very time-consuming a manual process so we
0:01:12want to use the
0:01:13speech technology
0:01:15so in this presentation will focus on the diarisation and why do you want to
0:01:21solve this of who spoke when problem
0:01:23first of all we want to at colours to the subtitles
0:01:27and if you want to generate subtitles it can also be useful to use the
0:01:31speaker adapted models so we got speaker labels we can use these other models
0:01:36and another thing is that actually if we detect speaker changes this can be extra
0:01:41information for the language model of the speech recognizer to
0:01:46begin and sentences so this can also help to recognition
0:01:51so i the interspeech to have a show and tell session which of all the
0:01:56shall be a complete system platform
0:01:58so
0:02:00it will a show how can uploaded be you and then start the whole chain
0:02:03of a speech nonspeech segmentation speaker diarization language detection system and then speech recognition
0:02:09but that's not the final step then we actually have to make short sentences to
0:02:12display them on the screen
0:02:18okay so what is this concept i think more probably get or audio signal plus
0:02:22all the first step is the speech nonspeech segmentation we have to move a laughter
0:02:26we have to remove music
0:02:28so when once be detected the speech segments we can start that or a speaker
0:02:32diarization
0:02:33so this includes a detecting the speaker change points and finding homogeneous segments
0:02:39and once we found of segments we can cluster those segments to assign a speaker
0:02:42label to all these segments
0:02:45so done you make the hypothesis that the each speaker only uses one language
0:02:51and because in flanders you're interested in image we only keep the flemish segments
0:02:55and then we will do the speech recognition
0:02:58and the output of the speech recognizer will need some processing to make the sentences
0:03:02short enough to display them on the screen
0:03:05so here we will focus on more accurate speaker segmentation because if we use to
0:03:10short segments that kernel provides and all data for reliable speaker models but costs in
0:03:15this kind of the files we will use we have sometimes fifty speakers in one
0:03:19audio file so the longer this homogeneous speaker segments will be the more reliable clustering
0:03:25will be
0:03:26obviously we don't detector speaker change this will result in nantes a homogeneous segments and
0:03:32this will result in error propagation during the clustering process and also if we make
0:03:37two short segments this will make clustering a lot slower because we have to accompany
0:03:41lot more distances between segments
0:03:46okay it'll propose a two-pass system so when the first a single other speech segments
0:03:52are generated by the speech and non speech segmentation
0:03:55so and then we will do so my speaker segmentation to actually the a standard
0:03:59eigenvoice approach so would be vocal this a generic eigenvoices because these
0:04:04a composer stuff the model actually every speaker that can appear
0:04:07so why once we detected those speaker segment we can do standard speaker clustering
0:04:12and the output of these of the speaker clustering i'm the speaker clusters we will
0:04:16use that actually to
0:04:18retrain or eigenvoice model so we know which speakers are active in the audio file
0:04:22and the broadcast news file so we will retrain eigenvoices that match those speakers
0:04:27and we also got speech segments so we can also actually retrain or a universal
0:04:32background model
0:04:33so then going go to a second sparse again the us start from or baseline
0:04:37speech segments you do the speech segmentation again but now with our specific eigenvoices matching
0:04:42the speakers inside the audio file
0:04:44and then we do again speaker clustering and an evil three have that the speaker
0:04:48clusters that in the first pass
0:04:52okay the first step or speaking segmentation will be a boundary generation so that is
0:04:58actually a generation of a kind of speaker change points
0:05:02so we will lie use a sliding window approach we have to comparison windows left
0:05:07window and the right window so and you can have a two hypothesis either we
0:05:11have the same speaker and the to win also we have a different speaker
0:05:16so we will use the a measure that looks for the maximal the similarity between
0:05:20the distribution of the acoustic features and of there is a fixed to somebody then
0:05:24this will indicate that there was a speaker change
0:05:30okay also
0:05:31speech nonspeech segmentation actually did not eliminate short pauses so it is tuned to detect
0:05:36all laughter and music segments of longer than one seconds
0:05:41so there can actually be a short alternate between speakers
0:05:45so if we would use adjacent comparison windows it's actually generate several maxima
0:05:51during the speaker change so we argue that is
0:05:54i maxima can actually appear at the and vq at the beginning and the end
0:05:58of the pulses because then the dissimilarity between acoustic features would be maximal and in
0:06:03both windows
0:06:05so and stats we propose to use overlapping comparison windows
0:06:09so if you look at the regions that the classes of these actually attribute to
0:06:13the summer the summer the
0:06:15and the red regions
0:06:18make them segments more the comparison windows more similar
0:06:21so with actually the overlapped region between both comparison minnows
0:06:25matches the false
0:06:27then the dissimilarity between both windows will be maximal and the pause and the speaker
0:06:31change will be inserted at the middle of the poles which is actually the thing
0:06:35we want
0:06:36just the more logical thing to do
0:06:39so one if we apply to us
0:06:42two or
0:06:43sliding window approach we just simply use a two
0:06:46overlapping sliding windows a left window and a right
0:06:53okay for each comparison in the we actually want to extract speaker specific information
0:06:58so we will do this to factor analysis
0:07:03we will use so because we use the sliding window approach we will use very
0:07:07low dimensional models because we have to extract those speaker factors for each frame
0:07:12so we will use the gmm-ubm speech model with the thirty two components and use
0:07:17a low dimensional speaker viable the or eigenvoice matrix with only twenty eigenvoices
0:07:23so we use of in the wall for one second then we slide across each
0:07:26frame and we expect those the twenty speaker factors
0:07:30so i mentioned that for the training data we use the english broadcast news data
0:07:39okay so not to another now that we have the speaker factors per frame we
0:07:43actually look for a significant local changes between the speaker factors because these will indicate
0:07:48a speaker change
0:07:50so we use the extraction of one seconds so it's quite obvious that the phonetic
0:07:55content of this one second window
0:07:57we'll have a huge impact on the speaker factors
0:08:00so we propose to estimate the subphonetic fallibility this intra speaker variability on that that's
0:08:06that the data itself so we got or to i-vector speaker factor extraction then those
0:08:12but
0:08:13if we look at the segment to the left and in my to make the
0:08:17hypothesis with the same speaker and the same to the right
0:08:20we can actually use the question model
0:08:22to estimate the phonetic variability are the intra speaker variability on the that the signal
0:08:28l
0:08:29and we have a right speaker we can say we estimate the phonetic fundable you
0:08:33the signal are
0:08:34and
0:08:35actually want to use of you want to find changes in the speaker factors that
0:08:39are not explained by this phonetic valuable do you want to look for changes other
0:08:43have occurred because of a real speaker change
0:08:46so if we use the model and will be space distance we can actually look
0:08:49for changes that are in other directions than that caused by the phonetic variability
0:08:54so we propose to make and mahalanobis space this with the components one where we
0:08:59have the hypothesis that we have left speaker
0:09:01so we look for changes in the speaker factors that are not explained by phonetic
0:09:05fundable given by the left speaker
0:09:06and the second component is looking for changes not explained by on it but with
0:09:11the of the right speaker
0:09:15okay so here we got the a speech segment and that
0:09:17this shows the or distance metric
0:09:21so well i also included the euclidean distance of compared to the mahalanobis distance
0:09:26so the red lines or the maximum peak so actually we have this the distance
0:09:31measurement mean for a maximum distance so we have to pick a selection algorithm
0:09:35so we average or a distance measure
0:09:38so when then according to the length of or speech segment we select the number
0:09:41of maxima
0:09:43and we also and for the minimum duration of a speaker or not but one
0:09:47second
0:09:47so the red lines indicate all the detected
0:09:50and you can and the black lines are actually the real speaker turns and we
0:09:54see the other model a mahalanobis distance a emphasis the
0:09:58the real speaker changes
0:10:00so it's successfully detects the to
0:10:03speaker turns out to why the with your
0:10:09okay once that we got or candidate speaker change points we can you some clustering
0:10:15of the adjacent segments to eliminate false a also this
0:10:19so again we had to pa system in our first also of the signal some
0:10:24system we will use delta bic here clustering of the adjacent speaker turns to see
0:10:29if there is a much acoustic somebody would reading segments if there are quite similar
0:10:34then you can simply eliminate this boundary
0:10:39and the second pass we had the specific eigenvoice model so this agent voice model
0:10:43matches the speakers and a file
0:10:45so then we can actually extract speaker factors
0:10:48perks homogeneous segments
0:10:50and use the course that cosine distance to compare the speaker factors
0:10:54if they're similar we eliminate the kind of the change point it's
0:10:57is there dissimilar it's a speaker change point
0:11:00so we can use the thresholds
0:11:02a bold criteria to control the number of eliminated boundaries
0:11:07okay
0:11:09so it does that this on the cost two hundred and eight broadcast news test
0:11:12sets at this as actually a sets with the twelve languages
0:11:16we used one language to as development data to tune our parameters
0:11:21and the other eleven remaining sets were used for s the test data
0:11:26so this includes a thirty hours of data
0:11:29and of four thousand four hundred the speaker turns
0:11:33for the evaluation me to the mapping between the estimated change points and the real
0:11:37so the speaker change points with the margin of five hundred milliseconds
0:11:41and we compare the precision and recall but with this mapping
0:11:46so the precision is the percentage of computed boundaries that are actually matter we once
0:11:52and the recall and the sorry the recall a substantial real boundaries mapped to the
0:11:57computers ones and the precision is the percentage of compute the boundaries other are actually
0:12:02map
0:12:03so we compare
0:12:06this is
0:12:07speaker just change detection with delta bic baseline
0:12:11and we can see that's for a low precision we get the maximum legal of
0:12:15nineteen point six percent which is a maybe a larger than the they'll topic of
0:12:20baseline
0:12:21so once we get these a decision beagle course we can then select an operating
0:12:26point according to the threshold of the of the
0:12:29by the elimination algorithm
0:12:31and you can use this operating point to start or a speaker clustering
0:12:39okay no more details about or a two-pass adapt is speaker segmentation system so in
0:12:44the first pass we got or speaker turns
0:12:47our clusters generated
0:12:49by clustering the speaker turns generated in the first pass then you to train the
0:12:53ubm model and the eigenvoice model on the speech and the speaker cluster test file
0:12:58so and he repeats the boundary generation
0:13:02and then we eliminate the boundaries with the cosine distance instead of the delta bic
0:13:05elimination
0:13:07so here the a yellow line
0:13:09indicates oracle or system and we can see that now the cosine distance boundary elimination
0:13:14actually outperforms the be all the bic elimination that we
0:13:19used in the first boss
0:13:21so now we can use an operating point on the second
0:13:25no of the output of the second pass
0:13:30okay now we propose actually if we extract speaker factors for each comparison window this
0:13:36did not differentiate between the speech and non-speech frames in the test file
0:13:41so the idea is actually to give the speech frames in the windows more rates
0:13:45during the speaker factor extraction
0:13:47so eval integrate the gmm based
0:13:51for a soft voice activity detection maybe estimated speech ubm and non-speech ubm and then
0:13:56we will integrates and then we will use a softmax
0:14:00to
0:14:01convert log likelihoods of the speech ubm and the non-speech ubm to speech posteriors per
0:14:05frame
0:14:07i'm we will be the baumwelch statistics that are used to bring the speaker factor
0:14:11extraction
0:14:12extraction
0:14:14to make them at the speech posteriors
0:14:16so it's also important the note that here we will use the speech ubm to
0:14:21estimate the occupation probabilities of a each frame
0:14:25because it will also used is the speech posteriors and the second part of the
0:14:29system so we do not only between these speech ubm but we also we train
0:14:33the non-speech ubm on the test all so we got non speech segments with the
0:14:37music and the applause
0:14:38and you will also use the low energy frames inside the speech segments to reading
0:14:43retrain the non-speech ubm
0:14:45and also during the boundary elimination soap to make the false positives
0:14:50we will use the soft voice activity to the
0:14:53to extract speaker factors and then use cosine distance boundary of the nist
0:15:00okay what we still
0:15:02problem of the big baseline again
0:15:04this is are
0:15:06speaker factor extraction without the soft voice activity detection we actually see if we don't
0:15:10use it to process than the t voice activity detection doesn't really improved results
0:15:15but if we use it to paul system may be to use the cosine distance
0:15:18from the elimination
0:15:20we see that we can further improve the results so the soft voice that the
0:15:24detection is a really useful if we use a two-pass just
0:15:29so once we got this set precision and recall best or best precision recall rough
0:15:34we choose an operating point to store a clustering
0:15:39so this clustering as a agglomerative clustering a first we do conditional big clustering across
0:15:45the whole that
0:15:46and this is quite important to gets enough data for a i-vector be lda clustering
0:15:52in the second stage
0:15:53so the ideas for each trust we got by the output of our clustering
0:15:57to extract an i-vector
0:16:00and then we will use the lda to that's the hypothesis if you have the
0:16:04same speaker or different speaker
0:16:07and if this the lda indicates
0:16:10and
0:16:11that the this the same speaker done real magic recipe
0:16:14cluster pair
0:16:15and then for this much cluster we will again extract the i-vector by a summing
0:16:20up
0:16:20the sufficient statistics extract a new i-vector and
0:16:24that's the hypothesis again with the lda
0:16:26so we will iterate this whole clustering process until
0:16:30the p lda outputs a large a low probability of the same speaker
0:16:37so okay whatever their results after clustering again we use the most one of eighty
0:16:41seven broadcast news data sets
0:16:43so we will evaluated diarization error rate which is the percentage of frames that are
0:16:48actually attribute to run speaker after mapping between the clusters and the real speakers
0:16:54so here we got the popular delta bic segmentation so then you go the diarization
0:16:59error rate of ten point one percent
0:17:01and we see that actually the detected boundaries are not that accurate when we have
0:17:05a margin over five hundred milliseconds
0:17:07if you look for a local changes between a speaker factors we see a slight
0:17:11improvement in the diarization error rate what the big changes all clearly in the accuracy
0:17:17of the boundaries of the speaker factor extraction is much more accurate and detecting the
0:17:21boundaries
0:17:23the same for when we use the to pa system we see
0:17:26a slight improvement in the precision on the be cool
0:17:29but then if we use the two passes system at the site soft activity detection
0:17:33apparently the boundaries got that besides that we got the ten percent relative improvement in
0:17:38our diarization error rates and the double boundary precision of at one percent and the
0:17:43recall of eighty five percent which is clearly better than the standard bic segmentation popular
0:17:48standard because segmentation
0:17:51so
0:17:53i we also want to note that the if we will it's popular to use
0:17:56viterbis a re-segmentation to make it to find more accurate boundaries offered of clustering and
0:18:02are basically use the speaker factor approach this actually the three it's the results
0:18:14thank environments so simple buttons
0:18:24it to pass that the a two-pass liquidation is quite well in the speaker diarization
0:18:32but problem of hunters this
0:18:34somehow you
0:18:36you
0:18:37you can represent them is the
0:18:41do you like that the actual or the u languages
0:18:46so one selection but ratio between posterior features or not the speaker factors the first
0:18:52step the first line again this is
0:18:57on speaker factors it difficult that is a slight line i'll try to put gosh
0:19:04models on the speaker factors but that didn't generate the same results so actually using
0:19:08a distance measure
0:19:10a different better results than trying to fit a portion models on problem
0:19:18question is did you have
0:19:22we were then
0:19:30so you
0:19:31the
0:19:35that is
0:19:39and then try that some one thing about this approach is that the amount of
0:19:44speech to the fact that depends on the length of the speech segments so we
0:19:47can use this to reduce the amount of speaker changes that we to make the
0:19:51hypothesis of the amount of speaker change that could actually for inside speech segment
0:19:55so then you would have to find solution for that's
0:19:58but i think it's possible to use actually this i-vector approach to find boundaries between
0:20:02speech and non-speech segments
0:20:05probably even after generate more accurate boundaries on an hmm system that i use not
0:20:10that's a hypothesis that i should best
0:20:24so the you use a gmm based what the real spectrum
0:20:29i just like are somewhat appears to the ribbon recordings or the trend of one
0:20:35completely the hmm system is also that so it's again to both system variable with
0:20:40so we got two models for the non speech
0:20:43so the music mogul and of background noise model
0:20:46then also for speech you got really different models speech clear speech that the background
0:20:51noise and speech and music
0:20:53so we go to the file one
0:20:55and then we might make us estimate posterior suddenly adapt the models
0:21:00and then we go through with the second
0:21:18the to what extent your rates to its over talking figure it is a speaker
0:21:23states it
0:21:25significantly what proportion of that would you have to
0:21:30speakers just of your region
0:21:33so are you talking about overlapping speech
0:21:37we don't send this dataset we don't have annotations of overlapping speech so i cannot
0:21:41comment on how this as an impact on that all results
0:21:48the by that token would you have in your class
0:21:52we have here
0:21:56and that just
0:21:59you model is that
0:22:02you're
0:22:04if you've got speakers speaking region
0:22:07and most cases each of these would be detected as a separate cluster i think
0:22:12if i manually look that the false than this could be detected as a separate
0:22:16cluster
0:22:18so the complete cluster
0:22:20that of
0:22:22it also pure i think it also occurs that the overlapping speech is assigned to
0:22:27one of the two speakers but i to notice sometimes that it's a detector doesn't
0:22:31a cluster
0:22:37think we have to sample
0:22:45okay she of this method is for
0:22:48t v
0:22:50other
0:22:52so
0:22:52how these that this method to online diarization
0:22:56you time citizens
0:23:01so you're talking about the second pass and of the system
0:23:15so it's not an on-line system so the idea is that the journalist upload the
0:23:19file then start the process and comes back and one hour for example
0:23:24so the first goal is not to make an on-line system but
0:23:27there might be techniques to make it online but i would have to think about
0:23:31the
0:23:44in this election system don't the to model the number of speakers
0:23:49so how many speaker were the
0:23:53in reality and how many speakers who estimated
0:23:56okay so if we combine the big clustering and then the i-vector p lda clustering
0:24:00of the ratio is very close to one but i have to notice if you
0:24:04don't use the initial be clustering
0:24:06the than the i-vector be lda system actually which is a low a diarisation error
0:24:10rate but he ratio between clusters and speakers is quite of its about the factor
0:24:14to that
0:24:15so it in the system it's quite important to do initial be clustering
0:24:18to make the racial close to one
0:24:21but the diarisation rate does not that i just using i-vector really
0:24:30alright i think so
0:24:32if no what the questions like to thank the speaker and all the speakers once
0:24:36again and stuff