Speech Transcript - Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News

0:00:14	she good morning at second university at the it is data signs that you recently
0:00:19	worked on a soft voice activity detection in that factor analysis based speaker segmentation of
0:00:23	a broadcast news
0:00:26	so what this work has been done in the context of the artiste on project
0:00:31	so the u r d is actually the public broadcasters of long as
0:00:35	the dutch speaking region of that belgium
0:00:38	and the idea is to use the speech technology to
0:00:42	speed of the process of a subset of grading subtitles for tv shows
0:00:47	another case can be for journalists to meter reports two
0:00:51	have a fess track to put the report online with the subtitles so then they
0:00:55	can use the speech technology to generate the subtitles
0:00:58	and the quality maybe a bit less but
0:01:01	in case of for online you the speed is more important than the quality of
0:01:05	the subtitles
0:01:07	so the ideas that the subtitling as a very time-consuming a manual process so we
0:01:12	want to use the
0:01:13	speech technology
0:01:15	so in this presentation will focus on the diarisation and why do you want to
0:01:21	solve this of who spoke when problem
0:01:23	first of all we want to at colours to the subtitles
0:01:27	and if you want to generate subtitles it can also be useful to use the
0:01:31	speaker adapted models so we got speaker labels we can use these other models
0:01:36	and another thing is that actually if we detect speaker changes this can be extra
0:01:41	information for the language model of the speech recognizer to
0:01:46	begin and sentences so this can also help to recognition
0:01:51	so i the interspeech to have a show and tell session which of all the
0:01:56	shall be a complete system platform
0:01:58	so
0:02:00	it will a show how can uploaded be you and then start the whole chain
0:02:03	of a speech nonspeech segmentation speaker diarization language detection system and then speech recognition
0:02:09	but that's not the final step then we actually have to make short sentences to
0:02:12	display them on the screen
0:02:18	okay so what is this concept i think more probably get or audio signal plus
0:02:22	all the first step is the speech nonspeech segmentation we have to move a laughter
0:02:26	we have to remove music
0:02:28	so when once be detected the speech segments we can start that or a speaker
0:02:32	diarization
0:02:33	so this includes a detecting the speaker change points and finding homogeneous segments
0:02:39	and once we found of segments we can cluster those segments to assign a speaker
0:02:42	label to all these segments
0:02:45	so done you make the hypothesis that the each speaker only uses one language
0:02:51	and because in flanders you're interested in image we only keep the flemish segments
0:02:55	and then we will do the speech recognition
0:02:58	and the output of the speech recognizer will need some processing to make the sentences
0:03:02	short enough to display them on the screen
0:03:05	so here we will focus on more accurate speaker segmentation because if we use to
0:03:10	short segments that kernel provides and all data for reliable speaker models but costs in
0:03:15	this kind of the files we will use we have sometimes fifty speakers in one
0:03:19	audio file so the longer this homogeneous speaker segments will be the more reliable clustering
0:03:25	will be
0:03:26	obviously we don't detector speaker change this will result in nantes a homogeneous segments and
0:03:32	this will result in error propagation during the clustering process and also if we make
0:03:37	two short segments this will make clustering a lot slower because we have to accompany
0:03:41	lot more distances between segments
0:03:46	okay it'll propose a two-pass system so when the first a single other speech segments
0:03:52	are generated by the speech and non speech segmentation
0:03:55	so and then we will do so my speaker segmentation to actually the a standard
0:03:59	eigenvoice approach so would be vocal this a generic eigenvoices because these
0:04:04	a composer stuff the model actually every speaker that can appear
0:04:07	so why once we detected those speaker segment we can do standard speaker clustering
0:04:12	and the output of these of the speaker clustering i'm the speaker clusters we will
0:04:16	use that actually to
0:04:18	retrain or eigenvoice model so we know which speakers are active in the audio file
0:04:22	and the broadcast news file so we will retrain eigenvoices that match those speakers
0:04:27	and we also got speech segments so we can also actually retrain or a universal
0:04:32	background model
0:04:33	so then going go to a second sparse again the us start from or baseline
0:04:37	speech segments you do the speech segmentation again but now with our specific eigenvoices matching
0:04:42	the speakers inside the audio file
0:04:44	and then we do again speaker clustering and an evil three have that the speaker
0:04:48	clusters that in the first pass
0:04:52	okay the first step or speaking segmentation will be a boundary generation so that is
0:04:58	actually a generation of a kind of speaker change points
0:05:02	so we will lie use a sliding window approach we have to comparison windows left
0:05:07	window and the right window so and you can have a two hypothesis either we
0:05:11	have the same speaker and the to win also we have a different speaker
0:05:16	so we will use the a measure that looks for the maximal the similarity between
0:05:20	the distribution of the acoustic features and of there is a fixed to somebody then
0:05:24	this will indicate that there was a speaker change
0:05:30	okay also
0:05:31	speech nonspeech segmentation actually did not eliminate short pauses so it is tuned to detect
0:05:36	all laughter and music segments of longer than one seconds
0:05:41	so there can actually be a short alternate between speakers
0:05:45	so if we would use adjacent comparison windows it's actually generate several maxima
0:05:51	during the speaker change so we argue that is
0:05:54	i maxima can actually appear at the and vq at the beginning and the end
0:05:58	of the pulses because then the dissimilarity between acoustic features would be maximal and in
0:06:03	both windows
0:06:05	so and stats we propose to use overlapping comparison windows
0:06:09	so if you look at the regions that the classes of these actually attribute to
0:06:13	the summer the summer the
0:06:15	and the red regions
0:06:18	make them segments more the comparison windows more similar
0:06:21	so with actually the overlapped region between both comparison minnows
0:06:25	matches the false
0:06:27	then the dissimilarity between both windows will be maximal and the pause and the speaker
0:06:31	change will be inserted at the middle of the poles which is actually the thing
0:06:35	we want
0:06:36	just the more logical thing to do
0:06:39	so one if we apply to us
0:06:42	two or
0:06:43	sliding window approach we just simply use a two
0:06:46	overlapping sliding windows a left window and a right
0:06:53	okay for each comparison in the we actually want to extract speaker specific information
0:06:58	so we will do this to factor analysis
0:07:03	we will use so because we use the sliding window approach we will use very
0:07:07	low dimensional models because we have to extract those speaker factors for each frame
0:07:12	so we will use the gmm-ubm speech model with the thirty two components and use
0:07:17	a low dimensional speaker viable the or eigenvoice matrix with only twenty eigenvoices
0:07:23	so we use of in the wall for one second then we slide across each
0:07:26	frame and we expect those the twenty speaker factors
0:07:30	so i mentioned that for the training data we use the english broadcast news data
0:07:39	okay so not to another now that we have the speaker factors per frame we
0:07:43	actually look for a significant local changes between the speaker factors because these will indicate
0:07:48	a speaker change
0:07:50	so we use the extraction of one seconds so it's quite obvious that the phonetic
0:07:55	content of this one second window
0:07:57	we'll have a huge impact on the speaker factors
0:08:00	so we propose to estimate the subphonetic fallibility this intra speaker variability on that that's
0:08:06	that the data itself so we got or to i-vector speaker factor extraction then those
0:08:12	but
0:08:13	if we look at the segment to the left and in my to make the
0:08:17	hypothesis with the same speaker and the same to the right
0:08:20	we can actually use the question model
0:08:22	to estimate the phonetic variability are the intra speaker variability on the that the signal
0:08:28	l
0:08:29	and we have a right speaker we can say we estimate the phonetic fundable you
0:08:33	the signal are
0:08:34	and
0:08:35	actually want to use of you want to find changes in the speaker factors that
0:08:39	are not explained by this phonetic valuable do you want to look for changes other
0:08:43	have occurred because of a real speaker change
0:08:46	so if we use the model and will be space distance we can actually look
0:08:49	for changes that are in other directions than that caused by the phonetic variability
0:08:54	so we propose to make and mahalanobis space this with the components one where we
0:08:59	have the hypothesis that we have left speaker
0:09:01	so we look for changes in the speaker factors that are not explained by phonetic
0:09:05	fundable given by the left speaker
0:09:06	and the second component is looking for changes not explained by on it but with
0:09:11	the of the right speaker
0:09:15	okay so here we got the a speech segment and that
0:09:17	this shows the or distance metric
0:09:21	so well i also included the euclidean distance of compared to the mahalanobis distance
0:09:26	so the red lines or the maximum peak so actually we have this the distance
0:09:31	measurement mean for a maximum distance so we have to pick a selection algorithm
0:09:35	so we average or a distance measure
0:09:38	so when then according to the length of or speech segment we select the number
0:09:41	of maxima
0:09:43	and we also and for the minimum duration of a speaker or not but one
0:09:47	second
0:09:47	so the red lines indicate all the detected
0:09:50	and you can and the black lines are actually the real speaker turns and we
0:09:54	see the other model a mahalanobis distance a emphasis the
0:09:58	the real speaker changes
0:10:00	so it's successfully detects the to
0:10:03	speaker turns out to why the with your
0:10:09	okay once that we got or candidate speaker change points we can you some clustering
0:10:15	of the adjacent segments to eliminate false a also this
0:10:19	so again we had to pa system in our first also of the signal some
0:10:24	system we will use delta bic here clustering of the adjacent speaker turns to see
0:10:29	if there is a much acoustic somebody would reading segments if there are quite similar
0:10:34	then you can simply eliminate this boundary
0:10:39	and the second pass we had the specific eigenvoice model so this agent voice model
0:10:43	matches the speakers and a file
0:10:45	so then we can actually extract speaker factors
0:10:48	perks homogeneous segments
0:10:50	and use the course that cosine distance to compare the speaker factors
0:10:54	if they're similar we eliminate the kind of the change point it's
0:10:57	is there dissimilar it's a speaker change point
0:11:00	so we can use the thresholds
0:11:02	a bold criteria to control the number of eliminated boundaries
0:11:07	okay
0:11:09	so it does that this on the cost two hundred and eight broadcast news test
0:11:12	sets at this as actually a sets with the twelve languages
0:11:16	we used one language to as development data to tune our parameters
0:11:21	and the other eleven remaining sets were used for s the test data
0:11:26	so this includes a thirty hours of data
0:11:29	and of four thousand four hundred the speaker turns
0:11:33	for the evaluation me to the mapping between the estimated change points and the real
0:11:37	so the speaker change points with the margin of five hundred milliseconds
0:11:41	and we compare the precision and recall but with this mapping
0:11:46	so the precision is the percentage of computed boundaries that are actually matter we once
0:11:52	and the recall and the sorry the recall a substantial real boundaries mapped to the
0:11:57	computers ones and the precision is the percentage of compute the boundaries other are actually
0:12:02	map
0:12:03	so we compare
0:12:06	this is
0:12:07	speaker just change detection with delta bic baseline
0:12:11	and we can see that's for a low precision we get the maximum legal of
0:12:15	nineteen point six percent which is a maybe a larger than the they'll topic of
0:12:20	baseline
0:12:21	so once we get these a decision beagle course we can then select an operating
0:12:26	point according to the threshold of the of the
0:12:29	by the elimination algorithm
0:12:31	and you can use this operating point to start or a speaker clustering
0:12:39	okay no more details about or a two-pass adapt is speaker segmentation system so in
0:12:44	the first pass we got or speaker turns
0:12:47	our clusters generated
0:12:49	by clustering the speaker turns generated in the first pass then you to train the
0:12:53	ubm model and the eigenvoice model on the speech and the speaker cluster test file
0:12:58	so and he repeats the boundary generation
0:13:02	and then we eliminate the boundaries with the cosine distance instead of the delta bic
0:13:05	elimination
0:13:07	so here the a yellow line
0:13:09	indicates oracle or system and we can see that now the cosine distance boundary elimination
0:13:14	actually outperforms the be all the bic elimination that we
0:13:19	used in the first boss
0:13:21	so now we can use an operating point on the second
0:13:25	no of the output of the second pass
0:13:30	okay now we propose actually if we extract speaker factors for each comparison window this
0:13:36	did not differentiate between the speech and non-speech frames in the test file
0:13:41	so the idea is actually to give the speech frames in the windows more rates
0:13:45	during the speaker factor extraction
0:13:47	so eval integrate the gmm based
0:13:51	for a soft voice activity detection maybe estimated speech ubm and non-speech ubm and then
0:13:56	we will integrates and then we will use a softmax
0:14:00	to
0:14:01	convert log likelihoods of the speech ubm and the non-speech ubm to speech posteriors per
0:14:05	frame
0:14:07	i'm we will be the baumwelch statistics that are used to bring the speaker factor
0:14:11	extraction
0:14:12	extraction
0:14:14	to make them at the speech posteriors
0:14:16	so it's also important the note that here we will use the speech ubm to
0:14:21	estimate the occupation probabilities of a each frame
0:14:25	because it will also used is the speech posteriors and the second part of the
0:14:29	system so we do not only between these speech ubm but we also we train
0:14:33	the non-speech ubm on the test all so we got non speech segments with the
0:14:37	music and the applause
0:14:38	and you will also use the low energy frames inside the speech segments to reading
0:14:43	retrain the non-speech ubm
0:14:45	and also during the boundary elimination soap to make the false positives
0:14:50	we will use the soft voice activity to the
0:14:53	to extract speaker factors and then use cosine distance boundary of the nist
0:15:00	okay what we still
0:15:02	problem of the big baseline again
0:15:04	this is are
0:15:06	speaker factor extraction without the soft voice activity detection we actually see if we don't
0:15:10	use it to process than the t voice activity detection doesn't really improved results
0:15:15	but if we use it to paul system may be to use the cosine distance
0:15:18	from the elimination
0:15:20	we see that we can further improve the results so the soft voice that the
0:15:24	detection is a really useful if we use a two-pass just
0:15:29	so once we got this set precision and recall best or best precision recall rough
0:15:34	we choose an operating point to store a clustering
0:15:39	so this clustering as a agglomerative clustering a first we do conditional big clustering across
0:15:45	the whole that
0:15:46	and this is quite important to gets enough data for a i-vector be lda clustering
0:15:52	in the second stage
0:15:53	so the ideas for each trust we got by the output of our clustering
0:15:57	to extract an i-vector
0:16:00	and then we will use the lda to that's the hypothesis if you have the
0:16:04	same speaker or different speaker
0:16:07	and if this the lda indicates
0:16:10	and
0:16:11	that the this the same speaker done real magic recipe
0:16:14	cluster pair
0:16:15	and then for this much cluster we will again extract the i-vector by a summing
0:16:20	up
0:16:20	the sufficient statistics extract a new i-vector and
0:16:24	that's the hypothesis again with the lda
0:16:26	so we will iterate this whole clustering process until
0:16:30	the p lda outputs a large a low probability of the same speaker
0:16:37	so okay whatever their results after clustering again we use the most one of eighty
0:16:41	seven broadcast news data sets
0:16:43	so we will evaluated diarization error rate which is the percentage of frames that are
0:16:48	actually attribute to run speaker after mapping between the clusters and the real speakers
0:16:54	so here we got the popular delta bic segmentation so then you go the diarization
0:16:59	error rate of ten point one percent
0:17:01	and we see that actually the detected boundaries are not that accurate when we have
0:17:05	a margin over five hundred milliseconds
0:17:07	if you look for a local changes between a speaker factors we see a slight
0:17:11	improvement in the diarization error rate what the big changes all clearly in the accuracy
0:17:17	of the boundaries of the speaker factor extraction is much more accurate and detecting the
0:17:21	boundaries
0:17:23	the same for when we use the to pa system we see
0:17:26	a slight improvement in the precision on the be cool
0:17:29	but then if we use the two passes system at the site soft activity detection
0:17:33	apparently the boundaries got that besides that we got the ten percent relative improvement in
0:17:38	our diarization error rates and the double boundary precision of at one percent and the
0:17:43	recall of eighty five percent which is clearly better than the standard bic segmentation popular
0:17:48	standard because segmentation
0:17:51	so
0:17:53	i we also want to note that the if we will it's popular to use
0:17:56	viterbis a re-segmentation to make it to find more accurate boundaries offered of clustering and
0:18:02	are basically use the speaker factor approach this actually the three it's the results
0:18:14	thank environments so simple buttons
0:18:24	it to pass that the a two-pass liquidation is quite well in the speaker diarization
0:18:32	but problem of hunters this
0:18:34	somehow you
0:18:36	you
0:18:37	you can represent them is the
0:18:41	do you like that the actual or the u languages
0:18:46	so one selection but ratio between posterior features or not the speaker factors the first
0:18:52	step the first line again this is
0:18:57	on speaker factors it difficult that is a slight line i'll try to put gosh
0:19:04	models on the speaker factors but that didn't generate the same results so actually using
0:19:08	a distance measure
0:19:10	a different better results than trying to fit a portion models on problem
0:19:18	question is did you have
0:19:22	we were then
0:19:30	so you
0:19:31	the
0:19:35	that is
0:19:39	and then try that some one thing about this approach is that the amount of
0:19:44	speech to the fact that depends on the length of the speech segments so we
0:19:47	can use this to reduce the amount of speaker changes that we to make the
0:19:51	hypothesis of the amount of speaker change that could actually for inside speech segment
0:19:55	so then you would have to find solution for that's
0:19:58	but i think it's possible to use actually this i-vector approach to find boundaries between
0:20:02	speech and non-speech segments
0:20:05	probably even after generate more accurate boundaries on an hmm system that i use not
0:20:10	that's a hypothesis that i should best
0:20:24	so the you use a gmm based what the real spectrum
0:20:29	i just like are somewhat appears to the ribbon recordings or the trend of one
0:20:35	completely the hmm system is also that so it's again to both system variable with
0:20:40	so we got two models for the non speech
0:20:43	so the music mogul and of background noise model
0:20:46	then also for speech you got really different models speech clear speech that the background
0:20:51	noise and speech and music
0:20:53	so we go to the file one
0:20:55	and then we might make us estimate posterior suddenly adapt the models
0:21:00	and then we go through with the second
0:21:18	the to what extent your rates to its over talking figure it is a speaker
0:21:23	states it
0:21:25	significantly what proportion of that would you have to
0:21:30	speakers just of your region
0:21:33	so are you talking about overlapping speech
0:21:37	we don't send this dataset we don't have annotations of overlapping speech so i cannot
0:21:41	comment on how this as an impact on that all results
0:21:48	the by that token would you have in your class
0:21:52	we have here
0:21:56	and that just
0:21:59	you model is that
0:22:02	you're
0:22:04	if you've got speakers speaking region
0:22:07	and most cases each of these would be detected as a separate cluster i think
0:22:12	if i manually look that the false than this could be detected as a separate
0:22:16	cluster
0:22:18	so the complete cluster
0:22:20	that of
0:22:22	it also pure i think it also occurs that the overlapping speech is assigned to
0:22:27	one of the two speakers but i to notice sometimes that it's a detector doesn't
0:22:31	a cluster
0:22:37	think we have to sample
0:22:45	okay she of this method is for
0:22:48	t v
0:22:50	other
0:22:52	so
0:22:52	how these that this method to online diarization
0:22:56	you time citizens
0:23:01	so you're talking about the second pass and of the system
0:23:15	so it's not an on-line system so the idea is that the journalist upload the
0:23:19	file then start the process and comes back and one hour for example
0:23:24	so the first goal is not to make an on-line system but
0:23:27	there might be techniques to make it online but i would have to think about
0:23:31	the
0:23:44	in this election system don't the to model the number of speakers
0:23:49	so how many speaker were the
0:23:53	in reality and how many speakers who estimated
0:23:56	okay so if we combine the big clustering and then the i-vector p lda clustering
0:24:00	of the ratio is very close to one but i have to notice if you
0:24:04	don't use the initial be clustering
0:24:06	the than the i-vector be lda system actually which is a low a diarisation error
0:24:10	rate but he ratio between clusters and speakers is quite of its about the factor
0:24:14	to that
0:24:15	so it in the system it's quite important to do initial be clustering
0:24:18	to make the racial close to one
0:24:21	but the diarisation rate does not that i just using i-vector really
0:24:30	alright i think so
0:24:32	if no what the questions like to thank the speaker and all the speakers once
0:24:36	again and stuff

Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News

Speaker Recognition in Multimedia Content

Brecht Desplanques, Kris Demuynck, Jean-Pierre Martens