Speech Transcript - Factor analysis-based approaches applied to the speaker diarization task of meetings: a preliminary study

0:00:06	well
0:00:07	after a great discussion uh about uh the
0:00:10	last
0:00:11	so take i will i will continue with another topic of related to speaker diarisation
0:00:16	uh
0:00:17	my name is bob automatic and
0:00:19	uh i was working uh previous semester or uh
0:00:23	as an erasmus student in that
0:00:26	uh you at the university of i mean you wanna
0:00:29	at all about
0:00:30	about the last
0:00:31	in formatting that venue
0:00:32	uh were my supervisors where
0:00:35	coding the video and there is not true
0:00:38	uh
0:00:39	it was about uh preliminary study
0:00:42	oh factor analysis based approach is applied to the speaker diarization task
0:00:48	of meetings
0:00:50	well
0:00:51	what it would be about
0:00:53	uh i will briefly describe the speaker diarisation
0:00:58	also factor analysis
0:01:00	i will tell you something about the objectives of this study uh some experiments
0:01:05	and the
0:01:06	perspective
0:01:10	uh shortly about diarisation i suppose uh almost all of you know what speaker diarization means
0:01:20	what
0:01:20	is its purpose
0:01:22	uh speaker diarization tries to find the answer a question
0:01:27	who spoke one
0:01:29	uh we don't have
0:01:31	uh any a priori knowledge
0:01:32	about speakers they and number
0:01:35	and their identity
0:01:38	uh as you can see here is a small
0:01:41	small
0:01:42	you have uh
0:01:44	if uh
0:01:45	and how
0:01:46	would of uh such a such a system
0:01:48	uh where we can see the
0:01:50	speech segments are labelled by the by the speakers
0:01:55	uh the diarisation system you uh tries to find the same segments of
0:01:59	goers
0:02:00	and label them
0:02:01	uh for for my experiments i used uh diarisation system uh developed in
0:02:08	in the yeah
0:02:09	uh the the system uh participate it uh in a nice the rich transcription
0:02:15	uh combines since two thousand three
0:02:18	uh the system uses topdown strategy
0:02:21	uh what is the top down strategy i will
0:02:23	i will uh
0:02:24	sounds
0:02:25	now
0:02:26	uh the top down strategy consists of uh
0:02:29	four main steps
0:02:31	the first
0:02:32	the uh is in uh speech activity detection
0:02:36	uh
0:02:37	where to retrain the gmm models
0:02:41	uh
0:02:42	are are are used uh
0:02:44	as a as a models of speech and nonspeech
0:02:47	uh
0:02:49	then uh it's
0:02:50	used uh viterbi decoding and the map adaptation
0:02:54	another step is uh segmentation
0:02:56	uh where is
0:02:57	use the evaluative
0:03:00	uh hidden markov model
0:03:02	uh
0:03:03	also viterbi the counting the coding and uh
0:03:07	uh the third and for the fourth
0:03:09	steps
0:03:10	are almost the same uh it's for segmentation about
0:03:13	using different
0:03:15	parameterisation
0:03:19	uh factor and all is is uh
0:03:22	is
0:03:22	is so well known in in fields like uh speaker verification
0:03:27	language identification uh and video gender classification
0:03:32	uh
0:03:34	and
0:03:35	the the uh
0:03:38	the big difference uh
0:03:41	you can say it's
0:03:42	uh
0:03:43	but then that legally
0:03:44	describe
0:03:45	uh in these two equations
0:03:47	where the the first decorations
0:03:50	is standard gmm ubm modelling
0:03:52	and
0:03:53	the second equation
0:03:55	uh
0:03:57	contains
0:03:58	uh
0:03:59	um
0:04:00	contains you we
0:04:02	which uh
0:04:04	so modelling the session variability
0:04:10	so what about
0:04:11	trying factor analysis uh
0:04:13	the link uh uh the
0:04:15	the single audio files
0:04:17	uh
0:04:19	uh we have situation for example
0:04:22	speaker is
0:04:23	peaky and
0:04:25	environment of the recording is changing like
0:04:28	the speaker is going
0:04:30	and around the microphone and the distance
0:04:33	speaker and uh
0:04:35	and the microphone is changing
0:04:37	uh the the factor analysis can be held
0:04:39	helpful in this case
0:04:41	um
0:04:44	uh we we tried to uh to
0:04:47	two approaches in this work and
0:04:50	the first is uh by localising subspace you containing the entire segment viability
0:04:56	and the second uh is
0:04:59	uh in a localising the interspeaker variability
0:05:05	about the experimental protocol the details uh are the following as a development set i used twenty three audio files
0:05:13	from the nist uh rich transcriptions
0:05:16	since two thousand four
0:05:18	two two thousand six
0:05:20	uh
0:05:22	it took place in seven different meeting rooms and
0:05:25	uh from
0:05:26	some statistical data
0:05:28	uh the recordings
0:05:29	uh have from ten to eighteen minutes
0:05:32	containing from four to nine participants
0:05:36	and
0:05:36	as evaluation set i use the
0:05:40	seven audio files from nist uh from the previous year
0:05:45	they have from seventeen to twenty seven minutes
0:05:48	and from four to seven speakers
0:05:52	uh
0:05:54	the multiple distant microphones were used here and as a performance uh
0:05:59	measure
0:06:00	uh i used uh diarisation error rate
0:06:05	the factor analysis model link was applied
0:06:08	only
0:06:09	in the third step of the speaker diarisation system
0:06:14	now the first approach
0:06:16	the modelling go
0:06:18	interspeaker variability
0:06:23	uh the U matrix uh here
0:06:27	in
0:06:27	in this equation
0:06:29	is common to all speakers
0:06:31	and the assumptions are uh
0:06:34	main relevant speaker information located in the low
0:06:37	dimension subspace and the rest
0:06:40	uh
0:06:41	all the speaker information in the full space
0:06:45	and the results are on the next
0:06:47	page
0:06:48	uh there is uh
0:06:51	nothing interesting
0:06:52	except
0:06:53	one think
0:06:54	it's the difference
0:06:56	between
0:06:57	these two columns
0:06:59	uh
0:07:00	what does it mean and the first column
0:07:03	uh contains the baseline diarization error rate
0:07:07	of
0:07:07	this file
0:07:08	without application of factor analysis
0:07:11	uh the next
0:07:12	column contains uh
0:07:14	results
0:07:15	after
0:07:16	application uh factor analysis for segmentation
0:07:20	containing
0:07:21	the U V
0:07:23	and the last without
0:07:24	you think
0:07:26	and the difference is
0:07:28	big
0:07:28	uh in average about ten percent
0:07:30	what does it mean it means that the U I
0:07:34	can
0:07:35	contains some information
0:07:37	useful
0:07:38	four
0:07:39	what they're doing
0:07:40	speaker
0:07:41	uh
0:07:42	in this case uh the only only thing
0:07:45	uh which is important
0:07:46	all the all the results
0:07:48	are uh
0:07:49	in average whereas
0:07:52	the second approach is uh in the in in their segment of our identity
0:07:58	um
0:08:00	it's almost the same except uh
0:08:04	the the base
0:08:04	think that the right but the is
0:08:07	uh
0:08:08	modelling
0:08:08	inter segment
0:08:10	so the results uh are
0:08:13	this page
0:08:17	yeah the baseline
0:08:19	diarisation error rate
0:08:22	there is uh
0:08:24	after
0:08:24	application of factor analysis
0:08:28	with ordering with you you
0:08:30	and here without
0:08:31	you you
0:08:34	uh what is
0:08:35	what is uh interesting here
0:08:38	only the fact that uh
0:08:41	so speaker information uh
0:08:44	present
0:08:45	is present in the inter segment component but
0:08:47	not significant
0:08:50	uh i tried another experiment
0:08:53	and it was based uh on filtering
0:08:57	um uh of a speech segment
0:09:00	in
0:09:00	mm kay
0:09:01	development set
0:09:03	in the first column you can uh see
0:09:06	there are
0:09:07	results of system uh
0:09:09	which uses
0:09:10	you metrics
0:09:12	uh estimated on all speech segments of from the the from the development set
0:09:18	in the next next column you can see uh
0:09:20	results
0:09:21	system
0:09:22	using uh
0:09:24	you matrix estimated on uh segments
0:09:27	longer or equal to
0:09:29	one second
0:09:30	and so on
0:09:32	so seconds five second consequence
0:09:33	uh the most uh interesting i think uh
0:09:36	uh
0:09:37	this
0:09:38	this in this paper is
0:09:40	is the uh
0:09:41	the big difference
0:09:43	in these values
0:09:44	uh for this file
0:09:46	uh
0:09:49	it's uh
0:09:49	the original
0:09:51	uh diarization error rate
0:09:53	for this file was about twenty percent
0:09:57	after application uh
0:09:58	this modelling and this filtration of
0:10:01	uh segments shorter than one second
0:10:03	we improve the segmentation
0:10:05	uh about fifteen point five
0:10:08	point five
0:10:09	person
0:10:10	uh
0:10:13	well
0:10:13	it's interesting
0:10:15	and uh
0:10:16	we move
0:10:17	this segmentation
0:10:20	uh so much
0:10:21	we
0:10:22	we got from
0:10:23	twenty percent error rate to five percent error rate
0:10:26	uh
0:10:27	what about next
0:10:28	uh our segmentation step using ca
0:10:32	norm uh standard or a segmentation step
0:10:34	uh they but this is is that uh we can again
0:10:38	and other improvements
0:10:40	uh with viterbi and map adaptation
0:10:43	and
0:10:44	we can see here that
0:10:46	is it but this is calm
0:10:47	it's confirmed because from
0:10:49	uh from the well change
0:10:51	the segmentation
0:10:53	we improve it so but
0:10:55	by another one point four percent
0:10:58	but this is uh
0:11:00	this is important uh
0:11:02	and significant only for
0:11:04	for this file
0:11:06	uh
0:11:07	where the segmentation
0:11:08	changed a lot
0:11:14	oh
0:11:15	in general
0:11:16	the it's not significant
0:11:19	these changes
0:11:21	uh
0:11:22	and the signal segmentation uh
0:11:26	was uh just
0:11:27	about classical viterbi and
0:11:29	map adaptation
0:11:36	i would like to summarise
0:11:37	this work
0:11:39	uh i just it's a two strategies
0:11:43	the
0:11:44	interspeaker variability modelling and inter segment
0:11:48	but i but at the moment modelling
0:11:50	and
0:11:50	uh
0:11:51	only the second
0:11:53	has uh and improvements
0:11:56	uh of of the segmentation
0:11:58	but
0:11:59	very
0:12:00	or
0:12:02	uh it can be useful
0:12:05	to to feel
0:12:06	filters some
0:12:08	some short
0:12:09	uh
0:12:10	speech segment
0:12:11	in the
0:12:12	in the heart of estimation you moderate
0:12:15	and it's
0:12:17	also useful as you so uh another
0:12:20	presegmentation step
0:12:27	next work uh can be done with uh
0:12:30	more training data
0:12:32	uh and
0:12:35	uh
0:12:36	the large number of speakers when dealing with the
0:12:39	interspeaker variability
0:12:41	uh
0:12:43	regarding the inter segment viability
0:12:46	uh it can be interesting to to
0:12:49	ben dealing with the multiple distant microphones
0:12:53	uh and uh
0:12:55	also another
0:12:57	test
0:12:57	can be done uh
0:12:59	one uh
0:13:01	when the application factor analysis based uh speaker modelling in the first step
0:13:06	of the
0:13:07	the speaker diarization system
0:13:14	well thank you very much for attention
0:13:16	and
0:13:16	if you have any questions
0:13:26	question
0:13:32	only reported an improvement when actually you selected only the
0:13:36	speech segments longer than one second
0:13:39	right
0:13:40	it means that actually in your segmentation of most of most lots of research
0:13:44	and this is your variable files
0:13:46	so good that was how we were i was configure are there any
0:13:50	it limits for the minimum duration of a segment
0:13:53	uh sorry i cannot tell uh and i think about the vad because i just the work
0:13:58	uh with the diarization system as it was
0:14:01	uh maybe uh korean if uh not serious
0:14:22	uh but uh maybe uh i i didn't understand well uh this uh this uh filtration is made on the
0:14:28	development
0:14:29	so
0:14:37	uh_huh
0:14:51	yeah in fact that the united
0:14:53	yeah train on the
0:14:55	and development it so we have to wait for instance the development set
0:14:59	so we can choose
0:15:00	and the length of the segment
0:15:02	and you try to train
0:15:06	yeah but the united estimation yeah
0:15:11	yeah
0:15:11	oh i have a question
0:15:14	so
0:15:14	i see that is it to speaker variability in this segment
0:15:18	ability
0:15:19	and uh
0:15:21	do you
0:15:22	so
0:15:23	i guess each segment their ability uh reflects the changes
0:15:27	speaker
0:15:28	is it useful information for
0:15:30	or
0:15:31	detecting the speaker
0:15:32	change
0:15:35	so
0:15:36	and we expect
0:15:38	a two
0:15:39	speaker
0:15:39	i think they should okay
0:15:41	can you do some information but
0:15:43	we should keep
0:15:44	okay segment and applications compensated and
0:15:48	nation
0:15:49	well you can line
0:15:50	why not
0:15:51	in uh in the estimation of you metrics
0:15:54	the uh the vocal development set
0:15:56	uh we had the reference
0:15:58	and uh you matrix was estimated um
0:16:02	in this case
0:16:03	uh
0:16:06	for for each speaker
0:16:09	uh
0:16:09	between uh the segments
0:16:11	of
0:16:12	of one speaker
0:16:14	so it was it was not uh
0:16:16	in there a segment of arrival they
0:16:19	in the way of
0:16:20	for uh
0:16:21	intel
0:16:21	all segments right but the only
0:16:24	uh it was in their segment the viability of
0:16:26	of a certain speaker
0:16:31	all speakers soprano testing
0:16:36	and then you do the presegmentation
0:16:38	using a generative model
0:16:41	you can see you mentioned B B segmentation
0:16:44	i always
0:16:45	process so you have one
0:16:47	one night lately
0:16:50	and
0:16:52	how many rounds
0:16:53	right
0:16:55	uh how many how many or segmentation
0:16:58	uh
0:16:59	uh well
0:17:00	there is normally there is uh
0:17:02	one one uh
0:17:03	segmentation and then uh take place and the story segmentation
0:17:07	this case it was a resegmentation uses uh factor analysis
0:17:12	wondering
0:17:13	uh
0:17:16	and uh there is segmentation uh uh was it the right thing until
0:17:20	uh the number of
0:17:22	five
0:17:22	changes of
0:17:23	in the in the segmentation
0:17:26	uh was uh
0:17:28	less than a certain
0:17:30	well you
0:17:33	one one
0:17:38	one per segmentation process
0:17:40	with many iterations
0:17:41	right
0:17:42	which
0:17:46	oh
0:17:47	slide
0:17:49	a class
0:17:52	uh
0:17:54	i don't know which light you mean
0:17:56	in this uh
0:17:57	there are parts of the uh
0:18:01	right
0:18:01	segmentation
0:18:02	yes
0:18:03	uh yeah
0:18:04	this is the original baseline system
0:18:07	and there are two resegmentation uh steps and uh the factor analysis
0:18:12	took place after this
0:18:14	presegmentation step
0:18:16	as the last
0:18:17	part of the of the diarization system
0:18:20	okay
0:18:24	you can can anything
0:18:25	in fact that the number education is not speak
0:18:28	it depends that understands it changes
0:18:31	and giving them a sense
0:18:32	so when we an estimated ten
0:18:35	no more changes
0:18:37	it went a segmentation that a given state we stop
0:18:42	thank you
0:18:44	no
0:18:46	i actually
0:18:50	yes
0:18:56	uh but you tested so
0:18:58	you you only scored the sections of the meetings that did not have overlapping speakers correct
0:19:05	uh
0:19:06	we just it only the the evaluation set uh from the nist
0:19:10	so i but there were different ways to score that there was a parameter which determines how much overlapping speech
0:19:16	was included
0:19:18	uh
0:19:19	and and your your uh error rate
0:19:22	are quite low so i assume you
0:19:24	but did not score the overlap
0:19:26	speakers
0:19:27	but that's just an assumption i want to
0:19:29	from you
0:19:32	well there are rights uh
0:19:36	maybe maybe you don't know because you just drama
0:19:38	for example the yeah right
0:19:40	here are
0:19:41	are the global arrays that although the total
0:19:44	all rights including curve force are um with
0:19:46	speech and the speaker
0:19:48	you could change
0:19:50	okay i i don't know uh
0:19:52	if i
0:19:53	and just
0:19:54	oh okay
0:19:55	and then about this one meeting
0:19:57	where you
0:19:58	had a significant improvement
0:20:00	um
0:20:02	i i
0:20:03	i remember that on one of the nist meetings
0:20:06	there was a
0:20:07	much larger number of speakers
0:20:09	then
0:20:10	and the other meetings
0:20:12	and i wonder if that was the one meeting where you saw again
0:20:16	um so there were many more
0:20:19	speaker changes because the number of speakers were actually
0:20:22	that's like double the other meetings
0:20:25	uh so i wondered if you had actually looked at some statistics of your meetings
0:20:29	to see uh if
0:20:31	there are some variable like the number
0:20:33	speakers that
0:20:34	uh could predict when you're method works
0:20:37	uh well and when that might make a difference
0:20:40	oh well uh i i don't have
0:20:42	anyhow
0:20:44	oh
0:20:52	no information
0:20:55	and we we did not
0:20:57	it
0:20:57	and
0:20:58	and is about to the
0:21:00	this was it
0:21:01	we
0:21:02	we know that
0:21:03	you say that again
0:21:05	yeah sometimes
0:21:07	and is not
0:21:08	necessary you to to the fact and he's in good
0:21:11	if we change
0:21:13	and finally implies sense
0:21:14	and
0:21:16	insinuation of that and uh
0:21:18	and this is and we know that we can
0:21:20	and this improvement
0:21:22	an infected E es work
0:21:24	the good ones too
0:21:26	and
0:21:27	exp tool
0:21:28	and don't we all
0:21:29	applying thank john and easy C speaker deviation
0:21:33	on meetings we had these aladdin
0:21:35	speakers most because then
0:21:36	implementation
0:21:38	and a different connotation
0:21:41	it is
0:21:45	and that that's the overlap
0:21:47	and we didn't
0:21:49	scroll and we thought about that
0:21:51	and
0:21:52	because we we
0:21:54	do something
0:21:55	oh
0:21:56	and to delete
0:21:57	overlap
0:21:58	and the ones to law school

Factor analysis-based approaches applied to the speaker diarization task of meetings: a preliminary study

SESSION 6: Diarization

Added: 14. 7. 2010 11:08, Author: Pavel Tomasek, Corinne Fredouille, Driss Matrouf (University of Avignon, CERI/LIA, Avignon, France), Length: 0:22:18