Speech Transcript - Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

0:00:15	okay
0:00:15	thank you my name is a problem
0:00:18	i'll present the work we have carried out by extracting i-vectors from
0:00:23	short and long time speech features for speaker clustering
0:00:26	and
0:00:27	this is also the what of don't really k and have it at london
0:00:32	so the outline of
0:00:34	a representation that's is as follows so we would describe
0:00:38	so objectives of our research
0:00:40	we would also describe the main
0:00:44	long-term features that are used in our experiments we would also mention the
0:00:49	baseline and the proposed speaker the standard vector
0:00:53	and then we will
0:00:55	describe the fusion techniques that are carried out in the speaker segmentation and speaker clustering
0:01:00	and finally the experimental setups conclusions would be presented
0:01:06	so far the on all
0:01:08	speaker diarization consists of two men tasks and these are
0:01:12	speaker segmentation and speaker clustering
0:01:14	and in a speaker segmentation
0:01:16	a given audio speech is
0:01:19	split it into homogeneous boxes and in speaker clustering
0:01:23	speech clusters that belong to a given speaker are grouped together
0:01:28	so the main motivation for this dataset used in our previous
0:01:32	work
0:01:32	we have shown that the use of jitter and shimmer and
0:01:36	prosodic features have improved
0:01:39	the performance of
0:01:41	gmm based speaker detection systems so based on these
0:01:45	we have proposed to the extraction of i-vectors from these
0:01:49	detection or prosodic features
0:01:51	and then want to fuse their cosine distance courses with the
0:01:56	mfcc for speaker clustering task
0:02:00	so here in the feature selection
0:02:02	we select different set of features from the voice quality and from the prosodic
0:02:08	from the voice quality way extracts
0:02:10	features called absolute jitter absolute stream attention meticulously and from the prosodic once we extract
0:02:16	the speech
0:02:18	intensity and the first four formant frequencies
0:02:21	once these features are extracted the abstract the same feature vectors
0:02:27	then we extract two different set of i-vectors the first i-vector is from
0:02:32	the mfcc
0:02:33	and the second i-vector is from the long-term features
0:02:37	then the cosine similarity of these two
0:02:41	i-vectors is used for speaker clustering task
0:02:46	so these are the main speech features that are used in our experiments without mfcc
0:02:52	voice quality that the jitter and shimmer and we have also used the prosodic ones
0:02:59	so from the voice qualities we have selected three different measurement is based on previous
0:03:04	studies these are the absolute jitter which major the variation between
0:03:09	two consecutive periods
0:03:11	and we have also very absolute stream
0:03:15	it may just evaluation of the amplitude between consecutive periods and also
0:03:20	they should medically two c d's
0:03:22	similar to should matter out of instrument that's
0:03:26	it takes into consideration three consecutive periods
0:03:31	so from prosody we have extracted speech
0:03:34	in basically and formant frequencies
0:03:38	so when it comes to the speaker diarization architecture first i'll try to describe the
0:03:43	baseline system
0:03:45	so given speech signal
0:03:48	so we
0:03:49	further steak the speech different mappings or i thought
0:03:53	the main reason wearers using the oracle studies
0:03:56	where i'm really interested on the speaker errors
0:03:59	where the restaurant the speaker segmentation errors
0:04:03	then we extract the mfcc the jitter and shimmer and the prosodic once only for
0:04:08	the speech frames
0:04:10	then the jitter and shimmer and that was only once output in the same feature
0:04:14	vectors
0:04:17	so based on the side of that inside the
0:04:19	new channel number of clusters is initialized if we have
0:04:23	more number of if there
0:04:25	size of the data is or if the
0:04:29	sure is longer to have more number of clusters if it is shown to do
0:04:33	have less number of clusters all the initial number of clusters
0:04:38	depend just on the duration of the audio signal
0:04:42	then we assign genments complex tenish ali for this neutralised clusters
0:04:48	then we perform the hmm decoding and
0:04:51	training process and then we'll get to different to log-likelihood scores the first one is
0:04:56	for the
0:04:57	short time spectral features
0:04:58	and then we also get another score role
0:05:01	you don't of features
0:05:02	then these two scores are used you nearly in the speaker segmentation and
0:05:07	we get the speaker segmentation still in gives us
0:05:10	a set of clusters
0:05:12	so we use a classical bic
0:05:15	computation technique and computes
0:05:17	pairwise similarity between
0:05:19	all set of clusters and i each iteration the two clusters that that's the not
0:05:25	the highest
0:05:28	bic score
0:05:29	will be but you'd and this process
0:05:33	i to its until the highest peak value among the clusters is less than the
0:05:38	specified threshold value
0:05:40	so this is a classical be computation so you know work
0:05:44	the initialization and the speaker segmentation are the same
0:05:47	the may conclusion we should it in the speaker clustering one the speech the decomposition
0:05:52	of the gmm be competition is replaced by the i-vector clustering one
0:06:00	so this is our proposed architecture so given a set of clusters
0:06:05	that are
0:06:06	the output of the viterbi segmentation we extract
0:06:09	two different set of i-vectors
0:06:11	if a test i-vector is from the mfcc
0:06:14	and the second one is from the detection and the problems once
0:06:18	and we used to difference
0:06:20	universal background models the first one is forty
0:06:25	short-term spectral features
0:06:26	and the second one is four
0:06:29	the
0:06:31	long-term features
0:06:33	so the ubm and the t matrix is are trained using the same source from
0:06:38	a and have selected one hundreds unusual side of a duration of forty hours to
0:06:43	train
0:06:44	the ubm
0:06:46	and the i-vectors are extracted using an is a toolkit
0:06:49	so that the less than the g
0:06:52	is normally based on
0:06:54	specified threshold value so if the threshold value is based on
0:06:59	specified one the
0:07:01	system stops margin
0:07:04	so
0:07:05	to find the optimum
0:07:07	threshold value we have used a semi-automatic way of
0:07:11	finding
0:07:12	the number of triphone value
0:07:14	for example in this figure
0:07:17	we have displayed how we have selected
0:07:20	the lamp that value and the stopping criterion for five shows from the development set
0:07:25	so these once the or it once show the highest
0:07:29	cosine distance scores per each iteration
0:07:32	and
0:07:34	these block once they are the diarization error rates but each iteration so there
0:07:40	horizontal dashed line is the lamb data selected
0:07:44	as a threshold to stop the process for example
0:07:48	if we talk about the rest
0:07:49	showed
0:07:52	there's system
0:07:53	stops at the for citation because in the fall detection
0:07:56	the
0:07:57	maximum
0:07:58	cosine distance score value is made than this threshold value so we have applied this
0:08:04	techniques on
0:08:06	the whole development shows and this number about it is applied directly on the test
0:08:12	sites
0:08:15	so we have used to different fusion techniques ones on the speaker segmentation and the
0:08:21	other on speaker clustering
0:08:23	so in the
0:08:25	segmentation
0:08:26	the figure technique is based on log likelihood scores so we get
0:08:31	two different scores for a given we can see that the axes
0:08:35	the short-term spectral features and the what is the long-term features all
0:08:40	we gates
0:08:41	and more than
0:08:43	for
0:08:44	the short-term spectral features so we get the log-likelihood score so this is multiplied by
0:08:48	are five and again similarly for the
0:08:52	long-term features we ate
0:08:54	extract
0:08:55	the log-likelihood score and this is multiplied by a the file and the and fast
0:09:00	how to be against you on the development data sites
0:09:06	so that putting technique in the speaker clustering is carried out possible so we have
0:09:11	three different set of features very mfcc the voice quality and the prosodic once
0:09:17	so the long-term features are stacked a basic
0:09:21	then we extract two different sets of i-vectors from the mfcc and from the long
0:09:26	term one
0:09:27	then the cosine similarity between
0:09:30	these two sets of i-vectors
0:09:33	is fused divide
0:09:34	a linear weighting function
0:09:36	so that fused score that is a multi it "'cause" i similarity is multiplied by
0:09:46	weight functions
0:09:48	but also the beta in this one is
0:09:51	the weights
0:09:52	but applied for their cosine task force extracted from
0:09:57	the spectral features and one minus data is
0:10:00	the way to signs
0:10:02	for the cosine distance scores
0:10:04	extracted from the long-term features
0:10:10	so when we come to the experimental setup
0:10:13	we have
0:10:15	developed and tested that experiment on ami corpus which is
0:10:19	and multi-party and spontaneous that of meeting recordings
0:10:24	so normally in the i shows the number of speakers is
0:10:27	let me just two
0:10:29	so you to five that's mostly
0:10:31	the number of speakers these for and
0:10:34	it is and meeting records and it is a model channel with the fight of
0:10:37	each condition
0:10:38	so we have selected potentials as a development set to tune the different parameter studies
0:10:43	the weight values
0:10:45	and that threshold values
0:10:48	then we have defined
0:10:50	two experimental setups the first one is a single sides so potentials
0:10:54	how to be selected from idea
0:10:58	and the other one is a multiple sites
0:11:00	we have selected ten calls from idea
0:11:03	adam back end to end all sides so
0:11:06	the
0:11:07	optimum parameters that are obtained from the development sites are directly used on these
0:11:14	a single and the multiple sites roles so we have used to difference
0:11:18	as of i-vectors
0:11:20	for the short and long term features and is also
0:11:24	do you want on the development set and we have
0:11:27	use the artists at all the speech differences
0:11:30	at the speech activity detection so very but is that the city portage in this
0:11:35	work
0:11:36	corresponds mainly to the speaker errors missus speech and the form out on this have
0:11:40	a zero value
0:11:44	so he if we see
0:11:45	the results the baseline system that is based on mfcc and gmm big
0:11:51	clustering p does
0:11:52	is a model of the art
0:11:54	but when we are using jitter and shimmer and prosody both in the gmm and
0:12:00	i-vector
0:12:02	clustering technique it improves
0:12:05	a lot compared to
0:12:07	the baseline
0:12:08	and if we compare these to the i-vector
0:12:12	clustering techniques with the gmm ones
0:12:16	but i with a clustering techniques
0:12:19	again provide better result is on
0:12:22	they gmm clustering technique
0:12:24	and we can also conclude that
0:12:26	if we compare the same clustering techniques the i-vector clustering techniques that this one based
0:12:31	on only short-term spectral feature and this one
0:12:34	using two different set of features it's
0:12:37	i provide us better results on
0:12:40	using one i-vectors from the
0:12:43	sure that features
0:12:48	so we have
0:12:49	also then
0:12:50	some posts paper processing work
0:12:53	after the sensational stories to better
0:12:55	so we have
0:12:57	also pasted
0:12:59	that the lda scoring
0:13:01	in the clustering stage
0:13:02	and the p l a clustering as it is shown in the table
0:13:06	with that it uses only one set of i-vector or
0:13:09	two sets of i-vectors
0:13:11	it provides a better diarization of results on what the gmm and cosine scoring techniques
0:13:19	so one of the issues in a speaker adaptation is the diarization error rates among
0:13:24	the different roles is
0:13:28	a relatively
0:13:31	it follows from one to one show for example is a wonderful may give us
0:13:34	a small d are like five percent and another show to make debusk idea of
0:13:39	like a fifty percent
0:13:41	so for example this box plot shows the
0:13:45	d r evaluation all the multi pole and a single side so
0:13:50	this one is a d r evaluation for the single five and the grey one
0:13:54	is
0:13:55	the idea validation for the multiple site
0:13:57	so this easy high d r and d c the lowest eer
0:14:01	so we can see that there is
0:14:02	a huge evaluation
0:14:04	between
0:14:05	the maximum and the minimum
0:14:09	so if we see
0:14:12	here the use of long-term features
0:14:15	both in the gmm and i-vector clustering technique
0:14:18	help us to reduce the
0:14:21	the other what if you normal the different roles
0:14:24	and the other thing we can see both
0:14:27	i-vector clustering techniques that are based on
0:14:30	short-term and shorter class long-term features
0:14:33	they give us
0:14:35	a bit errors
0:14:37	at least we can say it reduces again
0:14:39	the idea variations among
0:14:42	the different roles
0:14:43	and finally this one that is the i-vector clustering technique based on
0:14:48	short-term and long-term features used as
0:14:51	the lost
0:14:52	variations among
0:14:53	the different roles
0:14:58	so in conclusion
0:15:00	we have proposed the extraction of i-vectors from
0:15:04	short and long term c feature for
0:15:06	speaker clustering task
0:15:09	and in the experiments are designed to sit strum that's the
0:15:12	i-vector clustering techniques provide
0:15:15	bitter diarization error is that is and the clustering the general clustering once
0:15:20	and also the extraction of i-vectors from the
0:15:24	long-term features
0:15:25	in addition to the
0:15:27	a short time once
0:15:29	help us to reduce the d r
0:15:32	so in conclusion we can phase that's the extraction of i-vectors
0:15:37	and the use of
0:15:39	i-vector clustering techniques are helpful for speaker diarization system
0:15:43	and thank you
0:15:52	then it's time for questions
0:16:12	so i have
0:16:19	but i was one thing to do explain the process you using for calculating the
0:16:26	jitter and shimmer in did you find it to be a robust process across the
0:16:32	tv shows
0:16:37	normally are
0:16:40	shows a meeting domains
0:16:42	but
0:16:44	it is
0:16:45	it is
0:16:46	and meeting domain it's not a t v show
0:16:49	but when we extract different remote
0:16:53	we the problem of bases if
0:16:56	the speech is almost
0:16:59	we give zero buttons
0:17:01	so we compensate them by averaging over five hundred milisecond duration
0:17:06	that extract the fattest all certainly second duration
0:17:10	sort compensates a zero values for the unvoiced frames we averaging over five hundred milisecond
0:17:16	duration
0:17:27	you have also in one of your a slight and you said that the training
0:17:31	from the development set
0:17:33	how did you you'll find it or train it how did you find that threshold
0:17:38	and did you experiment with changing the threshold value
0:17:42	you mean that the segmentation i think this one
0:17:47	no in the formula you
0:17:51	we present the segmentation
0:17:57	this one
0:17:58	or you hear
0:18:01	so you mean that i four buttons they have been
0:18:05	modeling be you on the development sites
0:18:08	we taste different weights while the weight
0:18:11	bottles
0:18:12	for the two features
0:18:14	and
0:18:15	these files are directly applied on the test sites
0:18:22	okay so they are fixed your exists in the test experiments affix it's
0:18:41	of thank you very clear presentation i just wanted to understand the little bit about
0:18:48	the physical what we should you have an explanation why he went so did to
0:18:54	shiver and prosody
0:18:56	so for example in explains that we do we for pitch to be quite well
0:19:00	quite to important how did you sort of converge of these two did you go
0:19:06	through a selection opposed to get to the mean do you have any intuition or
0:19:10	expression for the
0:19:11	so you're saying why we are interested in the extraction of the detection but and
0:19:15	prosodic how did you zero it on the balloon is what's your sort of physical
0:19:19	intuition for what using that as opposed to of the long-term features
0:19:25	because they are voice quality measurements
0:19:27	no special potentially much
0:19:29	so they can be used to discriminate
0:19:33	where the speech of one percent from another one so you'll hypothesis is that they
0:19:38	would that would be significant difference between us we have seen it is and the
0:19:41	this will be robust to whatever channel that is going through
0:19:46	but we didn't similar extremely delicate if you will so
0:19:53	if you had extend this outside this dataset for example of real life recording
0:19:59	we're going to worry about the sensitivity of these features that you looking at
0:20:04	okay for example jitter and shimmer they have also been used in a speaker
0:20:10	verification and recognition on these database
0:20:14	so we have normally
0:20:16	that is it will not is the reason why we applied on speaker diarization
0:20:20	and we have checked the jitter and shimmer on ami corpus
0:20:25	here's what i'm presenting we have also attracted on how about campus it is a
0:20:30	cut on projects t v show
0:20:32	is that also we got some improvements
0:20:36	so you would like companies it's helps and
0:20:40	would that be any other as you think
0:20:42	i don't it would a but others you think that you could out to the
0:20:46	two
0:20:47	note that different types of region we have about ten or eleven types of jitter
0:20:51	and shimmer measurements
0:20:53	beds we have selected this c d based on previous studies for speaker recognition and
0:20:59	of maybe you can check with the others also
0:21:08	and you in a question
0:21:14	and i don't have we question so it's about the stopping criterion so you are
0:21:22	not assuming and that you know the number of speakers beforehand
0:21:26	that's right now we know the number of speakers you know you larson and you
0:21:29	know it is okay conditions
0:21:37	so any other questions
0:21:42	there are no more questions to estimate speaker again

Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System

Speaker Clustering and Diarization

Abraham Woubie Zewoudie, Jordi Luque, Javier Hernando