Speech Transcript - The IBM 2016 Speaker Recognition System

0:00:15	a good morning everyone so i'm not sure if you notice what
0:00:19	this is the only speaker recognition and talk in this section
0:00:24	so which makes me feel somehow like
0:00:27	the distant relative that the family invites
0:00:31	but you know they don't want to like to
0:00:35	so today i'm gonna be a presenting the some of the recent advances that we've
0:00:40	had seen in our speaker recognition system and i will share some results
0:00:47	that we obtained with the system on the nist sre two thousand and ten
0:00:53	extended core task
0:00:54	tasks with an emphasis on the telephony condition which is condition five
0:01:01	this is joint work with actually ongoing about the who is now an assistant professor
0:01:07	at i see bangalore india and jason pelecanos
0:01:13	i will start mike like with a brief overview of some of the recent a
0:01:18	state-of-the-art works in speaker recognition then i will
0:01:24	share with you the objectives of my talk i will present a our speaker recognition
0:01:31	system and the key components that contributed the most towards the end results
0:01:37	i'll describe our experimental setup the data we used the d and then acoustic models
0:01:44	as well it's configurations as well as the speaker recognition system configuration and i'll share
0:01:50	with you as i said the results we obtained with the system on the nist
0:01:54	sre two thousand ten extended core task
0:01:57	mostly on condition five and then a comparable
0:02:02	so when we look at the recent a state-of-the-art work on a speaker recognition forcible
0:02:09	most of most of the state-of-the-art systems are i-vector based
0:02:13	and they somehow use a universal background model to generate statistics to compute the i-vectors
0:02:22	now when we look at this through the time so we started by traditional unsupervised
0:02:26	a gaussian mixture models to represent the ubms
0:02:31	and then more recently we used a
0:02:36	of phone any event based you be ubms which are derived from an asr systems
0:02:43	so
0:02:44	i would like to emphasise here that even though this work
0:02:50	don't at i b and does not get much created but this is the first
0:02:53	work that in fact used seen owns to compute a
0:02:57	the hyper parameters of the ubm for speaker recognition in fact it was it achieves
0:03:03	a state-of-the-art results as a single system on the nist sre two thousand
0:03:07	and then a this work game the work from best buy which basically used the
0:03:13	nn based a scene on posteriors to compute the ubm parameters
0:03:20	more recently there was the work from a johns hopkins university the used a t
0:03:25	v and then based you
0:03:27	a posteriors to compute the ubm parameters any fact they found that
0:03:32	contrary to read to what sri found
0:03:36	with the
0:03:37	a with the diagonal covariance make a so with a with the ubm that uses
0:03:43	diagonal covariance matrices you can i in fact you
0:03:46	estimate the ubm parameters we
0:03:48	with a full covariance matrices and reduce a lot of computations then you need to
0:03:53	necessarily
0:03:56	go through the hassle of
0:03:58	of it for the nn based system so you can use directly if a supervised
0:04:03	ubm to compute the statistics and then from their compute the i-vectors and they had
0:04:08	nice gains as well
0:04:10	we also some of the state-of-the-art systems also use they don't use the nn posteriors
0:04:15	to compute the ubm hyper parameters a use the n and bottleneck features and then
0:04:21	the
0:04:21	the rest of the pipeline mean in i-vector based speaker recognition system remains thing
0:04:27	so i mentioned some of the word here i would like to give some pretty
0:04:30	to their heck a fixed work in ninety eight
0:04:34	that was the first to explore a bottleneck based features for a speaker recognition
0:04:41	so the objectives of mike talk today
0:04:45	i will be sharing our a state-of-the-art results on the nist sre two thousand and
0:04:50	ten extended core tasks again our emphasis is on telephony condition which is condition five
0:04:56	i will be presenting the key system components that contributed the most towards achieving these
0:05:02	results
0:05:05	namely i will talk about the fmllr based features that we used
0:05:10	and compared them in fact compare compared them with the a more traditional raw acoustic
0:05:16	features such as mfccs
0:05:20	we also used the n and based acoustic models in place of a gmm
0:05:25	unsupervised gmm acoustic model for i think ubm
0:05:29	the dev so this is basically
0:05:32	technically not novel id and then based i-vector a i-vectors a they've been around for
0:05:37	awhile now
0:05:38	what we did here we nearly double
0:05:41	the size of the scene onset and we wanted to see how that impacts the
0:05:45	speaker recognition performance
0:05:47	and then finally we explore a nearest neighbor this given analysis to achieve inter session
0:05:54	variability compensation in the i-vector space we compared the performance with
0:05:59	the more commonly used lda
0:06:02	we also quantify the contribution all
0:06:06	these three system components
0:06:09	to work towards the performance in fact we also
0:06:12	we'll see how varying for example the signal that the size of the scene on
0:06:17	set
0:06:18	will impact the perform
0:06:22	now let's take a look at
0:06:23	our speaker recognition system so you see the flowchart of all that speaker recognition system
0:06:29	here this is assuming that all the all the model parameters already train a so
0:06:35	we have at the in an acoustic model training i-vector extractors for that the lda
0:06:40	models
0:06:40	and
0:06:42	so the three components i just mentioned let me repeat this
0:06:46	we have
0:06:48	a similar based features
0:06:51	can be used to train and evaluate the d and then
0:06:56	as well as
0:06:58	to compute the sufficient statistics for i-vector extraction so with a from a large you
0:07:02	can achieve well speaker and channel and normalization
0:07:07	we have the in an acoustic model instead of a an unsupervised on a gmm
0:07:12	acoustic model to compute
0:07:14	the i-vectors again we nearly compared to the previous work double the size of the
0:07:20	signal set
0:07:20	and then we replace the traditional of the more commonly used lda with
0:07:25	and you for intersession variability compensation and used you i'm sure you're familiar with the
0:07:31	sparkling
0:07:34	so if we look at the previous work
0:07:38	with the non with the nn based signal i-vectors we what we of there is
0:07:44	that many systems the used to different set of features
0:07:48	to
0:07:49	compute the posteriors
0:07:52	and to compute a sufficient statistics so typically asr features are different from a
0:07:59	speaker-recognition features which makes sense so in this work we wanted to see
0:08:03	what happens if we can you many five
0:08:05	or use the same set of features to both trained and evaluated the nn and
0:08:09	to compute a sufficient
0:08:10	statistics for i-vector extraction so for the two words that we
0:08:14	are considered to use o a feature space maximum likelihood linear regression transforms to which
0:08:21	are actually use and based features which are used as features for our d n
0:08:27	and system
0:08:30	so adamantly transform is a linear transform like this which is basically would which can
0:08:35	be decomposed into a linear manner and it right and a translation and these parameters
0:08:40	can be obtain a
0:08:42	using the alignments that we obtain
0:08:46	from a from the first pass through a gmm hmm system
0:08:51	and then a maximum likelihood basically
0:08:57	estimate gives us a fight and better probably not be repeated parameters here
0:09:04	and the product of this transform on a raw acoustic features such as mfccs or
0:09:09	even transform like lda transform features
0:09:12	our speaker and channel more normalized features
0:09:15	i mean this may sound contradictory by the way so because after the larger used
0:09:20	to reduce speaker variability but as we know
0:09:23	there are two types of variability is their variability other speakers
0:09:27	speaker variability can be with teen or across speakers right so here we believe that
0:09:33	the
0:09:35	normalisation that if from lower provides within speakers dominates the a between speaker normalisation also
0:09:43	we get the benefit of channel normalization so if we have different setup and stands
0:09:47	for example we think that f and lark and take care that as well
0:09:53	now i as i mentioned the in ansi known i-vectors they've been around for awhile
0:09:59	so nothing technique the new in this
0:10:01	here in this slide the only difference is that you we really nearly a double
0:10:06	the size of our semen set compared to the previous work to compute the posteriors
0:10:11	and the from their computer sufficient
0:10:14	sufficient i will sufficient statistics
0:10:19	so i'm not gonna spend much time on this
0:10:21	now what
0:10:23	but can i conduct a present that
0:10:28	now we know basically how to rapidly compute i-vectors using even ten k c known
0:10:34	so just you connect this work to the
0:10:38	one of the presentations yesterday a set of money talked about
0:10:42	a how i-vector distributions are not necessarily gaussian and actually he showed us some distributions
0:10:48	and that was even on clean data not of in noisy data okay so
0:10:54	and lda basically it's formulated based on gaussian distribution assumptions for a class for different
0:11:00	for individual classes
0:11:02	or even if they are not gaussian they need to be at least uni modal
0:11:08	at a soul
0:11:09	therefore lda cannot effectively handle multimodal data
0:11:13	which is typical in the nist sre types of a scenarios because data come from
0:11:19	various sources we have switchboard sources of data we have mixer sources of data and
0:11:24	that causes a multimodal the in the i-vectors
0:11:28	and also for applications such as language recognition we because we only have a few
0:11:34	classes the lda transform
0:11:38	i can be rank deficient
0:11:40	so we might get a hit from that as well
0:11:44	so instead of trying to transform the i-vector space so that is more gaussian like
0:11:49	what center are presented yesterday
0:11:53	we here we tried to use the transform that is the does not assumed gaussianity
0:11:57	or does not use the class
0:12:01	the ball or a structure of the classes to compute a the between class scatter
0:12:06	matrices so
0:12:07	when you look at the lda uses the class centroids
0:12:11	the differences between class controls will error rate here
0:12:16	arrow here we see to compute the between-class gotta meet now in the n b
0:12:21	a what we do with that we don't assume any a global extract structure for
0:12:27	classes for individual classes rather we assume that classes are only locally structure
0:12:31	so we use the local
0:12:34	means that are computed based on character a nearest neighbours for each individual sample
0:12:39	and then used to differences to compute the between class scatter matrices
0:12:46	another thing is that we introduce this weighting function here which is basically to emphasise
0:12:50	the sample these samples near the classification boundary which are more important for discrimination between
0:12:58	different classes rather than the sample here
0:13:01	which should get a really small way because it doesn't contribute towards to discrimination
0:13:07	the class discrimination
0:13:09	and then another thing is that on like lda and be a
0:13:12	even that we have enough a number of examples of for can for different classes
0:13:17	can always before right
0:13:20	so therefore is very useful for applications such as language id which we don't you
0:13:27	publish to work in i guess a twenty fifteen and we actually obtain some gains
0:13:32	over the
0:13:34	so our experimental setup a for training data we extracted english telephony and microphone data
0:13:40	from a
0:13:41	this two thousand and four through two thousand and eight sre data we also used
0:13:48	switchboard data both cellular and land line data
0:13:52	this are basically resulted in a total of sixty k recordings
0:13:58	to train our system hyper parameters for evaluations we considered the nist a twenty ten
0:14:05	sre and that a evaluation set there is that we considered a nist sre two
0:14:09	thousand ten compared to twenty twelve is because
0:14:13	we had some anchors to compare our performance of our system where it
0:14:17	with other with other no sites
0:14:23	so the conditions we consider where c one condition want to see five
0:14:28	you can see the details here but i wanted to emphasise that again our emphasis
0:14:31	is on condition five which is
0:14:34	a left levantine and there is a mismatch between enrollment and test
0:14:39	so the type of
0:14:42	phones used in
0:14:45	enrollment s are not necessarily the same
0:14:48	our didn't system
0:14:50	our d an acoustic model had a seven hidden layers a six to six of
0:14:55	wins
0:14:55	ha twenty forty eight a hidden units and then the bottleneck layer which
0:15:01	five hundred and twelve units we use fisher data to train it
0:15:04	in addition to the think eight original think a signals we also consider two point
0:15:10	four k
0:15:12	posteriors to see basically how of varying the size of how a varying the granular
0:15:17	larry d
0:15:19	the in the output layer
0:15:21	well in fact that speaker recognition performance
0:15:24	a typical setup for speaker our speaker recognition system
0:15:29	we use five hundred dimensional total variability subspace
0:15:33	which was reduced to two fifty using an l d or lda simply lda which
0:15:38	was trained on the entire a training set and we report equal error rate or
0:15:43	and mindcf away and ten
0:15:47	i think that we also consider a to give working in thinking signals from i-vector
0:15:51	extraction
0:15:54	in terms of results list can compare held every and da so this is a
0:15:58	these results are obtained with mfcc twenty forty component gaussian mixture model thinking in and
0:16:03	results are reported condition five as we see a no matter what type of acoustic
0:16:09	model we use lda all always provided a nice benefit over
0:16:13	and the across the three metrics
0:16:18	and the reason is because in as i mentioned in v a can handle non
0:16:23	gaussian and more to model
0:16:26	more effectively then lda for comparison of
0:16:30	mfccs version of a large
0:16:34	again condition five thinking in we can see that
0:16:40	first with lda and the it doesn't matter we always have improvement with a from
0:16:45	a large aware mfccs and the reason is because you
0:16:49	m f and large provide a speaker and channel normalization
0:16:53	also note that we unified a speaker recognition and speech recognition features this way okay
0:16:58	so the system is even it's simpler but we should also take into account the
0:17:05	fact that for every two in order to compute fmllr transforms we need a two-pass
0:17:08	system as opposed to
0:17:12	to look up to measure the impact of signals that
0:17:16	size we consider to pay for k ever think a posteriors accuracy as an increase
0:17:21	the signals that side results improve
0:17:24	we also considered thirty two j signals
0:17:27	okay so choose just to just to see how it how it impacts performance
0:17:33	we did not see much gains with thirty two k c that experiment to the
0:17:39	wind to finish
0:17:42	i just one to emphasise here that in contrast to what we see with the
0:17:47	d n and
0:17:48	if you increase the size of the components the number of components in a gmm
0:17:53	athletes with a diagonal covariance matrices you don't c d's gain if you increase the
0:17:57	size
0:17:58	of the gmm components from two k to forty two six k to make a
0:18:03	you don't see
0:18:04	probably gave it if not degradations
0:18:08	and now they say i picture is worth in other words you're this that bloody
0:18:13	at this work to table
0:18:16	so for a week what we what you can you can see how lda compared
0:18:20	to lda with both a gmm and the in and based systems the performance larger
0:18:26	menus gmm gmms to compute the i-vector a posterior to compute i-vectors
0:18:33	and with the nn as we increase the size of the signals that the this
0:18:37	gap in performance
0:18:38	a narrow and then secondly we can compare two k versus
0:18:43	ten k the nn seen on a
0:18:48	performance
0:18:50	a progression of our system over time without and with the very basic system gmms
0:18:54	and mfcc then lda we replace the and the got it you came and we
0:19:00	are replaced you know you know got of used and the timit and from lars
0:19:05	got further boost in performance so we at the best published a performance on the
0:19:12	nist sre two thousand and ten condition five at least
0:19:15	for other conditions we believe those are also the best college performances these we you
0:19:20	know refer to the paper for more detail
0:19:23	because we're really on previous best results i wanted to mention you created to a
0:19:28	d choose more that it's you one point o nine equal error rate
0:19:33	they had a gender dependent system our system is gender independent
0:19:36	and but this work
0:19:39	which also the also use the gender dependent systems and what the only reported results
0:19:44	on female trials while i'm not sure how we can
0:19:48	compare it is numbers but
0:19:53	so in conclusion i presented hours a speaker recognition system the components that it has
0:19:58	i shared with you our results and quantify the contribution of different components
0:20:05	we have if you're interested for further progress on our system
0:20:09	please come visit us are i in your speech or you know if you buy
0:20:13	me a cookie after this i might be i able to share more details when
0:20:18	you thank you
0:20:27	for some questions
0:20:37	for your presentation my question is about the weights in the in computation it is
0:20:44	those weights in the
0:20:46	original in da that you mentioned
0:20:50	in computation
0:20:54	those weights alright originally in the in da for us they are they and you
0:20:59	say that at all the data that are close the boundaries are have to listen
0:21:04	to look at this let's take a look at how things are computed so that
0:21:09	is that it says minimum of the distance between
0:21:13	each sound and its k nearest neighbor within that class
0:21:18	e seven point it's k nearest neighbours from the j a two class right and
0:21:24	then that is divided by a so of course if the sample is not close
0:21:28	to the boundary it's gonna be closer to use case okay a k nearest neighbours
0:21:33	from kids the same class
0:21:35	right
0:21:36	so this is gonna be smallest number is gonna be small compared to the denominator
0:21:40	so you're gonna get close to zero number versus if the if the south are
0:21:44	close to the boundary this number
0:21:48	is gonna get
0:21:49	close to this number
0:21:51	so sorry this number is gonna basically at the this is gonna come out of
0:21:57	the mean this so it's the sample from class i compared to its case i
0:22:04	k nearest mean of the k here and here's neighbours from plastic is gonna come
0:22:09	out so this is divided by
0:22:12	this term is gonna be point five so for samples near and dear a classification
0:22:16	boundary you get point five four samples that are far from that is a battery
0:22:20	where you're gonna get some a zero norm
0:22:24	can you tell conceptually what does it mean i mean
0:22:27	what "'cause" i mean why some those doubles the boundaries more weights done because the
0:22:34	sound of which are far from either far anyway so what are but directly what
0:22:38	we what they contribute or you know a station the between class scatter because sometimes
0:22:44	that are if it assumes that they are gaussians
0:22:48	samples that are near yes
0:22:52	first well even if it's gaussian
0:22:56	so all those data is that are a away from the mean so that are
0:23:00	like l layers right
0:23:03	the because you can distinguish them
0:23:07	if there is that she that can conclude that can do not directly last but
0:23:13	again we extended data more than those that are or more representative of the training
0:23:18	set is the training set what we mean by shift is the training set already
0:23:22	have the labels we know that those samples or
0:23:26	are in that class and they are far from the classification boundary
0:23:33	okay
0:23:37	work
0:23:45	thank you thank you for your the actually of a question regarding the implementation of
0:23:49	the indy so a new papers i've seen several things like they're within class covariance
0:23:57	use the classical one are used for this work on this work we use the
0:24:01	classical one
0:24:02	okay be clear lately we limit limited with something for this work we use the
0:24:07	exactly the same we compute the within class scatter matrix exactly the same way you
0:24:12	computed for lda
0:24:14	for at k nearest neighbours we use one versus press
0:24:18	that means that work each class
0:24:21	you consider that fast versus all the other classes and you compute the and you
0:24:25	compute the nearest exit was the development question so if there is a except for
0:24:29	the computational time because it was gonna be other was gonna be various i know
0:24:35	but intimate results does it change anything
0:24:38	if we i this is never be was so slow that i've never explore
0:24:44	thank you
0:24:46	english degree

The IBM 2016 Speaker Recognition System

Speaker & Language Recognition Systems

Seyed Omid Sadjadi, Sriram Ganapathy, Jason Pelecanos