Speech Transcript - MagNetO: X-vector Magnitude Estimation Network plus Offset for Improved Speaker Recognition

0:00:15	hello my name is then of course you have no i will be presenting joint
0:00:19	work with the michael it's
0:00:21	the excel and unlikely
0:00:23	from the human language technology center of excellence
0:00:26	i johns hopkins university
0:00:28	that i don't know or work is might need to expect a marketing estimation network
0:00:34	plus also
0:00:35	for improving speaker recognition
0:00:41	current state-of-the-art in text-independent speaker recognition is based on the in and variance training with
0:00:47	our classification loss
0:00:48	for example a multiclass cross entropy
0:00:52	if there is no severe mismatch between the nn training data and the deployment environment
0:00:58	this in that cosine similarity between them but is from a system trained with the
0:01:03	angular marking softmax
0:01:05	or vice versa
0:01:06	speaker discrimination
0:01:09	for example in the most recent nist sre evaluation
0:01:13	which is audio from
0:01:15	video particular didn't of videos
0:01:17	the top performing
0:01:19	single system
0:01:21	on the audio track was based on this part of that
0:01:26	unfortunately
0:01:27	even though cosine similarity provides
0:01:30	good speaker discrimination
0:01:32	directly using those scores
0:01:35	does not allow us to make use of the automatically a stressful
0:01:41	because this discourse are not calibrated
0:01:45	typical way to address this problem is
0:01:48	use an affine mapping to transform the scores into look like a result alright calibrated
0:01:55	this is
0:01:56	typically done using on logistic regression
0:01:59	and we learn two numbers i scale
0:02:02	and also
0:02:04	so looking at the top equation
0:02:06	the raw scores are denoted by s i e a
0:02:09	which is the cosine similarity between
0:02:12	two and variance
0:02:15	and this can be
0:02:16	basically as precise that unit for a well unit length and then in x i
0:02:21	till the
0:02:22	transpose
0:02:23	x till the j
0:02:25	so it is nothing more than this they number of unit length and between
0:02:30	you will learn a calibration mapping
0:02:32	with the parameters a and b we can transform this score in to log-likelihood ratios
0:02:37	and then we can make use of the bayes threshold to make optimal positions
0:02:45	in this work we
0:02:47	proposed agenda cezanne
0:02:49	and i went to look at it is to think that the actual
0:02:53	scale at
0:02:55	can be thought of us
0:02:56	simply a sign
0:02:58	constant might be due to the
0:03:00	unit length and variance
0:03:02	so it's inventing
0:03:04	get the same active
0:03:06	instead
0:03:07	we suggest that it's probably better that it somebody has its on my data
0:03:12	and we want to use a neural network
0:03:15	to estimate
0:03:16	the optimal value of those magnitudes
0:03:19	we also used a global offset to provide
0:03:23	the mapping
0:03:24	to like to raise
0:03:26	no the this new approach
0:03:29	may result in a non monotonic mapping
0:03:32	which means that you has the potential to not only produce calibrated scores but it
0:03:37	also can improve discrimination
0:03:39	by increasing the classical range
0:03:43	to train this mike to network
0:03:45	we
0:03:46	want to use a binary classification task
0:03:50	so we draw target and non-target trials from a training set
0:03:54	and i lost constant it's a by a weighted by now regression
0:03:58	where are five is the prior of a target trial
0:04:01	and then
0:04:02	these two i is the log posterior odds
0:04:07	which can be decomposed in terms of the local error rates so
0:04:10	on the log prior art
0:04:13	the overall
0:04:14	system architecture than one and use
0:04:17	it's gonna be training so we a steps
0:04:20	on the well left
0:04:22	it's a block diagram of our baseline architecture
0:04:25	we're gonna use
0:04:26	to the convolution with a resonant architecture
0:04:30	well why a temporal only
0:04:32	and getting a high dimensional
0:04:35	was probably in activations
0:04:37	and then we use an affine layer
0:04:39	to do a bottleneck so that we can obtain then between
0:04:43	and but it's are gonna be a one and fifty six dimensions
0:04:47	and their star is used to the node where the embedding is extracted condition at
0:04:51	work
0:04:52	in a more will be trained using multiclass cross entropy with the softmax classification the
0:04:59	using directive mark
0:05:02	the first as the of the training process is to use short segments to train
0:05:06	the network
0:05:08	in the past we've seen this to be a good you know compromise
0:05:13	because the sort of sequences allow for a good use of gpu memory with a
0:05:18	large buttons
0:05:20	i'm not the same time in makes the task are even though we have a
0:05:25	very powerful classification head was to get
0:05:27	error so that we're going back propagated gradients
0:05:32	as the second step we propose to freeze then most memory intensive layers
0:05:38	which are typically only layers we are operating at the frame level
0:05:42	and then finding the postpone layers with more recordings
0:05:46	using
0:05:47	all the nodes of the sequence of the audio recording
0:05:51	which might be a two minutes of speech
0:05:54	by freezing the people were layers
0:05:57	we
0:05:57	the dues the man's of memory and therefore we can use the long sequences
0:06:02	and also we avoid overfitting to these are problem
0:06:07	based on the long sequences
0:06:11	finally the third step it in which we train them i-vector estimation
0:06:17	the first thing we do is we're gonna discard the actual multiclass classification
0:06:23	and we're gonna use a binary a classification
0:06:27	we're gonna use a sinus structure which is depicted here by copying the network tries
0:06:32	but the parameters of the same this is just for illustration purposes
0:06:36	and notice that we also three is
0:06:39	the affine layer corresponding this is denoted by
0:06:43	degree of colour
0:06:45	so
0:06:46	actually merges fixing them variance and now we're adding
0:06:50	and i'm into the estimation of work
0:06:52	that takes the possible in
0:06:54	activation which are very high dimensional and tries to learn as a lower magnitude
0:07:00	the along with the unit length expressed or
0:07:04	it's gonna be optimized people use a to minimize the cross-entropy
0:07:09	we also keep the global also
0:07:12	as part of the optimization problem
0:07:16	to validate are ideas we're gonna use the following setup
0:07:21	i'll start baseline system we're gonna use a modification of the rest in a thirty
0:07:26	four
0:07:27	expect to propose
0:07:28	by saving
0:07:30	and company
0:07:33	the modifications that we're doing is we're allocating more challenge more channels to their layers
0:07:39	because wishing that improves performance
0:07:42	i'm not the same time to control the number of parameters
0:07:46	we're gonna change the expansion rates of different layers so that we do not increase
0:07:52	the channel so much in deeper layers
0:07:54	and we have a certain is control the number and it is without degrading performance
0:08:00	to train the n and we're gonna use the box selected to dev data which
0:08:04	comprises about six thousand speakers and a million utterances
0:08:08	and this is wideband a i sixteen khz
0:08:12	note that we process differently the data when we use it with source segments on
0:08:16	full-length refinements in terms of how we apply a plantations
0:08:20	and i refer you to the paper to look at it excels
0:08:23	those are very important
0:08:25	what a good performance and also generalization
0:08:29	to make sure that we do not overfit to a single
0:08:33	evaluation set we are benchmarking against for different states
0:08:38	speakers in the while and bob select one
0:08:40	are actually good three it's to bob select two
0:08:43	there there's not much about the means that between
0:08:47	those two evaluation sets on the training data
0:08:51	the
0:08:51	sre nineteen outperform video portion and the time five
0:08:55	have some domains is compared to the training data
0:08:58	and i will be someone in the results later
0:09:01	mostly this is
0:09:02	in the case of sre nineteen is because the tails audio comprises multiple speakers and
0:09:09	there's a need for diarization
0:09:11	and in the time five k's
0:09:13	there is
0:09:14	far field microphone recordings with a lot of overlap speech and higher levels of reverberation
0:09:20	so there is a very challenging setup
0:09:22	also the time five results will be a split
0:09:25	in terms of a close-talking microphone and too far field mike
0:09:31	so that start by looking at the baseline system the we're proposing
0:09:35	we're percent of results in terms of equal error rate and to all other operating
0:09:40	points
0:09:41	we're doing this to facilitate the comparison with prior work
0:09:45	if you look at the right of the table
0:09:47	we are listing the best single system with no fusion number the we're able to
0:09:52	find in the literature
0:09:54	for all the benchmarks
0:09:56	you know all of the costs work reported
0:10:00	but our baseline
0:10:02	since to do a good job compared to the prior work
0:10:06	i know performance of most of the operating
0:10:10	points
0:10:12	note that we're not actually doing any particular tuning for its evaluation set
0:10:17	it's the for some small carrier that as i said sre nineteen
0:10:22	require a diarisation so we'd are as the test segments
0:10:26	and then for its
0:10:28	speaker that adding a text and then we extract an expert or
0:10:31	and their score
0:10:33	we can score
0:10:35	with the enrollment on all the test expect sources the one of the key for
0:10:38	scoring
0:10:43	so check
0:10:44	the
0:10:45	improvements that the phone lines refinement brings
0:10:49	in the second stage
0:10:51	we can compare in this table
0:10:54	with respect to the baseline
0:10:56	overall we also positive trends across all the data sets an operating points
0:11:02	but the games are
0:11:04	larger for the speakers in a while also and this makes sense because
0:11:08	this is done
0:11:09	so for with the evaluation data has a longer duration compared to the four seconds
0:11:14	segments that were used to train the nn
0:11:17	so
0:11:17	this value is the recent findings know how in our interest this paper
0:11:23	in which we saw that formants were fine and it's a good way to mitigate
0:11:28	the mismatch between the duration
0:11:31	in the training faces on the test phases
0:11:36	regarding the amount of the destination node work we explore multiple topologies
0:11:42	all of them were fit for where architectures and we explore interracial that and with
0:11:47	here and percent in three
0:11:49	represented in cases
0:11:50	a change in terms of the
0:11:53	number of layers and the with of the layers
0:11:56	the parameters go for one point five million to twenty million
0:12:01	when we compare performance for this three architectures across all the task
0:12:07	we do not see why changes
0:12:10	so the performance is quite stable across networks which is
0:12:14	it's probably a string
0:12:17	to find a good trade-off between the number of parameters some performance we're gonna be
0:12:21	the magneto two
0:12:23	architecture for the remaining part of experiments
0:12:31	percent
0:12:32	the overall
0:12:34	gains in discrimination
0:12:36	and due to the three stages
0:12:39	you have the graphs
0:12:42	the horizontal axis
0:12:43	are
0:12:44	the different benchmarks
0:12:47	we are explained in a
0:12:49	by far field microphones and of different plot just a first facilitate the visualisation because
0:12:55	they're in a different dynamic range
0:12:58	on the vertical axis we're depicting one of the
0:13:01	a cost
0:13:03	and then the colour coding indicates into the baseline system
0:13:07	the utterance
0:13:08	is
0:13:09	applying the for refinement to that baseline system
0:13:13	and the grey indicates application of the magnitude estimation on top of full-length refinement
0:13:20	so overall we can see that there was it is trained as well the
0:13:25	full answer feynman an i-vector estimation produced
0:13:27	gains
0:13:29	and we see that across all data sets
0:13:32	in there so
0:13:34	e r
0:13:35	we are getting out twelve percent gain and then for the other two operating points
0:13:40	we're getting an average of to the one percent gains
0:13:43	even though i'm only assign one operating points here in the paper you guys the
0:13:48	results for the other two operating
0:13:52	so finally a look into the calibration results
0:13:57	most of the global calibrate or and the miami network our training on the ball
0:14:01	so they have to dataset
0:14:04	this is a is a good night for the box select one on this because
0:14:07	in the well evaluations that but is not subset would match forty five and sorry
0:14:12	nineteen
0:14:13	where the reason segments and
0:14:15	before
0:14:17	you know they do not calibrate or we can see that we can obtain good
0:14:22	performance
0:14:23	in terms of the actual cost max and the mean cost
0:14:26	what both box evidence because no well
0:14:29	but when we moved to the other datasets
0:14:32	we struggle to obtain a good calibration with the global calibration
0:14:37	looking at the magnitude estimation that work we see a similar trend
0:14:42	for box a lemon speakers in a while we obtain very good calibration
0:14:46	but we also system struggle for the other sets
0:14:51	i think that a fair statement is to say that the mac we can estimation
0:14:55	does not deal with the domain saved
0:14:58	but you
0:14:59	performance the global mean and calibration
0:15:02	in all the operating points
0:15:04	i'm for all data sets
0:15:06	to gain some understanding of what mounted estimations doing
0:15:12	we did some analysis
0:15:14	the bottom plot on the right shows the cosine scores
0:15:18	the histogram scores for the
0:15:20	non-target on the target distribution
0:15:22	the red colour indicates a non-target scores and the blue collar indicates a target score
0:15:29	the top two panels are still in the cosine score
0:15:32	it's kind of a lot
0:15:33	against the magnitude
0:15:35	the of the product the magnitudes
0:15:38	for both and variance involving strike
0:15:42	therefore some of the line indicates the global scale
0:15:46	or magnitude
0:15:47	the big global calibrate or assigns
0:15:50	to this limiting
0:15:53	discourse used for this analysis are one the speakers in a while evaluations
0:15:58	since the magnitude estimation network improves discrimination
0:16:02	we expect
0:16:04	two trends
0:16:05	for the local sense for targets
0:16:08	we expect that the
0:16:10	a lot the magnitude
0:16:11	should be bigger than a global scale
0:16:15	on the other hand for the high cosine score
0:16:19	non-target trials
0:16:21	we expect the others
0:16:23	that is that the product manager to be smaller than a global scale
0:16:29	the expected trends are actually person in these plots
0:16:32	we look at the top plot we see
0:16:34	the
0:16:35	there's on
0:16:36	tilt
0:16:38	and the
0:16:40	magnitude for the no
0:16:42	cosine scoring
0:16:44	tend to be of all
0:16:45	the
0:16:45	contact constant magnitude that will be assigned for the global i-vector
0:16:50	on the other and we see that a large portion
0:16:54	then non-targets are the global scale
0:16:56	and
0:16:57	the ones that are doing getting very high cosine scores
0:17:01	also quite attenuated
0:17:04	this is consistent with the observation that magnitude estimation there were is improvement of discrimination
0:17:10	so to control we have
0:17:13	introduce undirected estimation network
0:17:15	within a global offset
0:17:17	the idea is to assign an eigen to each one of the unit length x
0:17:21	vectors that are training with an angular mark the softmax
0:17:27	the resulting scale extractors can be directly compare used in inner products to produce calibrated
0:17:33	scores
0:17:34	and also we have seen that it increases the discrimination between speakers
0:17:39	although
0:17:40	the domain is still remains a chance this are significant improvements
0:17:46	the propose system outperforms a very strong baseline on the for common benchmarks dimensional
0:17:53	when we but also for the validated the use of for recording refinements to help
0:17:58	will the duration mismatches interviews you another training and test phase
0:18:05	if you found this work interesting i suggest that we also take a look at
0:18:09	day
0:18:10	current work the senator a and meets my clan are gonna be presented in this
0:18:14	work so
0:18:15	once it is related
0:18:18	and if you have any questions you can reach me at my email
0:18:22	and i look for
0:18:24	to hand you guys in the middle sessions
0:18:28	thanks for the time

MagNetO: X-vector Magnitude Estimation Network plus Offset for Improved Speaker Recognition

Speaker Recognition 1

Daniel Garcia-Romero, Greg Sell, Alan Mccree