Speech Transcript - Bayesian x-vector: Bayesian Neural Network based x-vector System for Speaker Verification

0:00:14	hello i run decision at a really a psd student problem this kind university'll background
0:00:20	represent the work place that's later it is in your network based i-vector system for
0:00:27	speaker verification
0:00:28	a set thing to all the c two thousand and twenty
0:00:33	this work proposes to incorporate it is a neural networks you to automatic speaker verification
0:00:38	systems to improve the system generalization ability
0:00:43	in this presentation of a firstly introduced is he systems and strategies to as a
0:00:48	very in developing by a these systems
0:00:52	this is followed by some related works these days and learning process so use of
0:00:56	the buttons days of modeling in machine learning community
0:01:00	then talk about our approach including the motivation and how to apply based entirely you
0:01:06	to is this systems
0:01:07	next our experimental setup and results will be restricted to where five the effectiveness of
0:01:14	our approach
0:01:15	this is followed by the final
0:01:17	congruence
0:01:20	automatic speaker verification systems and i've confirming the spoken utterance is the speaker identity claim
0:01:28	we have zero i ever increasing use of nist systems why don't data lacks
0:01:33	including was iteration you electronic devices
0:01:37	you banking association and so on
0:01:40	there are three most the represented q frameworks for developing is these systems
0:01:46	i-vectors speaker-invariant is the systems were proposed to know those speaker and channel variations the
0:01:52	better
0:01:53	and the user speaker discriminative back end work experience
0:01:58	benefiting from the partial discriminative ability of the neural networks
0:02:03	speaker embedded in this distance or proposed to extract speaker discriminative repetitions is utterance
0:02:11	this could choose the state-of-the-art performance
0:02:15	is the development of and upper test testing
0:02:19	many research is also focus on constructing a s p system
0:02:25	and to and manner
0:02:29	its head and z zero four lacy systems development is then this nice feature in
0:02:33	the training and evaluation data
0:02:36	so i says the speaker population is nice and the variations you channel and environmental
0:02:41	background
0:02:43	a speaker or blsa use the for training and evaluation commonly how no overlap
0:02:48	especially for practical applications
0:02:52	cool work on this is nice your data pairs
0:02:56	the strictly the speaker representations to generalize well on these bigger data
0:03:03	this i know and every environmental variations most the most only it is the in
0:03:07	practical applications
0:03:09	where the training and evaluation data are collected from different types of recorders
0:03:15	and environments
0:03:16	is this nice is also have a high demand for the model general the idiot
0:03:22	times today
0:03:24	to address this is you
0:03:26	previous efforts have applied it was fishing you to elevate the channel and parametric variations
0:03:31	from a christian by any
0:03:34	is the pros as the improvement will be affected you elevating the effects
0:03:38	of channel and environmental mismatch used
0:03:41	are you guys in the consider the speaker population size that could also lead to
0:03:46	the system performance degradation
0:03:50	in this or
0:03:51	we focus on the ice vectors system and try to incorporate it is a neural
0:03:56	networks
0:03:57	to true
0:03:58	the systems generates is it at
0:04:01	across all these three and so from these nine she's
0:04:06	the baseline and you of course as the initial and would be effective to improve
0:04:10	the generalisation ability of discriminative training you p and systems
0:04:15	in the machine learning community
0:04:17	barbara et al proposes
0:04:20	and you've patient variational you or is there is useful based in your networks
0:04:26	but i the lid or propose a novel
0:04:29	propagation
0:04:31	compatible algorithm for learning the network parameters of zero discrete
0:04:36	distribution
0:04:37	in this is this area
0:04:40	reefs and a whole
0:04:41	propose to your ball be based a neural networks use these recognition
0:04:46	so and or stuck a bayesian learning of kidding you need contributions for speaker identification
0:04:53	chatting you also applied to it is interesting to me into language modeling
0:05:00	i we introduce our approach is the most efficient would be personally talk about
0:05:07	traditional d extractor system
0:05:09	a system parameters estimated and the maximum likelihood strategy
0:05:14	i p for me mistake i showing you think about
0:05:20	it has to
0:05:21	our feet when given limited training data are you moment they are
0:05:26	is lasts is nice speech in the training and evaluation data
0:05:31	in the case of a nice you speaker population
0:05:34	the overfitting the model parameters may result in speaker representations all you must i
0:05:40	distribution
0:05:41	to come or score supporting a speaker identities
0:05:44	however this you can not to generalize well on
0:05:48	i think speaker data
0:05:50	the cases of channel and environmental nice
0:05:54	a similar for instance for channel is nice the orthopaedic the model parameters may be
0:06:00	partially rely on the channel you formation
0:06:02	go classify speakers to wear a suit recorders for different speakers in the training data
0:06:10	are more ones analysing to channel mismatch the evaluation data
0:06:14	the original channels to be really system is broken and to train the relies on
0:06:20	channel information cleanly to misclassification
0:06:27	and all have so on
0:06:29	that the extract a speaker representations from outside first is the
0:06:35	still contain a the speaker and related information just a channel
0:06:40	transcription an utterance long
0:06:42	is the information to the fact that the verification performance especially on the nist sre
0:06:47	evaluation data
0:06:49	is a neural network shares economy nist a great interest would be a problem this
0:06:53	t
0:06:55	posterior distribution as shown in figure two
0:06:58	is probabilistic parameters could have what is an additional data
0:07:02	to address this speaker population is niceties you
0:07:06	it is clear that have could have some of the distributions of speaker representation for
0:07:12	better generalization and since you're data
0:07:15	to apply the mismatch is caused by chain i mean variance there probabilistic parameter model
0:07:20	mean go the reduce the risk
0:07:23	overfitting on channel information based more thing parameters to consider archer possible values
0:07:29	that don't rely on channel information for speaker classification
0:07:34	the boundaries you want to be used work to incorporate
0:07:38	we place a neural networks into a the i-vector system by replacing the layers to
0:07:45	improve his general system abated
0:07:51	acts like the system consists of two parts of france and use the following shaking
0:07:56	utterance level speaker in banking and of our vacations calling back
0:08:01	if right hand compresses these utterances of different amounts into a fixed that ms in
0:08:06	speaker-related ratings
0:08:09	based on this inventing different scoring schemes kind we use the projection whether two utterances
0:08:15	you don't close to kristin on that
0:08:18	in this work we focused on the reversal the print and choose probabilistic linear discriminant
0:08:24	analysis and of course task only has two
0:08:28	given by hand for the performance evaluation
0:08:32	that's right for extractor is a neural network
0:08:35	to the other speaker discrimination task as a new people three across this all frame
0:08:41	level and bathroom is levelling structures
0:08:44	and a frame level several layers of
0:08:47	of time delay neural network are used to model have you burned corpora
0:08:53	curve right okay characteristics
0:08:56	of acoustic features
0:08:58	then
0:09:00	then
0:09:00	statistics to relay a ivory is all the frame level all those from velocity d
0:09:06	and they are i don't confuse the army and standard deviation
0:09:10	that compute his that a case space are propagated through or several embedding layers and
0:09:17	the panelists of my output layer
0:09:20	the cross entropy is used to find for the crime interest unit sheena states
0:09:26	in the testing stage
0:09:27	even though acoustic features of that utterance mean value layout too easy extracted as e
0:09:34	x vector
0:09:39	is a neural networks
0:09:40	during the parameters posterior distribution p of w given t to model week after dingy
0:09:47	and the right legally enables and you need
0:09:50	number of possible model parameters to physically is that she and they have be
0:09:55	is the data i third and he modeling has the most model
0:09:59	parameters and the make the more the generalize well as in the
0:10:03	during the testing stage
0:10:05	the model can choose the occluded
0:10:07	well i even they include x
0:10:09	and making i x
0:10:10	the expectation
0:10:11	or what awaits posterior distribution
0:10:14	p w
0:10:15	"'kay" of audio unity i shown you creation one
0:10:20	i work better that they write the i think nation l p and out of
0:10:23	that unit but intractable for neural networks of irony right was that
0:10:29	yes the number of possible ways values could be you data
0:10:33	so the variational approximation is commonly adopted to estimate the posterior distribution
0:10:40	the variational
0:10:42	a poor approximation i theme is a set of parameters see that you're for distribution
0:10:48	to all of that you
0:10:49	to approximate the posterior distribution p of w unity
0:10:54	this is issued by minimizing the callback labour
0:10:59	i divergence between these two distributions and so and you creation two
0:11:05	from you creation to equation four
0:11:08	we applied the ester two and job the constant term low key of the
0:11:14	low key of p
0:11:16	that that's of by the minimization no was it actually
0:11:19	you creations problem of for to say
0:11:23	benesty that just means that is
0:11:26	increasing could be decomposed into two pass
0:11:31	one is
0:11:32	the kl-divergence speech e
0:11:35	the speech in the approximation distribution q of w and the posterior
0:11:40	and the prior distribution p of w on the page
0:11:45	the one is
0:11:47	the one is
0:11:49	the expectation of the log likelihood of the training data over the approximation distribution q
0:11:55	of topic
0:11:56	increase mistakes is used as the loss function to be really nice in the training
0:12:01	process
0:12:07	as commonly adopted to be assumed that both variational approximation of that you and the
0:12:13	prior distribution p of that you follows telcon or cost and distribution these a printer
0:12:20	side data to composed of new q and the map you
0:12:24	and six is see that the controls of new p and c marquee respectively
0:12:30	the two class you know loss function of the last is gonna be formulated as
0:12:34	you kristen's seven and eight respect it because it ain't useful in relation to apply
0:12:40	a model car was some three
0:12:42	two approximates the integration
0:12:44	processed
0:12:45	finally can
0:12:47	concatenate increases seven and eight we have the final loss function actually is not
0:12:53	this news
0:12:54	we see you be the directly use the are watching imprecise
0:12:59	order to evaluate the effectiveness of operation any of a speaker verification you both
0:13:04	so and a long utterance conditions
0:13:06	we performed experiments on two datasets
0:13:09	to solve utterance condition we consider the book set of one side
0:13:15	totally
0:13:16	wow hundred and forty
0:13:17	eight thousand
0:13:19	six hundred and forty two utterances from one thousand and twenty two hundred by
0:13:25	site agrees
0:13:28	we adopted a four thousand
0:13:31	four thousand eight hundred and they seventy four utterances from forty speakers for evaluation
0:13:37	and the remaining utterances are used for junior
0:13:40	yes you system parameters
0:13:42	for the long utterance condition a card has thing in beanies the speaker barry is
0:13:46	to be correct recognition evaluation can use the for benchmarking i won't motives
0:13:52	but is synthesized we adopt the previous
0:13:54	sre corpora sense these four
0:13:56	in total be how wrong
0:13:59	sixty miles thousands recordings from six thousand
0:14:03	and of the hundred speakers indigenous this site
0:14:06	we evaluate the general system benefits that
0:14:10	based on included three and a
0:14:12	evaluation of different miss nine degrees
0:14:15	we performed only and also to me evaluations
0:14:18	which in and has two stages i think really on the same
0:14:21	dataset the in domain evaluation
0:14:23	well executed on different not size are also be evaluation
0:14:30	so if you dimensional mel-frequency spectral coefficients are adopted i so closely features our experiments
0:14:37	extracted mfccs onion normalization them voice activity detection filters all non-speech frames
0:14:43	that's right drawing structure configuration is shown in table one
0:14:47	linear discriminant analysis is applied to reduce the extractors dimension
0:14:53	to make a fair calibration that based extractor system is configured to be is the
0:14:58	same architecture of the baseline system
0:15:00	except
0:15:01	the first the t v and later use replace the bad their business the number
0:15:06	or units
0:15:08	so that is the gradient descent and is a great i as you optimize rd
0:15:12	machine evaluation metrics adaptively increase or other commonly used equal error rate and minimum you
0:15:18	understand
0:15:19	cost function
0:15:21	here is that you need only evaluation results we have their own that
0:15:26	you calamities
0:15:27	consistently decrease after incorporating the basin running on both sides
0:15:34	on this dataset be considered i was right you glower we degrees
0:15:39	across close a and the lda back-end
0:15:45	on the
0:15:46	looks at the one that i with a few i enquiry decrease from place to
0:15:50	extract or system is two point six days point process
0:15:55	and the fusion system quoted surefire to our wrists radio or are we increase that
0:15:59	so on to four presents
0:16:04	and then he's that i sign skin database until you varying degrees is to one
0:16:09	thirty h
0:16:10	is the three two percent for
0:16:12	based on extractor system and three point
0:16:16	eight a stands for the fusion system
0:16:19	we also consider the consistent improvement in detection cost function performance after applying bayesian learning
0:16:27	and that the stooges just the
0:16:29	is observations where five improve the general system ability of the client base a neural
0:16:35	networks
0:16:37	figure four ulysses
0:16:39	the details at work feed off curves
0:16:41	all systems these the cost and by can win benchmark almost set of one side
0:16:47	is shows the proposed space is just a model from the baseline for all operating
0:16:52	points
0:16:53	and the fusion system couldn't show further improvements to trigger
0:16:57	complementary advantages of the baseline and based in system
0:17:02	k is the off total knee evaluation regions
0:17:05	the model to now centered one was evaluated on these the sre ten
0:17:10	and vice versa
0:17:12	system performance costing significantly due to the last term is my speech in the training
0:17:17	and evaluation data
0:17:22	from the table be of the died
0:17:25	systems could benefit more from the generalisation calibration in your
0:17:29	we also consider the average radio equal error rate degrees across course and real case
0:17:35	calling back end for performance evaluation
0:17:39	in the experiments evaluated on nist i sign ten database right you equal weight
0:17:45	increase is
0:17:47	for one six nine cents and the six point
0:17:51	well three percent over the baseline system and the fusion system respectively
0:17:56	for the experiments evaluated on the wheel set of one dataset are always right you
0:18:02	equal ridiculous yes three point o seven percent for the base tax vectors this the
0:18:07	and the fusion system as true father
0:18:11	for the average review equal error rate degrees all six point
0:18:15	for
0:18:15	one a sense
0:18:18	the latter value equal error rate decreases compared to be is that the only evaluations
0:18:24	just respect bayesian learning could be
0:18:26	more beneficial when larger miss nicely it is
0:18:29	between the training and evaluation data the last column in the table shows the corresponding
0:18:35	you
0:18:36	detection cost function performance
0:18:38	and we also can see consistent improvement by applying bayesian learning and with the fusion
0:18:44	system
0:18:46	similar to that of the variation in figure four
0:18:49	the detection error tradeoff curves in figure fell so consistent improvements by applying bayesian learning
0:18:56	and a few this system
0:18:58	for all operating points
0:19:02	in this work we
0:19:03	we incorporate the base in your network utility
0:19:08	i extractor assistant when you produce
0:19:10	models generalisation ability
0:19:12	our experimental results verify the bayesian routine enable a consistent
0:19:17	generalizes the ability you improvement over extractor system both
0:19:22	sort and alarm rates conditions
0:19:25	and the through the system which used for the improvements nor overall system score the
0:19:29	latter improvement problem would be is already and all of complete evaluation results as s
0:19:35	is around you makes
0:19:36	my personal and the doctor nice it is between the training and evaluation data
0:19:42	possible future research will focus on
0:19:45	incorporating the bayesian learning improve the and ran a speaker verification systems
0:19:49	then for a listening

Bayesian x-vector: Bayesian Neural Network based x-vector System for Speaker Verification

Speaker Recognition 2

Xu Li, Jinghua Zhong, Jianwei Yu, Shoukang Hu, Xixin Wu, Xunying Liu, Helen Meng