Speech Transcript - Combined Vector Based on Factorized Time-delay Neural Network for Text-Independent Speaker Recognition

0:00:14	where one is it can problems in wyoming recently
0:00:19	are we papers travel is combat vector based on factor as the time delay neural
0:00:24	network for text independent speaker recognition
0:00:28	it's too long so that speaking
0:00:33	currently the most effective text independent speaker recognition method has term to be extracting speaker
0:00:40	and batting
0:00:41	on the on the extractor extract the wrong back to write that i'm dating or
0:00:46	network has been demonstrated to be along the best performance on recent nist sre evaluations
0:00:55	well speech signal consists of content in you curved and their emotions channel and noise
0:01:02	information and so well
0:01:04	no way no speech content is well the mean information
0:01:10	generally different verification tasks progress on the different type of target information on vol
0:01:17	and ignore the influence all other information
0:01:21	however the fact the that different be components share some common information and cannot be
0:01:28	completely separated
0:01:31	based on this study are some of it has learning masters have been proposed
0:01:37	in collection on i think that's learning had only errors are shared between different task
0:01:43	the networks
0:01:46	why previous works we have proposed that combines a vector costing vector and rooted at
0:01:53	the performance can be further improved by introducing phonetic information
0:01:59	but this is one of the wanted it is also vector is that it only
0:02:03	authors of a simple okay network
0:02:07	so in this paper we introduce factor as in the years into a vector and
0:02:12	propose an extended network called have t vector
0:02:20	speaker in batting has the mean in speaker recognition math are this stage
0:02:26	the input layer the frame level acoustic features of the speech
0:02:32	as far as well as it is true or several layers of time delay architecture
0:02:38	frame level information and that were created in a standard text coding layer
0:02:44	the me and the standard of dimension are calculated and the converted into statement layer
0:02:51	a second level information
0:02:55	have to remove the whole training with all the have of the year after this
0:02:59	statistics scrolling layer will be extracted s speaker in body
0:03:06	and the lda nearly a
0:03:08	we have to calculate the square
0:03:12	after as the idea has the vector the or more
0:03:17	because comprised of a characterizing the weight matrix between td layers
0:03:24	this mess are used as last network parameters well maintaining the iterations the and there's
0:03:31	a better payoff
0:03:32	no work trainee
0:03:34	and obtains good results
0:03:36	the weight matrix factorized into product of two factor metrics
0:03:44	unfortunately your i've see the characters finish there is has been a confirmed i in
0:03:51	the nist sre eighteen evaluation
0:03:56	although the extractor network performance speaker detection and a segment level
0:04:03	something cremation is ignored in this cross last
0:04:07	alright the extractor network without asr network okay labels and asr network always cheap is
0:04:17	information and a frame level
0:04:21	in that it's unethical adaptation master and hacker as a phonetic information into extractor
0:04:29	network
0:04:31	first a correctly asr model is the bottleneck layer and were the without a bottleneck
0:04:38	layer as a salary vector into extractor network
0:04:46	that
0:04:48	fabric multiclass the lower income band spectral actor network based asr network
0:04:54	so that two networks share a part of the frame level layers and
0:05:00	the training process is alternative method two parts of the combined network
0:05:06	the speaker and batting part of the combined the network and their of all more
0:05:11	about the common information or by speaker right features and the phonetic content
0:05:18	and the recognition about
0:05:20	for this matter
0:05:26	so that a court application on the error rate multitask learning or correspond to as
0:05:31	bags of the speech information
0:05:34	the former trying to write a supervised the next two and fact all one i
0:05:42	take the content
0:05:43	and a letter
0:05:45	or trying to their more detail information from one like a content
0:05:51	they still actor network combat this to him as errors in an attempt to more
0:05:57	effectively learn the share part of
0:06:00	syntactic information and speaker information
0:06:05	similar to the phonetic adaptation matter
0:06:09	star network is pre-trained first
0:06:11	features are extracted from as automatically your and more as the
0:06:18	speaker embedding part have really multitask learning
0:06:22	during have great multitask learning now where a trainee
0:06:27	pretrained asr network is no longer updated
0:06:32	after that the two parts so they have great multitask learning that work
0:06:37	a lot in order to the not the alternatively
0:06:41	training and the embedding as extracted from they had only or be handled porting layer
0:06:47	of the speaker embedding part
0:06:55	many experiments have shown that
0:06:59	or network architectures improve the performance
0:07:04	so we have do extend this vectorized t and network architecture which as it can
0:07:10	extend and from the t v is that still
0:07:14	we use this architecture in the nist sre nineteen evaluation
0:07:20	well they greatly deep in the network architecture
0:07:27	the network parameter a long as
0:07:30	controls to start a range and the performance was
0:07:34	the second fa candidly improvement
0:07:38	in all the good performance network and the impact of one i think information on
0:07:46	speaker recognition is that still
0:07:48	we introduce the to see vector
0:07:53	and it called f t vector
0:07:58	the way we include because they
0:08:00	the rice the layers as a little difference from the at the end that mar
0:08:06	the company can actually mass are within the layer is that all this kind of
0:08:13	and the
0:08:15	include of the vectorized in a year it is likely applied by a local stations
0:08:20	and the up with all the by to read the areas as a
0:08:25	and the you watch
0:08:27	a similar to the rest not
0:08:32	e os you have t v and to replace the car use the two
0:08:37	extracted the embedding in there is the vector network
0:08:44	replace the part x that have the phonetic a problem that vector at then you're
0:08:50	twenty
0:08:52	the extract the embedding exactly are twenty two
0:08:58	and the same time but was simply five you have to be a network without
0:09:03	putting the year to replace the so far of the multi task learning is the
0:09:09	vector
0:09:10	for some new the first feel layers of the two hour stream in those them
0:09:15	or very
0:09:21	for the experiment
0:09:24	it is performed according to the requirements of the next training condition or nist sre
0:09:30	at
0:09:32	it should be noted cadence that that's our training data size doesn't include the looks
0:09:39	l a and sat down to datasets
0:09:43	the fisher data size consists of for me labels so we use data to create
0:09:50	additional sre task and the remaining data size i euros the two three that neural
0:09:57	network and back end
0:10:00	each sense to know there and i rise are noise this are used that no
0:10:06	i as sources to enhance the training data
0:10:10	and amount of training data is able
0:10:15	the test to study the development and evaluation data size or
0:10:20	nist sre it can cts task
0:10:23	the input features of network a mystery imagine a mfccs
0:10:30	this is the estimate working is trained using english dataset as
0:10:36	so it has two or and fisher
0:10:39	it doesn't phase well with the language of the sre eight data set
0:10:48	just and transcriptions for a landed by gmm you x m used and from one
0:10:56	phonetic labels
0:10:58	i extractor all the pre-trained asr
0:11:01	it's really
0:11:02	on this range
0:11:04	and it has the same as that all the same actors as though
0:11:13	the about experiments a little bad taste and background processing
0:11:19	after imbalance are extracted
0:11:22	two hundred imagine that only a hand only a story are trained and the class
0:11:29	due to domain mismatch
0:11:31	a common optimisation and mess already has two euros sre at on labeled data to
0:11:37	also realise the already adaptation
0:11:40	but me us another math are to get better results
0:11:45	okay as to apply i'm supplies the cost three
0:11:50	sre eighteen unlabeled data and use the clustered data to train can be
0:11:57	then the p r t and videos the for story
0:12:01	the result of f t vector as bad heard that the background identity and you
0:12:06	or in the controllers
0:12:10	and they have t vectors test on that schade first a layer on the best
0:12:15	performance
0:12:18	the overall effect of the have to vectors as norm decreased as that the number
0:12:23	of charlotte i layers increases as we review that's nist and this partly due to
0:12:30	language mismatch
0:12:32	the training data for asr part spoke english well at a test it has side
0:12:39	is spoken in one is the already are about
0:12:44	a bathroom the results because the data you mean in this case
0:12:49	the extracted the phonotactic information can still have improved a factor of speaker recognition
0:12:59	that's all thank you

Combined Vector Based on Factorized Time-delay Neural Network for Text-Independent Speaker Recognition

Speech Application

Tianyu Liang, Yi Liu, Can Xu, Xianwei Zhang, Liang He