Speech Transcript - Selective Deep Speaker Embedding Enhancement for Speaker Verification

0:00:13	i know my mean is shocking ending this the ubiquity of these vectors and training
0:00:18	workshop
0:00:19	all represents all paper selecting t speaker in between its nodes or is okay shen
0:00:27	these are to this contents are you or start with an introduction the motivation
0:00:32	next are we going the voice just dataset
0:00:36	and i we introduced a baseline system
0:00:38	we use low and it and the proposed tomorrow this the remaining the states
0:00:44	experiments and corresponding richard will be then present followed by our conclusion
0:00:49	a nice to meet introduction
0:00:53	recently
0:00:54	tim neural network are using the kings table t are honest in speaker verification
0:01:01	however distantly utterances are well known to integrate or honest because the contain environmental vector
0:01:08	such and reverberation and noise
0:01:11	so celeste of these so case we always use of security in complex environments ascending
0:01:16	problem is done challenge was
0:01:19	then encoded already dataset
0:01:24	previously
0:01:25	several studies have or compensation for the performance degradation or with the distant environments
0:01:33	however to problem to have mean oregon meetings eating compensation method
0:01:39	well as
0:01:39	you just a one as a degradation of one cluster of utterance
0:01:44	applying the compensation that a good agreement though honestly recognition or distant contrasts
0:01:51	however when the distant compensation technique was applied to the cluster doctrines the performance det
0:01:58	only
0:02:01	or into this you know nina used in recording used compensation system when you come
0:02:05	from various distance
0:02:08	second
0:02:08	there is a dependency on the sre system
0:02:12	when a new speaker embedding structure is almost
0:02:15	corresponding studies or adequate at position and you know you should be are well
0:02:23	to all the gradient this
0:02:24	previous problems
0:02:26	we want to build a system followed in no or properties
0:02:31	first
0:02:32	you should be independent the front end speaker extractor
0:02:36	second
0:02:37	the proposed system should be or on selecting cepstral innocent
0:02:41	while considering got used and you training speech and microphone
0:02:45	certainly
0:02:46	was cluster and distant utterance can be including
0:02:50	into the proposed system
0:02:53	why not only
0:02:53	the problem of the system comprise all you late we simply architecture
0:02:58	the cost minima or had to store all honestly cross that line
0:03:05	we propose to this town doctrines compensation system
0:03:10	the worst cross or system so that really can't of the announcements according to require
0:03:15	use tentel compensation
0:03:18	we design also or cleaning to determine the level and the voice and you preparation
0:03:23	no apply compensation accordingly
0:03:26	a second approach or system is based on the auto-encoder primal
0:03:31	while key binding document retention
0:03:34	into two sorts there is no system into set correctly stressed speaker information
0:03:40	including embedding teary encoding quality
0:03:44	once a spacey target contain clean speaker information on your plane or the channel offset
0:03:50	function to these input layer
0:03:52	and you know the subspace is target two
0:03:54	contain subsequently incarnation but liberation indoors
0:04:01	with dataset using this study will be described
0:04:06	that was dataset was collected by clinton levers this dataset
0:04:10	so one loss or
0:04:12	only layer coding we'd already market various test and of course conditions
0:04:18	of course the conditional order to according to learn
0:04:21	trendy nor training mike
0:04:23	impressed angle and distracters
0:04:25	in the workforce it dataset
0:04:27	there are three hundred speakers
0:04:30	the development set comprise all our total term store
0:04:34	two hundred speakers and all evaluation sets comprise are twelve utterance well unless the whole
0:04:40	one hundred speakers
0:04:44	introduce a known and used as baseline
0:04:48	no the use of data from a speaker embedding stricter
0:04:52	that you will know where one time actually
0:04:56	when can as four or so used to extract speaker embedding
0:04:59	mel frequency cepstral coefficients
0:05:02	a local man a speech or moreover that only used
0:05:05	this acoustic is true for that human knowledge into a size or discriminative features
0:05:12	convolutional neural network which is frequently used or anything about extractor
0:05:18	gradually increased only set to create
0:05:21	does when in perspective ran into the c n only set their people standing can
0:05:26	consider only on digits time and frequency region
0:05:30	and then you're
0:05:31	there are close to the input layer
0:05:35	although
0:05:36	this conventional acoustic is for us to in widely used
0:05:39	mainly sense to the also explore low weight problem as you could to t n
0:05:45	it is that they don't alignment learning can batteries track discriminant information you document layers
0:05:53	when we're on are processed by synonyms
0:05:56	additional frequency response
0:05:58	also we can spend can be strictly
0:06:02	in addition the progress and all data to data and task
0:06:09	known and all the policy intentionally architecture where the midget a global c n n's
0:06:14	extract train leavened representation
0:06:17	as illustrated here
0:06:19	no one installation the plot is similar to the original last night
0:06:23	well the whole mess clean a year
0:06:27	this representation and in canada uni directional getting equal to unit layer
0:06:33	to all we're getting into a single times level election station
0:06:37	a fully connected layer with the one thousand twenty four those
0:06:41	and conduct affine transformation it is a later uses a speaker embedding
0:06:49	in this section we introduce two or system or at a speaker invading last night
0:06:57	the first proposed system is a lucrative as skin condition based selective innocent
0:07:03	the q on the night show the crime local sc
0:07:08	this system comprise all p n in that in a speaker embedding asking condition
0:07:14	in on the other segments kiss each and unit
0:07:18	sc cantonese out you know is able to encoder
0:07:20	and sat in a decidedly stencil activity in the skin condition similar to the case
0:07:26	becomes you
0:07:29	during the training phase
0:07:31	and ct nn is trained for me nice to me scared and an object motion
0:07:35	routine do not include any in a speaker embedding
0:07:39	when a source utterances include
0:07:41	sc on the only on structural be included
0:07:46	on the other hand we're not distant utterances include
0:07:49	sc on the key noisy
0:07:52	output or source all trials
0:07:54	that was used to make the distance utterance
0:07:58	a stinky in it is trained to minimize the wine on the cross entropy object
0:08:02	function
0:08:04	when a source alton seeing a binary label is a one to make the skin
0:08:09	condition only working
0:08:11	and the way not distance all utterances include the finally agrees general to make the
0:08:16	iterative scheme condition
0:08:18	in the figure below
0:08:20	the top n only presented a training base of our proposed
0:08:24	i think i feel
0:08:27	or quoting from previous study
0:08:29	when compensation is conducting speaker and benny's face
0:08:33	compensation may not be and although the ins evaluation pair too low
0:08:38	this phenomenon is to analyze as all users what we losing or discriminative power
0:08:43	all speaker embedding by changing value
0:08:46	you know high dimensional extract embedding space
0:08:50	labels in this knowledge e unless component so proposed system
0:08:54	or on a speaker identification where do contain what the cross entropy roses function is
0:08:59	used
0:09:01	so the final was it commissioned used to train the sc is it is just
0:09:05	a
0:09:06	just described there
0:09:09	loss and the same is or total reconstruction error
0:09:12	this is seeing measure the distance the detection error
0:09:15	analysis a measure called speaker identification error
0:09:19	this entire in a speaker and battery
0:09:23	in the test case the speaker and made it is including to c t n
0:09:27	and as the key and
0:09:30	so clean condition to connect input and output all sc t n is not rely
0:09:35	on it all other whereas the nn
0:09:37	we don't sigmoid activation function
0:09:41	this is only a longer between zero and one and produce source case clean condition
0:09:48	why nineteen a speaker embedding is still i by adding the all will go to
0:09:52	see the nn
0:09:54	and its cascade condition
0:09:56	in the figure below those already there all represent the test process over our proposed
0:10:02	sc
0:10:05	the second proposed system usually prior to causality business not destroy the whole time corner
0:10:12	that is not
0:10:16	those second proposed system usually prior to us so that leaving that's not
0:10:20	described auto-encoder
0:10:23	the second proposed system easily hurt us so that in a sense to discriminate auto-encoder
0:10:30	that is composed of on encoder decoder and two on an intermediate hidden layers
0:10:37	like you hear loss filter set architecture
0:10:41	the architecture design follow descreening altering quality structure
0:10:46	inspired by pca set eyes computer intermediate hidden layer
0:10:51	to collect the reverberation voicing and layer
0:10:54	and to contain
0:10:55	clean speech recognition in this kind layer
0:11:01	so that i used an intermediate human lay your next time s ideally and always
0:11:06	isolated
0:11:07	you has been very
0:11:09	when training set up
0:11:11	although was of ocean correspond to minimize the inter class areas and mesh five the
0:11:16	you class variance
0:11:18	we utilize central sandy tolerance margin thus
0:11:23	centre or source presented very nice intra-class variance why don't you embedding it surely many
0:11:28	discriminate
0:11:31	noninternal destruction was used in d c in to maximize the entire class
0:11:36	variance
0:11:40	in the same yes the previous sc diana sylvia function was used to train but
0:11:46	you know resulting colour
0:11:48	to nest or on the ocean between the number of source of times
0:11:52	and distance all times in the training set
0:11:54	the sample weight or two on the because this six
0:11:58	and one is given recording you put
0:12:01	the c of the ocean is also used to store all the function shrek on
0:12:05	the speaker identification
0:12:08	the final was of functional propose that a system
0:12:12	it is described below
0:12:14	here can my is all hyper parameter the scale the omission or try to this
0:12:19	time
0:12:20	and at times all hyper parameter the combined always function gender roles and inter racial
0:12:27	noticed
0:12:29	no less mobile and experiments and results
0:12:34	the train set comprise all art so the voices development set
0:12:38	and what select one and two dataset
0:12:42	baseline alone a system
0:12:43	in cologne where called is a two
0:12:46	it in nine thousand
0:12:48	what's a nice sample which a car or was to recognise that was second
0:12:51	we're meeting that's construction
0:12:54	to the so
0:12:55	we had to click a short utterance and a common and the call me
0:12:59	all the details are present in the paper
0:13:04	the baseline system used a low and then architecture
0:13:07	we had some modification
0:13:10	first set and the number of the articulators no to seven about it
0:13:15	by on the sisters tree
0:13:17	to consider more speakers
0:13:20	secondly
0:13:21	increased a criminal at all the speaker and battery to one thousand training or
0:13:28	"'kay" the glow described here top on it in a single system o'connor's from the
0:13:33	always the challenge
0:13:34	and our baseline system with various congregation
0:13:38	target comparison between the current system in our baseline
0:13:43	kind of in may going to the occurrence in the
0:13:46	input feature
0:13:47	tries the congregation
0:13:49	and binary classifiers
0:13:52	our story describe the noticed when using all the voice just dataset or training
0:13:58	our street train
0:14:00	we first trained on that were use of constant two
0:14:03	and then press
0:14:04	on the top layer
0:14:06	and conduct fine tuning we propose that set
0:14:09	and hours or shown college road all training or street dataset scatter
0:14:15	training all or street dataset simultaneously and provides the best but almost
0:14:23	proposed sc explore the learning life's customer and optimiser
0:14:29	the best performance loss and the quantum and used as treaty and cosine along a
0:14:34	scheduler
0:14:36	sc show six point
0:14:38	it's by orson the year
0:14:40	where the test set and then the only channels three percent laid our reduction of
0:14:46	compared to the baseline
0:14:50	we experiment the proposed set a we keep a bit size and a manager
0:14:56	the best performance was an echo the menu saddam
0:14:59	and set aside to ten thousand
0:15:03	the set i shows system only or seven percent a year or the test set
0:15:08	and fifteen point nine seven percent are
0:15:11	compared to the baseline
0:15:16	score normalization technique are frequently chlorine various acoustic business condition
0:15:22	most of the artist and in the course is two thousand nineteen challenge or so
0:15:27	use the score normalization techniques such as generous colour magician
0:15:31	the score normalization estimating score normalization
0:15:36	we experiment i actually so this technique or our baseline aurora two for all system
0:15:43	sc that's data
0:15:45	and an important measure the in table low
0:15:48	the results show the z-norm demonstrate but best document in most cases in our experiments
0:15:55	in addition scores and all somewhere all the two proposed system
0:15:59	only the audition across the improvement
0:16:02	we don't eer all other
0:16:03	six point one nine percent or z-norm
0:16:08	finally then we introduce the conclusion
0:16:13	in this study we propose to speaker-invariant is not system
0:16:18	was proposed system are independent from the front ends you can vary instruction
0:16:23	and this taste and can process not only distance on trust was cluster utterance
0:16:29	this process which can are you sure wasn't degradation
0:16:33	when cluster goddess are input into the speaker and battery in is not system
0:16:37	it is time won't systems utterance
0:16:41	compared to the baseline system to proposed system as the c s c and set
0:16:47	up in was based on a real eleven point two or three percent
0:16:51	and fourteen point nine three percent respectively
0:16:55	this is richard show that you x in this impulse cluster and discuss utterance
0:17:01	in our just for making sensing interrogate to proposed system into a single speaker in
0:17:07	body units nist is that
0:17:12	they could probably sing

Selective Deep Speaker Embedding Enhancement for Speaker Verification

Special Session: VOiCES 2020

Jee-Weon Jung, Ju-Ho Kim, Hye-Jin Shim, Seung-bin Kim, Ha-Jin Yu