Speech Transcript - An Empirical Analysis of Information Encoded in Disentangled Neural Speaker Representations

0:00:14	however
0:00:15	my name is a weird
0:00:17	this is trained in the signals the standard a traditional a accuracy los angeles
0:00:23	to the be presenting our work
0:00:25	try to an umbilical analysis of information coder
0:00:29	in this and then the neural speaker representations
0:00:32	and here the people that have
0:00:34	well average of it for this work
0:00:38	so first
0:00:40	i'll introduce what i referred to as speaker meetings in the rest of the talk
0:00:44	speaker limiting the lower dimensions these two presentations
0:00:48	that or discriminative of speaker identity
0:00:52	these other applications
0:00:54	such as
0:00:55	in voice biometrics but the task is to verify wasn't sounded different speech
0:01:01	the house at application can speaker adapted a set of models
0:01:06	they can also be used in speaker diarization
0:01:08	with the task is to domain
0:01:10	who spoke when in multiparty conversations
0:01:14	this can be of particular use in meeting an x and many other applications
0:01:19	good speaker ramblings should satisfy two properties
0:01:23	first there should be discriminative of speaker factors
0:01:26	second is that addition be invariant to other factors
0:01:30	so what are the fact of information that could be encoded speaker embedding
0:01:34	for ease of analysis be broadly categorized them as follows
0:01:39	so as to the speaker factors these are related to the speaker's identity but example
0:01:44	that gender age et cetera
0:01:47	content factors a these are quite during speech production by the speaker
0:01:51	for example
0:01:53	emotional state output a in the speech signal
0:01:58	sentiment whether it is a positive landed one year
0:02:00	the language being spoken
0:02:02	and most importantly the lexicon containing the signal
0:02:06	and
0:02:07	that is the channel factors these factors that quite given signal captured of the microphone
0:02:12	we could be the room acoustics
0:02:14	the microphone on a linear is applied on acoustic noise
0:02:18	and also artificial and also the artifacts related to the competition
0:02:22	on signal vector
0:02:26	as i mentioned previously good speaker the minister supposed to be invariant nuisance factors
0:02:30	these other factors that in that in order to the speaker's identity
0:02:34	such emergencies useful for robust speaker recognition
0:02:38	in the presence of a bad on acoustic noise
0:02:42	they're also useful for detecting a speaker's identity
0:02:45	irrespective of the emotional state of the speaker
0:02:48	and
0:02:49	also independent of all speakers is
0:02:52	this is particularly useful
0:02:54	in text-independent speaker verification applications
0:02:58	so with those that don't have the motivation the goal of our work is to
0:03:03	four
0:03:03	first
0:03:04	is to quantify the amount of misinformation in speaker meetings
0:03:08	second is to investigate
0:03:10	what extent
0:03:11	unsupervised learning and hence
0:03:13	to remove the misinformation
0:03:18	most existing digits
0:03:20	only performed analysis based on one or two datasets
0:03:24	and
0:03:24	compared to analysis is lacking
0:03:27	also most of this work do not consider the dependence
0:03:30	but in the individual variables in the dataset
0:03:32	for example
0:03:33	note addressed dataset a lexical content and the speaker identity sad and angry
0:03:38	but some sentences that spoken only vectors speakers
0:03:42	therefore
0:03:42	it should be possible to predict the speakers based on lexical content on
0:03:47	being can to mitigate these limitations our previous work
0:03:51	by making the following contributions
0:03:53	firstly we use multiple datasets to comprehensively and lies information and are denoted speaker different
0:03:59	additions
0:04:00	secondly we analyze the
0:04:02	effect of disentangling speaker factors from uses factors on then down information
0:04:11	briefly detail what they mean made disentanglement
0:04:14	in the
0:04:15	orders of the talk
0:04:17	we define a disentanglement broadly as the task of separating out information streams from advancing
0:04:23	signal
0:04:24	is a coke example
0:04:26	the input speech signal from belief you good
0:04:29	who is happy that just bought a civilised like super
0:04:33	contain such information related to various factors
0:04:36	it contains information about because identity including have with him gender and age
0:04:42	the information put into the good emotional state is also encoder
0:04:46	more importantly
0:04:47	the language identity and the lexical content i don't same but in the signal
0:04:52	the goal of additional embedding extractor
0:04:54	is to separate all these information streams
0:04:59	and in the context of speaker the meetings i which is supposed to capture speaker
0:05:02	and get additional information
0:05:04	all other factors such as an emotional state and the lexical content
0:05:08	i considered nuisance factors
0:05:11	it is these factors which we propose to remove from the speaker meanings
0:05:15	to make the more robust
0:05:18	no and explain the methodology behind it is and then a speaker domain extraction
0:05:23	this is a model b is
0:05:24	as input of we can use any speech representation sort of that's either spectrogram
0:05:29	only one speaker meeting from pre-denned model statistics vectors
0:05:33	and
0:05:34	using than suppose disentanglement adapted from
0:05:38	method that as previously proposed in the computer vision domain
0:05:41	we try to separate out
0:05:42	these speaker later information
0:05:44	from the loses information
0:05:47	please note that this method with previously proposed in our earlier work
0:05:51	and you can find more details
0:05:54	in that paper
0:05:55	however for completeness that explained in that he rested
0:06:01	i don't think that comprises two models the main model
0:06:04	which are shown in the clean
0:06:07	blocks here
0:06:08	and
0:06:09	the address it and models shown in the blue
0:06:12	then put it is first processed in court of misfits fit into two
0:06:16	and weighting function in is trash shown in the figure
0:06:19	the embedding hits them
0:06:21	is starting to the predictive
0:06:22	which predictions speaker labels like that
0:06:25	the embedding has two is concatenated with the noisy version of h one
0:06:30	which is denoted by hits and prime here
0:06:32	it's and frame is obtained by thing it's one
0:06:34	to drop what martin
0:06:36	two randomly remove certain elements of h from
0:06:40	and has two along with the noisy
0:06:43	hatch on which is session pine
0:06:45	i concatenated
0:06:47	and fed into a decoder
0:06:49	which tries really consider that the origin input x
0:06:54	the motivation behind using the top or
0:06:56	is to make sure that
0:06:58	hatch one
0:06:59	is an and eleven source of information for the reconstruction task
0:07:03	and training in this and make sure that
0:07:05	the information required for reconstruction is not storage and
0:07:08	and only
0:07:09	the information required for
0:07:11	speaker and weightings are stored
0:07:14	here
0:07:16	in addition
0:07:17	we also used to disentangle models we just one and low
0:07:21	these models are jointly trained
0:07:23	to perform poorly in predicting hits on from is to
0:07:27	and has to from its own
0:07:29	the goal of these models is to ensure that
0:07:31	and so the nist two are not very to a feature that
0:07:35	doesn't make sure that did not contain similar information
0:07:38	this way
0:07:40	we can team for this and then there's other conditions
0:07:44	and the questions that we used a present one fish one here the main model
0:07:49	produces two losses a one is a standard cross entropy loss from the predicate
0:07:52	which pretty speakers
0:07:54	and the second is the means greater reconstruction us from the decoder
0:07:59	and the adversarial
0:08:00	a model is a use means could've lost
0:08:04	the overall loss function is shown here
0:08:07	we try to minimize the loss with respect to the main models
0:08:10	when advert of by maximizing the twisted in knots
0:08:14	this training process further apart from previous work as i mentioned before
0:08:18	basically use this technique
0:08:20	on it
0:08:20	because the digit recognition task
0:08:24	on successful training
0:08:26	them but enhancement is expected to capture speaker discriminative information
0:08:30	and them in his to is expected to captain useless information
0:08:34	notice that we are not used any labels of that uses factors such as a
0:08:38	nice tight channel conditions
0:08:40	extractor
0:08:44	for training the models we use the standard box in the training corpus now which
0:08:47	consists of
0:08:48	in the way we use of interviews with celebrities
0:08:51	the additive noise and reverberation which is standard practice in a day in examining
0:08:56	this results in two point four million utterances from i don't seven thousand two hundred
0:09:00	speakers
0:09:01	as mentioned before we can you either you spectrograms atoms is and what
0:09:06	well it also is decoder meetings from kate and models which we do in
0:09:09	this work
0:09:11	so we use i x it is extracted from a publicly available played in models
0:09:15	as input
0:09:17	exactly that's most of you already know are speaker demanding a hint on the automatically
0:09:21	rubber and related work
0:09:23	that is trained to classify speakers
0:09:25	from a large dataset artificial augmented with noise and reverberation
0:09:29	and this model has shown to provide state-of-the-art performance and multiple tasks
0:09:35	not require speaker discriminant discriminately
0:09:39	we use multiple datasets i not evaluations as mentioned here
0:09:43	and by evaluating some factors for example
0:09:46	i emotion on my calculator that
0:09:48	we could also
0:09:50	too low the
0:09:51	issue of dataset bias
0:09:53	creating in the model
0:09:55	and following others in the looks the make the assumption that
0:09:59	better classification performance
0:10:01	all of the speaker remaining for the factors
0:10:04	in light
0:10:06	there is more information present in the embedding with respect to that factors
0:10:11	and as a baseline views expected that speaker eminence since our model a data accepted
0:10:16	as input
0:10:17	we can consider a speaker ramblings as a refinement of detectors
0:10:21	but speaker different information today and uses factorisation will
0:10:26	the also reduce the dimension of expected by using pca
0:10:30	or to match the
0:10:32	the and meetings in vermont models
0:10:37	so us of the results
0:10:41	and the first set of results shows the accuracy of predicting speaker factors using x
0:10:46	vectors
0:10:47	shown in blue
0:10:48	and using alignment actually hindered
0:10:50	and in this case high it is better
0:10:53	the first two of graphs here so speaker classification accuracy and the other two sure
0:10:58	gender prediction accuracy
0:11:00	so we find that in general both expect is an atom bearings
0:11:04	but from creativity in just thank speakers and genders
0:11:07	and we see a slight degradation a using another
0:11:11	however the differences that women
0:11:14	one other observation is that
0:11:15	in i'm okay final performance of
0:11:19	both axes and i model
0:11:21	we conjecture that this the eight it could be due to a speaker overlap
0:11:25	and also this dataset is not what ideally suited for speaker
0:11:29	recognition task since
0:11:31	the purpose of this dataset was emotion recognition
0:11:36	no the more enticing results
0:11:39	of a show the results of predicting the and in factors using x s and
0:11:44	are speaker dominance and in this case since is then used actors you know it
0:11:48	is british
0:11:49	we find that in
0:11:51	on
0:11:51	the cases are model it is the model is information
0:11:56	in particular
0:11:58	emotion and lexical information added used to a greater extent
0:12:02	here the lexical accuracy
0:12:04	is accuracy of predicting the sentence
0:12:06	spoken given speaker the meeting of that sentence
0:12:10	and apart from the election emotional lexical content we also see a detection
0:12:14	no information but into sentiment
0:12:18	we just was used to motion
0:12:20	and also language
0:12:25	in this side of a report the results of predicting the channel factors using x
0:12:28	vectors
0:12:29	and a speaker dominance
0:12:31	okay in this case a low respective
0:12:33	in particular of we focus on three factors
0:12:36	the room microphone distance are the microphone location
0:12:41	and then i start
0:12:44	we find that in predicting the location of the microphone use
0:12:48	and the type of agonise present
0:12:50	except is have a much higher accuracy than a to predict
0:12:54	this means that being able to successfully reduced and what of this isn't information from
0:12:58	extractors
0:13:00	however we notice that
0:13:03	in panic and the room
0:13:05	in this the recording with me
0:13:07	because so present to see that what extent this and i gnostic animating that very
0:13:11	effective
0:13:12	this needs further investigation
0:13:18	we show the results of like evaluation
0:13:21	then evaluated models for speaker verification task
0:13:24	and our competitors
0:13:27	the detection update of "'cause" actual
0:13:29	where the false positive rate and the according to be exact scale only lately
0:13:34	right and they because model you get compared to the articles
0:13:38	and the "'cause"
0:13:40	that it was at the origin
0:13:42	you don't better models
0:13:44	the black dotted lines a show the except the model
0:13:47	and all the other
0:13:49	lines do not are modeled they then without
0:13:52	lda based dimensionality reduction
0:13:57	be found statistically significant differences only in the graphs
0:14:00	based numbers dimension
0:14:05	well most notably in challenging scenarios
0:14:09	babble in television lies in the background
0:14:11	all models perform better than extractors
0:14:13	also in the are distant microphone condition i've models perform significantly better than extractors
0:14:21	we also found that at the model and do that is trained with a metadata
0:14:26	what was slightly better compared to the model in one
0:14:29	that is staying with not additional conditions
0:14:31	this actually confronted expected be
0:14:38	so finally like to quickly present a discussion based on experiments which hopefully will be
0:14:44	useful pointers for future research
0:14:46	in this domain
0:14:48	first we find that speaker the meetings to captain right of information what into a
0:14:52	nuisance factors
0:14:54	and this can sometimes be detrimental to robustness
0:14:58	and we also found that just introducing
0:15:02	bottleneck on the dimension of the speaker automatic by using pca
0:15:06	doesn't seem all this information
0:15:09	this points of the need for explicitly more the fusion starters
0:15:13	and using the
0:15:14	on suppose that wasn't invariance technique which is the
0:15:19	taking that using a model
0:15:21	we can it is then uses information
0:15:23	from the speaker meetings
0:15:25	and the because advantages that unlabeled of nuisance factors are not required for this matter
0:15:31	we also found that and the voice disentanglement retains gender information
0:15:36	this actually such as that speaker gender
0:15:38	as captured when you know conditions
0:15:41	is a crucial part of identity
0:15:43	this is quite intuitive from human perception point hasn't
0:15:47	essentially what the shows is that mute of conditions and sounds and
0:15:51	for though human perception
0:15:54	finally a disincentive speaker representations shall
0:15:57	a better verification performance the presence of ability of tiny conditions
0:16:01	but it only babble in television i features consider
0:16:05	very challenging for this test
0:16:10	going forward we would like to explore methods to further improve the sentiment
0:16:14	and
0:16:15	so far we have not as a mention of all of not used in uses
0:16:18	labeled so
0:16:20	we would like to see if
0:16:22	if we use this it's a with variable data available
0:16:25	danny
0:16:26	achieve better disentanglement
0:16:29	so that brings me to the
0:16:33	invested in different conditions of those of the differences
0:16:36	finally i would like to acknowledge the support for us to for this work
0:16:40	and
0:16:42	that's it utterance that's what is into my presentation
0:16:45	please feel free to us and many men with any questions or stations you might
0:16:48	have
0:16:49	thank you

An Empirical Analysis of Information Encoded in Disentangled Neural Speaker Representations

Special Session: VOiCES 2020

Raghuveer Peri, Haoqi Li, Krishna Somandepalli, Arindam Jati, Shrikanth Narayanan