Speech Transcript - Adaptive Mean Normalization for Unsupervised Adaptation of Speaker Embeddings

0:00:16	i management and they're representing work from sri international on the adapted mean normalization for
0:00:23	unsupervised adaptation of speaker and bindings
0:00:28	the scroll control problem statement if on are actually trying to tackle deal with this
0:00:33	work
0:00:33	well then look at the wrong name normalization have it applies to the state-of-the-art speaker
0:00:38	recognition systems
0:00:40	and we're going for the proposed technique adaptive may normalization
0:00:43	and have a look at six times to see how phones
0:00:48	the problem statement
0:00:50	variability is well-known one of the biggest challenges to practical use of speech face recognition
0:00:55	systems
0:00:56	and mister really refers to changes in a fixed between the training
0:01:00	and successive detection attempts
0:01:02	a system
0:01:04	everything calmed and two types of variability
0:01:06	one is extremes that is something separated from this data
0:01:10	includes things like microphones the acoustic environment transmission channel
0:01:14	you turn it is intrinsic are really this is to do with this data
0:01:18	ally vary over time and things into two different variability here
0:01:23	include health
0:01:24	stress
0:01:25	they're all the stake
0:01:26	speaking style
0:01:27	these differences are collectively referred to as the main mismatch
0:01:31	when you looking at the differences between system training sonar
0:01:36	and detection attempts
0:01:40	now many of us know the domain mismatch typically results in performance of the system
0:01:44	now this is a performance with respect to the extent the performance of the system
0:01:49	once the systems trained on land we have a certain estimate of have standard for
0:01:54	if that the main when there is that the what changes
0:01:57	then we have almost all
0:01:59	due to this domain mismatch
0:02:01	now we use a just two different things
0:02:04	one is discrimination loss
0:02:06	that means less karen the system to separate state
0:02:09	the others mis calibration when assistance miss calibrated
0:02:12	and it gives a score
0:02:14	that's for a my actually recently the use of into believing something that shouldn't have
0:02:19	been detected prints
0:02:22	domain adaptation be used to cope with this problem
0:02:24	and it's two different ways of dealing with domain adaptation one is suitable
0:02:28	this is where we have labeled data
0:02:31	where we often get improvement or reliable improvement
0:02:35	but it is a high costs will be a man data that you end up
0:02:38	eating to improve system
0:02:40	the alternative is unsupervised adaptation
0:02:43	it has a very low cost
0:02:45	there's actually no and use a labelling and all
0:02:48	plenty of data available
0:02:49	and it's ideally matched to leave out conditions
0:02:52	but insight here is that we have no ground truth labels to reliable
0:02:59	for using this work
0:03:01	is the unsupervised adaptation snore
0:03:06	there are some shortcomings of unsupervised adaptation
0:03:09	one is a lack of generalisation
0:03:11	now quite high decisions a to be my in a have to comply a supervised
0:03:17	our approach for instance if we're going to retrain
0:03:20	lda okay ogi best system
0:03:22	we end up needing to make some kind of assumptions about
0:03:25	which clusters different audio segments mark going to with respect to different stages
0:03:30	it can also be over g for the data being trained with
0:03:35	trustworthiness is nullified the
0:03:37	when we get guarantees for improvements friends it was a patient their limited i see
0:03:42	that
0:03:42	then this complexity
0:03:44	some approaches have high computation and that makes it a little bit more difficult to
0:03:48	give two
0:03:49	clients or uses a goes out the door
0:03:54	and the question we trained and two years or where is the best place to
0:03:57	apply adaptation
0:03:59	and i in the unsupervised scenario
0:04:01	where can be fast and reliable once deployed
0:04:04	so on screen here we have diagram all the different stages of a speaker recognition
0:04:09	one
0:04:10	and we can look at what would happen if we applied adaptations which of these
0:04:14	stages
0:04:15	i think the feature extraction the mfcc is or how normalized cepstral coefficients
0:04:22	speaker embedded in it
0:04:23	if someone was tuned that
0:04:25	and hot in this in our here
0:04:27	attacks requires a for re-training or but the nn and the back end modules
0:04:31	you need to have a lot of data on hand ready to do that process
0:04:34	that's how to explore
0:04:38	what about speech activity detection
0:04:40	now there are approaches by next the genus wasn't stages
0:04:44	a different scenarios
0:04:46	this is useful when that is actually the main since the
0:04:49	but it's on purchase solution doesn't really help the discrimination a in the rest of
0:04:54	the form
0:04:57	lda purely eye calibration
0:05:00	no these the sum of the clock kinds of the backend process
0:05:03	but center require labels or prediction in your clustering and this can be are carried
0:05:08	by a projectionist
0:05:11	like normalization well there's no actual adaptation to go on us does not applicable
0:05:16	which leads us would mean normalization
0:05:18	now this is simple to that a parameter sorry in general typically about two hundred
0:05:23	numbers of the only i
0:05:25	at the request use doesn't help
0:05:28	but sort of the role of mean normalization in a system
0:05:32	nobody only i is a strong model when the assumptions of a few for the
0:05:36	p lda model car
0:05:37	now that is the distribution of the data going into it
0:05:40	fixed a standard normal distribution
0:05:43	in training
0:05:44	without trying to is that
0:05:46	mean normalization
0:05:47	and length normalization together achieve this
0:05:50	so the assumptions of for your
0:05:52	for a few when the system is ranked right
0:05:55	length normalization when we will actually projects embedding some to you know how does the
0:06:00	and that's a trigram actually right here that demonstrates just
0:06:03	but evenly spread around the house
0:06:06	and this issues a zero-mean works well during training
0:06:11	emphasis shifted domain
0:06:12	wow of course of training
0:06:15	such as with evaluation data
0:06:18	so this is in this diagram here
0:06:20	and then some producing you know how to say that has a distribution that is
0:06:24	not evenly distributed
0:06:26	therefore assumptions appeared in model i'm not sure field anymore
0:06:30	and we actually reduce the discrimination battle
0:06:36	no so that actual performance here when we look at this difference all using a
0:06:41	system based main
0:06:42	where we have taken the mean from the actual training data a best this the
0:06:46	impact if we actually i mean just the mean of the system
0:06:51	two i held out dataset
0:06:53	only relevant conditions of the data with benchmark
0:06:57	now there are more details on the evaluation protocol on the dataset used here later
0:07:01	on in the presentation
0:07:02	but for now this is a quick snapshot of what happens if you simply update
0:07:06	the main the main of the only a tighter polish dataset
0:07:10	actually see the equal error rate increase prior to nineteen percent
0:07:14	really helping and discrimination that
0:07:16	but even more impressive
0:07:17	is the fact that sail that's the cost of the likelihood ratio
0:07:21	and that's an indication of discrimination and calibration performance
0:07:25	improves file to sixty percent
0:07:27	no this is this five holding the calibration model as used in mismatched to the
0:07:33	other conditions
0:07:34	in particular this calibration model here is train the red source data
0:07:38	that's clean telephone data
0:07:40	and yet in the sre dataset and stickers in well
0:07:44	it is dramatically helping calibration
0:07:46	so having a roommate really is crucial
0:07:52	no it's okay a that it may normalization
0:07:55	so if you we've got
0:07:58	i mean that a suitable when evaluation conditions are homogeneous so if we deploy a
0:08:03	system we know the generally what we do about four it's not gonna very much
0:08:07	from that
0:08:08	that is that the okay
0:08:10	the problem comes in my conditions can vary over time will between trial
0:08:15	so for instance dealing with radio broadcast right
0:08:18	over time depending on the single signal the time of day or maybe a system
0:08:23	thing used for both telephone and microphone style
0:08:26	calls
0:08:27	then we end up having this distribution over here only bottomright where we have different
0:08:32	needs
0:08:33	and how that projects onto the in opposite
0:08:36	this means that a ideally what would love to be able to here is actually
0:08:41	and that domain
0:08:42	depending on the conditions of the trial at hand
0:08:45	so that means we wanna dynamically defining
0:08:48	as we're going to resist
0:08:51	that's what we contract requires method of adaptive name normalization
0:08:57	so what is a
0:08:58	well this process actually stand that if that's on trial based calibration
0:09:02	and what role based calibration does is actually whites into a problem can to define
0:09:06	the system parameters
0:09:08	in particular the calibration model parameters
0:09:11	it actually looks conditions of hand
0:09:14	of the trial coming in both sides the enrollment and test on
0:09:18	defines different subsets for those
0:09:22	conditions
0:09:22	and then finds calibration model on-the-fly using held out data
0:09:28	so called the system here
0:09:30	is to try my the system model what is general
0:09:33	and reliable as possible
0:09:37	no one extra advantage here's the overtime as systems the system is saying more and
0:09:41	more conditions are more relevant data
0:09:44	it can act agreement about the new conditions of the time
0:09:48	the he's the process
0:09:50	so taken a bit of is not show on the embedding is only a mean
0:09:54	normalization links not purely and calibration left hand side
0:09:58	and was fifteen
0:10:00	the a normalization and the adaptive process
0:10:04	one used to be there was inconsistent me
0:10:07	no we're doing is taking the goodies from after all
0:10:10	so in fact this can be this is an embedding
0:10:14	specific process
0:10:15	not a troll specific process which is a bit of the benefit here
0:10:18	and terms of computation
0:10:20	for each embedding what we do is we go meet her some personal
0:10:24	against a bunch of handed in
0:10:26	embedding
0:10:27	what we wanna do is found those embedding from the candidate on that are similar
0:10:31	conditioned to out embedding that's coming into the adaptively may normalized
0:10:36	we make a selection of that's also
0:10:39	we then find the condition name based on that sounds that
0:10:42	no and how many strong and we find and how many we would like to
0:10:46	find we dental weighting process
0:10:50	and then we use that men
0:10:52	as noted that the main one for that embedding
0:10:55	with and follow on through the rest of the pipeline
0:10:59	so what we're trying to do here is
0:11:02	make this happen on the fly in fact that actually has very little overhead
0:11:08	there are some ingredients that we need for that it may normalization
0:11:12	now in terms of making a comparison between embedding and handed it we need something
0:11:17	that can
0:11:18	tell us whether the conditions from those embedding the similar or not
0:11:22	but this we use condition field
0:11:24	and this is really what we only a what we use the speaker recognition
0:11:28	ever instead of discriminating stated
0:11:30	it's trained to discriminate different conditions
0:11:33	but conditions include compression time
0:11:35	re the five noise type
0:11:37	language in general
0:11:39	when we combine those things together we actually end up with our eleven thousand unique
0:11:43	conditions
0:11:45	so it should be a very thin slice
0:11:47	that we're dealing with
0:11:49	so for meaningful candidate embedding
0:11:52	and this is just a love mixture conditions anything controller really
0:11:56	and ideally it's including some examples and evaluation conditions
0:12:00	now if that's not the case again what the system could actually be a
0:12:05	after is deployed is actually have testing data along the way to calculate the whole
0:12:10	handed in embedding
0:12:11	to be more suited to the conditions
0:12:15	and this whole is used to dynamically estimate that means all conditions
0:12:19	finally there are twelve parameters one is the condition similarity threshold we don't want everything
0:12:24	from the candidate for coming through we wanna say we want to determine how similarity
0:12:29	is and make sure similar enough
0:12:32	the pastor in the next stage remain constant
0:12:35	the thing that we wanna sit is the maximum number of candidates to select
0:12:39	now everything if everything in that and are the cool
0:12:42	was about the threshold
0:12:45	everything will be faster to maybe we get a no benefit to the don't have
0:12:49	a natural system
0:12:51	we wanna make sure that we mean that how much longer term just select the
0:12:54	top number of this
0:12:55	so if we then go back to our picture here we can fill in a
0:12:58	few different things
0:13:00	for instance the comparison now is done that security i
0:13:04	we didn't do the selection process where n is the number of candidates for the
0:13:08	similarity
0:13:09	about the official
0:13:11	however if an
0:13:12	is more than a maximum where layer
0:13:14	which is an
0:13:16	we kind of the and with the highest similarity
0:13:19	but when making sure we can be most relevant ones for our main estimate
0:13:24	once we estimate domain
0:13:26	we go on to a weighted average
0:13:28	with a system
0:13:30	and that weighted average
0:13:31	means
0:13:34	them close we get to that type of value and the more we rely on
0:13:38	being you gonna make me
0:13:40	whereas we do in the following that the system in the case that no relevant
0:13:44	samples could be
0:13:48	and several in a few of "'em" benefits of adaptive may normalization
0:13:52	no harm said that it's very minimal overhead
0:13:55	and that over here is that defined by the number handed examples that has to
0:13:59	compare it is
0:14:01	this is also applied for embedding instead of a problem which usually done in from
0:14:05	based calibration
0:14:06	and the sense to doing a lot in terms of reducing computation
0:14:12	it can for the case of no relevant examples where the reverse is just the
0:14:16	main based on the
0:14:18	weighted average
0:14:20	you know enrollment audio or test audio could actually collected to re
0:14:24	time in the candidate pool
0:14:26	and this allows that are most relevant changes are that i'm aware of the systems
0:14:30	being coloured
0:14:33	now the simple process
0:14:35	with the parameters thing under two hundred numbers this changing he
0:14:39	it's also weighting against this is the main so makes it a little bit difficult
0:14:42	over fitting which is room benefit
0:14:45	and finally
0:14:46	we show a we find quite impressive here is that it allows a single
0:14:50	static calibration model to be applied across in this now that's exactly the problem that
0:14:55	row based calibration was trying to sell
0:14:58	by adapting the calibration model
0:15:00	no gonna step further that and the one of the system to that main
0:15:05	which of as a calibration models just a static
0:15:08	and the cost
0:15:09	the other night sure main normalization there
0:15:12	a last few only ice assumptions to be for field
0:15:15	we calibration model after peoria scoring
0:15:18	is also suitable
0:15:21	let's take a look of experiments
0:15:24	first of all the baseline system we do and t is the sri
0:15:28	can you s e
0:15:30	came submission for the sre i mean
0:15:34	nist evaluation this involves sixteen khz
0:15:37	how normalized cepstral coefficients
0:15:39	and multi band i think that things
0:15:42	a multiband which means that we trained the embedding system with by i k and
0:15:46	sixteen k data at any time we had sixteen k to i
0:15:49	we also downsampled to eight as well
0:15:52	so that the nn was exposed both i k and six think a directional sign
0:15:57	audio segments
0:15:58	that tended to help bridge the gap between i khz and sixteen khz evaluation data
0:16:06	we trained on the standard datasets
0:16:08	the references for those datasets are and i
0:16:10	and we do the standard ornamentation occurs
0:16:13	now this mentioned before the calibration model years trained on the right source data
0:16:18	now that is from the rats program darpa rats program
0:16:22	but is the telephone dinally clean data not the transmission data which is heavily degraded
0:16:28	in terms of evaluation
0:16:30	we split out of l six and two evaluation and norm
0:16:35	now from the nist sre corpora
0:16:37	it doesn't sixteen two thousand nineteen
0:16:39	but have their own name sets
0:16:41	available with an known as the unlabeled data as you can see in table or
0:16:47	speakers in while
0:16:48	we use the about fourteen
0:16:49	or evaluation and the dataset
0:16:51	but the notes the
0:16:53	again speakers for this are disjoint
0:16:55	this is right source data we axis that this plastic to have two different speaker
0:17:00	calls
0:17:01	ones for evaluation and ones for the normalization step
0:17:07	in terms of the adapted mean normalization parameters
0:17:10	we setting condition similarity threshold of ten
0:17:14	and the maximum number of candidates thing half number candidate samples
0:17:18	for the dataset
0:17:20	of candidates
0:17:21	and these by searching on the rats or style
0:17:24	so you can see how many segments were available for each an onset
0:17:29	including the pools
0:17:31	which we use initially
0:17:33	and the n b value all can't is that with trying to g
0:17:37	and that value of and remember
0:17:39	also helps with the
0:17:40	weighted average
0:17:42	it when the dynamic maintenance this tonight so close we can get that
0:17:46	a more relies on the and it may
0:17:54	let's look at the out of the box performance
0:17:57	so we've got here of for different datasets sre sixteen i can stick as in
0:18:01	the wall and the right clean telephone data
0:18:05	the baseline system here we consider them a norm is
0:18:08	simply the main that was estimated during the training of the system
0:18:13	and the calibration model is right
0:18:15	now what happens
0:18:17	if we go and look at a
0:18:19	adapting calibration model
0:18:22	the actual eval set
0:18:23	so this is a cheating experiment on the right hand side
0:18:26	essentially what we're doing is we're replacing the rest calibration model
0:18:31	we've the eval set calibration model
0:18:36	what we can see here
0:18:38	is that we're getting much better calibration performance
0:18:41	"'cause" some datasets that acoustic isn't model
0:18:44	rest doesn't seem too much
0:18:46	and
0:18:48	the equal error rates they tend to vary wildly between these different datasets
0:18:53	but the calibration is considerably better compared to those
0:18:56	but see lower value of one for non
0:18:58	one four seven
0:19:00	because it is matched to the rats data used in calibration model
0:19:09	let's look at the impact of relevant may normalization
0:19:13	now previously really on in this presentation we show the first two columns he the
0:19:17	baseline and the condition based mean normalization
0:19:20	now reading on the call
0:19:22	assume that it may normalization
0:19:26	what we fancy a using adapted mean normalization the whole
0:19:30	on the held-out data says training together the sre six thing i think
0:19:34	speaks in water and rats data
0:19:36	will be held out dataset as one day call handed
0:19:41	and that you mean normalization was able to outperform the conditions specific mean normalization in
0:19:46	the heterogeneous conditions
0:19:48	so in particular the honesty is in well
0:19:52	and the sre dataset
0:19:54	the calibration performance there is in proving quite significantly and sometimes in some cases
0:20:00	other times i two thousand i think it's a nice improvement
0:20:05	the where i've to seven sixteen also improves
0:20:09	quite reasonably
0:20:11	no what's interesting is the adaptive process didn't really have a direct condition
0:20:16	so that of a benefit there's well
0:20:20	but now you data requirements
0:20:23	how much data do we actually need in canada segment
0:20:26	productive may normalization work
0:20:29	well we've done on the slide here is we're looking at these still
0:20:33	the which remember a
0:20:37	how's measure the discrimination performance and calibration problem
0:20:41	the dashed fines of the baseline performance across the for different datasets
0:20:45	the more solid lines a what happens as we were very the number of canada
0:20:49	segments
0:20:50	now remember these used to the at least
0:20:53	one thousand two hundred samples
0:20:55	i mean doing this in my dataset specific scenario
0:20:59	where for instance
0:21:00	sre sixteen
0:21:03	they can pull is the actual unlabeled data from sre six thing so
0:21:07	suitable for the conditions when we randomly selecting from that held out on that
0:21:14	well we see here is that
0:21:16	quite rapidly after thirty two relevant segments
0:21:20	independent or we're already
0:21:22	in front of the based on local
0:21:24	so it's already sufficient for significance a lower improvement and we saw the
0:21:29	this also have an equal error rate in terms of be trained
0:21:33	not quite so much of a relative guy
0:21:36	again the true relevant segments from the target to my
0:21:40	wasn't nothing this is that process to get a good guy
0:21:47	now importantly what happens when we have adapted mean normalization
0:21:51	we employ
0:21:53	and the data and it can all is mismatched
0:21:56	two conditions that are gonna be evaluated
0:22:00	we wanna see what happens in this case so what we did for each
0:22:03	dataset will benchmarking here
0:22:06	we excluded the relevant data
0:22:09	common can hold for that is
0:22:12	so for instance
0:22:13	with the rats data set down the one on the table
0:22:15	we actually excluded that from the whole standards and just retained stick as in the
0:22:20	war
0:22:20	and the two sre dataset in the can cool
0:22:23	and that's all had to select from
0:22:25	in order to estimate domain and the hyper don't in the system
0:22:29	i remember when it can actually find anything but it is relevant
0:22:33	if all that the system in
0:22:36	so a wooden thing that the performance is the sign
0:22:40	as the baseline system
0:22:42	well better
0:22:45	now we can see is speakers in while
0:22:48	and rats actually perform reasonably well
0:22:51	there was an improvement still with stages and well which is surprising
0:22:55	with right
0:22:55	this still a that was just a little bit
0:22:59	sre six thing
0:23:00	integrated what's really with respect and baseline
0:23:03	without any relevant data for a min
0:23:07	and we tried to vary just selection threshold he in the hope that using a
0:23:10	higher threshold
0:23:12	we restrict the subset selection to really
0:23:15	the closest ones possible
0:23:17	other this didn't have
0:23:18	so this indicates is there was a problem with the currency purely eye are conditions
0:23:23	the audio card and wasn't quite optimal for selection
0:23:27	in this not mismatch scenario
0:23:32	so in summary we propose adaptive may normalization
0:23:36	it's simple and effective leveraging the test and adjourn used possible
0:23:41	it's useful i just i samples the state in fact i think that should be
0:23:45	thirty two samples of speech
0:23:48	the discrimination we sort improvements about twenty six distend and intense calibration mentioned through the
0:23:54	c o l
0:23:55	with or improvements of up to sixty "'cause"
0:23:57	sixty six percent relative over the baseline system
0:24:01	and what's important here is that actually and i want to study calibration model to
0:24:04	become suitable for varying conditions
0:24:07	that's a room between once you system goes out the door
0:24:11	in terms of future work we identified a couple things
0:24:14	we want to enhance the selection method to be robust when relevant matters like embezzling
0:24:18	which i do not very fast experiment
0:24:20	we also wanna do experiments in how active learning over time
0:24:25	can improve that calibration pool sorry not calibration pork and oracle
0:24:30	but collecting five test data
0:24:32	over time that's relevant to the examples and retaining i recent history
0:24:37	five hundred representation not be happy to hear any remarks of questions from anyone
0:24:42	thank you

Adaptive Mean Normalization for Unsupervised Adaptation of Speaker Embeddings

Speaker and Language Recognition

Mitchell Mclaren, Md Hafizur Rahman, Diego Castan, Mahesh Kumar Nandwana, Aaron Lawson