0:00:16i management and they're representing work from sri international on the adapted mean normalization for
0:00:23unsupervised adaptation of speaker and bindings
0:00:28the scroll control problem statement if on are actually trying to tackle deal with this
0:00:33work
0:00:33well then look at the wrong name normalization have it applies to the state-of-the-art speaker
0:00:38recognition systems
0:00:40and we're going for the proposed technique adaptive may normalization
0:00:43and have a look at six times to see how phones
0:00:48the problem statement
0:00:50variability is well-known one of the biggest challenges to practical use of speech face recognition
0:00:55systems
0:00:56and mister really refers to changes in a fixed between the training
0:01:00and successive detection attempts
0:01:02a system
0:01:04everything calmed and two types of variability
0:01:06one is extremes that is something separated from this data
0:01:10includes things like microphones the acoustic environment transmission channel
0:01:14you turn it is intrinsic are really this is to do with this data
0:01:18ally vary over time and things into two different variability here
0:01:23include health
0:01:24stress
0:01:25they're all the stake
0:01:26speaking style
0:01:27these differences are collectively referred to as the main mismatch
0:01:31when you looking at the differences between system training sonar
0:01:36and detection attempts
0:01:40now many of us know the domain mismatch typically results in performance of the system
0:01:44now this is a performance with respect to the extent the performance of the system
0:01:49once the systems trained on land we have a certain estimate of have standard for
0:01:54if that the main when there is that the what changes
0:01:57then we have almost all
0:01:59due to this domain mismatch
0:02:01now we use a just two different things
0:02:04one is discrimination loss
0:02:06that means less karen the system to separate state
0:02:09the others mis calibration when assistance miss calibrated
0:02:12and it gives a score
0:02:14that's for a my actually recently the use of into believing something that shouldn't have
0:02:19been detected prints
0:02:22domain adaptation be used to cope with this problem
0:02:24and it's two different ways of dealing with domain adaptation one is suitable
0:02:28this is where we have labeled data
0:02:31where we often get improvement or reliable improvement
0:02:35but it is a high costs will be a man data that you end up
0:02:38eating to improve system
0:02:40the alternative is unsupervised adaptation
0:02:43it has a very low cost
0:02:45there's actually no and use a labelling and all
0:02:48plenty of data available
0:02:49and it's ideally matched to leave out conditions
0:02:52but insight here is that we have no ground truth labels to reliable
0:02:59for using this work
0:03:01is the unsupervised adaptation snore
0:03:06there are some shortcomings of unsupervised adaptation
0:03:09one is a lack of generalisation
0:03:11now quite high decisions a to be my in a have to comply a supervised
0:03:17our approach for instance if we're going to retrain
0:03:20lda okay ogi best system
0:03:22we end up needing to make some kind of assumptions about
0:03:25which clusters different audio segments mark going to with respect to different stages
0:03:30it can also be over g for the data being trained with
0:03:35trustworthiness is nullified the
0:03:37when we get guarantees for improvements friends it was a patient their limited i see
0:03:42that
0:03:42then this complexity
0:03:44some approaches have high computation and that makes it a little bit more difficult to
0:03:48give two
0:03:49clients or uses a goes out the door
0:03:54and the question we trained and two years or where is the best place to
0:03:57apply adaptation
0:03:59and i in the unsupervised scenario
0:04:01where can be fast and reliable once deployed
0:04:04so on screen here we have diagram all the different stages of a speaker recognition
0:04:09one
0:04:10and we can look at what would happen if we applied adaptations which of these
0:04:14stages
0:04:15i think the feature extraction the mfcc is or how normalized cepstral coefficients
0:04:22speaker embedded in it
0:04:23if someone was tuned that
0:04:25and hot in this in our here
0:04:27attacks requires a for re-training or but the nn and the back end modules
0:04:31you need to have a lot of data on hand ready to do that process
0:04:34that's how to explore
0:04:38what about speech activity detection
0:04:40now there are approaches by next the genus wasn't stages
0:04:44a different scenarios
0:04:46this is useful when that is actually the main since the
0:04:49but it's on purchase solution doesn't really help the discrimination a in the rest of
0:04:54the form
0:04:57lda purely eye calibration
0:05:00no these the sum of the clock kinds of the backend process
0:05:03but center require labels or prediction in your clustering and this can be are carried
0:05:08by a projectionist
0:05:11like normalization well there's no actual adaptation to go on us does not applicable
0:05:16which leads us would mean normalization
0:05:18now this is simple to that a parameter sorry in general typically about two hundred
0:05:23numbers of the only i
0:05:25at the request use doesn't help
0:05:28but sort of the role of mean normalization in a system
0:05:32nobody only i is a strong model when the assumptions of a few for the
0:05:36p lda model car
0:05:37now that is the distribution of the data going into it
0:05:40fixed a standard normal distribution
0:05:43in training
0:05:44without trying to is that
0:05:46mean normalization
0:05:47and length normalization together achieve this
0:05:50so the assumptions of for your
0:05:52for a few when the system is ranked right
0:05:55length normalization when we will actually projects embedding some to you know how does the
0:06:00and that's a trigram actually right here that demonstrates just
0:06:03but evenly spread around the house
0:06:06and this issues a zero-mean works well during training
0:06:11emphasis shifted domain
0:06:12wow of course of training
0:06:15such as with evaluation data
0:06:18so this is in this diagram here
0:06:20and then some producing you know how to say that has a distribution that is
0:06:24not evenly distributed
0:06:26therefore assumptions appeared in model i'm not sure field anymore
0:06:30and we actually reduce the discrimination battle
0:06:36no so that actual performance here when we look at this difference all using a
0:06:41system based main
0:06:42where we have taken the mean from the actual training data a best this the
0:06:46impact if we actually i mean just the mean of the system
0:06:51two i held out dataset
0:06:53only relevant conditions of the data with benchmark
0:06:57now there are more details on the evaluation protocol on the dataset used here later
0:07:01on in the presentation
0:07:02but for now this is a quick snapshot of what happens if you simply update
0:07:06the main the main of the only a tighter polish dataset
0:07:10actually see the equal error rate increase prior to nineteen percent
0:07:14really helping and discrimination that
0:07:16but even more impressive
0:07:17is the fact that sail that's the cost of the likelihood ratio
0:07:21and that's an indication of discrimination and calibration performance
0:07:25improves file to sixty percent
0:07:27no this is this five holding the calibration model as used in mismatched to the
0:07:33other conditions
0:07:34in particular this calibration model here is train the red source data
0:07:38that's clean telephone data
0:07:40and yet in the sre dataset and stickers in well
0:07:44it is dramatically helping calibration
0:07:46so having a roommate really is crucial
0:07:52no it's okay a that it may normalization
0:07:55so if you we've got
0:07:58i mean that a suitable when evaluation conditions are homogeneous so if we deploy a
0:08:03system we know the generally what we do about four it's not gonna very much
0:08:07from that
0:08:08that is that the okay
0:08:10the problem comes in my conditions can vary over time will between trial
0:08:15so for instance dealing with radio broadcast right
0:08:18over time depending on the single signal the time of day or maybe a system
0:08:23thing used for both telephone and microphone style
0:08:26calls
0:08:27then we end up having this distribution over here only bottomright where we have different
0:08:32needs
0:08:33and how that projects onto the in opposite
0:08:36this means that a ideally what would love to be able to here is actually
0:08:41and that domain
0:08:42depending on the conditions of the trial at hand
0:08:45so that means we wanna dynamically defining
0:08:48as we're going to resist
0:08:51that's what we contract requires method of adaptive name normalization
0:08:57so what is a
0:08:58well this process actually stand that if that's on trial based calibration
0:09:02and what role based calibration does is actually whites into a problem can to define
0:09:06the system parameters
0:09:08in particular the calibration model parameters
0:09:11it actually looks conditions of hand
0:09:14of the trial coming in both sides the enrollment and test on
0:09:18defines different subsets for those
0:09:22conditions
0:09:22and then finds calibration model on-the-fly using held out data
0:09:28so called the system here
0:09:30is to try my the system model what is general
0:09:33and reliable as possible
0:09:37no one extra advantage here's the overtime as systems the system is saying more and
0:09:41more conditions are more relevant data
0:09:44it can act agreement about the new conditions of the time
0:09:48the he's the process
0:09:50so taken a bit of is not show on the embedding is only a mean
0:09:54normalization links not purely and calibration left hand side
0:09:58and was fifteen
0:10:00the a normalization and the adaptive process
0:10:04one used to be there was inconsistent me
0:10:07no we're doing is taking the goodies from after all
0:10:10so in fact this can be this is an embedding
0:10:14specific process
0:10:15not a troll specific process which is a bit of the benefit here
0:10:18and terms of computation
0:10:20for each embedding what we do is we go meet her some personal
0:10:24against a bunch of handed in
0:10:26embedding
0:10:27what we wanna do is found those embedding from the candidate on that are similar
0:10:31conditioned to out embedding that's coming into the adaptively may normalized
0:10:36we make a selection of that's also
0:10:39we then find the condition name based on that sounds that
0:10:42no and how many strong and we find and how many we would like to
0:10:46find we dental weighting process
0:10:50and then we use that men
0:10:52as noted that the main one for that embedding
0:10:55with and follow on through the rest of the pipeline
0:10:59so what we're trying to do here is
0:11:02make this happen on the fly in fact that actually has very little overhead
0:11:08there are some ingredients that we need for that it may normalization
0:11:12now in terms of making a comparison between embedding and handed it we need something
0:11:17that can
0:11:18tell us whether the conditions from those embedding the similar or not
0:11:22but this we use condition field
0:11:24and this is really what we only a what we use the speaker recognition
0:11:28ever instead of discriminating stated
0:11:30it's trained to discriminate different conditions
0:11:33but conditions include compression time
0:11:35re the five noise type
0:11:37language in general
0:11:39when we combine those things together we actually end up with our eleven thousand unique
0:11:43conditions
0:11:45so it should be a very thin slice
0:11:47that we're dealing with
0:11:49so for meaningful candidate embedding
0:11:52and this is just a love mixture conditions anything controller really
0:11:56and ideally it's including some examples and evaluation conditions
0:12:00now if that's not the case again what the system could actually be a
0:12:05after is deployed is actually have testing data along the way to calculate the whole
0:12:10handed in embedding
0:12:11to be more suited to the conditions
0:12:15and this whole is used to dynamically estimate that means all conditions
0:12:19finally there are twelve parameters one is the condition similarity threshold we don't want everything
0:12:24from the candidate for coming through we wanna say we want to determine how similarity
0:12:29is and make sure similar enough
0:12:32the pastor in the next stage remain constant
0:12:35the thing that we wanna sit is the maximum number of candidates to select
0:12:39now everything if everything in that and are the cool
0:12:42was about the threshold
0:12:45everything will be faster to maybe we get a no benefit to the don't have
0:12:49a natural system
0:12:51we wanna make sure that we mean that how much longer term just select the
0:12:54top number of this
0:12:55so if we then go back to our picture here we can fill in a
0:12:58few different things
0:13:00for instance the comparison now is done that security i
0:13:04we didn't do the selection process where n is the number of candidates for the
0:13:08similarity
0:13:09about the official
0:13:11however if an
0:13:12is more than a maximum where layer
0:13:14which is an
0:13:16we kind of the and with the highest similarity
0:13:19but when making sure we can be most relevant ones for our main estimate
0:13:24once we estimate domain
0:13:26we go on to a weighted average
0:13:28with a system
0:13:30and that weighted average
0:13:31means
0:13:34them close we get to that type of value and the more we rely on
0:13:38being you gonna make me
0:13:40whereas we do in the following that the system in the case that no relevant
0:13:44samples could be
0:13:48and several in a few of "'em" benefits of adaptive may normalization
0:13:52no harm said that it's very minimal overhead
0:13:55and that over here is that defined by the number handed examples that has to
0:13:59compare it is
0:14:01this is also applied for embedding instead of a problem which usually done in from
0:14:05based calibration
0:14:06and the sense to doing a lot in terms of reducing computation
0:14:12it can for the case of no relevant examples where the reverse is just the
0:14:16main based on the
0:14:18weighted average
0:14:20you know enrollment audio or test audio could actually collected to re
0:14:24time in the candidate pool
0:14:26and this allows that are most relevant changes are that i'm aware of the systems
0:14:30being coloured
0:14:33now the simple process
0:14:35with the parameters thing under two hundred numbers this changing he
0:14:39it's also weighting against this is the main so makes it a little bit difficult
0:14:42over fitting which is room benefit
0:14:45and finally
0:14:46we show a we find quite impressive here is that it allows a single
0:14:50static calibration model to be applied across in this now that's exactly the problem that
0:14:55row based calibration was trying to sell
0:14:58by adapting the calibration model
0:15:00no gonna step further that and the one of the system to that main
0:15:05which of as a calibration models just a static
0:15:08and the cost
0:15:09the other night sure main normalization there
0:15:12a last few only ice assumptions to be for field
0:15:15we calibration model after peoria scoring
0:15:18is also suitable
0:15:21let's take a look of experiments
0:15:24first of all the baseline system we do and t is the sri
0:15:28can you s e
0:15:30came submission for the sre i mean
0:15:34nist evaluation this involves sixteen khz
0:15:37how normalized cepstral coefficients
0:15:39and multi band i think that things
0:15:42a multiband which means that we trained the embedding system with by i k and
0:15:46sixteen k data at any time we had sixteen k to i
0:15:49we also downsampled to eight as well
0:15:52so that the nn was exposed both i k and six think a directional sign
0:15:57audio segments
0:15:58that tended to help bridge the gap between i khz and sixteen khz evaluation data
0:16:06we trained on the standard datasets
0:16:08the references for those datasets are and i
0:16:10and we do the standard ornamentation occurs
0:16:13now this mentioned before the calibration model years trained on the right source data
0:16:18now that is from the rats program darpa rats program
0:16:22but is the telephone dinally clean data not the transmission data which is heavily degraded
0:16:28in terms of evaluation
0:16:30we split out of l six and two evaluation and norm
0:16:35now from the nist sre corpora
0:16:37it doesn't sixteen two thousand nineteen
0:16:39but have their own name sets
0:16:41available with an known as the unlabeled data as you can see in table or
0:16:47speakers in while
0:16:48we use the about fourteen
0:16:49or evaluation and the dataset
0:16:51but the notes the
0:16:53again speakers for this are disjoint
0:16:55this is right source data we axis that this plastic to have two different speaker
0:17:00calls
0:17:01ones for evaluation and ones for the normalization step
0:17:07in terms of the adapted mean normalization parameters
0:17:10we setting condition similarity threshold of ten
0:17:14and the maximum number of candidates thing half number candidate samples
0:17:18for the dataset
0:17:20of candidates
0:17:21and these by searching on the rats or style
0:17:24so you can see how many segments were available for each an onset
0:17:29including the pools
0:17:31which we use initially
0:17:33and the n b value all can't is that with trying to g
0:17:37and that value of and remember
0:17:39also helps with the
0:17:40weighted average
0:17:42it when the dynamic maintenance this tonight so close we can get that
0:17:46a more relies on the and it may
0:17:54let's look at the out of the box performance
0:17:57so we've got here of for different datasets sre sixteen i can stick as in
0:18:01the wall and the right clean telephone data
0:18:05the baseline system here we consider them a norm is
0:18:08simply the main that was estimated during the training of the system
0:18:13and the calibration model is right
0:18:15now what happens
0:18:17if we go and look at a
0:18:19adapting calibration model
0:18:22the actual eval set
0:18:23so this is a cheating experiment on the right hand side
0:18:26essentially what we're doing is we're replacing the rest calibration model
0:18:31we've the eval set calibration model
0:18:36what we can see here
0:18:38is that we're getting much better calibration performance
0:18:41"'cause" some datasets that acoustic isn't model
0:18:44rest doesn't seem too much
0:18:46and
0:18:48the equal error rates they tend to vary wildly between these different datasets
0:18:53but the calibration is considerably better compared to those
0:18:56but see lower value of one for non
0:18:58one four seven
0:19:00because it is matched to the rats data used in calibration model
0:19:09let's look at the impact of relevant may normalization
0:19:13now previously really on in this presentation we show the first two columns he the
0:19:17baseline and the condition based mean normalization
0:19:20now reading on the call
0:19:22assume that it may normalization
0:19:26what we fancy a using adapted mean normalization the whole
0:19:30on the held-out data says training together the sre six thing i think
0:19:34speaks in water and rats data
0:19:36will be held out dataset as one day call handed
0:19:41and that you mean normalization was able to outperform the conditions specific mean normalization in
0:19:46the heterogeneous conditions
0:19:48so in particular the honesty is in well
0:19:52and the sre dataset
0:19:54the calibration performance there is in proving quite significantly and sometimes in some cases
0:20:00other times i two thousand i think it's a nice improvement
0:20:05the where i've to seven sixteen also improves
0:20:09quite reasonably
0:20:11no what's interesting is the adaptive process didn't really have a direct condition
0:20:16so that of a benefit there's well
0:20:20but now you data requirements
0:20:23how much data do we actually need in canada segment
0:20:26productive may normalization work
0:20:29well we've done on the slide here is we're looking at these still
0:20:33the which remember a
0:20:37how's measure the discrimination performance and calibration problem
0:20:41the dashed fines of the baseline performance across the for different datasets
0:20:45the more solid lines a what happens as we were very the number of canada
0:20:49segments
0:20:50now remember these used to the at least
0:20:53one thousand two hundred samples
0:20:55i mean doing this in my dataset specific scenario
0:20:59where for instance
0:21:00sre sixteen
0:21:03they can pull is the actual unlabeled data from sre six thing so
0:21:07suitable for the conditions when we randomly selecting from that held out on that
0:21:14well we see here is that
0:21:16quite rapidly after thirty two relevant segments
0:21:20independent or we're already
0:21:22in front of the based on local
0:21:24so it's already sufficient for significance a lower improvement and we saw the
0:21:29this also have an equal error rate in terms of be trained
0:21:33not quite so much of a relative guy
0:21:36again the true relevant segments from the target to my
0:21:40wasn't nothing this is that process to get a good guy
0:21:47now importantly what happens when we have adapted mean normalization
0:21:51we employ
0:21:53and the data and it can all is mismatched
0:21:56two conditions that are gonna be evaluated
0:22:00we wanna see what happens in this case so what we did for each
0:22:03dataset will benchmarking here
0:22:06we excluded the relevant data
0:22:09common can hold for that is
0:22:12so for instance
0:22:13with the rats data set down the one on the table
0:22:15we actually excluded that from the whole standards and just retained stick as in the
0:22:20war
0:22:20and the two sre dataset in the can cool
0:22:23and that's all had to select from
0:22:25in order to estimate domain and the hyper don't in the system
0:22:29i remember when it can actually find anything but it is relevant
0:22:33if all that the system in
0:22:36so a wooden thing that the performance is the sign
0:22:40as the baseline system
0:22:42well better
0:22:45now we can see is speakers in while
0:22:48and rats actually perform reasonably well
0:22:51there was an improvement still with stages and well which is surprising
0:22:55with right
0:22:55this still a that was just a little bit
0:22:59sre six thing
0:23:00integrated what's really with respect and baseline
0:23:03without any relevant data for a min
0:23:07and we tried to vary just selection threshold he in the hope that using a
0:23:10higher threshold
0:23:12we restrict the subset selection to really
0:23:15the closest ones possible
0:23:17other this didn't have
0:23:18so this indicates is there was a problem with the currency purely eye are conditions
0:23:23the audio card and wasn't quite optimal for selection
0:23:27in this not mismatch scenario
0:23:32so in summary we propose adaptive may normalization
0:23:36it's simple and effective leveraging the test and adjourn used possible
0:23:41it's useful i just i samples the state in fact i think that should be
0:23:45thirty two samples of speech
0:23:48the discrimination we sort improvements about twenty six distend and intense calibration mentioned through the
0:23:54c o l
0:23:55with or improvements of up to sixty "'cause"
0:23:57sixty six percent relative over the baseline system
0:24:01and what's important here is that actually and i want to study calibration model to
0:24:04become suitable for varying conditions
0:24:07that's a room between once you system goes out the door
0:24:11in terms of future work we identified a couple things
0:24:14we want to enhance the selection method to be robust when relevant matters like embezzling
0:24:18which i do not very fast experiment
0:24:20we also wanna do experiments in how active learning over time
0:24:25can improve that calibration pool sorry not calibration pork and oracle
0:24:30but collecting five test data
0:24:32over time that's relevant to the examples and retaining i recent history
0:24:37five hundred representation not be happy to hear any remarks of questions from anyone
0:24:42thank you