0:00:15hello my name is then of course you have no i will be presenting joint
0:00:19work with the michael it's
0:00:21the excel and unlikely
0:00:23from the human language technology center of excellence
0:00:26i johns hopkins university
0:00:28that i don't know or work is might need to expect a marketing estimation network
0:00:34plus also
0:00:35for improving speaker recognition
0:00:41current state-of-the-art in text-independent speaker recognition is based on the in and variance training with
0:00:47our classification loss
0:00:48for example a multiclass cross entropy
0:00:52if there is no severe mismatch between the nn training data and the deployment environment
0:00:58this in that cosine similarity between them but is from a system trained with the
0:01:03angular marking softmax
0:01:05or vice versa
0:01:06speaker discrimination
0:01:09for example in the most recent nist sre evaluation
0:01:13which is audio from
0:01:15video particular didn't of videos
0:01:17the top performing
0:01:19single system
0:01:21on the audio track was based on this part of that
0:01:26unfortunately
0:01:27even though cosine similarity provides
0:01:30good speaker discrimination
0:01:32directly using those scores
0:01:35does not allow us to make use of the automatically a stressful
0:01:41because this discourse are not calibrated
0:01:45typical way to address this problem is
0:01:48use an affine mapping to transform the scores into look like a result alright calibrated
0:01:55this is
0:01:56typically done using on logistic regression
0:01:59and we learn two numbers i scale
0:02:02and also
0:02:04so looking at the top equation
0:02:06the raw scores are denoted by s i e a
0:02:09which is the cosine similarity between
0:02:12two and variance
0:02:15and this can be
0:02:16basically as precise that unit for a well unit length and then in x i
0:02:21till the
0:02:22transpose
0:02:23x till the j
0:02:25so it is nothing more than this they number of unit length and between
0:02:30you will learn a calibration mapping
0:02:32with the parameters a and b we can transform this score in to log-likelihood ratios
0:02:37and then we can make use of the bayes threshold to make optimal positions
0:02:45in this work we
0:02:47proposed agenda cezanne
0:02:49and i went to look at it is to think that the actual
0:02:53scale at
0:02:55can be thought of us
0:02:56simply a sign
0:02:58constant might be due to the
0:03:00unit length and variance
0:03:02so it's inventing
0:03:04get the same active
0:03:06instead
0:03:07we suggest that it's probably better that it somebody has its on my data
0:03:12and we want to use a neural network
0:03:15to estimate
0:03:16the optimal value of those magnitudes
0:03:19we also used a global offset to provide
0:03:23the mapping
0:03:24to like to raise
0:03:26no the this new approach
0:03:29may result in a non monotonic mapping
0:03:32which means that you has the potential to not only produce calibrated scores but it
0:03:37also can improve discrimination
0:03:39by increasing the classical range
0:03:43to train this mike to network
0:03:45we
0:03:46want to use a binary classification task
0:03:50so we draw target and non-target trials from a training set
0:03:54and i lost constant it's a by a weighted by now regression
0:03:58where are five is the prior of a target trial
0:04:01and then
0:04:02these two i is the log posterior odds
0:04:07which can be decomposed in terms of the local error rates so
0:04:10on the log prior art
0:04:13the overall
0:04:14system architecture than one and use
0:04:17it's gonna be training so we a steps
0:04:20on the well left
0:04:22it's a block diagram of our baseline architecture
0:04:25we're gonna use
0:04:26to the convolution with a resonant architecture
0:04:30well why a temporal only
0:04:32and getting a high dimensional
0:04:35was probably in activations
0:04:37and then we use an affine layer
0:04:39to do a bottleneck so that we can obtain then between
0:04:43and but it's are gonna be a one and fifty six dimensions
0:04:47and their star is used to the node where the embedding is extracted condition at
0:04:51work
0:04:52in a more will be trained using multiclass cross entropy with the softmax classification the
0:04:59using directive mark
0:05:02the first as the of the training process is to use short segments to train
0:05:06the network
0:05:08in the past we've seen this to be a good you know compromise
0:05:13because the sort of sequences allow for a good use of gpu memory with a
0:05:18large buttons
0:05:20i'm not the same time in makes the task are even though we have a
0:05:25very powerful classification head was to get
0:05:27error so that we're going back propagated gradients
0:05:32as the second step we propose to freeze then most memory intensive layers
0:05:38which are typically only layers we are operating at the frame level
0:05:42and then finding the postpone layers with more recordings
0:05:46using
0:05:47all the nodes of the sequence of the audio recording
0:05:51which might be a two minutes of speech
0:05:54by freezing the people were layers
0:05:57we
0:05:57the dues the man's of memory and therefore we can use the long sequences
0:06:02and also we avoid overfitting to these are problem
0:06:07based on the long sequences
0:06:11finally the third step it in which we train them i-vector estimation
0:06:17the first thing we do is we're gonna discard the actual multiclass classification
0:06:23and we're gonna use a binary a classification
0:06:27we're gonna use a sinus structure which is depicted here by copying the network tries
0:06:32but the parameters of the same this is just for illustration purposes
0:06:36and notice that we also three is
0:06:39the affine layer corresponding this is denoted by
0:06:43degree of colour
0:06:45so
0:06:46actually merges fixing them variance and now we're adding
0:06:50and i'm into the estimation of work
0:06:52that takes the possible in
0:06:54activation which are very high dimensional and tries to learn as a lower magnitude
0:07:00the along with the unit length expressed or
0:07:04it's gonna be optimized people use a to minimize the cross-entropy
0:07:09we also keep the global also
0:07:12as part of the optimization problem
0:07:16to validate are ideas we're gonna use the following setup
0:07:21i'll start baseline system we're gonna use a modification of the rest in a thirty
0:07:26four
0:07:27expect to propose
0:07:28by saving
0:07:30and company
0:07:33the modifications that we're doing is we're allocating more challenge more channels to their layers
0:07:39because wishing that improves performance
0:07:42i'm not the same time to control the number of parameters
0:07:46we're gonna change the expansion rates of different layers so that we do not increase
0:07:52the channel so much in deeper layers
0:07:54and we have a certain is control the number and it is without degrading performance
0:08:00to train the n and we're gonna use the box selected to dev data which
0:08:04comprises about six thousand speakers and a million utterances
0:08:08and this is wideband a i sixteen khz
0:08:12note that we process differently the data when we use it with source segments on
0:08:16full-length refinements in terms of how we apply a plantations
0:08:20and i refer you to the paper to look at it excels
0:08:23those are very important
0:08:25what a good performance and also generalization
0:08:29to make sure that we do not overfit to a single
0:08:33evaluation set we are benchmarking against for different states
0:08:38speakers in the while and bob select one
0:08:40are actually good three it's to bob select two
0:08:43there there's not much about the means that between
0:08:47those two evaluation sets on the training data
0:08:51the
0:08:51sre nineteen outperform video portion and the time five
0:08:55have some domains is compared to the training data
0:08:58and i will be someone in the results later
0:09:01mostly this is
0:09:02in the case of sre nineteen is because the tails audio comprises multiple speakers and
0:09:09there's a need for diarization
0:09:11and in the time five k's
0:09:13there is
0:09:14far field microphone recordings with a lot of overlap speech and higher levels of reverberation
0:09:20so there is a very challenging setup
0:09:22also the time five results will be a split
0:09:25in terms of a close-talking microphone and too far field mike
0:09:31so that start by looking at the baseline system the we're proposing
0:09:35we're percent of results in terms of equal error rate and to all other operating
0:09:40points
0:09:41we're doing this to facilitate the comparison with prior work
0:09:45if you look at the right of the table
0:09:47we are listing the best single system with no fusion number the we're able to
0:09:52find in the literature
0:09:54for all the benchmarks
0:09:56you know all of the costs work reported
0:10:00but our baseline
0:10:02since to do a good job compared to the prior work
0:10:06i know performance of most of the operating
0:10:10points
0:10:12note that we're not actually doing any particular tuning for its evaluation set
0:10:17it's the for some small carrier that as i said sre nineteen
0:10:22require a diarisation so we'd are as the test segments
0:10:26and then for its
0:10:28speaker that adding a text and then we extract an expert or
0:10:31and their score
0:10:33we can score
0:10:35with the enrollment on all the test expect sources the one of the key for
0:10:38scoring
0:10:43so check
0:10:44the
0:10:45improvements that the phone lines refinement brings
0:10:49in the second stage
0:10:51we can compare in this table
0:10:54with respect to the baseline
0:10:56overall we also positive trends across all the data sets an operating points
0:11:02but the games are
0:11:04larger for the speakers in a while also and this makes sense because
0:11:08this is done
0:11:09so for with the evaluation data has a longer duration compared to the four seconds
0:11:14segments that were used to train the nn
0:11:17so
0:11:17this value is the recent findings know how in our interest this paper
0:11:23in which we saw that formants were fine and it's a good way to mitigate
0:11:28the mismatch between the duration
0:11:31in the training faces on the test phases
0:11:36regarding the amount of the destination node work we explore multiple topologies
0:11:42all of them were fit for where architectures and we explore interracial that and with
0:11:47here and percent in three
0:11:49represented in cases
0:11:50a change in terms of the
0:11:53number of layers and the with of the layers
0:11:56the parameters go for one point five million to twenty million
0:12:01when we compare performance for this three architectures across all the task
0:12:07we do not see why changes
0:12:10so the performance is quite stable across networks which is
0:12:14it's probably a string
0:12:17to find a good trade-off between the number of parameters some performance we're gonna be
0:12:21the magneto two
0:12:23architecture for the remaining part of experiments
0:12:31percent
0:12:32the overall
0:12:34gains in discrimination
0:12:36and due to the three stages
0:12:39you have the graphs
0:12:42the horizontal axis
0:12:43are
0:12:44the different benchmarks
0:12:47we are explained in a
0:12:49by far field microphones and of different plot just a first facilitate the visualisation because
0:12:55they're in a different dynamic range
0:12:58on the vertical axis we're depicting one of the
0:13:01a cost
0:13:03and then the colour coding indicates into the baseline system
0:13:07the utterance
0:13:08is
0:13:09applying the for refinement to that baseline system
0:13:13and the grey indicates application of the magnitude estimation on top of full-length refinement
0:13:20so overall we can see that there was it is trained as well the
0:13:25full answer feynman an i-vector estimation produced
0:13:27gains
0:13:29and we see that across all data sets
0:13:32in there so
0:13:34e r
0:13:35we are getting out twelve percent gain and then for the other two operating points
0:13:40we're getting an average of to the one percent gains
0:13:43even though i'm only assign one operating points here in the paper you guys the
0:13:48results for the other two operating
0:13:52so finally a look into the calibration results
0:13:57most of the global calibrate or and the miami network our training on the ball
0:14:01so they have to dataset
0:14:04this is a is a good night for the box select one on this because
0:14:07in the well evaluations that but is not subset would match forty five and sorry
0:14:12nineteen
0:14:13where the reason segments and
0:14:15before
0:14:17you know they do not calibrate or we can see that we can obtain good
0:14:22performance
0:14:23in terms of the actual cost max and the mean cost
0:14:26what both box evidence because no well
0:14:29but when we moved to the other datasets
0:14:32we struggle to obtain a good calibration with the global calibration
0:14:37looking at the magnitude estimation that work we see a similar trend
0:14:42for box a lemon speakers in a while we obtain very good calibration
0:14:46but we also system struggle for the other sets
0:14:51i think that a fair statement is to say that the mac we can estimation
0:14:55does not deal with the domain saved
0:14:58but you
0:14:59performance the global mean and calibration
0:15:02in all the operating points
0:15:04i'm for all data sets
0:15:06to gain some understanding of what mounted estimations doing
0:15:12we did some analysis
0:15:14the bottom plot on the right shows the cosine scores
0:15:18the histogram scores for the
0:15:20non-target on the target distribution
0:15:22the red colour indicates a non-target scores and the blue collar indicates a target score
0:15:29the top two panels are still in the cosine score
0:15:32it's kind of a lot
0:15:33against the magnitude
0:15:35the of the product the magnitudes
0:15:38for both and variance involving strike
0:15:42therefore some of the line indicates the global scale
0:15:46or magnitude
0:15:47the big global calibrate or assigns
0:15:50to this limiting
0:15:53discourse used for this analysis are one the speakers in a while evaluations
0:15:58since the magnitude estimation network improves discrimination
0:16:02we expect
0:16:04two trends
0:16:05for the local sense for targets
0:16:08we expect that the
0:16:10a lot the magnitude
0:16:11should be bigger than a global scale
0:16:15on the other hand for the high cosine score
0:16:19non-target trials
0:16:21we expect the others
0:16:23that is that the product manager to be smaller than a global scale
0:16:29the expected trends are actually person in these plots
0:16:32we look at the top plot we see
0:16:34the
0:16:35there's on
0:16:36tilt
0:16:38and the
0:16:40magnitude for the no
0:16:42cosine scoring
0:16:44tend to be of all
0:16:45the
0:16:45contact constant magnitude that will be assigned for the global i-vector
0:16:50on the other and we see that a large portion
0:16:54then non-targets are the global scale
0:16:56and
0:16:57the ones that are doing getting very high cosine scores
0:17:01also quite attenuated
0:17:04this is consistent with the observation that magnitude estimation there were is improvement of discrimination
0:17:10so to control we have
0:17:13introduce undirected estimation network
0:17:15within a global offset
0:17:17the idea is to assign an eigen to each one of the unit length x
0:17:21vectors that are training with an angular mark the softmax
0:17:27the resulting scale extractors can be directly compare used in inner products to produce calibrated
0:17:33scores
0:17:34and also we have seen that it increases the discrimination between speakers
0:17:39although
0:17:40the domain is still remains a chance this are significant improvements
0:17:46the propose system outperforms a very strong baseline on the for common benchmarks dimensional
0:17:53when we but also for the validated the use of for recording refinements to help
0:17:58will the duration mismatches interviews you another training and test phase
0:18:05if you found this work interesting i suggest that we also take a look at
0:18:09day
0:18:10current work the senator a and meets my clan are gonna be presented in this
0:18:14work so
0:18:15once it is related
0:18:18and if you have any questions you can reach me at my email
0:18:22and i look for
0:18:24to hand you guys in the middle sessions
0:18:28thanks for the time