0:00:15i speech
0:00:18that's going to present our
0:00:21i files a odin counters i
0:00:25i-vector space for speaker recognition
0:00:29well and
0:00:36that let me start from the
0:00:38motivation or activation
0:00:41cool for fireworks
0:00:44down i would like to the and details are the only thing
0:00:51a particular i will focus on
0:00:54i've and
0:01:00and the
0:01:02few words
0:01:05will be made out that they can't and scoring
0:01:08well the next section of the dedicated to
0:01:13and improve denoising thing or is this so i mean you're probability
0:01:19we tried to apply
0:01:21we tried to apply
0:01:23this technique
0:01:25and a deep
0:01:27our conjecture will be considered in this section
0:01:31next denoting comforting for the system in the domain mismatch
0:01:36scenario will prevented
0:01:38and the finally i will conclude
0:01:41my presentation
0:01:44okay let me start for all
0:01:46our motivation and goals last year published our work about implementation of it you know
0:01:53it engulfed encoder
0:01:55for the speaker verification task
0:01:58and the this system
0:02:01based on
0:02:04t aec still showed
0:02:07some improvements
0:02:08compared to the commonly used baseline system i mean ple on the raw i-vectors
0:02:15well and this motivated us to
0:02:19for the investigation to detailed investigation
0:02:24and the
0:02:26and i'll go also used to study the proposed to solve and i in the
0:02:31i-vector space
0:02:33to analyse different straight edges all units as a nation and training probably big back
0:02:38and parameters
0:02:40to investigate about and to explored a different deep architecture
0:02:47we offer
0:02:48and to investigate
0:02:50the a basis to increase or domain mismatch conditions
0:02:57well to the
0:03:01the dataset and experimental setup we used in our work
0:03:05as you can see for the
0:03:08training data as a training data we used a telephone channel recording from the nist
0:03:13is the re
0:03:15corpora for evaluation we used and used
0:03:20ten sre protocol condition five extended
0:03:23and to our results presented in terms of four
0:03:28equal error rate and minimum detection cost function
0:03:33and now to our front end tent
0:03:37i-vector extractor
0:03:40as you can see we used to
0:03:44mfccs and the first and second to do it is just the county where from
0:03:50well are what structural was based on
0:03:53the nn posteriors
0:03:55with the eleven frames why thing
0:03:58we used
0:03:59two thousand and that's a silence at one hundred three phone states with the twenty
0:04:05non speech state
0:04:08instead of
0:04:11hardwired decision
0:04:12we try to use soft one solution using the nn outputs
0:04:18well you can see this formula i
0:04:23try to apply
0:04:25cepstral means you mean and variance normalization
0:04:28in this way in the statistics space
0:04:32well and the you can see that all e
0:04:35triphone states corresponding to the
0:04:38states are used to calculate
0:04:41a sufficient statistics
0:04:43finally a four hundred dimensional i-vectors were instructed for
0:04:50our first experiments
0:04:56few works about the det system and the
0:05:00the a training procedure
0:05:03to their own devising transform we're
0:05:07do noise are pre-training generative pre-training speech
0:05:13with the contrastive divergence algorithm
0:05:20to train our
0:05:22denoising transform we
0:05:26we used the
0:05:29speaker session dependent i-vectors and the box
0:05:34the mean four
0:05:36the main for i of all i-vectors of the same speaker
0:05:41i mean i s
0:05:44well and we modeled
0:05:47joint distribution of
0:05:53and then after training but we unfold they are and
0:05:58a finds you and you two
0:06:02to obtain a
0:06:04a denoising out in order
0:06:11on the next slide i have a back to prevent
0:06:15our system
0:06:17under consideration
0:06:19well as you can see we used
0:06:22convention the lda based system as our baseline
0:06:27with whitening and length normalisation
0:06:30a pre-processing
0:06:34the next system is based on
0:06:36are a out to import or also with a whitening and men normalisation
0:06:44a pre-processing and the finally
0:06:48are where
0:06:51next system is a det based a
0:06:57well it's just and
0:06:59l two in order which is
0:07:01find fuel from the army and this dashed looked at all means fine tuning procedure
0:07:10and the
0:07:12a ball the hero or about the parameter transmission or substitution
0:07:18i will focus on that on the on my neck slides
0:07:22it is very important
0:07:24right it just turned out to be important in our system
0:07:33we used two covariance model for scoring it's can be viewed as simple case of
0:07:37the lda and the score can be a
0:07:42expressed in terms of
0:07:45between speaker and within speaker covariance matrices
0:07:52few words about the parameter substitution
0:07:56during our experiments
0:07:58in our work we figure out that the
0:08:02the best performing the best performance of the a base based the system
0:08:10is performed so well we a substitute
0:08:14why whitening and p lda back-end parameters from they are bm system
0:08:20to the eight based system
0:08:24denoting crafting for the basis
0:08:27well it's empirical fun
0:08:29but it's it is wearing important
0:08:33for this system
0:08:35let me show you our first results
0:08:38well with just the system
0:08:41on the nist as the retail
0:08:44protocol and to
0:08:45as you can see
0:08:47the gain
0:08:49we're observed again
0:08:52over the baseline system when we applied our da a based system with parameter replacement
0:08:59both four
0:09:01commonly used in nist sre ten protocol and our second
0:09:06corpus called rest rooms telecom test got stuck on the on the results
0:09:18some information about the
0:09:20a risk telecon corpus can perform and the by the slide
0:09:28to the analysis of the det based system we decided to use cluster variability criteria
0:09:37e g
0:09:39it is also called for can not criteria
0:09:43well it is based on
0:09:46we since began between speaker covariance matrices
0:09:50and if you're
0:09:53take a look at this figure and you can see that there
0:10:00odin quarter based projections have more stronger clustered variability
0:10:09about unit is well and the in this case we didn't apply and normalization for
0:10:16our bn and
0:10:17d e bay super projections
0:10:22well i mean about normalization i mean to know whitening
0:10:27were applied to d r b m and v
0:10:34are we decided to use cosine scoring
0:10:38as an independent estimation
0:10:40or to assess the
0:10:44the properties of our projections
0:10:49you can see from this result
0:10:51that no weight in the that da based system achieves the
0:10:56the good performance among the
0:10:59all the system
0:11:01by the way we try to use
0:11:04and simple
0:11:05out in order to
0:11:10to try that it's in speaker recognition if you
0:11:13but it shot out to be the
0:11:15not so would is the e bay system
0:11:23and now to the white in can length normalization
0:11:27when we apply this parameters for the r b m and g u based projections
0:11:33we obtain those results
0:11:37and i
0:11:39that we can see the
0:11:41the lines are very similar
0:11:43and that close to each other
0:11:46in this situation a where we applied
0:11:50it di da a based
0:11:54one of the four
0:11:56forty it based system
0:11:58it's turned out to be
0:12:00not so who
0:12:02for the system
0:12:04and the
0:12:05now on the next slide
0:12:07we applied parameter substitution so we decided to use the parameter whitening parameter from our
0:12:15em system
0:12:17and the
0:12:18in this situation we achieve good performance of the system
0:12:24yes you can see
0:12:26one baseline
0:12:28and the
0:12:30to the figure
0:12:32you also can see at the to
0:12:35the discriminative properties
0:12:37or was the in this case
0:12:41a more stronger for the a basis projection
0:12:48to summarize altogether i prepared
0:12:53all table we we'll terrible with the all common result
0:12:58and the among the
0:13:02the system the a based system with a are very important the substitution i mean
0:13:09at you the best performance
0:13:17and no to the
0:13:19p lda based scoring
0:13:23in this table
0:13:24you can see that our results we obtained a opted different experiments in different configuration
0:13:31of our system
0:13:32and again
0:13:33at the last line
0:13:35the table you can see that the
0:13:38good improvement would be in
0:13:40can be achieved by using
0:13:43parameter substitution from there are bm system
0:13:46but the question
0:13:48why it's happens is still open for us and we didn't manage to until it's
0:13:59no i will
0:14:01we will discuss some improvements for the a based system
0:14:06and first we decided to apply to apply
0:14:09dropout regularisation
0:14:11for both our em training
0:14:14and the
0:14:16for fine-tuning
0:14:18well as you can see
0:14:21dropped out helps
0:14:23to improve the system
0:14:25when we used the it's a in
0:14:27the orange where
0:14:29our em training stage
0:14:31r be improved training
0:14:33but unfortunately apple a plan to produce the stage of discriminative fine tuning wasn't couple
0:14:39for us
0:14:42well to the jeep our conjecture we try to use the two schemes
0:14:50you can see the first one on the slide
0:14:53it is cold stating audience
0:14:59after training the first are
0:15:01it's out what can be may be used as a as an input for the
0:15:05next are
0:15:07and then we try to find t one
0:15:09each altogether you
0:15:14well but it does not
0:15:17asked to improve the system
0:15:21about the second that scheme
0:15:23which is named stating bias
0:15:28manage to obtain good results
0:15:30but in this scenario we need to
0:15:33to you and or two
0:15:36substitute whitening parameter again probably are bm system
0:15:42some big generative pretrained system
0:15:45and the we get a little bit improvement from that
0:15:52and the
0:15:54next question i would like to focus is
0:15:59the domain mismatch tonight
0:16:01we investigated our da a best system in
0:16:06in the domain mismatch conditions
0:16:10well we used domain adaptation challenge that a dataset
0:16:14and setup
0:16:16it's a back end we use cosine scoring
0:16:19two covariance model record s
0:16:22to as the lda and simplify the lda with
0:16:27four hundred dimensional speaker subspace
0:16:29referred to
0:16:30as the only
0:16:33it should be noted that in our experiments we absolutely ignore label so the in
0:16:38the main beta we used
0:16:41we use it
0:16:42one way to estimate whitening and the
0:16:46whitening parameters or the systems
0:16:49well and not to the results
0:16:52you can see
0:16:54for the baseline
0:16:58system when we use in domain data for training
0:17:03we obtain both results for
0:17:05cosine scoring and you can see that the in applying a to do when the
0:17:10wind di da a based system
0:17:13before was focus i in only a scoring
0:17:17but so when we
0:17:22used out-of-domain that the data to train our systems
0:17:25or with a you can see the degradation
0:17:29for both for cosine and you'll be scoring
0:17:34and in the
0:17:38this table
0:17:39you can see it the improvement
0:17:41when we used whitening parameters from
0:17:45in the mean data
0:17:51the same results but for the
0:17:53a simplified field v scoring
0:17:56well i just little bit
0:18:01then you'll be
0:18:04and i'll to conclude ones
0:18:06we present to
0:18:09the study of denoising grafting order
0:18:12in there
0:18:13i-vector space
0:18:14we figured out that the i
0:18:20i'm sort be performed on the t or tdoa based system is you two
0:18:24you by employing can parameters directly from the rear are beyond i'll put
0:18:31the question is still open why are beyond transform provide better bacon parameters for this
0:18:39well dropped about helps to improve the results but when applied to do our em
0:18:46training stage
0:18:47and that helped when we implemented in fine tuning
0:18:54different project share in the form of stated denoising crafting quarter provide a few further
0:19:03well and all our findings
0:19:06regarding speaker verification system in my conditions
0:19:10called so true in
0:19:12mismatched condition case
0:19:16the last one it's and the you think whitening parameters for the target domain along
0:19:22the a it train twenty out-of-domain set
0:19:25else two
0:19:27the weights avoid significant
0:19:29performance gap
0:19:30goes by domain mismatch
0:19:32that's it
0:19:43top questions
0:19:57in this late it's when d you show the and the stacked
0:20:01in tennessee note and can then
0:20:03digits right more than two layers
0:20:09yes but in this
0:20:11in this we need to inject whitening conflict summarisation between the wires it is the
0:20:21this has five
0:20:24five i want to with whitening and length normalization injection
0:20:31i mean
0:20:32and that when you when you use it to like us to
0:20:39to denoising of the encoders
0:20:40you improve the results
0:20:42so that you use your tie the third one
0:20:50what do you know more than one at each other than the corrected where a
0:20:56i see
0:20:58well we i
0:21:00we decided to
0:21:05through might not able to for the
0:21:09goal deeper in this because of four we find out that this result is very
0:21:15similar to the you know our first one based on only one
0:21:32although we probably have discussed this issue about your question why copying the p lda
0:21:38and the and the long length normalization variables from b r p m rather than
0:21:44final say stage gives better performance
0:21:49where it should be initial maybe of a over feeding you do the back propagation
0:21:55but you're doing since you're using the same set
0:21:59maybe therefore let's say via residual matrix that were using be lda becomes artificially small
0:22:07in terms of strays let's say
0:22:10so how to check maybe the traces of the two matrices
0:22:14the one that you estimate from r b m and what i guesstimate after to
0:22:18see maybe
0:22:20the covariance matrices are sufficiently small
0:22:23might be a result of overfitting
0:22:26well this and now assumption and we try to check out chip it calculates after
0:22:34as the meeting our paper our paper was submitted but we figure out
0:22:41it was it does not the reason because of we try to
0:22:46to split our datasets in two parts and the to use separate data to train
0:22:54a lda based and so but they can parameters but
0:22:58the results
0:23:01shows that
0:23:02and is not the rate
0:23:04it is not a repeating
0:23:07occured while we trained the system on the same data
0:23:15al so try to
0:23:18explain the situation by
0:23:21using a house bill option assumption well i mean
0:23:29det projection we can obtain
0:23:31no more or less
0:23:34goals and but less torsion the
0:23:38and that can be the
0:23:41in this case but
0:23:43seems to us
0:23:46but also it is not the answer
0:23:54this time for another question jumps a
0:24:03just to construe on the first step of your system but i think it will
0:24:09to be spot you say that you are using twenty
0:24:13non-speech states i don't quite amazed both this huge number could you say something about
0:24:20you mean huge number of non-speech states but
0:24:25we have
0:24:29we use this
0:24:30standard caldera see from our where speech recognition department
0:24:36and they the
0:24:39you fast
0:24:40and a twice to use this configuration all these system and the we train
0:24:48ours the d n and in this way
0:24:50and the
0:24:51well it's provide food
0:24:54voice activity detection for our system
0:24:57and we are also it's a
0:25:00mentioned we also used to
0:25:04capabilities to
0:25:06a to a black soft one solution
0:25:10also what decision in this statistic space
0:25:15well i mean we
0:25:17we have done
0:25:20cepstral mean shift normalization in the statistics space
0:25:23by excluding a non speech
0:25:26well non speech is the problem our consideration
0:25:34that's to the speaker again thank you