0:00:14however
0:00:15this is the transition from clean laugh
0:00:18and the indian institute of science bounded off
0:00:22and i think presenting a work
0:00:24and b or the u wouldn't but a model for speaker verification
0:00:30the corpus this people actually as someone g
0:00:33actually automatically
0:00:36let's look at the roadmap of this presentation
0:00:40for this and we into using what a speaker verification task consists of
0:00:46no one to the motivation behind our work
0:00:50i don't talk about the front end one that be used
0:00:53discuss bus approaches to back and modeling
0:00:57before describing
0:00:59the proposed new will be lda and the lda model e
0:01:03and then
0:01:04some experiments and results before concluding the presentation
0:01:11let's look at what a speaker verification task consists of
0:01:16when you're given
0:01:17a segment of and alignment recording of a particular target speaker and a test segment
0:01:25the em audio objective of the speaker verification system
0:01:29is that it only
0:01:30whether the target speaker is speaking in the test segment which is the alternative hypothesis
0:01:36or
0:01:38if is not speaking
0:01:39which is the null hypothesis
0:01:42i think can see here
0:01:44this and one minute recording differently by x e
0:01:48and test recording denoted by x t
0:01:52these are given
0:01:53as an input to the speaker verification system
0:01:57this system outputs a log likelihood ratio score
0:02:02this score is used
0:02:04indeed only
0:02:05by the
0:02:06the test segment belongs to the target speaker on a non target speaker
0:02:14most popular state-of-the-art systems for speaker verification
0:02:19consist
0:02:19off and you will be celebrating extractor the most popular ones in the last few
0:02:25years have been the extractor models
0:02:28this is followed by a back a generative model such as the probabilistic linear discriminant
0:02:35analysis or b
0:02:38there are some discriminative backend approaches
0:02:41like the discriminative but and svm
0:02:45what we propose
0:02:47is
0:02:47one neural network of roads
0:02:49which is discriminative as was to generate a
0:02:53for back end modeling in speaker recognition and speaker verification tasks
0:03:01let's look at the front-end model may be used
0:03:05as i mentioned
0:03:06most popular model the last few years have been the extra to extract us
0:03:12we had they are extracted extractor
0:03:15on the wall salem got what
0:03:17which consisted
0:03:19of seven thousand three hundred and twenty three speakers
0:03:23this was clean
0:03:24using
0:03:25thirteen dimensional mfccs
0:03:27from twenty five miliseconds frames shifted every ten miliseconds using a twenty channel mel-scale filterbank
0:03:35this fan the frequency range twenty hours a seven thousand six hundred hz
0:03:42a fine for augmentation strategy
0:03:44which included
0:03:46all mandarin data using things
0:03:48like babble noise music
0:03:51to generate or six point three million training segments
0:03:56the architecture that we used to train on extractor model was the extended d v
0:04:02and in architecture
0:04:04this consists of well hidden layers
0:04:08and value nonlinearities
0:04:10the model is trained to discriminate among the speakers
0:04:15the forest and hidden layers
0:04:17already at the frame level well the last two layers already i'd a segment level
0:04:23often training the and bearings are extracted from the five hundred twelve dimensional affine component
0:04:31of the level cleo
0:04:33that is this the forest adamantly a after the statistics for you
0:04:38the and bindings extracted sure are the expectations
0:04:44let's look at a few approaches
0:04:47a back and modeling
0:04:48the most popular one
0:04:50in speaker verification systems is the general don't have gotten really order to be really
0:04:58once the x well there's are extracted
0:05:00there are a few steps of processing done on them
0:05:04that is their standard well then mean is a more
0:05:08the transform using lda
0:05:11and the unit length normalized
0:05:14the beheading model on this process extra go for a particular recording
0:05:19is given in equation one a
0:05:22but you dog
0:05:23is the extra for the particular recording
0:05:27make a describes a latent speaker factor which is a coalition
0:05:32file
0:05:33characterizes the speaker subspace matrix
0:05:36and epsilon arc is of caution decision
0:05:40now
0:05:41first warning
0:05:43and of these extra those
0:05:44there is one from the enrollment recording
0:05:47denoted by t e
0:05:49and one from the test recording
0:05:52denoted by county
0:05:54use with the leading really in order to compute
0:05:58the log-likelihood ratio score given in english
0:06:03english and two
0:06:04is that what i one
0:06:06and b and q are dying to make this is
0:06:11you all other approaches
0:06:13backend modeling on the discriminative but
0:06:18and pairwise abortion by
0:06:20the discriminative be lda rdb lda
0:06:24users
0:06:24and expanded by the
0:06:26in order to a by five of indulging or might be by entirely online you
0:06:32got be represents art right
0:06:35this computed using a quadratic on it
0:06:39which is given in equation three
0:06:42the final be a log-likelihood ratio score is computed
0:06:47as the dot product of all weight vector
0:06:50and this expanded vector
0:06:52file viewed i can might not be
0:06:56the pairwise collection backing
0:06:58models the pairs of enrollment and test extra goes
0:07:02using gosh distribution but i mean thus
0:07:06these bottom lead us
0:07:08i estimate
0:07:09but computing
0:07:10the sample mean and covariance matrix is off the target and non-target trials
0:07:16in the training data
0:07:18along with
0:07:20the and really a model that we propose
0:07:22we reported on results on the generative gaussian really their be clearly and pairwise gaussian
0:07:29backend
0:07:32no slow and the proposed new wouldn't but what and but architecture
0:07:38what we have your
0:07:39is a pair-wise the siamese time discriminative network
0:07:44as you can see the green portion of the network corresponds
0:07:48to the enrollment and ratings
0:07:51and the being portion of the network response that s and brings
0:07:57we can start the preprocessing steps
0:08:00in the generated of course but as layers
0:08:03in
0:08:04the neural network
0:08:07the lda
0:08:08the first affine layers
0:08:11unit length normalization as a nonlinear activation
0:08:15and then the bleu centring and diagonalization as
0:08:20another the affine transformation
0:08:23the final vad of airway scoring
0:08:26which is given in equation two
0:08:30is implemented as a quadratic here
0:08:34the bottom does of this model are optimized
0:08:38you were saying
0:08:38and approximation
0:08:40all the minimum detection cost function which is known as the mindcf or semen
0:08:47as the model
0:08:49optimizes to minimize the detection cost function
0:08:53we report results
0:08:55on the mindcf metric
0:08:57and the eer might
0:09:03the normalized detection cost function or dcf
0:09:07is defined as seen on all be done on a bigger
0:09:11which is you will to be miss of being a pleasant be done times p
0:09:16fa of data
0:09:18where b
0:09:19is and application basically
0:09:22p miss and b f e
0:09:24at the probability of miss
0:09:26and false alarms respectively
0:09:29on this
0:09:30is when the model but it's a target trial to be a non-target one
0:09:36that is the model believes
0:09:38that the enrollment and test come from different speakers
0:09:42whereas of false alarm is when non-target trial is wrong ready to as well
0:09:51p miss and b if a computed by applying i'm detection threshold of peter to
0:09:57the log-likelihood ratios
0:09:59how p miss and b if we are computed
0:10:02given in equation five
0:10:05here
0:10:06s i is the score all the log-likelihood ratio output by the model
0:10:13e i
0:10:14is the ground truth variable for by i
0:10:18that is equal to zero if the right i is a target i
0:10:23i d i is equal to one
0:10:25if it doesn't non-driving k
0:10:28one
0:10:30is the indicator function
0:10:34the normalized detection cost function is not as a function of the bad only those
0:10:40due to the status continues by the indicator function
0:10:45and hence
0:10:46it cannot be used as an objective function in your electro
0:10:51what we do all work on this is propose
0:10:54okay differentiable approximation
0:10:56of the normalized detection cost but approximating the indicator function but what sigmoid function
0:11:04so the integration is integration six
0:11:07i've
0:11:08the
0:11:09approximations of the normalized detection cost
0:11:12given by p miss soft and be a face off a soft detection costs
0:11:19g r e i is the client for index i s i is the system
0:11:25output score or the log-likelihood ratio
0:11:29signal a denotes the sigmoid function
0:11:32by choosing a large enough value for the wall in fact that i phone
0:11:37the approximation
0:11:38can be made arbitrarily close
0:11:41the actual detection cost function but a wide range of thresholds
0:11:47before we diving the designers
0:11:51let's look at the datasets used in training and testing the background model
0:11:57we sample about six point six million trials from the key inbox alive set
0:12:03and i don't put anything within five from the augmented boxes that say
0:12:08for testing we report results
0:12:11on three datasets
0:12:13the speakers in the white but i second you go core test condition which consist
0:12:18of around eight hundred thousand trials
0:12:21the voices development set which consists of all work four million by as
0:12:26and the wisest evaluations
0:12:29which okay and consisted of roughly on the and a half million trials
0:12:36the demon your words
0:12:38the results on the sat w goal wise as development and one evaluation sets on
0:12:45various models
0:12:46like the gaussian really going back in the yearly and approach was divided but
0:12:54along with the soft detection cost
0:12:56we also that our experiments
0:12:58with binding cross entropy as the loss which is denoted in the table i c
0:13:03vc loss
0:13:05we observe relative improvement in terms of mindcf
0:13:09of around thirty one was in twenty percent and eleven percent
0:13:13for s id w was is development and wise evaluation respectively
0:13:20the best scores for slu w local is and eer of two point zero five
0:13:27percent
0:13:27and the mindcf of point two
0:13:30for the wise as development began a best overall one point nine one person to
0:13:35sdr and point do the best mindcf
0:13:39for the voice is evaluation
0:13:41you get six point zero one percent eer as the best
0:13:44the other school and point four nine as the mindcf
0:13:49the improvements observed in then you wouldn't but a more consistent
0:13:54where data augmentation
0:13:56as well as
0:13:57for the eer metric
0:13:59all that on this soft detection costs for a and b performs even better than
0:14:04the binary cross entropy or b c loss
0:14:10to summarize
0:14:12the problem was model is the step and exploding on discriminative neural network model for
0:14:17the task of speaker verification
0:14:21using a single elegant back and more than just targeted to optimize the speaker verification
0:14:27lost the and maybe a model uses
0:14:30extract that and weightings directly to generate the speaker verification score
0:14:36this more shows significant performance gains
0:14:40on the s id w and was is datasets
0:14:44we have also observed considerable improvements on other datasets
0:14:49like the nist sre data six
0:14:52a standard as well
0:14:54two and do and model
0:14:56where the more to optimize is not just from the expected and weightings but directly
0:15:01from acoustic features like mfccs
0:15:04this work was accepted and interspeech twenty
0:15:10these are some of the difference is the reviewers
0:15:16thank you