0:00:14i
0:00:15this is a special edition
0:00:17from the lab id indian institute of science battle
0:00:22and i think presenting a paper
0:00:25system for assigning twenty nineteen cts challenge improvements in data analysis
0:00:31the goal of those of this paper
0:00:34actually i someone g
0:00:35log online so
0:00:37but using
0:00:38actually on band
0:00:41let's go
0:00:42and the total number of this presentation
0:00:45and introduce a brief overview
0:00:48off how speaker recognition systems well
0:00:51discuss
0:00:52sre nineteen challenge performance metrics
0:00:56talk about
0:00:57the front end and back and modeling in a system something
0:01:01discuss the results of these systems
0:01:04and then some analysis of post evaluation results before concluding the presentation
0:01:12this is a brief overview of how speaker verification for speaker recognition systems well
0:01:19the first phase
0:01:20we have the raw speech and extract features like mfccs from that
0:01:26these features
0:01:27well then
0:01:29processed with some voice activity detection and normalization
0:01:33then these features are given as an input to train a deep neural network model
0:01:38parameters
0:01:40the most popular neural network based embedding extractor and the last few years have been
0:01:45the exploration of mars
0:01:47once the extractor training phase is done
0:01:50we enter the lda training phase
0:01:54these extracted extractors
0:01:56have some processing done on them
0:01:58like sending and lda
0:02:00the other then unit length normalized
0:02:03before cleaning up but a more
0:02:06most popular state-of-the-art systems
0:02:08use a generative gaussian but more
0:02:12for the back end system
0:02:14in the verification phase
0:02:16we have but right which consists
0:02:19off and domain utterance under test functions
0:02:22and the objective of the speaker recognition system
0:02:27i don't know whether the test utterance belongs to the target speaker or non-target stego
0:02:34thus once you extract extra and ratings for the enrollment and test utterances
0:02:41we compute
0:02:42log-likelihood ratio scores using will be lda back-end more
0:02:47and then
0:02:47using these scores we did only
0:02:50if the trial is a target one
0:02:52or a non-target one
0:02:57let's look at the sre nineteen performance metrics
0:03:01then this assigning challenge in twenty nineteen
0:03:04consisting
0:03:05of two tracks
0:03:07the forest
0:03:08one speaker detection one conversational telephone speech or c d s
0:03:13and second
0:03:14was no multimedia speaker recognition
0:03:18a work was on the forced that the cts challenge
0:03:22the normalized detection cost function or dcf
0:03:26is defined
0:03:27as in equation one
0:03:29just seen on all be done on my data
0:03:32is equal to be missed of the you'd are pleasantly times p fa of t
0:03:38right
0:03:38in this and be a fee
0:03:40the probability of miss and false alarms respectively
0:03:45on this is when the speaker recognition system
0:03:48the database a target trial as a non-target one that is the system wrong
0:03:55though and alignment and best
0:03:57to be
0:03:58of the same speaker
0:04:00of false alarm
0:04:01is when non-target trial is it on you ready to as a target right
0:04:07in this and be a free
0:04:09computed by applying detection threshold of the ego the log-likelihood ratios
0:04:15the training cost mentally all the nist sre nineteen for the conversational telephone speech
0:04:21is given
0:04:22by equation two
0:04:24you're to be done one
0:04:26is equal to ninety nine and be done to is equal to one eighty nine
0:04:32the minimum detection cost was alone
0:04:35as mindcf or semen is computed using the detection thresholds that minimize the detection cost
0:04:44you creation three
0:04:46ins to minimize you wish to
0:04:48on the threshold you know one and the two
0:04:52the equal error rate eer
0:04:54is the value of p fa and p miss
0:04:57computed at that actually read p fa and b m is equal
0:05:02we report the results in terms of eer
0:05:06semen
0:05:07and c primary for all of a systems
0:05:11the assigning nineteen evaluation set consisted of or two and a half million trials from
0:05:17fourteen thousand five hundred and sixty one segments
0:05:22let's look at the front-end modeling in a systems
0:05:26we obtain
0:05:27t expect all models with different subsets of the training data that i described in
0:05:33the next slide
0:05:34we used the extended time delay neural network architecture
0:05:39the extended d v and an architecture consisted of twenty hidden layers and value nonlinearities
0:05:46the model
0:05:47mostly to discriminate among the speakers in the training set
0:05:52the forest and hidden layers
0:05:54all three i-th frame level by the last two already at the statement level
0:05:59that is a one thousand five hundred dimensional statistics putting your between the frame level
0:06:05and the same and several years
0:06:07it computes
0:06:08the mean and standard deviation
0:06:11after training and ratings are extracted from the five hundred and twelve dimensional affine company
0:06:18of the lemon clear
0:06:19which is the forced alignment label layer
0:06:22these and weightings are the extra cost use
0:06:28this table
0:06:30describes the details of the training and development datasets
0:06:34used in the assigning nineteen evaluation systems
0:06:38x p one
0:06:40well extract the one model
0:06:42i was trying valiantly
0:06:43on the wall syllabic or whatever
0:06:46x lead to
0:06:47you was mixer six
0:06:49and vts sat process
0:06:52x p d
0:06:53was the full extent a system
0:06:56which are staying on the little box ella
0:06:59and previous sre data sets
0:07:02the data partitions
0:07:04use in the back end martyrs of the system in individual systems submitted are indicated
0:07:09in the table two
0:07:14now let's look at the background model
0:07:18once the popular systems in speaker verification
0:07:22use
0:07:23the generated of course in the lda bungee nearly as of that in modeling approach
0:07:28once the extra those that extracted
0:07:31there is some preprocessing done on them
0:07:34they are standard are the mean is a model
0:07:36the transformed using lda
0:07:39and are you wouldn't like nonetheless
0:07:41the bleu model
0:07:43on this process extract of a particular recording
0:07:46is given
0:07:47by equation four
0:07:49but you do i
0:07:51is the extra for the particular recording
0:07:54well make our
0:07:55this kind of only can speak of five do we just go origin
0:07:59five
0:07:59characterizes the speaker subspace matrix and axes on a is a collection procedure
0:08:06now the scoring
0:08:08well bad of expect that was one from the enrollment recording be noted your diary
0:08:15and one
0:08:15from the test recording denoting show but you don't e
0:08:18are used
0:08:19when w can be lda model or to compute the log-likelihood ratio score given in
0:08:25equation five
0:08:27english and five is of course that one and b and q
0:08:32alright in many cases
0:08:35along with the g vad approach
0:08:38we propose
0:08:40when you wouldn't be lda model what and the lda model
0:08:43for background modeling
0:08:45what we have you are
0:08:48pairwise discriminative network
0:08:50the bayesian portion of the network
0:08:52corresponds
0:08:53to the enrollment and ratings
0:08:56and the pink portion of the network correspond
0:08:58the test and really
0:09:01we construct
0:09:02the preprocessing steps
0:09:04in the generated a gpu
0:09:06as layers in the neural network
0:09:10lda
0:09:11as the force affine layer
0:09:14then unit length normalization as a nonlinear activation
0:09:18and then be is entering and diagonalization as another affine transformation
0:09:25the final pairwise
0:09:27but is scoring
0:09:29which is given in equation five in the previous slide is implemented as a quadratically
0:09:36the by having those of this model
0:09:38are optimized
0:09:40using an approximation of the minimum detection cost function
0:09:44or seen in
0:09:49no less than i'd are submitted systems and the results
0:09:54the database your
0:09:56shows det is about the seven individual models that we submitted
0:10:00and a couple of fusion systems
0:10:04the best individual system
0:10:06was the combination of the x t which is the for the extra extractor with
0:10:12the proposed and b idea more
0:10:16for the s i eighteen development set
0:10:18it had a score of five point three one person ser and pointing to a
0:10:23signal
0:10:25and the best scores for the assigned nineteen evaluation
0:10:28was
0:10:29for one nine seven percent
0:10:30and for the and point four two
0:10:33for semen
0:10:35the fusion systems
0:10:37are some gains on the individual systems
0:10:41all that all
0:10:42the for an extra system just actually three
0:10:45performs significantly better than the walks l images extreme one
0:10:50and the x s i next week two systems
0:10:53for any choice of backing
0:10:57systems be which is trained on and vad
0:10:59just in the c include a system f
0:11:02and it is observed that a model support in domain and out-of-domain data better than
0:11:08the collision be lda
0:11:13let's talk about some post evaluation experiments and analysis
0:11:18one of the factors
0:11:19then we found that we didn't to optimally
0:11:22with calibration
0:11:24in our previous work for sat
0:11:27we propose
0:11:28an alternative approach to calibration
0:11:30but the target and non-target scores will model
0:11:34as a gaussian distribution with the shape variance
0:11:39as assigning nineteen did not have an exclusively matched development dataset provided
0:11:45the aforementioned calibration
0:11:47using the sre eating development dataset when applied on assigned nineteen don't know to be
0:11:54ineffective
0:11:55this was done for all of us operating systems and thus the calibration
0:12:00was
0:12:00not as optimal as you want to
0:12:03the graph on the right
0:12:05shows
0:12:06how exciting
0:12:07development and assigning nineteen evaluation datasets are not matched
0:12:12and
0:12:13the threshold instantly opening
0:12:15well not optimal for are selected systems
0:12:21we perform some normalisation techniques to improve a score
0:12:26we perform the adaptive symmetric normalization well yes non using the sri meeting development unlimited
0:12:34say as the core
0:12:36and be achieved
0:12:37twenty four percent relative improvements for the x p one which is the voxel of
0:12:42extract the system
0:12:43and twenty one percent relative improvement for the full extract the system actually the on
0:12:48the sre eighteen development set
0:12:51you got comparatively low but consistent improvement of about fourteen percent on an average
0:12:57in all of us systems for the sre nineteen evaluations yes
0:13:02the table
0:13:04shows the best values
0:13:05there we go out for the exciting development and the sre ninety eight evaluation
0:13:11you got and eer of four point seventy question
0:13:14and assuming all point two seven as best scores for deciding of love me
0:13:20and eer also point five one
0:13:23and semen of point thirty six and the c by many of pointy nine for
0:13:28the sre ninety evaluation systems
0:13:33to summarize
0:13:34we k t extractor extract was and background models on different partitions be available data
0:13:41sets
0:13:42we also explored a normal discriminative back end model quality and the lda which is
0:13:48inspired from be neural network architectures and the generated of be a dog key idea
0:13:53more
0:13:54we observe that the and view stuff only this of the system or g p
0:13:59lda for with his datasets
0:14:02the errors that will cost by calibration
0:14:05with the mismatched development datasets are discussed
0:14:09but also significant performance gains that were achieved by using
0:14:13cohort based
0:14:14yes non adaptive score normalization technique for various systems
0:14:21these are some of the references that we use
0:14:25thank you