0:00:14i'm going to pretend to this work about them in addition speaker recognition
0:00:18a decision strategy in preston from scratch one element in speaker recognition
0:00:26we want to carry out speaker recognition on a new domain not up to increase
0:00:30the criticism detection
0:00:32thanks to adaptation techniques
0:00:35but we don't want
0:00:36to meet to take into account the difficulties of the task in real life situations
0:00:42the task of data collecting and also without the cost and therefore forming the large
0:00:48available to them in dataset
0:00:52so as to assume that a unique and nonaudible in them and development dataset not
0:00:58anymore possibly reduced in size down stuff speaker also segments per speaker
0:01:05this dataset is used to learn an adaptive speaker recognition model
0:01:10first we want to know that how about the performance increase depending on the amount
0:01:15of unlabeled in domain data
0:01:18in terms of segments
0:01:19and so of speakers or
0:01:23of some po size of segments per speaker
0:01:31instead of the asking is always number of clusters damman thanks to another line in
0:01:36domain data set
0:01:37so this break distinct and number
0:01:42we want to
0:01:43carol to clustering without this requirement for exist in
0:01:48in domain
0:01:49and lower bound
0:01:53this is explained below in this presentation
0:02:01displays most edges back and process for speaker recognition systems based on embedding
0:02:08the different adaptation techniques that can be included
0:02:13missiles are which amazed
0:02:15transforming vectors to reduce the shift between target and out-of-domain distributions
0:02:21covariance indignant
0:02:23while or at the feature distribution of the up to attempt to about the out-of-domain
0:02:29distributions to also target ones
0:02:32leading to transform on out-of-domain data into possible in domain data
0:02:40when speaker labels of in domain simple or about anymore
0:02:44supervised adaptation can be carried out
0:02:47that's the kind of map
0:02:51that's more z-norm to linear interpolation between them and then total and parameters
0:02:58also score normalizations can be considered as and supervised adaptation is
0:03:03as they use an on the rubber in the man subsets for impostor cohort
0:03:09that does not that we generalize is interpretation of the lda parameters
0:03:14to all possible stages of the system and a and whitening
0:03:18this tactic improvements of performance of a percent
0:03:22on all our experiments
0:03:29so how does not from i think raise depending on the a
0:03:33amount of data
0:03:36we carry out
0:03:41focusing on the gain of adaptive systems a function of the invaluable data and results
0:03:47sort parameters are selected for the coarse reference tonight it's
0:03:54speaker else
0:03:56speaker samples
0:03:58adaptation technique
0:04:02they are is a description of the experimental setup for our
0:04:07i'm not exist
0:04:09we use and that's just seen from county you is twenty three cepstral coefficients
0:04:14the window size
0:04:16of three seconds
0:04:18then vad with the u c zero component
0:04:23z extract a fixed vector r is a one of candide toolkit
0:04:28what is attentive statistics putting layer
0:04:32this extractor is trained on switchboard and nist sre
0:04:36right tails
0:04:39use five fold it i one session strategy with full crowed you please
0:04:46nor is music
0:04:48bubble from use "'em"
0:04:52so the men is that it is an arabic language which is called a manner
0:04:56as the nist recognition evaluation
0:04:59two so
0:05:00so than eighteen
0:05:02cmn and two thousand
0:05:05nineteen sorry
0:05:10this languages finalists from the nist speaker recognition training data bases
0:05:15one do things to our mismatch
0:05:22the in domain corpus for development and test is described in system or
0:05:28development dataset may have just the enrollment test segments the leave out of from nist
0:05:32sre eighteen development test
0:05:35and how for the enrollment the segments delivered from nist sre eight nineteen that's
0:05:42the other are fixed set aside for making up trial data set of test
0:05:47the fifty per cent split takes genders into account to more elements will be asked
0:05:52us you
0:05:54contains committee on trial perhaps
0:05:57a normally and uniformly picked up with the constraint of being equalized by gender
0:06:03and of target prior
0:06:04equal to one percent
0:06:07one analysing the adaptation strategy
0:06:10to predict errors number of speakers and the number of segments per speakers are rated
0:06:16another two three different total amount of segments and also
0:06:21given a fixed amount to assess the impact of speaker class variability
0:06:26each time a subset is picked up from the three hundred and ten speakers size
0:06:31development dataset and an important for the two models
0:06:36system development set
0:06:38is fixed and on the intended for testing
0:06:42for alternatives are considered that experimented
0:06:45system applying and supervised adaptation only
0:06:49system applying supervised adaptation only
0:06:52and the system applying for pipeline
0:06:55unsupervised installer
0:06:57the goal is to assess the usefulness
0:07:00of unsupervised techniques for speaker labels are available
0:07:07this figure shows the results of our analyses
0:07:12performance in terms of recall rate of unsupervised and supervised
0:07:17adapted systems depending on the number of speakers
0:07:22and segment bell speakers
0:07:25of the in domain development dataset
0:07:28the case
0:07:30since andy segments per speaker s corresponds to all segments remorseful the speakers
0:07:36so and t is the mean
0:07:39x is the number of speakers
0:07:42where x is the number of segments per speaker
0:07:47it can be upset of that
0:07:49combining unsupervised and supervised adaptation is the best way having lower bound labeled data doesn't
0:07:55make sense provides questionable
0:07:58and sre
0:08:01also we observe that
0:08:03and then with the small in domain data set here or fifty speakers there is
0:08:08a significant gain of performance with adaptation compared to the design of twelve point
0:08:14twelve best
0:08:16now or not a subset of the dashed curves in the figure
0:08:21they correspond to fixed total amount of segments
0:08:28for example
0:08:29this last row corresponds to the same amount of two thousand and five hundred segments
0:08:39fifty speakers and fifty segments
0:08:42bell speaker or one hundred
0:08:48by sweeping the kl
0:08:51we cannot sell that
0:08:53given a total amount of segments performance improvement with the number of speakers
0:08:58gathering data from a few speakers to then with many utterances per speaker
0:09:03really needs again off adapted systems
0:09:07talk about clustering
0:09:10the goal is to up to show reliable a in domain data set by using
0:09:15unsupervised clustering and in defining the provided places
0:09:20this is to speaker labels
0:09:23dataset x
0:09:26cluster on
0:09:27the results
0:09:29is the actual speaker labels for
0:09:34note that we use
0:09:36why previous thing total dataset form in domain data
0:09:40a model is computed
0:09:42using out of them and training dataset
0:09:45then the score matrix of course tails x is used for going out
0:09:51an item out to hierarchical clustering using s
0:09:56a similarity matrix
0:09:59given this clustering problem is how to determine the actual number or
0:10:05of places
0:10:08by sweeping the number of clusters for each number you a model is estimated which
0:10:12includes and double delta parameters
0:10:16and the preexisting in them into a low dataset y is used for error rate
0:10:27then we select the class labels corresponding to the number of classes q that minimizes
0:10:32the or right
0:10:37nor block of this approach is here quality nor and
0:10:42actually quite a preexisting the mental set that is not
0:10:46so a missile from scratch without in domain data except
0:10:56so we propose a missile for clustering the in domain data set and determining the
0:11:01optimal number of classes from scratch result requirement of preexisting in them into a set
0:11:10is algorithm
0:11:12this algorithm is identical
0:11:16for each number of classes q
0:11:18we identify class and speaker
0:11:21and by key matrix can
0:11:27then we use
0:11:28this is not weights of artificial keys
0:11:31for computing the error rate
0:11:37now we have to determine the optimal number of classes
0:11:42we use the remote gridiron one on the field of clustering
0:11:48on display in the air or its those criteria for determining the optimal number of
0:11:55reported was is correspond to the loop of the algorithm from scratch
0:12:01we can see that the slope of equal error rate goal so then it slows
0:12:05down around the neighbourhood by excess of the exact number of speakers
0:12:11which is
0:12:12two hundred and fifty
0:12:15moreover the values of this yes we still operating points
0:12:20rich local minima before converging to zero
0:12:25the trust one in the same neighbour
0:12:31two hundred and fifty
0:12:38so i don't format salted gives the wrong
0:12:42three hundred
0:12:45with the colour white beyond this threshold also dcf increases
0:12:55no display the performance of the adapted system using clustering from scratch as a function
0:13:01of the number of clusters
0:13:04compared to unsupervised and supervised with the exact speaker labels adaptation
0:13:12exact syllables and spell adaptation the performance of eigenvalues round six test and
0:13:19with only and style adaptation performance is round seven percent
0:13:25and we can see the crawled all results by varying the number of classes
0:13:33form the clustering
0:13:35from scratch that we propose
0:13:42we can see that the missile or estimates the number of speakers but manage to
0:13:47attain dusting performance in terms of equal error rate and this yes
0:13:53close to the performance
0:13:56with exact lower bounds and supervised adaptation
0:14:03of the residuals
0:14:05with values number of segments per speaker
0:14:09five ten or more
0:14:11for example
0:14:13last line we can see that results by clustering from scratch
0:14:17the right
0:14:18a similar to goals were produced in one about that moment set
0:14:24but also close to the ones with the exact speaker labels
0:14:31now will conclude
0:14:35the analyses that we carried out
0:14:38shows that improvement of performance is due to supervised but also unsupervised domain adaptation techniques
0:14:46michael a or lda
0:14:49that's techniques well combine one is a model field
0:14:53the other on the picture failed to achieve best performance
0:15:01it's subset of that the small sample of in domain data can significantly reduce the
0:15:05gap of performance
0:15:08but when following the amount of speakers
0:15:11rather than of segments per speaker
0:15:18lastly a new or partial optional speaker labeling has been introduced here
0:15:23doing from scratch
0:15:25without break this thing in the man labeled data
0:15:29for clustering
0:15:31well actually being a given and performance
0:15:36thank you for attention
0:15:38can try to as for more details on this study
0:15:41but by