i'm going to pretend to this work about them in addition speaker recognition

a decision strategy in preston from scratch one element in speaker recognition

we want to carry out speaker recognition on a new domain not up to increase

the criticism detection

thanks to adaptation techniques

but we don't want

to meet to take into account the difficulties of the task in real life situations

the task of data collecting and also without the cost and therefore forming the large

available to them in dataset

so as to assume that a unique and nonaudible in them and development dataset not

anymore possibly reduced in size down stuff speaker also segments per speaker

this dataset is used to learn an adaptive speaker recognition model

first we want to know that how about the performance increase depending on the amount

of unlabeled in domain data

in terms of segments

and so of speakers or

of some po size of segments per speaker

instead of the asking is always number of clusters damman thanks to another line in

domain data set

so this break distinct and number

we want to

carol to clustering without this requirement for exist in

in domain

and lower bound


this is explained below in this presentation

displays most edges back and process for speaker recognition systems based on embedding

the different adaptation techniques that can be included

missiles are which amazed

transforming vectors to reduce the shift between target and out-of-domain distributions

covariance indignant

while or at the feature distribution of the up to attempt to about the out-of-domain

distributions to also target ones

leading to transform on out-of-domain data into possible in domain data

when speaker labels of in domain simple or about anymore

supervised adaptation can be carried out

that's the kind of map


that's more z-norm to linear interpolation between them and then total and parameters

also score normalizations can be considered as and supervised adaptation is

as they use an on the rubber in the man subsets for impostor cohort

that does not that we generalize is interpretation of the lda parameters

to all possible stages of the system and a and whitening

this tactic improvements of performance of a percent

on all our experiments

so how does not from i think raise depending on the a

amount of data

we carry out


focusing on the gain of adaptive systems a function of the invaluable data and results

sort parameters are selected for the coarse reference tonight it's

speaker else

speaker samples


adaptation technique

they are is a description of the experimental setup for our

i'm not exist

we use and that's just seen from county you is twenty three cepstral coefficients

the window size

of three seconds

then vad with the u c zero component

z extract a fixed vector r is a one of candide toolkit

what is attentive statistics putting layer

this extractor is trained on switchboard and nist sre

right tails

use five fold it i one session strategy with full crowed you please

nor is music

bubble from use "'em"

so the men is that it is an arabic language which is called a manner

as the nist recognition evaluation

two so

so than eighteen

cmn and two thousand


nineteen sorry


this languages finalists from the nist speaker recognition training data bases

one do things to our mismatch

the in domain corpus for development and test is described in system or

development dataset may have just the enrollment test segments the leave out of from nist

sre eighteen development test

and how for the enrollment the segments delivered from nist sre eight nineteen that's

the other are fixed set aside for making up trial data set of test

the fifty per cent split takes genders into account to more elements will be asked

us you

contains committee on trial perhaps

a normally and uniformly picked up with the constraint of being equalized by gender

and of target prior

equal to one percent

one analysing the adaptation strategy

to predict errors number of speakers and the number of segments per speakers are rated

another two three different total amount of segments and also

given a fixed amount to assess the impact of speaker class variability

each time a subset is picked up from the three hundred and ten speakers size

development dataset and an important for the two models

system development set

is fixed and on the intended for testing

for alternatives are considered that experimented

system applying and supervised adaptation only

system applying supervised adaptation only

and the system applying for pipeline

unsupervised installer

the goal is to assess the usefulness

of unsupervised techniques for speaker labels are available

this figure shows the results of our analyses

performance in terms of recall rate of unsupervised and supervised

adapted systems depending on the number of speakers

and segment bell speakers

of the in domain development dataset

the case

since andy segments per speaker s corresponds to all segments remorseful the speakers

so and t is the mean

x is the number of speakers

where x is the number of segments per speaker

it can be upset of that

combining unsupervised and supervised adaptation is the best way having lower bound labeled data doesn't

make sense provides questionable

and sre

also we observe that

and then with the small in domain data set here or fifty speakers there is

a significant gain of performance with adaptation compared to the design of twelve point

twelve best

now or not a subset of the dashed curves in the figure

they correspond to fixed total amount of segments

for example

this last row corresponds to the same amount of two thousand and five hundred segments


fifty speakers and fifty segments

bell speaker or one hundred


by sweeping the kl

we cannot sell that

given a total amount of segments performance improvement with the number of speakers

gathering data from a few speakers to then with many utterances per speaker

really needs again off adapted systems

talk about clustering

the goal is to up to show reliable a in domain data set by using

unsupervised clustering and in defining the provided places

this is to speaker labels

dataset x

cluster on

the results

is the actual speaker labels for

note that we use

why previous thing total dataset form in domain data

a model is computed

using out of them and training dataset

then the score matrix of course tails x is used for going out

an item out to hierarchical clustering using s

a similarity matrix

given this clustering problem is how to determine the actual number or

of places

by sweeping the number of clusters for each number you a model is estimated which

includes and double delta parameters

and the preexisting in them into a low dataset y is used for error rate


then we select the class labels corresponding to the number of classes q that minimizes

the or right

nor block of this approach is here quality nor and

actually quite a preexisting the mental set that is not

so a missile from scratch without in domain data except

so we propose a missile for clustering the in domain data set and determining the

optimal number of classes from scratch result requirement of preexisting in them into a set

is algorithm


this algorithm is identical


for each number of classes q

we identify class and speaker

and by key matrix can

then we use

this is not weights of artificial keys

for computing the error rate

now we have to determine the optimal number of classes

we use the remote gridiron one on the field of clustering

on display in the air or its those criteria for determining the optimal number of


reported was is correspond to the loop of the algorithm from scratch

we can see that the slope of equal error rate goal so then it slows

down around the neighbourhood by excess of the exact number of speakers

which is

two hundred and fifty

moreover the values of this yes we still operating points

rich local minima before converging to zero

the trust one in the same neighbour

two hundred and fifty

so i don't format salted gives the wrong

three hundred


with the colour white beyond this threshold also dcf increases

no display the performance of the adapted system using clustering from scratch as a function

of the number of clusters

compared to unsupervised and supervised with the exact speaker labels adaptation


exact syllables and spell adaptation the performance of eigenvalues round six test and

with only and style adaptation performance is round seven percent

and we can see the crawled all results by varying the number of classes

form the clustering

from scratch that we propose

we can see that the missile or estimates the number of speakers but manage to

attain dusting performance in terms of equal error rate and this yes

close to the performance

with exact lower bounds and supervised adaptation

of the residuals

with values number of segments per speaker

five ten or more

for example

last line we can see that results by clustering from scratch

the right

a similar to goals were produced in one about that moment set

but also close to the ones with the exact speaker labels

now will conclude

the analyses that we carried out

shows that improvement of performance is due to supervised but also unsupervised domain adaptation techniques

michael a or lda

that's techniques well combine one is a model field

the other on the picture failed to achieve best performance


it's subset of that the small sample of in domain data can significantly reduce the

gap of performance

but when following the amount of speakers

rather than of segments per speaker

lastly a new or partial optional speaker labeling has been introduced here

doing from scratch

without break this thing in the man labeled data

for clustering

well actually being a given and performance

thank you for attention

can try to as for more details on this study

but by