Speech Transcript - Advances in Speaker Recognition for Telephone and Audio-Visual Data: the JHU-MIT Submission for NIST SRE19

and a causes b well i think we're interest and therefore above is the speaker

recognition for telephone number of is one data

usually my these submission form is design a

this is a during war from distances on the human language than only standard estimators

and standard orders

it's and language processing from the two in my feeling

these assigning the income tax

c d is like we telephone speech intonational i mean

and the audio visual like composed of them on the internet deviates from the bus

core

there is one that have a speaker recognition on face recognition only working model of

this

all also used and what database or some formula one and very well why don't

you lda or cosine scoring okay

in the other side the key points for women for still you what do you

really

not is gonna the nn are businessmen

still be lots kind of a place the nn vectors

also

but cannot they still using in the melee but mostly from any estimator fine tuning

to in domain data also

one will be assigned the key points where usage of rain is that are based

bodies

okay we use cosine a score in several areas these to combine it is and

variance from different be there

again we use this on the i

what i will face acoustic features that similar well for be overcome based detection

problem that we just a speaker

or face images

and do not isn't based on but when there walking and we kind of in

this course

we start describing the oracle systems

so we're starving different acoustic features we use and this is used for units vectors

and build a lattice for rest of this vectors

it's be we use community vad or sixty s and you don't and v for

really

in video we constantly system

so what we'll from there is a sin was clustering or a be lda gmm

that a single speaker factors in speaker labels posteriors

we used to estimate of labels

based on similar you know and as the best one double will make me but

is not very sure would be is generally

also on responsiveness might consider are less money is

this and that was one based on god i

we got some improvement will is but i during

i seriously we're finding the for the n and then what we in domain data

just finding the leslie using four letter words in this way embodies becomes a sinus

and we call this

besides discriminant percent

so we have seven that is that or architectures

we have

i was gonna be and then but since

three five basis

than better since what we're gonna since the is the same that we use of

and sre

the contains translators from new domain

with a linear size of

one thousand four

alright is an utterance

we unknown

and therefore based on find a we have regulators five miles away two thousand forty

eight

are very agreements

we also several possible ways that five questions

they're having less than wireless the inverse there wasn't one the and that's always been

feeding

this is this one of the datasets used for training or not the inspectors

so it's in serious condition

zero use switchboard was designed for okay

r c of this work

it's

there isn't or is something the in work we use all the data set someone

one their completion

a is evident in a we use the same but with the model

we remove the so systems i one microphone

lincoln labs the still use businesses

microphone

confrontation or this though

we used as i e one

and i'm gonna is this study

and you state

or they are all from being the one d c and we just use the

most of the thing in this

we have a for like principal equations

c l is the only one last use the first configuration that's the line of

the you're

let's say that we have some all domain and some in domain

first we and that the out-of-domain in domain using their or a little

and they're all in an out-of-domain data in

then we use a different thing that in for in domain

although mean data

we use common whitening

then the my face

the other two in domain data

are then at the score normalization was and in domain data a calibrated

but for steely and have a three by conventions

something that and use for that all lda

and the use yes everyday the lda for a swear and very nice thing what's

almost instantly

are also in the scoring or

we also the lda for cases where

and then it is then we only the model in salt

so this is a this what are the values something the markets

that's a small difference between sites

but as forces us ordinance yuri a on

the use this study for then i x values to some well on this study

in u one

for the dc one

as you use the is something at you by

we also and since the only problem we also use the unlabeled

that it really by doing clustering

or other score normalization we use the only really

i'm use the sre seen that for

or maybe a we just think that can almost the latter

this is a very good speakers in the white honestly demos data

score by bayesian also us

the i have to be also provided us an significant improvement

a value will use this i think bias you one for calibration

that's you know this used the silence

first we analyze the us also that five million and

romana something the we use

where a source false or misleading there

on the on the lower a sliding i b d one all the

the base then system used unsupervised really in a bayesian with this study only

then in the signal were we is that in the u one okay

provides a very nice

then we i we are noise segmentation lately

that improves the convince your in the u

then we have that the a spectrum and also

and the in domain be i get some room and you the by a small

improvement

all in one

i think that if we change that sure or then run your that's where we

made the grade on our way we

getting some

implementing that you well limbaugh an improvement in the

also analysis on this you by also versa before rest

the bayesian network use a risk of a system for based silence mean versus evaluation

will also must present a unique

then we alignments unless something dusty the data

provides a nice improvement in the u and it again

then we a the we got a number of channels in the network and that

provides a small role

not remote really okay and we define the never will always unusable sinus fourteen

so on without use of us more ergonomically baseline but in there about their grace

and they always fits to the or something or thirteen data

and that's was in those identity

these are also all four to all the single system

the based system is your five better results before was one of the database sinus

ability have okay

so we're very close to be easily affected formal system for which channels

a personal one of the

and

for this part of the nn with the

will be the training set

in all cases you was greater than this method was i

or we apply several

medals for the fusion we have there

but it's a you don't use of in it was used in calibration and yes

is for a basis for

an efficient v

once you so in the real assisting calibration a one when you mean and another

is that it is not the union that i mean and

the scores

a quality with a where we can see that is consistent when interviews with a

very high or station

are you sure we got everything we on over and over

so the based system for us your proposal by in address the source for calibration

i think five series systems with but like plus three system is not possible

usually might need them

we have the fusion of existence

and the basic progress is a thing with fusion be but obviously once she

the best results that they want you can see that are the system also

the present problems phones your feature

no it's either a your problem of your results

was also an analysis of our last for the nn are where lunges it was

also for delay of advanced

or the u s

the first figure analyze this problem i phase you're

so and we can see that score normalization provides more meetings in a savvy the

in domain sre an eighteen

also we can see that i mean by handle this problem i faced is that

why

provide some a similar guy

great

the second year so the was also a v i

right and that we will one between their usage

so the decision rule

the relative improvement in bic studies

so systems and it is easier to the utterance

besides the results of the signal system that we used in all submissions

we can see that there is anything about christmas is to have that is that

e d u

these is too small

so you systems for the reestimation by a significant

all by n c l is be part of the nn a waitress

there is no right in assigning from using y for a given in a network

for this

we use a real efficient is the input shows the system for fusion

we just reading writing i

includes your we still is involved in an a small step

so you're right value is yes one system

you'd reminding contrast to estimate ubm

the misuse

have a very similar a million this year use women right i have the base

a once you

now see the face recognition systems

this is there may be a front end

the bible any something will be different for enrollment and test

but elsewhere well

phase of that still

then enrollment

we use the reference mumbles and you the test phase

but overlap with the telephone calls

in this will yes all the faces with it

then we used the final

modeling more on the original on a small line ungrounded phase and then we use

that are facing varies

we use briefly visited those and invariance

you just be used every now and a snack implementations or within a face on

our face unless you use the one d by the implementation

we examine the task as a c n

the video but since what are based on percent is for

series system doesn't use score normalization for enrollment the average the enrollment and variance

and the test set the new animated clustering with a twenty one clusters

unless you listen we have several and robustness the

but based methods also indicated in table we have

you mean and variance

averaged and variance the median of a multi clustering so turns you form an alliance

you

maybe also balanced young ones used for in somewhere in the media we go

similar to his twitter that's they will i know fine inventing which is then weighted

average

all the meetings rooms

in the total attention we obtain a single invading for this with a weighted average

all the testing babies

but also

and enrollment set

no see the this problem model

we have analysis the csp markets for this experiment we used in save face first

one hundred and very

the best figure is without is not understand your is it is not

is not improve the low in the guns you one and it's a need in

the

well rules less in this study night in

you one

and the baseline and in the about is the

made in enrollment bonuses are limited clustering in the that is

well as in the other datasets

the baseline peons overall only once the contents of attention

there are more steam or impostors are statistics

we compare the different and variance improve work of the us you by the question

and now there was as follows we have

the questions all the inside phase

printing models

we use the whole or can we use a already some enrollment and omit the

last three test

area so we can see that the white gaussian is better than a form the

exact reason but is there a lot of in the network a very significant we

can see that doesn't work on my personal

this of the submission process

then used primarily

is a really use general

the only last three assumes be systems on the taste of is a well this

year

this using a system is close to the right

using a system is worse a posteriori because we're and based on we were or

generally but one best so that no one

analysis

against a

based on the equal error rate

well no that's impossible

this was also than one model

in addition

so for the fusion we assume that independent within that we live video these so

we assume this calls

in the figure we have a combination of more than useful single all those used

the additional value systems

single videos used in a fisherman previous nist and finally in one more

we can see that

we can get yours implement all eighty percent exactly

when we will from a single of assistant

who but it would be more efficient

okay

the key will results was using be data

the no more than one the one used

well cts less money loss

probably provide some woman we're got significant improvement of that a spectrum that for some

backends we

small liberal in domain

they can perform better than listening

what a probability of the screen but it was saying performance where

without the need for every

the results difference between as i the n-best and instantly in obvious that we wonder

why is that the is fitting work

so it is also studied in it has led with the transform it is because

the italians or entity that or

i mean doesn't in there is no

so we won't remember a city bus always focus on the same the on the

other side exactly you have already body was also incredibly or we don't want to

solve problem

we're really on all levels

i mean and variance

and organs performing very well

i mean it is obvious what is only obviously modalities are when

in the unimodal this so we will maybe that's came are used

that's all from my say thank you for

Advances in Speaker Recognition for Telephone and Audio-Visual Data: the JHU-MIT Submission for NIST SRE19

Evaluation and Benchmarking

Jesus Antonio Villalba Lopez, Daniel Garcia-Romero, Nanxin Chen, Gregory Sell, Jonas Borgstrom, Alan McCree, Leibny Paola Garcia Perera, Saurabh Kataria, Phani Sankar Nidadavolu, Pedro Torres-Carrasquiilo, Najim Dehak