Speech Transcript - What are we missing with i-vectors? A perceptual analysis of i-vector-based falsely accepted trials

she and effect a lot of my presentation is what i would missing with i-vectors

a perceptron analysis of i-vector based falsely accepted trials and decide in collaboration with people

from their phonetic lot of the c as i c would the research

solution for many years at establishing the spain

so plus not talking about

i-vectors

yes we will not i i-vectors but tones

and

those i-vectors give us a compact an elegant solution for every utterance can be represented

in a fixed the dimension vector

they also a given us a great an efficient performance of that a wide range

of the original and a last two

perform a state to apply state-of-the-art and but the recognition techniques

and more the recently we are able to perform speaker recognition without point it is

a really great

we have we can avoid a lot of problems

and especially and i think that in the point

we don't produce calibrated likelihood ratios to forensic speaker recognition when we have lots of

i think that in accumulating a for this we have seen a nice paper we

wanted from that if i own do that's not that's

in this paper some but what if you feel you just a and have

wally score be in this paper has gone given a step farther when they have

not only over a being able to calculate an icon regularly richer when they have

i have recordings from the of these channel intercept assistant but they also have obtain

the day i select aggregation for to collapse the all that was to do so

they have an assessed not just a little bit about all the pros to sell

this is a

a great

we as a starting point but we have to look a little more in detail

about

i-vectors

and they explicitly courses lead to ignore a high-level and source little information

so the speaker and information

and is reduced to reach the short term but this is has a lot of

advantages for features to for conditional for real points

some users and imitate also so

but be still a spectral only detection decisions

probably will be uncorrelated with human perception this morning joe i'd like to this issue

of a possible loss of credibility of the system if the it's a very user

if i boardman ldc perceive rate disagreements between what the system is doing and what

what they can see that they can see that humans are pristine

moreover a we have almost that of that ignorance on the or you know those

detection errors

and when we have also you know system we are simply trying to restore system

probabilistic but we don't do not fit the specific with them

and we can have a transparent estimates is that which is very good but finally

we if we have a roast we cannot display at all what's the recent of

the

of the art

and it's very important to have to be able to provide explanations of all the

wires system is working set a specific way

and just a final reminder we as you decide systems usually on average error rate

but from the user's perspective

and they perceive performance like a baby case by case so it can be done

larger or even a single trial the system will be affected as a

as a whole

so what we in that for the paper wants to select a set of i

bet or based for the s if we try to problem

sorry ten and it's a eight sre ten

and we're gonna some a team of us find useful additions force the not english

a great

and

the objective was to explore to better understand what they do with their with that

a date down and all

that it just a and sre that's

as we have a and we might with that of data what target type of

types of different that they think they could find and also the number of different

types of different that they can have taken finding a single signal a trial

and the first of all a display where this is not a paper on the

speaker recognition by humans

both one of these you know in advance that day speakers in every time a

different

all what we are asking that is to highlight difference that they've find in the

and between the two utterances but without any a decision then used fourteen yes to

see what they can find a in a

and

in trials where the i-vector has provided a

line ratio greater than one

as they have a difficult time for analysis and we're not to select a subset

of trials

so we selected we will use the scores from our submission to nist two thousand

and ten

and what we did was a outlier proper selection

first of all we to be a sixteen and a false acceptance that we actually

had

and with the it to eight

with the eight is a set

and but also as those trials were specifically selected to be a special difficult for

humans just in case that was at peace stuff on it for that for the

analysis we also selected fifty different forces us a second trials from the sre can

and in that case of we had thousands of different

trials with the condition was selected yes those with no likelihood ratios in the range

from three to five with the translates into the results for all between the two

one hundred and fifty also so to those were a big are for example systems

that we usually

and how when we use our i-vector systems with

and eight now with the real by a lot of availability

and after those we yes end and all sixty six trials and they are there

are short rehearing not the about the mean this but trial they select it does

for a little work and eighteen trials nine male and female for them probably it's

a it's a and fourteen from a test everything

this is the final this which is in the paper just i want the soda

because we will and referred to every trial using the them

the number of the target id

ability of which one of the speakers

second disclaimer i'm not of an addition that i even have problems with english roll

okay i would be talking about but of things that my colleagues is therefore that

takes a lot declared it so yes

my apology that buttons if i have i say something not right

and this is the rate of features that they will explore they will we be

noted by really deformation type temporal characteristics what extent means that what the characteristics degree

of the solid deep or something like than all the type of non-linguistic features or

what robert was impressions of

so that they will just

what they will extend

we don't like the selected trials is to perform that detail during the at both

about one hour per one of the trials and we focus on the full feature

which are presented all along the conversation

i would still some samples

but that is a

the feature that the difference is that we are that they're finding out present along

the whole conversation

and those comparison will be maybe linguistically k compare compatible segment example select you think

that set consisting of motown and finally some of the observation would be confidence through

acoustically or estimate a and then

by seasonal i used in mentioning that might expect so you don't seem a spectrogram

so the last part of my presentation will be simply so and some of the

use a file

in every case i went so on a number of the trial with the where

the audio can from and also the likelihood ratio in that do not value the

degree of support that the ipod or used a given

the same speaker hypothesis so we know in advance they are different

this the i-vectors is that we say

and then the same of these c same speaker and we will see it for

every trial

and the that the that fault

degree of support of that are that can easily and english

all possible this is a case without a very high misleading value on the three

just and the operator what we use an obtain even for targets

and in that case for example what they found is that this for speech a

lot of the whole conversation is

and not different

no but we do you wanna go well

the it's for the blue line

for the right one

i really but i four

a sound like different by the that are over a regular or you are well

i really i four

and a set of features that they then used

you just about the long as variability

in the collective synthesis people usually tends to decrease the energy at the end up

there is at least that's happened with the for speaker in that case

our that the second speaker in that try out is

keeping the same stress can do you and we'll especially for to keep that log

in this

and this is consequently repeated during the whole conversation

in this case and which has which had a celebration of at a smaller value

obviously value and there's a

only dysphonic voice you once only one of the sides of the conversation is that

they have no idea what are okay

they have no idea what like are okay

is that is for the one

well there are no but neural network grammar

well there are no but you'll never bigger

for example you that are compared to the one light both phase right

but

and this is the spectral analysis of the of that powering latt uses a

without hi everyone would ratio on you know we have

much lower

another type of and situation that would be found is the president of creaky voice

for sample this is not very usual find in a speaker to the second one

here we just peaks do all the conversation with really voice

i normal rate and this

second one no you know

no you know

this is not very frequent but this thing present in this case and it's very

quickly but what is quite usual is that the resulting solution of creaky voice at

the end of the of the phrase would like your

we will pop up a sample here in that case work like ratio measly like

results about fifty

two

one segment well

well

we also found issues about sorry more boys system where you the voice difficult is

to haul the bit it's a similar segment with and that type of speech you

can see the

tennessee of the mean value is quite similar however the second one we have

you use the oscillation problems to maintain that

i together i

i get a very i

second one

we will be known

also a feature what's file was about the speech rate

you for somebody in that case there are two different speaker which sold at different

levels of a of activity

what about how would be better marketing

moreover

it was bigger really

we were able to leave

this also issues all known hyperarticulation for example

the phase

really different see if you're selling you know

one the other one i for like you know

well

almost basis some

also this can be found in other cases with the

without using any of a key and where the formant a three of on here

it's much more the about more standard for speaker

your

second

huh

the form of a second formant is much lower than the

signal for one for speaker

also that there may be found differences well the specific but there's of realisation some

first personable one pretty because the finding difference and a type of s that the

speaker reviews

for example in that case and the as in that speaker starts of the five

hundred you're while the as in the second speaker this

start above

three thousand system or a standard student s

also cases where the problems or differences in the a degree of summarisation

sample here i

this is like that

you don't want together

and that of kind of nice of voice when and in this case the other

one is i per thousand since we have a goal or something

also that uses about impaired melodic voices

so regular

no we in

what is the one i know you know

in some cases the file extralinguistic ensures that for example the noisy reading everything to

use that speaker

you can hear

that you are construction some parties

for

for example

well as well

so what while the second one that's it's already and noisy breathing at all

they're also presents all squats or

strong not control of the o

or not the case of some of the presence of rectly voice

e and o

i go off all gonna

so i'm finally this is they comparisons of the of a

this work where and the idea is that if you look and you all some

top weight you can find the amount of times that and one given feature is

file

and but its moral about the look trial by trials or columns of the table

and us see that

for every trial there are

there's an average of about four different types of different that a file

especially health interest to last if we want to make a diplomatic pursues to detect

something some any kind of features are possible feature related to phonation type well phone

creaky also and those the

like to a specific but there are some presentation of the specific sound

so do you might well

yes a we have shown that percent all analyses initial null correlation with the that

backdoor false acceptances

and

there is detectable a useful information goals trials that just produce away from poland uses

what one bs recognition rate is

furthermore there's like

a relational

and specifically the but the realisation that bit of a specific cells

but also at would that those could provide an

we try to reach no signals transcription of the whole utterances and they could be

used to provide some kind of soft information or

and

this what specific highlight the inter some provide an objective measurements about this for you

not the spectral features especially for speaker

thank you

just listening to

second creaky wanna sell like was actually clipping happening in the first

creaky voice

solves one it

perhaps the reason the system to see the same because audio clip like three

was there any analysis on when you when people listening to these false like taking

part of the audio acquisition and one that was quality as well

there was no it's okay

especially analysis of brain processing of the of the data we just select the data

as it was and what is given to them and it what have phone from

the phone at finding just what the what they what they did so

how can you tell them so

what's the variance from the sets consist of experts on

to ten

that's good

there was a very high actually the second one was a student of the rate

was just you from one to and

maybe they provide for they come from the same school of listening

and then the degree of agreement ones

impressing we will be working completely separate

i we have to say that there were no this is chosen there were no

scoring sorry what does i found difference on i five difference on but the degree

but i can say that it was almost exactly the same maybe there was one

of the differences and that one of the informant the of the on

i was wondering

since you only used

non-target trials

yes you have conducted the same experiment with the same from the tuition non-target trials

how many of those differences they would also something especially the prosodic differences

of course there will find a lot of then what's

we are trying to do is to look for clues we rolled analysis nowhere to

look for

for a different of information and of course those prosodic and just prosodic information that

prosodic information is very easily and modify a and b and you can depend a

lot on the on the type of conversation

that's why a i stress the idea of the issues of

voice production and specific buttons of religious the which can be much more dependent upon

the speaker but

of course this part of the word that could be don't and of course they

would because when

i suppose like then participate in that kind of a humans just this evaluation they

also did not

yes as the result of this analysis the use it just the but kind of

features that we used

system the future

which so you mention the prosody given duration what do you suggest

that we look at for improving system

i'm not suggesting anything special i just giving the information what they found but what

i'm saying is that the for example the one noise

those voice quality features around

a specific but doesn't really say some of some a has a good degree of

parameters that can be

the properly detected

let's see if they can improve the overall system

What are we missing with i-vectors? A perceptual analysis of i-vector-based falsely accepted trials

Speaker Modeling I

Joaquin Gonzalez-Rodriguez, Juana Gil, Rubén Pérez and Javier Franco-Pedroso