Speech Transcript - Linearly Constrained Minimum Variance for Robust I-vector Based Speaker Recognition

a little buddy i'm happy to be here

"'cause" this is my first where the field of speaker recognition

and remember

one attorney

then these four

providing such approach and eighty four that's two

participate

"'cause" this i think it is

an important to the improvement you can be

that we can improve the speaker recognition system

with the holding these kinds of challenges

okay

what i'm going to propose is the

kind of the idea from the beamforming

which is a name is technique in signal processing

okay what are going to

present

the first one i one to try to explain what beamforming is and the

what how we apply to this challenge

we explain know how we can solve the problem with an adaptive filtering

and then find an optimal a beamformer in order to

solve problem

first of all without any constants windy and can have that

we include a sensitivity

to make you more robust

and or work also include some modification of the

possible audience matrix and some one was it or more station in order to

owing to the

performance

what is to know what we suppose

from i-vectors

i-vector is interesting because it's

provides

a fixed dimensional representation of any arbitrary length

speech

and the what

problem with i-vectors that it varies with different i environments speaker role

and this is the challenge

in this field

okay in interspersed intersession compensation we are going to remove this unwanted variability but in

this challenge

using a probabilistic linear discriminant analysis is not going to be a good idea since

we don't have any label for the data

and if we provide this label for the data it's also you'll be

the performance of that clustering labeling be affect the performance of p lda

okay

one important things is that

what

we need

if you have a lot up

speech data

we can use of these amounts of

available data for example in a speech sensor

up with telephone speech centre there a lot of speech data passing through

so we can use the take advantage of these data in order to improve speaker

recognition

instead of providing some

artificially

data by labelling them

so the p lda what is similar approaches the

it's a two

have label

so this is not a good idea to use that we taken on a new

approaches

two

solve the problem

so if it can't

finest within speaker scatter matrix reliably so why to be

why don't we go to find that the between speaker variance then increase that

okay

the first things the on going to explain is the beamforming

it is the signal processing technique

from since we're base in order to direct the signal transmission to a

desired target

and adaptive filtering is used the two

using optimal filtering the interference rejections

in order to estimate the signal of interest

so what i beamforming operation is that when a signal implying on some and ten

as well

from the same distance

it then passed through a filter

and then the results

the that filter

the

desired angles

and rejects all the other groups

this is the same as the

dot product of to a filter and the sequel

so if we can

illustrate the

idea is that in the

omnidirectional antennas

the signal the interference of the targets of are treated equally but in the beamformer

we all focus on the talked

so what i have

filter so we are going to design a filter like this the w transpose start

by where i is the i-vector and w is the filter

so we wants to

pass the target speaker to this filter

but we check all the others impostor speakers

so the development set is

impostors so all day impostors comes from the development set

so iffy

use the mean square error

in order to solve the problem

we reach the this result as it can see here

okay

the w is there

a particle filter for this solution

and parties the

autocorrelation matrix

and i is the target which can be estimated by using

okay listen to compare it with the baseline system

the baseline systems

is computed after whitening the i-vectors

and the using it that the cosine similarity to find the score

you can see that when

the use cosine similarity before that we should the normalized the math of the i-vectors

a display but in the

adaptive filtering as like just

explain

there is no normalisation of

the i-vectors

okay just a little further unchanged a criteria

in the beamforming the minimum variance distortionless response

there is a new approach area that is to maximize signal interference lost more information

so we wants to

maximize

this relation

that is to maximize the output of the filter when the targets past

but to recheck all the

impostors the to want to minimize the

did not meaning to but t vs the dominate two

in order to

solve the problem we assume that

the nominee two

equals one

that's the

all

the best way

we wants to minimize the

did not many to which is this for of a pasta been passed through the

field

where a value of that

and here

particles the

impostor the covariance matrix so the optimum solution for this problem

can easily be found this way

so let's just compare it with the cosine similarity

the baseline system is like that and the mvdr proposed this way

so if you look at this idea we see that

this nor mvdr suppose that new similarity measure

that does not include the normalisation of the test i-vector but focuses more on the

targets

the result

shows that

it's will provide a

improvement of seven point seven percent

in the

i-vector challenge

so let's the goal and

step further and to make it more robust

as the we had we had in the previous

the slide that we use the all the mean of

all the target i-vectors in order to

so estimate the target since the mvdr suppose that there is no uncertainty regarding the

target

but in this

the linear constrained minimum variance speech and the include uncertainty by some linear constraints

so that we anna i all the i-vectors provided for the target

in the matrix c

and the

we enforce

that the past the filter

we the value of one

so f is equal to one

so if you solve this problem

the optimal filter will be as you can see here

and when we applied to the challenge

there is a more another improvement of three point seven relative mvdr but

and then eleven point one percent relative to the baseline system

now we have your no we can

do an additional the

job

in order to improve the performance

since you need in signal processing

there are many a more techniques such as will paucity palm beamformer or public constraint

robust keep on the formant were two

improve the performance by top only loading the

covariance matrix

i just used a similar approach and the but use the pop impostor i-vectors

ward the most similar to the target i-vectors

in this way we compare we passport impostors through the

filter for each target

and

selected those what was six thousands impostors know the for similarity

two and computed the covariance matrix again

this result in a

very good improvements

of twenty one point five percent

relative to the baseline system

we can see that

for all the impostor when compared to the

a target

after the

applying is you have covariance matrix modification we can see you

the put reduction

in the schools

another

factor to be true for the

or the speaker performance was to use that score normalisation

i just found this relation

the best

contrary to some others use the variance of well

two z norm or t-norm use the various of the scores

we could not do that

and this results nine further improve

okay

and let's go and the more a supervised

okay

that we use the within class covariance matrix

fine using some clustering methods

but this clustering method is some what different

as we

three set each target each a single i-vector in the development set as a target

and found the closest or the similar

i-vector to that target

and this is repeated each time at one more in order to find more similar

i-vectors

after finding those i-vectors

we use the this formula in order to compute a within class of the tools

like vector which assumes to be from the same speaker

and the final model

can be found by adding this w since we what we apply to chance so

the inter session variability as well as the

rejecting impostor

so we added together

to find this optimum what

so can see the results

it's

you to leads to an improvement of twenty five to twenty seven point five percent

relative to the baseline system

so in conclusion

we have proposed a new

idea of rum of the signal processing for adaptive filtering in order to solve the

i-vector challenge

so a modification of the impostor covariance matrix can be possible

this way

we have used the

this idea

two we can apply to p l d i thing to do we can

improve the speaker recognition if we apply to p lda

but we had not much enough time to do that

thank you for your listening

so one time if do not remember starts and eleven we did language at that

was doing something cosine and i was i the target michael length normalized text

we should the same what's your data

would be inferred and it was cut off the backend was able to a calibrate

scores but you have some shift in the scroll effect on two test

sort calibration sessions cell so that's what happened time so the calibration what nick but

also was about the clock offset estimation

but we have that we find that too much worse when you wait when you

don't normalise the test

so no

for language id was possible speaker okay but

okay thank you

good

Linearly Constrained Minimum Variance for Robust I-vector Based Speaker Recognition

Nist I-Vector Special Session

Abbas Khosravani and Mohammad Mahdi Homayounpour