Speech Transcript - An Initial Investigation on Optimizing Tandem Speaker Verification and Countermeasure Systems Using Reinforcement Learning

hello my name is on second at every stop and i'm here tell you about

our work on an initial investigation on optimising tandem speaker verification and ponder missus systems

using reinforcement learning

so that we all just on the same page

speaker verification system

verifies that the

claimed identity

and the

provided speech sample all the same come from the same people so person so other

make speaker verification system takes in a claimed identity

and some speech sample and if the identity max's

the identity of the who spoke

then all good the system will pass and then

likewise if somebody else

claims that they are someone to put their not and provides speech sample

it shouldn't let them pass

good

a very simple many of you work on this field

of course when it comes to security and systems like this there are some bad

guys want to break the system

so for example in this case

somebody could record

tommy speech here with a mobile phone and then later use that recorded speech

to claim that they are taught me by feeding saying that they are tommy and

playing the audio and the system will gladly expect accept that and previous work has

shown to the s p is proved with a seventeen

silence so that they are if you don't protect for this the extra system will

gladly accept this kind of a trial even though it shouldn't

likewise

you could

use they gathered dataset or collect data up some

somebody speaking

and then you speech synthesis or voice conversion

to again generate speech that sounds like tommy and feed it that the system and

it will again accent it all fine

and again this has been shown in previous competition it's a bob problem but

you can also fixed for this

this is

where

they come to missus come in so a condom is a system takes in

they

a speech sample that's well was provided to the extra system as well and also

checks that the sample comes from a

human speaker instead of like a mobile phone or it's not synthesise piece or voice

conversed speech so

it's like upon a five human speech

so for example somebody

has recorded somebody else's speech

feature the system but now it's fed to the condom is a system as well

and the count them is the system says it's or reject and then the other

car does not get access so it leads inside attacker

good so far and these competitions have shown that when you train for these kind

of situations you train to detect these would play

attacks of this

since the speech

you can detect them and all works fine

but one mac we had with this

setup is that

the

a yes we system

and they

condom is a system or trained completely independent from mixer so the fa system has

its own dataset its own laws its own training protocols

and so on someone

likewise the insistent has its own datasets intone was its own trainings protocols and its

own network architecture and someone so on

these are trained separately but then they are invalid together so they are evaluated as

one big bigger system

where you have a completely different

l is metric and these two systems have never been trained to minimize actually this

evaluation metric that been trained on their own tasks

we had this

coffee room idea

a where

what if when we have this kind of

bigger whole system

what if we train the

a svm to see insist then on the evaluation metric directly so

maybe on pop they already training already existing training they had we also optimize them

to minimize the or maximize the appellation metric for better results

and

however

sadly

it's not so very straightforward

we have this system where we split but speech to a svm cm system

they produce

i both of them produce like accent and reject label so i either accept or

reject and these are then fed to the appellation metric which usually computes

the error rates so false reject rate real reaction rate and false acceptance rate

and

these are then used in various ways depending on the evolution metric

the kind of come up with the one number to show how good the system

as a whole is

however

if we assume that these two i since the interest they are differentiable so they

are like on your networks which is quite common these days

if we wanted to

minimize the automatic we would need to compute the gradient of the evaluation metric with

respect to the

two systems we have or its parameters

sadly

the weak and that's compute the gradient over these hot addition of like accent reject

and

but these all required for be error rates

for the whole asymmetric so for example the tandem

decision cost function which we will be using later

requires these two error rates

and it's going to weight them in different ways but we can't from that

go back to the compute the gradient all the way back to the systems

because these heart this isn't is not differentiable operation and tools we can't calculate the

gradient

relating

i really in a related topic

other work has suggested as soft a remote error metrics

so for example

by softening the f-score f one score or are we undercut lost you can come

up with a differentiable version of the

score metric

and then you can do this computation so they

by softening it means they these heart decisions are kind of stuff and so that

you have a function you can actually there they got derivative of

and then you can compute disparity and however

the tandem decision cost function we have here does not have such source of person

instead

we looked into t v important nodding

in reinforcement learning there is this we're gonna simplified setup is like this so

the computer agent

sees a images or slight problem but it's

predicted some information from the game or the environment

the agent chooses an action a

and

the action is then executed on the environment

and depending if the outcome of the action is good on that the agent receives

some rework and

in reinforcement learning

the goal of it of this the whole setup is to

get a smarts reward as possible so modify d h and so that you get

as much reward as possible

and one way

to do this is

kind of a week the gradient

well so

we could take the gradient

of the expected reward so the reward i'll i've reached

overall difference in the set up to situations

and they got gradient of that with respect to t probably see with the respective

age and so if we can do this we can then

of course update the agent

two

towards the direction that increases the amount of reward

however

we hear also have this that problem that you can't really

differentiate is a decision part where we

choose an

one specific action of many

and

you execute that on the environment so we can't different see that and we can

compute the gradient

however there is a thing called police a gradient

where which kind of estimates this gradient

we do it is kind of a equation where

instead of calculating the gradient

of the reward directly it computes the gradient of the

log probabilities of the

selected actions

and weights them by t report

we got

and this has been shown in reinforcement learning

two

be quite effective us ready and also been shown that you can replace the

the air the td reward with any function and then

by running this you will also find

get the correct gradient so you can

get the correct same results with enough samples

going back to our tandem optimization

where we had the

same problem of

a heart decisions which we can't

differentiate we just apply this

police a gradient here that both to get it great in theorem here

where b

equation is more or less same

just with team or different meeting

we then proceeded to test this how well it works

with a rather trivial a set up so

we have two datasets

the fox let one

the stated and more specifically the speaker verification

part of it

and then we have t is feasible of nineteen

for the are synthesized speech and the other condor mercer tests and i labels and

for a has to be task we except extract the x vectors using t pretrained

tell the models

and forty s feasible we extract easy to see features

and these are fixed so these are not being trained in does and in this

setup

we then train the a actually system and the c insistent as thirty as normally

don't using these two datasets

and then evaluate them together using the d dcf cost function as

present in the a speech both nineteen competition

this we take the two pretrained systems and before random optimization on the mass previous

is shown and then finally we evaluate the results

and compared them with the pre-trained results and see if it actually helps

let surprisingly

the tandem optimization helps in out

very short not shelled

one way we see this

is by looking at this a learning curve where on the x-axis you have

the number of updates

you do so you can compared to this because where you have the loss and

the number of you box

and on the y-axis

you will have

the relative change in immolation set

compared to the operating system so if it was zero percent

it means the

the metric did not chains since the pretrained system

the main metric we wanted to minimize

is the minimum

the decision cost function normalized

and this indeed

decrease over time as we do the this a tandem optimization so

from zero percent change it went to minus twenty five percent change so yes it

improved

as a

then we also studied

how the

in the a visual systems changed over the training

for example to compliments or equal error rate in the

condom is a zone pass process detecting if it's move or not

it also improved by around ten percent

in this task but interestingly the a s v

e r

increased over time and we help of places that

because this is because

when

we have a way that

the a s p system in condom is a task and the condom sre in

pac task so looked at tasks we notice that of that these phantom optimization

the

to have improved in there

it's others task so a as we was better encounter mister

tasks

and counter measure was also a slightly better in the a speaker verification task so

we hypothesize that this kind of outweigh that the speaker verification systems

normal task of do that can correct speaker and it started to kind of

thick the condom answers these proved samples instead

we also compared this to a simple baseline

where instead of using this tandem optimization we just independently trained

continue training to two systems using the same samples as in the quality grading methods

basically we just use the same samples and use the a s p samples to

continue updating the a speech is then and then be used the condom is a

samples to update the down to missus system independently completely different from each other and

we see the same at a sweep behaviour here but

counters mercer systems the equal error rate just exploded in the beginning and then slowly

creeps back down and in the end operates over multiple runs

we see that the

both the grading method improves the results by twenty six percent and the fine tuning

improve the results by seven point eight four percent

but to note that the results on the fine tuning

have a lot how your variance

than in the

police a gradient

mesh version

but

these all results are very positive but as a this wasn't very initial investigation and

i highly recommend that you

check out the paper for more resultant figures

and that's all

thank you for listening to be sure to check out the paper and the code

behind that leak

An Initial Investigation on Optimizing Tandem Speaker Verification and Countermeasure Systems Using Reinforcement Learning

Spoofing and Countermeasure 1

Anssi Kanervisto, Ville Hautamäki, Tomi Kinnunen, Junichi Yamagishi