0:00:13hello my name is on second at every stop and i'm here tell you about
0:00:15our work on an initial investigation on optimising tandem speaker verification and ponder missus systems
0:00:21using reinforcement learning
0:00:23so that we all just on the same page
0:00:26speaker verification system
0:00:29verifies that the
0:00:30claimed identity
0:00:31and the
0:00:32provided speech sample all the same come from the same people so person so other
0:00:37make speaker verification system takes in a claimed identity
0:00:40and some speech sample and if the identity max's
0:00:43the identity of the who spoke
0:00:46then all good the system will pass and then
0:00:49likewise if somebody else
0:00:52claims that they are someone to put their not and provides speech sample
0:00:56it shouldn't let them pass
0:00:59good
0:00:59a very simple many of you work on this field
0:01:02of course when it comes to security and systems like this there are some bad
0:01:05guys want to break the system
0:01:08so for example in this case
0:01:10somebody could record
0:01:11tommy speech here with a mobile phone and then later use that recorded speech
0:01:16to claim that they are taught me by feeding saying that they are tommy and
0:01:19playing the audio and the system will gladly expect accept that and previous work has
0:01:24shown to the s p is proved with a seventeen
0:01:27silence so that they are if you don't protect for this the extra system will
0:01:32gladly accept this kind of a trial even though it shouldn't
0:01:37likewise
0:01:38you could
0:01:39use they gathered dataset or collect data up some
0:01:41somebody speaking
0:01:42and then you speech synthesis or voice conversion
0:01:45to again generate speech that sounds like tommy and feed it that the system and
0:01:49it will again accent it all fine
0:01:52and again this has been shown in previous competition it's a bob problem but
0:01:56you can also fixed for this
0:01:58so
0:01:59this is
0:02:00where
0:02:01they come to missus come in so a condom is a system takes in
0:02:05they
0:02:06a speech sample that's well was provided to the extra system as well and also
0:02:11checks that the sample comes from a
0:02:15human speaker instead of like a mobile phone or it's not synthesise piece or voice
0:02:21conversed speech so
0:02:23it's like upon a five human speech
0:02:26so for example somebody
0:02:29has recorded somebody else's speech
0:02:31feature the system but now it's fed to the condom is a system as well
0:02:35and the count them is the system says it's or reject and then the other
0:02:39car does not get access so it leads inside attacker
0:02:43good so far and these competitions have shown that when you train for these kind
0:02:48of situations you train to detect these would play
0:02:50attacks of this
0:02:52since the speech
0:02:53you can detect them and all works fine
0:02:57so
0:02:59but one mac we had with this
0:03:02setup is that
0:03:04the
0:03:05a yes we system
0:03:06and they
0:03:07condom is a system or trained completely independent from mixer so the fa system has
0:03:13its own dataset its own laws its own training protocols
0:03:17and so on someone
0:03:19likewise the insistent has its own datasets intone was its own trainings protocols and its
0:03:23own network architecture and someone so on
0:03:27these are trained separately but then they are invalid together so they are evaluated as
0:03:31one big bigger system
0:03:33so
0:03:34where you have a completely different
0:03:36l is metric and these two systems have never been trained to minimize actually this
0:03:42evaluation metric that been trained on their own tasks
0:03:46so
0:03:47we had this
0:03:48coffee room idea
0:03:50a where
0:03:52what if when we have this kind of
0:03:55bigger whole system
0:03:57what if we train the
0:04:00a svm to see insist then on the evaluation metric directly so
0:04:05maybe on pop they already training already existing training they had we also optimize them
0:04:11to minimize the or maximize the appellation metric for better results
0:04:16and
0:04:17however
0:04:18sadly
0:04:19it's not so very straightforward
0:04:22so
0:04:23we have this system where we split but speech to a svm cm system
0:04:27they produce
0:04:28i both of them produce like accent and reject label so i either accept or
0:04:33reject and these are then fed to the appellation metric which usually computes
0:04:38the error rates so false reject rate real reaction rate and false acceptance rate
0:04:43and
0:04:44these are then used in various ways depending on the evolution metric
0:04:48the kind of come up with the one number to show how good the system
0:04:51as a whole is
0:04:54however
0:04:55if we assume that these two i since the interest they are differentiable so they
0:04:59are like on your networks which is quite common these days
0:05:04if we wanted to
0:05:05minimize the automatic we would need to compute the gradient of the evaluation metric with
0:05:10respect to the
0:05:11two systems we have or its parameters
0:05:14sadly
0:05:15the weak and that's compute the gradient over these hot addition of like accent reject
0:05:21and
0:05:22but these all required for be error rates
0:05:25for the whole asymmetric so for example the tandem
0:05:28decision cost function which we will be using later
0:05:32requires these two error rates
0:05:34and it's going to weight them in different ways but we can't from that
0:05:38go back to the compute the gradient all the way back to the systems
0:05:42because these heart this isn't is not differentiable operation and tools we can't calculate the
0:05:48gradient
0:05:49so
0:05:54relating
0:05:55i really in a related topic
0:05:57other work has suggested as soft a remote error metrics
0:06:02so for example
0:06:04by softening the f-score f one score or are we undercut lost you can come
0:06:10up with a differentiable version of the
0:06:13score metric
0:06:13and then you can do this computation so they
0:06:17by softening it means they these heart decisions are kind of stuff and so that
0:06:20you have a function you can actually there they got derivative of
0:06:24and then you can compute disparity and however
0:06:27the tandem decision cost function we have here does not have such source of person
0:06:31so
0:06:32instead
0:06:34we looked into t v important nodding
0:06:37so
0:06:38in reinforcement learning there is this we're gonna simplified setup is like this so
0:06:43the computer agent
0:06:45sees a images or slight problem but it's
0:06:47predicted some information from the game or the environment
0:06:51the agent chooses an action a
0:06:54and
0:06:55the action is then executed on the environment
0:06:58and depending if the outcome of the action is good on that the agent receives
0:07:03some rework and
0:07:05in reinforcement learning
0:07:06the goal of it of this the whole setup is to
0:07:11get a smarts reward as possible so modify d h and so that you get
0:07:15as much reward as possible
0:07:17and one way
0:07:18to do this is
0:07:20kind of a week the gradient
0:07:22well so
0:07:23we could take the gradient
0:07:25of the expected reward so the reward i'll i've reached
0:07:29overall difference in the set up to situations
0:07:32and they got gradient of that with respect to t probably see with the respective
0:07:37age and so if we can do this we can then
0:07:39of course update the agent
0:07:41two
0:07:43towards the direction that increases the amount of reward
0:07:46however
0:07:47we hear also have this that problem that you can't really
0:07:51differentiate is a decision part where we
0:07:53choose an
0:07:54one specific action of many
0:07:56and
0:07:58you execute that on the environment so we can't different see that and we can
0:08:01compute the gradient
0:08:04however there is a thing called police a gradient
0:08:07where which kind of estimates this gradient
0:08:12we do it is kind of a equation where
0:08:15instead of calculating the gradient
0:08:17of the reward directly it computes the gradient of the
0:08:21log probabilities of the
0:08:23selected actions
0:08:25and weights them by t report
0:08:28we got
0:08:29and this has been shown in reinforcement learning
0:08:32two
0:08:33be quite effective us ready and also been shown that you can replace the
0:08:38the air the td reward with any function and then
0:08:43by running this you will also find
0:08:45get the correct gradient so you can
0:08:47get the correct same results with enough samples
0:08:52so
0:08:52going back to our tandem optimization
0:08:55where we had the
0:08:56same problem of
0:08:57a heart decisions which we can't
0:09:00differentiate we just apply this
0:09:03police a gradient here that both to get it great in theorem here
0:09:07where b
0:09:08equation is more or less same
0:09:10just with team or different meeting
0:09:17so
0:09:18we then proceeded to test this how well it works
0:09:22with a rather trivial a set up so
0:09:25we have two datasets
0:09:26the fox let one
0:09:28us
0:09:28the stated and more specifically the speaker verification
0:09:31part of it
0:09:32and then we have t is feasible of nineteen
0:09:35for the are synthesized speech and the other condor mercer tests and i labels and
0:09:41for a has to be task we except extract the x vectors using t pretrained
0:09:46tell the models
0:09:46and forty s feasible we extract easy to see features
0:09:50and these are fixed so these are not being trained in does and in this
0:09:53setup
0:09:55we then train the a actually system and the c insistent as thirty as normally
0:10:00don't using these two datasets
0:10:03and then evaluate them together using the d dcf cost function as
0:10:10present in the a speech both nineteen competition
0:10:13this we take the two pretrained systems and before random optimization on the mass previous
0:10:19is shown and then finally we evaluate the results
0:10:23and compared them with the pre-trained results and see if it actually helps
0:10:28so
0:10:29let surprisingly
0:10:30the tandem optimization helps in out
0:10:34very short not shelled
0:10:35so
0:10:37one way we see this
0:10:39is by looking at this a learning curve where on the x-axis you have
0:10:44the number of updates
0:10:47you do so you can compared to this because where you have the loss and
0:10:50the number of you box
0:10:51and on the y-axis
0:10:53you will have
0:10:54the relative change in immolation set
0:10:57compared to the operating system so if it was zero percent
0:11:01it means the
0:11:02the metric did not chains since the pretrained system
0:11:07so
0:11:08the main metric we wanted to minimize
0:11:11is the minimum
0:11:12the decision cost function normalized
0:11:15and this indeed
0:11:17decrease over time as we do the this a tandem optimization so
0:11:21from zero percent change it went to minus twenty five percent change so yes it
0:11:26improved
0:11:28as a
0:11:30then we also studied
0:11:31how the
0:11:33in the a visual systems changed over the training
0:11:36so
0:11:37for example to compliments or equal error rate in the
0:11:42condom is a zone pass process detecting if it's move or not
0:11:45it also improved by around ten percent
0:11:47in this task but interestingly the a s v
0:11:51e r
0:11:52increased over time and we help of places that
0:11:56because this is because
0:11:58when
0:11:58we have a way that
0:11:59the a s p system in condom is a task and the condom sre in
0:12:04pac task so looked at tasks we notice that of that these phantom optimization
0:12:10the
0:12:12to have improved in there
0:12:13it's others task so a as we was better encounter mister
0:12:18tasks
0:12:18and counter measure was also a slightly better in the a speaker verification task so
0:12:23we hypothesize that this kind of outweigh that the speaker verification systems
0:12:31normal task of do that can correct speaker and it started to kind of
0:12:35thick the condom answers these proved samples instead
0:12:39so
0:12:40we also compared this to a simple baseline
0:12:43where instead of using this tandem optimization we just independently trained
0:12:49continue training to two systems using the same samples as in the quality grading methods
0:12:54so
0:12:56basically we just use the same samples and use the a s p samples to
0:13:01continue updating the a speech is then and then be used the condom is a
0:13:06samples to update the down to missus system independently completely different from each other and
0:13:11we see the same at a sweep behaviour here but
0:13:15counters mercer systems the equal error rate just exploded in the beginning and then slowly
0:13:20creeps back down and in the end operates over multiple runs
0:13:26we see that the
0:13:27both the grading method improves the results by twenty six percent and the fine tuning
0:13:31improve the results by seven point eight four percent
0:13:35but to note that the results on the fine tuning
0:13:39have a lot how your variance
0:13:42than in the
0:13:44police a gradient
0:13:45mesh version
0:13:48but
0:13:49these all results are very positive but as a this wasn't very initial investigation and
0:13:55i highly recommend that you
0:13:57check out the paper for more resultant figures
0:14:00and that's all
0:14:02so
0:14:02thank you for listening to be sure to check out the paper and the code
0:14:05behind that leak