| 0:00:13 | hello my name is on second at every stop and i'm here tell you about |
|---|
| 0:00:15 | our work on an initial investigation on optimising tandem speaker verification and ponder missus systems |
|---|
| 0:00:21 | using reinforcement learning |
|---|
| 0:00:23 | so that we all just on the same page |
|---|
| 0:00:26 | speaker verification system |
|---|
| 0:00:29 | verifies that the |
|---|
| 0:00:30 | claimed identity |
|---|
| 0:00:31 | and the |
|---|
| 0:00:32 | provided speech sample all the same come from the same people so person so other |
|---|
| 0:00:37 | make speaker verification system takes in a claimed identity |
|---|
| 0:00:40 | and some speech sample and if the identity max's |
|---|
| 0:00:43 | the identity of the who spoke |
|---|
| 0:00:46 | then all good the system will pass and then |
|---|
| 0:00:49 | likewise if somebody else |
|---|
| 0:00:52 | claims that they are someone to put their not and provides speech sample |
|---|
| 0:00:56 | it shouldn't let them pass |
|---|
| 0:00:59 | good |
|---|
| 0:00:59 | a very simple many of you work on this field |
|---|
| 0:01:02 | of course when it comes to security and systems like this there are some bad |
|---|
| 0:01:05 | guys want to break the system |
|---|
| 0:01:08 | so for example in this case |
|---|
| 0:01:10 | somebody could record |
|---|
| 0:01:11 | tommy speech here with a mobile phone and then later use that recorded speech |
|---|
| 0:01:16 | to claim that they are taught me by feeding saying that they are tommy and |
|---|
| 0:01:19 | playing the audio and the system will gladly expect accept that and previous work has |
|---|
| 0:01:24 | shown to the s p is proved with a seventeen |
|---|
| 0:01:27 | silence so that they are if you don't protect for this the extra system will |
|---|
| 0:01:32 | gladly accept this kind of a trial even though it shouldn't |
|---|
| 0:01:37 | likewise |
|---|
| 0:01:38 | you could |
|---|
| 0:01:39 | use they gathered dataset or collect data up some |
|---|
| 0:01:41 | somebody speaking |
|---|
| 0:01:42 | and then you speech synthesis or voice conversion |
|---|
| 0:01:45 | to again generate speech that sounds like tommy and feed it that the system and |
|---|
| 0:01:49 | it will again accent it all fine |
|---|
| 0:01:52 | and again this has been shown in previous competition it's a bob problem but |
|---|
| 0:01:56 | you can also fixed for this |
|---|
| 0:01:58 | so |
|---|
| 0:01:59 | this is |
|---|
| 0:02:00 | where |
|---|
| 0:02:01 | they come to missus come in so a condom is a system takes in |
|---|
| 0:02:05 | they |
|---|
| 0:02:06 | a speech sample that's well was provided to the extra system as well and also |
|---|
| 0:02:11 | checks that the sample comes from a |
|---|
| 0:02:15 | human speaker instead of like a mobile phone or it's not synthesise piece or voice |
|---|
| 0:02:21 | conversed speech so |
|---|
| 0:02:23 | it's like upon a five human speech |
|---|
| 0:02:26 | so for example somebody |
|---|
| 0:02:29 | has recorded somebody else's speech |
|---|
| 0:02:31 | feature the system but now it's fed to the condom is a system as well |
|---|
| 0:02:35 | and the count them is the system says it's or reject and then the other |
|---|
| 0:02:39 | car does not get access so it leads inside attacker |
|---|
| 0:02:43 | good so far and these competitions have shown that when you train for these kind |
|---|
| 0:02:48 | of situations you train to detect these would play |
|---|
| 0:02:50 | attacks of this |
|---|
| 0:02:52 | since the speech |
|---|
| 0:02:53 | you can detect them and all works fine |
|---|
| 0:02:57 | so |
|---|
| 0:02:59 | but one mac we had with this |
|---|
| 0:03:02 | setup is that |
|---|
| 0:03:04 | the |
|---|
| 0:03:05 | a yes we system |
|---|
| 0:03:06 | and they |
|---|
| 0:03:07 | condom is a system or trained completely independent from mixer so the fa system has |
|---|
| 0:03:13 | its own dataset its own laws its own training protocols |
|---|
| 0:03:17 | and so on someone |
|---|
| 0:03:19 | likewise the insistent has its own datasets intone was its own trainings protocols and its |
|---|
| 0:03:23 | own network architecture and someone so on |
|---|
| 0:03:27 | these are trained separately but then they are invalid together so they are evaluated as |
|---|
| 0:03:31 | one big bigger system |
|---|
| 0:03:33 | so |
|---|
| 0:03:34 | where you have a completely different |
|---|
| 0:03:36 | l is metric and these two systems have never been trained to minimize actually this |
|---|
| 0:03:42 | evaluation metric that been trained on their own tasks |
|---|
| 0:03:46 | so |
|---|
| 0:03:47 | we had this |
|---|
| 0:03:48 | coffee room idea |
|---|
| 0:03:50 | a where |
|---|
| 0:03:52 | what if when we have this kind of |
|---|
| 0:03:55 | bigger whole system |
|---|
| 0:03:57 | what if we train the |
|---|
| 0:04:00 | a svm to see insist then on the evaluation metric directly so |
|---|
| 0:04:05 | maybe on pop they already training already existing training they had we also optimize them |
|---|
| 0:04:11 | to minimize the or maximize the appellation metric for better results |
|---|
| 0:04:16 | and |
|---|
| 0:04:17 | however |
|---|
| 0:04:18 | sadly |
|---|
| 0:04:19 | it's not so very straightforward |
|---|
| 0:04:22 | so |
|---|
| 0:04:23 | we have this system where we split but speech to a svm cm system |
|---|
| 0:04:27 | they produce |
|---|
| 0:04:28 | i both of them produce like accent and reject label so i either accept or |
|---|
| 0:04:33 | reject and these are then fed to the appellation metric which usually computes |
|---|
| 0:04:38 | the error rates so false reject rate real reaction rate and false acceptance rate |
|---|
| 0:04:43 | and |
|---|
| 0:04:44 | these are then used in various ways depending on the evolution metric |
|---|
| 0:04:48 | the kind of come up with the one number to show how good the system |
|---|
| 0:04:51 | as a whole is |
|---|
| 0:04:54 | however |
|---|
| 0:04:55 | if we assume that these two i since the interest they are differentiable so they |
|---|
| 0:04:59 | are like on your networks which is quite common these days |
|---|
| 0:05:04 | if we wanted to |
|---|
| 0:05:05 | minimize the automatic we would need to compute the gradient of the evaluation metric with |
|---|
| 0:05:10 | respect to the |
|---|
| 0:05:11 | two systems we have or its parameters |
|---|
| 0:05:14 | sadly |
|---|
| 0:05:15 | the weak and that's compute the gradient over these hot addition of like accent reject |
|---|
| 0:05:21 | and |
|---|
| 0:05:22 | but these all required for be error rates |
|---|
| 0:05:25 | for the whole asymmetric so for example the tandem |
|---|
| 0:05:28 | decision cost function which we will be using later |
|---|
| 0:05:32 | requires these two error rates |
|---|
| 0:05:34 | and it's going to weight them in different ways but we can't from that |
|---|
| 0:05:38 | go back to the compute the gradient all the way back to the systems |
|---|
| 0:05:42 | because these heart this isn't is not differentiable operation and tools we can't calculate the |
|---|
| 0:05:48 | gradient |
|---|
| 0:05:49 | so |
|---|
| 0:05:54 | relating |
|---|
| 0:05:55 | i really in a related topic |
|---|
| 0:05:57 | other work has suggested as soft a remote error metrics |
|---|
| 0:06:02 | so for example |
|---|
| 0:06:04 | by softening the f-score f one score or are we undercut lost you can come |
|---|
| 0:06:10 | up with a differentiable version of the |
|---|
| 0:06:13 | score metric |
|---|
| 0:06:13 | and then you can do this computation so they |
|---|
| 0:06:17 | by softening it means they these heart decisions are kind of stuff and so that |
|---|
| 0:06:20 | you have a function you can actually there they got derivative of |
|---|
| 0:06:24 | and then you can compute disparity and however |
|---|
| 0:06:27 | the tandem decision cost function we have here does not have such source of person |
|---|
| 0:06:31 | so |
|---|
| 0:06:32 | instead |
|---|
| 0:06:34 | we looked into t v important nodding |
|---|
| 0:06:37 | so |
|---|
| 0:06:38 | in reinforcement learning there is this we're gonna simplified setup is like this so |
|---|
| 0:06:43 | the computer agent |
|---|
| 0:06:45 | sees a images or slight problem but it's |
|---|
| 0:06:47 | predicted some information from the game or the environment |
|---|
| 0:06:51 | the agent chooses an action a |
|---|
| 0:06:54 | and |
|---|
| 0:06:55 | the action is then executed on the environment |
|---|
| 0:06:58 | and depending if the outcome of the action is good on that the agent receives |
|---|
| 0:07:03 | some rework and |
|---|
| 0:07:05 | in reinforcement learning |
|---|
| 0:07:06 | the goal of it of this the whole setup is to |
|---|
| 0:07:11 | get a smarts reward as possible so modify d h and so that you get |
|---|
| 0:07:15 | as much reward as possible |
|---|
| 0:07:17 | and one way |
|---|
| 0:07:18 | to do this is |
|---|
| 0:07:20 | kind of a week the gradient |
|---|
| 0:07:22 | well so |
|---|
| 0:07:23 | we could take the gradient |
|---|
| 0:07:25 | of the expected reward so the reward i'll i've reached |
|---|
| 0:07:29 | overall difference in the set up to situations |
|---|
| 0:07:32 | and they got gradient of that with respect to t probably see with the respective |
|---|
| 0:07:37 | age and so if we can do this we can then |
|---|
| 0:07:39 | of course update the agent |
|---|
| 0:07:41 | two |
|---|
| 0:07:43 | towards the direction that increases the amount of reward |
|---|
| 0:07:46 | however |
|---|
| 0:07:47 | we hear also have this that problem that you can't really |
|---|
| 0:07:51 | differentiate is a decision part where we |
|---|
| 0:07:53 | choose an |
|---|
| 0:07:54 | one specific action of many |
|---|
| 0:07:56 | and |
|---|
| 0:07:58 | you execute that on the environment so we can't different see that and we can |
|---|
| 0:08:01 | compute the gradient |
|---|
| 0:08:04 | however there is a thing called police a gradient |
|---|
| 0:08:07 | where which kind of estimates this gradient |
|---|
| 0:08:12 | we do it is kind of a equation where |
|---|
| 0:08:15 | instead of calculating the gradient |
|---|
| 0:08:17 | of the reward directly it computes the gradient of the |
|---|
| 0:08:21 | log probabilities of the |
|---|
| 0:08:23 | selected actions |
|---|
| 0:08:25 | and weights them by t report |
|---|
| 0:08:28 | we got |
|---|
| 0:08:29 | and this has been shown in reinforcement learning |
|---|
| 0:08:32 | two |
|---|
| 0:08:33 | be quite effective us ready and also been shown that you can replace the |
|---|
| 0:08:38 | the air the td reward with any function and then |
|---|
| 0:08:43 | by running this you will also find |
|---|
| 0:08:45 | get the correct gradient so you can |
|---|
| 0:08:47 | get the correct same results with enough samples |
|---|
| 0:08:52 | so |
|---|
| 0:08:52 | going back to our tandem optimization |
|---|
| 0:08:55 | where we had the |
|---|
| 0:08:56 | same problem of |
|---|
| 0:08:57 | a heart decisions which we can't |
|---|
| 0:09:00 | differentiate we just apply this |
|---|
| 0:09:03 | police a gradient here that both to get it great in theorem here |
|---|
| 0:09:07 | where b |
|---|
| 0:09:08 | equation is more or less same |
|---|
| 0:09:10 | just with team or different meeting |
|---|
| 0:09:17 | so |
|---|
| 0:09:18 | we then proceeded to test this how well it works |
|---|
| 0:09:22 | with a rather trivial a set up so |
|---|
| 0:09:25 | we have two datasets |
|---|
| 0:09:26 | the fox let one |
|---|
| 0:09:28 | us |
|---|
| 0:09:28 | the stated and more specifically the speaker verification |
|---|
| 0:09:31 | part of it |
|---|
| 0:09:32 | and then we have t is feasible of nineteen |
|---|
| 0:09:35 | for the are synthesized speech and the other condor mercer tests and i labels and |
|---|
| 0:09:41 | for a has to be task we except extract the x vectors using t pretrained |
|---|
| 0:09:46 | tell the models |
|---|
| 0:09:46 | and forty s feasible we extract easy to see features |
|---|
| 0:09:50 | and these are fixed so these are not being trained in does and in this |
|---|
| 0:09:53 | setup |
|---|
| 0:09:55 | we then train the a actually system and the c insistent as thirty as normally |
|---|
| 0:10:00 | don't using these two datasets |
|---|
| 0:10:03 | and then evaluate them together using the d dcf cost function as |
|---|
| 0:10:10 | present in the a speech both nineteen competition |
|---|
| 0:10:13 | this we take the two pretrained systems and before random optimization on the mass previous |
|---|
| 0:10:19 | is shown and then finally we evaluate the results |
|---|
| 0:10:23 | and compared them with the pre-trained results and see if it actually helps |
|---|
| 0:10:28 | so |
|---|
| 0:10:29 | let surprisingly |
|---|
| 0:10:30 | the tandem optimization helps in out |
|---|
| 0:10:34 | very short not shelled |
|---|
| 0:10:35 | so |
|---|
| 0:10:37 | one way we see this |
|---|
| 0:10:39 | is by looking at this a learning curve where on the x-axis you have |
|---|
| 0:10:44 | the number of updates |
|---|
| 0:10:47 | you do so you can compared to this because where you have the loss and |
|---|
| 0:10:50 | the number of you box |
|---|
| 0:10:51 | and on the y-axis |
|---|
| 0:10:53 | you will have |
|---|
| 0:10:54 | the relative change in immolation set |
|---|
| 0:10:57 | compared to the operating system so if it was zero percent |
|---|
| 0:11:01 | it means the |
|---|
| 0:11:02 | the metric did not chains since the pretrained system |
|---|
| 0:11:07 | so |
|---|
| 0:11:08 | the main metric we wanted to minimize |
|---|
| 0:11:11 | is the minimum |
|---|
| 0:11:12 | the decision cost function normalized |
|---|
| 0:11:15 | and this indeed |
|---|
| 0:11:17 | decrease over time as we do the this a tandem optimization so |
|---|
| 0:11:21 | from zero percent change it went to minus twenty five percent change so yes it |
|---|
| 0:11:26 | improved |
|---|
| 0:11:28 | as a |
|---|
| 0:11:30 | then we also studied |
|---|
| 0:11:31 | how the |
|---|
| 0:11:33 | in the a visual systems changed over the training |
|---|
| 0:11:36 | so |
|---|
| 0:11:37 | for example to compliments or equal error rate in the |
|---|
| 0:11:42 | condom is a zone pass process detecting if it's move or not |
|---|
| 0:11:45 | it also improved by around ten percent |
|---|
| 0:11:47 | in this task but interestingly the a s v |
|---|
| 0:11:51 | e r |
|---|
| 0:11:52 | increased over time and we help of places that |
|---|
| 0:11:56 | because this is because |
|---|
| 0:11:58 | when |
|---|
| 0:11:58 | we have a way that |
|---|
| 0:11:59 | the a s p system in condom is a task and the condom sre in |
|---|
| 0:12:04 | pac task so looked at tasks we notice that of that these phantom optimization |
|---|
| 0:12:10 | the |
|---|
| 0:12:12 | to have improved in there |
|---|
| 0:12:13 | it's others task so a as we was better encounter mister |
|---|
| 0:12:18 | tasks |
|---|
| 0:12:18 | and counter measure was also a slightly better in the a speaker verification task so |
|---|
| 0:12:23 | we hypothesize that this kind of outweigh that the speaker verification systems |
|---|
| 0:12:31 | normal task of do that can correct speaker and it started to kind of |
|---|
| 0:12:35 | thick the condom answers these proved samples instead |
|---|
| 0:12:39 | so |
|---|
| 0:12:40 | we also compared this to a simple baseline |
|---|
| 0:12:43 | where instead of using this tandem optimization we just independently trained |
|---|
| 0:12:49 | continue training to two systems using the same samples as in the quality grading methods |
|---|
| 0:12:54 | so |
|---|
| 0:12:56 | basically we just use the same samples and use the a s p samples to |
|---|
| 0:13:01 | continue updating the a speech is then and then be used the condom is a |
|---|
| 0:13:06 | samples to update the down to missus system independently completely different from each other and |
|---|
| 0:13:11 | we see the same at a sweep behaviour here but |
|---|
| 0:13:15 | counters mercer systems the equal error rate just exploded in the beginning and then slowly |
|---|
| 0:13:20 | creeps back down and in the end operates over multiple runs |
|---|
| 0:13:26 | we see that the |
|---|
| 0:13:27 | both the grading method improves the results by twenty six percent and the fine tuning |
|---|
| 0:13:31 | improve the results by seven point eight four percent |
|---|
| 0:13:35 | but to note that the results on the fine tuning |
|---|
| 0:13:39 | have a lot how your variance |
|---|
| 0:13:42 | than in the |
|---|
| 0:13:44 | police a gradient |
|---|
| 0:13:45 | mesh version |
|---|
| 0:13:48 | but |
|---|
| 0:13:49 | these all results are very positive but as a this wasn't very initial investigation and |
|---|
| 0:13:55 | i highly recommend that you |
|---|
| 0:13:57 | check out the paper for more resultant figures |
|---|
| 0:14:00 | and that's all |
|---|
| 0:14:02 | so |
|---|
| 0:14:02 | thank you for listening to be sure to check out the paper and the code |
|---|
| 0:14:05 | behind that leak |
|---|