0:00:15 | a little buddy i'm happy to be here |
---|
0:00:19 | "'cause" this is my first where the field of speaker recognition |
---|
0:00:24 | and remember |
---|
0:00:25 | one attorney |
---|
0:00:27 | then these four |
---|
0:00:29 | providing such approach and eighty four that's two |
---|
0:00:33 | participate |
---|
0:00:36 | "'cause" this i think it is |
---|
0:00:37 | an important to the improvement you can be |
---|
0:00:44 | that we can improve the speaker recognition system |
---|
0:00:47 | with the holding these kinds of challenges |
---|
0:00:52 | okay |
---|
0:00:54 | what i'm going to propose is the |
---|
0:00:57 | kind of the idea from the beamforming |
---|
0:01:02 | which is a name is technique in signal processing |
---|
0:01:09 | okay what are going to |
---|
0:01:11 | present |
---|
0:01:15 | the first one i one to try to explain what beamforming is and the |
---|
0:01:21 | what how we apply to this challenge |
---|
0:01:26 | we explain know how we can solve the problem with an adaptive filtering |
---|
0:01:31 | and then find an optimal a beamformer in order to |
---|
0:01:37 | solve problem |
---|
0:01:38 | first of all without any constants windy and can have that |
---|
0:01:43 | we include a sensitivity |
---|
0:01:45 | to make you more robust |
---|
0:01:47 | and or work also include some modification of the |
---|
0:01:52 | possible audience matrix and some one was it or more station in order to |
---|
0:01:57 | owing to the |
---|
0:01:58 | performance |
---|
0:02:01 | so |
---|
0:02:01 | what is to know what we suppose |
---|
0:02:06 | from i-vectors |
---|
0:02:09 | i-vector is interesting because it's |
---|
0:02:12 | provides |
---|
0:02:14 | a fixed dimensional representation of any arbitrary length |
---|
0:02:17 | speech |
---|
0:02:18 | and the what |
---|
0:02:22 | problem with i-vectors that it varies with different i environments speaker role |
---|
0:02:30 | and this is the challenge |
---|
0:02:32 | in this field |
---|
0:02:37 | okay in interspersed intersession compensation we are going to remove this unwanted variability but in |
---|
0:02:44 | this challenge |
---|
0:02:46 | using a probabilistic linear discriminant analysis is not going to be a good idea since |
---|
0:02:53 | we don't have any label for the data |
---|
0:02:56 | and if we provide this label for the data it's also you'll be |
---|
0:03:05 | the performance of that clustering labeling be affect the performance of p lda |
---|
0:03:15 | okay |
---|
0:03:19 | one important things is that |
---|
0:03:23 | what |
---|
0:03:24 | we need |
---|
0:03:25 | if you have a lot up |
---|
0:03:27 | speech data |
---|
0:03:30 | so |
---|
0:03:31 | we can use of these amounts of |
---|
0:03:34 | available data for example in a speech sensor |
---|
0:03:37 | up with telephone speech centre there a lot of speech data passing through |
---|
0:03:44 | so we can use the take advantage of these data in order to improve speaker |
---|
0:03:49 | recognition |
---|
0:03:50 | instead of providing some |
---|
0:03:54 | artificially |
---|
0:03:55 | data by labelling them |
---|
0:04:01 | so the p lda what is similar approaches the |
---|
0:04:05 | it's a two |
---|
0:04:07 | have label |
---|
0:04:08 | so this is not a good idea to use that we taken on a new |
---|
0:04:13 | approaches |
---|
0:04:15 | two |
---|
0:04:16 | solve the problem |
---|
0:04:18 | so if it can't |
---|
0:04:21 | finest within speaker scatter matrix reliably so why to be |
---|
0:04:27 | why don't we go to find that the between speaker variance then increase that |
---|
0:04:35 | okay |
---|
0:04:36 | the first things the on going to explain is the beamforming |
---|
0:04:41 | it is the signal processing technique |
---|
0:04:44 | from since we're base in order to direct the signal transmission to a |
---|
0:04:49 | desired target |
---|
0:04:52 | and adaptive filtering is used the two |
---|
0:04:58 | using optimal filtering the interference rejections |
---|
0:05:01 | in order to estimate the signal of interest |
---|
0:05:07 | so what i beamforming operation is that when a signal implying on some and ten |
---|
0:05:14 | as well |
---|
0:05:16 | from the same distance |
---|
0:05:19 | it then passed through a filter |
---|
0:05:23 | and then the results |
---|
0:05:26 | the that filter |
---|
0:05:28 | the |
---|
0:05:29 | desired angles |
---|
0:05:31 | and rejects all the other groups |
---|
0:05:35 | this is the same as the |
---|
0:05:38 | dot product of to a filter and the sequel |
---|
0:05:44 | so if we can |
---|
0:05:46 | illustrate the |
---|
0:05:47 | idea is that in the |
---|
0:05:50 | omnidirectional antennas |
---|
0:05:53 | the signal the interference of the targets of are treated equally but in the beamformer |
---|
0:06:01 | we all focus on the talked |
---|
0:06:07 | so what i have |
---|
0:06:09 | filter so we are going to design a filter like this the w transpose start |
---|
0:06:15 | by where i is the i-vector and w is the filter |
---|
0:06:20 | so we wants to |
---|
0:06:23 | pass the target speaker to this filter |
---|
0:06:26 | but we check all the others impostor speakers |
---|
0:06:30 | so the development set is |
---|
0:06:33 | impostors so all day impostors comes from the development set |
---|
0:06:38 | so iffy |
---|
0:06:41 | use the mean square error |
---|
0:06:43 | in order to solve the problem |
---|
0:06:45 | we reach the this result as it can see here |
---|
0:06:54 | okay |
---|
0:06:55 | the w is there |
---|
0:06:56 | a particle filter for this solution |
---|
0:07:00 | and parties the |
---|
0:07:02 | autocorrelation matrix |
---|
0:07:04 | and i is the target which can be estimated by using |
---|
0:07:11 | okay listen to compare it with the baseline system |
---|
0:07:15 | the baseline systems |
---|
0:07:18 | is computed after whitening the i-vectors |
---|
0:07:21 | and the using it that the cosine similarity to find the score |
---|
0:07:28 | you can see that when |
---|
0:07:30 | the use cosine similarity before that we should the normalized the math of the i-vectors |
---|
0:07:37 | a display but in the |
---|
0:07:40 | adaptive filtering as like just |
---|
0:07:43 | explain |
---|
0:07:45 | there is no normalisation of |
---|
0:07:47 | the i-vectors |
---|
0:07:50 | okay just a little further unchanged a criteria |
---|
0:07:55 | in the beamforming the minimum variance distortionless response |
---|
0:08:01 | there is a new approach area that is to maximize signal interference lost more information |
---|
0:08:07 | so we wants to |
---|
0:08:11 | maximize |
---|
0:08:12 | this relation |
---|
0:08:14 | that is to maximize the output of the filter when the targets past |
---|
0:08:19 | but to recheck all the |
---|
0:08:22 | impostors the to want to minimize the |
---|
0:08:25 | did not meaning to but t vs the dominate two |
---|
0:08:29 | in order to |
---|
0:08:30 | solve the problem we assume that |
---|
0:08:33 | the nominee two |
---|
0:08:35 | equals one |
---|
0:08:36 | that's the |
---|
0:08:38 | all |
---|
0:08:39 | the best way |
---|
0:08:41 | so |
---|
0:08:41 | we wants to minimize the |
---|
0:08:44 | did not many to which is this for of a pasta been passed through the |
---|
0:08:49 | field |
---|
0:08:51 | where a value of that |
---|
0:08:53 | and here |
---|
0:08:55 | particles the |
---|
0:08:58 | impostor the covariance matrix so the optimum solution for this problem |
---|
0:09:03 | can easily be found this way |
---|
0:09:07 | so let's just compare it with the cosine similarity |
---|
0:09:11 | the baseline system is like that and the mvdr proposed this way |
---|
0:09:16 | so if you look at this idea we see that |
---|
0:09:22 | this nor mvdr suppose that new similarity measure |
---|
0:09:27 | that does not include the normalisation of the test i-vector but focuses more on the |
---|
0:09:33 | targets |
---|
0:09:37 | the result |
---|
0:09:39 | shows that |
---|
0:09:40 | it's will provide a |
---|
0:09:43 | improvement of seven point seven percent |
---|
0:09:46 | in the |
---|
0:09:47 | i-vector challenge |
---|
0:09:50 | so let's the goal and |
---|
0:09:53 | step further and to make it more robust |
---|
0:09:57 | as the we had we had in the previous |
---|
0:10:00 | the slide that we use the all the mean of |
---|
0:10:03 | all the target i-vectors in order to |
---|
0:10:08 | so estimate the target since the mvdr suppose that there is no uncertainty regarding the |
---|
0:10:13 | target |
---|
0:10:14 | but in this |
---|
0:10:17 | the linear constrained minimum variance speech and the include uncertainty by some linear constraints |
---|
0:10:26 | so that we anna i all the i-vectors provided for the target |
---|
0:10:32 | in the matrix c |
---|
0:10:35 | and the |
---|
0:10:36 | we enforce |
---|
0:10:40 | that the past the filter |
---|
0:10:43 | we the value of one |
---|
0:10:46 | so f is equal to one |
---|
0:10:49 | so if you solve this problem |
---|
0:10:53 | the optimal filter will be as you can see here |
---|
0:10:59 | and when we applied to the challenge |
---|
0:11:02 | there is a more another improvement of three point seven relative mvdr but |
---|
0:11:08 | and then eleven point one percent relative to the baseline system |
---|
0:11:13 | so |
---|
0:11:14 | now we have your no we can |
---|
0:11:17 | do an additional the |
---|
0:11:20 | job |
---|
0:11:21 | in order to improve the performance |
---|
0:11:24 | since you need in signal processing |
---|
0:11:27 | there are many a more techniques such as will paucity palm beamformer or public constraint |
---|
0:11:33 | robust keep on the formant were two |
---|
0:11:37 | improve the performance by top only loading the |
---|
0:11:41 | covariance matrix |
---|
0:11:44 | i just used a similar approach and the but use the pop impostor i-vectors |
---|
0:11:51 | ward the most similar to the target i-vectors |
---|
0:11:55 | so |
---|
0:11:57 | in this way we compare we passport impostors through the |
---|
0:12:03 | filter for each target |
---|
0:12:05 | and |
---|
0:12:06 | selected those what was six thousands impostors know the for similarity |
---|
0:12:13 | two and computed the covariance matrix again |
---|
0:12:17 | this result in a |
---|
0:12:20 | very good improvements |
---|
0:12:22 | of twenty one point five percent |
---|
0:12:25 | relative to the baseline system |
---|
0:12:28 | we can see that |
---|
0:12:30 | is |
---|
0:12:31 | for all the impostor when compared to the |
---|
0:12:34 | a target |
---|
0:12:36 | after the |
---|
0:12:39 | applying is you have covariance matrix modification we can see you |
---|
0:12:43 | the put reduction |
---|
0:12:46 | in the schools |
---|
0:12:50 | another |
---|
0:12:52 | factor to be true for the |
---|
0:12:54 | or the speaker performance was to use that score normalisation |
---|
0:12:59 | i just found this relation |
---|
0:13:02 | the best |
---|
0:13:04 | contrary to some others use the variance of well |
---|
0:13:09 | two z norm or t-norm use the various of the scores |
---|
0:13:14 | we could not do that |
---|
0:13:18 | and this results nine further improve |
---|
0:13:23 | okay |
---|
0:13:24 | and let's go and the more a supervised |
---|
0:13:28 | okay |
---|
0:13:29 | that we use the within class covariance matrix |
---|
0:13:33 | fine using some clustering methods |
---|
0:13:37 | but this clustering method is some what different |
---|
0:13:41 | as we |
---|
0:13:42 | three set each target each a single i-vector in the development set as a target |
---|
0:13:49 | and found the closest or the similar |
---|
0:13:54 | i-vector to that target |
---|
0:13:56 | and this is repeated each time at one more in order to find more similar |
---|
0:14:04 | i-vectors |
---|
0:14:05 | after finding those i-vectors |
---|
0:14:08 | we use the this formula in order to compute a within class of the tools |
---|
0:14:12 | like vector which assumes to be from the same speaker |
---|
0:14:18 | and the final model |
---|
0:14:22 | can be found by adding this w since we what we apply to chance so |
---|
0:14:28 | the inter session variability as well as the |
---|
0:14:31 | rejecting impostor |
---|
0:14:34 | so we added together |
---|
0:14:37 | to find this optimum what |
---|
0:14:39 | so can see the results |
---|
0:14:43 | it's |
---|
0:14:45 | you to leads to an improvement of twenty five to twenty seven point five percent |
---|
0:14:50 | relative to the baseline system |
---|
0:14:54 | so in conclusion |
---|
0:14:55 | we have proposed a new |
---|
0:14:59 | idea of rum of the signal processing for adaptive filtering in order to solve the |
---|
0:15:04 | i-vector challenge |
---|
0:15:10 | so a modification of the impostor covariance matrix can be possible |
---|
0:15:15 | this way |
---|
0:15:16 | so |
---|
0:15:17 | we have used the |
---|
0:15:20 | this idea |
---|
0:15:22 | two we can apply to p l d i thing to do we can |
---|
0:15:27 | improve the speaker recognition if we apply to p lda |
---|
0:15:32 | but we had not much enough time to do that |
---|
0:15:37 | thank you for your listening |
---|
0:15:58 | so one time if do not remember starts and eleven we did language at that |
---|
0:16:05 | was doing something cosine and i was i the target michael length normalized text |
---|
0:16:12 | we should the same what's your data |
---|
0:16:14 | would be inferred and it was cut off the backend was able to a calibrate |
---|
0:16:19 | scores but you have some shift in the scroll effect on two test |
---|
0:16:24 | sort calibration sessions cell so that's what happened time so the calibration what nick but |
---|
0:16:32 | also was about the clock offset estimation |
---|
0:16:36 | but we have that we find that too much worse when you wait when you |
---|
0:16:40 | don't normalise the test |
---|
0:16:42 | so no |
---|
0:16:44 | for language id was possible speaker okay but |
---|
0:16:47 | okay thank you |
---|
0:16:57 | good |
---|