0:00:15 | a little buddy i'm happy to be here |
---|---|

0:00:19 | "'cause" this is my first where the field of speaker recognition |

0:00:24 | and remember |

0:00:25 | one attorney |

0:00:27 | then these four |

0:00:29 | providing such approach and eighty four that's two |

0:00:33 | participate |

0:00:36 | "'cause" this i think it is |

0:00:37 | an important to the improvement you can be |

0:00:44 | that we can improve the speaker recognition system |

0:00:47 | with the holding these kinds of challenges |

0:00:52 | okay |

0:00:54 | what i'm going to propose is the |

0:00:57 | kind of the idea from the beamforming |

0:01:02 | which is a name is technique in signal processing |

0:01:09 | okay what are going to |

0:01:11 | present |

0:01:15 | the first one i one to try to explain what beamforming is and the |

0:01:21 | what how we apply to this challenge |

0:01:26 | we explain know how we can solve the problem with an adaptive filtering |

0:01:31 | and then find an optimal a beamformer in order to |

0:01:37 | solve problem |

0:01:38 | first of all without any constants windy and can have that |

0:01:43 | we include a sensitivity |

0:01:45 | to make you more robust |

0:01:47 | and or work also include some modification of the |

0:01:52 | possible audience matrix and some one was it or more station in order to |

0:01:57 | owing to the |

0:01:58 | performance |

0:02:01 | so |

0:02:01 | what is to know what we suppose |

0:02:06 | from i-vectors |

0:02:09 | i-vector is interesting because it's |

0:02:12 | provides |

0:02:14 | a fixed dimensional representation of any arbitrary length |

0:02:17 | speech |

0:02:18 | and the what |

0:02:22 | problem with i-vectors that it varies with different i environments speaker role |

0:02:30 | and this is the challenge |

0:02:32 | in this field |

0:02:37 | okay in interspersed intersession compensation we are going to remove this unwanted variability but in |

0:02:44 | this challenge |

0:02:46 | using a probabilistic linear discriminant analysis is not going to be a good idea since |

0:02:53 | we don't have any label for the data |

0:02:56 | and if we provide this label for the data it's also you'll be |

0:03:05 | the performance of that clustering labeling be affect the performance of p lda |

0:03:15 | okay |

0:03:19 | one important things is that |

0:03:23 | what |

0:03:24 | we need |

0:03:25 | if you have a lot up |

0:03:27 | speech data |

0:03:30 | so |

0:03:31 | we can use of these amounts of |

0:03:34 | available data for example in a speech sensor |

0:03:37 | up with telephone speech centre there a lot of speech data passing through |

0:03:44 | so we can use the take advantage of these data in order to improve speaker |

0:03:49 | recognition |

0:03:50 | instead of providing some |

0:03:54 | artificially |

0:03:55 | data by labelling them |

0:04:01 | so the p lda what is similar approaches the |

0:04:05 | it's a two |

0:04:07 | have label |

0:04:08 | so this is not a good idea to use that we taken on a new |

0:04:13 | approaches |

0:04:15 | two |

0:04:16 | solve the problem |

0:04:18 | so if it can't |

0:04:21 | finest within speaker scatter matrix reliably so why to be |

0:04:27 | why don't we go to find that the between speaker variance then increase that |

0:04:35 | okay |

0:04:36 | the first things the on going to explain is the beamforming |

0:04:41 | it is the signal processing technique |

0:04:44 | from since we're base in order to direct the signal transmission to a |

0:04:49 | desired target |

0:04:52 | and adaptive filtering is used the two |

0:04:58 | using optimal filtering the interference rejections |

0:05:01 | in order to estimate the signal of interest |

0:05:07 | so what i beamforming operation is that when a signal implying on some and ten |

0:05:14 | as well |

0:05:16 | from the same distance |

0:05:19 | it then passed through a filter |

0:05:23 | and then the results |

0:05:26 | the that filter |

0:05:28 | the |

0:05:29 | desired angles |

0:05:31 | and rejects all the other groups |

0:05:35 | this is the same as the |

0:05:38 | dot product of to a filter and the sequel |

0:05:44 | so if we can |

0:05:46 | illustrate the |

0:05:47 | idea is that in the |

0:05:50 | omnidirectional antennas |

0:05:53 | the signal the interference of the targets of are treated equally but in the beamformer |

0:06:01 | we all focus on the talked |

0:06:07 | so what i have |

0:06:09 | filter so we are going to design a filter like this the w transpose start |

0:06:15 | by where i is the i-vector and w is the filter |

0:06:20 | so we wants to |

0:06:23 | pass the target speaker to this filter |

0:06:26 | but we check all the others impostor speakers |

0:06:30 | so the development set is |

0:06:33 | impostors so all day impostors comes from the development set |

0:06:38 | so iffy |

0:06:41 | use the mean square error |

0:06:43 | in order to solve the problem |

0:06:45 | we reach the this result as it can see here |

0:06:54 | okay |

0:06:55 | the w is there |

0:06:56 | a particle filter for this solution |

0:07:00 | and parties the |

0:07:02 | autocorrelation matrix |

0:07:04 | and i is the target which can be estimated by using |

0:07:11 | okay listen to compare it with the baseline system |

0:07:15 | the baseline systems |

0:07:18 | is computed after whitening the i-vectors |

0:07:21 | and the using it that the cosine similarity to find the score |

0:07:28 | you can see that when |

0:07:30 | the use cosine similarity before that we should the normalized the math of the i-vectors |

0:07:37 | a display but in the |

0:07:40 | adaptive filtering as like just |

0:07:43 | explain |

0:07:45 | there is no normalisation of |

0:07:47 | the i-vectors |

0:07:50 | okay just a little further unchanged a criteria |

0:07:55 | in the beamforming the minimum variance distortionless response |

0:08:01 | there is a new approach area that is to maximize signal interference lost more information |

0:08:07 | so we wants to |

0:08:11 | maximize |

0:08:12 | this relation |

0:08:14 | that is to maximize the output of the filter when the targets past |

0:08:19 | but to recheck all the |

0:08:22 | impostors the to want to minimize the |

0:08:25 | did not meaning to but t vs the dominate two |

0:08:29 | in order to |

0:08:30 | solve the problem we assume that |

0:08:33 | the nominee two |

0:08:35 | equals one |

0:08:36 | that's the |

0:08:38 | all |

0:08:39 | the best way |

0:08:41 | so |

0:08:41 | we wants to minimize the |

0:08:44 | did not many to which is this for of a pasta been passed through the |

0:08:49 | field |

0:08:51 | where a value of that |

0:08:53 | and here |

0:08:55 | particles the |

0:08:58 | impostor the covariance matrix so the optimum solution for this problem |

0:09:03 | can easily be found this way |

0:09:07 | so let's just compare it with the cosine similarity |

0:09:11 | the baseline system is like that and the mvdr proposed this way |

0:09:16 | so if you look at this idea we see that |

0:09:22 | this nor mvdr suppose that new similarity measure |

0:09:27 | that does not include the normalisation of the test i-vector but focuses more on the |

0:09:33 | targets |

0:09:37 | the result |

0:09:39 | shows that |

0:09:40 | it's will provide a |

0:09:43 | improvement of seven point seven percent |

0:09:46 | in the |

0:09:47 | i-vector challenge |

0:09:50 | so let's the goal and |

0:09:53 | step further and to make it more robust |

0:09:57 | as the we had we had in the previous |

0:10:00 | the slide that we use the all the mean of |

0:10:03 | all the target i-vectors in order to |

0:10:08 | so estimate the target since the mvdr suppose that there is no uncertainty regarding the |

0:10:13 | target |

0:10:14 | but in this |

0:10:17 | the linear constrained minimum variance speech and the include uncertainty by some linear constraints |

0:10:26 | so that we anna i all the i-vectors provided for the target |

0:10:32 | in the matrix c |

0:10:35 | and the |

0:10:36 | we enforce |

0:10:40 | that the past the filter |

0:10:43 | we the value of one |

0:10:46 | so f is equal to one |

0:10:49 | so if you solve this problem |

0:10:53 | the optimal filter will be as you can see here |

0:10:59 | and when we applied to the challenge |

0:11:02 | there is a more another improvement of three point seven relative mvdr but |

0:11:08 | and then eleven point one percent relative to the baseline system |

0:11:13 | so |

0:11:14 | now we have your no we can |

0:11:17 | do an additional the |

0:11:20 | job |

0:11:21 | in order to improve the performance |

0:11:24 | since you need in signal processing |

0:11:27 | there are many a more techniques such as will paucity palm beamformer or public constraint |

0:11:33 | robust keep on the formant were two |

0:11:37 | improve the performance by top only loading the |

0:11:41 | covariance matrix |

0:11:44 | i just used a similar approach and the but use the pop impostor i-vectors |

0:11:51 | ward the most similar to the target i-vectors |

0:11:55 | so |

0:11:57 | in this way we compare we passport impostors through the |

0:12:03 | filter for each target |

0:12:05 | and |

0:12:06 | selected those what was six thousands impostors know the for similarity |

0:12:13 | two and computed the covariance matrix again |

0:12:17 | this result in a |

0:12:20 | very good improvements |

0:12:22 | of twenty one point five percent |

0:12:25 | relative to the baseline system |

0:12:28 | we can see that |

0:12:30 | is |

0:12:31 | for all the impostor when compared to the |

0:12:34 | a target |

0:12:36 | after the |

0:12:39 | applying is you have covariance matrix modification we can see you |

0:12:43 | the put reduction |

0:12:46 | in the schools |

0:12:50 | another |

0:12:52 | factor to be true for the |

0:12:54 | or the speaker performance was to use that score normalisation |

0:12:59 | i just found this relation |

0:13:02 | the best |

0:13:04 | contrary to some others use the variance of well |

0:13:09 | two z norm or t-norm use the various of the scores |

0:13:14 | we could not do that |

0:13:18 | and this results nine further improve |

0:13:23 | okay |

0:13:24 | and let's go and the more a supervised |

0:13:28 | okay |

0:13:29 | that we use the within class covariance matrix |

0:13:33 | fine using some clustering methods |

0:13:37 | but this clustering method is some what different |

0:13:41 | as we |

0:13:42 | three set each target each a single i-vector in the development set as a target |

0:13:49 | and found the closest or the similar |

0:13:54 | i-vector to that target |

0:13:56 | and this is repeated each time at one more in order to find more similar |

0:14:04 | i-vectors |

0:14:05 | after finding those i-vectors |

0:14:08 | we use the this formula in order to compute a within class of the tools |

0:14:12 | like vector which assumes to be from the same speaker |

0:14:18 | and the final model |

0:14:22 | can be found by adding this w since we what we apply to chance so |

0:14:28 | the inter session variability as well as the |

0:14:31 | rejecting impostor |

0:14:34 | so we added together |

0:14:37 | to find this optimum what |

0:14:39 | so can see the results |

0:14:43 | it's |

0:14:45 | you to leads to an improvement of twenty five to twenty seven point five percent |

0:14:50 | relative to the baseline system |

0:14:54 | so in conclusion |

0:14:55 | we have proposed a new |

0:14:59 | idea of rum of the signal processing for adaptive filtering in order to solve the |

0:15:04 | i-vector challenge |

0:15:10 | so a modification of the impostor covariance matrix can be possible |

0:15:15 | this way |

0:15:16 | so |

0:15:17 | we have used the |

0:15:20 | this idea |

0:15:22 | two we can apply to p l d i thing to do we can |

0:15:27 | improve the speaker recognition if we apply to p lda |

0:15:32 | but we had not much enough time to do that |

0:15:37 | thank you for your listening |

0:15:58 | so one time if do not remember starts and eleven we did language at that |

0:16:05 | was doing something cosine and i was i the target michael length normalized text |

0:16:12 | we should the same what's your data |

0:16:14 | would be inferred and it was cut off the backend was able to a calibrate |

0:16:19 | scores but you have some shift in the scroll effect on two test |

0:16:24 | sort calibration sessions cell so that's what happened time so the calibration what nick but |

0:16:32 | also was about the clock offset estimation |

0:16:36 | but we have that we find that too much worse when you wait when you |

0:16:40 | don't normalise the test |

0:16:42 | so no |

0:16:44 | for language id was possible speaker okay but |

0:16:47 | okay thank you |

0:16:57 | good |