0:00:15a little buddy i'm happy to be here
0:00:19"'cause" this is my first where the field of speaker recognition
0:00:24and remember
0:00:25one attorney
0:00:27then these four
0:00:29providing such approach and eighty four that's two
0:00:33participate
0:00:36"'cause" this i think it is
0:00:37an important to the improvement you can be
0:00:44that we can improve the speaker recognition system
0:00:47with the holding these kinds of challenges
0:00:52okay
0:00:54what i'm going to propose is the
0:00:57kind of the idea from the beamforming
0:01:02which is a name is technique in signal processing
0:01:09okay what are going to
0:01:11present
0:01:15the first one i one to try to explain what beamforming is and the
0:01:21what how we apply to this challenge
0:01:26we explain know how we can solve the problem with an adaptive filtering
0:01:31and then find an optimal a beamformer in order to
0:01:37solve problem
0:01:38first of all without any constants windy and can have that
0:01:43we include a sensitivity
0:01:45to make you more robust
0:01:47and or work also include some modification of the
0:01:52possible audience matrix and some one was it or more station in order to
0:01:57owing to the
0:01:58performance
0:02:01so
0:02:01what is to know what we suppose
0:02:06from i-vectors
0:02:09i-vector is interesting because it's
0:02:12provides
0:02:14a fixed dimensional representation of any arbitrary length
0:02:17speech
0:02:18and the what
0:02:22problem with i-vectors that it varies with different i environments speaker role
0:02:30and this is the challenge
0:02:32in this field
0:02:37okay in interspersed intersession compensation we are going to remove this unwanted variability but in
0:02:44this challenge
0:02:46using a probabilistic linear discriminant analysis is not going to be a good idea since
0:02:53we don't have any label for the data
0:02:56and if we provide this label for the data it's also you'll be
0:03:05the performance of that clustering labeling be affect the performance of p lda
0:03:15okay
0:03:19one important things is that
0:03:23what
0:03:24we need
0:03:25if you have a lot up
0:03:27speech data
0:03:30so
0:03:31we can use of these amounts of
0:03:34available data for example in a speech sensor
0:03:37up with telephone speech centre there a lot of speech data passing through
0:03:44so we can use the take advantage of these data in order to improve speaker
0:03:49recognition
0:03:50instead of providing some
0:03:54artificially
0:03:55data by labelling them
0:04:01so the p lda what is similar approaches the
0:04:05it's a two
0:04:07have label
0:04:08so this is not a good idea to use that we taken on a new
0:04:13approaches
0:04:15two
0:04:16solve the problem
0:04:18so if it can't
0:04:21finest within speaker scatter matrix reliably so why to be
0:04:27why don't we go to find that the between speaker variance then increase that
0:04:35okay
0:04:36the first things the on going to explain is the beamforming
0:04:41it is the signal processing technique
0:04:44from since we're base in order to direct the signal transmission to a
0:04:49desired target
0:04:52and adaptive filtering is used the two
0:04:58using optimal filtering the interference rejections
0:05:01in order to estimate the signal of interest
0:05:07so what i beamforming operation is that when a signal implying on some and ten
0:05:14as well
0:05:16from the same distance
0:05:19it then passed through a filter
0:05:23and then the results
0:05:26the that filter
0:05:28the
0:05:29desired angles
0:05:31and rejects all the other groups
0:05:35this is the same as the
0:05:38dot product of to a filter and the sequel
0:05:44so if we can
0:05:46illustrate the
0:05:47idea is that in the
0:05:50omnidirectional antennas
0:05:53the signal the interference of the targets of are treated equally but in the beamformer
0:06:01we all focus on the talked
0:06:07so what i have
0:06:09filter so we are going to design a filter like this the w transpose start
0:06:15by where i is the i-vector and w is the filter
0:06:20so we wants to
0:06:23pass the target speaker to this filter
0:06:26but we check all the others impostor speakers
0:06:30so the development set is
0:06:33impostors so all day impostors comes from the development set
0:06:38so iffy
0:06:41use the mean square error
0:06:43in order to solve the problem
0:06:45we reach the this result as it can see here
0:06:54okay
0:06:55the w is there
0:06:56a particle filter for this solution
0:07:00and parties the
0:07:02autocorrelation matrix
0:07:04and i is the target which can be estimated by using
0:07:11okay listen to compare it with the baseline system
0:07:15the baseline systems
0:07:18is computed after whitening the i-vectors
0:07:21and the using it that the cosine similarity to find the score
0:07:28you can see that when
0:07:30the use cosine similarity before that we should the normalized the math of the i-vectors
0:07:37a display but in the
0:07:40adaptive filtering as like just
0:07:43explain
0:07:45there is no normalisation of
0:07:47the i-vectors
0:07:50okay just a little further unchanged a criteria
0:07:55in the beamforming the minimum variance distortionless response
0:08:01there is a new approach area that is to maximize signal interference lost more information
0:08:07so we wants to
0:08:11maximize
0:08:12this relation
0:08:14that is to maximize the output of the filter when the targets past
0:08:19but to recheck all the
0:08:22impostors the to want to minimize the
0:08:25did not meaning to but t vs the dominate two
0:08:29in order to
0:08:30solve the problem we assume that
0:08:33the nominee two
0:08:35equals one
0:08:36that's the
0:08:38all
0:08:39the best way
0:08:41so
0:08:41we wants to minimize the
0:08:44did not many to which is this for of a pasta been passed through the
0:08:49field
0:08:51where a value of that
0:08:53and here
0:08:55particles the
0:08:58impostor the covariance matrix so the optimum solution for this problem
0:09:03can easily be found this way
0:09:07so let's just compare it with the cosine similarity
0:09:11the baseline system is like that and the mvdr proposed this way
0:09:16so if you look at this idea we see that
0:09:22this nor mvdr suppose that new similarity measure
0:09:27that does not include the normalisation of the test i-vector but focuses more on the
0:09:33targets
0:09:37the result
0:09:39shows that
0:09:40it's will provide a
0:09:43improvement of seven point seven percent
0:09:46in the
0:09:47i-vector challenge
0:09:50so let's the goal and
0:09:53step further and to make it more robust
0:09:57as the we had we had in the previous
0:10:00the slide that we use the all the mean of
0:10:03all the target i-vectors in order to
0:10:08so estimate the target since the mvdr suppose that there is no uncertainty regarding the
0:10:13target
0:10:14but in this
0:10:17the linear constrained minimum variance speech and the include uncertainty by some linear constraints
0:10:26so that we anna i all the i-vectors provided for the target
0:10:32in the matrix c
0:10:35and the
0:10:36we enforce
0:10:40that the past the filter
0:10:43we the value of one
0:10:46so f is equal to one
0:10:49so if you solve this problem
0:10:53the optimal filter will be as you can see here
0:10:59and when we applied to the challenge
0:11:02there is a more another improvement of three point seven relative mvdr but
0:11:08and then eleven point one percent relative to the baseline system
0:11:13so
0:11:14now we have your no we can
0:11:17do an additional the
0:11:20job
0:11:21in order to improve the performance
0:11:24since you need in signal processing
0:11:27there are many a more techniques such as will paucity palm beamformer or public constraint
0:11:33robust keep on the formant were two
0:11:37improve the performance by top only loading the
0:11:41covariance matrix
0:11:44i just used a similar approach and the but use the pop impostor i-vectors
0:11:51ward the most similar to the target i-vectors
0:11:55so
0:11:57in this way we compare we passport impostors through the
0:12:03filter for each target
0:12:05and
0:12:06selected those what was six thousands impostors know the for similarity
0:12:13two and computed the covariance matrix again
0:12:17this result in a
0:12:20very good improvements
0:12:22of twenty one point five percent
0:12:25relative to the baseline system
0:12:28we can see that
0:12:30is
0:12:31for all the impostor when compared to the
0:12:34a target
0:12:36after the
0:12:39applying is you have covariance matrix modification we can see you
0:12:43the put reduction
0:12:46in the schools
0:12:50another
0:12:52factor to be true for the
0:12:54or the speaker performance was to use that score normalisation
0:12:59i just found this relation
0:13:02the best
0:13:04contrary to some others use the variance of well
0:13:09two z norm or t-norm use the various of the scores
0:13:14we could not do that
0:13:18and this results nine further improve
0:13:23okay
0:13:24and let's go and the more a supervised
0:13:28okay
0:13:29that we use the within class covariance matrix
0:13:33fine using some clustering methods
0:13:37but this clustering method is some what different
0:13:41as we
0:13:42three set each target each a single i-vector in the development set as a target
0:13:49and found the closest or the similar
0:13:54i-vector to that target
0:13:56and this is repeated each time at one more in order to find more similar
0:14:04i-vectors
0:14:05after finding those i-vectors
0:14:08we use the this formula in order to compute a within class of the tools
0:14:12like vector which assumes to be from the same speaker
0:14:18and the final model
0:14:22can be found by adding this w since we what we apply to chance so
0:14:28the inter session variability as well as the
0:14:31rejecting impostor
0:14:34so we added together
0:14:37to find this optimum what
0:14:39so can see the results
0:14:43it's
0:14:45you to leads to an improvement of twenty five to twenty seven point five percent
0:14:50relative to the baseline system
0:14:54so in conclusion
0:14:55we have proposed a new
0:14:59idea of rum of the signal processing for adaptive filtering in order to solve the
0:15:04i-vector challenge
0:15:10so a modification of the impostor covariance matrix can be possible
0:15:15this way
0:15:16so
0:15:17we have used the
0:15:20this idea
0:15:22two we can apply to p l d i thing to do we can
0:15:27improve the speaker recognition if we apply to p lda
0:15:32but we had not much enough time to do that
0:15:37thank you for your listening
0:15:58so one time if do not remember starts and eleven we did language at that
0:16:05was doing something cosine and i was i the target michael length normalized text
0:16:12we should the same what's your data
0:16:14would be inferred and it was cut off the backend was able to a calibrate
0:16:19scores but you have some shift in the scroll effect on two test
0:16:24sort calibration sessions cell so that's what happened time so the calibration what nick but
0:16:32also was about the clock offset estimation
0:16:36but we have that we find that too much worse when you wait when you
0:16:40don't normalise the test
0:16:42so no
0:16:44for language id was possible speaker okay but
0:16:47okay thank you
0:16:57good