0:00:15 | a good morning everyone so i'm not sure if you notice what |
---|---|

0:00:19 | this is the only speaker recognition and talk in this section |

0:00:24 | so which makes me feel somehow like |

0:00:27 | the distant relative that the family invites |

0:00:31 | but you know they don't want to like to |

0:00:35 | so today i'm gonna be a presenting the some of the recent advances that we've |

0:00:40 | had seen in our speaker recognition system and i will share some results |

0:00:47 | that we obtained with the system on the nist sre two thousand and ten |

0:00:53 | extended core task |

0:00:54 | tasks with an emphasis on the telephony condition which is condition five |

0:01:01 | this is joint work with actually ongoing about the who is now an assistant professor |

0:01:07 | at i see bangalore india and jason pelecanos |

0:01:13 | i will start mike like with a brief overview of some of the recent a |

0:01:18 | state-of-the-art works in speaker recognition then i will |

0:01:24 | share with you the objectives of my talk i will present a our speaker recognition |

0:01:31 | system and the key components that contributed the most towards the end results |

0:01:37 | i'll describe our experimental setup the data we used the d and then acoustic models |

0:01:44 | as well it's configurations as well as the speaker recognition system configuration and i'll share |

0:01:50 | with you as i said the results we obtained with the system on the nist |

0:01:54 | sre two thousand ten extended core task |

0:01:57 | mostly on condition five and then a comparable |

0:02:02 | so when we look at the recent a state-of-the-art work on a speaker recognition forcible |

0:02:09 | most of most of the state-of-the-art systems are i-vector based |

0:02:13 | and they somehow use a universal background model to generate statistics to compute the i-vectors |

0:02:22 | now when we look at this through the time so we started by traditional unsupervised |

0:02:26 | a gaussian mixture models to represent the ubms |

0:02:31 | and then more recently we used a |

0:02:36 | of phone any event based you be ubms which are derived from an asr systems |

0:02:43 | so |

0:02:44 | i would like to emphasise here that even though this work |

0:02:50 | don't at i b and does not get much created but this is the first |

0:02:53 | work that in fact used seen owns to compute a |

0:02:57 | the hyper parameters of the ubm for speaker recognition in fact it was it achieves |

0:03:03 | a state-of-the-art results as a single system on the nist sre two thousand |

0:03:07 | and then a this work game the work from best buy which basically used the |

0:03:13 | nn based a scene on posteriors to compute the ubm parameters |

0:03:20 | more recently there was the work from a johns hopkins university the used a t |

0:03:25 | v and then based you |

0:03:27 | a posteriors to compute the ubm parameters any fact they found that |

0:03:32 | contrary to read to what sri found |

0:03:36 | with the |

0:03:37 | a with the diagonal covariance make a so with a with the ubm that uses |

0:03:43 | diagonal covariance matrices you can i in fact you |

0:03:46 | estimate the ubm parameters we |

0:03:48 | with a full covariance matrices and reduce a lot of computations then you need to |

0:03:53 | necessarily |

0:03:56 | go through the hassle of |

0:03:58 | of it for the nn based system so you can use directly if a supervised |

0:04:03 | ubm to compute the statistics and then from their compute the i-vectors and they had |

0:04:08 | nice gains as well |

0:04:10 | we also some of the state-of-the-art systems also use they don't use the nn posteriors |

0:04:15 | to compute the ubm hyper parameters a use the n and bottleneck features and then |

0:04:21 | the |

0:04:21 | the rest of the pipeline mean in i-vector based speaker recognition system remains thing |

0:04:27 | so i mentioned some of the word here i would like to give some pretty |

0:04:30 | to their heck a fixed work in ninety eight |

0:04:34 | that was the first to explore a bottleneck based features for a speaker recognition |

0:04:41 | so the objectives of mike talk today |

0:04:45 | i will be sharing our a state-of-the-art results on the nist sre two thousand and |

0:04:50 | ten extended core tasks again our emphasis is on telephony condition which is condition five |

0:04:56 | i will be presenting the key system components that contributed the most towards achieving these |

0:05:02 | results |

0:05:05 | namely i will talk about the fmllr based features that we used |

0:05:10 | and compared them in fact compare compared them with the a more traditional raw acoustic |

0:05:16 | features such as mfccs |

0:05:20 | we also used the n and based acoustic models in place of a gmm |

0:05:25 | unsupervised gmm acoustic model for i think ubm |

0:05:29 | the dev so this is basically |

0:05:32 | technically not novel id and then based i-vector a i-vectors a they've been around for |

0:05:37 | awhile now |

0:05:38 | what we did here we nearly double |

0:05:41 | the size of the scene onset and we wanted to see how that impacts the |

0:05:45 | speaker recognition performance |

0:05:47 | and then finally we explore a nearest neighbor this given analysis to achieve inter session |

0:05:54 | variability compensation in the i-vector space we compared the performance with |

0:05:59 | the more commonly used lda |

0:06:02 | we also quantify the contribution all |

0:06:06 | these three system components |

0:06:09 | to work towards the performance in fact we also |

0:06:12 | we'll see how varying for example the signal that the size of the scene on |

0:06:17 | set |

0:06:18 | will impact the perform |

0:06:22 | now let's take a look at |

0:06:23 | our speaker recognition system so you see the flowchart of all that speaker recognition system |

0:06:29 | here this is assuming that all the all the model parameters already train a so |

0:06:35 | we have at the in an acoustic model training i-vector extractors for that the lda |

0:06:40 | models |

0:06:40 | and |

0:06:42 | so the three components i just mentioned let me repeat this |

0:06:46 | we have |

0:06:48 | a similar based features |

0:06:51 | can be used to train and evaluate the d and then |

0:06:56 | as well as |

0:06:58 | to compute the sufficient statistics for i-vector extraction so with a from a large you |

0:07:02 | can achieve well speaker and channel and normalization |

0:07:07 | we have the in an acoustic model instead of a an unsupervised on a gmm |

0:07:12 | acoustic model to compute |

0:07:14 | the i-vectors again we nearly compared to the previous work double the size of the |

0:07:20 | signal set |

0:07:20 | and then we replace the traditional of the more commonly used lda with |

0:07:25 | and you for intersession variability compensation and used you i'm sure you're familiar with the |

0:07:31 | sparkling |

0:07:34 | so if we look at the previous work |

0:07:38 | with the non with the nn based signal i-vectors we what we of there is |

0:07:44 | that many systems the used to different set of features |

0:07:48 | to |

0:07:49 | compute the posteriors |

0:07:52 | and to compute a sufficient statistics so typically asr features are different from a |

0:07:59 | speaker-recognition features which makes sense so in this work we wanted to see |

0:08:03 | what happens if we can you many five |

0:08:05 | or use the same set of features to both trained and evaluated the nn and |

0:08:09 | to compute a sufficient |

0:08:10 | statistics for i-vector extraction so for the two words that we |

0:08:14 | are considered to use o a feature space maximum likelihood linear regression transforms to which |

0:08:21 | are actually use and based features which are used as features for our d n |

0:08:27 | and system |

0:08:30 | so adamantly transform is a linear transform like this which is basically would which can |

0:08:35 | be decomposed into a linear manner and it right and a translation and these parameters |

0:08:40 | can be obtain a |

0:08:42 | using the alignments that we obtain |

0:08:46 | from a from the first pass through a gmm hmm system |

0:08:51 | and then a maximum likelihood basically |

0:08:57 | estimate gives us a fight and better probably not be repeated parameters here |

0:09:04 | and the product of this transform on a raw acoustic features such as mfccs or |

0:09:09 | even transform like lda transform features |

0:09:12 | our speaker and channel more normalized features |

0:09:15 | i mean this may sound contradictory by the way so because after the larger used |

0:09:20 | to reduce speaker variability but as we know |

0:09:23 | there are two types of variability is their variability other speakers |

0:09:27 | speaker variability can be with teen or across speakers right so here we believe that |

0:09:33 | the |

0:09:35 | normalisation that if from lower provides within speakers dominates the a between speaker normalisation also |

0:09:43 | we get the benefit of channel normalization so if we have different setup and stands |

0:09:47 | for example we think that f and lark and take care that as well |

0:09:53 | now i as i mentioned the in ansi known i-vectors they've been around for awhile |

0:09:59 | so nothing technique the new in this |

0:10:01 | here in this slide the only difference is that you we really nearly a double |

0:10:06 | the size of our semen set compared to the previous work to compute the posteriors |

0:10:11 | and the from their computer sufficient |

0:10:14 | sufficient i will sufficient statistics |

0:10:19 | so i'm not gonna spend much time on this |

0:10:21 | now what |

0:10:23 | but can i conduct a present that |

0:10:28 | now we know basically how to rapidly compute i-vectors using even ten k c known |

0:10:34 | so just you connect this work to the |

0:10:38 | one of the presentations yesterday a set of money talked about |

0:10:42 | a how i-vector distributions are not necessarily gaussian and actually he showed us some distributions |

0:10:48 | and that was even on clean data not of in noisy data okay so |

0:10:54 | and lda basically it's formulated based on gaussian distribution assumptions for a class for different |

0:11:00 | for individual classes |

0:11:02 | or even if they are not gaussian they need to be at least uni modal |

0:11:08 | at a soul |

0:11:09 | therefore lda cannot effectively handle multimodal data |

0:11:13 | which is typical in the nist sre types of a scenarios because data come from |

0:11:19 | various sources we have switchboard sources of data we have mixer sources of data and |

0:11:24 | that causes a multimodal the in the i-vectors |

0:11:28 | and also for applications such as language recognition we because we only have a few |

0:11:34 | classes the lda transform |

0:11:38 | i can be rank deficient |

0:11:40 | so we might get a hit from that as well |

0:11:44 | so instead of trying to transform the i-vector space so that is more gaussian like |

0:11:49 | what center are presented yesterday |

0:11:53 | we here we tried to use the transform that is the does not assumed gaussianity |

0:11:57 | or does not use the class |

0:12:01 | the ball or a structure of the classes to compute a the between class scatter |

0:12:06 | matrices so |

0:12:07 | when you look at the lda uses the class centroids |

0:12:11 | the differences between class controls will error rate here |

0:12:16 | arrow here we see to compute the between-class gotta meet now in the n b |

0:12:21 | a what we do with that we don't assume any a global extract structure for |

0:12:27 | classes for individual classes rather we assume that classes are only locally structure |

0:12:31 | so we use the local |

0:12:34 | means that are computed based on character a nearest neighbours for each individual sample |

0:12:39 | and then used to differences to compute the between class scatter matrices |

0:12:46 | another thing is that we introduce this weighting function here which is basically to emphasise |

0:12:50 | the sample these samples near the classification boundary which are more important for discrimination between |

0:12:58 | different classes rather than the sample here |

0:13:01 | which should get a really small way because it doesn't contribute towards to discrimination |

0:13:07 | the class discrimination |

0:13:09 | and then another thing is that on like lda and be a |

0:13:12 | even that we have enough a number of examples of for can for different classes |

0:13:17 | can always before right |

0:13:20 | so therefore is very useful for applications such as language id which we don't you |

0:13:27 | publish to work in i guess a twenty fifteen and we actually obtain some gains |

0:13:32 | over the |

0:13:34 | so our experimental setup a for training data we extracted english telephony and microphone data |

0:13:40 | from a |

0:13:41 | this two thousand and four through two thousand and eight sre data we also used |

0:13:48 | switchboard data both cellular and land line data |

0:13:52 | this are basically resulted in a total of sixty k recordings |

0:13:58 | to train our system hyper parameters for evaluations we considered the nist a twenty ten |

0:14:05 | sre and that a evaluation set there is that we considered a nist sre two |

0:14:09 | thousand ten compared to twenty twelve is because |

0:14:13 | we had some anchors to compare our performance of our system where it |

0:14:17 | with other with other no sites |

0:14:23 | so the conditions we consider where c one condition want to see five |

0:14:28 | you can see the details here but i wanted to emphasise that again our emphasis |

0:14:31 | is on condition five which is |

0:14:34 | a left levantine and there is a mismatch between enrollment and test |

0:14:39 | so the type of |

0:14:42 | phones used in |

0:14:45 | enrollment s are not necessarily the same |

0:14:48 | our didn't system |

0:14:50 | our d an acoustic model had a seven hidden layers a six to six of |

0:14:55 | wins |

0:14:55 | ha twenty forty eight a hidden units and then the bottleneck layer which |

0:15:01 | five hundred and twelve units we use fisher data to train it |

0:15:04 | in addition to the think eight original think a signals we also consider two point |

0:15:10 | four k |

0:15:12 | posteriors to see basically how of varying the size of how a varying the granular |

0:15:17 | larry d |

0:15:19 | the in the output layer |

0:15:21 | well in fact that speaker recognition performance |

0:15:24 | a typical setup for speaker our speaker recognition system |

0:15:29 | we use five hundred dimensional total variability subspace |

0:15:33 | which was reduced to two fifty using an l d or lda simply lda which |

0:15:38 | was trained on the entire a training set and we report equal error rate or |

0:15:43 | and mindcf away and ten |

0:15:47 | i think that we also consider a to give working in thinking signals from i-vector |

0:15:51 | extraction |

0:15:54 | in terms of results list can compare held every and da so this is a |

0:15:58 | these results are obtained with mfcc twenty forty component gaussian mixture model thinking in and |

0:16:03 | results are reported condition five as we see a no matter what type of acoustic |

0:16:09 | model we use lda all always provided a nice benefit over |

0:16:13 | and the across the three metrics |

0:16:18 | and the reason is because in as i mentioned in v a can handle non |

0:16:23 | gaussian and more to model |

0:16:26 | more effectively then lda for comparison of |

0:16:30 | mfccs version of a large |

0:16:34 | again condition five thinking in we can see that |

0:16:40 | first with lda and the it doesn't matter we always have improvement with a from |

0:16:45 | a large aware mfccs and the reason is because you |

0:16:49 | m f and large provide a speaker and channel normalization |

0:16:53 | also note that we unified a speaker recognition and speech recognition features this way okay |

0:16:58 | so the system is even it's simpler but we should also take into account the |

0:17:05 | fact that for every two in order to compute fmllr transforms we need a two-pass |

0:17:08 | system as opposed to |

0:17:12 | to look up to measure the impact of signals that |

0:17:16 | size we consider to pay for k ever think a posteriors accuracy as an increase |

0:17:21 | the signals that side results improve |

0:17:24 | we also considered thirty two j signals |

0:17:27 | okay so choose just to just to see how it how it impacts performance |

0:17:33 | we did not see much gains with thirty two k c that experiment to the |

0:17:39 | wind to finish |

0:17:42 | i just one to emphasise here that in contrast to what we see with the |

0:17:47 | d n and |

0:17:48 | if you increase the size of the components the number of components in a gmm |

0:17:53 | athletes with a diagonal covariance matrices you don't c d's gain if you increase the |

0:17:57 | size |

0:17:58 | of the gmm components from two k to forty two six k to make a |

0:18:03 | you don't see |

0:18:04 | probably gave it if not degradations |

0:18:08 | and now they say i picture is worth in other words you're this that bloody |

0:18:13 | at this work to table |

0:18:16 | so for a week what we what you can you can see how lda compared |

0:18:20 | to lda with both a gmm and the in and based systems the performance larger |

0:18:26 | menus gmm gmms to compute the i-vector a posterior to compute i-vectors |

0:18:33 | and with the nn as we increase the size of the signals that the this |

0:18:37 | gap in performance |

0:18:38 | a narrow and then secondly we can compare two k versus |

0:18:43 | ten k the nn seen on a |

0:18:48 | performance |

0:18:50 | a progression of our system over time without and with the very basic system gmms |

0:18:54 | and mfcc then lda we replace the and the got it you came and we |

0:19:00 | are replaced you know you know got of used and the timit and from lars |

0:19:05 | got further boost in performance so we at the best published a performance on the |

0:19:12 | nist sre two thousand and ten condition five at least |

0:19:15 | for other conditions we believe those are also the best college performances these we you |

0:19:20 | know refer to the paper for more detail |

0:19:23 | because we're really on previous best results i wanted to mention you created to a |

0:19:28 | d choose more that it's you one point o nine equal error rate |

0:19:33 | they had a gender dependent system our system is gender independent |

0:19:36 | and but this work |

0:19:39 | which also the also use the gender dependent systems and what the only reported results |

0:19:44 | on female trials while i'm not sure how we can |

0:19:48 | compare it is numbers but |

0:19:53 | so in conclusion i presented hours a speaker recognition system the components that it has |

0:19:58 | i shared with you our results and quantify the contribution of different components |

0:20:05 | we have if you're interested for further progress on our system |

0:20:09 | please come visit us are i in your speech or you know if you buy |

0:20:13 | me a cookie after this i might be i able to share more details when |

0:20:18 | you thank you |

0:20:27 | for some questions |

0:20:37 | for your presentation my question is about the weights in the in computation it is |

0:20:44 | those weights in the |

0:20:46 | original in da that you mentioned |

0:20:50 | in computation |

0:20:54 | those weights alright originally in the in da for us they are they and you |

0:20:59 | say that at all the data that are close the boundaries are have to listen |

0:21:04 | to look at this let's take a look at how things are computed so that |

0:21:09 | is that it says minimum of the distance between |

0:21:13 | each sound and its k nearest neighbor within that class |

0:21:18 | e seven point it's k nearest neighbours from the j a two class right and |

0:21:24 | then that is divided by a so of course if the sample is not close |

0:21:28 | to the boundary it's gonna be closer to use case okay a k nearest neighbours |

0:21:33 | from kids the same class |

0:21:35 | right |

0:21:36 | so this is gonna be smallest number is gonna be small compared to the denominator |

0:21:40 | so you're gonna get close to zero number versus if the if the south are |

0:21:44 | close to the boundary this number |

0:21:48 | is gonna get |

0:21:49 | close to this number |

0:21:51 | so sorry this number is gonna basically at the this is gonna come out of |

0:21:57 | the mean this so it's the sample from class i compared to its case i |

0:22:04 | k nearest mean of the k here and here's neighbours from plastic is gonna come |

0:22:09 | out so this is divided by |

0:22:12 | this term is gonna be point five so for samples near and dear a classification |

0:22:16 | boundary you get point five four samples that are far from that is a battery |

0:22:20 | where you're gonna get some a zero norm |

0:22:24 | can you tell conceptually what does it mean i mean |

0:22:27 | what "'cause" i mean why some those doubles the boundaries more weights done because the |

0:22:34 | sound of which are far from either far anyway so what are but directly what |

0:22:38 | we what they contribute or you know a station the between class scatter because sometimes |

0:22:44 | that are if it assumes that they are gaussians |

0:22:48 | samples that are near yes |

0:22:52 | first well even if it's gaussian |

0:22:56 | so all those data is that are a away from the mean so that are |

0:23:00 | like l layers right |

0:23:03 | the because you can distinguish them |

0:23:07 | if there is that she that can conclude that can do not directly last but |

0:23:13 | again we extended data more than those that are or more representative of the training |

0:23:18 | set is the training set what we mean by shift is the training set already |

0:23:22 | have the labels we know that those samples or |

0:23:26 | are in that class and they are far from the classification boundary |

0:23:33 | okay |

0:23:37 | work |

0:23:45 | thank you thank you for your the actually of a question regarding the implementation of |

0:23:49 | the indy so a new papers i've seen several things like they're within class covariance |

0:23:57 | use the classical one are used for this work on this work we use the |

0:24:01 | classical one |

0:24:02 | okay be clear lately we limit limited with something for this work we use the |

0:24:07 | exactly the same we compute the within class scatter matrix exactly the same way you |

0:24:12 | computed for lda |

0:24:14 | for at k nearest neighbours we use one versus press |

0:24:18 | that means that work each class |

0:24:21 | you consider that fast versus all the other classes and you compute the and you |

0:24:25 | compute the nearest exit was the development question so if there is a except for |

0:24:29 | the computational time because it was gonna be other was gonna be various i know |

0:24:35 | but intimate results does it change anything |

0:24:38 | if we i this is never be was so slow that i've never explore |

0:24:44 | thank you |

0:24:46 | english degree |