0:00:14 | i'm going to pretend to this work about them in addition speaker recognition |
---|---|

0:00:18 | a decision strategy in preston from scratch one element in speaker recognition |

0:00:26 | we want to carry out speaker recognition on a new domain not up to increase |

0:00:30 | the criticism detection |

0:00:32 | thanks to adaptation techniques |

0:00:35 | but we don't want |

0:00:36 | to meet to take into account the difficulties of the task in real life situations |

0:00:42 | the task of data collecting and also without the cost and therefore forming the large |

0:00:48 | available to them in dataset |

0:00:52 | so as to assume that a unique and nonaudible in them and development dataset not |

0:00:58 | anymore possibly reduced in size down stuff speaker also segments per speaker |

0:01:05 | this dataset is used to learn an adaptive speaker recognition model |

0:01:10 | first we want to know that how about the performance increase depending on the amount |

0:01:15 | of unlabeled in domain data |

0:01:18 | in terms of segments |

0:01:19 | and so of speakers or |

0:01:23 | of some po size of segments per speaker |

0:01:31 | instead of the asking is always number of clusters damman thanks to another line in |

0:01:36 | domain data set |

0:01:37 | so this break distinct and number |

0:01:42 | we want to |

0:01:43 | carol to clustering without this requirement for exist in |

0:01:48 | in domain |

0:01:49 | and lower bound |

0:01:51 | data |

0:01:53 | this is explained below in this presentation |

0:02:01 | displays most edges back and process for speaker recognition systems based on embedding |

0:02:08 | the different adaptation techniques that can be included |

0:02:13 | missiles are which amazed |

0:02:15 | transforming vectors to reduce the shift between target and out-of-domain distributions |

0:02:21 | covariance indignant |

0:02:23 | while or at the feature distribution of the up to attempt to about the out-of-domain |

0:02:29 | distributions to also target ones |

0:02:32 | leading to transform on out-of-domain data into possible in domain data |

0:02:40 | when speaker labels of in domain simple or about anymore |

0:02:44 | supervised adaptation can be carried out |

0:02:47 | that's the kind of map |

0:02:49 | approach |

0:02:51 | that's more z-norm to linear interpolation between them and then total and parameters |

0:02:58 | also score normalizations can be considered as and supervised adaptation is |

0:03:03 | as they use an on the rubber in the man subsets for impostor cohort |

0:03:09 | that does not that we generalize is interpretation of the lda parameters |

0:03:14 | to all possible stages of the system and a and whitening |

0:03:18 | this tactic improvements of performance of a percent |

0:03:22 | on all our experiments |

0:03:29 | so how does not from i think raise depending on the a |

0:03:33 | amount of data |

0:03:36 | we carry out |

0:03:37 | experiments |

0:03:41 | focusing on the gain of adaptive systems a function of the invaluable data and results |

0:03:47 | sort parameters are selected for the coarse reference tonight it's |

0:03:54 | speaker else |

0:03:56 | speaker samples |

0:03:57 | and |

0:03:58 | adaptation technique |

0:04:02 | they are is a description of the experimental setup for our |

0:04:07 | i'm not exist |

0:04:09 | we use and that's just seen from county you is twenty three cepstral coefficients |

0:04:14 | the window size |

0:04:16 | of three seconds |

0:04:18 | then vad with the u c zero component |

0:04:23 | z extract a fixed vector r is a one of candide toolkit |

0:04:28 | what is attentive statistics putting layer |

0:04:32 | this extractor is trained on switchboard and nist sre |

0:04:36 | right tails |

0:04:39 | use five fold it i one session strategy with full crowed you please |

0:04:46 | nor is music |

0:04:48 | bubble from use "'em" |

0:04:52 | so the men is that it is an arabic language which is called a manner |

0:04:56 | as the nist recognition evaluation |

0:04:59 | two so |

0:05:00 | so than eighteen |

0:05:02 | cmn and two thousand |

0:05:04 | eighteen |

0:05:05 | nineteen sorry |

0:05:06 | cts |

0:05:10 | this languages finalists from the nist speaker recognition training data bases |

0:05:15 | one do things to our mismatch |

0:05:22 | the in domain corpus for development and test is described in system or |

0:05:28 | development dataset may have just the enrollment test segments the leave out of from nist |

0:05:32 | sre eighteen development test |

0:05:35 | and how for the enrollment the segments delivered from nist sre eight nineteen that's |

0:05:42 | the other are fixed set aside for making up trial data set of test |

0:05:47 | the fifty per cent split takes genders into account to more elements will be asked |

0:05:52 | us you |

0:05:54 | contains committee on trial perhaps |

0:05:57 | a normally and uniformly picked up with the constraint of being equalized by gender |

0:06:03 | and of target prior |

0:06:04 | equal to one percent |

0:06:07 | one analysing the adaptation strategy |

0:06:10 | to predict errors number of speakers and the number of segments per speakers are rated |

0:06:16 | another two three different total amount of segments and also |

0:06:21 | given a fixed amount to assess the impact of speaker class variability |

0:06:26 | each time a subset is picked up from the three hundred and ten speakers size |

0:06:31 | development dataset and an important for the two models |

0:06:36 | system development set |

0:06:38 | is fixed and on the intended for testing |

0:06:42 | for alternatives are considered that experimented |

0:06:45 | system applying and supervised adaptation only |

0:06:49 | system applying supervised adaptation only |

0:06:52 | and the system applying for pipeline |

0:06:55 | unsupervised installer |

0:06:57 | the goal is to assess the usefulness |

0:07:00 | of unsupervised techniques for speaker labels are available |

0:07:07 | this figure shows the results of our analyses |

0:07:12 | performance in terms of recall rate of unsupervised and supervised |

0:07:17 | adapted systems depending on the number of speakers |

0:07:22 | and segment bell speakers |

0:07:25 | of the in domain development dataset |

0:07:28 | the case |

0:07:30 | since andy segments per speaker s corresponds to all segments remorseful the speakers |

0:07:36 | so and t is the mean |

0:07:39 | x is the number of speakers |

0:07:42 | where x is the number of segments per speaker |

0:07:47 | it can be upset of that |

0:07:49 | combining unsupervised and supervised adaptation is the best way having lower bound labeled data doesn't |

0:07:55 | make sense provides questionable |

0:07:58 | and sre |

0:08:01 | also we observe that |

0:08:03 | and then with the small in domain data set here or fifty speakers there is |

0:08:08 | a significant gain of performance with adaptation compared to the design of twelve point |

0:08:14 | twelve best |

0:08:16 | now or not a subset of the dashed curves in the figure |

0:08:21 | they correspond to fixed total amount of segments |

0:08:28 | for example |

0:08:29 | this last row corresponds to the same amount of two thousand and five hundred segments |

0:08:37 | possibly |

0:08:39 | fifty speakers and fifty segments |

0:08:42 | bell speaker or one hundred |

0:08:45 | suppose |

0:08:48 | by sweeping the kl |

0:08:51 | we cannot sell that |

0:08:53 | given a total amount of segments performance improvement with the number of speakers |

0:08:58 | gathering data from a few speakers to then with many utterances per speaker |

0:09:03 | really needs again off adapted systems |

0:09:07 | talk about clustering |

0:09:10 | the goal is to up to show reliable a in domain data set by using |

0:09:15 | unsupervised clustering and in defining the provided places |

0:09:20 | this is to speaker labels |

0:09:23 | dataset x |

0:09:26 | cluster on |

0:09:27 | the results |

0:09:29 | is the actual speaker labels for |

0:09:34 | note that we use |

0:09:36 | why previous thing total dataset form in domain data |

0:09:40 | a model is computed |

0:09:42 | using out of them and training dataset |

0:09:45 | then the score matrix of course tails x is used for going out |

0:09:51 | an item out to hierarchical clustering using s |

0:09:56 | a similarity matrix |

0:09:59 | given this clustering problem is how to determine the actual number or |

0:10:05 | of places |

0:10:08 | by sweeping the number of clusters for each number you a model is estimated which |

0:10:12 | includes and double delta parameters |

0:10:16 | and the preexisting in them into a low dataset y is used for error rate |

0:10:21 | computation |

0:10:27 | then we select the class labels corresponding to the number of classes q that minimizes |

0:10:32 | the or right |

0:10:37 | nor block of this approach is here quality nor and |

0:10:42 | actually quite a preexisting the mental set that is not |

0:10:46 | so a missile from scratch without in domain data except |

0:10:56 | so we propose a missile for clustering the in domain data set and determining the |

0:11:01 | optimal number of classes from scratch result requirement of preexisting in them into a set |

0:11:10 | is algorithm |

0:11:11 | first |

0:11:12 | this algorithm is identical |

0:11:15 | then |

0:11:16 | for each number of classes q |

0:11:18 | we identify class and speaker |

0:11:21 | and by key matrix can |

0:11:27 | then we use |

0:11:28 | this is not weights of artificial keys |

0:11:31 | for computing the error rate |

0:11:37 | now we have to determine the optimal number of classes |

0:11:42 | we use the remote gridiron one on the field of clustering |

0:11:48 | on display in the air or its those criteria for determining the optimal number of |

0:11:53 | clusters |

0:11:55 | reported was is correspond to the loop of the algorithm from scratch |

0:12:01 | we can see that the slope of equal error rate goal so then it slows |

0:12:05 | down around the neighbourhood by excess of the exact number of speakers |

0:12:11 | which is |

0:12:12 | two hundred and fifty |

0:12:15 | moreover the values of this yes we still operating points |

0:12:20 | rich local minima before converging to zero |

0:12:25 | the trust one in the same neighbour |

0:12:31 | two hundred and fifty |

0:12:38 | so i don't format salted gives the wrong |

0:12:42 | three hundred |

0:12:44 | classes |

0:12:45 | with the colour white beyond this threshold also dcf increases |

0:12:55 | no display the performance of the adapted system using clustering from scratch as a function |

0:13:01 | of the number of clusters |

0:13:04 | compared to unsupervised and supervised with the exact speaker labels adaptation |

0:13:09 | with |

0:13:12 | exact syllables and spell adaptation the performance of eigenvalues round six test and |

0:13:19 | with only and style adaptation performance is round seven percent |

0:13:25 | and we can see the crawled all results by varying the number of classes |

0:13:33 | form the clustering |

0:13:35 | from scratch that we propose |

0:13:42 | we can see that the missile or estimates the number of speakers but manage to |

0:13:47 | attain dusting performance in terms of equal error rate and this yes |

0:13:53 | close to the performance |

0:13:56 | with exact lower bounds and supervised adaptation |

0:14:03 | of the residuals |

0:14:05 | with values number of segments per speaker |

0:14:09 | five ten or more |

0:14:11 | for example |

0:14:13 | last line we can see that results by clustering from scratch |

0:14:17 | the right |

0:14:18 | a similar to goals were produced in one about that moment set |

0:14:24 | but also close to the ones with the exact speaker labels |

0:14:31 | now will conclude |

0:14:35 | the analyses that we carried out |

0:14:38 | shows that improvement of performance is due to supervised but also unsupervised domain adaptation techniques |

0:14:46 | michael a or lda |

0:14:49 | that's techniques well combine one is a model field |

0:14:53 | the other on the picture failed to achieve best performance |

0:14:59 | also |

0:15:01 | it's subset of that the small sample of in domain data can significantly reduce the |

0:15:05 | gap of performance |

0:15:08 | but when following the amount of speakers |

0:15:11 | rather than of segments per speaker |

0:15:18 | lastly a new or partial optional speaker labeling has been introduced here |

0:15:23 | doing from scratch |

0:15:25 | without break this thing in the man labeled data |

0:15:29 | for clustering |

0:15:31 | well actually being a given and performance |

0:15:36 | thank you for attention |

0:15:38 | can try to as for more details on this study |

0:15:41 | but by |