0:00:14 | another one broken into my presentation this is the recorded video for all the c |
---|---|

0:00:20 | two thousand and twenty eight |

0:00:22 | and in control from different university of science and technology |

0:00:27 | in this we do i'd like to introduce our work on orthogonality regularization for and |

0:00:34 | you and speaker verification |

0:00:38 | for the speaker verification tasks |

0:00:40 | high resistance has being dominic solutions for a long time |

0:00:45 | for example the i-vector based systems |

0:00:48 | or a selector based distance |

0:00:51 | the hybrid system is usually consist of multiple not your is |

0:00:56 | the speaker and endings can be and then problem i-vectors or deep neural networks and |

0:01:02 | a separate scoring function is commonly but with a ple a classifier |

0:01:09 | the in the hypothesis in |

0:01:12 | if from audio is optimized with respect to its own target function |

0:01:17 | which is usually not the stand |

0:01:20 | moreover these speaker verification is an open set problem |

0:01:26 | used to handle on all speakers in the evaluation stage |

0:01:30 | so that regularization ability of the system will be very important |

0:01:35 | recently |

0:01:36 | one more speaker verification systems are trending and two and then or |

0:01:42 | so and to assist as well macula test utterance and |

0:01:46 | i enrollment utterances directly to a single scoring |

0:01:50 | it can simplify the training pipeline |

0:01:53 | and then the whole system is optimized in the morning consistent manner |

0:01:59 | it also and levels lending or task specific matrix during the training stage |

0:02:05 | and various loss functions have been proposed for the entrance systems |

0:02:11 | for example that regret loss |

0:02:13 | the generalized and to analysis |

0:02:17 | though the core idea of the loss functions in the entrances students is to minimize |

0:02:23 | the distances between i'm outings from the signs being sent speaker |

0:02:28 | and maximize the distance between and bad things from different speakers |

0:02:33 | in this loss functions most of them will use the cosine similarity |

0:02:38 | the that means that cosine of the angle between two and biting vectors to and |

0:02:43 | be the distance between these two and batting |

0:02:48 | so the major on the line assumption for the effectiveness of the cosine similarity measurement |

0:02:55 | in that the and endings base is although no |

0:02:58 | which is not guaranteeing during the training |

0:03:02 | in this work |

0:03:03 | we and to explore in the orthogonal |

0:03:06 | regularization in the entrance speaker verification systems |

0:03:13 | so |

0:03:14 | in this work our systems are trend was generalized enter and loss |

0:03:20 | we propose to regular i-vectors |

0:03:23 | the first one is quite a soft also known as you regularization the second one |

0:03:27 | is score spectra restricted isometry property |

0:03:31 | regularization |

0:03:34 | and this |

0:03:35 | two proposed regularized there is i evaluated on two different you and then were structures |

0:03:42 | the air sgm based one and times and a neural network based one |

0:03:49 | so first i'd like to briefly introduce |

0:03:52 | the generalized and two and lost |

0:03:57 | in our and tie system |

0:03:59 | so one mini batch consists of an speakers and utterances from each speaker |

0:04:06 | that means we will have and times and utterances in total for one mini batch |

0:04:13 | so that x r j represents acoustic feature computed phone utterance tray of speaker i |

0:04:20 | for each input feature excitedly the network |

0:04:24 | produces a corresponding and adding vector e i j |

0:04:28 | so we can |

0:04:30 | compute the centroid of the speaker |

0:04:32 | so in and batting vectors from |

0:04:35 | speaker i |

0:04:36 | by averaging its and i think vectors |

0:04:41 | then we define the similarity make free matrix as r j k |

0:04:46 | b |

0:04:48 | scaled cosine distances between h and adding vector to all of the century |

0:04:54 | it's i j k-means the similarity matrix of |

0:04:59 | speaker and endings e i j two thus |

0:05:03 | speaker centroids the k |

0:05:06 | and w and |

0:05:08 | b r trainable parameters |

0:05:11 | and we |

0:05:13 | i don't really is constrained to be costing so that this me guarantee will be |

0:05:18 | larger when the cosine distance is larger |

0:05:23 | a during the training |

0:05:25 | we wanna h all utterances |

0:05:26 | and batting |

0:05:27 | e i j to be close to it all speakers |

0:05:31 | centroid while far away from other speakers centroid so we apply a softmax only s |

0:05:38 | i j k |

0:05:39 | for all the possible |

0:05:41 | okay |

0:05:42 | and got these |

0:05:43 | loss function |

0:05:47 | and the final a generalized and two and almost is the summation of classes all |

0:05:52 | well and adding vectors |

0:05:55 | in brief |

0:05:57 | each generalized and two and lost push and battles towards the centroid of the for |

0:06:03 | speaker |

0:06:04 | and away from the centroid of the most similar different speakers |

0:06:12 | and we introduce the two regular or right there is to the entrance is jen |

0:06:16 | the first one is corn soft all about ninety regularization |

0:06:22 | so house |

0:06:23 | we have a full a canadian the |

0:06:26 | and a has a weight matrix w |

0:06:29 | the start of the analogy regularization is defined in this way |

0:06:34 | and then and there is a regularization |

0:06:37 | coefficient |

0:06:40 | and the new on a first and the frobenius norm |

0:06:46 | so this soft although not if you're realisation turn |

0:06:50 | requires the grand matrix of w to be close to identity |

0:06:57 | and since the gradient of this also known or regularization to resist respect to the |

0:07:02 | weight |

0:07:03 | that you can be computed in the |

0:07:06 | stable form |

0:07:07 | so this regularization term can be directly added to the and two and almost |

0:07:13 | and optimize |

0:07:15 | together |

0:07:18 | a second one is core the spectral restricted isometry property regularization |

0:07:25 | the |

0:07:25 | restricted isometry property characterised |

0:07:30 | you matrices that on your e |

0:07:32 | of all know |

0:07:35 | so this realisation to is derived from the |

0:07:39 | a restricted isometry property |

0:07:43 | for it i weighted measure is the u s r i p regularization |

0:07:47 | is formulated in this way |

0:07:50 | here is an enter is also regularization coefficient |

0:07:54 | and sigma is |

0:07:56 | the spectral |

0:07:58 | it cost you the largest singular value of the |

0:08:02 | matrix |

0:08:05 | so this is alright p regularization term |

0:08:08 | requires |

0:08:11 | the largest singular value of the gram matrix |

0:08:15 | to be close to identity |

0:08:17 | eight in close to |

0:08:19 | required all this all the singular values of to be able to be crossed to |

0:08:23 | white |

0:08:27 | in the same pieces |

0:08:28 | s r i p realisation turn |

0:08:31 | requires the |

0:08:33 | a given value to control station |

0:08:36 | it will result in a novel stable gradients |

0:08:40 | so |

0:08:41 | in practice we use the technique cord how iteration |

0:08:46 | two |

0:08:47 | approximate the spectral norm computation process |

0:08:51 | so in our experiments we just randomly in usual nice the vector b |

0:08:57 | and |

0:08:58 | a repeat these above iterative process due for a two times |

0:09:07 | and there is regularization coefficient for both romanisation terms |

0:09:12 | the choice of the regularization coefficient |

0:09:15 | plays an important role in the training process as well as final system performance |

0:09:22 | so we investigated to different |

0:09:25 | a sky true |

0:09:26 | the first one is k the |

0:09:30 | consists and get a consistent coefficients are the training stage which is an and the |

0:09:36 | to be sure one |

0:09:38 | and the second scatter and started with |

0:09:42 | learn the you question zero point two and then we gradually reduce it to zero |

0:09:47 | during the training stage |

0:09:51 | we a scroll two different types of neural networks |

0:09:55 | and the first one is rest best system |

0:10:00 | and the second one is td and a system |

0:10:03 | the air sgm system fines three-layer as it always gmm based projection |

0:10:09 | and if a rest em there has seven hundred sixty a hidden nodes |

0:10:14 | the projections i z is set to two hundred fifty six |

0:10:18 | after processing the whole input utterance and they the lost brand output |

0:10:24 | only have rest |

0:10:25 | well we have |

0:10:27 | at the seven real |

0:10:29 | of the whole utterance |

0:10:32 | ending that indian system and we use smt do the structured as |

0:10:36 | in how these x vectors |

0:10:39 | model |

0:10:41 | and all other hand adding letters |

0:10:44 | computed as the l two normalization of the network output |

0:10:52 | so our ex experiments is |

0:10:57 | ways |

0:10:57 | well so we at one corpora |

0:11:02 | and the |

0:11:04 | in its meaning that if we use sixty four |

0:11:08 | speakers and a segments per speaker |

0:11:11 | then it is use of concern of ours you memory capacity |

0:11:15 | and this set in the last are randomly sample from one hundred forty age of |

0:11:20 | one hundred a j |

0:11:23 | and also already the also analogy regularization can be applied to all the layers |

0:11:29 | but in this work we hand the applied be also narrative constraints on the weighted |

0:11:34 | matrix of speaker and batting print using their |

0:11:40 | there is the results of the u r s g and basis jen's |

0:11:44 | in addition |

0:11:46 | now regularized the remains |

0:11:49 | we don't have an eighteen are then and two |

0:11:53 | regularization term during training stage and this is the baseline |

0:11:58 | for all the result we can see that both regularization term is improve the system |

0:12:03 | performance |

0:12:06 | and the is alright p regularization |

0:12:09 | and all to perform the soft orthogonality romanisation time |

0:12:13 | as well as a baseline with remarkable performance against |

0:12:18 | it's a around |

0:12:19 | twenty percent improvements in eer |

0:12:23 | this you have to mindcf three |

0:12:27 | and the decreasing scoundrel planned the |

0:12:30 | performance factors and the constant schedule for both regular i-vectors |

0:12:37 | we also show the det curves for the baseline and the fast addressee and the |

0:12:43 | assistant |

0:12:44 | trained with is already realisation and the decreasing scatter |

0:12:48 | in this figure it can see that the |

0:12:51 | regularization |

0:12:53 | also noted to realisation ray helps to produce i-vectors just |

0:12:59 | and here is the result of the td and then based systems |

0:13:04 | in this case |

0:13:06 | the two regularization to know actually is compatible in performance |

0:13:12 | and the |

0:13:14 | for these soft although not here annotation time |

0:13:18 | it is it is forging is a battery or no i percent better in addition |

0:13:23 | to and the sixteen christian |

0:13:25 | by doing these they have three when trained without decreasing band masking |

0:13:31 | and the solve the regularization |

0:13:34 | is beneficial one friend ways integrating seeing and the scatter |

0:13:38 | the best asr a physician and |

0:13:41 | is twelve percent better eer and i |

0:13:45 | teen percent better in this industry |

0:13:50 | so the sri p minimisation cues |

0:13:52 | consistent |

0:13:57 | in performance went random is different than the schemes |

0:14:04 | here we plot the det curves for the baseline and to all the |

0:14:10 | s sis |

0:14:11 | t and their systems trained with two regularizer this way and the decreasing is gonna |

0:14:16 | doing this figure |

0:14:22 | and to explore the effect of all that an additive regularization during training we plot |

0:14:28 | the validation last curves |

0:14:31 | grace to example |

0:14:32 | of the validation last curve do the training of error rates gmm based system |

0:14:38 | just noticed out |

0:14:39 | the actual number of training a hoax is then trained different four |

0:14:44 | different systems |

0:14:46 | it is because we stand the maximum number of training works |

0:14:50 | to one hundred and start training if the validation loss does not |

0:14:55 | decrease for six consecutive blocks |

0:15:01 | from the loss function of from the loss curves we can see that |

0:15:05 | all the regular right there is the accelerate the training process in the other day |

0:15:09 | training stage |

0:15:11 | and then ten at several or lost remote the training compared to of the baseline |

0:15:18 | in general the sre regularization and she's real remote additional lost |

0:15:23 | then |

0:15:23 | a soft also no regularization and this finding is consistent with their system performance |

0:15:31 | where in general the sre p rotation is better than is or regularization |

0:15:37 | for both running the right there is |

0:15:40 | training we set consists in lambda released to a more training |

0:15:44 | i box and also lower finder lost |

0:15:48 | or this is different from the findings |

0:15:51 | in the |

0:15:53 | system performance |

0:15:55 | well according to the final system without |

0:15:59 | training without increasing scared you |

0:16:01 | always results in better performance |

0:16:05 | so one possible reason is that |

0:16:07 | in the final training stage is a trend use of model parameters numbers more it's |

0:16:12 | more likely not point really |

0:16:14 | so keeping nist and recognition strange |

0:16:18 | stress |

0:16:18 | sure about the training |

0:16:20 | well be all illustrate |

0:16:23 | at this stage |

0:16:24 | so by decreasing the coefficient we lose in the orthogonality constrained and of the model |

0:16:30 | parameters have more flexibility in the final stage |

0:16:34 | thus leading to a better system performance |

0:16:40 | so in conclusion |

0:16:41 | we introduce the two also nancy reagan right there is infringing |

0:16:47 | and two and text independent speaker verification systems |

0:16:52 | the first one is the soft also known and two regularization |

0:16:56 | it requires the gram matrix |

0:16:59 | to be a close to identity |

0:17:01 | and the second one is |

0:17:06 | sri peer organisation |

0:17:08 | in minimize the not is |

0:17:10 | singular value of the gram matrix |

0:17:12 | based on the restricted isometry property |

0:17:20 | two different neural network architectures |

0:17:22 | there is meant easier than |

0:17:24 | or investigated |

0:17:26 | no weight fried different all regularization coefficient rantings gonna do this and investigated their effect |

0:17:34 | on the training data as well as evaluation performance |

0:17:39 | we find that spectral restricted isometry property realisation |

0:17:45 | performs that best indoor the cases and |

0:17:49 | and shapes in the bass case around trendy percent improvement on the all the criteria |

0:17:57 | both run underwriters can be combined into the original training loss and optimized scalar with |

0:18:05 | little computation or hate |

0:18:09 | and that's all e u |

0:18:11 | a presentation thank you for listening ningbo the constraints of work are then you |