0:00:14and everyone i'm general non from completed ascensions to montreal
0:00:20to downwind at a commode our work on
0:00:24analysis of web estimation tool
0:00:26this sre two thousand nineteen cmn and bus challenges
0:00:31in this work i'm going to provide an overview of a busy in some mission
0:00:38nist sre two thousand nineteen by
0:00:41brno university of technology can be due to such an this from montreal
0:00:45for next you know on the and you am
0:00:55on this is the outline of my dark follows
0:01:00i will
0:01:04i'm going to start and introduction of the data and
0:01:07going to talk about the speaker verification is in conversation television the telephone speech
0:01:14once you meant to not to
0:01:16then i'll talk about them onto medea speaker verification on lost
0:01:21that i employing
0:01:22audio and phase biometric traits
0:01:25finally i'm going to draw my conclusion
0:01:35to the nineteen edition of nist sre
0:01:38there are two task
0:01:41one task is
0:01:45speaker verification on conversation telephone speech where there is a domain mismatch between
0:01:51and train and test sitting mainly due to
0:01:55difference in languages away i mean
0:02:00training data mostly nist speak english where is the yesterday's in arabic
0:02:06the second task to the multimedia speaker recognition over robust
0:02:11we do this is just speech technology that the main challenges here is the multi-speaker
0:02:15test recording
0:02:17there are two sub tasks in the last task
0:02:22one is the verification of a speaker verification or a speaker on audio but wheezing
0:02:27modeled by a to trade only what is
0:02:29i have verification in dust to verification of a speaker employing both or the u
0:02:34and pairs biometric traits
0:02:36in this work we present the system that brought by a visiting
0:02:43to tackle the challenges introduced in boat
0:02:46cmn to and what does task of nist sre two thousand nineteen
0:02:51and we problem provides some analyses of results
0:02:59data preparation original data are used for training speaker discriminant neural network are nist sre
0:03:07two thousand forty two thousand and fisher a big
0:03:11all switchboard bookseller one and two
0:03:15supplemented data is created by the most on
0:03:20room impulse response from openness alarm and also using compression
0:03:27well in the origin gmm be decoded
0:03:32five hundred k recordings was selected as supplemented a trial
0:03:37and added to the original the don't
0:03:39both to increase the to increase that morgan i wasn't of the training data
0:03:44after filtering based on minimum mellow duration
0:03:49i in discussed five second after bad
0:03:52and minimum number is speaker utterances party speaker in this case five utterance but a
0:04:00there are approximately is
0:04:02seven entails in the speaker in the training data
0:04:06that i used for background training the nist sre is to those and
0:04:10for the two thousand and having approximately
0:04:14sixty six thousand recording
0:04:17adaptation that you'll is based on
0:04:22sorry eighteen a set it to those an eighteen they have longer than sixty percent
0:04:26of the study eighteen
0:04:29there are total
0:04:31a thousand recording from one hundred thirty seven speaker
0:04:36part of evaluations a part of adaptation set and sre
0:04:43unlabeled data are unique or where used for score normalization
0:04:47and as developments
0:04:49test set we used for forty percent of the you well other missing the remaining
0:04:53forty percent of the e well wondered
0:04:59feature extraction or we
0:05:02as local feature we use forty dimensional filter bank or twenty two dimensional mfcc features
0:05:09well what twenty minutes twenty five milisecond windows with different should go of ten milisecond
0:05:15for feature normalisation short-term cepstral mean normalisation was used with a sliding window of three
0:05:22and on the speech frames or anymore would energy of is band
0:05:28in general pipeline that has been adopted for
0:05:31speaker verification on cmn for us
0:05:35i don't
0:05:36with the boys are phase
0:05:38current trained in a speaker verification is to use this
0:05:41due to speaker them very
0:05:43with this filter will back end
0:05:46why the speaker embedding set extracted using a speaker discriminant neural network
0:05:52which is normally trained to discriminate among a set of training speaker
0:05:57and the network is normally a supervised by some variants of classification laws such as
0:06:03such as softmax or metric learning loss function
0:06:08in this
0:06:09case for cmn to does will use for speaker discriminant neural network trained with four
0:06:14different architecture
0:06:16as a back and we use either gaussian plp you're here we don't really more
0:06:22evaluation and weightings are centered using me no adaptation set
0:06:28what is back and training set o a mating with standard using the you know
0:06:34the same set
0:06:38training and maybe this are adapted to the target domain using
0:06:41feature distribution adaptation so the finally we use feature somewhere based plp adaptation
0:06:49over to be lda model switches
0:06:52trained on unadapted an undirected speaker time at
0:06:56score normalization are
0:06:58is used to died of conventional
0:07:05all we double of for each individual system for semen to task
0:07:14system wanna use a standard fifty layer resonate architecture for training
0:07:19speaker discriminant neural net global field of and feature
0:07:25portion billy is used for the scoring
0:07:29two dimensional convolution is used as we are using filter of and feature
0:07:35and five to obtain global representation from local in
0:07:40statistics pulling is used
0:07:44in this system
0:07:45for training the lda model
0:07:47additional training data are used from sre two thousand six ten evaluation data
0:07:52that contain and ten thousand recording from
0:07:57two hundred one is speakers
0:08:01post processing is employed estimation and in the previous on this agenda five in the
0:08:06channel pipeline
0:08:11system to system to employs a factor to deny architecture for training the speaker discriminant
0:08:17neural network
0:08:20colour the sre sixteen recipe will use for this case and the network was trained
0:08:25for six a box
0:08:28as back and he wouldn't be lda use you would forming the channel a pipeline
0:08:32that has been mentioned before
0:08:41for system three divinely architectures selected to train the speaker discriminant neural network is the
0:08:48extended to dinner architecture with the fuel residual connection to this layers
0:08:53and the network is
0:08:55the was trained for to you box
0:09:00in this case extractor limiting our of
0:09:02seven hundred sixty eight dimensional instead of five hundred well
0:09:07and them meaning as j noise judging and denoising
0:09:12one dimensional convolution is used over mfcc features other in this statistic putting is used
0:09:21global utterance labeled
0:09:26so we nearly as big and heavy tailed ple the use of following the general
0:09:30parkland that has been mentioned before
0:09:37finding system for similar depending architecture is used in system to you were used for
0:09:44training speaker discriminant neural network
0:09:47but this network was trained only on a thirty s
0:09:50two thousand four two doesn't an english data
0:09:54and this is mfcc feature is you that's fine grained feature
0:09:58yes turned our gender developer syrian
0:10:02network is used on the topic and more no
0:10:06mainly to discriminate between the source and target
0:10:10so is domain our classes english or target domain is arabic
0:10:15extracted a meeting this case of for seven and a sixty dimensional
0:10:20and as back inhibited really is used following the general pipeline that of the mentioned
0:10:31calibration and fusion for c one two task what calibration and fusion are trained when
0:10:37the logistic regression on an emblem as
0:10:41and consistent performance or absorb across the progress any well set
0:10:47which indicates that the
0:10:50well we achieve almost perfect
0:10:56in table one presents the results of individual and fused system on they have any
0:11:02well set for cmn troll dolls
0:11:05seven does
0:11:06single best system we found here system an i-vector k d n and would have
0:11:12yielded ple a combination
0:11:14the denoising did not have but when fused with that t systems
0:11:21it resulted
0:11:22in a nice improvement over performance
0:11:26i'll feel system provided the best performance in this case
0:11:36in table two we present and compare is an performance using different backends with the
0:11:43and vector
0:11:44to the ann architectures
0:11:47for cm into two cm and two dollars ple awakens are clearly don't we know
0:11:52this is perhaps due to the dimension
0:11:56between train and test settings
0:11:59in table three
0:12:01we show the performance with this system andreas post processing was adopted
0:12:07or what the extracted speaker i'm waiting
0:12:10from this to what we can see that or what the extracted animating when mean
0:12:17feature distribution and addition kld adaptation and as norm was processed in combination not
0:12:24widely used the lead to the
0:12:26based performance
0:12:34finally a robust task
0:12:38data preparation
0:12:40what original data used in this case is for training speaker
0:12:45discriminant neural network mainly voxel of two development data
0:12:49which normally contents six
0:12:52six thousand the speaker
0:12:54but forty d n and the system bookseller one and two
0:12:59nobody speech
0:13:01the reminder to say
0:13:04combine which consist of around
0:13:07eleven thousand the speakers are used for training
0:13:13s supplementary that is created
0:13:17my using most on an room impulse response from open a sum up
0:13:23and all the five mean only recordings from just supplemented data
0:13:27was selected to this really is selected to add to the original local
0:13:33well for increasing the mold and i was it in the training data
0:13:37after filtering based on minimum allowed duration in this case or second of curve
0:13:44voice activity detection
0:13:46and minimum number utterance participated in this case is eight utterances per speaker
0:13:52there are approximately six thousand the speaker in the training data
0:13:56that i used for bacon training is one hundred
0:14:00forty five utterances from original training data
0:14:04adaptation set is based on thirty seven utterances from sre eighteen busted of data
0:14:11a subset of the lda training data is used for us
0:14:16score normalization using a small
0:14:19the implement
0:14:22test set chosen for audio-only sub task is sre eighteen busty well
0:14:29where is
0:14:31for audiovisual task development test set is sre nineteen or dave is one and implement
0:14:37at all
0:14:44feature extraction
0:14:47for robust task as local feature we use forty dimensional filter bank
0:14:53or twenty three dimensional plp features are extracted would
0:14:57a twenty five milisecond window over a frame shift of ten milisecond
0:15:03for feature normalisation o we use short-term cepstral mean normalisation with this sliding window of
0:15:09two second
0:15:12none the speech frames are removed using an energy based voice activity detector
0:15:17and for the last
0:15:20or do you only does channel the general pipeline is
0:15:25we used t speaker discriminant neural network trained with three different architecture in order to
0:15:32extend the
0:15:33speaker and maybe
0:15:36as bank and
0:15:38we use question p l d your placenta scoring
0:15:43a novel and meetings are centered using mean of the bank and training set in
0:15:49training and weightings are adapted to the
0:15:52target domain using feature distribution adaptation
0:15:56diarisation is applied on the test set and a final score is the maximum or
0:16:02what that additional or
0:16:06is score is normalized then a message on
0:16:15individual systems and envelope for vast audio-only system would have multisine images system for this
0:16:23i system on the two uses the standard
0:16:28group delay resonate architecture which is first be obtained using the softmax loss
0:16:34and then after later it is finetune using adaptive
0:16:39additive angular margin loss function
0:16:42in this case as local features filterbank is used and as backend portion p l
0:16:48d n constants codings values
0:16:51and for post processing we use general that the form of the general background that
0:16:56you mentioned before
0:16:59and system two in this case the
0:17:02two d nn architecture for training speaker discriminant neural network
0:17:07and this network is trained using all the
0:17:11icily sixteen recipe over a box and i one and two
0:17:16liberty speech
0:17:17and reminded of is for six a box
0:17:22as they can go action clearly model is used forming the gmm parkland that has
0:17:27been mentioned before
0:17:33system three
0:17:36is trained following colour the
0:17:39as that a system recipe on the sre
0:17:42to those and fortitude doesn't and all switchboard data for to you box
0:17:48as front end of
0:17:50feature plp is use
0:17:54augmented sre is to those in for two thousand
0:17:57dehne dan data was used for training the backend monitor
0:18:03correlation alignment based domain adaptation used for adapting source domain to the target domain in
0:18:08this case
0:18:09as back and why shouldn't really is used and for system three
0:18:15no score normalization was used
0:18:20this is test data contain multi speaker recording we add up to speaker diarization to
0:18:25obtain number of the speaker and there but how much of the speaker segments according
0:18:30to speaker identity
0:18:31for each
0:18:32test utterance we extract expect or four
0:18:36every two hundred fifty milliseconds
0:18:39then and no more to hear it together clustering is used to cluster made things
0:18:45one two
0:18:46three or four the speaker cluster
0:18:49and m baiting spar test for the extractor for it just speaker
0:18:54enrollment embedding is scored against all and test "'em" begins and finally the score is
0:19:00the maximum about ten
0:19:02and is called
0:19:10it is well-known lead to us so on lost us with the release of us
0:19:16of blast stars with the of low to system
0:19:21a system one and it scares is which is a pretty again
0:19:26school is the excitation version or resonant
0:19:29fifty which is to and only be g d vista dataset
0:19:33and all this
0:19:34this peter network is used for extracting us to extract phase and mary
0:19:39for enrollment data based on provided frame in nieces and of a c l m
0:19:45baiting spatial modeling boxes
0:19:49colours only in each regions are cross
0:19:51and normalized before posting to the peak in model for animating destruction
0:19:58speaker is represented by averaging enrollment and mating
0:20:02what is that the signal shortest getting very infested a tool is used to detect
0:20:07one phase
0:20:08our second
0:20:10in the test data
0:20:12for a scoring also similar it was found similar to do you the but official
0:20:17and maximize score is selected
0:20:20no score normalization is applied for any of the v is well systems
0:20:30system to a similar to his well only system on system to also used of
0:20:37a pre-training
0:20:39school is x addition motion orders and on wages the
0:20:43face to dataset to extract for estimating
0:20:46but for the system at each frame multiple bounding box are extracted using
0:20:52mounted s c n
0:20:54kalman filtering is applied to track the extracted to bounding boxes from frame to frame
0:21:00chinese is available to them is applied for clustering and this other than does not
0:21:06use any prior information about the number of clusters
0:21:10for enrollment a speaker is represented by averaging betting
0:21:14for the scoring in the system console similar to a use usable information limiting and
0:21:20the maximum score that selected
0:21:29of calibration and fusion for impostors and
0:21:33this calibration official is trained ear logistic regression and of long as test set
0:21:38and sre eight in a plastic one was used for calibration and fusion for audio-only
0:21:45and sre nineteen
0:21:47audio-visual development set was used for calibration and fusion for audiovisual systems
0:21:56performance evaluation
0:21:58in table four we compare different back end on the top of bayes net would
0:22:03additive and low margin
0:22:05softmax architecture
0:22:07we can see from here and that
0:22:10adaptation is a score normalisation are found helpful
0:22:15cosine escorting outperformed the p lda back-end in impostor audio new task
0:22:20or have this is due to the fact that there's not much
0:22:25domain she between train and test fitting in this case
0:22:33in table five we show the influence of using data position on monte speaker test
0:22:38recording for of a story on it does
0:22:41we can see from here the validation help to boast performance
0:22:50this table we present already you
0:22:53this is well
0:22:55single and feel system and audiovisual few systems
0:23:00performance on than they have an email test set
0:23:04we can see from here if we shouldn't how to improve performance
0:23:08the performance of a lot of is well-known this is to a not that much
0:23:12goal but
0:23:14when visual modalities fuel to the audio modality
0:23:18i huge improvement in performance
0:23:22well actually over it you know model systems
0:23:29finally the convolutional can say
0:23:32adaptation of source domain to the target domain have played a vital role for both
0:23:38cm into and was tossed using either
0:23:42fine tuning of speaker discriminant neural net control the target domain
0:23:47or i that is encode relational unmanned or feature distribution and a petition or a
0:23:53our domain adaptation using is standard down
0:23:57diarisation how to boast performance in multi
0:24:02speaker can work
0:24:03test recording scenario
0:24:06simple the score level fusion or more the un phase biometrics
0:24:10provided significant it
0:24:12performance improvement over you know model system
0:24:15which indicates that the reason exists complementarity between audio and visual model it is
0:24:25thank you very much for your attention