0:00:15not everyone will come to my presentation today i'm going to present our work checked
0:00:22a standardization of audio defect detection
0:00:25this work is done by me and remote pair off can issue and anti-query
0:00:32so first
0:00:35i'm going to present you sent background optimality woody fake and how to defect detection
0:00:42and the motivation of our work then have introduced the details of our proposed system
0:00:49which is a deep neural network based how to defect detection system
0:00:54it is trained using large margin concept allows and just frequency masking augmentation layer
0:01:00a to war i we will present the results and conclusion
0:01:07soon what is due to fake i'll give you pick technically known as a logical
0:01:13access why spoken techniques
0:01:16the iraq issue manipulated how do count and that are generated using text-to-speech always converge
0:01:24and techniques
0:01:27such as
0:01:29dropped all the people wise hence echo me and b get
0:01:34so due to the recent breakthrough in speech synset is and of ways convergent technologies
0:01:40justice deep-learning based technologies can produce very high comedy a synthesized speech and all thing
0:01:49in the traditional too few years
0:01:55why we need kind of duty predetection so i'll to make a speaker verification system
0:02:01is badly adapt you moneywise based human computer interfaces
0:02:07this is so popular hardware
0:02:10several studies have shown that the o g d vic proposed of greece try to
0:02:16model in speaker verification says
0:02:18and to was used to generate how two depicts an easily accessible to the public
0:02:24everyone down of the prediction model and the score it
0:02:29the last how to effectively detecting this attacks
0:02:35is critical to many speech applications including automatic a speaker verification system
0:02:44so essentially the research community has done a lot worse on studying okay give it
0:02:52that is we school series started in two thousand detecting it aims to foster the
0:02:58research shall consume each other to detect voice morphing
0:03:03in two thousand fitting
0:03:06speech synthesis and the was convention attacks so or not actually based on a hidden
0:03:12markov models and gaussian mixture models
0:03:17have data
0:03:18the quality of the speech successes and was component system has drastically improved with the
0:03:25use of deeper
0:03:27thus squeezable twenty nineteen
0:03:32challenge was introduced
0:03:34it includes the most recent state-of-the-art text was speech and was converted techniques
0:03:42during the challenge more subs researchers are focusing on a baskin investigating different type of
0:03:49low-level features
0:03:51such as consonant use actual computations
0:03:56mfcc r f c and also the phase information like modified group delay features
0:04:04also is seventy mastered is a company used during the challenge
0:04:14however this messrs don't perform while on yes we spoof twenty nineteen dataset
0:04:21so based on the results was a t c and u t zero equal error
0:04:29divided the data set but only evaluation dataset
0:04:32just a large gaps
0:04:36so what cost
0:04:40as the dataset this dataset is a focus on evaluating the systems against and non
0:04:46smoking techniques
0:04:48it can't is a seventeen different e d s and we see techniques but only
0:04:55six of them in training data sets
0:04:58it includes eleven techniques
0:05:01as and no from the training and development dataset
0:05:05so we evaluation dataset a totally contents searching techniques eleven of them are unknown
0:05:11only two of them are icing in the training set
0:05:15so that may
0:05:18the dataset writer challenge and
0:05:21zero four
0:05:23strong robustness is required for supporting detection system in this dataset
0:05:32so now here's a here's the problem how to be you the robust how to
0:05:37give a detection system that can detect and a half of onto a d fix
0:05:43if feature engineers doesn't work well
0:05:48can we focus on increasing the generalisation ability all to model itself
0:05:58so here's our proposed solution somewhat propose already fixed it actually says that it can't
0:06:05is a d purely based feature invitingly vector
0:06:10and the bike yet a backend classifier to classify whether a given audio sample instead
0:06:16of our charity
0:06:18so our system simply used in your low filter banks and the low level feature
0:06:24so now future max first pass through the frequency augmentation here
0:06:29which at which the later
0:06:32the is phase into the d precision that work
0:06:37instead of using as softmax loss we used large margin calls and loss to train
0:06:43the residual network
0:06:45we use the output all the final fully connected lay here as a feature e
0:06:50bay once we got the feature inviting if it is then fed into again classifier
0:06:58in this case in this case so backend classifier is just a shallow a neural
0:07:04network with only one hidden layer
0:07:14no let's talk about the details about dear is nice not eating embedding factor so
0:07:21we use standard resonating architectures actually the table the different is we remove the we
0:07:31replace the
0:07:33global max pooling the year with mean and standard deviation probably after residual blocks
0:07:41and we didn't we use it like dividing causing loss instead of softmax loss
0:07:47and the feature embedding is extracted from the second forty connect later
0:07:58why we want to use large margin call center loss so as mentioned we about
0:08:03we want to increase the model
0:08:06genders in generalization ability itself
0:08:11so large amount in costa loss
0:08:15was usually used for face recognition
0:08:19so the goal of
0:08:22consent laws is to maximize
0:08:24so vibrant speech reading training and smoothed class
0:08:28and at the same time minimize intra-class variance
0:08:33so here's the visualisation of the usually biting there but using a cost and ours
0:08:40this is presented in the original paper
0:08:43so we can see
0:08:45compared to softmax
0:08:50the causal laws can not just on me
0:08:55separate different classes but we easy in his class the features are clustered together
0:09:07so finally we added a random frequency masking augmentation after the input layer years
0:09:15so this is an online augmentation mice or during training but each mini batch of
0:09:21random consecutive frequency band is masked
0:09:25by setting the value to zero during training
0:09:33by adding this frequency alimentation lay here we hope to item or noise into the
0:09:40training and it will increase the generalisation ability
0:09:44the model
0:09:48doing testing this
0:09:49set is it
0:09:57so we totally construct this tree also "'cause"
0:10:01through training protocols industry evaluation protocols
0:10:05of all protocol t one we use the original s wishable nineteen dataset
0:10:11and punchy to we
0:10:15create a noisy version of data by using traditional audio documentation techniques
0:10:22two types of distortion were useful documentation reverberation and background noise room impulse response for
0:10:30remove reverberation work shows a phone public the while able room impulse response datasets
0:10:38and we choose what you've in terms of background noise for documentation that's music television
0:10:43the bubble and free so
0:10:49we also want to invest cute
0:10:52the system performance under call center scenarios
0:10:56that is thus we replay the original is least two datasets to treat of services
0:11:02to create from channel you fact
0:11:07and we you we use that for our t three
0:11:13a simple the
0:11:14evaluation dataset you one is a regional yes we split nineteen you well said into
0:11:21his announcing words are not that is really is
0:11:26logic logically replayed so training services
0:11:33so that any presented results now this is the results on the original bayes risk
0:11:40of nineteen
0:11:41evaluation set
0:11:44our baseline system is the standard rice eighteen
0:11:49and rested and
0:11:52it shapes
0:11:54four percent equal error rate on the
0:11:58evaluation set
0:12:01and you can see by eileen largemargin consider loss we the equal error rate really
0:12:07used to three point where nine percent
0:12:09and finally we at the
0:12:13both scores and loss and frequency masking linear
0:12:18the reason that you choir it can be reduced to one point eight one present
0:12:27and he's the
0:12:30compose the system trained using three different protocols and evaluated against
0:12:36different benchmarks
0:12:38you one these are regional bunch more original is wishful thinking dataset into is an
0:12:44l z more general that and is three a's
0:12:49the particular as
0:12:51the logically replace rule treat of services
0:12:55so as you can see using all the data like he's we protocol to train
0:13:02well we can achieve significant improvements all words the original dataset and
0:13:11to do a large communicative set
0:13:20this is a detailed equal error rate of difference moving techniques
0:13:25our proposed system outperforms the baseline system almost all types of a speaker spoken techniques
0:13:38it's conclusion we have what she'd state-of-the-art performance on yes we spoke twenty necking evaluation
0:13:48without using any in sampling and
0:13:51that without using any seventeen ms
0:13:55so we also shows that the traditional data augmentation technique is it is
0:14:01may be helpful
0:14:06we were able to increase the generation ability
0:14:10by using frequency out a limitation and largemargin consent loss and we shows that
0:14:17by increase the generalisation ability of the model itself
0:14:23is very is very useful
0:14:28finally we evaluate the system performance on now see we're of the data set and
0:14:34in call centres to
0:14:38that's all my presentation
0:14:41sounds but i mean