not everyone will come to my presentation today i'm going to present our work checked

a standardization of audio defect detection

this work is done by me and remote pair off can issue and anti-query

so first

i'm going to present you sent background optimality woody fake and how to defect detection

and the motivation of our work then have introduced the details of our proposed system

which is a deep neural network based how to defect detection system

it is trained using large margin concept allows and just frequency masking augmentation layer

a to war i we will present the results and conclusion

soon what is due to fake i'll give you pick technically known as a logical

access why spoken techniques

the iraq issue manipulated how do count and that are generated using text-to-speech always converge

and techniques

such as

dropped all the people wise hence echo me and b get

so due to the recent breakthrough in speech synset is and of ways convergent technologies

justice deep-learning based technologies can produce very high comedy a synthesized speech and all thing

in the traditional too few years

why we need kind of duty predetection so i'll to make a speaker verification system

is badly adapt you moneywise based human computer interfaces

this is so popular hardware

several studies have shown that the o g d vic proposed of greece try to

model in speaker verification says

and to was used to generate how two depicts an easily accessible to the public

everyone down of the prediction model and the score it

the last how to effectively detecting this attacks

is critical to many speech applications including automatic a speaker verification system

so essentially the research community has done a lot worse on studying okay give it

detection

that is we school series started in two thousand detecting it aims to foster the

research shall consume each other to detect voice morphing

in two thousand fitting

speech synthesis and the was convention attacks so or not actually based on a hidden

markov models and gaussian mixture models

have data

the quality of the speech successes and was component system has drastically improved with the

use of deeper

thus squeezable twenty nineteen

challenge was introduced

it includes the most recent state-of-the-art text was speech and was converted techniques

and

during the challenge more subs researchers are focusing on a baskin investigating different type of

low-level features

such as consonant use actual computations

mfcc r f c and also the phase information like modified group delay features

also is seventy mastered is a company used during the challenge

so

however this messrs don't perform while on yes we spoof twenty nineteen dataset

so based on the results was a t c and u t zero equal error

rate

divided the data set but only evaluation dataset

just a large gaps

so what cost

as the dataset this dataset is a focus on evaluating the systems against and non

smoking techniques

it can't is a seventeen different e d s and we see techniques but only

six of them in training data sets

it includes eleven techniques

as and no from the training and development dataset

so we evaluation dataset a totally contents searching techniques eleven of them are unknown

only two of them are icing in the training set

so that may

the dataset writer challenge and

zero four

strong robustness is required for supporting detection system in this dataset

so now here's a here's the problem how to be you the robust how to

give a detection system that can detect and a half of onto a d fix

if feature engineers doesn't work well

can we focus on increasing the generalisation ability all to model itself

so here's our proposed solution somewhat propose already fixed it actually says that it can't

is a d purely based feature invitingly vector

and the bike yet a backend classifier to classify whether a given audio sample instead

of our charity

so our system simply used in your low filter banks and the low level feature

so now future max first pass through the frequency augmentation here

which at which the later

the is phase into the d precision that work

instead of using as softmax loss we used large margin calls and loss to train

the residual network

we use the output all the final fully connected lay here as a feature e

bay once we got the feature inviting if it is then fed into again classifier

in this case in this case so backend classifier is just a shallow a neural

network with only one hidden layer

so

no let's talk about the details about dear is nice not eating embedding factor so

we use standard resonating architectures actually the table the different is we remove the we

replace the

global max pooling the year with mean and standard deviation probably after residual blocks

and we didn't we use it like dividing causing loss instead of softmax loss

and the feature embedding is extracted from the second forty connect later

so

well

why we want to use large margin call center loss so as mentioned we about

we want to increase the model

genders in generalization ability itself

so large amount in costa loss

was usually used for face recognition

so the goal of

consent laws is to maximize

so vibrant speech reading training and smoothed class

and at the same time minimize intra-class variance

so here's the visualisation of the usually biting there but using a cost and ours

this is presented in the original paper

so we can see

compared to softmax

the causal laws can not just on me

separate different classes but we easy in his class the features are clustered together

so finally we added a random frequency masking augmentation after the input layer years

so this is an online augmentation mice or during training but each mini batch of

random consecutive frequency band is masked

by setting the value to zero during training

this

by adding this frequency alimentation lay here we hope to item or noise into the

training and it will increase the generalisation ability

the model

doing testing this

set is it

so we totally construct this tree also "'cause"

through training protocols industry evaluation protocols

of all protocol t one we use the original s wishable nineteen dataset

and punchy to we

create a noisy version of data by using traditional audio documentation techniques

two types of distortion were useful documentation reverberation and background noise room impulse response for

remove reverberation work shows a phone public the while able room impulse response datasets

and we choose what you've in terms of background noise for documentation that's music television

the bubble and free so

so

we also want to invest cute

the system performance under call center scenarios

that is thus we replay the original is least two datasets to treat of services

to create from channel you fact

and we you we use that for our t three

protocol

a simple the

evaluation dataset you one is a regional yes we split nineteen you well said into

his announcing words are not that is really is

logic logically replayed so training services

so that any presented results now this is the results on the original bayes risk

of nineteen

evaluation set

our baseline system is the standard rice eighteen

model

and rested and

it shapes

four percent equal error rate on the

evaluation set

and you can see by eileen largemargin consider loss we the equal error rate really

used to three point where nine percent

and finally we at the

both scores and loss and frequency masking linear

the reason that you choir it can be reduced to one point eight one present

and he's the

compose the system trained using three different protocols and evaluated against

different benchmarks

you one these are regional bunch more original is wishful thinking dataset into is an

l z more general that and is three a's

the particular as

the logically replace rule treat of services

so as you can see using all the data like he's we protocol to train

well we can achieve significant improvements all words the original dataset and

to do a large communicative set

so

this is a detailed equal error rate of difference moving techniques

our proposed system outperforms the baseline system almost all types of a speaker spoken techniques

so

it's conclusion we have what she'd state-of-the-art performance on yes we spoke twenty necking evaluation

dataset

without using any in sampling and

that without using any seventeen ms

so we also shows that the traditional data augmentation technique is it is

may be helpful

we were able to increase the generation ability

by using frequency out a limitation and largemargin consent loss and we shows that

by increase the generalisation ability of the model itself

is very is very useful

finally we evaluate the system performance on now see we're of the data set and

in call centres to

that's all my presentation

sounds but i mean