and well what can i actually identify speakers i and then we also wanted to

try if it's possible to actually fuse the results

where is more

sort of traditional atoms an and systems so basically there was some things

done already

and this is basically some the closest works we could find the time of writing

but you know do so you sort of the archive publishing method than and stuff

like that they could be out of date so basically especially the first one of

occurs because they actually uses the in spectrograms as well

but

what the use it for is to identify disguised voices so for example when you

have voice actors and then like the simpsons or something one actor can play like

several characters so they want to sort of identify the actual act as a not

the characters that they play

but they didn't do when in so the fusion or exploration and in basically used

sort of out of ops network

and also there's this is quite a lot of

now one to conclusions a non sound

so basically what we want to see is a citizens of the overview so the

lower part is basically surreal standards

approach where you have the mfccs or other features extracted any sort of the i-vectors

or the u

whatever and then

you usually in this identification what we wanted to do is basically extract spectrograms but

it through the network and then sort of get the identity of the other hand

i will explain later wine

there are several identities

of the c and then so basically what we wanted to test

a little conversion network and then

t v is

and you're

actually quite dataset was

quite surprising that you

so we actually chose this

system so

the fusion

so

this is not expect sorta need to go into detail

convolution work inspired by a lot of networks that are currently used for image recognition

this particular one

so basically what we did we tried an existing model and then we sort of

started downsizing it because it didn't like to change the resultant cued up learning

and we came up with this

and it's actually a very robust as you have five convolutional layer but the main

an overkill

especially of the images and we begin that is a monochromatic so we don't like

you three chance that the very beginning

some be system basically trying to one twelve efforts but it's actually

and we use rather than dropout rate of the nonlinear function

and

and dropout at zero point fine

and this is up to each we conducted where we did no random propping the

rotations this was due to the so the spectrograms basically have a pretty big overlap

anyway so cropping than actually do

the detection have much use and we don't want to the rotations because hopefully this

may be something and the time domain may be interesting

and we use

well average point max pooling but this is just based on experimental

exactly so basically because we wanted to combat

t v s and t o v as and stuff like that

we you want to have

the same sort of output so basically what we got from the signal is

the somebody news

so the speech segments

and because the spectrograms have a

then it shows a fixed size we have to sort of divine to speech segments

into separate spectrograms and then do an average

and the output to get an equivalent for forward to us for example

so for

you many the end you get your the eggs we

so to use the following setting like more teachers and paper one sort of going

dependent this now but we tested a settings in this

i think you the best also i'm not

the segmentation problem for

getting the speech segments is based and bic criterion

i victim suspect hundred and stuff like that

so for the fusion we chose t v s because it had the best results

and then

we explore three

different approaches the late fusion so basically just to the scrolls

from the t v s and from bayesian and then

basically

fuse them

and then we so from our experiments that

actually the c n and works was four

longer segments

speech

which was quite surprising but then so we basically wanted so the weight down it's

this value depending on the duration

so the and the duration baseline instance for the duration the track

and then we wanted to see if an early fusion

so basically take the our work all the last hidden sin level we do with

pca to have

the same dimensionality as an i-vector and then we just concatenate them and trainings be

a

so that they said that we used in the repair this is a french language

corpus this is and radios

and

that seven types of videos including news debates

sort of interviews celebrity gossip stuff like that so and because of this it's pretty

noisy because you i don't very often you have like background music you have different

voices overlapping you have streets noises a et cetera

and

very unbalanced as well because

you sometimes have very i don't know politicians who i don't present fronts that sort

of is that almost constantly a or binders throughout the more and then you have

sort of this long tail of speakers so basically in the whole training set that

eight hundred three months speakers but that says sets

contains only one hundred thirteen and this is likely on be one hundred thirteen is

actually overlap

with the speakers with and train set

and while the strange about speech or frames and six for the test

this is just a show sort of like the imbalance in the distribution this is

a logarithmic scale

and then this

so on the x-axis you have all those one hundred thirteen speakers

and then on the while you have the duration but speaker so basically and that

sort by the duration and the train set so basically what you've got is that

it's not very an imbalanced us you know some people speaking forty minutes and then

someone who excuse that for just a few seconds

and then it's

as we can see that spike at the very rights

this shows that there's actually

someone who

is almost nonexistent train set but then he's very present in the test data

so

pretty difficult also another feature of this data that

almost

a quota speech segments are shorter than two seconds

and seven c

percent shorter than that

a which makes it quite difficult so basically we used mfccs features

and this is sort of problem no

nineteen dimensions so

so basically all the details and the paper but

we end up with than fifty nine dimensional vector

up to some

feature warping

so for the spectrograms you have an example of it on your

it's

the two hundred

forty miliseconds in duration

there's a big overlap between neighboring spectrogram

well at the two hundred milisecond systems on overall

it's percent

and basically

so this is that we use so are

audio segments were a value of refinement seconds

and then we form for the look for a window and twenty miliseconds we use

the

i mean windowing

log-spectra optical

amplitude values extraction and then we basically got an individual matrix which ones of a

forty eight times woman twenty one pixel

so basically here the results so far table we see the results of the on

for each individual systems and basically

this in and

doesn't work very well which isn't

that's surprising considering

the way the dataset structured but

pretty surprising is that the of the a

is also not very good an actually gmm ubm

right okay so basically to the best system is the c v s one

and that we have used for fusion afterwards so basically

we want to see

so in the lower table you have three more detailed results including the accuracy or

the tracks to have less than two seconds

and

actually the best approach that we have is the just the simple length and so

basically take the predictions from c n

and seriously sort of normalise them and

our remote

and the biggest most of the form is actually is also given that for the

trusts okay for the facts that a lower than two seconds so basically for forty

forty one almost and forty nine for t v s and fourteen and respectively and

then goes up to fifty eight

so it's a phone

which is quite of course

and then the yellow re fusion actually model but well actually decreased results

but for like duration nights

it's pretty

similar so basically

even though the c n and didn't

outperform

it

seems to provide different things in spectrograms and

by fusion consort exploited and sort of go

beyond what was

but say possible so is also

so it's of the lower plots

as you

we have so the red one is the nn

performance across

different duration files

on a logarithmic scale

so you can see that

the between c and then and

i-vectors as of this yes

it's a low increases as a sort of a long

with the duration and the biggest is actually helpful for very short tracks and then

doesn't affect the performance and the latest

so that's basically it we wanted so

see how it works and we conclude that the s t and c n and

t v s main improve over the baseline systems

a more data that may be requires

or more what quality data especially for this unit india data actually work better and

four perspectives

so basically we chose this corpus

because it also contains

texas and stuff like that is we explored wanted to have like a system that

takes both the spectrogram the face and say

so the a be a like a speaking persons

rather than just concentrate on speaker identification by standard edition and we want to have

it all compact and then like one trainable system

and

an additional source of

inside make the to force a difference in an architecture so basically if you

have just for example horizontal or vertical focuses rather than squares that we use now

you can sort of force it to look

more than in sort of the time domain frequency domain

to sort of look at the

at some buttons that

and so that's a thank you

i performance

so we have plenty of time for some more buttons

okay

yes

any kind of segmentation per segmentation or you assume that there is

you know the segmentation so these age segmentation is basically an automatic speech segmentation done

by bic criterion so it is a pretty all technique and then we just basically

the segments as they are

and a pretty noisy sometimes analysis that it is very hot sometimes to distinguish

or to filter out like music and voice and stuff like that and then sometimes

because like something's that basically have strike selecting two speakers

as well which you know

we could probably

benefit from using a more sophisticated way to generate the

okay maybe also one is not experiments on this a the features are complementary to

the baseline so did you have an attempt to have as well as in the

upper layers learned by the c n like another

can you can kinda the telephone or something up for a meaning in terms of

the old averaged it is some basic you could be a actually that was to

see the saliency maps

so basically this is a and once again you can actually see

the was of particular layers c n and look at what it looks task

it to make a decision so basically what i guess pretty interesting most of the

teachers that were horizontal

and announcer in frequency domain so that's one way so that's my final we want

to see what happens if you like force the not just the vertical

and see what happens that

segmentation error

the simulation the red and no sorry

the measurement question was how

what five

of your total data is the segmentation that it

okay i don't have number wouldn't sorry

but

could be in the fact that should be

doesn't come out and of the last question with twenty five persons

of the segment with the duration less than two seconds

going but we are

but using

almost you know to compute a segmentation score we have this

what of open five seconds along the boundaries of each segment it means that new

case for twenty five percent of the data

fifty posants of the speech is not used to compute e

segmentation school so we have to change from it we want to go

if the segmentation you were a house and but on speaker identification

okay

thank you

time problem one final question

okay thinker everyone a separate so unless the spectrogram