0:00:15and well what can i actually identify speakers i and then we also wanted to
0:00:20try if it's possible to actually fuse the results
0:00:22where is more
0:00:24sort of traditional atoms an and systems so basically there was some things
0:00:32done already
0:00:34and this is basically some the closest works we could find the time of writing
0:00:38but you know do so you sort of the archive publishing method than and stuff
0:00:42like that they could be out of date so basically especially the first one of
0:00:46occurs because they actually uses the in spectrograms as well
0:00:51what the use it for is to identify disguised voices so for example when you
0:00:55have voice actors and then like the simpsons or something one actor can play like
0:01:03several characters so they want to sort of identify the actual act as a not
0:01:09the characters that they play
0:01:11but they didn't do when in so the fusion or exploration and in basically used
0:01:16sort of out of ops network
0:01:18and also there's this is quite a lot of
0:01:20now one to conclusions a non sound
0:01:25so basically what we want to see is a citizens of the overview so the
0:01:32lower part is basically surreal standards
0:01:37approach where you have the mfccs or other features extracted any sort of the i-vectors
0:01:42or the u
0:01:44whatever and then
0:01:46you usually in this identification what we wanted to do is basically extract spectrograms but
0:01:52it through the network and then sort of get the identity of the other hand
0:01:57i will explain later wine
0:01:59there are several identities
0:02:03of the c and then so basically what we wanted to test
0:02:07a little conversion network and then
0:02:10t v is
0:02:12and you're
0:02:14actually quite dataset was
0:02:16quite surprising that you
0:02:20so we actually chose this
0:02:22system so
0:02:24the fusion
0:02:27this is not expect sorta need to go into detail
0:02:33convolution work inspired by a lot of networks that are currently used for image recognition
0:02:39this particular one
0:02:41so basically what we did we tried an existing model and then we sort of
0:02:45started downsizing it because it didn't like to change the resultant cued up learning
0:02:51and we came up with this
0:02:53and it's actually a very robust as you have five convolutional layer but the main
0:03:00an overkill
0:03:02especially of the images and we begin that is a monochromatic so we don't like
0:03:07you three chance that the very beginning
0:03:11some be system basically trying to one twelve efforts but it's actually
0:03:17and we use rather than dropout rate of the nonlinear function
0:03:25and dropout at zero point fine
0:03:27and this is up to each we conducted where we did no random propping the
0:03:32rotations this was due to the so the spectrograms basically have a pretty big overlap
0:03:38anyway so cropping than actually do
0:03:40the detection have much use and we don't want to the rotations because hopefully this
0:03:44may be something and the time domain may be interesting
0:03:49and we use
0:03:50well average point max pooling but this is just based on experimental
0:03:55exactly so basically because we wanted to combat
0:03:59t v s and t o v as and stuff like that
0:04:03we you want to have
0:04:05the same sort of output so basically what we got from the signal is
0:04:11the somebody news
0:04:13so the speech segments
0:04:14and because the spectrograms have a
0:04:18then it shows a fixed size we have to sort of divine to speech segments
0:04:22into separate spectrograms and then do an average
0:04:27and the output to get an equivalent for forward to us for example
0:04:32so for
0:04:35you many the end you get your the eggs we
0:04:37so to use the following setting like more teachers and paper one sort of going
0:04:42dependent this now but we tested a settings in this
0:04:45i think you the best also i'm not
0:04:48the segmentation problem for
0:04:51getting the speech segments is based and bic criterion
0:04:55i victim suspect hundred and stuff like that
0:05:00so for the fusion we chose t v s because it had the best results
0:05:05and then
0:05:06we explore three
0:05:09different approaches the late fusion so basically just to the scrolls
0:05:13from the t v s and from bayesian and then
0:05:18fuse them
0:05:18and then we so from our experiments that
0:05:22actually the c n and works was four
0:05:27longer segments
0:05:30which was quite surprising but then so we basically wanted so the weight down it's
0:05:36this value depending on the duration
0:05:38so the and the duration baseline instance for the duration the track
0:05:44and then we wanted to see if an early fusion
0:05:48so basically take the our work all the last hidden sin level we do with
0:05:52pca to have
0:05:54the same dimensionality as an i-vector and then we just concatenate them and trainings be
0:06:01so that they said that we used in the repair this is a french language
0:06:06corpus this is and radios
0:06:11that seven types of videos including news debates
0:06:16sort of interviews celebrity gossip stuff like that so and because of this it's pretty
0:06:22noisy because you i don't very often you have like background music you have different
0:06:27voices overlapping you have streets noises a et cetera
0:06:34very unbalanced as well because
0:06:36you sometimes have very i don't know politicians who i don't present fronts that sort
0:06:42of is that almost constantly a or binders throughout the more and then you have
0:06:49sort of this long tail of speakers so basically in the whole training set that
0:06:53eight hundred three months speakers but that says sets
0:06:57contains only one hundred thirteen and this is likely on be one hundred thirteen is
0:07:01actually overlap
0:07:04with the speakers with and train set
0:07:06and while the strange about speech or frames and six for the test
0:07:15this is just a show sort of like the imbalance in the distribution this is
0:07:18a logarithmic scale
0:07:21and then this
0:07:23so on the x-axis you have all those one hundred thirteen speakers
0:07:27and then on the while you have the duration but speaker so basically and that
0:07:32sort by the duration and the train set so basically what you've got is that
0:07:39it's not very an imbalanced us you know some people speaking forty minutes and then
0:07:43someone who excuse that for just a few seconds
0:07:46and then it's
0:07:48as we can see that spike at the very rights
0:07:51this shows that there's actually
0:07:53someone who
0:07:54is almost nonexistent train set but then he's very present in the test data
0:08:01pretty difficult also another feature of this data that
0:08:07a quota speech segments are shorter than two seconds
0:08:11and seven c
0:08:13percent shorter than that
0:08:15a which makes it quite difficult so basically we used mfccs features
0:08:22and this is sort of problem no
0:08:26nineteen dimensions so
0:08:28so basically all the details and the paper but
0:08:31we end up with than fifty nine dimensional vector
0:08:34up to some
0:08:36feature warping
0:08:38so for the spectrograms you have an example of it on your
0:08:45the two hundred
0:08:47forty miliseconds in duration
0:08:50there's a big overlap between neighboring spectrogram
0:08:54well at the two hundred milisecond systems on overall
0:08:59it's percent
0:09:02and basically
0:09:04so this is that we use so are
0:09:09audio segments were a value of refinement seconds
0:09:12and then we form for the look for a window and twenty miliseconds we use
0:09:18i mean windowing
0:09:21log-spectra optical
0:09:23amplitude values extraction and then we basically got an individual matrix which ones of a
0:09:28forty eight times woman twenty one pixel
0:09:34so basically here the results so far table we see the results of the on
0:09:39for each individual systems and basically
0:09:42this in and
0:09:43doesn't work very well which isn't
0:09:45that's surprising considering
0:09:47the way the dataset structured but
0:09:50pretty surprising is that the of the a
0:09:52is also not very good an actually gmm ubm
0:09:57right okay so basically to the best system is the c v s one
0:10:02and that we have used for fusion afterwards so basically
0:10:07we want to see
0:10:10so in the lower table you have three more detailed results including the accuracy or
0:10:17the tracks to have less than two seconds
0:10:22actually the best approach that we have is the just the simple length and so
0:10:28basically take the predictions from c n
0:10:30and seriously sort of normalise them and
0:10:34our remote
0:10:35and the biggest most of the form is actually is also given that for the
0:10:40trusts okay for the facts that a lower than two seconds so basically for forty
0:10:49forty one almost and forty nine for t v s and fourteen and respectively and
0:10:53then goes up to fifty eight
0:10:56so it's a phone
0:10:58which is quite of course
0:11:02and then the yellow re fusion actually model but well actually decreased results
0:11:08but for like duration nights
0:11:12it's pretty
0:11:13similar so basically
0:11:15even though the c n and didn't
0:11:20seems to provide different things in spectrograms and
0:11:23by fusion consort exploited and sort of go
0:11:26beyond what was
0:11:29but say possible so is also
0:11:34so it's of the lower plots
0:11:36as you
0:11:39we have so the red one is the nn
0:11:43performance across
0:11:44different duration files
0:11:48on a logarithmic scale
0:11:49so you can see that
0:11:52the between c and then and
0:11:55i-vectors as of this yes
0:11:57it's a low increases as a sort of a long
0:12:01with the duration and the biggest is actually helpful for very short tracks and then
0:12:08doesn't affect the performance and the latest
0:12:14so that's basically it we wanted so
0:12:19see how it works and we conclude that the s t and c n and
0:12:24t v s main improve over the baseline systems
0:12:29a more data that may be requires
0:12:32or more what quality data especially for this unit india data actually work better and
0:12:37four perspectives
0:12:40so basically we chose this corpus
0:12:42because it also contains
0:12:44texas and stuff like that is we explored wanted to have like a system that
0:12:48takes both the spectrogram the face and say
0:12:52so the a be a like a speaking persons
0:12:55rather than just concentrate on speaker identification by standard edition and we want to have
0:13:00it all compact and then like one trainable system
0:13:05an additional source of
0:13:08inside make the to force a difference in an architecture so basically if you
0:13:15have just for example horizontal or vertical focuses rather than squares that we use now
0:13:22you can sort of force it to look
0:13:25more than in sort of the time domain frequency domain
0:13:30to sort of look at the
0:13:34at some buttons that
0:13:36and so that's a thank you
0:13:43i performance
0:13:45so we have plenty of time for some more buttons
0:14:15any kind of segmentation per segmentation or you assume that there is
0:14:21you know the segmentation so these age segmentation is basically an automatic speech segmentation done
0:14:28by bic criterion so it is a pretty all technique and then we just basically
0:14:33the segments as they are
0:14:35and a pretty noisy sometimes analysis that it is very hot sometimes to distinguish
0:14:41or to filter out like music and voice and stuff like that and then sometimes
0:14:44because like something's that basically have strike selecting two speakers
0:14:49as well which you know
0:14:52we could probably
0:14:53benefit from using a more sophisticated way to generate the
0:15:10okay maybe also one is not experiments on this a the features are complementary to
0:15:16the baseline so did you have an attempt to have as well as in the
0:15:20upper layers learned by the c n like another
0:15:24can you can kinda the telephone or something up for a meaning in terms of
0:15:28the old averaged it is some basic you could be a actually that was to
0:15:33see the saliency maps
0:15:36so basically this is a and once again you can actually see
0:15:42the was of particular layers c n and look at what it looks task
0:15:48it to make a decision so basically what i guess pretty interesting most of the
0:15:54teachers that were horizontal
0:15:56and announcer in frequency domain so that's one way so that's my final we want
0:16:05to see what happens if you like force the not just the vertical
0:16:10and see what happens that
0:16:23segmentation error
0:16:27the simulation the red and no sorry
0:16:32the measurement question was how
0:16:34what five
0:16:36of your total data is the segmentation that it
0:16:40okay i don't have number wouldn't sorry
0:16:47could be in the fact that should be
0:17:09doesn't come out and of the last question with twenty five persons
0:17:13of the segment with the duration less than two seconds
0:17:17going but we are
0:17:19but using
0:17:20almost you know to compute a segmentation score we have this
0:17:25what of open five seconds along the boundaries of each segment it means that new
0:17:30case for twenty five percent of the data
0:17:33fifty posants of the speech is not used to compute e
0:17:38segmentation school so we have to change from it we want to go
0:17:42if the segmentation you were a house and but on speaker identification
0:17:48thank you
0:17:56time problem one final question
0:18:01okay thinker everyone a separate so unless the spectrogram