Speech Transcript - Deep complementary features for speaker identification in TV broadcast data

0:00:15	and well what can i actually identify speakers i and then we also wanted to
0:00:20	try if it's possible to actually fuse the results
0:00:22	where is more
0:00:24	sort of traditional atoms an and systems so basically there was some things
0:00:32	done already
0:00:34	and this is basically some the closest works we could find the time of writing
0:00:38	but you know do so you sort of the archive publishing method than and stuff
0:00:42	like that they could be out of date so basically especially the first one of
0:00:46	occurs because they actually uses the in spectrograms as well
0:00:49	but
0:00:51	what the use it for is to identify disguised voices so for example when you
0:00:55	have voice actors and then like the simpsons or something one actor can play like
0:01:03	several characters so they want to sort of identify the actual act as a not
0:01:09	the characters that they play
0:01:11	but they didn't do when in so the fusion or exploration and in basically used
0:01:16	sort of out of ops network
0:01:18	and also there's this is quite a lot of
0:01:20	now one to conclusions a non sound
0:01:25	so basically what we want to see is a citizens of the overview so the
0:01:32	lower part is basically surreal standards
0:01:37	approach where you have the mfccs or other features extracted any sort of the i-vectors
0:01:42	or the u
0:01:44	whatever and then
0:01:46	you usually in this identification what we wanted to do is basically extract spectrograms but
0:01:52	it through the network and then sort of get the identity of the other hand
0:01:57	i will explain later wine
0:01:59	there are several identities
0:02:03	of the c and then so basically what we wanted to test
0:02:07	a little conversion network and then
0:02:10	t v is
0:02:12	and you're
0:02:14	actually quite dataset was
0:02:16	quite surprising that you
0:02:20	so we actually chose this
0:02:22	system so
0:02:24	the fusion
0:02:26	so
0:02:27	this is not expect sorta need to go into detail
0:02:33	convolution work inspired by a lot of networks that are currently used for image recognition
0:02:39	this particular one
0:02:41	so basically what we did we tried an existing model and then we sort of
0:02:45	started downsizing it because it didn't like to change the resultant cued up learning
0:02:51	and we came up with this
0:02:53	and it's actually a very robust as you have five convolutional layer but the main
0:03:00	an overkill
0:03:02	especially of the images and we begin that is a monochromatic so we don't like
0:03:07	you three chance that the very beginning
0:03:11	some be system basically trying to one twelve efforts but it's actually
0:03:17	and we use rather than dropout rate of the nonlinear function
0:03:23	and
0:03:25	and dropout at zero point fine
0:03:27	and this is up to each we conducted where we did no random propping the
0:03:32	rotations this was due to the so the spectrograms basically have a pretty big overlap
0:03:38	anyway so cropping than actually do
0:03:40	the detection have much use and we don't want to the rotations because hopefully this
0:03:44	may be something and the time domain may be interesting
0:03:49	and we use
0:03:50	well average point max pooling but this is just based on experimental
0:03:55	exactly so basically because we wanted to combat
0:03:59	t v s and t o v as and stuff like that
0:04:03	we you want to have
0:04:05	the same sort of output so basically what we got from the signal is
0:04:11	the somebody news
0:04:13	so the speech segments
0:04:14	and because the spectrograms have a
0:04:18	then it shows a fixed size we have to sort of divine to speech segments
0:04:22	into separate spectrograms and then do an average
0:04:27	and the output to get an equivalent for forward to us for example
0:04:32	so for
0:04:35	you many the end you get your the eggs we
0:04:37	so to use the following setting like more teachers and paper one sort of going
0:04:42	dependent this now but we tested a settings in this
0:04:45	i think you the best also i'm not
0:04:48	the segmentation problem for
0:04:51	getting the speech segments is based and bic criterion
0:04:55	i victim suspect hundred and stuff like that
0:05:00	so for the fusion we chose t v s because it had the best results
0:05:05	and then
0:05:06	we explore three
0:05:09	different approaches the late fusion so basically just to the scrolls
0:05:13	from the t v s and from bayesian and then
0:05:16	basically
0:05:18	fuse them
0:05:18	and then we so from our experiments that
0:05:22	actually the c n and works was four
0:05:27	longer segments
0:05:29	speech
0:05:30	which was quite surprising but then so we basically wanted so the weight down it's
0:05:36	this value depending on the duration
0:05:38	so the and the duration baseline instance for the duration the track
0:05:44	and then we wanted to see if an early fusion
0:05:48	so basically take the our work all the last hidden sin level we do with
0:05:52	pca to have
0:05:54	the same dimensionality as an i-vector and then we just concatenate them and trainings be
0:05:58	a
0:06:01	so that they said that we used in the repair this is a french language
0:06:06	corpus this is and radios
0:06:10	and
0:06:11	that seven types of videos including news debates
0:06:16	sort of interviews celebrity gossip stuff like that so and because of this it's pretty
0:06:22	noisy because you i don't very often you have like background music you have different
0:06:27	voices overlapping you have streets noises a et cetera
0:06:33	and
0:06:34	very unbalanced as well because
0:06:36	you sometimes have very i don't know politicians who i don't present fronts that sort
0:06:42	of is that almost constantly a or binders throughout the more and then you have
0:06:49	sort of this long tail of speakers so basically in the whole training set that
0:06:53	eight hundred three months speakers but that says sets
0:06:57	contains only one hundred thirteen and this is likely on be one hundred thirteen is
0:07:01	actually overlap
0:07:04	with the speakers with and train set
0:07:06	and while the strange about speech or frames and six for the test
0:07:15	this is just a show sort of like the imbalance in the distribution this is
0:07:18	a logarithmic scale
0:07:21	and then this
0:07:23	so on the x-axis you have all those one hundred thirteen speakers
0:07:27	and then on the while you have the duration but speaker so basically and that
0:07:32	sort by the duration and the train set so basically what you've got is that
0:07:39	it's not very an imbalanced us you know some people speaking forty minutes and then
0:07:43	someone who excuse that for just a few seconds
0:07:46	and then it's
0:07:48	as we can see that spike at the very rights
0:07:51	this shows that there's actually
0:07:53	someone who
0:07:54	is almost nonexistent train set but then he's very present in the test data
0:08:00	so
0:08:01	pretty difficult also another feature of this data that
0:08:06	almost
0:08:07	a quota speech segments are shorter than two seconds
0:08:11	and seven c
0:08:13	percent shorter than that
0:08:15	a which makes it quite difficult so basically we used mfccs features
0:08:22	and this is sort of problem no
0:08:26	nineteen dimensions so
0:08:28	so basically all the details and the paper but
0:08:31	we end up with than fifty nine dimensional vector
0:08:34	up to some
0:08:36	feature warping
0:08:38	so for the spectrograms you have an example of it on your
0:08:45	it's
0:08:45	the two hundred
0:08:47	forty miliseconds in duration
0:08:50	there's a big overlap between neighboring spectrogram
0:08:54	well at the two hundred milisecond systems on overall
0:08:59	it's percent
0:09:02	and basically
0:09:04	so this is that we use so are
0:09:09	audio segments were a value of refinement seconds
0:09:12	and then we form for the look for a window and twenty miliseconds we use
0:09:17	the
0:09:18	i mean windowing
0:09:21	log-spectra optical
0:09:23	amplitude values extraction and then we basically got an individual matrix which ones of a
0:09:28	forty eight times woman twenty one pixel
0:09:34	so basically here the results so far table we see the results of the on
0:09:39	for each individual systems and basically
0:09:42	this in and
0:09:43	doesn't work very well which isn't
0:09:45	that's surprising considering
0:09:47	the way the dataset structured but
0:09:50	pretty surprising is that the of the a
0:09:52	is also not very good an actually gmm ubm
0:09:57	right okay so basically to the best system is the c v s one
0:10:02	and that we have used for fusion afterwards so basically
0:10:07	we want to see
0:10:10	so in the lower table you have three more detailed results including the accuracy or
0:10:17	the tracks to have less than two seconds
0:10:21	and
0:10:22	actually the best approach that we have is the just the simple length and so
0:10:28	basically take the predictions from c n
0:10:30	and seriously sort of normalise them and
0:10:34	our remote
0:10:35	and the biggest most of the form is actually is also given that for the
0:10:40	trusts okay for the facts that a lower than two seconds so basically for forty
0:10:49	forty one almost and forty nine for t v s and fourteen and respectively and
0:10:53	then goes up to fifty eight
0:10:56	so it's a phone
0:10:58	which is quite of course
0:11:02	and then the yellow re fusion actually model but well actually decreased results
0:11:08	but for like duration nights
0:11:12	it's pretty
0:11:13	similar so basically
0:11:15	even though the c n and didn't
0:11:18	outperform
0:11:19	it
0:11:20	seems to provide different things in spectrograms and
0:11:23	by fusion consort exploited and sort of go
0:11:26	beyond what was
0:11:29	but say possible so is also
0:11:34	so it's of the lower plots
0:11:36	as you
0:11:39	we have so the red one is the nn
0:11:43	performance across
0:11:44	different duration files
0:11:48	on a logarithmic scale
0:11:49	so you can see that
0:11:52	the between c and then and
0:11:55	i-vectors as of this yes
0:11:57	it's a low increases as a sort of a long
0:12:01	with the duration and the biggest is actually helpful for very short tracks and then
0:12:08	doesn't affect the performance and the latest
0:12:14	so that's basically it we wanted so
0:12:19	see how it works and we conclude that the s t and c n and
0:12:24	t v s main improve over the baseline systems
0:12:29	a more data that may be requires
0:12:32	or more what quality data especially for this unit india data actually work better and
0:12:37	four perspectives
0:12:40	so basically we chose this corpus
0:12:42	because it also contains
0:12:44	texas and stuff like that is we explored wanted to have like a system that
0:12:48	takes both the spectrogram the face and say
0:12:52	so the a be a like a speaking persons
0:12:55	rather than just concentrate on speaker identification by standard edition and we want to have
0:13:00	it all compact and then like one trainable system
0:13:05	and
0:13:05	an additional source of
0:13:08	inside make the to force a difference in an architecture so basically if you
0:13:15	have just for example horizontal or vertical focuses rather than squares that we use now
0:13:22	you can sort of force it to look
0:13:25	more than in sort of the time domain frequency domain
0:13:30	to sort of look at the
0:13:34	at some buttons that
0:13:36	and so that's a thank you
0:13:43	i performance
0:13:45	so we have plenty of time for some more buttons
0:13:51	okay
0:13:56	yes
0:14:15	any kind of segmentation per segmentation or you assume that there is
0:14:21	you know the segmentation so these age segmentation is basically an automatic speech segmentation done
0:14:28	by bic criterion so it is a pretty all technique and then we just basically
0:14:33	the segments as they are
0:14:35	and a pretty noisy sometimes analysis that it is very hot sometimes to distinguish
0:14:41	or to filter out like music and voice and stuff like that and then sometimes
0:14:44	because like something's that basically have strike selecting two speakers
0:14:49	as well which you know
0:14:52	we could probably
0:14:53	benefit from using a more sophisticated way to generate the
0:15:10	okay maybe also one is not experiments on this a the features are complementary to
0:15:16	the baseline so did you have an attempt to have as well as in the
0:15:20	upper layers learned by the c n like another
0:15:24	can you can kinda the telephone or something up for a meaning in terms of
0:15:28	the old averaged it is some basic you could be a actually that was to
0:15:33	see the saliency maps
0:15:36	so basically this is a and once again you can actually see
0:15:42	the was of particular layers c n and look at what it looks task
0:15:48	it to make a decision so basically what i guess pretty interesting most of the
0:15:54	teachers that were horizontal
0:15:56	and announcer in frequency domain so that's one way so that's my final we want
0:16:05	to see what happens if you like force the not just the vertical
0:16:10	and see what happens that
0:16:23	segmentation error
0:16:27	the simulation the red and no sorry
0:16:32	the measurement question was how
0:16:34	what five
0:16:36	of your total data is the segmentation that it
0:16:40	okay i don't have number wouldn't sorry
0:16:46	but
0:16:47	could be in the fact that should be
0:17:09	doesn't come out and of the last question with twenty five persons
0:17:13	of the segment with the duration less than two seconds
0:17:17	going but we are
0:17:19	but using
0:17:20	almost you know to compute a segmentation score we have this
0:17:25	what of open five seconds along the boundaries of each segment it means that new
0:17:30	case for twenty five percent of the data
0:17:33	fifty posants of the speech is not used to compute e
0:17:38	segmentation school so we have to change from it we want to go
0:17:42	if the segmentation you were a house and but on speaker identification
0:17:47	okay
0:17:48	thank you
0:17:56	time problem one final question
0:18:01	okay thinker everyone a separate so unless the spectrogram

Deep complementary features for speaker identification in TV broadcast data

Speaker Recognition in Multimedia Content

Mateusz Budnik, Ali Khodabakhsh, Laurent Besacier, Cenk Demiroglu