Speech Transcript - Speaker Detection in the Wild: Lessons Learned from JSALT 2019

0:00:12	i know miming with powerlessness and today it is my great pleasure
0:00:17	two percent or on speaker detection in the while
0:00:21	lessons learned from jason two thousand nineteen
0:00:25	i would like to reverse that all the off course
0:00:28	that may this work possible
0:00:31	let's first ask a question
0:00:33	one they didn't we have here
0:00:36	right now we have plenty of devices
0:00:39	like smart phones
0:00:40	recorders
0:00:41	social media
0:00:43	from which we can gather data
0:00:45	and use it for downstream
0:00:47	tests
0:00:49	and you we go for then we can even
0:00:53	performance speaker detection
0:00:56	hello
0:00:57	my name is bonus to date you'd is my great pleasure percent or
0:01:01	on the speaker detection in the while
0:01:04	lessons learned from jason two thousand nineteen
0:01:07	i would like to the first faculty of course
0:01:09	no make this work possible
0:01:12	so let's start
0:01:16	what they did we have older
0:01:19	we have plenty of devices like
0:01:22	a smart phones
0:01:24	recorders
0:01:25	even we can get information from social media
0:01:28	and
0:01:30	we gather data
0:01:32	and you see
0:01:33	for downstream task
0:01:35	however these data needs to be label
0:01:39	to be useful
0:01:40	and we'd is labeling we can you can perform speaker detection
0:01:44	one of them are very experiments was to use brute force
0:01:51	and it was the motivation to use diarization actual words
0:01:57	so we have the speech recording
0:01:59	and we obtain homogeneous segments from it
0:02:04	from those segments we computed the and things and we compare those in billings
0:02:09	with the target
0:02:11	speakers involved in between
0:02:13	and gave and i result
0:02:16	but then
0:02:18	we need diarization
0:02:21	and we extracted the segments that not to the same speaker
0:02:25	and we obtain better results
0:02:28	so it was i would find it
0:02:32	to do it this way
0:02:34	so this is the be sure
0:02:37	a whole pipeline
0:02:39	we have
0:02:40	a record mean and we're looking for john
0:02:43	the first stage
0:02:44	is to a client was voice activity detection
0:02:48	that means to get rid of all the silence that's the second stage
0:02:52	is to perform speaker type classification or super b e
0:02:57	that means to time
0:02:59	all the segments according to the gender or if it's a keyboard and at all
0:03:05	or even it is t v
0:03:07	the speaker diarization
0:03:08	that answers the question who spoke when
0:03:11	so gathered together
0:03:13	the a segments that you know to the same speaker detection as the question
0:03:19	if we have john in
0:03:22	in any segment so is a binary decision
0:03:25	and then we can look for john a low
0:03:28	the recording with the speaker tracking
0:03:33	thus in were fine
0:03:36	to follow this type of and if we have challenges in are used as a
0:03:41	cocktail party that there is no
0:03:44	if any we have five psnr in the answer again is
0:03:49	so let's take a look at some of the numbers on the diarisation side
0:03:55	on the right now
0:03:57	we can observe the results of obtain
0:04:01	on the that i try to
0:04:02	based on the p x
0:04:05	provided by but
0:04:09	so we can observe that ceilings
0:04:12	we do not too long the recordings
0:04:15	and basque which are due to be d s
0:04:18	got very that results
0:04:24	we conclude that is that results are because we're talking about far field microphone
0:04:30	noisy speech
0:04:31	overlapping speech
0:04:33	condition mismatch not comparative speakers
0:04:36	and biased towards angles speech
0:04:39	so we wanted to study this conditions
0:04:43	no that's is some numbers on speaker recognition
0:04:47	for speaker recognition we compare two systems
0:04:52	a two datasets
0:04:53	the first one is that it's alright and the second one is the voices
0:04:57	and we are comparing
0:05:00	a close talking microphone are feel
0:05:04	we can observe that for far microphone
0:05:07	a big our doubles or false
0:05:13	then our main goal was to research developed and benchmark speaker diarization a speaker recognition
0:05:19	systems
0:05:20	for real speech
0:05:22	by using a single microphones in realistic scenarios that included right around noises
0:05:28	so just television audio music
0:05:31	or other people talk
0:05:35	the data one of the characteristics of the data
0:05:38	is it
0:05:39	like this one where you're having a meeting
0:05:42	or is it
0:05:43	completely while
0:05:46	as the one inch i'm five were people gathered together to have already
0:05:51	or anything they long recording
0:05:55	just having a five hour recording or even longer
0:06:00	or is it
0:06:01	that we have a are far field microphone
0:06:05	on the other room
0:06:06	that is catching
0:06:08	the voice of the speaker
0:06:15	to cover
0:06:16	although this type of data sets we included this so core right
0:06:22	i mean this alright channel five and bt training
0:06:25	going from the easiest one
0:06:27	to the most typical one
0:06:30	so for i mean we have a meeting domain
0:06:33	and we use it for both for devastation a detection
0:06:37	for this alright we have i mean to control domain
0:06:40	we just use it for the for detection we then used for the recession because
0:06:46	we have
0:06:47	the complete
0:06:49	labels for all the speakers
0:06:51	for channel five we use it for diarisation only
0:06:55	and it's an injured domain
0:06:57	we didn't using for addition because we usually four speakers
0:07:02	which is
0:07:03	quite a few persons
0:07:06	and babies right
0:07:09	we use it for both for their station and detection and is completely while i
0:07:15	don't control
0:07:17	the models that we explore as i said before is the devastation and the speaker
0:07:23	detection
0:07:24	so
0:07:25	from the devastation we get the labels for all the speakers and for the speaker
0:07:30	detection we can
0:07:32	try the speaker i don't are equal
0:07:36	this is that the picture of the for the devastation so we have
0:07:41	a traditional modularized system that is composed enhancement the p
0:07:47	the embedding the scoring the cluster e
0:07:50	the re-segmentation and the overlap assignment
0:07:53	we have to type something enhancement
0:07:56	one of the signal level
0:07:58	and you're with their one i the enhancement level
0:08:01	the
0:08:03	boxes that are in orange
0:08:05	are the ones that we explore
0:08:09	let's start with the enhancement
0:08:12	on the signal level
0:08:14	we feel
0:08:15	and snr progressive multi target and based speech enhancement model
0:08:20	the progressive mode in time
0:08:23	network or p n g
0:08:24	is divided into statistically stacking blocks
0:08:28	with one elicit em where you're
0:08:30	and one phoneme connected they can be a multi target learning per block
0:08:35	the one connected to let your in every plot
0:08:38	is designed to ranger meeting speech target with higher
0:08:42	snr than the previous target the first
0:08:47	a serious progress you variation masks
0:08:50	are concatenated with the progressively and have low power spectral features
0:08:56	other targets
0:08:58	i test time with directly be
0:09:00	the enhanced audio
0:09:02	processed by awarding has been model to the back end systems
0:09:07	note that we have a wiener signal
0:09:09	we can
0:09:10	explored vad
0:09:12	in this case
0:09:13	we have two directions
0:09:15	the one on the top that is based
0:09:17	on mfccs and on the one on the bottom that is based on
0:09:22	i think that
0:09:23	and volatile then sure there is a philosophy a list we collected layers
0:09:30	the output these the speech
0:09:31	and nonspeech
0:09:33	it is important to note
0:09:35	that the lower branch is the one that we chose
0:09:38	for works very
0:09:42	although this is not part of the finite stages it is also true that debated
0:09:47	invading network
0:09:48	the related to the performance
0:09:50	as shown in the table
0:09:52	so we explore the extended t and then
0:09:55	with a box so that and with box so that
0:09:59	cluster augmentation
0:10:01	and we also explore a
0:10:03	after t and then
0:10:04	we also there was a commendation
0:10:07	so we can see that the factor t v n and with even
0:10:11	the best results are be trained
0:10:13	and i mean it was completely given in child five
0:10:17	so we chose the factor g d n
0:10:20	for our experiments
0:10:22	now let's focus
0:10:24	on the speech enhancement
0:10:27	we had is i mean how to train an unsupervised speech enhancement system
0:10:31	which can be used as a front end
0:10:34	good processing model
0:10:36	to improve the quality of the features
0:10:38	before they are passed
0:10:39	two than varying or
0:10:41	the main idea here is to use an unsupervised
0:10:45	adaptation system
0:10:47	based on cycle against
0:10:49	we train a cycle can network using a lot will be addressed
0:10:54	as input
0:10:55	to each of the generator networks
0:10:58	so we have a clean source signal on the left and the real time domain
0:11:03	data on the right
0:11:05	during testing
0:11:06	we process that is data to the target signal
0:11:12	these are then huh
0:11:13	acoustic features
0:11:15	i being used
0:11:16	just write extractors
0:11:18	even though the cycle get and you work was trained for doing the reverberation
0:11:23	we also testing on noisy data sets
0:11:26	showing improvements
0:11:28	now let's continue with the overlap assignment
0:11:32	but have these architecture might also sample mean here
0:11:35	it is exactly the same as the one use for the vad approach
0:11:40	but now training in a certain way that would ease
0:11:43	overlap or not overlap
0:11:45	speech
0:11:47	it can also be used to perform a speaker at right
0:11:50	and also asking the vad
0:11:53	the thing that approach show better results
0:11:57	let's continue with the overlap assignment
0:12:02	from the e
0:12:04	we got a posterior matrix
0:12:07	for each of the speakers
0:12:10	so the most probable speakers will be you rolls one and two
0:12:17	so we can combine this with the overlap detector
0:12:21	and also we didn't vad
0:12:24	merging these results
0:12:26	we got what we call the overlap assignment where we have regions where the overlapping
0:12:32	to tell us that we have two speakers and we put their the most probable
0:12:38	speaker
0:12:39	in this part
0:12:40	we ended our diarization system
0:12:45	but now the question is what combination of all these things
0:12:49	a good results
0:12:51	so in our case
0:12:53	we put together to into n b a d enhancement
0:12:57	that maybe re-segmentation an overlap assignment
0:13:01	for all thus a corpora we got a nice improvements
0:13:06	for example i mean
0:13:08	we went problem fourteen nine percent
0:13:11	the residual error rate to thirty percent
0:13:13	there is station
0:13:16	for the channel five
0:13:18	so the corpus
0:13:20	we also put together
0:13:23	the same combination we went
0:13:26	problem sixty nine percent every station error rate
0:13:28	justice degree
0:13:30	or set at every station
0:13:33	and finally pervading train
0:13:37	we got a nice improvement from eighty five percent every session error rate to forty
0:13:42	seven percent
0:13:44	the recession error rate
0:13:45	it is important to note here that in two
0:13:49	and but
0:13:50	really improve the system
0:13:54	this is the speaker detection pipeline
0:13:57	we have the enhancement
0:14:00	and the signal level and also the invading level we have the devastation segmentation
0:14:06	we have been the in extractor the that okay the calibration and finally
0:14:10	we get the speaker detection
0:14:14	the boxes in orange
0:14:17	use the same techniques
0:14:19	i think conversation
0:14:20	so we use the enhancement two levels
0:14:24	and the signal level and also
0:14:26	and the and very little
0:14:29	that there is station
0:14:31	a segmentation
0:14:33	is fed into that invading extractor and the type like wendy's
0:14:40	then that in extractor as we are really emphasised before
0:14:44	it is a factor at the nn
0:14:46	which is getting the best results for speaker i p
0:14:50	we also is used an enhancement
0:14:52	module
0:14:53	for this and getting extractor
0:14:57	and finally we have the backing and the calibration
0:15:00	the backend
0:15:01	sure the key lda front of devastation with documentation and the calibration stage
0:15:08	goes directly to speaker detection
0:15:11	the combination of the use of results for all hours of corpora
0:15:16	enclosed is speech enhancement the spectral augmentation and of the lda with augmentation it is
0:15:22	important to note that although this
0:15:26	this is include the devastation as their first stage
0:15:30	so for
0:15:31	we got an improvement going problem
0:15:34	seventeen percent equal error rate
0:15:36	two percent equal error rate
0:15:40	in terms of mindcf and actual dcf shown in the bottom we can also something
0:15:46	improvement
0:15:49	remained trained we kind of so the same trend
0:15:52	going from fourteen percent equal error rate
0:15:56	two nine percent equal error rate
0:16:01	on the bottom we kind of service then mean these yes
0:16:04	and the actual dcf the mean this got improvement
0:16:09	but the actual dcf
0:16:12	for the is alright data our system also include the results going from twenty one
0:16:19	percent equal error rate to sixty percent
0:16:21	equal error rate
0:16:23	the mean these i'm the actual dcf
0:16:26	for this and trend
0:16:28	getting improvement simple
0:16:34	finally some taken ways i'd like to mention
0:16:39	the recession ease that fundamental stage to perform speaker detection
0:16:46	there are some models that are really needed to have a competitive system
0:16:51	course a whole good enhancement could be a i
0:16:55	we didn't beginnings
0:16:56	an overlap detection and assignment
0:17:01	the speaker detection they hence not only
0:17:04	on the devastation model
0:17:06	but also wanting but in extractor on the augmentation
0:17:12	then you directions of this work are as follows
0:17:17	or the signal to signal enhancement and speaker separation we need some customisation
0:17:22	you could be by the test it by speaker or quite task
0:17:26	for the speech enhancement
0:17:28	we have to explore other hand gestures a transformer and largescale training
0:17:35	for the vad we need ways to handle domain mismatch
0:17:39	you can be done for example using domain or sorry
0:17:43	for the clustering we need an unsupervised adaptation
0:17:47	take the overlap into account
0:17:50	during the clustering
0:17:52	and also included transcription
0:17:54	in parallel with the speaker and b
0:17:57	for the speaker detection
0:17:58	some enhancement for the multi-speaker scenario
0:18:02	that means
0:18:04	hi light
0:18:05	that's speaker of interest
0:18:08	and also perform better clustering
0:18:10	for short segments
0:18:12	this is our amazing thing
0:18:15	i would like to thank
0:18:16	all of them very much thank you questions

Speaker Detection in the Wild: Lessons Learned from JSALT 2019

Speech Application