Speech Transcript - Speech Bandwidth Expansion For Speaker Recognition On Telephony Audio

0:00:15	hello everyone
0:00:17	i am then used to but often
0:00:18	i am a research scientist i'd been dropped security
0:00:22	based in atlanta are you would say
0:00:25	i'm here to present our paper
0:00:28	i to read speech bandwidth expansion for speaker recognition on telephone you or you
0:00:38	this is the overview of might all
0:00:41	i will start by giving a motivation as to why we need bandwidth expansion
0:00:45	followed by explaining the problem statement
0:00:48	and then i will describe some prior research in this area
0:00:53	we will then explain
0:00:55	the bandwidth expansion system that we propose in this paper
0:00:58	and show some results of bandwidth expansion performance
0:01:03	finally
0:01:04	i really others this show you some speaker verification experiments that you perform
0:01:10	and the results that we obtained with the bandwidth expanse just
0:01:17	in this paper we therefore to y
0:01:20	no audio that the sampled at sixteen khz
0:01:24	now has wideband or you
0:01:26	typically the audio that is sampled at sixteen khz and has frequency content
0:01:33	between zero to eight khz
0:01:35	but is called wideband audio in this paper
0:01:39	input additional telephone the audio
0:01:41	which is due back band limited to a three hundred to three thousand four hundred
0:01:46	hz
0:01:47	an example of the universe is referred to as
0:01:51	narrowband audio in this paper
0:01:55	speaker verification systems
0:01:57	typically work well on why nine between all you
0:02:02	this is because
0:02:04	the higher frequency content maybe in four and eight khz
0:02:10	in by band all your
0:02:12	is helpful in speaker discrimination
0:02:17	the wine mandarin systems
0:02:20	the one of the wideband audio stream systems perform warily
0:02:24	a narrowband or you to the mismatch in the training and testing conditions
0:02:29	so the lack of you have higher frequency information in the narrowband speech leads to
0:02:35	the degraded performance
0:02:37	so the question that we days in this paper is can be indexed estimate the
0:02:42	higher frequency content that is missing in narrowband or you
0:02:47	in such a way that it improves the performance
0:02:50	on why band trained
0:02:53	speaker verification systems
0:02:55	so this is the problem statement
0:02:57	the narrowband on or the u
0:03:00	well as a and which is band limited to four khz
0:03:04	is shown on the left
0:03:06	in this figure we have shown the spectrogram
0:03:10	for showing the frequencies between zero and
0:03:13	e you know parts
0:03:15	you see that there is no information or frequency contained in between forty kilobytes
0:03:20	and the objective will be banded expansion system is to use the lower frequency content
0:03:25	of the narrowband audio to estimate the missing higher frequency content that is typically present
0:03:32	in the wideband audio
0:03:36	the objective of the s estimation of the higher frequency content is inside to be
0:03:41	of that it improves the performance of speaker verification systems
0:03:49	then as being a lot of research that has been conducted in bandwidth expansion
0:03:54	the earliest approaches to bandwidth expansion they don't the problem into two parts
0:03:59	estimating the on the log of the spectrum and the excitation signal of this of
0:04:06	this paper
0:04:07	the on the left estimation is typically
0:04:10	are then made using spline fitting cubic spline fitting one option mixture model based approaches
0:04:17	and
0:04:18	spectral folding is a is used for
0:04:21	estimating it's extending the excitation six
0:04:25	so this is the earliest approaches in bandwidth expansion
0:04:30	more recent approaches use the neural network based bandwidth extraction and
0:04:35	these kind of deep neural network based systems have shown improvement in the performance of
0:04:41	asr systems
0:04:42	are trained on wideband speech
0:04:46	more recent work in speaker verification related to bandwidth expansion
0:04:51	has you have used
0:04:53	d plus it will networks
0:04:55	and bidirectional l s t m network architectures for or forming bandwidth expansion
0:05:01	this work has also shown significant improvement in the performance of speaker verification systems
0:05:11	in this
0:05:12	but we propose a novel bandwidth expansion system
0:05:15	that is lightweight compared to all the systems proposed in the literature
0:05:22	in this system the band with the bantered expansion is performed using a c n
0:05:27	b and then network architecture
0:05:30	a feed-forward c n and t are not capture better in there is a
0:05:34	single convolutional layer
0:05:36	which is which more forms one deconvolution along the time axis
0:05:41	followed by three v forward layers
0:05:44	containing
0:05:45	one thousand twenty four nodes in each layer
0:05:50	there are sixty four filters in the convolutional ears
0:05:54	and after the convolution operation the feature maps that slightly and fact that the feed-forward
0:06:00	here's
0:06:01	this is the architecture of the d and then
0:06:04	that performs the bandwidth expansion
0:06:06	the input
0:06:07	to the deep neural network is
0:06:11	and the narrowband log spectrum
0:06:15	narrowband log spectrum so we extract the spectrogram
0:06:20	from the eight khz telephone the audio
0:06:23	and we perform
0:06:25	the mean and variance normalization of the spectrum
0:06:30	and compute the logarithm
0:06:32	of
0:06:34	of the spectrum and feed it as input to the network
0:06:39	the output of the network is the s is tries to estimate the complete than
0:06:44	some of the
0:06:46	corresponding by back to see that
0:06:49	the input to the network
0:06:51	a fixed
0:06:52	eleven frames
0:06:54	of one twenty eight dimensional narrowband log spectrum
0:06:59	the features are computed at twenty millisecond frame size and ten milisecond frame rate
0:07:06	the network output is to fifty seven dimensional wideband log spectrum
0:07:12	the network is trained with the mean squared error loss and adam optimiser
0:07:19	after the and the network output is a pain
0:07:23	the mean and variance computed from the input us narrowband spectrum is added back
0:07:30	to the wideband spectra
0:07:33	often i think that the mean and variance
0:07:35	and inverse
0:07:37	no bias vector or the
0:07:40	and inverse filtering is applied
0:07:42	bring up the energy content in the higher frequencies
0:07:47	this is made him than do
0:07:49	in a to compensate for the mean values of the energies
0:07:54	in the higher frequency which
0:07:59	the output of this system
0:08:01	is the white that lost spectrum which is for the processed
0:08:05	for
0:08:06	speaker verification
0:08:08	this bandwidth expansion
0:08:10	b and then system is trained on every speech on the rubber dataset
0:08:16	and the v c d k dataset
0:08:20	this is the inverse filtering that is use the reverse the low-pass filtering effect
0:08:25	the mean and variance of the not narrowband log spectrum is added back to the
0:08:29	estimated wideband log spectrum which is the output of the vienna
0:08:34	the higher frequency energies of the narrowband all your are attenuated viewable by selectively
0:08:40	you re well as they do clustering be added back
0:08:44	the this filter the i about this
0:08:49	inverse vector in the log domain two
0:08:53	the estimated by
0:08:55	well getting back the ugly normalized wideband spectrum estimate
0:09:03	the data for this for training the bandwidth expansion system is simulated using or telephone
0:09:13	equally codec simulation software
0:09:17	the limited speech and v c d k datasets
0:09:20	i'll hold
0:09:22	wideband audio data sets libby speech as a sampling rate of sixteen khz and b
0:09:27	c d k is originally forty eight khz audio it should be bring down by
0:09:33	down sampling to sixteen khz
0:09:36	what these datasets are clean speech bit by band data at sixteen khz
0:09:44	in order to simulate telephone the artifacts in the wideband speech be perform a
0:09:53	coding and decoding using three different
0:09:57	audio codecs the three audio codecs that be used for simulating the telephone data are
0:10:03	of the adaptive might be the narrowband amr and b
0:10:07	the allpass narrowband codec and this week data back codec
0:10:12	so this three codecs cover a wide range of telephone the applications that are commonly
0:10:17	used as you can see from this table that my ten b is typically used
0:10:22	in mobile telephone
0:10:23	allpass is used in white like what's a playstation for except and silk is also
0:10:30	used in wide applications voip applications
0:10:35	so be it a sixteen a and i don't a wide band audio from delivery
0:10:42	speech data set or d dct case dataset we passed through it a d v
0:10:48	boss the audio through the
0:10:50	audio coding application
0:10:53	which course which converts it into a coded signal
0:10:58	and then be passing through the audio codec decoder to get back the telephone e
0:11:03	just a or they started narrowband sick so this is how they sixteen khz audio
0:11:11	is converted to eight khz or a telephone e distorted audio
0:11:16	we simulate the data set for bandwidth expansion train
0:11:22	the bandwidth expansion system is that's trained on a hundred hours of liberty speech and
0:11:28	we syndicated a sec
0:11:29	the performance of the bandwidth expansion is computed by the log spectral distortion measure which
0:11:38	is basically the mean squared error in this between the estimated wideband spectrum and the
0:11:46	actual wideband spectrum
0:11:48	in the log domain
0:11:50	so the by a d
0:11:51	not spectral distortion
0:11:54	is show the results are shown here
0:11:57	the simple up sampling of
0:12:00	narrowband audio now gives there'll low a log spectral distortion of one point seven nine
0:12:05	three in the higher frequency d h by doing simple subsampling we are not adding
0:12:10	any new information but the audio all that simple a lab sampling does is
0:12:17	performance
0:12:19	interpolation between samples
0:12:21	and followed by
0:12:24	no less affected so interpolation followed by an no but followed by smoothing that this
0:12:31	simple up sampling
0:12:32	and simple have sampling gives a log spectral distortion of one point seven nine three
0:12:38	the be a bandwidth x expanded system output
0:12:43	has
0:12:45	l s d value of one point two nine one it just a significant reduction
0:12:48	compared to
0:12:51	the simple of sampled signal
0:12:54	the loss but we have been be due to bandwidth expansion be estimated the complete
0:13:00	spectrum of the art
0:13:02	that is the spectrum from zero
0:13:04	at universe of the
0:13:06	wideband audio
0:13:07	we also compute the log spectral distortion of
0:13:11	the in order bags
0:13:13	of a as a as a result
0:13:15	so in the lower frequency band zero to four khz which is already but i
0:13:19	sent it but
0:13:21	in the narrowband spectrum
0:13:23	this simple up sampling gives the
0:13:26	not spectral distortion of point nine three four
0:13:29	benesty bandwidths expanded system output
0:13:32	as
0:13:33	a distortion of one point zero to nine
0:13:36	so this means that the bandwidth expansion system
0:13:41	introduces a mind that of distortion
0:13:43	in the lower frequencies
0:13:45	compared to simple laptop
0:13:49	and also remember
0:13:50	that
0:13:51	the audio codecs that be applied
0:13:54	to simulate the telephone e audio would have introduced more distortions in the lower frequencies
0:14:00	that is why
0:14:02	that is
0:14:03	a significant amount of log spectral distortion
0:14:06	even in the lower frequencies
0:14:08	for the simple example signal
0:14:12	finally this table shows that the bandwidth expansion system clearly or phone
0:14:18	spectral estimation of higher frequency content
0:14:22	this is an example log of the output of the bandwidth expansion system on top
0:14:27	is the eight khz narrowband telephone you argue
0:14:31	we have
0:14:32	performs simple subsampling of the telephone the audio
0:14:35	to show the spectrogram
0:14:37	you see no frequency content in
0:14:41	i between forty kilobytes
0:14:43	the bandwidth expanded output
0:14:46	is shown in the
0:14:48	a total pain
0:14:49	you see that of the higher frequencies are estimated
0:14:55	by the
0:14:56	we have a pretty well by the bandwidth expansion system
0:14:59	and the bottom be in
0:15:01	shows the sixteen khz reference
0:15:07	next i we will more to the speaker verification experiments
0:15:13	the speaker verification experiments in this paper are performed
0:15:17	on a speaker verification system
0:15:19	as shown in this figure
0:15:22	our speaker verification system is that the convolution neural network based speaker embedding
0:15:29	it consists of five convolutional layers
0:15:32	followed by
0:15:33	i statistics pulling here
0:15:36	for followed by two fully connected us
0:15:40	and the output is a softmax
0:15:43	layer and beating speaker labels
0:15:46	the input to the speaker embedding system is thirty dimensional
0:15:50	mfcc features
0:15:52	the training of these speaker and speaker recognition system
0:15:57	the speaker embedding system is performed in two stages
0:16:00	the first stage be used
0:16:04	a softmax output
0:16:05	of speaker labels
0:16:07	and doing the network with
0:16:09	a cross entropy loss
0:16:11	training
0:16:12	in stage two
0:16:13	be the remote the second fully connected eer and the output layer
0:16:18	and that's a card a fully connected leotard different layer
0:16:23	in the all in the output
0:16:25	actually is all the layers
0:16:28	before the back
0:16:29	and train the network with large margin or side loss
0:16:33	this is how this you got embedding system is trained two stages
0:16:38	this system is train
0:16:41	completely on be walks lm to dataset it's consists of
0:16:45	sixteen khz clean audio
0:16:48	so this is a wide band plain system so we train
0:16:51	two different speaker embedding systems
0:16:54	using the same architecture
0:16:56	one system used train on the only one select two
0:17:01	sixteen khz audio
0:17:03	the second system we train it on mixed training
0:17:07	the use both p by nine audio
0:17:09	and the bandwidth expanded downsampled and band with expanded
0:17:14	audio
0:17:15	so we possibly walk select to dataset
0:17:19	to what it
0:17:21	distortion
0:17:21	and followed by
0:17:23	the bandwidth expansion
0:17:25	using them
0:17:26	bandwidth expansion system that the proposed in this paper
0:17:30	and then we combine the two datasets the by the original wideband
0:17:34	box a let
0:17:35	and divine with expanded
0:17:37	downsampled what's that
0:17:40	brain
0:17:41	the speaker recognition system
0:17:43	a speaker recognition speaker and body
0:17:45	we train the speaker every
0:17:47	so note that
0:17:49	both of these systems are trained on wideband audio
0:17:53	one is train on the original wideband data
0:17:56	the second w b plus you w e is trained on by nine
0:18:00	last
0:18:01	bandwidth expanded by
0:18:05	if all the results
0:18:06	here
0:18:07	the speaker verification results are shown in this they in this so by graph
0:18:13	these this by graph shows the speaker verification equal error rates that we obtain
0:18:19	using d by band only trained system
0:18:24	so the system is trained only on what select sixteen khz wideband audio
0:18:28	you see that we perform speaker verification test
0:18:33	on
0:18:34	for different test sets
0:18:37	double select one
0:18:38	e subset
0:18:40	the as id w
0:18:42	a dataset
0:18:44	the speakers in the why you dataset
0:18:46	and the nist sre two thousand and second ten second test set
0:18:54	so these are before test sets that be a computer the results on
0:19:00	the green
0:19:02	you can see
0:19:04	a performing bandwidth expansion
0:19:07	other uses the equal error rate
0:19:10	contrary to simply upsample signal
0:19:13	so the block an audience shows the equal error rate obtained using
0:19:19	a simple subsampling
0:19:22	and the bottom plot in l a shows
0:19:26	the design
0:19:27	after nine bit extraction
0:19:29	note that
0:19:30	the box tell everyone
0:19:31	yes i
0:19:32	the s id w
0:19:34	they have set and that's id w eval set
0:19:36	at all
0:19:37	sixty universe
0:19:40	audio
0:19:41	v past these test sets
0:19:44	the would be coded distortion
0:19:47	assimilation that we'd have a lot in this paper to simulate telephone e audio
0:19:53	and then be ugly speaker verification on top of it is twenty distorted l a
0:19:58	funny or that the start it yes say that is the results that those are
0:20:03	the desires the actually in the orange blinded
0:20:08	because a lot is the output of the bandwidth expansion system
0:20:13	when you past the telephone need to start essex would have an expansion
0:20:19	system
0:20:20	normal that the nist sre two thousand and test set
0:20:25	yes
0:20:26	are telling it consists of the only telephone your you
0:20:29	so i
0:20:31	is inherently a narrowband speech signal
0:20:34	recorded using a real telephone the audio
0:20:37	so we don't have
0:20:42	we don't have the us this we don't have the results for
0:20:47	if the nist sre dataset
0:20:51	four
0:20:52	is divided by nine or go because there is no wideband audio in this design
0:20:57	so we have only the
0:21:00	and results
0:21:01	for an simple up sampling and bandwidth expansion
0:21:05	so you see that even in the unseen case of nist sre dataset consists of
0:21:10	real telephone your
0:21:12	that is a significant improvement in the equal error rate
0:21:16	finally we show the results on the mixed plain system even on the mixed trained
0:21:21	system there is a significant improvement in the equal error rates
0:21:25	across all the test sets
0:21:28	it is a particular of point to note here is that the equal error rates
0:21:34	obtained that
0:21:34	on the original wideband a test audio cassettes
0:21:40	i might lower
0:21:42	then what the obtained but the wideband plain system
0:21:45	so initially for example forty bucks eleven ease set
0:21:50	we obtain the four point one to eer
0:21:55	by after d but i'm explaining the eer values to three point two percent
0:22:01	that means that the bandwidth expansion has helped improve the performance even on the original
0:22:05	sixteen khz audio
0:22:09	so these are the conclusions of our paper the bandwidth expansion system that the proposed
0:22:14	in this paper performs significantly better
0:22:19	than upset simple up sampling
0:22:21	we obtain a relate to equal error rate reduction of four point four percent
0:22:25	on the nist sre two thousand and second
0:22:29	and a nine point ninety percent improvement on the s i t w u about
0:22:34	six and eleven point one percent improvement on the inside you don't you test set
0:22:39	the bandwidth expansion well so improved in the accuracy on the original sixteen khz data
0:22:47	across all
0:22:48	the protocols across all the test sets
0:22:51	which means that the bandwidth expansion system is helping as an augmentation mechanism for training
0:22:59	speaker verification of for training the speaker recognition system
0:23:03	so the perforce bandwidth expansion system is also significantly lightweight system
0:23:08	compared to other systems that a recently proposed
0:23:12	and the system can be
0:23:14	deployed and used even in a legal times an audio
0:23:20	these are some references that have order well i in this presentation
0:23:26	please refer to the paper for further details and desires
0:23:31	and
0:23:33	i will be glad to answer your questions
0:23:36	thank you for listening to my talk
0:23:39	i look forward your
0:23:40	questions and discussions regarding this paper
0:23:43	thank you

Speech Bandwidth Expansion For Speaker Recognition On Telephony Audio

Speech Application

Ganesh Sivaraman, Amruta Vidwans, Elie Khoury