Speech Transcript - A Low-Power Text-Dependent Speaker Verification System with Narrow-Band Feature Pre-Selection and Weighted Dynamic Time Warping

0:00:14	she again hi everyone today i'll be talking about my work on low power text
0:00:18	dependent speaker verification system
0:00:20	with narrowband feature pre-selection and weighted dynamic time warping
0:00:24	this is a joint work with texas instruments
0:00:27	so first to motivate the work
0:00:29	with the increasing popularity of mobile devices such as smart phones and smart watches there's
0:00:35	the ricean interested to enable for
0:00:38	voice authenticated
0:00:41	speaker verification
0:00:42	that is the device continuously listens to the surrounding and wakes up when the desk
0:00:49	a designated pass phrase is pronounced by the
0:00:52	a target speaker
0:00:54	so it's this thing wakeup systems are implemented on the host device using digital solutions
0:01:01	because those holes device that usually design of what general purpose applications
0:01:06	signals are usually acquired error rates much higher than the nyquist rate
0:01:10	in order to minimize information loss in early stage and to enable
0:01:14	flexible downstream processing therefore this systems usually involve many stages of processing a high dimensional
0:01:20	data
0:01:21	and therefore resulting in power consumptions in the range of lpc hundreds of millions
0:01:27	so the opportunity lies in the fact that if you're design you application specific system
0:01:32	and the desired output maybe a potential in a more direct manner
0:01:36	with less processing by performing of early stage signal dimension reduction with analog components
0:01:41	and adaptive data processing so our goal is to design a low power
0:01:48	voice authenticated we come system was power consumption is limited to the range of a
0:01:53	couple hundreds of micro what's
0:01:56	in order to achieve this
0:01:58	we proposed a new architecture
0:02:00	so in conventional systems
0:02:02	the processing pipeline usually involves
0:02:06	high sampling rate fast processing
0:02:08	and due to the high dimensional features
0:02:11	the downstream processing usually requires a high complexity come
0:02:15	computation therefore without being high power consumption
0:02:18	in contrast the proposed this is then priest select a set of a low dimensional
0:02:24	spectrum features which can be efficiently extracted using analogue components and because this right
0:02:32	some pretty selected features are sparse in the frequency domain
0:02:36	therefore it enables
0:02:37	load a low rate sampling and a low rate processing
0:02:42	therefore to achieve low power consumption
0:02:47	so in the remainder of the talk i'll describe the two components of the system
0:02:52	the spectral of feature pre-selection component and of speaker verification back end
0:02:59	so first are described all feature pre-selection process
0:03:03	in this part
0:03:04	well show that with a few carefully selected narrowband
0:03:08	where a
0:03:09	we are capable of capturing most essential information speech information
0:03:15	first review
0:03:17	so the speech signal
0:03:19	best of c
0:03:22	there is the display
0:03:26	are you this work should be you of t
0:03:29	and
0:03:31	so the speech signal can be represented as the convolution between the excitation signal you
0:03:36	of t and the vocal tract modulation signal each of the and it chopped you
0:03:41	contains essential speech information
0:03:44	this convolution relationship can become separable in the cepstral domain
0:03:48	so let's fix that i'm plot the power spectrum density of speech
0:03:53	in a frequency domain convolution becomes multiplication and when you take the log of the
0:03:58	power spectrum density a multiplication becomes addition notice that the peaks of the power spectrum
0:04:05	density corresponds to the harmonics of the fundamental frequency
0:04:08	and the overall i'm a little corresponds to the vocal tract modulation
0:04:13	and to transform the signal to the cepstral domain we take the inverse fourier transform
0:04:17	and
0:04:19	it turns out that speech is
0:04:22	sparse in the cepstral domain it's consists of two main components though vocal tract modulation
0:04:27	component h of top which is represented by this narrow a rectangle here with cut-off
0:04:32	frequent a difference the ad data h and the excitation component you of power which
0:04:38	is represented by this
0:04:41	delta component at data e
0:04:43	so our goal is to extract each of how and this can be done by
0:04:47	performing
0:04:49	and transformation to the cepstral domain
0:04:51	however this is par expense
0:04:53	so the question is
0:04:55	how to extract each of tall without requiring of both actual and perform transformation to
0:05:00	the cepstral domain
0:05:01	so let's begin by trying something simple see if we measure the speech at and
0:05:06	of viewpoints not are evenly spaced out across this fit frequency spectrum on top of
0:05:12	the harmonics and let the spacing between the sampling point be denoted by delta p
0:05:18	then what this corresponds to is multiplication between the power spectrum density and the and
0:05:25	impostor and when transformed to the cepstral domain this corresponds to convolution between the speech
0:05:31	cepstral and adult aging so what we get here is that really is the version
0:05:36	so
0:05:37	so what we get here is that the cepstrum all the are point wise samples
0:05:41	each
0:05:42	is the earliest version of the cepstral of the original speech and the earliest copies
0:05:47	like exactly on the multiple of a wine over don't happy
0:05:52	so the take away message here is that even though it seems we have thrown
0:05:57	away and the majority of the frequency spectrum and only kept a few points the
0:06:02	most essential speech information each of how is the old reading so the next question
0:06:08	is
0:06:09	what if we do not have an exact estimate of the fundamental frequency where we
0:06:13	have no information of the fundamental frequency at all so in this case instead of
0:06:17	doing point one simply we use a set of bandpass filters
0:06:22	so this corresponds to you can derive a this can be represented as
0:06:26	multiplication between the power spectrum density and a rectangular training so
0:06:31	because the rectangular trend
0:06:33	is on
0:06:34	i think are attenuated delta function in the cepstral domain and there's and bandpass-filtering becomes
0:06:44	convolution between the
0:06:47	space capsule with the things attenuated don't the training
0:06:50	so what we get is that you miss the version of the original speech cepstrum
0:06:54	which is attenuated by the sink a adult aging so no this here that because
0:07:01	this time we did not choose the don't the rectangular trying to apply on top
0:07:05	of the harmonics
0:07:06	there's aliasing between well good five component it'll tell and the excitation component of tell
0:07:14	however and
0:07:16	and because this and
0:07:18	aliasing is attenuated by the sink function and it's okay
0:07:22	can be negligible for our application
0:07:25	for example with a narrowband bandwidth a bandwidth of two hundred hz and the spacing
0:07:31	of eight hundred hz
0:07:32	and a fundamental frequency of one hundred hz
0:07:35	the alias database then is attenuated significantly
0:07:40	so far what we have concluded is that with a small number of a narrow
0:07:45	band filters
0:07:46	we can capture the most essential speech information
0:07:49	and this will result in low rate a downstream sampling and processing
0:07:55	so we integrates the narrowband a frequent and the narrowband
0:08:02	frequency
0:08:03	narrow line spectral extraction frontend into our proposed
0:08:08	block diagram
0:08:09	so here other of like the front end features are fed into the back and
0:08:14	speaker verification and
0:08:18	are fed into the speaker verification back end
0:08:20	so one distinguishing character of this and system is that the individual bands can now
0:08:26	be turned on and off needed depending on the background waistband and system requirement
0:08:33	so next we talk about the back and a speaker verification algorithm so recall that
0:08:40	our goal is to design a voice authenticate a wakeup system
0:08:43	that identifies the user and the pass phrase in one shot
0:08:47	and we would like to allow the user to
0:08:50	provide customise pass phrases by providing a very few number of enrollment samples
0:08:56	still
0:08:57	there are
0:08:58	are application falls into the category of text dependent speaker verification
0:09:02	and there are many existing methods for this application so there are model based methods
0:09:10	that leverages a cohort of speakers two
0:09:14	who pronounce to pass phrase to train a background model
0:09:17	and the parameters of the model are then fine tuning
0:09:20	with an enrollment from the target is euchre
0:09:24	and under either and there are
0:09:27	template based the method that
0:09:28	does not require prior model training so decisions made by comparing the input features with
0:09:35	the enrollment features
0:09:37	because we would like to allow the users and to
0:09:42	to provide customised and pass phrases
0:09:46	we used a template based method
0:09:49	so
0:09:49	the classical dynamic time warping algorithm is used to overcome
0:09:54	speakers you variation and houses in speech
0:09:58	and however for application it turns out that the classical dynamic time warping algorithm either
0:10:03	provides
0:10:04	too much
0:10:05	warping that it's and you take the signal envelope which then leads to a large
0:10:10	number of false positives
0:10:12	or it does not provide enough warping to compensate
0:10:15	the long pauses between words in a pass phrase
0:10:19	so therefore we propose a
0:10:21	a modified version of the
0:10:23	dynamic time warping algorithm which we called with a dynamic time warping algorithm to provide
0:10:28	sufficient warping such static and compensated speaker variation pauses
0:10:33	in between words
0:10:35	without causing too much signal envelope mutation
0:10:38	so to do this we simply add a penalty term to the distance measurement and
0:10:43	the penalty term scales up linearly with the number of consecutive work you steps us
0:10:49	though that's
0:10:50	it's prevents too much warping on the signal
0:10:53	and this penalty scales with the signal magnitude so the pet penalty
0:10:58	it's a small when the signal and reduce is small because that probably corresponds to
0:11:03	a pauses in a in a signal and the penalties high when the signal amplitude
0:11:09	is high in order to prevent signal envelope mutation
0:11:12	so this can be illustrated in a distance matrix computation so the
0:11:18	the weighted and i mean time warping algorithm is the same as the classical dynamic
0:11:23	time warping algorithm
0:11:24	with only one difference that is the cost function of those signal magnitude that is
0:11:29	we had a cost
0:11:31	to the distance measurement and the cost is a function of the signal magnitude and
0:11:35	the number of consecutive markings that's
0:11:38	i won't go into the details here and you can find like the full implementation
0:11:42	in the paper
0:11:44	so to illustrate the benefits of with a dynamic marking
0:11:48	i shows
0:11:48	it through a simulation of a lot
0:11:50	so here our goal is to align the mail envelopes of the two signals are
0:11:54	and you can see that with the window length of one hundred millisecond the classical
0:12:00	tenement dynamic time warping algorithm fails to align the signal i will
0:12:05	here
0:12:06	and with
0:12:08	with no land of two hundred millisecond and also the weighted dynamic time warping algorithm
0:12:12	the signal envelopes are properly aligned
0:12:15	however you can notice here that
0:12:18	the shape of the input signal i is
0:12:22	have really mutated by the classical dynamic time warping algorithm
0:12:28	on the other hand and their shape is retained
0:12:32	by the way to dynamic time warping algorithm
0:12:35	so next we performed experiments on the entire system design
0:12:39	so you know our experiments we used three pass phrase this
0:12:42	i galaxy okay glass and okay power weight pronounced by thirty to forty speakers with
0:12:48	twenty to forty repetitions
0:12:49	and we also had wind and car noise this to the clean samples to generate
0:12:53	other the weight examples
0:12:55	so the reason we chose reincarnate with this is because
0:12:58	they are common for application
0:13:01	and also they have the distinguishing characters to be narrowly concentrated in a low-frequency
0:13:08	so if there were this we can illustrate the benefits of adaptive and selection by
0:13:13	discarding the weights events
0:13:15	so in the experiments we used three enrollment samples with narrowband bandwidth of two hundred
0:13:21	hz
0:13:22	chosen to be around the estimated fundamental frequency
0:13:25	and we compare it with two baseline systems
0:13:27	the for
0:13:29	the system with forty dimensional mfccs with the classical dynamic time warping algorithm and forty
0:13:34	dimensional mfcc with the gmm ubm
0:13:38	but that
0:13:39	so this table summarizes experiments results
0:13:43	when there's no background noise we use a top-n to narrow band spectral
0:13:48	features and when there is background noise all the all the features be able to
0:13:53	have two khz which would be leave a contaminated by noise
0:13:57	are dropped so we only use the remaining at that
0:14:00	it can be shown here that without background noise than aeroplanes
0:14:04	spectral coefficients your to compare overall accuracy to the mfcc features and i three db
0:14:10	snr the narrowband spectra coefficient you're much better accuracy than mfcc features
0:14:16	and
0:14:16	overall the weighted dynamic time warping algorithm yields improved accuracy with the
0:14:22	then the classical dynamic time warping algorithm for feature for all features
0:14:27	and the proposed system that is
0:14:29	the right now then spectral coefficients combined with the weighted dynamic time warping algorithm
0:14:34	you also improved accuracy than the gmm ubm but with tape with only three enrollment
0:14:40	samples as prior and without a prior a background model training
0:14:45	so we also investigated how
0:14:48	the performances of affected by different parameters
0:14:52	so as you can see here is that
0:14:54	when we increase the number of bands the accuracy of the system improves
0:14:59	and also
0:15:02	the accuracy of the system improves significantly with the band selection that is this role
0:15:08	compared with like
0:15:10	without then selection over here
0:15:13	so the pope the
0:15:18	total power consumption of the system can be estimated as the summation of the power
0:15:22	consumption of the front end and the backend so for front end power estimation we
0:15:28	use front end design by texas instruments that consists of about thirty band filter bank
0:15:33	so this front end has a fixed power consumption of a hundred five
0:15:37	a hundred fifty mike robots without additional power consumption of ten micro one
0:15:42	per band
0:15:43	and the back and algorithm is implemented on the cortex and the or microcontroller
0:15:48	and it is but decision is made
0:15:52	the assumptions got a decision is made at
0:15:55	every sixty milliseconds with continuous triggering still there's a worst-case trigger assumption
0:16:01	and the power consumption of the backend is
0:16:04	and i micro bites and per band
0:16:06	so
0:16:07	in here we're just trying to illustrate that overall or system based very power efficient
0:16:12	and the total power consumption have you kept on
0:16:15	under a few hundred microphones
0:16:18	so it can be summarised
0:16:20	i presented
0:16:22	new structure for low power text dependent speaker verification
0:16:26	in this system it has a front end that consist of a set of narrowband
0:16:30	analogue filters
0:16:32	which performs early stage signal dimension reduction
0:16:35	and
0:16:36	by performing
0:16:37	and
0:16:39	are used
0:16:40	a performing this
0:16:43	front end filtering it will allow a little resampling and low rate down
0:16:48	downstream processing
0:16:49	and we also proposed a back and algorithm for speaker verification
0:16:54	an overall the system is designed to support adaptive and selection
0:16:59	which leads to a improve the robustness to noise and the overall system have demonstrated
0:17:05	come
0:17:06	comparable accuracy to existing systems
0:17:08	which much low
0:17:10	much lower power consumption
0:17:12	so that for today thing
0:17:21	okay we have time for questions
0:17:33	testing one two three so if i think can you applications you're looking probably in
0:17:37	like consumer electronics or a home environments are so forth so you don't have an
0:17:42	infinite number of competing speakers or speakers be confused with so
0:17:48	it have you thought maybe about which phrases might be more effective if you're looking
0:17:52	at let's say family using kind of some electronics at home you can kinda distinguish
0:17:57	which speakers are really more confusable with others and second if you're working at
0:18:04	kind of right trying to keep the power allow you would seem like you wanna
0:18:08	have what a wakeup type devices been doing the verification on something else so something
0:18:14	that might actually and i wanna see like clapping her hands or something but some
0:18:18	type of sound that would actually "'cause" the system to wake up
0:18:23	in this way the powers not being drunk continuously
0:18:27	well thanks
0:18:28	so for the first question
0:18:33	certainly there are pass phrases that are
0:18:36	that works better the other one and i'm not sure like what's
0:18:40	a good rule to choose like which has to
0:18:42	pass phrase better than others and
0:18:45	because like we want to solve the systems to different customers all around the world
0:18:50	so we just
0:18:51	let them choose like their favourite pass phrase
0:18:54	and for the second question
0:18:56	actually in the like for our system we do have like vad at the fact
0:19:00	that detects the energy and
0:19:02	this is so it's like
0:19:05	you have you their front to detect energy
0:19:08	and then it will wake up our system and our system will become the holes
0:19:12	device
0:19:13	so the reason with
0:19:15	we still use
0:19:17	we can see that the worst case scenario for this device
0:19:20	and
0:19:22	sorry weakens the worst case scenario in our first power estimation is because so say
0:19:27	you of your interest wrong what you know very noisy environment your
0:19:31	like they'll
0:19:33	this system is continuously running with the power consumption so the reason is that day
0:19:39	for if you have like
0:19:41	and apple like self
0:19:44	so in order to activate theory you need to like actually weak of the device
0:19:48	first
0:19:49	i and then you have like after the word
0:19:51	so we want to make something that we will not training of apply a noisy
0:19:56	environment
0:19:57	sorry i'm also thinking you're gonna add this to an internet of things type device
0:20:01	because that's were actually would probably have the biggest financial gains on
0:20:07	i think they are also considering selling those two like home appliances or
0:20:25	so what are things that people do it all these all devices is that they
0:20:30	don't all the devices close anymore
0:20:33	so you can have a you know
0:20:35	a phone on the it's a little and use this at a distance
0:20:39	no whatever you want to say so i saw that you have added noise to
0:20:43	the examples have looked at effect of five
0:20:49	actually that's a very good question still we have encountered is the are applications so
0:20:53	if you know like
0:20:55	house a real systems are implemented there is that it is the i-th the be
0:20:59	at the front end
0:21:01	in fact if you're speaking from far away and close time like those a sound
0:21:06	that sounds different
0:21:07	so
0:21:07	i think that i did some circuitry to change it's like a multi stage and
0:21:13	eighty see that it will like amplify this down if you are far away and
0:21:18	like amplify less if you are like close by
0:21:21	so i'm not considering this part and like in my work
0:21:25	but it's
0:21:28	like it certainly needs to be considered for real applications
0:21:39	okay i have a question a notice that you in slide not number nineteen that
0:21:44	you a thanks
0:21:55	i actually said to us a you for noise you remove their or all that
0:22:00	information it been a below two khz
0:22:04	you to try to do is also for in ecstasy
0:22:08	you think it's
0:22:09	i didn't do because i think
0:22:13	i
0:22:16	i don't know if people do it but
0:22:18	yes i
0:22:19	understand you of mfcc when you take
0:22:23	take the information from the frequency domain by transforming into the cepstral domain if you
0:22:28	just remove some bands and take the mfcc actually will alter the cepstral
0:22:34	signal significantly it's not
0:22:37	just like a addition
0:22:41	so i didn't do for mfcc there is no
0:22:43	i think it can be done if it's trained on it from scratch this way
0:22:49	for phone with the man
0:22:54	okay so the reason that it can be done in my system is because
0:23:02	the decision is like added together at the and in a linear manner you can
0:23:07	consider us each sub band x is a decision and thus
0:23:11	this is a summation of all but when you have thought mfcc coefficients they'll
0:23:17	decision is not no longer like a linear summation of all the decisions
0:23:22	song i'm not sure how we can be done
0:23:25	and using mfcc and i don't think is a task before
0:23:29	thanks thank you okay like to thank speaker again

A Low-Power Text-Dependent Speaker Verification System with Narrow-Band Feature Pre-Selection and Weighted Dynamic Time Warping

Text Dependent Speaker Verification

Qing He, Gregory Wornell and Wei Ma