Speech Transcript - A Low-Power Text-Dependent Speaker Verification System with Narrow-Band Feature Pre-Selection and Weighted Dynamic Time Warping

she again hi everyone today i'll be talking about my work on low power text

dependent speaker verification system

with narrowband feature pre-selection and weighted dynamic time warping

this is a joint work with texas instruments

so first to motivate the work

with the increasing popularity of mobile devices such as smart phones and smart watches there's

the ricean interested to enable for

voice authenticated

speaker verification

that is the device continuously listens to the surrounding and wakes up when the desk

a designated pass phrase is pronounced by the

a target speaker

so it's this thing wakeup systems are implemented on the host device using digital solutions

because those holes device that usually design of what general purpose applications

signals are usually acquired error rates much higher than the nyquist rate

in order to minimize information loss in early stage and to enable

flexible downstream processing therefore this systems usually involve many stages of processing a high dimensional

data

and therefore resulting in power consumptions in the range of lpc hundreds of millions

so the opportunity lies in the fact that if you're design you application specific system

and the desired output maybe a potential in a more direct manner

with less processing by performing of early stage signal dimension reduction with analog components

and adaptive data processing so our goal is to design a low power

voice authenticated we come system was power consumption is limited to the range of a

couple hundreds of micro what's

in order to achieve this

we proposed a new architecture

so in conventional systems

the processing pipeline usually involves

high sampling rate fast processing

and due to the high dimensional features

the downstream processing usually requires a high complexity come

computation therefore without being high power consumption

in contrast the proposed this is then priest select a set of a low dimensional

spectrum features which can be efficiently extracted using analogue components and because this right

some pretty selected features are sparse in the frequency domain

therefore it enables

load a low rate sampling and a low rate processing

therefore to achieve low power consumption

so in the remainder of the talk i'll describe the two components of the system

the spectral of feature pre-selection component and of speaker verification back end

so first are described all feature pre-selection process

in this part

well show that with a few carefully selected narrowband

where a

we are capable of capturing most essential information speech information

first review

so the speech signal

best of c

there is the display

are you this work should be you of t

and

so the speech signal can be represented as the convolution between the excitation signal you

of t and the vocal tract modulation signal each of the and it chopped you

contains essential speech information

this convolution relationship can become separable in the cepstral domain

so let's fix that i'm plot the power spectrum density of speech

in a frequency domain convolution becomes multiplication and when you take the log of the

power spectrum density a multiplication becomes addition notice that the peaks of the power spectrum

density corresponds to the harmonics of the fundamental frequency

and the overall i'm a little corresponds to the vocal tract modulation

and to transform the signal to the cepstral domain we take the inverse fourier transform

and

it turns out that speech is

sparse in the cepstral domain it's consists of two main components though vocal tract modulation

component h of top which is represented by this narrow a rectangle here with cut-off

frequent a difference the ad data h and the excitation component you of power which

is represented by this

delta component at data e

so our goal is to extract each of how and this can be done by

performing

and transformation to the cepstral domain

however this is par expense

so the question is

how to extract each of tall without requiring of both actual and perform transformation to

the cepstral domain

so let's begin by trying something simple see if we measure the speech at and

of viewpoints not are evenly spaced out across this fit frequency spectrum on top of

the harmonics and let the spacing between the sampling point be denoted by delta p

then what this corresponds to is multiplication between the power spectrum density and the and

impostor and when transformed to the cepstral domain this corresponds to convolution between the speech

cepstral and adult aging so what we get here is that really is the version

so what we get here is that the cepstrum all the are point wise samples

each

is the earliest version of the cepstral of the original speech and the earliest copies

like exactly on the multiple of a wine over don't happy

so the take away message here is that even though it seems we have thrown

away and the majority of the frequency spectrum and only kept a few points the

most essential speech information each of how is the old reading so the next question

what if we do not have an exact estimate of the fundamental frequency where we

have no information of the fundamental frequency at all so in this case instead of

doing point one simply we use a set of bandpass filters

so this corresponds to you can derive a this can be represented as

multiplication between the power spectrum density and a rectangular training so

because the rectangular trend

is on

i think are attenuated delta function in the cepstral domain and there's and bandpass-filtering becomes

convolution between the

space capsule with the things attenuated don't the training

so what we get is that you miss the version of the original speech cepstrum

which is attenuated by the sink a adult aging so no this here that because

this time we did not choose the don't the rectangular trying to apply on top

of the harmonics

there's aliasing between well good five component it'll tell and the excitation component of tell

however and

and because this and

aliasing is attenuated by the sink function and it's okay

can be negligible for our application

for example with a narrowband bandwidth a bandwidth of two hundred hz and the spacing

of eight hundred hz

and a fundamental frequency of one hundred hz

the alias database then is attenuated significantly

so far what we have concluded is that with a small number of a narrow

band filters

we can capture the most essential speech information

and this will result in low rate a downstream sampling and processing

so we integrates the narrowband a frequent and the narrowband

frequency

narrow line spectral extraction frontend into our proposed

block diagram

so here other of like the front end features are fed into the back and

speaker verification and

are fed into the speaker verification back end

so one distinguishing character of this and system is that the individual bands can now

be turned on and off needed depending on the background waistband and system requirement

so next we talk about the back and a speaker verification algorithm so recall that

our goal is to design a voice authenticate a wakeup system

that identifies the user and the pass phrase in one shot

and we would like to allow the user to

provide customise pass phrases by providing a very few number of enrollment samples

still

there are

are application falls into the category of text dependent speaker verification

and there are many existing methods for this application so there are model based methods

that leverages a cohort of speakers two

who pronounce to pass phrase to train a background model

and the parameters of the model are then fine tuning

with an enrollment from the target is euchre

and under either and there are

template based the method that

does not require prior model training so decisions made by comparing the input features with

the enrollment features

because we would like to allow the users and to

to provide customised and pass phrases

we used a template based method

the classical dynamic time warping algorithm is used to overcome

speakers you variation and houses in speech

and however for application it turns out that the classical dynamic time warping algorithm either

provides

too much

warping that it's and you take the signal envelope which then leads to a large

number of false positives

or it does not provide enough warping to compensate

the long pauses between words in a pass phrase

so therefore we propose a

a modified version of the

dynamic time warping algorithm which we called with a dynamic time warping algorithm to provide

sufficient warping such static and compensated speaker variation pauses

in between words

without causing too much signal envelope mutation

so to do this we simply add a penalty term to the distance measurement and

the penalty term scales up linearly with the number of consecutive work you steps us

though that's

it's prevents too much warping on the signal

and this penalty scales with the signal magnitude so the pet penalty

it's a small when the signal and reduce is small because that probably corresponds to

a pauses in a in a signal and the penalties high when the signal amplitude

is high in order to prevent signal envelope mutation

so this can be illustrated in a distance matrix computation so the

the weighted and i mean time warping algorithm is the same as the classical dynamic

time warping algorithm

with only one difference that is the cost function of those signal magnitude that is

we had a cost

to the distance measurement and the cost is a function of the signal magnitude and

the number of consecutive markings that's

i won't go into the details here and you can find like the full implementation

in the paper

so to illustrate the benefits of with a dynamic marking

i shows

it through a simulation of a lot

so here our goal is to align the mail envelopes of the two signals are

and you can see that with the window length of one hundred millisecond the classical

tenement dynamic time warping algorithm fails to align the signal i will

here

and with

with no land of two hundred millisecond and also the weighted dynamic time warping algorithm

the signal envelopes are properly aligned

however you can notice here that

the shape of the input signal i is

have really mutated by the classical dynamic time warping algorithm

on the other hand and their shape is retained

by the way to dynamic time warping algorithm

so next we performed experiments on the entire system design

so you know our experiments we used three pass phrase this

i galaxy okay glass and okay power weight pronounced by thirty to forty speakers with

twenty to forty repetitions

and we also had wind and car noise this to the clean samples to generate

other the weight examples

so the reason we chose reincarnate with this is because

they are common for application

and also they have the distinguishing characters to be narrowly concentrated in a low-frequency

so if there were this we can illustrate the benefits of adaptive and selection by

discarding the weights events

so in the experiments we used three enrollment samples with narrowband bandwidth of two hundred

chosen to be around the estimated fundamental frequency

and we compare it with two baseline systems

the for

the system with forty dimensional mfccs with the classical dynamic time warping algorithm and forty

dimensional mfcc with the gmm ubm

but that

so this table summarizes experiments results

when there's no background noise we use a top-n to narrow band spectral

features and when there is background noise all the all the features be able to

have two khz which would be leave a contaminated by noise

are dropped so we only use the remaining at that

it can be shown here that without background noise than aeroplanes

spectral coefficients your to compare overall accuracy to the mfcc features and i three db

snr the narrowband spectra coefficient you're much better accuracy than mfcc features

and

overall the weighted dynamic time warping algorithm yields improved accuracy with the

then the classical dynamic time warping algorithm for feature for all features

and the proposed system that is

the right now then spectral coefficients combined with the weighted dynamic time warping algorithm

you also improved accuracy than the gmm ubm but with tape with only three enrollment

samples as prior and without a prior a background model training

so we also investigated how

the performances of affected by different parameters

so as you can see here is that

when we increase the number of bands the accuracy of the system improves

and also

the accuracy of the system improves significantly with the band selection that is this role

compared with like

without then selection over here

so the pope the

total power consumption of the system can be estimated as the summation of the power

consumption of the front end and the backend so for front end power estimation we

use front end design by texas instruments that consists of about thirty band filter bank

so this front end has a fixed power consumption of a hundred five

a hundred fifty mike robots without additional power consumption of ten micro one

per band

and the back and algorithm is implemented on the cortex and the or microcontroller

and it is but decision is made

the assumptions got a decision is made at

every sixty milliseconds with continuous triggering still there's a worst-case trigger assumption

and the power consumption of the backend is

and i micro bites and per band

in here we're just trying to illustrate that overall or system based very power efficient

and the total power consumption have you kept on

under a few hundred microphones

so it can be summarised

i presented

new structure for low power text dependent speaker verification

in this system it has a front end that consist of a set of narrowband

analogue filters

which performs early stage signal dimension reduction

and

by performing

and

are used

a performing this

front end filtering it will allow a little resampling and low rate down

downstream processing

and we also proposed a back and algorithm for speaker verification

an overall the system is designed to support adaptive and selection

which leads to a improve the robustness to noise and the overall system have demonstrated

come

comparable accuracy to existing systems

which much low

much lower power consumption

so that for today thing

okay we have time for questions

testing one two three so if i think can you applications you're looking probably in

like consumer electronics or a home environments are so forth so you don't have an

infinite number of competing speakers or speakers be confused with so

it have you thought maybe about which phrases might be more effective if you're looking

at let's say family using kind of some electronics at home you can kinda distinguish

which speakers are really more confusable with others and second if you're working at

kind of right trying to keep the power allow you would seem like you wanna

have what a wakeup type devices been doing the verification on something else so something

that might actually and i wanna see like clapping her hands or something but some

type of sound that would actually "'cause" the system to wake up

in this way the powers not being drunk continuously

well thanks

so for the first question

certainly there are pass phrases that are

that works better the other one and i'm not sure like what's

a good rule to choose like which has to

pass phrase better than others and

because like we want to solve the systems to different customers all around the world

so we just

let them choose like their favourite pass phrase

and for the second question

actually in the like for our system we do have like vad at the fact

that detects the energy and

this is so it's like

you have you their front to detect energy

and then it will wake up our system and our system will become the holes

device

so the reason with

we still use

we can see that the worst case scenario for this device

and

sorry weakens the worst case scenario in our first power estimation is because so say

you of your interest wrong what you know very noisy environment your

like they'll

this system is continuously running with the power consumption so the reason is that day

for if you have like

and apple like self

so in order to activate theory you need to like actually weak of the device

first

i and then you have like after the word

so we want to make something that we will not training of apply a noisy

environment

sorry i'm also thinking you're gonna add this to an internet of things type device

because that's were actually would probably have the biggest financial gains on

i think they are also considering selling those two like home appliances or

so what are things that people do it all these all devices is that they

don't all the devices close anymore

so you can have a you know

a phone on the it's a little and use this at a distance

no whatever you want to say so i saw that you have added noise to

the examples have looked at effect of five

actually that's a very good question still we have encountered is the are applications so

if you know like

house a real systems are implemented there is that it is the i-th the be

at the front end

in fact if you're speaking from far away and close time like those a sound

that sounds different

i think that i did some circuitry to change it's like a multi stage and

eighty see that it will like amplify this down if you are far away and

like amplify less if you are like close by

so i'm not considering this part and like in my work

but it's

like it certainly needs to be considered for real applications

okay i have a question a notice that you in slide not number nineteen that

you a thanks

i actually said to us a you for noise you remove their or all that

information it been a below two khz

you to try to do is also for in ecstasy

you think it's

i didn't do because i think

i don't know if people do it but

yes i

understand you of mfcc when you take

take the information from the frequency domain by transforming into the cepstral domain if you

just remove some bands and take the mfcc actually will alter the cepstral

signal significantly it's not

just like a addition

so i didn't do for mfcc there is no

i think it can be done if it's trained on it from scratch this way

for phone with the man

okay so the reason that it can be done in my system is because

the decision is like added together at the and in a linear manner you can

consider us each sub band x is a decision and thus

this is a summation of all but when you have thought mfcc coefficients they'll

decision is not no longer like a linear summation of all the decisions

song i'm not sure how we can be done

and using mfcc and i don't think is a task before

thanks thank you okay like to thank speaker again

A Low-Power Text-Dependent Speaker Verification System with Narrow-Band Feature Pre-Selection and Weighted Dynamic Time Warping

Text Dependent Speaker Verification

Qing He, Gregory Wornell and Wei Ma