0:00:14she again hi everyone today i'll be talking about my work on low power text
0:00:18dependent speaker verification system
0:00:20with narrowband feature pre-selection and weighted dynamic time warping
0:00:24this is a joint work with texas instruments
0:00:27so first to motivate the work
0:00:29with the increasing popularity of mobile devices such as smart phones and smart watches there's
0:00:35the ricean interested to enable for
0:00:38voice authenticated
0:00:41speaker verification
0:00:42that is the device continuously listens to the surrounding and wakes up when the desk
0:00:49a designated pass phrase is pronounced by the
0:00:52a target speaker
0:00:54so it's this thing wakeup systems are implemented on the host device using digital solutions
0:01:01because those holes device that usually design of what general purpose applications
0:01:06signals are usually acquired error rates much higher than the nyquist rate
0:01:10in order to minimize information loss in early stage and to enable
0:01:14flexible downstream processing therefore this systems usually involve many stages of processing a high dimensional
0:01:20data
0:01:21and therefore resulting in power consumptions in the range of lpc hundreds of millions
0:01:27so the opportunity lies in the fact that if you're design you application specific system
0:01:32and the desired output maybe a potential in a more direct manner
0:01:36with less processing by performing of early stage signal dimension reduction with analog components
0:01:41and adaptive data processing so our goal is to design a low power
0:01:48voice authenticated we come system was power consumption is limited to the range of a
0:01:53couple hundreds of micro what's
0:01:56in order to achieve this
0:01:58we proposed a new architecture
0:02:00so in conventional systems
0:02:02the processing pipeline usually involves
0:02:06high sampling rate fast processing
0:02:08and due to the high dimensional features
0:02:11the downstream processing usually requires a high complexity come
0:02:15computation therefore without being high power consumption
0:02:18in contrast the proposed this is then priest select a set of a low dimensional
0:02:24spectrum features which can be efficiently extracted using analogue components and because this right
0:02:32some pretty selected features are sparse in the frequency domain
0:02:36therefore it enables
0:02:37load a low rate sampling and a low rate processing
0:02:42therefore to achieve low power consumption
0:02:47so in the remainder of the talk i'll describe the two components of the system
0:02:52the spectral of feature pre-selection component and of speaker verification back end
0:02:59so first are described all feature pre-selection process
0:03:03in this part
0:03:04well show that with a few carefully selected narrowband
0:03:08where a
0:03:09we are capable of capturing most essential information speech information
0:03:15first review
0:03:17so the speech signal
0:03:19best of c
0:03:22there is the display
0:03:26are you this work should be you of t
0:03:29and
0:03:31so the speech signal can be represented as the convolution between the excitation signal you
0:03:36of t and the vocal tract modulation signal each of the and it chopped you
0:03:41contains essential speech information
0:03:44this convolution relationship can become separable in the cepstral domain
0:03:48so let's fix that i'm plot the power spectrum density of speech
0:03:53in a frequency domain convolution becomes multiplication and when you take the log of the
0:03:58power spectrum density a multiplication becomes addition notice that the peaks of the power spectrum
0:04:05density corresponds to the harmonics of the fundamental frequency
0:04:08and the overall i'm a little corresponds to the vocal tract modulation
0:04:13and to transform the signal to the cepstral domain we take the inverse fourier transform
0:04:17and
0:04:19it turns out that speech is
0:04:22sparse in the cepstral domain it's consists of two main components though vocal tract modulation
0:04:27component h of top which is represented by this narrow a rectangle here with cut-off
0:04:32frequent a difference the ad data h and the excitation component you of power which
0:04:38is represented by this
0:04:41delta component at data e
0:04:43so our goal is to extract each of how and this can be done by
0:04:47performing
0:04:49and transformation to the cepstral domain
0:04:51however this is par expense
0:04:53so the question is
0:04:55how to extract each of tall without requiring of both actual and perform transformation to
0:05:00the cepstral domain
0:05:01so let's begin by trying something simple see if we measure the speech at and
0:05:06of viewpoints not are evenly spaced out across this fit frequency spectrum on top of
0:05:12the harmonics and let the spacing between the sampling point be denoted by delta p
0:05:18then what this corresponds to is multiplication between the power spectrum density and the and
0:05:25impostor and when transformed to the cepstral domain this corresponds to convolution between the speech
0:05:31cepstral and adult aging so what we get here is that really is the version
0:05:36so
0:05:37so what we get here is that the cepstrum all the are point wise samples
0:05:41each
0:05:42is the earliest version of the cepstral of the original speech and the earliest copies
0:05:47like exactly on the multiple of a wine over don't happy
0:05:52so the take away message here is that even though it seems we have thrown
0:05:57away and the majority of the frequency spectrum and only kept a few points the
0:06:02most essential speech information each of how is the old reading so the next question
0:06:08is
0:06:09what if we do not have an exact estimate of the fundamental frequency where we
0:06:13have no information of the fundamental frequency at all so in this case instead of
0:06:17doing point one simply we use a set of bandpass filters
0:06:22so this corresponds to you can derive a this can be represented as
0:06:26multiplication between the power spectrum density and a rectangular training so
0:06:31because the rectangular trend
0:06:33is on
0:06:34i think are attenuated delta function in the cepstral domain and there's and bandpass-filtering becomes
0:06:44convolution between the
0:06:47space capsule with the things attenuated don't the training
0:06:50so what we get is that you miss the version of the original speech cepstrum
0:06:54which is attenuated by the sink a adult aging so no this here that because
0:07:01this time we did not choose the don't the rectangular trying to apply on top
0:07:05of the harmonics
0:07:06there's aliasing between well good five component it'll tell and the excitation component of tell
0:07:14however and
0:07:16and because this and
0:07:18aliasing is attenuated by the sink function and it's okay
0:07:22can be negligible for our application
0:07:25for example with a narrowband bandwidth a bandwidth of two hundred hz and the spacing
0:07:31of eight hundred hz
0:07:32and a fundamental frequency of one hundred hz
0:07:35the alias database then is attenuated significantly
0:07:40so far what we have concluded is that with a small number of a narrow
0:07:45band filters
0:07:46we can capture the most essential speech information
0:07:49and this will result in low rate a downstream sampling and processing
0:07:55so we integrates the narrowband a frequent and the narrowband
0:08:02frequency
0:08:03narrow line spectral extraction frontend into our proposed
0:08:08block diagram
0:08:09so here other of like the front end features are fed into the back and
0:08:14speaker verification and
0:08:18are fed into the speaker verification back end
0:08:20so one distinguishing character of this and system is that the individual bands can now
0:08:26be turned on and off needed depending on the background waistband and system requirement
0:08:33so next we talk about the back and a speaker verification algorithm so recall that
0:08:40our goal is to design a voice authenticate a wakeup system
0:08:43that identifies the user and the pass phrase in one shot
0:08:47and we would like to allow the user to
0:08:50provide customise pass phrases by providing a very few number of enrollment samples
0:08:56still
0:08:57there are
0:08:58are application falls into the category of text dependent speaker verification
0:09:02and there are many existing methods for this application so there are model based methods
0:09:10that leverages a cohort of speakers two
0:09:14who pronounce to pass phrase to train a background model
0:09:17and the parameters of the model are then fine tuning
0:09:20with an enrollment from the target is euchre
0:09:24and under either and there are
0:09:27template based the method that
0:09:28does not require prior model training so decisions made by comparing the input features with
0:09:35the enrollment features
0:09:37because we would like to allow the users and to
0:09:42to provide customised and pass phrases
0:09:46we used a template based method
0:09:49so
0:09:49the classical dynamic time warping algorithm is used to overcome
0:09:54speakers you variation and houses in speech
0:09:58and however for application it turns out that the classical dynamic time warping algorithm either
0:10:03provides
0:10:04too much
0:10:05warping that it's and you take the signal envelope which then leads to a large
0:10:10number of false positives
0:10:12or it does not provide enough warping to compensate
0:10:15the long pauses between words in a pass phrase
0:10:19so therefore we propose a
0:10:21a modified version of the
0:10:23dynamic time warping algorithm which we called with a dynamic time warping algorithm to provide
0:10:28sufficient warping such static and compensated speaker variation pauses
0:10:33in between words
0:10:35without causing too much signal envelope mutation
0:10:38so to do this we simply add a penalty term to the distance measurement and
0:10:43the penalty term scales up linearly with the number of consecutive work you steps us
0:10:49though that's
0:10:50it's prevents too much warping on the signal
0:10:53and this penalty scales with the signal magnitude so the pet penalty
0:10:58it's a small when the signal and reduce is small because that probably corresponds to
0:11:03a pauses in a in a signal and the penalties high when the signal amplitude
0:11:09is high in order to prevent signal envelope mutation
0:11:12so this can be illustrated in a distance matrix computation so the
0:11:18the weighted and i mean time warping algorithm is the same as the classical dynamic
0:11:23time warping algorithm
0:11:24with only one difference that is the cost function of those signal magnitude that is
0:11:29we had a cost
0:11:31to the distance measurement and the cost is a function of the signal magnitude and
0:11:35the number of consecutive markings that's
0:11:38i won't go into the details here and you can find like the full implementation
0:11:42in the paper
0:11:44so to illustrate the benefits of with a dynamic marking
0:11:48i shows
0:11:48it through a simulation of a lot
0:11:50so here our goal is to align the mail envelopes of the two signals are
0:11:54and you can see that with the window length of one hundred millisecond the classical
0:12:00tenement dynamic time warping algorithm fails to align the signal i will
0:12:05here
0:12:06and with
0:12:08with no land of two hundred millisecond and also the weighted dynamic time warping algorithm
0:12:12the signal envelopes are properly aligned
0:12:15however you can notice here that
0:12:18the shape of the input signal i is
0:12:22have really mutated by the classical dynamic time warping algorithm
0:12:28on the other hand and their shape is retained
0:12:32by the way to dynamic time warping algorithm
0:12:35so next we performed experiments on the entire system design
0:12:39so you know our experiments we used three pass phrase this
0:12:42i galaxy okay glass and okay power weight pronounced by thirty to forty speakers with
0:12:48twenty to forty repetitions
0:12:49and we also had wind and car noise this to the clean samples to generate
0:12:53other the weight examples
0:12:55so the reason we chose reincarnate with this is because
0:12:58they are common for application
0:13:01and also they have the distinguishing characters to be narrowly concentrated in a low-frequency
0:13:08so if there were this we can illustrate the benefits of adaptive and selection by
0:13:13discarding the weights events
0:13:15so in the experiments we used three enrollment samples with narrowband bandwidth of two hundred
0:13:21hz
0:13:22chosen to be around the estimated fundamental frequency
0:13:25and we compare it with two baseline systems
0:13:27the for
0:13:29the system with forty dimensional mfccs with the classical dynamic time warping algorithm and forty
0:13:34dimensional mfcc with the gmm ubm
0:13:38but that
0:13:39so this table summarizes experiments results
0:13:43when there's no background noise we use a top-n to narrow band spectral
0:13:48features and when there is background noise all the all the features be able to
0:13:53have two khz which would be leave a contaminated by noise
0:13:57are dropped so we only use the remaining at that
0:14:00it can be shown here that without background noise than aeroplanes
0:14:04spectral coefficients your to compare overall accuracy to the mfcc features and i three db
0:14:10snr the narrowband spectra coefficient you're much better accuracy than mfcc features
0:14:16and
0:14:16overall the weighted dynamic time warping algorithm yields improved accuracy with the
0:14:22then the classical dynamic time warping algorithm for feature for all features
0:14:27and the proposed system that is
0:14:29the right now then spectral coefficients combined with the weighted dynamic time warping algorithm
0:14:34you also improved accuracy than the gmm ubm but with tape with only three enrollment
0:14:40samples as prior and without a prior a background model training
0:14:45so we also investigated how
0:14:48the performances of affected by different parameters
0:14:52so as you can see here is that
0:14:54when we increase the number of bands the accuracy of the system improves
0:14:59and also
0:15:02the accuracy of the system improves significantly with the band selection that is this role
0:15:08compared with like
0:15:10without then selection over here
0:15:13so the pope the
0:15:18total power consumption of the system can be estimated as the summation of the power
0:15:22consumption of the front end and the backend so for front end power estimation we
0:15:28use front end design by texas instruments that consists of about thirty band filter bank
0:15:33so this front end has a fixed power consumption of a hundred five
0:15:37a hundred fifty mike robots without additional power consumption of ten micro one
0:15:42per band
0:15:43and the back and algorithm is implemented on the cortex and the or microcontroller
0:15:48and it is but decision is made
0:15:52the assumptions got a decision is made at
0:15:55every sixty milliseconds with continuous triggering still there's a worst-case trigger assumption
0:16:01and the power consumption of the backend is
0:16:04and i micro bites and per band
0:16:06so
0:16:07in here we're just trying to illustrate that overall or system based very power efficient
0:16:12and the total power consumption have you kept on
0:16:15under a few hundred microphones
0:16:18so it can be summarised
0:16:20i presented
0:16:22new structure for low power text dependent speaker verification
0:16:26in this system it has a front end that consist of a set of narrowband
0:16:30analogue filters
0:16:32which performs early stage signal dimension reduction
0:16:35and
0:16:36by performing
0:16:37and
0:16:39are used
0:16:40a performing this
0:16:43front end filtering it will allow a little resampling and low rate down
0:16:48downstream processing
0:16:49and we also proposed a back and algorithm for speaker verification
0:16:54an overall the system is designed to support adaptive and selection
0:16:59which leads to a improve the robustness to noise and the overall system have demonstrated
0:17:05come
0:17:06comparable accuracy to existing systems
0:17:08which much low
0:17:10much lower power consumption
0:17:12so that for today thing
0:17:21okay we have time for questions
0:17:33testing one two three so if i think can you applications you're looking probably in
0:17:37like consumer electronics or a home environments are so forth so you don't have an
0:17:42infinite number of competing speakers or speakers be confused with so
0:17:48it have you thought maybe about which phrases might be more effective if you're looking
0:17:52at let's say family using kind of some electronics at home you can kinda distinguish
0:17:57which speakers are really more confusable with others and second if you're working at
0:18:04kind of right trying to keep the power allow you would seem like you wanna
0:18:08have what a wakeup type devices been doing the verification on something else so something
0:18:14that might actually and i wanna see like clapping her hands or something but some
0:18:18type of sound that would actually "'cause" the system to wake up
0:18:23in this way the powers not being drunk continuously
0:18:27well thanks
0:18:28so for the first question
0:18:33certainly there are pass phrases that are
0:18:36that works better the other one and i'm not sure like what's
0:18:40a good rule to choose like which has to
0:18:42pass phrase better than others and
0:18:45because like we want to solve the systems to different customers all around the world
0:18:50so we just
0:18:51let them choose like their favourite pass phrase
0:18:54and for the second question
0:18:56actually in the like for our system we do have like vad at the fact
0:19:00that detects the energy and
0:19:02this is so it's like
0:19:05you have you their front to detect energy
0:19:08and then it will wake up our system and our system will become the holes
0:19:12device
0:19:13so the reason with
0:19:15we still use
0:19:17we can see that the worst case scenario for this device
0:19:20and
0:19:22sorry weakens the worst case scenario in our first power estimation is because so say
0:19:27you of your interest wrong what you know very noisy environment your
0:19:31like they'll
0:19:33this system is continuously running with the power consumption so the reason is that day
0:19:39for if you have like
0:19:41and apple like self
0:19:44so in order to activate theory you need to like actually weak of the device
0:19:48first
0:19:49i and then you have like after the word
0:19:51so we want to make something that we will not training of apply a noisy
0:19:56environment
0:19:57sorry i'm also thinking you're gonna add this to an internet of things type device
0:20:01because that's were actually would probably have the biggest financial gains on
0:20:07i think they are also considering selling those two like home appliances or
0:20:25so what are things that people do it all these all devices is that they
0:20:30don't all the devices close anymore
0:20:33so you can have a you know
0:20:35a phone on the it's a little and use this at a distance
0:20:39no whatever you want to say so i saw that you have added noise to
0:20:43the examples have looked at effect of five
0:20:49actually that's a very good question still we have encountered is the are applications so
0:20:53if you know like
0:20:55house a real systems are implemented there is that it is the i-th the be
0:20:59at the front end
0:21:01in fact if you're speaking from far away and close time like those a sound
0:21:06that sounds different
0:21:07so
0:21:07i think that i did some circuitry to change it's like a multi stage and
0:21:13eighty see that it will like amplify this down if you are far away and
0:21:18like amplify less if you are like close by
0:21:21so i'm not considering this part and like in my work
0:21:25but it's
0:21:28like it certainly needs to be considered for real applications
0:21:39okay i have a question a notice that you in slide not number nineteen that
0:21:44you a thanks
0:21:55i actually said to us a you for noise you remove their or all that
0:22:00information it been a below two khz
0:22:04you to try to do is also for in ecstasy
0:22:08you think it's
0:22:09i didn't do because i think
0:22:13i
0:22:16i don't know if people do it but
0:22:18yes i
0:22:19understand you of mfcc when you take
0:22:23take the information from the frequency domain by transforming into the cepstral domain if you
0:22:28just remove some bands and take the mfcc actually will alter the cepstral
0:22:34signal significantly it's not
0:22:37just like a addition
0:22:41so i didn't do for mfcc there is no
0:22:43i think it can be done if it's trained on it from scratch this way
0:22:49for phone with the man
0:22:54okay so the reason that it can be done in my system is because
0:23:02the decision is like added together at the and in a linear manner you can
0:23:07consider us each sub band x is a decision and thus
0:23:11this is a summation of all but when you have thought mfcc coefficients they'll
0:23:17decision is not no longer like a linear summation of all the decisions
0:23:22song i'm not sure how we can be done
0:23:25and using mfcc and i don't think is a task before
0:23:29thanks thank you okay like to thank speaker again