0:00:14 | she again hi everyone today i'll be talking about my work on low power text |
---|---|

0:00:18 | dependent speaker verification system |

0:00:20 | with narrowband feature pre-selection and weighted dynamic time warping |

0:00:24 | this is a joint work with texas instruments |

0:00:27 | so first to motivate the work |

0:00:29 | with the increasing popularity of mobile devices such as smart phones and smart watches there's |

0:00:35 | the ricean interested to enable for |

0:00:38 | voice authenticated |

0:00:41 | speaker verification |

0:00:42 | that is the device continuously listens to the surrounding and wakes up when the desk |

0:00:49 | a designated pass phrase is pronounced by the |

0:00:52 | a target speaker |

0:00:54 | so it's this thing wakeup systems are implemented on the host device using digital solutions |

0:01:01 | because those holes device that usually design of what general purpose applications |

0:01:06 | signals are usually acquired error rates much higher than the nyquist rate |

0:01:10 | in order to minimize information loss in early stage and to enable |

0:01:14 | flexible downstream processing therefore this systems usually involve many stages of processing a high dimensional |

0:01:20 | data |

0:01:21 | and therefore resulting in power consumptions in the range of lpc hundreds of millions |

0:01:27 | so the opportunity lies in the fact that if you're design you application specific system |

0:01:32 | and the desired output maybe a potential in a more direct manner |

0:01:36 | with less processing by performing of early stage signal dimension reduction with analog components |

0:01:41 | and adaptive data processing so our goal is to design a low power |

0:01:48 | voice authenticated we come system was power consumption is limited to the range of a |

0:01:53 | couple hundreds of micro what's |

0:01:56 | in order to achieve this |

0:01:58 | we proposed a new architecture |

0:02:00 | so in conventional systems |

0:02:02 | the processing pipeline usually involves |

0:02:06 | high sampling rate fast processing |

0:02:08 | and due to the high dimensional features |

0:02:11 | the downstream processing usually requires a high complexity come |

0:02:15 | computation therefore without being high power consumption |

0:02:18 | in contrast the proposed this is then priest select a set of a low dimensional |

0:02:24 | spectrum features which can be efficiently extracted using analogue components and because this right |

0:02:32 | some pretty selected features are sparse in the frequency domain |

0:02:36 | therefore it enables |

0:02:37 | load a low rate sampling and a low rate processing |

0:02:42 | therefore to achieve low power consumption |

0:02:47 | so in the remainder of the talk i'll describe the two components of the system |

0:02:52 | the spectral of feature pre-selection component and of speaker verification back end |

0:02:59 | so first are described all feature pre-selection process |

0:03:03 | in this part |

0:03:04 | well show that with a few carefully selected narrowband |

0:03:08 | where a |

0:03:09 | we are capable of capturing most essential information speech information |

0:03:15 | first review |

0:03:17 | so the speech signal |

0:03:19 | best of c |

0:03:22 | there is the display |

0:03:26 | are you this work should be you of t |

0:03:29 | and |

0:03:31 | so the speech signal can be represented as the convolution between the excitation signal you |

0:03:36 | of t and the vocal tract modulation signal each of the and it chopped you |

0:03:41 | contains essential speech information |

0:03:44 | this convolution relationship can become separable in the cepstral domain |

0:03:48 | so let's fix that i'm plot the power spectrum density of speech |

0:03:53 | in a frequency domain convolution becomes multiplication and when you take the log of the |

0:03:58 | power spectrum density a multiplication becomes addition notice that the peaks of the power spectrum |

0:04:05 | density corresponds to the harmonics of the fundamental frequency |

0:04:08 | and the overall i'm a little corresponds to the vocal tract modulation |

0:04:13 | and to transform the signal to the cepstral domain we take the inverse fourier transform |

0:04:17 | and |

0:04:19 | it turns out that speech is |

0:04:22 | sparse in the cepstral domain it's consists of two main components though vocal tract modulation |

0:04:27 | component h of top which is represented by this narrow a rectangle here with cut-off |

0:04:32 | frequent a difference the ad data h and the excitation component you of power which |

0:04:38 | is represented by this |

0:04:41 | delta component at data e |

0:04:43 | so our goal is to extract each of how and this can be done by |

0:04:47 | performing |

0:04:49 | and transformation to the cepstral domain |

0:04:51 | however this is par expense |

0:04:53 | so the question is |

0:04:55 | how to extract each of tall without requiring of both actual and perform transformation to |

0:05:00 | the cepstral domain |

0:05:01 | so let's begin by trying something simple see if we measure the speech at and |

0:05:06 | of viewpoints not are evenly spaced out across this fit frequency spectrum on top of |

0:05:12 | the harmonics and let the spacing between the sampling point be denoted by delta p |

0:05:18 | then what this corresponds to is multiplication between the power spectrum density and the and |

0:05:25 | impostor and when transformed to the cepstral domain this corresponds to convolution between the speech |

0:05:31 | cepstral and adult aging so what we get here is that really is the version |

0:05:36 | so |

0:05:37 | so what we get here is that the cepstrum all the are point wise samples |

0:05:41 | each |

0:05:42 | is the earliest version of the cepstral of the original speech and the earliest copies |

0:05:47 | like exactly on the multiple of a wine over don't happy |

0:05:52 | so the take away message here is that even though it seems we have thrown |

0:05:57 | away and the majority of the frequency spectrum and only kept a few points the |

0:06:02 | most essential speech information each of how is the old reading so the next question |

0:06:08 | is |

0:06:09 | what if we do not have an exact estimate of the fundamental frequency where we |

0:06:13 | have no information of the fundamental frequency at all so in this case instead of |

0:06:17 | doing point one simply we use a set of bandpass filters |

0:06:22 | so this corresponds to you can derive a this can be represented as |

0:06:26 | multiplication between the power spectrum density and a rectangular training so |

0:06:31 | because the rectangular trend |

0:06:33 | is on |

0:06:34 | i think are attenuated delta function in the cepstral domain and there's and bandpass-filtering becomes |

0:06:44 | convolution between the |

0:06:47 | space capsule with the things attenuated don't the training |

0:06:50 | so what we get is that you miss the version of the original speech cepstrum |

0:06:54 | which is attenuated by the sink a adult aging so no this here that because |

0:07:01 | this time we did not choose the don't the rectangular trying to apply on top |

0:07:05 | of the harmonics |

0:07:06 | there's aliasing between well good five component it'll tell and the excitation component of tell |

0:07:14 | however and |

0:07:16 | and because this and |

0:07:18 | aliasing is attenuated by the sink function and it's okay |

0:07:22 | can be negligible for our application |

0:07:25 | for example with a narrowband bandwidth a bandwidth of two hundred hz and the spacing |

0:07:31 | of eight hundred hz |

0:07:32 | and a fundamental frequency of one hundred hz |

0:07:35 | the alias database then is attenuated significantly |

0:07:40 | so far what we have concluded is that with a small number of a narrow |

0:07:45 | band filters |

0:07:46 | we can capture the most essential speech information |

0:07:49 | and this will result in low rate a downstream sampling and processing |

0:07:55 | so we integrates the narrowband a frequent and the narrowband |

0:08:02 | frequency |

0:08:03 | narrow line spectral extraction frontend into our proposed |

0:08:08 | block diagram |

0:08:09 | so here other of like the front end features are fed into the back and |

0:08:14 | speaker verification and |

0:08:18 | are fed into the speaker verification back end |

0:08:20 | so one distinguishing character of this and system is that the individual bands can now |

0:08:26 | be turned on and off needed depending on the background waistband and system requirement |

0:08:33 | so next we talk about the back and a speaker verification algorithm so recall that |

0:08:40 | our goal is to design a voice authenticate a wakeup system |

0:08:43 | that identifies the user and the pass phrase in one shot |

0:08:47 | and we would like to allow the user to |

0:08:50 | provide customise pass phrases by providing a very few number of enrollment samples |

0:08:56 | still |

0:08:57 | there are |

0:08:58 | are application falls into the category of text dependent speaker verification |

0:09:02 | and there are many existing methods for this application so there are model based methods |

0:09:10 | that leverages a cohort of speakers two |

0:09:14 | who pronounce to pass phrase to train a background model |

0:09:17 | and the parameters of the model are then fine tuning |

0:09:20 | with an enrollment from the target is euchre |

0:09:24 | and under either and there are |

0:09:27 | template based the method that |

0:09:28 | does not require prior model training so decisions made by comparing the input features with |

0:09:35 | the enrollment features |

0:09:37 | because we would like to allow the users and to |

0:09:42 | to provide customised and pass phrases |

0:09:46 | we used a template based method |

0:09:49 | so |

0:09:49 | the classical dynamic time warping algorithm is used to overcome |

0:09:54 | speakers you variation and houses in speech |

0:09:58 | and however for application it turns out that the classical dynamic time warping algorithm either |

0:10:03 | provides |

0:10:04 | too much |

0:10:05 | warping that it's and you take the signal envelope which then leads to a large |

0:10:10 | number of false positives |

0:10:12 | or it does not provide enough warping to compensate |

0:10:15 | the long pauses between words in a pass phrase |

0:10:19 | so therefore we propose a |

0:10:21 | a modified version of the |

0:10:23 | dynamic time warping algorithm which we called with a dynamic time warping algorithm to provide |

0:10:28 | sufficient warping such static and compensated speaker variation pauses |

0:10:33 | in between words |

0:10:35 | without causing too much signal envelope mutation |

0:10:38 | so to do this we simply add a penalty term to the distance measurement and |

0:10:43 | the penalty term scales up linearly with the number of consecutive work you steps us |

0:10:49 | though that's |

0:10:50 | it's prevents too much warping on the signal |

0:10:53 | and this penalty scales with the signal magnitude so the pet penalty |

0:10:58 | it's a small when the signal and reduce is small because that probably corresponds to |

0:11:03 | a pauses in a in a signal and the penalties high when the signal amplitude |

0:11:09 | is high in order to prevent signal envelope mutation |

0:11:12 | so this can be illustrated in a distance matrix computation so the |

0:11:18 | the weighted and i mean time warping algorithm is the same as the classical dynamic |

0:11:23 | time warping algorithm |

0:11:24 | with only one difference that is the cost function of those signal magnitude that is |

0:11:29 | we had a cost |

0:11:31 | to the distance measurement and the cost is a function of the signal magnitude and |

0:11:35 | the number of consecutive markings that's |

0:11:38 | i won't go into the details here and you can find like the full implementation |

0:11:42 | in the paper |

0:11:44 | so to illustrate the benefits of with a dynamic marking |

0:11:48 | i shows |

0:11:48 | it through a simulation of a lot |

0:11:50 | so here our goal is to align the mail envelopes of the two signals are |

0:11:54 | and you can see that with the window length of one hundred millisecond the classical |

0:12:00 | tenement dynamic time warping algorithm fails to align the signal i will |

0:12:05 | here |

0:12:06 | and with |

0:12:08 | with no land of two hundred millisecond and also the weighted dynamic time warping algorithm |

0:12:12 | the signal envelopes are properly aligned |

0:12:15 | however you can notice here that |

0:12:18 | the shape of the input signal i is |

0:12:22 | have really mutated by the classical dynamic time warping algorithm |

0:12:28 | on the other hand and their shape is retained |

0:12:32 | by the way to dynamic time warping algorithm |

0:12:35 | so next we performed experiments on the entire system design |

0:12:39 | so you know our experiments we used three pass phrase this |

0:12:42 | i galaxy okay glass and okay power weight pronounced by thirty to forty speakers with |

0:12:48 | twenty to forty repetitions |

0:12:49 | and we also had wind and car noise this to the clean samples to generate |

0:12:53 | other the weight examples |

0:12:55 | so the reason we chose reincarnate with this is because |

0:12:58 | they are common for application |

0:13:01 | and also they have the distinguishing characters to be narrowly concentrated in a low-frequency |

0:13:08 | so if there were this we can illustrate the benefits of adaptive and selection by |

0:13:13 | discarding the weights events |

0:13:15 | so in the experiments we used three enrollment samples with narrowband bandwidth of two hundred |

0:13:21 | hz |

0:13:22 | chosen to be around the estimated fundamental frequency |

0:13:25 | and we compare it with two baseline systems |

0:13:27 | the for |

0:13:29 | the system with forty dimensional mfccs with the classical dynamic time warping algorithm and forty |

0:13:34 | dimensional mfcc with the gmm ubm |

0:13:38 | but that |

0:13:39 | so this table summarizes experiments results |

0:13:43 | when there's no background noise we use a top-n to narrow band spectral |

0:13:48 | features and when there is background noise all the all the features be able to |

0:13:53 | have two khz which would be leave a contaminated by noise |

0:13:57 | are dropped so we only use the remaining at that |

0:14:00 | it can be shown here that without background noise than aeroplanes |

0:14:04 | spectral coefficients your to compare overall accuracy to the mfcc features and i three db |

0:14:10 | snr the narrowband spectra coefficient you're much better accuracy than mfcc features |

0:14:16 | and |

0:14:16 | overall the weighted dynamic time warping algorithm yields improved accuracy with the |

0:14:22 | then the classical dynamic time warping algorithm for feature for all features |

0:14:27 | and the proposed system that is |

0:14:29 | the right now then spectral coefficients combined with the weighted dynamic time warping algorithm |

0:14:34 | you also improved accuracy than the gmm ubm but with tape with only three enrollment |

0:14:40 | samples as prior and without a prior a background model training |

0:14:45 | so we also investigated how |

0:14:48 | the performances of affected by different parameters |

0:14:52 | so as you can see here is that |

0:14:54 | when we increase the number of bands the accuracy of the system improves |

0:14:59 | and also |

0:15:02 | the accuracy of the system improves significantly with the band selection that is this role |

0:15:08 | compared with like |

0:15:10 | without then selection over here |

0:15:13 | so the pope the |

0:15:18 | total power consumption of the system can be estimated as the summation of the power |

0:15:22 | consumption of the front end and the backend so for front end power estimation we |

0:15:28 | use front end design by texas instruments that consists of about thirty band filter bank |

0:15:33 | so this front end has a fixed power consumption of a hundred five |

0:15:37 | a hundred fifty mike robots without additional power consumption of ten micro one |

0:15:42 | per band |

0:15:43 | and the back and algorithm is implemented on the cortex and the or microcontroller |

0:15:48 | and it is but decision is made |

0:15:52 | the assumptions got a decision is made at |

0:15:55 | every sixty milliseconds with continuous triggering still there's a worst-case trigger assumption |

0:16:01 | and the power consumption of the backend is |

0:16:04 | and i micro bites and per band |

0:16:06 | so |

0:16:07 | in here we're just trying to illustrate that overall or system based very power efficient |

0:16:12 | and the total power consumption have you kept on |

0:16:15 | under a few hundred microphones |

0:16:18 | so it can be summarised |

0:16:20 | i presented |

0:16:22 | new structure for low power text dependent speaker verification |

0:16:26 | in this system it has a front end that consist of a set of narrowband |

0:16:30 | analogue filters |

0:16:32 | which performs early stage signal dimension reduction |

0:16:35 | and |

0:16:36 | by performing |

0:16:37 | and |

0:16:39 | are used |

0:16:40 | a performing this |

0:16:43 | front end filtering it will allow a little resampling and low rate down |

0:16:48 | downstream processing |

0:16:49 | and we also proposed a back and algorithm for speaker verification |

0:16:54 | an overall the system is designed to support adaptive and selection |

0:16:59 | which leads to a improve the robustness to noise and the overall system have demonstrated |

0:17:05 | come |

0:17:06 | comparable accuracy to existing systems |

0:17:08 | which much low |

0:17:10 | much lower power consumption |

0:17:12 | so that for today thing |

0:17:21 | okay we have time for questions |

0:17:33 | testing one two three so if i think can you applications you're looking probably in |

0:17:37 | like consumer electronics or a home environments are so forth so you don't have an |

0:17:42 | infinite number of competing speakers or speakers be confused with so |

0:17:48 | it have you thought maybe about which phrases might be more effective if you're looking |

0:17:52 | at let's say family using kind of some electronics at home you can kinda distinguish |

0:17:57 | which speakers are really more confusable with others and second if you're working at |

0:18:04 | kind of right trying to keep the power allow you would seem like you wanna |

0:18:08 | have what a wakeup type devices been doing the verification on something else so something |

0:18:14 | that might actually and i wanna see like clapping her hands or something but some |

0:18:18 | type of sound that would actually "'cause" the system to wake up |

0:18:23 | in this way the powers not being drunk continuously |

0:18:27 | well thanks |

0:18:28 | so for the first question |

0:18:33 | certainly there are pass phrases that are |

0:18:36 | that works better the other one and i'm not sure like what's |

0:18:40 | a good rule to choose like which has to |

0:18:42 | pass phrase better than others and |

0:18:45 | because like we want to solve the systems to different customers all around the world |

0:18:50 | so we just |

0:18:51 | let them choose like their favourite pass phrase |

0:18:54 | and for the second question |

0:18:56 | actually in the like for our system we do have like vad at the fact |

0:19:00 | that detects the energy and |

0:19:02 | this is so it's like |

0:19:05 | you have you their front to detect energy |

0:19:08 | and then it will wake up our system and our system will become the holes |

0:19:12 | device |

0:19:13 | so the reason with |

0:19:15 | we still use |

0:19:17 | we can see that the worst case scenario for this device |

0:19:20 | and |

0:19:22 | sorry weakens the worst case scenario in our first power estimation is because so say |

0:19:27 | you of your interest wrong what you know very noisy environment your |

0:19:31 | like they'll |

0:19:33 | this system is continuously running with the power consumption so the reason is that day |

0:19:39 | for if you have like |

0:19:41 | and apple like self |

0:19:44 | so in order to activate theory you need to like actually weak of the device |

0:19:48 | first |

0:19:49 | i and then you have like after the word |

0:19:51 | so we want to make something that we will not training of apply a noisy |

0:19:56 | environment |

0:19:57 | sorry i'm also thinking you're gonna add this to an internet of things type device |

0:20:01 | because that's were actually would probably have the biggest financial gains on |

0:20:07 | i think they are also considering selling those two like home appliances or |

0:20:25 | so what are things that people do it all these all devices is that they |

0:20:30 | don't all the devices close anymore |

0:20:33 | so you can have a you know |

0:20:35 | a phone on the it's a little and use this at a distance |

0:20:39 | no whatever you want to say so i saw that you have added noise to |

0:20:43 | the examples have looked at effect of five |

0:20:49 | actually that's a very good question still we have encountered is the are applications so |

0:20:53 | if you know like |

0:20:55 | house a real systems are implemented there is that it is the i-th the be |

0:20:59 | at the front end |

0:21:01 | in fact if you're speaking from far away and close time like those a sound |

0:21:06 | that sounds different |

0:21:07 | so |

0:21:07 | i think that i did some circuitry to change it's like a multi stage and |

0:21:13 | eighty see that it will like amplify this down if you are far away and |

0:21:18 | like amplify less if you are like close by |

0:21:21 | so i'm not considering this part and like in my work |

0:21:25 | but it's |

0:21:28 | like it certainly needs to be considered for real applications |

0:21:39 | okay i have a question a notice that you in slide not number nineteen that |

0:21:44 | you a thanks |

0:21:55 | i actually said to us a you for noise you remove their or all that |

0:22:00 | information it been a below two khz |

0:22:04 | you to try to do is also for in ecstasy |

0:22:08 | you think it's |

0:22:09 | i didn't do because i think |

0:22:13 | i |

0:22:16 | i don't know if people do it but |

0:22:18 | yes i |

0:22:19 | understand you of mfcc when you take |

0:22:23 | take the information from the frequency domain by transforming into the cepstral domain if you |

0:22:28 | just remove some bands and take the mfcc actually will alter the cepstral |

0:22:34 | signal significantly it's not |

0:22:37 | just like a addition |

0:22:41 | so i didn't do for mfcc there is no |

0:22:43 | i think it can be done if it's trained on it from scratch this way |

0:22:49 | for phone with the man |

0:22:54 | okay so the reason that it can be done in my system is because |

0:23:02 | the decision is like added together at the and in a linear manner you can |

0:23:07 | consider us each sub band x is a decision and thus |

0:23:11 | this is a summation of all but when you have thought mfcc coefficients they'll |

0:23:17 | decision is not no longer like a linear summation of all the decisions |

0:23:22 | song i'm not sure how we can be done |

0:23:25 | and using mfcc and i don't think is a task before |

0:23:29 | thanks thank you okay like to thank speaker again |