| 0:00:16 | and everyone whiners are from johns hopkins university |
|---|
| 0:00:20 | a compromise |
|---|
| 0:00:22 | what is your presentation my our framework is on speaker verification and speech enhancement |
|---|
| 0:00:27 | let's say that six lights |
|---|
| 0:00:36 | i love this presentation is a another system which allows this enhancement or speaker verification |
|---|
| 0:00:43 | and to be using some slides from my previous work i guess was called feature |
|---|
| 0:00:49 | enhancement but |
|---|
| 0:00:50 | the feature classes for speaker verification |
|---|
| 0:00:56 | i mean downstream does is speaker verification |
|---|
| 0:00:59 | and the problem refers to the |
|---|
| 0:01:03 | task of data mining if speaker an utterance one |
|---|
| 0:01:06 | just and drawn inference is same as |
|---|
| 0:01:09 | these you got an utterance to which is the test utterance |
|---|
| 0:01:13 | the state-of-the-art we implement this is to use a so-called extractor network and |
|---|
| 0:01:19 | a probabilistic linear discriminant analysis is okay |
|---|
| 0:01:23 | and also due date or addition |
|---|
| 0:01:27 | in conjunction |
|---|
| 0:01:30 | speech enhancement |
|---|
| 0:01:31 | is once this problem but you have speaker verification |
|---|
| 0:01:35 | by any preprocessing and rule and test utterances during this time |
|---|
| 0:01:42 | it has a node is the speech enhancement maybe on helps when trained in the |
|---|
| 0:01:48 | and then of speaker recognition option |
|---|
| 0:01:52 | and three pursue a title frame only fisherman's training |
|---|
| 0:01:56 | which |
|---|
| 0:01:56 | next the two problems as we can see how |
|---|
| 0:02:02 | this is the schematic of each feature loss training was you can see there are |
|---|
| 0:02:08 | two networks one is e |
|---|
| 0:02:10 | one just has one or another one is denoted by e which is t alternately |
|---|
| 0:02:15 | network |
|---|
| 0:02:18 | the enhancement network takes noisy features and produced enhanced features |
|---|
| 0:02:23 | these enhanced features are not directly compare between features however they are for us to |
|---|
| 0:02:30 | also unit for and the intermediate activity activations in the small sooner or we know |
|---|
| 0:02:37 | the differences in them and they are known as a feature loss |
|---|
| 0:02:43 | when we don't use this on clean and fruit and simply choose |
|---|
| 0:02:46 | compared enhanced features indicating features |
|---|
| 0:02:49 | in a score |
|---|
| 0:02:51 | feature mostly |
|---|
| 0:02:54 | this can imagine |
|---|
| 0:02:55 | this type of training is doing enhancement however results you'd information also |
|---|
| 0:03:02 | that is then exquisitely is also unit for |
|---|
| 0:03:08 | this is how or speaker verification that looks like the enrollment and test going through |
|---|
| 0:03:15 | feature extraction independently and also enhanced independently then |
|---|
| 0:03:22 | well healthy |
|---|
| 0:03:23 | a phones goes through our invariant structure which is our case expected network |
|---|
| 0:03:30 | and |
|---|
| 0:03:31 | and the but a classifier |
|---|
| 0:03:33 | tries to give them a log-likelihood ratio and say |
|---|
| 0:03:38 | the there is |
|---|
| 0:03:40 | same speaker or not |
|---|
| 0:03:44 | no these of the details on how database extraction is ten |
|---|
| 0:03:49 | we use |
|---|
| 0:03:50 | and use a corpus which consists of |
|---|
| 0:03:53 | three or instances only use a |
|---|
| 0:03:57 | gender noises |
|---|
| 0:03:59 | and that'll |
|---|
| 0:04:02 | these |
|---|
| 0:04:03 | the noise classes are used to |
|---|
| 0:04:06 | combine |
|---|
| 0:04:07 | with |
|---|
| 0:04:08 | also the within sixteen khz conversations statistic as a |
|---|
| 0:04:13 | and be just wrote also the combined and it is three times but also |
|---|
| 0:04:19 | the emission works of the is |
|---|
| 0:04:22 | is so some wild |
|---|
| 0:04:25 | i a fifty percent rate |
|---|
| 0:04:28 | randomly agreeable so the utterance for it to |
|---|
| 0:04:33 | we also use |
|---|
| 0:04:35 | it s not filtering algorithm called about as an two |
|---|
| 0:04:39 | create a fifty percent you what's alone |
|---|
| 0:04:42 | and it is supposed to preserve the highest and utterances from work so |
|---|
| 0:04:48 | such clean and version of also the is gain combined with |
|---|
| 0:04:53 | the news on |
|---|
| 0:04:54 | noise is and that serves as the noisy constant for our supervise enhancement training |
|---|
| 0:05:04 | this trend of the ldr frame with the what's the combined dimension and these see |
|---|
| 0:05:09 | that no networks a |
|---|
| 0:05:11 | does use |
|---|
| 0:05:14 | given more details the features that the use of forty dimensional measure of that |
|---|
| 0:05:20 | this is to see and other ways |
|---|
| 0:05:22 | the evaluation will be done on d v train a which is a corpus containing |
|---|
| 0:05:27 | a young children means that in and controlled environment |
|---|
| 0:05:32 | the complete data is to fifty hours for is and struck divided in detection and |
|---|
| 0:05:37 | a diarization task |
|---|
| 0:05:40 | we have not explained |
|---|
| 0:05:42 | the diarisation component you know pipeline |
|---|
| 0:05:46 | for the evaluation data a number of speakers in and roll and test r five |
|---|
| 0:05:51 | ninety five and one fifty respectively |
|---|
| 0:05:54 | and results are presented in form of equal error rate and minimum decision cost function |
|---|
| 0:06:00 | where target prior probability of five percent |
|---|
| 0:06:05 | the table that you see here is from our previous work which we want to |
|---|
| 0:06:10 | analyze in this work |
|---|
| 0:06:13 | use if you focus on the second |
|---|
| 0:06:17 | dataset column which is about maybe train |
|---|
| 0:06:21 | you can see for scroll |
|---|
| 0:06:24 | is actually without an enhanced and every and refers to the original version of x |
|---|
| 0:06:29 | are gonna work |
|---|
| 0:06:30 | and if do is just |
|---|
| 0:06:32 | a notation to denote |
|---|
| 0:06:34 | the type of be and es data used |
|---|
| 0:06:38 | so this rule actually give results on |
|---|
| 0:06:42 | that enhanced and it is seven point six percent eer and then we use a |
|---|
| 0:06:47 | feature lost which loss and also combination |
|---|
| 0:06:51 | and |
|---|
| 0:06:52 | in c d's usually give the best performance previously |
|---|
| 0:06:57 | assign a row zero |
|---|
| 0:06:58 | is the comparison between how much performance t and you can see |
|---|
| 0:07:04 | we just are feature allows efficient most |
|---|
| 0:07:07 | formants cleanest a or six k |
|---|
| 0:07:11 | having said that we want to address and questions |
|---|
| 0:07:16 | forces are |
|---|
| 0:07:18 | only the initial layers of course in a useful for the official of training |
|---|
| 0:07:23 | can't feature allows the additive which allows |
|---|
| 0:07:27 | second it is |
|---|
| 0:07:29 | for supervised and has the training how clean data is required |
|---|
| 0:07:33 | can i just using speech results of the |
|---|
| 0:07:35 | below are created database |
|---|
| 0:07:38 | mismatch issues |
|---|
| 0:07:40 | currently you extractor and all seen in four |
|---|
| 0:07:44 | are available pre-training on your emotions features can i used to train and has the |
|---|
| 0:07:49 | network each works the height of features can get an idea get some benefit |
|---|
| 0:07:57 | for this and has a really an expected data and of the training for the |
|---|
| 0:08:01 | improvements |
|---|
| 0:08:05 | faced is again and has features the bootstrap to training data double the amount of |
|---|
| 0:08:10 | data and make our extra to store the be obvious four |
|---|
| 0:08:16 | six is to see if the was less that we're working with a really useful |
|---|
| 0:08:22 | during the data condition process |
|---|
| 0:08:25 | is some of the noise class |
|---|
| 0:08:27 | even harmful |
|---|
| 0:08:30 | find regression is that as the proposed scheme for the task of dereverberation and joint |
|---|
| 0:08:35 | denoising anteater operation |
|---|
| 0:08:40 | or should be produce the baseline and see what there is good for differs a |
|---|
| 0:08:45 | lost a extraction |
|---|
| 0:08:48 | is |
|---|
| 0:08:49 | results table with a lot of numbers a better for this doesn't station it's enough |
|---|
| 0:08:55 | to focus on the first column which gives you the labels |
|---|
| 0:08:59 | for that i all loss or data that's going to use |
|---|
| 0:09:02 | and the final |
|---|
| 0:09:04 | a column is the mean result on the |
|---|
| 0:09:08 | no be retrained test set |
|---|
| 0:09:11 | but shows without it has then given then one nine percent eer |
|---|
| 0:09:15 | and then we have l d s l five between the feature last extracted from |
|---|
| 0:09:20 | five layers |
|---|
| 0:09:21 | and this |
|---|
| 0:09:24 | on signal folk has six layers |
|---|
| 0:09:26 | the fess up to five are used in this one and six is |
|---|
| 0:09:30 | the |
|---|
| 0:09:31 | classification in finding invariance we are not using for a particular role |
|---|
| 0:09:36 | i guess the best performance and z more combinations |
|---|
| 0:09:40 | to see |
|---|
| 0:09:41 | and the l f l is the feature loss and it gives you were worse |
|---|
| 0:09:46 | performance in and then baseline |
|---|
| 0:09:48 | this reduces observations from previous four |
|---|
| 0:09:52 | combining them was so |
|---|
| 0:09:54 | is not good point two percent |
|---|
| 0:09:57 | when you combine the embedding |
|---|
| 0:10:00 | years the last layer false in that for the d feature lost |
|---|
| 0:10:04 | it duh is also not helpful |
|---|
| 0:10:07 | and then the use |
|---|
| 0:10:09 | efficient loss five layer for later three layers two years and |
|---|
| 0:10:14 | finally one layer and they are not as good as using all the layers |
|---|
| 0:10:18 | the bottom half of the table is a decision cost function |
|---|
| 0:10:23 | the |
|---|
| 0:10:24 | observations are mostly same as the equal error rate |
|---|
| 0:10:27 | so here we have seen the feature losses in three artificial are or system |
|---|
| 0:10:34 | combining them |
|---|
| 0:10:36 | is also for |
|---|
| 0:10:38 | a more lazy use the best increase the computational complexity |
|---|
| 0:10:44 | well that's okay |
|---|
| 0:10:48 | the main data v is the |
|---|
| 0:10:50 | you need to |
|---|
| 0:10:51 | use you know if all silly layers from the jar |
|---|
| 0:10:58 | if we see the choice of training data set for enhanced and also you know |
|---|
| 0:11:01 | where |
|---|
| 0:11:03 | we see donovan to dash fisher the blue means |
|---|
| 0:11:07 | what's alone |
|---|
| 0:11:08 | with the bodice and i was used for the |
|---|
| 0:11:11 | and has and therefore and |
|---|
| 0:11:13 | also |
|---|
| 0:11:15 | on as a consequence for the |
|---|
| 0:11:18 | also network and gives the best performance you know by boldface or |
|---|
| 0:11:24 | one |
|---|
| 0:11:26 | using p c which is the what's of the |
|---|
| 0:11:29 | and b c we just have also combine |
|---|
| 0:11:32 | but in spots of the combined with the |
|---|
| 0:11:34 | the noise documentations |
|---|
| 0:11:37 | we also from we see if two indian in the has to know where |
|---|
| 0:11:43 | which is if you core |
|---|
| 0:11:44 | the you can of the three persons of some kind of what's |
|---|
| 0:11:48 | and it is not as |
|---|
| 0:11:51 | good as the bodice not filtering so |
|---|
| 0:11:55 | the shows that feeling screening all four |
|---|
| 0:11:59 | barcelona one snr seems to be unimportant |
|---|
| 0:12:03 | and use a little speech and |
|---|
| 0:12:05 | can see of course and point to a greater than i one and baseline |
|---|
| 0:12:11 | and solely for speech |
|---|
| 0:12:13 | i think being |
|---|
| 0:12:14 | in on conversational and mismatched data it is for training |
|---|
| 0:12:20 | even when used as a |
|---|
| 0:12:22 | clean counterpart for the enhanced |
|---|
| 0:12:24 | and hence the network |
|---|
| 0:12:28 | we also thing the powerful the also the network is that it is |
|---|
| 0:12:33 | and the old one is so |
|---|
| 0:12:38 | means that the more data is used and |
|---|
| 0:12:40 | the data condition is also that |
|---|
| 0:12:46 | you see if we mismatch the features and has the network can i use i |
|---|
| 0:12:50 | dimensional features and hence for network |
|---|
| 0:12:53 | second rule festival is |
|---|
| 0:12:56 | ellen |
|---|
| 0:12:57 | f b for the means log mel from the man features |
|---|
| 0:13:01 | for the dimension in has been network |
|---|
| 0:13:04 | recall that forty dimensional features are used in the opportunity for and the effectiveness of |
|---|
| 0:13:10 | also |
|---|
| 0:13:11 | show and this is the condition where the features are matched |
|---|
| 0:13:15 | so i don't need to learn any bridge between networks for this case of a |
|---|
| 0:13:21 | were four |
|---|
| 0:13:22 | if you dimension wanted to do and menus spectrogram |
|---|
| 0:13:28 | i there is a speech are mismatched and you need lower average between units as |
|---|
| 0:13:33 | well |
|---|
| 0:13:33 | and |
|---|
| 0:13:34 | is the results are not as good as the matched condition |
|---|
| 0:13:38 | seems like cannot it advantage of high dimensional features |
|---|
| 0:13:43 | literal |
|---|
| 0:13:44 | we also the spectrograms somehow since use of for a least or |
|---|
| 0:13:50 | but it is also |
|---|
| 0:13:51 | worse than the baseline |
|---|
| 0:13:58 | you see the effect of hasn't you lda and the or extractor data |
|---|
| 0:14:06 | for scroll is not as good as us to control was tested and then |
|---|
| 0:14:11 | alright consisted percent |
|---|
| 0:14:13 | that at home so we can see |
|---|
| 0:14:17 | the lda common test is written |
|---|
| 0:14:20 | as the label which means that be lda |
|---|
| 0:14:23 | and it is also has |
|---|
| 0:14:25 | and it does and so much rates it and seven percent |
|---|
| 0:14:31 | so for the mindcf we have |
|---|
| 0:14:38 | not much change so don't feel that the really is |
|---|
| 0:14:43 | is on benefiting an entire susceptible to a enhancement processing |
|---|
| 0:14:49 | if and hence the training set |
|---|
| 0:14:52 | there is improvement for the start baseline |
|---|
| 0:14:56 | which is an iterative system |
|---|
| 0:14:58 | however it's not as good as just so that has in the test |
|---|
| 0:15:02 | one and half of them since like |
|---|
| 0:15:05 | the robustness of the whole system is lost so it's not working for at least |
|---|
| 0:15:10 | four |
|---|
| 0:15:12 | this corpus |
|---|
| 0:15:16 | we combine the enhanced vision see if we can take advantage make them |
|---|
| 0:15:22 | complementary original features |
|---|
| 0:15:25 | no that wasn't we just means that even if a if conditions |
|---|
| 0:15:30 | and half which means the and score of all the data |
|---|
| 0:15:37 | in the column |
|---|
| 0:15:39 | you see all can be lda that means |
|---|
| 0:15:43 | meditation |
|---|
| 0:15:45 | is then be in the |
|---|
| 0:15:48 | to verify all can be lda |
|---|
| 0:15:51 | vol including original features as a listing and switches along with the data |
|---|
| 0:15:57 | it seems to be getting our |
|---|
| 0:16:01 | and |
|---|
| 0:16:02 | when i combined these features in training set |
|---|
| 0:16:06 | is actually doing much better performance seems like the network analysis double data and |
|---|
| 0:16:12 | there is also complement energy |
|---|
| 0:16:15 | in the |
|---|
| 0:16:17 | has features so they are |
|---|
| 0:16:19 | it can be bonastre |
|---|
| 0:16:22 | if i one station and the frame effect of these features in train as well |
|---|
| 0:16:26 | as the and the lda it doesn't |
|---|
| 0:16:32 | so this ensures that the lda is a suitable one hasn't processing |
|---|
| 0:16:37 | i started to just put i has features that or is not in the training |
|---|
| 0:16:41 | set up or a spoof an oak |
|---|
| 0:16:46 | now we see if i e one type of noise class from the expected network |
|---|
| 0:16:51 | r t |
|---|
| 0:16:54 | and hasn't data |
|---|
| 0:16:56 | so let's focus on the a lot of this table which is that the war |
|---|
| 0:17:01 | music and |
|---|
| 0:17:05 | see the last column we have i one zero five percent this means that |
|---|
| 0:17:10 | right i skate |
|---|
| 0:17:12 | using the music files from extract phonetic or and i also don't use enhancement actually |
|---|
| 0:17:18 | doing better than the based on which means |
|---|
| 0:17:20 | and then |
|---|
| 0:17:21 | removing music is good so this discussed actually also performance |
|---|
| 0:17:29 | next unseen means i used enhancement or what the |
|---|
| 0:17:34 | the on has filter has not seen use it |
|---|
| 0:17:36 | so it's still able to improve the one this is some and |
|---|
| 0:17:41 | most interestingly |
|---|
| 0:17:43 | and the use the |
|---|
| 0:17:44 | units seeing which is |
|---|
| 0:17:46 | when i using and has to network which has seen using it is the s |
|---|
| 0:17:51 | so it seems like some noise classes are |
|---|
| 0:17:55 | or are being |
|---|
| 0:17:57 | and |
|---|
| 0:17:59 | is that it just give them in x vector training |
|---|
| 0:18:02 | okay include them in the |
|---|
| 0:18:05 | a enhancement |
|---|
| 0:18:07 | training data |
|---|
| 0:18:11 | it to see if we can do you relation with division loss you try seven |
|---|
| 0:18:16 | schemes |
|---|
| 0:18:17 | use call so that would be e tradition earlier repetitions scheme trying to do you |
|---|
| 0:18:25 | duration and utilizing in |
|---|
| 0:18:27 | joint fashion |
|---|
| 0:18:29 | also and the distance fashioned which is denoted by joint one stage |
|---|
| 0:18:33 | a few we all these numbers |
|---|
| 0:18:37 | in c |
|---|
| 0:18:39 | the dereverberation is not actually working |
|---|
| 0:18:42 | we also suspect that's possible that a there is not possible configuration nevertheless t-norm things |
|---|
| 0:18:50 | since e |
|---|
| 0:18:53 | you have |
|---|
| 0:18:56 | a pre-processing step for a improving on this maybe straight |
|---|
| 0:19:02 | finally database are you can you need to choose also you know for you have |
|---|
| 0:19:08 | layers of it for this type of funding |
|---|
| 0:19:12 | and use one isa nine based filtering to keep highest not only you scores from |
|---|
| 0:19:16 | the |
|---|
| 0:19:17 | a construct a clean data for has to network training |
|---|
| 0:19:21 | the mismatch in and has to and hasn't and also very |
|---|
| 0:19:25 | and it is slightly worse is better to use same features |
|---|
| 0:19:29 | we see that the lda is not really |
|---|
| 0:19:33 | us a nice it's very susceptible to using enhanced data american put this next fortunate |
|---|
| 0:19:38 | for |
|---|
| 0:19:39 | some noise types are harder in for extracting data like music |
|---|
| 0:19:45 | and finally the duration is not or four |
|---|
| 0:19:50 | using this |
|---|
| 0:19:52 | state of training scheme |
|---|
| 0:19:54 | so that is the end of the presentation please feel free to send questions that |
|---|
| 0:19:58 | where we thank you |
|---|