0:00:15 | so hi everyone i'm gonna talk |
---|
0:00:18 | about very similar approach to what mutual described before |
---|
0:00:24 | for at least for the part of speaker recognition actually |
---|
0:00:28 | to say model |
---|
0:00:30 | so you're it's it won't be anything new |
---|
0:00:36 | this is the outline more or less i'm gonna describe a little bit about the |
---|
0:00:40 | use of the nn since speech ends now speaker recognition |
---|
0:00:43 | and how to extract baumwelch statistics i'll |
---|
0:00:46 | do with a little bit more analytically the |
---|
0:00:49 | done what mitchell did some the inane be lda configurations |
---|
0:00:53 | and some experiments on switchboard and the nist two thousand twelve |
---|
0:00:58 | so little bit about the limitations of the ubm based speaker recognition so far the |
---|
0:01:05 | short-term spectral information that we are traditionally been using on the as a front end |
---|
0:01:10 | feature as from the features |
---|
0:01:12 | in speaker recognition work fine in one in some sense |
---|
0:01:16 | but in some others not and that it would be more specific our experiences that |
---|
0:01:22 | when you know alignment suppose i'm going to say to australia |
---|
0:01:25 | purchase going to jump on |
---|
0:01:27 | "'kay" |
---|
0:01:28 | and of cereal normal is a language because |
---|
0:01:31 | it check this okay |
---|
0:01:32 | so i think you'll be able to |
---|
0:01:34 | discriminate between speakers a little bit more effectively than if i go jump on |
---|
0:01:40 | okay and that of problem is that with the current traditional ubm based |
---|
0:01:45 | a speaker recognition systems we don't capture of this that information and also because they're |
---|
0:01:52 | not phonetically |
---|
0:01:54 | a where |
---|
0:01:54 | the assignments the classes that we are define |
---|
0:01:57 | by using an unsupervised way of training |
---|
0:02:00 | i ubm so segmenting let's say the input space |
---|
0:02:04 | using the feature itself but we're gonna then use |
---|
0:02:08 | "'kay" to a due to i to extract baumwelch statistics |
---|
0:02:13 | came it not it do not have these a phonetically awareness that is needed |
---|
0:02:18 | i hope so |
---|
0:02:20 | so the challenge here |
---|
0:02:23 | is to use the n n's |
---|
0:02:26 | which we know that now are capable of |
---|
0:02:30 | improving drastically the performance of asr systems |
---|
0:02:33 | and scab to these ideal socratic way |
---|
0:02:38 | way in which is speaker pronounces it's |
---|
0:02:40 | as we said that signals which actually as others are |
---|
0:02:43 | tied triphone states |
---|
0:02:47 | and help with about that units or on asr |
---|
0:02:53 | the reports or something like thirty percent relative improvement in terms of word the error |
---|
0:02:59 | rate |
---|
0:03:00 | compared gmms |
---|
0:03:02 | there have several hidden layers five or six in triphone states |
---|
0:03:07 | as outputs a their discriminative classifiers yet we can combine them we'd hmms using this |
---|
0:03:15 | trick that |
---|
0:03:15 | we term posteriors back and likelihood by subtracting |
---|
0:03:19 | the you prior into the log domain |
---|
0:03:22 | and |
---|
0:03:24 | and then we can combine them with a mean hmm framework |
---|
0:03:28 | initially the used to initialize them with us |
---|
0:03:32 | stock a restricted boltzmann machines |
---|
0:03:35 | a this is no longer need it's that's has been proven but |
---|
0:03:40 | you might imagine cases or domains where or languages were not enough labeled data is |
---|
0:03:46 | available |
---|
0:03:48 | yet you might have very few data but many unlabeled |
---|
0:03:53 | data |
---|
0:03:53 | in this case is but exclude the possibility of |
---|
0:03:58 | of for using is be stacked architecture of our be ends |
---|
0:04:02 | due to a due to initialize the they are bm more robustly |
---|
0:04:07 | and i think the key difference is that the fact that the capacity of handling |
---|
0:04:14 | a longer segments as inputs |
---|
0:04:18 | okay so |
---|
0:04:20 | something about three hundred milliseconds |
---|
0:04:23 | in order to capture |
---|
0:04:26 | say information the temporal information |
---|
0:04:29 | this is done the reference by the way a little bit old now |
---|
0:04:33 | from two of the pioneers |
---|
0:04:36 | so the ubm approach does your |
---|
0:04:39 | more i sumo no |
---|
0:04:41 | is goes like this you |
---|
0:04:43 | you start whereby training |
---|
0:04:46 | using the em algorithm a ubm |
---|
0:04:49 | and the for each new utterance you extract the so called zero order statistics and |
---|
0:04:54 | first order statistics |
---|
0:04:55 | and then you use again you're ubm |
---|
0:04:58 | in order to somehow pretty wide from your bumble statistics a component wise it is |
---|
0:05:03 | because |
---|
0:05:04 | that's what you're doing effectively |
---|
0:05:06 | so a in these the nn based approach or we are using these by the |
---|
0:05:12 | posterior probability |
---|
0:05:13 | of each frame belonging to its component |
---|
0:05:16 | it that's the only difference so this by t |
---|
0:05:19 | tease the frame count sees the component |
---|
0:05:21 | that's the only thing that changes |
---|
0:05:23 | so that means that don't have a change or |
---|
0:05:25 | algorithms are all we just have to have it the nn algorithm some to put |
---|
0:05:29 | usually posteriors |
---|
0:05:31 | and that's all |
---|
0:05:31 | no need to create use of course |
---|
0:05:39 | so i take ubm is still need is only practically for the last step |
---|
0:05:44 | two prewhitening the bible statistics before feeding them either to |
---|
0:05:48 | to an i-vector extractor maybe to jfa |
---|
0:05:53 | and of course em here is not required to train the ubm because |
---|
0:05:59 | the |
---|
0:06:00 | the posteriors came come actually from that unit so there's my |
---|
0:06:04 | no need to do this is just an m step |
---|
0:06:07 | all a single of several |
---|
0:06:10 | will be sufficient |
---|
0:06:12 | and it is easy does it is interesting to note here that different features can |
---|
0:06:16 | be used for estimating |
---|
0:06:19 | the assignments |
---|
0:06:21 | or of a frame to they sit to the scene on or |
---|
0:06:26 | what we used to say the component of the ubm |
---|
0:06:30 | and those that you finally use |
---|
0:06:33 | for a extract |
---|
0:06:36 | i i-vectors or whatever |
---|
0:06:39 | you're using so |
---|
0:06:40 | you don't have to change that you can have two parallel way that are optimized |
---|
0:06:45 | for the two tasks for the sr task |
---|
0:06:47 | and for the speaker recognition task as long of course that you are having it |
---|
0:06:51 | you the same frame rate |
---|
0:06:55 | so i'm not gonna go deep enough into that this is the first unit configuration |
---|
0:07:01 | we |
---|
0:07:01 | we developed it was inspired by this paper robustly at all and he was a |
---|
0:07:09 | very successful paper that's of asr we managed to reproduce the sr results |
---|
0:07:14 | and do something to find it also you next |
---|
0:07:18 | and this more as the configuration |
---|
0:07:21 | and we have some results and then |
---|
0:07:25 | he was gently percent estimated telling us is a guy's we managed to obtain some |
---|
0:07:30 | amazing results |
---|
0:07:33 | with this where i |
---|
0:07:34 | and we show that the method was actually saying |
---|
0:07:39 | and |
---|
0:07:41 | so we tried this as well |
---|
0:07:43 | the first the first the configuration but we tried to switchboard data not an east |
---|
0:07:48 | so this was the configuration of young really of voice alright |
---|
0:07:53 | from sri and it's a little bit different the uses trap features at the fantasy |
---|
0:07:59 | maybe |
---|
0:08:00 | it's better thing to do |
---|
0:08:02 | it's along the span thirty one |
---|
0:08:06 | frames it's i use it uses log mel filter banks |
---|
0:08:11 | they use forty i think we you another think that we used twenty three that |
---|
0:08:15 | was i guess one of the reasons why there are we the results with our |
---|
0:08:19 | and obtain are not that will well there are several reasons of we have expect |
---|
0:08:23 | you know |
---|
0:08:24 | these said you know sub a lot of free parameters that |
---|
0:08:28 | that someone has to like in and |
---|
0:08:32 | but i'm gonna show you next and so |
---|
0:08:35 | we have to configuration the small one was practically |
---|
0:08:39 | so that we include results for the common already paper |
---|
0:08:43 | and here and we have be configuration also |
---|
0:08:46 | with that is more close are close to what is right be seen there in |
---|
0:08:51 | the paper |
---|
0:08:54 | these are some an asr results of be obtained |
---|
0:08:59 | there or you see first of all the comparison that is on basically paper |
---|
0:09:04 | just two |
---|
0:09:05 | to address the dramatic improvement you can obtain by using |
---|
0:09:10 | the in insisted of |
---|
0:09:13 | gmms as emission probabilities |
---|
0:09:15 | and to these are the two configurations |
---|
0:09:18 | of we developed in a green most inspired by the work of vastly and then |
---|
0:09:25 | this or i |
---|
0:09:29 | now let's go back to speaker recognition |
---|
0:09:33 | these are the plp a questioned us to tell you that what |
---|
0:09:35 | flavour of p lda we used |
---|
0:09:38 | we found that for most of the cases |
---|
0:09:43 | the full rank |
---|
0:09:44 | v transpose that is a speaker space |
---|
0:09:47 | work better we didn't of course trial recognition but it will with work better compared |
---|
0:09:51 | to one twenty |
---|
0:09:52 | for example these system got |
---|
0:09:55 | we before links norm apply w c n |
---|
0:09:57 | instead of doing prewhitening that word most of the cases again very well but much |
---|
0:10:03 | better prewhitening |
---|
0:10:06 | and about this dilemma whether you should average |
---|
0:10:09 | after or before length normalization i think you should average |
---|
0:10:14 | before and after length molestation |
---|
0:10:16 | because that's more consistent with the way you're training the p lda model |
---|
0:10:20 | and in our case make made a lot of difference |
---|
0:10:23 | okay |
---|
0:10:25 | so |
---|
0:10:27 | these other results from switchboard with the first configuration they're not that good |
---|
0:10:32 | then all that good |
---|
0:10:33 | their |
---|
0:10:34 | not even comparable to |
---|
0:10:35 | the once you tame as a baseline system |
---|
0:10:37 | okay so we were rather disappointing that the state each that was somehow christmas |
---|
0:10:42 | and that but what once you fuse and you get something that like yes |
---|
0:10:47 | it's good not that the in this case so that we are gonna using a |
---|
0:10:50 | single |
---|
0:10:51 | enrollment utterance the same for male more less |
---|
0:10:56 | and |
---|
0:10:57 | notice go to nice with the configuration or not what we thought was you configuration |
---|
0:11:03 | of is right |
---|
0:11:06 | these sees the small configuration |
---|
0:11:09 | now we see that's at least for the low false alarm area without but we |
---|
0:11:13 | have we're making progress |
---|
0:11:15 | not by fusing them up much though |
---|
0:11:18 | "'kay" the fusion was not a that's |
---|
0:11:20 | that's good |
---|
0:11:22 | and it's by the way i'm emphasising c to classify both although c five these |
---|
0:11:28 | a subset just a |
---|
0:11:30 | to make sure that |
---|
0:11:31 | you know that |
---|
0:11:32 | that if it's so it's both clean and noisy tell |
---|
0:11:36 | and this is with the configuration the same picture now we are we are comparing |
---|
0:11:41 | it with it |
---|
0:11:42 | to forty eight gmm |
---|
0:11:46 | and it's more the same picture you get some improvement on the low false alarm |
---|
0:11:54 | area |
---|
0:11:56 | that some caves the don't think that so much this is one |
---|
0:12:00 | four we could be configuration |
---|
0:12:03 | so i'm gonna i'm gonna |
---|
0:12:05 | just keep a little bit i'm gonna talk a little bit about |
---|
0:12:08 | the p lda now because it was there was this issue about the domain adaptation |
---|
0:12:14 | a gender so we're gonna focus a little bit on p lda now just to |
---|
0:12:17 | share with your result which i think it's interesting |
---|
0:12:22 | we know that link when you apply length normalization you may attain results that are |
---|
0:12:27 | even better |
---|
0:12:28 | compared to heavy tailed be of the in some cases |
---|
0:12:31 | the problem is that this transformation is some cost sensitive to two datasets |
---|
0:12:37 | so we ideally we would we would be great to get rid of it |
---|
0:12:43 | and the possible alternative would be to scale down the number of recording so |
---|
0:12:48 | what that what that means is that's you pretends |
---|
0:12:52 | that's instead of having an and recordings you are having and over three |
---|
0:12:56 | we define a scaling factor arbitrary but one by one over three one able to |
---|
0:13:00 | works fine |
---|
0:13:01 | in practice |
---|
0:13:03 | and using that streak all the evidence criteria work |
---|
0:13:08 | at all you the i mean you once you trying to be lda you getting |
---|
0:13:11 | it strictly increasing privates but you which is good |
---|
0:13:16 | and it's you somehow a losing caught your somehow losing confidence |
---|
0:13:20 | which is a good thing |
---|
0:13:22 | okay that to lose confidence in some cases |
---|
0:13:26 | and it's the problems can we get rid of length normally that no the answer |
---|
0:13:30 | is no |
---|
0:13:32 | but we are rather close a gets so the scale factor of one means practical |
---|
0:13:36 | nothing |
---|
0:13:37 | here are some results |
---|
0:13:39 | with different scaling factor so all i'm doing is simply divide the number consisted role |
---|
0:13:45 | in training |
---|
0:13:46 | and when evaluating them all the other large |
---|
0:13:49 | dividing the number of |
---|
0:13:51 | of recordings by either one over it multiplied by |
---|
0:13:55 | one over two or will buy bound over three |
---|
0:13:57 | i'm guessing that most of the gap |
---|
0:14:00 | between door not doing length normalization and doing length normalization is somehow |
---|
0:14:07 | i think by the by about maybe strict so |
---|
0:14:09 | maybe because the other people that are using these domain adaptation are function with domain |
---|
0:14:16 | adaptation can use that |
---|
0:14:17 | as an alternative |
---|
0:14:20 | to the like someone addition and tell me maybe if they the found something interesting |
---|
0:14:27 | so was conclusions |
---|
0:14:32 | the use of the state-of-the-art to the nn sri can replace definitely a traditional gmmubm |
---|
0:14:37 | it'd a ubm ceased based system |
---|
0:14:41 | and a good thing is that once a baum-welch statistics are extracted is exactly the |
---|
0:14:46 | same machinery but can be applied |
---|
0:14:50 | and no need to change the coleman teachings anything and they're the results provided by |
---|
0:14:55 | sri |
---|
0:14:56 | and is now |
---|
0:14:58 | missiles only that's |
---|
0:15:00 | you was also done your merrill this morning that but also sound role |
---|
0:15:05 | models to stick to get some results exactly the same idea |
---|
0:15:08 | clearly so these results clearly so the superiority |
---|
0:15:13 | we did something suboptimal probably that's why would we didn't manage to get the desired |
---|
0:15:18 | results |
---|
0:15:19 | so as an extension component |
---|
0:15:21 | obviously a convolutional neural nets a neural nets maybe make might be useful |
---|
0:15:29 | and there is also another idea where |
---|
0:15:32 | we used for asr where |
---|
0:15:36 | what we did this was to all commands |
---|
0:15:38 | fifteen the input layer of the d and then by blowing |
---|
0:15:45 | and i typical i-vector a regular i-vectors |
---|
0:15:48 | "'kay" we do that for broadcast news |
---|
0:15:50 | in order to make a some sort of speaker adaptation |
---|
0:15:53 | we presented that and i cuts |
---|
0:15:55 | so i don't help a lot it hopefully you've a |
---|
0:15:59 | one point five two relative improvement which is what not relative sort absolute improvement |
---|
0:16:05 | so which is very good for size so you can maybe margin a |
---|
0:16:09 | and architecture where you extracts |
---|
0:16:11 | i regular i-vector |
---|
0:16:13 | to feed that the nn in order to extract |
---|
0:16:18 | it didn't based i-vector you can imagine hold things like that's |
---|
0:16:22 | so that's all things a lot |
---|
0:16:31 | thank you channel we have time for some questions |
---|
0:16:40 | i didn't quite catch when you talked about scaling down the number of counts |
---|
0:16:45 | you talking about scaling it down your |
---|
0:16:47 | in the p lda score |
---|
0:16:50 | i mean you don't scored by the book |
---|
0:16:52 | i don't know |
---|
0:16:53 | no i'm averaging i'm training the p lda model first of all by doing this |
---|
0:16:59 | trick |
---|
0:17:00 | that's quite still here that's crucial to train the model like that |
---|
0:17:06 | then |
---|
0:17:07 | i treat i doing i'm doing averaging |
---|
0:17:11 | but i treat |
---|
0:17:12 | the single utterance has been |
---|
0:17:14 | one over three or one or two utterances |
---|
0:17:17 | in the scoring |
---|
0:17:19 | okay so you just so you whiten the variances when you try but i and |
---|
0:17:22 | then it then you also add uncertainty |
---|
0:17:24 | and scoring |
---|
0:17:26 | the it is one if you put down the llr score it's i you can |
---|
0:17:32 | clearly see where you where you need to multiply scaling factor especially for |
---|
0:17:48 | thanks tables well just mention a few things just those like community that you know |
---|
0:17:52 | would be quite a forward to see what the difference is that it feels like |
---|
0:17:58 | the money this key ingredient somewhere else or but we close might be the scheme |
---|
0:18:02 | gradient that you know all the teams are gonna try and at this time i |
---|
0:18:05 | stumble into the same thing |
---|
0:18:07 | so some of things up a lot it since this conference is that as you |
---|
0:18:11 | mentioned the low number of filter banks twenty three instead of audio believe you said |
---|
0:18:17 | this wasn't impacting factors that might be one reason a also worked out that we're |
---|
0:18:22 | not applying vtln before training the t and then sort of things for the isr |
---|
0:18:26 | yes but not for the d and n assets another factor |
---|
0:18:32 | and also removing the silence index of the demon during the accumulator generation they're number |
---|
0:18:38 | of things there and that's good if you that other people have also been able |
---|
0:18:42 | to a might make a wireless well so we know that something positive no it's |
---|
0:18:47 | moving in the right direction |
---|
0:18:50 | one of the other things i wanted to mentioned was |
---|
0:18:54 | let me think that them on blank right now |
---|
0:18:59 | that's right that we we're talking about isr performance one of things that people said |
---|
0:19:04 | was you know this configuration works really well a bias are so why should we |
---|
0:19:09 | change that and what we've seen so far is that the indication of performance on |
---|
0:19:14 | the isr side of things |
---|
0:19:16 | doesn't necessarily reflect have suitable is for this it to a speaker id task sorry |
---|
0:19:22 | if you struggle straight up to use your paradise a system or whatever you have |
---|
0:19:26 | and apps go back to the whatever was published in the configurations and just start |
---|
0:19:32 | from scratch and see if that works better |
---|
0:19:34 | and certainly don't be afraid to contact any of that aims at you know working |
---|
0:19:38 | on this |
---|
0:19:39 | so we're all happy to address the issues |
---|
0:19:42 | because in errors are you it's a the asr is forward once you exploit the |
---|
0:19:46 | posteriors your it's in a to members folding it's a language model that can smooth |
---|
0:19:50 | some results |
---|
0:19:52 | whereas we don't have that's in when we are extracting posteriors |
---|
0:19:57 | for speaker recognition |
---|
0:19:59 | so that might be |
---|
0:20:00 | an indication that they sat results that at all necessary reflects better results for speaker |
---|
0:20:05 | recognition |
---|
0:20:06 | are you implying image that you guys turned out to vtln specifically because you're gonna |
---|
0:20:12 | use it for sitter that was something that was already the way you did asr |
---|
0:20:19 | i just that are working on the actually didn't inside of training myself unity was |
---|
0:20:23 | doing it beforehand i and the configuration dekai we had switched off and i asked |
---|
0:20:28 | you know should not be doing this i |
---|
0:20:31 | can't actually for whether you said it doesn't help for or it doesn't make much |
---|
0:20:37 | difference |
---|
0:20:38 | that's just one thing when we can fit with a tape most and what we're |
---|
0:20:41 | doing that's one thing we knighted my will have an impact it's removing speaker discriminability |
---|
0:20:47 | a simple |
---|
0:20:51 | they're writing |
---|
0:21:05 | so you seem to have a very good there is all too is |
---|
0:21:09 | you |
---|
0:21:11 | you and that's that the |
---|
0:21:14 | convolutional nets have been around for twenty years right |
---|
0:21:18 | but i mean and can that can was working on that |
---|
0:21:22 | let us also |
---|
0:21:24 | how can these other right now and |
---|
0:21:28 | and the second question is what the both recurrent one that's which is also useful |
---|
0:21:35 | but the what the story white does this you hear twenty is it |
---|
0:21:42 | sure after the question |
---|
0:21:47 | i guess |
---|
0:21:49 | and major over the place the fact that where using now much longer windows as |
---|
0:21:55 | input spaces |
---|
0:21:56 | okay that will that and of course the fact that we have processing power not |
---|
0:22:00 | that |
---|
0:22:00 | it took as |
---|
0:22:02 | the month |
---|
0:22:03 | maybe less to check the big system of course |
---|
0:22:07 | we but using g it to be you of course there isn't some optimisation |
---|
0:22:12 | then the that need to be done in a made in terms of engineering |
---|
0:22:15 | but it takes a lot text all the time so to process all these data |
---|
0:22:20 | that is required to train robust that all systems maybe wasn't feasible during the eighties |
---|
0:22:26 | that that's that definitely of bait most the community y |
---|
0:22:30 | they failed to show during those sarah |
---|
0:22:33 | that |
---|
0:22:35 | those discriminative models are powerful enough to compete by far |
---|
0:22:39 | the gmm approaches the channel right |
---|