0:00:15so hi everyone i'm gonna talk
0:00:18about very similar approach to what mutual described before
0:00:24for at least for the part of speaker recognition actually
0:00:28to say model
0:00:30so you're it's it won't be anything new
0:00:36this is the outline more or less i'm gonna describe a little bit about the
0:00:40use of the nn since speech ends now speaker recognition
0:00:43and how to extract baumwelch statistics i'll
0:00:46do with a little bit more analytically the
0:00:49done what mitchell did some the inane be lda configurations
0:00:53and some experiments on switchboard and the nist two thousand twelve
0:00:58so little bit about the limitations of the ubm based speaker recognition so far the
0:01:05short-term spectral information that we are traditionally been using on the as a front end
0:01:10feature as from the features
0:01:12in speaker recognition work fine in one in some sense
0:01:16but in some others not and that it would be more specific our experiences that
0:01:22when you know alignment suppose i'm going to say to australia
0:01:25purchase going to jump on
0:01:28and of cereal normal is a language because
0:01:31it check this okay
0:01:32so i think you'll be able to
0:01:34discriminate between speakers a little bit more effectively than if i go jump on
0:01:40okay and that of problem is that with the current traditional ubm based
0:01:45a speaker recognition systems we don't capture of this that information and also because they're
0:01:52not phonetically
0:01:54a where
0:01:54the assignments the classes that we are define
0:01:57by using an unsupervised way of training
0:02:00i ubm so segmenting let's say the input space
0:02:04using the feature itself but we're gonna then use
0:02:08"'kay" to a due to i to extract baumwelch statistics
0:02:13came it not it do not have these a phonetically awareness that is needed
0:02:18i hope so
0:02:20so the challenge here
0:02:23is to use the n n's
0:02:26which we know that now are capable of
0:02:30improving drastically the performance of asr systems
0:02:33and scab to these ideal socratic way
0:02:38way in which is speaker pronounces it's
0:02:40as we said that signals which actually as others are
0:02:43tied triphone states
0:02:47and help with about that units or on asr
0:02:53the reports or something like thirty percent relative improvement in terms of word the error
0:03:00compared gmms
0:03:02there have several hidden layers five or six in triphone states
0:03:07as outputs a their discriminative classifiers yet we can combine them we'd hmms using this
0:03:15trick that
0:03:15we term posteriors back and likelihood by subtracting
0:03:19the you prior into the log domain
0:03:24and then we can combine them with a mean hmm framework
0:03:28initially the used to initialize them with us
0:03:32stock a restricted boltzmann machines
0:03:35a this is no longer need it's that's has been proven but
0:03:40you might imagine cases or domains where or languages were not enough labeled data is
0:03:48yet you might have very few data but many unlabeled
0:03:53in this case is but exclude the possibility of
0:03:58of for using is be stacked architecture of our be ends
0:04:02due to a due to initialize the they are bm more robustly
0:04:07and i think the key difference is that the fact that the capacity of handling
0:04:14a longer segments as inputs
0:04:18okay so
0:04:20something about three hundred milliseconds
0:04:23in order to capture
0:04:26say information the temporal information
0:04:29this is done the reference by the way a little bit old now
0:04:33from two of the pioneers
0:04:36so the ubm approach does your
0:04:39more i sumo no
0:04:41is goes like this you
0:04:43you start whereby training
0:04:46using the em algorithm a ubm
0:04:49and the for each new utterance you extract the so called zero order statistics and
0:04:54first order statistics
0:04:55and then you use again you're ubm
0:04:58in order to somehow pretty wide from your bumble statistics a component wise it is
0:05:04that's what you're doing effectively
0:05:06so a in these the nn based approach or we are using these by the
0:05:12posterior probability
0:05:13of each frame belonging to its component
0:05:16it that's the only difference so this by t
0:05:19tease the frame count sees the component
0:05:21that's the only thing that changes
0:05:23so that means that don't have a change or
0:05:25algorithms are all we just have to have it the nn algorithm some to put
0:05:29usually posteriors
0:05:31and that's all
0:05:31no need to create use of course
0:05:39so i take ubm is still need is only practically for the last step
0:05:44two prewhitening the bible statistics before feeding them either to
0:05:48to an i-vector extractor maybe to jfa
0:05:53and of course em here is not required to train the ubm because
0:06:00the posteriors came come actually from that unit so there's my
0:06:04no need to do this is just an m step
0:06:07all a single of several
0:06:10will be sufficient
0:06:12and it is easy does it is interesting to note here that different features can
0:06:16be used for estimating
0:06:19the assignments
0:06:21or of a frame to they sit to the scene on or
0:06:26what we used to say the component of the ubm
0:06:30and those that you finally use
0:06:33for a extract
0:06:36i i-vectors or whatever
0:06:39you're using so
0:06:40you don't have to change that you can have two parallel way that are optimized
0:06:45for the two tasks for the sr task
0:06:47and for the speaker recognition task as long of course that you are having it
0:06:51you the same frame rate
0:06:55so i'm not gonna go deep enough into that this is the first unit configuration
0:07:01we developed it was inspired by this paper robustly at all and he was a
0:07:09very successful paper that's of asr we managed to reproduce the sr results
0:07:14and do something to find it also you next
0:07:18and this more as the configuration
0:07:21and we have some results and then
0:07:25he was gently percent estimated telling us is a guy's we managed to obtain some
0:07:30amazing results
0:07:33with this where i
0:07:34and we show that the method was actually saying
0:07:41so we tried this as well
0:07:43the first the first the configuration but we tried to switchboard data not an east
0:07:48so this was the configuration of young really of voice alright
0:07:53from sri and it's a little bit different the uses trap features at the fantasy
0:08:00it's better thing to do
0:08:02it's along the span thirty one
0:08:06frames it's i use it uses log mel filter banks
0:08:11they use forty i think we you another think that we used twenty three that
0:08:15was i guess one of the reasons why there are we the results with our
0:08:19and obtain are not that will well there are several reasons of we have expect
0:08:23you know
0:08:24these said you know sub a lot of free parameters that
0:08:28that someone has to like in and
0:08:32but i'm gonna show you next and so
0:08:35we have to configuration the small one was practically
0:08:39so that we include results for the common already paper
0:08:43and here and we have be configuration also
0:08:46with that is more close are close to what is right be seen there in
0:08:51the paper
0:08:54these are some an asr results of be obtained
0:08:59there or you see first of all the comparison that is on basically paper
0:09:04just two
0:09:05to address the dramatic improvement you can obtain by using
0:09:10the in insisted of
0:09:13gmms as emission probabilities
0:09:15and to these are the two configurations
0:09:18of we developed in a green most inspired by the work of vastly and then
0:09:25this or i
0:09:29now let's go back to speaker recognition
0:09:33these are the plp a questioned us to tell you that what
0:09:35flavour of p lda we used
0:09:38we found that for most of the cases
0:09:43the full rank
0:09:44v transpose that is a speaker space
0:09:47work better we didn't of course trial recognition but it will with work better compared
0:09:51to one twenty
0:09:52for example these system got
0:09:55we before links norm apply w c n
0:09:57instead of doing prewhitening that word most of the cases again very well but much
0:10:03better prewhitening
0:10:06and about this dilemma whether you should average
0:10:09after or before length normalization i think you should average
0:10:14before and after length molestation
0:10:16because that's more consistent with the way you're training the p lda model
0:10:20and in our case make made a lot of difference
0:10:27these other results from switchboard with the first configuration they're not that good
0:10:32then all that good
0:10:34not even comparable to
0:10:35the once you tame as a baseline system
0:10:37okay so we were rather disappointing that the state each that was somehow christmas
0:10:42and that but what once you fuse and you get something that like yes
0:10:47it's good not that the in this case so that we are gonna using a
0:10:51enrollment utterance the same for male more less
0:10:57notice go to nice with the configuration or not what we thought was you configuration
0:11:03of is right
0:11:06these sees the small configuration
0:11:09now we see that's at least for the low false alarm area without but we
0:11:13have we're making progress
0:11:15not by fusing them up much though
0:11:18"'kay" the fusion was not a that's
0:11:20that's good
0:11:22and it's by the way i'm emphasising c to classify both although c five these
0:11:28a subset just a
0:11:30to make sure that
0:11:31you know that
0:11:32that if it's so it's both clean and noisy tell
0:11:36and this is with the configuration the same picture now we are we are comparing
0:11:41it with it
0:11:42to forty eight gmm
0:11:46and it's more the same picture you get some improvement on the low false alarm
0:11:56that some caves the don't think that so much this is one
0:12:00four we could be configuration
0:12:03so i'm gonna i'm gonna
0:12:05just keep a little bit i'm gonna talk a little bit about
0:12:08the p lda now because it was there was this issue about the domain adaptation
0:12:14a gender so we're gonna focus a little bit on p lda now just to
0:12:17share with your result which i think it's interesting
0:12:22we know that link when you apply length normalization you may attain results that are
0:12:27even better
0:12:28compared to heavy tailed be of the in some cases
0:12:31the problem is that this transformation is some cost sensitive to two datasets
0:12:37so we ideally we would we would be great to get rid of it
0:12:43and the possible alternative would be to scale down the number of recording so
0:12:48what that what that means is that's you pretends
0:12:52that's instead of having an and recordings you are having and over three
0:12:56we define a scaling factor arbitrary but one by one over three one able to
0:13:00works fine
0:13:01in practice
0:13:03and using that streak all the evidence criteria work
0:13:08at all you the i mean you once you trying to be lda you getting
0:13:11it strictly increasing privates but you which is good
0:13:16and it's you somehow a losing caught your somehow losing confidence
0:13:20which is a good thing
0:13:22okay that to lose confidence in some cases
0:13:26and it's the problems can we get rid of length normally that no the answer
0:13:30is no
0:13:32but we are rather close a gets so the scale factor of one means practical
0:13:37here are some results
0:13:39with different scaling factor so all i'm doing is simply divide the number consisted role
0:13:45in training
0:13:46and when evaluating them all the other large
0:13:49dividing the number of
0:13:51of recordings by either one over it multiplied by
0:13:55one over two or will buy bound over three
0:13:57i'm guessing that most of the gap
0:14:00between door not doing length normalization and doing length normalization is somehow
0:14:07i think by the by about maybe strict so
0:14:09maybe because the other people that are using these domain adaptation are function with domain
0:14:16adaptation can use that
0:14:17as an alternative
0:14:20to the like someone addition and tell me maybe if they the found something interesting
0:14:27so was conclusions
0:14:32the use of the state-of-the-art to the nn sri can replace definitely a traditional gmmubm
0:14:37it'd a ubm ceased based system
0:14:41and a good thing is that once a baum-welch statistics are extracted is exactly the
0:14:46same machinery but can be applied
0:14:50and no need to change the coleman teachings anything and they're the results provided by
0:14:56and is now
0:14:58missiles only that's
0:15:00you was also done your merrill this morning that but also sound role
0:15:05models to stick to get some results exactly the same idea
0:15:08clearly so these results clearly so the superiority
0:15:13we did something suboptimal probably that's why would we didn't manage to get the desired
0:15:19so as an extension component
0:15:21obviously a convolutional neural nets a neural nets maybe make might be useful
0:15:29and there is also another idea where
0:15:32we used for asr where
0:15:36what we did this was to all commands
0:15:38fifteen the input layer of the d and then by blowing
0:15:45and i typical i-vector a regular i-vectors
0:15:48"'kay" we do that for broadcast news
0:15:50in order to make a some sort of speaker adaptation
0:15:53we presented that and i cuts
0:15:55so i don't help a lot it hopefully you've a
0:15:59one point five two relative improvement which is what not relative sort absolute improvement
0:16:05so which is very good for size so you can maybe margin a
0:16:09and architecture where you extracts
0:16:11i regular i-vector
0:16:13to feed that the nn in order to extract
0:16:18it didn't based i-vector you can imagine hold things like that's
0:16:22so that's all things a lot
0:16:31thank you channel we have time for some questions
0:16:40i didn't quite catch when you talked about scaling down the number of counts
0:16:45you talking about scaling it down your
0:16:47in the p lda score
0:16:50i mean you don't scored by the book
0:16:52i don't know
0:16:53no i'm averaging i'm training the p lda model first of all by doing this
0:17:00that's quite still here that's crucial to train the model like that
0:17:07i treat i doing i'm doing averaging
0:17:11but i treat
0:17:12the single utterance has been
0:17:14one over three or one or two utterances
0:17:17in the scoring
0:17:19okay so you just so you whiten the variances when you try but i and
0:17:22then it then you also add uncertainty
0:17:24and scoring
0:17:26the it is one if you put down the llr score it's i you can
0:17:32clearly see where you where you need to multiply scaling factor especially for
0:17:48thanks tables well just mention a few things just those like community that you know
0:17:52would be quite a forward to see what the difference is that it feels like
0:17:58the money this key ingredient somewhere else or but we close might be the scheme
0:18:02gradient that you know all the teams are gonna try and at this time i
0:18:05stumble into the same thing
0:18:07so some of things up a lot it since this conference is that as you
0:18:11mentioned the low number of filter banks twenty three instead of audio believe you said
0:18:17this wasn't impacting factors that might be one reason a also worked out that we're
0:18:22not applying vtln before training the t and then sort of things for the isr
0:18:26yes but not for the d and n assets another factor
0:18:32and also removing the silence index of the demon during the accumulator generation they're number
0:18:38of things there and that's good if you that other people have also been able
0:18:42to a might make a wireless well so we know that something positive no it's
0:18:47moving in the right direction
0:18:50one of the other things i wanted to mentioned was
0:18:54let me think that them on blank right now
0:18:59that's right that we we're talking about isr performance one of things that people said
0:19:04was you know this configuration works really well a bias are so why should we
0:19:09change that and what we've seen so far is that the indication of performance on
0:19:14the isr side of things
0:19:16doesn't necessarily reflect have suitable is for this it to a speaker id task sorry
0:19:22if you struggle straight up to use your paradise a system or whatever you have
0:19:26and apps go back to the whatever was published in the configurations and just start
0:19:32from scratch and see if that works better
0:19:34and certainly don't be afraid to contact any of that aims at you know working
0:19:38on this
0:19:39so we're all happy to address the issues
0:19:42because in errors are you it's a the asr is forward once you exploit the
0:19:46posteriors your it's in a to members folding it's a language model that can smooth
0:19:50some results
0:19:52whereas we don't have that's in when we are extracting posteriors
0:19:57for speaker recognition
0:19:59so that might be
0:20:00an indication that they sat results that at all necessary reflects better results for speaker
0:20:06are you implying image that you guys turned out to vtln specifically because you're gonna
0:20:12use it for sitter that was something that was already the way you did asr
0:20:19i just that are working on the actually didn't inside of training myself unity was
0:20:23doing it beforehand i and the configuration dekai we had switched off and i asked
0:20:28you know should not be doing this i
0:20:31can't actually for whether you said it doesn't help for or it doesn't make much
0:20:38that's just one thing when we can fit with a tape most and what we're
0:20:41doing that's one thing we knighted my will have an impact it's removing speaker discriminability
0:20:47a simple
0:20:51they're writing
0:21:05so you seem to have a very good there is all too is
0:21:11you and that's that the
0:21:14convolutional nets have been around for twenty years right
0:21:18but i mean and can that can was working on that
0:21:22let us also
0:21:24how can these other right now and
0:21:28and the second question is what the both recurrent one that's which is also useful
0:21:35but the what the story white does this you hear twenty is it
0:21:42sure after the question
0:21:47i guess
0:21:49and major over the place the fact that where using now much longer windows as
0:21:55input spaces
0:21:56okay that will that and of course the fact that we have processing power not
0:22:00it took as
0:22:02the month
0:22:03maybe less to check the big system of course
0:22:07we but using g it to be you of course there isn't some optimisation
0:22:12then the that need to be done in a made in terms of engineering
0:22:15but it takes a lot text all the time so to process all these data
0:22:20that is required to train robust that all systems maybe wasn't feasible during the eighties
0:22:26that that's that definitely of bait most the community y
0:22:30they failed to show during those sarah
0:22:35those discriminative models are powerful enough to compete by far
0:22:39the gmm approaches the channel right