Speech Transcript - Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition

0:00:15	so hi everyone i'm gonna talk
0:00:18	about very similar approach to what mutual described before
0:00:24	for at least for the part of speaker recognition actually
0:00:28	to say model
0:00:30	so you're it's it won't be anything new
0:00:36	this is the outline more or less i'm gonna describe a little bit about the
0:00:40	use of the nn since speech ends now speaker recognition
0:00:43	and how to extract baumwelch statistics i'll
0:00:46	do with a little bit more analytically the
0:00:49	done what mitchell did some the inane be lda configurations
0:00:53	and some experiments on switchboard and the nist two thousand twelve
0:00:58	so little bit about the limitations of the ubm based speaker recognition so far the
0:01:05	short-term spectral information that we are traditionally been using on the as a front end
0:01:10	feature as from the features
0:01:12	in speaker recognition work fine in one in some sense
0:01:16	but in some others not and that it would be more specific our experiences that
0:01:22	when you know alignment suppose i'm going to say to australia
0:01:25	purchase going to jump on
0:01:27	"'kay"
0:01:28	and of cereal normal is a language because
0:01:31	it check this okay
0:01:32	so i think you'll be able to
0:01:34	discriminate between speakers a little bit more effectively than if i go jump on
0:01:40	okay and that of problem is that with the current traditional ubm based
0:01:45	a speaker recognition systems we don't capture of this that information and also because they're
0:01:52	not phonetically
0:01:54	a where
0:01:54	the assignments the classes that we are define
0:01:57	by using an unsupervised way of training
0:02:00	i ubm so segmenting let's say the input space
0:02:04	using the feature itself but we're gonna then use
0:02:08	"'kay" to a due to i to extract baumwelch statistics
0:02:13	came it not it do not have these a phonetically awareness that is needed
0:02:18	i hope so
0:02:20	so the challenge here
0:02:23	is to use the n n's
0:02:26	which we know that now are capable of
0:02:30	improving drastically the performance of asr systems
0:02:33	and scab to these ideal socratic way
0:02:38	way in which is speaker pronounces it's
0:02:40	as we said that signals which actually as others are
0:02:43	tied triphone states
0:02:47	and help with about that units or on asr
0:02:53	the reports or something like thirty percent relative improvement in terms of word the error
0:02:59	rate
0:03:00	compared gmms
0:03:02	there have several hidden layers five or six in triphone states
0:03:07	as outputs a their discriminative classifiers yet we can combine them we'd hmms using this
0:03:15	trick that
0:03:15	we term posteriors back and likelihood by subtracting
0:03:19	the you prior into the log domain
0:03:22	and
0:03:24	and then we can combine them with a mean hmm framework
0:03:28	initially the used to initialize them with us
0:03:32	stock a restricted boltzmann machines
0:03:35	a this is no longer need it's that's has been proven but
0:03:40	you might imagine cases or domains where or languages were not enough labeled data is
0:03:46	available
0:03:48	yet you might have very few data but many unlabeled
0:03:53	data
0:03:53	in this case is but exclude the possibility of
0:03:58	of for using is be stacked architecture of our be ends
0:04:02	due to a due to initialize the they are bm more robustly
0:04:07	and i think the key difference is that the fact that the capacity of handling
0:04:14	a longer segments as inputs
0:04:18	okay so
0:04:20	something about three hundred milliseconds
0:04:23	in order to capture
0:04:26	say information the temporal information
0:04:29	this is done the reference by the way a little bit old now
0:04:33	from two of the pioneers
0:04:36	so the ubm approach does your
0:04:39	more i sumo no
0:04:41	is goes like this you
0:04:43	you start whereby training
0:04:46	using the em algorithm a ubm
0:04:49	and the for each new utterance you extract the so called zero order statistics and
0:04:54	first order statistics
0:04:55	and then you use again you're ubm
0:04:58	in order to somehow pretty wide from your bumble statistics a component wise it is
0:05:03	because
0:05:04	that's what you're doing effectively
0:05:06	so a in these the nn based approach or we are using these by the
0:05:12	posterior probability
0:05:13	of each frame belonging to its component
0:05:16	it that's the only difference so this by t
0:05:19	tease the frame count sees the component
0:05:21	that's the only thing that changes
0:05:23	so that means that don't have a change or
0:05:25	algorithms are all we just have to have it the nn algorithm some to put
0:05:29	usually posteriors
0:05:31	and that's all
0:05:31	no need to create use of course
0:05:39	so i take ubm is still need is only practically for the last step
0:05:44	two prewhitening the bible statistics before feeding them either to
0:05:48	to an i-vector extractor maybe to jfa
0:05:53	and of course em here is not required to train the ubm because
0:05:59	the
0:06:00	the posteriors came come actually from that unit so there's my
0:06:04	no need to do this is just an m step
0:06:07	all a single of several
0:06:10	will be sufficient
0:06:12	and it is easy does it is interesting to note here that different features can
0:06:16	be used for estimating
0:06:19	the assignments
0:06:21	or of a frame to they sit to the scene on or
0:06:26	what we used to say the component of the ubm
0:06:30	and those that you finally use
0:06:33	for a extract
0:06:36	i i-vectors or whatever
0:06:39	you're using so
0:06:40	you don't have to change that you can have two parallel way that are optimized
0:06:45	for the two tasks for the sr task
0:06:47	and for the speaker recognition task as long of course that you are having it
0:06:51	you the same frame rate
0:06:55	so i'm not gonna go deep enough into that this is the first unit configuration
0:07:01	we
0:07:01	we developed it was inspired by this paper robustly at all and he was a
0:07:09	very successful paper that's of asr we managed to reproduce the sr results
0:07:14	and do something to find it also you next
0:07:18	and this more as the configuration
0:07:21	and we have some results and then
0:07:25	he was gently percent estimated telling us is a guy's we managed to obtain some
0:07:30	amazing results
0:07:33	with this where i
0:07:34	and we show that the method was actually saying
0:07:39	and
0:07:41	so we tried this as well
0:07:43	the first the first the configuration but we tried to switchboard data not an east
0:07:48	so this was the configuration of young really of voice alright
0:07:53	from sri and it's a little bit different the uses trap features at the fantasy
0:07:59	maybe
0:08:00	it's better thing to do
0:08:02	it's along the span thirty one
0:08:06	frames it's i use it uses log mel filter banks
0:08:11	they use forty i think we you another think that we used twenty three that
0:08:15	was i guess one of the reasons why there are we the results with our
0:08:19	and obtain are not that will well there are several reasons of we have expect
0:08:23	you know
0:08:24	these said you know sub a lot of free parameters that
0:08:28	that someone has to like in and
0:08:32	but i'm gonna show you next and so
0:08:35	we have to configuration the small one was practically
0:08:39	so that we include results for the common already paper
0:08:43	and here and we have be configuration also
0:08:46	with that is more close are close to what is right be seen there in
0:08:51	the paper
0:08:54	these are some an asr results of be obtained
0:08:59	there or you see first of all the comparison that is on basically paper
0:09:04	just two
0:09:05	to address the dramatic improvement you can obtain by using
0:09:10	the in insisted of
0:09:13	gmms as emission probabilities
0:09:15	and to these are the two configurations
0:09:18	of we developed in a green most inspired by the work of vastly and then
0:09:25	this or i
0:09:29	now let's go back to speaker recognition
0:09:33	these are the plp a questioned us to tell you that what
0:09:35	flavour of p lda we used
0:09:38	we found that for most of the cases
0:09:43	the full rank
0:09:44	v transpose that is a speaker space
0:09:47	work better we didn't of course trial recognition but it will with work better compared
0:09:51	to one twenty
0:09:52	for example these system got
0:09:55	we before links norm apply w c n
0:09:57	instead of doing prewhitening that word most of the cases again very well but much
0:10:03	better prewhitening
0:10:06	and about this dilemma whether you should average
0:10:09	after or before length normalization i think you should average
0:10:14	before and after length molestation
0:10:16	because that's more consistent with the way you're training the p lda model
0:10:20	and in our case make made a lot of difference
0:10:23	okay
0:10:25	so
0:10:27	these other results from switchboard with the first configuration they're not that good
0:10:32	then all that good
0:10:33	their
0:10:34	not even comparable to
0:10:35	the once you tame as a baseline system
0:10:37	okay so we were rather disappointing that the state each that was somehow christmas
0:10:42	and that but what once you fuse and you get something that like yes
0:10:47	it's good not that the in this case so that we are gonna using a
0:10:50	single
0:10:51	enrollment utterance the same for male more less
0:10:56	and
0:10:57	notice go to nice with the configuration or not what we thought was you configuration
0:11:03	of is right
0:11:06	these sees the small configuration
0:11:09	now we see that's at least for the low false alarm area without but we
0:11:13	have we're making progress
0:11:15	not by fusing them up much though
0:11:18	"'kay" the fusion was not a that's
0:11:20	that's good
0:11:22	and it's by the way i'm emphasising c to classify both although c five these
0:11:28	a subset just a
0:11:30	to make sure that
0:11:31	you know that
0:11:32	that if it's so it's both clean and noisy tell
0:11:36	and this is with the configuration the same picture now we are we are comparing
0:11:41	it with it
0:11:42	to forty eight gmm
0:11:46	and it's more the same picture you get some improvement on the low false alarm
0:11:54	area
0:11:56	that some caves the don't think that so much this is one
0:12:00	four we could be configuration
0:12:03	so i'm gonna i'm gonna
0:12:05	just keep a little bit i'm gonna talk a little bit about
0:12:08	the p lda now because it was there was this issue about the domain adaptation
0:12:14	a gender so we're gonna focus a little bit on p lda now just to
0:12:17	share with your result which i think it's interesting
0:12:22	we know that link when you apply length normalization you may attain results that are
0:12:27	even better
0:12:28	compared to heavy tailed be of the in some cases
0:12:31	the problem is that this transformation is some cost sensitive to two datasets
0:12:37	so we ideally we would we would be great to get rid of it
0:12:43	and the possible alternative would be to scale down the number of recording so
0:12:48	what that what that means is that's you pretends
0:12:52	that's instead of having an and recordings you are having and over three
0:12:56	we define a scaling factor arbitrary but one by one over three one able to
0:13:00	works fine
0:13:01	in practice
0:13:03	and using that streak all the evidence criteria work
0:13:08	at all you the i mean you once you trying to be lda you getting
0:13:11	it strictly increasing privates but you which is good
0:13:16	and it's you somehow a losing caught your somehow losing confidence
0:13:20	which is a good thing
0:13:22	okay that to lose confidence in some cases
0:13:26	and it's the problems can we get rid of length normally that no the answer
0:13:30	is no
0:13:32	but we are rather close a gets so the scale factor of one means practical
0:13:36	nothing
0:13:37	here are some results
0:13:39	with different scaling factor so all i'm doing is simply divide the number consisted role
0:13:45	in training
0:13:46	and when evaluating them all the other large
0:13:49	dividing the number of
0:13:51	of recordings by either one over it multiplied by
0:13:55	one over two or will buy bound over three
0:13:57	i'm guessing that most of the gap
0:14:00	between door not doing length normalization and doing length normalization is somehow
0:14:07	i think by the by about maybe strict so
0:14:09	maybe because the other people that are using these domain adaptation are function with domain
0:14:16	adaptation can use that
0:14:17	as an alternative
0:14:20	to the like someone addition and tell me maybe if they the found something interesting
0:14:27	so was conclusions
0:14:32	the use of the state-of-the-art to the nn sri can replace definitely a traditional gmmubm
0:14:37	it'd a ubm ceased based system
0:14:41	and a good thing is that once a baum-welch statistics are extracted is exactly the
0:14:46	same machinery but can be applied
0:14:50	and no need to change the coleman teachings anything and they're the results provided by
0:14:55	sri
0:14:56	and is now
0:14:58	missiles only that's
0:15:00	you was also done your merrill this morning that but also sound role
0:15:05	models to stick to get some results exactly the same idea
0:15:08	clearly so these results clearly so the superiority
0:15:13	we did something suboptimal probably that's why would we didn't manage to get the desired
0:15:18	results
0:15:19	so as an extension component
0:15:21	obviously a convolutional neural nets a neural nets maybe make might be useful
0:15:29	and there is also another idea where
0:15:32	we used for asr where
0:15:36	what we did this was to all commands
0:15:38	fifteen the input layer of the d and then by blowing
0:15:45	and i typical i-vector a regular i-vectors
0:15:48	"'kay" we do that for broadcast news
0:15:50	in order to make a some sort of speaker adaptation
0:15:53	we presented that and i cuts
0:15:55	so i don't help a lot it hopefully you've a
0:15:59	one point five two relative improvement which is what not relative sort absolute improvement
0:16:05	so which is very good for size so you can maybe margin a
0:16:09	and architecture where you extracts
0:16:11	i regular i-vector
0:16:13	to feed that the nn in order to extract
0:16:18	it didn't based i-vector you can imagine hold things like that's
0:16:22	so that's all things a lot
0:16:31	thank you channel we have time for some questions
0:16:40	i didn't quite catch when you talked about scaling down the number of counts
0:16:45	you talking about scaling it down your
0:16:47	in the p lda score
0:16:50	i mean you don't scored by the book
0:16:52	i don't know
0:16:53	no i'm averaging i'm training the p lda model first of all by doing this
0:16:59	trick
0:17:00	that's quite still here that's crucial to train the model like that
0:17:06	then
0:17:07	i treat i doing i'm doing averaging
0:17:11	but i treat
0:17:12	the single utterance has been
0:17:14	one over three or one or two utterances
0:17:17	in the scoring
0:17:19	okay so you just so you whiten the variances when you try but i and
0:17:22	then it then you also add uncertainty
0:17:24	and scoring
0:17:26	the it is one if you put down the llr score it's i you can
0:17:32	clearly see where you where you need to multiply scaling factor especially for
0:17:48	thanks tables well just mention a few things just those like community that you know
0:17:52	would be quite a forward to see what the difference is that it feels like
0:17:58	the money this key ingredient somewhere else or but we close might be the scheme
0:18:02	gradient that you know all the teams are gonna try and at this time i
0:18:05	stumble into the same thing
0:18:07	so some of things up a lot it since this conference is that as you
0:18:11	mentioned the low number of filter banks twenty three instead of audio believe you said
0:18:17	this wasn't impacting factors that might be one reason a also worked out that we're
0:18:22	not applying vtln before training the t and then sort of things for the isr
0:18:26	yes but not for the d and n assets another factor
0:18:32	and also removing the silence index of the demon during the accumulator generation they're number
0:18:38	of things there and that's good if you that other people have also been able
0:18:42	to a might make a wireless well so we know that something positive no it's
0:18:47	moving in the right direction
0:18:50	one of the other things i wanted to mentioned was
0:18:54	let me think that them on blank right now
0:18:59	that's right that we we're talking about isr performance one of things that people said
0:19:04	was you know this configuration works really well a bias are so why should we
0:19:09	change that and what we've seen so far is that the indication of performance on
0:19:14	the isr side of things
0:19:16	doesn't necessarily reflect have suitable is for this it to a speaker id task sorry
0:19:22	if you struggle straight up to use your paradise a system or whatever you have
0:19:26	and apps go back to the whatever was published in the configurations and just start
0:19:32	from scratch and see if that works better
0:19:34	and certainly don't be afraid to contact any of that aims at you know working
0:19:38	on this
0:19:39	so we're all happy to address the issues
0:19:42	because in errors are you it's a the asr is forward once you exploit the
0:19:46	posteriors your it's in a to members folding it's a language model that can smooth
0:19:50	some results
0:19:52	whereas we don't have that's in when we are extracting posteriors
0:19:57	for speaker recognition
0:19:59	so that might be
0:20:00	an indication that they sat results that at all necessary reflects better results for speaker
0:20:05	recognition
0:20:06	are you implying image that you guys turned out to vtln specifically because you're gonna
0:20:12	use it for sitter that was something that was already the way you did asr
0:20:19	i just that are working on the actually didn't inside of training myself unity was
0:20:23	doing it beforehand i and the configuration dekai we had switched off and i asked
0:20:28	you know should not be doing this i
0:20:31	can't actually for whether you said it doesn't help for or it doesn't make much
0:20:37	difference
0:20:38	that's just one thing when we can fit with a tape most and what we're
0:20:41	doing that's one thing we knighted my will have an impact it's removing speaker discriminability
0:20:47	a simple
0:20:51	they're writing
0:21:05	so you seem to have a very good there is all too is
0:21:09	you
0:21:11	you and that's that the
0:21:14	convolutional nets have been around for twenty years right
0:21:18	but i mean and can that can was working on that
0:21:22	let us also
0:21:24	how can these other right now and
0:21:28	and the second question is what the both recurrent one that's which is also useful
0:21:35	but the what the story white does this you hear twenty is it
0:21:42	sure after the question
0:21:47	i guess
0:21:49	and major over the place the fact that where using now much longer windows as
0:21:55	input spaces
0:21:56	okay that will that and of course the fact that we have processing power not
0:22:00	that
0:22:00	it took as
0:22:02	the month
0:22:03	maybe less to check the big system of course
0:22:07	we but using g it to be you of course there isn't some optimisation
0:22:12	then the that need to be done in a made in terms of engineering
0:22:15	but it takes a lot text all the time so to process all these data
0:22:20	that is required to train robust that all systems maybe wasn't feasible during the eighties
0:22:26	that that's that definitely of bait most the community y
0:22:30	they failed to show during those sarah
0:22:33	that
0:22:35	those discriminative models are powerful enough to compete by far
0:22:39	the gmm approaches the channel right

Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition

Neural Nets for Speaker and Language Modeling

Patrick Kenny, Vishwa Gupta, Themos Stafylakis, Pierre Ouellet and Jahangir Alam