Přepis řeči - Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

0:00:07	right thanks
0:00:08	my name's cornell
0:00:09	and this is
0:00:10	work with
0:00:11	she was sitting up there she's
0:00:12	happy to take all of their questions afterwards
0:00:18	so i before we begin i just want to sort of
0:00:20	lay down that the definitions that i'm gonna be using this is my first time at this meeting so i
0:00:24	may be saying things very wrong
0:00:26	and i apologise for that in advance
0:00:28	um
0:00:29	i conceive of all features that you could uh
0:00:31	compute from speech
0:00:33	in these four areas
0:00:35	um where on the left hand side
0:00:37	i can i consider
0:00:38	course
0:00:39	spectral features and on the right hand side fine spectral features
0:00:42	and at the top two panels things that characterise a single frame of speech
0:00:46	where is
0:00:47	at the bottom things that characterise the trajectory
0:00:49	that models things across great
0:00:53	so
0:00:53	all the features that you probably are familiar with can be
0:00:56	used to test like
0:00:57	space
0:00:58	and prosodic features tend to be those that uh
0:01:01	either model the fine structure in the in the spectrum for that was that
0:01:05	model long term dependencies at the bottom
0:01:07	but we here in this paper are gonna look at only the so called
0:01:11	contain use prosodic features namely those that
0:01:13	characterise a single frame
0:01:15	and in particular of looking at
0:01:19	okay so pitch is estimated
0:01:20	uh using a pitch detector
0:01:23	um which typically produces a
0:01:25	best estimate for pitch but it usually is so noisy
0:01:27	that a pitch detector
0:01:29	it's typically expected to produce and best estimate
0:01:32	and then a dynamic programming approach is used
0:01:35	two
0:01:36	ah
0:01:36	and so i think that to a single best estimate per frame
0:01:39	i'm gonna refer to these two components
0:01:41	pitch estimation
0:01:42	and the best estimate per frame that comes out of this
0:01:45	um
0:01:46	can be linearly or nonlinearly smooth
0:01:49	can be normalised
0:01:50	based on proximity to some kind of landmark
0:01:52	and then different kind of
0:01:53	features can be extracted from
0:01:55	and the simple model now these things at the bottom here
0:01:57	i assume or what you were calling high level feature computation or
0:02:01	high level features
0:02:02	in this talk
0:02:03	i hope i'm not disappointing in once we're actually gonna be looking at this point
0:02:07	which is as low level
0:02:09	the
0:02:10	in that session
0:02:12	um
0:02:13	we're gonna claim that these features are as low level is M S
0:02:17	okay so if we look at this right box a little more closely
0:02:20	um typically pitch estimation of pitch section is that
0:02:23	two step process
0:02:24	where the source to think we start from
0:02:27	is an fft
0:02:29	um
0:02:30	so the first step is the computation of what i'm gonna be calling a transform domain
0:02:35	and there's lots of alternatives here
0:02:37	so was saying this is all the correlation spectrum
0:02:40	and then the second step is simply finding the the art
0:02:43	um
0:02:44	yeah
0:02:45	so a lot of effort has gone into
0:02:47	uh
0:02:48	this process and typically the effort is both it's only on this first
0:02:51	because the second step is so elementary that nobody really questions
0:02:55	um
0:02:56	the
0:02:57	most of the work was improving pitch detection has gone into making sure this is just
0:03:02	oh such just force optimally or most free
0:03:05	to uh
0:03:06	we consider
0:03:08	what we're gonna claim in this work is that you should just throw well as whole seconds
0:03:12	and you should model hi transform
0:03:15	and that's what this talk
0:03:17	so there are four parts to stop
0:03:20	men are described
0:03:21	what i'm calling the harmonic structure transform
0:03:24	presenting something
0:03:24	garments
0:03:25	some additional analysis
0:03:27	and i will conclude
0:03:28	three slides
0:03:29	yeah
0:03:36	okay so
0:03:37	the particular pitch detection algorithm that we're gonna look at
0:03:41	is um was proposed by shorter in nineteen sixty eight
0:03:44	and it involves
0:03:45	summing
0:03:46	producing an you spectrum the sigma spectrum where uh at each frequency we have
0:03:50	the sum of all the frequencies
0:03:52	that are
0:03:53	integer multiples of some candidate fundamental frequency in the original effect
0:03:58	okay and this was very quickly after he proposed this dog harmonic compression which is
0:04:03	distinctly nonlinear operation
0:04:05	um
0:04:06	i wanna demonstrated over here on the right
0:04:08	basically what what ends up
0:04:11	what ends up happening is that the spectrum is compressed
0:04:14	conceptually
0:04:15	by integer factors and then it's a bird
0:04:18	right
0:04:18	and the the problem with congresswoman compression is that has led to people actually looking for
0:04:23	uh implementations of this algorithm
0:04:25	in exactly this way so first compressed and then uh
0:04:28	and it turns out that
0:04:29	this is occupied people for about twenty years
0:04:32	a lot of last century
0:04:34	a a much better way to
0:04:36	do this is to actually not do any compression
0:04:38	but
0:04:39	uh comb filter
0:04:40	so you just add whatever's at whatever frequency you want without first having to compress it towards the
0:04:45	and that a harmonic frequency or
0:04:47	fundamental frequency that you're interested in
0:04:49	um
0:04:51	so when this happens there of course no compression difficulties
0:04:55	filtering is linear
0:04:57	uh
0:04:58	we in this work are gonna be defining all of the all of our filters over this range of three
0:05:02	hundred hertz to eight thousand hertz
0:05:05	um if you have lots of such computers and that
0:05:08	you'll the filter bank
0:05:09	and in this work we're gonna
0:05:11	nominally three four hundred
0:05:12	filters and this filterbank
0:05:14	um which
0:05:15	range from fifty to four hundred fifty hertz space one hurt so far
0:05:20	uh
0:05:20	so this is
0:05:21	the continuous frequency space of course we want a discrete frequency space
0:05:25	filter because uh we have
0:05:27	discrete ffts
0:05:28	so
0:05:29	there are lots of ways to do this and
0:05:32	i
0:05:32	always like siding
0:05:33	the work by you know colleagues
0:05:35	from
0:05:35	lindsay because i
0:05:37	this is actually work that influenced me
0:05:39	um but it's probably not the first work
0:05:41	what we're gonna do in this work is a little bit different
0:05:43	we're gonna sing that each
0:05:44	uh
0:05:45	to stop the columnist triangular
0:05:48	and
0:05:49	then we're gonna simply riemann sample this
0:05:52	such that the filterbank filter
0:05:54	for that
0:05:55	comb filter
0:05:56	discrete filter actually end up looking like this and
0:05:58	as you will know it doesn't look harmonic at all
0:06:03	it's a what what do you do with this now so
0:06:05	uh if you have a set of such discrete calls can filters
0:06:08	then that actually
0:06:10	um
0:06:11	implement
0:06:11	uh
0:06:12	filter bank
0:06:13	that has represented a matrix representation age
0:06:16	and it's very simple to use you just
0:06:18	matrix multiplication by the fft that you haven't
0:06:21	we're also gonna take the logarithm of the output of that filter bank
0:06:24	the same way that's done for
0:06:25	the mel frequency filter
0:06:27	um finally we're gonna subtract from that
0:06:30	the energy that is founded specific integer more at integer multiples of a specific candidate fundamental frequency
0:06:35	we're gonna subtract from that
0:06:37	the energy found everywhere else
0:06:39	and to do that we're gonna form
0:06:41	this
0:06:41	complementary it is complement transform H tilde
0:06:45	um which i can
0:06:46	demonstrate over here this is the
0:06:48	column
0:06:49	a column vector for particular
0:06:50	a comb filter
0:06:52	then we just
0:06:53	form a unity complement right and that gives us a
0:06:56	this here so that's the corresponding
0:06:58	column vector of H two
0:07:00	um
0:07:01	what this of course implements why implements a
0:07:04	if they're in i form of the harmonic
0:07:06	to noise ratio
0:07:08	um which is known to correlate with force
0:07:10	voice or hoarseness or roughness of reading
0:07:12	and
0:07:13	typically in in in pathological you so
0:07:16	two
0:07:17	what's done is the harmonic noise ratios computed only and not at the fundamental frequency once that is known
0:07:23	and what we're doing is we're computing it for all possible candidate fundamental frequencies
0:07:27	and then using that vector
0:07:28	as a
0:07:29	feature
0:07:30	fish
0:07:34	okay
0:07:34	so the elements of this are still correlated in me decorrelate them in a way that anybody else would
0:07:39	we subtract the global mean
0:07:41	we form a D correlation matrix
0:07:43	and then we truncate
0:07:44	after applying that matrix all those features
0:07:46	that are in the mentions that have one positive eigenvalue
0:07:50	we're gonna call the output of this
0:07:52	harmonic structure cepstral coefficients for lack of a better term
0:07:55	and um
0:07:56	this is simply a D correlation of the logarithm
0:07:59	oh the output of the filter banks minus some normalisation term which
0:08:02	is uh
0:08:03	R H tilde here
0:08:05	we we actually explore two different options for this book pca
0:08:08	and lda which you probably know more about than i do
0:08:12	i want
0:08:13	i've claim that this is at the level of
0:08:15	mfccs but
0:08:16	i would like to try to convince you hear that it's nearly identical from a functional point of view
0:08:20	um
0:08:21	the mel filterbank
0:08:22	kind of
0:08:23	also be implemented as a matrix
0:08:25	and so if it's if that's and
0:08:27	you can see that at least
0:08:28	inside here
0:08:29	is approximately the same it's a matrix multiplication of the lombard
0:08:33	the decorrelating transforms of course
0:08:35	different
0:08:36	and
0:08:37	sort of
0:08:37	important and in our case unfortunate that
0:08:40	article or decorrelating matrix is data
0:08:42	pendant where is the mfcc one is not but
0:08:45	um
0:08:46	to compare
0:08:47	here 'em in H
0:08:49	these are
0:08:49	essentially the columns of and
0:08:51	that is to say
0:08:53	they
0:08:53	smear energy across frequencies that are related by
0:08:56	jason
0:08:57	see
0:08:57	where is the columns H the matrix that we're proposing here
0:09:01	smear energy across frequencies that are related by harmonicity
0:09:06	um i also wanting
0:09:08	just say that this is as direct
0:09:09	sort of a lot from
0:09:11	our previous work
0:09:12	using a representation call fundamental frequency variation which
0:09:16	models the
0:09:17	instantaneous change in fundamental frequency without actually computing the fundamental frequency
0:09:22	um
0:09:23	so what we're gonna what we're doing here in the in the current work
0:09:26	is we take it
0:09:27	frame of speech we take it at fifty
0:09:29	and then we take a bunch of idealised fft that are the con
0:09:32	filters where the columns were capital H
0:09:35	we formed a dog
0:09:36	matrix
0:09:37	oh
0:09:37	the frame
0:09:38	that we are currently looking at with every one of these
0:09:41	and
0:09:42	the locus of these dot product
0:09:43	of course defines
0:09:44	uh
0:09:45	this trajectory which is a function of a lie
0:09:48	which
0:09:48	corresponds to
0:09:50	fundamental frequency
0:09:51	right so
0:09:54	in contrast the F F E stuff that we've done before we take two frames we take the current frame
0:09:58	the same as here
0:09:59	but we also take
0:10:00	the previous frame
0:10:02	and we die like the previous frame by logarithmic factors
0:10:06	by a range of them
0:10:07	and then again we take the dot product
0:10:09	oh
0:10:09	the dilated
0:10:10	previous frame
0:10:11	with the current frame
0:10:13	and then the locus of those dot product give us gives us another focus
0:10:16	it is also a function of i wear
0:10:18	i hear is the logarithmic dilation factor so
0:10:21	this
0:10:22	expresses
0:10:23	expresses
0:10:24	the key here nominally expresses the
0:10:27	uh location of the peak sorry
0:10:28	expresses the
0:10:30	fundamental frequency in hertz
0:10:32	and the location of the peak here expresses the
0:10:35	rate of change of fundamental frequency and
0:10:37	for sex
0:10:39	okay so not gonna
0:10:41	um describe
0:10:42	the experiments that we did too
0:10:43	see whether
0:10:44	this makes any sense at all
0:10:46	um
0:10:46	the data that we use this wall street
0:10:48	journal data mostly coming from wall street
0:10:50	the
0:10:51	this corpus
0:10:52	um the number of speakers that we have over a hundred and two female speakers and ninety five male speakers
0:10:56	and leave it
0:10:57	close to classification of gender separately
0:11:00	we used and second trials
0:11:01	we had enough
0:11:02	we get enough data
0:11:04	to have five minutes of training minutes of development data
0:11:07	three minutes
0:11:07	test data and that
0:11:08	corresponding to approximately fourteen hundred eighteen hundred
0:11:13	trials that are ten seconds apiece um
0:11:15	all of the data comes from a single microphone i
0:11:18	and that's what we're calling match multisession which means that
0:11:21	for the majority of speakers but
0:11:23	data that both in the train in the data and the test set
0:11:26	is drawn from all
0:11:27	speakers that all sessions that are available for that
0:11:32	something that is not in the paper
0:11:34	um
0:11:34	is
0:11:35	we did this afterwards but i thought that
0:11:37	you might
0:11:38	appreciate this is we built a system that's based just on pitch
0:11:41	and we extract pitch unit standard
0:11:43	sound processing to it
0:11:44	in its default settings
0:11:46	the comparison isn't quite fair because
0:11:49	any pitch tracker currently
0:11:50	actually uses dynamic programming so this
0:11:52	the speech
0:11:54	this system is actually using longterm constraints where is our system will not 'cause it treats brains in the end
0:11:59	it
0:12:00	um
0:12:00	we we ignore unvoiced frames and we transformed voice frame to what do you mean
0:12:04	and what we see is that
0:12:05	for females this this system
0:12:07	uh achieves an accuracy of approximately eighteen percent
0:12:11	and
0:12:11	approximate twenty seven percent for females and males
0:12:14	respectively
0:12:17	i get the feeling that my microphone is louder sometimes and white or another
0:12:20	is that true or
0:12:21	does it bother anyone
0:12:24	okay
0:12:25	sorry
0:12:27	okay so the system that we're proposing here um to explore this this idea of of modelling the entire transform
0:12:33	domain signal
0:12:34	um
0:12:35	we don't perform any preemphasis
0:12:36	uh because partly because we
0:12:38	are not using frequencies below three hundred hertz though because we throw those away there's no D C component we
0:12:43	decided not to bother with this
0:12:44	uh
0:12:45	we have seventy five percent
0:12:46	frame overlap
0:12:47	with thirty two millisecond frames and we use the
0:12:50	and window instead of the hamming window which is ubiquitous use
0:12:53	um
0:12:54	number of dimensions and the number of gaussian gender models
0:12:56	as
0:12:57	is
0:12:57	still need to discuss that in coming coming um
0:13:01	slide
0:13:01	we don't use universal background model and we don't use any speech activity detection
0:13:05	so
0:13:08	um in
0:13:09	optimising this number of dimensions
0:13:11	um
0:13:12	what we what we have done here is we have
0:13:14	we have created
0:13:15	the most laconic modelling you can invent a single gaussian which the diagonal covariance
0:13:20	and we train our pca lda trans
0:13:23	on the training set
0:13:24	and we
0:13:25	select that number of dimensions which maximise accuracy on the developer
0:13:29	right so we can see here that
0:13:31	four
0:13:32	the pca transform
0:13:33	which even accuracy of about forty percent
0:13:35	for the for the first part
0:13:37	and top discriminants
0:13:38	and accuracy about eighty five percent
0:13:40	for the
0:13:41	um
0:13:43	for that first hand lda
0:13:45	uh_huh
0:13:45	components
0:13:46	for females and slightly better for males but that's
0:13:49	approximating that the ballpark
0:13:51	the lighter colours in these two parts
0:13:52	uh represent
0:13:53	longer trial durations we decided not to do sixty second in thirty seconds which is what we start out with
0:13:58	because
0:13:58	the the numbers were were too high and it was difficult them
0:14:01	paris
0:14:01	so
0:14:04	um
0:14:04	this table summarises uh the performance of
0:14:07	this agency C system that i just described
0:14:09	uh once the number of
0:14:11	gaussians
0:14:12	has been optimised
0:14:13	has been set to optimise
0:14:15	that's set accuracy and that number happen
0:14:17	to be two fifty six
0:14:18	in our experiments
0:14:19	what what we see here is that a single
0:14:22	is
0:14:22	if you take the
0:14:24	representation that a pitch tracker is exposed to and you spend the time looking for the art max in it
0:14:29	you achieve eighteen and twenty seven percent
0:14:31	spectral here but if you don't bother doing that you just throw everything that that representation has any model that
0:14:36	then you achieve almost a hundred percent
0:14:38	so
0:14:39	the claim here based on these experiments
0:14:41	is that
0:14:41	there is speaker discriminant information
0:14:43	it is beyond our max
0:14:45	in these uh
0:14:46	in
0:14:47	in his representation vectors
0:14:48	and
0:14:49	of course discarding it needs the performance that
0:14:51	is
0:14:52	not really comparable
0:14:53	um
0:14:54	spending time improving our backs estimation
0:14:56	it appears unnecessary and of course
0:14:58	our backs estimation here
0:15:00	speech
0:15:03	okay so we also constructed a
0:15:05	a contrastive mfcc system which
0:15:07	is not really stan
0:15:08	nerd in the weighted
0:15:09	but you probably build them but
0:15:11	um
0:15:11	we we try to retain as many similarities with
0:15:14	but a very simple to just see system as we could
0:15:16	so
0:15:17	um
0:15:18	we did apply
0:15:19	preemphasis and having window because that just happens to be the standard front end feature processing in our asr systems
0:15:25	um
0:15:25	we retain twenty
0:15:27	of the
0:15:28	the
0:15:28	uh lowest order mfccs
0:15:30	but then
0:15:32	we also don't use a universal background model or any speech activity detection
0:15:36	so
0:15:37	uh_huh
0:15:37	uh in this respect the the two systems are most comparable
0:15:42	and what we see when we compare these these two systems is that
0:15:45	essentially in every case at least for this data for these experiments that we did here
0:15:50	this H assisi representation outperforms the mfcc representation but what we're happy just saying that they're they're comparable in magnitude
0:15:56	um
0:15:57	we've also
0:15:58	just
0:15:59	just to be safe
0:15:59	applied you know lda to the mfcc system
0:16:02	this is also not if they're thing to do because we actually haven't
0:16:05	truncated or thrown away discarded any dimensions after that so we we take twenty dimensions and we rotate them
0:16:10	um
0:16:11	it it it leads to
0:16:12	uh negligible improve
0:16:15	um if we combine the agency see in the next
0:16:17	systems and
0:16:18	we achieve uh
0:16:19	we we get improvements in every case
0:16:21	uh except
0:16:22	for dev set in males here mfccs don't seem to help
0:16:25	um
0:16:27	but
0:16:27	um
0:16:28	other than that in general
0:16:29	at least combination with mfccs
0:16:35	okay so
0:16:36	given these results i i'm gonna
0:16:38	describe a couple of
0:16:40	uh
0:16:41	analyses or analysis of a couple of perturbations 'cause we were interested in
0:16:45	seen how
0:16:45	lucky we were in just guessing at the
0:16:48	parameters that actually drive
0:16:50	the can
0:16:50	evaluation of our system
0:16:51	so we considered three different kinds of perturbations
0:16:54	one was um changing the
0:16:56	the the frequency of range
0:16:59	to which the
0:16:59	the filterbank is exposed
0:17:02	um
0:17:02	one is
0:17:03	changing the number of a comb filters in the filterbank
0:17:06	and the other is
0:17:08	they'll be throwing out the
0:17:09	the
0:17:10	so called spectral envelope information which is contained in mfcc
0:17:14	um
0:17:15	i mean we very much
0:17:17	we had a very simple
0:17:18	version of this analysis we where we used
0:17:20	only diagonal covariance gaussian
0:17:22	um
0:17:23	single that that'll convince causing per speaker
0:17:26	and we only show numbers on that set because
0:17:28	find it there
0:17:29	sufficiently similar but of course uh
0:17:31	granularity to that you've all set numbers that
0:17:33	we didn't actually bother doing
0:17:36	um
0:17:37	as before i'm gonna plug accuracy as a function of the number of dimensions
0:17:43	a so the first perturbation has to do with modifying the slow order cut off as i said the justice
0:17:47	system
0:17:48	looks at frequencies between three hundred hertz and eight kilohertz
0:17:51	and if it is interesting to see what happens if you choose a different value for this low order or
0:17:56	low frequency cutoff
0:17:57	um
0:17:58	so the results here for females on the left and males on the right
0:18:01	um
0:18:02	indicate that what we had chosen this three hundred hertz cutoff this just happens to correspond to the first hand
0:18:08	fifty
0:18:09	space
0:18:09	um
0:18:10	what's but was it is that is to say that to the
0:18:13	to the
0:18:14	best performance
0:18:15	if we if we expose the algorithm to
0:18:18	also frequencies between zero and three hundred
0:18:21	then for females we actually lose approximately four percent absolute here
0:18:24	that
0:18:25	the drugs much smaller for males
0:18:27	and um
0:18:28	moving the the cutoff further up
0:18:31	has a smaller effect but it's also worse than than keeping that we have
0:18:35	the second perturbation that we
0:18:37	there are that we analyse ways
0:18:39	changing the upper limit here so
0:18:41	um as i said we had three hundred eight thousand to begin with
0:18:45	but it's interesting to see what happens if you cut it at four thousand words
0:18:48	two thousand
0:18:49	and and this
0:18:50	configuration particular corresponds approximately
0:18:53	to an upsampled eight kilohertz telephone
0:18:55	no
0:18:55	uh
0:18:56	uh_huh
0:18:57	so here again results for males on the right and for females on the left
0:19:01	uh
0:19:02	what we see is that
0:19:03	for males
0:19:04	chip
0:19:05	reducing the number of high frequency components that that that you looking at in the F F T
0:19:09	has a more drastic effect and for females
0:19:12	um
0:19:13	for females actually
0:19:14	going down to four thousand
0:19:16	is only a drop
0:19:17	less than one
0:19:18	one percent absolute
0:19:19	but then dropping it further
0:19:21	um
0:19:22	you see drops of approximately three percent up to
0:19:25	wanna stated even under these sort of ridiculous ablation conditions
0:19:28	this significantly outperforms a pitch tracker and
0:19:31	or so it is not known how well the pitch tracker would operate on
0:19:34	three hundred to two thousand
0:19:35	or
0:19:36	audio
0:19:37	so um
0:19:38	the third perturbation
0:19:40	is
0:19:41	is is
0:19:42	that in the transform domain so we
0:19:43	and i said at the very beginning we have four hundred filters under one hertz apart
0:19:47	and
0:19:48	uh were at liberty to choose
0:19:50	however many filters we want
0:19:52	right so
0:19:52	uh
0:19:54	so it's interesting to see what happens if you double that number and space them point five hertz apart or
0:19:58	you have that number space imports as far apart
0:20:01	uh what the results
0:20:02	show here for female the left again and for males on the right
0:20:05	is that
0:20:06	increasing the resolution
0:20:08	of the
0:20:09	candidate fundamental frequencies
0:20:10	with which you construct the filter bank
0:20:12	actually leads to significant improvements for females of
0:20:15	um
0:20:16	almost two percent absolute and for males is slightly smaller
0:20:19	um
0:20:20	and but then decreasing resolution has a similar impact negatively
0:20:23	for boston
0:20:26	uh finally
0:20:27	the fact that the
0:20:28	that the mfcc an H assisi features
0:20:30	combine to improve performance in three out of four cases
0:20:34	suggests that the
0:20:35	that the to surf feature streams are complementary
0:20:38	um
0:20:39	but there is actually no proof of that until sort of now
0:20:43	so what we're gonna do here is we're gonna take that
0:20:45	the the source domain um
0:20:47	fft
0:20:48	and we're going to uh
0:20:50	lifter it by transforming it into the real cepstrum and then throwing out the low order cepstral coefficients and then
0:20:55	transforming it back into the
0:20:57	into the spectrum
0:20:58	coming
0:20:59	um
0:20:59	and i wanna say here that the the lower order
0:21:02	real cepstral coefficients correspond approximately
0:21:05	to the low order mfcc coefficients right so
0:21:08	um
0:21:08	ablating
0:21:10	real cepstral coefficients
0:21:11	is a
0:21:12	which are working
0:21:13	computed without a filterbank
0:21:15	is a
0:21:16	very similar to removing
0:21:17	exactly that information that's captured by
0:21:20	so
0:21:21	this system that we have a justice system that you saw the performance of in the table
0:21:25	would we actually don't do when you lettering
0:21:26	but we could
0:21:27	lifter
0:21:28	you know the first thirteen low order cepstral coefficients
0:21:31	which corresponds approximately what people typically use an asr system
0:21:34	and
0:21:35	or or twenty which is what we use in ornaments
0:21:37	see a baseline we saw earlier
0:21:39	and
0:21:40	what and that happening here
0:21:42	you can see that
0:21:43	a personal comment on females is that removing the spectral envelope information actually improves performance here so
0:21:49	if we throw away
0:21:50	the first thirteen cepstral corpus are sort of the information contained in the first
0:21:53	thirteen cepstral coefficients
0:21:55	we get an improvement of about two percent absolute
0:21:57	i meaning that the spectral
0:21:59	and information that's model and mfccs actually hurts here for when
0:22:03	um
0:22:04	it's also the case that if we throw twenty of them
0:22:06	we also do better than not throwing out any but it's already not as good as only doing a thirteen
0:22:11	which suggests that the
0:22:12	the cepstral coefficients that are found between
0:22:14	or thirteen and twenty or
0:22:16	or
0:22:16	are useful
0:22:17	um in male doing any kind of
0:22:19	uh
0:22:20	evolution seems to
0:22:22	sorry blistering seems to hurt
0:22:24	but it
0:22:24	the pain is
0:22:25	smaller here you throw away the first thirteen that's negligible i believe
0:22:29	it's
0:22:29	one one trial
0:22:32	and uh i have no idea what statistically so
0:22:35	so the findings of this is that um
0:22:38	the this representation appears to be robust
0:22:40	to perturbations of various sorts
0:22:42	there is play of approximately five percent absolute
0:22:45	um
0:22:46	the performance for
0:22:47	female speakers seems to be more sensitive to these perturbations than for males
0:22:51	in both
0:22:52	uh pleasing and displeasing directions and
0:22:54	um
0:22:55	it it's again important to say that and even under these perturbed conditions
0:22:59	the
0:23:00	the performance of these systems is
0:23:01	vastly superior to
0:23:03	um
0:23:04	the performance that would be achieved if you spent a lot of time finding the art max in the representation
0:23:09	that pitch trackers are exposed to
0:23:11	uh
0:23:12	we don't know how to pitch track would perform
0:23:14	whisper
0:23:16	so the summary of the stock
0:23:18	um
0:23:18	i still have three slides
0:23:20	is that uh the information that's available
0:23:23	to it
0:23:24	standard pitch tracker
0:23:25	because it is computed by this pitch tracker
0:23:27	and then subsequently discarded is
0:23:29	valuable for speaker recognition
0:23:31	um
0:23:32	and then the three points that i would like to
0:23:34	pay specific attention to is that
0:23:36	the performance achieved which is where they just C C but features
0:23:39	is comparable to that achieved with mfcc features
0:23:42	um
0:23:43	the information contained in these pages theses
0:23:46	if you're is complementary to the information
0:23:48	okay
0:23:49	mfccs
0:23:50	and the H assisi modelling appears to be at least as easy as
0:23:53	C C
0:23:56	um
0:23:57	so
0:23:57	that this evidence
0:23:58	suggests as i probably said
0:24:00	too often now
0:24:01	that
0:24:03	improving estimation of
0:24:05	detecting arguments are finding the art max
0:24:08	in this representation the goal which are essentially
0:24:11	um
0:24:11	seems like
0:24:12	and endeavour that doesn't
0:24:14	warrant further time investment and
0:24:16	uh_huh
0:24:16	it's
0:24:17	it's
0:24:17	possible
0:24:18	to simply model the entire transform domain
0:24:20	and and do better
0:24:22	um
0:24:23	if pitch is required for other high level kind of features which of course we're ignoring here 'cause we're not
0:24:27	doing anyone
0:24:28	distance
0:24:29	um
0:24:30	feature computation
0:24:32	then
0:24:32	at least
0:24:33	that information should not be discarded
0:24:35	even if it's not used to estimate pitch
0:24:37	i if
0:24:38	it
0:24:38	it yeah it
0:24:39	these ideas
0:24:40	um generalised to other data types in other tasks then
0:24:44	there there is some chance that this
0:24:46	um who will lead to some form a paradigm shift
0:24:49	in the way that prosody is modelled
0:24:51	speech
0:24:54	so i wanna close with a couple of
0:24:56	happy at um
0:24:57	we we don't actually know how these features compared to other
0:25:00	instantaneous
0:25:01	uh for prosody
0:25:03	vectors right so it's possible that if you had a
0:25:05	uh
0:25:05	a vector that contains page and maybe harmonic to noise ratio and and maybe some other things that
0:25:10	are
0:25:11	computable
0:25:12	you know instantaneously personal frames and
0:25:14	it would be much
0:25:15	the difference would be much smaller
0:25:17	um
0:25:18	we don't know that this guy
0:25:19	at the current time
0:25:20	we also don't know how this
0:25:21	this representation performed under various
0:25:23	various um
0:25:25	mismatched conditions for example channel or or or session or distance from microphone
0:25:30	or uh vocal effort
0:25:31	so these are things that need to be
0:25:33	explored and it's also quite possible that
0:25:36	better
0:25:36	better
0:25:37	uh
0:25:38	that
0:25:39	there are other classifiers the maybe better
0:25:41	suited to this
0:25:42	um
0:25:43	in particular the performance that wasn't bad which you do this single diagonal covariance gaussian suggested
0:25:48	maybe svm so much better that the the feature vectors are large right so
0:25:52	um
0:25:53	this presents some some problems
0:25:55	existing prosody systems of course
0:25:57	focus a lot on long term features
0:26:00	and we haven't
0:26:01	attempted that here at all so
0:26:03	um a simple thing to try would be to uh simply start features from temporally adjacent frames
0:26:08	or stack i differences
0:26:10	from features but i
0:26:11	i think
0:26:12	that
0:26:12	probably the best thing to do is to
0:26:14	simply compute the modulation spectrum over this
0:26:17	so how it just the spectrogram
0:26:19	um
0:26:21	and of course probably most importantly
0:26:22	we would really like to have a data independent
0:26:25	uh feature rotation which allows us to
0:26:27	compress the feature space
0:26:30	this would significantly improve understanding 'cause right now we just have this huge bag of numbers
0:26:34	and it would prevent it would allow
0:26:36	as to apply some normal
0:26:38	things that people apply like universal background models
0:26:41	um
0:26:45	and it would allow us to
0:26:47	deployed in other large ask
0:26:50	thank you
0:26:51	thank you
0:27:06	could you
0:27:07	perhaps
0:27:08	uh
0:27:09	just help me understand your your last one
0:27:12	uh
0:27:13	oh a explain please explain to me why do some difficulty in applying our method to to a lot
0:27:20	that's because you're
0:27:21	feature vectors are very large or
0:27:25	well the first thing like yeah so in the in the in the system that we describe most
0:27:29	extensively here
0:27:30	the feature vector has
0:27:31	four hundred number
0:27:33	um
0:27:33	and so
0:27:35	i have found it to be
0:27:37	painful to so that's four hundred every
0:27:39	ten milliseconds right
0:27:43	does that answer your question actually
0:27:44	i say something more
0:27:46	um
0:27:47	we'd actually found that if you're looking at
0:27:49	different kinds of mismatch
0:27:50	you need to do some
0:27:52	um homomorphic processing which actually
0:27:54	increases the size of this feature vector and so it becomes even more painful
0:27:57	and it's basically because we don't really know
0:28:00	how come
0:28:00	really properly model
0:28:02	with a data independent
0:28:04	yeah
0:28:06	okay thanks becomes uh
0:28:08	the those seem like it would be very worthwhile try that is based on those
0:28:12	yeah
0:28:13	on the nasdaq does it so well
0:28:15	would be nice to think of ways to
0:28:17	my proposal
0:28:18	definitely and if any of you have any suggestions
0:28:20	i would like to take
0:28:30	you have any thoughts on this
0:28:32	might be hit on mismatched data
0:28:36	um
0:28:38	i do
0:28:39	i have we have some thoughts
0:28:40	so
0:28:41	but we we don't have
0:28:42	really the correct
0:28:42	kinds of thoughts
0:28:43	so
0:28:44	um
0:28:46	note also that the problem is that the other dataset that we've been playing with most recently after doing this
0:28:52	is that is a far field data so
0:28:54	and so everything is far field
0:28:55	so that there is a big change in what happened then we actually don't really know
0:28:59	exactly where the changes so i guess we're now in the process of thinking about buying it
0:29:02	different data but
0:29:03	we try to remember this table um
0:29:06	so
0:29:06	this is on something called the mixer five dataset
0:29:09	which is
0:29:10	which which contains lots of
0:29:11	different channels but
0:29:12	that nine channels that we use are all the far field channel
0:29:16	and um
0:29:17	we we what we have there is we have uh
0:29:20	two evaluation sets
0:29:22	um
0:29:23	one has
0:29:24	session match and the other is session mismatch and then
0:29:27	we
0:29:29	we build models
0:29:30	for data from every channel and apply them to that same channel that's the match channel condition
0:29:34	and we also apply those models to data from every other channel and that's a mismatch channel conditions so that
0:29:39	it's not channel condition consists of
0:29:41	um
0:29:42	i think it's a times nine it's an average of eight times nine numbers and the nine channel condition is
0:29:46	an average of nine
0:29:48	so
0:29:48	what we see is that in channel
0:29:50	match and channel
0:29:51	so in section match and channel matched conditions
0:29:54	um
0:29:54	we we're we're doing something on interesting there
0:29:57	um
0:29:58	but in
0:29:59	it it here is here
0:30:01	that session mismatch
0:30:02	is is more painful than channel mismatch
0:30:04	right
0:30:05	and um
0:30:07	there there is a there is a clear reversal here in the ordering between the mfcc system and a justice
0:30:13	system that we reported in this work
0:30:15	um
0:30:16	so
0:30:17	oh yeah by the way
0:30:18	so uh
0:30:20	this
0:30:20	this line this H assisi or is that system that i just described
0:30:24	it you see sinew is something that we
0:30:27	submitted this summer also just problem that can be accepted which is where the this
0:30:30	this table comes from
0:30:31	um
0:30:32	but uh
0:30:34	but the point is that um
0:30:35	be these numbers in this role or
0:30:37	are always smaller than the average
0:30:44	then the numbers in in in this role
0:30:46	so
0:30:47	uh
0:30:48	i don't know if that answers your questions i can probably talk a little bit about the magnitude of these
0:30:52	numbers but
0:30:53	you're happy with this then
0:31:02	but
0:31:03	but i could see actually here that um
0:31:06	i don't recall him but i think that
0:31:08	we did the combination here and the combination leads to approximately ten percent
0:31:13	absolute increase
0:31:14	you over this mfcc number
0:31:17	it on average over all conditions right
0:31:20	and
0:31:21	and asks you know processing side
0:31:23	sorry
0:31:24	yeah and asks you know persisting sign
0:31:26	i think this proposal to choose
0:31:28	please
0:31:28	see mia
0:31:29	two
0:31:30	O D U these images
0:31:32	cool
0:31:33	oh if you these things
0:31:34	use
0:31:35	you know maybe
0:31:36	holding
0:31:36	well
0:31:37	right right side is withholding extra but could you hold your microphone a little bit too so sorry
0:31:43	okay
0:31:44	um
0:31:45	um i think this so use it is futures
0:31:47	this proposal you chose to me that the band limited
0:31:50	i a harmonic to noise ratios
0:31:52	no
0:31:52	i think in addition to harmonic to noise ratios disputes to be seen you know to
0:31:58	oh
0:31:58	i p2p these images
0:32:00	or if you these two things
0:32:01	which is useful
0:32:03	made recordings
0:32:05	mated
0:32:05	in it decoding so mixed excitation
0:32:08	and
0:32:09	so
0:32:10	you have in mind
0:32:11	well i have not i i'm not sure that i got all of the things that you said
0:32:15	um
0:32:16	but if you said that there is something that's very similar it is
0:32:19	yeah i would really like to talk with you yeah system and and that's where we can do that offline
0:32:24	or or you can use this
0:32:26	right
0:32:27	thank you
0:32:32	just to come back to the
0:32:33	four hundred dimensional features i think you wanna
0:32:36	one of those like to mention all the lights
0:32:39	did you reduce that
0:32:40	feature dimensionality before nor modelling stage i'm sorry
0:32:44	uh can you just at the very beginning of your question
0:32:49	your
0:32:49	your your features of
0:32:50	four hundred dimensional yep
0:32:52	so
0:32:53	did you use an L D I to reduce the dimensionality before you're such as well
0:32:58	model together
0:33:00	yeah
0:33:00	so to what dimensionality do you use
0:33:03	sorry i i uh
0:33:17	so it turns out that
0:33:19	i i differ males are for females it was fifty two and fifty three
0:33:22	i i i don't remember which gender doesn't matter
0:33:24	is close enough
0:33:25	okay because then it would seem it
0:33:28	probably not a practical problem anymore
0:33:30	uh
0:33:31	we we typically use sixty dimensional features
0:33:34	uh with the with the
0:33:36	uh
0:33:37	ubm that has two thousand components
0:33:40	so that's doable
0:33:41	right but the problem is that we would need to invert
0:33:44	uh you know we we need to compute the L D or pca transform
0:33:47	over
0:33:48	because these transforms are global
0:33:50	right
0:33:51	so we would need to compute
0:33:52	a pca transform over
0:33:54	two thousand features for the entire
0:33:57	i don't know
0:33:58	ubm training set if you will
0:34:01	do do do i understand you correctly
0:34:04	oh
0:34:04	did you say words um
0:34:07	we have to transform is estimated remote will depend on where you are from sparse rooms
0:34:13	cross
0:34:14	no
0:34:15	i
0:34:16	what i meant
0:34:17	if i gave that impression i didn't intend to
0:34:20	uh
0:34:21	so it's it's russell dimensional so much
0:34:25	yes
0:34:26	um remote closed so hard to see what would be um
0:34:30	a problem when using a universal remote
0:34:33	some of these features
0:34:35	but you would use everything rooms who are
0:34:38	for two dimensions in the morning
0:34:41	right so
0:34:48	i guess
0:35:02	i see
0:35:03	training and
0:35:05	and extracting the features and the
0:35:08	and change energies of all time
0:35:10	and
0:35:11	and it's a lower
0:35:13	yeah right
0:35:14	this
0:35:15	yeah
0:35:16	so basic
0:35:18	basically yeah so we start from like
0:35:20	i see
0:35:21	and just
0:35:21	and
0:35:22	oh it's really uh you yeah
0:35:24	chaney and then
0:35:25	because
0:35:26	based on this and
0:35:27	there is that we do
0:35:28	and happy and sad and change and that's the test set
0:35:32	you'll often but
0:35:33	and
0:35:34	just
0:35:35	yeah listing
0:35:36	people recently
0:35:37	we like to do it
0:35:39	and
0:35:40	due to some limitations
0:35:42	a couple of it's not difficult imprints
0:35:45	so many times it's just fine
0:35:47	i mean
0:35:48	i mean we just haven't gotten around to getting that far
0:35:50	and like i said
0:35:52	with
0:35:53	feature vectors of the order of
0:35:55	four hundred or eight hundred is shown being better
0:35:57	and
0:35:58	two thousand and forty eight after
0:36:00	some homomorphic
0:36:02	yeah scores
0:36:03	um
0:36:04	just haven't gotten around to even estimating how much disk space we would need and a particular court
0:36:09	so
0:36:10	um
0:36:12	that's essentially the the
0:36:14	correct answer there but my
0:36:15	thought always was that we were gonna attack this problem by making the feature vector smaller first
0:36:19	rather than addressing the
0:36:21	the
0:36:22	the
0:36:22	infrastructure problem
0:36:23	and by more disks right
0:36:25	so
0:36:25	okay
0:36:26	okay
0:36:32	okay but
0:36:33	i think
0:36:33	right now
0:36:35	i

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

SESSION 1: Speaker recognition – LVCSR and high level features

Přidáno: 14. 7. 2010 11:08, Autor: Kornel Laskowski, Qin Jin (Carnegie Mellon University), Délka: 0:36:44