Speech Transcript - Temporally Weighted Linear Prediction Features for Speaker Verification in Additive Noise

0:00:07	our idea
0:00:08	i am
0:00:08	writing saving
0:00:10	uh
0:00:11	what about um
0:00:12	research uh directions that uh we are working in university of is there a lot
0:00:17	is concentrated on uh
0:00:19	uh new features for
0:00:21	uh speaker recognition
0:00:23	uh
0:00:23	it could be i have a
0:00:25	line
0:00:25	three in our laboratory that we are working on um
0:00:28	oh
0:00:29	let me say
0:00:29	i'm a little meeting on and working on it
0:00:32	and new features exploring area of new features for
0:00:35	the speaker recognition
0:00:37	now the current work is the application of uh
0:00:41	some sort of features name uh
0:00:44	uh weighted linear prediction features for
0:00:46	the speaker recognition
0:00:48	and uh our work
0:00:50	is
0:00:50	yeah done jointly bar
0:00:52	a group of how holding university of helsinki that nowadays the following a
0:00:57	all all the universe
0:00:59	let me just say that
0:01:01	uh i'm not pretending that i know
0:01:03	what is happening inside the
0:01:06	uh is a weighted linear prediction i'm just
0:01:09	presenting on this up my understanding that
0:01:12	what is weighted linear prediction
0:01:15	and
0:01:15	uh from the group of older is here i just have to me
0:01:19	to help me if
0:01:20	i cannot describe something to you
0:01:23	then
0:01:24	so the concept
0:01:26	we have
0:01:27	sometimes the the customers or reckons that the users of
0:01:30	speaker recognition technology that
0:01:32	they want to use
0:01:33	speaker recognition when they are when when they are
0:01:36	outside of
0:01:37	environment
0:01:38	but we also have some sort of other users of a speaker recognition technology that they want to use it
0:01:44	in any environment in high energy
0:01:46	noise environment in the street
0:01:48	or like
0:01:48	fig track
0:01:49	noise whatever
0:01:50	they are not in office environment
0:01:52	or the way to control
0:01:54	uh and wonderful
0:01:55	speech record
0:01:56	then we are interested you know
0:01:58	how our speaker recognition systems
0:02:00	could it
0:02:01	the degrade in performance
0:02:03	but having this type of
0:02:04	additive noise
0:02:08	well
0:02:08	just to describe what is our for
0:02:11	since record here is the
0:02:13	we are uh i think that uh
0:02:15	uh a typical speaker recognition system but different
0:02:18	phases and different modules
0:02:20	but our for
0:02:22	in this society is to
0:02:23	see that
0:02:24	how feature extraction
0:02:25	could affect
0:02:26	speaker recognition
0:02:28	all of the speaker recognition performance
0:02:32	how typically we are being uh
0:02:34	feature extraction
0:02:36	is that
0:02:36	we have window frames
0:02:38	we do is it from estimation
0:02:40	having the mfccs duress the
0:02:42	filtering
0:02:43	appending delta double that of
0:02:45	frame dropping according to energy
0:02:47	and uh cepstral mean and variance normalisation this is something typical weekly
0:02:52	thirty six dimensional feature vector that we have in our experiments but
0:02:56	uh this is just based on problems
0:02:57	is that we have
0:02:59	then
0:03:00	now the question is that
0:03:01	is really
0:03:03	uh we are all the time using F
0:03:05	p2p
0:03:05	to make the spectrum but if it is really
0:03:08	uh the best way
0:03:09	that we can do it
0:03:10	or
0:03:11	another question is that
0:03:12	is it really that much for robust in additive noise condition
0:03:18	that are going to the L P
0:03:19	it is
0:03:20	something uh
0:03:21	well known that
0:03:22	uh estimating the spectrum could be done by
0:03:25	linear prediction
0:03:27	or if if the estimation
0:03:28	and they are uh
0:03:30	fig just alternate model alternate the way to estimate the spectrum
0:03:34	nobody save uh that
0:03:36	L P is better for speaker recognition or if
0:03:39	debaters metaphor
0:03:40	the speaker recognition or even any
0:03:42	other
0:03:42	the speech processing applications
0:03:45	and that
0:03:47	now
0:03:47	we are trying
0:03:48	the two
0:03:49	uh say that
0:03:51	what is the performance of fft
0:03:54	L P
0:03:54	and now introducing we L P
0:03:57	the V L P
0:03:58	it's uh
0:03:59	just
0:04:00	targeted to pay more stress
0:04:03	and
0:04:03	some regions that
0:04:04	speech that
0:04:05	uh do you have
0:04:07	let me say uh they have
0:04:09	more energy
0:04:10	yeah we have
0:04:12	uh
0:04:13	uh the way the uh we are waiting there
0:04:15	energy of error
0:04:16	by a weighting function and where the weighting function comes from
0:04:20	is that we are uh
0:04:22	computing there
0:04:28	yeah we are
0:04:30	we are computing the
0:04:32	uh the weighting function as that
0:04:34	the immediate energy of the signal
0:04:37	before that
0:04:38	current sample something like in samples before the current sample
0:04:41	and put it and
0:04:43	weighting function
0:04:44	where we are estimating interrupted
0:04:46	for example based on the previous at
0:04:49	in this way
0:04:50	it's possible
0:04:52	yeah again set the derivatives of that
0:04:54	wait
0:04:54	echo or uh with respect
0:04:56	yeah
0:04:57	estimator
0:04:57	a chance to zero and
0:04:59	at least two normal double curve
0:05:01	uh decorations
0:05:02	and fine
0:05:03	the weights
0:05:04	after predictor
0:05:05	and uh
0:05:06	it is
0:05:07	maybe the history count tonight
0:05:09	seventy five and after that
0:05:11	again activated in nineteen ninety three
0:05:13	that the weighted linear prediction
0:05:16	but
0:05:17	let's say
0:05:18	why
0:05:19	we are choosing the S E
0:05:21	short time energy
0:05:22	four weighting function of the V L
0:05:26	it can be true that
0:05:27	yeah regions
0:05:28	speech that they are they have high energy
0:05:31	they are less contaminated with additive noise
0:05:34	and uh
0:05:35	it is a
0:05:36	something some uh some sort of
0:05:38	five
0:05:39	but it is known we can have
0:05:42	it or estimation of the spectrum in the region that
0:05:44	speech that they are
0:05:45	less
0:05:46	corrupted by noise
0:05:47	and these regions that
0:05:48	speech
0:05:49	how
0:05:49	higher
0:05:50	short time
0:05:51	energy
0:05:53	it corresponds also
0:05:55	to the region of the i mean
0:05:57	when you're talking the regions of a speech that they are
0:05:59	higher
0:06:00	short time energy
0:06:02	it also corresponds to the regions
0:06:04	that
0:06:05	uh our
0:06:06	little hole
0:06:07	a little
0:06:08	and the
0:06:09	yeah
0:06:09	some local system it disconnected
0:06:11	from this the speech production
0:06:13	system
0:06:14	and the
0:06:14	in this case we have some standing wave inside our local calls
0:06:19	where
0:06:19	if we want to compute
0:06:21	formance of
0:06:22	speech signal
0:06:23	we can have more prominent
0:06:25	uh formant
0:06:26	estimation
0:06:27	of that
0:06:28	speech signal
0:06:31	well
0:06:32	if
0:06:32	now what is the problem with reality
0:06:35	normal equation somehow gravity to lead
0:06:38	two
0:06:39	table filter when we are
0:06:41	predicting the coefficients of the predictor
0:06:44	now the problem with the L P that it is that correctly
0:06:46	sure
0:06:47	to lead to stable filter
0:06:49	and this is a problem
0:06:50	speech thing
0:06:50	as for example
0:06:52	oh how we can
0:06:53	what we can do
0:06:55	is that uh
0:06:57	instead of using
0:06:58	some sort of
0:06:59	weighting function
0:07:00	we can decompose into partial weights
0:07:02	and a light
0:07:04	in
0:07:04	this way
0:07:05	to the estimator
0:07:06	after uh
0:07:08	yeah
0:07:09	current sample
0:07:10	and
0:07:10	in this way
0:07:11	we can only
0:07:12	to such equations
0:07:14	that they are derived
0:07:15	in the paper up to maggie
0:07:17	and uh
0:07:19	uh
0:07:20	they describe
0:07:21	the behaviour of the
0:07:23	a total weight
0:07:24	i mean these base
0:07:26	in the way
0:07:27	that the
0:07:28	final estimator coefficients should be
0:07:31	it should be in such a way that lead to the
0:07:34	a stable filter
0:07:36	well
0:07:37	i'm not
0:07:38	still
0:07:38	understanding completely what's happening here but in this paper
0:07:42	because we describe describe
0:07:44	but for more different
0:07:46	please
0:07:46	you can refer to that
0:07:48	paper
0:07:50	well here
0:07:51	i'm the reading of
0:07:52	frame and
0:07:53	i spectrum estimation of it
0:07:56	voice
0:07:56	right
0:07:57	from these two thousand
0:07:59	to uh sorry
0:08:01	and the
0:08:02	uh somehow
0:08:04	the same frame
0:08:05	that we contaminated with factory noise
0:08:07	with your db snr
0:08:09	it is
0:08:11	let me think obvious that
0:08:12	uh
0:08:13	uh
0:08:14	when we are doing the the
0:08:16	uh spectrum estimation of the noise to signal
0:08:18	there are
0:08:19	some problems
0:08:20	that
0:08:21	it
0:08:21	is mainly cool
0:08:22	by
0:08:23	the the the
0:08:25	the noise signal and
0:08:26	how it affects
0:08:27	depends on the snr level it depends on the noise that is adjusted
0:08:31	sample
0:08:32	and the tequila just more intuition what is
0:08:36	zero T V factory noise i have here
0:08:38	yeah
0:08:39	speech file just the
0:08:40	P stuff
0:08:41	speech files that
0:08:42	we do all this
0:08:43	frame
0:08:43	from those people
0:08:44	speech file
0:08:49	a little
0:08:51	it'll the other way
0:08:54	we go real but i don't know what or something
0:08:58	yeah it was a clean sample from these two thousand
0:09:01	you
0:09:01	test set
0:09:03	yeah
0:09:04	yeah
0:09:05	yeah
0:09:05	yeah
0:09:06	the other way
0:09:07	the remote
0:09:09	really
0:09:10	or
0:09:11	yeah
0:09:12	and
0:09:13	same piece
0:09:13	that we can can it be zero T V
0:09:16	additive noise
0:09:17	well factor
0:09:19	well
0:09:20	no it shows that
0:09:21	what are what is really
0:09:23	the mean by zero D B
0:09:25	snr
0:09:26	yeah
0:09:27	yeah
0:09:28	yeah
0:09:28	yeah
0:09:30	connected to some results
0:09:31	ah
0:09:32	yeah
0:09:33	let me think uh opted for
0:09:35	spectrum estimation method
0:09:37	that we are thinking about
0:09:38	and used to come into
0:09:40	corpus we had known or has some other type of
0:09:43	speaker detection
0:09:44	and using factory noise then
0:09:46	the only be
0:09:48	snr
0:09:49	here we can see that
0:09:50	the method mainly grouped into
0:09:53	sure method
0:09:54	after
0:09:54	the N L P
0:09:55	and let me see the weighted
0:09:57	L P group
0:09:59	plp itself
0:09:59	and
0:10:00	it's the L P
0:10:02	yeah
0:10:03	i i should mention that needs to go into
0:10:05	it's a
0:10:07	uh the database collected in uh
0:10:10	um
0:10:11	uh
0:10:12	that
0:10:13	mobile handsets mainly
0:10:15	and it includes
0:10:16	inside
0:10:17	come with uh convolutional noise and some additive noise
0:10:20	although we are i think i did too much white
0:10:23	ourselves
0:10:25	yeah
0:10:26	we can
0:10:27	see
0:10:27	that that is really some difference between the performance of
0:10:31	these feature
0:10:32	in additive noise environment
0:10:36	we don't try
0:10:37	uh some
0:10:38	just
0:10:38	let me say one
0:10:39	very famous
0:10:40	a speech enhancement method
0:10:42	and uh
0:10:43	uh as it
0:10:44	just some added to black
0:10:46	in our feature extraction
0:10:47	to see what
0:10:48	really uh one simplicity
0:10:50	speech enhanced
0:10:51	method
0:10:52	i have
0:10:53	a speaker recognition system in additive noise N Y
0:10:56	and
0:10:57	looking at the results
0:10:58	it shows that yes there is
0:11:00	uh some
0:11:01	good improvement
0:11:03	based on
0:11:04	having a speech
0:11:06	and enhancement or latency spectrum
0:11:08	yeah subtracting our
0:11:09	them
0:11:10	but
0:11:11	uh these results
0:11:12	although they are too much different but
0:11:14	i should say that
0:11:15	uh our
0:11:17	uh
0:11:18	noise
0:11:19	it's
0:11:19	stationary remote
0:11:20	and uh and uh real work it is
0:11:23	not really the case
0:11:26	coming
0:11:27	some
0:11:27	more recent data that
0:11:29	we were here
0:11:30	see
0:11:30	that
0:11:31	if these results from this to tell them to generalise to nice two thousand
0:11:35	eight and maybe need two thousand
0:11:37	ten because
0:11:38	we were one of the ladies that i for you for some should be nice
0:11:41	two thousand
0:11:42	ten sre and this was
0:11:44	our
0:11:44	based system i mean the contribution of our
0:11:47	uh university of eastern finland was
0:11:49	trying some
0:11:50	new features
0:11:52	and it's
0:11:52	for speaker recognition
0:11:54	looking at the results
0:11:56	let me see
0:11:57	just
0:11:58	somehow
0:11:59	how them
0:12:00	group
0:12:02	the system here is
0:12:03	'cause that's where we are with
0:12:05	an A P
0:12:06	and the condition is
0:12:07	eight content second if you ask me why it contents that can be selected for
0:12:11	evaluation 'cause i was working on a forecast for
0:12:14	the speaker recognition and this was something
0:12:17	well let me say
0:12:18	somehow it has some metric nice
0:12:20	how to and i selected here
0:12:22	for the presentation but
0:12:23	if uh
0:12:24	we
0:12:24	look at the other
0:12:26	core test
0:12:26	also
0:12:27	they have this
0:12:28	same
0:12:29	uh interpretation
0:12:32	looking at the results of any weed out any P
0:12:35	it says that uh
0:12:36	uh
0:12:37	it's plp
0:12:38	based results
0:12:39	they are improving
0:12:41	the det care
0:12:42	in uh all that
0:12:44	area if
0:12:44	i
0:12:45	i carried the results correct
0:12:47	uh thing
0:12:48	i mean dcf at whatever rate
0:12:51	A S P L P is improving compared to
0:12:54	yeah
0:12:55	mfcc here directors are for
0:12:57	uh many of the balloon are for females and the green one uh is for
0:13:03	let me say
0:13:04	all trials male and female
0:13:07	coming to the results
0:13:09	with any any
0:13:10	the effect of using S P L P
0:13:13	but to someone rotating the det curve in some sense because
0:13:17	min dcf
0:13:17	getting through to be but
0:13:19	equal error rate
0:13:20	get a bit worse
0:13:22	but
0:13:22	if
0:13:23	uh
0:13:25	you had
0:13:25	why
0:13:26	happening
0:13:27	we have i have no idea right now
0:13:29	we just applied
0:13:30	live in this
0:13:31	S P L P and
0:13:33	uh we try
0:13:34	time
0:13:34	effect
0:13:35	but
0:13:36	coming to the interpretation that
0:13:37	why it happens need more study on it
0:13:41	well
0:13:43	i think
0:13:44	yes
0:13:45	this was the point that i want to
0:13:46	oh
0:13:47	thank you
0:13:57	okay questions we have the whole question could've
0:14:14	just click less know that yes signal to noise ratio you use on the inside
0:14:19	T
0:14:20	no matter
0:14:21	yeah and you also had yeah we'll deal
0:14:25	in the you mentioned that and that you know performance was supported in the two hundred zero D B
0:14:32	yes
0:14:33	um
0:14:34	my question is how did you miss european signal you know to noise ratio because one nine or you know
0:14:40	what do you
0:14:42	it sounded as yeah maybe it's me maybe
0:14:46	he
0:14:46	other people may not agree with the
0:14:48	i thought i think that's the most signals so therefore maybe that's not zero D B maybe i did that
0:14:54	idea
0:14:55	women tend either minus ten
0:14:57	higher
0:14:58	i don't noise in it
0:14:59	and then i mean uh yeah yeah uh i thought
0:15:02	the editorial you display
0:15:05	that you called not zero D B
0:15:07	sounded is in
0:15:08	the signal is only the stronger than uh you know zero D B situation
0:15:12	well because i was suspected that somebody will ask how i'm like that with exactly the matlab code that you
0:15:17	have to get yeah
0:15:18	i uh i can interpret here that we are measuring the energy of the every frame that
0:15:23	speech signal and averaging them
0:15:25	or
0:15:26	signal and over the noise and uh putting all that
0:15:30	snr
0:15:32	snr here
0:15:33	yeah
0:15:34	to to gain to have the game
0:15:35	and then
0:15:36	needs to all just signal together
0:15:38	the noise
0:15:39	and we'll get signal together with thinking that we have as
0:15:42	average snr
0:15:45	so
0:15:47	you are meddling signal to noise
0:15:49	yeah
0:15:49	so by using that intense
0:15:52	the
0:15:53	uh rather than uh no i'm pretty you know the
0:15:57	yeah yeah framing the signal and the measuring the energy of the
0:16:00	uh
0:16:00	frames
0:16:01	and uh averaging the more
0:16:03	signal
0:16:03	and uh
0:16:04	okay uh
0:16:05	finding the
0:16:06	relative gain between the noise and signal
0:16:24	i don't see any difference in these
0:16:26	ah
0:16:29	what you cant difference you expect to see
0:16:31	well that's noisy i expected the spectrum ooh
0:16:34	these are
0:16:35	flat and then filled in
0:16:37	noise
0:16:38	i mean this is a
0:16:39	this looks
0:16:39	because
0:16:41	yeah this is depends on the noise
0:16:43	because this is fact
0:16:44	just the the noise that we use here
0:16:47	it just factory noise
0:16:48	i
0:16:48	these
0:16:48	right
0:16:49	just
0:16:50	uh i had these
0:16:50	type of behaviour we just selected one right
0:16:53	the effect of noise is not the same for all frames maybe you're right because
0:16:57	the I S P X
0:16:58	right
0:16:58	but i think that by increasing the noise on the noise level of the spectrum
0:17:02	uh it's flat and more flat and we are losing the information
0:17:06	in the spectrum but just some typical example to show
0:17:10	how it works
0:17:28	the other questions
0:17:30	two questions
0:17:31	we have it or not
0:17:32	but may get one more interpretation that
0:17:35	we use this as the L E in conjunction with mfcc add other features
0:17:40	and uh i for you separation
0:17:42	is that we
0:17:43	right somehow evaluated our system
0:17:45	or just yeah he uh
0:17:47	feature
0:17:48	and then
0:17:48	i mean
0:17:49	score four
0:17:50	so
0:17:50	subsystem
0:17:51	they use the other side
0:17:53	sensing i for you and taking
0:17:55	uh uh let me say
0:17:57	uh
0:17:58	using uh me
0:17:59	having
0:18:00	this type of
0:18:01	them that they are
0:18:02	uh ultra wide
0:18:04	beat
0:18:05	S A P
0:18:06	a different type of
0:18:07	score
0:18:08	speaker
0:18:10	or just
0:18:11	one of the assumptions
0:18:12	um
0:18:13	for for your model is that you
0:18:15	more energy
0:18:17	um observations in the signal
0:18:19	so in the most reliable right
0:18:21	that's right we have a good because that has the energy of the noise increasing
0:18:25	they could be a really just
0:18:27	uh
0:18:28	uh coloured by the noise
0:18:30	okay i just
0:18:31	the the other side of the uh the the body is also
0:18:35	uh if you don't um
0:18:37	uh
0:18:37	the situation where you getting distortions because
0:18:40	uh
0:18:40	or are driving the channel for example
0:18:43	um and it may be the case where the signal is actually one time
0:18:48	and then you
0:18:49	the
0:18:49	could be
0:18:50	um but maybe another
0:18:53	indicated
0:18:53	silver jews
0:18:54	the work
0:18:54	syllable are energy
0:18:56	um observations
0:18:58	well in this case you're right uh
0:19:00	we don't know exactly what will happen if signal is to be
0:19:03	by by channel by recording device or
0:19:07	what about this
0:19:08	formance us are just somehow done with the uh
0:19:12	sounds and that
0:19:13	uh
0:19:13	all the signal exactly but if you ask me what will happen if all the signals here
0:19:18	i will say that
0:19:19	uh i think
0:19:20	after the spectral L P spectrum that all
0:19:23	fig
0:19:23	the same way that we hope U S we'll get the fate
0:19:28	really thank you very much

Temporally Weighted Linear Prediction Features for Speaker Verification in Additive Noise

SESSION 2: Features for Speaker recognition

Added: 14. 7. 2010 11:08, Author: Rahim Saeidi (University of Eastern Finland), Jouni Pohjalainen (Aalto University), Tomi Kinnunen (University of Eastern Finland), Paavo Alku (Aalto University), Length: 0:19:39