Přepis řeči - ANALYSIS-SYNTHESIS BASED SPEECH ENHANCEMENT WITH IMPROVED SPECTRUM ENVELOPE ESTIMATION BY TRACKING SPEECH DYNAMICS

0:00:13	and and there a one um of
0:00:15	rough urgent phones at university from calm
0:00:18	five for a or two presents a joint one with my C for vices it can
0:00:23	uh the topic is a analysis synthesis
0:00:26	based speech enhancement
0:00:28	we is improved
0:00:29	spectral envelope estimation by tracking speech time then
0:00:34	so
0:00:34	uh
0:00:35	first less
0:00:36	have a look at our line
0:00:37	for for my presentation
0:00:39	first
0:00:40	at the very beginning i where we introduce some uh
0:00:43	but runs
0:00:44	uh
0:00:44	as a spectral you all some
0:00:47	a a effect by noise corruption
0:00:50	conventional filtering ring
0:00:52	and now introduce a model based based speech enhancement
0:00:55	uh
0:00:56	which is a previous
0:00:57	proposed by us
0:00:59	and uh i work then introduce a speech tracking
0:01:02	speech dynamics tracking scheme that is used
0:01:06	in conjunction with the model based
0:01:08	speech enhancement
0:01:10	and uh uh performance evaluation
0:01:13	a cushion you
0:01:16	so uh
0:01:17	let's to have a a first have a look
0:01:19	the effect of noise corruption from as a true
0:01:22	perspective
0:01:24	yeah
0:01:24	use a white noise for example
0:01:27	we can observe that the
0:01:29	harmonic structure of speech as the C V a lead image
0:01:33	and the
0:01:35	the the special name a lot is now
0:01:38	which are out in a lot of a spectral distortion
0:01:41	and the
0:01:43	we are
0:01:45	the is some uh mention no statistical model based
0:01:48	as speech and and has meant to
0:01:50	and
0:01:52	the the the
0:01:53	the upper figure shows the classical oh lot special an impact you
0:01:58	though
0:01:59	and the
0:02:00	from the job times spent on we can see that
0:02:03	the lower portion of the
0:02:05	special and have been
0:02:06	restored
0:02:08	but the overall noise level
0:02:11	can not be
0:02:12	um
0:02:13	where was suppressed
0:02:14	so as a result there will be
0:02:17	many is music tones and the wrist reese residual noise is in the
0:02:22	clean
0:02:23	a a process the speech
0:02:25	and the the
0:02:26	lower or figure shows that uh optimum own modify the
0:02:30	log spectrum them to do you
0:02:32	and the
0:02:33	these not the generally
0:02:35	have a very good
0:02:37	at it
0:02:37	a cat ability of office
0:02:38	noise suppression
0:02:40	and but however
0:02:42	the form men and of the harmonics but
0:02:44	structures
0:02:45	um
0:02:46	for that just talk
0:02:48	so um that a of of us also um often you
0:02:53	are better pass goal but the
0:02:56	and the wild the
0:02:58	uh lower you go
0:02:59	gives a
0:03:00	better segment low snr school so there is always a tradeoff
0:03:05	the two in the noise suppression
0:03:06	and the the harmonic
0:03:08	distortion
0:03:09	we can say the naturalness of speech
0:03:14	so uh
0:03:15	can be also observed
0:03:17	from the spend joe
0:03:18	special special model that
0:03:20	no voice will first
0:03:22	a C V are they just thought of this
0:03:24	special name model
0:03:26	and the can measure no statistical method
0:03:29	what the
0:03:30	partial you're restore the
0:03:32	the
0:03:33	spectrum am all but a partial of for that just the spectrum
0:03:37	so this what potentially a con for some
0:03:40	comment and features
0:03:41	as such as music tones and the low intelligibility problems
0:03:45	in um
0:03:46	speech enhancement
0:03:49	so you in our our previous work we have proposed um
0:03:53	analysis synthesis this approach
0:03:56	based on the how most model
0:03:59	so that
0:03:59	basic idea is to
0:04:01	it's track a close to
0:04:03	Q
0:04:03	from noisy spatial
0:04:06	and the down we reconstruct the noise uh the type a speech
0:04:10	by re is this
0:04:12	using these speech only information
0:04:14	so you can see had from yeah
0:04:16	you have a speech information so you can have the track the location of the harmonics
0:04:22	you have a actual again
0:04:24	so you can have that all are average spectral
0:04:27	and at level
0:04:28	and you have the special envelope
0:04:30	you can have the
0:04:31	track
0:04:32	uh
0:04:33	many to respect
0:04:36	so why use this
0:04:37	what we choose this approach to
0:04:39	uh
0:04:40	to do speech enhancement
0:04:42	first
0:04:43	this model was cape
0:04:45	escape
0:04:45	bow to generate
0:04:47	clean harmonics
0:04:49	and that only speech related information is size
0:04:52	so i background noise is out to me be removed
0:04:56	and the this
0:04:58	this model also
0:05:00	and retrieved
0:05:02	some then each harmonic structure
0:05:05	and that as moves
0:05:07	spectrum would hope so so no isolates spectrum peaks
0:05:10	and the hands no meats
0:05:12	we from one problem
0:05:15	and also this mortal allows
0:05:17	at independent adjustment of
0:05:20	different more apparent
0:05:22	so it in a thing and both ask to
0:05:25	for was or N has the spent M role
0:05:28	and
0:05:29	using this framework
0:05:31	so by you think this now thought
0:05:33	at
0:05:33	we can
0:05:35	you uh we can suffer from the noise suppression
0:05:38	and the the harmonic distortion trade
0:05:43	so from some
0:05:44	uh of of our previous work
0:05:47	um
0:05:48	after we uh
0:05:49	applying some
0:05:50	for clean procedures
0:05:51	using conventional method that
0:05:54	we can apply the pitch
0:05:56	uh frequency domain pitch searching
0:05:58	and that that
0:05:59	a a spectral again estimation
0:06:03	some um really
0:06:04	preliminary result
0:06:06	shows that that P H and the spectral gain estimation
0:06:09	already already give very
0:06:10	good performance
0:06:12	by a a a a a pine on the perfect in the spectrum
0:06:17	however
0:06:18	the spectrum envelope estimation is
0:06:21	someone and
0:06:23	ad
0:06:24	so
0:06:27	uh
0:06:27	we can see yeah for some are really made a result
0:06:31	shows that the the past goal for
0:06:34	uh uh there a do you want noise
0:06:37	would give a already one point five
0:06:39	and the some um
0:06:42	but can measure an approach what a run
0:06:44	one point nine
0:06:46	and the our previous
0:06:48	approach
0:06:49	take D vol
0:06:51	you with this
0:06:52	pretty clean and can give a uh
0:06:56	also a a a point to
0:06:58	uh improvement
0:06:59	however
0:07:00	it's we replace
0:07:02	the M brought with a to clean rule
0:07:05	this
0:07:06	that
0:07:06	it can achieve
0:07:08	three point one seven
0:07:10	so it is
0:07:11	huge huge got here
0:07:12	so we we would expect some
0:07:16	improvement in past call if we can
0:07:19	further proof
0:07:20	spectrum them
0:07:24	so that problem can be state
0:07:26	as a
0:07:27	so for each frame use
0:07:29	frames of noisy observation
0:07:32	uh we want to find a mapping
0:07:34	between the noise and train spectral envelopes
0:07:38	and of full can set sec two frames
0:07:40	we want to find that
0:07:42	temporary tried to juries of clean special neville
0:07:46	oh
0:07:46	uh i in other words we want to estimate clean speech and by
0:07:51	looking for long term
0:07:53	speech you pollution
0:07:57	so by as you me uh over us
0:08:00	certain pure at time
0:08:02	uh a the S U yeah relationship between the consecutive clean spectrum blobs
0:08:07	and uh a
0:08:08	the in relationship between the noise and clean
0:08:11	special on them
0:08:12	we can use that lenient an
0:08:14	just the model to more though
0:08:16	this
0:08:17	uh
0:08:18	state chucking
0:08:20	so the
0:08:21	the feature
0:08:22	we used here is uh
0:08:24	a a line spectrum frequency of lpc coefficients
0:08:29	and uh
0:08:31	and the
0:08:33	for
0:08:33	each uh pure
0:08:35	see each cu result
0:08:37	all
0:08:37	observations
0:08:39	so
0:08:39	we have well
0:08:41	as C a series of lpc coefficients
0:08:45	so a given a comments system few uh
0:08:47	part meters
0:08:49	we can run it
0:08:50	um um i and the
0:08:52	yeah
0:08:53	oh ten clean L quite vision
0:08:57	so the next proper or what you how to to ten
0:09:01	uh
0:09:02	a a common system permit us
0:09:04	for
0:09:05	each
0:09:06	the year is all
0:09:07	which
0:09:09	so the idea is that for each block of noisy observations
0:09:14	we find a a a a we
0:09:15	we use the for each and the culpable
0:09:19	that
0:09:20	but also the
0:09:21	class did
0:09:22	parallel i lpc coefficients
0:09:24	and the
0:09:27	to through some uh optimize region
0:09:30	quite your we can all to and the corresponding i meant them permit
0:09:36	so in that all fine chaining just we have all
0:09:40	noisy and noisy and clean
0:09:43	uh L C coefficients
0:09:45	and the
0:09:47	we use those
0:09:48	spread B Q
0:09:50	to um
0:09:52	sure a to and uh
0:09:54	global and trace
0:09:56	in the sense that blocks with similar be sure
0:09:59	a a group into the same class us
0:10:02	by saying a similar we need to do define a distortion measure here
0:10:07	it could be is uh something vol
0:10:10	measure as a a you could in or you can
0:10:13	use the
0:10:14	some contract manager as as a
0:10:17	uh
0:10:18	as a uh
0:10:20	modified i S measure
0:10:22	and the
0:10:23	you also a to define i'm
0:10:26	feature for each
0:10:27	prop of all persuasions
0:10:29	you can use that average just special
0:10:31	or you can use all of theories of
0:10:35	vectors
0:10:36	so it it what it actually be a a matrix quantisation quantization you this case
0:10:42	and the
0:10:44	a for each cluster
0:10:46	we have both noisy and clean up so
0:10:48	uh
0:10:49	observation a noisy and clean
0:10:52	features
0:10:53	so we can minimize the total neck
0:10:56	a like cool function
0:10:58	for each cluster
0:10:59	and we will
0:11:01	oh to and the design the
0:11:03	comment system them permit in this case
0:11:07	so you know i like adaptation up they just we
0:11:10	we
0:11:11	we also have a a noisy observations
0:11:14	for a block
0:11:16	and the we use the
0:11:18	say
0:11:19	at this
0:11:20	this that's measure to find the cop and trees
0:11:23	and that has the
0:11:25	corresponding comments just the parameters
0:11:28	and the were run their common are we
0:11:31	is uh as that's of permit us and we will get the design better on them
0:11:37	so you can
0:11:38	so from the
0:11:40	spectral round yeah that the tracking
0:11:42	actually gives very good
0:11:45	uh
0:11:46	performance
0:11:48	also have from uh
0:11:51	three D view that
0:11:52	the
0:11:54	a noisy
0:11:55	the noise the envelope trying to juries a
0:11:58	quite
0:11:59	mad and the flat rate
0:12:01	and the some get
0:12:03	this conventional mention P of a read what the re risk oh
0:12:07	some harmonics
0:12:08	but a resulting some use one problem when
0:12:11	but
0:12:12	that the this most tracking
0:12:14	subject
0:12:15	use here
0:12:16	which give various moves
0:12:17	and uh
0:12:19	uh a and accurate to to re
0:12:25	so it is it can also be
0:12:28	observe from this figure that the
0:12:31	for
0:12:32	is that a
0:12:33	spend it
0:12:34	the phone then with
0:12:35	expend as compared to the conventional map
0:12:38	so the tracking gives very close
0:12:41	to the original spectral envelope
0:12:44	try to the right
0:12:47	so uh there's still time men do spectrum
0:12:50	and that harmonic structures
0:12:52	uh also
0:12:54	to
0:12:55	and the from the fine find or size
0:12:58	speech we can see that uh
0:13:01	no
0:13:02	smell or
0:13:04	um it and no use homes
0:13:06	and the
0:13:07	harmonic structures i
0:13:09	retrieved
0:13:10	and the
0:13:13	actually we can achieve a run
0:13:17	to phone
0:13:18	for one
0:13:19	pass school for
0:13:22	speaker dependent trendy
0:13:24	and the the
0:13:25	uh
0:13:27	the noise we use it is a from there are to ten db
0:13:31	or
0:13:32	uh uh using a white noise car noise and uh uh
0:13:37	a a be noise
0:13:39	so a speaker dependent and this be in the pen and testing is used
0:13:47	and it finally uh
0:13:49	i can group the presentation yeah
0:13:51	in this paper
0:13:52	presentation
0:13:53	we uh we've block at the effect of noise corporation an cry option
0:13:58	and the conventional speech enhancement
0:14:01	it's got
0:14:02	as been just got
0:14:03	and not and not it seems this approach is present
0:14:07	and and speech dynamic tracking important that incorporate
0:14:11	you change in the common ring as proposed
0:14:14	and they prove
0:14:16	a special name estimation is illustrated
0:14:19	objective to in terms
0:14:21	but
0:14:22	spectral distortion and passed call i show
0:14:25	so
0:14:26	at so for my
0:14:27	edition
0:14:27	i think you
0:14:28	yeah that so much
0:14:33	yeah this the first question
0:14:42	you have be audio samples are and then have you can up to uh bring with me so you're
0:14:47	yeah yeah yeah i was then some
0:14:50	good
0:14:50	it
0:14:51	yeah i you can sort or could you come on C P U cost to issues
0:14:56	oh um
0:14:57	actually i you
0:14:59	use that a a a for training
0:15:01	it will be time consuming a
0:15:04	will you out
0:15:05	but you can can show that all the in your protection
0:15:08	in of the uh
0:15:10	a a a i thought of size
0:15:13	so that that's always a tradeoff
0:15:15	okay fine tune is full
0:15:18	set able
0:15:20	let
0:15:21	that's quite lot
0:15:25	yeah a next question please
0:15:27	from your presentation uh i realise that
0:15:30	the on is by send is according to clean signal be or upper bound right
0:15:35	yeah
0:15:36	the on nice is sentences results according to the given to clean and but of the be you upper bound
0:15:44	of the optimal case would be the time to lead to show that clean yeah why is so my question
0:15:50	is is
0:15:51	you had in is said the on effect of D
0:15:53	they a noisy phase information that you using your sentences
0:15:58	so um what will be
0:16:00	the exact a of the uh noisy envelope
0:16:03	and the noisy
0:16:05	and fate information that you are using a would is in this case
0:16:09	in this work we just use the
0:16:11	many do spectral
0:16:12	we have not uh look at the face ms
0:16:15	actually uh
0:16:16	in college it has enhancement
0:16:19	uh a face not selling four
0:16:21	improved by
0:16:22	um
0:16:23	research
0:16:24	could
0:16:25	a fact the intelligibility
0:16:27	so
0:16:27	uh uh maybe in free for sure works we will come
0:16:30	but for your information the were some papers also
0:16:34	talking about the importance of phase information in in
0:16:37	the T
0:16:39	a made as which are work you know
0:16:41	yeah you this is a
0:16:43	have a some i mean that a gap between the upper bound and the proposed method of your can also
0:16:48	be because a
0:16:50	that's a
0:16:50	noise it face
0:16:51	so this check this scheme is uh
0:16:54	this lee
0:16:55	what well for voiced speech so
0:16:57	for um voiced speech we
0:16:59	we can just use some pretty clean the data
0:17:02	so this would be something weird asian
0:17:05	for
0:17:06	for form
0:17:06	a gap between the optimal
0:17:09	proposed but
0:17:12	i would be interested to know what you need to
0:17:15	a voice activity detector
0:17:17	all
0:17:18	actually we have tried to use the void
0:17:21	you D to trend voiced and i'm voice
0:17:24	for different that
0:17:26	but there is out there that
0:17:28	we can sure that
0:17:30	trend that
0:17:31	one class to pull or data
0:17:34	it's
0:17:35	you better for
0:17:36	for the whole tracking
0:17:37	yeah you synthesis model is very
0:17:40	yeah adequate for
0:17:41	it's a sinusoidal model for approach using yeah voiced sounds how do you put use the unvoiced sounds
0:17:48	so the unvoiced voiced sound it is basically uh
0:17:51	no P and uh we just
0:17:52	used that
0:17:54	um uh
0:17:55	a a a a boy
0:17:57	time the women port two
0:17:58	seems size
0:17:59	to have a gain information and P information
0:18:02	and just
0:18:03	commit time domain and
0:18:06	yeah i you are they have for of the questions
0:18:09	that is not the case
0:18:10	thank you once more

ANALYSIS-SYNTHESIS BASED SPEECH ENHANCEMENT WITH IMPROVED SPECTRUM ENVELOPE ESTIMATION BY TRACKING SPEECH DYNAMICS

Speech Enhancement

Přednášející: Ruofei chen, Autoři: Ruofei Chen, Cheung-Fat Chan, City University of Hong Kong, Hong Kong SAR of China