Přepis řeči - AN SVM BASED CLASSIFICATION APPROACH TO SPEECH SEPARATION

0:00:13	okay uh this is that last a section
0:00:16	and you will coming
0:00:17	um and M is the can uh have come from the was data what city
0:00:22	and today i'm gonna presents my a a with my vice the that a one
0:00:26	it's a a a a a as we and based the classification of approach to speech separation
0:00:35	uh
0:00:36	this is the outline of our presentation
0:00:39	uh the fourth but it like action
0:00:41	uh of that a well
0:00:43	in to you uh we'll
0:00:45	talk about of the feature extraction prod
0:00:47	and then a a a a can talk about uh the unit and labeling
0:00:51	and uh segmentation
0:00:52	the last but it the
0:00:54	experiments with
0:00:58	okay
0:00:59	uh well uh you know data environment
0:01:01	the target is speech is often corrupted by the uh by various types of well
0:01:06	interference
0:01:07	so the question is how can we
0:01:10	remove
0:01:11	all at any to the background noise
0:01:14	this this see the
0:01:14	speech is separation problem
0:01:16	in this study we were only focus on the monaural speech separation
0:01:22	it is a
0:01:23	very very and
0:01:24	because
0:01:24	we can only we can use the
0:01:27	look location information
0:01:28	we can only use the
0:01:30	intrinsic a property of the target and the
0:01:33	interference
0:01:36	so i well force introduce the very important concept
0:01:40	uh is the ideal binary mask
0:01:42	short for I B M
0:01:44	it is the
0:01:45	men computers of go for a
0:01:47	the B M could be
0:01:49	find there
0:01:51	here
0:01:55	okay uh
0:01:56	so give a mixture
0:01:59	a a you compose the it it to that time and frequency to men
0:02:02	is it to it
0:02:03	to mention of the presentation
0:02:04	and for each t-f unit
0:02:07	uh we compare the
0:02:09	speech energy and of the noise and a
0:02:11	a if the local snr is larger than your
0:02:15	local car tire out L C
0:02:17	this mask is the one
0:02:18	otherwise otherwise it zero
0:02:20	so you this way we can convert of the speech separation
0:02:23	problem to a
0:02:25	binary weight estimation
0:02:27	because
0:02:27	previous study had shown that
0:02:29	if we use the I B M to me think this the mixture
0:02:34	uh we can get a separate the speech
0:02:36	with a very high
0:02:38	speech intelligibility
0:02:40	so what i'd be M estimation it'd
0:02:42	just a the white and the zero
0:02:44	uh
0:02:45	so it it is nothing else just the binary classification
0:02:51	this figure
0:02:52	uh you straight of that B M
0:02:54	the first few it is the
0:02:56	how uh
0:02:57	is the additive from the target
0:02:59	and the second why the
0:03:01	cochlear or gram of the noise
0:03:03	and we mixed them together
0:03:05	yeah is a mixture of uh of the the mixture
0:03:09	so if we know the target that and we know the noise
0:03:12	and for each you need that we compare the energy
0:03:15	we can get the
0:03:17	this mask
0:03:19	it is finer uh i'd a binary mask
0:03:21	the white region
0:03:22	means
0:03:23	the
0:03:25	the target and rate it stronger
0:03:27	the plaque reading means the noise
0:03:29	uh and and are stronger
0:03:32	so we
0:03:33	though this i B M comes from the idea or information when you to know the target and uh
0:03:38	when you to know the noise
0:03:39	so what do will we you is
0:03:41	uh
0:03:42	use some feature from the mixture two
0:03:45	estimate this
0:03:46	mask
0:03:47	this is
0:03:48	our will go
0:03:51	this see that system overview
0:03:53	the you know a mixture
0:03:54	uh we use the common don't field of stand
0:03:57	and decompose the mixture to the sixty four channels
0:04:01	on on each channel
0:04:02	for each t-f unit
0:04:04	we extract of the feature
0:04:06	uh including the peach based feature and uh
0:04:09	amplitude the modulation spectrum
0:04:12	or yeah mess feature
0:04:15	once we get it is feature we were use the support vector machine
0:04:19	to do the classification
0:04:21	class a class fight each unit to
0:04:23	one of their zero
0:04:25	and then
0:04:26	we get a mask
0:04:27	this much
0:04:28	we can use the
0:04:29	or to tree for the improve the up
0:04:33	so
0:04:34	finally we use this mask two
0:04:36	re things is the mixture and get the the speech
0:04:42	before for the feature extraction
0:04:44	we have two
0:04:45	types of feature
0:04:47	the first the one and the
0:04:48	pitch based feature
0:04:50	uh
0:04:51	so for this feature for each t-f unit
0:04:53	we compute the the
0:04:55	uh all the correlation at of the feature that
0:04:58	of course for the unvoiced the for
0:05:00	there is no P each
0:05:02	so we just simply put a zero
0:05:04	yeah
0:05:05	and all we also computed a are to each a to capture the feature
0:05:10	variations across time and the frequency
0:05:13	we just use the feature in the current unit
0:05:15	minus the feature
0:05:17	in the previous
0:05:18	unit
0:05:19	that that are the feature
0:05:20	uh we were also can be a the habit of all the uh all the correlation and its of feature
0:05:26	oh here we have six dimensional no
0:05:28	fee teach based the feature
0:05:30	the first two as the
0:05:32	or in a feature
0:05:34	and
0:05:34	and the
0:05:35	to feature the
0:05:36	the
0:05:37	time
0:05:38	are the feature and
0:05:39	two
0:05:40	the uh for you considered the feature
0:05:44	or a and not of a a a a yeah S feature
0:05:47	uh
0:05:48	so for each t-f unit
0:05:50	units we we extract the fifteen dimensional a a a a ms feature
0:05:54	we use the same S so as the team at all
0:05:58	to thousand the nine paper
0:06:00	and the ready
0:06:02	we
0:06:03	that of that have feature
0:06:04	so for the ms feature we have
0:06:07	forty five dimensional no
0:06:09	feature vector
0:06:13	okay okay
0:06:13	mm
0:06:14	now we have the feature
0:06:16	well buying this to it yeah the
0:06:18	and uh use this feature to
0:06:20	chan a svm
0:06:23	um
0:06:24	once we finish their training we
0:06:26	we can use this the discrete mean net function
0:06:29	to do the classification
0:06:31	the F X
0:06:32	is the
0:06:33	D don't value
0:06:35	a computed from as we have
0:06:37	uh these these in a very with a real number
0:06:40	so the standard as we um
0:06:41	we use the
0:06:43	this sign function
0:06:44	or
0:06:45	like use the zero at the special
0:06:48	it F X is the positive
0:06:50	the level is white if it and that if that was there
0:06:54	uh
0:06:55	so when which and as we we were and it in each channel so we have a sixty four channel
0:07:00	it means we have sixty four as we have
0:07:03	and we use the causing kernel and uh the
0:07:06	parameters side
0:07:07	in in form a a a a a a five fold
0:07:10	cross-validation
0:07:13	okay uh when people
0:07:15	to the classification
0:07:16	uh the you're really use the classification accuracy to
0:07:20	you very to the performance
0:07:22	so here a you also focus on one of the measurement
0:07:25	it is a key mine F eight
0:07:28	so for the classification
0:07:30	results uh we have this
0:07:32	or types of a result
0:07:34	but if i B M and it's to made i B M mouth posts there were is the correct reject
0:07:38	and the if i B M is there all this made is one it's so false alarm it's error
0:07:43	and uh if both are one it's the correct you
0:07:46	and
0:07:47	if i B M is one estimates it
0:07:49	there were
0:07:50	um use i
0:07:52	so
0:07:52	we can in computed the you to rate
0:07:56	here and of false alarm rate
0:07:58	and the we
0:07:59	uh calculate the difference between the heat and the might uh false alarm
0:08:04	rate
0:08:05	a because this measurement is the
0:08:08	a correlated to the speech intelligibility
0:08:12	so that's why we we are use this measure
0:08:17	a now we have the problem
0:08:19	uh because
0:08:20	the svm is a diff designed that to maximise the classification
0:08:25	yours
0:08:26	instead of a key the my set three
0:08:28	but if we want
0:08:30	to maximise the in mind not lee
0:08:32	we need to do some change
0:08:35	so he might we it's a
0:08:37	actually we need to can see that two kind of a row of the means there were and the false
0:08:42	alarm rate
0:08:43	the we want to balance this two
0:08:45	to kind of arrows and uh maximise this value
0:08:49	uh what we were you is you the technique
0:08:52	"'cause" the research coding
0:08:55	a
0:08:56	the for the standard as we have
0:08:58	the use the zero as the special
0:09:00	yeah we were choose a a new structure
0:09:03	which could a maximise my the in F weight
0:09:05	in each channel
0:09:07	it just like
0:09:08	uh
0:09:09	if we have
0:09:10	to many in our of but a few false alarm error
0:09:13	we can change of the
0:09:15	hyperplane a little it
0:09:17	and do some active
0:09:18	where point to be one
0:09:20	and
0:09:21	by this we we can increase the he to rates
0:09:24	so we can
0:09:25	in is the key my we wait
0:09:27	the we use this you have to threshold
0:09:30	if the do but it is in a red it was a larger than see that
0:09:34	it is one otherwise it is zero
0:09:36	and this data is the choose form
0:09:38	oh small
0:09:39	but additions that
0:09:43	and uh
0:09:44	the and get this
0:09:45	yeah the the on each channel we combined that we get a whole mask
0:09:50	we can for the use the
0:09:51	or tree segmentation
0:09:53	to improve the mask
0:09:55	for the voice for we use the cross-channel correlation and and well
0:09:58	or channel correlation
0:10:00	and for the on frame amway
0:10:02	onset and offset
0:10:06	okay this the figure
0:10:07	uh you go straight the estimate
0:10:10	made mask the first the fear is the
0:10:12	i B N
0:10:14	and the they right
0:10:16	the as we name body mask
0:10:18	uh
0:10:19	so this mask is the
0:10:21	is a good is the close to the art bn but uh it's it looks miss some
0:10:26	part
0:10:27	okay
0:10:32	so just the miss some
0:10:34	uh
0:10:35	missed um
0:10:35	one
0:10:36	miss some a white region
0:10:38	the but user research coding we can in large it it's the
0:10:42	a mask
0:10:42	we can
0:10:43	increase the to rates
0:10:45	you may also known is that's we also increased on force alarm number eight
0:10:49	but to the point is the
0:10:51	the
0:10:51	we can increase the he the rates more things the false alarm rate so that he my at least your
0:10:57	and uh a not look was things you that
0:11:00	the this first false alarm rates it's the
0:11:03	i
0:11:03	is all uh isolated unit here
0:11:06	these units
0:11:07	i i these you need i you've a to remove the by using the segmentation
0:11:11	so the last but that segmentation results
0:11:14	is the
0:11:15	uh pretty to and uh close to the I M
0:11:21	okay mm for the value
0:11:23	evaluation for the training
0:11:25	a cop was where use one hundred utterances from the
0:11:28	ieee covers
0:11:29	uh a female utterance
0:11:31	and we use the three types of noise the speech to of noise vector E and a of babble noise
0:11:36	and uh for the P based feature we
0:11:38	uh directly extracted the peach
0:11:41	run should speech from the target speech
0:11:43	and uh we use the mixture at the mine five they were and a five db
0:11:49	but trend them together
0:11:51	for the check uh for the test
0:11:53	uh we use sixty utterance
0:11:55	this utterance down all seen in the training couple
0:11:58	the noise with the are you this
0:12:00	a speech up every and that when noise
0:12:02	also we will test on to a new noise
0:12:05	it's a white at how L party not
0:12:08	and here we cannot use the i'd information
0:12:12	we use the gene and also algorithm to
0:12:15	extracted the the estimated peach from the mixture
0:12:20	uh and we test on the mind five and uh they would be
0:12:25	this is the classification result
0:12:27	uh we will compare our our system with the key at whole system
0:12:31	there system
0:12:32	uh use of see you mixture model to learn the distribution of the
0:12:36	the ms feature
0:12:38	and uh then you the as in classified to to the vacation
0:12:42	uh we choose the system because you're system improve the speech intelligibility
0:12:47	in listening test
0:12:49	uh
0:12:50	in in this in the front the table we can
0:12:52	see that
0:12:53	uh the hidden an are our proposed the
0:12:56	uh
0:12:58	that was a
0:12:58	but you have very high he myself every
0:13:01	and they to significantly better than
0:13:03	okay came system
0:13:05	and uh also for the accuracy uh our our method is still at
0:13:10	and uh the table two is the on noise results
0:13:15	uh in this
0:13:16	two
0:13:16	to noise
0:13:17	they are not they are not seen in the training corpus
0:13:20	so
0:13:21	uh but our systems you know very well and that this give if we results use you know close to
0:13:27	the
0:13:27	result in the scene noise
0:13:30	so is it
0:13:31	means that our system could general general a generalized well in this two
0:13:36	you noise
0:13:38	and this and to compare it
0:13:39	i the pre compared them
0:13:41	uh the post system use different
0:13:44	uh feature we use a the ms plus teach the when use the M S
0:13:48	we you different classifier
0:13:50	and uh
0:13:51	we also
0:13:53	in or coke or of very the the
0:13:55	the the segmentation stage
0:13:57	so here
0:13:58	we want to
0:13:59	start us study the performance of the classifier only
0:14:03	so we use the exactly the same front and it's the twenty five channel mel-scale filter bank
0:14:09	use in the that's and feature the only use the in feature
0:14:13	and they they only the
0:14:15	training corpus
0:14:16	it's the trend training covers
0:14:18	the only different is the classifier via we use as we have
0:14:21	use a gmm
0:14:24	uh we can find that
0:14:25	the key to my say
0:14:27	a result of the svm is you know it's consist any better than the
0:14:31	gmm result
0:14:32	for the mine five db to
0:14:35	uh improve
0:14:36	or a from
0:14:37	uh two percent at to five percent
0:14:40	for for the were there were V to improve the
0:14:42	uh
0:14:43	from
0:14:44	five
0:14:44	but send to that it does that
0:14:46	though
0:14:47	uh
0:14:48	this improvement
0:14:49	the of the advantage svm over yeah
0:14:54	uh this is the
0:14:55	demo it is the female speech makes the with the factory noise
0:14:59	at a zero db
0:15:02	this is the noisy speech
0:15:09	this is the proposed the
0:15:11	uh a result we use this
0:15:13	we use this
0:15:14	uh
0:15:15	mask
0:15:19	i
0:15:20	two
0:15:22	a this is P M result
0:15:27	i
0:15:29	okay we can
0:15:30	here that
0:15:31	our proposed to
0:15:33	uh a results
0:15:34	the chip a
0:15:35	put it
0:15:35	so that speech intelligibility
0:15:37	and close so the idea
0:15:39	so we conclude our work here
0:15:41	which treated the
0:15:42	speech separation problem as a binary classification
0:15:46	we use a as we and to classify the you need to to one one zero
0:15:50	we use the peach based feature and the ms feature
0:15:53	so based on the comparison
0:15:54	uh we can
0:15:55	pretty that are were a separation a result will already to
0:15:59	significant the improve the speech intelligibility
0:16:02	in
0:16:03	noisy condition
0:16:04	what you melissa listener
0:16:06	our future
0:16:07	what will attest this
0:16:10	that's all thing
0:16:11	yeah i i
0:16:16	are there any questions
0:16:19	and a multi
0:16:23	we you use uh comment on the processing steps are assume that was a pitch
0:16:28	type of processing or quit able to be implemented as an online
0:16:32	processing was
0:16:33	the latency
0:16:35	so it can say it again
0:16:37	um the processing steps of you to two
0:16:40	uh separate the signals
0:16:42	is that a a
0:16:43	a batch type posting where you have several times as all was a
0:16:46	so each style or is it an online
0:16:49	mess like where you just have a
0:16:51	does little bit of latency and you process on
0:16:54	yeah uh is like the the
0:16:55	the the back of the page
0:16:57	a process
0:16:58	is given us a mixture i can give you a
0:17:01	uh stiff to the speech
0:17:03	it's not to the online
0:17:06	i
0:17:08	i i i i would like to know if you can a command on
0:17:11	difference is
0:17:12	between voiced and unvoiced
0:17:14	a face is because the signal to most was you might be different
0:17:20	uh
0:17:21	or you might be less critical
0:17:24	to apply the binary mask two
0:17:26	a speech if it is on boards
0:17:28	yeah yeah you in
0:17:29	what what difference between voice and all in terms of quality E
0:17:34	uh yeah in in our work
0:17:36	uh we use to kind of feature
0:17:38	the P to a feature and uh M it's feature
0:17:41	so the to based feature basically
0:17:43	well we focus on the voice
0:17:45	because for the unvoiced though we don't have the P
0:17:48	this feature
0:17:49	i don't what
0:17:50	but the for the voice
0:17:51	we still have the ms feature
0:17:53	so the yeah ms feature works for the unvoiced part
0:17:57	and all
0:17:58	also
0:17:59	yeah matt's you also what's of the voice parts
0:18:02	so we we combine them together
0:18:03	she
0:18:05	the a complementary feature that
0:18:08	for finding the the
0:18:10	harmonics so you are using correlation measure i didn't get that
0:18:14	first
0:18:15	is the correlation or over time and frequency
0:18:18	yeah
0:18:19	for
0:18:19	coco right and then to take the differences between
0:18:23	what adjacent frames
0:18:25	and the adjacent been
0:18:28	you you mean you the P extractor yeah
0:18:31	yeah O um
0:18:32	for for the U you is made each we
0:18:35	use gene and all we're the use the
0:18:38	uh
0:18:40	the core where and and
0:18:42	to extract of the the that the extracted pitch
0:18:45	yep with on each frame
0:18:48	get
0:18:49	are the for and so question
0:18:53	please
0:18:54	uh
0:18:54	i this
0:18:56	but are you ran experiments to zero and it five four
0:19:00	five very er zero and minus five can remember yeah uh my result is the minus five hundred a
0:19:07	right okay two
0:19:08	my question is are you should be able to look to the mask could shelf but you are estimated zero
0:19:13	and five
0:19:15	and or to is as we signal noise ratio decreases
0:19:18	you should see erosion russian around be edges of your matched
0:19:23	it's so you should be able to somehow connect the image
0:19:26	uh oh oh the mask at to zero db in minus five db in can use the strippers noise ratio
0:19:32	change changes
0:19:33	but she drops from zero to minus two point five or something
0:19:36	if you tried looking at that were you have a mismatch were changed in the approach as ratio
0:19:41	for from estimated uh approach
0:19:44	yeah i and
0:19:45	in this study
0:19:46	if the signal to not noise ratio decrease
0:19:49	like to the minus five T V
0:19:51	uh
0:19:53	since
0:19:53	yeah the the marks a mask of a very different
0:19:56	uh
0:19:57	yeah so this performance actually you can see that it's a decrease
0:20:03	i
0:20:03	use a possible to interpolate the maps between those to limits T
0:20:07	right right between
0:20:11	so uh uh i i don't
0:20:12	and get your pets
0:20:14	okay with respect to the time to to one small for the contribution

AN SVM BASED CLASSIFICATION APPROACH TO SPEECH SEPARATION

Speech Enhancement

Přednášející: Kun Han, Autoři: Kun Han, DeLiang Wang, The Ohio State University, United States