0:00:13okay uh this is that last a section
0:00:16and you will coming
0:00:17um and M is the can uh have come from the was data what city
0:00:22and today i'm gonna presents my a a with my vice the that a one
0:00:26it's a a a a a as we and based the classification of approach to speech separation
0:00:36this is the outline of our presentation
0:00:39uh the fourth but it like action
0:00:41uh of that a well
0:00:43in to you uh we'll
0:00:45talk about of the feature extraction prod
0:00:47and then a a a a can talk about uh the unit and labeling
0:00:51and uh segmentation
0:00:52the last but it the
0:00:54experiments with
0:00:59uh well uh you know data environment
0:01:01the target is speech is often corrupted by the uh by various types of well
0:01:07so the question is how can we
0:01:11all at any to the background noise
0:01:14this this see the
0:01:14speech is separation problem
0:01:16in this study we were only focus on the monaural speech separation
0:01:22it is a
0:01:23very very and
0:01:24we can only we can use the
0:01:27look location information
0:01:28we can only use the
0:01:30intrinsic a property of the target and the
0:01:36so i well force introduce the very important concept
0:01:40uh is the ideal binary mask
0:01:42short for I B M
0:01:44it is the
0:01:45men computers of go for a
0:01:47the B M could be
0:01:49find there
0:01:55okay uh
0:01:56so give a mixture
0:01:59a a you compose the it it to that time and frequency to men
0:02:02is it to it
0:02:03to mention of the presentation
0:02:04and for each t-f unit
0:02:07uh we compare the
0:02:09speech energy and of the noise and a
0:02:11a if the local snr is larger than your
0:02:15local car tire out L C
0:02:17this mask is the one
0:02:18otherwise otherwise it zero
0:02:20so you this way we can convert of the speech separation
0:02:23problem to a
0:02:25binary weight estimation
0:02:27previous study had shown that
0:02:29if we use the I B M to me think this the mixture
0:02:34uh we can get a separate the speech
0:02:36with a very high
0:02:38speech intelligibility
0:02:40so what i'd be M estimation it'd
0:02:42just a the white and the zero
0:02:45so it it is nothing else just the binary classification
0:02:51this figure
0:02:52uh you straight of that B M
0:02:54the first few it is the
0:02:56how uh
0:02:57is the additive from the target
0:02:59and the second why the
0:03:01cochlear or gram of the noise
0:03:03and we mixed them together
0:03:05yeah is a mixture of uh of the the mixture
0:03:09so if we know the target that and we know the noise
0:03:12and for each you need that we compare the energy
0:03:15we can get the
0:03:17this mask
0:03:19it is finer uh i'd a binary mask
0:03:21the white region
0:03:25the target and rate it stronger
0:03:27the plaque reading means the noise
0:03:29uh and and are stronger
0:03:32so we
0:03:33though this i B M comes from the idea or information when you to know the target and uh
0:03:38when you to know the noise
0:03:39so what do will we you is
0:03:42use some feature from the mixture two
0:03:45estimate this
0:03:47this is
0:03:48our will go
0:03:51this see that system overview
0:03:53the you know a mixture
0:03:54uh we use the common don't field of stand
0:03:57and decompose the mixture to the sixty four channels
0:04:01on on each channel
0:04:02for each t-f unit
0:04:04we extract of the feature
0:04:06uh including the peach based feature and uh
0:04:09amplitude the modulation spectrum
0:04:12or yeah mess feature
0:04:15once we get it is feature we were use the support vector machine
0:04:19to do the classification
0:04:21class a class fight each unit to
0:04:23one of their zero
0:04:25and then
0:04:26we get a mask
0:04:27this much
0:04:28we can use the
0:04:29or to tree for the improve the up
0:04:34finally we use this mask two
0:04:36re things is the mixture and get the the speech
0:04:42before for the feature extraction
0:04:44we have two
0:04:45types of feature
0:04:47the first the one and the
0:04:48pitch based feature
0:04:51so for this feature for each t-f unit
0:04:53we compute the the
0:04:55uh all the correlation at of the feature that
0:04:58of course for the unvoiced the for
0:05:00there is no P each
0:05:02so we just simply put a zero
0:05:05and all we also computed a are to each a to capture the feature
0:05:10variations across time and the frequency
0:05:13we just use the feature in the current unit
0:05:15minus the feature
0:05:17in the previous
0:05:19that that are the feature
0:05:20uh we were also can be a the habit of all the uh all the correlation and its of feature
0:05:26oh here we have six dimensional no
0:05:28fee teach based the feature
0:05:30the first two as the
0:05:32or in a feature
0:05:34and the
0:05:35to feature the
0:05:38are the feature and
0:05:40the uh for you considered the feature
0:05:44or a and not of a a a a yeah S feature
0:05:48so for each t-f unit
0:05:50units we we extract the fifteen dimensional a a a a ms feature
0:05:54we use the same S so as the team at all
0:05:58to thousand the nine paper
0:06:00and the ready
0:06:03that of that have feature
0:06:04so for the ms feature we have
0:06:07forty five dimensional no
0:06:09feature vector
0:06:13okay okay
0:06:14now we have the feature
0:06:16well buying this to it yeah the
0:06:18and uh use this feature to
0:06:20chan a svm
0:06:24once we finish their training we
0:06:26we can use this the discrete mean net function
0:06:29to do the classification
0:06:31the F X
0:06:32is the
0:06:33D don't value
0:06:35a computed from as we have
0:06:37uh these these in a very with a real number
0:06:40so the standard as we um
0:06:41we use the
0:06:43this sign function
0:06:45like use the zero at the special
0:06:48it F X is the positive
0:06:50the level is white if it and that if that was there
0:06:55so when which and as we we were and it in each channel so we have a sixty four channel
0:07:00it means we have sixty four as we have
0:07:03and we use the causing kernel and uh the
0:07:06parameters side
0:07:07in in form a a a a a a five fold
0:07:13okay uh when people
0:07:15to the classification
0:07:16uh the you're really use the classification accuracy to
0:07:20you very to the performance
0:07:22so here a you also focus on one of the measurement
0:07:25it is a key mine F eight
0:07:28so for the classification
0:07:30results uh we have this
0:07:32or types of a result
0:07:34but if i B M and it's to made i B M mouth posts there were is the correct reject
0:07:38and the if i B M is there all this made is one it's so false alarm it's error
0:07:43and uh if both are one it's the correct you
0:07:47if i B M is one estimates it
0:07:49there were
0:07:50um use i
0:07:52we can in computed the you to rate
0:07:56here and of false alarm rate
0:07:58and the we
0:07:59uh calculate the difference between the heat and the might uh false alarm
0:08:05a because this measurement is the
0:08:08a correlated to the speech intelligibility
0:08:12so that's why we we are use this measure
0:08:17a now we have the problem
0:08:19uh because
0:08:20the svm is a diff designed that to maximise the classification
0:08:26instead of a key the my set three
0:08:28but if we want
0:08:30to maximise the in mind not lee
0:08:32we need to do some change
0:08:35so he might we it's a
0:08:37actually we need to can see that two kind of a row of the means there were and the false
0:08:42alarm rate
0:08:43the we want to balance this two
0:08:45to kind of arrows and uh maximise this value
0:08:49uh what we were you is you the technique
0:08:52"'cause" the research coding
0:08:56the for the standard as we have
0:08:58the use the zero as the special
0:09:00yeah we were choose a a new structure
0:09:03which could a maximise my the in F weight
0:09:05in each channel
0:09:07it just like
0:09:09if we have
0:09:10to many in our of but a few false alarm error
0:09:13we can change of the
0:09:15hyperplane a little it
0:09:17and do some active
0:09:18where point to be one
0:09:21by this we we can increase the he to rates
0:09:24so we can
0:09:25in is the key my we wait
0:09:27the we use this you have to threshold
0:09:30if the do but it is in a red it was a larger than see that
0:09:34it is one otherwise it is zero
0:09:36and this data is the choose form
0:09:38oh small
0:09:39but additions that
0:09:43and uh
0:09:44the and get this
0:09:45yeah the the on each channel we combined that we get a whole mask
0:09:50we can for the use the
0:09:51or tree segmentation
0:09:53to improve the mask
0:09:55for the voice for we use the cross-channel correlation and and well
0:09:58or channel correlation
0:10:00and for the on frame amway
0:10:02onset and offset
0:10:06okay this the figure
0:10:07uh you go straight the estimate
0:10:10made mask the first the fear is the
0:10:12i B N
0:10:14and the they right
0:10:16the as we name body mask
0:10:19so this mask is the
0:10:21is a good is the close to the art bn but uh it's it looks miss some
0:10:32so just the miss some
0:10:35missed um
0:10:36miss some a white region
0:10:38the but user research coding we can in large it it's the
0:10:42a mask
0:10:42we can
0:10:43increase the to rates
0:10:45you may also known is that's we also increased on force alarm number eight
0:10:49but to the point is the
0:10:51we can increase the he the rates more things the false alarm rate so that he my at least your
0:10:57and uh a not look was things you that
0:11:00the this first false alarm rates it's the
0:11:03is all uh isolated unit here
0:11:06these units
0:11:07i i these you need i you've a to remove the by using the segmentation
0:11:11so the last but that segmentation results
0:11:14is the
0:11:15uh pretty to and uh close to the I M
0:11:21okay mm for the value
0:11:23evaluation for the training
0:11:25a cop was where use one hundred utterances from the
0:11:28ieee covers
0:11:29uh a female utterance
0:11:31and we use the three types of noise the speech to of noise vector E and a of babble noise
0:11:36and uh for the P based feature we
0:11:38uh directly extracted the peach
0:11:41run should speech from the target speech
0:11:43and uh we use the mixture at the mine five they were and a five db
0:11:49but trend them together
0:11:51for the check uh for the test
0:11:53uh we use sixty utterance
0:11:55this utterance down all seen in the training couple
0:11:58the noise with the are you this
0:12:00a speech up every and that when noise
0:12:02also we will test on to a new noise
0:12:05it's a white at how L party not
0:12:08and here we cannot use the i'd information
0:12:12we use the gene and also algorithm to
0:12:15extracted the the estimated peach from the mixture
0:12:20uh and we test on the mind five and uh they would be
0:12:25this is the classification result
0:12:27uh we will compare our our system with the key at whole system
0:12:31there system
0:12:32uh use of see you mixture model to learn the distribution of the
0:12:36the ms feature
0:12:38and uh then you the as in classified to to the vacation
0:12:42uh we choose the system because you're system improve the speech intelligibility
0:12:47in listening test
0:12:50in in this in the front the table we can
0:12:52see that
0:12:53uh the hidden an are our proposed the
0:12:58that was a
0:12:58but you have very high he myself every
0:13:01and they to significantly better than
0:13:03okay came system
0:13:05and uh also for the accuracy uh our our method is still at
0:13:10and uh the table two is the on noise results
0:13:15uh in this
0:13:16to noise
0:13:17they are not they are not seen in the training corpus
0:13:21uh but our systems you know very well and that this give if we results use you know close to
0:13:27result in the scene noise
0:13:30so is it
0:13:31means that our system could general general a generalized well in this two
0:13:36you noise
0:13:38and this and to compare it
0:13:39i the pre compared them
0:13:41uh the post system use different
0:13:44uh feature we use a the ms plus teach the when use the M S
0:13:48we you different classifier
0:13:50and uh
0:13:51we also
0:13:53in or coke or of very the the
0:13:55the the segmentation stage
0:13:57so here
0:13:58we want to
0:13:59start us study the performance of the classifier only
0:14:03so we use the exactly the same front and it's the twenty five channel mel-scale filter bank
0:14:09use in the that's and feature the only use the in feature
0:14:13and they they only the
0:14:15training corpus
0:14:16it's the trend training covers
0:14:18the only different is the classifier via we use as we have
0:14:21use a gmm
0:14:24uh we can find that
0:14:25the key to my say
0:14:27a result of the svm is you know it's consist any better than the
0:14:31gmm result
0:14:32for the mine five db to
0:14:35uh improve
0:14:36or a from
0:14:37uh two percent at to five percent
0:14:40for for the were there were V to improve the
0:14:44but send to that it does that
0:14:48this improvement
0:14:49the of the advantage svm over yeah
0:14:54uh this is the
0:14:55demo it is the female speech makes the with the factory noise
0:14:59at a zero db
0:15:02this is the noisy speech
0:15:09this is the proposed the
0:15:11uh a result we use this
0:15:13we use this
0:15:22a this is P M result
0:15:29okay we can
0:15:30here that
0:15:31our proposed to
0:15:33uh a results
0:15:34the chip a
0:15:35put it
0:15:35so that speech intelligibility
0:15:37and close so the idea
0:15:39so we conclude our work here
0:15:41which treated the
0:15:42speech separation problem as a binary classification
0:15:46we use a as we and to classify the you need to to one one zero
0:15:50we use the peach based feature and the ms feature
0:15:53so based on the comparison
0:15:54uh we can
0:15:55pretty that are were a separation a result will already to
0:15:59significant the improve the speech intelligibility
0:16:03noisy condition
0:16:04what you melissa listener
0:16:06our future
0:16:07what will attest this
0:16:10that's all thing
0:16:11yeah i i
0:16:16are there any questions
0:16:19and a multi
0:16:23we you use uh comment on the processing steps are assume that was a pitch
0:16:28type of processing or quit able to be implemented as an online
0:16:32processing was
0:16:33the latency
0:16:35so it can say it again
0:16:37um the processing steps of you to two
0:16:40uh separate the signals
0:16:42is that a a
0:16:43a batch type posting where you have several times as all was a
0:16:46so each style or is it an online
0:16:49mess like where you just have a
0:16:51does little bit of latency and you process on
0:16:54yeah uh is like the the
0:16:55the the back of the page
0:16:57a process
0:16:58is given us a mixture i can give you a
0:17:01uh stiff to the speech
0:17:03it's not to the online
0:17:08i i i i would like to know if you can a command on
0:17:11difference is
0:17:12between voiced and unvoiced
0:17:14a face is because the signal to most was you might be different
0:17:21or you might be less critical
0:17:24to apply the binary mask two
0:17:26a speech if it is on boards
0:17:28yeah yeah you in
0:17:29what what difference between voice and all in terms of quality E
0:17:34uh yeah in in our work
0:17:36uh we use to kind of feature
0:17:38the P to a feature and uh M it's feature
0:17:41so the to based feature basically
0:17:43well we focus on the voice
0:17:45because for the unvoiced though we don't have the P
0:17:48this feature
0:17:49i don't what
0:17:50but the for the voice
0:17:51we still have the ms feature
0:17:53so the yeah ms feature works for the unvoiced part
0:17:57and all
0:17:59yeah matt's you also what's of the voice parts
0:18:02so we we combine them together
0:18:05the a complementary feature that
0:18:08for finding the the
0:18:10harmonics so you are using correlation measure i didn't get that
0:18:15is the correlation or over time and frequency
0:18:19coco right and then to take the differences between
0:18:23what adjacent frames
0:18:25and the adjacent been
0:18:28you you mean you the P extractor yeah
0:18:31yeah O um
0:18:32for for the U you is made each we
0:18:35use gene and all we're the use the
0:18:40the core where and and
0:18:42to extract of the the that the extracted pitch
0:18:45yep with on each frame
0:18:49are the for and so question
0:18:54i this
0:18:56but are you ran experiments to zero and it five four
0:19:00five very er zero and minus five can remember yeah uh my result is the minus five hundred a
0:19:07right okay two
0:19:08my question is are you should be able to look to the mask could shelf but you are estimated zero
0:19:13and five
0:19:15and or to is as we signal noise ratio decreases
0:19:18you should see erosion russian around be edges of your matched
0:19:23it's so you should be able to somehow connect the image
0:19:26uh oh oh the mask at to zero db in minus five db in can use the strippers noise ratio
0:19:32change changes
0:19:33but she drops from zero to minus two point five or something
0:19:36if you tried looking at that were you have a mismatch were changed in the approach as ratio
0:19:41for from estimated uh approach
0:19:44yeah i and
0:19:45in this study
0:19:46if the signal to not noise ratio decrease
0:19:49like to the minus five T V
0:19:53yeah the the marks a mask of a very different
0:19:57yeah so this performance actually you can see that it's a decrease
0:20:03use a possible to interpolate the maps between those to limits T
0:20:07right right between
0:20:11so uh uh i i don't
0:20:12and get your pets
0:20:14okay with respect to the time to to one small for the contribution