okay uh this is that last a section
and you will coming
um and M is the can uh have come from the was data what city
and today i'm gonna presents my a a with my vice the that a one
it's a a a a a as we and based the classification of approach to speech separation
uh
this is the outline of our presentation
uh the fourth but it like action
uh of that a well
in to you uh we'll
talk about of the feature extraction prod
and then a a a a can talk about uh the unit and labeling
and uh segmentation
the last but it the
experiments with
okay
uh well uh you know data environment
the target is speech is often corrupted by the uh by various types of well
interference
so the question is how can we
remove
all at any to the background noise
this this see the
speech is separation problem
in this study we were only focus on the monaural speech separation
it is a
very very and
because
we can only we can use the
look location information
we can only use the
intrinsic a property of the target and the
interference
so i well force introduce the very important concept
uh is the ideal binary mask
short for I B M
it is the
men computers of go for a
the B M could be
find there
here
okay uh
so give a mixture
a a you compose the it it to that time and frequency to men
is it to it
to mention of the presentation
and for each t-f unit
uh we compare the
speech energy and of the noise and a
a if the local snr is larger than your
local car tire out L C
this mask is the one
otherwise otherwise it zero
so you this way we can convert of the speech separation
problem to a
binary weight estimation
because
previous study had shown that
if we use the I B M to me think this the mixture
uh we can get a separate the speech
with a very high
speech intelligibility
so what i'd be M estimation it'd
just a the white and the zero
uh
so it it is nothing else just the binary classification
this figure
uh you straight of that B M
the first few it is the
how uh
is the additive from the target
and the second why the
cochlear or gram of the noise
and we mixed them together
yeah is a mixture of uh of the the mixture
so if we know the target that and we know the noise
and for each you need that we compare the energy
we can get the
this mask
it is finer uh i'd a binary mask
the white region
means
the
the target and rate it stronger
the plaque reading means the noise
uh and and are stronger
so we
though this i B M comes from the idea or information when you to know the target and uh
when you to know the noise
so what do will we you is
uh
use some feature from the mixture two
estimate this
mask
this is
our will go
this see that system overview
the you know a mixture
uh we use the common don't field of stand
and decompose the mixture to the sixty four channels
on on each channel
for each t-f unit
we extract of the feature
uh including the peach based feature and uh
amplitude the modulation spectrum
or yeah mess feature
once we get it is feature we were use the support vector machine
to do the classification
class a class fight each unit to
one of their zero
and then
we get a mask
this much
we can use the
or to tree for the improve the up
so
finally we use this mask two
re things is the mixture and get the the speech
before for the feature extraction
we have two
types of feature
the first the one and the
pitch based feature
uh
so for this feature for each t-f unit
we compute the the
uh all the correlation at of the feature that
of course for the unvoiced the for
there is no P each
so we just simply put a zero
yeah
and all we also computed a are to each a to capture the feature
variations across time and the frequency
we just use the feature in the current unit
minus the feature
in the previous
unit
that that are the feature
uh we were also can be a the habit of all the uh all the correlation and its of feature
oh here we have six dimensional no
fee teach based the feature
the first two as the
or in a feature
and
and the
to feature the
the
time
are the feature and
two
the uh for you considered the feature
or a and not of a a a a yeah S feature
uh
so for each t-f unit
units we we extract the fifteen dimensional a a a a ms feature
we use the same S so as the team at all
to thousand the nine paper
and the ready
we
that of that have feature
so for the ms feature we have
forty five dimensional no
feature vector
okay okay
mm
now we have the feature
well buying this to it yeah the
and uh use this feature to
chan a svm
um
once we finish their training we
we can use this the discrete mean net function
to do the classification
the F X
is the
D don't value
a computed from as we have
uh these these in a very with a real number
so the standard as we um
we use the
this sign function
or
like use the zero at the special
it F X is the positive
the level is white if it and that if that was there
uh
so when which and as we we were and it in each channel so we have a sixty four channel
it means we have sixty four as we have
and we use the causing kernel and uh the
parameters side
in in form a a a a a a five fold
cross-validation
okay uh when people
to the classification
uh the you're really use the classification accuracy to
you very to the performance
so here a you also focus on one of the measurement
it is a key mine F eight
so for the classification
results uh we have this
or types of a result
but if i B M and it's to made i B M mouth posts there were is the correct reject
and the if i B M is there all this made is one it's so false alarm it's error
and uh if both are one it's the correct you
and
if i B M is one estimates it
there were
um use i
so
we can in computed the you to rate
here and of false alarm rate
and the we
uh calculate the difference between the heat and the might uh false alarm
rate
a because this measurement is the
a correlated to the speech intelligibility
so that's why we we are use this measure
a now we have the problem
uh because
the svm is a diff designed that to maximise the classification
yours
instead of a key the my set three
but if we want
to maximise the in mind not lee
we need to do some change
so he might we it's a
actually we need to can see that two kind of a row of the means there were and the false
alarm rate
the we want to balance this two
to kind of arrows and uh maximise this value
uh what we were you is you the technique
"'cause" the research coding
a
the for the standard as we have
the use the zero as the special
yeah we were choose a a new structure
which could a maximise my the in F weight
in each channel
it just like
uh
if we have
to many in our of but a few false alarm error
we can change of the
hyperplane a little it
and do some active
where point to be one
and
by this we we can increase the he to rates
so we can
in is the key my we wait
the we use this you have to threshold
if the do but it is in a red it was a larger than see that
it is one otherwise it is zero
and this data is the choose form
oh small
but additions that
and uh
the and get this
yeah the the on each channel we combined that we get a whole mask
we can for the use the
or tree segmentation
to improve the mask
for the voice for we use the cross-channel correlation and and well
or channel correlation
and for the on frame amway
onset and offset
okay this the figure
uh you go straight the estimate
made mask the first the fear is the
i B N
and the they right
the as we name body mask
uh
so this mask is the
is a good is the close to the art bn but uh it's it looks miss some
part
okay
so just the miss some
uh
missed um
one
miss some a white region
the but user research coding we can in large it it's the
a mask
we can
increase the to rates
you may also known is that's we also increased on force alarm number eight
but to the point is the
the
we can increase the he the rates more things the false alarm rate so that he my at least your
and uh a not look was things you that
the this first false alarm rates it's the
i
is all uh isolated unit here
these units
i i these you need i you've a to remove the by using the segmentation
so the last but that segmentation results
is the
uh pretty to and uh close to the I M
okay mm for the value
evaluation for the training
a cop was where use one hundred utterances from the
ieee covers
uh a female utterance
and we use the three types of noise the speech to of noise vector E and a of babble noise
and uh for the P based feature we
uh directly extracted the peach
run should speech from the target speech
and uh we use the mixture at the mine five they were and a five db
but trend them together
for the check uh for the test
uh we use sixty utterance
this utterance down all seen in the training couple
the noise with the are you this
a speech up every and that when noise
also we will test on to a new noise
it's a white at how L party not
and here we cannot use the i'd information
we use the gene and also algorithm to
extracted the the estimated peach from the mixture
uh and we test on the mind five and uh they would be
this is the classification result
uh we will compare our our system with the key at whole system
there system
uh use of see you mixture model to learn the distribution of the
the ms feature
and uh then you the as in classified to to the vacation
uh we choose the system because you're system improve the speech intelligibility
in listening test
uh
in in this in the front the table we can
see that
uh the hidden an are our proposed the
uh
that was a
but you have very high he myself every
and they to significantly better than
okay came system
and uh also for the accuracy uh our our method is still at
and uh the table two is the on noise results
uh in this
two
to noise
they are not they are not seen in the training corpus
so
uh but our systems you know very well and that this give if we results use you know close to
the
result in the scene noise
so is it
means that our system could general general a generalized well in this two
you noise
and this and to compare it
i the pre compared them
uh the post system use different
uh feature we use a the ms plus teach the when use the M S
we you different classifier
and uh
we also
in or coke or of very the the
the the segmentation stage
so here
we want to
start us study the performance of the classifier only
so we use the exactly the same front and it's the twenty five channel mel-scale filter bank
use in the that's and feature the only use the in feature
and they they only the
training corpus
it's the trend training covers
the only different is the classifier via we use as we have
use a gmm
uh we can find that
the key to my say
a result of the svm is you know it's consist any better than the
gmm result
for the mine five db to
uh improve
or a from
uh two percent at to five percent
for for the were there were V to improve the
uh
from
five
but send to that it does that
though
uh
this improvement
the of the advantage svm over yeah
uh this is the
demo it is the female speech makes the with the factory noise
at a zero db
this is the noisy speech
this is the proposed the
uh a result we use this
we use this
uh
mask
i
two
a this is P M result
i
okay we can
here that
our proposed to
uh a results
the chip a
put it
so that speech intelligibility
and close so the idea
so we conclude our work here
which treated the
speech separation problem as a binary classification
we use a as we and to classify the you need to to one one zero
we use the peach based feature and the ms feature
so based on the comparison
uh we can
pretty that are were a separation a result will already to
significant the improve the speech intelligibility
in
noisy condition
what you melissa listener
our future
what will attest this
that's all thing
yeah i i
are there any questions
and a multi
we you use uh comment on the processing steps are assume that was a pitch
type of processing or quit able to be implemented as an online
processing was
the latency
so it can say it again
um the processing steps of you to two
uh separate the signals
is that a a
a batch type posting where you have several times as all was a
so each style or is it an online
mess like where you just have a
does little bit of latency and you process on
yeah uh is like the the
the the back of the page
a process
is given us a mixture i can give you a
uh stiff to the speech
it's not to the online
i
i i i i would like to know if you can a command on
difference is
between voiced and unvoiced
a face is because the signal to most was you might be different
uh
or you might be less critical
to apply the binary mask two
a speech if it is on boards
yeah yeah you in
what what difference between voice and all in terms of quality E
uh yeah in in our work
uh we use to kind of feature
the P to a feature and uh M it's feature
so the to based feature basically
well we focus on the voice
because for the unvoiced though we don't have the P
this feature
i don't what
but the for the voice
we still have the ms feature
so the yeah ms feature works for the unvoiced part
and all
also
yeah matt's you also what's of the voice parts
so we we combine them together
she
the a complementary feature that
for finding the the
harmonics so you are using correlation measure i didn't get that
first
is the correlation or over time and frequency
yeah
for
coco right and then to take the differences between
what adjacent frames
and the adjacent been
you you mean you the P extractor yeah
yeah O um
for for the U you is made each we
use gene and all we're the use the
uh
the core where and and
to extract of the the that the extracted pitch
yep with on each frame
get
are the for and so question
please
uh
i this
but are you ran experiments to zero and it five four
five very er zero and minus five can remember yeah uh my result is the minus five hundred a
right okay two
my question is are you should be able to look to the mask could shelf but you are estimated zero
and five
and or to is as we signal noise ratio decreases
you should see erosion russian around be edges of your matched
it's so you should be able to somehow connect the image
uh oh oh the mask at to zero db in minus five db in can use the strippers noise ratio
change changes
but she drops from zero to minus two point five or something
if you tried looking at that were you have a mismatch were changed in the approach as ratio
for from estimated uh approach
yeah i and
in this study
if the signal to not noise ratio decrease
like to the minus five T V
uh
since
yeah the the marks a mask of a very different
uh
yeah so this performance actually you can see that it's a decrease
i
use a possible to interpolate the maps between those to limits T
right right between
so uh uh i i don't
and get your pets
okay with respect to the time to to one small for the contribution