okay uh this is that last a section

and you will coming

um and M is the can uh have come from the was data what city

and today i'm gonna presents my a a with my vice the that a one

it's a a a a a as we and based the classification of approach to speech separation

this is the outline of our presentation

uh the fourth but it like action

uh of that a well

in to you uh we'll

talk about of the feature extraction prod

and then a a a a can talk about uh the unit and labeling

and uh segmentation

the last but it the

experiments with

okay

uh well uh you know data environment

the target is speech is often corrupted by the uh by various types of well

interference

so the question is how can we

remove

all at any to the background noise

this this see the

speech is separation problem

in this study we were only focus on the monaural speech separation

it is a

very very and

because

we can only we can use the

look location information

we can only use the

intrinsic a property of the target and the

interference

so i well force introduce the very important concept

uh is the ideal binary mask

short for I B M

it is the

men computers of go for a

the B M could be

find there

here

okay uh

so give a mixture

a a you compose the it it to that time and frequency to men

is it to it

to mention of the presentation

and for each t-f unit

uh we compare the

speech energy and of the noise and a

a if the local snr is larger than your

local car tire out L C

this mask is the one

otherwise otherwise it zero

so you this way we can convert of the speech separation

problem to a

binary weight estimation

because

previous study had shown that

if we use the I B M to me think this the mixture

uh we can get a separate the speech

with a very high

speech intelligibility

so what i'd be M estimation it'd

just a the white and the zero

so it it is nothing else just the binary classification

this figure

uh you straight of that B M

the first few it is the

how uh

is the additive from the target

and the second why the

cochlear or gram of the noise

and we mixed them together

yeah is a mixture of uh of the the mixture

so if we know the target that and we know the noise

and for each you need that we compare the energy

we can get the

this mask

it is finer uh i'd a binary mask

the white region

means

the

the target and rate it stronger

the plaque reading means the noise

uh and and are stronger

so we

though this i B M comes from the idea or information when you to know the target and uh

when you to know the noise

so what do will we you is

use some feature from the mixture two

estimate this

mask

this is

our will go

this see that system overview

the you know a mixture

uh we use the common don't field of stand

and decompose the mixture to the sixty four channels

on on each channel

for each t-f unit

we extract of the feature

uh including the peach based feature and uh

amplitude the modulation spectrum

or yeah mess feature

once we get it is feature we were use the support vector machine

to do the classification

class a class fight each unit to

one of their zero

and then

we get a mask

this much

we can use the

or to tree for the improve the up

finally we use this mask two

re things is the mixture and get the the speech

before for the feature extraction

we have two

types of feature

the first the one and the

pitch based feature

so for this feature for each t-f unit

we compute the the

uh all the correlation at of the feature that

of course for the unvoiced the for

there is no P each

so we just simply put a zero

yeah

and all we also computed a are to each a to capture the feature

variations across time and the frequency

we just use the feature in the current unit

minus the feature

in the previous

unit

that that are the feature

uh we were also can be a the habit of all the uh all the correlation and its of feature

oh here we have six dimensional no

fee teach based the feature

the first two as the

or in a feature

and

and the

to feature the

the

time

are the feature and

two

the uh for you considered the feature

or a and not of a a a a yeah S feature

so for each t-f unit

units we we extract the fifteen dimensional a a a a ms feature

we use the same S so as the team at all

to thousand the nine paper

and the ready

that of that have feature

so for the ms feature we have

forty five dimensional no

feature vector

okay okay

now we have the feature

well buying this to it yeah the

and uh use this feature to

chan a svm

once we finish their training we

we can use this the discrete mean net function

to do the classification

the F X

is the

D don't value

a computed from as we have

uh these these in a very with a real number

so the standard as we um

we use the

this sign function

like use the zero at the special

it F X is the positive

the level is white if it and that if that was there

so when which and as we we were and it in each channel so we have a sixty four channel

it means we have sixty four as we have

and we use the causing kernel and uh the

parameters side

in in form a a a a a a five fold

cross-validation

okay uh when people

to the classification

uh the you're really use the classification accuracy to

you very to the performance

so here a you also focus on one of the measurement

it is a key mine F eight

so for the classification

results uh we have this

or types of a result

but if i B M and it's to made i B M mouth posts there were is the correct reject

and the if i B M is there all this made is one it's so false alarm it's error

and uh if both are one it's the correct you

and

if i B M is one estimates it

there were

um use i

we can in computed the you to rate

here and of false alarm rate

and the we

uh calculate the difference between the heat and the might uh false alarm

rate

a because this measurement is the

a correlated to the speech intelligibility

so that's why we we are use this measure

a now we have the problem

uh because

the svm is a diff designed that to maximise the classification

yours

instead of a key the my set three

but if we want

to maximise the in mind not lee

we need to do some change

so he might we it's a

actually we need to can see that two kind of a row of the means there were and the false

alarm rate

the we want to balance this two

to kind of arrows and uh maximise this value

uh what we were you is you the technique

"'cause" the research coding

the for the standard as we have

the use the zero as the special

yeah we were choose a a new structure

which could a maximise my the in F weight

in each channel

it just like

if we have

to many in our of but a few false alarm error

we can change of the

hyperplane a little it

and do some active

where point to be one

and

by this we we can increase the he to rates

so we can

in is the key my we wait

the we use this you have to threshold

if the do but it is in a red it was a larger than see that

it is one otherwise it is zero

and this data is the choose form

oh small

but additions that

and uh

the and get this

yeah the the on each channel we combined that we get a whole mask

we can for the use the

or tree segmentation

to improve the mask

for the voice for we use the cross-channel correlation and and well

or channel correlation

and for the on frame amway

onset and offset

okay this the figure

uh you go straight the estimate

made mask the first the fear is the

i B N

and the they right

the as we name body mask

so this mask is the

is a good is the close to the art bn but uh it's it looks miss some

part

okay

so just the miss some

missed um

one

miss some a white region

the but user research coding we can in large it it's the

a mask

we can

increase the to rates

you may also known is that's we also increased on force alarm number eight

but to the point is the

the

we can increase the he the rates more things the false alarm rate so that he my at least your

and uh a not look was things you that

the this first false alarm rates it's the

is all uh isolated unit here

these units

i i these you need i you've a to remove the by using the segmentation

so the last but that segmentation results

is the

uh pretty to and uh close to the I M

okay mm for the value

evaluation for the training

a cop was where use one hundred utterances from the

ieee covers

uh a female utterance

and we use the three types of noise the speech to of noise vector E and a of babble noise

and uh for the P based feature we

uh directly extracted the peach

run should speech from the target speech

and uh we use the mixture at the mine five they were and a five db

but trend them together

for the check uh for the test

uh we use sixty utterance

this utterance down all seen in the training couple

the noise with the are you this

a speech up every and that when noise

also we will test on to a new noise

it's a white at how L party not

and here we cannot use the i'd information

we use the gene and also algorithm to

extracted the the estimated peach from the mixture

uh and we test on the mind five and uh they would be

this is the classification result

uh we will compare our our system with the key at whole system

there system

uh use of see you mixture model to learn the distribution of the

the ms feature

and uh then you the as in classified to to the vacation

uh we choose the system because you're system improve the speech intelligibility

in listening test

in in this in the front the table we can

see that

uh the hidden an are our proposed the

that was a

but you have very high he myself every

and they to significantly better than

okay came system

and uh also for the accuracy uh our our method is still at

and uh the table two is the on noise results

uh in this

two

to noise

they are not they are not seen in the training corpus

uh but our systems you know very well and that this give if we results use you know close to

the

result in the scene noise

so is it

means that our system could general general a generalized well in this two

you noise

and this and to compare it

i the pre compared them

uh the post system use different

uh feature we use a the ms plus teach the when use the M S

we you different classifier

and uh

we also

in or coke or of very the the

the the segmentation stage

so here

we want to

start us study the performance of the classifier only

so we use the exactly the same front and it's the twenty five channel mel-scale filter bank

use in the that's and feature the only use the in feature

and they they only the

training corpus

it's the trend training covers

the only different is the classifier via we use as we have

use a gmm

uh we can find that

the key to my say

a result of the svm is you know it's consist any better than the

gmm result

for the mine five db to

uh improve

or a from

uh two percent at to five percent

for for the were there were V to improve the

from

five

but send to that it does that

though

this improvement

the of the advantage svm over yeah

uh this is the

demo it is the female speech makes the with the factory noise

at a zero db

this is the noisy speech

this is the proposed the

uh a result we use this

we use this

mask

two

a this is P M result

okay we can

here that

our proposed to

uh a results

the chip a

put it

so that speech intelligibility

and close so the idea

so we conclude our work here

which treated the

speech separation problem as a binary classification

we use a as we and to classify the you need to to one one zero

we use the peach based feature and the ms feature

so based on the comparison

uh we can

pretty that are were a separation a result will already to

significant the improve the speech intelligibility

noisy condition

what you melissa listener

our future

what will attest this

that's all thing

yeah i i

are there any questions

and a multi

we you use uh comment on the processing steps are assume that was a pitch

type of processing or quit able to be implemented as an online

processing was

the latency

so it can say it again

um the processing steps of you to two

uh separate the signals

is that a a

a batch type posting where you have several times as all was a

so each style or is it an online

mess like where you just have a

does little bit of latency and you process on

yeah uh is like the the

the the back of the page

a process

is given us a mixture i can give you a

uh stiff to the speech

it's not to the online

i i i i would like to know if you can a command on

difference is

between voiced and unvoiced

a face is because the signal to most was you might be different

or you might be less critical

to apply the binary mask two

a speech if it is on boards

yeah yeah you in

what what difference between voice and all in terms of quality E

uh yeah in in our work

uh we use to kind of feature

the P to a feature and uh M it's feature

so the to based feature basically

well we focus on the voice

because for the unvoiced though we don't have the P

this feature

i don't what

but the for the voice

we still have the ms feature

so the yeah ms feature works for the unvoiced part

and all

also

yeah matt's you also what's of the voice parts

so we we combine them together

she

the a complementary feature that

for finding the the

harmonics so you are using correlation measure i didn't get that

first

is the correlation or over time and frequency

yeah

for

coco right and then to take the differences between

what adjacent frames

and the adjacent been

you you mean you the P extractor yeah

yeah O um

for for the U you is made each we

use gene and all we're the use the

the core where and and

to extract of the the that the extracted pitch

yep with on each frame

get

are the for and so question

please

i this

but are you ran experiments to zero and it five four

five very er zero and minus five can remember yeah uh my result is the minus five hundred a

right okay two

my question is are you should be able to look to the mask could shelf but you are estimated zero

and five

and or to is as we signal noise ratio decreases

you should see erosion russian around be edges of your matched

it's so you should be able to somehow connect the image

uh oh oh the mask at to zero db in minus five db in can use the strippers noise ratio

change changes

but she drops from zero to minus two point five or something

if you tried looking at that were you have a mismatch were changed in the approach as ratio

for from estimated uh approach

yeah i and

in this study

if the signal to not noise ratio decrease

like to the minus five T V

since

yeah the the marks a mask of a very different

yeah so this performance actually you can see that it's a decrease

use a possible to interpolate the maps between those to limits T

right right between

so uh uh i i don't

and get your pets

okay with respect to the time to to one small for the contribution