a model particle filter glancing method

so the key idea over here is

um where N to separate and all to do so joint separation and tracking

of moving speakers in uh close setting and we're using

uh the going thing fact where the sources can appear or this it uh or

or disappear appear by like they could uh

uh

basically they can turn on or not

so first i'm gonna give um

an overview of convoluted time-invariant mixing

so we have a let's say two sources to microphones

sources are static

uh in a room and uh because

uh

because of the more hold the like passed from each source leak sensor

uh

the mixing process is known in a convolutive

manner because of the reverberation

and are global over here is to D makes these convoluted we mix signal

oh however if we wanna do it in the time domain

um i could be a complicated problem

uh because of the convolution so one trick um that uh are often use uh researchers you

you uh transform the data and to the for get

domain by use of the short trying for you transform where the convolution time domain translates to model patient the

frequency domain

for large enough for a short time fourier transform window

so in this case are at of

J

um

it's of K is

the mixing matrix at uh bin frequency

uh frequency bin K

oh um each

and i didn't can be viewed as a different

independent component analysis problem

uh so i C a uh in the think and component analysis

as we know it in to permutation

um so if um

i C is performed in each bin

for their post processing has to be done to correct for possible permutation

"'kay" so here we gonna mention um

that

a a source

the the temporal dynamic

dynamics of the sources

in the time domain is the chi to uh

to perform um

a source separation and the frequency domain

using ica

and we show in our previous papers that um

and it's um it's available on on line uh that um

a on a website that uh basically each frame

a sample from a gaussian with uh

with

zero mean and the specific variant

uh after it's transformed to

the um have after transform to

uh the for don't mean um

and that's because of the central limit theorem

so basically if our

signal in the time domain has a very and like has a

a

as a energy envelopes of with time

uh the overall distribution

uh so basically

a a you know

one

you have a gaussian and one frame and you have a different gaussian

with a different variance another frame

a so the overall distribution is of the form of a gaussian scale mixture

a which has a super gaussian um

uh four

um so

in this paper we use a fixed

uh

a gaussian scale mixture

uh

by approximating using finite

uh mixture of gaussian

so here we have youth mixture gaussian now these parameters are here are fixed are and beforehand uh because

they all uh fall into the over

uh sorry that the the whole to the super gaussian

forms so we

oh we're not really gonna

um

uh give or so heading try to estimate the

parameters of here is that we can have focused on other interesting at uh aspects so

of speech like signal

um so basically we have a this mixture gaussian

for each of the sources

because independence there more by the they the dense these are more like of the overall distribution the don't

we of the sources

also

a mixture gaussian

so the previous frame

a i'm sorry in the previous slide uh i

um the talked about how the the temporal dynamics T two source of in the frequency domain

or when introduce another form of temporal dynamics and that's the glancing fact

and which the sources can um

uh with time

so in this is to the colour speech where we have silence period

oh so basically um

so in this case we have resources sources and three microphones

so uh in this kind period

or we here

um

only the first source is active so that means that the first column of the mixing matrix is used for

the for the mixing

so this is done basically just we're we're looking at an any frequent

then

um so the first column of the mixing matrix

uh and if we uh in each frequency bin is use

um

now

in this time period now

um all three sources are active sort the full

uh of of the full mixing matrix a use for this uh for the mixing process

um

and then let's say

the for source to uh become silent

and only the first and second

uh columns to make it

you

so

by one seen or thing in the silence a gap

where able to basically

hopefully achieve better

um basically results so this is also one strategy that the human here

use to handle uh adverse it way

now we're gonna move on to the time

mixing in frequency domain

that's when the sources are moving around

so basically the mixing matrix

uh varies with time

um

and here the emphasis that uh incorporating glancing

is crucial in time in

online uh uh mixing because of because if the model state is not correct

this to make the estimation die user becomes unstable

a just to give some more explanation on that a later on

we're gonna um basically introduce particle filters and when particle filters to

uh to simulate the columns of the mixing matrix

so for example if we're in this in a case where a

this uh there's the third source silent

so the the particles that uh signal like the third source

um in this time period are going to die words

or going to just the weight to a location that's undesirable to as because basically

it's inactive and they don't have any information about it

so when the third source turned back on

um basically the the particles might of

uh drifted too it

uh to far away location that not able to attain a a track again

so uh basically it so it's very crucial

oh for trying very an online the mixing

to incorporate this since data

and also the problem becomes

you more complicated when the source is

new not been what a move while being silent

uh we call this

a a phenomenon a silence blind zones which is similar to doppler of zones and greater target tracking

so basically if the sources are

uh are so and also moving that's um

um

um

basic basic of the problem in in becomes more calm

and we're gonna talk about we're gonna

so here and then introduce the

the general model of lead for the going sing strategy you we in here so um

uh we assume that each source can take on two states

either active in so for a of

and states

there will be a total of a to the power of um

so i for a total of M sources there will be a total of two to the power M state

a can be different for different frequency bins

and indicate which source

of for each week and frequency bin is present or absent

at each time set of active sources of the subset

oh uh the set of total number of sources

uh so for uh for example state i

a um could be as a state for uh that uh

that corresponds to a case where

a a a a a a a specific number of sources that let's say you have a um that's a

three sources

and state are like say

a state i corresponds to case for the first and second one or active and the third third source side

um um

so for example for so um

to continue with the gender model we're gonna introduce our observation

a a model over here the relationship you know observation and our uh

and our states of interest

so here

um for each discrete state i

so

um

pertaining to a to a particular activity pattern

we have

we have our observation is going to be uh a mixture gaussian and that's because our uh that the the

the density of our sources are mixture gaussian our observation also become a mixture gaussian

so for example

if state i corresponds to the case where the first and second column

are active the third column is silent

uh

uh or the the third source is silent

a so basically X i over here

only the first and in second columns and and the third column is

it not use for this

um

so here we're gonna introduce

our channel model

and that's

the evolution of the columns of the mixing matrices

um and we uh we use a random walk model and the reason that we use around walk model is

that because we don't have any prior information of how the channels very you

with time from one location in the room to another location

so we you have no choice that use around random walk well

um where

you you over here is a gaussian random vector with a diagonal covariance

um

and also for the discrete state

that

that basically uh correspond to different activity patterns

we have a markovian

uh property for the transition

and

um so we have a transitional matrix pi

where a each element is pi i J a i i D A a uh is the probability

going

state i state J

so here we gonna get

um basically you why why we have to use particle filters for this problem

so uh as we can see in this relationship the really uh in this basic equation

the relationship between our observation and or

state S

a a basic our continuous take at

um have a non in your non gaussian form

uh

so we cannot use

um

standard

optimal

uh common filtering technique

to uh to track these um

these mixed in that the cop these columns of the mixing matrices

so we have to resort to

um

so called sub optimal techniques like the particle filtering

um

so every is

so particle so in a particle filter

every state whether be thing is a discrete is represented with a cloud of particles so of the states are

continuous

the "'cause" the the car particles are also thinking is

states are discrete the cup that a part are the screen

um um and we also you have to use a mobile model part of filter that's because we have to

be able to switch between the different states of activity

um

so a set of continuous particles is used to represent the mixing matrices

and set of this peak particles is is used to represent the discrete state of activity

so

just gonna walk guys through uh

the

our model or our model or multiple model particle filter

uh so basically we have

we have a continuous states at and

uh that are that a simulated by it's and and uh at M super script and and he's

are the particles that that basically simulate at and

um and we have a are are are uh

are discrete states X

that are simulated by particles act and

X to prescription and

oh uh we initialize these state particles

using a initial prior

um

and we give them uniform weights so W M and

are the weights

for a and and and

or and are the weights for X

uh we classified a particle of the stats corresponding to different

activity states

so uh and i here

a corresponds to the index of the particle

uh

uh so next that is

that we predict in you set of particles by draw

or a new set of samples at time T according to state transition described by

so basically it's state i uh contains a and

we going and an update of are we good and print uh make a prediction

uh uh four

uh for a new set of particles if state i

it does not contain

column M

uh uh we we just leave it as at is

so this is

this is how we avoid that you think of particles whenever we have silence

uh silences

uh

and also like to go memory of the salads plans of the sources is based on previous frames

the covariance

of the cloud of particles can be increased temporally this way that out of particle

during the silence blind zones would a large enough to find a track once the sorts become active again so

uh by keeping this buffer memory buffer of the previous silence of pattern

and increasing the the variance for those

for the silence source sources

um

we able to deal with the silence blind so

um

now on this that we we update our are

our way

um

so basically this is using

um the so we only update the weights

which i uh state i can calm and and it

i as they i it would come on and

we just keep the weights as it

um

um so

this is using the standard uh

bootstrap particle filter

we do the same thing for

uh are the speak of basic B

uh are weights and

for the speech uh state

are sorry are weights are or

they

um

and then we normalize

are weights

um

uh

and

in order to to is uh basically achieve a meaningful probability

and um

and then from there we can we can obtain a problem you actually from each state

and we also uh do the same thing for our call weight

uh the from there we can estimate the

the mixing matrix columns

by that's weighted average

um and if the wire particles become

uh D generate we can resample that

um

and at the end once we we obtain he's

these estimates so uh our me and mixing me me makes it

and mixing

columns

we could uh we can use a a minimum mean square error estimator

to uh to reconstruct the sources

then permutation in the frequency bin

is corrected using the correlation method

the activity patterns

uh this is work by so a lot ah

uh and others from japan

uh by keeping a a a a memory of the past estimates

of the sources in each frequency band so um

so as we move on with our separation process

we are but we are able to achieve better permutation correction

oh once to the very uh mixing matrices are found the source is time varying directions of arrival

uh

with respect to the mike uh with respect to the micro microphone array can be found and this is work

again by

us a lot and others uh from japan

um

uh so if we have a

so uh and if we have another rate

we can achieve

it's a another array in a different position in the room

we can achieve uh we can we can find a different direction of arrival

however

all is the sources are separated we can easily a sort so he each source from one rate

to another using the simple correlation method

hence a your possibility of ghost location so if we have a

so if we have basically the direction just a direction or was from the two race

um

the picture on the right

so we have a possibility of two goes locate

now if we have a separation we can easily associate she each source

from from one rate to another

and we we uh we avoid this goes

uh

goes problem

uh also at P N

a multiple model uh

constant and velocity constant acceleration can "'em" attic motion model on the spatial dynamics of the sources

is implemented using again

a model model particle filtering uh a sources

so this is

using another model model

uh for for uh

to track the now the motion of the source the spatial motion of the sources

and we use in this small mall part filter is very similar to the one that we use

for our separation from

so here we have basically we have some our results

uh so we have to mike or uh to a raise one you over here only two microphone

one over really only can mark phone

this is uh a simulated by in that room our uh reverberation time is about two hundred milliseconds

we have a thousand particles four

each of the frequency bins uh

the two sources are moving

clockwise wise a kind of chase chasing each other

uh sigh on and the magenta are the two trajectories blue and red are the estimated exactly

a total duration of of each source was about on average

twelve and a half second

being at for only about five and a half second

on average

therefore we have about seven seconds of silence blind zones which makes the problem really

uh into good

um

so

here i'm gone into

i'm gonna show you the the video of the tracking process

so uh we have these we have this circle

and we have the triangle circle circle uh

is the true trajectory triangle is the estimated trajectory

and use shapes turned green they feel with green whenever the source becomes act

so when when the circle current active that's

that's a true activity pattern when the triangle becomes active that's estimated fact that are

as you can see

so we start from an initial

basically estimation

and

the source is

uh basic yes

estimation has a

it could have

basically it

uh it try the catch up with the bit the trying try to catch up with the circle

and that's because it when it's i'll we don't have a moving around

no one of the drift around with

right

um so

you're and then give you um

the show the

the average position root mean square error of the trajectory

using uh compared with uh

uh comparing our method with an online i the algorithm

as we can see our method um

uh

uh

basically

does better than on only i i D A these

these bikes over here for part to the silence periods so wouldn't axe i'll see it is

basically those as well

um we have a S the are or we here

and uh just to conclude we have a we've

we uh we use the to sing problem

but in a different combination of tracks we show that i

it's necessary and

where able to deal with a side lines zone

and um

uh because out do not have to separate source of fully we don't have a problem of go

thank very much

i

i

i

the we have questions question is a question

yeah

so some work done earlier or by a you do that

um

i think in that

range of about to three taps

i

to talked about post process is for such a problem in using particle filters

this this "'cause" they you you can turn not is on or off using using

i using this kind of process and this work to show that this is a very very effective than the

clap just a little complexity of the problem can

i and i is

to protest process approach to so

like but not in a a and a great detail well it's it was a it looks like it's we

we have a would be a very uh works well process yeah and you think that so that's true

more

i you you i basically writing can we had local station so you know there is no i we shown

so as as i

uh no no uh basically uh yeah just line of sight um however uh

uh

the

the estimation that the D a

uh estimation problem

uh is

basic the the estimation algorithm is sufficient to fine

the the

the direction right

uh basic that that with don't does it if with with just using direct that

now yeah right okay

i am oh okay again