0:00:13a model particle filter glancing method
0:00:15so the key idea over here is
0:00:16um where N to separate and all to do so joint separation and tracking
0:00:22of moving speakers in uh close setting and we're using
0:00:26uh the going thing fact where the sources can appear or this it uh or
0:00:30or disappear appear by like they could uh
0:00:34basically they can turn on or not
0:00:35that's sporadically radically with
0:00:40so first i'm gonna give um
0:00:43an overview of convoluted time-invariant mixing
0:00:46so we have a let's say two sources to microphones
0:00:49sources are static
0:00:50uh in a room and uh because
0:00:54because of the more hold the like passed from each source leak sensor
0:00:58the mixing process is known in a convolutive
0:01:01manner because of the reverberation
0:01:03and are global over here is to D makes these convoluted we mix signal
0:01:08oh however if we wanna do it in the time domain
0:01:10um i could be a complicated problem
0:01:13uh because of the convolution so one trick um that uh are often use uh researchers you
0:01:19you uh transform the data and to the for get
0:01:22domain by use of the short trying for you transform where the convolution time domain translates to model patient the
0:01:27frequency domain
0:01:28for large enough for a short time fourier transform window
0:01:32so in this case are at of
0:01:37it's of K is
0:01:38the mixing matrix at uh bin frequency
0:01:41uh frequency bin K
0:01:44oh um each
0:01:46and i didn't can be viewed as a different
0:01:48independent component analysis problem
0:01:50uh so i C a uh in the think and component analysis
0:01:54as we know it in to permutation
0:01:57um so if um
0:01:59i C is performed in each bin
0:02:01for their post processing has to be done to correct for possible permutation
0:02:06"'kay" so here we gonna mention um
0:02:10a a source
0:02:11the the temporal dynamic
0:02:13dynamics of the sources
0:02:15in the time domain is the chi to uh
0:02:18to perform um
0:02:20a source separation and the frequency domain
0:02:23using ica
0:02:24and we show in our previous papers that um
0:02:27and it's um it's available on on line uh that um
0:02:32a on a website that uh basically each frame
0:02:35a sample from a gaussian with uh
0:02:38zero mean and the specific variant
0:02:40uh after it's transformed to
0:02:43the um have after transform to
0:02:46uh the for don't mean um
0:02:48and that's because of the central limit theorem
0:02:51so basically if our
0:02:53signal in the time domain has a very and like has a
0:02:59as a energy envelopes of with time
0:03:02uh the overall distribution
0:03:05uh so basically
0:03:06a a you know
0:03:08you have a gaussian and one frame and you have a different gaussian
0:03:11with a different variance another frame
0:03:14a so the overall distribution is of the form of a gaussian scale mixture
0:03:18a which has a super gaussian um
0:03:20uh four
0:03:22um so
0:03:24in this paper we use a fixed
0:03:27a gaussian scale mixture
0:03:29by approximating using finite
0:03:31uh mixture of gaussian
0:03:33so here we have youth mixture gaussian now these parameters are here are fixed are and beforehand uh because
0:03:39they all uh fall into the over
0:03:42uh sorry that the the whole to the super gaussian
0:03:45forms so we
0:03:46oh we're not really gonna
0:03:49uh give or so heading try to estimate the
0:03:51parameters of here is that we can have focused on other interesting at uh aspects so
0:03:56of speech like signal
0:03:59um so basically we have a this mixture gaussian
0:04:02for each of the sources
0:04:04because independence there more by the they the dense these are more like of the overall distribution the don't
0:04:09we of the sources
0:04:12also had a
0:04:13a mixture gaussian
0:04:16so the previous frame
0:04:17a i'm sorry in the previous slide uh i
0:04:20um the talked about how the the temporal dynamics T two source of in the frequency domain
0:04:25or when introduce another form of temporal dynamics and that's the glancing fact
0:04:29and which the sources can um
0:04:31basically turned on and off sporadically radically
0:04:34uh with time
0:04:35so in this is to the colour speech where we have silence period
0:04:39oh so basically um
0:04:41so in this case we have resources sources and three microphones
0:04:44so uh in this kind period
0:04:47or we here
0:04:50only the first source is active so that means that the first column of the mixing matrix is used for
0:04:55the for the mixing
0:04:57so this is done basically just we're we're looking at an any frequent
0:05:01um so the first column of the mixing matrix
0:05:04uh and if we uh in each frequency bin is use
0:05:09in this time period now
0:05:11um all three sources are active sort the full
0:05:14uh of of the full mixing matrix a use for this uh for the mixing process
0:05:20and then let's say
0:05:22the for source to uh become silent
0:05:25and only the first and second
0:05:26uh columns to make it
0:05:30by one seen or thing in the silence a gap
0:05:33where able to basically
0:05:36hopefully achieve better
0:05:37um basically results so this is also one strategy that the human here
0:05:42use to handle uh adverse it way
0:05:46now we're gonna move on to the time
0:05:48mixing in frequency domain
0:05:50that's when the sources are moving around
0:05:52so basically the mixing matrix
0:05:55uh varies with time
0:05:59and here the emphasis that uh incorporating glancing
0:06:03is crucial in time in
0:06:05online uh uh mixing because of because if the model state is not correct
0:06:10this to make the estimation die user becomes unstable
0:06:13a just to give some more explanation on that a later on
0:06:15we're gonna um basically introduce particle filters and when particle filters to
0:06:20uh to simulate the columns of the mixing matrix
0:06:23so for example if we're in this in a case where a
0:06:25this uh there's the third source silent
0:06:28so the the particles that uh signal like the third source
0:06:32um in this time period are going to die words
0:06:35or going to just the weight to a location that's undesirable to as because basically
0:06:41it's inactive and they don't have any information about it
0:06:43so when the third source turned back on
0:06:46um basically the the particles might of
0:06:49uh drifted too it
0:06:51uh to far away location that not able to attain a a track again
0:06:57so uh basically it so it's very crucial
0:07:00oh for trying very an online the mixing
0:07:02to incorporate this since data
0:07:05and also the problem becomes
0:07:07you more complicated when the source is
0:07:09new not been what a move while being silent
0:07:12uh we call this
0:07:13a a phenomenon a silence blind zones which is similar to doppler of zones and greater target tracking
0:07:18so basically if the sources are
0:07:20uh are so and also moving that's um
0:07:26basic basic of the problem in in becomes more calm
0:07:29and we're gonna talk about we're gonna
0:07:30a talk about this later on
0:07:33so here and then introduce the
0:07:35the general model of lead for the going sing strategy you we in here so um
0:07:40uh we assume that each source can take on two states
0:07:43either active in so for a of
0:07:45and states
0:07:46there will be a total of a to the power of um
0:07:49so i for a total of M sources there will be a total of two to the power M state
0:07:54a can be different for different frequency bins
0:07:56and indicate which source
0:07:57of for each week and frequency bin is present or absent
0:08:00at each time set of active sources of the subset
0:08:03oh uh the set of total number of sources
0:08:06uh so for uh for example state i
0:08:09a um could be as a state for uh that uh
0:08:13that corresponds to a case where
0:08:15a a a a a a a specific number of sources that let's say you have a um that's a
0:08:20three sources
0:08:21and state are like say
0:08:23a state i corresponds to case for the first and second one or active and the third third source side
0:08:29um um
0:08:30so for example for so um
0:08:33to continue with the gender model we're gonna introduce our observation
0:08:37a a model over here the relationship you know observation and our uh
0:08:42and our states of interest
0:08:44so here
0:08:45um for each discrete state i
0:08:49pertaining to a to a particular activity pattern
0:08:52we have
0:08:53we have our observation is going to be uh a mixture gaussian and that's because our uh that the the
0:09:00the density of our sources are mixture gaussian our observation also become a mixture gaussian
0:09:05so for example
0:09:06if state i corresponds to the case where the first and second column
0:09:09are active the third column is silent
0:09:14uh or the the third source is silent
0:09:16a so basically X i over here
0:09:18uh a basically um had
0:09:21only the first and in second columns and and the third column is
0:09:25it not use for this
0:09:29so here we're gonna introduce
0:09:31our channel model
0:09:33and that's
0:09:33the evolution of the columns of the mixing matrices
0:09:36um and we uh we use a random walk model and the reason that we use around walk model is
0:09:41that because we don't have any prior information of how the channels very you
0:09:45with time from one location in the room to another location
0:09:48so we you have no choice that use around random walk well
0:09:51um where
0:09:52you you over here is a gaussian random vector with a diagonal covariance
0:09:59and also for the discrete state
0:10:02that basically uh correspond to different activity patterns
0:10:06we have a markovian
0:10:08uh property for the transition
0:10:11um so we have a transitional matrix pi
0:10:14where a each element is pi i J a i i D A a uh is the probability
0:10:20state i state J
0:10:24so here we gonna get
0:10:25um basically you why why we have to use particle filters for this problem
0:10:30so uh as we can see in this relationship the really uh in this basic equation
0:10:35the relationship between our observation and or
0:10:38state S
0:10:39a a basic our continuous take at
0:10:42um have a non in your non gaussian form
0:10:45so we cannot use
0:10:50uh common filtering technique
0:10:52to uh to track these um
0:10:55these mixed in that the cop these columns of the mixing matrices
0:10:58so we have to resort to
0:11:02so called sub optimal techniques like the particle filtering
0:11:06so every is
0:11:07so particle so in a particle filter
0:11:09every state whether be thing is a discrete is represented with a cloud of particles so of the states are
0:11:14the "'cause" the the car particles are also thinking is
0:11:17states are discrete the cup that a part are the screen
0:11:20um um and we also you have to use a mobile model part of filter that's because we have to
0:11:25be able to switch between the different states of activity
0:11:29so a set of continuous particles is used to represent the mixing matrices
0:11:33and set of this peak particles is is used to represent the discrete state of activity
0:11:39just gonna walk guys through uh
0:11:42our model or our model or multiple model particle filter
0:11:46uh so basically we have
0:11:48we have a continuous states at and
0:11:51uh that are that a simulated by it's and and uh at M super script and and he's
0:11:56are the particles that that basically simulate at and
0:11:59um and we have a are are are uh
0:12:03are discrete states X
0:12:05that are simulated by particles act and
0:12:07X to prescription and
0:12:09oh uh we initialize these state particles
0:12:12using a initial prior
0:12:15and we give them uniform weights so W M and
0:12:18are the weights
0:12:19for a and and and
0:12:21or and are the weights for X
0:12:24uh we classified a particle of the stats corresponding to different
0:12:27activity states
0:12:29so uh and i here
0:12:31a corresponds to the index of the particle
0:12:34that for uh that had
0:12:38state i had it's day
0:12:42uh so next that is
0:12:43that we predict in you set of particles by draw
0:12:47or a new set of samples at time T according to state transition described by
0:12:51so basically it's state i uh contains a and
0:12:55we going and an update of are we good and print uh make a prediction
0:12:58uh uh four
0:13:00uh for a new set of particles if state i
0:13:03it does not contain
0:13:04column M
0:13:06uh uh we we just leave it as at is
0:13:08so this is
0:13:09this is how we avoid that you think of particles whenever we have silence
0:13:14uh silences
0:13:16and also like to go memory of the salads plans of the sources is based on previous frames
0:13:20the covariance
0:13:21of the cloud of particles can be increased temporally this way that out of particle
0:13:26during the silence blind zones would a large enough to find a track once the sorts become active again so
0:13:31uh by keeping this buffer memory buffer of the previous silence of pattern
0:13:36and increasing the the variance for those
0:13:38for the silence source sources
0:13:40we able to deal with the silence blind so
0:13:46now on this that we we update our are
0:13:49our way
0:13:51so basically this is using
0:13:53um the so we only update the weights
0:13:56which i uh state i can calm and and it
0:13:59i as they i it would come on and
0:14:01we just keep the weights as it
0:14:04um so
0:14:05this is using the standard uh
0:14:08bootstrap particle filter
0:14:09we do the same thing for
0:14:11uh are the speak of basic B
0:14:13uh are weights and
0:14:15for the speech uh state
0:14:17are sorry are weights are or
0:14:22and then we normalize
0:14:23are weights
0:14:28in order to to is uh basically achieve a meaningful probability
0:14:33and um
0:14:34and then from there we can we can obtain a problem you actually from each state
0:14:39and we also uh do the same thing for our call weight
0:14:44uh the from there we can estimate the
0:14:46the mixing matrix columns
0:14:48by that's weighted average
0:14:50um and if the wire particles become
0:14:52uh D generate we can resample that
0:14:56and at the end once we we obtain he's
0:14:59these estimates so uh our me and mixing me me makes it
0:15:03and mixing
0:15:05we could uh we can use a a minimum mean square error estimator
0:15:08to uh to reconstruct the sources
0:15:10then permutation in the frequency bin
0:15:12is corrected using the correlation method
0:15:15the activity patterns
0:15:16uh this is work by so a lot ah
0:15:18uh and others from japan
0:15:20uh by keeping a a a a memory of the past estimates
0:15:23of the sources in each frequency band so um
0:15:26so as we move on with our separation process
0:15:29we are but we are able to achieve better permutation correction
0:15:35oh once to the very uh mixing matrices are found the source is time varying directions of arrival
0:15:42with respect to the mike uh with respect to the micro microphone array can be found and this is work
0:15:48again by
0:15:49us a lot and others uh from japan
0:15:53uh so if we have a
0:15:55so uh and if we have another rate
0:15:57we can achieve
0:15:58it's a another array in a different position in the room
0:16:01we can achieve uh we can we can find a different direction of arrival
0:16:06all is the sources are separated we can easily a sort so he each source from one rate
0:16:11to another using the simple correlation method
0:16:14hence a your possibility of ghost location so if we have a
0:16:18so if we have basically the direction just a direction or was from the two race
0:16:23the picture on the right
0:16:25so we have a possibility of two goes locate
0:16:28now if we have a separation we can easily associate she each source
0:16:32from from one rate to another
0:16:35and we we uh we avoid this goes
0:16:38goes problem
0:16:40uh also at P N
0:16:43a multiple model uh
0:16:45constant and velocity constant acceleration can "'em" attic motion model on the spatial dynamics of the sources
0:16:50is implemented using again
0:16:53a model model particle filtering uh a sources
0:16:56so this is
0:16:56using another model model
0:16:58uh for for uh
0:17:00to track the now the motion of the source the spatial motion of the sources
0:17:05and we use in this small mall part filter is very similar to the one that we use
0:17:10for our separation from
0:17:13so here we have basically we have some our results
0:17:17uh so we have to mike or uh to a raise one you over here only two microphone
0:17:21one over really only can mark phone
0:17:23this is uh a simulated by in that room our uh reverberation time is about two hundred milliseconds
0:17:29we have a thousand particles four
0:17:31each of the frequency bins uh
0:17:33the two sources are moving
0:17:35clockwise wise a kind of chase chasing each other
0:17:37uh sigh on and the magenta are the two trajectories blue and red are the estimated exactly
0:17:42a total duration of of each source was about on average
0:17:46twelve and a half second
0:17:48being at for only about five and a half second
0:17:50on average
0:17:51therefore we have about seven seconds of silence blind zones which makes the problem really
0:17:56uh into good
0:18:00here i'm gone into
0:18:01i'm gonna show you the the video of the tracking process
0:18:04so uh we have these we have this circle
0:18:07and we have the triangle circle circle uh
0:18:10is the true trajectory triangle is the estimated trajectory
0:18:13and use shapes turned green they feel with green whenever the source becomes act
0:18:20so when when the circle current active that's
0:18:22that's a true activity pattern when the triangle becomes active that's estimated fact that are
0:18:29as you can see
0:18:30so we start from an initial
0:18:31basically estimation
0:18:35the source is
0:18:36uh basic yes
0:18:37estimation has a
0:18:39it could have
0:18:40basically it
0:18:41uh it try the catch up with the bit the trying try to catch up with the circle
0:18:46and that's because it when it's i'll we don't have a moving around
0:18:49no one of the drift around with
0:19:14um so
0:19:15you're and then give you um
0:19:17the show the
0:19:19the average position root mean square error of the trajectory
0:19:23using uh compared with uh
0:19:26uh comparing our method with an online i the algorithm
0:19:29as we can see our method um
0:19:35does better than on only i i D A these
0:19:37these bikes over here for part to the silence periods so wouldn't axe i'll see it is
0:19:41basically those as well
0:19:43um we have a S the are or we here
0:19:45and uh just to conclude we have a we've
0:19:48we uh we use the to sing problem
0:19:51but in a different combination of tracks we show that i
0:19:53it's necessary and
0:19:55where able to deal with a side lines zone
0:19:57and um
0:19:58uh because out do not have to separate source of fully we don't have a problem of go
0:20:02thank very much
0:20:10the we have questions question is a question
0:20:15so some work done earlier or by a you do that
0:20:20i think in that
0:20:21range of about to three taps
0:20:25to talked about post process is for such a problem in using particle filters
0:20:31this this "'cause" they you you can turn not is on or off using using
0:20:35i using this kind of process and this work to show that this is a very very effective than the
0:20:41clap just a little complexity of the problem can
0:20:44i and i is
0:20:47to protest process approach to so
0:20:50like but not in a a and a great detail well it's it was a it looks like it's we
0:20:53we have a would be a very uh works well process yeah and you think that so that's true
0:20:59i you you i basically writing can we had local station so you know there is no i we shown
0:21:07so as as i
0:21:10uh no no uh basically uh yeah just line of sight um however uh
0:21:18the estimation that the D a
0:21:20uh estimation problem
0:21:22uh is
0:21:23basic the the estimation algorithm is sufficient to fine
0:21:26the the
0:21:28the direction right
0:21:30uh basic that that with don't does it if with with just using direct that
0:21:34now yeah right okay
0:21:36i am oh okay again