Speech Transcript - SEPARATION AND TRACKING OF MULTIPLE SPEAKERS IN A REVERBERANT ENVIRONMENT USING A MULTIPLE MODEL PARTICLE FILTER GLIMPSING METHOD

a model particle filter glancing method so the key idea over here is um where N to separate and all to do so joint separation and tracking of moving speakers in uh close setting and we're using uh the going thing fact where the sources can appear or this it uh or or disappear appear by like they could uh uh basically they can turn on or not that's sporadically radically with so first i'm gonna give um an overview of convoluted time-invariant mixing so we have a let's say two sources to microphones sources are static uh in a room and uh because uh because of the more hold the like passed from each source leak sensor uh the mixing process is known in a convolutive manner because of the reverberation and are global over here is to D makes these convoluted we mix signal oh however if we wanna do it in the time domain um i could be a complicated problem uh because of the convolution so one trick um that uh are often use uh researchers you you uh transform the data and to the for get domain by use of the short trying for you transform where the convolution time domain translates to model patient the frequency domain for large enough for a short time fourier transform window so in this case are at of J um it's of K is the mixing matrix at uh bin frequency uh frequency bin K oh um each and i didn't can be viewed as a different independent component analysis problem uh so i C a uh in the think and component analysis as we know it in to permutation um so if um i C is performed in each bin for their post processing has to be done to correct for possible permutation "'kay" so here we gonna mention um that a a source the the temporal dynamic dynamics of the sources in the time domain is the chi to uh to perform um a source separation and the frequency domain using ica and we show in our previous papers that um and it's um it's available on on line uh that um a on a website that uh basically each frame a sample from a gaussian with uh with zero mean and the specific variant uh after it's transformed to the um have after transform to uh the for don't mean um and that's because of the central limit theorem so basically if our signal in the time domain has a very and like has a a as a energy envelopes of with time uh the overall distribution uh so basically a a you know one you have a gaussian and one frame and you have a different gaussian with a different variance another frame a so the overall distribution is of the form of a gaussian scale mixture a which has a super gaussian um uh four um so in this paper we use a fixed uh a gaussian scale mixture uh by approximating using finite uh mixture of gaussian so here we have youth mixture gaussian now these parameters are here are fixed are and beforehand uh because they all uh fall into the over uh sorry that the the whole to the super gaussian forms so we oh we're not really gonna um uh give or so heading try to estimate the parameters of here is that we can have focused on other interesting at uh aspects so of speech like signal um so basically we have a this mixture gaussian for each of the sources because independence there more by the they the dense these are more like of the overall distribution the don't we of the sources also also had a a mixture gaussian so the previous frame a i'm sorry in the previous slide uh i um the talked about how the the temporal dynamics T two source of in the frequency domain or when introduce another form of temporal dynamics and that's the glancing fact and which the sources can um basically turned on and off sporadically radically uh with time so in this is to the colour speech where we have silence period oh so basically um so in this case we have resources sources and three microphones so uh in this kind period or we here um only the first source is active so that means that the first column of the mixing matrix is used for the for the mixing so this is done basically just we're we're looking at an any frequent then um so the first column of the mixing matrix uh and if we uh in each frequency bin is use um now in this time period now um all three sources are active sort the full uh of of the full mixing matrix a use for this uh for the mixing process um and then let's say the for source to uh become silent and only the first and second uh columns to make it you so by one seen or thing in the silence a gap where able to basically hopefully achieve better um basically results so this is also one strategy that the human here use to handle uh adverse it way now we're gonna move on to the time mixing in frequency domain that's when the sources are moving around so basically the mixing matrix uh varies with time um and here the emphasis that uh incorporating glancing is crucial in time in online uh uh mixing because of because if the model state is not correct this to make the estimation die user becomes unstable a just to give some more explanation on that a later on we're gonna um basically introduce particle filters and when particle filters to uh to simulate the columns of the mixing matrix so for example if we're in this in a case where a this uh there's the third source silent so the the particles that uh signal like the third source um in this time period are going to die words or going to just the weight to a location that's undesirable to as because basically it's inactive and they don't have any information about it so when the third source turned back on um basically the the particles might of uh drifted too it uh to far away location that not able to attain a a track again so uh basically it so it's very crucial oh for trying very an online the mixing to incorporate this since data and also the problem becomes you more complicated when the source is new not been what a move while being silent uh we call this a a phenomenon a silence blind zones which is similar to doppler of zones and greater target tracking so basically if the sources are uh are so and also moving that's um um um basic basic of the problem in in becomes more calm and we're gonna talk about we're gonna a talk about this later on so here and then introduce the the general model of lead for the going sing strategy you we in here so um uh we assume that each source can take on two states either active in so for a of and states there will be a total of a to the power of um so i for a total of M sources there will be a total of two to the power M state a can be different for different frequency bins and indicate which source of for each week and frequency bin is present or absent at each time set of active sources of the subset oh uh the set of total number of sources uh so for uh for example state i a um could be as a state for uh that uh that corresponds to a case where a a a a a a a specific number of sources that let's say you have a um that's a three sources and state are like say a state i corresponds to case for the first and second one or active and the third third source side um um so for example for so um to continue with the gender model we're gonna introduce our observation a a model over here the relationship you know observation and our uh and our states of interest so here um for each discrete state i so um pertaining to a to a particular activity pattern we have we have our observation is going to be uh a mixture gaussian and that's because our uh that the the the density of our sources are mixture gaussian our observation also become a mixture gaussian so for example if state i corresponds to the case where the first and second column are active the third column is silent uh uh or the the third source is silent a so basically X i over here uh a basically um had only the first and in second columns and and the third column is it not use for this um so here we're gonna introduce our channel model and that's the evolution of the columns of the mixing matrices um and we uh we use a random walk model and the reason that we use around walk model is that because we don't have any prior information of how the channels very you with time from one location in the room to another location so we you have no choice that use around random walk well um where you you over here is a gaussian random vector with a diagonal covariance um and also for the discrete state that that basically uh correspond to different activity patterns we have a markovian uh property for the transition and um so we have a transitional matrix pi where a each element is pi i J a i i D A a uh is the probability going state i state J so here we gonna get um basically you why why we have to use particle filters for this problem so uh as we can see in this relationship the really uh in this basic equation the relationship between our observation and or state S a a basic our continuous take at um have a non in your non gaussian form uh so we cannot use um standard optimal uh common filtering technique to uh to track these um these mixed in that the cop these columns of the mixing matrices so we have to resort to um so called sub optimal techniques like the particle filtering um so every is so particle so in a particle filter every state whether be thing is a discrete is represented with a cloud of particles so of the states are continuous the "'cause" the the car particles are also thinking is states are discrete the cup that a part are the screen um um and we also you have to use a mobile model part of filter that's because we have to be able to switch between the different states of activity um so a set of continuous particles is used to represent the mixing matrices and set of this peak particles is is used to represent the discrete state of activity so just gonna walk guys through uh the our model or our model or multiple model particle filter uh so basically we have we have a continuous states at and uh that are that a simulated by it's and and uh at M super script and and he's are the particles that that basically simulate at and um and we have a are are are uh are discrete states X that are simulated by particles act and X to prescription and oh uh we initialize these state particles using a initial prior um and we give them uniform weights so W M and are the weights for a and and and or and are the weights for X uh we classified a particle of the stats corresponding to different activity states so uh and i here a corresponds to the index of the particle that for uh that had uh state i had it's day uh so next that is that we predict in you set of particles by draw or a new set of samples at time T according to state transition described by so basically it's state i uh contains a and we going and an update of are we good and print uh make a prediction uh uh four uh for a new set of particles if state i it does not contain column M uh uh we we just leave it as at is so this is this is how we avoid that you think of particles whenever we have silence uh silences uh and also like to go memory of the salads plans of the sources is based on previous frames the covariance of the cloud of particles can be increased temporally this way that out of particle during the silence blind zones would a large enough to find a track once the sorts become active again so uh by keeping this buffer memory buffer of the previous silence of pattern and increasing the the variance for those for the silence source sources um we able to deal with the silence blind so um now on this that we we update our are our way um so basically this is using um the so we only update the weights which i uh state i can calm and and it i as they i it would come on and we just keep the weights as it um um so this is using the standard uh bootstrap particle filter we do the same thing for uh are the speak of basic B uh are weights and for the speech uh state are sorry are weights are or they um and then we normalize are weights um uh and in order to to is uh basically achieve a meaningful probability and um and then from there we can we can obtain a problem you actually from each state and we also uh do the same thing for our call weight uh the from there we can estimate the the mixing matrix columns by that's weighted average um and if the wire particles become uh D generate we can resample that um and at the end once we we obtain he's these estimates so uh our me and mixing me me makes it and mixing columns we could uh we can use a a minimum mean square error estimator to uh to reconstruct the sources then permutation in the frequency bin is corrected using the correlation method the activity patterns uh this is work by so a lot ah uh and others from japan uh by keeping a a a a memory of the past estimates of the sources in each frequency band so um so as we move on with our separation process we are but we are able to achieve better permutation correction oh once to the very uh mixing matrices are found the source is time varying directions of arrival uh with respect to the mike uh with respect to the micro microphone array can be found and this is work again by us a lot and others uh from japan um uh so if we have a so uh and if we have another rate we can achieve it's a another array in a different position in the room we can achieve uh we can we can find a different direction of arrival however all is the sources are separated we can easily a sort so he each source from one rate to another using the simple correlation method hence a your possibility of ghost location so if we have a so if we have basically the direction just a direction or was from the two race um the picture on the right so we have a possibility of two goes locate now if we have a separation we can easily associate she each source from from one rate to another and we we uh we avoid this goes uh goes problem uh also at P N a multiple model uh constant and velocity constant acceleration can "'em" attic motion model on the spatial dynamics of the sources is implemented using again a model model particle filtering uh a sources so this is using another model model uh for for uh to track the now the motion of the source the spatial motion of the sources and we use in this small mall part filter is very similar to the one that we use for our separation from so here we have basically we have some our results uh so we have to mike or uh to a raise one you over here only two microphone one over really only can mark phone this is uh a simulated by in that room our uh reverberation time is about two hundred milliseconds we have a thousand particles four each of the frequency bins uh the two sources are moving clockwise wise a kind of chase chasing each other uh sigh on and the magenta are the two trajectories blue and red are the estimated exactly a total duration of of each source was about on average twelve and a half second being at for only about five and a half second on average therefore we have about seven seconds of silence blind zones which makes the problem really uh into good um so here i'm gone into i'm gonna show you the the video of the tracking process so uh we have these we have this circle and we have the triangle circle circle uh is the true trajectory triangle is the estimated trajectory and use shapes turned green they feel with green whenever the source becomes act so when when the circle current active that's that's a true activity pattern when the triangle becomes active that's estimated fact that are as you can see so we start from an initial basically estimation and the source is uh basic yes estimation has a it could have basically it uh it try the catch up with the bit the trying try to catch up with the circle and that's because it when it's i'll we don't have a moving around no one of the drift around with right um so you're and then give you um the show the the average position root mean square error of the trajectory using uh compared with uh uh comparing our method with an online i the algorithm as we can see our method um uh uh basically does better than on only i i D A these these bikes over here for part to the silence periods so wouldn't axe i'll see it is basically those as well um we have a S the are or we here and uh just to conclude we have a we've we uh we use the to sing problem but in a different combination of tracks we show that i it's necessary and where able to deal with a side lines zone and um uh because out do not have to separate source of fully we don't have a problem of go thank very much i i i the we have questions question is a question yeah so some work done earlier or by a you do that um i think in that range of about to three taps i to talked about post process is for such a problem in using particle filters this this "'cause" they you you can turn not is on or off using using i using this kind of process and this work to show that this is a very very effective than the clap just a little complexity of the problem can i and i is to protest process approach to so like but not in a a and a great detail well it's it was a it looks like it's we we have a would be a very uh works well process yeah and you think that so that's true more i you you i basically writing can we had local station so you know there is no i we shown so as as i uh no no uh basically uh yeah just line of sight um however uh uh the the estimation that the D a uh estimation problem uh is basic the the estimation algorithm is sufficient to fine the the the direction right uh basic that that with don't does it if with with just using direct that now yeah right okay i am oh okay again

SEPARATION AND TRACKING OF MULTIPLE SPEAKERS IN A REVERBERANT ENVIRONMENT USING A MULTIPLE MODEL PARTICLE FILTER GLIMPSING METHOD

Signal Separation

Presented by: Alireza Masnadi-Shirazi, Author(s): Alireza Masnadi-Shirazi, Bhaskar D. Rao, University of California San Diego, United States