Speech Transcript - SEPARATION AND TRACKING OF MULTIPLE SPEAKERS IN A REVERBERANT ENVIRONMENT USING A MULTIPLE MODEL PARTICLE FILTER GLIMPSING METHOD

0:00:13	a model particle filter glancing method
0:00:15	so the key idea over here is
0:00:16	um where N to separate and all to do so joint separation and tracking
0:00:22	of moving speakers in uh close setting and we're using
0:00:26	uh the going thing fact where the sources can appear or this it uh or
0:00:30	or disappear appear by like they could uh
0:00:33	uh
0:00:34	basically they can turn on or not
0:00:35	that's sporadically radically with
0:00:40	so first i'm gonna give um
0:00:43	an overview of convoluted time-invariant mixing
0:00:46	so we have a let's say two sources to microphones
0:00:49	sources are static
0:00:50	uh in a room and uh because
0:00:53	uh
0:00:54	because of the more hold the like passed from each source leak sensor
0:00:58	uh
0:00:58	the mixing process is known in a convolutive
0:01:01	manner because of the reverberation
0:01:03	and are global over here is to D makes these convoluted we mix signal
0:01:08	oh however if we wanna do it in the time domain
0:01:10	um i could be a complicated problem
0:01:13	uh because of the convolution so one trick um that uh are often use uh researchers you
0:01:19	you uh transform the data and to the for get
0:01:22	domain by use of the short trying for you transform where the convolution time domain translates to model patient the
0:01:27	frequency domain
0:01:28	for large enough for a short time fourier transform window
0:01:32	so in this case are at of
0:01:34	J
0:01:35	um
0:01:37	it's of K is
0:01:38	the mixing matrix at uh bin frequency
0:01:41	uh frequency bin K
0:01:44	oh um each
0:01:46	and i didn't can be viewed as a different
0:01:48	independent component analysis problem
0:01:50	uh so i C a uh in the think and component analysis
0:01:54	as we know it in to permutation
0:01:57	um so if um
0:01:59	i C is performed in each bin
0:02:01	for their post processing has to be done to correct for possible permutation
0:02:06	"'kay" so here we gonna mention um
0:02:09	that
0:02:10	a a source
0:02:11	the the temporal dynamic
0:02:13	dynamics of the sources
0:02:15	in the time domain is the chi to uh
0:02:18	to perform um
0:02:20	a source separation and the frequency domain
0:02:23	using ica
0:02:24	and we show in our previous papers that um
0:02:27	and it's um it's available on on line uh that um
0:02:32	a on a website that uh basically each frame
0:02:35	a sample from a gaussian with uh
0:02:37	with
0:02:38	zero mean and the specific variant
0:02:40	uh after it's transformed to
0:02:43	the um have after transform to
0:02:46	uh the for don't mean um
0:02:48	and that's because of the central limit theorem
0:02:51	so basically if our
0:02:53	signal in the time domain has a very and like has a
0:02:57	a
0:02:59	as a energy envelopes of with time
0:03:02	uh the overall distribution
0:03:05	uh so basically
0:03:06	a a you know
0:03:07	one
0:03:08	you have a gaussian and one frame and you have a different gaussian
0:03:11	with a different variance another frame
0:03:14	a so the overall distribution is of the form of a gaussian scale mixture
0:03:18	a which has a super gaussian um
0:03:20	uh four
0:03:22	um so
0:03:24	in this paper we use a fixed
0:03:26	uh
0:03:27	a gaussian scale mixture
0:03:29	uh
0:03:29	by approximating using finite
0:03:31	uh mixture of gaussian
0:03:33	so here we have youth mixture gaussian now these parameters are here are fixed are and beforehand uh because
0:03:39	they all uh fall into the over
0:03:42	uh sorry that the the whole to the super gaussian
0:03:45	forms so we
0:03:46	oh we're not really gonna
0:03:48	um
0:03:49	uh give or so heading try to estimate the
0:03:51	parameters of here is that we can have focused on other interesting at uh aspects so
0:03:56	of speech like signal
0:03:59	um so basically we have a this mixture gaussian
0:04:02	for each of the sources
0:04:04	because independence there more by the they the dense these are more like of the overall distribution the don't
0:04:09	we of the sources
0:04:10	also
0:04:12	also had a
0:04:13	a mixture gaussian
0:04:16	so the previous frame
0:04:17	a i'm sorry in the previous slide uh i
0:04:20	um the talked about how the the temporal dynamics T two source of in the frequency domain
0:04:25	or when introduce another form of temporal dynamics and that's the glancing fact
0:04:29	and which the sources can um
0:04:31	basically turned on and off sporadically radically
0:04:34	uh with time
0:04:35	so in this is to the colour speech where we have silence period
0:04:39	oh so basically um
0:04:41	so in this case we have resources sources and three microphones
0:04:44	so uh in this kind period
0:04:47	or we here
0:04:49	um
0:04:50	only the first source is active so that means that the first column of the mixing matrix is used for
0:04:55	the for the mixing
0:04:57	so this is done basically just we're we're looking at an any frequent
0:05:00	then
0:05:01	um so the first column of the mixing matrix
0:05:04	uh and if we uh in each frequency bin is use
0:05:07	um
0:05:08	now
0:05:09	in this time period now
0:05:11	um all three sources are active sort the full
0:05:14	uh of of the full mixing matrix a use for this uh for the mixing process
0:05:19	um
0:05:20	and then let's say
0:05:22	the for source to uh become silent
0:05:25	and only the first and second
0:05:26	uh columns to make it
0:05:28	you
0:05:29	so
0:05:30	by one seen or thing in the silence a gap
0:05:33	where able to basically
0:05:36	hopefully achieve better
0:05:37	um basically results so this is also one strategy that the human here
0:05:42	use to handle uh adverse it way
0:05:46	now we're gonna move on to the time
0:05:48	mixing in frequency domain
0:05:50	that's when the sources are moving around
0:05:52	so basically the mixing matrix
0:05:55	uh varies with time
0:05:58	um
0:05:59	and here the emphasis that uh incorporating glancing
0:06:03	is crucial in time in
0:06:05	online uh uh mixing because of because if the model state is not correct
0:06:10	this to make the estimation die user becomes unstable
0:06:13	a just to give some more explanation on that a later on
0:06:15	we're gonna um basically introduce particle filters and when particle filters to
0:06:20	uh to simulate the columns of the mixing matrix
0:06:23	so for example if we're in this in a case where a
0:06:25	this uh there's the third source silent
0:06:28	so the the particles that uh signal like the third source
0:06:32	um in this time period are going to die words
0:06:35	or going to just the weight to a location that's undesirable to as because basically
0:06:41	it's inactive and they don't have any information about it
0:06:43	so when the third source turned back on
0:06:46	um basically the the particles might of
0:06:49	uh drifted too it
0:06:51	uh to far away location that not able to attain a a track again
0:06:57	so uh basically it so it's very crucial
0:07:00	oh for trying very an online the mixing
0:07:02	to incorporate this since data
0:07:05	and also the problem becomes
0:07:07	you more complicated when the source is
0:07:09	new not been what a move while being silent
0:07:12	uh we call this
0:07:13	a a phenomenon a silence blind zones which is similar to doppler of zones and greater target tracking
0:07:18	so basically if the sources are
0:07:20	uh are so and also moving that's um
0:07:24	um
0:07:25	um
0:07:26	basic basic of the problem in in becomes more calm
0:07:29	and we're gonna talk about we're gonna
0:07:30	a talk about this later on
0:07:33	so here and then introduce the
0:07:35	the general model of lead for the going sing strategy you we in here so um
0:07:40	uh we assume that each source can take on two states
0:07:43	either active in so for a of
0:07:45	and states
0:07:46	there will be a total of a to the power of um
0:07:49	so i for a total of M sources there will be a total of two to the power M state
0:07:54	a can be different for different frequency bins
0:07:56	and indicate which source
0:07:57	of for each week and frequency bin is present or absent
0:08:00	at each time set of active sources of the subset
0:08:03	oh uh the set of total number of sources
0:08:06	uh so for uh for example state i
0:08:09	a um could be as a state for uh that uh
0:08:13	that corresponds to a case where
0:08:15	a a a a a a a specific number of sources that let's say you have a um that's a
0:08:20	three sources
0:08:21	and state are like say
0:08:23	a state i corresponds to case for the first and second one or active and the third third source side
0:08:29	um um
0:08:30	so for example for so um
0:08:33	to continue with the gender model we're gonna introduce our observation
0:08:37	a a model over here the relationship you know observation and our uh
0:08:42	and our states of interest
0:08:44	so here
0:08:45	um for each discrete state i
0:08:48	so
0:08:49	um
0:08:49	pertaining to a to a particular activity pattern
0:08:52	we have
0:08:53	we have our observation is going to be uh a mixture gaussian and that's because our uh that the the
0:09:00	the density of our sources are mixture gaussian our observation also become a mixture gaussian
0:09:05	so for example
0:09:06	if state i corresponds to the case where the first and second column
0:09:09	are active the third column is silent
0:09:12	uh
0:09:14	uh or the the third source is silent
0:09:16	a so basically X i over here
0:09:18	uh a basically um had
0:09:21	only the first and in second columns and and the third column is
0:09:25	it not use for this
0:09:29	um
0:09:29	so here we're gonna introduce
0:09:31	our channel model
0:09:33	and that's
0:09:33	the evolution of the columns of the mixing matrices
0:09:36	um and we uh we use a random walk model and the reason that we use around walk model is
0:09:41	that because we don't have any prior information of how the channels very you
0:09:45	with time from one location in the room to another location
0:09:48	so we you have no choice that use around random walk well
0:09:51	um where
0:09:52	you you over here is a gaussian random vector with a diagonal covariance
0:09:58	um
0:09:59	and also for the discrete state
0:10:02	that
0:10:02	that basically uh correspond to different activity patterns
0:10:06	we have a markovian
0:10:08	uh property for the transition
0:10:10	and
0:10:11	um so we have a transitional matrix pi
0:10:14	where a each element is pi i J a i i D A a uh is the probability
0:10:19	going
0:10:20	state i state J
0:10:24	so here we gonna get
0:10:25	um basically you why why we have to use particle filters for this problem
0:10:30	so uh as we can see in this relationship the really uh in this basic equation
0:10:35	the relationship between our observation and or
0:10:38	state S
0:10:39	a a basic our continuous take at
0:10:42	um have a non in your non gaussian form
0:10:45	uh
0:10:45	so we cannot use
0:10:47	um
0:10:48	standard
0:10:49	optimal
0:10:50	uh common filtering technique
0:10:52	to uh to track these um
0:10:55	these mixed in that the cop these columns of the mixing matrices
0:10:58	so we have to resort to
0:11:01	um
0:11:02	so called sub optimal techniques like the particle filtering
0:11:05	um
0:11:06	so every is
0:11:07	so particle so in a particle filter
0:11:09	every state whether be thing is a discrete is represented with a cloud of particles so of the states are
0:11:13	continuous
0:11:14	the "'cause" the the car particles are also thinking is
0:11:17	states are discrete the cup that a part are the screen
0:11:20	um um and we also you have to use a mobile model part of filter that's because we have to
0:11:25	be able to switch between the different states of activity
0:11:27	um
0:11:29	so a set of continuous particles is used to represent the mixing matrices
0:11:33	and set of this peak particles is is used to represent the discrete state of activity
0:11:38	so
0:11:39	just gonna walk guys through uh
0:11:41	the
0:11:42	our model or our model or multiple model particle filter
0:11:46	uh so basically we have
0:11:48	we have a continuous states at and
0:11:51	uh that are that a simulated by it's and and uh at M super script and and he's
0:11:56	are the particles that that basically simulate at and
0:11:59	um and we have a are are are uh
0:12:03	are discrete states X
0:12:05	that are simulated by particles act and
0:12:07	X to prescription and
0:12:09	oh uh we initialize these state particles
0:12:12	using a initial prior
0:12:14	um
0:12:15	and we give them uniform weights so W M and
0:12:18	are the weights
0:12:19	for a and and and
0:12:21	or and are the weights for X
0:12:24	uh we classified a particle of the stats corresponding to different
0:12:27	activity states
0:12:29	so uh and i here
0:12:31	a corresponds to the index of the particle
0:12:34	that for uh that had
0:12:37	uh
0:12:38	state i had it's day
0:12:42	uh so next that is
0:12:43	that we predict in you set of particles by draw
0:12:47	or a new set of samples at time T according to state transition described by
0:12:51	so basically it's state i uh contains a and
0:12:55	we going and an update of are we good and print uh make a prediction
0:12:58	uh uh four
0:13:00	uh for a new set of particles if state i
0:13:03	it does not contain
0:13:04	column M
0:13:06	uh uh we we just leave it as at is
0:13:08	so this is
0:13:09	this is how we avoid that you think of particles whenever we have silence
0:13:14	uh silences
0:13:15	uh
0:13:16	and also like to go memory of the salads plans of the sources is based on previous frames
0:13:20	the covariance
0:13:21	of the cloud of particles can be increased temporally this way that out of particle
0:13:26	during the silence blind zones would a large enough to find a track once the sorts become active again so
0:13:31	uh by keeping this buffer memory buffer of the previous silence of pattern
0:13:36	and increasing the the variance for those
0:13:38	for the silence source sources
0:13:39	um
0:13:40	we able to deal with the silence blind so
0:13:45	um
0:13:46	now on this that we we update our are
0:13:49	our way
0:13:50	um
0:13:51	so basically this is using
0:13:53	um the so we only update the weights
0:13:56	which i uh state i can calm and and it
0:13:59	i as they i it would come on and
0:14:01	we just keep the weights as it
0:14:02	um
0:14:04	um so
0:14:05	this is using the standard uh
0:14:08	bootstrap particle filter
0:14:09	we do the same thing for
0:14:11	uh are the speak of basic B
0:14:13	uh are weights and
0:14:15	for the speech uh state
0:14:17	are sorry are weights are or
0:14:19	they
0:14:21	um
0:14:22	and then we normalize
0:14:23	are weights
0:14:24	um
0:14:25	uh
0:14:27	and
0:14:28	in order to to is uh basically achieve a meaningful probability
0:14:33	and um
0:14:34	and then from there we can we can obtain a problem you actually from each state
0:14:39	and we also uh do the same thing for our call weight
0:14:44	uh the from there we can estimate the
0:14:46	the mixing matrix columns
0:14:48	by that's weighted average
0:14:50	um and if the wire particles become
0:14:52	uh D generate we can resample that
0:14:55	um
0:14:56	and at the end once we we obtain he's
0:14:59	these estimates so uh our me and mixing me me makes it
0:15:03	and mixing
0:15:04	columns
0:15:05	we could uh we can use a a minimum mean square error estimator
0:15:08	to uh to reconstruct the sources
0:15:10	then permutation in the frequency bin
0:15:12	is corrected using the correlation method
0:15:15	the activity patterns
0:15:16	uh this is work by so a lot ah
0:15:18	uh and others from japan
0:15:20	uh by keeping a a a a memory of the past estimates
0:15:23	of the sources in each frequency band so um
0:15:26	so as we move on with our separation process
0:15:29	we are but we are able to achieve better permutation correction
0:15:35	oh once to the very uh mixing matrices are found the source is time varying directions of arrival
0:15:41	uh
0:15:42	with respect to the mike uh with respect to the micro microphone array can be found and this is work
0:15:48	again by
0:15:49	us a lot and others uh from japan
0:15:51	um
0:15:53	uh so if we have a
0:15:55	so uh and if we have another rate
0:15:57	we can achieve
0:15:58	it's a another array in a different position in the room
0:16:01	we can achieve uh we can we can find a different direction of arrival
0:16:05	however
0:16:06	all is the sources are separated we can easily a sort so he each source from one rate
0:16:11	to another using the simple correlation method
0:16:14	hence a your possibility of ghost location so if we have a
0:16:18	so if we have basically the direction just a direction or was from the two race
0:16:22	um
0:16:23	the picture on the right
0:16:25	so we have a possibility of two goes locate
0:16:28	now if we have a separation we can easily associate she each source
0:16:32	from from one rate to another
0:16:35	and we we uh we avoid this goes
0:16:37	uh
0:16:38	goes problem
0:16:40	uh also at P N
0:16:43	a multiple model uh
0:16:45	constant and velocity constant acceleration can "'em" attic motion model on the spatial dynamics of the sources
0:16:50	is implemented using again
0:16:53	a model model particle filtering uh a sources
0:16:56	so this is
0:16:56	using another model model
0:16:58	uh for for uh
0:17:00	to track the now the motion of the source the spatial motion of the sources
0:17:05	and we use in this small mall part filter is very similar to the one that we use
0:17:10	for our separation from
0:17:13	so here we have basically we have some our results
0:17:17	uh so we have to mike or uh to a raise one you over here only two microphone
0:17:21	one over really only can mark phone
0:17:23	this is uh a simulated by in that room our uh reverberation time is about two hundred milliseconds
0:17:29	we have a thousand particles four
0:17:31	each of the frequency bins uh
0:17:33	the two sources are moving
0:17:35	clockwise wise a kind of chase chasing each other
0:17:37	uh sigh on and the magenta are the two trajectories blue and red are the estimated exactly
0:17:42	a total duration of of each source was about on average
0:17:46	twelve and a half second
0:17:48	being at for only about five and a half second
0:17:50	on average
0:17:51	therefore we have about seven seconds of silence blind zones which makes the problem really
0:17:56	uh into good
0:17:57	um
0:17:59	so
0:18:00	here i'm gone into
0:18:01	i'm gonna show you the the video of the tracking process
0:18:04	so uh we have these we have this circle
0:18:07	and we have the triangle circle circle uh
0:18:10	is the true trajectory triangle is the estimated trajectory
0:18:13	and use shapes turned green they feel with green whenever the source becomes act
0:18:20	so when when the circle current active that's
0:18:22	that's a true activity pattern when the triangle becomes active that's estimated fact that are
0:18:29	as you can see
0:18:30	so we start from an initial
0:18:31	basically estimation
0:18:34	and
0:18:35	the source is
0:18:36	uh basic yes
0:18:37	estimation has a
0:18:39	it could have
0:18:40	basically it
0:18:41	uh it try the catch up with the bit the trying try to catch up with the circle
0:18:46	and that's because it when it's i'll we don't have a moving around
0:18:49	no one of the drift around with
0:19:13	right
0:19:14	um so
0:19:15	you're and then give you um
0:19:17	the show the
0:19:19	the average position root mean square error of the trajectory
0:19:23	using uh compared with uh
0:19:26	uh comparing our method with an online i the algorithm
0:19:29	as we can see our method um
0:19:31	uh
0:19:32	uh
0:19:33	basically
0:19:35	does better than on only i i D A these
0:19:37	these bikes over here for part to the silence periods so wouldn't axe i'll see it is
0:19:41	basically those as well
0:19:43	um we have a S the are or we here
0:19:45	and uh just to conclude we have a we've
0:19:48	we uh we use the to sing problem
0:19:51	but in a different combination of tracks we show that i
0:19:53	it's necessary and
0:19:55	where able to deal with a side lines zone
0:19:57	and um
0:19:58	uh because out do not have to separate source of fully we don't have a problem of go
0:20:02	thank very much
0:20:05	i
0:20:06	i
0:20:08	i
0:20:10	the we have questions question is a question
0:20:12	yeah
0:20:15	so some work done earlier or by a you do that
0:20:20	um
0:20:20	i think in that
0:20:21	range of about to three taps
0:20:24	i
0:20:25	to talked about post process is for such a problem in using particle filters
0:20:31	this this "'cause" they you you can turn not is on or off using using
0:20:35	i using this kind of process and this work to show that this is a very very effective than the
0:20:41	clap just a little complexity of the problem can
0:20:44	i and i is
0:20:47	to protest process approach to so
0:20:50	like but not in a a and a great detail well it's it was a it looks like it's we
0:20:53	we have a would be a very uh works well process yeah and you think that so that's true
0:20:58	more
0:20:59	i you you i basically writing can we had local station so you know there is no i we shown
0:21:07	so as as i
0:21:10	uh no no uh basically uh yeah just line of sight um however uh
0:21:16	uh
0:21:17	the
0:21:18	the estimation that the D a
0:21:20	uh estimation problem
0:21:22	uh is
0:21:23	basic the the estimation algorithm is sufficient to fine
0:21:26	the the
0:21:28	the direction right
0:21:30	uh basic that that with don't does it if with with just using direct that
0:21:34	now yeah right okay
0:21:36	i am oh okay again

SEPARATION AND TRACKING OF MULTIPLE SPEAKERS IN A REVERBERANT ENVIRONMENT USING A MULTIPLE MODEL PARTICLE FILTER GLIMPSING METHOD

Signal Separation

Presented by: Alireza Masnadi-Shirazi, Author(s): Alireza Masnadi-Shirazi, Bhaskar D. Rao, University of California San Diego, United States