0:00:13but have to known every
0:00:15and thank you very much
0:00:16for your patience
0:00:17sitting on two
0:00:19the time
0:00:21with a delay
0:00:22for its final tool of the day
0:00:26today day not talk about
0:00:28and new base in approach to solve the multi target tracking using audio date
0:00:37in this talk
0:00:39after a background
0:00:40i will introduce you to
0:00:43a random finite set approach to the general problem of multi object estimation
0:00:49by multi object estimation i mean
0:00:52if free problem in which you are dealing with multiple objects
0:00:56each one having their own states
0:00:59problems where
0:01:00there is a any not only in the state of the object
0:01:05in the number of of
0:01:10then a switch to a spatial type of random finite set
0:01:15called uh multi band newly sets
0:01:18and we go through a a be now you've thousands member filter
0:01:25then i will explain the main contribution of the paper which is what you visual sure tracking
0:01:31and some simulation results
0:01:33and conclusions we finish to stop
0:01:41the problem that we are focusing in this paper and presentation is
0:01:46tracking of
0:01:48location shouldn't speaking
0:01:53but me if you an example
0:01:56uh this is a an example of a
0:01:58pay be data
0:02:03yeah O
0:02:05we have a bit of sound
0:02:07uh a the people are speaking location only so the audio data is
0:02:12in and
0:02:14the people can get out of the camera seen there for location a we don't have a
0:02:19a sure the information coming on
0:02:22we are interested in
0:02:25and tracking the multiple card
0:02:30as we will see in this example
0:02:34the target of interest
0:02:36i can sleep here
0:02:42and uh we want to design a filter that can detect
0:02:48and try simultaneously
0:02:50oh existing active
0:02:54there might be
0:02:55in a give people are of J
0:02:58in this scene i will tell you what is the definition of a active target
0:03:03and how a mathematically be formulated
0:03:13in such problems there are a few main challenges
0:03:16we have occasionally silent targets
0:03:19and location the invisible target
0:03:22and also
0:03:24we can have clutter measurements
0:03:27in visual visual Q
0:03:30and in or do features that will extract from the role what you visual information
0:03:39a contribution is
0:03:41a principled approach to combine audio and video data
0:03:45in a bayesian framework
0:03:51all of you are familiar with the
0:03:54nonlinear filtering approaches
0:03:56single target tracking method
0:04:00there is a single target
0:04:02which correspond to
0:04:04single measurement
0:04:06with a single state
0:04:08and from K mine use want to K
0:04:11it trends it's to the new state
0:04:13and in a
0:04:14a general bayesian filtering scheme
0:04:17we have
0:04:18a prediction was that
0:04:20and and update step
0:04:22in prediction is that we use
0:04:25the information that we have a about the dynamic
0:04:28of the object
0:04:30in the update state
0:04:32we use the information that we have a
0:04:35provided by the measure
0:04:39if we assume that the distribution of the state of the single target is cool C N
0:04:45and the dynamics and measurement models are linear
0:04:49then i'm not approximation is corpsman filtering
0:04:51in nonlinear cases
0:04:53particle filters
0:04:55are you
0:04:59a up a multi object filtering problem
0:05:03is something like that
0:05:04spots with
0:05:05spatial complexity
0:05:08and challenge
0:05:12we can have the number of objects
0:05:14randomly changing
0:05:17the number of measurement cues available random be changing
0:05:21we can have
0:05:22some objects undetected
0:05:24on a needing detections
0:05:27we can have clutter
0:05:29and also data association it's
0:05:32another challenge that needs to be
0:05:42a relatively be sent
0:05:46to tackle the multi object
0:05:48filtering problem
0:05:52the random finite set we using the random finite set theory to double a principled solutions to tackle these problems
0:06:03in this approach
0:06:06the objects
0:06:08are modelled as a set
0:06:10as a random find
0:06:13in which
0:06:14the onset a need you both
0:06:16in the states
0:06:17and in the number
0:06:19of the targets or objects are
0:06:21mathematically model
0:06:24they have four
0:06:26instead of multiple objects or targets we will be dealing with a single target that is modelled
0:06:33as a set
0:06:36will be dealing with
0:06:43uh and and derivation
0:06:47statistical properties of the set
0:06:51the problem is encapsulated
0:06:55as a single target
0:06:58tracking or a single object estimation problem
0:07:03in the mathematical formulation of very this solutions that have been double opt
0:07:08in this
0:07:12random finite set theory framework
0:07:16detection on a need T
0:07:18and association onset a needy T
0:07:20are principal in a a a are um mathematically formulated in a principled man
0:07:30mean wind to you to go through some of these solutions including the well-known phd filter
0:07:35and C phd filter
0:07:37and member filter
0:07:42cardinality duality balance number filter
0:07:46which is the filter that that i'm using to solve the big focus problem in this presentation
0:07:55spatial kind of random finite sets are multi but only random finite set
0:08:01there are the ensemble a
0:08:06which is
0:08:07known but can be determined iteratively
0:08:11capital um
0:08:14but newly set
0:08:16each but newly sets is
0:08:18prescribe by
0:08:23and a are which is existent and probability
0:08:26all a possible object
0:08:29and the P which is the pdf of state of that object
0:08:33and the union of all these
0:08:35very newly sets
0:08:37uh form a multi but only random finite set
0:08:42a multi only R F S or random finite set can be fully prescribe point ensemble of are
0:08:48all are i are and
0:08:56ask you see
0:08:58yeah whole on set a needy in the number of a
0:09:01you objects that exist in the scene
0:09:04and the distribution of the state
0:09:06can be mathematically modelled
0:09:09having these are lies
0:09:11and the P a my functions
0:09:18and we'd
0:09:19general bayesian filters
0:09:21we have a prediction and update that
0:09:25and when we model the random finite set of targets
0:09:29as a multi breed new only random finite set
0:09:34it's is the are i
0:09:36and P a which are predicted and update data
0:09:47mentor filter
0:09:49and it six version card you know keep balance or C B member filters
0:09:54are more useful than phd D filters
0:09:57in practical implementations because of the computational requirements and
0:10:02also their accuracy
0:10:10similar to a general bayesian filter in prediction
0:10:15the are eyes and P is are predicted
0:10:17and the predicted are are i and P I equations involve
0:10:24a survival probability of each
0:10:28and a transitional density
0:10:30state state transition density of each sorry object
0:10:35transition density of each object
0:10:38these are the dynamics
0:10:40information that we have about the movements
0:10:44or the state
0:10:45changes of the object
0:10:49in addition
0:10:50in prediction
0:10:52in new set
0:10:55but new sets
0:10:57is the
0:10:58in should used to the system has the result of
0:11:01new coming
0:11:04to the C
0:11:08in prediction
0:11:09in a a in the updates that
0:11:12the ensemble of a a i is and P is the bear new newly said
0:11:16are updated to the union of two set S
0:11:20one set includes the legacy try
0:11:24this sets that that are there
0:11:26because there might not be detected
0:11:29there might not have been detected in that
0:11:33and the sets
0:11:34that are there and they are updated using the measurement
0:11:41in these equation i want to draw your attention to
0:11:45two important parameters
0:11:48detection probability
0:11:50and measurement likely
0:11:52P D
0:11:53and G K
0:11:55these are
0:11:57define for single option
0:12:00single objects
0:12:02you have measurements
0:12:03the relationship between the measurement
0:12:05and the object
0:12:07is defined
0:12:09make the dependent on
0:12:12the your sensor uh performance and your equipment
0:12:17and also some timber mental
0:12:19a a a a a parameters
0:12:21such as the clock to rate
0:12:23a try it's a the whole measurement pro
0:12:29and the detection probability is another parameter using which
0:12:34we can to you the performance of the system
0:12:39our definition of
0:12:42speakers were active targets in the scene
0:12:45you see how
0:12:48so for all do you we should tracking
0:12:53in our implementation we dig state includes the icsi image white image and X start and white dots
0:12:59and this size
0:13:02uh a rectangular
0:13:04but souls that's we will get
0:13:07as a result of tracking
0:13:08in the image
0:13:12video all measurements are obtained by performing a no based background subtraction followed by morphological image operations
0:13:20the result would be a set of rectangular brought in each frame at
0:13:26if we denote the results as a random finite set
0:13:32which in which each element includes the
0:13:35X Y W and hey H
0:13:38then the likelihood
0:13:39can be defined by this function
0:13:44which is a coarse can like
0:13:48with audio your measurements
0:13:50i have taken the simplest approach
0:13:54assuming that there are two microphones on two sides of the camera
0:13:59the time difference of arrival or tdoa
0:14:02is calculated using cross correlation
0:14:05uh generalized cross-correlation function face transform or gcc-phat
0:14:11because of dereverberation effects
0:14:13there are several peaks in the gcc-phat curve
0:14:17when it plotted versus time difference
0:14:21in our experiments
0:14:22we have considered at most five large errors
0:14:25peaks of the gcc-phat values
0:14:29we consider them as the tdoa measurements in each frame
0:14:38in order to
0:14:40prime it tries and calculate relationship between these tdoa measurements
0:14:45and the state
0:14:47the the object state which is it
0:14:49why W and hey H
0:14:54there is a practical consideration
0:14:56the distance of targets from the microphones
0:15:00relatively large compared to the distance between the two microphones
0:15:04therefore we can practically assume
0:15:07but um there is a linear relationship between it
0:15:12and the corresponding tdoa
0:15:14you know to to find a parameter of this linear relationship
0:15:18i have used
0:15:20the ground truth state that i have
0:15:22in one of the
0:15:23case says one up to be D use in this paper be database
0:15:29in each frame i have calculated five peaks or five tdoa ace
0:15:37the red points
0:15:38and then
0:15:39uh i wanna find out
0:15:41because of
0:15:42many of them are out wires and on these some of them are in the liar
0:15:47and using the robust estimation technique you can detect and remove the outliers and then use regression to find out
0:15:54that linear your
0:15:55a a a a relationship that exists
0:15:58between the tdoa and X
0:16:03each of the two persons that are active in the scene in that case the study
0:16:10and i have a
0:16:12um can see two persons
0:16:16uh uh
0:16:17comparison purposes
0:16:19because if the two equations are very close
0:16:22to each other in terms of their parameters
0:16:25that put a proof that uh
0:16:28uh this assumption is practically core wrecked
0:16:31and our estimates
0:16:32or accurate
0:16:33and that once the case
0:16:39we have a
0:16:40two measurement likelihoods
0:16:42in each frame
0:16:44we have
0:16:45what what you data
0:16:46and we have to frame coming up
0:16:49we have what you measurements as the set of tdoa ace
0:16:52and we have a
0:16:54we T or
0:16:55image measure
0:16:57as a result of background subtraction followed by morphological equation uh operations
0:17:02how do we use them how to be fused these information
0:17:06to find
0:17:12we define active targets
0:17:15in terms of the probability of detection
0:17:22for example if an active speaker is considered to be the person who is expected to be
0:17:27visible visible to the camera in no less than ninety five percent of the time
0:17:32and to be speaking in have at least forty percent of the time
0:17:37we set
0:17:39the detection probability for a usual data are as ninety five percent
0:17:44for what you a data as forty
0:17:47like increasing
0:17:48and decreasing these
0:17:50detection probabilities
0:17:54we can take you wanna
0:17:56how long we expect
0:17:59uh a a um and active target to be speaking or to be visible
0:18:04it is application to ten that dependent and it can be tuned by the user
0:18:17sensor a fusion happens
0:18:20repeating the update
0:18:23that simple
0:18:26we do the to state
0:18:28and i remind you that in the update
0:18:31step of the filter are using the measurement likelihood
0:18:36which we have
0:18:37we do to be update step first
0:18:40using the visual measurements and then using the old your mission
0:18:46in each of these repetitions we use the corresponding
0:18:49detection probability
0:18:51and again i'd remind you that in each step
0:18:54we have
0:18:55the legacy tracks
0:18:57and we'll have measurement correct track
0:19:00are to you want buy these detection probability
0:19:05so here are some results
0:19:07for example in this case
0:19:17a class
0:19:20as you see people are talking
0:19:22and they are uh detected and
0:19:28this is not
0:19:29the sound of my sure i a there for the
0:19:36but i
0:19:37let me run it outside
0:19:38probably its
0:20:07five to the have like a large C there are being right
0:20:13the left frame
0:20:15shows that i image result of tracking
0:20:18the right frame shows the ensemble of particle that i have used to implement three
0:20:24uh uh i a belief that
0:20:26the that the of the they sure
0:20:28it then of each or newly component in in a random finite set of real
0:20:35uh uh uh i two three four five six
0:20:38and and the results that use see on the
0:20:42left to right yeah
0:20:44yeah are actually the have a age of all the winning particles corresponding to each talk
0:20:54one and and is was the results of think well
0:20:59and one thing i it if is one from the number of a whole
0:21:04well uh we for one two three four
0:21:14and here is another example
0:21:18an example of a smart from all of them are from a
0:21:21space data phase
0:21:22uh i
0:21:24and um
0:21:29i i i finish my for
0:21:31with some quantitative results because i think we are closing to the and of time
0:21:36in ninety eight point five percent of all the frames the existing targets
0:21:41for all detected in this freak
0:21:43and uh cases
0:21:44they were correctly labeled and track
0:21:48like bills were never switched after or during occlusion
0:21:53and then in target was successfully tracked using the or do you Q
0:21:59a false negative ratio false alarm ratio and label switching shoes
0:22:04we out would you and read want you are here as you see
0:22:08these false alarm rate to and label switching ratio as are
0:22:14almost zero or cut a zero and none available in this and uh
0:22:20and and are less than
0:22:23uh the case when we are not using the audio data
0:22:30and i will script conclusions
0:22:32thank you and i will be answering your question
0:22:36a are very much okay okay we
0:22:38oh i would like to thank all of you for remaining until this time and fact of the speakers for
0:22:44very good the uh uh uh box
0:22:46i think you can do breast are also separately over wise the noise that in your we'll uh