0:00:13 | but have to known every |
---|---|

0:00:15 | and thank you very much |

0:00:16 | for your patience |

0:00:17 | sitting on two |

0:00:19 | the time |

0:00:21 | with a delay |

0:00:22 | for its final tool of the day |

0:00:26 | today day not talk about |

0:00:28 | and new base in approach to solve the multi target tracking using audio date |

0:00:37 | in this talk |

0:00:39 | after a background |

0:00:40 | i will introduce you to |

0:00:43 | a random finite set approach to the general problem of multi object estimation |

0:00:49 | by multi object estimation i mean |

0:00:52 | if free problem in which you are dealing with multiple objects |

0:00:56 | each one having their own states |

0:00:59 | problems where |

0:01:00 | there is a any not only in the state of the object |

0:01:04 | but |

0:01:05 | in the number of of |

0:01:09 | and |

0:01:10 | then a switch to a spatial type of random finite set |

0:01:15 | called uh multi band newly sets |

0:01:18 | and we go through a a be now you've thousands member filter |

0:01:25 | then i will explain the main contribution of the paper which is what you visual sure tracking |

0:01:31 | and some simulation results |

0:01:33 | and conclusions we finish to stop |

0:01:41 | the problem that we are focusing in this paper and presentation is |

0:01:46 | tracking of |

0:01:47 | multiple |

0:01:48 | location shouldn't speaking |

0:01:50 | targets |

0:01:53 | but me if you an example |

0:01:56 | uh this is a an example of a |

0:01:58 | pay be data |

0:02:00 | again |

0:02:02 | oh |

0:02:03 | yeah O |

0:02:05 | we have a bit of sound |

0:02:07 | uh a the people are speaking location only so the audio data is |

0:02:12 | in and |

0:02:14 | the people can get out of the camera seen there for location a we don't have a |

0:02:19 | a sure the information coming on |

0:02:21 | but |

0:02:22 | we are interested in |

0:02:24 | detecting |

0:02:25 | and tracking the multiple card |

0:02:30 | as we will see in this example |

0:02:34 | the target of interest |

0:02:36 | i can sleep here |

0:02:37 | while |

0:02:38 | still |

0:02:39 | talking |

0:02:41 | i |

0:02:42 | and uh we want to design a filter that can detect |

0:02:48 | and try simultaneously |

0:02:50 | oh existing active |

0:02:52 | target |

0:02:54 | there might be |

0:02:55 | in a give people are of J |

0:02:58 | in this scene i will tell you what is the definition of a active target |

0:03:03 | and how a mathematically be formulated |

0:03:06 | oh |

0:03:08 | a |

0:03:08 | so |

0:03:11 | hmmm |

0:03:13 | in such problems there are a few main challenges |

0:03:16 | we have occasionally silent targets |

0:03:19 | and location the invisible target |

0:03:22 | and also |

0:03:24 | we can have clutter measurements |

0:03:26 | but |

0:03:27 | in visual visual Q |

0:03:30 | and in or do features that will extract from the role what you visual information |

0:03:39 | a contribution is |

0:03:41 | a principled approach to combine audio and video data |

0:03:45 | in a bayesian framework |

0:03:49 | okay |

0:03:51 | all of you are familiar with the |

0:03:54 | nonlinear filtering approaches |

0:03:56 | single target tracking method |

0:04:00 | there is a single target |

0:04:02 | which correspond to |

0:04:04 | single measurement |

0:04:06 | with a single state |

0:04:08 | and from K mine use want to K |

0:04:11 | it trends it's to the new state |

0:04:13 | and in a |

0:04:14 | a general bayesian filtering scheme |

0:04:17 | we have |

0:04:18 | a prediction was that |

0:04:20 | and and update step |

0:04:22 | in prediction is that we use |

0:04:25 | the information that we have a about the dynamic |

0:04:28 | of the object |

0:04:30 | in the update state |

0:04:32 | we use the information that we have a |

0:04:35 | provided by the measure |

0:04:39 | if we assume that the distribution of the state of the single target is cool C N |

0:04:45 | and the dynamics and measurement models are linear |

0:04:49 | then i'm not approximation is corpsman filtering |

0:04:51 | in nonlinear cases |

0:04:53 | particle filters |

0:04:55 | are you |

0:04:59 | a up a multi object filtering problem |

0:05:03 | is something like that |

0:05:04 | spots with |

0:05:05 | spatial complexity |

0:05:08 | and challenge |

0:05:12 | we can have the number of objects |

0:05:14 | randomly changing |

0:05:17 | the number of measurement cues available random be changing |

0:05:21 | we can have |

0:05:22 | some objects undetected |

0:05:24 | on a needing detections |

0:05:27 | we can have clutter |

0:05:29 | and also data association it's |

0:05:32 | another challenge that needs to be |

0:05:35 | tech |

0:05:42 | a relatively be sent |

0:05:45 | approach |

0:05:46 | to tackle the multi object |

0:05:48 | filtering problem |

0:05:50 | is |

0:05:51 | um |

0:05:52 | the random finite set we using the random finite set theory to double a principled solutions to tackle these problems |

0:06:03 | in this approach |

0:06:06 | the objects |

0:06:08 | are modelled as a set |

0:06:10 | as a random find |

0:06:13 | in which |

0:06:14 | the onset a need you both |

0:06:16 | in the states |

0:06:17 | and in the number |

0:06:19 | of the targets or objects are |

0:06:21 | mathematically model |

0:06:24 | they have four |

0:06:26 | instead of multiple objects or targets we will be dealing with a single target that is modelled |

0:06:33 | as a set |

0:06:35 | and |

0:06:36 | will be dealing with |

0:06:38 | mathematics |

0:06:40 | off |

0:06:40 | sets |

0:06:41 | integration |

0:06:43 | uh and and derivation |

0:06:46 | and |

0:06:47 | statistical properties of the set |

0:06:50 | however |

0:06:51 | the problem is encapsulated |

0:06:55 | as a single target |

0:06:58 | tracking or a single object estimation problem |

0:07:03 | in the mathematical formulation of very this solutions that have been double opt |

0:07:08 | in this |

0:07:10 | framework |

0:07:12 | random finite set theory framework |

0:07:16 | detection on a need T |

0:07:17 | clutter |

0:07:18 | and association onset a needy T |

0:07:20 | are principal in a a a are um mathematically formulated in a principled man |

0:07:30 | mean wind to you to go through some of these solutions including the well-known phd filter |

0:07:35 | and C phd filter |

0:07:37 | and member filter |

0:07:39 | and |

0:07:40 | uh |

0:07:42 | cardinality duality balance number filter |

0:07:46 | which is the filter that that i'm using to solve the big focus problem in this presentation |

0:07:55 | spatial kind of random finite sets are multi but only random finite set |

0:08:01 | there are the ensemble a |

0:08:05 | and |

0:08:06 | which is |

0:08:07 | known but can be determined iteratively |

0:08:11 | capital um |

0:08:14 | but newly set |

0:08:16 | each but newly sets is |

0:08:18 | prescribe by |

0:08:20 | two |

0:08:21 | parameters |

0:08:23 | and a are which is existent and probability |

0:08:26 | all a possible object |

0:08:29 | and the P which is the pdf of state of that object |

0:08:33 | and the union of all these |

0:08:35 | very newly sets |

0:08:37 | uh form a multi but only random finite set |

0:08:42 | a multi only R F S or random finite set can be fully prescribe point ensemble of are |

0:08:48 | all are i are and |

0:08:51 | P |

0:08:55 | so |

0:08:56 | ask you see |

0:08:58 | yeah whole on set a needy in the number of a |

0:09:01 | you objects that exist in the scene |

0:09:04 | and the distribution of the state |

0:09:06 | can be mathematically modelled |

0:09:09 | having these are lies |

0:09:11 | and the P a my functions |

0:09:18 | and we'd |

0:09:19 | general bayesian filters |

0:09:21 | we have a prediction and update that |

0:09:25 | and when we model the random finite set of targets |

0:09:29 | as a multi breed new only random finite set |

0:09:34 | it's is the are i |

0:09:36 | and P a which are predicted and update data |

0:09:47 | mentor filter |

0:09:49 | and it six version card you know keep balance or C B member filters |

0:09:54 | are more useful than phd D filters |

0:09:57 | in practical implementations because of the computational requirements and |

0:10:02 | also their accuracy |

0:10:07 | so |

0:10:10 | similar to a general bayesian filter in prediction |

0:10:15 | the are eyes and P is are predicted |

0:10:17 | and the predicted are are i and P I equations involve |

0:10:24 | a survival probability of each |

0:10:26 | function |

0:10:28 | and a transitional density |

0:10:30 | state state transition density of each sorry object |

0:10:35 | transition density of each object |

0:10:38 | these are the dynamics |

0:10:40 | information that we have about the movements |

0:10:44 | or the state |

0:10:45 | changes of the object |

0:10:49 | in addition |

0:10:50 | in prediction |

0:10:52 | in new set |

0:10:54 | all |

0:10:55 | but new sets |

0:10:57 | is the |

0:10:58 | in should used to the system has the result of |

0:11:01 | new coming |

0:11:03 | objects |

0:11:04 | to the C |

0:11:08 | in prediction |

0:11:09 | in a a in the updates that |

0:11:12 | the ensemble of a a i is and P is the bear new newly said |

0:11:16 | are updated to the union of two set S |

0:11:20 | one set includes the legacy try |

0:11:24 | this sets that that are there |

0:11:26 | because there might not be detected |

0:11:29 | there might not have been detected in that |

0:11:32 | frame |

0:11:33 | and the sets |

0:11:34 | that are there and they are updated using the measurement |

0:11:38 | date |

0:11:41 | in these equation i want to draw your attention to |

0:11:45 | two important parameters |

0:11:48 | detection probability |

0:11:50 | and measurement likely |

0:11:52 | P D |

0:11:53 | and G K |

0:11:55 | these are |

0:11:57 | define for single option |

0:12:00 | single objects |

0:12:02 | you have measurements |

0:12:03 | the relationship between the measurement |

0:12:05 | and the object |

0:12:06 | state |

0:12:07 | is defined |

0:12:09 | make the dependent on |

0:12:12 | the your sensor uh performance and your equipment |

0:12:16 | dynamics |

0:12:17 | and also some timber mental |

0:12:19 | a a a a a parameters |

0:12:21 | such as the clock to rate |

0:12:23 | a try it's a the whole measurement pro |

0:12:29 | and the detection probability is another parameter using which |

0:12:34 | we can to you the performance of the system |

0:12:37 | and |

0:12:38 | define |

0:12:39 | our definition of |

0:12:41 | active |

0:12:42 | speakers were active targets in the scene |

0:12:45 | you see how |

0:12:48 | so for all do you we should tracking |

0:12:53 | in our implementation we dig state includes the icsi image white image and X start and white dots |

0:12:59 | and this size |

0:13:01 | off |

0:13:02 | uh a rectangular |

0:13:04 | but souls that's we will get |

0:13:07 | as a result of tracking |

0:13:08 | in the image |

0:13:12 | video all measurements are obtained by performing a no based background subtraction followed by morphological image operations |

0:13:20 | the result would be a set of rectangular brought in each frame at |

0:13:26 | if we denote the results as a random finite set |

0:13:30 | Z |

0:13:32 | which in which each element includes the |

0:13:35 | X Y W and hey H |

0:13:38 | then the likelihood |

0:13:39 | can be defined by this function |

0:13:44 | which is a coarse can like |

0:13:48 | with audio your measurements |

0:13:50 | i have taken the simplest approach |

0:13:54 | assuming that there are two microphones on two sides of the camera |

0:13:59 | the time difference of arrival or tdoa |

0:14:02 | is calculated using cross correlation |

0:14:05 | uh generalized cross-correlation function face transform or gcc-phat |

0:14:11 | because of dereverberation effects |

0:14:13 | there are several peaks in the gcc-phat curve |

0:14:17 | when it plotted versus time difference |

0:14:21 | in our experiments |

0:14:22 | we have considered at most five large errors |

0:14:25 | peaks of the gcc-phat values |

0:14:28 | and |

0:14:29 | we consider them as the tdoa measurements in each frame |

0:14:35 | so |

0:14:38 | in order to |

0:14:40 | prime it tries and calculate relationship between these tdoa measurements |

0:14:45 | and the state |

0:14:47 | the the object state which is it |

0:14:49 | why W and hey H |

0:14:53 | first |

0:14:54 | there is a practical consideration |

0:14:56 | the distance of targets from the microphones |

0:15:00 | is |

0:15:00 | relatively large compared to the distance between the two microphones |

0:15:04 | therefore we can practically assume |

0:15:07 | but um there is a linear relationship between it |

0:15:12 | and the corresponding tdoa |

0:15:14 | you know to to find a parameter of this linear relationship |

0:15:18 | i have used |

0:15:20 | the ground truth state that i have |

0:15:22 | in one of the |

0:15:23 | case says one up to be D use in this paper be database |

0:15:29 | in each frame i have calculated five peaks or five tdoa ace |

0:15:36 | and |

0:15:37 | the red points |

0:15:38 | and then |

0:15:39 | uh i wanna find out |

0:15:41 | because of |

0:15:42 | many of them are out wires and on these some of them are in the liar |

0:15:47 | and using the robust estimation technique you can detect and remove the outliers and then use regression to find out |

0:15:54 | that linear your |

0:15:55 | a a a a relationship that exists |

0:15:58 | between the tdoa and X |

0:16:01 | of |

0:16:02 | uh |

0:16:03 | each of the two persons that are active in the scene in that case the study |

0:16:10 | and i have a |

0:16:12 | um can see two persons |

0:16:15 | for |

0:16:16 | uh uh |

0:16:17 | comparison purposes |

0:16:19 | because if the two equations are very close |

0:16:22 | to each other in terms of their parameters |

0:16:25 | that put a proof that uh |

0:16:28 | uh this assumption is practically core wrecked |

0:16:31 | and our estimates |

0:16:32 | or accurate |

0:16:33 | and that once the case |

0:16:36 | okay |

0:16:37 | now |

0:16:39 | we have a |

0:16:40 | two measurement likelihoods |

0:16:42 | in each frame |

0:16:44 | we have |

0:16:45 | what what you data |

0:16:46 | and we have to frame coming up |

0:16:49 | we have what you measurements as the set of tdoa ace |

0:16:52 | and we have a |

0:16:54 | we T or |

0:16:55 | image measure |

0:16:57 | as a result of background subtraction followed by morphological equation uh operations |

0:17:02 | how do we use them how to be fused these information |

0:17:06 | to find |

0:17:07 | active |

0:17:09 | targets |

0:17:12 | we define active targets |

0:17:15 | in terms of the probability of detection |

0:17:19 | values |

0:17:20 | system |

0:17:22 | for example if an active speaker is considered to be the person who is expected to be |

0:17:27 | visible visible to the camera in no less than ninety five percent of the time |

0:17:32 | and to be speaking in have at least forty percent of the time |

0:17:37 | then |

0:17:37 | we set |

0:17:39 | the detection probability for a usual data are as ninety five percent |

0:17:44 | and |

0:17:44 | for what you a data as forty |

0:17:47 | like increasing |

0:17:48 | and decreasing these |

0:17:50 | detection probabilities |

0:17:52 | sure |

0:17:54 | we can take you wanna |

0:17:56 | how long we expect |

0:17:59 | uh a a um and active target to be speaking or to be visible |

0:18:04 | it is application to ten that dependent and it can be tuned by the user |

0:18:10 | then |

0:18:16 | then |

0:18:17 | sensor a fusion happens |

0:18:20 | by |

0:18:20 | repeating the update |

0:18:22 | step |

0:18:23 | twice |

0:18:23 | that simple |

0:18:25 | fast |

0:18:26 | we do the to state |

0:18:28 | and i remind you that in the update |

0:18:31 | step of the filter are using the measurement likelihood |

0:18:34 | functions |

0:18:36 | which we have |

0:18:37 | so |

0:18:37 | we do to be update step first |

0:18:40 | using the visual measurements and then using the old your mission |

0:18:45 | a |

0:18:46 | in each of these repetitions we use the corresponding |

0:18:49 | detection probability |

0:18:51 | and again i'd remind you that in each step |

0:18:54 | we have |

0:18:55 | the legacy tracks |

0:18:57 | and we'll have measurement correct track |

0:18:59 | which |

0:19:00 | are to you want buy these detection probability |

0:19:05 | so here are some results |

0:19:07 | for example in this case |

0:19:16 | yeah |

0:19:17 | a class |

0:19:20 | as you see people are talking |

0:19:22 | and they are uh detected and |

0:19:28 | this is not |

0:19:29 | the sound of my sure i a there for the |

0:19:34 | a |

0:19:36 | but i |

0:19:37 | let me run it outside |

0:19:38 | probably its |

0:19:40 | a |

0:19:41 | a |

0:19:49 | but |

0:20:00 | okay |

0:20:03 | a |

0:20:07 | five to the have like a large C there are being right |

0:20:12 | um |

0:20:13 | the left frame |

0:20:15 | yeah |

0:20:15 | shows that i image result of tracking |

0:20:18 | the right frame shows the ensemble of particle that i have used to implement three |

0:20:24 | uh uh i a belief that |

0:20:26 | the that the of the they sure |

0:20:28 | it then of each or newly component in in a random finite set of real |

0:20:34 | okay |

0:20:35 | uh uh uh i two three four five six |

0:20:38 | and and the results that use see on the |

0:20:42 | left to right yeah |

0:20:44 | yeah are actually the have a age of all the winning particles corresponding to each talk |

0:20:54 | one and and is was the results of think well |

0:20:59 | and one thing i it if is one from the number of a whole |

0:21:04 | well uh we for one two three four |

0:21:10 | wow |

0:21:14 | and here is another example |

0:21:18 | an example of a smart from all of them are from a |

0:21:21 | space data phase |

0:21:22 | uh i |

0:21:24 | and um |

0:21:25 | a |

0:21:26 | i |

0:21:27 | i |

0:21:28 | i |

0:21:29 | just |

0:21:29 | i i i finish my for |

0:21:31 | with some quantitative results because i think we are closing to the and of time |

0:21:36 | in ninety eight point five percent of all the frames the existing targets |

0:21:41 | for all detected in this freak |

0:21:43 | and uh cases |

0:21:44 | they were correctly labeled and track |

0:21:48 | like bills were never switched after or during occlusion |

0:21:53 | and then in target was successfully tracked using the or do you Q |

0:21:58 | and |

0:21:59 | a false negative ratio false alarm ratio and label switching shoes |

0:22:04 | we out would you and read want you are here as you see |

0:22:08 | these false alarm rate to and label switching ratio as are |

0:22:14 | almost zero or cut a zero and none available in this and uh |

0:22:19 | right |

0:22:20 | and and are less than |

0:22:23 | uh the case when we are not using the audio data |

0:22:30 | and i will script conclusions |

0:22:32 | thank you and i will be answering your question |

0:22:36 | a are very much okay okay we |

0:22:38 | oh i would like to thank all of you for remaining until this time and fact of the speakers for |

0:22:43 | error |

0:22:44 | very good the uh uh uh box |

0:22:46 | i think you can do breast are also separately over wise the noise that in your we'll uh |

0:22:51 | increase |