0:00:13 but have to known every and thank you very much for your patience sitting on two the time with a delay for its final tool of the day today day not talk about and new base in approach to solve the multi target tracking using audio date in this talk after a background i will introduce you to a random finite set approach to the general problem of multi object estimation by multi object estimation i mean if free problem in which you are dealing with multiple objects each one having their own states problems where there is a any not only in the state of the object but in the number of of and then a switch to a spatial type of random finite set called uh multi band newly sets and we go through a a be now you've thousands member filter then i will explain the main contribution of the paper which is what you visual sure tracking and some simulation results and conclusions we finish to stop the problem that we are focusing in this paper and presentation is tracking of multiple location shouldn't speaking targets but me if you an example uh this is a an example of a pay be data again oh yeah O we have a bit of sound uh a the people are speaking location only so the audio data is in and the people can get out of the camera seen there for location a we don't have a a sure the information coming on but we are interested in detecting and tracking the multiple card as we will see in this example the target of interest i can sleep here while still talking i and uh we want to design a filter that can detect and try simultaneously oh existing active target there might be in a give people are of J in this scene i will tell you what is the definition of a active target and how a mathematically be formulated oh a so hmmm in such problems there are a few main challenges we have occasionally silent targets and location the invisible target and also we can have clutter measurements but in visual visual Q and in or do features that will extract from the role what you visual information a contribution is a principled approach to combine audio and video data in a bayesian framework okay all of you are familiar with the nonlinear filtering approaches single target tracking method there is a single target which correspond to single measurement with a single state and from K mine use want to K it trends it's to the new state and in a a general bayesian filtering scheme we have a prediction was that and and update step in prediction is that we use the information that we have a about the dynamic of the object in the update state we use the information that we have a provided by the measure if we assume that the distribution of the state of the single target is cool C N and the dynamics and measurement models are linear then i'm not approximation is corpsman filtering in nonlinear cases particle filters are you a up a multi object filtering problem is something like that spots with spatial complexity and challenge we can have the number of objects randomly changing the number of measurement cues available random be changing we can have some objects undetected on a needing detections we can have clutter and also data association it's another challenge that needs to be tech a relatively be sent approach to tackle the multi object filtering problem is um the random finite set we using the random finite set theory to double a principled solutions to tackle these problems in this approach the objects are modelled as a set as a random find in which the onset a need you both in the states and in the number of the targets or objects are mathematically model they have four instead of multiple objects or targets we will be dealing with a single target that is modelled as a set and will be dealing with mathematics off sets integration uh and and derivation and statistical properties of the set however the problem is encapsulated as a single target tracking or a single object estimation problem in the mathematical formulation of very this solutions that have been double opt in this framework random finite set theory framework detection on a need T clutter and association onset a needy T are principal in a a a are um mathematically formulated in a principled man mean wind to you to go through some of these solutions including the well-known phd filter and C phd filter and member filter and uh cardinality duality balance number filter which is the filter that that i'm using to solve the big focus problem in this presentation spatial kind of random finite sets are multi but only random finite set there are the ensemble a and which is known but can be determined iteratively capital um but newly set each but newly sets is prescribe by two parameters and a are which is existent and probability all a possible object and the P which is the pdf of state of that object and the union of all these very newly sets uh form a multi but only random finite set a multi only R F S or random finite set can be fully prescribe point ensemble of are all are i are and P so ask you see yeah whole on set a needy in the number of a you objects that exist in the scene and the distribution of the state can be mathematically modelled having these are lies and the P a my functions and we'd general bayesian filters we have a prediction and update that and when we model the random finite set of targets as a multi breed new only random finite set it's is the are i and P a which are predicted and update data mentor filter and it six version card you know keep balance or C B member filters are more useful than phd D filters in practical implementations because of the computational requirements and also their accuracy so similar to a general bayesian filter in prediction the are eyes and P is are predicted and the predicted are are i and P I equations involve a survival probability of each function and a transitional density state state transition density of each sorry object transition density of each object these are the dynamics information that we have about the movements or the state changes of the object in addition in prediction in new set all but new sets is the in should used to the system has the result of new coming objects to the C in prediction in a a in the updates that the ensemble of a a i is and P is the bear new newly said are updated to the union of two set S one set includes the legacy try this sets that that are there because there might not be detected there might not have been detected in that frame and the sets that are there and they are updated using the measurement date in these equation i want to draw your attention to two important parameters detection probability and measurement likely P D and G K these are define for single option single objects you have measurements the relationship between the measurement and the object state is defined make the dependent on the your sensor uh performance and your equipment dynamics and also some timber mental a a a a a parameters such as the clock to rate a try it's a the whole measurement pro and the detection probability is another parameter using which we can to you the performance of the system and define our definition of active speakers were active targets in the scene you see how so for all do you we should tracking in our implementation we dig state includes the icsi image white image and X start and white dots and this size off uh a rectangular but souls that's we will get as a result of tracking in the image video all measurements are obtained by performing a no based background subtraction followed by morphological image operations the result would be a set of rectangular brought in each frame at if we denote the results as a random finite set Z which in which each element includes the X Y W and hey H then the likelihood can be defined by this function which is a coarse can like with audio your measurements i have taken the simplest approach assuming that there are two microphones on two sides of the camera the time difference of arrival or tdoa is calculated using cross correlation uh generalized cross-correlation function face transform or gcc-phat because of dereverberation effects there are several peaks in the gcc-phat curve when it plotted versus time difference in our experiments we have considered at most five large errors peaks of the gcc-phat values and we consider them as the tdoa measurements in each frame so in order to prime it tries and calculate relationship between these tdoa measurements and the state the the object state which is it why W and hey H first there is a practical consideration the distance of targets from the microphones is relatively large compared to the distance between the two microphones therefore we can practically assume but um there is a linear relationship between it and the corresponding tdoa you know to to find a parameter of this linear relationship i have used the ground truth state that i have in one of the case says one up to be D use in this paper be database in each frame i have calculated five peaks or five tdoa ace and the red points and then uh i wanna find out because of many of them are out wires and on these some of them are in the liar and using the robust estimation technique you can detect and remove the outliers and then use regression to find out that linear your a a a a relationship that exists between the tdoa and X of uh each of the two persons that are active in the scene in that case the study and i have a um can see two persons for uh uh comparison purposes because if the two equations are very close to each other in terms of their parameters that put a proof that uh uh this assumption is practically core wrecked and our estimates or accurate and that once the case okay now we have a two measurement likelihoods in each frame we have what what you data and we have to frame coming up we have what you measurements as the set of tdoa ace and we have a we T or image measure as a result of background subtraction followed by morphological equation uh operations how do we use them how to be fused these information to find active targets we define active targets in terms of the probability of detection values system for example if an active speaker is considered to be the person who is expected to be visible visible to the camera in no less than ninety five percent of the time and to be speaking in have at least forty percent of the time then we set the detection probability for a usual data are as ninety five percent and for what you a data as forty like increasing and decreasing these detection probabilities sure we can take you wanna how long we expect uh a a um and active target to be speaking or to be visible it is application to ten that dependent and it can be tuned by the user then then sensor a fusion happens by repeating the update step twice that simple fast we do the to state and i remind you that in the update step of the filter are using the measurement likelihood functions which we have so we do to be update step first using the visual measurements and then using the old your mission a in each of these repetitions we use the corresponding detection probability and again i'd remind you that in each step we have the legacy tracks and we'll have measurement correct track which are to you want buy these detection probability so here are some results for example in this case yeah a class as you see people are talking and they are uh detected and this is not the sound of my sure i a there for the a but i let me run it outside probably its a a but okay a five to the have like a large C there are being right um the left frame yeah shows that i image result of tracking the right frame shows the ensemble of particle that i have used to implement three uh uh i a belief that the that the of the they sure it then of each or newly component in in a random finite set of real okay uh uh uh i two three four five six and and the results that use see on the left to right yeah yeah are actually the have a age of all the winning particles corresponding to each talk one and and is was the results of think well and one thing i it if is one from the number of a whole well uh we for one two three four wow and here is another example an example of a smart from all of them are from a space data phase uh i and um a i i i just i i i finish my for with some quantitative results because i think we are closing to the and of time in ninety eight point five percent of all the frames the existing targets for all detected in this freak and uh cases they were correctly labeled and track like bills were never switched after or during occlusion and then in target was successfully tracked using the or do you Q and a false negative ratio false alarm ratio and label switching shoes we out would you and read want you are here as you see these false alarm rate to and label switching ratio as are almost zero or cut a zero and none available in this and uh right and and are less than uh the case when we are not using the audio data and i will script conclusions thank you and i will be answering your question a are very much okay okay we oh i would like to thank all of you for remaining until this time and fact of the speakers for error very good the uh uh uh box i think you can do breast are also separately over wise the noise that in your we'll uh increase