but have to known every

and thank you very much

for your patience

sitting on two

the time

with a delay

for its final tool of the day

today day not talk about

and new base in approach to solve the multi target tracking using audio date

in this talk

after a background

i will introduce you to

a random finite set approach to the general problem of multi object estimation

by multi object estimation i mean

if free problem in which you are dealing with multiple objects

each one having their own states

problems where

there is a any not only in the state of the object

but

in the number of of

and

then a switch to a spatial type of random finite set

called uh multi band newly sets

and we go through a a be now you've thousands member filter

then i will explain the main contribution of the paper which is what you visual sure tracking

and some simulation results

and conclusions we finish to stop

the problem that we are focusing in this paper and presentation is

tracking of

multiple

location shouldn't speaking

targets

but me if you an example

uh this is a an example of a

pay be data

again

oh

yeah O

we have a bit of sound

uh a the people are speaking location only so the audio data is

in and

the people can get out of the camera seen there for location a we don't have a

a sure the information coming on

but

we are interested in

detecting

and tracking the multiple card

as we will see in this example

the target of interest

i can sleep here

while

still

talking

i

and uh we want to design a filter that can detect

and try simultaneously

oh existing active

target

there might be

in a give people are of J

in this scene i will tell you what is the definition of a active target

and how a mathematically be formulated

oh

a

so

hmmm

in such problems there are a few main challenges

we have occasionally silent targets

and location the invisible target

and also

we can have clutter measurements

but

in visual visual Q

and in or do features that will extract from the role what you visual information

a contribution is

a principled approach to combine audio and video data

in a bayesian framework

okay

all of you are familiar with the

nonlinear filtering approaches

single target tracking method

there is a single target

which correspond to

single measurement

with a single state

and from K mine use want to K

it trends it's to the new state

and in a

a general bayesian filtering scheme

we have

a prediction was that

and and update step

in prediction is that we use

the information that we have a about the dynamic

of the object

in the update state

we use the information that we have a

provided by the measure

if we assume that the distribution of the state of the single target is cool C N

and the dynamics and measurement models are linear

then i'm not approximation is corpsman filtering

in nonlinear cases

particle filters

are you

a up a multi object filtering problem

is something like that

spots with

spatial complexity

and challenge

we can have the number of objects

randomly changing

the number of measurement cues available random be changing

we can have

some objects undetected

on a needing detections

we can have clutter

and also data association it's

another challenge that needs to be

tech

a relatively be sent

approach

to tackle the multi object

filtering problem

is

um

the random finite set we using the random finite set theory to double a principled solutions to tackle these problems

in this approach

the objects

are modelled as a set

as a random find

in which

the onset a need you both

in the states

and in the number

of the targets or objects are

mathematically model

they have four

instead of multiple objects or targets we will be dealing with a single target that is modelled

as a set

and

will be dealing with

mathematics

off

sets

integration

uh and and derivation

and

statistical properties of the set

however

the problem is encapsulated

as a single target

tracking or a single object estimation problem

in the mathematical formulation of very this solutions that have been double opt

in this

framework

random finite set theory framework

detection on a need T

clutter

and association onset a needy T

are principal in a a a are um mathematically formulated in a principled man

mean wind to you to go through some of these solutions including the well-known phd filter

and C phd filter

and member filter

and

uh

cardinality duality balance number filter

which is the filter that that i'm using to solve the big focus problem in this presentation

spatial kind of random finite sets are multi but only random finite set

there are the ensemble a

and

which is

known but can be determined iteratively

capital um

but newly set

each but newly sets is

prescribe by

two

parameters

and a are which is existent and probability

all a possible object

and the P which is the pdf of state of that object

and the union of all these

very newly sets

uh form a multi but only random finite set

a multi only R F S or random finite set can be fully prescribe point ensemble of are

all are i are and

P

so

ask you see

yeah whole on set a needy in the number of a

you objects that exist in the scene

and the distribution of the state

can be mathematically modelled

having these are lies

and the P a my functions

and we'd

general bayesian filters

we have a prediction and update that

and when we model the random finite set of targets

as a multi breed new only random finite set

it's is the are i

and P a which are predicted and update data

mentor filter

and it six version card you know keep balance or C B member filters

are more useful than phd D filters

in practical implementations because of the computational requirements and

also their accuracy

so

similar to a general bayesian filter in prediction

the are eyes and P is are predicted

and the predicted are are i and P I equations involve

a survival probability of each

function

and a transitional density

state state transition density of each sorry object

transition density of each object

these are the dynamics

information that we have about the movements

or the state

changes of the object

in addition

in prediction

in new set

all

but new sets

is the

in should used to the system has the result of

new coming

objects

to the C

in prediction

in a a in the updates that

the ensemble of a a i is and P is the bear new newly said

are updated to the union of two set S

one set includes the legacy try

this sets that that are there

because there might not be detected

there might not have been detected in that

frame

and the sets

that are there and they are updated using the measurement

date

in these equation i want to draw your attention to

two important parameters

detection probability

and measurement likely

P D

and G K

these are

define for single option

single objects

you have measurements

the relationship between the measurement

and the object

state

is defined

make the dependent on

the your sensor uh performance and your equipment

dynamics

and also some timber mental

a a a a a parameters

such as the clock to rate

a try it's a the whole measurement pro

and the detection probability is another parameter using which

we can to you the performance of the system

and

define

our definition of

active

speakers were active targets in the scene

you see how

so for all do you we should tracking

in our implementation we dig state includes the icsi image white image and X start and white dots

and this size

off

uh a rectangular

but souls that's we will get

as a result of tracking

in the image

video all measurements are obtained by performing a no based background subtraction followed by morphological image operations

the result would be a set of rectangular brought in each frame at

if we denote the results as a random finite set

Z

which in which each element includes the

X Y W and hey H

then the likelihood

can be defined by this function

which is a coarse can like

with audio your measurements

i have taken the simplest approach

assuming that there are two microphones on two sides of the camera

the time difference of arrival or tdoa

is calculated using cross correlation

uh generalized cross-correlation function face transform or gcc-phat

because of dereverberation effects

there are several peaks in the gcc-phat curve

when it plotted versus time difference

in our experiments

we have considered at most five large errors

peaks of the gcc-phat values

and

we consider them as the tdoa measurements in each frame

so

in order to

prime it tries and calculate relationship between these tdoa measurements

and the state

the the object state which is it

why W and hey H

first

there is a practical consideration

the distance of targets from the microphones

is

relatively large compared to the distance between the two microphones

therefore we can practically assume

but um there is a linear relationship between it

and the corresponding tdoa

you know to to find a parameter of this linear relationship

i have used

the ground truth state that i have

in one of the

case says one up to be D use in this paper be database

in each frame i have calculated five peaks or five tdoa ace

and

the red points

and then

uh i wanna find out

because of

many of them are out wires and on these some of them are in the liar

and using the robust estimation technique you can detect and remove the outliers and then use regression to find out

that linear your

a a a a relationship that exists

between the tdoa and X

of

uh

each of the two persons that are active in the scene in that case the study

and i have a

um can see two persons

for

uh uh

comparison purposes

because if the two equations are very close

to each other in terms of their parameters

that put a proof that uh

uh this assumption is practically core wrecked

and our estimates

or accurate

and that once the case

okay

now

we have a

two measurement likelihoods

in each frame

we have

what what you data

and we have to frame coming up

we have what you measurements as the set of tdoa ace

and we have a

we T or

image measure

as a result of background subtraction followed by morphological equation uh operations

how do we use them how to be fused these information

to find

active

targets

we define active targets

in terms of the probability of detection

values

system

for example if an active speaker is considered to be the person who is expected to be

visible visible to the camera in no less than ninety five percent of the time

and to be speaking in have at least forty percent of the time

then

we set

the detection probability for a usual data are as ninety five percent

and

for what you a data as forty

like increasing

and decreasing these

detection probabilities

sure

we can take you wanna

how long we expect

uh a a um and active target to be speaking or to be visible

it is application to ten that dependent and it can be tuned by the user

then

then

sensor a fusion happens

by

repeating the update

step

twice

that simple

fast

we do the to state

and i remind you that in the update

step of the filter are using the measurement likelihood

functions

which we have

so

we do to be update step first

using the visual measurements and then using the old your mission

a

in each of these repetitions we use the corresponding

detection probability

and again i'd remind you that in each step

we have

the legacy tracks

and we'll have measurement correct track

which

are to you want buy these detection probability

so here are some results

for example in this case

yeah

a class

as you see people are talking

and they are uh detected and

this is not

the sound of my sure i a there for the

a

but i

let me run it outside

probably its

a

a

but

okay

a

five to the have like a large C there are being right

um

the left frame

yeah

shows that i image result of tracking

the right frame shows the ensemble of particle that i have used to implement three

uh uh i a belief that

the that the of the they sure

it then of each or newly component in in a random finite set of real

okay

uh uh uh i two three four five six

and and the results that use see on the

left to right yeah

yeah are actually the have a age of all the winning particles corresponding to each talk

one and and is was the results of think well

and one thing i it if is one from the number of a whole

well uh we for one two three four

wow

and here is another example

an example of a smart from all of them are from a

space data phase

uh i

and um

a

i

i

i

just

i i i finish my for

with some quantitative results because i think we are closing to the and of time

in ninety eight point five percent of all the frames the existing targets

for all detected in this freak

and uh cases

they were correctly labeled and track

like bills were never switched after or during occlusion

and then in target was successfully tracked using the or do you Q

and

a false negative ratio false alarm ratio and label switching shoes

we out would you and read want you are here as you see

these false alarm rate to and label switching ratio as are

almost zero or cut a zero and none available in this and uh

right

and and are less than

uh the case when we are not using the audio data

and i will script conclusions

thank you and i will be answering your question

a are very much okay okay we

oh i would like to thank all of you for remaining until this time and fact of the speakers for

error

very good the uh uh uh box

i think you can do breast are also separately over wise the noise that in your we'll uh

increase