Speech Transcript - A PROBABILISTIC PIXEL-BASED APPROACH TO DETECT HUMANS IN VIDEO STREAMS

0:00:15	i
0:00:15	with and
0:00:16	present them at that is and you know
0:00:19	i
0:00:20	to detect you mouth
0:00:21	just
0:00:22	so i
0:00:23	of my presentation is just for that
0:00:26	first i read explain why from our point of view the detection of humans
0:00:29	a just streams
0:00:31	is a binary now see that classification problem
0:00:34	and then we will design a new robust probably stay
0:00:37	it's so this approach
0:00:38	classified
0:00:38	see wet
0:00:40	and i and i will present the that that's that's we used assess S or method
0:00:44	the results
0:00:45	and i really give a short conclusion
0:00:49	uh uh so that's a by the definition of the problem of detection
0:00:53	a few meant to be just straight
0:00:56	so that taking you humans in video streams is useful for many applications
0:01:00	such as a video surveillance
0:01:01	but the at we just send an is
0:01:04	oh to ten man
0:01:06	are
0:01:06	the goal is to detect a only you minds and the sink guess
0:01:10	so the corner stone of such a
0:01:12	an application
0:01:13	is the ability to classify the observation
0:01:17	in two class
0:01:18	that are your man and you my
0:01:22	a first approach to detect humans has been proposed for still images
0:01:27	however such an approach has a of drawbacks
0:01:30	it's based on the appearance
0:01:32	and usually the appearance
0:01:33	a few months
0:01:34	is a pretty table because uh
0:01:36	colours and textures
0:01:38	a uh
0:01:40	uh
0:01:41	very
0:01:43	also a lot of detection windows those have to be considered
0:01:46	a image because we have to look for in months of the lot of positions and that a lot of
0:01:50	states
0:01:52	so um
0:01:53	the consequence is that it is difficult to obtain a low false alarm rate per image
0:01:59	a better press and she's in using a background subtraction and right
0:02:04	this ones
0:02:05	uh i
0:02:06	struck the C of the users
0:02:07	and the a moving objects in the scene
0:02:10	thus
0:02:11	uh we take advantage of that on prior information present in the you video streams
0:02:17	and and D um
0:02:20	the um decision is based on geometric information
0:02:24	also on the a few close have to be considered
0:02:26	so it's possible to obtain a lower false alarm rate
0:02:30	but match and then with the approach based on
0:02:33	the image
0:02:35	so now we will design
0:02:37	a probabilistic because of this approach
0:02:39	a classify C where
0:02:43	when we in
0:02:44	i design our method we took into account
0:02:46	the ours
0:02:47	because we want to
0:02:48	to get a robust method
0:02:51	so when a background subtraction is followed by a a connected components and i is
0:02:55	see let's make present defect
0:02:58	first
0:02:59	then may present
0:03:00	um
0:03:01	noise controls or or you do not
0:03:04	but
0:03:05	also um
0:03:07	the sequence of
0:03:09	a several users or or or a moving object
0:03:12	could be match
0:03:13	and last but not least um
0:03:15	carried object and shot those could also be detected in the foreground
0:03:19	so we need to have a a robust
0:03:22	uh a description technique
0:03:25	oh what that it the option a description technique exist
0:03:28	and are may need to treat a yeah but can be used to choose the um
0:03:33	description technique most adapted
0:03:35	to uh what we need
0:03:38	so the first criterion
0:03:40	uh is related to the use of
0:03:43	uh uh the rip points or to entirely or of the seal
0:03:48	but you are you can that this that
0:03:50	a a a a a and every image is the noise a fix the controls
0:03:53	so a region based methods
0:03:55	is
0:03:56	less sensitive to noise and is preferable
0:03:59	no other criteria is
0:04:01	read it it to um
0:04:03	the use of a bottle or look at the uh
0:04:06	attribute
0:04:08	with is a global that's
0:04:09	the attribute
0:04:11	um are we need it to do a whole shape
0:04:13	that for
0:04:14	um
0:04:16	and the
0:04:17	what the attributes are a to it
0:04:19	by
0:04:20	the presence of defects
0:04:21	in the ceiling
0:04:22	look at it that's
0:04:23	speech shapes into smaller components
0:04:26	with the hope to limit the influence of the to a few components
0:04:30	so
0:04:31	uh a in our case a region based the catfish description made that is prefer
0:04:39	or also with they should and mit that is the simplest region based that at the spectrum it that one
0:04:44	can image a it's the set of all pixels include in the show
0:04:47	but the question is how can you justify such a set
0:04:51	unfortunately there it doesn't exist any motion on them at the same for this task
0:04:55	so we will do this by or
0:05:02	in our our at each pixel that plays the role of next
0:05:06	in the first stage
0:05:07	each expert
0:05:08	this site
0:05:09	if it i
0:05:10	the
0:05:11	if
0:05:11	is already to peaks set is part of a human silhouette or or not
0:05:15	and we also assume that it gives the probably for it's this decision to be a
0:05:21	um um a matter of
0:05:22	the experts can be implemented by machine on methods
0:05:27	in the second stage yep in is given by the expert
0:05:30	our males
0:05:31	by by weighted it what image is
0:05:34	and the weight given to an expert
0:05:35	depends on the probability for this region to be correct
0:05:39	indeed
0:05:40	intuitively T we want that un
0:05:42	a confidant expect
0:05:44	has an important weight in the few not decision
0:05:49	so here is a a a a small example at to to understand them at that
0:05:53	so
0:05:54	it's
0:05:54	X
0:05:55	is a fix a and they produce
0:05:57	is the probability for a it's to
0:06:00	be part of a you human seem
0:06:02	that's the the information given by the expert
0:06:06	so
0:06:06	in this example are six six express are used to classify the way
0:06:11	each of them takes a look around here and based on the subset summation
0:06:16	it gives the probability for a it's to be part of a human silhouette
0:06:21	and then all this information are mapped into the few not decision that is
0:06:26	that
0:06:27	see what this you
0:06:31	so to implement the experts we have used a or full uh machine learning need
0:06:37	and
0:06:37	except trees
0:06:40	oh it's a it so that doesn't require you to optimize and parameter
0:06:45	and i so that of all
0:06:46	intrinsic a uh over fitting
0:06:50	it's a like
0:06:52	a a of decision trees
0:06:54	and in our data each tree use of votes for one class you man on a human
0:06:59	so we denote know it that i is the proportion of trees voting for the class you man
0:07:04	and chris and then my there's
0:07:06	respectively that don't ten amount of you man and then the man errors
0:07:09	in the long set
0:07:12	and
0:07:13	we propose the following estimator for the probability
0:07:17	uh of
0:07:19	the
0:07:19	set
0:07:20	to be should from a humans you with
0:07:23	so um
0:07:24	one the learning set
0:07:25	balance
0:07:27	the probability uh is
0:07:29	approximately equal to the proportion of trees voting for the class and
0:07:34	however when the that long that the base is not but i and
0:07:38	there is a yes in the decision you can but the trees
0:07:41	and this is yeah as to be can so that's what we do in our probably the estimate
0:07:48	so
0:07:49	once we have a probability uh for each peak that
0:07:53	we can compute
0:07:55	the decision and they can buy the than D X fair
0:07:58	and the probability for a this shouldn't to be can right
0:08:01	using base rule
0:08:04	also um
0:08:06	i say then yeah yeah
0:08:08	uh we use of a weighting
0:08:11	a weight it um voting rule
0:08:14	to give the class
0:08:15	to the this right
0:08:17	and that's
0:08:18	the question number four
0:08:20	yeah W value is the weight given
0:08:22	to D X
0:08:26	so i nine we present
0:08:28	the a that that's it's we use to assess or method
0:08:31	the results
0:08:32	and give
0:08:33	i short computer
0:08:35	both our make set and testing set
0:08:37	a contents you man and then you meant see what
0:08:40	then in wrist i to one hundred by one hundred pixels
0:08:44	this means that are we make that is
0:08:47	scanning valiant
0:08:49	and also also that the that can be used with a low resolution images
0:08:54	also note that
0:08:56	a long and it is not but i
0:08:58	but this is not the problem is our probably to estimate of can as the B yes
0:09:03	to by
0:09:05	so are at the result
0:09:07	the images is shows the probability a plus compute it in each peak set
0:09:13	um
0:09:14	these are probability maps
0:09:16	oh what extent response
0:09:18	to a probability of one
0:09:20	where whereas of the excel corresponds to a probability of zero
0:09:25	as you can see
0:09:26	you meant see what i'm right to
0:09:28	and that's means that our method that works very well
0:09:33	uh_huh hmmm
0:09:36	one
0:09:37	we have computed the probability matrix
0:09:39	in X three
0:09:41	we also have to assign a weight
0:09:44	to each said
0:09:46	in fact we try to three different weighting strategies
0:09:49	one of them being too large are automatically the weighting function
0:09:54	and
0:09:54	oh these strategies that's to see our a
0:10:00	and then we have a a correct classification rate or one ninety percent for most you man and then you
0:10:05	meant to where
0:10:07	however for this is something are starting point
0:10:09	because
0:10:11	uh
0:10:11	we did not yet try
0:10:13	to optimize
0:10:14	the set of attributes
0:10:16	used to describe a set
0:10:18	and also to describe it and we have to define a neighborhood
0:10:22	and we don't try to optimize the neighborhood shape and the neighbourhood size
0:10:27	so i believe that but the results
0:10:29	be also are it thing with our method
0:10:33	so in country and we have proposed a new system for the detection of humans
0:10:38	well used it for video streams
0:10:40	our approach has been designed to rely on geometric information
0:10:45	and to be a robust to not
0:10:47	so in a first that we apply a background subtraction noise
0:10:51	but like sequence
0:10:53	of
0:10:53	best sounds and moving objects in the scene
0:10:56	then a probabilistic information
0:10:58	is computed for each of set in the foreground
0:11:03	and finale
0:11:04	is information is used to decide was of the sit where is that a for you or not
0:11:08	there is show that our approach is promising for the detection of humans months industry
0:11:14	but finding the optimal neighborhood used for addition
0:11:17	a for the description of a set is left for future work
0:11:20	thank you
0:11:27	thank you sebastian
0:11:29	any question
0:11:38	uh a what about a comparison we the whole days
0:11:42	uh the best and detection
0:11:45	but is so that
0:11:47	my first play
0:11:49	with a but is
0:11:50	you have really uh
0:11:52	are you john more of
0:11:54	uh detection windows
0:11:55	we be considered
0:11:57	um
0:11:58	and this is
0:11:59	about
0:12:00	twelve thousand images
0:12:02	uh i mean those per image
0:12:04	um there for the um
0:12:06	a force a lower rate
0:12:08	should be multiplied by
0:12:10	oh what i on
0:12:12	to obtain the false alarm rate
0:12:14	per image
0:12:15	so it gives a really a um high for time rate right image
0:12:20	uh also so there are uh techniques to
0:12:23	keep only um a amount of uh response in the images
0:12:27	but at least you have a false
0:12:30	but detection or image
0:12:32	this is not
0:12:33	uh
0:12:34	acceptable a table for vacation
0:12:36	such as video a
0:12:39	or but you can apply your or uh uh uh to the whole detect or
0:12:43	a descriptor is going be
0:12:45	on the uh um
0:12:47	on the movie mask
0:12:49	on the moving object
0:12:50	okay but
0:12:51	in this case
0:12:53	yeah O and the um
0:12:55	oh okay is on computed using colours
0:12:59	and
0:12:59	um
0:13:00	the utterance of few months in videos
0:13:03	is
0:13:04	and predictable of you can have
0:13:06	but of different colours and textures
0:13:08	and from our point of view that's preferable to use on the uh geometric information
0:13:14	and the temporal information that we have
0:13:16	in the video streams
0:13:17	i does as
0:13:18	to do this
0:13:20	that's why we have
0:13:21	chosen
0:13:26	yes but
0:13:32	really
0:13:33	a funny question but
0:13:34	uh uh so uh
0:13:36	i mean based on the shape but uh you you said you want to distinguish between humans and the rest
0:13:41	so as so when you put next to these market
0:13:44	i mean
0:13:46	have you have you to this view my
0:13:48	because uh uh for me like
0:13:50	something of that has like uh
0:13:53	of the same shape
0:13:55	will be detected as you right
0:13:57	right
0:13:58	okay so in the market will be human
0:14:02	um um
0:14:03	in fact you can uh a longer
0:14:06	was
0:14:06	uh
0:14:07	one keys in the negative
0:14:09	a set
0:14:10	so that set of nine you man us let's
0:14:13	and probably if you don't have a uh
0:14:16	two nine two
0:14:17	too much nice in i images
0:14:19	this will work
0:14:20	but
0:14:20	uh in real applications
0:14:22	they are nice
0:14:23	and therefore with to can us small images
0:14:26	one hundred probably one hundred
0:14:28	that's why guessing uh you're are right on the you will be detected this you my
0:14:33	but with a a synthetic images
0:14:36	without noise
0:14:37	this is possible to distinguish
0:14:39	well then you have
0:14:40	problems with
0:14:42	the close
0:14:43	and you you so on
0:14:44	okay
0:14:45	thank
0:14:47	other question
0:14:50	you got any assumptions on how the cameras shall be
0:14:54	well compared to
0:14:56	people
0:14:57	um
0:14:58	yes indeed
0:15:00	when you but you're running set
0:15:03	you should um
0:15:05	a it with uh
0:15:07	see do taken from the same point of view
0:15:10	do a real application
0:15:12	for example if
0:15:13	and the
0:15:13	a real application you time right
0:15:15	is
0:15:16	a above the person
0:15:17	and you should
0:15:18	a place in your right
0:15:20	so
0:15:20	see let's they can
0:15:21	the
0:15:22	on the same
0:15:24	but
0:15:25	um
0:15:28	this
0:15:29	in practice
0:15:30	not the problems
0:15:31	the you meant see words in the long set can be generated with and can be with a a a
0:15:36	a a human about to
0:15:38	and uh for changing the point of view
0:15:40	only a few minutes
0:15:42	to compute and
0:15:44	and uh
0:15:45	related to the first question
0:15:47	a got a sense of all these were form compared to
0:15:51	uh a train cascade of classifiers
0:15:54	been approaches that look at
0:15:56	you humans
0:15:57	a like humans or
0:15:59	you humans
0:15:59	a set of parts of the band
0:16:01	using a cascade classifier
0:16:04	and that it takes a long time to plane but it's not source will we
0:16:07	at this point we didn't compare
0:16:10	because uh we
0:16:13	we hope to have a better results with or mental
0:16:16	and also our method as as so um
0:16:20	um positive points
0:16:21	for example you have in the formation computed in pixel
0:16:25	which means that for example if i
0:16:28	or the get in my hand
0:16:30	it will be that it it as being in the four a i'd the background subtraction
0:16:34	but
0:16:34	the probability maps
0:16:36	right i
0:16:37	to you raise the guitar if i one for example or to do was recovery of we also
0:16:42	a like this
0:16:44	so
0:16:46	i think or or middle well
0:16:48	uh
0:16:49	steve
0:16:53	well last question from you again
0:16:59	mentioned that the this can be used for video sequence
0:17:01	have you thought about how we use the temporal information
0:17:04	because a you and that is
0:17:05	a frame by frame
0:17:06	yes yes that them brought information is used
0:17:09	uh in fig by the background subtraction
0:17:12	okay i
0:17:13	but my question was to think that it
0:17:15	the uh
0:17:16	could use the temporal information on them you
0:17:18	successive detections
0:17:20	and successive frames to the we prove the the result
0:17:23	about that
0:17:24	uh
0:17:25	yeah
0:17:26	if you want to
0:17:27	can apply tracking
0:17:29	sample
0:17:30	and
0:17:31	if you
0:17:31	try and number and
0:17:33	each
0:17:34	the component
0:17:34	the the for one
0:17:36	you can
0:17:36	uh
0:17:37	improve the right
0:17:40	i know if it's really did
0:17:42	on no
0:17:43	the that depends on the application
0:17:46	i could for example to you of the arms
0:17:49	and the movements of the like
0:17:50	B
0:17:51	one possible feature
0:17:52	can
0:17:53	looking
0:17:54	okay and um
0:17:56	just take a temporal window
0:17:59	um
0:18:00	just can the the what
0:18:02	and you would have a
0:18:03	uh
0:18:04	three D you now we shape
0:18:06	and then you can have that such method
0:18:09	uh
0:18:10	but all
0:18:11	in the place of
0:18:13	scraping excess
0:18:14	we will describe folks
0:18:18	that's all right
0:18:19	thank you very much for all the sensors

A PROBABILISTIC PIXEL-BASED APPROACH TO DETECT HUMANS IN VIDEO STREAMS

Video Analysis and Processing

Presented by: Sébastien Piérard, Author(s): Sébastien Piérard, Antoine Lejeune, Marc Van Droogenbroeck, University of Liège, Belgium