Přepis řeči - EVENT CLASSIFICATION FOR PERSONAL PHOTO COLLECTIONS

0:00:13	and not known um
0:00:14	oops
0:00:16	uh i apologise for uh missing my slot and i guess of moving a little bit fast here than i
0:00:20	expected
0:00:21	my name is david very data
0:00:23	uh i i'm from H P labs
0:00:25	and today and presenting a of the work of my colleagues fun time
0:00:30	i don't try to an chris will
0:00:32	um so town had to cancel his trip at the last moment because of some us
0:00:37	uh these uh
0:00:38	situation
0:00:39	so um i'm
0:00:41	good to do my best
0:00:42	to
0:00:43	but present a his by uh
0:00:45	um the what is called a event classification
0:00:48	photo collection
0:00:49	and uh the idea is
0:00:52	uh
0:00:53	to be able to uh use a collection of of photograph from a a a single device
0:00:57	over a short period of time to be one hour two hours
0:01:01	and um
0:01:02	and to be able to classify the events
0:01:06	that uh that photo collection is uh is represented
0:01:10	so uh examples of events um uh you christmas scenes probably scenes
0:01:16	oh and fine state
0:01:17	i'll those ports and and and things of this nature
0:01:21	so it turns out that this is a um
0:01:25	apparently quite a challenging problem i think the main the main difficulties are
0:01:29	that the um
0:01:30	a photos are essentially and
0:01:31	stream
0:01:32	um they can just be
0:01:34	any collection of photos taken
0:01:36	any time
0:01:37	um
0:01:38	and uh that the reason so you know this doing this is
0:01:42	uh to make organization and management of personal photos
0:01:45	uh that's what follows
0:01:47	um
0:01:48	much easier to do
0:01:50	it tell the stories of people's lives
0:01:52	but the reason why each piece interested in this is because
0:01:55	um if if we can um
0:01:58	automatically categorise of a classifier a collection of photos to fit a certain theme
0:02:03	then uh of the companies able to
0:02:06	um
0:02:07	suggest products that the consumer can buy
0:02:10	um like for of books and things like that um that maps that the so
0:02:15	each piece interest
0:02:17	so this is a system overview of uh how
0:02:21	how the though
0:02:22	the uh work of my colleagues um what
0:02:25	uh
0:02:26	a a collection of photos is given
0:02:28	and the first thing that we do is we consider each uh pressing that they do
0:02:32	is a take a single photo at a time
0:02:35	um and they extract some data
0:02:38	a a um and then they run to classifier
0:02:41	to obtain a prediction
0:02:42	um all of what the uh a category it is a of that for it is
0:02:47	um
0:02:48	based on the metadata
0:02:50	like
0:02:51	sorry think this as
0:02:52	i didn't realise
0:02:53	this is running on or two
0:02:55	uh
0:02:55	or or that
0:02:56	keep
0:02:57	the pixel forward
0:02:58	um
0:02:59	they also take a single photo obtain a uh a a of visual feature is to grab
0:03:03	uh a a one a classifier and like wise and before
0:03:07	uh obtain predictions soft predictions of which uh category
0:03:11	um
0:03:12	that photo belongs to
0:03:15	then all of that
0:03:16	metadata predictions and visual predictions
0:03:18	from all the images are combined together
0:03:21	information fusion step
0:03:23	and uh a the event is then classified to to one of the the category
0:03:29	so in this talk i'll start by describing a uh what the metadata data is and the uh the each
0:03:34	of the classifier
0:03:36	uh i also talk
0:03:37	about the visual feature uh
0:03:39	a process and and the class five but that
0:03:42	um finally um the information fusion step
0:03:45	and uh a result
0:03:49	so the metadata um that is um
0:03:52	used here in these experiments
0:03:54	um is actually a of for different things
0:03:57	i time stamps
0:03:59	each of the uh for close
0:04:01	uh
0:04:01	and indication of whether the flash was on or off when the photo was taken
0:04:06	uh exposure time
0:04:07	and a focal a focal and
0:04:09	so you can see that time stamps uh be a lot of information about
0:04:13	about a certain um
0:04:16	so and um
0:04:17	and that for example if you look here
0:04:19	um these uh all the photos it's a histogram of all the photos label this christmas
0:04:24	um
0:04:25	in the in the training set
0:04:27	and um
0:04:28	zero corresponds to december twenty states and and negative numbers uh
0:04:33	the
0:04:34	yeah and a negative numbers for uh to some twenty fit for the numbers of and sim if it so
0:04:39	you can see
0:04:40	that is a is a very lot spike on december twenty fit but there's also a large number of christmas
0:04:45	the photos that were taken in the month of december
0:04:49	there are there are you know a small number of photos taken at all the times for christmas themed that
0:04:54	perhaps this is due to bad data
0:04:56	the fact that the
0:04:57	uh the photographer did not a set a time correctly on the car
0:05:01	a probably explains that
0:05:03	flash are are off reveals a lot of information about whether the uh photo was taken indoors or outdoors and
0:05:09	you can see most
0:05:10	christmas photos were taken with flash on
0:05:12	exposed time like why tells you about the uh the light conditions when the photos was taken
0:05:17	vocal length tells you about out um the relative distance
0:05:21	a seen from the from the camp
0:05:24	so a the metadata is extracted
0:05:27	um from a single image
0:05:29	and um
0:05:31	and then the class
0:05:32	a classifier was is built offline
0:05:34	uh based on um the random farce technique
0:05:38	and certainly not
0:05:39	but on this so um
0:05:41	i don't i don't have a lot to say about it but i understand
0:05:44	that um
0:05:45	the the random forest that fire off is very good
0:05:48	formants
0:05:49	uh i very low computational cost and that was why this choice was made
0:05:55	um the visual features are
0:05:57	uh extracted in a different way they are um
0:06:00	a use the bag of features
0:06:02	approach
0:06:03	and so uh what
0:06:04	uh what's done by my colleagues here
0:06:06	is they it take the original image
0:06:08	and the um
0:06:10	filter down sampling at uh down to what of size
0:06:13	and then it filtered down sampled that to to um
0:06:15	one sixteen so
0:06:18	um that the lime to just program for this
0:06:20	X D in tiles this images program to four tile so each time
0:06:23	there twenty one time L C each white tile has the same size
0:06:27	now
0:06:28	from each tile
0:06:29	they uh a sample what all they take a grid of points
0:06:33	and at each grid location
0:06:35	uh they obtain a a um
0:06:39	a a each feature vector
0:06:41	uh
0:06:42	and that looks something like this a hundred and twenty eight dimensional
0:06:45	a feature vector
0:06:47	each uh feature vector
0:06:49	is then quantized
0:06:51	quantized to one of two hundred words uh this dictionary also
0:06:56	trained uh offline
0:06:59	and so uh i if you take all the feature vectors
0:07:02	from a certain uh tile
0:07:04	i you can then obtain a frequency back to
0:07:07	all uh two hundred elements because there are there two hundred on
0:07:11	code
0:07:13	um so if a total uh a there are twenty one tiles
0:07:17	and two hundred elements in each feature vector
0:07:20	so all those feature vectors actually concatenated together
0:07:23	to obtain a um
0:07:25	uh a a vector of four thousand two hundred um
0:07:28	element
0:07:30	and then that
0:07:31	a feature vector is passed through a a a support vector machine
0:07:34	that um
0:07:35	produces a
0:07:37	prediction
0:07:38	a the uh
0:07:40	the event category
0:07:42	based on the visual features
0:07:43	for single image
0:07:46	okay
0:07:47	so
0:07:48	what are described so far are um using the metadata data to provide predictions of a single image and then
0:07:53	using visual features to provide predictions of a single image
0:07:56	uh but it turns out that um this mine
0:07:59	for single image this might not be a very good
0:08:02	so to take a look at this image
0:08:04	um
0:08:06	it's not exactly clear what
0:08:07	what event
0:08:09	image represents
0:08:10	but if you look at all the surrounding images
0:08:12	from that collection
0:08:14	you see that pretty clearly this is the a channels but state park
0:08:18	and so that this is the main contribution of the car work
0:08:21	is to um
0:08:24	sort of leverage the fact that we have a collection of images and an
0:08:27	see how much
0:08:28	um do what we can do with a oh
0:08:30	what my colleagues can could do with that
0:08:36	um
0:08:36	okay
0:08:37	so that that brings us to the information fusion step and and what they've done is actually a fairly simple
0:08:43	so suppose that um
0:08:45	we have a uh it's collection of images i one through i and
0:08:50	and then a have from the previous steps that i saw uh already told you about
0:08:54	uh we've already obtained
0:08:56	uh
0:08:56	probability back it is um based on so this indicates visual features
0:09:01	is the probability that a uh image i
0:09:04	um
0:09:05	is
0:09:06	classified as a belong to defend G
0:09:10	like is the probability that image is classified as belonging to a event J
0:09:14	uh based on the metadata features
0:09:16	so we obtain um these vectors of of of probabilities
0:09:21	uh but we also have to note that uh a different types of features um of the different amount of
0:09:26	conference the different events
0:09:27	we've already see that uh that
0:09:30	a time
0:09:30	metadata feature
0:09:32	it's fairly useful in predicting a christmas
0:09:35	but is not
0:09:36	very useful in um telling you about but it
0:09:39	because but it by a uh
0:09:41	family evenly distributed throughout the county year
0:09:45	so these um these weight scenes uh need to be obtained of those also obtained offline like training
0:09:52	and then um
0:09:53	these um probabilities are combined
0:09:56	to uh a single conference number for a collection of photos high
0:10:01	um
0:10:03	um um
0:10:05	that the confidence measure it is a conference that that
0:10:08	that the collection i
0:10:09	uh i classifies as to eventually J
0:10:12	so what's done is a is a linear combination of the probability up
0:10:17	for the single images
0:10:18	um weighted by the weights and also weighted by this off and one minus for which treats off
0:10:24	um
0:10:25	the um
0:10:27	the metadata data
0:10:28	and the and the visual feature a uh classification
0:10:32	is method it is not available as is the case for i think approximately twenty five percent
0:10:37	of all the elements in the in the test set
0:10:41	um you can only use the visual date
0:10:46	so i think we can now start to discuss the experimental results
0:10:50	uh
0:10:51	i think a hundred thousand photos were obtained from online um online folders of of of uh
0:10:58	photos
0:10:59	um and these were manually labeled into eight event types
0:11:03	so uh those a
0:11:04	christmas how we valentine's day for them to lie
0:11:08	um i'll post board
0:11:10	one
0:11:11	uh that's D C speech scenes and none of the above
0:11:16	and none of the above
0:11:17	no the above turns out to be a a a um
0:11:20	very difficult category to deal with
0:11:24	so a out of the heart thousand photos eight thousand were used for training
0:11:28	and um a hundred fifty two collections
0:11:31	uh were used for testing each collection could um contain
0:11:36	anyway between five and one hundred
0:11:38	images
0:11:41	so here i show the um confusion matrix
0:11:45	uh
0:11:46	result
0:11:47	that um my colleagues
0:11:48	obtained for the signal photo classification
0:11:52	based on meta data on top and uh a visual features
0:11:55	the bottom
0:11:56	using metadata actually turns out that you can do very well for um
0:12:00	signal images on christmas how we valentine's time stay in fourth of july
0:12:05	uh because these have
0:12:07	uh dates
0:12:08	associated with those of and
0:12:10	um
0:12:11	for you know the other ones
0:12:12	sporting scenes but days speech um the metadata is is not as you
0:12:18	a visual classifier are um since up to be
0:12:21	pretty good as well
0:12:22	but it it uh is
0:12:24	very very good at how all sports events and uh and beach event
0:12:28	um and that's
0:12:30	probably because those have a fairly consistent
0:12:33	um
0:12:34	visual signature
0:12:35	a visual composition
0:12:37	especially
0:12:40	so the next uh page of results
0:12:42	shows shows um
0:12:43	the classification results for the whole collection
0:12:46	after um
0:12:47	they done
0:12:48	um
0:12:49	the information used
0:12:51	i'm see that the results and i'm not too bad you getting
0:12:54	i guess between seventy and ninety percent accuracy
0:12:57	so for these
0:12:59	these seven categories
0:13:00	um the none of the above
0:13:02	is
0:13:03	uh the false less well so so what's happening is that
0:13:06	and which have nothing to do with any of these that and are actually getting mapped to to one of
0:13:11	seven seven
0:13:11	for
0:13:12	three
0:13:15	don't know
0:13:17	um
0:13:18	so well
0:13:20	well to conclude um what of been presenting today is work my colleagues on uh
0:13:25	a of uh
0:13:27	collections of photos
0:13:28	to um
0:13:29	uh a number of uh of categories
0:13:34	um
0:13:36	what uh what finally um colleagues are interested in in doing it is um
0:13:41	instead of just
0:13:42	subtracting features from individual photos and then fusing them later
0:13:46	is to um what what the intent to do is
0:13:49	directly extract
0:13:50	uh features from the collection
0:13:52	so
0:13:53	this my
0:13:54	a quite invention
0:13:55	of of new types
0:13:57	feature
0:13:58	um um also like to explore but fusion of different classifiers what the using right now is that
0:14:04	a yeah but you waiting
0:14:06	um and uh potentially i think the think um
0:14:10	that
0:14:11	some nonlinear
0:14:12	um
0:14:13	approach might be that uh maybe
0:14:15	considering um different metadata data features
0:14:18	groups so different visual features
0:14:20	hmmm
0:14:21	um
0:14:22	my
0:14:23	provide a a better
0:14:25	a fusion than what they using right now
0:14:27	and that also like to um
0:14:29	grow with the number of categories
0:14:31	i
0:14:33	to a larger number
0:14:36	so um that wraps a my talk a i'll do my best to to answer your question
0:14:42	um
0:14:43	but uh if there are any the direct talented
0:14:46	you
0:14:47	i can only forty
0:14:49	to
0:14:50	michael colleague
0:14:52	i
0:14:56	and
0:14:57	the but there any equation questions
0:15:04	can
0:15:05	i you we can
0:15:08	and this concludes that
0:15:10	particular
0:15:11	is so should thank you very much for to

EVENT CLASSIFICATION FOR PERSONAL PHOTO COLLECTIONS

Image and Video Indexing and Retrieval

Přednášející: David Varodayan, Autoři: Feng Tang, Daniel R. Tretter, Chris Willis, Hewlett-Packard Laboratories, United States