0:00:13and not known um
0:00:14oops
0:00:16uh i apologise for uh missing my slot and i guess of moving a little bit fast here than i
0:00:20expected
0:00:21my name is david very data
0:00:23uh i i'm from H P labs
0:00:25and today and presenting a of the work of my colleagues fun time
0:00:30i don't try to an chris will
0:00:32um so town had to cancel his trip at the last moment because of some us
0:00:37uh these uh
0:00:38situation
0:00:39so um i'm
0:00:41good to do my best
0:00:42to
0:00:43but present a his by uh
0:00:45um the what is called a event classification
0:00:48photo collection
0:00:49and uh the idea is
0:00:52uh
0:00:53to be able to uh use a collection of of photograph from a a a single device
0:00:57over a short period of time to be one hour two hours
0:01:01and um
0:01:02and to be able to classify the events
0:01:06that uh that photo collection is uh is represented
0:01:10so uh examples of events um uh you christmas scenes probably scenes
0:01:16oh and fine state
0:01:17i'll those ports and and and things of this nature
0:01:21so it turns out that this is a um
0:01:25apparently quite a challenging problem i think the main the main difficulties are
0:01:29that the um
0:01:30a photos are essentially and
0:01:31stream
0:01:32um they can just be
0:01:34any collection of photos taken
0:01:36any time
0:01:37um
0:01:38and uh that the reason so you know this doing this is
0:01:42uh to make organization and management of personal photos
0:01:45uh that's what follows
0:01:47um
0:01:48much easier to do
0:01:50it tell the stories of people's lives
0:01:52but the reason why each piece interested in this is because
0:01:55um if if we can um
0:01:58automatically categorise of a classifier a collection of photos to fit a certain theme
0:02:03then uh of the companies able to
0:02:06um
0:02:07suggest products that the consumer can buy
0:02:10um like for of books and things like that um that maps that the so
0:02:15each piece interest
0:02:17so this is a system overview of uh how
0:02:21how the though
0:02:22the uh work of my colleagues um what
0:02:25uh
0:02:26a a collection of photos is given
0:02:28and the first thing that we do is we consider each uh pressing that they do
0:02:32is a take a single photo at a time
0:02:35um and they extract some data
0:02:38a a um and then they run to classifier
0:02:41to obtain a prediction
0:02:42um all of what the uh a category it is a of that for it is
0:02:47um
0:02:48based on the metadata
0:02:50like
0:02:51sorry think this as
0:02:52i didn't realise
0:02:53this is running on or two
0:02:55uh
0:02:55or or that
0:02:56keep
0:02:57the pixel forward
0:02:58um
0:02:59they also take a single photo obtain a uh a a of visual feature is to grab
0:03:03uh a a one a classifier and like wise and before
0:03:07uh obtain predictions soft predictions of which uh category
0:03:11um
0:03:12that photo belongs to
0:03:15then all of that
0:03:16metadata predictions and visual predictions
0:03:18from all the images are combined together
0:03:21information fusion step
0:03:23and uh a the event is then classified to to one of the the category
0:03:29so in this talk i'll start by describing a uh what the metadata data is and the uh the each
0:03:34of the classifier
0:03:36uh i also talk
0:03:37about the visual feature uh
0:03:39a process and and the class five but that
0:03:42um finally um the information fusion step
0:03:45and uh a result
0:03:49so the metadata um that is um
0:03:52used here in these experiments
0:03:54um is actually a of for different things
0:03:57i time stamps
0:03:59each of the uh for close
0:04:01uh
0:04:01and indication of whether the flash was on or off when the photo was taken
0:04:06uh exposure time
0:04:07and a focal a focal and
0:04:09so you can see that time stamps uh be a lot of information about
0:04:13about a certain um
0:04:16so and um
0:04:17and that for example if you look here
0:04:19um these uh all the photos it's a histogram of all the photos label this christmas
0:04:24um
0:04:25in the in the training set
0:04:27and um
0:04:28zero corresponds to december twenty states and and negative numbers uh
0:04:33the
0:04:34yeah and a negative numbers for uh to some twenty fit for the numbers of and sim if it so
0:04:39you can see
0:04:40that is a is a very lot spike on december twenty fit but there's also a large number of christmas
0:04:45the photos that were taken in the month of december
0:04:49there are there are you know a small number of photos taken at all the times for christmas themed that
0:04:54perhaps this is due to bad data
0:04:56the fact that the
0:04:57uh the photographer did not a set a time correctly on the car
0:05:01a probably explains that
0:05:03flash are are off reveals a lot of information about whether the uh photo was taken indoors or outdoors and
0:05:09you can see most
0:05:10christmas photos were taken with flash on
0:05:12exposed time like why tells you about the uh the light conditions when the photos was taken
0:05:17vocal length tells you about out um the relative distance
0:05:21a seen from the from the camp
0:05:24so a the metadata is extracted
0:05:27um from a single image
0:05:29and um
0:05:31and then the class
0:05:32a classifier was is built offline
0:05:34uh based on um the random farce technique
0:05:38and certainly not
0:05:39but on this so um
0:05:41i don't i don't have a lot to say about it but i understand
0:05:44that um
0:05:45the the random forest that fire off is very good
0:05:48formants
0:05:49uh i very low computational cost and that was why this choice was made
0:05:55um the visual features are
0:05:57uh extracted in a different way they are um
0:06:00a use the bag of features
0:06:02approach
0:06:03and so uh what
0:06:04uh what's done by my colleagues here
0:06:06is they it take the original image
0:06:08and the um
0:06:10filter down sampling at uh down to what of size
0:06:13and then it filtered down sampled that to to um
0:06:15one sixteen so
0:06:18um that the lime to just program for this
0:06:20X D in tiles this images program to four tile so each time
0:06:23there twenty one time L C each white tile has the same size
0:06:27now
0:06:28from each tile
0:06:29they uh a sample what all they take a grid of points
0:06:33and at each grid location
0:06:35uh they obtain a a um
0:06:39a a each feature vector
0:06:41uh
0:06:42and that looks something like this a hundred and twenty eight dimensional
0:06:45a feature vector
0:06:47each uh feature vector
0:06:49is then quantized
0:06:51quantized to one of two hundred words uh this dictionary also
0:06:56trained uh offline
0:06:59and so uh i if you take all the feature vectors
0:07:02from a certain uh tile
0:07:04i you can then obtain a frequency back to
0:07:07all uh two hundred elements because there are there two hundred on
0:07:11code
0:07:13um so if a total uh a there are twenty one tiles
0:07:17and two hundred elements in each feature vector
0:07:20so all those feature vectors actually concatenated together
0:07:23to obtain a um
0:07:25uh a a vector of four thousand two hundred um
0:07:28element
0:07:30and then that
0:07:31a feature vector is passed through a a a support vector machine
0:07:34that um
0:07:35produces a
0:07:37prediction
0:07:38a the uh
0:07:40the event category
0:07:42based on the visual features
0:07:43for single image
0:07:46okay
0:07:47so
0:07:48what are described so far are um using the metadata data to provide predictions of a single image and then
0:07:53using visual features to provide predictions of a single image
0:07:56uh but it turns out that um this mine
0:07:59for single image this might not be a very good
0:08:02so to take a look at this image
0:08:04um
0:08:06it's not exactly clear what
0:08:07what event
0:08:09image represents
0:08:10but if you look at all the surrounding images
0:08:12from that collection
0:08:14you see that pretty clearly this is the a channels but state park
0:08:18and so that this is the main contribution of the car work
0:08:21is to um
0:08:24sort of leverage the fact that we have a collection of images and an
0:08:27see how much
0:08:28um do what we can do with a oh
0:08:30what my colleagues can could do with that
0:08:36um
0:08:36okay
0:08:37so that that brings us to the information fusion step and and what they've done is actually a fairly simple
0:08:43so suppose that um
0:08:45we have a uh it's collection of images i one through i and
0:08:50and then a have from the previous steps that i saw uh already told you about
0:08:54uh we've already obtained
0:08:56uh
0:08:56probability back it is um based on so this indicates visual features
0:09:01is the probability that a uh image i
0:09:04um
0:09:05is
0:09:06classified as a belong to defend G
0:09:10like is the probability that image is classified as belonging to a event J
0:09:14uh based on the metadata features
0:09:16so we obtain um these vectors of of of probabilities
0:09:21uh but we also have to note that uh a different types of features um of the different amount of
0:09:26conference the different events
0:09:27we've already see that uh that
0:09:30a time
0:09:30metadata feature
0:09:32it's fairly useful in predicting a christmas
0:09:35but is not
0:09:36very useful in um telling you about but it
0:09:39because but it by a uh
0:09:41family evenly distributed throughout the county year
0:09:45so these um these weight scenes uh need to be obtained of those also obtained offline like training
0:09:52and then um
0:09:53these um probabilities are combined
0:09:56to uh a single conference number for a collection of photos high
0:10:01um
0:10:03um um
0:10:05that the confidence measure it is a conference that that
0:10:08that the collection i
0:10:09uh i classifies as to eventually J
0:10:12so what's done is a is a linear combination of the probability up
0:10:17for the single images
0:10:18um weighted by the weights and also weighted by this off and one minus for which treats off
0:10:24um
0:10:25the um
0:10:27the metadata data
0:10:28and the and the visual feature a uh classification
0:10:32is method it is not available as is the case for i think approximately twenty five percent
0:10:37of all the elements in the in the test set
0:10:41um you can only use the visual date
0:10:46so i think we can now start to discuss the experimental results
0:10:50uh
0:10:51i think a hundred thousand photos were obtained from online um online folders of of of uh
0:10:58photos
0:10:59um and these were manually labeled into eight event types
0:11:03so uh those a
0:11:04christmas how we valentine's day for them to lie
0:11:08um i'll post board
0:11:10one
0:11:11uh that's D C speech scenes and none of the above
0:11:16and none of the above
0:11:17no the above turns out to be a a a um
0:11:20very difficult category to deal with
0:11:24so a out of the heart thousand photos eight thousand were used for training
0:11:28and um a hundred fifty two collections
0:11:31uh were used for testing each collection could um contain
0:11:36anyway between five and one hundred
0:11:38images
0:11:41so here i show the um confusion matrix
0:11:45uh
0:11:46result
0:11:47that um my colleagues
0:11:48obtained for the signal photo classification
0:11:52based on meta data on top and uh a visual features
0:11:55the bottom
0:11:56using metadata actually turns out that you can do very well for um
0:12:00signal images on christmas how we valentine's time stay in fourth of july
0:12:05uh because these have
0:12:07uh dates
0:12:08associated with those of and
0:12:10um
0:12:11for you know the other ones
0:12:12sporting scenes but days speech um the metadata is is not as you
0:12:18a visual classifier are um since up to be
0:12:21pretty good as well
0:12:22but it it uh is
0:12:24very very good at how all sports events and uh and beach event
0:12:28um and that's
0:12:30probably because those have a fairly consistent
0:12:33um
0:12:34visual signature
0:12:35a visual composition
0:12:37especially
0:12:40so the next uh page of results
0:12:42shows shows um
0:12:43the classification results for the whole collection
0:12:46after um
0:12:47they done
0:12:48um
0:12:49the information used
0:12:51i'm see that the results and i'm not too bad you getting
0:12:54i guess between seventy and ninety percent accuracy
0:12:57so for these
0:12:59these seven categories
0:13:00um the none of the above
0:13:02is
0:13:03uh the false less well so so what's happening is that
0:13:06and which have nothing to do with any of these that and are actually getting mapped to to one of
0:13:11seven seven
0:13:11for
0:13:12three
0:13:15don't know
0:13:17um
0:13:18so well
0:13:20well to conclude um what of been presenting today is work my colleagues on uh
0:13:25a of uh
0:13:27collections of photos
0:13:28to um
0:13:29uh a number of uh of categories
0:13:34um
0:13:36what uh what finally um colleagues are interested in in doing it is um
0:13:41instead of just
0:13:42subtracting features from individual photos and then fusing them later
0:13:46is to um what what the intent to do is
0:13:49directly extract
0:13:50uh features from the collection
0:13:52so
0:13:53this my
0:13:54a quite invention
0:13:55of of new types
0:13:57feature
0:13:58um um also like to explore but fusion of different classifiers what the using right now is that
0:14:04a yeah but you waiting
0:14:06um and uh potentially i think the think um
0:14:10that
0:14:11some nonlinear
0:14:12um
0:14:13approach might be that uh maybe
0:14:15considering um different metadata data features
0:14:18groups so different visual features
0:14:20hmmm
0:14:21um
0:14:22my
0:14:23provide a a better
0:14:25a fusion than what they using right now
0:14:27and that also like to um
0:14:29grow with the number of categories
0:14:31i
0:14:33to a larger number
0:14:36so um that wraps a my talk a i'll do my best to to answer your question
0:14:42um
0:14:43but uh if there are any the direct talented
0:14:46you
0:14:47i can only forty
0:14:49to
0:14:50michael colleague
0:14:52i
0:14:56and
0:14:57the but there any equation questions
0:15:04can
0:15:05i you we can
0:15:08and this concludes that
0:15:10particular
0:15:11is so should thank you very much for to