and not known um oops uh i apologise for uh missing my slot and i guess of moving a little bit fast here than i expected my name is david very data uh i i'm from H P labs and today and presenting a of the work of my colleagues fun time i don't try to an chris will um so town had to cancel his trip at the last moment because of some us uh these uh situation so um i'm good to do my best to but present a his by uh um the what is called a event classification photo collection and uh the idea is uh to be able to uh use a collection of of photograph from a a a single device over a short period of time to be one hour two hours and um and to be able to classify the events that uh that photo collection is uh is represented so uh examples of events um uh you christmas scenes probably scenes oh and fine state i'll those ports and and and things of this nature so it turns out that this is a um apparently quite a challenging problem i think the main the main difficulties are that the um a photos are essentially and stream um they can just be any collection of photos taken any time um and uh that the reason so you know this doing this is uh to make organization and management of personal photos uh that's what follows um much easier to do it tell the stories of people's lives but the reason why each piece interested in this is because um if if we can um automatically categorise of a classifier a collection of photos to fit a certain theme then uh of the companies able to um suggest products that the consumer can buy um like for of books and things like that um that maps that the so each piece interest so this is a system overview of uh how how the though the uh work of my colleagues um what uh a a collection of photos is given and the first thing that we do is we consider each uh pressing that they do is a take a single photo at a time um and they extract some data a a um and then they run to classifier to obtain a prediction um all of what the uh a category it is a of that for it is um based on the metadata like sorry think this as i didn't realise this is running on or two uh or or that keep the pixel forward um they also take a single photo obtain a uh a a of visual feature is to grab uh a a one a classifier and like wise and before uh obtain predictions soft predictions of which uh category um that photo belongs to then all of that metadata predictions and visual predictions from all the images are combined together information fusion step and uh a the event is then classified to to one of the the category so in this talk i'll start by describing a uh what the metadata data is and the uh the each of the classifier uh i also talk about the visual feature uh a process and and the class five but that um finally um the information fusion step and uh a result so the metadata um that is um used here in these experiments um is actually a of for different things i time stamps each of the uh for close uh and indication of whether the flash was on or off when the photo was taken uh exposure time and a focal a focal and so you can see that time stamps uh be a lot of information about about a certain um so and um and that for example if you look here um these uh all the photos it's a histogram of all the photos label this christmas um in the in the training set and um zero corresponds to december twenty states and and negative numbers uh the yeah and a negative numbers for uh to some twenty fit for the numbers of and sim if it so you can see that is a is a very lot spike on december twenty fit but there's also a large number of christmas the photos that were taken in the month of december there are there are you know a small number of photos taken at all the times for christmas themed that perhaps this is due to bad data the fact that the uh the photographer did not a set a time correctly on the car a probably explains that flash are are off reveals a lot of information about whether the uh photo was taken indoors or outdoors and you can see most christmas photos were taken with flash on exposed time like why tells you about the uh the light conditions when the photos was taken vocal length tells you about out um the relative distance a seen from the from the camp so a the metadata is extracted um from a single image and um and then the class a classifier was is built offline uh based on um the random farce technique and certainly not but on this so um i don't i don't have a lot to say about it but i understand that um the the random forest that fire off is very good formants uh i very low computational cost and that was why this choice was made um the visual features are uh extracted in a different way they are um a use the bag of features approach and so uh what uh what's done by my colleagues here is they it take the original image and the um filter down sampling at uh down to what of size and then it filtered down sampled that to to um one sixteen so um that the lime to just program for this X D in tiles this images program to four tile so each time there twenty one time L C each white tile has the same size now from each tile they uh a sample what all they take a grid of points and at each grid location uh they obtain a a um a a each feature vector uh and that looks something like this a hundred and twenty eight dimensional a feature vector each uh feature vector is then quantized quantized to one of two hundred words uh this dictionary also trained uh offline and so uh i if you take all the feature vectors from a certain uh tile i you can then obtain a frequency back to all uh two hundred elements because there are there two hundred on code um so if a total uh a there are twenty one tiles and two hundred elements in each feature vector so all those feature vectors actually concatenated together to obtain a um uh a a vector of four thousand two hundred um element and then that a feature vector is passed through a a a support vector machine that um produces a prediction a the uh the event category based on the visual features for single image okay so what are described so far are um using the metadata data to provide predictions of a single image and then using visual features to provide predictions of a single image uh but it turns out that um this mine for single image this might not be a very good so to take a look at this image um it's not exactly clear what what event image represents but if you look at all the surrounding images from that collection you see that pretty clearly this is the a channels but state park and so that this is the main contribution of the car work is to um sort of leverage the fact that we have a collection of images and an see how much um do what we can do with a oh what my colleagues can could do with that um okay so that that brings us to the information fusion step and and what they've done is actually a fairly simple so suppose that um we have a uh it's collection of images i one through i and and then a have from the previous steps that i saw uh already told you about uh we've already obtained uh probability back it is um based on so this indicates visual features is the probability that a uh image i um is classified as a belong to defend G like is the probability that image is classified as belonging to a event J uh based on the metadata features so we obtain um these vectors of of of probabilities uh but we also have to note that uh a different types of features um of the different amount of conference the different events we've already see that uh that a time metadata feature it's fairly useful in predicting a christmas but is not very useful in um telling you about but it because but it by a uh family evenly distributed throughout the county year so these um these weight scenes uh need to be obtained of those also obtained offline like training and then um these um probabilities are combined to uh a single conference number for a collection of photos high um um um that the confidence measure it is a conference that that that the collection i uh i classifies as to eventually J so what's done is a is a linear combination of the probability up for the single images um weighted by the weights and also weighted by this off and one minus for which treats off um the um the metadata data and the and the visual feature a uh classification is method it is not available as is the case for i think approximately twenty five percent of all the elements in the in the test set um you can only use the visual date so i think we can now start to discuss the experimental results uh i think a hundred thousand photos were obtained from online um online folders of of of uh photos um and these were manually labeled into eight event types so uh those a christmas how we valentine's day for them to lie um i'll post board one uh that's D C speech scenes and none of the above and none of the above no the above turns out to be a a a um very difficult category to deal with so a out of the heart thousand photos eight thousand were used for training and um a hundred fifty two collections uh were used for testing each collection could um contain anyway between five and one hundred images so here i show the um confusion matrix uh result that um my colleagues obtained for the signal photo classification based on meta data on top and uh a visual features the bottom using metadata actually turns out that you can do very well for um signal images on christmas how we valentine's time stay in fourth of july uh because these have uh dates associated with those of and um for you know the other ones sporting scenes but days speech um the metadata is is not as you a visual classifier are um since up to be pretty good as well but it it uh is very very good at how all sports events and uh and beach event um and that's probably because those have a fairly consistent um visual signature a visual composition especially so the next uh page of results shows shows um the classification results for the whole collection after um they done um the information used i'm see that the results and i'm not too bad you getting i guess between seventy and ninety percent accuracy so for these these seven categories um the none of the above is uh the false less well so so what's happening is that and which have nothing to do with any of these that and are actually getting mapped to to one of seven seven for three don't know um so well well to conclude um what of been presenting today is work my colleagues on uh a of uh collections of photos to um uh a number of uh of categories um what uh what finally um colleagues are interested in in doing it is um instead of just subtracting features from individual photos and then fusing them later is to um what what the intent to do is directly extract uh features from the collection so this my a quite invention of of new types feature um um also like to explore but fusion of different classifiers what the using right now is that a yeah but you waiting um and uh potentially i think the think um that some nonlinear um approach might be that uh maybe considering um different metadata data features groups so different visual features hmmm um my provide a a better a fusion than what they using right now and that also like to um grow with the number of categories i to a larger number so um that wraps a my talk a i'll do my best to to answer your question um but uh if there are any the direct talented you i can only forty to michael colleague i and the but there any equation questions can i you we can and this concludes that particular is so should thank you very much for to