Speech Transcript - Detection of Near-duplicate Patches in Random Images using Keypoint-based Features

0:00:15	a good afternoon ever about this a try to be as brief as possible so
0:00:19	this is the type of the presentation my name and the name of my collaborators
0:00:24	contributed some parts of this presentation so let's start from the outline so basically i'll
0:00:31	be talking about certain problems of an image retrieval using that the cup in general
0:00:38	the concept of keypoints and the then i highlight some disadvantages of typical keypoint based
0:00:45	approaches that new features based on keypoints as well introduced and then a brief overview
0:00:52	gradually build several algorithms combining those two those new features with traditional sift descriptors and
0:01:00	of course everything will be supported by some preliminary experimental verification so basically the problem
0:01:08	of detection of near duplicate image fragments are image patches is quite straightforward if we
0:01:14	have two images by not be interested in their in determining whether they are similar
0:01:21	or nearly identical images but we might be interested whether those images contain any unspecified
0:01:28	and an unspecified number similar fragments similar patches so these are examples of images which
0:01:34	are generally different but which contain some similarly looking fragments this these similarities are usually
0:01:41	represented by a transformation affine transformation uh denote a generally accepted model so another example
0:01:51	of completely different images which nevertheless contain some fragments those patches are also to but
0:01:57	not always represented just a similar or identical objects placed on different backgrounds
0:02:05	and in general we expect the results something like that i mean we have those
0:02:10	buttons all kinds of the errors and then we a we have we expect somehow
0:02:16	i'm showing that these are identical fragments identical need nearly identical patches even though the
0:02:21	images are different or another example of those two fragments are similar are those two
0:02:26	but are similar accidentally be represent more of the same object or another example the
0:02:32	same idea and one more over here a these images are different but they share
0:02:38	some nearly identical similar patches so of course uh the concept of may keypoint-based matching
0:02:45	case a very well known as very well established at the very powerful tool but
0:02:51	in the context also the detection of near duplicate patches near duplicate fragments it has
0:02:58	several basically two major disadvantages the first one is that individual keypoint matches even if
0:03:05	they correctly represent local similarity is a very seldom correct in the context in the
0:03:11	wider contacts any local or global complex and these are just a very simple example
0:03:17	shown in two images which are though different and it contains a small fragment and
0:03:22	these are a keypoint matches if we use the most restrictive technique one-to-one keypoint correspondences
0:03:29	based on the mutual nearest neighbor values exhibitions and harris affine keypoints so as you
0:03:36	see it's high it's very difficult to identify which fragment activity you're nearly similar or
0:03:41	nearly identical or another fragment which is even more confusing because now we are using
0:03:48	many-to-many correspondence is based on the same sift words and their work well the number
0:03:54	of correspondences if you order and some of them actually corresponds to a similar fragments
0:03:59	existing these two images so it is have some statistical examples of the small data
0:04:06	because we're using for experiments shopping of data if we have images which contain just
0:04:12	fragments which are similar the percentage of correct matches of all the matches between individual
0:04:18	keypoints are very small actually a it renders all these methods use less in terms
0:04:25	of bit direct to use it of keypoint correspondences as a tool for that nearly
0:04:30	near duplicate fragments near duplicate patches so what we generally do we is supported this
0:04:37	technique with some second step algorithm so which are jointly called do you the constraint
0:04:44	can agree configuration constraints verification we can use different techniques like to run start which
0:04:50	doesn't work you for their too few correct correspondences some variance of runs like prosodic
0:04:56	geometric hashing hough transform the large vocabulary somewhere that identical words are used as seeds
0:05:03	for the future matching and so on and so one but into a all these
0:05:07	techniques require these well i'd say post processing when we matching pair of images so
0:05:12	which means they are a certain limits of the size of the database to be
0:05:17	such techniques can be used so the
0:05:21	but that's why that's why we have proposed a new features which are still based
0:05:28	on the keypoints but using these features we can achieve much higher the uh i
0:05:35	mean the reliability of the techniques which are based on the straightforward feature matching no
0:05:41	other verification of the geometry of a pulse trains and so on and so on
0:05:46	so basically these features are term these features are named term features in turn comes
0:05:53	from the tuples of ellipses represented by moments will be later explain why we use
0:06:00	these names there is
0:06:02	some similar right yes in the past so these are the in my opinion the
0:06:06	best two examples of such ideas which nevertheless were not fully developed in the in
0:06:11	the direction but uh
0:06:15	yeah the features we propose are based on the
0:06:19	triplets or supplementarily pairs of keypoints in this and since the that we propose descriptors
0:06:27	for those features that you could have descriptors we can convert descriptors into the vocabularies
0:06:32	of the visual words so in the future we can just imagine those features like
0:06:36	matching the descriptor for by matching visual words and of course not because we assume
0:06:42	that this might be distorted by affine transformations are using a affine invariant detectors which
0:06:48	a return elliptical keypoints as our starting point actually we are using two categories of
0:06:56	it detectors we are using a harris affine and we are using and seven very
0:07:01	popular so starting from the term three features so if we have a and alex
0:07:08	which could be a key point and an external point P one we can actually
0:07:12	find these two points are a zero one eight one zero one B based on
0:07:18	the direction of the tandem flight and intersection point
0:07:25	is there any
0:07:26	i mean i would be i and then you create so we have these two
0:07:31	points are zero one a can be uniquely defined by the ellipse and by this
0:07:36	point P one if we have another point we have another pair of such points
0:07:42	he uniquely defined by this external point and by the al it set so that
0:07:48	we can actually set for such points a if we're a two other external point
0:07:54	are available for a given talent and then by using two of these four points
0:07:59	and by using the point where the corresponding line intersects the objects we can you
0:08:05	need to create an
0:08:07	trapezoid which is defined by the L it's itself and the by these two external
0:08:12	points in other words the geometry of this trapezoid is defined by the ellipse itself
0:08:18	in the context of to external points is uniquely defined so then of course fortunately
0:08:25	the clean out about this is quite obvious assumption so then if we have a
0:08:30	if we have three ellipses for each of them if we can see that the
0:08:34	centers of the other two and next are not points we can build such a
0:08:39	trapezoid and these three trapezoids are considered term three feature so this is the feature
0:08:44	which represents these three ellipses but in the context of the other two ellipses of
0:08:50	course these points are sort of these features are change and the geometry of these
0:08:57	features changes will corresponding the geometry of that of the additive so if we have
0:09:02	a transformation linear or a affine transformation of that of the area contained in these
0:09:08	three ellipses the shapes of these trapezoids change the change corresponding in other words we
0:09:16	can relatively easily identified three ellipses which are the which are transformed or you will
0:09:23	buy it fine madden but by an affine mapping just by using and the shape
0:09:28	descriptors which are you invariant under such transformations and be set you are of course
0:09:33	that these trapezoid it's so basically
0:09:38	this is that on three feature and the term two feature which is based only
0:09:43	on two ellipses again we use is three quadrilaterals which are uniquely defined by these
0:09:50	three ellipses and again they change it they change covariantly with the ellipses you the
0:09:56	edges are in map by from affine transformation but as i said these features have
0:10:02	only uh are used only in the supplemental role in the future okay then
0:10:09	we define the descriptors for these features and because the shapes are very simple we
0:10:13	are also using a very simple descriptor the descriptor to giles that these complex affine
0:10:19	moment invariants which is even or their over that are based on a central moments
0:10:25	of them of the second order it divided by the central by the moment of
0:10:30	the zero holding by the area and then as a descriptor also that all this
0:10:36	teacher we use that seven values of this invariant which are gradually calculated for the
0:10:42	first of these shapes for the second one for the third one for two of
0:10:47	them another pair of them for the third rbm and finally for all of them
0:10:53	is a very simple seven dimensional descriptor actually it for that three-dimensional a descriptor of
0:11:01	term two features of our or also using just the just this battery and calculated
0:11:06	for those three contribute be quite a lot less then it turns out that
0:11:14	these
0:11:16	features are obviously very numerous in a typical image if we if we just take
0:11:21	any triplet or any pair of keypoints effective in an image so we might get
0:11:26	millions of features within a single image which is of course unacceptable so we deliberately
0:11:32	made the number of these features by using two steps the first step is of
0:11:36	course to reduce the number of keypoints themselves so we somehow and we tried decided
0:11:42	to limit the number of keypoints that are three hundred although using somehow the best
0:11:47	three hundred the most prominent three hundred keypoints the and then we are building of
0:11:53	those features using only a certain neighborhood of those of those not more than three
0:11:58	hundred keypoints and we are using we are using it is the neighbourhoods we do
0:12:04	not just traditional geometric make the records about the neighbourhood are determined by the ellipses
0:12:08	themselves in other words we assume that either two ellipses which are very close and
0:12:14	their size is uh a very large compared to the distance most probably be represented
0:12:20	more or less the same visual contents so we reject something like that in the
0:12:24	ellipses are small compared to the distance between them most probably these visual content are
0:12:29	not so related or rather unrelated so we also don't really consider such keypoints in
0:12:34	other words we use all the ellipses or keypoints which are within a certain distance
0:12:39	and that this those is determined by the size of the minor and major axis
0:12:44	of both and it's as though it should not more than half of the longer
0:12:49	minor axis and the not more than two of the short verify off all the
0:12:55	major axis so that would be arbitrarily is defined slightly different about this is the
0:13:01	idea and other these the sounds and additionally we exclude triangles of keypoints which are
0:13:06	just too narrow if one of the angles is less than fifteen degrees so under
0:13:11	such establishes we get reasonable numbers of keypoint is sort of features and again these
0:13:17	are just a statistical i results from our database all a with the images where
0:13:24	the number of keypoints is limited to more or less three hundred we actually have
0:13:30	tolerable large numbers all goes to it term two and term three features up to
0:13:36	sixteen thousand which is which is not that bad
0:13:40	and that the important point in using these features is that is statistically independent of
0:13:47	the individual dimensions we use over thirty million term features from over one thousand images
0:13:55	and we found that the correlation between individual dimensions is very no it's practically negligible
0:14:02	so which means that we assume that those dimensions are just independent and then we
0:14:08	build the visual vocabulary just by
0:14:13	dividing each dimension into areas of the same probabilities two three four five and then
0:14:20	the number of for each word is just corresponding to the cartesian product of such
0:14:25	a decomposition so which means that we have just to the power of seven three
0:14:29	to the power of seven for power of seven and so all and so on
0:14:33	works for T three description of term three features or two to the power of
0:14:38	three four and so all the four C two description so basically that is a
0:14:44	that mathematical background and now the algorithms and the evaluation we have used in a
0:14:52	relatively small database which is available over there but the beauty of the database is
0:14:58	that it contains newly duplicate images and all the images are manually outlined and which
0:15:06	means that we can uh automatically determine whether two features or to keypoints are properly
0:15:13	matched of course there are some mistakes because it might be that they are matching
0:15:18	not the same fragment of all of the same object or there should match similarly
0:15:23	looking fragment of different objects and the way this is a pretty to be a
0:15:28	reliable verification reliable grounds tools and then
0:15:33	we have been trying we will be trying to a three
0:15:38	to measure the performances using both traditional keypoints and goals you term two and three
0:15:46	there are three features in the foreseeable to which was used a as a representation
0:15:52	for individual is used as a small vocabulary of five hundred words may be too
0:15:58	small but it and it turns out that it's not really that bad and
0:16:05	and these are the results it's a first we just matched individual keypoint using one-to-one
0:16:11	no term to know term three then a again individual keypoints using just a many-to-many
0:16:19	the same word then methods were just individual term two and term three features are
0:16:25	matched using a than using either a one-to-one or many-to-many
0:16:33	and it turns out that all these results are using this
0:16:39	the number of correct matches when we use individual keypoints are we use individual term
0:16:44	for term three features are basically at the level of statistical error there's no there's
0:16:49	no use of such a big how weather easily combine term two or three we
0:16:57	keypoints it keypoint description which we is that we matched features geometrically term parental three
0:17:04	only if they contribute the keypoints have the same words the results are much better
0:17:09	and
0:17:11	these are exemplar is not results for this database is used that are very good
0:17:16	the problem is that the number of the matches is very small compared to the
0:17:21	number of uh compared to the to the number of matching images because they're over
0:17:26	five hundred images which contain similar fragments but altogether we have around one thousand five
0:17:32	hundred two thousand keypoint found in there we use are very features found in the
0:17:38	whole database and it turns out that all of them are concentrated in just fifteen
0:17:43	twenty images so which means that the method is quite good but at its purest
0:17:49	it's all that is that of all we used to just words for both terrible
0:17:55	features there are three in particular and then works for the contributing sets then the
0:18:00	results are better in terms of the number of sound correspondences about the percentage of
0:18:07	correct matches is not so good it's too much better than previously it's about twenty
0:18:11	percent but it's still
0:18:14	not enough and eventuality the ultimate method combines
0:18:19	both were four and three which me is sort of forty three term three which
0:18:24	means we take triplets with five we can we match works for T three then
0:18:29	we match were for their tool for each pair of keypoints within the triplet and
0:18:38	finally we must we match also a individual sift works for the comparability a report
0:18:46	the contributing keypoint and then especially if we use harris having a the results are
0:18:52	quite good uniform so are we still have not too many of them but for
0:18:56	harris affine they are quite many of them and the current level of phone correctness
0:19:00	is very high ninety three percent of the reconstructed of the retrieved features are matched
0:19:06	properly which means they actually belong to the same patch similarly looking past of the
0:19:13	same object it is also and then recall the results are also not that bad
0:19:19	the effective recall is about eighty percent a which is also not that bad especially
0:19:24	that some of these objects are really learn so there are not many keypoint there
0:19:29	in
0:19:31	that's it these are just exemplary results returned by a that it by this method
0:19:37	of course to the ellipses that show the outline because the individual keypoints are not
0:19:42	so well as seen that there is you see the results are really quite board
0:19:46	and all of them are achieved by a simple match of individual features no verification
0:19:52	of configuration constraints not be like that just matching individual features
0:19:59	yeah that some of these are from the database it somehow from this and is
0:20:05	a data base um are just from i don't databases maybe a available databases have
0:20:11	been used that also some semantically incorrect results but if you look at about eight
0:20:17	dollars way that that's right
0:20:20	yeah or in there
0:20:22	but this example so that basically eight and so summary these advantages
0:20:31	and disadvantages we still
0:20:35	have to reduce the number of keypoints are not perfectly yeah having the battery on
0:20:40	that was one of the conclusions and eventually we have to as an ultimate solution
0:20:47	you want to incorporate very similar data into descriptions of individual keypoint so which means
0:20:53	that we can identify similarly looking fragments of images but don't by matching descriptions of
0:20:59	three calculated descriptions of individual keypoints rather than features comparing so we keep that said
0:21:07	thank you very much sorta for being played for one minute
0:21:12	okay
0:21:16	well as i would have to be fast
0:21:39	yes
0:21:42	no not that you of the two older
0:21:45	yes oh yeah
0:21:49	no but it because basically they're all the three of them so we cannot incorporate
0:21:53	more and the identity is that you each of them contain is represented by individual
0:22:00	words so which is that we can just matt sentences of a few words so
0:22:05	that's why this is the number of data we combine three of them
0:22:11	yeah
0:22:13	yes
0:22:18	yes
0:22:21	i
0:22:26	yeah well basically i'm not saying that i'm identifying the outline of the patch i
0:22:31	and identify and detecting the presence of the patch so even if it's a single
0:22:35	keypoint or a single feature it shows that there is a patch there and it
0:22:41	generally is the outline of the patch it's a different story but
0:22:47	now we can we can at least know that there is a patch there with
0:22:53	a high level of confidence
0:23:05	uh_huh
0:23:12	that
0:23:27	yeah
0:23:29	i fully agree with you the problem is that if you don't if you're just
0:23:33	looking in that unknown unpredictable database for some preliminary similar is you have to start
0:23:39	from the method which doesn't assume and think about the content of the
0:23:43	and this is basically the most perfect
0:23:53	sorry if the images containing
0:23:56	yeah it's not a problem then we have a lot of the that was one
0:24:00	example of two bottles of beer over here and two bits are two kinds of
0:24:04	your over europe to cancel it be over there so basically we found that for
0:24:09	pairs of patterns there's just this and that so which means that in that case
0:24:15	we might have a lot of pairs of patches but eventually it's almost all that
0:24:20	is it a very good part of their to either two men the repetitions of
0:24:24	the pattern of course it doesn't make too much sense about you don't of two
0:24:27	or three or four of them works there was no problem
0:24:39	okay well i didn't have that i didn't have time to mention that it one
0:24:43	basically depends on the application so it is very naive matlab bodies a implementation we
0:24:50	are able to match hundreds of images within individual second
0:24:55	yeah of course the random images so we have not completed that in a starving
0:25:02	on and then C plus implementation and or more sophisticated techniques of a native a
0:25:10	organising the database a but i did you can be really is
0:25:14	i would be and the meeting
0:25:19	this

Detection of Near-duplicate Patches in Random Images using Keypoint-based Features

Geometry and Shape

Andrzej S Sluzek, Mariusz Paradowski