Speech Transcript - SEARCHING IN ONE BILLION VECTORS: RE-RANK WITH SOURCE CODING

you introduction election so is don't work was the formant that now i'm we use as you have then but that is from yeah a on the on the risk of ones i'm said i so this took okay a an interesting in the large the base of media uh the nuns uh new image sell to large that the base day give a means that when you on in state of the your the scheme an image a your you present the by about one thousand or two thousand so it means that we have to on the to be billion uh descriptors on the the the basis get those oh sift descriptors we or uh of them mentioned one like uh if we look at a do so much uh we would like to right and a thousand i well so video for instance and trying and evaluation task as a that a two hundred hour on this to do or also present it by billions of what you want your and and some in music retrieval or uh was really is the uh the column down you some that the base uh on the gains all the it is one billion oh you can take a more concrete example of uh are spatial to that's like V uh evaluation in the copy detection test uh uh we have extra from do about sweet that's five billion image is thus two represents a a a a a of the database on on a D uh about uh one don't quite fourteen million uh we do it and look do is we want to sell some query uh zero on look whereas as or of the base based keep those which mean is that for each disk though we look at nearest neighbor or we Q D at that so it could and distance a for the so now if we look at uh exhaustive even now search discarding and talk uh we have a one for frame described i but one thousand descriptors we we have to make trillions of i mention that i vector vector reason that is in the order of ten our four the remote through so we can do it for single frame why we need for our powerful approximate to make the search research which are situation she's quite to used to avoid program but also so memory efficient okay i i wouldn't miss is or so uh uh such a reason as to to nice i to to re uh uh from speech yeah a a a pretty C it's great quite to use that to retrieve though should be actual you as neighbours just be the is why we make approximate search but also the a we use that and actually most or and many optimize to first protect for instance look at it is steve division uh which is very popular "'cause" that's some good so sort you corpora properties but it is quite a memory and swimming as it is is that you use only at ten at table uh you need at least four to by per vector or the exact to minimum and using that you have that some very good performance because i is no that that that at the station in the let's see uh a but does choice be two select a finite where is an which is a very good was and the terms of the actors right a state of so yeah is yeah agrees um on this is now a great so you you you agree but again the agrees reason cry as a lot of memory so yeah made in experiment on on need when we and vectors on choose about one hundred in me given rights index only when you don't like so it can approximate socially so that done uh you know i on two stage a the first stage is approximate search it set well you have some kind of space but training so this is a a H like a space of the shouldn't base and i can on you can for a given vector the find a set uh where is a vector or lie so you have to do this for so that the basic though a uh a flying and when you have a crappy you compute the that's key which you say on the you you would five the vectors of the same set as put control nearest neighbor so this is one as a function on in the less age you have sit the hard partition this to improve the probability to get a you nearest neighbor and then when you have some potential nearest neighbor but men mean as and are not a very good as a very far from the query vector or actually so that is why you speaker lee uh made a verification based an exact it two D this calculation know the are two uh a a a a is approximate net middle a good thing is that is it should nearest neighbor oh uh would the from the first stage you sure are that that will be wrong in good position of does exact a speculation the the problem is that for this a we don't stage we need to the script dolls on it means that we i still as to use a a huge amount of memory so for instance for one B vectors it means more than one hundred to get that we i so you have to but and this and this in this case the in back the efficiency of this scheme because in practice you have to check too many nee both there oh you cannot do it and the but it up to to not more than one vectors was what we propose a paid not is to add that the at and that if we want came is that based on the source coding be for i have to introduce a a a a a a previous work you may uh on a makes cell edge where we use um compression base approach that is we are going to represent each that base vector by a compressed presentation that is case a concise presentation on uh is done using a product a quantizer to have many prediction values available don't he a bus for a production value i is really and then to search is is seen as a distance sounds approximation problem at is uh instead of computing to just distance between X and Y you are going to approximate it base this sounds using I and of of why on that you have a bias can use to make so but E we about the by is there just the most but and on so in pressing put that C that we can make this distance estimation directly in the compressed domain do not at and compress that that to make it to so uh that power on uh you may be from yeah with uh you we uh on building this that where a a a a couldn't vectors on that to you a space to P place to it to distance by hand this sounds on using this scheme you obtain almost same efficiency the performance much better i can show you some or so we have some proved average above bounds sounds estimation people so now uh i'm looking is a on stage the re-ranking king stage knowing that the first stage was a compressed base pro a good proposed but use the first stage is that we had a compressed based they sink which means that for each that that is vector or we have an explicit a construction of Z stick to so instead of using the whole disk to for the second stage we are going to uh we find the first a construction or that the basic to well thank as the first state this is done by first computing still this your or a vector so it means this this a small vector we two oh this one and then it's spectral because we do not want to stop me because he does a i didn't mention a uh uh uh for C what be don't we eight it's does not quantized use a using the quantizer that at like to uh uh uh if it's dual uh so now means that we can approximate Y as a first approximation obtained by the first stage plus as a gone uh of if five don't the could yeah so that is on could be to this dual vector which improves the initial the estimate both as or a construction but also for the distance calculation and we can uh re we can right that uh between precision and memory by is the amount of by to all going to D to this for contains so so it's parameters and prior on can be eight bytes that which just a small and the number of bytes to "'cause" on so a which in not vector that's good that i was them which is quite quite simple so we yeah that that this vector why the first approximation made by the first stage E a consists in a pleasant thing it that the code that which can be seen in the old net clean space as Q of why is the first stage well bring to compress the curvy we suppose of possible to Y on sale a shot is oh vectors so once that we want to fine if Y to select T as a put concerned roast neighbour or is you are going to explicitly to construct this improve estimate that is we been to uh we find uh why by using so was really taught like this so we have a hats and then is a new distance to i will be are there in instead of this C which is a better approximation uh is than the distance between back uh so this is a better a use this for this to this so is it see some such results in one billion vectors the in this case we is used for the first stage eight a of vectors the first one that wasn't a of the the the cost on just use this performance is performance yeah i i shows the wrong of so one else need or when i don't get the right over a a large amount of query oh the probably for we no stable is one in the first position in the text first position in the hundred as position you have to sing that the wrong can be what up to one billion vectors to or that you can see that the first approach is a better i could be to wrong so the neighbor in the first thousand position but that's not so you make and we don't king using one the eight by for of it you as usual uh could could cause a and want to we set you get a very good improvements i then you can file i sixteen bytes but and we converge the using a uh one of twenty eight byte not perform ones there which means that the first a a stable always one and first position okay on uh i have say that so we don't state as a unit cost which is almost negligible compared to the first on as a clone second so a means just mention oh kind of ms so that we can increase in new stable uh uh we see less and one minutes again oh a a to two hundred means going if you want to be sure that you we gets a nearest stable one billion vector a very five uh so you to of the old king is that we you can see uh i'm not being that was a it time but is better to use a less i as the first stage on more of the circle because you will have a i efficiency we have that separation for the first stage uh but comparable precision or or in fact if an improved precision using the we don't king a stage based on on uh source couldn't or fine and i would like to mention the before can might talk that yes but then nine uh be vector that the set of one billion big on the reason we have don't this is because as many papers thus an approximate cell is that makes an evaluation on one billion vectors and one in fact uh i think i have shows is the beginning is that's actual sites get a petition we need to on that one big in fact on that one you so if you makes make three months these thirty days and on the here uh where are but the set and this is one for which uh we have uh we compute that so uh extracted and thousand prairie we this and because for long on you they completely the exact nearest neighbor that is for each we you have computed support supports distance is a vectors on we give zero on and i wrong of uh as a a true uh a a thousand hz an on the corresponding distance C case you want we but make runs how to conclude my two we have proposed the as could base you don't king approach that the vote using a whole skipped also you can see it a to a the memory for a comedy yeah the server in which improves with shades yeah a trade off between a efficiency and pretty for a fixed more is that you have a is uh the to at a point of the vector for evaluation of approximate search a not so i have to munch that we have but the method a cage uh a a nine a for compression based me sets that produce as a result of uh the and i where is an and so source but things that i have a mention a for you a a question i i have very short uh question how deep and is this a technique on using set per se so a a do we have to use it with this technique are uh no so we we have this this is this so don't the us to a the loss and we do this skip those on the back of all descriptors but and you and i think that they can be you can done this and i have a question i i i i i have one more question myself actually and you do this iteration with one and what prevents you from a to reading again and we can it for them for further and you given T conversations that were that one if you make the quantisation of zero very small yeah i can go to zero uh for so if were of course on it is a good question because of uh we out to optimize the first stage the second save on i'd use that stage can think of right stores and that some kind of words are sufficient to asian actually so we have that the there's two up and to this stage so that is would be a good a it's nice or several let us on try out many neighbours how many back to give to and yeah yeah but you you you get you have that i remember my first course and quantization and and this per some she make is that the quantization is you know for with a bin sorry i i remember my first course score quantization and you know the assumption you make it that you know that that the air the quantisation here use uniform within a bin um so that there's you know you know form here and so they can be evenly distributed i you know points in the bin and so i'm wondering is if you do one or more levels of quantization wouldn't that make um a very hard to quantisation because there's is a very little structure left to you know it's take advantage of yes uh i think that's true but and you at some point when you use and are very fine can ties also just as the the first the yeah side is correlated for the eigenvalue value like compression yes some structure for a big intensity when you we find as the and we when you a noise anyway we have the same uh a party i think that uh as in compression that is as the on which you are we code uh a most one them uh so you from side or but the program from yeah it's just just got but it is a problem when you have a high dimensional data set because all points weekly descent one one uniform made you yeah you you mean you a this so a to compared to and you could do but the if by this improve i think i

SEARCHING IN ONE BILLION VECTORS: RE-RANK WITH SOURCE CODING

Image and Video Indexing and Retrieval

Presented by: Hervé Jégou, Author(s): Hervé Jégou, INRIA, France; Romain Tavenard, University of Rennes 1, France; Matthijs Douze, INRIA, France; Laurent Amsaleg, CNRS, France