Search for similar vectors, based on elementwise difference

Question

I understand that there is a thread discussing a similar issue here:

How to efficiently search a set of vectors for a vector which is the closest match

But my problem is slightly different - and hopefully easier.

Given a set of vectors of the same dimension, e.g.,

[2, 3, 5, 9]

[12, 9, 2, 8]

[45, 1, 0, 1]

And a query vector [1, 4, 7, 2], I would like an algorithm that returns [2, 3, 5, 9] as the closest matching vector because it has the lowest sum of elementwise difference, i.e., (1-2) + (4-3) + (7-5) + (2-9) = -5

One way would be to pre-compute the pairwise 'similarity' but the computation is very heavy, given that I will have more than 5000 vectors so 250,000 comparisons.

As soon as you ask for a library, someone is probably going to vote to close as off-topic. Stick to asking about the algorithm and you're better likely to get the answer you want. — Paul, Apr 02 '17 at 13:30
I have a potential answer, but I'm not certain it's needed. You aren't correct that you need 250,000 comparisons if you're testing a query against 5,000 vectors, because you aren't testing each vector against each other vector. Can you clarify? — Trixie Wolf, Apr 02 '17 at 15:43
Hi, true that if I have 1 query i need 5000 comparisons. However, my application needs to rank for each vector, say top 100 closest vectors. Therefore essentially I still need to compare 5000 x 5000 pairs. Correct me if wrong. thanks — Ziqi, Apr 02 '17 at 16:13
Still confused. Are the query vectors the same vectors you're testing? And if so, do you need to test all of them and determine for each one the 100 nearest neighbors? — Trixie Wolf, Apr 02 '17 at 16:42
I was using query as a way to describe how the problem could be solved. and sorry if that has been misleading. But yes essentially given a set of vectors of the same dimension, I need to, for each vector v, find the top ranked N other vectors that are closet to v — Ziqi, Apr 02 '17 at 17:01
Not worthy of an answer, but the proposed metric is rather lousy. You should look into the L1 and L2 norms (and perhaps the L-infinity norm). Consider the vector [6,8,1,8]. Your metric clearly chooses [2,3,5,9] as the closest match, while the L1 and L2 norms would both choose [12,9,2,8] as the closest match. — David Hammen, Apr 02 '17 at 20:15

Jerry Coffin · Accepted Answer · 2017-04-02T17:16:02.117

You can rearrange your:

(1-2) + (4-3) + (7-5) + (2-9)

...to become:

(1 + 4 + 7 + 2) - (2 + 3 + 5 + 9)

...and (modulo roundoff errors, overflows, etc.) we're assured that the answer remains the same.

In other words, we can store the sum of elements in each vector, and when we get a query, sum the elements in that vector, then search for the closest value among those given.

This will let us do most of the calculations ahead of time much more reasonably.

We can do still better than just searching through all those vectors though. For example, we could create an std::multimap (or std::unordered_multimap) to tell use the vector(s) associated with each sum. With an unordered_multimap, we can expect approximately constant complexity for the lookup.

In the comments, however, you've added a requirement for finding the N most similar vectors. In this case, you probably want to use a std::multimap instead of the unordered variety. With this, the initial lookup will be logarithmic instead of (expected) constant complexity--but finding the next most similar item can be done with constant complexity, because they're ordered.

Depending on your situation, it might also be worth considering using a sorted vector of sums instead. This tends to make substantially better use of cache, since its stores the data contiguously. The tradeoff is that it doesn't support insertion/deletion of elements as nicely. If you do a lot of searching and rarely (or never) change those 5000 vectors, then a sorted vector of sums will probably work best. If, however, they're subject to a lot of changes (i.e., you constantly delete old vectors and insert new ones) then a std::map may be a better choice.

One more point: a multimap uses a binary search to find the closest item. If your differences are reasonably evenly distributed, you can use an interpolating search instead. For that you start by finding the difference between the smallest and largest sum. For example, let's assume the smallest is 4 and the largest is 7912038, and we're looking for the sum closest to 32750. The difference in the sums is 7912034, so we start by computing 32750/7912034 = about 0.004. Then (assuming exactly 5000 vectors) we multiply that by 5000 to get about 20. So, instead of starting at the middle of the array of sums, we can start at the 20^th item in the array (and depending on how far wrong that result is, we can repeat the same process until we get to the desired point).

Assuming the sums are reasonably evenly distributed, this lets us find the most similar item with about O(log log N) complexity instead of the O(log N) complexity of a binary search. Of course, for the moment I've used linear interpolation--if we know that the sums have (for example) a logarithmic or exponential distribution, we can use a logarithmic or exponential interpolation. In other words, the sums don't really have to be evenly distributed for this to work well--they just have to be reasonably predictably distributed.

log log N is often called "pseudo-constant"--it does grow with the number of items being examined, but grows so slowly that for most practical purposes it can be treated as constant--for a small number of items like your 5000, it's one. For a large number of items, it grows to 2. Before it gets to 3, you're talking about more data than you can store in all the storage produced on planet earth in its entire history. Before it gets to 4, you're talking about a number larger than the number of atoms in the universe.

Thank you. I think this certainly improves performance over a bruteforce approach. — Ziqi, Apr 02 '17 at 17:04
An excellent answer. And thanks for the detail about interpolated searches... it's a technique I've used before but have never been sure how to quantify how much performance it gained. — Jules, Apr 02 '17 at 17:14
@Jules: for most practical purposes, you can treat the `log log N` as constant, so the only real question in quantifying performance is whether your values are similar enough to the type of interpolation your using for the `log log N` to really apply. My own experience has been that even for data that doesn't seem like it's distributed very evenly, it still works out pretty well though. — Jerry Coffin, Apr 02 '17 at 17:48

score 1 · Answer 2 · answered Apr 03 '17 at 14:59

For element wise differences, Jerry provided an excellent solution. If that's really what you want, this is a perfect way to go.

However, I want to point out that there could be some confusion going on between the element wise difference and the distance between vectors. If you want to calculate the distance of vectors (=you want to find the vector that is closest to the query vector) you need to redefine your calculation:

Consider these 2 vectors:

v1: [1 4]
v2: [4 1]

Given a query vector q [2 4], it is obvious that v1 is closer than v2, although the sum of the differences is the same:

diff(v1, q) = (1-2) + (4-4) = -1+0 = -1
diff(v2, q) = (4-2) + (1-4) = 2-3 = -1

The problem with your diff-function is, that negative differences should be treated positive if you want to know the distance of the vectors. So using the distance function (=absolute values of the differences) the equations look like this:

dist(v1, q) = abs(1-2) + abs(4-4) = 1+0 = 1
dist(v2, q) = abs(4-2) + abs(1-4) = 2+3 = 5

Now clearly we see that v1 is closer to q.

Again, this might not be an answer to the initial question - just some add on :)

thanks, really well-thought insight. But in this application, a simple element-wise different approach would be a reasonable approximation to the problem I need to deal with. It should be also much more light-weight than classic vector-distance/similarity metrics. — Ziqi, Apr 03 '17 at 18:21
i re-thought this and based on your description an "element wise difference" can only be an absolute difference (a difference can never be negative). if an element wise difference is what you "have to" calculate (like if this is your homework hehe), you need to use absolute values. — Stefan Woehrer, Apr 06 '17 at 16:00

score 0 · Answer 3 · answered Apr 07 '17 at 23:14

I would like to suggest an unorthodox way of approaching for calculating manhattan distances between a query vector and a set of vectors (m) using GPGPU.

Lets assume that your set of vectors are defined as S(m,n)

S (m,n) = 
   [s_1,1, s_1,2, .... s_1,n

    s_2,1, s_1,2, .... s_2,n

    ....               ....

    s_m,1, s_m,2, .... s_m,n]

First you have to transform your N-dimensional vectors (V1,.. Vn) to N+1 dimensional vectors as (V1,...Vn,1), which can be done in O(m) time.

S'(m,n+1)= 
    [s_1,1, s_1,2, .... s_1,n ,1

     s_2,1, s_1,2, .... s_2,n, 1

     ....               ....

     s_m,1, s_m,2, .... s_m,n ,1]

Then you can translate your set around your query vector Q' = [Q1,..., Qn, 1] using

TQ(n+1,n+1)= 
    [1, 0, .... 0, -Q1

     0, 1, .... 0, -Q2

     ....         ....

     0, 0, .... 1, -Qn
     0, 0, .... 0, 1]

S''(m,n+1)= S'(m,n+1) x TQ(n+1,n+1);

Manhattan Distances can be calculated using the following matrix multiplication:

MD(n+1,1)= 
    [1, 

     1, 

     ...

     1, 
     0]

 S'''(m,1) = S''(m,n+1) x MD(n+1,1)

All matrix multiplication can be done in GPU as stated in Section 3.4 of Matrix computations on the GPU by using CUDA to optimise your processing

Now, the only thing you need to do is to find min(S''') and its corresponding vector which can be done in O(m) time.

Search for similar vectors, based on elementwise difference

3 Answers3