Get minimum Euclidean distance between a given vector and vectors in the database - sql

I store 128D vectors in PostgreSQL table as double precision []:
create table tab (
   id integer,
   name character varying (200),
   vector double precision []
 )
For a given vector, I need to return one record from the database with the minimum Euclidean distance between this vector and the vector in the table entry.
I have a function that computes the Euclidean distance of two vectors according to the known formula SQRT ((v11-v21) ^ 2 + (v1 [2] -v2 [2]) ^ 2 + .... + (v1 [128] -v2 [128] ]) ^ 2):
CREATE OR REPLACE FUNCTION public.euclidian (
  arr1 double precision [],
  arr2 double precision [])
  RETURNS double precision AS
$ BODY $
  select sqrt (SUM (tab.v)) as euclidian from (SELECT
     UNNEST (vec_sub (arr1, arr2)) as v) as tab;
$ BODY $
LANGUAGE sql IMMUTABLE STRICT
Ancillary function:
CREATE OR REPLACE FUNCTION public.vec_sub (
  arr1 double precision [],
  arr2 double precision [])
RETURNS double precision [] AS
$ BODY $
  SELECT array_agg (result)
    FROM (SELECT (tuple.val1 - tuple.val2) * (tuple.val1 - tuple.val2)
        AS result
        FROM (SELECT UNNEST ($ 1) AS val1
               , UNNEST ($ 2) AS val2
               , generate_subscripts ($ 1, 1) AS ix) tuple
    ORDER BY ix) inn;
$ BODY $
LANGUAGE sql IMMUTABLE STRICT
Query:
select tab.id as tabid, tab.name as tabname,
        euclidian ('{0.1,0.2,...,0.128}', tab.vector) as eucl from tab
order by eulc ASC
limit 1
Everything works fine since I have several thousands of records in tab. But the DB is going to be grown and I need to avoid full scan of tab running the query, add a kind of index search. Would be great to filter-out at least 80% of records by index, the remaining 20% can be handled by full scan.
One of the current directions of the solution search: PostGIS extension allows to search and sort by distance (ST_3DDistance), filter by distance (ST_3DWithin), etc. This works great and fast using indicies. Is it possible to abstract for N-dimensional space?
Some observations:
all coordinate values are between [-0.5...0.5] (I do not know exactly, I think [-1.0 ...1.0] are theoretical limits)
the vectors are not normalized, the distance from (0,0,... 0) is in range [1.2...1.6].
That is the translated post from StackExchange Russian.

Like #SaiBot hints at with local sensitivity hashing (LSH), there are plenty of researched techniques that allow you to run approximate nearest neighbors (ANN) searches. You have to accept a speed/accuracy tradeoff, but this is reasonable for most production scenarios since a brute-force approach to find the exact neighbors tends to be computationally prohibitive.
This article is an excellent overview of current state-of-the-art algorithms along with their pros and cons. Below, I've linked several popular open-source implementations. All 3 have Python bindings
Facebook FAISS
Spotify Annoy
Google ScaNN

With a 128 dimensional data constrained to PostgreSQL you will have no choice than to apply a full scan for each query.
Even highly optimized index structures for indexing high-dimensional data like the X-Tree or the IQ-Tree will have problems with that many dimensions and usually offer no big benefit over the pure scan.
The main issue here is the curse of dimensionality that will let index structures degenerate above 20ish dimensions.
Newer work thus considers the problem of approximate nearest neighbor search, since in a lot of applications with this many dimensions it is sufficient to find a good answer, rather than the best one. Locality Sensitive Hashing is among these approaches.
Note: Even if an index structure is able to filter out 80% of the records, you will have to access the remaining 20% of the records by random access operations from disk (making the application I/O bound), which will be even slower than reading all the data in one scan and computing the distances.

You could looking at Computational Geometry which is a field dedicated to efficient algorithms. There is generally a tradeoff between the amount of data stored and efficiency of an algorithm. So by adding data we can reduce speed. In particular, we are looking at a Nearest neighbour search the algorithm below uses a form of space partitioning.
Lets consider the 3D case as the distances from the origin lie in a narrow range it looks like they are clustered around a fuzzy sphere. Divide space into sub 8 cubes depending on the sign of each coordinate, label these +++, ++- etc. We can work out the minimum distance from the test vector to a vector in each cube.
Say our test vector is (0.4,0.5,0.6). The minimum distance from that to the +++ cube is zero. The minimum distance to the -++ cube is 0.4 as the closest vector in the -++ cube would be (-0.0001,0.5,0.6). Likewise, the minimum distances to +-+ is 0.5, ++-: 0.6, --+: sqrt(.4^2+.5^2) etc.
The algorithms then becomes: First search the cube the test vector is in and find the minimum distance to all the vectors in that cube. If that distance is smaller than the minimum distance to the other cubes then we are done. If not search the next closest cube until no vector in any other cube could be closer.
If we were to implement this is a database we would compute a key for each vector. In 3D this is a 3-bit integer with 0 or 1 in each bit depending on the sign of the coordinate + (0), - (-). So first select WHERE key = 000, then where key = 100, etc.
You can think of this as a type of hash function which has been specifically designed to make finding close points easy. I think these are called Locality-sensitive hashing
The high dimensions of your data make things much trickier. With 128 dimensions just using the signs of coordinates give 2^128 = 3.4E+38 possibilities. This is far too many hash buckets and some form of dimension reduction is needed.
You might be able to choose k points and partition space according to which of those you are closest too.

Related

Which data structure should I use to store large 10,000 data set and do fastest search ,and take least memory?

*Suppose I have 10,000 circle(x,y,r) values , and i want to find a point (p1,p2) lies in which circle , to get fastest response for this query what data structure should i use to store those 10,000 circle data.
It is a static data ,means one time construct ,
But most frequent operation will be search query. and It will Not be a range based search or not nearest neighbor search
How about B-tree ,B+ tree or R-tree or Quadtree or linear Interpolation search or any bitmap kind , solution should take least memory while little extra time trade off is okay.*
You have at least three choices:
Use a bounding volume hierarchy (BVH); in this case i think 2D sphere tree (circle tree) will be the way to go; So you construct a tree which each node is circle. Each node can contain children circles. Your input circles will be on leafs. This way point search will be logN in time but structure will be NlogN in space. But, generally BVH may be hard to construct.
Use a tree based N-ary space partitioning (binary space partitioning, qudtree, 2D kd-tree); This time you partition space, and each space may contain some spheres. These algorithms will have the same complexity as BVH, but most likely will be less efficient than BVH - I suspect lesser thight fitting for circles. But, some space partitionings (eg quadtree) can be easier to construct.
Use spatial hashing. That is space will be partitioned into cubes (buckets) of exact size and theses cubes hashed. Basically, you may think about it as a uniform grid of pixels (but smartly stored), when each pixel have list of contained circles. This in theory gives O(1) for search. But space may be more complicated to predict. It is in theory O(N) but may be bigger than BVH due to const factor - and most likely depends on distribution of circle's area and position (eg. number of pixels which have circles).

Determine the running time of an algorithm with two parameters

I have implemented an algorithm that uses two other algorithms for calculating the shortest path in a graph: Dijkstra and Bellman-Ford. Based on the time complexity of the these algorithms, I can calculate the running time of my implementation, which is easy giving the code.
Now, I want to experimentally verify my calculation. Specifically, I want to plot the running time as a function of the size of the input (I am following the method described here). The problem is that I have two parameters - number of edges and number of vertices.
I have tried to fix one parameter and change the other, but this approach results in two plots - one for varying number of edges and the other for varying number of vertices.
This leads me to my question - how can I determine the order of growth based on two plots? In general, how can one experimentally determine the running time complexity of an algorithm that has more than one parameter?
It's very difficult in general.
The usual way you would experimentally gauge the running time in the single variable case is, insert a counter that increments when your data structure does a fundamental (putatively O(1)) operation, then take data for many different input sizes, and plot it on a log-log plot. That is, log T vs. log N. If the running time is of the form n^k you should see a straight line of slope k, or something approaching this. If the running time is like T(n) = n^{k log n} or something, then you should see a parabola. And if T is exponential in n you should still see exponential growth.
You can only hope to get information about the highest order term when you do this -- the low order terms get filtered out, in the sense of having less and less impact as n gets larger.
In the two variable case, you could try to do a similar approach -- essentially, take 3 dimensional data, do a log-log-log plot, and try to fit a plane to that.
However this will only really work if there's really only one leading term that dominates in most regimes.
Suppose my actual function is T(n, m) = n^4 + n^3 * m^3 + m^4.
When m = O(1), then T(n) = O(n^4).
When n = O(1), then T(n) = O(m^4).
When n = m, then T(n) = O(n^6).
In each of these regimes, "slices" along the plane of possible n,m values, a different one of the terms is the dominant term.
So there's no way to determine the function just from taking some points with fixed m, and some points with fixed n. If you did that, you wouldn't get the right answer for n = m -- you wouldn't be able to discover "middle" leading terms like that.
I would recommend that the best way to predict asymptotic growth when you have lots of variables / complicated data structures, is with a pencil and piece of paper, and do traditional algorithmic analysis. Or possibly, a hybrid approach. Try to break the question of efficiency into different parts -- if you can split the question up into a sum or product of a few different functions, maybe some of them you can determine in the abstract, and some you can estimate experimentally.
Luckily two input parameters is still easy to visualize in a 3D scatter plot (3rd dimension is the measured running time), and you can check if it looks like a plane (in log-log-log scale) or if it is curved. Naturally random variations in measurements plays a role here as well.
In Matlab I typically calculate a least-squares solution to two-variable function like this (just concatenates different powers and combinations of x and y horizontally, .* is an element-wise product):
x = log(parameter_x);
y = log(parameter_y);
% Find a least-squares fit
p = [x.^2, x.*y, y.^2, x, y, ones(length(x),1)] \ log(time)
Then this can be used to estimate running times for larger problem instances, ideally those would be confirmed experimentally to know that the fitted model works.
This approach works also for higher dimensions but gets tedious to generate, maybe there is a more general way to achieve that and this is just a work-around for my lack of knowledge.
I was going to write my own explanation but it wouldn't be any better than this.

Generating Matched Pairs for Statistical Analysis

In my study, a person is represented as a pair of real numbers (x, y). x is on [30, 80] and y is [60, 120]. There are two types of people, A and B. I have ~300 of each type. How can I generate the largest (or even a large) set of pairs of one person from A with one from B: ((xA, yA), (xB, yB)) such that each pair of points is close? Two points are close if abs(x1-x2) < dX and abs(y1 - y2) < dY. Similar constraints are acceptable. (That is, this constraint is roughly a Manhattan metric, but euclidean/etc is ok too.) Not all points need be used, but no point can be reused.
You're looking for the Hungarian Algorithm.
Suggested formulation: A are rows, B are columns, each cell contains a distance metric between Ai and Bi, e.g. abs(X(Ai)-X(Bi)) + abs(Y(Ai)-Y(Bi)). (You can normalize the X and Y values to [0,1] if you want distances to be proportional to the range of each variable)
Then use the Hungarian Algorithm to minimize matching weight.
You can filter out matches with distances over your threshold. If you're worried that this filtering might cause the approach to be sub-optimal, you could set distances over your threshold to a very high number.
There are many implementations of this algorithm. A short search found one in any conceivable language, including VBA for Excel and some online solvers (not sure about matching 300x300 matrix with them, though)
Hungarian algorithm did it, thanks Etov.
Source code available here: http://www.filedropper.com/stackoverflow1

approximating log10[x^k0 + k1]

Greetings. I'm trying to approximate the function
Log10[x^k0 + k1], where .21 < k0 < 21, 0 < k1 < ~2000, and x is integer < 2^14.
k0 & k1 are constant. For practical purposes, you can assume k0 = 2.12, k1 = 2660. The desired accuracy is 5*10^-4 relative error.
This function is virtually identical to Log[x], except near 0, where it differs a lot.
I already have came up with a SIMD implementation that is ~1.15x faster than a simple lookup table, but would like to improve it if possible, which I think is very hard due to lack of efficient instructions.
My SIMD implementation uses 16bit fixed point arithmetic to evaluate a 3rd degree polynomial (I use least squares fit). The polynomial uses different coefficients for different input ranges. There are 8 ranges, and range i spans (64)2^i to (64)2^(i + 1).
The rational behind this is the derivatives of Log[x] drop rapidly with x, meaning a polynomial will fit it more accurately since polynomials are an exact fit for functions that have a derivative of 0 beyond a certain order.
SIMD table lookups are done very efficiently with a single _mm_shuffle_epi8(). I use SSE's float to int conversion to get the exponent and significand used for the fixed point approximation. I also software pipelined the loop to get ~1.25x speedup, so further code optimizations are probably unlikely.
What I'm asking is if there's a more efficient approximation at a higher level?
For example:
Can this function be decomposed into functions with a limited domain like
log2((2^x) * significand) = x + log2(significand)
hence eliminating the need to deal with different ranges (table lookups). The main problem I think is adding the k1 term kills all those nice log properties that we know and love, making it not possible. Or is it?
Iterative method? don't think so because the Newton method for log[x] is already a complicated expression
Exploiting locality of neighboring pixels? - if the range of the 8 inputs fall in the same approximation range, then I can look up a single coefficient, instead of looking up separate coefficients for each element. Thus, I can use this as a fast common case, and use a slower, general code path when it isn't. But for my data, the range needs to be ~2000 before this property hold 70% of the time, which doesn't seem to make this method competitive.
Please, give me some opinion, especially if you're an applied mathematician, even if you say it can't be done. Thanks.
You should be able to improve on least-squares fitting by using Chebyshev approximation. (The idea is, you're looking for the approximation whose worst-case deviation in a range is least; least-squares instead looks for the one whose summed squared difference is least.) I would guess this doesn't make a huge difference for your problem, but I'm not sure -- hopefully it could reduce the number of ranges you need to split into, somewhat.
If there's already a fast implementation of log(x), maybe compute P(x) * log(x) where P(x) is a polynomial chosen by Chebyshev approximation. (Instead of trying to do the whole function as a polynomial approx -- to need less range-reduction.)
I'm an amateur here -- just dipping my toe in as there aren't a lot of answers already.
One observation:
You can find an expression for how large x needs to be as a function of k0 and k1, such that the term x^k0 dominates k1 enough for the approximation:
x^k0 +k1 ~= x^k0, allowing you to approximately evaluate the function as
k0*Log(x).
This would take care of all x's above some value.
I recently read how the sRGB model compresses physical tri stimulus values into stored RGB values.
It basically is very similar to the function I try to approximate, except that it's defined piece wise:
k0 x, x < 0.0031308
k1 x^0.417 - k2 otherwise
I was told the constant addition in Log[x^k0 + k1] was to make the beginning of the function more linear. But that can easily be achieved with a piece wise approximation. That would make the approximation a lot more "uniform" - with only 2 approximation ranges. This should be cheaper to compute due to no longer needing to compute an approximation range index (integer log) and doing SIMD coefficient lookup.
For now, I conclude this will be the best approach, even though it doesn't approximate the function precisely. The hard part will be proposing this change and convincing people to use it.

Need help generating discrete random numbers from distribution

I searched the site but did not find exactly what I was looking for... I wanted to generate a discrete random number from normal distribution.
For example, if I have a range from a minimum of 4 and a maximum of 10 and an average of 7. What code or function call ( Objective C preferred ) would I need to return a number in that range. Naturally, due to normal distribution more numbers returned would center round the average of 7.
As a second example, can the bell curve/distribution be skewed toward one end of the other? Lets say I need to generate a random number with a range of minimum of 4 and maximum of 10, and I want the majority of the numbers returned to center around the number 8 with a natural fall of based on a skewed bell curve.
Any help is greatly appreciated....
Anthony
What do you need this for? Can you do it the craps player's way?
Generate two random integers in the range of 2 to 5 (inclusive, of course) and add them together. Or flip a coin (0,1) six times and add 4 to the result.
Summing multiple dice produces a normal distribution (a "bell curve"), while eliminating high or low throws can be used to skew the distribution in various ways.
The key is you are going for discrete numbers (and I hope you mean integers by that). Multiple dice throws famously generate a normal distribution. In fact, I think that's how we were first introduced to the Gaussian curve in school.
Of course the more throws, the more closely you approximate the bell curve. Rolling a single die gives a flat line. Rolling two dice just creates a ramp up and down that isn't terribly close to a bell. Six coin flips gets you closer.
So consider this...
If I understand your question correctly, you only have seven possible outcomes--the integers (4,5,6,7,8,9,10). You can set up an array of seven probabilities to approximate any distribution you like.
Many frameworks and libraries have this built-in.
Also, just like TokenMacGuy said a normal distribution isn't characterized by the interval it's defined on, but rather by two parameters: Mean μ and standard deviation σ. With both these parameters you can confine a certain quantile of the distribution to a certain interval, so that 95 % of all points fall in that interval. But resticting it completely to any interval other than (−∞, ∞) is impossible.
There are several methods to generate normal-distributed values from uniform random values (which is what most random or pseudorandom number generators are generating:
The Box-Muller transform is probably the easiest although not exactly fast to compute. Depending on the number of numbers you need, it should be sufficient, though and definitely very easy to write.
Another option is Marsaglia's Polar method which is usually faster1.
A third method is the Ziggurat algorithm which is considerably faster to compute but much more complex to program. In applications that really use a lot of random numbers it may be the best choice, though.
As a general advice, though: Don't write it yourself if you have access to a library that generates normal-distributed random numbers for you already.
For skewing your distribution I'd just use a regular normal distribution, choosing μ and σ appropriately for one side of your curve and then determine on which side of your wanted mean a point fell, stretching it appropriately to fit your desired distribution.
For generating only integers I'd suggest you just round towards the nearest integer when the random number happens to fall within your desired interval and reject it if it doesn't (drawing a new random number then). This way you won't artificially skew the distribution (such as you would if you were clamping the values at 4 or 10, respectively).
1 In testing with deliberately bad random number generators (yes, worse than RANDU) I've noticed that the polar method results in an endless loop, rejecting every sample. Won't happen with random numbers that fulfill the usual statistic expectations to them, though.
Yes, there are sophisticated mathematical solutions, but for "simple but practical" I'd go with Nosredna's comment. For a simple Java solution:
Random random=new Random();
public int bell7()
{
int n=4;
for (int x=0;x<6;++x)
n+=random.nextInt(2);
return n;
}
If you're not a Java person, Random.nextInt(n) returns a random integer between 0 and n-1. I think the rest should be similar to what you'd see in any programming language.
If the range was large, then instead of nextInt(2)'s I'd use a bigger number in there so there would be fewer iterations through the loop, depending on frequency of call and performance requirements.
Dan Dyer and Jay are exactly right. What you really want is a binomial distribution, not a normal distribution. The shape of a binomial distribution looks a lot like a normal distribution, but it is discrete and bounded whereas a normal distribution is continuous and unbounded.
Jay's code generates a binomial distribution with 6 trials and a 50% probability of success on each trial. If you want to "skew" your distribution, simply change the line that decides whether to add 1 to n so that the probability is something other than 50%.
The normal distribution is not described by its endpoints. Normally it's described by it's mean (which you have given to be 7) and its standard deviation. An important feature of this is that it is possible to get a value far outside the expected range from this distribution, although that will be vanishingly rare, the further you get from the mean.
The usual means for getting a value from a distribution is to generate a random value from a uniform distribution, which is quite easily done with, for example, rand(), and then use that as an argument to a cumulative distribution function, which maps probabilities to upper bounds. For the standard distribution, this function is
F(x) = 0.5 - 0.5*erf( (x-μ)/(σ * sqrt(2.0)))
where erf() is the error function which may be described by a taylor series:
erf(z) = 2.0/sqrt(2.0) * Σ∞n=0 ((-1)nz2n + 1)/(n!(2n + 1))
I'll leave it as an excercise to translate this into C.
If you prefer not to engage in the exercise, you might consider using the Gnu Scientific Library, which among many other features, has a technique to generate random numbers in one of many common distributions, of which the Gaussian Distribution (hint) is one.
Obviously, all of these functions return floating point values. You will have to use some rounding strategy to convert to a discrete value. A useful (but naive) approach is to simply downcast to integer.