Converting (x,y,z) coord to single index that minimizes "distance" of elements to neighbors - coordinate-transformation

If you have a fixed-size (N) 3D space, with integral coordinates in [0,N[ , how can you convert those (x,y,z) coordinates to a single linear index [0,N*N*N[ , where the average distance of one coordinate (x,y,z) to it's next neighbors (x-1,y,z), (x+1,y,z), ... (26 neighbors) is minimized, compared to the simplistic "index = x + N*y + N*N*z" formula?
An expensive formula is an acceptable solution in my case because N is fixed and not too big, so I can compute the mapping a single time and cache the result, if it is expensive.
The reason I need this is that I'm trying to compress same-values based on proximity, and so putting neighbor values next to each other in the array improves the compression rate.
If you can point to some Java code it would be even better ...

Related

Get minimum Euclidean distance between a given vector and vectors in the database

I store 128D vectors in PostgreSQL table as double precision []:
create table tab (
   id integer,
   name character varying (200),
   vector double precision []
 )
For a given vector, I need to return one record from the database with the minimum Euclidean distance between this vector and the vector in the table entry.
I have a function that computes the Euclidean distance of two vectors according to the known formula SQRT ((v11-v21) ^ 2 + (v1 [2] -v2 [2]) ^ 2 + .... + (v1 [128] -v2 [128] ]) ^ 2):
CREATE OR REPLACE FUNCTION public.euclidian (
  arr1 double precision [],
  arr2 double precision [])
  RETURNS double precision AS
$ BODY $
  select sqrt (SUM (tab.v)) as euclidian from (SELECT
     UNNEST (vec_sub (arr1, arr2)) as v) as tab;
$ BODY $
LANGUAGE sql IMMUTABLE STRICT
Ancillary function:
CREATE OR REPLACE FUNCTION public.vec_sub (
  arr1 double precision [],
  arr2 double precision [])
RETURNS double precision [] AS
$ BODY $
  SELECT array_agg (result)
    FROM (SELECT (tuple.val1 - tuple.val2) * (tuple.val1 - tuple.val2)
        AS result
        FROM (SELECT UNNEST ($ 1) AS val1
               , UNNEST ($ 2) AS val2
               , generate_subscripts ($ 1, 1) AS ix) tuple
    ORDER BY ix) inn;
$ BODY $
LANGUAGE sql IMMUTABLE STRICT
Query:
select tab.id as tabid, tab.name as tabname,
        euclidian ('{0.1,0.2,...,0.128}', tab.vector) as eucl from tab
order by eulc ASC
limit 1
Everything works fine since I have several thousands of records in tab. But the DB is going to be grown and I need to avoid full scan of tab running the query, add a kind of index search. Would be great to filter-out at least 80% of records by index, the remaining 20% can be handled by full scan.
One of the current directions of the solution search: PostGIS extension allows to search and sort by distance (ST_3DDistance), filter by distance (ST_3DWithin), etc. This works great and fast using indicies. Is it possible to abstract for N-dimensional space?
Some observations:
all coordinate values are between [-0.5...0.5] (I do not know exactly, I think [-1.0 ...1.0] are theoretical limits)
the vectors are not normalized, the distance from (0,0,... 0) is in range [1.2...1.6].
That is the translated post from StackExchange Russian.
Like #SaiBot hints at with local sensitivity hashing (LSH), there are plenty of researched techniques that allow you to run approximate nearest neighbors (ANN) searches. You have to accept a speed/accuracy tradeoff, but this is reasonable for most production scenarios since a brute-force approach to find the exact neighbors tends to be computationally prohibitive.
This article is an excellent overview of current state-of-the-art algorithms along with their pros and cons. Below, I've linked several popular open-source implementations. All 3 have Python bindings
Facebook FAISS
Spotify Annoy
Google ScaNN
With a 128 dimensional data constrained to PostgreSQL you will have no choice than to apply a full scan for each query.
Even highly optimized index structures for indexing high-dimensional data like the X-Tree or the IQ-Tree will have problems with that many dimensions and usually offer no big benefit over the pure scan.
The main issue here is the curse of dimensionality that will let index structures degenerate above 20ish dimensions.
Newer work thus considers the problem of approximate nearest neighbor search, since in a lot of applications with this many dimensions it is sufficient to find a good answer, rather than the best one. Locality Sensitive Hashing is among these approaches.
Note: Even if an index structure is able to filter out 80% of the records, you will have to access the remaining 20% of the records by random access operations from disk (making the application I/O bound), which will be even slower than reading all the data in one scan and computing the distances.
You could looking at Computational Geometry which is a field dedicated to efficient algorithms. There is generally a tradeoff between the amount of data stored and efficiency of an algorithm. So by adding data we can reduce speed. In particular, we are looking at a Nearest neighbour search the algorithm below uses a form of space partitioning.
Lets consider the 3D case as the distances from the origin lie in a narrow range it looks like they are clustered around a fuzzy sphere. Divide space into sub 8 cubes depending on the sign of each coordinate, label these +++, ++- etc. We can work out the minimum distance from the test vector to a vector in each cube.
Say our test vector is (0.4,0.5,0.6). The minimum distance from that to the +++ cube is zero. The minimum distance to the -++ cube is 0.4 as the closest vector in the -++ cube would be (-0.0001,0.5,0.6). Likewise, the minimum distances to +-+ is 0.5, ++-: 0.6, --+: sqrt(.4^2+.5^2) etc.
The algorithms then becomes: First search the cube the test vector is in and find the minimum distance to all the vectors in that cube. If that distance is smaller than the minimum distance to the other cubes then we are done. If not search the next closest cube until no vector in any other cube could be closer.
If we were to implement this is a database we would compute a key for each vector. In 3D this is a 3-bit integer with 0 or 1 in each bit depending on the sign of the coordinate + (0), - (-). So first select WHERE key = 000, then where key = 100, etc.
You can think of this as a type of hash function which has been specifically designed to make finding close points easy. I think these are called Locality-sensitive hashing
The high dimensions of your data make things much trickier. With 128 dimensions just using the signs of coordinates give 2^128 = 3.4E+38 possibilities. This is far too many hash buckets and some form of dimension reduction is needed.
You might be able to choose k points and partition space according to which of those you are closest too.

BST search problen

I've implemented a BST that contains in it's n nodes an object of type Point (x,y), the order in the tree is according to the X's values.
I need to implement a function that getting as input the range of X's (x left, x right)
and the output is
(EDIT):the sum of Y's coordinates in range(included).
it's not that hard doing it by "walking" throught all the nodes, the problem is the I asKed to do it in O(logn) comlexity.
I thought about intializing fields of ranges and sum of Y's, but somehow it doesn't work out with the insertion and deletion functions.
any ideas?
Finding each end of the range is O(log n) complexity. There is no need to look at all nodes in the tree.
If you are required to sum the numbers of a range, then taking the high boundary minus the low boundary (assuming a consecutive, linear range), is a constant time operation.

Most efficient way to check if a point is in or on a convex quad polygon

I'm trying to figure out the most efficient/fast way to add a large number of convex quads (four given x,y points) into an array/list and then to check against those quads if a point is within or on the border of those quads.
I originally tried using ray casting but thought that it was a little overkill since I know that all my polygons will be quads and that they are also all convex.
currently, I am splitting each quad into two triangles that share an edge and then checking if the point is on or in each of those two triangles using their areas.
for example
Triangle ABC and test point P.
if (areaPAB + areaPAC + areaPBC == areaABC) { return true; }
This seems like it may run a little slow since I need to calculate the area of 4 different triangles to run the check and if the first triangle of the quad returns false, I have to get 4 more areas. (I include a bit of an epsilon in the check to make up for floating point errors)
I'm hoping that there is an even faster way that might involve a single check of a point against a quad rather than splitting it into two triangles.
I've attempted to reduce the number of checks by putting the polygon's into an array[,]. When adding a polygon, it checks the minimum and maximum x and y values and then using those, places the same poly into the proper array positions. When checking a point against the available polygons, it retrieves the proper list from the array of lists.
I've been searching through similar questions and I think what I'm using now may be the fastest way to figure out if a point is in a triangle, but I'm hoping that there's a better method to test against a quad that is always convex. Every polygon test I've looked up seems to be testing against a polygon that has many sides or is an irregular shape.
Thanks for taking the time to read my long winded question to what's prolly a simple problem.
I believe that fastest methods are:
1: Find mutual orientation of all vector pairs (DirectedEdge-CheckedPoint) through cross product signs. If all four signs are the same, then point is inside
Addition: for every edge
EV[i] = V[i+1] - V[i], where V[] - vertices in order
PV[i] = P - V[i]
Cross[i] = CrossProduct(EV[i], PV[i]) = EV[i].X * PV[i].Y - EV[i].Y * PV[i].X
Cross[i] value is positive, if point P lies in left semi-plane relatively to i-th edge (V[i] - V[i+1]), and negative otherwise. If all the Cross[] values are positive, then point p is inside the quad, vertices are in counter-clockwise order. f all the Cross[] values are negative, then point p is inside the quad, vertices are in clockwise order. If values have different signs, then point is outside the quad.
If quad set is the same for many point queries, then dmuir suggests to precalculate uniform line equation for every edge. Uniform line equation is a * x + b * y + c = 0. (a, b) is normal vector to edge. This equation has important property: sign of expression
(a * P.x + b * Y + c) determines semi-plane, where point P lies (as for crossproducts)
2: Split quad to 2 triangles and use vector method for each: express CheckedPoint vector in terms of basis vectors.
P = a*V1+b*V2
point is inside when a,b>=0 and their sum <=1
Both methods require about 10-15 additions, 6-10 multiplications and 2-7 comparisons (I don't consider floating point error compensation)
If you could afford to store, with each quad, the equation of each of its edges then you could save a little time over MBo's answer.
For example if you have an inward pointing normal vector N for each edge of the quad, and a constant d (which is N.p for one of the vertcies p on the edge) then a point x is in the quad if and only if N.x >= d for each edge. So thats 2 multiplications, one addition and one comparison per edge, and you'll need to perform up to 4 tests per point.This technique works for any convex polygon.

Search optimization problem

Suppose you have a list of 2D points with an orientation assigned to them. Let the set S be defined as:
S={ (x,y,a) | (x,y) is a 2D point, a is an orientation (an angle) }.
Given an element s of S, we will indicate with s_p the point part and with s_a the angle part. I would like to know if there exist an efficient data structure such that, given a query point q, is able to return all the elements s in S such that
(dist(q_p, s_p) < threshold_1) AND (angle_diff(q_a, s_a) < threshold_2) (1)
where dist(p1,p2), with p1,p2 2D points, is the euclidean distance, and angle_diff(a1,a2), with a1,a2 angles, is the difference between angles (taken to be the smallest one). The data structure should be efficient w.r.t. insertion/deletion of elements and the search as defined above. The number of vectors can grow up to 10.000 and more, but take this with a grain of salt.
Now suppose to change the above requirement: instead of using the condition (1), let's request all the elements of S such that, given a distance function d, we want all elements of S such that d(q,s) < threshold. If i remember well, this last setup is called range-search. I don't know if the first case can be transformed in the second.
For the distance search I believe the accepted best method is a Binary Space Partition tree. This can be stored as a series of bits. Each two bits (for a 2D tree) or three bits (for a 3D tree) subdivides the space one more level, increasing resolution.
Using a BSP, locating a set of objects to compare distances with is pretty easy. Just find the smallest set of squares or cubes which contain the edges of your distance box.
For the angle, I don't know of anything. I suppose that you could store each object in a second list or tree sorted by its angle. Then you would find every object at the proper distance using the BSP, every object at the proper angles using the angle tree, then do a set intersection.
You have effectively described a "three dimensional cyclindrical space", ie. a space that is locally three dimensional but where one dimension is topologically cyclic. In other words, it is locally flat and may be modeled as the boundary of a four-dimensional object C4 in (x, y, z, w) defined by
z^2 + w^2 = 1
where
a = arctan(w/z)
With this model, the space defined by your constraints is a 2-dimensional cylinder wrapped "lengthwise" around a cross section wedge, where the wedge wraps around the 4-d cylindrical space with an angle of 2 * threshold_2. This can be modeled using a "modified k-d tree" approach (modified 3-d tree), where the data structure is not a tree but actually a graph (it has cycles). You can still partition this space into cells with hyperplane separation, but traveling along the curve defined by (z, w) in the positive direction may encounter a point encountered in the negative direction. The tree should be modified to actually lead to these nodes from both directions, so that the edges are bidirectional (in the z-w curve direction - the others are obviously still unidirectional).
These cycles do not change the effectiveness of the data structure in locating nearby points or allowing your constraint search. In fact, for the most part, those algorithms are only slightly modified (the simplest approach being to hold a visited node data structure to prevent cycles in the search - you test the next neighbors about to be searched).
This will work especially well for your criteria, since the region you define is effectively bounded by these axis-defined hyperplane-bounded cells of a k-d tree, and so the search termination will leave a region on average populated around pi / 4 percent of the area.

Faster way to perform point-wise interplation of numpy array?

I have a 3D datacube, with two spatial dimensions and the third being a multi-band spectrum at each point of the 2D image.
H[x, y, bands]
Given a wavelength (or band number), I would like to extract the 2D image corresponding to that wavelength. This would be simply an array slice like H[:,:,bnd]. Similarly, given a spatial location (i,j) the spectrum at that location is H[i,j].
I would also like to 'smooth' the image spectrally, to counter low-light noise in the spectra. That is for band bnd, I choose a window of size wind and fit a n-degree polynomial to the spectrum in that window. With polyfit and polyval I can find the fitted spectral value at that point for band bnd.
Now, if I want the whole image of bnd from the fitted value, then I have to perform this windowed-fitting at each (i,j) of the image. I also want the 2nd-derivative image of bnd, that is, the value of the 2nd-derivative of the fitted spectrum at each point.
Running over the points, I could polyfit-polyval-polyder each of the x*y spectra. While this works, this is a point-wise operation. Is there some pytho-numponic way to do this faster?
If you do least-squares polynomial fitting to points (x+dx[i],y[i]) for a fixed set of dx and then evaluate the resulting polynomial at x, the result is a (fixed) linear combination of the y[i]. The same is true for the derivatives of the polynomial. So you just need a linear combination of the slices. Look up "Savitzky-Golay filters".
EDITED to add a brief example of how S-G filters work. I haven't checked any of the details and you should therefore not rely on it to be correct.
So, suppose you take a filter of width 5 and degree 2. That is, for each band (ignoring, for the moment, ones at the start and end) we'll take that one and the two on either side, fit a quadratic curve, and look at its value in the middle.
So, if f(x) ~= ax^2+bx+c and f(-2),f(-1),f(0),f(1),f(2) = p,q,r,s,t then we want 4a-2b+c ~= p, a-b+c ~= q, etc. Least-squares fitting means minimizing (4a-2b+c-p)^2 + (a-b+c-q)^2 + (c-r)^2 + (a+b+c-s)^2 + (4a+2b+c-t)^2, which means (taking partial derivatives w.r.t. a,b,c):
4(4a-2b+c-p)+(a-b+c-q)+(a+b+c-s)+4(4a+2b+c-t)=0
-2(4a-2b+c-p)-(a-b+c-q)+(a+b+c-s)+2(4a+2b+c-t)=0
(4a-2b+c-p)+(a-b+c-q)+(c-r)+(a+b+c-s)+(4a+2b+c-t)=0
or, simplifying,
22a+10c = 4p+q+s+4t
10b = -2p-q+s+2t
10a+5c = p+q+r+s+t
so a,b,c = p-q/2-r-s/2+t, (2(t-p)+(s-q))/10, (p+q+r+s+t)/5-(2p-q-2r-s+2t).
And of course c is the value of the fitted polynomial at 0, and therefore is the smoothed value we want. So for each spatial position, we have a vector of input spectral data, from which we compute the smoothed spectral data by multiplying by a matrix whose rows (apart from the first and last couple) look like [0 ... 0 -9/5 4/5 11/5 4/5 -9/5 0 ... 0], with the central 11/5 on the main diagonal of the matrix.
So you could do a matrix multiplication for each spatial position; but since it's the same matrix everywhere you can do it with a single call to tensordot. So if S contains the matrix I just described (er, wait, no, the transpose of the matrix I just described) and A is your 3-dimensional data cube, your spectrally-smoothed data cube would be numpy.tensordot(A,S).
This would be a good point at which to repeat my warning: I haven't checked any of the details in the few paragraphs above, which are just meant to give an indication of how it all works and why you can do the whole thing in a single linear-algebra operation.