Which data structure should I use to store large 10,000 data set and do fastest search ,and take least memory? - binary-search-tree

*Suppose I have 10,000 circle(x,y,r) values , and i want to find a point (p1,p2) lies in which circle , to get fastest response for this query what data structure should i use to store those 10,000 circle data.
It is a static data ,means one time construct ,
But most frequent operation will be search query. and It will Not be a range based search or not nearest neighbor search
How about B-tree ,B+ tree or R-tree or Quadtree or linear Interpolation search or any bitmap kind , solution should take least memory while little extra time trade off is okay.*

You have at least three choices:
Use a bounding volume hierarchy (BVH); in this case i think 2D sphere tree (circle tree) will be the way to go; So you construct a tree which each node is circle. Each node can contain children circles. Your input circles will be on leafs. This way point search will be logN in time but structure will be NlogN in space. But, generally BVH may be hard to construct.
Use a tree based N-ary space partitioning (binary space partitioning, qudtree, 2D kd-tree); This time you partition space, and each space may contain some spheres. These algorithms will have the same complexity as BVH, but most likely will be less efficient than BVH - I suspect lesser thight fitting for circles. But, some space partitionings (eg quadtree) can be easier to construct.
Use spatial hashing. That is space will be partitioned into cubes (buckets) of exact size and theses cubes hashed. Basically, you may think about it as a uniform grid of pixels (but smartly stored), when each pixel have list of contained circles. This in theory gives O(1) for search. But space may be more complicated to predict. It is in theory O(N) but may be bigger than BVH due to const factor - and most likely depends on distribution of circle's area and position (eg. number of pixels which have circles).

Related

Get minimum Euclidean distance between a given vector and vectors in the database

I store 128D vectors in PostgreSQL table as double precision []:
create table tab (
   id integer,
   name character varying (200),
   vector double precision []
 )
For a given vector, I need to return one record from the database with the minimum Euclidean distance between this vector and the vector in the table entry.
I have a function that computes the Euclidean distance of two vectors according to the known formula SQRT ((v11-v21) ^ 2 + (v1 [2] -v2 [2]) ^ 2 + .... + (v1 [128] -v2 [128] ]) ^ 2):
CREATE OR REPLACE FUNCTION public.euclidian (
  arr1 double precision [],
  arr2 double precision [])
  RETURNS double precision AS
$ BODY $
  select sqrt (SUM (tab.v)) as euclidian from (SELECT
     UNNEST (vec_sub (arr1, arr2)) as v) as tab;
$ BODY $
LANGUAGE sql IMMUTABLE STRICT
Ancillary function:
CREATE OR REPLACE FUNCTION public.vec_sub (
  arr1 double precision [],
  arr2 double precision [])
RETURNS double precision [] AS
$ BODY $
  SELECT array_agg (result)
    FROM (SELECT (tuple.val1 - tuple.val2) * (tuple.val1 - tuple.val2)
        AS result
        FROM (SELECT UNNEST ($ 1) AS val1
               , UNNEST ($ 2) AS val2
               , generate_subscripts ($ 1, 1) AS ix) tuple
    ORDER BY ix) inn;
$ BODY $
LANGUAGE sql IMMUTABLE STRICT
Query:
select tab.id as tabid, tab.name as tabname,
        euclidian ('{0.1,0.2,...,0.128}', tab.vector) as eucl from tab
order by eulc ASC
limit 1
Everything works fine since I have several thousands of records in tab. But the DB is going to be grown and I need to avoid full scan of tab running the query, add a kind of index search. Would be great to filter-out at least 80% of records by index, the remaining 20% can be handled by full scan.
One of the current directions of the solution search: PostGIS extension allows to search and sort by distance (ST_3DDistance), filter by distance (ST_3DWithin), etc. This works great and fast using indicies. Is it possible to abstract for N-dimensional space?
Some observations:
all coordinate values are between [-0.5...0.5] (I do not know exactly, I think [-1.0 ...1.0] are theoretical limits)
the vectors are not normalized, the distance from (0,0,... 0) is in range [1.2...1.6].
That is the translated post from StackExchange Russian.
Like #SaiBot hints at with local sensitivity hashing (LSH), there are plenty of researched techniques that allow you to run approximate nearest neighbors (ANN) searches. You have to accept a speed/accuracy tradeoff, but this is reasonable for most production scenarios since a brute-force approach to find the exact neighbors tends to be computationally prohibitive.
This article is an excellent overview of current state-of-the-art algorithms along with their pros and cons. Below, I've linked several popular open-source implementations. All 3 have Python bindings
Facebook FAISS
Spotify Annoy
Google ScaNN
With a 128 dimensional data constrained to PostgreSQL you will have no choice than to apply a full scan for each query.
Even highly optimized index structures for indexing high-dimensional data like the X-Tree or the IQ-Tree will have problems with that many dimensions and usually offer no big benefit over the pure scan.
The main issue here is the curse of dimensionality that will let index structures degenerate above 20ish dimensions.
Newer work thus considers the problem of approximate nearest neighbor search, since in a lot of applications with this many dimensions it is sufficient to find a good answer, rather than the best one. Locality Sensitive Hashing is among these approaches.
Note: Even if an index structure is able to filter out 80% of the records, you will have to access the remaining 20% of the records by random access operations from disk (making the application I/O bound), which will be even slower than reading all the data in one scan and computing the distances.
You could looking at Computational Geometry which is a field dedicated to efficient algorithms. There is generally a tradeoff between the amount of data stored and efficiency of an algorithm. So by adding data we can reduce speed. In particular, we are looking at a Nearest neighbour search the algorithm below uses a form of space partitioning.
Lets consider the 3D case as the distances from the origin lie in a narrow range it looks like they are clustered around a fuzzy sphere. Divide space into sub 8 cubes depending on the sign of each coordinate, label these +++, ++- etc. We can work out the minimum distance from the test vector to a vector in each cube.
Say our test vector is (0.4,0.5,0.6). The minimum distance from that to the +++ cube is zero. The minimum distance to the -++ cube is 0.4 as the closest vector in the -++ cube would be (-0.0001,0.5,0.6). Likewise, the minimum distances to +-+ is 0.5, ++-: 0.6, --+: sqrt(.4^2+.5^2) etc.
The algorithms then becomes: First search the cube the test vector is in and find the minimum distance to all the vectors in that cube. If that distance is smaller than the minimum distance to the other cubes then we are done. If not search the next closest cube until no vector in any other cube could be closer.
If we were to implement this is a database we would compute a key for each vector. In 3D this is a 3-bit integer with 0 or 1 in each bit depending on the sign of the coordinate + (0), - (-). So first select WHERE key = 000, then where key = 100, etc.
You can think of this as a type of hash function which has been specifically designed to make finding close points easy. I think these are called Locality-sensitive hashing
The high dimensions of your data make things much trickier. With 128 dimensions just using the signs of coordinates give 2^128 = 3.4E+38 possibilities. This is far too many hash buckets and some form of dimension reduction is needed.
You might be able to choose k points and partition space according to which of those you are closest too.

Solidworks Feature Recognition on a fill pattern/linear pattern

I am currently creating a feature and patterning it across a flat plane to get the maximum number of features to fit on the plane. I do this frequently enough to warrant building some sort of marcro for this if possible. The issue that I run into is I still have to manually set the spacing between the parts. I want to be able to create a feature and have it determine "best" fit spacing given an area while avoiding overlaps. I have had very little luck finding any resources describing this. Any information or links to potentially helpful resources on this would be much appreciated!
Thank you.
Before, you start the linear pattern bit:
Select the face2 of that feature2, get the outer most loop2 of edges. You can test for that using loop2.IsOuter.
Now:
if the loop has one edge: that means it's a circle and the spacing must superior to the circle's radius
if the loop has more that one edge, that you need to calculate all the distances between the vertices and assume that the largest distance is the safest spacing.
NOTA: If one of the edges is a spline, then you need a different strategy:
You would need to convert the face into a sketch and finds the coordinates of that spline to calculate the highest distances.
Example: The distance between the edges is lower than the distance between summit of the splines. If the linear pattern has the a vertical direction, then spacing has to be superior to the distance between the summit.
When I say distance, I mean the distance projected on the linear pattern direction.

Row / column vs linear indexing speed (spatial locality)

Related question:
This one
I am using a spatial grid which can potentially get big (10^6 nodes) or even bigger. I will regularly have to perform displacement operations (like a particle from a node to another). I'm not a crack in informatics but I begin to understand the concepts of cache lines and spatial locality, though not well yet. So, I was wandering if it is preferible to use a 2D array (and if yes, which one? I'd prefer to avoid boost for now, but maybe I will link it later) and indexing the displacement for example like this:
Array[i][j] -> Array[i-1][j+2]
or, with a 1D array, if NX is the "equivalent" number of columns:
Array[i*NX+j] -> Array[(i-1)*NX+j+2]
Knowing that it will be done nearly one million times per iteration, with nearly one million iteration as well.
With modern compilers and optimization enabled both of these will probably generate the exact same code
Array[i-1][j+2] // Where Array is 2-dimensional
and
Array[(i-1)*NX+j+2] // Where Array is 1-dimensional
assuming NX is the dimension of the second subscript in the 2-dimensional Array (the number of columns).

Search optimization problem

Suppose you have a list of 2D points with an orientation assigned to them. Let the set S be defined as:
S={ (x,y,a) | (x,y) is a 2D point, a is an orientation (an angle) }.
Given an element s of S, we will indicate with s_p the point part and with s_a the angle part. I would like to know if there exist an efficient data structure such that, given a query point q, is able to return all the elements s in S such that
(dist(q_p, s_p) < threshold_1) AND (angle_diff(q_a, s_a) < threshold_2) (1)
where dist(p1,p2), with p1,p2 2D points, is the euclidean distance, and angle_diff(a1,a2), with a1,a2 angles, is the difference between angles (taken to be the smallest one). The data structure should be efficient w.r.t. insertion/deletion of elements and the search as defined above. The number of vectors can grow up to 10.000 and more, but take this with a grain of salt.
Now suppose to change the above requirement: instead of using the condition (1), let's request all the elements of S such that, given a distance function d, we want all elements of S such that d(q,s) < threshold. If i remember well, this last setup is called range-search. I don't know if the first case can be transformed in the second.
For the distance search I believe the accepted best method is a Binary Space Partition tree. This can be stored as a series of bits. Each two bits (for a 2D tree) or three bits (for a 3D tree) subdivides the space one more level, increasing resolution.
Using a BSP, locating a set of objects to compare distances with is pretty easy. Just find the smallest set of squares or cubes which contain the edges of your distance box.
For the angle, I don't know of anything. I suppose that you could store each object in a second list or tree sorted by its angle. Then you would find every object at the proper distance using the BSP, every object at the proper angles using the angle tree, then do a set intersection.
You have effectively described a "three dimensional cyclindrical space", ie. a space that is locally three dimensional but where one dimension is topologically cyclic. In other words, it is locally flat and may be modeled as the boundary of a four-dimensional object C4 in (x, y, z, w) defined by
z^2 + w^2 = 1
where
a = arctan(w/z)
With this model, the space defined by your constraints is a 2-dimensional cylinder wrapped "lengthwise" around a cross section wedge, where the wedge wraps around the 4-d cylindrical space with an angle of 2 * threshold_2. This can be modeled using a "modified k-d tree" approach (modified 3-d tree), where the data structure is not a tree but actually a graph (it has cycles). You can still partition this space into cells with hyperplane separation, but traveling along the curve defined by (z, w) in the positive direction may encounter a point encountered in the negative direction. The tree should be modified to actually lead to these nodes from both directions, so that the edges are bidirectional (in the z-w curve direction - the others are obviously still unidirectional).
These cycles do not change the effectiveness of the data structure in locating nearby points or allowing your constraint search. In fact, for the most part, those algorithms are only slightly modified (the simplest approach being to hold a visited node data structure to prevent cycles in the search - you test the next neighbors about to be searched).
This will work especially well for your criteria, since the region you define is effectively bounded by these axis-defined hyperplane-bounded cells of a k-d tree, and so the search termination will leave a region on average populated around pi / 4 percent of the area.

How to depict multidimentional vectors on two-dinesional plot?

I have a set of vectors in multidimensional space (may be several thousands of dimensions). In this space, I can calculate distance between 2 vectors (as a cosine of the angle between them, if it matters). What I want is to visualize these vectors keeping the distance. That is, if vector a is closer to vector b than to vector c in multidimensional space, it also must be closer to it on 2-dimensional plot. Is there any kind of diagram that can clearly depict it?
I don't think so. Imagine any twodimensional picture of a tetrahedron. There is no way of depicting the four vertices in two dimensions with equal distances from each other. So you will have a hard time trying to depict more than three n-dimensional vectors in 2 dimensions conserving their mutual distances.
(But right now I can't think of a rigorous proof.)
Update:
Ok, second idea, maybe it's dumb: If you try and find clusters of closer associated objects/texts, then calculate the center or mean vector of each cluster. Then you can reduce the problem space. At first find a 2D composition of the clusters that preserves their relative distances. Then insert the primary vectors, only accounting for their relative distances within a cluster and their distance to the center of to two or three closest clusters.
This approach will be ok for a large number of vectors. But it will not be accurate in that there always will be somewhat similar vectors ending up at distant places.