Manhattan distance with n dimensions in oracle - sql

I have a table with around 5 millions rows and each row has 10 columns representing 10 dimensions.
I would like to be able when a new input is coming to perform a search in the table to return the closest rows using Manhattan distances.
The distance is the sum of abs(Ai-Aj)+abs(Bi-Bj)...
The problem is that for the moment if I do a query, it does a full scan of the entire table, to calculate the distances from every rows, and then sort them to find the top X.
Is there a way to speed the process and make the query more efficient?
I looked at the distance function online for the SDO_GEOMETRY, but I couldn't find it for more than 4 dimensions.
Thank you

If you are inserting a point A and you want to look for points that are within a neighbourhood of radius r (i.e., are less than r away, on any metric), you can do a really simply query:
select x1, x2, ..., xn
from points
where x1 between a1 - r and a1 + r
and x2 between a2 - r and a2 + r
...
and xn between an - r and an + r
...where A = (a1, a2, ..., an), to find a bound. If you have an index over all x1, ..., xn fields of points, then this query shouldn't require a full scan. Now, this result may include points that are outside the neighbourhood (i.e., the bits in the corners), but is an easy win to find an appropriate subset: you can now check against the records in this subquery, rather than checking against every point in your table.
You may be able to refine this query further because, with the Manhattan metric, a neighbourhood will be square shaped (although at 45 degrees to the above) and squares are relatively easy to work with! (Even in 10 dimensions.) However, the more complicated logic required may be more of an overhead than an optimisation, ultimately.

I suggest using function based index. You need this distance calculated, therefore pre calculate it using function based index.
You may want to read following question and it links. Function based index creates hidden column for you. This hidden column will hold manhanttan distance , therefore sorting will be easier.
Thanks for #Xophmeister's comment. Function based index will not help you for arbitrary point. I do not know any sql function to help you here. But if you are willing to use machine learning data mining algorithm.
I suggest cluster your 5 million rows using k-means clustering. Lets say 1000 cluster center you found. Put this cluster centers to another table.
By definition clustering , your points will be assigned to cluster centers. Because of this you know which points are nearest to this cluster center, say
cluster (1) contains 20.000 points, ... cluster ( 987) contains 10.000 points ...
Your arbitrary point will be near to one cluster. You find that your point is nearest to cluster 987. Run your sql , using only points which belongs to this cluster center, that 10.000 points.
You need to add several tables/columns to your schema to make this effective. If your 5.000.000 rows changes continuously, you need to run k-means clustering again as they change. But if they are fairly constant values, one clustering per week or per month will be enough.

Related

BigQuery SQL / GIS: Extend Radius Until Count Is Greater Than / Equal To 'N'

I have a very special kind of query to write. In PostGIS / BigQuery, I have a point. I can buffer this point by increments and perform an aggregation query, such as count(distinct()) on the unique records that fall within this point. Once the count reaches a certain threshold, I would like to return the input value of the geographic object, ie. its radius or diameter. This problem can be phrased as "how far do I have to keep going out until I hit 'n' [ids]?".
Finely incrementing the value of the buffer or radius will be insufferably slow and expensive. Can anyone think of a nice way to short this and offer a solution that provides a nice answer quickly (in BQ or PSQL terms!)?
Available GIS functions:
st_buffer()
st_dwithin()
Thank you!
You would have to order by distance and keep the N closest points. The <-> operator will use the spatial index.
SELECT *
FROM pointLayer p
ORDER BY p.geometry <-> st_makePoint(...)
LIMIT 10; --N
You don't need to increment radius finely - I would rather double it, or maybe even increment 10x and once you have enough distinct records, take N nearest ones.
I've used BigQuery scripting to solve similar problem (with N=1, but it is easy to modify for any N, just use LIMIT N instead of LIMIT 1 and modify stopping condition):
https://medium.com/#mentin/nearest-neighbor-using-bq-scripting-373241f5b2f5

Index by geolocation in database

I'm trying to find a way to let my database support fast location-based searches (for example, all items that lie within a certain distance from some geopoint (LAT, LON)). I guess the brute-force solution which calculates the distances between every point in the database and the query point probably won't work for large dataset, so some kind of indexing should be necessary. I'm not sure if there are any existing standard ways to do this (well I know they are out there but Google failed me), but here is a method (or more like a hack?) that I think might work:
Calculate a value from (LAT, LON) and store it in an indexed column. For example, something like floor(LAT / 10) * 10 * 100 + floor(LON / 10) * 10. Each time a query arrives, we first calculate this value for the query and find all the corresponding rows, and then calculate the Euclid distances between all points and the query point.

Grouping of Million Data Points slow

I have a simple table containing 2 float columns representing X and Y coordinates. A non clustered index is on each of those 2 columns. In this table there are about 5 million datapoints which I want to group into custom grid using such an SQL:
SELECT COUNT(X) Count, AVG(X) CenterX, AVG(Y) CenterY
FROM DataPoints
GROUP BY FLOOR(X / 5), FLOOR(Y / 5)
On a test case I splitted a data set with 815000 points into a grid where each point gets his own grid cell. It took the SQL server 2012 26000 milliseconds to provide the results which is definitly too long. I made a C# implementation of the same grouping using LINQ on a simple point array and there it only took 3450ms! I also created a stored procedure of the SQL for some speed-up, but still it takes 26-30seconds to calcualte the grid cells.
I can't understand why it takes the SQL Server that long to calcualte those groups. I know it might take long on all 815000 poitns to calculate the grid cell index but 7 times longer than on a simple C# program can't be a realistic result.
I also tried to use spatial types to do calculate the grid but those solutions are even slower. Using a geometry column and a spatial index (GEOMETRY_AUTO_GRID) the built in sp_help_spatial_geometry_histogram needs 2:40min to calculate 4 grid cells containing the data.
Has anybody an idea how to speed up such a simple SQL? In the future this data will be sent to a map in the browser and there will be a lot of requests so <100ms would be an ultimate goal.
What does the execution plan tells you?
why is this slow?
i suggest you put a nonclustered index on x and y (not separate),
is this result better?

oracle geocoding query is really slow, how could i optimize a dynamic field?

I have an Oracle table 12K records/gyms, and the query below takes approximately ~0.3s:
SELECT (acos(sin(41.922682*0.017453293) *
sin(to_number(LATITUDE)*0.017453293) + cos(41.922682*0.017453293) *
cos(to_number(LATITUDE)*0.017453293) * cos(to_number(LONGITUDE)*0.017453293 -
(-87.65432*0.017453293)))*3959) as distance
FROM gym
However, I would like to return all of the records where distance <= 10, and as soon as I run the following query, my query execution time jumps up to ~5.0s:
SELECT * from (SELECT (acos(sin(41.922682*0.017453293) *
sin(to_number(LATITUDE)*0.017453293) + cos(41.922682*0.017453293) *
cos(to_number(LATITUDE)*0.017453293) * cos(to_number(LONGITUDE)*0.017453293 -
(-87.65432*0.017453293)))*3959)
as distance FROM gym)
WHERE distance <= 10
ORDER BY distance asc
Any idea how I can optimize this in Oracle?
Most important:
use a where clause to exclude all longitudes and latitudes that will be more than 10 km/miles (?) away from your point. So you only need to make your calculation for the window within a 10km/miles block.
as an and very rough approximation you could use 0.1 degree as a rule, this is 11km at the equator,and less elsewhere
so add
WHERE (longitude - -87.65)<0.1 and (latitude - 41.922)<0.1
(If you use nested queries, add this to the deepest level)
Since your distance is smaller than 10 km or mile, you can consider the length of one unit latitude/longitude as constant, and calculate them once using your formula. Than you can use pythagoras rule to calculate the distance (after adding the bounding box). This is basically why people usually use projected data for calculations.
Other things:
order by is always slow if you don't have an index. Do you need to order?
save your longitude and latitude as numbers in your table. Why would you store them different in a database?
With money. Specifically, Oracle Spatial.
1) How are you measuring 0.3 seconds for the first query? I'll wager that you are measuring the time required to fetch the first row rather than the time required to fetch the last row. Most client tools will start displaying results long before the database has finished producing them if that is possible (which it almost certainly is if there is no ORDER BY). So you're probably measuring the time required by the first query to calculate the distance to the first 50 or 500 gyms against the time required by the last query to calculate the distance to all 12,000 gyms.
2) Oracle Locator is a feature that comes with all editions of the Oracle database that includes the ability to use spatial indexes and that provides built-in methods for computing distance. It's not nearly as powerful as Oracle Spatial but it should be more than sufficient for what you're discussing here.
3) If you want to roll your own, I'd second johanvdw's suggestion of using a bounding box.

How can I test that my hash function is good in terms of max-load?

I have read through various papers on the 'Balls and Bins' problem and it seems that if a hash function is working right (ie. it is effectively a random distribution) then the following should/must be true if I hash n values into a hash table with n slots (or bins):
Probability that a bin is empty, for large n is 1/e.
Expected number of empty bins is n/e.
Probability that a bin has k balls is <= 1/ek! (corrected).
Probability that a bin has at least k collisions is <= ((e/k)**k)/e (corrected).
These look easy to check. But the max-load test (the maximum number of collisions with high probability) is usually stated vaguely.
Most texts state that the maximum number of collisions in any bin is O( ln(n) / ln(ln(n)) ).
Some say it is 3*ln(n) / ln(ln(n)). Other papers mix ln and log - usually without defining them, or state that log is log base e and then use ln elsewhere.
Is ln the log to base e or 2 and is this max-load formula right and how big should n be to run a test?
This lecture seems to cover it best, but I am no mathematician.
http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture07.pdf
BTW, with high probability seems to mean 1 - 1/n.
That is a fascinating paper/lecture-- makes me wish I had taken some formal algorithms class.
I'm going to take a stab at some answers here, based on what I've just read from that, and feel free to vote me down. I'd appreciate a correction, though, rather than just a downvote :) I'm also going to use n and N interchangeably here, which is a big no-no in some circles, but since I'm just copy-pasting your formulae, I hope you'll forgive me.
First, the base of the logs. These numbers are given as big-O notation, not as absolute formulae. That means that you're looking for something 'on the order of ln(n) / ln(ln(n))', not with an expectation of an absolute answer, but more that as n gets bigger, the relationship of n to the maximum number of collisions should follow that formula. The details of the actual curve you can graph will vary by implementation (and I don't know enough about the practical implementations to tell you what's a 'good' curve, except that it should follow that big-O relationship). Those two formulae that you posted are actually equivalent in big-O notation. The 3 in the second formula is just a constant, and is related to a particular implementation. A less efficient implementation would have a bigger constant.
With that in mind, I would run empirical tests, because I'm a biologist at heart and I was trained to avoid hard-and-fast proofs as indications of how the world actually works. Start with N as some number, say 100, and find the bin with the largest number of collisions in it. That's your max-load for that run. Now, your examples should be as close as possible to what you expect actual users to use, so maybe you want to randomly pull words from a dictionary or something similar as your input.
Run that test many times, at least 30 or 40. Since you're using random numbers, you'll need to satisfy yourself that the average max-load you're getting is close to the theoretical 'expectation' of your algorithm. Expectation is just the average, but you'll still need to find it, and the tighter your std dev/std err about that average, the more you can say that your empirical average matches the theoretical expectation. One run is not enough, because a second run will (most likely) give a different answer.
Then, increase N, to say, 1000, 10000, etc. Increase it logarithmically, because your formula is logarithmic. As your N increases, your max-load should increase on the order of ln(n) / ln(ln(n)). If it increases at a rate of 3*ln(n) / ln(ln(n)), that means that you're following the theory that they put forth in that lecture.
This kind of empirical test will also show you where your approach breaks down. It may be that your algorithm works well for N < 10 million (or some other number), but above that, it starts to collapse. Why could that be? Maybe you have some limitation to 32 bits in your code without realizing it (ie, using a 'float' instead of a 'double'), or some other implementation detail. These kinds of details let you know where your code will work well in practice, and then as your practical needs change, you can modify your algorithm. Maybe making the algorithm work for very large datasets makes it very inefficient for very small ones, or vice versa, so pinpointing that tradeoff will help you further characterize how you could adapt your algorithm to particular situations. Always a useful skill to have.
EDIT: a proof of why the base of the log function doesn't matter with big-O notation:
log N = log_10 (N) = log_b (N)/log_b (10)= (1/log_b(10)) * log_b(N)
1/log_b(10) is a constant, and in big-O notation, constants are ignored. Base changes are free, which is why you're encountering such variation in the papers.
Here is a rough start to the solution of this problem involving uniform distributions and maximum load.
Instead of bins and balls or urns or boxes or buckets or m and n, people (p) and doors (d) will be used as designations.
There is an exact expected value for each of the doors given a certain number of people. For example, with 5 people and 5 doors, the expected maximum door is exactly 1.2864 {(1429-625) / 625} above the mean (p/d) and the minimum door is exactly -0.9616 {(24-625) / 625} below the mean. The absolute value of the highest door's distance from the mean is a little larger than the smallest door's because all of the people could go through one door, but no less than zero can go through one of the doors. With large numbers of people (p/d > 3000), the difference between the absolute value of the highest door's distance from the mean and the lowest door's becomes negligible.
For an odd number of doors, the center door is essentially zero and is not scalable, but all of the other doors are scalable from certain values representing p=d. These rounded values for d=5 are:
-1.163 -0.495 0* 0.495 1.163
* slowly approaching zero from -0.12
From these values, you can compute the expected number of people for any count of people going through each of the 5 doors, including the maximum door. Except for the middle ordered door, the difference from the mean is scalable by sqrt(p/d).
So, for p=50,000 and d=5:
Expected number of people going through the maximum door, which could be any of the 5 doors, = 1.163 * sqrt(p/d) + p/d.
= 1.163 * sqrt(10,000) + 10,000 = 10,116.3
For p/d < 3,000, the result from this equation must be slightly increased.
With more people, the middle door slowly becomes closer and closer to zero from -0.11968 at p=100 and d=5. It can always be rounded up to zero and like the other 4 doors has quite a variance.
The values for 6 doors are:
-1.272 -0.643 -0.202 0.202 0.643 1.272
For 1000 doors, the approximate values are:
-3.25, -2.95, -2.79 … 2.79, 2.95, 3.25
For any d and p, there is an exact expected value for each of the ordered doors. Hopefully, a good approximation (with a relative error < 1%) exists. Some professor or mathematician somewhere must know.
For testing uniform distribution, you will need a number of averaged ordered sessions (750-1000 works well) rather than a greater number of people. No matter what, the variances between valid sessions are great. That's the nature of randomness. Collisions are unavoidable. *
The expected values for 5 and 6 doors were obtained by sheer brute force computation using 640 bit integers and averaging the convergence of the absolute values of corresponding opposite doors.
For d=5 and p=170:
-6.63901 -2.95905 -0.119342 2.81054 6.90686
(27.36099 31.04095 33.880658 36.81054 40.90686)
For d=6 and p=108:
-5.19024 -2.7711 -0.973979 0.734434 2.66716 5.53372
(12.80976 15.2289 17.026021 18.734434 20.66716 23.53372)
I hope that you may evenly distribute your data.
It's almost guaranteed that all of George Foreman's sons or some similar situation will fight against your hash function. And proper contingent planning is the work of all good programmers.
After some more research and trial-and-error I think I can provide something part way to to an answer.
To start off, ln and log seem to refer to log base-e if you look into the maths behind the theory. But as mmr indicated, for the O(...) estimates, it doesn't matter.
max-load can be defined for any probability you like. The typical formula used is
1-1/n**c
Most papers on the topic use
1-1/n
An example might be easiest.
Say you have a hash table of 1000 slots and you want to hash 1000 things. Say you also want to know the max-load with a probability of 1-1/1000 or 0.999.
The max-load is the maximum number of hash values that end up being the same - ie. collisions (assuming that your hash function is good).
Using the formula for the probability of getting exactly k identical hash values
Pr[ exactly k ] = ((e/k)**k)/e
then by accumulating the probability of exactly 0..k items until the total equals or exceeds 0.999 tells you that k is the max-load.
eg.
Pr[0] = 0.37
Pr[1] = 0.37
Pr[2] = 0.18
Pr[3] = 0.061
Pr[4] = 0.015
Pr[5] = 0.003 // here, the cumulative total is 0.999
Pr[6] = 0.0005
Pr[7] = 0.00007
So, in this case, the max-load is 5.
So if my hash function is working well on my set of data then I should expect the maxmium number of identical hash values (or collisions) to be 5.
If it isn't then this could be due to the following reasons:
Your data has small values (like short strings) that hash to the same value. Any hash of a single ASCII character will pick 1 of 128 hash values (there are ways around this. For example you could use multiple hash functions, but slows down hashing and I don't know much about this).
Your hash function doesn't work well with your data - try it with random data.
Your hash function doesn't work well.
The other tests I mentioned in my question also are helpful to see that your hash function is running as expected.
Incidentally, my hash function worked nicely - except on short (1..4 character) strings.
I also implemented a simple split-table version which places the hash value into the least used slot from a choice of 2 locations. This more than halves the number of collisions and means that adding and searching the hash table is a little slower.
I hope this helps.