Index by geolocation in database - sql

I'm trying to find a way to let my database support fast location-based searches (for example, all items that lie within a certain distance from some geopoint (LAT, LON)). I guess the brute-force solution which calculates the distances between every point in the database and the query point probably won't work for large dataset, so some kind of indexing should be necessary. I'm not sure if there are any existing standard ways to do this (well I know they are out there but Google failed me), but here is a method (or more like a hack?) that I think might work:
Calculate a value from (LAT, LON) and store it in an indexed column. For example, something like floor(LAT / 10) * 10 * 100 + floor(LON / 10) * 10. Each time a query arrives, we first calculate this value for the query and find all the corresponding rows, and then calculate the Euclid distances between all points and the query point.

Related

BigQuery SQL / GIS: Extend Radius Until Count Is Greater Than / Equal To 'N'

I have a very special kind of query to write. In PostGIS / BigQuery, I have a point. I can buffer this point by increments and perform an aggregation query, such as count(distinct()) on the unique records that fall within this point. Once the count reaches a certain threshold, I would like to return the input value of the geographic object, ie. its radius or diameter. This problem can be phrased as "how far do I have to keep going out until I hit 'n' [ids]?".
Finely incrementing the value of the buffer or radius will be insufferably slow and expensive. Can anyone think of a nice way to short this and offer a solution that provides a nice answer quickly (in BQ or PSQL terms!)?
Available GIS functions:
st_buffer()
st_dwithin()
Thank you!
You would have to order by distance and keep the N closest points. The <-> operator will use the spatial index.
SELECT *
FROM pointLayer p
ORDER BY p.geometry <-> st_makePoint(...)
LIMIT 10; --N
You don't need to increment radius finely - I would rather double it, or maybe even increment 10x and once you have enough distinct records, take N nearest ones.
I've used BigQuery scripting to solve similar problem (with N=1, but it is easy to modify for any N, just use LIMIT N instead of LIMIT 1 and modify stopping condition):
https://medium.com/#mentin/nearest-neighbor-using-bq-scripting-373241f5b2f5

CoreData + Magical Record running select query

I have an application with a sqlite database that contains 7000+ records in it with city names, longitudes and latitudes.. also these "cities" are connected to relevant city fields on the database too.
What my app doing is, query the current location with core location, fetch the lon and lat values, and then find the closest location from the database.
The result doesn't have to be super accurate (i just want to match cities), so I want to use Hypotenuse formula for finding the closest point:
closest city in db: min((x1-x2)^2 +(y1-y2)^2)^(1/2)
x1, y1: lon and lat for user
x2, y2: lon and lat for points in database.
If I was using ms-sql or sqlite database, I could easily create a query but when it comes to core data, I'm out of ideas.
I don't want to fetch all the data (and fill the memory) then aggregate this formula on all fields so is there a way to create a query and get the result from the db?
Am I overthinking this problem, and missing a simple solution?
If I'm understanding your problem correctly, you're wanting to find the closest "n" cities to your current location.
I had something similar and here's how I approached it.
In essence, you probably need to take each city's lat/lon and hash it into some index. We use a Mercator Projection to convert the lat/lon to x/y, then hash that value in a manner similar to how Google/Bing/Apple Maps hash their map tiles. Fortunately, MapKit has a built-in Mercator Projection function.
In pseudocode:
for each city's lat/lon {
CLLocationCoordinate2D coordinate = (CLLocationCoordinate2D){lat, lon};
MKMapPoint point = MKMapPointForCoordinate(coordinate);
//256 represents the size of a map tile at zoomLevel 20. You can use whatever zoomLevel
//you want here, but we need something to quickly lookup close-by cities.
//this is the formula you can use to determine how granular your index is
//(256 * pow(2, (20 - zoomLevel)))
NSInteger x = point.x/256.0;
NSInteger y = point.y/256.0;
save x & y in a CityHashIndex table
}
Now, you get the current location's lat/lon, hash that into the index as above, and just simply write a query against this CityHashIndex table.
So say that, for simplicity sake, you're current location is indexed at 1000, 1000. So to find close by cities, maybe you search for cities with indexes in the range of `900-1100, 900-1100'.
From there, you're now only pulling in a much smaller set of cities and the memory requirements to process your Hypotenuse Formula isn't so bad.
I can elaborate more if you're interested.
This is directly related to a commonly asked question about Core Data.
Searching for surrounding suburbs based on latitude & longitude using Objective C
Calculate a bounding box around the point you need (min lat/long max lat/long) then use an NSPredicate against those values to find everything within the box. From there you can do a distance calculation on the results that return and sort them.
I would suggest setting this up so that it can search at multiple distances then you can see if a city is within 10 miles, 100 miles, etc. Slowly increasing the bounding box until you get one or more results back.
I would use NSPredicate to define my search criteria it will act as a filter. I'm not sure how optimized is this and if it will pull all your registers but I'm assuming that coreData has some kind of indexing mechanism that will optimize the search.
You can take a look of this document
https://developer.apple.com/library/mac/documentation/Cocoa/Conceptual/CoreData/Articles/cdFetching.html
Check the section named
Retrieving Specific Objects

Manhattan distance with n dimensions in oracle

I have a table with around 5 millions rows and each row has 10 columns representing 10 dimensions.
I would like to be able when a new input is coming to perform a search in the table to return the closest rows using Manhattan distances.
The distance is the sum of abs(Ai-Aj)+abs(Bi-Bj)...
The problem is that for the moment if I do a query, it does a full scan of the entire table, to calculate the distances from every rows, and then sort them to find the top X.
Is there a way to speed the process and make the query more efficient?
I looked at the distance function online for the SDO_GEOMETRY, but I couldn't find it for more than 4 dimensions.
Thank you
If you are inserting a point A and you want to look for points that are within a neighbourhood of radius r (i.e., are less than r away, on any metric), you can do a really simply query:
select x1, x2, ..., xn
from points
where x1 between a1 - r and a1 + r
and x2 between a2 - r and a2 + r
...
and xn between an - r and an + r
...where A = (a1, a2, ..., an), to find a bound. If you have an index over all x1, ..., xn fields of points, then this query shouldn't require a full scan. Now, this result may include points that are outside the neighbourhood (i.e., the bits in the corners), but is an easy win to find an appropriate subset: you can now check against the records in this subquery, rather than checking against every point in your table.
You may be able to refine this query further because, with the Manhattan metric, a neighbourhood will be square shaped (although at 45 degrees to the above) and squares are relatively easy to work with! (Even in 10 dimensions.) However, the more complicated logic required may be more of an overhead than an optimisation, ultimately.
I suggest using function based index. You need this distance calculated, therefore pre calculate it using function based index.
You may want to read following question and it links. Function based index creates hidden column for you. This hidden column will hold manhanttan distance , therefore sorting will be easier.
Thanks for #Xophmeister's comment. Function based index will not help you for arbitrary point. I do not know any sql function to help you here. But if you are willing to use machine learning data mining algorithm.
I suggest cluster your 5 million rows using k-means clustering. Lets say 1000 cluster center you found. Put this cluster centers to another table.
By definition clustering , your points will be assigned to cluster centers. Because of this you know which points are nearest to this cluster center, say
cluster (1) contains 20.000 points, ... cluster ( 987) contains 10.000 points ...
Your arbitrary point will be near to one cluster. You find that your point is nearest to cluster 987. Run your sql , using only points which belongs to this cluster center, that 10.000 points.
You need to add several tables/columns to your schema to make this effective. If your 5.000.000 rows changes continuously, you need to run k-means clustering again as they change. But if they are fairly constant values, one clustering per week or per month will be enough.

spatial data distance search - optimization options

Our business user loves for our searches to be done by distance, problem is we have over 1 million records with a lat/long location. We are using SQL 2008 but we keep running into issues when we order or restrict our searches by distance that the queries take way to long (30 seconds plus). This is unacceptable, there has got to be a better way to do this. We have done everything we can with SQL 2008 and want to upgrade to 2012 if we can at some point.
I ask though, if there is another technology or optimization that we could apply. Could we switch to a different DB for faster performance, a different search algorithm to apply, estimation algorithm, tree, grids, pre-computation, etc?
A solution that might be useful here would be to break your search into two parts:
1) Run a query where you find all records that are within a certain value + or - of the current lat/lng of your location, the where clause might look like:
where (#latitude > (lat - .001) and #latitude > (lat - .001)) and (#longitude> (lng- .001) and #longitude> (longitude- .001))
Using this approach, and especially with an index on both the latitude and longitude columns, you can very quickly define a working set of locations within a specified distance.
2) with the rough results from step 1, use the great circle/haversine method to determine what the actual distance between the source location and each point is.
Where this approach falls over is if there is never any limit to the radius that you are searching, but it works great if you are for instance looking to find all locations within a specific distance of a given point.

oracle geocoding query is really slow, how could i optimize a dynamic field?

I have an Oracle table 12K records/gyms, and the query below takes approximately ~0.3s:
SELECT (acos(sin(41.922682*0.017453293) *
sin(to_number(LATITUDE)*0.017453293) + cos(41.922682*0.017453293) *
cos(to_number(LATITUDE)*0.017453293) * cos(to_number(LONGITUDE)*0.017453293 -
(-87.65432*0.017453293)))*3959) as distance
FROM gym
However, I would like to return all of the records where distance <= 10, and as soon as I run the following query, my query execution time jumps up to ~5.0s:
SELECT * from (SELECT (acos(sin(41.922682*0.017453293) *
sin(to_number(LATITUDE)*0.017453293) + cos(41.922682*0.017453293) *
cos(to_number(LATITUDE)*0.017453293) * cos(to_number(LONGITUDE)*0.017453293 -
(-87.65432*0.017453293)))*3959)
as distance FROM gym)
WHERE distance <= 10
ORDER BY distance asc
Any idea how I can optimize this in Oracle?
Most important:
use a where clause to exclude all longitudes and latitudes that will be more than 10 km/miles (?) away from your point. So you only need to make your calculation for the window within a 10km/miles block.
as an and very rough approximation you could use 0.1 degree as a rule, this is 11km at the equator,and less elsewhere
so add
WHERE (longitude - -87.65)<0.1 and (latitude - 41.922)<0.1
(If you use nested queries, add this to the deepest level)
Since your distance is smaller than 10 km or mile, you can consider the length of one unit latitude/longitude as constant, and calculate them once using your formula. Than you can use pythagoras rule to calculate the distance (after adding the bounding box). This is basically why people usually use projected data for calculations.
Other things:
order by is always slow if you don't have an index. Do you need to order?
save your longitude and latitude as numbers in your table. Why would you store them different in a database?
With money. Specifically, Oracle Spatial.
1) How are you measuring 0.3 seconds for the first query? I'll wager that you are measuring the time required to fetch the first row rather than the time required to fetch the last row. Most client tools will start displaying results long before the database has finished producing them if that is possible (which it almost certainly is if there is no ORDER BY). So you're probably measuring the time required by the first query to calculate the distance to the first 50 or 500 gyms against the time required by the last query to calculate the distance to all 12,000 gyms.
2) Oracle Locator is a feature that comes with all editions of the Oracle database that includes the ability to use spatial indexes and that provides built-in methods for computing distance. It's not nearly as powerful as Oracle Spatial but it should be more than sufficient for what you're discussing here.
3) If you want to roll your own, I'd second johanvdw's suggestion of using a bounding box.