oracle geocoding query is really slow, how could i optimize a dynamic field? - sql

I have an Oracle table 12K records/gyms, and the query below takes approximately ~0.3s:
SELECT (acos(sin(41.922682*0.017453293) *
sin(to_number(LATITUDE)*0.017453293) + cos(41.922682*0.017453293) *
cos(to_number(LATITUDE)*0.017453293) * cos(to_number(LONGITUDE)*0.017453293 -
(-87.65432*0.017453293)))*3959) as distance
FROM gym
However, I would like to return all of the records where distance <= 10, and as soon as I run the following query, my query execution time jumps up to ~5.0s:
SELECT * from (SELECT (acos(sin(41.922682*0.017453293) *
sin(to_number(LATITUDE)*0.017453293) + cos(41.922682*0.017453293) *
cos(to_number(LATITUDE)*0.017453293) * cos(to_number(LONGITUDE)*0.017453293 -
(-87.65432*0.017453293)))*3959)
as distance FROM gym)
WHERE distance <= 10
ORDER BY distance asc
Any idea how I can optimize this in Oracle?

Most important:
use a where clause to exclude all longitudes and latitudes that will be more than 10 km/miles (?) away from your point. So you only need to make your calculation for the window within a 10km/miles block.
as an and very rough approximation you could use 0.1 degree as a rule, this is 11km at the equator,and less elsewhere
so add
WHERE (longitude - -87.65)<0.1 and (latitude - 41.922)<0.1
(If you use nested queries, add this to the deepest level)
Since your distance is smaller than 10 km or mile, you can consider the length of one unit latitude/longitude as constant, and calculate them once using your formula. Than you can use pythagoras rule to calculate the distance (after adding the bounding box). This is basically why people usually use projected data for calculations.
Other things:
order by is always slow if you don't have an index. Do you need to order?
save your longitude and latitude as numbers in your table. Why would you store them different in a database?

With money. Specifically, Oracle Spatial.

1) How are you measuring 0.3 seconds for the first query? I'll wager that you are measuring the time required to fetch the first row rather than the time required to fetch the last row. Most client tools will start displaying results long before the database has finished producing them if that is possible (which it almost certainly is if there is no ORDER BY). So you're probably measuring the time required by the first query to calculate the distance to the first 50 or 500 gyms against the time required by the last query to calculate the distance to all 12,000 gyms.
2) Oracle Locator is a feature that comes with all editions of the Oracle database that includes the ability to use spatial indexes and that provides built-in methods for computing distance. It's not nearly as powerful as Oracle Spatial but it should be more than sufficient for what you're discussing here.
3) If you want to roll your own, I'd second johanvdw's suggestion of using a bounding box.

Related

BigQuery SQL computing spatial statistics around multiple locations

Problem: I want to make a query that allows me to compute sociodemographic statistics within a given distance of multiple locations. I can only make a query that computes the statistics at a single location at a time.
My desired result would be a table where I can see the name of the library (title), P_60YMAS (some sociodemographic data near these libraries), and a geometry of the multipolygons within the buffer distance of this location as a GEOGRAPHY data type.
Context:
I have two tables:
'cis-sdhis.de.biblioteca' or library, that have points as GEOGRAPHY data type;
'cis-sdhis.inegi.resageburb' in which I have many sociodemographic data, including polygons as GEOGRAPHY data type (column name: 'GEOMETRY')
I want to make 1 Km Buffer around the libraries, make new multipolygons within this buffers and get some sociodemographic data and geometry from those multipolygons.
My first approach was with this query:
SELECT
SUM(P_60YMAS) AS age60_plus,
ST_UNION_AGG(GEOMETRY) AS geo
FROM
`cis-sdhis.inegi.resageburb`
WHERE
ST_WITHIN(GEOMETRY,
(
SELECT
ST_BUFFER(geography,
1000)
FROM
`cis-sdhis.de.biblioteca`
WHERE
id = 'bpm-mty3'))
As you can see, this query only gives me one library ('bpm-mty3'), an that's my problem: I want them all at once.
I thought that using OVER() would be one solution, but I don't really know where or how to use it.
OVER() could be a solution, but a more performant solution is JOIN.
You first need to join two tables on condition that the distance between them is less than 1000 meters, that gives you pairs of row, where libraries are paired with all relevant data, note we'll get multiple rows per library. The predicate to use is ST_DWithin(geo1, geo2, distance), alternative form is ST_Distance(geo1, geo2) < distance - in both cases you don't need buffer.
SELECT
*
FROM
`cis-sdhis.inegi.resageburb` data,
`cis-sdhis.de.biblioteca` lib
WHERE
ST_DWITHIN(lib.geometry, data.geography)
Then we need to compute stats per library, remember we have many rows per-library, and we need a single row for each library. For this we need to aggregate per library, let's do GROUP BY using id. When we need an info about the library itself the cheapest way to do it is to use ANY_VALUE aggregation function. So it will be something like
SELECT
lib.id, ANY_VALUE(lib.title) AS title,
SUM(P_60YMAS) AS age60_plus,
-- if you need to show the circle around library
ST_BUFFER(ANY_VALUE(lib.geometry)) AS buffered_location
FROM
`cis-sdhis.inegi.resageburb` data,
`cis-sdhis.de.biblioteca` lib
WHERE
ST_DWITHIN(lib.geometry, data.geography)
GROUP BY lib.id
One thing to note here is that ST_BUFFER(ANY_VALUE(...)) is much cheaper than ANY_VALUE(ST_BUFFER(...)) - it only computes buffer once per output row.

BigQuery SQL / GIS: Extend Radius Until Count Is Greater Than / Equal To 'N'

I have a very special kind of query to write. In PostGIS / BigQuery, I have a point. I can buffer this point by increments and perform an aggregation query, such as count(distinct()) on the unique records that fall within this point. Once the count reaches a certain threshold, I would like to return the input value of the geographic object, ie. its radius or diameter. This problem can be phrased as "how far do I have to keep going out until I hit 'n' [ids]?".
Finely incrementing the value of the buffer or radius will be insufferably slow and expensive. Can anyone think of a nice way to short this and offer a solution that provides a nice answer quickly (in BQ or PSQL terms!)?
Available GIS functions:
st_buffer()
st_dwithin()
Thank you!
You would have to order by distance and keep the N closest points. The <-> operator will use the spatial index.
SELECT *
FROM pointLayer p
ORDER BY p.geometry <-> st_makePoint(...)
LIMIT 10; --N
You don't need to increment radius finely - I would rather double it, or maybe even increment 10x and once you have enough distinct records, take N nearest ones.
I've used BigQuery scripting to solve similar problem (with N=1, but it is easy to modify for any N, just use LIMIT N instead of LIMIT 1 and modify stopping condition):
https://medium.com/#mentin/nearest-neighbor-using-bq-scripting-373241f5b2f5

Index by geolocation in database

I'm trying to find a way to let my database support fast location-based searches (for example, all items that lie within a certain distance from some geopoint (LAT, LON)). I guess the brute-force solution which calculates the distances between every point in the database and the query point probably won't work for large dataset, so some kind of indexing should be necessary. I'm not sure if there are any existing standard ways to do this (well I know they are out there but Google failed me), but here is a method (or more like a hack?) that I think might work:
Calculate a value from (LAT, LON) and store it in an indexed column. For example, something like floor(LAT / 10) * 10 * 100 + floor(LON / 10) * 10. Each time a query arrives, we first calculate this value for the query and find all the corresponding rows, and then calculate the Euclid distances between all points and the query point.

Manhattan distance with n dimensions in oracle

I have a table with around 5 millions rows and each row has 10 columns representing 10 dimensions.
I would like to be able when a new input is coming to perform a search in the table to return the closest rows using Manhattan distances.
The distance is the sum of abs(Ai-Aj)+abs(Bi-Bj)...
The problem is that for the moment if I do a query, it does a full scan of the entire table, to calculate the distances from every rows, and then sort them to find the top X.
Is there a way to speed the process and make the query more efficient?
I looked at the distance function online for the SDO_GEOMETRY, but I couldn't find it for more than 4 dimensions.
Thank you
If you are inserting a point A and you want to look for points that are within a neighbourhood of radius r (i.e., are less than r away, on any metric), you can do a really simply query:
select x1, x2, ..., xn
from points
where x1 between a1 - r and a1 + r
and x2 between a2 - r and a2 + r
...
and xn between an - r and an + r
...where A = (a1, a2, ..., an), to find a bound. If you have an index over all x1, ..., xn fields of points, then this query shouldn't require a full scan. Now, this result may include points that are outside the neighbourhood (i.e., the bits in the corners), but is an easy win to find an appropriate subset: you can now check against the records in this subquery, rather than checking against every point in your table.
You may be able to refine this query further because, with the Manhattan metric, a neighbourhood will be square shaped (although at 45 degrees to the above) and squares are relatively easy to work with! (Even in 10 dimensions.) However, the more complicated logic required may be more of an overhead than an optimisation, ultimately.
I suggest using function based index. You need this distance calculated, therefore pre calculate it using function based index.
You may want to read following question and it links. Function based index creates hidden column for you. This hidden column will hold manhanttan distance , therefore sorting will be easier.
Thanks for #Xophmeister's comment. Function based index will not help you for arbitrary point. I do not know any sql function to help you here. But if you are willing to use machine learning data mining algorithm.
I suggest cluster your 5 million rows using k-means clustering. Lets say 1000 cluster center you found. Put this cluster centers to another table.
By definition clustering , your points will be assigned to cluster centers. Because of this you know which points are nearest to this cluster center, say
cluster (1) contains 20.000 points, ... cluster ( 987) contains 10.000 points ...
Your arbitrary point will be near to one cluster. You find that your point is nearest to cluster 987. Run your sql , using only points which belongs to this cluster center, that 10.000 points.
You need to add several tables/columns to your schema to make this effective. If your 5.000.000 rows changes continuously, you need to run k-means clustering again as they change. But if they are fairly constant values, one clustering per week or per month will be enough.

spatial data distance search - optimization options

Our business user loves for our searches to be done by distance, problem is we have over 1 million records with a lat/long location. We are using SQL 2008 but we keep running into issues when we order or restrict our searches by distance that the queries take way to long (30 seconds plus). This is unacceptable, there has got to be a better way to do this. We have done everything we can with SQL 2008 and want to upgrade to 2012 if we can at some point.
I ask though, if there is another technology or optimization that we could apply. Could we switch to a different DB for faster performance, a different search algorithm to apply, estimation algorithm, tree, grids, pre-computation, etc?
A solution that might be useful here would be to break your search into two parts:
1) Run a query where you find all records that are within a certain value + or - of the current lat/lng of your location, the where clause might look like:
where (#latitude > (lat - .001) and #latitude > (lat - .001)) and (#longitude> (lng- .001) and #longitude> (longitude- .001))
Using this approach, and especially with an index on both the latitude and longitude columns, you can very quickly define a working set of locations within a specified distance.
2) with the rough results from step 1, use the great circle/haversine method to determine what the actual distance between the source location and each point is.
Where this approach falls over is if there is never any limit to the radius that you are searching, but it works great if you are for instance looking to find all locations within a specific distance of a given point.