BigQuery SQL / GIS: Extend Radius Until Count Is Greater Than / Equal To 'N' - google-bigquery

I have a very special kind of query to write. In PostGIS / BigQuery, I have a point. I can buffer this point by increments and perform an aggregation query, such as count(distinct()) on the unique records that fall within this point. Once the count reaches a certain threshold, I would like to return the input value of the geographic object, ie. its radius or diameter. This problem can be phrased as "how far do I have to keep going out until I hit 'n' [ids]?".
Finely incrementing the value of the buffer or radius will be insufferably slow and expensive. Can anyone think of a nice way to short this and offer a solution that provides a nice answer quickly (in BQ or PSQL terms!)?
Available GIS functions:
st_buffer()
st_dwithin()
Thank you!

You would have to order by distance and keep the N closest points. The <-> operator will use the spatial index.
SELECT *
FROM pointLayer p
ORDER BY p.geometry <-> st_makePoint(...)
LIMIT 10; --N

You don't need to increment radius finely - I would rather double it, or maybe even increment 10x and once you have enough distinct records, take N nearest ones.
I've used BigQuery scripting to solve similar problem (with N=1, but it is easy to modify for any N, just use LIMIT N instead of LIMIT 1 and modify stopping condition):
https://medium.com/#mentin/nearest-neighbor-using-bq-scripting-373241f5b2f5

Related

Index by geolocation in database

I'm trying to find a way to let my database support fast location-based searches (for example, all items that lie within a certain distance from some geopoint (LAT, LON)). I guess the brute-force solution which calculates the distances between every point in the database and the query point probably won't work for large dataset, so some kind of indexing should be necessary. I'm not sure if there are any existing standard ways to do this (well I know they are out there but Google failed me), but here is a method (or more like a hack?) that I think might work:
Calculate a value from (LAT, LON) and store it in an indexed column. For example, something like floor(LAT / 10) * 10 * 100 + floor(LON / 10) * 10. Each time a query arrives, we first calculate this value for the query and find all the corresponding rows, and then calculate the Euclid distances between all points and the query point.

spatial data distance search - optimization options

Our business user loves for our searches to be done by distance, problem is we have over 1 million records with a lat/long location. We are using SQL 2008 but we keep running into issues when we order or restrict our searches by distance that the queries take way to long (30 seconds plus). This is unacceptable, there has got to be a better way to do this. We have done everything we can with SQL 2008 and want to upgrade to 2012 if we can at some point.
I ask though, if there is another technology or optimization that we could apply. Could we switch to a different DB for faster performance, a different search algorithm to apply, estimation algorithm, tree, grids, pre-computation, etc?
A solution that might be useful here would be to break your search into two parts:
1) Run a query where you find all records that are within a certain value + or - of the current lat/lng of your location, the where clause might look like:
where (#latitude > (lat - .001) and #latitude > (lat - .001)) and (#longitude> (lng- .001) and #longitude> (longitude- .001))
Using this approach, and especially with an index on both the latitude and longitude columns, you can very quickly define a working set of locations within a specified distance.
2) with the rough results from step 1, use the great circle/haversine method to determine what the actual distance between the source location and each point is.
Where this approach falls over is if there is never any limit to the radius that you are searching, but it works great if you are for instance looking to find all locations within a specific distance of a given point.

oracle geocoding query is really slow, how could i optimize a dynamic field?

I have an Oracle table 12K records/gyms, and the query below takes approximately ~0.3s:
SELECT (acos(sin(41.922682*0.017453293) *
sin(to_number(LATITUDE)*0.017453293) + cos(41.922682*0.017453293) *
cos(to_number(LATITUDE)*0.017453293) * cos(to_number(LONGITUDE)*0.017453293 -
(-87.65432*0.017453293)))*3959) as distance
FROM gym
However, I would like to return all of the records where distance <= 10, and as soon as I run the following query, my query execution time jumps up to ~5.0s:
SELECT * from (SELECT (acos(sin(41.922682*0.017453293) *
sin(to_number(LATITUDE)*0.017453293) + cos(41.922682*0.017453293) *
cos(to_number(LATITUDE)*0.017453293) * cos(to_number(LONGITUDE)*0.017453293 -
(-87.65432*0.017453293)))*3959)
as distance FROM gym)
WHERE distance <= 10
ORDER BY distance asc
Any idea how I can optimize this in Oracle?
Most important:
use a where clause to exclude all longitudes and latitudes that will be more than 10 km/miles (?) away from your point. So you only need to make your calculation for the window within a 10km/miles block.
as an and very rough approximation you could use 0.1 degree as a rule, this is 11km at the equator,and less elsewhere
so add
WHERE (longitude - -87.65)<0.1 and (latitude - 41.922)<0.1
(If you use nested queries, add this to the deepest level)
Since your distance is smaller than 10 km or mile, you can consider the length of one unit latitude/longitude as constant, and calculate them once using your formula. Than you can use pythagoras rule to calculate the distance (after adding the bounding box). This is basically why people usually use projected data for calculations.
Other things:
order by is always slow if you don't have an index. Do you need to order?
save your longitude and latitude as numbers in your table. Why would you store them different in a database?
With money. Specifically, Oracle Spatial.
1) How are you measuring 0.3 seconds for the first query? I'll wager that you are measuring the time required to fetch the first row rather than the time required to fetch the last row. Most client tools will start displaying results long before the database has finished producing them if that is possible (which it almost certainly is if there is no ORDER BY). So you're probably measuring the time required by the first query to calculate the distance to the first 50 or 500 gyms against the time required by the last query to calculate the distance to all 12,000 gyms.
2) Oracle Locator is a feature that comes with all editions of the Oracle database that includes the ability to use spatial indexes and that provides built-in methods for computing distance. It's not nearly as powerful as Oracle Spatial but it should be more than sufficient for what you're discussing here.
3) If you want to roll your own, I'd second johanvdw's suggestion of using a bounding box.

Efficient way to compute accumulating value in sqlite3

I have an sqlite3 table that tells when I gain/lose points in a game. Sample/query result:
SELECT time,p2 FROM events WHERE p1='barrycarter' AND action='points'
ORDER BY time;
1280622305|-22
1280625580|-9
1280627919|20
1280688964|21
1280694395|-11
1280698006|28
1280705461|-14
1280706788|-13
[etc]
I now want my running point total. Given that I start w/ 1000 points,
here's one way to do it.
SELECT DISTINCT(time), (SELECT
1000+SUM(p2) FROM events e WHERE p1='barrycarter' AND action='points'
AND e.time <= e2.time) AS points FROM events e2 WHERE p1='barrycarter'
AND action='points' ORDER BY time
but this is highly inefficient. What's a better way to write this?
MySQL has #variables so you can do things like:
SELECT time, #tot := #tot+points ...
but I'm using sqlite3 and the above isn't ANSI standard SQL anyway.
More info on the db if anyone needs it: http://ccgames.db.94y.info/
EDIT: Thanks for the answers! My dilemma: I let anyone run any
single SELECT query on "http://ccgames.db.94y.info/". I want to give
them useful access to my data, but not to the point of allowing
scripting or allowing multiple queries with state. So I need a single
SQL query that can do accumulation. See also:
Existing solution to share database data usefully but safely?
SQLite is meant to be a small embedded database. Given that definition, it is not unreasonable to find many limitations with it. The task at hand is not solvable using SQLite alone, or it will be terribly slow as you have found. The query you have written is a triangular cross join that will not scale, or rather, will scale badly.
The most efficient way to tackle the problem is through the program that is making use of SQLite, e.g. if you were using Web SQL in HTML5, you can easily accumulate in JavaScript.
There is a discussion about this problem in the sqlite mailing list.
Your 2 options are:
Iterate through all the rows with a cursor and calculate the running sum on the client.
Store sums instead of, or as well as storing points. (if you only store sums you can get the points by doing sum(n) - sum(n-1) which is fast).

Efficient division by 128, in tsql

As part of implementing a half-life behaviour, I need to perform x = x - x / 128 on around a hundred thousand rows every few days. Is tsql smart enough to do the division by 128 efficiently (as a bit-shift), or is it just as efficient to divide by 130?
If tsql isn't smart enough, is there anything clever that I can do to make it more efficient?
A hundred thousand rows isn't enough that the difference in perf between a divide operation and a shift operation would probably even be measurable. Especially if you only have to do it every few days. Better to spend your time worrying about other issues.
You could use a computed column with the PERSISTED flag to ensure that the values were physically stored and not recomputed every time they were displayed. That could (possibly, depending on your particular circumstances) be more efficient.
More liklely you will have problems with integer math. I don't know what your x values are but if they are also integers, you want to divide by 128.0 if you don;t want the answer to be rounded to the nearest integer.