How to overcome the limitation of `PERCENTILE_CONT` that the argument should be constant? - sql

I want to find POINTCOUNT values that cut the input set ADS.PREDICTOR into equally large groups. The parameter POINTCOUNT can have different value for different predictors, so I don't want to hard-code it in the code.
Unfortunately the code below fails with ORA-30496: Argument should be a constant... How can I overcome this (except for 300 lines of code with hard-coded threshold quantiles, of course)?
define POINTCOUNT=300;
select
*
from (
select
percentile_disc(MYQUNTILE)
within group (
order by PREDICTOR ) as THRESHOLD
from ADS
inner join (
select (LEVEL - 1)/(&POINTCOUNT-1) as MYQUANTILE
from dual
connect by LEVEL <= &POINTCOUNT
)
on 1=1
)
group by THRESHOLD
I want to draw a ROC curve. The curve will be plotted in Excel as a linear interpolation between pairs of points (X, Y) calculated in Oracle.
Each point (X, Y) is calculated using a threshold value.
I will get the best approximation of the ROC curve for a give number of the pairs of points if the distance between each adjacent pair of (X, Y) is uniform.
if I cut the domain of the predicted values into N values that separate 1/Nth quantiles, I should get a fairly good set of the threshold values.

PERCENTILE_CONT() only requires that the percentile value be constant within each group. You do not have a group by in your subquery, so I think this might fix your problem:
select MYQUANTILE,
percentile_disc(MYQUANTILE) within group (order by PREDICTOR
) as THRESHOLD
from ADS cross join
(select (LEVEL - 1)/(&POINTCOUNT-1) as MYQUANTILE
from dual
connect by LEVEL <= &POINTCOUNT
)
GROUP BY MYQUANTILE;
Also, note that CROSS JOIN is the same as INNER JOIN . . . ON 1=1.

Related

BigQuery Geography function to compute cartesian product of distances

I need to calculate the cartesian product of distances for an array of GEO points.
More specifically to compute the distance from the first point to all the other 5 points, then the distance from the second point to all the other points etc..
Wondering if there is a BigQuery Geography function that can carry out this type of computation efficiently.
The alternative is to do it explicitly pair by pair which is kind of a brute force approach.
POINT(-95.665885 29.907145)
POINT(-95.636533 29.757219)
POINT(-95.652796 29.89204)
POINT(-84.27087 33.991642)
POINT(-84.466853 33.987008)
I tried to create an example using your example data:
//Creating a temporary table with your data
with t as
(Select * from UNNEST(
['POINT(-95.665885 29.907145)',
'POINT(-95.636533 29.757219)',
'POINT(-95.652796 29.89204)',
'POINT(-84.27087 33.991642)',
'POINT(-84.466853 33.987008)']) f
)
//Doing a cross join of the created table with itself, filtering some cases to avoid calculating the distance of a point to itself and calculating the distance between the points
select
t.f point_1,
t2.f point_2,
ST_DISTANCE(ST_GeogFromText(t.f), ST_GeogFromText(t2.f))
from t cross join t t2
where t.f <> t2.f
group by point_1, point_2
Is it what you are looking for?
Of course it can be optimized if you consider that distance between two points are the same doesn't matter their order.

Why is PostGIS stacking all points on top of each other? (Using ST_DWithin to find all results within 1000m radius)

New to PostGIS/PostgreSQL...any help would be greatly appreciated!
I have two tables in a postgres db aliased as gas and ev. I'm trying to choose a specific gas station (gas.site_id=11949) and locate all EV/alternative fuel charging stations within a 1000m radius. When I run the following though, PostGIS returns a number of ev stations that are all stacked on top of each other in the map (see screenshot).
Anyone have any idea why this is happening? How can I get PostGIS to visualize the points within a 1000m radius of the specified gas station?
with myplace as (
SELECT gas.geom
from nj_gas gas
where gas.site_id = 11949 limit 1)
select myplace.*, ev.*
from alt_fuel ev, myplace
where ST_DWithin(ev.geom1, myplace.geom, 1000)
The function ST_DWithin does not compute distances in meters using geometry typed parameters.
From the documentation:
For geometry: The distance is specified in units defined by the
spatial reference system of the geometries. For this function to make
sense, the source geometries must both be of the same coordinate
projection, having the same SRID.
So, if you want compute distances in meters you have to use the data type geography:
For geography units are in meters and measurement is defaulted to
use_spheroid=true, for faster check, use_spheroid=false to measure
along sphere.
That all being said, you have to cast the data type of your geometries. Besides that your query looks just fine - considering your data is correct :-)
WITH myplace as (
SELECT gas.geom
FROM nj_gas gas
WHERE gas.site_id = 11949 LIMIT 1)
SELECT myplace.*, ev.*
FROM alt_fuel ev, myplace
WHERE ST_DWithin(ev.geom1::GEOGRAPHY, myplace.geom::GEOGRAPHY, 1000)
Sample data:
CREATE TABLE t1 (id INT, geom GEOGRAPHY);
INSERT INTO t1 VALUES (1,'POINT(-4.47 54.22)');
CREATE TABLE t2 (geom GEOGRAPHY);
INSERT INTO t2 VALUES ('POINT(-4.48 54.22)'),('POINT(-4.41 54.18)');
Query
WITH j AS (
SELECT geom FROM t1 WHERE id = 1 LIMIT 1)
SELECT ST_AsText(t2.geom)
FROM j,t2 WHERE ST_DWithin(t2.geom, j.geom, 1000);
st_astext
--------------------
POINT(-4.48 54.22)
(1 Zeile)
You are cross joining those tables and have PostgreSQL return the cartesian product of both when selecting myplace.* & ev.*.
So while there is only one row in myplace, its geom will be merged with every row of alt_fuel (i.e. the result set will have all columns of both tables in every possible combination of both); since the result set thus has two geometry columns, your client application likely chooses either the first, or the one called geom (as opposed to alt_fuel.geom1) to display!
I don't see that you are interested in myplace.geom in the result set anyway, so I suggest to run
WITH
myplace as (
SELECT gas.geom
FROM nj_gas gas
WHERE gas.site_id = 11949
LIMIT 1
)
SELECT ev.*
FROM alt_fuel AS ev
JOIN myplace AS mp
ON ST_DWithin(ev.geom1, mp.geom, 1000) -- ST_DWithin(ev.geom1::GEOGRAPHY, mp.geom::GEOGRAPHY, 1000)
;
If, for some reason, you also want to display myplace.geom along with the stations, you'd have to UNION[ ALL] the above with a SELECT * on myplace; note that you will also have to provide the same column list and structure (same data types!) as alt_fuel.* (or better, the other side of the UNION[ ALL]) in that SELECT!
Note the suggestions made by #JimJones about units; if your data is not projected in a meter based CRS (but in a geographic reference system; 'LonLat'), use the cast to GEOGRAPHY to have ST_DWithin consider the input as meter (and calculate using spheroidal algebra instead of planar (Euclidean))!
Resolved by using:
WITH
myplace as (
SELECT geom as g
FROM nj_gas
WHERE site_id = 11949 OR site_id = 11099 OR site_id = 11679 or site_id = 480522
), myresults AS (
SELECT ev.*
FROM alt_fuel AS ev
JOIN myplace AS mp
ON ST_DWithin(ev.geom, mp.g, 0.1))
select * from myresults```
Thanks so much for your help #ThingumaBob and #JimJones ! Greatly appreciate it.

How to group points by defined count and return a bounding box for each group?

I have a large table of very irregularly spaced points. Given a user-defined bounding box, I would like to return rows of sub-bounding boxes that represent an equal number of points. The shape of the sub-boxes does not matter, as long as all points in the user-defined bounding box are represented and counted.
This is the logic I'm trying to implement:
select all points where intersects user's bounding box.
order all points by x value
group ordered points where count <= 1000
return ST_Extent of each group.
I'm not really sure where to begin, since I don't have a lot of experience with SQL and PostGIS, but something like this...?
SELECT
ST_Extent(geom) as extent,
c.count
FROM
xyz_master as x,
(
SELECT
COUNT(*) as count
FROM
xyz_master
) as c
WHERE
c.count < 1000
GROUP BY
extent
;
And, of course, Postgres responds with this:
ERROR: aggregate functions are not allowed in GROUP BY
LINE 3: ST_Extent(geom) as extent
I realize the subquery doesn't really make much sense, since it's just returning one row with a count of all points, but I have no idea where to begin.
Can anyone point me in the right direction?
Thanks.
Sort the table based on x, then create a different group for each 1000s. Using CEIL is one way to do it. Note that you have to replace in the following code the xmin, ymin, xmax, ymax, and srid provided by the user:
SELECT ST_EXTENT(t2.geom) extent, COUNT(*) count
FROM (
SELECT t1.geom, ROW_NUMBER() OVER (ORDER BY ST_X(t1.geom)) row_num
FROM xyz_master t1
WHERE ST_INTERSECTS(ST_MakeEnvelope(xmin, ymin, xmax, ymax, srid), t1.geom)
) t2
GROUP BY CEIL(t2.row_num / 1000.0);

One-dimensional earth mover's distance in BigQuery/SQL

Let P and Q be two finite probability distributions on integers, with support between 0 and some large integer N. The one-dimensional earth mover's distance between P and Q is the minimum cost you have to pay to transform P into Q, considering that it costs r*|n-m| to "move" a probability r associated to integer n to another integer m.
There is a simple algorithm to compute this. In pseudocode:
previous = 0
sum = 0
for i from 0 to N:
previous = P(i) - Q(i) + previous
sum = sum + abs(previous) // abs = absolute value
return sum
Now, suppose you have two tables that contain each a probability distribution. Column n contains integers, and column p contains the corresponding probability. The tables are correct (all probabilities are between 0 and 1, their sum is I want to compute the earth mover's distance between these two tables in BigQuery (Standard SQL).
Is it possible? I feel like one would need to use analytical functions, but I don't have much experience with them, so I don't know how to get there.
What if N (the maximum integers) is very large, but my tables are not? Can we adapt the solution to avoid doing a computation for each integer i?
Hopefully I fully understand your problem. This seems to be what you're looking for:
WITH Aggr AS (
SELECT rp.n AS n, SUM(rp.p - rq.p)
OVER(ORDER BY rp.n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS emd
FROM P rp
LEFT JOIN Q rq
ON rp.n = rq.n
) SELECT SUM(ABS(a.emd)) AS total_emd
FROM Aggr a;
WRT question #2, note that we only scan what's actually in tables, regardless of the N, assuming a one-to-one match for every n in P with n in Q.
I adapted Michael's answer to fix its issues, here's the solution I ended up with. Suppose the integers are stored in column i and the probability in column p. First I join the two tables, then I compute EMD(i) for all i using the window, then I sum all absolute values.
WITH
joined_table AS (
SELECT
IFNULL(table1.i, table2.i) AS i,
IFNULL(table1.p, 0) AS p,
IFNULL(table2.p, 0) AS q,
FROM table1
OUTER JOIN table2
ON table1.i = table2.i
),
aggr AS (
SELECT
(SUM(p-q) OVER win) * (i - (LAG(i,1) OVER win)) AS emd
FROM joined_table
WINDOW win AS (
ORDER BY i
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
)
SELECT SUM(ABS(emd)) AS total_emd
FROM aggr

Thin out many ST_Points

I have many (1.000.000) ST_Points in a postgres-db with postgis
extension. When i show them on a map, browsers are getting very busy.
For that I would like to write an sql-statement which filters the high density to only one point.
When a User zoom out of 100 ST_Points, postgres should give back only one.
But only if these Points are close together.
I tried it with this statement:
select a.id, count(*)
from points as a, points as b
where st_dwithin(a.location, b.location, 0.001)
and a.id != b.id
group by a.id
I would call it thin-out but didnot find anything - maybe because I'm
not a native english speaker.
Does anybody have some suggestions?
I agree with tcarobruce that clustering is the term you are looking for. But it is available in postgis.
Basically clustering can be achieved by reducing the number of decimals in the X and Y and grouping upon them;
select
count(*),
round(cast (ST_X(geom) as numeric),3)
round(cast (ST_Y(geom) as numeric),3)
from mytable
group by
round(cast (ST_X(geom) as numeric),3),
round(cast (ST_Y(geom) as numeric),3)
Which will result in a table with coordinates and the number of real points at that coordinate. In this particular sample, it leaves you with rounding on 3 decimals, 0.001 like in your initial statement.
You can cluster nearby Points together using ST_ClusterDBSCAN
Then keep all single points and for example:
Select one random Point per cluster
or
Select the centroid of each Point cluster.
I use eps 300 to cluster points together that are within 300 meters.
create table buildings_grouped as
SELECT geom, ST_ClusterDBSCAN(geom, eps := 300, minpoints := 2) over () AS cid
FROM buildings
1:
create table buildings_grouped_keep_random as
select geom, cid from buildings_grouped
where cid is null
union
select * from
(SELECT DISTINCT ON (cid) *
FROM buildings_grouped
ORDER BY cid, random()) sub
2:
create table buildings_grouped_keep_centroid as
select geom, cid from buildings_grouped
where cid is null
union
select st_centroid(st_union(geom)) geom, cid
from buildings_grouped
where cid is not null
group by cid
The term you are looking for is "clustering".
There are client-side libraries that do this, as well as commercial services that do it server-side.
But it's not something PostGIS does natively. (There's a ticket for it.)
You'll probably have to write your own solution, and precompute your clusters ahead of time.
ST_ClusterDBSCAN- and KMeans- based clustering works but it is VERY SLOW! for big data sets. So it is practically unusable. PostGIS functions like ST_SnapToGrid and ST_RemoveRepeatedPoints is faster and can help in some cases. But the best approach, I think, is using PDAL thinning filters like sample filter. I recommend using it with PG Point Cloud.
Edit:
ST_SnapToGrid is pretty fast and useful. Here is the example query for triangulation with optimizations:
WITH step1 AS
(
SELECT geometry, ST_DIMENSION(geometry) AS dim FROM table
)
, step2 AS
(
SELECT ST_SIMPLIFYVW(geometry, :tolerance) AS geometry FROM step1 WHERE dim > 0
UNION ALL
(WITH q1 AS
(
SELECT (ST_DUMP(geometry)).geom AS geometry FROM step1 WHERE dim = 0
)
SELECT ST_COLLECT(DISTINCT(ST_SNAPTOGRID(geometry, :tolerance))) FROM q1)
)
, step3 AS
(
SELECT ST_COLLECT(geometry) AS geometry FROM step2
)
SELECT ST_DELAUNAYTRIANGLES(geometry, :tolerance, 0)::BYTEA AS geometry
FROM step3
OFFSET :offset LIMIT :limit;