SQL ORDER BY before TABLESAMPLE BERNOULLI, or in Python - sql

I am performing a query on PostrgreSQL using Python (pyscopg2)
The data is point geometries, stored in patched of 600 points per patch.
I am trying to streamline and speed up the process, previously I would do the following:
explode the geometry
order by x, y, z
save the result to a new table
Use TABLESAMPLE BERNOULLI(1) to sample the data to 1%
save back to the database
To speed things up I'm trying to reduce the amount of writing to the database and keep the data in python as much as possible.
The old code:
Exploding the patches
query = sql.SQL("""INSERT INTO {}.{} (x,y,z)
SELECT
st_x(PC_EXPLODE(pa)::geometry) as x,
st_y(PC_EXPLODE(pa)::geometry) as y,
st_z(PC_EXPLODE(pa)::geometry) as z
from "public".{} order by x,y,z;""").format(
*map(sql.Identifier, (schema_name, table_name2, table_name1)))
sampling the data:
query2 = ("CREATE TABLE {}.{} AS (SELECT * FROM {}.{} TABLESAMPLE BERNOULLI ({}))".format(
schema, table_name_base, schema, imported_table_name_base, sample_base))
This works, but I would like to either:
A) Perform this as a single query, so explode --> order by --> sample.
B) Perform the explode in SQL, then sample in python.
For A) I have attempted to nest/subquery but PostgreSQL will not allow TABLESAMPLE to work on anything that isn't a table or a view.
For B) I use data = gpd.read_postgis(query, con=conn) to get the data directly into a geopandas dataframe, so sorting is then easy, but how do I perform the equivalent of TABLESAMPLE BERNOULLI to a geopandas dataframe?
Option A is my preferred option, but it might be useful to test option B incase I end up allowing different sampling methods.
Edit:
This is the visual result of:
query = """
SELECT
PC_EXPLODE(pa)::geometry as geom,
st_x(PC_EXPLODE(pa)::geometry) as x,
st_y(PC_EXPLODE(pa)::geometry) as y,
st_z(PC_EXPLODE(pa)::geometry) as z
FROM {}.{}
TABLESAMPLE BERNOULLI ({})
ORDER BY x,y,z, geom
;
""".format(schema, pointcloud, sample)

I am a little lost. A random sample is a random sample and doesn't depend on the ordering. If you want a sample that depends on the ordering, then use an nth sample. That would be:
select t.*
from (select t.*,
row_number() over (order by x, y, z) as seqnum
from (select st_x(PC_EXPLODE(pa)::geometry) as x,
st_y(PC_EXPLODE(pa)::geometry) as y,
st_z(PC_EXPLODE(pa)::geometry) as z
from "public".{}
) t
) t
where seqnum % 100 = 1;
Or perhaps you just want to take the sample and then order afterwards, which you can also do with a subquery.

Related

Implementing a SQL query without window functions

I have read that it is possible to implement anything you might do in a SQL window function, with creative use of joins, etc, but I cannot figure out how. I'm using SQLite in this project, which doesn't currently have window functions.
I have a table with four columns:
CREATE TABLE foo (
id INTEGER PRIMARY KEY,
x REAL NOT NULL,
y REAL NOT NULL,
val REAL NOT NULL,
UNIQUE(x,y));
and a convenience function DIST(x1, y1, x2, y2) that returns the distance between two points.
What I want: For every row in that table, I want the entire row in that same table within a certain distance [eg 25 km], with the lowest "val". For rows with the same "val", I want to use lowest distance as a tie breaker.
My current solution is running n+1 queries, which works but is ucky:
SELECT * FROM foo;
... then, for each row returned, I run [where "src" is the row I just got]:
SELECT * FROM foo
WHERE DIST(foo.x, foo.y, src.x, src.y)<25
ORDER BY val ASC, DIST(foo.x, foo.y, src.x, src.y) ASC
LIMIT 1
But I really want it in a single query, partially for my own interest, and partially because it makes it much easier to work with some other tools I have.
Use your query to get the ID of the wanted row, then use that to join the tables:
SELECT *
FROM (SELECT foo.*,
(SELECT id
FROM (SELECT id,
x,
y,
foo.x AS foo_x,
foo.y AS foo_y,
val
FROM foo)
WHERE DIST(foo_x, foo_y, x, y) < 25
ORDER BY val, DIST(foo_x, foo_y, x, y)
LIMIT 1
) AS id2
FROM foo)
JOIN foo AS foo2 ON id2 = foo2.id;

Reuse computed select value

I'm trying to use ST_SnapToGrid and then GROUP BY the grid cells (x, y). Here is what I did first:
SELECT
COUNT(*) AS n,
ST_X(ST_SnapToGrid(geom, 50)) AS x,
ST_Y(ST_SnapToGrid(geom, 50)) AS y
FROM points
GROUP BY x, y
I don't want to recompute ST_SnapToGrid for both x and y. So I changed it to use a sub-query:
SELECT
COUNT(*) AS n,
ST_X(geom) AS x,
ST_Y(geom) AS y
FROM (
SELECT
ST_SnapToGrid(geom, 50) AS geom
FROM points
) AS tmp
GROUP BY x, y
But when I run EXPLAIN, both of these queries have the exact same execution plan:
GroupAggregate (...)
-> Sort (...)
Sort Key: (st_x(st_snaptogrid(points.geom, 0::double precision))), (st_y(st_snaptogrid(points.geom, 0::double precision)))
-> Seq Scan on points (...)
Question: Will PostgreSQL reuse the result value of ST_SnapToGrid()?
If not, is there a way to make it do this?
Test timing
You don't see the evaluation of individual functions per row in the EXPLAIN output.
Test with EXPLAIN ANALYZE to get actual query times to compare overall effectiveness. Run a couple of times to rule out caching artifacts. For simple queries like this, you get more reliable numbers for the total runtime with:
EXPLAIN (ANALYZE, TIMING OFF) SELECT ...
Requires Postgres 9.2+. Per documentation:
TIMING
Include actual startup time and time spent in each node in the output. The overhead of repeatedly reading the system clock can slow
down the query significantly on some systems, so it may be useful to
set this parameter to FALSE when only actual row counts, and not exact
times, are needed. Run time of the entire statement is always
measured, even when node-level timing is turned off with this option.
This parameter may only be used when ANALYZE is also enabled. It
defaults to TRUE.
Prevent repeated evaluation
Normally, expressions in a subquery are evaluated once. But Postgres can collapse trivial subqueries if it thinks that will be faster.
To introduce an optimization barrier, you could use a CTE instead of the subquery. This guarantees that Postgres computes ST_SnapToGrid(geom, 50) once only:
WITH cte AS (
SELECT ST_SnapToGrid(geom, 50) AS geom1
FROM points
)
SELECT COUNT(*) AS n
, ST_X(geom1) AS x
, ST_Y(geom1) AS y
FROM cte
GROUP BY geom1; -- see below
However, this it's probably slower than a subquery due to more overhead for a CTE. The function call is probably very cheap. Generally, Postgres knows better how to optimize a query plan. Only introduce such an optimization barrier if you know better.
Simplify
I changed the name of the computed point in the subquery / CTE to geom1 to clarify it's different from the original geom. That helps to clarify the more important thing here:
GROUP BY geom1
instead of:
GROUP BY x, y
That's obviously cheaper - and may have an influence on whether the function call is repeated. So, this is probably fastest:
SELECT COUNT(*) AS n
, ST_X(ST_SnapToGrid(geom, 50)) AS x
, ST_y(ST_SnapToGrid(geom, 50)) AS y
FROM points
GROUP BY ST_SnapToGrid(geom, 50); -- same here!
Or maybe this:
SELECT COUNT(*) AS n
, ST_X(geom1) AS x
, ST_y(geom1) AS y
FROM (
SELECT ST_SnapToGrid(geom, 50) AS geom1
FROM points
) AS tmp
GROUP BY geom1;
Test all three with EXPLAIN ANALYZE or EXPLAIN (ANALYZE, TIMING OFF) and see for yourself. Testing >> guessing.

deterministic stats_mode in Oracle

In Oracle, stats_mode function selects the mode of a set of data. Unfortunately, it is non-deterministic in picking it's result in the presence of ties (e.g. stats_mode(1,2,1,2) could return 1 or 2 depending on the ordering of rows inside Oracle. In many situations this is not acceptable. Is there a function or nice technique for being able to supply your own deterministic ordering for stats_mode function?
Oracle's web-page on STATS_MODE explains that If more than one mode exists, Oracle Database chooses one and returns only that one value.
As there are no additional parameters, etc, you can not change it's behaviour.
The same page, however, does also show that the following sample query can generate multiple mode values...
SELECT x FROM (SELECT x, COUNT(x) AS cnt1 FROM t GROUP BY x)
WHERE cnt1 = (SELECT MAX(cnt2) FROM (SELECT COUNT(x) AS cnt2 FROM t GROUP BY x));
By modifying such code you could once again just choose a single value, as determined by a specified ORDER...
SELECT x FROM (SELECT x, MAX(y) AS y, COUNT(x) AS cnt1 FROM t GROUP BY x)
WHERE cnt1 = (SELECT MAX(cnt2) FROM (SELECT COUNT(x) AS cnt2 FROM t GROUP BY x))
AND rownum = 1
ORDER BY y DESC;
A bit messy, unfortunately, though you may be able to tidy it slightly for your particular case. But I'm not aware of alternative fundamentally different approaches.
Selecting the value among a set of values with the highest occurring frequency could also be done by counting and ordering.
select x from t group by x order by count(*) desc limit 1;
You can also make it deterministic by ordering on the value itself.
select x from t group by x order by count(*) desc, x desc limit 1;
I don't quite understand the complexity of Oracles query examples, the performance is really bad. Can anyone shine some light on the difference?

Thin out many ST_Points

I have many (1.000.000) ST_Points in a postgres-db with postgis
extension. When i show them on a map, browsers are getting very busy.
For that I would like to write an sql-statement which filters the high density to only one point.
When a User zoom out of 100 ST_Points, postgres should give back only one.
But only if these Points are close together.
I tried it with this statement:
select a.id, count(*)
from points as a, points as b
where st_dwithin(a.location, b.location, 0.001)
and a.id != b.id
group by a.id
I would call it thin-out but didnot find anything - maybe because I'm
not a native english speaker.
Does anybody have some suggestions?
I agree with tcarobruce that clustering is the term you are looking for. But it is available in postgis.
Basically clustering can be achieved by reducing the number of decimals in the X and Y and grouping upon them;
select
count(*),
round(cast (ST_X(geom) as numeric),3)
round(cast (ST_Y(geom) as numeric),3)
from mytable
group by
round(cast (ST_X(geom) as numeric),3),
round(cast (ST_Y(geom) as numeric),3)
Which will result in a table with coordinates and the number of real points at that coordinate. In this particular sample, it leaves you with rounding on 3 decimals, 0.001 like in your initial statement.
You can cluster nearby Points together using ST_ClusterDBSCAN
Then keep all single points and for example:
Select one random Point per cluster
or
Select the centroid of each Point cluster.
I use eps 300 to cluster points together that are within 300 meters.
create table buildings_grouped as
SELECT geom, ST_ClusterDBSCAN(geom, eps := 300, minpoints := 2) over () AS cid
FROM buildings
1:
create table buildings_grouped_keep_random as
select geom, cid from buildings_grouped
where cid is null
union
select * from
(SELECT DISTINCT ON (cid) *
FROM buildings_grouped
ORDER BY cid, random()) sub
2:
create table buildings_grouped_keep_centroid as
select geom, cid from buildings_grouped
where cid is null
union
select st_centroid(st_union(geom)) geom, cid
from buildings_grouped
where cid is not null
group by cid
The term you are looking for is "clustering".
There are client-side libraries that do this, as well as commercial services that do it server-side.
But it's not something PostGIS does natively. (There's a ticket for it.)
You'll probably have to write your own solution, and precompute your clusters ahead of time.
ST_ClusterDBSCAN- and KMeans- based clustering works but it is VERY SLOW! for big data sets. So it is practically unusable. PostGIS functions like ST_SnapToGrid and ST_RemoveRepeatedPoints is faster and can help in some cases. But the best approach, I think, is using PDAL thinning filters like sample filter. I recommend using it with PG Point Cloud.
Edit:
ST_SnapToGrid is pretty fast and useful. Here is the example query for triangulation with optimizations:
WITH step1 AS
(
SELECT geometry, ST_DIMENSION(geometry) AS dim FROM table
)
, step2 AS
(
SELECT ST_SIMPLIFYVW(geometry, :tolerance) AS geometry FROM step1 WHERE dim > 0
UNION ALL
(WITH q1 AS
(
SELECT (ST_DUMP(geometry)).geom AS geometry FROM step1 WHERE dim = 0
)
SELECT ST_COLLECT(DISTINCT(ST_SNAPTOGRID(geometry, :tolerance))) FROM q1)
)
, step3 AS
(
SELECT ST_COLLECT(geometry) AS geometry FROM step2
)
SELECT ST_DELAUNAYTRIANGLES(geometry, :tolerance, 0)::BYTEA AS geometry
FROM step3
OFFSET :offset LIMIT :limit;

Is there a performance difference between HAVING on alias, vs not using HAVING

Ok, I'm learning, bit by bit, about what HAVING means.
Now, my question is if these two queries have difference performance characteristics:
Without HAVING
SELECT x + y AS z, t.* FROM t
WHERE
x = 1 and
x+y = 2
With HAVING
SELECT x + y AS z, t.* FROM t
WHERE
x = 1
HAVING
z = 2
Yes it should be different - (1) is expected to be faster.
Having will ensure that first the main query is run and then the having filter is applied - so it basically works on a the dataset returned by the (query minus having).
The first query should be preferable, since it does not select those records at all.
HAVING is used for queries that contain GROUP BY or return a single row containg the result of aggregate functions. For example SELECT SUM(scores) FROM t HAVING SUM(scores) > 100 returns either one row, or no row at all.
The second query is considered invalid by the SQL Standard and is not accepted by some database systems.