Thin out many ST_Points - sql

I have many (1.000.000) ST_Points in a postgres-db with postgis
extension. When i show them on a map, browsers are getting very busy.
For that I would like to write an sql-statement which filters the high density to only one point.
When a User zoom out of 100 ST_Points, postgres should give back only one.
But only if these Points are close together.
I tried it with this statement:
select a.id, count(*)
from points as a, points as b
where st_dwithin(a.location, b.location, 0.001)
and a.id != b.id
group by a.id
I would call it thin-out but didnot find anything - maybe because I'm
not a native english speaker.
Does anybody have some suggestions?

I agree with tcarobruce that clustering is the term you are looking for. But it is available in postgis.
Basically clustering can be achieved by reducing the number of decimals in the X and Y and grouping upon them;
select
count(*),
round(cast (ST_X(geom) as numeric),3)
round(cast (ST_Y(geom) as numeric),3)
from mytable
group by
round(cast (ST_X(geom) as numeric),3),
round(cast (ST_Y(geom) as numeric),3)
Which will result in a table with coordinates and the number of real points at that coordinate. In this particular sample, it leaves you with rounding on 3 decimals, 0.001 like in your initial statement.

You can cluster nearby Points together using ST_ClusterDBSCAN
Then keep all single points and for example:
Select one random Point per cluster
or
Select the centroid of each Point cluster.
I use eps 300 to cluster points together that are within 300 meters.
create table buildings_grouped as
SELECT geom, ST_ClusterDBSCAN(geom, eps := 300, minpoints := 2) over () AS cid
FROM buildings
1:
create table buildings_grouped_keep_random as
select geom, cid from buildings_grouped
where cid is null
union
select * from
(SELECT DISTINCT ON (cid) *
FROM buildings_grouped
ORDER BY cid, random()) sub
2:
create table buildings_grouped_keep_centroid as
select geom, cid from buildings_grouped
where cid is null
union
select st_centroid(st_union(geom)) geom, cid
from buildings_grouped
where cid is not null
group by cid

The term you are looking for is "clustering".
There are client-side libraries that do this, as well as commercial services that do it server-side.
But it's not something PostGIS does natively. (There's a ticket for it.)
You'll probably have to write your own solution, and precompute your clusters ahead of time.

ST_ClusterDBSCAN- and KMeans- based clustering works but it is VERY SLOW! for big data sets. So it is practically unusable. PostGIS functions like ST_SnapToGrid and ST_RemoveRepeatedPoints is faster and can help in some cases. But the best approach, I think, is using PDAL thinning filters like sample filter. I recommend using it with PG Point Cloud.
Edit:
ST_SnapToGrid is pretty fast and useful. Here is the example query for triangulation with optimizations:
WITH step1 AS
(
SELECT geometry, ST_DIMENSION(geometry) AS dim FROM table
)
, step2 AS
(
SELECT ST_SIMPLIFYVW(geometry, :tolerance) AS geometry FROM step1 WHERE dim > 0
UNION ALL
(WITH q1 AS
(
SELECT (ST_DUMP(geometry)).geom AS geometry FROM step1 WHERE dim = 0
)
SELECT ST_COLLECT(DISTINCT(ST_SNAPTOGRID(geometry, :tolerance))) FROM q1)
)
, step3 AS
(
SELECT ST_COLLECT(geometry) AS geometry FROM step2
)
SELECT ST_DELAUNAYTRIANGLES(geometry, :tolerance, 0)::BYTEA AS geometry
FROM step3
OFFSET :offset LIMIT :limit;

Related

Parsing Multiple Snowflake Objects with consistent keys to rows

First post, hope I don't do anything too crazy
I want to go from JSON/object to long in terms of formatting.
I have a table set up as follows (note: there will be a large but finite number of 50+ activity columns, 2 is a minimal working example). I'm not concerned about the formatting of the date column - different problem.
customer_id(varcahr), activity_count(object, int), activity_duration(object, numeric)
sample starting point
In this case I'd like to explode this into this:
customer_id(varcahr), time_period, activity_count(int), activity_duration(numeric)
sample end point - long
minimum data set
WITH smpl AS (
SELECT
'12a' AS id,
OBJECT_CONSTRUCT(
'd1910', 0,
'd1911', 26,
'd1912', 6,
'd2001', 73) as activity_count,
OBJECT_CONSTRUCT(
'd1910', 0,
'd1911', 260.1,
'd1912', 30,
'd2001', 712.3) AS activity_duration
UNION ALL
SELECT
'13b' AS id,
OBJECT_CONSTRUCT(
'd1910', 1,
'd1911', 2,
'd1912', 3,
'd2001', 4) as activity_count,
OBJECT_CONSTRUCT(
'd1910', 1,
'd1911', 2.2,
'd1912', 3.3,
'd2001', 4.3) AS activity_duration
)
select * from smpl
Extra credit for also taking this from JSON/object to wide (in Google Big Query it's SELECT id, activity_count.* FROM tbl
Thanks in advance.
I've tried tons of random FLATTEN() based joins. In this instance I probably just need one working example.
This needs to scale to a moderate but finite number of objects (e.g. 50)
I'll also see if I can combine with THIS - I'll see if I can combine it - Lateral flatten two columns without repetition in snowflake
Using FLATTEN:
WITH (...)
SELECT s1.ID, s1.KEY, s1.value AS activity_count, s2.value AS activity_duration
FROM (select ID, Key, VALUE from smpl,table(flatten(input=>activity_count))) AS s1
JOIN (select ID, Key, VALUE from smpl,table(flatten(input=>activity_duration))) AS s2
ON S1.ID = S2.ID AND S1.KEY = S2.KEY;
Output:
#Lukasz Szozda gets close but the answer doesn't scale as well with multiple variables (it's essentially a bunch of cartesian products and I'd need to do a lot of ON conditions). I have a known constraint (each field is in a strict format) so it's easy to recycle the key.
After WAY WAY WAY too much messing with this (off and on searches for weeks) it finally snapped and it's pretty easy.
SELECT
id, key, activity_count[key], activity_duration[key], activity_duration2[key]
FROM smpl, LATERAL flatten(input => activity_count);
You can also use things OTHER than key such as index
It's inspired by THIS link but I just didn't quite follow it.
https://stackoverflow.com/a/36804637/20994650

Generate and merging results of geometry function into one column in Postgis

I need to create a table of points based on linestring.
Basically I have a linestring table (l_table) and I will generate start, end, centroid of each line. It's necessary to merge these 3 columns into one geometry points column.I desire to keep the original "ID" column associated with the new geometry point. Finally, One more column to describe the new point geometry: start, end or centroid.
Something like this:
create table point_from_line as (
select l.id, st_startpoint(l.geom), st_centroid(l.geom), st_endpoint(l.geom)
case st_startpoint ... then 'start'
case st_centroid ... then 'centroid'
case st_endpoint ... then 'end'
end as geom_origin
from linestring_table
where l.id any condition...
Note: the geom column result need to be associated with the original id of linestring and one more column to describe if is start, center or end point.
Input columns: l.id, l.geom (linestring);
Output columns: l.id, p.geom (the unified result of st_start, st_centroid, st_endpoint), _p.geometry_origin_ (classification in: start,end, centroid for each row of p.geom)
Could someone help me?
Complex and good question, I will try to collaborate.
(haha)
after several hours I think I get a direction to my problem.
So, I will explain by steps:
1 - create 3 recursive cte's: Start, Center, endpoint + the respective classification
with Recursive cte as
start point of linestring
select l.id, st_startpoint(l.geom),
case when
l.geom is not null then 'start point'
end as geom_origin
where l.somecolumn = 'somefilter'...
2 - repeat the same above query modifying the parameters for centroid and endpoint
3 - Union all for the 3 generated cte's
yes the id will be repeated, but that is the intention.
4 - Allocate the 3 ctes in a subquery for generate a unique id value, called by "gid"
So, the final query seems like this:
select row_number() over () as gid, cte.id, cte.geom, cte.geom_origin
from (
with Recursive cte as ( select...),
cte_2 (select...),
cte_3 (select...),
)
select * from cte
union all
select * from cte_2
union all
select * from cte_3
);
If someone know another way to reach the same results, please share, but for now, that's solve my doubt.
Regards

Why is PostGIS stacking all points on top of each other? (Using ST_DWithin to find all results within 1000m radius)

New to PostGIS/PostgreSQL...any help would be greatly appreciated!
I have two tables in a postgres db aliased as gas and ev. I'm trying to choose a specific gas station (gas.site_id=11949) and locate all EV/alternative fuel charging stations within a 1000m radius. When I run the following though, PostGIS returns a number of ev stations that are all stacked on top of each other in the map (see screenshot).
Anyone have any idea why this is happening? How can I get PostGIS to visualize the points within a 1000m radius of the specified gas station?
with myplace as (
SELECT gas.geom
from nj_gas gas
where gas.site_id = 11949 limit 1)
select myplace.*, ev.*
from alt_fuel ev, myplace
where ST_DWithin(ev.geom1, myplace.geom, 1000)
The function ST_DWithin does not compute distances in meters using geometry typed parameters.
From the documentation:
For geometry: The distance is specified in units defined by the
spatial reference system of the geometries. For this function to make
sense, the source geometries must both be of the same coordinate
projection, having the same SRID.
So, if you want compute distances in meters you have to use the data type geography:
For geography units are in meters and measurement is defaulted to
use_spheroid=true, for faster check, use_spheroid=false to measure
along sphere.
That all being said, you have to cast the data type of your geometries. Besides that your query looks just fine - considering your data is correct :-)
WITH myplace as (
SELECT gas.geom
FROM nj_gas gas
WHERE gas.site_id = 11949 LIMIT 1)
SELECT myplace.*, ev.*
FROM alt_fuel ev, myplace
WHERE ST_DWithin(ev.geom1::GEOGRAPHY, myplace.geom::GEOGRAPHY, 1000)
Sample data:
CREATE TABLE t1 (id INT, geom GEOGRAPHY);
INSERT INTO t1 VALUES (1,'POINT(-4.47 54.22)');
CREATE TABLE t2 (geom GEOGRAPHY);
INSERT INTO t2 VALUES ('POINT(-4.48 54.22)'),('POINT(-4.41 54.18)');
Query
WITH j AS (
SELECT geom FROM t1 WHERE id = 1 LIMIT 1)
SELECT ST_AsText(t2.geom)
FROM j,t2 WHERE ST_DWithin(t2.geom, j.geom, 1000);
st_astext
--------------------
POINT(-4.48 54.22)
(1 Zeile)
You are cross joining those tables and have PostgreSQL return the cartesian product of both when selecting myplace.* & ev.*.
So while there is only one row in myplace, its geom will be merged with every row of alt_fuel (i.e. the result set will have all columns of both tables in every possible combination of both); since the result set thus has two geometry columns, your client application likely chooses either the first, or the one called geom (as opposed to alt_fuel.geom1) to display!
I don't see that you are interested in myplace.geom in the result set anyway, so I suggest to run
WITH
myplace as (
SELECT gas.geom
FROM nj_gas gas
WHERE gas.site_id = 11949
LIMIT 1
)
SELECT ev.*
FROM alt_fuel AS ev
JOIN myplace AS mp
ON ST_DWithin(ev.geom1, mp.geom, 1000) -- ST_DWithin(ev.geom1::GEOGRAPHY, mp.geom::GEOGRAPHY, 1000)
;
If, for some reason, you also want to display myplace.geom along with the stations, you'd have to UNION[ ALL] the above with a SELECT * on myplace; note that you will also have to provide the same column list and structure (same data types!) as alt_fuel.* (or better, the other side of the UNION[ ALL]) in that SELECT!
Note the suggestions made by #JimJones about units; if your data is not projected in a meter based CRS (but in a geographic reference system; 'LonLat'), use the cast to GEOGRAPHY to have ST_DWithin consider the input as meter (and calculate using spheroidal algebra instead of planar (Euclidean))!
Resolved by using:
WITH
myplace as (
SELECT geom as g
FROM nj_gas
WHERE site_id = 11949 OR site_id = 11099 OR site_id = 11679 or site_id = 480522
), myresults AS (
SELECT ev.*
FROM alt_fuel AS ev
JOIN myplace AS mp
ON ST_DWithin(ev.geom, mp.g, 0.1))
select * from myresults```
Thanks so much for your help #ThingumaBob and #JimJones ! Greatly appreciate it.

BigQuery - Joining and pivoting large tables

I know there are some posts on pivoting, which I have used to get where I am today (thanks to the BQ community!). But this post seeks some advice on optimising this where there is a large number of pivot columns needed, distributed table joins are needed....as well and deudping. Not asking much right!
Objective:
We have 2 large BQ tables, with a full 10 years history that needs joining:
sales_order_header (13 GB - 1.35 million rows)
sales_order_line (50GM - 5 million rows)
This is a typical 'header/line' one to many relationship. The data for the tables arrives as 2 seperate streams unfortunately rather then 1 document style where the line is nested inside the header which would be ideal - but its not so distributed joins become necesary for some of the views our BI tool (Tableau) wants to periodically (every 60 mins) call to ingest 'cleansed' data that is:
deduped (both tables that is)
joined header to line (on salesOrderId)
each has its own array of 'sourceData' namve / value paris that needs unpacking / 'pivot' so its not an array
Point 3 presents an issue in its own right. We have a column called 'sourceData' which is basically where the core data is - its an array of string name value pairs (a row in BQ is a replication of a single row from a DB so the key is a column name and value the value for a single row).
Now I think here lay the issue, as there are 250 array entries (we know the exact number up front) , this equates to 250 'unnest' statements each and using the best approach I can think of using sub selects:
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
250 times
And this is done as a pattern for each of the header and the line tables repsective views.
So the SQL for the view for just retrieving a deduped, flattened/pivoted array for the sales_order_header table is as follows. The sales_order_line has the same pattern for its view:
#standardSQL
WITH latest_snapshot_dups AS (
SELECT
salesOrderId,
PARSE_TIMESTAMP("%Y-%m-%dT%H:%M:%E*S%Ez", lastUpdated) AS lastUpdatedTimestampUTC,
sourceData,
_PARTITIONTIME AS bqPartitionTime
FROM
`project.ds.sales_order_header_refdata`
),
latest_snapshot_nodups AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY salesOrderId ORDER BY lastUpdatedTimestampUTC DESC) AS rowNum
FROM latest_snapshot_dups
)
SELECT
salesOrderId,
lastUpdatedTimestampUTC,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'b') AS b,
....250 of these
FROM
latest_snapshot_nodups
WHERE
rowNum = 1
Although just showing one here, we have these two similar views (with total of 250 + 300 = 550 unique subqueries that unnest/pivot), and now I want to join the header with the line views and I run into an issue straight away exceeding a limit of subqueries.
Is there a better way to do this, assuming this is the data there is to work with? A better way to 'pivot' perhaps? Or a more efficient way building a single view that optimises the order of things, rather then using 2 discrete views?
Thanks for your help BQ Community!
I run into an issue straight away exceeding a limit of subqueries
You currently using below pattern (removed mot significant part of code for simplicity)
#standardSQL
SELECT
salesOrderId,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'a') AS a,
(SELECT val FROM UNNEST(sourceData) WHERE name = 'b') AS b,
....250 OF these
FROM latest_snapshot_nodups
Try below pattern
#standardSQL
SELECT
salesOrderId,
MAX(IF(name = 'a', val, NULL)) AS a,
MAX(IF(name = 'b', val, NULL)) AS b,
....250 OF these
FROM latest_snapshot_nodups, UNNEST(sourceData) kv
GROUP BY salesOrderId

SQL Server - Multiplying row values for a given column value [duplicate]

Im looking for something like SELECT PRODUCT(table.price) FROM table GROUP BY table.sale similar to how SUM works.
Have I missed something on the documentation, or is there really no PRODUCT function?
If so, why not?
Note: I looked for the function in postgres, mysql and mssql and found none so I assumed all sql does not support it.
For MSSQL you can use this. It can be adopted for other platforms: it's just maths and aggregates on logarithms.
SELECT
GrpID,
CASE
WHEN MinVal = 0 THEN 0
WHEN Neg % 2 = 1 THEN -1 * EXP(ABSMult)
ELSE EXP(ABSMult)
END
FROM
(
SELECT
GrpID,
--log of +ve row values
SUM(LOG(ABS(NULLIF(Value, 0)))) AS ABSMult,
--count of -ve values. Even = +ve result.
SUM(SIGN(CASE WHEN Value < 0 THEN 1 ELSE 0 END)) AS Neg,
--anything * zero = zero
MIN(ABS(Value)) AS MinVal
FROM
Mytable
GROUP BY
GrpID
) foo
Taken from my answer here: SQL Server Query - groupwise multiplication
I don't know why there isn't one, but (take more care over negative numbers) you can use logs and exponents to do:-
select exp (sum (ln (table.price))) from table ...
There is no PRODUCT set function in the SQL Standard. It would appear to be a worthy candidate, though (unlike, say, a CONCATENATE set function: it's not a good fit for SQL e.g. the resulting data type would involve multivalues and pose a problem as regards first normal form).
The SQL Standards aim to consolidate functionality across SQL products circa 1990 and to provide 'thought leadership' on future development. In short, they document what SQL does and what SQL should do. The absence of PRODUCT set function suggests that in 1990 no vendor though it worthy of inclusion and there has been no academic interest in introducing it into the Standard.
Of course, vendors always have sought to add their own functionality, these days usually as extentions to Standards rather than tangentally. I don't recall seeing a PRODUCT set function (or even demand for one) in any of the SQL products I've used.
In any case, the work around is fairly simple using log and exp scalar functions (and logic to handle negatives) with the SUM set function; see #gbn's answer for some sample code. I've never needed to do this in a business application, though.
In conclusion, my best guess is that there is no demand from SQL end users for a PRODUCT set function; further, that anyone with an academic interest would probably find the workaround acceptable (i.e. would not value the syntactic sugar a PRODUCT set function would provide).
Out of interest, there is indeed demand in SQL Server Land for new set functions but for those of the window function variety (and Standard SQL, too). For more details, including how to get involved in further driving demand, see Itzik Ben-Gan's blog.
You can perform a product aggregate function, but you have to do the maths yourself, like this...
SELECT
Exp(Sum(IIf(Abs([Num])=0,0,Log(Abs([Num])))))*IIf(Min(Abs([Num]))=0,0,1)*(1-2*(Sum(IIf([Num]>=0,0,1)) Mod 2)) AS P
FROM
Table1
Source: http://productfunctionsql.codeplex.com/
There is a neat trick in T-SQL (not sure if it's ANSI) that allows to concatenate string values from a set of rows into one variable. It looks like it works for multiplying as well:
declare #Floats as table (value float)
insert into #Floats values (0.9)
insert into #Floats values (0.9)
insert into #Floats values (0.9)
declare #multiplier float = null
select
#multiplier = isnull(#multiplier, '1') * value
from #Floats
select #multiplier
This can potentially be more numerically stable than the log/exp solution.
I think that is because no numbering system is able to accommodate many products. As databases are designed for large number of records, a product of 1000 numbers would be super massive and in case of floating point numbers, the propagated error would be huge.
Also note that using log can be a dangerous solution. Although mathematically log(a*b) = log(a)*log(b), it might not be in computers as we are not dealing with real numbers. If you calculate 2^(log(a)+log(b)) instead of a*b, you may get unexpected results. For example:
SELECT 9999999999*99999999974482, EXP(LOG(9999999999)+LOG(99999999974482))
in Sql Server returns
999999999644820000025518, 9.99999999644812E+23
So my point is when you are trying to do the product do it carefully and test is heavily.
One way to deal with this problem (if you are working in a scripting language) is to use the group_concat function.
For example, SELECT group_concat(table.price) FROM table GROUP BY table.sale
This will return a string with all prices for the same sale value, separated by a comma.
Then with a parser you can get each price, and do a multiplication. (In php you can even use the array_reduce function, in fact in the php.net manual you get a suitable example).
Cheers
Another approach based on fact that the cardinality of cartesian product is product of cardinalities of particular sets ;-)
⚠ WARNING: This example is just for fun and is rather academic, don't use it in production! (apart from the fact it's just for positive and practically small integers)⚠
with recursive t(c) as (
select unnest(array[2,5,7,8])
), p(a) as (
select array_agg(c) from t
union all
select p.a[2:]
from p
cross join generate_series(1, p.a[1])
)
select count(*) from p where cardinality(a) = 0;
The problem can be solved using modern SQL features such as window functions and CTEs. Everything is standard SQL and - unlike logarithm-based solutions - does not require switching from integer world to floating point world nor handling nonpositive numbers. Just number rows and evaluate product in recursive query until no row remain:
with recursive t(c) as (
select unnest(array[2,5,7,8])
), r(c,n) as (
select t.c, row_number() over () from t
), p(c,n) as (
select c, n from r where n = 1
union all
select r.c * p.c, r.n from p join r on p.n + 1 = r.n
)
select c from p where n = (select max(n) from p);
As your question involves grouping by sale column, things got little bit complicated but it's still solvable:
with recursive t(sale,price) as (
select 'multiplication', 2 union
select 'multiplication', 5 union
select 'multiplication', 7 union
select 'multiplication', 8 union
select 'trivial', 1 union
select 'trivial', 8 union
select 'negatives work', -2 union
select 'negatives work', -3 union
select 'negatives work', -5 union
select 'look ma, zero works too!', 1 union
select 'look ma, zero works too!', 0 union
select 'look ma, zero works too!', 2
), r(sale,price,n,maxn) as (
select t.sale, t.price, row_number() over (partition by sale), count(1) over (partition by sale)
from t
), p(sale,price,n,maxn) as (
select sale, price, n, maxn
from r where n = 1
union all
select p.sale, r.price * p.price, r.n, r.maxn
from p
join r on p.sale = r.sale and p.n + 1 = r.n
)
select sale, price
from p
where n = maxn
order by sale;
Result:
sale,price
"look ma, zero works too!",0
multiplication,560
negatives work,-30
trivial,8
Tested on Postgres.
Here is an oracle solution for anyone who needs it
with data(id, val) as(
select 1,1.0 from dual union all
select 2,-2.0 from dual union all
select 3,1.0 from dual union all
select 4,2.0 from dual
),
neg(val , modifier) as(
select exp(sum(ln(abs(val)))), case when mod(count(*),2) = 0 then 1 Else -1 end
from data
where val <0
)
,
pos(val) as (
select exp(sum(ln(val)))
from data
where val >=0
)
select (select val*modifier from neg)*(select val from pos) product from dual