Generate and merging results of geometry function into one column in Postgis - sql

I need to create a table of points based on linestring.
Basically I have a linestring table (l_table) and I will generate start, end, centroid of each line. It's necessary to merge these 3 columns into one geometry points column.I desire to keep the original "ID" column associated with the new geometry point. Finally, One more column to describe the new point geometry: start, end or centroid.
Something like this:
create table point_from_line as (
select l.id, st_startpoint(l.geom), st_centroid(l.geom), st_endpoint(l.geom)
case st_startpoint ... then 'start'
case st_centroid ... then 'centroid'
case st_endpoint ... then 'end'
end as geom_origin
from linestring_table
where l.id any condition...
Note: the geom column result need to be associated with the original id of linestring and one more column to describe if is start, center or end point.
Input columns: l.id, l.geom (linestring);
Output columns: l.id, p.geom (the unified result of st_start, st_centroid, st_endpoint), _p.geometry_origin_ (classification in: start,end, centroid for each row of p.geom)
Could someone help me?

Complex and good question, I will try to collaborate.
(haha)
after several hours I think I get a direction to my problem.
So, I will explain by steps:
1 - create 3 recursive cte's: Start, Center, endpoint + the respective classification
with Recursive cte as
start point of linestring
select l.id, st_startpoint(l.geom),
case when
l.geom is not null then 'start point'
end as geom_origin
where l.somecolumn = 'somefilter'...
2 - repeat the same above query modifying the parameters for centroid and endpoint
3 - Union all for the 3 generated cte's
yes the id will be repeated, but that is the intention.
4 - Allocate the 3 ctes in a subquery for generate a unique id value, called by "gid"
So, the final query seems like this:
select row_number() over () as gid, cte.id, cte.geom, cte.geom_origin
from (
with Recursive cte as ( select...),
cte_2 (select...),
cte_3 (select...),
)
select * from cte
union all
select * from cte_2
union all
select * from cte_3
);
If someone know another way to reach the same results, please share, but for now, that's solve my doubt.
Regards

Related

Wrong order of elements after GROUP BY using ST_MakeLine

I have a table (well, it's CTE) containing path, as array of node IDs, and table of nodes with their geometries. I am trying to SELECT paths with their start and end nodes, and geometries, like this:
SELECT *
FROM (
SELECT t.path_id, t.segment_num, t.start_node, t.end_node, ST_MakeLine(n.geom) AS geom
FROM (SELECT path_id, segment_num, nodes[1] AS start_node, nodes[array_upper(nodes,1)] AS end_node, unnest(nodes) AS node_id
FROM paths
) t
JOIN nodes n ON n.id = t.node_id
GROUP BY path_id, segment_num, start_node, end_node
) rs
This seems to be working just fine when I try it on individual path samples, but when I run this on large dataset, small number of resulting geometries are bad - clearly the ST_MakeLine received points in wrong order. I suspect parallel aggregation resulting in wrong order, but maybe I am missing something else here?
How can I ensure correct order of points into ST_MakeLine?
If I am correct about the parallel aggregation, postgres docs are saying that Scans of common table expressions (CTEs) are always parallel restricted, but does that mean I have to make CTE with unnested array and mark it AS MATERIALIZED so it does not get optimized back into query?
Thanks for reminding me of ST_MakeLine(geom ORDER BY something) possibility, ST_MakeLine is aggregate function after all. I dont have any explicit ordering column available (order is position in nodes array, but one node can be present multiple times). Fortunately, unnest can be used in FROM clause with WITH ORDINALITY and therefore create an ordering column for me. Working solution:
SELECT *
FROM (SELECT t.path_id, t.segment_num, t.start_node, t.end_node, ST_MakeLine(n.geom ORDER BY node_order) AS geom
FROM (SELECT path_id, segment_num, nodes[1] AS start_node, nodes[array_upper(nodes,1)] AS end_node, a.elem AS node_id, a.nr AS node_order
FROM paths, unnest(nodes) WITH ORDINALITY a(elem, nr)
) t
JOIN nodes n ON n.id = t.node_id
GROUP BY path_id, segment_num, start_node, end_node
) rs
In order for ST_MakeLine to create a LineString in the right order you must explicitly state it with an ORDER BY. The following examples show how the order of points make a huge difference in the output:
Without ordering
WITH j (id,geom) AS (
VALUES
(3,'SRID=4326;POINT(1 2)'::geometry),
(1,'SRID=4326;POINT(3 4)'::geometry),
(0,'SRID=4326;POINT(1 9)'::geometry),
(2,'SRID=4326;POINT(8 3)'::geometry)
)
SELECT ST_MakeLine(geom) FROM j;
Ordering by id:
WITH j (id,geom) AS (
VALUES
(3,'SRID=4326;POINT(1 2)'::geometry),
(1,'SRID=4326;POINT(3 4)'::geometry),
(0,'SRID=4326;POINT(1 9)'::geometry),
(2,'SRID=4326;POINT(8 3)'::geometry)
)
SELECT ST_MakeLine(geom ORDER BY id) FROM j;
Demo: db<>fiddle

SQL query : multiple challenges

Not being an SQL expert, I am struggling with the following:
I inherited a larg-ish table (about 100 million rows) containing time-stamped events that represent stage transitions of mostly shortlived phenomena. The events are unfortunately recorded in a somewhat strange way, with the table looking as follows:
phen_ID record_time producer_id consumer_id state ...
000123 10198789 start
10298776 000123 000112 hjhkk
000124 10477886 start
10577876 000124 000123 iuiii
000124 10876555 end
Each phenomenon (phen-ID) has a start event and theoretically an end event, although it might not have been occured yet and thus not recorded. Each phenomenon can then go through several states. Unfortunately, for some states, the ID is recorded in either a product or a consumer field. Also, the number of states is not fixed, and neither is the time between the states.
To beginn with, I need to create an SQL statement that for each phen-ID shows the start time and the time of the last recorded event (could be an end state or one of the intermediate states).
Just considering a single phen-ID, I managed to pull together the following SQL:
WITH myconstants (var1) as (
values ('000123')
)
select min(l.record_time), max(l.record_time) from
(select distinct * from public.phen_table JOIN myconstants ON var1 IN (phen_id, producer_id, consumer_id)
) as l
As the start-state always has the lowest recorded-time for the specific phenomenon, the above statement correctly returns the recorded time range as one row irrespective of what the end state is.
Obviously here I have to supply the phen-ID manually.
How can I make this work that so I get a row of the start times and maxium recorded time for each unique phen-ID? Played around with trying to fit in something like select distinct phen-id ... but was not able to "feed" them automatically into the above. Or am I completely off the mark here?
Addition:
Just to clarify, the ideal output using the table above would like something like this:
ID min-time max-time
000123 10198789 10577876 (min-time is start, max-time is state iuii)
000124 10477886 10876555 (min-time is start, max-time is end state)
union all might be an option:
select phen_id,
min(record_time) as min_record_time,
max(record_time) as max_record_time
from (
select phen_id, record_time from phen_table
union all select producer_id, record_time from phen_table
union all select consumer_id, record_time from phen_table
) t
where phen_id is not null
group by phen_id
On the other hand, if you want prioritization, then you can use coalesce():
select coalesce(phen_id, producer_id, consumer_id) as phen_id,
min(record_time) as min_record_time,
max(record_time) as max_record_time
from phen_table
group by coalesce(phen_id, producer_id, consumer_id)
The logic of the two queries is not exactly the same. If there are rows where more than one of the three columns is not null, and values differ, then the first query takes in account all non-null values, while the second considers only the "first" non-null value.
Edit
In Postgres, which you finally tagged, the union all solution can be phrased more efficiently with a lateral join:
select x.phen_id,
min(p.record_time) as min_record_time,
max(p.record_time) as max_record_time
from phen_table p
cross join lateral (values (phen_id), (producer_id), (consumer_id)) as x(phen_id)
where x.phen_id is not null
group by x.phen_id
I think you're on the right track. Try this and see if it is what you are looking for:
select
min(l.record_time)
,max(l.record_time)
,coalesce(phen_id, producer_id, consumer_id) as [Phen ID]
from public.phen_table
group by coalesce(phen_id, producer_id, consumer_id)

Retrieve hierarchical groups ... with infinite recursion

I've a table like this which contains links :
key_a key_b
--------------
a b
b c
g h
a g
c a
f g
not really tidy & infinite recursion ...
key_a = parent
key_b = child
Require a query which will recompose and attribute a number for each hierarchical group (parent + direct children + indirect children) :
key_a key_b nb_group
--------------------------
a b 1
a g 1
b c 1
**c a** 1
f g 2
g h 2
**link responsible of infinite loop**
Because we have
A-B-C-A
-> Only want to show simply the link as shown.
Any idea ?
Thanks in advance
The problem is that you aren't really dealing with strict hierarchies; you're dealing with directed graphs, where some graphs have cycles. Notice that your nbgroup #1 doesn't have any canonical root-- it could be a, b, or c due to the cyclic reference from c-a.
The basic way of dealing with this is to think in terms of graph techniques, not recursion. In fact, an iterative approach (not using a CTE) is the only solution I can think of in SQL. The basic approach is explained here.
Here is a SQL Fiddle with a solution that addresses both the cycles and the shared-leaf case. Notice it uses iteration (with a failsafe to prevent runaway processes) and table variables to operate; I don't think there's any getting around this. Note also the changed sample data (a-g changed to a-h; explained below).
If you dig into the SQL you'll notice that I changed some key things from the solution given in the link. That solution was dealing with undirected edges, whereas your edges are directed (if you used undirected edges the entire sample set is a single component because of the a-g connection).
This gets to the heart of why I changed a-g to a-h in my sample data. Your specification of the problem is straightforward if only leaf nodes are shared; that's the specification I coded to. In this case, a-h and g-h can both get bundled off to their proper components with no problem, because we're concerned about reachability from parents (even given cycles).
However, when you have shared branches, it's not clear what you want to show. Consider the a-g link: given this, g-h could exist in either component (a-g-h or f-g-h). You put it in the second, but it could have been in the first instead, right? This ambiguity is why I didn't try to address it in this solution.
Edit: To be clear, in my solution above, if shared braches ARE encountered, it treats the whole set as a single component. Not what you described above, but it will have to be changed after the problem is clarified. Hopefully this gets you close.
You should use a recursive query. In the first part we select all records which are top level nodes (have no parents) and using ROW_NUMBER() assign them group ID numbers. Then in the recursive part we add to them children one by one and use parent's groups Id numbers.
with CTE as
(
select t1.parent,t1.child,
ROW_NUMBER() over (order by t1.parent) rn
from t t1 where
not exists (select 1 from t where child=t1.parent)
union all
select t.parent,t.child, CTE.rn
from t
join CTE on t.parent=CTE.Child
)
select * from CTE
order by RN,parent
SQLFiddle demo
Painful problem of graph walking using recursive CTEs. This is the problem of finding connected subgraphs in a graph. The challenge with using recursive CTEs is to prevent unwarranted recursion -- that is, infinite loops In SQL Server, that typically means storing them in a string.
The idea is to get a list of all pairs of nodes that are connected (and a node is connected with itself). Then, take the minimum from the list of connected nodes and use this as an id for the connected subgraph.
The other idea is to walk the graph in both directions from a node. This ensures that all possible nodes are visited. The following is query that accomplishes this:
with fullt as (
select keyA, keyB
from t
union
select keyB, keyA
from t
),
CTE as (
select t.keyA, t.keyB, t.keyB as last, 1 as level,
','+cast(keyA as varchar(max))+','+cast(keyB as varchar(max))+',' as path
from fullt t
union all
select cte.keyA, cte.keyB,
(case when t.keyA = cte.last then t.keyB else t.keyA
end) as last,
1 + level,
cte.path+t.keyB+','
from fullt t join
CTE
on t.keyA = CTE.last or
t.keyB = cte.keyA
where cte.path not like '%,'+t.keyB+',%'
) -- select * from cte where 'g' in (keyA, keyB)
select t.keyA, t.keyB,
dense_rank() over (order by min(cte.Last)) as grp,
min(cte.Last)
from t join
CTE
on (t.keyA = CTE.keyA and t.keyB = cte.keyB) or
(t.keyA = CTE.keyB and t.keyB = cte.keyA)
where cte.path like '%,'+t.keyA+',%' or
cte.path like '%,'+t.keyB+',%'
group by t.id, t.keyA, t.keyB
order by t.id;
The SQLFiddle is here.
you might want to check with COMMON TABLE EXPRESSIONS
here's the link

Purposely having a query return blank entries at regular intervals

I want to write a query that returns 3 results followed by blank results followed by the next 3 results, and so on. So if my database had this data:
CREATE TABLE table (a integer, b integer, c integer, d integer);
INSERT INTO table (a,b,c,d)
VALUES (1,2,3,4),
(5,6,7,8),
(9,10,11,12),
(13,14,15,16),
(17,18,19,20),
(21,22,23,24),
(25,26,37,28);
I would want my query to return this
1,2,3,4
5,6,7,8
9,10,11,12
, , ,
13,14,15,16
17,18,19,20
21,22,23,24
, , ,
25,26,27,28
I need this to work for arbitrarily many entries that I select for, have three be grouped together like this.
I'm running postgresql 8.3
This should work flawlessly in PostgreSQL 8.3
SELECT a, b, c, d
FROM (
SELECT rn, 0 AS rk, (x[rn]).*
FROM (
SELECT x, generate_series(1, array_upper(x, 1)) AS rn
FROM (SELECT ARRAY(SELECT tbl FROM tbl) AS x) x
) y
UNION ALL
SELECT generate_series(3, (SELECT count(*) FROM tbl), 3), 1, (NULL::tbl).*
ORDER BY rn, rk
) z
Major points
Works for a query that selects all columns of tbl.
Works for any table.
For selecting arbitrary columns you have to substitute (NULL::tbl).* with a matching number of NULL columns in the second query.
Assuming that NULL values are ok for "blank" rows.
If not, you'll have to cast your columns to text in the first and substitute '' for NULL in the second SELECT.
Query will be slow with very big tables.
If I had to do it, I would write a plpgsql function that loops through the results and inserts the blank rows. But you mentioned you had no direct access to the db ...
In short, no, there's not an easy way to do this, and generally, you shouldn't try. The database is concerned with what your data actually is, not how it's going to be displayed. It's not an appropriate scope of responsibility to expect your database to return "dummy" or "extra" data so that some down-stream process produces a desired output. The generating script needs to do that.
As you can't change your down-stream process, you could (read that with a significant degree of skepticism and disdain) add things like this:
Select Top 3
a, b, c, d
From
table
Union Select Top 1
'', '', '', ''
From
table
Union Select Top 3 Skip 3
a, b, c, d
From
table
Please, don't actually try do that.
You can do it (at least on DB2 - there doesn't appear to be equivalent functionality for your version of PostgreSQL).
No looping needed, although there is a bit of trickery involved...
Please note that though this works, it's really best to change your display code.
Statement requires CTEs (although that can be re-written to use other table references), and OLAP functions (I guess you could re-write it to count() previous rows in a subquery, but...).
WITH dataList (rowNum, dataColumn) as (SELECT CAST(CAST(:interval as REAL) /
(:interval - 1) * ROW_NUMBER() OVER(ORDER BY dataColumn) as INTEGER),
dataColumn
FROM dataTable),
blankIncluder(rowNum, dataColumn) as (SELECT rowNum, dataColumn
FROM dataList
UNION ALL
SELECT rowNum - 1, :blankDataColumn
FROM dataList
WHERE MOD(rowNum - 1, :interval) = 0
AND rowNum > :interval)
SELECT *
FROM dataList
ORDER BY rowNum
This will generate a list of those elements from the datatable, with a 'blank' line every interval lines, as ordered by the initial query. The result set only has 'blank' lines between existing lines - there are no 'blank' lines on the ends.

Thin out many ST_Points

I have many (1.000.000) ST_Points in a postgres-db with postgis
extension. When i show them on a map, browsers are getting very busy.
For that I would like to write an sql-statement which filters the high density to only one point.
When a User zoom out of 100 ST_Points, postgres should give back only one.
But only if these Points are close together.
I tried it with this statement:
select a.id, count(*)
from points as a, points as b
where st_dwithin(a.location, b.location, 0.001)
and a.id != b.id
group by a.id
I would call it thin-out but didnot find anything - maybe because I'm
not a native english speaker.
Does anybody have some suggestions?
I agree with tcarobruce that clustering is the term you are looking for. But it is available in postgis.
Basically clustering can be achieved by reducing the number of decimals in the X and Y and grouping upon them;
select
count(*),
round(cast (ST_X(geom) as numeric),3)
round(cast (ST_Y(geom) as numeric),3)
from mytable
group by
round(cast (ST_X(geom) as numeric),3),
round(cast (ST_Y(geom) as numeric),3)
Which will result in a table with coordinates and the number of real points at that coordinate. In this particular sample, it leaves you with rounding on 3 decimals, 0.001 like in your initial statement.
You can cluster nearby Points together using ST_ClusterDBSCAN
Then keep all single points and for example:
Select one random Point per cluster
or
Select the centroid of each Point cluster.
I use eps 300 to cluster points together that are within 300 meters.
create table buildings_grouped as
SELECT geom, ST_ClusterDBSCAN(geom, eps := 300, minpoints := 2) over () AS cid
FROM buildings
1:
create table buildings_grouped_keep_random as
select geom, cid from buildings_grouped
where cid is null
union
select * from
(SELECT DISTINCT ON (cid) *
FROM buildings_grouped
ORDER BY cid, random()) sub
2:
create table buildings_grouped_keep_centroid as
select geom, cid from buildings_grouped
where cid is null
union
select st_centroid(st_union(geom)) geom, cid
from buildings_grouped
where cid is not null
group by cid
The term you are looking for is "clustering".
There are client-side libraries that do this, as well as commercial services that do it server-side.
But it's not something PostGIS does natively. (There's a ticket for it.)
You'll probably have to write your own solution, and precompute your clusters ahead of time.
ST_ClusterDBSCAN- and KMeans- based clustering works but it is VERY SLOW! for big data sets. So it is practically unusable. PostGIS functions like ST_SnapToGrid and ST_RemoveRepeatedPoints is faster and can help in some cases. But the best approach, I think, is using PDAL thinning filters like sample filter. I recommend using it with PG Point Cloud.
Edit:
ST_SnapToGrid is pretty fast and useful. Here is the example query for triangulation with optimizations:
WITH step1 AS
(
SELECT geometry, ST_DIMENSION(geometry) AS dim FROM table
)
, step2 AS
(
SELECT ST_SIMPLIFYVW(geometry, :tolerance) AS geometry FROM step1 WHERE dim > 0
UNION ALL
(WITH q1 AS
(
SELECT (ST_DUMP(geometry)).geom AS geometry FROM step1 WHERE dim = 0
)
SELECT ST_COLLECT(DISTINCT(ST_SNAPTOGRID(geometry, :tolerance))) FROM q1)
)
, step3 AS
(
SELECT ST_COLLECT(geometry) AS geometry FROM step2
)
SELECT ST_DELAUNAYTRIANGLES(geometry, :tolerance, 0)::BYTEA AS geometry
FROM step3
OFFSET :offset LIMIT :limit;