Remove CROSS JOIN LATERAL from postgres query that spans many to many - sql

I have the following three tables (many to many):
Location
+====+==============+===+===+=============+
| id | coord_system | x | y | last_update |
+====+==============+===+===+=============+
| | | | | |
+----+--------------+---+---+-------------+
Mapping
+=============+============+
| location_id | history_id |
+=============+============+
| | |
+-------------+------------+
History
+====+=======+======+
| id | speed | date |
+====+=======+======+
| | | |
+----+-------+------+
The location table represents physical x, y locations within a specific coordinate system. For each x, y location at least one row in the history table exists. Each row in the history table can point to multiple rows in the location table.
Important to note is that (coord_system, x, y) is indexed and is unique. I don't think it makes a difference but all ids and coord_system are UUIDs. In the code examples below I will use letter to make it easier to read. The location and history have additional columns, but do not change the scope of the question. The last_update column on the location table should match the date column on the History table (I come back to this later in the post).
The goal is to fetch the most recent history row for a range of (coor_system, x, y). Currently this is done with a CROSS JOIN LATERAl, like
SELECT *
FROM location loc
CROSS JOIN LATERAL
(SELECT *
FROM history hist
LEFT JOIN mapping map ON hist.id = map.history_id
WHERE map.location_id = loc.id
ORDER BY date DESC limit(1)) AS records
WHERE loc.coord_system = '43330ccc-3f42-4f05-8ec5-18cb659bfd2d'
AND (x >= 403047
AND x <= 404047)
AND (y >= 16451337
AND y <= 16452337);
For this specific range of x, y and coord_system the query takes ~25 seconds to run and returns 182 351 rows.
I am not extremely experienced in SQL, but thought that the goal of this query could also be achieved using a regular join. If I do a join across the three tables, with the same x, y and coord_system "filters" it takes about 2 seconds and returns ~3 million rows. I tried to be clever and use the dates to prune down the result:
SELECT *
FROM history hist
RIGHT JOIN mapping map ON hist.id = map.history_id
RIGHT JOIN location loc ON loc.id = map.location_id
WHERE loc.coord_system = '43330ccc-3f42-4f05-8ec5-18cb659bfd2d'
AND (x >= 403047
AND x <= 404047)
AND (y >= 16451337
AND y <= 16452337)
AND location.last_update = hist.date
This got very close to the same result as the original query. The result was 182 485 rows in ~3 seconds. Unfortunately the result needs to be exactly the same. I am guessing I made a logical mistake in the query I made and came here hoping someone can point it out.
My question is: is there a clever way that will allow a join to take only the rows that have the "newest" date from the history.date column? As is expected I am trying to make the query run as quickly as possible while maintaining the correct result set.
In the table below I show a toy example of the join and the results I would expect (marked in the "return_row" column).
+=============+==============+===+===+=============+============+============+=======+============+============+
| location.id | coord_system | x | y | location_id | history_id | history.id | speed | date | return_row |
+=============+==============+===+===+=============+============+============+=======+============+============+
| 0 | a | 1 | 1 | 0 | 0 | 0 | 3.0 | 2020/10/31 | * |
+-------------+--------------+---+---+-------------+------------+------------+-------+------------+------------+
| 0 | a | 1 | 1 | 0 | 1 | 1 | 3.1 | 2020/10/30 | |
+-------------+--------------+---+---+-------------+------------+------------+-------+------------+------------+
| 0 | a | 1 | 1 | 0 | 2 | 2 | 3.2 | 2020/10/29 | |
+-------------+--------------+---+---+-------------+------------+------------+-------+------------+------------+
| 1 | a | 1 | 2 | 1 | 3 | 3 | 3.1 | 2020/10/31 | * |
+-------------+--------------+---+---+-------------+------------+------------+-------+------------+------------+
| 1 | a | 1 | 2 | 1 | 4 | 4 | 3.0 | 2020/10/30 | |
+-------------+--------------+---+---+-------------+------------+------------+-------+------------+------------+
| 2 | a | 2 | 2 | 2 | 5 | 5 | 4 | 2020/10/31 | * |
+-------------+--------------+---+---+-------------+------------+------------+-------+------------+------------+
| 3 | b | 1 | 1 | 3 | 6 | 6 | 5 | 2020/10/1 | * |
+-------------+--------------+---+---+-------------+------------+------------+-------+------------+------------+

Does it work better with DISTINCT ON?
SELECT DISTINCT ON (l.id) l.id, h.date, ... -- enumerate the columns here
FROM location l
LEFT JOIN mapping m ON m.location_id = l.id
LEFT JOIN history h ON h.id = m.history_id
WHERE
l.coord_system = '43330ccc-3f42-4f05-8ec5-18cb659bfd2d'
AND l.x BETWEEN 403047 AND 404047
AND l.y BETWEEN 16451337 AND 16452337
ORDER BY l.id, h.date DESC

Related

Merging multiple "state-change" time series

Given a number of tables like the following, representing state-changes at time t of an entity identified by id:
| A | | B |
| t | id | a | | t | id | b |
| - | -- | - | | - | -- | - |
| 0 | 1 | 1 | | 0 | 1 | 3 |
| 1 | 1 | 2 | | 2 | 1 | 2 |
| 5 | 1 | 3 | | 3 | 1 | 1 |
where t is in reality a DateTime field with millisecond precision (making discretisation infeasible), how would I go about creating the following output?
| output |
| t | id | a | b |
| - | -- | - | - |
| 0 | 1 | 1 | 3 |
| 1 | 1 | 2 | 3 |
| 2 | 1 | 2 | 2 |
| 3 | 1 | 2 | 1 |
| 5 | 1 | 3 | 1 |
The idea is that for any given input timestamp, the entire state of a selected entity can be extracted by selecting one row from the resulting table. So the latest state of each variable corresponding to any time needs to be present in each row.
I've tried various JOIN statements, but I seem to be getting nowhere.
Note that in my use case:
rows also need to be joined by entity id
there may be more than two source tables to be merged
I'm running PostgreSQL, but I will eventually translate the query to SQLAlchemy, so a pure SQLAlchemy solution would be even better
I've created a db<>fiddle with the example data.
I think you want a full join and some other manipulations. The ideal would be:
select t, id,
last_value(a.a ignore nulls) over (partition by id order by t) as a,
last_value(b.b ignore nulls) over (partition by id order by t) as b
from a full join
b
using (t, id);
But . . . Postgres doesn't support ignore nulls. So an alternative method is:
select t, id,
max(a) over (partition by id, grp_a) as a,
max(b) over (partition by id, grp_b) as b
from (select *,
count(a.a) over (partition by id order by t) as grp_a,
count(b.b) over (partition by id order by t) as grp_b
from a full join
b
using (t, id)
) ab;

SQL - How to transform a table with a range of values into another with all the numbers in that range?

I have a Table (A) with some intervals from start_val to end_val with an attribute for that range of values.
I want a Table (B) in which each row is a number in the interval of start_val to end_val with the attribute of that range.
I need to do that using SQL.
Exemple
Table A:
+---------+--------+----------+
|start_val| end_val| attribute|
+---------+--------+----------+
| 10 | 12 | 1 |
| 20 | 23 | 2 |
+---------+--------+----------+
Table B (Expected result):
+---------+----------+
|start_val| attribute|
|end_val | |
| interv | |
+---------+----------+
| 10 | 1 |
| 11 | 1 |
| 12 | 1 |
| 20 | 2 |
| 21 | 2 |
| 22 | 2 |
| 23 | 2 |
+---------+----------+
Here is a way to do this
select m.start_val + n -1 as start_val_computed
,m.attribute
from t m
join lateral generate_series(1,(m.end_val-m.start_val)+1) n
on 1=1
+--------------------+-----------+
| start_val_computed | attribute |
+--------------------+-----------+
| 10 | 1 |
| 11 | 1 |
| 12 | 1 |
| 20 | 2 |
| 21 | 2 |
| 22 | 2 |
| 23 | 2 |
+--------------------+-----------+
working example
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=ce9e13765b5a4c3616d95ec659c1dfc9
You may use a calendar table approach:
SELECT
t1.val,
t2.attribute
FROM generate_series(10, 23) AS t1(val)
INNER JOIN TableA t2
ON t1.val BETWEEN t2.start_val AND t2.end_val
ORDER BY
t2.attribute,
t1.val;
Note: You may expand the bounds in the above call to generate_series to cover whatever range you think your data would need.
This is a variant of George's solution, but it is a bit simpler:
select n, m.attribute
from t m cross join lateral
generate_series(m.start_val, m.end_val) n;
The changes are:
CROSS JOIN instead of JOIN. So, no need for an ON clause.
No arithmetic in the GENERATE_SERIES().
No arithmetic in the SELECT.
You can just call the result of GENERATE_SERIES() whatever name you want in the result set.
Postgres actually allows you to put GENERATE_SERIES() in the SELECT:
select generate_series(m.start_val, m.end_val) as n, m.attribute
from t m;
However, I am not a fan of putting row generating functions anywhere other than the FROM clause. I just find it confusing to figure out what the query is doing.

Make a query making groups on the same result row

I have two tables. Like this.
select * from extrafieldvalues;
+----------------------------+
| id | value | type | idItem |
+----------------------------+
| 1 | 100 | 1 | 10 |
| 2 | 150 | 2 | 10 |
| 3 | 101 | 1 | 11 |
| 4 | 90 | 2 | 11 |
+----------------------------+
select * from items
+------------+
| id | name |
+------------+
| 10 | foo |
| 11 | bar |
+------------+
I need to make a query and get something like this:
+--------------------------------------+
| idItem | valtype1 | valtype2 | name |
+--------------------------------------+
| 10 | 100 | 150 | foo |
| 11 | 101 | 90 | bar |
+--------------------------------------+
The quantity of types of extra field values is variable, but every item ALWAYS uses every extra field.
If you have only two fields, then left join is an option for this:
select i.*, efv1.value as value_1, efv2.value as value_2
from items i left join
extrafieldvalues efv1
on efv1.iditem = i.id and
efv1.type = 1 left join
extrafieldvalues efv2
on efv1.iditem = i.id and
efv1.type = 2 ;
In terms of performance, two joins are probably faster than an aggregation -- and it makes it easier to bring in more columns from items. One the other hand, conditional aggregation generalizes more easily and the performance changes by little as more columns from extrafieldvalues are added to the select.
Use conditional aggregation
select iditem,
max(case when type=1 then value end) as valtype1,
max(case when type=2 then value end) as valtype2,name
from extrafieldvalues a inner join items b on a.iditem=b.id
group by iditem,name

Combining two view into one result set with transform?

I have a couple of views that generates the following two outputs in SQL Server.
First one (Flats output) shows the number of flats in a particular town with Tileroofs and Brickwalls. Second one shows the same, but for houses.
What I'm trying to do is to create a final table that looks like the 3rd example where the flats and house counts are combined with the corresponding Tileroof and Brickwall combinations.
I have tried union and then grouping, but I'm really struggling to get the Flats and Houses count columns side by side. Is anyone able to help please?
Thanks
--View one
| Town | Flats | TileRoofs | Brick Wall |
-----------------------------------------
| A | 3 | Y | N |
| A | 4 | N | Y |
| A | 8 | N | N |
--View two
| Town | Houses | TileRoofs | Brick Wall |
------------------------------------------
| A | 1 | Y | Y |
| A | 2 | Y | N |
| A | 5 | N | Y |
| A | 2 | N | N |
--Prefered output, by combining the two--
| Town | Flats | Houses | TileRoofs | Brick Wall |
--------------------------------------------------
| A | 0 | 1 | Y | Y |
| A | 3 | 2 | Y | N |
| A | 4 | 5 | N | Y |
| A | 8 | 2 | N | N |
Full outer join might help here.
select isnull(a.Town, b.Town) Town,
isnull(a.TileRoofs, b.TileRoofs) TileRoofs,
isnull(a.[Brick wall], b.[Brick wall]) [Brick wall],
isnull(a.Flats, 0) Flats,
isnull(b.Houses, 0) Houses
from ViewOne a
full outer join ViewTwo b
on a.Town = b.Town
and a.TileRoofs = b.TileRoofs
and a.[Brick wall] = b.[Brick wall]
select
v2.Town ,coalesce(v1.flat,0) as flat,v2.houses,v2.TileRoofs, v2.Brick, v2.Wall
from
view2 as v2 left join view1 as v1
on v1.town=v2.town
You may be after a full outer join
select
houses.town,
flats.flats,
houses.houses,
houses.BrickWall,
houses.TileRoofs
from flats
full outer join houses
on houses.town=flats.town
and houses.TileRoofs = flats.TileRoofs
and houses.BrickWall = flats.BrickWall

Getting Sum of MasterTable's amount which joins to DetailTable

I have two tables:
1. Master
| ID | Name | Amount |
|-----|--------|--------|
| 1 | a | 5000 |
| 2 | b | 10000 |
| 3 | c | 5000 |
| 4 | d | 8000 |
2. Detail
| ID |MasterID| PID | Qty |
|-----|--------|-------|------|
| 1 | 1 | 1 | 10 |
| 2 | 1 | 2 | 20 |
| 3 | 2 | 2 | 60 |
| 4 | 2 | 3 | 10 |
| 5 | 3 | 4 | 100 |
| 6 | 4 | 1 | 20 |
| 7 | 4 | 3 | 40 |
I want to select sum(Amount) from Master which joins to Deatil where Detail.PID in (1,2,3)
So I execute the following query:
SELECT SUM(Amount) FROM Master M INNER JOIN Detail D ON M.ID = D.MasterID WHERE D.PID IN (1,2,3)
Result should be 20000. But I am getting 40000
See this fiddle. Any suggestion?
You are getting exactly double the amount because the detail table has two occurences for each of the PIDs in the WHERE clause.
See demo
Use
SELECT SUM(Amount)
FROM Master M
WHERE M.ID IN (
SELECT DISTINCT MasterID
FROM DETAIL
WHERE PID IN (1,2,3) )
What is the requirement of joining the master table with details when you have all your columns are in Master table.
Also, isnt there any FK relationhsip defined on these tables. Looking at your data it seems to me that there should be FK on detail table for MasterId. If that is the case then you do not need join the table at all.
Also, in case you want to make sure that you have records in details table for the records for which you need sum and there is no FK relationship. Then you could give a try for exists instead of join.