Postgres optimize several left joins on one table - sql

I have a postgres schema like this:
CREATE TABLE rows
(
id bigint NOT NULL,
start_year integer
);
CREATE TABLE calculations
(
id bigint NOT NULL,
row_id bigint NOT NULL,
year integer,
calculation numeric(23,7)
);
INSERT INTO rows (id, start_year)
VALUES
(1, 2020),
(2, 2021);
INSERT INTO calculations (id, row_id, year, calculation)
VALUES
(1, 1, 2019, 0),
(2, 1, 2020, 100),
(3, 1, 2021, 900),
(4, 1, 2022, 300),
(5, 1, 2023, 500),
(6, 2, 2019, 220),
(7, 2, 2020, 111),
(8, 2, 2021, 222),
(9, 2, 2024, 333),
(10, 2, 2025, 444);
A an SQL view with select like this:
SELECT
row.id,
calc1.calculation as calc1,
calc2.calculation as calc2,
calc3.calculation as calc3
FROM
rows row
LEFT JOIN calculations calc1 on calc1.row_id = row.id and calc1.year = row.start_year
LEFT JOIN calculations calc2 on calc2.row_id = row.id and calc2.year = row.start_year + 1
LEFT JOIN calculations calc3 on calc3.row_id = row.id and calc3.year = row.start_year + 2;
Actually both tables are way larger. SQL query takes about 10 sec to execute and most of it is taken by calculations. The only thing I've managed to optimize it so far is:
SELECT
row.id,
calc.calculation->(row.start_year)::text as calc1,
calc.calculation->(row.start_year+1)::text as calc2,
calc.calculation->(row.start_year+2)::text as calc3
FROM
rows row
LEFT JOIN (select row_id, json_object_agg(year, calculation) as calculation
from calculations
group by row_id) calc on calc.row_id = row.id
Now it has x2 performance boost, but it not enough. It queries unneeded year values. When I've replaced this query with taking first, second and third year, it was working much faster., so I wonder if there is another way to merge these JOINs to one with performance boost.
http://sqlfiddle.com/#!17/8ff004/4

You may try adding the following index to the calculations table:
CREATE INDEX idx_calc ON calculations (row_id, year, calculation);
This index, if used, has the ability to speed up the multiple joins to the calculations table.

Related

Is it possible to set e the initial-select value of a recursive CTE query with a parameter?

Using this self-referencing table:
CREATE TABLE ENTRY (
ID integer NOT NULL,
PARENT_ID integer,
... other columns ...
)
There are many top-level rows (with PARENT_ID = NULL) that can have 0 to several levels of child rows, forming a graph like this:
(1, NULL, 'A'),
(2, 1, 'B'),
(3, 2, 'C'),
(4, 3, 'D'),
(5, 4, 'E'),
(6, NULL, 'one'),
(7, 6, 'two'),
(8, 7, 'three'),
(9, 6, 'four'),
(10, 9, 'five'),
(11, 10, 'six');
I want to write a query that would give me the subgraph (all related rows in both directions) for a given row, for instance (just showing the ID values):
ID = 3: (1, 2, 3, 4, 5)
ID = 6: (6, 7, 8, 9, 10, 11)
ID = 7: (6, 7, 8)
ID = 10: (6, 9, 10, 11)
It's similar to the query in ยง3.3 Queries against a Graph of the SQLite documentation, for returning a graph from any of its nodes:
WITH RECURSIVE subtree(x) AS (
SELECT 3
UNION
SELECT e1.ID x FROM ENTRY e1 JOIN subtree ON e1.PARENT_ID = subtree.x
UNION
SELECT e2.PARENT_ID x FROM ENTRY e2 JOIN subtree ON e2.ID = subtree.x
)
SELECT x FROM subtree
LIMIT 100;
... with 3 as the anchor / initial-select value.
This particular query works fine in DBeaver. The sqlite version available in db-fiddle gives a circular reference error, but this nested CTE gives the same result in db-fiddle.
However, I can only get this to work when the initial value is hard-coded in the query. I can't find any mention of how to supply that initial-select value as a parameter.
I'd think it should be straightforward. Maybe the case of having more than one top-level row is very unusual, or I'm overlooking something blindingly obvious?
Any suggestions?
As forpas points out above, SQLite doesn't support passing parameters to stored/user defined functions.
Using a placeholder in the prepared statement from the calling code is a good alternative.

running sums, find blocks of rows that sum to given list of values

here is the test data:
declare #trial table (id int, val int)
insert into #trial (id, val)
values (1, 1), (2, 3),(3, 2), (4, 4), (5, 5),(6, 6), (7, 7), (8, 2),(9, 3), (10, 4), (11, 6),(12, 10), (13, 5), (14, 3),(15, 2) ;
select * from #trial order by id asc
description of data:
i have a list of n values that represent sums. assume they are (10, 53) for this example. the values in the #trial can be both negative & positive. note that the values in #trial will always sum to the given sums.
description of pattern:
10 in this example is the 1st sum i want to match & 53 is the 2nd sum i want to match. the dataset has been set up in such a way that a block of consecutive rows will always sum to these sums with this feature: in this example, the 1st 4 rows sum to 10, & then the next 11 rows sum to 53. the dataset will always have this feature. in other words, the 1st given sum can be found from summing 1 to ith row, then 2nd sum from i + 1 row to jth row, & so on....
finally i want an id to identify the groups of rows that sum to the given sums. so in this example, 1 to 4th row will take id 1, 5th to 15th row will take id 2.
This answers the original question.
From what you describe you can do something like this:
select v.grp, t.*
from (select t.*, sum(val) over (order by id) as running_val
from #trial t
) t left join
(select grp lag(upper, 1, -1) over (order by upper) as lower, uper
from (values (1, 10), (2, 53)) v(grp, upper)
) v
on t.running_val > lower and
t.running_val <= v.upper

Window functions with summations on postgresql

What I'm trying to achieve is to calculate a daily, weekly and monthly leaderboard with sum(points), all-time high points and all-time low points per user (and per time-frame) but haven't had a lot of success. My schema look like:
CREATE TABLE users(
id SERIAL PRIMARY KEY,
name text NOT NULL
);
-- contains millions of rows!
CREATE TABLE results(
id SERIAL PRIMARY KEY,
user_id integer NOT NULL REFERENCES users(id),
points float NOT NULL, -- can be negative
date timestamptz NOT NULL DEFAULT NOW()
);
-- sample data
INSERT INTO users (name)
VALUES ('user1'), ('user2'), ('user3'), ('user4');
INSERT INTO results (user_id, points)
VALUES (2, -10), (1, 50), (4, -20), (3, 20), (2, 50), (4, -20), (1, 50), (1, -25), (4, 30), (3, -70), (2, 50), (1, -25), (4, 20), (2, -90), (3, 60), (4, -20);
so for example, assuming those results where correspond to the last week, the weekly leaderboard would have something like:
User|sum(points) User|ATH points User|ATL points
1 50 1 100 3 -50
3 10 2 90 4 -40
which are only calculated with the results where date is in the last week, and so on.
but in order to achieve that it seems to me that I need to somehow iterate over every bet to calculate the highest and the lowest amounts of points the user had at any point in that time-frame. Doing it in-memory isn't going to work well, because I'll need to store millions of results in memory.
Is there anyway of doing it completely in a query?. I've looked into window functions but don't see how a summation could be done using them.
You should use window functions to calculate the sum and the running sum (ordered by date), then take the minimum and maximum of the running sums:
SELECT user_id,
sum,
min(running) AS atl,
max(running) AS ath
FROM (SELECT user_id,
sum(points) OVER (PARTITION BY user_id),
sum(points) OVER (PARTITION BY user_id ORDER BY date) AS running
FROM results
WHERE date > current_timestamp - INTERVAL '1 week') AS q
GROUP BY user_id, sum;

Recursive member of a common table expression 'cte' has multiple recursive references?

I have the following two table E and G.
create table E(K1 int, K2 int primary key (K1, K2))
insert E
values (1, 11), (1, 20), (2, 10), (2, 30), (3, 10), (3, 30),
(4, 100), (5, 200), (6, 200),
(7, 300), (8, 300), (9, 310), (10, 310), (10, 320), (11, 320), (12, 330)
create table G(GroupID varchar(10), K1 int primary key)
insert G
values ('Group 1', 1), ('Group 1', 2), ('Group 2', 4), ('Group 2', 5),
('Group 3', 8), ('Group 3', 9), ('Group 3', 12)
I need to a view - giving a K2 number, find all related K1. The "related K1" is defined:
All K1s have the same K2 in table E. For example, 2 and 3 in E are related because both records have K2 of 10. ((2, 10), (3, 10)).
All K1s have the same GroupID in table G. For example, the K1 of 1 and 2 are both in group Group 1.
So querying the following view
select K1 from GroupByK2 where K2 = 200 -- or 100
should return
4
5
6
because both (5, 200) and (6, 200) have the same K2. And the 4 and 5 of (4, 100) and (5, 200) are both in 'Group 2'.
And select K1 from GroupByK2 where K2 = 300 -- or 310, 320, 330 should return 7, 8, 9, 10, 11, 12.
View:
create view GroupByK2
as
with cte as (
select E.*, K2 K2x from E
union all
select E.K1, E.K2, cte.K2x
from cte
join G on cte.K1 = G.K1
join G h on h.GroupID = G.GroupID
join E on E.K1 = h.K1 and E.K1 <> cte.K1
where not exists (select * from cte x where x.k1 = G.k1 and x.K2 = G.K2) -- error
)
select *
from cte;
However, the SQL has the error of
Recursive member of a common table expression 'cte' has multiple recursive references?
Scratched my head over this one a bit, but here is a working, although highly inefficient solution...
You correctly tried to eliminate joining the original rows back to avoid the cyclic recursion, but it won't work due to 2 reasons:
As the error stated, you can't reference the recursive member more
than once
Even if you could, at each recursion, the recursive set consists only of the output of the previous recursion, so you wouldn't be
able to eliminate the cycles from earlier recursions anyway.
My solution avoids that in a "less than optimal" way, it simply includes all the rows with the cycles, but limits the recursion level to a hard number (5 in the example, but you can parameterize it as well) to avoid the endless recursion, and only at the final query, eliminates the duplicates with a group by.
This may or not work for you depending on the depth of the hierarchy. It creates tons of redundant work, and I doubt it will scale, but YMMV. I addressed it as a logical puzzle :-)
This is one of the (rare) cases where I will definitely consider an iterative solution instead of a set based one. You will need to create a table valued function so you can parameterize it, which you won't be able to do properly with a view. Within the function create a temporary table or table variable, populate it with the output sets one by one, and loop until you are done. This way you will be able to eliminate the cycles at the root by checking the content of the temporary table and only inserting new rows.
Anyway, here goes:
;WITH KeyGroups AS
(
SELECT E.*, G.GroupID
FROM E
LEFT OUTER JOIN
G
ON E.K1 = G.K1
),
Recursive AS
(
SELECT K.K1, K.K2, K.GroupID, 0 AS lvl
FROM KeyGroups AS K
WHERE K.K2 = 300
UNION ALL
SELECT K.K1, K.K2, K.GroupID, lvl + 1
FROM Recursive AS R
INNER JOIN
KeyGroups AS K
ON R.GroupID = K.GroupID
OR
R.K2 = K.K2
OR
R.K1 = K.K1
WHERE lvl < 5
)
SELECT MIN(lvl) AS lvl, K1, K2, GroupID
FROM Recursive
GROUP BY GroupID, K1, K2
ORDER BY lvl, K1, K2, GroupID;
Also see DBFiddle.
I'll give this some more thought tomorrow if I have time, and update here if I find a better solution.
Thanks for the interesting challenge and well formulated post.
HTH

Grouping records by subsets SQL

I have a database with PermitHolders (PermitNum = PK) and DetailedFacilities of each Permit Holder. In the tblPermitDetails table there are 2 columns
PermitNum (foreign Key)
FacilityID (integer Foreign Key Lookup to Facility table).
A permitee can have 1 - 29 items on their permit, e.i. Permit 50 can have a Boat Dock (FacID 4), a Paved walkway (FacID 17) a Retaining Wall (FacID 20) etc. I need an SQL filter/display whatever, ALL PERMIT #s that have ONLY FacIDs 19, 20, or 28, NOT ones that have those plus "x" others,....just that subset. I've worked on this for 4 days, would someone PLEASE help me? I HAVE posted to other BB but have not received any helpful suggestions.
As Oded suggested, here are more details.
There is no PK for the tblPermitDetails table.
Let's say that we have Permitees 1 - 10; Permit 1 is John Doe, he has a Boat Dock (FacID 1), a Walkway (FacID 4), a buoy (FacID 7), and Underbrushing (FacID 19)...those are 3 records for Permit 1. Permit 2 is Sus Brown, she has ONLY underbrushing (FacID 19), Permit 3 is Steve Toni, he has a Boat Dock (FacID 1), a Walkway (FacID 4), a buoy (FacID 7), and a Retaining Wall (FacID 20). Permit 4 is Jill Jack, she has Underbrushing (FacID 19), and a Retaining Wall (FacID 20). I could go on but i hope you follow me. I want an SQL (for MS Access) that will show me ONLY Permits 2 & 4 because they have a combination of FacIDs 19 & 20 [either both, or one or the other], BUT NOT ANYTHING ELSE such as Permit 1 who has #19, but also has 4 & 7.
I hope that helps, please say so if not.
Oh yea, I DO know the difference between i.e. and e.g. since i'm in my 40's have written over 3000 pages of archaeological field reports and an MA thesis, but I'm really stressed out here from struggling with this SQL and could care less about consulting the Chicago Manual of Style before banging out a plea for help. SO, DON"T be coy about my compostion errors! Thank you!
Untested, but how about something like this?
SELECT DISTINCT p.PermitNum
FROM tblPermitDetails p
WHERE EXISTS
(SELECT '+'
FROM tblFacility f
WHERE p.FacilityID = f.FacilityID
AND f.facilityID = 19 )
AND EXISTS
(SELECT '+'
FROM tblFacility f
WHERE p.FacilityID = f.FacilityID
AND f.facilityID = 20 )
AND EXISTS
(SELECT '+'
FROM tblFacility f
WHERE p.FacilityID = f.FacilityID
AND f.facilityID = 28 )
AND NOT EXISTS
(SELECT '+'
FROM tblFacility f
WHERE p.FacilityID = f.FacilityID
AND f.facilityID NOT IN (19,20,28) )
SELECT PermitNum
FROM tblPermitDetails
WHERE FacilityID IN (19, 20, 28)
GROUP BY PermitNum
HAVING COUNT(PermitNum)=3
I wasn't sure if you wanted ALL of 19,20,28 or ANY of 19,20,28... also, this is untested, but if you want the any of solution it should be fairly close
Select
allowed.PermitNum
from
DetailedFacilties allowed
join DetailedFacilities disallowed on allowed.PermitNum != disallowed.PermitNum
where
allowed.FacilityID in (19, 20, 28)
and disallowed.FacilityID not in (19, 20, 28)
SELECT DISTINCT PermitNum FROM tblPermitDetails t1
WHERE FacilityID IN (19, 20, 28)
AND NOT EXISTS (SELECT 1 FROM tblPermitDetails t2
WHERE t2.PermitNum = t1.PermitNum
AND FacilityId NOT IN (19, 20, 28));
Or, in prose, get the list of PermitNums that have any of the requested permit numbers as long as no row exists for that PermitNum that isn't in the requested list.
A more optimized version of the same query would be the following:
SELECT PermitNum FROM (SELECT DISTINCT PermitNum FROM tblPermitDetails
WHERE FacilityID IN (19, 20, 28)) AS t1
WHERE NOT EXISTS (SELECT 1 FROM tblPermitDetails t2
WHERE t2.PermitNum = t1.PermitNum
AND FacilityID NOT IN (19, 20, 28));
It's a little harder to read, but it will involve fewer "NOT EXISTS" subqueries by doing the "DISTINCT" part first.
Update:
David-W-Fenton mentions that NOT EXISTS should be avoided for optimization reasons. For a small table, this probably won't matter much, but you could also do the query using COUNT(*) if you needed to avoid NOT EXISTS:
SELECT DISTINCT PermitNum FROM tblPermitDetails t1
WHERE (SELECT COUNT(*) FROM tblPermitDetails t2
WHERE t1.PermitNum = t2.PermitNum
AND FacilityID IN (19, 20, 28))
=
(SELECT COUNT(*) FROM tblPermitDetails t3
WHERE t1.PermitNum = t3.PermitNum)
What about (untested)
select permitnum
from tblPermitDetails t1
left outer join
(Select distinct permitnum from tblPermitDetails where facilityId not in (19, 20, or 28)) t2
on t1.permitnum=t2.permitnum
where t2.permitnum is null
i.e. we find all the permits that cannot match your criteria (they have at least one detail outside those you list), then we find all the permits that are left, via a left join and where criteria.
with indexes set up properly, this should be pretty quick.
Quick way might be to only look at the ones with exactly three matches (with an inner query), and then among those only include the ones that have 19, 20, and 28.
Of course, that is sort of a brute force method, and not very elegant. But it has the small benefit of being understandable. None of the approaches I can think of will be easy to customize to various other sets of values.
Ok, it seems i didn't understand the problem at first. So, again:
I will recreate the example by Stacy here:
DECLARE #PermitHolders TABLE
(PermitNum INT NOT NULL,
PermitHolder VARCHAR(20))
DECLARE #tblPermitDetails TABLE
(PermitNum INT,
FacilityID INT)
INSERT INTO #PermitHolders VALUES (1, 'John Doe')
INSERT INTO #PermitHolders VALUES (2, 'Sus Brown')
INSERT INTO #PermitHolders VALUES (3, 'Steve Toni')
INSERT INTO #PermitHolders VALUES (4, 'Jill Jack')
INSERT INTO #tblPermitDetails VALUES (1, 1)
INSERT INTO #tblPermitDetails VALUES (1, 4)
INSERT INTO #tblPermitDetails VALUES (1, 7)
INSERT INTO #tblPermitDetails VALUES (1, 19)
INSERT INTO #tblPermitDetails VALUES (2, 19)
INSERT INTO #tblPermitDetails VALUES (3, 1)
INSERT INTO #tblPermitDetails VALUES (3, 4)
INSERT INTO #tblPermitDetails VALUES (3, 7)
INSERT INTO #tblPermitDetails VALUES (3, 20)
INSERT INTO #tblPermitDetails VALUES (4, 19)
INSERT INTO #tblPermitDetails VALUES (4, 20)
And this is the solution:
SELECT * FROM #PermitHolders
WHERE (PermitNum IN (SELECT PermitNum FROM #tblPermitDetails WHERE FacilityID IN (19, 20, 28)))
AND (PermitNum NOT IN (SELECT PermitNum FROM #tblPermitDetails WHERE FacilityID NOT IN (19, 20, 28)))
I have one observation on the side:
You didn't mention any PK for tblPermitDetails. If non exists, this may not be good for performance. I recommend that you create a PK using both PermitNum and FacilityID (composite key) because this will serve as both your PK and a useful index for the expected queries.