Window functions with summations on postgresql - sql

What I'm trying to achieve is to calculate a daily, weekly and monthly leaderboard with sum(points), all-time high points and all-time low points per user (and per time-frame) but haven't had a lot of success. My schema look like:
CREATE TABLE users(
id SERIAL PRIMARY KEY,
name text NOT NULL
);
-- contains millions of rows!
CREATE TABLE results(
id SERIAL PRIMARY KEY,
user_id integer NOT NULL REFERENCES users(id),
points float NOT NULL, -- can be negative
date timestamptz NOT NULL DEFAULT NOW()
);
-- sample data
INSERT INTO users (name)
VALUES ('user1'), ('user2'), ('user3'), ('user4');
INSERT INTO results (user_id, points)
VALUES (2, -10), (1, 50), (4, -20), (3, 20), (2, 50), (4, -20), (1, 50), (1, -25), (4, 30), (3, -70), (2, 50), (1, -25), (4, 20), (2, -90), (3, 60), (4, -20);
so for example, assuming those results where correspond to the last week, the weekly leaderboard would have something like:
User|sum(points) User|ATH points User|ATL points
1 50 1 100 3 -50
3 10 2 90 4 -40
which are only calculated with the results where date is in the last week, and so on.
but in order to achieve that it seems to me that I need to somehow iterate over every bet to calculate the highest and the lowest amounts of points the user had at any point in that time-frame. Doing it in-memory isn't going to work well, because I'll need to store millions of results in memory.
Is there anyway of doing it completely in a query?. I've looked into window functions but don't see how a summation could be done using them.

You should use window functions to calculate the sum and the running sum (ordered by date), then take the minimum and maximum of the running sums:
SELECT user_id,
sum,
min(running) AS atl,
max(running) AS ath
FROM (SELECT user_id,
sum(points) OVER (PARTITION BY user_id),
sum(points) OVER (PARTITION BY user_id ORDER BY date) AS running
FROM results
WHERE date > current_timestamp - INTERVAL '1 week') AS q
GROUP BY user_id, sum;

Related

Count Distinct not working as expected, output is equal to count

I have a table where I'm trying to count the distinct number of members per group. I know there's duplicates based on the count(distinct *) function. But when I try to group them into the group and count distinct, it's not spitting out the number I'd expect.
select count(distinct memberid), count(*)
from dbo.condition c
output:
count
count
303,781
348,722
select groupid, count(*), count(distinct memberid)
from dbo.condition c
group by groupid
output:
groupid
count
count
2
19,984
19,984
3
25,689
25,689
5
14,400
14,400
24
56,058
56,058
25
200,106
200,106
29
27,847
27,847
30
1,370
1,370
31
3,268
3,268
The numbers in the second query equate when they shouldn't be. Does anyone know what I'm doing wrong? I need the 3rd column to be equal to 303,781 not 348,722.
Thanks!
There's nothing wrong with your second query. Since you're aggregating on the "groupid" field, the output you get tells you that there are no duplicates within the same groupid of the "memberid" values (basically counting values equates to counting distinctively).
On the other hand, in the first query the aggregation happens without any partitioning, whose output hints there are duplicate values across different "groupid" values.
Took the liberty of adding of an example that corroborates your answer:
create table aa (groupid int not null, memberid int not null );
insert into aa (groupid, memberid)
values
(1, 1), (1, 2), (1, 3), (2, 1), (3, 1), (3, 2), (3, 3), (4, 1), (4, 2), (4, 3), (4, 5), (5, 3)
select groupid, count(*), count(distinct memberid)
from aa group by groupid;
select count(*), count(distinct memberid)
from aa

SQL count number of patients who have issue of same drug 3 or more times?

I would be really grateful for some help…
I had a table with patients in it, PAT.ID, an EVENT.CD, and EVENT.DT (date)
I am interested for the EVENT.YR = 2019, to COUNT how many times any unique PAT.ID had 3 or more occurrences of a specific drug EVENT.CD = d4% in that year.
I can produce a table showing rows with a separate entry each time a Patient got the drug (so if the patient got the drug 6 times in 2019, the patient.iD will have six row entries showing each EVENT.DT), but I cannot work out how to get a count of only those unique patients who had multiple issues of the drug (set as 3 or more times) in 2019. (So if => 3 events for a patient, they would get counted only once).
Been battling this for 2 days as a novice and getting nowhere . Please show mercy 😁
I'm not sure if there's more to the problem than this, so here's some sample data along with the SQL. There's no specific DB tag, so this is straight ANSI SQL:
create table T1(PAT_ID int, EVENT_CD varchar(10), EVENT_DT date);
insert into T1 (PAT_ID, EVENT_CD, EVENT_DT) values
(1, 'd4r', '2019-01-01'),
(2, 'abc', '2019-01-01'),
(3, 'def', '2019-01-01'),
(1, 'd4r', '2019-04-15'),
(1, 'd4e', '2019-09-05'),
(1, 'd4s', '2020-06-12'),
(1, 'def', '2019-01-01'),
(3, 'd4s', '2019-08-17'),
(3, 'd4s', '2019-12-10'),
(2, 'd4s', '2019-11-17');
select PAT_ID, count(*) as EVENT_COUNT
from T1
where EVENT_DT >= '2019-01-01'
and EVENT_DT <= '2019-12-31'
group by PAT_ID, left(EVENT_CD,2)
having count(*) >= 3
;
PAT_ID
EVENT_COUNT
1
3
SQL Server 2019 Fiddle

"For each player who has incurred more than $150 worth of penalties in total, find the player number and the total amount of penalties."

SQL beginner here. I'm trying to find the player number and total amount of penalties for each player that has over 150 dollars in penalties, I've made a query for it but its somehow returning instead the summed amounts for every player, is there something I am missing?
Database and Table:
CREATE DATABASE Tennis;
Use Tennis;
CREATE TABLE PENALTIES
(PAYMENTNO INTEGER NOT NULL PRIMARY KEY,
PLAYERNO INTEGER NOT NULL,
PAYMENT_DATE DATE NOT NULL
CHECK (PAYMENT_DATE >= DATE('1969-12-31')),
AMOUNT DECIMAL(7,2) NOT NULL
CHECK (AMOUNT > 0),
FOREIGN KEY (PLAYERNO) REFERENCES PLAYERS (PLAYERNO))
INSERT INTO PENALTIES VALUES (1, 6, '1980-12-08',100)
;
INSERT INTO PENALTIES VALUES (2, 44, '1981-05-05', 75)
;
INSERT INTO PENALTIES VALUES (3, 27, '1983-09-10',100)
;
INSERT INTO PENALTIES VALUES (4,104, '1984-12-08', 50)
;
INSERT INTO PENALTIES VALUES (5, 44, '1980-12-08', 25)
;
INSERT INTO PENALTIES VALUES (6, 8, '1980-12-08', 25)
;
INSERT INTO PENALTIES VALUES (7, 44, '1982-12-30', 30)
;
INSERT INTO PENALTIES VALUES (8, 27, '1984-11-12', 75)
;
My Query:
SELECT PLAYERNO, SUM(AMOUNT)FROM PENALTIES WHERE (SELECT SUM(AMOUNT) FROM PENALTIES) > 150.00 GROUP BY PLAYERNO;
Wen you need to filter the results of a query after an aggregation with GROUP BYyou should use the HAVING clause: https://www.w3schools.com/sql/sql_having.asp
So the following request should bring you the expected results
SELECT PLAYERNO, SUM(AMOUNT) FROM PENALTIES GROUP BY PLAYERNO HAVING SUM(AMOUNT) > 150.00

running sums, find blocks of rows that sum to given list of values

here is the test data:
declare #trial table (id int, val int)
insert into #trial (id, val)
values (1, 1), (2, 3),(3, 2), (4, 4), (5, 5),(6, 6), (7, 7), (8, 2),(9, 3), (10, 4), (11, 6),(12, 10), (13, 5), (14, 3),(15, 2) ;
select * from #trial order by id asc
description of data:
i have a list of n values that represent sums. assume they are (10, 53) for this example. the values in the #trial can be both negative & positive. note that the values in #trial will always sum to the given sums.
description of pattern:
10 in this example is the 1st sum i want to match & 53 is the 2nd sum i want to match. the dataset has been set up in such a way that a block of consecutive rows will always sum to these sums with this feature: in this example, the 1st 4 rows sum to 10, & then the next 11 rows sum to 53. the dataset will always have this feature. in other words, the 1st given sum can be found from summing 1 to ith row, then 2nd sum from i + 1 row to jth row, & so on....
finally i want an id to identify the groups of rows that sum to the given sums. so in this example, 1 to 4th row will take id 1, 5th to 15th row will take id 2.
This answers the original question.
From what you describe you can do something like this:
select v.grp, t.*
from (select t.*, sum(val) over (order by id) as running_val
from #trial t
) t left join
(select grp lag(upper, 1, -1) over (order by upper) as lower, uper
from (values (1, 10), (2, 53)) v(grp, upper)
) v
on t.running_val > lower and
t.running_val <= v.upper

How to specify a linear programming-like constraint (i.e. max number of rows for a dimension's attributes) in SQL server?

I'm looking to assign unique person IDs to a marketing program, but need to optimize based on each person's Probability Score (some people can be sent to multiple programs, some only one) and have two constraints such as budgeted mail quantity for each program.
I'm using SQL Server and am able to put IDs into their highest scoring program using the row_number() over(partition by person_ID order by Prob_Score), but I need to return a table where each ID is assigned to a program, but I'm not sure how to add the max mail quantity constraint specific to each individual program. I've looked into the Check() constraint functionality, but I'm not sure if that's applicable.
create table test_marketing_table(
PersonID int,
MarketingProgram varchar(255),
ProbabilityScore real
);
insert into test_marketing_table (PersonID, MarketingProgram, ProbabilityScore)
values (1, 'A', 0.07)
,(1, 'B', 0.06)
,(1, 'C', 0.02)
,(2, 'A', 0.02)
,(3, 'B', 0.08)
,(3, 'C', 0.13)
,(4, 'C', 0.02)
,(5, 'A', 0.04)
,(6, 'B', 0.045)
,(6, 'C', 0.09);
--this section assigns everyone to their highest scoring program,
--but this isn't necessarily what I need
with x
as
(
select *, row_number()over(partition by PersonID order by ProbabilityScore desc) as PersonScoreRank
from test_marketing_table
)
select *
from x
where PersonScoreRank='1';
I also need to specify some constraints: two max C packages, one max A & one max B package can be sent. How can I reassign the IDs to a program while also using the highest probability score left available?
The final result should look like:
PersonID MarketingProgram ProbabilityScore PersonScoreRank
3 C 0.13 1
6 C 0.09 1
1 A 0.07 1
6 B 0.045 2
You need to rethink your ROW_NUMBER() formula based on your actual need, and you should also have a table of Marketing Programs to make this work efficiently. This covers the basic ideas you need to incorporate to efficiently perform the filtering you need.
MarketingPrograms Table
CREATE TABLE MarketingPrograms (
ProgramID varchar(10),
PeopleDesired int
)
Populate the MarketingPrograms Table
INSERT INTO MarketingPrograms (ProgramID, PeopleDesired) Values
('A', 1),
('B', 1),
('C', 2)
Use the MarketingPrograms Table
with x as (
select *,
row_number()over(partition by ProgramId order by ProbabilityScore desc) as ProgramScoreRank
from test_marketing_table
)
select *
from x
INNER JOIN MarketingPrograms m
ON x.MarketingProgram = m.ProgramID
WHERE x.ProgramScoreRank <= m.PeopleDesired