Calculate sum and cumul by age group and by date (but people changes age group as time passes) - sql

DBMS : postgreSQL
my problem :
In my database I have a person table with id and birth date, an events table that links a person, an event (id_event) and a date, an age table used for grouping ages. In the real database the person table is about 40 millions obs, and events 3 times bigger.
I need to produce a report (sum and cumul of X events) by age (age_group) and date (event_date). There isn't any problem to count the number of events by date. The problem lies with the cumul : contrary to other variables (sex for example), a person grow older and changes age group
as time passes, so for a given age group the cumul can increase then decrease. I want that the event's cumul, on every date in my report, uses the age of the persons on these dates.
Example of my inputs and desired output
The only way I found is to do a Cartesian product on the tables person and the dates v_dates, so it's easy to follow an event and make it change age_group. The code below uses this method.
BUT I can't use a cartesian product on my real data (makes a table way too big) and I need to use another method.
reproductible example
In this simplified example I want to produce a report by month from 2020-07-01 to 2022-07-01 (view v_dates). In reality I need to produce the same report by day but the logic remains the same.
My inputs
/* create table person*/
DROP TABLE IF EXISTS person;
CREATE TABLE person
(
person_id varchar(1),
person_birth_date date
);
INSERT INTO person
VALUES ('A', '2017-01-01'),
('B', '2016-07-01');
person_id
person_birth_date
A
2000-10-01
B
2010-02-01
/* create table events*/
DROP TABLE IF EXISTS events;
CREATE TABLE events
(
person_id varchar(1),
event_id integer,
event_date date
);
INSERT INTO events
VALUES ('A', 1, '2020-07-01'),
('A', 2, '2021-07-01'),
('B', 1, '2021-01-01'),
('B', 2, '2022-01-01');
person_id
event_id
event_date
A
1
2020-01-01
A
2
2021-01-01
B
1
2020-07-01
B
2
2021-01-01
/* create table age*/
DROP TABLE IF EXISTS age;
CREATE TABLE age
(
age integer,
age_group varchar(8)
);
INSERT INTO age
VALUES (0,'[0-4]'),
(1,'[0-4]'),
(2,'[0-4]'),
(3,'[0-4]'),
(4,'[0-4]'),
(5,'[5-9]'),
(6,'[5-9]'),
(7,'[5-9]'),
(8,'[5-9]'),
(9,'[5-9]');
/* create view dates : contains monthly dates from 2020-07-01 to 2022-07-01*/
CREATE or replace view v_dates AS
SELECT GENERATE_SERIES('2020-07-01'::date, '2022-07-01'::date, '6 month')::date as event_date;
age
age_group
0
[0-4]
1
[0-4]
5
[5-9]
My current method using a cartesian product
CROSS JOIN person * v_dates
with a LEFT JOIN to get info from table events
with a LEFT JOIN to get age_group from table age
CREATE or replace view v_person_event AS
SELECT
pdev.person_id,
pdev.event_date,
pdev.age,
ag.age_group,
pdev.event1,
pdev.event2
FROM
(
SELECT pd.person_id,
pd.event_date,
date_part('year', age(pd.event_date::TIMESTAMP, pd.person_birth_date::TIMESTAMP)) as age,
CASE WHEN ev.event_id = 1 THEN 1 else 0 END as event1,
CASE WHEN ev.event_id = 2 THEN 1 else 0 END as event2
FROM
(
SELECT *
FROM person
CROSS JOIN v_dates
) pd
LEFT JOIN events ev
on pd.person_id = ev.person_id
and pd.event_date = ev.event_date
) pdev
Left JOIN age as ag on pdev.age = ag.age
ORDER by pdev.person_id, pdev.event_date;
add columns event1_cum and event2_cum
CREATE or replace view v_person_event_cum AS
SELECT *,
SUM(event1) OVER (PARTITION BY person_id ORDER BY event_date) event1_cum,
SUM(event2) OVER (PARTITION BY person_id ORDER BY event_date) event2_cum
FROM v_person_event;
SELECT * FROM v_person_event_cum;
person_id
event_date
age
age_group
event1
event2
event1_cum
event2_cum
A
2020-07-01
3
[0-4]
1
0
1
0
A
2021-01-01
4
[0-4]
0
0
1
0
A
2021-07-01
4
[0-4]
0
1
1
1
A
2022-01-01
5
[5-9]
0
0
1
1
A
2022-07-01
5
[5-9]
0
0
1
1
B
2020-07-01
4
[0-4]
0
0
0
0
B
2021-01-01
4
[0-4]
1
0
1
0
B
2021-07-01
5
[5-9]
0
0
1
0
B
2022-01-01
5
[5-9]
0
1
1
1
B
2022-07-01
6
[5-9]
0
0
1
1
desired output : create a report grouped by variables age_group and event_date
SELECT
age_group,
event_date,
SUM(event1) as event1,
SUM(event2) as event2,
SUM(event1_cum) as event1_cum,
SUM(event2_cum) as event2_cum
FROM v_person_event_cum
GROUP BY age_group, event_date
ORDER BY age_group, event_date;
age_group
event_date
event1
event2
event1_cum
event2_cum
[0-4]
2020-07-01
1
0
1
0
[0-4]
2021-01-01
1
0
2
0
[0-4]
2021-07-01
0
1
1
1
[5-9]
2021-07-01
0
0
1
0
[5-9]
2022-01-01
0
1
2
2
This is why this is not an ordinary cumul : for the age_group [0-4], event1_cum goes from 2 at '2021-01-01' to 1 at '2021-07-01' because A was in [0-4] at the time of the event 1, still in [0-4] at '2021-01-01' but in [5-9] at 2021-07-01
When we read the report:
the 2021-01-01, there was 2 person between 0 and 4 (at that date) who had event1 and 0 person who had event2.
the 2021-07-01, there was 1 person between 0 and 4 who had event1 and 1 person who had event2.
I can't get a solution to this problem without using a cartesian Product...
Thanks in advance!

Related

Adding labels to row based on condition of prior/next row

I have sample data like the following in Snowflake. I'd like to assign groupings (without aggregation based on the grp_start -> grp_end (basically when one grp_start = 1 I want to assign it a label, and assign each sequential row the same ID until grp_end is equal to 1. That would constitute a single grp. Then the next grp should have a different label and follow the same logic.
Note: If a single row is a grp_start = 1 and grp_end = 1 I want it to have a single grp label for that row as well, and thus following the pattern.
The data needs to be partitioned by id and ordered by start_time as well. Please see the below sample data and the mockup of what the desired result is to be. Ideally, I need this to scale to large amounts of data.
Current data:
create or replace temporary table grp_test (id char(4), start_time date, grp_start int, grp_end int)
as select * from values
('0001','2021-01-10',1,0),
('0001','2021-01-11',0,0),
('0001','2021-01-14',0,1),
('0001','2021-07-01',1,1),
('0001','2021-09-25',1,0),
('0001','2021-09-29',0,1),
('0002','2022-11-04',1,0),
('0002','2022-11-25',0,1);
select * from grp_test;
Desired result mockup:
create or replace temporary table desired_result (id char(4), start_time date, grp_start int, grp_end int, label int)
as select * from values
('0001','2021-01-10',1,0,0),
('0001','2021-01-11',0,0,0),
('0001','2021-01-14',0,1,0),
('0001','2021-07-01',1,1,1),
('0001','2021-09-25',1,0,2),
('0001','2021-09-29',0,1,2),
('0002','2022-11-04',1,0,0),
('0002','2022-11-25',0,1,0);
select * from desired_result;
so changing the setup data to:
create or replace temporary table grp_test as
select * from values
('0001','2021-01-10'::date,1,0),
('0001','2021-01-11'::date,0,0),
('0001','2021-01-14'::date,0,1),
('0001','2021-01-15'::date,0,0),
('0001','2021-07-01'::date,1,1),
('0001','2021-09-25'::date,1,0),
('0001','2021-09-29'::date,0,1),
('0002','2022-11-04'::date,1,0),
('0002','2022-11-25'::date,0,1)
t(id, start_time, grp_start, grp_end);
We can use two CONDITIONAL_TRUE_EVENT's, this allows us to know we we are outside the end, but before a start, and thus can alter the label to null.
select d.*
,CONDITIONAL_TRUE_EVENT(grp_start=1) over (partition by id order by start_time) as s_e
,CONDITIONAL_TRUE_EVENT(grp_end=1) over (partition by id order by start_time) as e_e
,iff(s_e != e_e OR grp_end = 1, s_e, null) as label
from grp_test as d
order by 1,2;
ID
START_TIME
GRP_START
GRP_END
S_E
E_E
LABEL
0001
2021-01-10
1
0
1
0
1
0001
2021-01-11
0
0
1
0
1
0001
2021-01-14
0
1
1
1
1
0001
2021-01-15
0
0
1
1
0001
2021-07-01
1
1
2
2
2
0001
2021-09-25
1
0
3
2
3
0001
2021-09-29
0
1
3
3
3
0002
2022-11-04
1
0
1
0
1
0002
2022-11-25
0
1
1
1
1
If you don't actually care about label rows after an end, but before the next start, you can just use a single CONDITIONAL_TRUE_EVENT
select d.*
,CONDITIONAL_TRUE_EVENT(grp_start=1) over (partition by id order by start_time) as label
from grp_test as d
order by 1,2;
ID
START_TIME
GRP_START
GRP_END
LABEL
0001
2021-01-10
1
0
1
0001
2021-01-11
0
0
1
0001
2021-01-14
0
1
1
0001
2021-01-15
0
0
1
0001
2021-07-01
1
1
2
0001
2021-09-25
1
0
3
0001
2021-09-29
0
1
3
0002
2022-11-04
1
0
1
0002
2022-11-25
0
1
1
Here's a solution that uses two nested window functions, max and dense_rank. Snowflake (as well as most other DBMSs) doesn't allow you to nest two window functions, so we'll process the first one in a subquery and the second one in the query itself.
The key to this method is to assign a common date-value to all members of the group, in this case the start date of the group, then dense_rank will give a 1 to all the records tied for first place, a 2 to the next group, etc. So we want the max(Start_Time) of the records with grp_start=1 at or before this time for every row in grp_test.
max(Case When grp_start=1 Then Start_Time End)
Over (Partition By ID Order By Start_Time
Rows Between Unbounded Preceding And Current Row) as grp_start_time
So put it all together with
Select ID, Start_Time, Grp_Start, Grp_End,
dense_rank(grp_start_time) Over (Partition By ID) as label
From (
Select ID, Start_Time, Grp_Start, Grp_End,
max(Case When grp_start=1 Then Start_Time End)
Over (Partition By ID Order By Start_Time
Rows Between Unbounded Preceding And Current Row) as grp_start_time
From grp_test
)
Order by ID,Start_Time
METHOD 2
You can simplify this considerably if you are certain grp_start must only contain zeros and ones. This one simply creates a running sum of grp_start:
Select ID, Start_Time, Grp_Start, Grp_End,
sum(Grp_Start) Over (Partition By ID Order By Start_Time
Rows Between Unbounded Preceding And Current Row) as label
Order by ID,Start_Time

Select rows from a particular row to latest row if that particular row type exist

I want to achieve these two requirements using a single query. Currently I'm using 2 queries in the program and use C# to do the process part something like this.
Pseudocode
select top 1 id from table where type=b
if result.row.count > 0 {var typeBid = row["id"]}
select * from table where id >= {typeBid}
else
select * from table
Req1: If there is records exist with type=b, Result should be latest row with type=b and all other rows added after.
Table
--------------------
id type date
--------------------
1 b 2021-10-15
2 a 2021-11-16
3 b 2021-11-19
4 a 2021-12-02
5 c 2021-12-12
6 a 2021-12-16
Result
--------------------
id type date
--------------------
3 b 2021-11-19
4 a 2021-12-02
5 c 2021-12-12
6 a 2021-12-16
Req2: There is NO record exist with type=b. Query should select all the records in the table
Table
---------------------
id type date
---------------------
1 a 2021-10-15
2 a 2021-11-16
3 a 2021-11-19
4 a 2021-12-02
5 c 2021-12-12
6 a 2021-12-16
Result
--------------------
id type date
--------------------
1 a 2021-10-15
2 a 2021-11-16
3 a 2021-11-19
4 a 2021-12-02
5 c 2021-12-12
6 a 2021-12-16
with max_b_date as (select max(date) as date
from table1 where type = 'b')
select t1.*
from table1 t1
cross join max_b_date
where t1.date >= max_b_date.date
or max_b_date.date is null
(table is a SQL reserved word, https://en.wikipedia.org/wiki/SQL_reserved_words, so I used table1 as table name instead.)
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=bd05543a9712e27f01528708f10b209f
Please try this(It's somewhat deep but might you exact looking for)
select ab.* from
((select top 1 id, type, date from test where type = 'b' order by id desc)
union
select * from test where type != 'b') as ab
where ab.id >= (select COALESCE((select top 1 id from test where type = 'b' order by id desc), 0))
order by ab.id;
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=739eb6bfee787e5079e616bbf4e933b1
Looks Like you can use an OR condition here
SELECT
*
FROM
(
SELECT
*,
BCount = COUNT(CASE type WHEN 'B' THEN 1 ELSE NULL END)-- to get the Count of Records with Type b.
FROM Table
)Q
WHERE
(
BCount > 0 AND id >= (select top 1 id from table where type=b)-- if there are Row's with Type b then select Req#1
)
OR
(
BCount = 0 -- if THere are no rows with Type B select All
)

Get the first record from Table 2 where timestamp is greater than timestamp in Table 1

I have two tables.
EVENT_NARATIVE
EventNum
SEQ
Message
Message_Time
GRADE_HISTORY
EventNum
Grade
GradeChangeTime
The EVENT_NARATIVE Table has a lot of data (millions of rows) and each Event (EventNum) can have multiple EVENT_NARATIVE records.
The GRADE_HISTORY table contains thousands of rows and each EVENT can have multiple Grade changes.
In the EVENT_NARATIVE Table I need to find all records where the MESSAGE LIKE 'CODE set to%' There is only one MESSAGE that meets this criteria for each EVENT.
Once I have those records I need to find the first GRADE from the GRADE_HISTORY table that occurs after the Message_Time.
The result should look like
EventNum, Message, Message_Time, Grade, GradeChangeTime
My SQL looks like this but I know it doesn't work
SELECT
N.EventNum, N.SEQ, N.MESSAGE_TIME, N.MESSAGE,
G.CHANGE_DATE, G.GRADE
FROM
(SELECT EventNum, SEQ, MESSAGE_TIME, MESSAGE
FROM EVENT_NARRATIVE
Where MESSAGE LIKE 'CODE set to%' ) N
Left JOIN (SELECT Top 1
Event_Num, CHANGE_DATE, GRADE
FROM GRADE_HISTORY) G
ON G.Event_Num = N.Event_Num
AND G.CHANGE_DATE >= N.MESSAGE_TIME
SQL is not my day job so any help is appreciated to get the result I need.
SAMPLE DATA
EVENT_NARATIVE
*EventNum SEQ MESSAGE_TIME MESSAGE*
000001-01012021 20770236 2021-01-01 00:03:36.0000000 CODE set to 6D02
000001-01022020 8339846 2020-02-01 00:06:14.0000000 CODE set to 17B01
000001-01022021 22038639 2021-02-01 00:04:44.0000000 CODE set to 17A02
SAMPLE DATA
GRADE_HISTORY
*EventNum CHANGE_DATE GRADE*
000001-01012021 2021-01-01 00:03:15.0000000 2
000001-01012021 2021-01-01 00:03:37.0000000 3
000001-01012021 2021-01-01 00:03:40.0000000 5
000001-01022020 2020-02-01 00:06:10.0000000 2
000001-01022020 2020-02-01 00:06:15.0000000 2
000001-01022020 2020-02-01 00:06:18.0000000 5
000001-01022020 2020-02-01 00:06:20.0000000 5
000001-01022021 2021-02-01 00:04:40.0000000 2
000001-01022021 2021-02-01 00:04:42.0000000 3
000001-01022021 2021-02-01 00:04:44.0000000 0
000001-01022021 2021-02-01 00:04:54.0000000 5
Expected Result
*EventNum SEQ CHANGE_DATE GRADE*
000001-01012021 20770236 2021-01-01 00:03:37.0000000 3
000001-01022020 8339846 2020-02-01 00:06:15.0000000 2
000001-01022021 22038639 22021-02-01 00:04:44.0000000 0
Try this query
SELECT
N.EventNum,
N.SEQ,
N.MESSAGE_TIME,
N.MESSAGE,
G.CHANGE_DATE,
G.GRADE
FROM
(
SELECT EventNum, SEQ, MESSAGE_TIME, MESSAGE
FROM EVENT_NARRATIVE
WHERE MESSAGE LIKE 'CODE set to%'
) AS N
OUTER APPLY (
SELECT TOP 1
T.CHANGE_DATE,
T.GRADE
FROM GRADE_HISTORY AS T
WHERE T.EventNum = N.EventNum AND T.CHANGE_DATE > N.MESSAGE_TIME
ORDER BY T.CHANGE_DATE ASC
) AS G
you can use window function
SELECT EventNum,SEQ,CHANGE_DATE,GRADE
FROM
(SELECT
N.EventNum,
N.SEQ,
G.CHANGE_DATE,
G.GRADE,ROW_NUMBER() OVER(PARTITION BY N.EventNum ORDER BY (SELECT null)) AS NUM
FROM EVENT_NARATIVE N
JOIN GRADE_HISTORY G ON N.EventNum = G.EventNum
WHERE G.CHANGE_DATE >= N.MESSAGE_TIME AND N.MESSAGE LIKE 'CODE set to%') T
WHERE NUM = 1
demo in db<>fiddle

Does Oracle allow to do a sum over a partition but only when it obeys certain conditions, otherwise use a lag?

So my company has an application that has a certain "in-app currency". We record every transaction.
Recently, we found out there was a bug running for a couple of weeks that allowed users to spend currency in a certain place, even when they had none. When this happened, users wouldn't get charged at all: e.g. User had 4 m.u. and bought something that's worth 10 m.u. it's balance would remain at 4.
Now we need to find out who abused it and what's their available balance.
I want to get the column BUG_ABUSE and WISHFUL_CUMMULATIVE that reflect the illegitimate transactions and the amount that our users really see in their in-app wallets but I'm running out of ideas of how to get there.
I was wondering if I could do something like a sum(estrelas) if result over 0 else lag over (partition by user order by date) or something of the likes to get the wishful cummulative.
We're using oracle. Any help is highly appreciated
User_ID
EVENT_DATE
AMOUNT
DIRECTION
RK
CUM
WISHFUL_CUMMULATIVE
BUG_ABUSE
1
02/01/2021 13:37:19,009000
-5
0
1
-5
0
1
1
08/01/2021 01:55:40,000000
40
1
2
35
40
0
1
10/01/2021 10:45:41,000000
2
1
3
37
42
0
1
10/01/2021 10:45:58,000000
2
1
4
39
44
0
1
10/01/2021 13:47:37,456000
-5
0
5
34
39
0
2
13/01/2021 20:09:59,000000
2
1
1
2
2
0
2
16/01/2021 15:14:54,000000
-50
0
2
-48
2
1
2
19/01/2021 02:02:59,730000
-5
0
3
-53
2
1
2
23/01/2021 21:14:40,000000
3
1
4
-50
5
0
2
23/01/2021 21:14:50,000000
-5
0
5
-55
0
0
Here's something you can try. This uses recursive subquery factoring (recursive WITH clause), so it will only work in Oracle 11.2 and higher.
I use columns USER_ID, EVENT_DATE and AMOUNT from your inputs. I assume all three columns are constrained NOT NULL, two events can't have exactly the same timestamp for the same user, and AMOUNT is negative for purchases and other debits (fees, etc.) and positive for deposits or other credits.
The input data looks like this:
select user_id, event_date, amount
from sample_data
order by user_id, event_date
;
USER_ID EVENT_DATE AMOUNT
------- ----------------------------- ------
1 02/01/2021 13:37:19,009000000 -5
1 08/01/2021 01:55:40,000000000 40
1 10/01/2021 10:45:41,000000000 2
1 10/01/2021 10:45:58,000000000 2
1 10/01/2021 13:47:37,456000000 -5
2 13/01/2021 20:09:59,000000000 2
2 16/01/2021 15:14:54,000000000 -50
2 19/01/2021 02:02:59,730000000 -5
2 23/01/2021 21:14:40,000000000 3
2 23/01/2021 21:14:50,000000000 -5
Perhaps your input data has additional columns (like cumulative amount, which I left out because it plays no role in the problem or its solution). You show a RK column - I assume you computed it as a step in your attempt to solve the problem; I re-create it in my solution below.
Here is what you can do with a recursive query (recursive WITH clause):
with
p (user_id, event_date, amount, rk) as (
select user_id, event_date, amount,
row_number() over (partition by user_id order by event_date)
from sample_data
)
, r (user_id, event_date, amount, rk, bug_flag, balance) as (
select user_id, event_date, amount, rk,
case when amount < 0 then 'bug' end, greatest(amount, 0)
from p
where rk = 1
union all
select p.user_id, p.event_date, p.amount, p.rk,
case when p.amount + r.balance < 0 then 'bug' end,
r.balance + case when r.balance + p.amount >= 0
then p.amount else 0 end
from p join r on p.user_id = r.user_id and p.rk = r.rk + 1
)
select *
from r
order by user_id, event_date
;
Output:
USER_ID EVENT_DATE AMOUNT RK BUG BALANCE
------- ----------------------------- ------ -- --- -------
1 02/01/2021 13:37:19,009000000 -5 1 bug 0
1 08/01/2021 01:55:40,000000000 40 2 40
1 10/01/2021 10:45:41,000000000 2 3 42
1 10/01/2021 10:45:58,000000000 2 4 44
1 10/01/2021 13:47:37,456000000 -5 5 39
2 13/01/2021 20:09:59,000000000 2 1 2
2 16/01/2021 15:14:54,000000000 -50 2 bug 2
2 19/01/2021 02:02:59,730000000 -5 3 bug 2
2 23/01/2021 21:14:40,000000000 3 4 5
2 23/01/2021 21:14:50,000000000 -5 5 0
In order to produce the result you want you'll probably want to process the rows sequentially: once the first row is processed for a user you'll compute the second one, then the third one, and so on.
Assuming the column RK is already computed and sequential for each user you can do:
with
n (user_id, event_date, amount, direction, rk, cum, wishful, bug_abuse) as (
select t.*,
greatest(amount, 0),
case when amount < 0 then 1 else 0 end
from t where rk = 1
union all
select
t.user_id, t.event_date, t.amount, t.direction, t.rk, t.cum,
greatest(n.wishful + t.amount, 0),
case when n.wishful + t.amount < 0 then n.wishful
else n.wishful + t.amount
end
from n
join t on t.user_id = n.user_id and t.rk = n.rk + 1
)
select *
from n
order by user_id, rk;

PostgreSQL backdating query

I am trying to write a query that will return counted records from the time they were created.
The primary key is a particular house which is unique. Another variable is bidder. The house to bidder relationship is 1:1 but there can be multiple records for each bidder (different houses). Another variable is a count (CASE) of results of previous bids that were won.
I want to be able to set the count to return the number of previous bids won at the time each house record was created.
Currently, my query (shown below) logs the overall number of previous bids won regardless of the time the house record was created. Any help would be great!
Example:
SELECT h.house_id,
h.bidder_id,
h.created_date,
b.previous_bids_won
FROM house h
LEFT JOIN bid_transactions b
ON h.house_id = b.house_id
LEFT JOIN (
SELECT bidder_id,
COUNT(CASE WHEN created_date IS NOT NULL AND transaction_kind = 'Successful Bid' THEN 1 END) previous_bids_won
FROM bid_transactions
GROUP BY user_id
) b
ON h.bidder_id = b.bidder_id
ORDER BY h.created_date DESC
Example Data:
house_id bidder_id created_date previous_bids_won
1 1 2016-03-21 0
2 2 2016-02-10 1
3 2 2016-01-15 1
4 3 2016-01-01 0
Desired Data:
house_id bidder_id created_date previous_bids_won
1 1 2016-03-21 0
2 2 2016-02-10 1
3 2 2016-01-15 0
4 3 2016-01-01 0
If I understand correctly, you just want a cumulative sum:
SELECT h.house_id, h.bidder_id, h.created_date,
SUM(CASE WHEN created_date IS NOT NULL AND transaction_kind = 'Successful Bid'
THEN 1
ELSE 0
END) OVER (PARTITION BY h.bidder_id ORDER BY h.created_date
) as previous_bids_won
FROM house h LEFT JOIN
bid_transactions b
ON h.house_id = b.house_id;