marked user with label by time for each month - sql

Data source
User ID
Visit Date
1
2020-01-01 12:29:15
1
2020-01-02 12:30:11
1
2020-04-01 12:31:01
2
2020-05-01 12:31:14
Problem
I need advice im trying to do sub query for this result to mark user as retention if they havent visit back for 3 month. i using this query for the data to get user's latest visit each month includes null
select u.user_id, gs.yyyymm, s.last_visit_date
from (select distinct user_id from source s) u cross join
generate_series('2021-01-01'::timestamp, '2021-12-01'::timestamp, interval '1 month'
) gs(yyyymm) left join lateral
(select max(s.visit_date) as last_visit_date
from source s
where s.user_id = u.user_id and
s.visit_date >= gs.yyyymm and
s.visit_date < gs.yyyymm + interval '1 month'
) s
on 1=1;
but i think its really affect to performance if user keep increasing, do you guys have any advice to achieve result like below?
Expected Result
Month
User ID
Type
1
1
FIRST
2
1
RETENTION
3
1
RETENTION
4
1
REACTIVATE
....
12
1
null
1
2
null
...
5
2
FIRST
6
2
RETENTION
7
2
RETENTION
8
2
RETENTION
9
2
null
... and so on
or it could be like this
Month
First
Retention
Reactiavate
1
1
0
0
2
0
1
0
3
0
1
0
4
0
0
1
5
1
0
0
6
0
1
0
7
0
1
0
8
0
1
0
9
0
0
0
... and so on

My solution have some reasonable requirements, but I can work without it (the price is some performance).
I built some helper tables what is autofilled with trigger.
The requirement is, that UPDATE or DELETE isn't allowed on visit table.
mst_user table stores distinct user_id-s and it's first_visit.
user_monthly_visit table stores the last and the first visit_date pro user_id and month.
TABLES
CREATE TABLE mst_user (
id BIGINT,
first_visit TIMESTAMP,
CONSTRAINT pk_mst_user PRIMARY KEY (id)
);
CREATE TABLE visit (
user_id BIGINT,
visit_date TIMESTAMP,
CONSTRAINT visit_user_fkey FOREIGN KEY (user_id) REFERENCES mst_user (id)
);
CREATE TABLE user_monthly_visit (
user_id BIGINT,
month DATE,
first_visit_this_month TIMESTAMP,
last_visit_this_month TIMESTAMP,
CONSTRAINT pk_user_monthly_visit PRIMARY KEY (user_id, month),
CONSTRAINT user_monthly_visit_user_fkey FOREIGN KEY (user_id) REFERENCES mst_user (id)
);
CREATE INDEX ix_user_monthly_visit_month ON user_monthly_visit(month);
TRIGGER
CREATE OR REPLACE FUNCTION trf_visit() RETURNS trigger
VOLATILE
AS $xx$
DECLARE
l_user_id BIGINT;
l_row RECORD;
l_user_monthly_visit user_monthly_visit;
BEGIN
IF (tg_op = 'INSERT')
THEN
l_row := NEW;
INSERT INTO mst_user(id, first_visit) VALUES (l_row.user_id, l_row.visit_date)
ON CONFLICT(id) DO UPDATE SET first_visit = LEAST(mst_user.first_visit, l_row.visit_date);
INSERT INTO user_monthly_visit(user_id,month,first_visit_this_month,last_visit_this_month) VALUES (l_row.user_id,date_trunc('month',l_row.visit_date),l_row.visit_date,l_row.visit_date)
ON CONFLICT(user_id,month) DO UPDATE SET first_visit_this_month = LEAST(user_monthly_visit.first_visit_this_month,l_row.visit_date),
last_visit_this_month = GREATEST(user_monthly_visit.last_visit_this_month,l_row.visit_date);
ELSE
RAISE EXCEPTION 'UPDATE and DELETE arent allowed!';
END IF;
RETURN l_row;
END;
$xx$ LANGUAGE plpgsql;
CREATE TRIGGER trig_visit
BEFORE INSERT OR DELETE OR UPDATE ON visit
FOR EACH ROW
EXECUTE PROCEDURE trf_visit();
TESTDATA
INSERT INTO visit (user_id, visit_date)
VALUES (1, '20200101 122915');
INSERT INTO visit (user_id, visit_date)
VALUES (1, '20200102 123011');
INSERT INTO visit (user_id, visit_date)
VALUES (1, '20200401 123101');
INSERT INTO visit (user_id, visit_date)
VALUES (2, '20200501 123114');
QUERY
SELECT mnt AS month, user_id,
CASE WHEN first_visit IS NULL OR first_visit> yyyymm + INTERVAL '1 month' THEN NULL
WHEN first_visit_this_month = first_visit THEN 'FIRST'
WHEN first_visit_this_month IS NULL AND last_three_month + INTERVAL '3 month' >= yyyymm THEN 'RETENTION'
WHEN first_visit_this_month IS NOT NULL THEN 'REACTIVATE'
ELSE NULL
END user_type
FROM
(SELECT date_part('month', gs.yyyymm)::INTEGER AS mnt, gs.yyyymm, u.id user_id, umv.first_visit_this_month, umv.last_visit_this_month, u.first_visit,
GREATEST(
LAG(last_visit_this_month) OVER w,
LAG(last_visit_this_month,2) OVER w,
LAG(last_visit_this_month,3) OVER w
) last_three_month
FROM
generate_series('2020-01-01'::TIMESTAMP, '2020-12-01'::TIMESTAMP, INTERVAL '1 month') gs(yyyymm)
CROSS JOIN mst_user u
LEFT JOIN user_monthly_visit umv on (umv.user_id=u.id AND umv.month = gs.yyyymm)
WINDOW w AS (PARTITION BY u.id ORDER BY gs.yyyymm)
) monthly_visit
ORDER BY user_id,mnt;
RESULT
month
user_id
user_type
1
1
FIRST
2
1
RETENTION
3
1
RETENTION
4
1
REACTIVATE
5
1
RETENTION
6
1
RETENTION
7
1
RETENTION
8
1
(null)
9
1
(null)
10
1
(null)
11
1
(null)
12
1
(null)
1
2
(null)
2
2
(null)
3
2
(null)
4
2
(null)
5
2
FIRST
6
2
RETENTION
7
2
RETENTION
8
2
RETENTION
9
2
(null)
10
2
(null)
11
2
(null)
12
2
(null)

restating your question so it is clearer for me:
for any given month, the user is classified as first, retention or reactivate based on the following criteria
first: month of first visit
retention: within 3 months since previous visit
reactivate: month of visit & no visit in prior month
If I understood this correctly, you can get the first desired result with the following query
Schema (PostgreSQL v13)
create table visits (user_id int, visit_at timestamp);
insert into visits values
(1, '2020-01-01 12:29:15'),
(1, '2020-01-02 12:30:11'),
(1, '2020-04-01 12:31:01'),
(2, '2020-05-01 12:31:14');
Query
WITH trange AS (
SELECT
user_id
, DATE_TRUNC('month', min(visit_at)) visit_from
, DATE_TRUNC('month', max(visit_at)) + interval '3 month' visit_to
FROM visits
GROUP BY 1
)
, monthly_visits AS (
SELECT DISTINCT
user_id
, DATE_TRUNC('month', visit_at) visit_month
FROM visits
)
SELECT
trange.user_id
, DATE(m) report_month
, CASE
WHEN visit_from = m THEN 'FIRST'
WHEN visit_month = m AND LAST_VALUE(visit_month) OVER w IS NULL THEN 'REACTIVATE'
WHEN m <= MAX(visit_month) OVER w + INTERVAL '3 MONTH' THEN 'RETENTION'
ELSE NULL END user_type
FROM trange
LEFT JOIN LATERAL GENERATE_SERIES(visit_from, visit_to, '1 month') m(m)
ON true
LEFT JOIN monthly_visits
ON monthly_visits.user_id = trange.user_id
AND monthly_visits.visit_month = m.m
WINDOW w AS (
PARTITION BY trange.user_id
ORDER BY m.m
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
);
user_id
report_month
user_type
1
2020-01-01T00:00:00.000Z
FIRST
1
2020-02-01T00:00:00.000Z
RETENTION
1
2020-03-01T00:00:00.000Z
RETENTION
1
2020-04-01T00:00:00.000Z
REACTIVATE
1
2020-05-01T00:00:00.000Z
RETENTION
1
2020-06-01T00:00:00.000Z
RETENTION
1
2020-07-01T00:00:00.000Z
RETENTION
2
2020-05-01T00:00:00.000Z
FIRST
2
2020-06-01T00:00:00.000Z
RETENTION
2
2020-07-01T00:00:00.000Z
RETENTION
2
2020-08-01T00:00:00.000Z
RETENTION
View on DB Fiddle

Related

Categorising transactions in past 90 days

I have a table where the columns are:
Transaction_id(T_id): Distinct id generated for each transactions
Date(Dt): Date of Transaction
account-id(Ac_id): The id from which the transaction is done
Org_id(O_id): It is the id given to the organizations. One organization can have multiple accounts thereby different account id can have the same org_id
Sample table:
T_id
Dt
Ac_id
O_id
101
23/4/22
1
A
102
06/7/22
3
C
103
01/8/22
2
A
104
13/3/22
6
B
*The question is to mark the o_id where transactions are done in the past 90 days as 1 and others as 0
Output
T_id
Dt.
Ac_id.
O_id
Mark
101
23/4/22
1
A
0
102
06/7/22
3
C
1
103
01/8/22
2
A
1
104
13/3/22
6
B
0
The query I am using is:
Select *,
Case when datediff('day', Dt, current_date()) between 0 and 90 then '1'
Else '0'
End as Mark
From Table1
Desired Output:
T_id
Dt.
Ac_id.
O_id
Mark
101
23/4/22
1
A
1
102
06/7/22
3
C
1
103
01/8/22
2
A
1
104
13/3/22
6
B
0
for o_id 'A' from the output the mark I want is 1 in all cases as one transaction is done past 90 days, irrespective of other transactions done prior to 90days.
I have to join this out to another table so need all o_id where ever any one transaction is done in the past 90 days as '1'.
Please help me with it quickly.
The easisest approach is to compare date difference of current date against windowed MAX partitioned by o_id:
SELECT *,
CASE
WHEN DATEDIFF('day', (MAX(Dt) OVER(PARTITION BY o_id)), CURRENT_DATE()) <= 90
THEN 1
ELSE 0
END AS Mark
FROM Tab;
Sample data:
ALTER SESSION SET DATE_INPUT_FORMAT = 'DD/MM/YYYY';
CREATE OR REPLACE TABLE tab(t_id INT,
Dt Date,
Ac_id INT,
O_id TEXT)
AS
SELECT 101, '23/04/2022' ,1 ,'A' UNION
SELECT 102, '06/07/2022' ,3 ,'C' UNION
SELECT 103, '01/08/2022' ,2 ,'A' UNION
SELECT 104, '13/03/2022' ,6 ,'B';
Output:
Snowflake supports natively BOOLEAN data types so entire query could be just:
SELECT *,
DATEDIFF('day', (MAX(Dt) OVER(PARTITION BY o_id)), CURRENT_DATE()) <= 90 AS Mark
FROM tab
Create a subquery where you identify all the distinct o_id where there is a recent transaction, and use that to update the main result.
The subquery would be:
select o_id, dt from table1
group by o_id
having datediff('day', max(Dt), current_date()) between 0 and 90;
Then your main query becomes:
Select *,'1' as Mark
From Tab
where o_id in
(select x.o_id from (select o_id, max(Dt)
from tab
group by o_id
having datediff('day', max(Dt), current_date()) between 0 and 90) x)
union all
select *,'0' as Mark
from Tab
where o_id not in
(select x.o_id from (select o_id, max(Dt)
from tab
group by o_id
having datediff('day', max(Dt), current_date()) between 0 and 90) x);

SQL 30 day active user query

I have a table of users and how many events they fired on a given date:
DATE
USERID
EVENTS
2021-08-27
1
5
2021-07-25
1
7
2021-07-23
2
3
2021-07-20
3
9
2021-06-22
1
9
2021-05-05
1
4
2021-05-05
2
2
2021-05-05
3
6
2021-05-05
4
8
2021-05-05
5
1
I want to create a table showing number of active users for each date with active user being defined as someone who has fired an event on the given date or in any of the preceding 30 days.
DATE
ACTIVE_USERS
2021-08-27
1
2021-07-25
3
2021-07-23
2
2021-07-20
2
2021-06-22
1
2021-05-05
5
I tried the following query which returned only the users who were active on the specified date:
SELECT COUNT(DISTINCT USERID), DATE
FROM table
WHERE DATE >= (CURRENT_DATE() - interval '30 days')
GROUP BY 2 ORDER BY 2 DESC;
I also tried using a window function with rows between but seems to end up getting the same result:
SELECT
DATE,
SUM(ACTIVE_USERS) AS ACTIVE_USERS
FROM
(
SELECT
DATE,
CASE
WHEN SUM(EVENTS) OVER (PARTITION BY USERID ORDER BY DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) >= 1 THEN 1
ELSE 0
END AS ACTIVE_USERS
FROM table
)
GROUP BY 1
ORDER BY 1
I'm using SQL:ANSI on Snowflake. Any suggestions would be much appreciated.
This is tricky to do as window functions -- because count(distinct) is not permitted. You can use a self-join:
select t1.date, count(distinct t2.userid)
from table t join
table t2
on t2.date <= t.date and
t2.date > t.date - interval '30 day'
group by t1.date;
However, that can be expensive. One solution is to "unpivot" the data. That is, do an incremental count per user of going "in" and "out" of active states and then do a cumulative sum:
with d as ( -- calculate the dates with "ins" and "outs"
select user, date, +1 as inc
from table
union all
select user, date + interval '30 day', -1 as inc
from table
),
d2 as ( -- accumulate to get the net actives per day
select date, user, sum(inc) as change_on_day,
sum(sum(inc)) over (partition by user order by date) as running_inc
from d
group by date, user
),
d3 as ( -- summarize into active periods
select user, min(date) as start_date, max(date) as end_date
from (select d2.*,
sum(case when running_inc = 0 then 1 else 0 end) over (partition by user order by date) as active_period
from d2
) d2
where running_inc > 0
group by user
)
select d.date, count(d3.user)
from (select distinct date from table) d left join
d3
on d.date >= start_date and d.date < end_date
group by d.date;

SQL - Find if column dates include at least partially a date range

I need to create a report and I am struggling with the SQL script.
The table I want to query is a company_status_history table which has entries like the following (the ones that I can't figure out)
Table company_status_history
Columns:
| id | company_id | status_id | effective_date |
Data:
| 1 | 10 | 1 | 2016-12-30 00:00:00.000 |
| 2 | 10 | 5 | 2017-02-04 00:00:00.000 |
| 3 | 11 | 5 | 2017-06-05 00:00:00.000 |
| 4 | 11 | 1 | 2018-04-30 00:00:00.000 |
I want to answer to the question "Get all companies that have been at least for some point in status 1 inside the time period 01/01/2017 - 31/12/2017"
Above are the cases that I don't know how to handle since I need to add some logic of type :
"If this row is status 1 and it's date is before the date range check the next row if it has a date inside the date range."
"If this row is status 1 and it's date is after the date range check the row before if it has a date inside the date range."
I think this can be handled as a gaps and islands problem. Consider the following input data: (same as sample data of OP plus two additional rows)
id company_id status_id effective_date
-------------------------------------------
1 10 1 2016-12-15
2 10 1 2016-12-30
3 10 5 2017-02-04
4 10 4 2017-02-08
5 11 5 2017-06-05
6 11 1 2018-04-30
You can use the following query:
SELECT t.id, t.company_id, t.status_id, t.effective_date, x.cnt
FROM company_status_history AS t
OUTER APPLY
(
SELECT COUNT(*) AS cnt
FROM company_status_history AS c
WHERE c.status_id = 1
AND c.company_id = t.company_id
AND c.effective_date < t.effective_date
) AS x
ORDER BY company_id, effective_date
to get:
id company_id status_id effective_date grp
-----------------------------------------------
1 10 1 2016-12-15 0
2 10 1 2016-12-30 1
3 10 5 2017-02-04 2
4 10 4 2017-02-08 2
5 11 5 2017-06-05 0
6 11 1 2018-04-30 0
Now you can identify status = 1 islands using:
;WITH CTE AS
(
SELECT t.id, t.company_id, t.status_id, t.effective_date, x.cnt
FROM company_status_history AS t
OUTER APPLY
(
SELECT COUNT(*) AS cnt
FROM company_status_history AS c
WHERE c.status_id = 1
AND c.company_id = t.company_id
AND c.effective_date < t.effective_date
) AS x
)
SELECT id, company_id, status_id, effective_date,
ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY effective_date) -
cnt AS grp
FROM CTE
Output:
id company_id status_id effective_date grp
-----------------------------------------------
1 10 1 2016-12-15 1
2 10 1 2016-12-30 1
3 10 5 2017-02-04 1
4 10 4 2017-02-08 2
5 11 5 2017-06-05 1
6 11 1 2018-04-30 2
Calculated field grp will help us identify those islands:
;WITH CTE AS
(
SELECT t.id, t.company_id, t.status_id, t.effective_date, x.cnt
FROM company_status_history AS t
OUTER APPLY
(
SELECT COUNT(*) AS cnt
FROM company_status_history AS c
WHERE c.status_id = 1
AND c.company_id = t.company_id
AND c.effective_date < t.effective_date
) AS x
), CTE2 AS
(
SELECT id, company_id, status_id, effective_date,
ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY effective_date) -
cnt AS grp
FROM CTE
)
SELECT company_id,
MIN(effective_date) AS start_date,
CASE
WHEN COUNT(*) > 1 THEN DATEADD(DAY, -1, MAX(effective_date))
ELSE MIN(effective_date)
END AS end_date
FROM CTE2
GROUP BY company_id, grp
HAVING COUNT(CASE WHEN status_id = 1 THEN 1 END) > 0
Output:
company_id start_date end_date
-----------------------------------
10 2016-12-15 2017-02-03
11 2018-04-30 2018-04-30
All you want know is those records from above that overlap with the specified interval.
Demo here with somewhat more complicated use case.
Maybe this is what you are looking for? For these kind of questions, you need to join two instance of your table, in this case I am just joining with next record by Id, which probably is not totally correct. To do it better, you can create a new Id using a windowed function like row_number, ordering the table by your requirement criteria
If this row is status 1 and it's date is before the date range check
the next row if it has a date inside the date range
declare #range_st date = '2017-01-01'
declare #range_en date = '2017-12-31'
select
case
when csh1.status_id=1 and csh1.effective_date<#range_st
then
case
when csh2.effective_date between #range_st and #range_en then true
else false
end
else NULL
end
from company_status_history csh1
left join company_status_history csh2
on csh1.id=csh2.id+1
Implementing second criteria:
"If this row is status 1 and it's date is after the date range check
the row before if it has a date inside the date range."
declare #range_st date = '2017-01-01'
declare #range_en date = '2017-12-31'
select
case
when csh1.status_id=1 and csh1.effective_date<#range_st
then
case
when csh2.effective_date between #range_st and #range_en then true
else false
end
when csh1.status_id=1 and csh1.effective_date>#range_en
then
case
when csh3.effective_date between #range_st and #range_en then true
else false
end
else null -- ¿?
end
from company_status_history csh1
left join company_status_history csh2
on csh1.id=csh2.id+1
left join company_status_history csh3
on csh1.id=csh3.id-1
I would suggest the use of a cte and the window functions ROW_NUMBER. With this you can find the desired records. An example:
DECLARE #t TABLE(
id INT
,company_id INT
,status_id INT
,effective_date DATETIME
)
INSERT INTO #t VALUES
(1, 10, 1, '2016-12-30 00:00:00.000')
,(2, 10, 5, '2017-02-04 00:00:00.000')
,(3, 11, 5, '2017-06-05 00:00:00.000')
,(4, 11, 1, '2018-04-30 00:00:00.000')
DECLARE #StartDate DATETIME = '2017-01-01';
DECLARE #EndDate DATETIME = '2017-12-31';
WITH cte AS(
SELECT *
,ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY effective_date) AS rn
FROM #t
),
cteLeadLag AS(
SELECT c.*, ISNULL(c2.effective_date, c.effective_date) LagEffective, ISNULL(c3.effective_date, c.effective_date)LeadEffective
FROM cte c
LEFT JOIN cte c2 ON c2.company_id = c.company_id AND c2.rn = c.rn-1
LEFT JOIN cte c3 ON c3.company_id = c.company_id AND c3.rn = c.rn+1
)
SELECT 'Included' AS RangeStatus, *
FROM cteLeadLag
WHERE status_id = 1
AND effective_date BETWEEN #StartDate AND #EndDate
UNION ALL
SELECT 'Following' AS RangeStatus, *
FROM cteLeadLag
WHERE status_id = 1
AND effective_date > #EndDate
AND LagEffective BETWEEN #StartDate AND #EndDate
UNION ALL
SELECT 'Trailing' AS RangeStatus, *
FROM cteLeadLag
WHERE status_id = 1
AND effective_date < #EndDate
AND LeadEffective BETWEEN #StartDate AND #EndDate
I first select all records with their leading and lagging Dates and then I perform your checks on the inclusion in the desired timespan.
Try with this, self-explanatory. Responds to this part of your question:
I want to answer to the question "Get all companies that have been at
least for some point in status 1 inside the time period 01/01/2017 -
31/12/2017"
Case that you want to find those id's that have been in any moment in status 1 and have records in the period requested:
SELECT *
FROM company_status_history
WHERE id IN
( SELECT Id
FROM company_status_history
WHERE status_id=1 )
AND effective_date BETWEEN '2017-01-01' AND '2017-12-31'
Case that you want to find id's in status 1 and inside the period:
SELECT *
FROM company_status_history
WHERE status_id=1
AND effective_date BETWEEN '2017-01-01' AND '2017-12-31'

Count days from start_date to end_date or end of month

With datediff() I can count the days between two dates, but how can I count the days between the later date or the end of the month and the start date?
CREATE TABLE table1 (id int, start_date datetime, end_date datetime, jan int);
INSERT INTO table1 (id, start_date, end_date) VALUES
(1, '2016-12-12', '2017-01-17'),
(2, '2017-01-10', '2017-01-10'),
(3, '2017-01-10', '2017-02-10'),
(4, '2017-01-03', '2017-02-03'),
(5, '2016-12-03', '2017-02-03');
If I run:
select id, month(start_date) as month, datediff(end_date, start_date) as diff
from table1;
it returns
id month diff
1 12 36
2 1 0
3 1 31
4 1 31
5 12 62
but I would like it to return:
id month diff
1 12 19
5 12 28
1 1 17
2 1 0
3 1 21
4 1 28
5 1 31
3 2 10
4 2 3
5 2 3
I'm trying to get the amount of days in a month a event occurs by month.
I've created a separated query to update a new column with the values, but ideally it shouldn't have a new column, since I would need several new columns for each year-month combination and one for each year-month combination:
update table1 set jan= case
when start_date >= "2017-01-01" and end_date <= last_day("2017-01-01") then datediff(end_date, start_date)+1
when start_date >= "2017-01-01" and start_date <= last_day("2017-01-01") and end_date > last_day("2017-01-01") then datediff(last_day("2017-01-01"), start_date)+1
when start_date < "2017-01-01" and end_date between "2017-01-01" and last_day("2017-01-01") then datediff(end_date, "2017-01-01")+1
when start_date < "2017-01-01" and end_date > last_day("2017-01-01") then day(last_day("2017-01-01"))
else null
end;
Your problem is going to be getting multiple rows... so let's take a different tack.
This ends up being trivial if you have a calendar table: a table with a row-per-date (and a bunch of individual columns and indices):
SELECT Table1.id, Calendar.calendar_month, COUNT(*)
FROM Table1
JOIN Calendar
ON Calendar.calendar_date >= start_date
AND Calendar.calendar_date < end_date
GROUP BY Table1.id, Calendar.calendar_month
ORDER BY Table1.id, MIN(Calendar.calendar_date)
Fiddle Demo
I don't know if this is what you're looking for.
select month(start_date) as month,
datediff(LAST_DAY(start_date), start_date) as diff
from table1
UNION ALL
select month(end_date) as month,
IF(end_date < LAST_DAY(start_date), datediff(start_date, end_date),
datediff(end_date, LAST_DAY(start_date)))
from table1;
DEMO

Columns columns by time in a sql table

I have a sql table as follows:
+-----------+----------+----------+---------------+
| AccountID | PersonId | DoctorID | Admitdatetime |
+-----------+----------+----------+---------------+
| 1 | 2 | 345 | 20090108 |
| 2 | 3 | 53 | 20090109 |
| 3 | 1 | 234 | 20090110 |
| 4 | 2 | 345 | |
+-----------+----------+----------+---------------+
Each row of this table is like a visit of a patient given by the admitdatetime. Each unique record is referenced by AccountID
Date column is basically int and is yyyymmdd. So just subtracting two dates might not be right as it is not datetime. I just checked.
Now, what I want to do for each record in the table is to add 3 columns. One for last three months, one for last 6 months, and one for last 12 months.
The columns are described as follows:
The no. of cases a DoctorID has seen in the past 3 months of that current record. Similarly, no. of cases a DoctorID has seen in the past 6 months of that current record.
I am doing a self join like this:
SELECT a.DoctorID, count(AccountID) FROM
Visits AS a INNER JOIN
Visits AS b ON a.DoctorId = b.DoctorId
WHERE a.admitdatetime - b.admitdatetime <= 90
The above one I am doing for the 3 months case, but I don't think it is right. I want for each record the no. of cases (count of AccountId) a doctor has seen 3,6,9 months before that. So for each DoctorID, that value would vary based on which record the doctorID is present and it's 3,6,9 months prior that admitdatetime of that record such that the above code would just give me one value for a doctorID. That doesn't seem right.
I think the join should be grouped by DoctorId, AccountId as I need to join all the doctorid back to each record and each record is identified by accountid. So then join it back on doctorid and accountid. Does this sound right?
I would suggest correlated subqueries:
select v.*,
(select sum(case when v2.AdmitDate >= v.AdmitDate - interval '3 month'
then 1 else 0
end)
from visits v2
where v2.doctorid = v.doctorid
) as last3,
(select sum(case when v2.AdmitDate >= v.AdmitDate - interval '6 month'
then 1 else 0
end)
from visits v2
where v2.doctorid = v.doctorid
) as last6,
(select sum(case when v2.AdmitDate >= v.AdmitDate - interval '12 month'
then 1 else 0
end)
from visits v2
where v2.doctorid = v.doctorid
) as last12
from visits v;
I should point out that Postgres allows you to simplify this syntax:
(select sum((v2.AdmitDate >= v.AdmitDate - interval '3 month')::int)
from visits v2
where v2.doctorid = v.doctorid
) as last3,
And in more recent versions of Postgres you can use a lateral join to combine the logic into a single subquery.
EDIT:
A reasonable simplification of the query is:
select v.*,
(select count(*)
from visits v2
where v2.doctorid = v.doctorid and v2.AdmitDate >= v.AdmitDate - interval '3 month'
) as last3,
(select count(*)
from visits v2
where v2.doctorid = v.doctorid and 2.AdmitDate >= v.AdmitDate - interval '6 month'
) as last6,
(select count(*)
from visits v2
where v2.doctorid = v.doctorid and v2.AdmitDate >= v.AdmitDate - interval '12 month'
) as last12
from visits v;