Rolling Sum Calculation Based on 2 Date Fields - sql

Giving up after a few hours of failed attempts.
My data is in the following format - event_date can never be higher than create_date.
I'd need to calculate on a rolling n-day basis (let's say 3) the sum of units where the create_date and event_date were within the same 3-day window. The data is illustrative but each event_date can have over 500+ different create_dates associated with it and the number isn't constant. There is a possibility of event_dates missing.
So let's say for 2022-02-03, I only want to sum units where both the event_date and create_date values were between 2022-02-01 and 2022-02-03.
event_date
create_date
rowid
units
2022-02-01
2022-01-20
1
100
2022-02-01
2022-02-01
2
100
2022-02-02
2022-01-21
3
100
2022-02-02
2022-01-23
4
100
2022-02-02
2022-01-31
5
100
2022-02-02
2022-02-02
6
100
2022-02-03
2022-01-30
7
100
2022-02-03
2022-02-01
8
100
2022-02-03
2022-02-03
9
100
2022-02-05
2022-02-01
10
100
2022-02-05
2022-02-03
11
100
The output I'd need to get to (added in brackets the rows I'd need to include in the calculation for each date but my result would only need to include the numerical sum) . I tried calculating using either dates but neither of them returned the results I needed.
date
units
2022-02-01
100 (Row 2)
2022-02-02
300 (Row 2,5,6)
2022-02-03
300 (Row 2,6,8,9)
2022-02-04
200 (Row 6,9)
2022-02-05
200 (Row 9,11)
In Python I solved above with a definition that looped through filtering a dataframe for each date but I am struggling to do the same in SQL.
Thank you!

Consider below approach
with events_dates as (
select date from (
select min(event_date) min_date, max(event_date) max_date
from your_table
), unnest(generate_date_array(min_date, max_date)) date
)
select date, sum(units) as units, string_agg('' || rowid) rows_included
from events_dates
left join your_table
on create_date between date - 2 and date
and event_date between date - 2 and date
group by date
if applied to sample data in your question - output is

Related

SQL - Calculate the average of a value in a table B from date range in table A

I am constructing a table in SQL like this
TABLE A
obj_id start_date end_date
1 2021-03-01 2022-08-02
1 2020-06-01 2021-07-02
2 2021-05-03 2022-08-04
3 2021-04-21 2022-06-05
And I have another table
TABLE B
obj_id date value
1 2021-04-12 21.45
3 2022-06-15 19.02
1 2020-11-02 3.11
2 2022-05-23 45.20
1 2022-07-31 32.45
3 2021-09-01 22.56
2 2021-10-10 34.04
I want to add to TABLE A a column with average value of TABLE B for corresponding obj_id of values where TABLE B date falls between TABLE A date range.
Expected result
TABLE A
obj_id start_date end_date average value
1 2021-03-01 2022-08-02 26.95 <-- Average value of 21.45 and 32.45 excluding 3.11 from average because date in table B is outside date range in table A
1 2020-06-01 2021-07-02 etc.
2 2021-05-03 2022-08-04 etc.
3 2021-04-21 2022-06-05 etc.
Sample query:
select
a.obj_id,
a.start_date,
a.end_date,
avg(b.value) as average
from table_a a
inner join table_b b
on a.obj_id = b.obj_id
and b.date >= a.start_date
and b.date <= a.end_date
group by
a.obj_id,
a.start_date,
a.end_date
order by
a.obj_id

SQL query for getting data for the last 6 months grouped by month?

I know a basic query to get some results for the last 6 months. Let's say like this:
SELECT *
FROM RANDOM_TABLE
WHERE Date_Column >= DATEADD(MONTH, -6, GETDATE())
But what if I'd like to get results grouped by month - each month looking back 6 months into the past?
The first three rows of a result could ideally look like this (count of IDs is random):
Month_and_year
COUNT(ID)
January 2017
120
February 2017
160
March 2017
240
The last three rows:
Month_and_year
COUNT(ID)
November 2021
80
December 2021
350
January 2021
260
Hope it's understandable.
Thanks in advance!
EDIT:
Over the hours I made a few corrections. Most notably I corrected the self join query to reflect my intentions and also added more details to better explain what is going on.
To my knowledge there are two ways about it (which are probably the same under the hood).
Also, please note that these solutions assume you have a month field already in place. If you have a date or timestamp field, you should take one extra preparation step.
[Addendum] To be more precise, I'd say that the ideal would be to have a date/timestamp field that is truncated/flattened to the first day of the month.
As an example,
month
amount
2021-01-01
50
2021-02-01
20
2021-03-01
10
2021-04-01
100
2021-05-01
20
2021-06-01
40
2021-07-01
80
2021-08-01
50
The first is to use a "self-non-equi join"
SELECT
a.month,
SUM(b.amount) AS amount_over_6_months
FROM table AS a
INNER JOIN table AS b ON a.month BETWEEN b.month AND DATEADD(MONTH, 5, b.month)
WHERE a.month >= DATEADD(MONTH, -5, GETDATE())
GROUP BY a.month
What happens here is that you are joining the table with itself. Specifically, for each row in the (a) alias, you will join six rows from the (b) alias. For each row you will join the rows where the month is equal, all the way back to five months prior. So...
a.month
b.month
a.amount
b.amount
2021-01-01
2021-01-01
50
50
2021-02-01
2021-01-01
20
50
2021-02-01
2021-02-01
20
20
2021-03-01
2021-01-01
10
50
2021-03-01
2021-02-01
10
20
2021-03-01
2021-03-01
10
10
2021-04-01
2021-01-01
100
50
2021-04-01
2021-02-01
100
20
2021-04-01
2021-03-01
100
10
2021-04-01
2021-04-01
100
100
2021-05-01
2021-01-01
20
50
2021-05-01
2021-02-01
20
20
2021-05-01
2021-03-01
20
10
2021-05-01
2021-04-01
20
100
2021-05-01
2021-05-01
20
20
2021-06-01
2021-01-01
40
50
2021-06-01
2021-02-01
40
20
2021-06-01
2021-03-01
40
10
2021-06-01
2021-04-01
40
100
2021-06-01
2021-05-01
40
20
2021-06-01
2021-06-01
40
40
2021-07-01
2021-02-01
80
20
2021-07-01
2021-03-01
80
10
2021-07-01
2021-04-01
80
100
2021-07-01
2021-05-01
80
20
2021-07-01
2021-06-01
80
40
2021-07-01
2021-07-01
80
80
...
...
...
...
Then it's just a matter of grouping based on the month in the (a) alias, and summing the amounts coming from the (b) alias.
The advantage of this approach is that it should be vendor and generation agnostic, save the DATEADD() fucuntion.
The second solution would be to use window functions. I cannot comment on whether this would work with your vendor and the specific version.
SELECT
month,
SUM(amount) OVER (ORDER BY month ROWS BETWEEN 5 PRECEDING AND CURRENT ROW)
FROM table

How to count users group by time interval

I have a table with user id and created_at of type timestamp, I want to count how many users have created their account in 3 hours interval for a given day. so far I have created this query but I'm not able to get the count for each three hours
with time_cte AS (
SELECT time_sample from
generate_series('2021-12-01'::date, '2021-12-01'::date + interval '1 day', interval '3 hour')
as time_sample
) SELECT time_sample, count(u.id) FROM time_cte
join users u ON u.created_at::date = '2021-12-01'::date
GROUP BY time_sample;
I am able to get series and count but they are total users count for that day
The output I got
time_sample count
2021-12-01 00:00:00.000000, 4
2021-12-01 03:00:00.000000, 4
2021-12-01 06:00:00.000000, 4
2021-12-01 09:00:00.000000, 4
2021-12-01 12:00:00.000000, 4
2021-12-01 15:00:00.000000, 4
2021-12-01 18:00:00.000000, 4
2021-12-01 21:00:00.000000, 4
2021-12-02 00:00:00.000000, 4
The output I expect is
time_sample count
2021-12-01 00:00:00.000000, 0
2021-12-01 03:00:00.000000, 0
2021-12-01 06:00:00.000000, 3
2021-12-01 09:00:00.000000, 1
2021-12-01 12:00:00.000000, 0
2021-12-01 15:00:00.000000, 0
2021-12-01 18:00:00.000000, 0
2021-12-01 21:00:00.000000, 0
2021-12-02 00:00:00.000000, 0
For PostgreSQL 14 you can use the built-in date_bin function.
select
date_bin(interval '3 hours', created_at, date_trunc('day', created_at)) as time_slot,
count(*) as cnt
from users
group by time_slot
order by time_slot;
For PostgreSQL versions before 14 you may use this implementation of date_bin.

How to calculate range in 1 week using Postgres?

tanggal | product
2021-01-01 bag 1
2021-01-05 bag 5
2021-01-08 bag 8
2021-01-11 bag 11
2021-01-12 bag 12
2021-01-13 bag 13
2021-01-14 bag 14
here I have a product tbl, in this table there are input dates and product names,
I want to calculate the product based on 1 week how the query to calculate the data with a range of 7 days?
and this my query
select tanggal, product from tbl_product
where tanggal > current_date + interval '7' day
You could solve this for arbitrary dates using a generated time series.
For example:
SELECT series::date
FROM generate_series(
(now() - interval '1 week')::date,
now()::date,
'1 day'::interval
) series;
Would result in:
2021-05-26
2021-05-27
2021-05-28
2021-05-29
2021-05-30
2021-05-31
2021-06-01
2021-06-02
which you can join with other tables as you see fit.
For further information on generate_series() and other set-returning functions, check out the documentation.

How to match the closet date in sql (redshift)?

For example, my table A is work_schedule:
Employee_id
Week_start
Work_schedule
A
2021-01-03
Day shift
A
2021-01-10
Day shift
A
2021-01-17
Night shift
B
2020-12-27
Day shift
B
2021-01-03
Day shift
Table B is employee_history:
Employee_id
Calendar_date
Tenure
A
2020-12-20
0
A
2020-12-21
1
A
---
2-30
A
2021-01-19
31
A
2021-01-20
32
B
2020-12-15
0
B
2020-12-16
1
B
---
Employee can choose work schedule 2 weeks ahead, and I want to fetch tenure at the snapshot date (2 weeks ahead). For employee A, the 14 days time period can match a calendar_date. But for employee B, he started within 2 weeks. I want to have the closet date to the 2-week date.
The ideal output is:
Employee_id
Week_start
Work_schedule
Calendar_date (to calculate tenure)
Tenure (at 2 weeks ago)
A
2021-01-03
Day shift
2020-12-20
0
A
2021-01-10
Day shift
2020-12-27
7
A
2021-01-17
Night shift
2021-01-03
14
B
2020-12-27
Day shift
2020-12-15
0
B
2021-01-03
Day shift
2020-12-20
5
For one record to fetch closet date, I can use
order by abs(datediff(day, (week_start - 14), calendar_date)) asc
limit 1
For example, fetch ‘2020-12-15’ as the closest date to ‘2020-12-13’.
select employee_id, calendar_date, tenure
from employee_history h
where employee_id = B
order by abs(datediff(day, ('2020-12-27' - 14), date_key)) asc
limit 1
But I have more than one employees in this situation, how can I get the closest calendar_date for all those that cannot find a match for exactly 2 weeks?