SQL -- computing end dates from a given start date with arbitrary breaks - sql

I have a table of 'semesters' of variable lengths with variable breaks in between them with a constraint such that a 'start_date' is always greater than the previous 'end_date':
id start_date end_date
-----------------------------
1 2012-10-01 2012-12-20
2 2013-01-05 2013-03-28
3 2013-04-05 2013-06-29
4 2013-07-10 2013-09-20
And a table of students as follows, where a start date may occur at any time within a given semester:
id start_date n_weeks
-------------------------
1 2012-11-15 25
2 2013-02-12 8
3 2013-03-02 12
I am attempting to compute an 'end_date' by joining the 'students' on 'semesters' which takes into account the variable-length breaks in-between semesters.
I can draw in the previous semester's end date (ie from the previous row's end_date) and by subtraction find the number of days in-between semesters using the following:
SELECT start_date
, end_date
, lag(end_date) OVER () AS prev_end_date
, start_date - lag(end_date) OVER () AS days_break
FROM terms
ORDER BY start_date;
Clearly, if there were to be only two terms, it would simply be a matter of adding the 'break' in days (perhaps, cast to 'weeks') -- and thereby extend the 'end_date' by that same period of time.
But should 'n_weeks' for a given student span more than one term, how could such a query be structured ?
Been banging my head against a wall for the last couple of days and I'd be immensely grateful for any help anyone would be able to offer....
Many thanks.

Rather than just looking at the lengths of semesters or the gaps between them, you could generate a list of all the dates that are within a semester using generate_series(), like this:
SELECT
row_number() OVER () as day_number,
day
FROM
(
SELECT
generate_series(start_date, end_date, '1 day') as day
FROM
semesters
) as day_series
ORDER BY
day
(SQLFiddle demo)
This assigns each day that is during a semester an arbitrary but sequential "day number", skipping out all the gaps between semesters.
You can then use this as a sub-query/CTE JOINed to your table of students: first find the "day number" of their start date, then add 7 * n_weeks to find the "day number" of their end date, and finally join back to find the actual date for that "day number".
This assumes that there is no special handling needed for partial weeks - i.e. if n_weeks is 4, the student must be enrolled for 28 days which are within the duration of a semeseter. The approach could be adapted to measure weeks (pass 1 week as the last argument to generate_series()), with the additional step of finding which week the student's start_date falls into.
Here's a complete query (SQLFiddle demo here):
WITH semester_days AS
(
SELECT
semester_id,
row_number() OVER () as day_number,
day_date::date
FROM
(
SELECT
id as semester_id,
generate_series(start_date, end_date, '1 day') as day_date
FROM
semesters
) as day_series
ORDER BY
day_date
)
SELECT
S.id as student_id,
S.start_date,
SD_start.semester_id as start_semester_id,
S.n_weeks,
SD_end.day_date as end_date,
SD_end.semester_id as end_semester_id
FROM
students as S
JOIN
semester_days as SD_start
On SD_start.day_date = S.start_date
JOIN
semester_days as SD_end
On SD_end.day_number = SD_start.day_number + (7 * S.n_weeks)
ORDER BY
S.start_date

Related

SQL - Get historic count of rows collected within a certain period by date

For many years I've been collecting data and I'm interested in knowing the historic counts of IDs that appeared in the last 30 days. The source looks like this
id
dates
1
2002-01-01
2
2002-01-01
3
2002-01-01
...
...
3
2023-01-10
If I wanted to know the historic count of ids that appeared in the last 30 days I would do something like this
with total_counter as (
select id, count(id) counts
from source
group by id
),
unique_obs as (
select id
from source
where dates >= DATEADD(Day ,-30, current_date)
group by id
)
select count(distinct(id))
from unique_obs
left join total_counter
on total_counter.id = unique_obs.id;
The problem is that this results would return a single result for today's count as provided by current_date.
I would like to see a table with such counts as if for example I had ran this analysis yesterday, and the day before and so on. So the expected result would be something like
counts
date
1235
2023-01-10
1234
2023-01-09
1265
2023-01-08
...
...
7383
2022-12-11
so for example, let's say that if the current_date was 2023-01-10, my query would've returned 1235.
If you need a distinct count of Ids from the 30 days up to and including each date the below should work
WITH CTE_DATES
AS
(
--Create a list of anchor dates
SELECT DISTINCT
dates
FROM source
)
SELECT COUNT(DISTINCT s.id) AS "counts"
,D.dates AS "date"
FROM CTE_DATES D
LEFT JOIN source S ON S.dates BETWEEN DATEADD(DAY,-29,D.dates) AND D.dates --30 DAYS INCLUSIVE
GROUP BY D.dates
ORDER BY D.dates DESC
;
If the distinct count didnt matter you could likely simplify with a rolling sum, only hitting the source table once:
SELECT S.dates AS "date"
,COUNT(1) AS "count_daily"
,SUM("count_daily") OVER(ORDER BY S.dates DESC ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING) AS "count_rolling" --assumes there is at least one row for every day.
FROM source S
GROUP BY S.dates
ORDER BY S.dates DESC;
;
This wont work though if you have gaps in your list of dates as it'll just include the latest 30 days available. In which case the first example without distinct in the count will do the trick.
SELECT count(*) AS Counts
dates AS Date
FROM source
WHERE dates >= DATEADD(DAY, -30, CURRENT_DATE)
GROUP BY dates
ORDER BY dates DESC

R's ceiling_date equivalent in SQL

I want to implement R's ceiling_date fucntion in SQL (Postgresql).
So I have dates in a column for everyday with corresponding sales and I want to accumulate the sales for a week over a single date (say Friday).
Input Format:
Dates in yellow are the dates to aggregate sales on
Expected output format:
This can easily be done in R using ceiling_date but I want to do it in SQL itself.
Any help would be appreciated. Thanks
Accepting and processing the ISO 8601 Standard is by far the easiest for processing date ranges. But this imposes a standard definition, which is essentially:
All weeks consist on exactly 7 days.
All weeks begin on Monday.
The first week of the year is the week the contains 4-Jan.
The date_trunc function gives the first date of the week, adding 6 gives the last day of the week.
-- ISO 8601 Week definition
select (date_trunc('week',dte)::date +6) "Week Ending"
, sum(sales) "Total Sales"
from test
group by (date_trunc('week',dte)::date +6)
order by (date_trunc('week',dte)::date +6);
Date/Week processing for non ISO 8601 presents somewhat tricky process to get the appropriate week definition. The following does so for week Friday - Thursday definition. It creates a date range for a year beginning with the first Friday in the table, then joins using the range contains operator to determine the appropriate summation period
with periods (wk) as
( select daterange( ((min_dt + (n-1) * interval '1 week'))::date
, ((min_dt + (n) * interval '1 week'))::date
, '(]'
)
from (select min(dte) min_dt
from test
where extract(dow from dte) = 5 --- Day_Of_Week (5) = Friday
) s
cross join generate_series(0,52) gs(n)
) --select * from periods;
select upper(wk)-1 "Week Ending"
, sum(sales) "Total Sales"
from periods
join test
on (dte <# wk)
group by upper(wk)-1
order by upper(wk)-1;
See demo of both here.
NOTE: Demo changes sample date from January (2022-01-01 ...) to May (2022-05-01 ...) as 6-January-2022 was Thursday not Friday as description, 6-May-2022 is however Friday. Also the sum of values ending 6-May is 38 (not 42 as indicated). Finally, neither query attempts a limiting date, but processed through end-of-data. Nor does either address multiple years of data.
demo
idea: for 2022-Janurary-1 to 2022-Janurary-20, there is 3 Fridays:'2022-01-07','2022-01-14', '2022-01-21'.
We need to partition by these 3 friday order by sales date.
Now the problem is now to compute get all these date belong to these 3 fridays.
get every friday each sales_date belong to.
deal with special cases(one week after friday: saturday, sunday) when sales_date > friday then the real friday is next friday.
final code:
SELECT
*,
sum(amount) OVER (PARTITION BY sales.compute_friday ORDER BY sales_date)
FROM
sales;
processing code:
BEGIN;
CREATE TABLE sales (
sales_date date
, amount numeric
);
INSERT INTO sales (sales_date , amount)
SELECT
i
, (random() * 10)::integer
FROM
generate_series('2022-01-01'::timestamp , '2022-01-20'::timestamp , interval '1 day') g (i);
ALTER TABLE sales
ADD COLUMN friday date;
UPDATE
sales
SET
friday = (date_trunc('week' , sales_date) + interval '4 day')::date;
ALTER TABLE sales
ADD COLUMN compute_friday date;
UPDATE
sales
SET
compute_friday = CASE WHEN sales_date > friday THEN
(friday + interval '7 days')::date
ELSE
friday
END;
COMMIT;

prestosql get average from last 7 days for each day

The question I have is very similar to the question here, but I am using Presto SQL (on aws athena) and couldn't find information on loops in presto.
To reiterate the issue, I want the query that:
Given table that contains: Day, Number of Items for this Day
I want: Day, Average Items for Last 7 Days before "Day"
So if I have a table that has data from Dec 25th to Jan 25th, my output table should have data from Jan 1st to Jan 25th. And for each day from Jan 1-25th, it will be the average number of items from last 7 days.
Is it possible to do this with presto?
maybe you can try this one
calendar Common Table Expression (CTE) is used to generate dates between two dates range.
with calendar as (
select date_generated
from (
values (sequence(date'2021-12-25', date'2022-01-25', interval '1' day))
) as t1(date_array)
cross join unnest(date_array) as t2(date_generated)),
temp CTE is basically used to make a date group which contains last 7 days for each date group.
temp as (select c1.date_generated as date_groups
, format_datetime(c2.date_generated, 'yyyy-MM-dd') as dates
from calendar c1, calendar c2
where c2.date_generated between c1.date_generated - interval '6' day and c1.date_generated
and c1.date_generated >= date'2021-12-25' + interval '6' day)
Output for this part:
date_groups
dates
2022-01-01
2021-12-26
2022-01-01
2021-12-27
2022-01-01
2021-12-28
2022-01-01
2021-12-29
2022-01-01
2021-12-30
2022-01-01
2021-12-31
2022-01-01
2022-01-01
last part is joining day column from your table with each date and then group it by the date group
select temp.date_groups as day
, avg(your_table.num_of_items) avg_last_7_days
from your_table
join temp on your_table.day = temp.dates
group by 1
You want a running average (AVG OVER)
select
day, amount,
avg(amount) over (order by day rows between 6 preceding and current row) as avg_amount
from mytable
order by day
offset 6;
I tried many different variations of getting the "running average" (which I now know is what I was looking for thanks to Thorsten's answer), but couldn't get the output I wanted exactly with my other columns (that weren't included in my original question) in the table, but this ended up working:
SELECT day, <other columns>, avg(amount) OVER (
PARTITION BY <other columns>
ORDER BY date(day) ASC
ROWS 6 PRECEDING) as avg_7_days_amount FROM table ORDER BY date(day) ASC

Data value on a given date

This time I have a table on a PostgreSQL database that contains the employee name, the date that he started working and the date that he leaves the company, in the cases of the employee still remains in the company, this field has null value.
Knowing this, I would like to know how many people was working on a predetermined date, ex:
I would like to know how many people works on the company in January 2021.
I don't know where to start, in some attempts I got the number of hires and layoffs per month, but I need to show this accumulated value per month, in another column.
I hope I made myself understood, I'll leave the last SQL I got here.
select reference, sum(hires) from
(
select
date_trunc('month', date_hires) as reference,
count(*) as hires
from
ponto_mais_relatorio_colaboradores
group by
date_hires
union all
select
date_trunc('month', date_layoff) as reference,
count(*)*-1 as layoffs
from
ponto_mais_relatorio_colaboradores
group by
date_layoff
) as reference
join calendar_aux on calendar_aux.ano_mes = reference
group by reference
order by reference
Break the requirement down. The question: how many are employed on any given date? That would include all hired before that date and do not have a layoff date plus all hired before with a layoff date later then the date your interested period. I.e you are interested in Jan so you still want to count an employee with a layoff date in Feb. With that in place convert into SQL. The preceding is available from select comparing dates. other issue is that Jan is not a date, it is a range of dates, so you need each date. You can use generate series to create each day in Jan. Then Join the generated dates with and selection from your table. Resulting query:
with jan_dates( jdate ) as
( select generate_series( date '2021-01-01'
, date '2021-01-31'
, interval '1' day
)::date
)
select jdate "Date", count(*) "Employees"
from jan_dates j
join employees e
on ( e.date_hires <= j.jdate
and ( e.date_layoff is null
or e.date_layoff > j.jdate
)
)
group by j.jdate
order by j.jdate;
Note: Not tested.

7-day user count: Big-Query self-join to get date range and count?

My Google Firebase event data is integrated to BigQuery and I'm trying to fetch from here one of the info that Firebase gives me automatically: 1-day, 7-day, 28-day user count.
1-day count is quite straightforward
SELECT
"1-day" as period,
events.event_date,
count(distinct events.user_pseudo_id) as uid
FROM
`your_path.events_*` as events
WHERE events.event_name = "session_start"
group by events.event_date
with a neat result like
period event_date uid
1-day 20190609 5
1-day 20190610 7
1-day 20190611 5
1-day 20190612 7
1-day 20190613 37
1-day 20190614 73
1-day 20190615 52
1-day 20190616 36
But to me it gets complicated when I try to count for each day how many unique users I had in the previous 7 days
From the above query, I know my target value for day 20190616 will be 142, by filtering 7 days and removing the group by condition.
The solution I tried is direct self join (and variations that didnt change the result)
SELECT
"7-day" as period,
events.event_date,
count(distinct user_events.user_pseudo_id) as uid
FROM
`your_path.events_*` as events,
`your_path.events_*` as user_events
WHERE user_events.event_name = "session_start"
and PARSE_DATE("%Y%m%d", events.event_date) between DATE_SUB(PARSE_DATE("%Y%m%d", user_events.event_date), INTERVAL 7 DAY) and PARSE_DATE("%Y%m%d", user_events.event_date) #one day in the first table should correspond to 7 days worth of events in the second
and events.event_date = "20190616" #fixed date to check
group by events.event_date
Now, I know I'm barely setting any join conditions, but if any I expected to produce cross joins and huge results. Instead, the count this way is 70, which is a lot lower than expected. Furthermore, I can set INTERVAL 2 DAY and the result does not change.
I'm clearly doing something very wrong here, but I also thought that the way I'm doing it is very rudimental, and there must be a smarter way to accomplish this.
I have checked Calculating a current day 7 day active user with BigQuery? but the explicit cross join here is with event_dim which definition I'm unsure about
Cheched the solution provided at Rolling 90 days active users in BigQuery, improving preformance (DAU/MAU/WAU) as suggested by comment.
The solution seemed sound at first but has some problems the more recent the day is. Here's the query using COUNT(DISTINCT) that I adapted to my case
SELECT DATE_SUB(event_date, INTERVAL i DAY) date_grp
, COUNT(DISTINCT user_pseudo_id) unique_90_day_users
, COUNT(DISTINCT IF(i<29,user_pseudo_id,null)) unique_28_day_users
, COUNT(DISTINCT IF(i<8,user_pseudo_id,null)) unique_7_day_users
, COUNT(DISTINCT IF(i<2,user_pseudo_id,null)) unique_1_day_users
FROM (
SELECT PARSE_DATE("%Y%m%d",event_date) as event_date, user_pseudo_id
FROM `your_path_here.events_*`
WHERE EXTRACT(YEAR FROM PARSE_DATE("%Y%m%d",event_date))=2019
GROUP BY 1, 2
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
and here is the result for the latest days (consider data starts 23rd May) where you can appreciate that the result is wrong
row_num date_grp 90-day 28-day 7-day 1-day
114 2019-06-16 273 273 273 210
115 2019-06-17 78 78 78 78
so in the last day this count for 90-day,28-day,7day is only considering the very same day instead of all the days before.
It's not possible for 90-day count on the 17th June to be 78 if the 1-day on the 16th June was higher.
This is AN answer to my same question.
My means are rudimentary as I'm not extremely familiar with BQ shortcuts and some advanced functions, but the result is still correct.
I hope others will be able to integrate with better queries.
#standardSQL
WITH dates AS (
SELECT i as event_date
FROM UNNEST(GENERATE_DATE_ARRAY('2019-05-24', CURRENT_DATE(), INTERVAL 1 DAY)) i
)
, ptd_dates as (
SELECT DISTINCT "90-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 90)) i
UNION ALL
SELECT distinct "28-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 29)) i
UNION ALL
SELECT distinct "7-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",DATE_SUB(event_date, INTERVAL i-1 DAY)) as ptd_date
FROM dates,
UNNEST(GENERATE_ARRAY(1, 7)) i
UNION ALL
SELECT distinct "1-day" as day_category, FORMAT_DATE("%Y%m%d",event_date) AS event_date, FORMAT_DATE("%Y%m%d",event_date) as ptd_date
FROM dates
)
SELECT event_date,
sum(IF(day_category="90-day",unique_ptd_users,null)) as count_90_day ,
sum(IF(day_category="28-day",unique_ptd_users,null)) as count_28_day,
sum(IF(day_category="7-day",unique_ptd_users,null)) as count_7_day,
sum(IF(day_category="1-day",unique_ptd_users,null)) as count_1_day
from (
SELECT ptd_dates.day_category
, ptd_dates.event_date
, COUNT(DISTINCT user_pseudo_id) unique_ptd_users
FROM ptd_dates,
`your_path_here.events_*` events,
unnest(events.event_params) e_params
WHERE ptd_dates.ptd_date = events.event_date
GROUP BY ptd_dates.day_category
, ptd_dates.event_date)
group by event_date
order by 1,2,3
As per suggestion from ECris, I first defined a calendar table to use: this contains 4 categories of PTDs (period to date). Each is generated from basic elements: this should scale linearly as it's not querying the event dataset and therefore does not have gaps.
Then the join is made with events, where the join condition shows how for each date I'm counting distinct users in all related days in the period.
The results are correct.