I want to split something like this:
Value | Startdate | Enddate
XXXX | 2.July | 16 August
Into this:
Value | Startdate | Enddate
XXXX | 2.July | 31 July
XXXX | 1.August | 16 August
The value is not important for now.
If I understand correctly, you want to split your range into different months. A convenient method uses generate_series():
select value, greatest(startdate, gs.mon), least(enddate, gs.mon + interval '1 month - 1 day')
from t cross join lateral
generate_series(date_trunc('month', startdate), date_trunc('month', enddate), interval '1 month'
) gs(mon)
Here is a db<>fiddle.
Related
i'm having hard times with creating a statistics with sum of ongoing subscriptions per month
i have table subscriptions
id | created_at | cancelled_at
----------------------------------------
1 | 2020-12-29 13:56:12 | null
2 | 2021-02-15 01:06:25 | 2021-04-21 19:35:31
3 | 2021-03-22 02:42:19 | null
4 | 2021-04-21 19:35:31 | null
and statistics should look as follows:
month | count
---------------
12/2020 | 1 -- #1
01/2021 | 1 -- #1
02/2021 | 2 -- #1 + #2
03/2021 | 3 -- #1 + #2 + #3
04/2021 | 3 -- #1 + #3 + #4, not #2 since it ends that month
05/2021 | 3 -- #1 + #3 + #4
so far i was able to make list of all months i need the stats for:
select generate_series(min, max, '1 month') as "month"
from (
select date_trunc('month', min(created_at)) as min,
now() as max
from subscriptions
) months;
and get the right number of subscriptions for specific month
select sum(
case
when
make_date(2021, 04, 1) >= date_trunc('month', created_at)
and make_date(2021, 04, 1); < date_trunc('month', coalesce(cancelled_at, now() + interval '1 month'))
then 1
else 0
end
) as total
from subscriptions
-- returns 3
but i am struggling combining those together... would OVER (which i am unexperienced with) be of any use for me? i found Count cumulative total in Postgresql but it's different case (dates are fixed)... or is the proper approach to use function with FOR somehow?
You can use generate_series() to generate the months and then a correlated subquery to calculate the actives:
select yyyymm,
(select count(*)
from subscriptions s
where s.created_at < gs.yyyymm + interval '1 month' and
(s.cancelled_at > gs.yyyymm + interval '1 month' or s.cancelled_at is null)
) as count
from generate_series('2020-12-01'::date, '2021-05-01'::date, interval '1 month'
) gs(yyyymm);
I have a database with a tbl_registration with rows that look like
ID | start_date_time | end_date_time | ...
1 | 2021-01-01 14:00:15 | 2021-01-01 14:00:15
2 | 2021-02-01 14:00:15 | null
4 | 2021-05-15 14:00:15 | 2024-01-01 14:00:15
5 | 2019--15 14:00:15 | 2024-01-01 14:00:15
endDate can be null
It contains 500.000 - 1.000.000 of records
We want to create an overview of year grouped by month that shows the amount of records that are active in that month. So a registration is counted per month if it lies (partially) in that month based on start and end date.
I can do a query per month like this
select count (id)
from tbl_registration
where
(r.end_date_time >= to_timestamp('01/01/2021 00:00:00', 'DD/MM/YYYY HH24:MI:SS') or r.end_date_time is null )
and r.start_date_time < to_timestamp('01/02/2021 00:00:00', 'DD/MM/YYYY HH24:MI:SS');
But that forces me to repeat this query 12 times.
I don't see a creative way to solve this in one query that would give me as a result 12 rows, one for each month
I've been looking at the generate_series function, but I don't see how I can group on the comparison of those start- and end dates
Postgres supports generate_series() . . . so generate the dates you want then then construct the query. One method is:
select gs.mon, x.cnt
from generate_series('2021-01-01'::date, '2021-12-01'::date, interval '1 month') gs(mon) left join lateral
(select count(*) as cnt
from tbl_registration
where r.end_date_time >= gs.mon or r.end_date_time is null) and
r.start_date_time < gs.mon + interval '1 month'
) x
on 1=1;
Given a table as such:
# SELECT * FROM payments ORDER BY payment_date DESC;
id | payment_type_id | payment_date | amount
----+-----------------+--------------+---------
4 | 1 | 2019-11-18 | 300.00
3 | 1 | 2019-11-17 | 1000.00
2 | 1 | 2019-11-16 | 250.00
1 | 1 | 2019-11-15 | 300.00
14 | 1 | 2019-10-18 | 130.00
13 | 1 | 2019-10-18 | 100.00
15 | 1 | 2019-09-18 | 1300.00
16 | 1 | 2019-09-17 | 1300.00
17 | 1 | 2019-09-01 | 400.00
18 | 1 | 2019-08-25 | 400.00
(10 rows)
How can I SUM the amount column based on an arbitrary date range, not simply a date truncation?
Taking the example of a date range beginning on the 15th of a month, and ending on the 14th of the following month, the output I would expect to see is:
payment_type_id | payment_date | amount
-----------------+--------------+---------
1 | 2019-11-15 | 1850.00
1 | 2019-10-15 | 230.00
1 | 2019-09-15 | 2600.00
1 | 2019-08-15 | 800.00
Can this be done in SQL, or is this something that's better handled in code? I would traditionally do this in code, but looking to extend my knowledge of SQL (which at this stage, isnt much!)
Click demo:db<>fiddle
You can use a combination of the CASE clause and the date_trunc() function:
SELECT
payment_type_id,
CASE
WHEN date_part('day', payment_date) < 15 THEN
date_trunc('month', payment_date) + interval '-1month 14 days'
ELSE date_trunc('month', payment_date) + interval '14 days'
END AS payment_date,
SUM(amount) AS amount
FROM
payments
GROUP BY 1,2
date_part('day', ...) gives out the current day of month
The CASE clause is for dividing the dates before the 15th of month and after.
The date_trunc('month', ...) converts all dates in a month to the first of this month
So, if date is before the 15th of the current month, it should be grouped to the 15th of the previous month (this is what +interval '-1month 14 days' calculates: +14, because the date_trunc() truncates to the 1st of month: 1 + 14 = 15). Otherwise it is group to the 15th of the current month.
After calculating these payment_days, you can use them for simple grouping.
I would simply subtract 14 days, truncate the month, and add 14 days back:
select payment_type_id,
date_trunc('month', payment_date - interval '14 day') + interval '14 day' as month_15,
sum(amount)
from payments
group by payment_type_id, month_15
order by payment_type_id, month_15;
No conditional logic is actually needed for this.
Here is a db<>fiddle.
You can use the generate_series() function and make a inner join comparing month and year, like this:
SELECT specific_date_on_month, SUM(amount)
FROM (SELECT generate_series('2015-01-15'::date, '2015-12-15'::date, '1 month'::interval) AS specific_date_on_month)
INNER JOIN payments
ON (TO_CHAR(payment_date, 'yyyymm')=TO_CHAR(specific_date_on_month, 'yyyymm'))
GROUP BY specific_date_on_month;
The generate_series(<begin>, <end>, <interval>) function generate a serie based on begin and end with an specific interval.
Given a table with rows like:
+----+-------------------------+------------------------+
| ID | StartDate | EndDate |
+----+-------------------------+------------------------+
| 1 | 2016-02-05 20:00:00.000 | 2016-02-07 5:00:00.000 |
+----+-------------------------+------------------------+
I want to produce a table like this:
+----+------------+----------+
| ID | Date | Duration |
+----+------------+----------+
| 1 | 2016-02-05 | 4 |
| 1 | 2016-02-06 | 24 |
| 1 | 2016-02-07 | 5 |
+----+------------+----------+
This is an interview-style question. I am wondering how I can go about tackling this. Is it possible to do this with just standard SQL query syntax? Or is a procedural language like pl/pgSQL required to do a query like this?
The basic idea is this:
SELECT date_trunc('day', dayhour) as dd,count(*)
FROM (VALUES (1, '2016-02-05 20:00:00.000'::timestamp, '2016-02-07 5:00:00.000'::timestamp)
) v(ID, StartDate, EndDate), lateral
generate_series(StartDate, EndDate, interval '1 hour') g(dayhour)
GROUP BY dd
ORDER BY dd;
That adds an extra hour, so this is more accurate:
SELECT date_trunc('day', dayhour) as dd,count(*)
FROM (VALUES (1, '2016-02-05 20:00:00.000'::timestamp, '2016-02-07 5:00:00.000'::timestamp)
) v(ID, StartDate, EndDate), lateral
generate_series(StartDate, EndDate - interval '1 hour', interval '1 hour') g(dayhour)
GROUP BY dd
ORDER BY dd;
Technically, the lateral is not needed (and in that case, I would replace the comma with cross join). However, this is an example of a lateral join, so being explicit is good.
I should also note that the above is the simplest method. However, the group by does slow down the query. There are other methods that don't require generating a series for every hour.
Is there any convenient way to populate a table with all dates in a given range in Google BigQuery? What I need are all dates from 2015-06-01 till CURRENT_DATE(), so something like this:
+------------+
| date |
+------------+
| 2015-06-01 |
| 2015-06-02 |
| 2015-06-03 |
| ... |
| 2016-07-11 |
+------------+
Optimally, the next step would be to also get all weeks between the two dates, i.e.:
+---------+
| week |
+---------+
| 2015-23 |
| 2015-24 |
| 2015-25 |
| ... |
| 2016-28 |
+---------+
I've been fiddling around with the following answers I found, but I can't get them to work, mostly because core functions aren't supported and I can't find proper ways to replace them.
Easiest way to populate a temp table with dates between and including 2 date parameters
Generate Dates between date ranges
Your help is very much appreciated!
Best,
Max
Mikhail's answer works for BigQuery's legacy sql syntax perfectly. This solution is a slightly easier one if you're using the standard SQL syntax.
BigQuery standard SQL syntax actually has a built in function, GENERATE_DATE_ARRAY for creating an array from a date range. It takes a start date, end date and INTERVAL. For example:
SELECT day
FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 DAY)
) AS day
If you wanted the week and year you could use
SELECT EXTRACT(YEAR FROM day), EXTRACT(WEEK FROM day)
FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('2015-06-01'), CURRENT_DATE(), INTERVAL 1 WEEK)
) AS day
all dates from 2015-06-01 till CURRENT_DATE()
SELECT DATE(DATE_ADD(TIMESTAMP("2015-06-01"), pos - 1, "DAY")) AS DAY
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP(CURRENT_DATE()), TIMESTAMP("2015-06-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))
all weeks between the two dates
SELECT YEAR(DAY) AS y, WEEK(DAY) AS w
FROM (
SELECT DATE(DATE_ADD(TIMESTAMP("2015-06-01"), pos - 1, "DAY")) AS DAY
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP(CURRENT_DATE()), TIMESTAMP("2015-06-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))
)
GROUP BY y, w