Sum results on constant timeframe range on each date in table - sql

I'm using PostGres DB.
I have a table that contains test names, their results and reported time:
|test_name|result |report_time|
| A |error |29/11/2020 |
| A |failure|28/12/2020 |
| A |error |29/12/2020 |
| B |passed |30/12/2020 |
| C |failure|31/12/2020 |
| A |error |31/12/2020 |
I'd like to sum how many tests have failed or errored in the last 30 days, per date (and limit it to be 5 days back from the current date), so the final result will be:
| date | sum | (notes)
| 29/11/2020 | 1 | 1 failed/errored test in range (29/11 -> 29/10)
| 28/12/2020 | 2 | 2 failed/errored tests in range (28/12 -> 28/11)
| 29/12/2020 | 3 | 3 failed/errored tests in range (29/12 -> 29/11)
| 30/12/2020 | 2 | 2 failed/errored tests in range (30/12 -> 30/11)
| 31/12/2020 | 4 | 4 failed/errored tests in range (31/12 -> 31/11)
I know how to sum the results per date (i.e, how many failures/errors were on a specific date):
SELECT report_time::date AS "Report Time", count(case when result in ('failure', 'error') then 1 else
null end) from table
where report_time::date = now()::date
GROUP BY report_time::date, count(case when result in ('failure', 'error') then 1 else null end)
But I'm struggling to sum each date 30 days back.

You can generate the dates and then use window functions:
select gs.dte, num_failed_error, num_failed_error_30
from genereate_series(current_date - interval '5 day', current_date, interval '1 day') gs(dte) left join
(select t.report_time, count(*) as num_failed_error,
sum(count(*)) over (order by report_time range between interval '30 day' preceding and current row) as num_failed_error_30
from t
where t.result in ('failed', 'error') and
t.report_time >= current_date - interval '35 day'
group by t.report_time
) t
on t.report_time = gs.dte ;
Note: This assumes that report_time is only the date with no time component. If it has a time component, use report_time::date.
If you have data on each day, then this can be simplified to:
select t.report_time, count(*) as num_failed_error,
sum(count(*)) over (order by report_time range between interval '30 day' preceding and current row) as num_failed_error_30
from t
where t.result in ('failed', 'error') and
t.report_time >= current_date - interval '35 day'
group by t.report_time
order by report_time desc
limit 5;

Since I'm using PostGresSql 10.12 and update is currently not an option, I took a different approach, where I calculate the dates of the last 30 days and for each date I calculate the cumulative distinct sum for the past 30 days:
SELECT days_range::date, SUM(number_of_tests)
FROM generate_series (now() - interval '30 day', now()::timestamp , '1 day'::interval) days_range
CROSS JOIN LATERAL (
SELECT environment, COUNT(DISTINCT(test_name)) as number_of_tests from tests
WHERE report_time > days_range - interval '30 day'
GROUP BY report_time::date
HAVING COUNT(case when result in ('failure', 'error') then 1 else null end) > 0
ORDER BY report_time::date asc
) as lateral_query
GROUP BY days_range
ORDER BY days_range desc
It is definitely not the best optimized query, it takes ~1 minute for it to compute.

Related

SQL window function over 31 days but including only one first day of the month

I want to calculate avg or sum (metric) over (past 31 days) to use it in visualization.
"metric" varies every days but it always jumps at the first day of calendar months.
The problem is that some months (like November) are 30-day long. So this function actually includes two first day of the months on first of December (run the query below and check the row at 2021-12-01T00:00:00Z).
I need avg (metric) over (past 30 days) if we have two first days of the months in the window and (past 31 days) otherwise.
with days as(
select
'2021-10-01' :: timestamptz + (d || ' day') ::interval as "day"
from generate_series(0, 100) d
)
, daily_metrics as (
select
"day"
-- in reality "metric" fluctuates every day. But it jumps on the first day of the months
, case when extract(day from "day") = 1 then 300 else 100 end :: float as metric
from days
)
, result as (
select
"day"
, avg(metric) over (rows between 30 preceding and current row) as metric_roll_avg
from daily_metrics
)
select * from result
where "day" > '2021-10-01' :: timestamptz + '30 day' :: interval
This is what I ended up doing:
First crate a calendar view, each rows for the days of the calendar
with
calendar as (
select c."day"
, extract(day from c."day") = 1 and extract(day from date_trunc('month', c."day") - interval '1 day') = 30 as last_month_30
, extract(day from c."day") in (1, 2) and extract(day from date_trunc('month', c."day") - interval '1 day') = 29 as last_month_29
, extract(day from c."day") in (1, 2, 3) and extract(day from date_trunc('month', c."day") - interval '1 day') = 28 as last_month_28
from ... as c ...
)
Which returns this view:
+------------+---------------+---------------+---------------+
| day | last_month_30 | last_month_29 | last_month_28 |
|------------+---------------+---------------+---------------|
| 2022-02-26 | False | False | False |
| 2022-02-27 | False | False | False |
| 2022-02-28 | False | False | False |
| 2022-03-01 | False | False | True |
| 2022-03-02 | False | False | True |
+------------+---------------+---------------+---------------+
And using a case switch:
select
"day"
, case
when last_month_30
then avg(revenue) over (order by "day" rows between 29 preceding and current row )
when last_month_29
then avg(revenue) over (order by "day" rows between 28 preceding and current row)
when last_month_28
then avg(revenue) over (order by "day" rows between 27 preceding and current row)
else avg(revenue) over (order by "day" rows between 30 preceding and current row )
end as monthly_revenue
from ...
Probably not the cleanest way, it but works for all cases.

Count if previous month data exists postgres

i'm stuck with a query to count id where if it exists in previous month than 1
my table look like this
date | id |
2020-02-02| 1 |
2020-03-04| 1 |
2020-03-04| 2 |
2020-04-05| 1 |
2020-04-05| 3 |
2020-05-06| 2 |
2020-05-06| 3 |
2020-06-07| 2 |
2020-06-07| 3 |
i'm stuck with this query
SELECT date_trunc('month',date), id
FROM table
WHERE id IN
(SELECT DISTINCT id FROM table WHERE date
BETWEEN date_trunc('month', current_date) - interval '1 month' AND date_trunc('month', current_date)
the main problem is that i stuck with current_date function. is there any dynamic ways change current_date? thanks
What i expected to be my result is
date | count |
2020-02-01| 0 |
2020-03-01| 1 |
2020-04-01| 1 |
2020-05-01| 1 |
2020-06-01| 2 |
Solution 1 with SELF JOIN
SELECT date_trunc('month', c.date) :: date AS date
, count(DISTINCT c.id) FILTER (WHERE p.date IS NOT NULL)
FROM test AS c
LEFT JOIN test AS p
ON c.id = p.id
AND date_trunc('month', c.date) = date_trunc('month', p.date) + interval '1 month'
GROUP BY date_trunc('month', c.date)
ORDER BY date_trunc('month', c.date)
Result :
date count
2020-02-01 0
2020-03-01 1
2020-04-01 1
2020-05-01 1
2020-06-01 2
Solution 2 with WINDOW FUNCTIONS
SELECT DISTINCT ON (date) date
, count(*) FILTER (WHERE count > 0 AND previous_month) OVER (PARTITION BY date)
FROM
( SELECT DISTINCT ON (id, date_trunc('month', date))
id
, date_trunc('month', date) AS date
, count(*) OVER w AS count
, first_value(date_trunc('month', date)) OVER w = date_trunc('month', date) - interval '1 month' AS previous_month
FROM test
WINDOW w AS (PARTITION BY id ORDER BY date_trunc('month', date) GROUPS BETWEEN 1 PRECEDING AND 1 PRECEDING)
) AS a
Result :
date count
2020-02-01 0
2020-03-01 1
2020-04-01 1
2020-05-01 1
2020-06-01 2
see dbfiddle

Calculating Aggregates on subset of data based on condition

I have a DB as follows:
| company | timestamp | value |
| ------- | ---------- | ----- |
| google | 2020-09-01 | 5 |
| google | 2020-08-01 | 4 |
| amazon | 2020-09-02 | 3 |
I'd like to calculate the average value for each company within the last year if there are >= 20 datapoints. If there are less than 20 datapoints then I'd like the average during the entire time duration. I know I can do two separate queries and get the averages for each scenario. The question I suppose is how do I merge them back in a single table based on the criteria I have.
select company, avg(value) from my_db GROUP BY company;
select company, avg(value) from my_db
where timestamp > (CURRENT_DATE - INTERVAL '12 months')
GROUP BY company;
WITH last_year AS (
SELECT company, avg(value), 'year' AS range -- optional tag
FROM tbl
WHERE timestamp >= now() - interval '1 year'
GROUP BY 1
HAVING count(*) >= 20 -- 20+ rows in range
)
SELECT company, avg(value), 'all' AS range
FROM tbl
WHERE NOT EXISTS (SELECT FROM last_year WHERE company = t.company)
GROUP BY 1
UNION ALL TABLE last_year;
db<>fiddle here
An index on (timestamp) will only be used if your table is big and holds many years.
If most companies have 20+ rows in range, an index on (company) will be used for the 2nd SELECT to retrieve the few outliers.
Use conditional aggregation:
select company,
case
when sum(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end) >= 20 then
avg(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end)
else avg(value)
end
from my_db
group by company
If by 20 datapoints you mean 20 rows in the last 12 months for each company, then:
select company,
case
when count(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end) >= 20 then
avg(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end)
else avg(value)
end
from my_db
group by company
You can use window functions to provide the information for filtering:
select company, avg(value),
(count(*) = cnt_this_year) as only_this_year
from (select t.*,
count(*) filter (where date_trunc('year', datecol) = date_trunc('year', now()) over (partition by company) as cnt_this_year
from t
) t
where cnt_this_year >= 20 and date_trunc('year', datecol) = date_trunc('year', now()) or
cnt_this_year < 20
group by company;
The third column specifies if all the rows are from this year. By filtering in the where clause, it is simple to add other calculations as well (such as min(), max(), and so on).

Weeks between two dates

I'm attempting to turn two dates into a series of records. One record for each week between the dates.
Additionally the original start and end dates should be used to clip the week in case the range starts or ends mid-week. I'm also assuming that a week starts on Monday.
With a start date of: 05/09/2018 and an end date of 27/09/2018 I would like to retrieve the following results:
| # | Start Date | End date |
|---------------------------------|
| 0 | '05/09/2018' | '09/09/2018' |
| 1 | '10/09/2018' | '16/09/2018' |
| 2 | '17/09/2018' | '23/09/2018' |
| 3 | '24/09/2018' | '27/09/2018' |
I have made some progress - at the moment I can get the total number of weeks between the date range with:
SELECT (
EXTRACT(
days FROM (
date_trunc('week', to_date('27/09/2018', 'DD/MM/YYYY')) -
date_trunc('week', to_date('05/09/2018', 'DD/MM/YYYY'))
) / 7
) + 1
) as total_weeks;
Total weeks will return 4 for the above SQL. This is where I'm stuck, going from an integer to actual set of results.
Window functions are your friend:
SELECT week_num,
min(d) AS start_date,
max(d) AS end_date
FROM (SELECT d,
count(*) FILTER (WHERE new_week) OVER (ORDER BY d) AS week_num
FROM (SELECT DATE '2018-09-05' + i AS d,
extract(dow FROM DATE '2018-09-05'
+ lag(i) OVER (ORDER BY i)
) = 1 AS new_week
FROM generate_series(0, DATE '2018-09-27' - DATE '2018-09-05') AS i
) AS week_days
) AS weeks
GROUP BY week_num
ORDER BY week_num;
week_num | start_date | end_date
----------+------------+------------
0 | 2018-09-05 | 2018-09-09
1 | 2018-09-10 | 2018-09-16
2 | 2018-09-17 | 2018-09-23
3 | 2018-09-24 | 2018-09-27
(4 rows)
Use generate_series():
select gs.*
from generate_series(date_trunc('week', '2018-09-05'::date),
'2018-09-27'::date,
interval '1 week'
) gs(dte)
Ultimately I expanded on Gordon's solution to get to the following, however Laurenz's answer is slightly more concise.
select
(
case when (week_start - interval '6 days' <= date_trunc('week', '2018-09-05'::date)) then '2018-09-05'::date else week_start end
) as start_date,
(
case when (week_start + interval '6 days' >= '2018-09-27'::date) then '2018-09-27'::date else week_start + interval '6 days' end
) as end_date
from generate_series(
date_trunc('week', '2018-09-05'::date),
'2018-09-27'::date,
interval '1 week'
) gs(week_start);

GROUP BY next months over N years

I need to aggregate amounts grouped by "horizon" 12 next months over 5 year:
assuming we are 2015-08-15
SUM amount from 0 to 12 next months (from 2015-08-16 to 2016-08-15)
SUM amount from 12 to 24 next months (from 2016-08-16 to 2017-08-15)
SUM amount from 24 to 36 next months ...
SUM amount from 36 to 48 next months
SUM amount from 48 to 60 next months
Here is a fiddled dataset example:
+----+------------+--------+
| id | date | amount |
+----+------------+--------+
| 1 | 2015-09-01 | 10 |
| 2 | 2015-10-01 | 10 |
| 3 | 2016-10-01 | 10 |
| 4 | 2017-06-01 | 10 |
| 5 | 2018-06-01 | 10 |
| 6 | 2019-05-01 | 10 |
| 7 | 2019-04-01 | 10 |
| 8 | 2020-04-01 | 10 |
+----+------------+--------+
Here is the expected result:
+---------+--------+
| horizon | amount |
+---------+--------+
| 1 | 20 |
| 2 | 20 |
| 3 | 10 |
| 4 | 20 |
| 5 | 10 |
+---------+--------+
How can I get these 12 next months grouped "horizons" ?
I tagged PostgreSQL but I'm actually using an ORM so it's just to find the idea. (by the way I don't have access to the date formatting functions)
I would split by 12 months time frame and group by this:
SELECT
FLOOR(
(EXTRACT(EPOCH FROM date) - EXTRACT(EPOCH FROM now()))
/ EXTRACT(EPOCH FROM INTERVAL '12 month')
) + 1 AS "horizon",
SUM(amount) AS "amount"
FROM dataset
GROUP BY horizon
ORDER BY horizon;
SQL Fiddle
Inspired by: Postgresql SQL GROUP BY time interval with arbitrary accuracy (down to milli seconds)
Assuming you need intervals from current date to this day next year and so on, I would query this like this:
SELECT 1 AS horizon, SUM(amount) FROM dataset
WHERE date > now()
AND date < (now() + '12 months'::INTERVAL)
UNION
SELECT 2 AS horizon, SUM(amount) FROM dataset
WHERE date > (now() + '12 months'::INTERVAL)
AND date < (now() + '24 months'::INTERVAL)
UNION
SELECT 3 AS horizon, SUM(amount) FROM dataset
WHERE date > (now() + '24 months'::INTERVAL)
AND date < (now() + '36 months'::INTERVAL)
UNION
SELECT 4 AS horizon, SUM(amount) FROM dataset
WHERE date > (now() + '36 months'::INTERVAL)
AND date < (now() + '48 months'::INTERVAL)
UNION
SELECT 5 AS horizon, SUM(amount) FROM dataset
WHERE date > (now() + '48 months'::INTERVAL)
AND date < (now() + '60 months'::INTERVAL)
ORDER BY horizon;
You can generalize it and make something like this using additional variable:
SELECT number AS horizon, SUM(amount) FROM dataset
WHERE date > (now() + ((number - 1) * '12 months'::INTERVAL))
AND date < (now() + (number * '12 months'::INTERVAL));
Where number is an integer from range [1,5]
Here is what I get from the Fiddle:
| horizon | sum |
|---------|-----|
| 1 | 20 |
| 2 | 20 |
| 3 | 10 |
| 4 | 20 |
| 5 | 10 |
Perhaps CTE?
WITH RECURSIVE grps AS
(
SELECT 1 AS Horizon, (date '2015-08-15') + interval '1' day AS FromDate, (date '2015-08-15') + interval '1' year AS ToDate
UNION ALL
SELECT Horizon + 1, ToDate + interval '1' day AS FromDate, ToDate + interval '1' year
FROM grps WHERE Horizon < 5
)
SELECT
Horizon,
(SELECT SUM(amount) FROM dataset WHERE date BETWEEN g.FromDate AND g.ToDate) AS SumOfAmount
FROM
grps g
SQL fiddle
Rather simply:
SELECT horizon, sum(amount) AS amount
FROM generate_series(1, 5) AS s(horizon)
JOIN dataset ON "date" >= current_date + (horizon - 1) * interval '1 year'
AND "date" < current_date + horizon * interval '1 year'
GROUP BY horizon
ORDER BY horizon;
You need a union and an aggregate function:
select 1 as horizon,
sum(amount) amount
from the_table
where date >= current_date
and date < current_date + interval '12' month
union all
select 2 as horizon,
sum(amount) amount
where date >= current_date + interval '12' month
and date < current_date + interval '24' month
union all
select 3 as horizon,
sum(amount) amount
where date >= current_date + interval '24' month
and date < current_date + interval '36' month
... and so on ...
But I don't know, how to do that with an obfuscation layer (aka ORM) but I'm sure it supports (or it should) aggregation and unions.
This could easily be wrapped up into a PL/PgSQL function where you pass the "horizon" and the SQL is built dynamically so that all you need to call is something like: select * from sum_horizon(5) where 5 indicates the number of years.
Btw: date is a horrible name for a column. For one because it's a reserved word, but more importantly because it doesn't document the meaning of the column. Is it a "release date"? A "due date"? An "order date"?
Try this
select
id,
sum(case when date>=current_date and date<current_date+interval 1 year then amount else 0 end) as year1,
sum(case when date>=current_date+interval 1 year and date<current_date+interval 2 year then amount else 0 end) as year2,
sum(case when date>=current_date+interval 2 year and date<current_date+interval 3 year then amount else 0 end) as year3,
sum(case when date>=current_date+interval 3 year and date<current_date+interval 4 year then amount else 0 end) as year4,
sum(case when date>=current_date+interval 4 year and date<current_date+interval 5 year then amount else 0 end) as year5
from table
group by id