Sum duration of overlapping periods with priority by excluding the overlap itself - sql

I have an R code and I am trying to rewrite it in PostgreSQL that feeds grafana dashboard. I do have the basics so I am almost done with the other parts of the script but what I am trying to accomplish now in PostgreSQL is beyond my league. I see very similar solved issues on StackOverflow but I can't seem to get them to work for me. Here are some links with code that I was trying to adapt
https://stackoverflow.com/a/54370027/7885817
https://stackoverflow.com/a/44139381/7885817
I applogize for the repetitive issue that I post.
Any help is highly appreciated!
So, my issue is:
I have messages with overlapping timestamps. These messages have priorities: A and B (A is more important), start time and end ime.
Strictly speaking: I would like to sum the durations for A and B
BUT if there is an overlap I want to find the duration between the first Start Time and the last End Time of messages with priority A and the same for messages with priority B. And if a A message overlaps with a B message I want to split this duration at the End time of A message, till that point the duration of B message is allocated to A.
I made a visual to support my cryptic explanations and simplified version of my data:
CREATE TABLE activities(
id int,
name text,
start timestamp,
"end" timestamp
);
INSERT INTO activitiesVALUES
(1, 'A', '2018-01-09 17:00:00', '2018-01-09 20:00:00'),
(2, 'A', '2018-01-09 18:00:00', '2018-01-09 20:30:00'),
(3, 'B', '2018-01-09 19:00:00', '2018-01-09 21:30:00'),
(4, 'B', '2018-01-09 22:00:00', '2018-01-09 23:00:00');
SELECT * FROM activities;
Thank you very much for your time!

Update
My original solution was not correct. The consolidation of ranges cannot be handled in a regular window. I confused myself by using the same name, trange, forgetting that the window is over the source rows rather than the result rows. Please see the updated SQL Fiddle with the full query as well as an added record to illustrate the problem.
You can simplify the overlapping requirement as well as identifying gaps and islands using PostgreSQL range types.
The following query is intentionally verbose to show each step of the process. A number of steps can be combined.
SQL Fiddle
First, add an inclusive [start, end] range to each record.
with add_ranges as (
select id, name, tsrange(start, "end", '[]') as t_range
from activities
),
id | name | t_range
----+------+-----------------------------------------------
1 | A | ["2018-01-09 17:00:00","2018-01-09 20:00:00"]
2 | A | ["2018-01-09 18:00:00","2018-01-09 20:30:00"]
3 | B | ["2018-01-09 19:00:00","2018-01-09 21:30:00"]
4 | B | ["2018-01-09 22:00:00","2018-01-09 23:00:00"]
(4 rows)
Identify overlapping ranges as determined by the && operator and mark the beginning of new islands with a 1.
mark_islands as (
select id, name, t_range,
case
when t_range && lag(t_range) over w then 0
else 1
end as new_range
from add_ranges
window w as (partition by name order by t_range)
),
id | name | t_range | new_range
----+------+-----------------------------------------------+-----------
1 | A | ["2018-01-09 17:00:00","2018-01-09 20:00:00"] | 1
2 | A | ["2018-01-09 18:00:00","2018-01-09 20:30:00"] | 0
3 | B | ["2018-01-09 19:00:00","2018-01-09 21:30:00"] | 1
4 | B | ["2018-01-09 22:00:00","2018-01-09 23:00:00"] | 1
(4 rows)
Number the groups based on the sum of the new_range within name.
group_nums as (
select id, name, t_range,
sum(new_range) over (partition by name order by t_range) as group_num
from mark_islands
),
id | name | t_range | group_num
----+------+-----------------------------------------------+-----------
1 | A | ["2018-01-09 17:00:00","2018-01-09 20:00:00"] | 1
2 | A | ["2018-01-09 18:00:00","2018-01-09 20:30:00"] | 1
3 | B | ["2018-01-09 19:00:00","2018-01-09 21:30:00"] | 1
4 | B | ["2018-01-09 22:00:00","2018-01-09 23:00:00"] | 2
Group by name, group_num to get the total time spent on the island as well as a complete t_range to be used in overlap deduction.
islands as (
select name,
tsrange(min(lower(t_range)), max(upper(t_range)), '[]') as t_range,
max(upper(t_range)) - min(lower(t_range)) as island_time_interval
from group_nums
group by name, group_num
),
name | t_range | island_time_interval
------+-----------------------------------------------+----------------------
A | ["2018-01-09 17:00:00","2018-01-09 20:30:00"] | 03:30:00
B | ["2018-01-09 19:00:00","2018-01-09 21:30:00"] | 02:30:00
B | ["2018-01-09 22:00:00","2018-01-09 23:00:00"] | 01:00:00
(3 rows)
For the requirement to count overlap time between A messages and B messages, find occurrences of when an A message overlaps a B message, and use the * intersect operator to find the intersection.
priority_overlaps as (
select b.name, a.t_range * b.t_range as overlap_range
from islands a
join islands b
on a.t_range && b.t_range
and a.name = 'A' and b.name != 'A'
),
name | overlap_range
------+-----------------------------------------------
B | ["2018-01-09 19:00:00","2018-01-09 20:30:00"]
(1 row)
Sum the total time of each overlap by name.
overlap_time as (
select name, sum(upper(overlap_range) - lower(overlap_range)) as total_overlap_interval
from priority_overlaps
group by name
),
name | total_overlap_interval
------+------------------------
B | 01:30:00
(1 row)
Calculate the total time for each name.
island_times as (
select name, sum(island_time_interval) as name_time_interval
from islands
group by name
)
name | name_time_interval
------+--------------------
B | 03:30:00
A | 03:30:00
(2 rows)
Join the total time for each name to adjustments from the overlap_time CTE, and subtract the adjustment for the final duration value.
select i.name,
i.name_time_interval - coalesce(o.total_overlap_interval, interval '0') as duration
from island_times i
left join overlap_time o
on o.name = i.name
;
name | duration
------+----------
B | 02:00:00
A | 03:30:00
(2 rows)

This is a type of gaps-and-islands problem. To solve this, find where the "islands" begin and then aggregate. So, to get the islands:
select a.name, min(start) as startt, max("end") as endt
from (select a.*,
count(*) filter (where prev_end is null or prev_end < start) over (partition by name order by start, id) as grp
from (select a.*,
max("end") over (partition by name
order by start, id
rows between unbounded preceding and 1 preceding
) as prev_end
from activities a
) a
) a
group by name, grp;
The next step is just to aggregate again:
with islands as (
select a.name, min(start) as startt, max("end") as endt
from (select a.*,
count(*) filter (where prev_end is null or prev_end < start) over (partition by name order by start, id) as grp
from (select a.*,
max("end") over (partition by name
order by start, id
rows between unbounded preceding and 1 preceding
) as prev_end
from activities a
) a
) a
group by name, grp
)
select name, sum(endt - startt)
from islands i
group by name;
Here is a db<>fiddle.
Note that this uses a cumulative trailing maximum to define the overlaps. This is the most general method for determining overlaps. I think this will work on all edge cases, including:
1----------2---2----3--3-----1
It also handles ties on the start time.

Related

BigQuery for running count of distinct values with a dynamic date-range

We are trying to make a query where we get the sum of unique customers on a specific year-month + the sum of unique customers on the 364 days before the specific date.
For example:
Our customer-table looks like this:
| order_date | customer_unique_id |
| -------- | -------------- |
| 2020-01-01 | tom#email.com |
| 2020-01-01 | daisy#email.com |
| 2019-05-02 | tom#email.com |
In this example we have two customers who ordered on 2020-01-01 and one of them already ordered within the 364-days timeframe.
The desired table should look like this:
| year_month | unique_customers |
| -------- | -------------- |
| 2020-01 | 2 |
We tried multiple solutions, such as partitioning and windows, but nothing seem to work correctly. The tricky part is the uniqueness. We want the look 364 days back but want to do a count distinct on the customers based on that whole period and not based on date/year/month because then we would get duplicates. For example, if you partition by date, year or month tom#email.com would be counted twice instead of once.
The goal of this query is to get insight into the order-frequency (orders divided by customers) over a time period from 12 months.
We work with Google BigQuery.
Hope someone can help us out! :)
Here is a way to achieve your desired results. Note that this query does year-month join in a separate query, and joins it with the rolling 364-day-interval query.
with year_month_distincts as (
select
concat(
cast(extract(year from order_date) as string),
'-',
cast(extract(month from order_date) as string)
) as year_month,
count(distinct customer_id) as ym_distincts
from customer_table
group by 1
)
select x.order_date, x.ytd_distincts, y.ym_distincts from (
select
a. order_date,
(select
count(distinct customer_id)
from customer_table b
where b.order_date between date_sub(a.order_date, interval 364 day) and a.order_date
) as ytd_distincts
from orders a
group by 1
) x
join year_month_distincts y on concat(
cast(extract(year from x.order_date) as string),
'-',
cast(extract(month from x.order_date) as string)
) = y.year_month
Two options using arrays that may help.
Look back 364 days as requested
In case you wish to look back 11 months (given reporting is monthly)
month_array AS (
SELECT
DATE_TRUNC(order_date,month) AS order_month,
STRING_AGG(DISTINCT customer_unique_id) AS cust_mth
FROM customer_table
GROUP BY 1
),
year_array AS (
SELECT
order_month,
STRING_AGG(cust_mth) OVER(ORDER by UNIX_DATE(order_month) RANGE BETWEEN 364 PRECEDING AND CURRENT ROW) cust_12m
-- (option 2) STRING_AGG(cust_mth) OVER (ORDER by cast(format_date('%Y%m', order_month) as int64) RANGE BETWEEN 99 PRECEDING AND CURRENT ROW) AS cust_12m
FROM month_array
)
SELECT format_date('%Y-%m',order_month) year_month,
(SELECT COUNT(DISTINCT cust_unique_id) FROM UNNEST(SPLIT(cust_12m)) AS cust_unique_id) as unique_12m
FROM year_array

Get max value of binned time-interval

I have a 'requests' table with a 'time_request' column which has a timestamp for each request. I want to know the maximum amount of requests that i had in a single minute.
So im guessing i need to somehow 'group by' a 1m time interval, and then do some sort of MAX(COUNT(request_id))? Although nested aggregations are not allowed.
Will appreciate any help.
Table example:
request_id | time_request
------------------+---------------------
ab1 | 2021-03-29 16:20:05
ab2 | 2021-03-29 16:20:20
bc3 | 2021-03-31 20:34:07
fw3 | 2021-03-31 20:38:53
fe4 | 2021-03-31 20:39:53
Expected result: 2 (There were a maximum of 2 requests in a single minute)
Thanks!
You may use window function count and specify logical interval of one minute as the window boundary. It will calculate the count for each row and will account all the rows that are within one minute before.
Code for Postgres is below:
with a as (
select
id
, cast(ts as timestamp) as ts
from(values
('ab1', '2021-03-29 16:20:05'),
('ab2', '2021-03-29 16:20:20'),
('bc3', '2021-03-31 20:34:07'),
('fw3', '2021-03-31 20:38:53'),
('fe4', '2021-03-31 20:39:53')
) as t(id, ts)
)
, count_per_interval as (
select
a.*
, count(id) over (
order by ts asc
range between
interval '1' minute preceding
and current row
) as cnt_per_min
from a
)
select max(cnt_per_min)
from count_per_interval
| max |
| --: |
| 2 |
db<>fiddle here

SQL: last 7 Days Calculations based on date

Below Tables consists of count of users on particular day.Looking to populate Total_Users signup column
Logic:Contains user count b/w Signupdate-14 & Signupdate-7
For Example: 15/01/2020 , contains users count between 1/1/2020 AND 1/7/2020
Signupdate| |Users| Total_Users(b/w D-14 & D-7)
1/1/2020 | |20. | 60
2/1/2020 | |30. | 80
3/1/2020 | |10. | 90
--- | |-- | --
--- | |-- | --
15/1/2020 | |30. | 120
16/1/2020 | |10. | 40
SELECT Signupdate
, Users
,SUM(CASE
WHEN Signupdate BETWEEN to_date(Signupdate,'DDMMYYYY')-14 and to_date(Signupdate,'DDMMYYYY')-7
THEN Users END) AS 'Total_Users'
FROM
This is assuming that the users column is of numeric type
Assuming you have a row for each date, you would use window functions with a windowing clause. I'm not sure if Redshift supports window frames with intervals, but this is the basic logic:
select t.*,
sum(users) over (order by signupdate
range between interval '-14' day and interval '-7 day'
) as total_users
from t;
If not, you can turn the date into a number and use that:
select t.*,
sum(users) over (order by signupdate
rows between 14 preceding and 7 preceding
) as total_users
from (select t.*,
datediff(day, signupdate, date '2000-01-01') as diff
from t
) t
I am guessing you want a complete week. However, this is 8 days.

Finding gaps in huge event streams?

I have about 1 million events in a PostgreSQL database that are of this format:
id | stream_id | timestamp
----------+-----------------+-----------------
1 | 7 | ....
2 | 8 | ....
There are about 50,000 unique streams.
I need to find all of the events where the time between any two of the events is over a certain time period. In other words, I need to find event pairs where there was no event in a certain period of time.
For example:
a b c d e f g h i j k
| | | | | | | | | | |
\____2 mins____/
In this scenario, I would want to find the pair (f, g) since those are the events immediately surrounding a gap.
I don't care if the query is (that) slow, i.e. on 1 million records it's fine if it takes an hour or so. However, the data set will keep growing, so hopefully if it's slow it scales sanely.
I also have the data in MongoDB.
What's the best way to perform this query?
You can do this with the lag() window function over a partition by the stream_id which is ordered by the timestamp. The lag() function gives you access to previous rows in the partition; without a lag value, it is the previous row. So if the partition on stream_id is ordered by time, then the previous row is the previous event for that stream_id.
SELECT stream_id, lag(id) OVER pair AS start_id, id AS end_id,
("timestamp" - lag("timestamp") OVER pair) AS diff
FROM my_table
WHERE diff > interval '2 minutes'
WINDOW pair AS (PARTITION BY stream_id ORDER BY "timestamp");
In postgres it can be done very easily with a help of the lag() window function. Check the fiddle below as an example:
SQL Fiddle
PostgreSQL 9.3 Schema Setup:
CREATE TABLE Table1
("id" int, "stream_id" int, "timestamp" timestamp)
;
INSERT INTO Table1
("id", "stream_id", "timestamp")
VALUES
(1, 7, '2015-06-01 15:20:30'),
(2, 7, '2015-06-01 15:20:31'),
(3, 7, '2015-06-01 15:20:32'),
(4, 7, '2015-06-01 15:25:30'),
(5, 7, '2015-06-01 15:25:31')
;
Query 1:
with c as (select *,
lag("timestamp") over(partition by stream_id order by id) as pre_time,
lag(id) over(partition by stream_id order by id) as pre_id
from Table1
)
select * from c where "timestamp" - pre_time > interval '2 sec'
Results:
| id | stream_id | timestamp | pre_time | pre_id |
|----|-----------|------------------------|------------------------|--------|
| 4 | 7 | June, 01 2015 15:25:30 | June, 01 2015 15:20:32 | 3 |

Select first & last date in window

I'm trying to select first & last date in window based on month & year of date supplied.
Here is example data:
F.rates
| id | c_id | date | rate |
---------------------------------
| 1 | 1 | 01-01-1991 | 1 |
| 1 | 1 | 15-01-1991 | 0.5 |
| 1 | 1 | 30-01-1991 | 2 |
.................................
| 1 | 1 | 01-11-2014 | 1 |
| 1 | 1 | 15-11-2014 | 0.5 |
| 1 | 1 | 30-11-2014 | 2 |
Here is pgSQL SELECT I came up with:
SELECT c_id, first_value(date) OVER w, last_value(date) OVER w FROM F.rates
WINDOW w AS (PARTITION BY EXTRACT(YEAR FROM date), EXTRACT(MONTH FROM date), c_id
ORDER BY date ASC)
Which gives me a result pretty close to what I want:
| c_id | first_date | last_date |
----------------------------------
| 1 | 01-01-1991 | 15-01-1991 |
| 1 | 01-01-1991 | 30-01-1991 |
.................................
Should be:
| c_id | first_date | last_date |
----------------------------------
| 1 | 01-01-1991 | 30-01-1991 |
.................................
For some reasons last_value(date) returns every record in a window. Which giving me a thought that I'm misunderstanding how windows in SQL works. It's like SQL forming a new window for each row it iterates through, but not multiple windows for entire table based on YEAR and MONTH.
So could any one be kind and explain if I'm wrong and how do I achieve the result I want?
There is a reason why i'm not using MAX/MIN over GROUP BY clause. My next step would be to retrieve associated rates for dates I selected, like:
| c_id | first_date | last_date | first_rate | last_rate | avg rate |
-----------------------------------------------------------------------
| 1 | 01-01-1991 | 30-01-1991 | 1 | 2 | 1.1 |
.......................................................................
If you want your output to become grouped into a single (or just fewer) row(s), you should use simple aggregation (i.e. GROUP BY), if avg_rate is enough:
SELECT c_id, min(date), max(date), avg(rate)
FROM F.rates
GROUP BY c_id, date_trunc('month', date)
More about window functions in PostgreSQL's documentation:
But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.
...
There is another important concept associated with window functions: for each row, there is a set of rows within its partition called its window frame. Many (but not all) window functions act only on the rows of the window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in the partition.
...
There are options to define the window frame in other ways ... See Section 4.2.8 for details.
EDIT:
If you want to collapse (min/max aggregation) your data and want to collect more columns than those what listed in GROUP BY, you have 2 choice:
The SQL way
Select min/max value(s) in a sub-query, then join their original rows back (but this way, you have to deal with the fact, that min/max-ed column(s) usually not unique):
SELECT c_id,
min first_date,
max last_date,
first.rate first_rate,
last.rate last_rate,
avg avg_rate
FROM (SELECT c_id, min(date), max(date), avg(rate)
FROM F.rates
GROUP BY c_id, date_trunc('month', date)) agg
JOIN F.rates first ON agg.c_id = first.c_id AND agg.min = first.date
JOIN F.rates last ON agg.c_id = last.c_id AND agg.max = last.date
PostgreSQL's DISTINCT ON
DISTINCT ON is typically meant for this task, but highly rely on ordering (only 1 extremum can be searched for this way at a time):
SELECT DISTINCT ON (c_id, date_trunc('month', date))
c_id,
date first_date,
rate first_rate
FROM F.rates
ORDER BY c_id, date
You can join this query with other aggregated sub-queries of F.rates, but this point (if you really need both minimum & maximum, and in your case even an average) the SQL compliant way is more suiting.
Windowing functions aren't appropriate for this. Use aggregate functions instead.
select
c_id, date_trunc('month', date)::date,
min(date) first_date, max(date) last_date
from rates
group by c_id, date_trunc('month', date)::date;
c_id | date_trunc | first_date | last_date
------+------------+------------+------------
1 | 2014-11-01 | 2014-11-01 | 2014-11-30
1 | 1991-01-01 | 1991-01-01 | 1991-01-30
create table rates (
id integer not null,
c_id integer not null,
date date not null,
rate numeric(2, 1),
primary key (id, c_id, date)
);
insert into rates values
(1, 1, '1991-01-01', 1),
(1, 1, '1991-01-15', 0.5),
(1, 1, '1991-01-30', 2),
(1, 1, '2014-11-01', 1),
(1, 1, '2014-11-15', 0.5),
(1, 1, '2014-11-30', 2);