How do I window Postgre SQL rows based on row data? - sql

I have a table that has a timestamp column and some data columns. Given a interval length T (say 30minutes) I want to partition the table into 'sessions'. Two adjacent rows (when sorted by the timestamp) are in the same 'session' if the difference of the timestamp values is less than T. If the difference is more than T then there is a break in sessions. For example, the table below has two gaps of more than T that split the sessions. How do I generate the session column with SQL?
row
timestamp
session
1
18:00
1
2
18:02
1
3
18:04
1
4
18:30
1
5
19:10
2
6
19:20
2
7
20:20
3

You can use lag() on the timestamp to measure the difference and then a cumulative sum to calculate the session:
select t.*,
sum(case when prev_timestamp > timetamp - interval '30 minute' then 0 else 1 end) over
(order by timestamp) as session
from (select t.*,
lag(timestamp) over (order by timestamp) as prev_timestamp
from t
) t;
Or, you could use count() with filter in Postgres:
select t.*,
1 + count(*) filter (where prev_timestamp < timestamp - interval '30 minute') over (order by timestamp) as session
from (select t.*,
lag(timestamp) over (order by timestamp) as prev_timestamp
from t
) t;

Related

Get occurences of past 2 weeks on any given date

I have data like
id | date |
-------------
1 | 1.1.20 |
3 | 4.1.20 |
2 | 4.1.20 |
1 | 5.1.20 |
6 | 2.1.20 |
What I would like to get is to get the amount of occurrences an user with ID did in the past 2 weeks on any given date so basically "occurences between date - 14 days and date. I'm trying to categorize users by their amount of sessions past 2 weeks, and I'm following them by daily cohorts.
This query does not work since there can be days when the user does not log in aka does not have a row:
COUNT (distinct id) OVER (PARTITION BY id ORDER BY date ROWS BETWEEN 14 PRECEDING AND 0 FOLLOWING)
Unfortunately, Presto does not support range() window functions. One method is a self-join/aggregation or correlated subquery:
select t.id, count(tprev.id)
from t left join
t tprev
on tprev.id = t.id and
tprev.date > t.date - interval '13' day and
tprev.date <= t.date
group by t.id;
This interprets your request as wanting 14 days of data, including the current day.
Another method that is much more verbose but might be faster is to use lag() . . . and lag() again:
select t.id,
(1 + -- current date
(case when lag(date, 1) over (partition by id order by date) > date - interval '14' day then 1 else 0 end) +
(case when lag(date, 2) over (partition by id order by date) > date - interval '14' day then 1 else 0 end) +
. . .
(case when lag(date, 13) over (partition by id order by date) > date - interval '14' day then 1 else 0 end) +
) as cnt_14
from t;

Calculating Aggregates on subset of data based on condition

I have a DB as follows:
| company | timestamp | value |
| ------- | ---------- | ----- |
| google | 2020-09-01 | 5 |
| google | 2020-08-01 | 4 |
| amazon | 2020-09-02 | 3 |
I'd like to calculate the average value for each company within the last year if there are >= 20 datapoints. If there are less than 20 datapoints then I'd like the average during the entire time duration. I know I can do two separate queries and get the averages for each scenario. The question I suppose is how do I merge them back in a single table based on the criteria I have.
select company, avg(value) from my_db GROUP BY company;
select company, avg(value) from my_db
where timestamp > (CURRENT_DATE - INTERVAL '12 months')
GROUP BY company;
WITH last_year AS (
SELECT company, avg(value), 'year' AS range -- optional tag
FROM tbl
WHERE timestamp >= now() - interval '1 year'
GROUP BY 1
HAVING count(*) >= 20 -- 20+ rows in range
)
SELECT company, avg(value), 'all' AS range
FROM tbl
WHERE NOT EXISTS (SELECT FROM last_year WHERE company = t.company)
GROUP BY 1
UNION ALL TABLE last_year;
db<>fiddle here
An index on (timestamp) will only be used if your table is big and holds many years.
If most companies have 20+ rows in range, an index on (company) will be used for the 2nd SELECT to retrieve the few outliers.
Use conditional aggregation:
select company,
case
when sum(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end) >= 20 then
avg(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end)
else avg(value)
end
from my_db
group by company
If by 20 datapoints you mean 20 rows in the last 12 months for each company, then:
select company,
case
when count(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end) >= 20 then
avg(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end)
else avg(value)
end
from my_db
group by company
You can use window functions to provide the information for filtering:
select company, avg(value),
(count(*) = cnt_this_year) as only_this_year
from (select t.*,
count(*) filter (where date_trunc('year', datecol) = date_trunc('year', now()) over (partition by company) as cnt_this_year
from t
) t
where cnt_this_year >= 20 and date_trunc('year', datecol) = date_trunc('year', now()) or
cnt_this_year < 20
group by company;
The third column specifies if all the rows are from this year. By filtering in the where clause, it is simple to add other calculations as well (such as min(), max(), and so on).

Finding multiple consecutive dates (datetime) in Ruby on Rails / Postgresql

How can we find X consecutive dates (using by hour) that meet a condition?
EDIT: here is the SQL fiddle http://sqlfiddle.com/#!17/44928/1
Example:
Find 3 consecutive dates where aa < 2 and bb < 6 and cc < 7
Given this table called weather:
timestamp
aa
bb
cc
01/01/2000 00:00
1
5
5
01/01/2000 01:00
5
5
5
01/01/2000 02:00
1
5
5
01/01/2000 03:00
1
5
5
01/01/2000 04:00
1
5
5
01/01/2000 05:00
1
5
5
Answer should return the 3 records from 02:00, 03:00, 04:00.
How can we do this in Ruby on Rails - or directly in SQL if that is better?
I started working on a method based on this answer:
Detect consecutive dates ranges using SQL
def consecutive_dates
the_query = "WITH t AS (
SELECT timestamp d,ROW_NUMBER() OVER(ORDER BY timestamp) i
FROM #d
GROUP BY timestamp
)
SELECT MIN(d),MAX(d)
FROM t
GROUP BY DATEDIFF(hour,i,d)"
ActiveRecord::Base.connection.execute(the_query)
end
But I was unable to get it working.
Assuming that you have one row every hour, then an easy way to get the first hour where this occurs uses lead():
select t.*
from (select t.*,
lead(timestamp, 2) over (order by timestamp) as timestamp_2
from t
where aa < 2 and bb < 6 and cc < 7
) t
where timestamp_2 = timestamp + interval '2 hour';
This filters on the conditions and looks at the rows two rows ahead. If it is two hours ahead, then three rows in a row match the conditions. Note: The above will return both 2020-01-01 02:00 and 2020-01-01 03:00.
From your question you only seem to want the earliest. To handle that, use lag() as well:
select t.*
from (select t.*,
lag(timestamp) over (order by timestamp) as prev_timestamp
lead(timestamp, 2) over (order by timestamp) as timestamp_2
from t
where aa < 2 and bb < 6 and cc < 7
) t
where timestamp_2 = timestamp + interval '2 hour' and
(prev_timestamp is null or prev_timestamp < timestamp - interval '1' hour);
You can generate the additional hours use generate_series() if you really need the original rows:
select t.timestamp + n.n * interval '1 hour', aa, bb, cc
from (select t.*,
lead(timestamp, 2) over (order by timestamp) as timestamp_2
from t
where aa < 2 and bb < 6 and cc < 7
) t cross join lateral
generate_series(0, 2) n
where timestamp_2 = timestamp + interval '2 hour';
Your data seems to have precise timestamps based on the question, so the timestamp equalities will work. If the real data has more fuzziness, then the queries can be tweaked to take this into account.
)This is a gaps-and-islands problem. Islands are adjacent records that match the condition, and you want islands that are at least 3 records long.
Here is one approach that uses a window count that increments every time value that does not match the condition is met to define the groups. We can then count how many rows there are in each group, and use that information to filter.
select *
from (
select t.*, count(*) over(partition by a, grp) cnt
from (
select t.*,
count(*) filter(where b <= 4) over(partition by a order by timestamp) grp
from mytable t
) t
) t
where cnt >= 3

Max value by ID, date and last x days

Supposed I have a table :
---------------
id | date | value
------------------
1 | Jan 1 | 10
1 | Jan 2 | 12
1 | Jan 3 | 11
2 | Jan 4 | 11
I need to get the max and median value of each id, each date, each for the past 90 days. Im using query :
select id, date, value
max(value) over (partition by id, date) as max_date,
median(value) over (partition by id, date) as med_date
from table
where date > date - interval '90 days'
I tried to export the data and check manually but the result is not correct. Any thing I missed? thanks
expected output is to get maximum value of since the last 90 days. for example the date is April 5th, then it will find the maximum value from Jan 5th (the last 90 days) until April 5th. and then the date moves to April 6th, then it will do again for jan 6th until April 6h and so on for each ID
So im assuming u can get several values for same ID and Date and right ? otherwise partitioning for both id and date makes no sense
SELECT id, date, max(value), avg(value) from table where date > date - interval '90 days'
group by id, value
'group by' does the partitioning
Why are you using window functions? This seems to do what you describe:
select id,
max(value) as max_date,
percentile_disc(0.5) within group (order by value) as median_value
from table
where date > date - interval '90 days';
If you want this per date, use window functions:
select t.*
from (select t.*,
max(value) over (order by date range between '89 day' preceding and current row) as running_max_value,
percentile_disc(0.5) within group (order by value) range between '89 day' preceding and current row) as running_median_value
from t
) t
where date > date - interval '90 days';
The filter is in the outer query so the preceding period can go back further in time.

Redshift SQL Window Function frame_clause with days

I am trying to perform a window function on a data-set in Redshift using days an an interval for the preceding rows.
Example data:
date ID score
3/1/2017 123 1
3/1/2017 555 1
3/2/2017 123 1
3/3/2017 555 3
3/5/2017 555 2
SQL window function for avg score from the last 3 scores:
select
date,
id,
avg(score) over
(partition by id order by date rows
between preceding 3 and
current row) LAST_3_SCORES_AVG,
from DATASET
Result:
date ID LAST_3_SCORES_AVG
3/1/2017 123 1
3/1/2017 555 1
3/2/2017 123 1
3/3/2017 555 2
3/5/2017 555 2
Problem is that I would like the average score from the last 3 DAYS (moving average) and not the last three tests. I have gone over the Redshift and Postgre Documentation and can't seem to find any way of doing it.
Desired Result:
date ID 3_DAY_AVG
3/1/2017 123 1
3/1/2017 555 1
3/2/2017 123 1
3/3/2017 555 2
3/5/2017 555 2.5
Any direction would be appreciated.
You can use lag() and explicitly calculate the average.
select t.*,
(score +
(case when lag(date, 1) over (partition by id order by date) >=
date - interval '2 day'
then lag(score, 1) over (partition by id order by date)
else 0
end) +
(case when lag(date, 2) over (partition by id order by date) >=
date - interval '2 day'
then lag(score, 2) over (partition by id order by date)
else 0
end)
)
) /
(1 +
(case when lag(date, 1) over (partition by id order by date) >=
date - interval '2 day'
then 1
else 0
end) +
(case when lag(date, 2) over (partition by id order by date) >=
date - interval '2 day'
then 1
else 0
end)
)
from dataset t;
The following approach could be used instead of the RANGE window option in a lot of (or all) cases.
You can introduce "expiry" for each of the input records. The expiry record would negate the original one, so when you aggregate all preceding records, only the ones in the desired range will be considered.
AVG is a bit harder as it doesn't have a direct opposite, so we need to think of it as SUM/COUNT and negate both.
SELECT id, date, running_avg_score
FROM
(
SELECT id, date, n,
SUM(score) OVER (PARTITION BY id ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
/ NULLIF(SUM(n) OVER (PARTITION BY id ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 0) as running_avg_score
FROM
(
SELECT date, id, score, 1 as n
FROM DATASET
UNION ALL
-- expiry and negate
SELECT DATEADD(DAY, 3, date), id, -1 * score, -1
FROM DATASET
)
) a
WHERE a.n = 1