PostgreSQL window function: partition by comparison

PostgreSQL window function: partition by comparison - sql

I'm trying to find the way of doing a comparison with the current row in the PARTITION BY clause in a WINDOW function in PostgreSQL query.
Imagine I have the short list in the following query of this 5 elements (in the real case, I have thousands or even millions of rows). I am trying to get for each row, the id of the next different element (event column), and the id of the previous different element.
WITH events AS(
SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT lag(id) over w as previous_different, event
, lead(id) over w as next_different
FROM events ev
WINDOW w AS (PARTITION BY event!=ev.event ORDER BY date ASC);
I know the comparison event!=ev.event is incorrect but that's the point I want to reach.
The result I get is (the same as if I delete the PARTITION BY clause):
|12|2
1|12|3
2|13|4
3|13|5
4|12|
And the result I would like to get is:
|12|3
|12|3
2|13|5
2|13|5
4|12|
Anyone knows if it is possible and how? Thank you very much!
EDIT: I know I can do it with two JOINs, a ORDER BY and a DISTINCT ON, but in the real case of millions of rows it is very inefficient:
WITH events AS(
SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT DISTINCT ON (e.id, e.date) e1.id, e.event, e2.id
FROM events e
LEFT JOIN events e1 ON (e1.date<=e.date AND e1.id!=e.id AND e1.event!=e.event)
LEFT JOIN events e2 ON (e2.date>=e.date AND e2.id!=e.id AND e2.event!=e.event)
ORDER BY e.date ASC, e.id ASC, e1.date DESC, e1.id DESC, e2.date ASC, e2.id ASC

Using several different window functions and two subqueries, this should work decently fast:
WITH events(id, event, ts) AS (
VALUES
(1, 12, '2014-03-19 08:00:00'::timestamp)
,(2, 12, '2014-03-19 08:30:00')
,(3, 13, '2014-03-19 09:00:00')
,(4, 13, '2014-03-19 09:30:00')
,(5, 12, '2014-03-19 10:00:00')
)
SELECT first_value(pre_id) OVER (PARTITION BY grp ORDER BY ts) AS pre_id
, id, ts
, first_value(post_id) OVER (PARTITION BY grp ORDER BY ts DESC) AS post_id
FROM (
SELECT *, count(step) OVER w AS grp
FROM (
SELECT id, ts
, NULLIF(lag(event) OVER w, event) AS step
, lag(id) OVER w AS pre_id
, lead(id) OVER w AS post_id
FROM events
WINDOW w AS (ORDER BY ts)
) sub1
WINDOW w AS (ORDER BY ts)
) sub2
ORDER BY ts;
Using ts as name for the timestamp column.
Assuming ts to be unique - and indexed (a unique constraint does that automatically).
In a test with a real life table with 50k rows it only needed a single index scan. So, should be decently fast even with big tables. In comparison, your query with join / distinct did not finish after a minute (as expected).
Even an optimized version, dealing with one cross join at a time (the left join with hardly a limiting condition is effectively a limited cross join) did not finish after a minute.
For best performance with a big table, tune your memory settings, in particular for work_mem (for big sort operations). Consider setting it (much) higher for your session temporarily if you can spare the RAM. Read more here and here.
How?
In subquery sub1 look at the event from the previous row and only keep that if it has changed, thus marking the first element of a new group. At the same time, get the id of the previous and the next row (pre_id, post_id).
In subquery sub2, count() only counts non-null values. The resulting grp marks peers in blocks of consecutive same events.
In the final SELECT, take the first pre_id and the last post_id per group for each row to arrive at the desired result.
Actually, this should be even faster in the outer SELECT:
last_value(post_id) OVER (PARTITION BY grp ORDER BY ts
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS post_id
... since the sort order of the window agrees with the window for pre_id, so only a single sort is needed. A quick test seems to confirm it. More about this frame definition.
SQL Fiddle.

Related

Efficiently join latest entry of first table to second table depending on entity characteristics of first table

Looking for efficient solution to join two tables but with the caveat that characteristics of second table should determine what is joined to first table (in Google BigQuery).
Lets say I have two tables. One Table with events (id, session, event_date) and a second with policies applying to events (event_id, policy, create_date) and I want to determine which policy applied to an event based on the policy create date and the event date.
CREATE TEMP TABLE events AS (
SELECT *
FROM UNNEST([
STRUCT(1 AS id, "A" AS session, "2021-11-05" AS event_date),
(1, "B", "2021-12-17"),
(2, "A", "2021-08-13")
])
);
CREATE TEMP TABLE policies AS (
SELECT *
FROM UNNEST([
STRUCT(1 AS event_id, "foo" AS policy, "2021-01-01" AS create_date),
(1, "bar", "2021-12-01"),
(2, "foo", "2021-02-01")
])
)
In my example, the result should look like this if I get the latest policy_create_date that was in existence by the time of the event (enevt_date).
id
session
policy_create_date
1
A
2021-01-01
1
B
2021-12-01
2
A
2021-02-01
The following solution would provide the result I want, but it create a N:N JOIN and can become quite big and calculation intense, if both tables get large (especially if I have many of the same events and many policy changes). Hence, I'm looking for a solution that is more efficient than the solution below and avoids the N:N JOIN.
SELECT
e.id,
e.session,
MAX(p.create_date) AS policy_create_date -- get latest policy amongst all policies for an event_id that existed before the session took place
FROM events e
INNER JOIN policies p
ON e.id = p.event_id -- match event and policy based on event_id
AND p.create_date < e.event_date -- match only policies that existed before the session of the event took place
GROUP BY 1, 2
TY!!!
Edit: I adjusted the known but inefficient solution to better reflect my goal. Of course, I want the policy in the end, but that is not in focus here.

You can try the window function
WITH cte AS (
SELECT e.id, e.session, p.policy
, row_number() over(partition by e.id, e.session order by p.create_date desc) rn
FROM events e
INNER JOIN policies p
ON e.id = p.event_id AND p.create_date < e.event_date
)
SELECT c.id, c.session, c.policy
FROM cte c
where rn=1

I have tried the following code on Postgres, but there shouldn't be anything in there that is postgres specific.
Your query can be reorganised using a subquery to:
SELECT
e.id,
e.session,
(SELECT MAX(create_date) FROM policies AS p WHERE e.id = p.event_id AND p.create_date < e.event_date) AS policy_create_date
FROM events e
WHERE policy_create_date IS NOT NULL
While this query should show similar performance it makes it easier to spot the problem with the overall query: While finding the MAX the database has already found and read the row from policies with the highest date, but you are not getting the the value of the policy column out. So, you need to do a second join.
Using a lateral join you can get the complete relevant row from policies in one go.
SELECT
e.id,
e.session,
p2.policy,
p2.create_date
FROM events AS e
INNER JOIN LATERAL
(SELECT
*
FROM policies AS p
WHERE e.id = p.event_id AND p.create_date < e.event_date
ORDER BY p.create_date DESC
LIMIT 1) AS p2
ON TRUE;
This should use an index on policies. So, time should increase linear with size of events and logarithmic with size of policies.
Nevertheles, you can't expect great performance when you do this for large resultsets, because there will be lots of cache-misses while accessing the policies table.

Another option is to interleave the two tables, then use LAST_VALUE() to look back to find the policy data...
WITH
interleave AS
(
SELECT
id AS event_id,
event_date AS event_date,
session AS event_session,
NULL AS policy_label,
NULL AS policy_date
FROM
events
UNION ALL
SELECT
event_id,
create_date,
NULL,
policy,
create_date
FROM
policies
),
lookback AS
(
SELECT
event_id,
event_session,
event_date,
LAST_VALUE(policy_label IGNORE NULLS) OVER event_order AS policy_label,
LAST_VALUE(policy_date IGNORE NULLS) OVER event_order AS policy_date
FROM
interleave
WINDOW
event_order AS (
PARTITION BY event_id
ORDER BY event_date,
event_session NULLS FIRST
ROWS BETWEEN UNBOUNDED PRECEDING
AND 1 PRECEDING
)
)
SELECT
event_id,
event_session,
event_date,
policy_label,
policy_date
FROM
lookback
WHERE
event_session IS NOT NULL
This presumes that the events table is vastly larger than the policies table.
I'd also recommend ensuring the tables are partitioned by the event_id and clustered by their respective date column.

Another option is to use LEAD() to find a policy's "expiry" date, then use that in the join...
WITH
policy_range AS
(
SELECT
event_id,
policy,
create_date,
LEAD(create_date, 1, DATE '9999-12-31') OVER event_order AS expiry_date
FROM
policies
WINDOW
event_order AS (
PARTITION BY event_id
ORDER BY create_date
)
)
SELECT
e.id,
e.session,
e.event_date,
p.policy,
p.create_date
FROM
policy_range AS p
INNER JOIN
events AS e
ON e.id = p.event_id
AND e.event_date > p.create_date
AND e.event_date <= p.expiry_date

Hive: why to use partition by in selects?

I cannot understand partitioning concept in Hive completely.
I understand what are partitions and how to create them. What I cannot get is why people are writing select statements which have "partition by" clause like it is done here: SQL most recent using row_number() over partition
SELECT user_id, page_name, recent_click
FROM (
SELECT user_id,
page_name,
row_number() over (partition by session_id order by ts desc) as recent_click
from clicks_data
) T
WHERE recent_click = 1
Why to specify partition key in selects? In any case partition key was defined during table creation. Select statement will use the partition scheme that was defined in Create Table statement. Then why to add that over (partition by session_id order by ts desc)?
What if I skip over (partition by session_id order by ts desc) ?

Read about Hive Windowing and Analytics Functions.
row-number() is an analytics function which numbers rows and requires over().
In the over() you can specify for which group (partition) it will be calculated.
partition by in the over is not the same as partitioned by in create table DDL and has nothing in common. in create table it means how the data is being stored (each partition is a separate folder in hive), partitioned table is used for optimizing filtering or loading data.
partition by in the over() determines group in which function is calculated. Similar to GROUP BY in the select, but the difference is that analytics function does not change the number of rows.
Row_number re-initializes when it crossing the partition boundary and starts with 1
Also row_number needs order by in the over(). order by determines the order in which rows will be numbered.
If you do not specify partition by, row_number will work on the whole dataset as a single partition. It will produce single 1 and maximum number will be equal to the number of rows in the whole dataset. Table partitioning does not affect analytics function behavior.
If you do not specify order by, then row_number will number rows in non-deterministic order and probably different rows will be marked 1 from run to run. This is why you need to specify order by. In your example, order by ts desc means that 1 will be assigned to row with max ts (for each session_id).
Say, if there are three different session_id and three clicks in each session with different ts (totally 9 rows), then row_number in your example will assign 1 to last click for each session and after filtering recent_click = 1 you will get 3 rows instead of 9 initially. row_number() over() without partition by will number all rows from 1 to 9 in a random order (may differ from run to run), and the same filtering will give you 8 rows mixed from all 3 sessions.
See also this answer https://stackoverflow.com/a/55909947/2700344 for more details how it works in Hive, there is also similar question about table partition vs over() in the comments.
Try this example, it may be better than reading too long explanation:
with clicks_data as (
select stack (9,
--session1
1, 1, 'page1', '2020-01-01 01:01:01.123',
1, 1, 'page1', '2020-01-01 01:01:01.124',
1, 1, 'page2', '2020-01-01 01:01:01.125',
--session2
1, 2, 'page1', '2020-01-01 01:02:02.123',
1, 2, 'page2', '2020-01-01 01:02:02.124',
1, 2, 'page1', '2020-01-01 01:02:02.125',
--session 3
1, 3, 'page1', '2020-01-01 01:03:01.123',
1, 3, 'page2', '2020-01-01 01:03:01.124',
1, 3, 'page1', '2020-01-01 01:03:01.125'
) as(user_id, session_id, page_name, ts)
)
SELECT
user_id
,session_id
,page_name
,ts
,ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY ts DESC) AS rn1
,ROW_NUMBER() OVER() AS rn2
FROM clicks_data
Result:
user_id session_id page_name ts rn1 rn2
1 2 page1 2020-01-01 01:02:02.125 1 1
1 2 page2 2020-01-01 01:02:02.124 2 2
1 2 page1 2020-01-01 01:02:02.123 3 3
1 1 page2 2020-01-01 01:01:01.125 1 4
1 1 page1 2020-01-01 01:01:01.124 2 5
1 1 page1 2020-01-01 01:01:01.123 3 6
1 3 page1 2020-01-01 01:03:01.125 1 7
1 3 page2 2020-01-01 01:03:01.124 2 8
1 3 page1 2020-01-01 01:03:01.123 3 9
First row_number assigned 1 to rows with max timestamp in each session(partition). Second row_number without partition and order specified numbered all rows from 1 to 9. Why rn2=1 is for session2 and max timestamp in session=2, should it be random or not? Because for calculating first row_number, all rows were distributed by session_id and ordered by timestamp desc and it happened that row_number2 received session2 first(it was read by reducer before other two files prepared by mapper) and as it was already sorted for calculation of rn1, rn2 received rows in the same order. If it was no row_number1, it could be "more random". The bigger the dataset, the more random rn2 order will look.

Custom aggregate function in PostgreSQL

Is it possible to write an aggregate function in PostgreSQL that will calculate a delta value, by substracting the initial (last value in the column) from the current(first value in column) ?
It would apply on a structure like this
rankings (userId, rank, timestamp)
And could be used like
SELECT userId, custum_agg(rank) OVER w
FROM rankings
WINDOWS w AS (PARTITION BY userId ORDER BY timstamp desc)
returning for an userId the rank of the newest entry (by timestamp) - rank of the oldest entry (by timestamp)
Thanks!

the rank of the newest entry (by timestamp) - rank of the oldest entry (by timestamp)
There are many ways to achieve this with existing functions.
You can use the existing window functions first_value() and last_value(), combined with DISTINCT or DISTINCT ON to get it without joins and subqueries:
SELECT DISTINCT ON (userid)
userid
, last_value(rank) OVER w
- first_value(rank) OVER w AS rank_delta
FROM rankings
WINDOW w AS (PARTITION BY userid ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING);
Note the custom frames for the window functions!
Or you can use basic aggregate functions in a subquery and JOIN:
SELECT userid, r2.rank - r1.rank AS rank_delta
FROM (
SELECT userid
, min(ts) AS first_ts
, max(ts) AS last_ts
FROM rankings
GROUP BY 1
) sub
JOIN rankings r1 USING (userid)
JOIN rankings r2 USING (userid)
WHERE r1.ts = first_ts
AND r2.ts = last_ts;
Assuming unique (userid, rank), or your requirements would be ambiguous.
SQL Fiddle demo.
Shichinin no samurai
... a.k.a. "7 Samurai"
Per request in the comments, the same for only the last seven rows per userid (or as many as can be found, if there are fewer):
Again, one of many possible ways. But I believe this to be one of the shortest:
SELECT DISTINCT ON (userid)
userid
, first_value(rank) OVER w
- last_value(rank) OVER w AS rank_delta
FROM rankings
WINDOW w AS (PARTITION BY userid ORDER BY ts DESC
ROWS BETWEEN CURRENT ROW AND 7 FOLLOWING)
ORDER BY userid, ts DESC;
Note the reversed sort order. The first row is the "newest" entry. I span a frame of (max.) 7 rows and pick only the results for the newest entry with DISTINCT ON.
SQL Fiddle demo.

You can do it with JOIN and DISTINCT ON in Postgres. The GRP query give you the last rank values for each userID so just join it with rankings on user_id and substract values.
SELECT rankings.userId,
rankings.rank-GRP.rank as delta,
rankings.timestamp
FROM rankings
JOIN
(
SELECT DISTINCT ON (userId) userId, rank, timestamp
FROM rankings
ORDER BY userId, timestamp DESC
) as GRP ON rankings.userId=GRP.userId
SQLFiddle demo

PostgreSQL select daily max and corresponding hour of ocurrence

I have the following table structure, with daily-hourly data:
time_of_ocurrence(timestamp); particles(numeric)
"2012-11-01 00:30:00";191.3
"2012-11-01 01:30:00";46
...
"2013-01-01 02:30:00";319.6
How do i select the DAILY max and THE HOUR in which this max occur?
I've tried
SELECT date_trunc('hour', time_of_ocurrence) as hora,
MAX(particles)
from my_table WHERE time_of_ocurrence > '2013-09-01'
GROUP BY hora ORDER BY hora
But it doesn't work:
"2013-09-01 00:00:00";34.35
"2013-09-01 01:00:00";33.13
"2013-09-01 02:00:00";33.09
"2013-09-01 03:00:00";28.08
My result would be in this format instead (one max per day, showing the hour)
"2013-09-01 05:00:00";100.35
"2013-09-02 03:30:00";80.13
How can i do that? Thanks!

This type of question has come up on StackOverflow frequently, and these questions are categorized with the greatest-n-per-group tag, if you want to see other solutions.
edit: I changed the following code to group by day instead of by hour.
Here's one solution:
SELECT t.*
FROM (
SELECT date_trunc('day', time_of_ocurrence) as hora, MAX(particles) AS particles
FROM my_table
GROUP BY hora
) AS _max
INNER JOIN my_table AS t
ON _max.hora = date_trunc('day', t.time_of_ocurrence)
AND _max.particles = t.particles
WHERE time_of_ocurrence > '2013-09-01'
ORDER BY time_of_ocurrence;
This might also show more than one result per day, if more than one row has the max value.
Another solution using window functions that does not show such duplicates:
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY date_trunc('day', time_of_ocurrence)
ORDER BY particles DESC) AS _rn
FROM my_table
) AS _max
WHERE _rn = 1
ORDER BY time_of_ocurrence;
If multiple rows have the same max, one row with nevertheless be numbered row 1. If you need specific control over which row is numbered 1, you need to use ORDER BY in the partitioning clause using a unique column to break such ties.

Use window functions:
select distinct
date_trunc('day',time_of_ocurrence) as day,
max(particles) over (partition by date_trunc('day',time_of_ocurrence)) as particles_max_of_day,
first_value(date_trunc('hour',time_of_ocurrence)) over (partition by date_trunc('day',time_of_ocurrence) order by particles desc)
from my_table
order by 1
One edge case here is if the same MAX number of particles show up in the same day, but in different hours. This version would randomly pick one of them. If you prefer one over the other (always the earlier one for example) you can add that to the order by clause:
first_value(date_trunc('hour',time_of_ocurrence)) over (partition by date_trunc('day',time_of_ocurrence) order by particles desc, time_of_ocurrence)

Combine results from two independent queries

I'm fairly new to Postgres and need to fetch two separate data :
1) values of avg, min, max of various records/resuts of table T1
2) column values of the 'last' record of the table T1 based on the most recent timestamp
The problem is I cannot run these queries separately as it will cause performance issues. (the data in this table can be in tens of thousands or records and combining them into a result object, even more complex).
Is it possible to combine the results of these two data side by side into one monster of a query that will return the desired output?
Appreciate your help.
Updated with queries:
1st query :
select
rtp.id, rtp.received_timestamp,
rtp.agent_time, rtp.sourceip, rtp.destip, rtp.sourcedscp,
sum(rtp.numduplicate) as dups, avg(rtp.numduplicate) as avgdups,
min(rtp.numduplicate) as mindups, max(rtp.numduplicate) as maxdups
from rtp_test_result rtp
where
rtp.received_timestamp between 1274723208 and 1475642299
group by rtp.sourceip, rtp.destip, rtp.sourcedscp
order by rtp.sourceip, rtp.destip, rtp.sourcedscp
2nd query:
select id, received_timestamp, numooo
from rtp_test_result
where received_timestamp = (select max(received_timestamp) mrt from rtp_test_result)
group by id,received_timestamp, numooo
order by id desc limit 1

something like
with cte as (
select
val,
last_value(val) over(order by ts asc rows between unbounded preceding and unbounded following) as lst_value
from T1
)
select
avg(val) as avg_value,
min(val) as min_value,
max(val) as max_value,
max(lst_value) as lst_value
from cte
or
select
avg(val) as avg_value,
min(val) as min_value,
max(val) as max_value,
(select val from T1 order by ts desc limit 1) as lst_value
from T1
sql fiddle demo

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PostgreSQL window function: partition by comparison - sql

Related

Efficiently join latest entry of first table to second table depending on entity characteristics of first table

Hive: why to use partition by in selects?

Custom aggregate function in PostgreSQL

PostgreSQL select daily max and corresponding hour of ocurrence

Combine results from two independent queries

Categories

Resources