Illogical result using Order By along with Partition By - sql

Let's say there is a table dd:
id (integer)
name (varchar)
ts (date)
1
first
2021-03-25
2
first
2021-03-30
When I query this table with following:
SELECT *, MAX(ts) OVER (PARTITION BY name ORDER BY ts) max_ts FROM dd;
Then the result is:
id (integer)
name (varchar)
ts (date)
max_ts (date)
1
first
2021-03-25
2021-03-25
2
first
2021-03-30
2021-03-30
When I add "DESC" to Order By clause:
SELECT *, MAX(ts) OVER (PARTITION BY name ORDER BY ts DESC) max_ts FROM dd;
The result is:
id (integer)
name (varchar)
ts (date)
max_ts (date)
2
first
2021-03-30
2021-03-30
1
first
2021-03-25
2021-03-30
This time the result is what I expect. Considering that I am partitioning records by name and then getting the max date from them, I expect the max_ts values to be the same (the max one) in both cases, since the order should not really matter when getting the max value from the group. But in fact, in the first case the result contains different max_ts values, not the maximum one.
Why does it work this way? Why does ordering affect the result?

This syntax:
MAX(ts) OVER (PARTITION BY name ORDER BY ts)
is a cumulative maximum ordered by ts. The window frame starts with the smallest value of ts. Each subsequent row is larger -- because the ORDER BY is the same column as ts.
This is not interesting; ts on each row is the cumulative maximum when ordered by ts.
On the other hand:
MAX(ts) OVER (PARTITION BY name ORDER BY ts DESC)
This is the cumulative maximum in reverse order. So, the first row in the window frame is the maximum ts. All subsequent rows will be the maximum.
This is not the most efficient way to express this, though. I think this better captures the logic you want:
MAX(ts) OVER (PARTITION BY name)

Related

dense_rank in sql partition by id and session id but ordered by timestamp

I have a table as following:
User ID
Session ID
Timestamp
100
7e938c4437a0
1:30:30
100
7e938c4437a0
1:30:33
100
c1fcfd8b1a25
2:40:00
100
7b5e86d91103
3:20:00
200
bda6c8743671
2:20:00
200
bda6c8743671
2:25:00
200
aac5d66421a0
3:10:00
200
aac5d66421a0
3:11:00
I am trying to rank each session_id for by user_id, sequenced(ordered by) timestamp. I want something like the second table.
I am doing the following but it does not order by timestamp:
dense_rank() over (partition by user_id order by session_id) as visit_number
it outputs in wrong order and when I add timestamp in the order by it behaves like a row_number() function.
Below is what I am really looking for to get as a result:
User ID
Session ID
Timestamp
Rank
100
7e938c4437a0
1:30:30
1
100
7e938c4437a0
1:30:33
1
100
c1fcfd8b1a25
2:40:00
2
100
7b5e86d91103
3:20:00
3
200
bda6c8743671
2:20:00
1
200
bda6c8743671
2:25:00
1
200
aac5d66421a0
3:10:00
2
200
aac5d66421a0
3:11:00
2
If you want to dense rank by the hour component of the timestamp, you can extract the hour. This should give the results you specify. In standard SQL, this looks like:
dense_rank() over (partition by user_id order by extract(hour from timestamp) as visit_number
Of course, date/time functions are highly database dependent, so your database might have a different function for extracting the hour.
I wanted to do something similar and since I found the answer I thought I would come and post here. This is what I have learned you can do.
SELECT user_id, session_id, session_timestamp,
-- This ranks the records according to the date, which is the same for each user_id, session_group
DENSE_RANK() OVER (PARTITION BY tbl.user_id ORDER BY tbl.min_dt) AS rank
FROM (
SELECT user_id, session_id, session_timestamp,
-- We want to get the MIN or MAX session_timestamp but only for each group. This allows us to keep the ordering by timestamp, but still group by user_id and session_id.
MIN(session_timestamp) OVER (PARTITION BY user_id, session_id) AS min_dt
FROM sessions) tbl
ORDER BY user_id, rank, session_timestamp
This is the results that look the same as asked for.

Using the append model to do partial row updates in BigQuery

Suppose I have the following record in BQ:
id name age timestamp
1 "tom" 20 2019-01-01
I then perform two "updates" on this record by using the streaming API to 'append' additional data -- https://cloud.google.com/bigquery/streaming-data-into-bigquery. This is mainly to get around the update quota that BQ enforces (and it is a high-write application we have).
I then append two edits to the table, one update that just modifies the name, and then one update that just modifies the age. Here are the three records after the updates:
id name age timestamp
1 "tom" 20 2019-01-01
1 "Tom" null 2019-02-01
1 null 21 2019-03-03
I then want to query this record to get the most "up-to-date" information. Here is how I have started:
SELECT id, **name**, **age**,max(timestamp)
FROM table
GROUP BY id
-- 1,"Tom",21,2019-03-03
How would I get the correct name and age here? Note that there could be thousands of updates to a record, so I don't want to have to write 1000 case statements, if at all possible.
For various other reasons, I usually won't have all row data at one time, I will only have the RowID + FieldName + FieldValue.
I suppose plan B here is to do a query to get the current data and then add my changes to insert the new row, but I'm hoping there's a way to do this in one go without having to do two queries.
Below is for BigQuery Standard SQL
#standardSQL
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "tom" name, 20 age, DATE '2019-01-01' ts UNION ALL
SELECT 1, "Tom", NULL, '2019-02-01' UNION ALL
SELECT 1, NULL, 21, '2019-03-03'
)
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
with result
Row id name age ts
1 1 Tom 21 2019-03-03
This is a classic case of application of analytic functions in Standard SQL.
Here is how you can achieve your results:
select id, name, age from (
select id, name, age, ts, rank() over (partition by id order by ts desc) rnk
from `yourdataset.yourtable`
)
where rnk = 1
This will sub-group your records based id and pick the one with most recent ts (indicating the record most recently added for a given id).

Max difference between update timestamps

I have a table:
id | updated_at
1 | 2018-10-22T21:00:00Z
2 | 2018-10-22T21:02:00Z
I'd like to find the largest delta for a given day between closest updated timestamps. For example, if there were 5 rows:
id | updated_at
1 | 2018-10-22T21:00:00Z
2 | 2018-10-22T21:02:00Z
3 | 2018-10-22T21:05:00Z
4 | 2018-10-22T21:06:00Z
5 | 2018-10-22T21:16:00Z
The largest delta is between 4 and 5 (10 minutes). Note that really when comparing, I just want to find the next closest updated_at timestamp and then give me the max of this. I feel like I'm messing up the subquery to do this.
with nearest_time(time_diff)
as
(
select datediff('minute', updated_at as u1, (select updated_at from table where updated_at > u1 limit 1) as u2)
group by updated_at::date
)
select max(select time_diff from nearest_time);
demo:db<>fiddle
SELECT
lead(updated) OVER (ORDER BY updated) - updated as diff
FROM dates
ORDER BY diff DESC NULLS LAST
LIMIT 1;
Using window function LEAD allows you to get the value of the next row: In this case you can get the next timestamp.
With that you can do a substraction, sorting the results descending and take the first value.
Use lag to get the updated_at from the previous row and then get the max difference per day.
select dt_updated_at,max(time_diff)
from (select updated_at::date as dt_updated_at
,updated_at - lag(updated_at) over(partition by updated_at::date order by updated_at) as time_diff
from tbl
) t
group by dt_updated_at
One more option using DISTINCT ON (only works on Postgres..as the question was initially tagged Postgres, keeping this answer)
select distinct on
(updated_at::date)
updated_at::date as dt_updated_at
,updated_at-lag(updated_at) over(partition by updated_at::date order by updated_at) as diff
from dates
order by updated_at::date,diff desc
nulls last

SQL How to order logged data according to the date and the user?

I have a situation where I need to create an ordered "event" or "touch" ranking based on the date and the user that touched a case from a historical log table. For example, I have a log table that looks like this:
case_id user_id log_date
------- ------- --------
1 5 06-29 12:05
1 5 06-29 12:10
1 5 06-30 9:12
1 3 06-30 9:15
And I want to get this:
case_id user_id log_date EventNumber
------- ------- -------- -----------
1 5 06-29 12:05 1
1 5 06-29 12:10 1
1 5 06-30 9:12 2
1 3 06-30 9:15 3
Basically either a change in the date or a change in the user that touched a case signifies that a new event has occurred. The closest I got so far is [EventNum] = DENSE_RANK() OVER (PARTITION BY case_id ORDER BY CONVERT(DATE, log_date), user_id)
The problem with this approach is that the secondary order, while correctly incrementing the rank number because a different user touched it would put the second user first because the user_id happens to be a lower number. I can't figure out how to "partition" by users while maintaining the original logged order. Even the date break part isn't essential - I would settle breaking up the ranking only by users provided the original logged order remains the same. Any advice?
This is a tricky question. You need to identify groups where the date and user are adjacent. One method is to use lag(). But, not available in SQL Server 2008. Another method is to use a difference of row numbers.
The difference defines the group. You then need to get the minimum date for the final ordering. So:
select t.*,
dense_rank() over (partition by caseid order by grp_log_date) as EventNum
from (select t.*, min(log_date) over (partition by caseid, grp) as grp_log_date
from (select t.*,
(row_number() over (partition by caseid order by log_date) -
row_number() over (partition by caseid, userid, cast(log_date) as date
order by log_date
)
) as grp
from table t
) t
) t;

How to group following rows by not unique value

I have data like this:
table1
_____________
id way time
1 1 00:01
2 1 00:02
3 2 00:03
4 2 00:04
5 2 00:05
6 3 00:06
7 3 00:07
8 1 00:08
9 1 00:09
I would like to know in which time interval I was on which way:
desired output
_________________
id way from to
1 1 00:01 00:02
3 2 00:03 00:05
6 3 00:06 00:07
8 1 00:08 00:09
I tried to use a window function:
SELECT DISTINCT
first_value(id) OVER w AS id,
first_value(way) OVER w as way,
first_value(time) OVER w as from,
last_value(time) OVER w as to
FROM table1
WINDOW w AS (
PARTITION BY way ORDER BY ID
range between unbounded preceding and unbounded following);
What I get is:
ID way from to
1 1 00:01 00:09
3 2 00:03 00:05
6 3 00:06 00:07
And this is not correct, because on way 1 I wasn't from 00:01 to 00:09.
Is there a possibility to do the partition according to the order, means grouping only following attributes, that are equal?
If your case is as simple as the example values suggest, #Giorgos' answer serves nicely.
However, that's typically not the case. If the id column is a serial, you cannot rely on the assumption that a row with an earlier time also has a smaller id.
Also, time values (or timestamp like you probably have) can easily be duplicates, you need to make the sort order unambiguous.
Assuming both can happen, and you want the id from the row with the earliest time per time slice (actually, the smallest id for the earliest time, there could be ties), this query would deal with the situation properly:
SELECT *
FROM (
SELECT DISTINCT ON (way, grp)
id, way, time AS time_from
, max(time) OVER (PARTITION BY way, grp) AS time_to
FROM (
SELECT *
, row_number() OVER (ORDER BY time, id) -- id as tie breaker
- row_number() OVER (PARTITION BY way ORDER BY time, id) AS grp
FROM table1
) t
ORDER BY way, grp, time, id
) sub
ORDER BY time_from, id;
ORDER BY time, id to be unambiguous. Assuming time is not unique, add the (assumed unique) id to avoid arbitrary results - that could change between queries in sneaky ways.
max(time) OVER (PARTITION BY way, grp): without ORDER BY, the window frame spans all rows of the PARTITION, so we get the absolute maximum per time slice.
The outer query layer is only necessary to produce the desired sort order in the result, since we are bound to a different ORDER BY in the subquery sub by using DISTINCT ON. Details:
Select first row in each GROUP BY group?
SQL Fiddle demonstrating the use case.
If you are looking to optimize performance, a plpgsql function could be faster in such a case. Closely related answer:
Group by repeating attribute
Aside: don't use the basic type name time as identifier (also a reserved word in standard SQL).
I think you want something like this:
select min(id), way,
min(time), max(time)
from (
select id, way, time,
ROW_NUMBER() OVER (ORDER BY id) -
ROW_NUMBER() OVER (PARTITION BY way ORDER BY time) AS grp
from table1 ) t
group by way, grp
grp identifies 'islands' of successive way values. Using this calculated field in an outer query, we can get start and end times of way intervals using MIN and MAX aggregate functions respectively.
Demo here