I have data like this:
table1
_____________
id way time
1 1 00:01
2 1 00:02
3 2 00:03
4 2 00:04
5 2 00:05
6 3 00:06
7 3 00:07
8 1 00:08
9 1 00:09
I would like to know in which time interval I was on which way:
desired output
_________________
id way from to
1 1 00:01 00:02
3 2 00:03 00:05
6 3 00:06 00:07
8 1 00:08 00:09
I tried to use a window function:
SELECT DISTINCT
first_value(id) OVER w AS id,
first_value(way) OVER w as way,
first_value(time) OVER w as from,
last_value(time) OVER w as to
FROM table1
WINDOW w AS (
PARTITION BY way ORDER BY ID
range between unbounded preceding and unbounded following);
What I get is:
ID way from to
1 1 00:01 00:09
3 2 00:03 00:05
6 3 00:06 00:07
And this is not correct, because on way 1 I wasn't from 00:01 to 00:09.
Is there a possibility to do the partition according to the order, means grouping only following attributes, that are equal?
If your case is as simple as the example values suggest, #Giorgos' answer serves nicely.
However, that's typically not the case. If the id column is a serial, you cannot rely on the assumption that a row with an earlier time also has a smaller id.
Also, time values (or timestamp like you probably have) can easily be duplicates, you need to make the sort order unambiguous.
Assuming both can happen, and you want the id from the row with the earliest time per time slice (actually, the smallest id for the earliest time, there could be ties), this query would deal with the situation properly:
SELECT *
FROM (
SELECT DISTINCT ON (way, grp)
id, way, time AS time_from
, max(time) OVER (PARTITION BY way, grp) AS time_to
FROM (
SELECT *
, row_number() OVER (ORDER BY time, id) -- id as tie breaker
- row_number() OVER (PARTITION BY way ORDER BY time, id) AS grp
FROM table1
) t
ORDER BY way, grp, time, id
) sub
ORDER BY time_from, id;
ORDER BY time, id to be unambiguous. Assuming time is not unique, add the (assumed unique) id to avoid arbitrary results - that could change between queries in sneaky ways.
max(time) OVER (PARTITION BY way, grp): without ORDER BY, the window frame spans all rows of the PARTITION, so we get the absolute maximum per time slice.
The outer query layer is only necessary to produce the desired sort order in the result, since we are bound to a different ORDER BY in the subquery sub by using DISTINCT ON. Details:
Select first row in each GROUP BY group?
SQL Fiddle demonstrating the use case.
If you are looking to optimize performance, a plpgsql function could be faster in such a case. Closely related answer:
Group by repeating attribute
Aside: don't use the basic type name time as identifier (also a reserved word in standard SQL).
I think you want something like this:
select min(id), way,
min(time), max(time)
from (
select id, way, time,
ROW_NUMBER() OVER (ORDER BY id) -
ROW_NUMBER() OVER (PARTITION BY way ORDER BY time) AS grp
from table1 ) t
group by way, grp
grp identifies 'islands' of successive way values. Using this calculated field in an outer query, we can get start and end times of way intervals using MIN and MAX aggregate functions respectively.
Demo here
Related
I'm looking for an efficient approach where I can assign numbers in sequence to each group.
Record Group GroupSequence
-------|---------|--------------
1 Car 1
2 Car 2
3 Bike 1
4 Bus 1
5 Bus 2
6 Bus 3
I came through this question: How to add sequence number for groups in a SQL query without temp tables. But my use case is slightly different from it. Any ideas on how to accomplish this with a single query?
You are looking for row_number():
select t.*, row_number() over (partition by group order by record) as group_sequence
from t;
You can calculate this when you need it, so I see no reason to store it. However, you can update the values if you like:
update t
set group_sequence = tt.new_group_sequence
from (select t.*,
row_number() over (partition by group order by record) as new_group_sequence
from t
) tt
where tt.record = t.record;
Let's say there is a table dd:
id (integer)
name (varchar)
ts (date)
1
first
2021-03-25
2
first
2021-03-30
When I query this table with following:
SELECT *, MAX(ts) OVER (PARTITION BY name ORDER BY ts) max_ts FROM dd;
Then the result is:
id (integer)
name (varchar)
ts (date)
max_ts (date)
1
first
2021-03-25
2021-03-25
2
first
2021-03-30
2021-03-30
When I add "DESC" to Order By clause:
SELECT *, MAX(ts) OVER (PARTITION BY name ORDER BY ts DESC) max_ts FROM dd;
The result is:
id (integer)
name (varchar)
ts (date)
max_ts (date)
2
first
2021-03-30
2021-03-30
1
first
2021-03-25
2021-03-30
This time the result is what I expect. Considering that I am partitioning records by name and then getting the max date from them, I expect the max_ts values to be the same (the max one) in both cases, since the order should not really matter when getting the max value from the group. But in fact, in the first case the result contains different max_ts values, not the maximum one.
Why does it work this way? Why does ordering affect the result?
This syntax:
MAX(ts) OVER (PARTITION BY name ORDER BY ts)
is a cumulative maximum ordered by ts. The window frame starts with the smallest value of ts. Each subsequent row is larger -- because the ORDER BY is the same column as ts.
This is not interesting; ts on each row is the cumulative maximum when ordered by ts.
On the other hand:
MAX(ts) OVER (PARTITION BY name ORDER BY ts DESC)
This is the cumulative maximum in reverse order. So, the first row in the window frame is the maximum ts. All subsequent rows will be the maximum.
This is not the most efficient way to express this, though. I think this better captures the logic you want:
MAX(ts) OVER (PARTITION BY name)
I cannot understand partitioning concept in Hive completely.
I understand what are partitions and how to create them. What I cannot get is why people are writing select statements which have "partition by" clause like it is done here: SQL most recent using row_number() over partition
SELECT user_id, page_name, recent_click
FROM (
SELECT user_id,
page_name,
row_number() over (partition by session_id order by ts desc) as recent_click
from clicks_data
) T
WHERE recent_click = 1
Why to specify partition key in selects? In any case partition key was defined during table creation. Select statement will use the partition scheme that was defined in Create Table statement. Then why to add that over (partition by session_id order by ts desc)?
What if I skip over (partition by session_id order by ts desc) ?
Read about Hive Windowing and Analytics Functions.
row-number() is an analytics function which numbers rows and requires over().
In the over() you can specify for which group (partition) it will be calculated.
partition by in the over is not the same as partitioned by in create table DDL and has nothing in common. in create table it means how the data is being stored (each partition is a separate folder in hive), partitioned table is used for optimizing filtering or loading data.
partition by in the over() determines group in which function is calculated. Similar to GROUP BY in the select, but the difference is that analytics function does not change the number of rows.
Row_number re-initializes when it crossing the partition boundary and starts with 1
Also row_number needs order by in the over(). order by determines the order in which rows will be numbered.
If you do not specify partition by, row_number will work on the whole dataset as a single partition. It will produce single 1 and maximum number will be equal to the number of rows in the whole dataset. Table partitioning does not affect analytics function behavior.
If you do not specify order by, then row_number will number rows in non-deterministic order and probably different rows will be marked 1 from run to run. This is why you need to specify order by. In your example, order by ts desc means that 1 will be assigned to row with max ts (for each session_id).
Say, if there are three different session_id and three clicks in each session with different ts (totally 9 rows), then row_number in your example will assign 1 to last click for each session and after filtering recent_click = 1 you will get 3 rows instead of 9 initially. row_number() over() without partition by will number all rows from 1 to 9 in a random order (may differ from run to run), and the same filtering will give you 8 rows mixed from all 3 sessions.
See also this answer https://stackoverflow.com/a/55909947/2700344 for more details how it works in Hive, there is also similar question about table partition vs over() in the comments.
Try this example, it may be better than reading too long explanation:
with clicks_data as (
select stack (9,
--session1
1, 1, 'page1', '2020-01-01 01:01:01.123',
1, 1, 'page1', '2020-01-01 01:01:01.124',
1, 1, 'page2', '2020-01-01 01:01:01.125',
--session2
1, 2, 'page1', '2020-01-01 01:02:02.123',
1, 2, 'page2', '2020-01-01 01:02:02.124',
1, 2, 'page1', '2020-01-01 01:02:02.125',
--session 3
1, 3, 'page1', '2020-01-01 01:03:01.123',
1, 3, 'page2', '2020-01-01 01:03:01.124',
1, 3, 'page1', '2020-01-01 01:03:01.125'
) as(user_id, session_id, page_name, ts)
)
SELECT
user_id
,session_id
,page_name
,ts
,ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY ts DESC) AS rn1
,ROW_NUMBER() OVER() AS rn2
FROM clicks_data
Result:
user_id session_id page_name ts rn1 rn2
1 2 page1 2020-01-01 01:02:02.125 1 1
1 2 page2 2020-01-01 01:02:02.124 2 2
1 2 page1 2020-01-01 01:02:02.123 3 3
1 1 page2 2020-01-01 01:01:01.125 1 4
1 1 page1 2020-01-01 01:01:01.124 2 5
1 1 page1 2020-01-01 01:01:01.123 3 6
1 3 page1 2020-01-01 01:03:01.125 1 7
1 3 page2 2020-01-01 01:03:01.124 2 8
1 3 page1 2020-01-01 01:03:01.123 3 9
First row_number assigned 1 to rows with max timestamp in each session(partition). Second row_number without partition and order specified numbered all rows from 1 to 9. Why rn2=1 is for session2 and max timestamp in session=2, should it be random or not? Because for calculating first row_number, all rows were distributed by session_id and ordered by timestamp desc and it happened that row_number2 received session2 first(it was read by reducer before other two files prepared by mapper) and as it was already sorted for calculation of rn1, rn2 received rows in the same order. If it was no row_number1, it could be "more random". The bigger the dataset, the more random rn2 order will look.
I am creating ranks for partitions of my table. Partitions are performed by name column with ordered by its transaction value. While I am generating these partitions and checking count for each of the ranks, I get different number in each rank for every query run I do.
select count(*) FROM (
--
-- Sort and ranks the element of RFM
--
SELECT
*,
RANK() OVER (PARTITION BY name ORDER BY date_since_last_trans desc) AS rfmrank_r,
FROM (
SELECT
name,
id_customer,
cust_age,
gender,
DATE_DIFF(entity_max_date, customer_max_date, DAY ) AS date_since_last_trans,
txncnt,
txnval,
txnval / txncnt AS avg_txnval
FROM
(
SELECT
name,
id_customer,
MAX(cust_age) AS cust_age,
COALESCE(APPROX_TOP_COUNT(cust_gender,1)[OFFSET(0)].VALUE, MAX(cust_gender)) AS gender,
MAX(date_date) AS customer_max_date,
(SELECT MAX(date_date) FROM xxxxx) AS entity_max_date,
COUNT(purchase_amount) AS txncnt,
SUM(purchase_amount) AS txnval
FROM
xxxxx
WHERE
date_date > (
SELECT
DATE_SUB(MAX(date_date), INTERVAL 24 MONTH) AS max_date
FROM
xxxxx)
AND cust_age >= 15
AND cust_gender IN ('M','F')
GROUP BY
name,
id_customer
)
)
)
group by rfmrank_r
For 1st run I am getting
Row f0
1 3970
2 3017
3 2116
4 2118
For 2nd run I am getting
Row f0
1 4060
2 3233
3 2260
4 2145
What can be done, If I need to get same number of partitions getting ranked same for each run
Edit:
Sorry for the blurring of fields
This is the output of field ```query to get this column````
The RANK window function determines the rank of a value in a group of values.
Each value is ranked within its partition. Rows with equal values for the ranking criteria receive the same rank. Drill adds the number of tied rows to the tied rank to calculate the next rank and thus the ranks might not be consecutive numbers.
For example, if two rows are ranked 1, the next rank is 3.
I have a situation where I need to create an ordered "event" or "touch" ranking based on the date and the user that touched a case from a historical log table. For example, I have a log table that looks like this:
case_id user_id log_date
------- ------- --------
1 5 06-29 12:05
1 5 06-29 12:10
1 5 06-30 9:12
1 3 06-30 9:15
And I want to get this:
case_id user_id log_date EventNumber
------- ------- -------- -----------
1 5 06-29 12:05 1
1 5 06-29 12:10 1
1 5 06-30 9:12 2
1 3 06-30 9:15 3
Basically either a change in the date or a change in the user that touched a case signifies that a new event has occurred. The closest I got so far is [EventNum] = DENSE_RANK() OVER (PARTITION BY case_id ORDER BY CONVERT(DATE, log_date), user_id)
The problem with this approach is that the secondary order, while correctly incrementing the rank number because a different user touched it would put the second user first because the user_id happens to be a lower number. I can't figure out how to "partition" by users while maintaining the original logged order. Even the date break part isn't essential - I would settle breaking up the ranking only by users provided the original logged order remains the same. Any advice?
This is a tricky question. You need to identify groups where the date and user are adjacent. One method is to use lag(). But, not available in SQL Server 2008. Another method is to use a difference of row numbers.
The difference defines the group. You then need to get the minimum date for the final ordering. So:
select t.*,
dense_rank() over (partition by caseid order by grp_log_date) as EventNum
from (select t.*, min(log_date) over (partition by caseid, grp) as grp_log_date
from (select t.*,
(row_number() over (partition by caseid order by log_date) -
row_number() over (partition by caseid, userid, cast(log_date) as date
order by log_date
)
) as grp
from table t
) t
) t;