Hive: why to use partition by in selects? - sql

I cannot understand partitioning concept in Hive completely.
I understand what are partitions and how to create them. What I cannot get is why people are writing select statements which have "partition by" clause like it is done here: SQL most recent using row_number() over partition
SELECT user_id, page_name, recent_click
FROM (
SELECT user_id,
page_name,
row_number() over (partition by session_id order by ts desc) as recent_click
from clicks_data
) T
WHERE recent_click = 1
Why to specify partition key in selects? In any case partition key was defined during table creation. Select statement will use the partition scheme that was defined in Create Table statement. Then why to add that over (partition by session_id order by ts desc)?
What if I skip over (partition by session_id order by ts desc) ?

Read about Hive Windowing and Analytics Functions.
row-number() is an analytics function which numbers rows and requires over().
In the over() you can specify for which group (partition) it will be calculated.
partition by in the over is not the same as partitioned by in create table DDL and has nothing in common. in create table it means how the data is being stored (each partition is a separate folder in hive), partitioned table is used for optimizing filtering or loading data.
partition by in the over() determines group in which function is calculated. Similar to GROUP BY in the select, but the difference is that analytics function does not change the number of rows.
Row_number re-initializes when it crossing the partition boundary and starts with 1
Also row_number needs order by in the over(). order by determines the order in which rows will be numbered.
If you do not specify partition by, row_number will work on the whole dataset as a single partition. It will produce single 1 and maximum number will be equal to the number of rows in the whole dataset. Table partitioning does not affect analytics function behavior.
If you do not specify order by, then row_number will number rows in non-deterministic order and probably different rows will be marked 1 from run to run. This is why you need to specify order by. In your example, order by ts desc means that 1 will be assigned to row with max ts (for each session_id).
Say, if there are three different session_id and three clicks in each session with different ts (totally 9 rows), then row_number in your example will assign 1 to last click for each session and after filtering recent_click = 1 you will get 3 rows instead of 9 initially. row_number() over() without partition by will number all rows from 1 to 9 in a random order (may differ from run to run), and the same filtering will give you 8 rows mixed from all 3 sessions.
See also this answer https://stackoverflow.com/a/55909947/2700344 for more details how it works in Hive, there is also similar question about table partition vs over() in the comments.
Try this example, it may be better than reading too long explanation:
with clicks_data as (
select stack (9,
--session1
1, 1, 'page1', '2020-01-01 01:01:01.123',
1, 1, 'page1', '2020-01-01 01:01:01.124',
1, 1, 'page2', '2020-01-01 01:01:01.125',
--session2
1, 2, 'page1', '2020-01-01 01:02:02.123',
1, 2, 'page2', '2020-01-01 01:02:02.124',
1, 2, 'page1', '2020-01-01 01:02:02.125',
--session 3
1, 3, 'page1', '2020-01-01 01:03:01.123',
1, 3, 'page2', '2020-01-01 01:03:01.124',
1, 3, 'page1', '2020-01-01 01:03:01.125'
) as(user_id, session_id, page_name, ts)
)
SELECT
user_id
,session_id
,page_name
,ts
,ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY ts DESC) AS rn1
,ROW_NUMBER() OVER() AS rn2
FROM clicks_data
Result:
user_id session_id page_name ts rn1 rn2
1 2 page1 2020-01-01 01:02:02.125 1 1
1 2 page2 2020-01-01 01:02:02.124 2 2
1 2 page1 2020-01-01 01:02:02.123 3 3
1 1 page2 2020-01-01 01:01:01.125 1 4
1 1 page1 2020-01-01 01:01:01.124 2 5
1 1 page1 2020-01-01 01:01:01.123 3 6
1 3 page1 2020-01-01 01:03:01.125 1 7
1 3 page2 2020-01-01 01:03:01.124 2 8
1 3 page1 2020-01-01 01:03:01.123 3 9
First row_number assigned 1 to rows with max timestamp in each session(partition). Second row_number without partition and order specified numbered all rows from 1 to 9. Why rn2=1 is for session2 and max timestamp in session=2, should it be random or not? Because for calculating first row_number, all rows were distributed by session_id and ordered by timestamp desc and it happened that row_number2 received session2 first(it was read by reducer before other two files prepared by mapper) and as it was already sorted for calculation of rn1, rn2 received rows in the same order. If it was no row_number1, it could be "more random". The bigger the dataset, the more random rn2 order will look.

Related

RANK() function with over is creating ranks dynamically for every run

I am creating ranks for partitions of my table. Partitions are performed by name column with ordered by its transaction value. While I am generating these partitions and checking count for each of the ranks, I get different number in each rank for every query run I do.
select count(*) FROM (
--
-- Sort and ranks the element of RFM
--
SELECT
*,
RANK() OVER (PARTITION BY name ORDER BY date_since_last_trans desc) AS rfmrank_r,
FROM (
SELECT
name,
id_customer,
cust_age,
gender,
DATE_DIFF(entity_max_date, customer_max_date, DAY ) AS date_since_last_trans,
txncnt,
txnval,
txnval / txncnt AS avg_txnval
FROM
(
SELECT
name,
id_customer,
MAX(cust_age) AS cust_age,
COALESCE(APPROX_TOP_COUNT(cust_gender,1)[OFFSET(0)].VALUE, MAX(cust_gender)) AS gender,
MAX(date_date) AS customer_max_date,
(SELECT MAX(date_date) FROM xxxxx) AS entity_max_date,
COUNT(purchase_amount) AS txncnt,
SUM(purchase_amount) AS txnval
FROM
xxxxx
WHERE
date_date > (
SELECT
DATE_SUB(MAX(date_date), INTERVAL 24 MONTH) AS max_date
FROM
xxxxx)
AND cust_age >= 15
AND cust_gender IN ('M','F')
GROUP BY
name,
id_customer
)
)
)
group by rfmrank_r
For 1st run I am getting
Row f0
1 3970
2 3017
3 2116
4 2118
For 2nd run I am getting
Row f0
1 4060
2 3233
3 2260
4 2145
What can be done, If I need to get same number of partitions getting ranked same for each run
Edit:
Sorry for the blurring of fields
This is the output of field ```query to get this column````
The RANK window function determines the rank of a value in a group of values.
Each value is ranked within its partition. Rows with equal values for the ranking criteria receive the same rank. Drill adds the number of tied rows to the tied rank to calculate the next rank and thus the ranks might not be consecutive numbers.
For example, if two rows are ranked 1, the next rank is 3.

Max difference between update timestamps

I have a table:
id | updated_at
1 | 2018-10-22T21:00:00Z
2 | 2018-10-22T21:02:00Z
I'd like to find the largest delta for a given day between closest updated timestamps. For example, if there were 5 rows:
id | updated_at
1 | 2018-10-22T21:00:00Z
2 | 2018-10-22T21:02:00Z
3 | 2018-10-22T21:05:00Z
4 | 2018-10-22T21:06:00Z
5 | 2018-10-22T21:16:00Z
The largest delta is between 4 and 5 (10 minutes). Note that really when comparing, I just want to find the next closest updated_at timestamp and then give me the max of this. I feel like I'm messing up the subquery to do this.
with nearest_time(time_diff)
as
(
select datediff('minute', updated_at as u1, (select updated_at from table where updated_at > u1 limit 1) as u2)
group by updated_at::date
)
select max(select time_diff from nearest_time);
demo:db<>fiddle
SELECT
lead(updated) OVER (ORDER BY updated) - updated as diff
FROM dates
ORDER BY diff DESC NULLS LAST
LIMIT 1;
Using window function LEAD allows you to get the value of the next row: In this case you can get the next timestamp.
With that you can do a substraction, sorting the results descending and take the first value.
Use lag to get the updated_at from the previous row and then get the max difference per day.
select dt_updated_at,max(time_diff)
from (select updated_at::date as dt_updated_at
,updated_at - lag(updated_at) over(partition by updated_at::date order by updated_at) as time_diff
from tbl
) t
group by dt_updated_at
One more option using DISTINCT ON (only works on Postgres..as the question was initially tagged Postgres, keeping this answer)
select distinct on
(updated_at::date)
updated_at::date as dt_updated_at
,updated_at-lag(updated_at) over(partition by updated_at::date order by updated_at) as diff
from dates
order by updated_at::date,diff desc
nulls last

How to group following rows by not unique value

I have data like this:
table1
_____________
id way time
1 1 00:01
2 1 00:02
3 2 00:03
4 2 00:04
5 2 00:05
6 3 00:06
7 3 00:07
8 1 00:08
9 1 00:09
I would like to know in which time interval I was on which way:
desired output
_________________
id way from to
1 1 00:01 00:02
3 2 00:03 00:05
6 3 00:06 00:07
8 1 00:08 00:09
I tried to use a window function:
SELECT DISTINCT
first_value(id) OVER w AS id,
first_value(way) OVER w as way,
first_value(time) OVER w as from,
last_value(time) OVER w as to
FROM table1
WINDOW w AS (
PARTITION BY way ORDER BY ID
range between unbounded preceding and unbounded following);
What I get is:
ID way from to
1 1 00:01 00:09
3 2 00:03 00:05
6 3 00:06 00:07
And this is not correct, because on way 1 I wasn't from 00:01 to 00:09.
Is there a possibility to do the partition according to the order, means grouping only following attributes, that are equal?
If your case is as simple as the example values suggest, #Giorgos' answer serves nicely.
However, that's typically not the case. If the id column is a serial, you cannot rely on the assumption that a row with an earlier time also has a smaller id.
Also, time values (or timestamp like you probably have) can easily be duplicates, you need to make the sort order unambiguous.
Assuming both can happen, and you want the id from the row with the earliest time per time slice (actually, the smallest id for the earliest time, there could be ties), this query would deal with the situation properly:
SELECT *
FROM (
SELECT DISTINCT ON (way, grp)
id, way, time AS time_from
, max(time) OVER (PARTITION BY way, grp) AS time_to
FROM (
SELECT *
, row_number() OVER (ORDER BY time, id) -- id as tie breaker
- row_number() OVER (PARTITION BY way ORDER BY time, id) AS grp
FROM table1
) t
ORDER BY way, grp, time, id
) sub
ORDER BY time_from, id;
ORDER BY time, id to be unambiguous. Assuming time is not unique, add the (assumed unique) id to avoid arbitrary results - that could change between queries in sneaky ways.
max(time) OVER (PARTITION BY way, grp): without ORDER BY, the window frame spans all rows of the PARTITION, so we get the absolute maximum per time slice.
The outer query layer is only necessary to produce the desired sort order in the result, since we are bound to a different ORDER BY in the subquery sub by using DISTINCT ON. Details:
Select first row in each GROUP BY group?
SQL Fiddle demonstrating the use case.
If you are looking to optimize performance, a plpgsql function could be faster in such a case. Closely related answer:
Group by repeating attribute
Aside: don't use the basic type name time as identifier (also a reserved word in standard SQL).
I think you want something like this:
select min(id), way,
min(time), max(time)
from (
select id, way, time,
ROW_NUMBER() OVER (ORDER BY id) -
ROW_NUMBER() OVER (PARTITION BY way ORDER BY time) AS grp
from table1 ) t
group by way, grp
grp identifies 'islands' of successive way values. Using this calculated field in an outer query, we can get start and end times of way intervals using MIN and MAX aggregate functions respectively.
Demo here

PostgreSQL window function: partition by comparison

I'm trying to find the way of doing a comparison with the current row in the PARTITION BY clause in a WINDOW function in PostgreSQL query.
Imagine I have the short list in the following query of this 5 elements (in the real case, I have thousands or even millions of rows). I am trying to get for each row, the id of the next different element (event column), and the id of the previous different element.
WITH events AS(
SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT lag(id) over w as previous_different, event
, lead(id) over w as next_different
FROM events ev
WINDOW w AS (PARTITION BY event!=ev.event ORDER BY date ASC);
I know the comparison event!=ev.event is incorrect but that's the point I want to reach.
The result I get is (the same as if I delete the PARTITION BY clause):
|12|2
1|12|3
2|13|4
3|13|5
4|12|
And the result I would like to get is:
|12|3
|12|3
2|13|5
2|13|5
4|12|
Anyone knows if it is possible and how? Thank you very much!
EDIT: I know I can do it with two JOINs, a ORDER BY and a DISTINCT ON, but in the real case of millions of rows it is very inefficient:
WITH events AS(
SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT DISTINCT ON (e.id, e.date) e1.id, e.event, e2.id
FROM events e
LEFT JOIN events e1 ON (e1.date<=e.date AND e1.id!=e.id AND e1.event!=e.event)
LEFT JOIN events e2 ON (e2.date>=e.date AND e2.id!=e.id AND e2.event!=e.event)
ORDER BY e.date ASC, e.id ASC, e1.date DESC, e1.id DESC, e2.date ASC, e2.id ASC
Using several different window functions and two subqueries, this should work decently fast:
WITH events(id, event, ts) AS (
VALUES
(1, 12, '2014-03-19 08:00:00'::timestamp)
,(2, 12, '2014-03-19 08:30:00')
,(3, 13, '2014-03-19 09:00:00')
,(4, 13, '2014-03-19 09:30:00')
,(5, 12, '2014-03-19 10:00:00')
)
SELECT first_value(pre_id) OVER (PARTITION BY grp ORDER BY ts) AS pre_id
, id, ts
, first_value(post_id) OVER (PARTITION BY grp ORDER BY ts DESC) AS post_id
FROM (
SELECT *, count(step) OVER w AS grp
FROM (
SELECT id, ts
, NULLIF(lag(event) OVER w, event) AS step
, lag(id) OVER w AS pre_id
, lead(id) OVER w AS post_id
FROM events
WINDOW w AS (ORDER BY ts)
) sub1
WINDOW w AS (ORDER BY ts)
) sub2
ORDER BY ts;
Using ts as name for the timestamp column.
Assuming ts to be unique - and indexed (a unique constraint does that automatically).
In a test with a real life table with 50k rows it only needed a single index scan. So, should be decently fast even with big tables. In comparison, your query with join / distinct did not finish after a minute (as expected).
Even an optimized version, dealing with one cross join at a time (the left join with hardly a limiting condition is effectively a limited cross join) did not finish after a minute.
For best performance with a big table, tune your memory settings, in particular for work_mem (for big sort operations). Consider setting it (much) higher for your session temporarily if you can spare the RAM. Read more here and here.
How?
In subquery sub1 look at the event from the previous row and only keep that if it has changed, thus marking the first element of a new group. At the same time, get the id of the previous and the next row (pre_id, post_id).
In subquery sub2, count() only counts non-null values. The resulting grp marks peers in blocks of consecutive same events.
In the final SELECT, take the first pre_id and the last post_id per group for each row to arrive at the desired result.
Actually, this should be even faster in the outer SELECT:
last_value(post_id) OVER (PARTITION BY grp ORDER BY ts
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS post_id
... since the sort order of the window agrees with the window for pre_id, so only a single sort is needed. A quick test seems to confirm it. More about this frame definition.
SQL Fiddle.

Multiple filters on SQL query

I have been reading many topics about filtering SQL queries, but none seems to apply to my case, so I'm in need of a bit of help. I have the following data on a SQL table.
Date item quantity moved quantity in stock sequence
13-03-2012 16:51:00 xpto 2 2 1
13-03-2012 16:51:00 xpto -2 0 2
21-03-2012 15:31:21 zyx 4 6 1
21-03-2012 16:20:11 zyx 6 12 2
22-03-2012 12:51:12 zyx -3 9 1
So this is quantities moved in the warehouse, and the problem is on the first two rows which was a reception and return at the same time, because I'm trying to make a query which gives me the stock at a given time of all items. I use max(date) but i don't get the right quantity on result.
SELECT item, qty_in_stock
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY item ORDER BY item_date DESC, sequence DESC) rn
FROM mytable
WHERE item_date <= #date_of_stock
) q
WHERE rn = 1
If you are on SQL-Server 2012, these are several nice features added.
You can use the LAST_VALUE - or the FIRST_VALUE() - function, in combination with a ROWS or RANGE window frame (see OVER clause):
SELECT DISTINCT
item,
LAST_VALUE(quantity_in_stock) OVER (PARTITION BY item
ORDER BY date, sequence
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING)
AS quantity_in_stock
FROM tableX
WHERE date <= #date_of_stock
Add a where clause and do the summation:
select item, sum([quantity moved])
from t
group by item
where t.date <= #DESIREDDATETIME
If you put a date in for the desired datetime, remember that goes to midnight when the day starts.