Windowing function in Hive - sql

I am exploring windowing functions in Hive and I am able to understand the functionalities of all the UDFs. Although, I am not able to understand the partition by and order by that we use with the other functions. Following is the structure that is very similar to the query which I am planning to build.
SELECT a, RANK() OVER(partition by b order by c) as d from xyz;
Just trying to understand the background process involved for both keywords.
Appreciate the help :)

RANK() analytic function assigns a rank to each row in each partition in the dataset.
PARTITION BY clause determines how the rows to be distributed (between reducers if it is hive).
ORDER BY determines how the rows are being sorted in the partition.
First phase is distribute by, all rows in a dataset are distributed into partitions. In map-reduce each mapper groups rows according to the partition by and produces files for each partition. Mapper does initial sorting of partition parts according to the order by.
Second phase, all rows are sorted inside each partition.
In map-reduce, each reducer gets partitions files (parts of partitions) produced by mappers and sorts rows in the whole partition (sort of partial results) according to the order by.
Third, rank function assigns rank to each row in a partition. Rank function is being initialized for each partition.
For the first row in the partition rank starts with 1. For each next row Rank=previous row rank+1. Rows with equal values (specified in the order by) given the same rank, if the two rows share the same rank, next row rank is not consecutive.
Different partitions can be processed in parallel on different reducers. Small partitions can be processed on the same reducer. Rank function re-initializes when it crossing the partition boundary and starts with rank=1 for each partition.
Example (rows are already partitioned and sorted inside partitions):
SELECT a, RANK() OVER(partition by b order by c) as d from xyz;
a, b, c, d(rank)
----------------
1 1 1 1 --starts with 1
2 1 1 1 --the same c value, the same rank=1
3 1 2 3 --rank 2 is skipped because second row shares the same rank as first
4 2 3 1 --New partition starts with 1
5 2 4 2
6 2 5 3
If you need consecutive ranks, use dense_rank function. dense_rank will produce rank=2 for the third row in the above dataset.
row_number function will assign a position number to each row in the partition starting with 1. Rows with equal values will receive different consecutive numbers.
SELECT a, ROW_NUMBER() OVER(partition by b order by c) as d from xyz;
a, b, c, d(row_number)
----------------
1 1 1 1 --starts with 1
2 1 1 2 --the same c value, row number=2
3 1 2 3 --row position=3
4 2 3 1 --New partition starts with 1
5 2 4 2
6 2 5 3
Important note: For rows with the same values row_number or other such analytic function may have non-deterministic behavior and produce different numbers from run to run. First row in the above dataset may receive number 2 and second row may receive number 1 and vice-versa, because their order is not determined unless you will add one more column a to the order by clause. In this case all rows will always have the same row_number from run to run, their order values are different.

Related

Fetching a minimum of N rows, plus all peers of the last row

I have a sample table named assets which looks like this:
id
name
block_no
1
asset1
2
2
asset2
2
3
asset3
3
There can be any number of assets in a specific block. I need a minimum of 100 rows from the table, and containing all the data from the block_no. Like, if there are 95 rows to block_no 2 and around 20 on block_no 3, I need all 20 of block_no 3 as if I am fetching data in packets based on block_no.
Is this possible and feasible?
Postgres 13 or later
There is a dead simple solution using WITH TIES in Postgres 13 or later:
SELECT *
FROM assets
WHERE block_no >= 2 -- your starting block
ORDER BY block_no
FETCH FIRST 100 ROWS WITH TIES;
This will return at least 100 rows (if enough qualify), plus all peers of the 100th row.
If your table isn't trivially small, an index on (block_no) is essential for performance.
See:
Get top row(s) with highest value, with ties
Older versions
Use the window function rank() in a subquery:
SELECT (a).*
FROM (
SELECT a, rank() OVER (ORDER BY block_no) AS rnk
FROM assets a
) sub
WHERE rnk <= 100;
Same result.
I use a little trick with the row type to strip the added rnk from the result. That's an optional addition.
See:
PostgreSQL equivalent for TOP n WITH TIES: LIMIT "with ties"?

SQL: Apply sequence number to a column based on nth occurrence of each distinct value

I have a table with a column of values where each value occurs a variable number of times (i.e., one value may occur 1 time, and another value may occur 3 times). I need to add a column that identifies the occurrence sequence # of its corresponding value.
Input Table
SOURCE_VAL
a
a
b
c
c
c
Output table
SEQUENCE_VAL
SOURCE_VAL
1
a
2
a
1
b
1
c
2
c
3
c
What would the SQL for this be to generate the SEQUENCE_VAL column based on SOURCE_VAL?
You are looking for row_number(). Without an ordering column, you can use:
select t.*,
row_number() over (partition by source_val order by source_val) as sequence_val
from t
order by source_val, sequence_val;
Note: This assumes that you do not care about the ordering of the value. If you have another column that does specify the ordering for each source_val, then use that in the order by.

How to check max from range in cursor?

I have a problem with transferring an Excel formula to SQL. My excel formula is: =IF(P2<(MAX($P$2:P2));"Move";"").
The P column in excel is a sequence of numbers.
a | b
------
1
2
7
3 MOVE
4 MOVE
8
9
5 MOVE
10
You can find more example on this screenshot:
I created a cursor with a loop but I don't know how to check max from range.
For example when I iterate for fourth row, I have to check max from 1-4 row etc.
No need for a cursor and a loop. Assuming that you have a column that defines the ordering of the rows (say, id), you can use window functions:
select t.*,
case when a < max(a) over(order by id) then 'MOVE' end as b
from mytable t
One option would be using MAX() Analytic function . But in any case, you'd have an extra column such as id for ordering in order to determine the max value for the current row from the first row, since SQL statements represent unordered sets. If you have that id column with values ordered as in your sample data, then consider using
WITH t2 AS
(
SELECT MAX(a) OVER (ORDER BY id ROWS BETWEEN
UNBOUNDED PRECEDING
AND
CURRENT ROW) AS max_upto_this_row,
t.*
FROM t
)
SELECT a, CASE WHEN max_upto_this_row > a THEN 'Move' END AS b
FROM t2
ORDER BY id;
Demo

Group rows into sets of 5

TableA
Col1
----------
1
2
3
4....all the way to 27
I want to add a second column that assigns a number to groups of 5.
Results
Col1 Col2
----- ------
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2...and so on
The 6th group should have 2 rows in it.
NTILE doesn't accomplish what I want because of the way NTILE handles the groups if they aren't divisible by the integer.
If the number of rows in a partition is not divisible by integer_expression, this will cause groups of two sizes that differ by one member. Larger groups come before smaller groups in the order specified by the OVER clause. For example if the total number of rows is 53 and the number of groups is five, the first three groups will have 11 rows and the two remaining groups will have 10 rows each. If on the other hand the total number of rows is divisible by the number of groups, the rows will be evenly distributed among the groups. For example, if the total number of rows is 50, and there are five groups, each bucket will contain 10 rows.
This is clearly demonstrated in this SQL Fiddle. Groups 4, 5, 6 each have 4 rows while the rest have 5. I have some started some solutions but they were getting lengthy and I feel like I'm missing something and that this could be done in a single line.
You can use this:
;WITH CTE AS
(
SELECT col1,
RN = ROW_NUMBER() OVER(ORDER BY col1)
FROM TableA
)
SELECT col1, (RN-1)/5+1 col2
FROM CTE;
In your sample data, col1 is a correlative without gaps, so you could use it directly (if it's an INT) without using ROW_NUMBER(). But in the case that it isn't, then this answer works too. Here is the modified sqlfiddle.
A bit of math can go a long way. subtracting 1 from all values puts the 5s (edge cases) into the previous group here, and 6's into the next. flooring the division by your group size and adding one give the result you're looking for. Also, the SQLFiddle example here fixes your iterative insert - the table only went up to 27.
SELECT col1,
floor((col1-1)/5)+1 as grpNum
FROM tableA

sequential numbering of rows

I have to customize a SQL view to be able to perform a system integration test. One table keeps track of the transactions and the LINE ITEMS per transactions (for example, transaction 2 has three items, so there are three consecutive rows corresponding to the same transaction number) What I need to accomplish is to get a column where I keep track of the number of items for THAT transaction, always starting from 1 for each transaction.
For example, the first column is TransactionNumber and the second is DesiredOutput:
1 1
1 2
2 1
2 2
2 3
3 1
4 1
4 2
ETC...
I know how to number consecutive rows on the WHOLE table, but I cannot find any reference to this numbering that depends on a value on another row.
For SQL Server 2005+ and Oracle you can use the following:
SELECT TransNumber, ROW_NUMBER() OVER(PARTITION BY Transnumber ORDER BY Something) [DESIRED COLUMN OUTPUT]
FROM YourTable