Hive window functions: last value of previous partition

Hive window functions: last value of previous partition - sql

Using Hive window functions, I would like to get the last value of the previous partition:
| name | rank | type |
| one | 1 | T1 |
| two | 2 | T2 |
| thr | 3 | T2 |
| fou | 4 | T1 |
| fiv | 5 | T2 |
| six | 6 | T2 |
| sev | 7 | T2 |
Following query:
SELECT
name,
rank,
first_value(rank over(partition by type order by rank)) as new_rank
FROM my_table
Would give:
| name | rank | type | new_rank |
| one | 1 | T1 | 1 |
| two | 2 | T2 | 2 |
| thr | 3 | T2 | 2 |
| fou | 4 | T1 | 4 |
| fiv | 5 | T2 | 5 |
| six | 6 | T2 | 5 |
| sev | 7 | T2 | 5 |
But what I need is "the last value of the previous partition":
| name | rank | type | new_rank |
| one | 1 | T1 | NULL |
| two | 2 | T2 | 1 |
| thr | 3 | T2 | 1 |
| fou | 4 | T1 | 3 |
| fiv | 5 | T2 | 4 |
| six | 6 | T2 | 4 |
| sev | 7 | T2 | 4 |

This seems quite tricky. This is a variant of group-and-islands. Here is the idea:
Identify the "islands" where type is the same (using difference of row numbers).
Then use lag() to introduce the previous rank into the island.
Do a min scan to get the new rank that you want.
So:
with gi as (
select t.*,
(seqnum - seqnum_t) as grp
from (select t.*,
row_number() over (partition by type order by rank) as seqnum_t,
row_number() over (order by rank) as seqnum
from t
) t
),
gi2 as (
select gi.*, lag(rank) over (order by gi.rank) as prev_rank
from gi
)
select gi2.*,
min(prev_rank) over (partition by type, grp) as new_rank
from gi2
order by rank;
Here is a SQL Fiddle (albeit using Postgres).

Related

SQL group by changing column

Suppose I have a table sorted by date as so:
+-------------+--------+
| DATE | VALUE |
+-------------+--------+
| 01-09-2020 | 5 |
| 01-15-2020 | 5 |
| 01-17-2020 | 5 |
| 02-03-2020 | 8 |
| 02-13-2020 | 8 |
| 02-20-2020 | 8 |
| 02-23-2020 | 5 |
| 02-25-2020 | 5 |
| 02-28-2020 | 3 |
| 03-13-2020 | 3 |
| 03-18-2020 | 3 |
+-------------+--------+
I want to group by changes in value within that given date range, and add a value that increments each time as an added column to denote that.
I have tried a number of different things, such as using the lag function:
SELECT value, value - lag(value) over (order by date) as count
GROUP BY value
In short, I want to take the table above and have it look like:
+-------------+--------+-------+
| DATE | VALUE | COUNT |
+-------------+--------+-------+
| 01-09-2020 | 5 | 1 |
| 01-15-2020 | 5 | 1 |
| 01-17-2020 | 5 | 1 |
| 02-03-2020 | 8 | 2 |
| 02-13-2020 | 8 | 2 |
| 02-20-2020 | 8 | 2 |
| 02-23-2020 | 5 | 3 |
| 02-25-2020 | 5 | 3 |
| 02-28-2020 | 3 | 4 |
| 03-13-2020 | 3 | 4 |
| 03-18-2020 | 3 | 4 |
+-------------+--------+-------+
I want to eventually have it all in one small table with the earliest date for each.
+-------------+--------+-------+
| DATE | VALUE | COUNT |
+-------------+--------+-------+
| 01-09-2020 | 5 | 1 |
| 02-03-2020 | 8 | 2 |
| 02-23-2020 | 5 | 3 |
| 02-28-2020 | 3 | 4 |
+-------------+--------+-------+
Any help would be very appreciated

you can use a combination of Row_number and Dense_rank functions to get the required results like below:
;with cte
as
(
select t.DATE,t.VALUE
,Dense_rank() over(partition by t.VALUE order by t.DATE) as d_rank
,Row_number() over(partition by t.VALUE order by t.DATE) as r_num
from table t
)
Select t.Date,t.Value,d_rank as count
from cte
where r_num = 1

You can use a lag and cumulative sum and a subquery:
SELECT value,
SUM(CASE WHEN prev_value = value THEN 0 ELSE 1 END) OVER (ORDER BY date)
FROM (SELECT t.*, LAG(value) OVER (ORDER BY date) as prev_value
FROM t
) t
Here is a db<>fiddle.

You can recursively use lag() and then row_number() analytic functions :
WITH t2 AS
(
SELECT LAG(value,1,value-1) OVER (ORDER BY date) as lg,
t.*
FROM t
)
SELECT t2.date,t2.value, ROW_NUMBER() OVER (ORDER BY t2.date) as count
FROM t2
WHERE value - lg != 0
Demo
and filter through inequalities among the returned values from those functions.

Select one row inside a group according to a criteria in PostgreSQL

I have a table as such (tbl):
+----+-----+------+-----+
| pk | grp | attr | val |
+----+-----+------+-----+
| 0 | 0 | ohif | 4 |
| 1 | 0 | foha | 56 |
| 2 | 0 | slns | 2 |
| 3 | 1 | faso | 11 |
| 4 | 1 | tepj | 4 |
| 5 | 2 | bnda | 12 |
| 6 | 2 | ojdf | 9 |
| 7 | 2 | anaw | 1 |
+----+-----+------+-----+
I would like to select one row from each group, in particular that with the maximum val for each group.
I can easily select grp and val:
SELECT grp, MAX(val)
FROM tbl
GROUP BY grp
Yielding this table (tbl2):
+-----+-----+
| grp | val |
+-----+-----+
| 0 | 56 |
| 1 | 11 |
| 2 | 12 |
+-----+-----+
However, I want this table:
+----+-----+------+-----+
| pk | grp | attr | val |
+----+-----+------+-----+
| 1 | 0 | foha | 56 |
| 3 | 1 | faso | 11 |
| 5 | 2 | bnda | 12 |
+----+-----+------+-----+
Since (grp, val) constitutes a key, I could left-join tbl2 with tbl on same grp and val.
However, I was wondering if there was any other solution: in my real-world situation tbl is a pretty complex and heavy derived table, and I have the design constrain of not being able to use temp tables. Is there any way to order the rows inside each group according to val and to then take the first record for each group?
I'm on PostgreSQL 10, but a standard SQL solution would be the best.

In Postgres, the best approach is distinct on:
SELECT DISTINCT ON (t.grp) t.*
FROM tbl
ORDER BY grp, val DESC;
In particular, this can take advantage of an index on (grp, val desc).

Hive - over (partition by ...) with a column not in group by

Is it possible to do something like:
select
avg(count(distinct user_id))
over (partition by some_date) as average_users_per_day
from user_activity
group by user_type
(notably, the partition by column, some_date, is not in the group by columns)
The idea I'm going for is something like: the average users per day by user type.
I know how to do it using subqueries (see below), but I'd like to know if there is a nice way using only over (partition by ...) and group by.
Notes:
From reading this answer, my understanding (correct me if I'm wrong) is that the following query:
select
avg(count(distinct a)) over (partition by b)
from foo
group by b
can be expanded equivalently to:
select
avg(count_distinct_a)
from (
select
b,
count(distinct a) as count_distinct_a
from foo
group by b
)
group by b
And from that, I can tweak it a bit to achieve what I want:
select
avg(count_distinct_user_id) as average_users_per_day
from (
select
user_type,
count(distinct user_id) as count_distinct_user_id
from user_activity
group by user_type, some_date
)
group by user_type
(notably, the inner group by user_type, some_date differs from the outer group by user_type)
I'd like to be able to tell the partition by-group by interaction to use a "sub-group-by" for the windowing part. Please let me know if my understanding of partition by/group by is completely off.
EDIT: Some sample data and desired output.
Source table:
+---------+-----------+-----------+
| user_id | user_type | some_date |
+---------+-----------+-----------+
| 1 | a | 1 |
| 1 | a | 2 |
| 2 | a | 1 |
| 3 | a | 2 |
| 3 | a | 2 |
| 4 | b | 2 |
| 5 | b | 1 |
| 5 | b | 3 |
| 5 | b | 3 |
| 6 | c | 1 |
| 7 | c | 1 |
| 8 | c | 4 |
| 9 | c | 2 |
| 9 | c | 3 |
| 9 | c | 4 |
+---------+-----------+-----------+
Sample intermediate table (for reasoning with):
+-----------+-----------+---------------------+
| user_type | some_date | distinct_user_count |
+-----------+-----------+---------------------+
| a | 1 | 2 |
| a | 2 | 2 |
| b | 1 | 1 |
| b | 2 | 1 |
| b | 3 | 1 |
| c | 1 | 2 |
| c | 2 | 1 |
| c | 3 | 1 |
| c | 4 | 2 |
+-----------+-----------+---------------------+
SQL is: select user_type, some_date, count(distinct user_id) from user_activity group by user_type, some_date.
Desired result:
+-----------+---------------------+
| user_type | average_daily_users |
+-----------+---------------------+
| a | 2 |
| b | 1 |
| c | 1.5 |
+-----------+---------------------+

how to select max of row number data in a table in sql?

I have data like this,
| ID | Client | Some_Value | Row_No |
| 1 | HP | 123 | 1 |
| 1 | HP | 1245 | 2 |
| 1 | Dell | 123445 | 3 |
| 2 | HP | 111 | 1 |
| 2 | HP | 223 | 2 |
| 3 | Dell | 34 | 1 |
| 3 | Dell | 5563 | 2 |
And i need output like this ,
| ID | Client | Some_Value | Row_No |
| 1 | Dell | 123445 | 3 |
| 2 | HP | 223 | 2 |
| 3 | Dell | 5563 | 2 |
Please consider that I'm a beginner and explain me the logic.

USE Row_NUMBER() and Partition BY:
;With T AS
(
SELECT
ID,
Client,
Some_Value,
Row_No,
Row_NUMBER() OVER(Partition BY ID Order BY Row_No Desc) AS PartNo
FROM TableName
)
SELECT
ID,
Client,
Some_Value,
Row_No
FROM T
WHERE T.PartNo=1
Update Statement Example:
;With T AS
(
SELECT
ID,
Client,
Some_Value,
Row_No,
Row_NUMBER() OVER(Partition BY ID Order BY Row_No Desc) AS PartNo
FROM TableName
)
Update TableName
SET Name=T.Name
FROM T
WHERE T.PartNo=1
AND TableName.Id=T.Id

group by top two results based on order

I have been trying to get this to work with some row_number, group by, top, sort of things, but I am missing some fundamental concept. I have a table like so:
+-------+-------+-------+
| name | ord | f_id |
+-------+-------+-------+
| a | 1 | 2 |
| b | 5 | 2 |
| c | 6 | 2 |
| d | 2 | 1 |
| e | 4 | 1 |
| a | 2 | 3 |
| c | 50 | 4 |
+-------+-------+-------+
And my desired output would be:
+-------+---------+--------+-------+
| f_id | ord_n | ord | name |
+-------+---------+--------+-------+
| 2 | 1 | 1 | a |
| 2 | 2 | 5 | b |
| 1 | 1 | 2 | d |
| 1 | 2 | 4 | e |
| 3 | 1 | 2 | a |
| 4 | 1 | 50 | c |
+-------+---------+--------+-------+
Where data is ordered by the ord value, and only up to two results per f_id. Should I be working on a Stored Procedure for this or can I just do it with SQL? I have experimented with some select TOP subqueries, but nothing has even come close..
Here are some statements to create the test table:
create table help(name varchar(255),ord tinyint,f_id tinyint);
insert into help values
('a',1,2),
('b',5,2),
('c',6,2),
('d',2,1),
('e',4,1),
('a',2,3),
('c',50,4);

You may use Rank or DENSE_RANK functions.
select A.name, A.ord_n, A.ord , A.f_id from
(
select
RANK() OVER (partition by f_id ORDER BY ord asc) AS "Rank",
ROW_NUMBER() OVER (partition by f_id ORDER BY ord asc) AS "ord_n",
help.*
from help
) A where A.rank <= 2
Sqlfiddle demo

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive window functions: last value of previous partition - sql

Related

SQL group by changing column

Select one row inside a group according to a criteria in PostgreSQL

Hive - over (partition by ...) with a column not in group by

how to select max of row number data in a table in sql?

group by top two results based on order

Categories

Resources