hive top K sum() records per group by key

hive top K sum() records per group by key - hive

for a table TBL with columns A, B, C, I want to group by and select A, B where I only take the top K values of B very sum(C)
without the top limit, this is:
select A, B, sum(C) from TBL group by A, B
with the values
A | B | C
--+---+----
a | 1 | 10
a | 2 | 20
a | 1 | 5
a | 3 | 12
b | 3 | 100
b | 2 | 90
b | 1 | 120
c | 5 | 10
and limit of 2, the results will be
A | B | sum(C)
--+---+-------
a | 1 | 15
a | 2 | 20
b | 1 | 120
b | 3 | 100
c | 5 | 10

select A
,B
,sum_C
from (select A
,B
,sum(C) as sum_C
,row_number () over
(
partition by A
order by sum(C) desc
) as rn
from TBL
group by A
,B
) t
where rn <= 2
+---+---+-------+
| a | b | sum_c |
+---+---+-------+
| a | 2 | 20 |
| a | 1 | 15 |
| b | 1 | 120 |
| b | 3 | 100 |
| c | 5 | 10 |
+---+---+-------+

You can use windowing functions to achieve this.
Query:
SELECT a, b, c
FROM (
SELECT *
, ROW_NUMBER() OVER (PARTITION BY a ORDER BY c DESC) AS rank
FROM (
SELECT A AS a
, B AS b
, SUM(C) AS c
FROM db.table
GROUP BY A, B ) x ) y
WHERE rank < 3
Output:
a b c
a 2 20
a 1 15
b 1 120
b 3 100
c 5 10

Related

How to get second largest column value and column name

How can I get second largest column value and its name?
My current query gives it mostly correct but in cases where largest and second largest values are same I am getting wrong values.
select item_code, A, B, C,
greatest(A, B, C) as largest1,
greatest(case when largest1 = A then 0 else A end,
case when largest1 = B then 0 else B end,
case when largest1 = C then 0 else C end) as largest2,
(case largest1 when A then 'A'
when B then 'B'
when C then 'C' end) as largest1_column_name,
(case largest2 when A then 'A'
when B then 'B'
when C then 'C' else 'None' end) as largest2_column_name
from table1
Below is the sample table:
+-----------+----+----+----+
| item_code | A | B | C |
+-----------+----+----+----+
| p1 | 20 | 30 | 40 |
| p2 | 50 | 30 | 10 |
| p3 | 30 | 50 | 10 |
| p4 | 30 | 30 | 30 |
| p5 | 50 | 50 | 10 |
| p6 | 0 | 0 | 0 |
+-----------+----+----+----+
Below is expected output:
+-----------+----+----+----+----------+----------+----------------------+----------------------+
| item_code | A | B | C | largest1 | largest2 | largest1_column_name | largest2_column_name |
+-----------+----+----+----+----------+----------+----------------------+----------------------+
| p1 | 20 | 30 | 40 | 40 | 30 | C | B |
| p2 | 50 | 30 | 10 | 50 | 30 | A | B |
| p3 | 30 | 50 | 10 | 50 | 30 | B | A |
| p4 | 30 | 30 | 30 | 30 | 30 | A | B |
| p5 | 50 | 50 | 10 | 50 | 50 | A | B |
| p6 | 0 | 0 | 0 | 0 | 0 | A | B |
+-----------+----+----+----+----------+----------+----------------------+----------------------+
This is the output I am getting from my query (I have marked wrong as comment):
+-----------+----+----+----+----------+-------------+----------------------+----------------------+
| item_code | A | B | C | largest1 | largest2 | largest1_column_name | largest2_column_name |
+-----------+----+----+----+----------+-------------+----------------------+----------------------+
| p1 | 20 | 30 | 40 | 40 | 30 | C | B |
| p2 | 50 | 30 | 10 | 50 | 30 | A | B |
| p3 | 30 | 50 | 10 | 50 | 30 | B | A |
| p4 | 30 | 30 | 30 | 30 | 0/*wrong*/ | A | NULL/*wrong*/ |
| p5 | 50 | 50 | 10 | 50 | 10/*wrong*/ | A | C/*wrong*/ |
| p6 | 0 | 0 | 0 | 0 | 0/*wrong*/ | A | A/*wrong*/ |
+-----------+----+----+----+----------+-------------+----------------------+----------------------+

I tried a slight variation of this (listagg instead of string_agg) in Snowflake and it seemed to be getting the expected result
with cte (item_code, abc, id) as
(select item_code, a, 'a' from table1 union all
select item_code, b, 'b' from table1 union all
select item_code, c, 'c' from table1)
select item_code,
max(case when id='a' then abc end) a,
max(case when id='b' then abc end) b,
max(case when id='c' then abc end) c,
split_part(string_agg(abc::varchar,',' order by abc desc),',',1) largest1,
split_part(string_agg(abc::varchar,',' order by abc desc),',',2) largest2,
split_part(string_agg(id,',' order by abc desc),',',1) largest1_col,
split_part(string_agg(id,',' order by abc desc),',',2) largest2_col
from cte
group by item_code;

This might be simpler achieved by unpivoting the rows, ranking the values, and then using conditional aggregation. In Postgres, you could phrase this as:
select t.*, x.*
from table1 t1
cross join lateral (
select
min(val) filter(where rn = 1) largest1,
min(val) filter(where rn = 2) largest2,
min(col) filter(where rn = 1) largest1_column_name,
min(col) filter(where rn = 2) largest2_column_name
from (
select x.*, dense_rank() over(order by val desc) rn
from (values ('a', a), ('b', b), ('c', c)) as x(col, val)
) x
) x

Cte within Cte in SQL

I have been encountered with a situation where I need to apply a where, group by condition on the result of CTE in the CTE.
Table 1 as follows
+---+---+---+---+
| x | y | z | w |
+---+---+---+---+
| 1 | 2 | 3 | 1 |
| 2 | 3 | 4 | 2 |
| 3 | 2 | 5 | 3 |
| 1 | 2 | 6 | 2 |
+---+---+---+---+
Table 2 as follows
+---+---+-----+---+
| a | b | c | d |
+---+---+-----+---+
| 1 | m | 100 | 1 |
| 2 | n | 23 | 2 |
| 4 | o | 34 | 4 |
| 1 | m | 23 | 2 |
+---+---+-----+---+
Assuming I have the data of following sql query in a table called TAB
with cte as (
select x,y,z from table1),
cte1 as (select a,b,c from table2)
select cte.x,cte1.y,cte1.z,cte2.b,cte2.c from cte left join cte1 on cte.x=cte.a and cte1.w=cte2.d
Result of above CTE would be as follows
+---+---+---+---+---+-----+
| x | y | z | w | b | c |
+---+---+---+---+---+-----+
| 1 | 2 | 3 | 1 | m | 100 |
| 2 | 3 | 4 | 2 | n | 23 |
| 1 | 2 | 6 | 2 | m | 23 |
+---+---+---+---+---+-----+
I would like to query the following from the table TAB
select * from TAB where (X||b) in (select (X||b) from TAB group by (X||Y) having sum(c)=123)
I'm trying to formulate the SQL query as follows but it's not as i expected:
select * from (
with cte as (
select x,y,z from table1),
cte1 as (select a,b,c from table2)
select cte.x,cte1.y,cte1.z,cte2.b,cte2.c from cte left join cte1 on cte.x=cte.a) as TAB
where ((X||b) in (select (X||b) from TAB group by (X||Y) having sum(c)=123))
The final result must be as follows
+---+---+---+---+---+-----+
| x | y | z | w | b | c |
+---+---+---+---+---+-----+
| 1 | 2 | 3 | 1 | m | 100 |
| 1 | 2 | 6 | 2 | m | 23 |
+---+---+---+---+---+-----+

I don't think DB2 allows CTEs in subqueries or to be nested. Why not just write this using another CTE?
with cte as (
select x,y,z from
table1
),
cte1 as (
select a,b,c
from table2
),
tab as (
select cte.x,cte1.y,cte1.z,cte1.w,cte2.b,cte2.c
from cte left join
cte1
on cte.x=cte.a and cte1.w=cte2.d
)
select *
from TAB
where (X||b) in (select (X||b) from TAB group by (X||Y) having sum(c)=123);

group by and select non null value if present

Is there a way I can perform group by and use non value for a column if any. i.e
a | b | c | d | e | f |
---------------------------------------------------
1 | 2 | 3 | x | test1 | 2019-07-01 07:17:01 |
1 | 2 | 3 | NULL | test2 | 2019-07-01 10:23:11 |
1 | 2 | 3 | NULL | test3 | 2019-07-01 22:00:51 |
1 | 2 | 7 | NULL | testTet | 2019-07-01 23:00:00 |
In my case above if d is present for say a=1,b=2,c=3 it will always be x otherwise it can come null. So my query would be like
select a,
b,
c,
d,
count(distinct e) as something
from tableX
where f between '2019-07-01 00:00:00' and '2019-07-01 23:59:59.999'
group by a,
b,
c,
d
the results would be:
a | b | c | d | something |
------------------------------|
1 | 2 | 3 | x | 1 |
1 | 2 | 3 | NULL | 2 |
1 |2 | 7 | NULL | 1 |
whereas it will be wonderful if I can have (since for each group by combination I know it's null or that unique value if present):
a | b | c | d | something |
------------------------------|
1 | 2 | 3 | x | 3 |
1 | 2 | 7 | NULL | 1 |

From your sample data I think that you don't need d in the group by clause.
So get its max:
select
a, b, c,
max(d) d,
count(distinct e) as something
from tableX
where f between '2019-07-01 00:00:00' and '2019-07-01 23:59:59.999'
group by a, b, c

try like below
with cte as (select a,
b,
c,
d,
count(distinct e) as something
from tableX
where f between '2019-07-01 00:00:00' and '2019-07-01 23:59:59.999'
group by a,
b,
c,
d) select a,b,c,max(d) as d,sum(something) from cte group by a,b,c

Sum all sub group last value by group

Consider the following table:
ID | ITEM | GROUP_ID | VAL | COST
---+------+----------+-----------+-------
1 | A | 1 | 1 | 12
2 | B | 1 | 2 | 12
3 | C | 1 | 3 | 12
4 | D | 1 | 4 | 13
5 | D | 1 | 5 | 12
6 | E | 2 | 1 | 17
7 | E | 2 | 2 | 10
8 | E | 2 | 3 | 11
9 | E | 2 | 4 | 12
10 | F | 2 | 5 | 15
11 | F | 2 | 6 | 13
12 | F | 2 | 7 | 11
13 | F | 2 | 8 | 12
how to get the result as follow:
GROUP_ID | VAL | COST
----------+-----------+-------
1 | 15 | 48
2 | 36 | 24
The val is the sum by group id.
The cost is the sum of last value by item.

Use analytic function ROW_NUMBER() on postgres, oracle or sql server
SqlFiddleDemo
WITH last_item as (
SELECT group_id, sum(cost) as sum_cost
FROM (
SELECT t.*,
ROW_NUMBER() over (partition by item order by id desc) as rn
FROM Table1 t
) as t
WHERE rn = 1
GROUP BY t.group_id
),
val_sum as (
SELECT t.group_id, SUM(val) as sum_val
FROM Table1 t
GROUP BY t.group_id
)
SELECT v.group_id, v.sum_val, l.sum_cost
FROM val_sum v
INNER JOIN last_item l
ON v.group_id = l.group_id
OUTPUT
| group_id | sum_val | sum_cost |
|----------|---------|----------|
| 1 | 15 | 48 |
| 2 | 36 | 24 |

Try this
WITH LastRow (id)
AS (
SELECT MAX(id)
FROM TheTable
GROUP BY item, group_id
)
SELECT group_Id, SUM(val), SUM(CASE WHEN B.id IS NULL THEN 0 ELSE cost END)
FROM TheTable A
LEFT OUTER JOIN LastRow B ON A.id = B.id
GROUP BY group_id
EDIT:
SQL Fiddle Demo
Thanks #Juan Carlos Oropeza for creating the SQL Fiddle test data

SQL Select top frequent records

I have the following table:
Table
+----+------+-------+
| ID | Name | Group |
+----+------+-------+
| 0 | a | 1 |
| 1 | a | 1 |
| 2 | a | 2 |
| 3 | a | 1 |
| 4 | b | 1 |
| 5 | b | 2 |
| 6 | b | 1 |
| 7 | c | 2 |
| 8 | c | 2 |
| 9 | c | 1 |
+----+------+-------+
I would like to select top 20 distinct names from a specific group ordered by most frequent name in that group. The result for this example for group 1 would return a b c (
a - 3 occurrences, b - 2 occurrences and c - 1 occurrence).
Thank you.

SELECT TOP(20) [Name], Count(*) FROM Table
WHERE [Group] = 1
GROUP BY [Name]
ORDER BY Count(*) DESC

SELECT Top(20)
name, group, count(*) as occurences
FROM yourtable
GROUP BY name, group
ORDER BY count(*) desc

SELECT
TOP 20
Name,
Group,
COUNT(1) Count,
FROM
MyTable
GROUP BY
Name,
Group
ORDER BY
Count DESC

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

hive top K sum() records per group by key - hive

You can use windowing functions to achieve this. Query: SELECT a, b, c FROM ( SELECT * , ROW_NUMBER() OVER (PARTITION BY a ORDER BY c DESC) AS rank FROM ( SELECT A AS a , B AS b , SUM(C) AS c FROM db.table GROUP BY A, B ) x ) y WHERE rank < 3 Output: a b c a 2 20 a 1 15 b 1 120 b 3 100 c 5 10

Related

How to get second largest column value and column name

Cte within Cte in SQL

group by and select non null value if present

Sum all sub group last value by group

SQL Select top frequent records

Categories

Resources