hive top K sum() records per group by key - hive

for a table TBL with columns A, B, C, I want to group by and select A, B where I only take the top K values of B very sum(C)
without the top limit, this is:
select A, B, sum(C) from TBL group by A, B
with the values
A | B | C
--+---+----
a | 1 | 10
a | 2 | 20
a | 1 | 5
a | 3 | 12
b | 3 | 100
b | 2 | 90
b | 1 | 120
c | 5 | 10
and limit of 2, the results will be
A | B | sum(C)
--+---+-------
a | 1 | 15
a | 2 | 20
b | 1 | 120
b | 3 | 100
c | 5 | 10

select A
,B
,sum_C
from (select A
,B
,sum(C) as sum_C
,row_number () over
(
partition by A
order by sum(C) desc
) as rn
from TBL
group by A
,B
) t
where rn <= 2
+---+---+-------+
| a | b | sum_c |
+---+---+-------+
| a | 2 | 20 |
| a | 1 | 15 |
| b | 1 | 120 |
| b | 3 | 100 |
| c | 5 | 10 |
+---+---+-------+

You can use windowing functions to achieve this.
Query:
SELECT a, b, c
FROM (
SELECT *
, ROW_NUMBER() OVER (PARTITION BY a ORDER BY c DESC) AS rank
FROM (
SELECT A AS a
, B AS b
, SUM(C) AS c
FROM db.table
GROUP BY A, B ) x ) y
WHERE rank < 3
Output:
a b c
a 2 20
a 1 15
b 1 120
b 3 100
c 5 10

Related

How to get second largest column value and column name

How can I get second largest column value and its name?
My current query gives it mostly correct but in cases where largest and second largest values are same I am getting wrong values.
select item_code, A, B, C,
greatest(A, B, C) as largest1,
greatest(case when largest1 = A then 0 else A end,
case when largest1 = B then 0 else B end,
case when largest1 = C then 0 else C end) as largest2,
(case largest1 when A then 'A'
when B then 'B'
when C then 'C' end) as largest1_column_name,
(case largest2 when A then 'A'
when B then 'B'
when C then 'C' else 'None' end) as largest2_column_name
from table1
Below is the sample table:
+-----------+----+----+----+
| item_code | A | B | C |
+-----------+----+----+----+
| p1 | 20 | 30 | 40 |
| p2 | 50 | 30 | 10 |
| p3 | 30 | 50 | 10 |
| p4 | 30 | 30 | 30 |
| p5 | 50 | 50 | 10 |
| p6 | 0 | 0 | 0 |
+-----------+----+----+----+
Below is expected output:
+-----------+----+----+----+----------+----------+----------------------+----------------------+
| item_code | A | B | C | largest1 | largest2 | largest1_column_name | largest2_column_name |
+-----------+----+----+----+----------+----------+----------------------+----------------------+
| p1 | 20 | 30 | 40 | 40 | 30 | C | B |
| p2 | 50 | 30 | 10 | 50 | 30 | A | B |
| p3 | 30 | 50 | 10 | 50 | 30 | B | A |
| p4 | 30 | 30 | 30 | 30 | 30 | A | B |
| p5 | 50 | 50 | 10 | 50 | 50 | A | B |
| p6 | 0 | 0 | 0 | 0 | 0 | A | B |
+-----------+----+----+----+----------+----------+----------------------+----------------------+
This is the output I am getting from my query (I have marked wrong as comment):
+-----------+----+----+----+----------+-------------+----------------------+----------------------+
| item_code | A | B | C | largest1 | largest2 | largest1_column_name | largest2_column_name |
+-----------+----+----+----+----------+-------------+----------------------+----------------------+
| p1 | 20 | 30 | 40 | 40 | 30 | C | B |
| p2 | 50 | 30 | 10 | 50 | 30 | A | B |
| p3 | 30 | 50 | 10 | 50 | 30 | B | A |
| p4 | 30 | 30 | 30 | 30 | 0/*wrong*/ | A | NULL/*wrong*/ |
| p5 | 50 | 50 | 10 | 50 | 10/*wrong*/ | A | C/*wrong*/ |
| p6 | 0 | 0 | 0 | 0 | 0/*wrong*/ | A | A/*wrong*/ |
+-----------+----+----+----+----------+-------------+----------------------+----------------------+
I tried a slight variation of this (listagg instead of string_agg) in Snowflake and it seemed to be getting the expected result
with cte (item_code, abc, id) as
(select item_code, a, 'a' from table1 union all
select item_code, b, 'b' from table1 union all
select item_code, c, 'c' from table1)
select item_code,
max(case when id='a' then abc end) a,
max(case when id='b' then abc end) b,
max(case when id='c' then abc end) c,
split_part(string_agg(abc::varchar,',' order by abc desc),',',1) largest1,
split_part(string_agg(abc::varchar,',' order by abc desc),',',2) largest2,
split_part(string_agg(id,',' order by abc desc),',',1) largest1_col,
split_part(string_agg(id,',' order by abc desc),',',2) largest2_col
from cte
group by item_code;
This might be simpler achieved by unpivoting the rows, ranking the values, and then using conditional aggregation. In Postgres, you could phrase this as:
select t.*, x.*
from table1 t1
cross join lateral (
select
min(val) filter(where rn = 1) largest1,
min(val) filter(where rn = 2) largest2,
min(col) filter(where rn = 1) largest1_column_name,
min(col) filter(where rn = 2) largest2_column_name
from (
select x.*, dense_rank() over(order by val desc) rn
from (values ('a', a), ('b', b), ('c', c)) as x(col, val)
) x
) x

Cte within Cte in SQL

I have been encountered with a situation where I need to apply a where, group by condition on the result of CTE in the CTE.
Table 1 as follows
+---+---+---+---+
| x | y | z | w |
+---+---+---+---+
| 1 | 2 | 3 | 1 |
| 2 | 3 | 4 | 2 |
| 3 | 2 | 5 | 3 |
| 1 | 2 | 6 | 2 |
+---+---+---+---+
Table 2 as follows
+---+---+-----+---+
| a | b | c | d |
+---+---+-----+---+
| 1 | m | 100 | 1 |
| 2 | n | 23 | 2 |
| 4 | o | 34 | 4 |
| 1 | m | 23 | 2 |
+---+---+-----+---+
Assuming I have the data of following sql query in a table called TAB
with cte as (
select x,y,z from table1),
cte1 as (select a,b,c from table2)
select cte.x,cte1.y,cte1.z,cte2.b,cte2.c from cte left join cte1 on cte.x=cte.a and cte1.w=cte2.d
Result of above CTE would be as follows
+---+---+---+---+---+-----+
| x | y | z | w | b | c |
+---+---+---+---+---+-----+
| 1 | 2 | 3 | 1 | m | 100 |
| 2 | 3 | 4 | 2 | n | 23 |
| 1 | 2 | 6 | 2 | m | 23 |
+---+---+---+---+---+-----+
I would like to query the following from the table TAB
select * from TAB where (X||b) in (select (X||b) from TAB group by (X||Y) having sum(c)=123)
I'm trying to formulate the SQL query as follows but it's not as i expected:
select * from (
with cte as (
select x,y,z from table1),
cte1 as (select a,b,c from table2)
select cte.x,cte1.y,cte1.z,cte2.b,cte2.c from cte left join cte1 on cte.x=cte.a) as TAB
where ((X||b) in (select (X||b) from TAB group by (X||Y) having sum(c)=123))
The final result must be as follows
+---+---+---+---+---+-----+
| x | y | z | w | b | c |
+---+---+---+---+---+-----+
| 1 | 2 | 3 | 1 | m | 100 |
| 1 | 2 | 6 | 2 | m | 23 |
+---+---+---+---+---+-----+
I don't think DB2 allows CTEs in subqueries or to be nested. Why not just write this using another CTE?
with cte as (
select x,y,z from
table1
),
cte1 as (
select a,b,c
from table2
),
tab as (
select cte.x,cte1.y,cte1.z,cte1.w,cte2.b,cte2.c
from cte left join
cte1
on cte.x=cte.a and cte1.w=cte2.d
)
select *
from TAB
where (X||b) in (select (X||b) from TAB group by (X||Y) having sum(c)=123);

group by and select non null value if present

Is there a way I can perform group by and use non value for a column if any. i.e
a | b | c | d | e | f |
---------------------------------------------------
1 | 2 | 3 | x | test1 | 2019-07-01 07:17:01 |
1 | 2 | 3 | NULL | test2 | 2019-07-01 10:23:11 |
1 | 2 | 3 | NULL | test3 | 2019-07-01 22:00:51 |
1 | 2 | 7 | NULL | testTet | 2019-07-01 23:00:00 |
In my case above if d is present for say a=1,b=2,c=3 it will always be x otherwise it can come null. So my query would be like
select a,
b,
c,
d,
count(distinct e) as something
from tableX
where f between '2019-07-01 00:00:00' and '2019-07-01 23:59:59.999'
group by a,
b,
c,
d
the results would be:
a | b | c | d | something |
------------------------------|
1 | 2 | 3 | x | 1 |
1 | 2 | 3 | NULL | 2 |
1 |2 | 7 | NULL | 1 |
whereas it will be wonderful if I can have (since for each group by combination I know it's null or that unique value if present):
a | b | c | d | something |
------------------------------|
1 | 2 | 3 | x | 3 |
1 | 2 | 7 | NULL | 1 |
From your sample data I think that you don't need d in the group by clause.
So get its max:
select
a, b, c,
max(d) d,
count(distinct e) as something
from tableX
where f between '2019-07-01 00:00:00' and '2019-07-01 23:59:59.999'
group by a, b, c
try like below
with cte as (select a,
b,
c,
d,
count(distinct e) as something
from tableX
where f between '2019-07-01 00:00:00' and '2019-07-01 23:59:59.999'
group by a,
b,
c,
d) select a,b,c,max(d) as d,sum(something) from cte group by a,b,c

Sum all sub group last value by group

Consider the following table:
ID | ITEM | GROUP_ID | VAL | COST
---+------+----------+-----------+-------
1 | A | 1 | 1 | 12
2 | B | 1 | 2 | 12
3 | C | 1 | 3 | 12
4 | D | 1 | 4 | 13
5 | D | 1 | 5 | 12
6 | E | 2 | 1 | 17
7 | E | 2 | 2 | 10
8 | E | 2 | 3 | 11
9 | E | 2 | 4 | 12
10 | F | 2 | 5 | 15
11 | F | 2 | 6 | 13
12 | F | 2 | 7 | 11
13 | F | 2 | 8 | 12
how to get the result as follow:
GROUP_ID | VAL | COST
----------+-----------+-------
1 | 15 | 48
2 | 36 | 24
The val is the sum by group id.
The cost is the sum of last value by item.
Use analytic function ROW_NUMBER() on postgres, oracle or sql server
SqlFiddleDemo
WITH last_item as (
SELECT group_id, sum(cost) as sum_cost
FROM (
SELECT t.*,
ROW_NUMBER() over (partition by item order by id desc) as rn
FROM Table1 t
) as t
WHERE rn = 1
GROUP BY t.group_id
),
val_sum as (
SELECT t.group_id, SUM(val) as sum_val
FROM Table1 t
GROUP BY t.group_id
)
SELECT v.group_id, v.sum_val, l.sum_cost
FROM val_sum v
INNER JOIN last_item l
ON v.group_id = l.group_id
OUTPUT
| group_id | sum_val | sum_cost |
|----------|---------|----------|
| 1 | 15 | 48 |
| 2 | 36 | 24 |
Try this
WITH LastRow (id)
AS (
SELECT MAX(id)
FROM TheTable
GROUP BY item, group_id
)
SELECT group_Id, SUM(val), SUM(CASE WHEN B.id IS NULL THEN 0 ELSE cost END)
FROM TheTable A
LEFT OUTER JOIN LastRow B ON A.id = B.id
GROUP BY group_id
EDIT:
SQL Fiddle Demo
Thanks #Juan Carlos Oropeza for creating the SQL Fiddle test data

SQL Select top frequent records

I have the following table:
Table
+----+------+-------+
| ID | Name | Group |
+----+------+-------+
| 0 | a | 1 |
| 1 | a | 1 |
| 2 | a | 2 |
| 3 | a | 1 |
| 4 | b | 1 |
| 5 | b | 2 |
| 6 | b | 1 |
| 7 | c | 2 |
| 8 | c | 2 |
| 9 | c | 1 |
+----+------+-------+
I would like to select top 20 distinct names from a specific group ordered by most frequent name in that group. The result for this example for group 1 would return a b c (
a - 3 occurrences, b - 2 occurrences and c - 1 occurrence).
Thank you.
SELECT TOP(20) [Name], Count(*) FROM Table
WHERE [Group] = 1
GROUP BY [Name]
ORDER BY Count(*) DESC
SELECT Top(20)
name, group, count(*) as occurences
FROM yourtable
GROUP BY name, group
ORDER BY count(*) desc
SELECT
TOP 20
Name,
Group,
COUNT(1) Count,
FROM
MyTable
GROUP BY
Name,
Group
ORDER BY
Count DESC