hive - how to select top N elements for each match - hive

Please consider a hive table - TableA as mentioned below.
This basic SQL syntax works fine when we want to get "all" the rows that matches the condition in the where clause. I want to limit the returned rows to a number - say N - for each of the matches of where clause.
Let me explain with an example:
(1)
Consider this table:
TableA
c1 c2
1. a
1 b
1 c
2. d
2. e
2. f
(2) Consider this query:
SELECT c1, c2
FROM TableA
WHERE c1 in (1,2)
(3) As you can imagine, it would produce this result:
Actual Results:
c1 c2
1. a
1 b
1 c
2. d
2. e
2. f
(4)
Desired Result:
c1 c2
1. a
1 b
2. d
2. e
Question: How do I modify the query in #2) to get the desired output mention in #4).

You can use row_number function to do this.
select c1,c2
from (SELECT c1, c2, row_number() over(partition by c1 order by c2) as rnum
FROM TableA
--add a where clause as needed
) t
where rnum <= 2

Only 2 values for c1
SELECT c1, c2 FROM TableA WHERE c1 = 1 ORDER BY c2 LIMIT 2
UNION ALL
SELECT c1, c2 FROM TableA WHERE c1 = 2 ORDER BY c2 LIMIT 2
More than 2 values, use rank()
select c1,c2 from
(
select c1,c2,rank() over (partition by c1 order by c2) as rank
from TableA
) t
where rank < 3;

Related

Selecting nth top row based on number of occurrences of value in 3 tables

I have three tables let's say A, B and C. Each of them has column that's named differently, let's say D1, D2 and D3. In those columns I have values between 1 and 26. How do I count occurrences of those values and sort them by that count?
Example:
TableA.D1
1
2
1
1
3
TableB.D2
2
1
1
1
2
3
TableC.D3
2
1
3
So the output for 3rd most common value would look like this:
3 -- number 3 appeared only 3 times
Likewise, output for 2nd most common value would be:
2 -- number 2 appeared 4 times
And output for 1st most common value:
1 -- number 1 appeared 7 times
You probably want :
select top (3) d1
from ((select d1 from tablea ta) union all
(select d2 from tableb tb) union all
(select d3 from tablec tc)
) t
group by d1
order by count(*) desc;
SELECT DQ3.X, DQ3.CNT
(
SELECT DQ2.*, dense_rank() OVER (ORDER BY DQ2.CNT DESC) AS RN
(SELECT DS.X,COUNT(DS.X) CNT FROM
(select D1 as X FROM TableA UNION ALL SELECT D2 AS X FROM TABLE2 UNION ALL SELECT D3 AS X FROM TABLE3) AS DS
GROUP BY DS.X
) DQ2
) DQ3 WHERE DQ3.RN = 3 --the third in the order of commonness - note that 'ties' can be handled differently
One of the things about SQL scripts: they get difficult to read very easily. I'm a big fan of making things as readable as absolute possible. So I'd recommend something like:
declare #topThree TABLE(entry int, cnt int)
select TOP 3 entry,count(*) as cnt
from
(
select d1 as entry from tablea UNION ALL
select d2 as entry from tableb UNION ALL
select d3 as entry from tablec UNION ALL
) as allTablesCombinedSubquery
order by count(*)
select TOP 1 entry
from #topThree
order by cnt desc
... it's extremely readable, and doesn't use any concepts that are tough to grok.

Looking for an alternative SQL statement

Given the following table with 2 columns:
c1 c2
------------
a1 | b1
a1 | b1
a2 | b2
a2 | b3
a3 | b3
I want to return those values from column c2 where the value of c2 column appears multiple times for the same c1 value. I am doing the following SQL query to return the required result:
SELECT DISTINCT ( c2 ) AS c
FROM ( SELECT c1 , c2 , COUNT (*) AS rowcount
FROM table
GROUP BY c1 , c2 HAVING rowcount > 1 )
Result:
c
---
b1
Is there any alternative SQL statement of the above query?
Based on your description, you can use:
select distinct c1
from (select t.*, count(*) over (partition by c2) as cnt
from t
) t
where cnt >= 2;
Based on your sample results:
select c1
from t
group by c1
having count(*) >= 2;
And based on the revised question:
select c2
from t
group by c2
having count(*) >= 2;
Use count in having clause instead of using subquery:-
select c1
from table
group by c1
having count(c2) > 1
Most answers above will work if you want all the values in c1 that appear more than once in the table (even with the same value on c2).
If you want to measure only values of c1 that may have multiple DISTINCT values on c2 you can use:
SELECT c1
FROM table
GROUP BY c1
HAVING COUNT(DISTINCT c2) > 1

SQL Script to populate a new field with auto number based on groups of other columns

I need to populate a column (C3) with autoincrement IDs based on the values of two other columns (Unique ID for Unique C1-C2 values)
Current
C1 C2 C3
------------------------------
X A null
X A null
Y A null
Z B null
Z B null
Z B null
Desired
C1 C2 C3
------------------------------
X A 1
X A 1
Y A 2
Z B 3
Z B 3
Z B 3
Your result is described by:
select c1, c2, dense_rank() over (order by c1)
from t;
You might intend:
select c1, c2, dense_rank() over (order by c1, c2)
from t;
(But this is more complicated than needed for your sample data.)
This depends on the ordering of the values columns themselves. I am guessing that you have some sort of id and you want the rows ordered by that id. The same idea still holds, but you use the minimum id:
select c1, c2,
dense_rank() over (order by minid)
from (select t.*, min(id) over (partition by c1, c2) as minid
from t
) t;

Get specific items from partition which should not be in other partition

I have below table -
ID type group_name creation_date
1 A G1 C1
2 B G2 C2
3 C G2 C3
4 B G1 C4
I want to extract the old type items in each group, but if that type item is latest item in other partition , then i won't extract that.
So, for G1, I will have 2 items A and B where C1 > C4
For G2, I will have 2 items B and C where C2 > C3.
Ideally, B is older for group G1 and C is older for group G2
But i don't want to extract B for G1 since it is latest for G2. Hence
the output should be C only.
Could anyone help how can i achieve this ?
Query:
SELECT DISTINCT
type
FROM (
SELECT type,
rnk,
COUNT( CASE rnk WHEN 1 THEN 1 END ) OVER ( PARTITION BY type ) AS ct
FROM (
SELECT type,
RANK() OVER ( PARTITION BY group_name ORDER BY creation_date DESC ) AS rnk
FROM table_name
)
)
WHERE rnk > 1 AND ct = 0;
Output:
TYPE
----
C

Selecting the the last row in a partition in HIVE

I have a table t1:
c1 | c2 | c3| c4
1 1 1 A
1 1 2 B
1 1 3 C
1 1 4 D
1 1 4 E
1 1 4 F
2 2 1 A
2 2 2 A
2 2 3 A
I want to select the last row of each c1, c2 pair. So (1,1,4,F) and (2,2,3,A) in this case. My idea is to do something like this:
create table t2 as
select *, row_number() over (partition by c1, c2 order by c3) as rank
from t1
create table t3 as
select a.c1, a.c2, a.c3, a.c4
from t2 a
inner join
(select c1, c2, max(rank) as maxrank
from t2
group by c1, c2
)
on a.c1=b.c1 and a.c2=b.c1
where a.rank=b.maxrank
Would this work? (Having environment issues so can't test myself)
Just use a subquery:
select t1.*
from (select t1.*, row_number() over (partition by c1, c2 order by c3 desc) as rank
from t1
) t1
where rank = 1;
Note the use of desc for the order by.