Hive / SQL query for top n values per key - sql

I want top 2 valus per key. The result would look like:
What should be the hive query.

You can use a window function with OVER() close:
select col1,col2 from (SELECT col1,
col2,
ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY col2 DESC) AS row_num
FROM data)f
WHERE f.row_num < 3
order by col1,col2

Related

How to select first-n/top-n rows from a query's resultant if its count is more than a given number?

I have a query that returns more than 1000 rows.
Step1:
with total_res as (
select table1.col1, table1.col2, table2.col3,... table2.coln
from table1 join table2
on table1.keycol=table2.keycol
where table1.col4='ABCD' and table2.col5 <= '02-02-2022'
order by table1.col1 desc)
In my requirement, I have to return the first 350rows by ordering col3 in desc if the output of the above query contain more than 1000rows.
So I added a row number column like below to add sequential numbers to the resulset from above.
Step2:
select col1, col2, col2...coln, ROW_NUMBER() OVER (ORDER BY col3 desc) as number from total_res;
What I don't understand now is how can I check if the output from step2 contains more than 350 rows and if so, select the first 350 rows.
Could anyone let me know how can I achieve this ? Or is there a better way to do it than using row_number ?
try like below in 2nd step check highest number>350 and then limit the value 350
with cte as
(
select col1, col2, col2...coln, ROW_NUMBER() OVER (ORDER BY col3 as number from total_res
) select * from cte
where 350 < ( select max(number) from cte)
order by number
limit 350

Distinct over multiple columns in SQL Server

How to apply distinct on multiple rows in SQL Server? The query that I have tried below does not work on SQL Server.
select distinct(column1, column2), column3
from table_name
select distinct applies to all columns in the row. So, you can do:
select distinct col1, col2, col3
from t;
If you only want col1 and col2 to be distinct, then group by works:
select col1, col2, min(col3)
from t
group by col1, col2;
Or if you want random rows, you can use row_number(). For instance:
select t.*
from (select t.*,
row_number() over (partition by col1, col2 order by newid()) as seqnum
from t
) t
where seqnum = 1;
A clever version of this doesn't require a subquery:
select top (1) with ties t.*
from t
order by row_number() over (partition by col1, col2 order by newid());

Alternative for count distinct

I want an alternative way to write the following query
SELECT COUNT(DISTINCT col1) FROM table.
I dont want to use distinct. Is there an alternative way?
Try GROUP BY as a subquery and COUNT() from outside query. It would achieve same result.
SELECT COUNT(*)
FROM
(
SELECT Col1
FROM Table
GROUP BY Col1
) tbl
Select count(col1) from table GROUP BY col1
Try this
SELECT COUNT(Col1)
FROM (SELECT ROW_NUMBER() OVER (PARTITION BY Col1 ORDER BY Col1) As RNO, Col1
FROM Table_Name)
WHERE RNO = 1

Give correct Sql Query

I have two queries,
SELECT col1,col2,col3
from scnd1
where col2<>''
group by col1,col2,col3
order by col1
SELECT ROWNUMBER() OVER (PARTITION BY COL1) AS RN FROM SCND1)AS A
WHERE RN > 1
For same table, I need single query to combine these two
ie. 1st I want to sort table as well as remove NULL and then delete the repeated rows by 2nd query.
Try a sub query, along the lines of this:
SELECT
main.col1,
main.col2,
main.col3
FROM scnd1 main
JOIN (SELECT
col1,
col2,
col3,
ROW_NUMBER() OVER (PARTITION BY(col1) ORDER BY col1) AS RN
FROM SCND1
WHERE col2<>''
) sub
ON sub.Col1 = main.Col1 AND sub.Col2 = main.col2 AND sub.Col3 = main.col3
WHERE RN > 1
GROUP BY main.col1,main.col2,main.col3
ORDER BY main.col1

Multiple rows match, but I only want one?

Sometimes I wish to perform a join whereby I take the largest value of one column. Doing this I have to perform a max() and a groupby- which prevents me from retrieving the other columns from the row which was the max (beause they were not contained in a GROUP BY or aggregate function).
To fix this, I join the max value back on the original data source, to get the other columns. However, my problem is that this sometimes returns more than one row.
So, so far I have something like:
SELECT * FROM
(SELECT Col1, Max(Col2) FROM Table GROUP BY Col1) tab1
JOIN
(SELECT Col1, Col2 FROM Table) tab2
ON tab1.Col2 = tab2.Col2
If the above query now returns three rows (which match the largest value for column2) I have a bit of a headache.
If there was an extra column- col3 and for the rows returned by the above query, I only wanted to return the one which was, say the minimum Col3 value- how would I do this?
If you are using SQL Server 2005+. Then you can do it like this:
CTE way
;WITH CTE
AS
(
SELECT
ROW_NUMBER() OVER(PARTITION BY Col1 ORDER BY Col2 DESC) AS RowNbr,
table.*
FROM
table
)
SELECT
*
FROM
CTE
WHERE
CTE.RowNbr=1
Subquery way
SELECT
*
FROM
(
SELECT
ROW_NUMBER() OVER(PARTITION BY Col1 ORDER BY Col2 DESC) AS RowNbr,
table.*
FROM
table
) AS T
WHERE
T.RowNbr=1
As I got it can be something like this
SELECT * FROM
(SELECT Col1, Max(Col2) FROM Table GROUP BY Col1) tab1
JOIN
(SELECT Col1, Col2 FROM Table) tab2
ON tab1.Col2 = tab2.Col2 and Col3 = (select min(Col3) from table )
Assuming you are using SQL-Server 2005 or later You can make use of Window functions here. I have chosen ROW_NUMBER() but it is not hte only option.
;WITH T AS
( SELECT *,
ROW_NUMBER() OVER(PARTITION BY Col1 ORDER BY Col2 DESC) [RowNumber]
FROM Table
)
SELECT *
FROM T
WHERE RowNumber = 1
The PARTITION BY within the OVER clause is equivalent to your group by in your subquery, then your ORDER BY determines the order in which to start numbering the rows. In this case Col2 DESC to start with the highest value of col2 (Equivalent to your MAX statement).