Multi-column Rank in SQL Server - sql

There probably is, but I'm slightly new to SQL Server. I need to rank/denserank a dataset, but the ranking is based on 6 columns. What I have at the moment is:
SELECT col1, col2, col3, col4, col5, col6, col7,
RANK() OVER(ORDER BY col2 desc) as APPLICANT_RANK
FROM myTable
So that works fine, but if there is a tie in col2, then I get two records ranked the same. What I want is if there's a tie in col2, to see the higher number in col3, then col4, so down the line to col 6.
Thanks

You can include multiple columns in the order by clause in the rank function, just as you would when ordering the results of a whole query:
RANK() OVER(
ORDER BY col2 desc,col3 desc, col4 desc, col5 desc, col6 desc
) as APPLICANT_RANK

Related

Delete duplicate data that some columns equal zero

I have SQL Server table that has col1, col2, col3, col4, col5, col6, col7, col8, col9, col10.
I want delete the duplicate based on col1, col2, col3.
The row that should be deleted is where col6=0 and col7=0 and col8=0.
We can use a deletable CTE here:
WITH cte AS (
SELECT *, COUNT(*) OVER (PARTITION BY col1, col2, col3) cnt
FROM yourTable
)
DELETE
FROM cte
WHERE cnt > 1 AND col6 = 0 AND col7 = 0 AND col8 = 0;
The CTE above identifies "duplicates" according to your definition, which is 2 or more records having the same values for col1, col2, and col3. Then we delete duplicates meeting the requirements on the other 3 columns.

SQL - Is column exclusion possible from 'SELECT' clause?

I face this question every time when I do a lot of complex processing and lot of columns SELECT ed in a sub-query but finally need to show only few.
Is there anyway SQL (Oracle or Microsoft or others) is thinking of having an (extra) clause to just ignore the columns not required.
;with t as (
select col1, col2, col3, col4, col5, col6, col7, col8, col9, col10
from orders_tbl
where order_date > getdate() -- ex. T-sql
)
, s as (
select t.*, row_number() over(partition by col1 order by col8 DESC, col9) rn
from t
)
--
-- The problem is here: if i don't explicitly select the individual columns of "t" ,then it'll display the column "rn" as well which is not required.
--
select col1, col2, col3, col4, col5, col6, col7, col8, col9, col10
from s where rn = 1
order by col1, col2
Now, imagine something like this -
with t as (
select col1, col2, col3, col4, col5, col6, col7, col8, col9, col10
from orders_tbl
where order_date > getdate() -- ex. T-sql
)
, s as (
select t.*, row_number() over(partition by col1 order by col8 DESC, col9) rn
from t
)
--
-- Note: the imaginary clause "exclude"
--
select *
from s exclude (rn) where rn = 1
order by col1, col2
Your thoughts please?
It would be nice if MS Sql Server supported something like a SELECT * EXCEPT col FROM tbl like Google BigQuery.
But currently that functionality isn't (yet?) implemented in MS Sql Server.
However, one can simplify that SQL. And use only 1 CTE.
Since a TOP 1 WITH TIES can be combined with an ORDER BY ROW_NUMBER() OVER (...).
That way you don't have an RN column to exclude from the final result.
with T as (
select TOP 1 WITH TIES
col1, col2, col3, col4, col5, col6, col7, col8, col9, col10
from orders_tbl
where order_date > getdate()
ORDER BY row_number() over(partition by col1 order by col8 DESC, col9)
)
select *
from T
order by col1, col2;
Note that the CTE is only needed here because the final result still has to be ordered by col1, col2.
Side-note One:
For simple queries selecting the required fields in the outer-query seems to be used more often.
with CTE as (
select *
, row_number() over(partition by col1 order by col8 DESC, col9) as rn
from orders_tbl
where order_date > getdate()
)
select col1, col2, col3, col4, col5, col6, col7, col8, col9, col10
from CTE
where rn = 1
order by col1, col2;
Side-note Two:
I would love to see something like TeraData's QUALIFY clause added someday to the SQL Standard. It's a nice thing to have when there's a need to filter based on a window function like ROW_NUMBER or DENSE_RANK.
In TeraData that SQL could be golf-coded like this:
select col1, col2, col3, col4, col5, col6, col7, col8, col9, col10
from orders_tbl
where order_date > current_timestamp
QUALIFY row_number() over(partition by col1 order by col8 DESC, col9) = 1
order by col1, col2
One way is to select into a new table and then drop the columns:
select col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, row_number() over(partition by col1 order by col8 DESC, col9) rn
into #a
from orders_tbl
where order_date > getdate()
alter table #a drop column col1
select * from #a
Note that this is not optimal in performance, as you've already read and then deleted some data. But it proves handy for few data and on-the-fly queries.

PERCENTILE_CONT() returns same value regardless of input parameter

I would like to get the 5th, 50th, 95th percentile of a table
SELECT col1, col2, col3, AVG(col4), STD(col4),
PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY col4)
OVER (PARTITION BY col1, col2, col3) as 5th_percentile,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col4)
OVER (PARTITION BY col1, col2, col3) as 50th_percentile,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY col4)
OVER (PARTITION BY col1, col2, col3) as 95th_percentile
FROM table
GROUP BY col1, col2, col3
LIMIT 100
What I end up getting back is 5th_percentile == 50th_percentile == 95th_percentile
AVG(col4) STD(col4) 5th_percentile 50th_percentile 95th_percentile
300.000000 0.000000 300.000000 300.000000 300.000000
67.076600 16.968851 82.031792 82.031792 82.031792
66.166136 11.452172 78.348846 78.348846 78.348846
544.262809 68.269014 605.797302 605.797302 605.797302
22.523138 1.820358 24.000000 24.000000 24.000000
Whats going on?
Edit: The db is MemSQL
Window functions operate after the GROUP BY clause. The GROUP BY produces one row per group, which is why the PERCENTILE_CONT window functions all return the same value.
You want to compute the window functions first, then GROUP BY afterwards. You can do this by putting the window functions in an inner subselect, and the GROUP BY in an outer select.
Here is documentation from postgres which explains how window functions relate to group by (this is standard ANSI SQL, and MemSQL does the same thing):
https://www.postgresql.org/docs/current/static/tutorial-window.html
The rows considered by a window function are those of the "virtual table" produced by the query's FROM clause as filtered by its WHERE, GROUP BY, and HAVING clauses if any. For example, a row removed because it does not meet the WHERE condition is not seen by any window function. A query can contain multiple window functions that slice up the data in different ways by means of different OVER clauses, but they all act on the same collection of rows defined by this virtual table.
Note that in MemSQL, if you use a column that isn't grouped or aggregated, such as col4 in your query, you get an arbitrary value out of the rows in the group, i.e. it behaves like an ANY_VALUE aggregate. In a future version of MemSQL, this query will instead return an error, to help you avoid writing queries with unintended behavior like this.
WITH a AS (
SELECT col1, col2, col3,
PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY col4)
OVER (PARTITION BY col1, col2, col3) as 5th_percentile,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col4)
OVER (PARTITION BY col1, col2, col3) as 50th_percentile,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY col4)
OVER (PARTITION BY col1, col2, col3) as 95th_percentile
FROM table
)
SELECT DISTINCT col1, col2, col3, 5th_percentile, 50th_percentile, 95th_percentile
FROM a
LIMIT 100
This works, looks like you can't do a groupby with percentile_cont
PERCENTILE_CONT() -- at least in some databases -- can be either an aggregation function or a window function.
What I think is happening is that the value is being calculated after the aggregation -- I'm not sure why. To be honest, I would expect the code to get a syntax error, because col4 is not aggregated. In other words (ORDER BY MAX(col4)) should work, but not (ORDER BY col4) because the percentile is calculated after the aggregation.
But try without the OVER clause:
SELECT col1, col2, col3, AVG(col4), STD(col4),
PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY col4) as 5th_percentile,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col4) as 50th_percentile,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY col4) as 95th_percentile
FROM table
GROUP BY col1, col2, col3
LIMIT 100;
EDIT:
Your database doesn't seem to support PERCENTILE_CONT() as an aggregation function. No accounting for taste. Most do.
The workaround is SELECT DISTINCT:
SELECT DISTINCT col1, col2, col3,
AVG(col4) OVER (PARTITION BY col1, col2, col3),
STD(col4) OVER (PARTITION BY col1, col2, col3),
PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY col4) OVER (PARTITION BY col1, col2, col3) as 5th_percentile,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col4) OVER (PARTITION BY col1, col2, col3) as 50th_percentile,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY col4) OVER (PARTITION BY col1, col2, col3) as 95th_percentile
FROM table
LIMIT 100;
Or using a subquery.

Based on the column values add a row value to the column in sql

I have a table as below:
I want the result as displayed only when the col1 and col2 values are same in the two rows.
Can anyone provide the SQL query, please?
You didn't provide the details so I will suggest the following solution with pivoting:
with cte as(select col1,
col2,
col3,
row_number() over(partition by col1, col2 order by col3) rn
from tablename)
select col1, col2, [1] as col3, [2] as col4
from cte
pivot(max(col3) for rn in([1],[2]))p

De-duplicating rows in a table with respect to certain columns and retaining the corresponding values in the other columns in HIVE

I need to create a temporary table in HIVE using an existing table that has 7 columns. I just want to get rid of duplicates with respect to first three columns and also retain the corresponding values in the other 4 columns. I don't care which row is actually dropped while de-duplicating using first three rows alone.
You could use something as below if you are not considered about ordering
create table table2 as
select col1, col2, col3,
,split(agg_col,"|")[0] as col4
,split(agg_col,"|")[1] as col5
,split(agg_col,"|")[2] as col6
,split(agg_col,"|")[3] as col7
from (Select col1, col2, col3,
max(concat(cast(col4 as string),"|",
cast(col5 as string),"|",
cast(col6 as string),"|",
cast(col7 as string))) as agg_col
from table1
group by col1,col2,col3 ) A;
Below is another approach, which gives much control over ordering but slower than above approach
create table table2 as
select col1, col2, col3,max(col4), max(col5), max(col6), max(col7)
from (Select col1, col2, col3,col4, col5, col6, col7,
rank() over ( partition by col1, col2, col3
order by col4 desc, col5 desc, col6 desc, col7 desc ) as col_rank
from table1 ) A
where A.col_rank = 1
GROUP BY col1, col2, col3;
rank() over(..) function returns more than one column with rank as '1' if order by columns are all equal. In our case if there are 2 columns with exact same values for all seven columns then there will be duplicates when we use filter as col_rank =1. These duplicates can be eleminated using max and group by clauses as written in above query.