sql query distinct with Row_Number - sql

I am fighting with the distinct keyword in sql.
I just want to display all row numbers of unique (distinct) values in a column & so I tried:
SELECT DISTINCT id, ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
however the below code giving me the distinct values:
SELECT distinct id FROM table WHERE fid = 64
but when tried it with Row_Number.
then it is not working.

This can be done very simple, you were pretty close already
SELECT distinct id, DENSE_RANK() OVER (ORDER BY id) AS RowNum
FROM table
WHERE fid = 64

Use this:
SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS RowNum FROM
(SELECT DISTINCT id FROM table WHERE fid = 64) Base
and put the "output" of a query as the "input" of another.
Using CTE:
; WITH Base AS (
SELECT DISTINCT id FROM table WHERE fid = 64
)
SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS RowNum FROM Base
The two queries should be equivalent.
Technically you could
SELECT DISTINCT id, ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
but if you increase the number of DISTINCT fields, you have to put all these fields in the PARTITION BY, so for example
SELECT DISTINCT id, description,
ROW_NUMBER() OVER (PARTITION BY id, description ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
I even hope you comprehend that you are going against standard naming conventions here, id should probably be a primary key, so unique by definition, so a DISTINCT would be useless on it, unless you coupled the query with some JOINs/UNION ALL...

This article covers an interesting relationship between ROW_NUMBER() and DENSE_RANK() (the RANK() function is not treated specifically). When you need a generated ROW_NUMBER() on a SELECT DISTINCT statement, the ROW_NUMBER() will produce distinct values before they are removed by the DISTINCT keyword. E.g. this query
SELECT DISTINCT
v,
ROW_NUMBER() OVER (ORDER BY v) row_number
FROM t
ORDER BY v, row_number
... might produce this result (DISTINCT has no effect):
+---+------------+
| V | ROW_NUMBER |
+---+------------+
| a | 1 |
| a | 2 |
| a | 3 |
| b | 4 |
| c | 5 |
| c | 6 |
| d | 7 |
| e | 8 |
+---+------------+
Whereas this query:
SELECT DISTINCT
v,
DENSE_RANK() OVER (ORDER BY v) row_number
FROM t
ORDER BY v, row_number
... produces what you probably want in this case:
+---+------------+
| V | ROW_NUMBER |
+---+------------+
| a | 1 |
| b | 2 |
| c | 3 |
| d | 4 |
| e | 5 |
+---+------------+
Note that the ORDER BY clause of the DENSE_RANK() function will need all other columns from the SELECT DISTINCT clause to work properly.
All three functions in comparison
Using PostgreSQL / Sybase / SQL standard syntax (WINDOW clause):
SELECT
v,
ROW_NUMBER() OVER (window) row_number,
RANK() OVER (window) rank,
DENSE_RANK() OVER (window) dense_rank
FROM t
WINDOW window AS (ORDER BY v)
ORDER BY v
... you'll get:
+---+------------+------+------------+
| V | ROW_NUMBER | RANK | DENSE_RANK |
+---+------------+------+------------+
| a | 1 | 1 | 1 |
| a | 2 | 1 | 1 |
| a | 3 | 1 | 1 |
| b | 4 | 4 | 2 |
| c | 5 | 5 | 3 |
| c | 6 | 5 | 3 |
| d | 7 | 7 | 4 |
| e | 8 | 8 | 5 |
+---+------------+------+------------+

Using DISTINCT causes issues as you add fields and it can also mask problems in your select. Use GROUP BY as an alternative like this:
SELECT id
,ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM table
where fid = 64
group by id
Then you can add other interesting information from your select like this:
,count(*) as thecount
or
,max(description) as description

How about something like
;WITH DistinctVals AS (
SELECT distinct id
FROM table
where fid = 64
)
SELECT id,
ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM DistinctVals
SQL Fiddle DEMO
You could also try
SELECT distinct id, DENSE_RANK() OVER (ORDER BY id) AS RowNum
FROM #mytable
where fid = 64
SQL Fiddle DEMO

Try this:
;WITH CTE AS (
SELECT DISTINCT id FROM table WHERE fid = 64
)
SELECT id, ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM cte
WHERE fid = 64

Try this
SELECT distinct id
FROM (SELECT id, ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM table
WHERE fid = 64) t
Or use RANK() instead of row number and select records DISTINCT rank
SELECT id
FROM (SELECT id, ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RowNum
FROM table
WHERE fid = 64) t
WHERE t.RowNum=1
This also returns the distinct ids

Question is too old and my answer might not add much but here are my two cents for making query a little useful:
;WITH DistinctRecords AS (
SELECT DISTINCT [col1,col2,col3,..]
FROM tableName
where [my condition]
),
serialize AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY [colNameAsNeeded] ORDER BY [colNameNeeded]) AS Sr,*
FROM DistinctRecords
)
SELECT * FROM serialize
Usefulness of using two cte's lies in the fact that now you can use serialized record much easily in your query and do count(*) etc very easily.
DistinctRecords will select all distinct records and serialize apply serial numbers to distinct records. after wards you can use final serialized result for your purposes without clutter.
Partition By might not be needed in most cases

Related

Is there a way to calculate average based on distinct rows without using a subquery?

If I have data like so:
+----+-------+
| id | value |
+----+-------+
| 1 | 10 |
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
| 2 | 20 |
+----+-------+
How do I calculate the average based on the distinct id WITHOUT using a subquery (i.e. querying the table directly)?
For the above example it would be (10+20+30)/3 = 20
I tried to do the following:
SELECT AVG(IF(id = LAG(id) OVER (ORDER BY id), NULL, value)) AS avg
FROM table
Basically I was thinking that if I order by id and check the previous row to see if it has the same id, the value should be NULL and thus it would not be counted into the calculation, but unfortunately I can't put analytical functions inside aggregate functions.
As far as I know, you can't do this without a subquery. I would use:
SELECT AVG(avg_value)
FROM
(
SELECT AVG(value) AS avg_value
FROM yourTable
GROUP BY id
) t;
WITH RANK AS (
Select *,
ROW_NUMBER() OVER(PARTITION BY ID) AS RANK
FROM
TABLE
QUALIFY RANK = 1
)
SELECT
AVG(VALUES)
FROM RANK
The outer query will have other parameters that need to access all the data in the table
I interpret this comment as wanting an average on every row -- rather than doing an aggregation. If so, you can use window functions:
select t.*,
avg(case when seqnum = 1 then value end) over () as overall_avg
from (select t.*,
row_number() over (partition by id order by id) as seqnum
from t
) t;
Yes there is a way,
Simply use distinct inside the avg function as below :
select avg(distinct value) from tab;
http://sqlfiddle.com/#!4/9d156/2/0

How to SELECT in SQL based on a value from the same table column?

I have the following table
| id | date | team |
|----|------------|------|
| 1 | 2019-01-05 | A |
| 2 | 2019-01-05 | A |
| 3 | 2019-01-01 | A |
| 4 | 2019-01-04 | B |
| 5 | 2019-01-01 | B |
How can I query the table to receive the most recent values for the teams?
For example, the result for the above table would be ids 1,2,4.
In this case, you can use window functions:
select t.*
from (select t.*, rank() over (partition by team order by date desc) as seqnum
from t
) t
where seqnum = 1;
In some databases a correlated subquery is faster with the right indexes (I haven't tested this with Postgres):
select t.*
from t
where t.date = (select max(t2.date) from t t2 where t2.team = t.team);
And if you wanted only one row per team, then the canonical answer is:
select distinct on (t.team) t.*
from t
order by t.team, t.date desc;
However, that doesn't work in this case because you want all rows from the most recent date.
If your dataset is large, consider the max analytic function in a subquery:
with cte as (
select
id, date, team,
max (date) over (partition by team) as max_date
from t
)
select id
from cte
where date = max_date
Notionally, max is O(n), so it should be pretty efficient. I don't pretend to know the actual implementation on PostgreSQL, but my guess is it's O(n).
One more possibility, generic:
select * from t join (select max(date) date,team from t
group by team) tt
using(date,team)
Window function is the best solution for you.
select id
from (
select team, id, rank() over (partition by team order by date desc) as row_num
from table
) t
where row_num = 1
That query will return this table:
| id |
|----|
| 1 |
| 2 |
| 4 |
If you to get it one row per team, you need to use array_agg function.
select team, array_agg(id) ids
from (
select team, id, rank() over (partition by team order by date desc) as row_num
from table
) t
where row_num = 1
group by team
That query will return this table:
| team | ids |
|------|--------|
| A | [1, 2] |
| B | [4] |

sql aggregate data

this is not a specific dbms question, but a generic sql problem.
i have this dataset
userid | objecteid| count
--------------------------
1 | 1 | 12
1 | 2 | 15
1 | 3 | 6
2 | 4 | 30
2 | 1 | 1
2 | 5 | 9
with one query i need to find: for each user, the object with the maximum count
looking for a result like this:
userid | objecteid| count
--------------------------
1 | 2 | 15
2 | 4 | 30
because the object 2 has the max count for user 1 and the object 4 has the max count for user 2
This can easily be solved using window functions.
The following is standard ANSI SQL:
select userid, objecteid, "count"
from (
select userid, objecteid, "count",
max("count") over (partition by userid) as max_cnt
from the_table
) t
where "count" = max_cnt;
If there are two objects with the same count, both will be returned.
Alternatively this can also be done using row_number() instead:
select userid, objecteid, "count"
from (
select userid, objecteid, "count",
row_number() over (partition by userid order by "count" desc) as rn
from the_table
) t
where rn = 1;
Unlike the first query, this will only pick one row if a user has more than one object with the same count. If you want those duplicates returned, use dense_rank() instead of row_number()
SQLFiddle: http://sqlfiddle.com/#!15/f02a9/1
try this
Select * from tableName
where count in (
Select Max(count)
from tableName
group by userid
)

Renumber dynamic column without update in SQL Server

I have this data
5 | Batman
5 | Superman
5 | Wonderwomen
6 | Green Lantern
6 | Green Arrow
7 | Cyborg
when I do select query, I want renumber to
1 | Batman
1 | Superman
1 | Wonderwomen
2 | Green Lantern
2 | Green Arrow
3 | Cyborg
thought?
EDIT:
thanks to vittore, so i came up with this solution. I'm not sure if my query is good.
I do ROW_NUMBER() twice. In case my sequence Id is jumping, this query will renumbering perfectly.
WITH cte AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY id ORDER BY id asc) AS CteId
FROM MyTable
)
SELECT
ROW_NUMBER() OVER(PARTITION BY CteId ORDER BY CteId asc) AS RenumberColumn
FROM cte
RANK function is what you are looking for
select RANK() OVER (ORDER BY id), name
from t
Check row_number() and dense_rank() when you reading about it as well.
UPDATE: If you just use rank alone, it will give you not the values you want ( 1 1 1 2 2 3 ), but ranked values ( 1 1 1 4 4 6 )
So in order to get (1 2 3) group, rank and join:
select a.r, t.name from t
inner join (select id, rank() over (order by id asc) r
from t group by id) a
on t.id = a.id
If it's always -4, then:
Select (number-4), name
from table
But I doubt it's that simple.

Amazon Redshift mechanism for aggregating a column into a string [duplicate]

I have a data set in the form.
id | attribute
-----------------
1 | a
2 | b
2 | a
2 | a
3 | c
Desired output:
attribute| num
-------------------
a | 1
b,a | 1
c | 1
In MySQL, I would use:
select attribute, count(*) num
from
(select id, group_concat(distinct attribute) attribute from dataset group by id) as subquery
group by attribute;
I am not sure this can be done in Redshift because it does not support group_concat or any psql group aggregate functions like array_agg() or string_agg(). See this question.
An alternate solution that would work is if there was a way for me to pick a random attribute from each group instead of group_concat. How can this work in Redshift?
I found a way to pick up a random attribute for each id, but it's too tricky. Actually I don't think it's a good way, but it works.
SQL:
-- (1) uniq dataset
WITH uniq_dataset as (select * from dataset group by id, attr)
SELECT
uds.id, rds.attr
FROM
-- (2) generate random rank for each id
(select id, round((random() * ((select count(*) from uniq_dataset iuds where iuds.id = ouds.id) - 1))::numeric, 0) + 1 as random_rk from (select distinct id from uniq_dataset) ouds) uds,
-- (3) rank table
(select rank() over(partition by id order by attr) as rk, id ,attr from uniq_dataset) rds
WHERE
uds.id = rds.id
AND
uds.random_rk = rds.rk
ORDER BY
uds.id;
Result:
id | attr
----+------
1 | a
2 | a
3 | c
OR
id | attr
----+------
1 | a
2 | b
3 | c
Here are tables in this SQL.
-- dataset (original table)
id | attr
----+------
1 | a
2 | b
2 | a
2 | a
3 | c
-- (1) uniq dataset
id | attr
----+------
1 | a
2 | a
2 | b
3 | c
-- (2) generate random rank for each id
id | random_rk
----+----
1 | 1
2 | 1 <- 1 or 2
3 | 1
-- (3) rank table
rk | id | attr
----+----+------
1 | 1 | a
1 | 2 | a
2 | 2 | b
1 | 3 | c
This solution, inspired by Masashi, is simpler and accomplishes selecting a random element from a group in Redshift.
SELECT id, first_value as attribute
FROM(SELECT id, FIRST_VALUE(attribute)
OVER(PARTITION BY id ORDER BY random()
ROWS BETWEEN unbounded preceding AND unbounded following)
FROM dataset)
GROUP BY id, attribute ORDER BY id;
This is an answer for the related question here. That question is closed, so I am posting the answer here.
Here is a method to aggregate a column into a string:
select * from temp;
attribute
-----------
a
c
b
1) Give a unique rank to each row
with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select * from sub_table;
attribute | rnk
-----------+-----
a | 1
b | 2
c | 3
2) Use concat operator || to combine in one line
with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select (select attribute from sub_table where rnk = 1)||
(select attribute from sub_table where rnk = 2)||
(select attribute from sub_table where rnk = 3) res_string;
res_string
------------
abc
This only works for a finite numbers of rows (X) in that column. It can be the first X rows ordered by some attribute in the "order by" clause. I'm guessing this is expensive.
Case statement can be used to deal with NULLs which occur when a certain rank does not exist.
with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select (select attribute from sub_table where rnk = 1)||
(select attribute from sub_table where rnk = 2)||
(select attribute from sub_table where rnk = 3)||
(case when (select attribute from sub_table where rnk = 4) is NULL then ''
else (select attribute from sub_table where rnk = 4) end) as res_string;
I haven't tested this query, but these functions are supported in Redshift:
select id, arrary_to_string(array(select attribute from mydataset m where m.id=d.id),',')
from mydataset d