first max second min ordering - sql

I have a table with float non unique numbers and I want to order them in a special way that max element will be at the 1-st place, min element at the 2-nd place, second largest element at the 3-rd place and etc. For example,
1,2,3,4,5,6,7,8,9
I would like to order as
1,9,2,8,3,7,4,6,5
UPD:
Combination of ordering by ascending and descending over row_number() can be a solution, e.g.
select c, a, d, abs(a - d)
from (select c,
row_number() over (order by c) as a
row_number() over (order by c desc) as d
from t)
order by abs(a - d)
But you should keep in mind that you can meet some problems due to non unique numbers, solution above will NOT work for example below
c | a | d
4 | 1 | 4 | 3
4 | 2 | 5 | 3
5 | 3 | 1 | 2
5 | 4 | 2 | 2
5 | 5 | 3 | 2
Which means that expression used under OVER statement should NOT provide many ordering possibilities

ANSI SQL supports row_number(). You can do this by using row_number() in a clever way:
select t.*
from (select t.*,
row_number() over (order by col) as seqnum_asc
row_number() over (order by col desc) as seqnum_desc
from table t
) t
order by (case when seqnum_asc <= seqnum_desc then seqnum_asc else seqnum_desc end),
col desc;
The case is really least(seqnum_asc, seqnum_desc), but not all databases support that construct.

Related

Create column for the quantile number of a value in BigQuery

I have a table with two columns: id and score. I'd like to create a third column that equals the quantile that an individual's score falls in. I'd like to do this in BigQuery's standardSQL.
Here's my_table:
+----+--------+
| id | score |
+----+--------+
| 1 | 2 |
| 2 | 13 |
| 3 | -2 |
| 4 | 7 |
+----+--------+
and afterwards I'd like to have the following table (example shown with quartiles, but I'd be interested in quartiles/quintiles/deciles)
+----+--------+----------+
| id | score | quaRtile |
+----+--------+----------+
| 1 | 2 | 2 |
| 2 | 13 | 4 |
| 3 | -2 | 1 |
| 4 | 7 | 3 |
+----+--------+----------+
It would be excellent if this were to work on 100 million rows. I've looked around to see a couple solutions that seem to use legacy sql, and the solutions using RANK() functions don't seem to work for really large datasets. Thanks!
If I understand correctly, you can use ntile(). For instance, if you wanted a value from 1-4, you can do:
select t.*, ntile(4) over (order by score) as tile
from t;
If you want to enumerate the values, then use rank() or dense_rank():
select t.*, rank() over (order by score) as tile
from t;
I see, your problem is getting the code to work, because BigQuery tends to run out of resources without a partition by. One method is to break up the score into different groups. I think this logic does what you want:
select *,
( (count(*) over (partition by cast(score / 1000 as int64) order by cast(score / 1000 as int64)) -
count(*) over (partition by cast(score / 1000 as int64))
) +
rank() over (partition by cast(score / 1000 as int64) order by regi_id)
) as therank,
-- rank() over (order by score) as therank
from t;
This breaks the score into 1000 groups (perhaps that is too many for an integer). And then reconstructs the ranking.
If your score has relatively low cardinality, then join with aggregation works:
select t.*, (running_cnt - cnt + 1) as therank
from t join
(select score, count(*) as cnt, sum(count(*)) over (order by score) as running_cnt
from t
group by score
) s
on t.score = s.score;
Once you have the rank() (or row_number()) you can easily calculate the tiles yourself (hint: division).
Output suggest me rank() :
SELECT *, RANK() OVER (ORDER BY score) as quantile
FROM table t
ORDER BY id;

Count rows in partition with Order By

I was trying to understand PARTITION BY in postgres by writing a few sample queries. I have a test table on which I run my query.
id integer | num integer
___________|_____________
1 | 4
2 | 4
3 | 5
4 | 6
When I run the following query, I get the output as I expected.
SELECT id, COUNT(id) OVER(PARTITION BY num) from test;
id | count
___________|_____________
1 | 2
2 | 2
3 | 1
4 | 1
But, when I add ORDER BY to the partition,
SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;
id | count
___________|_____________
1 | 1
2 | 2
3 | 1
4 | 1
My understanding is that COUNT is computed across all rows that fall into a partition. Here, I have partitioned the rows by num. The number of rows in the partition is the same, with or without an ORDER BY clause. Why is there a difference in the outputs?
When you add an order by to an aggregate used as a window function that aggregate turns into a "running count" (or whatever aggregate you use).
The count(*) will return the number of rows up until the "current one" based on the order specified.
The following query shows the different results for aggregates used with an order by. With sum() instead of count() it's a bit easier to see (in my opinion).
with test (id, num, x) as (
values
(1, 4, 1),
(2, 4, 1),
(3, 5, 2),
(4, 6, 2)
)
select id,
num,
x,
count(*) over () as total_rows,
count(*) over (order by id) as rows_upto,
count(*) over (partition by x order by id) as rows_per_x,
sum(num) over (partition by x) as total_for_x,
sum(num) over (order by id) as sum_upto,
sum(num) over (partition by x order by id) as sum_for_x_upto
from test;
will result in:
id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
1 | 4 | 1 | 4 | 1 | 1 | 8 | 4 | 4
2 | 4 | 1 | 4 | 2 | 2 | 8 | 8 | 8
3 | 5 | 2 | 4 | 3 | 1 | 11 | 13 | 5
4 | 6 | 2 | 4 | 4 | 2 | 11 | 19 | 11
There are more examples in the Postgres manual
Your two expressions are:
COUNT(id) OVER (PARTITION BY num)
COUNT(id) OVER (PARTITION BY num ORDER BY id)
Why would you expect these to return the same values? The syntax is different for a reason.
The first returns the overall count for each num -- essentially joining back the aggregated value.
The second does a cumulative count. It does the COUNT() for each row of id, for all values up to that ids value.
Note that such cumulative counts would normally be implemented using RANK() (or related functions).
The cumulative count is subtly different from RANK(). The cumulative count implements:
COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
RANK() is slightly different. The difference only matters when the ORDER BY keys have ties.
The "why" has already been explained by others. Sometimes you have an ordered window, and you have to do a count over the whole partition despite having an ORDER BY.
To do so, use an unbounded range with RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
create table search_log
(
id bigint not null primary key,
query varchar(255) not null,
stemmed_query varchar(255) not null,
created timestamp not null,
);
SELECT query,
created as seen_on,
first_value(created) OVER query_window as last_seen,
row_number() OVER query_window AS rn,
count(*) OVER query_window AS occurence
FROM search_log l
WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

TSQL Number Rows Based on change in fieldvalue and sorted on date with incremented numbers on duplicates

Say I have a data like the following:
X | 2/2/2000
X | 2/3/2000
B | 2/4/2000
B | 2/10/2000
B | 2/10/2000
J | 2/11/2000
X | 3/1/2000
I would like to get a dataset like this:
1 | X | 2/2/2000
1 | X | 2/3/2000
2 | B | 2/4/2000
2 | B | 2/10/2000
2 | B | 2/10/2000
3 | J | 2/11/2000
4 | X | 3/1/2000
So far everything I have tried has either ended up numbering each change resetting the count on each field value change or in the example leave the last X as 1.
This is a gaps and islands problem. You can use a difference of row numbers:
select dense_rank() over (order by col1, seqnum_1 - seqnum_2) as col0,
col1, col2
from (select t.*,
row_number() over (order by col2) as seqnum_1,
row_number() over (partition by col1 order by col2) as seqnum_2
from t
) t;
Explaining why this works is a bit cumbersome. If you run the subquery, you will see how the sequence numbers are assigned and why the difference is what you want.
you can query like this:
SELECT dense_rank() over(order by yourcolumn1), * from yourtable

Renumber dynamic column without update in SQL Server

I have this data
5 | Batman
5 | Superman
5 | Wonderwomen
6 | Green Lantern
6 | Green Arrow
7 | Cyborg
when I do select query, I want renumber to
1 | Batman
1 | Superman
1 | Wonderwomen
2 | Green Lantern
2 | Green Arrow
3 | Cyborg
thought?
EDIT:
thanks to vittore, so i came up with this solution. I'm not sure if my query is good.
I do ROW_NUMBER() twice. In case my sequence Id is jumping, this query will renumbering perfectly.
WITH cte AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY id ORDER BY id asc) AS CteId
FROM MyTable
)
SELECT
ROW_NUMBER() OVER(PARTITION BY CteId ORDER BY CteId asc) AS RenumberColumn
FROM cte
RANK function is what you are looking for
select RANK() OVER (ORDER BY id), name
from t
Check row_number() and dense_rank() when you reading about it as well.
UPDATE: If you just use rank alone, it will give you not the values you want ( 1 1 1 2 2 3 ), but ranked values ( 1 1 1 4 4 6 )
So in order to get (1 2 3) group, rank and join:
select a.r, t.name from t
inner join (select id, rank() over (order by id asc) r
from t group by id) a
on t.id = a.id
If it's always -4, then:
Select (number-4), name
from table
But I doubt it's that simple.

Amazon Redshift mechanism for aggregating a column into a string [duplicate]

I have a data set in the form.
id | attribute
-----------------
1 | a
2 | b
2 | a
2 | a
3 | c
Desired output:
attribute| num
-------------------
a | 1
b,a | 1
c | 1
In MySQL, I would use:
select attribute, count(*) num
from
(select id, group_concat(distinct attribute) attribute from dataset group by id) as subquery
group by attribute;
I am not sure this can be done in Redshift because it does not support group_concat or any psql group aggregate functions like array_agg() or string_agg(). See this question.
An alternate solution that would work is if there was a way for me to pick a random attribute from each group instead of group_concat. How can this work in Redshift?
I found a way to pick up a random attribute for each id, but it's too tricky. Actually I don't think it's a good way, but it works.
SQL:
-- (1) uniq dataset
WITH uniq_dataset as (select * from dataset group by id, attr)
SELECT
uds.id, rds.attr
FROM
-- (2) generate random rank for each id
(select id, round((random() * ((select count(*) from uniq_dataset iuds where iuds.id = ouds.id) - 1))::numeric, 0) + 1 as random_rk from (select distinct id from uniq_dataset) ouds) uds,
-- (3) rank table
(select rank() over(partition by id order by attr) as rk, id ,attr from uniq_dataset) rds
WHERE
uds.id = rds.id
AND
uds.random_rk = rds.rk
ORDER BY
uds.id;
Result:
id | attr
----+------
1 | a
2 | a
3 | c
OR
id | attr
----+------
1 | a
2 | b
3 | c
Here are tables in this SQL.
-- dataset (original table)
id | attr
----+------
1 | a
2 | b
2 | a
2 | a
3 | c
-- (1) uniq dataset
id | attr
----+------
1 | a
2 | a
2 | b
3 | c
-- (2) generate random rank for each id
id | random_rk
----+----
1 | 1
2 | 1 <- 1 or 2
3 | 1
-- (3) rank table
rk | id | attr
----+----+------
1 | 1 | a
1 | 2 | a
2 | 2 | b
1 | 3 | c
This solution, inspired by Masashi, is simpler and accomplishes selecting a random element from a group in Redshift.
SELECT id, first_value as attribute
FROM(SELECT id, FIRST_VALUE(attribute)
OVER(PARTITION BY id ORDER BY random()
ROWS BETWEEN unbounded preceding AND unbounded following)
FROM dataset)
GROUP BY id, attribute ORDER BY id;
This is an answer for the related question here. That question is closed, so I am posting the answer here.
Here is a method to aggregate a column into a string:
select * from temp;
attribute
-----------
a
c
b
1) Give a unique rank to each row
with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select * from sub_table;
attribute | rnk
-----------+-----
a | 1
b | 2
c | 3
2) Use concat operator || to combine in one line
with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select (select attribute from sub_table where rnk = 1)||
(select attribute from sub_table where rnk = 2)||
(select attribute from sub_table where rnk = 3) res_string;
res_string
------------
abc
This only works for a finite numbers of rows (X) in that column. It can be the first X rows ordered by some attribute in the "order by" clause. I'm guessing this is expensive.
Case statement can be used to deal with NULLs which occur when a certain rank does not exist.
with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select (select attribute from sub_table where rnk = 1)||
(select attribute from sub_table where rnk = 2)||
(select attribute from sub_table where rnk = 3)||
(case when (select attribute from sub_table where rnk = 4) is NULL then ''
else (select attribute from sub_table where rnk = 4) end) as res_string;
I haven't tested this query, but these functions are supported in Redshift:
select id, arrary_to_string(array(select attribute from mydataset m where m.id=d.id),',')
from mydataset d