Create column for the quantile number of a value in BigQuery - sql

I have a table with two columns: id and score. I'd like to create a third column that equals the quantile that an individual's score falls in. I'd like to do this in BigQuery's standardSQL.
Here's my_table:
+----+--------+
| id | score |
+----+--------+
| 1 | 2 |
| 2 | 13 |
| 3 | -2 |
| 4 | 7 |
+----+--------+
and afterwards I'd like to have the following table (example shown with quartiles, but I'd be interested in quartiles/quintiles/deciles)
+----+--------+----------+
| id | score | quaRtile |
+----+--------+----------+
| 1 | 2 | 2 |
| 2 | 13 | 4 |
| 3 | -2 | 1 |
| 4 | 7 | 3 |
+----+--------+----------+
It would be excellent if this were to work on 100 million rows. I've looked around to see a couple solutions that seem to use legacy sql, and the solutions using RANK() functions don't seem to work for really large datasets. Thanks!

If I understand correctly, you can use ntile(). For instance, if you wanted a value from 1-4, you can do:
select t.*, ntile(4) over (order by score) as tile
from t;
If you want to enumerate the values, then use rank() or dense_rank():
select t.*, rank() over (order by score) as tile
from t;
I see, your problem is getting the code to work, because BigQuery tends to run out of resources without a partition by. One method is to break up the score into different groups. I think this logic does what you want:
select *,
( (count(*) over (partition by cast(score / 1000 as int64) order by cast(score / 1000 as int64)) -
count(*) over (partition by cast(score / 1000 as int64))
) +
rank() over (partition by cast(score / 1000 as int64) order by regi_id)
) as therank,
-- rank() over (order by score) as therank
from t;
This breaks the score into 1000 groups (perhaps that is too many for an integer). And then reconstructs the ranking.
If your score has relatively low cardinality, then join with aggregation works:
select t.*, (running_cnt - cnt + 1) as therank
from t join
(select score, count(*) as cnt, sum(count(*)) over (order by score) as running_cnt
from t
group by score
) s
on t.score = s.score;
Once you have the rank() (or row_number()) you can easily calculate the tiles yourself (hint: division).

Output suggest me rank() :
SELECT *, RANK() OVER (ORDER BY score) as quantile
FROM table t
ORDER BY id;

Related

Is there a way to calculate average based on distinct rows without using a subquery?

If I have data like so:
+----+-------+
| id | value |
+----+-------+
| 1 | 10 |
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
| 2 | 20 |
+----+-------+
How do I calculate the average based on the distinct id WITHOUT using a subquery (i.e. querying the table directly)?
For the above example it would be (10+20+30)/3 = 20
I tried to do the following:
SELECT AVG(IF(id = LAG(id) OVER (ORDER BY id), NULL, value)) AS avg
FROM table
Basically I was thinking that if I order by id and check the previous row to see if it has the same id, the value should be NULL and thus it would not be counted into the calculation, but unfortunately I can't put analytical functions inside aggregate functions.
As far as I know, you can't do this without a subquery. I would use:
SELECT AVG(avg_value)
FROM
(
SELECT AVG(value) AS avg_value
FROM yourTable
GROUP BY id
) t;
WITH RANK AS (
Select *,
ROW_NUMBER() OVER(PARTITION BY ID) AS RANK
FROM
TABLE
QUALIFY RANK = 1
)
SELECT
AVG(VALUES)
FROM RANK
The outer query will have other parameters that need to access all the data in the table
I interpret this comment as wanting an average on every row -- rather than doing an aggregation. If so, you can use window functions:
select t.*,
avg(case when seqnum = 1 then value end) over () as overall_avg
from (select t.*,
row_number() over (partition by id order by id) as seqnum
from t
) t;
Yes there is a way,
Simply use distinct inside the avg function as below :
select avg(distinct value) from tab;
http://sqlfiddle.com/#!4/9d156/2/0

How to SELECT in SQL based on a value from the same table column?

I have the following table
| id | date | team |
|----|------------|------|
| 1 | 2019-01-05 | A |
| 2 | 2019-01-05 | A |
| 3 | 2019-01-01 | A |
| 4 | 2019-01-04 | B |
| 5 | 2019-01-01 | B |
How can I query the table to receive the most recent values for the teams?
For example, the result for the above table would be ids 1,2,4.
In this case, you can use window functions:
select t.*
from (select t.*, rank() over (partition by team order by date desc) as seqnum
from t
) t
where seqnum = 1;
In some databases a correlated subquery is faster with the right indexes (I haven't tested this with Postgres):
select t.*
from t
where t.date = (select max(t2.date) from t t2 where t2.team = t.team);
And if you wanted only one row per team, then the canonical answer is:
select distinct on (t.team) t.*
from t
order by t.team, t.date desc;
However, that doesn't work in this case because you want all rows from the most recent date.
If your dataset is large, consider the max analytic function in a subquery:
with cte as (
select
id, date, team,
max (date) over (partition by team) as max_date
from t
)
select id
from cte
where date = max_date
Notionally, max is O(n), so it should be pretty efficient. I don't pretend to know the actual implementation on PostgreSQL, but my guess is it's O(n).
One more possibility, generic:
select * from t join (select max(date) date,team from t
group by team) tt
using(date,team)
Window function is the best solution for you.
select id
from (
select team, id, rank() over (partition by team order by date desc) as row_num
from table
) t
where row_num = 1
That query will return this table:
| id |
|----|
| 1 |
| 2 |
| 4 |
If you to get it one row per team, you need to use array_agg function.
select team, array_agg(id) ids
from (
select team, id, rank() over (partition by team order by date desc) as row_num
from table
) t
where row_num = 1
group by team
That query will return this table:
| team | ids |
|------|--------|
| A | [1, 2] |
| B | [4] |

Count rows in partition with Order By

I was trying to understand PARTITION BY in postgres by writing a few sample queries. I have a test table on which I run my query.
id integer | num integer
___________|_____________
1 | 4
2 | 4
3 | 5
4 | 6
When I run the following query, I get the output as I expected.
SELECT id, COUNT(id) OVER(PARTITION BY num) from test;
id | count
___________|_____________
1 | 2
2 | 2
3 | 1
4 | 1
But, when I add ORDER BY to the partition,
SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;
id | count
___________|_____________
1 | 1
2 | 2
3 | 1
4 | 1
My understanding is that COUNT is computed across all rows that fall into a partition. Here, I have partitioned the rows by num. The number of rows in the partition is the same, with or without an ORDER BY clause. Why is there a difference in the outputs?
When you add an order by to an aggregate used as a window function that aggregate turns into a "running count" (or whatever aggregate you use).
The count(*) will return the number of rows up until the "current one" based on the order specified.
The following query shows the different results for aggregates used with an order by. With sum() instead of count() it's a bit easier to see (in my opinion).
with test (id, num, x) as (
values
(1, 4, 1),
(2, 4, 1),
(3, 5, 2),
(4, 6, 2)
)
select id,
num,
x,
count(*) over () as total_rows,
count(*) over (order by id) as rows_upto,
count(*) over (partition by x order by id) as rows_per_x,
sum(num) over (partition by x) as total_for_x,
sum(num) over (order by id) as sum_upto,
sum(num) over (partition by x order by id) as sum_for_x_upto
from test;
will result in:
id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
1 | 4 | 1 | 4 | 1 | 1 | 8 | 4 | 4
2 | 4 | 1 | 4 | 2 | 2 | 8 | 8 | 8
3 | 5 | 2 | 4 | 3 | 1 | 11 | 13 | 5
4 | 6 | 2 | 4 | 4 | 2 | 11 | 19 | 11
There are more examples in the Postgres manual
Your two expressions are:
COUNT(id) OVER (PARTITION BY num)
COUNT(id) OVER (PARTITION BY num ORDER BY id)
Why would you expect these to return the same values? The syntax is different for a reason.
The first returns the overall count for each num -- essentially joining back the aggregated value.
The second does a cumulative count. It does the COUNT() for each row of id, for all values up to that ids value.
Note that such cumulative counts would normally be implemented using RANK() (or related functions).
The cumulative count is subtly different from RANK(). The cumulative count implements:
COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
RANK() is slightly different. The difference only matters when the ORDER BY keys have ties.
The "why" has already been explained by others. Sometimes you have an ordered window, and you have to do a count over the whole partition despite having an ORDER BY.
To do so, use an unbounded range with RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
create table search_log
(
id bigint not null primary key,
query varchar(255) not null,
stemmed_query varchar(255) not null,
created timestamp not null,
);
SELECT query,
created as seen_on,
first_value(created) OVER query_window as last_seen,
row_number() OVER query_window AS rn,
count(*) OVER query_window AS occurence
FROM search_log l
WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

Average and Median in one query in Sql Server 2012

I have the following data in my table:
SELECT category, value FROM test
| category | value |
+----------+-------+
| 1 | 1 |
| 1 | 3 |
| 1 | 4 |
| 1 | 8 |
Right now I am using two separate queries.
To get average:
SELECT category, avg(value) as Average
FROM test
GROUP BY category
| category | value |
+----------+-------+
| 1 | 4 |
To get median:
SELECT DISTINCT category,
PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY value)
OVER (partition BY category) AS Median
FROM test
| category | value |
+----------+-------+
| 1 | 3.5 |
Is there any way to merge them in one query?
Note: I know that I can also get median with two subqueries, but I prefer to use PERCENTILE_CONT function to get it.
AVG is also a windowed function:
select
distinct
category,
avg(value) over (partition by category) as average,
PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY value)
OVER (partition BY category) AS Median
from test
I would approach this in a slightly different way:
select category, avg(value) as avg,
avg(case when 2 * seqnum in (cnt, cnt + 1, cnt + 2) then value end) as median
from (select t.*, row_number() over (partition by category order by value) as seqnum,
count(*) over (partition by category) as cnt
from test t
) t
group by category;
I wanted a more thorough answer to this question, and after some digging found it in this exhaustive analysis of multiple methods from Dwain Camps, in case others find it useful:
Calculating the Median Value within a Partitioned Set Using T-SQL
I went with "his" 4th solution (he is combining/testing others' approaches), which was straightforward to understand and indeed performs well:
WITH Counts AS
(
SELECT ID, c=COUNT(*)
FROM #MedianValues
GROUP BY ID
)
SELECT a.ID, Median=AVG(0.+N)
FROM Counts a
CROSS APPLY
(
SELECT TOP(((a.c - 1) / 2) + (1 + (1 - a.c % 2)))
N, r=ROW_NUMBER() OVER (ORDER BY N)
FROM #MedianValues b
WHERE a.ID = b.ID
ORDER BY N
) p
WHERE r BETWEEN ((a.c - 1) / 2) + 1 AND (((a.c - 1) / 2) + (1 + (1 - a.c % 2)))
GROUP BY a.ID;

sql query distinct with Row_Number

I am fighting with the distinct keyword in sql.
I just want to display all row numbers of unique (distinct) values in a column & so I tried:
SELECT DISTINCT id, ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
however the below code giving me the distinct values:
SELECT distinct id FROM table WHERE fid = 64
but when tried it with Row_Number.
then it is not working.
This can be done very simple, you were pretty close already
SELECT distinct id, DENSE_RANK() OVER (ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
Use this:
SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS RowNum FROM
(SELECT DISTINCT id FROM table WHERE fid = 64) Base
and put the "output" of a query as the "input" of another.
Using CTE:
; WITH Base AS (
SELECT DISTINCT id FROM table WHERE fid = 64
)
SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS RowNum FROM Base
The two queries should be equivalent.
Technically you could
SELECT DISTINCT id, ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
but if you increase the number of DISTINCT fields, you have to put all these fields in the PARTITION BY, so for example
SELECT DISTINCT id, description,
ROW_NUMBER() OVER (PARTITION BY id, description ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
I even hope you comprehend that you are going against standard naming conventions here, id should probably be a primary key, so unique by definition, so a DISTINCT would be useless on it, unless you coupled the query with some JOINs/UNION ALL...
This article covers an interesting relationship between ROW_NUMBER() and DENSE_RANK() (the RANK() function is not treated specifically). When you need a generated ROW_NUMBER() on a SELECT DISTINCT statement, the ROW_NUMBER() will produce distinct values before they are removed by the DISTINCT keyword. E.g. this query
SELECT DISTINCT
v,
ROW_NUMBER() OVER (ORDER BY v) row_number
FROM t
ORDER BY v, row_number
... might produce this result (DISTINCT has no effect):
+---+------------+
| V | ROW_NUMBER |
+---+------------+
| a | 1 |
| a | 2 |
| a | 3 |
| b | 4 |
| c | 5 |
| c | 6 |
| d | 7 |
| e | 8 |
+---+------------+
Whereas this query:
SELECT DISTINCT
v,
DENSE_RANK() OVER (ORDER BY v) row_number
FROM t
ORDER BY v, row_number
... produces what you probably want in this case:
+---+------------+
| V | ROW_NUMBER |
+---+------------+
| a | 1 |
| b | 2 |
| c | 3 |
| d | 4 |
| e | 5 |
+---+------------+
Note that the ORDER BY clause of the DENSE_RANK() function will need all other columns from the SELECT DISTINCT clause to work properly.
All three functions in comparison
Using PostgreSQL / Sybase / SQL standard syntax (WINDOW clause):
SELECT
v,
ROW_NUMBER() OVER (window) row_number,
RANK() OVER (window) rank,
DENSE_RANK() OVER (window) dense_rank
FROM t
WINDOW window AS (ORDER BY v)
ORDER BY v
... you'll get:
+---+------------+------+------------+
| V | ROW_NUMBER | RANK | DENSE_RANK |
+---+------------+------+------------+
| a | 1 | 1 | 1 |
| a | 2 | 1 | 1 |
| a | 3 | 1 | 1 |
| b | 4 | 4 | 2 |
| c | 5 | 5 | 3 |
| c | 6 | 5 | 3 |
| d | 7 | 7 | 4 |
| e | 8 | 8 | 5 |
+---+------------+------+------------+
Using DISTINCT causes issues as you add fields and it can also mask problems in your select. Use GROUP BY as an alternative like this:
SELECT id
,ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM table
where fid = 64
group by id
Then you can add other interesting information from your select like this:
,count(*) as thecount
or
,max(description) as description
How about something like
;WITH DistinctVals AS (
SELECT distinct id
FROM table
where fid = 64
)
SELECT id,
ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM DistinctVals
SQL Fiddle DEMO
You could also try
SELECT distinct id, DENSE_RANK() OVER (ORDER BY id) AS RowNum
FROM #mytable
where fid = 64
SQL Fiddle DEMO
Try this:
;WITH CTE AS (
SELECT DISTINCT id FROM table WHERE fid = 64
)
SELECT id, ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM cte
WHERE fid = 64
Try this
SELECT distinct id
FROM (SELECT id, ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM table
WHERE fid = 64) t
Or use RANK() instead of row number and select records DISTINCT rank
SELECT id
FROM (SELECT id, ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RowNum
FROM table
WHERE fid = 64) t
WHERE t.RowNum=1
This also returns the distinct ids
Question is too old and my answer might not add much but here are my two cents for making query a little useful:
;WITH DistinctRecords AS (
SELECT DISTINCT [col1,col2,col3,..]
FROM tableName
where [my condition]
),
serialize AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY [colNameAsNeeded] ORDER BY [colNameNeeded]) AS Sr,*
FROM DistinctRecords
)
SELECT * FROM serialize
Usefulness of using two cte's lies in the fact that now you can use serialized record much easily in your query and do count(*) etc very easily.
DistinctRecords will select all distinct records and serialize apply serial numbers to distinct records. after wards you can use final serialized result for your purposes without clutter.
Partition By might not be needed in most cases