I have the following data in my table:
SELECT category, value FROM test
| category | value |
+----------+-------+
| 1 | 1 |
| 1 | 3 |
| 1 | 4 |
| 1 | 8 |
Right now I am using two separate queries.
To get average:
SELECT category, avg(value) as Average
FROM test
GROUP BY category
| category | value |
+----------+-------+
| 1 | 4 |
To get median:
SELECT DISTINCT category,
PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY value)
OVER (partition BY category) AS Median
FROM test
| category | value |
+----------+-------+
| 1 | 3.5 |
Is there any way to merge them in one query?
Note: I know that I can also get median with two subqueries, but I prefer to use PERCENTILE_CONT function to get it.
AVG is also a windowed function:
select
distinct
category,
avg(value) over (partition by category) as average,
PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY value)
OVER (partition BY category) AS Median
from test
I would approach this in a slightly different way:
select category, avg(value) as avg,
avg(case when 2 * seqnum in (cnt, cnt + 1, cnt + 2) then value end) as median
from (select t.*, row_number() over (partition by category order by value) as seqnum,
count(*) over (partition by category) as cnt
from test t
) t
group by category;
I wanted a more thorough answer to this question, and after some digging found it in this exhaustive analysis of multiple methods from Dwain Camps, in case others find it useful:
Calculating the Median Value within a Partitioned Set Using T-SQL
I went with "his" 4th solution (he is combining/testing others' approaches), which was straightforward to understand and indeed performs well:
WITH Counts AS
(
SELECT ID, c=COUNT(*)
FROM #MedianValues
GROUP BY ID
)
SELECT a.ID, Median=AVG(0.+N)
FROM Counts a
CROSS APPLY
(
SELECT TOP(((a.c - 1) / 2) + (1 + (1 - a.c % 2)))
N, r=ROW_NUMBER() OVER (ORDER BY N)
FROM #MedianValues b
WHERE a.ID = b.ID
ORDER BY N
) p
WHERE r BETWEEN ((a.c - 1) / 2) + 1 AND (((a.c - 1) / 2) + (1 + (1 - a.c % 2)))
GROUP BY a.ID;
Related
If I have data like so:
+----+-------+
| id | value |
+----+-------+
| 1 | 10 |
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
| 2 | 20 |
+----+-------+
How do I calculate the average based on the distinct id WITHOUT using a subquery (i.e. querying the table directly)?
For the above example it would be (10+20+30)/3 = 20
I tried to do the following:
SELECT AVG(IF(id = LAG(id) OVER (ORDER BY id), NULL, value)) AS avg
FROM table
Basically I was thinking that if I order by id and check the previous row to see if it has the same id, the value should be NULL and thus it would not be counted into the calculation, but unfortunately I can't put analytical functions inside aggregate functions.
As far as I know, you can't do this without a subquery. I would use:
SELECT AVG(avg_value)
FROM
(
SELECT AVG(value) AS avg_value
FROM yourTable
GROUP BY id
) t;
WITH RANK AS (
Select *,
ROW_NUMBER() OVER(PARTITION BY ID) AS RANK
FROM
TABLE
QUALIFY RANK = 1
)
SELECT
AVG(VALUES)
FROM RANK
The outer query will have other parameters that need to access all the data in the table
I interpret this comment as wanting an average on every row -- rather than doing an aggregation. If so, you can use window functions:
select t.*,
avg(case when seqnum = 1 then value end) over () as overall_avg
from (select t.*,
row_number() over (partition by id order by id) as seqnum
from t
) t;
Yes there is a way,
Simply use distinct inside the avg function as below :
select avg(distinct value) from tab;
http://sqlfiddle.com/#!4/9d156/2/0
I have a table similar to below:
+------------+-----------+
| CustomerID | OrderYear |
+------------+-----------+
| 1 | 2012 |
| 1 | 2013 |
| 1 | 2014 |
| 1 | 2017 |
| 1 | 2018 |
| 2 | 2012 |
| 2 | 2013 |
| 2 | 2014 |
| 2 | 2015 |
| 2 | 2017 |
+------------+-----------+
How would I identify which CustomerIDs have 4 consecutive years of giving? (In the above, only customer 2.) As you can see, some records will have gaps in order years.
I started down the row of trying to utilize some combination of ROW_NUMBER/LAG/LEAD with no luck to this point.
Very paired down/modified attempt...
WITH CTE
AS
(
SELECT T.ConstituentLookupID,
T.FISCALYEAR,
COUNT(T.FISCALYEAR) OVER (PARTITION BY T.ConstituentLookupID) AS
YearCount,
FIRST_VALUE(T.FISCALYEAR) OVER(PARTITION BY T.ConstituentLookupID ORDER
BY T.FISCALYEAR DESC) - T.FISCALYEAR + 1 as X,
ROW_NUMBER() OVER(PARTITION BY T.ConstituentLookupID ORDER BY
T.FISCALYEAR DESC) AS RN
FROM #Temp AS T)
SELECT CTE.ConstituentLookupID,
CTE.FISCALYEAR,
CTE.YearCount,
CTE.X,
CTE.RN,
FROM CTE
WHERE CTE.YearCount >= 4 --Have at least 4 years of giving
AND CTE.X - CTE.RN = 1 --Some kind of way to calculate consecutive years. Doesnt account current year and gaps...;
Assuming no duplicates, you can use lag():
select distinct customerid
from (
select t.*,
lag(orderyear, 3) over(partition by customerid order by orderyear) oderyear3
from mytable t
) t
where orderyear = orderyear3 + 3
A more conventional approach is to use some gaps-and-islands technique. This is convenient if you want the start and end of each series. Here, an island is a series of rows with "adjacent" order years, and you want islands that are at least 4 years long. We can identify the islands by comparing the order year against an incrementing sequence, then use aggregation:
select customerid, min(orderyear) firstorderyear, max(orderyear) lastorderyear
from (
select t.*,
row_number() over(partition by customerid order by orderyear) rn
from mytable t
) t
group by customerid, orderyear - rn
having count(*) >= 4
Assuming you have no more than one row per customer and year, the simplest method is lag():
select customerid, year
from (select t.*,
lag(orderyear, 3) over (partition by customerid order by orderyear) as prev3_year
from t
) t
where prev3_year = year - 3;
The idea is to look 3 years back. If that year is year - 3, then there are four years in a row. If your data can have duplicates, there are tweaks to the logic (they make the query more slightly more complicated).
This could return duplicates, so you might just want:
select distinct customerid
from (select t.*,
lag(orderyear, 3) over (partition by customerid order by orderyear) as prev3_year
from t
) t
where prev3_year = year - 3;
I have a simple solution using row number and group by
SELECT Max(z.customerid),
Count(z.grp)
FROM (SELECT customerid,
orderyear,
orderyear - Row_number()
OVER (
ORDER BY customerid) AS Grp
FROM mytable)z
GROUP BY z.grp
HAVING Count(z.grp) = 4
I have a table with two columns: id and score. I'd like to create a third column that equals the quantile that an individual's score falls in. I'd like to do this in BigQuery's standardSQL.
Here's my_table:
+----+--------+
| id | score |
+----+--------+
| 1 | 2 |
| 2 | 13 |
| 3 | -2 |
| 4 | 7 |
+----+--------+
and afterwards I'd like to have the following table (example shown with quartiles, but I'd be interested in quartiles/quintiles/deciles)
+----+--------+----------+
| id | score | quaRtile |
+----+--------+----------+
| 1 | 2 | 2 |
| 2 | 13 | 4 |
| 3 | -2 | 1 |
| 4 | 7 | 3 |
+----+--------+----------+
It would be excellent if this were to work on 100 million rows. I've looked around to see a couple solutions that seem to use legacy sql, and the solutions using RANK() functions don't seem to work for really large datasets. Thanks!
If I understand correctly, you can use ntile(). For instance, if you wanted a value from 1-4, you can do:
select t.*, ntile(4) over (order by score) as tile
from t;
If you want to enumerate the values, then use rank() or dense_rank():
select t.*, rank() over (order by score) as tile
from t;
I see, your problem is getting the code to work, because BigQuery tends to run out of resources without a partition by. One method is to break up the score into different groups. I think this logic does what you want:
select *,
( (count(*) over (partition by cast(score / 1000 as int64) order by cast(score / 1000 as int64)) -
count(*) over (partition by cast(score / 1000 as int64))
) +
rank() over (partition by cast(score / 1000 as int64) order by regi_id)
) as therank,
-- rank() over (order by score) as therank
from t;
This breaks the score into 1000 groups (perhaps that is too many for an integer). And then reconstructs the ranking.
If your score has relatively low cardinality, then join with aggregation works:
select t.*, (running_cnt - cnt + 1) as therank
from t join
(select score, count(*) as cnt, sum(count(*)) over (order by score) as running_cnt
from t
group by score
) s
on t.score = s.score;
Once you have the rank() (or row_number()) you can easily calculate the tiles yourself (hint: division).
Output suggest me rank() :
SELECT *, RANK() OVER (ORDER BY score) as quantile
FROM table t
ORDER BY id;
I have a table with float non unique numbers and I want to order them in a special way that max element will be at the 1-st place, min element at the 2-nd place, second largest element at the 3-rd place and etc. For example,
1,2,3,4,5,6,7,8,9
I would like to order as
1,9,2,8,3,7,4,6,5
UPD:
Combination of ordering by ascending and descending over row_number() can be a solution, e.g.
select c, a, d, abs(a - d)
from (select c,
row_number() over (order by c) as a
row_number() over (order by c desc) as d
from t)
order by abs(a - d)
But you should keep in mind that you can meet some problems due to non unique numbers, solution above will NOT work for example below
c | a | d
4 | 1 | 4 | 3
4 | 2 | 5 | 3
5 | 3 | 1 | 2
5 | 4 | 2 | 2
5 | 5 | 3 | 2
Which means that expression used under OVER statement should NOT provide many ordering possibilities
ANSI SQL supports row_number(). You can do this by using row_number() in a clever way:
select t.*
from (select t.*,
row_number() over (order by col) as seqnum_asc
row_number() over (order by col desc) as seqnum_desc
from table t
) t
order by (case when seqnum_asc <= seqnum_desc then seqnum_asc else seqnum_desc end),
col desc;
The case is really least(seqnum_asc, seqnum_desc), but not all databases support that construct.
I am fighting with the distinct keyword in sql.
I just want to display all row numbers of unique (distinct) values in a column & so I tried:
SELECT DISTINCT id, ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
however the below code giving me the distinct values:
SELECT distinct id FROM table WHERE fid = 64
but when tried it with Row_Number.
then it is not working.
This can be done very simple, you were pretty close already
SELECT distinct id, DENSE_RANK() OVER (ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
Use this:
SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS RowNum FROM
(SELECT DISTINCT id FROM table WHERE fid = 64) Base
and put the "output" of a query as the "input" of another.
Using CTE:
; WITH Base AS (
SELECT DISTINCT id FROM table WHERE fid = 64
)
SELECT *, ROW_NUMBER() OVER (ORDER BY id) AS RowNum FROM Base
The two queries should be equivalent.
Technically you could
SELECT DISTINCT id, ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
but if you increase the number of DISTINCT fields, you have to put all these fields in the PARTITION BY, so for example
SELECT DISTINCT id, description,
ROW_NUMBER() OVER (PARTITION BY id, description ORDER BY id) AS RowNum
FROM table
WHERE fid = 64
I even hope you comprehend that you are going against standard naming conventions here, id should probably be a primary key, so unique by definition, so a DISTINCT would be useless on it, unless you coupled the query with some JOINs/UNION ALL...
This article covers an interesting relationship between ROW_NUMBER() and DENSE_RANK() (the RANK() function is not treated specifically). When you need a generated ROW_NUMBER() on a SELECT DISTINCT statement, the ROW_NUMBER() will produce distinct values before they are removed by the DISTINCT keyword. E.g. this query
SELECT DISTINCT
v,
ROW_NUMBER() OVER (ORDER BY v) row_number
FROM t
ORDER BY v, row_number
... might produce this result (DISTINCT has no effect):
+---+------------+
| V | ROW_NUMBER |
+---+------------+
| a | 1 |
| a | 2 |
| a | 3 |
| b | 4 |
| c | 5 |
| c | 6 |
| d | 7 |
| e | 8 |
+---+------------+
Whereas this query:
SELECT DISTINCT
v,
DENSE_RANK() OVER (ORDER BY v) row_number
FROM t
ORDER BY v, row_number
... produces what you probably want in this case:
+---+------------+
| V | ROW_NUMBER |
+---+------------+
| a | 1 |
| b | 2 |
| c | 3 |
| d | 4 |
| e | 5 |
+---+------------+
Note that the ORDER BY clause of the DENSE_RANK() function will need all other columns from the SELECT DISTINCT clause to work properly.
All three functions in comparison
Using PostgreSQL / Sybase / SQL standard syntax (WINDOW clause):
SELECT
v,
ROW_NUMBER() OVER (window) row_number,
RANK() OVER (window) rank,
DENSE_RANK() OVER (window) dense_rank
FROM t
WINDOW window AS (ORDER BY v)
ORDER BY v
... you'll get:
+---+------------+------+------------+
| V | ROW_NUMBER | RANK | DENSE_RANK |
+---+------------+------+------------+
| a | 1 | 1 | 1 |
| a | 2 | 1 | 1 |
| a | 3 | 1 | 1 |
| b | 4 | 4 | 2 |
| c | 5 | 5 | 3 |
| c | 6 | 5 | 3 |
| d | 7 | 7 | 4 |
| e | 8 | 8 | 5 |
+---+------------+------+------------+
Using DISTINCT causes issues as you add fields and it can also mask problems in your select. Use GROUP BY as an alternative like this:
SELECT id
,ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM table
where fid = 64
group by id
Then you can add other interesting information from your select like this:
,count(*) as thecount
or
,max(description) as description
How about something like
;WITH DistinctVals AS (
SELECT distinct id
FROM table
where fid = 64
)
SELECT id,
ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM DistinctVals
SQL Fiddle DEMO
You could also try
SELECT distinct id, DENSE_RANK() OVER (ORDER BY id) AS RowNum
FROM #mytable
where fid = 64
SQL Fiddle DEMO
Try this:
;WITH CTE AS (
SELECT DISTINCT id FROM table WHERE fid = 64
)
SELECT id, ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM cte
WHERE fid = 64
Try this
SELECT distinct id
FROM (SELECT id, ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM table
WHERE fid = 64) t
Or use RANK() instead of row number and select records DISTINCT rank
SELECT id
FROM (SELECT id, ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS RowNum
FROM table
WHERE fid = 64) t
WHERE t.RowNum=1
This also returns the distinct ids
Question is too old and my answer might not add much but here are my two cents for making query a little useful:
;WITH DistinctRecords AS (
SELECT DISTINCT [col1,col2,col3,..]
FROM tableName
where [my condition]
),
serialize AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY [colNameAsNeeded] ORDER BY [colNameNeeded]) AS Sr,*
FROM DistinctRecords
)
SELECT * FROM serialize
Usefulness of using two cte's lies in the fact that now you can use serialized record much easily in your query and do count(*) etc very easily.
DistinctRecords will select all distinct records and serialize apply serial numbers to distinct records. after wards you can use final serialized result for your purposes without clutter.
Partition By might not be needed in most cases