T-SQL: Calculating the Nth Percentile Value from column - sql

I have a column of data, some of which are NULL values, from which I wish to extract the single 90th percentile value:
ColA
-----
NULL
100
200
300
NULL
400
500
600
700
800
900
1000
For the above, I am looking for a technique which returns the value 900 when searching for the 90th percentile, 800 for the 80th percentile, etc. An analogous function would be AVG(ColA) which returns 550 for the above data, or MIN(ColA) which returns 100, etc.
Any suggestions?

If you want to get exactly the 90th percentile value, excluding NULLs, I would suggest doing the calculation directly. The following version calculates the row number and number of rows, and selects the appropriate value:
select max(case when rownum*1.0/numrows <= 0.9 then colA end) as percentile_90th
from (select colA,
row_number() over (order by colA) as rownum,
count(*) over (partition by NULL) as numrows
from t
where colA is not null
) t
I put the condition in the SELECT clause rather than the WHERE clause, so you can easily get the 50th percentile, 17th, or whatever values you want.

WITH
percentiles AS
(
SELECT
NTILE(100) OVER (ORDER BY ColA) AS percentile,
*
FROM
data
)
SELECT
*
FROM
percentiles
WHERE
percentile = 90
Note: If the data has less than 100 observations, not all percentiles will have a value. Equally, if you have more than 100 observations, some percentiles will contain more values.

Starting with SQL Server 2012, there are now PERCENTILE_DISC and PERCENTILE_CONT inverse distribution functions. These are (so far) only available as window functions, not as aggregate functions, so you would have to remove redundant results because of the lacking grouping, e.g. by using DISTINCT or TOP 1:
WITH t AS (
SELECT *
FROM (
VALUES(NULL),(100),(200),(300),
(NULL),(400),(500),(600),(700),
(800),(900),(1000)
) t(ColA)
)
SELECT DISTINCT percentile_disc(0.9) WITHIN GROUP (ORDER BY ColA) OVER()
FROM t
;
I have blogged about percentiles more in detail here.

Related

In Spark SQL how do I take 98% of the lowest values

I am using Spark SQL and I have some outliers that have incredibly high transaction counts in comparison to the rest. I only want the lowest 98% of the values and to cut off the top 2% outliers. How do I go about doing that? The TOP function is not being recognized in Spark SQL. This is a sample of the table but it is a very large table.
Date
ID
Name
Transactions
02/02/2022
ABC123
Bob
107
01/05/2022
ACD232
Emma
34
12/03/2022
HH254
Kirsten
23
12/11/2022
HH254
Kirsten
47
You need a couple of window functions to compute the relative rank; the row_number() will give absolute rank, but you won't know where to draw the cutoff line without a full record count to compute the percentile.
In an inner query,
Select t.*,
row_number() Over (Order By Transactions, Date desc) * 100
/ count(*) Over (Rows unbounded preceeding to rows unbounded following) as percentile
From myTable t
Then in an outer query just
Select * from (*inner query*)
Where percentile <= 98
You might be able to omit the Over clause on the Count(*), I don't know.
You can calculate the 98th percentile value for the Transactions column and then filter the rows where the value of Transactions is below the 98th percentile. You can use the following query to accomplish that:
WITH base_data AS (
SELECT Date, ID, Name, Transactions
FROM your_table
),
percentiles AS (
SELECT percentiles_approx(Transactions, array(0.98)) AS p
FROM base_data
)
SELECT Date, ID, Name, Transactions
FROM base_data
JOIN percentiles
ON Transactions <= p
The percentiles_approx method is used on the baseData DataFrame to obtain the 98th percentile value

I get the same value percentile for all rows

For each borough, what is the 90th percentile number of people injured per intersection?
I have borough column, columns with number of injured that I aggregate it to one column. I need to find the 90th percentile of the injured people. It gives me just one value. I need to get different value for each row (or am I wrong?)
select distinct borough,count(num_of_injured) as count_all, PERCENTILE_CONT(num_of_injured, 0.9 RESPECT NULLS) OVER() AS percentile90
from`bigquery-public-data.new_york.nypd_mv_collisions` c cross join
unnest(array[number_of_persons_injured,number_of_pedestrians_injured,number_of_motorist_killed,number_of_cyclist_injured]
)num_of_injured
where borough!=''
group by borough,num_of_injured
order by count_all desc
limit 10;
table
Thank you for the help!
If you look at the counts by borough for each num injured, then over 90% are 0:
select borough, num_of_injured, count(*),
count(*) / sum(count(*)) over (partition by borough)
from`bigquery-public-data.new_york.nypd_mv_collisions` c cross join
unnest(array[number_of_persons_injured,number_of_pedestrians_injured,number_of_motorist_killed,number_of_cyclist_injured]
) num_of_injured
group by 1, 2
order by 1, 2;
Hence, the 90th percentile is 0.

SQLite averaging lot of rows into X averaged value between two desired row

I have rows in SQLite, let's say 10.0000 with last row as unix timestamp. I would like to average every value from today 00:00 to 23:59 into X averaged group. If there is 1000 records today and X is 10, then average each 100 value and the result would be 10x averaged 100 records. If x is 20, average each value and result is averaged values 50x. Those values are from sensors, like temperature and I would like to be able to track what the temperature was today between X and Y hours and so, for each day.
What would be the best efficient way to do this? I'm using SQLite3 with C++, I could do it in C++ with more queries but I would like to let this to SQLite and fetch the result only if it's possible. Visualization: https://i.ibb.co/grSTgrZ/sqlite.png
Any help appreciated where I should start with this.
Thanks.
You can use NTILE() window function to create the groups on which you will aggregate:
SELECT AVG(value) avg_value
FROM (
SELECT *, NTILE(3) OVER (ORDER BY id) grp
FROM tablename
)
GROUP BY grp
The number inside the parentheses of NTILE() corresponds to the number X in your requirement.
Id is the column on which the table should be ordered.
If you have a date column then change to:
SELECT AVG(value) avg_value
FROM (
SELECT *, NTILE(3) OVER (ORDER BY date) grp
FROM tablename
)
GROUP BY grp
See a simplified demo.

Fetch ordered records between specific ranks

I ran a SQL query to order(sort) my records. Now how can I fetch not the top 1000 records by using limit but records between ranks of 500 to 800? I provide a specific range and get all records in that range of ranks?
If by "rank" you just mean row numbers, LIMIT & OFFSET will do:
SELECT * FROM tbl ORDER BY col OFFSET 499 LIMIT 301; -- "ranks of 500 to 800"
If you mean actual "rank" as implemented by the window functions rank() or dense_rank() use the respective function in a subquery or CTE like demonstrated by #downernn.
Pesky side effect: SELECT * cannot be used to get all columns of the table. You get the additional column "rank" from the subquery unless you spell out the definitive list of desired columns.
Use the row type of the underlying table to work around this:
SELECT (sub.t).* -- parentheses required!
FROM (
SELECT t, rank() OVER (ORDER BY col1) AS rnk -- or dense_rank()?
FROM tbl t
) sub
ORDER BY col1 -- repeart order (optional)
WHERE rnk BETWEEN 500 AND 800;
Use the rank or row_number window function (do some research on their differences and choose the one that suits you) and an outer query to filter the rows:
SELECT *
FROM
(
SELECT f1, f2, ..., RANK() OVER (ORDER BY fn, fm, ...) as r
FROM ...
WHERE ...
)
WHERE r between 500 and 800
Use offset.
-- Fetch rows 500 to 800 inclusive
select *
from generate_series(1, 1000)
order by 1
limit 301
offset 499
offset is the number of rows to skip, to start at row 500 you skip the first 499 rows. limit is 301 because there's 301 rows between 500 and 800 inclusive. Use 300 if you want 500 to 800 exclusive.

Calculate percentage value for each row based on total column value in SQL Server

I have a table like below. I need to calculate percentage value for each row based on total column value. How to achieve this with a SQL query?
I am expecting results like this:
Please find the below simple calculation to get result.
(100 / 1000) * 100 = 10
for the first value.
(row value / grand total) * 100
Can you please help me to get 10 as result for the first row value with a SQL query? Thanks in advance.
You can use the SUM() function as analytic function over the entire table to compute the total sum. Then, just divide each region's cost by that sum to obtain the percentage.
SELECT
Region,
Cost,
100 * Cost / SUM(Cost) OVER () AS Percentage
FROM yourTable
Note that you could have also used a non correlated subquery to find the total cost, e.g.
(SELECT SUM(Cost) FROM yourTable)
But the first version I gave you might outperform if for no other reason than it requires running only a single query.
Output:
Demo here:
Rextester
Update:
For your updated query I might use the following:
WITH cte AS (
SELECT
Region,
SUM(Cost) AS sum_cost
FROM yourTable
GROUP BY Region
)
SELECT
Region,
sum_cost,
100 * sum_cost / SUM(sum_cost) OVER () AS Percentage
FROM cte;
Demo
You can get a result with below query.
;with Sales(region,cost,Total)
as
(
select region,cost,
(sum(cost) over()) as Total
from YourTable
)
Select Region,Cost,convert(numeric(18,0),Cost/1000*100) as Per from sales