I get the same value percentile for all rows - sql

For each borough, what is the 90th percentile number of people injured per intersection?
I have borough column, columns with number of injured that I aggregate it to one column. I need to find the 90th percentile of the injured people. It gives me just one value. I need to get different value for each row (or am I wrong?)
select distinct borough,count(num_of_injured) as count_all, PERCENTILE_CONT(num_of_injured, 0.9 RESPECT NULLS) OVER() AS percentile90
from`bigquery-public-data.new_york.nypd_mv_collisions` c cross join
unnest(array[number_of_persons_injured,number_of_pedestrians_injured,number_of_motorist_killed,number_of_cyclist_injured]
)num_of_injured
where borough!=''
group by borough,num_of_injured
order by count_all desc
limit 10;
table
Thank you for the help!

If you look at the counts by borough for each num injured, then over 90% are 0:
select borough, num_of_injured, count(*),
count(*) / sum(count(*)) over (partition by borough)
from`bigquery-public-data.new_york.nypd_mv_collisions` c cross join
unnest(array[number_of_persons_injured,number_of_pedestrians_injured,number_of_motorist_killed,number_of_cyclist_injured]
) num_of_injured
group by 1, 2
order by 1, 2;
Hence, the 90th percentile is 0.

Related

In Spark SQL how do I take 98% of the lowest values

I am using Spark SQL and I have some outliers that have incredibly high transaction counts in comparison to the rest. I only want the lowest 98% of the values and to cut off the top 2% outliers. How do I go about doing that? The TOP function is not being recognized in Spark SQL. This is a sample of the table but it is a very large table.
Date
ID
Name
Transactions
02/02/2022
ABC123
Bob
107
01/05/2022
ACD232
Emma
34
12/03/2022
HH254
Kirsten
23
12/11/2022
HH254
Kirsten
47
You need a couple of window functions to compute the relative rank; the row_number() will give absolute rank, but you won't know where to draw the cutoff line without a full record count to compute the percentile.
In an inner query,
Select t.*,
row_number() Over (Order By Transactions, Date desc) * 100
/ count(*) Over (Rows unbounded preceeding to rows unbounded following) as percentile
From myTable t
Then in an outer query just
Select * from (*inner query*)
Where percentile <= 98
You might be able to omit the Over clause on the Count(*), I don't know.
You can calculate the 98th percentile value for the Transactions column and then filter the rows where the value of Transactions is below the 98th percentile. You can use the following query to accomplish that:
WITH base_data AS (
SELECT Date, ID, Name, Transactions
FROM your_table
),
percentiles AS (
SELECT percentiles_approx(Transactions, array(0.98)) AS p
FROM base_data
)
SELECT Date, ID, Name, Transactions
FROM base_data
JOIN percentiles
ON Transactions <= p
The percentiles_approx method is used on the baseData DataFrame to obtain the 98th percentile value

Group By 2 Columns and Find Percentage of Group In SQL

I have a Game table with two columns TeamZeroScore and TeamOneScore. I would like to calculate the % of games that end with each score variance. The max score one team can have is 5.
I have got the following code which selects each team score with an additional 2 columns to have the max and min of these two values in order. I did this because I thought the next step is to group by these two columns
SELECT TOP (100000) [TeamOneScore],[TeamZeroScore],
(SELECT Max(v)
FROM (VALUES ([TeamOneScore]), ([TeamZeroScore])) AS value(v)) as [MaxScore],
(SELECT Min(v)
FROM (VALUES ([TeamOneScore]), ([TeamZeroScore])) AS value(v)) as [MinScore]
FROM [Database].[dbo].[Game]
Below is the sample data I have for the code above.
How do I produce something similar to this? I think I need to Group By MaxScore, MinScore and then use Count on each group to calculate the percentage based on the total.
Select
Count(*) as "number",
(100 * count(*)) / t
As "percentage",
TeamOneScore as score,
TeamTwoScore as score
From
( Select
TeamOneScore,TeamTwoScore
From tablename
Where TeamOneScore <= TeamTwoScore
Union all
Select
TeamTwoScore,TeamOneScore
from tablename
Where TeamOneScore > TeamTwoScore
) a,
(Select count(*) as t
From tablename) b
Group by
TeamOneScore,
TeamTwoScore
Order by
TeamOneScore,
TeamTwoScore;

How to extract median value?

I need to get median value in column "median". Any ideas, please?
SELECT
MIN(score) min, CAST(AVG(score) AS float) median, MAX(score) max
FROM result JOIN student ON student.id = result.student_id
I think the simplest method is PERCENTILE_CONT() or PERCENTILE_DISC():
SELECT MIN(score) as min_score,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY score) as median_score,
MAX(score) max_score
FROM result r JOIN
student s
ON s.id = r.student_id;
This assumes (reasonably) that score is numeric.
The difference between PERCENTILE_CONT() and PERCENTILE_DISC() is what happens when there are an even number of values. That is usually an unimportant consideration, unless you have a small amount of data.
Average is not Median, you're right.
You can do it the exact way, with:
SELECT ( (SELECT MIN(score) FROM Results X
WHERE (SELECT COUNT(*) FROM Results Y WHERE Y.score<= X.score)
>= (SELECT COUNT(*) FROM Results ) / 2)
+ (SELECT MAX(num) FROM Results X
WHERE (SELECT COUNT(*) FROM Results Y WHERE Y.score >= X.score)
>= (SELECT COUNT(*) FROM Results ) / 2)
) / 2 AS median
This handles the case where the boundary between the upper and lower 50% falls between two values; it arbitrarily takes the halfway point between them as the median. There are arguments why that might be weighted slightly higher or lower, but any value in that interval correctly divides the population in two.
Or, if you are dealing with a hyperbolic distribution, there's a short-cut approximation:
SELECT SQRT(SUM(num) / SUM(1.0/num)) FROM List
Many other real-world distributions have a lot of little members and a few large members.
Having just hit SAVE and seen the prior answer: yes, SQL2003 now gives you something simpler :-)

Top 90% average (ASC) against group by in SQL

I have huge database and I need to get top 90% average of all category using group by.
Example, I have 300 locations and data is around 100k with TAT column against all dockets, I need to take min 90% average of TAT all location in one query using group by(location).
Most DBMSes support Windowed Aggregate Fuctions, you need PERCENT_RANK:
select location, avg(TAT)
from
(
select location, TAT,
-- assign a value between 0 (lowest TAT) and 1 (highest TAT) for each location
percent_rank() over (partition by location order by TAT) as pr
from tab
) as dt
where pr <= 0.9 -- exclude the highest TAT amounts
group by location

T-SQL: Calculating the Nth Percentile Value from column

I have a column of data, some of which are NULL values, from which I wish to extract the single 90th percentile value:
ColA
-----
NULL
100
200
300
NULL
400
500
600
700
800
900
1000
For the above, I am looking for a technique which returns the value 900 when searching for the 90th percentile, 800 for the 80th percentile, etc. An analogous function would be AVG(ColA) which returns 550 for the above data, or MIN(ColA) which returns 100, etc.
Any suggestions?
If you want to get exactly the 90th percentile value, excluding NULLs, I would suggest doing the calculation directly. The following version calculates the row number and number of rows, and selects the appropriate value:
select max(case when rownum*1.0/numrows <= 0.9 then colA end) as percentile_90th
from (select colA,
row_number() over (order by colA) as rownum,
count(*) over (partition by NULL) as numrows
from t
where colA is not null
) t
I put the condition in the SELECT clause rather than the WHERE clause, so you can easily get the 50th percentile, 17th, or whatever values you want.
WITH
percentiles AS
(
SELECT
NTILE(100) OVER (ORDER BY ColA) AS percentile,
*
FROM
data
)
SELECT
*
FROM
percentiles
WHERE
percentile = 90
Note: If the data has less than 100 observations, not all percentiles will have a value. Equally, if you have more than 100 observations, some percentiles will contain more values.
Starting with SQL Server 2012, there are now PERCENTILE_DISC and PERCENTILE_CONT inverse distribution functions. These are (so far) only available as window functions, not as aggregate functions, so you would have to remove redundant results because of the lacking grouping, e.g. by using DISTINCT or TOP 1:
WITH t AS (
SELECT *
FROM (
VALUES(NULL),(100),(200),(300),
(NULL),(400),(500),(600),(700),
(800),(900),(1000)
) t(ColA)
)
SELECT DISTINCT percentile_disc(0.9) WITHIN GROUP (ORDER BY ColA) OVER()
FROM t
;
I have blogged about percentiles more in detail here.