I have huge database and I need to get top 90% average of all category using group by.
Example, I have 300 locations and data is around 100k with TAT column against all dockets, I need to take min 90% average of TAT all location in one query using group by(location).
Most DBMSes support Windowed Aggregate Fuctions, you need PERCENT_RANK:
select location, avg(TAT)
from
(
select location, TAT,
-- assign a value between 0 (lowest TAT) and 1 (highest TAT) for each location
percent_rank() over (partition by location order by TAT) as pr
from tab
) as dt
where pr <= 0.9 -- exclude the highest TAT amounts
group by location
Related
I am using Spark SQL and I have some outliers that have incredibly high transaction counts in comparison to the rest. I only want the lowest 98% of the values and to cut off the top 2% outliers. How do I go about doing that? The TOP function is not being recognized in Spark SQL. This is a sample of the table but it is a very large table.
Date
ID
Name
Transactions
02/02/2022
ABC123
Bob
107
01/05/2022
ACD232
Emma
34
12/03/2022
HH254
Kirsten
23
12/11/2022
HH254
Kirsten
47
You need a couple of window functions to compute the relative rank; the row_number() will give absolute rank, but you won't know where to draw the cutoff line without a full record count to compute the percentile.
In an inner query,
Select t.*,
row_number() Over (Order By Transactions, Date desc) * 100
/ count(*) Over (Rows unbounded preceeding to rows unbounded following) as percentile
From myTable t
Then in an outer query just
Select * from (*inner query*)
Where percentile <= 98
You might be able to omit the Over clause on the Count(*), I don't know.
You can calculate the 98th percentile value for the Transactions column and then filter the rows where the value of Transactions is below the 98th percentile. You can use the following query to accomplish that:
WITH base_data AS (
SELECT Date, ID, Name, Transactions
FROM your_table
),
percentiles AS (
SELECT percentiles_approx(Transactions, array(0.98)) AS p
FROM base_data
)
SELECT Date, ID, Name, Transactions
FROM base_data
JOIN percentiles
ON Transactions <= p
The percentiles_approx method is used on the baseData DataFrame to obtain the 98th percentile value
For each borough, what is the 90th percentile number of people injured per intersection?
I have borough column, columns with number of injured that I aggregate it to one column. I need to find the 90th percentile of the injured people. It gives me just one value. I need to get different value for each row (or am I wrong?)
select distinct borough,count(num_of_injured) as count_all, PERCENTILE_CONT(num_of_injured, 0.9 RESPECT NULLS) OVER() AS percentile90
from`bigquery-public-data.new_york.nypd_mv_collisions` c cross join
unnest(array[number_of_persons_injured,number_of_pedestrians_injured,number_of_motorist_killed,number_of_cyclist_injured]
)num_of_injured
where borough!=''
group by borough,num_of_injured
order by count_all desc
limit 10;
table
Thank you for the help!
If you look at the counts by borough for each num injured, then over 90% are 0:
select borough, num_of_injured, count(*),
count(*) / sum(count(*)) over (partition by borough)
from`bigquery-public-data.new_york.nypd_mv_collisions` c cross join
unnest(array[number_of_persons_injured,number_of_pedestrians_injured,number_of_motorist_killed,number_of_cyclist_injured]
) num_of_injured
group by 1, 2
order by 1, 2;
Hence, the 90th percentile is 0.
I need to get the max value on a field of a table but I can't use max or any other aggregation function nor cursors. For example I need to get the max value of the field amount on the table Sales.
A couple of ways:
1. Sort the column descending and get the 1st row:
select top 1 amount from sales order by amount DESC
2. With NOT EXISTS:
select distinct s.amount
from sales s
where not exists (
select 1 from sales
where amount > s.amount
)
I am creating ranks for partitions of my table. Partitions are performed by name column with ordered by its transaction value. While I am generating these partitions and checking count for each of the ranks, I get different number in each rank for every query run I do.
select count(*) FROM (
--
-- Sort and ranks the element of RFM
--
SELECT
*,
RANK() OVER (PARTITION BY name ORDER BY date_since_last_trans desc) AS rfmrank_r,
FROM (
SELECT
name,
id_customer,
cust_age,
gender,
DATE_DIFF(entity_max_date, customer_max_date, DAY ) AS date_since_last_trans,
txncnt,
txnval,
txnval / txncnt AS avg_txnval
FROM
(
SELECT
name,
id_customer,
MAX(cust_age) AS cust_age,
COALESCE(APPROX_TOP_COUNT(cust_gender,1)[OFFSET(0)].VALUE, MAX(cust_gender)) AS gender,
MAX(date_date) AS customer_max_date,
(SELECT MAX(date_date) FROM xxxxx) AS entity_max_date,
COUNT(purchase_amount) AS txncnt,
SUM(purchase_amount) AS txnval
FROM
xxxxx
WHERE
date_date > (
SELECT
DATE_SUB(MAX(date_date), INTERVAL 24 MONTH) AS max_date
FROM
xxxxx)
AND cust_age >= 15
AND cust_gender IN ('M','F')
GROUP BY
name,
id_customer
)
)
)
group by rfmrank_r
For 1st run I am getting
Row f0
1 3970
2 3017
3 2116
4 2118
For 2nd run I am getting
Row f0
1 4060
2 3233
3 2260
4 2145
What can be done, If I need to get same number of partitions getting ranked same for each run
Edit:
Sorry for the blurring of fields
This is the output of field ```query to get this column````
The RANK window function determines the rank of a value in a group of values.
Each value is ranked within its partition. Rows with equal values for the ranking criteria receive the same rank. Drill adds the number of tied rows to the tied rank to calculate the next rank and thus the ranks might not be consecutive numbers.
For example, if two rows are ranked 1, the next rank is 3.
I have a table that contains patient locations. I'm trying to find the first patient location that is not the emergency department. I tried using MIN but since the locations have numbers in them it pulls the MIN location but not necessarily the first location. There is a datetime field associated with the location, but I'm not certain how to link the min datetime to the first location. Any help would be appreciated. My query looks something like this:
SELECT PatientName,
MRN,
CSN,
MIN (LOC) as FirstUnit,
MIN (DateTime)as FirstUnitTime
FROM Patients
WHERE LOC <> 'ED'
I presume that you want the first unit for each patient. If so, then you can use row_number():
select PatientName, MRN, CSN, LOC as FirstUnit, DateTime as FirstUnitTime
from (select p.*,
row_number() over (partition by PatientName, MRN, CSN
order by datetime asc) as seqnum
from Patients p
where loc <> 'ED'
) p
where seqnum = 1;
row_number() assigns a sequential number to a group of rows, where the group is specified by the partition by clause. The numbers are in order, as defined by the order by clause. So, the oldest (first) row in each group is assigned a value of 1. The outside query chooses this row.