In Spark SQL how do I take 98% of the lowest values - sql

I am using Spark SQL and I have some outliers that have incredibly high transaction counts in comparison to the rest. I only want the lowest 98% of the values and to cut off the top 2% outliers. How do I go about doing that? The TOP function is not being recognized in Spark SQL. This is a sample of the table but it is a very large table.
Date
ID
Name
Transactions
02/02/2022
ABC123
Bob
107
01/05/2022
ACD232
Emma
34
12/03/2022
HH254
Kirsten
23
12/11/2022
HH254
Kirsten
47

You need a couple of window functions to compute the relative rank; the row_number() will give absolute rank, but you won't know where to draw the cutoff line without a full record count to compute the percentile.
In an inner query,
Select t.*,
row_number() Over (Order By Transactions, Date desc) * 100
/ count(*) Over (Rows unbounded preceeding to rows unbounded following) as percentile
From myTable t
Then in an outer query just
Select * from (*inner query*)
Where percentile <= 98
You might be able to omit the Over clause on the Count(*), I don't know.

You can calculate the 98th percentile value for the Transactions column and then filter the rows where the value of Transactions is below the 98th percentile. You can use the following query to accomplish that:
WITH base_data AS (
SELECT Date, ID, Name, Transactions
FROM your_table
),
percentiles AS (
SELECT percentiles_approx(Transactions, array(0.98)) AS p
FROM base_data
)
SELECT Date, ID, Name, Transactions
FROM base_data
JOIN percentiles
ON Transactions <= p
The percentiles_approx method is used on the baseData DataFrame to obtain the 98th percentile value

Related

SQLite averaging lot of rows into X averaged value between two desired row

I have rows in SQLite, let's say 10.0000 with last row as unix timestamp. I would like to average every value from today 00:00 to 23:59 into X averaged group. If there is 1000 records today and X is 10, then average each 100 value and the result would be 10x averaged 100 records. If x is 20, average each value and result is averaged values 50x. Those values are from sensors, like temperature and I would like to be able to track what the temperature was today between X and Y hours and so, for each day.
What would be the best efficient way to do this? I'm using SQLite3 with C++, I could do it in C++ with more queries but I would like to let this to SQLite and fetch the result only if it's possible. Visualization: https://i.ibb.co/grSTgrZ/sqlite.png
Any help appreciated where I should start with this.
Thanks.
You can use NTILE() window function to create the groups on which you will aggregate:
SELECT AVG(value) avg_value
FROM (
SELECT *, NTILE(3) OVER (ORDER BY id) grp
FROM tablename
)
GROUP BY grp
The number inside the parentheses of NTILE() corresponds to the number X in your requirement.
Id is the column on which the table should be ordered.
If you have a date column then change to:
SELECT AVG(value) avg_value
FROM (
SELECT *, NTILE(3) OVER (ORDER BY date) grp
FROM tablename
)
GROUP BY grp
See a simplified demo.

RANK() function with over is creating ranks dynamically for every run

I am creating ranks for partitions of my table. Partitions are performed by name column with ordered by its transaction value. While I am generating these partitions and checking count for each of the ranks, I get different number in each rank for every query run I do.
select count(*) FROM (
--
-- Sort and ranks the element of RFM
--
SELECT
*,
RANK() OVER (PARTITION BY name ORDER BY date_since_last_trans desc) AS rfmrank_r,
FROM (
SELECT
name,
id_customer,
cust_age,
gender,
DATE_DIFF(entity_max_date, customer_max_date, DAY ) AS date_since_last_trans,
txncnt,
txnval,
txnval / txncnt AS avg_txnval
FROM
(
SELECT
name,
id_customer,
MAX(cust_age) AS cust_age,
COALESCE(APPROX_TOP_COUNT(cust_gender,1)[OFFSET(0)].VALUE, MAX(cust_gender)) AS gender,
MAX(date_date) AS customer_max_date,
(SELECT MAX(date_date) FROM xxxxx) AS entity_max_date,
COUNT(purchase_amount) AS txncnt,
SUM(purchase_amount) AS txnval
FROM
xxxxx
WHERE
date_date > (
SELECT
DATE_SUB(MAX(date_date), INTERVAL 24 MONTH) AS max_date
FROM
xxxxx)
AND cust_age >= 15
AND cust_gender IN ('M','F')
GROUP BY
name,
id_customer
)
)
)
group by rfmrank_r
For 1st run I am getting
Row f0
1 3970
2 3017
3 2116
4 2118
For 2nd run I am getting
Row f0
1 4060
2 3233
3 2260
4 2145
What can be done, If I need to get same number of partitions getting ranked same for each run
Edit:
Sorry for the blurring of fields
This is the output of field ```query to get this column````
The RANK window function determines the rank of a value in a group of values.
Each value is ranked within its partition. Rows with equal values for the ranking criteria receive the same rank. Drill adds the number of tied rows to the tied rank to calculate the next rank and thus the ranks might not be consecutive numbers.
For example, if two rows are ranked 1, the next rank is 3.

Top 90% average (ASC) against group by in SQL

I have huge database and I need to get top 90% average of all category using group by.
Example, I have 300 locations and data is around 100k with TAT column against all dockets, I need to take min 90% average of TAT all location in one query using group by(location).
Most DBMSes support Windowed Aggregate Fuctions, you need PERCENT_RANK:
select location, avg(TAT)
from
(
select location, TAT,
-- assign a value between 0 (lowest TAT) and 1 (highest TAT) for each location
percent_rank() over (partition by location order by TAT) as pr
from tab
) as dt
where pr <= 0.9 -- exclude the highest TAT amounts
group by location

T-SQL calculate moving average

I am working with SQL Server 2008 R2, trying to calculate a moving average. For each record in my view, I would like to collect the values of the 250 previous records, and then calculate the average for this selection.
My view columns are as follows:
TransactionID | TimeStamp | Value | MovAvg
----------------------------------------------------
1 | 01.09.2014 10:00:12 | 5 |
2 | 01.09.2014 10:05:34 | 3 |
...
300 | 03.09.2014 09:00:23 | 4 |
TransactionID is unique. For each TransactionID, I would like to calculate the average for column value, over previous 250 records. So for TransactionID 300, collect all values from previous 250 rows (view is sorted descending by TransactionID) and then in column MovAvg write the result of the average of these values. I am looking to collect data within a range of records.
The window functions in SQL 2008 are rather limited compared to later versions and if I remember correct you can only partition and you can't use any rows/range frame limit but I think this might be what you want:
;WITH cte (rn, transactionid, value) AS (
SELECT
rn = ROW_NUMBER() OVER (ORDER BY transactionid),
transactionid,
value
FROM your_table
)
SELECT
transactionid,
value,
movagv = (
SELECT AVG(value)
FROM cte AS inner_ref
-- average is calculated for 250 previous to current row inclusive
-- I might have set the limit one row to large, maybe it should be 249
WHERE inner_ref.rn BETWEEN outer_ref.rn-250 AND outer_ref.rn
)
FROM cte AS outer_ref
Note that it applies a correlated sub-query to every row and performance might not be great.
With the later versions you could have used window frame functions and done something like this:
SELECT
transactionid,
value,
-- avg over the 250 rows counting from the previous row
AVG(value) OVER (ORDER BY transactionid
ROWS BETWEEN 251 PRECEDING AND 1 PRECEDING),
-- or 250 rows counting from current
AVG(value) OVER (ORDER BY transactionid
ROWS BETWEEN 250 PRECEDING AND CURRENT ROW)
FROM your_table
Use a Common Table Expression (CTE) to include the rownum for each transaction, then join the CTE against itself on the row number so you can get the previous values to calculate the average with.
CREATE TABLE MyTable (TransactionId INT, Value INT)
;with Data as
(
SELECT TransactionId,
Value,
ROW_NUMBER() OVER (ORDER BY TransactionId ASC) as rownum
FROM MyTable
)
SELECT d.TransactionId , Avg(h.Value) as MovingAverage
FROM Data d
JOIN Data h on h.rownum between d.rownum-250 and d.rownum-1
GROUP BY d.TransactionId

T-SQL: Calculating the Nth Percentile Value from column

I have a column of data, some of which are NULL values, from which I wish to extract the single 90th percentile value:
ColA
-----
NULL
100
200
300
NULL
400
500
600
700
800
900
1000
For the above, I am looking for a technique which returns the value 900 when searching for the 90th percentile, 800 for the 80th percentile, etc. An analogous function would be AVG(ColA) which returns 550 for the above data, or MIN(ColA) which returns 100, etc.
Any suggestions?
If you want to get exactly the 90th percentile value, excluding NULLs, I would suggest doing the calculation directly. The following version calculates the row number and number of rows, and selects the appropriate value:
select max(case when rownum*1.0/numrows <= 0.9 then colA end) as percentile_90th
from (select colA,
row_number() over (order by colA) as rownum,
count(*) over (partition by NULL) as numrows
from t
where colA is not null
) t
I put the condition in the SELECT clause rather than the WHERE clause, so you can easily get the 50th percentile, 17th, or whatever values you want.
WITH
percentiles AS
(
SELECT
NTILE(100) OVER (ORDER BY ColA) AS percentile,
*
FROM
data
)
SELECT
*
FROM
percentiles
WHERE
percentile = 90
Note: If the data has less than 100 observations, not all percentiles will have a value. Equally, if you have more than 100 observations, some percentiles will contain more values.
Starting with SQL Server 2012, there are now PERCENTILE_DISC and PERCENTILE_CONT inverse distribution functions. These are (so far) only available as window functions, not as aggregate functions, so you would have to remove redundant results because of the lacking grouping, e.g. by using DISTINCT or TOP 1:
WITH t AS (
SELECT *
FROM (
VALUES(NULL),(100),(200),(300),
(NULL),(400),(500),(600),(700),
(800),(900),(1000)
) t(ColA)
)
SELECT DISTINCT percentile_disc(0.9) WITHIN GROUP (ORDER BY ColA) OVER()
FROM t
;
I have blogged about percentiles more in detail here.