Customizing percentile calculation in SQL Server

Customizing percentile calculation in SQL Server - sql

Overview
Given the following T-SQL Query, is it possible to use
PERCENT_RANK or a similar function do get the desired percentile? If not, what is the simplest way to accomplish this goal?
SELECT rank,
ROUND(
100 * PERCENT_RANK()
OVER (ORDER BY rank DESC),
2) as pctile,
desired_pctile
FROM ( VALUES (1, 100), (1, 100), (1, 100), (2, 62.5), (3, 50), (3, 50), (4, 25), (5, 12.5) ) as X(rank, desired_pctile)
rank
pctile
desired_pctile
5
0
12.5
4
14.29
25
3
28.57
50
3
28.57
50
2
57.14
62.5
1
71.43
100
1
71.43
100
1
71.43
100
Details
Specifically, the pctile and desired_pctile calculations are defined slightly differently.
pctile: the percentage of other records with a rank less than the current record's rank.
desired_pctile: the percentage of records with a rank less than or equal to the current record's rank.
This means that there are two distinctions:
When multiple records share the same rank, the desired behavior is to take the maximum instead of the minimum (see Python for an example).
The denominator is N-1, where having N is desired. Where N is the record count. Is there a way to multiply pctile by COUNT(*) / (COUNT(*)-1) in the query?
Python Equivalent
In pandas the rank function has a parameter method for specifying how to handle multiple records that share the same value. The following python code obtains the desired result:
pd.Series([1,1,1,2,3,3,4,5]).rank(method='max', ascending=False, pct=True)

Here is a solution that calculates a windowed form of 1 - (rank() - 1) / count(), scaled and rounded to percentages. This is a slight improvement over what I originally posted in my comment, as it eliminates the duplicate count subexpression.
SELECT
rank,
ROUND(100 - 100.0 * (RANK() OVER (ORDER BY rank) - 1)
/ (COUNT(*) OVER())
, 2) as pctile,
desired_pctile
FROM (
VALUES
(1, 100), (1, 100), (1, 100), (2, 62.5),
(3, 50), (3, 50), (4, 25), (5, 12.5)
) as X(rank, desired_pctile)
Results:
rank
pctile
desired_pctile
1
100.000000000000
100.0
1
100.000000000000
100.0
1
100.000000000000
100.0
2
62.500000000000
62.5
3
50.000000000000
50.0
3
50.000000000000
50.0
4
25.000000000000
25.0
5
12.500000000000
12.5
See this db<>fiddle.

Related

Finding the largest subsets of consecutive rows with a maximum gap size (gaps and islands)

I'm trying to solve a SQL puzzle. The goal is to find subsets wherein the acceptible gap size is less than some maximum. Think of (say) searching for suspicious credit card behaviour by looking for n transactions within m minutes.
I'm using Postgres 9.6, but a correct solution to the puzzle sticks to ANSI SQL:2008.
Input
t
amt
1
10
4
10
16
40
20
10
30
50
60
5
61
5
62
5
63
5
72
5
90
30
create table d(t int, amt int);
insert into d
values (1, 10),
(4, 10),
(16, 40),
(20, 10),
(30, 50),
(60, 5),
(61, 5),
(62, 5),
(63, 5),
(72, 5),
(90, 30);
Expected Output
All subsequences such that the difference of t with the previous row is less than 10.
start_t
end_t
cnt
total
1
4
2
20
16
20
2
50
30
30
1
50
60
72
5
25
90
90
1
30
Notes
I've tried the "difference of row_number" (Tabibitosan method), but the fact that t is not necessarily consecutive foiled my efforts.
Thank you for your help!

Flag the start of group and aggregate groups
select min(t) start_t, max(t) end_t, count(*) cnt, sum(amt) total
from (
select t, amt, sum(flag) over(order by t) grp
from (
select t, amt, case when t - lag(t, 1, t-11) over(order by t) >= 10 then 1 end flag
from d
) t
) t
group by grp

Filter Table results Self Join

Imagine a large table that contains receipt information. Since it holds so much data, you are required to return a subset of the data, excluding or consolidating rows where possible.
Here is the SQL and results table showing how the data should be returned.
create table table1
(RecieptNo smallint, Customer varchar(10), ReceiptDate date,
ItemDesc varchar(10), Amount smallint)
insert into table1 values
(100, 'Matt','2022-01-05','Ball', 10),
(101, 'Mark','2022-01-07','Hat', 20),
(101, 'Mark','2022-01-07','Jumper', -20),
(101, 'Mark','2022-01-14','Spoon', 30),
(102, 'Luke','2022-01-15','Fork', 15),
(102, 'Luke','2022-01-17','Spork', -10),
(103, 'John','2022-01-20','Orange', 13),
(103, 'John','2022-01-25','Pear', 12)
If there are rows on the same receipt where the negative and positive values cancel out, do not return either row.
If there is a receipt with a negative amount not exceeding positive amount, the negative amount should be deducted from positive line.
RecieptNo
Customer
ReceiptDate
ItemDesc
Amount
100
Matt
2022-01-05
Ball
10
101
Mark
2022-01-14
Spoon
30
102
Luke
2022-01-15
Fork
5
103
John
2022-01-20
Orange
13
103
John
2022-01-25
Pear
12
This is proving tricky, any ideas?

Based on table you provided, I suppose you want only row with the earliest date when you have multiple rows with same receipts which bring positive Amount after deduction.
;WITH cte AS (
SELECT *
, SUM( amount) OVER (PARTITION BY RecieptNo ORDER BY RecieptNo, ReceiptDate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS ActualAmount
, ROW_NUMBER() OVER (PARTITION BY RecieptNo ORDER BY RecieptNo, ReceiptDate) AS rn
FROM table1)
SELECT RecieptNo, Customer, ReceiptDate, ItemDesc, ActualAmount
FROM cte
WHERE ActualAmount > 0 AND rn = 1
Read about window functions and cte's though.

Subquery Factoring recursive sql

Im having an issue where im using recursive subquery factoring to use the previous rows values as my next rows values. Problem is i need to stop using the previous rows values if my product_key changes.
CREATE TABLE MAKE_IT_WORK
(
PRODUCT_KEY NUMBER,
WEEK NUMBER,
OPENING_STOCK NUMBER,
INTAKE NUMBER,
SALES NUMBER,
CLOSING_STOCK NUMBER,
FORWARD_COVER NUMBER
);
Insert into MAKE_IT_WORK (PRODUCT_KEY, WEEK)
Values (1, 1);
Insert into MAKE_IT_WORK (PRODUCT_KEY, WEEK, INTAKE, SALES)
Values (1, 2, 1000, 80);
Insert into MAKE_IT_WORK (PRODUCT_KEY, WEEK, SALES)
Values (1, 3, 70);
Insert into MAKE_IT_WORK (PRODUCT_KEY, WEEK, SALES)
Values (1, 4, 90);
Insert into MAKE_IT_WORK (PRODUCT_KEY, WEEK, SALES)
Values (2, 1, 0);
Insert into MAKE_IT_WORK (PRODUCT_KEY, WEEK, INTAKE, SALES)
Values (2, 2, 6000, 500);
Insert into MAKE_IT_WORK (PRODUCT_KEY, WEEK, SALES)
Values (2, 3, 70);
Insert into MAKE_IT_WORK (PRODUCT_KEY, WEEK, SALES)
Values (2, 4, 350);
CURRENT QUERY
with master
as(select product_key,week,opening_stock ,intake,sales,closing_stock,forward_cover,row_number()over( order by 1) lvl,product_key-1 pkey
from make_it_work),
bdw_knows_best(product_key,week,opening_stock,intake,sales,closing_stock,forward_cover,lvl,pkey) as
(select product_key
,week
,opening_stock
,nvl(intake,0)intake
,sales
,closing_stock
,forward_cover
,lvl
,pkey
from master
where lvl = 1
union all
select a.product_key
,a.week
,case when b.closing_stock < 0 then 0
else b.closing_stock
end opening_stock
,nvl(a.intake,0)intake
,nvl(a.sales,0) sales
,case when nvl(b.closing_stock,0) + nvl(a.intake,0) - nvl(a.sales,0) < 0 THEN 0
else nvl(b.closing_stock,0) + nvl(a.intake,0) - nvl(a.sales,0)
end closing_stock
,a.forward_cover
,b.lvl +1
,a.pkey pkey
from master a,
bdw_knows_best b
where a.lvl = b.lvl +1
)
select product_key,week,opening_stock,intake,sales,closing_stock,forward_cover,lvl,pkey from bdw_knows_best;
REQUIRED
When the product key changes from 1 to 2, I need to use the values from Product_Key 2 and not the last records from Product_Key 1. I need to somehow group the by Product_Key buckets(so to speak).
Any help or ideas would be highly appreaciated

You don't need a recursive CTE. Window functions (the OVER clause) will produce the result you want. For example:
select product_key, week, opening_stock, intake, sales,
coalesce(opening_stock, 0)
+ sum(intake) over(partition by product_key order by week)
- sum(sales) over(partition by product_key order by week)
as closing_stock
from make_it_work
order by product_key, week;
Result:
PRODUCT_KEY WEEK OPENING_STOCK INTAKE SALES CLOSING_STOCK
------------ ----- -------------- ------- ------ -------------
1 1
1 2 1000 80 920
1 3 70 850
1 4 90 760
2 1 0
2 2 6000 500 5500
2 3 70 5430
2 4 350 5080
See running example at db<>fiddle.

T-SQL case error: expressions may only be nested to level 10

I have the following code with a none-nested case statement (as can be seen, there's only one case statement) with 13 arguments, yet I get an error saying
"Case expressions may only be nested to level 10".
It does work when I use less arguments, but that doesn't really help me because I need all arguments for my required result set.
select ID,
CASE
when sum(Y) between 0 and 30 then 1
when sum(Y) between 31 and 70 then 2
when sum(Y) between 71 and 100 then 3
when sum(Y) between 101 and 200 then 4
when sum(Y) between 201 and 300 then 5
when sum(Y) between 301 and 400 then 6
when sum(Y) between 401 and 500 then 7
when sum(Y) between 501 and 600 then 8
when sum(Y) between 601 and 700 then 9
when sum(Y) between 701 and 800 then 10
when sum(Y) between 801 and 900 then 11
when sum(Y) between 901 and 1000 then 12
when sum(Y) > 1000 then 13
end
from X
group by ID
;
I did manage to solve the issue with splitting my case argument over two different select statements and using UNION between them to get my required result set, but it feels as if I could've done a better job at it, as the original case argument is not n
ested

The easiest way to do this is to join it to a VALUES table constructor (a virtual table).
Because the value is an aggregate, it may be easier to do this in a correlated subquery, rather than putting the whole thing into a CTE:
select
ID,
(
SELECT v.val
FROM (VALUES
(0, 30, 1),
(31, 70, 2),
(71, 100, 3),
(101, 200, 4),
(201, 300, 5),
(301, 400, 6),
(401, 500, 7),
(501, 600, 8),
(601, 700, 9),
(701, 800, 10),
(801, 900, 11),
(901, 1000, 12),
(1001, 2147483647, 13)
) v(nStart, nEnd, val)
WHERE SUM(y) BETWEEN v.nStart AND v.nEnd
)
from X
group by ID;

This is a problem with linked servers. One method in your case is to use arithmetic to simplify the logic:
select ID,
(case when sum(Y) between 0 and 30 then 1
when sum(Y) between 31 and 70 then 2
when sum(Y) between 71 and 100 then 3
when sum(Y) > 1000 then 13
else ceiling(sum(Y) / 100.0) + 2
end)
from X
group by ID

Cumulative sum of a column

I have a table that has the below data.
COUNTRY LEVEL NUM_OF_DUPLICATES
US 9 6
US 8 24
US 7 12
US 6 20
US 5 39
US 4 81
US 3 80
US 2 430
US 1 178
US 0 430
I wrote a query that will calculate the sum of cumulative rows and got the below output .
COUNTRY LEVEL NUM_OF_DUPLICATES POOL
US 9 6 6
US 8 24 30
US 7 12 42
US 6 20 62
US 5 39 101
US 4 81 182
US 3 80 262
US 2 130 392
US 1 178 570
US 0 254 824
Now I want to to filter the data and take only where the POOL <=300, if the POOL field does not have the value 300 then I should take the first value after 300. So, in the above example we do not have the value 300 in the field POOL, so we take the next immediate value after 300 which is 392. So I need a query so that I can pull the records POOL <= 392(as per the example above) which will yield me the output as
COUNTRY LEVEL NUM_OF_DUPLICATES POOL
US 9 6 6
US 8 24 30
US 7 12 42
US 6 20 62
US 5 39 101
US 4 81 182
US 3 80 262
US 2 130 392
Please let me know your thoughts. Thanks in advance.

declare #t table(Country varchar(5), Level int, Num_of_Duplicates int)
insert into #t(Country, Level, Num_of_Duplicates)
values
('US', 9, 6),
('US', 8, 24),
('US', 7, 12),
('US', 6, 20),
('US', 5, 39),
('US', 4, 81),
('US', 3, 80),
('US', 2, 130/*-92*/),
('US', 1, 178),
('US', 0, 430);
select *, sum(Num_of_Duplicates) over(partition by country order by Level desc),
(sum(Num_of_Duplicates) over(partition by country order by Level desc)-Num_of_Duplicates) / 300 as flag,--any row which starts before 300 will have flag=0
--or
case when sum(Num_of_Duplicates) over(partition by country order by Level desc)-Num_of_Duplicates < 300 then 1 else 0 end as startsbefore300
from #t;
select *
from
(
select *, sum(Num_of_Duplicates) over(partition by country order by Level desc) as Pool
from #t
) as t
where Pool - Num_of_Duplicates < 300 ;

The logic here is quite simple:
Calculate the running sum POOL value up to the current row.
Filter rows so that the previous row's total is < 300, you can either subtract the current row's value, or use a second sum
If the total up to the current row is exactly 300, the previous row will be less, so this row will be included
If the current row's total is more than 300, but the previous row is less then it will also be included
All higher rows are excluded
It's unclear what ordering you want. I've used NUM_OF_DUPLICATES column ascending, but you may want something else
SELECT
COUNTRY,
LEVEL,
NUM_OF_DUPLICATES,
POOL
FROM (
SELECT *,
POOL = SUM(NUM_OF_DUPLICATES) OVER (ORDER BY NUM_OF_DUPLICATES ROWS UNBOUNDED PRECEDING)
-- alternative calculation
-- ,POOLPrev = SUM(NUM_OF_DUPLICATES) OVER (ORDER BY NUM_OF_DUPLICATES ROWS UNBOUNDED PRECEDING AND 1 PRECEDING)
FROM YourTable
) t
WHERE POOL - NUM_OF_DUPLICATES < 300;
-- you could also use POOLPrev above

I used two temp tables to get the answer.
DECLARE #t TABLE(Country VARCHAR(5), [Level] INT, Num_of_Duplicates INT)
INSERT INTO #t(Country, Level, Num_of_Duplicates)
VALUES ('US', 9, 6),
('US', 8, 24),
('US', 7, 12),
('US', 6, 20),
('US', 5, 39),
('US', 4, 81),
('US', 3, 80),
('US', 2, 130),
('US', 1, 178),
('US', 0, 254);
SELECT
Country
,Level
, Num_of_Duplicates
, SUM (Num_of_Duplicates) OVER (ORDER BY id) AS [POOL]
INTO #temp_table
FROM
(
SELECT
Country,
level,
Num_of_Duplicates,
ROW_NUMBER() OVER (ORDER BY country) AS id
FROM #t
) AS A
SELECT
[POOL],
ROW_NUMBER() OVER (ORDER BY [POOL] ) AS [rank]
INTO #Temp_2
FROM #temp_table
WHERE [POOL] >= 300
SELECT *
FROM #temp_table WHERE
[POOL] <= (SELECT [POOL] FROM #Temp_2 WHERE [rank] = 1 )
DROP TABLE #temp_table
DROP TABLE #Temp_2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Customizing percentile calculation in SQL Server - sql

Related

Finding the largest subsets of consecutive rows with a maximum gap size (gaps and islands)

Filter Table results Self Join

Subquery Factoring recursive sql

T-SQL case error: expressions may only be nested to level 10

Cumulative sum of a column

Categories

Resources