SQL - Group Values by Percentile/Merge Rankings - sql

I have multiple tables that contain the name of a company/attribute and a ranking.
I would like to write a piece of code which allows a range of Scores to be placed into specific Groups based on the percentile of the score in relationship to tables Score total. I provided a very easy use case to demonstrate what I am looking for, splitting a group of 10 companies into 5 groups, but I would like to scales this in order to apply the 5 groups to data sets with many rows WITHOUT having to specify values in a CASE statement.

You can use NTILE to divide the data into 5 buckets based on score. However, if the data can't be divided into equal number of bins or if there are ties, one of the groups will have more members.
SELECT t.*, NTILE(5) OVER(ORDER BY score) as grp
FROM tablename t
Read more about NTILE here

NTILE(5) OVER(ORDER BY score) might actually put rows with the same value into different quantiles (This is probably not what you want, at least I never liked that).
It's quite similar to
5 * (row_number() over (order by score) - 1) / count(*) over ()
but if the number of rows can't be evenly divided the remainder rows are added to the first quantiles when using NTILE and randomly for ROW_NUMBER.
To assign all the rows with the same value to the same quantile you need to do your own calculation:
5 * (rank() over (order by score) - 1) / count(*) over ()

You can try using ROW_NUMBER() and CEILING() :
SELECT t.name,t.score,
CEILING(ROW_NUMBER() OVER(ORDER BY t.score)/2) as group
FROM YourTable t
This will divide each group of two into a single group, using the ROW_NUMBER() result .

Related

Count half of rest of a partition by from position

I'm trying to achieve the following results:
now, the group comes from
SUM(CASE WHEN seqnum <= (0.5 * seqnum_rev) THEN i.[P&L] END) OVER(PARTITION BY i.bracket_label ORDER BY i.event_id) AS [P&L 50%],
I need that in each iteration it counts the total of rows from the end till position (seq_inv) and sum the amounts in P&L only for the half of it from that position.
for example, when
seq = 2
seq_inv will be = 13, half of it is 6 so I need to sum the following 6 positions from seq = 2.
when seq = 4 there are 11 positions till the end (seq_inv = 11), so half is 5, so I want to count 5 positions from seq = 4.
I hope this makes sense, I'm trying to come up with a rule that will be able to adapt to the case I have, since the partition by is what gives me the numbers that need to be summed.
I was also thinking if there was something to do with a partition by top 50% or something like that, but I guess that doesn't exist.
I have the advantage that I've helped him before and have a little extra context.
That context is that this is just the later stage of a very long chain of common table expressions. That means self-joins and/or correlated sub-queries are unfortunately expensive.
Preferably, this should be answerable using window functions, as the data set is already available in the appropriate ordering and partitioning.
My reading is this...
The SUM(5:9) (meaning the sum of rows 5 to row 9, inclusive) is equal to SUM(5:end) - SUM(10:end)
That leads me to this...
WITH
cumulative AS
(
SELECT
*,
SUM([P&L]) OVER (PARTITION BY bracket_label ORDER BY event_id DESC) AS cumulative_p_and_l
FROM
data
)
SELECT
*,
cum_val - LEAD(cumulative_p_and_l, seq_inv/2, 0) OVER (PARTITION BY bracket_label ORDER BY event_id) AS p_and_l_50_perc,
cum_val - LEAD(cumulative_p_and_l, seq_inv/4, 0) OVER (PARTITION BY bracket_label ORDER BY event_id) AS p_and_l_25_perc,
FROM
cumulative
NOTE: Using , &, % in column names is horrendous, don't do it ;)
EDIT: Corrected the ORDER BY in the cumulative sum.
I don't think that window functions can do what you want. You could use a correlated subquery instead, with the following logic:
select
t.*,
(
select sum(t1.P&L]
from mytable t1
where t1.seq - t.seq between 0 and t.seq_inv/2
) [P&L 50%]
from mytable t

Sql Islands and Gaps Merge Contiguous records if relevant fields hold same values

I have created a test case here for my problem https://rextester.com/ZRXSQ14415
Its must each easier to show the problem to explain what I am trying to achieve.
I have a list of records across time and I wish to merge contiguous records into a single record.
Each record has a period Date, Risk Levels and a couple of flags. When these risks and flags are the same the records should be merged when they are different then they should be a separate row.
On the Rextester example, i have almost achieved my goal, however look at rows 3 + 4 of the result.
What I want to achieve is that rows 3 + 4 would be combined such that row 3
StartDate End Date Name ... ...
17.03.2019 20.03.2019 CPWJ40-A ... ...
As all flags and risk levels are the same.
Change the SEQ expression to
..
ROW_NUMBER() OVER (ORDER BY PeriodDate) - ROW_NUMBER() OVER (Partition BY ImplicitRisk,QCReadyRisk,IsQualityControlReady, ActivePeriod ORDER BY PeriodDate) AS SEQ
..
This way you'll get the proper grouping of islands of ImplicitRisk,QCReadyRisk,IsQualityControlReady, ActivePeriod.
This answer is purely to complement Serg answer with the full query.
SELECT MIN(d.PeriodDate) AS StartDate,
MAX(d.PeriodDate) AS EndDate,
ImplicitRisk,
QcReadyRisk,
IsQualityControlReady,
ActivePeriod,
LocationEventName
FROM
(
SELECT c.*,
ROW_NUMBER() OVER (ORDER BY PeriodDate) - ROW_NUMBER() OVER (Partition BY LocationEventId, ImplicitRisk, QCReadyRisk, IsQualityControlReady, ActivePeriod ORDER BY PeriodDate) AS grp
FROM tab c
--order by PeriodDate
) d
group by ImplicitRisk, QcReadyRisk, IsQualityControlReady, ActivePeriod, LocationEventName, grp
order by 1

Retrieve the Median from a decimal column using PERCENTILE_CONT SQL

I have a table Prices Like:
ID PurchasePriceCalc
0146301 0.002875161
00006L00 0.00396
00087G03 NULL
00001G04 0.0020004
00006S 0.003689818
01580h01 NULL
00082EE00 0.002462687
00038R05 0.002237565
01666R01 0.002666667
I Would like to get the Median per each PurchasePriceCalc and then subtract the result with the PurchasePriceCalc, for a better explanation the Formula should be : (PurchasePriceCalc - Median(PurchasePriceCalc)).
I'm using the query below but is not working:
SELECT ID,PurchasePriceCalc, PurchasePriceCalc - PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY PurchasePriceCalc)
OVER (PARTITION BY ID) AS MediaCalc FROM Prices
This is how should be the Output, (Yellow Column).
Any assistance or help would be really appreciated!
I guess the problem is with OVER (PARTITION BY ID). If ID is UNIQUE then each group consists of only one row that is why you get all values equal 0/NULL.
You should remove PARTITION BY part.
SELECT ID,PurchasePriceCalc,
PurchasePriceCalc - PERCENTILE_CONT(0.5)
WITHIN GROUP(ORDER BY PurchasePriceCalc) OVER () AS MediaCalc
FROM Prices;

Splitting table PK values into roughly same-size ranges

I have a table in Postgres with about half a million rows and an integer primary key.
I'd like to split its entire PK space into N ranges of approximately same size for independent processing. How do I best do it?
I apparently can do it by fetching all PK values to a client and remember every N-th value. This does a full scan and a fetch of all the values, while I only want no more than N+1 of them.
I can select min and max values and cut the range, but if the PKs are not distributed quite evenly, it may give me some ranges of seriously different sizes.
I want ranges for index-based access later on, so any modulo-based tricks do mot apply.
Is there any nice SQL-based solution that does not involve fetching all the keys to a client? Writing an N-specific query, e.g. with N clauses, if fine.
An example:
IDs in a range, say, from 1234 to 567890, N = 4.
I'd like to get 4 numbers, say 127123, 254789, 379860, so than there are approximately 125k records in each of the ranges of IDs [1234, 127123], [127123, 254789], [254789, 379860], [379860, 567890].
Update:
I've come up with a solution like this:
select
percentile_disc(0.25) within group (order by c.id) over() as pct_25
,percentile_disc(0.50) within group (order by c.id) over() as pct_50
,percentile_disc(0.75) within group (order by c.id) over() as pct_75
from customer c
limit 1
;
It does a decent job of giving me the exact range boundaries, and runs only a few seconds, which is fine for my purposes.
What bothers me is that I have to add the limit 1 clause to get just one row. Without it, I receive identical rows, one per record in the table. Is there a better way to get just a one row of the percentiles?
I think you can use row_number() for this purpose. Something like this:
select t.*,
floor((seqnum * N) / cnt) as range
from (select t.*,
row_number() over (order by pk) - 1 as seqnum,
count(*) over () as cnt
from t
) t;
This assumes by range that you mean ranges on pk values. You can also move the range expression to a where clause to just select one particular range.

Dynamic Parameter for NTILE SQL Function

I'm attempting to provide a grouping id to a set of items using NTILE(). Basically, every 4 items should be grouped together with the same GroupID. The problem is that the total number of rows is different per id. Is this possible?
SELECT
ProductDescription AS LabelType1,
NTILE(FLOOR(COUNT(bc.Groupings) / 4)) OVER (ORDER BY s.OrderId) AS GroupNumber,
Barcode AS Barcode1
FROM
dbo.table1 s
INNER JOIN
#BoxCounts bc ON s.OrderId = bc.OrderId
This is an elaboration on Ben Thul's comment (because he has not answered the question).
NTILE() divides a set of rows into n almost equal-sized groups. The n is a constant.
You want to assign a grouping id to a fixed number of rows. That is a different problem and easily handled with row_number() or rank().
So, one method is:
SELECT ProductDescription AS LabelType1,
(ROW_NUMBER() OVER (ORDER BY s.OrderId) - 1) / 4 as GroupNumber,
Barcode AS Barcode1
FROM dbo.table1 s INNER JOIN
#BoxCounts bc
ON s.OrderId = bc.OrderId;
Note the - 1 in the calculation, so the first group has four elements. Also, SQL Server does integer division, so you don't have to worry about additional decimal places.
If you could have ties and want all rows with the same OrderId to be in the same group, then use dense_rank() (if you want all groups to have four different order ids) or rank() (if you want all groups to have approximately four order ids).