SQL Server 2012 buckets based on running total - sql-server-2012

For SQL Server 2012, I am trying to assign given rows to sequential buckets based on the maximum size of the bucket (100 in the sample below) and running total of a column. Most of the solutions I found partition by known column changing value e.g. partition by department id etc. However, in this situation all I have is sequential id and size. The closest solution I have found is discussed in this thread for SQL Server 2008 and I tried it but the performance very slow for large row set much worse than cursor based solution. https://dba.stackexchange.com/questions/45179/how-can-i-write-windowing-query-which-sums-a-column-to-create-discrete-buckets
This table can contain up to 10 Million rows. With SQL Server 2012 supporting SUM OVER and LAG and LEAD functions, wondering if someone can suggest a solution based on 2012.
CREATE TABLE raw_data (
id INT PRIMARY KEY
, size INT NOT NULL
);
INSERT INTO raw_data
(id, size)
VALUES
( 1, 96) -- new bucket here, maximum bucket size is 100
, ( 2, 10) -- and here
, ( 3, 98) -- and here
, ( 4, 20)
, ( 5, 50)
, ( 6, 15)
, ( 7, 97)
, ( 8, 96) -- and here
;
--Expected output
--bucket_size is for illustration only, actual needed output is bucket only
id size bucket_size bucket
-----------------------------
1 100 100 1
2 10 10 2
3 98 98 3
4 20 85 4
5 50 85 4
6 15 85 4
7 97 98 5
8 1 98 5
TIA

You can achieve this quite easily in SQL Server 2012 using a window function and framing. The syntax looks quite complex, but the concept is simple - sum all the previous rows up to and including the current one. The cumulative_bucket_size column in this example is for demonstration purposes, as it is part of the equation used to derive the bucket number:
DECLARE #Bucket_Size AS INT;
SET #Bucket_Size = 100
SELECT
id,
size,
SUM(size) OVER (
PARTITION BY 1 ORDER BY id ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_bucket_size,
1 + SUM(size) OVER (
PARTITION BY 1 ORDER BY id ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) / #Bucket_Size AS bucket
FROM
raw_data
The PARTITION BY clause is optional, but would be useful if you had different "bucket sets" for column groupings. I have added it here for completeness.
Results:
id size cumulative_bucket_size bucket
------------------------------------------
1 96 96 1
2 10 106 2
3 98 204 3
4 20 224 3
5 50 274 3
6 15 289 3
7 97 386 4
8 96 482 5
You can read more about windows framing in the following article:
https://www.simple-talk.com/sql/learn-sql-server/window-functions-in-sql-server-part-2-the-frame/

Before you can use the running total method to assign bucket numbers, you need to generate that bucket_size column, because the numbers would be produced based on that column.
Based on your expected output, the bucket ranges are
1..10
11..85
86..100
You could use a simple CASE expression like this to generate a bucket_size column like in your example:
CASE
WHEN size <= 10 THEN 10
WHEN size <= 85 THEN 85
ELSE 100
END
Then you would use LAG() to determine if a row starts a new sequence of sizes belonging to the same bucket:
CASE bucket_size
WHEN LAG(bucket_size) OVER (ORDER BY id) THEN 0
ELSE 1
END
These two calculations could be done in the same (sub)query with the help of CROSS APPLY:
SELECT
d.id,
d.size,
x.bucket_size, -- for illustration only
is_new_seq = CASE x.bucket_size
WHEN LAG(x.bucket_size) OVER (ORDER BY d.id) THEN 0
ELSE 1
END
FROM dbo.raw_data AS d
CROSS APPLY
(
SELECT
CASE
WHEN size <= 10 THEN 10
WHEN size <= 85 THEN 85
ELSE 100
END
) AS x (bucket_size)
The above query would produce this output:
id size bucket_size is_new_seq
-- ---- ----------- ----------
1 96 100 1
2 10 10 1
3 98 100 1
4 20 85 1
5 50 85 0
6 15 85 0
7 97 100 1
8 96 100 0
Now use that result as a derived table and apply SUM() OVER to is_new_seq to produce the bucket numbers, like this:
SELECT
id,
size,
bucket = SUM(is_new_seq) OVER (ORDER BY id)
FROM
(
SELECT
d.id,
d.size,
is_new_seq = CASE x.bucket_size
WHEN LAG(x.bucket_size) OVER (ORDER BY d.id) THEN 0
ELSE 1
END
FROM dbo.raw_data AS d
CROSS APPLY
(
SELECT
CASE
WHEN size <= 10 THEN 10
WHEN size <= 85 THEN 85
ELSE 100
END
) AS x (bucket_size)
) AS s
;

Related

If value is 0 (zero) then increment it with max number +1

I need to UPDATE all new inserted values of 0 with the highest value from the same column + 1. Any value with zero should be updated by the highest value +1. If the highest value is 30 below in the "Preference" column, then the next value should be 31 for Id 11 and 32 for Id 12. New values are inserted every 30 seconds, could be multiple, from the source table that I have no access to into the table below (table 1).
The UPDATE statement is executed when a user drags and drops a row in the web app.
UPDATE [DB].[dbo].[tbl1] SET
Preference = #Preference
WHERE Id = Id
I need to somehow add that logic to this statement described above. This is where I am lost.
Any ideas? Thank you for the help!!
For example:
ID
Preference
Account
3
7
22
6
8
33
7
9
44
9
0
55
11
0
66
Required results:
ID
Preference
Account
3
7
22
6
8
33
7
9
44
9
10
55
11
11
66
Gather the current maximum preference using a cross apply (or you could use a cross join) and together with row_number() ordered by ID you will increment preference as described:
with CTE as (
select id, preference, cp.maxpref, row_number() over(order by id) rn
from mytable
cross apply (select max(preference) maxpref
from mytable p
) cp
where preference = 0
)
update cte
set preference = maxpref + rn
where preference = 0
see db<>fiddle here
select *
from mytable
order by id
ID
Preference
Account
3
7
22
6
8
33
7
9
44
9
10
55
11
11
66

How to find the SQL medians for a grouping

I am working with SQL Server 2008
If I have a Table as such:
Code Value
-----------------------
4 240
4 299
4 210
2 NULL
2 3
6 30
6 80
6 10
4 240
2 30
How can I find the median AND group by the Code column please?
To get a resultset like this:
Code Median
-----------------------
4 240
2 16.5
6 30
I really like this solution for median, but unfortunately it doesn't include Group By:
https://stackoverflow.com/a/2026609/106227
The solution using rank works nicely when you have an odd number of members in each group, i.e. the median exists within the sample, where you have an even number of members the rank method will fall down, e.g.
1
2
3
4
The median here is 2.5 (i.e. half the group is smaller, and half the group is larger) but the rank method will return 3. To get around this you essentially need to take the top value from the bottom half of the group, and the bottom value of the top half of the group, and take an average of the two values.
WITH CTE AS
( SELECT Code,
Value,
[half1] = NTILE(2) OVER(PARTITION BY Code ORDER BY Value),
[half2] = NTILE(2) OVER(PARTITION BY Code ORDER BY Value DESC)
FROM T
WHERE Value IS NOT NULL
)
SELECT Code,
(MAX(CASE WHEN Half1 = 1 THEN Value END) +
MIN(CASE WHEN Half2 = 1 THEN Value END)) / 2.0
FROM CTE
GROUP BY Code;
Example on SQL Fiddle
In SQL Server 2012 you can use PERCENTILE_CONT
SELECT DISTINCT
Code,
Median = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY Value) OVER(PARTITION BY Code)
FROM T;
Example on SQL Fiddle
SQL Server does not have a function to calculate medians, but you could use the ROW_NUMBER function like this:
WITH RankedTable AS (
SELECT Code, Value,
ROW_NUMBER() OVER (PARTITION BY Code ORDER BY VALUE) AS Rnk,
COUNT(*) OVER (PARTITION BY Code) AS Cnt
FROM MyTable
)
SELECT Code, Value
FROM RankedTable
WHERE Rnk = Cnt / 2 + 1
To elaborate a bit on this solution, consider the output of the RankedTable CTE:
Code Value Rnk Cnt
---------------------------
4 240 2 3 -- Median
4 299 3 3
4 210 1 3
2 NULL 1 2
2 3 2 2 -- Median
6 30 2 3 -- Median
6 80 3 3
6 10 1 3
Now from this result set, if you only return those rows where Rnk equals Cnt / 2 + 1 (integer division), you get only the rows with the median value for each group.

Assign value to a column based on values of other columns in the same table

I have a table with columns Date and Order. I want to add a column named Batch to this table which will be filled as follows: For each Date, we start from the first Order, and group each two orders in one batch.
It means that for records with Date = 1 in this example (the first 4 records), the first two records (Order= 10 and Order=30) will have batch number: Batch = 1, the next two records (Order = 80 and Order = 110) will have Batch = 2, and so on.
If at the end the number of remaining record(s) is less than the batch size (2 in this example),
the remained order(s) will have a separate Batch number, as in the example below, number of records with Date=2 is odd, so the last record (5th records) will have Batch = 3.
Date Order
-----------
1 10
1 30
1 80
1 110
2 20
2 30
2 50
2 70
2 120
3 90
Date Order Batch
------------------
1 10 1
1 30 1
1 80 2
1 110 2
2 20 1
2 30 1
2 50 2
2 70 2
2 120 3
3 90 1
Use the analytic function row_number to get row numbers 1,2,3,... within each date. Then add one and divide by two:
select
dateid,
orderid,
trunc((row_number() over (partition by dateid order by orderid) +1 ) / 2) as batch
from mytable;

How to track how many times a column changed its value?

I have a table called crewWork as follows :
CREATE TABLE crewWork(
FloorNumber int, AptNumber int, WorkType int, simTime int )
After the table was populated, I need to know how many times a change in apt occurred and how many times a change in floor occurred. Usually I expect to find 10 rows on each apt and 40-50 on each floor.
I could just write a scalar function for that, but I was wondering if there's any way to do that in t-SQL without having to write scalar functions.
Thanks
The data will look like this:
FloorNumber AptNumber WorkType simTime
1 1 12 10
1 1 12 25
1 1 13 35
1 1 13 47
1 2 12 52
1 2 12 59
1 2 13 68
1 1 14 75
1 4 12 79
1 4 12 89
1 4 13 92
1 4 14 105
1 3 12 115
1 3 13 129
1 3 14 138
2 1 12 142
2 1 12 150
2 1 14 168
2 1 14 171
2 3 12 180
2 3 13 190
2 3 13 200
2 3 14 205
3 3 14 216
3 4 12 228
3 4 12 231
3 4 14 249
3 4 13 260
3 1 12 280
3 1 13 295
2 1 14 315
2 2 12 328
2 2 14 346
I need the information for a report, I don't need to store it anywhere.
If you use the accepted answer as written now (1/6/2023), you get correct results with the OP dataset, but I think you can get wrong results with other data.
CONFIRMED: ACCEPTED ANSWER HAS A MISTAKE (as of 1/6/2023)
I explain the potential for wrong results in my comments on the accepted answer.
In this db<>fiddle, I demonstrate the wrong results. I use a slightly modified form of accepted answer (my syntax works in SQL Server and PostgreSQL). I use a slightly modified form of the OP's data (I change two rows). I demonstrate how the accepted answer can be changed slightly, to produce correct results.
The accepted answer is clever but needs a small change to produce correct results (as demonstrated in the above db<>fiddle and described here:
Instead of doing this as seen in the accepted answer COUNT(DISTINCT AptGroup)...
You should do thisCOUNT(DISTINCT CONCAT(AptGroup, '_', AptNumber))...
DDL:
SELECT * INTO crewWork FROM (VALUES
-- data from question, with a couple changes to demonstrate problems with the accepted answer
-- https://stackoverflow.com/q/8666295/1175496
--FloorNumber AptNumber WorkType simTime
(1, 1, 12, 10 ),
-- (1, 1, 12, 25 ), -- original
(2, 1, 12, 25 ), -- new, changing FloorNumber 1->2->1
(1, 1, 13, 35 ),
(1, 1, 13, 47 ),
(1, 2, 12, 52 ),
(1, 2, 12, 59 ),
(1, 2, 13, 68 ),
(1, 1, 14, 75 ),
(1, 4, 12, 79 ),
-- (1, 4, 12, 89 ), -- original
(1, 1, 12, 89 ), -- new , changing AptNumber 4->1->4 ges)
(1, 4, 13, 92 ),
(1, 4, 14, 105 ),
(1, 3, 12, 115 ),
...
DML:
;
WITH groupedWithConcats as (SELECT
*,
CONCAT(AptGroup,'_', AptNumber) as AptCombo,
CONCAT(FloorGroup,'_',FloorNumber) as FloorCombo
-- SQL SERVER doesnt have TEMPORARY keyword; Postgres doesn't understand # for temp tables
-- INTO TEMPORARY groupedWithConcats
FROM
(
SELECT
-- the columns shown in Andriy's answer:
-- https://stackoverflow.com/a/8667477/1175496
ROW_NUMBER() OVER ( ORDER BY simTime) as RN,
-- AptNumber
AptNumber,
ROW_NUMBER() OVER (PARTITION BY AptNumber ORDER BY simTime) as RN_Apt,
ROW_NUMBER() OVER ( ORDER BY simTime)
- ROW_NUMBER() OVER (PARTITION BY AptNumber ORDER BY simTime) as AptGroup,
-- FloorNumber
FloorNumber,
ROW_NUMBER() OVER (PARTITION BY FloorNumber ORDER BY simTime) as RN_Floor,
ROW_NUMBER() OVER ( ORDER BY simTime)
- ROW_NUMBER() OVER (PARTITION BY FloorNumber ORDER BY simTime) as FloorGroup
FROM crewWork
) grouped
)
-- if you want to see how the groupings work:
-- SELECT * FROM groupedWithConcats
-- otherwise just run this query to see the counts of "changes":
SELECT
COUNT(DISTINCT AptCombo)-1 as CountAptChangesWithConcat_Correct,
COUNT(DISTINCT AptGroup)-1 as CountAptChangesWithoutConcat_Wrong,
COUNT(DISTINCT FloorCombo)-1 as CountFloorChangesWithConcat_Correct,
COUNT(DISTINCT FloorGroup)-1 as CountFloorChangesWithoutConcat_Wrong
FROM groupedWithConcats;
ALTERNATIVE ANSWER
The accepted-answer may eventually get updated to remove the mistake. If that happens I can remove my warning but I still want leave you with this alternative way to produce the answer.
My approach goes like this: "check the previous row, if the value is different in previous row vs current row, then there is a change". SQL doesn't have idea or row order functions per se (at least not like in Excel for example; )
Instead, SQL has window functions. With SQL's window functions, you can use the window function RANK plus a self-JOIN technique as seen here to combine current row values and previous row values so you can compare them. Here is a db<>fiddle showing my approach, which I pasted below.
The intermediate table, showing the columns which has a value 1 if there is a change, 0 otherwise (i.e. FloorChange, AptChange), is shown at the bottom of the post...
DDL:
...same as above...
DML:
;
WITH rowNumbered AS (
SELECT
*,
ROW_NUMBER() OVER ( ORDER BY simTime) as RN
FROM crewWork
)
,joinedOnItself AS (
SELECT
rowNumbered.*,
rowNumberedRowShift.FloorNumber as FloorShift,
rowNumberedRowShift.AptNumber as AptShift,
CASE WHEN rowNumbered.FloorNumber <> rowNumberedRowShift.FloorNumber THEN 1 ELSE 0 END as FloorChange,
CASE WHEN rowNumbered.AptNumber <> rowNumberedRowShift.AptNumber THEN 1 ELSE 0 END as AptChange
FROM rowNumbered
LEFT OUTER JOIN rowNumbered as rowNumberedRowShift
ON rowNumbered.RN = (rowNumberedRowShift.RN+1)
)
-- if you want to see:
-- SELECT * FROM joinedOnItself;
SELECT
SUM(FloorChange) as FloorChanges,
SUM(AptChange) as AptChanges
FROM joinedOnItself;
Below see the first few rows of the intermediate table (joinedOnItself). This shows how my approach works. Note the last two columns, which have a value of 1 when there is a change in FloorNumber compared to FloorShift (noted in FloorChange), or a change in AptNumber compared to AptShift (noted in AptChange).
floornumber
aptnumber
worktype
simtime
rn
floorshift
aptshift
floorchange
aptchange
1
1
12
10
1
0
0
2
1
12
25
2
1
1
1
0
1
1
13
35
3
2
1
1
0
1
1
13
47
4
1
1
0
0
1
2
12
52
5
1
1
0
1
1
2
12
59
6
1
2
0
0
1
2
13
68
7
1
2
0
0
Note instead of using the window function RANK and JOIN, you could use the window function LAG to compare values in the current row to the previous row directly (no need to JOIN). I don't have that solution here, but it is described in the Wikipedia article example:
Window functions allow access to data in the records right before and after the current record.
If I am not missing anything, you could use the following method to find the number of changes:
determine groups of sequential rows with identical values;
count those groups;
subtract 1.
Apply the method individually for AptNumber and for FloorNumber.
The groups could be determined like in this answer, only there's isn't a Seq column in your case. Instead, another ROW_NUMBER() expression could be used. Here's an approximate solution:
;
WITH marked AS (
SELECT
FloorGroup = ROW_NUMBER() OVER ( ORDER BY simTime)
- ROW_NUMBER() OVER (PARTITION BY FloorNumber ORDER BY simTime),
AptGroup = ROW_NUMBER() OVER ( ORDER BY simTime)
- ROW_NUMBER() OVER (PARTITION BY AptNumber ORDER BY simTime)
FROM crewWork
)
SELECT
FloorChanges = COUNT(DISTINCT FloorGroup) - 1,
AptChanges = COUNT(DISTINCT AptGroup) - 1
FROM marked
(I'm assuming here that the simTime column defines the timeline of changes.)
UPDATE
Below is a table that shows how the distinct groups are obtained for AptNumber.
AptNumber RN RN_Apt AptGroup (= RN - RN_Apt)
--------- -- ------ ---------
1 1 1 0
1 2 2 0
1 3 3 0
1 4 4 0
2 5 1 4
2 6 2 4
2 7 3 4
1 8 5 => 3
4 9 1 8
4 10 2 8
4 11 3 8
4 12 4 8
3 13 1 12
3 14 2 12
3 15 3 12
1 16 6 10
… … … …
Here RN is a pseudo-column that stands for ROW_NUMBER() OVER (ORDER BY simTime). You can see that this is just a sequence of rankings starting from 1.
Another pseudo-column, RN_Apt contains values produces by the other ROW_NUMBER, namely ROW_NUMBER() OVER (PARTITION BY AptNumber ORDER BY simTime). It contains rankings within individual groups of identical AptNumber values. You can see that, for a newly encountered value, the sequence starts over, and for a recurring one, it continues where it stopped last time.
You can also see from the table that if we subtract RN from RN_Apt (could be the other way round, doesn't matter in this situation), we get the value that uniquely identifies every distinct group of same AptNumber values. You might as well call that value a group ID.
So, now that we've got these IDs, it only remains for us to count them (count distinct values, of course). That will be the number of groups, and the number of changes is one less (assuming the first group is not counted as a change).
add an extra column changecount
CREATE TABLE crewWork(
FloorNumber int, AptNumber int, WorkType int, simTime int ,changecount int)
increment changecount value for each updation
if want to know count for each field then add columns corresponding to it for changecount
Assuming that each record represents a different change, you can find changes per floor by:
select FloorNumber, count(*)
from crewWork
group by FloorNumber
And changes per apartment (assuming AptNumber uniquely identifies apartment) by:
select AptNumber, count(*)
from crewWork
group by AptNumber
Or (assuming AptNumber and FloorNumber together uniquely identifies apartment) by:
select FloorNumber, AptNumber, count(*)
from crewWork
group by FloorNumber, AptNumber

Row_Number with complex conditions

Is it possible to use Row_Number() to number rows on something else than a simple partition done with Group By ?
This is my particular case :
Id Type Date
-- ---- ----
1 16 Some Date
2 16 Some Date
3 16 Some Date
4 32 Some Date
5 64 Some Date
6 64 Some Date
7 128 Some Date
8 256 Some Date
9 256 Some Date
10 256 Some Date
I want to partition the numbering in the following way (row numbering is sorted by date) :
Id Type RowNb
-- ---- -----
6 64 1
4 32 2
5 64 3
9 256 1
3 16 2
1 16 3
8 256 4
7 128 5
2 16 6
10 256 7
ie: Every other type than 32 and 64 are numbered together. The numbering of types 32 and 64 is optional because I only need to sort the others one.
What I really want to do, is retrieve all the rows with type 32 and 64 and only the row with the lowest date from the other type. The reason why I'm asking this specific question is that in the future it is possible that I will have to retrieve more than just the first column and I think it will be easier if I can number my rows like that. If you have another solution, I'm all ears.
Of course you can have complex partitioning expressions, just like you could have complex grouping expressions. Here's one possible way how you could rank rows in your example:
SELECT
Id,
Type,
ROW_NUMBER() OVER (
PARTITION BY CASE WHEN Type IN (32, 64) THEN 1 ELSE 2 END
Order BY Date
)
FROM atable
But you'd have to repeat that condition later when filtering rows to display. So, if I were you, I might try something like this:
WITH marked AS (
SELECT
*,
TypeGroup = CASE WHEN Type IN (32, 64) THEN 1 ELSE 2 END
FROM atable
),
ranked AS (
SELECT
*,
RowNb = ROW_NUMBER() OVER (PARTITION BY TypeGroup ORDER BY Date)
FROM
)
SELECT
Id,
Type,
RowNb
FROM ranked
WHERE TypeGroup = 1 OR RowNb = 1
ORDER BY TypeGroup, RowNb