Split a quantity into multiple rows with limit on quantity per row - sql

I have a table of ids and quantities that looks like this:
dbo.Quantity
id | qty
-------
1 | 3
2 | 6
I would like to split the quantity column into multiple lines and number them, but with a set limit (which can be arbitrary) on the maximum quantity allowed for each row.
So for the value of 2, expected output should be:
dbo.DesiredResult
id | qty | bucket
---------------
1 | 2 | 1
1 | 1 | 2
2 | 1 | 2
2 | 2 | 3
2 | 2 | 4
2 | 1 | 5
In other words,
Running SELECT id, SUM(qty) as qty FROM dbo.DesiredResult should return the original table (dbo.Quantity).
Running
SELECT id, SUM(qty) as qty FROM dbo.DesiredResult GROUP BY bucket
should give you this table.
id | qty | bucket
------------------
1 | 2 | 1
1 | 2 | 2
2 | 2 | 3
2 | 2 | 4
2 | 1 | 5
I feel I can do this with cursors imperitavely, looping over each row, keeping a counter that increments and resets as the "max" for each is filled. But this is very "anti-SQL" I feel there is a better way around this.

One approach is recursive CTE which emulates cursor sequentially going through rows.
Another approach that comes to mind is to represent your data as intervals and intersections of intervals.
Represent this:
id | qty
-------
1 | 3
2 | 6
as intervals [0;3), [3;9) with ids being their labels
0123456789
|--|-----|
1 2 - id
It is easy to generate this set of intervals using running total SUM() OVER().
Represent your buckets also as intervals [0;2), [2;4), [4;6), etc. with their own labels
0123456789
|-|-|-|-|-|
1 2 3 4 5 - bucket
It is easy to generate this set of intervals using a table of numbers.
Intersect these two sets of intervals preserving information about their labels.
Working with sets should be possible in a set-based SQL query, rather than a sequential cursor or recursion.
It is bit too much for me to write down the actual query right now. But, it is quite possible that ideas similar to those discussed in Packing Intervals by Itzik Ben-Gan may be useful here.
Actually, once you have your quantities represented as intervals you can generate required number of rows/buckets on the fly from the table of numbers using CROSS APPLY.
Imagine we transformed your Quantity table into Intervals:
Start | End | ID
0 | 3 | 1
3 | 9 | 2
And we also have a table of numbers - a table Numbers with column Number with values from 0 to, say, 100K.
For each Start and End of the interval we can calculate the corresponding bucket number by dividing the value by the bucket size and rounding down or up.
Something along these lines:
SELECT
Intervals.ID
,A.qty
,A.Bucket
FROM
Intervals
CROSS APPLY
(
SELECT
Numbers.Number + 1 AS Bucket
,#BucketSize AS qty
-- it is equal to #BucketSize if the bucket is completely within the Start and End boundaries
-- it should be adjusted for the first and last buckets of the interval
FROM Numbers
WHERE
Numbers.Number >= Start / #BucketSize
AND Numbers.Number < End / #BucketSize + 1
) AS A
;
You'll need to check and adjust formulas for errors +-1.
And write some CASE WHEN logic for calculating the correct qty for the buckets that happen to be on the lower and upper boundary of the interval.

Use a recursive CTE:
with cte as (
select id, 1 as n, qty
from t
union all
select id, n + 1, qty
from cte
where n + 1 < qty
)
select id, n
from cte;
Here is a db<>fiddle.

Related

How to pivot column data into a row where a maximum qty total cannot be exceeded?

Introduction:
I have come across an unexpected challenge. I'm hoping someone can help and I am interested in the best method to go about manipulating the data in accordance to this problem.
Scenario:
I need to combine column data associated to two different ID columns. Each row that I have associates an item_id and the quantity for this item_id. Please see below for an example.
+-------+-------+-------+---+
|cust_id|pack_id|item_id|qty|
+-------+-------+-------+---+
| 1 | A | 1 | 1 |
| 1 | A | 2 | 1 |
| 1 | A | 3 | 4 |
| 1 | A | 4 | 0 |
| 1 | A | 5 | 0 |
+-------+-------+-------+---+
I need to manipulate the data shown above so that 24 rows (for 24 item_ids) is combined into a single row. In the example above I have chosen 5 items to make things easier. The selection format I wish to get, assuming 5 item_ids, can be seen below.
+---------+---------+---+---+---+---+---+
| cust_id | pack_id | 1 | 2 | 3 | 4 | 5 |
+---------+---------+---+---+---+---+---+
| 1 | A | 1 | 1 | 4 | 0 | 0 |
+---------+---------+---+---+---+---+---+
However, here's the condition that is making this troublesome. The maximum total quantity for each row must not exceed 5. If the total quantity exceeds 5 a new row associated to the cust_id and pack_id must be created for the rest of the item_id quantities. Please see below for the desired output.
+---------+---------+---+---+---+---+---+
| cust_id | pack_id | 1 | 2 | 3 | 4 | 5 |
+---------+---------+---+---+---+---+---+
| 1 | A | 1 | 1 | 3 | 0 | 0 |
| 1 | A | 0 | 0 | 1 | 0 | 0 |
+---------+---------+---+---+---+---+---+
Notice how the quantities of item_ids 1, 2 and 3 summed together equal 6. This exceeds the maximum total quantity of 5 for each row. For the second row the difference is created. In this case only item_id 3 has a single quantity remaining.
Note, if a 2nd row needs to be created that total quantity displayed in that row also cannot exceed 5. There is a known item_id limit of 24. But, there is no known limit of the quantity associated for each item_id.
Here's an approach which goes from left-field a bit.
One approach would have been to do a recursive CTE, building the rows one-by-one.
Instead, I've taken an approach where I
Create a new (virtual) table with 1 row per item (so if there are 6 items, there will be 6 rows)
Group those items into groups of 5 (I've called these rn_batches)
Pivot those (based on counts per item per rn_batch)
For these, processing is relatively simple
Creating one row per item is done using INNER JOIN to a numbers table with n <= the relevant quantity.
The grouping then just assigns rn_batch = 1 for the first 5 items, rn_batch = 2 for the next 5 items, etc - until there are no more items left for that order (based on cust_id/pack_id).
Here is the code
/* Data setup */
CREATE TABLE #Order (cust_id int, pack_id varchar(1), item_id int, qty int, PRIMARY KEY (cust_id, pack_id, item_id))
INSERT INTO #Order (cust_id, pack_id, item_id, qty) VALUES
(1, 'A', 1, 1),
(1, 'A', 2, 1),
(1, 'A', 3, 4),
(1, 'A', 4, 0),
(1, 'A', 5, 0);
/* Pivot results */
WITH Nums(n) AS
(SELECT (c * 100) + (b * 10) + (a) + 1 AS n
FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) A(a)
CROSS JOIN (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) B(b)
CROSS JOIN (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) C(c)
),
ItemBatches AS
(SELECT cust_id, pack_id, item_id,
FLOOR((ROW_NUMBER() OVER (PARTITION BY cust_id, pack_id ORDER BY item_id, N.n)-1) / 5) + 1 AS rn_batch
FROM #Order O
INNER JOIN Nums N ON N.n <= O.qty
)
SELECT *
FROM (SELECT cust_id, pack_id, rn_batch, 'Item_' + LTRIM(STR(item_id)) AS item_desc
FROM ItemBatches
) src
PIVOT
(COUNT(item_desc) FOR item_desc IN ([Item_1], [Item_2], [Item_3], [Item_4], [Item_5])) pvt
ORDER BY cust_id, pack_id, rn_batch;
And here are results
cust_id pack_id rn_batch Item_1 Item_2 Item_3 Item_4 Item_5
1 A 1 1 1 3 0 0
1 A 2 0 0 1 0 0
Here's a db<>fiddle with
additional data in the #Orders table
the answer above, and also the processing with each step separated.
Notes
This approach (with the virtual numbers table) assumes a maximum of 1,000 for a given item in an order. If you need more, you can easily extend that numbers table by adding additional CROSS JOINs.
While I am in awe of the coders who made SQL Server and how it determines execution plans in millisends, for larger datasets I give SQL Server 0 chance to accurately predict how many rows will be in each step. As such, for performance, it may work better to split the code up into parts (including temp tables) similar to the db<>fiddle example.

Reference the output of a calculated column in Hive SQL

I have a self-referencing/recursive calculation in Excel that needs to be moved to Hive SQL. Basically the column needs to SUM the two values only if the total of the concrete column plus the result from the previous calculation is greater than 0.
The data is as follows, A is the value and B is the expected output:
| A | B |
|-----|-----|
| -1 | 0 |
| 2 | 2 |
| -2 | 0 |
| 2 | 2 |
| 2 | 4 |
| -1 | 3 |
| 2 | 5 |
In Excel it would be written in column B as:
=MAX(0,B1+A2)
The problem in SQL is you need to have the output of the current calculation. I think I've got it sorted in SQL as the following:
DECLARE #Numbers TABLE(A INT, Rn INT)
INSERT INTO #Numbers VALUES (-1,1),(2,2),(-2,3),(2,4),(2,5),(-1,6),(2,7);
WITH lagged AS
(
SELECT A, 0 AS B, Rn
FROM #Numbers
WHERE Rn = 1
UNION ALL
SELECT i.A,
CASE WHEN ((i.A + l.B) >= 0) THEN (i.A + l.B)
ELSE l.B
END,
i.Rn
FROM #Numbers i INNER JOIN lagged l
ON i.Rn = l.Rn + 1
)
SELECT *
FROM lagged;
But this being Hive, it doesn't support CTEs so I need to dumb the SQL down a touch. Is that possible using LAG/LEAD? My brain is hurting having got this far!
I initially thought that it would help to first compute the Sum of all elements until each rank and then fix the values somehow using negative elements.
However, one big negative that would zero the B column will carry forward in the sum and will make all following elements negative.
It's as Gordon commented - 0 is max in the calculation =MAX(0,B1+A2) depends on the previous location where it happened and it seems to be impossible to compute them in advance analytically.

sql index same column two directions for traversing window functions

I'm trying use windowing functions to group records close to each other (within the same partition) into sequential groups. There's probably a better way to solve the problem, but right now what I would like to try is running too slow to be useful. It involves an order by on the select:
order by person_id, rollup_class, rollup_concept_id, exp_num
and another order by in the window function:
lead(days_from_latest) over (partition by person_id, rollup_class, rollup_concept_id
order by exp_num DESC)
Because I have that last column (exp_num) ordered in opposite directions, the query takes forever. I even have two indexes on the table to handle the two directions:
create index deeIdx on results.drug_exposure_extra (person_id,rollup_class, rollup_concept_id,
exp_num);
create index deeIdx2 on results.drug_exposure_extra (person_id,rollup_class,rollup_concept_id,
exp_num desc);
But that doesn't help. So I'm trying one that orders exp_num in both directions:
create index deeIdx3 on results.drug_exposure_extra (person_id,rollup_class,rollup_concept_id,
exp_num, exp_num desc);
Does that even make sense? When the index finally finishes building, if it solves the problem, I'll answer my own question...
Nope.
Even with all three indexes, if the two order bys (in select and in over clause) go the same direction, the query runs super fast, if they go opposite directions the query runs super slow. So, at this point I guess I should explain my use case better and ask for ideas for a better approach.
I've got drug exposure records (this is for a cool open-source project http://www.ohdsi.org/, btw), and when a person has drug exposures that begin less than N days from the end of any previous exposure, it should be combined with the earlier ones into a single 'era'. Whenever there is a gap of more than N days, a new era begins.
Over the course of composing this question, it turns out I solved it. It raises some interesting issues, though, so I'll post it and answer it below.
Like asking a doctor, "It hurts when I move my arm like this, what should I do?" the answer is obviously, "Don't move your arm like that." So -- don't try to make windowing functions proceed in a different order from the main query (or probably from each other) -- there's probably a better solution.
Early in working on this I had somehow convinced myself that it would be easier to aggregate eras relative to their ending records rather than their starting records, but that was where I went wrong.
So the expression that gives me the era number I want looks like this:
sum(case when exp_num = 1 or days_from_latest > 30 then 1 else 0 end)
over (partition by person_id, rollup_class, rollup_concept_id
order by exp_num)
as era_num
Explanation: if it's the patient's first exposure to the drug (well, the combination of rollup_class and rollup_concept_id in this case), then that's the beginning of a drug era. It's also the beginning of a drug era if the exposure is more than N days from any earlier exposure. (This point is what makes it a little complicated: say exposure 1 starts at day 1 and is 60 days, exposure 2 starts at day 20 and is 10 days, exposure 3 starts at day 70: it's 40 days after the end of the most recent exposure, 2, which would put it in a new era, but it's only 10 days after exposure 1, which puts it in the same era with 1 and 2.) So, for each record that starts an era the case statement gives us a 1, the rest get 0s. Then we sum that, partitioning over the same partition we used in an earlier query to establish the exp_num, and order by exp_num. I could have specified the rows to sum explicitly by adding rows between unbounded preceding and current row, but that's the default behavior anyway. So the era number increments only at the beginning of new eras.
Here is a much simplified example in response to gordon-linoff's comment below.
create table junk_numbers (x int);
insert into junk_numbers values (1),(2),(3),(5),(7),(9),(10),(15),(20),(25),(26),(28),(30);
-- break into series with gaps of at least 1
select x, gap, 1+sum(case when gap > 1 then 1 else 0 end) over (order by x) as series_num
from (
select x, x - lag(x) over (order by x) as gap
from junk_numbers
) as x_and_gaps
order by x;
x | gap | series_num
----+-----+------------
1 | | 1
2 | 1 | 1
3 | 1 | 1
5 | 2 | 2
7 | 2 | 3
9 | 2 | 4
10 | 1 | 4
15 | 5 | 5
20 | 5 | 6
25 | 5 | 7
26 | 1 | 7
28 | 2 | 8
30 | 2 | 9
-- same query but bigger gaps:
select x, gap, 1+sum(case when gap > 4 then 1 else 0 end) over (order by x) as series_num
from (
select x, x - lag(x) over (order by x) as gap
from junk_numbers
) as x_and_gaps
order by x;
x | gap | series_num
----+-----+------------
1 | | 1
2 | 1 | 1
3 | 1 | 1
5 | 2 | 1
7 | 2 | 1
9 | 2 | 1
10 | 1 | 1
15 | 5 | 2
20 | 5 | 3
25 | 5 | 4
26 | 1 | 4
28 | 2 | 4
30 | 2 | 4

Finding the maximum value between a given interval

Let's say I have a table like so, where the amount is some arbitrary amount of something(like fruit or something but we don't care about the type)
row | amount
_______________
1 | 54
2 | 2
3 | 102
4 | 102
5 | 1
And I want to select the rows that have the maximum value within a given interval. For instance if I was only wanting to select from rows 2-5 what would be returned would be
row | amount
_______________
3 | 102
4 | 102
Because they both contain the max value within the interval, which is 102. Or if I chose to only look at rows 1-2 it would return
row | amount
_______________
1 | 54
Because the maximum value in the interval 1-2 only exists in row 1
I tried to use a variety of:
amount= (select MAX(amount) FROM arbitraryTable)
But that will only ever return
row | amount
_______________
3 | 102
4 | 102
Because 102 is the absolute max of the table. Can you find the maximum value between a given interval?
I would use rank() or max() as a window function:
select t.row, t.amount
from (select t.*, max(amount) over () as maxamount
from t
where row between 2 and 4
) t
where amount = maxamount;
You can use a subquery to get the max value and use it in WHERE clause:
SELECT
row,
amount
FROM
arbitraryTable
WHERE
row BETWEEN 2 AND 5 AND
amount = (
SELECT
MAX(amount)
FROM
arbitraryTable
WHERE
row BETWEEN 2 AND 5
);
Just remember to use the same conditions in the main and sub query: row BETWEEN 2 AND 5.

repeating / duplicating query entries based on a table value

Related to / copied from this PostgreSQL topic: so-link
Let's say I have a table with two rows
id | value |
----+-------+
1 | 2 |
2 | 3 |
I want to write a query that will duplicate (repeat) each row based on
the value. I want this result (5 rows total):
id | value |
----+-------+
1 | 2 |
1 | 2 |
2 | 3 |
2 | 3 |
2 | 3 |
How is this possible in SQL Anywhere (Sybase SQL)?
The easiest way to do this is to have a numbers table . . . one that generates integers. Perhaps you have one handy. There are other ways. For instance, using a recursive CTE:
with numbers as (
select 1 as n
union all
select n + 1
from numbers
where n < 100
)
select t.*
from yourtable t join
numbers n
on n.n <= value;
Not all versions of Sybase necessarily support recursive CTEs There are other ways to generate such a table or you might already have one handy.