SQL increment between a range without duplicates - multithreaded - sql

I'll try to provide as much detail as I can on this. What is the best way to increment between a range of numbers in a high transaction environment, meaning, calls come in very fast from a web api?
The first table I have is a master table of the ranges that are available to use. It's noted below. I'm not quote sure if this is the best way to implement it either, so I'm open to suggestions
Company A initially gave me a range of 100-200. Over time, we started to run low so they gave us a new range. The new range is 201-300.
Company Range Inactive
A 100-200 X
B 100-200
C 200-350
A 201-300
The second table is the list of numbers that have been used between the ranges.
Company Number DateUsed
A 198 2017-11-30
B 199 2017-11-30
A 200 2017-11-30
B 105 2017-11-30
C 215 2017-11-30
A 201 2017-11-30
Once a range is used up, I need to be able to flag out that range so it's not used anymore, and use the next range that's available. I was thinking of adding a "Last Used" number to the first table and doing an Update statement with an output with a case statement on inactive flagging it inactive if its empty.
The question I have is what is the best way to do this in a high transaction environment? I'm familiar with Scope_Identity, but I don't think this will work in this setup.

Having an inactive flag in the first table seems perfectly sane to me. I've written and tested a query that will update your flag, provided the Range column is split into a lower range column and upper range column. I'm calling the tables Ranges and RangeLog respectively.
UPDATE Ranges
SET Inactive = 'X'
WHERE EXISTS (SELECT tA.Company, RangeLow, RangeHigh, COUNT(*)
FROM Ranges tA INNER JOIN (SELECT DISTINCT Company, Number
FROM RangeLog) tB ON tA.Company = tB.Company AND (Number BETWEEN tA.RangeLow AND tA.RangeHigh)
GROUP BY tA.Company, RangeLow, RangeHigh
HAVING RangeHigh - RangeLow + 1 = COUNT(*)
AND Ranges.Company = tA.Company AND Ranges.RangeLow = tA.RangeLow AND Ranges.RangeHigh = tA.RangeHigh)
Obviously this won't work with the schema the way you have it now, but making the split to the Range lookup column makes the data more atomic and usable, and easier to write queries. And it's a fairly minor modification to your table.
Let us know how this looks to you!

Related

Redshift SQL result set 100s of rows wide efficiency (long to wide)

Scenario: Medical records reporting to state government which requires a pipe delimited text file as input.
Challenge: Select hundreds of values from a fact table and produce a wide result set to be (Redshift) UNLOADed to disk.
What I have tried so far is a SQL that I want to make into a VIEW.
;WITH
CTE_patient_record AS
(
SELECT
record_id
FROM fact_patient_record
WHERE update_date = <yesterday>
)
,CTE_patient_record_item AS
(
SELECT
record_id
,record_item_name
,record_item_value
FROM fact_patient_record_item fpri
INNER JOIN CTE_patient_record cpr ON fpri.record_id = cpr.record_id
)
Note that fact_patient_record has 87M rows and fact_patient_record_item has 97M rows.
The above code runs in 2 seconds for 2 test records and the CTE_patient_record_item CTE has about 200 rows per record for a total of about 400.
Now, produce the result set:
,CTE_result AS
(
SELECT
cpr.record_id
,cpri002.record_item_value AS diagnosis_1
,cpri003.record_item_value AS diagnosis_2
,cpri004.record_item_value AS medication_1
...
FROM CTE_patient_record cpr
INNER JOIN CTE_patient_record_item cpri002 ON cpr.cpr.record_id = cpri002.cpr.record_id
AND cpri002.record_item_name = 'diagnosis_1'
INNER JOIN CTE_patient_record_item cpri003 ON cpr.cpr.record_id = cpri003.cpr.record_id
AND cpri003.record_item_name = 'diagnosis_2'
INNER JOIN CTE_patient_record_item cpri004 ON cpr.cpr.record_id = cpri004.cpr.record_id
AND cpri003.record_item_name = 'mediation_1'
...
) SELECT * FROM CTE_result
Result set looks like this:
record_id diagnosis_1 diagnosis_2 medication_1 ...
100001 09 9B 88X ...
...and then I use the Reshift UNLOAD command to write to disk pipe delimited.
I am testing this on a full production sized environment but only for 2 test records.
Those 2 test records have about 200 items each.
Processing output is 2 rows 200 columns wide.
It takes 30 to 40 minutes to process just just the 2 records.
You might ask me why I am joining on the item name which is a string. Basically there is no item id, no integer, to join on. Long story.
I am looking for suggestions on how to improve performance. With only 2 records, 30 to 40 minutes is unacceptable. What will happen when I have 1000s of records?
I have also tried making the VIEW a MATERIALIZED VIEW however, it takes 30 to 40 minutes (not surprisingly) to compile the materialized view also.
I am not sure which route to take from here.
Stored procedure? I have experience with stored procs.
Create new tables so I can create integer id's to join on and indexes? However, my managers are "new table" averse.
?
I could just stop with the first two CTEs, pull the data down to python and process using pandas dataframe which I've done before successfully but it would be nice if I could have an efficient query, just use Redshift UNLOAD and be done with it.
Any help would be appreciated.
UPDATE: Many thanks to Paul Coulson and Bill Weiner for pointing me in the right direction! (Paul I am unable to upvote your answer as I am too new here).
Using (pseudo code):
MAX(CASE WHEN t1.name = 'somename' THEN t1.value END ) AS name
...
FROM table1 t1
reduced execution time from 30 minutes to 30 seconds.
EXPLAIN PLAN for the original solution is 2700 lines long, for the new solution using conditional aggregation is 40 lines long.
Thanks guys.
Without some more information it is impossible to know what is going on for sure but what you are doing is likely not ideal. An explanation plan and the execution time per step would help a bunch.
What I suspect is getting you is that you are reading a 97M row table 200 times. This will slow things down but shouldn't take 40 min. So I also suspect that record_item_name is not unique per value of record_id. This will lead to row replication and could be expanding the data set many fold. Also is record_id unique in fact_patient_record? If not then this will cause row replication. If all of this is large enough to cause significant spill and significant network broadcasting your 40 min execution time is very plausible.
There is no need to be joining when all the data is in a single copy of the table. #PhilCoulson is correct that some sort of conditional aggregation could be applied and the decode() syntax could save you space if you don't like case. Several of the above issues that might be affecting your joins would also make this aggregation complicated. What are you looking for if there are several values for record_item_value for each record_id and record_item_name pair? I expect you have some discovery of what your data holds in your future.

Using Real numbers for explicit sorting in sql database

i'm facing a recurring problem. I've to let a user reorder some list that is stored in a database.
The fist straightforward approach i can think is to have a "position" column with the ordering saved as a integer. p.e.
Data, Order
A 1
B 2
C 3
D 4
Problem here is that if i have to insert FOO in position 2, now my table become
Data, Order
A 1
FOO 2
B 3
C 4
D 5
So to insert a new line, i have to do one CREATE and three UPDATE on a table of five elements.
So my new idea is using Real numbers instead of integers, my new table become
Data, Order
A 1.0
B 2.0
C 3.0
D 4.0
If i want to insert a element FOO after A, this become
Data, Order
A 1.0
FOO 1.5
B 2.0
C 3.0
D 4.0
With only one SQL query executed.
This would work fine with theoretical Real Numbers, but floating point numbers have a limited precision and i wondering how feasible this is and whether and how can i optimize it to avoid exceeding double precision with a reasonable number of modifications
edit:
this is how i implemented it now in python
#classmethod
def get_middle_priority(cls, p, n):
p = Decimal(str(p))
n = Decimal(str(n))
m = p + ((n - p)/2)
i = 0
while True:
m1 = round(m, i)
if m1 > p and m1 < n:
return m1
else:
i += 1
#classmethod
def create(cls, data, user):
prev = data.get('prev')
if prev is None or len(prev)<1:
first = cls.list().first()
if first is None:
priority = 1.0
else:
priority = first.priority - 1.0
else:
prev = cls.list().filter(Rotator.codice==prev).first()
next = cls.list().filter(Rotator.priority>prev.priority).first()
if next is None:
priority = prev.priority + 1.0
else:
priority = cls.get_middle_priority(prev.priority, next.priority)
r = cls(data.get('codice'),
priority)
DBSession.add(r)
return r
If you want to control the position and there is no ORDER BY solution then a rather simple and robust approach is to point to the next or to the previous. Updates/inserts/deletes (other than the first and last) will require 3 operations.
Insert the new Item
Update the Item Prior the New Item
Update the Item After the New Item
After you have that established you can use a CTE (with a UNION ALL) to create a sorted list that will never have a limit.
I have seen rather large implementations of this that were done via Triggers to keep the list in perfect form. I however am not a fan of triggers and would just put the logic for the entire operation in a stored procedure.
You may use a string rather then numbers:
item order
A ffga
B ffgaa
C ffgb
Here, the problem of finite precision is handled by the possibility of growing the string. String storage is theoretically unlimited in the database, only by the size of the storage device. But there is no better solution for absolute-ordering items. Relative-ordering, like linked-lists, might work better (but you can't do order by query then).
The linked list idea is neat but it's expensive to pull out data in order. If you have a database which supports it, you can use something like connect by to pull it out. linked list in sql is a question dedicated to that problem.
Now if you don't, I was thinking of how one can achieve an infinitely divisable range, and thought of sections in a book. What about storing the list initially as
1
2
3
and then to insert between 1 and two you insert a "subsection under 1" so that your list becomes
1
1.1
2
3
If you want to insert another one between 1.1 and 2 you place a second subsection under 1 and get
1
1.1
1.2
2
3
and lastly if you want to add something between 1.1 and 1.2 you need to introduce a subsubsection and get
1
1.1
1.1.1
1.2
2
3
Maybe using letters instead of numbers would be less confusing.
I'm not sure if there is any standard lexicographic ordering in sql databases which could sort this type of list correctly. But I think you could roll your own with some "order by case" and substringing. Edit: I found a question pertaining to this: linky
Another downside is that the worst case field size of this solution would grow exponentially with the number of input items (You could get long rows like 1.1.1.1.1.1 etc). But in the best case it would be linear or almost constant (Rows like 1.934856.1).
This solution is also quite close to what you already had in mind, and I'm not sure that it's an improvement. A decimal number using the binary partitioning strategy that you mentioned will probably increase the number of decimal points between each insert by one, right? So you would get
1,2 -> 1,1.5,2 -> 1,1.25,1.5,2 -> 1,1.125,1.25,1.5,2
So the best case of the subsectioning-strategy seems better, but the worst case a lot worse.
I'm also not aware of any infinite precision decimal types for sql databases. But you could of course save your number as a string, in which case this solution becomes even more similar to your original one.
Set all rows to a unique number starting at 1 and incrementing by 1 at the start. When you insert a new row, set it to count(*) of the table + 1 (there are a variety of ways of doing this).
When the user updates the Order of a row, always update it by calling a stored procedure with this Id (PK) of the row to update and the new order. In the stored procedure,
update tableName set Order = Order + 1 where Order >= #updatedRowOrder;
update tablename set Order = #updatedRowOrder where Id = #pk;
That guarantees that there will always be space and a continuous sequence with no duplicates. I haven't worked you what would happen if you put silly new Order numbers of a row (e.g. <= 0) but probably bad things; that's for the Front End app to prevent.
Cheers -

T-SQL query for SQL Server 2008 : how to query X # of rows where X is a value in a query while matching on another column

Summary:
I have a list of work items that I am attempting to assign to a list of workers. Each working is allowed to only have a max of 100 work items assigned to them. Each work item specifies the user that should work it (associated as an owner).
For example:
Jim works a total of 5 accounts each with multiple work items. In total jim has 50 items to work already assigned to him. I am allowed to assign only 50 more.
My plight/goal:
I am using a temp table and a select statement to get the # of items each owner has currently assigned to them and I calculate the available slots for new items and store the values in new column. I need to be able to select from the items table where the owner matches my list of owners and their available items(in the temp table), only retrieving the number of rows for each user equal to the number of available slots per user - query would return only 50 rows for jim even though there may be 200 matching the criteria while sam may get 0 rows because he has no available slots while there are 30 items for him to work in the items table.
I realize I may be approaching this problem wrong. I want to avoid using a cursor.
Edit: Adding some example code
SELECT
nUserID_Owner
, CASE
WHEN COUNT(c.nWorkID) >= 100 THEN 0
ELSE 100 - COUNT(c.nWorkID)
END
,COUNT(c.nWorkID)
FROM tblAccounts cic
LEFT JOIN tblWorkItems c
ON c.sAccountNumber = cic.sAccountNumber
AND c.nUserID_WorkAssignedTo = cic.nUserID_Owner
AND c.nTeamID_WorkAssignedTo = cic.nTeamID_Owner
WHERE cic.nUserID_Collector IS NOT NULL
AND nUserID_CurrentOwner = 5288
AND c.bCompleted = 0
GROUP BY nUserID_Owner
This provides output vaulues of 5288, 50, 50 (in Jim's scenario)
It took longer than I wanted it to but I found a solution.
I did use a sub-query as suggested above to produce the work items with a unique row count by user.
I used PARTITION BY to produce a unique row count for each worker and included in my HAVING clause that the row number must be < the count of available slots. I'd post the code but it's beyond the char limit and I'd also have a lot of things to change to anon the system properly.
Originally I was approaching the problem incorrectly focusing on limiting the results rather than thinking about creating the necessary data to relate the result sets.

MS SQL 2000 - How to efficiently walk through a set of previous records and process them in groups. Large table

I'd like to consult one thing. I have table in DB. It has 2 columns and looks like this:
Name...bilance
Jane...+3
Jane...-5
Jane...0
Jane...-8
Jane...-2
Paul...-1
Paul...2
Paul....9
Paul...1
...
I have to walk through this table and if I find record with different "name" (than was on previous row) I process all rows with the previous "name". (If I step on the first Paul row I process all Jane rows)
The processing goes like this:
Now I work only with Jane records and walk through them one by one. On each record I stop and compare it with all previous Jane rows one by one.
The task is to sumarize "bilance" column (in the scope of actual person) if they have different signs
Summary:
I loop through this table in 3 levels paralelly (nested loops)
1st level = search for changes of "name" column
2nd level = if change was found, get all rows with previous "name" and walk through them
3rd level = on each row stop and walk through all previous rows with current "name"
Can this be solved only using CURSOR and FETCHING, or is there some smoother solution?
My real table has 30 000 rows and 1500 people and If I do the logic in PHP, it takes long minutes and than timeouts. So I would like to rewrite it to MS SQL 2000 (no other DB is allowed). Are cursors fast solution or is it better to use something else?
Thank you for your opinions.
UPDATE:
There are lots of questions about my "summarization". Problem is a little bit more difficult than I explained. I simplified it just to describe my algorithm.
Each row of my table contains much more columns. The most important is month. That's why there are more rows for each person. Each is for different month.
"Bilances" are "working overtimes" and "arrear hours" of workers. And I need to sumarize + and - bilances to neutralize them using values from previous months. I want to have as many zeroes as possible. All the table must stay as it is, just bilances must be changed to zeroes.
Example:
Row (Jane -5) will be summarized with row (Jane +3). Instead of 3 I will get 0 and instead of -5 I will get -2. Because I used this -5 to reduce +3.
Next row (Jane 0) won't be affected
Next row (Jane -8) can not be used, because all previous bilances are negative
etc.
You can sum all the values per name using a single SQL statement:
select
name,
sum(bilance) as bilance_sum
from
my_table
group by
name
order by
name
On the face of it, it sounds like this should do what you want:
select Name, sum(bilance)
from table
group by Name
order by Name
If not, you might need to elaborate on how the Names are sorted and what you mean by "summarize".
I'm not sure what you mean by this line... "The task is to sumarize "bilance" column (in the scope of actual person) if they have different signs".
But, it may be possible to use a group by query to get a lot of what you need.
select name, case when bilance < 0 then 'negative' when bilance >= 0 then 'positive', count(*)
from table
group by name, bilance
That might not be perfect syntax for the case statement, but it should get you really close.

Group by run when there is no run number in data (was Show how changing the length of a production run affects time-to-build)

It would seem that there is a much simpler way to state the problem. Please see Edit 2, following the sample table.
I have a number of different products on a production line. I have the date that each product entered production. Each product has two identifiers: item number and serial number I have the total number of labour hours for each product by item number and by serial number (i.e. I can tell you how many hours went into each object that was manufactured and what the average build time is for each kind of object).
I want to determine how (if) varying the length of production runs affects the average time it takes to build a product (item number). A production run is the sequential production of multiple serial numbers for a single item number. We have historical records going back several years with production runs varying in length from 1 to 30.
I think to achieve this, I need to be able to assign 'run id'. To me, that means building a query that sorts by start date and calculates a new unique value at each change in item number. If I knew how to do that, I could solve the rest of the problem on my own.
So that suggests a series of related questions:
Am I thinking about this the right way?
If I am on the right track, how do I generate those run id values? Calculate and store is an option, although I have a (misguided?) preference for direct queries. I know exactly how I would generate the run numbers in Excel, but I have a (misguided?) preference to do this in the database.
If I'm not on the right track, where might I find that track? :)
Edit:
Table structure (simplified) with sample data:
AutoID Item Serial StartDate Hours RunID (proposed calculation)
1 Legend 1234 2010-06-06 10 1
3 Legend 1235 2010-06-07 9 1
2 Legend 1237 2010-06-08 8 1
4 Apex 1236 2010-06-09 12 2
5 Apex 1240 2010-06-10 11 2
6 Legend 1239 2010-06-11 10 3
7 Legend 1238 2010-06-12 8 3
I have shown that start date, serial, and autoID are mutually unrelated. I have shown the expectation that labour goes down as the run length increases (but this is a 'fact' only via received wisdom, not data analysis). I have shown what I envision as the heart of the solution, that being a RunID that reflects sequential builds of a single item. I know that if I could get that runID, I could group by run to get counts, averages, totals, max, min, etc. In addition, I could do something like hours/ to get percentage change from the start of the run. At that point I could graph the trends associated with different run lengths either globally across all items or on a per item basis. (At least I think I could do all that. I might have to muck about a bit, but I think I could get it done.)
Edit 2: This problem would appear to be: how do I get the 'starting' member (earliest start date) of each run when I don't already have a runID? (The runID shown in the sample table does not exist and I was originally suggesting that being able to calculate runID was a potentially viable solution.)
AutoID Item
1 Legend
4 Apex
6 Legend
I'm assuming that having learned how to find the first member of each run that I would then be able to use what I've learned to find the last member of each run and then use those two results to get all other members of each run.
Edit 3: my version of a query that uses the AutoID of the first item in a run as the RunID for all units in a run. This was built entirely from samples and direction provided by Simon, who has the accepted answer. Using this as the basis for grouping by run, I can produce a variety of run statistics.
SELECT first_product_of_run.AutoID AS runID, run_sibling.AutoID AS itemID, run_sibling.Item, run_sibling.Serial, run_sibling.StartDate, run_sibling.Hours
FROM (SELECT first_of_run.AutoID, first_of_run.Item, first_of_run.Serial, first_of_run.StartDate, first_of_run.Hours
FROM dbo.production AS first_of_run LEFT OUTER JOIN
dbo.production AS earlier_in_run ON first_of_run.AutoID - 1 = earlier_in_run.AutoID AND
first_of_run.Item = earlier_in_run.Item
WHERE (earlier_in_run.AutoID IS NULL)) AS first_product_of_run LEFT OUTER JOIN
dbo.production AS run_sibling ON first_product_of_run.Item = run_sibling.Item AND first_product_of_run.AutoID run_sibling.AutoID AND
first_product_of_run.StartDate product_between.Item AND
first_product_of_run.StartDate
Could you describe your table structure some more? If the "date that each product entered production" is a full time stamp, or if there is a sequential identifier across products, you can write queries to identify the first and last products of a run. From that, you can assign IDs to or calculate the length of the runs.
Edit:
Once you've identified 1,4, and 6 as the start of a run, you can use this query to find the other IDs in the run:
select first_product_of_run.AutoID, run_sibling.AutoID
from first_product_of_run
left join production run_sibling on first_product_of_run.Item = run_sibling.Item
and first_product_of_run.AutoID <> run_sibling.AutoID
and first_product_of_run.StartDate < run_sibling.StartDate
left join production product_between on first_product_of_run.Item <> product_between.Item
and first_product_of_run.StartDate < product_between.StartDate
and product_between.StartDate < run_sibling.StartDate
where product_between.AutoID is null
first_product_of_run can be a temp table, table variable, or sub-query that you used to find the start of a run. The key is the where product_between.AutoID is null. That restricts the results to only pairs where no different items were produced between them.
Edit 2, here's how to get the first of each run:
select first_of_run.AutoID
from
(
select product.AutoID, product.Item, MAX(previous_product.StartDate) as PreviousDate
from production product
left join production previous_product on product.AutoID <> previous_product.AutoID
and product.StartDate > previous_product.StartDate
group by product.AutoID, product.Item
) first_of_run
left join production earlier_in_run
on first_of_run.PreviousDate = earlier_in_run.StartDate
and first_of_run.Item = earlier_in_run.Item
where earlier_in_run.AutoID is null
It's not pretty, and will break if StartDate is not unique. The query could be simplified by adding a sequential and unique identifier with no gaps. In fact, that step will probably be necessary if StartDate is not unique. Here's how it would look:
select first_of_run.AutoID
from production first_of_run
left join production earlier_in_run
on (first_of_run.Sequence - 1) = earlier_in_run.Sequence
and first_of_run.Item = earlier_in_run.Item
where earlier_in_run.AutoID is null
Using outer joins to find where things aren't still twists my brain, but it's a very powerful technique.