Getting a Chunk of Work

Getting a Chunk of Work - sql

Recently I had to deal with a problem that I imagined would be pretty common: given a database table with a large (million+) number of rows to be processed, and various processors running in various machines / threads, how to safely allow each processor instance to get a chunk of work (say 100 items) without interfering with one another?
The reason I am getting a chunk at a time is for performance reasons - I don't want to go to the database for each item.

There are a few approaches - you could associate each processor a token, and have a SPROC that sets that token against the next [n] available items; perhaps something like:
(note - needs suitable isolation-level; perhaps serializable: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE)
(edited to fix TSQL)
UPDATE TOP (1000) WORK
SET [Owner] = #processor, Expiry = #expiry
OUTPUT INSERTED.Id -- etc
WHERE [Owner] IS NULL
You'd also want a timeout (#expiry) on this, so that when a processor goes down you don't lose work. You'd also need a task to clear the owner on things that are past their Expiry.

You can have a special table to queue work up, where the consumers delete (or mark) work as being handled, or use a middleware queuing solution, like MSMQ or ActiveMQ.
Middleware comes with its own set of problems so, if possible, I'd stick with a special table (keep it as small as possible, hopefully just with an id so the workers can fetch the rest of the information by themselves on the rest of the database and not lock the queue table up for too long).
You'd fill this table up at regular intervals and let processors grab what they need from the top.
Related questions on SQL table queues:
Queue using table
Working out the SQL to query a priority queue table
Related questions on queuing middleware:
Building a high performance and automatically backupped queue
Messaging platform

You didn't say which database server you're using, but there are a couple of options.
MySQL includes an extension to SQL99's INSERT to limit the number of rows that are updated. You can assign each worker a unique token, update a number of rows, then query to get that worker's batch. Marc used the UPDATE TOP syntax, but didn't specify the database server.
Another option is to designate a table used for locking. Don't use the same table with the data, since you don't want to lock it for reading. Your lock table likely only needs a single row, with the next ID needing work. A worker locks the table, gets the current ID, increments it by whatever your batch size is, updates the table, then releases the lock. Then it can go query the data table and pull the rows it reserved. This option assumes the data table has a monotonically increasing ID, and isn't very fault-tolerant if a worker dies or otherwise can't finish a batch.

Quite similar to this question: SQL Server Process Queue Race Condition
You run a query to assign a 100 rows to a given processorid. If you use these locking hints then it's "safe" in the concurrency sense. And it's a single SQL statement with no SET statements needed.
This is taken from the other question:
UPDATE TOP (100)
foo
SET
ProcessorID = #PROCID
FROM
OrderTable foo WITH (ROWLOCK, READPAST, UPDLOCK)
WHERE
ProcessorID = 0 --Or whatever unassigned is

Related

How to establish read-only-once implement within SAP HANA?

Context: I am a long-time MSSQL developer... What I would like to know is how to implement a read-only-once select from SAP HANA.
High-level pseudo-code:
Collect request via db proc (query)
Call API with request
Store results of the request (response)
I have a table (A) that is the source of inputs to a process. Once a process has completed it will write results to another table (B).
Perhaps this is all solved if I just add a column to table A to avoid concurrent processors from selecting the same records from A?
I am wondering how to do this without adding the column to source table A.
What I have tried is a left outer join between tables A and B to get rows from A that have no corresponding rows (yet) in B. This doesn't work, or I haven't implemented such that rows are processed only 1 time by any of the processors.
I have a stored proc to handle batch selection:
/*
* getBatch.sql
*
* SYNOPSIS: Retrieve the next set of criteria to be used in a search
* request. Use left outer join between input source table
* and results table to determine the next set of inputs, and
* provide support so that concurrent processes may call this
* proc and get their inputs exclusively.
*/
alter procedure "ACOX"."getBatch" (
in in_limit int
,in in_run_group_id varchar(36)
,out ot_result table (
id bigint
,runGroupId varchar(36)
,sourceTableRefId integer
,name nvarchar(22)
,location nvarchar(13)
,regionCode nvarchar(3)
,countryCode nvarchar(3)
)
) language sqlscript sql security definer as
begin
-- insert new records:
insert into "ACOX"."search_result_v4" (
"RUN_GROUP_ID"
,"BEGIN_DATE_TS"
,"SOURCE_TABLE"
,"SOURCE_TABLE_REFID"
)
select
in_run_group_id as "RUN_GROUP_ID"
,CURRENT_TIMESTAMP as "BEGIN_DATE_TS"
,'acox.searchCriteria' as "SOURCE_TABLE"
,fp.descriptor_id as "SOURCE_TABLE_REFID"
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
left outer join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
where
st.usps is not null
and r.BEGIN_DATE_TS is null
limit :in_limit;
-- select records inserted for return:
ot_result =
select
r.ID id
,r.RUN_GROUP_ID runGroupId
,fp.descriptor_id sourceTableRefId
,fp.merch_name name
,fp.Location location
,st.usps regionCode
,'USA' countryCode
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
inner join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
and r.COMPLETE_DATE_TS is null
and r.RUN_GROUP_ID = in_run_group_id
where
st.usps is not null
limit :in_limit;
end;
When running 7 concurrent processors, I get a 35% overlap. That is to say that out of 5,000 input rows, the resulting row count is 6,755. Running time is about 7 mins.
Currently my solution includes adding a column to the source table. I wanted to avoid that but it seems to make a simpler implement. I will update the code shortly, but it includes an update statement prior to the insert.
Useful references:
SAP HANA Concurrency Control
Exactly-Once Semantics Are Possible: Here’s How Kafka Does It

First off: there is no "read-only-once" in any RDBMS, including MS SQL.
Literally, this would mean that a given record can only be read once and would then "disappear" for all subsequent reads. (that's effectively what a queue does, or the well-known special-case of a queue: the pipe)
I assume that that is not what you are looking for.
Instead, I believe you want to implement a processing-semantic analogous to "once-and-only-once" aka "exactly-once" message delivery. While this is impossible to achieve in potentially partitioned networks it is possible within the transaction context of databases.
This is a common requirement, e.g. with batch data loading jobs that should only load data that has not been loaded so far (i.e. the new data that was created after the last batch load job began).
Sorry for the long pre-text, but any solution for this will depend on being clear on what we want to actually achieve. I will get to an approach for that now.
The major RDBMS have long figured out that blocking readers is generally a terrible idea if the goal is to enable high transaction throughput. Consequently, HANA does not block readers - ever (ok, not ever-ever, but in the normal operation setup).
The main issue with the "exactly-once" processing requirement really is not the reading of the records, but the possibility of processing more than once or not at all.
Both of these potential issues can be addressed with the following approach:
SELECT ... FOR UPDATE ... the records that should be processed (based on e.g. unprocessed records, up to N records, even-odd-IDs, zip-code, ...). With this, the current session has an UPDATE TRANSACTION context and exclusive locks on the selected records. Other transactions can still read those records, but no other transaction can lock those records - neither for UPDATE, DELETE, nor for SELECT ... FOR UPDATE ... .
Now you do your processing - whatever this involves: merging, inserting, updating other tables, writing log-entries...
As the final step of the processing, you want to "mark" the records as processed. How exactly this is implemented, does not really matter.
One could create a processed-column in the table and set it to TRUE when records have been processed. Or one could have a separate table that contains the primary keys of the processed records (and maybe a load-job-id to keep track of multiple load jobs).
In whatever way this is implemented, this is the point in time, where this processed status needs to be captured.
COMMIT or ROLLBACK (in case something went wrong). This will COMMIT the records written to the target table, the processed-status information, and it will release the exclusive locks from the source table.
As you see, Step 1 takes care of the issue that records may be missed by selecting all wanted records that can be processed (i.e. they are not exclusively locked by any other process).
Step 3 takes care of the issue of records potentially be processed more than once by keeping track of the processed records. Obviously, this tracking has to be checked in Step 1 - both steps are interconnected, which is why I point them out explicitly. Finally, all the processing occurs within the same DB-transaction context, allowing for guaranteed COMMIT or ROLLBACK across the whole transaction. That means, that no "record marker" will ever be lost when the processing of the records was committed.
Now, why is this approach preferable to making records "un-readable"?
Because of the other processes in the system.
Maybe the source records are still read by the transaction system but never updated. This transaction system should not have to wait for the data load to finish.
Or maybe, somebody wants to do some analytics on the source data and also needs to read those records.
Or maybe you want to parallelise the data loading: it's easily possible to skip locked records and only work on the ones that are "available for update" right now. See e.g. Load balancing SQL reads while batch-processing? for that.
Ok, I guess you were hoping for something easier to consume; alas, that's my approach to this sort of requirement as I understood it.

Lock issues on large recordset

I have a database table that I use as a queue system, where separate process that talk to each other create and read entries in the table. For example, when a user initiates a search an entry is created, then another process that runs every second or two will pick up that new entry, update the status and then do a search, updating the entry again when the search is complete. This all seems to work well with thousands of searches per hour.
However, I have a master admin screen that lets me view the status of all of these 'jobs' but it runs very slowly. I basically return all entries in the table for the last hour so I can keep an eye on what's going on. I think that I am running into lock issues of some sort. I only need to read each entry, and don't really care if it the data is a little bit out of date. I just use a standard 'Select * from Table' statement so maybe it is waiting for other locks to expire before returning data as the jobs are constantly updating the data.
Would this be handled better by a certain kind of cursor to return each row one at a time, etc? Any other ideas?
Thanks

If you really don't care if the data is a bit out of date... or if you only need the data to be 99.99% accurate, consider using WITH (NOLOCK):
SELECT * FROM Table WITH (NOLOCK);
This will instruct your query to use the READ UNCOMMITTED ISOLATION LEVEL, which has the following behavior:
Specifies that dirty reads are allowed. No shared locks are issued to
prevent other transactions from modifying data read by the current
transaction, and exclusive locks set by other transactions do not
block the current transaction from reading the locked data.
Be aware that NOLOCK may cause some inaccuracies in your data, so it probably isn't a good idea to use it throughout the rest of your system.

You need FROM yourtable WITH (NOLOCK) table hint.
You may also want to look at transaction isolation in your update process, if you aren't already

An alternative to NOLOCK (which can lead to very bad things, such as missed rows or duplicated rows) is to allow read committed snapshot isolation at the database level and then issue your query with:
SET TRANSACTION ISOLATION LEVEL SNAPSHOT;

Efficiently detecting concurrent insertions using standard SQL

The Requirements
I have a following table (pseudo DDL):
CREATE TABLE MESSAGE (
MESSAGE_GUID GUID PRIMARY KEY,
INSERT_TIME DATETIME
)
CREATE INDEX MESSAGE_IE1 ON MESSAGE (INSERT_TIME);
Several clients concurrently insert rows in that table, possibly many times per second. I need to design a "Monitor" application that will:
Initially, fetch all the rows currently in the table.
After that, periodically check if there are any new rows inserted and then fetch
these rows only.
There may be multiple Monitors concurrently running. All the Monitors need to see all the rows (i.e. when a row is inserted, it must be "detected" by all the currently running Monitors).
This application will be developed for Oracle initially, but we need to keep it portable to every major RDBMS and would like to avoid as much database-specific stuff as possible.
The Problem
The naive solution would be to simply find the maximal INSERT_TIME in rows selected in step 1 and then...
SELECT * FROM MESSAGE WHERE INSERT_TIME >= :max_insert_time_from_previous_select
...in step 2.
However, I'm worried this might lead to race conditions. Consider the following scenario:
Transaction A inserts a new row but does not yet commit.
Transaction B inserts a new row and commits.
The Monitor selects rows and sees that the maximal INSERT_TIME
is the one inserted by B.
Transaction A commits. At this point, A's INSERT_TIME is actually
earlier than the B's (A's INSERT was actually executed before
B's, before we even knew who is going to commit first).
The Monitor selects rows newer than B's INSERT_TIME (as a consequence of step 3). Since A's INSERT_TIME is earlier than B's insert time, A's row is skipped.
So, the row inserted by transaction A is never fetched.
Any ideas how to design the client SQL or even change the database schema (as long as it is mildly portable), so these kinds of concurrency problems are avoided, while still keeping a decent performance?
Thanks.

Without using any of the platform-specific change data capture (CDC) technologies, there are a couple of approaches.
Option 1
Each Monitor registers a sort of subscription to the MESSAGE table. The code that writes messages then writes each MESSAGE once per Monitor, i.e.
CREATE TABLE message_subscription (
message_subscription_id NUMBER PRIMARY KEY,
message_id RAW(32) NOT NULLL,
monitor_id NUMBER NOT NULL,
CONSTRAINT uk_message_sub UNIQUE (message_id, monitor_id)
);
INSERT INTO message_subscription
SELECT message_subscription_seq.nextval,
sys_guid,
monitor_id
FROM monitor_subscribers;
Each Monitor then deletes the message from its subscription once that is processed.
Option 2
Each Monitor maintains a cache of the recent messages it has processed that is at least as long as the longest-running transaction could be. If the Monitor maintained a cache of the messages it has processed for the last 5 minutes, for example, it would query your MESSAGE table for all messages later than its LAST_MONITOR_TIME. The Monitor would then be responsible for noting that some of the rows it had selected had already been processed. The Monitor would only process MESSAGE_ID values that were not in its cache.
Option 3
Just like Option 1, you set up subscriptions for each Monitor but you use some queuing technology to deliver the messages to the Monitor. This is less portable than the other two options but most databases can deliver messages to applications via queues of some sort (i.e. JMS queues if your Monitor is a Java application). This saves you from reinventing the wheel by building your own queue table and gives you a standard interface in the application tier to code against.

You need to be able to identify all rows added since the last time you checked (i.e. the monitor checks). You have a "time of insert" column. However, as you spell it out, that time of insert column cannot be used with "greater than [last check]" logic to reliably identify subsequently inserted new items. Commits do not occur in the same order as (initial) inserts. I am not aware of anything that works on all major RDBMSs that would clearly and safely apply such an "as of" tag at the actual time of commit. [This is not to say I would know it if such a thing existed, but it seems a pretty safe guess to me.] Thus, you will have to use something other than a "greater than last check" algorithm.
That leads to filtering. Upon insert, an item (row) is flagged as "not yet checked"; when a montior logs in, it reads all not yet checked items, returns that set, and flips the flag to "checked" (and if there are multiple monitors, this must all be done within its own transaction). The monitors' queries will have to read all the data and pick out which have not yet been checked. The implication is, however, that this will be a fairly small set of data, at least relative to the entire set of data. From here, I see two likely options:
Add a column, perhaps "Checked". Store a binary 1/0 value for is/isnot checked. The cardinality of this value will be extreme -- 99.9s Checked, 00,0s Unchecked, so it should be rather efficient. (Some RDBMSs provide filtered queries, such that the Checked rows won't even be in the index; once flipped to checked, a row will presumably never be flipped back, so the overhead to support this shouldn't be too great.)
Add a separate table identify those rows in the "primary" table that have not yet been checked. When a montior logs in, it reads and deletes the items from that table. This doesn't seem efficient... but again, if the data set involved is small, the overall performance pain might be acceptable.

You should use Oracle AQ with a multi-subscriber queue.
This is Oracle specific, but you can create an abstraction layer of stored procedures (or abstract in Java if you like) so that you have a common API to enqueue the new messages and have each subscriber (monitor) dequeue any pending messages. Behind that API, for Oracle you use AQ.
I am not sure if there is a queuing solution for other databases.
I don't think you will be able to come up with a totally database agnostic approach that meets your requirements. You could extend the example above that included the 'checked' column, to have a second table called monitor_checked - that would contain one row per message per monitor. That is basically what AQ does behind the scenes, so it is sort of reinventing the wheel.

With PostgreSQL, use PgQ. It has all those little details worked out for you.
I doubt you will find a robust and manageable database-agnostic solution for this.

Incremented DB Field

Let's say that I have an article on a website, and I want to track the number of views to the article. In the Articles table, there's the PK ID - int, Name - nvarchar(50), and ViewCount - int. Every time the the page is viewed, I increment the ViewCount field. I'm worried about collisions when updating the field. I can run it in a sproc with a transaction like:
CREATE PROCEDURE IncrementView
(
#ArticleID int
)
as
BEGIN TRANSACTION
UPDATE Article set ViewCount = ViewCount + 1 where ID = #ArticleID
IF ##ERROR <> 0
BEGIN
-- Rollback the transaction
ROLLBACK
-- Raise an error and return
RAISERROR ('Error Incrementing', 16, 1)
RETURN
END
COMMIT
My fear is that I'm going to end up with PageViews not being counted in this model. The other possible solution is a log type of model where I actually log views to the articles and use a combination of a function and view to grab data about number of views to an article.

Probably a better model is to cache the number of views hourly in the app somewhere, and then update them in a batch-style process.
-- Edit:
To to elaborate more, a simple model for you may be:
Each page load, for the given page, increment a static hashmap. Also on each load, check if enough time has elapsed since 'Last Update', and if so, perform an update.
Be tricky, and put the base value in the asp.net cache (http://msdn.microsoft.com/en-us/library/aa478965.aspx) and, when it times out, [implement the cache removal handler as described in the link] do the update. Set the timeout for an hour.
In both models, you'll have the static map of pages to counts; you'll update this each view, and you'll also use this - and the cached db amount - to get the current 'live' count.

The database should be able to handle a single digit increment atomically. Queries on the queue should be handled in order in the case where there might be a conflict. Your bigger issue, if there is enough volume will be handling all of the writes to the same row. Each write will block the reads and writes behind it. If you are worried, I would create a simple program that calls SQL updates in a row and run it with a few hundred concurrent threads (increase threads until your hardware is saturated). Make sure the attempts = the final result.
Finding a mechanism to cache and/or perform batch updates as silky suggests sounds like a winner.
Jacob

You don't need to worry about concurrency within a single update statement in SQL Server.
But if you are worried about 2 users hitting a table in the same tenth of a second, keep in mind that there are 864,000 10th of a seconds in a day. Doesn't sound like something that is going to be an issue for a page that serves up articles.

Have no fear!
This update is a single (atomic) transaction - you cannot get 'collisions'. Even if 5,000,000 calls to IncrementView all hit the database at the exact same moment, they will each be processed in a serial, queue like fashion - thats what you are using a database engine for - consistency. Each call will gain an exclusive update lock on the row (at least), so no subsequent queries can update the row until the current one has committed.
You don't even need to use BEGIN TRAN...COMMIT. If the update fails, there is nothing to rollback anyway.
I don't see the need for any app caching - there's no reason why this update would take a long time adn therefore should have no impact on the performance of your app.
[Assuming it's relatively well designed!]

Best practices for multithreaded processing of database records

I have a single process that queries a table for records where PROCESS_IND = 'N', does some processing, and then updates the PROCESS_IND to 'Y'.
I'd like to allow for multiple instances of this process to run, but don't know what the best practices are for avoiding concurrency problems.
Where should I start?

The pattern I'd use is as follows:
Create columns "lockedby" and "locktime" which are a thread/process/machine ID and timestamp respectively (you'll need the machine ID when you split the processing between several machines)
Each task would do a query such as:
UPDATE taskstable SET lockedby=(my id), locktime=now() WHERE lockedby IS NULL ORDER BY ID LIMIT 10
Where 10 is the "batch size".
Then each task does a SELECT to find out which rows it has "locked" for processing, and processes those
After each row is complete, you set lockedby and locktime back to NULL
All this is done in a loop for as many batches as exist.
A cron job or scheduled task, periodically resets the "lockedby" of any row whose locktime is too long ago, as they were presumably done by a task which has hung or crashed. Someone else will then pick them up
The LIMIT 10 is MySQL specific but other databases have equivalents. The ORDER BY is import to avoid the query being nondeterministic.

Although I understand the intention I would disagree on going to row level locking immediately. This will reduce your response time and may actually make your situation worse. If after testing you are seeing concurrency issues with APL you should do an iterative move to “datapage” locking first!
To really answer this question properly more information would be required about the table structure and the indexes involved, but to explain further.
DOL, datarow locking uses a lot more locks than allpage/page level locking. The overhead in managing all the locks and hence the decrease of available memory due to requests for more lock structures within the cache will decrease performance and counter any gains you may have by moving to a more concurrent approach.
Test your approach without the move first on APL (all page locking ‘default’) then if issues are seen move to DOL (datapage first then datarow). Keep in mind when you switch a table to DOL all responses on that table become slightly worse, the table uses more space and the table becomes more prone to fragmentation which requires regular maintenance.
So in short don’t move to datarows straight off try your concurrency approach first then if there are issues use datapage locking first then last resort datarows.

You should enable row level locking on the table with:
CREATE TABLE mytable (...) LOCK DATAROWS
Then you:
Begin the transaction
Select your row with FOR UPDATE option (which will lock it)
Do whatever you want.
No other process can do anything to this row until the transaction ends.
P. S. Some mention overhead problems that can result from using LOCK DATAROWS.
Yes, there is overhead, though i'd hardly call it a problem for a table like this.
But if you switch to DATAPAGES then you may lock only one row per PAGE (2k by default), and processes whose rows reside in one page will not be able to run concurrently.
If we are talking of table with dozen of rows being locked at once, there hardly will be any noticeable performance drop.
Process concurrency is of much more importance for design like that.

The most obvious way is locking, if your database doesn't have locks, you could implement it yourself by adding a "Locked" field.
Some of the ways to simplify the concurrency is to randomize the access to unprocessed items, so instead of competition on the first item, they distribute the access randomly.

Convert the procedure to a single SQL statement and process multiple rows as a single batch. This is how databases are supposed to work.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas