Best practices for multithreaded processing of database records

Best practices for multithreaded processing of database records - sql

I have a single process that queries a table for records where PROCESS_IND = 'N', does some processing, and then updates the PROCESS_IND to 'Y'.
I'd like to allow for multiple instances of this process to run, but don't know what the best practices are for avoiding concurrency problems.
Where should I start?

The pattern I'd use is as follows:
Create columns "lockedby" and "locktime" which are a thread/process/machine ID and timestamp respectively (you'll need the machine ID when you split the processing between several machines)
Each task would do a query such as:
UPDATE taskstable SET lockedby=(my id), locktime=now() WHERE lockedby IS NULL ORDER BY ID LIMIT 10
Where 10 is the "batch size".
Then each task does a SELECT to find out which rows it has "locked" for processing, and processes those
After each row is complete, you set lockedby and locktime back to NULL
All this is done in a loop for as many batches as exist.
A cron job or scheduled task, periodically resets the "lockedby" of any row whose locktime is too long ago, as they were presumably done by a task which has hung or crashed. Someone else will then pick them up
The LIMIT 10 is MySQL specific but other databases have equivalents. The ORDER BY is import to avoid the query being nondeterministic.

Although I understand the intention I would disagree on going to row level locking immediately. This will reduce your response time and may actually make your situation worse. If after testing you are seeing concurrency issues with APL you should do an iterative move to “datapage” locking first!
To really answer this question properly more information would be required about the table structure and the indexes involved, but to explain further.
DOL, datarow locking uses a lot more locks than allpage/page level locking. The overhead in managing all the locks and hence the decrease of available memory due to requests for more lock structures within the cache will decrease performance and counter any gains you may have by moving to a more concurrent approach.
Test your approach without the move first on APL (all page locking ‘default’) then if issues are seen move to DOL (datapage first then datarow). Keep in mind when you switch a table to DOL all responses on that table become slightly worse, the table uses more space and the table becomes more prone to fragmentation which requires regular maintenance.
So in short don’t move to datarows straight off try your concurrency approach first then if there are issues use datapage locking first then last resort datarows.

You should enable row level locking on the table with:
CREATE TABLE mytable (...) LOCK DATAROWS
Then you:
Begin the transaction
Select your row with FOR UPDATE option (which will lock it)
Do whatever you want.
No other process can do anything to this row until the transaction ends.
P. S. Some mention overhead problems that can result from using LOCK DATAROWS.
Yes, there is overhead, though i'd hardly call it a problem for a table like this.
But if you switch to DATAPAGES then you may lock only one row per PAGE (2k by default), and processes whose rows reside in one page will not be able to run concurrently.
If we are talking of table with dozen of rows being locked at once, there hardly will be any noticeable performance drop.
Process concurrency is of much more importance for design like that.

The most obvious way is locking, if your database doesn't have locks, you could implement it yourself by adding a "Locked" field.
Some of the ways to simplify the concurrency is to randomize the access to unprocessed items, so instead of competition on the first item, they distribute the access randomly.

Convert the procedure to a single SQL statement and process multiple rows as a single batch. This is how databases are supposed to work.

Related

Selecting 80% of rows and table lock

One of my colleagues came to me with this statement:
Having a SELECT on a table that fetch 80% of the rows while having a
WHERE clause on a column with an index. So to avoid that add a WITH (NOLOCK) in your FROM clause.
His only argument was: Believe me I've experienced it myself. I cannot find a proper documentation for this.
I far has I know WITH (NOLOCK) only affects the table by letting UPDATE and INSERT occur while selecting and that can lead us to dirty read.
Is my colleague's assumption correct?

I think you're referring to lock escalation, https://technet.microsoft.com/en-us/library/ms184286(v=sql.105).aspx , combined with a table scan caused by an index with bad selectivity, and some possibilities for blocking.
If the statistics on a non clustered index show that the number of rows returned from a table for a specific value exceed some threshold, then the optimizer will choose to use a table scan to find the corresponding rows instead of an index seek with corresponding bookmark lookups, because they are slow in quantity.
I typically tell people that you want that percentage to be 5% or lower, but sometimes it will still index seek up to 10% or so. At 80%, it's definitely going to table scan.
Also, since the query is doing a table scan, the query has to be able to acquire some kind of lock on every single row in the table. If there are any other queries that are running performing updates, or otherwise preventing locks from being acquired on even a single row, the query will have to wait.
With lock escalation, it's not a percentage, but instead a specific magic number of 5,000. A query generally starts reading rows using row locks. If a single query reads 5,000 or more rows, it will escalate the locks that it is using against the table from row and/or page locks to full table locks.
This is when deadlocks happen, because another query may be trying to do the same thing.
These locks don't necessarily have anything to do with inserts/updates.
This is an actual thing. No, this does not mean that you should use NOLOCK. You'd be much better off looking at READPAST, TABLOCK, or TABLOCKX, https://msdn.microsoft.com/en-us/library/ms187373.aspx , if you're having issues with deadlocks.
Do not do any of these things just out of habit and only look into them for specific instances with highly transactional tables that are experiencing actual problems.

By default writers have priority and readers will wait on writers to finish. WITH(NOLOCK) will allow readers to read uncommitted data, avoiding waits on writers to finish. For read only queries against very large tables, this is ok if you are querying data such as an old partition of data or pulling back data that is not going to change often and changes are not critical in the presentation of data. This is the same as using the SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED directive in SP's.

What is the practical used of nolock in sql server

I was going around some document in msdn and it said that "No shared locks are issued to prevent other transactions from modifying data read by the current transaction".
So in lay man term(i.e mine) this would cause the problem of dirty read. Which is too dangerous, if so then why to used it?
Does anybody knows the practical scenario where it would be used.

In our case (previous job) we used this for getting ballpark numbers. So for example a table that holds millions of e-mails sent in a given day, if at 5:00 we want to see "where we are" we can say:
SELECT COUNT(*) FROM dbo.MessageTable WITH (NOLOCK)
WHERE CampaignID = x AND CustomerID = y;
A lot of people will suggest using this for COUNT(*) with no WHERE clause. I would argue that if you're willing to accept some inaccuracy in COUNT(*) you may as well do:
SELECT SUM(rows) FROM sys.partitions
WHERE [object_id] = OBJECT_ID('dbo.tablename')
AND index_id IN (0,1);
This number is similarly inaccurate due to in-flight transactions, but doesn't actually have to scan the table, so it is far more efficient. For our case we could use this even for subsets: with a filtered index in place (for other purposes) we could similarly query sys.partitions but use the index_id of the filtered index.
In most cases, though, using NOLOCK may feel like a turbo button, but the inaccuracy it may cause is very rarely worth it. Unless your system is already heavily tempdb-bound, you should consider READ_COMMITTED_SNAPSHOT for your current NOLOCK scenarios instead. See Pros and cons of READ_COMMITTED_SNAPSHOT

I usually use this to query a relatively busy table I normally use for logging.
SELECT TOP 10 * FROM dbo.MessageLog (NOLOCK) WHERE AppCode = 'DesktopApp' ORDER BY MessageDate DESC
The table's records are mainly written once, and never updated.

Depending on the circumstances it can provide faster access if you are willing to accept that some records might be out of date.
For example:
SELECT COUNT(*) FROM mytable (nolock)
will use less resources and generally be faster on a large table than
SELECT COUNT(*) FROM mytable

Not all data is created equal. Quite a lot of data isn't actually critical, and so it is ok, in some situations, to have dirty reads going on. For instance, a batch process could be bulk updating a number of different tables. You may want to read the the number of Read Messages for a user. You'd prefer if the number was out by 1 or 2 rather than the user having to wait seconds, minutes, or hours for the locks to become available.
In other words, you're increasing concurrency when the exact data isn't actually required, and it's ok to have some (potentially) invalid data.

We have a table holding a queue that is only ever inserted to and updated; nothing is ever deleted. The rows have different flags that indicate what is occurring with the associated process. The production system only uses rowlocks, but dozens at a a time are held on various rows, so that different processes can occur at once.
We check how many items are still being processed, to avoid overloading the system. When a user launches a new process, it waits to spawn sub-processes until there are fewer than 10 items currently being processesd in the queue.
We need to prevent taking locks on processes that may need them to update their owns status, and we need to see the status of locked items. We use with (nolock) to prevent waiting to see what is going on in the queue, and to count how many items are not yet marked complete - which is guaranteed by the logic to only occur when the process completes.

we use it on tables that have lots of reads, but few writes. If the connections are just reading data then this is often no danger in doing dirty reads. This prevents blocking on the tables which increases performance.
Check out https://stackoverflow.com/a/1453000/1038940 for some more info about nolock and when it can be good/bad

Quite easy for us - we fetch non-financial (and usally rarely changed) data (like customer address or goods description or various configuration options) with nolock hint, financial or quantitative data (like prices or balances) with "normal" locking.

You are clearly a genius. It should never be used.
NOLOCK is often exploited as a magic way to speed up database reads, but I try to avoid using it whever possible.
The result set can contain rows that have not yet been committed, that are often later rolled back.
An error or Result set can be empty, be missing rows or display the same row multiple times.
This is because other transactions are moving data at the same time you're reading it.
READ COMMITTED adds an additional issue where data is corrupted within a single column where multiple users change the same cell simultaneously.
There are other side-effects too, which result in sacrificing the speed increase you were hoping to gain in the first place.
Now you know, never use it.

SQL transaction affecting a big amount of rows

The situation is as follows:
A big production client/server system where one central database table has a certain column that has had NULL as default value but now has 0 as default value. But all the rows created before that change of course still have value as null and that generates a lot of unnecessary error messages in this system.
Solution is of course simple as that:
update theTable set theColumn = 0 where theColumn is null
But I guess it's gonna take a lot of time to complete this transaction? Apart from that, will there be any other issues I should think of before I do this? Will this big transaction block the whole database, or that particular table during the whole update process?
This particular table has about 550k rows and 500k of them has null value and will be affected by the above sql statement.

The impact on the performance of other connected clients depends on:
How fast the servers hardware is
How many indexes containing the column your update statement has to update
Which transaction isolation settings the other clients connect to the database
The db engine will acquire write locks, so when your clients only need read access to the table, it should not be a big problem.
500.000 records sounds not too much for me, but as i said, the time and resources the update takes depends on many factors.
Do you have a similar test system, where you can try out the update?
Another solution is to split the one big update into many small ones and call them in a loop.
When you have clients writing frequently to that table, your update statement might get blocked "forever". I have seen databases where performing the update row by row was the only way of getting the update through. But that was a table with about 200.000.000 records and about 500 very active clients!

it's gonna take a lot of time to complete this transaction
there's no definite way to say this. Depends a lot on the hardware, number of concurrent sessions, whether the table has got locks, the number of interdependent triggers et al.
Will this big transaction block the whole database, or that particular table during the whole update process
If the "whole database" is dependent on this table then it might.
will there be any other issues I should think of before I do this
If the table has been locked by other transaction - you might run into a row-lock situation. In rare cases, perhaps a dead lock situation. Best would be to ensure that no one is utilizing the table, check for any pre-exising locks and then run the statement.

Locking issues are vendor specific.
Asuming no triggers on the table, half a million rows is not much for a dediated database server even with many indexes on the table.

What is the purpose of ROWLOCK on Delete and when should I use it?

Ex)
When should I use this statement:
DELETE TOP (#count)
FROM ProductInfo WITH (ROWLOCK)
WHERE ProductId = #productId_for_del;
And when should be just doing:
DELETE TOP (#count)
FROM ProductInfo
WHERE ProductId = #productId_for_del;

The with (rowlock) is a hint that instructs the database that it should keep locks on a row scope. That means that the database will avoid escalating locks to block or table scope.
You use the hint when only a single or only a few rows will be affected by the query, to keep the lock from locking rows that will not be deleted by the query. That will let another query read unrelated rows at the same time instead of having to wait for the delete to complete.
If you use it on a query that will delete a lot of rows, it may degrade the performance as the database will try to avoid escalating the locks to a larger scope, even if it would have been more efficient.
Normally you shouldn't need to add such hints to a query, because the database knows what kind of lock to use. It's only in situations where you get performance problems because the database made the wrong decision, that you should add such hints to a query.

Rowlock is a query hint that should be used with caution (as is all query hints).
Omitting it will likely still result in the exact same behaviour and providing it will not guarantee that it will only use a rowlock, it is only a hint afterall. If you do not have a very in depth knowledge of lock contention chances are that the optimizer will pick the best possible locking strategy, and these things are usually best left to the database engine to decide.
ROWLOCK means that SQL will lock only the affected row, and not the entire table or the page in the table where the data is stored when performing the delete. This will only affect other people reading from the table at the same time as your delete is running.
If a table lock is used it will cause all queries to the table to wait until your delete has completed, with a row lock only selects reading the specific rows will be made to wait.
Deleting top N where N is a number of rows will most likely lock the table in any case.

SQL Server defaults to page locks. This is the most efficient way for SQL server to process multiple date sets. But SQL server is not multi-user friendly sometimes; therefore you may need to incorporate locking methods so you can get your data to flow in and out of the database. This is why people approach that problem by using locking hints.
If everyone designed there database tables so that everything processed each row at page width - the system would be very fast. But no one spends that detailed amount of time.
So, you might see people use with(nolock) on their SELECT statements and the use of with(rowlock) on their UPDATE and DELETE statements. An INSERT does not matter because it will lock the PAGE automatically. Sometimes by using with(rowlock), you can get better multi-user (multiple user connections) performance.
The problem with(nolock) is that you can return the committed record sitting there in the DB already, plus the dirty record that is about to update the sitting record; thus a double return of records to your SELECT statement. If you know the personality of your system on how the data runs through it, you can use with(nolock) to your advantage quite a bit though.
When do you know when to use with(rowlock)? When your system isn't letting user play nice with each other in the same table / record. Though, query re-write / tune first and then adjust your locking as a last resort.
But as a DBA, always blame the developer's code. It is your solemnly sworn duty to do such. If you are the developer writing this code, just blame yourself.

Deleting rows from a contended table

I have a DB table in which each row has a randomly generated primary key, a message and a user. Each user has about 10-100 messages but there are 10k-50k users.
I write the messages daily for each user in one go. I want to throw away the old messages for each user before writing the new ones to keep the table as small as possible.
Right now I effectively do this:
delete from table where user='mk'
Then write all the messages for that user. I'm seeing a lot of contention because I have lots of threads doing this at the same time.
I do have an additional requirement to retain the most recent set of messages for each user.
I don't have access to the DB directly. I'm trying to guess at the problem based on some second hand feedback. The reason I'm focusing on this scenario is that the delete query is showing a lot of wait time (again - to the best of my knowledge) plus it's a newly added bit of functionality.
Can anyone offer any advice?
Would it be better to:
select key from table where user='mk'
Then delete individual rows from there? I'm thinking that might lead to less brutal locking.

If you do this everyday for every user, why not just delete every record from the table in a single statement? Or even
truncate table whatever reuse storage
/
edit
The reason why I suggest this approach is that the process looks like a daily batch upload of user messages preceded by a clearing out of the old messages. That is, the business rules seems to me to be "the table will hold only one day's worth of messages for any given user". If this process is done for every user then a single operation would be the most efficient.
However, if users do not get a fresh set of messages each day and there is a subsidiary rule which requires us to retain the most recent set of messages for each user then zapping the entire table would be wrong.

No, it is always better to perform a single SQL statement on a set of rows than a series of "row-by-row" (or what Tom Kyte calls "slow-by-slow") operations. When you say you are "seeing a lot of contention", what are you seeing exactly? An obvious question: is column USER indexed?
(Of course, the column name can't really be USER in an Oracle database, since it is a reserved word!)
EDIT: You have said that column USER is not indexed. This means that each delete will involve a full table scan of up to 50K*100 = 5 million rows (or at best 10K * 10 = 100,000 rows) to delete a mere 10-100 rows. Adding an index on USER may solve your problems.

Are you sure you're seeing lock contention? It seems more likely that you're seeing disk contention due to too many concurrent (but unrelated updates). The solution to that is simply to reduce the number of threads you're using: Less disk contention will mean higher total throughput.

I think you need to define your requirements a bit clearer...
For instance. If you know all of the users who you want to write messages for, insert the IDs into a temp table, index it on ID and batch delete. Then the threads you are firing off are doing two things. Write the ID of the user to a temp table, Write the message to another temp table. Then when the threads have finished executing, the main thread should
DELETE * FROM Messages INNER JOIN TEMP_MEMBERS ON ID = TEMP_ID
INSERT INTO MESSAGES SELECT * FROM TEMP_messges
im not familiar with Oracle syntax, but that is the way i would approach it IF the users messages are all done in rapid succession.
Hope this helps

TALK TO YOUR DBA
He is there to help you. When we DBAs take access away from the developers for something such as this, it is assumed we will provide the support for you for that task. If your code is taking too long to complete and that time appears to be tied up in the database, your DBA will be able to look at exactly what is going on and offer suggestions or possibly even solve the problem without you changing anything.
Just glancing over your problem statement, it doesn't appear you'd be looking at contention issues, but I don't know anything about your underlying structure.
Really, talk to your DBA. He will probably enjoy looking at something fun instead of planning the latest CPU deployment.

This might speed things up:
Create a lookup table:
create table rowid_table (row_id ROWID ,user VARCHAR2(100));
create index rowid_table_ix1 on rowid_table (user);
Run a nightly job:
truncate table rowid_table;
insert /*+ append */ into rowid_table
select ROWID row_id , user
from table;
dbms_stats.gather_table_stats('SCHEMAOWNER','ROWID_TABLE');
Then when deleting the records:
delete from table
where ROWID IN (select row_id
from rowid_table
where user = 'mk');

Your own suggestion seems very sensible. Locking in small batches has two advantages:
the transactions will be smaller
locking will be limited to only a few rows at a time
Locking in batches should be a big improvement.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas