Best implementation of a "counter" table in SQL Server - sql

I'm working with a large SQL Server database, and that's built upon the idea of a counter table for primary key values. Each table has a row in this counter table with the PK name and the next value to be used as a primary key (for that table). Our current method of getting a counter value is something like this:
BEGIN TRAN
UPDATE CounterValue + 1
SELECT Counter Value
COMMIT TRAN
That works mostly well since the process of starting a transaction, then updating the row, locks the row/page/table (the level of locking isn't too important for this topic) until the transaction is committed.
The problem here is that if a transaction is held open for a long period of time, access to that table/page/row is locked for too long. We have situations where hundreds of inserts may occur in a single transaction (which needs access to this counter table).
One attempt to address this problem would be to always use a separate connection from your application that would never hold a transaction open. Access to the table and hence the transaction would be quick, so access to the table is generally available. The problem here is that the use of triggers that may also need access to these counter values makes that a fairly unreasonable rule to have. In other words, we have triggers that also need counter values and those triggers sometimes run in the context of a larger parent transaction.
Another attempt to solve the problem is using a SQL Server app lock to serialize access to the table/row. That's Ok most of the time too, but has downsides. One of the biggest downsides here also involves triggers. Since triggers run in the context of the triggering query, the app lock would be locked until any parent transactions are completed.
So what I'm trying to figure out is a way to serialize access to a row/table that could be run from an application or from a SP / trigger that would never run in the context of a parent transaction. If a parent transaction would roll back, I don't need the counter value to roll back. Having always available, fast access to a counter value is much more important than loosing a few counter values should a parent transaction be rolled back.
I should point out that I completely realize that using GUID values or an identity column would solve a lot of my problems, but as I mentioned, we're talking about a massive system, with massive amounts of data that can't be changed in a reasonable time frame without a lot of pain for our clients (we're talking hundreds of tables with hundreds of millions of rows).
Any thoughts about the best way to implement such a counter table would be appreciated. Remember - access should be always available from many apps, services, triggers and other SPs, with very little blocking.
EDIT - we can assume SQL Server 2005+

The way the system currently works in unscalable. You have noticed that yourself. Here are some solutions in rough order of preference:
Use an IDENTITY column (You can set the IDENTITY property without rebuilding the table. Search the web to see how.)
Use a sequence
Use Hi-Lo ID generation (What's the Hi/Lo algorithm?). In short, consumers of IDs (application instances) check out big ranges of IDs (like 100) in a separate transaction. The overhead of that scheme is very low.
Working with the constraints from your comment below: You can achieve scalable counter generation even with a single transaction and no application-level changes. This is kind of a last resort measure.
Stripe the counter. For each table, you have 100 counters. The counter N tracks IDs that conform to ID % 100 = N. So each counter tracks 1/100th of all IDs.
When you want to take an ID, you take it from a randomly chosen counter. The chance is good that this counter is not in use by a concurrent transaction. You will have little blocking due to row-level locking in SQL Server.
You initialize counter N to N and increment it by 100. This ensures that all counters generate distinct ID ranges.
Counter 0 generates 0, 100, 200, .... Counter 1 generates 1, 101, 201, .... And so on.
A disadvantage of this is that your IDs now are not sequential. In my opinion, an application should not rely on this anyway because it is not a reliable property.
You can abstract all of this into a single procedure call. code complexity will actually not that much bigger. You basically just generate an additional random number and change the increment logic.

One way is to get and increment the counter value in one statement:
DECLARE #NextKey int
UPDATE Counter
SET #NextKey = NextKey + 1,
NextKey = #NextKey

Related

How to consistently track all new rows in a SQL database table

What I am trying to do
I am developing a web service, which runs in multiple server instances, all accessing the same RDBMS (PostgreSQL). While the database is needed for persistence, it contains very little data, which is why every server instance has a cache of all the data. Further the application is really simple in that it only ever inserts new rows in rather simple tables and selects that data in a scheduled fashion from all server instances (no updates or changes... only inserts and reads).
The way it is currently implemented
basically I have a table which roughly looks like this:
id BIGSERIAL,
creation_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- further data columns...
The server is doing something like this every couple of seconds (pseudocode):
get all rows with creation_timestamp > lastMaxTimestamp
lastMaxTimestamp = max timestamp for all data just retrieved
insert new rows into application cache
The issue I am running into
The application skips certain rows when updating the caches. I analyzed the issue and figured out, that the problem is caused in the following way:
one server instance is creating a new row in the context of a transaction. An id for the new row is retrieved from the associated sequence (id=n) and the creation_timestamp (with value ts_1) is set.
another server does the same in the context of a different transaction. The new row in this transaction gets id=n+1 and a creation_timestamp ts_2 (where ts_1 < ts_2).
transaction 2 finishes before transaction 1
one of the servers executes a "select all rows with creation_timestamp > lastMaxTimestamp". It gets row n+1, but not n1. It sets lastMaxTimestamp to ts_2.
transaction 1 completes
some time later the server from step 4 executes "select all rows with creation_timestamp > lastMaxTimestamp" again. But since lastMaxTimestamp=ts_2 and ts_2>ts_1 the row n will never be read on that server.
Note: CURRENT_TIMESTAMP has the same value during a transaction, which is the transaction start time.
So the application gets inconsistent data into its cache and can't get new rows based on the insertion timestamp OR based on the sequence id. Transaction isolation levels don't really change anything about the situation, since the problem is created in essence by transaction 2 finishing before transaction 1.
My question
Am I missing something? I am thinking there must be a straightforward way to get all new rows of a RDBMS, but I can't come up with a simple solution... at least with a simple solution that is consistent. Extensive locking (e.g. of tables) wouldn't be acceptable because of performance reasons. Simply trying to ensure to get all ids from that sequence seems like a) a complicated solution and b) can't be done easily, since rollbacks during transactions can happen (which would lead to sequence ids not being used).
Anyone has the solution?
After a lot of searching, I found the right keywords to google for... "transaction commit timestamp" to leads to all sorts of transaction timestamp tracking and system columns like xmin:
https://dba.stackexchange.com/questions/232273/is-there-way-to-get-transaction-commit-timestamp-in-postgres
This post has some more detailed information:
Questions about Postgres track_commit_timestamp (pg_xact_commit_timestamp)
In short:
you can turn on a postgresql option to track timestamps of commits and compare those instead of the current_timestamps/clock_timestamps inside the transaction
it seems though, that it is only tracked when a transaction is completed - not when it is commited, which makes the solution not bullet proof. There are also further issue to consider like transaction id (xmin) rollover for example
logical decoding / replication is something to look into for a proper solution
Thanks to everyone trying to help me find an answer. I hope this summary is useful to someone in the future.

Getting deadlocks in MySQL

We're very frustratingly getting deadlocks in MySQL. It isn't because of exceeding a lock timeout as the deadlocks happen instantly when they do happen. Here's the SQL code that is executing on 2 separate threads (with 2 separate connections from the connection pool) that produces a deadlock:
UPDATE Sequences SET Counter = LAST_INSERT_ID(Counter + 1) WHERE Sequence IS NULL
Sequences table has 2 columns: Sequence and Counter
The LAST_INSERT_ID allows us to retrieve this updated counter value as per MySQL's recommendation. That works perfect for us, but we get these deadlocks! Why are we getting them and how can we avoid them??
Thanks so much for any help with this.
EDIT: this is all in a transaction (required since I'm using Hibernate) and AUTO_INCREMENT doesn't make sense here. I should've been more clear. The Sequences table holds many sequences (in our case about 100 million of them). I need to increment a counter and retrieve that value. AUTO_INCREMENT plays no role in all of this, this has nothing to do with Ids or PRIMARY KEYs.
Wrap your sql statements in a transaction. If you aren't using a transaction you will get a race condition on LAST_INSERT_ID.
But really, you should have counter fields auto_increment, so you let mysql handle this.
Your third solution is to use LOCK_TABLES, to lock the sequence table so no other process can access it concurrently. This is the probably the slowest solution unless you are using INNODB.
Deadlocks are a normal part of any transactional database, and can occur at any time. Generally, you are supposed to write your application code to handle them, as there is no surefire way to guarantee that you will never get a deadlock. That being said, there are situations that increase the likelihood of deadlocks occurring, such as the use of large transactions, and there are things you can do to mitigate their occurrence.
First thing, you should read this manual page to get a better understanding of how you can avoid them.
Second, if all you're doing is updating a counter, you should really, really, really be using an AUTO_INCREMENT column for Counter rather than relying on a "select then update" process, which as you have seen is a race condition that can produce deadlocks. Essentially, the AUTO_INCREMENT property of your table column will act as a counter for you.
Finally, I'm going to assume that you have that update statement inside a transaction, as this would produce frequent deadlocks. If you want to see it in action, try the experiment listed here. That's exactly what's happening with your code... two threads are attempting to update the same records at the same time before one of them is committed. Instant deadlock.
Your best solution is to figure out how to do it without a transaction, and AUTO_INCREMENT will let you do that.
No other SQL involved ? Seems a bit unlikely to me.
The 'where sequence is null' probably causes a full table scan, causing read locks to be acquired on every row/page/... .
This becomes a problem if (your particular engine does not use MVCC and) there were an INSERT that preceded your update within the same transaction. That INSERT would have acquired an exclusive lock on some resource (row/page/...), which will cause the acquisition of a read lock by any other thread to go waiting. So two connections can first do their insert, causing each of them to have an exclusive lock on some small portion of the table, and then they both try to do your update, requiring each of them to be able to acquire a read lock on the entire table.
I managed to do this using a MyISAM table for the sequences.
I then have a function called getNextCounter that does the following:
performs a SELECT sequence_value FROM sequences where sequence_name = 'test';
performs the update: UPDATE sequences SET sequence_value = LAST_INSERT_ID(last_retrieved_value + 1) WHERE sequence_name = 'test' and sequence_value = last retrieved value;
repeat in a loop until both queries are successful, then retrieve the last insert id.
As it is a MyISAM table it won't be part of your transaction, so the operation won't cause any deadlocks.

Incremented DB Field

Let's say that I have an article on a website, and I want to track the number of views to the article. In the Articles table, there's the PK ID - int, Name - nvarchar(50), and ViewCount - int. Every time the the page is viewed, I increment the ViewCount field. I'm worried about collisions when updating the field. I can run it in a sproc with a transaction like:
CREATE PROCEDURE IncrementView
(
#ArticleID int
)
as
BEGIN TRANSACTION
UPDATE Article set ViewCount = ViewCount + 1 where ID = #ArticleID
IF ##ERROR <> 0
BEGIN
-- Rollback the transaction
ROLLBACK
-- Raise an error and return
RAISERROR ('Error Incrementing', 16, 1)
RETURN
END
COMMIT
My fear is that I'm going to end up with PageViews not being counted in this model. The other possible solution is a log type of model where I actually log views to the articles and use a combination of a function and view to grab data about number of views to an article.
Probably a better model is to cache the number of views hourly in the app somewhere, and then update them in a batch-style process.
-- Edit:
To to elaborate more, a simple model for you may be:
Each page load, for the given page, increment a static hashmap. Also on each load, check if enough time has elapsed since 'Last Update', and if so, perform an update.
Be tricky, and put the base value in the asp.net cache (http://msdn.microsoft.com/en-us/library/aa478965.aspx) and, when it times out, [implement the cache removal handler as described in the link] do the update. Set the timeout for an hour.
In both models, you'll have the static map of pages to counts; you'll update this each view, and you'll also use this - and the cached db amount - to get the current 'live' count.
The database should be able to handle a single digit increment atomically. Queries on the queue should be handled in order in the case where there might be a conflict. Your bigger issue, if there is enough volume will be handling all of the writes to the same row. Each write will block the reads and writes behind it. If you are worried, I would create a simple program that calls SQL updates in a row and run it with a few hundred concurrent threads (increase threads until your hardware is saturated). Make sure the attempts = the final result.
Finding a mechanism to cache and/or perform batch updates as silky suggests sounds like a winner.
Jacob
You don't need to worry about concurrency within a single update statement in SQL Server.
But if you are worried about 2 users hitting a table in the same tenth of a second, keep in mind that there are 864,000 10th of a seconds in a day. Doesn't sound like something that is going to be an issue for a page that serves up articles.
Have no fear!
This update is a single (atomic) transaction - you cannot get 'collisions'. Even if 5,000,000 calls to IncrementView all hit the database at the exact same moment, they will each be processed in a serial, queue like fashion - thats what you are using a database engine for - consistency. Each call will gain an exclusive update lock on the row (at least), so no subsequent queries can update the row until the current one has committed.
You don't even need to use BEGIN TRAN...COMMIT. If the update fails, there is nothing to rollback anyway.
I don't see the need for any app caching - there's no reason why this update would take a long time adn therefore should have no impact on the performance of your app.
[Assuming it's relatively well designed!]

Getting a Chunk of Work

Recently I had to deal with a problem that I imagined would be pretty common: given a database table with a large (million+) number of rows to be processed, and various processors running in various machines / threads, how to safely allow each processor instance to get a chunk of work (say 100 items) without interfering with one another?
The reason I am getting a chunk at a time is for performance reasons - I don't want to go to the database for each item.
There are a few approaches - you could associate each processor a token, and have a SPROC that sets that token against the next [n] available items; perhaps something like:
(note - needs suitable isolation-level; perhaps serializable: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE)
(edited to fix TSQL)
UPDATE TOP (1000) WORK
SET [Owner] = #processor, Expiry = #expiry
OUTPUT INSERTED.Id -- etc
WHERE [Owner] IS NULL
You'd also want a timeout (#expiry) on this, so that when a processor goes down you don't lose work. You'd also need a task to clear the owner on things that are past their Expiry.
You can have a special table to queue work up, where the consumers delete (or mark) work as being handled, or use a middleware queuing solution, like MSMQ or ActiveMQ.
Middleware comes with its own set of problems so, if possible, I'd stick with a special table (keep it as small as possible, hopefully just with an id so the workers can fetch the rest of the information by themselves on the rest of the database and not lock the queue table up for too long).
You'd fill this table up at regular intervals and let processors grab what they need from the top.
Related questions on SQL table queues:
Queue using table
Working out the SQL to query a priority queue table
Related questions on queuing middleware:
Building a high performance and automatically backupped queue
Messaging platform
You didn't say which database server you're using, but there are a couple of options.
MySQL includes an extension to SQL99's INSERT to limit the number of rows that are updated. You can assign each worker a unique token, update a number of rows, then query to get that worker's batch. Marc used the UPDATE TOP syntax, but didn't specify the database server.
Another option is to designate a table used for locking. Don't use the same table with the data, since you don't want to lock it for reading. Your lock table likely only needs a single row, with the next ID needing work. A worker locks the table, gets the current ID, increments it by whatever your batch size is, updates the table, then releases the lock. Then it can go query the data table and pull the rows it reserved. This option assumes the data table has a monotonically increasing ID, and isn't very fault-tolerant if a worker dies or otherwise can't finish a batch.
Quite similar to this question: SQL Server Process Queue Race Condition
You run a query to assign a 100 rows to a given processorid. If you use these locking hints then it's "safe" in the concurrency sense. And it's a single SQL statement with no SET statements needed.
This is taken from the other question:
UPDATE TOP (100)
foo
SET
ProcessorID = #PROCID
FROM
OrderTable foo WITH (ROWLOCK, READPAST, UPDLOCK)
WHERE
ProcessorID = 0 --Or whatever unassigned is

Best practices for multithreaded processing of database records

I have a single process that queries a table for records where PROCESS_IND = 'N', does some processing, and then updates the PROCESS_IND to 'Y'.
I'd like to allow for multiple instances of this process to run, but don't know what the best practices are for avoiding concurrency problems.
Where should I start?
The pattern I'd use is as follows:
Create columns "lockedby" and "locktime" which are a thread/process/machine ID and timestamp respectively (you'll need the machine ID when you split the processing between several machines)
Each task would do a query such as:
UPDATE taskstable SET lockedby=(my id), locktime=now() WHERE lockedby IS NULL ORDER BY ID LIMIT 10
Where 10 is the "batch size".
Then each task does a SELECT to find out which rows it has "locked" for processing, and processes those
After each row is complete, you set lockedby and locktime back to NULL
All this is done in a loop for as many batches as exist.
A cron job or scheduled task, periodically resets the "lockedby" of any row whose locktime is too long ago, as they were presumably done by a task which has hung or crashed. Someone else will then pick them up
The LIMIT 10 is MySQL specific but other databases have equivalents. The ORDER BY is import to avoid the query being nondeterministic.
Although I understand the intention I would disagree on going to row level locking immediately. This will reduce your response time and may actually make your situation worse. If after testing you are seeing concurrency issues with APL you should do an iterative move to “datapage” locking first!
To really answer this question properly more information would be required about the table structure and the indexes involved, but to explain further.
DOL, datarow locking uses a lot more locks than allpage/page level locking. The overhead in managing all the locks and hence the decrease of available memory due to requests for more lock structures within the cache will decrease performance and counter any gains you may have by moving to a more concurrent approach.
Test your approach without the move first on APL (all page locking ‘default’) then if issues are seen move to DOL (datapage first then datarow). Keep in mind when you switch a table to DOL all responses on that table become slightly worse, the table uses more space and the table becomes more prone to fragmentation which requires regular maintenance.
So in short don’t move to datarows straight off try your concurrency approach first then if there are issues use datapage locking first then last resort datarows.
You should enable row level locking on the table with:
CREATE TABLE mytable (...) LOCK DATAROWS
Then you:
Begin the transaction
Select your row with FOR UPDATE option (which will lock it)
Do whatever you want.
No other process can do anything to this row until the transaction ends.
P. S. Some mention overhead problems that can result from using LOCK DATAROWS.
Yes, there is overhead, though i'd hardly call it a problem for a table like this.
But if you switch to DATAPAGES then you may lock only one row per PAGE (2k by default), and processes whose rows reside in one page will not be able to run concurrently.
If we are talking of table with dozen of rows being locked at once, there hardly will be any noticeable performance drop.
Process concurrency is of much more importance for design like that.
The most obvious way is locking, if your database doesn't have locks, you could implement it yourself by adding a "Locked" field.
Some of the ways to simplify the concurrency is to randomize the access to unprocessed items, so instead of competition on the first item, they distribute the access randomly.
Convert the procedure to a single SQL statement and process multiple rows as a single batch. This is how databases are supposed to work.