SQL Identity Column out of step - sql-server-2005

We have a set of databases that have a table defined with an Identity column as the primary key. As a sub-set of these are replicated to other servers, a seed system was created so that they could never clash. That system was by using a starting seed with an increment of 50.
In this way the table on DB1 would generate 30001, 30051 etc, where Database2 would generate 30002, 30052 and so on.
I am looking at adding another database into this system (it is split for scaling/loading purposes) and have discovered that the identites have got out of sync on one or two of the databases - i.e. database 3 that should have numbers ending in 3, doesn't anymore. The seeding and increments is still correct according to the table design.
I am obviously going to have to work around this problem somehow (probably by setting a high initial value), but can anyone tell me what would cause them to get out of sync like this? From a query on the DB I can see the sequence went as follows: 32403,32453, 32456, 32474, 32524, 32574 and has continued in increments of 50 ever since it went wrong.
As far as I am aware no bulk-inserts or DTS or anything like that has put new data into these tables.
Second (bonus) question - how to reset the identity so that it goes back to what I want it to actually be!
EDIT:
I know the design is in principle a bit ropey - I didn't ask for criticism of it, I just wondered how it could have got out of sync. I inherited this system and changing the column to a GUID - whilst undoubtedly the best theoretical solution - is probably not going to happen. The system evolved from a single DB to multiple DBs when the load got too large (a few hundred GBs currently). Each ID in this table will be referenced in many other places - sometimes a few hundred thousand times each (multiplied by about 40,000 for each item). Updating all those will not be happening ;-)

Replication = GUID column.
To set the value of the next ID to be 1000:
DBCC CHECKIDENT (orders, RESEED, 999)

If you want to actually use Primary Keys for some meaningful purpose other than uniquely identify a row in a table, then it's not an Identity Column, and you need to assign them some other explicit way.
If you want to merge rows from multiple tables, then you are violating the intent of Identity, which is for one table. (A GUID column will use values that are unique enough to solve this problem. But you still can't impute a meaningful purpose to them.)

Perhaps somebody used:
SET IDENTITY INSERT {tablename} ON
INSERT INTO {tablename} (ID, ...)
VALUES(32456, ....)
SET IDENTITY INSERT {tablename} OFF
Or perhaps they used DBCC CHECKIDENT to change the identity. In any case, you can use the same to set it back.

It's too risky to rely on this kind of identity strategy, since it's (obviously) possible that it will get out of synch and wreck everything.
With replication, you really need to identify your data with GUIDs. It will probably be easier for you to migrate your data to a schema that uses GUIDs for PKs than to try and hack your way around IDENTITY issues.

To address your question directly,
Why did it get out of sync may be interesting to discuss, but the only result you could draw from the answer would be to prevent it in the future; which is a bad course of action. You will continue to have these and bigger problems unless you deal with the design which has a fatal flaw.
How to set the existing values right is also (IMHO) an invalid question, because you need to do something other than set the values right - it won't solve your problem.
This isn't to disparage you, it's to help you the best way I can think of. Changing the design is less work both short term and long term. Not to change the design is the pathway to FAIL.

This doesn't really answer your core question, but one possibility to address the design would be to switch to a hi_lo algorithm. it wouldn't require changing the column away from an int. so it shouldn't be nearly as much work as changing to a guid.
Hi_lo is used by the nhibernate ORM, but I couldn't find much documentation on it.
Basically the way a Hi_lo works is you have 1 central place where you keep track of your hi value. 1 table in 1 of the databases that every instance of your insert application can see. then you need to have some kind of a service (object, web service, whatever) that has a life somewhat longer than a single entity insert. this service when it starts up will go to the hi table, grab the current value, then increment the value in that table. Use a read committed lock to do this so that you won't get any concurrency issues with other instances of the service. Now you would use the new service to get your next id value. It internally starts at the number it got from the db, and when it passes that value out, increments by 1. keeping track of this current value and the "range" it's allowed to pass out. A simplistic example would be this.
service 1 gets 100 from "hi_value" table in db. increments db value 200.
service 1 gets request for a new ID. passes out 100.
another instance of the service, service 2 (either another thread, another middle tier worker machine, etc) spins up, gets 200 from the db, increments db to 300.
service 2 gets a request for a new id. passes out 200.
service 1 gets a request for a new id. passes out 101.
if any of these ever gets to passing out more than 100 before dying, then they will go back to the db, and get the current value and increment it and start over. Obviously there's some art to this. How big should your range be, etc.
A very simple variation on this is a single table in one of your db's that just contains the "nextId" value. basically manually reproducing oracle's sequence concept.

Related

Azure SQL Server identity key values jumping up by 1,000

I am using the SQL Server 2012 that MSOFT provide for Azure. My identity columns have a habit of jumping up by 1,000 sometimes even though they are "INT IDENTITY (1, 1) NOT NULL" in my table.
Is there anything I can do to stop this happening. What about if I remove all rows from a table? Seems like even after I delete every row then when I add a new row it starts off with an ID that's more than 1,000.
Refer to this post and this answer. Basically, this is by design and the argument as to why this is not too much of an issue is that azure database limits will be exceeded before hitting the identity limit. There is also the option of using a bigint.
There is no explanation as to why the jump in seed is done when the database is bounced, but I am guessing it has something to do with concurrency problems that might result in the same identity being used for two records at the boundary between shutting down and restarting (for some reason I can't think of).

Why database designers do not make IDENTITY columns start from the min value rather than 1?

As we know, In Sql Server, The IDENTITY (n,m) means that the values will start from n, and the increment value is m, but I noticed that all database designers make Identity columns as IDENTITY(1,1) , without taking advantage of all values of int data type which are from (-2,147,483,648) to (2,147,483,647),
I am planning to make all Identity columns as IDENTITY (-2,147,483,648, 1), (the identity columns are hidden from the application user).
Is that a good idea ?
If you find that 2billion values isn't enough, you're going to find out that 4billion isn't enough either (needing more than twice as many of anything over the lifetime of a project, than it was first designed for, is hardly rare*), so you need to take a different approach entirely (possibly long values, possibly something totally different).
Otherwise you're just being strange and unreadable for no gain.
Also, who doesn't have a database where they know that e.g. item 312 is the one with some nice characteristics for testing particular things? I know I have some arbitrary ids burned in my head. They may call it "so good they named it twice", but I'll always know New York as "city 657, covers most of our test cases". It's only a shorthand, but -2147482991 wouldn't be as handy.
*To add a bit to that. With some things you might say "ah about 100" and find it's actually 110, okay. With others you'll find actually it's actually 100,000 - you were out by orders of magnitude. The higher the number, the more often the mistake is of this sort due to the sort of problems that end up with estimates in the billions being different to those that end up with answers in the dozens. If you estimate 200 is your max in a given case, you should probably leave room for maybe a few hundred more. If you estimate 2billion in a given case, you should probably leave room for a few quadrillion more. That said, the only time I saw someone actually start an id at minus 2billion they ended up having about 3,000 rows.
If you have a class that represent your table in your code (which is very likely to happen), everytime you will create a new object it will be assigned the ID 0 by default. It could lead to mistakes that overwrite data in the database if the ID 0 is already assigned. This also makes it easy to determine if an object is new or if it came from the database by just doing if (myObject.ID != 0)
On the SQL Server side the negative ID-s are ok, handled like positive numbers, so you could do that.
The others are right, you should think about different suggestion, but the major problems are the applications connected to your database.
Let say take MS Enviroment. Here is an example:
.NET DataSet is using negative ID-s on autoincremented id-s to track changes in a code. So may you will have trouble, because:
The negative keys are used for temporary instances for the rows.
Here is the reference : MSDN
So definietly it is not a good idea to design a database like this for MSSQL in an MS Enviroment.
One of the 'nice' side effects of working with integers close to zero is that they are easy on the eye and easy for devs, testers etc to remember, especially with debugging and unit testing in mind.
Also, surrogate keys have a nasty habit of creeping into business terminology, e.g. users may be able to see the PK sitting in the URL querystring at the top of the browser - the more digits in the number, the more likely they are to misquote something in a helpdesk query.
So this is one of the reasons why I'm quite happy to seed my identities at 1, and not at -2147483231, and instead, as will, as #Jon suggests, move up to a BIGINT anytime that I may ever need more than 2 billion rows in my table.
As you transition from the negative numbers to positive ids, you will cross zero. That means (assuming you are actually inserting a couple of billion rows) that you will eventually have an identifier of zero. This is not intrinsically bad, but could present a potential edge-case for ORM tools or simply sloppy applicaiton code that has difficulty differentiating between a zero and a null.
Negative IDs can be useful for testing in a live environment where dummy data needs to be mixed with real data for testing purposes, then disposed of once the testing is complete. This should be done only with very good reason - I've only used the technique once.
Negative IDs can also be useful for Administrator purposes in single-row, read-only tables (i.e., no transactions, no executable SQL run on the table).
Aside from those specific purposes, identity values <= 0 will generate more heat than light.
And just to add, according to MS Docs (https://learn.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/implementing-identity-in-a-memory-optimized-table?view=sql-server-2014), IDENTITY(1, 1) is supported on a memory-optimized table. However, identity columns with definition of IDENTITY(x, y) where x != 1 or y != 1 are NOT supported on memory-optimized tables.
So in my opinion, and the other reasons alluded to by the other users, IDENTITY(1, 1) is more practical and memory efficient for memory-optimized tables.
Start with INT IDENTITY(1, 1), then when you have maxed-out you can follow these steps to 'upgraded' to BIGINT:
Drop current PK constraint:
ALTER TABLE [dbo].[tbName] DROP CONSTRAINT [PK_tbName_Id]
GO
Upgrade PK to BIGINT:
ALTER TABLE [dbo].[tbName] ALTER COLUMN [dbo.Id] BIGINT
Recreate the PK Constraint:
ALTER TABLE [dbo].[tbName] ADD CONSTRAINT [PK_tbName_Id] PRIMARY KEY
CLUSTERED
(
[Id] ASC
)
Hope this helps, and good lucky!

Some sort of “different auto-increment indexes” per a primary key values

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.
MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?
SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.
In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.
It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.

Reinserting rows with identity columns

I'm implementing a queue in SQL Server (2008 R2) containing jobs that are to be performed. Upon completion, the job is moved to a history table, setting a flag to success or failure. The items in the queue table has an identity column as a primary key. The history queue has a combo of this id and a time stamp as a PK.
If a job fails, I would like the option to re-run it, and they way this is thought, is to move it back from the history table and back in to the live queue. For traceability purposes, I would like to have the reinserted row have the same ID as the original entry, which causes problems as this is an identity column.
I see two possible solutions:
1) Use IDENTITY_INSERT:
SET IDENTITY_INSERT TableName ON
-- Move from history to live queue
SET IDENTITY_INSERT TableName OFF
2) Create some custom logic to generate unique IDs, like getting the max ID value from both the live and history queue and adding one.
I don't see any real problems with 2 apart from it being messy, possibly poor performance and that it makes my neurotic skin crawl...
Option 1 I like, but I don't know the implications well enough. How will this perform? And I know that doing this to two tables at the same time will make things crash and burn. What happens if two threads does this to the same table at the same time?
Is this at all a good way to do this for semi-commonly used stored procedures, or should this technique just be used for batch inserting data once in a blue moon?
Any thoughts on which is the best option, or is there a better way?
I'd go with Option 1 - Use IDENTITY_INSERT
SET IDENTITY_INSERT TableName ON
-- Move from history to live queue
SET IDENTITY_INSERT TableName OFF
IDENTITY_INSERT is a setting that applies to the current connection - so if another connection is doing similar, it will have no impact. The only place you get an error with using it is if you attempt to set it ON on another table without first turning it OFF on the first table.
Can't you use the original (live) identity value to insert into the history table? You say you combine it with a timestamp anyway.
Assuming that the Queue's Identity column is the one assigning "Job IDs", I would think the simplest solution would be to add a new "OriginalJobID" nullable column, potentially with FK pointing to the history table. Then when you are rerunning a job, allow it to get a new ID as it is added to the queue, but have it keep a reference to the original job in this new column.
To answer "or should this technique just be used for batch inserting data once in a blue moon", I would say yes, definitely, that's exactly what it's for.
Oops, #Damien_The_Unbeliever is right, I'd forgotten that the IDENTITY_INSERT setting is per connection. It would be complicated to get yourself into real trouble with the identity insert approach (would take something like MARS I guess, or bad error-handling). Nonetheless, I think trying to reuse IDs is a mistake!
I can see a potential performance issue when reusing identity values and that is if the identity column is indexed by a clustered index.
A strict growing number will cause inserted rows to always be added last in the clustered index and no page splits will occur.
If you start to insert reused numbers then you may cause page splits during those insertions.
If that is a problem is up to your domain.

Is this INSERT likely to cause any locking/concurrency issues?

In an effort to avoid auto sequence numbers and the like for one reason or another in this particular database, I wondered if anyone could see any problems with this:
INSERT INTO user (label, username, password, user_id)
SELECT 'Test', 'test', 'test', COALESCE(MAX(user_id)+1, 1) FROM user;
I'm using PostgreSQL (but also trying to be as database agnostic as possible)..
EDIT:
There's two reasons for me wanting to do this.
Keeping dependency on any particular RDBMS low.
Not having to worry about updating sequences if the data is batch-updated to a central database.
Insert performance is not an issue as the only tables where this will be needed are set-up tables.
EDIT-2:
The idea I'm playing with is that each table in the database have a human-generated SiteCode as part of their key, so we always have a compound key. This effectively partitions the data on SiteCode and would allow taking the data from a particular site and putting it somewhere else (obviously on the same database structure). For instance, this would allow backing up of various operational sites onto one central database, but also allow that central database to have operational sites using it.
I could still use sequences, but it seems messy. The actual INSERT would look more like this:
INSERT INTO user (sitecode, label, username, password, user_id)
SELECT 'SITE001', 'Test', 'test', 'test', COALESCE(MAX(user_id)+1, 1)
FROM user
WHERE sitecode='SITE001';
If that makes sense..
I've done something similar before and it worked fine, however the central database in that case was never operational (it was more of a way of centrally viewing data / analyzing) so it did not need to generate ids.
EDIT-3:
I'm starting to think it'd be simpler to only ever allow the centralised database to be either active-only or backup-only, thus avoiding the problem completely and allowing a more simple design.
Oh well back to the drawing board!
There are a couple of points:
Postgres uses Multi-Version Concurrency Control (MVCC) so Readers are never waiting on writers and vice versa. But there is of course a serialization that happens upon each write. If you are going to load a bulk of data into the system, then look at the COPY command. It is much faster than running a large swab of INSERT statements.
The MAX(user_id) can be answered with an index, and probably is, if there is an index on the user_id column. But the real problem is that if two transactions start at the same time, they will see the same MAX(user_id) value. It leads me to the next point:
The canonical way of handling numbers like user_id's is by using SEQUENCE's. These essentially are a place where you can draw the next user id from. If you are really worried about performance on generating the next sequence number, you can generate a batch of them per thread and then only request a new batch when it is exhausted (sometimes called a HiLo sequence).
You may be wanting to have user_id's packed up nice and tight as increasing numbers, but I think you should try to get rid of that. The reason is that deleting a user_id will create a hole anyway. So i'd not worry too much if the sequences were not strictly increasing.
Yes, I can see a huge problem. Don't do it.
Multiple connections can get the EXACT SAME id at the same time. I was going to add "under load" but it doesn't even need to be - just need the right timing between two queries.
To avoid it, you can use transactions or locking mechanisms or isolation levels specific to each DB, but once we get to that stage, you might as well use the dbms-specific sequence/identity/autonumber etc.
EDIT
For question edit2, there is no reason to fear gaps in the user_id, so you have one sequence across all sites. If gaps are ok, some options are
use guaranteed update statements, such as (in SQL Server)
update tblsitesequenceno set #nextnum = nextnum = nextnum + 1
Multiple callers to this statement are each guaranteed to get a unique number.
use a single table that produces the identity/sequence/autonumber (db specific)
If you cannot have gaps at all, consider using a transaction mechanism that will restrict access while you are running the max() query. Either that or use a proliferation of (sequences/tables with identity columns/tables with autonumber) that you manipulate using dynamic SQL using the same technique for a single sequence.
By all means use a sequence to generate unique numbers. They are fast, transaction safe and reliable.
Any self-written implemention of a "sequence generator" is either not scalable for a multi-user environment (because you need to do heavy locking) or simply not correct.
If you do need to be DBMS independent, then create an abstraction layer that uses sequences for those DBMS that support them (Posgres, Oracle, Firebird, DB2, Ingres, Informix, ...) and a self written generator on those that don't.
Trying to create a system than is DBMS independent, simply means it will run equally slow on all systems if you don't exploit the advantages of each DBMS.
Your goal is a good one. Avoiding IDENTITY and AUTOINCREMENT columns means avoiding a whole plethora of administration problems. Here is just one example of the many.
However most responders at SO will not appreciate it, the popular (as opposed to technical) response is "always stick an Id AUTOINCREMENT column on everything that moves".
A next-sequential number is fine, all vendors have optimised it.
As long as this code is inside a Transaction, as it should be, two users will not get the same MAX()+1 value. There is a concept called Isolation Level which needs to be understood when coding Transactions.
Getting away from user_id and onto a more meaningful key such as ShortName or State plus UserNo is even better (the former spreads the contention, latter avoids the next-sequential contention altogether, relevant for high volume systems).
What MVCC promises, and what it actually delivers, are two different things. Just surf the net or search SO to view the hundreds of problems re PostcreSQL/MVCC. In the realm of computers, the laws of physics applies, nothing is free. MVCC stores private copies of all rows touched, and resolves collisions at the end of the Transaction, resulting in far more Rollbacks. Whereas 2PL blocks at the beginning of the Transaction, and waits, without the massive storage of copies.
most people with actual experience of MVCC do not recommend it for high contention, high volume systems.
The first example code block is fine.
As per Comments, this item no longer applies: The second example code block has an issue. "SITE001" is not a compound key, it is a compounded column. Do not do that, separate "SITE" and "001" into two discrete columns. And if "SITE" is a fixed, repeatingvalue, it can be eliminated.
Different users can have the same user_id, concurrent SELECT-statements will see the same MAX(user_id).
If you don't want to use a SEQUENCE, you have to use an extra table with a single record and update this single record every time you need a new unique id:
CREATE TABLE my_sequence(id INT);
BEGIN;
UPDATE my_sequence SET id = COALESCE(id, 0) + 1;
INSERT INTO
user (label, username, password, user_id)
SELECT 'Test', 'test', 'test', id FROM my_sequence;
COMMIT;
I agree with maksymko, but not because I dislike sequences or autoincrementing numbers, as they have their place. If you need a value to be unique throughout your "various operational sites" i.e. not only within the confines of the single database instance, a globally unique identifier is a robust, simple solution.