Is it a bad practice to use an identity column to determine the order of row creation? [duplicate]

Is it a bad practice to use an identity column to determine the order of row creation? [duplicate] - sql

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Can I use a SQL Server identity column to determine the inserted order of rows?
If an identity column is reseeded, then it can not be used be used to determine the order of row insertion, but I have no reason to ever reseed the identity.
Are there any reasons why I should not use the identity column to determine the order of creation?

Because it wouldn't be reliable would be the reason I would not use it. You might have two processes ask simultaneously for identity values and process 1 got the first value and process 2 got the second value but process 2 actually finished the transaction earlier and thus was inserted earlier. A datetime field for date inserted is the only reliable choice if you want to know the order that records were actually inserted.

It is not considered a good practice. For example, two processes doing inserts on a table in simultaneous transactions can in some servers have chunks of ids assigned to them, so any row inserted from one transaction will have a lesser id than any row inserted from the other transaction. Also, this can sometimes cause gaps in sequence of ids. And there may be also other scenarios something unexpected might happen.
In short, autoincremented ids are not always guaranteed to a be a continuous and ascending sequence.

Related

SEQUENCE number on every INSERT in MS SQL 2012

I am in the situation where multiple user inserting values from application to database via web service, have using stored procedure for validate and insert records.
Requirement is create unique number for each entries but strictly in SEQUENCE only. I added Identity column but its missed some of the number in between e.g. 25,26,27,29,34...
Our requirement is strictly generate next number only like we use for Invoice Number/ Order Number/ Receipt Number etc. 1,2,3,4,5...
I checked below link about Sequence Number but not sure if its surely resolve my issue. Can someone please assist in this.
Sequence Numbers

If you absolutely, positively cannot have gaps, then you need to use a trigger and your own logic. This puts a lot of overhead into inserts, but it is the only guarantee.
Basically, the parts of a database that protect the data get in the way of doing what you want. If a transaction uses a sequence number (or identity) and it is later rolled back, then what happens to the generated number? Well, that is one way that gaps appear.
Of course, you will have to figure out what to do in that case. I would just go for an identity column and work on educating users that gaps are possible. After all, if you don't want gaps on output, then row_number() is available to re-assign numbers.

SQL Server 2008 identity Key auto increment Issue

I recently had to move a database(sql server 2008) to a different server, and I have noticed that in one of the table, the value of identity column has started to get some unexpected values. its set as identity column with identity increment 1 and identity seed 1. After every 10 consecutive entry or so, it would start from another much higher number and increment by 1 for next 10 entries or so and then jump up to another higher number. I can't seem to figure out the issue.
Sorry for the layman language. I am not a DB person.

This is likely not an issue with your identity key but an issue with a framework being used to insert data, or a SP. If you have a stored procedure that inserts data but then is ROLLBACK'd, the ID was reserved but the row is 'deleted'.
So two places to check: One on the frameworks you're using (NHibernate, or Entity Framework, etc?)... those frameworks might be inserting rows then deleting them. Second place to check is INSERT statements in SPROCs and other places you might expect a ROLLBACK.
See: SQL Identity (autonumber) is Incremented Even with a Transaction Rollback
Another issue is that you may just be examining data without sorting it? And when you ported the data you assumed it would be inserted or retrieved always in-ID-order. But because the new table is not 'indexed' the same way you won't necessarily see items in primary-key order. This is less likely if rows appear sequential most of the time with gaps, but worth mentioning.

The following will shed some light on your issue.
Create a table with an identity auto increment column
Insert 10 rows of random data
You will see that the ID values are 1,2,3,4,5,6,7,8,9,10
delete from mytable where id=10
insert another row into the table
now you will see the ID values are 1,2,3,4,5,6,7,8,9,**11**

Sequence vs identity

SQL Server 2012 introduced Sequence as a new feature, same as in Oracle and Postgres. Where sequences are preferred over identities? And why do we need sequences?

I think you will find your answer here
Using the identity attribute for a column, you can easily generate
auto-incrementing numbers (which as often used as a primary key). With
Sequence, it will be a different object which you can attach to a
table column while inserting. Unlike identity, the next number for the
column value will be retrieved from memory rather than from the disk –
this makes Sequence significantly faster than Identity. We will see
this in coming examples.
And here:
Sequences: Sequences have been requested by the SQL Server community
for years, and it's included in this release. Sequence is a user
defined object that generates a sequence of a number. Here is an
example using Sequence.
and here as well:
A SQL Server sequence object generates sequence of numbers just like
an identity column in sql tables. But the advantage of sequence
numbers is the sequence number object is not limited with single sql
table.
and on msdn you can also read more about usage and why we need it (here):
A sequence is a user-defined schema-bound object that generates a
sequence of numeric values according to the specification with which
the sequence was created. The sequence of numeric values is generated
in an ascending or descending order at a defined interval and may
cycle (repeat) as requested. Sequences, unlike identity columns, are
not associated with tables. An application refers to a sequence object
to receive its next value. The relationship between sequences and
tables is controlled by the application. User applications can
reference a sequence object and coordinate the values keys across
multiple rows and tables.
A sequence is created independently of the tables by using the CREATE
SEQUENCE statement. Options enable you to control the increment,
maximum and minimum values, starting point, automatic restarting
capability, and caching to improve performance. For information about
the options, see CREATE SEQUENCE.
Unlike identity column values, which are generated when rows are
inserted, an application can obtain the next sequence number before
inserting the row by calling the NEXT VALUE FOR function. The sequence
number is allocated when NEXT VALUE FOR is called even if the number
is never inserted into a table. The NEXT VALUE FOR function can be
used as the default value for a column in a table definition. Use
sp_sequence_get_range to get a range of multiple sequence numbers at
once.
A sequence can be defined as any integer data type. If the data type
is not specified, a sequence defaults to bigint.

Sequence and identity both used to generate auto number but the major difference is Identity is a table dependant and Sequence is independent from table.
If you have a scenario where you need to maintain an auto number globally (in multiple tables), also you need to restart your interval after particular number and you need to cache it also for performance, here is the place where we need sequence and not identity.

Although sequences provide more flexibility than identity columns, I didn't find they had any performance benefits.
I found performance using identity was consistently 3x faster than using sequence for batch inserts.
I inserted approx 1.5M rows and performance was:
14 seconds for identity
45 seconds for sequence
I inserted the rows into a table which used sequence object via a table default:
NEXT VALUE for <seq> for <col_name>
and also tried specifying sequence value in select statement:
SELECT NEXT VALUE for <seq>, <other columns> from <table>
Both were the same factor slower than the identity method. I used the default cache option for the sequence.
The article referenced in Arion's first link shows performance for row-by-row insert and difference between identity and sequence was 16.6 seconds to 14.3 seconds for 10,000 inserts.
The Caching option has a big impact on performance, but identity is faster for higher volumes (+1M rows)
See this link for an indepth analysis as per utly4life's comment.

I know this is a little old, but wanted to add an observation that bit me.
I switched from identity to sequence to have my indexes in order. I later found out that sequence doesn't transfer with replication. I started getting key violations after I setup replication between two databases since the sequences were not in sync. just something to watch out for before you make a decision.

I find the best use of Sequences is not to replace an identity column but to create a "Order Number" type of field.
In other words, an Order Number is exposed to the end user and may have business rules along with it. You want it to be unique, but just using an Identity Column isn't really correct either.
For example, different order types might require a different sequence, so you might have a sequence for Internet Order, as opposed to In-house orders.
In other words, don't think of a Sequence as simple a replacement for identity, think of it as being useful in cases where an identity does not fit the business requirements.

Recently was bit by something to consider for identity vs sequence. Seems MSFT now suggests sequence if you may want to keep identity without gaps. We had an issue where there were huge gaps in the identity, but based on this statement highlighted would explain our issue that SQL cached the identity and after reboot we lost those numbers.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql-identity-property?view=sql-server-2017
Consecutive values after server restart or other failures – SQL Server might cache identity values for performance reasons and some of the assigned values can be lost during a database failure or server restart. This can result in gaps in the identity value upon insert. If gaps are not acceptable then the application should use its own mechanism to generate key values. Using a sequence generator with the NOCACHE option can limit the gaps to transactions that are never committed.

Custom sequence in postgres

I need to have a custom unique identifier (sequence). In my table there is a field ready_to_fetch_id that will be null by default and when my message is ready to be delivered then i make it update with unique max id, this is quite heavy process as load increasing.
So it it possible to have some sequence in postgres that allow null and unique ids.

Allowing NULL values has nothing todo with sequences. If your column definition allows NULLs you can put NULL values in the column. When you update the column you take the nextval from the sequence.
Notice that if you plan to use the ids to keep track of which rows you have already processed that it won't work perfectly. When two transactions are going to update the ready_to_fetch_id column simultaneous the transaction that started last might commit first which means that the higher id of the last transaction to start will become visible before the lower id the earlier transaction is using.

Some sort of “different auto-increment indexes” per a primary key values

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.

MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?

SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.

In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.

It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas