Postgresql wrong auto-increment for serial - sql

I have a problem on postgresql which I think there is a bug in the postgresql, I wrongly implement something.
There is a table including colmn1(primary key), colmn2(unique), colmn3, ...
After an insertion of a row, if I try another insertion with an existing colmn2 value I am getting a duplicate value error as I expected. But after this unsuccesful try, colmn1's next value is
incremented by 1 although there is no insertion so i am getting rows with id sequences like , 1,2,4,6,9.(3,5,6,7,8 goes for unsuccessful trials).
I need help from the ones who can explain this weird behaviour.
This information may be useful: I used "create unique index on tableName (lower(column1)) " query to set unique constraint.

See the PostgreSQL sequence FAQ:
Sequences are intended for generating unique identifiers — not
necessarily identifiers that are strictly sequential. If two
concurrent database clients both attempt to get a value from a
sequence (using nextval()), each client will get a different sequence
value. If one of those clients subsequently aborts their transaction,
the sequence value that was generated for that client will be unused,
creating a gap in the sequence.
This can't easily be fixed without incurring a significant performance
penalty. For more information, see Elein Mustein's "Gapless Sequences for Primary Keys" in the General Bits Newsletter.

From the manual:
Important: Because sequences are non-transactional, changes made by
setval are not undone if the transaction rolls back.
In other words, it's normal to have gaps. If you don't want gaps, don't use a sequence.

Related

Are SQL Server sequences guaranteed to always generate unique values even if called simultaneously from multiple connections?

This is a follow-up question to: Are SQL Server sequences thread safe?
I have two separate stored procedures that are calling the same sequence. The stored procedures are launched "in parallel" from an SSIS package. There is no synchronization of any kind between the two stored procedures (other than the fact that I guarantee that they'll never be updating the same rows, even though they are updating the same table). That being said, there's no particular reason that the sequence couldn't be called more or less simultaneously by the two stored procedures. My question is about exactly what would happen in this case.
In the case of the linked question, the OP had several producer applications that were simultaneously inserting into the table and wanted to know whether they could "count" on them being sequential between the processes (i.e. that if producer 2 called the sequence first, its ID would be smaller than producer 3). (This ended up not being the case due to a race condition due to the fact that generating the IDs and storing them were separate steps).
The same logic would presumably apply to my case (that I can't count on them to be in any particular "order" due to the fact that I also produce and store them in separate steps). In my case, however, I don't particularly care whether they're sequential, just that they're unique.
Can I count on that being the case? Are SQL Server sequences guaranteed to always produce unique values (even if called more or less simultaneously from different connections), or could there be some race condition here that would make this no longer be the case?
Edit: The same sequence number could ultimately be added to multiple rows if that matters (although it will always be added to at least one). I fetch the number from the sequence and then do an update query to add it to the rows that I want it to be part of.
If I read correctly, you are just ensuring they are unique (example: you want them to be a primary key?). If so, that is correct. As far as guaranteed order, you are correct that there are conditions, esp. under load, they will not be in a particular order. Does not sound like that is a big problem for you. As long as you are pulling the next value correctly, you are safe.
When I look at created sequences, I think of them like autonumbering in Oracle, where you have to pull the value and then utilize it, rather than IDENTITY in SQL Server (although there are ways to increment IDENTITY to "fill in the hole" later, so it can be utilized in the same/similar manner).
I have not examine the internals, but I would imagine the base sequence concept is utilized for IDENTITY underneath the hood, as the ideas are essentially the same, except IDENTITY is attached to a field in the table.
Yes, they do, and that's one of their most important feature, that is described in the documentation (emphasis is mine):
Identity columns can be used for generating key values. The identity
property on a column guarantees the following:
Each new value is generated based on the current seed & increment.
Each new value for a particular transaction is different from other
concurrent transactions on the table.
Disclaimer: there is no guarantee that values are sequential.

When should I use CYCLE in a sequence?

I'm using sequences in a PostgreSQL database to insert rows into tables.
When creating the sequences I have never used the CYCLE option on them. I mean they can generate pretty big numbers (in the order of 2^63 as far as I remeber) and I don't really see why I would like a sequence to go back to zero. So my question is:
When should I use CYCLE while creating a sequence?
Do you have an example where it makes sense?
It seems a sequence can use CYCLE for other purposes rather than for primary key generation.
This is, in scenarios where the uniqueness of its value is not required; actually is quite the opposite, when the values are expected to cycle back and repeat themselves after some time.
For example:
When generating numbers that must return to the initial value and repeat themselves at some point, for any reason (e.g. implementing a "Bingo" game).
When the sequence is a temporary identifier that will last for a short period of time and will be unique during its life.
When the field is small -- or can accept a limited number of values -- and it doesn't matter if they repeat themselves.
When there is another field in the entity that will identify it, and the sequence value is used for something else.
When an entity has a composite unique key and the sequence value is only a part of it.
When using the sequence value to generate uniform distribution of values on a big set, though this is hardly a random assignation of values.
Any other cyclic number generation.

SQLPlus Sequence - multiple tables

I am trying to use Dennis' solution here as an implementation of auto_increment in Oracle database. Say I create one sequence as follows:
CREATE SEQUENCE auto_increment
START WITH 1
INCREMENT BY 1;
If I want auto_increment behavior in multiple tables, can I just use this sequence for all tables? Or do I need a separate sequence per table? That is, will the sequence increment for one table be affected by another table using the sequence?
Yes, the sequence accesses will be affecting each other if you use the same sequence. However the tone of your question makes me think that you expect the sequence to be continuous.
Don't be fooled, sequences are NOT sequential. The only thing that you can be garanteed is that the numbers retrieved are unique, and in an ascending order (in your case)
You can use the same sequence for many tables. It would be unconventional to do so, it would lead to more contention on the sequence, and it would make life a bit more difficult if you needed to reset the sequence value as a result of, say, an export and import between environments but it would work.
Of course, if the sequence gave a value of 1 for table A, it would never give that same value to a trigger defined on B. Since sequences do not generate gap-free sets of values (i.e. you can guarantee that there will be "missing" values in every table no matter how many sequences you create) that shouldn't be a major downside.
Sequences are sequential. However, there are many things that can cause gaps in the sequence e.g rollback, commit (because the sequence generator issues sequences irrespective of commits or rollbacks), and same sequence for multiple tables.

Sequence vs identity

SQL Server 2012 introduced Sequence as a new feature, same as in Oracle and Postgres. Where sequences are preferred over identities? And why do we need sequences?
I think you will find your answer here
Using the identity attribute for a column, you can easily generate
auto-incrementing numbers (which as often used as a primary key). With
Sequence, it will be a different object which you can attach to a
table column while inserting. Unlike identity, the next number for the
column value will be retrieved from memory rather than from the disk –
this makes Sequence significantly faster than Identity. We will see
this in coming examples.
And here:
Sequences: Sequences have been requested by the SQL Server community
for years, and it's included in this release. Sequence is a user
defined object that generates a sequence of a number. Here is an
example using Sequence.
and here as well:
A SQL Server sequence object generates sequence of numbers just like
an identity column in sql tables. But the advantage of sequence
numbers is the sequence number object is not limited with single sql
table.
and on msdn you can also read more about usage and why we need it (here):
A sequence is a user-defined schema-bound object that generates a
sequence of numeric values according to the specification with which
the sequence was created. The sequence of numeric values is generated
in an ascending or descending order at a defined interval and may
cycle (repeat) as requested. Sequences, unlike identity columns, are
not associated with tables. An application refers to a sequence object
to receive its next value. The relationship between sequences and
tables is controlled by the application. User applications can
reference a sequence object and coordinate the values keys across
multiple rows and tables.
A sequence is created independently of the tables by using the CREATE
SEQUENCE statement. Options enable you to control the increment,
maximum and minimum values, starting point, automatic restarting
capability, and caching to improve performance. For information about
the options, see CREATE SEQUENCE.
Unlike identity column values, which are generated when rows are
inserted, an application can obtain the next sequence number before
inserting the row by calling the NEXT VALUE FOR function. The sequence
number is allocated when NEXT VALUE FOR is called even if the number
is never inserted into a table. The NEXT VALUE FOR function can be
used as the default value for a column in a table definition. Use
sp_sequence_get_range to get a range of multiple sequence numbers at
once.
A sequence can be defined as any integer data type. If the data type
is not specified, a sequence defaults to bigint.
Sequence and identity both used to generate auto number but the major difference is Identity is a table dependant and Sequence is independent from table.
If you have a scenario where you need to maintain an auto number globally (in multiple tables), also you need to restart your interval after particular number and you need to cache it also for performance, here is the place where we need sequence and not identity.
Although sequences provide more flexibility than identity columns, I didn't find they had any performance benefits.
I found performance using identity was consistently 3x faster than using sequence for batch inserts.
I inserted approx 1.5M rows and performance was:
14 seconds for identity
45 seconds for sequence
I inserted the rows into a table which used sequence object via a table default:
NEXT VALUE for <seq> for <col_name>
and also tried specifying sequence value in select statement:
SELECT NEXT VALUE for <seq>, <other columns> from <table>
Both were the same factor slower than the identity method. I used the default cache option for the sequence.
The article referenced in Arion's first link shows performance for row-by-row insert and difference between identity and sequence was 16.6 seconds to 14.3 seconds for 10,000 inserts.
The Caching option has a big impact on performance, but identity is faster for higher volumes (+1M rows)
See this link for an indepth analysis as per utly4life's comment.
I know this is a little old, but wanted to add an observation that bit me.
I switched from identity to sequence to have my indexes in order. I later found out that sequence doesn't transfer with replication. I started getting key violations after I setup replication between two databases since the sequences were not in sync. just something to watch out for before you make a decision.
I find the best use of Sequences is not to replace an identity column but to create a "Order Number" type of field.
In other words, an Order Number is exposed to the end user and may have business rules along with it. You want it to be unique, but just using an Identity Column isn't really correct either.
For example, different order types might require a different sequence, so you might have a sequence for Internet Order, as opposed to In-house orders.
In other words, don't think of a Sequence as simple a replacement for identity, think of it as being useful in cases where an identity does not fit the business requirements.
Recently was bit by something to consider for identity vs sequence. Seems MSFT now suggests sequence if you may want to keep identity without gaps. We had an issue where there were huge gaps in the identity, but based on this statement highlighted would explain our issue that SQL cached the identity and after reboot we lost those numbers.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql-identity-property?view=sql-server-2017
Consecutive values after server restart or other failures – SQL Server might cache identity values for performance reasons and some of the assigned values can be lost during a database failure or server restart. This can result in gaps in the identity value upon insert. If gaps are not acceptable then the application should use its own mechanism to generate key values. Using a sequence generator with the NOCACHE option can limit the gaps to transactions that are never committed.

Some sort of “different auto-increment indexes” per a primary key values

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.
MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?
SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.
In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.
It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.