I want to increment a sequence by some batchSize in PostgreSQL but I am not sure if this query is safe from concurrent calls made to the same sequence.
select setval('my_sequence', nextval('my_sequence') + batchSize);
What I am worried bout given the query above is that the call for setval() of the sequence would set the sequence back or perhaps abort the query since if another thread has fetched it nextval() concurrently.
For example:
Thread 1. nextval = 1001 batchSize = 100 sets sequence to 1101
Tread 2. nextval = 1002 batchSize = 20 sets sequence to 1022
This would mean that the sequence could ultimately return duplicate sequenceIds for for ids between 1002 and 1101.
The same could be achieved using the generate_series() function. The drawback would be that the series is not guaranteed to be in sequence since other threads can be calling nextval() concurrently to the same sequence which means that I have to fetch and parse the generated series.
From the PG Documentation
To avoid blocking concurrent transactions that obtain numbers from the same sequence, a nextval operation is never rolled back; that is, once a value has been fetched it is considered used and will not be returned again. This is true even if the surrounding transaction later aborts, or if the calling query ends up not using the value. For example an INSERT with an ON CONFLICT clause will compute the to-be-inserted tuple, including doing any required nextval calls, before detecting any conflict that would cause it to follow the ON CONFLICT rule instead. Such cases will leave unused “holes” in the sequence of assigned values. Thus, PostgreSQL sequence objects cannot be used to obtain “gapless” sequences.
Likewise, any sequence state changes made by setval are not undone if the transaction rolls back.
I would assume the above states that is thread safe.
According to the PostgreSQL documentation, it is:
This is done atomically: even if multiple sessions execute nextval concurrently, each will safely receive a distinct sequence value
Which means, it's totally safe to call NEXTVAL concurrently, every single invocation will get its own unique value, guarranted to not to repeat.
According to answer in PostgreSQL mailing lists SELECT setval(nextval() + N) isn't safe in concurrent scenario. Safe and still somewhat performant alternative would be:
SELECT nextval('my_seq')
FROM generate_series(1, N)
Related
Consider this table:
create table entries (
sequence_number integer
default nextval('entries_sequence_number_seq')
primary key,
time timestamp default now()
);
This table is used as an append-only stream of changes. There may be other tables involved in writes, but as the last SQL statement in each transaction, we insert a row to this table. In other words, our transactions may be large and time-consuming, but eventually we write this row and commit immediately.
Now we want one or more consumers that can track changes as they are appended to this table:
Each consumer needs to loop at regular intervals to fetch the next batch of changes in roughly chronological order — in other words, the delta of new rows appended to entries since the last time the consumer polled.
The consumer always goes forward in time, never backwards.
Each consumer gets all the data. There's no need for selective distribution.
Order of consumption is not important. However, the consumer eventually must see all committed entries: If an in-flight transaction commits a new entry to the table, it must be picked up.
We’d like to minimize the possibility of ever seeing the same row twice, but we can tolerate it if it happens.
Conceptually:
select * from entries where sequence_number > :high_watermark
…where the high_watermark is the highest number seen by the consumer.
However, since nextval() is computed before commit time, you can get into a situation where there are gaps caused by in-flight transactions that haven’t committed yet. You may have a race condition happen like so:
Assume world starts at sequence number 0.
Writer A txn: Inserts, gets sequence number 1.
Writer B txn: Inserts, gets sequence number 2.
Writer B txn commits.
Newest sequence number is now 2.
The consumer does it select on > 0, finds entry with sequence number 2, sets it as high_watermark.
Writer A txn commits.
The consumer does it select on > 2, thus never sees the entry with sequence number 1.
The race condition is probably very small in the general case, but it’s still a possibility, and the probability of it occurring increases with the load of the system.
So far the best, but certainly not elegant, solution that comes to mind is to always select with time:
select * from entries
where sequence_number > :newest_sequence_number
or time >= :newest_timestamp
This should theoretically — modulo leap seconds and drifting clocks — guarantee that older entries are seen, at the expense of getting rows that appeared in the last batch. The consumer should want to maintain a hopefully small set of already-seen entries that it can ignore. Leap seconds and drifting clocks could be accounted for by padding the timestamp with some unscientific number of seconds. The downside is that it will be constantly reading a bunch of redundant rows. And it just feels a bit clunky and hand-wavy.
A slightly blunter, but more deterministic approach would be to maintain an unlogged table of pending events, and always delete from it as we read from it. This has two downsides: One is performance, obviously. The other is that since there may be any number of consumers, we would have to produce one event per consumer, which in turn means we have to identify the consumers by some sort of unique ID at event-emitting time, and of course garbage-collect unused events when a consumer no longer exists.
It strikes me that a better approach than an unlogged table would be to use LISTEN/NOTIFY, with the ID of the entry as a payload. This has the advantage of avoiding polling in the first place, although that's not a huge win, since the object of the consumer in this application is to wake up only now and then and reduce work on the system. On the other hand, the only major downside I can see is that there is a limit (albeit a large one) to the number of messages that can be in flight, and that transactions will begin to fail if a notification cannot happen. This might be a reasonable compromise, however.
At the same time, something in the back of my mind is telling me that there must be a mathematically more elegant way of doing this with even less work.
Your improved idea with WHERE time >= :newest_timestamp is subject to the same race condition, because there is no guarantee that the timestamps are in commit order. Processes go to sleep occasionally.
Add a boolean field consumed_n for each consumer which is initialized to FALSE. Consumer n then uses:
UPDATE entries
SET consumed_n = TRUE
WHERE NOT consumed_n
RETURNING sequence_number, time;
It helps to have partial indexes ON entries(1) WHERE NOT consumed_n.
If that takes up too much storage for your taste, use one bit(n) field with a bit for each consumer.
The consumers will lock each other out as long as the transaction that issues these statements remains open. So keep it short for good concurrency.
I am trying to use Dennis' solution here as an implementation of auto_increment in Oracle database. Say I create one sequence as follows:
CREATE SEQUENCE auto_increment
START WITH 1
INCREMENT BY 1;
If I want auto_increment behavior in multiple tables, can I just use this sequence for all tables? Or do I need a separate sequence per table? That is, will the sequence increment for one table be affected by another table using the sequence?
Yes, the sequence accesses will be affecting each other if you use the same sequence. However the tone of your question makes me think that you expect the sequence to be continuous.
Don't be fooled, sequences are NOT sequential. The only thing that you can be garanteed is that the numbers retrieved are unique, and in an ascending order (in your case)
You can use the same sequence for many tables. It would be unconventional to do so, it would lead to more contention on the sequence, and it would make life a bit more difficult if you needed to reset the sequence value as a result of, say, an export and import between environments but it would work.
Of course, if the sequence gave a value of 1 for table A, it would never give that same value to a trigger defined on B. Since sequences do not generate gap-free sets of values (i.e. you can guarantee that there will be "missing" values in every table no matter how many sequences you create) that shouldn't be a major downside.
Sequences are sequential. However, there are many things that can cause gaps in the sequence e.g rollback, commit (because the sequence generator issues sequences irrespective of commits or rollbacks), and same sequence for multiple tables.
I have created a table in APEX that has a PK that is incremented by a SQL sequence:
CREATE SEQUENCE seq_increment
MINVALUE 1
START WITH 880
INCREMENT BY 1
CACHE 10
This seems to work perfectly. The issue is that sometimes, usually when I get on in the morning and run a process to enter a new row, it skips a bunch of numbers. I only care because these numbers are being used as the ID# of documents in my company and losing/skipping blocks of numbers is not going to be acceptable when this tool goes live.
It does seem to jump to the next '10' number. i.e. yesterday my last test assigned 883 and this morning it assigned 890 as the next number. Looking at my code for creation of the sequence I notice that I have set it up to cache 10 values so that it will process quicker. Is it possible that this cache is getting dumped over night and that it is pulling 890 because it had 880-889 in cache and it was dumped?
Are there other potential causes and solutions?
Sequences will not and can not generate gap-free values. So you'd expect that numbers will occasionally be skipped. That's perfectly normal when you're using sequences.
As you've surmised, the most likely scenario is that the sequence cache is aging out of the shared pool overnight when the APEX application isn't being used. You can reduce the frequency of gaps by declaring your sequence NOCACHE but that will decrease performance and it will not eliminate gaps it will just make them less frequent.
Oracle sequences are never guaranteed to be contiguous. If you need an absolutely contiguous set of values, you'll need to implement a custom solution.
Odds are that CACHE 10 is why you're losing numbers in this case. The cache value is how many sequence values are stored in memory for future use. Rebooting will clear the cache and cause 10 new values to be retrieved. Similarly, if the sequence is not used for long enough, the current set of values may be flushed out of the shared pool, also causing a new set of values to be retrieved.
This is clearly not the case in your instance, but sequence numbers can also be lost due to rollbacks. A rolled back transaction involving one or more sequences discards the sequence value(s).
Some sequence numbers have been aged out of one of the in-memory structures (shared pool I think?). This is expected behaviour for sequences. The only guarantee that you have is that they are unique. If you need to present gap-free sequences you'll have to do this at reporting time using e.g. rownum pseudo-column. It is made this way deliberately otherwise you would have to serialise all inserts i.e. lock table. And even that wouldn't work properly if an insert was rolled back!
I'm trying to better understand a nuance of SQL Server transactions.
Say I have a query that updates 1,000 existing rows, updating one of the columns to have the values 1 through 1,000. It's possible to execute this query and, when completed, those rows would not be numbered sequentially. This is because it's possible for another query to modify one of those rows before my query finishes.
On the other hand, if I wrap those updates in a transaction, that guarantees that if any one update fails, I can fail all updates. But does it also mean that those rows would be guaranteed to be sequential when I'm done?
In other words, are transactions always atomic?
But does it also mean that those rows would be guaranteed to be sequential when I'm done?
No. This has nothing to do with transactions, because what you're asking for simply doesn't exists: relational tables have no order an asking for 'sequential rows' is the wrong question to ask. You can rephrase the question as 'will the 1000 updated rows contain the entire sequence from 1 to 1000, w/o gaps' ? Most likely yes, but the truth of the matter is that there could be gaps depending on the way you do the updates. Those gaps would not appear because updated rows are modified after the update before commit, but because the update will be a no-op (will not update any row) which is a common problem of read-modify-write back type of updates ( the row 'vanishes' between the read and the write-back due to concurrent operations).
To answer your question more precisely whether your code is correct or not you have to post the exact code you're doing the update with, as well as the exact table structure, including all indexes.
Atomic means the operation(s) within the transaction with either occur, or they don't.
If one of the 1,000 statements fails, none of the operations within the transaction will commit. The smaller the sample of statements within a transaction -- say 100 -- means that the blocks of 100 leading up to the error (say at the 501st) can be committed (the first 400; the 500 block won't, and the 600+ blocks will).
But does it also mean that those rows would be guaranteed to be sequential when I'm done?
You'll have to provide more context about what you're doing in a transaction to be "sequential".
The 2 points are unrelated
Sequential
If you insert values 1 to 1000, it will be sequential with an WHERE and ORDER BY to limit you to these 1000 rows in some column. Unless there are duplicates, so you'd need a unique constraint
If you rely on an IDENTITY, it isn't guaranteed: Do Inserted Records Always Receive Contiguous Identity Values.
Atomicity
All transactions are atomic:
Is neccessary to encapsulate a single merge statement (with insert, delete and update) in a transaction?
SQL Server and connection loss in the middle of a transaction
Does it delete partially if execute a delete statement without transaction?
SQL transactions, like transactions on all database platforms, put the data in isolation to cover the entire ACID acronym (atomic, consistent, isolated and durable). So the answer is yes.
A transaction guarantees atomicity. That is the point.
You problem is that after you do the insert, they are only "Sequential" until the next thing comes along and touches one of the new records.
If another step in you process requires them to still be sequential then that step, too, needs to be within your original transaction.
We're very frustratingly getting deadlocks in MySQL. It isn't because of exceeding a lock timeout as the deadlocks happen instantly when they do happen. Here's the SQL code that is executing on 2 separate threads (with 2 separate connections from the connection pool) that produces a deadlock:
UPDATE Sequences SET Counter = LAST_INSERT_ID(Counter + 1) WHERE Sequence IS NULL
Sequences table has 2 columns: Sequence and Counter
The LAST_INSERT_ID allows us to retrieve this updated counter value as per MySQL's recommendation. That works perfect for us, but we get these deadlocks! Why are we getting them and how can we avoid them??
Thanks so much for any help with this.
EDIT: this is all in a transaction (required since I'm using Hibernate) and AUTO_INCREMENT doesn't make sense here. I should've been more clear. The Sequences table holds many sequences (in our case about 100 million of them). I need to increment a counter and retrieve that value. AUTO_INCREMENT plays no role in all of this, this has nothing to do with Ids or PRIMARY KEYs.
Wrap your sql statements in a transaction. If you aren't using a transaction you will get a race condition on LAST_INSERT_ID.
But really, you should have counter fields auto_increment, so you let mysql handle this.
Your third solution is to use LOCK_TABLES, to lock the sequence table so no other process can access it concurrently. This is the probably the slowest solution unless you are using INNODB.
Deadlocks are a normal part of any transactional database, and can occur at any time. Generally, you are supposed to write your application code to handle them, as there is no surefire way to guarantee that you will never get a deadlock. That being said, there are situations that increase the likelihood of deadlocks occurring, such as the use of large transactions, and there are things you can do to mitigate their occurrence.
First thing, you should read this manual page to get a better understanding of how you can avoid them.
Second, if all you're doing is updating a counter, you should really, really, really be using an AUTO_INCREMENT column for Counter rather than relying on a "select then update" process, which as you have seen is a race condition that can produce deadlocks. Essentially, the AUTO_INCREMENT property of your table column will act as a counter for you.
Finally, I'm going to assume that you have that update statement inside a transaction, as this would produce frequent deadlocks. If you want to see it in action, try the experiment listed here. That's exactly what's happening with your code... two threads are attempting to update the same records at the same time before one of them is committed. Instant deadlock.
Your best solution is to figure out how to do it without a transaction, and AUTO_INCREMENT will let you do that.
No other SQL involved ? Seems a bit unlikely to me.
The 'where sequence is null' probably causes a full table scan, causing read locks to be acquired on every row/page/... .
This becomes a problem if (your particular engine does not use MVCC and) there were an INSERT that preceded your update within the same transaction. That INSERT would have acquired an exclusive lock on some resource (row/page/...), which will cause the acquisition of a read lock by any other thread to go waiting. So two connections can first do their insert, causing each of them to have an exclusive lock on some small portion of the table, and then they both try to do your update, requiring each of them to be able to acquire a read lock on the entire table.
I managed to do this using a MyISAM table for the sequences.
I then have a function called getNextCounter that does the following:
performs a SELECT sequence_value FROM sequences where sequence_name = 'test';
performs the update: UPDATE sequences SET sequence_value = LAST_INSERT_ID(last_retrieved_value + 1) WHERE sequence_name = 'test' and sequence_value = last retrieved value;
repeat in a loop until both queries are successful, then retrieve the last insert id.
As it is a MyISAM table it won't be part of your transaction, so the operation won't cause any deadlocks.