SQL: Uniqueness constraints and cycling rows - sql

I am writing a library that (with the help of an ORM) will allow people to store rows that contain a position column, and move rows from one position to another at a later time. The idea is thus to allow users to re-order the stored pieces of data.
In a simple implementation, it would be enough to just make this an integer column, and to perform two UPDATE statements: One moving the desired row from its old position to its new position, and a second one moving shifting all rows in-between one position up/down (depending on the direction of movement).
However, this goes wrong when multiple of these query-pairs are happening at the same time, even if they are wrapped in TRANSACTIONs, because most of the time the transaction isolation level is not set to 'Serializeable', and therefore in those cases, two rows will end up with the same position values.
An idea to solve this was to use a UNIQUE CONSTRAINT, but this would mean that we need to do both of above steps in a single UPDATE query. This is possible, using a modulo operation (something like position = (((position - beginning) + 1) % length + beginning)).
However, turns out that neither Postgres, nor MySQL nor SQLite is able to figure out that after the UPDATE statement the constraint would still hold. Instead, (for efficiency, or simplicity of implementation reasons?) they report an error as soon as you try to increment the first row to an already-existing position.
One 'solution' that someone mentioned to me, was to alter the constraint to become DEFERRED, which would mean that it would only be checked at the end of the transaction. However, I don't believe that this is widely supported across SQL databases (and I'd rather like for the library to work in a database-agnostic way).
Are there other options to ensure database integrity for this problem?

Related

SQL Server - resume cursor from last updated row

I need to do a cursor update of a table (milions of rows). The script should resume from last updated row if it would be started again (e.g. in case of a server restart).
What is the best way to resolve this? Create a new table with the last saved id? Use the tables extended propertes to save this info?
I would add an "UpdateDate" or "LastProcessDate" or some similarly named datetime column to your table and use this. When running your update, simply process any number of records that aren't the max UpdateDate or are null:
where UpdateDate < (select max(UpdateDate) from MyTable) or UpdateDate is null
It's probably a good idea to grab the max UpdateDate (#maxUpdateDate?) at the beginning of your process/loop so it does not change during a batch, and similarly get a new UpdateDate (#newUpdateDate?) at the beginning of your process to update each row as you go. A UTC date will work best to avoid DST time changes.
This data would now be a real attribute of your entity, not metadata or a temporary placeholder, and this would seem to be the best way to be transactionally consistent and otherwise fully ACID. It would also be more self-documenting than other methods, and can be indexed should the need arise. A date can also hold important temporal information about your data, whereas IDs and flags do not.
Doing it in this way would make storing data in other tables or extended properties redundant.
Some other thoughts:
Don't use a temp table that can disappear in many of the scenarios where you haven't processed all rows (connection loss, server restart, etc.).
Don't use an identity or other ID that can have gaps filled, be reseeded, truncated back to 0, etc.
The idea of having a max value stored in another table (essentially rolling your own sequence object) has generally been frowned upon and shown to be a dubious practice in SQL Server from what I've read, though I'm oddly having trouble locating a good article right now.
If at all possible, avoid cursors in favor of batches, and generally avoid batches in favor of full set-based updates.
sp_updateextendedproperty does seem to behave correctly with a rollback, though I'm not sure how locking works with that -- just FYI if you ultimately decide to go down that path.

DB2 Optimistic Concurrency - RID, Row Version, Timestamp Clarifications

I'm currently reading over implementing optimistic concurrency checks in DB2. I've been mainly reading http://www.ibm.com/developerworks/data/library/techarticle/dm-0801schuetz/ and http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.dbobj.doc%2Fdoc%2Fc0051496.html (as well as some other IBM docs).
Is RID necessary when you have an ID column already? In the 2 links they always mention use RID and row change version, however RID is row ID, so I'm not clear why I need to use it when row change token seems like SQLServer's rowversion (except for the page and not the row).
It seems as long as I have a row-change-timestamp column, then my row change token granularity will be good enough to prevent most false positives.
Thanks.
The way I read the first article is that you can use any of those features, you don't need to use all of them. In particular, it appears that the row-change-timestamp is derived from RID() and ROW CHANGE TOKEN:
Time-based update detection:
This feature is added to SQL using the
RID_BIT() and ROW CHANGE TOKEN. To support this feature, the table
needs to have a new generated column defined to store the timestamp
values. This can be added to existing tables using the ALTER TABLE
statement, or the column can be defined when creating a new table. The
column's existence, also affects the behavior of optimistic locking in
that the column if it is used to improve the granularity of the ROW
CHANGE TOKEN from page level to row level, which could greatly benefit
optimistic locking applications.
... among other things, the timestamp actually increases the granularity compared to the ROW CHANGE TOKEN, so it makes it easier to deal with updates.
For a number of reasons, please make sure to set the db time to UTC, as DB2 doesn't track timezone (so if you're somewhere that uses DST, the same timestamp can happen twice).
(As a side note, RID() isn't stable on all platforms. On the iSeries version, at least, it changes if somebody re-orgs the table, and you may not always get the results you expect when using it with joins. I'm also not sure about use with mirroring...)
Are you aware that if you update multiple rows in the same SQL statement execution, they will get the same timestamp (if the timestamp is updated in that statement)?
This means that a timestamp column is probably a bad choice for a unique row identifier.

Postgresql wrong auto-increment for serial

I have a problem on postgresql which I think there is a bug in the postgresql, I wrongly implement something.
There is a table including colmn1(primary key), colmn2(unique), colmn3, ...
After an insertion of a row, if I try another insertion with an existing colmn2 value I am getting a duplicate value error as I expected. But after this unsuccesful try, colmn1's next value is
incremented by 1 although there is no insertion so i am getting rows with id sequences like , 1,2,4,6,9.(3,5,6,7,8 goes for unsuccessful trials).
I need help from the ones who can explain this weird behaviour.
This information may be useful: I used "create unique index on tableName (lower(column1)) " query to set unique constraint.
See the PostgreSQL sequence FAQ:
Sequences are intended for generating unique identifiers — not
necessarily identifiers that are strictly sequential. If two
concurrent database clients both attempt to get a value from a
sequence (using nextval()), each client will get a different sequence
value. If one of those clients subsequently aborts their transaction,
the sequence value that was generated for that client will be unused,
creating a gap in the sequence.
This can't easily be fixed without incurring a significant performance
penalty. For more information, see Elein Mustein's "Gapless Sequences for Primary Keys" in the General Bits Newsletter.
From the manual:
Important: Because sequences are non-transactional, changes made by
setval are not undone if the transaction rolls back.
In other words, it's normal to have gaps. If you don't want gaps, don't use a sequence.

Some sort of “different auto-increment indexes” per a primary key values

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.
MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?
SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.
In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.
It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.

Is there a best-practise method to swap primary key values using SQL Lite?

As a bit of background, I'm working with a SQL Lite database that is being consumed by a closed-source UI that doesn't order the results by the handy timestamp column (gee, thanks Nokia!) - it just uses the default ordering, which corresponds to the primary key, which is a vanilla auto-incrementing 'id' column.
I easily have a map of the current and desired id values, but applying the mapping is my current problem. It seems I cannot swap the values as an update processes rows one at a time, which would temporarily result in a duplicate value. I've tried using an update statement with case clauses using a temporary out-of-sequence value, but as each row is only processed once this obviously doesn't work. Thus I've reached the point of needing 3 update statements to swap a pair of values, which is far from ideal as I want this to scale well.
Compounded to this, there are a number of triggers set up which makes adding/deleting rows into a new table a complex problem unless I can disable those for the duration of any additions/deletions resulting from table duplication & deletion, which is why I haven't pursued that avenue yet.
I'm thinking my next line of enquiry will be a new column with the new ids then finding a way to move the primary index to it before removing the old column, but I'm throwing this out there in case anyone can offer up a better solution that will save me some time :)