I want to apply data concurrency on some of tables in my SQL Server database. In order to achieve row versioning, I am thinking of adding an integer column named RowVersion, and increase the row value of the column when I update a row.
There is an other option: using timestamp column. When you use timestamp column, SQL Server automatically creates a unique value when the row was updated.
I want to know the advantages of these options. I think that inserting an int column to store row version is more generic way while inserting a timestamp column is SQL Server specific.
What is more, integer value of a row version is more human readable than timestamp column.
I want to know other advantages or disadvantages choosing integer column for row version field.
Thanks.
If you let SQL server do it for you, you have guaranteed atomicity for free. (Concurrent updates on the same table are guaranteed to produce different row versions.) If you roll your own row versioning scheme, you are going to have to do whatever it takes to guarantee atomicity by yourself.
Related
I googled around and found no answer. The question is whether for an existing SQL table (assume any of H2, MySQL, or Postgres)...
Is there a way to get the last-update timestamp value for a given table row. That is, without explicitly declaring a new column (altering the table), and/or adding triggers that update a timestamp column.
I'm using a JDBC driver, preparing statements, getting ResultSets and so forth. I need to be able to determine whether the data has changed recently or not, and for this a timestamp would help. If possible I want to avoid adding timestamp columns across all tables in the system.
There is no implicit standard approach to this problem. The standard way is having an explicit column and logic in a db trigger or app function...
As mentioned, there are ways to do it through the logs but it's hard and usually wont be much accurate. For example, in Postgres you can enable commit timestamps in postgresql.conf and check the last update time but those are approximate and are not kept for a long time...
I needed some help in knowing whether we can have an option of sorting the rows of the table in Oracle based on the time of insertion.
Like do we use any sorting of function based indexes.
I would like to perform this operation of auto sorting without having to declare any new column for recording time.
Does the oracle server keep track of that information which i can use for sorting.
Thanks in advance
Oracle doesn't record the "time of insertion" for you, so you must add that column yourself, set it to the current time (e.g. SYSTIMESTAMP) on every insert, then ORDER BY that column when you query the table.
The general answer is no.
However...
The ORA_ROWSCN pseudo column returns the conservative upper bound system change number (SCN) of the most recent change to the row. This pseudocolumn is useful for determining approximately when a row was last updated. It is not absolutely precise, because Oracle tracks SCNs by transaction committed for the block in which the row resides. You can obtain a more fine-grained approximation of the SCN by creating your tables with row-level dependency tracking.
There have been various similar questions, but they either referred to a too specific DB or assumed unsorted data.
In my case, the SQL should be portable if possible. The index column in question is a clustered PK containing a timestamp.
The timestamp is 99% of the time larger than previously inserted value. On rare occasions however, it can be smaller, or collide with an existing value.
I'm currently using this code to insert new values:
IF NOT EXISTS (select * from Foo where Timestamp = #ts) BEGIN
INSERT INTO Foo ([Timestamp]) VALUES (#ts);
END
ELSE BEGIN
INSERT INTO Foo ([Timestamp]) VALUES (
(SELECT Max (t1.Timestamp) - 1
FROM Foo t1
WHERE Timestamp < #ts
AND NOT EXISTS (select * from Foo t2 where t2.Timestamp = t1.Timestamp - 1))
);
END;
If the row is unused yet, just insert. Else, find the closest free row with a smaller value using an EXISTS check.
I am a novice when it comes to databases, so I'm not sure if there is a better way. I'm open for any ideas to make the code simpler and/or faster (around 100-1000 insertions per second), or to use a different approach altogether.
Edit Thank you for your comments ans answers so far.
To explain about the nature of my case: The timestamp is the only value ever used to sort the data, minor inconsistencies can be neglected. There are no FK relationships.
However, I agree that my approach is flawed, outweighing the reasons to use the presented idea in the first place. If I understand correctly, a simple way to fix the design is to have a regular, autoincremented PK column in combination with the known (and renamed) timestamp column, which will be clustered.
From a performance POV, I don't see how this could be worse than the initial approach. It also simplifies the code a lot.
This method is a prescription for disaster. In the first place you will have race conditions which will cause user annoyance when their insert won't work. Even worse, if you are adding to another table using that value as the foreign key and the whole thing is not in one transaction, you may be adding child data to the wrong record.
Further, looking for the lowest unused value is a recipe for further data integrity messes if you have not properly set up foreign key relationships and deleted a record without getting all of it's child records. Now you just joined to records which don;t belong with the new record.
This manual method is flawed and unreliable. All the major databases have a way to create an autogenerated value. Use that instead, the problems have been worked out and tested.
Timestamp BTW is a SQL server reserved word and should never be used as a fieldname.
If you can't guaranteed that your PK values are unique, then it's not a good PK candidate. Especially if it's a timestamp - I'm sure Goldman Sachs would love it if their high-frequency trading programs could cause collisions on an insert and get inserted 1 microsecond earlier because the system fiddles the timestamp of their trade.
Since you can't guarantee uniqueness of the timestamps, a better choice would be to use a plain-jane auto-increment int/bigint column, which takes care of the collision problem, gives you a nice method of getting insertion order, and you can still sort on the timestamp field to get a nice straight timeline if need be.
One idea would be to add a surrogate identity/autonumber/sequence key, so the primary key becomes (timestamp, newkey).
This way, you preserve row order and uniqueness without code
To run the code above, you'd need to fiddle with lock granularity and concurrency hints in the code above, or TRY/CATCH to retry with the alternate value (SQL Server). This removes portability. However, under heavy load you'd have to keep retrying because the alternate value may already exist.
A Timestamp as a key? Really? Every time a row is updated, its timestamp is modified. The SQL Server timestamp data type is intended for use in versioning rows. It is not the same as the ANSI/ISO SQL timestamp — that is the equivalent of SQL Server's datetime data type.
As far as "sorting" on a timestamp column goes: the only thing that guaranteed with a timestamp is that every time a row is inserted or updated it gets a new timestamp value and that value is a unique 8-octet binary value, different from the previous value assigned to the row, if any. There is no guarantee that that value has any correlation to the system clock.
I'm converting data from one schema to another. Each table in the source schema has a 'status' column (default NULL). When a record has been converted, I update the status column to 1. Afterwards, I can report on the # of records that are (not) converted.
While the conversion routines are still under development, I'd like to be able to quickly reset all values for status to NULL again.
An UPDATE statement on the tables is too slow (there are too many records). Does anyone know a fast alternative way to accomplish this?
The fastest way to reset a column would be to SET UNUSED the column, then add a column with the same name and datatype.
This will be the fastest way since both operations will not touch the actual table (only dictionary update).
As in Nivas' answer the actual ordering of the columns will be changed (the reset column will be the last column). If your code rely on the ordering of the columns (it should not!) you can create a view that will have the column in the right order (rename table, create view with the same name as old table, revoke grants from base table, add grants to view).
The SET UNUSED method will not reclaim the space used by the column (whereas dropping the column will free space in each block).
If the column is nullable (since default is NULL, I think this is the case), drop and add the column again?
While the conversion routines are still under development, I'd like to be able to quickly reset all values for status to NULL again.
If you are in development why do you need 70 million records? Why not develop against a subset of the data?
Have you tried using flashback table?
For example:
select current_scn from v$database;
-- 5607722
-- do a bunch of work
flashback table TABLE_NAME to scn 5607722;
What this does is ensure that the table you are working on is IDENTICAL each time you run your tests. Of course, you need to ensure you have sufficient UNDO to hold your changes.
hm. maybe add an index to the status column.
or alterately, add a new table with the primary key only in it. then insert to that table when the record is converted, and TRUNC that table to reset...
I like some of the other answers, but I just read in a tuning book that for several reasons it's often quicker to recreate the table than to do massive updates on the table. In this case, it seems ideal, since you would be writing the CREATE TABLE X AS SELECT with hopefully very few columns.
I have simple SSIS package where I import data from flat file into SQL Server table (SQL Server 005). File contains 70k rows and table has no primary key. Importing is sucessful but when I open SQL Server table the order of rows is different from the that of file. After observing closely I see that data in table is sorted by default by first column. Why this is happening? and how I can avoid default sort?
Thanks.
You cannot rely on ordering unless you specify order by in your SQL query. SQL is a relational algebra that works with sets. Those sets are unordered. Database tables do not have an intrinsic ordering.
It may well be that the sets are ordered due to the way the data is retrieved from the tables. This may be based on primary key, order of insertion, clustered key, seemingly random order based on the execution plan of the query or the actual data in the table or even the phase of the moon.
Bottom line, if you want a specific order, use order by. If you don't want a specific order, the DBMS is free to deliver your rows in any order, including one based on the first column.
If you really want them sorted depending on the position in the import file, you should add another column to the table to store an increasing number based on its position in that file. Then use order by using that column. But that's a pretty arbitrary sort order, you're generally better off choosing one that makes more sense to the data (transaction ID, date/time, customer number or whatever else you have).
If you want to avoid the default sort (however variable that may be), use a specific sort.
In general no order is applied if there is no ordering in the select query.
What I have noticed is that the table results might return in the order of the primary key, but this is not gaurenteed either.
So all in all, if you do not specify a ordering, no ordering can be assumed.