SQL Server rowversion without adding a column - sql

Typically to expose version data you'd have to add a column of type rowversion, but this operation would take quite a while on a large table. I did it anyway in a dev sandbox environment, and indeed it took a while, but I also noticed that the column was populated with some meaningful-looking initial value. I expected it to be all 0's or 1's to indicate that each row is in some sort of "initial" state (after all, there was no history before this), but what I saw were what looked like accurate values for each row (they were all different, non-default-looking values).
Where did they come from? It seems like the rowversion is being tracked behind the scenes anyway, regardless of whether you've exposed it in a column. If so, can I get at it directly without adding the column? Like maybe some kind of system function I can call directly? I really want to avoid downtime, and I also have a huge number of existing queries so migration to a different table/view/combo is not an option (as suggested in other related questions).

The rowversion value is generated when a table with a rowversion (a.k.a timestamp) value is modified. The rowversion value is database-scoped and the last generated value can be retrieved via ##DBTS.
Since the value is incremented only when a rowversion table is modified, I don't think you'll be able to use ##DBTS to avoid the downtime.

Related

SQL Server - resume cursor from last updated row

I need to do a cursor update of a table (milions of rows). The script should resume from last updated row if it would be started again (e.g. in case of a server restart).
What is the best way to resolve this? Create a new table with the last saved id? Use the tables extended propertes to save this info?
I would add an "UpdateDate" or "LastProcessDate" or some similarly named datetime column to your table and use this. When running your update, simply process any number of records that aren't the max UpdateDate or are null:
where UpdateDate < (select max(UpdateDate) from MyTable) or UpdateDate is null
It's probably a good idea to grab the max UpdateDate (#maxUpdateDate?) at the beginning of your process/loop so it does not change during a batch, and similarly get a new UpdateDate (#newUpdateDate?) at the beginning of your process to update each row as you go. A UTC date will work best to avoid DST time changes.
This data would now be a real attribute of your entity, not metadata or a temporary placeholder, and this would seem to be the best way to be transactionally consistent and otherwise fully ACID. It would also be more self-documenting than other methods, and can be indexed should the need arise. A date can also hold important temporal information about your data, whereas IDs and flags do not.
Doing it in this way would make storing data in other tables or extended properties redundant.
Some other thoughts:
Don't use a temp table that can disappear in many of the scenarios where you haven't processed all rows (connection loss, server restart, etc.).
Don't use an identity or other ID that can have gaps filled, be reseeded, truncated back to 0, etc.
The idea of having a max value stored in another table (essentially rolling your own sequence object) has generally been frowned upon and shown to be a dubious practice in SQL Server from what I've read, though I'm oddly having trouble locating a good article right now.
If at all possible, avoid cursors in favor of batches, and generally avoid batches in favor of full set-based updates.
sp_updateextendedproperty does seem to behave correctly with a rollback, though I'm not sure how locking works with that -- just FYI if you ultimately decide to go down that path.

Increasing column length in a large table

Our production sever is sql server 2005, and we have a very large table of 103 Million records. We want increase length of one particular field from varchar(20) to varchar(30). Though i said its just a metadata change as its a increase in the column length my manager says he doesnt want to alter such a huge table. pl advice the best option. I am thinking to create a new column and update the new column with old column values.
I looked at many blogs and they say that the alter will impact and some say it will not impact.
As you said, it is a metadata-only operation and this is the way to go. Prove to your manager (and to yourself!) through testing that you are right.
You should test any advice first, unless from a SQL Server MVP who might actually know the details of what happens.
However, changing the varchar length from 20 to 30 does not affect the layout of any existing data in the table. That is, the layout of the two variables is exactly the same. That means that the data does not have to change when you alter the table.
This offers optimism that the change would be "easy".
The data page does contain some information about types -- at least the length of the type in the record. I don't know if this includes the maximum length of a character type. It is possible that the data pages would need to be changed.
This is a bit of pessimism.
Almost any other change will require changes to every record and/or data page. For instance, changing from int to bigint is moving from a 4-byte field to an 8-byte field. All the records are affected by this change in data layout. Big change.
Changing from varchar() to either nvarchar() or char() would have the same impact.
On the other hand, changing a field from being NULLABLE to NOT NULLABLE (or vice versa) would not affect the record storage on each page. But, that information is stored on the page in the NULLABLE flags array, so all the pages would need to be updated.
So, there is some possibility that the change would not cause any data to be rewritten. But test on a smaller table to see what happens.

DB2 Optimistic Concurrency - RID, Row Version, Timestamp Clarifications

I'm currently reading over implementing optimistic concurrency checks in DB2. I've been mainly reading http://www.ibm.com/developerworks/data/library/techarticle/dm-0801schuetz/ and http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.dbobj.doc%2Fdoc%2Fc0051496.html (as well as some other IBM docs).
Is RID necessary when you have an ID column already? In the 2 links they always mention use RID and row change version, however RID is row ID, so I'm not clear why I need to use it when row change token seems like SQLServer's rowversion (except for the page and not the row).
It seems as long as I have a row-change-timestamp column, then my row change token granularity will be good enough to prevent most false positives.
Thanks.
The way I read the first article is that you can use any of those features, you don't need to use all of them. In particular, it appears that the row-change-timestamp is derived from RID() and ROW CHANGE TOKEN:
Time-based update detection:
This feature is added to SQL using the
RID_BIT() and ROW CHANGE TOKEN. To support this feature, the table
needs to have a new generated column defined to store the timestamp
values. This can be added to existing tables using the ALTER TABLE
statement, or the column can be defined when creating a new table. The
column's existence, also affects the behavior of optimistic locking in
that the column if it is used to improve the granularity of the ROW
CHANGE TOKEN from page level to row level, which could greatly benefit
optimistic locking applications.
... among other things, the timestamp actually increases the granularity compared to the ROW CHANGE TOKEN, so it makes it easier to deal with updates.
For a number of reasons, please make sure to set the db time to UTC, as DB2 doesn't track timezone (so if you're somewhere that uses DST, the same timestamp can happen twice).
(As a side note, RID() isn't stable on all platforms. On the iSeries version, at least, it changes if somebody re-orgs the table, and you may not always get the results you expect when using it with joins. I'm also not sure about use with mirroring...)
Are you aware that if you update multiple rows in the same SQL statement execution, they will get the same timestamp (if the timestamp is updated in that statement)?
This means that a timestamp column is probably a bad choice for a unique row identifier.

sql server get only updated record

I am using sql server 2000. I need to get only updated records from remote server and need to insert that record in my local server on daily basis. But that table did not have created date or modified date field.
Use Transactional Replication.
Update
If you cannot do administrative operations on the source then you'll going to have to read all the data every day. Since you cannot detect changes (and keep in mind that even if you'd have a timestamp you still wouldn't be able to detect changes because there is no way to detect deletes with a timestamp) then you have to read every row every time you sync. And if you read every row, then the simplest solution is to just replace all the data you have with the new snapshot.
You need one of the following
a column in the table which flag new or updated records in a fashion or other (lastupdate_timestamp, incremental update counter...)
some trigger on Insert and Update, on the table, which produces some side-effect such as adding the corresponding row id into a separate table
You can also compare row-by-row the data from the remote server against that of the production server to get the list of new or updated rows... Such a differential update can also be produced by comparing some hash value, one per row, computed from the values of all columns for the row.
Barring one the above, and barring some MS-SQL built-in replication setup, the only other possibility I can think of is [not pretty]:
parsing the SQL Log to identify updates and addition to the table. This requires specialized software; I'm not even sure if the Log file format is published/documented, though I have seen this types of tools. Frankly this approach is more one for forensic-type situations...
If you can't change the remote server's database, your best option may be to come up with some sort of hash function on the values of a given row, compare the old and new tables, and pull only the ones where function(oldrow) != function(newrow).
You can also just do a direct comparison of the columns in question, and copy that record over when not all the columns in question are the same between old and new.
This means that you cannot modify values in the new table, or they'll get overwritten daily from the old. If this is an issue, you'll need another table in which to cache the old table's values from the day before; then you'll be able to tell whether old, new, or both were modified in the interim.
I solved this by using tablediff utility which will compare the data in two tables for non-convergence, and is particularly useful for troubleshooting non-convergence in a replication topology.
See the link.
tablediff utility
TO sum up:
You have an older remote db server that you can't modify anything in (such as tables, triggers, etc).
You can't use replication.
The data itself has no indication of date/time it was last modified.
You don't want to pull the entire table down each time.
That leaves us with an impossible situation.
You're only option if the first 3 items above are true is to pull the entire table. Even if they did have a modified date/time column, you wouldn't detect deletes. Which leaves us back at square one.
Go talk to your boss and ask for better requirements. Maybe something that can be done this time.

SQL Identity Column out of step

We have a set of databases that have a table defined with an Identity column as the primary key. As a sub-set of these are replicated to other servers, a seed system was created so that they could never clash. That system was by using a starting seed with an increment of 50.
In this way the table on DB1 would generate 30001, 30051 etc, where Database2 would generate 30002, 30052 and so on.
I am looking at adding another database into this system (it is split for scaling/loading purposes) and have discovered that the identites have got out of sync on one or two of the databases - i.e. database 3 that should have numbers ending in 3, doesn't anymore. The seeding and increments is still correct according to the table design.
I am obviously going to have to work around this problem somehow (probably by setting a high initial value), but can anyone tell me what would cause them to get out of sync like this? From a query on the DB I can see the sequence went as follows: 32403,32453, 32456, 32474, 32524, 32574 and has continued in increments of 50 ever since it went wrong.
As far as I am aware no bulk-inserts or DTS or anything like that has put new data into these tables.
Second (bonus) question - how to reset the identity so that it goes back to what I want it to actually be!
EDIT:
I know the design is in principle a bit ropey - I didn't ask for criticism of it, I just wondered how it could have got out of sync. I inherited this system and changing the column to a GUID - whilst undoubtedly the best theoretical solution - is probably not going to happen. The system evolved from a single DB to multiple DBs when the load got too large (a few hundred GBs currently). Each ID in this table will be referenced in many other places - sometimes a few hundred thousand times each (multiplied by about 40,000 for each item). Updating all those will not be happening ;-)
Replication = GUID column.
To set the value of the next ID to be 1000:
DBCC CHECKIDENT (orders, RESEED, 999)
If you want to actually use Primary Keys for some meaningful purpose other than uniquely identify a row in a table, then it's not an Identity Column, and you need to assign them some other explicit way.
If you want to merge rows from multiple tables, then you are violating the intent of Identity, which is for one table. (A GUID column will use values that are unique enough to solve this problem. But you still can't impute a meaningful purpose to them.)
Perhaps somebody used:
SET IDENTITY INSERT {tablename} ON
INSERT INTO {tablename} (ID, ...)
VALUES(32456, ....)
SET IDENTITY INSERT {tablename} OFF
Or perhaps they used DBCC CHECKIDENT to change the identity. In any case, you can use the same to set it back.
It's too risky to rely on this kind of identity strategy, since it's (obviously) possible that it will get out of synch and wreck everything.
With replication, you really need to identify your data with GUIDs. It will probably be easier for you to migrate your data to a schema that uses GUIDs for PKs than to try and hack your way around IDENTITY issues.
To address your question directly,
Why did it get out of sync may be interesting to discuss, but the only result you could draw from the answer would be to prevent it in the future; which is a bad course of action. You will continue to have these and bigger problems unless you deal with the design which has a fatal flaw.
How to set the existing values right is also (IMHO) an invalid question, because you need to do something other than set the values right - it won't solve your problem.
This isn't to disparage you, it's to help you the best way I can think of. Changing the design is less work both short term and long term. Not to change the design is the pathway to FAIL.
This doesn't really answer your core question, but one possibility to address the design would be to switch to a hi_lo algorithm. it wouldn't require changing the column away from an int. so it shouldn't be nearly as much work as changing to a guid.
Hi_lo is used by the nhibernate ORM, but I couldn't find much documentation on it.
Basically the way a Hi_lo works is you have 1 central place where you keep track of your hi value. 1 table in 1 of the databases that every instance of your insert application can see. then you need to have some kind of a service (object, web service, whatever) that has a life somewhat longer than a single entity insert. this service when it starts up will go to the hi table, grab the current value, then increment the value in that table. Use a read committed lock to do this so that you won't get any concurrency issues with other instances of the service. Now you would use the new service to get your next id value. It internally starts at the number it got from the db, and when it passes that value out, increments by 1. keeping track of this current value and the "range" it's allowed to pass out. A simplistic example would be this.
service 1 gets 100 from "hi_value" table in db. increments db value 200.
service 1 gets request for a new ID. passes out 100.
another instance of the service, service 2 (either another thread, another middle tier worker machine, etc) spins up, gets 200 from the db, increments db to 300.
service 2 gets a request for a new id. passes out 200.
service 1 gets a request for a new id. passes out 101.
if any of these ever gets to passing out more than 100 before dying, then they will go back to the db, and get the current value and increment it and start over. Obviously there's some art to this. How big should your range be, etc.
A very simple variation on this is a single table in one of your db's that just contains the "nextId" value. basically manually reproducing oracle's sequence concept.