ETL Strategies: Identity Insert vs. Use Identity Logic

ETL Strategies: Identity Insert vs. Use Identity Logic - sql

One of my ETL moves about 18 Million rows from one server to another for further processing. I am using the FAST LOAD option.
For the Identity Column, I have two options:
Use IDENTITY INSERT
Don't set any input for the Identity Column, thereby forcing SQL Server to generate a new IDENTITY for each row inserted
The value of the Identity column does not matter.
Which option should I choose for the best performance?

Based on what you've told us, the value of the identity column doesn't matter and you have no reason for it to agree with the original table's value, I would choose the second option. There you are using SQL Server's natural method of setting identity values, you eliminate gaps in the values and the key will be ascending based on the order you choose when you insert them.

Related

IDENTITY SEED is incrementing based on other tables seed values

I'm using SQL SERVER 2017 and using SSMS. I have created a few tables whose Primary Key is int and enabled Is Identity and set Identity Increment = 1 and Identity Seed=1 For all the tables I have used the same method. But When I added one record in a table say Lead it's ID was 2, Then added value to the table say Followup then its ID was 3.
Here I'm adding the screenshots for a better understanding
Lead Table
Followup Table
Is there any option available to avoid this? can we keep the identity individual for each table?

The documentation is quite specific about what identity does not guarantee:
The identity property on a column does not guarantee the following:
Uniqueness of the value . . .
Consecutive values within a transaction . . .
Consecutive values after server restart or other failures . . .
Reuse of values
In general, the "uniqueness" property is a non-issue, because identity columns are usually the primary key (or routinely declared at least unique), which does guarantee uniqueness.
The purpose of an identity column is to provide a unique numeric identifier different from other rows, so it can be readily used as a primary key. There are no other guarantees. And for performance SQL Server has lots of short-cuts that result in gaps.
If you want no gaps, then the simplest way is to assign a value when querying:
row_number() over (order by <identity column>)
That is not 100% satisfying, because deletions can affect the value. I also think that on parallel systems, inserts can as well (because identities might be cached on individual nodes).
If you do not care about performance, you can use a sequence for assigning a value. This is less performant than using an identity. Basically, it requires serializing all the inserts to guarantee the properties of the insert.
I should note that even with a sequence, a failed insert can still produce gaps, so it still might not do what you want.

IDENTITY statement issue

QUESTION:
Is it possible to roll back the maximum value of a column(using IDENTITY) after deleting some rows?
PROBLEM:
The value of column(id) doesn't reseed/roll back to the current maximum value of the column, it keeps incrementing with the last known max.

I'm assuming you're using SQL Server.
The behaviour you described is the way IDENTITY columns work by design. The value for new rows is only ever incremented. This is normally used to ensure uniqueness of generated values. If you delete records, it leaves a gap in numbers. There's even a remark on TechNet:
If an identity column exists for a table with frequent deletions, gaps can occur between identity values. If this is a concern, do not use the IDENTITY property.
You can reseed identity if needed or you can enter any value explicitely into an idenity column when using SET IDENTITY_INSERT ON but that's not a standard usage of those columns.

TSQL Auto Increment on Update

SQL Server 2008+
I have a table with an auto-increment column which I would like to have increment not only on insert but also update. This column is not the primary key, but there is also a primary key which is a GUID created automatically via newid().
As far as I can tell, there are two ways to do this.
1.) Delete the existing row and insert a new row with indentical values (plus any updates).
or
2.) Update the existing row and use the following to get the "next" identity value:
IDENT_CURRENT('myTable') + IDENT_INCR('myTable')
In either case, I'm forced to allow identity inserts. (With option 1, because the primary key for the table needs to remain the same, and with option 2 because I'm updating the auto-increment column with a specific value.) I'm not sure what the locking/performance consequences of this are.
Any thoughts on this? Is there a better approach? The goal here is to maintain an always increasing set of integer values in the column whenever a row is inserted or updated.

I think a column of type rowversion (formerly known as "timestamp") might be your simplest choice, although at 8 bytes these can amount to fairly large integers. The "timestamp" syntax is deprecated in favor of rowversion (since ISO SQL has a timestamp datatype).
If you stay with the Identity column approach, you would probably want to put your logic into an UPDATE trigger, which would effectively replace the UPDATE with the INSERT and DELETE combination you've described.
Note that Identity column values are not guaranteed to be sequential, only increasing.

Does it need to be an integer column? A timestamp column will provide you the functionality you are looking for out of the box.

Columns with an identity property can't be updated. Once the column with an identity property on it has been assigned a value, either automatically, or with identity_insert on, it is an invariant value. Further the identity property may not be disabled or removed via alter column.
I believe what you want to look at is a SQL Server TIMESTAMP (now called rowversion in SQL Server 2008). It is fundamentally an auto-incrementing binary value. Each database has a unique rowversion counter. Each row insert/update in a table with a timestamp/rowversion column results in the counter being ticked up and the new value assigned to the inserted/modified row.

SQL Server 2005 - Working with IDENTITY column

I am working on SQL Server 2005 tables. Here I need to add a column named ‘ID’ as ‘IDENTITY’ column (With starting and incrementing values as 1, 1).
Now my problem is that these tables already have thousands of records. So could you please suggest the best and easy way to perform this job?
Many Thanks,
Regards.
Anusha.

If you add an identity column all the existing records will get calculated incremental values based on the seed value you establish on the new column.

First make sure you havea a good backup.
Here is an example:
alter table mydatabase.dbo.mytest
add id int identity (1,1)
Table will be locked up until it finishes adding the tidentity columns so don;t do this during high peak hours. As always test of dev first.
If you want to change an existing column to an identity that is harder.

identity column in Sql server

Why does Sql server doesn't allow more than one IDENTITY column in a table?? Any specific reasons.

Why would you need it? SQL Server keeps track of a single value (current identity value) for each table with IDENTITY column so it can have just one identity column per table.

An Identity column is a column ( also known as a field ) in a database table that :-
Uniquely identifies every row in the table
Is made up of values generated by the database
This is much like an AutoNumber field in Microsoft Access or a sequence in Oracle.
An identity column differs from a primary key in that its values are managed by the server and ( except in rare cases ) can't be modified. In many cases an identity column is used as a primary key, however this is not always the case.
SQL server uses the identity column as the key value to refer to a particular row. So only a single identity column can be created. Also if no identity columns are explicitly stated, Sql server internally stores a separate column which contains key value for each row. As stated if you want more than one column to be having unique value, you can make use of UNIQUE keyword.

The SQL Server stores the identity in an internal table, using the id of the table as it's key. So it's impossible for the SQL Server to have more than one Identity column per table.

Because MS realized that better than 80% of users would only want one auto-increment column per table and the work-around to have a second (or more) is simple enough i.e. create an IDENTITY with seed = 1, increment = 1 then a calculated column multiplying the auto-generated value by a factor to change the increment and adding an offset to change the seed.

Yes , Sequences allow more than one identity like columns in atable , but there are some issues here . In a typical development scenario i have seen developers manually inserting valid values in a column (which is suppose to be inserted through sequence) . Later on when a sequence try inserting value in to the table , it may fail due to unique key violation.
Also , in a multi developer / multi vendor scenario, developers might use the same sequence for more than one table (as sequences are not linked to tables) . This might lead to missing values in one of the table . ie tableA might get the value 1 while tableB might use value 2 and tableA will get 3. This means that tableA will have 1 and 3 (missing 2).
Apart from this , there is another scenario where you have a table which is truncated every day . Since Sequences are not having any link with table , the truncated table will continue to use the Seq.NextVal again (unless you manually reset the sequence) leading to missing values or even more dangerous arthmetic overflow error after sometime.
Owing to above reason , i feel that both Oracle sequences and SQL server identity column are good for their purposes. I would prefer oracle implementing the concept of Identity column and SQL Server implementing the sequence concept so that developers can implement either of the two as per their requirement.

The whole purpose of an identity column is that it will contain a unique value for each row in the table. So why would you need more than one of them in any given table?
Perhaps you need to clarify your question, if you have a real need for more than one.

An identity column is used to uniquely identify a single row of a table. If you want other columns to be unique, you can create a UNIQUE index for each "identity" column that you may need.

I've always seen this as an arbitrary and bad limitation for SQL Server. Yes, you only want one identity column to actually identify a row, but there are valid reasons why you would want the database to auto-generate a number for more than one field in the database.
That's the nice thing about sequences in Oracle. They're not tied to a table. You can use several different sequences to populate as many fields as you like in the same table. You could also have more than one table share the same sequence, although that's probably a really bad decision. But the point is you could. It's more granular and gives you more flexibility.
The bad thing about sequences is that you have to write code to actually increment them, whether it's in your insert statement or in an on-insert trigger on the table. The nice thing about SQL Server identity is that all you have to do is change a property or add a keyword to your table creation and you're done.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

ETL Strategies: Identity Insert vs. Use Identity Logic - sql

Related

IDENTITY SEED is incrementing based on other tables seed values

IDENTITY statement issue

TSQL Auto Increment on Update

SQL Server 2005 - Working with IDENTITY column

identity column in Sql server

Categories

Resources