Slow progress when adding sequential identity column - sql

We have 8 million row table and we need to add a sequential id column to it. It is used for data warehousing.
From testing, we know that if we remove all the indexes, including the primary key index, adding a new sequential id column was like 10x faster. I still haven't figure out why dropping the indexes would help adding a identity column.
Here is the SQL that add identity column:
ALTER TABLE MyTable ADD MyTableSeqId BIGINT IDENTITY(1,1)
However, the table in question has dependencies, thus I cannot drop the primary key index unless I remove all the FK constraints. As a result adding identity column.
Is there other ways to improve the speed when adding a identity column, so that client down time is minimal?
or
Is there a way to add an identity column without locking the table, so that table can be access, or at least be queried?
The database is SQL Server 2005 Standard Edition.

Adding a new column to a table will acquire a Sch-M (schema modification) lock, which prevents all access to the table for the duration of the operation.
You may get some benefit from switching the database into bulk-logged or simple mode for the duration of the operation, but of course, do so only if you're aware of the effects this will have on your backup / restore strategy.

Related

SQL Server database replication

I am working on a db optimization (planing for future project growth) and need some help.
Currently, every table is using a uniqueidentifier column as PK (clustered index) and we have high index fragmentation (99%). For the new tables we started using bigint as pk but I don't want a nightmare when bidirectional replication phase comes.
I did my research and uniqueidentifier is not a huge problem (except memory), problem is clustered index on that column (http://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/).
Can we do this to solve our fragmentation problem and any replication nightmare:
Add a ROWGUIDCOL to PK column
Add another identity column in every table and move that clustered index (not pk) to that column
Would this new identity column cause the same replication problems as if that was bigint PK?
I know SharePoint database primary keys are GUIDs but I am not happy with its performances and probably security has some part in that decision.
We could reduce fragmentation by using seq. GUIDs but we can't create it on application side or return it with SCOPE_IDENTITY(). OUTPUT inserted.ID hack would be time consuming because we need to rewrite complete application DAL.
Finally, is there any valid solution for this problem? Can we use bigint without any replication problem?
Thanks

Checking foreign key constraint "online"

If we have a giant fact table and want to add a new dimension, we can do it like this:
BEGIN TRANSACTION
ALTER TABLE [GiantFactTable]
ADD NewDimValueId INT NOT NULL
CONSTRAINT [temp_DF_NewDimValueId] DEFAULT (-1)
WITH VALUES -- table is not actually rebuilt!
ALTER TABLE [GiantFactTable]
WITH NOCHECK
ADD CONSTRAINT [FK_GiantFactTable_NewDimValue]
FOREIGN KEY ([NewDimValueId])
REFERENCES [NewDimValue] ([Id])
-- drop the default constraint, new INSERTs will specify a value for NewDimValueId column
ALTER TABLE [GiantFactTable]
DROP CONSTRAINT [temp_DF_NewDimValueId]
COMMIT TRANSACTION
NB: all of the above only manipulate table metadata and should be fast regardless of table size.
Then we can run a job to backfill GiantFactTable.NewDimValueId in small transactions, such that the FK is not violated. (At this point any INSERTs/UPDATEs - e.g. backfill operation - are verified by the FK since it's enabled, but not "trusted")
After the backfill we know the data is consistent, my question is how can SQL engine become enlightened too? Without taking the table offline.
This command will make the FK trusted but it requires a schema modification (Sch-M) lock and likely take hours (days?) taking the table offline:
ALTER TABLE [GiantFactTable]
WITH CHECK CHECK CONSTRAINT [FK_GiantFactTable_NewDimValue]
About the workload: Table has a few hundred partitions (fixed number), data is appended to one partition at a time (in a round-robin fashion), never deleted. There is also a constant read workload that uses the clustering key to get a (relatively small) range of rows from one partition at a time.
Checking one partition at a time, taking it offline, would be acceptable. But I can't find any syntax to do this. Any other ideas?
A few ideas come to mind but they aren't pretty:
Redirect workloads and run check constraint offline
Create a new table with the same structure.
Change the "insert" workload to insert into the new table
Copy the data from the partition used by the "read" workload to the new table (or a third table with the same structure)
Change the "read" workload to use the new table
Run alter table to check the constraint and let it take as long as it needs
Change the both workloads back to the main table.
Insert the new rows back into the main table
Drop new table(s)
A variation on the above is to switch the relevant partition to the new table in step 3. That should be faster than copying the data but I think you will have to copy (and not just switch) the data back after the constraint has been checked.
Insert all the data into a new table
Create a new table with the same structure and constraint enabled
Change the "insert" workload to the new table
Copy all the data from old to new table in batches and wait as long as it takes to complete
Change the "read" workload to the new table. If step 3 takes too long and the "read" workload needs rows that have only been inserted into the new table, you will have to manage this changeover manually.
Drop old table
Use index to speed up constraint check?
I have no idea if this works but you can try to create a non-clustered index on the foreign key column. Also make sure there's an index on the relevant unique key on the table referenced by the foreign key. The alter table command might be able to use them to speed up the check (at least by minimizing IO compared to doing a full table scan). The indexes, of course, can be created online to avoid any disruption.

Best approach for multi-tenant primary keys

I have a database used by several clients. I don't really want surrogate incremental key values to bleed between clients. I want the numbering to start from 1 and be client specific.
I'll use a two-part composite key of the tenant_id as well as an incremental id.
What is the best way to create an incremental key per tenant?
I am using SQL Server Azure. I'm concerned about locking tables, duplicate keys, etc. I'd typically set the primary key to IDENTITY and move on.
Thanks
Are you planning on using SQL Azure Federations in the future? If so, the current version of SQL Azure Federations does not support the use of IDENTITY as part of a clustered index. See this What alternatives exist to using guid as clustered index on tables in SQL Azure (Federations) for more details.
If you haven't looked at Federations yet, you might want to check it out as it provides an interesting way to both shard the database and for tenant isolation within the database.
Depending upon your end goal, using Federations you might be able to use a GUID as the primary clustered index on the table and also use an incremental INT IDENTITY field on the table. This INT IDENTITY field could be shown to end-users. If you are federating on the TenantID each "Tenant table" effectively becomes a silo (as I understand it at least) so the use of IDENTITY on a field within that table would effectively be an ever increasing auto generated value which increments within a given Tenant.
When \ if data is merged together (combining data from multiple Tenants) you would wind up with collisions on this INT IDENTITY field (hence why IDENTITY isn't supported as a primary key in federations) but as long as you aren't using this field as a unique identifier within the system at large you should be ok.
If you're looking to duplicate the convenience of having an automatically assigned unique INT key upon insert, you could add an INSTEAD OF INSERT trigger that uses MAX of the existing column +1 to determine the next value.
If the column with the identity value is the first key in an index, the MAX query will be a simple index seek, very efficient.
Transactions will ensure that unique values are assigned but this approach will have different locking semantics than the standard identity column. IIRC, SQL Server can allocate a different identity value for each transaction that requests it in parallel and if a transaction is rolled back, the value(s) allocated to it are discarded. The MAX approach would only allow one transaction to insert rows into the table at a time.
A related approach could be to have a dedicated key value table keyed by the table name, tenant ID and current identity value. It would require the same INSTEAD OF INSERT trigger and more boilerplate to query and keep that key table updated. It wouldn't improve parallel operations though; the lock would just be on a different table's record.
One possibility to fix the locking bottleneck would be to include the current SPID in the key's value (now the identity key is a combination of sequential int and whatever SPID happened to allocate it and not simply sequential), use the dedicated identity value table and insert records there per SPID as necessary; the identity table PK would be (table name, tenant, SPID) and have a non-key column with the current sequential value. That way, each SPID would have its own dynamically allocated identity pool and would only ever have its own SPID specific records locked.
Another downside is maintaining triggers that have to be updated whenever you change the columns in any of the special identity tables.

SQL Server table - ( or likely any SQL table) Does not having a primary key impact performance?

I have a table where I haven't explicitly defined a primary key, it's not really required for functionality... however a coworker suggested I add a column as a unique primary key to improve performance as the database grows...
Can anyone explain how this improves performance?
There is no indexing being used (I know I could add indexes to improve performance, what's not clear is how a primary key would also improve performance.)
The specifics
The main table is a log of user activity, it has a auto incrementing column for each entry, so it's already unique, but it isn't set as a primary key
This log table references activity tables which detail the specific activity, referenced by that autoincrementing entry in the main table. So the value is only unique in the main log table, there could be 100 entries in an activity table that reference that value as an identifier (ie. for session 212 Niall did these 500 things).
As you might guess the bulk of data is in the activity tables.
As Kimberly Tripp (the Queen of Indexing) clearly shows in her excellent blog post, The Clustered Index Debate Continues..., having a clustered index on your SQL Server table is beneficial - for all operations - yes, even for inserts!
To quote Kimberly:
Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is
that lookups in the IAM/PFS to determine the insert location in a heap
are slower than in a clustered table (where insert location is known,
defined by the clustered key). Inserts are faster when inserted into a
table where order is defined (CL) and where that order is
ever-increasing.
Since your primary key will by default automatically create a clustered index on that column you define, I would argue that yes, having a primary (clustering) key on your SQL Server table - even a log table - does have positive performance effects.
Primary keys can help performance - it tells SQL Server something important about that field - that it's unique and NOT NULL. This can help create more efficient execution plans.
This MSDN reference on Improving SQL Server Performance is worth a read.
Quote:
When primary and foreign keys are defined as constraints in the
database schema, the server can use that information to create optimal
execution plans.
A primary key automatically sets an index on the primary column. Setting an index to your table will increase performance on your queries.
You don't need to set a primary key to speed up your performance but you should set indexes to your table that will speed up your queries.
It depends on your queries and table what indexes make sense and which don't.
To add to the above - generally if you frequently search on a field, it is a good candidate for an index. Also, searching on an integer ID is usually faster than a string, for example.
Indexes take more storage space, but can increase search performance on that field.

VB.NET LINQ to SQL Delete All Records

I am having problems with deleting all records in a table with VB.NET. I am using this code to delete all records in the Contacts table
For Each contact In database.Contacts
database.Contacts.DeleteOnSubmit(contact)
Next
But I get this error
Can't perform Create, Update or Delete
operations on 'Table(Contact)' because
it has no primary key.
Anyone have any suggestions?
You should probably have a primary key on your table. This will make working with your table much easier. If you don't have a primary key, try finding a suitable candidate key to set as the primary key. If you have no suitable columns then you may wish to consider adding an auto incrementing surrogate key (called an identity in SQL Server). If you already have a primary key, make sure your LINQ to SQL classes are updated.
However if you just want to delete all values you may find that this method is too slow. An alternative is to execute SQL directly using DataContext.ExecuteCommand:
database.ExecuteCommand("DELETE Contacts");
This doesn't require that the table has a primary key. Note that this will irretrievably delete all rows in your table, so be careful. Even faster is the TRUNCATE command, but note that this requires greater privileges:
database.ExecuteCommand("TRUNCATE TABLE Contacts");
Again, be careful with this command. It will delete all rows from your table.