SQL Server database replication - sql

I am working on a db optimization (planing for future project growth) and need some help.
Currently, every table is using a uniqueidentifier column as PK (clustered index) and we have high index fragmentation (99%). For the new tables we started using bigint as pk but I don't want a nightmare when bidirectional replication phase comes.
I did my research and uniqueidentifier is not a huge problem (except memory), problem is clustered index on that column (http://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/).
Can we do this to solve our fragmentation problem and any replication nightmare:
Add a ROWGUIDCOL to PK column
Add another identity column in every table and move that clustered index (not pk) to that column
Would this new identity column cause the same replication problems as if that was bigint PK?
I know SharePoint database primary keys are GUIDs but I am not happy with its performances and probably security has some part in that decision.
We could reduce fragmentation by using seq. GUIDs but we can't create it on application side or return it with SCOPE_IDENTITY(). OUTPUT inserted.ID hack would be time consuming because we need to rewrite complete application DAL.
Finally, is there any valid solution for this problem? Can we use bigint without any replication problem?
Thanks

Related

SQL Server : big data replication primary key

I know that this subject has been discussed may times but there are different opinions about it. My scenario is this I have created a database which is going to be filled with 4 billion records and each year will be added between 1 to 2 million records.
We have servers in the USA and Europe and we do replication of the database to keep them similar on these servers for example same thing that Facebook does with replication.
My question is this as a primary key of the tables what should I use - BigInt or Uniqueidentifier, or it does not make any difference what I use for the replication?
Should I create a non-clustered uniqueidentifier primary key and then add another clustered bigInt column?
Or
Should I create a clustered bigint primary key?
Without a doubt, go with a Uniqueidentifier.
Do not add a bigint column, you don't need it.
If you use merge replication and you don't have a uniqueidentifier then the server is going to add that column anyway.
By using a GUID, you now have the capability of setting up a multi-master DB architecture. If you use a bigint as an identity field then you either force yourself to only use a single master (to control the bigint) or you then have to come up with a scheme to keep multiple servers from colliding with each other. Further by using GUIDs you get away from guessable IDs - which is generally a good thing.
My own testing in the hundred million record range with millions added / deleted daily showed no performance drop when using GUIDs vs ints for ids.
Final note - most places base64 encode the guid when calling web services or if it is going to be displayed anywhere - like in the address bar.
I would argue just the other option: I would try to AVOID uniqueidentifier columns - MOST DEFINITELY as your clustering key!
The clustering key is the most replicated data structure in SQL Server - and with millions and millions of rows, it does makes a huge difference if your clustering key is 8 or 16 bytes in size. Not to mention the number of page splits a uniqueidentifier clustering key would introduce - which you can totally avoid with a clustering key of BIGINT type.
If you're really interested - you must read all these articles from Kimberly Tripp - the "Queen of Indexing" in the SQL Server space - that clearly shows just how bad and counter-productive a GUID as your clustering key can be:
GUIDs as PRIMARY KEYs and/or the clustering key
The clustered index debate continues...
More considerations for the clustering key – the clustered index debate continues!
Ever-increasing clustering key – the Clustered Index Debate……….again!
Disk space is cheap .....

SQL Server table - ( or likely any SQL table) Does not having a primary key impact performance?

I have a table where I haven't explicitly defined a primary key, it's not really required for functionality... however a coworker suggested I add a column as a unique primary key to improve performance as the database grows...
Can anyone explain how this improves performance?
There is no indexing being used (I know I could add indexes to improve performance, what's not clear is how a primary key would also improve performance.)
The specifics
The main table is a log of user activity, it has a auto incrementing column for each entry, so it's already unique, but it isn't set as a primary key
This log table references activity tables which detail the specific activity, referenced by that autoincrementing entry in the main table. So the value is only unique in the main log table, there could be 100 entries in an activity table that reference that value as an identifier (ie. for session 212 Niall did these 500 things).
As you might guess the bulk of data is in the activity tables.
As Kimberly Tripp (the Queen of Indexing) clearly shows in her excellent blog post, The Clustered Index Debate Continues..., having a clustered index on your SQL Server table is beneficial - for all operations - yes, even for inserts!
To quote Kimberly:
Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is
that lookups in the IAM/PFS to determine the insert location in a heap
are slower than in a clustered table (where insert location is known,
defined by the clustered key). Inserts are faster when inserted into a
table where order is defined (CL) and where that order is
ever-increasing.
Since your primary key will by default automatically create a clustered index on that column you define, I would argue that yes, having a primary (clustering) key on your SQL Server table - even a log table - does have positive performance effects.
Primary keys can help performance - it tells SQL Server something important about that field - that it's unique and NOT NULL. This can help create more efficient execution plans.
This MSDN reference on Improving SQL Server Performance is worth a read.
Quote:
When primary and foreign keys are defined as constraints in the
database schema, the server can use that information to create optimal
execution plans.
A primary key automatically sets an index on the primary column. Setting an index to your table will increase performance on your queries.
You don't need to set a primary key to speed up your performance but you should set indexes to your table that will speed up your queries.
It depends on your queries and table what indexes make sense and which don't.
To add to the above - generally if you frequently search on a field, it is a good candidate for an index. Also, searching on an integer ID is usually faster than a string, for example.
Indexes take more storage space, but can increase search performance on that field.

Slow progress when adding sequential identity column

We have 8 million row table and we need to add a sequential id column to it. It is used for data warehousing.
From testing, we know that if we remove all the indexes, including the primary key index, adding a new sequential id column was like 10x faster. I still haven't figure out why dropping the indexes would help adding a identity column.
Here is the SQL that add identity column:
ALTER TABLE MyTable ADD MyTableSeqId BIGINT IDENTITY(1,1)
However, the table in question has dependencies, thus I cannot drop the primary key index unless I remove all the FK constraints. As a result adding identity column.
Is there other ways to improve the speed when adding a identity column, so that client down time is minimal?
or
Is there a way to add an identity column without locking the table, so that table can be access, or at least be queried?
The database is SQL Server 2005 Standard Edition.
Adding a new column to a table will acquire a Sch-M (schema modification) lock, which prevents all access to the table for the duration of the operation.
You may get some benefit from switching the database into bulk-logged or simple mode for the duration of the operation, but of course, do so only if you're aware of the effects this will have on your backup / restore strategy.

primary key datatype in sql server database

i see after installing the asp.net membership tables, they use the data type "uniqueidentifier" for all of the primary key fields.
I have been using "int" data type and doing increment by one on inserts and declaring the column as IDENTITY.
Is there any particular benefits to using the uniqueIdentifier data type compared to my current model of using int and auto increments on new inserts ?
I personally use INT IDENTITY for most of my primary and clustering keys. I think it's rather unfortunate that Microsoft chose to use Uniqueidentifier in their ASP.NET membership tables - lots of people take that database as a "template" for other.....
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be unique, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
and also see Jimmy Nilsson's The Cost of GUIDs as Primary Key
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.
uniqueidenfitier solves problems with replication. It's possible for two replicated versions of a table to insert rows with the same integer value for the key, but it's impossible for them to both insert using the same uniqueidentifier, assuming the value of the column is set to newid.
I have been using "int" data type and doing increment by one on inserts.
In SQL Server the way to get an auto incrementing column is to use IDENTITY. I'm not sure if that is what you meant by the above so I thought I would clarify this just in case.
The advantage of using an INT column with IDENTITY is that it is smaller so joins will be slightly faster. But for most purposes it won't be a significant improvement. There are other things you should worry about first, like choosing the correct indexes for your tables.

is it good to have primary keys as Identity field

I have read a lot of articles about whether we should have primary keys that are identity columns, but I'm still confused.
There are advantages of making columns are identity as it would give better performance in joins and provides data consistency. But there is a major drawback associated with identity ,i.e.When INSERT statement fails, still the IDENTITY value increases If a transaction is rolled back, the new IDENTITY column value isn't rolled back, so we end up with gaps in sequencing. I can use GUIDs (by using NEWSEQUENTIALID) but it reduces performance.
Gaps should not matter: the identity column is internal and not for end user usage or recognition.
GUIDs will kill performance, even sequential ones, because of the 16 byte width.
An identity column should be chosen to respect the physical implementation after modelling your data and working out what your natural keys are. That is, the chosen natural key is the logical key but you choose a surrogate key (identity) because you know how the engine works.
Or you use an ORM and let the client tail wag the database dog...
For all practical purposes, integers are ideal for primary keys and auto increment is a perfect way to generate them. As long as your PK is meaningless (surrogate) it will be protected from creativity of you customers and serve its main purpose (to identify a row in a table) just fine. Indexes are packed, joins fast as it gets, and it is easy to partition tables.
If you happen to need GUID, that's fine too; however, think auto-increment integer first.
I would like to say that depends on your needs. We use only Guids as primary keys (with default set to NewID) because we develop a distributed system with many Sql Server instances, so we have to be sure that every Sql Server generate unique primary key values.
But when using a Guid column as PK, be sure not to use it as your clustered index (thanks to marc_s for the link)
Advantage of the Guid type:
You can create unique values on different locations without synchronization
Disadvantage:
Its a large datatype (16 Bytes) and needs significant more space
It creates index fragmentation (at least when using the newid() function)
Dataconsistency is not an issue with primary keys independent of the datatype because a primary key has to be unique by definition!
I don't believe that an identity column has better join performance. At all, performance is a matter of the right indexes. A primary key is a constraint not an index.
Is your need to have a primary key of typ int with no gaps? This should'nt be a problem normally.
"yes, it KILLS performance - totally. I went from a legacy system with GUID as PK/CK and 99.5% index fragmentation on a daily basis to using INT IDENTITY - HUGE difference. Hardly any index fragmentation anymore, performance is significantly better. GUIDs as Clustering Index on your SQL Server table are BAD BAD BAD - period."
Might be true, but I see no logical reasoning according to which this leads me to conclude that GUIDs PER SE are also BAD BAD BAD.
Maybe you should consider using other types of indexes on such data. And if your dbms does not offer you a choice between several types of index, then perhaps you should consider getting yourself a better dbms.