SQL Server : big data replication primary key - sql

I know that this subject has been discussed may times but there are different opinions about it. My scenario is this I have created a database which is going to be filled with 4 billion records and each year will be added between 1 to 2 million records.
We have servers in the USA and Europe and we do replication of the database to keep them similar on these servers for example same thing that Facebook does with replication.
My question is this as a primary key of the tables what should I use - BigInt or Uniqueidentifier, or it does not make any difference what I use for the replication?
Should I create a non-clustered uniqueidentifier primary key and then add another clustered bigInt column?
Or
Should I create a clustered bigint primary key?

Without a doubt, go with a Uniqueidentifier.
Do not add a bigint column, you don't need it.
If you use merge replication and you don't have a uniqueidentifier then the server is going to add that column anyway.
By using a GUID, you now have the capability of setting up a multi-master DB architecture. If you use a bigint as an identity field then you either force yourself to only use a single master (to control the bigint) or you then have to come up with a scheme to keep multiple servers from colliding with each other. Further by using GUIDs you get away from guessable IDs - which is generally a good thing.
My own testing in the hundred million record range with millions added / deleted daily showed no performance drop when using GUIDs vs ints for ids.
Final note - most places base64 encode the guid when calling web services or if it is going to be displayed anywhere - like in the address bar.

I would argue just the other option: I would try to AVOID uniqueidentifier columns - MOST DEFINITELY as your clustering key!
The clustering key is the most replicated data structure in SQL Server - and with millions and millions of rows, it does makes a huge difference if your clustering key is 8 or 16 bytes in size. Not to mention the number of page splits a uniqueidentifier clustering key would introduce - which you can totally avoid with a clustering key of BIGINT type.
If you're really interested - you must read all these articles from Kimberly Tripp - the "Queen of Indexing" in the SQL Server space - that clearly shows just how bad and counter-productive a GUID as your clustering key can be:
GUIDs as PRIMARY KEYs and/or the clustering key
The clustered index debate continues...
More considerations for the clustering key – the clustered index debate continues!
Ever-increasing clustering key – the Clustered Index Debate……….again!
Disk space is cheap .....

Related

Using UNIQUE IDENTIFIERS as PRIMARY KEY in a High-Availability, Active-Active Environment

Our company is moving our databases to a High-Availability, Active-Active (HA/AA) environment. The middle-ware tool we chose makes it VERY painful to migrate Identity columns between live instances. As such, I and other folks want to move to using uniqueidentifiers (i.e. Guid's) for primary keys for all new tables.
CONSIDERATIONS:
Some tables will be quite shallow
Other tables will grow extremely large (over time)
Many legacy tables already contain millions of records
THE PROPOSED SOLUTION:
Use sequential uniqueidentifier as primary key for tables
THE CONCERN:
The effect of having millions of records using a uniqueidentifier as primary key for tables.
QUESTION: In general, how well have uniqueidentifier performed for you in these situations?
UPDATE:
By primary key I mean surrogate.
GUID may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
You really need to keep two issues apart:
the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based primary / clustered key into two separate keys - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as primary and clustering key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
Read about the GUIDs vs INT debate here https://blogs.msdn.microsoft.com/sqlserverfaq/2010/05/27/guid-vs-int-debate/
Seq GUID is not that worse than INT/BIGINT as many folks think about. It has advantages over BIGINT when in question is MERGE-ing of data, as the question is more about.
Seq GUID has a stable fragmentation too, and looses a bit on performance against BIGINT.
Note: Identnity data types have some issues when unexpected switching between the nodes occurs.
http://sqlblog.com/blogs/kalen_delaney/archive/2014/06/17/lost-identity.aspx
I personally experienced the necessity to use trace flat -t272 on such an environment (AlwaysOn) resulting with gaps in the IDs. Sometimes the business logic can be related to the Identity keys.
So the question becomes really debate-like.
However, it DEPENDS!

What are the advantages / disadvantages in using an int or newsequentialid for a clustered table primary key?

I have a new application and I would like to use a GUID for the primary key of my clustered table. I heard there were disadvantage using newid() so I would like to know more about newsequentialid().
Specifically is there any advantage in using int over a guid generated with newsequentialid().
GUID may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
You really need to keep two issues apart:
the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based primary / clustered key into two separate keys - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as primary and clustering key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
Marc

Deciding on a primary key according to value size in SQL Server

I want to ask a question to optimize SQL Server performance. Assume I have an entity - say Item - and I must assign a primary key for it. It has columns and two of them are expected to be unique, one of them is expected to be bigger than the other as tens of characters.
How should I decide primary key?
Should one of them be PK, if so which one, or both, or should I create an Identity number as PK? This is important for me because the entity "Item" would have relations with some other entities and I think the complexity of PK would affect the performance of SQL Server queries.
Personally, I would go with an IDENTITY Primary Key with unique constraints on both the mentioned unique keys and indexes for additonal lookups.
You have to remember that by default SQL Server creates the primary key as the clustered index, which impacts how it is stored on disc. If the new ITEMS came in at random, variance there could be a lot of fragmentation on either the primary keys.
Also, unless cascades and foreign keys are switched on, you would have to manually maintain the relational integrety of the data (unless you use IDENTITY)
Well, the primary key is really only used to uniquely identify each row - so the only requirements for it are: it has to be unique and typically also should not contain NULL.
Anything else is most likely more relevant for the clustering key in SQL Server - the column (or set of columns) by which the data is physically ordered on disk. By default, the primary key is also the clustering key in SQL Server.
The clustering key is the most important choice in SQL Server because it has far reaching performance implications. A good clustering key is
narrow
unique
stable
if possible ever-increasing
It has to be unique so that it can be added to each and every single nonclustered index for lookup into the actual data tables - if you pick a non-unique column (or set of columns), SQL Server will add a 4-byte "uniquefier" for you.
It should be as narrow as possible, since it's stored in a lot of places. Try to stick to 4 bytes for an INT or 8 bytes for a BIGINT - avoid long and variable length VARCHAR columns since those are both too wide, and the variable length also carries additional overhead. Because of this, sets of columns are also rather rarely a good choice.
The clustering key should be stable - value shouldn't change over time - since every time a value changes, potentially a lot of index entries (in the clustered index itself, and every single nonclustered index, too) need to be updated which causes a lot of unnecessary overhead.
And if it's ever-increasing (like an INT IDENTITY), you also can avoid most page splits - an extremely expensive and involved procedure that happens if you use random values (like GUID's) as your clustering key.
So in brief: an INT IDENTITY is ideal - GUIDs, variable length strings, or combinations of columns are typically less of a good choice.
Choose the one you will use to identify the records in queries and joins to other tables. Size is relative, and whilst a consideration usually not an issue since the PK will be indexed and the other unique column can make use also of a unique index.
The uniqueidentifier data type for e.g. is a 36 character long string representation and performs fine as a primary key under the majority of circumstances.

Using a SQL Server FILESTREAM GUID as a primary key

I am creating a simple table called Photo to store photos for people/groups defined in a User table. I'm using Microsoft SQL Server's FILESTREAM feature, since all other user data is already stored in SQL Server, and it makes more sense to me than programming a separate way to manually retrieve objects from the disk when they're directly related to entries in the database.
Each user can have only one photo associated with them at a time (for now, but this could change in the future), and FILESTREAM requires a GUID column to reference the files it stores to disk, so this is the model I've come up with for Photo:
UserID int NOT NULL UNIQUE
PhotoID uniqueidentifier ROWGUIDCOL NOT NULL
PhotoBitmap varbinary(MAX) FILESTREAM NULL
My question is (if this model is correct for my application), should I use the PhotoID as the primary key, seeing as how it's already unique and is required? It seems to me that it would be simpler than creating a separate INT column just for the primary key, but I don't know if it is "correct".
I personally use INT IDENTITY for most of my primary and clustering keys.
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be unique, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
and also see Jimmy Nilsson's The Cost of GUIDs as Primary Key
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.

primary key datatype in sql server database

i see after installing the asp.net membership tables, they use the data type "uniqueidentifier" for all of the primary key fields.
I have been using "int" data type and doing increment by one on inserts and declaring the column as IDENTITY.
Is there any particular benefits to using the uniqueIdentifier data type compared to my current model of using int and auto increments on new inserts ?
I personally use INT IDENTITY for most of my primary and clustering keys. I think it's rather unfortunate that Microsoft chose to use Uniqueidentifier in their ASP.NET membership tables - lots of people take that database as a "template" for other.....
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be unique, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
and also see Jimmy Nilsson's The Cost of GUIDs as Primary Key
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.
uniqueidenfitier solves problems with replication. It's possible for two replicated versions of a table to insert rows with the same integer value for the key, but it's impossible for them to both insert using the same uniqueidentifier, assuming the value of the column is set to newid.
I have been using "int" data type and doing increment by one on inserts.
In SQL Server the way to get an auto incrementing column is to use IDENTITY. I'm not sure if that is what you meant by the above so I thought I would clarify this just in case.
The advantage of using an INT column with IDENTITY is that it is smaller so joins will be slightly faster. But for most purposes it won't be a significant improvement. There are other things you should worry about first, like choosing the correct indexes for your tables.