Using a SQL Server FILESTREAM GUID as a primary key - sql

I am creating a simple table called Photo to store photos for people/groups defined in a User table. I'm using Microsoft SQL Server's FILESTREAM feature, since all other user data is already stored in SQL Server, and it makes more sense to me than programming a separate way to manually retrieve objects from the disk when they're directly related to entries in the database.
Each user can have only one photo associated with them at a time (for now, but this could change in the future), and FILESTREAM requires a GUID column to reference the files it stores to disk, so this is the model I've come up with for Photo:
UserID int NOT NULL UNIQUE
PhotoID uniqueidentifier ROWGUIDCOL NOT NULL
PhotoBitmap varbinary(MAX) FILESTREAM NULL
My question is (if this model is correct for my application), should I use the PhotoID as the primary key, seeing as how it's already unique and is required? It seems to me that it would be simpler than creating a separate INT column just for the primary key, but I don't know if it is "correct".

I personally use INT IDENTITY for most of my primary and clustering keys.
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be unique, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
and also see Jimmy Nilsson's The Cost of GUIDs as Primary Key
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.

Related

SQL Server : big data replication primary key

I know that this subject has been discussed may times but there are different opinions about it. My scenario is this I have created a database which is going to be filled with 4 billion records and each year will be added between 1 to 2 million records.
We have servers in the USA and Europe and we do replication of the database to keep them similar on these servers for example same thing that Facebook does with replication.
My question is this as a primary key of the tables what should I use - BigInt or Uniqueidentifier, or it does not make any difference what I use for the replication?
Should I create a non-clustered uniqueidentifier primary key and then add another clustered bigInt column?
Or
Should I create a clustered bigint primary key?
Without a doubt, go with a Uniqueidentifier.
Do not add a bigint column, you don't need it.
If you use merge replication and you don't have a uniqueidentifier then the server is going to add that column anyway.
By using a GUID, you now have the capability of setting up a multi-master DB architecture. If you use a bigint as an identity field then you either force yourself to only use a single master (to control the bigint) or you then have to come up with a scheme to keep multiple servers from colliding with each other. Further by using GUIDs you get away from guessable IDs - which is generally a good thing.
My own testing in the hundred million record range with millions added / deleted daily showed no performance drop when using GUIDs vs ints for ids.
Final note - most places base64 encode the guid when calling web services or if it is going to be displayed anywhere - like in the address bar.
I would argue just the other option: I would try to AVOID uniqueidentifier columns - MOST DEFINITELY as your clustering key!
The clustering key is the most replicated data structure in SQL Server - and with millions and millions of rows, it does makes a huge difference if your clustering key is 8 or 16 bytes in size. Not to mention the number of page splits a uniqueidentifier clustering key would introduce - which you can totally avoid with a clustering key of BIGINT type.
If you're really interested - you must read all these articles from Kimberly Tripp - the "Queen of Indexing" in the SQL Server space - that clearly shows just how bad and counter-productive a GUID as your clustering key can be:
GUIDs as PRIMARY KEYs and/or the clustering key
The clustered index debate continues...
More considerations for the clustering key – the clustered index debate continues!
Ever-increasing clustering key – the Clustered Index Debate……….again!
Disk space is cheap .....

Using UNIQUE IDENTIFIERS as PRIMARY KEY in a High-Availability, Active-Active Environment

Our company is moving our databases to a High-Availability, Active-Active (HA/AA) environment. The middle-ware tool we chose makes it VERY painful to migrate Identity columns between live instances. As such, I and other folks want to move to using uniqueidentifiers (i.e. Guid's) for primary keys for all new tables.
CONSIDERATIONS:
Some tables will be quite shallow
Other tables will grow extremely large (over time)
Many legacy tables already contain millions of records
THE PROPOSED SOLUTION:
Use sequential uniqueidentifier as primary key for tables
THE CONCERN:
The effect of having millions of records using a uniqueidentifier as primary key for tables.
QUESTION: In general, how well have uniqueidentifier performed for you in these situations?
UPDATE:
By primary key I mean surrogate.
GUID may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
You really need to keep two issues apart:
the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based primary / clustered key into two separate keys - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as primary and clustering key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
Read about the GUIDs vs INT debate here https://blogs.msdn.microsoft.com/sqlserverfaq/2010/05/27/guid-vs-int-debate/
Seq GUID is not that worse than INT/BIGINT as many folks think about. It has advantages over BIGINT when in question is MERGE-ing of data, as the question is more about.
Seq GUID has a stable fragmentation too, and looses a bit on performance against BIGINT.
Note: Identnity data types have some issues when unexpected switching between the nodes occurs.
http://sqlblog.com/blogs/kalen_delaney/archive/2014/06/17/lost-identity.aspx
I personally experienced the necessity to use trace flat -t272 on such an environment (AlwaysOn) resulting with gaps in the IDs. Sometimes the business logic can be related to the Identity keys.
So the question becomes really debate-like.
However, it DEPENDS!

What are the advantages / disadvantages in using an int or newsequentialid for a clustered table primary key?

I have a new application and I would like to use a GUID for the primary key of my clustered table. I heard there were disadvantage using newid() so I would like to know more about newsequentialid().
Specifically is there any advantage in using int over a guid generated with newsequentialid().
GUID may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
You really need to keep two issues apart:
the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based primary / clustered key into two separate keys - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as primary and clustering key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
Marc

SQL Guid Primary Key Join Performance

I'm currently using GUIDs as a NONCLUSTERED PRIMARY KEY alongside an INT IDENTITY column.
The GUIDs are required to allow offline creation of data and synchronisation - which is how the entire database is populated.
I'm aware of the implications of using a GUID as a clustered primary key, hence the integer clustered index but does using a GUID as a primary key and therefore foreign keys on other tables have significant performance implications?
Would a better option to use an integer primary/foreign key, and use the GUID as a client ID which has a UNIQUE INDEX on each table? - My concern there is that entity framework would require loading the navigation properties in order to get the GUID of the related entity without significant alteration to the existing code.
The database/hardware in question is SQL Azure.
You can also create foreign keys against unique key constraints, which then gives you the option to foreign key to the ID identity as an alternative to the Guid.
i.e.
Create Table SomeTable
(
UUID UNIQUEIDENTIFIER NOT NULL,
ID INT IDENTITY(1,1) NOT NULL,
CONSTRAINT PK PRIMARY KEY NONCLUSTERED (UUID),
CONSTRAINT UQ UNIQUE (ID)
)
GO
Create Table AnotherTable
(
SomeTableID INT,
FOREIGN KEY (SomeTableID) REFERENCES SomeTable(ID)
)
GO
Edit
Assuming that your centralized database is a Mart, and that only batch ETL is done from the source databases, if you do your ETL directly to the central database (i.e. not via Entity Framework), given that all your tables have UUID FK's after re-population from the distributed databases, you'll need to either map the INT UKCs during ETL or fix them up after the import (which would require a temporary NOCHECK constraint step on the INT FK's).
Once ETL is loaded and INT keys are mapped, I would suggest you ignore / remove the UUID's from your ORM model - you would need to regenerate your EF navigation on the INT keys.
A different solution would be required if you update the central database directly or do continual ETL and do use EF for the ETL itself. In this case, it might be less total I/O just to leave the PK GUID as FKs for RI, drop the INT FK's altogether, and choose other suitable columns for clustering (minimizing page reads).
GUID have important implications, yes. Your index is nonclustered, but the index itself will be quickly fragmented, and indexes on the foreign keys will be too. The size is also a concern : 16 Bytes instead of a 4 Bytes integer.
You can use the NEWSEQUENTIALID() function as the default value for your column to make it less random and diminish fragmentation.
But yes, I'd say that using an integer as your primary key and for references will be the best solution.
Generally speaking, it is preferable to use INT for Primary Key / Foreign Key fields, whether or not these fields are the leading field in Clustered indexes. The issue has to do with JOIN performance and even if you use UNIQUEINDENTIFIER as NonClustered or even if you used NEWSEQUENTIALID() to reduce fragmentation, as the tables get larger it will be more scalable to JOIN between INT fields. (Please note that I am not saying that PK / FK fields should always be INT as sometimes there are perfectly valid natural keys to use).
In your case, given the concern about Entity Framework and generating the GUIDs in the app and not in the DB, go with your alternate suggestion of using INT as the PK / FK fields, but rather than have the UNIQUEIDENTIFIER in all tables, only put it in the main user / customer info table. I would think that you should be able to do a one-time lookup of the customer INT identifier based on the GUID, cache that value, and then use the INT value for all remaining operations. And yes, be sure there is a UNIQUE, NONCLUSTERED index on the GUID field.
That all being said, if your tables will never (and I mean NEVER as opposed to just not in the first 2 years) grow beyond maybe 100,000 rows each, then using UNIQUEIDENTIFIER is less of a concern as small volumes of rows generally perform ok (given moderately decent hardware that is not overburdened with other processes or low on memory). Obviously, the point at which JOIN performance degrades due to using UNIQUEIDENTIFIER will greatly depend on the specifics of the system: hardware as well as what types of queries, how the queries are written, and how much load on the system.

primary key datatype in sql server database

i see after installing the asp.net membership tables, they use the data type "uniqueidentifier" for all of the primary key fields.
I have been using "int" data type and doing increment by one on inserts and declaring the column as IDENTITY.
Is there any particular benefits to using the uniqueIdentifier data type compared to my current model of using int and auto increments on new inserts ?
I personally use INT IDENTITY for most of my primary and clustering keys. I think it's rather unfortunate that Microsoft chose to use Uniqueidentifier in their ASP.NET membership tables - lots of people take that database as a "template" for other.....
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be unique, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
and also see Jimmy Nilsson's The Cost of GUIDs as Primary Key
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.
uniqueidenfitier solves problems with replication. It's possible for two replicated versions of a table to insert rows with the same integer value for the key, but it's impossible for them to both insert using the same uniqueidentifier, assuming the value of the column is set to newid.
I have been using "int" data type and doing increment by one on inserts.
In SQL Server the way to get an auto incrementing column is to use IDENTITY. I'm not sure if that is what you meant by the above so I thought I would clarify this just in case.
The advantage of using an INT column with IDENTITY is that it is smaller so joins will be slightly faster. But for most purposes it won't be a significant improvement. There are other things you should worry about first, like choosing the correct indexes for your tables.