I'm currently using GUIDs as a NONCLUSTERED PRIMARY KEY alongside an INT IDENTITY column.
The GUIDs are required to allow offline creation of data and synchronisation - which is how the entire database is populated.
I'm aware of the implications of using a GUID as a clustered primary key, hence the integer clustered index but does using a GUID as a primary key and therefore foreign keys on other tables have significant performance implications?
Would a better option to use an integer primary/foreign key, and use the GUID as a client ID which has a UNIQUE INDEX on each table? - My concern there is that entity framework would require loading the navigation properties in order to get the GUID of the related entity without significant alteration to the existing code.
The database/hardware in question is SQL Azure.
You can also create foreign keys against unique key constraints, which then gives you the option to foreign key to the ID identity as an alternative to the Guid.
i.e.
Create Table SomeTable
(
UUID UNIQUEIDENTIFIER NOT NULL,
ID INT IDENTITY(1,1) NOT NULL,
CONSTRAINT PK PRIMARY KEY NONCLUSTERED (UUID),
CONSTRAINT UQ UNIQUE (ID)
)
GO
Create Table AnotherTable
(
SomeTableID INT,
FOREIGN KEY (SomeTableID) REFERENCES SomeTable(ID)
)
GO
Edit
Assuming that your centralized database is a Mart, and that only batch ETL is done from the source databases, if you do your ETL directly to the central database (i.e. not via Entity Framework), given that all your tables have UUID FK's after re-population from the distributed databases, you'll need to either map the INT UKCs during ETL or fix them up after the import (which would require a temporary NOCHECK constraint step on the INT FK's).
Once ETL is loaded and INT keys are mapped, I would suggest you ignore / remove the UUID's from your ORM model - you would need to regenerate your EF navigation on the INT keys.
A different solution would be required if you update the central database directly or do continual ETL and do use EF for the ETL itself. In this case, it might be less total I/O just to leave the PK GUID as FKs for RI, drop the INT FK's altogether, and choose other suitable columns for clustering (minimizing page reads).
GUID have important implications, yes. Your index is nonclustered, but the index itself will be quickly fragmented, and indexes on the foreign keys will be too. The size is also a concern : 16 Bytes instead of a 4 Bytes integer.
You can use the NEWSEQUENTIALID() function as the default value for your column to make it less random and diminish fragmentation.
But yes, I'd say that using an integer as your primary key and for references will be the best solution.
Generally speaking, it is preferable to use INT for Primary Key / Foreign Key fields, whether or not these fields are the leading field in Clustered indexes. The issue has to do with JOIN performance and even if you use UNIQUEINDENTIFIER as NonClustered or even if you used NEWSEQUENTIALID() to reduce fragmentation, as the tables get larger it will be more scalable to JOIN between INT fields. (Please note that I am not saying that PK / FK fields should always be INT as sometimes there are perfectly valid natural keys to use).
In your case, given the concern about Entity Framework and generating the GUIDs in the app and not in the DB, go with your alternate suggestion of using INT as the PK / FK fields, but rather than have the UNIQUEIDENTIFIER in all tables, only put it in the main user / customer info table. I would think that you should be able to do a one-time lookup of the customer INT identifier based on the GUID, cache that value, and then use the INT value for all remaining operations. And yes, be sure there is a UNIQUE, NONCLUSTERED index on the GUID field.
That all being said, if your tables will never (and I mean NEVER as opposed to just not in the first 2 years) grow beyond maybe 100,000 rows each, then using UNIQUEIDENTIFIER is less of a concern as small volumes of rows generally perform ok (given moderately decent hardware that is not overburdened with other processes or low on memory). Obviously, the point at which JOIN performance degrades due to using UNIQUEIDENTIFIER will greatly depend on the specifics of the system: hardware as well as what types of queries, how the queries are written, and how much load on the system.
Related
I was told to create an autID identity column in the table with GUID varchar(40) as the primary key and use the autID column as a reference key to help in the join process. But is that a good approach?
This causes a lot of problems like this
CREATE TABLE OauthClientInfo
(
autAppID INT IDENTITY(1,1)
strClientID VARCHAR(40), -- GUID
strClientSecret VARCHAR(40)
)
CREATE TABLE OAuth_AuthToken
(
autID INT IDENTITY(1,1)
strAuthToken VARCHAR(40),
autAppID_fk INT
FOREIGN KEY REFERENCES OauthClientInfo(autAppID)
)
I was told that having autAppID_fk helps in the joins vs having strClientID_fk of varchar(40), but my point to defend is we unnecessarily adding a new id as a reference that some times forces to make joins.
Example, to know what is the strClientID that the strAuthToken belongs, if we have strClientID_fk as the reference key then the OAuth_AuthToken table data make sense a lot for me. Please comment your views on this.
I was told to create an autID identity column in the table with GUID varchar
(40) as the primary key and use the autID column as a reference key to help in the join process. But is that a good approach?
You were told this by someone that confuses clustering and primary keys. They are not one and the same, despite the confusing implementation of the database engine that "helps" the lazy developer.
You might get arguments about adding an identity column to every table and designating it as the primary key. I'll disagree with all of this. One does not BLINDLY do anything of this type in a schema. You do the proper analysis, identify (and enforce) any natural keys, and then you decide an whether a synthetic key is both useful and needed. And then you determine which columns to use for the clustered index (because you only have one). And then you verify the appropriateness of your decisions based on how efficient and effective your schema is under load by testing. There are no absolute rules about how to implement your schema.
Many of your indexing (and again note - indexing and primary key are completely separate things) choices will be affected by how your tables are updated over time. Do you have hotspots that need to be minimized? Does your table experience lots of random inserts, updates, and deletes over time? Or maybe just lots of inserts but relatively few updates or deletes? These are just some of the factors that guide your decision.
You need to use UNIQUEIDENTIFIER data type for GUID columns not VARCHAR
As far as I have read, Auto increment int is the most suitable column for clustered index.
And strClientID is the worst candidate for PK or cluster index.
Most importantly you haven't mention the purpose of StrClientID. What kind of data does it hold, how does it get populated?
I am creating a simple table called Photo to store photos for people/groups defined in a User table. I'm using Microsoft SQL Server's FILESTREAM feature, since all other user data is already stored in SQL Server, and it makes more sense to me than programming a separate way to manually retrieve objects from the disk when they're directly related to entries in the database.
Each user can have only one photo associated with them at a time (for now, but this could change in the future), and FILESTREAM requires a GUID column to reference the files it stores to disk, so this is the model I've come up with for Photo:
UserID int NOT NULL UNIQUE
PhotoID uniqueidentifier ROWGUIDCOL NOT NULL
PhotoBitmap varbinary(MAX) FILESTREAM NULL
My question is (if this model is correct for my application), should I use the PhotoID as the primary key, seeing as how it's already unique and is required? It seems to me that it would be simpler than creating a separate INT column just for the primary key, but I don't know if it is "correct".
I personally use INT IDENTITY for most of my primary and clustering keys.
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be unique, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
and also see Jimmy Nilsson's The Cost of GUIDs as Primary Key
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.
i see after installing the asp.net membership tables, they use the data type "uniqueidentifier" for all of the primary key fields.
I have been using "int" data type and doing increment by one on inserts and declaring the column as IDENTITY.
Is there any particular benefits to using the uniqueIdentifier data type compared to my current model of using int and auto increments on new inserts ?
I personally use INT IDENTITY for most of my primary and clustering keys. I think it's rather unfortunate that Microsoft chose to use Uniqueidentifier in their ASP.NET membership tables - lots of people take that database as a "template" for other.....
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be unique, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
and also see Jimmy Nilsson's The Cost of GUIDs as Primary Key
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.
uniqueidenfitier solves problems with replication. It's possible for two replicated versions of a table to insert rows with the same integer value for the key, but it's impossible for them to both insert using the same uniqueidentifier, assuming the value of the column is set to newid.
I have been using "int" data type and doing increment by one on inserts.
In SQL Server the way to get an auto incrementing column is to use IDENTITY. I'm not sure if that is what you meant by the above so I thought I would clarify this just in case.
The advantage of using an INT column with IDENTITY is that it is smaller so joins will be slightly faster. But for most purposes it won't be a significant improvement. There are other things you should worry about first, like choosing the correct indexes for your tables.
I have a table with 16 columns. It will be most frequently used table in web aplication and it will contain about few hundred tousand rows. Database is created on sql server 2008.
My question is choice for primary key. What is quicker? I can use complex primary key with two bigint-s or i can use one varchar value but i will need to concatenate it after?
There are many more factors you must consider:
data access prevalent pattern, how are you going to access the table?
how many non-clustered indexes?
frequency of updates
pattern of updates (sequential inserts, random)
pattern of deletes
All these factors, and specially the first two, should drive your choice of the clustered key. Note that the primary key and clustered key are different concepts, often confused. Read up my answer on Should I design a table with a primary key of varchar or int? for a lengthier discussion on the criteria that drive a clustered key choice.
Without any information on your access patterns I can answer very briefly and concise, and actually correct: the narrower key is always quicker (for reasons of IO). However, this response bares absolutely no value. The only thing that will make your application faster is to choose a key that is going to be used by the query execution plans.
A primary key which does not rely on any underlying values (called a surrogate key) is a good choice. That way if the row changes, the ID doesn't have to, and any tables referring to it (Foriegn Keys) will not need to change. I would choose an autonumber (i.e. IDENTITY) column for the primary key column.
In terms of performance, a shorter, integer based primary key is best.
You can still create your clustered index on multiple columns.
Why not just a single INT auto-generated primary key? INT is 32-bit, so it can handle over 4 billion records.
CREATE TABLE Records (
recordId INT NOT NULL PRIMARY KEY,
...
);
A surrogate key might be a fine idea if there are foreign key relationships on this table. Using a surrogate will save tables that refer to it from having to duplicate all those columns in their tables.
Another important consideration is indexes on columns that you'll be using in WHERE clauses. Your performance will suffer if you don't. Make sure that you add appropriate indexes, over and above the primary key, to avoid table scans.
What do you mean quicker? if you need to search quicker, you can create index for any column or create full text search. the primary key just make sure you do not have duplicated records.
The decision relies upon its use. If you are using the table to save data mostly and not retrieve it, then a simple key. If you are mostly querying the data and it is mostly static data where the key values will not change, your index strategy needs to optimize the data to the most frequent query that will be used. Personally, I like the idea of using GUIDs for the primary key and an int for the clustered index. That allows for easy data imports. But, it really depends upon your needs.
Lot’s of variables you haven’t mentioned; whether the data in the two columns is “natural” and there is a benefit in identifying records by a logical ID, if disclosure of the key via a UI poses a risk, how important performance is (a few hundred thousand rows is pretty minimal).
If you’re not too fussy, go the auto number path for speed and simplicity. Also take a look at all the posts on the site about SQL primary key types. Heaps of info here.
Is it a ER Model or Dimensional Model. In ER Model, they should be separate and should not be surrogated. The entire record could have a single surrogate for easy references in URLs etc. This could be a hash of all parts of the composite key or an Identity.
In Dimensional Model, also they must be separate and they all should be surrogated.
Please clear my doubt about this, In SQL Server (2000 and above) is primary key automatically cluster indexed or do we have choice to have non-clustered index on primary key?
Nope, it can be nonclustered. However, if you don't explicitly define it as nonclustered and there is no clustered index on the table, it'll be created as clustered.
One might also add that frequently it's BAD to allow the primary key to be clustered. In particular, when the primary key is assigned by an IDENTITY, it has no intrinsic meaning, so any effort to keep the table arranged accordingly would be wasted.
Consider a table Product, with ProductID INT IDENTITY PRIMARY KEY. If this is clustered, then products that are related in some way are likely to be spread all over the disk. It might be better to cluster by something that we're likely to query based on, like the ManufacturerID or the CategoryID. In either of these cases, a clustered index would (other things being equal) make the corresponding query much more efficient.
On the other hand, the foreign key in a child table that points to this might be a good candidate for clustering (my objection is to the column that actually has the IDENTITY attribute, not its relatives). So in my example above, it's likely that ManufacturerID is a foreign key to a Manufacturer table, where it is set as an IDENTITY. That column shouldn't be clustered, but the column in Product that references it might do so to good advantage.