Deciding on a primary key according to value size in SQL Server - sql

I want to ask a question to optimize SQL Server performance. Assume I have an entity - say Item - and I must assign a primary key for it. It has columns and two of them are expected to be unique, one of them is expected to be bigger than the other as tens of characters.
How should I decide primary key?
Should one of them be PK, if so which one, or both, or should I create an Identity number as PK? This is important for me because the entity "Item" would have relations with some other entities and I think the complexity of PK would affect the performance of SQL Server queries.

Personally, I would go with an IDENTITY Primary Key with unique constraints on both the mentioned unique keys and indexes for additonal lookups.
You have to remember that by default SQL Server creates the primary key as the clustered index, which impacts how it is stored on disc. If the new ITEMS came in at random, variance there could be a lot of fragmentation on either the primary keys.
Also, unless cascades and foreign keys are switched on, you would have to manually maintain the relational integrety of the data (unless you use IDENTITY)

Well, the primary key is really only used to uniquely identify each row - so the only requirements for it are: it has to be unique and typically also should not contain NULL.
Anything else is most likely more relevant for the clustering key in SQL Server - the column (or set of columns) by which the data is physically ordered on disk. By default, the primary key is also the clustering key in SQL Server.
The clustering key is the most important choice in SQL Server because it has far reaching performance implications. A good clustering key is
narrow
unique
stable
if possible ever-increasing
It has to be unique so that it can be added to each and every single nonclustered index for lookup into the actual data tables - if you pick a non-unique column (or set of columns), SQL Server will add a 4-byte "uniquefier" for you.
It should be as narrow as possible, since it's stored in a lot of places. Try to stick to 4 bytes for an INT or 8 bytes for a BIGINT - avoid long and variable length VARCHAR columns since those are both too wide, and the variable length also carries additional overhead. Because of this, sets of columns are also rather rarely a good choice.
The clustering key should be stable - value shouldn't change over time - since every time a value changes, potentially a lot of index entries (in the clustered index itself, and every single nonclustered index, too) need to be updated which causes a lot of unnecessary overhead.
And if it's ever-increasing (like an INT IDENTITY), you also can avoid most page splits - an extremely expensive and involved procedure that happens if you use random values (like GUID's) as your clustering key.
So in brief: an INT IDENTITY is ideal - GUIDs, variable length strings, or combinations of columns are typically less of a good choice.

Choose the one you will use to identify the records in queries and joins to other tables. Size is relative, and whilst a consideration usually not an issue since the PK will be indexed and the other unique column can make use also of a unique index.
The uniqueidentifier data type for e.g. is a 36 character long string representation and performs fine as a primary key under the majority of circumstances.

Related

Primary Key Needed on Fact Tables

I am currently developing a very complicated database schema and was wondering if the fact tables should have primary keys. Each fact table has 50+ columns of data and the only way to make a primary key would be to add an auto incrementing count to each tuple. I am just not sure what this information gets us in the long term, especially since the data will be deleted after 12 months.
My dimension tables of course will have primary keys, just wanting to know what is best practice.
I am a fan of putting an identity column on all tables. This makes it easier to identify specific rows for updating and deleting.
On a fact table with lots of dimensions, of course, such a column can seem superfluous. However, there is still usually a primary key -- which is the combination of dimensions.
I would encourage you to have a primary key on the table, either an identity column or a combination of existing rows. If you use a composite primary key, you should be careful about the ordering of the keys. SQL Server defaults to using the primary key as a clustered index, and if you put the keys in the wrong order, then your table is subject to fragmentation. Identity keys don't have this issue.
It is always good to go for a clustering key, which will leads to easily seeking the data, when we need. Clustering key is not only used for clustered index queries. It is also being stored in every non-clustered index leaf page, for seeking back to the data pages, when there is key-lookup.
Characteristics of good clustering key:
unique (no need for adding uniquefier to make value unique)
incrementing (reduces fragmentation)
narrow (less number of bytes to store in the tree pages of clustered index & in the leaf pages of non-clustered index)
Static (reduces fragmentation)
non-nullable (avoids null blocks)
fixed width (avoids variable blocks)
Read more on Kimberly Tripp Post on clustering key
Identity satisy all these clauses. They are good candidates for clustered index.
If you are going to hold data longer, you can go for Bigint and if you are going to hold for one year and purge, you can go for int datatype itself.

Is a Primary Key necessary in SQL Server?

This may be a pretty naive and stupid question, but I'm going to ask it anyway
I have a table with several fields, none of which are unique, and a primary key, which obviously is.
This table is accessed via the non-unique fields regularly, but no user SP or process access data via the primary key. Is the primary key necessary then? Is it used behind the scenes? Will removing it affect performance Positively or Negatively?
Necessary? No. Used behind the scenes? Well, it's saved to disk and kept in the row cache, etc. Removing will slightly increase your performance (use a watch with millisecond precision to notice).
But ... the next time someone needs to create references to this table, they will curse you. If they are brave, they will add a PK (and wait for a long time for the DB to create the column). If they are not brave or dumb, they will start creating references using the business key (i.e. the data columns) which will cause a maintenance nightmare.
Conclusion: Since the cost of having a PK (even if it's not used ATM) is so small, let it be.
Do you have any foreign keys, do you ever join on the PK?
If the answer to this is no, and your app never retrieves an item from the table by its PK, and no query ever uses it in a where clause, therefore you just added an IDENTITY column to have a PK, then:
the PK in itself adds no value, but does no damage either
the fact that the PK is very likely the clustered index too is .. it depends.
If you have NC indexes, then the fact that you have a narrow artificial clustered key (the IDENTITY PK) is helpful in keeping those indexes narrow (the CDX key is reproduced in every NC leaf slots). So a PK, even if never used, is helpful if you have significant NC indexes.
On the other hand, if you have a prevalent access pattern, a certain query that outweighs all the other is frequency and importance, or which is part of a critical time code path (eg. is the query run on every page visit on your site, or every second by and app etc) then that query is a good candidate to dictate the clustered key order.
And finally, if the table is seldom queried but often written to then it may be a good candidate for a HEAP (no clustered key at all) since heaps are so much better at inserts. See Comparing Tables Organized with Clustered Indexes versus Heaps.
The primary key is behind the scenes a clustered index (by default unless generated as a non clustered index) and holds all the data for the table. If the PK is an identity column the inserts will happen sequentially and no page splits will occur.
But if you don't access the id column at all then you probably want to add some indexes on the other columns. Also when you have a PK you can setup FK relationships
In the logical model, a table must have at least one key. There is no reason to arbitarily specify that one of the keys is 'primary'; all keys are equal. Although the concept of 'primary key' can be traced back to Ted Codd's early work, the mistake was picked up early on has long been corrected in relational theory.
Sadly, PRIMARY KEY found its way into SQL and we've had to live with it ever since. SQL tables can have duplicate rows and, if you consider the resultset of a SELECT query to also be a table, then SQL tables can have duplicate rows too. Relational theorists dislike SQL a lot. However, just because SQL lets you do all kinds of wacky non-relational things, that doesn't mean that you have to actually do them. It is good practice to ensure that every SQL table has at least one key.
In SQL, using PRIMARY KEY on its own has implications e.g. NOT NULL, UNIQUE, the table's default reference for foreign keys. In SQL Server, using PRIMARY KEY on its own has implications e.g. the table's clustered index. However, in all these cases, the implicit behavior can be made explicit using specific syntax.
You can use UNIQUE (constraint rather than index) and NOT NULL in combination to enforce keys in SQL. Therefore, no, a primary key (or even PRIMARY KEY) is not necessary for SQL Server.
I would never have a table without a primary key. Suppose you ever need to remove a duplicate - how would you identify which one to remove and which to keep?
A PK is not necessary.
But you should consider to place a non-unique index on the columns that you use for querying (i.e. that appear in the WHERE-clause). This will considerably boost lookup performance.
The primary key when defined will help improve performance within the database for indexing and relationships.
I always tend to define a primary key as an auto incrementing integer in all my tables, regardless of if I access it or not, this is because when you start to scale up your application, you may find you do actually need it, and it makes life a lot simpler.
A primary key is really a property of your domain model, and it uniquely identifies an instance of a domain object.
Having a clustered index on a montonically increasing column (such as an identity column) will mean page splits will not occur, BUT insertions will unbalance the index over time and therefore rebuilding indexes needs to be done regulary (or when fragmentation reaches a certain threshold).
I have to have a very good reason to create a table without a primary key.
As SQLMenace said, the clustered index is an important column for the physical layout of the table. In addition, having a clustered index, especially a well chosen one on a skinny column like an integer pk, actually increases insert performance.
If you are accessing them via non-key fields the performance probably will not change. However it might be nice to keep the PK for future enhancements or interfaces to these tables. Does your application only use this one table?

Optimize SQL databases by adding index columns

Say I have a database looking like this;
Product with columns [ProductName] [Price] [Misc] [Etc]
Order with columns [OrderID] [ProductName] [Quantity] [Misc] [Etc]
ProductName is primary key of Product, of some string type and unique.
OrderID is primary key and of some integer type, and ProductName being a foreign key.
Say I change the primary key of Product to a new column of integer type ie [ProductID].
Would this reduce the database size and optimize lookups joining these two tables (and likewise operations), or are these optimizations performed automatically by (most/general/main) SQL database implementations?
Technically, using (String) ProductName as primary key in Product, a database should be able to implement the ProductName column in Order as simply a pointer to a row in Product, and perform a JOIN as quicly as having an integer as a foreign key, is this a standard way of implementing SQL.
Update:
This question is about how SQL servers handles foreign keys, not whether a product table needs a serial number, or how I handle to product name change in a database.
A string primary key is a bad idea, so changing it to an INT will help performance. most databases uses the primary key index for lookups and comparisons, choose a brief primary key—one column, if possible. You use primary key columns for joins (combining data from two or more tables based on common values in join columns), for query retrieval, and for grouping or sorting a query result set. The briefer the index entries are, the faster the database can perform the lookups and comparisons.
Not to mention, if the name of the product changes, how can you handle that? update all rows that contain the product name as a Foreign Key?
I couldn't have said it any better, so check out this answer: Should I design a table with a primary key of varchar or int, quote from that answer:
Using a VARCHAR(10) or (20) just uses
up too much space - 10 or 20 bytes
instead of 4, and what a lot of folks
don't know - the clustering key value
will be repeated on every single index
entry on every single non-clustered
index on the table, so potentially,
you're wasting a lot of space (not
just on disk - that's cheap - but also
in SQL Server's main memory). Also,
since it's variable (might be 4, might
be 20 chars) it's harder to SQL server
to properly maintain a good index
structure
integer column acts better than string in joins
integer autoinc columns as primary clustered key is good for inserts
I won't reduce database size (presumably you'll keep the product name field), but should definitely improve lookup performance.
Integer datatype in most implementations will be less in size than the string (CHAR, VARCHAR etc.), this will make your index smaller in size.
In addition, there are some issues with comparing the strings:
Some databases, namely MySQL, compress the string keys which can make the searches less efficient.
String B-Trees that use natural language identifiers tend to be less concurrency balanced than integer B-Trees. Since the natural language words are not distributed evenly across the alphabet, more updates and inserts will go to the same block, increasing the number of page splits and ultimately increasing the index size. To work around this, Oracle supports REVERSE clause in indexes.
When comparing two strings, a collation should be taken into account. Normally, it does not matter much, however, it does add some overhead.
Primary keys should be unique, exist at time of row creation and be as immutable as possible. IMO, discussions about whether to use a surrogate key should be secondary to issues of data integrity.
If for example a product had a serial number stamped on the item, which had to exist at the time the row in the database was entered and was guaranteed to be unique, then IMO that would make a good primary key. The reason is this value will be used as the foreign key in other tables and it saves you the expense of an additional lookup to get the product's serial number. The additional storage space is inconsequential until you get into the many millions of rows. However, if the serial number was stamped by some other manufacturer so you had no guarantees of uniqueness ("it is probably unique" is not good enough), then a surrogate is appropriate. In fact, I would go so far as to say a good portion if not most "products" tables use surrogate keys because no value that is guaranteed to be available at time of entry, guaranteed to be unique and will be relatively immutable is available as a key.
However, many developers that use surrogate keys overlook the need that every table that has a surrogate key should also have another key (i.e. a unique constraint). Thus, in your case with products, even if you add an integer primary key, you should still have a unique constraint on product name. The unique constraint on product name creates what is called a candidate key with the integer value being the primary key.
Surrogate keys are meant to be behind-the-scenes goo. While integer keys perform the best and are easy to create they have one downside: it is easy, tempting even, for application developers to show the key value to users. This is a mistake IMO. Users should never see the key value or they will come to rely on the value itself which creates problems if you need to re-sequence the values (like say with a database merge) or if you use values that were created in gaps created by the Identity value and they rely on the values being sequential. As long as you never show the value to users, using an integer PK is fine.

primary key datatype in sql server database

i see after installing the asp.net membership tables, they use the data type "uniqueidentifier" for all of the primary key fields.
I have been using "int" data type and doing increment by one on inserts and declaring the column as IDENTITY.
Is there any particular benefits to using the uniqueIdentifier data type compared to my current model of using int and auto increments on new inserts ?
I personally use INT IDENTITY for most of my primary and clustering keys. I think it's rather unfortunate that Microsoft chose to use Uniqueidentifier in their ASP.NET membership tables - lots of people take that database as a "template" for other.....
You need to keep apart the primary key which is a logical construct - it uniquely identifies your rows, it has to be unique and stable and NOT NULL. A GUID works well for a primary key, too - since it's guaranteed to be unique. A GUID as your primary key is a good choice if you use SQL Server replication, since in that case, you need an uniquely identifying GUID column anyway.
The clustering key in SQL Server is a physical construct is used for the physical ordering of the data, and is a lot more difficult to get right. Typically, the Queen of Indexing on SQL Server, Kimberly Tripp, also requires a good clustering key to be unique, stable, as narrow as possible, and ideally ever-increasing (which a INT IDENTITY is).
See her articles on indexing here:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
and also see Jimmy Nilsson's The Cost of GUIDs as Primary Key
A GUID is a really bad choice for a clustering key, since it's wide, totally random, and thus leads to bad index fragmentation and poor performance. Also, the clustering key row(s) is also stored in each and every entry of each and every non-clustered (additional) index, so you really want to keep it small - GUID is 16 byte vs. INT is 4 byte, and with several non-clustered indices and several million rows, this makes a HUGE difference.
In SQL Server, your primary key is by default your clustering key - but it doesn't have to be. You can easily use a GUID as your NON-Clustered primary key, and an INT IDENTITY as your clustering key - it just takes a bit of being aware of it.
uniqueidenfitier solves problems with replication. It's possible for two replicated versions of a table to insert rows with the same integer value for the key, but it's impossible for them to both insert using the same uniqueidentifier, assuming the value of the column is set to newid.
I have been using "int" data type and doing increment by one on inserts.
In SQL Server the way to get an auto incrementing column is to use IDENTITY. I'm not sure if that is what you meant by the above so I thought I would clarify this just in case.
The advantage of using an INT column with IDENTITY is that it is smaller so joins will be slightly faster. But for most purposes it won't be a significant improvement. There are other things you should worry about first, like choosing the correct indexes for your tables.

is it good to have primary keys as Identity field

I have read a lot of articles about whether we should have primary keys that are identity columns, but I'm still confused.
There are advantages of making columns are identity as it would give better performance in joins and provides data consistency. But there is a major drawback associated with identity ,i.e.When INSERT statement fails, still the IDENTITY value increases If a transaction is rolled back, the new IDENTITY column value isn't rolled back, so we end up with gaps in sequencing. I can use GUIDs (by using NEWSEQUENTIALID) but it reduces performance.
Gaps should not matter: the identity column is internal and not for end user usage or recognition.
GUIDs will kill performance, even sequential ones, because of the 16 byte width.
An identity column should be chosen to respect the physical implementation after modelling your data and working out what your natural keys are. That is, the chosen natural key is the logical key but you choose a surrogate key (identity) because you know how the engine works.
Or you use an ORM and let the client tail wag the database dog...
For all practical purposes, integers are ideal for primary keys and auto increment is a perfect way to generate them. As long as your PK is meaningless (surrogate) it will be protected from creativity of you customers and serve its main purpose (to identify a row in a table) just fine. Indexes are packed, joins fast as it gets, and it is easy to partition tables.
If you happen to need GUID, that's fine too; however, think auto-increment integer first.
I would like to say that depends on your needs. We use only Guids as primary keys (with default set to NewID) because we develop a distributed system with many Sql Server instances, so we have to be sure that every Sql Server generate unique primary key values.
But when using a Guid column as PK, be sure not to use it as your clustered index (thanks to marc_s for the link)
Advantage of the Guid type:
You can create unique values on different locations without synchronization
Disadvantage:
Its a large datatype (16 Bytes) and needs significant more space
It creates index fragmentation (at least when using the newid() function)
Dataconsistency is not an issue with primary keys independent of the datatype because a primary key has to be unique by definition!
I don't believe that an identity column has better join performance. At all, performance is a matter of the right indexes. A primary key is a constraint not an index.
Is your need to have a primary key of typ int with no gaps? This should'nt be a problem normally.
"yes, it KILLS performance - totally. I went from a legacy system with GUID as PK/CK and 99.5% index fragmentation on a daily basis to using INT IDENTITY - HUGE difference. Hardly any index fragmentation anymore, performance is significantly better. GUIDs as Clustering Index on your SQL Server table are BAD BAD BAD - period."
Might be true, but I see no logical reasoning according to which this leads me to conclude that GUIDs PER SE are also BAD BAD BAD.
Maybe you should consider using other types of indexes on such data. And if your dbms does not offer you a choice between several types of index, then perhaps you should consider getting yourself a better dbms.