Primary Key Needed on Fact Tables - sql

I am currently developing a very complicated database schema and was wondering if the fact tables should have primary keys. Each fact table has 50+ columns of data and the only way to make a primary key would be to add an auto incrementing count to each tuple. I am just not sure what this information gets us in the long term, especially since the data will be deleted after 12 months.
My dimension tables of course will have primary keys, just wanting to know what is best practice.

I am a fan of putting an identity column on all tables. This makes it easier to identify specific rows for updating and deleting.
On a fact table with lots of dimensions, of course, such a column can seem superfluous. However, there is still usually a primary key -- which is the combination of dimensions.
I would encourage you to have a primary key on the table, either an identity column or a combination of existing rows. If you use a composite primary key, you should be careful about the ordering of the keys. SQL Server defaults to using the primary key as a clustered index, and if you put the keys in the wrong order, then your table is subject to fragmentation. Identity keys don't have this issue.

It is always good to go for a clustering key, which will leads to easily seeking the data, when we need. Clustering key is not only used for clustered index queries. It is also being stored in every non-clustered index leaf page, for seeking back to the data pages, when there is key-lookup.
Characteristics of good clustering key:
unique (no need for adding uniquefier to make value unique)
incrementing (reduces fragmentation)
narrow (less number of bytes to store in the tree pages of clustered index & in the leaf pages of non-clustered index)
Static (reduces fragmentation)
non-nullable (avoids null blocks)
fixed width (avoids variable blocks)
Read more on Kimberly Tripp Post on clustering key
Identity satisy all these clauses. They are good candidates for clustered index.
If you are going to hold data longer, you can go for Bigint and if you are going to hold for one year and purge, you can go for int datatype itself.

Related

Deciding on a primary key according to value size in SQL Server

I want to ask a question to optimize SQL Server performance. Assume I have an entity - say Item - and I must assign a primary key for it. It has columns and two of them are expected to be unique, one of them is expected to be bigger than the other as tens of characters.
How should I decide primary key?
Should one of them be PK, if so which one, or both, or should I create an Identity number as PK? This is important for me because the entity "Item" would have relations with some other entities and I think the complexity of PK would affect the performance of SQL Server queries.
Personally, I would go with an IDENTITY Primary Key with unique constraints on both the mentioned unique keys and indexes for additonal lookups.
You have to remember that by default SQL Server creates the primary key as the clustered index, which impacts how it is stored on disc. If the new ITEMS came in at random, variance there could be a lot of fragmentation on either the primary keys.
Also, unless cascades and foreign keys are switched on, you would have to manually maintain the relational integrety of the data (unless you use IDENTITY)
Well, the primary key is really only used to uniquely identify each row - so the only requirements for it are: it has to be unique and typically also should not contain NULL.
Anything else is most likely more relevant for the clustering key in SQL Server - the column (or set of columns) by which the data is physically ordered on disk. By default, the primary key is also the clustering key in SQL Server.
The clustering key is the most important choice in SQL Server because it has far reaching performance implications. A good clustering key is
narrow
unique
stable
if possible ever-increasing
It has to be unique so that it can be added to each and every single nonclustered index for lookup into the actual data tables - if you pick a non-unique column (or set of columns), SQL Server will add a 4-byte "uniquefier" for you.
It should be as narrow as possible, since it's stored in a lot of places. Try to stick to 4 bytes for an INT or 8 bytes for a BIGINT - avoid long and variable length VARCHAR columns since those are both too wide, and the variable length also carries additional overhead. Because of this, sets of columns are also rather rarely a good choice.
The clustering key should be stable - value shouldn't change over time - since every time a value changes, potentially a lot of index entries (in the clustered index itself, and every single nonclustered index, too) need to be updated which causes a lot of unnecessary overhead.
And if it's ever-increasing (like an INT IDENTITY), you also can avoid most page splits - an extremely expensive and involved procedure that happens if you use random values (like GUID's) as your clustering key.
So in brief: an INT IDENTITY is ideal - GUIDs, variable length strings, or combinations of columns are typically less of a good choice.
Choose the one you will use to identify the records in queries and joins to other tables. Size is relative, and whilst a consideration usually not an issue since the PK will be indexed and the other unique column can make use also of a unique index.
The uniqueidentifier data type for e.g. is a 36 character long string representation and performs fine as a primary key under the majority of circumstances.

What’s the difference between a primary key and a clustered index? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Performance difference between Primary Key and Unique Clustered Index in SQL Server
I make sure that I searched this forum but nobody asked this question before and I couldn't find any answer in anywhere too.
My question is = "What’s the difference between a primary key and a clustered index?"
Well, for starters, one is a key, and the other one is an index.
In most database lingo, key is something that somehow identifies the data, with no explicit relation to the storage or performance of the data. And a primary key is a piece of data that uniquely identifies that data.
An index on the other hand is something that describes a (faster) way to access data. It does not (generally) concern itself with the integrity and meaning of the data, it's just concerned with performance and storage. In SQL Server specifically, a clustered index is an index that dictates the physical order of storage of the rows. The things that it does are quite complex, but a useful approximation is that the rows are ordered by the value of the clustered index. This means that when you do not specify a order clause, the data is likely to be sorted by the value of the clustered index.
So, they are completely different things, that kinda-sorta compliment each other. That is why SQL Server, when you create a primary key via the designer, throws in a free clustered index along with it.
Before you can ask the difference between primary key and clustered index, you have to know that a key and an index are not the same thing.
A key can be a primary key or a foreign key. There can be only one primary key per table (but it might be more than one column). A key is a logical thing, it serves the business logic and defines the integrity of data. A foreign key is a reference to a primary key of another table.
Indexes helps to speed up your queries, because it builds references to columns of your choice. So it creates separate files that helps your queries that use indexed columns.
A clustered index is a special index that defines the physical order of your table (it should be a sequential data).
I tried to explain this with my own words, but you'll find all resources you need with a google search (and I definitely recommend that you read a lot of this ! )
Primary key is unique identifier for record. It's responsible for unique value of this field. It's simply existing or specially created field or group of fields that uniquely identifies row.
And clustered index is data structure that improves speed of data retrieval operations through an access of ordered records. Index is copy of one part of table. It takes additional physical place on hard disk.
In much of the RDBMS, as far as I know, when you create PK the engine in back creates clustered index. PK used for Entity integrity when clustered index sets data order and used for performance.

Should primary keys be always assigned as clustered index

I have a SQLServer table that stores employee details, the column ID is of GUID type while the column EmployeeNumber of INT type. Most of the time I will be dealing with EmployeeNumber while doing joins and select criteria's.
My question is, whether is it sensible to assign PrimaryKey to ID column while ClusteredIndex to EmployeeNumber?
Yes, it is possible to have a non-clustered primary key, and it is possible to have a clustered key that is completely unrelated to the primary key. By default a primary keys gets to be the clustered index key too, but this is not a requirement.
The primary key is a logical concept: is the key used in your data model to reference entities.
The clustered index key is a physical concept: is the order in which you want the rows to be stored on disk.
Choosing a different clustered key is driven by a variety of factors, like key width when you desire a narrower clustered key than the primary key (because the clustered key gets replicated in every non-clustered index. Or support for frequent range scans (common in time series) when the data is frequently accessed with queries like date between '20100101' and '20100201' (a clustered index key on date would be appropriate).
This subject has been discussed here ad nauseam before, see also What column should the clustered index be put on?.
The ideal clustered index key is:
Sequential
Selective (no dupes, unique for each record)
Narrow
Used in Queries
In general it is a very bad idea to use a GUID as a clustered index key, since it leads to mucho fragmentation as rows are added.
EDIT FOR CLARITY:
PK and Clustered key are indeed separate concepts. Your PK does not need to be your clustered index key.
In practical applications in my own experience, the same field that is your PK should/would be your clustered key since it meets the same criteria listed above.
First, I have to say that I have misgivings about the choice of a GUID as the primary key for this table. I am of the opinion that EmployeeNumber would probably be a better choice, and something naturally unique about the employee would be better than that, such as an SSN (or ATIN), which employers must legally obtain anyway (at least in the US).
Putting that aside, you should never base a clustered index on a GUID column. The clustered index specifies the physical order of rows in the table. Since GUID values are (in theory) completely random, every new row will fall at a random location. This is very bad for performance. There is something called 'sequential' GUIDs, but I would consider this a bit of a hack.
Using a clustured index on something else than the primary key will improve performance on SELECT query which will take advantage of this index.
But you will loose performance on UPDATE query, because in most scenario, they rely on the primary key to found the specific row you want to update.
CREATE query could also loose performance because when you add a new row in the middle of the index a lot of row have to be moved (physically). This won't happen on a primary key with an increment as new record will always be added in the end and won't make move any other row.
If you don't know what kind of operation need the most performance, I recommend to leave the clustered Index on the primary key and use nonclustered index on common search criteria.
Clustered indexes cause the data to be physically stored in that order. For this reason when testing for ranges of consecutive rows, clustered indexes help a lot.
GUID's are really bad clustered indexes since their order is not in a sensible pattern to order on. Int Identity columns aren't much better unless order of entry helps (e.g. most recent hires)
Since you're probably not looking for ranges of employees it probably doesn't matter much which is the Clustered index, unless you can segment blocks of employees that you often aren't interested in (e.g. Termination Dates)
Since EmployeeNumber is unique, I would make it the PK. In SQL Server, a PK is often a clustered index.
Joins on GUIDs is just horrible. #JNK answers this well.

When should I use primary key or index?

When should I use a primary key or an index?
What are their differences and which is the best?
Basically, a primary key is (at the implementation level) a special kind of index. Specifically:
A table can have only one primary key, and with very few exceptions, every table should have one.
A primary key is implicitly UNIQUE - you cannot have more than one row with the same primary key, since its purpose is to uniquely identify rows.
A primary key can never be NULL, so the row(s) it consists of must be NOT NULL
A table can have multiple indexes, and indexes are not necessarily UNIQUE. Indexes exist for two reasons:
To enforce a uniquness constraint (these can be created implicitly when you declare a column UNIQUE)
To improve performance. Comparisons for equality or "greater/smaller than" in WHERE clauses, as well as JOINs, are much faster on columns that have an index. But note that each index decreases update/insert/delete performance, so you should only have them where they're actually needed.
Differences
A table can only have one primary key, but several indexes.
A primary key is unique, whereas an index does not have to be unique. Therefore, the value of the primary key identifies a record in a table, the value of the index not necessarily.
Primary keys usually are automatically indexed - if you create a primary key, no need to create an index on the same column(s).
When to use what
Each table should have a primary key. Define a primary key that is guaranteed to uniquely identify each record.
If there are other columns you often use in joins or in where conditions, an index may speed up your queries. However, indexes have an overhead when creating and deleting records - something to keep in mind if you do huge amounts of inserts and deletes.
Which is best?
None really - each one has its purpose. And it's not that you really can choose the one or the other.
I recommend to always ask yourself first what the primary key of a table is and to define it.
Add indexes by your personal experience, or if performance is declining. Measure the difference, and if you work with SQL Server learn how to read execution plans.
This might help Back to the Basics: Difference between Primary Key and Unique Index
The differences between the two are:
Column(s) that make the Primary Key of a table cannot be NULL since by definition, the Primary Key cannot be NULL since it helps uniquely identify the record in the table. The column(s) that make up the unique index can be nullable. A note worth mentioning over here is that different RDBMS treat this differently –> while SQL Server and DB2 do not allow more than one NULL value in a unique index column, Oracle allows multiple NULL values. That is one of the things to look out for when designing/developing/porting applications across RDBMS.
There can be only one Primary Key defined on the table where as you can have many unique indexes defined on the table (if needed).
Also, in the case of SQL Server, if you go with the default options then a Primary Key is created as a clustered index while the unique index (constraint) is created as a non-clustered index. This is just the default behavior though and can be changed at creation time, if needed.
Keys and indexes are quite different concepts that achieve different things. A key is a logical constraint which requires tuples to be unique. An index is a performance optimisation feature of a database and is therefore a physical rather than a logical feature of the database.
The distinction between the two is sometimes blurred because often a similar or identical syntax is used for specifying constraints and indexes. Many DBMSs will create an index by default when key constraints are created. The potential for confusion between key and index is unfortunate because separating logical and physical concerns is a highly important aspect of data management.
As regards "primary" keys. They are not a "special" type of key. A primary key is just any one candidate key of a table. There are at least two ways to create candidate keys in most SQL DBMSs and that is either using the PRIMARY KEY constraint or using a UNIQUE constraint on NOT NULL columns. It is a very widely observed convention that every SQL table has a PRIMARY KEY constraint on it. Using a PRIMARY KEY constraint is conventional wisdom and a perfectly reasonable thing to do but it generally makes no practical or logical difference because most DBMSs treat all keys as equal. Certainly every table ought to enforce at least one candidate key but whether those key(s) are enforced by PRIMARY KEY or UNIQUE constraints doesn't usually matter. In principle it is candidate keys that are important, not "primary" keys.
The primary key is by definition unique: it identifies each individual row. You always want a primary key on your table, since it's the only way to identify rows.
An index is basically a dictionary for a field or set of fields. When you ask the database to find the record where some field is equal to some specific value, it can look in the dictionary (index) to find the right rows. This is very fast, because just like a dictionary, the entries are sorted in the index allowing for a binary search. Without the index, the database has to read each row in the table and check the value.
You generally want to add an index to each column you need to filter on. If you search on a specific combination of columns, you can create a single index containing all of those columns. If you do so, the same index can be used to search for any prefix of the list of columns in your index. Put simply (if a bit inaccurately), the dictionary holds entries consisting of the concatenation of the values used in the columns, in the specified order, so the database can look for entries which start with a specific value and still use efficient binary search for this.
For example, if you have an index on the columns (A, B, C), this index can be used even if you only filter on A, because that is the first column in the index. Similarly, it can be used if you filter on both A and B. It cannot, however, be used if you only filter on B or C, because they are not a prefix in the list of columns - you need another index to accomodate that.
A primary key also serves as an index, so you don't need to add an index convering the same columns as your primary key.
Every table should have a PRIMARY KEY.
Many types of queries are sped up by the judicious choice of an INDEX. It may be that the best index is the primary key. My point is that the query is the main factor in whether to use the PK for its index.

SQL primary key - complex primary or string with concatenation?

I have a table with 16 columns. It will be most frequently used table in web aplication and it will contain about few hundred tousand rows. Database is created on sql server 2008.
My question is choice for primary key. What is quicker? I can use complex primary key with two bigint-s or i can use one varchar value but i will need to concatenate it after?
There are many more factors you must consider:
data access prevalent pattern, how are you going to access the table?
how many non-clustered indexes?
frequency of updates
pattern of updates (sequential inserts, random)
pattern of deletes
All these factors, and specially the first two, should drive your choice of the clustered key. Note that the primary key and clustered key are different concepts, often confused. Read up my answer on Should I design a table with a primary key of varchar or int? for a lengthier discussion on the criteria that drive a clustered key choice.
Without any information on your access patterns I can answer very briefly and concise, and actually correct: the narrower key is always quicker (for reasons of IO). However, this response bares absolutely no value. The only thing that will make your application faster is to choose a key that is going to be used by the query execution plans.
A primary key which does not rely on any underlying values (called a surrogate key) is a good choice. That way if the row changes, the ID doesn't have to, and any tables referring to it (Foriegn Keys) will not need to change. I would choose an autonumber (i.e. IDENTITY) column for the primary key column.
In terms of performance, a shorter, integer based primary key is best.
You can still create your clustered index on multiple columns.
Why not just a single INT auto-generated primary key? INT is 32-bit, so it can handle over 4 billion records.
CREATE TABLE Records (
recordId INT NOT NULL PRIMARY KEY,
...
);
A surrogate key might be a fine idea if there are foreign key relationships on this table. Using a surrogate will save tables that refer to it from having to duplicate all those columns in their tables.
Another important consideration is indexes on columns that you'll be using in WHERE clauses. Your performance will suffer if you don't. Make sure that you add appropriate indexes, over and above the primary key, to avoid table scans.
What do you mean quicker? if you need to search quicker, you can create index for any column or create full text search. the primary key just make sure you do not have duplicated records.
The decision relies upon its use. If you are using the table to save data mostly and not retrieve it, then a simple key. If you are mostly querying the data and it is mostly static data where the key values will not change, your index strategy needs to optimize the data to the most frequent query that will be used. Personally, I like the idea of using GUIDs for the primary key and an int for the clustered index. That allows for easy data imports. But, it really depends upon your needs.
Lot’s of variables you haven’t mentioned; whether the data in the two columns is “natural” and there is a benefit in identifying records by a logical ID, if disclosure of the key via a UI poses a risk, how important performance is (a few hundred thousand rows is pretty minimal).
If you’re not too fussy, go the auto number path for speed and simplicity. Also take a look at all the posts on the site about SQL primary key types. Heaps of info here.
Is it a ER Model or Dimensional Model. In ER Model, they should be separate and should not be surrogated. The entire record could have a single surrogate for easy references in URLs etc. This could be a hash of all parts of the composite key or an Identity.
In Dimensional Model, also they must be separate and they all should be surrogated.