Primary Key as composite only on Primary table - sql

There is a table in our environment. Recently, it was discovered that performance was greatly improved by sorting datetime, which the dba wanted to make the primary key. Since he can't guarantee uniqueness with the datetime, he added the id that was once the primary key, into his new composite key.
So there is a table with the primary key as datetime / id and also the clustered index is defined this way. All the pk / fk relationships are still set up properly and exist on the id to id paradigm one would expect.
what could be the possible problem of a lopsided primary key?
And performance is considerably improved with this change.
however, in the schema, the actual "primary key" is two columns. what could possibly go wrong?

Do not do that! Set up a unique index with the two fields. It does not have to be primary key. In fact if you want the original key to remain unique, then this is a terrible idea.

EDIT: This answer was assuming Sql Server. If it turns out that it is not then I will delete my answer.
You do not list details so I will have to give a very general answer. In my research I have found that most will recommend a short primary key / clustered index.
The real key here though is what you mean by increased performance. Is it just one query? In other words does this change have beneficial or at least insignificant performance impact on all operations of this data? User interfaces, all reports, and so on. Or is this robbing peter to pay paul?
If this were a reporting database or data warehouse where the majority of reports are based on date, I could see why people might recommend having the clustered index setup in such a way that it would benefit all reports, or the most important ones.
In any other situation I can think of having a non-clustered index would provide almost the same level or performance increase without increasing the size of the PK, which is used in all lookups (more bytes read = slower performance) as well as taking up more space on your data pages.
EDIT:
This article explains this topic better than I could.
https://www.simple-talk.com/sql/learn-sql-server/effective-clustered-indexes/

The performance advantage you are currently seeing(if genuine) is due to the clustered index associated with the primary key, and not the primary key itself. If you are happy with the current index, but are concerned about uniqueness you should keep the unique datetime / id as your clustered index but revert to your old unique id as the primary key.
This also addresses the problem where other tables referencing this primary key would have required the creation of a likely inappropriate datetime column to create a foreign key relationship.

Related

What’s the difference between a primary key and a clustered index? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Performance difference between Primary Key and Unique Clustered Index in SQL Server
I make sure that I searched this forum but nobody asked this question before and I couldn't find any answer in anywhere too.
My question is = "What’s the difference between a primary key and a clustered index?"
Well, for starters, one is a key, and the other one is an index.
In most database lingo, key is something that somehow identifies the data, with no explicit relation to the storage or performance of the data. And a primary key is a piece of data that uniquely identifies that data.
An index on the other hand is something that describes a (faster) way to access data. It does not (generally) concern itself with the integrity and meaning of the data, it's just concerned with performance and storage. In SQL Server specifically, a clustered index is an index that dictates the physical order of storage of the rows. The things that it does are quite complex, but a useful approximation is that the rows are ordered by the value of the clustered index. This means that when you do not specify a order clause, the data is likely to be sorted by the value of the clustered index.
So, they are completely different things, that kinda-sorta compliment each other. That is why SQL Server, when you create a primary key via the designer, throws in a free clustered index along with it.
Before you can ask the difference between primary key and clustered index, you have to know that a key and an index are not the same thing.
A key can be a primary key or a foreign key. There can be only one primary key per table (but it might be more than one column). A key is a logical thing, it serves the business logic and defines the integrity of data. A foreign key is a reference to a primary key of another table.
Indexes helps to speed up your queries, because it builds references to columns of your choice. So it creates separate files that helps your queries that use indexed columns.
A clustered index is a special index that defines the physical order of your table (it should be a sequential data).
I tried to explain this with my own words, but you'll find all resources you need with a google search (and I definitely recommend that you read a lot of this ! )
Primary key is unique identifier for record. It's responsible for unique value of this field. It's simply existing or specially created field or group of fields that uniquely identifies row.
And clustered index is data structure that improves speed of data retrieval operations through an access of ordered records. Index is copy of one part of table. It takes additional physical place on hard disk.
In much of the RDBMS, as far as I know, when you create PK the engine in back creates clustered index. PK used for Entity integrity when clustered index sets data order and used for performance.

Should primary keys be always assigned as clustered index

I have a SQLServer table that stores employee details, the column ID is of GUID type while the column EmployeeNumber of INT type. Most of the time I will be dealing with EmployeeNumber while doing joins and select criteria's.
My question is, whether is it sensible to assign PrimaryKey to ID column while ClusteredIndex to EmployeeNumber?
Yes, it is possible to have a non-clustered primary key, and it is possible to have a clustered key that is completely unrelated to the primary key. By default a primary keys gets to be the clustered index key too, but this is not a requirement.
The primary key is a logical concept: is the key used in your data model to reference entities.
The clustered index key is a physical concept: is the order in which you want the rows to be stored on disk.
Choosing a different clustered key is driven by a variety of factors, like key width when you desire a narrower clustered key than the primary key (because the clustered key gets replicated in every non-clustered index. Or support for frequent range scans (common in time series) when the data is frequently accessed with queries like date between '20100101' and '20100201' (a clustered index key on date would be appropriate).
This subject has been discussed here ad nauseam before, see also What column should the clustered index be put on?.
The ideal clustered index key is:
Sequential
Selective (no dupes, unique for each record)
Narrow
Used in Queries
In general it is a very bad idea to use a GUID as a clustered index key, since it leads to mucho fragmentation as rows are added.
EDIT FOR CLARITY:
PK and Clustered key are indeed separate concepts. Your PK does not need to be your clustered index key.
In practical applications in my own experience, the same field that is your PK should/would be your clustered key since it meets the same criteria listed above.
First, I have to say that I have misgivings about the choice of a GUID as the primary key for this table. I am of the opinion that EmployeeNumber would probably be a better choice, and something naturally unique about the employee would be better than that, such as an SSN (or ATIN), which employers must legally obtain anyway (at least in the US).
Putting that aside, you should never base a clustered index on a GUID column. The clustered index specifies the physical order of rows in the table. Since GUID values are (in theory) completely random, every new row will fall at a random location. This is very bad for performance. There is something called 'sequential' GUIDs, but I would consider this a bit of a hack.
Using a clustured index on something else than the primary key will improve performance on SELECT query which will take advantage of this index.
But you will loose performance on UPDATE query, because in most scenario, they rely on the primary key to found the specific row you want to update.
CREATE query could also loose performance because when you add a new row in the middle of the index a lot of row have to be moved (physically). This won't happen on a primary key with an increment as new record will always be added in the end and won't make move any other row.
If you don't know what kind of operation need the most performance, I recommend to leave the clustered Index on the primary key and use nonclustered index on common search criteria.
Clustered indexes cause the data to be physically stored in that order. For this reason when testing for ranges of consecutive rows, clustered indexes help a lot.
GUID's are really bad clustered indexes since their order is not in a sensible pattern to order on. Int Identity columns aren't much better unless order of entry helps (e.g. most recent hires)
Since you're probably not looking for ranges of employees it probably doesn't matter much which is the Clustered index, unless you can segment blocks of employees that you often aren't interested in (e.g. Termination Dates)
Since EmployeeNumber is unique, I would make it the PK. In SQL Server, a PK is often a clustered index.
Joins on GUIDs is just horrible. #JNK answers this well.

Is a Primary Key necessary in SQL Server?

This may be a pretty naive and stupid question, but I'm going to ask it anyway
I have a table with several fields, none of which are unique, and a primary key, which obviously is.
This table is accessed via the non-unique fields regularly, but no user SP or process access data via the primary key. Is the primary key necessary then? Is it used behind the scenes? Will removing it affect performance Positively or Negatively?
Necessary? No. Used behind the scenes? Well, it's saved to disk and kept in the row cache, etc. Removing will slightly increase your performance (use a watch with millisecond precision to notice).
But ... the next time someone needs to create references to this table, they will curse you. If they are brave, they will add a PK (and wait for a long time for the DB to create the column). If they are not brave or dumb, they will start creating references using the business key (i.e. the data columns) which will cause a maintenance nightmare.
Conclusion: Since the cost of having a PK (even if it's not used ATM) is so small, let it be.
Do you have any foreign keys, do you ever join on the PK?
If the answer to this is no, and your app never retrieves an item from the table by its PK, and no query ever uses it in a where clause, therefore you just added an IDENTITY column to have a PK, then:
the PK in itself adds no value, but does no damage either
the fact that the PK is very likely the clustered index too is .. it depends.
If you have NC indexes, then the fact that you have a narrow artificial clustered key (the IDENTITY PK) is helpful in keeping those indexes narrow (the CDX key is reproduced in every NC leaf slots). So a PK, even if never used, is helpful if you have significant NC indexes.
On the other hand, if you have a prevalent access pattern, a certain query that outweighs all the other is frequency and importance, or which is part of a critical time code path (eg. is the query run on every page visit on your site, or every second by and app etc) then that query is a good candidate to dictate the clustered key order.
And finally, if the table is seldom queried but often written to then it may be a good candidate for a HEAP (no clustered key at all) since heaps are so much better at inserts. See Comparing Tables Organized with Clustered Indexes versus Heaps.
The primary key is behind the scenes a clustered index (by default unless generated as a non clustered index) and holds all the data for the table. If the PK is an identity column the inserts will happen sequentially and no page splits will occur.
But if you don't access the id column at all then you probably want to add some indexes on the other columns. Also when you have a PK you can setup FK relationships
In the logical model, a table must have at least one key. There is no reason to arbitarily specify that one of the keys is 'primary'; all keys are equal. Although the concept of 'primary key' can be traced back to Ted Codd's early work, the mistake was picked up early on has long been corrected in relational theory.
Sadly, PRIMARY KEY found its way into SQL and we've had to live with it ever since. SQL tables can have duplicate rows and, if you consider the resultset of a SELECT query to also be a table, then SQL tables can have duplicate rows too. Relational theorists dislike SQL a lot. However, just because SQL lets you do all kinds of wacky non-relational things, that doesn't mean that you have to actually do them. It is good practice to ensure that every SQL table has at least one key.
In SQL, using PRIMARY KEY on its own has implications e.g. NOT NULL, UNIQUE, the table's default reference for foreign keys. In SQL Server, using PRIMARY KEY on its own has implications e.g. the table's clustered index. However, in all these cases, the implicit behavior can be made explicit using specific syntax.
You can use UNIQUE (constraint rather than index) and NOT NULL in combination to enforce keys in SQL. Therefore, no, a primary key (or even PRIMARY KEY) is not necessary for SQL Server.
I would never have a table without a primary key. Suppose you ever need to remove a duplicate - how would you identify which one to remove and which to keep?
A PK is not necessary.
But you should consider to place a non-unique index on the columns that you use for querying (i.e. that appear in the WHERE-clause). This will considerably boost lookup performance.
The primary key when defined will help improve performance within the database for indexing and relationships.
I always tend to define a primary key as an auto incrementing integer in all my tables, regardless of if I access it or not, this is because when you start to scale up your application, you may find you do actually need it, and it makes life a lot simpler.
A primary key is really a property of your domain model, and it uniquely identifies an instance of a domain object.
Having a clustered index on a montonically increasing column (such as an identity column) will mean page splits will not occur, BUT insertions will unbalance the index over time and therefore rebuilding indexes needs to be done regulary (or when fragmentation reaches a certain threshold).
I have to have a very good reason to create a table without a primary key.
As SQLMenace said, the clustered index is an important column for the physical layout of the table. In addition, having a clustered index, especially a well chosen one on a skinny column like an integer pk, actually increases insert performance.
If you are accessing them via non-key fields the performance probably will not change. However it might be nice to keep the PK for future enhancements or interfaces to these tables. Does your application only use this one table?

SQL primary key - complex primary or string with concatenation?

I have a table with 16 columns. It will be most frequently used table in web aplication and it will contain about few hundred tousand rows. Database is created on sql server 2008.
My question is choice for primary key. What is quicker? I can use complex primary key with two bigint-s or i can use one varchar value but i will need to concatenate it after?
There are many more factors you must consider:
data access prevalent pattern, how are you going to access the table?
how many non-clustered indexes?
frequency of updates
pattern of updates (sequential inserts, random)
pattern of deletes
All these factors, and specially the first two, should drive your choice of the clustered key. Note that the primary key and clustered key are different concepts, often confused. Read up my answer on Should I design a table with a primary key of varchar or int? for a lengthier discussion on the criteria that drive a clustered key choice.
Without any information on your access patterns I can answer very briefly and concise, and actually correct: the narrower key is always quicker (for reasons of IO). However, this response bares absolutely no value. The only thing that will make your application faster is to choose a key that is going to be used by the query execution plans.
A primary key which does not rely on any underlying values (called a surrogate key) is a good choice. That way if the row changes, the ID doesn't have to, and any tables referring to it (Foriegn Keys) will not need to change. I would choose an autonumber (i.e. IDENTITY) column for the primary key column.
In terms of performance, a shorter, integer based primary key is best.
You can still create your clustered index on multiple columns.
Why not just a single INT auto-generated primary key? INT is 32-bit, so it can handle over 4 billion records.
CREATE TABLE Records (
recordId INT NOT NULL PRIMARY KEY,
...
);
A surrogate key might be a fine idea if there are foreign key relationships on this table. Using a surrogate will save tables that refer to it from having to duplicate all those columns in their tables.
Another important consideration is indexes on columns that you'll be using in WHERE clauses. Your performance will suffer if you don't. Make sure that you add appropriate indexes, over and above the primary key, to avoid table scans.
What do you mean quicker? if you need to search quicker, you can create index for any column or create full text search. the primary key just make sure you do not have duplicated records.
The decision relies upon its use. If you are using the table to save data mostly and not retrieve it, then a simple key. If you are mostly querying the data and it is mostly static data where the key values will not change, your index strategy needs to optimize the data to the most frequent query that will be used. Personally, I like the idea of using GUIDs for the primary key and an int for the clustered index. That allows for easy data imports. But, it really depends upon your needs.
Lot’s of variables you haven’t mentioned; whether the data in the two columns is “natural” and there is a benefit in identifying records by a logical ID, if disclosure of the key via a UI poses a risk, how important performance is (a few hundred thousand rows is pretty minimal).
If you’re not too fussy, go the auto number path for speed and simplicity. Also take a look at all the posts on the site about SQL primary key types. Heaps of info here.
Is it a ER Model or Dimensional Model. In ER Model, they should be separate and should not be surrogated. The entire record could have a single surrogate for easy references in URLs etc. This could be a hash of all parts of the composite key or an Identity.
In Dimensional Model, also they must be separate and they all should be surrogated.

Is primary key always clustered?

Please clear my doubt about this, In SQL Server (2000 and above) is primary key automatically cluster indexed or do we have choice to have non-clustered index on primary key?
Nope, it can be nonclustered. However, if you don't explicitly define it as nonclustered and there is no clustered index on the table, it'll be created as clustered.
One might also add that frequently it's BAD to allow the primary key to be clustered. In particular, when the primary key is assigned by an IDENTITY, it has no intrinsic meaning, so any effort to keep the table arranged accordingly would be wasted.
Consider a table Product, with ProductID INT IDENTITY PRIMARY KEY. If this is clustered, then products that are related in some way are likely to be spread all over the disk. It might be better to cluster by something that we're likely to query based on, like the ManufacturerID or the CategoryID. In either of these cases, a clustered index would (other things being equal) make the corresponding query much more efficient.
On the other hand, the foreign key in a child table that points to this might be a good candidate for clustering (my objection is to the column that actually has the IDENTITY attribute, not its relatives). So in my example above, it's likely that ManufacturerID is a foreign key to a Manufacturer table, where it is set as an IDENTITY. That column shouldn't be clustered, but the column in Product that references it might do so to good advantage.