performance improvements to database table - sql

I have the following sql server 2008 database table:
CREATE TABLE [dbo].[cache](
[cache_key] [nvarchar](50) NOT NULL,
[cache_data] [nvarchar](max) NOT NULL,
[expiry_date] [datetime] NOT NULL) ON [PRIMARY]
I want to add a primary key to it, ie. make the cache_key column the primary key. This column contains unique strings. My question is, are there any implications to making a nvarchar 50 column a primary key? Is it possible to add primary key to this column that contains data, even if the cache_key data is unique?
I also have another script that runs each day that removes data from the table based on the expiry_date column. This could mean up to 5000 records deleted based upon comparison to this field. Would it help performance if I created an index on this field?

You can make a primary key out of anything that's indexable and unique. A varchar(50) is no problem. You can add define a primary key after the fact, as long as every record has a unique value in that column. YOu won't be allowed to "primary-ize" a column(s) that isn't unique.
As for the index, if it only ever gets referenced in a single delete query that runs once a day, then don't bother indexing it. The overhead of maintaining the index through every single insert/update on the table won't be worth the microscopic time savings you'd get on the once-a-day delete. On the other hand, if that field is used frequently in where/join clauses in other queries, then go ahead and put on a index - you'll definitely improve performance.

Basically, technically, you can make any column that is less than 900 bytes in maximum size your primary key, e.g. you cannot make a NVARCHAR(2000) your primary key, but a nvarchar(50) works.
The requirements for the primary key are:
must be unique
must not be NULL
If those requirements are met - you're good to go.
One thing to keep in mind is this: your primary key is - by default - also your clustering key, the key by which the table's contents is physically ordered (slightly simplified). As such, that clustering key is like the address or pointer of your data row in the table, and it will be included into each and every non-clustered index you have on your table, too.
If you have a table that doesn't have any or just a single non-clustered index - no worries. But if your table has quite a few nonclustered indices (like a Customer table which might have four, five indices or even more), than having such a wide clustering key (100 bytes) of variable width is not ideal. In this case, you're better off using something like an INT IDENTITY as your surrogate key, and put your primary key / clustered index onto that column. It will save you a lot of disk space and make your table perform much better.
Read more about what makes a good clustering key (on a busy, large table) in Kimberly Tripp's blog post Ever-increasing clustering key - the Clustered Index Debate..........again! - highly educational!

Related

Primary Key Needed on Fact Tables

I am currently developing a very complicated database schema and was wondering if the fact tables should have primary keys. Each fact table has 50+ columns of data and the only way to make a primary key would be to add an auto incrementing count to each tuple. I am just not sure what this information gets us in the long term, especially since the data will be deleted after 12 months.
My dimension tables of course will have primary keys, just wanting to know what is best practice.
I am a fan of putting an identity column on all tables. This makes it easier to identify specific rows for updating and deleting.
On a fact table with lots of dimensions, of course, such a column can seem superfluous. However, there is still usually a primary key -- which is the combination of dimensions.
I would encourage you to have a primary key on the table, either an identity column or a combination of existing rows. If you use a composite primary key, you should be careful about the ordering of the keys. SQL Server defaults to using the primary key as a clustered index, and if you put the keys in the wrong order, then your table is subject to fragmentation. Identity keys don't have this issue.
It is always good to go for a clustering key, which will leads to easily seeking the data, when we need. Clustering key is not only used for clustered index queries. It is also being stored in every non-clustered index leaf page, for seeking back to the data pages, when there is key-lookup.
Characteristics of good clustering key:
unique (no need for adding uniquefier to make value unique)
incrementing (reduces fragmentation)
narrow (less number of bytes to store in the tree pages of clustered index & in the leaf pages of non-clustered index)
Static (reduces fragmentation)
non-nullable (avoids null blocks)
fixed width (avoids variable blocks)
Read more on Kimberly Tripp Post on clustering key
Identity satisy all these clauses. They are good candidates for clustered index.
If you are going to hold data longer, you can go for Bigint and if you are going to hold for one year and purge, you can go for int datatype itself.

Database indexing - what is the purpose of indexing primary keys

From what I have read, indexing is like writing index page at the front of the book to make sure the db doesnt have to go through all the pages.
If primary key is indexed, wouldnt it be exactly same as going through the entire book because they are all unique anyways so the categorization within the index of primary key is same as the number of documents. If so, what is the purpose of indexing primary keys if there is no performance benefit?
The primary key is an index -- keys are indexes! It's just a special name for a special kind of index which is always unique, and which may have an automatically assigned value.
In some databases, the rows are sometimes (or always) stored in the same order as the primary key. In these situations, the primary key may not need to be separately indexed -- the order of the rows is enough of an index on its own.
In some other databases, the primary key is not treated differently. The rows are stored in an arbitrary order -- perhaps in the order they were last modified, for example. In these situations, an index is needed on the primary key to look up the rows.

Composite Index primary key vs Unique auto increment primary key

We have a transaction table of over 111m rows that has a clustered composite primary key of...
RevenueCentreID int
DateOfSale smalldatetime
SaleItemID int
SaleTypeID int
...in a SQL 2008 R2 database.
We are going to be truncating and refilling the table soon for an archiving project, so the opportunity to get the indexes right will be once the table has been truncated.
Would it be better to keep the composite primary key or should we move to a unique auto increment primary key?
Most searches on the table are done using the DateOfSale and RevenueCentreID columns. We also often join to the SaleItemID column. We hardly ever use the SaleType column, in fact it is only included in the primary key for uniqueness. We dont care about how long it takes to insert & delete new sales figures(done over night) but rather the speed of returning reports.
A surrogate key serves no purpose here. I suggest a clustered primary key on the columns as listed, and an index on SaleItemID.
In have learned you want and need both a natural key and a surrogate key.
The natural key keeps the business keys unique and is prefect for indexing. where the surrogate key will help with queries and development.
So in your case a surrogate auto incrementing key is good in the fact it will help keep all the rows of data in tact. And a natural key of DateOfSale, RevenueID and maybe ClientID would make a great way of ensuring no duplicate records are being stored and speed up querying because you can index the natural key.
If you don't care about the speed of inserts and deletions, then you probably want multiple indexes which target the queries precisely.
You could create an auto increment primary key as you suggest, but also create indexes as required to cover the reporting queries. Create a unique constraint on the columns you currently have in the key to enforce uniqueness.
Index tuning wizard will help with defining the optimum set of indexes, but it's better to create your own.
Rule of thumb - you can define columns to index, and also "include" columns.
If your report has an OrderBy or a Where clause on a column then you need the index to be defined against these. Any other fields returned in the select should be included columns.

Should primary keys be always assigned as clustered index

I have a SQLServer table that stores employee details, the column ID is of GUID type while the column EmployeeNumber of INT type. Most of the time I will be dealing with EmployeeNumber while doing joins and select criteria's.
My question is, whether is it sensible to assign PrimaryKey to ID column while ClusteredIndex to EmployeeNumber?
Yes, it is possible to have a non-clustered primary key, and it is possible to have a clustered key that is completely unrelated to the primary key. By default a primary keys gets to be the clustered index key too, but this is not a requirement.
The primary key is a logical concept: is the key used in your data model to reference entities.
The clustered index key is a physical concept: is the order in which you want the rows to be stored on disk.
Choosing a different clustered key is driven by a variety of factors, like key width when you desire a narrower clustered key than the primary key (because the clustered key gets replicated in every non-clustered index. Or support for frequent range scans (common in time series) when the data is frequently accessed with queries like date between '20100101' and '20100201' (a clustered index key on date would be appropriate).
This subject has been discussed here ad nauseam before, see also What column should the clustered index be put on?.
The ideal clustered index key is:
Sequential
Selective (no dupes, unique for each record)
Narrow
Used in Queries
In general it is a very bad idea to use a GUID as a clustered index key, since it leads to mucho fragmentation as rows are added.
EDIT FOR CLARITY:
PK and Clustered key are indeed separate concepts. Your PK does not need to be your clustered index key.
In practical applications in my own experience, the same field that is your PK should/would be your clustered key since it meets the same criteria listed above.
First, I have to say that I have misgivings about the choice of a GUID as the primary key for this table. I am of the opinion that EmployeeNumber would probably be a better choice, and something naturally unique about the employee would be better than that, such as an SSN (or ATIN), which employers must legally obtain anyway (at least in the US).
Putting that aside, you should never base a clustered index on a GUID column. The clustered index specifies the physical order of rows in the table. Since GUID values are (in theory) completely random, every new row will fall at a random location. This is very bad for performance. There is something called 'sequential' GUIDs, but I would consider this a bit of a hack.
Using a clustured index on something else than the primary key will improve performance on SELECT query which will take advantage of this index.
But you will loose performance on UPDATE query, because in most scenario, they rely on the primary key to found the specific row you want to update.
CREATE query could also loose performance because when you add a new row in the middle of the index a lot of row have to be moved (physically). This won't happen on a primary key with an increment as new record will always be added in the end and won't make move any other row.
If you don't know what kind of operation need the most performance, I recommend to leave the clustered Index on the primary key and use nonclustered index on common search criteria.
Clustered indexes cause the data to be physically stored in that order. For this reason when testing for ranges of consecutive rows, clustered indexes help a lot.
GUID's are really bad clustered indexes since their order is not in a sensible pattern to order on. Int Identity columns aren't much better unless order of entry helps (e.g. most recent hires)
Since you're probably not looking for ranges of employees it probably doesn't matter much which is the Clustered index, unless you can segment blocks of employees that you often aren't interested in (e.g. Termination Dates)
Since EmployeeNumber is unique, I would make it the PK. In SQL Server, a PK is often a clustered index.
Joins on GUIDs is just horrible. #JNK answers this well.

Database Design and the use of non-numeric Primary Keys

I'm currently in the process of designing the database tables for a customer & website management application. My question is in regards to the use of primary keys as functional parts of a table (and not assigning "ID" numbers to every table just because).
For example, here are four related tables from the database so far, one of which uses the traditional primary key number, the others which use unique names as the primary key:
--
-- website
--
CREATE TABLE IF NOT EXISTS `website` (
`name` varchar(126) NOT NULL,
`client_id` int(11) NOT NULL,
`date_created` timestamp NOT NULL default CURRENT_TIMESTAMP,
`notes` text NOT NULL,
`website_status` varchar(26) NOT NULL,
PRIMARY KEY (`name`),
KEY `client_id` (`client_id`),
KEY `website_status` (`website_status`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
--
-- website_status
--
CREATE TABLE IF NOT EXISTS `website_status` (
`name` varchar(26) NOT NULL,
PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `website_status` (`name`) VALUES
('demo'),
('disabled'),
('live'),
('purchased'),
('transfered');
--
-- client
--
CREATE TABLE IF NOT EXISTS `client` (
`id` int(11) NOT NULL auto_increment,
`date_created` timestamp NOT NULL default CURRENT_TIMESTAMP,
`client_status` varchar(26) NOT NULL,
`firstname` varchar(26) NOT NULL,
`lastname` varchar(46) NOT NULL,
`address` varchar(78) NOT NULL,
`city` varchar(56) NOT NULL,
`state` varchar(2) NOT NULL,
`zip` int(11) NOT NULL,
`country` varchar(3) NOT NULL,
`phone` text NOT NULL,
`email` varchar(78) NOT NULL,
`notes` text NOT NULL,
PRIMARY KEY (`id`),
KEY `client_status` (`client_status`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=4 ;
--
-- client_status
---
CREATE TABLE IF NOT EXISTS `client_status` (
`name` varchar(26) NOT NULL,
PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `client_status` (`name`) VALUES
('affiliate'),
('customer'),
('demo'),
('disabled'),
('reseller');
As you can see, 3 of the 4 tables use their 'name' as the primary key. I know that these will always be unique. In 2 of the cases (the *_status tables) I am basically using a dynamic replacement for ENUM, since status options could change in the future, and for the 'website' table, I know that the 'name' of the website will always be unique.
I'm wondering if this is sound logic, getting rid of table ID's when I know the name is always going to be a unique identifier, or a recipe for disaster? I'm not a seasoned DBA so any feedback, critique, etc. would be extremely helpful.
Thanks for taking the time to read this!
There are 2 reasons I would always add an ID number to a lookup / ENUM table:
If you are referencing a single column table with the name then you may be better served by using a constraint
What happens if you wanted to rename one of the client_status entries? e.g. if you wanted to change the name from 'affiliate' to 'affiliate user' you would need to update the client table which should not be necessary. The ID number serves as the reference and the name is the description.
In the website table, if you are confident that the name will be unique then it is fine to use as a primary key. Personally I would still assign a numeric ID as it reduces the space used in foreign key tables and I find it easier to manage.
EDIT:
As stated above, you will run into problems if the website name is renamed. By making this the primary key you will be making it very difficult if not impossible for this to be changed at a later date.
When making natural PRIMARY KEY's, make sure their uniqueness is under your control.
If you're absolutely sure you will never ever have uniqueness violation, then it's OK to use these values as PRIMARY KEY's.
Since website_status and client_status seem to be generated and used by you and only by you, it's acceptable to use them as a PRIMARY KEY, though having a long key may impact performance.
website name seems be under control of the outer world, that's why I'd make it a plain field. What if they want to rename their website?
The counterexamples would be SSN and ZIP codes: it's not you who generates them and there is no guarantee that they won't be ever duplicated.
Kimberly Tripp has an Excellent series of blog articles (GUIDs as PRIMARY KEYs and/or the clustering key and The Clustered Index Debate Continues) on the issue of creating clustered indexes, and choosing the primary key (related issues, but not always exactly the same). Her recommendation is that a clustered index/primary key should be:
Unique (otherwise useless as a key)
Narrow (the key is used in all non-clustered indexes, and in foreign-key relationships)
Static (you don't want to have to change all related records)
Always Increasing (so new records always get added to the end of the table, and don't have to be inserted in the middle)
Using "Name" as your key, while it seems to satisfy #1, doesn't satisfy ANY of the other three.
Even for your "lookup" table, what if your boss decides to change all affiliates to partners instead? You'll have to modify all rows in the database that use this value.
From a performance perspective, I'm probably most concerned that a key be narrow. If your website name is actually a long URL, then that could really bloat the size of any non-clustered indexes, and all tables that use it as a foreign key.
Besides all the other excellent points that have already been made, I would add one more word of caution against using large fields as clustering keys in SQL Server (if you're not using SQL Server, then this probably doesn't apply to you).
I add this because in SQL Server, the primary key on a table by default also is the clustering key (you can change that, if you want to and know about it, but most of the cases, it's not done).
The clustering key that determines the physical ordering of the SQL Server table is also being added to every single non-clustered index on that table. If you have only a few hundred to a few thousand rows and one or two indices, that's not a big deal. But if you have really large tables with millions of rows, and potentially lots of indices to speed up the queries, this will indeed cause a lot of disk space and server memory to be wasted unnecessarily.
E.g. if your table has 10 million rows, 10 non-clustered indices, and your clustering key is 26 bytes instead of 4 (for an INT), then you're wasting 10 mio. by 10 by 22 bytes for a total of 2.2 billion bytes (or 2.2 GBytes approx.) - that's not peanuts anymore!
Again - this only applies to SQL Server, and only if you have really large tables with lots of non-clustered indices on them.
Marc
"If you're absolutely sure you will never ever have uniqueness violation, then it's OK to use these values as PRIMARY KEY's."
If you're absolutely sure you will never ever have uniqueness violation, then don't bother to define the key.
Personally, I think you will run into trouble using this idea. As you end up with more parent child relationships, you end up with a huge amount of work when the names change (As they always will sooner or later). There can be a big performance hit when having to update a child table that has thousands of rows when the name of the website changes. And you have to plan for how do make sure that those changes happen. Otherwise, the website name changes (oops we let the name expire and someone else bought it.) either break because of the foreign key constraint or you need to put in an automated way (cascade update) to propagate the change through the system. If you use cascading updates, then you can suddenly bring your system to a dead halt while a large chage is processed. This is not considered to be a good thing. It really is more effective and efficient to use ids for relationships and then put unique indexes on the name field to ensure they stay unique. Database design needs to consider maintenance of the data integrity and how that will affect performance.
Another thing to consider is that websitenames tend to be longer than a few characters. This means the performance difference between using an id field for joins and the name for joins could be quite significant. You have to think of these things at the design phase as it is too late to change to an ID when you have a production system with millions of records that is timing out and the fix is to completely restructure the databse and rewrite all of the SQL code. Not something you can fix in fifteen minutes to get the site working again.
This just seems like a really bad idea. What if you need to change the value of the enum? The idea is to make it a relational database and not a set of flat files. At this point, why have the client_status table? Moreover, if you are using the data in an application, by using a type like a GUID or INT, you can validate the type and avoid bad data (in so far as validating the type). Thus, it is another of many lines to deter hacking.
I would argue that a database that is resistant to corruption, even if it runs a little slower, is better than one that isn’t.
In general, surrogate keys (such as arbitrary numeric identifiers) undermine the integrity of the database. Primary keys are the main way of identifying rows in the database; if the primary key values are not meaningful, the constraint is not meaningful. Any foreign keys that refer to surrogate primary keys are therefore also suspect. Whenever you have to retrieve, update or delete individual rows (and be guaranteed of affecting only one), the primary key (or another candidate key) is what you must use; having to work out what a surrogate key value is when there is a meaningful alternative key is a redundant and potentially dangerous step for users and applications.
Even if it means using a composite key to ensure uniqueness, I would advocate using a meaningful, natural set of attributes as the primary key, whenever possible. If you need to record the attributes anyway, why add another one? That said, surrogate keys are fine when there is no natural, stable, concise, guaranteed-to-be-unique key (e.g. for people).
You could also consider using index key compression, if your DBMS supports it. This can be very effective, especially for indexes on composite keys (think trie data structures), and especially if the least selective attributes can appear first in the index.
I think I am in agreement with cheduardo. It has been 25 years since I took a course in database design but I recall being told that database engines can more efficiently manage and load indexes that use character keys. The comments about the database having to update thousands of records when a key is changed and on all of the added space being taken up by the longer keys and then having to be transferred across systems, assumes that the key is actually stored in the records and that it does not have to be transferred across systems anyway. If you create an index on a column(s) of a table, I do not think the value is stored in the records of the table (unless you set some option to do so).
If you have a natural key for a table, even if it is changed occassionally, creating another key creates a redundancy that could result in data integrity issues and actually creates even more information that needs to be stored and transferred across systems. I work for a team that decided to store the local application settings in the database. They have an identity column for each setting, a section name, a key name, and a key value. They have a stored procedure (another holy war) to save a setting that ensures it does not appear twice. I have yet to find a case where I would use a setting's ID. I have, however, ended up with multiple records with the same section and key name that caused my application to fail. And yes, I know that could have been avoided by defining a constraint on the columns.
Here few points should be considered before deciding keys in table
Numeric key is more suitable when you
use references ( foreign keys), since
you not using foreign keys, it ok in
your case to use non numeric key.
Non-numeric key uses more space than
numeric keys, can decrease
performance.
Numeric keys make db look simpler to
understand ( you can easily know no
of rows just by looking at last row)
You NEVER know when the company you work for suddenly explodes in growth and you have to hire 5 developers overnight. Your best bet is to use numeric (integer) primary keys because they will be much easier for the entire team to work with AND will help your performance if and when the database grows. If you have to break records out and partition them, you might want to use the primary key. If you are adding records with a datetime stamp (as every table should), and there is an error somewhere in the code that updates that field incorrectly, the only way to confirm if the record was entered in the proper sequence it to check the primary keys. There are probably 10 more TSQL or debugging reasons to use INT primary keys, not the least of which is writing a simple query to select the last 5 records entered into the table.