SQL Index - Difference Between char and int - sql-server-2005

I have a table on Sql Server 2005 database.
The primary key field of the table is a code number.
As a standard, the code must contain exactly 4 numeric digits. For example: 1234, 7834, ...
Do you suggest that field type to be char(4) or int or numeric(4) in terms of effective select operation.
Would indexing the table on any type of these differ from any other?

Integer / Identity columns are often used for primary keys in database tables for a number of reasons. Primary key columns must be unique, should not be updatable, and really should be meaningless. This makes an identity column a pretty good choice because the server will get the next value for you, they must be unique, and integers are relatively small and useable (compared to a GUID).
Some database architects will argue that other data types should be used for primary key values and the "meaningless" and "not updatable" criteria can be argued convincingly on both sides. Regardless, integer / identity fields are pretty convenient and many database designers find that they make suitable key values for referential integrity.
The best choice for primary key are integer data types since integer values are process faster than character data type values. A character data type (as a primary key) needs to be converted to ASCII equivalent values before processing.
Fetching the record on the basis of primary key will be faster in case of integers as primay keys as this will mean more index records will be present on a single page. So the total search time decreases. Also the joins will be faster. But this will be applicable incase your query uses clustered index seek and not scan and if only one table is used. In case of scan not having additional column will mean more rows on one data page.
Hopefully this will help you!

I advocate a SMALLINT column. Just because it is the most sensible datatype that will fit the required range (up to 65535, in excess of 4 digits). Use a check constraint to enforce the 4-digit limitation and a COMPUTED column to return the char(4) column.

If I remember correctly, ints take up less storage than chars, so you should go with int.
These two links say the same:
http://www.eggheadcafe.com/software/aspnet/31759030/varcharschars-vs-intbigint-as-keys.aspx
http://sql-server-performance.com/Community/forums/p/16020/94489.aspx

"It depends"
In this case, char(4) captures the data stored correctly with no storage overhead (4 bytes each). And 0001 is not the same as 1 of course.
You do have some overhead for processing collation etc if you have non-numeric digits, but it shouldn't matter for reasonably sized databases. And with a 4 digit code you do have an upper bound for number of rows especially if numeric (10k).
If your new codes are not strictly increasing, then you get the page split issue associated with GUID clustered keys
If they are strictly increasing, then use int and add a computed column to add leading zeros

Related

Deciding on a primary key according to value size in SQL Server

I want to ask a question to optimize SQL Server performance. Assume I have an entity - say Item - and I must assign a primary key for it. It has columns and two of them are expected to be unique, one of them is expected to be bigger than the other as tens of characters.
How should I decide primary key?
Should one of them be PK, if so which one, or both, or should I create an Identity number as PK? This is important for me because the entity "Item" would have relations with some other entities and I think the complexity of PK would affect the performance of SQL Server queries.
Personally, I would go with an IDENTITY Primary Key with unique constraints on both the mentioned unique keys and indexes for additonal lookups.
You have to remember that by default SQL Server creates the primary key as the clustered index, which impacts how it is stored on disc. If the new ITEMS came in at random, variance there could be a lot of fragmentation on either the primary keys.
Also, unless cascades and foreign keys are switched on, you would have to manually maintain the relational integrety of the data (unless you use IDENTITY)
Well, the primary key is really only used to uniquely identify each row - so the only requirements for it are: it has to be unique and typically also should not contain NULL.
Anything else is most likely more relevant for the clustering key in SQL Server - the column (or set of columns) by which the data is physically ordered on disk. By default, the primary key is also the clustering key in SQL Server.
The clustering key is the most important choice in SQL Server because it has far reaching performance implications. A good clustering key is
narrow
unique
stable
if possible ever-increasing
It has to be unique so that it can be added to each and every single nonclustered index for lookup into the actual data tables - if you pick a non-unique column (or set of columns), SQL Server will add a 4-byte "uniquefier" for you.
It should be as narrow as possible, since it's stored in a lot of places. Try to stick to 4 bytes for an INT or 8 bytes for a BIGINT - avoid long and variable length VARCHAR columns since those are both too wide, and the variable length also carries additional overhead. Because of this, sets of columns are also rather rarely a good choice.
The clustering key should be stable - value shouldn't change over time - since every time a value changes, potentially a lot of index entries (in the clustered index itself, and every single nonclustered index, too) need to be updated which causes a lot of unnecessary overhead.
And if it's ever-increasing (like an INT IDENTITY), you also can avoid most page splits - an extremely expensive and involved procedure that happens if you use random values (like GUID's) as your clustering key.
So in brief: an INT IDENTITY is ideal - GUIDs, variable length strings, or combinations of columns are typically less of a good choice.
Choose the one you will use to identify the records in queries and joins to other tables. Size is relative, and whilst a consideration usually not an issue since the PK will be indexed and the other unique column can make use also of a unique index.
The uniqueidentifier data type for e.g. is a 36 character long string representation and performs fine as a primary key under the majority of circumstances.

Using VARCHAR as PRIMARY KEY for an 'ORPHAN' table

I'm to create an orphan table (no relationships with any other table whatsoever) that contains 3 columns.
Col1 - String field - VARCHAR(32) - Contains unique data not more than 32 characters
Col2 - String field - TEXT - Contains larger non-unique data of characters
Col3 - Numeric (Bool) - INT(1) - 0/1 for Flagging
I'm thinking of using Col1 as my PRIMARY KEY. I have done some research and see people argue that using a meaningless INT column as a PRIMARY KEY to avoid Foreign Key/Storage issues is the way to go.
However, IMO, since this is an orphan table, it should not matter. Besides, I would require to place an INDEX on Col1 anyway.
As a side note, I'm not expecting more than ~1000 rows in this table.
Thoughts please.
I'd still just use an INT PK and put an index on COL1. I suppose you could use COL1 as the index if you can ensure that nothing will ever be joined to that table, but if nothing else the index will give you an idea of the order in which items are added/deleted from the table. I also like to add an IsActive boolean so that you never delete anything and a DateCreated datetime to almost every table.
If col1 is your real primary key, there is no reason not to use it. Especially if the table is that tiny.
You would need to maintain a unique index on that column anyway, so by adding an artificial primary key you just add more overhead fon insert and delete operations (as two indexes must be maintained).
Unless you are referencing that PK from really, really many other rows (and other tables) you should just go with what is the natural primary for your business rules.
I see where you are coming from but it just makes sense to index the first column anyway. It may be because I am used to excel but the usefulness of the initial column for a primary key also has an order to it along with readability while debugging or capturing data. If you use a more random generated number you still would be searching through a few hundred rows looking for a hard to distinguish key. In the end I highly recommend the extra column of ints. It is well worth it.
Whenever i do any database tables i keep my INT column.
I believe its faster to compare numbers then strings.
So it all depends how ofter you will query the database for info and compare strings in there.
I'm still unclear what the question is. Judging from the answers, I've deduced it down to two plausible questions:
Is it okay to use a VARCHAR instead of an INTEGER as a primary key?
Yes it is okay to use a VARCHAR instead.
In many cases it is preferred, especially if your table is expected to grow beyond 2,147,483,647 records (yes this happens). Performance-wise, even if INTs had a minimal speed advantage, on a ~1000 record table, you would not see it. Designated PKs are indexed by default. The one problem is that you'll lose any auto-generating sequence that the database can do for you.
Is it okay to use your unique COL1 field as a primary key, instead of some other unique ID field?
Yes it is okay.
The whole notion of having a primary key is to establish a unique field. What you're losing, though, may be some intrinsic comprehension. When other users want to join on that table, it's far easier to understand that id is a unique field, whereas col1 (some varchar) may or may not be unique.
In your given scenario, it should be okay. If the scope does grow up then you can always introduce an auto_increment PK column. Just make sure that your field is both indexed and unique.

Can you use GUID for a column if column is not a primary key?

I have two columns in a database. Both are ids. I am making one numeric auto incremented. This is what I chose as the primary key.But the other column has to contain unique auto generated ids as well. There have to be different from the ids of the primary key (ie. the numeric auto increment column). Thus I was thinking of using GUID for this column. BUT:
Can you use GUID for a column if column is not a primary key?
Is it good practice?
Clearly, the GUID column will be unique, but it is rather big for joining purposes. So, it can be desirable to use the numeric column within the database for foreign key joins (joining on 4 byte or 8 byte integer quantities) rather than the 16 byte GUID column (unless you store it as a string, in which case it has 32 hex digits plus 4 dashes). The indexes will be smaller for the numeric column; the referencing tables will be smaller; the join comparisons will be quicker.
So, there might well be good reasons to have both, and to have the GUID as a secondary (alternative) unique key on the table rather than as the primary key.
This is possible and seems no problem except data size and some performance overhead.
As for me it is normal practice if you need to sync data in the table with some external system. Then you use guid to maintain global record identifier for synchronization and numeric primary key for relations inside database because they are faster than guid's.
Yes, you can. Whether it's a good idea depends on circumstance, but in general there's no issue here. Furthermore, you should probably create a 'unique key' on that column to enforce referential integrity (doing this will also allow you to create foreign keys that relate back to this column).
Just make sure you don't create a clustered index on that column!

Optimize SQL databases by adding index columns

Say I have a database looking like this;
Product with columns [ProductName] [Price] [Misc] [Etc]
Order with columns [OrderID] [ProductName] [Quantity] [Misc] [Etc]
ProductName is primary key of Product, of some string type and unique.
OrderID is primary key and of some integer type, and ProductName being a foreign key.
Say I change the primary key of Product to a new column of integer type ie [ProductID].
Would this reduce the database size and optimize lookups joining these two tables (and likewise operations), or are these optimizations performed automatically by (most/general/main) SQL database implementations?
Technically, using (String) ProductName as primary key in Product, a database should be able to implement the ProductName column in Order as simply a pointer to a row in Product, and perform a JOIN as quicly as having an integer as a foreign key, is this a standard way of implementing SQL.
Update:
This question is about how SQL servers handles foreign keys, not whether a product table needs a serial number, or how I handle to product name change in a database.
A string primary key is a bad idea, so changing it to an INT will help performance. most databases uses the primary key index for lookups and comparisons, choose a brief primary key—one column, if possible. You use primary key columns for joins (combining data from two or more tables based on common values in join columns), for query retrieval, and for grouping or sorting a query result set. The briefer the index entries are, the faster the database can perform the lookups and comparisons.
Not to mention, if the name of the product changes, how can you handle that? update all rows that contain the product name as a Foreign Key?
I couldn't have said it any better, so check out this answer: Should I design a table with a primary key of varchar or int, quote from that answer:
Using a VARCHAR(10) or (20) just uses
up too much space - 10 or 20 bytes
instead of 4, and what a lot of folks
don't know - the clustering key value
will be repeated on every single index
entry on every single non-clustered
index on the table, so potentially,
you're wasting a lot of space (not
just on disk - that's cheap - but also
in SQL Server's main memory). Also,
since it's variable (might be 4, might
be 20 chars) it's harder to SQL server
to properly maintain a good index
structure
integer column acts better than string in joins
integer autoinc columns as primary clustered key is good for inserts
I won't reduce database size (presumably you'll keep the product name field), but should definitely improve lookup performance.
Integer datatype in most implementations will be less in size than the string (CHAR, VARCHAR etc.), this will make your index smaller in size.
In addition, there are some issues with comparing the strings:
Some databases, namely MySQL, compress the string keys which can make the searches less efficient.
String B-Trees that use natural language identifiers tend to be less concurrency balanced than integer B-Trees. Since the natural language words are not distributed evenly across the alphabet, more updates and inserts will go to the same block, increasing the number of page splits and ultimately increasing the index size. To work around this, Oracle supports REVERSE clause in indexes.
When comparing two strings, a collation should be taken into account. Normally, it does not matter much, however, it does add some overhead.
Primary keys should be unique, exist at time of row creation and be as immutable as possible. IMO, discussions about whether to use a surrogate key should be secondary to issues of data integrity.
If for example a product had a serial number stamped on the item, which had to exist at the time the row in the database was entered and was guaranteed to be unique, then IMO that would make a good primary key. The reason is this value will be used as the foreign key in other tables and it saves you the expense of an additional lookup to get the product's serial number. The additional storage space is inconsequential until you get into the many millions of rows. However, if the serial number was stamped by some other manufacturer so you had no guarantees of uniqueness ("it is probably unique" is not good enough), then a surrogate is appropriate. In fact, I would go so far as to say a good portion if not most "products" tables use surrogate keys because no value that is guaranteed to be available at time of entry, guaranteed to be unique and will be relatively immutable is available as a key.
However, many developers that use surrogate keys overlook the need that every table that has a surrogate key should also have another key (i.e. a unique constraint). Thus, in your case with products, even if you add an integer primary key, you should still have a unique constraint on product name. The unique constraint on product name creates what is called a candidate key with the integer value being the primary key.
Surrogate keys are meant to be behind-the-scenes goo. While integer keys perform the best and are easy to create they have one downside: it is easy, tempting even, for application developers to show the key value to users. This is a mistake IMO. Users should never see the key value or they will come to rely on the value itself which creates problems if you need to re-sequence the values (like say with a database merge) or if you use values that were created in gaps created by the Identity value and they rely on the values being sequential. As long as you never show the value to users, using an integer PK is fine.

Should I use integer primary IDs?

For example, I always generate an auto-increment field for the users table, but I also specify a UNIQUE index on their usernames. There are situations that I first need to get the userId for a given username and then execute the desired query, or use a JOIN in the desired query. It's 2 trips to the database or a JOIN vs. a varchar index.
Should I use integer primary IDs?
Is there a real performance benefit on INT over small VARCHAR indexes?
There are several advantages of having a surrogate primary key, including:
When you have a foreign key in another table, if it is an integer it takes up only a few bytes extra space and can be joined quickly. If you use the username as the primary key it will have to be stored in both tables - taking up more space and it takes longer to compare when you need to join.
If a user wishes to change their username, you will have big problems if you have used it as a primary key. While it is possible to update a primary key, it is very unwise to do so and can cause all sorts of problems as this key might have been sent out to all sorts of other systems, used in links, saved in backups, logs that have been archived, etc. You can't easily update all these places.
It's not just about performance. You should never key on a meaningful value, for reasons that are well documented elsewhere.
By the way, I often scale the type of int to the size of the table. When I know that a table will not exceed 255 rows, I use a tinyint key, and the same for smallint.
In addition to what others have said, you need to think about the clustering of the table.
In SQL Server for instance (and possibly other vendors), if the primary key is also used as the clustered index of the table (which is quote common), an incrementing integer benefits over other field types. This is because new rows are entered with a primary key that is always greater than the previous rows, meaning that the new row can be stored at the end of the table instead of in the middle (this same scenario can be created with other field types for the primary key, but an integer type lends itself better).
Compare this with a guid primary key - new rows have to be inserted into the middle of the table because guids are non-sequential, making inserts very inefficient.
First, as is obvious, on small tables, it will make no difference with respect to performance. Only on very large tables (how large depends on numerous factors), can it make a difference for a handful of reasons:
Using a 32-bit will only consume 4 bytes of space. Presumably, your usernames will be longer than four non-Unicode characters and thus consume more than 4 bytes of space. The more space used, the few pieces of data fit on a page, the fatter the index and the more IO you incur.
Your character columns are going to require the use of varchar over char unless you force everyone to have usernames of identical size. This too will have a tiny performance and storage impact.
Unless you are using a binary sort collation, the system has to do relatively sophisticaed matching when comparing two strings. Do the two columns use the same colllation? For each character, are they cased the same? What are the casing and accent rules in terms of matching? and so on. While this can be done quickly, it is more work which, in a very large tables, can make a difference in comparison to matching on an integer.
I'm not sure why you would ever have to do two trips to the database or join on a varchar column. Why couldn't you do one trip to the database (where creation returns your new PK) where you join to the users table on the integer PK?