Using VARCHAR as PRIMARY KEY for an 'ORPHAN' table - sql

I'm to create an orphan table (no relationships with any other table whatsoever) that contains 3 columns.
Col1 - String field - VARCHAR(32) - Contains unique data not more than 32 characters
Col2 - String field - TEXT - Contains larger non-unique data of characters
Col3 - Numeric (Bool) - INT(1) - 0/1 for Flagging
I'm thinking of using Col1 as my PRIMARY KEY. I have done some research and see people argue that using a meaningless INT column as a PRIMARY KEY to avoid Foreign Key/Storage issues is the way to go.
However, IMO, since this is an orphan table, it should not matter. Besides, I would require to place an INDEX on Col1 anyway.
As a side note, I'm not expecting more than ~1000 rows in this table.
Thoughts please.

I'd still just use an INT PK and put an index on COL1. I suppose you could use COL1 as the index if you can ensure that nothing will ever be joined to that table, but if nothing else the index will give you an idea of the order in which items are added/deleted from the table. I also like to add an IsActive boolean so that you never delete anything and a DateCreated datetime to almost every table.

If col1 is your real primary key, there is no reason not to use it. Especially if the table is that tiny.
You would need to maintain a unique index on that column anyway, so by adding an artificial primary key you just add more overhead fon insert and delete operations (as two indexes must be maintained).
Unless you are referencing that PK from really, really many other rows (and other tables) you should just go with what is the natural primary for your business rules.

I see where you are coming from but it just makes sense to index the first column anyway. It may be because I am used to excel but the usefulness of the initial column for a primary key also has an order to it along with readability while debugging or capturing data. If you use a more random generated number you still would be searching through a few hundred rows looking for a hard to distinguish key. In the end I highly recommend the extra column of ints. It is well worth it.

Whenever i do any database tables i keep my INT column.
I believe its faster to compare numbers then strings.
So it all depends how ofter you will query the database for info and compare strings in there.

I'm still unclear what the question is. Judging from the answers, I've deduced it down to two plausible questions:
Is it okay to use a VARCHAR instead of an INTEGER as a primary key?
Yes it is okay to use a VARCHAR instead.
In many cases it is preferred, especially if your table is expected to grow beyond 2,147,483,647 records (yes this happens). Performance-wise, even if INTs had a minimal speed advantage, on a ~1000 record table, you would not see it. Designated PKs are indexed by default. The one problem is that you'll lose any auto-generating sequence that the database can do for you.
Is it okay to use your unique COL1 field as a primary key, instead of some other unique ID field?
Yes it is okay.
The whole notion of having a primary key is to establish a unique field. What you're losing, though, may be some intrinsic comprehension. When other users want to join on that table, it's far easier to understand that id is a unique field, whereas col1 (some varchar) may or may not be unique.

In your given scenario, it should be okay. If the scope does grow up then you can always introduce an auto_increment PK column. Just make sure that your field is both indexed and unique.

Related

SQL Server: How to allow duplicate records on small table

I have a small table "ImgViews" that only contains two columns, an ID column called "imgID" + a count column called "viewed", both set up as int.
The idea is to use this table only as a counter so that I can track how often an image with a certain ID is viewed / clicked.
The table has no primary or foreign keys and no relationships.
However, when I enter some data for testing and try entering the same imgID multiple times it always appears greyed out and with a red error icon.
Usually this makes sense as you don't want duplicate records but as the purpose is different here it does make sense for me.
Can someone tell me how I can achieve this or work around it ? What would be a common way to do this ?
Many thanks in advance, Tim.
To address your requirement to store non-unique values, simply remove primary keys, unique constraints, and unique indexes. I expect you may still want a non-unique clustered index on ImgID to improve performance of aggregate queries that would otherwise require a scan the entire table and sort. I suggest you store an insert timestamp, not to provide uniqueness, but to facilitate purging data by date, should the need arise in the future.
You must have some unique index on that table. Make sure there is no unique index and no unique or primary key constraint.
Or, SSMS simply doesn't know how to identify the row that was just inserted because it has no key.
It is generally not best practice to have a table without a (logical) primary key. In your case, I'd make the image id the primary key and increment the counter. The MERGE statement is well-suited for performing and insert or update at the same time. Alternatives exist.
If you don't like that, create a surrogate primary key (an identity column set as the primary key).
At the moment you have no way of addressing a specific row. That makes the table a little unwieldy.
If you allow multiple rows being absolutely identical, how would you update/delete one of those rows?
How would you expect the database being able to "know" what row you referred to??
At the very least add a separate identity column (preferred being the clustered index, too).
As a side note: It's weird that you "like to avoid unneeded data" but at the same time insert duplicates over and over again instead of simply add up the click count per single image...
Use SQL statements, not GUI, if the table has not primary key or unique constraint.

Should a table with only one column have a primary key?

For example, I have a table called programs and another called format. The format tables contains a single column called format, which has three possible values: zip, rar and exe. Should the format table have a primary key?
Think what happens if your table would contain:
zip
rar
exe
exe
If you see no problem then your table does not need a PK.
If the values are unique, then perhaps not. However, if you do not have a unique index on the values, you could have duplicates, and that would make it difficult to identify single rows. Also, using a PK with a known, fixed length (such as an integer ID) will help with performance and keep index fragmentation down. Plus, if you need to join to this table, you're better off referencing the ID of a format rather than the value, by using a foreign key (which in some RDBMS requires a PK).
If you have a table called Programs and a table called Format with a bunch of values, why not just have a FormatId column in the Program table and use one of the (currently three) foreign key values in the Format table?
See these notes about Normalisation: http://en.wikipedia.org/wiki/Database_normalization (I'm sure you know this, but maybe somebody else reading the question might not).
The table with only one column would have unique values in most of the cases, as repeating value and that too in a single column does not make sense.
So, in your cases, the format would not be repeating (thats what I think), so there is no harm in making primary key but
1) It imposes an indexing on that field (do you want that?)
2) In future, if you are planning to link this to another table, then make sure you use this as a primary key and not introducing anything like format_id for the primary key. If you do so, then please dont make this as a primary key right now.

Primary key on a column

I have a table with 2 columns and 2 records. Column1 will never change but the Column2 might have chances that it will change but table will have only 2 records.
Column1
Missing
Invalid
Column2
\\sqlserver\destination\missing
\\sqlserver\destination\invalid
I am little confused here about the primary key that i wanna put on this table as there is no Id column. so which column i should have primary key? or do i have to add one more column with identity and put primary key on that?
Thanks
Yes. The PK column can never contain duplicates. It doesn't have to be an integer however, but it needs to be a unique non-null column.
The criteria for choosing candidate and primary keys are:
uniqueness, irreducibility, stability, simplicity and familiarity
From what you have written, Column1 is definitely a candidate key. It has all 5 of the above criteria.
Column2 might be a candidate key if the two values in the table must always be unique. However, it is not stable so Column1 is a better key to choose for foreign key references to the table (primary key).
You could create a 3rd numeric column. Since you constrain the table to 2 rows, it makes little difference whether the new column has a system-maintained sequence (identity attribute).
Column1 has familiarity and the new column would not. At a logical level of discourse, both Column1 and this new column are equally simple. Physically, a 7 character string is at least as large as a 64-bit number so a 32-bit number occupies less space.
However, if you choose to add a new column due to physical size, I would consider a char(1) column with 'M' for missing or 'I' for invalid, which would still have all 5 criteria while occupying less physical space in referencing tables.
IMHO, you should "always" put an id key that then becomes the primary key.
"always" is in quotes because it's possible to argue for cases when it's not required, but generally this is the way to go, and certainly it's safe to say it is the default approach, and any deviation from it should be investigated rigorously for its benefits.
There is an argument for "natural" keys; that is to say you put the primay key on the field that is guaranteed to be unique and never change. But, in my experience, almost everything does end up changing, so it's safer to go with an inbuilt default auto-incrementing ID.
You don't need a primary key on a table that has only two records. Primary key is meant for increasing query speed; with 2 records you will hardly see any difference.
Edit:
In response to the comments, I'd like to point out that no mainstream DB vendor enforces the use of primary keys. There is a reason for them being optional: unless the primary key is required by functionality, it doesn't belong there; YAGNI.
On a table that small the creation of an index would actually slow things down. Indexes are stored as a binary tree so a lookup on this small of a table, unless it's clustered index will cause more reads than the table scan. A clustered index would actually "be" the table, but again on such a small table the cost/benefit is "moot".
I like always having unique rows, but in this case just leaving the table unindexed (know as a "heap") might actually be the most efficient. I'd wouldn't throw an index on the table unless you need to enforce constraints. Indexing it for query performance isn't going to do anything for you with this small of a table.
If you are required to put a PK on the table for some other reason, then I would say put it on the first column as it is the shortest, less likely to chance and it looks like this table is basically just used as a look-up anyways...

Should I use integer primary IDs?

For example, I always generate an auto-increment field for the users table, but I also specify a UNIQUE index on their usernames. There are situations that I first need to get the userId for a given username and then execute the desired query, or use a JOIN in the desired query. It's 2 trips to the database or a JOIN vs. a varchar index.
Should I use integer primary IDs?
Is there a real performance benefit on INT over small VARCHAR indexes?
There are several advantages of having a surrogate primary key, including:
When you have a foreign key in another table, if it is an integer it takes up only a few bytes extra space and can be joined quickly. If you use the username as the primary key it will have to be stored in both tables - taking up more space and it takes longer to compare when you need to join.
If a user wishes to change their username, you will have big problems if you have used it as a primary key. While it is possible to update a primary key, it is very unwise to do so and can cause all sorts of problems as this key might have been sent out to all sorts of other systems, used in links, saved in backups, logs that have been archived, etc. You can't easily update all these places.
It's not just about performance. You should never key on a meaningful value, for reasons that are well documented elsewhere.
By the way, I often scale the type of int to the size of the table. When I know that a table will not exceed 255 rows, I use a tinyint key, and the same for smallint.
In addition to what others have said, you need to think about the clustering of the table.
In SQL Server for instance (and possibly other vendors), if the primary key is also used as the clustered index of the table (which is quote common), an incrementing integer benefits over other field types. This is because new rows are entered with a primary key that is always greater than the previous rows, meaning that the new row can be stored at the end of the table instead of in the middle (this same scenario can be created with other field types for the primary key, but an integer type lends itself better).
Compare this with a guid primary key - new rows have to be inserted into the middle of the table because guids are non-sequential, making inserts very inefficient.
First, as is obvious, on small tables, it will make no difference with respect to performance. Only on very large tables (how large depends on numerous factors), can it make a difference for a handful of reasons:
Using a 32-bit will only consume 4 bytes of space. Presumably, your usernames will be longer than four non-Unicode characters and thus consume more than 4 bytes of space. The more space used, the few pieces of data fit on a page, the fatter the index and the more IO you incur.
Your character columns are going to require the use of varchar over char unless you force everyone to have usernames of identical size. This too will have a tiny performance and storage impact.
Unless you are using a binary sort collation, the system has to do relatively sophisticaed matching when comparing two strings. Do the two columns use the same colllation? For each character, are they cased the same? What are the casing and accent rules in terms of matching? and so on. While this can be done quickly, it is more work which, in a very large tables, can make a difference in comparison to matching on an integer.
I'm not sure why you would ever have to do two trips to the database or join on a varchar column. Why couldn't you do one trip to the database (where creation returns your new PK) where you join to the users table on the integer PK?

Should i have a primary ID? i am indexing another field

Using sqlite i need a table to hold a blob to store a md5 hash and a 4byte int. I plan to index the int but this value will not be unique.
Do i need a primary key for this table? and is there an issue with indexing a non unique value? (I assume there is not issue or reason for any).
Personally, I like to have a unique primary id on all tables. It makes finding unique records for updating/deleting easier.
How are you going to reference on a SELECT * FROM Table WHERE or an UPDATE ... WHERE? Are you sure you want each one?
You already have one.
SQLite automatically creates an integer ROWID column for every row of every table. This can function as a primary key if you don't declare your own.
In general it's a good idea to declare your own primary key column. In the particular instance you mentioned, ROWID will probably be fine for you.
My advice is to go with primary key if you want to have referential integrity. However there is no issue with indexing a non unique value. The only thing is that your performance will downgrade a little.
What are the consequences of letting two identical rows somehow get into this table?
One consequence is, of course, wasted space. But I'm talking about something more fundamental, here. There are times when duplicate rows in data give you wrong results. For example, if you grouped by the int column (field), and listed the count of rows in each group, a duplicate row (record) might throw you off, depending on what you are really looking for.
Relational databases work better if they are based on relations. Relations are always in first normal form. The primary reason for declaring a primary key is to prevent the table from getting out of first normal form, and thus not representing a relation.