I want to use an id as a primary key for my table. In each record, I am also storing an id from an other source, but these ids are in no way sequential.
Should I add an (auto-incremented) column with a "new" id? It is very important that queries by the id are as fast as possible.
Some info:
The content of my table is only stored "temporary", The table gets often cleared (TRUNCATE) and than filled with new content.
It's a sql-server 2008
After writing content to the table, I create an index for the id column
Thanks!
As long as you are sure the supplied id's are unique, there's no need to create another (surrogate) id to use as primary key.
Under most circumstances, an index on the existing id should be sufficient. You can make it slightly faster by declaring it as a primary key.
From what you describe a new id is not necessary for performance. If you do add one, the table will be slightly larger, which has a (very small) negative effect on performance.
If the existing id is not numeric (or not an integer), then there might be a small gain from using a more efficient type for the index. But, your best bet is to make the existing id a primary key (although this might affect load performance).
Note: I usually prefer synthetic primary keys, so this answer is very specific to your question.
If you are after speed I would join the two IDs together (either from the application or stored proc) and then put them in one column
Related
I have my user table (pseudo sql, because I use an ORM and I must support several different DB types):
id: INTEGER, PK, AUTOINCREMENT
UUID : BINARY(16) (inserted by an update, it's a hash(id) )
I am currently using id for FK in all other tables.
However, in my REST API, I have to serve informations with the UUID, which causes a problem later to query.
Should I:
FK on the UUID instead?
just lookup id(UUID) each time (fast thanks to cache mechanism after a while)?
In general, it is better to use the auto-incremented id for the foreign key reference rather than some other combination of unique columns.
One important reason is that indexes on a single integer are more efficient than indexes on other column types -- if for no other reason than the index being smaller, so it occupies less disk and less memory. Also, there is additional overhead to storing the longer UUID in secondary tables.
This is not the only consideration. Another consideration is that you could change the UUID, if necessary, without changing the foreign key references. For instance, you may wake up one day and say "that id has to start with AAA". You can alter the table and update the table and be done with it -- or you could worry about foreign key references as well. Or, you might add an organization column and decide that the unique key is a combination of the UUID and organization. These operations are much harder/slower if the UUID is being used as a foreign key reference.
When you have composite primary keys (more than one column), using the auto-incremented id is an even better idea. In this case, using the id for joins prevents mistakes where one of the join conditions might be left out.
As you point out, looking up the UUID for a given id should be a fast operation with the correct indexes. There may be some borderline cases where you would not want to have an id, but in general, it is a good idea.
I have a small table "ImgViews" that only contains two columns, an ID column called "imgID" + a count column called "viewed", both set up as int.
The idea is to use this table only as a counter so that I can track how often an image with a certain ID is viewed / clicked.
The table has no primary or foreign keys and no relationships.
However, when I enter some data for testing and try entering the same imgID multiple times it always appears greyed out and with a red error icon.
Usually this makes sense as you don't want duplicate records but as the purpose is different here it does make sense for me.
Can someone tell me how I can achieve this or work around it ? What would be a common way to do this ?
Many thanks in advance, Tim.
To address your requirement to store non-unique values, simply remove primary keys, unique constraints, and unique indexes. I expect you may still want a non-unique clustered index on ImgID to improve performance of aggregate queries that would otherwise require a scan the entire table and sort. I suggest you store an insert timestamp, not to provide uniqueness, but to facilitate purging data by date, should the need arise in the future.
You must have some unique index on that table. Make sure there is no unique index and no unique or primary key constraint.
Or, SSMS simply doesn't know how to identify the row that was just inserted because it has no key.
It is generally not best practice to have a table without a (logical) primary key. In your case, I'd make the image id the primary key and increment the counter. The MERGE statement is well-suited for performing and insert or update at the same time. Alternatives exist.
If you don't like that, create a surrogate primary key (an identity column set as the primary key).
At the moment you have no way of addressing a specific row. That makes the table a little unwieldy.
If you allow multiple rows being absolutely identical, how would you update/delete one of those rows?
How would you expect the database being able to "know" what row you referred to??
At the very least add a separate identity column (preferred being the clustered index, too).
As a side note: It's weird that you "like to avoid unneeded data" but at the same time insert duplicates over and over again instead of simply add up the click count per single image...
Use SQL statements, not GUI, if the table has not primary key or unique constraint.
For example, I have a table called programs and another called format. The format tables contains a single column called format, which has three possible values: zip, rar and exe. Should the format table have a primary key?
Think what happens if your table would contain:
zip
rar
exe
exe
If you see no problem then your table does not need a PK.
If the values are unique, then perhaps not. However, if you do not have a unique index on the values, you could have duplicates, and that would make it difficult to identify single rows. Also, using a PK with a known, fixed length (such as an integer ID) will help with performance and keep index fragmentation down. Plus, if you need to join to this table, you're better off referencing the ID of a format rather than the value, by using a foreign key (which in some RDBMS requires a PK).
If you have a table called Programs and a table called Format with a bunch of values, why not just have a FormatId column in the Program table and use one of the (currently three) foreign key values in the Format table?
See these notes about Normalisation: http://en.wikipedia.org/wiki/Database_normalization (I'm sure you know this, but maybe somebody else reading the question might not).
The table with only one column would have unique values in most of the cases, as repeating value and that too in a single column does not make sense.
So, in your cases, the format would not be repeating (thats what I think), so there is no harm in making primary key but
1) It imposes an indexing on that field (do you want that?)
2) In future, if you are planning to link this to another table, then make sure you use this as a primary key and not introducing anything like format_id for the primary key. If you do so, then please dont make this as a primary key right now.
Traditionally I have always used an ID column in SQL (mostly mysql and postgresql).
However I am wondering if it is really necessary if the rest of the columns in each row make in unique. In my latest project I have the "ID" column set as my primary key, however I never call it or use it in any way, as the data in the row makes it unique and is much more useful for me.
So, if every row in a SQL table is unique, does it need a primary key ID table, and are there ant performance changes with or without one?
Thanks!
EDIT/Additional info:
The specific example that made me ask this question is a table I am using for a many-to-many-to-many-to-many table (if we still call it that at that point) it has 4 columns (plus ID) each of which represents an ID of an external table, and each row will always be numeric and unique. only one of the columns is allowed to be null.
I understand that for normal tables an ID primary key column is a VERY good thing to have. But I get the feeling on this particular table it just wastes space and slows down adding new rows.
If you really do have some pre-existing column in your data set that already does uniquely identify your row - then no, there's no need for an extra ID column. The primary key however must be unique (in ALL circumstances) and cannot be empty (must be NOT NULL).
In my 20+ years of experience in database design, however, this is almost never truly the case. Most "natural" ID's that appear to be unique aren't - ultimately. US Social Security Numbers aren't guaranteed to be unique, and most other "natural" keys end up being almost unique - and that's just not good enough for a database system.
So if you really do have a proper, unique key in your data already - use it! But most of the time, it's easier and more convenient to have just a single surrogate ID that you can guarantee will be unique over all rows.
Don't confuse the logical model with the implementation.
The logical model shows a candidate key (all columns) which could makes your primary key.
Great. However...
In practice, having a multi column primary key has downsides: it's wide, not good when clustered etc. There is plenty of information out there and in the "related" questions list on the right
So, you'd typically
add a surrogate key (ID column)
add a unique constraint to keep the other columns unique
the ID column will be the clustered key (can be only one per table)
You can make either key the primary key now
The main exception is link or many-to-many tables that link 2 ID columns: a surrogate isn't needed (unless you have a braindead ORM)
Edit, a link: "What should I choose for my primary key?"
Edit2
For many-many tables: SQL: Do you need an auto-incremental primary key for Many-Many tables?
Yes, you could have many attributes (values) in a record (row) that you could use to make a record unique. This would be called a composite primary key.
However it will be much slower in general because the construction of the primary index will be much more expensive. The primary index is used by relational database management systems (RDBMS) not only to determine uniqueness, but also in how they order and structure records on disk.
A simple primary key of one incrementing value is generally the most performant and the easiest solution for the RDBMS to manage.
You should have one column in every table that is unique.
EDITED...
This is one of the fundamentals of database table design. It's the row identifier - the identifier identifies which row(s) are being acted upon (updated/deleted etc). Relying on column combinations that are "unique", eg (first_name, last_name, city), as your key can quickly lead to problems when two John Smiths exist, or worse when John Smith moves city and you get a collision.
In most cases, it's best to use a an artificial key that's guaranteed to be unique - like an auto increment integer. That's why they are so popular - they're needed. Commonly, the key column is simply called id, or sometimes <tablename>_id. (I prefer id)
If natural data is available that is unique and present for every row (perhaps retinal scan data for people), you can use that, but all-to-often, such data isn't available for every row.
Ideally, you should have only one unique column. That is, there should only be one key.
Using IDs to key tables means you can change the content as needed without having to repoint things
Ex. if every row points to a unique user, what would happen if he/she changed his name to let say John Blblblbe which had already been in db? And then again, what would happen if you software wants to pick up John Blblblbe's details, whose details would be picked up? the old John's or the one ho has changed his name? Well if answer for bot questions is 'nothing special gonna happen' then, yep, you don't really need "ID" column :]
Important:
Also, having a numeric ID column with numbers is much more faster when you're looking for an exact row even when the table hasn't got any indexing keys or have more than one unique
If you are sure that any other column is going to have unique data for every row and isn't going to have NULL at any time then there is no need of separate ID column to distinguish each row from others, you can make that existing column primary key for your table.
No, single-attribute keys are not essential and nor are surrogate keys. Keys should have as many attributes as are necessary for data integrity: to ensure that uniqueness is maintained, to represent accurately the universe of discourse and to allow users to identify the data of interest to them. If you have already identified a suitable key and if you don't find any real need to create another one then it would make no sense to add redundant attributes and indexes to your table.
An ID can be more meaningful, for an example an employee id can represent from which department he is, year of he join and so on. Apart from that RDBMS supports lots operations with ID's.
I have a table with 16 columns. It will be most frequently used table in web aplication and it will contain about few hundred tousand rows. Database is created on sql server 2008.
My question is choice for primary key. What is quicker? I can use complex primary key with two bigint-s or i can use one varchar value but i will need to concatenate it after?
There are many more factors you must consider:
data access prevalent pattern, how are you going to access the table?
how many non-clustered indexes?
frequency of updates
pattern of updates (sequential inserts, random)
pattern of deletes
All these factors, and specially the first two, should drive your choice of the clustered key. Note that the primary key and clustered key are different concepts, often confused. Read up my answer on Should I design a table with a primary key of varchar or int? for a lengthier discussion on the criteria that drive a clustered key choice.
Without any information on your access patterns I can answer very briefly and concise, and actually correct: the narrower key is always quicker (for reasons of IO). However, this response bares absolutely no value. The only thing that will make your application faster is to choose a key that is going to be used by the query execution plans.
A primary key which does not rely on any underlying values (called a surrogate key) is a good choice. That way if the row changes, the ID doesn't have to, and any tables referring to it (Foriegn Keys) will not need to change. I would choose an autonumber (i.e. IDENTITY) column for the primary key column.
In terms of performance, a shorter, integer based primary key is best.
You can still create your clustered index on multiple columns.
Why not just a single INT auto-generated primary key? INT is 32-bit, so it can handle over 4 billion records.
CREATE TABLE Records (
recordId INT NOT NULL PRIMARY KEY,
...
);
A surrogate key might be a fine idea if there are foreign key relationships on this table. Using a surrogate will save tables that refer to it from having to duplicate all those columns in their tables.
Another important consideration is indexes on columns that you'll be using in WHERE clauses. Your performance will suffer if you don't. Make sure that you add appropriate indexes, over and above the primary key, to avoid table scans.
What do you mean quicker? if you need to search quicker, you can create index for any column or create full text search. the primary key just make sure you do not have duplicated records.
The decision relies upon its use. If you are using the table to save data mostly and not retrieve it, then a simple key. If you are mostly querying the data and it is mostly static data where the key values will not change, your index strategy needs to optimize the data to the most frequent query that will be used. Personally, I like the idea of using GUIDs for the primary key and an int for the clustered index. That allows for easy data imports. But, it really depends upon your needs.
Lot’s of variables you haven’t mentioned; whether the data in the two columns is “natural” and there is a benefit in identifying records by a logical ID, if disclosure of the key via a UI poses a risk, how important performance is (a few hundred thousand rows is pretty minimal).
If you’re not too fussy, go the auto number path for speed and simplicity. Also take a look at all the posts on the site about SQL primary key types. Heaps of info here.
Is it a ER Model or Dimensional Model. In ER Model, they should be separate and should not be surrogated. The entire record could have a single surrogate for easy references in URLs etc. This could be a hash of all parts of the composite key or an Identity.
In Dimensional Model, also they must be separate and they all should be surrogated.