Why most SQL databases allow defining the same index (or constraint) twice?
For example in MySQL I can do:
CREATE TABLE testkey(id VARCHAR(10) NOT NULL, PRIMARY KEY(id));
ALTER TABLE testkey ADD KEY (id);
ALTER TABLE testkey ADD KEY (id);
SHOW CREATE TABLE testkey;
CREATE TABLE `testkey` (
`id` varchar(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `id` (`id`),
KEY `id_2` (`id`)
)
I do not see any use case for having the same index or constraint twice. And I would like SQL databases not allowing me do so.
I also do not see the point on naming indexes or constraints, as I could reference them for deletion just as I created them.
Several reasons come to mind. In the case of a database product which supports multiple index types it is possible that you might want to have the same field or combination of fields indexed multiple times, with each index having a different type depending on intended usage. For example, some (perhaps most) database products have a tree-structured index which is good for both direct lookup (e.g KEY_FIELD = 1) and range scans (e.g. KEY_FIELD > 0 AND KEY_FIELD < 5). In addition, some (but definitely not all) database products also support a hashed index type, which is only useful for direct lookups but which is very fast (e.g. would work for a comparison such as KEY_FIELD = 1 but which could not be used for a range comparison). If you need to have very fast direct lookup times but still need to to provide for ranged comparisons it might be useful to create both a tree-structured index and a hashed index.
Some database products do prevent you from having multiple primary key constraints on a table. However, preventing all possible duplicates might require more effort on the part of the database vendor than they feel can be justified. In the case of an open source database the principal developers might take the view that if a given feature is a big enough deal to a given user it should be up to that user to send in a code patch to enable whatever feature it is that is wanted. Open source is not a euphemism for "I use your open-source product; therefore, you are now my slave and must implement every feature I might ever want!".
In the end I think it's fair to say that a product which is intended for use by software developers can take it as a given that the user should be expected to exercise reasonable care when using the product.
All programming languages allow you to write redundancies:
<?php
$foo = 'bar';
$foo = 'bar';
That's just an example, you could obviously have duplicate code, duplicate functions, or duplicate data structures that are much more wasteful.
It's up to you to write good code, and this depends on the situation. Maybe there's a good reason in some rare case to write something that seems redundant. In that case, you'd be just as put out if the technology didn't allow you to do it.
You might be interested in a tool called Maatkit, which is a collection of indispensable tools for MySQL users. One of its tools checks for duplicate keys:
http://www.maatkit.org/doc/mk-duplicate-key-checker.html
If you're a MySQL developer, novice or expert, you should download Maatkit right away and set aside a full day to read the docs, try out each tool in the set, and learn how to integrate them into your daily development tasks. You'll kick yourself for not doing it sooner.
As for naming indexes, it allows you to do this:
ALTER TABLE testkey DROP KEY `id`, DROP KEY `id_2`;
If they weren't named, you'd have no way to drop individual indexes. You'd have to drop the whole table and recreate it without the indexes.
There are only two good reasons - that I can think of - for allowing defining the same index twice
for compatibility with existing scripts that do define the same index twice.
changing the implementation would require work that I am neither willing to do nor pay for
I can see that some databases prevent duplicate indexes. Oracle Database prevents duplicate indexes https://www.techonthenet.com/oracle/errors/ora01408.php while other databases like MySQL and PostgreSQL do not have duplicate index prevention.
You shouldn't be in a scenario that you have so many indexes on a table that you can't just quickly look and see if the index in there.
As for naming constraints and indexes, I only really ever name constraints. I will name a constraint FK_CurrentTable_ForeignKeyedColumn, just so things are more visible when quickly looking through lists of them.
Because databases that support covering indexes - Oracle, MySQL, SQL Server... (but not PostgreSQL, oddly). A covering index means indexing two or more columns, and are processed left to right for that column list in order to use them.
So if I define a covering index on columns 1, 2 and 3 - my queries need to use, at a minimum, column 1 to use the index. The next possible combination is column 1 & 2, and finally 1,2 and 3.
So what about my queries that only use column 3? Without the other two columns, the covering index can't be used. It's the same issue for only column 2 use... Either case, that's a situation where I would consider separate indexes on columns 2 and 3.
Related
What is the core differences between a table that is created with constraints like PK vs a table created or added indexes created on it only (that don't have PK) when it comes which route would anyone prefer to implement when creating a table? I have work on them but I am just curious to know what separate them Thank you
They are quite different.
Constraints check the data that is being inserted and updated meet some criteria (for example "not null"). If the data does not meet the criteria, the INSERT or UPDATE is rejected, and fails. Constraints help you to maintain the quality of the data.
Indexes improve (most of the time) the speed of a query, and are usually beneficial to SELECT, UPDATE, and DELETE operations. Indexes improve database performance.
An index has no effect on how a query behaves nor the schema definition. It only effects performance. Although some SQL servers implement features using indexes, particularly the unique constraint. The SQL standard doesn't even mention them because they're considered an implementation issue.
A primary key constraint very much does have an effect on behavior and schema definition. It says this column must be unique and not null. Most databases also happen to index it for obvious performance reasons.
Declaring primary key rather than manually saying unique not null also lets the person reading your schema know that this is the primary key. They will know what its purpose it. It also lets the database know this is the primary key which might allow it to do some extra optimizations.
I am a PHP developer with little Oracle experience who is tasked to work with an Oracle database.
The first thing I have noticed is that the tables don't seem to have an auto number index as I am used to seeing in MySQL. Instead they seem to create an index out of two fields.
For example I noticed that one of the indexes is a combination of a Date Field and foreign key ID field. The Date field seems to store the entire date and timestamp so the combination is fairly unique.
If the index name was PLAYER_TABLE_IDX how would I go about using this index in my PHP code?
I want to reference a unique record by this index (rather than using two AND clauses in the WHERE portion of my SQL query)
Any advice Oracle/PHP gurus?
I want to reference a unique record by this index (rather than using two AND clauses in the WHERE portion of my SQL query)
There's no way around that you have to use reference all the columns in a composite primary key to get a unique row.
You can't use an index directly in a SQL query.
In Oracle, you use the hint syntax to suggestion an index that should be used, but the only means of hoping to use an index is by specifying the column(s) associated with it in the SELECT, JOIN, WHERE and ORDER BY clauses.
The first thing I have noticed is that the tables don't seem to have an auto number index as I am used to seeing in MySQL.
Oracle (and PostgreSQL) have what are called "sequences". They're separate objects from the table, but are used for functionality similar to MySQL's auto_increment. Unlike MySQL's auto_increment, you can have more than one sequence used per table (they're never associated), and can control each one individually.
Instead they seem to create an index out of two fields.
That's what the table design was, nothing specifically Oracle about it.
But I think it's time to address that an index has different meaning in a database than how you are using the term. An index is an additional step to make SELECTing data out of a table faster (but makes INSERT/UPDATE/DELETE slower because of maintaining them).
What you're talking about is actually called a primary key, and in this example it'd be called a composite key because it involves more than one column. One of the columns, either the DATE (consider it DATETIME) or the foreign key, can have duplicates in this case. But because of the key being based on both columns, it's the combination of the two values that makes them the key to a unique record in the table.
http://use-the-index-luke.com/ is my Web-Book that explains how to use indexes in Oracle.
It's an overkill to your question, however, it is probably worth reading if you want to understand how things work.
This may be a pretty naive and stupid question, but I'm going to ask it anyway
I have a table with several fields, none of which are unique, and a primary key, which obviously is.
This table is accessed via the non-unique fields regularly, but no user SP or process access data via the primary key. Is the primary key necessary then? Is it used behind the scenes? Will removing it affect performance Positively or Negatively?
Necessary? No. Used behind the scenes? Well, it's saved to disk and kept in the row cache, etc. Removing will slightly increase your performance (use a watch with millisecond precision to notice).
But ... the next time someone needs to create references to this table, they will curse you. If they are brave, they will add a PK (and wait for a long time for the DB to create the column). If they are not brave or dumb, they will start creating references using the business key (i.e. the data columns) which will cause a maintenance nightmare.
Conclusion: Since the cost of having a PK (even if it's not used ATM) is so small, let it be.
Do you have any foreign keys, do you ever join on the PK?
If the answer to this is no, and your app never retrieves an item from the table by its PK, and no query ever uses it in a where clause, therefore you just added an IDENTITY column to have a PK, then:
the PK in itself adds no value, but does no damage either
the fact that the PK is very likely the clustered index too is .. it depends.
If you have NC indexes, then the fact that you have a narrow artificial clustered key (the IDENTITY PK) is helpful in keeping those indexes narrow (the CDX key is reproduced in every NC leaf slots). So a PK, even if never used, is helpful if you have significant NC indexes.
On the other hand, if you have a prevalent access pattern, a certain query that outweighs all the other is frequency and importance, or which is part of a critical time code path (eg. is the query run on every page visit on your site, or every second by and app etc) then that query is a good candidate to dictate the clustered key order.
And finally, if the table is seldom queried but often written to then it may be a good candidate for a HEAP (no clustered key at all) since heaps are so much better at inserts. See Comparing Tables Organized with Clustered Indexes versus Heaps.
The primary key is behind the scenes a clustered index (by default unless generated as a non clustered index) and holds all the data for the table. If the PK is an identity column the inserts will happen sequentially and no page splits will occur.
But if you don't access the id column at all then you probably want to add some indexes on the other columns. Also when you have a PK you can setup FK relationships
In the logical model, a table must have at least one key. There is no reason to arbitarily specify that one of the keys is 'primary'; all keys are equal. Although the concept of 'primary key' can be traced back to Ted Codd's early work, the mistake was picked up early on has long been corrected in relational theory.
Sadly, PRIMARY KEY found its way into SQL and we've had to live with it ever since. SQL tables can have duplicate rows and, if you consider the resultset of a SELECT query to also be a table, then SQL tables can have duplicate rows too. Relational theorists dislike SQL a lot. However, just because SQL lets you do all kinds of wacky non-relational things, that doesn't mean that you have to actually do them. It is good practice to ensure that every SQL table has at least one key.
In SQL, using PRIMARY KEY on its own has implications e.g. NOT NULL, UNIQUE, the table's default reference for foreign keys. In SQL Server, using PRIMARY KEY on its own has implications e.g. the table's clustered index. However, in all these cases, the implicit behavior can be made explicit using specific syntax.
You can use UNIQUE (constraint rather than index) and NOT NULL in combination to enforce keys in SQL. Therefore, no, a primary key (or even PRIMARY KEY) is not necessary for SQL Server.
I would never have a table without a primary key. Suppose you ever need to remove a duplicate - how would you identify which one to remove and which to keep?
A PK is not necessary.
But you should consider to place a non-unique index on the columns that you use for querying (i.e. that appear in the WHERE-clause). This will considerably boost lookup performance.
The primary key when defined will help improve performance within the database for indexing and relationships.
I always tend to define a primary key as an auto incrementing integer in all my tables, regardless of if I access it or not, this is because when you start to scale up your application, you may find you do actually need it, and it makes life a lot simpler.
A primary key is really a property of your domain model, and it uniquely identifies an instance of a domain object.
Having a clustered index on a montonically increasing column (such as an identity column) will mean page splits will not occur, BUT insertions will unbalance the index over time and therefore rebuilding indexes needs to be done regulary (or when fragmentation reaches a certain threshold).
I have to have a very good reason to create a table without a primary key.
As SQLMenace said, the clustered index is an important column for the physical layout of the table. In addition, having a clustered index, especially a well chosen one on a skinny column like an integer pk, actually increases insert performance.
If you are accessing them via non-key fields the performance probably will not change. However it might be nice to keep the PK for future enhancements or interfaces to these tables. Does your application only use this one table?
I know this is subjective, but I'd like to know peoples opinions and hopefully some best practices that I can apply when designing sql server table structures.
I personally feel that keying a table on a fixed (max) length varchar is a no-no, because it means having to also propogate the same fixed length across any other tables that use this as a foreign key. Using an int, would avoid having to apply the same length across the board, which is bound to lead to human error, i.e. 1 table has varchar (10), and the other varchar (20).
This sounds like a nightmare to initially setup, plus means future maintaining of the tables is cumbersome too. For example, say the keyed varchar column suddenly becomes 12 chars instead of 10. You now have to go and update all the other tables, which could be a huge task years down the line.
Am I wrong? Have I missed something here? I'd like to know what others think of this and if sticking with int for primary keys is the best way to avoid maintainace nightmares.
When choosing the primary key usualy you also choose the clustered key. Them two are often confused, but you have to understand the difference.
Primary keys are logical business elements. The primary key is used by your application to identify an entity, and the discussion about primary keys is largely wether to use natural keys or surrogate key. The links go into much more detail, but the basic idea is that natural keys are derived from an existing entity property like ssn or phone number, while surrogate keys have no meaning whatsoever with regard to the business entity, like id or rowid and they are usually of type IDENTITY or some sort of uuid. My personal opinion is that surrogate keys are superior to natural keys, and the choice should be always identity values for local only applicaitons, guids for any sort of distributed data. A primary key never changes during the lifetime of the entity.
Clustered keys are the key that defines the physical storage of rows in the table. Most times they overlap with the primary key (the logical entity identifier), but that is not actually enforced nor required. When the two are different it means there is a non-clustered unique index on the table that implements the primary key. Clustered key values can actualy change during the lifetime of the row, resulting in the row being physically moved in the table to a new location. If you have to separate the primary key from the clustered key (and sometimes you do), choosing a good clustered key is significantly harder than choosing a primary key. There are two primary factors that drive your clustered key design:
The prevalent data access pattern.
The storage considerations.
Data Access Pattern. By this I understand the way the table is queried and updated. Remember that clustered keys determine the actual order of the rows in the table. For certain access patterns, some layouts make all the difference in the world in regard to query speed or to update concurency:
current vs. archive data. In many applications the data belonging to the current month is frequently accessed, while the one in the past is seldom accessed. In such cases the table design uses table partitioning by transaction date, often times using a sliding window algorithm. The current month partition is kept on filegroup located a hot fast disk, the archived old data is moved to filegroups hosted on cheaper but slower storage. Obviously in this case the clustered key (date) is not the primary key (transaction id). The separation of the two is driven by the scale requirements, as the query optimizer will be able to detect that the queries are only interested in the current partition and not even look at the historic ones.
FIFO queue style processing. In this case the table has two hot spots: the tail where inserts occur (enqueue), and the head where deletes occur (dequeue). The clustered key has to take this into account and organize the table as to physically separate the tail and head location on disk, in order to allow for concurency between enqueue and dequeue, eg. by using an enqueue order key. In pure queues this clustered key is the only key, since there is no primary key on the table (it contains messages, not entities). But most times the queue is not pure, it also acts as the storage for the entities, and the line between the queue and the table is blured. In this case there is also a primary key, which cannot be the clustered key: entities may be re-enqueued, thus changing the enqueue order clustered key value, but they cannot change the primary key value. Failure to see the separation is the primary reason why user table backed queues are so notoriously hard to get right and riddled with deadlocks: because the enqueue and dequeue occur interleaved trought the table, instead of localized at the tail and the head of the queue.
Correlated processing. When the application is well designed it will partition processing of correlated items between its worker threads. For instance a processor is designed to have 8 worker thread (say to match the 8 CPUs on the server) so the processors partition the data amongst themselves, eg. worker 1 picks up only accounts named A to E, worker 2 F to J etc. In such cases the table should be actually clustered by the account name (or by a composite key that has the leftmost position the first letter of account name), so that workers localize their queries and updates in the table. Such a table would have 8 distinct hot spots, around the area each worker concentrates at the moment, but the important thing is that they don't overlap (no blocking). This kind of design is prevalent on high throughput OLTP designs and in TPCC benchmark loads, where this kind of partitioning also reflects in the memory location of the pages loaded in the buffer pool (NUMA locality), but I digress.
Storage Considerations. The clustered key width has huge repercursions in the storage of the table. For one the key occupies space in every non-leaf page of the b-tree, so a large key will occupy more space. Second, and often more important, is that the clustered key is used as the lookup key by every non-clustred key, so every non-clustered key will have to store the full width of the clustered key for each row. This is what makes large clustered keys like varchar(256) and guids poor choices for clustered index keys.
Also the choice of the key has impact on the clustered index fragmentation, sometimes drastically affecting performance.
These two forces can sometimes be antagonistic, the data access pattern requiring a certain large clustered key which will cause storage problems. In such cases of course a balance is needed, but there is no magic formula. You measure and you test to get to the sweet spot.
So what do we make from all this? Always start with considering clustered key that is also the primary key of the form entity_id IDENTITY(1,1) NOT NULL. Separate the two and organize the table accordingly (eg. partition by date) when appropiate.
I would definitely recommend using an INT NOT NULL IDENTITY(1,1) field in each table as the
primary key.
With an IDENTITY field, you can let the database handle all the details of making sure it's really unique and all, and the INT datatype is just 4 bytes, and fixed, so it's easier and more suited to be used for the primary (and clustering) key in your table.
And you're right - INT is an INT is an INT - it will not change its size of anything, so you won't have to ever go recreate and/or update your foreign key relations.
Using a VARCHAR(10) or (20) just uses up too much space - 10 or 20 bytes instead of 4, and what a lot of folks don't know - the clustering key value will be repeated on every single index entry on every single non-clustered index on the table, so potentially, you're wasting a lot of space (not just on disk - that's cheap - but also in SQL Server's main memory). Also, since it's variable (might be 4, might be 20 chars) it's harder to SQL server to properly maintain a good index structure.
Marc
I'd agree that in general an INT (or identity) field type is the best choice in most "normal" database designs:
it requires no "algorithm" to generate the id/key/value
you have fast(er) joins and the optimizer can work a lot harder over ranges and such under the hood
you're following a defacto standard
That said, you also need to know your data. If you're going to blow through a signed 32-bit int, you need to think about unsigned. If you're going to blow through that, maybe 64-bit ints are what you want. Or maybe you need a UUID/hash to make syncing between database instances/shards easier.
Unfortunately, it depends and YMMV but I'd definitely use an int/identity unless you have a good reason not to.
Like you said, consistency is key. I personally use unsigned ints. You're not going to run out of them unless you are working with ludicrous amounts of data, and you can always know any key column needs to be that type and you never have to go looking for the right value for individual columns.
Based on going through this exercise countless times and then supporting the system with the results, there are some caveats to the blanket statement that INT is always better. In general, unless there is a reason, I would go along with that. However, in the trenches, here are some pros and cons.
INT
Use unless good reason not to do so.
GUID
Uniqueness - One example is the case where there is one way communication between remote pieces of the program and the side that needs to initiate is not the side with the database. In that case, setting a Guid on the remote side is safe where selecting an INT is not.
Uniqueness Again - A more far fetched scenario is a system where multiple customers are coexisting in separate databases and there is migration between customers like similar users using a suite of programs. If that user signs up for another program, their user record can be used there without conflict. Another scenario is if customers acquire entities from each other. If both are on the same system, they will often expect that migration to be easier. Essentially, any frequent migration between customers.
Hard to Use - Even an experienced programmer cannot remember a guid. When troubleshooting, it is often frustrating to have to copy and paste identifiers for queries, especially if the support is being done with a remote access tool. It is much easier to constantly refer to SELECT * FROM Xxx WHERE ID = 7 than SELECT * FROM Xxx WHERE ID = 'DF63F4BD-7DC1-4DEB-959B-4D19012A6306'
Indexing - using a clustered index for a guid field requires constant rearrangement of the data pages and is not as efficient to index as INTs or even short strings. It can kill performance - don't do it.
CHAR
Readability - Although conventional wisdom is that nobody should be in the database, the reality of systems is that people will have access - hopefully personnel from your organization. When those people are not savvy with join syntax, a normalized table with ints or guids is not clear without many other queries. The same normalized table with SOME string keys can be much more usable for troubleshooting. I tend to use this for the type of table where I supply the records at installation time so they do not vary. Things like StatusID on a major table is much more usable for support when the key is 'Closed' or 'Pending' than a digit. Using traditional keys in these areas can turn an easily resolved issue to something that requires developer assistance. Bottlenecks like that are bad even when caused by letting questionable personnel access to the database.
Constrain - Even if you use strings, keep them fixed length, which speeds indexing and add a constraint or foreign key to keep garbage out. Sometimes using this string can allow you to remove the lookup table and maintain the selection as a simple Enum in the code - it is still important to constrain the data going into this field.
For best performance, 99.999% of the time the primary key should be a single integer field.
Unless you require the primary key to be unique across multiple tables in a database or across multiple databases. I am assuming that you are asking about MS SQL-Server because that is how your question was tagged. In which case, consider using the GUID field instead. Though better than a varchar, the GUID field performance is not as good as an integer.
Use INT. Your points are all valid; I would prioritize as:
Ease of using SQL auto increment capabiity - why reinvent the wheel?
Managability - you don't want to have to change the key field.
Performance
Disk Space
1 & 2 require the developer's time/energy/effort. 3 & 4 you can throw hardware at.
If Joe Celko was on here, he would have some harsh words... ;-)
I want to point out that INTs as a hard and fast rule aren't always appropriate. Say you have a vehicle table with all types of cars trucks, etc. Now say you had a VehicleType table. If you wanted to get all trucks you might do this (with an INT identity seed):
SELECT V.Make, V.Model
FROM Vehicle as V
INNER JOIN VehicleType as VT
ON V.VehicleTypeID = VT.VehicleTypeID
WHERE VT.VehicleTypeName = 'Truck'
Now, with a Varchar PK on VehicleType:
SELECT Make, Model
FROM Vehicle
WHERE VehicleTypeName = 'Truck'
The code is a little cleaner and you avoid a join. Perhaps the join isn't the end of the world, but if you only have one tool in your toolbox, you're missing some opportunities for performance gains and cleaner schemas.
Just a thought. :-)
While INT is generally recommended, it really depends on your situation.
If you're concerned with maintainability, then other types are just as feasible. For example, you could use a Guid very effectively as a primary key. There's reasons for not doing this, but consistency is not one of them.
But yes, unless you have a good reason not to, an int is the simplest to use, and the least likely to cause you any problems.
With PostgreSQL I generally use the "Serial" or "BigSerial" 'data type' for generating primary keys. The values are auto incremented and I always find integers to be easy to work with. They are essentially equivalent to a MySQL integer field that is set to "auto_increment".
One should think hard about whether 32-bit range is enough for what you're doing. Twitter's status IDs were 32-bit INTs and they had trouble when they ran out.
Whether to use a BIGINT or a UUID/GUID in that situation is debatable and I'm not a hardcore database guy, but UUIDs can be stored in a fixed-length VARCHAR without worrying that you'll need to change the field size.
We have to keep in mind that the primary key of a table should not have "business logic" and it should be only an identity of the record it belongs. Following this simple rule an int and especially an identity int is a very good solution. By asking about varchar I guess that you mean using for example the "Full Name" as a key to the "people" table. But what if we want to change the name from "George Something" to "George A. Something" ? And what size will the field be ? If we change the size we have to change the size on all foreign tables too. So we should avoid logic on keys. Sometimes we can use the social ID (integer value) as key but I avoid that too. Now if a project has the prospects to scale up you should consider using Guids too (uniqueidentifier SQL type).
Keeping in mind that this is quite old a question, I still want to make the case for using varchar with surrogate keys fur future readers:
An environment with several replicated machines
Scenarios where it is required that the ID of a to be inserted row is known before it is actually inserted (i.e., the client assigns this ID, not the database)
Just a quick database design question: Do you ALWAYS use an ID field in EVERY table, or just most of them? Clearly most of your tables will benefit, but are there ever tables that you might not want to use an ID field?
For example, I want to add the ability to add tags to objects in another table (foo). So I've got a table FooTag with a varchar field to hold the tag, and a fooID field to refer to the row in foo. Do I really need to create a clustered index around an essentially arbitrary ID field? Wouldn't it be more efficient to use fooID and my text field as the clustered index, since I will almost always be searching by fooID anyway? Plus using my text in the clustered index would keep the data sorted, making sorting easier when I have to query my data. The downside is that inserts would take longer, but wouldn't that be offset by the gains during selection, which would happen far more often?
What are your thoughts on ID fields? Bendable rule, or unbreakable law?
edit: I am aware that the example provided is not normalized. If tagging is to be a major part of the project, with multiple tables being tagged, and other 'extras', a two-table solution would be a clear answer. However in this simplest case, would normalization be worthwhile? It would save some space, but require an extra join when running queries
As in much of programming: rule, not law.
Proof by exception: Some two-column tables exist only to form relationships between other more meaningful tables.
If you are making tables that bridge between two or more other tables and the only fields you need are the dual PK/FK's, then I don't know why you would need ID column in there as well.
ID columns generally can be very helpful, but that doesn't mean you should go peppering them in at every occasion.
As others have said, it's a general, rather than absolute, rule and there are plenty of exceptions (tables with composite keys for example).
There are some occasional but useful occasions where you might want to create an artificial ID in a table that already has a (usually composite) unique identifier. For example, in one system I've created a table to store part numbers; although the part numbers are unique, they may actually change - we add an arbitrary integer PartID. Not so common, but it's a typical real-world example.
In general what you really want is to be able if at all possible to have some kind of way to uniquely identify a record. It could be an id field or it could be a unique index (which does not have to be on just one field). Anytime I thought I could get away without creating a way to uniquely identify a record, I have been proven wrong. All tables do not have a natural key though and if they do not, you really need to have an id file of some kind. If you have a natural key, you could use that instead, but I find that even then I need an id field in most cases to prevent having to do too much updating when the natural key changes (it always seems to change). Plus having worked with literally hundreds of databases concerning many many differnt topics, I can tell you that a true natural key is rare. As others have nmentioned there is no need for an id field in a table that is simply there to join two tables that havea many to many relationship, but even this should have a unique index.
If you need to retrieve records from that table with unique id then yes. If you will retrieve them by some other composite key made up of foreign keys then no. The last thing you need is fields, data, and indexes that you do not use.
A clustered index does not need to be on primary key or a surrogate (identity column) either.
Your design, however, is not normalized. Typically for tagging, I use two tables, a table of tags (with a surrogate key) and a table of links from the tags to the subject table(s) using the surrogate key in the tag table and theprimary key in the subject table. This allows your tags to apply to different entities (photos, articles, employees, locations, products, whatever). It allows you to enforce foreign key relationships to multiple tables, and also allows you to invent tag hierarchies and other things about the tag table.
As far as the indexes on this design, it will be dictated by the usage patterns.
In general developers love having an ID field on all tables except for 'linking' tables because it makes development much easier, and I am no exception to this. DBA's on the other hand see no problem with making natural primary keys made up of 3 or 4 columns. It can be a butting of heads to try and get a good database design.