Keys/indexes for revisions of user submitted text

Keys/indexes for revisions of user submitted text - sql

I have a table:
id:int
revision:int
text:ntext
In general I will need to retrieve the latest revision of text for a particular id, or (considerably less frequently) add a new row containing a new revision for a particular id. Bearing this in mind, it would be a good idea to put indexes on the id and revision columns. I don't have a problem with implementing this, but I'm wondering if this is a situation where it would be sensible to use a composite (multi-field) index/key composed of both id and revision, or if there is any other strategy that would be appropriate for my use case?

I don't think the performance difference between a composite index and two separate indexes would be noticeable, but, as usual, I suggest trying both and profiling if the absolute best performance is needed.
You are likely to always be querying on both fields, with a definite id and an unknown revision occasionally (when needing to find the max revision for an id). If your composite index is (id,revision) then this use case is supported by the index. Querying on id alone with no care for revision also works.
If it is ever likely that you will be querying on revision only without regard to id then you will need two separate indexes.
You will also want to analyze the impact that either index has on insert performance. The composite index will cluster on both fields, whereas the two separate indexes will cluster only on id.
EDIT: typos.

It seems it the majority of cases you will be selecting the record based on both id and revision - therefore for quickest lookups you should make id and revision your composite primary key.

If id is the primary key its already indexed (I don't use SqlServer)
I seems that your revision is unique too.
so I think it would be better to use separate indexes and put unique constraint on revision (if required).

Related

Database: Should ids be sequential?

I want to use an id as a primary key for my table. In each record, I am also storing an id from an other source, but these ids are in no way sequential.
Should I add an (auto-incremented) column with a "new" id? It is very important that queries by the id are as fast as possible.
Some info:
The content of my table is only stored "temporary", The table gets often cleared (TRUNCATE) and than filled with new content.
It's a sql-server 2008
After writing content to the table, I create an index for the id column
Thanks!

As long as you are sure the supplied id's are unique, there's no need to create another (surrogate) id to use as primary key.

Under most circumstances, an index on the existing id should be sufficient. You can make it slightly faster by declaring it as a primary key.
From what you describe a new id is not necessary for performance. If you do add one, the table will be slightly larger, which has a (very small) negative effect on performance.
If the existing id is not numeric (or not an integer), then there might be a small gain from using a more efficient type for the index. But, your best bet is to make the existing id a primary key (although this might affect load performance).
Note: I usually prefer synthetic primary keys, so this answer is very specific to your question.

If you are after speed I would join the two IDs together (either from the application or stored proc) and then put them in one column

SQL - Must there always be a primary key?

There are a couple of similar questions already out there and the consensus seemed to be that a primary key should always be created.
But what if you have a single row table for storing settings (and let's not turn this into a discussion about why it might be good/bad to create a single row table please)?
Surely having a primary key on a single row table becomes completely useless?

It may seem completely useless, but it's also completely harmless, and I'd vote for harmless with good design principles vs. useless with no design principles every time.
Other people have commented, rightly, that you don't know how you're going to use the table in a year or five years... what if someone comes along and decides they want to duplicate the configuration -- move it to a distributed environment or add a test environment by using a duplicate configuration string or whatever. Having a field that acts like a primary key means that whenever you query the table, if you use the key, you'll be certain no matter what anyone else may do to your table, that you're getting the correct record.
You're right there are a million other aspects -- surrogate keys vs. intelligent keys, indexing, partitioning (silly on a single row table, I know), whatever... but without getting into that I'd vote add the key rather than not add it. You could have done it by the time you read this thread.

Short answer, no key, duplicate records possible. Your planning a single row now, but what about six months in the future when you single row multiplies. Put a primary key on the table, even for single row.

You could always base your primary key on the name of the setting. Then your table would become a key-value store.
But no, in many RDBMS you are not REQUIRED to have a primary key per table.

Having declared a primary key on a single row table in SQL will ensure that there will be no duplicates. Whether it is useless depends on your requirements. Usually it is a good idea to avoid duplicates.

Is composite index really necessary for a table with composite primary key?

Let's say we have a table to store users' favourite pictures, with a composite primary key pair(UserId, PictureId). Books normally say in this case you need a composite index based on (UserId, PictureId), which normally appears in the WHERE clause as (UserId=103 AND PictureId=1234). But I think the dababase engine should be smart enough to use two individual indexes based on the two columns separately. Just get the set of row numbers from each of the index and find the ones that are present in both sets. That way, a composite index is not necessary.
So, in reality can database engines do that?

There'd be no advantage to using the two separate single-column indexes; the engine would be better off doing a table scan.
The point of using an index is to make access faster. If the engine used two indexes, it would have to sort at least one set of data from one of the indexes and merge the results from the two indexes. That would be a lot more work than reading just one composite index, especially since the composite index allows for an index-only scan.

Most database engines will require the composite index to enforce the primary key. As such, it's a "free" index that you're going to have anyway - why worry about it?
There may be some benefit (if the index is on UserID,PictureID) to adding a second index just on PictureID. Any query on just UserID will be able to use the composite index, whereas a query just using PictureID would be unable to do so.

I think in the use case you describe, the composite index is not necessary. That would be useful if you were doing a query on, say, a given set of user IDs plus a given set of picture IDs. But when would you ever need that? You'd be more likely to query all a user's pictures in a given date range, or lookup a specific picture by ID. This would suggest an index structure of one composite user id + date index, and another picture id only index.
It always depends on the distribution of records in your database, and the types of queries you will be running most frequently.

PRIMARY KEY or UNIQUE constraints are abstract, theoretical concepts.
An INDEX is practical physical thing that lives in the real world.
In practice, indexes can be used to enforce PK or UNIQUE constraints. But other techniques could also be used (eg for a small domain: a bitmap)

What you describe would be significantly more expensive than using composite index.
First a set of rows would need to be identified from the first index, then a set of rows from the second and finally the set-intersection performed between the two.
--- UPDATE ---
Note hat this is the price you would pay for every INSERT/UPDATE and every foreign key check, not just SELECT.
Also, there may be concurrency issues involved - depending on how the DBMS is implemented, enforcing uniqueness through a single unique composite index might require less/simpler locking than enforcing uniqueness through two non-unique, non-composite indexes.
And of course, if you intend to cluster your table, the primary index will typically also be the clustering index, and contain all columns anyway, so there isn't much purpose in leaving anything out from the "sorting" portion of the index.

Where to place a primary key

To my knowledge SQL Server 2008 will only allow one clustered index per table. For the sake of this question let's say I have a list of user-submitted stories that contains the following columns.
ID (int, primary key)
Title (nvarchar)
Url (nvarchar)
UniqueName (nvarchar) This is the url slug (blah-blah-blah)
CategoryID (int, FK to Category table)
Most of the time stories will never be queried by ID. Most of the queries will be done either by the CategoryID or by the UniqueName.
I'm new to indexing so I assumed that it would be best to place 2 nonclustered indexes on this table. One on UniqueName and one on CategoryID. After doing some reading about indexes it seems like haivng a clustered index on UniqueName would be very beneficial. Considering UniqueName is... unique would it be advantageous to place the primary key on UniuqeName and get rid of the ID field? As for CategoryID I assume a nonclustered index will do just fine.
Thanks.

In the first place you can put the clustered index on unique name, it doesn't have to be onthe id field. If you do little or no joining to this table you could get rid of the id. In any event I would put a unique index on the unique name field (you may find in doing so that it isn't as unique as you thought it would be!).
If you do a lot of joining though, I would keep the id field, it is smaller and more efficient to join on.
Since you say you are new at indexing, I will point out that while primary keys have an index created automatically when they are defined, foreign keys do not. You almost always want to index your foreign key fields.

Just out of habit, I always create an Identity field "ID" like you have as the PK. It makes things consistent. If all "master" tables have a field named "ID" that is INT Identity, then it's always obvious what the PK is. Additionally, if I need to make a bridge entity, I'll be storing two (or more) columns of type INT instead of type nvarchar(). So in your example, I would keep ID as the PK and create a unique index on UniqueName.

Data is stored in order of the clustered key; if you are going to key retrievial of data by one of those fields it would be advantageous to use that assuming values aren't significantly fragmented, which can slow down insert performance.
On the other hand, if this table is joined to a lot on the ID, it probably makes more sense to keep the clustered key on the PK.

Generally it's always best to index a table on a identity key and use this as the clustered index. There's a simple rule of thumb here
Don't use a meaningful column as primary index
The reason for this is that generally using a PK on a meaningful column tends to give rise to maintenance issues. It's a rule of thumb, so can be overridden such circumstances dictate, but usually it's best to work from the assumed default position of each table indexed by a (clustered) meaningless identity column. Such tends to be more efficient for joins, and as it's usually the default design that most DBAs will adopt so won't raise any eyebrows or give any issues because they system isn't as the next DBA might assume. Meaningless PKs are invariably more flexible and can adapted more easily to changing circumstances then otherwise
When to override the rule? Only if you do envisage performance issues. For most databases with reasonable loads on modern hardware suitably indexed you will not have any trouble if you're not squeezing the last millisecond of performance out of them by clustering the optimal index. DBA and Programmer cycles are much more expensive than CPU cycles and if you'll only shave the odd millisecond or so off your queries by adopting a different strategy then it's just not worth it. However should you be looking at a table with approaching a million rows then that's a different matter. It depends very much on circumstances, but generally if I'm designing a database with tables of less than 100,000 rows I will lean heavily towards designing for flexibility, ease of writing stable queries, and along the principals any other designer would expect to see. Over a million rows then I design for performance. Between 100k and a million it's a matter of judgement.

There is no requirement or necessity to have a clustered index at all, primary key or otherwise. It's a performance optimisation tool, like all indexing strategies, and should be applied when an improvement can be gained by using it.
As already mentioned, because the table is physically sorted according to the clustered index key, it's a Highlander situation: there can only be one!
Clustered indexes are mostly useful for situations such as:
you regularly need to retrieve a set of rows whose values for a given column are in a range, so columns that are often the subject of a BETWEEN clause are interesting; or
most of your single-row hits in the table occur in an area that can be described by a subset of the values of a key.
I thought that they were particularly un-useful for situations like when you have high-volume transaction systems with very frequent inserts when a sequential key is the clustered column. You'll get a gang of processes all trying to insert at the same physical location (a "hot-spot"). Turns out, as was commented here before this edit, that I'm sadly out-of-date and showing my age. See this post on the topic by Kimberley Tripp which says it all much better.
Sequential numeric "ID" columns are generally not good candidate columns. Names can be good, dates likewise - if carefully considered.

ID fields in SQL tables: rule or law?

Just a quick database design question: Do you ALWAYS use an ID field in EVERY table, or just most of them? Clearly most of your tables will benefit, but are there ever tables that you might not want to use an ID field?
For example, I want to add the ability to add tags to objects in another table (foo). So I've got a table FooTag with a varchar field to hold the tag, and a fooID field to refer to the row in foo. Do I really need to create a clustered index around an essentially arbitrary ID field? Wouldn't it be more efficient to use fooID and my text field as the clustered index, since I will almost always be searching by fooID anyway? Plus using my text in the clustered index would keep the data sorted, making sorting easier when I have to query my data. The downside is that inserts would take longer, but wouldn't that be offset by the gains during selection, which would happen far more often?
What are your thoughts on ID fields? Bendable rule, or unbreakable law?
edit: I am aware that the example provided is not normalized. If tagging is to be a major part of the project, with multiple tables being tagged, and other 'extras', a two-table solution would be a clear answer. However in this simplest case, would normalization be worthwhile? It would save some space, but require an extra join when running queries

As in much of programming: rule, not law.
Proof by exception: Some two-column tables exist only to form relationships between other more meaningful tables.

If you are making tables that bridge between two or more other tables and the only fields you need are the dual PK/FK's, then I don't know why you would need ID column in there as well.
ID columns generally can be very helpful, but that doesn't mean you should go peppering them in at every occasion.

As others have said, it's a general, rather than absolute, rule and there are plenty of exceptions (tables with composite keys for example).
There are some occasional but useful occasions where you might want to create an artificial ID in a table that already has a (usually composite) unique identifier. For example, in one system I've created a table to store part numbers; although the part numbers are unique, they may actually change - we add an arbitrary integer PartID. Not so common, but it's a typical real-world example.

In general what you really want is to be able if at all possible to have some kind of way to uniquely identify a record. It could be an id field or it could be a unique index (which does not have to be on just one field). Anytime I thought I could get away without creating a way to uniquely identify a record, I have been proven wrong. All tables do not have a natural key though and if they do not, you really need to have an id file of some kind. If you have a natural key, you could use that instead, but I find that even then I need an id field in most cases to prevent having to do too much updating when the natural key changes (it always seems to change). Plus having worked with literally hundreds of databases concerning many many differnt topics, I can tell you that a true natural key is rare. As others have nmentioned there is no need for an id field in a table that is simply there to join two tables that havea many to many relationship, but even this should have a unique index.

If you need to retrieve records from that table with unique id then yes. If you will retrieve them by some other composite key made up of foreign keys then no. The last thing you need is fields, data, and indexes that you do not use.

A clustered index does not need to be on primary key or a surrogate (identity column) either.
Your design, however, is not normalized. Typically for tagging, I use two tables, a table of tags (with a surrogate key) and a table of links from the tags to the subject table(s) using the surrogate key in the tag table and theprimary key in the subject table. This allows your tags to apply to different entities (photos, articles, employees, locations, products, whatever). It allows you to enforce foreign key relationships to multiple tables, and also allows you to invent tag hierarchies and other things about the tag table.
As far as the indexes on this design, it will be dictated by the usage patterns.

In general developers love having an ID field on all tables except for 'linking' tables because it makes development much easier, and I am no exception to this. DBA's on the other hand see no problem with making natural primary keys made up of 3 or 4 columns. It can be a butting of heads to try and get a good database design.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas