What is the optimal indexing strategy for a relation table? - sql

A relation table is the common solution to representing a many-to-many (m:n) relationship.
In the simplest form, it combines foreign keys referencing the two relating tables to a new composite primary key:
A AtoB B
---- ---- ----
*id *Aid *id
data *Bid data
How should it be indexed to provide optimal performance in every JOIN situation?
clustered index over (Aid ASC, Bid ASC) (this is mandatory anyway, I guess)
option #1 plus an additional index over (Bid ASC, Aid ASC)
or option #1 plus an additional index over (Bid ASC)
any other options? Vendor-specific stuff, maybe?

I made some tests, and here is the update:
To cover all possible cases, you'll need to have:
CLUSTERED INDEX (a, b)
INDEX (b)
This will cover all JOIN sutiations AND ORDER BY
Note that an index on B is actually sorted on (B, A) since it references clustered rows.
As long as your a and b tables have PRIMARY KEY's on id's, you don't need to create additional indexes to handle ORDER BY ASC, DESC.
See the entry in my blog for more details:
Indexing a link table

I guess solution 2 is optimal. I'd choose the order of the clustered index by looking at the values and expecting which one has more distinct rows. That one goes first. Also it's important to have unique or primary key indexes on parent tables.
Depending on DBMS, number 3 might work as good as number 2. It might or might not be smart enough to consider the values (key of clustered index) in the nonclustered index for anything other than refering the the actual row. If it can use it, then number 3 would be better.

I have done some quick and dirty tests by examining the execution plans in SQL server 2005.
The plans showed that SQL uses the clustered index on Aid,Bid for most queries. Adding an index on Bid (ASC) shows that it's used for queries of type
select * from A
inner join AtoB on Aid = A.id
inner join B on Bid = B.id
where Bid = 1
So I'm voting for solution #3.

Related

Does a composite primary key also create an index for each column separately?

In Teradata, I create table with unique primary key out of two varchar columns A and B. I will write queries that need to filter on one or both of these columns.
For best performance, should I submit a create index statement for each of the two columns (the table would have 3 indexes: the unique primary key(column A, B), non-unique column A, and non-unique column B)?
On this table, I only care about read performance and not insert/update performance.
In Teradata, if you specify a PRIMARY KEY clause when you create the table, then the table will automatically be created with a UNIQUE PRIMARY INDEX (UPI) on those PK columns. Although Teradata supports keys, it is more of an index-based DBMS.
In your case, you will have very, very fast reads (i.e. UPI access - single AMP, single row) only when you specify all of the fields in your PK. This applies to equality access as mentioned in the previous comments (thanks Dieter).
If you access the table on some but not ALL of the PK / UPI columns, then your query won't use the UPI access path. You'd need to define separate indexes or other optimization strategies, depending on your queries.
If you only care about read performance, then it makes sense to create secondary indexes on the separate columns. Just run the EXPLAIN on your query to make sure the indexes are actually being used by the Optimizer.
Another option is to ditch the PK specification altogether, especially if you never access the table on that group of columns. If there is one column you access more than the other, specify that one as your PRIMARY INDEX (non-unique) and create a secondary index on the other one. Something like:
CREATE TABLE mytable (
A INTEGER,
B INTEGER,
C VARCHAR(10)
)
PRIMARY INDEX(A) -- Non-unique primary index
;
CREATE INDEX (B) ON mytable; -- Create secondary index
You only need two indexes.
If you have a primary key on (A, B), then this also works for (A). If you want to filter on B, then you want an index on (B).
You might want to make it (B, A) so the index can handle cases such as:
where B = ? and A in (?, ?, ?)

stored data sorting: nonclustered primary key overrides clustered index

I need to create a table with a nonclustered primary key (to set foreign keys on other tables to it) and a clustered index to store the data in the intended order.
However, the resulting stored data is sorted in the primary key's order as opposed to the index's.
Is there a way to prevent this from occurring? Here is an example (SQL Server 14.0 RTM):
create table dbo.a (
x nvarchar(50) not null
,y nvarchar(100) not null
,index ix_a clustered (y)
,constraint pk_a primary key nonclustered (x)
)
insert dbo.a
values
('d','p')
,('c','q');
select * from dbo.a
the result should be sorted with p first, then q. Howerver, q is in the first row and p is in the second row.
In a similar case, this approach worked when the primary key was in 2 columns as opposed to only 1 column.
You are confused. This query:
select *
from dbo.a
Does not tell you anything about the "ordering" of a table. A SQL table with no ORDER BY returns rows in an indeterminate order. I also freely admit that with a handful of rows in the table, this would be highly correlated with the actual ordering of the data, but I strongly discourage you from thinking along those lines.
If you want to know the actual ordering, you need to peak at the data pages. Or you can perhaps use an execution plan to see if an index is being used instead of a sort.
I think that what you are seeing is that SQL Server is choosing to return rows from the query using the primary key index. With two rows in the table, the actual execution plan doesn't really matter.

SQL: Multiple columns in primary key - how to achieve correct order?

In my solution I have multiple queries for my table. Let's assume that it's a table with multiple columns, but columns a and b form the PrimaryKey. Sometimes I query by a values, sometimes by b values. Currently I have PRIMARY KEY CLUSTERED ([a] ASC, [b] ASC).
When I'm trying to query by column b it is very slow and I constantly get a timeout from the database.
Having two clustered primary keys would be great...
What shoud I do? Will creating a new index on column b make b queries more efficient?
Primary key is a logical concept. Performance, on the other hand, depends on the physical organization of data. The reason why WHERE a = ? is fast is not that PK exists, but that an index was automatically created together with the PK.
So, if you want to make another query fast, just add an appropriate index.
Assuming your table looks similar to this...
CREATE TABLE T (
a int,
b int,
c int,
PRIMARY KEY (a, b)
);
You have more-less the following options for creating an index to speed-up WHERE b = ?:
CREATE INDEX T_I1 ON T(b);
CREATE UNIQUE INDEX T_I2 ON T(b, a);
CREATE UNIQUE INDEX T_I3 ON T(b, a, c);
-- or: CREATE UNIQUE INDEX T_I2 ON T(b, a) INCLUDE (c);
Which of these options you choose, depends largely on how much of your query you want to cover.
Also, beware that...
CREATE INDEX T_I1 ON T(b);
...might actually be equivalent to...
CREATE INDEX T_I1 ON T(b) INCLUDE (a);
...if your table is clustered (which it is in your case).

SQL: Add Primary Key to Non-Unique Index

Let's say a query is filtering on two fields and returning primary key values.
SELECT RowIdentifier
FROM Table
WHERE QualifierA = 'exampleA' AND QualifierB = 'exampleB'
Assuming the clustered index is not the PrimaryKey would a non-unique index that contains QualifierA and QualiferB be best served via the addition of the RowIdentifier(Scenario A & Scenario B). Or would it be more appropriate to simply include it(Scenario C)?
Scenario A: Non-Unique, Non-Clustered
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB],[RowIdentifier])
Scenario B: Unique, Non-Clustered
CREATE UNIQUE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB],[RowIdentifier])
Scenario C:
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB])
INCLUDE ([RowIdentifier])
Finally I'm assuming that if the PrimaryKey were the clustered index that neither is necessary, is this accurate?
If there is a CLUSTERED index, it is automatically included in all indexes on the table. You can explicitly include it but it is not required.
The UNIQUE index simply enforces uniqueness. The PK should already have this constraint. You do not need to re-enforce it in every index.
If you are including the PK in your where clause, it will almost certainly use the PK index to find that row because it is guaranteed to return the fewest results, so including in your index gains you nothing for lookups. It could also potentially skew the cardinality engine and make SQL think the index is more distinct than it really is.
For the above reasons, I would select Option C
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB])
INCLUDE ([RowIdentifier])
I would use this regardless of what column is clustered. This will give you the performance, insure the index will continue to perform regardless of the CLUSTERED INDEX, and make it explicit what the index is used for.
I'm wondering what's more appropriate? A non-clustered unique index incorporating all three fields, or a non-clustered non-unique index incorporating just the two fields(QualifierA & QualifierB) but including the PrimaryKey.
There's a third option. A non-clustered, non-unique index incorporating all three fields.
When you make an index, the fields in the index are duplicated to another place in memory so the server can go after those fields with ease. If you only have QualiferA and Qualifier B in the index it will find the rows in that index that meet your criteria and then go back to the main table to pick up the RowIdentifier. Instead, include all three in there to improve performance.
Remember, make sure you put QualifierA and QualifierB before RowIdentifier in your index. The order of the columns determine how the data is ordered.
Try it out with some test data if you like, and look at the query plan to see what it's doing.

primary key with additional data

I've read somewhere that there is an option to store additional data on the leaves of the tree created by the primary key.
For example,
If I have a table with columns: row_id, customer_id
and I need to display customer_name, I can do join between my table and customers table. But I can also store the customer_name with the primary key of customers table (with customer_id) and the sql engine wouldn't have to load the entire row of customer in order to fins customer name.
Can someone describe it better?
How can I implement that?
For SQL Server 2005+, it sounds like you're talking about included columns, but that only works when all of the columns are in one table.
CREATE INDEX IX_Customers_RowCust
ON Customers (customer_id)
INCLUDE (customer_name);
But, I think you're describing a situation where (row_id, customer_id) are in one table, and customer_name is in a second table. For that situation, you'd want to create an indexed view.
CREATE VIEW vwCust WITH SCHEMABINDING AS
SELECT t.row_id, t.customer_id, c.customer_name
FROM SomeTable t
INNER JOIN Customers c
ON t.customer_id = c.customer_id
GO
CREATE UNIQUE CLUSTERED INDEX vwCustRow ON vwCust (row_id)
GO
The MSDN article explains it very well
http://msdn.microsoft.com/en-us/library/ms190806.aspx
Basically as it fetches the data from the index (based on your where clause) instead of having to hit the table again to get the additional data the index bring back the data which has been included as part of the index.
It is important to note that the included columns do not make up the index used for search purposes but they will affect the size of the index and so will therefore take up more memory space.
Joe has got the syntax that you need to implement it.
FYI: Note that you cannot add included columns on a clustered index, since a clustered index isn't really an index in the first place - it's just the b-tree of the data. In some cases, you may be better off with a heap and several efficient covering indexes (which may have included columns).
So if your primary key is also the clustered index, no included columns...
Only a non-clustered index can be made covering by INCLUDEs. Here, the included columns are in the lowest level of the index. This avoids what is known as a key (bookmark in SQL Server 2000) lookup into the clustered index
A clustered index is covering automatically: the lowest leaf level of the index is the data. By default, a PK is clustered in SQL Server.
This applies to the same table.
To do this across tables you need an indexed view (and see Joe's answer)