primary key with additional data - sql

I've read somewhere that there is an option to store additional data on the leaves of the tree created by the primary key.
For example,
If I have a table with columns: row_id, customer_id
and I need to display customer_name, I can do join between my table and customers table. But I can also store the customer_name with the primary key of customers table (with customer_id) and the sql engine wouldn't have to load the entire row of customer in order to fins customer name.
Can someone describe it better?
How can I implement that?

For SQL Server 2005+, it sounds like you're talking about included columns, but that only works when all of the columns are in one table.
CREATE INDEX IX_Customers_RowCust
ON Customers (customer_id)
INCLUDE (customer_name);
But, I think you're describing a situation where (row_id, customer_id) are in one table, and customer_name is in a second table. For that situation, you'd want to create an indexed view.
CREATE VIEW vwCust WITH SCHEMABINDING AS
SELECT t.row_id, t.customer_id, c.customer_name
FROM SomeTable t
INNER JOIN Customers c
ON t.customer_id = c.customer_id
GO
CREATE UNIQUE CLUSTERED INDEX vwCustRow ON vwCust (row_id)
GO

The MSDN article explains it very well
http://msdn.microsoft.com/en-us/library/ms190806.aspx
Basically as it fetches the data from the index (based on your where clause) instead of having to hit the table again to get the additional data the index bring back the data which has been included as part of the index.
It is important to note that the included columns do not make up the index used for search purposes but they will affect the size of the index and so will therefore take up more memory space.
Joe has got the syntax that you need to implement it.

FYI: Note that you cannot add included columns on a clustered index, since a clustered index isn't really an index in the first place - it's just the b-tree of the data. In some cases, you may be better off with a heap and several efficient covering indexes (which may have included columns).
So if your primary key is also the clustered index, no included columns...

Only a non-clustered index can be made covering by INCLUDEs. Here, the included columns are in the lowest level of the index. This avoids what is known as a key (bookmark in SQL Server 2000) lookup into the clustered index
A clustered index is covering automatically: the lowest leaf level of the index is the data. By default, a PK is clustered in SQL Server.
This applies to the same table.
To do this across tables you need an indexed view (and see Joe's answer)

Related

Are there performance differences in queries with UNIQUE NON NULL indexes and Primary keys?

I want to search a DB with either the PK or a unique non null field that is indexed. Are there any performance differences between those? I am using Postgres as my DB. But a general DB-independent answer would be good too.
In postgreSQL, all indexes are secondary or unclustered indexes. That means the the index points to the heap, the data structure holding the actual column data. So, a primary key's index doesn't have any structural advantage over a UNIQUE index: SELECTs using the index for filtering must then bounce over to the heap for the data.
In fact, it might be the other way around, because postgreSQL indexes can have INCLUDES clauses.
For example consider a table with uniqueid, a, b, and c columns. If your workload is heavy with SELECT b FROM tbl WHERE uniqueid = something queries, you can declare this covering index.
CREATE UNIQUE INDEX uniq ON tbl(uniqueid) INCLUDE (b);
Your whole query can then be satisfied from the index. That saves the extra trip to the heap, and so saves IO and CPU time.
MySQL and SQL Server, on the other hand, use clustered indexes for their primary keys. That is, the table's data is stored in the primary key's index. So, the PK is, automatically, basically an index created like this.
CREATE UNIQUE INDEX pk ON tbl(uniqueid) INCLUDE (a, b, c);
In those databases the PK's index does have an advantage over a separate UNIQUE index, which necessarily is a secondary or unclustered index. (Note: MySQL's indexes don't have INCLUDE() clauses.)

What is the most efficient strategy for lookups on a large, static table which is already in sorted order (sqlite)?

I have a basic reverse lookup table in which the ids are already sorted in ascending numerical order:
id INT NOT NULL,
value INT NOT NULL
The ids are not unique; each id has from 5 to 25,000 associated values. Each id is independent, i.e., no relationships between the ids.
The table is static. Read only, no inserts or updates ever. The table has 100-200 million records. The database itself will be around 7-12gb. Sqlite.
I will do frequent lookups in this table and want the fastest response time for each query. Lookups are one-direction only, unordered, and always of the form:
SELECT value WHERE id IN (x,y,z)
What advantages does the pre-sorted order give me in terms of database efficiency? What should I do differently than I would with typical unordered tables? How do I tell sql that it's an ordered list?
What about indices: is it necessary or even helpful to create an index on id?
[Updated for clustered comment thanks to Gordon Linoff]. As far as I can tell, sqlite doesn't support clustered indices directly. The wiki says: "Are [clustered indices] supported? No, but if you use INTEGER PRIMARY KEY it acts as a clustered index." In my situation, the column id is not unique...
Assuming that space is not an issue, you should create an index on (id, value). This should be sufficient for your purposes.
However, if the table is static, then I would recommend that you create a clustered index when you create the table. The index would have the same keys, (id, value).
If the table happens to be sorted, the database does not know about this, so you'd still need an index.
It is a better idea to use a WITHOUT ROWID table (what other DBs call a clustered index):
CREATE TABLE MyLittleLookupTable (
id INTEGER,
value INTEGER,
PRIMARY KEY (id, value)
) WITHOUT ROWID;

SQL: Add Primary Key to Non-Unique Index

Let's say a query is filtering on two fields and returning primary key values.
SELECT RowIdentifier
FROM Table
WHERE QualifierA = 'exampleA' AND QualifierB = 'exampleB'
Assuming the clustered index is not the PrimaryKey would a non-unique index that contains QualifierA and QualiferB be best served via the addition of the RowIdentifier(Scenario A & Scenario B). Or would it be more appropriate to simply include it(Scenario C)?
Scenario A: Non-Unique, Non-Clustered
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB],[RowIdentifier])
Scenario B: Unique, Non-Clustered
CREATE UNIQUE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB],[RowIdentifier])
Scenario C:
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB])
INCLUDE ([RowIdentifier])
Finally I'm assuming that if the PrimaryKey were the clustered index that neither is necessary, is this accurate?
If there is a CLUSTERED index, it is automatically included in all indexes on the table. You can explicitly include it but it is not required.
The UNIQUE index simply enforces uniqueness. The PK should already have this constraint. You do not need to re-enforce it in every index.
If you are including the PK in your where clause, it will almost certainly use the PK index to find that row because it is guaranteed to return the fewest results, so including in your index gains you nothing for lookups. It could also potentially skew the cardinality engine and make SQL think the index is more distinct than it really is.
For the above reasons, I would select Option C
CREATE NONCLUSTERED INDEX IX_Table_QualifierA
ON [dbo].[Table] ([QualifierA],[QualifierB])
INCLUDE ([RowIdentifier])
I would use this regardless of what column is clustered. This will give you the performance, insure the index will continue to perform regardless of the CLUSTERED INDEX, and make it explicit what the index is used for.
I'm wondering what's more appropriate? A non-clustered unique index incorporating all three fields, or a non-clustered non-unique index incorporating just the two fields(QualifierA & QualifierB) but including the PrimaryKey.
There's a third option. A non-clustered, non-unique index incorporating all three fields.
When you make an index, the fields in the index are duplicated to another place in memory so the server can go after those fields with ease. If you only have QualiferA and Qualifier B in the index it will find the rows in that index that meet your criteria and then go back to the main table to pick up the RowIdentifier. Instead, include all three in there to improve performance.
Remember, make sure you put QualifierA and QualifierB before RowIdentifier in your index. The order of the columns determine how the data is ordered.
Try it out with some test data if you like, and look at the query plan to see what it's doing.

Create an index only on certain rows in mysql

So, I have this funny requirement of creating an index on a table only on a certain set of rows.
This is what my table looks like:
USER: userid, friendid, created, blah0, blah1, ..., blahN
Now, I'd like to create an index on:
(userid, friendid, created)
but only on those rows where userid = friendid. The reason being that this index is only going to be used to satisfy queries where the WHERE clause contains "userid = friendid". There will be many rows where this is NOT the case, and I really don't want to waste all that extra space on the index.
Another option would be to create a table (query table) which is populated on insert/update of this table and create a trigger to do so, but again I am guessing an index on that table would mean that the data would be stored twice.
How does mysql store Primary Keys? I mean is the table ordered on the Primary Key or is it ordered by insert order and the PK is like a normal unique index?
I checked up on clustered indexes (http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html), but it seems only InnoDB supports them. I am using MyISAM (I mention this because then I could have created a clustered index on these 3 fields in the query table).
I am basically looking for something like this:
ALTER TABLE USERS ADD INDEX (userid, friendid, created) WHERE userid=friendid
Regarding the conditional index:
You can't do this. MySQL has no such thing.
Regarding the primary key:
It depends on the storage engine. MySQL does not define how data is stored or retrieved, that's left up to the storage engine.
MyISAM does not enforce any order on how rows are stored; they're appended to the end of the table but gaps from deleting can be reused and UPDATE queries can leave things out of order even without any DELETEs.
InnoDB stores rows in order of their primary keys.
hard to tell what you're actually trying to do here (why would a user need to be his own friend?) but it seems to me a simple rethinking of your database schema would resolve this problem.
table 1: USER: userid, created, blah0, blah1, ...,
table 2: userIsFriend (user1,user2,...)
and just do your indexing on table 2 (whose elements presumably have foreign key constraint on table 1)
btw you should probably be using InnoDB if you want to do anything semi-serious with mySQL anyway, IMHO.

What is the optimal indexing strategy for a relation table?

A relation table is the common solution to representing a many-to-many (m:n) relationship.
In the simplest form, it combines foreign keys referencing the two relating tables to a new composite primary key:
A AtoB B
---- ---- ----
*id *Aid *id
data *Bid data
How should it be indexed to provide optimal performance in every JOIN situation?
clustered index over (Aid ASC, Bid ASC) (this is mandatory anyway, I guess)
option #1 plus an additional index over (Bid ASC, Aid ASC)
or option #1 plus an additional index over (Bid ASC)
any other options? Vendor-specific stuff, maybe?
I made some tests, and here is the update:
To cover all possible cases, you'll need to have:
CLUSTERED INDEX (a, b)
INDEX (b)
This will cover all JOIN sutiations AND ORDER BY
Note that an index on B is actually sorted on (B, A) since it references clustered rows.
As long as your a and b tables have PRIMARY KEY's on id's, you don't need to create additional indexes to handle ORDER BY ASC, DESC.
See the entry in my blog for more details:
Indexing a link table
I guess solution 2 is optimal. I'd choose the order of the clustered index by looking at the values and expecting which one has more distinct rows. That one goes first. Also it's important to have unique or primary key indexes on parent tables.
Depending on DBMS, number 3 might work as good as number 2. It might or might not be smart enough to consider the values (key of clustered index) in the nonclustered index for anything other than refering the the actual row. If it can use it, then number 3 would be better.
I have done some quick and dirty tests by examining the execution plans in SQL server 2005.
The plans showed that SQL uses the clustered index on Aid,Bid for most queries. Adding an index on Bid (ASC) shows that it's used for queries of type
select * from A
inner join AtoB on Aid = A.id
inner join B on Bid = B.id
where Bid = 1
So I'm voting for solution #3.