So, I have this funny requirement of creating an index on a table only on a certain set of rows.
This is what my table looks like:
USER: userid, friendid, created, blah0, blah1, ..., blahN
Now, I'd like to create an index on:
(userid, friendid, created)
but only on those rows where userid = friendid. The reason being that this index is only going to be used to satisfy queries where the WHERE clause contains "userid = friendid". There will be many rows where this is NOT the case, and I really don't want to waste all that extra space on the index.
Another option would be to create a table (query table) which is populated on insert/update of this table and create a trigger to do so, but again I am guessing an index on that table would mean that the data would be stored twice.
How does mysql store Primary Keys? I mean is the table ordered on the Primary Key or is it ordered by insert order and the PK is like a normal unique index?
I checked up on clustered indexes (http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html), but it seems only InnoDB supports them. I am using MyISAM (I mention this because then I could have created a clustered index on these 3 fields in the query table).
I am basically looking for something like this:
ALTER TABLE USERS ADD INDEX (userid, friendid, created) WHERE userid=friendid
Regarding the conditional index:
You can't do this. MySQL has no such thing.
Regarding the primary key:
It depends on the storage engine. MySQL does not define how data is stored or retrieved, that's left up to the storage engine.
MyISAM does not enforce any order on how rows are stored; they're appended to the end of the table but gaps from deleting can be reused and UPDATE queries can leave things out of order even without any DELETEs.
InnoDB stores rows in order of their primary keys.
hard to tell what you're actually trying to do here (why would a user need to be his own friend?) but it seems to me a simple rethinking of your database schema would resolve this problem.
table 1: USER: userid, created, blah0, blah1, ...,
table 2: userIsFriend (user1,user2,...)
and just do your indexing on table 2 (whose elements presumably have foreign key constraint on table 1)
btw you should probably be using InnoDB if you want to do anything semi-serious with mySQL anyway, IMHO.
Related
I want to search a DB with either the PK or a unique non null field that is indexed. Are there any performance differences between those? I am using Postgres as my DB. But a general DB-independent answer would be good too.
In postgreSQL, all indexes are secondary or unclustered indexes. That means the the index points to the heap, the data structure holding the actual column data. So, a primary key's index doesn't have any structural advantage over a UNIQUE index: SELECTs using the index for filtering must then bounce over to the heap for the data.
In fact, it might be the other way around, because postgreSQL indexes can have INCLUDES clauses.
For example consider a table with uniqueid, a, b, and c columns. If your workload is heavy with SELECT b FROM tbl WHERE uniqueid = something queries, you can declare this covering index.
CREATE UNIQUE INDEX uniq ON tbl(uniqueid) INCLUDE (b);
Your whole query can then be satisfied from the index. That saves the extra trip to the heap, and so saves IO and CPU time.
MySQL and SQL Server, on the other hand, use clustered indexes for their primary keys. That is, the table's data is stored in the primary key's index. So, the PK is, automatically, basically an index created like this.
CREATE UNIQUE INDEX pk ON tbl(uniqueid) INCLUDE (a, b, c);
In those databases the PK's index does have an advantage over a separate UNIQUE index, which necessarily is a secondary or unclustered index. (Note: MySQL's indexes don't have INCLUDE() clauses.)
In Teradata, I create table with unique primary key out of two varchar columns A and B. I will write queries that need to filter on one or both of these columns.
For best performance, should I submit a create index statement for each of the two columns (the table would have 3 indexes: the unique primary key(column A, B), non-unique column A, and non-unique column B)?
On this table, I only care about read performance and not insert/update performance.
In Teradata, if you specify a PRIMARY KEY clause when you create the table, then the table will automatically be created with a UNIQUE PRIMARY INDEX (UPI) on those PK columns. Although Teradata supports keys, it is more of an index-based DBMS.
In your case, you will have very, very fast reads (i.e. UPI access - single AMP, single row) only when you specify all of the fields in your PK. This applies to equality access as mentioned in the previous comments (thanks Dieter).
If you access the table on some but not ALL of the PK / UPI columns, then your query won't use the UPI access path. You'd need to define separate indexes or other optimization strategies, depending on your queries.
If you only care about read performance, then it makes sense to create secondary indexes on the separate columns. Just run the EXPLAIN on your query to make sure the indexes are actually being used by the Optimizer.
Another option is to ditch the PK specification altogether, especially if you never access the table on that group of columns. If there is one column you access more than the other, specify that one as your PRIMARY INDEX (non-unique) and create a secondary index on the other one. Something like:
CREATE TABLE mytable (
A INTEGER,
B INTEGER,
C VARCHAR(10)
)
PRIMARY INDEX(A) -- Non-unique primary index
;
CREATE INDEX (B) ON mytable; -- Create secondary index
You only need two indexes.
If you have a primary key on (A, B), then this also works for (A). If you want to filter on B, then you want an index on (B).
You might want to make it (B, A) so the index can handle cases such as:
where B = ? and A in (?, ?, ?)
Let's take some rdbms like Postgresql or Mysql.
And create some table with primary key and primary index on it.
Primary index is intended to speed up select operations with clause where primary_key_column=.....
It relies on sorted order by primary_key_column.
What I want to clarify is, do the rdbms keep the order of entries sorted?
If not, how can we perform fast select on unordered data?
Index is created on primary key column and it's structured usually as B+ tree or hashtable. The structure points on table's entries.
I have a basic reverse lookup table in which the ids are already sorted in ascending numerical order:
id INT NOT NULL,
value INT NOT NULL
The ids are not unique; each id has from 5 to 25,000 associated values. Each id is independent, i.e., no relationships between the ids.
The table is static. Read only, no inserts or updates ever. The table has 100-200 million records. The database itself will be around 7-12gb. Sqlite.
I will do frequent lookups in this table and want the fastest response time for each query. Lookups are one-direction only, unordered, and always of the form:
SELECT value WHERE id IN (x,y,z)
What advantages does the pre-sorted order give me in terms of database efficiency? What should I do differently than I would with typical unordered tables? How do I tell sql that it's an ordered list?
What about indices: is it necessary or even helpful to create an index on id?
[Updated for clustered comment thanks to Gordon Linoff]. As far as I can tell, sqlite doesn't support clustered indices directly. The wiki says: "Are [clustered indices] supported? No, but if you use INTEGER PRIMARY KEY it acts as a clustered index." In my situation, the column id is not unique...
Assuming that space is not an issue, you should create an index on (id, value). This should be sufficient for your purposes.
However, if the table is static, then I would recommend that you create a clustered index when you create the table. The index would have the same keys, (id, value).
If the table happens to be sorted, the database does not know about this, so you'd still need an index.
It is a better idea to use a WITHOUT ROWID table (what other DBs call a clustered index):
CREATE TABLE MyLittleLookupTable (
id INTEGER,
value INTEGER,
PRIMARY KEY (id, value)
) WITHOUT ROWID;
I've read somewhere that there is an option to store additional data on the leaves of the tree created by the primary key.
For example,
If I have a table with columns: row_id, customer_id
and I need to display customer_name, I can do join between my table and customers table. But I can also store the customer_name with the primary key of customers table (with customer_id) and the sql engine wouldn't have to load the entire row of customer in order to fins customer name.
Can someone describe it better?
How can I implement that?
For SQL Server 2005+, it sounds like you're talking about included columns, but that only works when all of the columns are in one table.
CREATE INDEX IX_Customers_RowCust
ON Customers (customer_id)
INCLUDE (customer_name);
But, I think you're describing a situation where (row_id, customer_id) are in one table, and customer_name is in a second table. For that situation, you'd want to create an indexed view.
CREATE VIEW vwCust WITH SCHEMABINDING AS
SELECT t.row_id, t.customer_id, c.customer_name
FROM SomeTable t
INNER JOIN Customers c
ON t.customer_id = c.customer_id
GO
CREATE UNIQUE CLUSTERED INDEX vwCustRow ON vwCust (row_id)
GO
The MSDN article explains it very well
http://msdn.microsoft.com/en-us/library/ms190806.aspx
Basically as it fetches the data from the index (based on your where clause) instead of having to hit the table again to get the additional data the index bring back the data which has been included as part of the index.
It is important to note that the included columns do not make up the index used for search purposes but they will affect the size of the index and so will therefore take up more memory space.
Joe has got the syntax that you need to implement it.
FYI: Note that you cannot add included columns on a clustered index, since a clustered index isn't really an index in the first place - it's just the b-tree of the data. In some cases, you may be better off with a heap and several efficient covering indexes (which may have included columns).
So if your primary key is also the clustered index, no included columns...
Only a non-clustered index can be made covering by INCLUDEs. Here, the included columns are in the lowest level of the index. This avoids what is known as a key (bookmark in SQL Server 2000) lookup into the clustered index
A clustered index is covering automatically: the lowest leaf level of the index is the data. By default, a PK is clustered in SQL Server.
This applies to the same table.
To do this across tables you need an indexed view (and see Joe's answer)