Mysql Index Being Ignored - sql

EXPLAIN SELECT
*
FROM
content_link link
STRAIGHT_JOIN
content
ON
link.content_id = content.id
WHERE
link.content_id = 1
LIMIT 10;
+----+-------------+---------+-------+---------------+------------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+------------+---------+-------+------+-------+
| 1 | SIMPLE | link | ref | content_id | content_id | 4 | const | 1 | |
| 1 | SIMPLE | content | const | PRIMARY | PRIMARY | 4 | const | 1 | |
+----+-------------+---------+-------+---------------+------------+---------+-------+------+-------+
However, when I remove the WHERE, the query stops using the key (even when i explicitly force it to)
EXPLAIN SELECT
*
FROM
content_link link FORCE KEY (content_id)
STRAIGHT_JOIN
content
ON
link.content_id = content.id
LIMIT 10;
+----+-------------+---------+--------+---------------+---------+---------+------------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+--------+---------------+---------+---------+------------------------+---------+-------------+
| 1 | SIMPLE | link | index | content_id | PRIMARY | 7 | NULL | 4555299 | Using index |
| 1 | SIMPLE | content | eq_ref | PRIMARY | PRIMARY | 4 | ft_dir.link.content_id | 1 | |
+----+-------------+---------+--------+---------------+---------+---------+------------------------+---------+-------------+
Are there any work-arounds to this?
I realize I'm selecting the entire table in the second example, but why does mysql suddenly decide that it's going to ignore my FORCE anyway and not use the key? Without the key the query takes like 10 minutes.. ugh.

FORCE is a bit of a misnomer. Here's what the MySQL docs say (emphasis mine):
You can also use FORCE INDEX, which acts like USE INDEX (index_list) but with the addition that a table scan is assumed to be very expensive. In other words, a table scan is used only if there is no way to use one of the given indexes to find rows in the table.
Since you aren't actually "finding" any rows (you are selecting them all), a table scan is always going to be fastest, and the optimizer is smart enough to know that in spite of what you are telling them.
ETA:
Try adding an ORDER BY on the primary key once and I bet it'll use the index.

An index helps search quickly inside a table, but it just slows things down if you select the entire table. So MySQL is correct in ignoring the index.
In your case, maybe the index has a hidden side effect that's not known to MySQL. For example, if the inner join holds only for a few rows, an index would speed things up. But MySQL can't know that without an explicit hint.
There is an exception: when every column you select is inside the index, the index is still useful if you select every row. For example, if you have an index on LastName, the following query still benefits from the index:
select LastName from orders
But this one won't:
select * from Orders

Your content_id seems to accept NULL values.
MySQL optimizer thinks there is no guarantee that your query will return all values only by using the index (though actually there is guarantee, since you use the column in a JOIN)
That's why it reverts to full table scan.
Either add a NOT NULL condition:
SELECT *
FROM content_link link FORCE KEY (content_id)
STRAIGHT_JOIN
content
ON content.id = link.content_id
WHERE link.content_id IS NOT NULL
LIMIT 10;
or mark your column as NOT NULL:
ALTER TABLE content_link MODIFY content_id NOT NULL
Update:
This is verified bug 45314 in MySQL.

Related

Optimize GROUP BY after ranged index query

I have a content application that needs to count responses in a time slice, then order them by number of responses. It currently works great with a small data set, but needs to scale to millions rows. My current query won't work.
mysql> describe Responses;
+---------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| site_id | int(10) unsigned | NO | MUL | NULL | |
| content_id | bigint(20) unsigned | NO | PRI | NULL | |
| response_id | bigint(20) unsigned | NO | PRI | NULL | |
| date | int(10) unsigned | NO | | NULL | |
+---------------+---------------------+------+-----+---------+-------+
The table type is InnoDB, the primary key is on (content_id, response_id). There is an additional index on (content_id, date) used to find responses to a piece of content, and another additional index on (site_id, date) used in the query I am have trouble with:
mysql> explain SELECT content_id id, COUNT(response_id) num_responses
FROM Responses
WHERE site_id = 1
AND date > 1234567890
AND date < 1293579867
GROUP BY content_id
ORDER BY num_responses DESC
LIMIT 0, 10;
+----+-------------+-----------+-------+---------------+------+---------+------+------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+------+---------+------+------+-----------------------------------------------------------+
| 1 | SIMPLE | Responses | range | date | date | 8 | NULL | 102 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-----------+-------+---------------+------+---------+------+------+-----------------------------------------------------------+
That's the best I've been able to come up with, but it will end up being in the 1,000,000's of rows needing to be counted, resulting in 10,000's of rows to sort, to pull in a handful of results.
I can't think of a way to precalculate the count either, as the date range is arbitrary. I have some liberty with changing the primary key: it can be composed of content_id, response_id, and site_id in any order, but cannot contain date.
The application is developed mostly in PHP, so if there is an quicker way to accomplish the same results by splitting the query into subqueries, using temporary tables, or doing things on the application side, I'm open to suggestions.
(Reposted from comments by request)
Set up a table that has three columns: id, date, and num_responses. The column num_responses consists of the number of responses for the given id on the given date. Backfill the table appropriately, and then at around midnight (or later) each night, run a script that adds a new row for the previous day.
Then, to get the rows you want, you can merely query the table mentioned above.
Rather than calculating each time, how about cache the calculated count since the last query, and add the increment of count to update the cache by putting date condition into the WHERE clause?
Have you considered partitioning the table by date? Are there any indices on the table?

group by date of timestamp

How can you group by a the date portion of a timestamp column in mysql and still take advantage of an index on that ?
Of course you can use solutions like
select date(thetime) from mytable group by date(thetime)
however this query will not be able to use an index on thetime but will instead require a temporary table as you are transforming the column using a function before grouping by it.
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+----------------------------------------------+
| 1 | SIMPLE | mytable | index | NULL | thetime | 4 | NULL | 48183 | Using index; Using temporary; Using filesort |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+----------------------------------------------+
Theoretically there's no reason why it shouldn't be able to use a range scan on an index on that column and not need a temporary table. Is there any syntax that can persuade the query optimizer to execute the query to do that?
From what I can understand the database already does what you want? The index is being scanned (rather than the table itself) as evident in xplan.
But you cannot get around the fact that an intermediate table is needed to hold the dates during the grouping (distinct) operation.

Strange use of the index in Mysql

Explain
SELECT `feed_objects`.*
FROM `feed_objects`
WHERE (`feed_objects`.feed_id IN
( 165,160,159,158,157,153,152,151,150,149,148,147,129,128,127,126,125,124,122,
121, 120,119,118,117,116,115,114,113,111,110)) ;
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | feed_objects | ALL | by_feed_id | NULL | NULL | NULL | 188 | Using where |
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
Not used index by_feed_id
But when I point less than the values in the WHERE - everything is working right
Explain
SELECT `feed_objects`.*
FROM `feed_objects`
WHERE (`feed_objects`.feed_id IN
(165,160,159,158,157,153,152,151,150,149,148,147,129,128,127,125,124)) ;
+----+-------------+--------------+-------+---------------+------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+---------------+------------+---------+------+------+-------------+
| 1 | SIMPLE | feed_objects | range | by_feed_id | by_feed_id | 9 | NULL | 18 | Using where |
+----+-------------+--------------+-------+---------------+------------+---------+------+------+-------------+
Used index by_feed_id
What is the problem?
The MySQL optimizer makes a lot of decisions that sometimes look strange. In this particular case, I believe that you have a very small table (188 rows total from the looks of the first EXPLAIN), and that is affecting the optimizer's decision.
The "How MySQL Uses Indexes" manual pages offers this snippet of info:
Sometimes MySQL does not use an index,
even if one is available. One
circumstance under which this occurs
is when the optimizer estimates that
using the index would require MySQL to
access a very large percentage of the
rows in the table. (In this case, a
table scan is likely to be much faster
because it requires fewer seeks.)
Because the number of ids in your first WHERE clause is relatively large compared to the table size, MySQL has determined that it is faster to scan the data than to consult the index, as the data scan is likely to result in less disk access time.
You could test this by adding rows to the table and re-running the first EXPLAIN query. At a certain point, MySQL will start using the index for it as well as the second query.

MySQL not using indexes

I just enabled the slow-log (+not using indexes) and I'm getting hundreds of entries for the same kind of query (only user changes)
SELECT id
, name
FROM `all`
WHERE id NOT IN(SELECT id
FROM `picks`
WHERE user=999)
ORDER BY name ASC;
EXPLAIN gives:
+----+--------------------+-------------------+-------+------------------+--------+---------+------------+------+------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------------+-------+------------------+--------+---------+------------+------+------------------------------------------+
| 1 | PRIMARY | all | index | NULL | name | 156 | NULL | 209 | Using where; Using index; Using filesort |
| 2 | DEPENDENT SUBQUERY | picks | ref | user,user_2,pick | user_2 | 8 | const,func | 1 | Using where; Using index |
+----+--------------------+-------------------+-------+------------------+--------+---------+------------+------+------------------------------------------+
Any idea about how to optimize this query? I've tried with a bunch of different indexes on different fields but nothing.
I don't necessarily agree that 'not in' and 'exists' are ALWAYS bad performance choices, however, it could be in this situation.
You might be able to get your results using a much simpler query:
SELECT id
, name
FROM `all`
, 'picks'
WHERE all.id = picks.id
AND picks.user <> 999
ORDER BY name ASC;
"not in" and "exists" always bad choices for performance. May be left join with cheking "NULL" will be better try it.
This is probably the best way to write the query. Select everything from all and try to find matching rows from picks that share the same id and user is 999. If such a row doesn't exist, picks.id will be NULL, because it's using a left outer join. Then you can filter the results to return only those rows.
SELECT all.id, all.name
FROM
all
LEFT JOIN picks ON picks.id=all.id AND picks.user=999
WHERE picks.id IS NULL
ORDER BY all.name ASC

A subquery that should be independent is not. Why?

I have a table files with files and a table reades with read accesses to these files. In the table reades there is a column file_id where refers to the respective column in files.
Now I would like to list all files which have not been accessed and tried this:
SELECT * FROM files WHERE file_id NOT IN (SELECT file_id FROM reades)
This is terribly slow. The reason is that mySQL thinks that the subquery is dependent on the query:
+----+--------------------+--------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | PRIMARY | files | ALL | NULL | NULL | NULL | NULL | 1053 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | reades | ALL | NULL | NULL | NULL | NULL | 3242 | 100.00 | Using where |
+----+--------------------+--------+------+---------------+------+---------+------+------+----------+-------------+
But why? The subquery is completely independent and more or less just meant to return a list of ids.
(To be precise: Each file_id can appear multiple times in reades, of course, as there can be arbitrarily many read operations for each file.)
Try replacing the subquery with a join:
SELECT *
FROM files f
LEFT OUTER JOIN reades r on r.file_id = f.file_id
WHERE r.file_id IS NULL
Here's a link to an article about this problem. The writer of that article wrote a stored procedure to force MySQL to evaluate subqueries as independant. I doubt that's necessary in this case though.
i've seen this before. it's a bug in mysql. try this:
SELECT * FROM files WHERE file_id NOT IN (SELECT * FROM (SELECT file_id FROM reades))
there bug report is here: http://bugs.mysql.com/bug.php?id=25926
Try:
SELECT * FROM files WHERE file_id NOT IN (SELECT reades.file_id FROM reades)
That is: if it's coming up as dependent, perhaps that's because of ambiguity in what file_id refers to, so let's try fully qualifying it.
If that doesn't work, just do:
SELECT files.*
FROM files
LEFT JOIN reades
USING (file_id)
WHERE reades.file_id IS NULL
Does MySQL support EXISTS in the same way that MSSQL would?
If so, you could rewrite the query as
SELECT * FROM files as f WHERE file_id NOT EXISTS (SELECT 1 FROM reades r WHERE r.file_id = f.file_id)
Using IN is horribly inefficient as it runs that subquery for each row in the parent query.
Looking at this page I found two possible solutions which both work. Just for completeness I add one of those, similar to the answers with JOINs shown above, but it is fast even without using foreign keys:
SELECT * FROM files AS f
INNER JOIN (SELECT DISTINCT file_id FROM reades) AS r
ON f.file_id = r.file_id
This solves the problem, but still this does not answer my question :)
EDIT: If I interpret the EXPLAIN output correctly, this is fast, because the interpreter generates a temporary index:
+----+-------------+------------+--------+---------------+---------+---------+-----------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+---------+---------+-----------+------+--------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 843 | |
| 1 | PRIMARY | f | eq_ref | PRIMARY | PRIMARY | 4 | r.file_id | 1 | |
| 2 | DERIVED | reades | range | NULL | file_id | 5 | NULL | 811 | Using index for group-by |
+----+-------------+------------+--------+---------------+---------+---------+-----------+------+--------------------------+
IN-subqueries are in MySQL 5.5 and earlier converted to EXIST subqueries. The given query will be converted to the following query:
SELECT * FROM files
WHERE NOT EXISTS (SELECT 1 FROM reades WHERE reades.filed_id = files.file_id)
As you see, the subquery is actually dependent.
MySQL 5.6 may choose to materialize the subquery. That is, first, run the inner query and store the result in a temporary table (removing duplicates). Then, it can use a join-like operation between the outer table (i.e., files) and the temporary table to find the rows with no match. This way of executing the query will probably be more optimal if reades.file_id is not indexed.
However, if reades.file_id is indexed, the traditional IN-to-EXISTS execution strategy is actually pretty efficient. In that case, I would not expect any significant performance improvement from converting the query into a join as suggested in other answers. MySQL 5.6 optimizer makes a cost-based choice between materialization and IN-to-EXISTS execution.