A subquery that should be independent is not. Why? - sql

I have a table files with files and a table reades with read accesses to these files. In the table reades there is a column file_id where refers to the respective column in files.
Now I would like to list all files which have not been accessed and tried this:
SELECT * FROM files WHERE file_id NOT IN (SELECT file_id FROM reades)
This is terribly slow. The reason is that mySQL thinks that the subquery is dependent on the query:
+----+--------------------+--------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | PRIMARY | files | ALL | NULL | NULL | NULL | NULL | 1053 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | reades | ALL | NULL | NULL | NULL | NULL | 3242 | 100.00 | Using where |
+----+--------------------+--------+------+---------------+------+---------+------+------+----------+-------------+
But why? The subquery is completely independent and more or less just meant to return a list of ids.
(To be precise: Each file_id can appear multiple times in reades, of course, as there can be arbitrarily many read operations for each file.)

Try replacing the subquery with a join:
SELECT *
FROM files f
LEFT OUTER JOIN reades r on r.file_id = f.file_id
WHERE r.file_id IS NULL
Here's a link to an article about this problem. The writer of that article wrote a stored procedure to force MySQL to evaluate subqueries as independant. I doubt that's necessary in this case though.

i've seen this before. it's a bug in mysql. try this:
SELECT * FROM files WHERE file_id NOT IN (SELECT * FROM (SELECT file_id FROM reades))
there bug report is here: http://bugs.mysql.com/bug.php?id=25926

Try:
SELECT * FROM files WHERE file_id NOT IN (SELECT reades.file_id FROM reades)
That is: if it's coming up as dependent, perhaps that's because of ambiguity in what file_id refers to, so let's try fully qualifying it.
If that doesn't work, just do:
SELECT files.*
FROM files
LEFT JOIN reades
USING (file_id)
WHERE reades.file_id IS NULL

Does MySQL support EXISTS in the same way that MSSQL would?
If so, you could rewrite the query as
SELECT * FROM files as f WHERE file_id NOT EXISTS (SELECT 1 FROM reades r WHERE r.file_id = f.file_id)
Using IN is horribly inefficient as it runs that subquery for each row in the parent query.

Looking at this page I found two possible solutions which both work. Just for completeness I add one of those, similar to the answers with JOINs shown above, but it is fast even without using foreign keys:
SELECT * FROM files AS f
INNER JOIN (SELECT DISTINCT file_id FROM reades) AS r
ON f.file_id = r.file_id
This solves the problem, but still this does not answer my question :)
EDIT: If I interpret the EXPLAIN output correctly, this is fast, because the interpreter generates a temporary index:
+----+-------------+------------+--------+---------------+---------+---------+-----------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+---------+---------+-----------+------+--------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 843 | |
| 1 | PRIMARY | f | eq_ref | PRIMARY | PRIMARY | 4 | r.file_id | 1 | |
| 2 | DERIVED | reades | range | NULL | file_id | 5 | NULL | 811 | Using index for group-by |
+----+-------------+------------+--------+---------------+---------+---------+-----------+------+--------------------------+

IN-subqueries are in MySQL 5.5 and earlier converted to EXIST subqueries. The given query will be converted to the following query:
SELECT * FROM files
WHERE NOT EXISTS (SELECT 1 FROM reades WHERE reades.filed_id = files.file_id)
As you see, the subquery is actually dependent.
MySQL 5.6 may choose to materialize the subquery. That is, first, run the inner query and store the result in a temporary table (removing duplicates). Then, it can use a join-like operation between the outer table (i.e., files) and the temporary table to find the rows with no match. This way of executing the query will probably be more optimal if reades.file_id is not indexed.
However, if reades.file_id is indexed, the traditional IN-to-EXISTS execution strategy is actually pretty efficient. In that case, I would not expect any significant performance improvement from converting the query into a join as suggested in other answers. MySQL 5.6 optimizer makes a cost-based choice between materialization and IN-to-EXISTS execution.

Related

MariaDB: How to using "INSERT ... SELECT" with WITH statement

Note: This involves ColumnStore.
At work, we have a big SQL statement that takes too much memory to execute on prod. I'm currently working on reducing the size the query consumes. I've tried using different approaches, but nothing has solved the issue so far, except for WITH ... AS (...), for some reason. However, I need to combine this with an INSERT INTO ....
This is the code I'm trying to get working
TRUNCATE db1.myTable;
INSERT INTO db1.myTable(`all`, `needed`, `columns`)
(WITH everything AS (
SELECT all, needed, columns
FROM db1.mainTable T1
JOIN db1.secondTable T2
ON (T1.someCol = T2.someCol)
JOIN db2.thirdTable T3
ON (T1.anotherCol = T3.anotherCol)
LEFT JOIN db1.fourthTable T4
ON (T4.anotherCol = T1.anotherCol)
WHERE T2.yetAnotherCol >= (some_SELECT_subquery)
AND T1.valid = 1
) SELECT * FROM everything);
EXPLAIN (WITH everything AS ... returns
+------+-------------+-----------------------+------+---------------+------+---------+------+------+-------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------------------+------+---------------+------+---------+------+------+-------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 16000000000000 | |
| 2 | PRIMARY | T1 | ALL | NULL | NULL | NULL | NULL | 2000 | Using where with pushed condition |
| 2 | PRIMARY | T2 | ALL | NULL | NULL | NULL | NULL | 2000 | Using where; Using join buffer (flat, BNL join) |
| 2 | PRIMARY | T3 | ALL | NULL | NULL | NULL | NULL | 2000 | Using where; Using join buffer (flat, BNL join) |
| 2 | PRIMARY | T4 | ALL | NULL | NULL | NULL | NULL | 2000 | Using where |
| 3 | SUBQUERY | some_SELECT_subquery | ALL | NULL | NULL | NULL | NULL | 2000 | Using where with pushed condition |
+------+-------------+-----------------------+------+---------------+------+---------+------+------+-------------------------------------------------+
5 rows in set (0,21 sec)
If I only use the WITH-statement, I can get it to work. As in, I don't use the INSERT INTO. No issues at all, and the query is even faster this way. I also did I quick test with trying to divide the query into several WITHs, but gave up since I believe I messed up the syntax. I'm not too good with SQL, and even less so with JOINs(junior developer).
When I combine the WITH-statement with an INSER INTO ..., MariaDB responds with ERROR 1064 (42000) at line 3: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near ') SELECT * FROM everything)' at line 1. I've also tried adding a semicolon after ... valid = 1, merging the two last lines, positioning the open parentheses after ... AS on a new line, and some other issues I could think of that might be syntax related. No luck.
My current thought is that you can't combine INSERT INTO ... SELECT ... with a WITH .... At least not having the WITH at the beginning, where the SELECT should be. This is what I can gather from the docs.
So, in short, my question is: can I combine INSERT INTO ... SELECT with a WITH-statement at all? If not, can I achieve something similar with another technique?
Are there any other ways I can improve the memory utilization of my query? I'd rather not mess with configuration options for MariaDB or Docker, but if that's the only possibility, I'll consider it.
Have you tried this?
TRUNCATE db1.myTable;
WITH everything AS (
SELECT all, needed, columns
FROM db1.mainTable T1
JOIN db1.secondTable T2
ON (T1.someCol = T2.someCol)
JOIN db2.thirdTable T3
ON (T1.anotherCol = T3.anotherCol)
LEFT JOIN db1.fourthTable T4
ON (T4.anotherCol = T1.anotherCol)
WHERE T2.yetAnotherCol >= (some_SELECT_subquery)
AND T1.valid = 1
) INSERT INTO db1.myTable SELECT * FROM everything;
Although I didn't find an answer to my original question, we decided to work around the problem by reducing the amount of data gathered in the subquery. I didn't disclose this in the original question because that wasn't a solution I was aware of when posting the question. We'll just call the SQL from a Python script where we can loop over the week numbers we'd like to fetch.
WHERE T2.ID >= (SELECT ID - {week_number} FROM db1.secondTable WHERE NOW() BETWEEN monday AND sunday) AND T1.valid = 1);

Need alternate SQL

I am currently working with an H2 database and I have written the following SQL, however the H2 database engine does not support the NOT IN being performed on a multiple column sub-query.
DELETE FROM AllowedParam_map
WHERE (AllowedParam_map.famid,AllowedParam_map.paramid) NOT IN (
SELECT famid,paramid
FROM macros
LEFT JOIN macrodata
ON macros.id != macrodata.macroid
ORDER BY famid)
Essentially I want to remove rows from allowedparam_map wherever it has the same combination of famid and paramid as the sub-query
Edit: To clarify, the sub-query is specifically trying to find famid/paramid combinations that are NOT present in macrodata, in an effort to weed out the allowedparam_map, hence the ON macros.id != macrodata.macroid. I'm also terrible at SQL so this might be completely the wrong way to do it.
Edit 2: Here is some more info about the pertinent schema:
Macros
| ID | NAME | FAMID |
| 0 | foo | 1 |
| 1 | bar | 1 |
| 2 | baz | 1 |
MacroData
| ID | MACROID | PARAMID | VALUE |
| 0 | 0 | 1 | 1024 |
| 1 | 0 | 2 | 200 |
| 2 | 0 | 3 | 89.85 |
AllowedParam_Map
| ID | FAMID | PARAMID |
| 0 | 1 | 1 |
| 1 | 1 | 2 |
| 2 | 1 | 3 |
| 3 | 1 | 4 |
The parameters are allowed on a per-family basis. Notice how the allowedParam_map table contains an entry for famid=1 and paramid=4, even though macro 0, aka "foo", does not have an entry for paramid=4. If we expand this, there might be another famid=1 macro that has paramid=4, but we cant be sure. I want to cull from the allowedParam_map table any unused parameters, based on the data in the macrodata table.
IN and NOT IN can always be replaced with EXISTS and NOT EXISTS.
Some points first:
You are using an ORDER BY in your subquery, which is of course superfluous.
You are outer-joining a table, which should have no effect when asking for existence. So either you need to look up a field in the outer-joined table, then inner-join it or you don't, then remove it from the query. (It's queer to join every non-related record (macros.id != macrodata.macroid) anyway.
You say in the comments section that both famid and paramid reside in table macros, so you can remove the outer join to macrodata from your query. You get:
As you say now that famid is in table macros and paramid is in table macrodata and you want to look up pairs that exist in AllowedParam_map, but not in the aformentioned tables, you seem to be looking for a simple inner join.
DELETE FROM AllowedParam_map
WHERE NOT EXISTS
(
SELECT *
FROM macros m
JOIN macrodata md ON md.macroid = m.id
WHERE m.famid = AllowedParam_map.famid
AND md.paramid = AllowedParam_map.paramid
);
You can use not exists instead:
DELETE FROM AllowedParam_map m
WHERE NOT EXISTS (SELECT 1
FROM macros LEFT JOIN
macrodata
ON macros.id <> macrodata.macroid -- I strongly suspect this should be =
WHERE m.famid = ?.famid and m.paramid = ?.paramid -- add the appropriate table aliases
);
Notes:
I strongly suspect the <> should be =. <> does not make sense in this context.
Replace the ? with the appropriate table alias.
NOT EXISTS is better than NOT IN anyway. It does what you expect if one of the value is NULL.

Why does select statement influence query execution and performance in MySQL?

I'm encountering a strange behavior of MySQL.
Query execution (i.e. the usage of indexes as shown by explain [QUERY]) and time needed for execution are dependent on the elements of the where clause.
Here is a query where the problem occurs:
select distinct
e1.idx, el1.idx, r1.fk_cat, r2.fk_cat
from ent e1, ent_leng el1, rel_c r1, _tax_c t1, rel_c r2, _tax_c t2
where el1.fk_ent=e1.idx
and r1.fk_ent=e1.idx and ((r1.fk_cat=43) or (r1.fk_cat=t1.fk_cat1 and t1.fk_cat2=43))
and r2.fk_ent=e1.idx and ((r2.fk_cat=10) or (r2.fk_cat=t2.fk_cat1 and t2.fk_cat2=10))
The corresponding explain output is:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+-------------+-------+--------+-------------------------+---------+---------+---------------+-------+------------------------------------
| 1 | SIMPLE | el1 | index | fk_ent | fk_ent | 4 | NULL | 15002 | Using index; Using temporary
| 1 | SIMPLE | e1 | eq_ref | PRIMARY | PRIMARY | 4 | DB.el1.fk_ent | 1 | Using index
| 1 | SIMPLE | r1 | ref | fk_ent,fk_cat,fks | fks | 4 | DB.e1.idx | 1 | Using where; Using index
| 1 | SIMPLE | r2 | ref | fk_ent,fk_cat,fks | fks | 4 | DB.el1.fk_ent | 1 | Using index
| 1 | SIMPLE | t1 | index | fk_cat1,fk_cat2,fk_cats | fk_cats | 8 | NULL | 69 | Using where; Using index; Distinct;
| | | | | | | | | | Using join buffer
| 1 | SIMPLE | t2 | index | fk_cat1,fk_cat2,fk_cats | fk_cats | 8 | NULL | 69 | Using where; Using index; Distinct;
| Using join buffer
As you can see a one-column index has the same name as the column it belongs to. I also added some useless indexes along with the used ones, just to see if they change the execution (which they don't).
The execution takes ~4.5 seconds.
When I add the column entl1.name to the select part (nothing else changed), the index fk_ent in el1 cannot be used any more:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+-------------+-------+--------+-------------------------+---------+---------+---------------+-------+------------------------------------
| 1 | SIMPLE | el1 | ALL | fk_ent | NULL | NULL | NULL | 15002 | Using temporary
The execution now takes ~8.5 seconds.
I always thought that the select part of a query does not influence the usage of indexes by the engine and doesn't affect performance in such a way.
Leaving out the attribute isn't a solution, and there are even more attributes that i have to select.
Even worse, the query in the used form is even a bit more complex and that makes the performance issue a big problem.
So my questions are:
1) What is the reason for this strange behavior?
2) How can I solve the performance problem?
Thanks for your help!
Gred
It's the DISTINCT restriction. You can think of that as another WHERE restriction. When you change the select list, you are really changing the WHERE clause for the DISTINCT restriction, and now the optimizer decides that it has to do a table scan anyway, so it might as well not use your index.
EDIT:
Not sure if this helps, but if I am understanding your data correctly, I think you can get rid of the DISTINCT restriction like this:
select
e1.idx, el1.idx, r1.fk_cat, r2.fk_cat
from ent e1
Inner Join ent_leng el1 ON el1.fk_ent=e1.idx
Inner Join rel_c r1 ON r1.fk_ent=e1.idx
Inner Join rel_c r2 ON r2.fk_ent=e1.idx
where
((r1.fk_cat=43) or Exists(Select 1 From _tax_c t1 Where r1.fk_cat=t1.fk_cat1 and t1.fk_cat2=43))
and
((r2.fk_cat=10) or Exists(Select 1 From _tax_c t2 Where r2.fk_cat=t2.fk_cat1 and t2.fk_cat2=10))
MySQL will return data from an index if possible, saving the entire row from being loaded. In this way, the selected columns can influence the index selection.
With this in mind, it can much more efficient to add all required columns to an index, especially in the case of only selecting a small subset of columns.

MySQL not using indexes

I just enabled the slow-log (+not using indexes) and I'm getting hundreds of entries for the same kind of query (only user changes)
SELECT id
, name
FROM `all`
WHERE id NOT IN(SELECT id
FROM `picks`
WHERE user=999)
ORDER BY name ASC;
EXPLAIN gives:
+----+--------------------+-------------------+-------+------------------+--------+---------+------------+------+------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------------+-------+------------------+--------+---------+------------+------+------------------------------------------+
| 1 | PRIMARY | all | index | NULL | name | 156 | NULL | 209 | Using where; Using index; Using filesort |
| 2 | DEPENDENT SUBQUERY | picks | ref | user,user_2,pick | user_2 | 8 | const,func | 1 | Using where; Using index |
+----+--------------------+-------------------+-------+------------------+--------+---------+------------+------+------------------------------------------+
Any idea about how to optimize this query? I've tried with a bunch of different indexes on different fields but nothing.
I don't necessarily agree that 'not in' and 'exists' are ALWAYS bad performance choices, however, it could be in this situation.
You might be able to get your results using a much simpler query:
SELECT id
, name
FROM `all`
, 'picks'
WHERE all.id = picks.id
AND picks.user <> 999
ORDER BY name ASC;
"not in" and "exists" always bad choices for performance. May be left join with cheking "NULL" will be better try it.
This is probably the best way to write the query. Select everything from all and try to find matching rows from picks that share the same id and user is 999. If such a row doesn't exist, picks.id will be NULL, because it's using a left outer join. Then you can filter the results to return only those rows.
SELECT all.id, all.name
FROM
all
LEFT JOIN picks ON picks.id=all.id AND picks.user=999
WHERE picks.id IS NULL
ORDER BY all.name ASC

Mysql Index Being Ignored

EXPLAIN SELECT
*
FROM
content_link link
STRAIGHT_JOIN
content
ON
link.content_id = content.id
WHERE
link.content_id = 1
LIMIT 10;
+----+-------------+---------+-------+---------------+------------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+------------+---------+-------+------+-------+
| 1 | SIMPLE | link | ref | content_id | content_id | 4 | const | 1 | |
| 1 | SIMPLE | content | const | PRIMARY | PRIMARY | 4 | const | 1 | |
+----+-------------+---------+-------+---------------+------------+---------+-------+------+-------+
However, when I remove the WHERE, the query stops using the key (even when i explicitly force it to)
EXPLAIN SELECT
*
FROM
content_link link FORCE KEY (content_id)
STRAIGHT_JOIN
content
ON
link.content_id = content.id
LIMIT 10;
+----+-------------+---------+--------+---------------+---------+---------+------------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+--------+---------------+---------+---------+------------------------+---------+-------------+
| 1 | SIMPLE | link | index | content_id | PRIMARY | 7 | NULL | 4555299 | Using index |
| 1 | SIMPLE | content | eq_ref | PRIMARY | PRIMARY | 4 | ft_dir.link.content_id | 1 | |
+----+-------------+---------+--------+---------------+---------+---------+------------------------+---------+-------------+
Are there any work-arounds to this?
I realize I'm selecting the entire table in the second example, but why does mysql suddenly decide that it's going to ignore my FORCE anyway and not use the key? Without the key the query takes like 10 minutes.. ugh.
FORCE is a bit of a misnomer. Here's what the MySQL docs say (emphasis mine):
You can also use FORCE INDEX, which acts like USE INDEX (index_list) but with the addition that a table scan is assumed to be very expensive. In other words, a table scan is used only if there is no way to use one of the given indexes to find rows in the table.
Since you aren't actually "finding" any rows (you are selecting them all), a table scan is always going to be fastest, and the optimizer is smart enough to know that in spite of what you are telling them.
ETA:
Try adding an ORDER BY on the primary key once and I bet it'll use the index.
An index helps search quickly inside a table, but it just slows things down if you select the entire table. So MySQL is correct in ignoring the index.
In your case, maybe the index has a hidden side effect that's not known to MySQL. For example, if the inner join holds only for a few rows, an index would speed things up. But MySQL can't know that without an explicit hint.
There is an exception: when every column you select is inside the index, the index is still useful if you select every row. For example, if you have an index on LastName, the following query still benefits from the index:
select LastName from orders
But this one won't:
select * from Orders
Your content_id seems to accept NULL values.
MySQL optimizer thinks there is no guarantee that your query will return all values only by using the index (though actually there is guarantee, since you use the column in a JOIN)
That's why it reverts to full table scan.
Either add a NOT NULL condition:
SELECT *
FROM content_link link FORCE KEY (content_id)
STRAIGHT_JOIN
content
ON content.id = link.content_id
WHERE link.content_id IS NOT NULL
LIMIT 10;
or mark your column as NOT NULL:
ALTER TABLE content_link MODIFY content_id NOT NULL
Update:
This is verified bug 45314 in MySQL.