MySQL, reading this EXPLAIN statement - sql

I have a query which is starting to cause some concern in my application. I'm trying to understand this EXPLAIN statement better to understand where indexes are potentially missing:
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
| 1 | SIMPLE | s | ref | client_id | client_id | 4 | const | 102 | Using temporary; Using filesort |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | www_foo_com.s.user_id | 1 | |
| 1 | SIMPLE | a | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | Using index |
| 1 | SIMPLE | h | ref | email_id | email_id | 4 | www_foo_com.a.email_id | 10 | Using index |
| 1 | SIMPLE | ph | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | Using index |
| 1 | SIMPLE | em | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | |
| 1 | SIMPLE | pho | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | |
| 1 | SIMPLE | c | ALL | userfield | NULL | NULL | NULL | 1108 | |
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
8 rows in set (0.00 sec)
I'm trying to understand where my indexes are missing by reading this EXPLAIN statement. Is it fair to say that one can understand how to optimize this query without seeing the query at all and just look at the results of the EXPLAIN?
It appears that the ALL scan against the 'c' table is the achilles heel. What's the best way to index this based on constant values as recommended on MySQL's documentation? |
Note, I also added an index to userfield in the cdr table and that hasn't done much good either.
Thanks.
--- edit ---
Here's the query, sorry -- don't know why I neglected to include it the first pass through.
SELECT s.`session_id` id,
DATE_FORMAT(s.`created`,'%m/%d/%Y') date,
u.`name`,
COUNT(DISTINCT c.id) calls,
COUNT(DISTINCT h.id) emails,
SEC_TO_TIME(MAX(DISTINCT c.duration)) duration,
(COUNT(DISTINCT em.email_id) + COUNT(DISTINCT pho.phone_id) > 0) status
FROM `fa_sessions` s
LEFT JOIN `fa_users` u ON s.`user_id`=u.`user_id`
LEFT JOIN `fa_email_aliases` a ON a.session_id = s.session_id
LEFT JOIN `fa_email_headers` h ON h.email_id = a.email_id
LEFT JOIN `fa_phones` ph ON ph.session_id = s.session_id
LEFT JOIN `fa_email_aliases` em ON em.session_id = s.session_id AND em.status = 1
LEFT JOIN `fa_phones` pho ON pho.session_id = s.session_id AND pho.status = 1
LEFT JOIN `cdr` c ON c.userfield = ph.phone_id
WHERE s.`partner_id`=1
GROUP BY s.`session_id`

I assume you've looked here to get more info about what it is telling you. Obviously the ALL means its going through all of them. The using temporary and using filesort are talked about on that page. You might want to look at that.
From the page:
Using filesort
MySQL must do an extra pass to find
out how to retrieve the rows in sorted
order. The sort is done by going
through all rows according to the join
type and storing the sort key and
pointer to the row for all rows that
match the WHERE clause. The keys then
are sorted and the rows are retrieved
in sorted order. See Section 7.2.12,
“ORDER BY Optimization”.
Using temporary
To resolve the query, MySQL needs to
create a temporary table to hold the
result. This typically happens if the
query contains GROUP BY and ORDER BY
clauses that list columns differently.
I agree that seeing the query might help to figure things out better.

My advice?
Break the query into 2 and use a temporary table in the middle.
Reasonning
The problem appears to be that table c is being table scanned, and that this is the last table in the query. This is probably bad: if you have a table scan, you want to do it at the start of the query, so it's only done once.
I'm not a MySQL guru, but I have spent a whole lot of time optimising queries on other DBs. It looks to me like the optimiser hasn't worked out that it should start with c and work backwards.
The other thing that strikes me is that there are probably too many tables in the join. Most optimisers struggle with more than 4 tables (because the number of possible table orders is growing exponentially, so checking them all becomes impractical).
Having too many tables in a join is the root of 90% of performance problems I have seen.
Give it a go, and let us know how you get on. If it doesn't help, please post the SQL, table definitions and indeces, and I'll take another look.
General Tips
Feel free to look at this answer I gave on general performance tips.
A great resource
MySQL Documentation for EXPLAIN

Well looking at the query would be useful, but there's at least one thing that's obviously worth looking into - the final line shows the ALL type for that part of the query, which is generally not great to see. If the suggested possible key (userfield) makes sense as an added index to table c, it might be worth adding it and seeing if that reduces the rows returned for that table in the search.

Query Plan
The query plan we might hope the optimiser would choose would be something like:
start with sessions where partner_id=1 , possibly using an index on partner_id,
join sessions to users, using an index on user_id
join sessions to phones, where status=1, using an index on session_id and possibly status
join sessions to phones again using an index on session_id and phone_id **
join phones to cdr using an index on userfield
join sessions to email_aliases, where status=1 using an index on session_id and possibly status
join sessions to email_aliases again using an index on session_id and email_id **
join email_aliases to email_headers using an index on email_id
** by putting 2 fields in these indeces, we enable the optimiser to join to the table using session_id, and immediately find out the associated phone_id or email_id without having to read the underlying table. This technique saves us a read, and can save a lot of time.
Indeces I would create:
The above query plan suggests these indeces:
fa_sessions ( partner_id, session_id )
fa_users ( user_id )
fa_email_aliases ( session_id, email_id )
fa_email_headers ( email_id )
fa_email_aliases ( session_id, status )
fa_phones ( session_id, status, phone_id )
cdr ( userfield )
Notes
You will almost certainly get acceptable performance without creating all of these.
If any of the tables are small ( less than 100 rows ) then it's probably not worth creating an index.
fa_email_aliases might work with ( session_id, status, email_id ), depending on how the optimiser works.

Related

Simple JOIN slow, very fast with STRAIGHT_JOIN but not for all cases

To place a video in one or more categories I have the following tables:
Video (id, url, title, viewCount) ~1,000,000 rows
VideoCategory (id, videoId, categoryId ~6,000,000 rows
Category (id, name) ~200 rows
Index on VideoCategory(categoryId, videoId)
Unique Index on Category(name)
The following query, to get 10 most viewed videos in the 'Cars' category is too slow (~5.5 sec). The 'Cars' category contains 200,000 videos.
SELECT v.* FROM Video v
JOIN VideoCategory vc ON vc.videoId = v.id
JOIN Category c ON vc.categoryId = c.id
WHERE c.name = 'Cars'
ORDER BY v.viewCount DESC
LIMIT 10
When querying a category that only contains 100 videos it takes ~0.05 sec.
EXPLAIN of query for 'Cars' category.
+------+-------------+-------+--------+------------------------------------+--------------------+---------+-----------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+------------------------------------+--------------------+---------+-----------------------+--------+----------------------------------------------+
| 1 | SIMPLE | c | const | name_UNIQUE | name_UNIQUE | 322 | const | 1 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | vc | ref | fk_Category_idx,category_video_idx | category_video_idx | 8 | const | 493988 | Using index |
| 1 | SIMPLE | v | eq_ref | PRIMARY | PRIMARY | 8 | VideoDB.vc.videoId | 1 | Using where |
+------+-------------+-------+--------+------------------------------------+--------------------+---------+-----------------------+--------+----------------------------------------------+
How can I speed it up (hopefully 0.1 sec. or less)? Leaving out the ORDER BY has a massive impact but it obviously doesn't give me the result I want.
Update
I played around with STRAIGHT_JOIN and observed some interesting things with the following query:
SELECT v.* FROM Video v
STRAIGHT_JOIN VideoCategory vc ON v.id = vc.videoId
WHERE vc.categoryId = (SELECT id FROM Category WHERE name = 'Cars')
ORDER BY v.viewCount ASC
LIMIT 10
It returns in 0.011 sec! I also removed the JOIN Category c and replaced it with WHERE vc.categoryId = (SELECT id FROM Category WHERE name = 'Cars') which returns immediately with 0 results if the Category.name doesn't exist.
Unfortunately just throwing in a STRAIGHT_JOIN doesn't fix everything, it is only fast(er) for categories with ~1000+ videos, the more videos the faster it seems. For a category containing less than 100 videos it gets extremely slow, taking out the STRAIGHT_JOIN makes it very fast again.
For a query this simple I expected the planner to find the optimal path. What's going on here?
Another observation is that changing the order from ASC to DESC will make the query slower again for some categories but not for other? (e.g. ASC will take 0.01 sec. and DESC will take 0.8 sec.)
That order by is going to cost a lot for 200k records. One idea is to create a top 10 table that gets updated periodically, so you can just show entries from the new top 10 table which would be way quicker for the user and easier on the db load.
The optimal resolution of your origin query is to lookup the Category.id for "cars" which is done.
The query plan then falls over due to the lack of indexing on VideoCategory.
As I assume lookups are from categoryId -> videoId there should be a categoryId index. But ideally as a joining table (categoryId, videoId) is probably the most suitable composite primary key that would improve your query. If there reverse resolution from videoId -> categoryId is required, a secondary key videoId. The primary key is already included at the end of secondary keys.
The last problem of viewCount is a hard one. Some query optimization like rowid_filter might help.

SQL script runs VERY slowly with small change

I am relatively new to SQL. I have a script that used to run very quickly (<0.5 seconds) but runs very slowly (>120 seconds) if I add one change - and I can't see why this change makes such a difference. Any help would be hugely appreciated!
This is the script and it runs quickly if I do NOT include "tt2.bulk_cnt
" in line 26:
with bulksum1 as
(
select t1.membercode,
t1.schemecode,
t1.transdate
from mina_raw2 t1
where t1.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
group by t1.membercode,
t1.schemecode,
t1.transdate
),
bulksum2 as
(
select t1.schemecode,
t1.transdate,
count(*) as bulk_cnt
from bulksum1 t1
group by t1.schemecode,
t1.transdate
having count(*) >= 10
),
results as
(
select t1.*, tt2.bulk_cnt
from mina_raw2 t1
inner join bulksum2 tt2
on t1.schemecode = tt2.schemecode and t1.transdate = tt2.transdate
where t1.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
)
select * from results
EDIT: I apologise for not putting enough detail in here previously - although I can use basic SQL code, I am a complete novice when it comes to databases.
Database: Oracle (I'm not sure which version, sorry)
Execution plans:
QUICK query:
Plan hash value: 1712123489
---------------------------------------------
| Id | Operation | Name |
---------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | HASH JOIN | |
| 2 | VIEW | |
| 3 | FILTER | |
| 4 | HASH GROUP BY | |
| 5 | VIEW | VM_NWVW_0 |
| 6 | HASH GROUP BY | |
| 7 | TABLE ACCESS FULL| MINA_RAW2 |
| 8 | TABLE ACCESS FULL | MINA_RAW2 |
---------------------------------------------
SLOW query:
Plan hash value: 1298175315
--------------------------------------------
| Id | Operation | Name |
--------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | FILTER | |
| 2 | HASH GROUP BY | |
| 3 | HASH JOIN | |
| 4 | VIEW | VM_NWVW_0 |
| 5 | HASH GROUP BY | |
| 6 | TABLE ACCESS FULL| MINA_RAW2 |
| 7 | TABLE ACCESS FULL | MINA_RAW2 |
--------------------------------------------
A few observations, and then some things to do:
1) More information is needed. In particular, how many rows are there in the MINA_RAW2 table, what indexes exist on this table, and when was the last time it was analyzed? To determine the answers to these questions, run:
SELECT COUNT(*) FROM MINA_RAW2;
SELECT TABLE_NAME, LAST_ANALYZED, NUM_ROWS
FROM USER_TABLES
WHERE TABLE_NAME = 'MINA_RAW2';
From looking at the plan output it looks like the database is doing two FULL SCANs on MINA_RAW2 - it would be nice if this could be reduced to no more than one, and hopefully none. It's always tough to tell without very detailed information about the data in the table, but at first blush it appears that an index on TRANSACTIONTYPE might be helpful. If such an index doesn't exist you might want to consider adding it.
2) Assuming that the statistics are out-of-date (as in, old, nonexistent, or a significant amount of data (> 10%) has been added, deleted, or updated since the last analysis) run the following:
BEGIN
DBMS_STATS.GATHER_TABLE_STATS(owner => 'YOUR-SCHEMA-NAME',
table_name => 'MINA_RAW2');
END;
substituting the correct schema name for "YOUR-SCHEMA-NAME" above. Remember to capitalize the schema name! If you don't know if you should or shouldn't gather statistics, err on the side of caution and do it. It shouldn't take much time.
3) Re-try your existing query after updating the table statistics. I think there's a fair chance that having up-to-date statistics in the database will solve your issues. If not:
4) This query is doing a GROUP BY on the results of a GROUP BY. This doesn't appear to be necessary as the initial GROUP BY doesn't do any grouping - instead, it appears this is being done to get the unique combinations of MEMBERCODE, SCHEMECODE, and TRANSDATE so that the count of the members by scheme and date can be determined. I think the whole query can be simplified to:
WITH cteWORKING_TRANS AS (SELECT *
FROM MINA_RAW2
WHERE TRANSACTIONTYPE IN ('RSP','SP','UNTV',
'ASTR','CN','TVIN',
'UCON','TRAS')),
cteBULKSUM AS (SELECT a.SCHEMECODE,
a.TRANSDATE,
COUNT(*) AS BULK_CNT
FROM (SELECT DISTINCT MEMBERCODE,
SCHEMECODE,
TRANSDATE
FROM cteWORKING_TRANS) a
GROUP BY a.SCHEMECODE,
a.TRANSDATE)
SELECT t.*, b.BULK_CNT
FROM cteWORKING_TRANS t
INNER JOIN cteBULKSUM b
ON b.SCHEMECODE = t.SCHEMECODE AND
b.TRANSDATE = t.TRANSDATE
I managed to remove an unnecessary subquery, but this syntax with distinct inside count may not work outside of PostgreSQL or may not be the desired result. I know I've certainly used it there.
select t1.*, tt2.bulk_cnt
from mina_raw2 t1
inner join (select t2.schemecode,
t2.transdate,
count(DISTINCT membercode) as bulk_cnt
from mina_raw2 t2
where t2.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
group by t2.schemecode,
t2.transdate
having count(DISTINCT membercode) >= 10) tt2
on t1.schemecode = tt2.schemecode and t1.transdate = tt2.transdate
where t1.transactiontype in ('RSP','SP','UNTV','ASTR','CN','TVIN','UCON','TRAS')
When you use those with queries, instead of subqueries when you don't need to, you're kneecapping the query optimizer.

SELECT TOP 1 ...Some stuff... ORDER BY DES gives different result

SELECT TOP 1 Col1,col2
FROM table ... JOIN table2
...Some stuff...
ORDER BY DESC
gives different result. compared to
SELECT Col1,col2
FROM table ... JOIN table2
...Some stuff...
ORDER BY DESC
2nd query gives me some rows , When I want the Top 1 of this result I write the 1st query with TOP 1 clause. These both give different results.
why is this behavior different
This isn't very clear, but I guess you mean the row returned by the first query isn't the same as the first row returned by the second query. This could be because your order by has duplicate values in it.
Say, for example, you had a table called Test
+-----+------+
| Seq | Name |
+-----+------+
| 1 | A |
| 1 | B |
| 2 | C |
+-----+------+
If you did Select * From Test Order By Seq, either of these is valid
+-----+------+
| Seq | Name |
+-----+------+
| 1 | A |
| 1 | B |
| 2 | C |
+-----+------+
+-----+------+
| Seq | Name |
+-----+------+
| 1 | B |
| 1 | A |
| 2 | C |
+-----+------+
With the top, you could get either row.
Having the top 1 clause could mean the query optimizer uses a completely different approach to generate the results.
I'm going to assume that you're working in SQL Server, so Laurence's answer is probably accurate. But for completeness, this also depends on what database technology you are using.
Typically, index-based databases, like SQL Server, will return results that are sorted by the index, depending on how the execution plan is created. But not all databases utilize indices.
Netezza, for example, keeps track of where data lives in the system without the concept of an index (Netezza's system architecture is quite a bit different). As a result, selecting the 1st record of a query will result in a random record from the result set floating to the top. Executing the same query multiple times will likely result in a different order each time.
If you have a requirement to order data, then it is in your best interest to enforce the ordering yourself instead of relying on the arbitrary ordering that the database will use when creating its execution plan. This will make your results more predictable.
Your 1st query will get one table's top row and compare with another table with condition. So it will return different values compare to normal join.

Join vs. sub-query

I am an old-school MySQL user and have always preferred JOIN over sub-query. But nowadays everyone uses sub-query, and I hate it; I don't know why.
I lack the theoretical knowledge to judge for myself if there is any difference. Is a sub-query as good as a JOIN and therefore is there nothing to worry about?
Sub-queries are the logically correct way to solve problems of the form, "Get facts from A, conditional on facts from B". In such instances, it makes more logical sense to stick B in a sub-query than to do a join. It is also safer, in a practical sense, since you don't have to be cautious about getting duplicated facts from A due to multiple matches against B.
Practically speaking, however, the answer usually comes down to performance. Some optimisers suck lemons when given a join vs a sub-query, and some suck lemons the other way, and this is optimiser-specific, DBMS-version-specific and query-specific.
Historically, explicit joins usually win, hence the established wisdom that joins are better, but optimisers are getting better all the time, and so I prefer to write queries first in a logically coherent way, and then restructure if performance constraints warrant this.
In most cases JOINs are faster than sub-queries and it is very rare for a sub-query to be faster.
In JOINs RDBMS can create an execution plan that is better for your query and can predict what data should be loaded to be processed and save time, unlike the sub-query where it will run all the queries and load all their data to do the processing.
The good thing in sub-queries is that they are more readable than JOINs: that's why most new SQL people prefer them; it is the easy way; but when it comes to performance, JOINS are better in most cases even though they are not hard to read too.
Taken from the MySQL manual (13.2.10.11 Rewriting Subqueries as Joins):
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
So subqueries can be slower than LEFT [OUTER] JOIN, but in my opinion their strength is slightly higher readability.
In the year 2010 I would have joined the author of this questions and would have strongly voted for JOIN, but with much more experience (especially in MySQL) I can state: Yes subqueries can be better. I've read multiple answers here; some stated subqueries are faster, but it lacked a good explanation. I hope I can provide one with this (very) late answer:
First of all, let me say the most important: There are different forms of sub-queries
And the second important statement: Size matters
If you use sub-queries, you should be aware of how the DB-Server executes the sub-query. Especially if the sub-query is evaluated once or for every row!
On the other side, a modern DB-Server is able to optimize a lot. In some cases a subquery helps optimizing a query, but a newer version of the DB-Server might make the optimization obsolete.
Sub-queries in Select-Fields
SELECT moo, (SELECT roger FROM wilco WHERE moo = me) AS bar FROM foo
Be aware that a sub-query is executed for every resulting row from foo.
Avoid this if possible; it may drastically slow down your query on huge datasets. However, if the sub-query has no reference to foo it can be optimized by the DB-server as static content and could be evaluated only once.
Sub-queries in the Where-statement
SELECT moo FROM foo WHERE bar = (SELECT roger FROM wilco WHERE moo = me)
If you are lucky, the DB optimizes this internally into a JOIN. If not, your query will become very, very slow on huge datasets because it will execute the sub-query for every row in foo, not just the results like in the select-type.
Sub-queries in the Join-statement
SELECT moo, bar
FROM foo
LEFT JOIN (
SELECT MIN(bar), me FROM wilco GROUP BY me
) ON moo = me
This is interesting. We combine JOIN with a sub-query. And here we get the real strength of sub-queries. Imagine a dataset with millions of rows in wilco but only a few distinct me. Instead of joining against a huge table, we have now a smaller temporary table to join against. This can result in much faster queries depending on database size. You can have the same effect with CREATE TEMPORARY TABLE ... and INSERT INTO ... SELECT ..., which might provide better readability on very complex queries (but can lock datasets in a repeatable read isolation level).
Nested sub-queries
SELECT VARIANCE(moo)
FROM (
SELECT moo, CONCAT(roger, wilco) AS bar
FROM foo
HAVING bar LIKE 'SpaceQ%'
) AS temp_foo
GROUP BY moo
You can nest sub-queries in multiple levels. This can help on huge datasets if you have to group or change the results. Usually the DB-Server creates a temporary table for this, but sometimes you do not need some operations on the whole table, only on the resultset. This might provide a much better performance depending on the size of the table.
Conclusion
Sub-queries are no replacement for a JOIN and you should not use them like this (although possible). In my humble opinion, the correct use of a sub-query is the use as a quick replacement of CREATE TEMPORARY TABLE .... A good sub-query reduces a dataset in a way you cannot accomplish in an ON statement of a JOIN. If a sub-query has one of the keywords GROUP BY or DISTINCT and is preferably not situated in the select fields or the where statement, then it might improve performance a lot.
Use EXPLAIN to see how your database executes the query on your data. There is a huge "it depends" in this answer...
PostgreSQL can rewrite a subquery to a join or a join to a subquery when it thinks one is faster than the other. It all depends on the data, indexes, correlation, amount of data, query, etc.
First of all, to compare the two first you should distinguish queries with subqueries to:
a class of subqueries that always have corresponding equivalent query written with joins
a class of subqueries that can not be rewritten using joins
For the first class of queries a good RDBMS will see joins and subqueries as equivalent and will produce same query plans.
These days even mysql does that.
Still, sometimes it does not, but this does not mean that joins will always win - I had cases when using subqueries in mysql improved performance. (For example if there is something preventing mysql planner to correctly estimate the cost and if the planner doesn't see the join-variant and subquery-variant as same then subqueries can outperform the joins by forcing a certain path).
Conclusion is that you should test your queries for both join and subquery variants if you want to be sure which one will perform better.
For the second class the comparison makes no sense as those queries can not be rewritten using joins and in these cases subqueries are natural way to do the required tasks and you should not discriminate against them.
I think what has been under-emphasized in the cited answers is the issue of duplicates and problematic results that may arise from specific (use) cases.
(although Marcelo Cantos does mention it)
I will cite the example from Stanford's Lagunita courses on SQL.
Student Table
+------+--------+------+--------+
| sID | sName | GPA | sizeHS |
+------+--------+------+--------+
| 123 | Amy | 3.9 | 1000 |
| 234 | Bob | 3.6 | 1500 |
| 345 | Craig | 3.5 | 500 |
| 456 | Doris | 3.9 | 1000 |
| 567 | Edward | 2.9 | 2000 |
| 678 | Fay | 3.8 | 200 |
| 789 | Gary | 3.4 | 800 |
| 987 | Helen | 3.7 | 800 |
| 876 | Irene | 3.9 | 400 |
| 765 | Jay | 2.9 | 1500 |
| 654 | Amy | 3.9 | 1000 |
| 543 | Craig | 3.4 | 2000 |
+------+--------+------+--------+
Apply Table
(applications made to specific universities and majors)
+------+----------+----------------+----------+
| sID | cName | major | decision |
+------+----------+----------------+----------+
| 123 | Stanford | CS | Y |
| 123 | Stanford | EE | N |
| 123 | Berkeley | CS | Y |
| 123 | Cornell | EE | Y |
| 234 | Berkeley | biology | N |
| 345 | MIT | bioengineering | Y |
| 345 | Cornell | bioengineering | N |
| 345 | Cornell | CS | Y |
| 345 | Cornell | EE | N |
| 678 | Stanford | history | Y |
| 987 | Stanford | CS | Y |
| 987 | Berkeley | CS | Y |
| 876 | Stanford | CS | N |
| 876 | MIT | biology | Y |
| 876 | MIT | marine biology | N |
| 765 | Stanford | history | Y |
| 765 | Cornell | history | N |
| 765 | Cornell | psychology | Y |
| 543 | MIT | CS | N |
+------+----------+----------------+----------+
Let's try to find the GPA scores for students that have applied to CS major (regardless of the university)
Using a subquery:
select GPA from Student where sID in (select sID from Apply where major = 'CS');
+------+
| GPA |
+------+
| 3.9 |
| 3.5 |
| 3.7 |
| 3.9 |
| 3.4 |
+------+
The average value for this resultset is:
select avg(GPA) from Student where sID in (select sID from Apply where major = 'CS');
+--------------------+
| avg(GPA) |
+--------------------+
| 3.6800000000000006 |
+--------------------+
Using a join:
select GPA from Student, Apply where Student.sID = Apply.sID and Apply.major = 'CS';
+------+
| GPA |
+------+
| 3.9 |
| 3.9 |
| 3.5 |
| 3.7 |
| 3.7 |
| 3.9 |
| 3.4 |
+------+
average value for this resultset:
select avg(GPA) from Student, Apply where Student.sID = Apply.sID and Apply.major = 'CS';
+-------------------+
| avg(GPA) |
+-------------------+
| 3.714285714285714 |
+-------------------+
It is obvious that the second attempt yields misleading results in our use case, given that it counts duplicates for the computation of the average value.
It is also evident that usage of distinct with the join - based statement will not eliminate the problem, given that it will erroneously keep one out of three occurrences of the 3.9 score. The correct case is to account for TWO (2) occurrences of the 3.9 score given that we actually have TWO (2) students with that score that comply with our query criteria.
It seems that in some cases a sub-query is the safest way to go, besides any performance issues.
MSDN Documentation for SQL Server says
Many Transact-SQL statements that include subqueries can be alternatively formulated as joins. Other questions can be posed only with subqueries. In Transact-SQL, there is usually no performance difference between a statement that includes a subquery and a semantically equivalent version that does not. However, in some cases where existence must be checked, a join yields better performance. Otherwise, the nested query must be processed for each result of the outer query to ensure elimination of duplicates. In such cases, a join approach would yield better results.
so if you need something like
select * from t1 where exists select * from t2 where t2.parent=t1.id
try to use join instead. In other cases, it makes no difference.
I say: Creating functions for subqueries eliminate the problem of cluttter and allows you to implement additional logic to subqueries. So I recommend creating functions for subqueries whenever possible.
Clutter in code is a big problem and the industry has been working on avoiding it for decades.
As per my observation like two cases, if a table has less then 100,000 records then the join will work fast.
But in the case that a table has more than 100,000 records then a subquery is best result.
I have one table that has 500,000 records on that I created below query and its result time is like
SELECT *
FROM crv.workorder_details wd
inner join crv.workorder wr on wr.workorder_id = wd.workorder_id;
Result : 13.3 Seconds
select *
from crv.workorder_details
where workorder_id in (select workorder_id from crv.workorder)
Result : 1.65 Seconds
Run on a very large database from an old Mambo CMS:
SELECT id, alias
FROM
mos_categories
WHERE
id IN (
SELECT
DISTINCT catid
FROM mos_content
);
0 seconds
SELECT
DISTINCT mos_content.catid,
mos_categories.alias
FROM
mos_content, mos_categories
WHERE
mos_content.catid = mos_categories.id;
~3 seconds
An EXPLAIN shows that they examine the exact same number of rows, but one takes 3 seconds and one is near instant. Moral of the story? If performance is important (when isn't it?), try it multiple ways and see which one is fastest.
And...
SELECT
DISTINCT mos_categories.id,
mos_categories.alias
FROM
mos_content, mos_categories
WHERE
mos_content.catid = mos_categories.id;
0 seconds
Again, same results, same number of rows examined. My guess is that DISTINCT mos_content.catid takes far longer to figure out than DISTINCT mos_categories.id does.
A general rule is that joins are faster in most cases (99%).
The more data tables have, the subqueries are slower.
The less data tables have, the subqueries have equivalent speed as joins.
The subqueries are simpler, easier to understand, and easier to read.
Most of the web and app frameworks and their "ORM"s and "Active record"s generate queries with subqueries, because with subqueries are easier to split responsibility, maintain code, etc.
For smaller web sites or apps subqueries are OK, but for larger web sites and apps you will often have to rewrite generated queries to join queries, especial if a query uses many subqueries in the query.
Some people say "some RDBMS can rewrite a subquery to a join or a join to a subquery when it thinks one is faster than the other.", but this statement applies to simple cases, surely not for complicated queries with subqueries which actually cause a problems in performance.
Subqueries are generally used to return a single row as an atomic value, though they may be used to compare values against multiple rows with the IN keyword. They are allowed at nearly any meaningful point in a SQL statement, including the target list, the WHERE clause, and so on. A simple sub-query could be used as a search condition. For example, between a pair of tables:
SELECT title
FROM books
WHERE author_id = (
SELECT id
FROM authors
WHERE last_name = 'Bar' AND first_name = 'Foo'
);
Note that using a normal value operator on the results of a sub-query requires that only one field must be returned. If you're interested in checking for the existence of a single value within a set of other values, use IN:
SELECT title
FROM books
WHERE author_id IN (
SELECT id FROM authors WHERE last_name ~ '^[A-E]'
);
This is obviously different from say a LEFT-JOIN where you just want to join stuff from table A and B even if the join-condition doesn't find any matching record in table B, etc.
If you're just worried about speed you'll have to check with your database and write a good query and see if there's any significant difference in performance.
MySQL version: 5.5.28-0ubuntu0.12.04.2-log
I was also under the impression that JOIN is always better than a sub-query in MySQL, but EXPLAIN is a better way to make a judgment. Here is an example where sub queries work better than JOINs.
Here is my query with 3 sub-queries:
EXPLAIN SELECT vrl.list_id,vrl.ontology_id,vrl.position,l.name AS list_name, vrlih.position AS previous_position, vrl.moved_date
FROM `vote-ranked-listory` vrl
INNER JOIN lists l ON l.list_id = vrl.list_id
INNER JOIN `vote-ranked-list-item-history` vrlih ON vrl.list_id = vrlih.list_id AND vrl.ontology_id=vrlih.ontology_id AND vrlih.type='PREVIOUS_POSITION'
INNER JOIN list_burial_state lbs ON lbs.list_id = vrl.list_id AND lbs.burial_score < 0.5
WHERE vrl.position <= 15 AND l.status='ACTIVE' AND l.is_public=1 AND vrl.ontology_id < 1000000000
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=43) IS NULL
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=55) IS NULL
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=246403) IS NOT NULL
ORDER BY vrl.moved_date DESC LIMIT 200;
EXPLAIN shows:
+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+
| 1 | PRIMARY | vrl | index | PRIMARY | moved_date | 8 | NULL | 200 | Using where |
| 1 | PRIMARY | l | eq_ref | PRIMARY,status,ispublic,idx_lookup,is_public_status | PRIMARY | 4 | ranker.vrl.list_id | 1 | Using where |
| 1 | PRIMARY | vrlih | eq_ref | PRIMARY | PRIMARY | 9 | ranker.vrl.list_id,ranker.vrl.ontology_id,const | 1 | Using where |
| 1 | PRIMARY | lbs | eq_ref | PRIMARY,idx_list_burial_state,burial_score | PRIMARY | 4 | ranker.vrl.list_id | 1 | Using where |
| 4 | DEPENDENT SUBQUERY | list_tag | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.l.list_id,const | 1 | Using where; Using index |
| 3 | DEPENDENT SUBQUERY | list_tag | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.l.list_id,const | 1 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | list_tag | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.l.list_id,const | 1 | Using where; Using index |
+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+
The same query with JOINs is:
EXPLAIN SELECT vrl.list_id,vrl.ontology_id,vrl.position,l.name AS list_name, vrlih.position AS previous_position, vrl.moved_date
FROM `vote-ranked-listory` vrl
INNER JOIN lists l ON l.list_id = vrl.list_id
INNER JOIN `vote-ranked-list-item-history` vrlih ON vrl.list_id = vrlih.list_id AND vrl.ontology_id=vrlih.ontology_id AND vrlih.type='PREVIOUS_POSITION'
INNER JOIN list_burial_state lbs ON lbs.list_id = vrl.list_id AND lbs.burial_score < 0.5
LEFT JOIN list_tag lt1 ON lt1.list_id = vrl.list_id AND lt1.tag_id = 43
LEFT JOIN list_tag lt2 ON lt2.list_id = vrl.list_id AND lt2.tag_id = 55
INNER JOIN list_tag lt3 ON lt3.list_id = vrl.list_id AND lt3.tag_id = 246403
WHERE vrl.position <= 15 AND l.status='ACTIVE' AND l.is_public=1 AND vrl.ontology_id < 1000000000
AND lt1.list_id IS NULL AND lt2.tag_id IS NULL
ORDER BY vrl.moved_date DESC LIMIT 200;
and the output is:
+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | lt3 | ref | list_tag_key,list_id,tag_id | tag_id | 5 | const | 2386 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | l | eq_ref | PRIMARY,status,ispublic,idx_lookup,is_public_status | PRIMARY | 4 | ranker.lt3.list_id | 1 | Using where |
| 1 | SIMPLE | vrlih | ref | PRIMARY | PRIMARY | 4 | ranker.lt3.list_id | 103 | Using where |
| 1 | SIMPLE | vrl | ref | PRIMARY | PRIMARY | 8 | ranker.lt3.list_id,ranker.vrlih.ontology_id | 65 | Using where |
| 1 | SIMPLE | lt1 | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.lt3.list_id,const | 1 | Using where; Using index; Not exists |
| 1 | SIMPLE | lbs | eq_ref | PRIMARY,idx_list_burial_state,burial_score | PRIMARY | 4 | ranker.vrl.list_id | 1 | Using where |
| 1 | SIMPLE | lt2 | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.lt3.list_id,const | 1 | Using where; Using index |
+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+
A comparison of the rows column tells the difference and the query with JOINs is using Using temporary; Using filesort.
Of course when I run both the queries, the first one is done in 0.02 secs, the second one does not complete even after 1 min, so EXPLAIN explained these queries properly.
If I do not have the INNER JOIN on the list_tag table i.e. if I remove
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=246403) IS NOT NULL
from the first query and correspondingly:
INNER JOIN list_tag lt3 ON lt3.list_id = vrl.list_id AND lt3.tag_id = 246403
from the second query, then EXPLAIN returns the same number of rows for both queries and both these queries run equally fast.
Subqueries have ability to calculate aggregation functions on a fly.
E.g. Find minimal price of the book and get all books which are sold with this price.
1) Using Subqueries:
SELECT titles, price
FROM Books, Orders
WHERE price =
(SELECT MIN(price)
FROM Orders) AND (Books.ID=Orders.ID);
2) using JOINs
SELECT MIN(price)
FROM Orders;
-----------------
2.99
SELECT titles, price
FROM Books b
INNER JOIN Orders o
ON b.ID = o.ID
WHERE o.price = 2.99;
The difference is only seen when the second joining table has significantly more data than the primary table. I had an experience like below...
We had a users table of one hundred thousand entries and their membership data (friendship) about 3 hundred thousand entries. It was a join statement in order to take friends and their data, but with a great delay. But it was working fine where there was only a small amount of data in the membership table. Once we changed it to use a sub-query it worked fine.
But in the mean time the join queries are working with other tables that have fewer entries than the primary table.
So I think the join and sub query statements are working fine and it depends on the data and the situation.
These days, many dbs can optimize subqueries and joins. Thus, you just gotto examine your query using explain and see which one is faster. If there is not much difference in performance, I prefer to use subquery as they are simple and easier to understand.
I am not a relational database expert, so take this with a grain of salt.
The general idea about sub queries vs joins is the path the evaluation of the larger query takes.
In order to perform the larger query, every individual subquery has to be executed first, and then the resultset is stored as a temporary table that the larger query interacts with.
This temporary table is unindexed, so, any comparison requires scanning the whole resultset.
In contrast, when you use a join, all indexes are in use and so, comparison require traversing index trees (or hash tables), which is way less expensive in terms of speed.
Now, what I don't know if newer versions of the most popular relational engines execute the evaluation on reverse, and just load the necessary elements in the temporary table, as an optimization method.
I just thinking about the same problem, but I am using subquery in the FROM part.
I need connect and query from large tables, the "slave" table have 28 million record but the result is only 128 so small result big data! I am using MAX() function on it.
First I am using LEFT JOIN because I think that is the correct way, the mysql can optimalize etc.
Second time just for testing, I rewrite to sub-select against the JOIN.
LEFT JOIN runtime: 1.12s
SUB-SELECT runtime: 0.06s
18 times faster the subselect than the join! Just in the chokito adv. The subselect looks terrible but the result ...
It depends on several factors, including the specific query you're running, the amount of data in your database. Subquery runs the internal queries first and then from the result set again filter out the actual results. Whereas in join runs the and produces the result in one go.
The best strategy is that you should test both the join solution and the subquery solution to get the optimized solution.
If you want to speed up your query using join:
For "inner join/join",
Don't use where condition instead use it in "ON" condition.
Eg:
select id,name from table1 a
join table2 b on a.name=b.name
where id='123'
Try,
select id,name from table1 a
join table2 b on a.name=b.name and a.id='123'
For "Left/Right Join",
Don't use in "ON" condition, Because if you use left/right join it will get all rows for any one table.So, No use of using it in "On". So, Try to use "Where" condition

How can I speed up a count(*) which is already using indexes? (MyISAM)

I have a 3 large tables (10k, 10k, and 100M rows) and am trying to do a simple count on a join of them, where all the joined columns are indexed. Why does the COUNT(*) take so long, and how can I speed it up (without triggers and a running summary)?
mysql> describe SELECT COUNT(*) FROM `metaward_alias` INNER JOIN `metaward_achiever` ON (`metaward_alias`.`id` = `metaward_achiever`.`alias_id`) INNER JOIN `metaward_award` ON (`metaward_achiever`.`award_id` = `metaward_award`.`id`) WHERE `metaward_award`.`owner_id` = 8;
+----+-------------+-------------------+--------+-------------------------------------------------------+----------------------------+---------+---------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+--------+-------------------------------------------------------+----------------------------+---------+---------------------------------+------+-------------+
| 1 | SIMPLE | metaward_award | ref | PRIMARY,metaward_award_owner_id | metaward_award_owner_id | 4 | const | 1552 | |
| 1 | SIMPLE | metaward_achiever | ref | metaward_achiever_award_id,metaward_achiever_alias_id | metaward_achiever_award_id | 4 | paul.metaward_award.id | 2498 | |
| 1 | SIMPLE | metaward_alias | eq_ref | PRIMARY | PRIMARY | 4 | paul.metaward_achiever.alias_id | 1 | Using index |
+----+-------------+-------------------+--------+-------------------------------------------------------+----------------------------+---------+---------------------------------+------+-------------+
3 rows in set (0.00 sec)
But actually running the query takes about 10 minutes, and I'm on MyISAM so the tables are fully locked down for that duration
I guess the reason is that you do a huge join over three tables (without applying where clause first, the result would be 10k * 10k * 100M = 1016 rows). Try to reorder joins (for example start with metaward_award, then join only metaward_achiever see how long that takes, then try to plug metaward_alias, possibly using subquery to force your preferred evaluation order).
If that does not help you might have to denormalize your data, for example by storing number of aliases for particular metaward_achiever. Then you'd get rid of one join altogether. Maybe you can even cache the sums for metaward_award, depending on how and how often is your data updated.
Other thing that might help is getting all your database content into RAM :-)
Make sure you have indexes on:
metaward_alias id
metaward_achiever alias_id
metaward_achiever award_id
metaward_award id
metaward_award owner_id
I'm sure many people will also suggest to count on a specific column, but in MySql this doesn't make any difference for your query.
UPDATE:
You could also try to set the condition on the main table instead of one of the joined tables. That would give you the same result, but it could be faster (I don't know how clever MySql is):
SELECT COUNT(*) FROM `metaward_award`
INNER JOIN `metaward_achiever`
ON (`metaward_achiever`.`award_id` = `metaward_award`.`id`)
INNER JOIN `metaward_alias`
ON (`metaward_alias`.`id` = `metaward_achiever`.`alias_id`)
WHERE `metaward_award`.`owner_id` = 8
10 minutes is way too long for that query. I think you must have a really small key cache. You can get its size in bytes with:
SELECT ##key_buffer_size
First off, you should run ANALYZE TABLE or OPTIMIZE TABLE. They'll sort your index and can slightly improve the performance.
You should also see if you can use more compact types for your columns. For instance, if you're not going to have more than 16 millions owners or awards or aliases, you can change your INT columns into MEDIUMINT (UNSIGNED, of course). Perhaps even SMALLINT in some cases? That will reduce your index footprint and you'll fit more of it in the cache.