Consider the following query:
select FEE_NUMBER
from CARRIER_FEE CF
left outer join CONTYPE_FEE_LIST cfl on CF.CAR_FEE_ID=cfl.CAR_FEE_ID and cfl.CONT_TYPE_ID=3
where CF.SEQ_NO = (
select max(CF2.SEQ_NO) from CARRIER_FEE CF2
where CF2.FEE_NUMBER=CF.FEE_NUMBER
and CF2.COMPANY_ID=CF.COMPANY_ID
group by CF2.FEE_NUMBER)
group by CF.CAR_FEE_ID
On my laptop this returns no results. Using exactly the same (dumped) database on my servers it does return results.
If I run an EXPLAIN on my laptop I get this
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------------------------------------+-----------------------+---------+------------------------+------+----------------------------------------------+
| 1 | PRIMARY | CF | index | NULL | PRIMARY | 8 | NULL | 132 | Using where |
| 1 | PRIMARY | cfl | ref | FK_CONTYPE_FEE_LIST_1,FK_CONTYPE_FEE_LIST_2 | FK_CONTYPE_FEE_LIST_1 | 8 | odysseyB.CF.CAR_FEE_ID | 6 | |
| 2 | DEPENDENT SUBQUERY | CF2 | ref | FK_SURCHARGE_1 | FK_SURCHARGE_1 | 8 | func | 66 | Using where; Using temporary; Using filesort |
Whereas on all of my other servers it gives this (note the difference in the ref column)
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------------------------------------+-----------------------+---------+------------------------+------+----------------------------------------------+
| 1 | PRIMARY | CF | index | NULL | PRIMARY | 8 | NULL | 132 | Using where |
| 1 | PRIMARY | cfl | ref | FK_CONTYPE_FEE_LIST_1,FK_CONTYPE_FEE_LIST_2 | FK_CONTYPE_FEE_LIST_1 | 8 | odysseyB.CF.CAR_FEE_ID | 6 | |
| 2 | DEPENDENT SUBQUERY | CF2 | ref | FK_SURCHARGE_1 | FK_SURCHARGE_1 | 8 | odysseyB.CF.COMPANY_ID | 66 | Using where; Using temporary; Using filesort |
If I remove either the join, the subquery or the last group-by then I get the expected results.
I'm assuming that this is a configuration issue, however it's not one that I've seen before. Does anybody know what might cause this?
My laptop is running OSX 10.6 with MySQL 5.0.41. Another laptop running OSX 10.5.7 and MySQL 5.0.37 works fine, as do the Linux servers running MySQL 5.0.27.
Can anyone explain the difference between one explain plan using ref=func and the other using ref=odysseyB.CF.COMPANY_ID?
Thanks.
On both machines:
mysql> SHOW CREATE TABLE CARRIER_FEE CF;
Make sure that both table ENGINE types are the same.
Also, since you are using OS X 10.6 on the machine having the error? Perhaps the data types on that OS have different qualities than 10.5.x.
Seems like people are having compatibility problems with snow leopard. Try installing MySQL 5.4 on your 10.6 laptop.
http://forums.mysql.com/read.php?10,278942,278942#msg-278942
I don't know why it's giving different results. You don't have exactly the same data dump, since the row counts reported in your EXPLAIN reports are different. I'd recommend doing some simpler queries to test your assumptions.
Also double-check that you're really executing the exact same SQL query on both servers. For instance, if you inadvertently changed your left outer join to an inner join, that could make the whole query return no results.
BTW, tangential to your question but I solve these "greatest row per group" types of queries with an outer join:
select FEE_NUMBER
from CARRIER_FEE CF
left outer join CARRIER_FEE CF2
on CF.FEE_NUMBER = CF2.FEE_NUMBER and CF.COMPANY_ID = CF.COMPANY_ID
and CF.SEQ_NO < cf2.SEQ_NO
left outer join CONTYPE_FEE_LIST cfl
on CF.CAR_FEE_ID=cfl.CAR_FEE_ID and cfl.CONT_TYPE_ID=3
where CF2.SEQ_NO IS NULL
group by CF.CAR_FEE_ID;
This type of solution is often much faster than the correlated subquery solution you're currently using. I wouldn't think that could change the result of the query, I'm just offering it as an option.
Related
SELECT c.customers_lastname,
cg.customers_group_name,
dctc.coupons_id AS coupId,
dcto.coupons_id AS coupIdUsed,
dc.coupons_date_start AS coupStart,
count(DISTINCT o.orders_id) AS totalorders,
sum(op.products_quantity * op.final_price) AS ordersum
from
customers c LEFT JOIN customers_groups cg ON cg.customers_group_id = c.customers_group_id
LEFT JOIN (discount_coupons_to_customers dctc
LEFT JOIN discount_coupons dc ON dc.coupons_id = dctc.coupons_id
LEFT JOIN discount_coupons_to_orders dcto ON dcto.coupons_id = dctc.coupons_id
) ON c.customers_id = dctc.customers_id, orders_products op, orders o
WHERE c.customers_id = o.customers_id
AND c.customers_promotions = '0'
AND o.orders_id = op.orders_id
GROUP BY c.customers_id
ORDER BY ordersum DESC LIMIT 0, 10
The above query returns all customers that ever bought anything in our webshop (and some extra data), sorted by total order amount. It runs fine on localhost (a few seconds) but takes up to a minute on remote server. To make matters worse, the query can be modified via a form to include extra bits in the GROUP BY clause like:
HAVING (sum(op.products_quantity * op.final_price) >= 1000
AND/OR count(DISTINCT o.orders_id) > 2)
which doesn't exactly speed things up. there's about 5000 customers and 3000 orders at present. I added a time constraint WHERE order not older than one year but things didnt speed up after that.
i compared my local server and the online one.
localhost linux kernel is 3.2, online 2.6,
localhost php 5.4.4, online 5.3.26,
localhost mysql 5.5, online 5.1,
localhost php memory limit 128M, online 126M.
is there an obvious bottleneck? i sent an email to my webhost but didnt get a response. if I need to swap hosts I will but would like to know what to look out for. cheers,
Edit
using explain: (not sure how to format this, and no idea what it means) returns
`+----+-------------+-------+--------+-----------------------------+--------------+---------+---------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------+--------------+---------+---------------------+-------+----------------------------------------------+
| 1 | SIMPLE | c | ALL | PRIMARY | NULL | NULL | NULL | 5541 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | cp | eq_ref | PRIMARY | PRIMARY | 4 | rpc.c.customers_id | 1 | |
| 1 | SIMPLE | dctc | eq_ref | customers_id,customers_id_2 | customers_id | 4 | rpc.c.customers_id | 1 | |
| 1 | SIMPLE | dcto | ref | PRIMARY | PRIMARY | 34 | rpc.dctc.coupons_id | 0 | Using index |
| 1 | SIMPLE | dc | ALL | PRIMARY | NULL | NULL | NULL | 1 | |
| 1 | SIMPLE | cg | ALL | PRIMARY | NULL | NULL | NULL | 5 | Using where; Using join buffer |
| 1 | SIMPLE | o | ALL | PRIMARY | NULL | NULL | NULL | 5010 | Using where; Using join buffer |
| 1 | SIMPLE | op | ALL | NULL | NULL | NULL | NULL | 10675 | Using where; Using join buffer |
+----+-------------+-------+--------+-----------------------------+--------------+---------+---------------------+-------+----------------------------------------------+`
This question already has answers here:
Closed 12 years ago.
The community reviewed whether to reopen this question 2 months ago and left it closed:
Original close reason(s) were not resolved
Possible Duplicate:
Why is SELECT * considered harmful?
Probably a database nOOb question.
Our application has a table like the following
TABLE WF
Field | Type | Null | Key | Default | Extra |
+--------------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| children | text | YES | | NULL | |
| w_id | int(11) | YES | | NULL | |
| f_id | int(11) | YES | | NULL | |
| filterable | tinyint(1) | YES | | 1 | |
| created_at | datetime | YES | | NULL | |
| updated_at | datetime | YES | | NULL | |
| status | smallint(6) | YES | | 1 | |
| visible | tinyint(1) | YES | | 1 | |
| weight | int(11) | YES | | NULL | |
| root | tinyint(1) | YES | | 0 | |
| mfr | tinyint(1) | YES | | 0 | |
+--------------------+-------------+------+-----+---------+----------------+
This table is expected to be upwards of ten million records. The schema is not expected to change much. I need to retrieve the columns f_id, children, status, visible, weight, root, mfr.
Which approach is faster for data retrieval?
1) Select * from WF where w_id = 1 AND status = 1;
I will strip the unnecessary columns in the application layer.
2) Select children,f_id,status,visible,weight,root,mfr from WF where w_id = 1 AND status = 1;
There is no need to strip the unnecessary columns as its pre-selected in the query.
Does any one have a real life benchmark as to which is faster. I know some say Select * is evil, but will MySQL respond faster while trying to get the whole chunk as opposed to retrieving selective columns?
I am using MySQL version: 5.1.37-1ubuntu5 (Ubuntu) and the application is Rails3 app.
As an example of how a select statement that includes a subset of columns can be significantly faster, it can use a covering index on the table that includes just those columns, potentially resulting in much better query performance.
If you return fewer columns there is less data to go across the network and less data for the database to process and it will almost always return faster. Databases also tend to be slower using select * because the database then has to go figure out what the columns are and thus do more work than when you specify. Further select * will often return bad results if the structure changes significantly. It may end up showing the user fields you don;t want them to see or if someone is silly enough to rearrange the columns, then the application may actually appear to show things in the wrong order or if doing an insert from the data, put them in the wrong column. It is almost alawys a poor practice to use selct * in production code.
I'm encountering a strange behavior of MySQL.
Query execution (i.e. the usage of indexes as shown by explain [QUERY]) and time needed for execution are dependent on the elements of the where clause.
Here is a query where the problem occurs:
select distinct
e1.idx, el1.idx, r1.fk_cat, r2.fk_cat
from ent e1, ent_leng el1, rel_c r1, _tax_c t1, rel_c r2, _tax_c t2
where el1.fk_ent=e1.idx
and r1.fk_ent=e1.idx and ((r1.fk_cat=43) or (r1.fk_cat=t1.fk_cat1 and t1.fk_cat2=43))
and r2.fk_ent=e1.idx and ((r2.fk_cat=10) or (r2.fk_cat=t2.fk_cat1 and t2.fk_cat2=10))
The corresponding explain output is:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+-------------+-------+--------+-------------------------+---------+---------+---------------+-------+------------------------------------
| 1 | SIMPLE | el1 | index | fk_ent | fk_ent | 4 | NULL | 15002 | Using index; Using temporary
| 1 | SIMPLE | e1 | eq_ref | PRIMARY | PRIMARY | 4 | DB.el1.fk_ent | 1 | Using index
| 1 | SIMPLE | r1 | ref | fk_ent,fk_cat,fks | fks | 4 | DB.e1.idx | 1 | Using where; Using index
| 1 | SIMPLE | r2 | ref | fk_ent,fk_cat,fks | fks | 4 | DB.el1.fk_ent | 1 | Using index
| 1 | SIMPLE | t1 | index | fk_cat1,fk_cat2,fk_cats | fk_cats | 8 | NULL | 69 | Using where; Using index; Distinct;
| | | | | | | | | | Using join buffer
| 1 | SIMPLE | t2 | index | fk_cat1,fk_cat2,fk_cats | fk_cats | 8 | NULL | 69 | Using where; Using index; Distinct;
| Using join buffer
As you can see a one-column index has the same name as the column it belongs to. I also added some useless indexes along with the used ones, just to see if they change the execution (which they don't).
The execution takes ~4.5 seconds.
When I add the column entl1.name to the select part (nothing else changed), the index fk_ent in el1 cannot be used any more:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+-------------+-------+--------+-------------------------+---------+---------+---------------+-------+------------------------------------
| 1 | SIMPLE | el1 | ALL | fk_ent | NULL | NULL | NULL | 15002 | Using temporary
The execution now takes ~8.5 seconds.
I always thought that the select part of a query does not influence the usage of indexes by the engine and doesn't affect performance in such a way.
Leaving out the attribute isn't a solution, and there are even more attributes that i have to select.
Even worse, the query in the used form is even a bit more complex and that makes the performance issue a big problem.
So my questions are:
1) What is the reason for this strange behavior?
2) How can I solve the performance problem?
Thanks for your help!
Gred
It's the DISTINCT restriction. You can think of that as another WHERE restriction. When you change the select list, you are really changing the WHERE clause for the DISTINCT restriction, and now the optimizer decides that it has to do a table scan anyway, so it might as well not use your index.
EDIT:
Not sure if this helps, but if I am understanding your data correctly, I think you can get rid of the DISTINCT restriction like this:
select
e1.idx, el1.idx, r1.fk_cat, r2.fk_cat
from ent e1
Inner Join ent_leng el1 ON el1.fk_ent=e1.idx
Inner Join rel_c r1 ON r1.fk_ent=e1.idx
Inner Join rel_c r2 ON r2.fk_ent=e1.idx
where
((r1.fk_cat=43) or Exists(Select 1 From _tax_c t1 Where r1.fk_cat=t1.fk_cat1 and t1.fk_cat2=43))
and
((r2.fk_cat=10) or Exists(Select 1 From _tax_c t2 Where r2.fk_cat=t2.fk_cat1 and t2.fk_cat2=10))
MySQL will return data from an index if possible, saving the entire row from being loaded. In this way, the selected columns can influence the index selection.
With this in mind, it can much more efficient to add all required columns to an index, especially in the case of only selecting a small subset of columns.
Description
According to the explain command, there is a range that is causing a query to perform a full table scan (160k rows). How do I keep the range condition and reduce the scanning? I expect the culprit to be:
Y.YEAR BETWEEN 1900 AND 2009 AND
Code
Here is the code that has the range condition (the STATION_DISTRICT is likely superfluous).
SELECT
COUNT(1) as MEASUREMENTS,
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y FORCE INDEX(YEAR_IDX),
MONTH_REF M,
DAILY D
WHERE
-- For a specific city ...
--
C.ID = 10663 AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= 50 AND
-- Get the station district identification for the matching station.
--
S.STATION_DISTRICT_ID = SD.ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = '003' AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
Update
The SQL is performing a full table scan, which results in MySQL performing a "copy to tmp table", as shown here:
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
| 1 | SIMPLE | C | const | PRIMARY | PRIMARY | 4 | const | 1 | |
| 1 | SIMPLE | Y | range | YEAR_IDX | YEAR_IDX | 4 | NULL | 160422 | Using where |
| 1 | SIMPLE | SD | eq_ref | PRIMARY | PRIMARY | 4 | climate.Y.STATION_DISTRICT_ID | 1 | Using index |
| 1 | SIMPLE | S | eq_ref | PRIMARY | PRIMARY | 4 | climate.SD.ID | 1 | Using where |
| 1 | SIMPLE | M | ref | PRIMARY,YEAR_REF_IDX,CATEGORY_IDX | YEAR_REF_IDX | 8 | climate.Y.ID | 54 | Using where |
| 1 | SIMPLE | D | ref | INDEX | INDEX | 8 | climate.M.ID | 11 | Using where |
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
Answer
After using the STRAIGHT_JOIN:
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
| 1 | SIMPLE | C | const | PRIMARY | PRIMARY | 4 | const | 1 | Using temporary; Using filesort |
| 1 | SIMPLE | S | ALL | PRIMARY | NULL | NULL | NULL | 7795 | Using where |
| 1 | SIMPLE | SD | eq_ref | PRIMARY | PRIMARY | 4 | climate.S.STATION_DISTRICT_ID | 1 | Using index |
| 1 | SIMPLE | Y | ref | PRIMARY,STAT_YEAR_IDX | STAT_YEAR_IDX | 4 | climate.S.STATION_DISTRICT_ID | 1650 | Using where |
| 1 | SIMPLE | M | ref | PRIMARY,YEAR_REF_IDX,CATEGORY_IDX | YEAR_REF_IDX | 8 | climate.Y.ID | 54 | Using where |
| 1 | SIMPLE | D | ref | INDEX | INDEX | 8 | climate.M.ID | 11 | Using where |
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
Related
http://dev.mysql.com/doc/refman/5.0/en/how-to-avoid-table-scan.html
http://dev.mysql.com/doc/refman/5.0/en/where-optimizations.html
Optimize SQL that uses between clause
Thank you!
ONE Request... It looks like you KNOW your data. Add the keyword "STRAIGHT_JOIN" and see the results...
SELECT STRAIGHT_JOIN ... the rest of your query...
Straight-join tells MySql to DO IT AS I HAVE LISTED. So, your CITY table is the first in the FROM list, thus indicating you expect that to be your primary... Additionally, your WHERE clause of the CITY is the immediate filter. With that being said, it will probably fly through the rest of the query...
Hope it helps... Its worked for me with gov't data of millions of records queried and joined to 10+ lookup tables where mySql was trying to think for me.
in order to do efficient between queries you are going to want a b tree index on your YEAR column. for example:
CREATE INDEX id_index USING BTREE ON YEAR_REF (YEAR);
BTREE indexes allow for efficient range queries, if this is in fact the root problem then having an index like this should get rid of the full table scan and have it only scan the part of the table that is in the range. read more about btrees on wikipedia
However, as with any optimisation advice, you should measure to make sure that you don't do more harm than good.
Can you change from searching within a radius to search in a bounding box?
You know the city so you can calculate a bounding box in your application.
Perhaps this
S.LATITUDE_DECIMAL >= latitude_lower and
S.LATITUDE_DECIMAL <= latitude_upper and
S.LONGITUDE_DECIMAL >= longitude_lower and
S.LONGITUDE_DECIMAL <= longitude_upper
could be a little faster?
I have a table files with files and a table reades with read accesses to these files. In the table reades there is a column file_id where refers to the respective column in files.
Now I would like to list all files which have not been accessed and tried this:
SELECT * FROM files WHERE file_id NOT IN (SELECT file_id FROM reades)
This is terribly slow. The reason is that mySQL thinks that the subquery is dependent on the query:
+----+--------------------+--------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+--------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | PRIMARY | files | ALL | NULL | NULL | NULL | NULL | 1053 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | reades | ALL | NULL | NULL | NULL | NULL | 3242 | 100.00 | Using where |
+----+--------------------+--------+------+---------------+------+---------+------+------+----------+-------------+
But why? The subquery is completely independent and more or less just meant to return a list of ids.
(To be precise: Each file_id can appear multiple times in reades, of course, as there can be arbitrarily many read operations for each file.)
Try replacing the subquery with a join:
SELECT *
FROM files f
LEFT OUTER JOIN reades r on r.file_id = f.file_id
WHERE r.file_id IS NULL
Here's a link to an article about this problem. The writer of that article wrote a stored procedure to force MySQL to evaluate subqueries as independant. I doubt that's necessary in this case though.
i've seen this before. it's a bug in mysql. try this:
SELECT * FROM files WHERE file_id NOT IN (SELECT * FROM (SELECT file_id FROM reades))
there bug report is here: http://bugs.mysql.com/bug.php?id=25926
Try:
SELECT * FROM files WHERE file_id NOT IN (SELECT reades.file_id FROM reades)
That is: if it's coming up as dependent, perhaps that's because of ambiguity in what file_id refers to, so let's try fully qualifying it.
If that doesn't work, just do:
SELECT files.*
FROM files
LEFT JOIN reades
USING (file_id)
WHERE reades.file_id IS NULL
Does MySQL support EXISTS in the same way that MSSQL would?
If so, you could rewrite the query as
SELECT * FROM files as f WHERE file_id NOT EXISTS (SELECT 1 FROM reades r WHERE r.file_id = f.file_id)
Using IN is horribly inefficient as it runs that subquery for each row in the parent query.
Looking at this page I found two possible solutions which both work. Just for completeness I add one of those, similar to the answers with JOINs shown above, but it is fast even without using foreign keys:
SELECT * FROM files AS f
INNER JOIN (SELECT DISTINCT file_id FROM reades) AS r
ON f.file_id = r.file_id
This solves the problem, but still this does not answer my question :)
EDIT: If I interpret the EXPLAIN output correctly, this is fast, because the interpreter generates a temporary index:
+----+-------------+------------+--------+---------------+---------+---------+-----------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+---------+---------+-----------+------+--------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 843 | |
| 1 | PRIMARY | f | eq_ref | PRIMARY | PRIMARY | 4 | r.file_id | 1 | |
| 2 | DERIVED | reades | range | NULL | file_id | 5 | NULL | 811 | Using index for group-by |
+----+-------------+------------+--------+---------------+---------+---------+-----------+------+--------------------------+
IN-subqueries are in MySQL 5.5 and earlier converted to EXIST subqueries. The given query will be converted to the following query:
SELECT * FROM files
WHERE NOT EXISTS (SELECT 1 FROM reades WHERE reades.filed_id = files.file_id)
As you see, the subquery is actually dependent.
MySQL 5.6 may choose to materialize the subquery. That is, first, run the inner query and store the result in a temporary table (removing duplicates). Then, it can use a join-like operation between the outer table (i.e., files) and the temporary table to find the rows with no match. This way of executing the query will probably be more optimal if reades.file_id is not indexed.
However, if reades.file_id is indexed, the traditional IN-to-EXISTS execution strategy is actually pretty efficient. In that case, I would not expect any significant performance improvement from converting the query into a join as suggested in other answers. MySQL 5.6 optimizer makes a cost-based choice between materialization and IN-to-EXISTS execution.