MySQL join with sort on datetime in joined table - sql

I have 2 large mysql tables: Articles and ArticleTopics. I want to query the DB and retrieve the last 30 articles published for a given topicID. My current query is rather slow. Any ideas on how to improve it?
More details:
The tables:
Articles (~1 million rows)
+-----------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+----------------+
| articleId | int(11) | NO | PRI | NULL | auto_increment |
| title | varchar(255) | NO | | NULL | |
| content | longtext | NO | | NULL | |
| pubDate | datetime | NO | MUL | NULL | |
+-----------+--------------+------+-----+---------+----------------+
ArticleTopics (~10 million rows)
+-----------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+-------+
| articleId | int(11) | NO | MUL | NULL | |
| topicId | int(11) | NO | MUL | NULL | |
+-----------+--------------+------+-----+---------+-------+
And my query:
SELECT a.articleId, a.pubDate
FROM Articles a, ArticleTopics t
WHERE t.articleId=a.articleId AND t.topicId=3364
ORDER BY a.pubDate DESC LIMIT 30;
And the EXPLAIN of the query:
+----+-------------+-------+--------+-------------------------------------+-------------------+---------+-------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------------------+-------------------+---------+-------------------+------+----------------------------------------------+
| 1 | SIMPLE | t | ref | articleId,topicId,topicId_articleId | topicId_articleId | 4 | const | 4281 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | a | eq_ref | PRIMARY,articleId_pubDate | PRIMARY | 4 | t.articleId | 1 | |
+----+-------------+-------+--------+-------------------------------------+-------------------+---------+-------------------+------+----------------------------------------------+
The slowness, I believe, is coming from the ORDER BY a.pubDate DESC. I can greatly improve performance by faking it a bit by instead doing an ORDER BY t.articleId DESC and having an index in ArticleTopics on both articleId & topicId, since in general, the articleIds are in the same order as pubDates. They are not always, however, so it's not ideal. I'd like to be able to sort it on the pubDate.
Update: Added EXPLAIN.

You can rewrite the query in various ways to see if it speeds things up:
SELECT a.articleId, a.pubDate
FROM Articles a
WHERE a.articleId in (
select articleId
from ArticleTopics
where topicId = 3364
)
ORDER BY a.pubDate DESC LIMIT 30;
Or:
SELECT a.articleId, a.pubDate
FROM Articles a
INNER JOIN ArticleTopics t ON t.articleId = a.articleId
WHERE t.topicId = 3364
ORDER BY a.pubDate DESC LIMIT 30;
The important index for both queries is on Articles, and contains articleId as first field.
If article is a large table, with say the entire PDF in binary, you can create an index that fully covers the query. Full coverage means all selected fields are part of the index. For this query, a fully covering index would be (articleId, pubDate).

At this point, do you have an index on topicId? If so, does the index contain only the topicId field?
And maybe you can post the output of the EXPLAIN query.

Related

Response Slow when Order BY is added to my SQL Query

I have the following job_requests table schema as shown here
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| available_to. | integer[] | NO | | | |
| available_type | varchar(255) | NO | | NULL | |
| start_at | varchar(255) | NO | | NULL | |
+-------------+--------------+------+-----+---------+----------------+
I have the following query to return a list of records and order them by the type_of_pool value
WITH matching_jobs AS (
SELECT
job_requests_with_distance.*,
CASE WHEN (users.id = ANY (available_to) AND available_type = 0) THEN 'favourite'
ELSE 'normal'
END AS type_of_pool
FROM (
SELECT
job_requests.*,
users.id AS user_id,
FROM
job_requests,
users) AS job_requests_with_distance
LEFT JOIN users ON users.id = user_id
WHERE start_at > NOW() at time zone 'Asia/Kuala_Lumpur'
AND user_id = 491
AND (user_id != ALL(coalesce(unavailable_to, array[]::int[])))
)
SELECT
*
FROM
matching_jobs
WHERE (type_of_pool != 'normal')::BOOLEAN
ORDER BY
array_position (ARRAY['favourite','exclusive','normal']::text[], type_of_pool),
LIMIT 30
If i remove the ORDER BY function, it takes about 3ms but when I add the ORDER BY function, it takes about 1.3seconds to run.
Not sure how do i optimize this query to make it faster? I have read using Indexes and all but not sure how an index will help in this scenario.
Any help is appreciated.

Why query is still so fast when I operate a non-indexing column?

I am learning indexing of database.
here are indexings of a table. And this table has 330k records.
mysql> show index from employee;
+----------+------------+-------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+----------+------------+-------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| employee | 0 | PRIMARY | 1 | id | A | 297383 | NULL | NULL | | BTREE | | | YES | NULL |
| employee | 0 | ak_employee | 1 | personal_code | A | 297383 | NULL | NULL | | BTREE | | | YES | NULL |
| employee | 1 | idx_email | 1 | email | A | 297383 | NULL | NULL | | BTREE | | | YES | NULL |
+----------+------------+-------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
as you can see, there are only three indexing on this table.
Now I want to query with where on birth_date column, I think it will be very slow because there is no indexing on birth-date column, I when I try query, I found it is very fast.
mysql> select sql_no_cache *
-> from employee
-> where birth_date > '1955-11-11'
-> limit 100
-> ;
100 rows in set, 1 warning (0.04 sec)
So I am confused:
why it is still so fast without indexing?
due to its still fast, why do we still need indexing?
This is your query:
select sql_no_cache *
from employee
where birth_date > '1955-11-11'
limit 100
There are no indexes so the query starts reading the data from the data pages. On each record, it compares the birthdate and returns the row. When it finds 100 (due to the limit) it stops.
Presumably, it finds 100 rows quite quickly. After all, the median age of the United States is about 38 -- which is (as I write this) a birth year of 1981. By far, most people were born after 1955.
The query would be much slower if you had an order by or group by. That would require reading all the data before returning anything.

SQL select query too slow on webhost, fine on localhost

SELECT c.customers_lastname,
cg.customers_group_name,
dctc.coupons_id AS coupId,
dcto.coupons_id AS coupIdUsed,
dc.coupons_date_start AS coupStart,
count(DISTINCT o.orders_id) AS totalorders,
sum(op.products_quantity * op.final_price) AS ordersum
from
customers c LEFT JOIN customers_groups cg ON cg.customers_group_id = c.customers_group_id
LEFT JOIN (discount_coupons_to_customers dctc
LEFT JOIN discount_coupons dc ON dc.coupons_id = dctc.coupons_id
LEFT JOIN discount_coupons_to_orders dcto ON dcto.coupons_id = dctc.coupons_id
) ON c.customers_id = dctc.customers_id, orders_products op, orders o
WHERE c.customers_id = o.customers_id
AND c.customers_promotions = '0'
AND o.orders_id = op.orders_id
GROUP BY c.customers_id
ORDER BY ordersum DESC LIMIT 0, 10
The above query returns all customers that ever bought anything in our webshop (and some extra data), sorted by total order amount. It runs fine on localhost (a few seconds) but takes up to a minute on remote server. To make matters worse, the query can be modified via a form to include extra bits in the GROUP BY clause like:
HAVING (sum(op.products_quantity * op.final_price) >= 1000
AND/OR count(DISTINCT o.orders_id) > 2)
which doesn't exactly speed things up. there's about 5000 customers and 3000 orders at present. I added a time constraint WHERE order not older than one year but things didnt speed up after that.
i compared my local server and the online one.
localhost linux kernel is 3.2, online 2.6,
localhost php 5.4.4, online 5.3.26,
localhost mysql 5.5, online 5.1,
localhost php memory limit 128M, online 126M.
is there an obvious bottleneck? i sent an email to my webhost but didnt get a response. if I need to swap hosts I will but would like to know what to look out for. cheers,
Edit
using explain: (not sure how to format this, and no idea what it means) returns
`+----+-------------+-------+--------+-----------------------------+--------------+---------+---------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------+--------------+---------+---------------------+-------+----------------------------------------------+
| 1 | SIMPLE | c | ALL | PRIMARY | NULL | NULL | NULL | 5541 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | cp | eq_ref | PRIMARY | PRIMARY | 4 | rpc.c.customers_id | 1 | |
| 1 | SIMPLE | dctc | eq_ref | customers_id,customers_id_2 | customers_id | 4 | rpc.c.customers_id | 1 | |
| 1 | SIMPLE | dcto | ref | PRIMARY | PRIMARY | 34 | rpc.dctc.coupons_id | 0 | Using index |
| 1 | SIMPLE | dc | ALL | PRIMARY | NULL | NULL | NULL | 1 | |
| 1 | SIMPLE | cg | ALL | PRIMARY | NULL | NULL | NULL | 5 | Using where; Using join buffer |
| 1 | SIMPLE | o | ALL | PRIMARY | NULL | NULL | NULL | 5010 | Using where; Using join buffer |
| 1 | SIMPLE | op | ALL | NULL | NULL | NULL | NULL | 10675 | Using where; Using join buffer |
+----+-------------+-------+--------+-----------------------------+--------------+---------+---------------------+-------+----------------------------------------------+`

Rewriting this subquery?

I am trying to build a new table such that the values in the existing table are NOT contained (but obviously the following checks for contained) in another table. Following is my table structure:
mysql> explain t1;
+-----------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+-------+
| id | int(11) | YES | | NULL | |
| point | bigint(20) unsigned | NO | MUL | 0 | |
+-----------+---------------------+------+-----+---------+-------+
mysql> explain whitelist;
+-------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| x | bigint(20) unsigned | YES | | NULL | |
| y | bigint(20) unsigned | YES | | NULL | |
| geonetwork | linestring | NO | MUL | NULL | |
+-------------+---------------------+------+-----+---------+----------------+
My query looks like this:
SELECT point
FROM t1
WHERE EXISTS(SELECT source
FROM whitelist
WHERE MBRContains(geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))));
Explain:
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
| 1 | PRIMARY | t1 | index | NULL | point | 8 | NULL | 1001 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | whitelist | ALL | _geonetwork | NULL | NULL | NULL | 3257 | Using where |
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
The query is taking 6 seconds to execute for 1000 records in t1 which is unacceptable for me. How can I rewrite this query using Joins (or perhaps a faster way if that exists) if I don't have a column to join on? Even a stored procedure is acceptable I guess in the worst case. My goal is to finally create a new table containing entries from t1. Any suggestions?
Unless the query optimizer is failing, a WHERE EXISTS construct should result in the same plan as a join with a GROUP clause. Look at optimizing MBRContains(geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)')))), that's probably where your query is spending all its time. I don't have a suggestion for that, but here's your query written with a JOIN:
Select t1.point
from t1
join whitelist on MBRContains(whitelist.geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))))
group by t1.point
;
or to get the points in t1 not in whitelist:
Select t1.point
from t1
left join whitelist on MBRContains(whitelist.geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))))
where whitelist.id is null
;
This seems like a case where de-nomalizing t1 might be beneficial. Adding a GeomFrmTxt column with a value of GeomFromText(CONCAT('POINT(', t1.point, ' 0)')) could speed up the query you already have.

Eliminate full table scan due to BETWEEN (and GROUP BY)

Description
According to the explain command, there is a range that is causing a query to perform a full table scan (160k rows). How do I keep the range condition and reduce the scanning? I expect the culprit to be:
Y.YEAR BETWEEN 1900 AND 2009 AND
Code
Here is the code that has the range condition (the STATION_DISTRICT is likely superfluous).
SELECT
COUNT(1) as MEASUREMENTS,
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y FORCE INDEX(YEAR_IDX),
MONTH_REF M,
DAILY D
WHERE
-- For a specific city ...
--
C.ID = 10663 AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= 50 AND
-- Get the station district identification for the matching station.
--
S.STATION_DISTRICT_ID = SD.ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = '003' AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
Update
The SQL is performing a full table scan, which results in MySQL performing a "copy to tmp table", as shown here:
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
| 1 | SIMPLE | C | const | PRIMARY | PRIMARY | 4 | const | 1 | |
| 1 | SIMPLE | Y | range | YEAR_IDX | YEAR_IDX | 4 | NULL | 160422 | Using where |
| 1 | SIMPLE | SD | eq_ref | PRIMARY | PRIMARY | 4 | climate.Y.STATION_DISTRICT_ID | 1 | Using index |
| 1 | SIMPLE | S | eq_ref | PRIMARY | PRIMARY | 4 | climate.SD.ID | 1 | Using where |
| 1 | SIMPLE | M | ref | PRIMARY,YEAR_REF_IDX,CATEGORY_IDX | YEAR_REF_IDX | 8 | climate.Y.ID | 54 | Using where |
| 1 | SIMPLE | D | ref | INDEX | INDEX | 8 | climate.M.ID | 11 | Using where |
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
Answer
After using the STRAIGHT_JOIN:
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
| 1 | SIMPLE | C | const | PRIMARY | PRIMARY | 4 | const | 1 | Using temporary; Using filesort |
| 1 | SIMPLE | S | ALL | PRIMARY | NULL | NULL | NULL | 7795 | Using where |
| 1 | SIMPLE | SD | eq_ref | PRIMARY | PRIMARY | 4 | climate.S.STATION_DISTRICT_ID | 1 | Using index |
| 1 | SIMPLE | Y | ref | PRIMARY,STAT_YEAR_IDX | STAT_YEAR_IDX | 4 | climate.S.STATION_DISTRICT_ID | 1650 | Using where |
| 1 | SIMPLE | M | ref | PRIMARY,YEAR_REF_IDX,CATEGORY_IDX | YEAR_REF_IDX | 8 | climate.Y.ID | 54 | Using where |
| 1 | SIMPLE | D | ref | INDEX | INDEX | 8 | climate.M.ID | 11 | Using where |
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
Related
http://dev.mysql.com/doc/refman/5.0/en/how-to-avoid-table-scan.html
http://dev.mysql.com/doc/refman/5.0/en/where-optimizations.html
Optimize SQL that uses between clause
Thank you!
ONE Request... It looks like you KNOW your data. Add the keyword "STRAIGHT_JOIN" and see the results...
SELECT STRAIGHT_JOIN ... the rest of your query...
Straight-join tells MySql to DO IT AS I HAVE LISTED. So, your CITY table is the first in the FROM list, thus indicating you expect that to be your primary... Additionally, your WHERE clause of the CITY is the immediate filter. With that being said, it will probably fly through the rest of the query...
Hope it helps... Its worked for me with gov't data of millions of records queried and joined to 10+ lookup tables where mySql was trying to think for me.
in order to do efficient between queries you are going to want a b tree index on your YEAR column. for example:
CREATE INDEX id_index USING BTREE ON YEAR_REF (YEAR);
BTREE indexes allow for efficient range queries, if this is in fact the root problem then having an index like this should get rid of the full table scan and have it only scan the part of the table that is in the range. read more about btrees on wikipedia
However, as with any optimisation advice, you should measure to make sure that you don't do more harm than good.
Can you change from searching within a radius to search in a bounding box?
You know the city so you can calculate a bounding box in your application.
Perhaps this
S.LATITUDE_DECIMAL >= latitude_lower and
S.LATITUDE_DECIMAL <= latitude_upper and
S.LONGITUDE_DECIMAL >= longitude_lower and
S.LONGITUDE_DECIMAL <= longitude_upper
could be a little faster?