I need to compare integers in a mysql table. Pretty simple, but this table is fairly large... so queries take a long time. No problem, I can use an index. According to MySQL documentation, I should be able to use an index for comparison operators:
"A B-tree index can be used for column comparisons in expressions that use the =, >, >=, <, <=, or BETWEEN"
However, when I try this it has no effect on performance and the index is not used according to explain :(
SELECT * FROM Node n WHERE n.X < 800000
That results in extremely poor performance and calling explain shows our "Rectangle_Index" as being of the possible_keys but NULL key was actually used... Here's are create table statement:
CREATE TABLE `Visual_Node` (
`Id` bigint(20) NOT NULL AUTO_INCREMENT,
`X` bigint(20) NOT NULL,
`Y` bigint(20) NOT NULL,
`X_plus_Width` bigint(20) DEFAULT NULL,
`Y_plus_Height` bigint(20) DEFAULT NULL,
PRIMARY KEY (`Id`),
KEY `Rectangle_Index` (`X`,`X_plus_Width`,`Y`,`Y_plus_Height`)
) ENGINE=InnoDB AUTO_INCREMENT=4340743 DEFAULT CHARSET=latin1
Can anyone help this query? The actual query I want to run is the following:
SELECT * FROM Node n WHERE 800000 BETWEEN n.X and n.X_plus_Width AND 1234567 BETWEEN n.Y and n.Y_plus_Height
Update (asked in one of the answers below)
Heres the output of the explain for the basic query:
altering the table structure is very difficult for me. Here's the output of my explain:
mysql> explain select * from Node n where n.X < 800000;
+----+-------------+-------+------+-----------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | n | ALL | Rectangle_Index | NULL | NULL | NULL | 173952 | Using where |
+----+-------------+-------+------+-----------------+------+---------+------+--------+-------------+
1 row in set (0.02 sec)
If you rewrite your query as
SELECT *
FROM Node n
WHERE
n.X <= 800000 AND
n.X_plus_Width >= 800000 AND
n.Y <= 1234567 AND
n.Y_plus_Height >= 1234567
Mysql could use index for one column (it can't use index for more than 1 range condition, and you have 4 of them.
I suggest you to take a look at Spatial extensions
Have you checked the details of multiple-column indexes - specifically, the part about how the optimizer is (or is not) able to use them. Here's a quote from this page:
If the table has a multiple-column
index, any leftmost prefix of the
index can be used by the optimizer to
find rows. For example, if you have a
three-column index on (col1, col2,
col3), you have indexed search
capabilities on (col1), (col1, col2),
and (col1, col2, col3).
Perhaps you could try creating multiple single-column indexes, rather than one multiple-column index?
EDIT 1:
I put together a simple test on my copy of MySQL (version 5.0.51a-24+lenny3). It shows that when using both your proper query, and your test query, your Rectangle_Index is being used. However, when using the proper query, the key_len is 8, suggesting that not all the parts of the multi-column index are being used. Perhaps the output from your version of MySQL differs in this respect.
As you'll see from the output below, even when additional indexes are added, the Rectangle_Index index is still chosen in all cases, except only the Y column is referenced in the query:
CREATE TABLE `Visual_Node` (
`Id` bigint(20) NOT NULL AUTO_INCREMENT,
`X` bigint(20) NOT NULL,
`Y` bigint(20) NOT NULL,
`X_plus_Width` bigint(20) DEFAULT NULL,
`Y_plus_Height` bigint(20) DEFAULT NULL,
PRIMARY KEY (`Id`),
KEY `Rectangle_Index` (`X`,`X_plus_Width`,`Y`,`Y_plus_Height`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `Visual_Node` VALUES
(1, 100000, 1000000, 3000000, 3000000),
(2, 200000, 2000000, 4000000, 4000000),
(3, 300000, 3000000, 5000000, 5000000),
(4, 400000, 4000000, 6000000, 6000000),
(5, 500000, 5000000, 7000000, 7000000),
(6, 600000, 6000000, 8000000, 8000000),
(7, 700000, 7000000, 9000000, 9000000),
(8, 800000, 8000000, 10000000, 10000000),
(9, 900000, 9000000, 11000000, 11000000),
(10, 1000000, 10000000, 12000000, 12000000);
EXPLAIN SELECT * FROM Visual_Node n WHERE n.X < 800000;
+----+-------------+-------+-------+-----------------+-----------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-----------------+-----------------+---------+------+------+--------------------------+
| 1 | SIMPLE | n | range | Rectangle_Index | Rectangle_Index | 8 | NULL | 5 | Using where; Using index |
+----+-------------+-------+-------+-----------------+-----------------+---------+------+------+--------------------------+
EXPLAIN SELECT * FROM Visual_Node n WHERE n.Y < 800000;
+----+-------------+-------+-------+---------------+-----------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-----------------+---------+------+------+--------------------------+
| 1 | SIMPLE | n | index | NULL | Rectangle_Index | 34 | NULL | 10 | Using where; Using index |
+----+-------------+-------+-------+---------------+-----------------+---------+------+------+--------------------------+
EXPLAIN SELECT * FROM Visual_Node n
WHERE 800000 BETWEEN n.X and n.X_plus_Width
AND 1234567 BETWEEN n.Y and n.Y_plus_Height;
+----+-------------+-------+-------+-----------------+-----------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-----------------+-----------------+---------+------+------+--------------------------+
| 1 | SIMPLE | n | range | Rectangle_Index | Rectangle_Index | 8 | NULL | 5 | Using where; Using index |
+----+-------------+-------+-------+-----------------+-----------------+---------+------+------+--------------------------+
ALTER TABLE `Visual_Node` ADD INDEX `X_Index` (`X`,`X_plus_Width`);
ALTER TABLE `Visual_Node` ADD INDEX `Y_Index` (`Y`,`Y_plus_Height`);
EXPLAIN SELECT * FROM Visual_Node n WHERE n.X < 800000;
+----+-------------+-------+-------+-------------------------+-----------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-------------------------+-----------------+---------+------+------+--------------------------+
| 1 | SIMPLE | n | range | Rectangle_Index,X_Index | Rectangle_Index | 8 | NULL | 5 | Using where; Using index |
+----+-------------+-------+-------+-------------------------+-----------------+---------+------+------+--------------------------+
EXPLAIN SELECT * FROM Visual_Node n WHERE n.Y < 800000;
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | n | range | Y_Index | Y_Index | 8 | NULL | 1 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
EXPLAIN SELECT * FROM Visual_Node n
WHERE 800000 BETWEEN n.X and n.X_plus_Width
AND 1234567 BETWEEN n.Y and n.Y_plus_Height;
+----+-------------+-------+-------+---------------------------------+-----------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------------------------+-----------------+---------+------+------+--------------------------+
| 1 | SIMPLE | n | range | Rectangle_Index,X_Index,Y_Index | Rectangle_Index | 8 | NULL | 5 | Using where; Using index |
+----+-------------+-------+-------+---------------------------------+-----------------+---------+------+------+--------------------------+
ALTER TABLE `Visual_Node` ADD INDEX `X` (`X`,`X_plus_Width`);
ALTER TABLE `Visual_Node` ADD INDEX `X_plus_Width` (`X_plus_Width`);
ALTER TABLE `Visual_Node` ADD INDEX `Y` (`Y`);
ALTER TABLE `Visual_Node` ADD INDEX `Y_plus_Height` (`Y_plus_Height`);
EXPLAIN SELECT * FROM Visual_Node n WHERE n.X < 800000;
+----+-------------+-------+-------+---------------------------+-----------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------------------+-----------------+---------+------+------+--------------------------+
| 1 | SIMPLE | n | range | Rectangle_Index,X_Index,X | Rectangle_Index | 8 | NULL | 5 | Using where; Using index |
+----+-------------+-------+-------+---------------------------+-----------------+---------+------+------+--------------------------+
EXPLAIN SELECT * FROM Visual_Node n WHERE n.Y < 800000;
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | n | range | Y_Index,Y | Y_Index | 8 | NULL | 1 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
EXPLAIN SELECT * FROM Visual_Node n
WHERE 800000 BETWEEN n.X and n.X_plus_Width
AND 1234567 BETWEEN n.Y and n.Y_plus_Height;
+----+-------------+-------+-------+----------------------------------------------------------------+-----------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+----------------------------------------------------------------+-----------------+---------+------+------+--------------------------+
| 1 | SIMPLE | n | range | Rectangle_Index,X_Index,Y_Index,X,X_plus_Width,Y,Y_plus_Height | Rectangle_Index | 8 | NULL | 5 | Using where; Using index |
+----+-------------+-------+-------+----------------------------------------------------------------+-----------------+---------+------+------+--------------------------+
Can you post the output from your EXPLAIN query?
What version of MySQL are you using?
EDIT 2:
The Spatial Extensions, as suggested by Naktibalda, are really cool. I'd not used these before, but if you are able to alter your table structure to use them, they may solve your problem.
Curious, I did a little research, and here's the result of my test scripts:
CREATE TABLE `Spatial_Node` (
`Id` bigint(20) NOT NULL AUTO_INCREMENT,
`Rectangle` POLYGON NOT NULL,
PRIMARY KEY (`Id`),
SPATIAL KEY `Rectangle` (`Rectangle`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO `Spatial_Node` (`Rectangle`)
SELECT Polygon(LineString(
Point(X, Y),
Point(X_plus_Width, Y),
Point(X_plus_Width, Y_plus_Height),
Point(X, Y_plus_Height),
Point(X, Y)
))
FROM Visual_Node;
SELECT AsText(`Rectangle`) FROM Spatial_Node
WHERE MBRContains(Rectangle, Point(100001, 1000001));
+-----------------------------------------------------------------------------------------+
| AsText(`Rectangle`) |
+-----------------------------------------------------------------------------------------+
| POLYGON((100000 1000000,3000000 1000000,3000000 3000000,100000 3000000,100000 1000000)) |
+-----------------------------------------------------------------------------------------+
EXPLAIN SELECT AsText(`Rectangle`) FROM Spatial_Node
WHERE MBRContains(Rectangle, Point(100001, 1000001));
+----+-------------+--------------+-------+---------------+-----------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+---------------+-----------+---------+------+------+-------------+
| 1 | SIMPLE | Spatial_Node | range | Rectangle | Rectangle | 32 | NULL | 1 | Using where |
+----+-------------+--------------+-------+---------------+-----------+---------+------+------+-------------+
I have no idea how the speed will compare, but I've definitely learned something new and exciting today. Thanks Naktibalda :-)
Have you tried changing the index to:
CREATE TABLE `Visual_Node` (
`Id` bigint(20) NOT NULL AUTO_INCREMENT,
`X` bigint(20) NOT NULL,
`Y` bigint(20) NOT NULL,
`X_plus_Width` bigint(20) DEFAULT NULL,
`Y_plus_Height` bigint(20) DEFAULT NULL,
PRIMARY KEY (`Id`),
KEY `X_Index` (`X`),
KEY `Y_Index` (`Y`),
KEY `X_Width_Index` (`X_plus_Width`),
KEY `Y_Height_Index` (`Y_plus_Height`)
) ENGINE=InnoDB AUTO_INCREMENT=4340743 DEFAULT CHARSET=latin1
Judging by your AI value, you'll probably want to test this with a smaller set of data.
Related
I'm trying to run an update query on the column answer_date of a table P. I want to fill each row of answer_date of P with the unique date from create_date column of H where P.ID1 matches with H.ID1 and where P.acceptance_date is not empty.
The query takes a long while to run, so I check the interim changes in answer_date but the entire column is empty like it was created.
Btree indices exists on all the mentioned columns.
Is there something wrong with the query?
UPDATE P
SET answer_date = subquery.date
FROM (SELECT DISTINCT H.create_date as date
FROM H, P
where H.postid=P.acceptance_id
) AS subquery
WHERE P.acceptance_id is not null;
Table schema is as follows:
Table "public.P"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------+-----------------------------+-----------+----------+--------------+-------------
id | integer | not null | plain | |
acceptance_id | integer | | plain | |
answer_date | timestamp without time zone | | plain | |
Indexes:
"posts_pkey" PRIMARY KEY, btree (id)
"posts_accepted_answer_id_idx" btree (acceptance_id) WITH (fillfactor='100')
and
Table "public.H"
Column | Type | Modifiers | Storage | Stats target | Description
-------------------+-----------------------------+-----------+----------+--------------+-------------
id | integer | not null | plain | |
postid | integer | | plain | |
create_date | timestamp without time zone | not null | plain | |
Indexes:
"H_pkey" PRIMARY KEY, btree (id)
"ph_creation_date_idx" btree (create_date) WITH (fillfactor='100')
Table P as 70 million rows and H has 220 million rows.
Postgres version is 9.6
Hardware is a Windows laptop with 8Gb of RAM.
I'm trying to execute several times the following query :
SELECT st2.stop_id AS to_stop_id,
TIME_TO_SEC(
ADDTIME(TIMEDIFF(MIN(st1.time), %time),
TIMEDIFF(st2.time, st2.time))) AS duration
FROM stop_times st1,
stop_times st2,
trips tr,
calendar cal
WHERE tr.service_id = cal.service_id
AND tr.trip_id = st1.trip_id
AND st1.trip_id = st2.trip_id
AND st1.stop_id = %sid
AND st1.stop_seq +1 = st2.stop_seq
AND st1.time > %time
AND DATE(NOW()) BETWEEN cal.start_date AND
cal.end_date
GROUP BY st2.stop_id
However, it run extremely slow. I indexed the following attributes:
+------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| stop_times | 0 | st_id | 1 | st_id | A | 11431583 | NULL | NULL | | BTREE | | |
| stop_times | 1 | fk_tid_s | 1 | trip_id | A | 1039234 | NULL | NULL | YES | BTREE | | |
| stop_times | 1 | st_per_sid | 1 | stop_id | A | 33135 | NULL | NULL | YES | BTREE | | |
| calendar | 0 | PRIMARY | 1 | service_id | A | 5206 | NULL | NULL | | BTREE | | |
| calendar | 0 | PRIMARY | 1 | service_id | A | 5206 | NULL | NULL | | BTREE | | |
| trips | 0 | PRIMARY | 1 | trip_id | A | 449489 | NULL | NULL | | BTREE | | |
| trips | 1 | fk_rid | 1 | route_id | A | 1937 | NULL | NULL | YES | BTREE | | |
| trips | 1 | fk_sid | 1 | service_id | A | 7749 | NULL | NULL | YES | BTREE | | |
+------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
(For some reasons, st_id is not show as a PRIMARY KEY, but it is, I don't know if it's important but just in case..)
I ran SQL EXPLAIN on this query and it gave me the following answer :
+------+-------------+-------+--------+-------------------------------------------------+---------------------+---------+------------------------------+------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+-------------------------------------------------+---------------------+---------+------------------------------+------+---------------------------------------------------------------------+
| 1 | SIMPLE | st1 | range | comp_uniq_st_seq,st_per_sid,comp_uniq_stid_time | comp_uniq_stid_time | 9 | NULL | 1396 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | tr | eq_ref | PRIMARY,fk_sid | PRIMARY | 8 | reseau_ratp.st1.trip_id | 1 | Using where |
| 1 | SIMPLE | cal | eq_ref | PRIMARY,comp_sid_date_en,comp_sid_date_st | PRIMARY | 4 | reseau_ratp.tr.service_id | 1 | Using where |
| 1 | SIMPLE | st2 | ref | comp_uniq_st_seq | comp_uniq_st_seq | 14 | reseau_ratp.st1.trip_id,func | 1 | Using index condition |
+------+-------------+-------+--------+-------------------------------------------------+---------------------+---------+------------------------------+------+---------------------------------------------------------------------+
What should I do to get this query running faster?
EDIT :
Query using the requested syntax :
SELECT st2.stop_id AS to_stop_id,
TIME_TO_SEC(
ADDTIME(TIMEDIFF(MIN(st1.time), %time),
TIMEDIFF(st2.time, st2.time))) AS duration
FROM stop_times st1
INNER JOIN stop_times st2
ON st1.trip_id = st2.trip_id AND st1.stop_seq + 1 = st2.stop_seq
INNER JOIN trips tr
ON tr.trip_id = st1.trip_id
INNER JOIN calendar cal
ON tr.service_id = cal.service_id
WHERE st1.stop_id = %sid
AND st1.time > %time
AND cal.start_date <= NOW()
AND cal.end_date >= NOW()
GROUP BY st2.stop_id
Here SHOW CREATE TABLE stop_times:
CREATE TABLE `stop_times` (
`trip_id` bigint(10) unsigned DEFAULT NULL,
`stop_id` int(10) DEFAULT NULL,
`time` time DEFAULT NULL,
`stop_seq` int(10) unsigned DEFAULT NULL,
UNIQUE KEY `comp_uniq_st_seq` (`trip_id`,`stop_seq`),
KEY `comp_uniq_stid_time` (`stop_id`,`time`),
CONSTRAINT `fk_sid_s` FOREIGN KEY (`stop_id`) REFERENCES `stops` (`stop_id`),
CONSTRAINT `fk_tid_s` FOREIGN KEY (`trip_id`) REFERENCES `trips` (`trip_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
For calendar :
CREATE TABLE `calendar` (
`service_id` int(10) unsigned NOT NULL,
`start_date` date DEFAULT NULL,
`end_date` date DEFAULT NULL,
PRIMARY KEY (`service_id`),
KEY `comp_sid_date_en` (`service_id`,`end_date`),
KEY `comp_sid_date_st` (`service_id`,`start_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
And for trips :
CREATE TABLE `trips` (
`trip_id` bigint(10) unsigned NOT NULL DEFAULT '0',
`route_id` int(10) unsigned DEFAULT NULL,
`service_id` int(10) unsigned DEFAULT NULL,
`trip_headsign` varchar(15) DEFAULT NULL,
`trip_short_name` varchar(15) DEFAULT NULL,
`direction_id` tinyint(1) DEFAULT NULL,
PRIMARY KEY (`trip_id`),
KEY `fk_rid` (`route_id`),
KEY `fk_sid` (`service_id`),
CONSTRAINT `fk_rid` FOREIGN KEY (`route_id`) REFERENCES `routes` (`route_id`),
CONSTRAINT `fk_sid` FOREIGN KEY (`service_id`) REFERENCES `calendar` (`service_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
st1 needs this composite index: INDEX(stop_id, time)
Please use the JOIN ... ON syntax.
Please provide SHOW CREATE TABLE.
Here is a Cookbook on creating INDEXes from a SELECT.
(Edit)
Calendar is trickier to handle, and there is no "good" index. These may help:
INDEX(service_id, start_time)
INDEX(service_id, end_time)
plus, reformulate AND DATE(NOW()) BETWEEN cal.start_date AND cal.end_date into
AND cal.start_date <= NOW()
AND cal.end_time >= NOW()
(Edit 2)
Wherever practical, say NOT NULL. This is probably especially important in stop_times which does not have a PRIMARY KEY. Change the two columns in UNIQUE KEY comp_uniq_st_seq (trip_id,stop_seq) to be NOT NULL and turn it into PRIMARY KEY (trip_id, stop_seq). This will allow the performance benefits of "the PK is clustered with the data" to kick in.
Now that I see the CREATE TABLE for Calendar, and that service_id is the PRIMARY KEY, the two indexes I suggested for it are probably useless. (Again, this relates to "clustering".)
My Cookbook for building indexes may come in handy.
I am trying to build a new table such that the values in the existing table are NOT contained (but obviously the following checks for contained) in another table. Following is my table structure:
mysql> explain t1;
+-----------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+-------+
| id | int(11) | YES | | NULL | |
| point | bigint(20) unsigned | NO | MUL | 0 | |
+-----------+---------------------+------+-----+---------+-------+
mysql> explain whitelist;
+-------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| x | bigint(20) unsigned | YES | | NULL | |
| y | bigint(20) unsigned | YES | | NULL | |
| geonetwork | linestring | NO | MUL | NULL | |
+-------------+---------------------+------+-----+---------+----------------+
My query looks like this:
SELECT point
FROM t1
WHERE EXISTS(SELECT source
FROM whitelist
WHERE MBRContains(geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))));
Explain:
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
| 1 | PRIMARY | t1 | index | NULL | point | 8 | NULL | 1001 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | whitelist | ALL | _geonetwork | NULL | NULL | NULL | 3257 | Using where |
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
The query is taking 6 seconds to execute for 1000 records in t1 which is unacceptable for me. How can I rewrite this query using Joins (or perhaps a faster way if that exists) if I don't have a column to join on? Even a stored procedure is acceptable I guess in the worst case. My goal is to finally create a new table containing entries from t1. Any suggestions?
Unless the query optimizer is failing, a WHERE EXISTS construct should result in the same plan as a join with a GROUP clause. Look at optimizing MBRContains(geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)')))), that's probably where your query is spending all its time. I don't have a suggestion for that, but here's your query written with a JOIN:
Select t1.point
from t1
join whitelist on MBRContains(whitelist.geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))))
group by t1.point
;
or to get the points in t1 not in whitelist:
Select t1.point
from t1
left join whitelist on MBRContains(whitelist.geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))))
where whitelist.id is null
;
This seems like a case where de-nomalizing t1 might be beneficial. Adding a GeomFrmTxt column with a value of GeomFromText(CONCAT('POINT(', t1.point, ' 0)')) could speed up the query you already have.
I have three columns in my table.
+-----------+-----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-----------------------+------+-----+---------+-------+
| hash | mediumint(8) unsigned | NO | PRI | 0 | |
| nums | int(10) unsigned | NO | PRI | 0 | |
| acc | smallint(5) unsigned | NO | PRI | 0 | |
+-----------+-----------------------+------+-----+---------+-------+
I am expecting duplicates in my data so I went ahead and added a unique constraint:
ALTER TABLE nt_accs ADD UNIQUE(hash,nums,acc);
I have about 500 million rows to insert into this table and this table has been paritioned using a RANGE on nums into about 20 partitions.
Does the unique constraint slow down inserts? How does this differ in just making both a Primary Key instead of imposing a unique constraint?
I have a lot of GROUP BY type queries using both the hash and nums columns. Do I go ahead and add a convering index on and or do I just add individual indexes?
EDIT:
Explain plan after partitioning and inserting some test data
1. mysql> explain partitions select * from nt_accs;
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-------------+
| 1 | SIMPLE | nt_accs | p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20 | index | NULL | hash | 7 | NULL | 10 | Using index |
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-------------+
1 row in set (0.00 sec)
2. mysql> explain partitions select * from nt_accs WHERE nums=1504887570;
+----+-------------+-----------+------------+-------+---------------+----------+---------+------+------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------------+-------+---------------+----------+---------+------+------+--------------------------+
| 1 | SIMPLE | nt_accs | p7 | index | NULL | hash | 7 | NULL | 10 | Using where; Using index |
+----+-------------+-----------+------------+-------+---------------+----------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
3. mysql> explain partitions select * from nt_accs WHERE hash=2347200;
+----+-------------+-----------+---------------------------------------------------------------------------+------+---------------+----------+---------+-------+------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+---------------------------------------------------------------------------+------+---------------+----------+---------+-------+------+-------------+
| 1 | SIMPLE | nt_accs | p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20 | ref | hash | hash | 3 | const | 27 | Using index |
+----+-------------+-----------+---------------------------------------------------------------------------+------+---------------+----------+---------+-------+------+-------------+
1 row in set (0.00 sec)
4. mysql> EXPLAIN PARTITIONS SELECT hash, count(distinct nums) FROM nt_accs GROUP BY hash;
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-------------+
| 1 | SIMPLE | nt_accs | p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20 | index | NULL | hash | 7 | NULL | 10 | Using index |
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-------------+
1 row in set (0.00 sec)
5. mysql> EXPLAIN PARTITIONS SELECT nums, count(distinct hash) FROM nt_accs GROUP BY nums;
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-----------------------------+
| 1 | SIMPLE | nt_accs | p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20 | index | NULL | hash | 7 | NULL | 10 | Using index; Using filesort |
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)
I am perfectly fine with the first and second queries but I'm not sure about the performance of the 3rd, 4th and 5th. Is there anything else I can do at this point to optimize this?
Does the unique constraint slow down inserts? How does this differ in just making both a Primary Key instead of imposing a unique constraint?
Yes, an index (MySQL implements a unique constraint as an index) will slow down inserts.
The same goes a primary key, which is why tables expecting high insertion loads (IE: for logging) do not have a primary key defined--to make insertions faster.
I have a lot of GROUP BY type queries using both the hash and nums columns. Do I go ahead and add a convering index on and or do I just add individual indexes?
The only way to definitely know is to test & check the EXPLAIN plan.
UPDATE
In light of the provided explain plans, I don't see the concern for 3rd & 4th versions. MySQL can only use one index per select_type. The fifth version might benefit from a covering index.
Addendum
Just want to make sure that you are aware that:
ALTER TABLE nt_accs ADD UNIQUE(hash, nums, acc);
...means the combination of the three column values will be unique. IE: These are valid, the unique constraint will allow:
hash nums acc
----------------
1 1 1
1 1 2
1 2 1
2 1 1
I have 2 large mysql tables: Articles and ArticleTopics. I want to query the DB and retrieve the last 30 articles published for a given topicID. My current query is rather slow. Any ideas on how to improve it?
More details:
The tables:
Articles (~1 million rows)
+-----------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+----------------+
| articleId | int(11) | NO | PRI | NULL | auto_increment |
| title | varchar(255) | NO | | NULL | |
| content | longtext | NO | | NULL | |
| pubDate | datetime | NO | MUL | NULL | |
+-----------+--------------+------+-----+---------+----------------+
ArticleTopics (~10 million rows)
+-----------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+-------+
| articleId | int(11) | NO | MUL | NULL | |
| topicId | int(11) | NO | MUL | NULL | |
+-----------+--------------+------+-----+---------+-------+
And my query:
SELECT a.articleId, a.pubDate
FROM Articles a, ArticleTopics t
WHERE t.articleId=a.articleId AND t.topicId=3364
ORDER BY a.pubDate DESC LIMIT 30;
And the EXPLAIN of the query:
+----+-------------+-------+--------+-------------------------------------+-------------------+---------+-------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------------------+-------------------+---------+-------------------+------+----------------------------------------------+
| 1 | SIMPLE | t | ref | articleId,topicId,topicId_articleId | topicId_articleId | 4 | const | 4281 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | a | eq_ref | PRIMARY,articleId_pubDate | PRIMARY | 4 | t.articleId | 1 | |
+----+-------------+-------+--------+-------------------------------------+-------------------+---------+-------------------+------+----------------------------------------------+
The slowness, I believe, is coming from the ORDER BY a.pubDate DESC. I can greatly improve performance by faking it a bit by instead doing an ORDER BY t.articleId DESC and having an index in ArticleTopics on both articleId & topicId, since in general, the articleIds are in the same order as pubDates. They are not always, however, so it's not ideal. I'd like to be able to sort it on the pubDate.
Update: Added EXPLAIN.
You can rewrite the query in various ways to see if it speeds things up:
SELECT a.articleId, a.pubDate
FROM Articles a
WHERE a.articleId in (
select articleId
from ArticleTopics
where topicId = 3364
)
ORDER BY a.pubDate DESC LIMIT 30;
Or:
SELECT a.articleId, a.pubDate
FROM Articles a
INNER JOIN ArticleTopics t ON t.articleId = a.articleId
WHERE t.topicId = 3364
ORDER BY a.pubDate DESC LIMIT 30;
The important index for both queries is on Articles, and contains articleId as first field.
If article is a large table, with say the entire PDF in binary, you can create an index that fully covers the query. Full coverage means all selected fields are part of the index. For this query, a fully covering index would be (articleId, pubDate).
At this point, do you have an index on topicId? If so, does the index contain only the topicId field?
And maybe you can post the output of the EXPLAIN query.