Does adding a unique constraint slow down things? - sql

I have three columns in my table.
+-----------+-----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-----------------------+------+-----+---------+-------+
| hash | mediumint(8) unsigned | NO | PRI | 0 | |
| nums | int(10) unsigned | NO | PRI | 0 | |
| acc | smallint(5) unsigned | NO | PRI | 0 | |
+-----------+-----------------------+------+-----+---------+-------+
I am expecting duplicates in my data so I went ahead and added a unique constraint:
ALTER TABLE nt_accs ADD UNIQUE(hash,nums,acc);
I have about 500 million rows to insert into this table and this table has been paritioned using a RANGE on nums into about 20 partitions.
Does the unique constraint slow down inserts? How does this differ in just making both a Primary Key instead of imposing a unique constraint?
I have a lot of GROUP BY type queries using both the hash and nums columns. Do I go ahead and add a convering index on and or do I just add individual indexes?
EDIT:
Explain plan after partitioning and inserting some test data
1. mysql> explain partitions select * from nt_accs;
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-------------+
| 1 | SIMPLE | nt_accs | p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20 | index | NULL | hash | 7 | NULL | 10 | Using index |
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-------------+
1 row in set (0.00 sec)
2. mysql> explain partitions select * from nt_accs WHERE nums=1504887570;
+----+-------------+-----------+------------+-------+---------------+----------+---------+------+------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------------+-------+---------------+----------+---------+------+------+--------------------------+
| 1 | SIMPLE | nt_accs | p7 | index | NULL | hash | 7 | NULL | 10 | Using where; Using index |
+----+-------------+-----------+------------+-------+---------------+----------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
3. mysql> explain partitions select * from nt_accs WHERE hash=2347200;
+----+-------------+-----------+---------------------------------------------------------------------------+------+---------------+----------+---------+-------+------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+---------------------------------------------------------------------------+------+---------------+----------+---------+-------+------+-------------+
| 1 | SIMPLE | nt_accs | p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20 | ref | hash | hash | 3 | const | 27 | Using index |
+----+-------------+-----------+---------------------------------------------------------------------------+------+---------------+----------+---------+-------+------+-------------+
1 row in set (0.00 sec)
4. mysql> EXPLAIN PARTITIONS SELECT hash, count(distinct nums) FROM nt_accs GROUP BY hash;
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-------------+
| 1 | SIMPLE | nt_accs | p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20 | index | NULL | hash | 7 | NULL | 10 | Using index |
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-------------+
1 row in set (0.00 sec)
5. mysql> EXPLAIN PARTITIONS SELECT nums, count(distinct hash) FROM nt_accs GROUP BY nums;
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-----------------------------+
| 1 | SIMPLE | nt_accs | p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20 | index | NULL | hash | 7 | NULL | 10 | Using index; Using filesort |
+----+-------------+-----------+---------------------------------------------------------------------------+-------+---------------+----------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)
I am perfectly fine with the first and second queries but I'm not sure about the performance of the 3rd, 4th and 5th. Is there anything else I can do at this point to optimize this?

Does the unique constraint slow down inserts? How does this differ in just making both a Primary Key instead of imposing a unique constraint?
Yes, an index (MySQL implements a unique constraint as an index) will slow down inserts.
The same goes a primary key, which is why tables expecting high insertion loads (IE: for logging) do not have a primary key defined--to make insertions faster.
I have a lot of GROUP BY type queries using both the hash and nums columns. Do I go ahead and add a convering index on and or do I just add individual indexes?
The only way to definitely know is to test & check the EXPLAIN plan.
UPDATE
In light of the provided explain plans, I don't see the concern for 3rd & 4th versions. MySQL can only use one index per select_type. The fifth version might benefit from a covering index.
Addendum
Just want to make sure that you are aware that:
ALTER TABLE nt_accs ADD UNIQUE(hash, nums, acc);
...means the combination of the three column values will be unique. IE: These are valid, the unique constraint will allow:
hash nums acc
----------------
1 1 1
1 1 2
1 2 1
2 1 1

Related

Why query is still so fast when I operate a non-indexing column?

I am learning indexing of database.
here are indexings of a table. And this table has 330k records.
mysql> show index from employee;
+----------+------------+-------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+----------+------------+-------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| employee | 0 | PRIMARY | 1 | id | A | 297383 | NULL | NULL | | BTREE | | | YES | NULL |
| employee | 0 | ak_employee | 1 | personal_code | A | 297383 | NULL | NULL | | BTREE | | | YES | NULL |
| employee | 1 | idx_email | 1 | email | A | 297383 | NULL | NULL | | BTREE | | | YES | NULL |
+----------+------------+-------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
as you can see, there are only three indexing on this table.
Now I want to query with where on birth_date column, I think it will be very slow because there is no indexing on birth-date column, I when I try query, I found it is very fast.
mysql> select sql_no_cache *
-> from employee
-> where birth_date > '1955-11-11'
-> limit 100
-> ;
100 rows in set, 1 warning (0.04 sec)
So I am confused:
why it is still so fast without indexing?
due to its still fast, why do we still need indexing?
This is your query:
select sql_no_cache *
from employee
where birth_date > '1955-11-11'
limit 100
There are no indexes so the query starts reading the data from the data pages. On each record, it compares the birthdate and returns the row. When it finds 100 (due to the limit) it stops.
Presumably, it finds 100 rows quite quickly. After all, the median age of the United States is about 38 -- which is (as I write this) a birth year of 1981. By far, most people were born after 1955.
The query would be much slower if you had an order by or group by. That would require reading all the data before returning anything.

Rewriting this subquery?

I am trying to build a new table such that the values in the existing table are NOT contained (but obviously the following checks for contained) in another table. Following is my table structure:
mysql> explain t1;
+-----------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+-------+
| id | int(11) | YES | | NULL | |
| point | bigint(20) unsigned | NO | MUL | 0 | |
+-----------+---------------------+------+-----+---------+-------+
mysql> explain whitelist;
+-------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| x | bigint(20) unsigned | YES | | NULL | |
| y | bigint(20) unsigned | YES | | NULL | |
| geonetwork | linestring | NO | MUL | NULL | |
+-------------+---------------------+------+-----+---------+----------------+
My query looks like this:
SELECT point
FROM t1
WHERE EXISTS(SELECT source
FROM whitelist
WHERE MBRContains(geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))));
Explain:
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
| 1 | PRIMARY | t1 | index | NULL | point | 8 | NULL | 1001 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | whitelist | ALL | _geonetwork | NULL | NULL | NULL | 3257 | Using where |
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
The query is taking 6 seconds to execute for 1000 records in t1 which is unacceptable for me. How can I rewrite this query using Joins (or perhaps a faster way if that exists) if I don't have a column to join on? Even a stored procedure is acceptable I guess in the worst case. My goal is to finally create a new table containing entries from t1. Any suggestions?
Unless the query optimizer is failing, a WHERE EXISTS construct should result in the same plan as a join with a GROUP clause. Look at optimizing MBRContains(geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)')))), that's probably where your query is spending all its time. I don't have a suggestion for that, but here's your query written with a JOIN:
Select t1.point
from t1
join whitelist on MBRContains(whitelist.geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))))
group by t1.point
;
or to get the points in t1 not in whitelist:
Select t1.point
from t1
left join whitelist on MBRContains(whitelist.geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))))
where whitelist.id is null
;
This seems like a case where de-nomalizing t1 might be beneficial. Adding a GeomFrmTxt column with a value of GeomFromText(CONCAT('POINT(', t1.point, ' 0)')) could speed up the query you already have.

Why does select statement influence query execution and performance in MySQL?

I'm encountering a strange behavior of MySQL.
Query execution (i.e. the usage of indexes as shown by explain [QUERY]) and time needed for execution are dependent on the elements of the where clause.
Here is a query where the problem occurs:
select distinct
e1.idx, el1.idx, r1.fk_cat, r2.fk_cat
from ent e1, ent_leng el1, rel_c r1, _tax_c t1, rel_c r2, _tax_c t2
where el1.fk_ent=e1.idx
and r1.fk_ent=e1.idx and ((r1.fk_cat=43) or (r1.fk_cat=t1.fk_cat1 and t1.fk_cat2=43))
and r2.fk_ent=e1.idx and ((r2.fk_cat=10) or (r2.fk_cat=t2.fk_cat1 and t2.fk_cat2=10))
The corresponding explain output is:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+-------------+-------+--------+-------------------------+---------+---------+---------------+-------+------------------------------------
| 1 | SIMPLE | el1 | index | fk_ent | fk_ent | 4 | NULL | 15002 | Using index; Using temporary
| 1 | SIMPLE | e1 | eq_ref | PRIMARY | PRIMARY | 4 | DB.el1.fk_ent | 1 | Using index
| 1 | SIMPLE | r1 | ref | fk_ent,fk_cat,fks | fks | 4 | DB.e1.idx | 1 | Using where; Using index
| 1 | SIMPLE | r2 | ref | fk_ent,fk_cat,fks | fks | 4 | DB.el1.fk_ent | 1 | Using index
| 1 | SIMPLE | t1 | index | fk_cat1,fk_cat2,fk_cats | fk_cats | 8 | NULL | 69 | Using where; Using index; Distinct;
| | | | | | | | | | Using join buffer
| 1 | SIMPLE | t2 | index | fk_cat1,fk_cat2,fk_cats | fk_cats | 8 | NULL | 69 | Using where; Using index; Distinct;
| Using join buffer
As you can see a one-column index has the same name as the column it belongs to. I also added some useless indexes along with the used ones, just to see if they change the execution (which they don't).
The execution takes ~4.5 seconds.
When I add the column entl1.name to the select part (nothing else changed), the index fk_ent in el1 cannot be used any more:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+-------------+-------+--------+-------------------------+---------+---------+---------------+-------+------------------------------------
| 1 | SIMPLE | el1 | ALL | fk_ent | NULL | NULL | NULL | 15002 | Using temporary
The execution now takes ~8.5 seconds.
I always thought that the select part of a query does not influence the usage of indexes by the engine and doesn't affect performance in such a way.
Leaving out the attribute isn't a solution, and there are even more attributes that i have to select.
Even worse, the query in the used form is even a bit more complex and that makes the performance issue a big problem.
So my questions are:
1) What is the reason for this strange behavior?
2) How can I solve the performance problem?
Thanks for your help!
Gred
It's the DISTINCT restriction. You can think of that as another WHERE restriction. When you change the select list, you are really changing the WHERE clause for the DISTINCT restriction, and now the optimizer decides that it has to do a table scan anyway, so it might as well not use your index.
EDIT:
Not sure if this helps, but if I am understanding your data correctly, I think you can get rid of the DISTINCT restriction like this:
select
e1.idx, el1.idx, r1.fk_cat, r2.fk_cat
from ent e1
Inner Join ent_leng el1 ON el1.fk_ent=e1.idx
Inner Join rel_c r1 ON r1.fk_ent=e1.idx
Inner Join rel_c r2 ON r2.fk_ent=e1.idx
where
((r1.fk_cat=43) or Exists(Select 1 From _tax_c t1 Where r1.fk_cat=t1.fk_cat1 and t1.fk_cat2=43))
and
((r2.fk_cat=10) or Exists(Select 1 From _tax_c t2 Where r2.fk_cat=t2.fk_cat1 and t2.fk_cat2=10))
MySQL will return data from an index if possible, saving the entire row from being loaded. In this way, the selected columns can influence the index selection.
With this in mind, it can much more efficient to add all required columns to an index, especially in the case of only selecting a small subset of columns.

SQL LIKE question

I was wondering if there's a drawback (other than bad practice) to using something like this
SELECT * FROM my_table WHERE id LIKE '1';
where id is an integer. I know you're supposed to use id=1 but I am writing a java program and if everything can use LIKE it'll be a lot easier for me. Also, so far, everything works fine; I get the correct query results, so if there is no drawback I will continue doing it like this.
edit: I am using MySQL.
MySQL will allow it, but will ignore the index:
mysql> describe METADATA_44;
+---------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+--------------+------+-----+---------+-------+
| AtextId | int(11) | NO | PRI | NULL | |
| num | varchar(128) | YES | | NULL | |
| title | varchar(128) | YES | | NULL | |
| file | varchar(128) | YES | | NULL | |
| context | varchar(128) | YES | | NULL | |
| source | varchar(128) | YES | | NULL | |
+---------+--------------+------+-----+---------+-------+
6 rows in set (0.00 sec)
mysql> explain select * from METADATA_44 where Atextid like '7';
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | METADATA_44 | ALL | PRIMARY | NULL | NULL | NULL | 591 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
mysql> explain select * from METADATA_44 where Atextid=7;
+----+-------------+-------------+-------+---------------+---------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+---------+---------+-------+------+-------+
| 1 | SIMPLE | METADATA_44 | const | PRIMARY | PRIMARY | 4 | const | 1 | |
+----+-------------+-------------+-------+---------------+---------+---------+-------+------+-------+
1 row in set (0.00 sec)
You'd need to look at the Query Execution Plan on your RDBMS to verify that LIKE with no wildcards is treated as efficiently as an = would be. A quick test in SQL Server shows that it would give you an index scan rather than a seek so I guess it doesn't look at that when generating the plan and for SQL Server using = would be much more efficient. I don't have a MySQL install to test against.
Edit: Just to update this SQL Server seems to handle it fine and do a seek when the data type is varchar. When it is run against an int column though you get the scan. This is because it does an implicit conversion to varchar on the int column so can't use the index.
You are better off writing your query as
SELECT * FROM my_table WHERE id = 1;
otherwise mysql will have to typecast '1' to int which is the type of the column id
so obviously there is a small performance penalty, when u know the type of the column supply the value according to that type
Speed. [15-char filler as there's not much more to say]
Without using any wildcards with LIKE, is should be fine for your needs if the speed/efficiency is something you don't bother with.

Eliminate full table scan due to BETWEEN (and GROUP BY)

Description
According to the explain command, there is a range that is causing a query to perform a full table scan (160k rows). How do I keep the range condition and reduce the scanning? I expect the culprit to be:
Y.YEAR BETWEEN 1900 AND 2009 AND
Code
Here is the code that has the range condition (the STATION_DISTRICT is likely superfluous).
SELECT
COUNT(1) as MEASUREMENTS,
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y FORCE INDEX(YEAR_IDX),
MONTH_REF M,
DAILY D
WHERE
-- For a specific city ...
--
C.ID = 10663 AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= 50 AND
-- Get the station district identification for the matching station.
--
S.STATION_DISTRICT_ID = SD.ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = '003' AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
Update
The SQL is performing a full table scan, which results in MySQL performing a "copy to tmp table", as shown here:
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
| 1 | SIMPLE | C | const | PRIMARY | PRIMARY | 4 | const | 1 | |
| 1 | SIMPLE | Y | range | YEAR_IDX | YEAR_IDX | 4 | NULL | 160422 | Using where |
| 1 | SIMPLE | SD | eq_ref | PRIMARY | PRIMARY | 4 | climate.Y.STATION_DISTRICT_ID | 1 | Using index |
| 1 | SIMPLE | S | eq_ref | PRIMARY | PRIMARY | 4 | climate.SD.ID | 1 | Using where |
| 1 | SIMPLE | M | ref | PRIMARY,YEAR_REF_IDX,CATEGORY_IDX | YEAR_REF_IDX | 8 | climate.Y.ID | 54 | Using where |
| 1 | SIMPLE | D | ref | INDEX | INDEX | 8 | climate.M.ID | 11 | Using where |
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
Answer
After using the STRAIGHT_JOIN:
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
| 1 | SIMPLE | C | const | PRIMARY | PRIMARY | 4 | const | 1 | Using temporary; Using filesort |
| 1 | SIMPLE | S | ALL | PRIMARY | NULL | NULL | NULL | 7795 | Using where |
| 1 | SIMPLE | SD | eq_ref | PRIMARY | PRIMARY | 4 | climate.S.STATION_DISTRICT_ID | 1 | Using index |
| 1 | SIMPLE | Y | ref | PRIMARY,STAT_YEAR_IDX | STAT_YEAR_IDX | 4 | climate.S.STATION_DISTRICT_ID | 1650 | Using where |
| 1 | SIMPLE | M | ref | PRIMARY,YEAR_REF_IDX,CATEGORY_IDX | YEAR_REF_IDX | 8 | climate.Y.ID | 54 | Using where |
| 1 | SIMPLE | D | ref | INDEX | INDEX | 8 | climate.M.ID | 11 | Using where |
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
Related
http://dev.mysql.com/doc/refman/5.0/en/how-to-avoid-table-scan.html
http://dev.mysql.com/doc/refman/5.0/en/where-optimizations.html
Optimize SQL that uses between clause
Thank you!
ONE Request... It looks like you KNOW your data. Add the keyword "STRAIGHT_JOIN" and see the results...
SELECT STRAIGHT_JOIN ... the rest of your query...
Straight-join tells MySql to DO IT AS I HAVE LISTED. So, your CITY table is the first in the FROM list, thus indicating you expect that to be your primary... Additionally, your WHERE clause of the CITY is the immediate filter. With that being said, it will probably fly through the rest of the query...
Hope it helps... Its worked for me with gov't data of millions of records queried and joined to 10+ lookup tables where mySql was trying to think for me.
in order to do efficient between queries you are going to want a b tree index on your YEAR column. for example:
CREATE INDEX id_index USING BTREE ON YEAR_REF (YEAR);
BTREE indexes allow for efficient range queries, if this is in fact the root problem then having an index like this should get rid of the full table scan and have it only scan the part of the table that is in the range. read more about btrees on wikipedia
However, as with any optimisation advice, you should measure to make sure that you don't do more harm than good.
Can you change from searching within a radius to search in a bounding box?
You know the city so you can calculate a bounding box in your application.
Perhaps this
S.LATITUDE_DECIMAL >= latitude_lower and
S.LATITUDE_DECIMAL <= latitude_upper and
S.LONGITUDE_DECIMAL >= longitude_lower and
S.LONGITUDE_DECIMAL <= longitude_upper
could be a little faster?