Rewriting this subquery? - sql

I am trying to build a new table such that the values in the existing table are NOT contained (but obviously the following checks for contained) in another table. Following is my table structure:
mysql> explain t1;
+-----------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+-------+
| id | int(11) | YES | | NULL | |
| point | bigint(20) unsigned | NO | MUL | 0 | |
+-----------+---------------------+------+-----+---------+-------+
mysql> explain whitelist;
+-------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| x | bigint(20) unsigned | YES | | NULL | |
| y | bigint(20) unsigned | YES | | NULL | |
| geonetwork | linestring | NO | MUL | NULL | |
+-------------+---------------------+------+-----+---------+----------------+
My query looks like this:
SELECT point
FROM t1
WHERE EXISTS(SELECT source
FROM whitelist
WHERE MBRContains(geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))));
Explain:
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
| 1 | PRIMARY | t1 | index | NULL | point | 8 | NULL | 1001 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | whitelist | ALL | _geonetwork | NULL | NULL | NULL | 3257 | Using where |
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
The query is taking 6 seconds to execute for 1000 records in t1 which is unacceptable for me. How can I rewrite this query using Joins (or perhaps a faster way if that exists) if I don't have a column to join on? Even a stored procedure is acceptable I guess in the worst case. My goal is to finally create a new table containing entries from t1. Any suggestions?

Unless the query optimizer is failing, a WHERE EXISTS construct should result in the same plan as a join with a GROUP clause. Look at optimizing MBRContains(geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)')))), that's probably where your query is spending all its time. I don't have a suggestion for that, but here's your query written with a JOIN:
Select t1.point
from t1
join whitelist on MBRContains(whitelist.geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))))
group by t1.point
;
or to get the points in t1 not in whitelist:
Select t1.point
from t1
left join whitelist on MBRContains(whitelist.geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))))
where whitelist.id is null
;

This seems like a case where de-nomalizing t1 might be beneficial. Adding a GeomFrmTxt column with a value of GeomFromText(CONCAT('POINT(', t1.point, ' 0)')) could speed up the query you already have.

Related

Replacing for loop by sql

I have SQL for example
show tables from mydb;
It shows the list of table
|table1|
|table2|
|table3|
Then,I use sql sentence for each table.
such as "show full columns from table1 ;"
+----------+--------+-----------+------+-----+---------+----------------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+----------+--------+-----------+------+-----+---------+----------------+---------------------------------+---------+
| id | bigint | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | |
| user_id | bigint | NULL | NO | MUL | NULL | | select,insert,update,references | |
| group_id | int | NULL | NO | MUL | NULL | | select,insert,update,references | |
+----------+--------+-----------+------+-----+---------+----------------+---------------------------------+---------+
So in this case I can use programming language such as .(this is not correct code just showing the flow)
tables = "show tables from mydb;"
for t in tables:
cmd.execute("show full columns from {t} ;")
However is it possible to do this in sql only?
If you are using MySQL you can use the system view - INFORMATION_SCHEMA.
It contains table name and column name (and other details). No loop is require and you can easily filter by other information, too.
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
If you are using Microsoft SQL Server, you can use the above command

Why query is still so fast when I operate a non-indexing column?

I am learning indexing of database.
here are indexings of a table. And this table has 330k records.
mysql> show index from employee;
+----------+------------+-------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+----------+------------+-------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| employee | 0 | PRIMARY | 1 | id | A | 297383 | NULL | NULL | | BTREE | | | YES | NULL |
| employee | 0 | ak_employee | 1 | personal_code | A | 297383 | NULL | NULL | | BTREE | | | YES | NULL |
| employee | 1 | idx_email | 1 | email | A | 297383 | NULL | NULL | | BTREE | | | YES | NULL |
+----------+------------+-------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
as you can see, there are only three indexing on this table.
Now I want to query with where on birth_date column, I think it will be very slow because there is no indexing on birth-date column, I when I try query, I found it is very fast.
mysql> select sql_no_cache *
-> from employee
-> where birth_date > '1955-11-11'
-> limit 100
-> ;
100 rows in set, 1 warning (0.04 sec)
So I am confused:
why it is still so fast without indexing?
due to its still fast, why do we still need indexing?
This is your query:
select sql_no_cache *
from employee
where birth_date > '1955-11-11'
limit 100
There are no indexes so the query starts reading the data from the data pages. On each record, it compares the birthdate and returns the row. When it finds 100 (due to the limit) it stops.
Presumably, it finds 100 rows quite quickly. After all, the median age of the United States is about 38 -- which is (as I write this) a birth year of 1981. By far, most people were born after 1955.
The query would be much slower if you had an order by or group by. That would require reading all the data before returning anything.

SQL LIKE question

I was wondering if there's a drawback (other than bad practice) to using something like this
SELECT * FROM my_table WHERE id LIKE '1';
where id is an integer. I know you're supposed to use id=1 but I am writing a java program and if everything can use LIKE it'll be a lot easier for me. Also, so far, everything works fine; I get the correct query results, so if there is no drawback I will continue doing it like this.
edit: I am using MySQL.
MySQL will allow it, but will ignore the index:
mysql> describe METADATA_44;
+---------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+--------------+------+-----+---------+-------+
| AtextId | int(11) | NO | PRI | NULL | |
| num | varchar(128) | YES | | NULL | |
| title | varchar(128) | YES | | NULL | |
| file | varchar(128) | YES | | NULL | |
| context | varchar(128) | YES | | NULL | |
| source | varchar(128) | YES | | NULL | |
+---------+--------------+------+-----+---------+-------+
6 rows in set (0.00 sec)
mysql> explain select * from METADATA_44 where Atextid like '7';
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | METADATA_44 | ALL | PRIMARY | NULL | NULL | NULL | 591 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
mysql> explain select * from METADATA_44 where Atextid=7;
+----+-------------+-------------+-------+---------------+---------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+---------+---------+-------+------+-------+
| 1 | SIMPLE | METADATA_44 | const | PRIMARY | PRIMARY | 4 | const | 1 | |
+----+-------------+-------------+-------+---------------+---------+---------+-------+------+-------+
1 row in set (0.00 sec)
You'd need to look at the Query Execution Plan on your RDBMS to verify that LIKE with no wildcards is treated as efficiently as an = would be. A quick test in SQL Server shows that it would give you an index scan rather than a seek so I guess it doesn't look at that when generating the plan and for SQL Server using = would be much more efficient. I don't have a MySQL install to test against.
Edit: Just to update this SQL Server seems to handle it fine and do a seek when the data type is varchar. When it is run against an int column though you get the scan. This is because it does an implicit conversion to varchar on the int column so can't use the index.
You are better off writing your query as
SELECT * FROM my_table WHERE id = 1;
otherwise mysql will have to typecast '1' to int which is the type of the column id
so obviously there is a small performance penalty, when u know the type of the column supply the value according to that type
Speed. [15-char filler as there's not much more to say]
Without using any wildcards with LIKE, is should be fine for your needs if the speed/efficiency is something you don't bother with.

Eliminate full table scan due to BETWEEN (and GROUP BY)

Description
According to the explain command, there is a range that is causing a query to perform a full table scan (160k rows). How do I keep the range condition and reduce the scanning? I expect the culprit to be:
Y.YEAR BETWEEN 1900 AND 2009 AND
Code
Here is the code that has the range condition (the STATION_DISTRICT is likely superfluous).
SELECT
COUNT(1) as MEASUREMENTS,
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y FORCE INDEX(YEAR_IDX),
MONTH_REF M,
DAILY D
WHERE
-- For a specific city ...
--
C.ID = 10663 AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= 50 AND
-- Get the station district identification for the matching station.
--
S.STATION_DISTRICT_ID = SD.ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = '003' AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
Update
The SQL is performing a full table scan, which results in MySQL performing a "copy to tmp table", as shown here:
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
| 1 | SIMPLE | C | const | PRIMARY | PRIMARY | 4 | const | 1 | |
| 1 | SIMPLE | Y | range | YEAR_IDX | YEAR_IDX | 4 | NULL | 160422 | Using where |
| 1 | SIMPLE | SD | eq_ref | PRIMARY | PRIMARY | 4 | climate.Y.STATION_DISTRICT_ID | 1 | Using index |
| 1 | SIMPLE | S | eq_ref | PRIMARY | PRIMARY | 4 | climate.SD.ID | 1 | Using where |
| 1 | SIMPLE | M | ref | PRIMARY,YEAR_REF_IDX,CATEGORY_IDX | YEAR_REF_IDX | 8 | climate.Y.ID | 54 | Using where |
| 1 | SIMPLE | D | ref | INDEX | INDEX | 8 | climate.M.ID | 11 | Using where |
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
Answer
After using the STRAIGHT_JOIN:
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
| 1 | SIMPLE | C | const | PRIMARY | PRIMARY | 4 | const | 1 | Using temporary; Using filesort |
| 1 | SIMPLE | S | ALL | PRIMARY | NULL | NULL | NULL | 7795 | Using where |
| 1 | SIMPLE | SD | eq_ref | PRIMARY | PRIMARY | 4 | climate.S.STATION_DISTRICT_ID | 1 | Using index |
| 1 | SIMPLE | Y | ref | PRIMARY,STAT_YEAR_IDX | STAT_YEAR_IDX | 4 | climate.S.STATION_DISTRICT_ID | 1650 | Using where |
| 1 | SIMPLE | M | ref | PRIMARY,YEAR_REF_IDX,CATEGORY_IDX | YEAR_REF_IDX | 8 | climate.Y.ID | 54 | Using where |
| 1 | SIMPLE | D | ref | INDEX | INDEX | 8 | climate.M.ID | 11 | Using where |
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
Related
http://dev.mysql.com/doc/refman/5.0/en/how-to-avoid-table-scan.html
http://dev.mysql.com/doc/refman/5.0/en/where-optimizations.html
Optimize SQL that uses between clause
Thank you!
ONE Request... It looks like you KNOW your data. Add the keyword "STRAIGHT_JOIN" and see the results...
SELECT STRAIGHT_JOIN ... the rest of your query...
Straight-join tells MySql to DO IT AS I HAVE LISTED. So, your CITY table is the first in the FROM list, thus indicating you expect that to be your primary... Additionally, your WHERE clause of the CITY is the immediate filter. With that being said, it will probably fly through the rest of the query...
Hope it helps... Its worked for me with gov't data of millions of records queried and joined to 10+ lookup tables where mySql was trying to think for me.
in order to do efficient between queries you are going to want a b tree index on your YEAR column. for example:
CREATE INDEX id_index USING BTREE ON YEAR_REF (YEAR);
BTREE indexes allow for efficient range queries, if this is in fact the root problem then having an index like this should get rid of the full table scan and have it only scan the part of the table that is in the range. read more about btrees on wikipedia
However, as with any optimisation advice, you should measure to make sure that you don't do more harm than good.
Can you change from searching within a radius to search in a bounding box?
You know the city so you can calculate a bounding box in your application.
Perhaps this
S.LATITUDE_DECIMAL >= latitude_lower and
S.LATITUDE_DECIMAL <= latitude_upper and
S.LONGITUDE_DECIMAL >= longitude_lower and
S.LONGITUDE_DECIMAL <= longitude_upper
could be a little faster?

MySQL syntax for Join Update

I have two tables that look like this
Train
+----------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+-------------+------+-----+---------+-------+
| TrainID | varchar(11) | NO | PRI | NULL | |
| Capacity | int(11) | NO | | 50 | |
+----------+-------------+------+-----+---------+-------+
Reservations
+---------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-------------+------+-----+---------+----------------+
| ReservationID | int(11) | NO | PRI | NULL | auto_increment |
| FirstName | varchar(30) | NO | | NULL | |
| LastName | varchar(30) | NO | | NULL | |
| DDate | date | NO | | NULL | |
| NoSeats | int(2) | NO | | NULL | |
| Route | varchar(11) | NO | | NULL | |
| Train | varchar(11) | NO | | NULL | |
+---------------+-------------+------+-----+---------+----------------+
Currently, I'm trying to create a query that will increment the capacity on a Train if a reservation is cancelled. I know I have to perform a Join, but I'm not sure how to do it in an Update statement. For Example, I know how to get the capacity of a Train with given a certain ReservationID, like so:
select Capacity
from Train
Join Reservations on Train.TrainID = Reservations.Train
where ReservationID = "15";
But I'd like to construct the query that does this -
Increment Train.Capacity by ReservationTable.NoSeats given a ReservationID
If possible, I'd like to know also how to Increment by an arbitrary number of seats. As an aside, I'm planning on deleting the reservation after I perform the increment in a Java transaction. Will the delete effect the transaction?
Thanks for the help!
MySQL supports a multi-table UPDATE syntax, which would look approximately like this:
UPDATE Reservations r JOIN Train t ON (r.Train = t.TrainID)
SET t.Capacity = t.Capacity + r.NoSeats
WHERE r.ReservationID = ?;
You can update the Train table and delete from the Reservations table in the same transaction. As long as you do the update first and then do the delete second, it should work.
Here is another example of an UPDATE statement that contains joins to determine the value that is being updated. In this case, I want to update the transactions.payee_id with the related account payment id, if the payee_id is zero (wasn't assigned).
UPDATE transactions t
JOIN account a ON a.id = t.account_id
JOIN account ap ON ap.id = a.pmt_act_id
SET t.payee_id = a.pmt_act_id
WHERE t.payee_id = 0