How can I optimize this query...? - sql

I have two tables, one for routes and one for airports.
Routes contains just over 9000 rows and I have indexed every column.
Airports only 2000 rows and I have also indexed every column.
When I run this query it can take up to 35 seconds to return 300 rows:
SELECT routes.* , a1.name as origin_name, a2.name as destination_name FROM routes
LEFT JOIN airports a1 ON a1.IATA = routes.origin
LEFT JOIN airports a2 ON a2.IATA = routes.destination
WHERE routes_build.carrier = "Carrier Name"
Running it with "DESCRIBE" I get the followinf info, but I'm not 100% sure on what it's telling me.
id | Select Type | Table | Type | possible_keys | Key | Key_len | ref | rows | Extra
--------------------------------------------------------------------------------------------------------------------------------------
1 | SIMPLE | routes_build | ref | carrier,carrier_2 | carrier | 678 | const | 26 | Using where
--------------------------------------------------------------------------------------------------------------------------------------
1 | SIMPLE | a1 | ALL | NULL | NULL | NULL | NULL | 5389 |
--------------------------------------------------------------------------------------------------------------------------------------
1 | SIMPLE | a2 | ALL | NULL | NULL | NULL | NULL | 5389 |
--------------------------------------------------------------------------------------------------------------------------------------
The only alternative I can think of is to run two separate queries and join them up with PHP although, I can't believe something like this being something that could kill a mysql server. So as usual, I suspect I'm doing something stupid. SQL is my number 1 weakness.

Personally, I would start by removing the left joins and replacing them with inner joins as each route must have a start and end point.

It's telling you that it's not using an index for joining on the airports table. See how the "rows" column is so huge, 5000 odd? that's how many rows it's having to read to answer your query.
I don't know why, as you have claimed you have indexed every column. What is IATA? Is it Unique? I believe if mysql decides the index is inefficient it may ignore it.
EDIT: if IATA is a unique string, maybe try indexing half of it only? (You can select how many characters to index) That may give mysql an index it can use.

SELECT routes.*, a1.name as origin_name, a2.name as destination_name
FROM routes_build
LEFT JOIN
airports a1
ON a1.IATA = routes_build.origin
LEFT JOIN
airports a2
ON a2.IATA = routes_build.destination
WHERE routes_build.carrier = "Carrier Name"
From your EXPLAIN PLAN I can see that you don't have an index on airports.IATA.
You should create it for the query to work fast.
Name also suggests that it should be a UNIQUE index, since IATA codes are unique.
Update:
Please post your table definition. Issue this query to show it:
SHOW CREATE TABLE airports
Also I should note that your FULLTEXT index on IATA is useless unless you have set ft_max_word_len is MySQL configuration to 3 or less.
By default, it's 4.
IATA codes are 3 characters long, and MySQL doesn't search for such short words using FULLTEXT with default settings.

After you implement Martin Robins's excellent advice (i.e. remove every instance of the word LEFT from your query), try giving routes_build a compound index on carrier, origin, and destination.

It really depends on what information you're trying to get to. You probably don't need to join airports twice and you probably don't need to use left joins. Also, if you can search on a numeric field rather than a text field, that would speed things up as well.
So what are you trying to fetch?

Related

Access query, if two values exist in one column, omit one

I have a series of queries that generate reports that contain chemical data. There are two compounds A and B where A is the total amount and B is a speciated amount (like total iron and ferrous iron, for example).
There are about one hundred total compounds in the query result, and I need a criteria to filter the results such that if both Compounds A and B are present, only Compound B is displayed. So far I've tried adding a few iif statements to the criteria section in the query builder with no luck.
Here is what I have so far:
SELECT Table1.KEY_ANLT
FROM Table1
WHERE (((Table1.KEY_ANLT)=IIf([Table1].[KEY_ANLT]=1223 And [Table1].[KEY_ANLT]=70,70,1223)));
This filters out Compound A but does not include the rest of the compounds. How can I modify the query to also include the other compounds?
So, to clarify some of the comments above, the problem here is you don't have (or haven't specified above) a way to identify values that go together. You gave 70 and 1223 as an example, but if you gave us a list of all the numbers, how would we be able to identify which ones go together? You might say "chemistry expertise", but that's based on another column with the compounds' names, right? So really, your query should use that column. But then there's still the problem of how to connect associated names (e.g., "total iron" and "ferrous iron" might be connected because they both have the word "iron", but what about "permanganate" and "manganese"?). In short, you need another column to specify the thing in common between these separate rows, whether it's element, ion, charge, etc. You would also need a column identifying which row in each "group" you would want to include in your query (or, which ones to exclude). For example:
+----------+-----------------+---------+---------+
| KEY_ANLT | Compound | Element | Primary |
+----------+-----------------+---------+---------+
| 70 | total iron | Fe | Y |
| 1223 | ferrous iron | Fe | |
| 1224 | ferric iron | Fe | |
| 900 | total manganese | Mn | Y |
| 901 | permanganate | Mn | |
+----------+-----------------+---------+---------+
Then, to get a query that shows just the "primary" rows, it's pretty trivial:
SELECT * FROM Table1 WHERE Primary='Y';
Without that [Primary] column, you'd have to decide how to choose each row. Perhaps you'd want the one with the smallest KEY_ANLT?
SELECT Table1.*
FROM
(SELECT Element, min(KEY_ANLT) AS MinKey FROM Table1 GROUP BY Element) AS Subquery
INNER JOIN Table1 ON
Subquery.Element=Table1.Element AND
Subquery.MinKey=Table1.KEY_ANLT
The reason your query doesn't work is that the WHERE clause operates row-by-row, and doesn't compare different rows to one another. So in your SQL:
IIf([Table1].[KEY_ANLT]=1223 And [Table1].[KEY_ANLT]=70,70,1223)
NONE of the rows will evaluate this as 70, because no single row has KEY_ANLT=1223 AND KEY_ANLT=70. Each row only has one value for KEY_ANLT. So then that IIF expression evaluates as 1223 for every row, and your condition will only return rows where KEY_ANLT=1223 (compound B).

Postgres matching against an array of regular expressions

My client wants the possibility to match a set of data against an array of regular expressions, meaning:
table:
name | officeId (foreignkey)
--------
bob | 1
alice | 1
alicia | 2
walter | 2
and he wants to do something along those lines:
get me all records of offices (officeId) where there is a member with
ANY name ~ ANY[.*ob, ali.*]
meaning
ANY of[alicia, walter] ~ ANY of [.*ob, ali.*] results in true
I could not figure it out by myself sadly :/.
Edit
The real Problem was missing form the original description:
I cannot use select disctinct officeId .. where name ~ ANY[.*ob, ali.*], because:
This application, stored data in postgres-xml columns, which means i do in fact have (after evaluating xpath('/data/clients/name/text()'))::text[]):
table:
name | officeId (foreignkey)
-----------------------------------------
[bob, alice] | 1
[anthony, walter] | 2
[alicia, walter] | 3
There is the Problem. And "you don't do that, that is horrible, why would you do it like this, store it like it is meant to be stored in a relation database, user a no-sql database for Document-based storage, use json" are no options.
I am stuck with this datamodel.
This looks pretty horrific, but the only way I can think of doing such a thing would be a hybrid of a cross-join and a semi join. On small data sets this would probably work pretty well. On large datasets, I imagine the cross-join component could hit you pretty hard.
Check it out and let me know if it works against your real data:
with patterns as (
select unnest(array['.*ob', 'ali.*']) as pattern
)
select
o.name, o.officeid
from
office o
where exists (
select null
from patterns p
where o.name ~ p.pattern
)
The semi-join helps protect you from cases where you have a name like "alicia nob" that would meet multiple search patterns would otherwise come back for every match.
You could cast the array to text.
SELECT * FROM workers WHERE (xpath('/data/clients/name/text()', xml_field))::text ~ ANY(ARRAY['wal','ant']);
When casting a string array into text, strings containing special characters or consisting of keywords are enclosed in double quotes kind of like {jimmy,"walter, james"} being two entries. Also when matching with ~ it is matched against any part of the string, not the same as LIKE where it's matched against the whole string.
Here is what I did in my test database:
test=# select id, (xpath('/data/clients/name/text()', name))::text[] as xss, officeid from workers WHERE (xpath('/data/clients/name/text()', name))::text ~ ANY(ARRAY['wal','ant']);
id | xss | officeid
----+-------------------------+----------
2 | {anthony,walter} | 2
3 | {alicia,walter} | 3
4 | {"walter, james"} | 5
5 | {jimmy,"walter, james"} | 4
(4 rows)

For Sql performances, several equals or one between

For a new developement, I will have a big SQL table (~100M rows).
4 fields will be used to query the data.
Is it better to query one concatenated field with between or several equals ?
Exemple :
MainTable
PkId | Label | FkId1 | FkId2 | FkId3 | FkId4
1 | test | 1 | 4 | 3 | 1
Datas in Fk tables are static, example :
FkTable1
Id | Value
1 | a
2 | b
3 | c
To query the datas, the classic sql query is :
select Label, FkId1, FkId2, FkId3, FkId4
from MainTable
where FkId1=1 and FkId2=2 and FkId3 in(2, 3)
The idea to optimize performance is to add one field "UniqueId" calculated backend before the insert :
UniqueId = FkId1*1000000 + FkId2*10000 + FkId3*100 + FkId4
PkId | Label | FkId1 | FkId2 | FkId3 | FkId4 | UniqueId
1 | test | 1 | 4 | 3 | 1 | 1040301
select Label, FkId1, FkId2, FkId3, FkId4
from MainTable
where UniqueId between 1020200 and 1040000
Moreover, with the UniqueId field, an index on this field only will be sufficient.
What do you think ?
Thanks
For this query:
select Label, FkId1, FkId2, FkId3, FkId4
from MainTable
where FkId1 = 1 and FkId2 = 2 and FkId3 in (2, 3)
The optimal index is on MainTable(FkID1, FkId2, FkId3). You can also add Label and FkId4 to the index if you want a covering index (so the index can handle the entire query without referring to the original data pages).
There is no need for a computed field for the example you provided.
Since you will have 100M rows, thinking about optimisations from the start seems sensible to me.
However, your proposed solution will not work in this way:
Your formula above has two times the SAME factor 10000. You have to use different factors, i.e. different powers of 10.
Your select example has a "IN" clause (FkId3 in(2, 3)). This will only work, if only one of the FKs is queried this way. This fk should be the one with no factor in the formula for computing UniqueId (i.e. gives the least significant Digits of UniqueId).
Now seeing Gordons answer, I agree with him, i.e. using a combined index may be good enough for you (though your solution would probably slightly better). However, also the combined index has a similar problem: The FK field beeing queried with the IN clause should be the last field in the index.

Need lowest price in each region in a mysql query

I am trying to write up a query for wordpress which will give me all the post_id's with the lowest fromprice field for each region. Now the trick is these are custom fields in wordpress, and due to such, the information is stored row based, so there is no region and fromprice columns.
So the data I have is (but of course containing a lot more rows):
Post_ID | Meta_Key | Meta_Value
1 | Region | Location1
1 | FromPrice | 150
2 | Region | Location1
2 | FromPrice | 160
3 | Region | Location2
3 | FromPrice | 145
The query I am endeavoring to build should return the post_id of the "lowest priced" matching post grouped by each region with results like:
Post_ID | Region | From Price
1 | Location1 | 150
3 | Location2 | 145
This will allow me to easily iterate the post_id's and print the required information, in fact, I would be just happy with returning post_id's if the rest is harder, I can then fetch the information independently if need be.
Thanks a lot, tearing my hair out over this one; don't often have to think about shifting results on their side from row based to column based that often, but this time I need it!
So you get an idea of the table structure I have, you can use the below as a guide. I thought I had this, but it turned out yes, this query prints out each distinct region WITH the lowest from price found attached to that post in the region, but the post_id is completely incorrect. I don't know why, it seems to be just getting the first result of the post_id and using that.
SELECT pm.post_id,
pm2.meta_value as region,
MIN(pm.meta_value) as price
FROM `wp_postmeta` pm
inner join `wp_postmeta` pm2
on pm2.post_id = pm.post_id
AND pm2.meta_key = 'region'
AND pm.meta_key = 'fromprice'
group by region
I suggest changing MIN(pm.meta_value) in your query to be MIN(CAST(pm.meta_value AS DECIMAL)). Meta_value is a character field, so your existing query will be returning the minimum string value, not the minimum numeric value; for example, "100" will be deemed to be lower than "21".
EDIT - amended CAST syntax.
It's hard to figure out without being able to execute the query, but would it help to just change your group by to:
group by pm.post_id, region

SQL magic - query shouldn't take 15 hours, but it does

Ok, so i have one really monstrous MySQL table (900k records, 180 MB total), and i want to extract from subgroups records with higher date_updated and calculate weighted average in each group. The calculation runs for ~15 hours, and i have a strong feeling i'm doing it wrong.
First, monstrous table layout:
category
element_id
date_updated
value
weight
source_prefix
source_name
Only key here is on element_id (BTREE, ~8k unique elements).
And calculation process:
Make hash for each group and subgroup.
CREATE TEMPORARY TABLE `temp1` (INDEX ( `ds_hash` ))
SELECT `category`,
`element_id`,
`source_prefix`,
`source_name`,
`date_updated`,
`value`,
`weight`,
MD5(CONCAT(`category`, `element_id`, `source_prefix`, `source_name`)) AS `subcat_hash`,
MD5(CONCAT(`category`, `element_id`, `date_updated`)) AS `cat_hash`
FROM `bigbigtable` WHERE `date_updated` <= '2009-04-28'
I really don't understand this fuss with hashes, but it worked faster this way. Dark magic, i presume.
Find maximum date for each subgroup
CREATE TEMPORARY TABLE `temp2` (INDEX ( `subcat_hash` ))
SELECT MAX(`date_updated`) AS `maxdate` , `subcat_hash`
FROM `temp1`
GROUP BY `subcat_hash`;
Join temp1 with temp2 to find weighted average values for categories
CREATE TEMPORARY TABLE `valuebycats` (INDEX ( `category` ))
SELECT `temp1`.`element_id`,
`temp1`.`category`,
`temp1`.`source_prefix`,
`temp1`.`source_name`,
`temp1`.`date_updated`,
AVG(`temp1`.`value`) AS `avg_value`,
SUM(`temp1`.`value` * `temp1`.`weight`) / SUM(`weight`) AS `rating`
FROM `temp1` LEFT JOIN `temp2` ON `temp1`.`subcat_hash` = `temp2`.`subcat_hash`
WHERE `temp2`.`subcat_hash` = `temp1`.`subcat_hash`
AND `temp1`.`date_updated` = `temp2`.`maxdate`
GROUP BY `temp1`.`cat_hash`;
(now that i looked through it and wrote it all down, it seems to me that i should use INNER JOIN in that last query (to avoid 900k*900k temp table)).
Still, is there a normal way to do so?
UPD: some picture for reference:
removed dead ImageShack link
UPD: EXPLAIN for proposed solution:
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
| 1 | SIMPLE | cur | ALL | NULL | NULL | NULL | NULL | 893085 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | next | ref | prefix | prefix | 1074 | bigbigtable.cur.source_prefix,bigbigtable.cur.source_name,bigbigtable.cur.element_id | 1 | 100.00 | Using where |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
Using hashses is one of the ways in which a database engine can execute a join. It should be very rare that you'd have to write your own hash-based join; this certainly doesn't look like one of them, with a 900k rows table with some aggregates.
Based on your comment, this query might do what you are looking for:
SELECT cur.source_prefix,
cur.source_name,
cur.category,
cur.element_id,
MAX(cur.date_updated) AS DateUpdated,
AVG(cur.value) AS AvgValue,
SUM(cur.value * cur.weight) / SUM(cur.weight) AS Rating
FROM eev0 cur
LEFT JOIN eev0 next
ON next.date_updated < '2009-05-01'
AND next.source_prefix = cur.source_prefix
AND next.source_name = cur.source_name
AND next.element_id = cur.element_id
AND next.date_updated > cur.date_updated
WHERE cur.date_updated < '2009-05-01'
AND next.category IS NULL
GROUP BY cur.source_prefix, cur.source_name,
cur.category, cur.element_id
The GROUP BY performs the calculations per source+category+element.
The JOIN is there to filter out old entries. It looks for later entries, and then the WHERE statement filters out the rows for which a later entry exists. A join like this benefits from an index on (source_prefix, source_name, element_id, date_updated).
There are many ways of filtering out old entries, but this one tends to perform resonably well.
Ok, so 900K rows isn't a massive table, it's reasonably big but and your queries really shouldn't be taking that long.
First things first, which of the 3 statements above is taking the most time?
The first problem I see is with your first query. Your WHERE clause doesn't include an indexed column. So this means that it has to do a full table scan on the entire table.
Create an index on the "data_updated" column, then run the query again and see what that does for you.
If you don't need the hash's and are only using them to avail of the dark magic then remove them completely.
Edit: Someone with more SQL-fu than me will probably reduce your whole set of logic into one SQL statement without the use of the temporary tables.
Edit: My SQL is a little rusty, but are you joining twice in the third SQL staement? Maybe it won't make a difference but shouldn't it be :
SELECT temp1.element_id,
temp1.category,
temp1.source_prefix,
temp1.source_name,
temp1.date_updated,
AVG(temp1.value) AS avg_value,
SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 LEFT JOIN temp2 ON temp1.subcat_hash = temp2.subcat_hash
WHERE temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;
or
SELECT temp1.element_id,
temp1.category,
temp1.source_prefix,
temp1.source_name,
temp1.date_updated,
AVG(temp1.value) AS avg_value,
SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 temp2
WHERE temp2.subcat_hash = temp1.subcat_hash
AND temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;