How can I speed up MySQL query with multiple joins - sql

Here is my issue, I am selecting and doing multiple joins to get the correct items...it pulls in a fair amount of rows, above 100,000. This query takes more than 5mins when the date range is set to 1 year.
I don't know if it's possible but I am afraid that the user might extend the date range to like ten years and crash it.
Anyone know how I can speed this up? Here is the query.
SELECT DISTINCT t1.first_name, t1.last_name, t1.email
FROM table1 AS t1
INNER JOIN table2 AS t2 ON t1.CU_id = t2.O_cid
INNER JOIN table3 AS t3 ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4 ON t3.I_pid = t4.P_id
INNER JOIN table5 AS t5 ON t4.P_cat = t5.C_id
WHERE t1.subscribe =1
AND t1.Cdate >= $startDate
AND t1.Cdate <= $endDate
AND t5.store =2
I am not the greatest with mysql so any help would be appreciated!
Thanks in advance!
UPDATE
Here is the explain you asked for
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t5 ref PRIMARY,C_store_type,C_id,C_store_type_2 C_store_type_2 1 const 101 Using temporary
1 SIMPLE t4 ref PRIMARY,P_cat P_cat 5 alphacom.t5.C_id 326 Using where
1 SIMPLE t3 ref I_pid,I_oref I_pid 4 alphacom.t4.P_id 31
1 SIMPLE t2 eq_ref O_ref,O_cid O_ref 28 alphacom.t3.I_oref 1
1 SIMPLE t1 eq_ref PRIMARY PRIMARY 4 alphacom.t2.O_cid 1 Using where
Also I added an index to table5 rows and table4 rows because they don't really change, however the other tables get around 500-1000 entries a month... I heard you should add an index to a table that has that many new entries....is this true?

I'd try the following:
First, ensure there are indexes on the following tables and columns (each set of columns in parentheses should be a separate index):
table1 : (subscribe, CDate)
(CU_id)
table2 : (O_cid)
(O_ref)
table3 : (I_oref)
(I_pid)
table4 : (P_id)
(P_cat)
table5 : (C_id, store)
Second, if adding the above indexes didn't improve things as much as you'd like, try rewriting the query as
SELECT DISTINCT t1.first_name, t1.last_name, t1.email FROM
(SELECT CU_id, t1.first_name, t1.last_name, t1.email
FROM table1
WHERE subscribe = 1 AND
CDate >= $startDate AND
CDate <= $endDate) AS t1
INNER JOIN table2 AS t2
ON t1.CU_id = t2.O_cid
INNER JOIN table3 AS t3
ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4
ON t3.I_pid = t4.P_id
INNER JOIN (SELECT C_id FROM table5 WHERE store = 2) AS t5
ON t4.P_cat = t5.C_id
I'm hoping here that the first sub-select would cut down significantly on the number of rows to be considered for joining, hopefully making the subsequent joins do less work. Ditto the reasoning behind the second sub-select on table5.
In any case, mess with it. I mean, ultimately it's just a SELECT - you can't really hurt anything with it. Examine the plans that are generated by each different permutation and try to figure out what's good or bad about each.
Share and enjoy.

Make sure your date columns and all the columns you are joining on are indexed.
Doing an unequivalence operator on your dates means it checks every row, which is inherently slower than an equivalence.
Also, using DISTINCT adds an extra comparison to the logic that your optimizer is running behind the scenes. Eliminate that if possible.

Well, first, make a subquery to decimate table1 down to just the records you actually want to go to all the trouble of joining...
SELECT DISTINCT t1.first_name, t1.last_name, t1.email
FROM (
SELECT first_name, last_name, email, CU_id FROM table1 WHERE
table1.subscribe = 1
AND table1.Cdate >= $startDate
AND table1.Cdate <= $endDate
) AS t1
INNER JOIN table2 AS t2 ON t1.CU_id = t2.O_cid
INNER JOIN table3 AS t3 ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4 ON t3.I_pid = t4.P_id
INNER JOIN table5 AS t5 ON t4.P_cat = t5.C_id
WHERE t5.store = 2
Then start looking at modifying the directionality of the joins.
Additionally, if t5.store is only very rarely 2, then flip this idea around: construct the t5 subquery, then join it back and back and back.

At present, your query is returning all matching rows on table2-table5, just to establish whether t5.store = 2. If any of table2-table5 have a significantly higher row count than table1, this may be greatly increasing the number of rows processed - consequently, the following query may perform significantly better:
SELECT DISTINCT t1.first_name, t1.last_name, t1.email
FROM table1 AS t1
WHERE t1.subscribe =1
AND t1.Cdate >= $startDate
AND t1.Cdate <= $endDate
AND EXISTS
(SELECT NULL FROM table2 AS t2
INNER JOIN table3 AS t3 ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4 ON t3.I_pid = t4.P_id
INNER JOIN table5 AS t5 ON t4.P_cat = t5.C_id AND t5.store =2
WHERE t1.CU_id = t2.O_cid);

Try adding indexes on the fields that you join. It may or may not improve the performance.
Moreover it also depends on the engine that you are using. If you are using InnoDB check your configuration params. I had faced a similar problem, as the default configuration of innodb wont scale much as myisam's default configuration.

As everyone says, make sure you have indexes.
You can also check if your server is set up properly so it can contain more of, of maybe the entire, dataset in memory.
Without an EXPLAIN, there's not much to work by. Also keep in mind that MySQL will look at your JOIN, and iterate through all possible solutions before executing the query, which can take time. Once you have the optimal JOIN order from the EXPLAIN, you could try and force this order in your query, eliminating this step from the optimizer.

It sounds like you should think about delivering subsets (paging) or limit the results some other way unless there is a reason that the users need every row possible all at once. Typically 100K rows is more than the average person can digest.

Related

In joining 2 tables, do the WHERE clauses reduce the table sizes before or after the join occurs?

For example, does the first query get processed different than the second query?
Query 1
SELECT t1.var1, t2.var2 FROM table1 t1
INNER JOIN table2 t2
ON t1.key = t2.key
WHERE t2.ID = 'ABCD'
Query 2
SELECT t1.var1, t2.var2 FROM table1 t1
INNER JOIN (
SELECT var2, key from table2
WHERE ID = 'ABCD'
) t2
ON t1.key = t2.key
WHERE t2.ID = 'ABCD'
At a glance, it seems as if the second query would be more efficient - table2 is reduced before the join begins, whereas the first query appears to join the tables first, then reduce later. I'm using teradata, if it matters.
Depends on vendor, version and configuration.
Teradata older version/legacy configuration might spool the sub-query as a first stage for Query 2 leading to reduced performance in comparison to Query 1 in depends with the table's' primary indexes and join algorithm.
I would suggest to avoid this kind of "optimization".
P.s.
Check if you get the same execution plan for both plans or different execution plans.
Check the query log for AMPCPUTime (for start)

Optimization of DB2 query which uses joins and takes 1.5 hours to execute

when i run SELECT stataement on my view it takes around 1.5 hours to run, what can i do to optimize it.
Below is the sample structure of how my view looks like
CREATE VIEW SCHEMANAME.VIEWNAME
{
COL, COL1, COL2, COL3 }
AS SELECT
COST.ETA,
CASE
WHEN VOL.CURR IS NOT NULL
THEN COALESCE {VOL.COMM,0}
END CASE,
CASE
WHEN...
END CASE
FROM TABLE1 t1 inner join TABLE2 t2 ON t1.ETA=t2.ETA
INNER JOIN TABLE3 t3 on t2.ETA=t3.ETA
LEFT OUTER JOIN TABLE4 t4 on t2.ETA=t4.ETA
This is your query:
SELECT COST.ETA,
(CASE WHEN VOL.CURR IS NOT NULL THEN COALESCE {VOL.COMM,0}
END) as ??,
. . .
FROM TABLE1 t1 inner join
TABLE2 t2
ON t1.ETA = t2.ETA INNER JOIN
TABLE3 t3
on t2.ETA = t3.ETA LEFT OUTER JOIN
TABLE4 t4
on t2.ETA = t4.ETA;
First, I will the fact that the select clause references tables that are not in the from clause. I assume this is a typo.
Second, you should be able to use indexes to improve this query: table1(eta), table2(eta),table3(eta), andtable4(eta).
Third, I am highly suspicious on seeing the same column used for joining so many tables. I suspect that you might have cartesian products occurring, because there are multiple values of any given eta in several tables. If that is the case, you need to fix the query to better reflect what you really need. If so, ask another question with sample data and desired results, because your query is probably not correct.

SQL query, where = value of another table

I want to make a query that simply makes this, this may sound really dumb but i made a lot of research and couldn't understand nothing.
Imagine that i have two tables (table1 and table2) and two columns (table1.column1 and table2.column2).
What i want to make is basically this:
SELECT column1 FROM table1 where table2.column2 = '0'
I don't know if this is possible.
Thanks in advance,
You need to apply join between two talbes and than you can apply your where clause will do work for you
select column1 from table1
inner join table2 on table1.column = table2.column
where table2.columne=0
for join info you can see this
Reading this original article on The Code Project will help you a lot: Visual Representation of SQL Joins.
Find original one at: Difference between JOIN and OUTER JOIN in MySQL.
SELECT column1 FROM table1 t1
where exists (select 1 from table2 t2
where t1.id = t2.table1_id and t2.column2 = '0')
assuming table1_id in table2 is a foreign key refering to id of table1 which is the primary key
You don't have any kind of natural join between two tables.
You're asking for
Select Houses.DoorColour from Houses, Cars where Cars.AreFourWheelDrive = '1'
You'd need to think about why you're selecting anything from the first table, there must be a shared piece of information between tables 1 and 2 otherwise a join is pointless and probably dangerous.

SQL Server query performance - removing need for Hash Match (Inner Join)

I have the following query, which is doing very little and is an example of the kind of joins I am doing throughout the system.
select t1.PrimaryKeyId, t1.AdditionalColumnId
from TableOne t1
join TableTwo t2 on t1.ForeignKeyId = t2.PrimaryKeyId
join TableThree t3 on t1.PrimaryKeyId = t3.ForeignKeyId
join TableFour t4 on t3.ForeignKeyId = t4.PrimaryKeyId
join TableFive t5 on t4.ForeignKeyId = t5.PrimaryKeyId
where
t1.StatusId = 1
and t5.TypeId = 68
There are indexes on all the join columns, however the performance is not great. Inspecting the query plan reveals a lot of Hash Match (Inner Joins) when really I want to see Nested Loop joins.
The number of records in each table is as follows:
select count(*) from TableOne
= 64393
select count(*) from TableTwo
= 87245
select count(*) from TableThree
= 97141
select count(*) from TableFour
= 116480
select count(*) from TableFive
= 62
What is the best way in which to improve the performance of this type of query?
First thoughts:
Change to EXISTS (changes equi-join to semi-join)
You need to have indexes on t1.StatusId, t5.TypeId and INCLUDE t1.AdditionalColumnID
I wouldn't worry about your join method yet...
Personally, I've never used a JOIN hint. They only work for the data, indexes and statistics you have at that point in time. As these change, your JOIN hint limits the optimiser
select t1.PrimaryKeyId, t1.AdditionalColumnId
from
TableOne t1
where
t1.Status = 1
AND EXISTS (SELECT *
FROM
TableThree t3
join TableFour t4 on t3.ForeignKeyId = t4.PrimaryKeyId
join TableFive t5 on t4.ForeignKeyId = t5.PrimaryKeyId
WHERE
t1.PrimaryKeyId = t3.ForeignKeyId
AND
t5.TypeId = 68)
AND EXISTS (SELECT *
FROM
TableTwo t2
WHERE
t1.ForeignKeyId = t2.PrimaryKeyId)
Index for tableOne.. one of
(Status, ForeignKeyId) INCLUDE (AdditionalColumnId)
(ForeignKeyId, Status) INCLUDE (AdditionalColumnId)
Index for tableFive... probably (typeID, PrimaryKeyId)
Edit: updated JOINS and EXISTS to match question fixes
SQL Server is pretty good at optimizing queries, but it's also conservative: it optimizes queries for the worst case. A loop join typically results in an index lookup and a bookmark lookup for for every row. Because loop joins cause dramatic degradation for large sets, SQL Server is hesitant to use them unless it's sure about the number of rows.
You can use the forceseek query hint to force an index lookup:
inner join TableTwo t2 with (FORCESEEK) on t1.ForeignKeyId = t2.PrimaryKeyId
Alternatively, you can force a loop join with the loop keyword:
inner LOOP join TableTwo t2 on t1.ForeignKeyId = t2.PrimaryKeyId
Query hints limit SQL Server's freedom, so it can no longer adapt to changed circumstances. It's best practice to avoid query hints unless there is a business need that cannot be met without them.

Many SQL queries vs 1 complex query

I have a database with two tables that are similar to:
table1
------
Id : long, auto increment
Title : string(50)
ParentId : long
table2
------
Id : long, auto increment
FirstName : string(20)
LastName : string(30)
Zip : string(5)
table2 has a one-to-many relationship with table1 where many includes zero.
I also have the following query (that works correctly, so ignore typos an the like, it is an example):
SELECT t1.Id AS tid, t1.Title, t2.Id AS oid, t2.FirstName, t2.LastName
FROM table t1
INNER JOIN table2 t2 ON t1.ParentId = t2.Id
WHERE t2.Id IN
(SELECT Id FROM table2
WHERE Zip IN ('zip1', 'zip2', 'etc'))
ORDER BY t2.Id DESC
The query finds all items in table1 that belong to a person in table2, where the person is in one of the listed zip codes.
The problem I have now is: I want to show all the users (with their items if available) in the listed zip codes, not just the ones with items.
So, I am wondering, should I just do something simple with a lot more queries, like:
SELECT Id AS oid, FirstName, LastName FROM table2 WHERE Zip in ('zip1', 'zip2', 'etc')
foreach(result) {
SELECT Id AS tid, Title FROM table2 WHERE ParentId = oid
}
Or should I come up with a more elaborate single SQL statement? And if so, can I get a little help? Thanks!
If I understand correctly, changing your INNER JOIN to a RIGHT JOIN should return all users regardless of whether they have an item or not, the item columns will just be null for those that don't.
Look into Right Joins and Group By. That will most likely get you the query you are after.
I agree with (and have upvoted) #Lee D and #Bueller. However, I generally advocate LEFT OUTER JOINS, because I find it easier to conceptualized what's going on with them, particularly when you are joining three or more tables. Consider it like so:
Start with what you know you want in the final result set:
FROM table2 t2
and then add in the "optional" data.
FROM table2 t2
left outer join table1 t1
on t1.ParentId = t2.Id
Whether or not matches are found, whatever gets selected from table2 will always appears.
In general, you should prefer the "many queries" approach if (and only if)
it gets you simpler code in total
is fast enough (which you should find out by testing)
In this case, I suspect, both conditions may not apply.
You should come up with a more elaborate single SQL statement and then process the results with your favorite programming language.
What you've described is called an N + 1 query. You have 1 initial query that returns N results, then 1 query for each of your N results. If N is small, the performance difference may not be noticeable - but there will be a larger and larger performance hit as N grows.
If I understand correctly, I think you are looking for something like this
SELECT t1.Id AS tid, t1.Title, t2.Id AS oid, t2.FirstName, t2.LastName
FROM table t1
RIGHT OUTER JOIN table2 t2 ON t1.ParentId = t2.Id AND Zip IN ('zip1', 'zip2', 'etc'))
ORDER BY t2.Id DESC
You can have multiple conditions on your JOIN and RIGHT OUTER will give you all the rows in table2 even if they don't match in table1