Execute MySQL update query on 750k rows - sql

I've added a field to a MySQL table. I need to populate the new column with the value from another table. Here is the query that I'd like to run:
UPDATE table1 t1
SET t1.user_id =
(
SELECT t2.user_id
FROM table2 t2
WHERE t2.usr_id = t1.usr_id
)
I ran that query locally on 239K rows and it took about 10 minutes. Before I do that on the live environment I wanted to ask if what I am doing looks ok i.e. does 10 minutes sound reasonable. Or should I do it another way, a php loop? a better query?

Use an UPDATE JOIN! This will provide you a native inner join to update from, rather than run the subquery for every bloody row. It tends to be much faster.
update table1 t1
inner join table2 t2 on
t1.usr_id = t2.usr_id
set t1.user_id = t2.user_id
Ensure that you have an index on each of the usr_id columns, too. That will speed things up quite a bit.
If you have some rows that don't match up, and you want to set t1.user_id = null, you will need to do a left join in lieu of an inner join. If the column is null already, and you're just looking to update it to the values in t2, use an inner join, since it's faster.
I should make mention, for posterity, that this is MySQL syntax only. The other RDBMS's have different ways of doing an update join.

There are two rather important pieces of information missing:
What type of tables are they?
What indexes exist on them?
If table2 has an index that contains user_id and usr_id as the first two columns and table1 is indexed on user_id, it shouldn't be that bad.

You don't have an index on t2.usr_id.
Create this index and run your query again, or a multiple-table UPDATE proposed by #Eric (with LEFT JOIN, of course).
Note that MySQL lacks other JOIN methods than NESTED LOOPS, so it's index that matters, not the UPDATE syntax.
However, the multiple table UPDATE is more readable.

Related

Avoid multiple SELECT while updating a table's column relatively to another table's one

I am quite a newbie with SQL queries but I need to modify a column of a table relatively to the column of another table. For now I have the following query working:
UPDATE table1
SET date1=(
SELECT last_day(max(date2))+1
FROM table2
WHERE id=123
)
WHERE id=123
AND date1=to_date('31/12/9999', 'dd/mm/yyyy');
The problem with this structure is that, I suppose, the SELECT query will be executed for every line of the table1. So I tried to create another query but this one has a syntax error somewhere after the FROM keyword:
UPDATE t1
SET t1.date1=last_day(max(t2.date2))+1
FROM table1 t1
INNER JOIN table2 t2
ON t1.id=t2.id
WHERE t1.id=123
AND t1.date1=to_date('31/12/9999', 'dd/mm/yyyy');
AND besides that I don't even know if this one is faster than the first one...
Do you have any idea how I can handle this issue?
Thanks a lot!
Kind regards,
Julien
The first code you wrote is fine. It won't be executed for every line of the table1 as you fear. It will do the following:
it will run the subquery to find a value you want to use in your UPDATE statement, searching through table2, but as you have stated the exact id from
the table, it should be as fast as possible, as long as you have
created an index on that (I guess a primary key) column
it will run the outer query, finding the single row you want to update. As before, it should be as fast as possible as you have stated the exact id, as long as there is an index on that column
To summarize, If those ID's are unique, both your subquery and your query should return only one row and it should execute as fast as possible. If you think that execution is not fast enough (at least that it takes longer than the amount of data would justify) check if those columns have unique values and if they have unique indexes on them.
In fact, it would be best to add those indexes regardless of this problem, if they do not exist and if these columns have unique values, as it would drastically improve all of the performances on these tables that search through these id columns.
Please try to use MERGE
MERGE INTO (
SELECT id,
date1
FROM table1
WHERE date1 = to_date('31/12/9999', 'dd/mm/yyyy')
AND id = 123
) t1
USING (
SELECT id,
last_day(max(date2))+1 max_date
FROM table2
WHERE id=123
GROUP BY id
) t2 ON (t1.id = t2.id)
WHEN MATCHED THEN
UPDATE SET t1.date1 = t2.max_date
;

Performance of join vs pre-select on MsSQL

I can do the same query in two ways as following, will #1 be more efficient as we don't have join?
1
select table1.* from table1
inner join table2 on table1.key = table2.key
where table2.id = 1
2
select * from table1
where key = (select key from table2 where id=1)
These are doing two different things. The second will return an error if more than one row is returned by the subquery.
In practice, my guess is that you have an index on table2(id) or table2(id, key), and that id is unique in table2. In that case, both should be doing index lookups and the performance should be very comparable.
And, the general answer to performance question is: try them on your servers with your data. That is really the only way to know if the performance difference makes a difference in your environment.
I executed these two statements after running set statistics io on (on SQL Server 2008 R2 Enterprise - which supposedly has the best optimization compared to Standard).
select top 5 * from x2 inner join ##prices on
x1.LIST_PRICE = ##prices.i1
and
select top 5 * from x2 where LIST_PRICE in (select i1 from ##prices)
and the statistics matched exactly. I have always preferred the first type of join but the second allows me to select just that part and see what rows are being returned.
I was taught that joins vs subqueries are mostly equivalent when it comes to performance. I would also look at the resulting query plans to see if one is better then the other. The query plans matched exactly.
MS SQL Server is smart enough to understand that it is the same action in such a simple query.
However if you have more than 1 record in subquery then you'll probably use IN. In is slow operation and it will never work faster than JOIN. It can be the same but never faster.
The best option for your case is to use EXISTS. It will be always faster or the same as JOIN or IN operation. Example:
select * from table1 t1
where EXISTS (select * from table2 t2 where id=1 AND t1.key = t2.key)

How do I put multiple criteria for a column in a where clause?

I have five results to retrieve from a table and I want to write a store procedure that will return all desired rows.
I can write the query like that temporarily:
Select * from Table where Id = 1 OR Id = 2 or Id = 3
I supposed I need to receive a list of Ids to split, but how do I write the WHERE clause?
So, if you're just trying to learn SQL, this is a short and good example to get to know the IN operator. The following query has the same result as your attempt.
SELECT *
FROM TABLE
WHERE ID IN (SELECT ID FROM TALBE2)
This translates into what is your attempt. And judging by your attempt, this might be the simplest version for you to understand. Although, in the future I would recommend using a JOIN.
A JOIN has the same functionality as the previous code, but will be a better alternative. If you are curious to read more about JOINs, here are a few links from the most important sources
Joins - wikipedia
and also a visual representation of how different types of JOIN work
Another way to do it. The inner join will only include rows from T1 that match up with a row from T2 via the Id field.
select T1.* from T1 inner join T2 on T1.Id = T2.Id
In practice, inner joins are usually preferable to subqueries for performance reasons.

SQL: Optimization problem, has rows?

I got a query with five joins on some rather large tables (largest table is 10 mil. records), and I want to know if rows exists. So far I've done this to check if rows exists:
SELECT TOP 1 tbl.Id
FROM table tbl
INNER JOIN ... ON ... = ... (x5)
WHERE tbl.xxx = ...
Using this query, in a stored procedure takes 22 seconds and I would like it to be close to "instant". Is this even possible? What can I do to speed it up?
I got indexes on the fields that I'm joining on and the fields in the WHERE clause.
Any ideas?
switch to EXISTS predicate. In general I have found it to be faster than selecting top 1 etc.
So you could write like this IF EXISTS (SELECT * FROM table tbl INNER JOIN table tbl2 .. do your stuff
Depending on your RDBMS you can check what parts of the query are taking a long time and which indexes are being used (so you can know they're being used properly).
In MSSQL, you can use see a diagram of the execution path of any query you submit.
In Oracle and MySQL you can use the EXPLAIN keyword to get details about how the query is working.
But it might just be that 22 seconds is the best you can do with your query. We can't answer that, only the execution details provided by your RDBMS can. If you tell us which RDBMS you're using we can tell you how to find the information you need to see what the bottleneck is.
4 options
Try COUNT(*) in place of TOP 1 tbl.id
An index per column may not be good enough: you may need to use composite indexes
Are you on SQL Server 2005? If som, you can find missing indexes. Or try the database tuning advisor
Also, it's possible that you don't need 5 joins.
Assuming parent-child-grandchild etc, then grandchild rows can't exist without the parent rows (assuming you have foreign keys)
So your query could become
SELECT TOP 1
tbl.Id --or count(*)
FROM
grandchildtable tbl
INNER JOIN
anothertable ON ... = ...
WHERE
tbl.xxx = ...
Try EXISTS.
For either for 5 tables or for assumed heirarchy
SELECT TOP 1 --or count(*)
tbl.Id
FROM
grandchildtable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
-- or
SELECT TOP 1 --or count(*)
tbl.Id
FROM
mytable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
AND
EXISTS (SELECT *
FROM
yetanothertable T3
WHERE
tbl.key = T3.key /* AND T3 condition*/)
Doing a filter early on your first select will help if you can do it; as you filter the data in the first instance all the joins will join on reduced data.
Select top 1 tbl.id
From
(
Select top 1 * from
table tbl1
Where Key = Key
) tbl1
inner join ...
After that you will likely need to provide more of the query to understand how it works.
Maybe you could offload/cache this fact-finding mission. Like if it doesn't need to be done dynamically or at runtime, just cache the result into a much smaller table and then query that. Also, make sure all the tables you're querying to have the appropriate clustered index. Granted you may be using these tables for other types of queries, but for the absolute fastest way to go, you can tune all your clustered indexes for this one query.
Edit: Yes, what other people said. Measure, measure, measure! Your query plan estimate can show you what your bottleneck is.
Use the maximun row table first in every join and if more than one condition use
in where then sequence of the where is condition is important use the condition
which give you maximum rows.
use filters very carefully for optimizing Query.

is it better to put more logic in your ON clause or should it only have the minimum necessary?

Given these two queries:
Select t1.id, t2.companyName
from table1 t1
INNER JOIN table2 t2 on t2.id = t1.fkId
WHERE t2.aField <> 'C'
OR:
Select t1.id, t2.companyName
from table1 t1
INNER JOIN table2 t2 on t2.id = t1.fkId and t2.aField <> 'C'
Is there a demonstrable difference between the two? Seems to me that the clause "t2.aField <> 'C'" will run on every row in t2 that meets the join criteria regardless. Am I incorrect?
Update: I did an "Include Actual Execution Plan" in SQL Server. The two queries were identical.
I prefer to use the Join criteria for explaining how the tables are joined together.
So I would place the additional clause in the where section.
I hope (although I have no stats), that SQL Server would be clever enough to find the optimal query plan regardless of the syntax you use.
HOWEVER, if you have indexes which also have id, and aField in them, I would suggest placing them together in the inner join criteria.
It would be interesting to see the query plan's in these 2 (or 3) scenarios, and see what happens. Nice question.
There is a difference. You should do an EXPLAIN PLAN for both of the selects and see it in detail.
As for a simplier explanation:
The WHERE clause gets executed only after the joining of the two tables, so it executes for each row returned from the join and not nececerally every one from table2.
Performance wise its best to eliminate unwanted results early on so there should be less rows for joins, where clauses or other operations to deal with later on.
In the second example, there are 2 columns that have to be same for the rows to be joined together so it usually will give different results than the first one.
It depends.
SELECT
t1.foo,
t2.bar
FROM
table1 t1
LEFT JOIN table2 t2 ON t1.SomeId = t2.SomeId
WHERE
t2.SomeValue IS NULL
is different from
SELECT
t1.foo,
t2.bar
FROM
table1 t1
LEFT JOIN table2 t2 ON t1.SomeId = t2.SomeId AND t2.SomeValue IS NULL
It is different because the former crosses out all records from t2 that have NULL in t2.SomeValue and those from t1 that are not referenced in t2. The latter crosses out only the t2 records that have NULL in t2.SomeValue.
Just use the ON clause for the join condition and the WHERE clause for the filter.
Unless moving the join condition to the where clause changes the meaning of the query (like in the left join example above), then it doesn't matter where you put them. SQL will re-arrange them, and as long as they are provably equivalent, you'll get the same query.
That being said, I think it's more of a logical / readability thing. I usually put anything that relates two tables in the join, and anything that filters in the where.
I'd prefer first query. SQL server will use the best join type for your query based on indexes you have, after that will apply WHERE clause. But you can run both queries at the same time, look at execution plans, compare and choose the fastest (optimize adding indexes also).
unless you are working on a single-user app or something similarly small that creates trivial load, the only considerations that mean anything is how the server will process your query.
The answers that mention query plans give good advice.
In addition, set io statistics on to get an idea of how many reads your query will generate (I especially love Azder's post).
Think of every DB server as a pump of data from disk to client. That pump goes faster if it performs only the IO needed to get the job done. If the data is in cache it will be even faster. But you don't want to be reading more than you need from disk - that will result in crowding out of your cache useful data for no good reason.