Performance of updating one big table using values from one small table - sql

First, I know that the sql statement to update table_a using values from table_b is in the form of:
Oracle:
UPDATE table_a
SET (col1, col2) = (SELECT cola, colb
FROM table_b
WHERE table_a.key = table_b.key)
WHERE EXISTS (SELECT *
FROM table_b
WHERE table_a.key = table_b.key)
MySQL:
UPDATE table_a
INNER JOIN table_b ON table_a.key = table_b.key
SET table_a.col1 = table_b.cola,
table_a.col2 = table_b.colb
What I understand is the database engine will go through records in table_a and update them with values from matching records in table_b.
So, if I have 10 millions records in table_a and only 10 records in table_b:
Does that mean that the engine will do 10 millions iterations through table_a just to update 10 records? Are Oracle/MySQL/etc smart enough to do only 10 iterations through table_b?
Is there a way to force the engine to actually iterate through records in table_b instead of table_a to do the update? Is there an alternative syntax for the sql statement?
Assume that table_a.key and table_b.key are indexed.

Either engine should be smart enough to optimize the query based on the fact that there are only ten rows in table b. How the engine determines what to do is based factors like indexes and statistics.
If the "key" column is the primary key and/or is indexed, the engine will have to do very little work to run this query. It will basically already sort of "know" where the matching rows are, and look them up very quickly. It won't have to "iterate" at all.
If there is no index on the key column, the engine will have to to a "table scan" (roughly the equivalent of "iterate") to find the right values and match them up. This means it will have to scan through 10 million rows.
Do a little reading on what's called an Execution Plan. This is basically an explanation of what work the engine had to do in order to run your query (some databases show it in text only, some have the option of seeing it graphically). Learning how to interpret an Execution Plan will give you great insight into adding indexes to your tables and optimizing your queries.
Look these up if they don't work (it's been a while), but it's something like:
In MySQL, put the work "EXPLAIN" in front of your SELECT statement
In Oracle, run "SET AUTOTRACE ON" before you run your SELECT statement
I think the first (Oracle) query would be better written with a JOIN instead of a WHERE EXISTS. The engine may be smart enough to optimize it properly either way. Once you get the hang of interpreting an execution plan, you can run it both ways and see for yourself. :)

Okay I know answering own question is usually frowned upon but I already accepted another answer and won't unaccept it so meh here it is ..
I've discovered a much better alternative that I'd like to share it with anyone who encounters the same scenario: MERGE statement.
Apparently, newer Oracle versions introduced this MERGE statement which simply blows! Not only that the performance is so much better in most cases, the syntax is so simple and so make sense that I feel stupid for using the UPDATE statement! Here comes ..
MERGE INTO table_a
USING table_b
ON (table_a.key = table_b.key)
WHEN MATCHED THEN UPDATE SET
table_a.col1 = table_b.cola,
table_a.col2 = table_b.colb;
And what more is that I can also extend the statement to include INSERT action when table_a does not have matching records for some records in table_b:
MERGE INTO table_a
USING table_b
ON (table_a.key = table_b.key)
WHEN MATCHED THEN UPDATE SET
table_a.col1 = table_b.cola,
table_a.col2 = table_b.colb
WHEN NOT MATCHED THEN INSERT
(key, col1, col2)
VALUES (table_b.key, table_b.cola, table_b.colb);
This new statement type made my day the day I discovered it :)

Related

SQL reduce data in join or where

I want to know what is faster, assuming I have the following queries and they retrieve the same data
select * from tableA a inner join tableB b on a.id = b.id where b.columnX = value
or
select * from tableA inner join (select * from tableB where b.columnX = value) b on a.id = b.id
I think makes sense to reduce the dataset from tableB in advanced, but I dont find anything to backup my perception.
In a database such as Teradata, the two should have exactly the same performance characteristics.
SQL is not a procedural language. A SQL query describes the result set. It does not specify the sequence of actions.
SQL engines process a query in three steps:
Parse the query.
Optimize the parsed query.
Execute the optimized query.
The second step gives the engine a lot of flexibility. And most query engines will be quite intelligent about ignoring subqueries, using indexes and partitions based on where clauses, and so on.
Most SQL dialects compile your query into an execution plan. Teradata and most SQL system show the expected execution plan with the "explain" command. Teradata has a visual explain too, which is simple to learn from
It depends on the data volumes and key type in each table, if any method would be advantageous
Most SQL compilers will work this out correctly using the current table statistics (data size and spread)
In some SQL systems your second command would be worse, as it may force a full temporary table build by ALL fields on tableB
It should be (not that I recommend this query style at all)
select * from tableA inner join (select id from tableB where columnX = value) b on a.id = b.id
In most cases, don't worry about this, unless you have a specific performance issue, and then use the explain commands to work out why
A better way in general is to use common table expressions (CTE) to break the problem down. This leads to better queries that can be tested and maintain over the long term
Whenever you come across such scenarios wherein you feel that which query would yeild the results faster in teradata, please use the EXPLAIN plan in teradata - which would properly dictate how the PE is going to retrieve records. If you are using Teradata sql assistant then you can select the query and press F6.
The DBMS decides the access path that will be used to resolve the query, you can't decide it, but you can do certain things like declaring indexes so that the DBMS takes those indexes into consideration when deciding which access path it will use to resolve the query, and then you will get a better performance.
For instance, in this example you are filtering tableB by b.columnX, normally if there are no indexes declared for tableB the DBMS will have to do a full table scan to determine which rows fulfill that condition, but suppose you declare an index on tableB by columnX, in that case the DBMS will probably consider that index and determine an access path that makes use of the index, getting a much better performance than a full table scan, specially if the table is big.

Oracle picks wrong index when joining table

I have a long query, but similar to the short version here:
select * from table_a a
left join table_b b on
b.id = a.id and
b.name = 'CONSTANT';
There are 2 indexes on table_b for both id and name, idx_id has fewer duplicates and idx_name has a lot of duplicates. This is quite a large table (20M+ records). And the join is taking 10min+.
A simple explain plan shows a lot of memory uses on the join part, and it shows it uses the index for name as opposed to id.
How to solve this issue? How to force using the idx_id index?
I was thinking of putting b.name='CONSTANT' to where clause, but this is a left join and where will remove all the record that exists in table_a.
Updated explain plan. Sorry cannot paste the whole plan.
Explain plan with b.name='CONSTANT':
Explain plan when commenting b.name clause:
Add an optimizer hint to your query.
Without knowing your 'long' query, it's difficult to know if Oracle is using the wrong one, or if your interpretation that indexb < indexa so therefore must be quicker for query z is correct.
To add a hint the syntax is
select /*+ index(table_name index_name) */ * from ....;
What is the size of TABLE_A relative to TABLE_B? It wouldn't make sense to use the ID index unless TABLE_A had significantly less rows than TABLE_B.
Index range scans are generally only useful when they access a small percentage of the rows in a table. Oracle reads the index one-block-at-a-time, and then still has to pull the relevant row from the table. If the index isn't very selective, that process can be slower than a multi-block full table scan.
Also, it might help if you can post the full explain plan using this text format:
explain plan for select ... ;
select * from table(dbms_xplan.display);

how to improve the performance on this type of Query?

I have query that I am searching for a range of user accounts but every time I pass query I will be using multiple id number's first 5 digits and based on that I will searching. I wanted to know is there any other way to re-write this query for user Id range when we use more than 10 userIDs to search? Is there going to be huge performance hit with this kind of search in query?
example:
select A.col1,B.col2,B.col3
from table1 A,table2 B
where A.col2=B.col2
and (B.col_id like '12345%'
OR B.col_id like '47474%'
OR B.col_id like '59598%');
I am using Oracle11g.
Actually it is not important how many UserIDs you will pass to the query. The most considerable part is what is selectivity of your query. In other words: how many rows will return your query and how many rows are there in your tables. If the number of returned rows is relatively small then it is good idea to create an index on column B.col_id. There is also nothing bad in using OR condition. Basically each OR will add one more INDEX RANGE SCAN to the execution plan with final CONCATENATION (but you'd rather check your actual plan to be sure). If the total cost of all that operations are lower than full table scan then Oracle CBO will use your index. In other case if you select >=20-30% of data at once then full table scan is most likely to happen and you should even less be worried about OR because all data will be read and comparing each value with your multiple conditions won't add much overhead.
Generally the use of LIKE makes it impossible for Oracle to use indexes.
If the query is going to be reused, consider creating a synthetic column with the first 5 characters of COL_ID. Put a non-unique index on it. Put your search keys in a table and join that to that column.
There may be a way to do it with a functional index on the first 5 characters.
I don't know if the performance will be better or not, but another way to write this is with a union:
select A.col1,B.col2,B.col3
from table1 A,table2 B
where A.col2=B.col2
and (A.col_id like '12345%')
union all
select A.col1,B.col2,B.col3
from table1 A,table2 B
where A.col2=B.col2
and (A.col_id like '47474%') -- you did mean A.col_id again, right?
union all
select A.col1,B.col2,B.col3
from table1 A,table2 B
where A.col2=B.col2
and (A.col_id like '59598%'); -- and A.col_id here too?

Many to one joins

Lets say I have the following tables:
create table table_a
(
id_a,
name_a,
primary_key (id_a)
);
create table table_b
(
id_b,
id_a is not null, -- (Edit)
name_b,
primary_key (id_b),
foreign_key (id_a) references table_a (id_a)
);
We can create a join view on these tables in a number of ways:
create view join_1 as
(
select
b.id_b,
b.id_a,
b.name_b,
a.name_a
from table_a a, table_b b
where a.id_a = b.id_a
);
create view join_2 as
(
select
b.id_b,
b.id_a,
b.name_b,
a.name_a
from table_b b left outer join table_a a
on a.id_a = b.id_a
);
create view join_3 as
(
select
b.id_b,
b.id_a,
b.name_b,
(select a.name_a from table_a a where b.id_b = a.id_a) as name_a
from table_b b;
);
Here we know:
(1) There must be at least one entry from table_a with id_a (due to the foreign key in table B) AND
(2) At most one entry from table_a with id_a (due to the primary key on table A)
then we know that there is exactly one entry in table_a that links in with the join.
Now consider the following SQL:
select id_b, name_b from join_X;
Note that this selects no columns from table_a, and because we know in this join, table_b joins to exactly one we really shouldn't have to look at table_a when performing the above select.
So what is the best way to write the above join view?
Should I just use the standard join_1, and hope the optimizer figures out that there is no need to access table_a based on the primary and foreign keys?
Or is it better to write it like join_2 or even join_3, which makes it more explicit that there is exactly one row from the join for each row from table_b?
Edit + additional question
Is there any time I should prefer a sub-select (as in join_3) over a normal join (as in join_1)?
Why are you using views at all for this purpose? If you want to get data from the table, get it from the table.
Or, if you need a view to translate some columns in the tables (such as coalescing NULLs into zeros), create a view over just the B table. This will also apply if some DBA wanna-be has implemented a policy that all selects must be via views rather than tables :-)
In both those cases, you don't have to worry about multi-table access.
Intuitively, I'd think join_1 will perform slightly slower, because your assumption that the optimiser can transform the join away is wrong as you didn't declare the table_b.id_a column to be NOT NULL. In fact, this means that (1) is wrong. table_b.id_a can be NULL. Even if you know it can't be, the optimiser doesn't know that.
As far as join_2 and join_3 is concerned, depending on your database, optimisation might be possible. The best way to find out is to run (Oracle syntax)
EXPLAIN select id_b, name_b from join_X;
And study the execution plan. It will tell you whether table_a was joined or not. On the other hand, if your view should be reusable, then I'd go for the plain join and forget about pre-mature optimisations. You can achieve better results with proper statistics and indexes, as a join operation is not always so expensive. But that depends on your statistics, of course.
1 + 2 are effectively identical under SQL Server.
I have never used [3], but it looks quite odd. I would strongly suspect the optimizer will make it equivalent to the other 2.
It's a good exercise to run all 3 statements and compare the execution plans produced.
So given identical performance, the clearest to read gets my vote - [2] is the standard where it is supported otherwise [1].
In your case, if you don't want any columns from A, why include Table_A in the statement anyway?
If it's simply a filter - i.e. only include rows from Table B where a row exists in Table A even though I don't want any cols from Table A, then all 3 syntaxes are fine, although you may find that use of IF EXISTS is more performant in some dbs:
SELECT * from Table_B b WHERE EXISTS (SELECT 1 FROM Table_A a WHERE b.id_b = a.id_a)
although in my experience this is usual equivalent in performance to any of the others.
You also ask, then would you choose a subquery over the other expressions. This boils down to whether it's a CORRELATED subquery or not.
Basically - a correlated subquery has to be run once for every row in the outer statement - this is true of the above - for every row in Table B you must run the subquery against Table A.
If the subquery can be run just once
SELECT * from Table_B b WHERE b.id_a IN (SELECT a.id_a FROM Table_A a WHERE a.id_a > 10)
Then the subquery is generally more performant than a join - although I suspect that some optimizers will still be able to spot this and reduce both to the same execution plan.
Again, the best thing to do is to run both statements, and compare execution plans.
Finally and most simply - given the FK you could just write:
SELECT * From Table_B b WHERE b.id_a IS NOT NULL
It will depend on the platform.
SQL Server both analyses the logical implications of constraints (foreign keys, primary keys, etc) and expands VIEWs inline. This means that the 'irrelevant' portion of the VIEW's code is obsoleted by the optimiser. SQL Server would give the exact same execution plan for all three cases. (Note; there is a limit to the complexity that the optimiser can handle, but it can certainly handle this.)
Not all platforms are created equal, however.
- Some may not analyse constraints in the same way, assuming you coded the join for a reason
- Some may pre-compile the VIEW's execution/explain plan
As such, to determine behaviour, you must be aware of the specific platform's capabilties. In the vast majority of situations the optimiser is a complex beast, and so the best test is simply to try it and see.
EDIT
In response to your extra question, are correlated sub-queries every prefered? There is no simple answer, as it depends on the data and the logic you are trying to implement.
There are certainly cases int he past where I have used them, both to simplify a query's structure (for maintainability), but also to enable specific logic.
If the field table_b.id_a referenced many entries in table_a you may only want the name from the latest one. And you could implement that using (SELECT TOP 1 name_a FROM table_a WHERE id_a = table_b.id_a ORDER BY id_a DESC).
In short, it depends.
- On the query's logic
- On the data's structure
- On the code's final layout
More often than not I find it's not needed, but measurably often I find that it is a positive choice.
Note:
Depending on the correlated sub-query, it doesn't actually always get executed 'once for every record'. SQL Server, for example, expands the required logic to be executed in-line with the rest of the query. It's important to note that SQL code is processed/compiled/whatever before being executed. SQL is simply a method for articulating set based logic, which is then transformed into traditional loops, etc, using the most optimal algorithms available to the optimiser.
Other RDBMS may perform differently, due to the capabilities or limitations of the optimiser. Some RDBMS perform well when using IN (SELECT blah FROM blah), or when using EXISTS (SELECT * FROM blah), butsome perform terribly. The same applies to correlated sub-queries. Sub perform exceptionally well with them, some don't perform so well, but most handle the very well in my experience.

How can I speed up a joined update in SQL? My statement seems to run indefinitely

I have two tables: a source table and a target table. The target table will have a subset of the columns of the source table. I need to update a single column in the target table by joining with the source table based on another column. The update statement is as follows:
UPDATE target_table tt
SET special_id = ( SELECT source_special_id
FROM source_table st
WHERE tt.another_id = st.another_id )
For some reason, this statement seems to run indefinitely. The inner select happens almost immediately when executed by itself. The table has roughly 50,000 records and its hosted on a powerful machine (resources are not an issue).
Am I doing this correctly? Any reasons the above wouldn't work in a timely manner? Any better way to do this?
Your initial query executes the inner subquery once for every row in the outer table. See if Oracle likes this better:
UPDATE target_table
SET special_id = st.source_special_id
FROM
target_table tt
INNER JOIN
source_table st
WHERE tt.another_id = st.another_id
(edited after posted query was corrected)
Add:
If the join syntax doesn't work on Oracle, how about:
UPDATE target_table
SET special_id = st.source_special_id
FROM
target_table tt, source_table st
WHERE tt.another_id = st.another_id
The point is to join the two tables rather than using the outer query syntax you are currently using.
Is there an index on source_table(another_id)? If not source_table will be fully scanned once for each row in target_table. This will take some time if target_table is big.
Is it possible for there to be no match in source_table for some target_table rows? If so, your update will set special_id to null for those rows. If you want to avoid that do this:
UPDATE target_table tt
SET special_id = ( SELECT source_special_id
FROM source_table st
WHERE tt.another_id = st.another_id )
WHERE EXISTS( SELECT NULL
FROM source_table st
WHERE tt.another_id = st.another_id );
If target_table.another_id was declared as a foreign key referencing source_table.another_id (unlikely in this scenario), this would work:
UPDATE
( SELECT tt.primary_key, tt.special_id, st.source_special_id
FROM tatget_table tt
JOIN source_table st ON st.another_id = tt.another_id
)
SET special_id = source_special_id;
Are you actually sure that it's running?
Have you looked for blocking locks? indefinitely is a long time and that's usually only achieved via something stalling execution.
Update: Ok, now that the query has been fixed -- I've done this exact thing many times, on unindexed tables well over 50K rows, and it worked fine in Oracle 10g and 9i. So something else is going on here; yes, you are calling for nested loops, but no, it shouldn't run forever, even so. What are the primary keys on these tables? Do you by any chance have multiple rows from the second table matching the same row for the first table? You could be trying to rewrite the whole table over and over, throwing the locking system into a fit.
Original Response
That statement doesn't really make sense -- you are telling it to update all the rows where ids match, to the same id (meaning, no change happens!).
I imagine the real statement looks a bit different?
Please also provide table schema information (primary keys for the 2 tables, any available indexes, etc).
Not sure what Oracle has available, but MS Sql Server has a tuning advisor that you can feed your queries into and it will give recommendation for adding indexes, etc... I would assume Oracle has something similar.
That would be the quickest way to pinpoint the issue.
I don't know Oracle, but MSSQLServer optimizer would have no problem converting the subquery into a join for you.
It sounds like you might be doing a data import against a short-lived or newly created table. It is easy to overlook indexing these kinds of tables. I'd make sure there's an index on sourcetable.anotherid - or a covering index on sourcetable.anotherid, sourcetable.specialid (order matters, anotherid should be first).
In cases such as these (query running unexpectedly for longer than 1 second). It is best to figure out how your environment provides query plans. Examine that plan and the problem will become clear.
You see, there is no such thing as "optimized sql code". Sql code is never executed - query plans are generated from the code and then those plans are executed.
Check that the statistics are up to date on the tables - see this question
I had the same problem, and got a "SQL command not properly ended" when trying Codewerks' answer in Oracle 11g. A bit of Googling turned up the Oracle MERGE statement, which I adapted as follows:
MERGE INTO target_table tt
USING source_table st
ON (tt.another_id = st.another_id)
WHEN MATCHED THEN
UPDATE SET tt.special_id = st.special_id;
If you're not sure all the values of another_id will be in source_table then you can use the WHEN NOT MATCHED THEN clause to handle that case.
If you have constraints that let Oracle know there is a 1-1 relationship when you join on "another_id", I think this might work well:
UPDATE ( SELECT tt.rowid tt_rowid, tt.another_id, tt.special_id, st.source_special_id
FROM target_table tt, source_table st
WHERE tt.another_id = st.other_id
ORDER BY tt.rowid )
SET special_id = source_special_id
Ordering by ROWID is important when you are updating and updatable view with a join.