Oracle SQL Update Query, looping to limit rows updated at a time - sql

I have a query that updates all the past history for a new column. It pulls the values from a source table with a corresponding ID. It also compares an update time with the current time, which may change but for now guarantees this will run on every row.
UPDATE table1
SET table1.comment =
(SELECT table2.comment
FROM table2
WHERE table1.ID = table2.ID)
WHERE(SELECT table2.updateTime
FROM table2
WHERE table1.ID = table2.ID) < sysdate
There are millions of rows in production and I need to limit this in a loop or something to only update so many at a time. I am fairly new to SQL and have not been able to find any documentation on how a loop would limit the amount of rows updated. How does a loop even know anything about rows in the tables being used?

A couple of things... First, if your tables have primary keys, then this would probably be the preferred update methodology:
update (
select
t1.comment c1, t2.comment c2
from
table1 t1,
table2 t2
where
t1.id = t2.id and
t2.updateTime < sysdate
)
set
c1 = c2
Second, assuming updateTime means what I would think it means, wouldn't it always be less than sysdate? Is there a reason to do this?
Third, to minimize the number of unnecessary updates, I would think you could add this. Assuming only a percentage of rows require update, this should dramatically impact performance.
update (
select
t1.comment c1, t2.comment c2
from
table1 t1,
table2 t2
where
t1.id = t2.id and
t2.updateTime < sysdate and
((t1.comment is null and t2.comment is not null) or
(t1.comment is not null and t2.comment is null) or
t1.comment != t2.comment)
)
set
c1 = c2
Finally, I'm not saying a Loop would NEVER help, but I am saying it's generally the wrong approach. Oracle is tuned to do this sort of thing. If your update query is slow, it's doubtful wrapping it in a procedural loop will make it run faster. Updating millions of rows should not be an issue for a well tuned Oracle database.
I understand the thought behind doing this, and it makes sense in a human world. I've tried it myself only to have a wise Oracle man tell me I was wrong. When he tuned the update query, it turns out he was quite right.

Related

Avoid multiple SELECT while updating a table's column relatively to another table's one

I am quite a newbie with SQL queries but I need to modify a column of a table relatively to the column of another table. For now I have the following query working:
UPDATE table1
SET date1=(
SELECT last_day(max(date2))+1
FROM table2
WHERE id=123
)
WHERE id=123
AND date1=to_date('31/12/9999', 'dd/mm/yyyy');
The problem with this structure is that, I suppose, the SELECT query will be executed for every line of the table1. So I tried to create another query but this one has a syntax error somewhere after the FROM keyword:
UPDATE t1
SET t1.date1=last_day(max(t2.date2))+1
FROM table1 t1
INNER JOIN table2 t2
ON t1.id=t2.id
WHERE t1.id=123
AND t1.date1=to_date('31/12/9999', 'dd/mm/yyyy');
AND besides that I don't even know if this one is faster than the first one...
Do you have any idea how I can handle this issue?
Thanks a lot!
Kind regards,
Julien
The first code you wrote is fine. It won't be executed for every line of the table1 as you fear. It will do the following:
it will run the subquery to find a value you want to use in your UPDATE statement, searching through table2, but as you have stated the exact id from
the table, it should be as fast as possible, as long as you have
created an index on that (I guess a primary key) column
it will run the outer query, finding the single row you want to update. As before, it should be as fast as possible as you have stated the exact id, as long as there is an index on that column
To summarize, If those ID's are unique, both your subquery and your query should return only one row and it should execute as fast as possible. If you think that execution is not fast enough (at least that it takes longer than the amount of data would justify) check if those columns have unique values and if they have unique indexes on them.
In fact, it would be best to add those indexes regardless of this problem, if they do not exist and if these columns have unique values, as it would drastically improve all of the performances on these tables that search through these id columns.
Please try to use MERGE
MERGE INTO (
SELECT id,
date1
FROM table1
WHERE date1 = to_date('31/12/9999', 'dd/mm/yyyy')
AND id = 123
) t1
USING (
SELECT id,
last_day(max(date2))+1 max_date
FROM table2
WHERE id=123
GROUP BY id
) t2 ON (t1.id = t2.id)
WHEN MATCHED THEN
UPDATE SET t1.date1 = t2.max_date
;

How to redesign a database to find distinct values more effectively?

I often have a need to select a set of distincs values from a column with low selectivity in a big table while joining it to some other table where I can't really filter the entries in the resulting set to some reasonable amount.
For example, I have a table with 20M rows, with column someID which has 200 unique values. I join this table with some other result set on another column and filter 20M rows down to, say, 10M rows (still a lot), and then need to find distinct someID. So I end up with a 10M rows scan no matter what, which is a pain.
In this join, there is no way to filter the results more, 10M records is really the set I need to find distint someID in.
Is there any standard approach to redesign the tables or create some additional table to make this work better?
Your basic query is:
select distinct t1.someID
from table1 t1 join
table2 t2
on t1.col1 = t2.col1;
The optimal indexes for this query are table1(col1, someId) and table2(col1).
Here is another version of the query:
select distinct t1.someId
from table1 t1
where exists (select 1 from table2 t2 where t1.col1 = t2.col1);
In this case, the optimal index would be table1(someid, col1). It is possible that SQL Server will be intelligent in this case and stop looking for an exists value when it encounters a match (although I am a bit skeptical). You would have to investigate the execution plans generated on your data.
Another idea extends this even further:
select s.someId
from someIdtable s
where exists (select 1
from table1 t1 join
table2 t2
on t1.col1 = t2.col1 and t1.someId = s.someId);
This removes the outer distinct, depending only on the semi-join in the exists clause. The optimal index would be table1(someid, col1).
Under some circumstances, this version would probably have the best performance -- for instance, if all the someIds were in the result set. On the other hand, if very few are, this might have poor performance.
I'm stealing the "basic query" from Gordons answer:
select t1.someID
from table1 t1
join table2 t2 on t1.col1 = t2.col1
group by t1.someID
This query fits the requirements of indexed views. You can index this query. Running it will result in a simple clustered index scan which is as cheap as it gets.

To execute SQL query takes a lot of time

I have two tables. Tables 2 contains more recent records.
Table 1 has 900K records and Table 2 about the same.
To execute the query below takes about 10 mins. Most of the queries (at the time of execution the query below) to table 1 give timeout exception.
DELETE T1
FROM Table1 T1 WITH(NOLOCK)
LEFT OUTER JOIN Table2 T2
ON T1.ID = T2.ID
WHERE T2.ID IS NULL AND T1.ID IS NOT NULL
Could someone help me to optimize the query above or write something more efficient?
Also how to fix the problem with time out issue?
Optimizer will likely chose to block whole table as it is easier to do if it needs to delete that many rows. In the case like this I delete in chunks.
while(1 = 1)
begin
with cte
as
(
select *
from Table1
where Id not in (select Id from Table2)
)
delete top(1000) cte
if ##rowcount = 0
break
waitfor delay '00:00:01' -- give it some rest :)
end
So the query deletes 1000 rows at a time. Optimizer will likely lock just a page to delete the rows, not whole table.
The total time of this query execution will be longer, but it will not block other callers.
Disclaimer: assumed MS SQL.
Another approach is to use SNAPSHOT transaction. This way table readers will not be blocked while rows are being deleted.
Wait a second, are you trying to do this...
DELETE Table1 WHERE ID NOT IN (SELECT ID FROM Table2)
?
If so, that's how I would write it.
You could also try to update the statistics on both tables. And of course indexes on Table1.ID and Table2.ID could speed things up considerably.
EDIT: If you're getting timeouts from the designer, increase the "Designer" timeout value in SSMS (default is 30 seconds). Tools -> Options -> Designers -> "Override connection string time-out value for table designer updates" -> enter reasonable number (in seconds).
Both ID columns need an index
Then use simpler SQL
DELETE Table1 WHERE NOT EXISTS (SELECT * FROM Table2 WHERE Table1.ID = Table2.ID)

Select proper columns from JOIN statement

I have two tables: table1, table2. Table1 has 10 columns, table2 has 2 columns.
SELECT * FROM table1 AS T1 INNER JOIN table2 AS T2 ON T1.ID = T2.ID
I want to select all columns from table1 and only 1 column from table2. Is it possible to do that without enumerating all columns from table1 ?
Yes, you can do the following:
SELECT t1.*, t2.my_col FROM table1 AS T1 INNER JOIN table2 AS T2 ON T1.ID = T2.ID
Even though you can do the t1.*, t2.col1 thing, I would not recommend it in production code.
I would never ever use a SELECT * in production - why?
you're telling SQL Server to get all columns - do you really, really need all of them?
by not specifying the column names, SQL Server has to go figure that out itself - it has to consult the data dictionary to find out what columns are present which does cost a little bit of performance
most importantly: you don't know what you're getting back. Suddenly, the table changes, another column or two are added. If you have any code which relies on e.g. the sequence or the number of columns in the table without explicitly checking for that, your code can brake
My recommendation for production code: always (no exceptions!) specify exactly those columns you really need - and even if you need all of them, spell it out explicitly. Less surprises, less bugs to hunt for, if anything ever changes in the underlying table.
Use table1.* in place of all columns of table1 ;)

SQL: how do I speed up this query

Here is the situation. I have one table that contains records based on records in many different tables (t1 below). t2 is one of the tables that has information pulled from it into t1.
t1
table_oid --which table id is a FK to
id --fk to other table
store_num --field
t2
t2_id
Here is what I need to find: I need the largest t2_id where the store_num is not null in the corresponding record of t1. Here is the query I wrote:
select max(id) from t1
join t2 on t2.t2_id = t1.id
where store_num is not null
and table_oid = 1234;
However, this takes fairly long. I think this should be a fast query. all _ids have indexes for them. (t1.id/t1.table_oid, t2.t2_id). The vast majority of entries in t1 have a store_num.
Mentally, I would get the t2_ids in desc order, than one by one, try them against t1 until I found the first one that had a store_num.
select t2_id from t2 order by t2_id desc;
has an explain cost of 25612
select t1.* from t1 where table_oid = 1234
and id in (select max(t2_id) from t2);
has an explain cost of 8.
So why wouldn't the above query be a cost of at most 25612*8 = 204896? When I explain it, it comes back as more than 3 times that.
Really, my question is how do I re-write that query to run faster.
NOTE: I am using Oracle.
EDIT:
t2 has 11,895,731 rows
t1 has 473,235,192 rows
EDIT 2:
As I've tried different things, the part of the query that is taking the longest is the full scan on t1 looking for the store_num. Is there a way to keep this from doing a full scan, since I only need the biggest entry?
You say:
all _ids have indexes for them
But your query is:
...
where store_num is not null
and table_oid = 1234;
All of your _id indexes are useless for this query unless store_num and table_oid are also indexed, and are the first columns in said index.
So of course it has to do a full scan; it can give you back max(id) instantly without any filter conditions, but as soon as you put in the filter, it can't use the id index anymore because it doesn't know which part of the index matches those store_num is not null entries - not without a scan.
To speed the query up, you need to create an index on (store_num, table_oid, id). Standard disclaimers about creating indexes for a single ad-hoc query apply; having too many indexes will hurt insert/update performance.
It really doesn't matter how you "rewrite" your query - this isn't like application code, the optimizer is going to rearrange all of the pieces of your query anyway. Unless you have sufficiently-selective indexes on your seek columns or the entire query is completely covered by a single index, it's going to be slow.
Not sure if these apply to Oracle. Do you have an index on the fk id column for the join. Also if you can avoid the 'NOT IN' is't a non-sargable type in SQL which slows down a query.
another option that might be slower is doing an outer join then checking for null on that column. (not sure if that only applies to sql also)
select max(id) from t1
left outer join t2 on t2.t2_id = t1.id
where t1... IS NULL
and table_oid = 1234;
The best way I can think of to have this run fast is to:
Create an index on (TABLE_OID, ID DESC, COVERED_ENTITY_ID) in that order. Why?
table_oid -- this is your primary access condition
id -- so you don't have to access a data block to read it,
-- and you get higher ID values first
covered_entity_id -- you're filtering the data based on this, null vs not null
That should prevent the need to access the 473m rows in T1 at all.
Ensure that there's an index on T2_ID.
If all that's in place, a query like:
select max(id)
from t1
inner join t2
on t2.t2_id = t1.id
where covered_entity_id is not null
and table_oid = 1234;
should be (the optimizer is a finicky beast) able to do a semi-join driven by a fast full scans against the index on T1, never scanning the data blocks. Also consider writing it manaully as:
select max(id)
from t1
where covered_entity_id is not null
and table_oid = 1234
and exists (select null
from t2
where t1.id = t2.t2_id);
select max(id)
from t1
where covered_entity_id is not null
and table_oid = 1234
and id in (select t2_id from t2);
As the optimizer may write those plans slightly differently.
In the following I assume covered_entity_id is the same as store_num - it would really make things easier for us if you were consistent in your naming.
The vast majority of entries in t1
have a store_num.
Given that this is the case, the following clause shouldn't have any impact on the performance of your query ...
where covered_entity_id is not null
However, you go on to say
the part of the query that is taking
the longest is the full scan on t1
looking for the store_num
This suggests the query is looking for covered_entity_id is not null first rather than the presumably far more selective table_oid = 1234. The solution might be as simple as re-writing the query like this ...
where table_oid = 1234
and covered_entity_id is not null;
... although I suspect not. You could try hinting to get the query to use the index on table_oid.
The other thing is, how fresh are the statistics? When the optimizer chooses a radically bad execution plan it is often because the stats are out of date.
Incidentally, why are you joining to T2 at all? Your requirements could be met by selecting max(id) from T1 (unless you don't have a foreign key enforcing T1.ID references T2.T2_ID, and hence need to be sure).
edit
To check your statistics run this query:
select table_name
, num_rows
, last_analyzed
from user_tables
where table_name in ('T1', 'T2')
/
If the results show num_rows is widely divergent from the values you gave in your first edit then you should re-gather statistics. If last_anlayzed is something like the day you went live then you definitely should re-gather. You may want to export your statistics first; refreshing the statistics can affect the execution plans (that is the object of the exercise) usually for good but sometimes things can get worse. Find out more.