Oracle SQL - How to do massive updates more efficient and faster? - sql

I'm trying to update 500.000 rows at once. I have a table with products like this:
+------------+----------------+--------------+-------+
| PRODUCT_ID | SUB_PRODUCT_ID | DESCRIPTION | CLASS |
+------------+----------------+--------------+-------+
| A001 | ACC1 | coffeemaker | A |
| A002 | ACC1 | toaster | A |
| A003 | ACC2 | coffee table | A |
| A004 | ACC5 | couch | A |
+------------+----------------+--------------+-------+
I've sets of individually statements, for example:
update products set class = 'A' where product_id = 'A001';
update products set class = 'B' where product_id = 'A005';
update products set class = 'Z' where product_id = 'A150';
I'm making a query putting one update statement below the other update statement and putting a commit statement each 1.000 rows.
It works fine (slow, but fine) but I wanna do it better if it can be possible in any way.
There is a better way to do this more efficient and faster?

One approach would be to create a temporary table holding your update information:
new_product_class:
product_id class
========== =====
A A001
B A005
Z A150
product_id should be an indexed primary key on this new table. Then you can do an UPDATE or a MERGE on the old table joined to this temporary table:
UPDATE (SELECT p.product_id, p.class, n.product_id, n.class
FROM product p
JOIN new_product_class n ON (p.product_id = n.product_id)
SET p.class = n.class
or
MERGE INTO product p
USING new_product_class n
ON (p.product_id = n.product_id)
WHEN MATCHED THEN
UPDATE SET p.class = n.class
Merge should be fast. Other things that you could look into depending on your environment: create a new table based on the old table with nologging followed by some renaming (should backup before and after), bulk updates.

Unless you have an index, each of your update statements scans the entire table. Even if you do have an index, there is a cost associated with the compilation and execution of each statement.
If you have a lot of conditions, and those conditions can vary, then I think Glenn's solution is clearly the way to go. This does everything in a single transaction, and there is no reason to run batches of 1,000 rows -- just do everything all at once.
If the number of conditions is relatively finite (as your example), and they don't change very often, then you can also do this as a simple case:
update products
set class =
case product_id
when 'A001' then 'A'
when 'A005' then 'B'
when 'A150' then 'C'
end
where
product_id in ('A001', 'A005', 'A150')
If it's possible your class field is already set to the correct value, then there is also great value in adding a condition to make sure you are not updating something to the same value. For example if this:
update products set class = 'A' where product_id = 'A001';
Updates 5,000 records, 4,000 of which are already set to 'A', then this would be significantly more efficient:
update products
set class = 'A'
where
product_id = 'A001' and
(class is null or class != 'A')

Related

Update if not exists in SQL Server?

Is there any way to run an update statement in SQL Server that skips rows that already exist in the target?
For instance, I have a view vw_BranchCaseCurrent which contains a CaseID and a BranchID, in addition to an auto-incremented ID. I want to do this:
update a
set
a.CaseID = #NewCaseID
from vw_BranchCaseCurrent a
where
a.CaseID = #OldCaseID;
But the problem is, if there is already an existing row in vw_BranchCaseCurrent for the new CaseID and the existing BranchID then this SQL will crash because it is violating the unique constraint on the backing table. So I'd need to skip that row when performing the update.
I was thinking maybe I could use a merge statement but I'm not entirely familiar with how those work...
There are about a dozen other views that need to be updated so I'm looking for something simple, if possible...
edit: let me clarify with an example:
| CaseID | BranchID |
|--------|----------|
| 42 | 8008 |
| 42 | 9001 |
| 86 | 9001 |
So I want to merge case 42 into case 86 by updating the CaseID field in this view. I want to change the first CaseID from 42 to 86. But the second row, I can't do anything with this because the BranchID of 9001 already exists for CaseID 86. So I leave that one alone.
This is a simple example; some of the other views I need to merge have multiple ID fields in addition to the CaseID...
You can express this using not exists, not in, or a left join . . . or even using window functions.
with toupdate as (
select bcc.*, sum(case when CaseID = #NewCaseID then 1 else 0 end) over () as num_new
from vw_BranchCaseCurrent bcc
)
update toupdate
set CaseID = #NewCaseID
where CaseID = #OldCaseID and num_new = 0;

update a single column with join lookups

I have a table adjustments with columns adjustable_id | adjustable_type | order_id
order_id is the target column to fill with values, this value should come from another table line_items which has a order_id column.
adjustable_id (int) and _type (varchar) references that table.
table: adjustments
id | adjustable_id | adjustable_type | order_id
------------------------------------------------
100 | 1 | line_item | NULL
101 | 2 | line_item | NULL
table: line_items
id | order_id | other | columns
--------------------------------
1 | 10 | bla | bla
2 | 20 | bla | bla
In the case above I guess I need a join query to update adjustments.order_id first row with value 10, second row with 20 and so on for the other rows using Postgres 9.3+.
In case the lookup fails, I need to delete invalid adjustments rows, for which they have no corresponding line_items.
There are two ways to do this. The first one using a co-related sub-query:
update adjustments a
set order_id = (select lorder_id
from line_items l
where l.id = a.adjustable_id)
where a.adjustable_type = 'line_item';
this is standard ANSI SQL as standard SQL does not define a join condition for the UPDATE statement.
The second way is using a join, which is a Postgres extension to the SQL standard (other DBMS also support that but with different semantics and syntax).
update adjustments a
set order_id = l.order_id
from line_items l
where l.id = a.adjustable_id
and a.adjustable_type = 'line_item';
The join is probably the faster one. Note that both versions (especially the first one) will only work if the join between line_items and adjustments will always return exactly one row from the line_items table. If that is not the case they will fail.
The reason why Arockia's query was "eating your RAM" is that his/her query creates a cross-join between table1 and table1 which is then joined against table2.
The Postgres manual contains a warning about that:
Note that the target table must not appear in the from_list, unless you intend a self-join
update a set A.name=B.name from table1 A join table2 B on
A.id=B.id

Multiple records in a table matched with a column

The architecture of my DB involves records in a Tags table. Each record in the Tags table has a string which is a Name and a foreign kery to the PrimaryID's of records in another Worker table.
Records in the Worker table have tags. Every time we create a Tag for a worker, we add a new row in the Tags table with the inputted Name and foreign key to the worker's PrimaryID. Therefore, we can have multiple Tags with different names per same worker.
Worker Table
ID | Worker Name | Other Information
__________________________________________________________________
1 | Worker1 | ..........................
2 | Worker2 | ..........................
3 | Worker3 | ..........................
4 | Worker4 | ..........................
Tags Table
ID |Foreign Key(WorkerID) | Name
__________________________________________________________________
1 | 1 | foo
2 | 1 | bar
3 | 2 | foo
5 | 3 | foo
6 | 3 | bar
7 | 3 | baz
8 | 1 | qux
My goal is to filter WorkerID's based on an inputted table of strings. I want to get the set of WorkerID's that have the same tags as the inputted ones. For example, if the inputted strings are foo and bar, I would like to return WorkerID's 1 and 3. Any idea how to do this? I was thinking something to do with GROUP BY or JOINING tables. I am new to SQL and can't seem to figure it out.
This is a variant of relational division. Here's one attempt:
select workerid
from tags
where name in ('foo', 'bar')
group by workerid
having count(distinct name) = 2
You can use the following:
select WorkerID
from tags where name in ('foo', 'bar')
group by WorkerID
having count(*) = 2
and this will retrieve your desired result/
Regards.
This article is an excellent resource on the subject.
While the answer from #Lennart works fine in Query Analyzer, you're not going to be able to duplicate that in a stored procedure or from a consuming application without opening yourself up to SQL injection attacks. To extend the solution, you'll want to look into passing your list of tags as a table-valued parameter since SQL doesn't support arrays.
Essentially, you create a custom type in the database that mimics a table with only one column:
CREATE TYPE list_of_tags AS TABLE (t varchar(50) NOT NULL PRIMARY KEY)
Then you populate an instance of that type in memory:
DECLARE #mylist list_of_tags
INSERT #mylist (t) VALUES('foo'),('bar')
Then you can select against that as a join using the GROUP BY/HAVING described in the previous answers:
select workerid
from tags inner join #mylist on tag = t
group by workerid
having count(distinct name) = 2
*Note: I'm not at a computer where I can test the query. If someone sees a flaw in my query, please let me know and I'll happily correct it and thank them.

How to get sum of values per id and update existing records in other table

I have two tables like:
ID | TRAFFIC
fd56756 | 4398
645effa | 567899
894fac6 | 611900
894fac6 | 567899
and
USER | ID | TRAFFIC
andrew | fd56756 | 0
peter | 645effa | 0
john | 894fac6 | 0
I need to get SUM ("TRAFFIC") from first table AND set column traffic to the second table where first table ID = second table ID. ID's from first table are not unique, and can be duplicated.
How can I do this?
Table names from your later comment. Chances are, you are reporting table and column names incorrectly.
UPDATE users u
SET "TRAFFIC" = sub.sum_traffic
FROM (
SELECT "ID", sum("TRAFFIC") AS sum_traffic
FROM stats.traffic
GROUP BY 1
) sub
WHERE u."ID" = sub."ID";
Aside: It's unwise to use mixed-case identifiers in Postgres. Use legal, lower-case identifiers, which do not need to be double-quoted, to make your life easier. Start by reading the manual here.
Something like this?
UPDATE users t2 SET t2.traffic = t1.sum_traffic FROM
(SELECT sum(t1.traffic) t1.sum_traffic FROM stats.traffic t1)
WHERE t1.id = t2.id;

Delete data from child tables

I have 2 tables:
"customers" and "addresses". A customer can have several addresses, so they have an "n:m" relationship.
For this reason, I also have the table "customer-addr".
This is how my tables look like:
+---------------+
+-----------+ | customer_addr |
| customers | +---------------+ +-----------+
+-----------+ | id | | addresses |
| id | <---> | cid | +-----------+
| name | | aid | <---> | id |
+-----------+ +---------------+ | address |
+-----------+
I need to update all customer-data incl. all addresses. For this reason I thought about deleting all existing addresses first, then updating the customer-table, and after that, I create every address new.
My question: How can I delete all existing addresses from one customer efficiently? (I have to remove rows from 2 tables).
Is there a single-statement I can use? (Without the cascade-method, this is too risky)
Or can I do it with 2 statements, without using subselects?
What's the best approach for this?
Notice that I'm using postgresql
Edit:
My whole database-design is more complex, and the address-table is not only a child from "customers" but also from "suppliers","bulkbuyers",..
Every address belongs to only one customer OR one supplier OR one bulkbuyer.
(No address is used by more than one parent / no address-sharing)
Ever customer/supplier/.. can have multiple addresses.
For this reason, the edited solution from zebediah49 won't work, because it would also delete all addresses from every supplier/bulkbuyer/...
I would use a writable CTE also called data-modifying CTE in PostgreSQL 9.1 or later:
WITH del AS (
DELETE FROM customer_addr
WHERE cid = $kill_this_cid
RETURNING aid
)
DELETE FROM addresses a
USING (SELECT DISTINCT aid FROM del) d
WHERE a.id = d.aid;
This should be fastest and safest.
If (cid, aid) is defined UNIQUE in customer_addr you don't need the DISTINCT step:
...
DELETE FROM addresses a
USING del d
WHERE a.id = d.aid;
EDIT:
Got it; this is safer because of the risk of two customers sharing an address anyway:
DELETE FROM customer_addr WHERE cid = $TARGET_CID;
DELETE FROM addresses WHERE id NOT IN (SELECT aid FROM customer_addr);
First, delete all references, then delete all unreferenced addresses.
Note that you could, for example, only do the first step, and run the "cleanup" second step at a later time.
I would suggest a two step transaction:
DELETE FROM addresses WHERE id IN (SELECT ca.aid FROM customers c LEFT JOIN customer_addr ca ON ca.cid=c.id WHERE c.name='$NAME_TO_DELETE');
DELETE FROM customer_addr WHERE cid = (SELECT id FROM customers WHERE name='$NAME_TO_DELETE');
If you have the customer ID already (EDIT: You do), you can skip most of that:
DELETE FROM addresses WHERE id IN (SELECT aid FROM customer_addr WHERE cid=$TARGET_CID);
DELETE FROM customer_addr WHERE cid = $TARGET_CID;
Wrap those with the appropriate transactional BEGIN/END, to make sure that you don't end up in an inconsistent state, and you should be set.