I have a task on my dev databases to update some customer PII. Some of these tables have many rows and are quite large (on the order of 51 million+ and 25GB in size). Every row in the table will need to be updated so right now I'm just running a simple update statement without a where clause on a single column, and this query is, at the time of this post, running 35mins+. Is there a faster way to update large tables? or a better way to mask PII data?
Current query is just
update mytable
set mycolumn = 'some text'
Related
I have to execute the following UPDATE statement on a large table, partitioned using a timestamp field with a resolution of one day:
UPDATE my_huge_table
SET column1 = 'an-uuid4'
WHERE customer_id = 'a-customer-uuid4'
AND organization_id = 'an-organization-uuid4'
column1 IN ('another-uuid4', 'yet-another-uuid4', ...)
where customer_id, organization_id and column1 columns are all properly indexed.
That statement is executed by an Aurora database (PostgreSQL) on AWS.
It turns out that, sometimes, that query times out. The reason, according to the performance insights provided by AWS, is that likely an inefficient plan is used to execute the query, as a lot of data must be read in one shot from the file system.
My guess is that the database is loading a huge amount of table records in memory as the column being updated is the same used for the selection.
I wonder whether there's a more efficient way, performance-wise, to accomplish the same task.
I need to update a huge table (over 200 million records, 20+ columns)
I tried to update one column:
update Table1 set [Customer]=Null where [Customer]='-' or len([customer])=0
And it took over 2 hours.
I tried it on all columns and it's still running, for over 5 days now.
update Table1 set [Name]=Null where [Name]='-' or len([Name])=0
update Table1 set [Email]=Null where [Email]='-' or len([Email])=0
...
BTW- the table does not have any indexes or triggers, only data. The DB is not in use and recovery mode is simple.
Is there any more efficient way to update big tables?
I have a very large table people with 60M rows indexed on id, wish to populate a field newid for every record based on a look up table id_conversion (1M rows) which contains id and newid, indexed on id.
when I run
update people p set p.newid=(select l.newid from id_conversion l where l.id=p.id)
it runs for an hour or so and then I get an archive error ora 00257.
Any suggestions for either running update in sections or better sql command?
To avoid writing to Oracle's undo log if your update statement hits every single row of the table then you are likely better off running a create table as select query which will bypass all undo logs, which is likely the issue you're running into as it is logging the impact across 60 million rows. You can then drop the old table and rename the new table to that of the old table's name.
Something like:
create table new_people as
select l.newid,
p.col2,
p.col3,
p.col4,
p.col5
from people p
join id_conversion l
on p.id = l.id;
drop table people;
-- rebuild any constraints and indexes
-- from old people table to new people table
alter table new_people rename to people;
For reference, read some of the tips here: http://www.dba-oracle.com/t_efficient_update_sql_dml_tips.htm
If you are basically creating a new table and not just updating some of the rows of a table it will likely prove the faster method.
I doubt you will be able to get this to run in seconds. Your query, as written, needs to update all 60 million rows.
My first advice is to add an index on id_conversion(id, newid), to make the subquery more efficient. If that doesn't help, then doing the update in batches might be the best way to go.
I should add. Because you are updating all the rows, it might be faster to take the following approach:
Copy the data into a new table with the new values.
Truncate the original table.
Insert the new data into the old table.
Inserts are faster than updates.
In addition to the answers above, which probably will work better in this case, you should know the MERGE statement
http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_9016.htm
that is used for updating one table according to another table and is far faster then update according to a select statement
I'm currently using Oracle 11g and let's say I have a table with the following columns (more or less)
Table1
ID varchar(64)
Status int(1)
Transaction_date date
tons of other columns
And this table has about 1 Billion rows. I would want to update the status column with a specific where clause, let's say
where transaction_date = somedatehere
What other alternatives can I use rather than just the normal UPDATE statement?
Currently what I'm trying to do is using CTAS or Insert into select to get the rows that I want to update and put on another table while using AS COLUMN_NAME so the values are already updated on the new/temporary table, which looks something like this:
INSERT INTO TABLE1_TEMPORARY (
ID,
STATUS,
TRANSACTION_DATE,
TONS_OF_OTHER_COLUMNS)
SELECT
ID
3 AS STATUS,
TRANSACTION_DATE,
TONS_OF_OTHER_COLUMNS
FROM TABLE1
WHERE
TRANSACTION_DATE = SOMEDATE
So far everything seems to work faster than the normal update statement. The problem now is I would want to get the remaining data from the original table which I do not need to update but I do need to be included on my updated table/list.
What I tried to do at first was use DELETE on the same original table using the same where clause so that in theory, everything that should be left on that table should be all the data that i do not need to update, leaving me now with the two tables:
TABLE1 --which now contains the rows that i did not need to update
TABLE1_TEMPORARY --which contains the data I updated
But the delete statement in itself is also too slow or as slow as the orginal UPDATE statement so without the delete statement brings me to this point.
TABLE1 --which contains BOTH the data that I want to update and do not want to update
TABLE1_TEMPORARY --which contains the data I updated
What other alternatives can I use in order to get the data that's the opposite of my WHERE clause (take note that the where clause in this example has been simplified so I'm not looking for an answer of NOT EXISTS/NOT IN/NOT EQUALS plus those clauses are slower too compared to positive clauses)
I have ruled out deletion by partition since the data I need to update and not update can exist in different partitions, as well as TRUNCATE since I'm not updating all of the data, just part of it.
Is there some kind of JOIN statement I use with my TABLE1 and TABLE1_TEMPORARY in order to filter out the data that does not need to be updated?
I would also like to achieve this using as less REDO/UNDO/LOGGING as possible.
Thanks in advance.
I'm assuming this is not a one-time operation, but you are trying to design for a repeatable procedure.
Partition/subpartition the table in a way so the rows touched are not totally spread over all partitions but confined to a few partitions.
Ensure your transactions wouldn't use these partitions for now.
Per each partition/subpartition you would normally UPDATE, perform CTAS of all the rows (I mean even the rows which stay the same go to TABLE1_TEMPORARY). Then EXCHANGE PARTITION and rebuild index partitions.
At the end rebuild global indexes.
If you don't have Oracle Enterprise Edition, you would need to either CTAS entire billion of rows (followed by ALTER TABLE RENAME instead of ALTER TABLE EXCHANGE PARTITION) or to prepare some kind of "poor man's partitioning" using a view (SELECT UNION ALL SELECT UNION ALL SELECT etc) and a bunch of tables.
There is some chance that this mess would actually be faster than UPDATE.
I'm not saying that this is elegant or optimal, I'm saying that this is the canonical way of speeding up large UPDATE operations in Oracle.
How about keeping in the UPDATE in the same table, but breaking it into multiple small chunks?
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 0000000 and 0999999
COMMIT
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 1000000 and 1999999
COMMIT
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 2000000 and 2999999
COMMIT
This could help if the total workload is potentially manageable, but doing it all in one chunk is the problem. This approach breaks it into modest-sized pieces.
Doing it this way could, for example, enable other apps to keep running & give other workloads a look in; and would avoid needing a single humungous transaction in the logfile.
I have a few jobs that insert large data sets from a text file. The data is loaded via .NET's SqlBulkCopy.
Currently, I load all the data into a temp table and then insert it into the production table. This was an improvement over straight importing into production. The T-SQL insert results query was a lot faster. Data is only loaded via this method, there is no other inserts or deletes.
However, I'm back to timeouts because of locks while the job is executing. The job consists of the following steps:
load data into temp table
start transaction
delete current and future dated rows
insert from temp table
commit
This happens once every hour. This portion takes 70 seconds. I need to get that to the smallest number possible.
The production table has about 20 million records and each import is about 70K rows. The table is not accessed at night, so I use this time to do all required maintenance (rebuild stats, index, etc.). Of the 70K, added, ~4K is kept from day-to-day - that is, the table grows by 4k a day.
I'm thinking a 2 part solution:
The job will turn into a copy/rename job. I insert all current data into the temp table, create stats & index, rename tables, drop old table.
Create a history table to break out older data. The "current" table would have a rolling 6 months data, about 990K records. This would make the delete/insert table smaller and [hopefully] more performant. I would prefer not to do this; the table is well designed with the perfect indexes; queries are plenty fast. But eventually it might be required.
Edit: Using Windows 2003, SQL Server 2008
Thoughts? Other suggestions?
Well one really sneaky way is to rename the current table as TableA and set up a second table with the same structure as TableB and the same data. Then set up a view with the same name and the exact fields in the TableA. Now all your existing code will use the view instead of the current table. The view starts out looking at TableA.
In your load process, load to TableB. Refresh the view defintion changing it to look at TableB. Your users are down for less than a second. Then load the same data to TableA and store which table you should start with somewhere in a database table. Next time load first to TableA and then change the view to point to TableA then reload TableB.
The answer should be that your queries that read from this table should READ UNCOMMITTED, since your data load is the only place that changes data. With READ UNCOMMITTED, the SELECT queries won't get locks.
http://msdn.microsoft.com/en-us/library/ms173763.aspx
You should look into using partitioned tables. The basic idea is that you can load all of your new data into a new table, then join that table to the original as a new partition. This is orders of magnitude faster than inserting into the current existing table.
Later on, you can merge multiple partitions into a single larger partition.
More information here: http://msdn.microsoft.com/en-us/library/ms191160.aspx
Get better hardware. Using 3 threads, 35.000 item batches I import around 90.000 items per second using this approach.
Sorry, but at a point hardware decides insert speed. Important: SSD for the logs, mirrored ;)
Another trick you could use is have a delta table for the updates. You'd have 2 tables with a view over them to merge them. One table, TableBig, will hold the old data, the second table, TableDelta, will hold deltas that you add to rows in tableBig.
You create a view over them that adds them up. A simple example:
For instance your old data in the TableBig (20M rows, lot's of indexes etc)
ID Col1 Col2
123 5 10
124 6 20
And you want to update 1 row and add a new one so that afterwards the table looks like this:
ID Col1 Col2
123 5 10 -- unchanged
124 8 30 -- this row is updated
125 9 60 -- this one added
Then in the TableDelta you insert these two rows:
ID Col1 Col2
124 2 10 -- these will be added to give the right number
125 9 60 -- this row is new
and the view is
select ID,
sum(col1) col1, -- the old value and delta added to give the correct value
sum(col2) col2
from (
select id, col1, col2 from TableBig
union all
select id, col1, col2 from TableDelta
)
group by ID
At night you can merge TableDelta into TableBig and index etc.
This way you can leave the big table alone completely during the day, and TableDelta will not have many rows so overall query perf is very good. Getting the data from BigTable benefits from inexing, getting rows from DeltaTable is no issue becuase it's small, summing them up is cheap compared to looking for data on disk. Pumping data into TableDelta is very cheap because you can just insert at the end.
Regards Gert-Jan
Update for text columns:
You could try somehting similar, with two tables, but instead of adding, you would substitute.Like this:
Select isnull(b.ID, o.ID) Id,
isnull(b.Col1, o.Col1) Col1
isnull(b.Col2, o.Col2) col2
From TableBig b
full join TableOverWrite o on b.ID = o.ID
The basic idea is the same: a big table with indexes and a small table for updates that doesn't need them.
Your solution seems sound to me. Another alternative to try would be to create the final data in a temp table, all outside of a transaction and then inside the transaction truncate the target table then load it from your temp table...something along those lines might be worth trying too.