How to copy a 110GB table with BLOBs from one schema to another on ORACLE - sql

I have a table containing 110GB in BLOBs in one schema and I want to copy it to another schema to a different table.
I only want to copy a column of the source table so I am using an UPDATE statement, but it takes 2,5 hours to copy 3 GB of data.
Is there a faster way to do this?
Update:
The code I am using is very simple:
update schema1.A a set blobA = (select blobB from schema2.B b where b.IDB = a.IDA);
ida and idb are indexes.

Check to see if there are indexes on the destination table that are causing the performance issue, if so, temporarily disable them then recreate them after the data is copied from one column in the source table to the column in the destination table.

If you are on Oracle 10 or 11, check ADDM to see what is causing problems. It is probably I/O or transaction log problem.
What kind of disc storage is this? Did you try to copy 110 GB file from one place to another on that disc system? How long it takes?

I don't know if oracle automatically grows the database size or not. If it does, then increase the amount of space allocated to the database to exceed the amount you are about to grow it prior to running your query.
I know in SQL server, under the default setup it will automatically allocate an additional 10% of the database size as you start filling it up. When it fills up, then it stops everything and reallocates another 10%. When running queries that do bulk loading of data, this can seriously slow the query down.
Also, as zendar pointed out, check the disk IO. If it has a high queue length then you may be constrained by have fast the drives work.

Related

Postgres extracting data from a huge table based on non-indexed column

We have a table on production which has been there for quite some time and the volume of that table is huge(close to 3 TB), since most of the data in this table is stale and unused we are planning to get rid of historical data which does not have any references.
There is a column "active" with type boolean which we can use to get rid of this data, however this column is not indexed.
Considering the volume of the table i am not too sure whether creation of a new index is going to help, i tried to incrementally delete the inactive rows 100K at a time but still the volume is so huge that this is going to take months to clear up.
The primary key of the table is of type UUID, i thought of creating a new table and inserting only the valued with active="true" as
insert
into
mytable_active
select
*
from
mytable
where
is_active = true;
But as expected this approach also fails because of the volume and keeps running like forever.
Any suggestions approaches would be most welcome.
When you need to delete a lot of rows quickly, partitioning is great......... when the table is already partitioned.
If there is no index on the column you need, then at least one full table scan will be required, unless you can use another index like "date" or something to narrow it down.
I mean, you could create an index "WHERE active" but that would also require the full table scan you're trying to avoid, so... meh.
First, DELETE. Just don't, not even in small bits with LIMIT. Not only will it write most of the table (3TB writes) but it will also write it to the WAL (3 more TB) and it will also update the indexes, and write that to the WAL too. This will take forever, and the random IO from index updates will nuke your performance. And if it ever finishes, you will still have a 3TB file, with most of it unallocated. Plus indexes.
So, no DELETE. Uh, wait.
Scenario with DELETE:
Swap the table with a view "SELECT * FROM humongous WHERE active=true" and add triggers or rules on the view to redirect updates/inserts/delete to the underlying table. Make sure triggers set all new rows with active=true.
Re-create each index (concurrently) except the primary key, adding "WHERE active=true". This will require a full table scan for the first index, even if you create the index on "active", because CREATE INDEX WHERE doesn't seem to be able to use another index to speed up when a WHERE is specified.
Drop the old indices
Note the purpose of the view is only to ensure absolutely all queries have "active=true" in the WHERE, because otherwise, they wouldn't be able to use the conditional indices we just created, so each query would be a full table scan, and that would be undesirable.
And now, you can DELETE, bit by bit, with your delete from mytable where id in ( select id from mytable where active = false limit 100000);
It's a tradeoff, you'll have a large number of table scans to recreate indices, but you'll avoid the random IO from index update due to a huge delete, which is the real reason why you say it will take months.
Scenario with INSERT INTO new_table SELECT...
If you have inserts and updates running on this huge table, then you have a problem, because these will not be transferred to the new table during the operation. So a solution would be to:
turn off all the scripts and services that run long queries
lock everything
create new_table
rename huge_table to huge_old
create a view that is a UNION ALL of huge_table and huge_old. From the application point of view, this view replaces huge_table. It must handle priority, ie if a row is present in the new table, a row with the same id present in the old table should be ignored... so it will have to have a JOIN. This step should be tested carefully beforehand.
unlock
Then, let it run for a while, see if the view does not destroy your performance. At this point, if it breaks, you can easily go back by dropping the view and renaming the table back to its old self. I said to turn off all the scripts and services that run long queries because these might fail with the view, and you don't want to take a big lock while one long query is running, because that will halt everything until it's done.
add insert/update/delete triggers on the view to redirect the writes to new_table. Inserts go directly to the new table, updates will have to transfer the row, deletes will have to hit both tables, and UNIQUE constraints will be... interesting. This will be a bit complicated.
Now to transfer the data.
Even if it takes a while, who cares? It will finish eventually. I suppose if you have a 3TB table, you must have some decent storage, even if that's these old spinning things that we used to put data on, it shouldn't take more than a few hours if the IO is not random. So the idea is to only use linear IO.
Fingers crossed hoping the table does not have a big text column that is stored in separate TOAST table that is going to require one random access per row. Did you check?
Now, you might actually want it to run for longer so it uses less IO bandwidth, both for reads and writes, and especially WAL writes. It doesn't matter how long the query runs as long as it doesn't degrade performance for the rest of the users.
Postgres will probably go for a parallel table scan to use all the cores and all the IO in the box, so maybe disable that first.
Then I think you should try to avoid the hilarious (for onlookers) scenario where it reads from the table for half a day, not finding any rows that match, so the disks handle the reads just fine, then it finds all the rows that match at the end and proceeds to write 300GB to the WAL and the destination table, causing huge write contention, and you have to Ctrl-C it when you know, you just know it in your gut that it was THIS CLOSE to finishing.
So:
create bogus_table just like mytable but without indices;
insert into bogus_table select * from mytable;
10% of "active" rows is still 300GB so better check the server can handle writing a 300GB table without slowing down. Watch vmstat and check if iowait goes crazy, watch number of transactions per second, query latency, web server responsiveness, the usual database health stuff. If the phone rings, hit Ctrl-C and say "Fixed!"
After it's done a few checkpoints, Ctrl-C. Time to do the real thing.
Now to make this query take much longer (and therefore destroy much less IO bandwidth) you can add this to the columns in your select:
pg_sleep((random()<0.000001)::INTEGER * 0.1)
That will make it sleep for 0.1s every million rows on average. Adjust to taste while looking at vmstat.
You can also monitor query progress using hacks.
It should work fine.
Once the interesting rows have been extracted from the accursed table, you could move the old data to a data warehouse or something, or to cold storage, or have fun loading it into clickhouse if you want to run some analytics.
Maybe partitioning the new table would also be a good idea, before it grows back to 3TB. Or periodically moving old rows.
Now, I wonder how you backup this thing...
-- EDIT
OK, I have another idea, maybe simpler, but you'll need a box.
Get a second server with fast storage and setup logical replication. On this replica server, create an empty UNLOGGED replica of the huge table with only one index on the primary key. Logical replication will copy the entire table, so it will take a while. A second network card in the original server or some QoS tuning would help not blowing up the ethernet connection you actually use to serve queries.
Logical replication is row based and identifies rows by primary key, so you absolutely need to manually create that PK index on the slave.
I've tested it on my home box right now and it works very well. The initial data transfer was a bit slow, but that may be my network. Pausing then resuming replication transferred rows inserted or updated on the master during the pause. However, renaming the table seems to break it, so you won't be able to do INSERT INTO SELECT, you'll have to DELETE on the replica. With SSDs, only one PK index, the table set to UNLOGGED, it should not take forever. Maybe using btrfs would turn the random index write IO into linear IO due to its copy on write nature. Or, if the PK index fits in shared_buffers, just YOLO it and set checkpoint_timeout to "7 days" so it doesn't actually write anything. You'll probably need to do the delete in chunks so the replicated updates keep up.
When I dropped the PK index to speed up the deletion, then recreated it before re-enabling replication, it didn't catch up on the updates. So you can't drop the index.
But is there a way to only transfer the rows you want to keep instead of transferring everything and deleting, while also having the replica keep up with the master's updates?... It's possible to do it for inserts (just disable the initial data copy) but not for updates unfortunately. You'd need an integer primary key so you could generate bogus rows on the replica that would then be updated during replication... but you can't do that with your UUID PK.
Anyway. Once this is done, set the number of WAL segments to be kept on the master server to a very high value, to resume replication later without missing updates.
And now you can run your big DELETE on the replica. When it's done, vacuum, maybe CLUSTER, re-create all indexes, etc, and set the table to LOGGED.
Then you can failover to the new server. Or if you're feeling adventurous, you could replicate the replica's table back on the master, since it will have the same name it should be in another schema.
That should allow for very little downtime since all updates are replicated, the replica will always be up to date.
I would suggest:
Copy the active records to a temporary table
Drop the main table
Rename the temporary table to the main table name

Running Updates on a large, heavily used table

I have a large table (~170 million rows, 2 nvarchar and 7 int columns) in SQL Server 2005 that is constantly being inserted into. Everything works ok with it from a performance perspective, but every once in a while I have to update a set of rows in the table which causes problems. It works fine if I update a small set of data, but if I have to update a set of 40,000 records or so it takes around 3 minutes and blocks on the table which causes problems since the inserts start failing.
If I just run a select to get back the data that needs to be updated I get back the 40k records in about 2 seconds. It's just the updates that take forever. This is reflected in the execution plan for the update where the clustered index update takes up 90% of the cost and the index seek and top operator to get the rows take up 10% of the cost. The column I'm updating is not part of any index key, so it's not like it reorganizing anything.
Does anyone have any ideas on how this could be sped up? My thought now is to write a service that will just see when these updates have to happen, pull back the records that have to be updated, and then loop through and update them one by one. This will satisfy my business needs but it's another module to maintain and I would love if I could fix this from just a DBA side of things.
Thanks for any thoughts!
Actually it might reorganise pages if you update the nvarchar columns.
Depending on what the update does to these columns they might cause the record to grow bigger than the space reserved for it before the update.
(See explanation now nvarchar is stored at http://www.databasejournal.com/features/mssql/physical-database-design-consideration.html.)
So say a record has a string of 20 characters saved in the nvarchar - this takes 20*2+2(2 for the pointer) bytes in space. This is written at the initial insert into your table (based on the index structure). SQL Server will only use as much space as your nvarchar really takes.
Now comes the update and inserts a string of 40 characters. And oops, the space for the record within your leaf structure of your index is suddenly too small. So off goes the record to a different physical place with a pointer in the old place pointing to the actual place of the updated record.
This then causes your index to go stale and because the whole physical structure requires changing you see a lot of index work going on behind the scenes. Very likely causing an exclusive table lock escalation.
Not sure how best to deal with this. Personally if possible I take an exclusive table lock, drop the index, do the updates, reindex. Because your updates sometimes cause the index to go stale this might be the fastest option. However this requires a maintenance window.
You should batch up your update into several updates (say 10000 at a time, TEST!) rather than one large one of 40k rows.
This way you will avoid a table lock, SQL Server will only take out 5000 locks (page or row) before esclating to a table lock and even this is not very predictable (memory pressure etc). Smaller updates made in this fasion will at least avoid concurrency issues you are experiencing.
You can batch the updates using a service or firehose cursor.
Read this for more info:
http://msdn.microsoft.com/en-us/library/ms184286.aspx
Hope this helps
Robert
The mos brute-force (and simplest) way is to have a basic service, as you mentioned. That has the advantage of being able to scale with the load on the server and/or the data load.
For example, if you have a set of updates that must happen ASAP, then you could turn up the batch size. Conversely, for less important updates, you could have the update "server" slow down if each update is taking "too long" to relieve some of the pressure on the DB.
This sort of "heartbeat" process is rather common in systems and can be very powerful in the right situations.
Its wired that your analyzer is saying it take time to update the clustered Index . Did the size of the data change when you update ? Seems like the varchar is driving the data to be re-organized which might need updates to index pointers(As KMB as already pointed out) . In that case you might want to increase the % free sizes on the data and the index pages so that the data and the index pages can grow without relinking/reallocation . Since update is an IO intensive operation ( unlike read , which can be buffered ) the performance also depends on several factors
1) Are your tables partitioned by data 2) Does the entire table lies in the same SAN disk ( Or is the SAN striped well ?) 3) How verbose is the transaction logging . Can the buffer size of the transaction loggin increased to support larger writes to the log to suport massive inserts ?
Its also important which API/Language are you using? e.g JDBC support a batch update feature which makes the updates a little bit efficient if you are doing multiple updates .

Free space in MySQL after deleting tables & columns?

I have a database of around 20GB. I need to delete 5 tables & drop a few columns in some other 3 tables.
Dropping 5 tables with free some 3 GB and dropping columns in other tables should free another 8GB.
How do I reclaim this space from MySQL.
I've read dumping the database and restoring it back as one of the solution but I'm not really sure how that works, I am not even sure if this only works for deleting the entire database or just parts of it?
Please suggest how to go about this. THanks.
From the comments, it sounds like you're using InnoDB without the file per table option.
Reclaiming space from the innodb tablespace is not generally possible in this mode. Your only course of action is to dump the whole database, turn on file-per-table mode, and reload it (with a completely clean mysql instance). This is going to take a long time with a large database; mk-parallel-dump and restore tools might be a bit quicker, but it will still take a while. Be sure to test this process on a non-production server first.
EDIT: Doesn't apply without file_per_table, Mark is right there.
What's going on is that once MySQL takes space, it won't give it back. This is so that if you delete 500 rows and then immediately insert 500, it doesn't have to give that space back to the file system and then request it back. It's an optimization to avoid filesystem overhead, and it works well when you delete little bits.
If you delete a large amount, it will take a long time to end up using all that space again, which can be annoying. This can be fixed two ways: dropping the table and reloading the contents, or optimizing the table (which I believe basically reloads the table internally).
All you have to do to get space back from a table is:
OPTIMIZE TABLE my_big_table;
Note that this can take a while, it's not a near instant operation. Basically, plan for a some downtime. If your tables are just a few gigs, it shouldn't be too long (probably a few minutes). This also rebuilds the indexes and does some other housekeeping.
You can see more about optimize on the MySQL site. Here is it's advice:
OPTIMIZE TABLE should be used if you have deleted a large part of a table or if you have made many changes to a table with variable-length rows (tables that have VARCHAR, VARBINARY, BLOB, or TEXT columns). Deleted rows are maintained in a linked list and subsequent INSERT operations reuse old row positions. You can use OPTIMIZE TABLE to reclaim the unused space and to defragment the data file.

Fastest way to do mass update

Let’s say you have a table with about 5 million records and a nvarchar(max) column populated with large text data. You want to set this column to NULL if SomeOtherColumn = 1 in the fastest possible way.
The brute force UPDATE does not work very well here because it will create large implicit transaction and take forever.
Doing updates in small batches of 50K records at a time works but it’s still taking 47 hours to complete on beefy 32 core/64GB server.
Is there any way to do this update faster? Are there any magic query hints / table options that sacrifices something else (like concurrency) in exchange for speed?
NOTE: Creating temp table or temp column is not an option because this nvarchar(max) column involves lots of data and so consumes lots of space!
PS: Yes, SomeOtherColumn is already indexed.
From everything I can see it does not look like your problems are related to indexes.
The key seems to be in the fact that your nvarchar(max) field contains "lots" of data. Think about what SQL has to do in order to perform this update.
Since the column you are updating is likely more than 8000 characters it is stored off-page, which implies additional effort in reading this column when it is not NULL.
When you run a batch of 50000 updates SQL has to place this in an implicit transaction in order to make it possible to roll back in case of any problems. In order to roll back it has to store the original value of the column in the transaction log.
Assuming (for simplicity sake) that each column contains on average 10,000 bytes of data, that means 50,000 rows will contain around 500MB of data, which has to be stored temporarily (in simple recovery mode) or permanently (in full recovery mode).
There is no way to disable the logs as it will compromise the database integrity.
I ran a quick test on my dog slow desktop, and running batches of even 10,000 becomes prohibitively slow, but bringing the size down to 1000 rows, which implies a temporary log size of around 10MB, worked just nicely.
I loaded a table with 350,000 rows and marked 50,000 of them for update. This completed in around 4 minutes, and since it scales linearly you should be able to update your entire 5Million rows on my dog slow desktop in around 6 hours on my 1 processor 2GB desktop, so I would expect something much better on your beefy server backed by SAN or something.
You may want to run your update statement as a select, selecting only the primary key and the large nvarchar column, and ensure this runs as fast as you expect.
Of course the bottleneck may be other users locking things or contention on your storage or memory on the server, but since you did not mention other users I will assume you have the DB in single user mode for this.
As an optimization you should ensure that the transaction logs are on a different physical disk /disk group than the data to minimize seek times.
Hopefully you already dropped any indexes on the column you are setting to null, including full text indexes. As said before, turning off transactions and the log file temporarily would do the trick. Backing up your data will usually truncate your log files too.
You could set the database recovery mode to Simple to reduce logging, BUT do not do this without considering the full implications for a production environment.
What indexes are in place on the table? Given that batch updates of approx. 50,000 rows take so long, I would say you require an index.
Have you tried placing an index or statistics on someOtherColumn?
This really helped me. I went from 2 hours to 20 minutes with this.
/* I'm using database recovery mode to Simple */
/* Update table statistics */
set transaction isolation level read uncommitted
/* Your 50k update, just to have a measures of the time it will take */
set transaction isolation level READ COMMITTED
In my experience, working in MSSQL 2005, moving everyday (automatically) 4 Million 46-byte-records (no nvarchar(max) though) from one table in a database to another table in a different database takes around 20 minutes in a QuadCore 8GB, 2Ghz server and it doesn't hurt application performance. By moving I mean INSERT INTO SELECT and then DELETE. The CPU usage never goes over 30 %, even when the table being deleted has 28M records and it constantly makes around 4K insert per minute but no updates. Well, that's my case, it may vary depending on your server load.
READ UNCOMMITTED
"Specifies that statements (your updates) can read rows that have been modified by other transactions but not yet committed." In my case, the records are readonly.
I don't know what rg-tsql means but here you'll find info about transaction isolation levels in MSSQL.
Try indexing 'SomeOtherColumn'...50K records should update in a snap. If there is already an index in place see if the index needs to be reorganized and that statistics have been collected for it.
If you are running a production environment with not enough space to duplicate all your tables, I believe that you are looking for trouble sooner or later.
If you provide some info about the number of rows with SomeOtherColumn=1, perhaps we can think another way, but I suggest:
0) Backup your table
1) Index the flag column
2) Set the table option to "no log tranctions" ... if posible
3) write a stored procedure to run the updates

Dropping an entire table

I plan to delete an entire table with over 930,000 rows of data. Which is the best way to do it without increasing the Log size or increasing the DB size.
I am on a live site and my hosting has given me 150 MB of space..I am already using 125 MB and hence need to be careful of the DB size since increase in log will increase the size of my DB
If you do not wish or need to fully log the deletion activity(i.e. you do not need to be able to recover your database to a specific point in time) then you can flush/deallocate the contents of the table by using the TRUNCATE TABLE command.
If on the other hand you wish to log the event fully then you should delete the data in batches in order to maximise performance, see the following article for details on how to do this:
Performing Fast SQL Server Delete Operations
use truncate table.