I am disabling and enabling the indexes before inserting data into staging table and before inserting data into the destination table(using MERGE statement) respectively. While the functionality is working fine my program takes too much time, as long as 10 hours to complete. This is how i'm doing in the code :
first disabling indexes of staging table..
load data into stage table using SQL*Loader..
enable the indexes of staging table..
insert data into destination table using MERGE(MERGE to dest. table using staging table.)
update errors, if any, to the staging table
NOTE : The staging table already has nearly 400 million rows. I was trying to insert 23 rows into staging and eventually destination table. The insertion into staging table is quick(till step 2) but rebuilding indexes and further on from step 3 is taking 10 hours.!!
Is my approach correct? How do i improve the performance?
Using the facts you mentioned:
1. Table already have 400M;
2. It's a Staging table;
3. New inserts can be massive;
4. You didn't specify if you need the rows in Staging, so I will cover as you need it.
Scenario I would create 3 tables:
TABLE_STAGING
TABLE_DESTINATION
TABLE_TEMP
1- First disable indexes of TABLE_TEMP;
2- Load data into TABLE_TEMP using SQL*Loader (read about APPEND
hint and Direct Load)
3- Enable the indexes in your TABLE_TEMP;
4- Insert data into TABLE_DESTINATION using MERGE on TABLE_TEMP;
5- Insert data into TABLE_STAGING from TABLE_TEMP - here you correct
the errors you found:
INSERT INTO TABLE_STAGING SELECT * FROM TABLE_TEMP;
6- Truncate table TABLE_TEMP;
Rebuilding index in 400M rows all the time is not ideal, it's a massive CPU work to check each value in the row to build the index. Staging tables should be empty all the table, or use temporary tables.
Related
I have a database with a single table. This table will need to be updated every few weeks. We need to ingest third-party data into it and it will contain 100-120 million rows. So the flow is basically:
Get the raw data from the source
Detect inserts, updates & deletes
Make updates and ingest into the database
What's the best way of detecting and performing updates?
Some options are:
Compare incoming data with current database one by one and make single updates. This seems very slow and not feasible.
Ingest incoming data into a new table, then switch out old table with the new table
Bulk updates in-place in the current table. Not sure how to do this.
What do you suggest is the best option, or if there's a different option out there?
Postgres has a helpful guide for improving performance of bulk loads. From your description, you need to perform a bulk INSERT in addition to a bulk UPDATE and DELETE. Below is a roughly step by step guide for making this efficient:
Configure Global Database Configuration Variables Before the Operation
ALTER SYSTEM SET max_wal_size = <size>;
You can additionally disable WAL entirely.
ALTER SYSTEM SET wal_level = 'minimal';
ALTER SYSTEM SET archive_mode = 'off';
ALTER SYSTEM SET max_wal_senders = 0;
Note that these changes will require a database restart to take effect.
Start a Transaction
You want all work to be done in a single transaction in case anything goes wrong. Running COPY in parallel across multiple connections does not usually increase performance as disk is usually the limiting factor.
Optimize Other Configuration Variables at the Transaction level
SET LOCAL maintenance_work_mem = <size>
...
You may need to set other configuration parameters if you are doing any additional special processing of the data inside Postgres (work_mem is usually most important there especially if using Postgis extension.) See this guide for the most important configuration variables for performance.
CREATE a TEMPORARY table with no constraints.
CREATE TEMPORARY TABLE changes(
id bigint,
data text,
) ON COMMIT DROP; --ensures this table will be dropped at end of transaction
Bulk Insert Into changes using COPY FROM
Use the COPY FROM Command to bulk insert the raw data into the temporary table.
COPY changes(id,data) FROM ..
DROP Relations That Can Slow Processing
On the target table, DROP all foreign key constraints, indexes and triggers (where possible). Don't drop your PRIMARY KEY, as you'll want that for the INSERT.
Add a Tracking Column to target Table
Add a column to target table to determine if row was present in changes table:
ALTER TABLE target ADD COLUMN seen boolean;
UPSERT from the changes table into the target table:
UPSERTs are performed by adding an ON CONFLICT clause to a standard INSERT statement. This prevents the need from performing two separate operations.
INSERT INTO target(id,data,seen)
SELECT
id,
data,
true
FROM
changes
ON CONFLICT (id) DO UPDATE SET data = EXCLUDED.data, seen = true;
DELETE Rows Not In changes Table
DELETE FROM target WHERE not seen is true;
DROP Tracking Column and Temporary changes Table
DROP TABLE changes;
ALTER TABLE target DROP COLUMN seen;
Add Back Relations You Dropped For Performance
Add back all constraints, triggers and indexes that were dropped to improve bulk upsert performance.
Commit Transaction
The bulk upsert/delete is complete and the following commands should be performed outside of a transaction.
Run VACUUM ANALYZE on the target Table.
This will allow the query planner to make appropriate inferences about the table and reclaim space taken up by dead tuples.
SET maintenance_work_mem = <size>
VACUUM ANALYZE target;
SET maintenance_work_mem = <original size>
Restore Original Values of Database Configuration Variables
ALTER SYSTEM SET max_wal_size = <size>;
...
You may need to restart your database again for these settings to take effect.
I am using MonetDB (MDB) for OLAP queries. I am storing source data in PostgreSQL (PGSQL) and syncing it with MonetDB in batches written in Python.
In PGSQL there is a wide table with ID (non-unique) and few columns. Every few seconds Python script takes a batch of 10k records changed in the PGSQL and uploads them to MDB.
The process of upload to MDB is as follows:
Create staging table in MDB
Use COPY command to upload 10k records into the staging table.
DELETE from destination table all IDs that are in staging table.
INSERT to the destination table all rows from staging table.
So, it is basically a DELETE & INSERT. I cannot use MERGE statement, because I do not have a PK - one ID can have multiple values in the destination. So I need to do a delete and full insert for all IDs currently synced.
Now to the problem: the DELETE is slow.
When I do a DELETE on a destination table, deleting 10k records in table of 25M rows, it will take 500ms.
However! If I run simple SELECT * FROM destination WHERE id = 1 and THEN do a DELETE, it takes 2ms.
I think that it has something to do with automatic creation of auxiliary indices. But this is where my knowledge ends.
I tried to solve this problem of "pre-heating" by doing the lookup myself and it works - but only for the first DELETE after pre-heat.
Once I do DELETE and INSERT, the next DELETE gets again slow. And doing the pre-heating before each DELETE does not make sense, because the pre-heat itself takes 500ms.
Is there any way on how to sync data to MDB without breaking auxiliary indices already built? Or make the DELETE faster without pre-heat? Or should I use some different technique to sync data into MDB without PK (does MERGE has the same problem?).
Thanks!
I want to tune my merge query which inserts and updates table in Oracle based on source table in SQL Server. Table Size is around 120 million rows and normally around 120k records are inserted/updated daily. Merge takes around 1.5 hours to run. It uses nested loop and primary key index to perform insert and update.
There is no record update date in source table to use; so all records are compared.
Merge abc tgt
using
(
select a,b,c
from sourcetable#sqlserver_remote) src
on (tgt.ref_id = src.ref_id)
when matched then
update set
.......
where
decode(tgt.a, src.a,1,0) = 0
or ......
when not matched then
insert (....) values (.....);
commit;
Since the table is huge and growing every day, I partitioned the table in DEV based on ref id (10 groups) and created local index on ref id.
Now it uses hash join and full table scan and it runs longer than the existing process.
When I changed from local to global index (ref_id), i uses nested loops but still takes longer to run than the existing process.
Is there a way to performance tune the process.
Thanks...
I'd be wary to join/merge huge tables over a database link. I'd try to copy over the complete source table (for instance with a non-atomic mview, possibly compressed, possibly sorted, certainly only the columns you'll need). After gathering statistics, I'd merge the target table with the local copy. Afterwards, the local copy can be truncated.
I wouldn't be surprised, if partitioning speeds up the merge from the local copy to your target table.
I have a PL/SQL file that at one point needs to delete an entire table. The challenge is:
truncate table cannot be used since the DB user in question cannot be provided rights to execute DDL commands (owing to SOX compliance)
delete * works well in case the table has lesser number of records. However, the table might have millions of records in which case the redo log file size increases drastically causing performance issues
Following solutions work:
Increasing the redo log file size
Deleting records in batch mode (within nested transactions)
Is there any better and more efficient way to address this issue?
If the redo log file size is the problem, you can delete in portions with COMMIT after each delete. For example:
LOOP
--delete 1000000 records in each iteration
DELETE FROM tmp_test_table
WHERE 1=1--:WHERE_CONDITION
AND rownum <= 1000000;
-- exit when there is no more to delete
EXIT WHEN SQL%rowcount = 0;
COMMIT;
END LOOP;
There are some ways that you can use :
Use partitioning: The fastest way to do a mass delete is to drop an Oracle partition.
Tune the delete subquery: Many Oracle deletes use a where clause subquery and optimizing the subquery will improve the SQL delete speed.
Use bulk deletes: Oracle PL/SQL has a bulk delete operator that often is faster than a standard SQL delete.
Drop indexes & constraints: If you are tuning a delete in a nighttime batch job, consider dropping the indexes and rebuilding them after the delete job as completed.
Small pctused: For tuning mass deletes you can reduce freelist overhead by setting Oracle to only re-add a block to the freelists when the block is dead empty by setting a low value for pctused.
Parallelize the delete job: You can run massive delete in parallel with the parallel hint. If you have 36 processors, the full-scan can run 35 times faster (cpu_count-1)
Consider NOARCHIVELOG: Take a full backup first and bounce the database into NOLOGGING mode for the delete and bounce it again after, into ARCHIVELOG mode.
Use CTAS: Another option you can try would be to create a new table using CTAS where the select statement filters out the rows that you want to delete. Then do a rename of the original followed by a rename of the new table and transfer constraints and indexes.
Lastly, resist the temptation to do "soft" deletes, a brain-dead approach that can be fatal.
This method has advantages.
You can create a table in the required tablespace without fragmentation on the required partition of the disk.
The physical re-creation of the table will remove the fragmentation of the tables and remove the chained rows.
If you need to remove 60-99% of the rows from the table. emp
Then you need to make a new table emp_new.
create table emp_new ;
Copy the required rows to a new table
insert into emp_new select * from emp where date_insert> sysdate-30
Create indexes on a new table
create index PK_EMP on emp_new.
drop the old emp table
drop table emp.
Rename the new table to the old name.
rename table emp_new to emp
In Oracle PL/SQL DDL statements should use Execute Immediate before the statement. Hence you should use:
execute immediate 'truncate table schema.tablename';
As well as you can also use
DBMS_UTILITY.EXEC_DDL_STATEMENT('TRUNCATE TABLE tablename;');
Try anyone it may work in your case
In my project having 23 million records and around 6 fields has been indexed of that table.
Earlier I tested to add delta column for Thinking Sphinx search but it turns in holding the whole database lock for an hour. Afterwards when the file is added and I try to rebuild indexes this is the query that holds the database lock for around 4 hours:
"update user_messages set delta = false where delta = true"
Well for making the server up I created a new database from db dump and promote it as database so server can be turned live.
Now what I am looking is that adding delta column in my table with out table lock is it possible? And once the column delta is added then why is the above query executed when I run the index rebuild command and why does it block the server for so long?
PS.: I am on Heroku and using Postgres with ika db model.
Postgres 11 or later
Since Postgres 11, only volatile default values still require a table rewrite. The manual:
Adding a column with a volatile DEFAULT or changing the type of an existing column will require the entire table and its indexes to be rewritten.
Bold emphasis mine. false is immutable. So just add the column with DEFAULT false. Super fast, job done:
ALTER TABLE tbl ADD column delta boolean DEFAULT false;
Postgres 10 or older, or for volatile DEFAULT
Adding a new column without DEFAULT or DEFAULT NULL will not normally force a table rewrite and is very cheap. Only writing actual values to it creates new rows. But, quoting the manual:
Adding a column with a DEFAULT clause or changing the type of an
existing column will require the entire table and its indexes to be rewritten.
UPDATE in PostgreSQL writes a new version of the row. Your question does not provide all the information, but that probably means writing millions of new rows.
While doing the UPDATE in place, if a major portion of the table is affected and you are free to lock the table exclusively, remove all indexes before doing the mass UPDATE and recreate them afterwards. It's faster this way. Related advice in the manual.
If your data model and available disk space allow for it, CREATE a new table in the background and then, in one transaction: DROP the old table, and RENAME the new one. Related:
Best way to populate a new column in a large table?
While creating the new table in the background: Apply all changes to the same row at once. Repeated updates create new row versions and leave dead tuples behind.
If you cannot remove the original table because of constraints, another fast way is to build a temporary table, TRUNCATE the original one and mass INSERT the new rows - sorted, if that helps performance. All in one transaction. Something like this:
BEGIN
SET temp_buffers = 1000MB; -- or whatever you can spare temporarily
-- write-lock table here to prevent concurrent writes - if needed
LOCK TABLE tbl IN SHARE MODE;
CREATE TEMP TABLE tmp AS
SELECT *, false AS delta
FROM tbl; -- copy existing rows plus new value
-- ORDER BY ??? -- opportune moment to cluster rows
-- DROP all indexes here
TRUNCATE tbl; -- empty table - truncate is super fast
ALTER TABLE tbl ADD column delta boolean DEFAULT FALSE; -- NOT NULL?
INSERT INTO tbl
TABLE tmp; -- insert back surviving rows.
-- recreate all indexes here
COMMIT;
You could add another table with the one column, there won't be any such long locks. Of course there should be another column, a foreign key to the first column.
For the indexes, you could use "CREATE INDEX CONCURRENTLY", it doesn't use too heavy locks on this table http://www.postgresql.org/docs/9.1/static/sql-createindex.html.