I am processing large amounts of data in iterations, each and iteration processes around 10-50 000 records. Because of such large number of records, I am inserting them into a global temporary table first, and then process it. Usually, each iteration takes 5-10 seconds.
Would it be wise to truncate the global temporary table after each iteration so that each iteration can start off with an empty table? There are around 5000 iterations.
No! The whole idea of a Global Temporary Table is that the data disappears automatically when you no longer need it.
For example, if you want the data to disappear when you COMMIT, you should use the ON COMMIT DELETE ROWS option when originally creating the table.
That way, you don't need to do a TRUNCATE - you just COMMIT, and the table is all fresh and empty and ready to be reused.
5000 iterations on 50000 records per run? If you need to be doing that much processing, surely you can optimise your processing logic to run more efficiently. That would give you more speed compared to truncating tables.
However, if you are finished with data in a temp table, you should truncate it, or just ensure that the next process to use the table isn't re-processing the same data over again.
E.g. have a 'processed' flag, so new processes don't use the existing data.
OR
remove data when no longer needed.
Related
I've got a seemingly simple stored procedure that is taking too long to run (25 minutes on about 1 million records). I was wondering what I can do to speed it up. It's just deleting records in a given set of statuses.
Here's the entire procedure:
ALTER PROCEDURE [dbo].[spTTWFilters]
AS
BEGIN
DELETE FROM TMW
WHERE STATUS IN ('AVAIL', 'CANCL', 'CONTACTED', 'EDI-IN', 'NOFRGHT', 'QUOTE');
END
I can obviously beef up my Azure SQL instance to run faster, but are there other ways to improve? Is my syntax not ideal? Do I need to index the STATUS column? Thanks!
So the answer, as it is generally the case with large data update operations, is to break it up into several smaller batches.
Every DML statement, by default, starts an implicit transaction, whether explicitly declared or not. By running the delete affecting a large number of rows in a single batch, locks are held on indexes and the base table for the duration of the operation and the log file will continue to grow, internally creating new VLFs for the entire transaction.
Moreover, if the delete is aborted before it completes, the rollback may well take considerably longer to complete since they are always single-threaded.
Breaking into batches, usually performed in some form of loop working progressively through a range of key values, allows the deletes to occur in smaller more manageable chunks. In this case, having a range of different status-values to delete separately appears to be enough to effect a worthwhile improvement.
You can use top keyword to delete large amount of data by using loop or use = sign instead of in keyword.
I have 2 tables which need to get refreshed every one hour. One table is a truncate load and other one is an incremental load. Total process takes around 30 seconds to complete. There are couple of applications hitting these tables on a continuous basis. I can't have applications with blank data at any moment. Any idea what could be done so that operations on these table doesn't affect the output on UI (including truncate/load)? I am thinking of creating a MV on these tables, but any better approach?
Convert the TRUNCATE to a DELETE and make the whole process one transaction. If the current process only takes 30 seconds the extra overhead of deleting and conventional inserts shouldn't be too bad.
I am using pentaho DI to insert data into fact table . But the thing is the table from which I am populating contains 10000 reccords and increasing on daily basis.
In my populating table contain 10,000 records and newly 200 records are added then i need to run the ktr, If I am running the ktr file then again it truncates all 10,000 data from fact table and start inserting the new 10,200 records.
To avoid this i unchecked the truncate option in table output step and also made one key as unique in the fact table and check the Ignore inputs error option. Now it's working fine and it inserting only the 200 records, But it taking the same execution time.
I tried with the stream lookup step also in the ktr, But there is no change in my execution time.
Please can any one help me to solve this problem.
Thanks in advance.
If you need to capture all of Inserts, Updates, and Deletes, the Merge Rows Diff step followed by a Synchronize after Merge step will do this, and typically will do it very quickly.
I have a SP written in C# which makes calculation on around 2 million rows. Calculations takes about 3 minutes. For each row result is generated in the form of three numbers.
Those results are inserted into temporary table which later is somehow processed.
Results are added in chunks and inserting takes sometimes over 200 minutes (yes, over 3 hours!). Sometimes it takes "only" 50 minutes.
I have modified it so results are kept in memory till the end and then whole 2 millions are dumped in one loop inside one transaction. Still - it takes around 20 minutes.
Similar loop written in SQL with transaction begin/commit takes less than 30 seconds.
Anyone has an idea where is the problem?
Processing 2 millions (so selecting them, etc) takes 3 minutes, inserting results in the best solution 20 minutes.
UPDATE: this table has one clustered index on identity column (to assure that rows are being physically appended at the end), no triggers, no other indexes, no other process is accessing it.
As long as we are all being vague, here is a vague answer. If inserting 2mil rows takes that long, I would check four problems in this order:
Validating foreign key references or uniqueness constraints. You shouldn't need any of these on your temporary table. Do the validation in your CLR before the record gets to the insert step.
Complicated triggers. Please tell me you don't have any triggers on the temporary table. Finish your inserts and then do more processing after everything is in.
Trying to recalculate indexes after each insert. Try dropping the indexes before the insert step and recreating after.
If these aren't it, you might be dealing with record locking. Do you have other processes that hit the temporary table that might be getting in the way? Can you stop them during your insert?
I have a long running job. The records to be processed are in a table with aroun 100K records.
Now during whole job whenever this table is queried it queries against those 100K records.
After processing status of every record is updated against same table.
I want to know, if it would be better if I add another table where I can update records status and in this table keep deleting whatever records are processed, so as the query go forward the no. of records in master table will decrease increasing the query performance.
EDIT: Master table is basically used for this load only. I receive a flat file, which I upload as it is before processing. After doing validations on this table I pick one record at a time and move data to appropriate system tables.
I had a similar performance problem where a table generally has a few million rows but I only need to process what has changed since the start of my last execution. In my target table I have an IDENTITY column so when my batch process begins, I get the highest IDENTITY value from the set I select where the IDs are greater than my previous batch execution. Then upon successful completion of the batch job, I add a record to a separate table indicating this highest IDENTITY value which was successfully processed and use this as the start input for the next batch invocation. (I'll also add that my bookmark table is general purpose so I have multiple different jobs using it each with unique job names.)
If you are experiencing locking issues because your processing time per record takes a long time you can use the approach I used above, but break your sets into 1,000 rows (or whatever row chunk size your system can process in a timely fashion) so you're only locking smaller sets at any given time.
Few pointers (my two cents):
Consider splitting that table similar to "slowly changing dimension" technique into few "intermediate" tables, depending on "system table" destination; then bulk load your system tables -- instead of record by record.
Drop the "input" table before bulk load, and re-create to get rid of indexes, etc.
Do not assign unnecessary (keys) indexes on that table before load.
Consider switching the DB "recovery model" to bulk-load mode, not to log bulk transactions.
Can you use a SSIS (ETL) task for loading, cleaning and validating?
UPDATE:
Here is a typical ETL scenario -- well, depends on who you talk to.
1. Extract to flat_file_1 (you have that)
2. Clean flat_file_1 --> SSIS --> flat_file_2 (you can validate here)
3. Conform flat_file_2 --> SSIS --> flat_file_3 (apply all company standards)
4. Deliver flat_file_3 --> SSIS (bulk) --> db.ETL.StagingTables (several, one per your destination)
4B. insert into destination_table select * from db.ETL.StagingTable (bulk load your final destination)
This way if a process (1-4) times-out you can always start from the intermediate file. You can also inspect each stage and create report files from SSIS for each stage to control your data quality. Operations 1-3 are essentially slow; here they are happening outside of the database and can be done on a separate server. If you archive flat_file(1-3) you also have an audit trail of what's going on -- good for debug too. :)