I have 2 tables which need to get refreshed every one hour. One table is a truncate load and other one is an incremental load. Total process takes around 30 seconds to complete. There are couple of applications hitting these tables on a continuous basis. I can't have applications with blank data at any moment. Any idea what could be done so that operations on these table doesn't affect the output on UI (including truncate/load)? I am thinking of creating a MV on these tables, but any better approach?
Convert the TRUNCATE to a DELETE and make the whole process one transaction. If the current process only takes 30 seconds the extra overhead of deleting and conventional inserts shouldn't be too bad.
Related
Last year, I wrote a powershell script that has a DMV query that pulls results into a variable and then inserts those results using bulk copy into another table i had created to save this data then export it to a csv file. We use AutoSys for scheduling jobs so in this case, we set 5 minutes increments for this script to run everytime to stay up to date with usage data. so every minute that table is getting updated with about 5-10 records technically.
the next day, we exported a csv file with all data in that table for that prior day, then truncated the table.
we received a requirement to retain usage information for up to 2 years (long story).
so we altered the truncate statement in the script to delete records only if they are older than 2 years. we tested it of course and it worked fine in Dev/Test (with a shorter time span)
ever since, the table now has over 5 million records after 2 months, and we started seeing the Autosys job failing/restarting often recently.
So the question is, is it possible that because the table is now growing fast, with millions of records, that this is is causing a performance issue by the bulk copy query? does the query "get tired" maybe because there is a bunch of transactions at once?
as far as i know, a table can have millions of records and there wouldnt be an issues with continous inserts nonstop. but maybe bulk copy is different from a regular insert? im thinking of increasing the job scheduling from 5 mins to 10 mins, but im afraid we might lose some usage data here between that extra gap. and it might not even be a solution to the issue...
I am inserting large amounts of data into a table.
For example once every 15 minutes, N records of data become available to be inserted into the table.
My question is, what should I do if inserting N records takes more than 15 minutes? That's, the next insertion cannot begin because the previous one is still in progress.
Please assume that I've used the most affordable hardware and even dropping indexes before starting to insert data does not make inserting faster than 15 minutes.
My preference is not to drop indexes though, because at the same time, the table is queried. What's the best practice in such scenario?
P.S. I don't have any actual code. I am just thinking of and questioning about a possible scenario.
If you are receiving/loading a large quantity of data every quarter hour, you have an operational requirement, not an application requirement, so use an operational solution.
All database have a "bulk insert" utility, sql server is no exception and even calls the function BULK INSERT:
BULK INSERT mytable FROM 'my_data_file.dat'
Such utilities are built for raw speed and will outstrip any alternative application solution.
Write a shell script to receive the data into a file, formatting it as required using shell utilities, and invoke BULK INSERT.
Wire the process up to crontab (or the equivalent Windows scheduler such as AT if you are running on Windows).
First thing is to look for basic optimizations for inserts.
You can find many posts about it:
What is the fastest way to insert large number of rows
Insert 2 million rows into SQL Server quickly
Second thing is to see why it takes more than 15 minutes? Many things can explain that - locks, isolation level etc. So try to challenge it (for example can some portion of the queries can read uncommitted records?).
Third thing - finding the right quota for insert, and consider splitting to several smaller chunks of data, with intermediate commits. Many inserts in one transaction without committing may have a bad affect on the server (log file/locks wise - you need to be able to rollback the entire transaction).
I receive about 8 huge delimited flatfiles to be loaded into an SQL server (2012)table once every week. Total number of rows in all the files would be about 150 million and each file has different number of rows. I have a simple SSIS package which loads data from flatfiles(using foreach container) into a history table. And then a select query runs on this history table to select current weeks data and loads into a staging table.
We ran into problems as history table grew very large(8 billion rows). So I decided to back up the data in history table and truncate. Before truncation the package execution time ranged from 15hrs to 63 hrs in that order.We hoped after truncation it should go back to 15hrs or less.But to my surprise even after 20+ hours the package is still running. The worst part is that it is still loading the history table. Latest count is around 120 million. It still has to load the staging data and it might take just as long.
Neither history table nor staging tables have any indexes, which is why select query on the history table used to take most of the execution time. But loading from all the flatfiles to history table was always under 3 hrs.
I hope i'm making sense. Can someone help me understand what could be the reason behind this unusual execution time for this week? Thanks.
Note: The biggest file(8GB) was read at flatfile source in 3 minutes. So I'm thinking source is not the bottle neck here.
There's no good reason, IMHO, why that server should take that long to load that much data. Are you saying that the process which used to take 3 hours, now takes 60+? Is it the first (data-load) or the second (history-table) portion that has suddenly become slow? Or, both at once?
I think the first thing that I would do is to "trust, but verify" that there are no indexes at play here. The second thing I'd look at is the storage allocation for this tablespace ... is it running out of room, such that the SQL server is having to do a bunch of extra calesthenics to obtain and to maintain storage? How does this process COMMIT? After every row? Can you prove that the package definition has not changed in the slightest, recently?
Obviously, "150 million rows" is not a lot of data, these days; neither is 8GB. If you were "simply" moving those rows into an un-indexed table, "3 hours" would be a generous expectation. Obviously, the only credible root-cause of this kind of behavior is that the disk-I/O load has increased dramatically, and I am healthily suspicious that "excessive COMMITs" might well be part of the cause: re-writing instead of "lazy-writing," re-reading instead of caching.
I am processing large amounts of data in iterations, each and iteration processes around 10-50 000 records. Because of such large number of records, I am inserting them into a global temporary table first, and then process it. Usually, each iteration takes 5-10 seconds.
Would it be wise to truncate the global temporary table after each iteration so that each iteration can start off with an empty table? There are around 5000 iterations.
No! The whole idea of a Global Temporary Table is that the data disappears automatically when you no longer need it.
For example, if you want the data to disappear when you COMMIT, you should use the ON COMMIT DELETE ROWS option when originally creating the table.
That way, you don't need to do a TRUNCATE - you just COMMIT, and the table is all fresh and empty and ready to be reused.
5000 iterations on 50000 records per run? If you need to be doing that much processing, surely you can optimise your processing logic to run more efficiently. That would give you more speed compared to truncating tables.
However, if you are finished with data in a temp table, you should truncate it, or just ensure that the next process to use the table isn't re-processing the same data over again.
E.g. have a 'processed' flag, so new processes don't use the existing data.
OR
remove data when no longer needed.
I am working on a SQL Job which involves 5 procs, a few while loops and a lot of Inserts and Updates.
This job processes around 75000 records.
Now, the job works fine for 10000/20000 records with speed of around 500/min. After around 20000 records, execution just dies. It loads around 3000 records every 30 mins and stays at same speed.
I was suspecting network, but don't know for sure. These kind of queries are difficult to analyze through SQL Performance Monitor. Not very sure where to start.
Also, there is a single cursor in one of the procs, which executes for very few records.
Any suggestions on how to speed this process up on the full-size data set?
I would check if your updates are within a transaction. If they are, it could explain why it dies after a certain amount of "modified" data. You might check how large your "tempdb" gets as an indicator.
Also I have seen cases when during long-running transactions the database would die when there are other "usages" at the same time, again because of transactionality and improper isolation levels used.
If you can split your job into independent non-overlaping chunks, you might want to do it: like doing the job in chunks by dates, ID ranges of "root" objects etc.
I suspect your whole process is flawed. I import a datafile that contains 20,000,000 records and hits many more tables and does some very complex processing in less time than you are describing for 75000 records. Remember looping is every bit as bad as using cursors.
I think if you set this up as an SSIS package you might be surprised to find the whole thing can run in just a few minutes.
With your current set-up consider if you are running out of room in the temp database or maybe it is trying to grow and can't grow fast enough. Also consider if at the time the slowdown starts, is there some other job running that might be causing blocking? Also get rid of the loops and process things in a set-based manner.
Okay...so here's what I am doing in steps:
Loading a file in a TEMP table, just an intermediary.
Do some validations on all records using SET-Based transactions.
Actual Processing Starts NOW.
TRANSACTION BEGIN HERE......
LOOP STARTS HERE
a. Pick Records based in TEMP tables PK (say customer A).
b. Retrieve data from existing tables (e.g. employer information)
c. Validate information received/retrieved.
d. Check if record already exists - UPDATE. else INSERT. (THIS HAPPENS IN SEPARATE PROCEDURE)
e. Find ALL Customer A family members (PROCESS ALL IN ANOTHER **LOOP** - SEPARATE PROC)
f. Update status for CUstomer A and his family members.
LOOP ENDS HERE
TRANSACTION ENDS HERE