LINQ Faster way to insert a large amount of data into SQL? - sql

I have a large amount of data in some files, the data cannot even be in memory. As of now I am parsing the data, and after each entity is parsed, the parsing class raises an event, in this event I am, using LINQ, inserting an item into the corresponding Database table. When a whole file (have also tried all files) is parsed, any inserts are submitted to the database. The problem is that this takes way too long. I have left the program running over night and it did not even finish. The data is about 1.5gb of data on disk. How can I speed up my insertions? I am leveraging parallelisation for parsing, and it takes not time at all to parse, it is the insertions that are creating a huge bottleneck.

bulk insert helped, along with a pipeline model, and general code efficiency. It's never going to take less than 30 minutes unless you're on a fast machine with fast internet...

Related

What to do if the speed of data availability is faster than insert?

I am inserting large amounts of data into a table.
For example once every 15 minutes, N records of data become available to be inserted into the table.
My question is, what should I do if inserting N records takes more than 15 minutes? That's, the next insertion cannot begin because the previous one is still in progress.
Please assume that I've used the most affordable hardware and even dropping indexes before starting to insert data does not make inserting faster than 15 minutes.
My preference is not to drop indexes though, because at the same time, the table is queried. What's the best practice in such scenario?
P.S. I don't have any actual code. I am just thinking of and questioning about a possible scenario.
If you are receiving/loading a large quantity of data every quarter hour, you have an operational requirement, not an application requirement, so use an operational solution.
All database have a "bulk insert" utility, sql server is no exception and even calls the function BULK INSERT:
BULK INSERT mytable FROM 'my_data_file.dat'
Such utilities are built for raw speed and will outstrip any alternative application solution.
Write a shell script to receive the data into a file, formatting it as required using shell utilities, and invoke BULK INSERT.
Wire the process up to crontab (or the equivalent Windows scheduler such as AT if you are running on Windows).
First thing is to look for basic optimizations for inserts.
You can find many posts about it:
What is the fastest way to insert large number of rows
Insert 2 million rows into SQL Server quickly
Second thing is to see why it takes more than 15 minutes? Many things can explain that - locks, isolation level etc. So try to challenge it (for example can some portion of the queries can read uncommitted records?).
Third thing - finding the right quota for insert, and consider splitting to several smaller chunks of data, with intermediate commits. Many inserts in one transaction without committing may have a bad affect on the server (log file/locks wise - you need to be able to rollback the entire transaction).

INSERT INTO goes much slower with time in SQL Server 2012

We have a very big database WriteDB, which store raw trading data and we use this table to fast writes. Then with sql scripts I import data from WriteDB into ReadDB in comparatively the same table, but extended with some extra values + relation added. Import script is like that:
TRUNCATE TABLE [ReadDB].[dbo].[Price]
GO
INSERT INTO [ReadDB].[dbo].[Price]
SELECT a.*, 0 as ValueUSD, 0 as ValueEUR
from [WriteDB].[dbo].[Price] a
JOIN [ReadDB].[dbo].[Companies] b ON a.QuoteId = b.QuoteID
So initially there is around 130 mil. rows in this table (~50GB). Each day some of them added, some of them changes, so right now we decide not over complicate logic and just re-import all data. The problem that for some reason with time this script works longer and longer, on the almost same amount of data. First run it's take ~1h, now it's already taken 3h
Also SQL Server after import work not well. After import (or during it) if I try to run different queries, even the simplest they often fail with timeout errors.
What is the reason of such bad behavior and how to fix this?
One theory is that your first 50GB dataset has filled available memory for caching. Upon truncating the table, your cache is now effectively empty. This alternating behavior makes effective use of the cache difficult and incurs a substantial number of cache misses / increased IO time.
Consider the following sequence of events:
You load your initial dataset into WriteDb. During the load operation, pages in WriteDb are cached. There's very little memory contention because there's only one copy of the dataset and sufficient memory.
You initially populate ReadDb. The pages required to populate ReadDb (the data in WriteDb) are already largely cached. Fewer reads are required from disk, and your IO time can be dedicated to writing the inserted data for ReadDb. (This is your fast first run.)
You load your second dataset into WriteDb. During the load operation, there is insufficient memory to cache both existing data in ReadDb and new data written to WriteDb. This memory contention leads to fewer pages of WriteDb cached.
You truncate ReadDb. This invalidates a substantial portion of your cache (i.e. the 50GB of ReadDb data that was cached).
You then attempt your second load of ReadDb. Here you have very little of WriteDb cached, so your IO time is split between reading pages of WriteDb (your query) and writing pages of ReadDb (your insert). (This is your slow second run.)
You could test this theory by comparing the SQL Server cache miss ratio during your first and second load operations.
Some ways to improve performance might be to:
Use separate disk arrays for ReadDb / WriteDb to increase parallel IO performance.
Increase the available cache (amount of server memory) to accomodate the combined size of ReadDb + WriteDb and minimize cache misses.
Minimize the impact of each load operation on existing cached pages by using a MERGE statement instead of dumping / loading 50GB of data at a time.

Loading huge flatfiles to SQL table is too slow via SSIS package

I receive about 8 huge delimited flatfiles to be loaded into an SQL server (2012)table once every week. Total number of rows in all the files would be about 150 million and each file has different number of rows. I have a simple SSIS package which loads data from flatfiles(using foreach container) into a history table. And then a select query runs on this history table to select current weeks data and loads into a staging table.
We ran into problems as history table grew very large(8 billion rows). So I decided to back up the data in history table and truncate. Before truncation the package execution time ranged from 15hrs to 63 hrs in that order.We hoped after truncation it should go back to 15hrs or less.But to my surprise even after 20+ hours the package is still running. The worst part is that it is still loading the history table. Latest count is around 120 million. It still has to load the staging data and it might take just as long.
Neither history table nor staging tables have any indexes, which is why select query on the history table used to take most of the execution time. But loading from all the flatfiles to history table was always under 3 hrs.
I hope i'm making sense. Can someone help me understand what could be the reason behind this unusual execution time for this week? Thanks.
Note: The biggest file(8GB) was read at flatfile source in 3 minutes. So I'm thinking source is not the bottle neck here.
There's no good reason, IMHO, why that server should take that long to load that much data. Are you saying that the process which used to take 3 hours, now takes 60+? Is it the first (data-load) or the second (history-table) portion that has suddenly become slow? Or, both at once?
I think the first thing that I would do is to "trust, but verify" that there are no indexes at play here. The second thing I'd look at is the storage allocation for this tablespace ... is it running out of room, such that the SQL server is having to do a bunch of extra calesthenics to obtain and to maintain storage? How does this process COMMIT? After every row? Can you prove that the package definition has not changed in the slightest, recently?
Obviously, "150 million rows" is not a lot of data, these days; neither is 8GB. If you were "simply" moving those rows into an un-indexed table, "3 hours" would be a generous expectation. Obviously, the only credible root-cause of this kind of behavior is that the disk-I/O load has increased dramatically, and I am healthily suspicious that "excessive COMMITs" might well be part of the cause: re-writing instead of "lazy-writing," re-reading instead of caching.

Speed-up SQL Insert Statements

I am facing an issue with an ever slowing process which runs every hour and inserts around 3-4 million rows daily into an SQL Server 2008 Database.
The schema consists of a large table which contains all of the above data and has a clustered index on a datetime field (by day), a unique index on a combination of fields in order to exclude duplicate inserts, and a couple more indexes on 2 varchar fields.
The typical behavior as of late, is that the insert statements get suspended for a while before they complete. The overall process used to take 4-5 mins and now it's usually well over 40 mins.
The inserts are executed by a .net service which parses a series of xml files, performs some data transformations and then inserts the data to the DB. The service has not changed at all, it's just that the inserts take longer than they use to.
At this point I'm willing to try everything. Please, let me know whether you need any more info and feel free to suggest anything.
Thanks in advance.
Sounds like you have exhausted the buffer pools ability to cache all the pages needed for the insert process. Append-style inserts (like with your date table) have a very small working set of just a few pages. Random-style inserts have basically the entire index as their working set. If you insert a row at a random location the existing page that row is supposed to be written to must be read first.
This probably means tons of disk seeks for inserts.
Make sure to insert all rows in one statement. Use bulk insert or TVPs. This allows SQL Server to optimize the query plan by sorting the inserts by key value making IO much more efficient.
This will, however, not realize a big speedup (I have seen 5x in similar situations). To regain the original performance you must bring the working set back into memory. Add RAM, purge old data, or partition such that you only need to touch very few partitions.
drop index's before insert and set them up on completion

What is more efficient INSERT command or SQL Loader for bulk upload - ORACLE 11g R2

As part of a new process requirement, we will be creating table and which will contain approximately 3000 - 4000 records. We have a copy of these records in plain text on a txt file.
Loading these records in the table leaves me with two choices
Use a shell script to generate SQL file containing INSERT Statements for these records
with the use of awk, shell variables, and loops to create a sql and script execution of this sql, we can be performed with ease
Use of SQL Loader.
Realignment of the record list and ctl file generation the only dependency.
Which of the above two options would be most efficient, in terms of taking up DB resources, utilisation on the client server on which this is to be performed.
I do realise the number of records are rather small, but we may have to repeat this activity with higher number of records (close to 60,000) in which case I would like to have the best possible option configured from the start.
SQL*Loader is the more efficient method. It gives you more control. You have an option do DIRECT load and NOLOGGING, which will reduce redo log generation, and when indexes have been disabled (as part of direct loading), the loading goes faster. Downside, is if load is interupted, indexes are left unusable.
But, considering the advantages, SQL*Loader is the best approach. And you will feel the difference, when you have millions of records, and having so many loading jobs running in parallel. I heard DBA complaining about the log size, when we do CONVENTIONAL INSERT statement loading, with 200+ such jobs, running in parallel. The larger the data volume, the larger the difference you'll see in performance.
SQL*Loader will be more efficient than thousands of individual INSERT statements. Even with 60,000 rows, though, both approaches should complete in a matter of seconds.
Of the two options you mentioned, SQL*Loader is definitely the way to go - much faster and more efficient.
However, I'd choose another approach - external tables. Has all the benefits of SQL*Loader, and allows you to treat your external csv file like an ordinary database table.