Take advantage of multiple cores executing SQL statements - sql

I have a small application that reads XML files and inserts the information on a SQL DB.
There are ~ 300 000 files to import, each one with ~ 1000 records.
I started the application on 20% of the files and it has been running for 18 hours now, I hope I can improve this time for the rest of the files.
I'm not using a multi-thread approach, but since the computer I'm running the process on has 4 cores I was thinking on doing it to get some improvement on the performance (although I guess the main problem is the I/O and not only the processing).
I was thinking on using the BeginExecutingNonQuery() method on the SqlCommand object I create for each insertion, but I don't know if I should limit the max amount of simultaneous threads (nor I know how to do it).
What's your advice to get the best CPU utilization?
Thanks

If I understand you correctly, you are reading those files on the same machine that runs the database. Although I don't know much about your machine, I bet that your bottleneck is disk IO. This doesn't sound terribly computation intensive to me.

Have you tried using SqlBulkCopy? Basically, you load your data into a DataTable instance, then use the SqlBulkCopy class to load it to SQL Server. Should offer a HUGE performance increase without as much change to your current process as using bcp or another utility.

Look into bulk insert.
Imports a data file into a database table or view in a user-specified format.

Related

best practises of batch insert in hibernate(large insertions)

I have a job that runs and inserts over 20000 records parsing a json, I am connecting my whole application to oracle db using hibernate. It is taking around 1 hour of time because it also involves json calls and parsing of json, whereas just printing the parsed fields in the logs takes a minute or 2. My question here is, Is there a way to optimize the insertion process using hibernate.
I tried suggestions from Hibernate batch size confusion, but still I feel very slow.
I tried increasing batch size.
I tried disabling second level cache.
I also flushed and cleared my session depending on the batch size
I am planning to move to jdbc batch insertions, but wanna give a try to optimize using hibernate.
I hope this may give a generic expose to most of amateur programmers helping them with the best practises

Loading large amounts of data to an Oracle SQL Database

I was wondering if anyone had any experience with what I am about to embark on. I have several csv files which are all around a GB or so in size and I need to load them into a an oracle database. While most of my work after loading will be read-only I will need to load updates from time to time. Basically I just need a good tool for loading several rows of data at a time up to my db.
Here is what I have found so far:
I could use SQL Loader t do a lot of the work
I could use Bulk-Insert commands
Some sort of batch insert.
Using prepared statement somehow might be a good idea. I guess I was wondering what everyone thinks is the fastest way to get this insert done. Any tips?
I would be very surprised if you could roll your own utility that will outperform SQL*Loader Direct Path Loads. Oracle built this utility for exactly this purpose - the likelihood of building something more efficient is practically nil. There is also the Parallel Direct Path Load, which allows you to have multiple direct path load processes running concurrently.
From the manual:
Instead of filling a bind array buffer
and passing it to the Oracle database
with a SQL INSERT statement, a direct
path load uses the direct path API to
pass the data to be loaded to the load
engine in the server. The load engine
builds a column array structure from
the data passed to it.
The direct path load engine uses the
column array structure to format
Oracle data blocks and build index
keys. The newly formatted database
blocks are written directly to the
database (multiple blocks per I/O
request using asynchronous writes if
the host platform supports
asynchronous I/O).
Internally, multiple buffers are used
for the formatted blocks. While one
buffer is being filled, one or more
buffers are being written if
asynchronous I/O is available on the
host platform. Overlapping computation
with I/O increases load performance.
There are cases where Direct Path Load cannot be used.
With that amount of data, you'd better be sure of your backing store - the dbf disks' free space.
sqlldr is script drive, very efficient, generally more efficient than a sql script.
The only thing I wonder about is the magnitude of the data. I personally would consider several to many sqlldr processes and assign each one a subset of data and let the processes run in parallel.
You said you wanted to load a few records at a time? That may take a lot longer than you think. Did you mean a few files at a time?
You may be able to create an external table on the CSV files and load them in by SELECTing from the external table into another table. Whether this method will be quicker not sure however might be quicker in terms of messing around getting sql*loader to work especially when you have a criteria for UPDATEs.

Database Disk Queue too high, what can be done?

I have a problem with a large database I am working with which resides on a single drive - this Database contains around a dozen tables with the two main ones are around 1GB each which cannot be made smaller. My problem is the disk queue for the database drive is around 96% to 100% even when the website that uses the DB is idle. What optimisation could be done or what is the source of the problem the DB on Disk is 16GB in total and almost all the data is required - transactions data, customer information and stock details.
What are the reasons why the disk queue is always high no matter the website traffic?
What can be done to help improve performance on a database this size?
Any suggestions would be appreciated!
The database is an MS SQL 2000 Database running on Windows Server 2003 and as stated 16GB in size (Data File on Disk size).
Thanks
Well, how much memory do you have on the machine? If you can't store the pages in memory, SQL Server is going to have to go to the disk to get it's information. If your memory is low, you might want to consider upgrading it.
Since the database is so big, you might want to consider adding two separate physical drives and then putting the transaction log on one drive and partitioning some of the other tables onto the other drive (you have to do some analysis to see what the best split between tables is).
In doing this, you are allowing IO accesses to occur in parallel, instead of in serial, which should give you some more performance from your DB.
Before buying more disks and shifting things around, you might also update statistics and check your queries - if you are doing lots of table scans and so forth you will be creating unnecessary work for the hardware.
Your database isn't that big after all - I'd first look at tuning your queries. Have you profiled what sort of queries are hitting the database?
If you disk activity is that high while your site is idle, I would look for other processes that might be running that could be affecting it. For example, are you sure there aren't any scheduled backups running? Especially with a large db, these could be running for a long time.
As Mike W pointed out, there is usually a lot you can do with query optimization with existing hardware. Isolate your slow-running queries and find ways to optimize them first. In one of our applications, we spent literally 2 months doing this and managed to improve the performance of the application, and the hardware utilization, dramatically.

Will it be faster to use several threads to update the same database?

I wrote a Java program to add and retrieve data from an MS Access. At present it goes sequentially through ~200K insert queries in ~3 minutes, which I think is slow. I plan to rewrite it using threads with 3-4 threads handling different parts of the hundred thousands records. I have a compound question:
Will this help speed up the program because of the divided workload or would it be the same because the threads still have to access the database sequentially?
What strategy do you think would speed up this process (except for query optimization which I already did in addition to using Java's preparedStatement)
Don't know. Without knowing more about what the bottle neck is I can't comment if it will make it faster. If the database is the limiter then chances are more threads will slow it down.
I would dump the access database to a flat file and then bulk load that file. Bulk loading allows for optimzations which are far, far faster than running multiple insert queries.
First, don't use Access. Move your data anywhere else -- SQL/Server -- MySQL -- anything. The DB engine inside access (called Jet) is pitifully slow. It's not a real database; it's for personal projects that involve small amounts of data. It doesn't scale at all.
Second, threads rarely help.
The JDBC-to-Database connection is a process-wide resource. All threads share the one connection.
"But wait," you say, "I'll create a unique Connection object in each thread."
Noble, but sometimes doomed to failure. Why? Operating System processing between your JVM and the database may involve a socket that's a single, process-wide resource, shared by all your threads.
If you have a single OS-level I/O resource that's shared across all threads, you won't see much improvement. In this case, the ODBC connection is one bottleneck. And MS-Access is the other.
With MSAccess as the backend database, you'll probably get better insert performance if you do an import from within MSAccess. Another option (since you're using Java) is to directly manipulate the MDB file (if you're creating it from scratch and there are no other concurrent users - which MS Access doesn't handle very well) with a library like Jackess.
If none of these are solutions for you, then I'd recommend using a profiler on your Java application and see if it is spending most of its time waiting for the database (in which case adding threads probably won't help much) or if it is doing processing and parallelizing will help.
Stimms bulk load approach will probably be your best bet but everything is worth trying once. Note that your bottle neck is going to be disk IO and multiple threads may slow things down. MS access can also fall apart when multiple users are banging on the file and that is exactly what your multi-threaded approach will act like (make a backup!). If performance continues to be an issue consider upgrading to SQL express.
MS Access to SQL Server Migrations docs.
Good luck.
I would agree that dumping Access would be the best first step. Having said that...
In a .NET and SQL environment I have definitely seen threads aid in maximizing INSERT throughputs.
I have an application that accepts asynchronous file drops and then processes them into tables in a database.
I created a loader that parsed the file and placed the data into a queue. The queue was served by one or more threads whose max I could tune with a parameter. I found that even on a single core CPU with your typical 7200RPM drive, the ideal number of worker threads was 3. It shortened the load time an almost proportional amount. The key is to balance it such that the CPU bottleneck and the Disk I/O bottleneck are balanced.
So in cases where a bulk copy is not an option, threads should be considered.
On modern multi-core machines, using multiple threads to populate a database can make a difference. It depends on the database and its hardware. Try it and see.
Just try it and see if it helps. I would guess not because the bottleneck is likely to be in the disk access and locking of the tables, unless you can figure out a way to split the load across multiple tables and/or disks.
IIRC access don't allow for multiple connections to te same file because of the locking policy it uses.
And I agree totally about dumping access for sql.

How do you speed up CSV file process? (5 million or more records)

I wrote a VB.net console program to process CSV record that come in a text file. I'm using FileHelpers library
along with MSFT Enterprise library 4. To read the record one at the time and insert into the database.
It took about 3 - 4 hours to process 5+ million records on the text file.
Is there anyway to speed up the process? Has anyone deal with such large amount of records before and how would you update such records if there is new data to be update?
edit: Can someone recommend a profiler? prefer Open source or free.
read the record one at the time and insert into the database
Read them in batches and insert them in batches.
Use a profiler - find out where the time is going.
Short of a real profiler, try the following:
Time how long it takes to just read the files line by line, without doing anything with them
Take a sample line, and time how long it takes just to parse it and do whatever processing you need, 5+ million times
Generate random data and insert it into the database, and time that
My guess is that the database will be the bottleneck. You should look into doing a batch insert - if you're inserting just a single record at a time, that's likely to be a lot slower than batch inserting.
I have done many applications like this in the past and there are a number of ways that you can look at optimizing.
Ensure that the code you are writing is properly managing memory, with something like this one little mistake here can slow the process to a crawl.
Think about writing the database calls to be Async as it may be the bottleneck so a bit a queuing could be ok
Consider dropping indexes, doing the import then re-doing the import.
Consider using SSIS to do the import, it is already optimized and does this kind of thing out fo the box.
Why not just insert that data directly to SQL Server Database using Microsoft SQL Server Management Studio or command line - SQLCMD? It does know how to process CVC files.
BulkInsert property should be set to True on your database.
If it has to be modified, you can insert it into Temprorary table and then apply your modifications with T-SQL.
Best bet would to try using a profiler with a relatively small sample -- this could identify where the actual hold-ups are.
Load it into memory and then insert into the DB. 5 million rows shouldn't tax your memory. The problem is you are essentially thrashing your disk--both reading the CSV and writing to the DB.
I'd speed it up the same way I'd speed anything up: by running it through a profiler and figuring out what's taking the longest.
There is absolutely no way to guess what the bottleneck here is -- maybe there is a bug in the code which parses the CSV file, resulting in polynomial runtimes? Maybe there is some very complex logic used to process each row? Who knows!
Also, for the "record", 5 million rows isn't all THAT heavy -- an off-the-top-of-my-head guess says that a reasonable program should be able to churn through that in half an hour, an a good program in much less.
Finally, if you find that the database is your bottleneck, check to see if a transaction is being committed after each insert. That can lead to some nontrivial slowdown...
Not sure what you're doing with them, but have you considered perl? I recently re-wrote a vb script that was doing something similar - processing thousands of records - and the time went from about an hour for the vb script to about 15 seconds for perl.
After reading all records from file (I would read entire file in one pass, or in blocks), then use the SqlBulkCopy class to import your records into the DB. SqlBulkCopy is, as far as I know, the fasted approach to importing a block of records. There are a number of tutorials online.
As others has suggested, profile the app first.
That said, you will probably gain from doing batch inserts. This was the case for one app I worked with, and it was a high impact.
Consider 5 million round trips are a lot, specially if each of them is for a simple insert.
In a similar situation we saw considerable performance improvement by switching from one-row-at-time inserts to using the SqlBulkCopy API.
There is a good article here.
You need to bulk load the data into your database, assuming it has that facility. In Sql Server you'd be looking at BCP, DTS or SSIS - BCP is the oldest but maybe the fastest. OTOH if that's not possible in your DB turn off all indexes before doing the run, I'm guessing it's the DB that's causing problems, not the .Net code.