How do you speed up CSV file process? (5 million or more records)

How do you speed up CSV file process? (5 million or more records) - vb.net

I wrote a VB.net console program to process CSV record that come in a text file. I'm using FileHelpers library
along with MSFT Enterprise library 4. To read the record one at the time and insert into the database.
It took about 3 - 4 hours to process 5+ million records on the text file.
Is there anyway to speed up the process? Has anyone deal with such large amount of records before and how would you update such records if there is new data to be update?
edit: Can someone recommend a profiler? prefer Open source or free.

read the record one at the time and insert into the database
Read them in batches and insert them in batches.

Use a profiler - find out where the time is going.
Short of a real profiler, try the following:
Time how long it takes to just read the files line by line, without doing anything with them
Take a sample line, and time how long it takes just to parse it and do whatever processing you need, 5+ million times
Generate random data and insert it into the database, and time that
My guess is that the database will be the bottleneck. You should look into doing a batch insert - if you're inserting just a single record at a time, that's likely to be a lot slower than batch inserting.

I have done many applications like this in the past and there are a number of ways that you can look at optimizing.
Ensure that the code you are writing is properly managing memory, with something like this one little mistake here can slow the process to a crawl.
Think about writing the database calls to be Async as it may be the bottleneck so a bit a queuing could be ok
Consider dropping indexes, doing the import then re-doing the import.
Consider using SSIS to do the import, it is already optimized and does this kind of thing out fo the box.

Why not just insert that data directly to SQL Server Database using Microsoft SQL Server Management Studio or command line - SQLCMD? It does know how to process CVC files.
BulkInsert property should be set to True on your database.
If it has to be modified, you can insert it into Temprorary table and then apply your modifications with T-SQL.

Best bet would to try using a profiler with a relatively small sample -- this could identify where the actual hold-ups are.

Load it into memory and then insert into the DB. 5 million rows shouldn't tax your memory. The problem is you are essentially thrashing your disk--both reading the CSV and writing to the DB.

I'd speed it up the same way I'd speed anything up: by running it through a profiler and figuring out what's taking the longest.
There is absolutely no way to guess what the bottleneck here is -- maybe there is a bug in the code which parses the CSV file, resulting in polynomial runtimes? Maybe there is some very complex logic used to process each row? Who knows!
Also, for the "record", 5 million rows isn't all THAT heavy -- an off-the-top-of-my-head guess says that a reasonable program should be able to churn through that in half an hour, an a good program in much less.
Finally, if you find that the database is your bottleneck, check to see if a transaction is being committed after each insert. That can lead to some nontrivial slowdown...

Not sure what you're doing with them, but have you considered perl? I recently re-wrote a vb script that was doing something similar - processing thousands of records - and the time went from about an hour for the vb script to about 15 seconds for perl.

After reading all records from file (I would read entire file in one pass, or in blocks), then use the SqlBulkCopy class to import your records into the DB. SqlBulkCopy is, as far as I know, the fasted approach to importing a block of records. There are a number of tutorials online.

As others has suggested, profile the app first.
That said, you will probably gain from doing batch inserts. This was the case for one app I worked with, and it was a high impact.
Consider 5 million round trips are a lot, specially if each of them is for a simple insert.

In a similar situation we saw considerable performance improvement by switching from one-row-at-time inserts to using the SqlBulkCopy API.
There is a good article here.

You need to bulk load the data into your database, assuming it has that facility. In Sql Server you'd be looking at BCP, DTS or SSIS - BCP is the oldest but maybe the fastest. OTOH if that's not possible in your DB turn off all indexes before doing the run, I'm guessing it's the DB that's causing problems, not the .Net code.

Related

best practises of batch insert in hibernate(large insertions)

I have a job that runs and inserts over 20000 records parsing a json, I am connecting my whole application to oracle db using hibernate. It is taking around 1 hour of time because it also involves json calls and parsing of json, whereas just printing the parsed fields in the logs takes a minute or 2. My question here is, Is there a way to optimize the insertion process using hibernate.
I tried suggestions from Hibernate batch size confusion, but still I feel very slow.
I tried increasing batch size.
I tried disabling second level cache.
I also flushed and cleared my session depending on the batch size
I am planning to move to jdbc batch insertions, but wanna give a try to optimize using hibernate.
I hope this may give a generic expose to most of amateur programmers helping them with the best practises

What Database for extensive logfile analysis?

The task is to filter and analyze a huge amount of logfiles (around 8TB) from a finished research project. The idea is to fill a database with the data to be able to run different analysis tasks later.
The values are stored comma separated. In principle the values are tuples of up to 5 values:
id, timestamp, type, v1, v2, v3, v4, v5
In a first try using MySQL I used one table with one log entry per row. So there is no direct relation between the log values. The downside here is slow querying of subsets.
Because there is no relation I looked into alternatives like NoSQL databases, and column based tables like hbase or cassandra seemed to be a perfect fit for this kind of data. But these systems are made for huge distributed systems, which we not have. In our case the analysis will run on a single machine or perhaps some VMs.
Which kind of database would fit this task? Is it worth to setup a single machine instance with hadoop+hbase... or is this all a bit over-sized?
What database would you choose to do high-performance logfile analysis?
EDIT: Maybe out of my question it is not clear that we cannot spend money for cloud services or new hardware. The Question is if there are benefits in using noSQL approaches instead of mySQL (especially for this data). If there are none, or if they are so small that the effort of setting up a noSQL system is not worth the benefit we can use our ESXi infrastructure and MySQL.
EDIT2: I'm still having the Problem here. I did further experiments with MySQL and just inserted a quarter of all available data. The insert is now running for over 2 days and is not yet finished. Currently there are 2,147,483,647 rows in my single table db. With indeces this takes 211,2 GiB of disk space. And this is just a quarter of all logging data...
A query of the form
SELECT * FROM `table` WHERE `timestamp`>=1342105200000 AND `timestamp`<=1342126800000 AND `logid`=123456 AND `unit`="UNIT40";
takes 761 seconds to complete, in this case returning one row.
There is a combined index on timestamp, logid, unit.
So I think this is not the way to go, because later in analysis I will have to get all entries in a time range and compare the datapoints.
I read bout MongoDB and Redis, but the problem with them is, that they are in Memory databases.
In the later analyzing process there will a very small amount of concurrent database access. In fact the analyzing will be run from one single machine.
I do not need redundancy. I would be able to regenerate the database in case of a failure.
When the database is once completely written, there would also be no need to update or add further row.
What do you think about alternatives like Redis, MongoDB and so on. When I get this right, i would need RAM in the dimension of my data...
Is this task even somehow possible with a single node system or with maybe two nodes?

well i personally would prefer the faster solution, as you said you need a high-perfomance analysis. the problem is, if you have to setup a whole new system to do so and the performance-improvement would be minor in relation to the additional effort you'd need, then stay with SQL.
in our company, we have a quite small Database containing not even half a GB of Data on the VM. the problem now is, as soon as you use a VM, you will have major performance issues, when opening the Database on VM you can go for a coffee in the meantime ;)
But if the time until the Database is loaded to cache is not so important it doesn't matter. It all depends on how much faster you think the new System will be, and how much effort you will have to put in it, but as i said i'd prefer the faster solution if you have to go for "high-performance analysis"

web application receiving millions of requests and leads to generating millions of row inserts per 30 seconds in SQL Server 2008

I am currently addressing a situation where our web application receives at least a Million requests per 30 seconds. So these requests will lead to generating 3-5 Million row inserts between 5 tables. This is pretty heavy load to handle. Currently we are using multi threading to handle this situation (which is a bit faster but unable to get a better CPU throughput). However the load will definitely increase in future and we will have to account for that too. After 6 months from now we are looking at double the load size we are currently receiving and I am currently looking at a possible new solution that is scalable and should be easy enough to accommodate any further increase to this load.
Currently with multi threading we are making the whole debugging scenario quite complicated and sometimes we are having problem with tracing issues.
FYI we are already utilizing the SQL Builk Insert/Copy that is mentioned in this previous post
Sql server 2008 - performance tuning features for insert large amount of data
However I am looking for a more capable solution (which I think there should be one) that will address this situation.
Note: I am not looking for any code snippets or code examples. I am just looking for a big picture of a concept that I could possibly use and I am sure that I can take that further to an elegant solution :)
Also the solution should have a better utilization of the threads and processes. And I do not want my threads/processes to even wait to execute something because of some other resource.
Any suggestions will be deeply appreciated.
Update: Not every request will lead to an insert...however most of them will lead to some sql operation. The appliciation performs different types of transactions and these will lead to a lot of bulk sql operations. I am more concerned towards inserts and updates.
and these operations need not be real time there can be a bit lag...however processing them real time will be much helpful.

I think your problem looks more towards getting a better CPU throughput which will lead to a better performance. So I would probably look at something like an Asynchronous Processing where in a thread will never sit idle and you will probably have to maintain a queue in the form of a linked list or any other data structure that will suit your programming model.
The way this would work is your threads will try to perform a given job immediately and if there is anything that would stop them from doing it then they will push that job into the queue and these pushed items will be processed based on how it stores the items in the container/queue.
In your case since you are already using bulk sql operations you should be good to go with this strategy.
lemme know if this helps you.

Can you partition the database so that the inserts are spread around? How is this data used after insert? Is there a natural partion to the data by client or geography or some other factor?
Since you are using SQL server, I would suggest you get several of the books on high availability and high performance for SQL Server. The internals book muight help as well. Amazon has a bunch of these. This is a complex subject and requires too much depth for a simple answer on a bulletin board. But basically there are several keys to high performance design including hardware choices, partitioning, correct indexing, correct queries, etc. To do this effectively, you have to understand in depth what SQL Server does under the hood and how changes can make a big difference in performance.

Since you do not need to have your inserts/updates real time you might consider having two databases; one for reads and one for writes. Similar to having a OLTP db and an OLAP db:
Read Database:
Indexed as much as needed to maximize read performance.
Possibly denormalized if performance requires it.
Not always up to date.
Insert/Update database:
No indexes at all. This will help maximize insert/update performance
Try to normalize as much as possible.
Always up to date.
You would basically direct all insert/update actions to the Insert/Update db. You would then create a publication process that would move data over to the read database at certain time intervals. When I have seen this in the past the data is usually moved over on a nightly bases when few people will be using the site. There are a number of options for moving the data over, but I would start by looking at SSIS.
This will depend on your ability to do a few things:
have read data be up to one day out of date
complete your nightly Read db update process in a reasonable amount of time.

Take advantage of multiple cores executing SQL statements

I have a small application that reads XML files and inserts the information on a SQL DB.
There are ~ 300 000 files to import, each one with ~ 1000 records.
I started the application on 20% of the files and it has been running for 18 hours now, I hope I can improve this time for the rest of the files.
I'm not using a multi-thread approach, but since the computer I'm running the process on has 4 cores I was thinking on doing it to get some improvement on the performance (although I guess the main problem is the I/O and not only the processing).
I was thinking on using the BeginExecutingNonQuery() method on the SqlCommand object I create for each insertion, but I don't know if I should limit the max amount of simultaneous threads (nor I know how to do it).
What's your advice to get the best CPU utilization?
Thanks

If I understand you correctly, you are reading those files on the same machine that runs the database. Although I don't know much about your machine, I bet that your bottleneck is disk IO. This doesn't sound terribly computation intensive to me.

Have you tried using SqlBulkCopy? Basically, you load your data into a DataTable instance, then use the SqlBulkCopy class to load it to SQL Server. Should offer a HUGE performance increase without as much change to your current process as using bcp or another utility.

Look into bulk insert.
Imports a data file into a database table or view in a user-specified format.

Will it be faster to use several threads to update the same database?

I wrote a Java program to add and retrieve data from an MS Access. At present it goes sequentially through ~200K insert queries in ~3 minutes, which I think is slow. I plan to rewrite it using threads with 3-4 threads handling different parts of the hundred thousands records. I have a compound question:
Will this help speed up the program because of the divided workload or would it be the same because the threads still have to access the database sequentially?
What strategy do you think would speed up this process (except for query optimization which I already did in addition to using Java's preparedStatement)

Don't know. Without knowing more about what the bottle neck is I can't comment if it will make it faster. If the database is the limiter then chances are more threads will slow it down.
I would dump the access database to a flat file and then bulk load that file. Bulk loading allows for optimzations which are far, far faster than running multiple insert queries.

First, don't use Access. Move your data anywhere else -- SQL/Server -- MySQL -- anything. The DB engine inside access (called Jet) is pitifully slow. It's not a real database; it's for personal projects that involve small amounts of data. It doesn't scale at all.
Second, threads rarely help.
The JDBC-to-Database connection is a process-wide resource. All threads share the one connection.
"But wait," you say, "I'll create a unique Connection object in each thread."
Noble, but sometimes doomed to failure. Why? Operating System processing between your JVM and the database may involve a socket that's a single, process-wide resource, shared by all your threads.
If you have a single OS-level I/O resource that's shared across all threads, you won't see much improvement. In this case, the ODBC connection is one bottleneck. And MS-Access is the other.

With MSAccess as the backend database, you'll probably get better insert performance if you do an import from within MSAccess. Another option (since you're using Java) is to directly manipulate the MDB file (if you're creating it from scratch and there are no other concurrent users - which MS Access doesn't handle very well) with a library like Jackess.
If none of these are solutions for you, then I'd recommend using a profiler on your Java application and see if it is spending most of its time waiting for the database (in which case adding threads probably won't help much) or if it is doing processing and parallelizing will help.

Stimms bulk load approach will probably be your best bet but everything is worth trying once. Note that your bottle neck is going to be disk IO and multiple threads may slow things down. MS access can also fall apart when multiple users are banging on the file and that is exactly what your multi-threaded approach will act like (make a backup!). If performance continues to be an issue consider upgrading to SQL express.
MS Access to SQL Server Migrations docs.
Good luck.

I would agree that dumping Access would be the best first step. Having said that...
In a .NET and SQL environment I have definitely seen threads aid in maximizing INSERT throughputs.
I have an application that accepts asynchronous file drops and then processes them into tables in a database.
I created a loader that parsed the file and placed the data into a queue. The queue was served by one or more threads whose max I could tune with a parameter. I found that even on a single core CPU with your typical 7200RPM drive, the ideal number of worker threads was 3. It shortened the load time an almost proportional amount. The key is to balance it such that the CPU bottleneck and the Disk I/O bottleneck are balanced.
So in cases where a bulk copy is not an option, threads should be considered.

On modern multi-core machines, using multiple threads to populate a database can make a difference. It depends on the database and its hardware. Try it and see.

Just try it and see if it helps. I would guess not because the bottleneck is likely to be in the disk access and locking of the tables, unless you can figure out a way to split the load across multiple tables and/or disks.

IIRC access don't allow for multiple connections to te same file because of the locking policy it uses.
And I agree totally about dumping access for sql.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas