best practises of batch insert in hibernate(large insertions) - sql

I have a job that runs and inserts over 20000 records parsing a json, I am connecting my whole application to oracle db using hibernate. It is taking around 1 hour of time because it also involves json calls and parsing of json, whereas just printing the parsed fields in the logs takes a minute or 2. My question here is, Is there a way to optimize the insertion process using hibernate.
I tried suggestions from Hibernate batch size confusion, but still I feel very slow.
I tried increasing batch size.
I tried disabling second level cache.
I also flushed and cleared my session depending on the batch size
I am planning to move to jdbc batch insertions, but wanna give a try to optimize using hibernate.
I hope this may give a generic expose to most of amateur programmers helping them with the best practises

Related

web application receiving millions of requests and leads to generating millions of row inserts per 30 seconds in SQL Server 2008

I am currently addressing a situation where our web application receives at least a Million requests per 30 seconds. So these requests will lead to generating 3-5 Million row inserts between 5 tables. This is pretty heavy load to handle. Currently we are using multi threading to handle this situation (which is a bit faster but unable to get a better CPU throughput). However the load will definitely increase in future and we will have to account for that too. After 6 months from now we are looking at double the load size we are currently receiving and I am currently looking at a possible new solution that is scalable and should be easy enough to accommodate any further increase to this load.
Currently with multi threading we are making the whole debugging scenario quite complicated and sometimes we are having problem with tracing issues.
FYI we are already utilizing the SQL Builk Insert/Copy that is mentioned in this previous post
Sql server 2008 - performance tuning features for insert large amount of data
However I am looking for a more capable solution (which I think there should be one) that will address this situation.
Note: I am not looking for any code snippets or code examples. I am just looking for a big picture of a concept that I could possibly use and I am sure that I can take that further to an elegant solution :)
Also the solution should have a better utilization of the threads and processes. And I do not want my threads/processes to even wait to execute something because of some other resource.
Any suggestions will be deeply appreciated.
Update: Not every request will lead to an insert...however most of them will lead to some sql operation. The appliciation performs different types of transactions and these will lead to a lot of bulk sql operations. I am more concerned towards inserts and updates.
and these operations need not be real time there can be a bit lag...however processing them real time will be much helpful.
I think your problem looks more towards getting a better CPU throughput which will lead to a better performance. So I would probably look at something like an Asynchronous Processing where in a thread will never sit idle and you will probably have to maintain a queue in the form of a linked list or any other data structure that will suit your programming model.
The way this would work is your threads will try to perform a given job immediately and if there is anything that would stop them from doing it then they will push that job into the queue and these pushed items will be processed based on how it stores the items in the container/queue.
In your case since you are already using bulk sql operations you should be good to go with this strategy.
lemme know if this helps you.
Can you partition the database so that the inserts are spread around? How is this data used after insert? Is there a natural partion to the data by client or geography or some other factor?
Since you are using SQL server, I would suggest you get several of the books on high availability and high performance for SQL Server. The internals book muight help as well. Amazon has a bunch of these. This is a complex subject and requires too much depth for a simple answer on a bulletin board. But basically there are several keys to high performance design including hardware choices, partitioning, correct indexing, correct queries, etc. To do this effectively, you have to understand in depth what SQL Server does under the hood and how changes can make a big difference in performance.
Since you do not need to have your inserts/updates real time you might consider having two databases; one for reads and one for writes. Similar to having a OLTP db and an OLAP db:
Read Database:
Indexed as much as needed to maximize read performance.
Possibly denormalized if performance requires it.
Not always up to date.
Insert/Update database:
No indexes at all. This will help maximize insert/update performance
Try to normalize as much as possible.
Always up to date.
You would basically direct all insert/update actions to the Insert/Update db. You would then create a publication process that would move data over to the read database at certain time intervals. When I have seen this in the past the data is usually moved over on a nightly bases when few people will be using the site. There are a number of options for moving the data over, but I would start by looking at SSIS.
This will depend on your ability to do a few things:
have read data be up to one day out of date
complete your nightly Read db update process in a reasonable amount of time.

SQL DB performance and repeated queries at short intervals

If a query is constantly sent to a database at short intervals, say every 5 seconds, could the number of reads generated cause problems in terms of performance or availability? If the database is Oracle are there any tricks that can be used to avoid a performance hit? If the queries are coming from an application is there a way to reduce any impact through software design?
Unless your query is very intensive or horribly written then it won't cause any noticeable issues running once every few seconds. That's not very often for queries that are generally measured in milliseconds.
You may still want to optimize it though, simply because there are better ways to do it. In Oracle and ADO.NET you can use an OracleDependency for the command that ran the query the first time and then subscribe to its OnChange event which will get called automatically whenever the underlying data would cause the query results to change.
It depends on the query. I assume the reason you want to execute it periodically is because the data being returned will changed frequently. If that's the case, then application level caching is obviously not an option.
Past that, is this query "big" in terms of the number of rows returned, tables joined, data aggregated / calculated? If so, it could be a problem if:
You are querying faster than it takes to execute the query. If you are calling it once a second, but it takes 2 seconds to run, that's going to become a problem.
If the query is touching a lot of data and you have a lot of other queries accessing the same tables, you could run into lock escalation issues.
As with most performance questions, the only real answer is to test. In this case test with realistic data in the DB and run this query concurrent with the other query load you expect on the system.
Along the lines of Samuel's suggestion, Oracle provides facilities in JDBC to do database change notification so that your application can subscribe to changes in the underlying data rather than re-running the query every few seconds. If the data is changing less frequently than you're running the query, this can be a major performance benefit.
Another option would be to use Oracle TimesTen as an in memory cache of the data on the middle tier machine(s). That will reduce the network round-trips and it will go through a very optimized retrieval path.
Finally, I'd take a look at using the query result cache to have Oracle cache the results.

Nhibernate: Batching and StatelessSession

I was trying around with just setting the batch value in the config file, and I see that there's a visible benefit in using it, as in inserting 25000 entries takes less time then without batching. My question is, what are the counter indication, or the dangers of using batching? As I see it I only see benefits in setting a batch value, and activating it.
Another question is regarding StatelessSession. I was also testing this and I've noticed that when I do a scope.Insert it takes more time compared to doing scope.Save of a regular Session, but when I do a commit it's lightning fast. Is there any reason for a Insert from a StatelessSession to take more time then a Save from a regular Session?
Thanks in advance
I can only speak to the first issue. A possible negative of having a large batch size is the size of the sql being sent across the wire in one go.

Take advantage of multiple cores executing SQL statements

I have a small application that reads XML files and inserts the information on a SQL DB.
There are ~ 300 000 files to import, each one with ~ 1000 records.
I started the application on 20% of the files and it has been running for 18 hours now, I hope I can improve this time for the rest of the files.
I'm not using a multi-thread approach, but since the computer I'm running the process on has 4 cores I was thinking on doing it to get some improvement on the performance (although I guess the main problem is the I/O and not only the processing).
I was thinking on using the BeginExecutingNonQuery() method on the SqlCommand object I create for each insertion, but I don't know if I should limit the max amount of simultaneous threads (nor I know how to do it).
What's your advice to get the best CPU utilization?
Thanks
If I understand you correctly, you are reading those files on the same machine that runs the database. Although I don't know much about your machine, I bet that your bottleneck is disk IO. This doesn't sound terribly computation intensive to me.
Have you tried using SqlBulkCopy? Basically, you load your data into a DataTable instance, then use the SqlBulkCopy class to load it to SQL Server. Should offer a HUGE performance increase without as much change to your current process as using bcp or another utility.
Look into bulk insert.
Imports a data file into a database table or view in a user-specified format.

How do you speed up CSV file process? (5 million or more records)

I wrote a VB.net console program to process CSV record that come in a text file. I'm using FileHelpers library
along with MSFT Enterprise library 4. To read the record one at the time and insert into the database.
It took about 3 - 4 hours to process 5+ million records on the text file.
Is there anyway to speed up the process? Has anyone deal with such large amount of records before and how would you update such records if there is new data to be update?
edit: Can someone recommend a profiler? prefer Open source or free.
read the record one at the time and insert into the database
Read them in batches and insert them in batches.
Use a profiler - find out where the time is going.
Short of a real profiler, try the following:
Time how long it takes to just read the files line by line, without doing anything with them
Take a sample line, and time how long it takes just to parse it and do whatever processing you need, 5+ million times
Generate random data and insert it into the database, and time that
My guess is that the database will be the bottleneck. You should look into doing a batch insert - if you're inserting just a single record at a time, that's likely to be a lot slower than batch inserting.
I have done many applications like this in the past and there are a number of ways that you can look at optimizing.
Ensure that the code you are writing is properly managing memory, with something like this one little mistake here can slow the process to a crawl.
Think about writing the database calls to be Async as it may be the bottleneck so a bit a queuing could be ok
Consider dropping indexes, doing the import then re-doing the import.
Consider using SSIS to do the import, it is already optimized and does this kind of thing out fo the box.
Why not just insert that data directly to SQL Server Database using Microsoft SQL Server Management Studio or command line - SQLCMD? It does know how to process CVC files.
BulkInsert property should be set to True on your database.
If it has to be modified, you can insert it into Temprorary table and then apply your modifications with T-SQL.
Best bet would to try using a profiler with a relatively small sample -- this could identify where the actual hold-ups are.
Load it into memory and then insert into the DB. 5 million rows shouldn't tax your memory. The problem is you are essentially thrashing your disk--both reading the CSV and writing to the DB.
I'd speed it up the same way I'd speed anything up: by running it through a profiler and figuring out what's taking the longest.
There is absolutely no way to guess what the bottleneck here is -- maybe there is a bug in the code which parses the CSV file, resulting in polynomial runtimes? Maybe there is some very complex logic used to process each row? Who knows!
Also, for the "record", 5 million rows isn't all THAT heavy -- an off-the-top-of-my-head guess says that a reasonable program should be able to churn through that in half an hour, an a good program in much less.
Finally, if you find that the database is your bottleneck, check to see if a transaction is being committed after each insert. That can lead to some nontrivial slowdown...
Not sure what you're doing with them, but have you considered perl? I recently re-wrote a vb script that was doing something similar - processing thousands of records - and the time went from about an hour for the vb script to about 15 seconds for perl.
After reading all records from file (I would read entire file in one pass, or in blocks), then use the SqlBulkCopy class to import your records into the DB. SqlBulkCopy is, as far as I know, the fasted approach to importing a block of records. There are a number of tutorials online.
As others has suggested, profile the app first.
That said, you will probably gain from doing batch inserts. This was the case for one app I worked with, and it was a high impact.
Consider 5 million round trips are a lot, specially if each of them is for a simple insert.
In a similar situation we saw considerable performance improvement by switching from one-row-at-time inserts to using the SqlBulkCopy API.
There is a good article here.
You need to bulk load the data into your database, assuming it has that facility. In Sql Server you'd be looking at BCP, DTS or SSIS - BCP is the oldest but maybe the fastest. OTOH if that's not possible in your DB turn off all indexes before doing the run, I'm guessing it's the DB that's causing problems, not the .Net code.