I have an object with a cascaded list which is mapped in the following way:
HasMany(x => x.Products).Cascade.AllDeleteOrphan(); //.BatchSize(10000);
After adding 20000 products to the list, the commit takes more then 30 seconds (while it should be max 3 seconds).
What I need is a kind of bulk insert. I could follow this approach: Speed up bulk insert operations with NHibernate. I know this solution uses the StatelessSession but anyway, my hope is to configure these things in my mapping, adding objects directly to the list in my Entity and NHibernate takes care of the remaining stuff. Setting the BatchSize on the mapping of the list seems to has no effect.
Is there any way to accomplish this task in an acceptable time?
I think that batch size in the mapping is only related to fetching. You can try using this configuration in your nhibernate config:
<property name="adonet.batch_size">250</property>
the only to speed things up is use stateless session
(read this: Inserts of stateless session of NHibernate are slow)
also, take care of below - to make it even more faster
cfg.AutoCommentSql = false;
cfg.LogFormattedSql = false;
Related
I have a Table with 10 columns and in that table I have thousands/millions of rows.
In some scenario, I want to update more than 10K records at a time. currently my scenario code works sequentially like,
for i in (primary key ids for all records to be updated)
executeupdate(i)
what I thought is instead of running same query 10K times, I will add all ids in a string and run a single update query like,
executeupdate(all ids)
actual DB queries can be like this,
suppose I have primary key ids like,
10001,10002,10003,10004,10005
so in first case My queries will be like
update tab1 set status="xyz" where Id="10001"
update tab1 set status="xyz" where Id="10002"
update tab1 set status="xyz" where Id="10003"
update tab1 set status="xyz" where Id="10004"
update tab1 set status="xyz" where Id="10005"
and My bulk update query will be like,
update tab1 set status="xyz" where id in ("10001","10002","10003","10004","10005")
so My question is, will I get any Performance improvement (executime time) by doing bulk update
or total query execution time will be same as for each record index scan will happen and update will take place?
Note : I am using DB2 9.5 as database
Thanks.
In general, a "bulk" update will be faster, regardless of database. Of course, you can test the performance of the two, and report back.
Each call to update requires a bunch of overhead, in terms of processing the query, setting up locks on tables/pages/rows. Doing a single update consolidates this overhead.
The downside to a single update is that it might be faster overall, but it might lock underlying resources for longer periods of time. For instance, the single updates might take 10 milliseconds each, for an elapsed time of 10 seconds for 1,000 of them. However, no resource is locked for more than 10 milliseconds. The bulk update might take 5 seconds, but the resources would be locked for more of this period.
To speed these updates, be sure that id is indexed.
I should note. This is a general principle. I have not specifically tested single versus multiple update performance on DB2.
You will definitely see a performance improvement, because you will reduce the number of roundtrips.
However, this approach does not scale very well; thousands of ID's in one statement could get a bit tricky. Also, there is a limit on the size of your query (could be 64k). You could consider to 'page' through your table and update - say - 100 records per update statement.
I came here with same question a week back. Then I faced a situation where I had to update a table with around 3500 rows in mySQL database through JDBC.
I updated same table twice: once through a For loop, by iterating through a collection of objects, and once using a bulk update query. Here are my findings:
When I updated the data in the database through iteration, it took around 7.945 seconds to execute completely.
When I came up with a rather gigantic (where 'gigantic' means 183 pages long) update query and executed the same, it took around 2.24 seconds to complete the update process.
clearly, bulk update wins by a huge margin.
Why this Difference?
To answer this, let's see how a query actually gets executed in DBMS.
Unlike procedural languages, you instruct the DBMS what to do, but not how to do. The DBMS then does the followings.
Syntax checking, or more commonly called 'Parsing'. And parsing comprises of steps like Lexical Analysis, Syntactic Analysis, Semantic Parsing.
A series of optimization (Although the definition of 'optimization' itself may vary from product to product. At least that's what I learned while surfing through the internet. I don't have much knowledge about it though.).
execution.
Now, when you update a table in database row by row, each of the queries you execute goes through parsing, optimization and execution. In stead if you write a loop to create a rather long query, and then execute the same, it is parsed only once. And the amount of time you save by using batch update in place of iterative approach increases almost linearly with number of rows you update.
A few tips that might come handy while updating data in your database
It is always a good practice to use indexed columns as reference while writing Any query.
Try to use integers or numbers and not strings for sorting or searching data in database. Your server is way more comfortable in comparing two numbers than comparing two strings.
Avoid using views and 'in' clause. they make your task easier, but slows down your database. Use joins in stead.
If you are using .NET (and there's probably a similar option in other languages like Java), there is a option you can use on your DB2Connection class called BeginChain, which will greatly improve performance.
Basically, when you have the chain option activated, your DB2 client will keep all of the commands in a queue. When you call EndChain, the queue will be sent to the server at once, and processed at one time.
The documentation says that this should perform much better than non-chained UPDATE/INSERT/DELETEs (and this is what we've seen in my shop), but there are some differences you might need to be aware of:
No exceptions will be thrown on individual statements. They will all be batched up in one DB2Exception, which will contain multiple errors in the DB2Error property.
ExecuteNonQuery will return -1 when chaining is active.
Additionally, performance can be improved further by using a query with Parameter Markers instead of separate individual queries (assuming status can change as well, otherwise, you might just use a literal):
UPDATE tab1
SET status = #status
WHERE id = #id
Edit for comment: I'm not sure if the confusion is in using Parameter Markers (which are just placeholders for values in a query, see the link for more details), or in the actual usage of chaining. If it is the second, then here is some example code (I didn't verify that it works, so use at your own risk :)):
//Below is a function that returns an open DB2Connection
//object. It can vary by shop, so put it whatever you do.
using (var conn = (DB2Connection) GetConnection())
{
using (var trans = conn.BeginTransaction())
{
var sb = new StringBuilder();
sb.AppendLine("UPDATE tab1 ");
sb.AppendLine(" SET status = 'HISTORY' ");
sb.AppendLine(" WHERE id = #id");
trans.Connection.BeginChain();
using (var cmd = trans.Connection.CreateCommand())
{
cmd.CommandText = sb.ToString();
cmd.Transaction = trans;
foreach (var id in ids)
{
cmd.Parameters.Clear();
cmd.Parameters.Add("#id", id);
cmd.ExecuteNonQuery();
}
}
trans.Connection.EndChain();
trans.Commit();
}
}
One other aspect I would like to point out is the commit interval. If the single update statement updates few 100 K rows, the transaction log also grows acordingly, it might become slower. I have seen reduction in total time while using ETL tools like informatica which fired sets of update statements per record followed by a commit- compared to a single update statement based on conditions to do it in a single go. This was counter-intuitive for me.
As per the performance tip in Rhom API of Rhomobile,
We should prepare the whole data set first and then call the create/update_attributes for better performance over preparing single record then calling create inside loop.
As per my knowledge, create method takes the object of single record as like this,
#account = Account.create(
{"name" => "some new record", "industry" => "electronics"}
)
So i wonder how to create/update multiple records on a single call?
Thanks in advance.
First, I have no idea how much this will actually affect performance, whether positively or negatively, and have never measured it.
That said, you can wrap all the CRUD calls in a transaction, to minimise the DB connections opened and closed. This can also help you with maintaining referential integrity, by rolling back changes if some record is causing a problem with your new dataset.
# Load all DB Models, to ensure they are available before first time import
Rho::RHO.load_all_sources();
# Get instance of DB to work transactions with
db = ::Rho::RHO.get_db_partitions()['local'] # Get reference to model db
db.start_transaction() # BEGIN transaction
... Do all your create/update/deletes
if (was_import_successful)
db.commit # COMMIT transaction
else
db.rollback() # ROLLBACK transaction
end
Using Rhom, you can still write SQL queries for the underlying SQLite engine. But you need to understand what is the Table format you're using.
The default PropertyBags data model are all stored in a key value store in a single Table, if you're looking for the maximum performance, you better switch to FixedSchema data models. In this case you loose some flexibility but you gain some performance and you save same space.
My suggestion is to use transactions, like you're already doing, switch to FixedSchema data models and see if you're fine in that way. If you really need to increase the speed, maybe you can achieve what you want in a different way, something like importing a SQLite database created on the server side.
This is the method that RhoConnect uses for the bulk synchronization.
If I have an entity Foo which I'm persisting using NHibernate and I retrieve it using Linq like so:
var foos = FooRepository.GetFoos.Where(x => x.LikesWearningFunnyHats == true);
I then dirty each Foo like so:
foo.LikesWearingFunnyHats = false;
And tell the repository to save it:
FooRepository.Save(foo);
I'm currently finding that the Foos are NOT being saved. Is this because I'm retrieving them using Linq rather than by ID or through an association?
It shouldn't be because you're using Linq. The Linq expression tree is just translated into the same ANTLR primitives that an IQuery or HQL string would be, and from there to SQL.
Make sure you're flushing the Session, and/or committing the Transaction inherent in that Session, after completing a persistence operation. You should understand that NHibernate provides a layer of separation between you and the database, and is designed to decide when to send updates to the DB, to economize on round trips to the DB server. Usually, it does so in "batches"; it'll collect ten or twenty statements before pushing them as one batch to SQL Server. If you're doing smaller units of work than that at one time, you must override its decision to hold off on sending the update by forcing the session to perform a SQL update, using the ForceFlush() method.
It would be best to make this externally controllable, by exposing methods of the Session or its containing Repository to create, commit and rollback transactions within the Session.
Why do you think is not actually saved ? Try to call session.Flush() and ensure committing the transaction to see if NH will issue the proper command. Event strategy for fetching the entity should not matter.
I have a fairly large EF4 model, using POCO code gen. I've got lots of instances where I select a single entity from whichever table by its ID.
However on some tables, this takes 2 minutes or more, where on most tables it takes less than a second. I'm out of ideas as to where to look now, because I can't see any reason. It's always the same tables that cause problems, but I can query them directly against the database without problems, so it must be somewhere in Entity Framework territory that the problem is coming from.
The line is the quite innoccuous:
Dim newProd As New Product
Product.ShippingSize = Entities.ShippingSizes.Single(Function(ss) ss.Id = id)
id is simply an integer passed in from the UI, Id on my entity is the primary key, which is indexed on the database
Entities is a freshly created instance of my entity framework datacontext
This is not the first query being executed against the Context, it is the first query against this EntitySet though
I have re-indexed all tables having seen posts suggesting that a corrupt index could cause slow access, that hasn't made any difference
The exact same line of code against other tables runs almost instantly, it's only certain tables
This particular table is tiny - it only has 4 things in it
Any suggestions as to where to even start?
--edit - I'd oversimplified the code in the question to the point where the problem disappeared!
Where to start?
Print or log the actual SQL string that's being sent to the database.
Execute that literal string on the server and measure its performance.
Use your server's EXPLAIN plan system to see what the server's actually doing.
Compare the raw SQL performance to your EF performance.
That should tell you whether you have a database problem or an EF problem.
Seems like this is a function of the POCO template's Fixup behaviour in combination with lazy loading.
Because the entity has already been loaded via Single, subsequent operations seem to be happening in memory rather than against the database. The Fixup method by default makes Contains() calls, which is where everything grinds to a halt while 10s of thousands of items get retrieved, initialised as proxies, and evaluated in memory.
I tried changing this Contains() to a Where(Function(x) x.Id = id).Count > 0 (will do logically the same thing, but trying to force a quick DB operation instead of the slow in-memory one). The query was still performed in-memory, and just as slow.
I switched from POCO to the standard EntityGenerator, and this problem just disappeared with no other changes. Say what you will about patterns/practices, but this is a nasty problem to have - I didn't spot this until I switched from fakes and small test databases to a full size database. Entity Generator saves the day for now.
I'm in the process of optimizing my SQL queries on my heroku server so I can speed things up on one particular request. Right now I'm mainly looking at condensing all the INSERT queries into the fewest queries as possible.
At some point in my code I have this:
jobs.each do |j|
Delayed::Job.enqueue j
end
I found out that every iteration sends a BEGIN, INSERT, COMMIT to the db. That jobs array can have from a few to a couple hundred objects in it. I have looked for a way to batch insert delayed jobs but couldn't find anything. Any idea of how to achieve that?
I've been using AR-Extensions for a long time to insert bulk data from models into the database.
That was on Rails 2.3.x though, be careful that there are now different versions depending on the Rails version: http://www.continuousthinking.com/tags/arext
I'm not sure how Delayed::Job works, but guessing from your example, I'd assume it inserts a record per job into a table which then serves as the queue. You could extend that, using AR-Extensions, to collect all those models and insert all jobs at once.
I ended up enqueuing my User object instead, which had a jobs attribute. So 1 insert instead of jobs.length inserts.