Data loading is slow while using "Insert/Update" step in pentaho 4.4.0
I am using pentaho 4.4.0. While using the "Insert/Update" step in kettle the speed of the data load is too slow compared to mysql. This step will scan through the entire records in table before inserting. If the record exist it will do a update. So what shall be done to optimize the performance while doing "Insert/Update" . and the process speed is 4 r/s, so totally my records will be above 1 lakh... The process takes 2 and half hours to complete the entire process.
Based on your comments it sounds like you want the Merge rows (diff) step followed by a Synchronize after merge. Check the Pentaho wiki to see how these steps work.
Another thing that makes a big difference is how many of the rows result in an upsert vs how many total rows. If the number of rows resulting in writes is more than roughly 40%, #carexcer's last comment may be a better approach. If it's less, definitely try the Merge rows (diff) step.
4 - 25 rows per second sounds way slow. Be sure the fields you marked as keys are indexed, whichever step you choose.
If most of the rows result in an upsert, you may be better off with a full refresh. If that's the case, check out the MySQL bulk loaders. Pentaho has both a batch and streaming bulk loader, though I don't know how good they are.
Try to set a big value on the field Transaction Size (Commit).
Depending on the number of rows you will upsert, set more or less on that field.
This improve so much the performance in my case.
500 would be a little value if you will upsert, in example, 100.000 rows, because it will have to be commited 200 times.
Less commits, faster execution.
useServerPrepStmts - false
useCursorFetch - true
useCompression - true
You can try editing the Connection and put these parameters there.
This will increase the performance.
Double Click on the Database Connection -> Options tab -> Add above parameters.
With mysql all other options dont work well
I have try with some cheat
Insert into A(a,b) values (1,2);
Insert into A(a,b) values (2,2);
Insert into A(a,b) values (3,2);
Change to
Insert into A(a,b) values (1,2), (2,2), (3,2);
I'ts perfect way
Related
I am inserting large amounts of data into a table.
For example once every 15 minutes, N records of data become available to be inserted into the table.
My question is, what should I do if inserting N records takes more than 15 minutes? That's, the next insertion cannot begin because the previous one is still in progress.
Please assume that I've used the most affordable hardware and even dropping indexes before starting to insert data does not make inserting faster than 15 minutes.
My preference is not to drop indexes though, because at the same time, the table is queried. What's the best practice in such scenario?
P.S. I don't have any actual code. I am just thinking of and questioning about a possible scenario.
If you are receiving/loading a large quantity of data every quarter hour, you have an operational requirement, not an application requirement, so use an operational solution.
All database have a "bulk insert" utility, sql server is no exception and even calls the function BULK INSERT:
BULK INSERT mytable FROM 'my_data_file.dat'
Such utilities are built for raw speed and will outstrip any alternative application solution.
Write a shell script to receive the data into a file, formatting it as required using shell utilities, and invoke BULK INSERT.
Wire the process up to crontab (or the equivalent Windows scheduler such as AT if you are running on Windows).
First thing is to look for basic optimizations for inserts.
You can find many posts about it:
What is the fastest way to insert large number of rows
Insert 2 million rows into SQL Server quickly
Second thing is to see why it takes more than 15 minutes? Many things can explain that - locks, isolation level etc. So try to challenge it (for example can some portion of the queries can read uncommitted records?).
Third thing - finding the right quota for insert, and consider splitting to several smaller chunks of data, with intermediate commits. Many inserts in one transaction without committing may have a bad affect on the server (log file/locks wise - you need to be able to rollback the entire transaction).
I am facing an issue with an ever slowing process which runs every hour and inserts around 3-4 million rows daily into an SQL Server 2008 Database.
The schema consists of a large table which contains all of the above data and has a clustered index on a datetime field (by day), a unique index on a combination of fields in order to exclude duplicate inserts, and a couple more indexes on 2 varchar fields.
The typical behavior as of late, is that the insert statements get suspended for a while before they complete. The overall process used to take 4-5 mins and now it's usually well over 40 mins.
The inserts are executed by a .net service which parses a series of xml files, performs some data transformations and then inserts the data to the DB. The service has not changed at all, it's just that the inserts take longer than they use to.
At this point I'm willing to try everything. Please, let me know whether you need any more info and feel free to suggest anything.
Thanks in advance.
Sounds like you have exhausted the buffer pools ability to cache all the pages needed for the insert process. Append-style inserts (like with your date table) have a very small working set of just a few pages. Random-style inserts have basically the entire index as their working set. If you insert a row at a random location the existing page that row is supposed to be written to must be read first.
This probably means tons of disk seeks for inserts.
Make sure to insert all rows in one statement. Use bulk insert or TVPs. This allows SQL Server to optimize the query plan by sorting the inserts by key value making IO much more efficient.
This will, however, not realize a big speedup (I have seen 5x in similar situations). To regain the original performance you must bring the working set back into memory. Add RAM, purge old data, or partition such that you only need to touch very few partitions.
drop index's before insert and set them up on completion
I am using pentaho DI to insert data into fact table . But the thing is the table from which I am populating contains 10000 reccords and increasing on daily basis.
In my populating table contain 10,000 records and newly 200 records are added then i need to run the ktr, If I am running the ktr file then again it truncates all 10,000 data from fact table and start inserting the new 10,200 records.
To avoid this i unchecked the truncate option in table output step and also made one key as unique in the fact table and check the Ignore inputs error option. Now it's working fine and it inserting only the 200 records, But it taking the same execution time.
I tried with the stream lookup step also in the ktr, But there is no change in my execution time.
Please can any one help me to solve this problem.
Thanks in advance.
If you need to capture all of Inserts, Updates, and Deletes, the Merge Rows Diff step followed by a Synchronize after Merge step will do this, and typically will do it very quickly.
Which is better?
1)A cursor that loop 30000 record and perform update one by one
2)Create a script that has 30000 update command
thanks
Both should take about the same time, mainly subject to how the CURSOR is declared.
Reason? You have 30,000 individual updates which is usually the main factor
Note that 30,000 individual UPDATES in one batch will probably fail because of batch size and compile time anyway...
SQL is a set based language and you can most likely do a single UPDATE to update all rows in one go. If you can't, it is because of 2 reasons
You need "per row" logic: this can usually be achieved by CASE expressions, UDFs etc
You don't understand sets and SQL
With more information (the SQL and logic) we could help you more...
There is a very easy way to tell: Do it and measure the time.
Other than that, having 30000 lines does not make a lot of sense when you can have just 10.
Making updates this way for reasons other than data migration or maintenance doesn't sound like wise either, and in those cases performance is not an issue - but maintenance and legibility always is.
You know, that depends on context.
It helps, though, to learn. SQL for example. You are on a low level not to see the real optimizations possible here. SQL is a lot more than just Update, Insert and simple Select statements.
1)A cursor that loop 30000 record and perform update one by one
Linear step by step processing. No way to paralellize as SQL itself has no threading mechanisms available to the user; Optimizations are one by one - i.e. the query optimizer looks at items one statement at a time.
2)Create a script that has 30000 update command
Assuming the script is external, it could split the work and run it concurrent on multiple connections, i.e. run more than one parallel.
But there is more:
Make a script that calculates the new values.
Bulk import them into a temporary table using the buld copy API
Issue ONE update statment that takes the updated values from the temporary table to the final one.
Maybe have a script that issues a merge statement for multi update? There are tons of variations there if you know the SQL api more than "update, open cursor, simple select".
I do that - though a lot more data (batches of 50.000, sometimes 4-6 at the same time). The problem being that sql bulk copy has some overhead. But I manage 75.000 inserts per second that way.
A lot depends on the business questions and the complexity of the logic - if it is simple updates then the question is: Calculated or externally driven? Multiple values by 2 = calculated, updating addresses = data driven (i.e. you need the new data from somewhere).
I'm currently in a situation where I'm building a script that I know will need to insert multiple rows. I'm doing this in Perl, so in terms of parameterization, it's much easier to insert each row individually. In terms of speed, I'm guessing running just one insert statement will be faster (although latency will be relatively low as I'm quite close to the database itself). I'm thinking the number of rows per run of the script will be about 20-40 on average. That said, what would be the approximate performance differences between running just 1 INSERT INTO statement v.s. running one for each row? Note: The server is running SQL 2008.
[EDIT]Since there seems to be a lot of confusion, I'd like to clarify that what I'm really asking for is the theory behind how a multi-row insert is handled by SQL Server 2008. Does it essentially just convert it internally into a bunch of individual insert statements and run those over one connection, or does it do something more intelligent?
Yes, I know I can run timed loops. No, that's not what I'm asking for. [/EDIT]
Combining multiple inserts into one command is always going to execute much more quickly than executing separate inserts. The reasons are:
A lot of work is done parsing the SQL - with multi version, there's only one parsing effort
More work is done checking permissions - again, only done once
Database connections are "chatty" - with multi version, handshaking only done once. You really notice this issue when using a poor network connection
Finally, multi version gives opportunity for server to optimize the operation
There is a general idea to let the SQL database do its thing and not try to treat the database as some sort of disk read. I've seen many times where a developer will read from one table, then another, or do a general query and then run through each row to see if it's the one they want. Generally, it's better to let the SQL database do its thing.
In this case, I can't really see an advantage of doing a single vs. multiple row insert. I guess there might be some because you don't have to do multiple prepares, and commits.
It shouldn't be too difficult to actual create a temporary database and try this out. Create a database with two columns, and have the program generate data to toss into the tables. Give yourself a decent amount to do. For example, how many items will this table have? And, how many do you think you'll be inserting at once? Say create a table of 1,000,000 items, and insert into this table 1000 items at a time, 100 items at a time, and one item at a time. Just generate data using the increment operator. There may be a "sweetspot" of the number of items you can insert at once.
In my unbiased, and always correct opinion, you'll probably find that the difference isn't worth fretting over, and you should instead employ the method that makes your code the easiest to maintain.
I've have a programming dictum: The place where you want to optimize your code is probably the wrong place. We like efficiency, but we usually attack the wrong item. And, whatever we've squeezed out in terms of efficiency, we end up wasting in maintenance.
So, just program what is the easiest to understand and don't fret about being overly efficient.
Just to add a couple of other performance differentiators to think about on insertion:
Foreign Keys - If the table you are inserting into has foreign keys, SQL Server effectively needs to join to the foreign key tables on insert. When you do your inserts in one query, SQL server can be more efficient in doing these joins.
Transactions - As you don't mention transactions, I assume you must be using SQL Server auto-commit mode. With such a small number of rows, it is likely that the overhead of creating 40 transactions vs. 1 transaction would be higher than maintaining the log to allow rollback. However, if you were inserting 400000 rows, it would likely be more expensive to insert in one statement/transaction than insert 400000 separate rows as the cost to be prepared to roll back up to 400000 rows is very high (if you were to insert 400000 rows, it usually is best to insert in batches -> the optimal batch size can be determined through testing). Also, above a certain row count, it may become more efficient to disable the foreign keys, insert the rows, then re-enable them.