Run time optimisation of Pentaho PDI transformation

Run time optimisation of Pentaho PDI transformation - pentaho

I'm running this transformation in PDI and it took me hours to load table that contains 18000 record .
SQL query in the iput table step :
Is it normal or should I try to optimize the SQL query or add more steps ??

I think your sql code is not effective. You should not to cast column, and should cast the parameter only. Sometimes, cast column is so expensive.

Most likely it's your output step that is slow, not the input step.
Here's how you can check if that's the case:
Add a Dummy step between those two steps;
Disable the hop going OUT of the Dummy step;
Run the transformation.
It will now run just the read part, which should be reasonably fast. If that's the case, then it's your output step that is slow, not the input. You can also check by running the entire transformation and looking at the Input/Output value on the Step metrics. A slow step will have its input buffer full (meaning it's slower than the steps upstream) and an almost empty output buffer (slower than the steps downstream).
If you confirm it's indeed your input step that is slow: your query can't make any use of indexes, as the condition is rather complex. You may need to refactor it.
If the input step isn't the problem: the Insert/Update step is rather slow, as each row of data requires a round trip to lookup the keys before trying to either insert or update.
You can avoid that, to some extent, by doing the following:
Add a unique constraint on the keys in the output table;
Use a Table output step (which only inserts). Any rows that try to insert duplicate keys will error out.
Add an Update step after the table output. When connecting it, choose "Error handling of step" as the hop type.
With this:
any new keys will be inserted by the Table output;
any repeated keys will error out on Insert and will be sent to the Error handling hop;
The update step will then update only the rows that need updating, without trying to figure out if they should be inserted or updated.
This can significantly boost your performance. But in any case, updating large volumes of data is more often than not a slow operation.

Related

ALTER stored procedure (with no changes) causes it to run 20x slower

I've been doing some speed tests on a stored procedure that does a bulk insert. It takes some JSON as a parameter, creates a table variable, weeds out any duplicates in the table variable from what's already in the destination table, and then copies the data from the table variable to the destination table.
In running these tests, I was seeing some wildly different speed results that were driving me nuts. They made no sense. I finally pinned down the issue and I'm able to reproduce it consistently.
Here's the process:
Delete all data from the destination table
Run the stored procedure and pass in a JSON record of 50,000 rows
It executes in about 1.5 seconds.
Repeat the process. This time it has existing data it needs to parse looking for duplicates. Same results. Less than 2 seconds
Repeat step 4 N times always with the same results.
Run an ALTER on the SP without having made ANY changes to the SP itself
Repeat step 4. This time it takes 30-40 seconds!!!
Delete the data in the destination table, repeat all the steps, same results.
I've been reading up on parameter sniffing, trying things like converting the passed in parameters to local parameters, and adding WITH RECOMPILE, but so far, the results are all the same.
If this were to happen in prod, it would be unacceptable. Does anyone have any ideas?
Thanks!

This is a bit long for a comment.
SQL Server caches the plans for queries in a stored procedure when they are first run. In your case, the first run has an empty table so the query plan is based on an empty table. That seems to be a good query plan for your problem.
When you alter the stored procedure, you do have one effect: it forgets the cached query plan. So a new plan is generated, one that uses the current size of the table.
For whatever reason, this second query plan is much worse than the first. I don't know why. Usually the problem is the other way around (the query plan on the empty table is the worse one).
I would suggest that you figure out how to get the query to have the right plan when there is data and to recompile the code in the stored procedure each time it is run. That might be overkill, but it adds just a little overhead.

How to load 15.000.000 registers into a table with pentaho?

I have created an ETL process with Pentaho that selects data from a table in a Database and load this into another database.
The main problem that I have to make front is that for 1.500.000 rows it takes 6 hours. The full table is 15.000.000 and I have to load 5 tables like that.
Can anyone explain how is supposed to load a large size of data with pentaho?
Thank you.

I never had problem with volume with Pentaho PDI. Check the following in order.
Can you check the problem is really coming from Pentaho: what happens if you drop the query in SQL-Developer or Toad or SQL-IDE-Fancy-JDBC-Compilant.
In principle, PDI is meant to import data with a SELECT * FROM ... WHERE ... and do all the rest in the transformation. I have a set of transformation here which take hours to execute because they do complex queries. The problem is not due to PDI but complexity of the query. The solutions is to export the GROUP BY and SELECT FROM (SELECT...) into PDI steps, which can start before the query result is finished. The result is like 4 hours to 56 seconds. No joke.
What is your memory size? It is defined in the spoon.bat / spoon.sh.
Near the end you have a line which looks like PENTAHO_DI_JAVA_OPTIONS="-Xms1024m" "-Xmx4096m" "-XX:MaxPermSize=256m". The important parameter is -Xmx.... If it is -Xmx256K, your jvm has only 256KB of RAM to work with.
Change it to 1/2 or 3/4 of the available memory, in order to leave room for the other processes.
Is the output step the bottleneck? Check by disabling it and watch you clock during the run.
If it is long , increase the commit size and allow batch inserts.
Disable all the index and constraints and restore them when loaded. You have nice SQL script executor steps to automate that, but check first manually then in a job, otherwise the reset index may trigger before to load begins.
You have also to check that you do not lock your self: as PDI launches the steps alltogether, you may have truncates which are waiting on another truncate to unlock. If you are not in an never ending block, it may take quite while before to db is able to cascade everything.

There's no fixed answer covering all possible performance issues. You'll need to identify the bottlenecks and solve them in your environment.
If you look at the Metrics tab while running the job in Spoon, you can often see at which step the rows/s rate drops. It will be the one with the full input buffer and empty output buffer.
To get some idea of the maximum performance of the job, you can test each component individually.
Connect the Table Input to a dummy step only and see how many rows/s it reaches.
Define a Generate Rows step with all the fields that go to your destination and some representative data and connect it to the Table Output step. Again, check the rows/s to see the destination database's throughput.
Start connecting more steps/transformations to your Table Input and see where performance goes down.
Once you know your bottlenecks, you'll need to figure out the solutions. Bulk load steps often help the output rate. If network lag is holding you back, you might want to dump data to compressed files first and copy those locally. If your Table input has joins or where clauses, make sure the source database has the correct indexes to use, or change your query.

Pentaho Data Integration (PDI) How to use postgresql bulk loader? My transformation running forever

I'm new to PDI, im using PDI 7, i have excel input with 6 rows and want to insert it into postgresDB. My transformation is : EXCEL INPUT --> Postgres Bulk Loader (2 steps only).
Condition 1 : When i Run the transformation the Postgres Bulk Load not stopping and not inserting anything into my postgresDB.
Condition 2 : So, I add "Insert/Update" step after Postgres Bulk Loader, and all data inserted to postgresDB which means success, but the bulk loader still running.
My transformation
From all sources i can get, they only need input and Bulk Loader step, and the after finished the transformation, the bulk loader is "finished" (mine's "running"). So, i wanna ask how to to this properly for Postgres? Do i skipped something important? Thanks.

The PostgreSQL bulk loader used to be only experimental. Haven't tried it in some time. Are you sure you need it? If you're loading from Excel, it's unlikely you'll have enough rows to warrant use of a bulk loader.
Try just the regular Table Output step. If you're only inserting, you shouldn't need the Insert/Update step either.

To insert just 7 rows you don't need bulk loader.
Bulk loader designed to load huge amount of data. It uses native psql client. PSQL client transfers data much faster since it uses all features of binary protocol without any restriction of jdbc specification. JDBC is used in other steps like Table Output. Most of time Table Output is enough sufficient.
Postgres Bulk Loader step just builds in memory data in csv format from incoming steps and pass them to psql client.

I did made some experiments.
Environment:
DB: Postgresv9.5x64
PDI KETTLE v5.2.0
PDI KETTLE defautl jvm settings 512mb
Data source: DBF FILE over 2_215_000 rows
Both PDI and Kettle on same localhost
Table truncated on each run
PDI Kettle restarted on each run(to avoid heavily CPU load of gc run due huge amount rows)
Results are underneath to help you make decision
Bulk loader: average over 150_000 rows per second around 13-15s
Table output (sql inserts): average 11_500 rows per second. Total is around 3min 18s
Table output (batch inserts, batch size 10_000): average 28_000 rows per second. Total is around 1min 30s
Table output (batch inserts in 5 threads batch size 3_000): average 7_600 rows per second per each thread. Means around 37_000 rows per second. Total time is around 59s.
Advantage of Buld loader is that is doesn't fill memory of jmv, all data is streamed into psql process immediately.
Table Output fill jvm memory with data. Actually after around 1_600_000 rows memory is full and gc is started. CPU that time loaded up to 100% and speed slows down significantly. That is why worth to play with batch size, to find value which will provide best performance (bigger better), but on some level cause GC overhead.
Last experiment. Memory provided to jvm is enough to hold data. This can be tweaked in variable PENTAHO_DI_JAVA_OPTIONS. I set value of jvm heap size to 1024mb and increased value of batch size.
Table output (batch inserts in 5 threads batch size 10_000): average 12_500 rows per second per each thread. Means total around 60_000 rows per second. Total time is around 35s.
Now much easier to make decision. But your have to notice the fact, that kettle pdi and database located on same host. In case if hosts are different network bandwidth can play some role in performance.

Slow insert/update step
Why you have to avoid using insert/update (in case of huge amount of data processed or you are limited by time)?
Let' look on documentation
The Insert/Update step first looks up a row in a table using one or
more lookup keys. If the row can't be found, it inserts the row. If it
can be found and the fields to update are the same, nothing is done.
If they are not all the same, the row in the table is updated.
Before states, for each row in stream step will execute 2 queries. It is lookup first and then update or insert. Source of PDI Kettle states that PreparedStatement is used for all queries: insert, update and lookup.
So if this step is bottleneck then, try to figure out what exactly slow.
Is lookup slow? (Run manually lookup query on database on sample data. Check is it slow ? Does lookup fields has index on those columns used to find correspond row in database)
Is update slow? (Run manually lookup query on database on sample data. Check is is slow? Does update where clause use index on lookup fields)
Anyway this step is slow since it requires a lot of network communication, and data processing in kettle.
The only way to make it faster, is to load all data in database into "temp" table and call function which will upsert data. Or just use simple sql step in job to do the same.

Loading fact table using pentaho data integation - Reduce Runtime of ktr

I am using pentaho DI to insert data into fact table . But the thing is the table from which I am populating contains 10000 reccords and increasing on daily basis.
In my populating table contain 10,000 records and newly 200 records are added then i need to run the ktr, If I am running the ktr file then again it truncates all 10,000 data from fact table and start inserting the new 10,200 records.
To avoid this i unchecked the truncate option in table output step and also made one key as unique in the fact table and check the Ignore inputs error option. Now it's working fine and it inserting only the 200 records, But it taking the same execution time.
I tried with the stream lookup step also in the ktr, But there is no change in my execution time.
Please can any one help me to solve this problem.
Thanks in advance.

If you need to capture all of Inserts, Updates, and Deletes, the Merge Rows Diff step followed by a Synchronize after Merge step will do this, and typically will do it very quickly.

Duplicate rows on ole db destination

I have quite a complex scenario where the same package can be run in parallel. In some situations both execution can end up trying to insert the same row into the destination, which causes a violation of primary key error.
There is currently a lookup that checks the destination table to see i the record exists so the insert is done on the its "no match" output. It doesnt prevent the error because the lookup is loaded on the package start thus both packages get the same data on it and if a row comes in both of them will consider it a "new" row so the first one succeeds and the second, fails.
Anything that can be done to avoid this scenario? Pretty much ignore the "duplicate rows" on the oledb destination? I cant use the MAX ERROR COUNT because the duplicate row is in a bath among other rows that were not on the first package and should be inserted.

The default lookup behaviour is to employ Full Cache mode. As you have observed, during the package validation stage, it will pull all the lookup values into an local memory cache and use that which results in it missing updates to the table.
For your scenario, I would try changing the cache mode to None (partial is the other option). None indicates that an actual query should be fired off to the target database for every row that passes through. Depending on your data volume or a poorly performing query, that can have a not-insignificant impact on the destination. It still won't guarantee that the parallel instance isn't trying the to load the exact same record (or that the parallel run has already satisfied their lookup and is ready to write to the target table) but it should improve the situation.
If you cannot control the package executions such that the concurrent dataflows are firing, then you should look at re-architecting the approach (write to partitions and swap in, use something to lock resources, stage all the data and use a TSQL merge etc)

Just a thought ... How about writing the new records to a temp table and merging it intermittently? This will give an opportunity to filter out duplicates.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas