Recently I ran against a problem. A transformation that causes it has an "Insert/Update" step that operates on a table with more than 200 millions of records. After the connection to the database server has been lost and I re-run the transformation manually it can be seen in the log window that the step re-checks the records it already downloaded before the connection loss. I understand that it's logical behavior of the step. But I have no chance to download all the records. Sometimes the process stops after 15 millions of records, sometimes after 50 millions.
How to deal with that problem? I thought about auto-increment of the primary key value and saving the last primary key value after the connection loss. Or sorting the records of the target table on primary key, finding the gaps and resume the load with the values in gaps. But are there some mechanisms in Pentaho that could do the job?
Pentaho has checkpoints that you can enable for jobs that allows you to restart jobs at the checkpoint that stopped for whatever reason. https://help.pentaho.com/Documentation/8.2/Products/Data_Integration/Data_Integration_Perspective/Job_Checkpoints
However, this isn't something that is available at the transformation level. Your idea about using a sequence, or an auto-incrementing field, is probably your best bet.
Related
I have created an ETL process with Pentaho that selects data from a table in a Database and load this into another database.
The main problem that I have to make front is that for 1.500.000 rows it takes 6 hours. The full table is 15.000.000 and I have to load 5 tables like that.
Can anyone explain how is supposed to load a large size of data with pentaho?
Thank you.
I never had problem with volume with Pentaho PDI. Check the following in order.
Can you check the problem is really coming from Pentaho: what happens if you drop the query in SQL-Developer or Toad or SQL-IDE-Fancy-JDBC-Compilant.
In principle, PDI is meant to import data with a SELECT * FROM ... WHERE ... and do all the rest in the transformation. I have a set of transformation here which take hours to execute because they do complex queries. The problem is not due to PDI but complexity of the query. The solutions is to export the GROUP BY and SELECT FROM (SELECT...) into PDI steps, which can start before the query result is finished. The result is like 4 hours to 56 seconds. No joke.
What is your memory size? It is defined in the spoon.bat / spoon.sh.
Near the end you have a line which looks like PENTAHO_DI_JAVA_OPTIONS="-Xms1024m" "-Xmx4096m" "-XX:MaxPermSize=256m". The important parameter is -Xmx.... If it is -Xmx256K, your jvm has only 256KB of RAM to work with.
Change it to 1/2 or 3/4 of the available memory, in order to leave room for the other processes.
Is the output step the bottleneck? Check by disabling it and watch you clock during the run.
If it is long , increase the commit size and allow batch inserts.
Disable all the index and constraints and restore them when loaded. You have nice SQL script executor steps to automate that, but check first manually then in a job, otherwise the reset index may trigger before to load begins.
You have also to check that you do not lock your self: as PDI launches the steps alltogether, you may have truncates which are waiting on another truncate to unlock. If you are not in an never ending block, it may take quite while before to db is able to cascade everything.
There's no fixed answer covering all possible performance issues. You'll need to identify the bottlenecks and solve them in your environment.
If you look at the Metrics tab while running the job in Spoon, you can often see at which step the rows/s rate drops. It will be the one with the full input buffer and empty output buffer.
To get some idea of the maximum performance of the job, you can test each component individually.
Connect the Table Input to a dummy step only and see how many rows/s it reaches.
Define a Generate Rows step with all the fields that go to your destination and some representative data and connect it to the Table Output step. Again, check the rows/s to see the destination database's throughput.
Start connecting more steps/transformations to your Table Input and see where performance goes down.
Once you know your bottlenecks, you'll need to figure out the solutions. Bulk load steps often help the output rate. If network lag is holding you back, you might want to dump data to compressed files first and copy those locally. If your Table input has joins or where clauses, make sure the source database has the correct indexes to use, or change your query.
I'm new to PDI, im using PDI 7, i have excel input with 6 rows and want to insert it into postgresDB. My transformation is : EXCEL INPUT --> Postgres Bulk Loader (2 steps only).
Condition 1 : When i Run the transformation the Postgres Bulk Load not stopping and not inserting anything into my postgresDB.
Condition 2 : So, I add "Insert/Update" step after Postgres Bulk Loader, and all data inserted to postgresDB which means success, but the bulk loader still running.
My transformation
From all sources i can get, they only need input and Bulk Loader step, and the after finished the transformation, the bulk loader is "finished" (mine's "running"). So, i wanna ask how to to this properly for Postgres? Do i skipped something important? Thanks.
The PostgreSQL bulk loader used to be only experimental. Haven't tried it in some time. Are you sure you need it? If you're loading from Excel, it's unlikely you'll have enough rows to warrant use of a bulk loader.
Try just the regular Table Output step. If you're only inserting, you shouldn't need the Insert/Update step either.
To insert just 7 rows you don't need bulk loader.
Bulk loader designed to load huge amount of data. It uses native psql client. PSQL client transfers data much faster since it uses all features of binary protocol without any restriction of jdbc specification. JDBC is used in other steps like Table Output. Most of time Table Output is enough sufficient.
Postgres Bulk Loader step just builds in memory data in csv format from incoming steps and pass them to psql client.
I did made some experiments.
Environment:
DB: Postgresv9.5x64
PDI KETTLE v5.2.0
PDI KETTLE defautl jvm settings 512mb
Data source: DBF FILE over 2_215_000 rows
Both PDI and Kettle on same localhost
Table truncated on each run
PDI Kettle restarted on each run(to avoid heavily CPU load of gc run due huge amount rows)
Results are underneath to help you make decision
Bulk loader: average over 150_000 rows per second around 13-15s
Table output (sql inserts): average 11_500 rows per second. Total is around 3min 18s
Table output (batch inserts, batch size 10_000): average 28_000 rows per second. Total is around 1min 30s
Table output (batch inserts in 5 threads batch size 3_000): average 7_600 rows per second per each thread. Means around 37_000 rows per second. Total time is around 59s.
Advantage of Buld loader is that is doesn't fill memory of jmv, all data is streamed into psql process immediately.
Table Output fill jvm memory with data. Actually after around 1_600_000 rows memory is full and gc is started. CPU that time loaded up to 100% and speed slows down significantly. That is why worth to play with batch size, to find value which will provide best performance (bigger better), but on some level cause GC overhead.
Last experiment. Memory provided to jvm is enough to hold data. This can be tweaked in variable PENTAHO_DI_JAVA_OPTIONS. I set value of jvm heap size to 1024mb and increased value of batch size.
Table output (batch inserts in 5 threads batch size 10_000): average 12_500 rows per second per each thread. Means total around 60_000 rows per second. Total time is around 35s.
Now much easier to make decision. But your have to notice the fact, that kettle pdi and database located on same host. In case if hosts are different network bandwidth can play some role in performance.
Slow insert/update step
Why you have to avoid using insert/update (in case of huge amount of data processed or you are limited by time)?
Let' look on documentation
The Insert/Update step first looks up a row in a table using one or
more lookup keys. If the row can't be found, it inserts the row. If it
can be found and the fields to update are the same, nothing is done.
If they are not all the same, the row in the table is updated.
Before states, for each row in stream step will execute 2 queries. It is lookup first and then update or insert. Source of PDI Kettle states that PreparedStatement is used for all queries: insert, update and lookup.
So if this step is bottleneck then, try to figure out what exactly slow.
Is lookup slow? (Run manually lookup query on database on sample data. Check is it slow ? Does lookup fields has index on those columns used to find correspond row in database)
Is update slow? (Run manually lookup query on database on sample data. Check is is slow? Does update where clause use index on lookup fields)
Anyway this step is slow since it requires a lot of network communication, and data processing in kettle.
The only way to make it faster, is to load all data in database into "temp" table and call function which will upsert data. Or just use simple sql step in job to do the same.
I am using pentaho DI to insert data into fact table . But the thing is the table from which I am populating contains 10000 reccords and increasing on daily basis.
In my populating table contain 10,000 records and newly 200 records are added then i need to run the ktr, If I am running the ktr file then again it truncates all 10,000 data from fact table and start inserting the new 10,200 records.
To avoid this i unchecked the truncate option in table output step and also made one key as unique in the fact table and check the Ignore inputs error option. Now it's working fine and it inserting only the 200 records, But it taking the same execution time.
I tried with the stream lookup step also in the ktr, But there is no change in my execution time.
Please can any one help me to solve this problem.
Thanks in advance.
If you need to capture all of Inserts, Updates, and Deletes, the Merge Rows Diff step followed by a Synchronize after Merge step will do this, and typically will do it very quickly.
I have quite a complex scenario where the same package can be run in parallel. In some situations both execution can end up trying to insert the same row into the destination, which causes a violation of primary key error.
There is currently a lookup that checks the destination table to see i the record exists so the insert is done on the its "no match" output. It doesnt prevent the error because the lookup is loaded on the package start thus both packages get the same data on it and if a row comes in both of them will consider it a "new" row so the first one succeeds and the second, fails.
Anything that can be done to avoid this scenario? Pretty much ignore the "duplicate rows" on the oledb destination? I cant use the MAX ERROR COUNT because the duplicate row is in a bath among other rows that were not on the first package and should be inserted.
The default lookup behaviour is to employ Full Cache mode. As you have observed, during the package validation stage, it will pull all the lookup values into an local memory cache and use that which results in it missing updates to the table.
For your scenario, I would try changing the cache mode to None (partial is the other option). None indicates that an actual query should be fired off to the target database for every row that passes through. Depending on your data volume or a poorly performing query, that can have a not-insignificant impact on the destination. It still won't guarantee that the parallel instance isn't trying the to load the exact same record (or that the parallel run has already satisfied their lookup and is ready to write to the target table) but it should improve the situation.
If you cannot control the package executions such that the concurrent dataflows are firing, then you should look at re-architecting the approach (write to partitions and swap in, use something to lock resources, stage all the data and use a TSQL merge etc)
Just a thought ... How about writing the new records to a temp table and merging it intermittently? This will give an opportunity to filter out duplicates.
I have a long running job. The records to be processed are in a table with aroun 100K records.
Now during whole job whenever this table is queried it queries against those 100K records.
After processing status of every record is updated against same table.
I want to know, if it would be better if I add another table where I can update records status and in this table keep deleting whatever records are processed, so as the query go forward the no. of records in master table will decrease increasing the query performance.
EDIT: Master table is basically used for this load only. I receive a flat file, which I upload as it is before processing. After doing validations on this table I pick one record at a time and move data to appropriate system tables.
I had a similar performance problem where a table generally has a few million rows but I only need to process what has changed since the start of my last execution. In my target table I have an IDENTITY column so when my batch process begins, I get the highest IDENTITY value from the set I select where the IDs are greater than my previous batch execution. Then upon successful completion of the batch job, I add a record to a separate table indicating this highest IDENTITY value which was successfully processed and use this as the start input for the next batch invocation. (I'll also add that my bookmark table is general purpose so I have multiple different jobs using it each with unique job names.)
If you are experiencing locking issues because your processing time per record takes a long time you can use the approach I used above, but break your sets into 1,000 rows (or whatever row chunk size your system can process in a timely fashion) so you're only locking smaller sets at any given time.
Few pointers (my two cents):
Consider splitting that table similar to "slowly changing dimension" technique into few "intermediate" tables, depending on "system table" destination; then bulk load your system tables -- instead of record by record.
Drop the "input" table before bulk load, and re-create to get rid of indexes, etc.
Do not assign unnecessary (keys) indexes on that table before load.
Consider switching the DB "recovery model" to bulk-load mode, not to log bulk transactions.
Can you use a SSIS (ETL) task for loading, cleaning and validating?
UPDATE:
Here is a typical ETL scenario -- well, depends on who you talk to.
1. Extract to flat_file_1 (you have that)
2. Clean flat_file_1 --> SSIS --> flat_file_2 (you can validate here)
3. Conform flat_file_2 --> SSIS --> flat_file_3 (apply all company standards)
4. Deliver flat_file_3 --> SSIS (bulk) --> db.ETL.StagingTables (several, one per your destination)
4B. insert into destination_table select * from db.ETL.StagingTable (bulk load your final destination)
This way if a process (1-4) times-out you can always start from the intermediate file. You can also inspect each stage and create report files from SSIS for each stage to control your data quality. Operations 1-3 are essentially slow; here they are happening outside of the database and can be done on a separate server. If you archive flat_file(1-3) you also have an audit trail of what's going on -- good for debug too. :)