Loading fact table using pentaho data integation - Reduce Runtime of ktr - pentaho

I am using pentaho DI to insert data into fact table . But the thing is the table from which I am populating contains 10000 reccords and increasing on daily basis.
In my populating table contain 10,000 records and newly 200 records are added then i need to run the ktr, If I am running the ktr file then again it truncates all 10,000 data from fact table and start inserting the new 10,200 records.
To avoid this i unchecked the truncate option in table output step and also made one key as unique in the fact table and check the Ignore inputs error option. Now it's working fine and it inserting only the 200 records, But it taking the same execution time.
I tried with the stream lookup step also in the ktr, But there is no change in my execution time.
Please can any one help me to solve this problem.
Thanks in advance.

If you need to capture all of Inserts, Updates, and Deletes, the Merge Rows Diff step followed by a Synchronize after Merge step will do this, and typically will do it very quickly.

Related

Run time optimisation of Pentaho PDI transformation

I'm running this transformation in PDI and it took me hours to load table that contains 18000 record .
SQL query in the iput table step :
Is it normal or should I try to optimize the SQL query or add more steps ??
I think your sql code is not effective. You should not to cast column, and should cast the parameter only. Sometimes, cast column is so expensive.
Most likely it's your output step that is slow, not the input step.
Here's how you can check if that's the case:
Add a Dummy step between those two steps;
Disable the hop going OUT of the Dummy step;
Run the transformation.
It will now run just the read part, which should be reasonably fast. If that's the case, then it's your output step that is slow, not the input. You can also check by running the entire transformation and looking at the Input/Output value on the Step metrics. A slow step will have its input buffer full (meaning it's slower than the steps upstream) and an almost empty output buffer (slower than the steps downstream).
If you confirm it's indeed your input step that is slow: your query can't make any use of indexes, as the condition is rather complex. You may need to refactor it.
If the input step isn't the problem: the Insert/Update step is rather slow, as each row of data requires a round trip to lookup the keys before trying to either insert or update.
You can avoid that, to some extent, by doing the following:
Add a unique constraint on the keys in the output table;
Use a Table output step (which only inserts). Any rows that try to insert duplicate keys will error out.
Add an Update step after the table output. When connecting it, choose "Error handling of step" as the hop type.
With this:
any new keys will be inserted by the Table output;
any repeated keys will error out on Insert and will be sent to the Error handling hop;
The update step will then update only the rows that need updating, without trying to figure out if they should be inserted or updated.
This can significantly boost your performance. But in any case, updating large volumes of data is more often than not a slow operation.

BigQuery update multi tables

i'm holding huge transactions data on daily multi tables according the business date.
trascation_20140101
trascation_20140102
trascation_20140103..
the process flow is like that:
1.i''m loading the batch of new files that that arrive to temp table
2.i group by the transcation_date field in order to notice on which date is belong -
for each date i query the temp table on this date and insert it to the proper trasaction_YYYYMMDD
table.
3.i'm doing part 2 in parallel in order to save time, because the temp table might contain data that belong to 20 days..
my challenge is what to do if one these process failed and other not..
i can't run it all again , since it will cause for duplications for the table that been already successfully update.
i solve these issue by managing this update, but it's seems to be too complex.
Is this best practice to deal with multi tables?
i will be glad to get some best practice in order to understand how others deals when they need to load the data to multi tables according to business date and Not just insert date(this is easy..)
You could add an extra step in the middle, where instead of moving directly from today's temp table into the permanent business-date tables, you extract into temporary daily tables and then copy the data over to the permanent tables.
Query from today's temp table, sharded by day into tmp_transaction_YYMMDD. Use WRITE_EMPTY or WRITE_TRUNCATE write disposition so that this step is idempotent.
Verify that all expected tmp_transaction_YYMMDD tables exist. If not, debug failures and go back to step 1.
Run parallel copy jobs from each tmp_transaction_YYMMDD table to append to the corresponding permanent transaction_YYMMDD table.
Verify copy jobs succeeded. If not, retry the individual failures from step 3.
Delete the tmp_transaction_YYMMDD tables.
The advantage of this is that you can catch query errors before affecting any of the end destination tables, then copy over all the added data at once. You may still have the same issue if the copy jobs fail, but they should be easier to debug and retry individually.
Our incentive for incremental load is cost, and therefore we interested in "touching each record only once".
We use table decorators to identify increment. We manage the increments timestamps independently, and add them to the query on run-time. It requires some logic to maintain, but nothing too complicated.

SQL update command and table locking

i have an SQL table and VB.NET application.
the application loads the sql table to a datatable then it starts updating data to records by fetching some websites, it takes an average of 1.4 sec to fill datatable row with new data.
now i was wondering if its ok to use the sql update command to update a single record in the sql table and run it every time a record is updated which means run the update command for a single record every 1.4 sec
problem is other applications use this table in the same time and one of them writes to the same table but other columns,will the table get locked for other applications during this process?
SQL won't lock the table by default, but you probably should lock the table while updating it to prevent data corruption if those apps are doing alterations. performance will take a small hit, yes, but better that than having to rebuild it because it got messed up. this is a good explanation of locking
http://www.developerfusion.com/article/84509/managing-database-locks-in-sql-server/
if the other applications are just querying the table while you're updating, there shouldn't be any impact BUT they might get some odd results if they query it mid-update. locking is mainly about the risk of 2 people modifying the same record at the same time.
You need to find out why it takes 1.4 second to update a single record. Chances are it's because VB.NET needs to do some processing (while it's fetching some websites). For example, it could be taking you 1.3 seconds to perform necessary calculations (client time), and 0.1 second to update a single record (server time). In this case, you could perform update in batches, to minimize database access time.
Table will get locked, but only for a short time, so you don't need to worry about that, in general.

Oracle - truncating a global temporary table

I am processing large amounts of data in iterations, each and iteration processes around 10-50 000 records. Because of such large number of records, I am inserting them into a global temporary table first, and then process it. Usually, each iteration takes 5-10 seconds.
Would it be wise to truncate the global temporary table after each iteration so that each iteration can start off with an empty table? There are around 5000 iterations.
No! The whole idea of a Global Temporary Table is that the data disappears automatically when you no longer need it.
For example, if you want the data to disappear when you COMMIT, you should use the ON COMMIT DELETE ROWS option when originally creating the table.
That way, you don't need to do a TRUNCATE - you just COMMIT, and the table is all fresh and empty and ready to be reused.
5000 iterations on 50000 records per run? If you need to be doing that much processing, surely you can optimise your processing logic to run more efficiently. That would give you more speed compared to truncating tables.
However, if you are finished with data in a temp table, you should truncate it, or just ensure that the next process to use the table isn't re-processing the same data over again.
E.g. have a 'processed' flag, so new processes don't use the existing data.
OR
remove data when no longer needed.

Query Performance help

I have a long running job. The records to be processed are in a table with aroun 100K records.
Now during whole job whenever this table is queried it queries against those 100K records.
After processing status of every record is updated against same table.
I want to know, if it would be better if I add another table where I can update records status and in this table keep deleting whatever records are processed, so as the query go forward the no. of records in master table will decrease increasing the query performance.
EDIT: Master table is basically used for this load only. I receive a flat file, which I upload as it is before processing. After doing validations on this table I pick one record at a time and move data to appropriate system tables.
I had a similar performance problem where a table generally has a few million rows but I only need to process what has changed since the start of my last execution. In my target table I have an IDENTITY column so when my batch process begins, I get the highest IDENTITY value from the set I select where the IDs are greater than my previous batch execution. Then upon successful completion of the batch job, I add a record to a separate table indicating this highest IDENTITY value which was successfully processed and use this as the start input for the next batch invocation. (I'll also add that my bookmark table is general purpose so I have multiple different jobs using it each with unique job names.)
If you are experiencing locking issues because your processing time per record takes a long time you can use the approach I used above, but break your sets into 1,000 rows (or whatever row chunk size your system can process in a timely fashion) so you're only locking smaller sets at any given time.
Few pointers (my two cents):
Consider splitting that table similar to "slowly changing dimension" technique into few "intermediate" tables, depending on "system table" destination; then bulk load your system tables -- instead of record by record.
Drop the "input" table before bulk load, and re-create to get rid of indexes, etc.
Do not assign unnecessary (keys) indexes on that table before load.
Consider switching the DB "recovery model" to bulk-load mode, not to log bulk transactions.
Can you use a SSIS (ETL) task for loading, cleaning and validating?
UPDATE:
Here is a typical ETL scenario -- well, depends on who you talk to.
1. Extract to flat_file_1 (you have that)
2. Clean flat_file_1 --> SSIS --> flat_file_2 (you can validate here)
3. Conform flat_file_2 --> SSIS --> flat_file_3 (apply all company standards)
4. Deliver flat_file_3 --> SSIS (bulk) --> db.ETL.StagingTables (several, one per your destination)
4B. insert into destination_table select * from db.ETL.StagingTable (bulk load your final destination)
This way if a process (1-4) times-out you can always start from the intermediate file. You can also inspect each stage and create report files from SSIS for each stage to control your data quality. Operations 1-3 are essentially slow; here they are happening outside of the database and can be done on a separate server. If you archive flat_file(1-3) you also have an audit trail of what's going on -- good for debug too. :)