I would like to understand the following differences in pentaho environment
1) What is a rowset. Is just like a collecting of records from the input step or what is the exact meaning?
I see in Transformation setting section that No of rows in rowset which is defaulted to
10000. What is the optimum value. For example if my input steps delivers 100 rows, what should be value here or if the input data set
is great than 10000 how will be the performance.
2) Manage thread priority option: How does this work for the above scenario.
how are you? I'll try to help with some explanations but maybe someone can improve them later.
First of all the most important thing to remember when designing a transformation is that (most of the time) all steps will run in parallel. So in that scenario how do you control the flowing rows to make sure that they are processed in the shortest time? The same two options you pointed are the keys to solve that.
Row set size
Every step has it's own row set. It's like a sign saying that the "Max. allowed persons inside are 10'000", but instead of persons there is rows. So when one step have the max allowed rows inside it locks the door and don't let rows get in until some row get out in the other side.
That's the main concept, but you may find steps working in a different way like blocking step, sort rows, memory group by, etc. They have to work different because of they functions (sort rows need to read all input to make sure it's ordered correctly).
Manage thread priorities
Remember that all the pentaho tranformation steps are running in parallel? And that the rowset may cause a step to lock it's doors and let no one in until it get someone out? Well, if all the steps have the same priority, that can cause a transformation to lock all the time and take too long to run, so that's where thread priority comes in. If that flag is enabled, you'll let pentaho say that a specific step should use more CPU and/or Memory to finish it's job quickly and let other rows come in.
Ok, with that said, what's the best row set to use?
A tricky question indeed. It'll depend upon how much rows you will process and how you designed your transformations (some designs may lock rows more then others). Usually I test a lot of configurations to make sure I'm running my transformation with the best performance possible.
In some cases I use 300'000 rows, 5'000'000 rows and even 500 rows. Some people (and the official wiki) don't encourage using a high row set:
In a lot of cases a smaller row set size actually improves performance since it forces rows through all steps of a (parallel executing) transformation.
But in the end, you should test until you find a good setup. =)
I hope this help
Related
I have read many articles and posts about how a cursor is a massive performance hindrance over the equivalent single set query.
However, with a cursor, you are able to perform the desired operation successfully on all rows that did not err, and provide an error message for each row that did.
Is there some other way I can achieve this row granularity with set operations?
No, a set-based operations works - as the name tells us - with a set. It will work or fail in total.
A CURSOR (or any other procedural approach like WHILE or an external program) can be the best choice in this case.
If performance matters I would prefer to use a tolerant staging table for the first set based import. Then do some quality/cleaning actions there to ensure the succesfull transfer and shift the cleaned data into your target tables (set based).
This depends on the data, your business rules and - of course - the amount of rows.
I have created an ETL process with Pentaho that selects data from a table in a Database and load this into another database.
The main problem that I have to make front is that for 1.500.000 rows it takes 6 hours. The full table is 15.000.000 and I have to load 5 tables like that.
Can anyone explain how is supposed to load a large size of data with pentaho?
Thank you.
I never had problem with volume with Pentaho PDI. Check the following in order.
Can you check the problem is really coming from Pentaho: what happens if you drop the query in SQL-Developer or Toad or SQL-IDE-Fancy-JDBC-Compilant.
In principle, PDI is meant to import data with a SELECT * FROM ... WHERE ... and do all the rest in the transformation. I have a set of transformation here which take hours to execute because they do complex queries. The problem is not due to PDI but complexity of the query. The solutions is to export the GROUP BY and SELECT FROM (SELECT...) into PDI steps, which can start before the query result is finished. The result is like 4 hours to 56 seconds. No joke.
What is your memory size? It is defined in the spoon.bat / spoon.sh.
Near the end you have a line which looks like PENTAHO_DI_JAVA_OPTIONS="-Xms1024m" "-Xmx4096m" "-XX:MaxPermSize=256m". The important parameter is -Xmx.... If it is -Xmx256K, your jvm has only 256KB of RAM to work with.
Change it to 1/2 or 3/4 of the available memory, in order to leave room for the other processes.
Is the output step the bottleneck? Check by disabling it and watch you clock during the run.
If it is long , increase the commit size and allow batch inserts.
Disable all the index and constraints and restore them when loaded. You have nice SQL script executor steps to automate that, but check first manually then in a job, otherwise the reset index may trigger before to load begins.
You have also to check that you do not lock your self: as PDI launches the steps alltogether, you may have truncates which are waiting on another truncate to unlock. If you are not in an never ending block, it may take quite while before to db is able to cascade everything.
There's no fixed answer covering all possible performance issues. You'll need to identify the bottlenecks and solve them in your environment.
If you look at the Metrics tab while running the job in Spoon, you can often see at which step the rows/s rate drops. It will be the one with the full input buffer and empty output buffer.
To get some idea of the maximum performance of the job, you can test each component individually.
Connect the Table Input to a dummy step only and see how many rows/s it reaches.
Define a Generate Rows step with all the fields that go to your destination and some representative data and connect it to the Table Output step. Again, check the rows/s to see the destination database's throughput.
Start connecting more steps/transformations to your Table Input and see where performance goes down.
Once you know your bottlenecks, you'll need to figure out the solutions. Bulk load steps often help the output rate. If network lag is holding you back, you might want to dump data to compressed files first and copy those locally. If your Table input has joins or where clauses, make sure the source database has the correct indexes to use, or change your query.
I have a ZODB catalog query with a start and end date. I want to sort the result on end_date first and then start_date second.
Sorting on either end_date or start_date works fine.
I tried with a tuple (start_date,end_date), but with no luck.
Is there a way to achieve this or do one have to employ some custom logic afterwards?
The generalized answer ought to be post-hoc-sort of your entire result set of catalog brains, use zope.sequencesort (via PyPI, but already shipped with Plone) or similar.
The more complex answer is a rabbit-hole of optimizations that you should only go down if you know you need to and know what you are doing:
Make sure when you do sort the brains that your user gets a sticky session to the same instance, at least for cache-affinity to get the same catalog indexes and brains (metadata);
You might want to cache across requests (thread-global) a unique session id, and a sequence of catalog RID (integer) values for your entire sorted request, should you expect the user to come back and need in subsequent batches. Of course, RIDs need to be re-constituted into ZCatalog's lazy-sequences of brains, and this requires some know-how (or reading the source).
Finally, for large result (many thousands) sets, I would suggest that it is reasonable to make application-specific compromises that approximate correct by post-hoc sorting of the current batch through to the end of the n-batches after it, where n is inversely proportional to the len(site.portal_catalog.uniqueValuesFor(indexnamehere)). For a large set of results, the correctness of an approximated secondary-sort is high for high-variability, and low for low variability (many items with same secondary value, such that count is much larger than batch size can make this frustrating).
Do not optimize as such unless you are dealing with particularly large result sets.
It should go without saying: if you do optimize, you need to verify that you are actually getting a superior result (profile and benchmark). If you cannot justify investing the time to do this, you cannot justify optimizing.
I have an program that needs to run queries on a number of very large Oracle tables (the largest with tens of millions of rows). The output of these queries is fed into another process which (as a side effect) can record the progress of the query (i.e., the last row fetched).
It would be nice if, in the event that the task stopped half way through for some reason, it could be restarted. For this to happen, the query has to return rows in a consistent order, so it has to be sorted. The obvious thing to do is to sort on the primary key; however, there is probably going to be a penalty for this in terms of performance (an index access) versus a non-sorted solution. Given that a restart may never happen this is not desirable.
Is there some trick to ensure consistent ordering in another way? Any other suggestions for maintaining performance in this case?
EDIT: I have been looking around and seen "order by rowid" mentioned. Is this useful or even possible?
EDIT2: I am adding some benchmarks:
With no order by: 17 seconds.
With order by PK: 46 seconds.
With order by rowid: 43 seconds.
So any order by has a savage effect on performance, and using rowid makes little difference. Accepted answer is - there is no easy way to do it.
The best advice I can think of is to reduce the chance of a problem occurring that might stop the process, and that means keeping the code simple. No cursors, no commits, no trying to move part of the data, just straight SQL statements.
Unless a complete restart would be a completely unacceptable disaster, I'd go for simplicity without any part-way restart code at all.
If you want some order and queried data is unsorted then you need to sort it anyway, and spend some resources to do sorting.
So, there are at least two variants for optimization:
Minimize resources spent on sorting;
Query already sorted data.
For the first variant Oracle on its own calculates a best variant to minimize data access and overall query time. It may be possible to choose sorting order involved in unique index which already used by optimizer, but it's a very questionable tactic.
Second variant is about index-organized tables and about forcing Oracle with hints to use some specific index. It seems Ok if you need to process nearly all records in some specific table, but if selectivity of query is high it's significantly slows a process, even on a single table.
Think about a table with surrogate primary key which holds data with 10-year transaction history. If you need data only for previous year and you force order by primary key then Oracle need to process records in all 10 years one-by-one to find all records which belongs to a single year.
But if you need data for 9 years from this table then full table scan may be faster than index-based choice.
So selectivity of your query is a key to choose between full table scan and result sorting.
For storing results and restarting query a good solution is to use Oracle Streams Advanced Queuing to fed another process.
All unprocessed messages in queue redirected to Exception Queue where it may be processed separately.
Because you don't specify exact ordering for selected messages I suppose that you need ordering only to maintain unprocessed part of records. If it's true then with AQ you don't need ordering at all and may, even, process records in parallel.
So, finally, from my point of view Buffered Queue is what you really need.
You could skip ordering and just update the records you processed with something like SET is_processed = 'Y' or SET date_processed = sysdate. Complete restartability and no ordering.
For performance you can partition by is_processed. Yes, partition key changes might be slow, but it is all about trade-offs.
Which is better?
1)A cursor that loop 30000 record and perform update one by one
2)Create a script that has 30000 update command
thanks
Both should take about the same time, mainly subject to how the CURSOR is declared.
Reason? You have 30,000 individual updates which is usually the main factor
Note that 30,000 individual UPDATES in one batch will probably fail because of batch size and compile time anyway...
SQL is a set based language and you can most likely do a single UPDATE to update all rows in one go. If you can't, it is because of 2 reasons
You need "per row" logic: this can usually be achieved by CASE expressions, UDFs etc
You don't understand sets and SQL
With more information (the SQL and logic) we could help you more...
There is a very easy way to tell: Do it and measure the time.
Other than that, having 30000 lines does not make a lot of sense when you can have just 10.
Making updates this way for reasons other than data migration or maintenance doesn't sound like wise either, and in those cases performance is not an issue - but maintenance and legibility always is.
You know, that depends on context.
It helps, though, to learn. SQL for example. You are on a low level not to see the real optimizations possible here. SQL is a lot more than just Update, Insert and simple Select statements.
1)A cursor that loop 30000 record and perform update one by one
Linear step by step processing. No way to paralellize as SQL itself has no threading mechanisms available to the user; Optimizations are one by one - i.e. the query optimizer looks at items one statement at a time.
2)Create a script that has 30000 update command
Assuming the script is external, it could split the work and run it concurrent on multiple connections, i.e. run more than one parallel.
But there is more:
Make a script that calculates the new values.
Bulk import them into a temporary table using the buld copy API
Issue ONE update statment that takes the updated values from the temporary table to the final one.
Maybe have a script that issues a merge statement for multi update? There are tons of variations there if you know the SQL api more than "update, open cursor, simple select".
I do that - though a lot more data (batches of 50.000, sometimes 4-6 at the same time). The problem being that sql bulk copy has some overhead. But I manage 75.000 inserts per second that way.
A lot depends on the business questions and the complexity of the logic - if it is simple updates then the question is: Calculated or externally driven? Multiple values by 2 = calculated, updating addresses = data driven (i.e. you need the new data from somewhere).