Pentaho Data Integration Import large dataset from DB

Pentaho Data Integration Import large dataset from DB - pentaho

I'm trying to import a large set of data from one DB to another (MSSQL to MySQL).
The transformation does this: gets a subset of data, check if it's an update or an insert by checking hash, map the data and insert it into MySQL DB with an API call.
The subset part for the moment is strictly manual, is there a way to set Pentaho to do it for me, kind of iteration.
The query I'm using to get the subset is
select t1.*
from (
select *, ROW_NUMBER() as RowNum over (order by id)
from mytable
) t1
where RowNum between #offset and #offset + #limit;
Is there a way that PDI can set the offset and reiterate the whole?
Thanks

You can (despite the warnings) create a loop in a parent job, incrementing the offset variable each iteration in a Javascript step. I've used such a setup to consume webservices with an unknown number of results, shifting the offset each time I after get a full page and stopping when I get less.
Setting up the variables
In the job properties, define parameters Offset and Limit, so you can (re)start at any offset even invoke the job from the commandline with specific offset and limit. It can be done with a variables step too, but parameters do all the same things plus you can set defaults for testing.
Processing in the transformation
The main transformation(s) should have "pass parameter values to subtransformation" enabled, as it is by default.
Inside the transformation (see lower half of the image) you start with a Table Input that uses variable substitution, putting ${Offset} and ${Limit} where you have #offset and #limit.
The stream from Table Input then goes to processing, but also is copied to a Group By step for counting rows. Leave the group field empty and create a field that counts all rows. Check the box to always give back a result row.
Send the stream from Group By to a Set Variables step and set the NumRows variable in the scope of the parent job.
Looping back
In the main job, go from the transformations to a Simple Evaluation step to compare the NumRows variable to the Limit. If NumRows is smaller than ${Limit}, you've reached the last batch, success!
If not, proceed to a Javascript step to increment the Offset like this:
var offset = parseInt(parent_job.getVariable("Offset"),0);
var limit = parseInt(parent_job.getVariable("Limit"),0);
offset = offset + limit;
parent_job.setVariable("Offset",offset);
true;
The job flow then proceeds to the dummy step and then the transformation again, with the new offset value.
Notes
Unlike a transformation, you can set and use a variable within the same job.
The JS step needs "true;" as the last statement so it reports success to the job.

Related

SAP HANA Sequence reset by Max or Max +1

In SAP HANA we use sequences.
However I am not sure what to define for reset by
do I use select max(ID) from tbl or max(ID) + 1 from tbl?
resently we got an unique constrained violation for the ID field.
And the sequence is defined as select max(ID) from tbl
Also is it even better to avoid the option "reset by"?

The common logic for the RESET BY clause is to check the current value (max(ID)) and add an offset (e.g. +1) to avoid a double allocation of a key value.
Not using the option effectively disables the ability to automatically set the current sequence value to a value that will not collide with existing stored values.
To provide some context: usually the sequence number generator uses a cache (even though it's not set up by default) to allow for high-speed consumption of sequence numbers.
In case of a system failure, the numbers in the cache that have not yet been consumed are "lost" in the sense that the database doesn't retain the information which numbers from the cache had been fetched in a recoverable fashion.
By using the RESET BY clause, the "loss" of numbers can be reduced as the sequence gets set back to the last actually used sequence number.

Performance for selecting multiple out-params from deterministic SUDF

I am about to test the deterministic flag for SUDFs that return multiple values (follow up question to this). The DETERMINISTIC flag should cache the results for same inputs to improve performance. However, I can't figure out how to do this for multiple return values. My SUDF looks as following:
CREATE FUNCTION DET_TEST(IN col BIGINT)
RETURNS a int, b int, c int, d int DETERMINISTIC
AS BEGIN
a = 1;
b = 2;
c = 3;
d = 4;
END;
Now when I execute the following select statements:
1) select DET_TEST(XL_ID).a from XL;
2) select DET_TEST(XL_ID).a, DET_TEST(XL_ID).b from XL;
3) select DET_TEST(XL_ID).a, DET_TEST(XL_ID).b,
DET_TEST(XL_ID).c, DET_TEST(XL_ID).d from XL;
I get the corresponding server processing times:
1) Statement 'select DET_TEST(XL_ID).a from XL'
successfully executed in 1.791 seconds (server processing time: 1.671 seconds)
2) Statement 'select DET_TEST(XL_ID).a, DET_TEST(XL_ID).b from XL'
successfully executed in 2.415 seconds (server processing time: 2.298 seconds)
3) Statement 'select DET_TEST(XL_ID).a, DET_TEST(XL_ID).b, DET_TEST(XL_ID).c, ...'
successfully executed in 4.884 seconds (server processing time: 4.674 seconds)
As you can see the processing time increases even though I call the function with the same input. So is this a bug or is it possible that only a single value is stored in cache but not the whole list of return parameters?
I will try out MAP_MERGE next.

I did some tests with your scenario and can confirm that the response time goes up considerably with every additional result parameter retrieved from the function.
The DETERMINISTIC flag helps here, but not as much as one would hope for since only the result value for distinct input parameters are saved.
So, if the same value(s) are entered into the function and it has been executed before with these value(s) then the result is taken from a cache.
This cache, however, is only valid during a statement. That means: for repeated function evaluations with the same value during a single statement, the DETERMINISTIC function can skip the evaluation of the function and reuse the result.
This doesn't mean, that all output parameters get evaluated once and are then available for reuse. Indeed, with different output parameters, HANA practically has to executed different evaluation graphs. In that sense, asking for different parameters is closer to execute different functions than, say, calling a matrix operation.
So, sorry about raising the hope for a massive improvement with DETERMINISTIC functions in the other thread. At least for your use case, that doesn't really help a lot.
Concerning the MAP_MERGE function, it's important to see that this really helps with horizontal partitioning of data, like one would have it in e.g. classic map-reduce situations.
The use case you presented is actually not doing that but tries to create multiple results for a single input.
During my tests, I actually found it quicker to just define four independent functions and call those in my SELECT statement against my source table.
Depending on the complexity of the calculations you like to do and the amount of data, I probably would look into using the Application Function Library (AFL) SDK for SAP HANA. For details on this, one has to check the relevant SAP notes.

Target Based commit point while updating into table

One of my mappings is running for a really long time (2 hours).From the session log i can see the statment "Time out based commit poin" which is tking most of the time and Busy percentage for the SQL tranfsormation is very high(which is taking time,I ran the SQL query manually in DB,its working fine ).So, basically there is a router which splits the record between insert and update.And the update stream is taking long.It has a SQL transforamtion,Update statrtergy and aggregator.I added an sorter before aggregator but no luck.
Also changed comit interval ,Lins Sequential Buffer lenght and Maximum memory allowed by checking some of the other blogs.Could you please help me with this.

If possible try to avoid the transformations which are creating cache because in future if the input records increase. Cache size will also increase and decrease the throughput
1) Aggregator : Try to use the Aggregation in SQL override itself
2) Sorter : Try to do the same in the SQL Override itself
Generally SQL transformation is slow for huge data loads, because for each input record an SQL session is invoked and a connection is established to database and the row is fetched. Say for example there are 1 million records, 1 million SQL sessions are initiated in the backend and the database is called.
What the SQL transformation doing ? Is it just generating a Surrogate key or its fetching a value from a table based on derived value from the stream
For fetching a value from a table based on derived value from the stream:
Try to use lookup
For generating Surrogate key, Use Oracle Sequence instead
Let me know if its purpose is any thing other than that
Also do the below checks
Sort the session log on thread and just make a note of start and end times of
the following
1) lookup caches creation (time between Query issued --> First row returned --> Cache creation completed)
2) Reader thread first row return time
Regards,
Raj

looping in a Kettle transformation

I want to repetitively execute an SQL query looking like this:
SELECT '${date.i}' AS d,
COUNT(DISTINCT xid) AS n
FROM table
WHERE date
BETWEEN DATE_SUB('${date.i}', INTERVAL 6 DAY)
AND '${date.i}'
;
It is basically a grouping by time spans, just that those are intersecting, which prevents usage of GROUP BY.
That is why I want to execute the query repetitively for every day in a certain time span. But I am not sure how I should implement the loop. What solution would you suggest?
The Kettle variable date.i is initialized from a global variable. The transformation is just one of several in the same transformation bundle. The "stop trafo" would be implemented maybe implicitely by just not reentering the loop.
Here's the flow chart:

Flow of the transformation:
In step "INPUT" I create a result set with three identical fields keeping the dates from ${date.from} until ${date.until} (Kettle variables). (for details on this technique check out my article on it - Generating virtual tables for JOIN operations in MySQL).
In step "SELECT" I set the data source to be used ("INPUT") and that I want "SELECT" to be executed for every row in the served result set. Because Kettle maps parameters 1 on 1 by a faceless question-mark I have to serve three times the same paramter - for each usage.
The "text file output" finally outputs the result in a generical fashion. Just a filename has to be set.
Content of the resulting text output for 2013-01-01 until 2013-01-05:
d;n
2013/01/01 00:00:00.000;3038
2013/01/02 00:00:00.000;2405
2013/01/03 00:00:00.000;2055
2013/01/04 00:00:00.000;2796
2013/01/05 00:00:00.000;2687
I am not sure if this is the slickest solution but it does the trick.

In Kettle you want to avoid loops and they can cause real trouble in transforms. Instead you should do this by adding a step that will put a row in the stream for each date you want (with the value stored in a field) and then using that field value in the query.
ETA: The stream is the thing that moves rows (records) between steps. It may help to think of it as consisting of a table at each hop that temporarily holds rows between steps.
You want to avoid loops because a Kettle transform is only sequential at the row level: rows may process in parallel and out of order and the only guarantee is that the row will pass through the steps in order. Because of this a loop in a transform does not function as you would intuitively expect.
FYI, it also sounds like you might need to go through some of the Kettle tutorials if you are still unclear about what the stream is.

Can we access user variable in query of Source in DFT?

I am working on optimizing a Data Flow Task. I Ado.Net source firing a query like below.
Select Emp_id, EmpName, Salary from Employee.
After source I hace a derived column transform whcih adds a derived column with user variable value #[User::TestVariable].
Now I guess this derived column transform would be taking some time atleast so I was wondering if I can save that time by doing something like below at source.
Select Emp_id, EmpName, Salary, DerivColumn as #[User::TestVariable]
from Employee
Is it possible to do something of this kind? if yes how?
Above is DFT I am working on how can i find out which component took how much time, so i can look to optimize that.

You can use the variable in ADO.NET Source .
1.In the property window of DFT task click the expression property and select the ADO.NET Source SQL Command
In the expression write your SQL Query
Select LoginId,JobTitle," + (DT_WSTR,10) #[User::TestVariable] + " as DerivedColumn
from HumanResources.Employee"
I don't think that your Derived Column is adding any overhead as it is a Non Blocking component (bt there are some exceptions to it )
In order to find the speed of individual components ,
1.Calculate the overall execution time for the package ,which you can find it in the execution result tab
Overall Execution Speed = Source Speed + Transformation Speed
2.Remove the derived component and connect the source to the row transaformation.Now again see the execution time .This will give you the source speed .
Overall Execution Speed - Source Speed = Transformation Speed
SSIS is an in-memory pipeline, so all its transformations occur in memory.It replies heavily on buffer .In your case ,SSIS buffer caries 196,602 rows .This value is controlled by 2 properties
DefaultMaxBufferRows and DefaultMaxBufferSize.MaximumBufferSize is 100MB.Now you need to calculate the estimated row size by calculating the column size in your table.Suppose adding your datatype length comes around 40 bytes then amount in bytes for 196,602 rows is
196,602*40=7864080 ~ 7MB
which is less than DefaultMaxBufferSize 10MB.You can try increasing the DefaultMaxBufferRows to increase the speed .But then again you need to do all your performance testing before comping to a conclusion .
I suggest you read this article to get a complete picture about SSIS performance

so you wish to add a new column to your dataset with a fixed value (contained on #[User::TestVariable]) to be inserted later on a destination, right? NO, you cant do what you are thinking because the scope is the database (where you execurte the query) and the variable is on the package.
Are you sure this derived column operation is take that long? It shouldnt. If it is, you could use a execute SQL task to insert this value on the DB into a temp table and the use it on your query
declare #aux int
select #aux = your_config_column from your_temp_table
Select Emp_id, EmpName, Salary, #aux as DerivColumn
from Employee
it is kind of a messy solution, but it is worth it if the derived column is really taking that long

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pentaho Data Integration Import large dataset from DB - pentaho

Related

SAP HANA Sequence reset by Max or Max +1

Performance for selecting multiple out-params from deterministic SUDF

Target Based commit point while updating into table

looping in a Kettle transformation

Can we access user variable in query of Source in DFT?

Categories

Resources