pentaho set variable to jobs - pentaho

I am new to pentaho. I have a Job with 3 transformation and all 3 transformation are simliar . In each transformation has Sql query something like
select * from table1 where tabl1.col1='XXX' and tab2.col2='YYYY'
value of col1 remains same. I want to pass it as variable in job instead of replacing it in each transformation . What are the steps to do that.

you split your transformation in two:
one for setting the var
the second for using set variable from step #1.
Please refer to Pentaho Documentation available online :
http://wiki.pentaho.com/display/COM/Using+Variables+in+Kettle
http://wiki.pentaho.com/display/EAI/Set+Variables
You cannot use variables in the same transformation where you set them.

Related

pentaho spoon transformation to delete data from table

I'm trying to write a ETL job using pentaho data integration tool, in spoon. Used "delete" icon and provided the target tabl details but the rows are not getting deleted.and no error. I have access to the schema.Please suggest.
In order to use the "Delete" step first you need to have a data source where PDI will read the keys to look for in the table. So, your transformation should look like this:
In my example, the first step queries the origin table for a list of Ids to be deleted, and then passes them to the Delete step as keys to be used as condition for the delete instruction.

Table output name from command line in pentaho kettle

There is a case in my ETL where i am trying to take "table output" name from command line. The table name does not correspond to any streaming field's name. Is there any way to get it done in pentaho kettle?
Pentaho DI is a metadata based tool. I assume you will be trying to pass the output table name from the command line like below:
.../pan.sh -file:"/home/user/sample.ktr" -param:table_output=SOMETABLE
Assuming the command above is what you are trying.
So firstly, change the transformation settings of sample.ktr (just an example) and add the parameter name : "table_output" to the Parameters section.
Next, in the Table Output Step, use this parameter name in the format : ${table_output} in place of table name. This should solve your query.
Incase you are passing the parameters to a job. As mentioned above, the first section of the adding the parameters remains the same.
You can next take a separate transformation (.ktr) file inside a job, double click on the ktr (from the job file) and you will find PARAMETERS Section like the image below. Add the parameters
Thirdly inside the .ktr file, repeat the step from above (first section) and use a SET VARIABLE or TABLE OUTPUT. SET Variable step will ensure that you have the parameter available across the entire job. Mostly depends on your requirement.
Hope it helps :)
This should give you an idea how to do it. Since transformations are just xml you can read the metadata from them. Basically you find the table output step and set it as a variable in this case "TABLE"

SSIS Script Component - only to change variables

I have a series of task that are very similar:
SELECT a,b FROM c
Lookup in another table and change value in column b.
Save new value back to c and if not match, send the result on to an error table.
That part is pretty straight forward and illustrated here:
Source ==> Lookup =match=> SQL Update command
=No match=> SQL Save Error command
(Hope you understand what I mean - but it works!)
I now have to repeat this a number of times, where my source-sql changes. So what I want to do is to insert a Script Component in front of the Source and set my User::Sql variable like:
Variables.Sql = "SELECT d, e FROM f"
All of the above is contained in a Data Flow. When I have created one I can then copy that one and only change the Sql variable in the script and then it should all work.
My problem is: When I insert the Script Command it asks me if it is a Source, Destination or Transscript script. And by only setting the variable it does not produce any rows for output and cannot connect to my Source.
Anyone know how to make that work?
(I have simplified the above. I actually want to update multiple variables and use those in my Source, Lookup and Error update as well - therefore it is not more simple just to change the SQL script in the initial Source! But being able to do the above, I will be able to achieve what I want :-))
You should set your variable containing the SQL query in the control flow, before you execute the dataflow.
Then you need to use that variable as an expression in your Dataflow. You can parametrize the query used in the lookup or any other parameters of your dataflow.
If your dataflows really have always the same structure, you could even generate a list of queries and call your dataflow task in a loop, preventing the duplication of the same tasks.

Pentaho Data Integration (PD)I: After Selecting records I need to update the field value in the table using pentaho transforamtion

Have a requirement to create a transformation where I have to run a select statement. After selecting the values it should update the status, so it doesn't process the same record again.
Select file_id, location, name, status
from files
OUTPUT:
1, c/user/, abc, PROCESS
Updated output should be:
1, c/user/, abc, INPROCESS
Is it possible for me to do a database select and cache the records so it doesn't reprocess the same record again in a single transformation in PDI? So I don't need to update the status in the database. Something similar to dynamic lookup in Informatica. If not what's the best possible way to update the database after doing the select.
Thanks, that helps. You wouldn't do this in a single transformation, because of the multi-threaded execution model of PDI transformations; you can't count on a variable being set until the transform ends.
The way to do it is to put two transformations in a Job, and create a variable in the job. The first transform runs your select and flows the result into a Set Variables step. Configure it to set the variable you created in your Job. Next you run the second transform which contains your Excel Input step. Specify your Job level variable as the file name.
If the select gives more than one result, you can store the file names in the Jobs file results area. You do this with an Set files in result step. Then you can configure the job to run the second transform once for each result file.

Pentaho Data Integration: How to select output of sql query as a filename for Microsoft Excel Input.

I have files abc.xlsx, 1234.xlsx, and xyz.xlsx in some folder. My requirement is to develop a transformation where the Microsoft Excel Input in PDI (Pentaho Data Integration) should only pick the file based on the output of a sql query. If the output query is abc.xlsx. Microsoft Excel Input should pick of abc.xlsx for further processing. How do I achieve this? Would really appreciate your help. Thanks.
Transformations in Kettle run asynchronously, so you're probably looking into needing a job for this.
Files to create
Create a transformation that performs the SQL query you're looking for and populates a variable based on the result
Create a transformation that pulls data from the Excel file, using the variable populated as the filename
Create a job that executes the first transformation, then steps into the second transformation
Jobs run sequentially, so it will execute the first transformation, perform the query, get the result, and set a variable. Variables need to be set and retrieved in different transformations because of their asynchronous nature. This is the reason for the second transformation; the job won't step into the second transformation until the first one is done running (therefore, not until the variable is populated).
This is all assuming you only want to run the transformation once, expecting a single result from the query. If you want to loop it, pulling data from a set, then setup is a little bit different.
The Excel input step has a "accept filenames from previous step" option. You can have a table input build the full path of the file you want to read (or you somehow build it later knowing the base dir and the short filename), pass the filename to the excel input, tick that box and specify the step and the field you want to use for the filename.