Add more data to a step "copy rows to result" - pentaho

I'm doing a transformation in kettle and I need to send data to another transformation, for this use a step "copy rows to result", but this step I do half the processing and need to add more data to the end of the transformation, as might do?.
Greetings and thanks
EDIT 24-06-2014
This image is the example of my transformation:

The only thing you need to make sure when merging the results from multiple streams in Kettle is that the columns from both hops to Copia Filas (Copy rows to result in the English version) are completely identical, meaning:
The number of columns sent to the final step is equal;
The column types are completely identical;
The column names are completely identical;
Kettle should alert if otherwise.
If this is not the issue, you should check that the second stream in the data distribution actually generates the data you want to merge into the final step of the transformation, depending on what you'd like to do there.
Use the preview and debug options, as well as the right click (on a step) ==> Show output fields in Kettle to make sure all of the above are happening.
I hope this helps a bit.

Related

Pentaho PDI: execute transformation for each line from CSV?

Here's a distilled version of what we're trying to do. The transformation step is a "Table Input":
SELECT DISTINCT ${SRCFIELD} FROM ${SRCTABLE}
We want to run that SQL with variables/parameters set from each line in our CSV:
SRCFIELD,SRCTABLE
carols_key,carols_table
mikes_ix,mikes_rec
their_field,their_table
In this case we'd want it to run the transformation three times, one for each data line in the CSV, to pull unique values from those fields in those tables. I'm hoping there's a simple way to do this.
I think the only difficulty is, we haven't stumbled across the right step/entry and the right settings.
Poking around in a "parent" transformation, the highest hopes we had were:
We tried chaining CSV file input to Set Variables (hoping to feed it to Transformation Executor one line at a time) but that gripes when we have more than one line from the CSV.
We tried piping CSV file input directly to Transformation Executor but that only sends TE's "static input value" to the sub-transformation.
We also explored using a job, with a Transformation object, we were very hopeful to stumble into what the "Execute every input row" applied to, but haven't figured out how to pipe data to it one row at a time.
Suggestions?
Aha!
To do this, we must create a JOB with TWO TRANSFORMATIONS. The first reads "parameters" from the CSV and the second does its duty once for each row of CSV data from the first.
In the JOB, the first transformation is set up like this:
Options/Logging/Arguments/Parameters tabs are all left as default
In the transformation itself (right click, open referenced object->transformation):
Step1: CSV file input
Step2: Copy rows to result <== that's the magic part
Back in the JOB, the second transformation is set up like so:
Options: "Execute every input row" is checked
Logging/Arguments tabs are left as default
Parameters:
Copy results to parameters, is checked
Pass parameter values to sub transformation, is checked
Parameter: SRCFIELD; Parameter to use: SRCFIELD
Parameter: SRCTABLE; Parameter to use: SRCTABLE
In the transformation itself (right click, open referenced object->transformation):
Table input "SELECT DISTINCT ${SRCFIELD} code FROM ${SRCTABLE}"
Note: "Replace variables in script" must be checked
So the first transformation gathers the "config" data from the CSV and, one-record-at-a-time, passes those values to the second transformation (since "Execute every input row" is checked).
So now with a CSV like this:
SRCTABLE,SRCFIELD
person_rec,country
person_rec,sex
application_rec,major1
application_rec,conc1
status_rec,cur_stat
We can pull distinct values for all those specific fields, and lots more. And it's easy to maintain which tables and which fields are examined.
Expanding this idea to a data-flow where the second transformation updates code fields in a datamart, isn't much of a stretch:
SRCTABLE,SRCFIELD,TARGETTABLE,TARGETFIELD
person_rec,country,dim_country,country_code
person_rec,sex,dim_sex,sex_code
application_rec,major1,dim_major,major_code
application_rec,conc1,dim_concentration,concentration_code
status_rec,cur_stat,dim_current_status,cur_stat_code
We'd need to pull unique ${TARGETTABLE}.${TARGETFIELD} values as well, use a Merge rows (diff) step, use a Filter rows step to find only the 'new' ones, and then a Execute SQL script step to update the targets.
Exciting!

How to do not see the previous fields into output table step?

I have a transformation into Kettle Pentaho called Test.
This ETL process should load three different tables of a single database, where each one has his source into a different table of a another database.
To do this I use three table input steps. Each one connects to a value mapper, this to a Select value step, then a Data Validator, and add sequence step and finally a table output.
Summarising I have a total of six steps per table load.
When I am editing the finals steps I found a thing that I would like to solve, I drag the fields of the previous tables loads.
For example, table A load have the field bank_id, in the second table it does not exist, but in the table output step of the second load process I can select this despite I do not want this.
Is there any option to do not see the previous fields? Thsi way I avoid easy errors. Especially, when the tables have a field with the same name.
Thank you
EDIT
The screenshot clarifies the situation immensely, so now the answer is simple:
Delete the diagonal hops (arrows) between the rows.
Transformations in PDI don't have a single starting or ending point, so you don't need to connect all the steps in a single line. Having three separate streams is just fine.
All steps in a transformation start in parallel, then wait and process rows as they come in (or in the case of input steps, start reading data and generating rows into their output hop). That means your three streams will execute in parallel following their own hops from input to output.
add a Select Values step, i use to add filter steps often to "clean" the flow

How can I use an additional condition by getting data from xls-file input in Pentaho spoon?

I have just started learning pentaho spoon steps and have one problem with solving one problem. I need to transform the data from xls-file and convert it do database. The problem is that my input file looks like this: table-description
And I can not find how to solve two problems:
For my next step I need to save not only the table itself (Range A8:D11), but also the date (cell A5). When I am trying to do it in pentaho with Microsoft Excel Input – Step it works only when I select A8-cell as a start row, but the date is not saved.
In Microsoft Excel Input – Step I must always select a start row in order to generate a table and use it in next steps. And I must do it manually, I mean to say that my table starts from A8-cell. In my case I can not always say for sure that the table starts from A8-cell. I know, that the start-cell is that cell, which is in A-Column and has value = “Date”. Microsoft Excel Input – Step will be first step in my kettle because I must get data and change them. That is why I think I can not use before Java Script.
I have not found the solution to these two problems and I do not know if it is possible to make it. I will be grateful for any help.
I am not sure what do you mean by converting an excel file to database but If you can convert the xls into csv and read that file then you know from which row you need to filter the data. Basically you can use a simple filter step to filter the data when it matches column name. I hope this will help.
Use two Microsoft Excel Input steps. One step reads the table (A8:D11). The other step reads the date (A5). Then merge the two streams, for example using a Join Rows (cartesian product) step
Read everything. Then use a Javascript step with two script tabs. For one of the tabs: Right-click and choose Set start script. Code : var start = 0; The other tab should be kept as a transformation script. Pseudocode: if(FieldA equals "Date") {start = 1;}. Now you will have an additional field in the stream called start. If start equals 0, then you know that your tabular data hasn't started yet, and you can filter out the row.

Passing data from one Pentaho transformation to another in a job?

Fairly straightforward question I think, I just haven't been able to find a clear example. I have a very complex transformation that I'm breaking down into a job. Having never created a job before, I'm struggling to send the data from one transformation to another. I used Copy Rows to Result in the first one and Get Rows From Result in the second one, but I feel like I'm still missing something. When I used Get Rows, I had to specify the row names - there was no sort of Get Fields button. I also can't preview the data in the transformation without running the job and having it save to an Excel file. When I did that, ALL of the fields were in the output file -- instead of just the ones I'd specified in the second transformation.
I've searched through the documentation and tried Googling but I can't find a clear walkthrough just on how to smoothly move data from one transformation to another. Any responses would be appreciated even if it's just pointing me towards something I've overlooked.
Thanks!
The most commom way is to use copy rows to result at the end of one KTR and use get rows from result as the starting point for the next one. Though you really can't "see" the result while operating in the next KTR, what you can do to ease the reading is set a preview window and leave it open to see all the columns names and data.
Whoever if you want to set just a few lines of code through to the next KTR you can use Set variables as the ending step of the first KTR and capture those variables at anytime in the second using Get Variables steps. Don't forget that if you do so you need to set the variables in the parent KJB(the Job that called the first KTR) with no Default value, and the Variable scope type of the Set variables step has to be set to Valid in the parent job.
The best way is to create KTR's, run/test each. This way you can examine resulting data and then integrate all individual transformations into the final job.

Pentaho - Having multiple Copy rows to result results in Get rows from result empty

I'm trying to process some data and store it in a datawarehouse. For doing it, I wanted to store dimensions in one transformation and fact (only have one) in another transformation. So I can use a job for execute the first one, copy rows to result and get them into the second transformation.
In the first transformation, I read some Excel file and separate this data into some streams. It is data from a baptism, so I have one stream for the person, another one for parents, another one for sponsors, and so on... At the end of each stream, I insert data into database and return PK autogenerated (it is an id autoincrement).
In the second one, I only have Get rows from result and want to set them into a txt file (just for see it is been done correctly). The problem is that the file is created but it is empty. I suppose that if I let fields in Get rows from result empty, it gets all fields.
What am I doing wrong?
At the end what I want is to have one Copy rows to result at the end of each stream in the first transformation and get all this data in the second one.
In "Insert Pare Padrina" I return id_pare_padrina which is autogenerated, and the same with "Insert Mare Padrina" (I have more streams which I also have to include them into result). This transformation is not executed per row because I need values of other rows.
Thank you!
In order to pass the data from the first transformation to the second transformation, you need to set certain parameters like:
1. First of all, in the transformation settings of the second transformation (at the Job Level), check on the items as image below:
Copy Previous results to parameters will ensure that all the results/data in the "Copy Rows to Result" step is getting properly passed to the next level.
Execute for every input row : will execute the second transformation for every rows in the first transformation file. This is optional based on your requirement.
2. In the same transformation settings, define the "Parameters" in the Parameters tabs. Check the image below:
Here, NAME is the parameter i have defined. So when you are using the "Get rows from result", you can define these parameter names.
3. Instead of using "Get rows from result", you can alternately use "Get Variables" step to fetch all the variables coming from the previous step. All you need to do is to define the parameter names inside the ktr file (CTRL + T). (Actually i have practically implemented in that fashion and it worked for me.)
4. Since "copy rows to result" step uses heap memory, defining multiple instances of this step might exhaust the memory space quickly and your code might fall in trouble. Ideally use a single instance of this step.
But if your data interation is only one row, best option would be to use "set variables" step.
I assume you might have missed some of these sections in the job.
You can read more on copy rows to result in here.
Hope it helps :)