run transformation kettle for each row xml data dynamically - pentaho

How dynamically can I get each element from get data from xml step separately to be an input to other transformation which do the parsing of message(value node xml), my main idea is how to run transformation kettle for each row xml data dynamically) .
*dynamically means that the number of elements is unknown.
question in forum pentaho community: http://forums.pentaho.com/showthread.php?204226-run-transformation-kettle-for-each-row-xml-data-dynamically&highlight=How+dynamically+can+I+get+each+element+from+get+data+from+xml+step+separately+to+be+an+input+to+other+transformation+which+do+the+parsing+of+message%28value+node+xml%29%2C+my+main+idea+is+how+to+run+transformation+kettle+for+each+row+xml+data+dynamically%29+.++%2Adynamically+means+that+the+number+of+elements+is+unknown.

It's a bit dated, but it sounds like this is what you're looking for:
Run Kettle Job for each Row
Essentially, you get the data from your XML file with a transform (Get data from XML) and flow it into a Copy rows to result step. Then in your job, add a Transformation step and in its options, on the advanced tab, check the "Copy previous results to parameters" and "Execute for every input row" check boxes.
You will have to setup parameters for the Transformation step to match the metadata of your XML data row.
Note that this will be pretty slow if you have a large number of message IDs and relatively little child data for each message. If that's the case, you might want to try a lookup from the XML data in the first Transformation instead.

Related

Pentaho PDI: execute transformation for each line from CSV?

Here's a distilled version of what we're trying to do. The transformation step is a "Table Input":
SELECT DISTINCT ${SRCFIELD} FROM ${SRCTABLE}
We want to run that SQL with variables/parameters set from each line in our CSV:
SRCFIELD,SRCTABLE
carols_key,carols_table
mikes_ix,mikes_rec
their_field,their_table
In this case we'd want it to run the transformation three times, one for each data line in the CSV, to pull unique values from those fields in those tables. I'm hoping there's a simple way to do this.
I think the only difficulty is, we haven't stumbled across the right step/entry and the right settings.
Poking around in a "parent" transformation, the highest hopes we had were:
We tried chaining CSV file input to Set Variables (hoping to feed it to Transformation Executor one line at a time) but that gripes when we have more than one line from the CSV.
We tried piping CSV file input directly to Transformation Executor but that only sends TE's "static input value" to the sub-transformation.
We also explored using a job, with a Transformation object, we were very hopeful to stumble into what the "Execute every input row" applied to, but haven't figured out how to pipe data to it one row at a time.
Suggestions?
Aha!
To do this, we must create a JOB with TWO TRANSFORMATIONS. The first reads "parameters" from the CSV and the second does its duty once for each row of CSV data from the first.
In the JOB, the first transformation is set up like this:
Options/Logging/Arguments/Parameters tabs are all left as default
In the transformation itself (right click, open referenced object->transformation):
Step1: CSV file input
Step2: Copy rows to result <== that's the magic part
Back in the JOB, the second transformation is set up like so:
Options: "Execute every input row" is checked
Logging/Arguments tabs are left as default
Parameters:
Copy results to parameters, is checked
Pass parameter values to sub transformation, is checked
Parameter: SRCFIELD; Parameter to use: SRCFIELD
Parameter: SRCTABLE; Parameter to use: SRCTABLE
In the transformation itself (right click, open referenced object->transformation):
Table input "SELECT DISTINCT ${SRCFIELD} code FROM ${SRCTABLE}"
Note: "Replace variables in script" must be checked
So the first transformation gathers the "config" data from the CSV and, one-record-at-a-time, passes those values to the second transformation (since "Execute every input row" is checked).
So now with a CSV like this:
SRCTABLE,SRCFIELD
person_rec,country
person_rec,sex
application_rec,major1
application_rec,conc1
status_rec,cur_stat
We can pull distinct values for all those specific fields, and lots more. And it's easy to maintain which tables and which fields are examined.
Expanding this idea to a data-flow where the second transformation updates code fields in a datamart, isn't much of a stretch:
SRCTABLE,SRCFIELD,TARGETTABLE,TARGETFIELD
person_rec,country,dim_country,country_code
person_rec,sex,dim_sex,sex_code
application_rec,major1,dim_major,major_code
application_rec,conc1,dim_concentration,concentration_code
status_rec,cur_stat,dim_current_status,cur_stat_code
We'd need to pull unique ${TARGETTABLE}.${TARGETFIELD} values as well, use a Merge rows (diff) step, use a Filter rows step to find only the 'new' ones, and then a Execute SQL script step to update the targets.
Exciting!

Excel to CSV Plugin for Kettle

I am trying to develop a reusable component in Pentaho which will take an Excel file and convert it to a CSV with an encoding option.
In short, I need to develop a transformation that has an Excel input and a CSV output.
I don't know the columns in advance. The columns have to be dynamically injected to the excel input.
That's a perfect candidate for Pentaho Metadata Injection.
You should have a template transformation wich contains the basic workflow (read from the excel, write to the text file), but without specifiying the input and/or output formats. Then, you should store your metadata (the list of columns and their properties) somewhere. In Pentaho example an excel spreadsheet is used, but you're not limited to that. I've used a couple of database tables to store the metadata for example, one for the input format and another one for the output format.
Also, you need to have a transformation that has the Metadata Injection step to "inject" the metadata into the template transformation. What it basically does, is to create a new transformation at runtime, by using the template and the fields you set to be populated, and then it runs it.
Pentaho's example is pretty clear if you follow it step by step, and from that you can then create a more elaborated solution.
You'll need at least two steps in a transformation:
Input step: Microsoft Excel input
Output step: Text file output
So, Here is the solution. In your Excel Input Component, in Fields Section, mention maximum number of fields which will come in any excel. Then Route the Input excel to text field based on the Number of fields which are actually present. You need to play switch/case component here.

Update multiple Excel sheets of one document within one Pentaho Kettle transformation

I am researching standard sample from Pentaho DI package: GetXMLData - Read parent children rows. It reads separately from same XML input parent rows & children rows. I need to do the same and update two different sheets of the same MS Excel Documents.
My understanding is that normal way to achieve it is to put first sequence in one transformation file with XML Output or Writer, second to the second one & at the end create job with chain from start, through 1st & 2nd transformations.
My problems are:
When I try to chain above sequences I loose content of first updated Excel sheet in the final document;
I need to have at the end just one file with either Job or Transformation without dependencies (In case of above proposed scenario I would have 1 KJB job + 2 KTR transformation files).
Questions are:
Is it possible to join 2 sequences from above sample with some wait node before starting update 2nd Excel sheet?
If above doesn't work: Is it possible to embed transformations to the job instead of referencing them from external files?
And extra question: What is better to use: Excel Output or Excel Writer?
=================
UPDATE:
Based on #AlainD proposal I have tried to put Block node in-between. Here is a result:
Looks like Block step can be an option, but somehow it doesn't work as expected with Excel Output / Writers node (or I do something wrong). What I have observed is that Pentaho tries to execute next after Block steps before Excel file is closed properly by the previous step. That leads to one of the following: I either get Excel file with one empty sheet or generated result file is malformed.
My input XML file (from Pentaho distribution) & test playground transformation are: HERE
NOTE: While playing do not forget to remove generated MS Excel files between runs.
Screenshot:
Any suggestions how to fix my transformation?
The pattern goes as follow:
read data: 1 row per children, with the parent data in one or more column
group the data : 1 row per parent, forget the children, keep the parent data. Transform and save as needed.
back from the original data, lookup each row (children) and fetch the parent in the grouped data flow.
the result is one row per children and the needed column of the transformed parent. Transform and save as needed.
It is a pattern, you may want to change the flow, and/or sort to speed up. But it will not lock, nor feed up the memory: the group by and lookup are pretty reliable.
Question 1: Yes, the step you are looking after is named Block until this (other) step finishes, or Blocking Step (untill all rows are processed).
Question 2: Yes, you can pass the rows from one transformation to an other via the job. But it would be wiser to first produce the parent sheet and, when finished, read it again in the second transformation. You can also pass the row in a sub-transformation, or use other architecture strategies...
Question 3: (Short answer) The Excel Writer appends data (new sheet or new rows) to an existing Excel file, while the Excel Output creates and feed a one sheet Excel file.

Add more data to a step "copy rows to result"

I'm doing a transformation in kettle and I need to send data to another transformation, for this use a step "copy rows to result", but this step I do half the processing and need to add more data to the end of the transformation, as might do?.
Greetings and thanks
EDIT 24-06-2014
This image is the example of my transformation:
The only thing you need to make sure when merging the results from multiple streams in Kettle is that the columns from both hops to Copia Filas (Copy rows to result in the English version) are completely identical, meaning:
The number of columns sent to the final step is equal;
The column types are completely identical;
The column names are completely identical;
Kettle should alert if otherwise.
If this is not the issue, you should check that the second stream in the data distribution actually generates the data you want to merge into the final step of the transformation, depending on what you'd like to do there.
Use the preview and debug options, as well as the right click (on a step) ==> Show output fields in Kettle to make sure all of the above are happening.
I hope this helps a bit.

SSIS script component question - no output columns defined, but I do get output

I have an SSIS package that runs a script component as one step in a series that transforms data from a flat file to a SQL table. The script component itself is pretty straightforward, but I had a question about its input and output columns.
The script appears to have no output columns defined (there are a number of input columns, of course.) Yet when I run the SSIS package, data comes out of this script component -- it's then used as input for a data conversion component, and from there pushed into a SQL table.
Is there a default setting I'm not aware of, where a script component with no defined output columns defaults to using the input columns? Thanks for helping me clear this up.
The OUTPUT Columns section is for defining columns that you are adding to the output after the script completes. In other words, if you are taking several values from the data flow in and based upon their values, calculating a new value to be output in a new column, then that would be defined in the script as an output. Otherwise, the buffer that is input into the script task is output out of the script task.