Passing data from one Pentaho transformation to another in a job? - pentaho

Fairly straightforward question I think, I just haven't been able to find a clear example. I have a very complex transformation that I'm breaking down into a job. Having never created a job before, I'm struggling to send the data from one transformation to another. I used Copy Rows to Result in the first one and Get Rows From Result in the second one, but I feel like I'm still missing something. When I used Get Rows, I had to specify the row names - there was no sort of Get Fields button. I also can't preview the data in the transformation without running the job and having it save to an Excel file. When I did that, ALL of the fields were in the output file -- instead of just the ones I'd specified in the second transformation.
I've searched through the documentation and tried Googling but I can't find a clear walkthrough just on how to smoothly move data from one transformation to another. Any responses would be appreciated even if it's just pointing me towards something I've overlooked.
Thanks!

The most commom way is to use copy rows to result at the end of one KTR and use get rows from result as the starting point for the next one. Though you really can't "see" the result while operating in the next KTR, what you can do to ease the reading is set a preview window and leave it open to see all the columns names and data.
Whoever if you want to set just a few lines of code through to the next KTR you can use Set variables as the ending step of the first KTR and capture those variables at anytime in the second using Get Variables steps. Don't forget that if you do so you need to set the variables in the parent KJB(the Job that called the first KTR) with no Default value, and the Variable scope type of the Set variables step has to be set to Valid in the parent job.

The best way is to create KTR's, run/test each. This way you can examine resulting data and then integrate all individual transformations into the final job.

Related

How can I map each specific row value to an ID in Pentaho?

I’m new to Pentaho and I’m currently having an issue with mapping specific row values to an ID.
I have a data file with around 30 columns, one of which is for currencies (USD, GBP, AUD, etc).
The main objective is to have the user select up to 8 (minimum of 1) currencies and map them to a corresponding ID 1-8. All other currencies not in the specified 8 will be mapped with an ID of 9.
The final step is to then output the original data set, along with the IDs.
I’m pretty sure I’m making this way harder than it should, but here is what I have at the moment.
I have created a job where the first step is to set the variables for my 8 currencies, selectionOne -> AUD, selectionTwo -> GBP, …, selectionEight -> JPY.
I then have a transformation to read the data from the file and use the copy rows to result step.
Following that I have a second job called for-each which is my loop for checking the current currency in the row.
Within this job I have two transformations, one called set-current, one called map-currencies.
set-current simply uses the get rows from result step (to grab the data from the first transformation). I then use the set variable step to set the current currency to the value in currency field. This works fine, as each pass through in the loop changes the current variable to the correct value.
Map-currencies is where I’m having the most issues.
The goal is to use the filter row step to compare the current currency against the original 8 selected currencies, and then using the value mapper step to map it to an ID, before outputting the csv file.
The main issue here, is that I can’t use my original variables in the filter or value mapper.
So, what I’ve done here is use the get variables step to retrieve the variables and named them: one, two, three, …, eight. This allows me to bypass the filtering issue, but they don’t seem to work for the value mapper, which is the all important step.
The second issue is that when the file is output it only outputs one value (because of the loop), selecting the append option works, but this could be a problem if the job is run more than once.
However, the priority here is the mapping issue.
I understand that this is rather long, and perhaps a tad confusing, but I will greatly appreciate any help on this, even if it’s an entirely new approach 😊.
Like I said, I’m probably making it harder than it should be.
Thanks for your time.
Edit for AlainD
Input example
Output example
This should be doable in a single transformation using the Stream Lookup step.
Text File Input is your main file, Property input reads your property file into Key and Value columns. You could use a normal text file with two columns instead of the property input.
Below are the settings of the Stream lookup. Note the default value "9" for records that are not found in the lookup stream.

Pentaho - Having multiple Copy rows to result results in Get rows from result empty

I'm trying to process some data and store it in a datawarehouse. For doing it, I wanted to store dimensions in one transformation and fact (only have one) in another transformation. So I can use a job for execute the first one, copy rows to result and get them into the second transformation.
In the first transformation, I read some Excel file and separate this data into some streams. It is data from a baptism, so I have one stream for the person, another one for parents, another one for sponsors, and so on... At the end of each stream, I insert data into database and return PK autogenerated (it is an id autoincrement).
In the second one, I only have Get rows from result and want to set them into a txt file (just for see it is been done correctly). The problem is that the file is created but it is empty. I suppose that if I let fields in Get rows from result empty, it gets all fields.
What am I doing wrong?
At the end what I want is to have one Copy rows to result at the end of each stream in the first transformation and get all this data in the second one.
In "Insert Pare Padrina" I return id_pare_padrina which is autogenerated, and the same with "Insert Mare Padrina" (I have more streams which I also have to include them into result). This transformation is not executed per row because I need values of other rows.
Thank you!
In order to pass the data from the first transformation to the second transformation, you need to set certain parameters like:
1. First of all, in the transformation settings of the second transformation (at the Job Level), check on the items as image below:
Copy Previous results to parameters will ensure that all the results/data in the "Copy Rows to Result" step is getting properly passed to the next level.
Execute for every input row : will execute the second transformation for every rows in the first transformation file. This is optional based on your requirement.
2. In the same transformation settings, define the "Parameters" in the Parameters tabs. Check the image below:
Here, NAME is the parameter i have defined. So when you are using the "Get rows from result", you can define these parameter names.
3. Instead of using "Get rows from result", you can alternately use "Get Variables" step to fetch all the variables coming from the previous step. All you need to do is to define the parameter names inside the ktr file (CTRL + T). (Actually i have practically implemented in that fashion and it worked for me.)
4. Since "copy rows to result" step uses heap memory, defining multiple instances of this step might exhaust the memory space quickly and your code might fall in trouble. Ideally use a single instance of this step.
But if your data interation is only one row, best option would be to use "set variables" step.
I assume you might have missed some of these sections in the job.
You can read more on copy rows to result in here.
Hope it helps :)

Adding two extra columns to input data - Pentaho Kettle

I am working on a transformation step for Pentaho Kettle. It selects several input columns and based on that adds two new columns during transformation. I am unable to understand (based on code from other plugins), how I can add the two new columns so that 1) steps downstream are aware of these columns and 2) i can push the transformed data into these columns.
Thanks in advance.
You might need to override meta.getStepFields() to add new ValueMetaInterface objects to the RowMetaInterface passed in. This is the standard way to add columns at runtime; however, the row's metadata (i.e. list of ValueMetaInterface objects) must be the same from row to row or else the next step in your transformation will complain.
Often when doing data-driven custom plugins, you consume as many rows as you need (using getRow()) in order to figure out what the outgoing row format/metadata will be, then you can construct a RowMetaInterface (usually using meta.getStepFields()) that will be passed into the putRow() call. If you intend to pass through the incoming fields, do something like:
RowMetaInterface outputRowMeta = getInputRowMeta().clone();
If you're creating new rows use this:
RowMetaInterface outputRowMeta = new RowMeta();
Either way when you call meta.getStepFields(outputRowMeta, ...) it should populate outputRowMeta with the appropriate fields, by adding/changing/removing ValueMetaInterface objects from outputRowMeta.
I've got a blog post using Groovy to add/replace fields in the incoming rows here:
http://funpdi.blogspot.com/2014/10/flatten-json-to-key-value-pairs-in-pdi.html
Not sure if that is similar to your use case or not. If you have more questions, feel free to find me on IRC at ##pentaho (my nick is usually mburgess_pdi)
IF i have understood your question correctly, i think you are trying to create an output file with dynamic column. So you can do this by checking on the "fast dumping" option in Text File Output Step. While doing so , donot define any column names in the "Fields" tab
Check my image below:
Hope it helps :)

Get list of columns of source flat file in SSIS

We get weekly data files (flat files) from our vendor to import into SQL, and at times the column names change or new columns are added.
What we have currently is an SSIS package to import columns that have been defined. Since we've assigned the mapping, SSIS only throws up an error when a column is absent. However when a new column is added (apart from the existing ones), it doesn't get imported at all, as it is not named. This is a concern for us.
What we'd like is to get the list of all the columns present in the flat file so that we can check whether any new columns are present before we import the file.
I am relatively new to SSIS, so a detailed help would be much appreciated.
Thanks!
Exactly how to code this will depend on the rules for the flat file layout, but I would approach this by writing a script task that reads the flat file using the file system object and a StreamReader object, and looks at the columns, which are hopefully named in the first line of the file.
However, about all you can do if the columns have changed is send an alert. I know of no way to dynamically change your data transformation task to accomodate new columns. It will have to be edited to handle them. And frankly, if all you're going to do is send an alert, you might as well just use the error handler to do it, and save yourself the trouble of pre-reading the column list.
I agree with the answer provided by #TabAlleman. SSIS can't natively handle dynamic columns (and niether can your SQL destination).
May I propose an alternative? You can detect a change in headers without using a C# Script Tasks. One way to do this would be to create a flafile connection that reads the entire row as a single column. Use a Conditional Split to discard anything other than the header row. Save that row to a RecordSet object. Any change? Send Email.
The "Get Header Row" DataFlow would look like this. Row Number if needed.
The Control Flow level would look like this. Use a ForEach ADO RecordSet object to assign the header row value to an SSIS variable CurrentHeader..
Above, the precedent constraints (fx icons ) of
[#ExpectedHeader] == [#CurrentHeader]
[#ExpectedHeader] != [#CurrentHeader]
determine whether you load data or send email.
Hope this helps!
i have worked for banking clients. And for banks to randomly add columns to a db is not possible due to fed requirements and rules. That said I get your not fed regulated bizz. So here are some steps
This is not a code issue but more of soft skills and working with other teams(yours and your vendors).
Steps you can take are:
(1) reach a solid columns structure that you always require. Because for newer columns older data rows will carry NULL.
(2) if a new column is going to be sent by the vendor. You or your team needs to make the DDL/DML changes to the table were data will be inserted. Ofcouse of correct data type.
(3) document this change in data dictanary as over time you or another member will do analysis on this data and would like to know what is the use of each attribute or column.
(4) long-term you do not wish to keep changing table structure monthly because one of your many vendors decided to change the style the send you data. Some clients push back very aggresively other not so much.
If a third-party tool is an option for you, check out CozyRoc's Data Flow Task Plus. It handles variable columns in sources.
SSIS cannot make the columns dynamic,
one thing, i always do, is use a script task to read the first and last lines of a file.
if it is not an expected list of csv columns i mark file as errored and continue/fail as required.
Headers are obviously important, but so are footers. Files can through any unknown issue be partially built. Requesting the header be placed at the rear of the file it is a double check.
I also do not know if SSIS can do this dynamically, but it never ceases to amaze me how people add/change order of columns and assume things will still work.
1-SSIS Does not provide dynamic source and destination mapping.But some third party component such as Data flow task plus , supporting this feature
2-We can achieve this using ssis script task.
3-If the Header is correct process further for migration else fail the package before DFT execute.
4-Read the line from the header using script task and store in array or list object
5-Then compare those array values to user defined variables declare earlier contained default value as column name.
6-If values are matching exactly then progress further else fail it.

How to create conditional destinations in SSIS Data Flow?

Goal: I have a Db source. Depending on a variable, i need to store it into a fixed width file OR a delimited file.
How do I do this in a data flow? I tried creating a conditional split, with two conditions. One condition going to a fixed width destination, and one to a delimited condition. Problem is that conditional split executed BOTH conditions even if no data comes in one condition. Becuase filename is same, so it errors out.
I would keep your solution with the folowing tweeks.
Write out to two Filename-fixed.txt and filename-delim.txt. Before those steps add row count tasks.
Then in your control flow you have two Success paths. Edit the success paths to look for both success and expression. Add an expression that checks the count from the new row count tasks in your dataflow. If you have file system tasks as your end point have them rename your fixed or delim file to the correct file name.
Note: I didn't try this and the pics all have red x's because I find it helpful to have the picture to figure out the logic not because I actually coded the solution.
Use two different data flows and do the diversion from with in the control flow. If you want to do it within the data-flow itself I guess you will have to use different filenames.