Read variable column names from excel in pentaho - pentaho

I am new to Pentaho.
I have excel input file with fixed number of columns but the column name changes. I want to capture the column names. I tried using "Metadata Structure of Stream" as well as UDJC
inputRowMeta = getInputRowMeta();
fieldNames = inputRowMeta.getFieldNames();
In both the cases I am getting the field names from what was defined from the first excel. So whatever is defined in the "Fields" tab in the "Microsoft Excel Input" comes out as the output for "Metadata Structure of Stream". What I am looking for is if the input excel file column name changes then the metadata output should also change. Is there a way I can do it?

If you don't know fieldnames at design-time, you must treat column headers as data. Metadata injection can be used to convert data to metadata, then. You will find a demo of this feature in your Kettle samples folder.

Related

Dynamic filename in Mapping Data Flow sink without the column in file

The way I understand it, is if you want to have dynamic filenames when writing to blob storage from a mapping data flow, the solution is to set "As data in column" on the file name options on the sink. This then uses the contents of a column as the filename for each row. To set the filename in the row you can have a derived column that contains the expression.
With auto mapping enabled on the sink this then results in having a column in the file containing the filename.
With auto mapping turned off, I could map all columns except for this one, but as I want to also have schema drift enabled on the source and keep any extra columns in the destination I can't have a fixed set of output columns.
How can I dynamically set the filename that gets generated without including it as a column in the file?
Or if we assume every row will have the same filename is there another way to dynamically set a filename? I've struggled to find any documentation on the file name options, but Pattern looks like it just adds a number and single file looks like a fixed value.
When you choose 'Output to single file' option, you can create a parameter in Data Flow and use it as file name. Then pass the value from pipeline to Data Flow like this:
My test:
1.add a parameter in Data Flow.
2.use that parameter as file name.
3.pass the value to parameter.

Excel to CSV Plugin for Kettle

I am trying to develop a reusable component in Pentaho which will take an Excel file and convert it to a CSV with an encoding option.
In short, I need to develop a transformation that has an Excel input and a CSV output.
I don't know the columns in advance. The columns have to be dynamically injected to the excel input.
That's a perfect candidate for Pentaho Metadata Injection.
You should have a template transformation wich contains the basic workflow (read from the excel, write to the text file), but without specifiying the input and/or output formats. Then, you should store your metadata (the list of columns and their properties) somewhere. In Pentaho example an excel spreadsheet is used, but you're not limited to that. I've used a couple of database tables to store the metadata for example, one for the input format and another one for the output format.
Also, you need to have a transformation that has the Metadata Injection step to "inject" the metadata into the template transformation. What it basically does, is to create a new transformation at runtime, by using the template and the fields you set to be populated, and then it runs it.
Pentaho's example is pretty clear if you follow it step by step, and from that you can then create a more elaborated solution.
You'll need at least two steps in a transformation:
Input step: Microsoft Excel input
Output step: Text file output
So, Here is the solution. In your Excel Input Component, in Fields Section, mention maximum number of fields which will come in any excel. Then Route the Input excel to text field based on the Number of fields which are actually present. You need to play switch/case component here.

BigQuery: schema autodetection of JSON couldn't recognize a field appearing later in the JSON input file

I found that BigQuery's schema autodetection doesn't recognize a field if that doesn't appear in the beginning of an input JSON file.
I have this field named "details" which is a record type. In the first 2K rows of the JSON input file, this field doesn't have any sub-fields. But then in 2,698 rows of the input file, this field has "report" sub-field for the first time. If I move the line to the top of the JSON file, it works fine.
How can I solve this issue? Explicitly specifying the schema is one way but I am wondering if there is a way to make the auto detection scan more rows or something like that.

Index function : Pentaho Data Integration

I need guidance regarding the most approriate approach to perform a index function using pentaho Data integration ( kettle )
my situation is as following :
using the GLOBAL voip system report, I stored all data in a Mysql Database, which gives me several id number + name and lastname but whithout the departement name.
each departement name has it's own excel reports that can be identified by the group file name, which is not available in the Global file.
what i am trying to achieve is a lookup for each identification number to identify the departement where he belongs using the report filename and store it on the approriate column.
Any help will be appreciated.
Assuming you're using the Excel File Input step, there is an option on the Additional Output Fields tab that will allow you to specify the Full Filename Field. You can name this whatever you want, and it will add an additional column to your incoming Excel data that has the name of the file as one of the columns. You may need to do some regex cleanup on that fields since it's the full file path, not just the filename.
As far as doing the lookup, there are many lookup options to merge streams in the Lookup category of the design tab. I think the Stream Lookup is the step you'll want.
As far as I understood your need, you have to first build a "mapping table" of two columns: the department (aka the start of the xls filename) and the employee (aka its ID).
This table does not need to be materialized and may stay in a step of the the PDI. So
Read all the xls files with a Microsoft Excel File. In case you do not know how to do it: Browse to any of these file, press the Add button, then in the Selected files table, remove the filename to keep only its directory path and write .*\.xls in the Regex wildcard. Check you select the appropriates files with the Show filename button.
In the same step, define the Sheet to be "Fiche technique" (assuming they are all the same). Define the field to be "A" with type String (an empty column) and "ID" also with type String (otherwise you'll have a un-trappable error on "Agent ID" and "Total". Also follow #eicherjc suggestion and keep the filename, although I suggest you keep the Short file name and call it filename.
You should get a two column stream: ID and filename, which need some bit of data massage before to be used. The ID contains non-integer fields and the file name contains extra characters.
The simplest way to do this is with a Modified Javascript Value. I may suggest the code:
var ID = Number(ID);
var regex = filename.match(/(.*)__\d+\.xls/);
if(regex) filename = regex[1];
and do not forget specify the the ID has now a type Integer and to put a "Y" in the Replace value in field of the Fields`` table at the bottom.
The first line will convert any number in its value, and non-number in a 0, which is an ID that does not exists.
The next lines will extract the department from the filename with a Regex. If you do not like regex, you may use a filename = filename.substr(0, filename.indexOf('__')), or any formula that will do the job.
Now you have a stream ready to be used, except that some employees may, right or wrong, be in more than one department. If it does not matter which one, then leave it like that. Otherwise you have to provide some logic to filter the correct department.
You can now use a Lookup Stream to read the department of each employee. The Lookup step is the Modified Javascript value (or whatever name you gave to this step). The field to lookup is the field of the ID in your mySql. The Lookup field is the ID (or whatever name you gave to the column B of your xls files). And the field to retrieveenter code here is the filename (or more precisely, the department name extracted from the filename).

How to map input to output fileds from excel to csv in pentaho?

How to map input to output fileds from excel to csv in pentaho?
How to tranform this in pentaho ? Where to map values of input to output columns as the positions and name are different in input to output.
You can rename the fields right in your MS-Excel-Input step, and you can reorder the fields in Text-File-Output. Also, a Select-Values step allows you to rename and reorder fields in one sweep on the Select & Alter tab.
The Select Values step allows you to change the column names and position (as well as type).
Note that the column name in the Excel Input is arbitrary and do not need to be related to the actual name in the Excel file, so that can rename them at wish. You can even copy/paste the names into the Field list.
Note also that the order of the column in the CSV output file is defined in Fields tab of Text file output step. You can change it with the Ctrl-Arrows keys.
If you need to industrialize the process and have the new columns name and order in, for example, a set of files or a database table, then you need Metadata injection. Have a look to Diethard's or Jens' examples.