Dynamic filename in Mapping Data Flow sink without the column in file - azure-data-factory-2

The way I understand it, is if you want to have dynamic filenames when writing to blob storage from a mapping data flow, the solution is to set "As data in column" on the file name options on the sink. This then uses the contents of a column as the filename for each row. To set the filename in the row you can have a derived column that contains the expression.
With auto mapping enabled on the sink this then results in having a column in the file containing the filename.
With auto mapping turned off, I could map all columns except for this one, but as I want to also have schema drift enabled on the source and keep any extra columns in the destination I can't have a fixed set of output columns.
How can I dynamically set the filename that gets generated without including it as a column in the file?
Or if we assume every row will have the same filename is there another way to dynamically set a filename? I've struggled to find any documentation on the file name options, but Pattern looks like it just adds a number and single file looks like a fixed value.

When you choose 'Output to single file' option, you can create a parameter in Data Flow and use it as file name. Then pass the value from pipeline to Data Flow like this:
My test:
1.add a parameter in Data Flow.
2.use that parameter as file name.
3.pass the value to parameter.

Related

BigQuery: schema autodetection of JSON couldn't recognize a field appearing later in the JSON input file

I found that BigQuery's schema autodetection doesn't recognize a field if that doesn't appear in the beginning of an input JSON file.
I have this field named "details" which is a record type. In the first 2K rows of the JSON input file, this field doesn't have any sub-fields. But then in 2,698 rows of the input file, this field has "report" sub-field for the first time. If I move the line to the top of the JSON file, it works fine.
How can I solve this issue? Explicitly specifying the schema is one way but I am wondering if there is a way to make the auto detection scan more rows or something like that.

Index function : Pentaho Data Integration

I need guidance regarding the most approriate approach to perform a index function using pentaho Data integration ( kettle )
my situation is as following :
using the GLOBAL voip system report, I stored all data in a Mysql Database, which gives me several id number + name and lastname but whithout the departement name.
each departement name has it's own excel reports that can be identified by the group file name, which is not available in the Global file.
what i am trying to achieve is a lookup for each identification number to identify the departement where he belongs using the report filename and store it on the approriate column.
Any help will be appreciated.
Assuming you're using the Excel File Input step, there is an option on the Additional Output Fields tab that will allow you to specify the Full Filename Field. You can name this whatever you want, and it will add an additional column to your incoming Excel data that has the name of the file as one of the columns. You may need to do some regex cleanup on that fields since it's the full file path, not just the filename.
As far as doing the lookup, there are many lookup options to merge streams in the Lookup category of the design tab. I think the Stream Lookup is the step you'll want.
As far as I understood your need, you have to first build a "mapping table" of two columns: the department (aka the start of the xls filename) and the employee (aka its ID).
This table does not need to be materialized and may stay in a step of the the PDI. So
Read all the xls files with a Microsoft Excel File. In case you do not know how to do it: Browse to any of these file, press the Add button, then in the Selected files table, remove the filename to keep only its directory path and write .*\.xls in the Regex wildcard. Check you select the appropriates files with the Show filename button.
In the same step, define the Sheet to be "Fiche technique" (assuming they are all the same). Define the field to be "A" with type String (an empty column) and "ID" also with type String (otherwise you'll have a un-trappable error on "Agent ID" and "Total". Also follow #eicherjc suggestion and keep the filename, although I suggest you keep the Short file name and call it filename.
You should get a two column stream: ID and filename, which need some bit of data massage before to be used. The ID contains non-integer fields and the file name contains extra characters.
The simplest way to do this is with a Modified Javascript Value. I may suggest the code:
var ID = Number(ID);
var regex = filename.match(/(.*)__\d+\.xls/);
if(regex) filename = regex[1];
and do not forget specify the the ID has now a type Integer and to put a "Y" in the Replace value in field of the Fields`` table at the bottom.
The first line will convert any number in its value, and non-number in a 0, which is an ID that does not exists.
The next lines will extract the department from the filename with a Regex. If you do not like regex, you may use a filename = filename.substr(0, filename.indexOf('__')), or any formula that will do the job.
Now you have a stream ready to be used, except that some employees may, right or wrong, be in more than one department. If it does not matter which one, then leave it like that. Otherwise you have to provide some logic to filter the correct department.
You can now use a Lookup Stream to read the department of each employee. The Lookup step is the Modified Javascript value (or whatever name you gave to this step). The field to lookup is the field of the ID in your mySql. The Lookup field is the ID (or whatever name you gave to the column B of your xls files). And the field to retrieveenter code here is the filename (or more precisely, the department name extracted from the filename).

How to map input to output fileds from excel to csv in pentaho?

How to map input to output fileds from excel to csv in pentaho?
How to tranform this in pentaho ? Where to map values of input to output columns as the positions and name are different in input to output.
You can rename the fields right in your MS-Excel-Input step, and you can reorder the fields in Text-File-Output. Also, a Select-Values step allows you to rename and reorder fields in one sweep on the Select & Alter tab.
The Select Values step allows you to change the column names and position (as well as type).
Note that the column name in the Excel Input is arbitrary and do not need to be related to the actual name in the Excel file, so that can rename them at wish. You can even copy/paste the names into the Field list.
Note also that the order of the column in the CSV output file is defined in Fields tab of Text file output step. You can change it with the Ctrl-Arrows keys.
If you need to industrialize the process and have the new columns name and order in, for example, a set of files or a database table, then you need Metadata injection. Have a look to Diethard's or Jens' examples.

Read variable column names from excel in pentaho

I am new to Pentaho.
I have excel input file with fixed number of columns but the column name changes. I want to capture the column names. I tried using "Metadata Structure of Stream" as well as UDJC
inputRowMeta = getInputRowMeta();
fieldNames = inputRowMeta.getFieldNames();
In both the cases I am getting the field names from what was defined from the first excel. So whatever is defined in the "Fields" tab in the "Microsoft Excel Input" comes out as the output for "Metadata Structure of Stream". What I am looking for is if the input excel file column name changes then the metadata output should also change. Is there a way I can do it?
If you don't know fieldnames at design-time, you must treat column headers as data. Metadata injection can be used to convert data to metadata, then. You will find a demo of this feature in your Kettle samples folder.

Can you set Fixed File Input column definitions dynamically in Pentaho data-integration (PDI)?

I have a metadata file which contains the column name, starting position, and length. I would like to read these values and define my columns within a FIXED FILE INPUT step.
Is there a way to do this in PDI? My file contains over 200 columns at a fixed widths and manually entering the information would be very time consuming especially if this definition changes over time.
Use MetaData Injection Step to inject the MetaData into prescribed steps , refer Matt Casters on figuring out delimited file and MetaData Injection description