Pentaho use an Excel input with previous fields - pentaho

I'm working with excel files in pentaho.
I do a preprocessing in directories because the information is stored in this way:
/[year_dir]/[mounth_dir]/[store_id]_[day_ofmount].xls
' example /2017/01/4567_3.xls means 03/01/2017 sells of the store 4567
and pass the filename to an Excel input but the information of year,day ,store_id the columns name are added to the begining shifting the rest of the columns names but not the data of the excel

Easiest way is to include the filename (whole path) in your output data stream, then use a regex to split it into the various bits and pieces you need and extracting the date and store id from there.
You can later have a select values step to re-order the fields if order is important.

Related

Issues Exporting from T-SQL to Excel including column names & sorting on a column that's not exported

I have a SQL table with ERP data containing over 1 million records for seven different companies. This data needs to be exported to Excel files breaking the records by company, and for each company by a date range. Based on the companies and date ranges it will produce > 100 separate files.
The results are being imported into a second ERP system which requires the import file to be Excel format and must include the column names as the first row. The records need to be sorted by a [PostDate] column which must NOT be exported to the final record set.
The client's SQL server does not have the MS Jet Engine components installed in order for me to export directly to Excel. Client will not allow install of the MS Jet Engine on this server.
I already had code from a similar project that exported to CSV files so have been attempting that. My T-SQL script loops through the table, parses out the records by company and by date range for each iteration, and uses BCP to export to CSV files. The BCP command includes the column names as the first row by using an initial SELECT 'Col_1_Name', 'Col_2_Name', etc. with a UNION SELECT [Col_1], [Col_2],...
This UNION works AS LONG AS I INCLUDE THE [PostDate] column needed for the ORDER BY in the SELECT. This exports the records I need in proper order, with column names, but also includes the [PostDate] column which the import routine will not accept. If I remove [PostDate] from the SELECT, then the UNION with ORDER BY [PostDate] fails.
We don't want to have the consultant spend the time to delete the unwanted column from each file for 100+ files.
Furthermore, one of the VarChar columns being exported ([Department]) contains rows that have a leading zero, "0999999" for example.
The user opens the CSV files by double-clicking on the file in Windows file explorer to review the data, notes the [Department] column values are ok with leading zero displayed, and then saves as Excel and closes the file to launch the import into the second ERP system. This process causes the leading zeros to be dropped from [Department] resulting in import failure.
How can I (1) export directly to Excel, (2) including column names in row 1, (3) sorting the rows by [PostDate] column, (4) excluding [PostDate] column from the exported results, and (5) preserving the leading zeros in [Department] column?
You could expand my answer to this question SSMS: Automatically save multiple result sets from same SQL script into separate tabs in Excel? by adding the sort functionality you require.
Alternatively, a better approach would be to use SSIS.

Exporting long numbers to CSV file for Excel

This is perhaps one of those many times discussed questions with solutions more specific to actual system that outputs the data into a CSV file.
Is there a simple way to export data like 3332401187555, 9992401187000 into a CSV file in a way that later when opened in Excel, the columns won't show them in "scientific" format? Should this be important, the data is retrieved directly by an SQL SELECT statement from any DBMS.
This also means that I've tried solutions like surrounding the values with apostrophes '3332401187555' and the Excel cell recognizes those as text and doesn't do any conversions/masking. Was wondering if there was a more elegant way without actually it being a pre-set Excel template with text data fields.
1. Try exporting the numbers prefixed with single quote. Example: '3332401187555.
2. In excel, select the column containing number values
and then select Number in Format Cells.
Just have to save your file with Excel the option CSV file. And you have the in file in requested format.

Is there any way to exclude columns from a source file/table in Pentaho using "like" or any other function?

I have a CSV file having more than 700 columns. I just want 175 columns from them to be inserted into a RDBMS table or a flat file usingPentaho (PDI). Now, the source CSV file has variable columns i.e. the columns can keep adding or deleting but have some specific keywords that remain constant throughout. I have the list of keywords which are present in column names that have to excluded, e.g. starts_with("avgbal_"), starts_with("emi_"), starts_with("delinq_prin_"), starts_with("total_utilization_"), starts_with("min_overdue_"), starts_with("payment_received_")
Any column which have the above keywords have to be excluded and should not pass onto my RDBMS table or a flat file. Is there any way to remove the above columns by writing some SQL query in PDI? Selecting specific 175 columns is not possible as they are variable in nature.
I think your example is fit to use meta data injection you can refer to example shared below
https://help.pentaho.com/Documentation/7.1/0L0/0Y0/0K0/ETL_Metadata_Injection
two things you need to be careful
maintain list of columns you need to push in.
since you have changing column names so you may face issue with valid columns as well which you want to import or work with. in order to do so make sure you generate the meta data file every time so you are sure about the column names you want to push out from the flat file.

Index function : Pentaho Data Integration

I need guidance regarding the most approriate approach to perform a index function using pentaho Data integration ( kettle )
my situation is as following :
using the GLOBAL voip system report, I stored all data in a Mysql Database, which gives me several id number + name and lastname but whithout the departement name.
each departement name has it's own excel reports that can be identified by the group file name, which is not available in the Global file.
what i am trying to achieve is a lookup for each identification number to identify the departement where he belongs using the report filename and store it on the approriate column.
Any help will be appreciated.
Assuming you're using the Excel File Input step, there is an option on the Additional Output Fields tab that will allow you to specify the Full Filename Field. You can name this whatever you want, and it will add an additional column to your incoming Excel data that has the name of the file as one of the columns. You may need to do some regex cleanup on that fields since it's the full file path, not just the filename.
As far as doing the lookup, there are many lookup options to merge streams in the Lookup category of the design tab. I think the Stream Lookup is the step you'll want.
As far as I understood your need, you have to first build a "mapping table" of two columns: the department (aka the start of the xls filename) and the employee (aka its ID).
This table does not need to be materialized and may stay in a step of the the PDI. So
Read all the xls files with a Microsoft Excel File. In case you do not know how to do it: Browse to any of these file, press the Add button, then in the Selected files table, remove the filename to keep only its directory path and write .*\.xls in the Regex wildcard. Check you select the appropriates files with the Show filename button.
In the same step, define the Sheet to be "Fiche technique" (assuming they are all the same). Define the field to be "A" with type String (an empty column) and "ID" also with type String (otherwise you'll have a un-trappable error on "Agent ID" and "Total". Also follow #eicherjc suggestion and keep the filename, although I suggest you keep the Short file name and call it filename.
You should get a two column stream: ID and filename, which need some bit of data massage before to be used. The ID contains non-integer fields and the file name contains extra characters.
The simplest way to do this is with a Modified Javascript Value. I may suggest the code:
var ID = Number(ID);
var regex = filename.match(/(.*)__\d+\.xls/);
if(regex) filename = regex[1];
and do not forget specify the the ID has now a type Integer and to put a "Y" in the Replace value in field of the Fields`` table at the bottom.
The first line will convert any number in its value, and non-number in a 0, which is an ID that does not exists.
The next lines will extract the department from the filename with a Regex. If you do not like regex, you may use a filename = filename.substr(0, filename.indexOf('__')), or any formula that will do the job.
Now you have a stream ready to be used, except that some employees may, right or wrong, be in more than one department. If it does not matter which one, then leave it like that. Otherwise you have to provide some logic to filter the correct department.
You can now use a Lookup Stream to read the department of each employee. The Lookup step is the Modified Javascript value (or whatever name you gave to this step). The field to lookup is the field of the ID in your mySql. The Lookup field is the ID (or whatever name you gave to the column B of your xls files). And the field to retrieveenter code here is the filename (or more precisely, the department name extracted from the filename).

How to map input to output fileds from excel to csv in pentaho?

How to map input to output fileds from excel to csv in pentaho?
How to tranform this in pentaho ? Where to map values of input to output columns as the positions and name are different in input to output.
You can rename the fields right in your MS-Excel-Input step, and you can reorder the fields in Text-File-Output. Also, a Select-Values step allows you to rename and reorder fields in one sweep on the Select & Alter tab.
The Select Values step allows you to change the column names and position (as well as type).
Note that the column name in the Excel Input is arbitrary and do not need to be related to the actual name in the Excel file, so that can rename them at wish. You can even copy/paste the names into the Field list.
Note also that the order of the column in the CSV output file is defined in Fields tab of Text file output step. You can change it with the Ctrl-Arrows keys.
If you need to industrialize the process and have the new columns name and order in, for example, a set of files or a database table, then you need Metadata injection. Have a look to Diethard's or Jens' examples.