Index function : Pentaho Data Integration - pentaho

I need guidance regarding the most approriate approach to perform a index function using pentaho Data integration ( kettle )
my situation is as following :
using the GLOBAL voip system report, I stored all data in a Mysql Database, which gives me several id number + name and lastname but whithout the departement name.
each departement name has it's own excel reports that can be identified by the group file name, which is not available in the Global file.
what i am trying to achieve is a lookup for each identification number to identify the departement where he belongs using the report filename and store it on the approriate column.
Any help will be appreciated.

Assuming you're using the Excel File Input step, there is an option on the Additional Output Fields tab that will allow you to specify the Full Filename Field. You can name this whatever you want, and it will add an additional column to your incoming Excel data that has the name of the file as one of the columns. You may need to do some regex cleanup on that fields since it's the full file path, not just the filename.
As far as doing the lookup, there are many lookup options to merge streams in the Lookup category of the design tab. I think the Stream Lookup is the step you'll want.

As far as I understood your need, you have to first build a "mapping table" of two columns: the department (aka the start of the xls filename) and the employee (aka its ID).
This table does not need to be materialized and may stay in a step of the the PDI. So
Read all the xls files with a Microsoft Excel File. In case you do not know how to do it: Browse to any of these file, press the Add button, then in the Selected files table, remove the filename to keep only its directory path and write .*\.xls in the Regex wildcard. Check you select the appropriates files with the Show filename button.
In the same step, define the Sheet to be "Fiche technique" (assuming they are all the same). Define the field to be "A" with type String (an empty column) and "ID" also with type String (otherwise you'll have a un-trappable error on "Agent ID" and "Total". Also follow #eicherjc suggestion and keep the filename, although I suggest you keep the Short file name and call it filename.
You should get a two column stream: ID and filename, which need some bit of data massage before to be used. The ID contains non-integer fields and the file name contains extra characters.
The simplest way to do this is with a Modified Javascript Value. I may suggest the code:
var ID = Number(ID);
var regex = filename.match(/(.*)__\d+\.xls/);
if(regex) filename = regex[1];
and do not forget specify the the ID has now a type Integer and to put a "Y" in the Replace value in field of the Fields`` table at the bottom.
The first line will convert any number in its value, and non-number in a 0, which is an ID that does not exists.
The next lines will extract the department from the filename with a Regex. If you do not like regex, you may use a filename = filename.substr(0, filename.indexOf('__')), or any formula that will do the job.
Now you have a stream ready to be used, except that some employees may, right or wrong, be in more than one department. If it does not matter which one, then leave it like that. Otherwise you have to provide some logic to filter the correct department.
You can now use a Lookup Stream to read the department of each employee. The Lookup step is the Modified Javascript value (or whatever name you gave to this step). The field to lookup is the field of the ID in your mySql. The Lookup field is the ID (or whatever name you gave to the column B of your xls files). And the field to retrieveenter code here is the filename (or more precisely, the department name extracted from the filename).

Related

How to get the column index number of a specific field name in a staged file on Snowflake?

I need to get the column number of a staged file on Snowflake.
The main idea behind it, is that I need to automate getting this field in other queries rather than using t.$3 whereas 3 is the position of the field, that might be changed because we are having an expandable surveys (more or less questions depending on the situation).
So what I need is something like that:
SELECT COL_NUMBER FROM #my_stage/myfile.csv WHERE value = 'my_column_name`
-- Without any file format to read the header
And then this COL_NUMBER could be user as t.$"+COL_NUMBER+" inside merge queries.

Dynamic filename in Mapping Data Flow sink without the column in file

The way I understand it, is if you want to have dynamic filenames when writing to blob storage from a mapping data flow, the solution is to set "As data in column" on the file name options on the sink. This then uses the contents of a column as the filename for each row. To set the filename in the row you can have a derived column that contains the expression.
With auto mapping enabled on the sink this then results in having a column in the file containing the filename.
With auto mapping turned off, I could map all columns except for this one, but as I want to also have schema drift enabled on the source and keep any extra columns in the destination I can't have a fixed set of output columns.
How can I dynamically set the filename that gets generated without including it as a column in the file?
Or if we assume every row will have the same filename is there another way to dynamically set a filename? I've struggled to find any documentation on the file name options, but Pattern looks like it just adds a number and single file looks like a fixed value.
When you choose 'Output to single file' option, you can create a parameter in Data Flow and use it as file name. Then pass the value from pipeline to Data Flow like this:
My test:
1.add a parameter in Data Flow.
2.use that parameter as file name.
3.pass the value to parameter.

PDI /Kettle - Passing data from previous hop to database query

I'm new to PDI and Kettle, and what I thought was a simple experiment to teach myself some basics has turned into a lot of frustration.
I want to check a database to see if a particular record exists (i.e. vendor). I would like to get the name of the vendor from reading a flat file (.CSV).
My first hurdle selecting only the vendor name from 8 fields in the CSV
The second hurdle is how to use that vendor name as a variable in a database query.
My third issue is what type of step to use for the database lookup.
I tried a dynamic SQL query, but I couldn't determine how to build the query using a variable, then how to pass the desired value to the variable.
The database table (VendorRatings) has 30 fields, one of which is vendor. The CSV also has 8 fields, one of which is also vendor.
My best effort was to use a dynamic query using:
SELECT * FROM VENDORRATINGS WHERE VENDOR = ?
How do I programmatically assign the desired value to "?" in the query? Specifically, how do I link the output of a specific field from Text File Input to the "vendor = ?" SQL query?
The best practice is a Stream lookup. For each record in the main flow (VendorRating) lookup in the reference file (the CSV) for the vendor details (lookup fields), based on its identifier (possibly its number or name or firstname+lastname).
First "hurdle" : Once the path of the csv file defined, press the Get field button.
It will take the first line as header to know the field names and explore the first 100 (customizable) record to determine the field types.
If the name is not on the first line, uncheck the Header row present, press the Get field button, and then change the name on the panel.
If there is more than one header row or other complexities, use the Text file input.
The same is valid for the lookup step: use the Get lookup field button and delete the fields you do not need.
Due to the fact that
There is at most one vendorrating per vendor.
You have to do something if there is no match.
I suggest the following flow:
Read the CSV and for each row look up in the table (i.e.: the lookup table is the SQL table rather that the CSV file). And put default upon not matching. I suggest something really visible like "--- NO MATCH ---".
Then, in case of no match, the filter redirect the flow to the alternative action (here: insert into the SQL table). Then the two flows and merged into the downstream flow.

How to map input to output fileds from excel to csv in pentaho?

How to map input to output fileds from excel to csv in pentaho?
How to tranform this in pentaho ? Where to map values of input to output columns as the positions and name are different in input to output.
You can rename the fields right in your MS-Excel-Input step, and you can reorder the fields in Text-File-Output. Also, a Select-Values step allows you to rename and reorder fields in one sweep on the Select & Alter tab.
The Select Values step allows you to change the column names and position (as well as type).
Note that the column name in the Excel Input is arbitrary and do not need to be related to the actual name in the Excel file, so that can rename them at wish. You can even copy/paste the names into the Field list.
Note also that the order of the column in the CSV output file is defined in Fields tab of Text file output step. You can change it with the Ctrl-Arrows keys.
If you need to industrialize the process and have the new columns name and order in, for example, a set of files or a database table, then you need Metadata injection. Have a look to Diethard's or Jens' examples.

Excel data table (SQL query): Once deleted column no longer shows up

I have the following problem:
I have a data table that is fed by data from a SQL query.
The query works just fine, but not all the data is displayed. I deleted one of the columns before and no wanted to readd it, but it does not show.
Is there a way to get this to work?
Basically, I have those columns:
Name, First name, birthday, gender
Now I deleted gender:
Name, First name, birthday
After a while, I wanted to readd gender, but the data table shows the following:
Name, first name, birthday
It does work, if I change the column name from gender to sex in the SQL query, but that is not a solution I can live with.
If I change the name, then rename the column header, on the next refresh, the name is reinstated. If I rename the column header, then change the column name in the SQL query, the column disappears on the next refresh.
Anyone with a solution?
I'm guessing you have Preserve column/sort/filter/layout checked in the External Data Properties dialog (right-click> Table> External Data Properties). Try unchecking it, refreshing, and then checking it again. Save first!
I had the same issue, and finally found an easy solution for adding columns. Click on the table, then Query>Edit>Advanced Editor (under the home tab).
You should see the source code for the query. In the first line of code, you will see Columns= (followed by your number of columns).
You need to change this number to reflect the correct number of columns in the new CSV file. I originally had 17 columns. I added two data columns, so I changed this number to 19.
Close the editor and refresh, and you should be all set.
For anyone who needs it:
I did not find how to follow method described by Sullivan
I found different one. It needs editing XML in the unzipped XLSX file.
1. Add column to your table and rename it to column name you want to restore
2. Open in the notepad queryTable xml file extracted from XLSX (you have to find proper one)
3. Open (in notepad or IE) proper Table[#].xml with your table, find by name and remember ID of your column as your column ID
4. Find tag with name of column you need to restore and remove this tag.
5. Find column with text ' tableColumnId="[your column ID]" '
6. Add atrybute 'name = "[column name]"' and delete atribute 'dataBound="0"'.
7. save querytable[#].xml zip all folders to one file , rename to xlsx (never zip one folder that contains all, you need to select all objects and zip).
[#] of querytable is not always the same as # of table.
Relation is described in xl\tables_rels\table[#].xml.rels