PDI /Kettle - Passing data from previous hop to database query - pentaho

I'm new to PDI and Kettle, and what I thought was a simple experiment to teach myself some basics has turned into a lot of frustration.
I want to check a database to see if a particular record exists (i.e. vendor). I would like to get the name of the vendor from reading a flat file (.CSV).
My first hurdle selecting only the vendor name from 8 fields in the CSV
The second hurdle is how to use that vendor name as a variable in a database query.
My third issue is what type of step to use for the database lookup.
I tried a dynamic SQL query, but I couldn't determine how to build the query using a variable, then how to pass the desired value to the variable.
The database table (VendorRatings) has 30 fields, one of which is vendor. The CSV also has 8 fields, one of which is also vendor.
My best effort was to use a dynamic query using:
SELECT * FROM VENDORRATINGS WHERE VENDOR = ?
How do I programmatically assign the desired value to "?" in the query? Specifically, how do I link the output of a specific field from Text File Input to the "vendor = ?" SQL query?

The best practice is a Stream lookup. For each record in the main flow (VendorRating) lookup in the reference file (the CSV) for the vendor details (lookup fields), based on its identifier (possibly its number or name or firstname+lastname).
First "hurdle" : Once the path of the csv file defined, press the Get field button.
It will take the first line as header to know the field names and explore the first 100 (customizable) record to determine the field types.
If the name is not on the first line, uncheck the Header row present, press the Get field button, and then change the name on the panel.
If there is more than one header row or other complexities, use the Text file input.
The same is valid for the lookup step: use the Get lookup field button and delete the fields you do not need.

Due to the fact that
There is at most one vendorrating per vendor.
You have to do something if there is no match.
I suggest the following flow:
Read the CSV and for each row look up in the table (i.e.: the lookup table is the SQL table rather that the CSV file). And put default upon not matching. I suggest something really visible like "--- NO MATCH ---".
Then, in case of no match, the filter redirect the flow to the alternative action (here: insert into the SQL table). Then the two flows and merged into the downstream flow.

Related

How to get the column index number of a specific field name in a staged file on Snowflake?

I need to get the column number of a staged file on Snowflake.
The main idea behind it, is that I need to automate getting this field in other queries rather than using t.$3 whereas 3 is the position of the field, that might be changed because we are having an expandable surveys (more or less questions depending on the situation).
So what I need is something like that:
SELECT COL_NUMBER FROM #my_stage/myfile.csv WHERE value = 'my_column_name`
-- Without any file format to read the header
And then this COL_NUMBER could be user as t.$"+COL_NUMBER+" inside merge queries.

Create table schema and load data in bigquery table using source google drive

I am creating table using google drive as a source and google sheet as a format.
I have selected "Drive" as a value for create table from. For file Format, I selected Google Sheet.
Also I selected the Auto Detect Schema and input parameters.
Its creating the table but the first row of the sheet is also loaded as a data instead of table fields.
Kindly tell me what I need to do to get the first row of the sheet as a table column name not as a data.
It would have been helpful if you could include a screenshot of the top few rows of the file you're trying to upload at least to see the data types you have in there. BigQuery, at least as of when this response was composed, cannot differentiate between column names and data rows if both have similar datatypes while schema auto detection is used. For instance, if your data looks like this:
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
BigQuery would not be able to detect the column names (at least automatically using the UI options alone) since all the headers and row data are Strings. The "Header rows to skip" option would not help with this.
Schema auto detection should be able to detect and differentiate column names from data rows when you have different data types for different columns though.
You have an option to skip header row in Advanced options. Simply put 1 as the number of rows to skip (your first row is where your header is). It will skip the first row and use it as the values for your header.

Index function : Pentaho Data Integration

I need guidance regarding the most approriate approach to perform a index function using pentaho Data integration ( kettle )
my situation is as following :
using the GLOBAL voip system report, I stored all data in a Mysql Database, which gives me several id number + name and lastname but whithout the departement name.
each departement name has it's own excel reports that can be identified by the group file name, which is not available in the Global file.
what i am trying to achieve is a lookup for each identification number to identify the departement where he belongs using the report filename and store it on the approriate column.
Any help will be appreciated.
Assuming you're using the Excel File Input step, there is an option on the Additional Output Fields tab that will allow you to specify the Full Filename Field. You can name this whatever you want, and it will add an additional column to your incoming Excel data that has the name of the file as one of the columns. You may need to do some regex cleanup on that fields since it's the full file path, not just the filename.
As far as doing the lookup, there are many lookup options to merge streams in the Lookup category of the design tab. I think the Stream Lookup is the step you'll want.
As far as I understood your need, you have to first build a "mapping table" of two columns: the department (aka the start of the xls filename) and the employee (aka its ID).
This table does not need to be materialized and may stay in a step of the the PDI. So
Read all the xls files with a Microsoft Excel File. In case you do not know how to do it: Browse to any of these file, press the Add button, then in the Selected files table, remove the filename to keep only its directory path and write .*\.xls in the Regex wildcard. Check you select the appropriates files with the Show filename button.
In the same step, define the Sheet to be "Fiche technique" (assuming they are all the same). Define the field to be "A" with type String (an empty column) and "ID" also with type String (otherwise you'll have a un-trappable error on "Agent ID" and "Total". Also follow #eicherjc suggestion and keep the filename, although I suggest you keep the Short file name and call it filename.
You should get a two column stream: ID and filename, which need some bit of data massage before to be used. The ID contains non-integer fields and the file name contains extra characters.
The simplest way to do this is with a Modified Javascript Value. I may suggest the code:
var ID = Number(ID);
var regex = filename.match(/(.*)__\d+\.xls/);
if(regex) filename = regex[1];
and do not forget specify the the ID has now a type Integer and to put a "Y" in the Replace value in field of the Fields`` table at the bottom.
The first line will convert any number in its value, and non-number in a 0, which is an ID that does not exists.
The next lines will extract the department from the filename with a Regex. If you do not like regex, you may use a filename = filename.substr(0, filename.indexOf('__')), or any formula that will do the job.
Now you have a stream ready to be used, except that some employees may, right or wrong, be in more than one department. If it does not matter which one, then leave it like that. Otherwise you have to provide some logic to filter the correct department.
You can now use a Lookup Stream to read the department of each employee. The Lookup step is the Modified Javascript value (or whatever name you gave to this step). The field to lookup is the field of the ID in your mySql. The Lookup field is the ID (or whatever name you gave to the column B of your xls files). And the field to retrieveenter code here is the filename (or more precisely, the department name extracted from the filename).

Spotfire dynamic filtering

I have a file which consists of a few part numbers.Using this file i need to exclude data in dashboard in another table which also has part numbers.How to filter data out of the table based on the part numbers present in the file if the part numbers in the file can change over time?
When you import the file with a list of part numbers, add a calculated column under transformations (also make sure that it's not reading the first record of your part list file as a header row--I don't know what your file looks like). In the expression box, just enter something simple like 1. Call this new dataset something like part_list. This column represents a flag that we will add to the table that is already in your dashboard. Let's suppose that table is called data.
Once the file is imported, click Insert > Columns... and ensure that data is selected in the "Add columns to data table:" drop down box, and that part_list is selected in the "Add columns from:" menu. Click Next. Match the part number column in both tables, and click next. Add the flag column to data with a left outer join (assuming this makes sense with your data). Once the column is added, you can filter out the 1's.
If this does not answer your question, consider providing more details about what your data looks like.

Qlikview Current selections box to use labels rather than table.fieldnames

In my Qlik View document I want to change the Current Selections information to use the Label applied to the field rather than the table.field format.
For example
PartsTable.PartNo
Would be
Part Number
Unfortunately there's no way to do this in the Current Selections object. However, you may have some alternatives depending on your requirements.
First of all, if you're just happy with seeing the current selections as text (rather than having the functionality of the Current Selections object), you can create a Text object and use the expression:
=replace(GetCurrentSelections(), 'PartsTable.PartNo', 'Part Number')
This will then look something like the below:
The other alternative is to use the RENAME statement in the load script after all your table loads are complete. RENAME allows you to rename a single field or a collection of fields by using a mapping table. The syntax for a single field is shown below:
RENAME FIELD oldname to newname
If you should need to rename more than one field at a time, you can expand this to:
RENAME FIELD oldname1 to newname1, oldname2 to newname2,...
More detail on the syntax including using a mapping table can be found in the QlikView installed help file.
For your example, I put together a small demo:
QUALIFY *;
PartsTable:
LOAD * INLINE [
PartNo
100
200
300
];
UNQUALIFY *;
RENAME FIELD PartsTable.PartNo to [Part Number];
This then results in:
RENAME is similar to the alias (AS) statement, except that you can first load all of your data, and then do the rename at the end. This will then rename your field so that it appears under its new name in any front-end controls (e.g. Current Selections etc.) However, this may not be suitable for you if you already have an existing field named Part Number in your script.