Pentaho compare values from table to a number from REST api - pentaho

I need to make a dimension for a datawarehouse using pentaho.
I need to compare a number in a table with the number I get from a REST call.
If the number is not in the table, I need to set it to a default (999). I was thinking to use table input step with a select statement, and a javascript step that if the result is null to set it to 999. The problem is if there is no result, there is nothing passed through. How can this be done? Another idea was to get all values from that table and somehow convert it to something so I can read id as an array in javascript. I'm very new to pentaho DI but I've did some research but couldn't find what I was looking for. Anyone know how to solve this? If you need information, or want to see my transformation let me know!
Steps something like this:
Load number from api
Get Numbers from table
A) If number not in table -> set number to value 999
B) If number is in table -> do nothing
Continue with transformation with that number
I have this atm:
But the problem is if the number is not in the table, it returns nothing. I was trying to check in javascript if number = null or 0 then set it to 999.
Thanks in advance!

Replace the Input rain-type table by a lookup stream.
You read the main input with a rest, and the dimension table with an Input table, then make a Stream Lookup in which you specify that the lookup step is the dimension input table. In this step you can also specify a default value of 999.
The lookup stream works like this: for each row coming in from the main stream, the steps looks if it exists in the reference step and adds the reference fields to the row. So there is always one and exactly one passing by.

Related

Azure Data Factory CDS Connector Random Records

I have an online implementation of Dynamics 365 where I need to use Azure Data factory to do the following:
1 - pull some records based on a fetchxml criteria
2 - select random 100 records out of these
3 - insert them into a different entity in D365.
I am using the ADF CDS connector which only supports copy activity (does not support data flows as yet)
What I am hoping I can do is the following:
Task 1 - copy all records into a csv file and add an extra column that contains a random ID as an extra column
Issue here: When I do this, and use the rand() function, all the numbers returned are the same:
The same issue happens if I try to use #guid() all values come back the same.
Question 1 - Is there a reason why rand() and guid() are returning the same values for all records and is there a way to work around it.
Question 2 - Is there another way that I can't think of to achieve what I am trying to do: pick random x number of records from a dataset?
Is there a reason why rand() and guid() are returning the same values
for all records and is there a way to work around it?
This is because ADF will execute this expression first and just one time. Then use it's result as the column value. So you get the same data. As a workaround, you need to copy records into csv file first. Then use the csv file as source in Data Flow, then use Derived Column to add randID column. You can also create an Azure Function to add randID column and invoke it in ADF.

Add new column to existing table Pentaho

I have a table input and I need to add the calculation to it i.e. add a new column. I have tried:
to do the calculation and then, feed back. Obviously, it stuck the new data to the old data.
to do the calculation and then feed back but truncate the table. As the process got stuck at some point, I assume what happens is that I was truncating the table while the data was still getting extracted from it.
to use stream lookup and then, feed back. Of course, it also stuck the data on the top of the existing data.
to use stream lookup where I pull the data from the table input, do the calculation, at the same time, pull the data from the same table and do a lookup based on the unique combination of date and id. And use the 'Update' step.
As it is has been running for a while, I am positive it is not the option but I exhausted my options.
It's seems that you need to update the table where your data came from with this new field. Use the Update step with fields A and B as keys.
actully once you connect the hope, result of 1st step is automatically carried forward to the next step. so let's say you have table input step and then you add calculator where you are creating 3rd column. after writing logic right click on calculator step and click on preview you will get the result with all 3 columns
I'd say your issue is not ONLY in Pentaho implementation, there are somethings you can do before reaching Data Staging in Pentaho.
'Workin Hard' is correct when he says you shouldn't use the same table, but instead leave the input untouched, and just upload / insert the new values into a new table, doesn't have to be a new table EVERYTIME, but instead of truncating the original, you truncate the staging table (output table).
How many 'new columns' will you need ? Will every iteration of this run create a new column in the output ? Or you will always have a 'C' Column which is always A+B or some other calculation ? I'm sorry but this isn't clear. If the case is the later, you don't need Pentaho for transformations, Updating 'C' Column with a math or function considering A+B, this can be done directly in most relational DBMS with a simple UPDATE clause. Yes, it can be done in Pentaho, but you're putting a lot of overhead and processing time.

PDI /Kettle - Passing data from previous hop to database query

I'm new to PDI and Kettle, and what I thought was a simple experiment to teach myself some basics has turned into a lot of frustration.
I want to check a database to see if a particular record exists (i.e. vendor). I would like to get the name of the vendor from reading a flat file (.CSV).
My first hurdle selecting only the vendor name from 8 fields in the CSV
The second hurdle is how to use that vendor name as a variable in a database query.
My third issue is what type of step to use for the database lookup.
I tried a dynamic SQL query, but I couldn't determine how to build the query using a variable, then how to pass the desired value to the variable.
The database table (VendorRatings) has 30 fields, one of which is vendor. The CSV also has 8 fields, one of which is also vendor.
My best effort was to use a dynamic query using:
SELECT * FROM VENDORRATINGS WHERE VENDOR = ?
How do I programmatically assign the desired value to "?" in the query? Specifically, how do I link the output of a specific field from Text File Input to the "vendor = ?" SQL query?
The best practice is a Stream lookup. For each record in the main flow (VendorRating) lookup in the reference file (the CSV) for the vendor details (lookup fields), based on its identifier (possibly its number or name or firstname+lastname).
First "hurdle" : Once the path of the csv file defined, press the Get field button.
It will take the first line as header to know the field names and explore the first 100 (customizable) record to determine the field types.
If the name is not on the first line, uncheck the Header row present, press the Get field button, and then change the name on the panel.
If there is more than one header row or other complexities, use the Text file input.
The same is valid for the lookup step: use the Get lookup field button and delete the fields you do not need.
Due to the fact that
There is at most one vendorrating per vendor.
You have to do something if there is no match.
I suggest the following flow:
Read the CSV and for each row look up in the table (i.e.: the lookup table is the SQL table rather that the CSV file). And put default upon not matching. I suggest something really visible like "--- NO MATCH ---".
Then, in case of no match, the filter redirect the flow to the alternative action (here: insert into the SQL table). Then the two flows and merged into the downstream flow.

How can I map each specific row value to an ID in Pentaho?

I’m new to Pentaho and I’m currently having an issue with mapping specific row values to an ID.
I have a data file with around 30 columns, one of which is for currencies (USD, GBP, AUD, etc).
The main objective is to have the user select up to 8 (minimum of 1) currencies and map them to a corresponding ID 1-8. All other currencies not in the specified 8 will be mapped with an ID of 9.
The final step is to then output the original data set, along with the IDs.
I’m pretty sure I’m making this way harder than it should, but here is what I have at the moment.
I have created a job where the first step is to set the variables for my 8 currencies, selectionOne -> AUD, selectionTwo -> GBP, …, selectionEight -> JPY.
I then have a transformation to read the data from the file and use the copy rows to result step.
Following that I have a second job called for-each which is my loop for checking the current currency in the row.
Within this job I have two transformations, one called set-current, one called map-currencies.
set-current simply uses the get rows from result step (to grab the data from the first transformation). I then use the set variable step to set the current currency to the value in currency field. This works fine, as each pass through in the loop changes the current variable to the correct value.
Map-currencies is where I’m having the most issues.
The goal is to use the filter row step to compare the current currency against the original 8 selected currencies, and then using the value mapper step to map it to an ID, before outputting the csv file.
The main issue here, is that I can’t use my original variables in the filter or value mapper.
So, what I’ve done here is use the get variables step to retrieve the variables and named them: one, two, three, …, eight. This allows me to bypass the filtering issue, but they don’t seem to work for the value mapper, which is the all important step.
The second issue is that when the file is output it only outputs one value (because of the loop), selecting the append option works, but this could be a problem if the job is run more than once.
However, the priority here is the mapping issue.
I understand that this is rather long, and perhaps a tad confusing, but I will greatly appreciate any help on this, even if it’s an entirely new approach 😊.
Like I said, I’m probably making it harder than it should be.
Thanks for your time.
Edit for AlainD
Input example
Output example
This should be doable in a single transformation using the Stream Lookup step.
Text File Input is your main file, Property input reads your property file into Key and Value columns. You could use a normal text file with two columns instead of the property input.
Below are the settings of the Stream lookup. Note the default value "9" for records that are not found in the lookup stream.

how to remove duplicate row from output table using Pentaho DI?

I am creating a transformation that take input from CSV file and output to a table. That is running correctly but the problem is if I run that transformation more then one time. Then the output table contain the duplicate rows again and again.
Now I want to remove all duplicate row from the output table.
And if I run the transformation repeatedly it should not affect the output table until it don't have a new row.
How I can solve this?
Two solutions come to my mind:
Use Insert / Update step instead of Table input step to store data into output table. It will try to search row in output table that matches incoming record stream row according to key fields (all fields / columns in you case) you define. It works like this:
If the row can't be found, it inserts the row. If it can be found and the fields to update are the same, nothing is done. If they are not all the same, the row in the table is updated.
Use following parameters:
The keys to look up the values: tableField1 = streamField1; tableField2 = streamField2; tableField3 = streamField3; and so on..
Update fields: tableField1, streamField1, N; tableField2, streamField2, N; tableField3, streamField3, N; and so on..
After storing duplicite values to the output table, you can remove duplicites using this concept:
Use Execute SQL step where you define SQL which removes duplicite entries and keeps only unique rows. You can inspire here to create such a SQL: How can I remove duplicate rows?
Another way is to use the Merge rows (diff) step, followed by a Synchronize after merge step.
As long as the number of rows in your CSV that are different from your target table are below 20 - 25% of the total, this is usually the most performance friendly option.
Merge rows (diff) takes two input streams that must be sorted on its key fields (by a compatible collation), and generates the union of the two inputs with each row marked as "new", "changed", "deleted", or "identical". This means you'll have to put Sort rows steps on the CSV input and possibly the input from the target table if you can't use an ORDER BY clause. Mark the CSV input as the "Compare" row origin and the target table as the "Reference".
The Synchronize after merge step then applies the changes marked in the rows to the target table. Note that Synchronize after merge is the only step in PDI (I believe) that requires input be entered in the Advanced tab. There you set the flag field and the values that identify the row operation. After applying the changes the target table will contain the exact same data as the input CSV.
Note also that you can use a Switch/Case or Filter Rows step to do things like remove deletes or updates if you want. I often flow off the "identical" rows and write the rest to a text file so I can examine only the changes.
I looked for visual answers, but the answers were text, so adding this visual-answer for any kettle-newbie like me
Case
user-updateslog.csv (has dup values) ---> users_table , store only latest user detail.
Solution
Step 1: Connect csv to insert/update as in the below Transformation.
Step 2: In Insert/Update, add condition to compare keys to find the candidate row, and choose "Y" fields to update.