Pentaho step - Use SQL functions to add a column in data before dumping it int DB - pentaho

I am fairly new to Pentaho, and while working on it, I have stumbled across a problem. Below is how my flow is:
Read input from a file. Let's say this has 5 columns.
Make some modifications to existing columns. (Filter, modify and all).
Add a new column, which will be equal to an SQL function of the current row data. Example, it can be sum(id, id+1)
Dump to the database.
Step 1,2, 4 are already in place and are working fine. It's Step 3 where I am stuck. I've tried to execute SQL, but that is only for Modifying DDL and doesn't return data. Table input needs data to be in a table already, which isn't the case with me.
I have a workaround, that I can enter all rows in DB, and then fire an update query, but I was hoping if there is a better way to do this.

You can add formula step and in the formula column, you can specify what you want to achieve. For example, your other column+1 and save it in a new field or also replace the existing field value

Related

Add new column to existing table Pentaho

I have a table input and I need to add the calculation to it i.e. add a new column. I have tried:
to do the calculation and then, feed back. Obviously, it stuck the new data to the old data.
to do the calculation and then feed back but truncate the table. As the process got stuck at some point, I assume what happens is that I was truncating the table while the data was still getting extracted from it.
to use stream lookup and then, feed back. Of course, it also stuck the data on the top of the existing data.
to use stream lookup where I pull the data from the table input, do the calculation, at the same time, pull the data from the same table and do a lookup based on the unique combination of date and id. And use the 'Update' step.
As it is has been running for a while, I am positive it is not the option but I exhausted my options.
It's seems that you need to update the table where your data came from with this new field. Use the Update step with fields A and B as keys.
actully once you connect the hope, result of 1st step is automatically carried forward to the next step. so let's say you have table input step and then you add calculator where you are creating 3rd column. after writing logic right click on calculator step and click on preview you will get the result with all 3 columns
I'd say your issue is not ONLY in Pentaho implementation, there are somethings you can do before reaching Data Staging in Pentaho.
'Workin Hard' is correct when he says you shouldn't use the same table, but instead leave the input untouched, and just upload / insert the new values into a new table, doesn't have to be a new table EVERYTIME, but instead of truncating the original, you truncate the staging table (output table).
How many 'new columns' will you need ? Will every iteration of this run create a new column in the output ? Or you will always have a 'C' Column which is always A+B or some other calculation ? I'm sorry but this isn't clear. If the case is the later, you don't need Pentaho for transformations, Updating 'C' Column with a math or function considering A+B, this can be done directly in most relational DBMS with a simple UPDATE clause. Yes, it can be done in Pentaho, but you're putting a lot of overhead and processing time.

Retrieve results from a batch of SQL queries in Pentaho or Postgres?

I'm still relatively new to SQL and Pentaho.
I've pulled a table with two different IDs and need to run a query for each specific instance.
For example,
SELECT *
FROM Table
WHERE RecordA = 'value in column A'
AND RecordB = 'value in column B'
I need the results back, either appended to new columns in the original table or part of their own text file output.
I was initially looking at using a formula for this inside of Pentaho, but couldn't quite figure it out. Since I have the query written I threw it into Excel and got the concatenated results (so a string of 350 or so queries that I need to run). I'm just not sure how to accomplish this - I tried the Execute SQL Script inside of Pentaho but it doesn't seem to do output?
Any direction would be useful. I've searched a little but have come up short so far, possibly because I am still pretty new to this platform.
You can accomplish this behavior in a lot of ways, with a "Database Lookup" step for example, but I usually do that in a quite easy way and here is a example for your tests, I hope it helps.
The idea here is to have two Table input steps, the first one will fetch the IDs we want to look at. For example you may use a SQL query similar to note on the left. The result will be a 1 column stream of rows.
Next we have a Table Input that reads the rows received and executes it's query for each row. I'll add a screenshot with the options that I selected.
What it does is replace a placeholder '?' with the data that is received. If you need two columns use two '?' but remember that it will replace the first one with the first column and the second one with the second column
And you are good to go. Test it a couple of times and good luck.
And the config for the second table input.

SSIS Check Excel source rows redirect rows to another table on 'x' number of field matches

I work in a sales based environment and our data consists of 'leads'.
Let's say we record CompanyName, PhoneNumber, Address1 & PostCode(ZIP). These rows a seeded with a unique ID in the schema.
The leads come in from various sources and are compiled onto a spread sheet and then imported into SQL 2012 using SSIS.
After a validation check to see if a file exists we then use a simple data flow which consists of an Excel source, Derived Column, Data Conversion and finally an OLE DB Destination.
My requirement I'm sure has a relatively simple solution. I understand what I need to achieve is the first step. I need to take a sample of data from the last rolling two months, if 2 or more fields in the source excel file match the corresponding field in the destination sql table then I want to redirect to another table.
I am unsure of which combination of components I could use to achieve this. I believe that Fuzzy lookup may not be what I am looking for as I am looking to find exact field matches, I have looked at the lookup component but I am unsure if this is the way to go.
Could anyone please provide some advice on how I can best achieve this as simply as possible.
You can use the Lookup to check for matches in your existing table. However, it will be fairly complicated to implement the requirement of checking for any two or more fields matching. Your expression would be long and complex basically consisting of:
(using pseudo code for readability)
IIF((a=a AND b=b) OR (a=a AND c=c) OR (b=b AND c=c) OR ...and so on
for every combination of two columns you want to test
I would do this by importing the entire spreadsheet to a staging table, and doing the existing rows check in a SQL stored proc that moves the data to the desired destination table.

In SQL how do I update a table with a similar table?

In my current Database I have a table whose data is manually entered or comes in an excel sheet every week. Before we had the "manual entry option", the table would be dropped and replaced by the excel version.
Now because there is data that only exists in the original table this can not be done.
I'm trying to find a way to update the original table with changes and additions from the (excel) table while preserving all rows not in the new sheet.
I've been attempting to simply use an insert query and an update query /but/ I can't find a way to detect changes in a record.
Any suggestions? I can provide the current sql if you'd find that helpful.
Based on what I have read so far, I think I can offer some suggestions:
It appears you have control of the MS Access. I would suggest adding a field to your data table called "source". Modify your form in the access database to store something like "m" for manual entry in the source field. When you import the excel, store an "e" for excel in the field.
You would need to do a one time scrub of the data to mark existing records as manual entries or excel entries. There are a couple of ways you can do it through automation/queries that I can explain in detail if you want.
Once past these steps, your excel process is fairly simple. You can delete all records with source = "e" and then do a full excel import. Manual records would remain unchanged.
This concept will allow you to add new sources and codes and allow you to handle each differently if needed. You just need to spend some time cleaning up your old data. I think you will find it worth it in the end.
Good Luck.

Splitting data by column value into an indefinite number of tables using an ETL tool

I'm trying to split a table into multiple tables based on the value of a given column using Talend Open Studio. Let's say this column can contain any of the integer values of 1, 2, 3, etc. then according to this value, these rows should go to table_1, table_2, table_3 etc.
It would be best if I could solve this when the number of different values in that column is not known in advance, but for now we can assume that all these output tables exists already. The bottom line is that the number of different values and therefore the number of different tables are high enough that setting up the individual filters manually is not an option.
Is this possible to solve this using Talend Open Studio or any similiary open source ETL tools like Pentaho Keetle?
Of course, I could just write a simple script myself, but I would prefer to use a proper ETL tool since the complete ETL process is quite complex.
In PDI or Pentaho Kettle you could do this with partitioning. (A right click option on the step IIRC) Partitioning in PDI is designed for exactly this sort of problem.
Yes that's Possible to do and split the data on the basis of single column to different table, but for that you need to create table dynamically :-
tFileInputDelimited->tFlowtoIterate ->tFixedFlowInput->and the can use
globalMap() to get the column values and use the same to seperate the
data to different tables. -> And the can use globalMap(Columnused to
seperate data) in table name.
The first solution that came to my mind was using the replicator to transport the current row to three filters which act as guard and only let rows through with either 1 2 or 3 in the given column. pic: http://i.imgur.com/FmvwU.png
But you could also build the table name dynamically, if that is what you want, pic: http://i.imgur.com/8LR7Q.png