Kettle - Two csv inputs into PostgreSQL output - pentaho

I have a class project using Pentaho. I need to create a dashboard using 2 different inputs into a PostgreSQL output. My problem is, using Kettle, I have to match two different .csv files that go into the Postgres. One of the csv is about crimes, the other is about weather. I manually added two columns into the weather one, so they have two matching columns: 'Month' and 'Year'.
My question is how can I use this matching columns (or does doing that make any sense) so I can later create the dashboard and make queries like 'What crimes where committed when it was raining?'.
Sorry if I'm not very accurate, I'm a bit lost at using Pentaho. If anyone could give me some help I would be thankful.

If your intent is to join two CSV files, please check the Join step.

Related

Retrieve results from a batch of SQL queries in Pentaho or Postgres?

I'm still relatively new to SQL and Pentaho.
I've pulled a table with two different IDs and need to run a query for each specific instance.
For example,
SELECT *
FROM Table
WHERE RecordA = 'value in column A'
AND RecordB = 'value in column B'
I need the results back, either appended to new columns in the original table or part of their own text file output.
I was initially looking at using a formula for this inside of Pentaho, but couldn't quite figure it out. Since I have the query written I threw it into Excel and got the concatenated results (so a string of 350 or so queries that I need to run). I'm just not sure how to accomplish this - I tried the Execute SQL Script inside of Pentaho but it doesn't seem to do output?
Any direction would be useful. I've searched a little but have come up short so far, possibly because I am still pretty new to this platform.
You can accomplish this behavior in a lot of ways, with a "Database Lookup" step for example, but I usually do that in a quite easy way and here is a example for your tests, I hope it helps.
The idea here is to have two Table input steps, the first one will fetch the IDs we want to look at. For example you may use a SQL query similar to note on the left. The result will be a 1 column stream of rows.
Next we have a Table Input that reads the rows received and executes it's query for each row. I'll add a screenshot with the options that I selected.
What it does is replace a placeholder '?' with the data that is received. If you need two columns use two '?' but remember that it will replace the first one with the first column and the second one with the second column
And you are good to go. Test it a couple of times and good luck.
And the config for the second table input.

Filter certain SQL data formatted in one column into a new column

Before I begin I found this to be most relevant with the research I have done.
How to split the data from one column into separate columns using the contents of another column in SQL
Attached are pictures of my progress so far. How can I display this information such as it is shown in the excel file without disrupting the GROUP BY filter in my Query?
It's a Fishbowl Database, newest version. I am running the queries through Flamerobin which you see in the picture. Trying to organize the query to display correctly so I can format it into 'iReports' and export it into an excel spreadsheet like the one shown. Maybe there is some part of this that would better be done in excel?
Notice the numbers for Qty are different, that's ok right now.
My reputation is too low to post pictures I am sorry. Here are the two JPGs in my Dropbox. I really appreciate the help.
https://www.dropbox.com/sh/r2rw5r2awsyvzs9/AAAXXg27CMPOYtZFqPX3Dx6la?dl=0

SSIS Check Excel source rows redirect rows to another table on 'x' number of field matches

I work in a sales based environment and our data consists of 'leads'.
Let's say we record CompanyName, PhoneNumber, Address1 & PostCode(ZIP). These rows a seeded with a unique ID in the schema.
The leads come in from various sources and are compiled onto a spread sheet and then imported into SQL 2012 using SSIS.
After a validation check to see if a file exists we then use a simple data flow which consists of an Excel source, Derived Column, Data Conversion and finally an OLE DB Destination.
My requirement I'm sure has a relatively simple solution. I understand what I need to achieve is the first step. I need to take a sample of data from the last rolling two months, if 2 or more fields in the source excel file match the corresponding field in the destination sql table then I want to redirect to another table.
I am unsure of which combination of components I could use to achieve this. I believe that Fuzzy lookup may not be what I am looking for as I am looking to find exact field matches, I have looked at the lookup component but I am unsure if this is the way to go.
Could anyone please provide some advice on how I can best achieve this as simply as possible.
You can use the Lookup to check for matches in your existing table. However, it will be fairly complicated to implement the requirement of checking for any two or more fields matching. Your expression would be long and complex basically consisting of:
(using pseudo code for readability)
IIF((a=a AND b=b) OR (a=a AND c=c) OR (b=b AND c=c) OR ...and so on
for every combination of two columns you want to test
I would do this by importing the entire spreadsheet to a staging table, and doing the existing rows check in a SQL stored proc that moves the data to the desired destination table.

SQL Split functions and imported data

Ok, I've got a database table where data gets dumped by this horrid little program that I despise, but can't change at the moment. It has merchant data in there, names, addresses, and a set of categories that are pipe-delimited. What I need is a clean way to split these out, so I have one row for each merchant/category pair. From there, I can easily get it into the new data structure. This will need to be a repeatable process for a short period of time. I realize the optimal solution is to rid myself of this structure, but I've wracked my brain trying to figure out how to do this cleanly in sql.
I already have a function in the database that will split a delimited string and return a table.
This is in sql server 2008, btw.
Edit (for clarity_
Basically, the following might be a merchant (with the categories attached - other fields redacted for simplicity. Using commas for field delimiters here).
Jimbo's Bait Shoppe, Bait|Sports Gear|Sandwiches
What I need is:
Jimbo's Bait Shoppe, Bait
Jimbo's Bait Shoppe, Sports Gear
Jimbo's Bait Shoppe, Sandwiches
If you have already written a function that splits the string and returns the table you can use a trigger.
Create a trigger on INSERT on the table where the "horrid" program spits the data. The trigger will then take the unformatted data and populate two clean tables (I think in your case you should have two tables: one is a merchant, and another one is products, that are linked using one-to-many relationship using MerchantID).
In this case you can use the table with unformatted data as a "dirty" table. You can cleanse straight after the "horrid" program imported a file.
Please comment if you need help with the triggers

Splitting data by column value into an indefinite number of tables using an ETL tool

I'm trying to split a table into multiple tables based on the value of a given column using Talend Open Studio. Let's say this column can contain any of the integer values of 1, 2, 3, etc. then according to this value, these rows should go to table_1, table_2, table_3 etc.
It would be best if I could solve this when the number of different values in that column is not known in advance, but for now we can assume that all these output tables exists already. The bottom line is that the number of different values and therefore the number of different tables are high enough that setting up the individual filters manually is not an option.
Is this possible to solve this using Talend Open Studio or any similiary open source ETL tools like Pentaho Keetle?
Of course, I could just write a simple script myself, but I would prefer to use a proper ETL tool since the complete ETL process is quite complex.
In PDI or Pentaho Kettle you could do this with partitioning. (A right click option on the step IIRC) Partitioning in PDI is designed for exactly this sort of problem.
Yes that's Possible to do and split the data on the basis of single column to different table, but for that you need to create table dynamically :-
tFileInputDelimited->tFlowtoIterate ->tFixedFlowInput->and the can use
globalMap() to get the column values and use the same to seperate the
data to different tables. -> And the can use globalMap(Columnused to
seperate data) in table name.
The first solution that came to my mind was using the replicator to transport the current row to three filters which act as guard and only let rows through with either 1 2 or 3 in the given column. pic: http://i.imgur.com/FmvwU.png
But you could also build the table name dynamically, if that is what you want, pic: http://i.imgur.com/8LR7Q.png