Pentaho - What step to use to flip parts of a table? - pentaho

I have a table with 4 columns (Date, Asset, Buy, Sell) and would like to transform it into a new one containing Date, Asset, Type (of transaction), Unit by conserving all the relevant values. I am struggling to find out how to do this simply using data integration, if possible at all. I have attached a screenshot of my input (top table) along with my desired output (bottom table). Any idea on how I could proceed? Any help would be greatly appreciated :)
initial table vs final table

The step you need is the Row Normaliser. Inside the Pentaho installation you have a folder named samples with examples on how to use different steps or some common techniques. There are two examples available on how to use the Row Normaliser step (conveniently named as Row Normalis/zer - xxxx)

Related

Kettle - Two csv inputs into PostgreSQL output

I have a class project using Pentaho. I need to create a dashboard using 2 different inputs into a PostgreSQL output. My problem is, using Kettle, I have to match two different .csv files that go into the Postgres. One of the csv is about crimes, the other is about weather. I manually added two columns into the weather one, so they have two matching columns: 'Month' and 'Year'.
My question is how can I use this matching columns (or does doing that make any sense) so I can later create the dashboard and make queries like 'What crimes where committed when it was raining?'.
Sorry if I'm not very accurate, I'm a bit lost at using Pentaho. If anyone could give me some help I would be thankful.
If your intent is to join two CSV files, please check the Join step.

How to do not see the previous fields into output table step?

I have a transformation into Kettle Pentaho called Test.
This ETL process should load three different tables of a single database, where each one has his source into a different table of a another database.
To do this I use three table input steps. Each one connects to a value mapper, this to a Select value step, then a Data Validator, and add sequence step and finally a table output.
Summarising I have a total of six steps per table load.
When I am editing the finals steps I found a thing that I would like to solve, I drag the fields of the previous tables loads.
For example, table A load have the field bank_id, in the second table it does not exist, but in the table output step of the second load process I can select this despite I do not want this.
Is there any option to do not see the previous fields? Thsi way I avoid easy errors. Especially, when the tables have a field with the same name.
Thank you
EDIT
The screenshot clarifies the situation immensely, so now the answer is simple:
Delete the diagonal hops (arrows) between the rows.
Transformations in PDI don't have a single starting or ending point, so you don't need to connect all the steps in a single line. Having three separate streams is just fine.
All steps in a transformation start in parallel, then wait and process rows as they come in (or in the case of input steps, start reading data and generating rows into their output hop). That means your three streams will execute in parallel following their own hops from input to output.
add a Select Values step, i use to add filter steps often to "clean" the flow

(Excel-VBA) Specific data import (on the background) in the active sheet

Would you please help me (total beginner) to prepare a VBA macro that would open a sheet on the background and import specific selection as shown below:
Let's say we have downloaded wordcount analysis (xlsx) like this downloaded from a CAT tool for testing.
Now I would need to add a macro to my main sheet that would read lines starting (Column A) with "All". If "All" then I'd need to record columns of that line (specficilly Columns A - O) in array / hashtable?.
Please take a look at this image that summs it all (better than explaining it for me :-)
Let me know in case you need to know more details.
All tips / suggestions are greatly appreciated.
Many thanks!
My suggestion (I'm a beginner too) would be to use the Macro Recorder. Great tool to learn (example).
start recording
filter for 'ALL'
copy/past the Cells
stop Recording
Then have a look at the recorded code and adjust it :)
Looking at your data and the final layout you are looking for, using a Pivot Table would provide you with all of the flexibility you need.
You can:
filter which data to display
generate calculated values based on data in other columns
choose what order your columns are displayed
dynamically change the layout if you decide you want a different view
From your data, I was able to generate the following Pivot Table in about 15 minutes.
There are several good, simple tutorials on building Pivot Tables. A Google search will turn up plenty.
Things you will need to learn about for your particular problem:
Classic display (I used the classic display to get this particular layout)
Calculated Fields (many of the columns in the pivot table are calculated based on your spec). There is a maximum string length of 255 characters for a field calculation, so you may need to rename some of the columns in the original data set.
Of course, basics of Pivot Tables
Loading new data and updating your pivot table
Good Luck!

Splitting data by column value into an indefinite number of tables using an ETL tool

I'm trying to split a table into multiple tables based on the value of a given column using Talend Open Studio. Let's say this column can contain any of the integer values of 1, 2, 3, etc. then according to this value, these rows should go to table_1, table_2, table_3 etc.
It would be best if I could solve this when the number of different values in that column is not known in advance, but for now we can assume that all these output tables exists already. The bottom line is that the number of different values and therefore the number of different tables are high enough that setting up the individual filters manually is not an option.
Is this possible to solve this using Talend Open Studio or any similiary open source ETL tools like Pentaho Keetle?
Of course, I could just write a simple script myself, but I would prefer to use a proper ETL tool since the complete ETL process is quite complex.
In PDI or Pentaho Kettle you could do this with partitioning. (A right click option on the step IIRC) Partitioning in PDI is designed for exactly this sort of problem.
Yes that's Possible to do and split the data on the basis of single column to different table, but for that you need to create table dynamically :-
tFileInputDelimited->tFlowtoIterate ->tFixedFlowInput->and the can use
globalMap() to get the column values and use the same to seperate the
data to different tables. -> And the can use globalMap(Columnused to
seperate data) in table name.
The first solution that came to my mind was using the replicator to transport the current row to three filters which act as guard and only let rows through with either 1 2 or 3 in the given column. pic: http://i.imgur.com/FmvwU.png
But you could also build the table name dynamically, if that is what you want, pic: http://i.imgur.com/8LR7Q.png

SSIS - Column names as Variable/Changed Programmatically

I'm hoping someone might be able to help me out with this one - I have 24 files in CSV format, they all have the same layout and need to be joined onto some pre-existing data. Each file has a single column that needs to be joined onto the rest of the data, but those columns all have the same names in the original files. I need the columns automatically renamed to the filename as part of the join.
The final name of the column needs to be: Filename - data from another column.
My current approach is to use a foreach container and use the variable generated by the container to name the column, but there's nowhere I can input that value in the join, and even if I did, it'd mess up the output mappings, because the column names would be different.
Does anyone have any thoughts about how to get around these issues? Whoever has an idea will be saving my neck!
EDIT In case some more detail helps with this... SSIS version is 2008 and there are only a few hundred rows per file. It's basically a one time task to collect a full billing history from several bills which are issued monthly.
The source data has three columns, the product number, the product type and the cost.
The destination needs to have 24*3 columns, each of which has a monthly cost for a given product category. There are three product categories, and 24 bills (in seperate files) hence 24*3.
So hopefully I'm being a bit clearer - all I really need to know how to do, is to change the name of a column using a variable passed in from the foreach file container.
I think the easiest is to create a tmp database (aka staging db)
to load data from xls file to it and to define stored procedures where you can pass paramas (ex file names etc) and to build your won logic ...
Cheers Mario