Splitting data by column value into an indefinite number of tables using an ETL tool

Splitting data by column value into an indefinite number of tables using an ETL tool - sql

I'm trying to split a table into multiple tables based on the value of a given column using Talend Open Studio. Let's say this column can contain any of the integer values of 1, 2, 3, etc. then according to this value, these rows should go to table_1, table_2, table_3 etc.
It would be best if I could solve this when the number of different values in that column is not known in advance, but for now we can assume that all these output tables exists already. The bottom line is that the number of different values and therefore the number of different tables are high enough that setting up the individual filters manually is not an option.
Is this possible to solve this using Talend Open Studio or any similiary open source ETL tools like Pentaho Keetle?
Of course, I could just write a simple script myself, but I would prefer to use a proper ETL tool since the complete ETL process is quite complex.

In PDI or Pentaho Kettle you could do this with partitioning. (A right click option on the step IIRC) Partitioning in PDI is designed for exactly this sort of problem.

Yes that's Possible to do and split the data on the basis of single column to different table, but for that you need to create table dynamically :-
tFileInputDelimited->tFlowtoIterate ->tFixedFlowInput->and the can use
globalMap() to get the column values and use the same to seperate the
data to different tables. -> And the can use globalMap(Columnused to
seperate data) in table name.

The first solution that came to my mind was using the replicator to transport the current row to three filters which act as guard and only let rows through with either 1 2 or 3 in the given column. pic: http://i.imgur.com/FmvwU.png
But you could also build the table name dynamically, if that is what you want, pic: http://i.imgur.com/8LR7Q.png

Related

Add new column to existing table Pentaho

I have a table input and I need to add the calculation to it i.e. add a new column. I have tried:
to do the calculation and then, feed back. Obviously, it stuck the new data to the old data.
to do the calculation and then feed back but truncate the table. As the process got stuck at some point, I assume what happens is that I was truncating the table while the data was still getting extracted from it.
to use stream lookup and then, feed back. Of course, it also stuck the data on the top of the existing data.
to use stream lookup where I pull the data from the table input, do the calculation, at the same time, pull the data from the same table and do a lookup based on the unique combination of date and id. And use the 'Update' step.
As it is has been running for a while, I am positive it is not the option but I exhausted my options.

It's seems that you need to update the table where your data came from with this new field. Use the Update step with fields A and B as keys.

actully once you connect the hope, result of 1st step is automatically carried forward to the next step. so let's say you have table input step and then you add calculator where you are creating 3rd column. after writing logic right click on calculator step and click on preview you will get the result with all 3 columns

I'd say your issue is not ONLY in Pentaho implementation, there are somethings you can do before reaching Data Staging in Pentaho.
'Workin Hard' is correct when he says you shouldn't use the same table, but instead leave the input untouched, and just upload / insert the new values into a new table, doesn't have to be a new table EVERYTIME, but instead of truncating the original, you truncate the staging table (output table).
How many 'new columns' will you need ? Will every iteration of this run create a new column in the output ? Or you will always have a 'C' Column which is always A+B or some other calculation ? I'm sorry but this isn't clear. If the case is the later, you don't need Pentaho for transformations, Updating 'C' Column with a math or function considering A+B, this can be done directly in most relational DBMS with a simple UPDATE clause. Yes, it can be done in Pentaho, but you're putting a lot of overhead and processing time.

Kettle - Two csv inputs into PostgreSQL output

I have a class project using Pentaho. I need to create a dashboard using 2 different inputs into a PostgreSQL output. My problem is, using Kettle, I have to match two different .csv files that go into the Postgres. One of the csv is about crimes, the other is about weather. I manually added two columns into the weather one, so they have two matching columns: 'Month' and 'Year'.
My question is how can I use this matching columns (or does doing that make any sense) so I can later create the dashboard and make queries like 'What crimes where committed when it was raining?'.
Sorry if I'm not very accurate, I'm a bit lost at using Pentaho. If anyone could give me some help I would be thankful.

If your intent is to join two CSV files, please check the Join step.

Merging two different spreadsheets database data

newbie here working on something bit complicated..not sure how to start and whats the best way..looking for some advice and tips
So, we have 2 system running using MS Dynamics POS 2009 and have extract of all data (inventory/stock) in spreadsheets. Both dbo have pretty much the same items but because they have been run separately all naming and Part Numbers are in different format.
I need to create one database (one excel file) from both. Where partial match on Part Number will be identified and "merged" (keeping Part Number and Description from sheet1 and updating Stock (sheet1 stock + stock from sheet2)
Problem is that Part Numbers are written in completely different styles (by different people) and can by match only by some partial match (i guess last 3-6 characters in Part numbers)
I am not excel expert so any advice and tips would be appreciated.
Also have thoughts of loading those excel sheets into 2 separate SQL databases and doing it from SSMS as not sure if excel can cope with this
Thanks

I'm not 100% sure of the source data, but based on the available information here are some possible steps:
-Create a new Database in SSMS
-Load the data from your excel extracts in with the import data tool (Right on your newly-created database, tasks, Import data). This
will pull up a wizard that will transform your Excel spreadsheet to a table in SQL server. Do this for all spreadsheets
http://searchsqlserver.techtarget.com/feature/The-SQL-Server-Import-and-Export-Wizard-how-to-guide
-You may be able to do some matching based on the start/end of characters and use a MERGE statement to get unique data. The merge statement
allows you to set a match criteria, and then take certain actions depending on a positive or negative match. For example, if your different POS
systems have two different spreadsheets of products where there is some overlap, but also some products that are unique to each system, you could start with a source table from the first system and only insert products into it that are unique to the other system, if there is a match do nothing. Something like
MERGE ProductA A
USING ProductB B
ON RIGHT(A.ProductID, 5) = RIGHT(B.ProductID,5)
WHEN NOT MATCHED BY TARGET THEN
INSERT (ProductID, Descrption)
VALUES (b.ProductID, b.Description)
https://www.simple-talk.com/sql/learn-sql-server/the-merge-statement-in-sql-server-2008/

SSIS Check Excel source rows redirect rows to another table on 'x' number of field matches

I work in a sales based environment and our data consists of 'leads'.
Let's say we record CompanyName, PhoneNumber, Address1 & PostCode(ZIP). These rows a seeded with a unique ID in the schema.
The leads come in from various sources and are compiled onto a spread sheet and then imported into SQL 2012 using SSIS.
After a validation check to see if a file exists we then use a simple data flow which consists of an Excel source, Derived Column, Data Conversion and finally an OLE DB Destination.
My requirement I'm sure has a relatively simple solution. I understand what I need to achieve is the first step. I need to take a sample of data from the last rolling two months, if 2 or more fields in the source excel file match the corresponding field in the destination sql table then I want to redirect to another table.
I am unsure of which combination of components I could use to achieve this. I believe that Fuzzy lookup may not be what I am looking for as I am looking to find exact field matches, I have looked at the lookup component but I am unsure if this is the way to go.
Could anyone please provide some advice on how I can best achieve this as simply as possible.

You can use the Lookup to check for matches in your existing table. However, it will be fairly complicated to implement the requirement of checking for any two or more fields matching. Your expression would be long and complex basically consisting of:
(using pseudo code for readability)
IIF((a=a AND b=b) OR (a=a AND c=c) OR (b=b AND c=c) OR ...and so on
for every combination of two columns you want to test
I would do this by importing the entire spreadsheet to a staging table, and doing the existing rows check in a SQL stored proc that moves the data to the desired destination table.

Summing different parts of a column in SQL

I have a database extract in excel and want to create a custom value in Tablue using their create calculation, which I believe is SQL based.
Basically I have a large number of feeds which all show up different amounts in a column. For example:
feed 1
feed 1
feed 2
feed 3
feed 4
feed 4
feed 4
And I want to have a sum for feed 1, feed 2, and feed 4. But in my actual DB there's about 100 feeds all with different number of appearances. I'm having troubles finding a good way to do this. If there even is one. Any help or direction would be appreciated!

I'm assuming that your list is a single column and you need a count of the number of occurrences of each feed. For the sake of example, since a column or table names were not supplied, let's call them colname and tablename.
select colname, count(*) as Ct from tablename group by colname

It would be easier to give an exact answer if you posted a small simplified subset of your spreadsheet. But assuming you have a column called "feed_name" which takes on values like "feed 1", "feed 2" etc depending on the row. Then the feed_name column should be a discrete dimension in Tableau.
Then just put the feed_name pill on a shelf, say the row shelf. And put the "Number of Records" field on another shelf, say the column shelf.
You don't need to write SQL to do this (or most tasks) in Tableau. It helps to understand SQL concepts and its very helpful to drop down to the SQL level when needed to solve tricky issues. But for most situations, you can just interactively explore the data by moving fields around and writing some simple calculations -- and let Tableau take care of generating the SQL necessary to retrieve the data needed to build the visualization you requested.
Tableau supports SQL and some NO-SQL data sources, along with some cubes too. It does that quite well and in multiple ways. You just can work more quickly and efficiently by using Tableau's visual based manipulations in most cases, and then drop to the lower level detail when needed. It just takes getting used to how Tableau operates.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Splitting data by column value into an indefinite number of tables using an ETL tool - sql

In PDI or Pentaho Kettle you could do this with partitioning. (A right click option on the step IIRC) Partitioning in PDI is designed for exactly this sort of problem.

Related

Add new column to existing table Pentaho

Kettle - Two csv inputs into PostgreSQL output

Merging two different spreadsheets database data

SSIS Check Excel source rows redirect rows to another table on 'x' number of field matches

Summing different parts of a column in SQL

Categories

Resources