I Have a fixed width flat file and that needs to be loaded into multiple oracle tables(one row needs to be split into multiple rows)
The numbers which are on top of each column is their size,
and my desired output should look like shown below.
Flatfile data(fixed width):
3 6 3 11 3 10 3 10 3
ID NAME AGE CTY1 ST1 CTY2 ST2 CTY3 ST3
200JOHN 46 LOSANGELES CA HOUSTON TX CHARLOTTE NC
201TIMBER54 PHOENIX AZ CHICAGO IL
202DAVID 32 ATLANTA GA PORTLAND AZ
The occurrence may vary.. it can grow upto 20-30
DESIRED OUTPUT:
TABLE1
ID NAME AGE
200JOHN 46
201TIMBER54
202DAVID 32
TABLE2
ID SEQ CTY ST
200 1 LOSANGELES CA
200 2 HOUSTON TX
200 3 CHARLOTTE NC
201 1 PHOENIX AZ
201 2 CHICAGO IL
202 1 ATLANTA GA
202 2 PORTLAND AZ
Can some one help me out?
Thanks!
I would listen to the advice given by #bilinkc first and attempt to solve this with an unpivot.
Click here for details on how to use the SSIS Unpivot Data Flow Transformation.
However, if that does not work out for some reason and you really want to solve this with SSIS, I am (kind of) happy to say it is technically feasible to solve the problem using SSIS and one data flow.
Below are an abbreviated list of steps:
1) Add a Data Flow Task to your package
2) Add a Flat File Source to your Data Flow Task
3) Configure the Flat file Source with a Connection Manager for your flat file
4) Add a Multicast Data Flow Transformation to your Data Flow Task
5) Connect your Flat File Source with the Multicast Data Flow Transformation
Now the "fun" part (copy and paste can save you time here)...
6) Add 30 Conditional Split Data Flow Transformations to your Data Flow Task
7) Connect the Multicast Data Flow Transformation to each Conditional Split Data Flow
8) Configure each Conditional Split N to pull the row subset where State N and City N has a value
Example: Conditional Split 1
Output Name: CTY1_ST1
Condition: [CTY1] != "" && [ST1] != ""
9) Add 30 Derived Column Data Flow Transformations to your data flow
10) Connect each one to your 30 Conditional Splits
11) Configure each with a Derived Column Name SEQ and a value 1 to 30
12) Add a Union All Data Flow Transformation and Union All 30 of the data pipes back together
Now the "easy" part...
13) Add your first Sort Transformation to your Data Flow Task
14) Connect a 31st Multicast pipe to your first Sort Transformation
15) Put a check mark next to and sort by ID (Hopefully ID:NAME and ID:AGE is 1:1)
16) Check Remove rows with duplicate sort values
17) Add your second Multicast Data Flow Transformation
18) Add a second Sort Transformation to your Data Flow Task
19) Connect your Union All to your second Sort Transformation and sort by ID
20) Add a Merge Join to your Data Flow Task
21) Connect your second Multicast Data Flow Transformation as the Left Input
22) Connect your second Sort Transformation to your Merge Join as your Right Input
23) Configure your Merge Join as Join Type = Inner Join and select columns ID, SEQ, CTY, ST
24) Add your first OLE DB Destination to your data flow and connect your Merge Join to it (the result is TABLE2)
25) Add a second OLE DB Destination to your data flow and connect your second Multicast Data Flow Transformation to it (the result is TABLE1)
Related
I'm trying to find a solution in datastage (Or in SQL) - without having to use a bunch of if/else conditions - where I can map value of one column based on value of another column.
Example -
Source File -
ID
Header1
Value1
Header2
Value2
1
Length
10
Height
15
2
Weight
200
Length
20
Target Output -
ID
Length
Height
Weight
1
10
15
2
20
200
I can do this using Index/Match function of excel. Was wondering if datastage or Snowflake can look into all these fields similarly and automatically populate the value column to the corresponding header column!
I think the best solution in DataStage would be a Pivot stage followed by a Transformer stage to strip out the hard-coded source column names.
Using Pentaho PDI 8.3.
After REST calls with quite complex data structures, I was able to extract data with a row for each data element in a REST result/ E.g:
DataCenterClusterAbstract
1
UK1
Datacenter (auto generated)
Company
29
0
39
15
DATAUPDATEJOB
2016-04-09T21:34:31.18
DataCenterClusterAbstract
2
UK1_Murex
Datacenter (auto generated)
Company
0
0
0
0
DATAUPDATEJOB
2016-04-09T21:34:31.18
DataCenterClusterAbstract
3
UK1_UNIX
Notice that there are 8 data elements that are spread out into separate rows. I would like to condense these 8 data elements into one row each iteration in Pentaho. Is this possible? And assign field names?
Row flattener
Condense 8 data element in columns into one row. Each of these 8 data elements are repeating.
(1) Add row flattener
(2) Assign field names for the rows coming in - so you have 10 data attributes in rows specify a field name for each row.
(3) In table output use space as seperator
I have a problem regarding making a data table that incorporates data of two other data tables, depending on what the input is in the input sheet.
These are my sheets:
sheet 1) Data table 1
sheet 2) Data table 2
sheet 3) Input sheet:
In this sheet one fills in the origin, destination, and month.
sheet 4) Output sheet:
Row(s) with characteristics that are a combination of the data in data table 1 and data table 2: 1 column for each characteristic in the row:
(General; Month; Origin; feature 1; feature 2; month max; month min; Transit point; feature 1; feature 2; feature 3; month max; month min; Destination; feature 1; feature 2; month max; month min;) => feature 3 of origin and destination don't have to be incorporated in the output!
Depending on the month, origin and destination filled in in the input sheet; the output has to list all the possible rows (routes) with that origin and that destination and the temperatures in that month at the origin, transit point and destination.
I have tried VLOOKUP(MATCH), but that only helps for 1 row. not if I want to list all possible rows..
I don't think this problem is that difficult, but I am really a rookie in Excel. Maybe it could work with a simple macro..
I'm a little unclear about some of your question, but perhaps you could adapt this solution to work for you?
http://thinketg.com/how-to-return-multiple-match-values-in-excel-using-index-match-or-vlookup/
I think this is what you want.
ColA ColB
a 1
b 2
c 3
a 4
b 5
c 6
a 7
b 8
9
10
11
7
8
9
9
16
17
18
19
20
In Cell E1, enter c (this is the value you are looking up).
In Cell F1, enter the function below and hit Ctrl+Shift+Enter.
=IF(ROWS(B$1:B1)<=COUNTIF($A$1:$A$20,$E$1),INDEX($B$1:$B$20,SMALL(IF($A$1:$A$20=$E$1,ROW($A$1:$A$20)-ROW($E$1)+1),ROWS(B$1:B1))),"")
I'm rookie at kettle pentaho and Querys. What i'm trying to do is check if value A, in file 1 is in file 2.
I've got 2 files, that i export from my DB:
File 1:
Row1, Row2
A 3
B 5
C 99
Z 65
File 2:
Row1, Row2
A 3
D 11
E 22
Z 65
And i want to create one file output:
File Output
Row1, Row2
A 3
Z 65
What i'm doing: 2 files input, merge join, but no file output. Something missing here.
Any suggestion will be great!!!
You can have the two streams joined by a "Merge join" step, which allows you to set freely the join keys (in your case it seems like you want both fields to be used), and also what type of join, Inner, Left Outer, Right outer or Full outer.
You can use stream lookup for that. Start with the file input for file 1, then create a stream lookup step that uses the input stream for file 2 as its lookup stream. Now just match the columns and you can add the column from file 2 to your data stream.
sort both files in Ascending Order
use the MergeJoin step to join both tables on the sorted field (this case Row1)
use select vales step to delete unwanted field produced as result of the join
output your result using the dummy step or whatever output you prefer
this should work fine
I have 48 matrices of dimensions 1,000 rows and 300,000 columns where each column has a respective ID, and each row is a measurement at one time point. Each of the 48 matrices is of the same dimension and their column IDs are all the same.
The way I have the matrices stored now is as RData objects and also as text files. I guess for SQL I'd have to transpose and store by ID, and in such case now the matrix would be of dimensions 300,000 rows and 1,000 columns.
I guess if I transpose it a small version of the data would look like this:
id1 1.5 3.4 10 8.6 .... 10 (with 1,000 columns, and 30,0000 rows now)
I want to store them in a way such that I can use R to retrieve a few of the rows (~ 5 to 100 each time).
The general strategy I have in mind is as follows:
(1) Create a database in sqlite3 using R that I will use to store the matrices (in different tables)
For file 1 to 48 (each file is of dim 1,000 rows and 300,000 columns):
(2) Read in file into R
(3) Store the file as a matrix in R
(4) Transpose the matrix (now its of dimensions 300,000 rows and 1,000 columns). Each row now is the unique id in the table in sqlite.
(5) Dump/write the matrix into the sqlite3 database created in (1) (dump it into a new table probably?)
Steps 1-5 are to create the DB.
Next, I need step 6 to read-in the database:
(6) Read some rows (at most 100 or so at a time) into R as a (sub)matrix.
A simple example code doing steps 1-6 would be best.
Some Thoughts:
I have used SQL before but it was mostly to store tabular data where each column had a name, in this case each column is just one point of the data matrix, I guess I could just name it col1 ... to col1000? or there are better tricks?
If I look at: http://sandymuspratt.blogspot.com/2012/11/r-and-sqlite-part-1.html they show this example:
dbSendQuery(conn = db,
"CREATE TABLE School
(SchID INTEGER,
Location TEXT,
Authority TEXT,
SchSize TEXT)")
But in my case this would look like:
dbSendQuery(conn = db,
"CREATE TABLE mymatrixdata
(myid TEXT,
col1 float,
col2 float,
.... etc.....
col1000 float)")
I.e., I have to type in col1 to ... col1000 manually, that doesn't sound very smart. This is where I am mostly stuck. Some code snippet would help me.
Then, I need to dump the text files into the SQLite database? Again, unsure how to do this from R.
Seems I could do something like this:
setwd(<directory where to save the database>)
db <- dbConnect(SQLite(), dbname="myDBname")
mymatrix.df = read.table(<full name to my text file containing one of the matrices>)
mymatrix = as.matrix(mymatrix.df)
Here I need to now the coe on how to dump this into the database...
Finally,
How to fast retrieve the values (without having to read the entire matrices each time) for some of the rows (by ID) using R?
From the tutorial it'd look like this:
sqldf("SELECT id1,id2,id30 FROM mymatrixdata", dbname = "Test2.sqlite")
But it the id1,id2,id30 are hardcoded in the code and I need to dynamically obtain them. I.e., sometimes i may want id1, id2, id10, id100; and another time i may want id80, id90, id250000, etc.
Something like this would be more approp for my needs:
cols.i.want = c("id1","id2","id30")
sqldf("SELECT cols.i.want FROM mymatrixdata", dbname = "Test2.sqlite")
Again, unsure how to proceed here. Code snippets would also help.
A simple example would help me a lot here, no need to code the whole 48 files, etc. just a simple example would be great!
Note: I am using Linux server, SQlite 3 and R 2.13 (I could update it as well).
In the comments the poster explained that it is only necessary to retrieve specific rows, not columns:
library(RSQLite)
m <- matrix(1:24, 6, dimnames = list(LETTERS[1:6], NULL)) # test matrix
con <- dbConnect(SQLite()) # could add dbname= arg. Here use in-memory so not needed.
dbWriteTable(con, "m", as.data.frame(m)) # write
dbGetQuery(con, "create unique index mi on m(row_names)")
# retrieve submatrix back as m2
m2.df <- dbGetQuery(con, "select * from m where row_names in ('A', 'C')
order by row_names")
m2 <- as.matrix(m2.df[-1])
rownames(m2) <- m2.df$row_names
Note that relational databases are set based and the order that the rows are stored in is not guaranteed. We have used order by row_names to get out a specific order. If that is not good enough then add a column giving the row index: 1, 2, 3, ... .
REVISED based on comments.