Pentaho Join table values - sql

I'm rookie at kettle pentaho and Querys. What i'm trying to do is check if value A, in file 1 is in file 2.
I've got 2 files, that i export from my DB:
File 1:
Row1, Row2
A 3
B 5
C 99
Z 65
File 2:
Row1, Row2
A 3
D 11
E 22
Z 65
And i want to create one file output:
File Output
Row1, Row2
A 3
Z 65
What i'm doing: 2 files input, merge join, but no file output. Something missing here.
Any suggestion will be great!!!

You can have the two streams joined by a "Merge join" step, which allows you to set freely the join keys (in your case it seems like you want both fields to be used), and also what type of join, Inner, Left Outer, Right outer or Full outer.

You can use stream lookup for that. Start with the file input for file 1, then create a stream lookup step that uses the input stream for file 2 as its lookup stream. Now just match the columns and you can add the column from file 2 to your data stream.

sort both files in Ascending Order
use the MergeJoin step to join both tables on the sorted field (this case Row1)
use select vales step to delete unwanted field produced as result of the join
output your result using the dummy step or whatever output you prefer
this should work fine

Related

Storing a large array into a table with 10,000 columns in SQLite

I want to be able to store some 100x100 matrices onto a table within my database (covariance matrices). A first good step for me would be to flatten the matrix and store the matrix structure (among other things) into a parent table.
However, creating such a table would require to make a table with about 10,000 or so columns. Writing so many field names would make my SQL code extraordinarily large, and I wouldn't know where to start if I want to query for that matrix.
Is there a neat way to specify such a table in SQL? Is there a neat way for me to set or get a particular (set of) matrix (matrices) from my database using such a table? Is there a better way?
I am using Sqlite for my databases.
All tables with big size of same typed columns can be rotated.
For example if you have a table A like this:
row col1 col2 col3 ...
1 1 2 3
2 11 12 13
You can simply rotate to a table with 3 colums
row col value
1 1 1
1 2 2
1 3 3
2 1 11
2 2 12
2 3 13
so instead of writing big sql like
select col1, col2, col3 ...... from A where row = 2
you write sql like
select value from A where row = 2 order by col
the result set was originally horizontal and now become vertical -- it is rotated and easy to handle.

How can I read and parse files with variant spaces as delim?

I need help solving this problem:
I have a directory full of .txt files that look like this:
file1.no
file2.no
file3.no
And every file has the following structure (I only care for the first two "columns" in the .txt):
#POS SEQ SCORE QQ-INTERVAL STD MSA DATA
#The alpha parameter 0.75858
#The likelihood of the data given alpha and the tree is:
#LL=-4797.62
1 M 0.3821 [0.01331,0.5465] 0.4421 7/7
2 E 0.4508 [0.05393,0.6788] 0.5331 7/7
3 L 0.5334 [0.05393,0.6788] 0.6279 7/7
4 G 0.5339 [0.05393,0.6788] 0.624 7/7
And I want to parse all of them into one DataFrame, while also converting the columns into lists for each row (i.e., the first column should be converted into a string like this: ["MELG"]).
But now I am running into two issues:
How to read the different files and append all of them to a single DataFrame, and also making a single column out of al the rows inside said files
How to parse this files, giving that the spaces between the columns vary for almost all of them.
My output should look like this:
|File |SEQ |SCORE|
| --- | ---| --- |
|File1|MELG|0.3821,0.4508,0.5334,0.5339|
|File2|AAHG|0.5412,1,2345,0.0241,0.5901|
|File3|LLKM|0.9812,0,2145,0.4142,0.4921|
So, the first column for the first file (file1.no), the one with single letters, is now in a list, in a row with all the information from that file, and the DataFrame has one row for each file.
Any help is welcome, thanks in advance.
Here is an example code that should work for you:
using DataFrames
function parsefile(filename)
l = readlines(filename)
filter!(x -> !startswith(x, "#"), l)
sl = split.(l)
return (File=filename,
SEQ=join(getindex.(sl, 2)),
SCORE=parse.(Float64, getindex.(sl, 3)))
end
df = DataFrame()
foreach(fn -> push!(df, parsefile(fn)), ["file$i.no" for i in 1:3])
your result will be in df data frame.

How to transpose columns when they encode multiple "records"?

I have a spreadsheet I have imported into OpenRefine. The creator encoded groups of information (records) in columns. I need to bring each of those groups of columns into its own row, along with all the relevant columns.
Using a simplified example, how would I go from this:
id foo1 foo2 foo3 bar1 bar2 bar3
1 4 6 a 7 9 b
2 5 5 a 8 8 b
3 6 4 a 9 7 b
To this:
id foobar1 foobar2 foobar3
1 4 6 a
1 7 9 b
2 5 5 a
2 8 8 b
3 6 4 a
3 9 7 b
I've been trying to think of a way forward with intermediate columns, but there are are 6 groups of 5 columns and I'm currently stuck.
I found a solution. The steps are:
Concat each group of columns into a single column (FOO_CONCAT, BAR_CONCAT)
Delete the now unneeded columns (foo1..3, bar1..3)
Transpose your CONCAT columns into a single column, no prefix, ignoring blanks, filling down other columns
Now FOO_CONCATs and BAR_CONCATs are all in the same column
Split that column into several columns...(using the separator you used in step 1)
Rename columns
Strip out prefixes (I had foo1:4, bar2:8, etc for clarity)
Transform to numbers (Edit cells -> Common Transforms -> toNumber)
Now you're ready to transpose,facet, etc
I think this is essentially the same has the solution you describe, but possibly with some shortcuts to avoid all the steps.
Given the example data you post I would:
On "Id" column select Edit column->Add column based on this column
from menu
Make new column name "foobar"
Use the GREL forEach(row.columnNames,cn,if(cn.startsWith("foo"),cells[cn].value,null)).join("|")+"~"+forEach(row.columnNames,cn,if(cn.startsWith("bar"),cells[cn].value,null)).join("|")
Once new "foobar" column exists, on this column use menu option Edit cells->Split multi-valued cells using the "~" character (as used in the GREL above)
The also on the "foobar" column use menu option Edit columns->Split into several columns, using the "|" character as in the GREL above
Finally on ID column use menu Edit cells->Fill down
This should result in the output you describe - if you don't need the original columns at this point you can either remove them, or (sometimes quicker) export the first X columns that have the reconfigured data using the custom tabular exporter, and then import that data into a new project.
You can modify the GREL to deal with the exact column groupings you have. In my example I've used the column naming to group the values, but if that isn't the reality of the data you are dealing with you can use GREL like:
forEach(row.columnNames.slice(1,4),cn,cells[cn].value).join("|")+"~"+forEach(row.columnNames.slice(4,8),cn,cells[cn].value).join("|")
Which uses the 'slice' function to select certain columns rather than using some aspect of the column name to select them.

Working of Merge in SAS (with IN=)

I have two dataset data1 and data2
data data1;
input sn id $;
datalines;
1 a
2 a
3 a
;
run;
data data2;
input id $ sales x $;
datalines;
a 10 x
a 20 y
a 30 z
a 40 q
;
run;
I am merging them from below code:
data join;
merge data1(in=a) data2(in=b);
by id;
if a and b;
run;
Result: (I was expecting an Inner Join result which is not the case)
1 a 10 x
2 a 20 y
2 a 30 z
2 a 40 w
Result from proc sql inner join.
proc sql;
select data1.id,sn,sales,x from data2 inner join data1 on data1.hh_id;
quit;
Result: (As expected from an inner join)
a 1 10 x
a 1 20 y
a 1 30 z
a 1 40 w
a 2 10 x
a 2 20 y
a 2 30 z
a 2 40 w
b 3 10 x
b 3 20 y
b 3 30 z
b 3 40 w
I want to know the concept and STEP BY STEP working of merge statement in SAS with In= and proving the above result.
PS: I have read this, and it says
An obvious use for these variables is to control what kind of 'merge'
will occur, using if statements. For example, if
ThisRecordIsFromYourData and ThisRecordIsFromOtherData; will make SAS
only include rows that match on the by variables from both input data
sets (like an inner join).
which I guess, (like an Inner Join) is not always the case.
Basically, this is a result of the difference in how the SAS data step and SQL process their respective join/merges.
SQL creates a separate record for each possible combination of keys. This is a Cartesian Product (at the key level).
SAS data step, however, process merges very differently. MERGE is really nothing more than a special case of SET. It still processes rows iteratively, one at a time - it never goes back, and never has more than one row from any dataset in the PDV at once. Thus, it cannot create a Cartesian product in its normal process - that would require random access, which the SAS datastep doesn't do normally.
What it does:
For each unique BY value
Take the next record from the left side dataset, if one exists with that BY value
Take the next record from the right side dataset, if one exists with that BY value
Output a row
Continue until both datasets are exhausted for that BY value
With BY values that yield unique records per value on either side (or both), it is effectively identical to SQL. However, with BY values that yield duplicates on BOTH sides, you get what you have there: a side-by-side merge, and if one runs out before the other, the values from the last row of the shorter dataset (for that by value) are more-or-less copied down. (They're actually RETAINED, so if you overwrite them with changes, they will not reset on new records from the longer dataset).
So, if left has 3 records and right has 4 records for key value a, like in your example, then you get data from the following records (assuming you don't alter the data after):
left right
1 1
2 2
3 3
3 4

Split fixed width row into multiple rows in SSIS

I Have a fixed width flat file and that needs to be loaded into multiple oracle tables(one row needs to be split into multiple rows)
The numbers which are on top of each column is their size,
and my desired output should look like shown below.
Flatfile data(fixed width):
3 6 3 11 3 10 3 10 3
ID NAME AGE CTY1 ST1 CTY2 ST2 CTY3 ST3
200JOHN 46 LOSANGELES CA HOUSTON TX CHARLOTTE NC
201TIMBER54 PHOENIX AZ CHICAGO IL
202DAVID 32 ATLANTA GA PORTLAND AZ
The occurrence may vary.. it can grow upto 20-30
DESIRED OUTPUT:
TABLE1
ID NAME AGE
200JOHN 46
201TIMBER54
202DAVID 32
TABLE2
ID SEQ CTY ST
200 1 LOSANGELES CA
200 2 HOUSTON TX
200 3 CHARLOTTE NC
201 1 PHOENIX AZ
201 2 CHICAGO IL
202 1 ATLANTA GA
202 2 PORTLAND AZ
Can some one help me out?
Thanks!
I would listen to the advice given by #bilinkc first and attempt to solve this with an unpivot.
Click here for details on how to use the SSIS Unpivot Data Flow Transformation.
However, if that does not work out for some reason and you really want to solve this with SSIS, I am (kind of) happy to say it is technically feasible to solve the problem using SSIS and one data flow.
Below are an abbreviated list of steps:
1) Add a Data Flow Task to your package
2) Add a Flat File Source to your Data Flow Task
3) Configure the Flat file Source with a Connection Manager for your flat file
4) Add a Multicast Data Flow Transformation to your Data Flow Task
5) Connect your Flat File Source with the Multicast Data Flow Transformation
Now the "fun" part (copy and paste can save you time here)...
6) Add 30 Conditional Split Data Flow Transformations to your Data Flow Task
7) Connect the Multicast Data Flow Transformation to each Conditional Split Data Flow
8) Configure each Conditional Split N to pull the row subset where State N and City N has a value
Example: Conditional Split 1
Output Name: CTY1_ST1
Condition: [CTY1] != "" && [ST1] != ""
9) Add 30 Derived Column Data Flow Transformations to your data flow
10) Connect each one to your 30 Conditional Splits
11) Configure each with a Derived Column Name SEQ and a value 1 to 30
12) Add a Union All Data Flow Transformation and Union All 30 of the data pipes back together
Now the "easy" part...
13) Add your first Sort Transformation to your Data Flow Task
14) Connect a 31st Multicast pipe to your first Sort Transformation
15) Put a check mark next to and sort by ID (Hopefully ID:NAME and ID:AGE is 1:1)
16) Check Remove rows with duplicate sort values
17) Add your second Multicast Data Flow Transformation
18) Add a second Sort Transformation to your Data Flow Task
19) Connect your Union All to your second Sort Transformation and sort by ID
20) Add a Merge Join to your Data Flow Task
21) Connect your second Multicast Data Flow Transformation as the Left Input
22) Connect your second Sort Transformation to your Merge Join as your Right Input
23) Configure your Merge Join as Join Type = Inner Join and select columns ID, SEQ, CTY, ST
24) Add your first OLE DB Destination to your data flow and connect your Merge Join to it (the result is TABLE2)
25) Add a second OLE DB Destination to your data flow and connect your second Multicast Data Flow Transformation to it (the result is TABLE1)