Pentaho Data Integration Multiway Merge Join - pentaho

I want to use the Multiway Merge Join step in Pentaho? The documentation is woefully lacking, and it's not doing what I intuitively thought it would do.
I have the following tables defined in Oracle:
JOE1:
A B C
1 NY 3
2 NJ 1
3 NJ 3
4 CT 7
JOE2:
B D
CT Connecticut
NJ New Jersey
NY New York
JOE3:
C E
1 one
3 three
7 seven
Here's the metadata from my Multiway Merge Join step in my .ktr:
Step name: Multiway Merge Join
Input Table1: JOE1 Join Keys: B,C
Input Table2: JOE2 Join Keys: B
Input Table3: JOE3 Join Keys: C
Join Type: INNER
I would've expected my .ktr to produce something like this:
A B C B_1 D C_1 E
1 NY 3 NY New York 3 three
2 NJ 1 NJ New Jersey 1 one
3 NJ 3 NJ New Jersey 3 three
4 CT 7 CT Connecticut 7 seven
But, instead, I get the following error:
**2018/10/12 14:44:25 - Multiway Merge Join.0 - Unexpected conversion error while converting value [B String(2)] to an Integer
2018/10/12 14:44:25 - Multiway Merge Join.0 -
2018/10/12 14:44:25 - Multiway Merge Join.0 - B String(2) : couldn't convert String to Integer
2018/10/12 14:44:25 - Multiway Merge Join.0 -
2018/10/12 14:44:25 - Multiway Merge Join.0 - B String(2) : couldn't convert String to number : non-numeric character found at position 1 for value [CT]**
This is an indication that it's not joining on the field I defined to join on in the .ktr.
Unfortunately, my company's firewall prevents me from sending a link to any files or images. I'm hoping I've provided enough information for someone to advise me if I've done something wrong or even if my behavioral expectations are accurate.

The multi merge join is not like a SQL-join. It is a merge, looking like a SQL sorted union. It taking two flows (Joe1 and Joe2) and put the record one after the other taking the lowest record. In particular the flow metadata (column name, type and order) must be the same, something the PDI should have warned you (unless you pressed the Don't tell me anymore button previously).
You can use the Join row (cartesian product). Don't worry it is not a cartesian product because you can specify that JOE1.B = JOE2.B (and many more). The PDI will remember you to sort the incoming flows before (unless you pressed the Don't tell me anymore button previously). Of course you have to make this twice: once to join Joe1 with Joe2, and once to join the resulting stream to Joe3.
In your case however, you are not after joins but after a look up. For each Joe1.B you are search for exactly one Joe2.B, and for each Joe1.C you are looking foe exactly one Joe3.C. Just like in the attached picture, on which the first lookup is open so that you can see the parameters. [Do not forget to specify the type of the returned column!]
Note that you can always put all this in the SQL: SELECT * FROM joe1 JOIN joe2 ON joe2.B=joe1.B JOIN joe3 ON joe3.C=joe1.C. But this will be harder to maintain, and if the query are complex (lots of join and many cross table relations), it may be slower that the PDI.

It would appear that the join has to be done on the same field(s) for all input streams. It doesn't have to have the same field name(s) but, conceptually, has to have the same data content.
Thanks to AlainD for the verification and the detailed explanation!!

Related

Merge SQL Rows in Subquery

I am trying to work with two tables on BigQuery. From table1 I want to find the accession ID of all records that are "World", and then from each of those accession numbers I want to create a column with every name in a separate row. Unfortunately, when I run this:
Select name
From `table2`
Where acc IN (Select acc
From `table1`
WHERE source = 'World')
Instead of getting something like this:
Acc1
Acc2
Acc3
Jeff
Jeff
Ted
Chris
Ted
Blake
Rob
Jack
Jack
I get something more like this:
row
name
1
Jeff
2
Chris
3
Rob
4
Jack
5
Jeff
6
Jack
7
Ted
8
Blake
Ultimately, I am hoping to download the data and somehow use python or something to take each name and count the number of times it shows up with each other name at a given accession number, and furthermore measure the degree to which each pairing is also found with third names in any given column, i.e. the degree to which they share a cohort. So I need to preserve the groupings which exist with each accession number, but I am struggling to find info on how one might do this.
Could anybody point me in the right direct for this, or otherwise is the way I am going about this wise if that is my end goal?
Thanks!
This is not a direct answer to the question you asked. In general, it is easier to handle multiple rows rather than multiple columns.
So, I would recommend that you put each acc value in a separate row and then list the names as an array:
select t2.acc, array_agg(t2.name order by t2.name) as names
from `table2` t2
where t2.acc in (Select t1.acc
From `table1` t1
where t1.source = 'World'
)
group by t2.acc;
Otherwise, you are going to have a challenge just naming the columns in your result set.

Trying to display id of table 1 as it relates to table 3

I have three tables: exfillocation, phishkit, snapshot. We need to be able to query exfillocation.filename and print the related snapshot.id, which requires traversing the phishkit table.
exfillocation.phishkit_id is related to phishkit.id as a foreign key.
Table exfillocation schema:
id exfil_location phishkit_id
== ========= ============
1 ['open.txt'] 7442
2 ['bot.txt'] 9931
phishkit.snapshot_id is related to snapshot.id as a foreign key.
Phishkit schema:
id snapshot_id md5
=== ============ =====
7442 1492 f4a3954e39b90c02f4a3954e39b90c02
9931 1661 e048f240ad0845b50abe8df9124ce3fb
Snapshot schema:
id asn url
=== ====== =============
1661 123 badwebsite.malicious.com
1492 31 haxx0rs.hacking.com
I've tried reading postgresql's four different JOIN methods as well as the UNION method, but I don't seem to get the snapshot_id column returned.
I tried something awkward this this:
SELECT exfil_location, found_in_file, phishkit_id
FROM public.lookup_exfillocation
FULL OUTER JOIN public.lookup_phishkit
ON public.lookup_exfillocation.phishkit_id = public.lookup_phishkit.id
FULL OUTER JOIN public.lookup_snapshot
ON public.lookup_phishkit.snapshot_id = public.lookup_snapshot.id WHERE exfil_location::text NOT LIKE ('__script.txt__') ORDER BY phishkit_id;
I expected to see the related lookup_snapshot.id and the related lookup_phishkit.id, which neither showed.
I accidentally found the solution. It came down to what columns I was SELECTing. Using a * showed all columns in the JOIN statements. Then I picked from the columns needed. The query looks like:
FROM public.lookup_exfillocation
FULL OUTER JOIN public.lookup_phishkit
ON public.lookup_exfillocation.phishkit_id = public.lookup_phishkit.id
FULL OUTER JOIN public.lookup_snapshot
ON public.lookup_phishkit.snapshot_id = public.lookup_snapshot.id WHERE exfil_location::text NOT LIKE ('__script.txt__') ORDER BY phishkit_id;```

Storing data about objects with a variable number of ordered subparts in Access Database

The situation: I have a database storing biological specimen data. One table contains data about each specimen. Each specimen has between 1 and 8 parts, which are ordered.
I would like to enumerate each subpart in a query, using the specimen id and the number of parts. So if I have 2 specimens, A and B, and A has 2 parts and B has 3 parts, I want the result:
Parts:
A - 1
A - 2
B - 1
B - 2
B - 3
I realize that this is probably a trivial task, but I don't know the correct terminology to talk about it in a way that help pages and Google will understand. Thank you.
Edit to add thoughts: If I were dealing with something like this in a non-SQL context, I'd use a for loop to iterate the enumeration process over each specimen, but I don't understand how to implement anything remotely similar in SQL.
You mentioned "main table" which implies there's some other table for the sub parts. What you're after is likely a simple JOIN:
SELECT
*
FROM
maintable
INNER JOIN
subtable
ON
subtable.mainid = maintable.id
If you want an exact query, post a screenshot of your database tables and their column names and any relationships

Efficient way to query a table with data from another table using 3 keys

Lets say I have Table A and Table B. Both tables contain about 500,000 records. Cat, Dog and Mouse house the same exact data type for both tables but data present in one table may not be in the other.
Table Zoo:
Cat | Dog | Mouse | Bird
xyz dfg sdhf 123
dfr kjf asdc 456
zxc abc qwrt 789
Table Pet_Store:
Cat | Dog | Mouse | Pig
ghf dsa dfre 12
dfr gfr qwy5 19
zxc abc dfgr 21
Desired Result:
Cat | Dog | Mouse
dfr kjf asdc
zxc abc qwrt
I want to query every record where either Cat, Dog or Mouse are the same. There is no unique key here to connect both tables the only way we can draw a connection is with those 3 fields. If at least one match is present return Cat, Dog and Mouse. I did a select statement myself but considering the data I am working with is very large this process is taking a long time so I don't think I am being efficient. Any suggestions?:
select n.Cat, n.Dog, n.Mouse
from Zoo n, Pet_Store t
where
(n.Cat =t.Cat or n.Dog =t.Dog or n.Mouse =t.Mouse)
edit: Sorry I should have included a little more clarity. My brain is fried at the moment so I apologize for that. If any of the fields I do a check on match, pull the fields Cat, Dog, Mouse from the Zoo table.
Depending on how much you care about duplicates, you could do something like
select z.cat, z.dog, z.mouse from zoo z inner join pet_store p on z.cat = p.cat
union all
select z.cat, z.dog, z.mouse from zoo z inner join pet_store p on z.dog = p.dog
union all
select z.cat, z.dog, z.mouse from zoo z inner join pet_store p on z.mouse = p.mouse
This will allow index usage on all columns (assuming you have the proper indexes on both tables).
Well you have not told us much but given what you have told us this is how I would do it.
SELECT A.Cat, A.Dog, A.Mouse
FROM Zoo A
LEFT JOIN Pet_Store B1 ON A.Cat = B1.Cat
LEFT JOIN Pet_Store B2 ON A.Dog = B2.Dog
LEFT JOIN Pet_Store B3 ON A.Mouse = B3.Mouse
WHERE COALESCE(B1.Cat, B2.Dog, B3.Mouse) IS NOT NULL
Since we don't know anything about the structure of the data or other information about the columns or the tables I know of no way to improve this query. HOWEVER, if you do have any indexes at all -- this query will use them the best possible ways.
For example an index on B.Mouse could be used in this query but not used in your example query.
There's nothing really wrong with your query, you're dealing with no indexes and table scans on a reasonably large table. You will see a slight improvement by refactoring the query slightly, but you would see much more significant performance improvements by adding indexes.
SELECT z.Cat, z.Dog, z.Mouse
FROM Zoo z
INNER JOIN Pet_Store p ON
z.Cat = p.Cat OR
z.Dog = p.Dog OR
z.Mouse = p.Mouse
That will return the data you're looking for - there's no need to join the tables multiple times.

Access 2010 doubling the sum in query

I know this question has been asked and answered. I understand the problem and I understand the underlying cause and I understand the solution. What I DON'T understand is how to implement the solution.
I'll try to be detailed....
Background: Each material is being grouped on WellID (I work in oil and gas) and SandType which is my primary key in each table, these come from 2 lookup tables one for each. (I work in oil and gas)
I have 3 tables that store material (sand)) weights at 3 different stages in the job process. Basically the weight from the engineer's DESIGN, what was DELIVERED and what is in INVENTORY.
I know that the join is messed up and adding the total for each row in each table. Sometimes double triple etc.
I am grouping on WellID and SandID.
Now I don't want someone to do the work for me. I just don't know how or where in access to restrict it to what I want, or if modifying t he sql the proper way to write the code. Current work around is 3 separate sum queries one for each table, but that is going to get inefficient and added steps.
My whole database purpose and subsequent reports hinge off math on these 3 numbers so, my show stopper here is putting the fat lady on stage, and is about to become a deal breaker at the end of the line! 0
I need some advice, direction, criticism, wisdom, witty euphemisms or a new job!
The 3 tables look as follows
Design:
T_DESIGN
DesignID WellID Sand_ID Weight_DES Time_DES
89 201 1 100 4/21/2014 6:46:02 AM
98 201 2 100 4/21/2014 7:01:22 AM
86 201 4 100 4/21/2014 6:28:01 AM
93 228 5 100 4/21/2014 6:53:34 AM
91 228 1 100 4/21/2014 6:51:23 AM
92 228 1 100 4/21/2014 6:53:30 AM
Delivered:
T_BOL
BOLID WellID_BOL SandID_BOL Weight_BOL
279 201 1 100
280 201 1 100
281 228 2 5
282 228 1 10
283 228 9 100
Inventory:
T_BIN
StrapID WellID_BIN SandID_BIN Weight_BIN
11 201 1 100
13 228 1 10
14 228 1 0
17 228 1 103
19 201 1 50
The Query Results:
Test Query99
WellID
WellID SandID Sum Of Weight_DES Sum Of Weight_BOL Sum Of Weight_BIN
201 1 400 400 300
228 1 600 60 226
SQL:
SELECT DISTINCTROW L_WELL.WellID, L_SAND.SandID,
Sum(T_DESIGN.Weight_DES) AS [Sum Of Weight_DES],
Sum(T_BOL.Weight_BOL) AS [Sum Of Weight_BOL],
Sum(T_BIN.Weight_BIN) AS [Sum Of Weight_BIN]
FROM ((L_SAND INNER JOIN
(L_WELL INNER JOIN T_DESIGN ON L_WELL.[WellID] = T_DESIGN.[WellID_DES])
ON L_SAND.SandID = T_DESIGN.[SandID_DES])
INNER JOIN T_BIN
ON (L_WELL.WellID = T_BIN.WellID_BIN)
AND (L_SAND.SandID = T_BIN.SandID_BIN))
INNER JOIN T_BOL
ON (L_WELL.WellID = T_BOL.WellID_BOL) AND (L_SAND.SandID = T_BOL.SandID_BOL)
GROUP BY L_WELL.WellID, L_SAND.SandID;
Two LooUp tables are for Well Names and Sand Types. (Well has been abbreviate do to size)
L_Well:
WellID WellName_WELL
3 AAGVIK 1-35H
4 AARON 1-22
5 ACHILLES 5301 41-12B
6 ACKLINS 6092 12-18H
7 ADDY 5992 43-21 #1H
8 AERABELLE 5502 43-7T
9 AGNES 1-13H
10 AL 5493 44-23B
11 ALDER 6092 43-8H
12 AMELIA FEDERAL 5201 41-11B
13 AMERADA STATE 1-16X
14 ANDERSMADSON 5201 41-13H
15 ANDERSON 1-13H
16 ANDERSON 7-18H
17 ANDRE 5501 13-4H
18 ANDRE 5501 14-5 3B
19 ANDRE SHEPHERD 5501 14-7 1T
Sand Lookup:
LSand
SandID SandType_Sand
1 100 Mesh
2 20/40 EP
3 20/40 RC
4 20/40 W
5 30/50 Ceramic
6 30/50 EP
7 30/50 RC
8 40/70 EP
9 40/70 W
10 NA See Notes
Querying and Joining Aggregation Data through an MS Access Database
I noticed your concern for pointers on how to implement some of the theory behind your aggregation queries. While SQL queries are good power-tools to get to the core of a difficult analysis problem, it might also be useful to show some of the steps on how to bring things together using the built-in design tools of MS Access.
This solution was developed on MS Access 2010.
Comments on Previous Solutions
#xQbert had a solid start with the following SQL statement. The sub query approach could be visualized as individual query objects created in Access:
FROM
(SELECT WellID, Sand_ID, Sum(weight_DES) as sumWeightDES
FROM T_DESGN) A
INNER JOIN
(SELECT WellID_BOL, Sum(Weight_BOL) as SUMWEIGHTBOL
FROM T_BOL B) B
ON A.Well_ID = B.WellID_BOL
INNER JOIN
(SELECT WellID_BIN, sum(Weight_Bin) as SumWeightBin
FROM T_BIN) C
ON C.Well_ID_BIN = B.Well_ID_BOL
Depending on the actual rules of the business data, the following assumptions made in this query may not necessarily be true:
Will the tables of T_DESIGN, T_BOL and T_BIN be populated at the same time? The sample data has mixed values, i.e., there are WellID and SandID combinations which do not have values for all three of these categories.
INNER type joins assume all three tables have records for each dimension value (Well-Sand combination)
#Frazz improved on the query design by suggesting that whatever is selected as the "base" joining table (T_DESIGN in this case), this table must be populated with all the relevant dimensional values (WellID and SandID combinations).
SELECT
WellID_DES AS WellID,
SandID_DES AS SandID,
SUM(Weight_DES) AS Weight_DES,
(SELECT SUM(Weight_BOL) FROM T_BOL WHERE T_BOL.WellID_BOL=d.WellID_DES
AND T_BOL.SandID_BOL=d.SandID_DES) AS Weight_BOL,
(SELECT SUM(Weight_BIN) FROM T_BIN WHERE T_BIN.WellID_BIN=d.WellID_DES
AND T_BIN.SandID_BIN=d.SandID_DES) AS Weight_BIN
FROM T_DESIGN;
(... note: a group-by statement should be here...)
This was animprovement because now all joins originate from a single point. If a key-value does not exist in either T_BOL or T_BIN, results will still come back and the entire record of the query would not be lost.
Again, it may be possible that there are no T_DESIGN records matching to values stored in the other tables.
Building Aggregation Sub Query Objects
The presented data does not suggest that there is any direct interaction between the data in each of the three tables aside from lining up their results in the end for presentation based on a common key-value pair (WellID and SandID). Since we are using Access, there is a chance to do these calculations separately.
This query was designed using the "summarizing" feature of the Access query design tool. It's output, after pointing to the T_DESIGN table looked like this:
Making Dimension Table Through a Cartesian Product
There are mixed opinions out there about cartesian products, but they do actually have a purpose.
Most of the concern is that a runaway cartesian product query will make millions and millions of nonsensical data values. In this query, it's specifically designed to simulate a real business condition.
The Case for a Cartesian Product
Picking from the sample data provided:
Some of the Sand Types: "20/40 EP", "30/50 Ceramic", "40/70 EP", and "30/50 RC" that are moved between their respective wells, are these sand types found at these wells consistently throughout the year?
Without an anchoring dimension for the key-values, Wells would not be found anywhere in the database via querying. It's not that they do not exist... it's just that there is no recorded data (i.e., Sand Type Weights delivered) for them.
A Reference Dimension Query Product
A dimension query is simple to produce. By referencing the two sources of keys: L_WELL and L_SAND (both look up tables or dimensional tables) without identifying a join condition, all the different combinations of the two key-values (WellID and SandID) are made:
The shortcut in SQL looks like this:
SELECT L_WELL.WellID, L_SAND.SandID, L_WELL.WellName, L_SAND.SandType
FROM L_SAND, L_WELL;
The resulting data looks like this:
Instead of using any of the operational data tables: T_DESIGN, T_BOL, or T_BIN as sources of data for a static dimension such as a list of Oil Wells, or a catalog of Sand Types, that data has been predetermined and can even be transferred to a real table since it probably will not change much once it is created.
Correlating Sub Query Results from Different Sources
After repeating the process and creating the summary tables for the other two sources (T_BOL and T_BIN), You can finally arrange the results through a simple query and join process.
The actual JOIN operations are between the dimension table/query: QSUB_WELL_SAND and all three of the summary queries: QSUB_DES, QSUB_BOL, and QSUB_BIN.
I have chosen to chosen to implement LEFT OUTER joins. If you are not sure of the difference between the different "outer" joins, this is the choice I made through the Access Query Design dialogue:
QSUB_WELL_SAND is defined as our anchor dimension. It will always have more records than any of the other tables. An OUTER JOIN should be defined to KEEP all reference dimension records... and all Summary Table query results, regardless if there is a match between the two Query results.
QSUB_WEIGHTS/ The Query to Combine All Sub Query Results
This is what the design of the final output query looks like:
This is what the data output looks like when this query design is executed:
Conclusions and Clean Up: Some Closing Thoughts
With respect to the join to the dimension query, there is a lot of empty space where there are no records or data to report on. This is where a cleverly placed filter or query criteria can shrink the output to exactly what you care to look at the most. Here's how mine looked after I added additional ending query criteria:
My data was based on what was supplied by the OP, except where the ID's assigned to the Well Type attribute did not match the sample data. The values I assigned instead are posted below as well.
Access supports a different style of database operations. Step-wise queries can be developed to hold pre-processed, special sets of data that can be reintroduced to the other data tables and query results to develop complex query criteria.
All this being said, Programming in SQL can also be just as rewarding. Be sure to explore some of the differences between the results and the capabilities you can tap into by using one approach (sql coding), the other approach (access design wizards) or both of the approaches. There's definitely a lot of room to grow and discover new capabilities from just the example provided here.
Hopefully I haven't stolen all the fun from developing a solution for your situation. I read into your comment about "building more on top" as the harbinger of more fun to come, so I don't feel so bad...! Happy Developing!
Data Modifications from the Sample Set
Without understanding L_SAND and L_WELL this is the best I could come up with..
use sub selects to get the sums first so you don't compound the data issues on the joins.
Select WellID, Sand_ID, sumWeightDES, WellID_BOL, SUMWEIGHTBOL,
WellID_BIN, SumWeightBin
FROM
(SELECT WellID, Sand_ID, Sum(weight_DES) as sumWeightDES
FROM T_DESGN) A
INNER JOIN
(SELECT WellID_BOL, Sum(Weight_BOL) as SUMWEIGHTBOL
FROM T_BOL B) B
ON A.Well_ID = B.WellID_BOL
INNER JOIN
(SELECT WellID_BIN, sum(Weight_Bin) as SumWeightBin
FROM T_BIN) C
ON C.Well_ID_BIN = B.Well_ID_BOL
I would simplify it excluding L_WELL and L_SAND. If you are just interestend in IDs, then they really shouldn't be necessary joins. If all the other 3 tables have the WellID and SandID columns, then pick the one that is sure to have all combos.
Supposing it's the Design table, then:
SELECT
WellID_DES AS WellID,
SandID_DES AS SandID,
SUM(Weight_DES) AS Weight_DES,
(SELECT SUM(Weight_BOL) FROM T_BOL WHERE T_BOL.WellID_BOL=d.WellID_DES AND T_BOL.SandID_BOL=d.SandID_DES) AS Weight_BOL,
(SELECT SUM(Weight_BIN) FROM T_BIN WHERE T_BIN.WellID_BIN=d.WellID_DES AND T_BIN.SandID_BIN=d.SandID_DES) AS Weight_BIN
FROM T_DESIGN
GROUP BY WellID, SandID;
... and make sure all your tables have an index on WellID and SandID.
Just to be clear. I dont' think it's a good idea to start the join from the lookup tables, or from their cartesian product. You can always left join them to fetch descriptions and other data. But the main query should be the one with all the combos of WellID and SandID... or if not all, at least the most. Things get difficult if none of the 3 tables (DESIGN, BOL and BIN) have all combos. In that case (and I'd say only in that case) then you might as well start with the cartesian product of the two lookup tables. You could also do a UNION, but I doubt that would be more efficient.