Rule Based Join in pig script - apache-pig

I have a rules_table data
Ruleid,leftColumn,rightColumn
1,c1,c1
2,c2,c3
3,c4,c4
rules_table contains the column names of left_table and right_table to give hint about the join keys.
Left_table
Schema : c1,c2,c3,c4,c5,c6,c7,c8,c9
Right_table
schema : c1,c2,c3,c4,c10,c12,c13,c14
i need to join the left_table and right_table according to the rules_table applying rules one by one(it should be sequential as the rule_id is the rule priority) . After each rule i need to get a matched_set and unmatched_set. Unmatched_Set data has to flow into next rule and go on like that. Final output will have 2 seperate datasets
matched_set,rule_id
unmatched_set
Right now I am using unix_script to read the rules table in hive and call the pig-script repeatedly to generate the matched_set and unmatched_set. But it is taking too much time as the pig initial set_up and store is taking too much time.
Can any body please suggest an optimal solution to do this in pig_script with single execution ?

You can't do it directly, but you can generate single pig script that will look somthing like that:
LeftTable = load ...;
RightTable = load ...;
joined1 = join LeftTable by c1 full, RightTable by c2;
SPLIT joined1 INTO Matched_rule1_raw IF LeftTable::c1 is not null and RightTable::c2 is not null, UnMatched_rule1 IF LeftTable::c1 is null or RightTable::c2 is null;
Matched_rule1 = foreach Matched_rule1_raw generate 1 as rule_id, ..;
At the end you can do union matched.

Related

Should I use an SQL full outer join for this?

Consider the following tables:
Table A:
DOC_NUM
DOC_TYPE
RELATED_DOC_NUM
NEXT_STATUS
...
Table B:
DOC_NUM
DOC_TYPE
RELATED_DOC_NUM
NEXT_STATUS
...
The DOC_TYPE and NEXT_STATUS columns have different meanings between the two tables, although a NEXT_STATUS = 999 means "closed" in both. Also, under certain conditions, there will be a record in each table, with a reference to a corresponding entry in the other table (i.e. the RELATED_DOC_NUM columns).
I am trying to create a query that will get data from both tables that meet the following conditions:
A.RELATED_DOC_NUM = B.DOC_NUM
A.DOC_TYPE = "ST"
B.DOC_TYPE = "OT"
A.NEXT_STATUS < 999 OR B.NEXT_STATUS < 999
A.DOC_TYPE = "ST" represents a transfer order to transfer inventory from one plant to another. B.DOC_TYPE = "OT" represents a corresponding receipt of the transferred inventory at the receiving plant.
We want to get records from either table where there is an ST/OT pair where either or both entries are not closed (i.e. NEXT_STATUS < 999).
I am assuming that I need to use a FULL OUTER join to accomplish this. If this is the wrong assumption, please let me know what I should be doing instead.
UPDATE (11/30/2021):
I believe that #Caius Jard is correct in that this does not need to be an outer join. There should always be an ST/OT pair.
With that I have written my query as follows:
SELECT <columns>
FROM A LEFT JOIN B
ON
A.RELATED_DOC_NUM = B.DOC_NUM
WHERE
A.DOC_TYPE IN ('ST') AND
B.DOC_TYPE IN ('OT') AND
(A.NEXT_STATUS < 999 OR B.NEXT_STATUS < 999)
Does this make sense?
UPDATE 2 (11/30/2021):
The reality is that these are DB2 database tables being used by the JD Edwards ERP application. The only way I know of to see the table definitions is by using the web site http://www.jdetables.com/, entering the table ID and hitting return to run the search. It comes back with a ton of information about the table and its columns.
Table A is really F4211 and table B is really F4311.
Right now, I've simplified the query to keep it simple and keep variables to a minimum. This is what I have currently:
SELECT CAST(F4211.SDDOCO AS VARCHAR(8)) AS SO_NUM,
F4211.SDRORN AS RELATED_PO,
F4211.SDDCTO AS SO_DOC_TYPE,
F4211.SDNXTR AS SO_NEXT_STATUS,
CAST(F4311.PDDOCO AS VARCHAR(8)) AS PO_NUM,
F4311.PDRORN AS RELATED_SO,
F4311.PDDCTO AS PO_DOC_TYPE,
F4311.PDNXTR AS PO_NEXT_STATUS
FROM PROD2DTA.F4211 AS F4211
INNER JOIN PROD2DTA.F4311 AS F4311
ON F4211.SDRORN = CAST(F4311.PDDOCO AS VARCHAR(8))
WHERE F4211.SDDCTO IN ( 'ST' )
AND F4311.PDDCTO IN ( 'OT' )
The other part of the story is that I'm using a reporting package that allows you to define "virtual" views of the data. Virtual views allow the report developer to specify the SQL to use. This is the application where I am using the SQL. When I set up the SQL, there is a validation step that must be performed. It will return a limited set of results if the SQL is validated.
When I enter the query above and validate it, it says that there are no results, which makes no sense. I'm guessing the data casting is causing the issue, but not sure.
UPDATE 3 (11/30/2021):
One more twist to the story. The related doc number is not only defined as a string value, but it contains leading zeros. This is true in both tables. The main doc number (in both tables) is defined as a numeric value and therefore has no leading zeros. I have no idea why those who developed JDE would have done this, but that is what is there.
So, there are matching records between the two tables that meet the criteria, but I think I'm getting no results because when I convert the numeric to a string, it does not match, because one value is, say "12345", while the other is "00012345".
Can I pad the numeric -> string value with zeros before doing the equals check?
UPDATE 4 (12/2/2021):
Was able to finally get the query to work by converting the numeric doc num to a left zero padded string.
SELECT <columns>
FROM PROD2DTA.F4211 AS F4211
INNER JOIN PROD2DTA.F4311 AS F4311
ON F4211.SDRORN = RIGHT(CONCAT('00000000', CAST(F4311.PDDOCO AS VARCHAR(8))), 8)
WHERE F4211.SDDCTO IN ( 'ST' )
AND F4311.PDDCTO IN ( 'OT' )
AND ( F4211.SDNXTR < 999
OR F4311.PDNXTR < 999 )
You should write your query as follows:
SELECT <columns>
FROM A INNER JOIN B
ON
A.RELATED_DOC_NUM = B.DOC_NUM
WHERE
A.DOC_TYPE IN ('ST') AND
B.DOC_TYPE IN ('OT') AND
(A.NEXT_STATUS < 999 OR B.NEXT_STATUS < 999)
LEFT join is a type of OUTER join; LEFT JOIN is typically a contraction of LEFT OUTER JOIN). OUTER means "one side might have nulls in every column because there was no match". Most critically, the code as posted in the question (with a LEFT JOIN, but then has WHERE some_column_from_the_right_table = some_value) runs as an INNER join, because any NULLs inserted by the LEFT OUTER process, are then quashed by the WHERE clause
See Update 4 for details of how I resolved the "data conversion or mapping" error.

MS Access-Return Record Below Current Record

I am very new to Access, and what I am trying to do seems like it should be very simple, but I can't seem to get it.
I am a structural engineer by trade and am making a database to design buildings.
My Diaphragm Analysis Table includes the fields "Floor_Name", "Story_Number", "Wall_Left", and "Wall_Right". I want to write a new query that looks in another query called "Shear_Wall_incremental_Deflection" and pulls information from it based on input from Diaphragm Analysis. I want to take the value in "Wall_Right" (SW01), find the corresponding value in "Shear_Wall_incremental_Deflection", and report the "Elastic_Deflection" corresponding to the "Story_Below" instead of the "Story_Number" in the Diaphragm Analysis Table. In the case where "Story_Number" = 1, "Story_Below" will be 0 and I want the output to be 0.
Same procedure for "Wall_Left", but I'm just taking it one step at a time.
It seems that I need to use a "DLookup" in the expression builder with TWO criteria, one that Wall_Right = Shear_Wall and one that Story_Number = Story_Below, but when I try this I just get errors.
"Shear_Wall_incremental_Deflection" includes shearwalls for all three stories, i.e. it starts at SW01 and goes through SWW for Story Number 3 and then starts again at SW01 for Story Number 2, and so on until Story Number 1. I only show a part of the query results in the image, but rest assured, there are "Elastic_Deflection" values for story numbers below 3.
Here is my attempt in the Expression Builder:
Right_Defl_in: IIf(IsNull([Diaphragm_Analysis]![Wall_Right]),0,DLookUp("[Elastic_Deflection_in]","[Shear_Wall_incremental_Deflection]","[Shear_Wall_incremental_Deflection]![Story_Below]=" & [Diaphragm_Analysis]![Story_Number]))
I know my join from Diaphragm_Analysis "Wall_Left" and "Wall_Right" must include all records from Diaphragm_Analysis and only those from "Shear_Wall_incremental_Deflection"![Shear_Walls] where the joined fields are equal, but that's about all I know.
Please let me know if I need to include more information or send out the database file.
Thanks for your help.
Diaphragm Analysis (Input Table)
Shear_Wall_incremental_Deflection (Partial Image of Query)
I think what you are missing is that you can and should join to Diaphragm_Analysis twice, first time to get the Story_Below value and second to use it to get the corresponding Elastic_Deflection value.
To handle the special case where Story_Below is zero, I would write a separate query (only requires one join this time) and 'OR together' the two queries using the UNION set operation (note the following SQL is untested):
SELECT swid.Floor_Name,
swid.Story_Number,
swid.Wall_Left,
da2.Elastic_Deflection AS Story_Below_Elastic_Deflection
FROM ( Shear_Wall_incremental_Deflection swid
INNER JOIN Diaphragm_Analysis da1
ON da1.ShearWall = swid.Wall_Left )
INNER JOIN Diaphragm_Analysis da2
ON da2.ShearWall = swid.Wall_Left
AND da2.Story_Number = da1.Story_Below
UNION
SELECT swid.Floor_Name,
swid.Story_Number,
swid.Wall_Left,
0 AS Story_Below_Elastic_Deflection
FROM Shear_Wall_incremental_Deflection swid
INNER JOIN Diaphragm_Analysis da1
ON da1.ShearWall = swid.Wall_Left
WHERE da1.Story_Below = 0;
I've assumed that there is no data where Story_Number is zero.

in Pig how to find out union between two datasets or relations?

I have a text file which I process using some rules and come up with two separate relations
dump A;
A=
({(18),(17),(16),(15)})
({(4),(1)})
({(7),(6)})
({(9),(2)})
({(13),(11)})
dump B;
B =
({(4),(3)})
I want to join these based on the values it holds i.e. (4),(1) of A and (4),(3) of B should join and their union should be displayed as output (4),(1),(3)
the output should be like this -
({(18),(17),(16),(15)})
({(4),(1) ,(3)})
({(7),(6)})
({(9),(2)})
({(13),(11)})
Thanks in advance
There is a bag join in datafu: http://datafu.incubator.apache.org/docs/datafu/guide/bag-operations.html
Once joined, you can apply DISTINCT.

How to avoid the same joining for two fields?

I admit the title of this question is not clear. If someone could reword it after reading my question, that will be great.
Anyway I have a pair of fields which are IDs of words. Now I want to replace them by their text. Right now I am doing two joins and foreach like the followings:
WordIDs = LOAD wordID.txt AS (wordID1:long, wordID2:long);
WordTexts = LOAD wordText.txt AS (wordID:long, wordText:chararray);
Join1 = JOIN WordIDs BY wordID1, WordTexts BY wordID;
Replaced1 = FOREACH Join1 GENERATE WordTexts::wordText As wordText1, WordIDs::wordID2;
Join2 = JOIN Replaced1 BY wordID2, WordTexts BY wordID;
Replaced2 = FOREACH Join2 GENERATE Replaced1::wordText1 As wordText1, WordTexts::wordText::wordText2;
Is there any way of doing this with less number of statements (like one join instead of two joins)?
I think your current code will generate 2 separate map reduce jobs, to avoid it use replicated join, it will not change the number of join statements, but will use just one map side join, only one map reduce job. Code should look like that (I did not run it yet):
WordIDs = LOAD wordID.txt AS (wordID1:long, wordID2:long);
WordTexts = LOAD wordText.txt AS (wordID:long, wordText:chararray);
Join1 = JOIN WordIDs BY wordID1, WordTexts BY wordID USING 'replicated';
Join2 = JOIN Join1 BY wordID2, WordTexts BY wordID USING 'replicated';
Replaced = FOREACH Join2 GENERATE Join1::WordTexts::wordText As wordText1, Join2::wordTexts::wordText as wordText2;

Apache Pig load entire relationship into UDF

I have a pig script that pertains to 2 Pig relations, lets say A and B. A is a small relationship, and B is a big one. My UDF should load all of A into memory on each machine and then use it while processing B. Currently I do it like this.
A = foreach smallRelation Generate ...
B = foreach largeRelation Generate propertyOfB;
store A into 'templocation';
C = foreach B Generate CustomUdf(propertyOfB);
I then have every machine load from 'templocation' to get A.This works, but I have two problems with it.
My understanding is I should be using the HDFS cache somehow, but I'm not sure how to load a relationship directly into the HDFS cache.
When I reload the file in my UDF I got to write logic to parse the output from A that was outputted to file when I'd rather be directly using bags and tuples (is there a built in Pig java function to parse Strings back into Bag/Tuple form?).
Does anyone know how it should be done?
Here's a trick that will work for you.
You do a GROUP ALL on A first which "bags" all data in A into one field. Then artificially add a common field on both A and B and join them. This way, foreach tuple in the enhanced B, you will have the full data of A for your UDF to use.
It's like this:
(say originally in A, you have fields fa1, fa2, fa3, in B you have fb1, fb2)
-- add an artificial join key with value 'xx'
B_aux = FOREACH B GENERATE 'xx' AS join_key, fb1, fb2;
A_all = GROUP A ALL;
A_aux = FOREACH A GENERATE 'xx' AS join_key, $1;
A_B_JOINED = JOIN B_aux BY join_key, A_aux BY join_key USING 'replicated';
C = FOREACH A_B_JOINED GENERATE CustomUdf(fb1, fb2, A_all);
since this is replicated join, it's also only map-side join.