How to avoid the same joining for two fields? - apache-pig

I admit the title of this question is not clear. If someone could reword it after reading my question, that will be great.
Anyway I have a pair of fields which are IDs of words. Now I want to replace them by their text. Right now I am doing two joins and foreach like the followings:
WordIDs = LOAD wordID.txt AS (wordID1:long, wordID2:long);
WordTexts = LOAD wordText.txt AS (wordID:long, wordText:chararray);
Join1 = JOIN WordIDs BY wordID1, WordTexts BY wordID;
Replaced1 = FOREACH Join1 GENERATE WordTexts::wordText As wordText1, WordIDs::wordID2;
Join2 = JOIN Replaced1 BY wordID2, WordTexts BY wordID;
Replaced2 = FOREACH Join2 GENERATE Replaced1::wordText1 As wordText1, WordTexts::wordText::wordText2;
Is there any way of doing this with less number of statements (like one join instead of two joins)?

I think your current code will generate 2 separate map reduce jobs, to avoid it use replicated join, it will not change the number of join statements, but will use just one map side join, only one map reduce job. Code should look like that (I did not run it yet):
WordIDs = LOAD wordID.txt AS (wordID1:long, wordID2:long);
WordTexts = LOAD wordText.txt AS (wordID:long, wordText:chararray);
Join1 = JOIN WordIDs BY wordID1, WordTexts BY wordID USING 'replicated';
Join2 = JOIN Join1 BY wordID2, WordTexts BY wordID USING 'replicated';
Replaced = FOREACH Join2 GENERATE Join1::WordTexts::wordText As wordText1, Join2::wordTexts::wordText as wordText2;

Related

SQL Joins are not working properly

I am new to JOINS and testing my query, but it's just not working for me...
The situation:
The database has got the following columns:
links (contains unique data)
cl_link (contains the relation between links & cats)
cats (cat. descriptions
images (contains multiple images of one link)
cfvalues (contains the values of the multiple custom fiels
customfields (contains the multiple customfields)
I am using the following query, but the Joins are not working for me. Because I only get one image while sometimes there are multiple. And I only get one customfield instead of multiple and I get none cfvalues.
I guess something is wrong with the JOINS, but I am not sure. Can somebody help me out here?
The SQL
SELECT DISTINCT
rqypj_mt_links.link_name,
rqypj_mt_links.link_desc,
rqypj_mt_links.address,
rqypj_mt_links.city,
rqypj_mt_links.state,
rqypj_mt_links.country,
rqypj_mt_links.postcode,
rqypj_mt_links.telephone,
rqypj_mt_links.fax,
rqypj_mt_links.email,
rqypj_mt_links.website,
rqypj_mt_links.price,
rqypj_mt_links.lat,
rqypj_mt_links.lng,
rqypj_mt_links.zoom,
rqypj_mt_cats.cat_name,
rqypj_mt_images.filename,
rqypj_mt_cfvalues.value,
rqypj_mt_customfields.caption
FROM rqypj_mt_links
LEFT JOIN rqypj_mt_cl
ON rqypj_mt_links.link_id = rqypj_mt_cl.link_id
LEFT JOIN rqypj_mt_cats
ON rqypj_mt_cl.cat_id = rqypj_mt_cats.cat_id
LEFT JOIN rqypj_mt_images
ON rqypj_mt_links.link_id = rqypj_mt_images.link_id
LEFT JOIN rqypj_mt_cfvalues
ON rqypj_mt_links.link_id = rqypj_mt_cfvalues.link_id
LEFT JOIN rqypj_mt_customfields
ON rqypj_mt_customfields.cf_id = rqypj_mt_customfields.cf_id LIMIT 100
Thanks in advance!
Jelte
your last condition doesn't look right:
on rqypj_mt_customfields.cf_id = rqypj_mt_customfields.cf_id
translates to 1=1
Shouldn't it be:
on rqypj_mt_customfields.cf_id = rqypj_mt_cfvalues.cf_id
Probably because you don't have an order by and are using limit.
Change it to
order by rqypj_mt_links.link_id, rqypj_mt_cl.cat_id
limit 100
and then your multiple pictures for the same link should be together.
Also please consider use of alias to make your code easier to read:
SELECT DISTINCT
links.link_name,
links.link_desc,
links.address,
links.city,
links.state,
links.country,
links.postcode,
links.telephone,
links.fax,
links.email,
links.website,
links.price,
links.lat,
links.lng,
links.zoom,
cats.cat_name,
images.filename,
cfvalues.value,
--custom.caption
FROM rqypj_mt_links links
LEFT JOIN rqypj_mt_cl cl ON links.link_id = cl.link_id
LEFT JOIN rqypj_mt_cats cats ON cl.cat_id = cats.cat_id
LEFT JOIN rqypj_mt_images images ON links.link_id = images.link_id
LEFT JOIN rqypj_mt_cfvalues cfvalues ON links.link_id = cfvalues.link_id
--LEFT JOIN rqypj_mt_customfields custom ON custom.cf_id = custom.cf_id
ORDER BY links.link_id, cats.cat_id
LIMIT 100

Changing old join to new join

I am new to SQL so any help is greatly appreciated. I have a query that seems to be working that has old style joins, and I need to change it to new style joins. the current query is like:
SELECT
STAR.V_DISASTER_DIMENSIONS .DISASTER_NUMBER,
STAR.PA_PROJECT_DIMENSIONS .PW_NUMBER,
STAR.PA_PROJECT_SITE_DIMENSIONS.SITE_NUMBER,
STAR.PA_PROJECT_FACTS .PROJECT_AMOUNT,
STAR.PA_MITIGATION_DIMENSIONS .MITIGATION_ACTIVITY_STATUS
FROM
STAR.V_DISASTER_DIMENSIONS,
STAR.PA_PROJECT_DIMENSIONS,
STAR.PA_PROJECT_SITE_DIMENSIONS,
STAR.PA_MITIGATION_DIMENSIONS,
STAR.PA_PROJECT_FACTS,
STAR.PA_PROJECT_SITE_FACTS
WHERE
( STAR.PA_PROJECT_DIMENSIONS.PA_PROJECT_ID = STAR.PA_PROJECT_FACTS.PA_PROJECT_ID )
AND
( STAR.PA_PROJECT_FACTS.DISASTER_ID = STAR.V_DISASTER_DIMENSIONS.DISASTER_ID )
AND
( STAR.PA_MITIGATION_DIMENSIONS.PA_MITIGATION_ID = STAR.PA_PROJECT_FACTS.PA_PROJECT_ID )
AND
( STAR.PA_PROJECT_SITE_FACTS.PA_PROJECT_ID = STAR.PA_MITIGATION_DIMENSIONS.PA_MITIGATION_ID )
AND
( STAR.PA_PROJECT_SITE_FACTS.DISASTER_ID = STAR.V_DISASTER_DIMENSIONS.DISASTER_ID )
AND
( STAR.PA_PROJECT_SITE_FACTS.PA_PROJECT_ID = STAR.PA_PROJECT_DIMENSIONS.PA_PROJECT_ID )
AND
( STAR.PA_PROJECT_SITE_FACTS.PA_PROJECT_SITE_ID = STAR.PA_PROJECT_SITE_DIMENSIONS.PA_PROJECT_SITE_ID )
My attempt to convert is below. I don't know where to put the extra conditions because they are not 1 to 1 with tables.
FROM
STAR.V_DISASTER_DIMENSIONS
JOIN STAR.PA_PROJECT_SITE_FACTS ON STAR.PA_PROJECT_SITE_FACTS.DISASTER_ID = STAR.V_DISASTER_DIMENSIONS.DISASTER_ID
JOIN STAR.PA_PROJECT_DIMENSIONS ON STAR.PA_PROJECT_SITE_FACTS.PA_PROJECT_ID = STAR.PA_PROJECT_DIMENSIONS.PA_PROJECT_ID
JOIN STAR.PA_PROJECT_SITE_DIMENSIONS ON STAR.PA_PROJECT_SITE_FACTS.PA_PROJECT_SITE_ID = STAR.PA_PROJECT_SITE_DIMENSIONS.PA_PROJECT_SITE_ID
JOIN STAR.PA_MITIGATION_DIMENSIONS ON STAR.PA_PROJECT_SITE_FACTS.PA_PROJECT_ID = STAR.PA_MITIGATION_DIMENSIONS.PA_MITIGATION_ID
JOIN STAR.PA_PROJECT_FACTS ON (
STAR.PA_PROJECT_FACTS .DISASTER_ID = STAR.V_DISASTER_DIMENSIONS.DISASTER_ID AND
STAR.PA_MITIGATION_DIMENSIONS.PA_MITIGATION_ID = STAR.PA_PROJECT_FACTS .PA_PROJECT_ID AND
STAR.PA_PROJECT_DIMENSIONS .PA_PROJECT_ID = STAR.PA_PROJECT_FACTS .PA_PROJECT_ID
)
Change , to INNER JOINs with ON condition:
SELECT
STAR.V_DISASTER_DIMENSIONS.DISASTER_NUMBER,
STAR.PA_PROJECT_DIMENSIONS.PW_NUMBER,
STAR.PA_PROJECT_SITE_DIMENSIONS.SITE_NUMBER,
STAR.PA_PROJECT_FACTS.PROJECT_AMOUNT,
STAR.PA_MITIGATION_DIMENSIONS.MITIGATION_ACTIVITY_STATUS
FROM
STAR.PA_PROJECT_DIMENSIONS PD
INNER JOIN STAR.PA_PROJECT_FACTS PF ON PD.PA_PROJECT_ID=PF.PA_PROJECT_ID
INNER JOIN STAR.V_DISASTER_DIMENSIONS DD ON DD.DISASTER_ID=PF.DISASTER_ID
INNER JOIN STAR.PA_MITIGATION_DIMENSIONS MD ON MD.PA_MITIGATION_ID=PF.PA_PROJECT_ID
INNER JOIN STAR.PA_PROJECT_SITE_FACTS PSF ON PSF.PA_PROJECT_ID=MD.PA_MITIGATION_ID
AND PSF.DISASTER_ID=DD.DISASTER_ID
AND PSF.PA_PROJECT_ID=PD.PA_PROJECT_ID
INNER JOIN STAR.PA_PROJECT_SITE_DIMENSIONS PSD ON PSD.PA_PROJECT_SITE_ID=PSF.PA_PROJECT_SITE_ID
Select * from
a,b
where a.z = b.y
would be written as
Select * from
a
INNER JOIN
b
ON a.z = b.y
It is easy. Just start with the facts table and join related tables on foreign key = key.
First of all you should use table aliases to get the query more readable. Also use some lowercase letters, too.
Then just write the table names (or the aliases) on paper and draw a line for each condition from one table to the other. Then pick one table to start with, e.g. pa_project_site_dimensions which is only linked to one table.
SELECT
dd.disaster_number,
pd.pw_number,
psd.site_number,
psf.project_amount,
md.mitigation_activity_status
FROM star.pa_project_site_dimensions psd
JOIN star.pa_project_site_facts psf ON psf.pa_project_site_id = psd.pa_project_site_id
JOIN star.v_disaster_dimensions dd ON dd.disaster_id = psf.disaster_id
JOIN star.pa_mitigation_dimensions md ON md.pa_mitigation_id = psf.pa_project_id
JOIN star.pa_project_dimensions pd ON pd.pa_project_id = psf.pa_project_id
JOIN star.pa_project_facts pf ON pf.disaster_id = dd.disaster_id
AND pf.pa_project_id = md.pa_mitigation_id
AND pf.pa_project_id = pd.pa_project_id
;
However, this is a strange query. First of all there is no limiting condition, you simply join all records, rather than retrieving data for, say, one particular project.
Moreover, you deal with several dimensions. Obviously a project has facts (pa_project_facts) and dimensions (pa_project_dimensions). With 5 facts and 3 dimensions you'd get 15 rows with all their combinations. Then there are also project sites it seems (maybe a table pa_project_sites we don't see in the query). Either that project site has facts on its own (pa_project_site_facts) that you also combine with all rows, or a project site is linked to a project fact via pa_project_site_facts, but then pa_project_facts wouldn't have to be joined by pa_project_id only, but also by some fact ID.
Also this looks strange: md.pa_mitigation_id = psf.pa_project_id. Is a mitigation the same as a project?
So after all have a look at all columns that need to be joined on. Think about how the tables are related and if you are not building combinations that make no sense.

Rule Based Join in pig script

I have a rules_table data
Ruleid,leftColumn,rightColumn
1,c1,c1
2,c2,c3
3,c4,c4
rules_table contains the column names of left_table and right_table to give hint about the join keys.
Left_table
Schema : c1,c2,c3,c4,c5,c6,c7,c8,c9
Right_table
schema : c1,c2,c3,c4,c10,c12,c13,c14
i need to join the left_table and right_table according to the rules_table applying rules one by one(it should be sequential as the rule_id is the rule priority) . After each rule i need to get a matched_set and unmatched_set. Unmatched_Set data has to flow into next rule and go on like that. Final output will have 2 seperate datasets
matched_set,rule_id
unmatched_set
Right now I am using unix_script to read the rules table in hive and call the pig-script repeatedly to generate the matched_set and unmatched_set. But it is taking too much time as the pig initial set_up and store is taking too much time.
Can any body please suggest an optimal solution to do this in pig_script with single execution ?
You can't do it directly, but you can generate single pig script that will look somthing like that:
LeftTable = load ...;
RightTable = load ...;
joined1 = join LeftTable by c1 full, RightTable by c2;
SPLIT joined1 INTO Matched_rule1_raw IF LeftTable::c1 is not null and RightTable::c2 is not null, UnMatched_rule1 IF LeftTable::c1 is null or RightTable::c2 is null;
Matched_rule1 = foreach Matched_rule1_raw generate 1 as rule_id, ..;
At the end you can do union matched.

How to combine multiple rows in a relation into a tuple to perform calculations in PIG Latin

I have the following code:
pitcher_res = UNION pitcher_total_salary,pitcher_total_appearances;
dump pitcher_res;
The output is:
(8965000.0)
(22.0)
However, I want to calculate 8965000.0/22.0, so I need something like:
res = FOREACH some_relation GENERATE $0/$1;
Therefore I need to have some_relation = (8965000.0,22.0). How can I perform such a conversion?
You can do a CROSS.
Computes the cross product of two or more relations.
https://pig.apache.org/docs/r0.11.1/basic.html#cross
Ideally you would have a unique identifier for each entry in your source relations. Then you can perform a join based on this identifier which results in the kind of relation you want to have.
Salary relation
salaries: pitcher_id, pitcher_total_salary
Total appearances relation
appearances: pitcher_id, pitcher_total_appearances
Join
pitcher_relation = join salaries by pitcher_id, appearances by pitcher_id;
Calculation
res = FOREACH pitcher_relation GENERATE pitcher_total_salary/pitcher_total_apperances;
The below pig latin scripts will surely come to your rescue:
load the salary file
salary = load '/home/abhishek/Work/pigInput/pitcher_total_salary' as (salary:long);
load the appearances file
appearances = load '/home/abhishek/Work/pigInput/pitcher_total_appearances' as (appearances:long);
Now, use the CROSS command
C = cross salary, appearances
Then, the final output
res = foreach C generate salary/appearances;
Output
dump res
407500
Hope this helps

Apache Pig load entire relationship into UDF

I have a pig script that pertains to 2 Pig relations, lets say A and B. A is a small relationship, and B is a big one. My UDF should load all of A into memory on each machine and then use it while processing B. Currently I do it like this.
A = foreach smallRelation Generate ...
B = foreach largeRelation Generate propertyOfB;
store A into 'templocation';
C = foreach B Generate CustomUdf(propertyOfB);
I then have every machine load from 'templocation' to get A.This works, but I have two problems with it.
My understanding is I should be using the HDFS cache somehow, but I'm not sure how to load a relationship directly into the HDFS cache.
When I reload the file in my UDF I got to write logic to parse the output from A that was outputted to file when I'd rather be directly using bags and tuples (is there a built in Pig java function to parse Strings back into Bag/Tuple form?).
Does anyone know how it should be done?
Here's a trick that will work for you.
You do a GROUP ALL on A first which "bags" all data in A into one field. Then artificially add a common field on both A and B and join them. This way, foreach tuple in the enhanced B, you will have the full data of A for your UDF to use.
It's like this:
(say originally in A, you have fields fa1, fa2, fa3, in B you have fb1, fb2)
-- add an artificial join key with value 'xx'
B_aux = FOREACH B GENERATE 'xx' AS join_key, fb1, fb2;
A_all = GROUP A ALL;
A_aux = FOREACH A GENERATE 'xx' AS join_key, $1;
A_B_JOINED = JOIN B_aux BY join_key, A_aux BY join_key USING 'replicated';
C = FOREACH A_B_JOINED GENERATE CustomUdf(fb1, fb2, A_all);
since this is replicated join, it's also only map-side join.