Multiple JOIN in Pig Latin - hive

In HQL, we have
JOIN weather ON (weather.Year = flight.Year AND weather.Month = flight.Month and weather.Day=flight.DayofMonth)
In Pig Latin, is it possible to fit it into one query? Or I have to do it separately and combine them?

Its possible see here :
You can also join on multiple keys. In all cases you must have the
same number of keys, and they must be of the same or compatible types
Example :
weather = load '/weather/files/' as (Year,Month,Day,Fieldx);
flight = load '/flight/files/' as (Year,Month,Day,Fieldy);
jnd = join weather by (Year,Month,Day), flight by (Year,Month,Day);

Related

Why is "Acute HIV Infection" not classified as a sexually transmitted disease in SNOMED-CT?

I'm trying to compile a list of sexually transmitted diseases using SNOMED-CT (I happen to be using OHDSI/OMOP concept tables in Databricks as my source for SNOMED-CT).
Querying the SNOMED-CT data from OHDSI Athena in OHDSI/OMOP using the query below, I get the results shown below. Notably missing from the ancestors is any indication that HIV is a sexually transmitted disease/infection.
Is there a way to use SNOMED-CT to create a somewhat comprehensive list of sexually transmitted diseases? Is there a better way to get to a list of codes (SNOMED or other, e.g. ICD) for sexually transmitted diseases?
select distinct
parent.concept_id,
parent.concept_code,
parent.vocabulary_id,
parent.concept_name,
an.max_levels_of_separation,
an.min_levels_of_separation
from
concept con
join concept_ancestor an on 1=1
and an.descendant_concept_id = con.concept_id
join concept parent on 1=1
and parent.concept_id = an.ancestor_concept_id
where 1=1
and con.vocabulary_id = 'SNOMED'
and lower(con.concept_name) = 'acute hiv infection'
and con.domain_id = 'Condition'
order by parent.concept_name
;
This finding seems to be confirmed using other SNOMED-CT browsers, for example:
https://browser.ihtsdotools.org/?perspective=full&conceptId1=62479008&edition=MAIN/SNOMEDCT-US/2022-03-01&release=&languages=en

Rule Based Join in pig script

I have a rules_table data
Ruleid,leftColumn,rightColumn
1,c1,c1
2,c2,c3
3,c4,c4
rules_table contains the column names of left_table and right_table to give hint about the join keys.
Left_table
Schema : c1,c2,c3,c4,c5,c6,c7,c8,c9
Right_table
schema : c1,c2,c3,c4,c10,c12,c13,c14
i need to join the left_table and right_table according to the rules_table applying rules one by one(it should be sequential as the rule_id is the rule priority) . After each rule i need to get a matched_set and unmatched_set. Unmatched_Set data has to flow into next rule and go on like that. Final output will have 2 seperate datasets
matched_set,rule_id
unmatched_set
Right now I am using unix_script to read the rules table in hive and call the pig-script repeatedly to generate the matched_set and unmatched_set. But it is taking too much time as the pig initial set_up and store is taking too much time.
Can any body please suggest an optimal solution to do this in pig_script with single execution ?
You can't do it directly, but you can generate single pig script that will look somthing like that:
LeftTable = load ...;
RightTable = load ...;
joined1 = join LeftTable by c1 full, RightTable by c2;
SPLIT joined1 INTO Matched_rule1_raw IF LeftTable::c1 is not null and RightTable::c2 is not null, UnMatched_rule1 IF LeftTable::c1 is null or RightTable::c2 is null;
Matched_rule1 = foreach Matched_rule1_raw generate 1 as rule_id, ..;
At the end you can do union matched.

How to retrieve data with the highest version number?

I use sql below to join DRAT & DRAP.
SELECT * INTO CORRESPONDING FIELDS OF WA_DOC_LOG
FROM DRAP
INNER JOIN DRAT ON DRAP~DOKNR = DRAT~DOKNR
AND DRAP~DOKAR = DRAT~DOKAR
WHERE DRAP~DOKNR IN S_DOKNR
AND DRAP~DOKAR IN S_DOKAR
AND DRAP~DOKST IN S_DOKST
AND DRAP~DATUM IN S_DATUM.
But when I display, I just want to display the record with highest version number (DRAP~DOKVR). What is the possible ways to eliminate the records with lower version?
I would probably off-load this to the application server:
SORT documents BY dokar doknr dokvr DESCENDING. " careful with doktl!
DELETE ADJACENT DUPLICATES FROM documents COMPARING dokar doknr.
Using SQL to achieve this might be possible, but I'd be very careful since this might turn out to be rather expensive unless done right.
Try this SELECT statement:
SELECT *
INTO CORRESPONDING FIELDS OF TABLE WA_DOC_LOG
FROM DRAP AS d
INNER JOIN DRAT AS t
ON d~DOKNR = t~DOKNR
AND d~DOKAR = t~DOKAR
WHERE d~dokvr = ( SELECT max( dokvr ) FROM drap ).
Although, it is not very efficient, but functional yet. You should consider performance on your concrete system yourself.

How to combine multiple rows in a relation into a tuple to perform calculations in PIG Latin

I have the following code:
pitcher_res = UNION pitcher_total_salary,pitcher_total_appearances;
dump pitcher_res;
The output is:
(8965000.0)
(22.0)
However, I want to calculate 8965000.0/22.0, so I need something like:
res = FOREACH some_relation GENERATE $0/$1;
Therefore I need to have some_relation = (8965000.0,22.0). How can I perform such a conversion?
You can do a CROSS.
Computes the cross product of two or more relations.
https://pig.apache.org/docs/r0.11.1/basic.html#cross
Ideally you would have a unique identifier for each entry in your source relations. Then you can perform a join based on this identifier which results in the kind of relation you want to have.
Salary relation
salaries: pitcher_id, pitcher_total_salary
Total appearances relation
appearances: pitcher_id, pitcher_total_appearances
Join
pitcher_relation = join salaries by pitcher_id, appearances by pitcher_id;
Calculation
res = FOREACH pitcher_relation GENERATE pitcher_total_salary/pitcher_total_apperances;
The below pig latin scripts will surely come to your rescue:
load the salary file
salary = load '/home/abhishek/Work/pigInput/pitcher_total_salary' as (salary:long);
load the appearances file
appearances = load '/home/abhishek/Work/pigInput/pitcher_total_appearances' as (appearances:long);
Now, use the CROSS command
C = cross salary, appearances
Then, the final output
res = foreach C generate salary/appearances;
Output
dump res
407500
Hope this helps

How to avoid the same joining for two fields?

I admit the title of this question is not clear. If someone could reword it after reading my question, that will be great.
Anyway I have a pair of fields which are IDs of words. Now I want to replace them by their text. Right now I am doing two joins and foreach like the followings:
WordIDs = LOAD wordID.txt AS (wordID1:long, wordID2:long);
WordTexts = LOAD wordText.txt AS (wordID:long, wordText:chararray);
Join1 = JOIN WordIDs BY wordID1, WordTexts BY wordID;
Replaced1 = FOREACH Join1 GENERATE WordTexts::wordText As wordText1, WordIDs::wordID2;
Join2 = JOIN Replaced1 BY wordID2, WordTexts BY wordID;
Replaced2 = FOREACH Join2 GENERATE Replaced1::wordText1 As wordText1, WordTexts::wordText::wordText2;
Is there any way of doing this with less number of statements (like one join instead of two joins)?
I think your current code will generate 2 separate map reduce jobs, to avoid it use replicated join, it will not change the number of join statements, but will use just one map side join, only one map reduce job. Code should look like that (I did not run it yet):
WordIDs = LOAD wordID.txt AS (wordID1:long, wordID2:long);
WordTexts = LOAD wordText.txt AS (wordID:long, wordText:chararray);
Join1 = JOIN WordIDs BY wordID1, WordTexts BY wordID USING 'replicated';
Join2 = JOIN Join1 BY wordID2, WordTexts BY wordID USING 'replicated';
Replaced = FOREACH Join2 GENERATE Join1::WordTexts::wordText As wordText1, Join2::wordTexts::wordText as wordText2;