Hive Pig anti join - apache-pig

I am trying to merge two datasets: Events and mortality by patientid. Events have all the patientid including the patientid in mortality, but didn't have the timestamp and label (always equal 1) in mortality. I want to extract all records in events but not in mortality.
-- load events file
events = LOAD '../sample_test/sample_events.csv' USING PigStorage(',') AS (patientid:int, eventid:chararray, eventdesc:chararray, timestamp:chararray, value:float);
-- select required columns from events
events = FOREACH events GENERATE patientid, eventid, ToDate(timestamp, 'yyyy-MM-dd') AS etimestamp, value;
-- load mortality file
mortality = LOAD '../sample_test/sample_mortality.csv' USING PigStorage(',') as (patientid:int, timestamp:chararray, label:int);
mortality = FOREACH mortality GENERATE patientid, ToDate(timestamp, 'yyyy-MM-dd') AS mtimestamp, label;
--To display the relation, use the dump command e.g. DUMP mortality;
-- ***************************************************************************
-- Compute the index dates for dead and alive patients
-- ***************************************************************************
eventswithmort = JOIN events BY patientid, mortality BY patientid;-- perform join of events and mortality by patientid;
eventswithmort = FOREACH eventswithmort GENERATE events.patientid, events.eventid, events.etimestamp, events.value, mortality.mtimestamp, mortality.label;
aliveevents = FILTER eventswithmort BY mortality::label != 1;
My question is for last row of code should I use label != 1 or label is null? It seems that I always got an empty dataset.

eventswithmort = FOREACH eventswithmort GENERATE events.patientid, events.eventid, events.etimestamp, events.value, mortality.mtimestamp, mortality.label;
not totally sure this is the syntax for generating the data, instead of . should you not use ::
if you want rows in events but NOT in mortality then the join should be left outer, pig default is an inner join.
eventswithmort = JOIN events BY patientid LEFT OUTER, mortality BY
patientid;
then for each eventswithmort you can FILTER by mortality::label IS NULL;

Related

PIG How do I combine 2 files based one not equal condition

I am trying to find the player that played on the most teams in one year. I have one file wit the the schema of PlayerID, yearID, teamID. I brought the file in twice to try to join where the PlayerID and yearID are equal but the teamID is not. How do I do in in PIG? Can I do a <> in a join statement? Do I need to group them and them compare? I know sql i could join based on the PlayerID and yearID being equal and the teamID not being equal but not sure how to do that in PIG.
I tried this but it is no the right syntax"
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS
(id:chararray,yearid:int, teamid:chararray);
batters1 = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS ` (id:chararray,yearid:int, teamid:chararray);
batter_fltr = FILTER batters BY (yearid > 0) AND (teamid> ' ');
batter1_fltr = FILTER batters1 BY (yearid>0) AND (teamid> ' ');
multi_playr = JOIN batter_fltr BY (yearid,id), batter1_fltr BY(yearid,id) ,LEFT OUTER BY(teamid);
You wanted to find the player that played on the most teams in one year. Therefore, you should group by player & year, then you can count the number of teams per player per year. Finally, order the data by the count descending - the first result will be your answer. There's no need to load the data twice or do a join.
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS
(id:chararray, yearid:int, teamid:chararray);
-- Apply filtering as needed here
teams_per_year = FOREACH (GROUP batters BY (id, yearid))
GENERATE
group.id AS id,
group.yearid AS yearid,
COUNT(batters.teamid) AS num_teams;
ordered_results = ORDER teams_per_year BY num_teams DESC;
DUMP ordered_results;
If you need the distinct number of teams, add a nested DISTINCT:
teams_per_year = FOREACH (GROUP batters BY (id, yearid)) {
dist_teams = DISTINCT batters.teamid;
GENERATE
group.id AS id,
group.yearid AS yearid,
COUNT(dist_teams) AS num_teams;
}

Multiple distinct columns (Pig)

I have a list of flights, with the following attributes:
day: day they were flown
flight_number: their flight number
origin_airport: origin city airport
dest_airport: destination city airport
carrier_code: the airline carrier's code (eg. delta: DL)
I am trying to find the number of flights operated by each carrier. In doing so, for every day, I need to find distinct flight_number, origin_airport, dest_airport and carrier_code since one of the conditions is that "An airplane may be scheduled to fly from A to B and then from B to C with the same flight number on the same day. However, we consider these two journeys as two separate flights."
This is what I have that is not running:
desiredattributes = FOREACH jnd GENERATE day, flight_number, origin_airport_id, dest_airport_id, carrier_code;
distinctflights = FOREACH (GROUP desiredattributes BY day)
{
a = carriers.(carrier_code, flight_number, origin_airport_id, dest_airport_id);
b = DISTINCT a;
};
DUMP distinctflights;
Any help or guidance is appreciated! I am new to pig
You should list the fields from desiredattributes.
desiredattributes = FOREACH jnd GENERATE day, flight_number, origin_airport_id, dest_airport_id, carrier_code;
dayflights = GROUP desiredattributes BY day;
alldayflights = FOREACH dayflights GENERATE FLATTEN(group) as day,desiredattributes.flight_number,desiredattributes.origin_airport_id,desiredattributes.dest_airport_id, desiredattributes.carrier_code;
distinctflights = DISTINCT alldayflights;
DUMP distinctflights;

Take MIN EFF_DT and MAX_CANC_dt from data in PIG

Schema :
TYP|ID|RECORD|SEX|EFF_DT|CANC_DT
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
Suppose i have multiple records like this. I only want to display records that have minimum eff_dt and maximum cancel date.
I only want to display just This 1 record
DMF|1234567|98765432|M|2011-04-30|9999-12-31
Thank you
Get min eff_dt and max canc_dt and use it to filter the relation.Assuming you have a relation A
B = GROUP A ALL;
X = FOREACH B GENERATE MIN(A.EFF_DT);
Y = FOREACH B GENERATE MAX(A.CANC_DT);
C = FILTER A BY ((EFF_DT == X.$0) AND (CANC_DT == Y.$0));
D = DISTINCT C;
DUMP D;
Let's say you have this data (sample here):
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-12-30|9999-12-31
DMX|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-04-01|9999-12-31
Perform these steps:
-- 1. Read data, if you have not
A = load 'data.txt' using PigStorage('|') as (typ: chararray, id:chararray, record:chararray, sex:chararray, eff_dt:datetime, canc_dt:datetime);
-- 2. Group data by the attribute you like to, in this case it is TYP
grouped = group A by typ;
-- 3. Now, generate MIN/MAX for each group. Also, only keep relevant fields
min_max = foreach grouped generate group, MIN(A.eff_dt) as min_eff_dt, MAX(A.canc_dt) as max_canc_dt;
--
dump min_max;
(DMF,2011-04-30T00:00:00.000Z,9999-12-31T00:00:00.000Z)
(DMX,2011-04-01T00:00:00.000Z,9999-12-31T00:00:00.000Z)
If you need to, change datetime to charrary.
Note: there are different ways of doing this, what I am showing, except the load step, it produces the desired result in 2 steps: GROUP and FOREACH.

Reuse Pig Groups in nested FOREACH statement

I'm trying to group records together, calculate the average of SCORE1, filter out the lower half of the scores, and compute their average of SCORE2. Obviously I can calculate the summary statistics, and rejoin them to the original dataset, but I'd prefer to use the intermediate grouped values.
Example Input
ID,GROUPBY,SCORE1,SCORE2
1,A,58.8,67.3
2,A,85.2,76.3
3,B,49.1,90.7
4,B,78.3,99.8
Pig Script
records = load 'example.csv' Using PigStorage(',') AS (ID,GROUPBY,SCORE1,SCORE2);
grouped = group records by GROUPBY;
avgscore = foreach grouped GENERATE group AS GROUPBY, AVG(records.SCORE1) AS AVGSCORE;
joined = join grouped BY group, avgscore BY GROUPBY USING 'replicated';
results = foreach joined {
scores = foreach records generate SCORE1,SCORE2;
low = FILTER scores by SCORE1 < avgscore.AVGSCORE;
GENERATE GROUPBY, AVG(low.SCORE2);
};
dump results;
Desired Output
A 67.3
B 90.7
However this gives me a result of java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (A,72.0), 2nd :(B,63.7)
You are essentially grouping two different data structures in line 4.
You are joining grouped (which is grouped) with avgscore (which should be flattened).
You should be doing:
joined = join records BY GROUPBY, avgscore BY GROUPBY USING 'replicated';
edit:
I would rewrite like this to avoid confusion (since there will be two GROUPBYs)
records = load 'example.csv' Using PigStorage(',') AS (ID,GROUPBY,SCORE1,SCORE2);
grouped = group records by GROUPBY;
avgscore = foreach grouped GENERATE group AS GROUPBY, AVG(records.SCORE1) AS AVGSCORE;
joined = join records BY GROUPBY, avgscore BY GROUPBY USING 'replicated';
joined_reduced = foreach joined generate ID, records::GROUPBY as GROUPBY, AVGSCORE, SCORE1, SCORE2;
filter_joined = filter joined_reduced by (SCORE1 > AVGSCORE);
grouped2 = group filter_joined by GROUPBY;
result = foreach grouped2 generate flatten (group), AVG(filter_joined.SCORE2) as low_avg;
dump result;

Pig split and join

I have a requirement to propagate field values from one row to another given type of record
for example my raw input is
1,firefox,p
1,,q
1,,r
1,,s
2,ie,p
2,,s
3,chrome,p
3,,r
3,,s
4,netscape,p
the desired result
1,firefox,p
1,firefox,q
1,firefox,r
1,firefox,s
2,ie,p
2,ie,s
3,chrome,p
3,chrome,r
3,chrome,s
4,netscape,p
I tried
A = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
SPLIT A INTO B IF (type =='p'), C IF (type!='p' );
joined = JOIN B BY id FULL, C BY id;
joinedFields = FOREACH joined GENERATE B::id, B::type, B::browser, C::id, C::type;
dump joinedFields;
the result I got was
(,,,1,p )
(,,,1,q)
(,,,1,r)
(,,,1,s)
(2,p,ie,2,s)
(3,p,chrome,3,r)
(3,p,chrome,3,s)
(4,p,netscape,,)
Any help is appreciated, Thanks.
PIG is not exactly SQL, it is built with data flows, MapReduce and groups in mind (joins are also there). You can get the result using a GROUP BY, FILTER nested in the FOREACH and FLATTEN.
inpt = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
grp = GROUP inpt BY id;
Result = FOREACH grp {
P = FILTER inpt BY type == 'p'; -- leave the record that contain p for the id
PL = LIMIT P 1; -- make sure there is just one
GENERATE FLATTEN(inpt.(id,type)), FLATTEN(PL.browser); -- convert bags produced by group by back to rows
};