PIG How do I combine 2 files based one not equal condition - apache-pig

I am trying to find the player that played on the most teams in one year. I have one file wit the the schema of PlayerID, yearID, teamID. I brought the file in twice to try to join where the PlayerID and yearID are equal but the teamID is not. How do I do in in PIG? Can I do a <> in a join statement? Do I need to group them and them compare? I know sql i could join based on the PlayerID and yearID being equal and the teamID not being equal but not sure how to do that in PIG.
I tried this but it is no the right syntax"
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS
(id:chararray,yearid:int, teamid:chararray);
batters1 = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS ` (id:chararray,yearid:int, teamid:chararray);
batter_fltr = FILTER batters BY (yearid > 0) AND (teamid> ' ');
batter1_fltr = FILTER batters1 BY (yearid>0) AND (teamid> ' ');
multi_playr = JOIN batter_fltr BY (yearid,id), batter1_fltr BY(yearid,id) ,LEFT OUTER BY(teamid);

You wanted to find the player that played on the most teams in one year. Therefore, you should group by player & year, then you can count the number of teams per player per year. Finally, order the data by the count descending - the first result will be your answer. There's no need to load the data twice or do a join.
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS
(id:chararray, yearid:int, teamid:chararray);
-- Apply filtering as needed here
teams_per_year = FOREACH (GROUP batters BY (id, yearid))
GENERATE
group.id AS id,
group.yearid AS yearid,
COUNT(batters.teamid) AS num_teams;
ordered_results = ORDER teams_per_year BY num_teams DESC;
DUMP ordered_results;
If you need the distinct number of teams, add a nested DISTINCT:
teams_per_year = FOREACH (GROUP batters BY (id, yearid)) {
dist_teams = DISTINCT batters.teamid;
GENERATE
group.id AS id,
group.yearid AS yearid,
COUNT(dist_teams) AS num_teams;
}

Related

PIG need to find max

I am new to Pig and working on a problem where I need to find the the player in this dataset with the max weight. Here is a sample of the data:
id, weight,id,year, triples
(bayja01,210,bayja01,2005,6)
(crawfca02,225,crawfca02,2005,15)
(damonjo01,205,damonjo01,2005,6)
(dejesda01,190,dejesda01,2005,6)
(eckstda01,170,eckstda01,2005,7)
and here is my pig script:
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' using PigStorage(',');
realbatters = FILTER batters BY $1==2005;
triphitters = FILTER realbatters BY $9>5;
tripids = FOREACH triphitters GENERATE $0 AS id,$1 AS YEAR, $9 AS Trips;
names = LOAD 'hdfs:/user/maria_dev/pigtest/Master.csv'
using PigStorage(',');
weights = FOREACH names GENERATE $0 AS id, $16 AS weight;
get_ids = JOIN weights BY (id), tripids BY(id);
wts = FOREACH get_ids GENERATE MAX(get_ids.weight)as wgt;
DUMP wts;
the second to last line did not work of course. It told me I had to use an explicit cast. I have the filtering etc figured out - jsut can't figure out how to get the final answer.
The MAX function in Pig expects a Bag of values and will return the highest value in the bag. In order to create a Bag, you must first GROUP your data:
get_ids = JOIN weights BY id, tripids BY id;
-- Drop columns we no longer need and rename for ease
just_ids_weights = FOREACH get_ids GENERATE
weights::id AS id,
weights:: weight AS weight;
-- Group the data by id value
gp_by_ids = GROUP just_ids_weights BY id;
-- Find maximum weight by id
wts = FOREACH gp_by_ids GENERATE
group AS id,
MAX(just_ids_weights.weight) AS wgt;
If you wanted the maximum weight in all of the data, you can put all of your data in a single bag using GROUP ALL:
gp_all = GROUP just_ids_weights ALL;
was = FOREACH gp_all GENERATE
MAX(just_ids_weights.weight) AS wgt;

How can I select the highest counts attributes from different groups?

So I have a table with players data(name, team, etc..) and a table with goals (player who scored it, local team, etc...). What I need to do is, get from each team the highest scorer. So the result I'm getting is something like:
germany - whatever name - 1
germany - another dude - 5
spain - another name - 8
italy - one more name - 6
As you can see teams repeat, and I want them not to, just get the highest scorer of each team.
Right now I have this:
SELECT P.TEAM_PLAYER, G.PLAYER_GOAL, COUNT(*) AS "TOTAL GOALS" FROM PLAYER P, GOAL G
WHERE TO_CHAR(G.DATE_GOAL, 'YYYY')=2002
AND P.NAME = G.PLAYER_GOAL
GROUP BY G.PLAYER_GOAL, P.TEAM_PLAYER
HAVING COUNT(*)>=ALL (SELECT COUNT(*) FROM PLAYER P2 where P.TEAM_PLAYER = P2.TEAM_PLAYER GROUP BY P2.TEAM_PLAYER)
ORDER BY COUNT(*) DESC;
I am 100% sure I'm close, and I'm pretty sure I have to do this with the HAVING feature, but I can't get it right.
Without the HAVING it returns a list of all the players, their teams and how many goals have they scored, now I want to cut it down to only one player for each team.
PD: the teams in the table GOAL are local and visiting team, so I have to use the Player table to get the team. Also the Goal table is not a list of the players and how many goals they have scored, but a list of every individual goal and the player who scored it.
If I understand correctly you can try this query.
just get MAX of PLAYER_GOAL column,SUM(G.PLAYER_GOAL) instead of COUNT(*)
SELECT P.TEAM_PLAYER,
MAX(G.PLAYER_GOAL) "PLAYER_GOAL",
SUM(G.PLAYER_GOAL) AS "TOTAL GOALS"
FROM PLAYER P
INNER JOIN GOAL G
ON P.NAME = G.PLAYER_NAME
WHERE TO_CHAR(G.DATE_GOAL, 'YYYY')=2002
GROUP BY P.TEAM_PLAYER
ORDER BY SUM(G.PLAYER_GOAL) DESC;
NOTE :
Avoid using commas to join tables it's a old join style, You can use inner-join instead.
Edit
I don't know your table schema, but this query might be work.
use a subquery to contain your current result set. then get MAX function and GROUP BY
SELECT T.TEAM_PLAYER,
T.PLAYER_GOAL,
MAX(TOTAL_GOALS) AS "TOTAL GOALS"
FROM
(
SELECT P.TEAM_PLAYER, G.PLAYER_GOAL, COUNT(*) AS "TOTAL_GOALS" FROM
PLAYER P, GOAL G
WHERE TO_CHAR(G.DATE_GOAL, 'YYYY')=2002
AND P.NAME = G.PLAYER_GOAL
GROUP BY G.PLAYER_GOAL, P.TEAM_PLAYER
HAVING COUNT(*)>=ALL (SELECT COUNT(*) FROM PLAYER P2 where P.TEAM_PLAYER = P2.TEAM_PLAYER GROUP BY P2.TEAM_PLAYER)
) T
GROUP BY T.TEAM_PLAYER,
T.PLAYER_GOAL
ORDER BY MAX(TOTAL_GOALS) DESC

how to find the maximum and its associated values from a grouped relation in pig?

Below is my input
$ cat people.csv
Steve,US,M,football,6.5
Alex,US,M,football,5.5
Ted,UK,M,football,6.0
Mary,UK,F,baseball,5.5
Ellen,UK,F,football,5.0
I Need to group my data based on the Country.
people = LOAD 'people.csv' USING PigStorage(',') AS (name:chararray,country:chararray,gender:chararray, sport:chararray,height:float);
grouped = GROUP people BY country;
Now i have to find the maximum height of the person and his details from the grouped data.
So i tried the below
a = FOREACH grouped GENERATE group AS country, MAX(people.height) as height, people.name as name;
which gives the output as
(UK,6.0,{(Ellen),(Mary),(Ted)})
(US,6.5,{(Alex),(Steve)})
But i need my output should be
(UK,6.0,Ted)
(US,6.5,Steve)
Could someone please help me to achieve this ?
This code will help you .
As per this code, If there are two players with max height under the same country then you will get both those players details
records = LOAD '/home/user/footbal.txt' USING PigStorage(',') AS(name:chararray,country:chararray,gender:chararray,sport:chararray,height:double);
records_grp = GROUP records BY (country);
records_each = foreach records_grp generate group as temp_country, MAX(records.height) as max_height;
records_join = join records by (country,height), records_each by (temp_country,max_height);
records_output = foreach records_join generate country, max_height, name;
dump records_output;
OutPut :
(UK,6.0,Ted)
(US,6.5,Steve)

Reuse Pig Groups in nested FOREACH statement

I'm trying to group records together, calculate the average of SCORE1, filter out the lower half of the scores, and compute their average of SCORE2. Obviously I can calculate the summary statistics, and rejoin them to the original dataset, but I'd prefer to use the intermediate grouped values.
Example Input
ID,GROUPBY,SCORE1,SCORE2
1,A,58.8,67.3
2,A,85.2,76.3
3,B,49.1,90.7
4,B,78.3,99.8
Pig Script
records = load 'example.csv' Using PigStorage(',') AS (ID,GROUPBY,SCORE1,SCORE2);
grouped = group records by GROUPBY;
avgscore = foreach grouped GENERATE group AS GROUPBY, AVG(records.SCORE1) AS AVGSCORE;
joined = join grouped BY group, avgscore BY GROUPBY USING 'replicated';
results = foreach joined {
scores = foreach records generate SCORE1,SCORE2;
low = FILTER scores by SCORE1 < avgscore.AVGSCORE;
GENERATE GROUPBY, AVG(low.SCORE2);
};
dump results;
Desired Output
A 67.3
B 90.7
However this gives me a result of java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (A,72.0), 2nd :(B,63.7)
You are essentially grouping two different data structures in line 4.
You are joining grouped (which is grouped) with avgscore (which should be flattened).
You should be doing:
joined = join records BY GROUPBY, avgscore BY GROUPBY USING 'replicated';
edit:
I would rewrite like this to avoid confusion (since there will be two GROUPBYs)
records = load 'example.csv' Using PigStorage(',') AS (ID,GROUPBY,SCORE1,SCORE2);
grouped = group records by GROUPBY;
avgscore = foreach grouped GENERATE group AS GROUPBY, AVG(records.SCORE1) AS AVGSCORE;
joined = join records BY GROUPBY, avgscore BY GROUPBY USING 'replicated';
joined_reduced = foreach joined generate ID, records::GROUPBY as GROUPBY, AVGSCORE, SCORE1, SCORE2;
filter_joined = filter joined_reduced by (SCORE1 > AVGSCORE);
grouped2 = group filter_joined by GROUPBY;
result = foreach grouped2 generate flatten (group), AVG(filter_joined.SCORE2) as low_avg;
dump result;

Pig split and join

I have a requirement to propagate field values from one row to another given type of record
for example my raw input is
1,firefox,p
1,,q
1,,r
1,,s
2,ie,p
2,,s
3,chrome,p
3,,r
3,,s
4,netscape,p
the desired result
1,firefox,p
1,firefox,q
1,firefox,r
1,firefox,s
2,ie,p
2,ie,s
3,chrome,p
3,chrome,r
3,chrome,s
4,netscape,p
I tried
A = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
SPLIT A INTO B IF (type =='p'), C IF (type!='p' );
joined = JOIN B BY id FULL, C BY id;
joinedFields = FOREACH joined GENERATE B::id, B::type, B::browser, C::id, C::type;
dump joinedFields;
the result I got was
(,,,1,p )
(,,,1,q)
(,,,1,r)
(,,,1,s)
(2,p,ie,2,s)
(3,p,chrome,3,r)
(3,p,chrome,3,s)
(4,p,netscape,,)
Any help is appreciated, Thanks.
PIG is not exactly SQL, it is built with data flows, MapReduce and groups in mind (joins are also there). You can get the result using a GROUP BY, FILTER nested in the FOREACH and FLATTEN.
inpt = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
grp = GROUP inpt BY id;
Result = FOREACH grp {
P = FILTER inpt BY type == 'p'; -- leave the record that contain p for the id
PL = LIMIT P 1; -- make sure there is just one
GENERATE FLATTEN(inpt.(id,type)), FLATTEN(PL.browser); -- convert bags produced by group by back to rows
};