Multiple distinct columns (Pig) - apache-pig

I have a list of flights, with the following attributes:
day: day they were flown
flight_number: their flight number
origin_airport: origin city airport
dest_airport: destination city airport
carrier_code: the airline carrier's code (eg. delta: DL)
I am trying to find the number of flights operated by each carrier. In doing so, for every day, I need to find distinct flight_number, origin_airport, dest_airport and carrier_code since one of the conditions is that "An airplane may be scheduled to fly from A to B and then from B to C with the same flight number on the same day. However, we consider these two journeys as two separate flights."
This is what I have that is not running:
desiredattributes = FOREACH jnd GENERATE day, flight_number, origin_airport_id, dest_airport_id, carrier_code;
distinctflights = FOREACH (GROUP desiredattributes BY day)
{
a = carriers.(carrier_code, flight_number, origin_airport_id, dest_airport_id);
b = DISTINCT a;
};
DUMP distinctflights;
Any help or guidance is appreciated! I am new to pig

You should list the fields from desiredattributes.
desiredattributes = FOREACH jnd GENERATE day, flight_number, origin_airport_id, dest_airport_id, carrier_code;
dayflights = GROUP desiredattributes BY day;
alldayflights = FOREACH dayflights GENERATE FLATTEN(group) as day,desiredattributes.flight_number,desiredattributes.origin_airport_id,desiredattributes.dest_airport_id, desiredattributes.carrier_code;
distinctflights = DISTINCT alldayflights;
DUMP distinctflights;

Related

PIG How do I combine 2 files based one not equal condition

I am trying to find the player that played on the most teams in one year. I have one file wit the the schema of PlayerID, yearID, teamID. I brought the file in twice to try to join where the PlayerID and yearID are equal but the teamID is not. How do I do in in PIG? Can I do a <> in a join statement? Do I need to group them and them compare? I know sql i could join based on the PlayerID and yearID being equal and the teamID not being equal but not sure how to do that in PIG.
I tried this but it is no the right syntax"
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS
(id:chararray,yearid:int, teamid:chararray);
batters1 = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS ` (id:chararray,yearid:int, teamid:chararray);
batter_fltr = FILTER batters BY (yearid > 0) AND (teamid> ' ');
batter1_fltr = FILTER batters1 BY (yearid>0) AND (teamid> ' ');
multi_playr = JOIN batter_fltr BY (yearid,id), batter1_fltr BY(yearid,id) ,LEFT OUTER BY(teamid);
You wanted to find the player that played on the most teams in one year. Therefore, you should group by player & year, then you can count the number of teams per player per year. Finally, order the data by the count descending - the first result will be your answer. There's no need to load the data twice or do a join.
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS
(id:chararray, yearid:int, teamid:chararray);
-- Apply filtering as needed here
teams_per_year = FOREACH (GROUP batters BY (id, yearid))
GENERATE
group.id AS id,
group.yearid AS yearid,
COUNT(batters.teamid) AS num_teams;
ordered_results = ORDER teams_per_year BY num_teams DESC;
DUMP ordered_results;
If you need the distinct number of teams, add a nested DISTINCT:
teams_per_year = FOREACH (GROUP batters BY (id, yearid)) {
dist_teams = DISTINCT batters.teamid;
GENERATE
group.id AS id,
group.yearid AS yearid,
COUNT(dist_teams) AS num_teams;
}

how to find the maximum and its associated values from a grouped relation in pig?

Below is my input
$ cat people.csv
Steve,US,M,football,6.5
Alex,US,M,football,5.5
Ted,UK,M,football,6.0
Mary,UK,F,baseball,5.5
Ellen,UK,F,football,5.0
I Need to group my data based on the Country.
people = LOAD 'people.csv' USING PigStorage(',') AS (name:chararray,country:chararray,gender:chararray, sport:chararray,height:float);
grouped = GROUP people BY country;
Now i have to find the maximum height of the person and his details from the grouped data.
So i tried the below
a = FOREACH grouped GENERATE group AS country, MAX(people.height) as height, people.name as name;
which gives the output as
(UK,6.0,{(Ellen),(Mary),(Ted)})
(US,6.5,{(Alex),(Steve)})
But i need my output should be
(UK,6.0,Ted)
(US,6.5,Steve)
Could someone please help me to achieve this ?
This code will help you .
As per this code, If there are two players with max height under the same country then you will get both those players details
records = LOAD '/home/user/footbal.txt' USING PigStorage(',') AS(name:chararray,country:chararray,gender:chararray,sport:chararray,height:double);
records_grp = GROUP records BY (country);
records_each = foreach records_grp generate group as temp_country, MAX(records.height) as max_height;
records_join = join records by (country,height), records_each by (temp_country,max_height);
records_output = foreach records_join generate country, max_height, name;
dump records_output;
OutPut :
(UK,6.0,Ted)
(US,6.5,Steve)

How to check COUNT of filtered elements in PIG

I have the following data set in which I need to perform some steps based on the Car's company name.
(23,Nissan,12.43)
(23,Nissan Car,16.43)
(23,Honda Car,13.23)
(23,Toyota Car,17.0)
(24,Honda,45.0)
(24,Toyota,12.43)
(24,Nissan Car,12.43)
A = LOAD 'data.txt' AS (code:int, name:chararray, rating:double);
G = GROUP A by (code, REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1));
DUMP G;
I am grouping cars based on code and their base company name like All the 'Nissan' and 'Nissan Car' records should come in 1 group and similar for others.
/* Grouped data based on code and company's first name*/
((23,Nissan),{(23,Nissan,12.43),(23,Nissan Car,16.43)})
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
Now, I want to filter out the groups based on whether they contain a tuple corresponding to group's name. If yes, take that tuple from that group and ignore others and if no such tuple exists then take all the tuples for that group.
The Output should be:
((23,Nissan),{(23,Nissan,12.43)}) // Since this group contains a row with group's name i.e. Nissan
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
R = FOREACH G { OW = FILTER A BY name==group.$1; IF COUNT(OW) > 0}
Could anybody please help how can I do this? After filtering by group's name? How can I find the count of the filtered tuples and get the required data.
Ok. Lets Consider the below records are your input.
23,Nissan,12.43
23,Nissan Car,16.43
23,Honda Car,13.23
23,Toyota Car,17.0
24,Honda,45.0
24,Toyota,12.43
25,Toyato Car,23.8
25,Toyato Car,17.2
24,Nissan Car,12.43
For the above Input , let say the below is intermediate output
((23,Honda),{(23,Honda,Honda Car,13.23)})
((23,Nissan),{(23,Nissan,Nissan,12.43),(23,Nissan,Nissan Car,16.43)})
((23,Toyota),{(23,Toyota,Toyota Car,17.0)})
((24,Honda),{(24,Honda,Honda,45.0)})
((24,Nissan),{(24,Nissan,Nissan Car,12.43)})
((24,Toyota),{(24,Toyota,Toyota,12.43)})
((25,Toyato),{(25,Toyato,Toyato Car,23.8),(25,Toyato,Toyato Car,17.2)})
Just Consider, from the above intermediate output, you are looking for below output as per your requirement .
(23,Honda,1)
(23,Nissan,1)
(23,Toyota,1)
(24,Honda,1)
(24,Nissan,1)
(24,Toyota,1)
(25,Toyato,2)
Below is the code..
nissan_load = LOAD '/user/cloudera/inputfiles/nissan.txt' USING PigStorage(',') as(code:int,name:chararray,rating:double);
nissan_each = FOREACH nissan_load GENERATE code,TRIM(REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1)) as brand_name,name,rating;
nissan_grp = GROUP nissan_each by (code,brand_name);
nissan_final_each =FOREACH nissan_grp {
A = FOREACH nissan_each GENERATE (brand_name == TRIM(name) ? 1 :0) as cnt;
B = (int)SUM(A);
C = FOREACH nissan_each GENERATE (brand_name != TRIM(name) ?1: 0) as extra_cnt;
D = SUM(C);
generate flatten(group) as(code,brand_name), (SUM(A.cnt) != 0 ? B : D) as final_cnt;
};
dump nissan_final_each;
Try this code with different inputs as well..

Pig: Summing Fields

I have some census data in which each line has a number denoting the county and fields for the number of people in a certain age range (eg, 5 and under, 5 to 17, etc.). After some initial processing in which I removed the unneeded columns, I grouped the labeled data as follows (labeled_data is of the schema {county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int}):
grouped_data = GROUP filtered_data BY county;
So grouped_data is of the schema
{group: chararray,filtered_data: {(county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int)}}
Now I would like to to sum up all of the pop fields for each county, yielding the total population of each county. I'm pretty sure the command to do this will be of the form
pop_sums = FOREACH grouped_data GENERATE group, SUM(something about the pop fields);
but I've been unable to get this to work. Thanks in advance!
I don't know if this is helpful, but the following is a representative entry of grouped_data:
(147,{(147,385,1005,283,468,649,738,933,977),(147,229,655,178,288,394,499,579,481)})
Note that the 147 entries are actually county codes, not populations. They are therefore of type chararray.
Can you try the below approach?
Sample input:
147,1,1,1,1,1,1,1,1
147,2,2,2,2,2,2,2,2
145,5,5,5,5,5,5,5,5
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(country:chararray,pop1:int,pop2:int,pop3:int,pop4:int,pop5:int,pop6:int,pop7:int,pop8:int);
B = GROUP A BY country;
C = FOREACH B GENERATE group,(SUM(A.pop1)+SUM(A.pop2)+SUM(A.pop3)+SUM(A.pop4)+SUM(A.pop5)+SUM(A.pop6)+SUM(A.pop7)+SUM(A.pop8)) AS totalPopulation;
DUMP C;
Output:
(145,40)
(147,24)

Join two relations (candidates) and display difference between votes?

I first split the relation into those who won and those who lost.
I have trouble with joining the all the candidates back together and making tuples with the last names of both candidates (elected and defeated) and the difference between their vote totals (only tuples where the difference is less than 10).
--load the data
raw = LOAD '.../data2.csv' USING PigStorage(',') AS (
date, type:chararray, parl:int, prov:chararray, riding:chararray,
lastname:chararray, firstname:chararray, gender:chararray,
occupation:chararray, party:chararray, votes:int,
percent:double, elected:int);
fltrd = FILTER raw by votes > 100 ;
spltrd = SPLIT fltrd INTO won IF elected > 0, lost IF elected == 0;
jnd = JOIN won BY lastname AS lastname_won, lost BY lastname AS lastname_lost;
For displaying the difference btw the votes this is the idea I had but it's not working:
jnd = JOIN won BY lastname AS lastname_won, vote AS vote_won, lost BY lastname AS lastname_lost, vote AS vote_lost;
gen = foreach jnd generate lastname_won, lastname_lost,(vote_won - vote_lost) as diffVotes;
What exactly is is that is not working?
At first glance, your Join will probably not work because of the AS you are using. You cannot rename the columns in the join, you'll need a foreach for this. Use :: to distinguish between the fields from the two relations:
jnd2 = FOREACH jnd GENERATE won::lastname AS lastname_won, won::vote AS vote_won etc.
If you join on multiple columns change the syntax as follows:
jnd = JOIN won BY (lastname, vote), lost BY (lastname, vote);