Filtering Dates Using Apache Pig - apache-pig

I have a list of movies with the release date. I want to get a list of movies that are newer than a given year e.g. 1982, so movies in 1983, 1984 and so on using Apache Pig.
The dates are in the format 01-Jan-1995. I can load the data correctly but my FILTER operation states there is a type mismatch.
I have tried converting the chararray to datetime format however, the result is the date in the format 1995-01-01T00:00:00.000-08:00.
1) How do I retrieve only the year
2) Filter only values that are newer than the selected year?
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS (userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS (movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);
nameLookup = FOREACH metadata GENERATE movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;
nameLookupYear = FOREACH nameLookup GENERATE movieID, movieTitle, ToString(releaseYear, 'yyyy') AS movieYear;
oldMovies = FILTER nameLookupYear by movieYear < ('1982');
DUMP oldMovies;

Use GetYear() for year part of the datetime object and if you want movies newer than 1982, the filter should be movieYear > 1982
nameLookupYear = FOREACH nameLookup GENERATE movieID, movieTitle, GetYear(releaseYear) AS movieYear;
oldMovies = FILTER nameLookupYear by movieYear > 1982;

Related

PIG need to find max

I am new to Pig and working on a problem where I need to find the the player in this dataset with the max weight. Here is a sample of the data:
id, weight,id,year, triples
(bayja01,210,bayja01,2005,6)
(crawfca02,225,crawfca02,2005,15)
(damonjo01,205,damonjo01,2005,6)
(dejesda01,190,dejesda01,2005,6)
(eckstda01,170,eckstda01,2005,7)
and here is my pig script:
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' using PigStorage(',');
realbatters = FILTER batters BY $1==2005;
triphitters = FILTER realbatters BY $9>5;
tripids = FOREACH triphitters GENERATE $0 AS id,$1 AS YEAR, $9 AS Trips;
names = LOAD 'hdfs:/user/maria_dev/pigtest/Master.csv'
using PigStorage(',');
weights = FOREACH names GENERATE $0 AS id, $16 AS weight;
get_ids = JOIN weights BY (id), tripids BY(id);
wts = FOREACH get_ids GENERATE MAX(get_ids.weight)as wgt;
DUMP wts;
the second to last line did not work of course. It told me I had to use an explicit cast. I have the filtering etc figured out - jsut can't figure out how to get the final answer.
The MAX function in Pig expects a Bag of values and will return the highest value in the bag. In order to create a Bag, you must first GROUP your data:
get_ids = JOIN weights BY id, tripids BY id;
-- Drop columns we no longer need and rename for ease
just_ids_weights = FOREACH get_ids GENERATE
weights::id AS id,
weights:: weight AS weight;
-- Group the data by id value
gp_by_ids = GROUP just_ids_weights BY id;
-- Find maximum weight by id
wts = FOREACH gp_by_ids GENERATE
group AS id,
MAX(just_ids_weights.weight) AS wgt;
If you wanted the maximum weight in all of the data, you can put all of your data in a single bag using GROUP ALL:
gp_all = GROUP just_ids_weights ALL;
was = FOREACH gp_all GENERATE
MAX(just_ids_weights.weight) AS wgt;

Multiple distinct columns (Pig)

I have a list of flights, with the following attributes:
day: day they were flown
flight_number: their flight number
origin_airport: origin city airport
dest_airport: destination city airport
carrier_code: the airline carrier's code (eg. delta: DL)
I am trying to find the number of flights operated by each carrier. In doing so, for every day, I need to find distinct flight_number, origin_airport, dest_airport and carrier_code since one of the conditions is that "An airplane may be scheduled to fly from A to B and then from B to C with the same flight number on the same day. However, we consider these two journeys as two separate flights."
This is what I have that is not running:
desiredattributes = FOREACH jnd GENERATE day, flight_number, origin_airport_id, dest_airport_id, carrier_code;
distinctflights = FOREACH (GROUP desiredattributes BY day)
{
a = carriers.(carrier_code, flight_number, origin_airport_id, dest_airport_id);
b = DISTINCT a;
};
DUMP distinctflights;
Any help or guidance is appreciated! I am new to pig
You should list the fields from desiredattributes.
desiredattributes = FOREACH jnd GENERATE day, flight_number, origin_airport_id, dest_airport_id, carrier_code;
dayflights = GROUP desiredattributes BY day;
alldayflights = FOREACH dayflights GENERATE FLATTEN(group) as day,desiredattributes.flight_number,desiredattributes.origin_airport_id,desiredattributes.dest_airport_id, desiredattributes.carrier_code;
distinctflights = DISTINCT alldayflights;
DUMP distinctflights;

Count distinct values in a group using pig

My problem in a general sense is that I'd like to group my data and then count the uniq values for a field.
Specifically, for the data below, I want to group by 'category' and 'year' and then count the uniq values for 'food'.
category,id,mydate,mystore,food
catA,myid_1,2014-03-11 13:13:13,store1,apple
catA,myid_2,2014-03-11 12:12:12,store1,milk
catA,myid_3,2014-08-11 10:13:13,store1,apple
catA,myid_4,2014-09-11 09:12:12,store1,milk
catA,myid_5,2015-09-01 10:10:10,store1,milk
catB,myid_6,2014-03-12 03:03:03,store2,milk
catB,myid_7,2014-03-12 05:55:55,store2,apple
This is as far as I can get, which is just picking out the values and using some of the neat pig date functions:
a = load '$input' using PigStorage(',') as (category:chararray,id:chararray,mydate:chararray,mystore:chararray,food:chararray);
b = foreach a generate category, id, ToDate(mydate,'yyyy-MM-dd HH:mm:ss') as myDt:DateTime, mystore,food;
c = foreach b generate category, GetYear(myDt) as year:int, mystore,food;
dump c;
The output from the alias 'c' is:
(catA,2014,store1,apple)
(catA,2014,store1,milk)
(catA,2014,store1,apple)
(catA,2014,store1,milk)
(catA,2015,store1,milk)
(catB,2014,store2,milk)
(catB,2014,store2,apple)
I want in the end:
catA, 2014, {(apple, 2), (milk, 2)}
catA, 2015, {(milk, 1)}
catB, 2014, {(apple, 1), (milk, 1)}
I've seen some example of generating value counts, but grouping by category and year is tripping me up.
Input:
category,id,mydate,mystore,food
catA,myid_1,2014-03-11 13:13:13,store1,apple
catA,myid_2,2014-03-11 12:12:12,store1,milk
catA,myid_3,2014-08-11 10:13:13,store1,apple
catA,myid_4,2014-09-11 09:12:12,store1,milk
catA,myid_5,2015-09-01 10:10:10,store1,milk
catB,myid_6,2014-03-12 03:03:03,store2,milk
catB,myid_7,2014-03-12 05:55:55,store2,apple
Yes, You can use nested FOREACH after your grouping, In that nested FOREACH you can apply Distinct for foods and then you can count that .
The below code will help you
Pig Script:
list = LOAD 'user/cloudera/apple.txt' USING PigStorage(',') AS(category:chararray,id:chararray,mydate:chararray,my_store:chararray,food:chararray);
list_each = FOREACH list GENERATE category,SUBSTRING(mydate,0,4) as my_year, my_store, food;
list_grp = GROUP list_each BY (category,my_year);
list_nested_each = FOREACH list_grp
{
list_inner_each = FOREACH list_each GENERATE food;
list_inner_dist = DISTINCT list_inner_each;
GENERATE flatten(group) as (catgeory,my_year), COUNT(list_inner_dist) as no_of_uniq_foods;
};
dump list_nested_each;
Output:
(catA,2014,2)
(catA,2015,1)
(catB,2014,2)
Appending to the code in the question:
d = group c by (category, year, food);
e = foreach d generate FLATTEN(group), COUNT(c) as count;
will produce:
(catA,2014,milk,2)
(catA,2014,apple,2)
(catA,2015,milk,1)
(catB,2014,milk,1)
(catB,2014,apple,1)
The key is to group by 'food' as well. Interesting. Any other insight is welcomed.

Pig: Summing Fields

I have some census data in which each line has a number denoting the county and fields for the number of people in a certain age range (eg, 5 and under, 5 to 17, etc.). After some initial processing in which I removed the unneeded columns, I grouped the labeled data as follows (labeled_data is of the schema {county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int}):
grouped_data = GROUP filtered_data BY county;
So grouped_data is of the schema
{group: chararray,filtered_data: {(county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int)}}
Now I would like to to sum up all of the pop fields for each county, yielding the total population of each county. I'm pretty sure the command to do this will be of the form
pop_sums = FOREACH grouped_data GENERATE group, SUM(something about the pop fields);
but I've been unable to get this to work. Thanks in advance!
I don't know if this is helpful, but the following is a representative entry of grouped_data:
(147,{(147,385,1005,283,468,649,738,933,977),(147,229,655,178,288,394,499,579,481)})
Note that the 147 entries are actually county codes, not populations. They are therefore of type chararray.
Can you try the below approach?
Sample input:
147,1,1,1,1,1,1,1,1
147,2,2,2,2,2,2,2,2
145,5,5,5,5,5,5,5,5
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(country:chararray,pop1:int,pop2:int,pop3:int,pop4:int,pop5:int,pop6:int,pop7:int,pop8:int);
B = GROUP A BY country;
C = FOREACH B GENERATE group,(SUM(A.pop1)+SUM(A.pop2)+SUM(A.pop3)+SUM(A.pop4)+SUM(A.pop5)+SUM(A.pop6)+SUM(A.pop7)+SUM(A.pop8)) AS totalPopulation;
DUMP C;
Output:
(145,40)
(147,24)

PIG: sum and division, creating an object

I am writing a pig program that loads a file that separates its entires with tabs
ex: name TAB year TAB count TAB...
file = LOAD 'file.csv' USING PigStorage('\t') as (type: chararray, year: chararray,
match_count: float, volume_count: float);
-- Group by type
grouped = GROUP file BY type;
-- Flatten
by_type = FOREACH grouped GENERATE FLATTEN(group) AS (type, year, match_count, volume_count);
group_operat = FOREACH by_type GENERATE
SUM(match_count) AS sum_m,
SUM(volume_count) AS sum_v,
(float)sum_m/sm_v;
DUMP group_operat;
The issue lies in the group operations object I am trying to create. I'm wanting to sum all the match counts, sum all the volume counts and divide the match counts by volume counts.
What am I doing wrong in my arithmetic operations/object creation?
An error I receive is line 7, column 11> pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable schema: left is "type:NULL,year:NULL,match_count:NULL,volume_count:NULL", right is "group:chararray"
Thank you.
Try like this, this will return type and sum.
UPDATED the working code
input.txt
A 2001 10 2
A 2002 20 3
B 2003 30 4
B 2004 40 1
PigScript:
file = LOAD 'input.txt' USING PigStorage() AS (type: chararray, year: chararray,
match_count: float, volume_count: float);
grouped = GROUP file BY type;
group_operat = FOREACH grouped {
sum_m = SUM(file.match_count);
sum_v = SUM(file.volume_count);
GENERATE group,(float)(sum_m/sum_v) as sum_mv;
}
DUMP group_operat;
Output:
(A,6.0)
(B,14.0)
try this,
file = LOAD 'file.csv' USING PigStorage('\t') as (type: chararray, year: chararray,
match_count: float, volume_count: float);
grouped = GROUP file BY (type,year);
group_operat = FOREACH grouped GENERATE group,
SUM(file.match_count) AS sum_m,
SUM(file.volume_count) AS sum_v,
(float)(SUM(file.match_count)/SUM(file.volume_count)) as sum_mv;
Above script give result group by type and year, if you want only group by type then remove from grouped
grouped = GROUP file BY type;
group_operat = FOREACH grouped GENERATE group,file.year,
SUM(file.match_count) AS sum_m,
SUM(file.volume_count) AS sum_v,
(float)(SUM(file.match_count)/SUM(file.volume_count)) as sum_mv;