Pig :: Get MAX value from the COUNTs of a groups - apache-pig

I have a file Bank name, location and few other fields too. I want to find out the bank with maximum branches.
A = LOAD 'bank.txt';
B = GROUP A by $0;
C = FOREACH B GENERATE COUNT($1);
I go the Bank wise counts. Now I am stuck how to refer C to get the bank with MAX branches.

Since you are grouping by Bank,you will have to generate the grouping and count the field that represents the branch,then order by the count desc and get the top row.
A = LOAD 'bank.txt';
B = GROUP A by $0;
C = FOREACH B GENERATE group as Bank,COUNT(B.Branches_Field) cnt;
D = ORDER C BY cnt DESC;
E = LIMIT D 1;
DUMP E;

Related

PIG Group into bag by distinct value

recipe,ingredient,inventor
Tacos,Beef,Alex
Tacos,Lettuce,Alex
Tacos,Cheese,Alex
TomatoSoup,Tomatoes,Steve
TomatoSoup,Milk,Steve
I want to group the record by recipe and bag the ingredient and inventor like
(Tacos,{Beef,Lettuce,Cheese},Alex)
(TomatoSoup,{Tomatoes,Milk},Steve)
Group by recipe and inventor and then order the columns as per your requirement.
A = LOAD 'data.txt' USING PigStorage(',') AS (recipe:chararray,ingredient:chararray,inventor:chararray);
B = GROUP A BY (recipe,inventory);
C = FOREACH B GENERATE FLATTEN(group) as (recipe,inventor),A.ingredient;
D = FOREACH C GENERATE recipe,ingredient,inventory;
DUMP D;

Pig - Convert rows into multiple columns

Can we convert input rows to multiple columns terminated by Three.*
Here is one naive solution.
Grouping every three rows using RANK and GROUP, and FILTER by each of three conditions.
My Pig Script
A = Load '/path_to_data/data' as (c1 : chararray);
B = RANK A;
C = FOREACH B GENERATE (rank_A+2)/3 as id, c1;
D = FOREACH (GROUP C BY id) {
ONE = FILTER C BY c1 matches 'One:.*';
TWO = FILTER C BY c1 matches 'Two:.*';
THREE = FILTER C BY c1 matches 'Three:.*';
GENERATE
group as id
, FLATTEN(ONE.c1) as c1_one
, FLATTEN(TWO.c1) as c1_two
, FLATTEN(THREE.c1) as c1_three
;
};
DUMP D;
My Result
(1,One:"A",Two:"2",Three:"last")
(2,One:"B",Two:"1",Three:"first")

Apache Pig Distinct and Count

I'm trying to figure out the following question.
How many female users provided at least one rating of 4. I think my join and filters are correct but I cant figure out the distinct count part Have tried numerous versions of the below.
a = load '/user/pig/movie' AS (userid:int, movieid:int, rating:int, timestamp:chararray);
b = load '/user/pig/reviewer' using PigStorage('|') AS (userid:int, age:int, gender:chararray, occupation:chararray, zip:chararray);
a1 = filter a by rating == 4;
b1 = filter b by gender == 'F';
c = join a1 by userid, b1 by userid;
d = FOREACH c GENERATE COUNT(DISTINCT(userid));
dump d;
You have to GROUP before COUNT.Ref:COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts.
d = GROUP c BY userid;
e = FOREACH d GENERATE COUNT(DISTINCT(b1.userid));
dump e;

How to compute substraction of the data from different alias in pig?

dump count_a;
(10)
dump count_b;
(20)
dump count_c;
(30)
Now I want to to calculate: count_c - count_b - count_a. How to achieve this in PIG script ?
You'll need to join the three aliases together, for which you'll need a common field to join upon. Assuming they are all single record aliases you can create a field n to join on:
prep_a = FOREACH count_a GENERATE 1 AS n, a;
prep_b = FOREACH count_b GENERATE 1 AS n, b;
prep_c = FOREACH count_c GENERATE 1 AS n, c;
Then you can join a & b & c all by the common field n:
ab = JOIN prep_a by n FULL, prep_b by n;
abc = JOIN ab by prep_a::n FULL, prep_c by n;
Then finally calculate the final result:
result = FOREACH abc GENERATE (c - b - a) AS result;

Pig Latin - distinct count and string comparison in for loop

I'm trying to to filter users by those ones who have at least two coutries in their profile or they are from the US, I tried this in Pig
B = group A by userid;
C = foreach B {
count = $1.country;
count2 = distinct count;
GENERATE (((SIZE(count2) > 1 OR count2.$0 != 'USA') ? group : null)));
}
but it came with this error
incompatible types in NotEqual Operator left hand side:bag :tuple(country:chararray) right hand side:chararray
I tried varies other combinations, but no luck.
Try this:
C =
foreach (group A by userid)
generate
group as userid,
COUNT(A) AS count,
FLATTEN(A) as country;
D = filter C by count > 1 OR country == 'US';
C is a relation with schema {userid:chararray, count:long, country:chararray}, where count is the number of countries that userid is associated with. D is filtered according to your criteria.