Pig Script output - apache-pig

I am new to pig script and encountered the following pig script.I understood the first two lines but i am not getting clarity from "C=group B by.." can some one please explain me the code?
A = LOAD ’page views’ AS (user, action, timespent, query term, ip addr,
timestamp,estimated revenue, page info, page links);
B = FOREACH A generate user, estimated revenue;
C = GROUP B by user parallel 40;
D = FOREACH C
E = ORDER B by estimated revenue;
F = E.estimated revenue;
GENERATE group, SUM(F);
ENDFOR
STORE D into ’L16out’;
Thanks in advance.

Related

Pig count query

I have to find out the number of students who scored less than 5.
I have loaded the file.
I am using a filter for grade< 5
I am not getting how to take the count now.
Can anyone please help
Refer to COUNT
A = LOAD 'student.csv' using PigStorage(',') as (name:chararray,grade:int);
B = FILTER A by grade < 5;
C = GROUP B BY name;
D = FOREACH C GENERATE COUNT(B);
DUMP D;

How to check COUNT of filtered elements in PIG

I have the following data set in which I need to perform some steps based on the Car's company name.
(23,Nissan,12.43)
(23,Nissan Car,16.43)
(23,Honda Car,13.23)
(23,Toyota Car,17.0)
(24,Honda,45.0)
(24,Toyota,12.43)
(24,Nissan Car,12.43)
A = LOAD 'data.txt' AS (code:int, name:chararray, rating:double);
G = GROUP A by (code, REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1));
DUMP G;
I am grouping cars based on code and their base company name like All the 'Nissan' and 'Nissan Car' records should come in 1 group and similar for others.
/* Grouped data based on code and company's first name*/
((23,Nissan),{(23,Nissan,12.43),(23,Nissan Car,16.43)})
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
Now, I want to filter out the groups based on whether they contain a tuple corresponding to group's name. If yes, take that tuple from that group and ignore others and if no such tuple exists then take all the tuples for that group.
The Output should be:
((23,Nissan),{(23,Nissan,12.43)}) // Since this group contains a row with group's name i.e. Nissan
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
R = FOREACH G { OW = FILTER A BY name==group.$1; IF COUNT(OW) > 0}
Could anybody please help how can I do this? After filtering by group's name? How can I find the count of the filtered tuples and get the required data.
Ok. Lets Consider the below records are your input.
23,Nissan,12.43
23,Nissan Car,16.43
23,Honda Car,13.23
23,Toyota Car,17.0
24,Honda,45.0
24,Toyota,12.43
25,Toyato Car,23.8
25,Toyato Car,17.2
24,Nissan Car,12.43
For the above Input , let say the below is intermediate output
((23,Honda),{(23,Honda,Honda Car,13.23)})
((23,Nissan),{(23,Nissan,Nissan,12.43),(23,Nissan,Nissan Car,16.43)})
((23,Toyota),{(23,Toyota,Toyota Car,17.0)})
((24,Honda),{(24,Honda,Honda,45.0)})
((24,Nissan),{(24,Nissan,Nissan Car,12.43)})
((24,Toyota),{(24,Toyota,Toyota,12.43)})
((25,Toyato),{(25,Toyato,Toyato Car,23.8),(25,Toyato,Toyato Car,17.2)})
Just Consider, from the above intermediate output, you are looking for below output as per your requirement .
(23,Honda,1)
(23,Nissan,1)
(23,Toyota,1)
(24,Honda,1)
(24,Nissan,1)
(24,Toyota,1)
(25,Toyato,2)
Below is the code..
nissan_load = LOAD '/user/cloudera/inputfiles/nissan.txt' USING PigStorage(',') as(code:int,name:chararray,rating:double);
nissan_each = FOREACH nissan_load GENERATE code,TRIM(REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1)) as brand_name,name,rating;
nissan_grp = GROUP nissan_each by (code,brand_name);
nissan_final_each =FOREACH nissan_grp {
A = FOREACH nissan_each GENERATE (brand_name == TRIM(name) ? 1 :0) as cnt;
B = (int)SUM(A);
C = FOREACH nissan_each GENERATE (brand_name != TRIM(name) ?1: 0) as extra_cnt;
D = SUM(C);
generate flatten(group) as(code,brand_name), (SUM(A.cnt) != 0 ? B : D) as final_cnt;
};
dump nissan_final_each;
Try this code with different inputs as well..

Pig Grouping Functions

I would like to get ,what item was bought very recently by each person. Assume that a same person can buy many items.
below are the input details
kumar,2014-09-30,television
kumar,2014-07-27,smartphone
Andrew,2014-06-21,camera
Andrew,2014-05-20,car
I need the output as below
kumar,2014-09-30,television
Andrew,2014-06-21,camera
I wrote a Pig script upto this, but after that i dont know how to proceed,can somebody help me
A = LOAD 'records.txt' USING PigStorage(',') AS(name:chararray,date:chararray,item:chararray);
B = GROUP A BY name;
C = FOREACH B GENERATE group,MAX(A.date);
But i need to get the item that was purchased recently by each person. How do i get that. If i apply GROUP then i am supposed to use only aggregate function in Pig.
How do i get the recepective item that was purchased?
Use bags and order by in a nested foreach, it will use only 1 MR job and is more in Apache Pig style.
A = LOAD 'input.txt' USING PigStorage(',') AS(name:chararray,date:chararray,item:chararray);
B = GROUP A BY name;
C = FOREACH B {
ordered = ORDER A BY date DESC; -- this will cause secondary sort to optimise the execution
latest = LIMIT ordered 1;
GENERATE FLATTEN(latest); - advantage of PIG, that all columns are preserved and not dropped as on SQL group by
};
DUMP C;
Also use of $0, $1 etc is convenient, but imagine you have a script with hundreds of lines and tens of group by and join operations that project using '$', it is nightmare to understand the flow of information/columns though such scripts. Time wasted in maintenance and making changes to such scripts is huge.
I hope this works for you.
input.txt
kumar,2014-09-30,television
kumar,2014-07-27,smartphone
Andrew,2014-06-21,camera
Andrew,2014-05-20,car
PigScript:
A = LOAD 'input.txt' USING PigStorage(',') AS(name:chararray,date:chararray,item:chararray);
B = GROUP A BY name;
C = FOREACH B GENERATE group,FLATTEN(MAX($1.date));
D = JOIN A BY date,C BY $1;
E = FOREACH D GENERATE $0,$1,$2;
DUMP E;
Output:
(Andrew,2014-06-21,camera)
(kumar,2014-09-30,television)

Pig - Calculating percentage of total for a field

I am trying to calculate % of total for a value for in a field.
For example, for data (name, ct)
(john, 1000)
(Dan, 2000)
(liz, 2000)
I want the output to be (name, % of ct to the total)
(john, .2)
(Dan, .4)
(liz, .4)
data = load 'fakedata.txt' as (name:chararray,sqr:chararray,ct:int);
A = foreach data generate name, ct;
A = FILTER A by ct is not null;
B = group A all;
C = foreach B generate SUM(A.ct) as tot;
D = foreach A generate name, ct/(double)C.tot;
dump D;
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: C in {name: bytearray,ct: int}
I am following exactly how it is given in the http://pig.apache.org/docs/r0.10.0/basic.html
an example code in section - "Casting Relations to Scalars"
If I say Dump C, then the output is correctly generated as 5000. So there is a problem in the D. Any help is greatly appreciated.
The below works for me without any error. This is basically same as what you have. Not sure why you are getting this error. Which version of pig are you using?
data = load 'StackData' as (name:chararray, marks:int);
grp = GROUP data all;
allcount = foreach grp generate SUM(data.marks) as total;
perc = foreach data generate name, marks/(double)allcount.total;
dump perc
In Relation D, you are looping over Relation A again - it knows knowing about C.
I'd suggest calculating the SUM, then doing JOIN so each entry contains the sum. That way you'll be able to calculate the % total for each entry.

Pig split and join

I have a requirement to propagate field values from one row to another given type of record
for example my raw input is
1,firefox,p
1,,q
1,,r
1,,s
2,ie,p
2,,s
3,chrome,p
3,,r
3,,s
4,netscape,p
the desired result
1,firefox,p
1,firefox,q
1,firefox,r
1,firefox,s
2,ie,p
2,ie,s
3,chrome,p
3,chrome,r
3,chrome,s
4,netscape,p
I tried
A = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
SPLIT A INTO B IF (type =='p'), C IF (type!='p' );
joined = JOIN B BY id FULL, C BY id;
joinedFields = FOREACH joined GENERATE B::id, B::type, B::browser, C::id, C::type;
dump joinedFields;
the result I got was
(,,,1,p )
(,,,1,q)
(,,,1,r)
(,,,1,s)
(2,p,ie,2,s)
(3,p,chrome,3,r)
(3,p,chrome,3,s)
(4,p,netscape,,)
Any help is appreciated, Thanks.
PIG is not exactly SQL, it is built with data flows, MapReduce and groups in mind (joins are also there). You can get the result using a GROUP BY, FILTER nested in the FOREACH and FLATTEN.
inpt = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
grp = GROUP inpt BY id;
Result = FOREACH grp {
P = FILTER inpt BY type == 'p'; -- leave the record that contain p for the id
PL = LIMIT P 1; -- make sure there is just one
GENERATE FLATTEN(inpt.(id,type)), FLATTEN(PL.browser); -- convert bags produced by group by back to rows
};