Deriving fields data from grouped data in pig - apache-pig

I am new to PIG. Could someone please help me. Below is the code
a = load 'stage.temp' USING org.apache.hive.hcatalog.pig.HCatLoader();
b = limit a 10;
c = group b by $0;
dump c;
(74409607,{(74409607,a,2),(74409607,b,1)})
(74409607,{(74409607,c,4),(74409607,d,5)})
(74409735,{(74409735,NA,159),(74409735,,158)})
How could we generate this from above operator c?
(74409607,{(2,a),(1,b),(4,c),(5,d)})
(74409735,{(159,NA),(158,)})

Swap the second and third columns while enumerating each group.
d = foreach c generate group,a.$2,a.$1;
Note: When you group by $0 and dump c you should be getting
(74409607,{(74409607,a,2),(74409607,b,1),(74409607,c,4),(74409607,d,5)})
(74409735,{(74409735,NA,159),(74409735,,158)})

Are you sure that the output you posted is what you obtained from that dump statement ?
From what I see this should be the output.
(74409607,{(74409607,d,5),(74409607,c,4),(74409607,b,1),(74409607,a,2)})
(74409735,{(74409735,,158),(74409735,NA,159)})
Coming to your question, group by operator in pig will give you the output as a bag having the grouped column (in Bold) and all the rows(italicized) that are having the same value .
You can how ever fetch the required fields from the bag using the relation c .

Related

transform data with multiple fields in pig

I have some data in the following way:
(102,(727,103,895))
(102,(105,255))
anyone knows how to transform these data to the following way in pig?
(102,727)
(102,103)
(102,895)
(102,105)
(102,255)
Use FLATTEN().Assuming you have relation B with two fields
C = foreach B generate B.$0,FLATTEN(B.$1);
DUMP C;

PIG filter out rows with improper number of columns

I have simple data loaded in a:
dump a
ahoeh,1,e32
hello,2,10
ho,3
I need to filter out all rows with number of columns/fields different than 3. How to do it?
In other words result should be:
dump results
ahoeh,1,e32
hello,2,10
I know there should be a FILTER built-in function. However I cannot figure out what condition (number of columns =3) should be defined.
Thanks!
Can you try this?
input
ahoeh,1,e32
hello,2,10
ho,3
3,te,0
aa,3,b
y,h,3
3,3,3
3,3,3,1,2,3,3,,,,,,4,44,6
PigScript1:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,','));
C = FOREACH B GENERATE COUNT(TOBAG(*)),$0..;
D = FILTER C BY $0==3;
E = FOREACH D GENERATE $1..;
DUMP E;
PigScript2:
A = LOAD 'input' USING PigStorage(',');
B = FOREACH A GENERATE COUNT(TOBAG(*)),$0..;
C = FILTER B BY (int)$0==3;
D = FOREACH C GENERATE $1..;
DUMP D;
Output:
(ahoeh,1,e32)
(hello,2,10)
(3,te,0)
(aa,3,b)
(y,h,3)
(3,3,3)
(It seems that I don't have enough karma to comment; that's why this is posted as a new answer.)
The accepted answer doesn't quite behave as expected if null/empty string is a valid field value; you need to use COUNT_STAR instead of COUNT to count empty/null fields in your schema.
See: https://pig.apache.org/docs/r0.9.1/func.html#count-star
For example, given the following input data:
1,2,3
1,,3
and this Pig script:
a = load 'input' USING PigStorage(',');
counted = foreach a generate COUNT_STAR(TOBAG(*)), $0..;
filtered = filter counted by $0 != 3;
result = foreach filtered generate $1..;
The filtered alias will contain both rows. The difference is that COUNT({(1),(),(3)}) returns 2 while COUNT_STAR({(1),(),(3)}) returns 3.
I see two ways to do this:
First, you can rephrase the filter I think, as it boils down to: Give me all lines that do not contain an NULL value. For lots of columns, writing this filter statement is rather tedious.
Second, you could convert your columns into a bag per line, using TOBAG (http://pig.apache.org/docs/r0.12.1/func.html#tobag) and then write a UDF that processes the input bag to check for null tuples in this bag and return true or false and use this in the filter statement.
Either way, some tediousness is required I think.

Pig Grouping Functions

I would like to get ,what item was bought very recently by each person. Assume that a same person can buy many items.
below are the input details
kumar,2014-09-30,television
kumar,2014-07-27,smartphone
Andrew,2014-06-21,camera
Andrew,2014-05-20,car
I need the output as below
kumar,2014-09-30,television
Andrew,2014-06-21,camera
I wrote a Pig script upto this, but after that i dont know how to proceed,can somebody help me
A = LOAD 'records.txt' USING PigStorage(',') AS(name:chararray,date:chararray,item:chararray);
B = GROUP A BY name;
C = FOREACH B GENERATE group,MAX(A.date);
But i need to get the item that was purchased recently by each person. How do i get that. If i apply GROUP then i am supposed to use only aggregate function in Pig.
How do i get the recepective item that was purchased?
Use bags and order by in a nested foreach, it will use only 1 MR job and is more in Apache Pig style.
A = LOAD 'input.txt' USING PigStorage(',') AS(name:chararray,date:chararray,item:chararray);
B = GROUP A BY name;
C = FOREACH B {
ordered = ORDER A BY date DESC; -- this will cause secondary sort to optimise the execution
latest = LIMIT ordered 1;
GENERATE FLATTEN(latest); - advantage of PIG, that all columns are preserved and not dropped as on SQL group by
};
DUMP C;
Also use of $0, $1 etc is convenient, but imagine you have a script with hundreds of lines and tens of group by and join operations that project using '$', it is nightmare to understand the flow of information/columns though such scripts. Time wasted in maintenance and making changes to such scripts is huge.
I hope this works for you.
input.txt
kumar,2014-09-30,television
kumar,2014-07-27,smartphone
Andrew,2014-06-21,camera
Andrew,2014-05-20,car
PigScript:
A = LOAD 'input.txt' USING PigStorage(',') AS(name:chararray,date:chararray,item:chararray);
B = GROUP A BY name;
C = FOREACH B GENERATE group,FLATTEN(MAX($1.date));
D = JOIN A BY date,C BY $1;
E = FOREACH D GENERATE $0,$1,$2;
DUMP E;
Output:
(Andrew,2014-06-21,camera)
(kumar,2014-09-30,television)

Sending relation to UDF functions

Can I Send a relation to Pig UDF function as input? A relation can have multiple tuples in it. How do we read each tuple one by one in Pig UDF function?
Ok.Below is my Sample input file.
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
Amit,SBI,70000,CTS
myinput = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
grouped = GROUP myinput BY company;
All i need is details about highest paid employee in each company. How do i use UDF for that ?
I need something like this
CTS Karthic,HDFC,95000,CTS
TCS Raja,AXIS,80000,TCS
Can SomeOne Help me on this.
This script will give you the results you want :
A = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
B = GROUP A BY (company);
topResults = FOREACH B {result = TOP(1, 2, A); GENERATE FLATTEN(result);}
dump topResults;
Explanation:
First we group A on the basis of company.So A is:
(CTS,{(Surender,HDFC,60000,CTS),(Kumar,AXIS,70000,CTS),(Remya,AXIS,40000,CTS),(Ankur,HDFC,80000,CTS),(Karthic,HDFC,95000,CTS),(Sandhya,AXIS,60000,CTS),(Amit,SBI,70000,CTS)})
(TCS,{(Raja,AXIS,80000,TCS),(Raj,HDFC,70000,TCS),(Arun,SBI,30000,TCS),(Vimal,SBI,10000,TCS)})
Then we say foreach tuple in B , generate another tuple result which is equal to the top 1 record from the relation A found in B on the basis of value of column number 2 i.e. amt. The columns are numbered from 0.
Note
First your data has extra spaces after company name. Please remove the extra spaces or use the following data :
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
mit,SBI,70000,CTS
You don't need to write an UDF to do this, you can simply do it with the top function from pig : http://pig.apache.org/docs/r0.11.0/func.html#topx
Here is an example of code that should work ( not tested) :
grouped = GROUP myinput BY company;
result = FOREACH grouped GENERATE company, FLATTEN(TOP(1,2,grouped));

How can I split a Pig tuple into subtuples?

Here are the tuples that I am manipulating in a Pig script:
DUMP A
(4,20,53,31)
(21,3,40,16)
(15,51,12,3)
I would transform this relation (A) into another relation (B) such as:
DUMP B
(4,20)
(4,53)
(4,31)
(21,3)
(21,40)
(21,16)
(15,51)
(15,12)
(15,3)
That means keeping the first field in all tuples and get one tuple by each field. In the previous example, each tuple must give 3 new tuples. I have a solution to get:
DUMP B
(4,20)
(21,3)
(15,51)
(4,53)
(21,40)
(15,12)
(4,31)
(21,16)
(15,3)
Which is the good result but with the wrong order (I use the FOREACH operator each time). I could get the right order by adding a field to each tuple and then using the ORDER operator but I think there is a simpler way to do so.
Any idea?
Thank you.
You could do:
-- T is the name of the tuple, and v[1-4] are the values in the tuple
B = FOREACH A GENERATE T.v1, FLATTEN(TOBAG(T.v2, T.v3, T.v4)) ;
If the values do not have names, you could also do:
B = FOREACH A GENERATE T.$0, FLATTEN(TOBAG(T.$1, T.$2, T.$3)) ;
Output:
(4,20)
(4,53)
(4,31)
(21,3)
(21,40)
(21,16)
(15,51)
(15,12)
(15,3)