Multiple records to a single record in PIG - apache-pig

I've input file as below.
1,Cust_name1,addr_type,Addr1
1,Cust_name1,addr_type,Addr2
2,Cust_name3,addr_type,Addr1
2,Cust_name3,addr_type,Addr3
Want to convert this to Avro format.
output should be like
1,Cust_name1,{(addr_type,Addr1),(addr_type,Addr2)
1,Cust_name3,{(addr_type,Addr1),(addr_type,Addr3)
For each customer I want generate a single message in avro and repeated elements in array.

GROUP by Id and Customer Name.In order to store in Avro format use AvroStorage available in piggybank.jar and register it in your script.It can downloaded from here
REGISTER /path/piggybank.jar;
A = LOAD 'data.txt' USING PigStorage(',') AS (int:id;name:chararray;addrtype:chararray;addr:chararray);
B = GROUP A BY (id,name);
STORE B INTO '/path/' USING org.apache.pig.piggybank.storage.avro.AvroStorage();;

Related

Deriving fields data from grouped data in pig

I am new to PIG. Could someone please help me. Below is the code
a = load 'stage.temp' USING org.apache.hive.hcatalog.pig.HCatLoader();
b = limit a 10;
c = group b by $0;
dump c;
(74409607,{(74409607,a,2),(74409607,b,1)})
(74409607,{(74409607,c,4),(74409607,d,5)})
(74409735,{(74409735,NA,159),(74409735,,158)})
How could we generate this from above operator c?
(74409607,{(2,a),(1,b),(4,c),(5,d)})
(74409735,{(159,NA),(158,)})
Swap the second and third columns while enumerating each group.
d = foreach c generate group,a.$2,a.$1;
Note: When you group by $0 and dump c you should be getting
(74409607,{(74409607,a,2),(74409607,b,1),(74409607,c,4),(74409607,d,5)})
(74409735,{(74409735,NA,159),(74409735,,158)})
Are you sure that the output you posted is what you obtained from that dump statement ?
From what I see this should be the output.
(74409607,{(74409607,d,5),(74409607,c,4),(74409607,b,1),(74409607,a,2)})
(74409735,{(74409735,,158),(74409735,NA,159)})
Coming to your question, group by operator in pig will give you the output as a bag having the grouped column (in Bold) and all the rows(italicized) that are having the same value .
You can how ever fetch the required fields from the bag using the relation c .

transform data with multiple fields in pig

I have some data in the following way:
(102,(727,103,895))
(102,(105,255))
anyone knows how to transform these data to the following way in pig?
(102,727)
(102,103)
(102,895)
(102,105)
(102,255)
Use FLATTEN().Assuming you have relation B with two fields
C = foreach B generate B.$0,FLATTEN(B.$1);
DUMP C;

PIG: How to Dynamically Pass N parameter values

I have a growing list of IDs I would like to dynamically pass into a pig script to aid in processing data for a single column.
I'm manually passing param values, which isn't scalable.
Command EX:
pig --param id1=123 id2=456 id3=789 get foo.pig
Script example
A = load '$INPUT' using AvroStorage();
B = foreach A generate value.rawData#'id' as user_id:chararray;
C = FILTER B BY user_id == '$id1' or user_id == '$id2' OR user_id == '$id3';
DUMP C;
How can one dynamically pass N parameter values and have them applied to a regional operator for the same column?
If I have to solve this problem I will :-
1> Create a simple text file (say id.txt) and keep on appending news ids to it.
2> Use id.txt inside my PIG script to JOIN with $INPUT file and records will be automatically filtered if id is not found :-
A = load '$INPUT' using AvroStorage();
A = foreach A generate value.rawData#'id' as user_id:chararray;
B = load 'id.txt' using PigStorage as (userId:chararrray);
C = JOIN A by user_id B by userId ;
-- after above JOIN C will only contain records which has user_id in both files
DUMP C;

Sending relation to UDF functions

Can I Send a relation to Pig UDF function as input? A relation can have multiple tuples in it. How do we read each tuple one by one in Pig UDF function?
Ok.Below is my Sample input file.
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
Amit,SBI,70000,CTS
myinput = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
grouped = GROUP myinput BY company;
All i need is details about highest paid employee in each company. How do i use UDF for that ?
I need something like this
CTS Karthic,HDFC,95000,CTS
TCS Raja,AXIS,80000,TCS
Can SomeOne Help me on this.
This script will give you the results you want :
A = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
B = GROUP A BY (company);
topResults = FOREACH B {result = TOP(1, 2, A); GENERATE FLATTEN(result);}
dump topResults;
Explanation:
First we group A on the basis of company.So A is:
(CTS,{(Surender,HDFC,60000,CTS),(Kumar,AXIS,70000,CTS),(Remya,AXIS,40000,CTS),(Ankur,HDFC,80000,CTS),(Karthic,HDFC,95000,CTS),(Sandhya,AXIS,60000,CTS),(Amit,SBI,70000,CTS)})
(TCS,{(Raja,AXIS,80000,TCS),(Raj,HDFC,70000,TCS),(Arun,SBI,30000,TCS),(Vimal,SBI,10000,TCS)})
Then we say foreach tuple in B , generate another tuple result which is equal to the top 1 record from the relation A found in B on the basis of value of column number 2 i.e. amt. The columns are numbered from 0.
Note
First your data has extra spaces after company name. Please remove the extra spaces or use the following data :
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
mit,SBI,70000,CTS
You don't need to write an UDF to do this, you can simply do it with the top function from pig : http://pig.apache.org/docs/r0.11.0/func.html#topx
Here is an example of code that should work ( not tested) :
grouped = GROUP myinput BY company;
result = FOREACH grouped GENERATE company, FLATTEN(TOP(1,2,grouped));

Apache Pig load entire relationship into UDF

I have a pig script that pertains to 2 Pig relations, lets say A and B. A is a small relationship, and B is a big one. My UDF should load all of A into memory on each machine and then use it while processing B. Currently I do it like this.
A = foreach smallRelation Generate ...
B = foreach largeRelation Generate propertyOfB;
store A into 'templocation';
C = foreach B Generate CustomUdf(propertyOfB);
I then have every machine load from 'templocation' to get A.This works, but I have two problems with it.
My understanding is I should be using the HDFS cache somehow, but I'm not sure how to load a relationship directly into the HDFS cache.
When I reload the file in my UDF I got to write logic to parse the output from A that was outputted to file when I'd rather be directly using bags and tuples (is there a built in Pig java function to parse Strings back into Bag/Tuple form?).
Does anyone know how it should be done?
Here's a trick that will work for you.
You do a GROUP ALL on A first which "bags" all data in A into one field. Then artificially add a common field on both A and B and join them. This way, foreach tuple in the enhanced B, you will have the full data of A for your UDF to use.
It's like this:
(say originally in A, you have fields fa1, fa2, fa3, in B you have fb1, fb2)
-- add an artificial join key with value 'xx'
B_aux = FOREACH B GENERATE 'xx' AS join_key, fb1, fb2;
A_all = GROUP A ALL;
A_aux = FOREACH A GENERATE 'xx' AS join_key, $1;
A_B_JOINED = JOIN B_aux BY join_key, A_aux BY join_key USING 'replicated';
C = FOREACH A_B_JOINED GENERATE CustomUdf(fb1, fb2, A_all);
since this is replicated join, it's also only map-side join.