In Pig latin, am not able to load data as multiple tuples, please advice - apache-pig

I am not able load the data as multiple tuples, am not sure what mistake am doing, please advise.
data.txt
vineet 1 pass Govt
hisham 2 pass Prvt
raj 3 fail Prvt
I want to load them as 2 touples.
A = LOAD 'data.txt' USING PigStorage('\t') AS (T1:tuple(name:bytearray, no:int), T2:tuple(result:chararray, school:chararray));
OR
A = LOAD 'data.txt' USING PigStorage('\t') AS (T1:(name:bytearray, no:int), T2:(result:chararray, school:chararray));
dump A;
the below data is displayed in the form of new line, i dont know why am not able to read actual data from data.txt.
(,)
(,)
(,)

As the input data is not stored as tuple we wont be able to read it directly in to a tuple.
One feasible approach is to read the data and then form a tuple with required fields.
Pig Script :
A = LOAD 'a.csv' USING PigStorage('\t') AS (name:chararray,no:int,result:chararray,school:chararray);
B = FOREACH A GENERATE (name,no) AS T1:tuple(name:chararray, no:int), (result,school) AS T2:tuple(result:chararray, school:chararray);
DUMP B;
Input : a.csv
vineet 1 pass Govt
hisham 2 pass Prvt
raj 3 fail Prvt
Output : DUMP B:
((vineet,1),(pass,Govt))
((hisham,2),(pass,Prvt))
((raj,3),(fail,Prvt))
Output : DESCRIBE B :
B: {T1: (name: chararray,no: int),T2: (result: chararray,school: chararray)}

Related

Pig Load with Schema giving error

I have a file called data_tuple_bag.txt on hdfs with the following content:
10,{(1,2),(2,3)}
11,{(4,5),(6,7)}
I am creating a relation as below :
D = LOAD '/user/pig_demo/data_tuple_bag.txt' AS (f1:int,B:{T:(t1:int,t2:int)});
When I DUMP it it is giving me ACCESSING_NON_EXISTENT_FIELD 2 time(s) as well as FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s) and an empty output.
I changed the relation to :
D = LOAD '/user/pig_demo/data_tuple_bag.txt' USING PigStorage(',') AS (f1:int,B:{T:(t1:int,t2:int)});
Now it's only giving FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s) and output as :
(10,)
(11,)
I have another file data_only_bag.txt with following in it:
{(1,2),(2,3)}
{(4,5),(6,7)}
The relation is defined as :
A = LOAD '/user/pig_demo/data_only_bag.txt' AS (B:{T:(t1:int,t2:int)});
And it works.
Now I am updating the data_only_bag.txt as below:
10,{(1,2),(2,3)}
11,{(4,5),(6,7)}
And the relation is :
A = LOAD '/user/pig_demo/data_only_bag.txt' AS (f1:int,B:{T:(t1:int,t2:int)});
I am getting :
(,)
(,)
When I DUMP it it is giving me ACCESSING_NON_EXISTENT_FIELD 2 time(s) as well as FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s) and an empty output.
Now I am updating the relation to :
A = LOAD '/user/pig_demo/data_only_bag.txt' USING PigStorage(',') AS (f1:int,B:{T:(t1:int,t2:int)});
Now it's only giving FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s) and output as :
(10,)
(11,)
Same as before.
Can anybody tell me what wrong am I doing here?
Thanks in Advance.
It failed to parse the input with the provided schema,
Try with this :
D = LOAD '/user/pig_demo/data_tuple_bag.txt' USING PigStorage(',')
AS (f1:int, B: {T1: (t1:int, t2:int),T2: (t1:int, t2:int)});

How to loop through tuples in a Bag, Pig

I am new to pig scripting.
I have an input, (A,B,{(XYZ,123,CDE)})
I am looking to loop through the bag inside and print the following records.
(A,B,XYZ)
(A,B,123)
(A,B,CDE)
Can someone please help me out!
Lets say X is your relation and it has (A,B,{(XYZ,123,CDE)}).ToBag converts the expression into bags and FLATTEN unnests the tuples,bag.
Y = FOREACH X GENERATE $0,$1,ToBag(FLATTEN($2));
Solved!!
Let us load below file (Tab separated)
A B {(XYZ,123,CDE)}
input_plus_bag = load '' USING PigStorage() AS (entry1:chararray, entry2:chararray, bag1:bag{(te1:chararray, te2:int, te3:chararray)});
intermed_output = foreach input_plus_bag generate entry1, entry2, FLATTEN(bag1);
Dump intermed_output;
This will give
(A,B,XYZ,123,CDE)
DESCRIBE intermed_output;
intermed_output: {entry1: chararray,entry2: chararray,bag1::te1: chararray,bag1::te2: int,bag1::te3: chararray}
Now perform TOBAG operation
intermed2_output = foreach intermed_output generate entry1, entry2, TOBAG(bag1::te1,bag1::te2,bag1::te3);
DUMP intermed2_output;
This will result in below output:-
(A,B,{(XYZ),(123),(CDE)})
Now final step is FLATTEN the bag
final_output = foreach intermed2_output generate entry1, entry2, FLATTEN($2);
And we have our desired output:-
(A,B,XYZ)
(A,B,123)
(A,B,CDE)

Apache Pig reading name value pairs in data file

i have a sample pig script with data that will read a csv file and dump it ot screen; however, my data has name value pairs. how can i read in a line of name value pairs and split the pairs using the name for the field and the value for the value?
data:
1,Smith,Bob,Business Development
2,Doe,John,Developer
3,Jane,Sally,Tester
script:
data = LOAD 'example-data.txt' USING PigStorage(',')
AS (id:chararray, last_name:chararray,
first_name:chararray, role:chararray);
DESCRIBE data;
DUMP data;
output:
data: {id: chararray,last_name: chararray,first_name: chararray,role: chararray}
(1,Smith,Bob,Business Development)
(2,Doe,John,Developer)
(3,Jane,Sally,Tester)
however, given the following input (as name value pairs); how could i process the data to get the same "data object"?
id=1,last_name=Smith,first_name=Bob,role=Business Development
id=2,last_name=Doe,first_name=John,role=Developer
id=3,last_name=Jane,first_name=Sally,role=Tester
Refer to STRSPLIT
A = LOAD 'example-data.txt' USING PigStorage(',') AS (f1:chararray,f2:chararray,f3:chararray, f4:chararray);
B = FOREACH A GENERATE
FLATTEN(STRSPLIT(f1,'=',2)) as (n1:chararray,v1:chararray),
FLATTEN(STRSPLIT(f2,'=',2)) as (n2:chararray,v2:chararray),
FLATTEN(STRSPLIT(f3,'=',2)) as (n3:chararray,v3:chararray),
FLATTEN(STRSPLIT(f4,'=',2)) as (n4:chararray,v4:chararray);
C = FOREACH B GENERATE v1,v2,v3,v4;
DUMP C;

Pig Flatten without nulls

I have a pig bag with
(1139-50052,Aquatic,Consumer,6,makarina,2,{(),(Unknown)})
(1139-50052,Aquatic,Consumer,6,jabong,2,{(),(),(),(Unknown)})
I need to flatten it without nulls.
(1139-50052,Aquatic,Consumer,6,makarina,2,Unknown)
(1139-50052,Aquatic,Consumer,6,jabong,2,Unknown)
Please advice.
One option could be you can pass the bag inside BagToString() function, so that null values will be discarded and then split your bag value based on delimiter '_'.
FLATTEN(STRSPLIT(BagToString(BagName),'_+'))
Other than your input it will work for other combination also, sample example below.
input
1139-50052 Aquatic Consumer 6 makarina 2 {(),(Unknown)}
1139-50052 Aquatic Consumer 6 jabong 2 {(),(),(),(Unknown)}
1139-50052 Aquatic Consumer 6 test1 2 {(unknown1),(),(),(Unknown2)}
1139-50052 Aquatic Consumer 6 test2 2 {(unknown1),(unknown2),(),(Unknown3)}
PigScript:
A = LOAD 'input' USING PigStorage() AS (f0,f1,f2,f3,f4,f5,B:{T:(f7)});
B = FOREACH A GENERATE f0,f1,f2,f3,f4,f5,FLATTEN(STRSPLIT(BagToString(B),'_+'));
DUMP B;
Output:
(1139-50052,Aquatic,Consumer,6,makarina,2,Unknown)
(1139-50052,Aquatic,Consumer,6,jabong,2,Unknown)
(1139-50052,Aquatic,Consumer,6,test1,2,unknown1,Unknown2)
(1139-50052,Aquatic,Consumer,6,test2,2,unknown1,unknown2,Unknown3)

PIG - Defining the delimiter used for a bag after a GROUP function

In Pig, I'm loading and grouping two files. I end up with a something like this:
A = LOAD 'File1' Using PigStorage('\t');
B = LOAD 'File2' Using PigStorage('\t');
C = COGROUP A BY $0, B BY $0;
STORE C INTO 'Output' USING PigStorage('\t');
Output:
123 {(123,XYZ,456)} {(123,QRS,889,QWER)}
Where the first field is the group key, the first bag is from File1, and the next bag is from File2. These three sections are delimited from each other using whatever I identified in the PigStorage('\t') clause.
Question: How do I force Pig to delimit the bags by something other than a comma? In my real data, there are commas present and so I need to delimit by tabs instead.
Desired output:
123 {(123\tXYZ\t456)} {(123\tQRS\t889\tQWER)}
This seems to be an open issue (as of June 2013) in Pig. See the corresponding JIRA for more details. Until the issue is fixed, you can change your input data.