Can you please let me know if I have {',') delimited field and having Bag Data ,How to read. I am getting below error.
Input Data.
Jorge Posada Yankees|{(Catcher),(Designated_hitter)}|[games#1594,hit_by_pitch#65,grand_slams#7]
Landon Powell Oakland|{(Catcher),(First_baseman)}|[on_base_percentage#0.297,games#26,home_runs#7]
Martin Prado Atlanta|{(Second_baseman),(Infielder),(Left_fielder)},[games#258,hit_by_pitch#3]
bfile= LOAD '/home/cloudera/basketball.txt' using PigStorage('|')as(name:chararray,team:chararray,pos:bag{t:(p:chararray)},bat:map[]);
grunt> players = load 'basketball.txt' using PigStorage('|')as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
2014-11-13 04:49:48,144 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 27, column 117> mismatched input ';' expecting RIGHT_PAREN
Details at logfile: /home/cloudera/pig_1415835089181.log
Sanjeeb
For the above input, regex is not required, you can access all the values using the existing schema itself.
input.txt
Jorge Posada |Yankees|{(Catcher),(Designated_hitter)}|[games#1594,hit_by_pitch#65,grand_slams#7]
Landon Powell |Oakland|{(Catcher),(First_baseman)}|[on_base_percentage#0.297,games#26,home_runs#7]
Martin Prado |Atlanta|{(Second_baseman),(Infielder),(Left_fielder)}|[games#258,hit_by_pitch#3]
Pigscript:
bfile= LOAD 'input.txt' using PigStorage('|') as (name:chararray,team:chararray,pos:bag{t:(p:chararray)},bat:map[]);
--Print the name and team
B = FOREACH bfile GENERATE name,team;
--DUMP B;
--Print the player and his position
C = FOREACH bfile GENERATE name,pos.(p);
--DUMP C;
--Print the player and key/value of games and hit_by_pitch
D = FOREACH bfile GENERATE name,bat#'games',bat#'hit_by_pitch';
--DUMP D;
Output of DUMP B:
(Jorge Posada ,Yankees)
(Landon Powell ,Oakland)
(Martin Prado ,Atlanta)
Output Of DUMP C:
(Jorge Posada ,{(Catcher),(Designated_hitter)})
(Landon Powell ,{(Catcher),(First_baseman)})
(Martin Prado ,{(Second_baseman),(Infielder),(Left_fielder)})
Output Of DUMP D:
(Jorge Posada ,1594,65)
(Landon Powell ,26,)
(Martin Prado ,258,3)
In the bag, if you need multiple fields then declare and access like this
pos:bag{t:(p:chararray,q:charrarray)}
FOREACH bfile GENERATE name,pos.(p,q);
Related
I use the exact sample from official document:
I have data.txt:
(3,8,9) (mary,19)
(1,4,7) (john,18)
(2,5,8) (joe,18)
I run:
A = LOAD 'data.txt' AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
dump A
I always got:
((3,8,9),)
((1,4,7),)
((2,5,8),)
The second nested tuple never got loaded. I tried in both versions of 0.16.0 and 0.17.0.
The problem should be with the data file you created. There should be tab in between both tuples as separator in the data file while creating it. If there was a space then we need to change the load query accordingly.
a)With tab(\t) as delimiter or separator.
grunt> A = LOAD '/home/ec2-user/data' AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
grunt> DESCRIBE A;
A: {F: (f1: int,f2: int,f3: int),T: (t1: chararray,t2: int)}
grunt> dump A;
((3,8,9),(mary,19))
((1,4,7),(john,18))
((2,5,8),(joe,18))
b)With single space( ) as delimiter or seperator.
grunt> A = LOAD '/home/ec2-user/data' AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
grunt> DESCRIBE A;
A: {F: (f1: int,f2: int,f3: int),T: (t1: chararray,t2: int)}
grunt> dump A;
((3,8,9),)
((1,4,7),)
((2,5,8),)
#Use PigStorage(' ') in case if you still want to use space as delimiter for file.
grunt> A = LOAD '/home/ec2-user/data' USING PigStorage(' ') AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
grunt> DESCRIBE A;
A: {F: (f1: int,f2: int,f3: int),T: (t1: chararray,t2: int)}
grunt> dump A;
((3,8,9),(mary,19))
((1,4,7),(john,18))
((2,5,8),(joe,18))
I am new to pig scripting.
I have an input, (A,B,{(XYZ,123,CDE)})
I am looking to loop through the bag inside and print the following records.
(A,B,XYZ)
(A,B,123)
(A,B,CDE)
Can someone please help me out!
Lets say X is your relation and it has (A,B,{(XYZ,123,CDE)}).ToBag converts the expression into bags and FLATTEN unnests the tuples,bag.
Y = FOREACH X GENERATE $0,$1,ToBag(FLATTEN($2));
Solved!!
Let us load below file (Tab separated)
A B {(XYZ,123,CDE)}
input_plus_bag = load '' USING PigStorage() AS (entry1:chararray, entry2:chararray, bag1:bag{(te1:chararray, te2:int, te3:chararray)});
intermed_output = foreach input_plus_bag generate entry1, entry2, FLATTEN(bag1);
Dump intermed_output;
This will give
(A,B,XYZ,123,CDE)
DESCRIBE intermed_output;
intermed_output: {entry1: chararray,entry2: chararray,bag1::te1: chararray,bag1::te2: int,bag1::te3: chararray}
Now perform TOBAG operation
intermed2_output = foreach intermed_output generate entry1, entry2, TOBAG(bag1::te1,bag1::te2,bag1::te3);
DUMP intermed2_output;
This will result in below output:-
(A,B,{(XYZ),(123),(CDE)})
Now final step is FLATTEN the bag
final_output = foreach intermed2_output generate entry1, entry2, FLATTEN($2);
And we have our desired output:-
(A,B,XYZ)
(A,B,123)
(A,B,CDE)
i have a sample pig script with data that will read a csv file and dump it ot screen; however, my data has name value pairs. how can i read in a line of name value pairs and split the pairs using the name for the field and the value for the value?
data:
1,Smith,Bob,Business Development
2,Doe,John,Developer
3,Jane,Sally,Tester
script:
data = LOAD 'example-data.txt' USING PigStorage(',')
AS (id:chararray, last_name:chararray,
first_name:chararray, role:chararray);
DESCRIBE data;
DUMP data;
output:
data: {id: chararray,last_name: chararray,first_name: chararray,role: chararray}
(1,Smith,Bob,Business Development)
(2,Doe,John,Developer)
(3,Jane,Sally,Tester)
however, given the following input (as name value pairs); how could i process the data to get the same "data object"?
id=1,last_name=Smith,first_name=Bob,role=Business Development
id=2,last_name=Doe,first_name=John,role=Developer
id=3,last_name=Jane,first_name=Sally,role=Tester
Refer to STRSPLIT
A = LOAD 'example-data.txt' USING PigStorage(',') AS (f1:chararray,f2:chararray,f3:chararray, f4:chararray);
B = FOREACH A GENERATE
FLATTEN(STRSPLIT(f1,'=',2)) as (n1:chararray,v1:chararray),
FLATTEN(STRSPLIT(f2,'=',2)) as (n2:chararray,v2:chararray),
FLATTEN(STRSPLIT(f3,'=',2)) as (n3:chararray,v3:chararray),
FLATTEN(STRSPLIT(f4,'=',2)) as (n4:chararray,v4:chararray);
C = FOREACH B GENERATE v1,v2,v3,v4;
DUMP C;
I am not able load the data as multiple tuples, am not sure what mistake am doing, please advise.
data.txt
vineet 1 pass Govt
hisham 2 pass Prvt
raj 3 fail Prvt
I want to load them as 2 touples.
A = LOAD 'data.txt' USING PigStorage('\t') AS (T1:tuple(name:bytearray, no:int), T2:tuple(result:chararray, school:chararray));
OR
A = LOAD 'data.txt' USING PigStorage('\t') AS (T1:(name:bytearray, no:int), T2:(result:chararray, school:chararray));
dump A;
the below data is displayed in the form of new line, i dont know why am not able to read actual data from data.txt.
(,)
(,)
(,)
As the input data is not stored as tuple we wont be able to read it directly in to a tuple.
One feasible approach is to read the data and then form a tuple with required fields.
Pig Script :
A = LOAD 'a.csv' USING PigStorage('\t') AS (name:chararray,no:int,result:chararray,school:chararray);
B = FOREACH A GENERATE (name,no) AS T1:tuple(name:chararray, no:int), (result,school) AS T2:tuple(result:chararray, school:chararray);
DUMP B;
Input : a.csv
vineet 1 pass Govt
hisham 2 pass Prvt
raj 3 fail Prvt
Output : DUMP B:
((vineet,1),(pass,Govt))
((hisham,2),(pass,Prvt))
((raj,3),(fail,Prvt))
Output : DESCRIBE B :
B: {T1: (name: chararray,no: int),T2: (result: chararray,school: chararray)}
I am trying to figure out the best way to parse key value pair with Pig in a dataset with mixed delimiters as below
My sample dataset is in the format below
a|b|c|k1=v1 k2=v2 k3=v3
The final output which i require here is
k1,v1,k2,v2,k3,v3
I guess one way to do this is to
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
and here i get (k1=v1 k2=v2 k3=v3) for B
Is there any way i can further parse this by "" so as to get 3 fields k1=v1,k2=v2 and K3=v3 which can then be further split into k1,v1,k2,v2,k3,v3 using Strsplit and Flatten on "=".
Thanks for the help!
San
If you know beforehand how many key=value pair are in each record, try this:
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
C = FOREACH B GENERATE STRSPLIT($0,'=',6); -- 6= no. of key=value pairs
D = FOREACH C GENERATE FLATTEN($0);
DUMP D
output:
(k1,v1, k2,v2, k3,v3)
If you dont know the # of key=value pair, use ' ' as delimiter and remove the unwanted prefix from $0 column.
A = LOAD 'sample' USING PigStorage(' ') as (a:chararray,b:chararray,c:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2);
DUMP C;
output:
(k1,v1, k2,v2, k3,v3)