Apache Pig only load first nested tuple - apache-pig

I use the exact sample from official document:
I have data.txt:
(3,8,9) (mary,19)
(1,4,7) (john,18)
(2,5,8) (joe,18)
I run:
A = LOAD 'data.txt' AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
dump A
I always got:
((3,8,9),)
((1,4,7),)
((2,5,8),)
The second nested tuple never got loaded. I tried in both versions of 0.16.0 and 0.17.0.

The problem should be with the data file you created. There should be tab in between both tuples as separator in the data file while creating it. If there was a space then we need to change the load query accordingly.
a)With tab(\t) as delimiter or separator.
grunt> A = LOAD '/home/ec2-user/data' AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
grunt> DESCRIBE A;
A: {F: (f1: int,f2: int,f3: int),T: (t1: chararray,t2: int)}
grunt> dump A;
((3,8,9),(mary,19))
((1,4,7),(john,18))
((2,5,8),(joe,18))
b)With single space( ) as delimiter or seperator.
grunt> A = LOAD '/home/ec2-user/data' AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
grunt> DESCRIBE A;
A: {F: (f1: int,f2: int,f3: int),T: (t1: chararray,t2: int)}
grunt> dump A;
((3,8,9),)
((1,4,7),)
((2,5,8),)
#Use PigStorage(' ') in case if you still want to use space as delimiter for file.
grunt> A = LOAD '/home/ec2-user/data' USING PigStorage(' ') AS (F:tuple(f1:int,f2:int,f3:int),T:tuple(t1:chararray,t2:int));
grunt> DESCRIBE A;
A: {F: (f1: int,f2: int,f3: int),T: (t1: chararray,t2: int)}
grunt> dump A;
((3,8,9),(mary,19))
((1,4,7),(john,18))
((2,5,8),(joe,18))

Related

Filter data after join using PIG

I would like to filter the records after two files are joined.
The file BX-Books.csv contains the book data. and the file BX-Book-Ratings.csv contains books rating data where ISBN is the common column from both the files. The inner join betweeb the files is done using the this column.
I would like to get the books that are published in the year 2002.
I have used the following script but i am getting 0 records.
grunt> BookXRecords = LOAD '/user/pradeep/BX-Books.csv' USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray, Publisher:chararray,ImageURLS:chararray,ImageURLM:chararray,ImageURLL:chararray);
grunt> BookXRating = LOAD '/user/pradeep/BX-Book-Ratings.csv' USING PigStorage(';') AS (user:chararray,ISBN:chararray,rating:chararray);
grunt> BxJoin = JOIN BookXRecords BY ISBN, BookXRating BY ISBN;
grunt> BxJoin_Mod = FOREACH BxJoin GENERATE $0 AS ISBN, $1, $2, $3, $4;
grunt> FLTRBx2002 = FILTER BxJoin_Mod BY $3 == '2002';
I created a test.csv and test-rating.csv and a Pig script that works out of them. It worked perfectly fine.
test.csv
1;abc;author1;2002
2;xyz;author2;2003
test-rating.csv
user1;1;3
user2;2;5
Pig Script :
A = LOAD 'test.csv' USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray);
describe A;
dump A;
B = LOAD 'test-rating.csv' USING PigStorage(';') AS (user:chararray,ISBN:chararray,rating:chararray);
describe B;
dump B;
C = JOIN A BY ISBN, B BY ISBN;
describe C;
dump C;
D = FOREACH C GENERATE $0 as ISBN,$1,$2,$3;
describe D;
dump D;
E = FILTER D BY $3 == '2002';
describe E;
dump E;
Output:
A: {ISBN: chararray,BookTitle: chararray,BookAuthor: chararray,YearOfPublication: chararray}
(1,abc,author1,2002)
(2,xyz,author2,2003)
B: {user: chararray,ISBN: chararray,rating: chararray}
(user1,1,3)
(user2,2,5)
C: {A::ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray,B::user: chararray,B::ISBN: chararray,B::rating: chararray}
(1,abc,author1,2002,user1,1,3)
(2,xyz,author2,2003,user2,2,5)
D: {ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray}
(1,abc,author1,2002)
(2,xyz,author2,2003)
E: {ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray}
(1,abc,author1,2002)
Requirement: Get the books that are published in the year 2002.
It is not required to have 2 data set.
Only with the "BookXRecords", this can be achieved.
grunt>BookXRecords = LOAD '/user/pradeep/BX-Books.csv' USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray, Publisher:chararray,ImageURLS:chararray,ImageURLM:chararray,ImageURLL:chararray);
grunt>A=FILTER BookXRecords BY year ='2002';
grunt>dump A;

Apache Pig reading name value pairs in data file

i have a sample pig script with data that will read a csv file and dump it ot screen; however, my data has name value pairs. how can i read in a line of name value pairs and split the pairs using the name for the field and the value for the value?
data:
1,Smith,Bob,Business Development
2,Doe,John,Developer
3,Jane,Sally,Tester
script:
data = LOAD 'example-data.txt' USING PigStorage(',')
AS (id:chararray, last_name:chararray,
first_name:chararray, role:chararray);
DESCRIBE data;
DUMP data;
output:
data: {id: chararray,last_name: chararray,first_name: chararray,role: chararray}
(1,Smith,Bob,Business Development)
(2,Doe,John,Developer)
(3,Jane,Sally,Tester)
however, given the following input (as name value pairs); how could i process the data to get the same "data object"?
id=1,last_name=Smith,first_name=Bob,role=Business Development
id=2,last_name=Doe,first_name=John,role=Developer
id=3,last_name=Jane,first_name=Sally,role=Tester
Refer to STRSPLIT
A = LOAD 'example-data.txt' USING PigStorage(',') AS (f1:chararray,f2:chararray,f3:chararray, f4:chararray);
B = FOREACH A GENERATE
FLATTEN(STRSPLIT(f1,'=',2)) as (n1:chararray,v1:chararray),
FLATTEN(STRSPLIT(f2,'=',2)) as (n2:chararray,v2:chararray),
FLATTEN(STRSPLIT(f3,'=',2)) as (n3:chararray,v3:chararray),
FLATTEN(STRSPLIT(f4,'=',2)) as (n4:chararray,v4:chararray);
C = FOREACH B GENERATE v1,v2,v3,v4;
DUMP C;

Pig and Parsing issue

I am trying to figure out the best way to parse key value pair with Pig in a dataset with mixed delimiters as below
My sample dataset is in the format below
a|b|c|k1=v1 k2=v2 k3=v3
The final output which i require here is
k1,v1,k2,v2,k3,v3
I guess one way to do this is to
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
and here i get (k1=v1 k2=v2 k3=v3) for B
Is there any way i can further parse this by "" so as to get 3 fields k1=v1,k2=v2 and K3=v3 which can then be further split into k1,v1,k2,v2,k3,v3 using Strsplit and Flatten on "=".
Thanks for the help!
San
If you know beforehand how many key=value pair are in each record, try this:
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
C = FOREACH B GENERATE STRSPLIT($0,'=',6); -- 6= no. of key=value pairs
D = FOREACH C GENERATE FLATTEN($0);
DUMP D
output:
(k1,v1, k2,v2, k3,v3)
If you dont know the # of key=value pair, use ' ' as delimiter and remove the unwanted prefix from $0 column.
A = LOAD 'sample' USING PigStorage(' ') as (a:chararray,b:chararray,c:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2);
DUMP C;
output:
(k1,v1, k2,v2, k3,v3)

how can i ignore " (double quotes) while loading file in PIG?

I have following data in file
"a","b","1","2"
"a","b","4","3"
"a","b","3","1"
I am reading this file using below command
File1 = LOAD '/path' using PigStorage (',') as (f1:chararray,f2:chararray,f3:int,f4:int)
But here it is ignoring the data of field 3 and 4
How to read this file correctly or any way to make PIG skip '"'
Additional information i am using Apache Pig version 0.10.0
You may use the REPLACE function (it won't be in one pass though) :
file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0 as (f1:chararray), $1 as (f2:chararray), REPLACE($2, '\\"', '') as (f3:int), REPLACE($3, '\\"', '') as (f4:int);
You may also use regexes with REGEX_EXTRACT :
file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0, $1, REGEX_EXTRACT($2, '([0-9]+)', 1), REGEX_EXTRACT($3, '([0-9]+)', 1);
Of course, you could erase " for f1 and f2 the same way.
Try below (No need to escape or replace double quotes) :
using org.apache.pig.piggybank.storage.CSVExcelStorage()
If you have Jython installed you could deploy a simple UDF to accomplish the job.
python UDF
#!/usr/bin/env python
'''
udf.py
'''
#outputSchema("out:chararray")
def formatter(item):
chars = 'abcdefghijklmnopqrstuvwxyz'
nums = '1234567890'
new_item = item.split('"')[1]
if new_item in chars:
output = str(new_item)
elif new_item in nums:
output = int(new_item)
return output
pig script
REGISTER 'udf.py' USING jython as udf;
data = load 'file' USING PigStorage(',') AS (col1:chararray, col2:chararray,
col3:chararray, col4:chararray);
out = foreach data generate udf.formatter(col1) as a, udf.formatter(col3) as b;
dump out
(a,1)
(a,4)
(a,3)
How about use REPLACE? if case is this simple?
data = LOAD 'YOUR_DATA' Using PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray) ;
new_data = foreach data generate
REPLACE(a, '"', '') AS a,
REPLACE(b, '"', '') AS b,
(int)REPLACE(c, '"', '') AS c:int,
(int)REPLACE(d, '"', '') AS d:int;
One more tips: If you are loading a csv file, set a correct number format in an Excel like tools might also help.
You can use CSVExcelStorage loader from Pig.
The double quotes in data are handled by this loader.
You have to register Piggy-bank jar for using this loader.
Register ${jar_location}/piggybank-0.15.0.jar;
load_data = load '${data_location}' using
org.apache.pig.piggybank.storage.CSVExcelStorage(',');
Hope this helps.

Pig Latin split columns to rows

Is there any solution in Pig latin to transform columns to rows to get the below?
Input:
id|column1|column2
1|a,b,c|1,2,3
2|d,e,f|4,5,6
required output:
id|column1|column2
1|a|1
1|b|2
1|c|3
2|d|4
2|e|5
2|f|6
thanks
I'm willing to bet this is not the best way to do this however ...
data = load 'input' using PigStorage('|') as (id:chararray, col1:chararray,
col2:chararray);
A = foreach data generate id, flatten(TOKENIZE(col1));
B = foreach data generate id, flatten(TOKENIZE(col2));
RA = RANK A;
RB = RANK B;
store RA into 'ra_temp' using PigStorage(',');
store RB into 'rb_temp' using PigStorage(',');
data_a = load 'ra_temp/part-m-00000' using PigStorage(',');
data_b = load 'rb_temp/part-m-00000' using PigStorage(',');
jed = JOIN data_a BY $0, data_b BY $0;
final = foreach jed generate $1, $2, $5;
dump final;
(1,a,1)
(1,b,2)
(1,c,3)
(2,d,4)
(2,e,5)
(2,f,6)
store final into '~/some_dir' using PigStorage('|');
EDIT: I really like this question and was discussing it with a co-worker and he came up with a much simpler and more elegant solution. If you have Jython installed ...
# create file called udf.py
#outputSchema("innerBag:bag{innerTuple:(column1:chararray, column2:chararray)}")
def pigzip(column1, column2):
c1 = column1.split(',')
c2 = column2.split(',')
innerBag = zip(c1, c2)
return innerBag
Then in Pig
$ pig -x local
register udf.py using jython as udf;
data = load 'input' using PigStorage('|') as (id:chararray, column1:chararray,
column2:chararray);
result = foreach data generate id, flatten(udf.pigzip(column1, column2));
dump result;
store final into 'output' using PigStorage('|')