how can i ignore " (double quotes) while loading file in PIG? - apache-pig

I have following data in file
"a","b","1","2"
"a","b","4","3"
"a","b","3","1"
I am reading this file using below command
File1 = LOAD '/path' using PigStorage (',') as (f1:chararray,f2:chararray,f3:int,f4:int)
But here it is ignoring the data of field 3 and 4
How to read this file correctly or any way to make PIG skip '"'
Additional information i am using Apache Pig version 0.10.0

You may use the REPLACE function (it won't be in one pass though) :
file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0 as (f1:chararray), $1 as (f2:chararray), REPLACE($2, '\\"', '') as (f3:int), REPLACE($3, '\\"', '') as (f4:int);
You may also use regexes with REGEX_EXTRACT :
file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0, $1, REGEX_EXTRACT($2, '([0-9]+)', 1), REGEX_EXTRACT($3, '([0-9]+)', 1);
Of course, you could erase " for f1 and f2 the same way.

Try below (No need to escape or replace double quotes) :
using org.apache.pig.piggybank.storage.CSVExcelStorage()

If you have Jython installed you could deploy a simple UDF to accomplish the job.
python UDF
#!/usr/bin/env python
'''
udf.py
'''
#outputSchema("out:chararray")
def formatter(item):
chars = 'abcdefghijklmnopqrstuvwxyz'
nums = '1234567890'
new_item = item.split('"')[1]
if new_item in chars:
output = str(new_item)
elif new_item in nums:
output = int(new_item)
return output
pig script
REGISTER 'udf.py' USING jython as udf;
data = load 'file' USING PigStorage(',') AS (col1:chararray, col2:chararray,
col3:chararray, col4:chararray);
out = foreach data generate udf.formatter(col1) as a, udf.formatter(col3) as b;
dump out
(a,1)
(a,4)
(a,3)

How about use REPLACE? if case is this simple?
data = LOAD 'YOUR_DATA' Using PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray) ;
new_data = foreach data generate
REPLACE(a, '"', '') AS a,
REPLACE(b, '"', '') AS b,
(int)REPLACE(c, '"', '') AS c:int,
(int)REPLACE(d, '"', '') AS d:int;
One more tips: If you are loading a csv file, set a correct number format in an Excel like tools might also help.

You can use CSVExcelStorage loader from Pig.
The double quotes in data are handled by this loader.
You have to register Piggy-bank jar for using this loader.
Register ${jar_location}/piggybank-0.15.0.jar;
load_data = load '${data_location}' using
org.apache.pig.piggybank.storage.CSVExcelStorage(',');
Hope this helps.

Related

I have a seemingly simple Pig generate and then filter issue

I am trying to run a simple Pig script on a simple csv file and I can not get FILTER to do what I want. I have a test.csv file that looks like this:
john,12,44,,0
bob,14,56,5,7
dave,13,40,5,5
jill,8,,,6
Here is my script that does not work:
people = LOAD 'hdfs:/whatever/test.csv' using PigStorage(',');
data = FOREACH people GENERATE $0 AS name:chararray, $1 AS first:int, $4 AS second:int;
filtered = FILTER data BY first == 13;
DUMP filtered;
When I dump data, everything looks good. I get the name and the first and last integer as expected. When I describe the data, everything looks good:
data: {name: bytearray,first: int,second: int}
When I try and filter out data by the first value being 13, I get nothing. DUMP filtered simply returns nothing. Oddly enough, if I change it to first > 13, then all "rows" will print out.
However, this script works:
peopletwo = LOAD 'hdfs:/whatever/test.csv' using PigStorage(',') AS (f1:chararray,f2:int,f3:int,f4:int,f5:int);
datatwo = FOREACH peopletwo GENERATE $0 AS name:chararray, $1 AS first:int, $4 AS second:int;
filteredtwo = FILTER datatwo BY first == 13;
DUMP filteredtwo;
What is the difference between filteredtwo and filtered (or data and datatwo for that matter)? I want to know why the new relation obtained using GENERATE (i.e. data) won't filter in the first script as one would expect.
Specify the datatype in the load itself.See below
people = LOAD 'test5.csv' USING PigStorage(',') as (f1:chararray,f2:int,f3:int,f4:int,f5:int);
filtered = FILTER people BY f2 == 13;
DUMP filtered;
Output
Changing the filter to use > gives
filtered = FILTER people BY f2 > 13;
Output
EDIT
When converting from bytearray you will have to explicitly cast the value of the fields in the FOREACH.This works.
people = LOAD 'test5.csv' USING PigStorage(',');
data = FOREACH people GENERATE $0 AS name:chararray,(int)$1 AS f1,(int)$4 AS f2;
filtered = FILTER data BY f1 == 13;
DUMP filtered;

Pig and Parsing issue

I am trying to figure out the best way to parse key value pair with Pig in a dataset with mixed delimiters as below
My sample dataset is in the format below
a|b|c|k1=v1 k2=v2 k3=v3
The final output which i require here is
k1,v1,k2,v2,k3,v3
I guess one way to do this is to
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
and here i get (k1=v1 k2=v2 k3=v3) for B
Is there any way i can further parse this by "" so as to get 3 fields k1=v1,k2=v2 and K3=v3 which can then be further split into k1,v1,k2,v2,k3,v3 using Strsplit and Flatten on "=".
Thanks for the help!
San
If you know beforehand how many key=value pair are in each record, try this:
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
C = FOREACH B GENERATE STRSPLIT($0,'=',6); -- 6= no. of key=value pairs
D = FOREACH C GENERATE FLATTEN($0);
DUMP D
output:
(k1,v1, k2,v2, k3,v3)
If you dont know the # of key=value pair, use ' ' as delimiter and remove the unwanted prefix from $0 column.
A = LOAD 'sample' USING PigStorage(' ') as (a:chararray,b:chararray,c:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2);
DUMP C;
output:
(k1,v1, k2,v2, k3,v3)

UDF for formatting numbers into strings in Pig

In Pig, I want to get a numeric column out let's say "12345" and cast it to a string with formatting like "$12,345".
Are there existing UDF's to help with standard formatting like adding dollar signs, commas, percents, etc? I haven't seen any in the docs
Here is my python UDF that you can leverage.
#!/usr/bin/python
#outputSchema("formatted:chararray")
def toDol(number):
s = '%d' % number
groups = []
while s and s[-1].isdigit():
groups.append(s[-3:])
s = s[:-3]
res = s + ','.join(reversed(groups))
res = '$' + res
return res
This is how your pig script is going to look like
Register 'locale_udf.py' using jython as myfuncs;
DT = LOAD 'sample_data.txt' Using PigStorage() as (dol:float);
DTR = FOREACH DT GENERATE dol,myfuncs.toDol(dol) as formattedstring;
dump DTR;
This should work for you.

Pig Latin split columns to rows

Is there any solution in Pig latin to transform columns to rows to get the below?
Input:
id|column1|column2
1|a,b,c|1,2,3
2|d,e,f|4,5,6
required output:
id|column1|column2
1|a|1
1|b|2
1|c|3
2|d|4
2|e|5
2|f|6
thanks
I'm willing to bet this is not the best way to do this however ...
data = load 'input' using PigStorage('|') as (id:chararray, col1:chararray,
col2:chararray);
A = foreach data generate id, flatten(TOKENIZE(col1));
B = foreach data generate id, flatten(TOKENIZE(col2));
RA = RANK A;
RB = RANK B;
store RA into 'ra_temp' using PigStorage(',');
store RB into 'rb_temp' using PigStorage(',');
data_a = load 'ra_temp/part-m-00000' using PigStorage(',');
data_b = load 'rb_temp/part-m-00000' using PigStorage(',');
jed = JOIN data_a BY $0, data_b BY $0;
final = foreach jed generate $1, $2, $5;
dump final;
(1,a,1)
(1,b,2)
(1,c,3)
(2,d,4)
(2,e,5)
(2,f,6)
store final into '~/some_dir' using PigStorage('|');
EDIT: I really like this question and was discussing it with a co-worker and he came up with a much simpler and more elegant solution. If you have Jython installed ...
# create file called udf.py
#outputSchema("innerBag:bag{innerTuple:(column1:chararray, column2:chararray)}")
def pigzip(column1, column2):
c1 = column1.split(',')
c2 = column2.split(',')
innerBag = zip(c1, c2)
return innerBag
Then in Pig
$ pig -x local
register udf.py using jython as udf;
data = load 'input' using PigStorage('|') as (id:chararray, column1:chararray,
column2:chararray);
result = foreach data generate id, flatten(udf.pigzip(column1, column2));
dump result;
store final into 'output' using PigStorage('|')

PIG - Defining the delimiter used for a bag after a GROUP function

In Pig, I'm loading and grouping two files. I end up with a something like this:
A = LOAD 'File1' Using PigStorage('\t');
B = LOAD 'File2' Using PigStorage('\t');
C = COGROUP A BY $0, B BY $0;
STORE C INTO 'Output' USING PigStorage('\t');
Output:
123 {(123,XYZ,456)} {(123,QRS,889,QWER)}
Where the first field is the group key, the first bag is from File1, and the next bag is from File2. These three sections are delimited from each other using whatever I identified in the PigStorage('\t') clause.
Question: How do I force Pig to delimit the bags by something other than a comma? In my real data, there are commas present and so I need to delimit by tabs instead.
Desired output:
123 {(123\tXYZ\t456)} {(123\tQRS\t889\tQWER)}
This seems to be an open issue (as of June 2013) in Pig. See the corresponding JIRA for more details. Until the issue is fixed, you can change your input data.