How to remove extra entries from the tuple in apache pig - apache-pig

How do I solve the issue of extra comma or entries from the tuple
ab = load "/path/file1.txt" USING PigStorage(',') AS (id1:chararray, id2:chararray, dt:chararray, qty:int);
Current output:-
(F1,S9,12/09/2011,2,,,)
Expected Output:-
(F1,S9,12/09/2011,2)
Should I make changes in the text which is there in my file.txt or something else?

Write path between single quote('') in LOAD statement.
Example:
ab = load '/path/file1.txt' USING PigStorage(',') AS (id1:chararray, id2:chararray, dt:chararray, qty:int);

Related

How to load data into pig using different PigStorage operator

I am new to Apache Pig and trying to load test twitter data to find out the number of tweets by each user name. Below is my data
format(twitterId,comment,userRefId):
Sample Data
When I am trying to load data into Pig using PigStorage as (',') it is separating my comment section also into multiple fields because comments could also have','. Please let me know how to load this data properly in Pig. I am using below command:
data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage(',') AS (id:chararray,comment:chararray,refId:chararray);
Load the record into a line,then replace ," with | and ", with |.This will ensure the fields are separated and then use STRSPLIT to get the 3 fields.
A = LOAD 'data.txt' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(REPLACE(line,',"','|'),'",','|');
C = FOREACH B GENERATE STRSPLIT($0,'\\|',3);
DUMP C;
EDIT:
I used sample text to run the script and works fine.See below
If changing the separator in your source data is an option, I would go that route. Makes it probably a lot easier to get started and to track down issues.
If you change your separator to a |, your code could look like:
data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage('|') AS (id:chararray,comment:chararray,refId:chararray);

Unable to load data using Apache Pig

I have a csv data in the following format:
id,name,price,information
12,Pants,50.00,{Clothes & Shoes: 5}
And here is my pig script:
grunt> sample = LOAD 'data.csv' USING PigStorage (',') AS (id:int, name:chararray, price:double, information:chararray);
The problem is, when I load information as chararray, because I can't access the category or the quantity itself. I tried to do something like:
information:tuple(category:chararray, quantity:int)
But it didn't work..
What should I do?
What is the best way to load information so I can have access to both category and quantity..
Thanks
What you have is a Bag and not a Tuple.See here for Bag,Tuple.
( ) A tuple is enclosed in parentheses ( ).
{ } An inner bag is enclosed in curly brackets { }.
You can load it like this
sample = LOAD 'data.csv' USING PigStorage (',') AS (id:int, name:chararray, price:double, information:bag{});

apache pig load data with multiple delimiters

Hi everyone I have a problem about loading data using apache pig, the file format is like:
"1","2","xx,yy","a,sd","3"
So I want to load it by using the multiple delimiter "," 2double quotes and one comma like:
A = LOAD 'file.csv' USING PigStorage('","') AS (f1,f2,f3,f4,f5);
but the PigStorage doesn't accept the multiple delimiter ",".How I can do it? Thank you very much!
PigStorage takes single character as delimiter.You will have use builtin functions from PiggyBank. Download piggybank.jar and save in the same folder as your pigscript.Register the jar in your pigscript.
REGISTER piggybank.jar;
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD 'test1.txt' USING CSVLoader(',') AS (f1:int,f2:int,f3:chararray,f4:chararray,f5:int);
B = FOREACH A GENERATE f1,f2,f3,f4,f5;
DUMP B;
Alternate option is to load the data into a line and then use STRSPLIT
A = LOAD 'test1.txt' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '","'));
DUMP B;

Error loading pig script

I am having difficulty loading data using apache pig script
cat data15.txt
1,(2,3)
2,(3,4)
grunt>a = load 'nikhil/data15.txt' using PigStorage(',') as (x:int, y:tuple(y1:int,y2:int));
grunt>dump a;
(1,)
(2,)
I know its too late to answer this
The problem is that the tuple and the other field have the same delimiter as ','. Pig fails to do the schema conversion.
you can try something like thisyou need to change the delimiter
1:(5,7,7)
3:(7,9,4)
5:(5,9,7)
and run the pig script as
A = load 'file.txt' using PigStorage(':') as (t1:int,t2:tuple(x:int,y:int,z:int));
dump A;
the output is
(1,(5,7,7))
(3,(7,9,4))
(5,(5,9,7))
you can change the delimiter using sed command in the input file and then load the file.

How to use apache pig filter to find '.PDF'

I have a file /pigmix.txt in HDFS which have a list of files with different format like .PDF,.DOC,.PPT etc. I want to filter only .PDF. How can I use apache pig filter function for it?
Can you try the below filter command?
input:
file1.txt
file2.PDF
file3.doc
file4.ppt
file5.pdf
PigScript:
A = LOAD 'input' USING PigStorage() AS (filename:chararray);
B = FILTER A BY filename matches '.*\\.(pdf|PDF)$';
DUMP B;
Output:
(file2.PDF)
(file5.pdf)