How to load data into pig using different PigStorage operator - apache-pig

I am new to Apache Pig and trying to load test twitter data to find out the number of tweets by each user name. Below is my data
format(twitterId,comment,userRefId):
Sample Data
When I am trying to load data into Pig using PigStorage as (',') it is separating my comment section also into multiple fields because comments could also have','. Please let me know how to load this data properly in Pig. I am using below command:
data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage(',') AS (id:chararray,comment:chararray,refId:chararray);

Load the record into a line,then replace ," with | and ", with |.This will ensure the fields are separated and then use STRSPLIT to get the 3 fields.
A = LOAD 'data.txt' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(REPLACE(line,',"','|'),'",','|');
C = FOREACH B GENERATE STRSPLIT($0,'\\|',3);
DUMP C;
EDIT:
I used sample text to run the script and works fine.See below

If changing the separator in your source data is an option, I would go that route. Makes it probably a lot easier to get started and to track down issues.
If you change your separator to a |, your code could look like:
data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage('|') AS (id:chararray,comment:chararray,refId:chararray);

Related

How to remove extra entries from the tuple in apache pig

How do I solve the issue of extra comma or entries from the tuple
ab = load "/path/file1.txt" USING PigStorage(',') AS (id1:chararray, id2:chararray, dt:chararray, qty:int);
Current output:-
(F1,S9,12/09/2011,2,,,)
Expected Output:-
(F1,S9,12/09/2011,2)
Should I make changes in the text which is there in my file.txt or something else?
Write path between single quote('') in LOAD statement.
Example:
ab = load '/path/file1.txt' USING PigStorage(',') AS (id1:chararray, id2:chararray, dt:chararray, qty:int);

Apache Pig - Numeric data missing while loading in a pig relation

I am learning Apache Pig. I am trying to load some data in to pig. When i see the txt file in vi editor, I find the following (sample) row.
[ABBOTT,DEEDEE W GRADES 9-12 TEACHER 52,122.10 0 LBOE
ATLANTA INDEPENDENT SCHOOL SYSTEM 2010].
I use the following command to load data into a pig relation.
A = LOAD 'salaryTravelReport_sample.txt' USING PigStorage() as (name:chararray,
prof:chararray,max_sal:float,travel:float,board:chararray,state:chararray,year:int);
However, when I do a dump in pig in the distributed environment, I find the following result (for the row mentioned above):
(ABBOTT,DEEDEE W,GRADES 9-12 TEACHER,,0.0,LBOE,ATLANTA INDEPENDENT
SCHOOL SYSTEM,2010).
The numeric data "52,122.10 " seems to be missing.
Please help.
PigStorage() is inbuilt function in pig which takes record delimiter as arguments. here its tab -- > \t
A = LOAD 'salaryTravelReport_sample.txt' USING PigStorage('\t') as (name:chararray,
prof:chararray,max_sal:float,travel:float,board:chararray,state:chararray,year:int);

Unable to load data using Apache Pig

I have a csv data in the following format:
id,name,price,information
12,Pants,50.00,{Clothes & Shoes: 5}
And here is my pig script:
grunt> sample = LOAD 'data.csv' USING PigStorage (',') AS (id:int, name:chararray, price:double, information:chararray);
The problem is, when I load information as chararray, because I can't access the category or the quantity itself. I tried to do something like:
information:tuple(category:chararray, quantity:int)
But it didn't work..
What should I do?
What is the best way to load information so I can have access to both category and quantity..
Thanks
What you have is a Bag and not a Tuple.See here for Bag,Tuple.
( ) A tuple is enclosed in parentheses ( ).
{ } An inner bag is enclosed in curly brackets { }.
You can load it like this
sample = LOAD 'data.csv' USING PigStorage (',') AS (id:int, name:chararray, price:double, information:bag{});

Removing HTML tags from thousands of rows in a CSV in PIG

I have a large collection of data from Stack Overflow which I obtained by querying the DB using the data explorer.
I am loading the data into HDFS and I would like to remove all HTML tags from every row of a certain column using pig.
Before loading the data I tried a Ctrl F and replace all "<*>" with "" but Excel couldn't do this for 250000 rows of data and crashed.
How could I go about doing this in PIG, so far this is what I have which is not a lot:
StackOverflow = load 'StackOverflow.csv' using PigStorage(',');
noHTML = FOREACH StackOverflow REPLACE(%STRING%, '<*>', '""')
What argument can I use in %String% to tell PIG to do this for each row?
You have to refer to the column data that needs to be modified.Assuming you have 3 columns and you would want to replace the html tags in the 2nd column,you would use the below script.$1 refers to the 2nd column
StackOverflow = load 'StackOverflow.csv' using PigStorage(',')
noHTML = FOREACH StackOverflow GENERATE $0,REPLACE($1, '<*>', '') as f2_new,$1;
DUMP noHTML;
Or by using column names
StackOverflow = load 'StackOverflow.csv' using PigStorage(',') as (f1:chararray,f2:chararray,f3:chararray);
noHTML = FOREACH StackOverflow GENERATE f1,REPLACE(f2, '<*>', '') as f2_new,f3;
DUMP noHTML;
There are lot of other ways you can do it. Trying to do it in a word file wouldn't help. You need word processing. You can use perl to do this. The smartest way you can do it is using Unix/Linux tools like sed, grep etc.
sed -i -e 's/<string you want to delete>/""/g' filename

apache pig load data with multiple delimiters

Hi everyone I have a problem about loading data using apache pig, the file format is like:
"1","2","xx,yy","a,sd","3"
So I want to load it by using the multiple delimiter "," 2double quotes and one comma like:
A = LOAD 'file.csv' USING PigStorage('","') AS (f1,f2,f3,f4,f5);
but the PigStorage doesn't accept the multiple delimiter ",".How I can do it? Thank you very much!
PigStorage takes single character as delimiter.You will have use builtin functions from PiggyBank. Download piggybank.jar and save in the same folder as your pigscript.Register the jar in your pigscript.
REGISTER piggybank.jar;
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD 'test1.txt' USING CSVLoader(',') AS (f1:int,f2:int,f3:chararray,f4:chararray,f5:int);
B = FOREACH A GENERATE f1,f2,f3,f4,f5;
DUMP B;
Alternate option is to load the data into a line and then use STRSPLIT
A = LOAD 'test1.txt' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '","'));
DUMP B;