Load CSV file in PIG - apache-pig

In PIG, When we load a CSV file using LOAD statement without mentioning schema & with default PIGSTORAGE (\t), what happens? Will the Load work fine and can we dump the data? Else will it throw error since the file has ',' and the pigstorage is '/t'? Please advice

When you load a csv file without defining a schema using PigStorage('\t'), since there are no tabs in each line of the input file, the whole line will be treated as one tuple. You will not be able to access the individual words in the line.
Example:
Input file:
john,smith,nyu,NY
jim,young,osu,OH
robert,cernera,mu,NJ
a = LOAD 'input' USING PigStorage('\t');
dump a;
OUTPUT:
(john,smith,nyu,NY)
(jim,young,osu,OH)
(robert,cernera,mu,NJ)
b = foreach a generate $0, $1, $2;
dump b;
(john,smith,nyu,NY,,)
(jim,young,osu,OH,,)
(robert,cernera,mu,NJ,,)
Ideally, b should have been:
(john,smith,nyu)
(jim,young,osu)
(robert,cernera,mu)
if the delimiter was a comma. But since the delimiter was a tab and a tab does not exist in the input records, the whole line was treated as one field. Pig doe snot complain if a field is null- It just outputs nothing when there is a null. Hence you see only the commas when you dump b.
Hope that was useful.

Related

How to clean bad data from huge csv file

So I have huge csv file (assume 5 GB) and I want to insert the data to the table but it return error that the length of the data is not the same
I found that some data has more columns than I want
For example the correct data I have has 8 columns but some data has 9 (it can be human/system error)
I want to take only 8 columns data, but because the data is so huge, I can not do it manually or using parsing in python
Any recommendation of a way to do it?
I am using linux, so any linux command also welcome
In sql I am using COPY ... FROM ... CSV HEADER; command to import the csv into table
You can use awk for this purpose. Assuming you field delimiter is comma (,) this code can do the work:
awk -F\, 'NF==8 {print}' input_file >output_file
A fast and dirty php solution as single command line:
php -r '$f=fopen("a.csv","rb"); $g=fopen("b.csv","wb"); while ( $r=fgetcsv($f) ) { $r = array_slice($r,0,8); fputcsv($g,$r); }'
It reads file a.csv and writes b.csv.

How to load data into pig using different PigStorage operator

I am new to Apache Pig and trying to load test twitter data to find out the number of tweets by each user name. Below is my data
format(twitterId,comment,userRefId):
Sample Data
When I am trying to load data into Pig using PigStorage as (',') it is separating my comment section also into multiple fields because comments could also have','. Please let me know how to load this data properly in Pig. I am using below command:
data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage(',') AS (id:chararray,comment:chararray,refId:chararray);
Load the record into a line,then replace ," with | and ", with |.This will ensure the fields are separated and then use STRSPLIT to get the 3 fields.
A = LOAD 'data.txt' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(REPLACE(line,',"','|'),'",','|');
C = FOREACH B GENERATE STRSPLIT($0,'\\|',3);
DUMP C;
EDIT:
I used sample text to run the script and works fine.See below
If changing the separator in your source data is an option, I would go that route. Makes it probably a lot easier to get started and to track down issues.
If you change your separator to a |, your code could look like:
data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage('|') AS (id:chararray,comment:chararray,refId:chararray);

Removing HTML tags from thousands of rows in a CSV in PIG

I have a large collection of data from Stack Overflow which I obtained by querying the DB using the data explorer.
I am loading the data into HDFS and I would like to remove all HTML tags from every row of a certain column using pig.
Before loading the data I tried a Ctrl F and replace all "<*>" with "" but Excel couldn't do this for 250000 rows of data and crashed.
How could I go about doing this in PIG, so far this is what I have which is not a lot:
StackOverflow = load 'StackOverflow.csv' using PigStorage(',');
noHTML = FOREACH StackOverflow REPLACE(%STRING%, '<*>', '""')
What argument can I use in %String% to tell PIG to do this for each row?
You have to refer to the column data that needs to be modified.Assuming you have 3 columns and you would want to replace the html tags in the 2nd column,you would use the below script.$1 refers to the 2nd column
StackOverflow = load 'StackOverflow.csv' using PigStorage(',')
noHTML = FOREACH StackOverflow GENERATE $0,REPLACE($1, '<*>', '') as f2_new,$1;
DUMP noHTML;
Or by using column names
StackOverflow = load 'StackOverflow.csv' using PigStorage(',') as (f1:chararray,f2:chararray,f3:chararray);
noHTML = FOREACH StackOverflow GENERATE f1,REPLACE(f2, '<*>', '') as f2_new,f3;
DUMP noHTML;
There are lot of other ways you can do it. Trying to do it in a word file wouldn't help. You need word processing. You can use perl to do this. The smartest way you can do it is using Unix/Linux tools like sed, grep etc.
sed -i -e 's/<string you want to delete>/""/g' filename

apache pig load data with multiple delimiters

Hi everyone I have a problem about loading data using apache pig, the file format is like:
"1","2","xx,yy","a,sd","3"
So I want to load it by using the multiple delimiter "," 2double quotes and one comma like:
A = LOAD 'file.csv' USING PigStorage('","') AS (f1,f2,f3,f4,f5);
but the PigStorage doesn't accept the multiple delimiter ",".How I can do it? Thank you very much!
PigStorage takes single character as delimiter.You will have use builtin functions from PiggyBank. Download piggybank.jar and save in the same folder as your pigscript.Register the jar in your pigscript.
REGISTER piggybank.jar;
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD 'test1.txt' USING CSVLoader(',') AS (f1:int,f2:int,f3:chararray,f4:chararray,f5:int);
B = FOREACH A GENERATE f1,f2,f3,f4,f5;
DUMP B;
Alternate option is to load the data into a line and then use STRSPLIT
A = LOAD 'test1.txt' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '","'));
DUMP B;

Error loading pig script

I am having difficulty loading data using apache pig script
cat data15.txt
1,(2,3)
2,(3,4)
grunt>a = load 'nikhil/data15.txt' using PigStorage(',') as (x:int, y:tuple(y1:int,y2:int));
grunt>dump a;
(1,)
(2,)
I know its too late to answer this
The problem is that the tuple and the other field have the same delimiter as ','. Pig fails to do the schema conversion.
you can try something like thisyou need to change the delimiter
1:(5,7,7)
3:(7,9,4)
5:(5,9,7)
and run the pig script as
A = load 'file.txt' using PigStorage(':') as (t1:int,t2:tuple(x:int,y:int,z:int));
dump A;
the output is
(1,(5,7,7))
(3,(7,9,4))
(5,(5,9,7))
you can change the delimiter using sed command in the input file and then load the file.