Apache Pig - Numeric data missing while loading in a pig relation - apache-pig

I am learning Apache Pig. I am trying to load some data in to pig. When i see the txt file in vi editor, I find the following (sample) row.
[ABBOTT,DEEDEE W GRADES 9-12 TEACHER 52,122.10 0 LBOE
ATLANTA INDEPENDENT SCHOOL SYSTEM 2010].
I use the following command to load data into a pig relation.
A = LOAD 'salaryTravelReport_sample.txt' USING PigStorage() as (name:chararray,
prof:chararray,max_sal:float,travel:float,board:chararray,state:chararray,year:int);
However, when I do a dump in pig in the distributed environment, I find the following result (for the row mentioned above):
(ABBOTT,DEEDEE W,GRADES 9-12 TEACHER,,0.0,LBOE,ATLANTA INDEPENDENT
SCHOOL SYSTEM,2010).
The numeric data "52,122.10 " seems to be missing.
Please help.

PigStorage() is inbuilt function in pig which takes record delimiter as arguments. here its tab -- > \t
A = LOAD 'salaryTravelReport_sample.txt' USING PigStorage('\t') as (name:chararray,
prof:chararray,max_sal:float,travel:float,board:chararray,state:chararray,year:int);

Related

PIG Latin No error:Output is blank while executing FOREACH..GENERATE statement

Getting incorrect output while executing FOREACH statement.
step 1) Copied a csv file from S3 to hdfs without header
step 2) In hdfs mode, i tried to load the same file in an alias in pig.(Dump is working fine till this)
grunt> rec = Load '/home/Output/' using PigStorage(',') AS (Student:chararray,School:chararray,Year:int,Awards:int);
grunt> dump rec
;
step 3) Then i grouped it and tried to count the number of awards.
grunt>rec2 = FOREACH rec1 GENERATE group as Country,SUM(rec.Awards) as Award_count;
When i m dumping rec2, there is no error but output is (,)
The above command is working perfectly when using in local mode.I am getting desired output.

How to load data into pig using different PigStorage operator

I am new to Apache Pig and trying to load test twitter data to find out the number of tweets by each user name. Below is my data
format(twitterId,comment,userRefId):
Sample Data
When I am trying to load data into Pig using PigStorage as (',') it is separating my comment section also into multiple fields because comments could also have','. Please let me know how to load this data properly in Pig. I am using below command:
data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage(',') AS (id:chararray,comment:chararray,refId:chararray);
Load the record into a line,then replace ," with | and ", with |.This will ensure the fields are separated and then use STRSPLIT to get the 3 fields.
A = LOAD 'data.txt' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(REPLACE(line,',"','|'),'",','|');
C = FOREACH B GENERATE STRSPLIT($0,'\\|',3);
DUMP C;
EDIT:
I used sample text to run the script and works fine.See below
If changing the separator in your source data is an option, I would go that route. Makes it probably a lot easier to get started and to track down issues.
If you change your separator to a |, your code could look like:
data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage('|') AS (id:chararray,comment:chararray,refId:chararray);

Removing HTML tags from thousands of rows in a CSV in PIG

I have a large collection of data from Stack Overflow which I obtained by querying the DB using the data explorer.
I am loading the data into HDFS and I would like to remove all HTML tags from every row of a certain column using pig.
Before loading the data I tried a Ctrl F and replace all "<*>" with "" but Excel couldn't do this for 250000 rows of data and crashed.
How could I go about doing this in PIG, so far this is what I have which is not a lot:
StackOverflow = load 'StackOverflow.csv' using PigStorage(',');
noHTML = FOREACH StackOverflow REPLACE(%STRING%, '<*>', '""')
What argument can I use in %String% to tell PIG to do this for each row?
You have to refer to the column data that needs to be modified.Assuming you have 3 columns and you would want to replace the html tags in the 2nd column,you would use the below script.$1 refers to the 2nd column
StackOverflow = load 'StackOverflow.csv' using PigStorage(',')
noHTML = FOREACH StackOverflow GENERATE $0,REPLACE($1, '<*>', '') as f2_new,$1;
DUMP noHTML;
Or by using column names
StackOverflow = load 'StackOverflow.csv' using PigStorage(',') as (f1:chararray,f2:chararray,f3:chararray);
noHTML = FOREACH StackOverflow GENERATE f1,REPLACE(f2, '<*>', '') as f2_new,f3;
DUMP noHTML;
There are lot of other ways you can do it. Trying to do it in a word file wouldn't help. You need word processing. You can use perl to do this. The smartest way you can do it is using Unix/Linux tools like sed, grep etc.
sed -i -e 's/<string you want to delete>/""/g' filename

Error loading pig script

I am having difficulty loading data using apache pig script
cat data15.txt
1,(2,3)
2,(3,4)
grunt>a = load 'nikhil/data15.txt' using PigStorage(',') as (x:int, y:tuple(y1:int,y2:int));
grunt>dump a;
(1,)
(2,)
I know its too late to answer this
The problem is that the tuple and the other field have the same delimiter as ','. Pig fails to do the schema conversion.
you can try something like thisyou need to change the delimiter
1:(5,7,7)
3:(7,9,4)
5:(5,9,7)
and run the pig script as
A = load 'file.txt' using PigStorage(':') as (t1:int,t2:tuple(x:int,y:int,z:int));
dump A;
the output is
(1,(5,7,7))
(3,(7,9,4))
(5,(5,9,7))
you can change the delimiter using sed command in the input file and then load the file.

how do I set one variable equal to another in pig latin

I would like to do
register s3n://uw-cse344-code/myudfs.jar
-- load the test file into Pig
--raw = LOAD 's3n://uw-cse344-test/cse344-test-file' USING TextLoader as (line:chararray);
-- later you will load to other files, example:
raw = LOAD 's3n://uw-cse344/btc-2010-chunk-000' USING TextLoader as (line:chararray);
-- parse each line into ntriples
ntriples = foreach raw generate FLATTEN(myudfs.RDFSplit3(line)) as (subject:chararray,predicate:chararray,object:chararray);
--filter 1
subjects1 = filter ntriples by subject matches '.*rdfabout\\.com.*' PARALLEL 50;
--filter 2
subjects2 = subjects1;
but I get the error:
2012-03-10 01:19:18,039 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input ';' expecting LEFT_PAREN
Details at logfile: /home/hadoop/pig_1331342327467.log
so it seems pig doesn't like that. How do I accomplish this?
i don't think that kind of 'typical' assignment works in pig. It's not really a programming language in the strict sense - it's a high-level language on top of hadoop with some specialized functions.
i think you'll need to simply re-project the data from subjects1 to subjects2, such as:
subjects2 = foreach subjects1 generate $0, $1, $2;
another approach might be to use the LIMIT function with some absurdly high parameter.
subjects2 = subjects2 LIMIT 100000000 ;
there could be a lot of reasons why that doesn't make sense, but it's a thought.
i sense you are considering doing things as you would in a programming language
i have found that rarely works out like you want it to but you can always get the job done once you think like Pig.
As I understand your example fro DataScience coursera course.
It's strange but I found the same problem. This code works on the on amount of data and don't on the another.
Because we need to change parameters I used this code:
filtered2 = foreach filtered generate subject as subject2, predicate as predicate2, object as object2;