how to remove a field from tuple in apache pig - apache-pig

i have a code as follow:
B_t= LOAD 'test.csv' USING PigStorage('\t') as (id:chararray,usr_id:chararray,weed:chararray,ip:chararray);
in above, i have field with name weed, i would like remove this field from record with filter command without use codes as follow:
B_f = FOREACH B_t GENERATE id , usr_id, ip
or
B_t= LOAD 'test.csv' USING PigStorage('\t') $0 as id, ....;
have anyone idea???

Why in the world u want to use Filter only. Filter is used emit the fields based on condition, not remove the entire field. Foreach -- Generate is the best way.

Related

How to ignore "," in data fields

I am trying to generate following ...
Input
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124440112951296,"00:00 #MAW",WesleyBitton
A = LOAD '/user/root/data/tweets.csv' USING PigStorage(',') as (users:chararray, tweets:chararray);
B = FILTER A by users == '396124436476092416';
output truncated
(396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift)
Output excepting
(396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse")
I do not want to read row as line.
You can use CSVLoader for loading data
however if you do not wish to do that here is the work around in Apache Pig itself for that :
--Load your Data
A = LOAD 'your/path/users.csv' USING TextLoader() AS (unparsed:chararray);
--Replace your " string with | so as to separate your tweets
B = FOREACH A GENERATE REPLACE(unparsed, '\\"', '|') AS parsed:chararray;
--store your temporary parsed data into your location
STORE B INTO 'your/path/parsed_users.csv' USING PigStorage('|');
--load your parsed data
C = LOAD 'your/path/parsed_users.csv' USING PigStorage('|') AS (users:chararray, tweets:chararray);
--Dump your data , how ever this will still contain one extra comma(,) but you can replace it by using the replace function you get the point.
DUMP C;
Thats fit in the csv standardization, so you need just to use CSVLoader which
supports double-quoted fields that contain commas and other
double-quotes escaped with backslashes.
This is how to use it :
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD '/user/root/data/tweets.csv' USING CSVLoader AS (users:chararray, tweets:chararray);
B = FILTER A by users == '396124436476092416';

store data into a path based on a field's value in the tuple

{'aaa', 'bbb', 'ccc'}
....
Let's suppose above tupple gets loaded with following schema:
as (firstField:chararray, secondField:chararray, thirdField:chararray)
I want to store the tuple in HDFS with the path based on 2nd field (which is 'bbb' in above example). So the above tuple would get stored in the path
/SomeBaseDir/bbb/testoutput.txt
Any help would be appreciated.
To load the file use below command. Remember the file input data should be tab separated.If you are using any other separator like comma then change the parameter passes in PigStorage funtion. It should be PigStorage(',')
A = load '/home/abhishek/Work/pigInput/data' using PigStorage('\t') as (firstField:chararray, secondField:chararray, thirdField:chararray);
Now, to get the second element simple use:
result = foreach A generate secondField;
Result
dump result
('bbb')
You can store it using below command
store result into 'provide the path';
I think you want to use MultiStorage (https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/MultiStorage.html). This should do pretty much what you want. Specify base path, then the field on which subdirectories should be based.
The SPLIT operator in PigLatin would do the job here.
For an input data (in) which is loaded, it could be split into different output variable as following:
loadedData = load ' ' as (.. ,somefield, ) using ... ;
SPLIT loadedData INTO
segmentA IF (somefield=='A'),
segmentB IF (somefield=='B'),
OtherSources OTHERWISE;
store segmentA into 'hdfs://<path for data segmentA >' using ....;
store segmentB into 'hdfs://<path for data segmentB >' using ....;

REGEX_EXTRACT error in PIG

I have a CSV file with 3 columns: tweetid , tweet, and Userid. However within the tweet column there are comma separated values.
i.e. of 1 row of data:
`396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
I want to extract all 3 fields individually, but REGEX_EXTRACT is giving me an error with this code:
a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3);
b = FILTER a BY REGEX_EXTRACT(f1,'(.*)\\"(.*)',1);
The error is:
error: Filter's condition must evaluate to boolean.
In the use case shared, reading the data using PigStrorage(',') will result in missing savava143 (last field value)
A = LOAD '/Users/muralirao/learning/pig/a.csv' USING PigStorage(',') AS (f1,f2,f3);
DUMP A;
Output : A : Observe that the last field value is missing.
(396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.")
For the use case shared, to extract all the values from CSV file with field values having ',' we can use either CSVExcelStorage or CSVLoader.
Approach 1 : Using CSVExcelStorage
Ref : http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
Input : a.csv
396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
Pig Script :
REGISTER piggybank.jar;
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (f1,f2,f3);
DUMP A;
Output : A
(396124437168537600,I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.,savava143)
Approach 2 : Using CSVLoader
Ref : http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html
Below script makes use of CSVLoader(), DUMP A will result in the same output seen earlier.
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (f1,f2,f3);
The error is that you do not want to FILTER based on a regex but GENERATE new fields based on a regex. To filter, you need to know if the line have to be filtered, hence the boolean requirement.
Therefore, you have to use :
b = FOREACH a GENERATE REGEX_EXTRACT(FIELD, REGEX, HOW_MANY_GROUPS_TO_RETURN);
However, as #Murali Rao said, your values are not just coma separated but CSV (think how you will handle a coma in tweet : it is not a field separator, just some content).

How to remove characters from a field in Pig

Data:
someId,+1 5552221234
someId2,+1 3331114321
I want to remove the +1 from the second field below
I first load the data
A= LOAD 'Data' USING PigStroage(,) as (Id:chararray, Phone:chararray)
Now i want to have the following Data
Desired Output:
someId, 5552221234
someId2, 3331114321
How would i go about doing this. I was using the following but it doesn't work:
mss_demographic_data3= FOREACH mss_demographic_data2 GENERATE *, REGEX_EXTRACT_ALL(Phone, '[0-9]{9}$') as newPhone;
Use the substring function. (easiest way)
mss_demographic_data3= FOREACH mss_demographic_data2 GENERATE Id,SUBSTRING(Phone,3,12);
NOTE- You have this (substring function) only if you are using pig 0.8.0 or above. If you are using older version of pig, you may need to write an udf.

Apache Pig: Merging list of attributes into a single tuple

I receive data in the form
id1|attribute1a,attribute1b|attribute2a|attribute3a,attribute3b,attribute3c....
id2||attribute2b,attribute2c|..
I'm trying to merge it all into a form where I just have a bag of tuples of an id field followed by a tuple containing a list of all my other fields merged together.
(id1,(attribute1a,attribute1b,attribute2a,attribute3a,attribute3b,attribute3c...))
(id2,(attribute2b,attribute2c...))
Currently I fetch it like
my_data = load '$input' USING PigStorage(|) as
(id:chararray, attribute1:chararray, attribute2:chararray)...
then I've tried all combinations of FLATTEN, TOKENIZE, GENERATE, TOTUPLE, BagConcat, etc. to massage it into the form I want, but I'm new to pig and just can't figure it out. Can anyone help? Any open source UDF libraries are fair game.
Load each line as an entire string, and then use the features of the built-in STRPLIT UDF to achieve the desired result. This relies on there being no tabs in your list of attributes, and assumes that | and , are not to be treated any differently in separating out the different attributes. Also, I modified your input a little bit to show more edge cases.
input.txt:
id1|attribute1a,attribute1b|attribute2a|,|attribute3a,attribute3b,attribute3c
id2||attribute2b,attribute2c,|attribute4a|,attribute5a
test.pig:
my_data = LOAD '$input' AS (str:chararray);
split1 = FOREACH my_data GENERATE FLATTEN(STRSPLIT(str, '\\|', 2)) AS (id:chararray, attr:chararray);
split2 = FOREACH split1 GENERATE id, STRSPLIT(attr, '[,|]') AS attributes;
DUMP split2;
Output of pig -x local -p input=input.txt test.pig:
(id1,(attribute1a,attribute1b,attribute2a,,,attribute3a,attribute3b,attribute3c))
(id2,(,attribute2b,attribute2c,,attribute4a,,attribute5a))