REGEX_EXTRACT error in PIG - apache-pig

I have a CSV file with 3 columns: tweetid , tweet, and Userid. However within the tweet column there are comma separated values.
i.e. of 1 row of data:
`396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
I want to extract all 3 fields individually, but REGEX_EXTRACT is giving me an error with this code:
a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3);
b = FILTER a BY REGEX_EXTRACT(f1,'(.*)\\"(.*)',1);
The error is:
error: Filter's condition must evaluate to boolean.

In the use case shared, reading the data using PigStrorage(',') will result in missing savava143 (last field value)
A = LOAD '/Users/muralirao/learning/pig/a.csv' USING PigStorage(',') AS (f1,f2,f3);
DUMP A;
Output : A : Observe that the last field value is missing.
(396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.")
For the use case shared, to extract all the values from CSV file with field values having ',' we can use either CSVExcelStorage or CSVLoader.
Approach 1 : Using CSVExcelStorage
Ref : http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
Input : a.csv
396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
Pig Script :
REGISTER piggybank.jar;
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (f1,f2,f3);
DUMP A;
Output : A
(396124437168537600,I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.,savava143)
Approach 2 : Using CSVLoader
Ref : http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html
Below script makes use of CSVLoader(), DUMP A will result in the same output seen earlier.
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (f1,f2,f3);

The error is that you do not want to FILTER based on a regex but GENERATE new fields based on a regex. To filter, you need to know if the line have to be filtered, hence the boolean requirement.
Therefore, you have to use :
b = FOREACH a GENERATE REGEX_EXTRACT(FIELD, REGEX, HOW_MANY_GROUPS_TO_RETURN);
However, as #Murali Rao said, your values are not just coma separated but CSV (think how you will handle a coma in tweet : it is not a field separator, just some content).

Related

how to remove a field from tuple in apache pig

i have a code as follow:
B_t= LOAD 'test.csv' USING PigStorage('\t') as (id:chararray,usr_id:chararray,weed:chararray,ip:chararray);
in above, i have field with name weed, i would like remove this field from record with filter command without use codes as follow:
B_f = FOREACH B_t GENERATE id , usr_id, ip
or
B_t= LOAD 'test.csv' USING PigStorage('\t') $0 as id, ....;
have anyone idea???
Why in the world u want to use Filter only. Filter is used emit the fields based on condition, not remove the entire field. Foreach -- Generate is the best way.

How to ignore "," in data fields

I am trying to generate following ...
Input
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124440112951296,"00:00 #MAW",WesleyBitton
A = LOAD '/user/root/data/tweets.csv' USING PigStorage(',') as (users:chararray, tweets:chararray);
B = FILTER A by users == '396124436476092416';
output truncated
(396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift)
Output excepting
(396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse")
I do not want to read row as line.
You can use CSVLoader for loading data
however if you do not wish to do that here is the work around in Apache Pig itself for that :
--Load your Data
A = LOAD 'your/path/users.csv' USING TextLoader() AS (unparsed:chararray);
--Replace your " string with | so as to separate your tweets
B = FOREACH A GENERATE REPLACE(unparsed, '\\"', '|') AS parsed:chararray;
--store your temporary parsed data into your location
STORE B INTO 'your/path/parsed_users.csv' USING PigStorage('|');
--load your parsed data
C = LOAD 'your/path/parsed_users.csv' USING PigStorage('|') AS (users:chararray, tweets:chararray);
--Dump your data , how ever this will still contain one extra comma(,) but you can replace it by using the replace function you get the point.
DUMP C;
Thats fit in the csv standardization, so you need just to use CSVLoader which
supports double-quoted fields that contain commas and other
double-quotes escaped with backslashes.
This is how to use it :
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD '/user/root/data/tweets.csv' USING CSVLoader AS (users:chararray, tweets:chararray);
B = FILTER A by users == '396124436476092416';

How to use Bioproject ID, for example, PRJNA12997, in biopython?

I have an Excel file in which are given more then 2000 organisms, where each one of them has a Bioproject ID associated (like PRJNA12997). The idea is to use these IDs to get the sequence for a later multiple alignment with other five sequences that I have in a text file.
Can anyone help me understand how I can do this using biopython? At least the part with the bioproject ID.
You can first get the info using Bio.Entrez:
from Bio import Entrez
Entrez.email = "Your.Name.Here#example.org"
# This call to efetch fails sometimes with a 400 error.
handle = Entrez.efetch(db="bioproject", id="PRJNA12997")
I've been trying, and Entrez.read(handle) doesn't seems to work. But if you do record_xml = handle.read() you'll get the XML entry for this record. In this XML you can get the ID for the organism, in this case 12997.
handle = Entrez.esearch(db="nuccore", term="12997[BioProject]")
search_results = Entrez.read(handle)
Now you can efecth from your search results. At this point you should use Biopython to parse whatever you will get in the efetch step, playing with the rettype http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/
for result in search_results["IdList"]:
entry = Entrez.efetch(db="nuccore", id=result, rettype="fasta")
this_seq_in_fasta = entry.read()

store data into a path based on a field's value in the tuple

{'aaa', 'bbb', 'ccc'}
....
Let's suppose above tupple gets loaded with following schema:
as (firstField:chararray, secondField:chararray, thirdField:chararray)
I want to store the tuple in HDFS with the path based on 2nd field (which is 'bbb' in above example). So the above tuple would get stored in the path
/SomeBaseDir/bbb/testoutput.txt
Any help would be appreciated.
To load the file use below command. Remember the file input data should be tab separated.If you are using any other separator like comma then change the parameter passes in PigStorage funtion. It should be PigStorage(',')
A = load '/home/abhishek/Work/pigInput/data' using PigStorage('\t') as (firstField:chararray, secondField:chararray, thirdField:chararray);
Now, to get the second element simple use:
result = foreach A generate secondField;
Result
dump result
('bbb')
You can store it using below command
store result into 'provide the path';
I think you want to use MultiStorage (https://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/MultiStorage.html). This should do pretty much what you want. Specify base path, then the field on which subdirectories should be based.
The SPLIT operator in PigLatin would do the job here.
For an input data (in) which is loaded, it could be split into different output variable as following:
loadedData = load ' ' as (.. ,somefield, ) using ... ;
SPLIT loadedData INTO
segmentA IF (somefield=='A'),
segmentB IF (somefield=='B'),
OtherSources OTHERWISE;
store segmentA into 'hdfs://<path for data segmentA >' using ....;
store segmentB into 'hdfs://<path for data segmentB >' using ....;

Conditions in Pig storage

Say I am having a input file as map.
sample.txt
[1#"anything",2#"something",3#"anotherthing"]
[2#"kish"]
[3#"mad"]
[4#"sun"]
[1#"moon"]
[1#"world"]
Since there are no values with the specified key, I do not want to save it to a file. Is there any conditional statements that i can include with the Store into relation ? Please Help me thro' this, following is the pig script.
A = LOAD 'sample.txt';
B = FOREACH A GENERATE $0#'5' AS temp;
C = FILTER B BY temp is not null;
-- It actually generates an empty part-r-X file
-- Is there any conditional statements i can include where if C is empty, Do not store ?
STORE C INTO '/user/logs/output';
Thanks
Am I going wrong somewhere ? Please correct me if I am wrong.
From Chapter 9 of Programming Pig,
Pig Latin is a dataflow language. Unlike general purpose programming languages, it does not include control flow constructs like if and for.
Thus, it is impossible to do this using just Pig.
I'm inclined to say you could achieve this using a combination of a custom StoreFunc and a custom OutputFormat, but that seems like it would be too much added overhead.
One way to solve this would be to just delete the output file if no records are written. This is not too difficult using embedded Pig. For example, using Python embedding:
from org.apache.pig.scripting import Pig
P = Pig.compile("""
A = load 'sample.txt';
B = foreach A generate $0#'5' AS temp;
C = filter B by temp is not null;
store C into 'output/foo/bar';
""")
bound = P.bind()
stats = bound.runSingle()
if not stats.isSuccessful():
raise RuntimeError(stats.getErrorMessage())
result = stats.result('C')
if result.getNumberRecords() < 1:
print 'Removing empty output directory'
Pig.fs('rmr ' + result.getLocation())