Pig CSVExcelStorage remove header - apache-pig

I've seen that there is a constructor which accepts header control parameter
CSVExcelStorage(String delimiter, String multilineTreatmentStr, String eolTreatmentStr, String headerTreatmentStr)
However I haven't found what is the value of "SKIP_INPUT_HEADER" constant.

I dont know why you want the constant value of SKIP_INPUT_HEADER but if your intention is to remove the header during load, then please check the below example
input.csv
Name,Age,Location
a,10,chennai
b,20,banglore
PigScript:(With SKIP_INPUT_HEADER)
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
DUMP A;
Output:
(a,10,chennai)
(b,20,banglore)
PigScript:(Without SKIP_INPUT_HEADER)
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX');
DUMP A;
OutPut:
(Name,Age,Location)
(a,10,chennai)
(b,20,banglore)

Related

check if nextflow channel is empty

I am trying to figure out how to check if a channel is empty or not.
For instance, I have two processes. The first process runs only if a combination of parameters/flags are set and if so, checks also if its input file from another process (input via a channel) is not empty, then it creates a new input file for a second process (to eventually replace the default one). As a simplified example:
.....
.....
// create the channel here to force nextflow to wait for the first process
_chNewInputForProcessTwo = Channel.create()
process processOne {
when:
params.conditionOne && parameters.conditionTwo
input:
file inputFile from _channelUpstreamProcess
output:
file("my.output.file") into _chNewInputForProcessTwo
script:
"""
# check if we need to produce new input for second process (i.e., input file not empty)
if [ -s ${inputFIle} ]
then
<super_command_to_generate_new_fancy_input_for_second_process> > "my.output.file"
else
echo "No need to create new input"
fi
"""
}
// and here I would like to check if new input was generated or leave the "default" one
_chInputProcessTwo = Channel.from(_chNewInputForProcessTwo).ifEmpty(Channel.value(params.defaultInputProcessTwo))
process secondProcess {
input:
file inputFile from _chInputProcessTwo
......
......
etc.
When I try running with this approach it fails because the channel _chNewInputForProcessTwo contains DataflowQueue(queue=[]) therefore, not being actually empty.
I've tried several things looking at the documentation and the threads on google groups and on gitter. trying to set it to empty, but then it complains i am trying to use the channel twice. putting create().close(), etc.
Is there a clean/reasonable way to do this? I could do it using a value channel and have the first process output some string on the stdout to be picked up and checked by the second process, but that seems pretty dirty to me.
Any suggestions/feedback is appreciated. Thank you in advance!
Marius
Best to avoid trying to check if the channel is empty. If your channel could be empty and you need a default value in your channel, you can use the ifEmpty operator to supply one. Note that a single value is implicitly a value channel. I think all you need is:
myDefaultInputFile = file(params.defaultInputProcessTwo)
chInputProcessTwo = chNewInputForProcessTwo.ifEmpty(myDefaultInputFile)
Also, calling Channel.create() is usually unnecessary.

How to load data into pig using different PigStorage operator

I am new to Apache Pig and trying to load test twitter data to find out the number of tweets by each user name. Below is my data
format(twitterId,comment,userRefId):
Sample Data
When I am trying to load data into Pig using PigStorage as (',') it is separating my comment section also into multiple fields because comments could also have','. Please let me know how to load this data properly in Pig. I am using below command:
data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage(',') AS (id:chararray,comment:chararray,refId:chararray);
Load the record into a line,then replace ," with | and ", with |.This will ensure the fields are separated and then use STRSPLIT to get the 3 fields.
A = LOAD 'data.txt' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(REPLACE(line,',"','|'),'",','|');
C = FOREACH B GENERATE STRSPLIT($0,'\\|',3);
DUMP C;
EDIT:
I used sample text to run the script and works fine.See below
If changing the separator in your source data is an option, I would go that route. Makes it probably a lot easier to get started and to track down issues.
If you change your separator to a |, your code could look like:
data = LOAD '/home/vinita/Desktop/Material/PIG/test.csv' using PigStorage('|') AS (id:chararray,comment:chararray,refId:chararray);

Load CSV file in PIG

In PIG, When we load a CSV file using LOAD statement without mentioning schema & with default PIGSTORAGE (\t), what happens? Will the Load work fine and can we dump the data? Else will it throw error since the file has ',' and the pigstorage is '/t'? Please advice
When you load a csv file without defining a schema using PigStorage('\t'), since there are no tabs in each line of the input file, the whole line will be treated as one tuple. You will not be able to access the individual words in the line.
Example:
Input file:
john,smith,nyu,NY
jim,young,osu,OH
robert,cernera,mu,NJ
a = LOAD 'input' USING PigStorage('\t');
dump a;
OUTPUT:
(john,smith,nyu,NY)
(jim,young,osu,OH)
(robert,cernera,mu,NJ)
b = foreach a generate $0, $1, $2;
dump b;
(john,smith,nyu,NY,,)
(jim,young,osu,OH,,)
(robert,cernera,mu,NJ,,)
Ideally, b should have been:
(john,smith,nyu)
(jim,young,osu)
(robert,cernera,mu)
if the delimiter was a comma. But since the delimiter was a tab and a tab does not exist in the input records, the whole line was treated as one field. Pig doe snot complain if a field is null- It just outputs nothing when there is a null. Hence you see only the commas when you dump b.
Hope that was useful.

apache pig load data with multiple delimiters

Hi everyone I have a problem about loading data using apache pig, the file format is like:
"1","2","xx,yy","a,sd","3"
So I want to load it by using the multiple delimiter "," 2double quotes and one comma like:
A = LOAD 'file.csv' USING PigStorage('","') AS (f1,f2,f3,f4,f5);
but the PigStorage doesn't accept the multiple delimiter ",".How I can do it? Thank you very much!
PigStorage takes single character as delimiter.You will have use builtin functions from PiggyBank. Download piggybank.jar and save in the same folder as your pigscript.Register the jar in your pigscript.
REGISTER piggybank.jar;
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD 'test1.txt' USING CSVLoader(',') AS (f1:int,f2:int,f3:chararray,f4:chararray,f5:int);
B = FOREACH A GENERATE f1,f2,f3,f4,f5;
DUMP B;
Alternate option is to load the data into a line and then use STRSPLIT
A = LOAD 'test1.txt' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '","'));
DUMP B;

How can I incorporate the current input filename into my Pig Latin script?

I am processing data from a set of files which contain a date stamp as part of the filename. The data within the file does not contain the date stamp. I would like to process the filename and add it to one of the data structures within the script. Is there a way to do that within Pig Latin (an extension to PigStorage maybe?) or do I need to preprocess all of the files using Perl or the like beforehand?
I envision something like the following:
-- Load two fields from file, then generate a third from the filename
rawdata = LOAD '/directory/of/files/' USING PigStorage AS (field1:chararray, field2:int, field3:filename);
-- Reformat the filename into a datestamp
annotated = FOREACH rawdata GENERATE
REGEX_EXTRACT(field3,'*-(20\d{6})-*',1) AS datestamp,
field1, field2;
Note the special "filename" datatype in the LOAD statement. Seems like it would have to happen there as once the data has been loaded it's too late to get back to the source filename.
You can use PigStorage by specify -tagsource as following
A = LOAD 'input' using PigStorage(',','-tagsource');
B = foreach A generate INPUT_FILE_NAME;
The first field in each Tuple will contain input path (INPUT_FILE_NAME)
According to API doc http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html
Dan
The Pig wiki as an example of PigStorageWithInputPath which had the filename in an additional chararray field:
Example
A = load '/directory/of/files/*' using PigStorageWithInputPath()
as (field1:chararray, field2:int, field3:chararray);
UDF
// Note that there are several versions of Path and FileSplit. These are intended:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.builtin.PigStorage;
import org.apache.pig.data.Tuple;
public class PigStorageWithInputPath extends PigStorage {
Path path = null;
#Override
public void prepareToRead(RecordReader reader, PigSplit split) {
super.prepareToRead(reader, split);
path = ((FileSplit)split.getWrappedSplit()).getPath();
}
#Override
public Tuple getNext() throws IOException {
Tuple myTuple = super.getNext();
if (myTuple != null)
myTuple.append(path.toString());
return myTuple;
}
}
-tagSource is deprecated in Pig 0.12.0 .
Instead use
-tagFile - Appends input source file name to beginning of each tuple.
-tagPath - Appends input source file path to beginning of each tuple.
A = LOAD '/user/myFile.TXT' using PigStorage(',','-tagPath');
DUMP A ;
will give you the full file path as first column
( hdfs://myserver/user/blo/input/2015.TXT,439,43,05,4,NAVI,PO,P&C,P&CR,UC,40)
Refrence: http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/builtin/PigStorage.html
A way to do this in Bash and PigLatin can be found at: How Can I Load Every File In a Folder Using PIG?.
What I've been doing lately though, and find to be much cleaner is embedding Pig in Python. That let's you throw all sorts of variables and such between the two. A simple example is:
#!/path/to/jython.jar
# explicitly import Pig class
from org.apache.pig.scripting import Pig
# COMPILE: compile method returns a Pig object that represents the pipeline
P = Pig.compile(
"a = load '$in'; store a into '$out';")
input = '/path/to/some/file.txt'
output = '/path/to/some/output/on/hdfs'
# BIND and RUN
results = P.bind({'in':input, 'out':output}).runSingle()
if results.isSuccessful() :
print 'Pig job succeeded'
else :
raise 'Pig job failed'
Have a look at Julien Le Dem's great slides as an introduction to this, if you're interested. There's also a ton of documentation at http://pig.apache.org/docs/r0.9.2/cont.pdf.