How can I incorporate the current input filename into my Pig Latin script? - apache-pig

I am processing data from a set of files which contain a date stamp as part of the filename. The data within the file does not contain the date stamp. I would like to process the filename and add it to one of the data structures within the script. Is there a way to do that within Pig Latin (an extension to PigStorage maybe?) or do I need to preprocess all of the files using Perl or the like beforehand?
I envision something like the following:
-- Load two fields from file, then generate a third from the filename
rawdata = LOAD '/directory/of/files/' USING PigStorage AS (field1:chararray, field2:int, field3:filename);
-- Reformat the filename into a datestamp
annotated = FOREACH rawdata GENERATE
REGEX_EXTRACT(field3,'*-(20\d{6})-*',1) AS datestamp,
field1, field2;
Note the special "filename" datatype in the LOAD statement. Seems like it would have to happen there as once the data has been loaded it's too late to get back to the source filename.

You can use PigStorage by specify -tagsource as following
A = LOAD 'input' using PigStorage(',','-tagsource');
B = foreach A generate INPUT_FILE_NAME;
The first field in each Tuple will contain input path (INPUT_FILE_NAME)
According to API doc http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html
Dan

The Pig wiki as an example of PigStorageWithInputPath which had the filename in an additional chararray field:
Example
A = load '/directory/of/files/*' using PigStorageWithInputPath()
as (field1:chararray, field2:int, field3:chararray);
UDF
// Note that there are several versions of Path and FileSplit. These are intended:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.builtin.PigStorage;
import org.apache.pig.data.Tuple;
public class PigStorageWithInputPath extends PigStorage {
Path path = null;
#Override
public void prepareToRead(RecordReader reader, PigSplit split) {
super.prepareToRead(reader, split);
path = ((FileSplit)split.getWrappedSplit()).getPath();
}
#Override
public Tuple getNext() throws IOException {
Tuple myTuple = super.getNext();
if (myTuple != null)
myTuple.append(path.toString());
return myTuple;
}
}

-tagSource is deprecated in Pig 0.12.0 .
Instead use
-tagFile - Appends input source file name to beginning of each tuple.
-tagPath - Appends input source file path to beginning of each tuple.
A = LOAD '/user/myFile.TXT' using PigStorage(',','-tagPath');
DUMP A ;
will give you the full file path as first column
( hdfs://myserver/user/blo/input/2015.TXT,439,43,05,4,NAVI,PO,P&C,P&CR,UC,40)
Refrence: http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/builtin/PigStorage.html

A way to do this in Bash and PigLatin can be found at: How Can I Load Every File In a Folder Using PIG?.
What I've been doing lately though, and find to be much cleaner is embedding Pig in Python. That let's you throw all sorts of variables and such between the two. A simple example is:
#!/path/to/jython.jar
# explicitly import Pig class
from org.apache.pig.scripting import Pig
# COMPILE: compile method returns a Pig object that represents the pipeline
P = Pig.compile(
"a = load '$in'; store a into '$out';")
input = '/path/to/some/file.txt'
output = '/path/to/some/output/on/hdfs'
# BIND and RUN
results = P.bind({'in':input, 'out':output}).runSingle()
if results.isSuccessful() :
print 'Pig job succeeded'
else :
raise 'Pig job failed'
Have a look at Julien Le Dem's great slides as an introduction to this, if you're interested. There's also a ton of documentation at http://pig.apache.org/docs/r0.9.2/cont.pdf.

Related

How to write specific data from JMeter execution output to CSV / Notepad using beanshell scripting

We are working on web services automation project using JMeter 4.0 In our response, JMeter returns data in json format but we would like to store only specific data (Account ID, Customer ID or Account inquiry fields) from that json into csv file but it stores data in csv file in unformatted format.
Looking for a workaround on this.
We are using following code:
import java.io.File;
import org.apache.jmeter.services.FileServer;
Result = "FAIL";
Responce = prev.getResponseDataAsString():
if(Responce.contains("data"))
Result = "PASS";
f = new FileOutputStream("C:/Users/Amar.pawar/Desktop/testoup.csv",true);
p = new PrintStream(f);
p.println(vars.get("/ds1odmc") + "," + Result):
p.close();
f.close():
Following error is getting encountered:
Error invoking bsh method: eval In file: inline evaluation of: ``import java.io.File; import org.apache.jmeter.services.FileServer; Result = "FA . . . '' Encountered ":" at line 5, column 42.
We are looking for saving specific data in CSV (or txt) instead of complete output in unformatted format. Please look into the matter & suggest.
Looks like a typo. You use : instead of ; in three lines:
Responce = prev.getResponseDataAsString():
...
p.println(vars.get("/ds1odmc") + "," + Result):
...
f.close():
And if problem does not solved it may be useful to check the article about testing complex logic with JMeter beanshell

Python - Use %s in value of config file

I use a config file (type .ini) to save my SQL queries, then i get a query by its key. All work fine, until creating a query with parameters, example :
;the ini file
product_by_cat = select * from products where cat =%s
I use :
config = configparser.ConfigParser()
args= ('cat1')
config.read(path_to_ini_file)
query= config.get(section_where_are_stored_thequeries,key_of_the_query)
complete_query= query%args
I get the error :
TypeError: not all arguments converted during string formatting
So it try to format the string at retrieving the value from the ini file.
Any proposition of my problem.
You can use format function like this
ini file
product_by_cat = select * from products where cat ={}
python:
complete_query= query.format(args)
depending on the versions of ConfigParser (Python 2 or Python 3) you may need to double the % like this or it throws an error:
product_by_cat = select * from products where cat =%%s
Although a better way would be to use the raw version of the config parser, so the % char isn't interpreted
config = configparser.RawConfigParser()

Issue automating CSV import to an RSQLite DB

I'm trying to automate writing CSV files to an RSQLite DB.
I am doing so by indexing csvFiles, which is a list of data.frame variables stored in the environment.
I can't seem to figure out why my dbWriteTable() code works perfectly fine when I enter it manually but not when I try to index the name and value fields.
### CREATE DB ###
mydb <- dbConnect(RSQLite::SQLite(),"")
# FOR LOOP TO BATCH IMPORT DATA INTO DATABASE
for (i in 1:length(csvFiles)) {
dbWriteTable(mydb,name = csvFiles[i], value = csvFiles[i], overwrite=T)
i=i+1
}
# EXAMPLE CODE THAT SUCCESSFULLY MANUAL IMPORTS INTO mydb
dbWriteTable(mydb,"DEPARTMENT",DEPARTMENT)
When I run the for loop above, I'm given this error:
"Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'DEPARTMENT': No such file or directory
# note that 'DEPARTMENT' is the value of csvFiles[1]
Here's the dput output of csvFiles:
c("DEPARTMENT", "EMPLOYEE_PHONE", "PRODUCT", "EMPLOYEE", "SALES_ORDER_LINE",
"SALES_ORDER", "CUSTOMER", "INVOICES", "STOCK_TOTAL")
I've researched this error and it seems to be related to my working directory; however, I don't really understand what to change, as I'm not even trying to manipulate files from my computer, simply data.frames already in my environment.
Please help!
Simply use get() for the value argument as you are passing a string value when a dataframe object is expected. Notice your manual version does not have DEPARTMENT quoted for value.
# FOR LOOP TO BATCH IMPORT DATA INTO DATABASE
for (i in seq_along(csvFiles)) {
dbWriteTable(mydb,name = csvFiles[i], value = get(csvFiles[i]), overwrite=T)
}
Alternatively, consider building a list of named dataframes with mget and loop element-wise between list's names and df elements with Map:
dfs <- mget(csvfiles)
output <- Map(function(n, d) dbWriteTable(mydb, name = n, value = d, overwrite=T), names(dfs), dfs)

apache pig load data with multiple delimiters

Hi everyone I have a problem about loading data using apache pig, the file format is like:
"1","2","xx,yy","a,sd","3"
So I want to load it by using the multiple delimiter "," 2double quotes and one comma like:
A = LOAD 'file.csv' USING PigStorage('","') AS (f1,f2,f3,f4,f5);
but the PigStorage doesn't accept the multiple delimiter ",".How I can do it? Thank you very much!
PigStorage takes single character as delimiter.You will have use builtin functions from PiggyBank. Download piggybank.jar and save in the same folder as your pigscript.Register the jar in your pigscript.
REGISTER piggybank.jar;
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD 'test1.txt' USING CSVLoader(',') AS (f1:int,f2:int,f3:chararray,f4:chararray,f5:int);
B = FOREACH A GENERATE f1,f2,f3,f4,f5;
DUMP B;
Alternate option is to load the data into a line and then use STRSPLIT
A = LOAD 'test1.txt' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '","'));
DUMP B;

Pig CSVExcelStorage remove header

I've seen that there is a constructor which accepts header control parameter
CSVExcelStorage(String delimiter, String multilineTreatmentStr, String eolTreatmentStr, String headerTreatmentStr)
However I haven't found what is the value of "SKIP_INPUT_HEADER" constant.
I dont know why you want the constant value of SKIP_INPUT_HEADER but if your intention is to remove the header during load, then please check the below example
input.csv
Name,Age,Location
a,10,chennai
b,20,banglore
PigScript:(With SKIP_INPUT_HEADER)
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
DUMP A;
Output:
(a,10,chennai)
(b,20,banglore)
PigScript:(Without SKIP_INPUT_HEADER)
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX');
DUMP A;
OutPut:
(Name,Age,Location)
(a,10,chennai)
(b,20,banglore)