Path folder/finename
So the folder name that I want to save the file in is "output" and "ScheduleTest.csv" this is the filename I like to call it, but I want to add a timestamp to it. This is on a File Write module. Does anyone know how to do that in expression mode?
<file:write doc:name="Write" doc:id="bba24eb0-8f63-4b6c-9c40-b5529325b4ea" config-ref="File_Config" path="output/ScheduleTest.csv" mode="APPEND">
<file:content><![CDATA[#[output application/csv header=false --- payload]]]></file:content>
</file:write>
In an expression you can use DataWeave expressions to create the timestamp as needed:
path="#['output/ScheduleTest.csv' ++ now() as DateTime as String {format: 'yyyyMMddHHmmss'} ]"
Related
I am trying to place sql query to read data from kudud table in applciation.yaml file where string literal is used.
But while running the program it is giving the parsing error as below -
EL1043E: Unexpected token. Expected 'rcurly(})' but was 'identifier'
Below is the query used where "FINNACE" is a literal-
application.yaml -
filequery.map : '{
"filequery" : "SELECT FILE,
project
FROM
( SELECT FILE AS file_or_table_group,
"FINNACE" AS project
FROM finnance_data ) AS finnancedata"
}'
I have a few .txt files with data in JSON to be loaded to google BigQuery table. Along with the columns in the text files I will need to insert filename and current timestamp for each rows. It is in GCP Dataflow with Python 3.7
I accessed the Filemetadata containing the filepath and size using GCSFileSystem.match and metadata_list.
I believe I need to get the pipeline code to run in a loop, pass the filepath to ReadFromText, and call a FileNameReadFunction ParDo.
(p
| "read from file" >> ReadFromText(known_args.input)
| "parse" >> beam.Map(json.loads)
| "Add FileName" >> beam.ParDo(AddFilenamesFn(), GCSFilePath)
| "WriteToBigQuery" >> beam.io.WriteToBigQuery(known_args.output,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
I followed the steps in Dataflow/apache beam - how to access current filename when passing in pattern? but I can't make it quite work.
Any help is appreciated.
You can use textio.ReadFromTextWithFilename instead of ReadFromText. That will produce a PCollection of (filename,line) tuples.
To include the file and timestamp in your output json record, you could change your "parse" line to
| "parse" >> beam.map(lambda (file, line): {
**json.loads(line),
"filename": file,
"timestamp": datetime.now()})
I am using Apache::Log::Parser to parse Apache log files.
I extracted the date from log file using the following code.
my $parser = Apache::Log::Parser->new(fast=>1);
my $log = $parser->parse($data);
$t = $log->{date};
Now,I tried to use Time::Piece to parse dates, but I'm unable to do it.
print "$t->day_of_month";
But, it's not working. How to use Time::Piece to parse date?
You cannot call methods on objects inside of string interpolation. It will probably output something like this:
Sat Feb 18 12:44:47 2017->day_of_month
Remove the double quotes "" to call the method.
print $t->day_of_month;
Now the output is:
18
Note that you need to create a Time::Piece object with localtime or gmtime if you have an epoch value in your log, or using strptime if the date is some kind of timestamp.
I've seen that there is a constructor which accepts header control parameter
CSVExcelStorage(String delimiter, String multilineTreatmentStr, String eolTreatmentStr, String headerTreatmentStr)
However I haven't found what is the value of "SKIP_INPUT_HEADER" constant.
I dont know why you want the constant value of SKIP_INPUT_HEADER but if your intention is to remove the header during load, then please check the below example
input.csv
Name,Age,Location
a,10,chennai
b,20,banglore
PigScript:(With SKIP_INPUT_HEADER)
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
DUMP A;
Output:
(a,10,chennai)
(b,20,banglore)
PigScript:(Without SKIP_INPUT_HEADER)
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX');
DUMP A;
OutPut:
(Name,Age,Location)
(a,10,chennai)
(b,20,banglore)
I am processing data from a set of files which contain a date stamp as part of the filename. The data within the file does not contain the date stamp. I would like to process the filename and add it to one of the data structures within the script. Is there a way to do that within Pig Latin (an extension to PigStorage maybe?) or do I need to preprocess all of the files using Perl or the like beforehand?
I envision something like the following:
-- Load two fields from file, then generate a third from the filename
rawdata = LOAD '/directory/of/files/' USING PigStorage AS (field1:chararray, field2:int, field3:filename);
-- Reformat the filename into a datestamp
annotated = FOREACH rawdata GENERATE
REGEX_EXTRACT(field3,'*-(20\d{6})-*',1) AS datestamp,
field1, field2;
Note the special "filename" datatype in the LOAD statement. Seems like it would have to happen there as once the data has been loaded it's too late to get back to the source filename.
You can use PigStorage by specify -tagsource as following
A = LOAD 'input' using PigStorage(',','-tagsource');
B = foreach A generate INPUT_FILE_NAME;
The first field in each Tuple will contain input path (INPUT_FILE_NAME)
According to API doc http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html
Dan
The Pig wiki as an example of PigStorageWithInputPath which had the filename in an additional chararray field:
Example
A = load '/directory/of/files/*' using PigStorageWithInputPath()
as (field1:chararray, field2:int, field3:chararray);
UDF
// Note that there are several versions of Path and FileSplit. These are intended:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.builtin.PigStorage;
import org.apache.pig.data.Tuple;
public class PigStorageWithInputPath extends PigStorage {
Path path = null;
#Override
public void prepareToRead(RecordReader reader, PigSplit split) {
super.prepareToRead(reader, split);
path = ((FileSplit)split.getWrappedSplit()).getPath();
}
#Override
public Tuple getNext() throws IOException {
Tuple myTuple = super.getNext();
if (myTuple != null)
myTuple.append(path.toString());
return myTuple;
}
}
-tagSource is deprecated in Pig 0.12.0 .
Instead use
-tagFile - Appends input source file name to beginning of each tuple.
-tagPath - Appends input source file path to beginning of each tuple.
A = LOAD '/user/myFile.TXT' using PigStorage(',','-tagPath');
DUMP A ;
will give you the full file path as first column
( hdfs://myserver/user/blo/input/2015.TXT,439,43,05,4,NAVI,PO,P&C,P&CR,UC,40)
Refrence: http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/builtin/PigStorage.html
A way to do this in Bash and PigLatin can be found at: How Can I Load Every File In a Folder Using PIG?.
What I've been doing lately though, and find to be much cleaner is embedding Pig in Python. That let's you throw all sorts of variables and such between the two. A simple example is:
#!/path/to/jython.jar
# explicitly import Pig class
from org.apache.pig.scripting import Pig
# COMPILE: compile method returns a Pig object that represents the pipeline
P = Pig.compile(
"a = load '$in'; store a into '$out';")
input = '/path/to/some/file.txt'
output = '/path/to/some/output/on/hdfs'
# BIND and RUN
results = P.bind({'in':input, 'out':output}).runSingle()
if results.isSuccessful() :
print 'Pig job succeeded'
else :
raise 'Pig job failed'
Have a look at Julien Le Dem's great slides as an introduction to this, if you're interested. There's also a ton of documentation at http://pig.apache.org/docs/r0.9.2/cont.pdf.