dse pig datetime functions - apache-pig

Can someone give a full example of date time functions including the 'register' jar ? I have been trying to get CurrentTime() and ToDate() running without much success. I have the piggybank jar in classpath and registered the same. But it always says the function has to be defined before usage.
I read this question comparing datetime in pig before this.

Datetime functions can be easily implemented using native pig, you no need to go for piggybank jar.
Example:
In this example i will read set of dates from the input file, get the current datetime and calculate the total no of days between previous and current date
input.txt
2014-10-12T10:20:47
2014-08-12T10:20:47
2014-07-12T10:20:47
PigScript:
A = LOAD 'input.txt' AS (mydate:chararray);
B = FOREACH A GENERATE ToDate(mydate) AS prevDate,CurrentTime() AS currentDate,DaysBetween(CurrentTime(),ToDate(mydate)) AS diffDays;
DUMP B;
Output:
(2014-10-12T10:20:47.000+05:30, 2014-12-12T10:39:15.455+05:30, 61)
(2014-08-12T10:20:47.000+05:30, 2014-12-12T10:39:15.455+05:30, 122)
(2014-07-12T10:20:47.000+05:30, 2014-12-12T10:39:15.455+05:30, 153)
You can refer few examples from my old post
Human readable String date converted to date using Pig?
Storing Date and Time In PIG
how to convert UTC time to IST using pig

Related

getting error when trying to ingest data from firehose to redshift

getting an error 1206 and 1205 when injesting data from fireshose to redshift using a copy command
Below is the raw data on firehose
{
"Name": "yoyo"
"a_timestamp": "2021-05-11T15:02:02.426729Z",
"a_date": "2021-05-11T00:00:00Z"
}
below is the copy command
COPY pqr_table FROM 's3://xyz/<manifest>' CREDENTIALS 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' MANIFEST json 's3://xyz/abc.json' DATEFORMAT 'YYYY-MM-DD' ;
below is the DDL command
create table events (
Name varchar(8),
a_timestamp timestamp,
a_date date)
It would be great if anyone can please help me with this
Those are errors for bad timestamp and date formats. You need to have "timeformat" specified with that string as it is not Redshift's default format. I'd first try 'auto' for both of these and see if Redshift can work things out.
dateformat as 'auto'
timeformat as 'auto'
Also, having time specified in your date may create some confusion and may need you to manually specify the format or ingest as timestamp and then cast to date. I'd fist see if 'auto' does the trick.

bq load error due to datetime with milliseconds

Is there an option for bq load to specify datetime format to parse? I'm getting an error when using bq load due to a datetime with milliseconds in it.
Sample file below:
ID|Card|Status|ExpiryDate|IssuedDate
1105|9902|Expired|2015-12-31 00:00:00|2014-07-04 14:43:41.963000000
Command used below:
bq load --source_format=CSV --skip_leading_rows 1 --field_delimiter "|" --replace mytable $GSPATH
It is not possible to control/change date or datetime formatting when loading data into BigQuery.
As a solution, I would try to load the datetime field as a STRING and then try to use the PARSE_DATETIME function or something else to postprocess and convert the string to datetime.
An example of the code to parse the string to datetime:
select PARSE_DATETIME('%Y-%m-%d %H:%M:%E*S','2014-07-04 14:43:41.963000000');

Apache Pig - Numeric data missing while loading in a pig relation

I am learning Apache Pig. I am trying to load some data in to pig. When i see the txt file in vi editor, I find the following (sample) row.
[ABBOTT,DEEDEE W GRADES 9-12 TEACHER 52,122.10 0 LBOE
ATLANTA INDEPENDENT SCHOOL SYSTEM 2010].
I use the following command to load data into a pig relation.
A = LOAD 'salaryTravelReport_sample.txt' USING PigStorage() as (name:chararray,
prof:chararray,max_sal:float,travel:float,board:chararray,state:chararray,year:int);
However, when I do a dump in pig in the distributed environment, I find the following result (for the row mentioned above):
(ABBOTT,DEEDEE W,GRADES 9-12 TEACHER,,0.0,LBOE,ATLANTA INDEPENDENT
SCHOOL SYSTEM,2010).
The numeric data "52,122.10 " seems to be missing.
Please help.
PigStorage() is inbuilt function in pig which takes record delimiter as arguments. here its tab -- > \t
A = LOAD 'salaryTravelReport_sample.txt' USING PigStorage('\t') as (name:chararray,
prof:chararray,max_sal:float,travel:float,board:chararray,state:chararray,year:int);

Using Time::Piece with Apache::Log::Parser

I am using Apache::Log::Parser to parse Apache log files.
I extracted the date from log file using the following code.
my $parser = Apache::Log::Parser->new(fast=>1);
my $log = $parser->parse($data);
$t = $log->{date};
Now,I tried to use Time::Piece to parse dates, but I'm unable to do it.
print "$t->day_of_month";
But, it's not working. How to use Time::Piece to parse date?
You cannot call methods on objects inside of string interpolation. It will probably output something like this:
Sat Feb 18 12:44:47 2017->day_of_month
Remove the double quotes "" to call the method.
print $t->day_of_month;
Now the output is:
18
Note that you need to create a Time::Piece object with localtime or gmtime if you have an epoch value in your log, or using strptime if the date is some kind of timestamp.

how do I set one variable equal to another in pig latin

I would like to do
register s3n://uw-cse344-code/myudfs.jar
-- load the test file into Pig
--raw = LOAD 's3n://uw-cse344-test/cse344-test-file' USING TextLoader as (line:chararray);
-- later you will load to other files, example:
raw = LOAD 's3n://uw-cse344/btc-2010-chunk-000' USING TextLoader as (line:chararray);
-- parse each line into ntriples
ntriples = foreach raw generate FLATTEN(myudfs.RDFSplit3(line)) as (subject:chararray,predicate:chararray,object:chararray);
--filter 1
subjects1 = filter ntriples by subject matches '.*rdfabout\\.com.*' PARALLEL 50;
--filter 2
subjects2 = subjects1;
but I get the error:
2012-03-10 01:19:18,039 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input ';' expecting LEFT_PAREN
Details at logfile: /home/hadoop/pig_1331342327467.log
so it seems pig doesn't like that. How do I accomplish this?
i don't think that kind of 'typical' assignment works in pig. It's not really a programming language in the strict sense - it's a high-level language on top of hadoop with some specialized functions.
i think you'll need to simply re-project the data from subjects1 to subjects2, such as:
subjects2 = foreach subjects1 generate $0, $1, $2;
another approach might be to use the LIMIT function with some absurdly high parameter.
subjects2 = subjects2 LIMIT 100000000 ;
there could be a lot of reasons why that doesn't make sense, but it's a thought.
i sense you are considering doing things as you would in a programming language
i have found that rarely works out like you want it to but you can always get the job done once you think like Pig.
As I understand your example fro DataScience coursera course.
It's strange but I found the same problem. This code works on the on amount of data and don't on the another.
Because we need to change parameters I used this code:
filtered2 = foreach filtered generate subject as subject2, predicate as predicate2, object as object2;