pig ToDate function not working properly - apache-pig

i am trying cast field with date function.
raw_data = LOAD '/user/cloudera/Chicago_Traffic_Tracker_- _Historical_Congestion_Estimates_by_Region.csv' USING PigStorage(',') AS ( TIME :chararray,REGION_ID:int,BUS_COUNT:int,NUMBER_OF_READS:int,SPEED:double);
raw_clean = FOREACH raw_data GENERATE ToDate(raw_data.TIME,'yyyy/MM/dd HH:mm:ss')as date_time:DateTime ;
I get the below error
Scalar has more than one row in the output. 1st :
(01/29/2015 01:40:35 PM,22,33,429,25.23), 2nd :(01/05/2015 01:10:46 PM,18,58,1058,21.14)
Input
01/29/2015 01:40:35 PM,22,33,429,25.23,a61e11c83f811b63e1dc64362f799dcac322fca8
01/05/2015 01:10:46 PM,18,58,1058,21.14,39c63427d0e1401a06f967fd43c30e291140c26e

Didn't try practicals: But Your Input date is in format 01/29/2015
01:40:35 i.e MM/dd/YYYY HH:mm:ss . Whereas you have specified it as
'yyyy/MM/dd HH:mm:ss'
Try something like :
raw_clean = FOREACH raw_data GENERATE ToDate(raw_data.TIME,'MM/dd/YYYY HH:mm:ss');

Related

Store only non null values result in the file

I have also provided the output for every statement. I just wanted the end result to write into a file only when there is a value.
NEW_roles = LOAD '/folder/newata/roles.gz' USING PigStorage('\t') AS (id,name,is_active,created_at,updated_at); NEW_roles_t = FOREACH NEW_roles generate id,name,is_active,ToString(created_at ,'yyyy-MM-dd hh:mm:ss') as created_at:chararray,ToString(updated_at ,'yyyy-MM-dd hh:mm:ss') as updated_at:chararray;
OLD_roles = LOAD '/folder//old-data/roles.gz' USING PigStorage('\t') AS (id,name,is_active,created_at,updated_at); OLD_roles_t = FOREACH OLD_roles generate id,name,is_active,ToString(created_at ,'yyyy-MM-dd hh:mm:ss') as created_at:chararray,ToString(updated_at ,'yyyy-MM-dd hh:mm:ss') as updated_at:chararray;
co_group = COGROUP NEW_roles_t by id , OLD_roles_t by id; dump co_group;
(1,{(1,CCCCO Admin,t,,)},{(1,CCCCO Admin,t,,)})
(2,{(2,COCI Read Only,t,,)},{(2,COCI Read Only,t,,)})
(3,{(3,College Admin,t,,)},{(3,College Admin,t,,)})
(4,{(4,College Submitter,t,,)},{(4,College Submitter,t,,)})
(5,{(5,CCCCO Reviewer,t,,)},{(5,CCCCO Reviewer,t,,)})
(6,{},{(6,Test,t,,)}) (7,{},{(7,Test1,t,,)})
(8,{},{(8,Test2,t,,)})
filtered_delete = FILTER co_group BY IsEmpty($1);
DUMP filtered_delete;
(6,{},{(6,Test,t,,)})
(7,{},{(7,Test1,t,,)})
(8,{},{(8,Test2,t,,)})
flat_delete = FOREACH filtered_delete GENERATE FLATTEN((IsEmpty(NEW_roles_t)? OLD_roles_t: null));
DUMP flat_delete;
(6,Test,t,,)
(7,Test1,t,,)
(8,Test2,t,,)
intermediate_delete = FOREACH flat_delete GENERATE $0 as id,$1 as name,$2 as is_active,$3 as created_at,$4 as updated_at;
dump intermediate_delete;
(6,Test,t,,) (7,Test1,t,,) (8,Test2,t,,)
/*WRITE IT IN FILE*/
/*If the intermediate_delete has no records, we are not supposed to create a file.
Based on the count of records in dump intermediate_delete we can write a condition to acheive this.
But don't know how I used count function to get the no.of records
but after that I was not able to write any condition based logic to write into a file only when record count is greater than 0
*/
STORE intermediate_insert INTO '/folder/refreshed-data/roles_diff.txt' USING PigStorage('\t');

data processing in pig , with tab separate

I am very new to Pig , so facing some issues while trying to perform very basic processing in Pig.
1- Load that file using Pig
2- Write a processing logic to filter records based on Date , for example the lines have 2 columns col_1 and col_2 ( assume the columns are chararray ) and I need to get only the records which are having 1 day difference between col_1 and col_2.
3- Finally store that filtered record in Hive table .
Input file ( tab separated ) :-
2016-01-01T16:31:40.000+01:00 2016-01-02T16:31:40.000+01:00
2017-01-01T16:31:40.000+01:00 2017-01-02T16:31:40.000+01:00
When I try
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
The result I am getting like below :-
DUMP A;
(,2016-01-03T19:28:58.000+01:00,2016-01-02T16:31:40.000+01:00)
(,2017-01-03T19:28:58.000+01:00,2017-01-02T16:31:40.000+01:00)
Not sure Why ?
Please can some one help me in this how to parse tab separated file and how to covert that chararray to Date and filter based on Day difference ?
Thanks
Convert the columns to datetime object using ToDate and use DaysBetween.This should give the difference and if the difference == 1 then filter.Finally load it hive.
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
B = FOREACH A GENERATE DaysBetween(ToDate(col_1,'yyyy-MM-dd HH:mm:ss'),ToDate(col_2,'yyyy-MM-dd HH:mm:ss')) as day_diff;
C = FILTER B BY (day_diff == 1);
STORE C INTO 'your_hive_partition' USING org.apache.hive.hcatalog.pig.HCatStorer();

How to convert time and date to Unix timestamp on Apache PIG?

I have a tuple containing (date, time, ip, id)
(23/04/2014, 19:14:30,192.168.5.28, al00000)
and I need to convert date and time to Unix timestamp
(1398280470, 192.168.5.28, al00000)
how can I do that?
Ref : http://pig.apache.org/docs/r0.11.1/func.html#datetime-functions
Input :
23/04/2014,19:14:30,192.168.5.28,al00000
Pig Script :
A = LOAD 'input_data.csv' USING PigStorage(',') AS (date:chararray,time:chararray,ip:chararray,id:chararray);
B = FOREACH A GENERATE ToUnixTime(ToDate(CONCAT(date, time),'dd/MM/yyyyHH:mm:ss', 'GMT')) AS unix_time, ip, id;
Output :
(1398280470,192.168.5.28,al00000)

ToDate function provides unexpected output

I used the ToDate(userinput, format) function to covert my chararray field. I used the ToDate(userinput, 'MM/dd/yyyy') to covert the field from chararray to date but looks like i am not seeing the output as i had expected.
Here is the code:
l_dat = load 'textfile' using PigStorage('|') as (first:chararray,last:chararray,dob:chararray);
c_dat = foreach l_dat generate ToDate(dob,'MM/dd/yyyy') as mydate;
describe c_dat;
dump c_dat;
data looks like this:
(firstname1,lastname1,02/02/1967)
(John,deloy,05/26/1967)
(frank,fun,05/18/1967)
Output looks like this:
c_dat: {mydate: datetime}
(1967-05-26T00:00:00.000-04:00)
(1967-05-18T00:00:00.000-04:00)
(1967-02-02T00:00:00.000-05:00)
The output i was expecting was dateObjects with data as shown below:
(05/26/1967)
(05/18/1967)
(02/02/1967)
Please advise if i am doing anything wrong?
Ref : http://pig.apache.org/docs/r0.12.0/func.html#to-date, the return type of ToDate function is DateTime object. You can observe that in the schema description shared in output
c_dat: {mydate: datetime}
If you are having the date in the required format, you need not do any conversion.
c_dat = foreach l_dat generate dob as mydate;
If you are interested in converting the chararray date to any other format then you have to use ToString() function after getting the DateTime object.
Step 1: Convert date chararray to Date Time Ojbect using ToDate(datesstring, inutformat)
Step 2 : Use ToString(DateTime object, required format) to get the string date in the required format.
This can be achieved in a single step as below.
ToString(ToDate(date,inputformat),requiredformat);
Ref : http://pig.apache.org/docs/r0.12.0/func.html#to-string for details.

Storing Date and Time In PIG

I am trying to store a txt file that has two columns date and time respectively.
Something like this:
1999-01-01 12:08:56
Now I want to perform some Date operations using PIG, but i want to store date and time like this
1999-01-01T12:08:56 ( I checked this link):
http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
What I want to know is that what kind of format can I use in which my date and time are in one column, so that I can feed it to PIG, and then how to load that date into pig. I know we change it into datetime, but its showing errors. Can somebody kindly tell me how to load Date&Time data together. An example would be of great help.
Please let me know if this works for you.
input.txt
1999-01-01 12:08:56
1999-01-02 12:08:57
1999-01-03 12:08:58
1999-01-04 12:08:59
PigScript:
A = LOAD 'input.txt' using PigStorage(' ') as(date:chararray,time:chararray);
B = FOREACH A GENERATE CONCAT(date,'T',time) as myDateString;
C = FOREACH B GENERATE ToDate(myDateString);
dump C;
Output:
(1999-01-01T12:08:56.000+05:30)
(1999-01-02T12:08:57.000+05:30)
(1999-01-03T12:08:58.000+05:30)
(1999-01-04T12:08:59.000+05:30)
Now the myDateString is in date object, you can process this data using all the build in date functions.
Incase if you want to store the output as in this format
(1999-01-01T12:08:56)
(1999-01-02T12:08:57)
(1999-01-03T12:08:58)
(1999-01-04T12:08:59)
you can use REGEX_EXTRACT to parse the each data till "." something like this
D = FOREACH C GENERATE ToString($0) as temp;
E = FOREACH D GENERATE REGEX_EXTRACT(temp, '(.*)\\.(.*)', 1);
dump E;
Output:
(1999-01-01T12:08:56)
(1999-01-02T12:08:57)
(1999-01-03T12:08:58)
(1999-01-04T12:08:59)