I have a tuple containing (date, time, ip, id)
(23/04/2014, 19:14:30,192.168.5.28, al00000)
and I need to convert date and time to Unix timestamp
(1398280470, 192.168.5.28, al00000)
how can I do that?
Ref : http://pig.apache.org/docs/r0.11.1/func.html#datetime-functions
Input :
23/04/2014,19:14:30,192.168.5.28,al00000
Pig Script :
A = LOAD 'input_data.csv' USING PigStorage(',') AS (date:chararray,time:chararray,ip:chararray,id:chararray);
B = FOREACH A GENERATE ToUnixTime(ToDate(CONCAT(date, time),'dd/MM/yyyyHH:mm:ss', 'GMT')) AS unix_time, ip, id;
Output :
(1398280470,192.168.5.28,al00000)
Related
I am new to Pig scripting but good with SQL. I wanted the pig equivalent for this SQL line :
SELECT * FROM Orders WHERE Date='2008-11-11'.
Basically I want to load data for one id or date how do I do that?
I did this and it worked, used FILTER in pig, and got the desired results.
`ivr_src = LOAD '/raw/prod/...;
info = foreach ivr_src generate timeEpochMillisUTC as time, cSId as id;
Filter_table= FILTER info BY id == '700000';
sorted_filter_table = Order Filter_table BY $1;
store sorted_filter_table into 'sorted_filter_table1' USING PigStorage('\t', '-
schema');`
I am very new to Pig , so facing some issues while trying to perform very basic processing in Pig.
1- Load that file using Pig
2- Write a processing logic to filter records based on Date , for example the lines have 2 columns col_1 and col_2 ( assume the columns are chararray ) and I need to get only the records which are having 1 day difference between col_1 and col_2.
3- Finally store that filtered record in Hive table .
Input file ( tab separated ) :-
2016-01-01T16:31:40.000+01:00 2016-01-02T16:31:40.000+01:00
2017-01-01T16:31:40.000+01:00 2017-01-02T16:31:40.000+01:00
When I try
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
The result I am getting like below :-
DUMP A;
(,2016-01-03T19:28:58.000+01:00,2016-01-02T16:31:40.000+01:00)
(,2017-01-03T19:28:58.000+01:00,2017-01-02T16:31:40.000+01:00)
Not sure Why ?
Please can some one help me in this how to parse tab separated file and how to covert that chararray to Date and filter based on Day difference ?
Thanks
Convert the columns to datetime object using ToDate and use DaysBetween.This should give the difference and if the difference == 1 then filter.Finally load it hive.
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
B = FOREACH A GENERATE DaysBetween(ToDate(col_1,'yyyy-MM-dd HH:mm:ss'),ToDate(col_2,'yyyy-MM-dd HH:mm:ss')) as day_diff;
C = FILTER B BY (day_diff == 1);
STORE C INTO 'your_hive_partition' USING org.apache.hive.hcatalog.pig.HCatStorer();
i am trying cast field with date function.
raw_data = LOAD '/user/cloudera/Chicago_Traffic_Tracker_- _Historical_Congestion_Estimates_by_Region.csv' USING PigStorage(',') AS ( TIME :chararray,REGION_ID:int,BUS_COUNT:int,NUMBER_OF_READS:int,SPEED:double);
raw_clean = FOREACH raw_data GENERATE ToDate(raw_data.TIME,'yyyy/MM/dd HH:mm:ss')as date_time:DateTime ;
I get the below error
Scalar has more than one row in the output. 1st :
(01/29/2015 01:40:35 PM,22,33,429,25.23), 2nd :(01/05/2015 01:10:46 PM,18,58,1058,21.14)
Input
01/29/2015 01:40:35 PM,22,33,429,25.23,a61e11c83f811b63e1dc64362f799dcac322fca8
01/05/2015 01:10:46 PM,18,58,1058,21.14,39c63427d0e1401a06f967fd43c30e291140c26e
Didn't try practicals: But Your Input date is in format 01/29/2015
01:40:35 i.e MM/dd/YYYY HH:mm:ss . Whereas you have specified it as
'yyyy/MM/dd HH:mm:ss'
Try something like :
raw_clean = FOREACH raw_data GENERATE ToDate(raw_data.TIME,'MM/dd/YYYY HH:mm:ss');
I'm trying to find out the salary in descending order but the output is not correct. I'm running pig in local mode.
My input is as below:
a,a#xyz.com,5000
b,b#xyz.com,3000
c,c#xyz.com,10000
a,a1#xyz.com,2000
c,c1#xyz.com,40000
d,d#xyz.com,7000
e,e#xyz.com,1000
f,f#xyz.com,9000
f,f1#xyz.com,110000
As I needed email and salary(in desc) so here is what I did.
A = load '/local_input_path' USING PigStorage(',');
B = foreach A generate $1,$2;
c = ORDER B by $1 DESC;
But the output is not as expected:
(f#xyz.com,9000)
(d#xyz.com,7000)
(a#xyz.com,5000)
(c1#xyz.com,40000)
(b#xyz.com,3000)
(a1#xyz.com,2000)
(f1#xyz.com,110000)
(c#xyz.com,10000)
(e#xyz.com,1000)
When I don't mention B = foreach A generate $1,$2; and proceed,output is as expected.
Any suggestion on this?
Cast the bytearray into int and then order :
Try this code :
a = LOAD '/local_input_path' using PigStorage(',');
b = FOREACH a GENERATE $1,(int)$2;
c = order b by $1 DESC;
dump c;
It's treating your numbers as strings and performing a lexicographical sort instead of numeric. When you're loading, assign names and types to help prevent this and make your code more readable/maintainable.
...USING PigStorage(',') AS (letter:chararray, email:chararray, salary:int)
I am trying to store a txt file that has two columns date and time respectively.
Something like this:
1999-01-01 12:08:56
Now I want to perform some Date operations using PIG, but i want to store date and time like this
1999-01-01T12:08:56 ( I checked this link):
http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
What I want to know is that what kind of format can I use in which my date and time are in one column, so that I can feed it to PIG, and then how to load that date into pig. I know we change it into datetime, but its showing errors. Can somebody kindly tell me how to load Date&Time data together. An example would be of great help.
Please let me know if this works for you.
input.txt
1999-01-01 12:08:56
1999-01-02 12:08:57
1999-01-03 12:08:58
1999-01-04 12:08:59
PigScript:
A = LOAD 'input.txt' using PigStorage(' ') as(date:chararray,time:chararray);
B = FOREACH A GENERATE CONCAT(date,'T',time) as myDateString;
C = FOREACH B GENERATE ToDate(myDateString);
dump C;
Output:
(1999-01-01T12:08:56.000+05:30)
(1999-01-02T12:08:57.000+05:30)
(1999-01-03T12:08:58.000+05:30)
(1999-01-04T12:08:59.000+05:30)
Now the myDateString is in date object, you can process this data using all the build in date functions.
Incase if you want to store the output as in this format
(1999-01-01T12:08:56)
(1999-01-02T12:08:57)
(1999-01-03T12:08:58)
(1999-01-04T12:08:59)
you can use REGEX_EXTRACT to parse the each data till "." something like this
D = FOREACH C GENERATE ToString($0) as temp;
E = FOREACH D GENERATE REGEX_EXTRACT(temp, '(.*)\\.(.*)', 1);
dump E;
Output:
(1999-01-01T12:08:56)
(1999-01-02T12:08:57)
(1999-01-03T12:08:58)
(1999-01-04T12:08:59)