data processing in pig , with tab separate - apache-pig

I am very new to Pig , so facing some issues while trying to perform very basic processing in Pig.
1- Load that file using Pig
2- Write a processing logic to filter records based on Date , for example the lines have 2 columns col_1 and col_2 ( assume the columns are chararray ) and I need to get only the records which are having 1 day difference between col_1 and col_2.
3- Finally store that filtered record in Hive table .
Input file ( tab separated ) :-
2016-01-01T16:31:40.000+01:00 2016-01-02T16:31:40.000+01:00
2017-01-01T16:31:40.000+01:00 2017-01-02T16:31:40.000+01:00
When I try
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
The result I am getting like below :-
DUMP A;
(,2016-01-03T19:28:58.000+01:00,2016-01-02T16:31:40.000+01:00)
(,2017-01-03T19:28:58.000+01:00,2017-01-02T16:31:40.000+01:00)
Not sure Why ?
Please can some one help me in this how to parse tab separated file and how to covert that chararray to Date and filter based on Day difference ?
Thanks

Convert the columns to datetime object using ToDate and use DaysBetween.This should give the difference and if the difference == 1 then filter.Finally load it hive.
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
B = FOREACH A GENERATE DaysBetween(ToDate(col_1,'yyyy-MM-dd HH:mm:ss'),ToDate(col_2,'yyyy-MM-dd HH:mm:ss')) as day_diff;
C = FILTER B BY (day_diff == 1);
STORE C INTO 'your_hive_partition' USING org.apache.hive.hcatalog.pig.HCatStorer();

Related

In pig how do i write a date, like in sql we write where date =''

I am new to Pig scripting but good with SQL. I wanted the pig equivalent for this SQL line :
SELECT * FROM Orders WHERE Date='2008-11-11'.
Basically I want to load data for one id or date how do I do that?
I did this and it worked, used FILTER in pig, and got the desired results.
`ivr_src = LOAD '/raw/prod/...;
info = foreach ivr_src generate timeEpochMillisUTC as time, cSId as id;
Filter_table= FILTER info BY id == '700000';
sorted_filter_table = Order Filter_table BY $1;
store sorted_filter_table into 'sorted_filter_table1' USING PigStorage('\t', '-
schema');`

Apache pig Store based on condition

I'm reading from a csv file and after grouping those datas I'm doing a count operation . Is there any way to store the datas into a folder name bad if the count is 0 and to good if the count is > 0 . I tried with the below code but it is not happening .
CODE :
STORE countVal INTO '/user/cloudera/good' IF countVal > 0 ;
USE function SPLIT. Refer :
https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
There are a couple of ways this :
1)Use the split function to perform the split based on the criteria.
SPLIT data into good if count>0, bad if (count==0);
2)Use a FOREACH loop to separate the data based on a criteria, using a BinCond operator.
X = FOREACH A GENERATE , data, (count>0?"good":"bad");

How to make multiple rows from single row in pig for movie lens dataset

I want to divide a single row into multiple row on the basis of a field in pig.
Example:
Consider one of the row in movie Data Set as follows:
(31807, Dot the I (2003), Drama|Film-Noir|Thriller)
each field is separated by ','.
Desired Output is as follows in 3 different rows:
31807,Dot the I (2003),Drama
31807,Dot the I (2003),Film-Noir
31807,Dot the I (2003),Thriller
Can anyone please help me to get the desired output in pig.
The below logic will help you .
/* Input
(31807,Dot the I (2003),Drama|Film-Noir|Thriller)
*/
list = LOAD '/user/cloudera/movies.txt' USING PigStorage(',') AS(id:int,name:chararray,generes:chararray);
list_each = FOREACH list GENERATE id,name, flatten(TOKENIZE(generes,'|'));
dump list_each;
/* Output
(31807,Dot the I (2003),Drama)
(31807,Dot the I (2003),Film-Noir)
(31807,Dot the I (2003),Thriller)
*/

Storing Date and Time In PIG

I am trying to store a txt file that has two columns date and time respectively.
Something like this:
1999-01-01 12:08:56
Now I want to perform some Date operations using PIG, but i want to store date and time like this
1999-01-01T12:08:56 ( I checked this link):
http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
What I want to know is that what kind of format can I use in which my date and time are in one column, so that I can feed it to PIG, and then how to load that date into pig. I know we change it into datetime, but its showing errors. Can somebody kindly tell me how to load Date&Time data together. An example would be of great help.
Please let me know if this works for you.
input.txt
1999-01-01 12:08:56
1999-01-02 12:08:57
1999-01-03 12:08:58
1999-01-04 12:08:59
PigScript:
A = LOAD 'input.txt' using PigStorage(' ') as(date:chararray,time:chararray);
B = FOREACH A GENERATE CONCAT(date,'T',time) as myDateString;
C = FOREACH B GENERATE ToDate(myDateString);
dump C;
Output:
(1999-01-01T12:08:56.000+05:30)
(1999-01-02T12:08:57.000+05:30)
(1999-01-03T12:08:58.000+05:30)
(1999-01-04T12:08:59.000+05:30)
Now the myDateString is in date object, you can process this data using all the build in date functions.
Incase if you want to store the output as in this format
(1999-01-01T12:08:56)
(1999-01-02T12:08:57)
(1999-01-03T12:08:58)
(1999-01-04T12:08:59)
you can use REGEX_EXTRACT to parse the each data till "." something like this
D = FOREACH C GENERATE ToString($0) as temp;
E = FOREACH D GENERATE REGEX_EXTRACT(temp, '(.*)\\.(.*)', 1);
dump E;
Output:
(1999-01-01T12:08:56)
(1999-01-02T12:08:57)
(1999-01-03T12:08:58)
(1999-01-04T12:08:59)

Convesion from Hive to PigLatin

I am trying to convert the below Hive statement to Pig:
max(substr(case when url like 'http:%' then '' else url end,1,50))
My pig statement for the above is:
url_group = GROUP data by (uid);
max_substr_url= FOREACH url_group generate SUBSTRING(MAX(((Coalesce(data.url) matches '.*http:%.*') ? '' : Coalesce(data.url))), 0, 49);
For some of the data, the url can be null. So I have written a pig UDF called Coalesce(String) which returns an empty string if the data is either null or empty. If the data is not null or not empty it returns the string back.
The above pig statement is giving me lot of trouble and tried n different options/ways but nothing worked. Anyone got any ideas on how to implement this? Please help me.
Thanks in advance
You are going to want to use a nested FOREACH so that you can do the substring transformation on each tuple in the data bag then take the MAX of the transformed bag.
A = GROUP data by (uid);
B = FOREACH url_group {
-- MAX needs a one column bag
transformed = FOREACH data
GENERATE SUBSTRING((Coalesce(url) matches '.*http:.*' ? '' : Coalesce(url)), 0, 49);
GENERATE group AS uid, MAX(transformed) ;
}