Store only non null values result in the file - apache-pig

I have also provided the output for every statement. I just wanted the end result to write into a file only when there is a value.
NEW_roles = LOAD '/folder/newata/roles.gz' USING PigStorage('\t') AS (id,name,is_active,created_at,updated_at); NEW_roles_t = FOREACH NEW_roles generate id,name,is_active,ToString(created_at ,'yyyy-MM-dd hh:mm:ss') as created_at:chararray,ToString(updated_at ,'yyyy-MM-dd hh:mm:ss') as updated_at:chararray;
OLD_roles = LOAD '/folder//old-data/roles.gz' USING PigStorage('\t') AS (id,name,is_active,created_at,updated_at); OLD_roles_t = FOREACH OLD_roles generate id,name,is_active,ToString(created_at ,'yyyy-MM-dd hh:mm:ss') as created_at:chararray,ToString(updated_at ,'yyyy-MM-dd hh:mm:ss') as updated_at:chararray;
co_group = COGROUP NEW_roles_t by id , OLD_roles_t by id; dump co_group;
(1,{(1,CCCCO Admin,t,,)},{(1,CCCCO Admin,t,,)})
(2,{(2,COCI Read Only,t,,)},{(2,COCI Read Only,t,,)})
(3,{(3,College Admin,t,,)},{(3,College Admin,t,,)})
(4,{(4,College Submitter,t,,)},{(4,College Submitter,t,,)})
(5,{(5,CCCCO Reviewer,t,,)},{(5,CCCCO Reviewer,t,,)})
(6,{},{(6,Test,t,,)}) (7,{},{(7,Test1,t,,)})
(8,{},{(8,Test2,t,,)})
filtered_delete = FILTER co_group BY IsEmpty($1);
DUMP filtered_delete;
(6,{},{(6,Test,t,,)})
(7,{},{(7,Test1,t,,)})
(8,{},{(8,Test2,t,,)})
flat_delete = FOREACH filtered_delete GENERATE FLATTEN((IsEmpty(NEW_roles_t)? OLD_roles_t: null));
DUMP flat_delete;
(6,Test,t,,)
(7,Test1,t,,)
(8,Test2,t,,)
intermediate_delete = FOREACH flat_delete GENERATE $0 as id,$1 as name,$2 as is_active,$3 as created_at,$4 as updated_at;
dump intermediate_delete;
(6,Test,t,,) (7,Test1,t,,) (8,Test2,t,,)
/*WRITE IT IN FILE*/
/*If the intermediate_delete has no records, we are not supposed to create a file.
Based on the count of records in dump intermediate_delete we can write a condition to acheive this.
But don't know how I used count function to get the no.of records
but after that I was not able to write any condition based logic to write into a file only when record count is greater than 0
*/
STORE intermediate_insert INTO '/folder/refreshed-data/roles_diff.txt' USING PigStorage('\t');

Related

In pig how do i write a date, like in sql we write where date =''

I am new to Pig scripting but good with SQL. I wanted the pig equivalent for this SQL line :
SELECT * FROM Orders WHERE Date='2008-11-11'.
Basically I want to load data for one id or date how do I do that?
I did this and it worked, used FILTER in pig, and got the desired results.
`ivr_src = LOAD '/raw/prod/...;
info = foreach ivr_src generate timeEpochMillisUTC as time, cSId as id;
Filter_table= FILTER info BY id == '700000';
sorted_filter_table = Order Filter_table BY $1;
store sorted_filter_table into 'sorted_filter_table1' USING PigStorage('\t', '-
schema');`

data processing in pig , with tab separate

I am very new to Pig , so facing some issues while trying to perform very basic processing in Pig.
1- Load that file using Pig
2- Write a processing logic to filter records based on Date , for example the lines have 2 columns col_1 and col_2 ( assume the columns are chararray ) and I need to get only the records which are having 1 day difference between col_1 and col_2.
3- Finally store that filtered record in Hive table .
Input file ( tab separated ) :-
2016-01-01T16:31:40.000+01:00 2016-01-02T16:31:40.000+01:00
2017-01-01T16:31:40.000+01:00 2017-01-02T16:31:40.000+01:00
When I try
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
The result I am getting like below :-
DUMP A;
(,2016-01-03T19:28:58.000+01:00,2016-01-02T16:31:40.000+01:00)
(,2017-01-03T19:28:58.000+01:00,2017-01-02T16:31:40.000+01:00)
Not sure Why ?
Please can some one help me in this how to parse tab separated file and how to covert that chararray to Date and filter based on Day difference ?
Thanks
Convert the columns to datetime object using ToDate and use DaysBetween.This should give the difference and if the difference == 1 then filter.Finally load it hive.
A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray);
B = FOREACH A GENERATE DaysBetween(ToDate(col_1,'yyyy-MM-dd HH:mm:ss'),ToDate(col_2,'yyyy-MM-dd HH:mm:ss')) as day_diff;
C = FILTER B BY (day_diff == 1);
STORE C INTO 'your_hive_partition' USING org.apache.hive.hcatalog.pig.HCatStorer();

How to filter empty output file?

This is my pig script.
data = load 's3a://sessionlog/2016-05-28/' using SegmentationDataLoader() as (cookie:chararray,tags_and_pageref:map[]);
tags_data = foreach data generate cookie, tags_and_pageref#'tags' as score_tag_bag;
flattened_data = FOREACH tags_data GENERATE cookie, FLATTEN(score_tag_bag) as score_tag;
converted_flattened_data = FOREACH flattened_data GENERATE cookie, (long)score_tag#'score' as score, score_tag#'tag' as tag;
-- dump converted_flattened_data;
-- tuple_data = FOREACH flattened_data GENERATE cookie, TOTUPLE(tags) as tag_tuple;
-- splited_data = FOREACH flattened_data GENERATE cookie, TOKENIZE(tags) as score_tags:bag{t1:(),t2:()};
grouped_data = group converted_flattened_data by (cookie,tag);
acc_data = foreach grouped_data generate group.cookie as cookie, group.tag as tag,SUM(converted_flattened_data.score) as score;
pageref_data = foreach data generate cookie, tags_and_pageref#'pageref' as pageref_bag;
flattened_pageref_data = FOREACH pageref_data GENERATE cookie, FLATTEN(pageref_bag) as score_tag;
filtered_data = FILTER flattened_pageref_data BY score_tag is not null and not IsEmpty(score_tag);
store acc_data into 'segmentation/2016-05-28/4' using PigStorage(',');
store filtered_data into 'pagerefdata/2016-05-28/4' using PigStorage(',');
However the output of pagerefdata is all empty file. How can I filter it, as they are all empty, I do not want any output.
Thanks in advance.
Not sure I fully understand your question here, but 3 lines from the bottom you seem to be trying to filter out the empty ones by testing for null. Have you tried the following?:
FILTER flattened_pageref_data BY SIZE(TRIM(score_tag)) == 0;

Apache Pig: Add one field dataset to another one as a new column

Let's say we have this scenario:
dataset1.csv :
datefield
field11, field12, field13
field21, field22, field23
field31, field32, field33
What is the best way to get this?:
field11, field12, field13, datefield
field21, field22, field23, datefield
field31, field32, field33, datefield
I tried to generate one dataset (relation1) with just this columns (after loading and generating):
field11, field12, field13
field21, field22, field23
field31, field32, field33
another one (relation2) with just this column (after loading and generating):
datefield
and then doing this:
finalResult = FOREACH dataset1 GENERATE UDFFunction1(relation1::f1) as firstFields, UDFFunction2(relation2::f2) as lastField
But I'm getting 'A column needs to be projected from a relation for it to be used as a scalar'
The problem is with the second field (the one with the datefield).
I would like to avoid joins, since it would be a little messy workaround.
Any suggestions?
Please, forget my UDF functions. They just format the input Tuples accordingly.
Adding the pig script:
register 's3://bucketName/lib/MyJar.jar';
define ParseOutFilesUDF packageName.ParseOutFiles;
define FormatTimestartedUDF packageName.FormatTimestarted;
outFile = LOAD 's3://bucketName/input/' USING PigStorage ('|');
--This UDF just reformat each tuple, adding a String to each Tuple and returning a new one.
resultAll = FOREACH outFile GENERATE ParseOutFilesUDF(*) as initial;
--load the same csv again to get the TIMESTARTED field
timestarted = LOAD 's3://bucketName/input/' USING PigStorage ('|') as f1;
--filter to get only one record, which is somth like TIMESTARTED=20160101
filetered = FILTER timestarted BY (f1 matches '.*TIMESTARTED.*');
timestarted = foreach filetered GENERATE $0 as fechaStarted;
-- the FormatTimestartedUDF just gets ride of 'TIMESTARTED=' in order to get the date '20160101'
in this FOREACH sentence is where it fails with the 'A column needs to be projected...'
finalResult = FOREACH outFile GENERATE f1, FormatTimestartedUDF(timestarted) as f2;
STORE finalResult INTO 's3://bucketName/output/';
You are getting the error because you are referencing f1 which doesn't exist in outfile and timestarted is a relation and not a field.Also you should be using the field in resultALL and filtered.
outFile = LOAD 's3://bucketName/input/' USING PigStorage ('|');
resultAll = FOREACH outFile GENERATE ParseOutFilesUDF(*) as initial;
timestarted = LOAD 's3://bucketName/input/' USING PigStorage ('|') as f1;
filtered = FILTER timestarted BY (f1 matches '.*TIMESTARTED.*');
finalResult = FOREACH resultALL GENERATE resultAll.initial, FormatTimestartedUDF(filtered.$0);
STORE finalResult INTO 's3://bucketName/output/';

PIG script for creating IDxCITY matrix from given csv file

I have an input file includes ID,CITY and Count information as below and I want to create a csv file which includes ID and count numbers for each CITY. Count will be written as '0' if ID doesnt watched with the CITY. I tried to generate a pig script using group by, cogroup and flatten but couldnt make it to give this sample output.
How can I write a pig script for this?
INPUT(ID,CITY,COUNT):
00004589,IZMIR,2
00005275,KOCAELI,1
00005275,ISTANBUL,1
00008523,ESKISEHIR,2
OUTPUT:
ID,IZMIR,ISTANBUL,ESKISEHIR,KOCAELI
00004589,2,0,0,0
00005275,0,1,0,1
00008523,0,0,2,0
You can use below script for creating matrix:
DATA = load '/tmp/data.csv' Using PigStorage(',') as (ID:chararray,CITY:chararray,COUNT:chararray);
ESKISEHIR = filter DATA by (CITY matches 'ESKISEHIR');
ISTANBUL = filter DATA by (CITY matches 'ISTANBUL');
IZMIR = filter DATA by (CITY matches 'IZMIR');
KOCAELI = filter DATA by (CITY matches 'KOCAELI');
IDLIST = foreach DATA generate $0 as ID;
IDLIST = distinct IDLIST;
COGROUPED = cogroup IDLIST by $0,IZMIR by $0,ISTANBUL by $0,ESKISEHIR by $0,KOCAELI by $0;
CG_CITY = foreach COGROUPED generate FLATTEN($1),FLATTEN ((IsEmpty($2.$2) ? {('0')} : $2.$2)),FLATTEN ((IsEmpty($3.$2) ? {('0')} : $3.$2)),FLATTEN ((IsEmpty($4.$2) ? {('0')} : $4.$2)),FLATTEN ((IsEmpty($5.$2) ? {('0')} : $5.$2));
STORE CG_CITY INTO '/tmp/id_city_matrix' USING PigStorage(',');