How to filter empty output file? - apache-pig

This is my pig script.
data = load 's3a://sessionlog/2016-05-28/' using SegmentationDataLoader() as (cookie:chararray,tags_and_pageref:map[]);
tags_data = foreach data generate cookie, tags_and_pageref#'tags' as score_tag_bag;
flattened_data = FOREACH tags_data GENERATE cookie, FLATTEN(score_tag_bag) as score_tag;
converted_flattened_data = FOREACH flattened_data GENERATE cookie, (long)score_tag#'score' as score, score_tag#'tag' as tag;
-- dump converted_flattened_data;
-- tuple_data = FOREACH flattened_data GENERATE cookie, TOTUPLE(tags) as tag_tuple;
-- splited_data = FOREACH flattened_data GENERATE cookie, TOKENIZE(tags) as score_tags:bag{t1:(),t2:()};
grouped_data = group converted_flattened_data by (cookie,tag);
acc_data = foreach grouped_data generate group.cookie as cookie, group.tag as tag,SUM(converted_flattened_data.score) as score;
pageref_data = foreach data generate cookie, tags_and_pageref#'pageref' as pageref_bag;
flattened_pageref_data = FOREACH pageref_data GENERATE cookie, FLATTEN(pageref_bag) as score_tag;
filtered_data = FILTER flattened_pageref_data BY score_tag is not null and not IsEmpty(score_tag);
store acc_data into 'segmentation/2016-05-28/4' using PigStorage(',');
store filtered_data into 'pagerefdata/2016-05-28/4' using PigStorage(',');
However the output of pagerefdata is all empty file. How can I filter it, as they are all empty, I do not want any output.
Thanks in advance.

Not sure I fully understand your question here, but 3 lines from the bottom you seem to be trying to filter out the empty ones by testing for null. Have you tried the following?:
FILTER flattened_pageref_data BY SIZE(TRIM(score_tag)) == 0;

Related

Store only non null values result in the file

I have also provided the output for every statement. I just wanted the end result to write into a file only when there is a value.
NEW_roles = LOAD '/folder/newata/roles.gz' USING PigStorage('\t') AS (id,name,is_active,created_at,updated_at); NEW_roles_t = FOREACH NEW_roles generate id,name,is_active,ToString(created_at ,'yyyy-MM-dd hh:mm:ss') as created_at:chararray,ToString(updated_at ,'yyyy-MM-dd hh:mm:ss') as updated_at:chararray;
OLD_roles = LOAD '/folder//old-data/roles.gz' USING PigStorage('\t') AS (id,name,is_active,created_at,updated_at); OLD_roles_t = FOREACH OLD_roles generate id,name,is_active,ToString(created_at ,'yyyy-MM-dd hh:mm:ss') as created_at:chararray,ToString(updated_at ,'yyyy-MM-dd hh:mm:ss') as updated_at:chararray;
co_group = COGROUP NEW_roles_t by id , OLD_roles_t by id; dump co_group;
(1,{(1,CCCCO Admin,t,,)},{(1,CCCCO Admin,t,,)})
(2,{(2,COCI Read Only,t,,)},{(2,COCI Read Only,t,,)})
(3,{(3,College Admin,t,,)},{(3,College Admin,t,,)})
(4,{(4,College Submitter,t,,)},{(4,College Submitter,t,,)})
(5,{(5,CCCCO Reviewer,t,,)},{(5,CCCCO Reviewer,t,,)})
(6,{},{(6,Test,t,,)}) (7,{},{(7,Test1,t,,)})
(8,{},{(8,Test2,t,,)})
filtered_delete = FILTER co_group BY IsEmpty($1);
DUMP filtered_delete;
(6,{},{(6,Test,t,,)})
(7,{},{(7,Test1,t,,)})
(8,{},{(8,Test2,t,,)})
flat_delete = FOREACH filtered_delete GENERATE FLATTEN((IsEmpty(NEW_roles_t)? OLD_roles_t: null));
DUMP flat_delete;
(6,Test,t,,)
(7,Test1,t,,)
(8,Test2,t,,)
intermediate_delete = FOREACH flat_delete GENERATE $0 as id,$1 as name,$2 as is_active,$3 as created_at,$4 as updated_at;
dump intermediate_delete;
(6,Test,t,,) (7,Test1,t,,) (8,Test2,t,,)
/*WRITE IT IN FILE*/
/*If the intermediate_delete has no records, we are not supposed to create a file.
Based on the count of records in dump intermediate_delete we can write a condition to acheive this.
But don't know how I used count function to get the no.of records
but after that I was not able to write any condition based logic to write into a file only when record count is greater than 0
*/
STORE intermediate_insert INTO '/folder/refreshed-data/roles_diff.txt' USING PigStorage('\t');

Pig Latin fetching first field

I have below pig code for a sample file.
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
I load the above file using PIG load command & then loop thru it and get 2,3 fields as follows.
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate (id,fname,lname);
output of each1:
((001,Rajive,reddy)) etc.
Now i wanna get 1st field of each1 i.e.ID how to get it. I tried below code but showing error
each2 = foreach each1 generate(students.id)
Need to get the first filed from each2 relation.
The extra parenthesis are added to the each1 relation, simply remove them:
students = LOAD '/user/4965056e873066f2abe966b4129918/Pig_Data/students.txt' USING PigStorage(',') as (id:int,fname:chararray,lname:chararray,age:int,mob:chararray,city:chararray);
each1 = foreach students generate id,fname,lname;
And you will get something like :
(001,Rajive,reddy)
For the each2 relation, you can get any filed of each1 without using the qualifier students, use filed name or filed position like this :
each2 = foreach each1 generate id;
each2 = foreach each1 generate $0;
And you will get something like :
(001)
each2 = foreach each1 generate $0;

Pig error in local mode

I'm trying to find out the salary in descending order but the output is not correct. I'm running pig in local mode.
My input is as below:
a,a#xyz.com,5000
b,b#xyz.com,3000
c,c#xyz.com,10000
a,a1#xyz.com,2000
c,c1#xyz.com,40000
d,d#xyz.com,7000
e,e#xyz.com,1000
f,f#xyz.com,9000
f,f1#xyz.com,110000
As I needed email and salary(in desc) so here is what I did.
A = load '/local_input_path' USING PigStorage(',');
B = foreach A generate $1,$2;
c = ORDER B by $1 DESC;
But the output is not as expected:
(f#xyz.com,9000)
(d#xyz.com,7000)
(a#xyz.com,5000)
(c1#xyz.com,40000)
(b#xyz.com,3000)
(a1#xyz.com,2000)
(f1#xyz.com,110000)
(c#xyz.com,10000)
(e#xyz.com,1000)
When I don't mention B = foreach A generate $1,$2; and proceed,output is as expected.
Any suggestion on this?
Cast the bytearray into int and then order :
Try this code :
a = LOAD '/local_input_path' using PigStorage(',');
b = FOREACH a GENERATE $1,(int)$2;
c = order b by $1 DESC;
dump c;
It's treating your numbers as strings and performing a lexicographical sort instead of numeric. When you're loading, assign names and types to help prevent this and make your code more readable/maintainable.
...USING PigStorage(',') AS (letter:chararray, email:chararray, salary:int)

PIG script for creating IDxCITY matrix from given csv file

I have an input file includes ID,CITY and Count information as below and I want to create a csv file which includes ID and count numbers for each CITY. Count will be written as '0' if ID doesnt watched with the CITY. I tried to generate a pig script using group by, cogroup and flatten but couldnt make it to give this sample output.
How can I write a pig script for this?
INPUT(ID,CITY,COUNT):
00004589,IZMIR,2
00005275,KOCAELI,1
00005275,ISTANBUL,1
00008523,ESKISEHIR,2
OUTPUT:
ID,IZMIR,ISTANBUL,ESKISEHIR,KOCAELI
00004589,2,0,0,0
00005275,0,1,0,1
00008523,0,0,2,0
You can use below script for creating matrix:
DATA = load '/tmp/data.csv' Using PigStorage(',') as (ID:chararray,CITY:chararray,COUNT:chararray);
ESKISEHIR = filter DATA by (CITY matches 'ESKISEHIR');
ISTANBUL = filter DATA by (CITY matches 'ISTANBUL');
IZMIR = filter DATA by (CITY matches 'IZMIR');
KOCAELI = filter DATA by (CITY matches 'KOCAELI');
IDLIST = foreach DATA generate $0 as ID;
IDLIST = distinct IDLIST;
COGROUPED = cogroup IDLIST by $0,IZMIR by $0,ISTANBUL by $0,ESKISEHIR by $0,KOCAELI by $0;
CG_CITY = foreach COGROUPED generate FLATTEN($1),FLATTEN ((IsEmpty($2.$2) ? {('0')} : $2.$2)),FLATTEN ((IsEmpty($3.$2) ? {('0')} : $3.$2)),FLATTEN ((IsEmpty($4.$2) ? {('0')} : $4.$2)),FLATTEN ((IsEmpty($5.$2) ? {('0')} : $5.$2));
STORE CG_CITY INTO '/tmp/id_city_matrix' USING PigStorage(',');

Pig Programming Logic

10278929012|HDFC1001|SBI|2014-08-03|8000|S
10278929012|HDFC1001|HDFC|2014-08-17|500|S
I need to find out if the atm_id belongs to the same bank then I need a indicator to be produced
I need output like this
10278929012|HDFC1001|SBI|diff_bank
10278929012|HDFC1001|HDFC|same_bank
atm_trans = LOAD '/user/cloudera/inputfiles/atm_trans.txt' USING PigStorage('|') as(accnt_no:long,atm_id:chararray,bank_name :chararray,date:chararray,amt:chararray,status:chararray);
atm_trans_each = foreach atm_trans generate accnt_no,atm_id,bank_name,(bank_name matches atm_id ?'same_bank' : 'diff_bank') as ind;
dump atm_trans_each;
but I am getting syntax error. Can somebody correct it give me the correct statement to get the ouput;
Can you try this?
input.txt
10278929012|HDFC1001|SBI|2014-08-03|8000|S
10278929012|HDFC1001|HDFC|2014-08-17|500|S
PigScript:
atm_trans = LOAD 'input.txt' USING PigStorage('|') as(accnt_no:long,atm_id:chararray,bank_name:chararray,date:chararray,amt:chararray,status:chararray);
atm_trans_each = foreach atm_trans generate accnt_no,atm_id,bank_name,((STARTSWITH(atm_id,bank_name)== true)?'same_bank':'diff_bank') as ind;
STORE atm_trans_each INTO 'output' USING PigStorage('|');
Update: In 0.8 version
atm_trans_each = foreach atm_trans generate accnt_no,atm_id,bank_name,((REGEX_EXTRACT(atm_id,'([A-Za-z]+)',1) == bank_name)?'same_bank':'diff_bank');
output:
10278929012|HDFC1001|SBI|diff_bank
10278929012|HDFC1001|HDFC|same_bank