Store only non null values result in the file - apache-pig
I have also provided the output for every statement. I just wanted the end result to write into a file only when there is a value.
NEW_roles = LOAD '/folder/newata/roles.gz' USING PigStorage('\t') AS (id,name,is_active,created_at,updated_at); NEW_roles_t = FOREACH NEW_roles generate id,name,is_active,ToString(created_at ,'yyyy-MM-dd hh:mm:ss') as created_at:chararray,ToString(updated_at ,'yyyy-MM-dd hh:mm:ss') as updated_at:chararray;
OLD_roles = LOAD '/folder//old-data/roles.gz' USING PigStorage('\t') AS (id,name,is_active,created_at,updated_at); OLD_roles_t = FOREACH OLD_roles generate id,name,is_active,ToString(created_at ,'yyyy-MM-dd hh:mm:ss') as created_at:chararray,ToString(updated_at ,'yyyy-MM-dd hh:mm:ss') as updated_at:chararray;
co_group = COGROUP NEW_roles_t by id , OLD_roles_t by id; dump co_group;
(1,{(1,CCCCO Admin,t,,)},{(1,CCCCO Admin,t,,)})
(2,{(2,COCI Read Only,t,,)},{(2,COCI Read Only,t,,)})
(3,{(3,College Admin,t,,)},{(3,College Admin,t,,)})
(4,{(4,College Submitter,t,,)},{(4,College Submitter,t,,)})
(5,{(5,CCCCO Reviewer,t,,)},{(5,CCCCO Reviewer,t,,)})
(6,{},{(6,Test,t,,)}) (7,{},{(7,Test1,t,,)})
(8,{},{(8,Test2,t,,)})
filtered_delete = FILTER co_group BY IsEmpty($1);
DUMP filtered_delete;
(6,{},{(6,Test,t,,)})
(7,{},{(7,Test1,t,,)})
(8,{},{(8,Test2,t,,)})
flat_delete = FOREACH filtered_delete GENERATE FLATTEN((IsEmpty(NEW_roles_t)? OLD_roles_t: null));
DUMP flat_delete;
(6,Test,t,,)
(7,Test1,t,,)
(8,Test2,t,,)
intermediate_delete = FOREACH flat_delete GENERATE $0 as id,$1 as name,$2 as is_active,$3 as created_at,$4 as updated_at;
dump intermediate_delete;
(6,Test,t,,) (7,Test1,t,,) (8,Test2,t,,)
/*WRITE IT IN FILE*/
/*If the intermediate_delete has no records, we are not supposed to create a file.
Based on the count of records in dump intermediate_delete we can write a condition to acheive this.
But don't know how I used count function to get the no.of records
but after that I was not able to write any condition based logic to write into a file only when record count is greater than 0
*/
STORE intermediate_insert INTO '/folder/refreshed-data/roles_diff.txt' USING PigStorage('\t');
Related
In pig how do i write a date, like in sql we write where date =''
I am new to Pig scripting but good with SQL. I wanted the pig equivalent for this SQL line : SELECT * FROM Orders WHERE Date='2008-11-11'. Basically I want to load data for one id or date how do I do that?
I did this and it worked, used FILTER in pig, and got the desired results. `ivr_src = LOAD '/raw/prod/...; info = foreach ivr_src generate timeEpochMillisUTC as time, cSId as id; Filter_table= FILTER info BY id == '700000'; sorted_filter_table = Order Filter_table BY $1; store sorted_filter_table into 'sorted_filter_table1' USING PigStorage('\t', '- schema');`
data processing in pig , with tab separate
I am very new to Pig , so facing some issues while trying to perform very basic processing in Pig. 1- Load that file using Pig 2- Write a processing logic to filter records based on Date , for example the lines have 2 columns col_1 and col_2 ( assume the columns are chararray ) and I need to get only the records which are having 1 day difference between col_1 and col_2. 3- Finally store that filtered record in Hive table . Input file ( tab separated ) :- 2016-01-01T16:31:40.000+01:00 2016-01-02T16:31:40.000+01:00 2017-01-01T16:31:40.000+01:00 2017-01-02T16:31:40.000+01:00 When I try A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray); The result I am getting like below :- DUMP A; (,2016-01-03T19:28:58.000+01:00,2016-01-02T16:31:40.000+01:00) (,2017-01-03T19:28:58.000+01:00,2017-01-02T16:31:40.000+01:00) Not sure Why ? Please can some one help me in this how to parse tab separated file and how to covert that chararray to Date and filter based on Day difference ? Thanks
Convert the columns to datetime object using ToDate and use DaysBetween.This should give the difference and if the difference == 1 then filter.Finally load it hive. A = LOAD '/user/inp.txt' USING PigStorage('\t') as (col_1:chararray,col_2:chararray); B = FOREACH A GENERATE DaysBetween(ToDate(col_1,'yyyy-MM-dd HH:mm:ss'),ToDate(col_2,'yyyy-MM-dd HH:mm:ss')) as day_diff; C = FILTER B BY (day_diff == 1); STORE C INTO 'your_hive_partition' USING org.apache.hive.hcatalog.pig.HCatStorer();
How to filter empty output file?
This is my pig script. data = load 's3a://sessionlog/2016-05-28/' using SegmentationDataLoader() as (cookie:chararray,tags_and_pageref:map[]); tags_data = foreach data generate cookie, tags_and_pageref#'tags' as score_tag_bag; flattened_data = FOREACH tags_data GENERATE cookie, FLATTEN(score_tag_bag) as score_tag; converted_flattened_data = FOREACH flattened_data GENERATE cookie, (long)score_tag#'score' as score, score_tag#'tag' as tag; -- dump converted_flattened_data; -- tuple_data = FOREACH flattened_data GENERATE cookie, TOTUPLE(tags) as tag_tuple; -- splited_data = FOREACH flattened_data GENERATE cookie, TOKENIZE(tags) as score_tags:bag{t1:(),t2:()}; grouped_data = group converted_flattened_data by (cookie,tag); acc_data = foreach grouped_data generate group.cookie as cookie, group.tag as tag,SUM(converted_flattened_data.score) as score; pageref_data = foreach data generate cookie, tags_and_pageref#'pageref' as pageref_bag; flattened_pageref_data = FOREACH pageref_data GENERATE cookie, FLATTEN(pageref_bag) as score_tag; filtered_data = FILTER flattened_pageref_data BY score_tag is not null and not IsEmpty(score_tag); store acc_data into 'segmentation/2016-05-28/4' using PigStorage(','); store filtered_data into 'pagerefdata/2016-05-28/4' using PigStorage(','); However the output of pagerefdata is all empty file. How can I filter it, as they are all empty, I do not want any output. Thanks in advance.
Not sure I fully understand your question here, but 3 lines from the bottom you seem to be trying to filter out the empty ones by testing for null. Have you tried the following?: FILTER flattened_pageref_data BY SIZE(TRIM(score_tag)) == 0;
Apache Pig: Add one field dataset to another one as a new column
Let's say we have this scenario: dataset1.csv : datefield field11, field12, field13 field21, field22, field23 field31, field32, field33 What is the best way to get this?: field11, field12, field13, datefield field21, field22, field23, datefield field31, field32, field33, datefield I tried to generate one dataset (relation1) with just this columns (after loading and generating): field11, field12, field13 field21, field22, field23 field31, field32, field33 another one (relation2) with just this column (after loading and generating): datefield and then doing this: finalResult = FOREACH dataset1 GENERATE UDFFunction1(relation1::f1) as firstFields, UDFFunction2(relation2::f2) as lastField But I'm getting 'A column needs to be projected from a relation for it to be used as a scalar' The problem is with the second field (the one with the datefield). I would like to avoid joins, since it would be a little messy workaround. Any suggestions? Please, forget my UDF functions. They just format the input Tuples accordingly. Adding the pig script: register 's3://bucketName/lib/MyJar.jar'; define ParseOutFilesUDF packageName.ParseOutFiles; define FormatTimestartedUDF packageName.FormatTimestarted; outFile = LOAD 's3://bucketName/input/' USING PigStorage ('|'); --This UDF just reformat each tuple, adding a String to each Tuple and returning a new one. resultAll = FOREACH outFile GENERATE ParseOutFilesUDF(*) as initial; --load the same csv again to get the TIMESTARTED field timestarted = LOAD 's3://bucketName/input/' USING PigStorage ('|') as f1; --filter to get only one record, which is somth like TIMESTARTED=20160101 filetered = FILTER timestarted BY (f1 matches '.*TIMESTARTED.*'); timestarted = foreach filetered GENERATE $0 as fechaStarted; -- the FormatTimestartedUDF just gets ride of 'TIMESTARTED=' in order to get the date '20160101' in this FOREACH sentence is where it fails with the 'A column needs to be projected...' finalResult = FOREACH outFile GENERATE f1, FormatTimestartedUDF(timestarted) as f2; STORE finalResult INTO 's3://bucketName/output/';
You are getting the error because you are referencing f1 which doesn't exist in outfile and timestarted is a relation and not a field.Also you should be using the field in resultALL and filtered. outFile = LOAD 's3://bucketName/input/' USING PigStorage ('|'); resultAll = FOREACH outFile GENERATE ParseOutFilesUDF(*) as initial; timestarted = LOAD 's3://bucketName/input/' USING PigStorage ('|') as f1; filtered = FILTER timestarted BY (f1 matches '.*TIMESTARTED.*'); finalResult = FOREACH resultALL GENERATE resultAll.initial, FormatTimestartedUDF(filtered.$0); STORE finalResult INTO 's3://bucketName/output/';
PIG script for creating IDxCITY matrix from given csv file
I have an input file includes ID,CITY and Count information as below and I want to create a csv file which includes ID and count numbers for each CITY. Count will be written as '0' if ID doesnt watched with the CITY. I tried to generate a pig script using group by, cogroup and flatten but couldnt make it to give this sample output. How can I write a pig script for this? INPUT(ID,CITY,COUNT): 00004589,IZMIR,2 00005275,KOCAELI,1 00005275,ISTANBUL,1 00008523,ESKISEHIR,2 OUTPUT: ID,IZMIR,ISTANBUL,ESKISEHIR,KOCAELI 00004589,2,0,0,0 00005275,0,1,0,1 00008523,0,0,2,0
You can use below script for creating matrix: DATA = load '/tmp/data.csv' Using PigStorage(',') as (ID:chararray,CITY:chararray,COUNT:chararray); ESKISEHIR = filter DATA by (CITY matches 'ESKISEHIR'); ISTANBUL = filter DATA by (CITY matches 'ISTANBUL'); IZMIR = filter DATA by (CITY matches 'IZMIR'); KOCAELI = filter DATA by (CITY matches 'KOCAELI'); IDLIST = foreach DATA generate $0 as ID; IDLIST = distinct IDLIST; COGROUPED = cogroup IDLIST by $0,IZMIR by $0,ISTANBUL by $0,ESKISEHIR by $0,KOCAELI by $0; CG_CITY = foreach COGROUPED generate FLATTEN($1),FLATTEN ((IsEmpty($2.$2) ? {('0')} : $2.$2)),FLATTEN ((IsEmpty($3.$2) ? {('0')} : $3.$2)),FLATTEN ((IsEmpty($4.$2) ? {('0')} : $4.$2)),FLATTEN ((IsEmpty($5.$2) ? {('0')} : $5.$2)); STORE CG_CITY INTO '/tmp/id_city_matrix' USING PigStorage(',');