Filter by length of array in Pig - apache-pig

I have data stored in avro format. One of the fields of each record (array_field, say) is an array. Using Pig how do I obtain only the records that have arrays with, for example, length(array_field) >= 2 and then store the results in avro files using the same schema as the original input?

This should be doable with something like code below:
A = LOAD '$INPUT' USING AvroStorage();
B = FILTER A BY SIZE(array_field) >= 2;
STORE B INTO '$OUTPUT' USING AvroStorage('schema', '<schema_here>');

Related

Dynamic list of variables in process in Azure Data Factory

I have a lookup config table that stores the 1) source table and 2) list of variables to process, for example:
SQL Lookup Table:
tableA, variableX,variableY,variableZ <-- tableA has more than these 3 variables, i.e it has other variables such as variableV, variable W but they do not need to be processed
tableB, variableA,variableB <-- tableB has more than these 2 variables
Hence, I will need to dynamically connect to each table and process the specific variables in each table. The processing step is to convert the julian date (in integer format) to standard date (date format). Example of SQL query:
select dateadd(dd, (variableX - ((variableX/1000) * 1000)) - 1, dateadd(yy, variableX/1000, 0)) FROM [dbo].[tableA]
The problem is after setting up lookup and forEach in ADF, I am unsure how to loop through the variable array (or string, since SQL DB does not allow me to store array results) and convert all these variables into the standard time format.
The return result should be a processed dataset to be exported to a sink.
Hence would like to check what will be the best way to achieve this in ADF?
Thank you!
I have reproed in my local environment. Please see the below steps.
Using lookup activity, first get all the tables list from control table.
Pass the lookup output to ForEach activity.
Inside ForEach activity, add lookup activity to get the variables list from control table where table name is current item from ForEach activity.
#concat('select table_variables from control_tb where table_name = ''',item().table_name,'''')
Convert lookup2 activity output value to an array using set variable activity.
#split(activity('Lookup2').output.firstRow.table_variables,',')
create another pipeline (pipeline2) with 2 parameters (table name (string) and variables (array)) and add ForEach activity in pipeline2
Pass the array parameter to ForEach activity in pipeline2 and Use the copy activity to copy data from source to sink
Connect Execute pipeline activity to pipeline 1 inside ForEach activity.

Pig ParquetLoader : Column Pruning

I read parquet files which has a schema of 12 columns.
I do a group by and sum aggregation over a single long column.
then join on another dataset. after join I only take a single column (the sum one) from the parquet dataset.
But pig constantly keeps on giving me error=>
"ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2000: Error processing rule ColumnMapKeyPrune. Try -t ColumnMapKeyPrune"
Does the pig parquet loader doesn't support column pruning?
If i tried with column pruning disabled, the job works.
pseudo code for what I am trying to achieve.
REGISTER /<path>/parquet*.jar;
res1 = load '<path>' using parquet.pig.ParquetLoader() as (c1:chararray,c2:chararray,c3:int, c4:int, c5:chararray, c6:chararray, c7:chararray, c8:chararray, c9:chararray, c10:chararray, c11:chararray, c12:long);
res2 = group winrate by (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11);
res3 = foreach res2 generate flatten(group) as (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11),SUM(res1.c12) as counts;

Pig Optimization on Group by

Lets assume that i have a large data set as per below schema layout
id,name,city
---------------
100,Ajay,Chennai
101,John,Bangalore
102,Zach,Chennai
103,Deep,Bangalore
....
...
I have two style of pig code giving me the same output.
Style 1 :
records = load 'user/inputfiles/records.txt' Using PigStorage(',') as (id:int,name:chararray,city:chararray);
records_grp = group records by city;
records_each = foreach records_grp generate group as city,COUNT(records.id) as emp_cnt;
dump records_each;
Style 2 :
records = load 'user/inputfiles/records.txt' Using PigStorage(',') as (id:int,name:chararray,city:chararray);
records_each = foreach (group records by city) generate group as city,COUNT(records.id) as emp_cnt;
dump records_each ;
In second style i used a nested Foreach. Does it style 2 code run faster than style 1 code or not.
I Would like to reduce the total time taken to complete that pig job..
So the Style 2 code achieve that ? Or there is no impact in total time taken?
If somebody confirms me then i can run similar code in my cluster with very large dataset
The solutions will have same costs.
However if records_grp is not used elsewhere, the version 2 allows you to not declare a variable and your script is shorter.

How to split a data in particular column into two other columns using pig scripts?

Hi i am working in big data ,since i am a new bee to pig programming help me to get the required output.I have a csv file which have many columns,one of the column is price,which has data like the following:
(10 Lacs)
(20 to 30 Lacs)
And i need this to be splitted as
price min max
10 null null
null 20 30
I have tried the following code
a = LOAD '/user/folder1/filename.csv' using PigStorage(',')as(SourceWebsite:chararray,PropertyType:chararray,PropertyId:chararray,title:chararray,bedroom:int,bathroom:int,Balconies:chararray,price:chararray,pricepersqft:chararray,builtuparea:chararray,address:chararray,otherdetails:chararray,description:chararray,posted:chararray,Features:chararray,ContactDetails:chararray);
b = FOREACH a GENERATE STRSPLIT(price, 'to');
c = FOREACH b GENERATE FLATTEN(STRSPLIT(Price,',')) AS (MAX:int,MIN:int);
dump c;
Any help will be appreciated.
I just ran into the same issue, and here is how I managed to solve it.
Suppose the column called outputraw.outputlineraw looks like this:
abc|def
gh|j
Then I split it into multiple columns like so:
output_in_columns = FOREACH output_raw GENERATE
FLATTEN(STRSPLIT(output_line_raw,'\\|'));
To test whether it succeeded, I dumped the result after referring to the columns:
output_selection = FOREACH output_in_columns GENERATE
$0,
$1;
DUMP output_selection;

finding the sum of columns for each row in pig

I need to find the sum of columns in a every row.
Consider the data set
A,1,5,45,25,20
B,5,50,5,23,12
C,1,25,4,15,23
I am trying to get the output as below
(A,96)
(B,95)
(C,68)
I cannot use built in SUM function for this. Should I write custom UDF or is there any other way to do this
You can define the schema and try the below approach.
input:
A,1,5,45,25,20
B,5,50,5,23,12
C,1,25,4,15,23
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(f1:chararray,f2:int,f3:int,f4:int,f5:int,f6:int);
B = FOREACH A GENERATE f1,SUM(TOBAG(f2..));
DUMP B;
Output:
(A,96)
(B,95)
(C,68)