finding the sum of columns for each row in pig - apache-pig

I need to find the sum of columns in a every row.
Consider the data set
A,1,5,45,25,20
B,5,50,5,23,12
C,1,25,4,15,23
I am trying to get the output as below
(A,96)
(B,95)
(C,68)
I cannot use built in SUM function for this. Should I write custom UDF or is there any other way to do this

You can define the schema and try the below approach.
input:
A,1,5,45,25,20
B,5,50,5,23,12
C,1,25,4,15,23
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(f1:chararray,f2:int,f3:int,f4:int,f5:int,f6:int);
B = FOREACH A GENERATE f1,SUM(TOBAG(f2..));
DUMP B;
Output:
(A,96)
(B,95)
(C,68)

Related

Pig calculating avg of delay fails

I have a file for airplanes data, having airplane dest and delay(delay can be negative or positve number)
A = load ‘flightdelays’ using Pigstorage(‘,’);
B = foreach a generate $14 as delay:int, $17 as dest:chararray;
C = group b all; -- this is failing for cast error, also get an error failed to read data from input file..
D =foreach c generate b.dest, AVG(b.delay);
When i execute this , i get 0 records read from source file and mapreduce job failed..
Why is it not able to calculate AVG?
Check the extension/path of the file.Is your file comma separated? Also,there are plenty of case issues with your script.
PigStorage - s is small in your load statement.
A = load ‘flightdelays’ using PigStorage(‘,’);
B = foreach a generate $14 as delay:int, $17 as dest:chararray;
There is no relation called a,b,c.You are loading data to relation A and so on.
1st thing A,a treated differently(in pig relation names are case sensitive) and 2nd thing while calculating Aggregate function on relation and group by on any attribute..
In FOREACH you should specify grouping attribute and aggregate function..
In this scenario you used group by all so you can't use b.dest along with aggregate function..
If you want destination wise AVG() delay then you should group by dest..

Filter by length of array in Pig

I have data stored in avro format. One of the fields of each record (array_field, say) is an array. Using Pig how do I obtain only the records that have arrays with, for example, length(array_field) >= 2 and then store the results in avro files using the same schema as the original input?
This should be doable with something like code below:
A = LOAD '$INPUT' USING AvroStorage();
B = FILTER A BY SIZE(array_field) >= 2;
STORE B INTO '$OUTPUT' USING AvroStorage('schema', '<schema_here>');

Pig Optimization on Group by

Lets assume that i have a large data set as per below schema layout
id,name,city
---------------
100,Ajay,Chennai
101,John,Bangalore
102,Zach,Chennai
103,Deep,Bangalore
....
...
I have two style of pig code giving me the same output.
Style 1 :
records = load 'user/inputfiles/records.txt' Using PigStorage(',') as (id:int,name:chararray,city:chararray);
records_grp = group records by city;
records_each = foreach records_grp generate group as city,COUNT(records.id) as emp_cnt;
dump records_each;
Style 2 :
records = load 'user/inputfiles/records.txt' Using PigStorage(',') as (id:int,name:chararray,city:chararray);
records_each = foreach (group records by city) generate group as city,COUNT(records.id) as emp_cnt;
dump records_each ;
In second style i used a nested Foreach. Does it style 2 code run faster than style 1 code or not.
I Would like to reduce the total time taken to complete that pig job..
So the Style 2 code achieve that ? Or there is no impact in total time taken?
If somebody confirms me then i can run similar code in my cluster with very large dataset
The solutions will have same costs.
However if records_grp is not used elsewhere, the version 2 allows you to not declare a variable and your script is shorter.

How to split a data in particular column into two other columns using pig scripts?

Hi i am working in big data ,since i am a new bee to pig programming help me to get the required output.I have a csv file which have many columns,one of the column is price,which has data like the following:
(10 Lacs)
(20 to 30 Lacs)
And i need this to be splitted as
price min max
10 null null
null 20 30
I have tried the following code
a = LOAD '/user/folder1/filename.csv' using PigStorage(',')as(SourceWebsite:chararray,PropertyType:chararray,PropertyId:chararray,title:chararray,bedroom:int,bathroom:int,Balconies:chararray,price:chararray,pricepersqft:chararray,builtuparea:chararray,address:chararray,otherdetails:chararray,description:chararray,posted:chararray,Features:chararray,ContactDetails:chararray);
b = FOREACH a GENERATE STRSPLIT(price, 'to');
c = FOREACH b GENERATE FLATTEN(STRSPLIT(Price,',')) AS (MAX:int,MIN:int);
dump c;
Any help will be appreciated.
I just ran into the same issue, and here is how I managed to solve it.
Suppose the column called outputraw.outputlineraw looks like this:
abc|def
gh|j
Then I split it into multiple columns like so:
output_in_columns = FOREACH output_raw GENERATE
FLATTEN(STRSPLIT(output_line_raw,'\\|'));
To test whether it succeeded, I dumped the result after referring to the columns:
output_selection = FOREACH output_in_columns GENERATE
$0,
$1;
DUMP output_selection;

Pig: summing column b of rows with the same column a

I'm trying to count the number of tweets with a certain hashtag over a period of time but I'm getting an error when trying to use the built-in SUM function.
Example:
data = LOAD 'tweets_2.csv' USING PigStorage('\t') AS (date:float,hashtag:chararray,count:int, year:int, month:int, day:int, hour:int, minute:int, second:int);
NBLNabilVoto_count = FILTER data BY hashtag == 'NBLNabilaVoto';
NBLNabilVoto_group = GROUP NBLNabilVoto by count;
X = FOREACH NBLNabilVoto GENERATE group, SUM(data.count);
Error:
<line 22, column 47> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
First Load the data then filter for the time interval you want to process. Group the record based on the hashtag. Use count() function to count the number of twitter for the corresponding hashtag.
I am not sure that the code is doing what you think or want it to do but the error you are getting is because you are doing a SUM on the wrong thing. You need to do this
X = FOREACH NBLNabilVoto GENERATE group, SUM(NBLNabilVoto_count.count);
NBLNabilVoto_count is the name of the tuples in the databag
i think you are using the wrong realtion in your SUM, you could SUM NBLNabilVoto_count not data realtion. i have question why you are groupping by COUNT ?
if you want count all your tweet with hashtag NBLNabilVoto.
i think the code must be like :
data = LOAD 'tweets_2.csv' USING PigStorage('\t') AS (date:float,hashtag:chararray,count:int, year:int, month:int, day:int, hour:int, minute:int, second:int);
NBLNabilVoto_count = FILTER data BY hashtag == 'NBLNabilaVoto';
NBLNabilVoto_group = GROUP NBLNabilVoto by all;
X = FOREACH NBLNabilVoto GENERATE group, SUM(NBLNabilVoto_count.count.count);