Average all columns in Pig [duplicate] - apache-pig

I have to loop over 30 variables in a list
[var1,var2, ... , var30]
and for each variable I use some PIG group by statement such as
grouped = GROUP data by var1;
data_var1 = FOREACH grouped{
GENERATE group as mygroup,
COUNT(data) as count;
};
Is there a way to loop over the list of variables or I am forced to repeat the code above manually 30 times in my code?
Thanks!

I think what you're looking for is the pig macro
Create a relation for your 30 variables, and iterate on them by foreach, and call a macro which get 2 params: your data relation and the var you want to group by.
Just check the example in the link the macro is really similar what you'd like to do.
UPDATE & code
So here's the macro you can use:
DEFINE my_cnt(data, group_field) RETURNS C {
$C = FOREACH (GROUP $data by $group_field) GENERATE
group AS mygroup,
COUNT($data) AS count;
};
Use the macro:
IMPORT 'cnt.macro';
data = LOAD 'data.txt' USING PigStorage(',') AS (field:chararray, value:chararray);
DESCRIBE data;
e = my_cnt(data,'the_field_you_group_by');
DESCRIBE e;
DUMP e;
I'm still thinking on how can you iterate through on your fields you'd like to group by. My original suggestion to foreach through a relation what contains the filed names not correct. (To create a UDF for this always works.) Let me think about it.
But this macro works as is if you call by all the filed name you want to group.

Related

how to assign name to count got from apache pig script?

students = LOAD 'hdfs://localhost:9000/pig_data/students.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray, cgpa:int);
group_all = Group students All;
student_count = foreach group_all Generate COUNT(students.cgpa);
Dump student_count;
This is the simple program to get count of the students. How can i get variable name beside count like anyvariablenamestudentcountvalue
DUMP 'value',student_count.$0;
When you dump you should see the name of the column, it will be something like _c0
If you want to rename, you could define a new variable, something like this:
student_count_named= foreach student_count generate $0 as a, $1 as b
You could also try to put an as directly after the count, but have not tried that.

How to check COUNT of filtered elements in PIG

I have the following data set in which I need to perform some steps based on the Car's company name.
(23,Nissan,12.43)
(23,Nissan Car,16.43)
(23,Honda Car,13.23)
(23,Toyota Car,17.0)
(24,Honda,45.0)
(24,Toyota,12.43)
(24,Nissan Car,12.43)
A = LOAD 'data.txt' AS (code:int, name:chararray, rating:double);
G = GROUP A by (code, REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1));
DUMP G;
I am grouping cars based on code and their base company name like All the 'Nissan' and 'Nissan Car' records should come in 1 group and similar for others.
/* Grouped data based on code and company's first name*/
((23,Nissan),{(23,Nissan,12.43),(23,Nissan Car,16.43)})
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
Now, I want to filter out the groups based on whether they contain a tuple corresponding to group's name. If yes, take that tuple from that group and ignore others and if no such tuple exists then take all the tuples for that group.
The Output should be:
((23,Nissan),{(23,Nissan,12.43)}) // Since this group contains a row with group's name i.e. Nissan
((23,Honda),{(23,Honda Car,13.23)})
((23,Toyota),{(23,Toyota Car,17.0)})
((24,Nissan),{(24,Nissan Car,12.43)})
((24,Honda),{(24,Honda,45.0)})
((24,Toyota),{(24,Toyota,12.43)})
R = FOREACH G { OW = FILTER A BY name==group.$1; IF COUNT(OW) > 0}
Could anybody please help how can I do this? After filtering by group's name? How can I find the count of the filtered tuples and get the required data.
Ok. Lets Consider the below records are your input.
23,Nissan,12.43
23,Nissan Car,16.43
23,Honda Car,13.23
23,Toyota Car,17.0
24,Honda,45.0
24,Toyota,12.43
25,Toyato Car,23.8
25,Toyato Car,17.2
24,Nissan Car,12.43
For the above Input , let say the below is intermediate output
((23,Honda),{(23,Honda,Honda Car,13.23)})
((23,Nissan),{(23,Nissan,Nissan,12.43),(23,Nissan,Nissan Car,16.43)})
((23,Toyota),{(23,Toyota,Toyota Car,17.0)})
((24,Honda),{(24,Honda,Honda,45.0)})
((24,Nissan),{(24,Nissan,Nissan Car,12.43)})
((24,Toyota),{(24,Toyota,Toyota,12.43)})
((25,Toyato),{(25,Toyato,Toyato Car,23.8),(25,Toyato,Toyato Car,17.2)})
Just Consider, from the above intermediate output, you are looking for below output as per your requirement .
(23,Honda,1)
(23,Nissan,1)
(23,Toyota,1)
(24,Honda,1)
(24,Nissan,1)
(24,Toyota,1)
(25,Toyato,2)
Below is the code..
nissan_load = LOAD '/user/cloudera/inputfiles/nissan.txt' USING PigStorage(',') as(code:int,name:chararray,rating:double);
nissan_each = FOREACH nissan_load GENERATE code,TRIM(REGEX_EXTRACT(name,'(?i)(^.+?\\b)\\s*(Car)*$',1)) as brand_name,name,rating;
nissan_grp = GROUP nissan_each by (code,brand_name);
nissan_final_each =FOREACH nissan_grp {
A = FOREACH nissan_each GENERATE (brand_name == TRIM(name) ? 1 :0) as cnt;
B = (int)SUM(A);
C = FOREACH nissan_each GENERATE (brand_name != TRIM(name) ?1: 0) as extra_cnt;
D = SUM(C);
generate flatten(group) as(code,brand_name), (SUM(A.cnt) != 0 ? B : D) as final_cnt;
};
dump nissan_final_each;
Try this code with different inputs as well..

PIG filter out rows with improper number of columns

I have simple data loaded in a:
dump a
ahoeh,1,e32
hello,2,10
ho,3
I need to filter out all rows with number of columns/fields different than 3. How to do it?
In other words result should be:
dump results
ahoeh,1,e32
hello,2,10
I know there should be a FILTER built-in function. However I cannot figure out what condition (number of columns =3) should be defined.
Thanks!
Can you try this?
input
ahoeh,1,e32
hello,2,10
ho,3
3,te,0
aa,3,b
y,h,3
3,3,3
3,3,3,1,2,3,3,,,,,,4,44,6
PigScript1:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,','));
C = FOREACH B GENERATE COUNT(TOBAG(*)),$0..;
D = FILTER C BY $0==3;
E = FOREACH D GENERATE $1..;
DUMP E;
PigScript2:
A = LOAD 'input' USING PigStorage(',');
B = FOREACH A GENERATE COUNT(TOBAG(*)),$0..;
C = FILTER B BY (int)$0==3;
D = FOREACH C GENERATE $1..;
DUMP D;
Output:
(ahoeh,1,e32)
(hello,2,10)
(3,te,0)
(aa,3,b)
(y,h,3)
(3,3,3)
(It seems that I don't have enough karma to comment; that's why this is posted as a new answer.)
The accepted answer doesn't quite behave as expected if null/empty string is a valid field value; you need to use COUNT_STAR instead of COUNT to count empty/null fields in your schema.
See: https://pig.apache.org/docs/r0.9.1/func.html#count-star
For example, given the following input data:
1,2,3
1,,3
and this Pig script:
a = load 'input' USING PigStorage(',');
counted = foreach a generate COUNT_STAR(TOBAG(*)), $0..;
filtered = filter counted by $0 != 3;
result = foreach filtered generate $1..;
The filtered alias will contain both rows. The difference is that COUNT({(1),(),(3)}) returns 2 while COUNT_STAR({(1),(),(3)}) returns 3.
I see two ways to do this:
First, you can rephrase the filter I think, as it boils down to: Give me all lines that do not contain an NULL value. For lots of columns, writing this filter statement is rather tedious.
Second, you could convert your columns into a bag per line, using TOBAG (http://pig.apache.org/docs/r0.12.1/func.html#tobag) and then write a UDF that processes the input bag to check for null tuples in this bag and return true or false and use this in the filter statement.
Either way, some tediousness is required I think.

Sending relation to UDF functions

Can I Send a relation to Pig UDF function as input? A relation can have multiple tuples in it. How do we read each tuple one by one in Pig UDF function?
Ok.Below is my Sample input file.
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
Amit,SBI,70000,CTS
myinput = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
grouped = GROUP myinput BY company;
All i need is details about highest paid employee in each company. How do i use UDF for that ?
I need something like this
CTS Karthic,HDFC,95000,CTS
TCS Raja,AXIS,80000,TCS
Can SomeOne Help me on this.
This script will give you the results you want :
A = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
B = GROUP A BY (company);
topResults = FOREACH B {result = TOP(1, 2, A); GENERATE FLATTEN(result);}
dump topResults;
Explanation:
First we group A on the basis of company.So A is:
(CTS,{(Surender,HDFC,60000,CTS),(Kumar,AXIS,70000,CTS),(Remya,AXIS,40000,CTS),(Ankur,HDFC,80000,CTS),(Karthic,HDFC,95000,CTS),(Sandhya,AXIS,60000,CTS),(Amit,SBI,70000,CTS)})
(TCS,{(Raja,AXIS,80000,TCS),(Raj,HDFC,70000,TCS),(Arun,SBI,30000,TCS),(Vimal,SBI,10000,TCS)})
Then we say foreach tuple in B , generate another tuple result which is equal to the top 1 record from the relation A found in B on the basis of value of column number 2 i.e. amt. The columns are numbered from 0.
Note
First your data has extra spaces after company name. Please remove the extra spaces or use the following data :
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
mit,SBI,70000,CTS
You don't need to write an UDF to do this, you can simply do it with the top function from pig : http://pig.apache.org/docs/r0.11.0/func.html#topx
Here is an example of code that should work ( not tested) :
grouped = GROUP myinput BY company;
result = FOREACH grouped GENERATE company, FLATTEN(TOP(1,2,grouped));

Using pig, how do I parse a mixed format line into tuples and a bag of tuples?

I'm new to pig, and I'm having an issue parsing my input and getting it into a format that I can use. The input file contains lines that have both fixed fields and KV pairs as follows:
FF1|FF2|FF3|FF4|KVP1|KVP2|...|KVPn
My goal here is to count the number of unique fixed field combinations for each of the KV Pairs. So considering the following input lines:
1|2|3|4|key1=value1|key2=value2
2|3|4|5|key1=value7|key2=value2|key3=value3
When I'm done, I'd like to be able to generate the following results (the output format doesn't really matter at this point, I'm just showing you what I'd like the results to be):
key1=value1 : 1
key1=value7 : 1
key2=value2 : 2
key3=value3 : 1
It seems like I should be able to do this by grouping the fixed fields and flattening a bag of the KV Pairs to generate the cross product
I've tried reading this in with something like:
data = load 'myfile' using PigStorage('|');
A = foreach data generate $0 as ff1:chararray, $1 as ff2:long, $2 as ff3:chararray, $3 as ff4:chararray, TOBAG($4..) as kvpairs:bag{kvpair:tuple()};
B = foreach A { sorted = order A by ff2; lim = limit sorted 1; generate group.ff1, group.ff4, flatten( lim.kvpairs ); };
C = filter B by ff3 matches 'somevalue';
D = foreach C generate ff1, ff4, flatten( kvpairs ) as kvpair;
E = group D by (ff1, ff4, kvpair);
F = foreach E generate group, COUNT(E);
This generates records with a schema as follows:
A: {date: long,hms: long,id: long,ff1: chararray,ff2: long,ff3: chararray,ff4: chararray,kvpairs: {kvpair: (NULL)}}
While this gets me the schema that I want, there are several problems that I can't seem to solve:
By using the TOBAG with .., no schema can be applied to my kvpairs, so I can't ever filter on kvpair, and I don't seem to be able to cast this at any point, so it's an all or nothing query.
The filter in statement 'C' seems to return no data regardless of what value I use, even if I use something like '.*' or '.+'. I don't know if this is because there is no schema, or if this is actually a bug in pig. If I dump some data from statement B, I definitely see data there that would match those expressions.
So I've tried approaching the problem differently, by loading the data using:
data = load 'myfile' using PigStorage('\n') as (line:chararray);
init_parse = foreach data generate FLATTEN( STRSPLIT( line, '\\|', 4 ) ) as (ff1:chararray, ff2:chararray, ff3:chararray, ff4:chararray, kvpairsStr:chararray);
A = foreach mc_bk_data generate ff1, ff2, ff3, ff4, TOBAG( STRSPLIT( kvpairsStr, '\\|', 500 ) ) as kvpairs:bag{t:(kvpair:chararray)};
The issue here is that the TOBAG( STRSPLIT( ... ) ) results in a bag of a single tuple, with each of the kvpairs being a field in that tuple. I really need the bag to contain, each of the individual kvpairs as a tuple of one field so that when I flatten the bag, I get the cross product of the bag and the group that I'm interested in.
I'm open to other ways of attacking this problem as well, but I can seem to find good way to transform my tuple of multiple fields into a bag of tuples, with each tuple having one field each.
I'm using Apache Pig version 0.11.1.1.3.0.0-107
Thanks in advance.
Your second approach is on the right track. Unfortunately, you'll need a UDF to convert a tuple to a bag, and as far as I know there is no builtin to do this. It's a simple matter to write one, however.
You won't want to group on the fixed fields, but rather on the key-value pairs themselves. So you only need to keep the tuple of key-value pairs; you can completely ignore the fixed fields.
The UDF is pretty simple. In Java, you can just do something like this in your exec method:
DataBag b = new DefaultDataBag();
Tuple t = (Tuple) input.get(0);
for (int i = 0; i < t.size(); i++) {
Object o = t.get(i);
Tuple e = TupleFactory.getInstance().createTuple(o);
b.add(e);
}
return b;
Once you have that, turn the tuple from STRSPLIT into a bag, flatten it, and then do the grouping and counting.