How do I call log? - apache-pig

I got this error:
Could not infer the matching function for org.apache.pig.builtin.LOG
as multiple or none of them fit. Please use an explicit cast
From this code:
> describe A;
A: {p: long,k: chararray,count: double}
> foreach (group A by p) generate SUM(A.count * LOG(A.count));
What am I doing wrong?

I suppose LOG works on a double, not on a bag of doubles. In your context you are giving it a bag, just as in SUM(A.count), but SUM is supposed to work with a bag.
Try to prepare you data before bag aggregations, something like:
computed = foreach A generate p, (count * LOG(count)) as multiplied;
summed = foreach (group computed by p) generate SUM(multiplied);

Related

PIG ERROR 1070 when Count grouped data

I just want to count how many player for each team in 2011.
It works good when group it with tmID. However, when I try to count the grouped data, the ERROR 1070 comes out.
load_file = load 'Assignment2/basketball_players.csv' using PigStorage(',');
temp = foreach load_file generate
(chararray)$3 AS tmID,
(int)$1 AS year,
(chararray)$0 AS playerID;
fil_data = filter temp by year == 2011;
group_data = group fil_data by tmID;
count_data = foreach group_data generate group, count($1);
dump count_data;
the error message shows below.
<file script.pig, line 8, column 48> Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve count using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Could anyone could help me for this problem? THX
COUNT function is case sensitive. Ref : http://pig.apache.org/docs/r0.12.0/func.html#count
Try this :
count_data = foreach group_data generate group, COUNT($1);
instead of $1 would suggest to make use of alias fil_data as its more readable.
All functions need to be in UPPER case. Commands like foreach, generate group by etc can be in either case.

How do I combine two pig statements into one?

This two-stage pig processing works:
my_out = foreach (group my_in by id) {
grouped = BagGroup(my_in.(keyword,weight),my_in.keyword);
generate
group as id,
CountEach(my_in.domain) as domains,
grouped as grouped;
};
my_out1 = foreach my_out {
keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
generate id, domains, keywords;
};
however, when I combine them:
my_out = foreach (foreach (group my_in by id) {
grouped = BagGroup(my_in.(keyword,weight),my_in.keyword);
generate
group as id,
CountEach(my_in.domain) as domains,
grouped as grouped;
}) {
keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
generate id, domains, keywords;
};
I get an error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " <IDENTIFIER> "generate "" at line 1, column 5.
My questions are:
How do I avoid this error?
Does it even make sense what I am trying to do?
Even if I manage to accomplish this, will this save me an MR pass?
In general, Pig's ability to parse complicated nested expressions is unreliable. Another common error when the nesting gets to be too much to handle is ERROR 1000: Error during parsing. Lexical error at line XXXX, column 0. Encountered: <EOF> after : ""
I often try to do this to avoid having to come up with a bunch of names for aliases that have no meaning except as intermediate steps in a computation. But sometimes it's not possible, as you have found out. My guess is that nesting a nested foreach is a no-go. But in your case, it looks like the first nested foreach is not necessary. Try this:
my_out = foreach (foreach (group my_in by id)
generate
group as id,
CountEach(my_in.domain) as domains,
BagGroup(my_in.(keyword,weight),my_in.keyword) as grouped
) {
keywords = foreach grouped generate group as keyword, SUM($1.weight) as weight;
generate id, domains, keywords;
};
As for your second question, no, this will make no difference to the eventual MR plan. This is purely a matter of Pig parsing your script; the map-reduce logic is unchanged by grouping the commands in this way.

LINQ to SQL - select query

I need help. I've some types defined:
Class1(){int ID; double Price;}
Class2(){int ID; Class1 myClass1;}
Class3(){int ID; List<Class2> Class2List;}
Now I have a list List<Class3> class3List, from which I need to take only the min double value (the min Price). Is this possible to do with LINQ to SQL, or do I need to use foreach loop?
var min = class3List.SelectMany(x => x.Class2List).Min(x => x.myClass1.Price);
Use SelectMany method to flatten your list of lists List<List<Class2>> into List<Class2>, and then return minimum value in a sequence of prices, fetched by simple selector x => x.myClass1.Price.

Error 1045 on sum function in pig latin with an int

The following pig latin script:
data = load 'access_log_Jul95' using PigStorage(' ') as (ip:chararray, dash1:chararray, dash2:chararray, date:chararray, date1:chararray, getRequset:chararray, location:chararray, http:chararray, code:int, size:int);
splitDate = foreach data generate size as size:int , ip as ip, FLATTEN(STRSPLIT(date, ':')) as h;
groupedIp = group splitDate by h.$1;
a = foreach groupedIp{
added = foreach splitDate generate SUM(size); --
generate added;
};
describe a;
gives me the error:
ERROR 1045:
<file 3.pig, line 10, column 39> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
This error makes me think I need to cast size as an int, but if i describe my groupedIp field, I get the following schema.
groupedIp: {group: bytearray,splitDate: {(size: int,ip: chararray,h: bytearray)}} which indicates that size is an int, and should be able to be used by the sum function.
Am I calling the sum function incorrectly? Let me know if you would like to see any thing else, such as the input file.
SUM operates on a bag as input, but you pass it the field 'size'.
Try to eliminate the nested foreach and use:
a = foreach groupedIp generate SUM(splitDate.size);
Do some dumps of your data. I'll bet some of the stuff in the size column is non-integer, and Pig runs into that and dies. You could also code up your own isInteger udf to check this before the rest of your processing, and throw out any that aren't integers.
SUM, AVG and COUNT are functions that always work on a bag, therefore group the data and then join with the original set like below:
A = load 'nyse_data.txt' as (exchange:chararray, symbol:chararray,date:chararray, pen:float,high:float, low:float, close:float,volume:int, adj_close:float);
G = group A by symbol;
C = foreach G generate group, SUM(A.open);

How do I specify a field name for a wrapped tuple in Pig?

I have a tuple with schema (a:int, b:int, c:int) stored in alias first. I want to convert each tuple to have a new relation second with schema like this:
(d: (a:int, b:int, c:int))
Basically, I've wrapped my initial tuple in another tuple and named the field. This is in preparation for a cross operation where I want to cross two relations but keep each one in a named field.
Here is what I would expect it to look like, except there's an error:
second = FOREACH first GENERATE TOTUPLE(*) AS (d:tuple);
This errors out too:
second = FOREACH first GENERATE TOTUPLE(*) AS (d:tuple (a:int, b:int, c:int));
Thanks!
Uri
What about:
second = FOREACH first GENERATE TOTUPLE(*) AS d;
describe second;
second: {d: (a: int,b: int,c: int)}