How to get the max value from a column in pig? - apache-pig

I have a table and I want to query a sum value of a column. Below is the table detailed information:
grunt>teams_raw = load '/usr/input/Teams.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
grunt>teams = foreach teams_raw generate $0 as year:int, $1 as lgID, $2 as tmID, $8 as g:float, $9 as w:float, $11 as t:float, $18 as name;
grunt> describe teams
teams: {year: bytearray,lgID: bytearray,tmID: bytearray,g: bytearray,w: bytearray,t: bytearray,name: bytearray};
grunt> gry_by_team = group teams by tmID;
I got below error when trying to get the sum value of g from teams table:
grunt> win = foreach grp_by_team generate group, SUM(teams.g) as win;
grunt>DUMP win
17/05/06 15:32:14 ERROR mapreduce.MRPigStatsUtil: 1 map reduce job(s) failed!
17/05/06 15:32:14 ERROR grunt.Grunt: ERROR 1066: Unable to open iterator for alias win
Details at logfile: /Users/joey/dev/bigdata/pig_1494048371690.log
in the log file, I see below exception.
================================================================================
Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias win
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias win
at org.apache.pig.PigServer.openIterator(PigServer.java:1019)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:747)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:231)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:206)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:564)
at org.apache.pig.Main.main(Main.java:176)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: java.io.IOException: Job terminated with anomalous status FAILED
at org.apache.pig.PigServer.openIterator(PigServer.java:1011)
... 13 more
================================================================================
below is the dump data of teams and gry_by_team:
grunt>dump teams
...
(1994,NHL,TBL,48,17,3,Tampa Bay Lightning)
(1994,NHL,TOR,48,21,8,Toronto Maple Leafs)
(1994,NHL,VAN,48,18,12,Vancouver Canucks)
(1994,NHL,WAS,48,22,8,Washington Capitals)
(1994,NHL,WIN,48,16,7,Winnipeg Jets)
(1995,NHL,ANA,82,35,8,Mighty Ducks of Anaheim)
(1995,NHL,BOS,82,40,11,Boston Bruins)
(1995,NHL,BUF,82,33,7,Buffalo Sabres)
(1995,NHL,CAL,82,34,11,Calgary Flames)
...
grunt>dump gry_by_team
...
(1912,NHA,TBS,20,9,0,Toronto Blueshirts),(1916,NHA,TBS,14,7,0,Toronto Blueshirts),(1914,NHA,TBS,20,8,0,Toronto Blueshirts)})
(TO1,{(1912,NHA,TO1,20,7,0,Toronto Tecumsehs)})
(TOA,{(1917,NHL,TOA,22,13,0,Toronto Arenas),(1918,NHL,TOA,18,5,0,Toronto Arenas)})
(TOB,{(1916,NHA,TOB,14,7,0,228th Battalion)})
(TOO,{(1913,NHA,TOO,20,4,0,Toronto Ontarios),(1914,NHA,TOO,20,7,0,Toronto Ontarios/Shamrocks)})
...
I don't know what wrong with my code.
Below is the hadoop and pig version I am using:
$ pig --version
Apache Pig version 0.16.0 (r1746530)
compiled Jun 01 2016, 23:10:49
$ hadoop version
Hadoop 2.8.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 91f2b7a13d1e97be65db92ddabc627cc29ac0009
Compiled by jdu on 2017-03-17T04:12Z
Compiled with protoc 2.5.0
From source with checksum 60125541c2b3e266cbf3becc5bda666
This command was run using /usr/local/Cellar/hadoop/2.8.0/libexec/share/hadoop/common/hadoop-common-2.8.0.jar

win = foreach grp_by_team generate group, SUM(teams.g) as win;
In your code column g data type is bytearray.
SUM work with the following data types: int, long, float, double, bigdecimal, biginteger or bytearray cast as double.. Here you need to cast bytearray as double. Please refer pig documentation for more information about it.
The schema you have defined in code grunt>teams = foreach teams_raw generate $0 as year:int, $1 as lgID, $2 as tmID, $8 as g:float, $9 as w:float, $11 as t:float, $18 as name; is not picked up. So its better you specify the schema along with the load statement.
For example: A = LOAD 'data' AS (a:chararray, b:int, c:int);

Related

ERROR 1066: Unable to open iterator for alias ~ ERROR

The following code works fine:
S4 = FOREACH S3 GENERATE group AS page_i,
COUNT(S2) AS outlinks,
FLATTEN(S2.rank);
DUMP S4
The result are like below:
(Computer engineering,1,0.1111111111111111)
(Outline of computer science,1,0.1111111111111111)
However, when I try to create one more table using divide:
S44 = FOREACH S4 GENERATE group as page_i, outlinks/2, ...
It goes gown like:
Failed Jobs:
JobId Alias Feature Message Outputs
job_local125575051_0033 S6,S66 DISTINCT Message: Job failed! Error - NA
file:/tmp/temp-847036156/tmp-1908150009,
Input(s):
Successfully read records from: "/home/song/workspace/FinalProject/output/part-m-00000"
Successfully read records from: "/home/song/workspace/FinalProject/output/part-m-00000"
Output(s):
Failed to produce result in "file:/tmp/temp-847036156/tmp-1908150009"
Job DAG:
job_local777342816_0028 -> job_local39708124_0029,
job_local39708124_0029 -> job_local495952123_0030,job_local268178801_0032,
job_local495952123_0030 -> job_local1880869927_0031,
job_local1880869927_0031 -> job_local268178801_0032,
job_local268178801_0032 -> job_local125575051_0033,
job_local125575051_0033
2015-04-29 07:45:49,133 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs
2015-04-29 07:45:49,136 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias S66
Details at logfile: /home/song/workspace/FinalProject/pig_1430315000370.log
It seems that you are trying to project a field called group for each tuple in relation S4, while the schema contains only those fields:
page_i
outlinks
unnamed field as a result of FLATTEN(S2.rank).
So I guess that all you need to do is to replace the invalid line with:
S44 = FOREACH S4 GENERATE page_i as page_i, outlinks/2, ...
Or maybe just:
S44 = FOREACH S4 GENERATE page_i, outlinks/2, ...
Hope that this was the only issue here...

Stitch function in PIG

Following PIG code is not working:
grunt> Register /usr/lib/pig/lib/piggybank.jar ;
grunt> define Stitch org.apache.pig.piggybank.evaluation.Stitch();
grunt> data = load 'a' using PigStorage('|') ;
grunt> B = Stitch(data,data);
Error:-
2015-01-06 12:03:57,730 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: <line 12> Cannot expand macro 'Stitch'.
Reason: Macro must be defined before expansion.
Details at logfile: /home/hduser/nikhil/pig_1420524859398.log
Can someone explain whats going wrong here.
There are two issues in your code
1. You can't directly assign the output of stitch command to any relation. It should be projected as part of FOREACH stmt.
2. Stitch command will take only bags as an input parameter, but you are passing the entire relation.
Can you fix the above two issue and retry your script.
Sample example:
input:
{(a,b),(e,f)} {(c,d),(g,h)}
PigScript:
grunt> REGISTER /tmp/piggybank.jar;
grunt> DEFINE MyStitch org.apache.pig.piggybank.evaluation.Stitch;
grunt> A = LOAD 'input' USING PigStorage() AS (B1:{T:(t1:chararray,t2:chararray)},B2:{T1:(t3:chararray,t4:chararray)});
grunt> B = FOREACH A GENERATE MyStitch(B1,B2);
grunt> DUMP B;
Output:
({(a,b,c,d),(e,f,g,h)})
Reference:
http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/piggybank/evaluation/Stitch.html

Pig script fails with java.io.EOFException: Unexpected end of input stream

I am having a Pig script to pick up a set of fields using regular expression and store the data to a Hive table.
--Load data
cisoFortiGateDataAll = LOAD '/user/root/cisodata/Logs/Fortigate/ec-pix-log.20140625.gz' USING TextLoader AS (line:chararray);
--There are two types of data, filter type1 - The field dst_country seems unique there
cisoFortiGateDataType1 = FILTER cisoFortiGateDataAll BY (line matches '.*dst_country.*');
--Parse each line and pick up the required fields
cisoFortiGateDataType1Required = FOREACH cisoFortiGateDataType1 GENERATE
FLATTEN(
REGEX_EXTRACT_ALL(line, '(.*?)\\s(.*?)\\s(.*?)\\s(.*?)\\sdate=(.*?)\\s+time=(.*?)\\sdevname=(.*?)\\sdevice_id=(.*?)\\slog_id=(.*?)\\stype=(.*?)\\ssubtype=(.*?)\\spri=(.*?)\\svd=(.*?)\\ssrc=(.*?)\\ssrc_port=(.*?)\\ssrc_int=(.*?)\\sdst=(.*?)\\sdst_port=(.*?)\\sdst_int=(.*?)\\sSN=(.*?)\\sstatus=(.*?)\\spolicyid=(.*?)\\sdst_country=(.*?)\\ssrc_country=(.*?)\\s(.*?\\s.*)+')
) AS (
rmonth:charArray, rdate:charArray, rtime:charArray, ip:charArray, date:charArray, time:charArray,
devname:charArray, deviceid:charArray, logid:charArray, type:charArray, subtype:charArray,
pri:charArray, vd:charArray, src:charArray, srcport:charArray, srcint:charArray, dst:charArray,
dstport:charArray, dstint:charArray, sn:charArray, status:charArray, policyid:charArray,
dstcountry:charArray, srccountry:charArray, rest:charArray );
--Store to hive table
STORE cisoFortiGateDataType1Required INTO 'ciso_db.fortigate_type1_1_table' USING org.apache.hcatalog.pig.HCatStorer();
The script works fine on a small file but breaks with the following exception on a bigger file (750 MB). Any idea how can I debug and find the root cause?
2014-09-03 15:31:33,562 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - java.io.EOFException: Unexpected end of input stream
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)
at org.apache.pig.builtin.TextLoader.getNext(TextLoader.java:58)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
Check the size of the text you are loading into line:chararray.If the size is greater than hdfs block size (64 MB) then you will get an error.

Pig process multiple file error: ERROR 0: Error while executing ForEach at []

I have 4 files A, B, C, D under the directory /user/bizlog/cpc on HDFS, and the record looks like this:
87465422^C376832^C27786^C21161214^Ckey
Here is my pig script:
cpc_all = load '/user/bizlog/cpc' using PigStorage('\u0003') as (cpcid, accountid, cpcplanid, cpcgrpid, key);
cpc = foreach cpc_all generate accountid, key;
account_group = group cpc by accountid;
account_sort = order account_group by group;
account_key = foreach account_sort generate group, BagToTuple(cpc.key);
store account_key into 'last' using PigStorage('\u0003');
It will get results such as:
376832^Ckey1^Ckey2
Above script suppose to process all the 4 files, but I get this error:
Backend error message
---------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: account_key: New For Each(false,false)[bag] - scope-18 Operator Key: scope-18): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at []
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.
Pig Stack Trace
---------------
ERROR 0: Exception while executing (Name: account_key: New For Each(false,false)[bag] - scope-18 Operator Key: scope-18): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at []
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: account_key: New For Each(false,false)[bag] - scope-18 Operator Key: scope-18): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at []
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
================================================================================
Oddly if I load one single file such as load '/user/bizlog/cpc/A' then the script will succeed.
If I load each file first and then union them, it will work fine too.
If I put the sort step at the last and the error goes away
The version of hadoop is 0.20.2 and the pig version is 0.12.1, any help will be appreciated
As mentioned in the comments:
I put the sort step at the last and the error goes away
Though I did not find much on the topic, it appears that pig does not like to rearrange the group itself.
As such the 'solution' is to rearrane the output of what is generated for the group, instead of ordering the group itself.

Unable to open iterator for alias while using CONCAT

Trying to split a column from originaldata and need to join back.
For that I created a rowid along with originaldata and seperated a col from originaldata along with concatenating the rowid
originaldata = load '$input' using PigStorage('$delimiter');
rankedoriginaldata = rank originaldata;
numericdata = foreach rankedoriginaldata generate CONCAT($0,$split);
But I am not able to do this statement
numericdata = foreach rankedoriginaldata generate CONCAT($0,$split);
Command
pig -x local -f seperator.pig -param input=data/StringNum.csv -param output=OUT/Numericfile -param delimiter="," -param split='$3'
It shows the following stack tree
Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias numericdata
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias numericdata
at org.apache.pig.PigServer.openIterator(PigServer.java:838)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:475)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.io.IOException: Job terminated with anomalous status FAILED
at org.apache.pig.PigServer.openIterator(PigServer.java:830)
... 12 more
================================================================================
But when I did
numericdata = foreach originaldata generate CONCAT($0,$split);
I am getting the expected output.
Doubt: While loading a data do the order of tuple change?
If we are loading a data say
1,4,6
3,8,9
2,4,5
How will be the ordering
whether it shuffles like
1,6,4
8,9,3...
Try casting your arguments for CONCAT to chararray first:
numericdata = foreach originaldata generate CONCAT((chararray)$0,(chararray)$split);
I think the cast is necessary because CONCAT expects two chararrays. RANK however produces a Long (which you pass as $0 to CONCAT).
Concerning your doubt: order of fields in your tuples is not going to change. The order of tuples in the relation may change however.