I have a table and I want to query a sum value of a column. Below is the table detailed information:
grunt>teams_raw = load '/usr/input/Teams.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
grunt>teams = foreach teams_raw generate $0 as year:int, $1 as lgID, $2 as tmID, $8 as g:float, $9 as w:float, $11 as t:float, $18 as name;
grunt> describe teams
teams: {year: bytearray,lgID: bytearray,tmID: bytearray,g: bytearray,w: bytearray,t: bytearray,name: bytearray};
grunt> gry_by_team = group teams by tmID;
I got below error when trying to get the sum value of g from teams table:
grunt> win = foreach grp_by_team generate group, SUM(teams.g) as win;
grunt>DUMP win
17/05/06 15:32:14 ERROR mapreduce.MRPigStatsUtil: 1 map reduce job(s) failed!
17/05/06 15:32:14 ERROR grunt.Grunt: ERROR 1066: Unable to open iterator for alias win
Details at logfile: /Users/joey/dev/bigdata/pig_1494048371690.log
in the log file, I see below exception.
================================================================================
Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias win
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias win
at org.apache.pig.PigServer.openIterator(PigServer.java:1019)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:747)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:231)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:206)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:564)
at org.apache.pig.Main.main(Main.java:176)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: java.io.IOException: Job terminated with anomalous status FAILED
at org.apache.pig.PigServer.openIterator(PigServer.java:1011)
... 13 more
================================================================================
below is the dump data of teams and gry_by_team:
grunt>dump teams
...
(1994,NHL,TBL,48,17,3,Tampa Bay Lightning)
(1994,NHL,TOR,48,21,8,Toronto Maple Leafs)
(1994,NHL,VAN,48,18,12,Vancouver Canucks)
(1994,NHL,WAS,48,22,8,Washington Capitals)
(1994,NHL,WIN,48,16,7,Winnipeg Jets)
(1995,NHL,ANA,82,35,8,Mighty Ducks of Anaheim)
(1995,NHL,BOS,82,40,11,Boston Bruins)
(1995,NHL,BUF,82,33,7,Buffalo Sabres)
(1995,NHL,CAL,82,34,11,Calgary Flames)
...
grunt>dump gry_by_team
...
(1912,NHA,TBS,20,9,0,Toronto Blueshirts),(1916,NHA,TBS,14,7,0,Toronto Blueshirts),(1914,NHA,TBS,20,8,0,Toronto Blueshirts)})
(TO1,{(1912,NHA,TO1,20,7,0,Toronto Tecumsehs)})
(TOA,{(1917,NHL,TOA,22,13,0,Toronto Arenas),(1918,NHL,TOA,18,5,0,Toronto Arenas)})
(TOB,{(1916,NHA,TOB,14,7,0,228th Battalion)})
(TOO,{(1913,NHA,TOO,20,4,0,Toronto Ontarios),(1914,NHA,TOO,20,7,0,Toronto Ontarios/Shamrocks)})
...
I don't know what wrong with my code.
Below is the hadoop and pig version I am using:
$ pig --version
Apache Pig version 0.16.0 (r1746530)
compiled Jun 01 2016, 23:10:49
$ hadoop version
Hadoop 2.8.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 91f2b7a13d1e97be65db92ddabc627cc29ac0009
Compiled by jdu on 2017-03-17T04:12Z
Compiled with protoc 2.5.0
From source with checksum 60125541c2b3e266cbf3becc5bda666
This command was run using /usr/local/Cellar/hadoop/2.8.0/libexec/share/hadoop/common/hadoop-common-2.8.0.jar
win = foreach grp_by_team generate group, SUM(teams.g) as win;
In your code column g data type is bytearray.
SUM work with the following data types: int, long, float, double, bigdecimal, biginteger or bytearray cast as double.. Here you need to cast bytearray as double. Please refer pig documentation for more information about it.
The schema you have defined in code grunt>teams = foreach teams_raw generate $0 as year:int, $1 as lgID, $2 as tmID, $8 as g:float, $9 as w:float, $11 as t:float, $18 as name; is not picked up. So its better you specify the schema along with the load statement.
For example: A = LOAD 'data' AS (a:chararray, b:int, c:int);
I am trying to extract data which is pipe delimited in Pig. Following is my command
L = LOAD 'entirepath_in_HDFS/b.txt/part-m*' USING PigStorage('||');
Iam getting following error
2016-08-04 23:58:21,122 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<line 1, column 4> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[||]'
My input sample file has exactly 5 lines as following
POS_TIBCO||HDFS||POS_LOG||1||7806||2016-07-18||1||993||0
POS_TIBCO||HDFS||POS_LOG||2||7806||2016-07-18||1||0||0
POS_TIBCO||HDFS||POS_LOG||3||7806||2016-07-18||1||0||5
POS_TIBCO||HDFS||POS_LOG||4||7806||2016-07-18||1||0||0
POS_TIBCO||HDFS||POS_LOG||5||7806||2016-07-18||1||0||19.99
I tried several options like using the backslash before delimiter(\||,\|\|) but everything failed. Also, I tried with schema but got the same error.I am using Horton works(HDP2.2.4) and pig (0.14.0).
Any help is appreciated. Please let me know if you need any further details.
I have faced this case, and by checking PigStorage code source, i think PigStorage argument should be parsed into only one character.
So we can use this code instead:
L0 = LOAD 'entirepath_in_HDFS/b.txt/part-m*' USING PigStorage('|');
L = FOREACH L0 GENERATE $0,$2,$4,$6,$8,$10,$12,$14,$16;
Its helpful if you know how many column you have, and it will not affect performance because it's map side.
When you load data using PigStorage, It only expects single character as delimiter.
However if still you want to achieve this you can use MyRegExLoader-
REGISTER '/path/to/piggybank.jar'
A = LOAD '/path/to/dataset' USING org.apache.pig.piggybank.storage.MyRegExLoader('||')
as (movieid:int, title:chararray, genre:chararray);
I am having a Pig script to pick up a set of fields using regular expression and store the data to a Hive table.
--Load data
cisoFortiGateDataAll = LOAD '/user/root/cisodata/Logs/Fortigate/ec-pix-log.20140625.gz' USING TextLoader AS (line:chararray);
--There are two types of data, filter type1 - The field dst_country seems unique there
cisoFortiGateDataType1 = FILTER cisoFortiGateDataAll BY (line matches '.*dst_country.*');
--Parse each line and pick up the required fields
cisoFortiGateDataType1Required = FOREACH cisoFortiGateDataType1 GENERATE
FLATTEN(
REGEX_EXTRACT_ALL(line, '(.*?)\\s(.*?)\\s(.*?)\\s(.*?)\\sdate=(.*?)\\s+time=(.*?)\\sdevname=(.*?)\\sdevice_id=(.*?)\\slog_id=(.*?)\\stype=(.*?)\\ssubtype=(.*?)\\spri=(.*?)\\svd=(.*?)\\ssrc=(.*?)\\ssrc_port=(.*?)\\ssrc_int=(.*?)\\sdst=(.*?)\\sdst_port=(.*?)\\sdst_int=(.*?)\\sSN=(.*?)\\sstatus=(.*?)\\spolicyid=(.*?)\\sdst_country=(.*?)\\ssrc_country=(.*?)\\s(.*?\\s.*)+')
) AS (
rmonth:charArray, rdate:charArray, rtime:charArray, ip:charArray, date:charArray, time:charArray,
devname:charArray, deviceid:charArray, logid:charArray, type:charArray, subtype:charArray,
pri:charArray, vd:charArray, src:charArray, srcport:charArray, srcint:charArray, dst:charArray,
dstport:charArray, dstint:charArray, sn:charArray, status:charArray, policyid:charArray,
dstcountry:charArray, srccountry:charArray, rest:charArray );
--Store to hive table
STORE cisoFortiGateDataType1Required INTO 'ciso_db.fortigate_type1_1_table' USING org.apache.hcatalog.pig.HCatStorer();
The script works fine on a small file but breaks with the following exception on a bigger file (750 MB). Any idea how can I debug and find the root cause?
2014-09-03 15:31:33,562 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - java.io.EOFException: Unexpected end of input stream
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:149)
at org.apache.pig.builtin.TextLoader.getNext(TextLoader.java:58)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:533)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
Check the size of the text you are loading into line:chararray.If the size is greater than hdfs block size (64 MB) then you will get an error.
I have 4 files A, B, C, D under the directory /user/bizlog/cpc on HDFS, and the record looks like this:
87465422^C376832^C27786^C21161214^Ckey
Here is my pig script:
cpc_all = load '/user/bizlog/cpc' using PigStorage('\u0003') as (cpcid, accountid, cpcplanid, cpcgrpid, key);
cpc = foreach cpc_all generate accountid, key;
account_group = group cpc by accountid;
account_sort = order account_group by group;
account_key = foreach account_sort generate group, BagToTuple(cpc.key);
store account_key into 'last' using PigStorage('\u0003');
It will get results such as:
376832^Ckey1^Ckey2
Above script suppose to process all the 4 files, but I get this error:
Backend error message
---------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: account_key: New For Each(false,false)[bag] - scope-18 Operator Key: scope-18): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at []
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.
Pig Stack Trace
---------------
ERROR 0: Exception while executing (Name: account_key: New For Each(false,false)[bag] - scope-18 Operator Key: scope-18): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at []
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: account_key: New For Each(false,false)[bag] - scope-18 Operator Key: scope-18): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at []
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
================================================================================
Oddly if I load one single file such as load '/user/bizlog/cpc/A' then the script will succeed.
If I load each file first and then union them, it will work fine too.
If I put the sort step at the last and the error goes away
The version of hadoop is 0.20.2 and the pig version is 0.12.1, any help will be appreciated
As mentioned in the comments:
I put the sort step at the last and the error goes away
Though I did not find much on the topic, it appears that pig does not like to rearrange the group itself.
As such the 'solution' is to rearrane the output of what is generated for the group, instead of ordering the group itself.
I am trying to run this commang over pig env.
grunt> A = LOAD inp;
But I am getting this error in the log files:
Pig Stack Trace:
ERROR 1200: mismatched input 'inp' expecting QUOTEDSTRING
Failed to parse: mismatched input 'inp' expecting QUOTEDSTRING
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:226)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:168)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1565)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1538)
at org.apache.pig.PigServer.registerQuery(PigServer.java:540)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:970)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:490)
at org.apache.pig.Main.main(Main.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
And in console Iam getting like this:
grunt> A = LOAD inp;
2012-10-26 12:18:34,627 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'inp' expecting QUOTEDSTRING
Details at logfile: /usr/local/hadoop/pig_1351232517175.log
Can any body provide me appropriate solution for this?
The syntax for load has been used wrongly. Check out the correct example provided herewith.
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#LOAD
Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records are newline-separated.
1 2 3
4 2 1
8 3 4
In this example the default load function, PigStorage, loads data from myfile.txt to form relation A. The two LOAD statements are equivalent. Note that, because no schema is specified, the fields are not named and all fields default to type bytearray.
A = LOAD 'myfile.txt';
A = LOAD 'myfile.txt' USING PigStorage('\t');
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
Example from http://pig.apache.org/docs
I believe the error log is self explanatory, it says - expecting QUOTEDSTRING
Please put the file name in single quotes to solve this issue.