Reading Multiple Orc Files in Pig - apache-pig

I am trying to read/Load multiple Orc files present in a directory Using pig's OrcStorage(). I tried to use glob technique but that was not working for me and throwing error saying file dose not exist, where as it is available.Please let me know how i can implement this functionality in pig.
Sample Files Used:
hadoop fs -ls /sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}
Found 2 items
-rw-r--r-- 3 as303e hdfs 302986 2015-11-19 05:12 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111900/000000_0
-rw-r--r-- 3 as303e hdfs 302986 2015-11-19 05:12 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111900/000001_0
Found 2 items
-rw-r--r-- 3 as303e ksndbx28 302986 2015-11-25 04:34 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111901/000000_0
-rw-r--r-- 3 as303e ksndbx28 302986 2015-11-25 04:34 /sandbox/sandbox28/pig_demo/input/ORC/data_dt=2015111901/000001_0
Code Used:
A = load '/sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}' Using OrcStorage();
B= limit A 2;
DUMP B;
Error log:
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Store(hdfs://localhost:8020/tmp/temp666047359/tmp808921130:org.apache.pig.impl.io.InterStorage) - scope-5 Operator Key: scope-5): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Limit - scope-4 Operator Key: scope-4): org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNextTuple(POStore.java:159)
at org.apache.pig.backend.hadoop.executionengine.fetch.FetchLauncher.runPipeline(FetchLauncher.java:161)
at org.apache.pig.backend.hadoop.executionengine.fetch.FetchLauncher.launchPig(FetchLauncher.java:81)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:278)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
at org.apache.pig.PigServer.storeEx(PigServer.java:1034)
... 15 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: B: Limit - scope-4 Operator Key: scope-4): org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNextTuple(POLimit.java:122)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
... 22 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function.
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNextTuple(POLoad.java:131)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307)
... 24 more
Caused by: org.apache.hadoop.mapred.InvalidInputException: File does not exist: hdfs://localhost:8020/sandbox/sandbox28/pig_demo/input/ORC/data_dt={2015111900,2015111901}
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:961)
at org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat.getSplits(OrcNewInputFormat.java:121)
at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:190)
at org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:146)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.setUp(POLoad.java:99)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNextTuple(POLoad.java:127)
... 25 more

Related

Avro : java.lang.RuntimeException: Unsupported type in record

Input: test.csv
100
101
102
Pig Script :
REGISTER required jars are registered;
A = LOAD 'test.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (code:chararray);
STORE A INTO 'test' USING org.apache.pig.piggybank.storage.avro.AvroStorage
('schema',
'{"namespace":"com.pig.test.avro","type":"record","name":"Avro_Test","doc":"Avro Test Schema",
"fields":[
{"name":"code","type":["string","null"],"default":null}
]}'
);
Getting a runtime error while STORE. Any inputs on resolving the same.
Error Log :
ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.RuntimeException: Unsupported type in record:class java.lang.String
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263)
at org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvroRecordWriter.java:49)
at org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.java:722)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:558)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMap
2015-06-02 23:06:03,934 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2015-06-02 23:06:03,934 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
Looks like this is a bug: https://issues.apache.org/jira/browse/PIG-3358
If you can, try to update to pig 0.14, according to the comments this has been fixed.

java.io.IOException: Mkdirs failed to create file:/pig/deidentifiedDir/

When I try store data using pig command
STORE D into '/deidentifiedDir';
Getting this error
2015-05-06 20:17:14,587 [Thread-96] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0009
java.io.IOException: Mkdirs failed to create file:/pig/deidentifiedDir/_temporary/_attempt_local_0009_m_000000_0
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:366)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:484)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:465)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:372)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextOutputFormat.getRecordWriter(PigTextOutputFormat.java:98)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getRecordWriter(PigOutputFormat.java:83)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:488)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:610)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

ERROR 1066: Unable to open iterator for alias ~ ERROR

The following code works fine:
S4 = FOREACH S3 GENERATE group AS page_i,
COUNT(S2) AS outlinks,
FLATTEN(S2.rank);
DUMP S4
The result are like below:
(Computer engineering,1,0.1111111111111111)
(Outline of computer science,1,0.1111111111111111)
However, when I try to create one more table using divide:
S44 = FOREACH S4 GENERATE group as page_i, outlinks/2, ...
It goes gown like:
Failed Jobs:
JobId Alias Feature Message Outputs
job_local125575051_0033 S6,S66 DISTINCT Message: Job failed! Error - NA
file:/tmp/temp-847036156/tmp-1908150009,
Input(s):
Successfully read records from: "/home/song/workspace/FinalProject/output/part-m-00000"
Successfully read records from: "/home/song/workspace/FinalProject/output/part-m-00000"
Output(s):
Failed to produce result in "file:/tmp/temp-847036156/tmp-1908150009"
Job DAG:
job_local777342816_0028 -> job_local39708124_0029,
job_local39708124_0029 -> job_local495952123_0030,job_local268178801_0032,
job_local495952123_0030 -> job_local1880869927_0031,
job_local1880869927_0031 -> job_local268178801_0032,
job_local268178801_0032 -> job_local125575051_0033,
job_local125575051_0033
2015-04-29 07:45:49,133 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs
2015-04-29 07:45:49,136 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias S66
Details at logfile: /home/song/workspace/FinalProject/pig_1430315000370.log
It seems that you are trying to project a field called group for each tuple in relation S4, while the schema contains only those fields:
page_i
outlinks
unnamed field as a result of FLATTEN(S2.rank).
So I guess that all you need to do is to replace the invalid line with:
S44 = FOREACH S4 GENERATE page_i as page_i, outlinks/2, ...
Or maybe just:
S44 = FOREACH S4 GENERATE page_i, outlinks/2, ...
Hope that this was the only issue here...

Pig process multiple file error: ERROR 0: Error while executing ForEach at []

I have 4 files A, B, C, D under the directory /user/bizlog/cpc on HDFS, and the record looks like this:
87465422^C376832^C27786^C21161214^Ckey
Here is my pig script:
cpc_all = load '/user/bizlog/cpc' using PigStorage('\u0003') as (cpcid, accountid, cpcplanid, cpcgrpid, key);
cpc = foreach cpc_all generate accountid, key;
account_group = group cpc by accountid;
account_sort = order account_group by group;
account_key = foreach account_sort generate group, BagToTuple(cpc.key);
store account_key into 'last' using PigStorage('\u0003');
It will get results such as:
376832^Ckey1^Ckey2
Above script suppose to process all the 4 files, but I get this error:
Backend error message
---------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: account_key: New For Each(false,false)[bag] - scope-18 Operator Key: scope-18): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at []
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.
Pig Stack Trace
---------------
ERROR 0: Exception while executing (Name: account_key: New For Each(false,false)[bag] - scope-18 Operator Key: scope-18): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at []
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: account_key: New For Each(false,false)[bag] - scope-18 Operator Key: scope-18): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at []
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:242)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
================================================================================
Oddly if I load one single file such as load '/user/bizlog/cpc/A' then the script will succeed.
If I load each file first and then union them, it will work fine too.
If I put the sort step at the last and the error goes away
The version of hadoop is 0.20.2 and the pig version is 0.12.1, any help will be appreciated
As mentioned in the comments:
I put the sort step at the last and the error goes away
Though I did not find much on the topic, it appears that pig does not like to rearrange the group itself.
As such the 'solution' is to rearrane the output of what is generated for the group, instead of ordering the group itself.

pig file load error

I am trying to run this commang over pig env.
grunt> A = LOAD inp;
But I am getting this error in the log files:
Pig Stack Trace:
ERROR 1200: mismatched input 'inp' expecting QUOTEDSTRING
Failed to parse: mismatched input 'inp' expecting QUOTEDSTRING
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:226)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:168)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1565)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1538)
at org.apache.pig.PigServer.registerQuery(PigServer.java:540)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:970)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:490)
at org.apache.pig.Main.main(Main.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
And in console Iam getting like this:
grunt> A = LOAD inp;
2012-10-26 12:18:34,627 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'inp' expecting QUOTEDSTRING
Details at logfile: /usr/local/hadoop/pig_1351232517175.log
Can any body provide me appropriate solution for this?
The syntax for load has been used wrongly. Check out the correct example provided herewith.
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#LOAD
Suppose we have a data file called myfile.txt. The fields are tab-delimited. The records are newline-separated.
1 2 3
4 2 1
8 3 4
In this example the default load function, PigStorage, loads data from myfile.txt to form relation A. The two LOAD statements are equivalent. Note that, because no schema is specified, the fields are not named and all fields default to type bytearray.
A = LOAD 'myfile.txt';
A = LOAD 'myfile.txt' USING PigStorage('\t');
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
Example from http://pig.apache.org/docs
I believe the error log is self explanatory, it says - expecting QUOTEDSTRING
Please put the file name in single quotes to solve this issue.