Unable to select count of rows of an ORC table through Hive Beeline command - hive

I am using the following components - Hadoop 3.1.4 , Hive 3.1.3 and Tez 0.9.2
And there is an ORC table from which I am trying to extract count of the rows in the table. select count(*) from ORC_TABLE and this throws the below set of exceptions
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1670915386694_0182_1_00, diagnostics=[Vertex vertex_1670915386694_0182_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: jio_ar_consumer_events initializer failed, vertex=vertex_1670915386694_0182_1_00 [Map 1], java.lang.RuntimeException: ORC split generation failed with exception: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:519)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:765)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1790)
... 17 more
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
at org.apache.hadoop.hive.ql.io.AcidUtils.lambda$getAcidState$0(AcidUtils.java:1117)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
at java.util.TimSort.sort(TimSort.java:220)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1464)
at java.util.Collections.sort(Collections.java:177)
at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1115)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.callInternal(OrcInputFormat.java:1207)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.access$1500(OrcInputFormat.java:1142)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:1179)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:1176)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:1176)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:1142)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
There is another article where the same problem has been described ORC Split Generation issue with Hive Table
but there isnt any solution as such yet.
I also tried running CONCATENATE function on top of ORC Table but that didn't help either.
What works though is, if I run select * from ORC_TABLE with or without LIMIT, it seems to extract the records. I reckon issue must only be with aggregate functions or may be I don't get the issue yet.
I am also using Spark 3.3.1 and I can extract the same count through Spark Context Spark Sql utility and able to fetch the rows as well. No issues with Spark in that front.
Adding on to it, When I change the execution engine to MR, then this works. Fails only when I run this on Tez Engine.
Any leads to resolve this issue is much appreciated.

The issue was resolved by the below steps based my previous analysis:
This class org.apache.hadoop.fs.FileStatus comes as a part of hadoop common jar file.
We were using Hadoop 3.1.4 & Tez 0.9.2
Tez 0.9.2 contains a tez.tar.gz that needs to be placed onto HDFS location. This tez.tar.gz contained hadoop-common-2.7.2.jar (This does not have the method compareTo that is thrown as an exception as shown in the error )
Solution :
We extracted the tez.tar.gz and replaced all hadoop 2.7.2 related jars with hadoop 3.1.4 jars. Do this if you dont want to reconfigure again with new tez version. Otherwise you could follow solution 2 as mentioned.
Recreated the tar and placed it across all dependent locations including HDFS as well. For us it was in /user/tez/share/tez.tar.gz location. It changes accordingly.
This error disappeared after I followed the steps and now I am able to do count of records on any table.
Solution 2 :
Other solution that you could easily do is, use 0.10.x Tez version that contains libraries for hadoop 3.x version. Rather than 0.9.2 Tez version which is compatible with hadoop 2.7.x version.

Related

Getting FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask exception while access Hive views

I am trying to access the views in Hive, getting following Exception:
Getting log thread is interrupted, since query is done!
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)
java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:349)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:251)
at org.apache.hive.beeline.Commands.executeInternal(Commands.java:988)
at org.apache.hive.beeline.Commands.execute(Commands.java:1160)
at org.apache.hive.beeline.Commands.sql(Commands.java:1074)
at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1145)
at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:976)
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:886)
at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:502)
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:485)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Here is my hive query:
select * from sample_view;
I have added SPARK_HOME/jars path to $HIVE_HOME/bin/hive like:
for f in ${SPARK_HOME}/jars/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
i have tried, hive.execution.engine as mr and as well as spark, but no luck.
Please help me out.
TIA
When I have seen this is because of a few reasons, it can be a red herring error that batches multiple ones together. Without seeing the table ddl or the executor logs this is the best answer I can offer.
(1) java error, navigate to the yarn logs for this job instance and read the executor logs. If this broke because of a relatively rare error you will find it here. Good luck this can be painful.
(2) background server is aberrant, restart the hadoop and hive elements and rerun command.
(3) try to call the underlying data in another process. This will find if the data doesn't match the ddl or corrupted.
(4) repair and invalidate table
msck repair table <table-name>
invalidate metadata <table-name>
Good luck.

Hadoop replication Issue during hive inserts

I've been getting this error over and over when running a test that is inserting a bunch of records into a table using hive. The nature of the error makes it appear to not be related to my code but rather some system issue. I just can't figure out what that might be. The code I'm running is doing about 100 or so individual inserts into a table via hive jdbc.
org.apache.ibatis.exceptions.PersistenceException:
### Error updating database. Cause: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10293]: Unable to create temp file for insert values File /tmp/hive/hive/4a7308b4-34e5-4fee-91c8-3e8e946ebfd5/_tmp_space.db/Values__Tmp__Table__242/data_file could only be replicated to 0 nodes instead of minReplication (=1). There are 4 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1641)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3198)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3122)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:843)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)

Why does my Dataflow output "timeout value is negative" on insertion to BigQuery?

I have a Dataflow job consisting of ReadSource, ParDo, Windowing, Insert (into a date-partitioned table in BigQuery).
It basically:
Reads text files from a Google Storage bucket using a glob
Process each line by splitting on delimiter, changing some values before giving each column a name and data type before outputting as a BigQuery table row together with a timestamp based on the data
Window on a daily window using the timestamp from step 2
Write to BigQuery, using Window table and "dataset$datepartition" syntax to specify table and partition. Create disposition set to CREATE_IF_NEEDED and write disposition set to WRITE_APPEND.
The first three steps seems to run fine but in most cases the job runs into problem on the last insert step which gives exceptions in the log:
java.lang.IllegalArgumentException: timeout value is negative at java.lang.Thread.sleep(Native Method)
at com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:287)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.flushRows(BigQueryIO.java:2446)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.finishBundle(BigQueryIO.java:2404)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.finishBundle(DoFnRunnerBase.java:158)
at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:196)
at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.finishBundle(ForwardingParDoFn.java:47)
at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.finish(ParDoOperation.java:65)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:80)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:287)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:223)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:173)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:193)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:173)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:160)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This exception is repeated ten times.
At last I get "workflow failed" as below:
Workflow failed. Causes: S04:Insert/DataflowPipelineRunner.BatchBigQueryIOWrite/BigQueryIO.StreamWithDeDup/Reshuffle/
GroupByKey/Read+Insert/DataflowPipelineRunner.BatchBigQueryIOWrite/BigQueryIO.StreamWithDeDup/Reshuffle/GroupByKey/
GroupByWindow+Insert/DataflowPipelineRunner.BatchBigQueryIOWrite/BigQueryIO.StreamWithDeDup/Reshuffle/
ExpandIterable+Insert/DataflowPipelineRunner.BatchBigQueryIOWrite/BigQueryIO.StreamWithDeDup/ParDo(StreamingWrite)
failed.
Sometimes the same job with the same input works without problem though which makes this quite hard to debug. So where to start?
This is a known issue with the BigQueryIO streaming write operation in Dataflow SDK for Java 1.7.0. It is fixed in the GitHub HEAD and the fix will be included in the 1.8.0 release of the Dataflow Java SDK.
For more details, see Issue #451 on the DataflowJavaSDK GitHub repository.

not able to select the data from a table in hive

When I try to create a table in hive, I am getting OK, and getting OK while I insert the values through a text file. But while selecting the values from that table I am getting the below error message
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.mapred.JobConf.unset(Ljava/lang/String;)
Due to the version inconsistencies, you need to switched to newer version of hadoop 2.X.x.
https://stackoverflow.com/questions/27842004/issue-in-apache-hive-0-14-running-dml-queries
you can check which version of hadoop will work with which version of hive - from hive download page

accesing Views created in Hive using HcatLoader in Pig

I was just trying something in hive and HcatLoader in Pig. What I did is, created a view in Hive and then tried to load data by view I created into pig using HcatLoader. But it seems it is not working. I just wanted to confirm that is there any way to do this? I am getting following error when I tried to load view in pig using HcatLoader
events=Load 'ViewName' using org.apache.hcatalog.pig.HCatLoader();
dump events;
When I use any tableName instead of View from Hive, it seems to work. Further it does not give metastore error. As it says successfully connected to metastore at load statement when it comes to dump, it crashes with the following error.
Any Pointers will be helpful.
Thanks,
Atul
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias events
at org.apache.pig.PigServer.openIterator(PigServer.java:857)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:682)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:555)
at org.apache.pig.Main.main(Main.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias events
at org.apache.pig.PigServer.storeEx(PigServer.java:956)
at org.apache.pig.PigServer.store(PigServer.java:919)
at org.apache.pig.PigServer.openIterator(PigServer.java:832)
... 12 more
Caused by: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException: ERROR 2017: Internal error creating job configuration.
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:731)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:259)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:180)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1270)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1255)
at org.apache.pig.PigServer.storeEx(PigServer.java:952)
Response I recieved by posting it on some other forum.
"HCatLoader does not support reading views in Hive. The issue is that a view is defined as a query on a table (create view V as select x, y from t).
Pig doesn't speak SQL,
and
HCat doesn't contain Hive's execution engine
so it cannot execute the query either. Reading Hive views from Pig and MR will require much tighter integration of the products than we currently have."
I found the same issue the hard way today. Hive cannot read Hive Views (but lacks good exception handling code on this topic).
For the records (anybody else falling into this problem), this is how the current version behaves: On Hortonworks 2.3 with Pig 1.15 I only got the following error in the log:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error
creating job configuration.
Pig fails this way because there is no file to load (as we attempted to load from a View).
Since Pig loads the data from a file in hadoop, reading data from an view (which does not have a physical file) may not work.
May be if we can manage to create a file for the view in hadoop, Pig may be able to load it. Atleast a virtual pointer file to the actual data file.
Not sure if this is possible or has been thought through.