accesing Views created in Hive using HcatLoader in Pig - hive

I was just trying something in hive and HcatLoader in Pig. What I did is, created a view in Hive and then tried to load data by view I created into pig using HcatLoader. But it seems it is not working. I just wanted to confirm that is there any way to do this? I am getting following error when I tried to load view in pig using HcatLoader
events=Load 'ViewName' using org.apache.hcatalog.pig.HCatLoader();
dump events;
When I use any tableName instead of View from Hive, it seems to work. Further it does not give metastore error. As it says successfully connected to metastore at load statement when it comes to dump, it crashes with the following error.
Any Pointers will be helpful.
Thanks,
Atul
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias events
at org.apache.pig.PigServer.openIterator(PigServer.java:857)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:682)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:555)
at org.apache.pig.Main.main(Main.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias events
at org.apache.pig.PigServer.storeEx(PigServer.java:956)
at org.apache.pig.PigServer.store(PigServer.java:919)
at org.apache.pig.PigServer.openIterator(PigServer.java:832)
... 12 more
Caused by: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException: ERROR 2017: Internal error creating job configuration.
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:731)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:259)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:180)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1270)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1255)
at org.apache.pig.PigServer.storeEx(PigServer.java:952)

Response I recieved by posting it on some other forum.
"HCatLoader does not support reading views in Hive. The issue is that a view is defined as a query on a table (create view V as select x, y from t).
Pig doesn't speak SQL,
and
HCat doesn't contain Hive's execution engine
so it cannot execute the query either. Reading Hive views from Pig and MR will require much tighter integration of the products than we currently have."

I found the same issue the hard way today. Hive cannot read Hive Views (but lacks good exception handling code on this topic).
For the records (anybody else falling into this problem), this is how the current version behaves: On Hortonworks 2.3 with Pig 1.15 I only got the following error in the log:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error
creating job configuration.
Pig fails this way because there is no file to load (as we attempted to load from a View).

Since Pig loads the data from a file in hadoop, reading data from an view (which does not have a physical file) may not work.
May be if we can manage to create a file for the view in hadoop, Pig may be able to load it. Atleast a virtual pointer file to the actual data file.
Not sure if this is possible or has been thought through.

Related

Unable to select count of rows of an ORC table through Hive Beeline command

I am using the following components - Hadoop 3.1.4 , Hive 3.1.3 and Tez 0.9.2
And there is an ORC table from which I am trying to extract count of the rows in the table. select count(*) from ORC_TABLE and this throws the below set of exceptions
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1670915386694_0182_1_00, diagnostics=[Vertex vertex_1670915386694_0182_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: jio_ar_consumer_events initializer failed, vertex=vertex_1670915386694_0182_1_00 [Map 1], java.lang.RuntimeException: ORC split generation failed with exception: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:519)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:765)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1790)
... 17 more
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
at org.apache.hadoop.hive.ql.io.AcidUtils.lambda$getAcidState$0(AcidUtils.java:1117)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
at java.util.TimSort.sort(TimSort.java:220)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1464)
at java.util.Collections.sort(Collections.java:177)
at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1115)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.callInternal(OrcInputFormat.java:1207)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.access$1500(OrcInputFormat.java:1142)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:1179)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:1176)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:1176)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:1142)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
There is another article where the same problem has been described ORC Split Generation issue with Hive Table
but there isnt any solution as such yet.
I also tried running CONCATENATE function on top of ORC Table but that didn't help either.
What works though is, if I run select * from ORC_TABLE with or without LIMIT, it seems to extract the records. I reckon issue must only be with aggregate functions or may be I don't get the issue yet.
I am also using Spark 3.3.1 and I can extract the same count through Spark Context Spark Sql utility and able to fetch the rows as well. No issues with Spark in that front.
Adding on to it, When I change the execution engine to MR, then this works. Fails only when I run this on Tez Engine.
Any leads to resolve this issue is much appreciated.
The issue was resolved by the below steps based my previous analysis:
This class org.apache.hadoop.fs.FileStatus comes as a part of hadoop common jar file.
We were using Hadoop 3.1.4 & Tez 0.9.2
Tez 0.9.2 contains a tez.tar.gz that needs to be placed onto HDFS location. This tez.tar.gz contained hadoop-common-2.7.2.jar (This does not have the method compareTo that is thrown as an exception as shown in the error )
Solution :
We extracted the tez.tar.gz and replaced all hadoop 2.7.2 related jars with hadoop 3.1.4 jars. Do this if you dont want to reconfigure again with new tez version. Otherwise you could follow solution 2 as mentioned.
Recreated the tar and placed it across all dependent locations including HDFS as well. For us it was in /user/tez/share/tez.tar.gz location. It changes accordingly.
This error disappeared after I followed the steps and now I am able to do count of records on any table.
Solution 2 :
Other solution that you could easily do is, use 0.10.x Tez version that contains libraries for hadoop 3.x version. Rather than 0.9.2 Tez version which is compatible with hadoop 2.7.x version.

Getting FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask exception while access Hive views

I am trying to access the views in Hive, getting following Exception:
Getting log thread is interrupted, since query is done!
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)
java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:349)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:251)
at org.apache.hive.beeline.Commands.executeInternal(Commands.java:988)
at org.apache.hive.beeline.Commands.execute(Commands.java:1160)
at org.apache.hive.beeline.Commands.sql(Commands.java:1074)
at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1145)
at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:976)
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:886)
at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:502)
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:485)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Here is my hive query:
select * from sample_view;
I have added SPARK_HOME/jars path to $HIVE_HOME/bin/hive like:
for f in ${SPARK_HOME}/jars/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
i have tried, hive.execution.engine as mr and as well as spark, but no luck.
Please help me out.
TIA
When I have seen this is because of a few reasons, it can be a red herring error that batches multiple ones together. Without seeing the table ddl or the executor logs this is the best answer I can offer.
(1) java error, navigate to the yarn logs for this job instance and read the executor logs. If this broke because of a relatively rare error you will find it here. Good luck this can be painful.
(2) background server is aberrant, restart the hadoop and hive elements and rerun command.
(3) try to call the underlying data in another process. This will find if the data doesn't match the ddl or corrupted.
(4) repair and invalidate table
msck repair table <table-name>
invalidate metadata <table-name>
Good luck.

Cannot load jdbc driver class org.apache.hive.jdbc.hivedriver in Kylo

I am trying to create a Data Ingest Feed but all the jobs are failing. I checked Nifi and there are error marks saying that "org.apache.hive.jdbc.hivedriver" was not found. I checked the nifi logs and found the following error :
So where exactly do I need to put the hivedriver jar?
Based on the comments, this seems to be the solution as mentioned by #Greg Hart:
Have you tried using a Data Transformation feed? The Data Ingest
template is for loading data into Hive, but it looks like you're using
it to move data from one Hive table into another.

Cant create ORC external tables on Hawq PXF

I'm using Pivotal Hawq with ambari and now I'm trying to run some queries over ORC hive tables with hawq.
Previously I was able to create the external queries on pqsql using SELECT * FROM hcatalog.hive-db-name.hive-table-name distributed randomly;
But now everytime I get the error:
Exception report message java.lang.Exception: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.
Can you provide some help on how to surpass this?
I believe you have missed a step to update your pxf-profiles.xml file that's required after upgrading to HDB 2.2. Please see the instructions listed here:
http://hdb.docs.pivotal.io/220/hdb/install/install-ambari.html#post-install-212-req

Redshift drop/create/select query failing in Data Pipeline

I'm trying to run a daily migration script in Redshift using Data Pipeline.
The script works as expected when I run it directly using SQL Workbench/J, but fails when triggered through Data Pipeline.
I have reproduced the problem with this simple code:
drop table if exists image_stg;
create table image_stg (like image_full);
select * from image_stg;
When I run it in Data Pipeline, I get this error:
[Amazon](500310) Invalid operation: relation "image_stg" does not exist;
I also got this error once, for the exact same code, without changing anything:
[Amazon](500310) Invalid operation: Relation with OID 108425 does not exist.;
Here's a screenshot of the two error messages:
I've found this thread on the AWS forums, but it didn't help: Pipeline started failing on simple Redshift SqlActivity and temp table
What is causing this error? Is there a workaround?
I've contacted Amazon, and it looks like a problem in Data Pipeline.
They did suggest a workaround that seems to work in my case: Change the JDBC connection string from jdbc:redshift://… to jdbc:postgresql://… .
I had the same problem when creating a temporary table in Redshift via Pipeline but the workaround of changing the connection string from jdbc:redshift://… to jdbc:postgresql://… didn't work for me though. My last resort is to create the table as physical table and drop it after use - through Pipeline.