Prestodb (AWS EMR) load partition metadata - amazon-emr

I have AWS EMR cluster with Prestodb v. 0.238.3
Everything is fine except one thing - when I call a function to collect data of partitions it always fails. I tried to execute this command via Datagrip (+ jdbc driver) and via presto-cli on master node of the cluster - no difference
CALL system.metadata.sync_partition_metadata('test', 'table_test', 'FULL')
[29] Query failed (#20201124_193910_00688_37bpw):
Procedure not registered: system.sync_partition_metadata com.facebook.presto.spi.PrestoException:
Procedure not registered: system.sync_partition_metadata
What do I wrong?

That function is available starting with Presto 304: https://trino.io/docs/current/release/release-304.html?highlight=sync_partition_metadata#hive-connector-changes
The documentation is available here: https://trino.io/docs/current/connector/hive.html?highlight=sync_partition_metadata#procedures
In EMR, it's available with EMR version 6.1.0, which includes Presto 338:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html

This function is available starting with version 0.241 in PrestoDB.
You can either bootstrap your EMR with a later version of PrestoDB, or use a managed service like https://ahana.io/ahana-cloud/
Doc: https://prestodb.io/docs/current/connector/hive.html#procedures

Related

Alluxio + Hive on EMR

I have Alluxio 1.8 installed on an EMR 5.19.0 cluster, and can see my S3 tables using /usr/local/alluxio/bin/alluxio fs ls /.
However, when I start up hive and issue
hive> [[DDL w/ LOCATION = alluxio://master_host:19998/my_table ]]], I get the following:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found
Is there a way of getting past this? I've tried starting hive with --auxpath pointing to both /usr/local/alluxio/client/alluxio-1.8.1-client.jar and a copy of the jar on hdfs without any success.
Any help?
I posted a blog talking about the reasons for the error message java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found. Here are some tips, hope they can help:
For Hive, set environment variable HIVE_AUX_JARS_PATH in conf/hive-env.sh:
export HIVE_AUX_JARS_PATH=/<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar:${HIVE_AUX_JARS_PATH}
which I guess is equivalent to what you have done to set --auxpath.
Depending on your setting of Hive (e.g., Hive on MR or Spark or Tez), you may also need to make sure the runtime is also able to access the client jar. Take Hive on MR as an example, you perhaps also need to append the path to Alluxio client jar to mapreduce.application.classpath or yarn.application.classpath to ensure each task of the MR jobs can access this jar.

AWS EMR s3a filesystem not found

I am running an EMR instance. It was working fine but suddenly it started giving below error when I am trying to access S3 files from a Python Spark script:
py4j.protocol.Py4JJavaError: An error occurred while calling o36.json.:
java.lang.RuntimeException:
java.lang.ClassNotFoundException:
Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
How can we resolve this?
Thanks in advance.
It was an issue with dependencies of spark. I had to add jars config in park-defaults.conf .
spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
Please follow below link:
https://gist.github.com/eddies/f37d696567f15b33029277ee9084c4a0
Download the hadoop-aws-3.2.1.jar (or any version above 2.7.10 based on your EMR version) and put it in /usr/lib/spark/jars
Download the latest aws SDK and put it in /usr/lib/spark/jars
update /usr/lib/spark/conf/spark-defaults.conf
update spark.driver.extraClasspath - in the end add the full path of these 2 new jars, seperated by colon
run spark submit after that
Note: I used AWS EMR version 6.0+
For Amazon EMR, use the "s3:" prefix. The S3A connector is the ASF's open source one; Amazon have their own (closed source) connector, which is the only one they support

HIVE on Spark Issue

I am trying to configure Hive on Spark but even after trying for 5 days i am not getting any solution..
Steps followed:
1.After spark installation,going in hive console and setting below proeprties
set hive.execution.engine=spark;
set spark.master=spark://INBBRDSSVM294:7077;
set spark.executor.memory=2g;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
2.Added spark -asembly jar in hive lib.
3.When running select count(*) from table_name I am getting below error:
2016-08-08 15:17:30,207 ERROR [main]: spark.SparkTask (SparkTask.java:execute(131))
- Failed to execute spark task, with exception
'org.apache.hadoop.hive.ql.metadata.HiveException (Failed to create spark client.)'
Hive version: 1.2.1
Spark version: tried with 1.6.1,1.3.1 and 2.0.0
Would appreciate if any one can suggest something.
You can download spark-1.3.1 src from spark download website and try to build spark-1.3.1 without hive version using:
./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4" -Dhadoop.version=2.7.1 -Dyarn.version=2.7.1 –DskipTests
Then copy spark-assembly-1.3.1-hadoop2.7.1.jar to hive/lib folder.
And follow https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-SparkInstallation to set necessary properties.
First of all, you need to pay attention to which versions are compatible. If you choose Hive 1.2.1, I advise you to use Spark 1.3.1. You can see the version compatibility list here.
The mistake you have is a general mistake. You need to start Spark and see what errors the Spark Workers says. However, have you already copied the hive-site.xml to spark/conf?

Cannot Load Hive Table into Pig via HCatalog

I am currently configuring a Cloudera HDP dev image using this tutorial on CentOS 6.5, installing the base and then adding the different components as I need them. Currently, I am installing / testing HCatalog using this section of the tutorial linked above.
I have successfully installed the package and am now testing HCatalog integration with Pig with the following script:
A = LOAD 'groups' USING org.apache.hcatalog.pig.HCatLoader();
DESCRIBE A;
I have previously created and populated a 'groups' table in Hive before running the command. When I run the script with the command pig -useHCatalog test.pig I get an exception rather than the expected output. Below is the initial part of the stacktrace:
Pig Stack Trace
---------------
ERROR 2245: Cannot get schema from loadFunc org.apache.hcatalog.pig.HCatLoader
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Cannot get schema from loadFunc org.apache.hcatalog.pig.HCatLoader
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1608)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1547)
at org.apache.pig.PigServer.registerQuery(PigServer.java:518)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:991)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412)
...
Has anyone encountered this error before? Any help would be much appreciated. I would be happy to provide more information if you need it.
The error was caused by HBase's Thrift server not being proper configured. I installed/configured Thrift and added the following to my hive-xml.site with the proper server information added:
<property>
<name>hive.metastore.uris</name>
<value>thrift://<!--URL of Your Server-->:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
I thought the snippet above was not required since I am running Cloudera HDP in pseudo-distributed mode.Turns out, it and HBase Thrift are required to use HCatalog with Pig.

In what mode is Hive installed?

Does hive installation have any specific mode?
Like for example, Hadoop installation has 3 modes: standalone, pseudo-distributed and fully distributed.
Similarly does Hive has any specific type of distribution?
Can Hive be installed in distributed mode?
Hive actually provides you the option to run queries in 2 modes :
1- Map-Reduce mode
2- Local mode
Normally Hive compiler generates map-reduce jobs for most queries under the hood. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:
mapred.job.tracker
While this usually points to a map-reduce cluster with multiple nodes, Hadoop also provided you the ability to run map-reduce jobs locally on the your standalone workstation. In order to run Hive queries in local mode you need to do this :
hive> SET mapred.job.tracker=local;
Details can be found here.