Setting Remote hive metastore on postgresql for EMR - amazon-emr

I am trying to setup postgresql db as external Hive metastore for AWS EMR.
I have tried hosting it on both EC2 and RDS.
I have already tried steps as given here.
But it doesnt go through, EMR fails in the provisioning step only with message
On the master instance (instance-id), application provisioning failed
I could not decipher anything from the failure log.
I also copied postgresql jdbc jar in paths
/usr/lib/hive/lib/ and /usr/lib/hive/jdbc/
in case EMR doesnt already has it, but still no help!
Then I setup the system by manually editing hive-site.xml and setting properties:
javax.jdo.option.ConnectionURL
javax.jdo.option.ConnectionDriverName
javax.jdo.option.ConnectionUserName
javax.jdo.option.ConnectionPassword
datanucleus.fixedDatastore
datanucleus.schema.autoCreateTables
and had to run hive --service metatool -listFSRoot.
After these manual settings I was able to get EMR to use postgres db as remote metastore.
Is there any way I can make it work using the configuration file as mentioned in official documentation?
Edit:
Configuration setting I am using to for remote mysql metastore:
classification=hive-site,properties=[javax.jdo.option.ConnectionURL=jdbc:mysql://[host]:3306/[dbname]?createDatabaseIfNotExist=true,javax.jdo.option.ConnectionDriverName=org.mariadb.jdbc.Driver,javax.jdo.option.ConnectionUserName=[user],javax.jdo.option.ConnectionPassword=[pass]]

I could never find a clean approach to configure this at the time of EMR startup itself.
The main problem is that EMR initializes the schema with MySQL using the command :
/usr/lib/hive/bin/schematool -initSchema -dbType MySQL
which should be postgres for our case.
The following manual steps allows to you configure postgres as external metastore:
1) Start EMR cluster with hive application, with default configurations.
2) Stop hive using command :
sudo stop hive-server2
3) Copy postgresql-jdbc jar (stored in some S3 location) to /usr/lib/hive/lib/ on EMR
4) Overwrite the default hive-site.xml in /usr/lib/hive/conf/ with custom one containing the JDO configuration for the Postgresql running on the EC2 node
5) Execute command :
sudo /usr/lib/hive/bin/schematool -upgradeSchema -dbType postgres

Related

AWS EMR - how to copy files to all the nodes?

is there a way to copy a file to all the nodes in EMR cluster thought EMR command line? I am working with presto and have created my custom plugin. The problem is I have to install this plugin on all the nodes. I don't want to login to all the nodes and copy it.
You can add it as a bootstrap script to let this happen during the launch of the cluster.
#Sanket9394 Thanks for the edit!
If you have the control to Bring up a new EMR, then you should consider using the bootstrap script of the EMR.
But incase you want to do it on Existing EMR (bootstrap is only available during launch time)
You can do this with the help of AWS Systems Manager (ssm) and EMR inbuilt client.
Something like (python):
emr_client = boto3.client('emr')
ssm_client = boto3.client('ssm')
You can get the list of core instances using emr_client.list_instances
finally send a command to each of these instance using ssm_client.send_command
Ref : Check the last detailed example Example Installing Libraries on Core Nodes of a Running Cluster on https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-install-kernels-libs.html#emr-jupyterhub-install-libs
Note: If you are going with SSM , you need to have proper IAM policy of ssm attached to the IAM role of your master node.

Access Glue Catalog from Dev endpoint and local Zeppelin with spark sql

I have set up a local Zeppelin notebook to access Glue Dev endpoint. I'm able to run spark and pyspark code and access the Glue catalog. But when I try spark.sql("show databases").show() or %sql show databases only default is returned.
When spinning up an EMR cluster we have to choose "Use for Hive table metadata" to enable this, but I hoped this would be the default setting for Glue Development endpoint, which seems to be not the case. Any workaround for this?

getting the hive meta-store url to use in other systems

When I go into hive in command line, is there a way to get the hive metastore url that is being used?
I'm trying to connect another system to hive but can't seem to figure out what the metastore url is.
Here is the command.
hive> set hive.metastore.uris;
Here is the output
hive.metastore.uris=thrift://sandbox.hortonworks.com:9083
Using set you can get all hadoop and hive parameters that are being used when hive CLI is launched.

Hadoop distcp command says can't connect to server

I want to download data form S3 to HDFS. I tried s3cmd but it's not parallel and thus slow. I am trying to make hadoop distcp work like this:
hadoop distcp -Dfs.s3n.awsAccessKeyId=[Access Key] -Dfs.s3n.awsSecretAccessKey=[Secret Key] s3n://[account-name]/[bucket]/folder /data
but it gives me:
ipc.Client: Retrying connect to server:
ec2-[ip].compute-1.amazonaws.com/[internal-ip]:9001. Already tried 0 time(s)
distcp is a map reduce based job. Make sure job tracker service is start. Try
hadoop/bin/start-all.sh

Amazon EMR Spark Cluster: output/result not visible

I am running a Spark cluster on Amazon EMR. I am running the PageRank example programs on the cluster.
While running the programs on my local machine, I am able to see the output properly. But the same doesn't work on EMR. The S3 folder only shows empty files.
The commands I am using:
For starting the cluster:
aws emr create-cluster --name SparkCluster --ami-version 3.2 --instance-type m3.xlarge --instance-count 2 \
--ec2-attributes KeyName=sparkproj --applications Name=Hive \
--bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark \
--log-uri s3://sampleapp-amahajan/output/ \
--steps Name=SparkHistoryServer,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=s3://support.elasticmapreduce/spark/start-history-server
For adding the job:
aws emr add-steps --cluster-id j-9AWEFYP835GI --steps \
Name=PageRank,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--class,SparkPageRank,s3://sampleapp-amahajan/pagerank_2.10-1.0.jar,s3://sampleapp-amahajan/web-Google.txt,2],ActionOnFailure=CONTINUE
After a few unsuccessful attempts... I made a text file for the output of the job and it is successfully created on my local machine. But I am unable to view the same when I SSH into the cluster. I tried FoxyProxy to view the logs for the instances and neither does anything show up there.
Could you please let me know where I am going wrong?
Thanks!
How are you writing the text file locally? Generally, EMR jobs save their output to S3, so you could use something like outputRDD.saveToTextFile("s3n://<MY_BUCKET>"). You could also save the output to HDFS, but storing the results to S3 is useful for "ephemeral" clusters-- where you provision an EMR cluster, submit a job, and terminate upon completion.
"While running the programs on my local machine, I am able to see the
output properly. But the same doesn't work on EMR. The S3 folder only
shows empty files"
For the benefit of newbies:
If you are printing output to the console, it will be displayed in local mode but when you execute on EMR cluster, the reduce operation will be performed on worker nodes and they cant right to the console of the Master/Driver node!
With proper path you should be able to write results to s3.