Access Glue Catalog from Dev endpoint and local Zeppelin with spark sql - amazon-emr

I have set up a local Zeppelin notebook to access Glue Dev endpoint. I'm able to run spark and pyspark code and access the Glue catalog. But when I try spark.sql("show databases").show() or %sql show databases only default is returned.
When spinning up an EMR cluster we have to choose "Use for Hive table metadata" to enable this, but I hoped this would be the default setting for Glue Development endpoint, which seems to be not the case. Any workaround for this?

Related

What is the best approach to sync data from AWS 3 bucket to Azure Data Lake Gen 2

Currently, I download csv files from AWS S3 to my local computer using:
aws s3 sync s3://<cloud_source> c:/<local_destination> --profile aws_profile. Now, I would like to use the same process to sync the files from AWS to Azure Data Lake Storage Gen2 (one-way sync) on a daily basis. [Note: I only have read/download permissions for the S3 data source.]
I thought about 5 potential paths to solving this problem:
Use AWS CLI commands within Azure. I'm not entirely sure how to do that without running an Azure VM. Also, I would like to have my AWS profile credentials persist?
Use Python's subprocess library to run AWS CLI commands. I run into similar issues as option 1, namely a) maintaining a persistent install of AWS CLI, b) passing AWS profile credentials, and c) running without an Azure VM.
Use Python's Boto3 library to access AWS services. In the past, it appears that Boto3 didn't support the AWS sync command. So, developers like #raydel-miranda developed their own. [see Sync two buckets through boto3]. However, it now appears that there is a DataSync class for Boto3. [see DataSync | Boto3 Docs 1.17.27 documentation]. Would I still need to run this in an Azure VM or could I use Azure Data Factory?
Use Azure Data Factory to copy data from AWS S3 bucket. [see Copy data from Amazon Simple Storage Service by using Azure Data Factory] My concern would be that I would want to sync rather than copy. I believe Azure Data Factory has functionality to check if a file already exists, but what if the file has been deleted from AWS S3 data source?
Use Azure Data Science Virtual Machine to: a) install the AWS CLI, 2) create my AWS profile to store the access credentials, and 3) run the aws s3 sync... command.
Any tips, suggestions, or ideas on automating this process are greatly appreciated.
Adding one more to the list :)
6. Please do also look into Azcopy option . https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3?toc=/azure/storage/blobs/toc.json
I am not aware of any tool which helps in syncing the data , more or less all will do the copy , I think you will have to implement that . Couple of quick thoughts .
#3 ) You can run this from a batch service . You can initate that from Azure data factory . Also since are talking about Python , you can also run that from Azure data bricks .
#4) ADF does not have any sync logic for the files to be deleted. We can implement that using the getMetadat activity . https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
AzReplciate is another option - especially for very large containers https://learn.microsoft.com/en-us/samples/azure/azreplicate/azreplicate/

AWS EMR - how to copy files to all the nodes?

is there a way to copy a file to all the nodes in EMR cluster thought EMR command line? I am working with presto and have created my custom plugin. The problem is I have to install this plugin on all the nodes. I don't want to login to all the nodes and copy it.
You can add it as a bootstrap script to let this happen during the launch of the cluster.
#Sanket9394 Thanks for the edit!
If you have the control to Bring up a new EMR, then you should consider using the bootstrap script of the EMR.
But incase you want to do it on Existing EMR (bootstrap is only available during launch time)
You can do this with the help of AWS Systems Manager (ssm) and EMR inbuilt client.
Something like (python):
emr_client = boto3.client('emr')
ssm_client = boto3.client('ssm')
You can get the list of core instances using emr_client.list_instances
finally send a command to each of these instance using ssm_client.send_command
Ref : Check the last detailed example Example Installing Libraries on Core Nodes of a Running Cluster on https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-install-kernels-libs.html#emr-jupyterhub-install-libs
Note: If you are going with SSM , you need to have proper IAM policy of ssm attached to the IAM role of your master node.

Failure to start a Neptune notebook

I can't seem to make a neptune notebook, everytime I try I get the following error:
Notebook Instance Lifecycle Config 'arn:aws:sagemaker:us-west-2:XXXXXXXX:notebook-instance-lifecycle-config/aws-neptune-tutorial-lc'
for Notebook Instance 'arn:aws:sagemaker:us-west-2:XXXXXXXXX:notebook-instance/aws-neptune-tutorial'
took longer than 5 minutes.
Please check your CloudWatch logs for more details if your Notebook Instance has Internet access.
Note that the cloudwatch logs that it suggests to look at don't exist.
The neptune database was created using this cloudformation template: https://github.com/awslabs/aws-cloudformation-templates/blob/master/aws/services/NeptuneDB/Neptune.yaml
Which created the neptune cluster in the default VPC.
The notebook instance was created using this cloudformation template: https://s3.amazonaws.com/aws-neptune-customer-samples/neptune-sagemaker/cloudformation-templates/neptune-sagemaker/neptune-sagemaker-nested-stack.json
passing in the relevant values from in for the created neptune stack.
Has anyone seen this type of error and knows how to get over it?
I had to go in and modify the predefined install script used by neptune and add and nohup command to the final section of the install as described here https://aws.amazon.com/premiumsupport/knowledge-center/sagemaker-lifecycle-script-timeout/
Probably what is happening is that your notebook instance does not have access to the internet. Check your NAT configuration for your VPC and their security groups have allowed outbound rules to all

What s3 bucket does DBFS use? How can I get the S3 location of a DBFS path

I am trying to migrate my Hive metadata to Glue. While migrating the delta table, when I am providing the same dbfs path, I am getting an error - "Cannot create table: The associated location is not empty.
When I am trying to create the same delta table on the S3 location it is working properly.
Is there a way to find the S3 location for the DBFS path the database is pointed on?
First configure Databricks Runtime to use AWS Glue Data Catalog as its metastore and then migrate the delta table.
Every Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. Instead of using the Databricks Hive metastore, you have the option to use an existing external Hive metastore instance or the AWS Glue Catalog.
External Apache Hive Metastore
Using AWS Glue Data Catalog as the Metastore for Databricks Runtime
Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the following benefits:
Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
Allows you to interact with object storage using directory and file semantics instead of storage URLs.
Persists files to object storage, so you won’t lose data after you terminate a cluster.
Is there a way to find the S3 location for the DBFS path the database
is pointed on?
You can access AWS S3 bucket by mounting buckets using DBFS or directly using APIs.
Reference: "Databricks - Amazon S3"
Hope this helps.

Setting Remote hive metastore on postgresql for EMR

I am trying to setup postgresql db as external Hive metastore for AWS EMR.
I have tried hosting it on both EC2 and RDS.
I have already tried steps as given here.
But it doesnt go through, EMR fails in the provisioning step only with message
On the master instance (instance-id), application provisioning failed
I could not decipher anything from the failure log.
I also copied postgresql jdbc jar in paths
/usr/lib/hive/lib/ and /usr/lib/hive/jdbc/
in case EMR doesnt already has it, but still no help!
Then I setup the system by manually editing hive-site.xml and setting properties:
javax.jdo.option.ConnectionURL
javax.jdo.option.ConnectionDriverName
javax.jdo.option.ConnectionUserName
javax.jdo.option.ConnectionPassword
datanucleus.fixedDatastore
datanucleus.schema.autoCreateTables
and had to run hive --service metatool -listFSRoot.
After these manual settings I was able to get EMR to use postgres db as remote metastore.
Is there any way I can make it work using the configuration file as mentioned in official documentation?
Edit:
Configuration setting I am using to for remote mysql metastore:
classification=hive-site,properties=[javax.jdo.option.ConnectionURL=jdbc:mysql://[host]:3306/[dbname]?createDatabaseIfNotExist=true,javax.jdo.option.ConnectionDriverName=org.mariadb.jdbc.Driver,javax.jdo.option.ConnectionUserName=[user],javax.jdo.option.ConnectionPassword=[pass]]
I could never find a clean approach to configure this at the time of EMR startup itself.
The main problem is that EMR initializes the schema with MySQL using the command :
/usr/lib/hive/bin/schematool -initSchema -dbType MySQL
which should be postgres for our case.
The following manual steps allows to you configure postgres as external metastore:
1) Start EMR cluster with hive application, with default configurations.
2) Stop hive using command :
sudo stop hive-server2
3) Copy postgresql-jdbc jar (stored in some S3 location) to /usr/lib/hive/lib/ on EMR
4) Overwrite the default hive-site.xml in /usr/lib/hive/conf/ with custom one containing the JDO configuration for the Postgresql running on the EC2 node
5) Execute command :
sudo /usr/lib/hive/bin/schematool -upgradeSchema -dbType postgres