Redshift External tables via Hive metastore - amazon-s3

I've a redshift DB setup and we do periodic archival of the data into S3. I would like to create redshift external tables on top of these archived files. AWS documentation suggests that this can be done either via athena or via hive metastore. Since athena is quite expensive, I would like to get this done via Hive metastore. But I'm struggling with the connectivity here.
Below are the links of the steps that I followed:
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html
Creating the external schema works out fine; but while creating the table i get the follow error:
Invalid operation: Hive Metastore error. HOST: XX.XXX.XXX.XX PORT: 9083 ERROR: Default TException.
Any idea what can be done here?

Related

What s3 bucket does DBFS use? How can I get the S3 location of a DBFS path

I am trying to migrate my Hive metadata to Glue. While migrating the delta table, when I am providing the same dbfs path, I am getting an error - "Cannot create table: The associated location is not empty.
When I am trying to create the same delta table on the S3 location it is working properly.
Is there a way to find the S3 location for the DBFS path the database is pointed on?
First configure Databricks Runtime to use AWS Glue Data Catalog as its metastore and then migrate the delta table.
Every Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. Instead of using the Databricks Hive metastore, you have the option to use an existing external Hive metastore instance or the AWS Glue Catalog.
External Apache Hive Metastore
Using AWS Glue Data Catalog as the Metastore for Databricks Runtime
Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the following benefits:
Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
Allows you to interact with object storage using directory and file semantics instead of storage URLs.
Persists files to object storage, so you won’t lose data after you terminate a cluster.
Is there a way to find the S3 location for the DBFS path the database
is pointed on?
You can access AWS S3 bucket by mounting buckets using DBFS or directly using APIs.
Reference: "Databricks - Amazon S3"
Hope this helps.

Moving data from RDS -> S3 using AWS Glue

I'm trying to load an entire table from a MySQL Database (RDS) to S3, through AWS Glue.
I have already configured the RDS connection and created the Glue table with a crawler.
Now I have to perform a ETL Job to load the table from RDS to S3. However, after correctly following the AWS documentation procedure, inside the S3 directory, I find some files, but none of these is a JSON file (with headers and data) as I requested in Glue's Job.
Where am I wrong?
Thank you.

Setting Remote hive metastore on postgresql for EMR

I am trying to setup postgresql db as external Hive metastore for AWS EMR.
I have tried hosting it on both EC2 and RDS.
I have already tried steps as given here.
But it doesnt go through, EMR fails in the provisioning step only with message
On the master instance (instance-id), application provisioning failed
I could not decipher anything from the failure log.
I also copied postgresql jdbc jar in paths
/usr/lib/hive/lib/ and /usr/lib/hive/jdbc/
in case EMR doesnt already has it, but still no help!
Then I setup the system by manually editing hive-site.xml and setting properties:
javax.jdo.option.ConnectionURL
javax.jdo.option.ConnectionDriverName
javax.jdo.option.ConnectionUserName
javax.jdo.option.ConnectionPassword
datanucleus.fixedDatastore
datanucleus.schema.autoCreateTables
and had to run hive --service metatool -listFSRoot.
After these manual settings I was able to get EMR to use postgres db as remote metastore.
Is there any way I can make it work using the configuration file as mentioned in official documentation?
Edit:
Configuration setting I am using to for remote mysql metastore:
classification=hive-site,properties=[javax.jdo.option.ConnectionURL=jdbc:mysql://[host]:3306/[dbname]?createDatabaseIfNotExist=true,javax.jdo.option.ConnectionDriverName=org.mariadb.jdbc.Driver,javax.jdo.option.ConnectionUserName=[user],javax.jdo.option.ConnectionPassword=[pass]]
I could never find a clean approach to configure this at the time of EMR startup itself.
The main problem is that EMR initializes the schema with MySQL using the command :
/usr/lib/hive/bin/schematool -initSchema -dbType MySQL
which should be postgres for our case.
The following manual steps allows to you configure postgres as external metastore:
1) Start EMR cluster with hive application, with default configurations.
2) Stop hive using command :
sudo stop hive-server2
3) Copy postgresql-jdbc jar (stored in some S3 location) to /usr/lib/hive/lib/ on EMR
4) Overwrite the default hive-site.xml in /usr/lib/hive/conf/ with custom one containing the JDO configuration for the Postgresql running on the EC2 node
5) Execute command :
sudo /usr/lib/hive/bin/schematool -upgradeSchema -dbType postgres

getting the hive meta-store url to use in other systems

When I go into hive in command line, is there a way to get the hive metastore url that is being used?
I'm trying to connect another system to hive but can't seem to figure out what the metastore url is.
Here is the command.
hive> set hive.metastore.uris;
Here is the output
hive.metastore.uris=thrift://sandbox.hortonworks.com:9083
Using set you can get all hadoop and hive parameters that are being used when hive CLI is launched.

How to copy a Hive external table to Readshift without using Data Pipeline

I'd like to upload a Hive external table to AWS Redshift directly via command line. I don't want to use the Data Pipeline. Do I have to upload the table to S3 first and then copy it to Redshift? Is there any way to do it directly?
You can load Redshift directly from remote hosts using SSH or, if you're using their EMR version of Hadoop, you can load directly from the HDFS file system.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html