AWS Glue Data Catalog as Metastore for external services like Databricks - amazon-s3

Let's say, the datalake is on AWS. Using S3 as storage and Glue as data catalog.
So, we can easily use athena, redshift or EMR to query data on S3 using Glue as metastore.
My question is, is it possible to expose Glue data catalog as metastore for external services like Databricks hosted on AWS ?

Now Databricks provides documentation to make Glue Data Catalog as the Metastore. It should be done following these steps:
Create an IAM role and policy to access a Glue Data Catalog
Create a policy for the target Glue Catalog
Look up the IAM role used to create the Databricks deployment
Add the Glue Catalog IAM role to the EC2 policy
Add the Glue Catalog IAM role to a Databricks workspace
Launch a cluster with the Glue Catalog IAM role
Reference: https://docs.databricks.com/data/metastores/aws-glue-metastore.html.

There'd been couple of decent documentation/writeup pieces provided by Databricks (see the docs and the blog post), though they cover custom/legacy Hive metastore integration, not Glue itself.
Also - as a Plan B - it should be possible to inspect table/partition definitions you have in Databricks metastore and do one-way replication to Glue through the Java SDK (or maybe the other way around as well, mapping AWS API responses to sequences of create table / create partition statements). Of course this is ridden with rather complex corner cases, like cascading partition/table deletions and such, but for some simple create-only stuff it seems to be approachable at least.

Related

How to connect AWS S3 from Spotfire

We have our data in AWS S3 bucket and respective schema defined in Glue Catelog.
Right now Athena queries are possible to the S3 bucket(well defined schema).
We need to visualize these data from spotfire.
What are the possible way of achieving this.
I am newbie to Spotfire.

Google Cloud Data Fusion, How can I load many tables to bigquery in one pipeline

I want to load many tables which is in aws rds mysql server by using cloud data fusion. each table storage is more than about 1gb. also I found the plugin which name is "multiple database table" to load multi table. but i got a fail. Also basically when I used database source I can check my tables' schema. However, in multiple database table, i can 't find how to check table's schema. how can i use this plugin? or is there any other way to load many tables in data fusion service?
My pipeline setting was as follows.
I'm posting this Community Wiki as OP didn't provide enough details to reproduce but the below information might help someone.
There are few ways to get your data using Cloud Data Fusion, you can use pipeline, plugin, driver and a few others depending on your needs.
On the internet you can find two very well described guides with examples.
If you would like to find some information about Cloud Data Fusion with GCP products you should read Bahadir Bulut guide - How I used Google Cloud Data Fusion to create a data warehouse - Part 1 and Part 2. Also Data Fusion allows to use 150+ preconfigured connectors and transformations like Amazons S3, SQS, etc. Azure services and many more.
Another well described (which I guess would help OP) is to configure both Amazon and GCP resources and using pipelines. This guide is Building a Simple Batch Data Pipeline from AWS RDS to Google BigQuery ā€” Part 1: Setting UP AWS Data pipeline and second part Building a Simple Batch Data Pipeline from AWS RDS to Google BigQuery ā€” Part 2: Setting up BigQuery Transfer Service and Scheduled Query.. In short this guide describes 2 main steps:
Extract data from MYSQL RDS and bring into S3 using AWS data pipeline service
From S3, bring the file inside Bigquery using BigqQuery transfer service.

Hive ORC ACID table on AZURE Blob Storage possible for MERGE

On HDFS Hive ORC ACID for Hive MERGE no issue.
On S3 not possible.
For Azure HD Insight I am not clear from docs if such a table on Azure Blob Storage is posible? Seeking confirmation or otherwise.
I am pretty sure no go. See the update I gave on the answer, however.
According to Azure HDInsight offical documents Azure HDInsight 4.0 overview as the figure below,
As I known, Hive MERGE requires MapReduce, but HDInsight does not support it for Hive, so it's also not possible.
UPDATE by question poster
HDInsight 4.0 doesn't support MapReduce for Apache Hive. Use Apache Tez instead. So, with Tez it will still work and from https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-version-release Spark with Hive 3 and Warehouse Connector are also options.

amazon athena and confluent schema registry

We are planning to offload events from Kafka to S3 (e.g via using kafka connect). The target is to spin up a service (e.g. like amazon Athena) and provide a query interface on top of the exported avro events. The obstacle is that amazon Athena avro SerDe (uses org.apache.hadoop.hive.serde2.avro.AvroSerDe) does not support the magic bytes that schema registry is utilising for storing the schema id. Do you know of any alternative that can play nice with confluent schema registry?
Thanks!
Using S3 Connect's AvroConverter does not put any schema ID in the file. In fact, after the message is written, you lose the schema ID entirely.
We have lots of Hive tables that are working fine with these files, and users are querying using Athena, Presto. SparkSQL, etc.
Note: If you wanted to use AWS Glue, S3 Connect doesn't (currently, as of 5.x) offer automatic Hive partition creation like the HDFS Connector, so you might want to look for alternatives if you wanted to use it that way.

Importing data from AWS Athena to RDS instance

Currently Iā€™m listening events from AWS Kinesis and writing them to S3. Then I query them using AWS Glue and Athena.
Is there a way to import that data, possibly with some transformation, to an RDS instance?
There are several general approaches to take with regards to that task.
Read data from and Athena query into a custom ETL script (using a JDBC connection) and load into the database
Mount the S3 bucket holding the data to a file system (perhaps using s3fs-fuse), read the data using a custom ETL script, and push it to the RDS instance(s)
Download the data to be uploaded to the RDS instance to a filesystem using the AWS CLI or the SDK, process locally, and then push to RDS
As you suggest, use AWS Glue to import the data to from Athena to the RDS instance. If you are building an application that is tightly coupled with AWS, and if you are using Kinesis and Athena you are, then such a solution makes sense.
When connecting GLUE to RDS a couple of things to keep in mind (mostly on the networking side:
Ensure that DNS Hostnames are enabled the VPC hosting the target RDS instance
You'll need to setup a self-referencing rule in the Security Group associated with the target RDS instance
For some examples of code targetting a relational database, see the following tutorials
One approach for Postgres:
Install the S3 extension in Postgres:
psql=> CREATE EXTENSION aws_s3 CASCADE;
Run the query in Athena and find the CSV result file location in S3 (S3 output location is in Athena settings) (You can also inspect the "Download results" button to get the S3 path)
Create your table in Postgres
Import from S3:
SELECT aws_s3.table_import_from_s3(
'newtable', '', '(format csv, header true)',
aws_commons.create_s3_uri('bucketname', 'reports/Unsaved/2021/05/10/aa9f04b0-d082-328g-5c9d-27982d345484.csv', 'us-east-1')
);
If you want to convert empty values to null, you can use this: (format csv, FORCE_NULL (columnname), header true)
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL.Procedural.Importing.html