Is it possible to access AWS s3 (and possibly other data sources) metadata in Presto without using Hive or Athena? Did not see any specific documentation.
thank you.
Related
We have our data in AWS S3 bucket and respective schema defined in Glue Catelog.
Right now Athena queries are possible to the S3 bucket(well defined schema).
We need to visualize these data from spotfire.
What are the possible way of achieving this.
I am newbie to Spotfire.
We are planning to offload events from Kafka to S3 (e.g via using kafka connect). The target is to spin up a service (e.g. like amazon Athena) and provide a query interface on top of the exported avro events. The obstacle is that amazon Athena avro SerDe (uses org.apache.hadoop.hive.serde2.avro.AvroSerDe) does not support the magic bytes that schema registry is utilising for storing the schema id. Do you know of any alternative that can play nice with confluent schema registry?
Thanks!
Using S3 Connect's AvroConverter does not put any schema ID in the file. In fact, after the message is written, you lose the schema ID entirely.
We have lots of Hive tables that are working fine with these files, and users are querying using Athena, Presto. SparkSQL, etc.
Note: If you wanted to use AWS Glue, S3 Connect doesn't (currently, as of 5.x) offer automatic Hive partition creation like the HDFS Connector, so you might want to look for alternatives if you wanted to use it that way.
Currently I’m listening events from AWS Kinesis and writing them to S3. Then I query them using AWS Glue and Athena.
Is there a way to import that data, possibly with some transformation, to an RDS instance?
There are several general approaches to take with regards to that task.
Read data from and Athena query into a custom ETL script (using a JDBC connection) and load into the database
Mount the S3 bucket holding the data to a file system (perhaps using s3fs-fuse), read the data using a custom ETL script, and push it to the RDS instance(s)
Download the data to be uploaded to the RDS instance to a filesystem using the AWS CLI or the SDK, process locally, and then push to RDS
As you suggest, use AWS Glue to import the data to from Athena to the RDS instance. If you are building an application that is tightly coupled with AWS, and if you are using Kinesis and Athena you are, then such a solution makes sense.
When connecting GLUE to RDS a couple of things to keep in mind (mostly on the networking side:
Ensure that DNS Hostnames are enabled the VPC hosting the target RDS instance
You'll need to setup a self-referencing rule in the Security Group associated with the target RDS instance
For some examples of code targetting a relational database, see the following tutorials
One approach for Postgres:
Install the S3 extension in Postgres:
psql=> CREATE EXTENSION aws_s3 CASCADE;
Run the query in Athena and find the CSV result file location in S3 (S3 output location is in Athena settings) (You can also inspect the "Download results" button to get the S3 path)
Create your table in Postgres
Import from S3:
SELECT aws_s3.table_import_from_s3(
'newtable', '', '(format csv, header true)',
aws_commons.create_s3_uri('bucketname', 'reports/Unsaved/2021/05/10/aa9f04b0-d082-328g-5c9d-27982d345484.csv', 'us-east-1')
);
If you want to convert empty values to null, you can use this: (format csv, FORCE_NULL (columnname), header true)
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL.Procedural.Importing.html
Let's say, the datalake is on AWS. Using S3 as storage and Glue as data catalog.
So, we can easily use athena, redshift or EMR to query data on S3 using Glue as metastore.
My question is, is it possible to expose Glue data catalog as metastore for external services like Databricks hosted on AWS ?
Now Databricks provides documentation to make Glue Data Catalog as the Metastore. It should be done following these steps:
Create an IAM role and policy to access a Glue Data Catalog
Create a policy for the target Glue Catalog
Look up the IAM role used to create the Databricks deployment
Add the Glue Catalog IAM role to the EC2 policy
Add the Glue Catalog IAM role to a Databricks workspace
Launch a cluster with the Glue Catalog IAM role
Reference: https://docs.databricks.com/data/metastores/aws-glue-metastore.html.
There'd been couple of decent documentation/writeup pieces provided by Databricks (see the docs and the blog post), though they cover custom/legacy Hive metastore integration, not Glue itself.
Also - as a Plan B - it should be possible to inspect table/partition definitions you have in Databricks metastore and do one-way replication to Glue through the Java SDK (or maybe the other way around as well, mapping AWS API responses to sequences of create table / create partition statements). Of course this is ridden with rather complex corner cases, like cascading partition/table deletions and such, but for some simple create-only stuff it seems to be approachable at least.
I'm trying to enable basic SQL querying of CSV files located in an s3 directory. Presto seemed like a natural fit (the files are 10s GB). As I went through the setup in Presto, I tried creating a table using the Hive connector. It was not clear to me if I only needed the hive metastore to save my table configurations in Presto, or if I have to create them in there first.
The documentation makes it seem that you can use Presto without having to CONFIGURE Hive, but using Hive syntax. Is that accurate? My experiences are that AWS S3 has not been able to connect.
Presto syntax is similar to Hive syntax. For most simple queries, the identical syntax would function in both. However, there are some key differences that make Presto and Hive not entirely the same thing. For example, in Hive, you might use LATERAL VIEW EXPLODE, whereas in Presto you'd use CROSS JOIN UNNEST. There are many such examples of nuanced syntactical differences between the two.
It is not possible to use vanilla Presto to analyze data on S3 without Hive. Presto provides only distributed execution engine. However, it lacks metadata information about tables. Thus, Presto Coordinator needs Hive to retrieve table metadata to parse and execute a query.
However, you can use AWS Athena, which is managed Presto, to run queries on top of S3.
Another option, in recent 0.198 release Presto adds a capability to connect AWS Glue and retrieve table metadata on top of files in S3.
I know it's been a while, but if this question is still outstanding, have you considered using Spark? Spark connects easily with out-of-the-box methods and can query/process data living in S3/CSV formats.
Also, I'm curious: what solution did you end up implementing to resolve your issue?