Presto and Hive - amazon-s3

I'm trying to enable basic SQL querying of CSV files located in an s3 directory. Presto seemed like a natural fit (the files are 10s GB). As I went through the setup in Presto, I tried creating a table using the Hive connector. It was not clear to me if I only needed the hive metastore to save my table configurations in Presto, or if I have to create them in there first.
The documentation makes it seem that you can use Presto without having to CONFIGURE Hive, but using Hive syntax. Is that accurate? My experiences are that AWS S3 has not been able to connect.

Presto syntax is similar to Hive syntax. For most simple queries, the identical syntax would function in both. However, there are some key differences that make Presto and Hive not entirely the same thing. For example, in Hive, you might use LATERAL VIEW EXPLODE, whereas in Presto you'd use CROSS JOIN UNNEST. There are many such examples of nuanced syntactical differences between the two.

It is not possible to use vanilla Presto to analyze data on S3 without Hive. Presto provides only distributed execution engine. However, it lacks metadata information about tables. Thus, Presto Coordinator needs Hive to retrieve table metadata to parse and execute a query.
However, you can use AWS Athena, which is managed Presto, to run queries on top of S3.
Another option, in recent 0.198 release Presto adds a capability to connect AWS Glue and retrieve table metadata on top of files in S3.

I know it's been a while, but if this question is still outstanding, have you considered using Spark? Spark connects easily with out-of-the-box methods and can query/process data living in S3/CSV formats.
Also, I'm curious: what solution did you end up implementing to resolve your issue?

Related

Is it possible to reduce Redshift load with Presto?

I have quite high-utilised Redshift cluster and Prestodb cluster.
Let's assume that it's impossible to rescale Redshift cluster in my case.
Is it make sense to setup Redshift Connector for Presto and run some complex queries on Presto instead of Redshift?
Would Presto propagate whole query to Redshift or just load data from it (quite cheap operation I guess) and join and aggregate it on Presto side?
It appears that "Redshift Connector for Presto" simply calls an Amazon Redshift cluster. Therefore, it would be creating load on both Presto and Redshift.
It would be useful when trying to join Redshift data to some other type of data that is accessible to Presto, but it would not reduce the load on a Redshift cluster.
Instead, one option would be to store data in S3 instead of inside the Redshift cluster. This data could then be accessed as a Redshift external table and as a Presto table. This would allow you to "shift load" between the two systems.

Can dbt connect to different databases in the same project?

It seems dbt only works for a single database.
If my data is in a different database, will that still work? For example, if my datalake is using delta, but I want to run dbt using Redshift, would dbt still work for this case?
To use dbt, you need to already be able to select from your raw data in your warehouse.
In general, dbt is not an ETL tool:
[dbt] doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse. This “transform after load” architecture is becoming known as ELT (extract, load, transform). dbt is the T in ELT. [reference]
So no, you cannot use dbt with Redshift and Deltalake at the same time. Instead, use a separate service to extract and load data into your Redshift cluster — dbt is agnostic about which tool you use to do this.
There is a nuance to this answer - you could use dbt to select from external files in S3 or GCS, so long as you've set up your data warehouse to be able to read those files. For Redshift, this means setting up Redshift Spectrum. (For Snowflake, this means setting up an external table and on BigQuery, you can also query cloud storage data)
So, if the data you read in Deltalake lives in S3, if you set up your Redshift cluster to be able to read it, you can use dbt to transform the data!
You can use Trino with dbt to connect to multiple databases in the same project.
The Github example project https://github.com/victorcouste/trino-dbt-demo contains a fully working setup, that you can replicate and adapt to your needs.
I would say that DBT doesn't have an execution engine, so you can not use it to move data from one source to another as it isn't processing data itself, it only sends the SQL commands to the database.
In any case, if you want to move data from S3 to Redshift, maybe you could use Redshift Spectrum where you can query S3 as external tables. There you'll be able to use DBT on S3 and Redshift data from the same system.
#willie Chen the short answer is yes you can. The more accurate answer that is not the intent of dbt. As a tool it is intended for the transform part of ETL. It serves as a transform that is already existing in a data warehouse. I agree that you should use Redshift Spectrum for ETL.
Luther

Query against two separate Redshift clusters in a single query?

I am designing my db structure, and wondering if possible to run a single query against two separate Redshift clusters?
If possible, any limitation on the region, availability zones, VPC groups, etc.?
No, it's no possible in Redshift directly. Additionally you cannot query across multiple databases on the same cluster.
UPDATE: Redshift announced a preview for cross database queries on 2020-10-15 - https://docs.aws.amazon.com/redshift/latest/dg/cross-database-overview.html
You could use an external tool such as Amazon Athena or Presto running on an EMR cluster to do this. You would define each Redshift cluster as an external data source. Be careful though, you will lose most of Redshift's performance optimizations and a lot of data will have to be pulled back into Athena / Presto to answer your queries.
As an alternative to cross-cluster queries, consider placing your data onto S3 in well partitioned Parquet or ORC files and using Redshift Spectrum (or Amazon Athena) to query them. This approach allows multiple clusters to query a common data set while maintaining good query performance. https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/
Using federated queries in Amazon Redshift, a second cluster tables can be accessed as an external schema
You can refer to documentation https://docs.aws.amazon.com/redshift/latest/dg/federated_query_example.html

Use hive metastore service WITHOUT Hadoop/HDFS

I know the question is a little bit strange. I love Hadoop & HDFS, but recently work on SparkSQL with Hive Metastore.
I want to use SparkSQL as a vertical SQL engine to run OLAP query across different datasources like RDB, Mongo, Elastic ... without ETL process. Then I register different schema as external tables in Metastore with corresponding Hive storage Handler.
Moreover, HDFS is not used as a datasource in my work. Then, given Map/R is already replaced by Spark engine. That sound to me that Hadoop/HDFS is useless but to base the installation of Hive. I don't want to buy them all.
I wonder If I only start Hive metastore service without Hadoop/HDFS to support SparkSQL, what kind of issue will happen. Would I put myself into the jungle?
What you need is "Hive Local Mode" (search for "Hive, Map-Reduce and Local-Mode" in the page).
Also this may help.
This configuration is only suggested if you are experimenting locally. But in this case you only need the metastore.
Also from here;
Spark SQL uses Hive Metastore, even if when we don't configure it to . When not configured it uses a default Derby DB as metastore.
So this seems to be quite legal;
Arrange your metastore in Hive
Start Hive in local mode
And make Spark use Hive metastore
Use Spark as an SQL engine for all datasources supported by Hive.

How to set up Hive metastore off Redshift

I couldn't find a way to set up a metastore off Redshift for Hive. I am wondering if there is anyone who has tried this. Also since Redshift supports PostgreSQL, maybe it is possible. Please share if you have any experience.
I am new to Hive and am using CDH5.4.
Redshift as DSS isn't suitable to store Hive meta store by definition. Use RDS service for that purpose.