I have quite high-utilised Redshift cluster and Prestodb cluster.
Let's assume that it's impossible to rescale Redshift cluster in my case.
Is it make sense to setup Redshift Connector for Presto and run some complex queries on Presto instead of Redshift?
Would Presto propagate whole query to Redshift or just load data from it (quite cheap operation I guess) and join and aggregate it on Presto side?
It appears that "Redshift Connector for Presto" simply calls an Amazon Redshift cluster. Therefore, it would be creating load on both Presto and Redshift.
It would be useful when trying to join Redshift data to some other type of data that is accessible to Presto, but it would not reduce the load on a Redshift cluster.
Instead, one option would be to store data in S3 instead of inside the Redshift cluster. This data could then be accessed as a Redshift external table and as a Presto table. This would allow you to "shift load" between the two systems.
Related
I have huge data from different DB sources ( Oracle, Mongo, Cassandra ) and also eventing data available in Kafka. Using Tableau for analytics and facing performance issue with huge data. So, planning to store data in some other way and use Tableau for visualization also. Have multiple options now and need some help to finalize the approach.
Option 1:-
Read DB data and store them in Parquet file and then expose it over Spark SQL or HiveQL or Presto SQL and let Tableau connect to this SQL.
Option 2:-
Read DB data and store them in Parquet file in S3 and then use AWS Athena for analytics and let Tableau connect to Athena.
Option 3:-
Read DB data and store them in Parquet file in S3 and then move to Redshift for analytics and let Tableau connect to Redshift.
Not sure if any of the above approach will be a good solution for streaming data( Kafka ) analytics as well.
Note:- I have multiple big tables and need joins b/w them.
I understand you have huge data from different sources, and you also have access to AWS. Then, you plan to use this data for analytics and dashboarding via Tableau.
Option 1 and 2
Your Options 1 and 2 are basically the same, as AWS Athena and Hive are based on the same principle of creating tables over flat files via a metastore which stores table definition. Both Athena's Presto engine and Spark are distributed and highly efficient on huge data (TB data). The main difference is the pricing model (Athena is based on price per data processed per request and is serverless, whereas Spark may imply infrastructure cost).
Then, both options may not perform well as they are not OLAP systems designed for self service BI (they are better use for ad hoc queries over huge data regarding).
Then, you may have trouble in managing your data model using flat files and table or views over them (data storage and compression won't be optimized for each table which may impact Tableau performance).
Option 3
Option 3 is better as it is based on Redshift which is designed to support OLAP system. You can connect Tableau directly to Redshift but you'll suffer from latency and you may have trouble managing your cluster load depending on the number of users and/or requests. But it can work the way you describe it.
Then, if you have performance issues, you'll be able to create data source extracts from Redshift to Tableau later on. You can also implement an intermediate database to store pre-aggregated queries (= datamarts) and connect Tableau directly to it which will avoid performing the same query on Redshift each time a dashboard is opened in Tableau (in that case Redshift also caches queries).
Then, as you need to perform multiple joins, you'll be able to optimize data storage for such queries using Redshift by setting the right partition and sort keys.
To conclude, you can also directly access flat files from Redshift using Redshift Spectrum (via Athena/Glue metastore).
Documentations:
https://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
https://aws.amazon.com/fr/athena/pricing/
From a user perspective, Athena and BigQuery both accept a sql-like query, they both query files stored on disk (without needing to have set up a relational database), and they both return results (usually very quickly). Do such technologies have a name? i.e. is there a generic term for technologies like AWS Athena and GCP BigQuery?
They are both distributed SQL Query Engines for big [in-place] data. Athena is based on Presto, which declares itself to be a Distributed SQL Query Engine for Big Data.
Apache Drill was based on the original BigQuery design and defines itself as a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
The three things that define them are the possibility of running SQL, their distributed nature so they can operate at scale for interactive queries, and the power to query data without having to ingest it first.
Note in the case of BigQuery, initially the data would need to be ingested and it is still the preferred way of working, even if querying data directly from GCS has been available for a number of years. Athena only works with external tables.
Google BigQuery is a serverless data warehouse that supports super-fast SQL queries for analyze data in parallel. Amazon Athena is a serverless interactive query service that allows you to conveniently analyze data stored in Amazon Simple Storage Service (S3) by using basic SQL in parallel.
Both technologies could be considered as MPP (massively parallel processing) systems as both technologies process data analytics in parallel.
I have huge CSV files in the zipped format in S3 storage. I need just a subset of columns from the data for Machine learning purposes. How should I extract those columns into EMR then to Redshift without transferring the whole files?
My idea is to process all files into EMR then extract subset and push the required columns into Redshift. But this taking a lot of time. Please let me know if there is an optimized way of handling this data.
Edit: I am trying to automate this pipeline using Kafka. Let say a new folder in added into S3, it should be processed in EMR using spark and stored into redshift without any manual intervention.
Edit 2: Thanks for input guys, I was able to create a pipeline From S3 to Redshift using Pyspark in EMR. Currently, I am trying to integrate Kafka into this pipeline.
I would suggest:
Create an external table in Amazon Athena (An AWS Glue crawler can do this for you) that points to where your data is stored
Use CREATE TABLE AS to select the desired columns and store them in a new table (with the data automatically stored in Amazon S3)
Amazon Athena can handle gzip format, but you'll have to check whether this includes zip format.
See:
CREATE TABLE - Amazon Athena
Examples of CTAS Queries - Amazon Athena
Compression Formats - Amazon Athena
If the goal is to materialise a subset of the file columns in a table in Redshift then one option you have is Redshift Spectrum, which will allow you to define an "external table" over the CSV files in S3.
You can then select the relevant columns from the external tables and insert them into actual Redshift tables.
You'll have an initial cost hit when Spectrum scans the CSV files to query them, which will vary depending on how big the files are, but that's likely to be significantly less than spinning up an EMR cluster to process the data.
Getting Started with Amazon Redshift Spectrum
I am designing my db structure, and wondering if possible to run a single query against two separate Redshift clusters?
If possible, any limitation on the region, availability zones, VPC groups, etc.?
No, it's no possible in Redshift directly. Additionally you cannot query across multiple databases on the same cluster.
UPDATE: Redshift announced a preview for cross database queries on 2020-10-15 - https://docs.aws.amazon.com/redshift/latest/dg/cross-database-overview.html
You could use an external tool such as Amazon Athena or Presto running on an EMR cluster to do this. You would define each Redshift cluster as an external data source. Be careful though, you will lose most of Redshift's performance optimizations and a lot of data will have to be pulled back into Athena / Presto to answer your queries.
As an alternative to cross-cluster queries, consider placing your data onto S3 in well partitioned Parquet or ORC files and using Redshift Spectrum (or Amazon Athena) to query them. This approach allows multiple clusters to query a common data set while maintaining good query performance. https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/
Using federated queries in Amazon Redshift, a second cluster tables can be accessed as an external schema
You can refer to documentation https://docs.aws.amazon.com/redshift/latest/dg/federated_query_example.html
I'm trying to enable basic SQL querying of CSV files located in an s3 directory. Presto seemed like a natural fit (the files are 10s GB). As I went through the setup in Presto, I tried creating a table using the Hive connector. It was not clear to me if I only needed the hive metastore to save my table configurations in Presto, or if I have to create them in there first.
The documentation makes it seem that you can use Presto without having to CONFIGURE Hive, but using Hive syntax. Is that accurate? My experiences are that AWS S3 has not been able to connect.
Presto syntax is similar to Hive syntax. For most simple queries, the identical syntax would function in both. However, there are some key differences that make Presto and Hive not entirely the same thing. For example, in Hive, you might use LATERAL VIEW EXPLODE, whereas in Presto you'd use CROSS JOIN UNNEST. There are many such examples of nuanced syntactical differences between the two.
It is not possible to use vanilla Presto to analyze data on S3 without Hive. Presto provides only distributed execution engine. However, it lacks metadata information about tables. Thus, Presto Coordinator needs Hive to retrieve table metadata to parse and execute a query.
However, you can use AWS Athena, which is managed Presto, to run queries on top of S3.
Another option, in recent 0.198 release Presto adds a capability to connect AWS Glue and retrieve table metadata on top of files in S3.
I know it's been a while, but if this question is still outstanding, have you considered using Spark? Spark connects easily with out-of-the-box methods and can query/process data living in S3/CSV formats.
Also, I'm curious: what solution did you end up implementing to resolve your issue?