What is the generic term for technologies like AWS Athena (Presto) and GCP BigQuery? - google-bigquery

From a user perspective, Athena and BigQuery both accept a sql-like query, they both query files stored on disk (without needing to have set up a relational database), and they both return results (usually very quickly). Do such technologies have a name? i.e. is there a generic term for technologies like AWS Athena and GCP BigQuery?

They are both distributed SQL Query Engines for big [in-place] data. Athena is based on Presto, which declares itself to be a Distributed SQL Query Engine for Big Data.
Apache Drill was based on the original BigQuery design and defines itself as a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
The three things that define them are the possibility of running SQL, their distributed nature so they can operate at scale for interactive queries, and the power to query data without having to ingest it first.
Note in the case of BigQuery, initially the data would need to be ingested and it is still the preferred way of working, even if querying data directly from GCS has been available for a number of years. Athena only works with external tables.

Google BigQuery is a serverless data warehouse that supports super-fast SQL queries for analyze data in parallel. Amazon Athena is a serverless interactive query service that allows you to conveniently analyze data stored in Amazon Simple Storage Service (S3) by using basic SQL in parallel.
Both technologies could be considered as MPP (massively parallel processing) systems as both technologies process data analytics in parallel.

Related

Replicate data from cloud SQL postgres to bigQuery

I am looking for the recommended way of streaming database change from cloud SQL (postgres) to bigQuery ? I am seeing that CDC streaming does not seems available for postgres, does anyone know the timeline of this feature ?
Thanks a lot for you help.
Jonathan.
With Datastream for BigQuery, you can now replicate data and schema updates from operational databases directly into BigQuery.
Datastream reads and delivers every change—insert, update, and delete—from your MySQL, PostgreSQL, AlloyDB, and Oracle databases into BigQuery with minimal latency. The source database can be hosted on-premises, on Google Cloud services such as Cloud SQL or Bare Metal Solution for Oracle, or anywhere else on any cloud.
https://cloud.google.com/datastream-for-bigquery
You have to create an ETL process. That will allow you to automatically transform data from Postgres into BigQuery. You can do that using many ways, but I will point you to the two main approaches that I've already implemented:
Way 1:
Set Up the ETL Process manually:
Create your ETL using open source tools...
This method involves the use of the COPY command to migrate data from PostgreSQL tables and standard file-system files. It can be used as a normal SQL statement with SQL functions or PL/pgSQL procedures which gives a lot of flexibility to extract data as a full dump or incrementally. You need to know that it is a time-consuming process and would need you to invest in engineering bandwidth!
Also, you could try different tech stacks to implement the above, and I recommended this one Java Spring Data Flow
Way 2:
Using DataFlow
You can automate the ETL process using GCP's DataFlow without coding your own solution. It is faster and it cost, of course.
DataFlow is Unified stream and batch data processing that's
serverless, fast, and cost-effective.
Check more details and learn in a minute here
Also check this

How is AWS Athena so different from a database

AWS Athena can read tables, create new tables and insert into tables (among other things). Is it fair to say that Athena can replace the basic functions of a database , or is it fair to say that Athena can be seen as a lightweight serverless replacement of a relational DB? What are its major differences compared to a DB ?
Athena is not a database and it's a query engine.In Athena compute and storage are separate where as in the case of a database both are tightly coupled.
Athena is more of a OLAP solution where as relational DB is OLTP and it is fully managed service. It will also not maintain the metadata and instead use Glue catalog to store and reteive it.
Also found this article which talks little more about the differences. You can have a read.

Tableau visualization - Performance issue with huge data

I have huge data from different DB sources ( Oracle, Mongo, Cassandra ) and also eventing data available in Kafka. Using Tableau for analytics and facing performance issue with huge data. So, planning to store data in some other way and use Tableau for visualization also. Have multiple options now and need some help to finalize the approach.
Option 1:-
Read DB data and store them in Parquet file and then expose it over Spark SQL or HiveQL or Presto SQL and let Tableau connect to this SQL.
Option 2:-
Read DB data and store them in Parquet file in S3 and then use AWS Athena for analytics and let Tableau connect to Athena.
Option 3:-
Read DB data and store them in Parquet file in S3 and then move to Redshift for analytics and let Tableau connect to Redshift.
Not sure if any of the above approach will be a good solution for streaming data( Kafka ) analytics as well.
Note:- I have multiple big tables and need joins b/w them.
I understand you have huge data from different sources, and you also have access to AWS. Then, you plan to use this data for analytics and dashboarding via Tableau.
Option 1 and 2
Your Options 1 and 2 are basically the same, as AWS Athena and Hive are based on the same principle of creating tables over flat files via a metastore which stores table definition. Both Athena's Presto engine and Spark are distributed and highly efficient on huge data (TB data). The main difference is the pricing model (Athena is based on price per data processed per request and is serverless, whereas Spark may imply infrastructure cost).
Then, both options may not perform well as they are not OLAP systems designed for self service BI (they are better use for ad hoc queries over huge data regarding).
Then, you may have trouble in managing your data model using flat files and table or views over them (data storage and compression won't be optimized for each table which may impact Tableau performance).
Option 3
Option 3 is better as it is based on Redshift which is designed to support OLAP system. You can connect Tableau directly to Redshift but you'll suffer from latency and you may have trouble managing your cluster load depending on the number of users and/or requests. But it can work the way you describe it.
Then, if you have performance issues, you'll be able to create data source extracts from Redshift to Tableau later on. You can also implement an intermediate database to store pre-aggregated queries (= datamarts) and connect Tableau directly to it which will avoid performing the same query on Redshift each time a dashboard is opened in Tableau (in that case Redshift also caches queries).
Then, as you need to perform multiple joins, you'll be able to optimize data storage for such queries using Redshift by setting the right partition and sort keys.
To conclude, you can also directly access flat files from Redshift using Redshift Spectrum (via Athena/Glue metastore).
Documentations:
https://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
https://aws.amazon.com/fr/athena/pricing/

How to minimize cost per SQL query execution in BigQuery

I am new to BigQuery and GCP. I am working with a (big) public data set available in BigQuery on which I am running a SQL query - it selects a bunch of data from one of the tables in the dataset, based on a simple where clause.
I then proceed to perform additional operations on the obtained data. I only need to run this query once a month, the other operations need to be run more often (hourly).
My problem is that every time I do this, it causes BigQuery to process 4+ million rows of data, and the cost of running this query is quickly adding up for me.
Is there a way I can run the SQL query and export the data to another
table/database in GCP, and then run my operations on that exported
data?
Am I correct in assuming (and I could be wrong here) that once I
export data to standard SQL DB in GCP, the cost per query will be
less in that exported database than it is in BigQuery?
Thanks!
Is there a way I can run the SQL query and export the data to another table/database in GCP, and then run my operations on that exported data?
You can run your SQL queries and therefore export the data into another table/databases in GCP by using the Client Libraries for BigQuery. You can also refer to this documentation about how to export table data using BigQuery.
As for the most efficient way to do it, I will proceed by using both BigQuery and Cloud SQL (for the other table/database) APIs.
The BigQuery documentation has an API example for extracting a BigQuery table to your Cloud Storage Bucket.
Once the data is in Cloud Storage, you can use the Cloud SQL Admin API to import the data into your desired database/table. I attached documentation regarding the best practices on how to import/export data within Cloud SQL.
Once the data is exported you can delete the residual files from your Cloud Storage Bucket, using the console, or interacting with the Cloud Storage. API
Am I correct in assuming (and I could be wrong here) that once I export data to standard SQL DB in GCP, the cost per query will be less in that exported database than it is in BigQuery?
As for the prices, you will find here how to estimate storage and query costs within BigQuery. As for other databases like Cloud SQL, here you will find more information about the Cloud SQL pricing.
Nonetheless, as Maxim point out, you can refer to both the best practices within BigQuery in order to maximize efficiency and therefore minimizing cost, and also the best practices for using Cloud SQL.
Both can greatly help you minimize cost and be more efficient in your queries or imports.
I hope this helps.

Query against two separate Redshift clusters in a single query?

I am designing my db structure, and wondering if possible to run a single query against two separate Redshift clusters?
If possible, any limitation on the region, availability zones, VPC groups, etc.?
No, it's no possible in Redshift directly. Additionally you cannot query across multiple databases on the same cluster.
UPDATE: Redshift announced a preview for cross database queries on 2020-10-15 - https://docs.aws.amazon.com/redshift/latest/dg/cross-database-overview.html
You could use an external tool such as Amazon Athena or Presto running on an EMR cluster to do this. You would define each Redshift cluster as an external data source. Be careful though, you will lose most of Redshift's performance optimizations and a lot of data will have to be pulled back into Athena / Presto to answer your queries.
As an alternative to cross-cluster queries, consider placing your data onto S3 in well partitioned Parquet or ORC files and using Redshift Spectrum (or Amazon Athena) to query them. This approach allows multiple clusters to query a common data set while maintaining good query performance. https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/
Using federated queries in Amazon Redshift, a second cluster tables can be accessed as an external schema
You can refer to documentation https://docs.aws.amazon.com/redshift/latest/dg/federated_query_example.html