Google BigQuery vs Spark and Parquet - google-bigquery

How does Google BigQuery compare to Apache Spark SQL and Parquet?
Is it correct to say that BigQuery is actually Storage & Analysis, and that you could therefore split the product into BigQuery Storage and BigQuery Analysis?
I understand there are plenty of other storage mechanisms, and processing engines, but, to pick 2 "pairs"...
For my understanding, is it a correct to say that BigQuery Storage is comparable with Apache Parquet and BigQuery Analysis is comparable with Spark SQL?
Is it correct to say that BigQuery storage is actually called Capacitor... "BigQuery’s next-generation columnar storage format"?
Is it also correct to say that Apache Parquet and BigQuery Storage both provide an implementation of Dremel?

Capacitor is the file format used by BigQuery, while the storage is the whole distributed system to host the files & data. Dremel is the underlying execution engine. Here is some introduction (https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood).

Related

SQL database to Bigquery or SQL database to GCS to BigQuery

In the book Data Engineering with Google Cloud Platform by Adi Wijaya, to load the data from a sql database to BigQuery, the author always load the data from sql to Google Cloud Storage first, and use it as staging environment, and only after that would he load data to BigQuery
What are the advantage of going through the GCS step and not straight away into BigQuery? In which case would you load directly data from SQL db to BigQuery?
BigQuery doesn't support the SQL format as mentioned in this post to directly load data from Cloud SQL to BigQuery. You can follow the below procedures:
You can use BigQuery Cloud SQL federated query importing data directly into BigQuery from Cloud SQL.
Based on this documentation, you should first generate CSV or JSON from the Cloud SQL Database and persist those files to Cloud Storage and load data into BigQuery.
The advantages when loading data from Cloud SQL to Cloud Storage to BigQuery are:
Cloud storage provides services like resumable uploads, whereas combining the job and data means you'd need to be more careful about managing any issues with jobs, and concerning yourself with transient issues.
According to this documentation, using Cloud Storage you can take advantage of long term storage:
When you load data into BigQuery from Cloud Storage, you are not charged for the load operation, but you do incur charges for storing the data in Cloud Storage.
And as mentioned by #John Hanley, I agree that the advantage of loading data to Google Cloud storage to BigQuery it is faster and you can ensure a consistent copy or backup to be recovered in the event of a primary data failure.
BigQuery table can be deleted when not in use and imported when needed. And less likely to fail when creating a table.
Additional information, the cost of storing in BigQuery is higher than in Cloud storage. And you are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket.
To suggest the best strategy, your question needs more information. Still it depends on your use case. And for more information on loading data can be found in the BigQuery documentation.

what is the difference between BigQuery and Storage on GCP?

Hi guys I am using GCP for the first time and while I walking through the a project's cloud function example with the mock data, I got confused about similarities/differences of each one and I would like more clarity of what makes them different because to me they seem so similar.
BigQuery is a data warehouse and a SQL Engine. You can use it to store tabular data in datasets and tables. In the tables you may as well store more complex structures like arrays and JSONs but not files for example.
Cloud Storage is a blob storage, with functionality similar to what you know in your linux/windows machine (saving files, folders, deleting, copying). Of course that in the backend it's nothing like your local file system.
BigQuery is a fully managed and serverless data warehouse. It's like Snowflake or Redshift.
Google Cloud Storage(GCS) is like Amazon S3 or Azure Storage. Storages are for storing data as the name suggests.
You usually use BigQuery to analyze & query data in order to draw some insights. BigQuery is an analytical engine.
You can store images, videos, logs, files, and etc in GCS(Google Cloud Storage), but BigQuery can't.
Google BigQuery belongs to "Big Data as a Service" category of the tech stack, while Google Cloud Storage can be primarily classified under "Cloud Storage".
Some of the features offered by Google BigQuery are:
• All behind the scenes- Your queries can execute asynchronously in the
background, and can be polled for status.
• Import data with ease- Bulk load your data using Google Cloud Storage or stream it in bursts of up to 1,000 rows per second.
• Affordable big data- The first Terabyte of data processed each month is free.
On the other hand, Google Cloud Storage provides the following key features:
• High Capacity and Scalability
• Strong Data Consistency
• Google Developers Console Projects
"High Performance" is the primary reason why developers consider Google BigQuery over the competitors, whereas "Scalable" was stated as the key factor in picking Google Cloud Storage.

Tableau visualization - Performance issue with huge data

I have huge data from different DB sources ( Oracle, Mongo, Cassandra ) and also eventing data available in Kafka. Using Tableau for analytics and facing performance issue with huge data. So, planning to store data in some other way and use Tableau for visualization also. Have multiple options now and need some help to finalize the approach.
Option 1:-
Read DB data and store them in Parquet file and then expose it over Spark SQL or HiveQL or Presto SQL and let Tableau connect to this SQL.
Option 2:-
Read DB data and store them in Parquet file in S3 and then use AWS Athena for analytics and let Tableau connect to Athena.
Option 3:-
Read DB data and store them in Parquet file in S3 and then move to Redshift for analytics and let Tableau connect to Redshift.
Not sure if any of the above approach will be a good solution for streaming data( Kafka ) analytics as well.
Note:- I have multiple big tables and need joins b/w them.
I understand you have huge data from different sources, and you also have access to AWS. Then, you plan to use this data for analytics and dashboarding via Tableau.
Option 1 and 2
Your Options 1 and 2 are basically the same, as AWS Athena and Hive are based on the same principle of creating tables over flat files via a metastore which stores table definition. Both Athena's Presto engine and Spark are distributed and highly efficient on huge data (TB data). The main difference is the pricing model (Athena is based on price per data processed per request and is serverless, whereas Spark may imply infrastructure cost).
Then, both options may not perform well as they are not OLAP systems designed for self service BI (they are better use for ad hoc queries over huge data regarding).
Then, you may have trouble in managing your data model using flat files and table or views over them (data storage and compression won't be optimized for each table which may impact Tableau performance).
Option 3
Option 3 is better as it is based on Redshift which is designed to support OLAP system. You can connect Tableau directly to Redshift but you'll suffer from latency and you may have trouble managing your cluster load depending on the number of users and/or requests. But it can work the way you describe it.
Then, if you have performance issues, you'll be able to create data source extracts from Redshift to Tableau later on. You can also implement an intermediate database to store pre-aggregated queries (= datamarts) and connect Tableau directly to it which will avoid performing the same query on Redshift each time a dashboard is opened in Tableau (in that case Redshift also caches queries).
Then, as you need to perform multiple joins, you'll be able to optimize data storage for such queries using Redshift by setting the right partition and sort keys.
To conclude, you can also directly access flat files from Redshift using Redshift Spectrum (via Athena/Glue metastore).
Documentations:
https://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
https://aws.amazon.com/fr/athena/pricing/

What is the generic term for technologies like AWS Athena (Presto) and GCP BigQuery?

From a user perspective, Athena and BigQuery both accept a sql-like query, they both query files stored on disk (without needing to have set up a relational database), and they both return results (usually very quickly). Do such technologies have a name? i.e. is there a generic term for technologies like AWS Athena and GCP BigQuery?
They are both distributed SQL Query Engines for big [in-place] data. Athena is based on Presto, which declares itself to be a Distributed SQL Query Engine for Big Data.
Apache Drill was based on the original BigQuery design and defines itself as a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
The three things that define them are the possibility of running SQL, their distributed nature so they can operate at scale for interactive queries, and the power to query data without having to ingest it first.
Note in the case of BigQuery, initially the data would need to be ingested and it is still the preferred way of working, even if querying data directly from GCS has been available for a number of years. Athena only works with external tables.
Google BigQuery is a serverless data warehouse that supports super-fast SQL queries for analyze data in parallel. Amazon Athena is a serverless interactive query service that allows you to conveniently analyze data stored in Amazon Simple Storage Service (S3) by using basic SQL in parallel.
Both technologies could be considered as MPP (massively parallel processing) systems as both technologies process data analytics in parallel.

Where the data will be stored by BigQuery

I am using BigQueryIO to publish data into BigQuery from a Google Dataflow job.
AFAIK, BigQuery can be used to query data from Google Cloud Storage, Google Drive and Google Sheets.
But when we store data using BigQueryIO, where the data will stored? Is it in Google Cloud Storage?
Short answer - BigQueryIO Write/Read to/from BigQuery Table
To go a little deeper:
BigQuery stores data in the Capacitor columnar data format, and offers the standard database concepts of tables, partitions, columns, and rows.
It manages the technical aspects of storing your structured data, including compression, encryption, replication, performance tuning, and scaling.
You can read more about BigQuery different components in BigQuery Overview
Cloud Storage is a separate service from Big Query. Internally, Big Query manages its own storage.
So, if you save your data to Cloud Storage, and then use the bq command to load a Big Query table from a file in Cloud Storage, there are now 2 copies of the data.
Consequences include:
If you delete the Cloud Storage copy, the data will still be in Big Query.
Fees include a price for each copy. I think in April 2017 long term storage in BQ is around $0.01/GB, and in cloud storage around $0.01-$0.026/GB depending on storage class.
If the same data is in both GCS and BQ, you are paying twice. Whether it is worthwhile to have a backup copy of data is up to you.
BigQuery is a managed data warehouse, simply say it's a database.
So your data will be stored in BigQuery, and you can acccess it by using SQL queries.