SQL database to Bigquery or SQL database to GCS to BigQuery - google-bigquery

In the book Data Engineering with Google Cloud Platform by Adi Wijaya, to load the data from a sql database to BigQuery, the author always load the data from sql to Google Cloud Storage first, and use it as staging environment, and only after that would he load data to BigQuery
What are the advantage of going through the GCS step and not straight away into BigQuery? In which case would you load directly data from SQL db to BigQuery?

BigQuery doesn't support the SQL format as mentioned in this post to directly load data from Cloud SQL to BigQuery. You can follow the below procedures:
You can use BigQuery Cloud SQL federated query importing data directly into BigQuery from Cloud SQL.
Based on this documentation, you should first generate CSV or JSON from the Cloud SQL Database and persist those files to Cloud Storage and load data into BigQuery.
The advantages when loading data from Cloud SQL to Cloud Storage to BigQuery are:
Cloud storage provides services like resumable uploads, whereas combining the job and data means you'd need to be more careful about managing any issues with jobs, and concerning yourself with transient issues.
According to this documentation, using Cloud Storage you can take advantage of long term storage:
When you load data into BigQuery from Cloud Storage, you are not charged for the load operation, but you do incur charges for storing the data in Cloud Storage.
And as mentioned by #John Hanley, I agree that the advantage of loading data to Google Cloud storage to BigQuery it is faster and you can ensure a consistent copy or backup to be recovered in the event of a primary data failure.
BigQuery table can be deleted when not in use and imported when needed. And less likely to fail when creating a table.
Additional information, the cost of storing in BigQuery is higher than in Cloud storage. And you are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket.
To suggest the best strategy, your question needs more information. Still it depends on your use case. And for more information on loading data can be found in the BigQuery documentation.

Related

Oracle Cloud to Azure Cloud storage

We have a requirement to move data from oracle Cloud storage to Azure Cloud storage.
The requirement is basically to move data from an Oracle ADW database (hosted on Oracle cloud) to Snowflake database (hosted on Azure).
Since the data volume in tables is huge (some with 60mil+ records) we do not wish to use any ETL tool and instead want to setup a pipeline as below.
Oracle ADW database -> Store data in Oracle storage --> Move data to Azure Cloud storage -> Load into Snowflake using snowpipe or similar snowflake utilities.
How should I go about this implementation?
Also share your views on whether we can use Oracle fastconnect and Azure ExpressRoute to directly pull data from Oracle Cloud onto snowflake (or into Azure storage)
I am looking for the same thing with the simplest method from Oracle (on prem but could be cloud), into Snowflake. Looks like data must be exporeted or dropped to external tables, shifted to Azure Blob storage (like AWS S3), then pushed into Snowflake using COPY INTO - basically copying on disk external tables. This is what Snowpipe does:
"Snowpipe copies the files into a queue, from which they are loaded into the target table in a continuous, serverless fashion based on parameters defined in a specified pipe object. The following table indicates the cloud storage service support for automated Snowpipe from Snowflake accounts hosted on each cloud platform:"
It's been a while since I have worked with this. The other option is GoldenGate, which was not expensive the last time I looked into it:
https://www.snowflake.com/blog/continuous-data-replication-into-snowflake-with-oracle-goldengate/
Easy, simple, fast. Anyone have any better ideas would be appreciated.

what is the difference between BigQuery and Storage on GCP?

Hi guys I am using GCP for the first time and while I walking through the a project's cloud function example with the mock data, I got confused about similarities/differences of each one and I would like more clarity of what makes them different because to me they seem so similar.
BigQuery is a data warehouse and a SQL Engine. You can use it to store tabular data in datasets and tables. In the tables you may as well store more complex structures like arrays and JSONs but not files for example.
Cloud Storage is a blob storage, with functionality similar to what you know in your linux/windows machine (saving files, folders, deleting, copying). Of course that in the backend it's nothing like your local file system.
BigQuery is a fully managed and serverless data warehouse. It's like Snowflake or Redshift.
Google Cloud Storage(GCS) is like Amazon S3 or Azure Storage. Storages are for storing data as the name suggests.
You usually use BigQuery to analyze & query data in order to draw some insights. BigQuery is an analytical engine.
You can store images, videos, logs, files, and etc in GCS(Google Cloud Storage), but BigQuery can't.
Google BigQuery belongs to "Big Data as a Service" category of the tech stack, while Google Cloud Storage can be primarily classified under "Cloud Storage".
Some of the features offered by Google BigQuery are:
• All behind the scenes- Your queries can execute asynchronously in the
background, and can be polled for status.
• Import data with ease- Bulk load your data using Google Cloud Storage or stream it in bursts of up to 1,000 rows per second.
• Affordable big data- The first Terabyte of data processed each month is free.
On the other hand, Google Cloud Storage provides the following key features:
• High Capacity and Scalability
• Strong Data Consistency
• Google Developers Console Projects
"High Performance" is the primary reason why developers consider Google BigQuery over the competitors, whereas "Scalable" was stated as the key factor in picking Google Cloud Storage.

What is the generic term for technologies like AWS Athena (Presto) and GCP BigQuery?

From a user perspective, Athena and BigQuery both accept a sql-like query, they both query files stored on disk (without needing to have set up a relational database), and they both return results (usually very quickly). Do such technologies have a name? i.e. is there a generic term for technologies like AWS Athena and GCP BigQuery?
They are both distributed SQL Query Engines for big [in-place] data. Athena is based on Presto, which declares itself to be a Distributed SQL Query Engine for Big Data.
Apache Drill was based on the original BigQuery design and defines itself as a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
The three things that define them are the possibility of running SQL, their distributed nature so they can operate at scale for interactive queries, and the power to query data without having to ingest it first.
Note in the case of BigQuery, initially the data would need to be ingested and it is still the preferred way of working, even if querying data directly from GCS has been available for a number of years. Athena only works with external tables.
Google BigQuery is a serverless data warehouse that supports super-fast SQL queries for analyze data in parallel. Amazon Athena is a serverless interactive query service that allows you to conveniently analyze data stored in Amazon Simple Storage Service (S3) by using basic SQL in parallel.
Both technologies could be considered as MPP (massively parallel processing) systems as both technologies process data analytics in parallel.

Google BigQuery vs Spark and Parquet

How does Google BigQuery compare to Apache Spark SQL and Parquet?
Is it correct to say that BigQuery is actually Storage & Analysis, and that you could therefore split the product into BigQuery Storage and BigQuery Analysis?
I understand there are plenty of other storage mechanisms, and processing engines, but, to pick 2 "pairs"...
For my understanding, is it a correct to say that BigQuery Storage is comparable with Apache Parquet and BigQuery Analysis is comparable with Spark SQL?
Is it correct to say that BigQuery storage is actually called Capacitor... "BigQuery’s next-generation columnar storage format"?
Is it also correct to say that Apache Parquet and BigQuery Storage both provide an implementation of Dremel?
Capacitor is the file format used by BigQuery, while the storage is the whole distributed system to host the files & data. Dremel is the underlying execution engine. Here is some introduction (https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood).

Where the data will be stored by BigQuery

I am using BigQueryIO to publish data into BigQuery from a Google Dataflow job.
AFAIK, BigQuery can be used to query data from Google Cloud Storage, Google Drive and Google Sheets.
But when we store data using BigQueryIO, where the data will stored? Is it in Google Cloud Storage?
Short answer - BigQueryIO Write/Read to/from BigQuery Table
To go a little deeper:
BigQuery stores data in the Capacitor columnar data format, and offers the standard database concepts of tables, partitions, columns, and rows.
It manages the technical aspects of storing your structured data, including compression, encryption, replication, performance tuning, and scaling.
You can read more about BigQuery different components in BigQuery Overview
Cloud Storage is a separate service from Big Query. Internally, Big Query manages its own storage.
So, if you save your data to Cloud Storage, and then use the bq command to load a Big Query table from a file in Cloud Storage, there are now 2 copies of the data.
Consequences include:
If you delete the Cloud Storage copy, the data will still be in Big Query.
Fees include a price for each copy. I think in April 2017 long term storage in BQ is around $0.01/GB, and in cloud storage around $0.01-$0.026/GB depending on storage class.
If the same data is in both GCS and BQ, you are paying twice. Whether it is worthwhile to have a backup copy of data is up to you.
BigQuery is a managed data warehouse, simply say it's a database.
So your data will be stored in BigQuery, and you can acccess it by using SQL queries.