What's the best way to pull the data from cloud spanner to BigQuery for Data Analysis?
thanks
You can use the Google provided dataflow template for pulling data from Spanner to GCS, then run a load job to load it into Bigquery.
Export Spanner database
Cloud Spanner to GCS AVRO
check this Link1. you dont need any external service to migrate data. you can directly read all spanner data through Bigquery and load it.
Related
In the book Data Engineering with Google Cloud Platform by Adi Wijaya, to load the data from a sql database to BigQuery, the author always load the data from sql to Google Cloud Storage first, and use it as staging environment, and only after that would he load data to BigQuery
What are the advantage of going through the GCS step and not straight away into BigQuery? In which case would you load directly data from SQL db to BigQuery?
BigQuery doesn't support the SQL format as mentioned in this post to directly load data from Cloud SQL to BigQuery. You can follow the below procedures:
You can use BigQuery Cloud SQL federated query importing data directly into BigQuery from Cloud SQL.
Based on this documentation, you should first generate CSV or JSON from the Cloud SQL Database and persist those files to Cloud Storage and load data into BigQuery.
The advantages when loading data from Cloud SQL to Cloud Storage to BigQuery are:
Cloud storage provides services like resumable uploads, whereas combining the job and data means you'd need to be more careful about managing any issues with jobs, and concerning yourself with transient issues.
According to this documentation, using Cloud Storage you can take advantage of long term storage:
When you load data into BigQuery from Cloud Storage, you are not charged for the load operation, but you do incur charges for storing the data in Cloud Storage.
And as mentioned by #John Hanley, I agree that the advantage of loading data to Google Cloud storage to BigQuery it is faster and you can ensure a consistent copy or backup to be recovered in the event of a primary data failure.
BigQuery table can be deleted when not in use and imported when needed. And less likely to fail when creating a table.
Additional information, the cost of storing in BigQuery is higher than in Cloud storage. And you are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket.
To suggest the best strategy, your question needs more information. Still it depends on your use case. And for more information on loading data can be found in the BigQuery documentation.
I need to setup a data pipeline from some source databases like Oracle, MySQL and load the data to BigQuery.
How can I use google-cloud-dataflow to read data from a database(jdbc connection) and write to BigQuery tables using Python.
Also, I have some hive tables in an on-premise Hadoop cluster, how do I transfer this data to BigQuery.
I couldn't find the right documentation or examples to achieve this.
Can you please point me in the right direction.
I applied a solution in my project to provide such thing, you need to follow these steps:
Load data from Google Cloud SQL to Google Cloud storage in CSV by following this link.
Load the CSV data from Google cloud storage directly into BigQuery by following this link.
I am using BigQueryIO to publish data into BigQuery from a Google Dataflow job.
AFAIK, BigQuery can be used to query data from Google Cloud Storage, Google Drive and Google Sheets.
But when we store data using BigQueryIO, where the data will stored? Is it in Google Cloud Storage?
Short answer - BigQueryIO Write/Read to/from BigQuery Table
To go a little deeper:
BigQuery stores data in the Capacitor columnar data format, and offers the standard database concepts of tables, partitions, columns, and rows.
It manages the technical aspects of storing your structured data, including compression, encryption, replication, performance tuning, and scaling.
You can read more about BigQuery different components in BigQuery Overview
Cloud Storage is a separate service from Big Query. Internally, Big Query manages its own storage.
So, if you save your data to Cloud Storage, and then use the bq command to load a Big Query table from a file in Cloud Storage, there are now 2 copies of the data.
Consequences include:
If you delete the Cloud Storage copy, the data will still be in Big Query.
Fees include a price for each copy. I think in April 2017 long term storage in BQ is around $0.01/GB, and in cloud storage around $0.01-$0.026/GB depending on storage class.
If the same data is in both GCS and BQ, you are paying twice. Whether it is worthwhile to have a backup copy of data is up to you.
BigQuery is a managed data warehouse, simply say it's a database.
So your data will be stored in BigQuery, and you can acccess it by using SQL queries.
I have created a BigQuery table by loading CSV file from Google cloud storage.
In this case, does BigQuery table reference the CSV file in cloud storage or it copies data to its own storage?
When you load file from Cloud Storage to BigQuery - this loads data into BigQuery "own" storage that is totally different from Cloud Storage.
Note: BigQuery supports querying data directly from Google Cloud Storage and Google Drive. See details at Creating and Querying Federated Data Sources
Does BigQuery have a feature to import data from s3?
If not then whats the best alternative path which you can suggest?
BigQuery doesn't support direct ingestion of data from S3 buckets. However, it is easy to move data from S3 buckets to Google Cloud Storage using the gsutil command line tool. I would suggest moving the data to Cloud Storage, then ingesting into BigQuery from there.
https://developers.google.com/storage/docs/gsutil