I have created a BigQuery table by loading CSV file from Google cloud storage.
In this case, does BigQuery table reference the CSV file in cloud storage or it copies data to its own storage?
When you load file from Cloud Storage to BigQuery - this loads data into BigQuery "own" storage that is totally different from Cloud Storage.
Note: BigQuery supports querying data directly from Google Cloud Storage and Google Drive. See details at Creating and Querying Federated Data Sources
Related
In the book Data Engineering with Google Cloud Platform by Adi Wijaya, to load the data from a sql database to BigQuery, the author always load the data from sql to Google Cloud Storage first, and use it as staging environment, and only after that would he load data to BigQuery
What are the advantage of going through the GCS step and not straight away into BigQuery? In which case would you load directly data from SQL db to BigQuery?
BigQuery doesn't support the SQL format as mentioned in this post to directly load data from Cloud SQL to BigQuery. You can follow the below procedures:
You can use BigQuery Cloud SQL federated query importing data directly into BigQuery from Cloud SQL.
Based on this documentation, you should first generate CSV or JSON from the Cloud SQL Database and persist those files to Cloud Storage and load data into BigQuery.
The advantages when loading data from Cloud SQL to Cloud Storage to BigQuery are:
Cloud storage provides services like resumable uploads, whereas combining the job and data means you'd need to be more careful about managing any issues with jobs, and concerning yourself with transient issues.
According to this documentation, using Cloud Storage you can take advantage of long term storage:
When you load data into BigQuery from Cloud Storage, you are not charged for the load operation, but you do incur charges for storing the data in Cloud Storage.
And as mentioned by #John Hanley, I agree that the advantage of loading data to Google Cloud storage to BigQuery it is faster and you can ensure a consistent copy or backup to be recovered in the event of a primary data failure.
BigQuery table can be deleted when not in use and imported when needed. And less likely to fail when creating a table.
Additional information, the cost of storing in BigQuery is higher than in Cloud storage. And you are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket.
To suggest the best strategy, your question needs more information. Still it depends on your use case. And for more information on loading data can be found in the BigQuery documentation.
What's the best way to pull the data from cloud spanner to BigQuery for Data Analysis?
thanks
You can use the Google provided dataflow template for pulling data from Spanner to GCS, then run a load job to load it into Bigquery.
Export Spanner database
Cloud Spanner to GCS AVRO
check this Link1. you dont need any external service to migrate data. you can directly read all spanner data through Bigquery and load it.
I have a MySQL DB in AWS and can I use the database as a data source in Big Query.
I m going with CSV upload to Google Cloud Storage bucket and loading into it.
I would like to keep it Synchronised by directly giving the data source itself than loading it every time.
You can create a permanent external table in BigQuery that is connected to Cloud Storage. Then BQ is just the interface while the data resides in GCS. It can be connected to a single CSV file and you are free to update/overwrite that file. But not sure if you can link BQ to a directory full of CSV files or even are tree of directories.
Anyway, have a look here: https://cloud.google.com/bigquery/external-data-cloud-storage
I pulled data into Google BigQuery tables and also generate some new datasets based on these data daily.
These original data and generated datesets, I would save in Google Cloud Storage for two purposes,
These are the backup copy of my Google BigQuery data.
Also some of these datasets saved in Google Cloud Storage would be dump loaded to AWS elasticsearch (so they are also the backup copy data for AWS Elasticsearch)
BigQuery or AWS Elasticsearch may only keep 2 months to 1 year data. So the data older than that, I only have one copy on Google Cloud Storage. (I need to have some backup options, such as 1 months snapshots for Google Cloud Storage which I can go back to if needed.)
My questions are
How could I keep a backup or snapshot of Google Cloud Storage data to prevent the data loss in Google Cloud Storage. Such as let me at least trace back 7 days or 1 months of the data in Google Cloud Storage?
So in the case of data lost, (accidentally delete data etc), I can go back a few days to get the data back.
Thanks!
You can backup your cloud data to some local storage, CloudBerry has option "Cloud to Local".
I can recommend the software I am using myself- Cloudberry backup that can backup cloud storage to local storage or to other cloud storage.The toolsupports various cloud storages i.e.Amazon, Google, Azure etc. You can also download and upload data with the help of the tool, thus it's better to install it on Google VM.
Does BigQuery have a feature to import data from s3?
If not then whats the best alternative path which you can suggest?
BigQuery doesn't support direct ingestion of data from S3 buckets. However, it is easy to move data from S3 buckets to Google Cloud Storage using the gsutil command line tool. I would suggest moving the data to Cloud Storage, then ingesting into BigQuery from there.
https://developers.google.com/storage/docs/gsutil