Backup options or snapshots of Google Cloud Storage data? - backup

I pulled data into Google BigQuery tables and also generate some new datasets based on these data daily.
These original data and generated datesets, I would save in Google Cloud Storage for two purposes,
These are the backup copy of my Google BigQuery data.
Also some of these datasets saved in Google Cloud Storage would be dump loaded to AWS elasticsearch (so they are also the backup copy data for AWS Elasticsearch)
BigQuery or AWS Elasticsearch may only keep 2 months to 1 year data. So the data older than that, I only have one copy on Google Cloud Storage. (I need to have some backup options, such as 1 months snapshots for Google Cloud Storage which I can go back to if needed.)
My questions are
How could I keep a backup or snapshot of Google Cloud Storage data to prevent the data loss in Google Cloud Storage. Such as let me at least trace back 7 days or 1 months of the data in Google Cloud Storage?
So in the case of data lost, (accidentally delete data etc), I can go back a few days to get the data back.
Thanks!

You can backup your cloud data to some local storage, CloudBerry has option "Cloud to Local".

I can recommend the software I am using myself- Cloudberry backup that can backup cloud storage to local storage or to other cloud storage.The toolsupports various cloud storages i.e.Amazon, Google, Azure etc. You can also download and upload data with the help of the tool, thus it's better to install it on Google VM.

Related

SQL database to Bigquery or SQL database to GCS to BigQuery

In the book Data Engineering with Google Cloud Platform by Adi Wijaya, to load the data from a sql database to BigQuery, the author always load the data from sql to Google Cloud Storage first, and use it as staging environment, and only after that would he load data to BigQuery
What are the advantage of going through the GCS step and not straight away into BigQuery? In which case would you load directly data from SQL db to BigQuery?
BigQuery doesn't support the SQL format as mentioned in this post to directly load data from Cloud SQL to BigQuery. You can follow the below procedures:
You can use BigQuery Cloud SQL federated query importing data directly into BigQuery from Cloud SQL.
Based on this documentation, you should first generate CSV or JSON from the Cloud SQL Database and persist those files to Cloud Storage and load data into BigQuery.
The advantages when loading data from Cloud SQL to Cloud Storage to BigQuery are:
Cloud storage provides services like resumable uploads, whereas combining the job and data means you'd need to be more careful about managing any issues with jobs, and concerning yourself with transient issues.
According to this documentation, using Cloud Storage you can take advantage of long term storage:
When you load data into BigQuery from Cloud Storage, you are not charged for the load operation, but you do incur charges for storing the data in Cloud Storage.
And as mentioned by #John Hanley, I agree that the advantage of loading data to Google Cloud storage to BigQuery it is faster and you can ensure a consistent copy or backup to be recovered in the event of a primary data failure.
BigQuery table can be deleted when not in use and imported when needed. And less likely to fail when creating a table.
Additional information, the cost of storing in BigQuery is higher than in Cloud storage. And you are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket.
To suggest the best strategy, your question needs more information. Still it depends on your use case. And for more information on loading data can be found in the BigQuery documentation.

what is the difference between BigQuery and Storage on GCP?

Hi guys I am using GCP for the first time and while I walking through the a project's cloud function example with the mock data, I got confused about similarities/differences of each one and I would like more clarity of what makes them different because to me they seem so similar.
BigQuery is a data warehouse and a SQL Engine. You can use it to store tabular data in datasets and tables. In the tables you may as well store more complex structures like arrays and JSONs but not files for example.
Cloud Storage is a blob storage, with functionality similar to what you know in your linux/windows machine (saving files, folders, deleting, copying). Of course that in the backend it's nothing like your local file system.
BigQuery is a fully managed and serverless data warehouse. It's like Snowflake or Redshift.
Google Cloud Storage(GCS) is like Amazon S3 or Azure Storage. Storages are for storing data as the name suggests.
You usually use BigQuery to analyze & query data in order to draw some insights. BigQuery is an analytical engine.
You can store images, videos, logs, files, and etc in GCS(Google Cloud Storage), but BigQuery can't.
Google BigQuery belongs to "Big Data as a Service" category of the tech stack, while Google Cloud Storage can be primarily classified under "Cloud Storage".
Some of the features offered by Google BigQuery are:
• All behind the scenes- Your queries can execute asynchronously in the
background, and can be polled for status.
• Import data with ease- Bulk load your data using Google Cloud Storage or stream it in bursts of up to 1,000 rows per second.
• Affordable big data- The first Terabyte of data processed each month is free.
On the other hand, Google Cloud Storage provides the following key features:
• High Capacity and Scalability
• Strong Data Consistency
• Google Developers Console Projects
"High Performance" is the primary reason why developers consider Google BigQuery over the competitors, whereas "Scalable" was stated as the key factor in picking Google Cloud Storage.

Syncing Azure BLOB Storage to Amazon S3

We're storing about 4 million files (4 TB or so) of miscellaneous files, mainly Word and PDF, in Azure BLOB storage. I'm looking to replicate this data in a different cloud for disaster recovery and peace of mind, and Amazon S3 seems as good a candidate as any.
Trouble is, I don't have a local server large enough to hold a local copy of these files. Ideally, I'd want to sync right from Azure Blob to S3. We're adding new files continually, so the sync would need to be frequent as well (multiple times per day).
I see lots of options for download from Azure to local => upload from local to S3, but very little for direct Azure => S3 sync. What are some good options here?
We can migrate the azure storage data to amazon s3 by node.js package.
You can see the full description provided here.
You can also use azure data factory to replicate as it provides a copy tool which can be modified according to your needs and settings for transferring data .
You can refer to this document on Azure data factory and copy tool.

Where the data will be stored by BigQuery

I am using BigQueryIO to publish data into BigQuery from a Google Dataflow job.
AFAIK, BigQuery can be used to query data from Google Cloud Storage, Google Drive and Google Sheets.
But when we store data using BigQueryIO, where the data will stored? Is it in Google Cloud Storage?
Short answer - BigQueryIO Write/Read to/from BigQuery Table
To go a little deeper:
BigQuery stores data in the Capacitor columnar data format, and offers the standard database concepts of tables, partitions, columns, and rows.
It manages the technical aspects of storing your structured data, including compression, encryption, replication, performance tuning, and scaling.
You can read more about BigQuery different components in BigQuery Overview
Cloud Storage is a separate service from Big Query. Internally, Big Query manages its own storage.
So, if you save your data to Cloud Storage, and then use the bq command to load a Big Query table from a file in Cloud Storage, there are now 2 copies of the data.
Consequences include:
If you delete the Cloud Storage copy, the data will still be in Big Query.
Fees include a price for each copy. I think in April 2017 long term storage in BQ is around $0.01/GB, and in cloud storage around $0.01-$0.026/GB depending on storage class.
If the same data is in both GCS and BQ, you are paying twice. Whether it is worthwhile to have a backup copy of data is up to you.
BigQuery is a managed data warehouse, simply say it's a database.
So your data will be stored in BigQuery, and you can acccess it by using SQL queries.

Does Google BigQuery and Google cloud storage share files between them?

I have created a BigQuery table by loading CSV file from Google cloud storage.
In this case, does BigQuery table reference the CSV file in cloud storage or it copies data to its own storage?
When you load file from Cloud Storage to BigQuery - this loads data into BigQuery "own" storage that is totally different from Cloud Storage.
Note: BigQuery supports querying data directly from Google Cloud Storage and Google Drive. See details at Creating and Querying Federated Data Sources