How to take the backup of a dataset in BigQuery? - google-bigquery

We want to create a backup copy of a BigQuery dataset in case a table is accidentally dropped, as it is only recoverable within 7 days.
Is there a way to extend the duration of the recovery period? If not, how can we create a backup of a dataset with a retention period of 30 days in BigQuery?

It is currently not possible to extend the duration of the recovery period. A feature request for the ability to extend the duration of the recovery period has already been created as commented by Katayoon.
Here is a public link to monitor the progress on that issue: https://issuetracker.google.com/120038872
To backup datasets in BigQuery you could either make copies of your dataset, or as a more workable solution, export the data to Cloud Storage, so you can import it back at a later time. Cloud Storage allows you to set a retention period and a lifecycle policy which together will allow you to make sure that data stays undisturbed for the desired amount of time, and that it removes itself after a given time should you wish to save on storage costs.
For how you do export in BigQuery:
You can export the tables as AVRO, JSON or CSV files to the Cloud Storage via web UI, command line, an API and using various languages like C#, Go, Python and Java, as long as both are in the same location. There are other limitations to exporting a table, such as file size, Integer encoding, data compression, etc.
Link to table export and limitations:
https://cloud.google.com/bigquery/docs/exporting-data
You can find the instructions on the procedures here:
Retention Policies and Bucket Lock: https://cloud.google.com/storage/docs/using-bucket-lock#lock-bucket
Object Lifecycle Management:
https://cloud.google.com/storage/docs/managing-lifecycles
Loading data into BigQuery can be done using various file formats, such as CSV, JSON, Avro, Parquet, or ORC and so on. At this moment you can load directly only from local storage, or from Google Storage. More on loading data, file formats, data sources and limitations by following the link: https://cloud.google.com/bigquery/docs/loading-data
More information on
Exporting tables: https://cloud.google.com/bigquery/docs/exporting-data
Export limitations: https://cloud.google.com/bigquery/docs/exporting-data#export_limitations
Loading data into BigQuery: https://cloud.google.com/bigquery/docs/loading-data
Wildcards: https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames
Merging the file: https://cloud.google.com/storage/docs/gsutil/commands/compose

You can take a snapshot of a table using either SQL or CLI:
SQL
CREATE SNAPSHOT TABLE
myproject.library_backup.books
CLONE myproject.library.books
OPTIONS(expiration_timestamp = TIMESTAMP "2022-04-27 12:30:00.00-08:00")
CLI
bq cp --snapshot --no_clobber --expiration=86400 library.books library_backup.books

You can backup and restore using the tools in https://github.com/GoogleCloudPlatform/bigquery-oreilly-book/tree/master/blogs/bigquery_backup:
Backup a table to GCS
./bq_backup.py --input dataset.tablename --output gs://BUCKET/backup
This saves a schema.json, a tabledef.json, and extracted data in AVRO format to GCS.
You can also backup all the tables in a data set:
./bq_backup.py --input dataset --output gs://BUCKET/backup
Restore tables one-by-one by specifying a destination data set
./bq_restore.py --input gs://BUCKET/backup/fromdataset/fromtable --output destdataset
For views, the backup stores the view definition and the restore creates a view.

Related

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

Automatic ETL data before loading to Bigquery

I have CSV files added to a GCS bucket daily or weekly each file name contains (date + specific parameter)
The files contain the schema (id + name) columns and we need to auto load/ingest these files into a bigquery table so that the final table have 4 columns (id,name,date,specific parameter)
We have tried dataflow templates but we couldn't get the date and specific parameter from the file name to the dataflow
And we tried cloud function (we can get the date and specific parameter value from file name) but couldn't add it in columns while ingestion
Any suggestions?
Disclaimer: I have authored an article for this kind of problem using Cloud Workflows. When you want to extract parts of filename, to use as table definition later.
We will create a Cloud Workflow to load data from Google Storage into BigQuery. This linked article is a complete guide on how to work with workflows, connecting any Google Cloud APIs, working with subworkflows, arrays, extracting segments, and calling BigQuery load jobs.
Let’s assume we have all our source files in Google Storage. Files are organized in buckets, folders, and could be versioned.
Our workflow definition will have multiple steps.
(1) We will start by using the GCS API to list files in a bucket, by using a folder as a filter.
(2) For each file then, we will further use parts from the filename to use in BigQuery’s generated table name.
(3) The workflow’s last step will be to load the GCS file into the indicated BigQuery table.
We are going to use BigQuery query syntax to parse and extract the segments from the URL and return them as a single row result. This way we will have an intermediate lesson on how to query from BigQuery and process the results.
Full article with lots of Code Samples is here: Using Cloud Workflows to load Cloud Storage files into BigQuery

BigQuery Table creation options

When we create a table under a particular dataset, we have 5 options like empty table , Google cloud storage and upload etc.My question is if it is a Cloud storage , where does this table gets created in BigQuery or Cloud storage ? as my intention is to dump the data in cloud storage and then load in to BigQuer. Same goes for empty table also as we explicitly define schema , I understand the table will reside in BQ.
I have load the data by below script:
bq load --source_format=CSV --skip_leading_rows=1 --autodetect --ignore_unknown_values \
commerce.balltoball gs://balltoballbucket/head_usa_names.csv
I suppose the balltoballbucket is referred to storage bucket where as commerce.balltoball is BigQuery refrence.
Apologies for newbie question. Thanks for your help.
If your bq load works, then UI should work for you. The documentation is here:
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#loading_csv_data_into_a_table (then pick Console tab)
Select file from GCS bucket: gs://balltoballbucket/head_usa_names.csv
File Format: CSV
Dataset Name: commerce
Table Name: balltoball
Other options see on the page:
(Optional) Click Advanced options.
As to where the table is stored, if you pick Native table as Table type, it is stored inside BigQuery storage, and External table for letting the data stay on GCS and only read from GCS when there is a query hitting the table.

DataBricks - save changes back to DataLake (ADLS Gen2)

I have legacy data stored as CSV in an Azure DataLake Gen2 storage account. I'm able to connect to this and interrogate it using DataBricks. I have a requirement to remove certain records once their retention period expires, or if a GDPR "right to be forgotten" needs applying to the data.
Using Delta I can load a CSV into a Delta table and use SQL to locate and delete the required rows, but what is the best way to save these changes? Ideally back to the original file, so that the data is removed from the original. I've used the LOCATION option when creating the Delta table to persist the generated Parquet format files to the DataLake but it would be nice to keep it in the original CSV format.
Any advice appreciated.
I'd be careful here. Right to be forgotten means you need to delete the data. Delta doesn't actually delete it from the original file (initially at least) - this will only happen once the data is vacuumed.
The safest way to delete data is to read all the data into a dataframe, filter off the records you do not want and then write it back using overwrite. This will ensure the data is remove and the same structure is re-written.
Convert Parquet to CSV in ADF
The versioned parquet files created in the ADLS Gen2 location can be converted to CSV using the Copy Data task in an Azure Data Factory pipeline.
So, you could read the CSV data into a Delta table(with location pointing to a Data lake folder), perform the required changes using SQL and then convert the parquet files to CSV format using ADF.
I have tried this and it works. The only hurdle might be detecting the column headers while reading the CSV file to Delta. You could read it to a dataframe and create a Delta table from it.
If you are running the delete operations periodically then it is costly to save file in csv, As every time you are reading the file and transforming the dataframe to Delta and then query on it and finally after filtering the records you are again saving it to csv and deleting the Delta table.
So my suggestion here would be, transform the csv to Delta once, perform delete periodically and generate csv only when it's needed.
The advantage here is - Delta internally stores data in parquet format which stores data in binary format and allow better compression and encoding/decoding of data.

Exporting query results as JSON via Google BigQuery API

I've got jobs/queries that return a few hundred thousand rows. I'd like to get the results of the query and write them as json in a storage bucket.
Is there any straightforward way of doing this? Right now the only method I can think of is:
set allowLargeResults to true
set a randomly named destination table to hold the query output
create a 2nd job to extract the data in the "temporary" destination table to a file in a storage bucket
delete the random "temporary" table.
This just seems a bit messy and roundabout. I'm going to be wrapping all this in a service hooked up to a UI that would have lots of users hitting it and would rather not be in the business of managing all these temporary tables.
1) As you mention the steps are good. You need to use Google Cloud Storage for your export job. Exporting data from BigQuery is explained here, check also the variants for different path syntax.
Then you can download the files from GCS to your local storage.
Gsutil tool can help you further to download the file from GCS to local machine.
With this approach you first need to export to GCS, then to transfer to local machine. If you have a message queue system (like Beanstalkd) in place to drive all these it's easy to do a chain of operation: submit jobs, monitor state of the job, when done initiate export to GCS, then delete the temp table.
Please also know that you can update a table via the API and set the expirationTime property, with this aproach you don't need to delete it.
2) If you use the BQ Cli tool, then you can set output format to JSON, and you can redirect to a file. This way you can achieve some export locally, but it has certain other limits.
this exports the first 1000 line as JSON
bq --format=prettyjson query --n=1000 "SELECT * from publicdata:samples.shakespeare" > export.json