Recover overwritten Bigquery table with older table schema - google-bigquery

I accidentally overwrote an existing table by using it as a temporary table to store result of another select. Is there a way to roll it back if both the old table and new table has a different table structure? Is it possible to prevent someone from overwriting a particular table to prevent this in future?
There is a comment in following question which says it is not possible to recover if table schema is different. Not sure if that has changed recently.
Is it possible to recover overwritten data in BigQuery

first overwrite your table again with something (anything) that has exact same schema as your "lost" table
Then follow same steps as in referenced post - which is :
SELECT * FROM [yourproject:yourdataset.yourtable#<time>]
You can use #0 if your table was not changed for last week or so or since creation
Or, to avoid cost - do bq cp ....

You could restore in SQL. But this loses column nullable and description fields and incurs query costs
bq query --use_legacy_sql=false "CREATE OR REPLACE TABLE project:dataset.table AS SELECT * FROM project:dataset.table FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 MINUTE)"
I recently found this to be more effective
Get a unix time stamp in milliseconds and override itself with cp
bq query --use_legacy_sql=false "SELECT DATE_DIFF(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 100 MINUTE), TIMESTAMP '1970-01-01', MILLISECOND)"
bq cp project:dataset.table#1625288152215 project:dataset.table
Before you do it you can check with the following
bq show --schema --format=prettyjson project:dataset.table#1625288152215 > schema-a.json
bq show --schema --format=prettyjson project:dataset.table > schema-b.json
diff schema-a.json schema-b.json

Related

BigQuery: (backup) copy of sharded Google Analytics tables

When Google Analytics is exported to BigQuery, the data is put in so called sharded tables, one for each day. They all start with ga_sessions_ followed by the suffix of a date.
I want to make a backup copy of these sharded tables.
How do I do that?
If you want to backup the tables to a Cloud Storage Bucket you can try the following.
Query table meta data to get the tables to export.
SELECT
table_name
FROM
`MyDataSet.INFORMATION_SCHEMA.TABLES`
WHERE
table_name LIKE 'ga_sessions_%'
Use the BigQuery export function to export the to the bucket.
-- If the tables are nested use json/avro/parquet
-- But be aware of the data type converstions:
-- https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_stored_in
EXPORT DATA OPTIONS(
uri='gs://bucket/folder/ga_sessions_<date>_*.json',
format='JSON',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT * FROM mydataset.ga_sessions_<date>
Put it together in a BQ script with a loop, using FORMAT to create the query and EXECUTE IMMEDIATE to run the query.
BEGIN
DECLARE backup_date STRING DEFAULT CAST(CURRENT_DATE('UTC') AS STRING);
FOR record IN
(
SELECT
table_name
FROM
`MyDataSet.INFORMATION_SCHEMA.TABLES`
WHERE
table_name LIKE 'ga_sessions_%')
DO
EXECUTE IMMEDIATE
FORMAT("""
EXPORT DATA
OPTIONS(
uri=CONCAT('gs://your_backup_bucket/path/to/folder/', '%s','/', '%s','_*.json'),
format='JSON',
overwrite=true,
header=true,
field_delimiter=';')
AS SELECT * FROM my_project.my_data_set.%s
"""
, backup_date, record.table_name, record.table_name);
END FOR;
END;
backup_date is used to create a 'folder' with the export date as a name in the bucket for the tables.
The * in the URI allows a table to be exported into multiple tables. This only matters if the exported table is bigger than 1GB (See here)
Set a life cycle rule on your storage bucket to archive files after an appropriate time or set it to archive by default if it's only for backup purposes (Accessed less once a year, see storage classes).
Props to Tim Lou for this article on using table meta data.
This answer relates to creating the backup in BQ:
So a sharded table such as ga_sessions_ basically exists of many tables, one for each day. We need to copy all those tables separately.
The below answer is a copy from the article below and all respect goes to the author:
https://medium.com/#Nayana22/playing-with-sharded-tables-in-bigquery-123e1ec5e453
So you can do as follows:
Take the list of all the days that you wish to copy by hitting below SQL command. [You can modify this as per your need]
SELECT
REPLACE(STRING_AGG(CONCAT('"', partition_name,'"') ORDER BY partition_name ), ","," ")
FROM
(
SELECT
DISTINCT date AS partition_name
FROM
`[Project_ID].[DATASET_NAME].ga_sessions_*`
ORDER BY PARSE_DATE(“%Y%m%d”, date)
)
Output:
"20180322" "20180323" "20180324" "20180325" "20180326" "20180327"...
Now, go to google console, select the project and click on Activate Cloud Shell.
Check the current project using echo $DEVSHELL_PROJECT_ID. If you’re in an incorrect project then change it and move to step 4.
Create a bash script and specify the days that you just got from the above query.
tables=(“20180322” “20180323” “20180324” “20180325” “20180326” “20180327”…)
Iterate through all the days available in tables variable and use BigQuery’s copy command to move a table from source to destination.
Syntax: [Note: There is a space between the source and destination table name]
bq cp -a source_table destination_table
Our, script file looks like this:
tables=("20180322" "20180323" "20180324" "20180325" "20180326" "20180327"…)
for val in ${tables[*]}; do
bq cp -a [source_dataset].ga_sessions_$val [destination_dataset].ga_sessions_backup_$val
done
How to validate whether all the tables are copied or not?
WITH first_ga_session AS (
SELECT MIN(PARSE_DATE("%Y%m%d", REGEXP_EXTRACT(table_id, '20[0–9]
{6,6}'))) AS day
FROM `[PROJECT_ID].[DATASET_NAME].__TABLES__` AS ga_tables
WHERE table_id LIKE 'ga_sessions_backup_2%'
),
all_days AS (
SELECT period
FROM
UNNEST(GENERATE_DATE_ARRAY((SELECT day from first_ga_session),
CURRENT_DATE())) AS period
),
available_ga_sessions AS (
SELECT PARSE_DATE("%Y%m%d", REGEXP_EXTRACT(table_id, '20[0–9]
{6,6}')) AS ga_day
FROM `[PROJECT_ID].[DATASET_NAME].__TABLES__` AS ga_tables
WHERE table_id LIKE 'ga_sessions_backup_2%'
)
SELECT A.period AS Day, B.ga_day AS Available_session
FROM all_days A
LEFT JOIN available_ga_sessions B
ON A.period = B.ga_day
WHERE B.ga_day IS NULL
The above query will give us all the days that are missing in the destination table.
My bash script ended up looking like this (note i'm using cloning with flag --clone:
tables=("20220331" "20220401")
bq --location=eu mk --dataset destination_project_id:destination_dataset
for val in ${tables[*]}; do
echo source_project_id:source_dataset.ga_sessions_$val
bq cp --clone source_project_id:source_dataset.ga_sessions_$val destination_project_id:destination_dataset.ga_sessions_backup_$val
done

how to view delta log after creating table

I have created table in delta format and not ingested any data .
just an empty table created and when I try using
DESCRIBE HISTORY table_name
it's showing:
DESCRIBE HISTORY is only supported for Delta tables
even though my table is delta table
but if in ingest any data its work perfectly
Use the below syntax:
DESCRIBE HISTORY table_identifier
table_identifier
[database_name.] table_name: A table name, optionally qualified with a database name.
delta.<path-to-table> : The location of an existing Delta table.
Refer: https://docs.databricks.com/delta/delta-utility.html#delta-history, https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-describe-history.html

How to modify CTAS query to append query results to table based on if new partition doesn't exist? - Athena

I have a query that I want to execute daily that's to be partitioned by the date it's executed. The results of this query should be appended to a the same table.
My idea was ideally having something similar to the CREATE TABLE IF NOT EXISTS command for adding data by a new partition every day to the existing table if the partition doesn't already exist, but I can't figure out how I'd be able to integrate this in my query.
My query:
CREATE TABLE IF NOT EXISTS db_name.table_name
WITH (
external_location = 's3://my-query-results-location/',
format = 'PARQUET',
parquet_compression = 'SNAPPY',
partitioned_by = ARRAY['date_executed'])
AS
SELECT
{columns_that_I_am_selecting_here_including_'date_executed'}
What this does is create a new table for the first day it's executed but nothing happens for subsequent days, I'm assuming because of the CREATE TABLE IF NOT EXISTS validating that the table already exists and not proceeding with the logic.
Is there a way to modify my query to create a table for the first day executed and append the results by a new partition for each subsequent day?
I'm quite sure ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION would not apply to my use case here as I'm running a CTAS query.
You can simply use INSERT INTO existing_table SELECT....
Presumably your table is already partitioned, so include that partition column in the SELECT and Amazon Athena will automatically put the data in the correct directory.
For example, you might include hte column like this: SELECT ... CURRENT_DATE as date_executed
See: INSERT INTO - Amazon Athena

Is there a way to restore a BigQuery table to an earlier state?

Is it possible to restore a Bigquery table to an earlier state like the state at a timestamp?
Per https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax,
they mentioned they can return a historical version of the table.
"
The following query returns a historical version of the table at an absolute point in time.
SELECT *
FROM t
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';
".
If a restore statement is available in BigQuery, then I think I can capture SYSTEM_TIME and restore a table to that timestamp.
One of the options would be to delete current table and create a new one with CREATE TABLE t AS SELECT * FROM t FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00'

BigQuery Equivalent of "CREATE TABLE my_table (LIKE your_table)"

I want to create a table which schema is exactly the same as another table. In other SQL engines, I think I was able to use "CREATE TABLE my_table (LIKE your_table)" or some variations.
I couldn't find the equivalent in BigQuery yet. Is this possible in some fashion?
Use this form:
CREATE TABLE dataset.new_table AS
SELECT *
FROM dataset.existing_table
LIMIT 0
This creates a new table with the same schema as the old one, and there is no cost due to the LIMIT 0.
Note that this does not preserve partitioning, table description, etc., however. Another option is to use the CLI (or API), making a copy of the table and then overwriting its contents, e.g.:
$ bq cp dataset.existing_table dataset.new_table
$ bq query --use_legacy_sql --replace --destination_table=dataset.new_table \
"SELECT * FROM dataset.new_table LIMIT 0;"
Now the new table has the same structure and attributes as the original did.
To create a partitioned and/or clustered table the syntax would be:
CREATE TABLE project.dataset.clustered_table
PARTITION BY DATE(created_time)
CLUSTER BY
account_id
AS SELECT * FROM project.dataset.example_table LIMIT 0