Creating an empty table based on another table's schema in BigQuery - google-bigquery

We have a BigQuery dataset that has some long list of tables (with data) in it. Since I am taking over a data pipeline, which I want to familiarize myself with by doing tests, I want to duplicate those dataset/tables without copying-truncating the tables. Essentially, I want to re-create those tables in a test dataset using their schema. How can this be done in bq client?

You have a couple of options considering you don't want to copy the data but the schema:
1.- extract the schema for each table and then create new ones just empty.
$ bq show --schema --format=prettyjson [PROJECT_ID]:[DATASET].[TABLE] > [SCHEMA_FILE]
$ bq mk --table [PROJECT_ID]:[NEW_DATASET].[TABLE] [SCHEMA_FILE]
2.- run a query with LIMIT 0 and setting a destination table.
bq query "SELECT * FROM [DATASET].[TABLE] LIMIT 0" --destination_table [NEW_DATASET].[TABLE]

Related

Get table names in dataset after dataset truncate

It seems that the BigQuery CLI supports restoring tables in a dataset after they have been deleted by using BigQuery Time Travel functionality -- as in:
bq cp dataset.table#TIME_AGO_UNIX dataset.table
However, this assumes we know the names of the tables. I want to write a script to iterate over all the tables that were in the dataset at TIME_AGO_UNIX time.
How would I go about finding those tables at that time?

Duplicate several tables in bigquery project at once

In our BQ export schema, we have one table for each day as per the screenshot below.
I want to copy the tables before a certain date (2021-feb-07). I know how to copy one day at a time via the UI, but is there not a way to use the cloud console to write a code for copying the selected date range, all at once? Or maybe an sql command directly from a query window?
I think you should transform your sharding tables into a partitioned table. So you can handled your tables with just a single query. As mention in the official documentation, partitioned tables perform better.
To make the conversion, you can just execute the following commands in the console.
bq partition \
--time_partitioning_type=DAY \
--time_partitioning_expiration 259200 \
mydataset.sourcetable_ \
mydataset.mytable_partitioned
This will make your sharded tables sourcetable_(xxx) into a single partitioned table mytable_partitioned which can be query with just a single query trough your entire set of data entries.
SELECT
*
FROM
`myprojectid.mydataset.mytable_partitioned`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2022-01-01') AND TIMESTAMP('2022-01-03')
For more details about the conversion commands you can check this link. Also, I recommend to check the links about querying partionated tables and partiotioned tables for more details.

Partitioning a table in BigQuery by file

I would like to create a table that is partitioned based on the filename. For example, let's say I have a thousand sales file, one for each date such as:
Files/Sales_2014-01-01.csv, Files/Sales_2014-01-02.csv, ...
I would like to partition the table based on the filename (which is essentially the date). Is there a way to do this in BQ? For example, I want to do a load job similar to the following (in pseudocode):
bq load gs://Files/Sales*.csv PARTITION BY filename
What would be the closest thing I could do to that?
When you have a TIMESTAMP, DATE, or DATETIME column in a table, first create a partitioned table by using the Time-unit column partitioning. When you load data to the table, BigQuery automatically puts the data into the correct partitions, based on the values in the column. To create an empty partitioned table for time-unit column-partitioned using bq CLI, please refer to the below command:
bq mk -t \
--schema 'ts:DATE,qtr:STRING,sales:FLOAT' \
--time_partitioning_field ts \
--time_partitioning_type DAILY \
mydataset.mytable
Then load all your sales files into that Time-unit column partitioning table. It will automatically put the data into the correct partition. The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The schema would be auto detected. Please refer to this link for more information.
bq load \
--autodetect \
--source_format=CSV \
mydataset.mytable \
gs://mybucket/mydata*.csv

Migrating Non Partitioned Streaming Table to Partitioned Table Bigquery

I have a legacy unpartitioned big query table that streams logs from various sources (Let's say Table BigOldA). The aim is to transfer it to a new day partition table (Let's say PartByDay) which is done with the help of the following link:
https://cloud.google.com/bigquery/docs/creating-column-partitions#creating_a_partitioned_table_from_a_query_result
bq query
--allow_large_results
--replace=true
--destination_table <project>:<data-set>.<PartByDay>
--time_partitioning_field REQUEST_DATETIME
--use_legacy_sql=false 'SELECT * FROM `<project>.<data-set>.<BigOldA>`'
I have migrated the historical data to the new table but I cannot delete them in Table BigOldA as I am running into the same problem with running DMLs on streaming buffer tables are not supported yet.
Error: UPDATE or DELETE DML statements are not supported over
table <project>:<data-set>.BigOldA with streaming buffer
I was planning to run batch jobs everyday transferring T-1 data from Table BigOldA to Table PartByDay and deleting them periodically so that I can still maintain the streaming buffer data in Table BigOldA and start using PartByDay Table for analytics. Now I am not sure if it's achievable.
I am looking for an alternative solution or best practice on how to periodically transfer & maintain stream buffering table to partitioned table. Also, as the data is streaming from independent production sources it's not possible to point all sources streaming to PartByDay and streamingbuffer properties from tables.get is never null.
You could just delete the original table and then rename the migrated table to the original name after you've run the your history job. This assumes your streaming component to BigQuery is fault tolerant. If it's designed well, you shouldn't lose any data. Whatever is streaming to BigQuery should be able to store events until the table comes back online. It shouldn't change anything for your components that are streaming once the table is partitioned.
If anyone interested in the script, here you go.
#!/bin/sh
# This script
# 1. copies the data as the partitioned table
# 2. delete the unpartitioned table
# 3. copy the partitioned table to the same dataset table name
# TODO 4. deletes the copied table
set -e
source_project="<source-project>"
source_dataset="<source-dataset>"
source_table="<source-table-to-partition>"
destination_project="<destination-project>"
destination_dataset="<destination-dataset>"
partition_field="<timestamp-partition-field>"
destination_table="<table-copy-partition>"
source_path="$source_project.$source_dataset.$source_table"
source_l_path="$source_project:$source_dataset.$source_table"
destination_path="$destination_project:$destination_dataset.$destination_table"
echo "copying table from $source_path to $destination_path"
query=$(cat <<-END
SELECT * FROM \`$source_path\`
END
)
echo "deleting old table"
bq rm -f -t $destination_path
echo "running the query: $query"
bq query --quiet=true --use_legacy_sql=false --apilog=stderr --allow_large_results --replace=true --destination_table $destination_path --time_partitioning_field $partition_field "$query"
echo "removing the original table: $source_path"
bq rm -f -t $source_l_path
echo "table deleted"
echo "copying the partition table to the original source path"
bq cp -f -n $destination_path $source_l_path

Google BigQuery Partitione Tables - How to create tables automatically daily?

The question is how to let Google BigQuery automatically create partitioned tables on the daily base (one day -> one table, etc.)?
I've used the following command in the command line to create the table:
bq mk --time_partitioning_type=DAY testtable1
The table1 appeared in the dataset, but how to create tables for every day automatically?
From the partitioned table documentation, you need to run the command to create the table only once. After that, you specify the partition to which you want to write as the destination table of the query, such as testtable1$20170919.