I have a legacy unpartitioned big query table that streams logs from various sources (Let's say Table BigOldA). The aim is to transfer it to a new day partition table (Let's say PartByDay) which is done with the help of the following link:
https://cloud.google.com/bigquery/docs/creating-column-partitions#creating_a_partitioned_table_from_a_query_result
bq query
--allow_large_results
--replace=true
--destination_table <project>:<data-set>.<PartByDay>
--time_partitioning_field REQUEST_DATETIME
--use_legacy_sql=false 'SELECT * FROM `<project>.<data-set>.<BigOldA>`'
I have migrated the historical data to the new table but I cannot delete them in Table BigOldA as I am running into the same problem with running DMLs on streaming buffer tables are not supported yet.
Error: UPDATE or DELETE DML statements are not supported over
table <project>:<data-set>.BigOldA with streaming buffer
I was planning to run batch jobs everyday transferring T-1 data from Table BigOldA to Table PartByDay and deleting them periodically so that I can still maintain the streaming buffer data in Table BigOldA and start using PartByDay Table for analytics. Now I am not sure if it's achievable.
I am looking for an alternative solution or best practice on how to periodically transfer & maintain stream buffering table to partitioned table. Also, as the data is streaming from independent production sources it's not possible to point all sources streaming to PartByDay and streamingbuffer properties from tables.get is never null.
You could just delete the original table and then rename the migrated table to the original name after you've run the your history job. This assumes your streaming component to BigQuery is fault tolerant. If it's designed well, you shouldn't lose any data. Whatever is streaming to BigQuery should be able to store events until the table comes back online. It shouldn't change anything for your components that are streaming once the table is partitioned.
If anyone interested in the script, here you go.
#!/bin/sh
# This script
# 1. copies the data as the partitioned table
# 2. delete the unpartitioned table
# 3. copy the partitioned table to the same dataset table name
# TODO 4. deletes the copied table
set -e
source_project="<source-project>"
source_dataset="<source-dataset>"
source_table="<source-table-to-partition>"
destination_project="<destination-project>"
destination_dataset="<destination-dataset>"
partition_field="<timestamp-partition-field>"
destination_table="<table-copy-partition>"
source_path="$source_project.$source_dataset.$source_table"
source_l_path="$source_project:$source_dataset.$source_table"
destination_path="$destination_project:$destination_dataset.$destination_table"
echo "copying table from $source_path to $destination_path"
query=$(cat <<-END
SELECT * FROM \`$source_path\`
END
)
echo "deleting old table"
bq rm -f -t $destination_path
echo "running the query: $query"
bq query --quiet=true --use_legacy_sql=false --apilog=stderr --allow_large_results --replace=true --destination_table $destination_path --time_partitioning_field $partition_field "$query"
echo "removing the original table: $source_path"
bq rm -f -t $source_l_path
echo "table deleted"
echo "copying the partition table to the original source path"
bq cp -f -n $destination_path $source_l_path
Related
I would like to create a table that is partitioned based on the filename. For example, let's say I have a thousand sales file, one for each date such as:
Files/Sales_2014-01-01.csv, Files/Sales_2014-01-02.csv, ...
I would like to partition the table based on the filename (which is essentially the date). Is there a way to do this in BQ? For example, I want to do a load job similar to the following (in pseudocode):
bq load gs://Files/Sales*.csv PARTITION BY filename
What would be the closest thing I could do to that?
When you have a TIMESTAMP, DATE, or DATETIME column in a table, first create a partitioned table by using the Time-unit column partitioning. When you load data to the table, BigQuery automatically puts the data into the correct partitions, based on the values in the column. To create an empty partitioned table for time-unit column-partitioned using bq CLI, please refer to the below command:
bq mk -t \
--schema 'ts:DATE,qtr:STRING,sales:FLOAT' \
--time_partitioning_field ts \
--time_partitioning_type DAILY \
mydataset.mytable
Then load all your sales files into that Time-unit column partitioning table. It will automatically put the data into the correct partition. The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The schema would be auto detected. Please refer to this link for more information.
bq load \
--autodetect \
--source_format=CSV \
mydataset.mytable \
gs://mybucket/mydata*.csv
We have a BigQuery dataset that has some long list of tables (with data) in it. Since I am taking over a data pipeline, which I want to familiarize myself with by doing tests, I want to duplicate those dataset/tables without copying-truncating the tables. Essentially, I want to re-create those tables in a test dataset using their schema. How can this be done in bq client?
You have a couple of options considering you don't want to copy the data but the schema:
1.- extract the schema for each table and then create new ones just empty.
$ bq show --schema --format=prettyjson [PROJECT_ID]:[DATASET].[TABLE] > [SCHEMA_FILE]
$ bq mk --table [PROJECT_ID]:[NEW_DATASET].[TABLE] [SCHEMA_FILE]
2.- run a query with LIMIT 0 and setting a destination table.
bq query "SELECT * FROM [DATASET].[TABLE] LIMIT 0" --destination_table [NEW_DATASET].[TABLE]
I don't want to delete tables one by one.
What is the fastest way to do it?
Basically you need to remove all partitions for the partitioned BQ table to be dropped.
Assuming you have gcloud already installed... do next:
Using terminal(check/set) GCP project you are logged in:
$> gcloud config list - to check if you are using proper GCP project.
$> gcloud config set project <your_project_id> - to set required project
Export variables:
$> export MY_DATASET_ID=dataset_name;
$> export MY_PART_TABLE_NAME=table_name_; - specify table name without partition date/value, so the real partition table name for this example looks like -> "table_name_20200818"
Double-check if you are going to delete correct table/partitions by running this(it will just list all partitions for your table):
for table in `bq ls --max_results=10000000 $MY_DATASET_ID | grep TABLE | grep $MY_PART_TABLE_NAME | awk '{print $1}'`; do echo $MY_DATASET_ID.$table; done
After checking run almost the same command, plus bq remove command parametrized by that iteration to actually DELETE all partitions, eventually the table itself:
for table in `bq ls --max_results=10000000 $MY_DATASET_ID | grep TABLE | grep $MY_PART_TABLE_NAME | awk '{print $1}'`; do echo $MY_DATASET_ID.$table; bq rm -f -t $MY_DATASET_ID.$table; done
The process for deleting a time-partitioned table and all the
partitions in it are the same as the process for deleting a standard
table.
So if you delete a partition table without specifying the partition it will delete all tables. You don't have to delete one by one.
DROP TABLE <tablename>
You can also delete programmatically (i.e. java).
Use the sample code of DeleteTable.java and change the flow to have a list of all your tables and partitions to be deleted.
In case needed for deletion of specific partitions only, you can refer to a table partition (i.e. partition by day) in the following way:
String mainTable = "TABLE_NAME"; // i.e. my_table
String partitionId = "YYYYMMDD"; // i.e. 20200430
String decorator = "$";
String tableName = mainTable+decorator+partitionId;
Here is the guide to run java BigQuery samples, and ensure to set your project in the cloud shell:
gcloud config set project <project-id>
Is there a way to copy a date-sharded table to another dataset via the bq utility?
My current solution is generating a bash script to copy each day one-by-one and splitting the work, but more efficient would be to do everything in parallel:
#!/bin/sh
bq cp old_dataset.table_20140101 new_dataset_20140101
..
bq cp old_dataset.table_20171001 new_dataset_20171001
You can specify multiple source tables but only a single destination table (refer to this question), so this may not work for you. However, if your data is date-partitioned (instead of sharded) then you can copy the table in one command.
I recommend you convert the sharded table into a date-partitioned table which will be effectively copying all the sharded tables to a new table. You can do this with the following command:
bq partition old_dataset.table_ new_dataset.partitioned
I'm using the Google SDK bq command. How do I change the name of a table? I'm not seeing this at https://cloud.google.com/bigquery/bq-command-line-tool
You have to copy to a new table and delete the original.
$ bq cp dataset.old_table dataset.new_table
$ bq rm -f -t dataset.old_table
I don't think there is a way just rename table
What you can do is COPY table to a new table with desired name (copy is free of charge) and then delete original table
The only drawback I see with this is that if you have long term stored data - I think you will lose storage discount (50%) for that data