I have a dataset with around 200k tables that I'm trying to delete. I've been using the commandline tool to run bq rm -r -f datasetID, but it has only deleted about 4% in 24 hours. (I can only guess at the amount by logging into the web UI and seeing what tables are left). Is there a faster way to get that done?
Quite late, but here is how I did it:
Install jq and gnu parallel first. Substitute PROJECT_ID with your project's ID.
bq ls --project_id PROJECT_ID --max_results=100000 --format=prettyjson | jq '.[] | .id' | parallel --bar -P 10 bq --project_id PROJECT_ID rm -r -f -d
You might need to tune -P parameter's value for better deletion rate.
Warning: It will end up deleting all the tables and datasets in your project. You can perform a dry run with echo, analyze the output and then finally run the above command:
bq ls --project_id PROJECT_ID --max_results=100000 --format=prettyjson | jq '.[] | .id' | parallel --bar -P 10 echo bq --project_id PROJECT_ID rm -r -f -d
Deleted 100K tables across 9K datasets in 15 minutes.
One way to do so would be to iterate through the tables and delete them individually (possibly in parallel). Or an even faster way could be to set an expiration time on the tables that is only a very short time in the future.
This is not a highly optimized path since we don't often get users who want to delete that many tables at once.
Related
I have a partitioned table in bigquery which needs to be unpartitioned. It is empty at the moment so I do not need to worry about losing information.
I deleted the table by clicking on it > choosing delete > typing delete
but the table is still there even I refreshed the page or wait for 30 mins.
Then I tried the cloud shell terminal:
bq rm --table project:dataset.table_name
then it asked me to confirm and deleted the table but the table is still there!!!
Once I want to create an unpartitioned table with the same name, it gives an error that a table with this name already exists!
I have done this many times before, not sure why the table does not get removed?
Deleting partitions from Google Cloud Console is not supported. You need to execute the command bq rm with the -t shortcut in the Cloud Shell.
Before deleting the partitioned table, I suggest verifying you are going to delete the correct tables (partitions).
You can execute these commands:
for table in `bq ls --max_results=10000000 $MY_DATASET_ID | grep TABLE | grep $MY_PART_TABLE_NAME | awk '{print $1}'`; do echo $MY_DATASET_ID.$table; done
For the variable $MY_PART_TABLE_NAME, you need to write the table name and delete the partition date/value, like in this example "table_name_20200818".
After verifying that these are the correct partitions, you need to execute this command:
for table in `bq ls --max_results=10000000 $MY_DATASET_ID | grep TABLE | grep $MY_PART_TABLE_NAME | awk '{print $1}'`; do echo $MY_DATASET_ID.$table; bq rm -f -t $MY_DATASET_ID.$table; done
Accidentally my python script has made a table with name as "ext_data_content_modec --replace" which we want to delete.
However BQ doesn't seem to recognize the table with spaces and keywords(--replace).
We have tried many variants of bq rm , as well as tried deleting the from BQ console but it doesn't work
For example, see below (etlt_dsc is dataset name).
$ bq rm 'etlt_dsc.ext_data_content_modec --replace'
BigQuery error in rm operation: Not found: Table boeing-prod-atm-next-dsc:etlt_dsc.ext_data_content_modec --replace
Besides above we tried below commands but nothing worked
bq rm "etlt_dsc.ext_data_content_modec --replace"
bq rm [etlt_dsc.ext_data_content_modec --replace]
bq rm [etlt_dsc.ext_data_content_modec --replace']
bq rm etlt_dsc.ext_data_content_modec \--replace
Would anyone has input for us please ?
You can try this:
$ bq ls mydataset
tableId Type Labels Time Partitioning Clustered Fields
---------------------------------- ------- -------- ------------------- ------------------
ext_data_content_modec --replace TABLE
$
$ bq rm "mydataset.ext_data_content_modec --replace"
rm: remove table 'data-lab:mydataset.ext_data_content_modec --replace'? (y/N) y
$
$ bq ls mydataset
$
I was able to figure out the solution.
'removal' didn't work but we were able to 'drop' the table from BQ console.
I don't want to delete tables one by one.
What is the fastest way to do it?
Basically you need to remove all partitions for the partitioned BQ table to be dropped.
Assuming you have gcloud already installed... do next:
Using terminal(check/set) GCP project you are logged in:
$> gcloud config list - to check if you are using proper GCP project.
$> gcloud config set project <your_project_id> - to set required project
Export variables:
$> export MY_DATASET_ID=dataset_name;
$> export MY_PART_TABLE_NAME=table_name_; - specify table name without partition date/value, so the real partition table name for this example looks like -> "table_name_20200818"
Double-check if you are going to delete correct table/partitions by running this(it will just list all partitions for your table):
for table in `bq ls --max_results=10000000 $MY_DATASET_ID | grep TABLE | grep $MY_PART_TABLE_NAME | awk '{print $1}'`; do echo $MY_DATASET_ID.$table; done
After checking run almost the same command, plus bq remove command parametrized by that iteration to actually DELETE all partitions, eventually the table itself:
for table in `bq ls --max_results=10000000 $MY_DATASET_ID | grep TABLE | grep $MY_PART_TABLE_NAME | awk '{print $1}'`; do echo $MY_DATASET_ID.$table; bq rm -f -t $MY_DATASET_ID.$table; done
The process for deleting a time-partitioned table and all the
partitions in it are the same as the process for deleting a standard
table.
So if you delete a partition table without specifying the partition it will delete all tables. You don't have to delete one by one.
DROP TABLE <tablename>
You can also delete programmatically (i.e. java).
Use the sample code of DeleteTable.java and change the flow to have a list of all your tables and partitions to be deleted.
In case needed for deletion of specific partitions only, you can refer to a table partition (i.e. partition by day) in the following way:
String mainTable = "TABLE_NAME"; // i.e. my_table
String partitionId = "YYYYMMDD"; // i.e. 20200430
String decorator = "$";
String tableName = mainTable+decorator+partitionId;
Here is the guide to run java BigQuery samples, and ensure to set your project in the cloud shell:
gcloud config set project <project-id>
I have tried method in this question, but it does not work since I'm working in cluster mode, and redis told me:
(error) CROSSSLOT Keys in request don't hash to the same slot
Answers for that question try to remove multiple keys in a single DEL. However, keys matching the given pattern might NOT locate in the same slot, and Redis Cluster DOES NOT support multiple-key command if these keys don't belong to the same slot. That's why you get the error message.
In order to fix this problem, you need to DEL these keys one-by-one:
redis-cli --scan --pattern "foo*" |xargs -L 1 redis-cli del
The -L option for xargs command specifies the number of keys to delete. You need to specify this option as 1.
In order to remove all keys matching the pattern, you also need to run the above command for every master nodes in your cluster.
NOTE
With this command, you have to delete these keys one-by-one, and that might be very slow. You need to consider re-designing your database, and use hash-tags to make keys matching the pattern belong to the same slot. So that you can remove these keys in a single DEL.
Either SCAN or KEYS command are inefficient, especially, KEYS should not be used in production. You need to consider building an index for these keys.
Building on for_stack's answer, you can speed up mass deletion quite a bit using redis-cli --pipe, and reduce the performance impact with UNLINK instead of DEL if you're using redis 4 or higher.
redis-cli --scan --pattern "foo*" | xargs -L 1 echo UNLINK | redis-cli --pipe
Output will look something like this:
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 107003
You do still need to run this against every master node in your cluster. If you have a large number of nodes, it's probably possible to automate the process further by parsing the output of CLUSTER NODES.
redis-cli provides a -c option to follow MOVED redirects. However, it should be deleted one at a time because you cannot guarantee two keys will be in the same node.
redis-cli -h myredis.internal --scan --pattern 'mycachekey::*' | \
xargs -L 1 -d'\n' redis-cli -h myredis.internal -c del
The first part provides a list of keys --scan prevents Redis from locking. xargs -L 1 runs the command for one entry at a time. -d'\n' disables the processing of quotes so you can have quoted strings like "SimpleKey[hello world]" be passed to the command otherwise the spaces will make it have two keys.
Do the REINDEX statements below help with the restore operation or file size of the dump?
Could not find a question about this anywhere on SO or web. I'm using Postgres 9.4 and cleaning out a very large database with both truncate and delete statements on various tables.
The table data varies in type and size.
After this clean up operation, I immediately execute a pg_dump, tar and upload, then pg_restore. It's using the directory format with 12 jobs in parallel for dump, 8 for restore.
For example, these queries first:
TRUNCATE users;
DELETE FROM users_email WHERE active = 1;
REINDEX TABLE users;
REINDEX TABLE users_email;
Then:
$ pg_dump_9.4 --compress=0 -F directory -j 12 $DB_EXPORT_NAME -f $DB_DUMP_FOLDER 2>> operations.log
$ # do tar and upload with dump then:
$ pg_restore_9.4 -d $DB_IMPORT_NAME -j 8 $DB_DUMP_FOLDER 2>> operations.log
This will make no difference at all for pg_dump or pg_restore.
pg_dump doesn't use the index at all, it just writes its definition as a CREATE INDEX statement into the dump. The table itself is scanned sequentially.
pg_restore creates the index using the CREATE INDEX from the dump.