I want to replace the latest partition on my BQ table with the already available data from an adhoc table.
Could anyone please help me in doing this?
The below command is not helping:
bq query \
--use_legacy_sql=false \
--replace \
--destination_table 'mydataset.table1$20160301' \
'SELECT
column1,
column2
FROM
mydataset.mytable'
I guess you need to use bq cp instead of bq query
From here:
To copy a partition, use the bq command-line tool's bq cp (copy)
command with a partition decorator ($date) such as $20160201.
Optional flags can be used to control the write disposition of the
destination partition:
-a or --append_table appends the data from the source partition to an existing table or partition in the destination dataset.
-f or --force overwrites an existing table or partition in the destination dataset and doesn't prompt you for confirmation.
-n or --no_clobber returns the following error message if the table or partition exists in the destination dataset: Table
'project_id:dataset.table or table$date' already
exists, skipping. If -n is not specified, the default behavior is to
prompt you to choose whether to replace the destination table or
partition.
bq --location=location cp \
-f \
project_id:dataset.source_table$source_partition \
project_id:dataset.destination_table$destination_partition
So don't forget to use -f param to destroy old partition
Related
I would like to create a table that is partitioned based on the filename. For example, let's say I have a thousand sales file, one for each date such as:
Files/Sales_2014-01-01.csv, Files/Sales_2014-01-02.csv, ...
I would like to partition the table based on the filename (which is essentially the date). Is there a way to do this in BQ? For example, I want to do a load job similar to the following (in pseudocode):
bq load gs://Files/Sales*.csv PARTITION BY filename
What would be the closest thing I could do to that?
When you have a TIMESTAMP, DATE, or DATETIME column in a table, first create a partitioned table by using the Time-unit column partitioning. When you load data to the table, BigQuery automatically puts the data into the correct partitions, based on the values in the column. To create an empty partitioned table for time-unit column-partitioned using bq CLI, please refer to the below command:
bq mk -t \
--schema 'ts:DATE,qtr:STRING,sales:FLOAT' \
--time_partitioning_field ts \
--time_partitioning_type DAILY \
mydataset.mytable
Then load all your sales files into that Time-unit column partitioning table. It will automatically put the data into the correct partition. The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The schema would be auto detected. Please refer to this link for more information.
bq load \
--autodetect \
--source_format=CSV \
mydataset.mytable \
gs://mybucket/mydata*.csv
According to the BigQuery docs, I should be able to export a single partition of a partitioned table:
Exporting all data from a partitioned table is the same process as exporting data from a non-partitioned table. For more information, see Exporting table data. To export data from an individual partition, append the partition decorator, $date, to the table name. For example: mytable$20160201.
However running the following extract command extracts the entire table, not just one partition. It is driving me nuts! What am I doing wrong?
bq --location=europe-west2 extract \
--destination_format NEWLINE_DELIMITED_JSON \
--compression GZIP \
bq-project-name:dataset.table_name$20200405 \
"gs://bucket-name/test_ga_sessions*.json.gz"
Adding partitioning information of source table here
I have also confirmed that the partition I am attempting to extract exists
#legacySQL
SELECT
partition_id,
creation_time,
creation_timestamp,
last_modified_time,
last_modified_timestamp
FROM
[dataset.tablename$__PARTITIONS_SUMMARY__]
where partition_id = '20200405'
Because I was running the bq extract command in a bash shell, the partition decorator $20200405 was being interpreted as a variable and an empty one at that. Therefore the full partition identifier of bq-project-name:dataset.table_name$20200405 was being interpreted as bq-project-name:dataset.table_name by the time the request reached BigQuery.
In order to get this command to run correctly, all I had to do was escape the $ character of the partition decorator with a backslash as follows:
bq --location=europe-west2 extract \
--destination_format NEWLINE_DELIMITED_JSON \
--compression GZIP \
bq-project-name:dataset.table_name\$20200405 \
"gs://bucket-name/test_ga_sessions*.json.gz"
I don't want to delete tables one by one.
What is the fastest way to do it?
Basically you need to remove all partitions for the partitioned BQ table to be dropped.
Assuming you have gcloud already installed... do next:
Using terminal(check/set) GCP project you are logged in:
$> gcloud config list - to check if you are using proper GCP project.
$> gcloud config set project <your_project_id> - to set required project
Export variables:
$> export MY_DATASET_ID=dataset_name;
$> export MY_PART_TABLE_NAME=table_name_; - specify table name without partition date/value, so the real partition table name for this example looks like -> "table_name_20200818"
Double-check if you are going to delete correct table/partitions by running this(it will just list all partitions for your table):
for table in `bq ls --max_results=10000000 $MY_DATASET_ID | grep TABLE | grep $MY_PART_TABLE_NAME | awk '{print $1}'`; do echo $MY_DATASET_ID.$table; done
After checking run almost the same command, plus bq remove command parametrized by that iteration to actually DELETE all partitions, eventually the table itself:
for table in `bq ls --max_results=10000000 $MY_DATASET_ID | grep TABLE | grep $MY_PART_TABLE_NAME | awk '{print $1}'`; do echo $MY_DATASET_ID.$table; bq rm -f -t $MY_DATASET_ID.$table; done
The process for deleting a time-partitioned table and all the
partitions in it are the same as the process for deleting a standard
table.
So if you delete a partition table without specifying the partition it will delete all tables. You don't have to delete one by one.
DROP TABLE <tablename>
You can also delete programmatically (i.e. java).
Use the sample code of DeleteTable.java and change the flow to have a list of all your tables and partitions to be deleted.
In case needed for deletion of specific partitions only, you can refer to a table partition (i.e. partition by day) in the following way:
String mainTable = "TABLE_NAME"; // i.e. my_table
String partitionId = "YYYYMMDD"; // i.e. 20200430
String decorator = "$";
String tableName = mainTable+decorator+partitionId;
Here is the guide to run java BigQuery samples, and ensure to set your project in the cloud shell:
gcloud config set project <project-id>
Seems the -f or --force=true flag does not work for views.
As it still output the following error.
could not be created; a table with this name already exists.
Below is part of the command I use
bq mk --use_legacy_sql=false -f --description "View on reporting table ..." --view
You can use a CREATE OR REPLACE VIEW statement, e.g.
bq query --use_legacy_sql=false "
CREATE OR REPLACE VIEW dataset.view
OPTIONS (description='View on reporting table ...') AS
SELECT ...
"
See the DDL documentation for more reading.
Actually, as per some tests I have been running, this option does not do what the documentation suggests ([...] and overwrite the table without prompting) even for tables:
$ bq mk test_dataset.test
Table 'PROJECT:test_dataset.test' successfully created.
$ bq mk test_dataset.test
BigQuery error in mk operation: Table 'PROJECT:test_dataset.test' could not be created; a table with this name already exists.
$ bq mk -f test_dataset.test
Table 'PROJECT:test_dataset.test' could not be created; a table with this name already exists.
Also, when looking at the description of the CLI tool, the explanation is not the same as in the documentation:
$ bq mk --help
[...]
-f,--[no]force: Ignore errors reporting that the object already exists.
(default: 'false')
And in fact, if we look at the exit status of the command when adding or not the -f flag, we see a significant difference:
$ bq mk test_dataset.test
BigQuery error in mk operation: Table 'PROJECT:test_dataset.test' could not be created; a table with this name already exists.
$ echo $?
1
$ bq mk -f test_dataset.test
Table 'PROJECT:test_dataset.test' could not be created; a table with this name already exists.
$ echo $?
0
So I believe that in this case the functionality is correct (also, as you can see, when not adding the flag, the output includes an additional message BigQuery error in mk operation that is not present with the flag), and the documentation does not reflect the real behavior of the flag.
Therefore, I have already reported this internally in order to make the necessary change to the documentation.
Regarding the way to achieve the objective you attempted to have with this flag, you can use either of the workarounds that have been proposed in other answers and comments, which all seem good options.
Just to provide some final context to this post, the documentation has already been changed in order to reflect the real functionality of the -f flag:
--force or -f
When specified, if a resource already exists, the exit code is 0. The
default value is false.
Based on the documentation it only forces table creations if already exist, do not talk about views
--force or -f
When specified, ignore already exists errors and overwrite the table without prompting. The default value is false.
I was trying to create a big query external table with parquet files on gcs. It's showing a wrong format error.
But using the same files to create a native table works fine. why it must be a native table.
If use a native table, how can I import more data to this table? I don't want to delete and create the table that every time I got new data.
Any help will be appreciated.
This appears to be supported now, at least in beta. This only works in us-central1 as far as I can tell.
Simply select 'External Table' and set 'Parquet' as your file type
The current google documentation might be a bit tricky to understand. It is a two step process, first create definition file and use that as an input to create the table.
Creating the definition file, if you are dealing with unpartitioned folders
bq mkdef \
--source_format=PARQUET \
"<path/to/parquet/folder>/*.parquet" > "<definition/file/path>"
Otherwise, if you are dealing with hive partitioned table
bq mkdef \
--autodetect \
--source_format=PARQUET \
--hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix="<path/to/hive/table/folder>" \
"<path/to/hive/table/folder>/*.parquet" > "<definition/file/path>"
Note: path/to/hive/table/folder should not include the partition
folder
Eg: If your table is loaded in format
gs://project-name/tablename/year=2009/part-000.parquet
bq mkdef \
--autodetect \
--source_format=PARQUET \
--hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix="gs://project-name/tablename" \
"gs://project-name/tablename/*.parquet" > "def_file_name"
Finally, the table can be created from the definition file by
bq mk --external_table_definition="<definition/file/path>" "<project_id>:<dataset>.<table_name>"
Parquet is not currently a supported data format for federated tables. You can repeatedly load more data into the same table as long as you append (instead of overwriting) the current contents.