I would like to create a table that is partitioned based on the filename. For example, let's say I have a thousand sales file, one for each date such as:
Files/Sales_2014-01-01.csv, Files/Sales_2014-01-02.csv, ...
I would like to partition the table based on the filename (which is essentially the date). Is there a way to do this in BQ? For example, I want to do a load job similar to the following (in pseudocode):
bq load gs://Files/Sales*.csv PARTITION BY filename
What would be the closest thing I could do to that?
When you have a TIMESTAMP, DATE, or DATETIME column in a table, first create a partitioned table by using the Time-unit column partitioning. When you load data to the table, BigQuery automatically puts the data into the correct partitions, based on the values in the column. To create an empty partitioned table for time-unit column-partitioned using bq CLI, please refer to the below command:
bq mk -t \
--schema 'ts:DATE,qtr:STRING,sales:FLOAT' \
--time_partitioning_field ts \
--time_partitioning_type DAILY \
mydataset.mytable
Then load all your sales files into that Time-unit column partitioning table. It will automatically put the data into the correct partition. The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The schema would be auto detected. Please refer to this link for more information.
bq load \
--autodetect \
--source_format=CSV \
mydataset.mytable \
gs://mybucket/mydata*.csv
Related
We have a BigQuery dataset that has some long list of tables (with data) in it. Since I am taking over a data pipeline, which I want to familiarize myself with by doing tests, I want to duplicate those dataset/tables without copying-truncating the tables. Essentially, I want to re-create those tables in a test dataset using their schema. How can this be done in bq client?
You have a couple of options considering you don't want to copy the data but the schema:
1.- extract the schema for each table and then create new ones just empty.
$ bq show --schema --format=prettyjson [PROJECT_ID]:[DATASET].[TABLE] > [SCHEMA_FILE]
$ bq mk --table [PROJECT_ID]:[NEW_DATASET].[TABLE] [SCHEMA_FILE]
2.- run a query with LIMIT 0 and setting a destination table.
bq query "SELECT * FROM [DATASET].[TABLE] LIMIT 0" --destination_table [NEW_DATASET].[TABLE]
According to the BigQuery docs, I should be able to export a single partition of a partitioned table:
Exporting all data from a partitioned table is the same process as exporting data from a non-partitioned table. For more information, see Exporting table data. To export data from an individual partition, append the partition decorator, $date, to the table name. For example: mytable$20160201.
However running the following extract command extracts the entire table, not just one partition. It is driving me nuts! What am I doing wrong?
bq --location=europe-west2 extract \
--destination_format NEWLINE_DELIMITED_JSON \
--compression GZIP \
bq-project-name:dataset.table_name$20200405 \
"gs://bucket-name/test_ga_sessions*.json.gz"
Adding partitioning information of source table here
I have also confirmed that the partition I am attempting to extract exists
#legacySQL
SELECT
partition_id,
creation_time,
creation_timestamp,
last_modified_time,
last_modified_timestamp
FROM
[dataset.tablename$__PARTITIONS_SUMMARY__]
where partition_id = '20200405'
Because I was running the bq extract command in a bash shell, the partition decorator $20200405 was being interpreted as a variable and an empty one at that. Therefore the full partition identifier of bq-project-name:dataset.table_name$20200405 was being interpreted as bq-project-name:dataset.table_name by the time the request reached BigQuery.
In order to get this command to run correctly, all I had to do was escape the $ character of the partition decorator with a backslash as follows:
bq --location=europe-west2 extract \
--destination_format NEWLINE_DELIMITED_JSON \
--compression GZIP \
bq-project-name:dataset.table_name\$20200405 \
"gs://bucket-name/test_ga_sessions*.json.gz"
I want to upload a CSV file(partitioned by a column name) to bigquery via CLI.For example: the table should be partitioned by a column name called "Time-Key".
Here is my current code:
bq load \
--source_format=CSV \
{prjectname:Datasetname.tablename} \
{cloud storage path for csv file} \
./{SchemaNames.json}
How do I add a partitioned parameter to this? (Ex: PARTITION BY TIME_KEY)
From the docs:
--time_partitioning_field
The field used to determine how to create a time-based partition. If time-based partitioning is enabled without this value, the table is partitioned based on the load time.
https://cloud.google.com/bigquery/docs/reference/bq-cli-reference#bq_load
I was trying to create a big query external table with parquet files on gcs. It's showing a wrong format error.
But using the same files to create a native table works fine. why it must be a native table.
If use a native table, how can I import more data to this table? I don't want to delete and create the table that every time I got new data.
Any help will be appreciated.
This appears to be supported now, at least in beta. This only works in us-central1 as far as I can tell.
Simply select 'External Table' and set 'Parquet' as your file type
The current google documentation might be a bit tricky to understand. It is a two step process, first create definition file and use that as an input to create the table.
Creating the definition file, if you are dealing with unpartitioned folders
bq mkdef \
--source_format=PARQUET \
"<path/to/parquet/folder>/*.parquet" > "<definition/file/path>"
Otherwise, if you are dealing with hive partitioned table
bq mkdef \
--autodetect \
--source_format=PARQUET \
--hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix="<path/to/hive/table/folder>" \
"<path/to/hive/table/folder>/*.parquet" > "<definition/file/path>"
Note: path/to/hive/table/folder should not include the partition
folder
Eg: If your table is loaded in format
gs://project-name/tablename/year=2009/part-000.parquet
bq mkdef \
--autodetect \
--source_format=PARQUET \
--hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix="gs://project-name/tablename" \
"gs://project-name/tablename/*.parquet" > "def_file_name"
Finally, the table can be created from the definition file by
bq mk --external_table_definition="<definition/file/path>" "<project_id>:<dataset>.<table_name>"
Parquet is not currently a supported data format for federated tables. You can repeatedly load more data into the same table as long as you append (instead of overwriting) the current contents.
Note: this is nearly a duplicate of this question with the distinction that in this case, the source table is date partitioned and the destination table does not yet exist. Also, the accepted solution to that question didn't work in this case.
I'm trying to copy a single day's worth of data from one date partitioned table into a new date partitoined table that I have not yet created. My hope is that BigQuery would simply create the date-partitioned destination table for me like it usually does for the non-date-partitioned case.
Using BigQuery CLI, here's my command:
bq cp mydataset.sourcetable\$20161231 mydataset.desttable\$20161231
Here's the output of that command:
BigQuery error in cp operation: Error processing job
'myproject:bqjob_bqjobid': Partitioning specification must be provided
in order to create partitioned table
I've tried doing something similar using the python SDK: running a select command on a date partitioned table (which selects data from only one date partition) and saving the results into a new destination table (which I hope would also be date partitioned). The job fails with the same error:
{u'message': u'Partitioning specification must be provided in order to
create partitioned table', u'reason': u'invalid'}
Clearly I need to add a partitioning specification, but I couldn't find any documentation on how to do so.
You need to create the partitioned destination table first (as per the docs):
If you want to copy a partitioned table into another partitioned
table, the partition specifications for the source and destination
tables must match.
So, just create the destination partitioned table before you start copying. If you can't be bothered specifying the schema, you can create the destination partitioned table like so:
bq mk --time_partitioning_type=DAY mydataset.temps
Then, use a query instead of a copy to write to the destination table. The schema will be copied with it:
bq query --allow_large_results --replace --destination_table 'mydataset.temps$20160101''SELECT * from `source`'