How can I extract a single partition from a partitioned BigQuery table? - google-bigquery

According to the BigQuery docs, I should be able to export a single partition of a partitioned table:
Exporting all data from a partitioned table is the same process as exporting data from a non-partitioned table. For more information, see Exporting table data. To export data from an individual partition, append the partition decorator, $date, to the table name. For example: mytable$20160201.
However running the following extract command extracts the entire table, not just one partition. It is driving me nuts! What am I doing wrong?
bq --location=europe-west2 extract \
--destination_format NEWLINE_DELIMITED_JSON \
--compression GZIP \
bq-project-name:dataset.table_name$20200405 \
"gs://bucket-name/test_ga_sessions*.json.gz"
Adding partitioning information of source table here
I have also confirmed that the partition I am attempting to extract exists
#legacySQL
SELECT
partition_id,
creation_time,
creation_timestamp,
last_modified_time,
last_modified_timestamp
FROM
[dataset.tablename$__PARTITIONS_SUMMARY__]
where partition_id = '20200405'

Because I was running the bq extract command in a bash shell, the partition decorator $20200405 was being interpreted as a variable and an empty one at that. Therefore the full partition identifier of bq-project-name:dataset.table_name$20200405 was being interpreted as bq-project-name:dataset.table_name by the time the request reached BigQuery.
In order to get this command to run correctly, all I had to do was escape the $ character of the partition decorator with a backslash as follows:
bq --location=europe-west2 extract \
--destination_format NEWLINE_DELIMITED_JSON \
--compression GZIP \
bq-project-name:dataset.table_name\$20200405 \
"gs://bucket-name/test_ga_sessions*.json.gz"

Related

Partitioning a table in BigQuery by file

I would like to create a table that is partitioned based on the filename. For example, let's say I have a thousand sales file, one for each date such as:
Files/Sales_2014-01-01.csv, Files/Sales_2014-01-02.csv, ...
I would like to partition the table based on the filename (which is essentially the date). Is there a way to do this in BQ? For example, I want to do a load job similar to the following (in pseudocode):
bq load gs://Files/Sales*.csv PARTITION BY filename
What would be the closest thing I could do to that?
When you have a TIMESTAMP, DATE, or DATETIME column in a table, first create a partitioned table by using the Time-unit column partitioning. When you load data to the table, BigQuery automatically puts the data into the correct partitions, based on the values in the column. To create an empty partitioned table for time-unit column-partitioned using bq CLI, please refer to the below command:
bq mk -t \
--schema 'ts:DATE,qtr:STRING,sales:FLOAT' \
--time_partitioning_field ts \
--time_partitioning_type DAILY \
mydataset.mytable
Then load all your sales files into that Time-unit column partitioning table. It will automatically put the data into the correct partition. The following command loads data from multiple files in gs://mybucket/ into a table named mytable in mydataset. The schema would be auto detected. Please refer to this link for more information.
bq load \
--autodetect \
--source_format=CSV \
mydataset.mytable \
gs://mybucket/mydata*.csv

How to replace a partition in BQ using queries?

I want to replace the latest partition on my BQ table with the already available data from an adhoc table.
Could anyone please help me in doing this?
The below command is not helping:
bq query \
--use_legacy_sql=false \
--replace \
--destination_table 'mydataset.table1$20160301' \
'SELECT
column1,
column2
FROM
mydataset.mytable'
I guess you need to use bq cp instead of bq query
From here:
To copy a partition, use the bq command-line tool's bq cp (copy)
command with a partition decorator ($date) such as $20160201.
Optional flags can be used to control the write disposition of the
destination partition:
-a or --append_table appends the data from the source partition to an existing table or partition in the destination dataset.
-f or --force overwrites an existing table or partition in the destination dataset and doesn't prompt you for confirmation.
-n or --no_clobber returns the following error message if the table or partition exists in the destination dataset: Table
'project_id:dataset.table or table$date' already
exists, skipping. If -n is not specified, the default behavior is to
prompt you to choose whether to replace the destination table or
partition.
bq --location=location cp \
-f \
project_id:dataset.source_table$source_partition \
project_id:dataset.destination_table$destination_partition
So don't forget to use -f param to destroy old partition

how to upload a CSV(partitioned by a column name) from cloud storage to BigQuery via CLI?

I want to upload a CSV file(partitioned by a column name) to bigquery via CLI.For example: the table should be partitioned by a column name called "Time-Key".
Here is my current code:
bq load \
--source_format=CSV \
{prjectname:Datasetname.tablename} \
{cloud storage path for csv file} \
./{SchemaNames.json}
How do I add a partitioned parameter to this? (Ex: PARTITION BY TIME_KEY)
From the docs:
--time_partitioning_field
The field used to determine how to create a time-based partition. If time-based partitioning is enabled without this value, the table is partitioned based on the load time.
https://cloud.google.com/bigquery/docs/reference/bq-cli-reference#bq_load

When creating a new big query external table with parquet files on gcs. Showing error

I was trying to create a big query external table with parquet files on gcs. It's showing a wrong format error.
But using the same files to create a native table works fine. why it must be a native table.
If use a native table, how can I import more data to this table? I don't want to delete and create the table that every time I got new data.
Any help will be appreciated.
This appears to be supported now, at least in beta. This only works in us-central1 as far as I can tell.
Simply select 'External Table' and set 'Parquet' as your file type
The current google documentation might be a bit tricky to understand. It is a two step process, first create definition file and use that as an input to create the table.
Creating the definition file, if you are dealing with unpartitioned folders
bq mkdef \
--source_format=PARQUET \
"<path/to/parquet/folder>/*.parquet" > "<definition/file/path>"
Otherwise, if you are dealing with hive partitioned table
bq mkdef \
--autodetect \
--source_format=PARQUET \
--hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix="<path/to/hive/table/folder>" \
"<path/to/hive/table/folder>/*.parquet" > "<definition/file/path>"
Note: path/to/hive/table/folder should not include the partition
folder
Eg: If your table is loaded in format
gs://project-name/tablename/year=2009/part-000.parquet
bq mkdef \
--autodetect \
--source_format=PARQUET \
--hive_partitioning_mode=AUTO \
--hive_partitioning_source_uri_prefix="gs://project-name/tablename" \
"gs://project-name/tablename/*.parquet" > "def_file_name"
Finally, the table can be created from the definition file by
bq mk --external_table_definition="<definition/file/path>" "<project_id>:<dataset>.<table_name>"
Parquet is not currently a supported data format for federated tables. You can repeatedly load more data into the same table as long as you append (instead of overwriting) the current contents.

Incremental updates in HIVE using SQOOP appends data into middle of the table

I am trying to append the new data from SQLServer to Hive using the following command
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password passwd --table testable --where "ID > 11854" --hive-import -hive-table hivedb.hivetesttable --fields-terminated-by ',' -m 1
This command appends the data.
But when I run
select * from hivetesttable;
it doesnot show the new data at the end.
This is because the sqoop import statement for appending the new data result the mapper output as part-m-00000-copy
So my data in the hive table directory looks like
part-m-00000
part-m-00000-copy
part-m-00001
part-m-00002
Is there any way to append the data at end by changing the name of mapper?
Hive, similarly to any other relational database, doesn't guarantee any order unless you explicitly use ORDER BY clause.
You're correct in your analysis - the reason why the data appears in the "middle" is that Hive will read one file after another based on lexicographical sort and Sqoop simply names the files that they will get appended somewhere in the middle of that list.
However this operation is fully valid - Sqoop appended data to Hive table and because your query doesn't have any explicit ORDER BY statement the result have no guarantees with regards to order. In fact Hive itself can change this behavior and read files based on time of creation without breaking any compatibility.
I'm also interested to see how this is affecting your use case? I'm assuming that the query to list all rows is just a test one. Do you have any issues with actual production queries?