How to efficiently get the partitions of a BigQuery table - google-bigquery

Is there a way of getting a list of the partitions in a BigQuery date-partitioned table? Right now the best way I have found of do this is using the _PARTITIONTIME meta-column, but this needs to scan all the rows in all the partitions. Is there an equivalent to a show partitions call or maybe something in the bq command-line tool?

To list partitions in a table, query the table's summary partition by using the partition decorator separator ($) followed by PARTITIONS_SUMMARY. For example, the following command retrieves the partition IDs for table1:
SELECT partition_id from [mydataset.table1$__PARTITIONS_SUMMARY__];

Related

Athena (Hive/Presto) query partitioned table IN statement

I have the following partitioned table in Athena (HIVE/Presto):
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable (
id STRING,
data STRING
)
PARTITIONED BY (
year string,
month string,
day string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3://mybucket';
Data is stored in s3 organized in a path structure like s3://mybucket/year=2020/month=01/day=30/.
I would like to know if the following query would leverage partitioning optimization:
SELECT
*
FROM
mydb.mytable
WHERE
(year='2020' AND month='08' AND day IN ('10', '11', '12')) OR
(year='2020' AND month='07' AND day IN ('29', '30', '31'));
I am assuming since IN operator will be transformed in a series of OR conditions, this will be still a query which will benefit by partitioning. Am I correct?
Unfortunately Athena does not expose information that would make it easier to understand how to optimise queries. Currently the only thing you can do is to run different variations of queries and look at the statistics returned in the GetQueryExecution API call.
One way to figure out if Athena will make use of partitioning in a query is to run the query with different values for the partition column and make sure that the amount of data scanned is different. If the amount of data is different Athena was able to prune partitions during query planning.
Yes, it's also mentioned int the documentation.
When Athena runs a query on a partitioned table, it checks to see if any partitioned columns are used in the WHERE clause of the query. If partitioned columns are used, Athena requests the AWS Glue Data Catalog to return the partition specification matching the specified partition columns. The partition specification includes the LOCATION property that tells Athena which Amazon S3 prefix to use when reading data. In this case, only data stored in this prefix is scanned. If you do not use partitioned columns in the WHERE clause, Athena scans all the files that belong to the table's partitions.

How to disallow loading duplicate rows to BigQuery?

I was wondering if there is a way to disallow duplicates from BigQuery?
Based on this article I can deduplicate a whole or a partition of a table.
To deduplicate a whole table:
CREATE OR REPLACE TABLE `transactions.testdata`
PARTITION BY date
AS SELECT DISTINCT * FROM `transactions.testdata`;
To deduplicate a table based on partitions defined in a WHERE clause:
MERGE `transactions.testdata` t
USING (
SELECT DISTINCT *
FROM `transactions.testdata`
WHERE date=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
If there is no way to disallow duplicates then is this a reasonable approach to deduplicate a table?
BigQuery doesn't have a mechanism like constraints that can be found in traditional DBMS. In other words, you can't set a primary key or anything like that because BigQuery is not focused on transactions but in fast analysis and scalability. You should think about it as a Data Lake and not as a database with uniqueness property.
If you have an existing table and need to de-duplicate it, the mentioned approaches will work. If you need your table to have unique rows by default and want to programmatically insert unique rows in your table without resorting to external resources, I can suggest you a workaround:
First insert your data into an temporary table
Then, run a query in your temporary table and save the results into your actual table. This step could be programmatically done in some different ways:
Using the approach you mentioned as a scheduled query
Using a bq command such as bq query --use_legacy_sql=false --destination_table=<dataset.actual_table> 'select distinct * from <dataset.temporary_table>' that will query the distinct values in your temporary table and load the results into the target table pointed in the --destination_table attribute. Its important to mention that this approach will also work for partitioned tables.
Finally, drop the temporary table. Like the previous step, this step could be done either using a scheduled query or bq command.
I hope it helps

How to know data is coming form partitioned filegroup or it is reading total records in table

I applied a partition on a DateTime column in a MSSQL table .
Created Partition function, Scheme and 4 file groups and given boundary values.
I have queried a result in this table with where condition on partitioned column.
In this how-to know, the query is reading total records or related filegroup.
How to know the query is using partition or not ?.
One way is with the actual query execution plan. The Actual Partition Count of the seek/scan operator will show the actual number of partitions touched.
Another method is to run the query with SET STATISTICS IO ON, where the scan count of the table will reflect the number of partitions used.

BigQuery - Max Partiton Date for a Custom Partitioned Table

Is there a metadata operation that can give me the max partitioned date/timestamp in use (for custom partitioned table not Ingest partitioning), such that I do not need to scan a whole table using MAX function? Or some other clever SQL way? Our source table is very large, and it gets a fresh snapshot of data most days - but then that data is generally for current_date()-1...but all in all I cant rely on much except for a query to tell me the max partition in use that doesnt cost the earth for a large table? thought?
SELECT MAX(custom_partition_field) FROM Y
#legacySQL
SELECT MAX(partition_id)
FROM [project:dataset.your_table$__PARTITIONS_SUMMARY__]
It is documented at Listing partitions in partitioned tables

Apache hive - How to limit partitions in show command

Is there any way to limit the number of Hive partitions while listing the partitions in show command?
I have a Hive table which has around 500 partitions and I wanted the latest partition alone. The show command list all the partitions. I am using this partition to find out the location details. I do not have access to metastore to query the details and the partition location is where the actual data resides.
I tried set hive.limit.query.max.table.partition=1 but this does not affect the metastore query. So, is there any other way to limit the partitions listed?
Thank you,
Revathy.
Are you running from the command line?
If so you can get your desired with something like this:
hive -e "set hive.cli.print.header=false;show partitions table_name;" | tail -1
There is a "BAD" way to obtain what you want. You can treat the partitions columns like other columns and extract them into a select with limit query:
SELECT DISTINCT partition_column
FROM partitioned_table
ORDER BY partition_column
LIMIT 1;
The only way to filter a SHOW PARTION is using PARTITION:
SHOW PARTITIONS partitioned_table PARTION ( partitioned_column = "somevalue" );