Listing all the partitions from BigQuery partitioned table with require_partition_filter - google-bigquery

I am trying to find a way to list the partitions of a table created with require_partition_filter = true however I am not able to find the way yet.
This is table creation script
CREATE TABLE mydataset.partitionedtable_partitiontime
(
x INT64 \
)
PARTITION BY DATE(_PARTITIONTIME)
OPTIONS(
require_partition_filter = true
);
Some test rows
INSERT INTO mydataset.partitionedtable_partitiontime (_PARTITIONTIME, x) SELECT TIMESTAMP("2017-05-01"), 10;
INSERT INTO mydataset.partitionedtable_partitiontime (_PARTITIONTIME, x) SELECT TIMESTAMP("2017-04-01"), 20;
INSERT INTO mydataset.partitionedtable_partitiontime (_PARTITIONTIME, x) SELECT TIMESTAMP("2017-03-01"), 30;
As expected, If a try the following query to get the partitions, I am getting an error because I need to user a filter on top of the partitioning column
SELECT _PARTITIONTIME as pt, FORMAT_TIMESTAMP("%Y%m%d", _PARTITIONTIME) as partition_id
FROM `mydataset.partitionedtable_partitiontime`
GROUP BY _PARTITIONTIME
ORDER BY _PARTITIONTIME
Error
Cannot query over table 'mydataset.partitionedtable_partitiontime' without a filter over column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME' that can be used for partition elimination
any ideas how to list the partitions?
EDIT: I know that it is possible to add the filter, but I am looking for a solution like "SHOW PARTITIONS TABLENAME" of Hive to list all the partitions (which are essentially metadata)
Thanks!

Here is the way to do it:
SELECT * FROM `mydataset.partitionedtable_partitiontime$__PARTITIONS_SUMMARY__`
The bigquery.jobs.create permission is required.
EDIT: Now is possible to get this information using Standard SQL:
SELECT * FROM `myproject.mydataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'partitionedtable'

As mentioned by hlagos, you can get this data by querying the _PARTITIONTIME pseudo column, in case you are using Standard SQL, or the __PARTITIONS_SUMMARY__ meta table for Legacy SQL.
You can take a look on this GCP documentation that contains detailed information about the usage of this partitioned tables metadata.

Related

I need to make backup from partitioned table (Hive)

I need to make backup data from partitioned table which has over 500 partitions.
My table has partitioning by date_part like "date_part = 20221101" or "date_part = 20221102" etc.
I need to take 30 partitions from 20221101 to 20221130 and make copy to another new backup-table.
If I do something like this:
create table <backup_table> as
select * from <data_table> where date_part between 20221101 and 20221130
at the output I get non-partitioned <backup_table> and idk is it good way or not but i guess partitioned <backup_table> will be more better.
If I try to do:
create table <bacup_table> like <data_table>;
insert overwrite table <backup_table> partition (`date_part`)
select * from <data_table> where date_part between 20221101 and 20221130;
At the output I get error like need to specify partition columns...
If I go another way:
create table <bacup_table> like <data_table>;
insert overwrite table <backup_table> partition (`date_part`)
select field1, field2...,
date_part
from <data_table> where date_part between 20221101 and 20221130;
I get another errors like "error running query" or "...nonstrick mode..." or something else.
I've tried a lot of hive settings but it still not work :(
Thats why I need your help to do it correctly.
enable dynamic partition and copy the data.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.mapred.mode = nonstrict;

Hive to BigQuery Converted INSERT OVERWRITE TABLE with PARTITION on Integer

I am trying to convert the following Hive query to BigQuery with little luck. The idea is to remove the records from the specified partition and insert new records into the partition without touching other partitions. I have seen Google's documentation on using a DML statement to add rows to an ingestion-time partitioned table, but this isn't what I'm trying to accomplish.
INSERT OVERWRITE TABLE mytable PARTITION (integer_id = 100) select tmp.*, NULL as value from (select * from mytable2) as tmp;
Any help would be greatly appreciated!

Create BigQuery table from existing table, including _PARTITIONTIME

I want to create a new table from an existing one, and add one column. But, and this seems to make this tricky, I want it to be partitioned by _PARTITIONTIME.
I know I can create a table from an existing table, like so:
CREATE OR REPLACE TABLE `mydataset.mytable_new`
AS SELECT * FROM `mydataset.mytable`
--JOIN the new column here
LIMIT 0
I also know that I can create a partitioned table, like so:
CREATE OR REPLACE TABLE `mydataset.mytable_new`
(
date DATE,
var1 STRING,
var2 INT64,
--add new column here
)
PARTITION BY DATE(_PARTITIONTIME);
But: How can I combine the two? I tried this:
CREATE OR REPLACE TABLE `mydataset.mytable_new`
PARTITION BY DATE(_PARTITIONTIME)
AS SELECT * FROM `mydataset.mytable`
-- JOIN new column here
LIMIT 0
However, this gives me the error 'Unrecognized name: _PARTITIONTIME'.
Any hints are greatly appreciated!
This is a documented limitation of the partitioning expression of the CREATE TABLE syntax:
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#partition_expression
_PARTITIONDATE. Partition by ingestion time with daily partitions. This syntax cannot be used with the AS query_statement clause.
I believe you should be able to split the work. Use a statement to create the new table, then issue INSERT statement(s) to populate from the original table.
However, if you're already dealing with a sizable table, you may want to re-consider this partitioning scheme. By default, all the data from the original table would land in a single partition (the current date).
You may try like :
CREATE TABLE
mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
AS SELECT transaction_id, transaction_date FROM mydataset.mytable
From doc : cloud.google.com/bigquery/docs/creating-partitioned-tables#sql
I've had a similar problem and found it is possible to use three statements to get the desired approach. In my case, I wanted to both omit columns from a seed table, but also add a new column. Given that I had a large number of columns and I had to repeat this process for 10+ tables, I didn't want to go through the effort of discovering the types for every column. Here were my requirements:
Use the existing schema, omitting columns I don't want
Modify the type of an existing column (Date)
Add a new column (metadata)
Final table will be partitioned by _PARTITIONDATE
Can copy selected data into the final table
The result, left fully in tact for clarity:
-- Create base table
CREATE TEMP TABLE p_ad_group_performance
AS
SELECT
TIMESTAMP(Date) as Date,
PartnerName,
Advertiser,
Campaign,
AdGroup,
DeviceType,
AdvertiserCurrencyCode,
CreativeDurationInSeconds,
Creative,
PartnerCurrencyCode,
Impressions,
Clicks,
Player25Complete,
Player50Complete,
Player75Complete,
PlayerCompletedViews,
TotalSecondsInView,
CompanionClicks,
CompanionImpressions,
PartnerCostPartnerCurrency,
SampledViewedImpressions,
SampledTrackedImpressions,
PlayerStarts,
_01ClickConversion,
_01ClickConversionRevenue,
_01TotalClickViewConversions,
_01ViewThroughConversion,
_01ViewThroughConversionRevenue,
_01TotalClickViewConversionRevenue,
STRUCT('migrated' as filename, 'migrated' as location, _PARTITIONTIME as uploaded_at, _PARTITIONTIME as last_modified ) as metadata
FROM trade_desk.p_ad_group_performance WHERE DATE(_PARTITIONTIME) = "2022-12-06" LIMIT 1;
-- Create partitioned version of base table
CREATE TABLE IF NOT EXISTS trade_desk.p_ad_group_performance
LIKE p_ad_group_performance
PARTITION BY _PARTITIONDATE;
-- Populate the final table with the seed data
INSERT trade_desk.p_ad_group_performance (
Date,
PartnerName,
Advertiser,
Campaign,
AdGroup,
DeviceType,
AdvertiserCurrencyCode,
CreativeDurationInSeconds,
Creative,
PartnerCurrencyCode,
Impressions,
Clicks,
Player25Complete,
Player50Complete,
Player75Complete,
PlayerCompletedViews,
TotalSecondsInView,
CompanionClicks,
CompanionImpressions,
PartnerCostPartnerCurrency,
SampledViewedImpressions,
SampledTrackedImpressions,
PlayerStarts,
_01ClickConversion,
_01ClickConversionRevenue,
_01TotalClickViewConversions,
_01ViewThroughConversion,
_01ViewThroughConversionRevenue,
_01TotalClickViewConversionRevenue,
metadata
)
SELECT
TIMESTAMP(Date) as Date,
PartnerName,
Advertiser,
Campaign,
AdGroup,
DeviceType,
AdvertiserCurrencyCode,
CreativeDurationInSeconds,
Creative,
PartnerCurrencyCode,
Impressions,
Clicks,
Player25Complete,
Player50Complete,
Player75Complete,
PlayerCompletedViews,
TotalSecondsInView,
CompanionClicks,
CompanionImpressions,
PartnerCostPartnerCurrency,
SampledViewedImpressions,
SampledTrackedImpressions,
PlayerStarts,
_01ClickConversion,
_01ClickConversionRevenue,
_01TotalClickViewConversions,
_01ViewThroughConversion,
_01ViewThroughConversionRevenue,
_01TotalClickViewConversionRevenue,
STRUCT('migrated' as filename, 'migrated' as location, _PARTITIONTIME as uploaded_at, _PARTITIONTIME as last_modified ) as metadata
FROM trade_desk.p_ad_group_performance WHERE DATE(_PARTITIONTIME) = "2022-12-06";
Honestly, it's more lines of code than I would really like, but it seems to be the only way to get around the restrictions of using _PARTITIONDATE as the partition. Most of it is simply copying and pasting the same column references. The original tables for me were 60+ columns, if you were only skipping one or two you could simply use the EXCEPT keyword.
Hope this helps!

BigQuery Equivalent of "CREATE TABLE my_table (LIKE your_table)"

I want to create a table which schema is exactly the same as another table. In other SQL engines, I think I was able to use "CREATE TABLE my_table (LIKE your_table)" or some variations.
I couldn't find the equivalent in BigQuery yet. Is this possible in some fashion?
Use this form:
CREATE TABLE dataset.new_table AS
SELECT *
FROM dataset.existing_table
LIMIT 0
This creates a new table with the same schema as the old one, and there is no cost due to the LIMIT 0.
Note that this does not preserve partitioning, table description, etc., however. Another option is to use the CLI (or API), making a copy of the table and then overwriting its contents, e.g.:
$ bq cp dataset.existing_table dataset.new_table
$ bq query --use_legacy_sql --replace --destination_table=dataset.new_table \
"SELECT * FROM dataset.new_table LIMIT 0;"
Now the new table has the same structure and attributes as the original did.
To create a partitioned and/or clustered table the syntax would be:
CREATE TABLE project.dataset.clustered_table
PARTITION BY DATE(created_time)
CLUSTER BY
account_id
AS SELECT * FROM project.dataset.example_table LIMIT 0

How to return record count in PostgreSQL

I have a query with a limit and an offset. For example:
select * from tbl
limit 10 offset 100;
How to keep track of the count of the records, without running a second query like:
select count(*) from tbl;
I think this answers my question, but I need it for PostgreSQL. Any ideas?
I have found a solution and I want to share it. What I do is - I create a temp table from my real table with the filters applied, then I select from the temp table with a limit and offset (no limitations, so the performance is good), then select count(*) from the temp table (again no filters), then the other stuff I need and last - I drop the temp table.
select * into tmp_tbl from tbl where [limitations];
select * from tmp_tbl offset 10 limit 10;
select count(*) from tmp_tbl;
select other_stuff from tmp_tbl;
drop table tmp_tbl;
I haven't tried this, but from the section titled Obtaining the Result Status in the documentation you can use the GET DIAGNOSTICS command to determine the effect of a command.
GET DIAGNOSTICS number_of_rows = ROW_COUNT;
From the documentation:
This command allows retrieval of system status indicators. Each item
is a key word identifying a state value to be assigned to the
specified variable (which should be of the right data type to receive
it). The currently available status items are ROW_COUNT, the number of
rows processed by the last SQL command sent down to the SQL engine,
and RESULT_OID, the OID of the last row inserted by the most recent
SQL command. Note that RESULT_OID is only useful after an INSERT
command into a table containing OIDs.
Depends if you need it from the psql CLI or if you're accessing the database from something like an HTTP server. I am using postgres from my Node server with node-postgres. The result set is returned as an array called 'rows' on the result object so I can just do
console.log(results.rows.length)
To get the row count.