BigQuery : Data processed when running Select on Custom Partitioning Field - google-bigquery

I have a table that is partitioned by day using a Timestamp field my_partition_field from the schema (and not the ingestion time _PARTITIONTIME)
When I execute the following query :
SELECT my_partition_field FROM MY_TABLE;
BigQuery tells me that "This query will process XX MB when run". The amount of data processed is the same as if the field was not the partitioning field.
However, if I have the same table partitioned by ingestion time and I run the following query :
SELECT _PARTITIONTIME FROM MY_TABLE_2;
BigQuery tells me that "This query will process 0 B when run."
Why is there a difference in the data processed (and billed :) ) between these two cases ?

When you create a partitioned table in BigQuery, your charges are based on how much data is stored in the partitions and on the queries you run against the data[1]. Many partitioned table operations are free as is _PARTITIONTIME[2]. There is no difference between the processed data, just the data which is in both tables may be different because, in the time-unit partitioned table, the partition is based on a TIMESTAMP, DATE or DATETIME column in the table. On the other hand, the Ingestion time tables are partitioned based on the timestamp when BigQuery ingests the data.
[1]https://cloud.google.com/bigquery/docs/partitioned-tables#pricing
[2]https://cloud.google.com/bigquery/pricing#free

Related

Ingestion time partitioned Table in BQ was "made available" for queries almost 14 hours after the begin of day?

We setup a streaming insert into a BQ table that is partitioned like this (ingestion-time partitioned):
Table Type Partitioned
Partitioned by DAY
Partitioned on field _PARTITIONTIME
Partition expiration
Partition filter Required
We know the table had fresh data today because a preview in the BG console showed rows being added to the table.
We tried following query with 0 B result for several hours after the "start of UTC day":
select * from `MQTT_trackers_partitioned`
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2022-06-01') AND TIMESTAMP('2022-06-02');
At about 13:30 UTC time today, this same query now shows "This query will process 285.99 KB when run." and worked fine.
Why did BQ take so long to make the partitioned table data available for the query to work? (13 hours!). We are inserting data every minute 24x7 on this dataset, I would expect closer to "real-time" performance for queries considering that these are frequent streaming inserts, are we missing some other detail to make this work?

Is it possible to set expiration time for records in BigQuery

Is it possible to set a time to live for a column in BigQuery?
If there are two records in table payment_details and timestamp, the data in BigQuery table should be deleted automatically if the timestamp is current time - timestamp is greater is 90 days.
Solution 1:
BigQuery has a partition expiration feature. You can leverage that for your use case.
Essentially you need to create a partitioned table, and set the partition_expiration_days option to 90 days.
CREATE TABLE
mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
OPTIONS(
partition_expiration_days=90
)
or if you have a table partitioned already by the right column
ALTER TABLE mydataset.mytable
SET OPTIONS (
-- Sets partition expiration to 90 days
partition_expiration_days=90
)
When a partition expires, BigQuery deletes the data in that partition.
Solution 2:
You can setup a Scheduled Query that will prune hourly/daily your data that is older than 90 days. By writing a "Delete" query you have more control to actually combine other business logic, like only delete duplicate rows, but keep most recent entry even if it's older than 90 days.
Solution 3:
If you have larger business process that does the 90 day pruning based on other external factors, like an API response, and conditional evaluation, you can leverage Cloud Workflows to build and invoke a workflow regularly to automate the pruning of your data. See Automate the execution of BigQuery queries with Cloud Workflows article which can guide you with this.

Get the most recent Timestamp value

I have a pipeline which reads from a BigQuery table, performs some processing to the data and saves it into a new BigQuery table. This is a batch process performed on a weekly basis through a cron. Entries keep being added on the source table, so I want that whenever I start the ETL process it only process the new rows which have been added since the last time the ETL job was launched.
In order to achieve this, I have thought about making a query to my sink table asking for the most recent timestamp it contains. Then, as a data source I will perform another query to the source table filtering and asking for the entries having a timestamp higher than the one I have just recovered. Both my source and sink table are time partitioned ones.
The query I am using for getting the latest entry on my sink table is the following one:
SELECT Timestamp
FROM `myproject.mydataset.mytable`
ORDER BY Timestamp DESC
LIMIT 1
It gives me the correct value, but I feel like if it is not the most efficient way of querying it. Does this query take advantage of the partitioned feature of my table? Is there any better way of retrieving the most recent timestamp from my table?
I'm going to refer to the timestamp field as ts_field for your example.
To get the latest timestamp, I would run the following query:
SELECT max(ts_field)
FROM `myproject.mydataset.mytable`
If your table is also partitioned on the timestamp field, you can do something like this to scan even less bytes:
SELECT max(ts_field)
FROM `myproject.mydataset.mytable`
WHERE date(ts_field) = current_date()

BigQuery - Max Partiton Date for a Custom Partitioned Table

Is there a metadata operation that can give me the max partitioned date/timestamp in use (for custom partitioned table not Ingest partitioning), such that I do not need to scan a whole table using MAX function? Or some other clever SQL way? Our source table is very large, and it gets a fresh snapshot of data most days - but then that data is generally for current_date()-1...but all in all I cant rely on much except for a query to tell me the max partition in use that doesnt cost the earth for a large table? thought?
SELECT MAX(custom_partition_field) FROM Y
#legacySQL
SELECT MAX(partition_id)
FROM [project:dataset.your_table$__PARTITIONS_SUMMARY__]
It is documented at Listing partitions in partitioned tables

How to query for data in streaming buffer ONLY in BigQuery?

We have a table partitioned by day in BigQuery, which is updated by streaming inserts.
The doc says that: "when streaming to a partitioned table, data in the streaming buffer has a NULL value for the _PARTITIONTIME pseudo column"
But if I query for select count(*) from table where _PARTITIONTIME is NULL it always returns 0, even though bq show tells me that there are a lot of rows in the streaming buffer.
Does this mean that the pseudo column is not present at all for rows in streaming buffer? In any case, how can I query for the data ONLY in the streaming buffer without it becoming a full table scan?
Thanks in advance
Data in the streaming buffer has a NULL value for the _PARTITIONTIME column.
SELECT
fields
FROM
`dataset.partitioned_table_name`
WHERE
_PARTITIONTIME IS NULL
https://cloud.google.com/bigquery/docs/partitioned-tables#copying_to_partitioned_tables
When you stream data to BQ you usually have the "warming-up" period and that's the time it takes for the streamed data to become available for operations such as querying, copying and exporting.
The doc states in the end that after a period of up to 90 mins the pseudo-column _PARTITIONTIME receives a non-null value, which means your streamed data is fully ready for any type of operation you want to run on the data (being able to run queries usually takes a few seconds).
That means that you don't query partitioned tables looking for when this field is null but instead, you do like so:
SELECT
fields
FROM
`dataset.partitioned_table_name`
WHERE
_PARTITIONTIME = TIMESTAMP('2017-01-20')
In this example, you would be querying only data streamed in the dates partition Jan/20 (which avoids a full table scan).
You can also select for a range of dates, you would just have to change the WHERE clause to:
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2017-01-20') AND TIMESTAMP('2017-01-22')
Which would query for 2 days in your table.