Get the most recent Timestamp value - sql

I have a pipeline which reads from a BigQuery table, performs some processing to the data and saves it into a new BigQuery table. This is a batch process performed on a weekly basis through a cron. Entries keep being added on the source table, so I want that whenever I start the ETL process it only process the new rows which have been added since the last time the ETL job was launched.
In order to achieve this, I have thought about making a query to my sink table asking for the most recent timestamp it contains. Then, as a data source I will perform another query to the source table filtering and asking for the entries having a timestamp higher than the one I have just recovered. Both my source and sink table are time partitioned ones.
The query I am using for getting the latest entry on my sink table is the following one:
SELECT Timestamp
FROM `myproject.mydataset.mytable`
ORDER BY Timestamp DESC
LIMIT 1
It gives me the correct value, but I feel like if it is not the most efficient way of querying it. Does this query take advantage of the partitioned feature of my table? Is there any better way of retrieving the most recent timestamp from my table?

I'm going to refer to the timestamp field as ts_field for your example.
To get the latest timestamp, I would run the following query:
SELECT max(ts_field)
FROM `myproject.mydataset.mytable`
If your table is also partitioned on the timestamp field, you can do something like this to scan even less bytes:
SELECT max(ts_field)
FROM `myproject.mydataset.mytable`
WHERE date(ts_field) = current_date()

Related

Querying a partitioned BigQuery table across multiple far-apart _PARTITIONDATE days?

I have a number of very big tables that are partitioned by _PARTITIONDATE which I'd like to query off of regularly in an efficient way. Each time I run the query, I only need to search across a small number of dates, but these dates will change every run and may be months/years apart from one-another.
To capture these dates, I could do _PARTITIONDATE >= '2015-01-01' but this is making the queries run very slow as there are millions of rows on each partition. I could also do _PARTITIONDATE BETWEEN '2015-01-01' AND '2017-01-01', but the exact date range will change every run. What I'd like to do is something like _PARTITIONDATE IN ("2015-03-10", "2016-01-24", "2016-22-03", "2017-06-14") so that the query only needs to run on the dates provided, which from my testing appears to work.
The problem I'm running into is that the list of dates will change every time, requiring me to join in the list of dates in a temp table first. When doing that like this source._PARTITIONDATE IN (datelist.date), it does not work and runs into an error if that's the only WHERE condition when querying a partition-required table.
Any advice for ways I might get this to work or other approach to querying off specific partitions that aren't back to back without having to process querying the whole thing?
I've been reading through the BigQuery documentation but I don't see an answer to this question. I do see it says that the following "doesn't limit the scanned partitions, because it uses table values, which are dynamic." So possibly what I'm trying to do is impossible with the current BQ limitations?
_PARTITIONTIME = (SELECT MAX(timestamp) from dataset.table1)
Script is a possible solution.
DECLARE max_date DEFAULT (SELECT MAX(...) FROM ...);
SELECT .... FROM ... WHERE _PARTITIONDATE = max_date;

BigQuery : Data processed when running Select on Custom Partitioning Field

I have a table that is partitioned by day using a Timestamp field my_partition_field from the schema (and not the ingestion time _PARTITIONTIME)
When I execute the following query :
SELECT my_partition_field FROM MY_TABLE;
BigQuery tells me that "This query will process XX MB when run". The amount of data processed is the same as if the field was not the partitioning field.
However, if I have the same table partitioned by ingestion time and I run the following query :
SELECT _PARTITIONTIME FROM MY_TABLE_2;
BigQuery tells me that "This query will process 0 B when run."
Why is there a difference in the data processed (and billed :) ) between these two cases ?
When you create a partitioned table in BigQuery, your charges are based on how much data is stored in the partitions and on the queries you run against the data[1]. Many partitioned table operations are free as is _PARTITIONTIME[2]. There is no difference between the processed data, just the data which is in both tables may be different because, in the time-unit partitioned table, the partition is based on a TIMESTAMP, DATE or DATETIME column in the table. On the other hand, the Ingestion time tables are partitioned based on the timestamp when BigQuery ingests the data.
[1]https://cloud.google.com/bigquery/docs/partitioned-tables#pricing
[2]https://cloud.google.com/bigquery/pricing#free

Is it possible to set expiration time for records in BigQuery

Is it possible to set a time to live for a column in BigQuery?
If there are two records in table payment_details and timestamp, the data in BigQuery table should be deleted automatically if the timestamp is current time - timestamp is greater is 90 days.
Solution 1:
BigQuery has a partition expiration feature. You can leverage that for your use case.
Essentially you need to create a partitioned table, and set the partition_expiration_days option to 90 days.
CREATE TABLE
mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
OPTIONS(
partition_expiration_days=90
)
or if you have a table partitioned already by the right column
ALTER TABLE mydataset.mytable
SET OPTIONS (
-- Sets partition expiration to 90 days
partition_expiration_days=90
)
When a partition expires, BigQuery deletes the data in that partition.
Solution 2:
You can setup a Scheduled Query that will prune hourly/daily your data that is older than 90 days. By writing a "Delete" query you have more control to actually combine other business logic, like only delete duplicate rows, but keep most recent entry even if it's older than 90 days.
Solution 3:
If you have larger business process that does the 90 day pruning based on other external factors, like an API response, and conditional evaluation, you can leverage Cloud Workflows to build and invoke a workflow regularly to automate the pruning of your data. See Automate the execution of BigQuery queries with Cloud Workflows article which can guide you with this.

Specify the partition # based on date range for that pkey value

We have a DW query that needs to extract data from a very large table around 10 TB which is partitioned by datetime column lets say time to purge data based on this column everyday. So my understanding is each partition has worth a day of data. From storage (SSMS GUI) tab I see # of partitions is 1995.
There is no clustered index on this table as its mostly intended for write operations. Just a design by vendor.
SELECT
a.*
FROM dbo.VLTB AS a
CROSS APPLY
(
VALUES($PARTITION.a_func(a.time))
) AS c (pid)
WHERE c.pid = 1896;
Currently query submitted is as
SELECT * from dbo.VLTB
WHERE time >= convert(datetime,'20210601',112)
AND time < convert(datetime,'20210602',112)
So replacing inequality predicates with equality to look in that days specific partition might help. Users via app can control dates when sending but how will they manage if we want them to use partition # as per first query
Question
How do I find a way in above query to find partition number for that day rather than me inserting like for 06/01 I had to give 1896 part#. Is there a better way to have script find the part# to avoid all partitions being scanned and can insert correct part# in where clause query?
Thank you

BigQuery: querying the latest partition, bytes to be processed vs. actually processed

I'm struggling with querying efficiently the last partition of a table, using a date or datetime field. The first approach was to filter like this:
SELECT *
FROM my_table
WHERE observation_date = (SELECT MAX(observation_date) FROM my_table)
But that, according to BigQuery's processing estimation, scans the entire table and does not use the partitions. Even Google states this happens in their documentation. It does work if I use an exact value for the partition:
SELECT *
FROM my_table
WHERE observation_date = CURRENT_DATE
But if the table is not up to date then the query will not get any results and my automatic procesess will fail. If I include an offset like observation_date = DATE_SUB(CURRENT_DATE, INTERVAL 2 DAY), I will likely miss the latest partition.
What is the best practice to get the latest partition efficiently?
What makes this worse is that BigQuery's estimation of the bytes to be processed with the active query does not match what was actually processed, unless I'm not interpreting those numbers correctly. Find below the screenshot of the mismatching values.
BigQuery screen with aparrently mistmatching processed bytes
Finally a couple of scenarios that I also tested:
If I store a max_date with a DECLARE statement first, as suggested in this post, the estimation seems to work, but it is not clear why. However, the actual processed bytes after running the query is not different than the case that filters the latest partition in the WHERE clause.
Using the same declared max_date in a table that is both partitioned and clustered, the estimation works only when using a filter for the partition, but fails if I include a filter for the cluster.
After some iterations I got an answer from Google and although it doesn't resolve the issue, it acknowledges it happens.
Tables partitioned with DATE or DATETIME fields cannot be efficiently queried for their latest partition. The best practice remains to filter with something like WHERE observation_date = (SELECT MAX(observation_date) FROM my_table) and that will scan the whole table.
They made notes to try and improve this in the future but we have to deal with this for now. I hope this helps somebody that was trying to do the same thing.