Is it possible to set expiration time for records in BigQuery - google-bigquery

Is it possible to set a time to live for a column in BigQuery?
If there are two records in table payment_details and timestamp, the data in BigQuery table should be deleted automatically if the timestamp is current time - timestamp is greater is 90 days.

Solution 1:
BigQuery has a partition expiration feature. You can leverage that for your use case.
Essentially you need to create a partitioned table, and set the partition_expiration_days option to 90 days.
CREATE TABLE
mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
OPTIONS(
partition_expiration_days=90
)
or if you have a table partitioned already by the right column
ALTER TABLE mydataset.mytable
SET OPTIONS (
-- Sets partition expiration to 90 days
partition_expiration_days=90
)
When a partition expires, BigQuery deletes the data in that partition.
Solution 2:
You can setup a Scheduled Query that will prune hourly/daily your data that is older than 90 days. By writing a "Delete" query you have more control to actually combine other business logic, like only delete duplicate rows, but keep most recent entry even if it's older than 90 days.
Solution 3:
If you have larger business process that does the 90 day pruning based on other external factors, like an API response, and conditional evaluation, you can leverage Cloud Workflows to build and invoke a workflow regularly to automate the pruning of your data. See Automate the execution of BigQuery queries with Cloud Workflows article which can guide you with this.

Related

Get the most recent Timestamp value

I have a pipeline which reads from a BigQuery table, performs some processing to the data and saves it into a new BigQuery table. This is a batch process performed on a weekly basis through a cron. Entries keep being added on the source table, so I want that whenever I start the ETL process it only process the new rows which have been added since the last time the ETL job was launched.
In order to achieve this, I have thought about making a query to my sink table asking for the most recent timestamp it contains. Then, as a data source I will perform another query to the source table filtering and asking for the entries having a timestamp higher than the one I have just recovered. Both my source and sink table are time partitioned ones.
The query I am using for getting the latest entry on my sink table is the following one:
SELECT Timestamp
FROM `myproject.mydataset.mytable`
ORDER BY Timestamp DESC
LIMIT 1
It gives me the correct value, but I feel like if it is not the most efficient way of querying it. Does this query take advantage of the partitioned feature of my table? Is there any better way of retrieving the most recent timestamp from my table?
I'm going to refer to the timestamp field as ts_field for your example.
To get the latest timestamp, I would run the following query:
SELECT max(ts_field)
FROM `myproject.mydataset.mytable`
If your table is also partitioned on the timestamp field, you can do something like this to scan even less bytes:
SELECT max(ts_field)
FROM `myproject.mydataset.mytable`
WHERE date(ts_field) = current_date()

Specify the partition # based on date range for that pkey value

We have a DW query that needs to extract data from a very large table around 10 TB which is partitioned by datetime column lets say time to purge data based on this column everyday. So my understanding is each partition has worth a day of data. From storage (SSMS GUI) tab I see # of partitions is 1995.
There is no clustered index on this table as its mostly intended for write operations. Just a design by vendor.
SELECT
a.*
FROM dbo.VLTB AS a
CROSS APPLY
(
VALUES($PARTITION.a_func(a.time))
) AS c (pid)
WHERE c.pid = 1896;
Currently query submitted is as
SELECT * from dbo.VLTB
WHERE time >= convert(datetime,'20210601',112)
AND time < convert(datetime,'20210602',112)
So replacing inequality predicates with equality to look in that days specific partition might help. Users via app can control dates when sending but how will they manage if we want them to use partition # as per first query
Question
How do I find a way in above query to find partition number for that day rather than me inserting like for 06/01 I had to give 1896 part#. Is there a better way to have script find the part# to avoid all partitions being scanned and can insert correct part# in where clause query?
Thank you

BigQuery - Create view with Partition but base table doesn't have

This may sound crazy, but I want to implement something like having a view with a partition.
Background:
I had a table with a date partition on a column which is really huge in size. We are running data ingestion to this table at every 2mins interval. All the data loads are append-only. Ever load will insert 10k+ rows. After some time, we encountered the partition limitation issue.
message: "Quota exceeded: Your table exceeded quota for Number of partition modifications to a column partitioned table. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"
Root cause:(from GCP support team)
The root cause under the hood was that due to your partitioned tables
have pretty granular partition for instance by minutes, hours or date,
when the loaded data cover a wide range of partition period, the
number of partition get modified will be high and above 4000. As per
internal documentation, it was suggested the user who ran into this
issue to consider making a less granular partition for instance change
a date/hour/minute based partitioned table to a week based partitioned
table. Alternatively split the load to multiple and hence limit the
data range to cover less number of partitions that would be affected.
This is the best recommendation I could have now.
So I'm planning to keep this table as un-partitioned and create a view(we need a view for eliminating the duplicates) and it should have parition. Is this possible? or any other alternate solution for this?
You can't partition a view, it's not physically materialized. Partitioning on day can be limiting with the 4000 limit, would year work? then you can use an integer partition:
create or replace table BI.test
PARTITION BY RANGE_BUCKET(Year, GENERATE_ARRAY(2000, 3000, 1)) as
select 2000 as Year, 1 as value
union all
select 2001 as Year, 1 as value
union all
select 2002 as Year, 1 as value
Alternatively, I've used month (YYYYMM) or week (YYYYWW) to integer partition by which gets you around 40 years:
RANGE_BUCKET(monthasintegerfield, GENERATE_ARRAY(201612, 205712, 1))

ClickHouse TTL on materialized column

I am trying to upgrade the clickhouse cluster from version 18.8 to 19.9.2. Previously, I had a cronjob that deletes old data from the database. I want to start using TTL feature instead.
Simplified table definition:
CREATE TABLE myTimeseries(
timestamp_ns Int64,
source_id String,
data String,
date Date MATERIALIZED toDate(timestamp_ns/1e9),
time DateTime MATERIALIZED toDateTime(timestamp_ns/1e9))
ENGINE = MergeTree()
PARTITION BY (source_id, toStartOfHour(time))
TTL date + toInterValDay(7)
SETTINGS index_granularity=8192, merge_with_ttl_timeout=43200
The problem is, it does not delete old data. I could not find anything in the documentation that would help debug this issue.
Questions:
How can I debug this issue? (Is there a way to see when the data will be cleared in the future)?
Might this be because of date field being materialized? I have another table where date is not a materialized field and everything works fine.
Yes, you can use materialized fields with TTL feature.
I've attached simple query that create table with 5 minutes interval to delete.
It works fine with clickhouse server version 20.4.5
CREATE TABLE IF NOT EXISTS test.profiling
(
headtime UInt64,
date DateTime MATERIALIZED toDateTime(headtime),
id Int64,
operation_name String,
duration Int64
)
ENGINE MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (date, id)
TTL date + INTERVAL 5 MINUTE
And important note from clickhouse documentation:
Data with an expired TTL is removed when ClickHouse merges data parts.
When ClickHouse see that data is expired, it performs an off-schedule
merge. To control the frequency of such merges, you can set
merge_with_ttl_timeout. If the value is too low, it will perform many
off-schedule merges that may consume a lot of resources.
If you perform the SELECT query between merges, you may get expired
data. To avoid it, use the OPTIMIZE query before SELECT.

Insert to clustered hive table from spark

i'm trying to do some performance optimization on the data storage. the idea is to use the bucketing/clustering of hive to bucket the available devices (based on column id). my current approach is inserting data from an external table based on parquet files into the table. As a result it applies the bucketing.
INSERT INTO TABLE bucketed_table PARTITION (year, month, day)
SELECT id, feature, value, year, month, day
FROM parquet_table ;
I would like to get rid of this step in between by ingesting the data directly into that table directly from PySpark 2.1.
Executing the same statement using SparkSQL leads to different results. Adding the cluster by clause
INSERT INTO TABLE bucketed_table PARTITION (year, month, day)
SELECT id, feature, value, year, month, day
FROM parquet_table cluster by id ;
still leads to different output files.
This leads to two questions:
1) What is the right way to insert into a clustered hive table from spark?
2) Does writing with clustered by statement enable the benefits of the hive metastore on the data?
I don't believe that it's supported as of yet. I'm currently using Spark 2.3 and it fails, as opposed to succeeding and corrupting your data store.
Checkout the jira ticket here if you want to track its progress