I'm running a query job on a large dataset in BigQuery. The job results are stored in a destinationTable. I want the tables to expire either within 1 day or 1 hour (historical data va. today's data).
Is there an option to set expirationTime on each table?
I am aware that I can set a defaultExpirationTime on the entire dataset, but since I have different expiration times, this is not an ideal solution.
Check expirationTime Table's Property
expirationTime long [Optional] The time when this table expires, in milliseconds since
the epoch. If not present, the table will persist indefinitely.
Expired tables will be deleted and their storage reclaimed.
You need to set it using tables.patch API after table is created or updated (depends on your logic)
Related
I see in https://stackoverflow.com/a/49105008/6191970 that partitions with an expiration that is expired still take some unknown amount of time to be deleted, though they are no longer are included in queries after expiration. I experimented with setting the partition expiration on a table which is partitioned hourly as so:
ALTER TABLE `my.table`
SET OPTIONS ( partition_expiration_days=.1 )
And was surprised that even after a few hours, setting the expiration back to its original limit of 90 days, all of the data was still there.
Is there any way to force deletion specifically of all expired partitions?
If not, what time frame is to be expected for this data to clear out?
This is a sensitive data security problem for my use case where we do not want old data to exist.
BigQuery retains data for 7 days, officially it offers a time traveling feature as well. Count that in your Organization policy.
https://cloud.google.com/bigquery/docs/time-travel
I need to create a copy of the production dataset in BigQuery to the testing environment and use it to simulate the pipeline processing with new changes.
However, the production dataset is huge. so I usually want to only keep its most recent data for testing.
To do that, I would like to truncate all partitioned data that is older than 30 days in my dataset.
I tried setting partition expiration at the dataset level. it doesn't work.
So how could I do that.
I did some tests about this and confirmed that.
When you set a default partition expiration at the dataset level. It only applies to the new tables.
for the existing partitioned tables, you need to set the partition at individual table level to expire its partitions.
For example:
ALTER TABLE `gcp_A.dataset_1.measurements`
SET OPTIONS (
-- Sets partition expiration to 30 days
partition_expiration_days=30
);
select min(stamp) from `gcp_A.dataset_1.measurements`
-- [result]
-- 2021-06-15 00:00:00 UTC
I am using the following query to populate my fact table:
Select sh.isbn_l,sh.id_c,sh.id_s, sh.data,sh.quantity, b.price
from Book as b
inner join Sales as sh
on l.isbn=sh.isbn_l
The main thing is that I want to load the table from a specific time to a specific time. So if I load today, I will get all the records from today till the last time I loaded.
And if I load it the day after tomorrow, I will get the datas from today after load time, till the day after tomorrow.
What I mean is NO DUBLICATED ROWS or DATAS. What should I do ?
Any idea pleasee ?
Thank you in advance
Streams (and maybe Tasks) are your friend here.
A Snowflake Stream records the delta of change data capture (CDC) information for a table (such as a staging table), including inserts and other DML changes. A stream allows querying and consuming a set of changes to a table, at the row level, between two transactional points of time.
In a continuous data pipeline, table streams record when staging tables and any downstream tables are populated with data from business applications using continuous data loading and are ready for further processing using SQL statements.
Snowflake Tasks may optionally use table streams to provide a convenient way to continuously process new or changed data. A task can transform new or changed rows that a stream surfaces. Each time a task is scheduled to run, it can verify whether a stream contains change data for a table (using SYSTEM$STREAM_HAS_DATA) and either consume the change data or skip the current run if no change data exists.
Users can define a simple tree-like structure of tasks that executes consecutive SQL statements to process data and move it to various destination tables.
https://docs.snowflake.com/en/user-guide/data-pipelines-intro.html
I am checking the feasibility of moving from Redshift to BigQuery. I need help in implementing the below use case on BigQuery.
We have a by day product performance table which is a date partitioned table. It is called product_performance_by_day. There is a row for every product that was sold each day. Every day we process the data at the end of the day and put it in the partition for that day. Then we aggregate this by day performance data over the last 30 days and put it in the table called product_performance_last30days. This aggregation saves querying time and in the case of BigQuery will save the cost as well since it will scan less data.
Here is how we do it in Redshift currently -
We put the aggregated data in a new table e.g. product_performance_last30days_temp. Then drop the product_performance_last30days table and rename product_performance_last30days_temp to product_performance_last30days. So there is very minimal downtime for product_performance_last30days table.
How can we do the same thing in the BigQuery?
Currently, BigQuery does not support renaming tables or materialized views or table aliases. And since we want to save the aggregated data in the same table every day we cannot use destination table if the table is not empty.
You can overwrite the same table by using writeDisposition Specifies the action that occurs if the destination table already exists.
The following values are supported:
WRITE_TRUNCATE: If the table already exists, BigQuery overwrites the table data.
WRITE_APPEND: If the table already exists, BigQuery appends the data to the table.
WRITE_EMPTY: If the table already exists and contains data, a 'duplicate' error is returned in the job result.
The default value is WRITE_EMPTY.
Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.
For RENAMING tables look on this answer.
In Cassandra I am used to the USING TTL clause of upserts which sets a number of seconds after which the upserted data will be deleted.
Does Oracle have a feature like this? I haven't been able to find anything about it.
There are ways to implement this feature, but I don't believe it is built in. The easiest way is to have a CreatedAt column in the table that specifies when a row has been inserted. Then, create a view to get the most recent rows, so for the most recent day:
create view v_table as
select t.*
from table t
where t.CreatedAt >= sysdate - 1;
This fixes the data access side. To actually delete the rows, you need an additional job to periodically delete old rows in the table.
I think it depends on what you mean by "Oracle".
Oracle relational database does not have it out of the box. We have usually scheduled a stored procedure to perform such tasks.
Oracle NoSQL database has it:
Time to Live (TTL) is a mechanism that allows you to automatically expire table rows. TTL is expressed as the amount of time data is allowed to live in the store. Data which has reached its expiration timeout value can no longer be retrieved, and will not appear in any store statistics. Whether the data is physically removed from the store is determined by an internal mechanism that is not user-controllable.