How can I force expired big query partitions to delete immediately? - google-bigquery

I see in https://stackoverflow.com/a/49105008/6191970 that partitions with an expiration that is expired still take some unknown amount of time to be deleted, though they are no longer are included in queries after expiration. I experimented with setting the partition expiration on a table which is partitioned hourly as so:
ALTER TABLE `my.table`
SET OPTIONS ( partition_expiration_days=.1 )
And was surprised that even after a few hours, setting the expiration back to its original limit of 90 days, all of the data was still there.
Is there any way to force deletion specifically of all expired partitions?
If not, what time frame is to be expected for this data to clear out?
This is a sensitive data security problem for my use case where we do not want old data to exist.

BigQuery retains data for 7 days, officially it offers a time traveling feature as well. Count that in your Organization policy.
https://cloud.google.com/bigquery/docs/time-travel

Related

Determining inactive records in a table

In this scenario, "inactive" can generally refer to data that has not been accessed in the last month by users from the web server.
Knowing the "inactive" status of records can be used to optimize queries for the active data as the database table grows larger.
I know one approach can be to
Update a record with a last_accessed timestamp each time it is
accessed.
Monthly, when there is low traffic, the web server can tell the database to update an inactive flag for records that have/have not been accessed in the past month.
But two major issues to this approach are
Updating when the client is just trying to select data has a performance impact.
If there are too many records, the monthly update may take too long and cause issues, like locking rows.
Wondering what a better, or alternative approach could be.
Here is one approach.
You could write a query that would essentially check if the last_accessed_date is in the last 30 days (CASE WHEN last_accessed_date < SYSDATE-30) and created an is_active indicator. That would essentially allow you to mark the historic records as active or inactive.
Then, once you have this done, you would need to run this script routinely (daily, weekly, or monthly) to check the status of these items. Monthly might be a good idea, and do it during a non-business hour so it does not affect performance much (Saturday mornings at 3:00AM). I'm sure you could schedule this with your team and also have your communications team send out a notice to the end users that they may see usability delays during this timeframe (first Saturday of the month from 3:00 AM - 6:00 AM or something).
Additionally, you could have it be the case that there is a second condition. Whenever someone does access a record, you could have a small logic check that essentially says "Change the last_accessed_date to today. If is_active is currently No, switch to yes". That would keep your database more up to date.
A last piece, for optimization, is if you choose to include the second option (the logic check), you could have a field that says "Last_Updated_Indicator" which is the date that the indicator was last changed. If your last updated indicator is within the lead time since you last ran your entire database update function, it can be skipped. This would drastically cut down on the performance impact of that update procedure.

Does Oracle have a time-to-live feature for rows?

In Cassandra I am used to the USING TTL clause of upserts which sets a number of seconds after which the upserted data will be deleted.
Does Oracle have a feature like this? I haven't been able to find anything about it.
There are ways to implement this feature, but I don't believe it is built in. The easiest way is to have a CreatedAt column in the table that specifies when a row has been inserted. Then, create a view to get the most recent rows, so for the most recent day:
create view v_table as
select t.*
from table t
where t.CreatedAt >= sysdate - 1;
This fixes the data access side. To actually delete the rows, you need an additional job to periodically delete old rows in the table.
I think it depends on what you mean by "Oracle".
Oracle relational database does not have it out of the box. We have usually scheduled a stored procedure to perform such tasks.
Oracle NoSQL database has it:
Time to Live (TTL) is a mechanism that allows you to automatically expire table rows. TTL is expressed as the amount of time data is allowed to live in the store. Data which has reached its expiration timeout value can no longer be retrieved, and will not appear in any store statistics. Whether the data is physically removed from the store is determined by an internal mechanism that is not user-controllable.

Bigquery: Check for duplications during stream

We have some data generated from our devices installed on clients' side. Duplicated data exist and it is by design, which means we wouldn't be able to eliminate duplicated ones in data generating phase. We are now looking into the possibility to avoid duplication while streaming into Bigquery (rather than clean the data by doing table copy and delete later). That's to say, for every ready-to-be-streamed record, we check whether it's already in Bigquery first, if not then we continue to stream it in, if it does exist, then we won't stream it in.
But here's the concern: (quote from [here]:https://developers.google.com/bigquery/streaming-data-into-bigquery)
Data availability
The first time a streaming insert occurs, the streamed data is inaccessible for a warm-up period of up to two minutes. After the warm-up period, all streamed data added during and after the warm-up period is immediately queryable. After several hours of inactivity, the warm-up period will occur again during the next insert.
Data can take up to 90 minutes to become available for copy and export operations.
Our data will go into different bigquery tables (the table name is dynamically generated from the data's date_time). What does "the first time a stream insert occur" mean? Is it per table?
Does the above doc mean that we cannot rely on the query result to check for duplications in the process of streaming?
If you provide an insert id, bigquery will automatically do the deduplication for you, as long as the duplicates are within the de-duplication window. The official docs don't mention how long the de-duplicatin window is, but it is generally from 5 minutes to 90 minutes (if you write data very quickly to a table, it will be closer to 5 than 90, but if data is trickled in, it will last longer in the deduplication buffers.).
Regarding "the first time a streaming insert occurs", this is per table. If you have a new table and start streaming to it, it may take a few minutes for that data to be available for querying. Once you've started streaming, however, new data will be available immediately.

The order of records in a regularly updated bigquery database

I am going to be maintaining a local copy of a database on bigquery. I will be using the API and tabledata:list. This database is not my own, and is regularly updated by the maintainers by appending new data (say every hour).
First, can I assume that when this data is appended, it will definitely be added to the end of the database?
Now, let's assume that currently the database has 1,000,000 rows and I am now downloading all of these by paging through tabledata:list. Also, let's assume that the database is updated partway through (with 10,000 rows). By using the page tokens, can I be assured that I will only download the 1m rows present when I started in the order they are in in the database?
Finally, now let's say that I come to update my copy. If I initiate the tabledata:list with a startIndex of 1,000,000 and I use a maxResults of 1000, will I get 10 pages containing the updated data that I am expecting?
I suppose all these questions boil down to whether bigquery respects the order the data is in, whether this order is used by tabledata:list, and whether appended data is guaranteed to follow previous data.
As there is a column whose values are unique, and I can perform a simple select count(1) from table to get the length of the table, I can of course check that my local copy is complete by comparing the length of my local db with that of the remote, however if the above weren't guaranteed and I ended up with holes in my data, it would be quite impractical to remedy as the primary key is not sequential (otherwise I could just fill in the missing rows) and the database is very large.
When you append data, we will append to the end of the table data list, however, bigquery may periodically coalesce data, which does not respect ordering. We have been discussing being able to preserve the ordering, or at least have a way of accessing the most recent data, but this is not yet implemented or designed. If it is an important feature for you, let us know and we'll prioritize it accordingly.
If you use page tokens, you are assured of a stable listing. If the table gets updated in the middle of paging through the data, you'll still only see the data that was in the table when you created the page token. Note that because of this, page tokens are only valid for 24 hours.
This should work as long as no coalesce has occurred since you have updated the table.
You can get the number of rows in the table by calling tables.get, which is usually simpler and faster than running a query.

ALTER PARTITION FUNCTION to include 1.5TB worth of data for a quick switch

I inhereted a unmaintained database in which the partition function was set on a date field and expired on the first of the year. The data is largely historic and I can control the jobs that import new data into this table.
My question is relating to setting up or altering partitioning to include so much data, roughly 1.5TB counting indexes. This is on a live system and I don't know what kind of impact it will have with so many users connecting to it at once. I will test this on a non prod system but then I can't get real usage load on there. My alternative solution was to kill all the users hitting the DB and quickly doing a rename of the table, and renaming a table that does have a proper partitioning scheme in.
I wanted to:
-Keep the same partition function but extend it to:
keep all 2011 data up to a certain date (let's say Nov 22nd 2011) on 1 partition, all data coming in after that need to be put in their own new partitions
-Do a quick switch of the specific partition which has the full years worth of data
Anyone know if altering a partition on a live system to include a new partition for a full years worth of data, roughly 5-6 billion records and 1.5tb, is plausible? Any pitfalls? I will share my test results once I complete them but want any input. Thanks!
Partitions switch are a metadata only operation and the size of the partition switched in or out does not matter, it can be 1Kb or 1TB, it takes the exactly same amount of time (ie. very fast).
However what you're describing is not a partition switch operation, but a partition split: you want to split the last partition of the table into two partitions, one containing all the existing data and a new one empty. Splitting a partition has to split the data, and unfortunately this is an offline size-of-data operation.