I have a table in which rows must expire after a month or so. When a row is created, it can be updated. I was curious if the update would bypass the TTL and would exist forever, since the update does not contain any TTL. This is what the docs are saying:
https://docs.scylladb.com/stable/cql/time-to-live.html
Here a TTL of 10 minutes is applied to all rows, however, keep in mind
that TTL is stored on a per column level for non-primary key columns.
I don't understand what this means. What does setting a TTL means for primary key columns? I just want my rows to be deleted after a certain time.
My question is: will an insert/update without TTL overwrite the tables TTL? Will setting a TTL simply delete the row after a certain time? Will the TTL reset for the row after an update statement (without ttl)?
Thanks.
This is an interesting question, and answering it accurately requires some subtleties.
First you need to know that in CQL, the expiration time is stored together with each cell (column value), not for an entire row. So if you INSERT a row with a=1, b=2 and c=3 with some TTL X, the TTL isn't stored once for the entire row - rather, each of the three columns will be stored with TTL X separately. If you later UPDATE only column a with TTL Y, you'll now have the a value expiring at one time and b and c expiring at a different time (this is fine for CQL - you are allowed to have a row where some of its columns are undefined (null)). If you update a column without specifying the TTL, the new value will never expire - it will not remember the previous TTL.
You may be asking yourself why doesn't updating one column (or all of the columns) of an existing row just keep the previous TTL value. Well, this is because one of the design goals of Scylla (and Cassandra) is to make writes are fast as possible. Scylla does writes fast precisely because it does not need to read the old value first. When you update x=1 Scylla just writes down that update ("mutation"), and doesn't need to read the previous value. Scylla will only needs to reconcile the different versions of the value (the one with the highest timestamp wins) during read or on a special "compaction" step that happens periodically. With this in mind, when you set x=1 with TTL y (or if TTL not set at all, infinity), this will be the new value of this column - the older TTL value isn't available during this update.
To answer your question about primary keys there's something else you need to know: In CQL, a row "exists" by virtue of having some live non-key columns. For example if your primary key is p, when you insert p=1,x=2 you basically inserting x=2 (non-key) into the row at p=1. If the x=2 expires, the entire row disappears. That's why the TTL is relevant only to non-key columns.
Well, I actually cheated a little bit in the last paragraph, and there's another subtlety here that I didn't explain. Maybe you noticed that it is possible to INSERT a row and then use UPDATE to delete each one of its columns individually, and you are left with an empty row, which still exists but is empty. How does this work, when I said that a row needs non-key cells to exist? Well, the trick is that the INSERT not only adds the specific columns you asked for (x=2), it also adds another invisible empty-named column (called the "CQL row marker"). When you later delete the individual columns, the row-marker column remains undeleted and keeps the row alive. When you INSERT a row with a TTL x, this command not only sets the TTL of each specified column to x, it also sets the TTL of the hidden row-marker column to x, so when x comes the entire row disappears because all its columns (including the row-marker column) have disappeared. Note that only INSERT, not UPDATE, adds this row marker. This means that if you want to change the TTL of the row marker, you must do this by doing an INSERT, not an UPDATE. For example, if you INSERT data with TTL x, and later UPDATE overwriting all its columns to an earlier expiration time, you'll end up with the data columns expiring early but the row marker remaining until its original expiration - and until then an empty row is visible.
Related
I have a large table whose rows get updated/inserted/merged periodically from a few different queries. I need a scheduled process to run (via API) to periodically check for which rows in that table were updated since the last check. So here are my issues...
When I run the merge query, I don't see a way for it to return which records were updated... otherwise, I could be copying those updated rows to a special updated_records table.
There are no triggers so I can't keep track of mutations that way.
I could add a last_updated timestamp column to keep track that way, but then repeatedly querying the entire table all day for that would be a huge amount of data billed (expensive).
I'm wondering if I'm overlooking something obvious or if maybe there's some kind of special BQ metadata that could help?
The reason I'm attempting this is that I'm wanting to extract and synchronize a smaller subset of this table into my PostgreSQL instance because the latency for querying BQ is just too much for smaller queries.
Any ideas? Thanks!
One way is to periodically save intermediate state of the table using the time travel feature. Or store only the diffs. I just want to leave this option here:
FOR SYSTEM_TIME AS OF references the historical versions of the table definition and rows that were current at timestamp_expression.
The value of timestamp_expression has to be within last 7 days.
The following query returns a historical version of the table from one hour ago.
SELECT * FROM table
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
The following query returns a historical version of the table at an absolute point in time.
SELECT * FROM table
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';
An approach would be to have 3 tables:
one basetable in "append only" mode, only inserts are added, and updates as full row, in this table would be every record like a versioning system.
a table to hold deletes (or this can be incorporated as a soft delete if there is a special column kept in the first table)
a livetable where you hold the current data (in this table you would do your MERGE statements most probably from the first base table.
If you choose partitioning and clustering, you could end up leverage a lot for long time storage discounted price and scan less data by using partitioning and clustering.
If the table is large but the amount of data updated per day is modest then you can partition and/or cluster the table on the last_updated_date column. There are some edge cases, like the first today's check should filter for last_updated_date being either today or yesterday.
Depending of how modest this amount of data updated throughout a day is, even repeatedly querying the entire table all day could be affordable because BQ engine will scan one daily partition only.
P.S.
Detailed explanation
I could add a last_updated timestamp column to keep track that way
I inferred from that the last_updated column is not there yet (so the check-for-updates statement cannot currently distinguish between updated rows and non-updated ones) but you can modify the table UPDATE statements so that this column will be added to the newly modified rows.
Therefore I assumed you can modify the updates further to set the additional last_updated_date column which will contain the date portion of the timestamp stored in the last_updated column.
but then repeatedly querying the entire table all day
From here I inferred there are multiple checks throughout the day.
but the data being updated can be for any time frame
Sure, but as soon as a row is updated, no matter how old this row is, it will acquire two new columns last_updated and last_updated_date - unless both columns have already been added by the previous update in which cases the two columns will be updated rather than added. If there are several updates to the same row between the update checks, then the latest update will still make the row to be discoverable by the checks that use the logic described below.
The check-for-update statement will (conceptually, not literally):
filter rows to ensure last_updated_date=today AND last_updated>last_checked. The datetime of the previous update check will be stored in last_checked and where this piece of data is held (table, durable config) is implementation dependent.
discover if the current check is the first today's check. If so then additionally search for last_updated_date=yesterday AND last_updated>last_checked.
Note 1If the table is partitioned and/or clustered on the last_updated_date column, then the above update checks will not cause table scan. And subject to ‘modest’ assumption made at the very beginning of my answer, the checks will satisfy your 3rd bullet point.
Note 2The downside of this approach is that the checks for updates will not find rows that had been updated before the table UPDATE statements were modified to include the two extra columns. (Such rows will be in the__NULL__ partition with rows that never were updated.) But I assume until the changes to the UPDATE statements are made it will be impossible to distinguish between updated rows and non-updated ones anyway.
Note 3 This is an explanatory concept. In the real implementation you might need one extra column instead of two. And you will need to check which approach works better: partitioning or clustering (with partitioning on a fake column) or both.
The detailed explanation of the initial (e.g. above P.S.) answer ends here.
Note 4
clustering only helps performance
From the point of view of table scan avoidance and achieving a reduction in the data usage/costs, clustering alone (with fake partitioning) could be as potent as partitioning.
Note 5
In the comment you mentioned there is already some partitioning in place. I’d suggest to examine if the existing partitioning is indispensable, can it be replaced with clustering.
Some good ideas posted here. Thanks to those who responded. Essentially, there are multiple approaches to tackling this.
But anyway, here's how I solved my particular problem...
Suppose the data needs to ultimately end up in a table called MyData. I created two additional tables, MyDataStaging and MyDataUpdate. These two tables have an identical structure to MyData with the exception of MyDataStaging has an additional Timestamp field, "batch_timestamp". This timestamp allows me to determine which rows are the latest versions in case I end up with multiple versions before the table is processed.
DatFlow pushes data directly to MyDataStaging, along with a Timestamp ("batch_timestamp") value indicating when the process ran.
A scheduled process then upserts/merges MyDataStaging to MyDataUpdate (MyDataUpdate will now always contain only a unique list of rows/values that have been changed). Then the process upserts/merges from MyDataUpdate into MyData as well as being exported & downloaded to be loaded into PostgreSQL. Then staging/update tables are emptied appropriately.
Now I'm not constantly querying the massive table to check for changes.
NOTE: When merging to the main big table, I filter the update on unique dates from within the source table to limit the bytes processed.
If I have a integer column in h2 and use auto increment for it, is it guaranteed to always increment the last inserted value?
-- If some intermediate row is deleted.
-- If last inserted row is deleted.
-- If all rows are deleted using delete from myTable
EDIT: The reason I need the numbering continued (and I would expect this to be the normal behavior) is I am looking to archive older data to keep current tables short.
The auto-incremented value is guaranteed to always be larger than the previous value. It is not guaranteed to always be exactly one more than the last successfully inserted value.
The latter requires much more overhead and is not (generally) worth the additional effort.
Each time I restart my DB2 services, the auto increment field, always change by itself,
for example : before I restart, the auto increment value is at 13, and it's incremented by 1, and after I restart it's always become 31 and it's always incremented by 20
Any idea what may cause this?
Each time I restarted my Db2 service, I have to execute this command
ALTER TABLE <table> ALTER COLUMN <column> RESTART WITH 1
DB2 has a cache of generated values in order to reduce the overhead of generating values (Reduce the IO). This cache in memory, and assign the values as requested.
Take a look at the cache option when creating / altering the table. By default the cache value is 20.
It is important to understand how the sequeneces work in DB2. Sequences share many concepts with generated values / identity column.
Create table http://publib.boulder.ibm.com/infocenter/db2luw/v10r1/topic/com.ibm.db2.luw.sql.ref.doc/doc/r0000927.html
Alter table http://publib.boulder.ibm.com/infocenter/db2luw/v10r1/topic/com.ibm.db2.luw.sql.ref.doc/doc/r0000888.html
Sequences http://publib.boulder.ibm.com/infocenter/db2luw/v10r1/topic/com.ibm.db2.luw.admin.dbobj.doc/doc/c0023175.html
From W3schools:
"Auto-increment allows a unique number to be generated when a new record is inserted into a table."
This is the only thing you may expect: unique (=non-conflicting) numbers. How these are generated is left to the DBMS. You must not expect a number sequence without any gaps.
For instance, a DBMS might choose to "pre-allocate" blocks of ten numbers (23..32, 33..42, ...) for performance reasons, so that the auto-increment field must only be incremented for every (up to) ten records. If you have an INSERT statement that inserts only 5 records into a newly created table, it can "acquire a block of 10 numbers" (0..9), use the first five values (0..4) of it and leave the rest unused. By acquiring this one block of numbers, the counter was incremented from 0 to 10. So the next INSERT statement that fetches a block will get the numbers ranging from 10 to 19.
I need to have a custom unique identifier (sequence). In my table there is a field ready_to_fetch_id that will be null by default and when my message is ready to be delivered then i make it update with unique max id, this is quite heavy process as load increasing.
So it it possible to have some sequence in postgres that allow null and unique ids.
Allowing NULL values has nothing todo with sequences. If your column definition allows NULLs you can put NULL values in the column. When you update the column you take the nextval from the sequence.
Notice that if you plan to use the ids to keep track of which rows you have already processed that it won't work perfectly. When two transactions are going to update the ready_to_fetch_id column simultaneous the transaction that started last might commit first which means that the higher id of the last transaction to start will become visible before the lower id the earlier transaction is using.
As a bit of background, I'm working with a SQL Lite database that is being consumed by a closed-source UI that doesn't order the results by the handy timestamp column (gee, thanks Nokia!) - it just uses the default ordering, which corresponds to the primary key, which is a vanilla auto-incrementing 'id' column.
I easily have a map of the current and desired id values, but applying the mapping is my current problem. It seems I cannot swap the values as an update processes rows one at a time, which would temporarily result in a duplicate value. I've tried using an update statement with case clauses using a temporary out-of-sequence value, but as each row is only processed once this obviously doesn't work. Thus I've reached the point of needing 3 update statements to swap a pair of values, which is far from ideal as I want this to scale well.
Compounded to this, there are a number of triggers set up which makes adding/deleting rows into a new table a complex problem unless I can disable those for the duration of any additions/deletions resulting from table duplication & deletion, which is why I haven't pursued that avenue yet.
I'm thinking my next line of enquiry will be a new column with the new ids then finding a way to move the primary index to it before removing the old column, but I'm throwing this out there in case anyone can offer up a better solution that will save me some time :)