How to guarantee monotonically increasing timeuuid when selecting from scylla - scylla

I have a table with timeuuid as a clustering key.
CREATE TABLE event (
domain TEXT,
createdAt TIMEUUID,
kind TEXT,
PRIMARY KEY (domain, createdAt)
);
I wish to select the data in order of this clustering key with the following guarantee - if I selected something, there will be NO inserts before those records in the future(so I could iterate through records checking what's new happened without the risk of skipping any events)
SELECT kind FROM event WHERE domain = ? AND createdAt > lastCreatedAtWeAreAwareOf
If I generate timeuuid on client and use parallel insert to scylla it's technically possible that recent timeuuid will get inserted first before several older(say due to say some networking issue) and I might miss those records in my selects.
What are possible ways to resolve this?
I tried using currentTimeUUID function and it seems to work(monotonically increasing within the same partition key) but creates a lot of duplicates(20-40 duplicates per the same partition key), I.e I end up with lots of records with exactly the same currentTimeUUID(I would really like a way to avoid duplicates, it complicates the select process and consumes unnecessary resources)
I'm also curious is there a threat of backward clock jumps when using currentTimeUUID function?

EDITED
It seems that there's a bug in Scylla that currentTimeUUID always generates duplicates for writes done at the same time using the same coordinator. I created an issue here. Thanks for bringing this up.
PREVIOUS ANSWER BELOW
If I generate timeuuid on client and use parallel insert to scylla it's technically possible that recent timeuuid will get inserted first before several older(say due to say some networking issue) and I might miss those records in my selects.
Just to clarify, all writes will be stored in the right order. There will be a point in time when you will be able to read old enough writes in the right order. This means that one possible solution would be to make sure that select does not query too recent data. Thus leaving a window for 'late' writes to arrive and take their place in line. For example, you could use a select like this:
SELECT kind FROM event WHERE domain = ? AND createdAt > lastCreatedAtWeAreAwareOf AND createdAt < now() - 30s
I don't know whether it's ok for you to impose such delay though. This strategy won't give you a full certainty because all writes that got delayed by more than 30s will be missed.
I tried using currentTimeUUID function and it seems to work(monotonically increasing within the same partition key) but creates a lot of duplicates(20-40 duplicates per the same partition key), I.e I end up with lots of records with exactly the same currentTimeUUID(I would really like a way to avoid duplicates, it complicates the select process and consumes unnecessary resources)
You can reduce the chances of clustering key duplications by introducing additional clustering key column like:
CREATE TABLE event (
domain TEXT,
createdAt TIMEUUID,
randomBit UUID/int,
kind TEXT,
PRIMARY KEY (domain, createdAt, randomBit)
);
and generate value for it on the client in some good random way. Maybe there's some aspect of the record that you know is guaranteed to be unique and could be used as a clustering key column. It would work better than a random field.

Related

Is using a timestamp as a hash key on a GSI in DynamoDB a good approach

I have a large (2B + records) DynamoDB table.
I want to implement a distributed locking process by adding a new field, 'index_due_at' when an item is created or updated. After the create/update, I will do some further processing on the item and then remove the 'index_due_at' field.
I'd like to create a sweeper job which will periodically extract any records with an outstanding 'index_due_at' field (on the assumption that something about the above process failed) to give those records further treatment. I would anticipate at most 100s of records in this state at any one time, more likely 10s.
To optimise the performance of the sweeper, I want to create a GSI including the new field (and project the key data into it).
It seems that using a timestamp (in millis) as the GSI HASH key ought to give a good distribution. And I don't need to query on this field's value, just on its presence. Can anyone identify any drawbacks in this approach and if so, suggest an alternative?
Issues I can anticipate include:
* Non-uniqueness in timestamps at milli level.
* Possible hash key problems with numeric values?
* Possible hash key problems with numeric values that don't vary much in the most significant digits.
This is less of a problem than you might be thinking. GSI hash keys don't actually have to be unique, so you're fine on than front.
You probably already know this, but your GSI will only contain items with GSI keys, so your GSI should be pretty small (100s of items).
One thought I have is that the index_due_at might actually be better as a GSI sort key rather than hash key. Data is sorted within a partition by sort key. So you could have a GSI hash key of index_due_at_flag which would be Y if present, then a sort key of index_due_at. This would mean all your data would be sorted naturally, so you could process it in date order.
That said, you are probably never going to Query this GSI, so I suspect your choice of keys hardly matters at all. Presumably you will just do a Scan, get all the items and try and process them all. In which case you would never even use the keys. Just having a key attribute present would put the item in the GSI.
Another thought is that you need to handle the fact GSIs are not perfectly synchronous with the base table. Its possible (admittedly unlikely) that an item in your GSI has actually just been processed. Therefore if your sweeper script picks up an item from the GSI, you should handle the fact its possible its already been updated in the base table (e.g. by checking the base table item before attempting to process it).
Good luck with it. I answered because I liked your bio! Hope staying on the right side of barrel shaped is working out :)
This should be a perfect scenario for using DynamoDB Sparse Index
Use the 'index_due_at' as sort key in GSI, and only the items you are interested will be in the index, greatly reducing the space needed and the performance.

Get rows inserted since last check?

I am implementing a CQRS pattern where one or more processes are inserting records into the database and one or more processes are pulling them at a difference pace.
I'd like consumer processes to poll the database for new records that were inserted since last check, but I'm not sure how to (safely) implement this.
You can assume that rows will not change once they are inserted. It seems it isn't enough for each row to have a unique id, and a timestamp indicating when it was inserted.
If I query for records with a timestamp greater than the last row I saw then I run into problems if multiple records were inserted at the same time (having the same timestamp).
If I query for records with an id greater than the last row I saw then I run into problems where concurrent transactions may commit IDs in non-increasing order (e.g. postgreSQL sessions allocate and cache sequence IDs ahead of time to improve performance).
Ideally, I am looking for a DBMS-agnostic solution and be able to consume data as close to real-time as possible. Any ideas?
Clarification: Each row should be consumed multiple times, once per consumer. Meaning, just because one consumer processes a row should not prevent other consumers from doing so. Each consumer will do something different with the same data.
Since you have a lot of data coming in and might have multiple records for the last time stamp, you need a way to keep track of the data read. Here are a few different approaches with their pro and cons:
You can wait for the data to come in for a time stamp. You would do this by not reading the MAX(timestamp) so you would get all the data from the table except the last one for which the data might still be coming in.
Pro: Simple design
Con: Not real time processing
You can store the id's you have read each time for the last time stamp. When getting the data, you can use a query like (timestamp = lasttimestamp and id not in (set of ids)) or timestamp > lasttimestamp)
Pro: Almost real time
Con: Additional storage required
If you don't use sharding or similar:
You can use optimistic locking.
For this you can create an order column, with an unique index on the records table (the Log). Before each insertion, the producer query the Log for the greatest order, it increments it and insert the next record with this order.
If a concurrency exception occurs (i.e. Duplicate entry '12345' for key order) then you retry the entire process (query, increment, insert).
If you use sharding or similar:
Then you will need an additional service/table that will generate a new, unique, always-increasing order integer every time it is asked to do so.
This has the disadvantage that there is another piece that must be managed, a single point of failure that must be highly-available.
P.S.
"sharding or similar" means that you can't have unique indexes on the entire table because you use sharding or you write to multiple tables.
you can't rely on the timestamps or anything that relates to physical time because the system time may be adjusted, by an automated service (NTP) or by an human operator.

AWS DynamoDB v2: Do I need secondary index for alternative queries?

I need to create a table that would contain a slice of data produced by a continuously running process. This process generates messages that contain two mandatory components, among other things: a globally unique message UUID, and a message timestamp.
Those messages would be later retrieved by the UUID.
In addition, on a regular basis I would need to delete all messages from that table that are too old, i.e. whose timestamps are more than X away from the current time.
I've been reading the DynamoDB v2 documentation (e.g. Local Secondary Indexes) trying to figure out how to organize my table and whether or not I need a secondary index to perform searches for messages to delete. There might be a simple answer to my question, but I am somehow confused...
So should I just create a table with the UUID as the hash and messageTimestamp as the range key (together with a "message" attribute that would contain the actual message), and then not create any secondary indices? In the examples that I've seen, the hash was something that was not unique (e.g. ForumName under the above link). In my case, the hash would be unique. I am not sure whether than makes any difference.
And if I create the table with hash and range as described, and without a secondary index, then how would I query for all messages that are in a certain timerange regardless of their UUIDs?
DynamoDB introduced Global Secondary Index which would solve this problem.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
We've wrestled with this as well. The best solution we've come up with is to create second table for storing the time series data. To do this:
1) Use the date plus "bucket" id for a hash key
You could just use the date, but then I'm guessing today's date would become a "hot" key - one that is written with excessive frequency. This can create a serious bottleneck, as the total throughput for a particular DynamoDB partition is equal to the total provisioned throughput divided by the number of partitions - that means if all your writes are to a single key (today's key) and you have a throughput of 20 writes per second, then with 20 partitions, your total throughput would be 1 write per second. Any requests beyond this would be throttled. Not a good situation.
The bucket can be a random number from 1 to n, where n should be greater than the number of partitions used by the underlying DB. Determining n is a bit tricky of course because Dynamo does not reveal how many partitions it uses. But we are currently working with the upper limit of 200 based on the example found here. The writeup at this link was the basis for our thinking in coming up with this approach.
2) Use the UUID for the range key
3) Query records by issuing queries for each day and bucket.
This may seem tedious, but it is more efficient than a full scan. Another possibility is to use Elastic Map Reduce jobs, but I have not tried that myself yet so cannot say how easy/effective it is to work with.
We are still figuring this out ourselves, so I'm interested to hear others' comments. I also found this presentation very helpful in thinking through how best to use Dynamo:
Falling In and Out Of Love with Dynamo
-John
In short you can not. All DynamoDB queries MUST contain the primary hash index in the query. Optionally, you can also use the range key and/or a local secondary index. With the current DynamoDB functionality you won't be able to use an LSI as an alternative to the primary index. You also are not able to issue a query with only the range key (you can test this out easily in the AWS Console).
A (costly) workaround that I can think of is to issue a scan of the table, adding filters based on the timestamp value in order to find out which fields to delete. Note that filtering will not reduce the used capacity of the query, as it will parse the whole table.

Is it safe to use ROWID to locate a Row/Record in Oracle?

I'm looking at a client application which retrieves several columns including ROWID, and
later uses ROWID to identify rows it needs to update:
update some_table t set col1=value1
where t.rowid = :selected_rowid
Is it safe to do so? As the table is being modified, can ROWID of a row change?
"From Oracle 8 the ROWID format and size changed from 8 to 10 bytes. Note that ROWID's will change when you reorganize or export/import a table. In case of a partitioned table, it also changes if the row migrates from a partition to another one during an UPDATE."
http://www.orafaq.com/wiki/ROWID
I'd say no. This could be safe if for instance the application stores ROWID temporarily(say generating a list of select-able items, each identified with ROWID, but the list is routinely regenerated and not stored). But if ROWID is used in any persistent way it's not safe.
Assuming that you are using the ROWID a short period of time after you SELECT it, that the table is a standard heap-organized table, and that the DBA isn't doing something to the table (which is a reasonably safe assumption if the application is online), the ROWID will be stable. It would be preferable to use the primary key but when the primary key isn't available, plenty of Oracle-developed tools and frameworks will use the ROWID for short periods of time. It would not be safe if you intended to use the ROWID a long period of time after you SELECT it-- for example, if you allow users to edit data locally and then synchronize with the master database some arbitrary length of time later.
The ROWID is just a physical location of a row so anything that causes that location to change will change the ROWID.
If you are using index-organized tables or partitioned tables, updates to the row can change where the row is physically located which will change the ROWID.
If a row is deleted from a heap-organized table, a subsequent INSERT might insert data with completely different data that happens to use the same ROWID the deleted row previously had.
Various administrative tasks can cause the ROWID to change. Exporting and importing the table will change the ROWID for example, but so will doing something like the new-ish online shrink command. These administrative tasks will not normally be done while the application is up, however, and will almost certainly not be done during the day. But it could lead to problems if the application isn't shut down when a DBA does this sort of thing or if the application persists the data.
Over time, it has become more and more common for new features to introduce new possibilities for ROWIDs to change. Index-organized tables and the online shrink option, for example, are relatively new features. In the future, it is likely that there will be more features that will involve the potential at least for a ROWID to change.
Of course, if we're being pedantic, it's also not safe to rely on the primary key. It is perfectly possible that some other session comes along and updates the primary key of the row after you read it or that some other session deletes the row after you select it and inserts a new row with the same data and a different primary key. In either case, it helps to have some local knowledge about what the applications using the database are actually supposed to be doing. It would be extremely uncommon, for example, to allow updates to primary keys or to reuse primary keys so you can generally determine that it's safe to use a primary key. Similarly, it is relatively common to conclude that given the way you're using partitioning or given the way you've defined the index in your index-organized table that updates won't actually change the ROWID. If you know that the table is partitioned by the LOAD_DATE, for example, and that you never update the LOAD_DATE, you won't actually experience changes to the ROWID because of an update. If you know that the table is index-organized but that you're not updating a column that is part of that index, the ROWID won't change on an UPDATE.
I do not think it is safe to do so, in theory it will not change, that is of course until someone "accidentally" deletes something on the actual DB...
I would just use the PK makes a lot more sense.

Some sort of “different auto-increment indexes” per a primary key values

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.
MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?
SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.
In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.
It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.