How to trim a list/set in ScyllaDB to a specific size? - scylla

Is there a way to trim a list/set to a specific size (in terms of number of elements)?
Something similar to LTRIM command on Redis (https://redis.io/commands/ltrim).
The goal is to insert an element to a list/set but ensuring that its final size is always <= X (discarding old entries).
Example of what I would like to be able to do:
CREATE TABLE images (
name text PRIMARY KEY,
owner text,
tags set<text> // A set of text values
);
-- single command
UPDATE images SET tags = ltrim(tags + { 'gray', 'cuddly' }, 10) WHERE name = 'cat.jpg';
-- two commands (Redis style)
UPDATE images SET tags = tags + { 'gray', 'cuddly' } WHERE name = 'cat.jpg';
UPDATE images SET tags = ltrim(tags, 10) WHERE name = 'cat.jpg';

No, there is no such operation in Scylla (or in Cassandra).
The first reason is efficiency: As you may be aware, one reason why writes in Scylla are so efficient is that they do not do a read: Appending an element to a list just writes this single item to a sequential file (a so-called "sstable"). It does not need to read the existing list and check what elements it already has. The operation you propose would have needed to read the existing item before writing, slowing it down significantly.
The second reason is consistency: What happens if multiple operations like you propose are done in parallel, reaching different coordinators and replicas in different order? What happens if after earlier problems, one of the replicas is missing one of the values? There is no magic way to solve these problems, and the general solution that Scylla offers for concurrent Read-Modify-Write operations is LWT (Lightweight Transacations). You can emulate your ltrim operation using LWT but it will be significantly slower than ordinary writes. You will need to read the list to the client, modify it (append, ltrim, etc.) and then write it back with an LWT (with the extra condition that it still has its old value, or using an additional "version number" column).

Related

REDIS SISMEMBERS Performance: LUA script on multiple keys vs single transaction

I have a dozen of REDIS Keys of the type SET, say
PUBSUB_USER_SET-1-1668985588478915880,
PUBSUB_USER_SET-2-1668985588478915880,
PUBSUB_USER_SET-3-1668988644477632747,
.
.
.
.
PUBSUB_USER_SET-10-1668983464477632083
The set contains a userId and the problem statement is to check if the user is present in any of the set or not
The solution I tried is to get all the keys and append with a delimiter (, comma) and pass it as an argument to lua script wherein with gmatch operator I split the keys and run sismember operation until there is a hit.
local vals = KEYS[1]
for match in (vals..","):gmatch("(.-)"..",") do
local exist = redis.call('sismember', match, KEYS[2])
if (exist == 1) then
return 1
end
end
return 0
Now as and when the number of keys grows to PUBSUB_USER_SET-20 or PUBSUB_USER_SET-30 I see an increase in latency and in throughput.
Is this the better way to do or Is it better to batch LUA scripts where in instead of passing 30keys as arguments I pass in batches of 10keys and return as soon as the user is present or is there any better way to do this?
I would propose a different solution instead of storing keys randomly in a set. You should store keys in one set and you should query that set to check whether a key is there or not.
Lets say we've N sets numbered s-0,s-1,s-2,...,s-19
You should put your keys in one of these sets based on their hash key, which means you need to query only one set instead of checking all these sets. You can use any hashing algorithm.
To make it further interesting you can try consistent hashing.
You can use redis pipeline with batching(10 keys per iteration) to improve the performance

DynamoDB: Have sequencing within Items

I am developing forums on DynamoDB.
There is a table posts which contains all the posts in a thread.
I need to have a notion of sequence in the posts, i.e. I need to know which post came first and which came later.
My service would be running in a distributed env.
I am not sure if using Timestamp is the best solution for deciding the sequence, as the hosts might have slightly different times and might be off my milliseconds/ seconds.
Is there another way to do this?
Can I get DynamoDB to populate the date so it is consistent?
Or is there a sequence generator that I can use in a distributed env?
You can't use DynamoDB to auto-populate dates. You can use other services to provide you with auto-generating numbers or use DynamoDB's atomic increment to create your own UUID.
This can become a bottleneck if your forum is very successful (needs lots of numbers per second). I think you should start with timestamp and later on add complexity to your id generating (concatenate timestamp+uuid or timstamp+atomiccounter)
It is always a best practice to sync your servers clock (ntpd)
Use a dedicated sequence table. If you have only one sequence (say, PostId), then there's going to be only one row with two attributes in the table.
Yes, there's extra cost and effort of managing another table, but this is the best solution I know by far and haven't seen any one else mentioning it.
The table should have a key attribute as primary partition key, and a numeric value attribute with initial value of 1 (or whatever you want the initial value to be).
Every time you want to get the next available key, you tell DynamoDB to do this:
Increment the value where key = PostId by 1, and return the value before incrementing.
Note that this is one single atomic operation. DynamoDB handles the auto-incrementing, so there's no concurrency issues.
In code, there're more than one ways of implementing this. Here's one example:
Map<String,AttributeValue> key = new HashMap<>();
key.put("key", new AttributeValue("PostId"));
Map<String, AttributeValueUpdate> item = new HashMap<String, AttributeValueUpdate>();
item.put("value",
new AttributeValueUpdate()
.withAction(AttributeAction.ADD)
.withValue(1));
UpdateItemRequest request = new UpdateItemRequest("Sequences", key, item).withReturnValues(ReturnValue.ALL_OLD);
UpdateItemResult result = dynamoDBClient.updateItem(request);
Integer postId = Integer.parseInt(result.getAttributes().get("value").getN()); // <- this is the sequential ID you want to set to your post
Another variation of Chen's suggestion is to have strict ordering of posts within a given Forum Thread, as opposed to globally across all Threads. One way to do this is to have a Reply table with the Hash key of ThreadId, and a range key of ReplyId. The ReplyId would be a Number type attribute starting at 0. Every time someone replies, your app does a Query on the Reply table for the one most recent reply on that thread (ScanIndexForward: false, Limit: 1, ThreadId: ). To insert your new reply use the ReplyId of the one returned in the Query, + 1. Then use PutItem, using a Conditional Write, so that if someone else replies at the same time, an error will be returned, and your app can start again with the query.
If you want the simplest initial solution possible, then the timestamp+uuid concatenation Chen suggests is the simplest approach. A global atomic counter item will be a scaling bottleneck, as Chen mentions, and based on what you've described, a global sequence number isn't required for your app.

Suggestions/Opinions for implementing a fast and efficient way to search a list of items in a very large dataset

Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields.
Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the
week) like MySQL or Postgres. The reason i want to have RDBMS in the
first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with
the latest valid data which supports operation to find out whether
each item is present in the dataset or not(Thinking some kind of a
NOSQL db, like Redis here optimised for search. Should have
persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present
(if possible match a list of values all at once instead of matching
one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.

Store data in DB so primary key is known before insertion based on data to be inserted

The basic gist of what I'm trying to accomplish is setting up an image processing server. As the page code is created in Coldfusion multiple images on the page may need to be resized and thumbnailed into appropriate sizes, each to a possibly different size and each with a possibly different algorithm.
The basic gist of how it works is using a simple img tag the src attribute will point to the image server along the lines of the following.
<img src="http://imageserver.com/<clientname>/<primarykey>.jpg">
This allows the image resizing to occur asynchronously, and on a different server, thus not slowing down the current page call.
When the image processing server receives the call it will first check if that file exists, if Apache determines the file exists it serves it right away, else, it invokes Coldfusion which reads an entry from the database using the primary key passed to it, to get the URL of the image to be processed and any associated parameters (in this case width, height, method, url, client, but possibly more in the future).
Currently I'm doing this using a hash system where the parameters are ordered alphabetically, and then hashed. Is that a reasonable system, or will hash collisions eventually occur even though the data being hashed is quite small (between 50 to 200 characters). Each client could likely store up to 10,000 images (in their own folder so hash collision would not be a problem cross-client).
To reduce DB calls, as the page processes, each time a processed image is desired, I add that image's information to an array. At the end of the page, I make 2 calls to the DB, first it checks if the rows in my array already exist in the DB, and then if necessary, it adds any rows that do not exist (storing their various parameters). The dilemma here is that the primarykey (or what goes in the image tag) must be known before it is actually inserted into the DB, this way I'm not checking at every single image as some pages could have hundreds of images on them and that would be very inefficient.
Are hash collisions not a concern with this sample size (10k images per client generated by 50-200 character strings)? What about if I did something simple like <width>_<height>_<hash>.jpg or put the images in folders like /<client>/<width>x<height>/<hash>.jpg because that would further reduce the possibility of hash collisions (although not remove them)?
Any advice?
How are you hashing? Use SHA-512 for the hashing algorithm and you'll get a string 128 characters long. You may not want a URL so long, but the idea here is that you can minimize collisions via more complex algorithms.
http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-7c52.html
Even though I doubt you would have to worry about hash collisions, you may want to just use a UUID.
http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-70de.html
EDIT: Or use a uniqueidentifer as the primary key of the table you are storing the file to. Then after an insert, you can use the OUTPUT clause of a query to return the key to be used however you want.
The method I resolved this was by hashing not only the filename but it's parameters, such as width and height. Thus, the possibility of hash collisions is basically 0 until we hit millions (billions?) of records. So far we have no hash collisions.

Represent Ordering in a Relational Database

I have a collection of objects in a database. Images in a photo gallery, products in a catalog, chapters in a book, etc. Each object is represented as a row. I want to be able to arbitrarily order these images, storing that ordering in the database so when I display the objects, they will be in the right order.
For example, let's say I'm writing a book, and each chapter is an object. I write my book, and put the chapters in the following order:
Introduction, Accessibility, Form vs. Function, Errors, Consistency, Conclusion, Index
It goes to the editor, and comes back with the following suggested order:
Introduction, Form, Function, Accessibility, Consistency, Errors, Conclusion, Index
How can I store this ordering in the database in a robust, efficient way?
I've had the following ideas, but I'm not thrilled with any of them:
Array. Each row has an ordering ID, when order is changed (via a removal followed by an insertion), the order IDs are updated. This makes retrieval easy, since it's just ORDER BY, but it seems easy to break.
// REMOVAL
UPDATE ... SET orderingID=NULL WHERE orderingID=removedID
UPDATE ... SET orderingID=orderingID-1 WHERE orderingID > removedID
// INSERTION
UPDATE ... SET orderingID=orderingID+1 WHERE orderingID > insertionID
UPDATE ... SET orderID=insertionID WHERE ID=addedID
Linked list. Each row has a column for the id of the next row in the ordering. Traversal seems costly here, though there may by some way to use ORDER BY that I'm not thinking of.
Spaced array. Set the orderingID (as used in #1) to be large, so the first object is 100, the second is 200, etc. Then when an insertion happens, you just place it at (objectBefore + objectAfter)/2. Of course, this would need to be rebalanced occasionally, so you don't have things too close together (even with floats, you'd eventually run into rounding errors).
None of these seem particularly elegant to me. Does anyone have a better way to do it?
An other alternative would be (if your RDBMS supports it) to use columns of type array. While this breaks the normalization rules, it can be useful in situations like this. One database which I know about that has arrays is PostgreSQL.
The acts_as_list mixin in Rails handles this basically the way you outlined in #1. It looks for an INTEGER column called position (of which you can override to name of course) and using that to do an ORDER BY. When you want to re-order things you update the positions. It has served me just fine every time I've used it.
As a side note, you can remove the need to always do re-positioning on INSERTS/DELETES by using sparse numbering -- kind of like basic back in the day... you can number your positions 10, 20, 30, etc. and if you need to insert something in between 10 and 20 you just insert it with a position of 15. Likewise when deleting you can just delete the row and leave the gap. You only need to do re-numbering when you actually change the order or if you try to do an insert and there is no appropriate gap to insert into.
Of course depending on your particular situation (e.g. whether you have the other rows already loaded into memory or not) it may or may not make sense to use the gap approach.
If the objects aren't heavily keyed by other tables, and the lists are short, deleting everything in the domain and just re-inserting the correct list is the easiest. But that's not practical if the lists are large and you have lots of constraints to slow down the delete. I think your first method is really the cleanest. If you run it in a transaction you can be sure nothing odd happens while you're in the middle of the update to screw up the order.
Just a thought considering option #1 vs #3: doesn't the spaced array option (#3) only postpone the problem of the normal array (#1)? Whatever algorithm you choose, either it's broken, and you'll run into problems with #3 later, or it works, and then #1 should work just as well.
I did this in my last project, but it was for a table that only occasionally needed to be specifically ordered, and wasn't accessed too often. I think the spaced array would be the best option, because it reordering would be cheapest in the average case, just involving a change to one value and a query on two).
Also, I would imagine ORDER BY would be pretty heavily optimized by database vendors, so leveraging that function would be advantageous for performance as opposed to the linked list implementation.
Use a floating point number to represent the position of each item:
Item 1 -> 0.0
Item 2 -> 1.0
Item 3 -> 2.0
Item 4 -> 3.0
You can place any item between any other two items by simple bisection:
Item 1 -> 0.0
Item 4 -> 0.5
Item 2 -> 1.0
Item 3 -> 2.0
(Moved item 4 between items 1 and 2).
The bisection process can continue almost indefinitely due to the way floating point numbers are encoded in a computer system.
Item 4 -> 0.5
Item 1 -> 0.75
Item 2 -> 1.0
Item 3 -> 2.0
(Move item 1 to the position just after Item 4)
Since I've mostly run into this with Django, I've found this solution to be the most workable. It seems that there isn't any "right way" to do this in a relational database.
I'd do a consecutive number, with a trigger on the table that "makes room" for a priority if it already exists.
I had this problem as well. I was under heavy time pressure (aren't we all) and I went with option #1, and only updated rows that changed.
If you swap item 1 with item 10, just do two updates to update the order numbers of item 1 and item 10. I know it is algorithmically simple, and it is O(n) worst case, but that worst case is when you have a total permutation of the list. How often is that going to happen? That's for you to answer.
I had the same issue and have probably spent at least a week concerning myself about the proper data modeling, but I think I've finally got it. Using the array datatype in PostgreSQL, you can store the primary key of each ordered item and update that array accordingly using insertions or deletions when your order changes. Referencing a single row will allow you to map all your objects based on the ordering in the array column.
It's still a bit choppy of a solution but it will likely work better than option #1, since option 1 requires updating the order number of all the other rows when ordering changes.
Scheme #1 and Scheme #3 have the same complexity in every operation except INSERT writes. Scheme #1 has O(n) writes on INSERT and Scheme #3 has O(1) writes on INSERT.
For every other database operation, the complexity is the same.
Scheme #2 should not even be considered because its DELETE requires O(n) reads and writes. Scheme #1 and Scheme #3 have O(1) DELETE for both read and write.
New method
If your elements have a distinct parent element (i.e. they share a foreign key row), then you can try the following ...
Django offers a database-agnostic solution to storing lists of integers within CharField(). One drawback is that the max length of the stored string can't be greater than max_length, which is DB-dependent.
In terms of complexity, this would give Scheme #1 O(1) writes for INSERT, because the ordering information would be stored as a single field in the parent element's row.
Another drawback is that a JOIN to the parent row is now required to update ordering.
https://docs.djangoproject.com/en/dev/ref/validators/#django.core.validators.validate_comma_separated_integer_list