How to implement a scalable, unordered collection in DynamoDB? - indexing

I am looking into implementing a scalable unordered collection of objects on top of Amazon DynamoDB. So far the following options have been considered:
Use DynamoDB document data types (map, list) and use document path to access stand-alone items. This has one obvious drawback for collection being limited to 400KB of data, meaning perhaps 1..10K objects depending on their size. Less obvious drawback is that cost of insertion of a new object into such collection is going to be huge: Amazon specifies that the write capacity will be deducted based on the total item size, not just newly added object -- therefore ~400 capacity units for inserting 1KB object when approaching the size limit. So considering this ruled out?
Using composite primary hash + range key, where primary hash remains the same for all objects in the collection, and range key is just something random or an atomic counter. Obvious drawback is that having identical hash key results in bad key distribution -- cardinality is low when there are collections with large number of objects. This means bad partitioning, and having a scale issue with all reads/writes on the same collection being stuck to one shard, becoming subject to 3000 reads / 1000 writes per second limitation of DynamoDB partition.
Using global secondary index with secondary hash + range key, where hash key remains the same for all objects belonging to the same collection, and range key is just something random or an atomic counter. Similar to above, partitioning becomes poor for the GSI, and it will become a bottleneck with too many identical hashes draining all the provisioned capacity to the index rapidly. I didn't find how the GSI is implemented exactly, thus not sure how badly it suffers from low cardinality.
Question is, whether I could live with (2) or (3) and suffer from non-ideal key distribution, or is there another way of implementing collection that was overlooked, or perhaps I should at all consider looking into another nosql database engine.

This is a "shooting from the hip" answer, what you end up doing may depend on how much and what type of reading and writing you do.
Two things the dynamo docs encourage you to avoid are hot keys and, in general, scans. You noted that in cases (2) and (3), you end up with a hot key. If you expect this to scale (large collections), the hot key will probably hurt more and more, especially if this is a write-intensive application.
The docs on Query and Scan operations (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html) say that, for a query, "you must specify the hash key attribute name and value as an equality condition." So if you want to avoid scans, this might still force your hand and put you back into that hot key situation.
Maybe one route would be to embrace doing a scan operation, but just have one table devoted to your collection. Then you could just have a fully random (well distributed) hash key and do a scan every time. This assumes you always want everything from the collection (you didn't say). This will still hurt if you scale up to a large collection, but if you always want the full set back, you'll have to deal with that pain regardless. If you just want a subset, you can add a limit parameter. This would help performance, but you will always get back the same subset (or you can use the last evaluated key and keep going). The docs also mention parallel scans.
If you are using AWS, elasticache/redis might be another route to try? The first pass might code up a lot faster/cleaner than situation (1) that you mentioned.

Related

Is using a timestamp as a hash key on a GSI in DynamoDB a good approach

I have a large (2B + records) DynamoDB table.
I want to implement a distributed locking process by adding a new field, 'index_due_at' when an item is created or updated. After the create/update, I will do some further processing on the item and then remove the 'index_due_at' field.
I'd like to create a sweeper job which will periodically extract any records with an outstanding 'index_due_at' field (on the assumption that something about the above process failed) to give those records further treatment. I would anticipate at most 100s of records in this state at any one time, more likely 10s.
To optimise the performance of the sweeper, I want to create a GSI including the new field (and project the key data into it).
It seems that using a timestamp (in millis) as the GSI HASH key ought to give a good distribution. And I don't need to query on this field's value, just on its presence. Can anyone identify any drawbacks in this approach and if so, suggest an alternative?
Issues I can anticipate include:
* Non-uniqueness in timestamps at milli level.
* Possible hash key problems with numeric values?
* Possible hash key problems with numeric values that don't vary much in the most significant digits.
This is less of a problem than you might be thinking. GSI hash keys don't actually have to be unique, so you're fine on than front.
You probably already know this, but your GSI will only contain items with GSI keys, so your GSI should be pretty small (100s of items).
One thought I have is that the index_due_at might actually be better as a GSI sort key rather than hash key. Data is sorted within a partition by sort key. So you could have a GSI hash key of index_due_at_flag which would be Y if present, then a sort key of index_due_at. This would mean all your data would be sorted naturally, so you could process it in date order.
That said, you are probably never going to Query this GSI, so I suspect your choice of keys hardly matters at all. Presumably you will just do a Scan, get all the items and try and process them all. In which case you would never even use the keys. Just having a key attribute present would put the item in the GSI.
Another thought is that you need to handle the fact GSIs are not perfectly synchronous with the base table. Its possible (admittedly unlikely) that an item in your GSI has actually just been processed. Therefore if your sweeper script picks up an item from the GSI, you should handle the fact its possible its already been updated in the base table (e.g. by checking the base table item before attempting to process it).
Good luck with it. I answered because I liked your bio! Hope staying on the right side of barrel shaped is working out :)
This should be a perfect scenario for using DynamoDB Sparse Index
Use the 'index_due_at' as sort key in GSI, and only the items you are interested will be in the index, greatly reducing the space needed and the performance.

Aerospike - Store *small quantity* of large values

Scenario
Let's say I am storing up to 5 byte arrays, each 50kB, per user.
Possible Implementations:
1) One byte array per record, indexed by secondary key.
Pros: Fast read/write.
Cons: High cardinality query (up to 5 results per query). Bad for horizontal scaling, if byte arrays are frequently accessed.
2) All byte arrays in single record in separate bins
Pros: Fast read
Neutral: Blocksize must be greater than 250kB
Cons: Slow write (one change means rewriting all byte arrays).
3) Store byte arrays in a LLIST LDT
Pros: Avoid the cons of solution (1) and (2)
Cons: LDTs are generally slow
4) Store each byte array in a separate record, keyed to a UUID. Store a UUID list in another record.
Pros: Writes to each byte array does not require rewriting all arrays. No low-cardinality concern of secondary indexes. Avoids use of LDT.
Cons: A client read is 2-stage: Get list of UUIDs from meta record, then multi-get for each UUID (very slow?)
5) Store each byte array as a separate record, using a pre-determined primary key scheme (e.g. userid_index, e.g. 123_0, 123_1, 123_2, 123_3, 123_4)
Pros: Avoid 2-stage read
Cons: Theoretical collision possibility with another user (e.g. user1_index1 and user2_index2 product same hash). I know this is (very, very) low-probability, but avoidance is still preferred (imagine one user being able to read the byte array of another user due to collision).
My Evaluation
For balanced read/write OR high read/low write situations, use #2 (One record, multiple bins). A rewrite is more costly, but avoids other cons (LDT penalty, 2-stage read).
For a high (re)write/low read situation, use #3 (LDT). This avoids having to rewrite all byte arrays when one of them is updated, due to the fact that records are copy-on-write.
Question
Which implementation is preferable, given the current data pattern (small quantity, large objects)? Do you agree with my evaluation (above)?
Here is some input. (I want to disclose that I do work at Aerospike).
Do avoid #3. Do not use LDT as the feature is definitely not as mature as the rest of the platform, especially when it comes to performance / reliability during cluster rebalance (migrations) situations when nodes leave/join a cluster.
I would try to stick as much as possible with basic Key/Value transactions. That should always be the fastest and most scalable. As you pointed out, option #1 would not scale. Secondary indices also do have an overhead in memory and currently do not allow for fast start (enterprise edition only anyways).
You are also correct on #2 for high write loads, especially if you are going to always update 1 bin...
So, this leaves options #4 and #5. For option #5, the collision will not happen in practice. You can go over the math, it will simply not happen. If it does, you will get famous and can publish a paper :) (there may even be a price for having found a collision). Also, note thatyou have the option to store the key along the record which will provide you with a 'key check' on writes which should be very cheap (since records are read anyway before being written). Option #4 would work as well, it will just do an extra read (which should be super fast).
It all depends on where you want the bit extra complexity. So you can do some simple benchmarking between the 2 options if you have that luxury before deciding.

Java EE/SQL: Is there a significant performance lag between primary key types?

Currently I am involved in learning some basics of the Java EE technology. I encountered a particular project and took a deeper look into the underlying database structure.
On server-side I investigated a Java function that creates a primary key with a length of 32 characters (based on concatenating the time, a random hash, and an additional cryptographic nonce).
I am interested in a estimation about the performance loss caused by using such a primary key. If there is no security reason to create such kind of unique IDs wouldn't it be much better to let the underlying database create new increasing primaries, starting at 0?
Wouldn't a SQL/JQL search be much faster when using numbers instead of strings?
Using numbers will probably be faster, but you should measure it with a test case if you need the performance ratio between both options.
I don't think number comparison vs string comparison will give a big performance advantage by itself. However:
larger fields typically means less data per table block, so you have to read more blocks from DB in case of a full scan (it will be slower)
accordingly, larger keys typically means less keys per index block, so you have to read more index blocks in case of index scans (it will be slower)
larger fields are, well, larger, so by definition they are less space-efficient.
Note that we are talking about data size and not data type: most likely a 8-byte integer will not be significantly more efficient than a 8-byte string.
Note also that using random IDs is usually more "clusterable" than sequence numbers, as sequences / autonumerics need to be administered centrally (although this can be mitigated using techniques such as the Hi-Lo algorithm. Most curent persistence frameworks support this technique).

What kind of overhead do non clustered indexes add?

If you are talking about btrees, I wouldn't imagine that the additional overhead of a non clustered index (not counting stuff like full text search or other kind of string indexing) is even measurable, except for an extremely high volume high write scenario.
What kind of overhead are we actually talking about? Why would it be a bad idea to just index everything? Is this implementation specific? (in that case, I am mostly interested in answers around pg)
EDIT: To explain the reasoning behind this a bit more...
We are looking to specifically improve performance right now across the board, and one of the key things we are looking at is query performance. I have read the things mentioned here, that indexes will increase db size on disk and will slow down writes. The question came up today when one pair did some pre-emptive indexing on a new table, since we usually apply indexes in a more reactive way. Their arguement was that they weren't indexing string fields, and they weren't doing clustered indexes, so the negative impact of possibly redundant indexes should barely be measurable.
Now, I am far from an expert in such things, and those arguments made a lot of sense to me based on what I understand.
Now, I am sure there are other reasons, or I am misunderstanding something. I know a redundant index will have a negative effect, what I want to know is how bad it will be (because it seems negligible). The whole indexing every field thing is a worst case scenario, but I figured if people could tell me what that will do to my db, it will help me understand the concerns around being conservative with indexing, or just throwing them out there when it has a possibility of helping things.
Random thoughts
Indexes benefit reads of course
You should index where you get the most bang for your buck
Most DBs are > 95% read (think about updates, FK checks, duplicate checks etc = reads)
"Everything" is pointless: most indexed need to be composite with includes
Define high volume we have 15-20 million new rows per day with indexes
Introduction to Indices
In short, an index, whether clustered or non-, adds extra "branches" to the "tree" in which data is stored by most current DBMSes. This makes finding values with a single unique combination of the index logarithmic-time instead of linear-time. This reduction in access time speeds up many common tasks the DB does; however, when performing tasks other than that, it can slow it down because the data must be accessed through the tree. Filtering based on non-indexed columns, for instance, requires the engine to iterate through the tree, and because the ratio of branch nodes (containing only pointers to somewhere else in the tree) to leaf nodes has been reduced, this will take longer than if the index were not present.
In addition, non-clustered indices separate data based on column values, but if those column values are not very unique across all table rows (like a flag indicating "yes" or "no"), then the index adds an extra level of complexity that doesn't actually help the search; in fact, it hinders it because in navigating from root to leaves of the tree, an extra branch is encountered.
I am sure the exact overheard is probably implementation specific, but off the top of my head some points:
Increased Disk Space requirements.
All writes (inserts, updates, deletes) cost more as all indexes must be updated.
Increased transaction locking overheard (all indexes must be updated within a transaction, leading to more locks being required, etc).
Potentially increased complexity for the query optimizer (choosing which index is most likely to perform best; Also potential for one index to be chosen when another index would actually be better).

Efficient Hashmap Use

What is the more efficient approach for using hashmaps?
A) Use multiple smaller hashmaps, or
B) store all objects in one giant hashmap?
(Assume that the hashing algorithm for the keys is fairly efficient, resulting in few collisions)
CLARIFICATION: Option B implies segregation by primary key -- i.e. no additional lookup is necessary to determine which actual hashmap to use. (For example, if the lookup keys are alphanumeric, Hashmap 1 stores the A's, Hashmap 2 stores B's, and so on.)
Definitely B. The advantage of hash tables is that the average number of comparisons per lookup is independent of the size.
If you split your map into N smaller hashmaps, you will have to search half of them on average for each lookup. If the smaller hashmaps have the same load factor that the larger map would have had, you will increase the total number of comparisons by a factor of approximately N/2.
And if the smaller hashmaps have a smaller load factor, you are wasting memory.
All that is assuming you distribute the keys randomly between the smaller hashmaps. If you distribute them according to some function of the key (e.g. a string prefix) then what you have created is a trie, which is efficient for some applications (e.g. auto-complete in web forms.)
Are these maps used in logically distinct places? For instance, I wouldn't have one map containing users, cached query results, loggers etc, just because you happen to know the keys won't clash. However, I equally wouldn't split up a single map into multiple maps.
Keep one hashmap for each logical mapping from key to value.
In addition #Jon's answer, there can be practical reasons why you want to maintain separate hash tables.
If you have separate tables for different mappings you can 'clear' each of the mappings independently; e.g. by calling 'clear' or getting rid of the reference to the corresponding table.
If the separate tables hold mappings to cached entries, you can use different strategies to 'age' the respective entries.
If the application is multi-threaded, using separate tables may reduce lock contention, and may (for some processor architectures) increase processor memory cache hit ratios.