bigtable: how does bigtable serves write request? - bigtable

I'm reading google's bigtable paper. I noticed that in section 5.3, it says
Updates are committed to a commit log that stores redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called a memtable; the older updates are stored in a sequence of SSTables.
What confuses me is that, according to this answer, SSTable should store the sorted key-value pairs. But from the texts quoted above, it gives me the feeling that both memtable and sstable stores update operations, instead of the actual value. So what does bigtable actually do when there comes an write request?

According to official documentation [1]:
"A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. (Tablets are similar to HBase regions.) Tablets are stored on Colossus, Google's file system, in SSTable format. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Each tablet is associated with a specific Cloud Bigtable node. In addition to the SSTable files, all writes are stored in Colossus's shared log as soon as they are acknowledged by Cloud Bigtable, providing increased durability."
The official documentation has a link to this document where is explained more detailed [2]:
“A "Sorted String Table" then is exactly what it sounds like, it is a file which contains a set of arbitrary, sorted key-value pairs inside. Duplicate keys are fine, there is no need for "padding" for keys or values, and keys and values are arbitrary blobs.
If we need to preserve the fast read access which SSTables give us, but we also want to support fast random writes, turns out, we already have all the necessary pieces: random writes are fast when the SSTable is in memory, that is the definition of the memtable.”
In effect, what happens during a write is the Tablet Server (Cloud Bigtable Node) generates a committed log entry describing the mutation, plus a modification to the row in the memtable. Once this memtable is too large, the entire memtable is compacted into many immutable SSTables, partitioned by locality group (column family) and then each is added to the respective stack of SSTables for each locality group.
Note that each SSTable does not contain the cell values for all rows in the locality group, only the most recent updates. Reads may need to group updates from one or many SSTables in the locality group to construct a response.
See section "5.4 Compactions" in the paper [3] for more information on how mutations may be moved around to increase performance. Furthermore, see the heading "Locality Groups" under section "6 Refinements" for more information on the implications of using locality groups.
[1] https://cloud.google.com/bigtable/docs/overview#architecture
[2] https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/
[3] https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

Related

Does Google BigTable support range scan?

I am student learning Google BigTable design.
I am confused, the SST table is sorted internally. Two SST tables may not be sorted. In this case, it seems BigTable doesn't support efficient range scan for primary key? For example, "select * where id between 100 and 200". BigTable may need to scan all the SST to get the result.
Then my understanding for why SST is sorted is because for single primary key query, we can do binary search within a SST.
Another question I have is, does MemTable sorted? If yes, how? Because MemTable need to update frequently. If use data structure like tree, then we need to traverse the tree when we write MemTable into SST?
It sounds like you've at least been through an overview of the original Bigtable paper, but here's a reference if you haven't read the whole thing; your questions can mostly be answered by a closer read: https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
Your intuitions about Bigtable are spot on. Both the SStables on disk and the Memtable are sorted based on the primary key, and any read (not just a scan) requires consulting all of them to produce a merged view. However, note that they are all sorted on the same key, so this amounts to a parallel traversal. We seek to the beginning of the range to be read in each sstable and in the memtable, and traverse them in parallel from there.
This process is alluded to in section 5.3: "A valid read operation is executed on a merged view of the sequence of SSTables and the memtable. Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can be formed efficiently."
Some of these lookups can be mitigated using Bloom Filters as described in section 6 of the paper, but not all.
Of course, if there are too many sstables in a tablet, this will still become inefficient. Section 5.4 of the paper goes into a bit more detail about how this is solved, which is to periodically merge the "stack" of sstables for a tablet into a smaller number of new sstables. If a particular tablet stops receiving writes, this will eventually quiesce its state down to a single file.
Regarding the efficiency of the memtable, the paper does not prescribe a particular data structure. However, suffice it to say that there are many examples of efficient in-memory sorted data structures. In addition, Section 5.4 mentions that there can actually be multiple memtables in a given tablet. By the time we scan a memtable to write it out to disk, we have already installed a new memtable at the top of the tablet "stack" and are serving the latest incoming reads from there.

Which approach is better when using Redis?

I'm facing following problem:
I wan't to keep track of tasks given to users and I want to store this state in Redis.
I can do:
1) create list called "dispatched_tasks" holding many objects (username, task)
2) create many (potentialy thousands) lists called dispatched_tasks:username holding usually few objects (task)
Which approach is better? If I only thought of my comfort, I would choose the second one, as from time to time I will have to search for particular user tasks, and this second approach gives this for free.
But how about Redis? Which approach will be more performant?
Thanks for any help.
Redis supports different kinds of data structures as shown here. There are different approaches you can take:
Scenario 1:
Using a list data type, your list will contain all the task/user combination for your problem. However, accessing and deleting a task runs in O(n) time complexity (it has to traverse the list to get to the element). This can have an impact in performance if your user has a lot of tasks.
Using sets:
Similar to lists, but you can add/delete/check for existence in O(1) and sets elements are unique. So if you add another username/task that already exists, it won't add it.
Scenario 2:
The data types do not change. The only difference is that there will be a lot more keys in redis, which in can increase the memory footprint.
From the FAQ:
What is the maximum number of keys a single Redis instance can hold? and what the max number of elements in a Hash, List, Set, Sorted
Set?
Redis can handle up to 232 keys, and was tested in practice to handle
at least 250 million keys per instance.
Every hash, list, set, and sorted set, can hold 232 elements.
In other words your limit is likely the available memory in your
system.
What's the Redis memory footprint?
To give you a few examples (all obtained using 64-bit instances):
An empty instance uses ~ 3MB of memory. 1 Million small Keys ->
String Value pairs use ~ 85MB of memory. 1 Million Keys -> Hash
value, representing an object with 5 fields, use ~ 160 MB of
memory. To test your use case is trivial using the
redis-benchmark utility to generate random data sets and check with
the INFO memory command the space used.

Does redis key size also include the data size for that key or just the key itself?

I'm trying to analyise the db size for redis db and tweak the storage of our data per a few articles such as https://davidcel.is/posts/the-story-of-my-redis-database/
and https://engineering.instagram.com/storing-hundreds-of-millions-of-simple-key-value-pairs-in-redis-1091ae80f74c
I've read documentation about "key sizes" (i.e. https://redis.io/commands/object)
and tried running various tools like:
redis-cli --bigkeys
and also tried to read the output from the redis-cli:
INFO memory
The size semantics are not clear to me.
Does the reported size reflect ONLY the size for the key itself, i.e. if my key is "abc" and the value is "value1" the reported size is for the "abc" portion? Also the same question in respects to complex data structures for that key such as a hash / array or list.
Trial and error doesn't seem to give me a clear result.
Different tools give different answers.
First read about --bigkeys - it reports big value sizes in the keyspace, excluding the space taken by the key's name. Note that in this case the size of the value means something different for each data type, i.e. Strings are sized by their STRLEN (bytes) whereas all other by the number of their nested elements.
So that basically means that it gives little indication about actual usage, but rather does as it is intended - finds big keys (not big key names, only estimated big values).
INFO MEMORY is a different story. The used_memory is reported in bytes and reflects the entire RAM consumption of key names, their values and all associated overheads of the internal data structures.
There also DEBUG OBJECT but note that it's output is not a reliable way to measure the memory consumption of a key in Redis - the serializedlength field is given in bytes needed for persisting the object, not the actual footprint in memory that includes various administrative overheads on top of the data itself.
Lastly, as of v4 we have the MEMORY USAGE command that does a much better job - see https://github.com/antirez/redis-doc/pull/851 for the details.

Affinity Key in Aerospike

In the aerospike documentation, it is mentioned that aerospike has 4096 logical partitions and each key is hashed and eventually mapped to any of the partitions between 1 to 4096, which determines in which node the data for that key should be stored.
However if we have two keys "A" and "AB" and we want to store them in the same node, is there a way?
In Redis it can be achieved by making the keys as "A" and "{A}B" that will make sure that the key "{A}B" will go to a node where "A" is hashed and stored.
In Apache Ignite, same can be done using "AffinityKey".
Does a similar idea exist in Aerospike?
Thanks
Aerospike was designed as a distributed database. Redis was designed to run on a single node, and lacks concepts such as data distribution, clustering, replication, failover, at least natively. I'm aware that you can use various application-side shenanigans to make it into an ad-hoc cluster.
Don't worry about the implementation details of Aerospike's data distribution. Those happen automatically between the client and cluster, and don't require you to do anything on the application side. Instead, think about your access patterns.
First, your Aerospike cluster will make sure the data is evenly distributed. Because work is directly proportional to data, you should make sure the nodes are homogeneous. You can then expect multi-node operations to wrap up in roughly the same amount of time on each node.
You can create a secondary index on the fields that you'll be querying often to enhance the speed of the query. Release 3.12 adds predicate filtering, allowing you to create more complex query predicates on top of the initial secondary index based filter (also see the Java client's PredExp class).
If you don't want to use secondary indexes (there are several valid reasons), you can create your own lookup using external records. In a set called country-school you can have a record for each country (keys such as 'india', 'luxembourg') with the value being a list containing the IDs of the schools in that country. You can get the list with a single get (or a batch-get if it's several records, such as india-1, india-2, ... , india-9999), then use the results to compose a batch-get operation for the schools. Batch reads return results in the ordered you asked so you can get a large batch, check whether the last element is null, and if not get another batch.
('ns1', 'country-school', 'us-california') => [ 1, 2, 3, 5, 8, 11, .. ]
Similarly, you can create permutations such as country-state-city, (example, us-california-oakland) with smaller lists. This costs some extra space, but gives you faster (key-value based) retrieval without spending memory on secondary indexes.
('ns1', 'country-school', 'us-california-oakland') => [ 1, 5, 42, .. ]

How to implement a scalable, unordered collection in DynamoDB?

I am looking into implementing a scalable unordered collection of objects on top of Amazon DynamoDB. So far the following options have been considered:
Use DynamoDB document data types (map, list) and use document path to access stand-alone items. This has one obvious drawback for collection being limited to 400KB of data, meaning perhaps 1..10K objects depending on their size. Less obvious drawback is that cost of insertion of a new object into such collection is going to be huge: Amazon specifies that the write capacity will be deducted based on the total item size, not just newly added object -- therefore ~400 capacity units for inserting 1KB object when approaching the size limit. So considering this ruled out?
Using composite primary hash + range key, where primary hash remains the same for all objects in the collection, and range key is just something random or an atomic counter. Obvious drawback is that having identical hash key results in bad key distribution -- cardinality is low when there are collections with large number of objects. This means bad partitioning, and having a scale issue with all reads/writes on the same collection being stuck to one shard, becoming subject to 3000 reads / 1000 writes per second limitation of DynamoDB partition.
Using global secondary index with secondary hash + range key, where hash key remains the same for all objects belonging to the same collection, and range key is just something random or an atomic counter. Similar to above, partitioning becomes poor for the GSI, and it will become a bottleneck with too many identical hashes draining all the provisioned capacity to the index rapidly. I didn't find how the GSI is implemented exactly, thus not sure how badly it suffers from low cardinality.
Question is, whether I could live with (2) or (3) and suffer from non-ideal key distribution, or is there another way of implementing collection that was overlooked, or perhaps I should at all consider looking into another nosql database engine.
This is a "shooting from the hip" answer, what you end up doing may depend on how much and what type of reading and writing you do.
Two things the dynamo docs encourage you to avoid are hot keys and, in general, scans. You noted that in cases (2) and (3), you end up with a hot key. If you expect this to scale (large collections), the hot key will probably hurt more and more, especially if this is a write-intensive application.
The docs on Query and Scan operations (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html) say that, for a query, "you must specify the hash key attribute name and value as an equality condition." So if you want to avoid scans, this might still force your hand and put you back into that hot key situation.
Maybe one route would be to embrace doing a scan operation, but just have one table devoted to your collection. Then you could just have a fully random (well distributed) hash key and do a scan every time. This assumes you always want everything from the collection (you didn't say). This will still hurt if you scale up to a large collection, but if you always want the full set back, you'll have to deal with that pain regardless. If you just want a subset, you can add a limit parameter. This would help performance, but you will always get back the same subset (or you can use the last evaluated key and keep going). The docs also mention parallel scans.
If you are using AWS, elasticache/redis might be another route to try? The first pass might code up a lot faster/cleaner than situation (1) that you mentioned.