Does redis key size also include the data size for that key or just the key itself? - redis

I'm trying to analyise the db size for redis db and tweak the storage of our data per a few articles such as https://davidcel.is/posts/the-story-of-my-redis-database/
and https://engineering.instagram.com/storing-hundreds-of-millions-of-simple-key-value-pairs-in-redis-1091ae80f74c
I've read documentation about "key sizes" (i.e. https://redis.io/commands/object)
and tried running various tools like:
redis-cli --bigkeys
and also tried to read the output from the redis-cli:
INFO memory
The size semantics are not clear to me.
Does the reported size reflect ONLY the size for the key itself, i.e. if my key is "abc" and the value is "value1" the reported size is for the "abc" portion? Also the same question in respects to complex data structures for that key such as a hash / array or list.
Trial and error doesn't seem to give me a clear result.

Different tools give different answers.
First read about --bigkeys - it reports big value sizes in the keyspace, excluding the space taken by the key's name. Note that in this case the size of the value means something different for each data type, i.e. Strings are sized by their STRLEN (bytes) whereas all other by the number of their nested elements.
So that basically means that it gives little indication about actual usage, but rather does as it is intended - finds big keys (not big key names, only estimated big values).
INFO MEMORY is a different story. The used_memory is reported in bytes and reflects the entire RAM consumption of key names, their values and all associated overheads of the internal data structures.
There also DEBUG OBJECT but note that it's output is not a reliable way to measure the memory consumption of a key in Redis - the serializedlength field is given in bytes needed for persisting the object, not the actual footprint in memory that includes various administrative overheads on top of the data itself.
Lastly, as of v4 we have the MEMORY USAGE command that does a much better job - see https://github.com/antirez/redis-doc/pull/851 for the details.

Related

Redis cluster command to get memory usage by key pattern

I have a redis cluster where there are 3-4 different types of keys with prefixes for each. I want to get size percentage distribution of values of particular key type.
eg
redis keys catalog:styleId1, catalog:styleId2, size:styleId1, sizeStyleId2 etc
expected output
9% catalog:style 7560 bytes
12% catalog:style x bytes
Basically even range would work, but I need a simple way to do analysis on data size of particular key type.
I tried --bigKeys but it doesn't work on particular key pattern

bigtable: how does bigtable serves write request?

I'm reading google's bigtable paper. I noticed that in section 5.3, it says
Updates are committed to a commit log that stores redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called a memtable; the older updates are stored in a sequence of SSTables.
What confuses me is that, according to this answer, SSTable should store the sorted key-value pairs. But from the texts quoted above, it gives me the feeling that both memtable and sstable stores update operations, instead of the actual value. So what does bigtable actually do when there comes an write request?
According to official documentation [1]:
"A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. (Tablets are similar to HBase regions.) Tablets are stored on Colossus, Google's file system, in SSTable format. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Each tablet is associated with a specific Cloud Bigtable node. In addition to the SSTable files, all writes are stored in Colossus's shared log as soon as they are acknowledged by Cloud Bigtable, providing increased durability."
The official documentation has a link to this document where is explained more detailed [2]:
“A "Sorted String Table" then is exactly what it sounds like, it is a file which contains a set of arbitrary, sorted key-value pairs inside. Duplicate keys are fine, there is no need for "padding" for keys or values, and keys and values are arbitrary blobs.
If we need to preserve the fast read access which SSTables give us, but we also want to support fast random writes, turns out, we already have all the necessary pieces: random writes are fast when the SSTable is in memory, that is the definition of the memtable.”
In effect, what happens during a write is the Tablet Server (Cloud Bigtable Node) generates a committed log entry describing the mutation, plus a modification to the row in the memtable. Once this memtable is too large, the entire memtable is compacted into many immutable SSTables, partitioned by locality group (column family) and then each is added to the respective stack of SSTables for each locality group.
Note that each SSTable does not contain the cell values for all rows in the locality group, only the most recent updates. Reads may need to group updates from one or many SSTables in the locality group to construct a response.
See section "5.4 Compactions" in the paper [3] for more information on how mutations may be moved around to increase performance. Furthermore, see the heading "Locality Groups" under section "6 Refinements" for more information on the implications of using locality groups.
[1] https://cloud.google.com/bigtable/docs/overview#architecture
[2] https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/
[3] https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

Storing 13 Million floats and integer in redis

I have a file with 13 million floats each of them have a associated index as integer. The original size of file is 80MB.
We want to pass multiple indexes to get float data. The only reason, I needed hashmap field and value as List does not support passing multiple indexes to get.
Stored them as hashmap in redis, with index being field and float as value. On checking memory usage it was about 970MB.
Storing 13 million as list is using 280MB.
Is there any optimization I can use.
Thanks in advance
running on elastic cache
You can do a real good optimization by creating buckets of index vs float values.
Hashes are very memory optimized internally.
So assume your data in original file looks like this:
index, float_value
2,3.44
5,6.55
6,7.33
8,34.55
And you have currently stored them one index to one float value in hash or a list.
You can do this optimization of bucketing the values:
Hash key would be index%1000, sub-key would be index, and value would be float value.
More details here as well :
At first, we decided to use Redis in the simplest way possible: for
each ID, the key would be the media ID, and the value would be the
user ID:
SET media:1155315 939 GET media:1155315
939 While prototyping this solution, however, we found that Redis needed about 70 MB to store 1,000,000 keys this way. Extrapolating to
the 300,000,000 we would eventually need, it was looking to be around
21GB worth of data — already bigger than the 17GB instance type on
Amazon EC2.
We asked the always-helpful Pieter Noordhuis, one of Redis’ core
developers, for input, and he suggested we use Redis hashes. Hashes in
Redis are dictionaries that are can be encoded in memory very
efficiently; the Redis setting ‘hash-zipmap-max-entries’ configures
the maximum number of entries a hash can have while still being
encoded efficiently. We found this setting was best around 1000; any
higher and the HSET commands would cause noticeable CPU activity. For
more details, you can check out the zipmap source file.
To take advantage of the hash type, we bucket all our Media IDs into
buckets of 1000 (we just take the ID, divide by 1000 and discard the
remainder). That determines which key we fall into; next, within the
hash that lives at that key, the Media ID is the lookup key within
the hash, and the user ID is the value. An example, given a Media ID
of 1155315, which means it falls into bucket 1155 (1155315 / 1000 =
1155):
HSET "mediabucket:1155" "1155315" "939" HGET "mediabucket:1155"
"1155315"
"939" The size difference was pretty striking; with our 1,000,000 key prototype (encoded into 1,000 hashes of 1,000 sub-keys each),
Redis only needs 16MB to store the information. Expanding to 300
million keys, the total is just under 5GB — which in fact, even fits
in the much cheaper m1.large instance type on Amazon, about 1/3 of the
cost of the larger instance we would have needed otherwise. Best of
all, lookups in hashes are still O(1), making them very quick.
If you’re interested in trying these combinations out, the script we
used to run these tests is available as a Gist on GitHub (we also
included Memcached in the script, for comparison — it took about 52MB
for the million keys)

Which approach is better when using Redis?

I'm facing following problem:
I wan't to keep track of tasks given to users and I want to store this state in Redis.
I can do:
1) create list called "dispatched_tasks" holding many objects (username, task)
2) create many (potentialy thousands) lists called dispatched_tasks:username holding usually few objects (task)
Which approach is better? If I only thought of my comfort, I would choose the second one, as from time to time I will have to search for particular user tasks, and this second approach gives this for free.
But how about Redis? Which approach will be more performant?
Thanks for any help.
Redis supports different kinds of data structures as shown here. There are different approaches you can take:
Scenario 1:
Using a list data type, your list will contain all the task/user combination for your problem. However, accessing and deleting a task runs in O(n) time complexity (it has to traverse the list to get to the element). This can have an impact in performance if your user has a lot of tasks.
Using sets:
Similar to lists, but you can add/delete/check for existence in O(1) and sets elements are unique. So if you add another username/task that already exists, it won't add it.
Scenario 2:
The data types do not change. The only difference is that there will be a lot more keys in redis, which in can increase the memory footprint.
From the FAQ:
What is the maximum number of keys a single Redis instance can hold? and what the max number of elements in a Hash, List, Set, Sorted
Set?
Redis can handle up to 232 keys, and was tested in practice to handle
at least 250 million keys per instance.
Every hash, list, set, and sorted set, can hold 232 elements.
In other words your limit is likely the available memory in your
system.
What's the Redis memory footprint?
To give you a few examples (all obtained using 64-bit instances):
An empty instance uses ~ 3MB of memory. 1 Million small Keys ->
String Value pairs use ~ 85MB of memory. 1 Million Keys -> Hash
value, representing an object with 5 fields, use ~ 160 MB of
memory. To test your use case is trivial using the
redis-benchmark utility to generate random data sets and check with
the INFO memory command the space used.

Reasonable message length to be stored in Oracle Database

I have a complex process that interacts with multiple systems.
Each of these systems may produce error messages that I would like to store in a table of my Oracle database (note that I have statuses but the nature of the process is such that the errors may not always be predefined).
We are talking about hundred thousands of transactions each day where 1% may result in various errors.
1) Wanted to know what is a reasonable/acceptable length for the database field and how big of a message should I be storing?
2) Memory wise, does it really matter how large the field is defined in the database?
"Reasonable" and "acceptable" depends on the application. Assuming that you want to define the database column as a VARCHAR2 rather than a CLOB, and assuming that you aren't using 12.1 or later, you can declare the column to hold up to 4000 bytes. Is that enough for whatever error messages you need to support? Is there a lower limit on the length of an error message that you can establish? If you're producing error messages that are designed to be shown to a user, you're probably going to be generating shorter messages. If you're producing and storing stack traces, you may need to declare the column as a CLOB because 4000 bytes may not be sufficient.
What sort of memory are we talking about? On disk, a VARCHAR2 will only allocate the space that is actually required to store the data. When the block is read into the buffer cache, it will also only use the space required to store the data. If you start allocating local variables in PL/SQL, depending on the size of the field, Oracle may allocate more space than is required to store the particular data for that local variable in order to try to avoid the cost of growing and shrinking the allocation when you modify the string. If you return the data to a client application (including a middle tier application server), that client may allocate a buffer in memory based on the maximum size of the column rather than based on the actual size of the data.