I have around 256 keys. Against each key I have to store a large number of non-repitative integers.
Following are the top 7 keys with number of total values (entries) against each key. Each value is a unique integer with large value.
Key No. of integers (values) in the list
Key 1 3394967
Key 2 3385081
Key 3 2172866
Key 4 2171779
Key 5 1776702
Key 6 1772936
Key 7 1748858
By default Redis consumes lot of memory in storing this data. I read that changing following parameters can result in memory usage reduction highly.
list-max-zipmap-entries 512
list-max-zipmap-value 64
Can anyone please explain me these above configuration commands (are 512 and 64 bytes?) and what changes I can make in the above configuration settings for my case to achieve memory usage reduction?
What should be kept in mind while selecting the values for entries and value in above command?
list-max-mipmap-entries 512:
list-max-zipmap-value 64
If the number of entries in a List exceeds 512, or if the size of any given element in the list > 64 bytes, Redis will switch to a less-efficient in-memory storage structure. More specifically, below those thresholds it will use a ziplist, and above it will use a linked list.
So in your case, you would need to use an entries value of > 1748858 to see any change (and then only in keys 8-end). Also note that for Redis to re-encode them to the smaller object size you would also need to make the change in the config and restart Redis as it doesn't re-encode down automatically.
To verify a given key is using a ziplist vs. linked list, use the OBJECTcommand.
For more details, see Redis Memory Optimization
IMO you can't achieve redis' memory optimization. In your case the entries in each list/set is around 3 million. In order to achieve memory optimization if you give the value of list-max-zipmap-entries as 3 million.
Redis doc says,
This operation is very fast for small values, but if you change the
setting in order to use specially encoded values for much larger
aggregate types the suggestion is to run some benchmark and test to
check the conversion time.
As per this encoding and decoding will take more time/CPU for that huge number. So it is better to run a benchmark test and then decide.
One alternative suggestion, if you only look up this sets to see whether a key is available or not. then you can change the Structure to a bucket kind of a thing.
For example a value 123456 set to key1 can be stored like this
Sadd key1:bucket:123 456
123 = 123456/1000
456 = 123456%1000
Note this won't work if you want to retrieve all the values for key1. In that case you would be looping through 1000 of sets. similarly for total size of key1 you have to loop through 1000 keys.
But the memory will be reduced to about 10 times.
Related
In my case I upload a lot of records to Redis sorted set, but I need to store only 10 highest scored items. I don't have ability to influence on the data which is uploaded (to sort and to limit it before uploading).
Currently I just execute
ZREMRANGEBYRANK key 0 -11
after uploading finish, but such approach looks not very optimal (it's slow and it will be better if Redis could handle that).
So does Redis provide something out of the box to limit count of items in sorted sets?
Nopes, redis does not provide any such functionality apart from ZREMRANGEBYRANK .
There is a similar problem about keeping a redis list of constant size, say 10 elements only when elements are being pushed from left using LPUSH.
The solution lies in optimizing the pruning operation.
Truncate your sorted set, once a while, not everytime
Methods:
Run a ZREMRANGEBYRANK with 1/5 probability everytime, using a random integer.
Use redis pipeline or Lua scripting to achieve this , this would even save the two network calls happening at almost every 5th call.
This is optimal enough for the purpose mentioned.
Algorithm example:
ZADD key member1 score1
random_int = some random number between 1-5
if random_int == 1: # trim sorted set with 1/5 chance
ZREMRANGEBYRANK key 0 -11
i have key-values like following example
KEY VALUE
key1 1
key2 2
key3 3
. .
. .
keyN N
each of my key needs to map a unique number so i am mapping my keys to auto incremented numbers then inserting it to Redis via redis mass insertion which works very well and then using GET command for internal processing of all the key value mapping.
but i have more than 1 billion key so i was wondering is there even more efficient(mainly lesser memory usage) way for using Redis for this scenario?
Thanks
You can pipeline commands into Redis to avoid the round-trip times like this:
{ for ((i=0;i<10000000;i++)) ; do printf "set key$i $i\r\n"; done ; sleep 1; } | nc localhost 6379
That takes 80 seconds to set 10,000,000 keys.
Or, if you want to avoid creating all those processes for printf, generate the data in a single awk process:
awk 'BEGIN{for(i=0;i<10000000;i++){printf("set key%d %d\r\n",i,i)}}'; sleep 1; } | nc localhost 6379
That now takes 17 seconds to set 10,000,000 keys.
The auto-increment key allows a unique number to be generated when a new record is inserted into a table/redis.
There is other way using UUID.
But I think auto-increment is far better due to reason like it need four time more space, ordering cannot be done based on key,etc
I'm doing exactly the same thing.
here is an simple example.
if you have a better one, welcome to discuss :)
1. connect to redis
import redis
pool = redis.ConnectionPool(host=your_host, port=your_port)
r = redis.Redis(connection_pool=pool)
2.define a function to incr, use pipe
def my_incr(pipe):
next_value = pipe.hlen('myhash')
pipe.multi()
pipe.hsetnx(
name='myhash',
key=newkey, value=next_value
)
3.make the function become a transaction
pipe = r.pipeline()
newkey = 'key1'
r.transaction(my_incr, 'myhash')
In order to be more memory efficient, you can use HASH to store these key-value pairs. Redis has special encoding for small HASH. It can save you lots of memory.
In you case, you can shard your keys into many small HASHs, each HASH has less than hash-max-ziplist-entries entries. See the doc for details.
B.T.W, with the INCR command, you can use Redis to create auto-incremented numbers.
I would like to answer my own question.
If you have sorted key values, the most efficient way to bulk insert and then read them is using a B-Tree based database.
For instance, with MapDB I am able to insert it very quickly and it takes up less memory.
I am using redis to save jsonwebtokens. I am confused a little about the consumption of memory for every record.
Let's say I have an instance on Google cloud that has 4GB Memory allocated to it, I want to know that how many records can it handle.
Given that a record has on an average 1 string values excluding he identifier and every string has on an average 200 characters.
It's all about how you store them. Using hashes (sizing them properly), or plain key value pair.
Do read this doc for more info http://redis.io/topics/memory-optimization
For 1 million keys (simple key value pair) of 200 characters it takes about 300 MB. So for 4 GB you can store more or less 14 million keys I guess. To make sure this, install redis in your machine, run a simple java (using jedis) snippet, and check the memory consumption before and after the insertion.
Jedis jedis = new Jedis("localhost");
for i=0 to N
jedis.set("Key_"+i,string);
Redis wraps strings into sds struct, which requires 3 extra bytes (or more) for each string.
Each sds is stored in a redisObject struct (using a pointer pointing to that sds object). It takes about 16 extra bytes if you're on a 64-bit machine.
You may also consider the entries in the hash table. Each one takes 24 bytes.
So you can assume each of your string occupies 243 bytes. 1 million strings will use more than 250 MB (Redis itself needs memory).
Does anyone know what the maximum value size you can store in redis? I want to use redis as a message queue with celery to store some small documents that need to be processed by a worker on another server, and I want to make sure the documents aren't going to be too big.
I found one page with a reference to 1GB, but when I followed the link on the page for where they got that answer the link wasn't valid anymore. Here is the link:
http://news.ycombinator.com/item?id=1182005
All string values are limited to 512 MiB. This is the size limit you probably care most about.
EDIT: Because keys in Redis are strings, the maximum key size is 512 MiB. The maximum number of keys is 2^32 - 1 = 4,294,967,295.
Values, on the other hand, can vary in size depending on their type. For aggregate data types (i.e. hash, list, set, and sorted set), the maximum value size is 512 MiB for each element, although the data structure itself can have up to 2^32 - 1 elements.
https://redis.io/topics/data-types
https://redis.io/topics/faq#what-is-the-maximum-number-of-keys-a-single-redis-instance-can-hold-and-what-is-the-max-number-of-elements-in-a-hash-list-set-sorted-set
http://groups.google.com/group/redis-db/browse_thread/thread/1c7e33fbc98734b3?fwc=2
Article about Redis Memory Usage can help you to roughly determine how much memory your database would take.
It's in the order of the amount of RAM you have, at least, so unless you plan on puting multi-gigabyte objects in there I wouldn't worry. I've had sets that were hundreds of megabytes big without a problem, but I don't know the exact limits.
A String value can accommodate the size of max 512MB. But according to this link, the size can be increased.
I have a hash table that I want to store to disk. The list looks like this:
<16-byte key > <1-byte result>
a7b4903def8764941bac7485d97e4f76 04
b859de04f2f2ff76496879bda875aecf 03
etc...
There are 1-5 million entries. Currently I'm just storing them in one file, 17-bytes per entry times the number of entries. That file is tens of megabytes. My goal is to store them in a way that optimizes first for space on the disk and then for lookup time. Insertion time is unimportant.
What is the best way to do this? I'd like the file to be as small as possible. Multiple files would be okay, too. Patricia trie? Radix trie?
Whatever good suggestions I get, I'll be implementing and testing. I'll post the results here for all to see.
You could just sort entries by key and do a binary search.
Fixed size keys and data entries means you can very quickly jump from row to row, and storing only the key and data means you're not wasting any space on meta data.
I don't think you'll do any better on disk space, and lookup times are O(log(n)). Insertion times are crazy long, but you said that didn't matter.
If you're really willing to tolerate long access times, do sort the table but then chunk it into blocks of some size and compress them. Store the offset* and start/end keys of each block in a section of the file at the start. Using this scheme, you can find the block containing the key you need in linear time and then perform a binary search within the decompressed block. Choose the block sized based on how much of the file you're willing to loading into memory at once.
Using an off the shelf compression scheme (like GZIP) you can tune the compression ratio as needed; larger files will presumably have quicker lookup times.
I have doubts that the space savings will be all that great, as your structure seems to be mostly hashes. If they are actually hashes, they're random and won't compress terribly well. Sorting will help increase the compression ratio, but not by a ton.
*Use the header to lookup the offset of a block to decompress and use.
5 million records it's about 81MB - acceptable to work with array in memory.
As you described problem - it's more unique keys than hash values.
Try to use hash table for accessing values (look at this link).
If there is my misunderstand and this is real hash - try to build second hash level above this.
Hash table can be successfuly organized on disk too (e.g. as separate file).
Addition
Solution with good search performance and little overhead is:
Define hash function, which produces integer values from keys.
Sort records in file according to values, produced by this function
Store file offsets where each hash value starts
To locate value:
4.1. compute it's hash with function
4.2. lookup for offset in file
4.3. read records from file starting from this position until key found or offset of next key not reached or End-Of-File.
There are some additional things which must be pointed out:
Hash function must be fast to be effective
Hash function must produce linear distributed values or near that
Table of hash value offsets can be placed in separated file
Table of hash value offsets can be produced dynamically with sequential read of whole sorted file at start of application and stored in memory
at step 4.3. records must be readed by blocks, not one-by-one, to be effective. Ideally reads all values with computed hash to memory at once.
You can find some examples of hash functions here.
Would the simple approach work and store them in a sqlite database? I don't suppose it'll get any smaller but you should get very good lookup performance, and it's very easy to implement.
First of all - multiple files are not OK if you want to optimize for disk space, because of cluster size - when you create file with size ~100 bytes, disk spaces decreases per cluster size - 2kB for example.
Secondly - in your case i would store all table in single binary file, ordered simply ASC by bytes values in keys. It will give you file with length exactly equals to entriesNumber*17, which is minimal if you do not want to use archiving, and secondly, you can use very quick search with time ~log2(entriesNumber), when you search for key dividing file into two parts and comparing key on their border with needed key. If "border key" is bigger, you take first part of file, if bigger - then second part. And again divide taken part into two parts, etc.
So you will need about log2(entriesNumber) read operations to search single key.
Your key is 128 bits, but if you have max 10^7 entries, it only takes 24 bits to index it.
You could make a hash table, or
Use Bentley-style unrolled binary search (at most 24 comparisons), as in
Here's the unrolled loop (with 32-bit ints).
int key[4];
int a[1<<24][4];
#define COMPARE(key, i) (key[0]>=a[i][0] && key[1]>=a[i][1] && key[2]>=a[i][2] && key[3]>=a[i][3])
i = 0;
if (COMPARE(key, (i+(1<<23))) >= 0) i += (1<<23);
if (COMPARE(key, (i+(1<<22))) >= 0) i += (1<<22);
if (COMPARE(key, (i+(1<<21))) >= 0) i += (1<<21);
...
if (COMPARE(key, (i+(1<<3))) >= 0) i += (1<<3);
if (COMPARE(key, (i+(1<<2))) >= 0) i += (1<<2);
if (COMPARE(key, (i+(1<<1))) >= 0) i += (1<<3);
As always with file design, the more you know (and tell us) about the distribution of data the better. On the assumption that your key values are evenly distributed across the set of all 16-byte keys -- which should be true if you are storing a hash table -- I suggest a combination of what others have already suggested:
binary data such as this belongs in a binary file; don't let the fact that the easy representation of your hashes and values are as strings of hexadecimal digits fool you into thinking that this is string data;
file size is such that the whole shebang can be kept in memory on any modern PC or server and a lot of other devices too;
the leading 4 bytes of your keys divide the set of possible keys into 16^4 (= 65536) subsets; if your keys are evenly distributed and you have 5x10^6 entries, that's about 76 entries per subset; so create a file with space for, say, 100 entries per subset; then:
at offset 0 start writing all the entries with leading 4 bytes 0x0000; pad to the total of 100 entries (1700 bytes I think) with 0s;
at offset 1700 start writing all the entries with leading 4 bytes 0x0001, pad,
repeat until you've written all the data.
Now your lookup becomes a calculation to figure out the offset into the file followed by a scan of up to 100 entries to find the one you want. If this isn't fast enough then use 16^5 subsets, allowing about 6 entries per subset (6x16^5 = 6291456). I guess that this will be faster than binary search -- but it is only a guess.
Insertion is a bit of a problem, it's up to you with your knowledge of your data to decide whether new entries (a) necessitate the re-sorting of a subset or (b) can simply be added at the end of the list of entries at that index (which means scanning the entire subset on every lookup).
If space is very important you can, of course, drop the leading 4 bytes from your entries, since they are computed by the calculation for the offset into the file.
What I'm describing, not terribly well, is a hash table.