I have total six type of keys, say a,b,..,f each having around a million subkeys, like a1,a2,...a99999(different in each bucket). What is the faster way to access?
Having separate keys by combining bucket name and key like: a_a1,b_b1 etc.
Use hash for 6 keys to have buckets and then have 1 million keys in each?
I search stack-overflow and couldn't find such comparison when I have few buckets with huge number of keys!
Edit1: Every key and value is string only at maximum 100 characters. I would access it using Jedis library of Java making transactions
Your question remind me this article. It doesn't contains performance benchmarks but seems like your second case (with buckets of keys) will have appropriate performance and small memory footprint.
Related
I have a file with 13 million floats each of them have a associated index as integer. The original size of file is 80MB.
We want to pass multiple indexes to get float data. The only reason, I needed hashmap field and value as List does not support passing multiple indexes to get.
Stored them as hashmap in redis, with index being field and float as value. On checking memory usage it was about 970MB.
Storing 13 million as list is using 280MB.
Is there any optimization I can use.
Thanks in advance
running on elastic cache
You can do a real good optimization by creating buckets of index vs float values.
Hashes are very memory optimized internally.
So assume your data in original file looks like this:
index, float_value
2,3.44
5,6.55
6,7.33
8,34.55
And you have currently stored them one index to one float value in hash or a list.
You can do this optimization of bucketing the values:
Hash key would be index%1000, sub-key would be index, and value would be float value.
More details here as well :
At first, we decided to use Redis in the simplest way possible: for
each ID, the key would be the media ID, and the value would be the
user ID:
SET media:1155315 939 GET media:1155315
939 While prototyping this solution, however, we found that Redis needed about 70 MB to store 1,000,000 keys this way. Extrapolating to
the 300,000,000 we would eventually need, it was looking to be around
21GB worth of data — already bigger than the 17GB instance type on
Amazon EC2.
We asked the always-helpful Pieter Noordhuis, one of Redis’ core
developers, for input, and he suggested we use Redis hashes. Hashes in
Redis are dictionaries that are can be encoded in memory very
efficiently; the Redis setting ‘hash-zipmap-max-entries’ configures
the maximum number of entries a hash can have while still being
encoded efficiently. We found this setting was best around 1000; any
higher and the HSET commands would cause noticeable CPU activity. For
more details, you can check out the zipmap source file.
To take advantage of the hash type, we bucket all our Media IDs into
buckets of 1000 (we just take the ID, divide by 1000 and discard the
remainder). That determines which key we fall into; next, within the
hash that lives at that key, the Media ID is the lookup key within
the hash, and the user ID is the value. An example, given a Media ID
of 1155315, which means it falls into bucket 1155 (1155315 / 1000 =
1155):
HSET "mediabucket:1155" "1155315" "939" HGET "mediabucket:1155"
"1155315"
"939" The size difference was pretty striking; with our 1,000,000 key prototype (encoded into 1,000 hashes of 1,000 sub-keys each),
Redis only needs 16MB to store the information. Expanding to 300
million keys, the total is just under 5GB — which in fact, even fits
in the much cheaper m1.large instance type on Amazon, about 1/3 of the
cost of the larger instance we would have needed otherwise. Best of
all, lookups in hashes are still O(1), making them very quick.
If you’re interested in trying these combinations out, the script we
used to run these tests is available as a Gist on GitHub (we also
included Memcached in the script, for comparison — it took about 52MB
for the million keys)
I've carefully read https://redis.io/topics/memory-optimization but I'm still confused. Basically, it says to cap the number of keys in each hash map (HSET). But what about the number of keys in each HSET.
If I have 1,000,000 keys for a certain prefix. Each one with a unique value. Suppose they're integer looking like "12345689". If I "shard" the keys by taking the first two characters (e.g. "12") and the remainder as the "sub key" (e.g. "3456789"), then for each hash I'm going to have 1,000,000 / 100 = 10,000 keys each (theoretically). Is that too many?
My (default) config is:
redis-store:6379> config get hash-*
1) "hash-max-ziplist-entries"
2) "512"
3) "hash-max-ziplist-value"
4) "64"
So, if I shard up each 1,000,000 keys per prefix, I'll have less than 512. Actually, I'll have 100 (e.g. "12" or "99"). But what about within each one? There'll theoretically be 10,000 keys each. Does that mean I break the limit and can't benefit from the space optimization that hash maps offer?
You can use such formula to calculate HASH internal data overhead for each key:
3 * next_power(n) * size_of(pointer)
There n is number of keys in your HASH. I think you are using x64 version of Redis so size_of(pointer) is 8. So for each 10,000 keys in your HASH your would have at least 240,000 bytes of overhead.
UPDATED
Please keep in mind hash-max-ziplist-entries is not the silver bullet. Please look at article here Under the hood of Rdis #2 — ziplist could be calculated as 21 * n and in same time: saving up to х10 RAM you got the write speed subsidence up to 30 times and up to 100 times in reading. So with total amount with 1,000,000 entries in HASH you could catch the critical breakdown with perfomance
You can read more about Redis HASH internals Under the hood of Redis #1.
After some extensive research I've finally understood how hash-max-ziplist-entries works.
https://www.peterbe.com/plog/understanding-redis-hash-max-ziplist-entries
Basically, it's just 1 hash map or if you need to break it up into multiple hash maps if within you need to store more keys than hash-max-ziplist-entries is set to.
I am wondering if there is size limit for S3 bucket? or I can store object limitless?
I need this information to assure I need to write a cleaner tool or not.
You can store unlimited objects in your S3 bucket. However, there are limits on individual objects stored -
An object can be 0 bytes to 5TB.
The largest object that can be uploaded in a single PUT is 5 gigabytes
For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
If you have too many objects then consider randomizing some prefix to your object names.
When your workload is a mix of request types, introduce some randomness to key names by adding a hash string as a prefix to the key name. By introducing randomness to your key names the I/O load will be distributed across multiple index partitions. For example, you can compute an MD5 hash of the character sequence that you plan to assign as the key and add 3 or 4 characters from the hash as a prefix to the key name.
https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/
Yes, you can store object limitless.
BTW, if your consideration is performance, see
http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html
I need to create a table that would contain a slice of data produced by a continuously running process. This process generates messages that contain two mandatory components, among other things: a globally unique message UUID, and a message timestamp.
Those messages would be later retrieved by the UUID.
In addition, on a regular basis I would need to delete all messages from that table that are too old, i.e. whose timestamps are more than X away from the current time.
I've been reading the DynamoDB v2 documentation (e.g. Local Secondary Indexes) trying to figure out how to organize my table and whether or not I need a secondary index to perform searches for messages to delete. There might be a simple answer to my question, but I am somehow confused...
So should I just create a table with the UUID as the hash and messageTimestamp as the range key (together with a "message" attribute that would contain the actual message), and then not create any secondary indices? In the examples that I've seen, the hash was something that was not unique (e.g. ForumName under the above link). In my case, the hash would be unique. I am not sure whether than makes any difference.
And if I create the table with hash and range as described, and without a secondary index, then how would I query for all messages that are in a certain timerange regardless of their UUIDs?
DynamoDB introduced Global Secondary Index which would solve this problem.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
We've wrestled with this as well. The best solution we've come up with is to create second table for storing the time series data. To do this:
1) Use the date plus "bucket" id for a hash key
You could just use the date, but then I'm guessing today's date would become a "hot" key - one that is written with excessive frequency. This can create a serious bottleneck, as the total throughput for a particular DynamoDB partition is equal to the total provisioned throughput divided by the number of partitions - that means if all your writes are to a single key (today's key) and you have a throughput of 20 writes per second, then with 20 partitions, your total throughput would be 1 write per second. Any requests beyond this would be throttled. Not a good situation.
The bucket can be a random number from 1 to n, where n should be greater than the number of partitions used by the underlying DB. Determining n is a bit tricky of course because Dynamo does not reveal how many partitions it uses. But we are currently working with the upper limit of 200 based on the example found here. The writeup at this link was the basis for our thinking in coming up with this approach.
2) Use the UUID for the range key
3) Query records by issuing queries for each day and bucket.
This may seem tedious, but it is more efficient than a full scan. Another possibility is to use Elastic Map Reduce jobs, but I have not tried that myself yet so cannot say how easy/effective it is to work with.
We are still figuring this out ourselves, so I'm interested to hear others' comments. I also found this presentation very helpful in thinking through how best to use Dynamo:
Falling In and Out Of Love with Dynamo
-John
In short you can not. All DynamoDB queries MUST contain the primary hash index in the query. Optionally, you can also use the range key and/or a local secondary index. With the current DynamoDB functionality you won't be able to use an LSI as an alternative to the primary index. You also are not able to issue a query with only the range key (you can test this out easily in the AWS Console).
A (costly) workaround that I can think of is to issue a scan of the table, adding filters based on the timestamp value in order to find out which fields to delete. Note that filtering will not reduce the used capacity of the query, as it will parse the whole table.
By default, Redis is configured with 16 databases, numbered 0-15. Is this simply a form of name spacing, or are there performance implications of segregating by database ?
For example, if I use the default database (0), and I have 10 million keys, best practices suggest that using the keys command to find keys by wildcard patterns will be inefficient. But what if I store my major keys, perhaps the first 4 segments of 8 segment keys, resulting in a much smaller subset of keys in a separate database (say database 3). Will Redis see these as a smaller set of keys, or do all keys across all databases appear as one giant index of keys ?
More explicitly put, in terms of time complexity, if my databases look like this:
Database 0: 10,000,000 keys
Database 3: 10,000 keys
will the time complexity of keys calls against Database 3 be O(10m) or will it be O(10k) ?
Thanks for your time.
Redis has a separate dictionary for each database. From your example, the keys call against database 3 will be O(10K)
That said, using keys is against best practice. Additionally, using multiple databases for the same application is against best practices as well. If you want to iterate over keys, you should index them in an application specific way. A SortedSet is a good way way to build an index.
References :
The structure redisServer has an array of redisDB. See redisServer in redis.h
Each redisDB has its own dictionary object. See redisDB in redis.h
keys command operates on the dictionary for the current database