Get value set of a large map - redis

I have a large map, which maps skuIds (strings, e.g.: AB-1 to "hola") to names; the skuIds are unique, but the names are not.
There are about 1 million skuIds mapped to about 1000 unique names. Now I need to get the unique name list for any subset of the skuId set.
I tried hashmap's hmget, but retrieving millions of records and looping through is not effiecient; then I tried using the Sorted sets, (kept the same score for same name), but then I needed the set of scores for a sorted set, which is not provided directly by redis.
We can do this by using a Lua script (taking about 10-15 seconds), but I am not sure about disadvantages of having Lua scripts in the code.

Use HSCAN for that. Also look at this answer about HSCAN usage.

Related

Indexing large number of strings

I have more than 2 million IDs, represented as 10-character strings each. These IDs correspond to documents that are going to be processed by multiple machines. What is the correct way to create a shared index that keeps track of the IDs that have been processed?
Is Cassandra the right tool to use or is it an overkill?
The frequent operations will be:
(1) Adding IDs to the index.
(2) Checking if an ID exists in the index.

redis: Get all keys that contain queried element in set

I'm looking into redis for my specific need but I know don't see how I could achieve what I need.
I want to store sets of integers (100s-low thousands entries per set) and then "query" by an input set. All sets that contain all the elements as the query should match. (SDIFF query key should return empty set and then iterate over all keys).
I don't see how this can be done efficiently (about 5ms per 10k keys)?
If you will only query your data by integer, consider then storing using the integer as a key. Instead of:
SADD a 3 5
SADD b 3 7
You can:
SADD int:3 a
SADD int:5 a
SADD int:3 b
SADD int:7 b
Then you use SINTER int:3 int:7 ... to get all matching integer set names (what you used for keys originally).
If you do need to query both ways, then you may need to do both. This is like doing a many-to-many relationship in Redis.
Here, you are trading off insert time and memory usage for fast query performance. Every time you add a pair, you need to both SADDs: SADD setName int and SADD prefix:int setName.
If this extra memory and insert effort is not an option on your case, your next option is to use a Lua Script to loop through the keys (pattern matching your set names) and using SISMEMBER to test through the integers of your query. See Lua script for Redis which sums the values of keys for an example of looping through a set of keys using Lua.
A Lua script is like a stored procedure, it will run atomically on your Redis server. However, if it will give perform at 5ms for 10k sets tested for multiple integer members remains to be seen.

Which approach is better when using Redis?

I'm facing following problem:
I wan't to keep track of tasks given to users and I want to store this state in Redis.
I can do:
1) create list called "dispatched_tasks" holding many objects (username, task)
2) create many (potentialy thousands) lists called dispatched_tasks:username holding usually few objects (task)
Which approach is better? If I only thought of my comfort, I would choose the second one, as from time to time I will have to search for particular user tasks, and this second approach gives this for free.
But how about Redis? Which approach will be more performant?
Thanks for any help.
Redis supports different kinds of data structures as shown here. There are different approaches you can take:
Scenario 1:
Using a list data type, your list will contain all the task/user combination for your problem. However, accessing and deleting a task runs in O(n) time complexity (it has to traverse the list to get to the element). This can have an impact in performance if your user has a lot of tasks.
Using sets:
Similar to lists, but you can add/delete/check for existence in O(1) and sets elements are unique. So if you add another username/task that already exists, it won't add it.
Scenario 2:
The data types do not change. The only difference is that there will be a lot more keys in redis, which in can increase the memory footprint.
From the FAQ:
What is the maximum number of keys a single Redis instance can hold? and what the max number of elements in a Hash, List, Set, Sorted
Set?
Redis can handle up to 232 keys, and was tested in practice to handle
at least 250 million keys per instance.
Every hash, list, set, and sorted set, can hold 232 elements.
In other words your limit is likely the available memory in your
system.
What's the Redis memory footprint?
To give you a few examples (all obtained using 64-bit instances):
An empty instance uses ~ 3MB of memory. 1 Million small Keys ->
String Value pairs use ~ 85MB of memory. 1 Million Keys -> Hash
value, representing an object with 5 fields, use ~ 160 MB of
memory. To test your use case is trivial using the
redis-benchmark utility to generate random data sets and check with
the INFO memory command the space used.

Out of Process in memory database table that supports queries for high speed caching

I have a SQL table that is accessed continually but changes very rarely.
The Table is partitioned by UserID and each user has many records in the table.
I want to save database resources and move this table closer to the application in some kind of memory cache.
In process caching is too memory intensive so it needs to be external to the application.
Key Value stores like Redis are proving inefficient due to the overhead of serializing and deserializing the table to and from Redis.
I am looking for something that can store this table (or partitions of data) in memory, but let me query only the information I need without serializing and deserializing large blocks of data for each read.
Is there anything that would provide Out of Process in memory database table that supports queries for high speed caching?
Searching has shown that Apache Ignite might be a possible option, but I am looking for more informed suggestions.
Since it's out-of-process, it has to do serialization and deserialization. The problem you concern is how to reduce the serialization/deserizliation work. If you use Redis' STRING type, you CANNOT reduce these work.
However, You can use HASH to solve the problem: mapping your SQL table to a HASH.
Suppose you have the following table: person: id(varchar), name(varchar), age(int), you can take person id as key, and take name and age as fields. When you want to search someone's name, you only need to get the name field (HGET person-id name), other fields won't be deserialzed.
Ignite is indeed a possible solution for you since you may optimize serialization/deserialization overhead by using internal binary representation for accessing objects' fields. You may refer to this documentation page for more information: https://apacheignite.readme.io/docs/binary-marshaller
Also access overhead may be optimized by disabling copy-on-read option https://apacheignite.readme.io/docs/performance-tips#section-do-not-copy-value-on-read
Data collocation by user id is also possible with Ignite: https://apacheignite.readme.io/docs/affinity-collocation
As the #for_stack said, Hash will be very suitable for your case.
you said that Each user has many rows in db indexed by the user_id and tag_id . So It is that (user_id, tag_id) uniquely specify one row. Every row is functional depends on this tuple, you could use the tuple as the HASH KEY.
For example, if you want save the row (user_id, tag_id, username, age) which values are ("123456", "FDSA", "gsz", 20) into redis, You could do this:
HMSET 123456:FDSA username "gsz" age 30
When you want to query the username with the user_id and tag_id, you could do like this:
HGET 123456:FDSA username
So Every Hash Key will be a combination of user_id and tag_id, if you want the key to be more human readable, you could add a prefix string such as "USERINFO". e.g. : USERINFO:123456:FDSA .
BUT If you want to query with only a user_id and get all rows with this user_id, this method above will be not enough.
And you could build the secondary indexes in redis for you HASH.
as the above said, we use the user_id:tag_id as the HASH key. Because it can unique points to one row. If we want to query all the rows about one user_id.
We could use sorted set to build a secondary indexing to index which Hashes store the info about this user_id.
We could add this in SortedSet:
ZADD user_index 0 123456:FDSA
As above, we set the member to the string of HASH key, and set the score to 0. And the rule is that we should set all score in this zset to 0 and then we could use the lexicographical order to do range query. refer zrangebylex.
E.g. We want to get the all rows about user_id 123456,
ZRANGEBYLEX user_index [123456 (123457
It will return all the HASH key whose prefix are 123456, and then we use this string as HASH key and hget or hmget to retrieve infomation what we want.
[ means inclusive, and ( means exclusive. and why we use 123457? it is obvious. So when we want to get all rows with a user_id, we shoud specify the upper bound to make the user_id string's leftmost char's ascii value plus 1.
More about lex index you could refer the article I mentioned above.
You can try apache mnemonic started by intel. Link -http://incubator.apache.org/projects/mnemonic.html. It supports serdeless features
For a read-dominant workload MySQL MEMORY engine should work fine (writing DMLs lock whole table). This way you don't need to change you data retrieval logic.
Alternatively, if you're okay with changing data retrieval logic, then Redis is also an option. To add to what #GuangshengZuo has described, there's ReJSON Redis dynamically loadable module (for Redis 4+) which implements document-store on top of Redis. It can further relax requirements for marshalling big structures back and forth over the network.
With just 6 principles (which I collected here), it is very easy for a SQL minded person to adapt herself to Redis approach. Briefly they are:
The most important thing is that, don't be afraid to generate lots of key-value pairs. So feel free to store each row of the table in a different key.
Use Redis' hash map data type
Form key name from primary key values of the table by a separator (such as ":")
Store the remaining fields as a hash
When you want to query a single row, directly form the key and retrieve its results
When you want to query a range, use wild char "*" towards your key. But please be aware, scanning keys interrupt other Redis processes. So use this method if you really have to.
The link just gives a simple table example and how to model it in Redis. Following those 6 principles you can continue to think like you do for normal tables. (Of course without some not-so-relevant concepts as CRUD, constraints, relations, etc.)
using Memcache and REDIS combination on top of MYSQL comes to Mind.

A .plist of 3200 dictionaries

I have a plist with 3200 dictionaries. Each dictionary has 20 key/values. What's the best way to search through it?
I have a string called "id" and what I am doing right now is, iterating through all the elements of the array, asking each element (dictionary) for the value of key "id", comparing that id with other id i have, and if it's found, break.
This is really slow, like I can see a lag of about 1-2 seconds. Is there a better way?
Thanks
What you're doing now is an O(n) operation (linear in the number of items in the list). You can get a "constant time" O(1) lookup if you keep another "lookaside" data structure that helps you index into your list.
Before you write the 3200 item list of dictionaries, create one more special dictionary that maps your IDs to indexes in the big array. In other words, each key will be an ID and its value will be an NSNumber with the index number into the big array. Then save this also (either in the same plist or a separate one).
Then when you need to do a lookup, just do -objectForKey: in the lookaside dictionary, which will immediately give you back the index of the entry you're looking for.
Just make sure your lookaside dictionary is always in sync if you update them with live data. Note that this also assumes your IDs are unique (it sounds like they are).
Why don't you use a SQLite database?
The first thing I notice is that it seems you're always searching on the same id key. If that's the case, then you should sort your array of dictionaries according to id. You can then do a binary search on the sorted array. Result: finding any dictionary by id takes a maximum of 12 operations. By contrast, a linear search through 3200 items averages 1600 operations and might need as many as 3200.
Core Data might be a very good solution here if you need to search on several different keys, and if all those dictionaries have the same keys. NSManagedObject works a lot like NSMutableDictionary, but the framework will take care of indexing for you, and searching is fast and relatively easy.