Redis sorted sets and best way to store uids - redis

I have data consisting of user_ids and tags of these user ids.
The user_ids occur multiple times and have pre-specified number of tags (500) however that might change in the feature. What must be stored is the user_id, their tags and their count.
I want later to easily find tags with top score.. etc. Every time a tag appears it is incremented
My implementation in redis is done using sorted sets
every user_id is a sorted set
key is user_id and is a hex number
works like this:
zincrby user_id:x 1 "tag0"
zincrby user_id:x 1 "tag499"
zincrby user_id:y 1 "tag3"
and so on
having in mind that I want to get tags with highest score, is there a better way?
The second issue is that right now I 'm using "keys *" to retrieve these keys for client side manipulation which I know that it's not aimed towards production systems.
Plus it would be great for memory problems to iterate through a specified number of keys (in the range of 10000). I know that keys have to be stored in memory, however they don't follow
a specific pattern to allow for partial retrieval so I can avoid "zmalloc" error (4GB 64 bit debian server).
Keys amount to range of 20 million.
Any thoughts?

My first point would be to note that 4 GB are tight to store 20M sorted sets. A quick try shows that 20M users, each of them with 20 tags would take about 8 GB on a 64 bits box (and it accounts for the sorted set ziplist memory optimizations provided with Redis 2.4 - don't even try this with earlier versions).
Sorted sets are the ideal data structure to support your use case. I would use them exactly as you described.
As you pointed out, KEYS cannot be used to iterate on keys. It is rather meant as a debug command. To support key iteration, you need to add a data structure to provide this access path. The only structures in Redis which can support iteration are the list and the sorted set (through the range methods). However, they tend to transform O(n) iteration algorithms into O(n^2) (for list), or O(nlogn) (for zset). A list is also a poor choice to store keys since it will be difficult to maintain it as keys are added/removed.
A more efficient solution is to add an index composed of regular sets. You need to use a hash function to associate a specific user to a bucket, and add the user id to the set corresponding to this bucket. If the user id are numeric values, a simple modulo function will be enough. If they are not, a simple string hashing function will do the trick.
So to support iteration on user:1000, user:2000 and user:1001, let's choose a modulo 1000 function. user:1000 and user:2000 will be put in bucket index:0 while user:1001 will be put in bucket index:1.
So on top of the zsets, we now have the following keys:
index:0 => set[ 1000, 2000 ]
index:1 => set[ 1001 ]
In the sets, the prefix of the keys is not needed, and it allows Redis to optimize the memory consumption by serializing the sets provided they are kept small enough (integer sets optimization proposed by Sripathi Krishnan).
The global iteration consists in a simple loop on the buckets from 0 to 1000 (excluded). For each bucket, the SMEMBERS command is applied to retrieve the corresponding set, and the client can then iterate on the individual items.
Here is an example in Python:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# ----------------------------------------------------
import redis, random
POOL = redis.ConnectionPool(host='localhost', port=6379, db=0)
NUSERS = 10000
NTAGS = 500
NBUCKETS = 1000
# ----------------------------------------------------
# Fill redis with some random data
def fill(r):
p = r.pipeline()
# Create only 10000 users for this example
for id in range(0,NUSERS):
user = "user:%d" % id
# Add the user in the index: a simple modulo is used to hash the user id
# and put it in the correct bucket
p.sadd( "index:%d" % (id%NBUCKETS), id )
# Add random tags to the user
for x in range(0,20):
tag = "tag:%d" % (random.randint(0,NTAGS))
p.zincrby( user, tag, 1 )
# Flush the pipeline every 1000 users
if id % 1000 == 0:
p.execute()
print id
# Flush one last time
p.execute()
# ----------------------------------------------------
# Iterate on all the users and display their 5 highest ranked tags
def iterate(r):
# Iterate on the buckets of the key index
# The range depends on the function used to hash the user id
for x in range(0,NBUCKETS):
# Iterate on the users in this bucket
for id in r.smembers( "index:%d"%(x) ):
user = "user:%d" % int(id)
print user,r.zrevrangebyscore(user,"+inf","-inf", 0, 5, True )
# ----------------------------------------------------
# Main function
def main():
r = redis.Redis(connection_pool=POOL)
r.flushall()
m = r.info()["used_memory"]
fill(r)
info = r.info()
print "Keys: ",info["db0"]["keys"]
print "Memory: ",info["used_memory"]-m
iterate(r)
# ----------------------------------------------------
main()
By tweaking the constants, you can also use this program to evaluate the global memory consumption of this data structure.
IMO this strategy is simple and efficient, because it offers O(1) complexity to add/remove users, and true O(n) complexity to iterate on all items. The only downside is the key iteration order is random.

Related

Redis using members of sorted list to delete external keys

Using sort one can sort a set and get external keys using results from the sort component of query.
By way of example:
If the external key/value are defined as various keys using the pattern:itemkey:<somestring>
And a sorted list has list of the members then issuing command sort <lists key> by nosort get itemkey:* would get the values of the referenced keys.
I would like to be able to sort through a sorted list and delete these individual keys but it appears that sort <key> by nosort del itemkey:* is not supported.
Any suggestions on how to get list of values stored in a set and then delete the external keys?
Obviously I can do this with two commands, first getting the list of values and then by iterating through list call the delete function - but this is not desirable as I requite atomic operation.
To ensure atomic operation one can use either transactions or redis' lua scripts. For efficiency I decided to go with using script. This way the entire script is completed before next redis action/request is processed.
In code snippet below. I used loadScript in order to store script redis side reducing traffic with every call, the response from loadScript is then used as identifier to Jedis's evalsha command.
Using Scala (Note Jedis is a Java library, hence the .asJava):
val scriptClearIndexRecipe = """local names = redis.call('SORT', KEYS[1]);
| for i, k in ipairs(names) do
| redis.call('DEL', "index:recipe:"..k)
| end;
| redis.call('DEL', KEYS[1]);
| return 0;""".stripMargin
def loadScript(script: String): String = client.scriptLoad(script)
def eval(luaSHA: String, keys: List[String], args: List[String]): AnyRef = {
client.evalsha(luaSHA, keys.asJava, args.asJava)
}

Finding value collisions from a changing collection of Redis keys

In my website, users are allowed to keep the same usernames. Moreover, at any point in time a user logs in, I temporarily save their username in a redis key with a ttl of 10 mins.
The question is: is there any way - using Redis - to find all user ids online within the last 10 mins, sharing the same username?
Currently, I'm extracting all the keys' values and finding collisions in Python - which doesn't really help since I need to do this multiple times at runtime (and there's lots of user traffic).
I hypothesize that I could have created sets with a unique username as the key, and stored all user ids in the set to give me O(1) look ups on users sharing the same usernames. But then, I'd have to sacrifice the 10 mins ttl condition (which I need for every username individually).
Btw Redis/Lua beginner here, hence the noob question (if it is).
Where there is a will, there is a way... :)
Begin by storing the logins in a Sorted Set. Assuming that user id 123 had logged in at time 456 with the username "foo", you can represent that as:
ZADD logins 456 123:foo
Note: you'll also have to remove old elements from that Sorted Set so it doesn't just grow out of control.
Next, you want to search for the users from the last 10 minutes, so you'd use ZRANGEBYSCORE for that. Instead of shipping the entire thing back to your client, use Lua to process it and check for collisions.
The following script example wraps together all of the above:
-- Keys: 1) The logins Sorted Set
-- Args: 1) The epoch value of 'now'
-- 2) The logged in user id
-- 3) The logged in user name
-- Get logins from the last 10 minutes
local l = redis.call('ZRANGEBYSCORE', KEYS[1], ARGV[1]-600, '+inf')
-- "Evict" old logins
redis.call('ZREMRANGEBYSCORE', KEYS[1], '-inf', '(' .. ARGV[1]-600)
-- Store the new login
redis.call('ZADD', KEYS[1], ARGV[1], ARGV[2] .. ':' .. ARGV[3])
local c = {} -- detected name collision
for _, v in pairs(l) do
local p = v:find(':') -- no string.split in Lua
local i = v:sub(1,p-1) -- id
local n = v:sub(p+1) -- name
if n == ARGV[3] then
c[#c+1] = i
end
end
return c

Redis: How to increment hash key when adding data?

I'm iterating through data and dumping some to a Redis DB. Here's an example:
hmset id:1 username "bsmith1" department "accounting"
How can I increment the unique ID on the fly and then use that during the next hmset command? This seems like an obvious ask but I can't quite find the answer.
Use another key, a String, for storing the last ID. Before calling HMSET, call INCR on that key to obtain the next ID. Wrap the two commands in a MULTI/EXEC block or a Lua script to ensure the atomicity of the transaction.
Like Itamar mentions you can store your index/counter in a separate key. In this example I've chosen the name index for that key.
Python 3
KEY_INDEX = 'index'
r = redis.from_url(host)
def store_user(user):
r.incr(KEY_INDEX, 1) # If key doesn't exist it will get created
index = r.get(KEY_INDEX).decode('utf-8') # Decode from byte to string
int_index = int(index) # Convert from string to int
result = r.set('user::%d' % int_index, user)
...
Note that user::<index> is an arbitrary key chosen by me. You can use whatever you want.
If you have multiple machines writing to the same DB you probably want to use pipelines.

Rails show different object every day

I want to match my user to a different user in his/her community every day. Currently, I use code like this:
#matched_user = User.near(#user).order("RANDOM()").first
But I want to have a different #matched_user on a daily basis. I haven't been able to find anything in Stack or in the APIs that has given me insight on how to do it. I feel it should be simpler than having to resort to a rake task with cron. (I'm on postgres.)
Whenever I find myself hankering for shared 'memory' or transient state, I think to myself "this is what (distributed) caches were invented for".
#matched_user = Rails.cache.fetch(#user.cache_key + '/daily_match', expires_in: 1.day) {
User.near(#user).order("RANDOM()").first
}
NOTE: While specifying a TTL for cache entry tells Rails/the cache system to try and keep that value for the given timeframe, there's NO guarantee that it will. In particular, a cache that aggressively tries to reclaim memory may expire an entry well before its desired expires_in time.
For this particular use case, it shouldn't be a big deal but in cases where the business/domain logic demands periodically generated values that are durable then you really have to factor that into your database.
How about using PostgreSQL's SETSEED function? I used the date to seed so that every day the seed will change, but within a day, the seed will be consistent.:
User.connection.execute "SELECT SETSEED(#{Date.today.strftime("%y%d%m").to_i/1000000.0})"
#matched_user = User.near(#user).order("RANDOM()").first
You may want to seed a random value after using this so that any future calls to random aren't biased:
random = User.connection.execute("SELECT RANDOM()").to_a.first["random"]
# Same code as above:
User.connection.execute "SELECT SETSEED(#{Date.today.strftime("%y%d%m").to_i/1000000.0})"
#matched_user = User.near(#user).order("RANDOM()").first
# Use random value before seed to make new seed:
User.connection.execute "SELECT SETSEED(#{random})"
I have split these steps in different sections just for readability. you can optimise query later.
1) Find all user records till today morning. so that the count will freeze.
usrs_till_today_morning = User.where("created_at <?", DateTime.now.in_time_zone(Time.zone).beginning_of_day)
2) Pluck all ID's
user_ids = usr_till_today_morning.pluck(:id)
3) Today date it will be a range (1..30) but will remain constant throughout the day.
day_today = Time.now.day
4) Select the same ID for the day
todays_user_id = user_ids[day_today % user_ids.count]
#matched_user = User.find(todays_user_id)
So it will give you random user records by maintaining same record throughout the day!!

How to retrieve all hash values from a list in Redis?

In Redis, to store an array of objects we should use hash for the object and add its key to a list:
HMSET concept:unique_id name "concept"
...
LPUSH concepts concept:unique_id
...
I want to retrieve all hash values (or objects) in the list, but the list contains only hash keys so a two step command is necessary, right? This is how I'm doing in python:
def get_concepts():
list = r.lrange("concepts", 0, -1)
pipe = r.pipeline()
for key in list:
pipe.hgetall(key)
pipe.execute()
Is it necessary to iterate and fetch each individual item? Can it be more optimized?
You can use the SORT command to do this:
SORT concepts BY nosort GET concept:*->name GET concept:*->some_key
Where * will expand to each item in the list.
Add LIMIT offset count for pagination.
Note that you have to enumerate each field in the hash (each field you want to fetch).
Another option is to use the new (in redis 2.6) EVAL command to execute a Lua script in the redis server, which could do what you are suggesting, but server side.