Redis atomic pop and add to sorted set, BRPOPLPUSH equivalent - redis

I have one redis list "waiting" and one redis list "partying".
I have a long running process that safely blocks on the "waiting" list item to come along, and then pops it and pushes it onto the "partying" list atomically using BRPOPLPUSH. Awesome.
Users in the "waiting" list are repeatedly querying "am I in the "partying" list yet?", but there is no fast (i.e. < O(n)) method of checking if a user is in a redis list. You have to grab the whole list and loop through it.
So I'm resorting to switching from a redis list to a redis sorted set, with the 'score' as the unix timestamp of when they joined the "waiting" sorted set. I can blocking pop on the lowest score (user at the head of the queue). Using sorted sets, I can use ZSCORE to check in O(1) time if they're on either list, so it's looking hopeful.
How can I perform the nice atomic equivalent of BRPOPLPUSH on sorted sets?
It's like I need a mythical BZRPOPMIN & ZADD= BZRPOPMINZADD. If the process dies between these two, a user will effectively be disappeared from both sets.
Looking into MULTI EXEC transactions in redis, they are not what they appear to be at first glance, they're more like 'pipelines', in that I can't get the result of the first command (BZRPOPMIN) and feed it into the second command (ZADD). I'm very suspicious of putting the blocking BZRPOPMIN into the MULTI too, am I right to be?

How can I perform the nice atomic equivalent of BRPOPLPUSH on sorted sets? 
Sorry, you can't. We actually discussed this when the ZPOP family was added and decided against it: "However I'm not for the BZPOPZADD part, because instead experience with lists shown that this is not a good idea in general, unfortunately, and that that adding safety of message processing may be used other means. The worst thing abut BZPOPZADD and BRPOPLPUSH and so forth are the cascading effects, they create a lot of problems in replication for instance, and our BRPOPLPUSH replication is still not correct in certain ways (we can talk about it if you want)." (ref: https://github.com/antirez/redis/pull/4879#issuecomment-389116241)
 I'm very suspicious of putting the blocking BZRPOPMIN into the MULTI too, am I right to be?
Definitely, and blocking commands can't be called inside a transaction anyway.

Related

REDIS search does not return results consistently

I am reading events from Kafka and depositing into REDIS. Then, we read events using Python and in case we don’t find events we drop/re-create the index.
However, I noticed at times even after re-creating the index I still don’t find events.
I have couple of questions -
[Q1] Is re-indexing a good approach where we are continuously getting a huge flow of events?
[Q2] Also, I noticed during REDIS search at times I do get events and then at another instance query does not return results, can this be related to dropping / re-creating index?
[Q3] Is there a better standard approach to ensure JSON are deposited / retrieved consistenly.
[Q4] Is there an explanation as to why at times everything just seems to work fine continuously for several hours and then does not work at all for few hours.
I would appreciate alternate approaches for this simple use case as I am fairly new to REDIS
Here are some points. I hope it would be helpful.
[Q1]
Re-indexing can be a good approach if you are getting a huge flow of events, as it will help keep your data up-to-date. However, you also need to make sure that your Redis instance is able to handle the increased load otherwise Redis can become very slow when it is re-indexing and may not be able to keep up with the flow of event. And in that case, you may need to scale your Redis instance to handle the load or incremental indexing may be a better option.
[Q2]
There could be many reasons why search results vary, but it is possible that the index is being dropped and re-created, which would cause the event to not be found. There are a few other things that could be happening:
There could be an issue with the search algorithm, which would cause search results to not be returned.
The data in the database could be changing, which would cause search results to not be returned.
[Q3]
There is no one standard approach to ensure JSON are deposited or retrieved consistently. However, some best practices you may consider include using a library or tool that supports serialization and deserialization of JSON data, validating input and output data, and using an editor that highlights errors in JSON syntax. You can also use locking mechanism to ensure that only one process can write to the Redis instance at a time, or using a queue to buffer writes to Redis. Also, different developers may have different preferences and opinions on the best way to handle JSON in Redis. Some possible methods include using commands such as JSET or JGET to manage JSON objects, or using a library such as JRedis to simplify the process.
[Q4]
Redis can be temperamental, and its behavior can vary depending on the specific configuration and usage scenarios. There is no specific explanation for this behavior, but it could be due to various factors such as load on the Redis server, network conditions, or other applications using the same Redis instance. In that case, server will not be able to handle requests properly and will stop responding. If everything is working fine for a few hours and then suddenly stops working, you can try restarting Redis or checking the logs for any errors that may have occurred.

Max value size for Redis

I've been trying to make replay system. So basically when player moves, system saves his datas(movements, location, animation etc.) into JSON file. In the end of the record, JSON file may be over 50 MB. I'd want to save this data into Redis with expire date (24-48 hours).
My questions are;
Is it bad to save over 50 MB into Redis with expire date?
How many datas that over 50 MB can Redis handle without performance loss?
If players make 500 records in 48 hours, may it be bad for Redis?
How many milliseconds does it takes 50 MB data from Redis with average VDS/VPS?
Storing a large object(in terms of size) is not a good practice. You may read it from here. One of the problem is network. You need to send 50MB payload to a redis server in a single call. Also if you save them as one big object, then while retrieving, updating it (a single field, element etc), you need to get 50 MB back from server and parse it to get a single field, update it back end send back to server. That's a serious problem in terms of network.
Instead of redis strings, you may prefer sorted sets or lists depending on your use case. If you are going to store them with timestamps and get the range of events between these timestamps, then sorted sets may be an ideal solution for you. It's good for pagination etc. One of the crucial drawback is the complexity of adding a new element is O(log(N)).
lists may also provide a good playground for your case. You may use LPUSH/RPUSH to add new events to your list, and since Redis lists are implemented with linked lists, both adding a message to the beginning or end of the list is same, O(1), which is great.
Whenever an event happens, you either call ZADD or RPUSH/LPUSH to send the events to redis. If you need to query those to you may use available functions such as ZRANGEBYSCORE or LRANGE depending on your choice.
While designing your keys you may use an identifier such as user-id just like you mentioned in the comments. You will not have the problems with lists/sorted sets like you will have in strings. But choosing which one is most suitable for your depends on your use case for reads/writes or business rules.
Here some useful links to read;
Redis data types intro
Redis data types
Redis labs documentation about data types

Best way to handle timouts on rabbitmq message processing

I am trying to get my head around an issue I have recently encountered and I hope someone will be able to point me in the most reasonable direction of solving it.
I am using Riak KV store and working on CRDT data, where I have some sort of counter inside each CRDT item stored in database.
I have a rabbitmq queue, where each message is a request to increase or decrease a certain amount of aforementioned counters.
Finally, I have a group of service-workers, that listens on the queue, and for each request try to change the amount of counters accordingly.
The issue I have is as follows: While a single worker is processing a request, it may get stuck for a while on a write operation to database – let’s say on a second change of counters out of three. It’s connection with rabbitmq gets lost (timeout) so the message-request gets back on to the queue (I cannot afford to miss one). Then it is picked up by second worker, which begins all processing anew. However, the first worker finishes its work, and as a results I have processed a single message twice.
I can split those increments into single actions, but this still leaves me with dilemma – can still change value of counter twice, if some worker gets stuck on a write operation for a long period.
I do not have possibility of making Riak KV CRDT writes work faster, nor can I accept missing out a message-request. I need to implement some means of checking whether a request was already processed before.
My initial thoughts were to use some alternative, quick KV store for storing rabbitMQ message ID if they are being processed. That way other workers could tell whether they are not starting to process a message that is already parsed elsewhere.
I could use any help and pointers to materials I can read.
You can't have "exactly one delivery" semantic. You can reduce double-sent messages or missed deliveries, so it's up to you to decide which misbehavior is the least inconvenient.
First of all are you sure it's the CRDTs that are too slow ? Are you using simple counters or counters inside maps ? In my experience they are quite fast, although slower than kv. You could try:
- having simple CRDTs (no maps), and more CRDTs objects, to lower their stress( can you split the counters in two ?)
- not using CRDTs but using good old sibling resolution on client side on simple key/values.
- accumulate the count updates orders and apply them in batch, but then you're accepting an increase in latency so it's equivalent to increasing the timeout.
Can you provide some metrics? Like how long the updates take, what numbers you'd expect, if it's as slow when you have few updates or many updates, etc

What is a real world use for ConcurrentBag<T>?

A ConcurrentBag will allow multiple threads to add and remove items from the bag. It is possible that a thread will add an item to the bag and then end up taking that same item right back out. It says that the ConcurrentBag is unordered, but how unordered is it? On a single thread, the bag acts like a Stack. Does unordered mean "not like a linked list"?
What is a real world use for ConcurrentBag?
Because there is no ordering the ConcurrentBag has a performance advantage over ConcurrentStack/Queue. It is implemented by Microsoft as local thread storage. So every thread that adds items does this in it's own space. When retrieving items they come from the local storage. Only when that is empty the thread steals item from another threads storage. So instead of a simple list a ConcurrentBag is a distributed list of items. And is almost lockfree and should scale better with high concurrency.
Unfortunately in .NET 4.0 there was a performance issue (fixed in 4.5) see
http://ayende.com/blog/156097/the-high-cost-of-concurrentbag-in-net-4-0
Bags are really useful for tracking instance counts. For example, if you want to keep a record of which hosts you're servicing web requests for, you can add their IP to the bag when you start servicing the request, and remove it when done.
Using a bag will allow you to tell at a glance which IPs you're currently servicing. It will also let you quickly query whether you're servicing a given IP address.
If you use a set for this rather than a bag, then having multiple concurrent requests from the same IP address will mess up your record-keeping.
Anything where you just need to keep track of what's there and don't need random access or guaranteed order. If you have a thread that adds items to process, and a thread that removes items in order to process them, a concurrent bag would work well if you don't care that they're processed in FIFO order.
Thanks to #Chris Jester-Young I came up with a good, real world, scenario that actually applies to a project i'm working on.
Find - Process - Store
Find - threads 1 & 2 are set to find or scrape data (file system, web, etc). These results are stored in ConcurrentBag1.
Process - threads 3 & 4 are set to take out of ConcurrentBag1, clean/transform/process the data and then store the results in ConcurrentBag2.
Store - threads 5 is set to gather results from ConcurrentBag2 and store the results in SQL.

Redis and Object Versioning

How are people coping with changes to redis object schemas - adding or removing properties from objects?
Sharing from my own experience (one year old project with thousands of user requests per second).
Usually, there were two scenarios for me:
Add new information to existing structures (like, "email" field to a user)
Remove or change existing values in existing structures (like, change format of some field)
Drop stuff from the database
For 1 I keep following simple strategy: degrade gracefully, e.g. if user doesn't have email record - treat it as empty email. Worked all the time.
For 2 and 3 it depends, whether data can be changed/calculated/fixed before releasing or after. I run a job on database that does all the work for me, for few millions of keys it takes considerable time (minutes). If that job can be run only after I release the new code - then degrading gracefully helps a lot, I simply release and then run the job.
PS: If you affect a lot of keys in redis then it is very important to use http://redis.io/topics/pipelining Saves a lot of time.
Take a list of all affected (i.e. you want to fix them in any way) keys or records in pipeline
Do whatever you want on them. If it's possible try to queue writing operations into pipeline too
Send queued operations to redis.
It is also very important for you to make indexes of your structures. I keep sets with ids. Then I simply iterate over SMEMBERS(set_with_ids).
It is much, much better than iterating over KEYS command.
For extremely simple versioning, you could use different database numbers. This could be quite limiting in cases where almost everything is the same between two versions but it's also a very clean way to do it if it will work for you.