doing a R/W test with redis cluster (servers): 1 master + 2 slaves. the following is the key WRITE code:
var trans = redisDatabase.CreateTransaction();
Task<bool> setResult = trans.StringSetAsync(key, serializedValue, TimeSpan.FromSeconds(10));
Task<RedisResult> waitResult = trans.ExecuteAsync("wait", 3, 10000);
trans.Execute();
trans.WaitAll(setResult, waitResult);
using the following as the connection string:
[server1 ip]:6379,[server2 ip]:6379,[server3 ip]:6379,ssl=False,abortConnect=False
running 100 threads which do 1000 loops of the following steps:
generate a GUID as key and random as value of 1024 bytes
writing the key (using the above code)
retrieve the key using "var stringValue =
redisDatabase.StringGet(key, CommandFlags.PreferSlave);"
compare the two values and print an error if they differ.
running this test a few times generates several errors - trying to understand why as the "wait" with (10 seconds!) operation should have guaranteed the write to all slaves before returning.
Any idea?
WAIT isn't supported by SE.Redis as explained by its prolific author at Stackexchange.redis lacks the "WAIT" support
What about improving consistency guarantees, by adding in some "check, write, read" iterations?
SET a new key value pair (master node)
Read it (set CommandFlags to DemandReplica.
Not there yet? Wait and Try X times.
4.a) Not there yet? SET again. go back to (3) or give up
4.b) There? You're "done"
Won't be perfect but it should reduce probability of losing a SET??
Related
I have a piece of code that is essentially executing the following with Infinispan in embedded mode, using version 13.0.0 of the -core and -clustered-lock modules:
#Inject
lateinit var lockManager: ClusteredLockManager
private fun getLock(lockName: String): ClusteredLock {
lockManager.defineLock(lockName)
return lockManager.get(lockName)
}
fun createSession(sessionId: String) {
tryLockCounter.increment()
logger.debugf("Trying to start session %s. trying to acquire lock", sessionId)
Future.fromCompletionStage(getLock(sessionId).lock()).map {
acquiredLockCounter.increment()
logger.debugf("Starting session %s. Got lock", sessionId)
}.onFailure {
logger.errorf(it, "Failed to start session %s", sessionId)
}
}
I take this piece of code and deploy it to kubernetes. I then run it in six pods distributed over six nodes in the same region. The code exposes createSession with random Guids through an API. This API is called and creates sessions in chunks of 500, using a k8s service in front of the pods which means the load gets balanced over the pods. I notice that the execution time to acquire a lock grows linearly with the amount of sessions. In the beginning it's around 10ms, when there's about 20_000 sessions it takes about 100ms and the trend continues in a stable fashion.
I then take the same code and run it, but this time with twelve pods on twelve nodes. To my surprise I see that the performance characteristics are almost identical to when I had six pods. I've been digging in to the code but still haven't figured out why this is, I'm wondering if there's a good reason why infinispan here doesn't seem to perform better with more nodes?
For completeness the configuration of the locks are as follows:
val global = GlobalConfigurationBuilder.defaultClusteredBuilder()
global.addModule(ClusteredLockManagerConfigurationBuilder::class.java)
.reliability(Reliability.AVAILABLE)
.numOwner(1)
and looking at the code the clustered locks is using DIST_SYNC which should spread out the load of the cache onto the different nodes.
UPDATE:
The two counters in the code above are simply micrometer counters. It is through them and prometheus that I can see how the lock creation starts to slow down.
It's correctly observed that there's one lock created per session id, this is per design what we'd like. Our use case is that we want to ensure that a session is running in at least one place. Without going to deep into detail this can be achieved by ensuring that we at least have two pods that are trying to acquire the same lock. The Infinispan library is great in that it tells us directly when the lock holder dies without any additional extra chattiness between pods, which means that we have a "cheap" way of ensuring that execution of the session continues when one pod is removed.
After digging deeper into the code I found the following in CacheNotifierImpl in the core library:
private CompletionStage<Void> doNotifyModified(K key, V value, Metadata metadata, V previousValue,
Metadata previousMetadata, boolean pre, InvocationContext ctx, FlagAffectedCommand command) {
if (clusteringDependentLogic.running().commitType(command, ctx, extractSegment(command, key), false).isLocal()
&& (command == null || !command.hasAnyFlag(FlagBitSets.PUT_FOR_STATE_TRANSFER))) {
EventImpl<K, V> e = EventImpl.createEvent(cache.wired(), CACHE_ENTRY_MODIFIED);
boolean isLocalNodePrimaryOwner = isLocalNodePrimaryOwner(key);
Object batchIdentifier = ctx.isInTxScope() ? null : Thread.currentThread();
try {
AggregateCompletionStage<Void> aggregateCompletionStage = null;
for (CacheEntryListenerInvocation<K, V> listener : cacheEntryModifiedListeners) {
// Need a wrapper per invocation since converter could modify the entry in it
configureEvent(listener, e, key, value, metadata, pre, ctx, command, previousValue, previousMetadata);
aggregateCompletionStage = composeStageIfNeeded(aggregateCompletionStage,
listener.invoke(new EventWrapper<>(key, e), isLocalNodePrimaryOwner));
}
The lock library uses a clustered Listener on the entry modified event, and this one uses a filter to only notify when the key for the lock is modified. It seems to me the core library still has to check this condition on every registered listener, which of course becomes a very big list as the number of sessions grow. I suspect this to be the reason and if it is it would be really really awesome if the core library supported a kind of key filter so that it could use a hashmap for these listeners instead of going through a whole list with all listeners.
I believe you are creating a clustered lock per session id. Is this what you need ? what is the acquiredLockCounter? We are about to deprecate the "lock" method in favour of "tryLock" with timeout since the lock method will block forever if the clustered lock is never acquired. Do you ever unlock the clustered lock in another piece of code? If you shared a complete reproducer of the code will be very helpful for us. Thanks!
I've been playing around with redis to keep track of the ratelimit of an external api in a distributed system. I've decided to create a key for each route where a limit is present. The value of the key is how many request I can still make until the limit resets. And the reset is made by setting the TTL of the key to when the limit will reset.
For that I wrote the following lua script:
if redis.call("EXISTS", KEYS[1]) == 1 then
local remaining = redis.call("DECR", KEYS[1])
if remaining < 0 then
local pttl = redis.call("PTTL", KEYS[1])
if pttl > 0 then
--[[
-- We would exceed the limit if we were to do a call now, so let's send back that a limit exists (1)
-- Also let's send back how much we would have exceeded the ratelimit if we were to ignore it (ramaning)
-- and how long we need to wait in ms untill we can try again (pttl)
]]
return {1, remaining, pttl}
elseif pttl == -1 then
-- The key expired the instant after we checked that it existed, so delete it and say there is no ratelimit
redis.call("DEL", KEYS[1])
return {0}
elseif pttl == -2 then
-- The key expired the instant after we decreased it by one. So let's just send back that there is no limit
return {0}
end
else
-- Great we have a ratelimit, but we did not exceed it yet.
return {1, remaining}
end
else
return {0}
end
Since a watched key can expire in the middle of a multi transaction without aborting it. I assume the same is the case for lua scripts. Therefore I put in the cases for when the ttl is -1 or -2.
After I wrote that script I looked a bit more in depth at the eval command page and found out that a lua script has to be a pure function.
In there it says
The script must always evaluates the same Redis write commands with
the same arguments given the same input data set. Operations performed
by the script cannot depend on any hidden (non-explicit) information
or state that may change as script execution proceeds or between
different executions of the script, nor can it depend on any external
input from I/O devices.
With this description I'm not sure if my function is a pure function or not.
After Itamar's answer I wanted to confirm that for myself so I wrote a little lua script to test that. The scripts creates a key with a 10ms TTL and checks the ttl untill it's less then 0:
redis.call("SET", KEYS[1], "someVal","PX", 10)
local tmp = redis.call("PTTL", KEYS[1])
while tmp >= 0
do
tmp = redis.call("PTTL", KEYS[1])
redis.log(redis.LOG_WARNING, "PTTL:" .. tmp)
end
return 0
When I ran this script it never terminated. It just went on to spam my logs until I killed the redis server. However time dosen't stand still while the script runs, instead it just stops once the TTL is 0.
So the key ages, it just never expires.
Since a watched key can expire in the middle of a multi transaction without aborting it. I assume the same is the case for lua scripts. Therefore I put in the cases for when the ttl is -1 or -2.
AFAIR that isn't the case w/ Lua scripts - time kinda stops (in terms of TTL at least) when the script's running.
With this description I'm not sure if my function is a pure function or not.
Your script's great (without actually trying to understand what it does), don't worry :)
I am new to Infinispan. Even after going through Infinispan User Guide & googling through, I am not able to figure out behavior of Infinispan in below cases :
1) Does it lock on HotRod Client reading when rebalancing is taking place ?
2) How does Infinispan behaves with REPL mode with async & nearCache at HotRod client end ?
(I found that if nearCache is disabled then it can get the data, but not with nearCache. Does it have to anything with nearCache update?)
Server Code :
GlobalConfigurationBuilder globalConfig = GlobalConfigurationBuilder.defaultClusteredBuilder();
globalConfig.transport().clusterName("infiniReplicatedCluster").globalJmxStatistics().enable().allowDuplicateDomains(Boolean.TRUE);
ConfigurationBuilder configBuilder = new ConfigurationBuilder();
EmbeddedCacheManager embeddedCacheManager = new DefaultCacheManager(globalConfig.build());
configBuilder.dataContainer().compatibility().enable().clustering().cacheMode(CacheMode.REPL_ASYNC)
.async().replQueueInterval(120, TimeUnit.SECONDS).useReplQueue(true).hash();
embeddedCacheManager.defineConfiguration("TestCache", configBuilder.build());
Cache<String, TopologyData> cache = embeddedCacheManager.getCache("TestCache");
cache.put("00000", new TopologyData());
HotRodServerConfiguration build = new HotRodServerConfigurationBuilder().build();
HotRodServer server = new HotRodServer();
server.start(build, embeddedCacheManager);
Client Code :
ConfigurationBuilder remoteBuilder = new ConfigurationBuilder();
remoteBuilder.nearCache().mode(NearCacheMode.EAGER).maxEntries(100);
RemoteCacheManager remoteCacheManager = new RemoteCacheManager(remoteBuilder.build());
remoteCache = remoteCacheManager.getCache("TestCache");
System.out.println(remoteCache.get(fetchKey));
With above code, scenarios & results are listed below(all the run were done multiple times, resulting in same result) :
-Without nearCache 1 Key --> got value as expected
-With nearCache (LAZY/EAGER) 1 key --> null
-In same run, same Key two times With nearCache (LAZY/EAGER) --> null(first time) - expected value(next time)
Clarification needed : If a sample code to re-verify HotRod client's load balancing(RoundRobin) behavior in DIST mode. (I am successfully able to check it with REPL mode, & it works as it claims)
State transfer in Infinispan is non-blocking
Not sure I understand completely: you mean to say that, when enabling near caching on a Hot Rod client, reading from an async REPL cache doesn't work ? Does it just hang ? Do you have code that you can share ?
A clarification: Hot Rod goes to the primary owner both in DIST and REPL mode (REPL is simply a special DIST mode where number of owners is equal to the size of the cluster) hashed according to the key and will only use Round Robin if the primary is not responding.
I am trying to understand, how pipe lining in redis works? According to one blog I read, For this code
Pipeline pipeline = jedis.pipelined();
long start = System.currentTimeMillis();
for (int i = 0; i < 100000; i++) {
pipeline.set("" + i, "" + i);
}
List<Object> results = pipeline.execute();
Every call to pipeline.set() effectively sends the SET command to Redis (you can easily see this by setting a breakpoint inside the loop and querying Redis with redis-cli). The call to pipeline.execute() is when the reading of all the pending responses happens.
So basically, when we use pipe-lining, when we execute any command like set above, the command gets executed on the server but we don't collect the response until we executed, pipeline.execute().
However, according to the documentation of pyredis,
Pipelines are a subclass of the base Redis class that provide support for buffering multiple commands to the server in a single request.
I think, this implies that, we use pipelining, all the commands are buffered and are sent to the server, when we execute pipe.execute(), so this behaviour is different from the behaviour described above.
Could someone please tell me what is the right behaviour when using pyreids?
This is not just a redis-py thing. In Redis, pipelining always means buffering a set of commands and then sending them to the server all at once. The main point of pipelining is to avoid extraneous network back-and-forths-- frequently the bottleneck when running commands against Redis. If each command were sent to Redis before the pipeline was run, this would not be the case.
You can test this in practice. Open up python and:
import redis
r = redis.Redis()
p = r.pipeline()
p.set('blah', 'foo') # this buffers the command. it is not yet run.
r.get('blah') # pipeline hasn't been run, so this returns nothing.
p.execute()
r.get('blah') # now that we've run the pipeline, this returns "foo".
I did run the test that you described from the blog, and I could not reproduce the behaviour.
Setting breakpoints in the for loop, and running
redis-cli info | grep keys
does not show the size increasing after every set command.
Speaking of which, the code you pasted seems to be Java using Jedis (which I also used).
And in the test I ran, and according to the documentation, there is no method execute() in jedis but an exec() and sync() one.
I did see the values being set in redis after the sync() command.
Besides, this question seems to go with the pyredis documentation.
Finally, the redis documentation itself focuses on networking optimization (Quoting the example)
This time we are not paying the cost of RTT for every call, but just one time for the three commands.
P.S. Could you get the link to the blog you read?
I need to design a Redis-driven scalable task scheduling system.
Requirements:
Multiple worker processes.
Many tasks, but long periods of idleness are possible.
Reasonable timing precision.
Minimal resource waste when idle.
Should use synchronous Redis API.
Should work for Redis 2.4 (i.e. no features from upcoming 2.6).
Should not use other means of RPC than Redis.
Pseudo-API: schedule_task(timestamp, task_data). Timestamp is in integer seconds.
Basic idea:
Listen for upcoming tasks on list.
Put tasks to buckets per timestamp.
Sleep until the closest timestamp.
If a new task appears with timestamp less than closest one, wake up.
Process all upcoming tasks with timestamp ≤ now, in batches (assuming
that task execution is fast).
Make sure that concurrent worker wouldn't process same tasks. At the same time, make sure that no tasks are lost if we crash while processing them.
So far I can't figure out how to fit this in Redis primitives...
Any clues?
Note that there is a similar old question: Delayed execution / scheduling with Redis? In this new question I introduce more details (most importantly, many workers). So far I was not able to figure out how to apply old answers here — thus, a new question.
Here's another solution that builds on a couple of others [1]. It uses the redis WATCH command to remove the race condition without using lua in redis 2.6.
The basic scheme is:
Use a redis zset for scheduled tasks and redis queues for ready to run tasks.
Have a dispatcher poll the zset and move tasks that are ready to run into the redis queues. You may want more than 1 dispatcher for redundancy but you probably don't need or want many.
Have as many workers as you want which do blocking pops on the redis queues.
I haven't tested it :-)
The foo job creator would do:
def schedule_task(queue, data, delay_secs):
# This calculation for run_at isn't great- it won't deal well with daylight
# savings changes, leap seconds, and other time anomalies. Improvements
# welcome :-)
run_at = time.time() + delay_secs
# If you're using redis-py's Redis class and not StrictRedis, swap run_at &
# the dict.
redis.zadd(SCHEDULED_ZSET_KEY, run_at, {'queue': queue, 'data': data})
schedule_task('foo_queue', foo_data, 60)
The dispatcher(s) would look like:
while working:
redis.watch(SCHEDULED_ZSET_KEY)
min_score = 0
max_score = time.time()
results = redis.zrangebyscore(
SCHEDULED_ZSET_KEY, min_score, max_score, start=0, num=1, withscores=False)
if results is None or len(results) == 0:
redis.unwatch()
sleep(1)
else: # len(results) == 1
redis.multi()
redis.rpush(results[0]['queue'], results[0]['data'])
redis.zrem(SCHEDULED_ZSET_KEY, results[0])
redis.exec()
The foo worker would look like:
while working:
task_data = redis.blpop('foo_queue', POP_TIMEOUT)
if task_data:
foo(task_data)
[1] This solution is based on not_a_golfer's, one at http://www.saltycrane.com/blog/2011/11/unique-python-redis-based-queue-delay/, and the redis docs for transactions.
You didn't specify the language you're using. You have at least 3 alternatives of doing this without writing a single line of code in Python at least.
Celery has an optional redis broker.
http://celeryproject.org/
resque is an extremely popular redis task queue using redis.
https://github.com/defunkt/resque
RQ is a simple and small redis based queue that aims to "take the good stuff from celery and resque" and be much simpler to work with.
http://python-rq.org/
You can at least look at their design if you can't use them.
But to answer your question - what you want can be done with redis. I've actually written more or less that in the past.
EDIT:
As for modeling what you want on redis, this is what I would do:
queuing a task with a timestamp will be done directly by the client - you put the task in a sorted set with the timestamp as the score and the task as the value (see ZADD).
A central dispatcher wakes every N seconds, checks out the first timestamps on this set, and if there are tasks ready for execution, it pushes the task to a "to be executed NOW" list. This can be done with ZREVRANGEBYSCORE on the "waiting" sorted set, getting all items with timestamp<=now, so you get all the ready items at once. pushing is done by RPUSH.
workers use BLPOP on the "to be executed NOW" list, wake when there is something to work on, and do their thing. This is safe since redis is single threaded, and no 2 workers will ever take the same task.
once finished, the workers put the result back in a response queue, which is checked by the dispatcher or another thread. You can add a "pending" bucket to avoid failures or something like that.
so the code will look something like this (this is just pseudo code):
client:
ZADD "new_tasks" <TIMESTAMP> <TASK_INFO>
dispatcher:
while working:
tasks = ZREVRANGEBYSCORE "new_tasks" <NOW> 0 #this will only take tasks with timestamp lower/equal than now
for task in tasks:
#do the delete and queue as a transaction
MULTI
RPUSH "to_be_executed" task
ZREM "new_tasks" task
EXEC
sleep(1)
I didn't add the response queue handling, but it's more or less like the worker:
worker:
while working:
task = BLPOP "to_be_executed" <TIMEOUT>
if task:
response = work_on_task(task)
RPUSH "results" response
EDit: stateless atomic dispatcher :
while working:
MULTI
ZREVRANGE "new_tasks" 0 1
ZREMRANGEBYRANK "new_tasks" 0 1
task = EXEC
#this is the only risky place - you can solve it by using Lua internall in 2.6
SADD "tmp" task
if task.timestamp <= now:
MULTI
RPUSH "to_be_executed" task
SREM "tmp" task
EXEC
else:
MULTI
ZADD "new_tasks" task.timestamp task
SREM "tmp" task
EXEC
sleep(RESOLUTION)
If you're looking for ready solution on Java. Redisson is right for you. It allows to schedule and execute tasks (with cron-expression support) in distributed way on Redisson nodes using familiar ScheduledExecutorService api and based on Redis queue.
Here is an example. First define a task using java.lang.Runnable interface. Each task can access to Redis instance via injected RedissonClient object.
public class RunnableTask implements Runnable {
#RInject
private RedissonClient redissonClient;
#Override
public void run() throws Exception {
RMap<String, Integer> map = redissonClient.getMap("myMap");
Long result = 0;
for (Integer value : map.values()) {
result += value;
}
redissonClient.getTopic("myMapTopic").publish(result);
}
}
Now it's ready to sumbit it into ScheduledExecutorService:
RScheduledExecutorService executorService = redisson.getExecutorService("myExecutor");
ScheduledFuture<?> future = executorService.schedule(new CallableTask(), 10, 20, TimeUnit.MINUTES);
future.get();
// or cancel it
future.cancel(true);
Examples with cron expressions:
executorService.schedule(new RunnableTask(), CronSchedule.of("10 0/5 * * * ?"));
executorService.schedule(new RunnableTask(), CronSchedule.dailyAtHourAndMinute(10, 5));
executorService.schedule(new RunnableTask(), CronSchedule.weeklyOnDayAndHourAndMinute(12, 4, Calendar.MONDAY, Calendar.FRIDAY));
All tasks are executed on Redisson node.
A combined approach seems plausible:
No new task timestamp may be less than current time (clamp if less). Assuming reliable NTP synch.
All tasks go to bucket-lists at keys, suffixed with task timestamp.
Additionally, all task timestamps go to a dedicated zset (key and score — timestamp itself).
New tasks are accepted from clients via separate Redis list.
Loop: Fetch oldest N expired timestamps via zrangebyscore ... limit.
BLPOP with timeout on new tasks list and lists for fetched timestamps.
If got an old task, process it. If new — add to bucket and zset.
Check if processed buckets are empty. If so — delete list and entrt from zset. Probably do not check very recently expired buckets, to safeguard against time synchronization issues. End loop.
Critique? Comments? Alternatives?
Lua
I made something similar to what's been suggested here, but optimized the sleep duration to be more precise. This solution is good if you have few inserts into the delayed task queue. Here's how I did it with a Lua script:
local laterChannel = KEYS[1]
local nowChannel = KEYS[2]
local currentTime = tonumber(KEYS[3])
local first = redis.call("zrange", laterChannel, 0, 0, "WITHSCORES")
if (#first ~= 2)
then
return "2147483647"
end
local execTime = tonumber(first[2])
local event = first[1]
if (currentTime >= execTime)
then
redis.call("zrem", laterChannel, event)
redis.call("rpush", nowChannel, event)
return "0"
else
return tostring(execTime - currentTime)
end
It uses two "channels". laterChannel is a ZSET and nowChannel is a LIST. Whenever it's time to execute a task, the event is moved from the the ZSET to the LIST. The Lua script with respond with how many MS the dispatcher should sleep until the next poll. If the ZSET is empty, sleep forever. If it's time to execute something, do not sleep(i e poll again immediately). Otherwise, sleep until it's time to execute the next task.
So what if something is added while the dispatcher is sleeping?
This solution works in conjunction with key space events. You basically need to subscribe to the key of laterChannel and whenever there is an add event, you wake up all the dispatcher so they can poll again.
Then you have another dispatcher that uses the blocking left pop on nowChannel. This means:
You can have the dispatcher across multiple instances(i e it's scaling)
The polling is atomic so you won't have any race conditions or double events
The task is executed by any of the instances that are free
There are ways to optimize this even more. For example, instead of returning "0", you fetch the next item from the zset and return the correct amount of time to sleep directly.
Expiration
If you can not use Lua scripts, you can use key space events on expired documents.
Subscribe to the channel and receive the event when Redis evicts it. Then, grab a lock. The first instance to do so will move it to a list(the "execute now" channel). Then you don't have to worry about sleeps and polling. Redis will tell you when it's time to execute something.
execute_later(timestamp, eventId, event) {
SET eventId event EXP timestamp
SET "lock:" + eventId, ""
}
subscribeToEvictions(eventId) {
var deletedCount = DEL eventId
if (deletedCount == 1) {
// move to list
}
}
This however has it own downsides. For example, if you have many nodes, all of them will receive the event and try to get the lock. But I still think it's overall less requests any anything suggested here.