Running Multiple Redis Sentinels through ServiceStack - redis

I'm working on a project that is using a Redis Sentinel through Servicestack. When the project was set up the original developer used Redis for both Caching and for maintaining a series of queue that power the logic of the system. Due to performance issues we are planning to spin up a new Redis Sentinel box and split the functionalities with the Caching being done on one server, and the queuing being done on another.
I was able to make a couple small changes to a local instance I had to split it between two servers by using the RedisClient and the PooledClient
container.Register<IRedisClientsManager>(c => new RedisManagerPool(redCon, poolConfig));
container.Register<PooledRedisClientManager>(c => new PooledRedisClientManager(redCon2Test));
container.Register(c => c.Resolve<IRedisClientsManager>().GetClient());
container.Register(c => c.Resolve<PooledRedisClientManager>().GetClient());
// REDIS CACHE
container.Register(c => c.Resolve<PooledRedisClientManager>().GetCacheClient());
// SESSION
container.Register(c => new SessionFactory(c.Resolve<ICacheClient>()));
// REDIS MQ
container.Register<IMessageService>(c => new RedisMqServer(c.Resolve<IRedisClientsManager>())
{
DisablePriorityQueues = true,
DisablePublishingResponses = true,
RetryCount = 2
});
container.Register(q => q.Resolve<IMessageService>().MessageFactory);
this.RegisterHandlers(container.Resolve<IMessageService>() as RedisMqServer);
The problem though is I don't have Redis Sentinel set up on the machine I'm using, and when I tried to drop a Sentinel Connection in as a PooledRedis Connection, I receive compilation errors on the second start. It will let me cast it as a PooledRedisClientManager, but I wasn't sure if Pooled vs Sentinel was even something that would play well together to begin with
if (useSentinel)
{
var hosts = redCon.Split(',');
var sentinel = new RedisSentinel(hosts, masterName)
{
RedisManagerFactory = CreateRedisManager,
ScanForOtherSentinels = false,
SentinelWorkerConnectTimeoutMs = 150,
OnWorkerError = OnWorkerError,
OnFailover = OnSentinelFailover,
OnSentinelMessageReceived = (x, y) => Log.Debug($"MSG: {x} DETAIL: {y}")
};
container.Register(c => sentinel.Start());
var hosts2 = redCon.Split(',');
var sentinel2 = new RedisSentinel(hosts2, masterName)
{
RedisManagerFactory = CreatePooledRedisClientManager,
ScanForOtherSentinels = false,
SentinelWorkerConnectTimeoutMs = 150,
OnWorkerError = OnWorkerError,
OnFailover = OnSentinelFailover,
OnSentinelMessageReceived = (x, y) => Log.Debug($"MSG: {x} DETAIL: {y}")
};
container.Register<PooledRedisClientManager>(c => sentinel2.Start());
}
But honestly, I'm not sure if this is even the correct way to be trying to go about this. Should I even be using the Pooled manager at all? Is there a good way to register two different Redis Sentinel servers in the Container and split them in the way I am attempting?

ServiceStack only allows 1 IRedisClientsManager implementation per AppHost, if you're using RedisSentinel its .Start() method will return a pre-configured PooledRedisClientManager that utilizes the RedisSentinel configuration.
If you wanted RedisMqServer to use a different RedisSentinel cluster you should avoid duplicating Redis registrations in the IOC and just configure it directly with RedisMqServer, e.g:
container.Register<IMessageService>(c => new RedisMqServer(sentinel2.Start())
{
DisablePriorityQueues = true,
DisablePublishingResponses = true,
RetryCount = 2
});
However given RedisSentinel typically requires 6 nodes for setting up a minimal highly available configuration it seems counter productive to double the required infrastructure resources just to have a separate Sentinel Cluster for RedisMQ especially when the load for using Redis as a message transport should be negligible compared to the compute resources to process the messages. What’s the MQ throughput? You should verify the load on Redis servers is the bottleneck as it’s very unlikely.
I would recommend avoiding this duplicated complexity and use a different RedisMQ Server like see if Background MQ is an option where MQ Requests are executed in Memory in Background Threads, if you need to use a distributed MQ look at Rabbit MQ which is purpose built for the task and would require a lot less maintenance than trying to manage 2 separate RedisSentinel cluster configurations.

Related

Kafka Parallel Consumer is not splitting work between different processes

I am using confluent parallel-consumer in order to acheive fast writes into different Data stores. I implemented my code and everything worked just fine locally with dockers.
Once I started several hosts with several consumers (with the same group id) I noticed that only one of the nodes (processes) is really consuming data. The topic I am reading from has 24 partitions, and I have 3 different nodes, I expected that kafka will split the work between them.
Here are parts of my code:
fun buildConsumer(config: KafkaConsumerConfig): KafkaConsumer<String, JsonObject> {
val props = Properties()
props[ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG] = config.kafkaBootstrapServers
props[ConsumerConfig.AUTO_OFFSET_RESET_CONFIG] = "earliest"
props[ConsumerConfig.GROUP_ID_CONFIG] = "myGroup"
// Auto commit must be false in parallel consumer
props[ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG] = false
props[ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG] = StringDeserializer::class.java.name
props[ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG] = JsonObjectDeSerializer::class.java.name
val consumer = KafkaConsumer<String, JsonObject>(props)
return consumer
}
private fun createReactParallelConsumer(): ReactorProcessor<String, JsonObject> {
val options = ParallelConsumerOptions.builder<String, JsonObject>()
.ordering(ParallelConsumerOptions.ProcessingOrder.KEY)
.maxConcurrency(10)
.batchSize(1)
.consumer(buildConsumer(kafkaConsumerConfig))
.build()
return ReactorProcessor(options)
}
And my main code:
pConsumer = createReactParallelConsumer()
pConsumer.subscribe(UniLists.of(kafkaConsumerConfig.kafkaTopic))
pConsumer.react { context ->
batchProcessor.processBatch(context)
}
Would appreciate any advice
We hit an issue that was closed in version 0.5.2.4 https://github.com/confluentinc/parallel-consumer/issues/409
The Parallel client kept old unfinished offsets, since our consumer was slow (many different reasons) we got to the end of the retention (earliest strategy), so every time we restarted the consumer, it was scanning all those incompatible offsets (which it did not truncate them - AKA the bug). Fix was just updating version from 0.5.2.3 to 0.5.2.4

Task queues and result queues with Celery and Rabbitmq

I have implemented Celery with RabbitMQ as Broker. I rely on Celery v4.4.7 since I have read that v5.0+ doesn't support RabbitMQ anymore. RabbitMQ is a MUST in my case.
Everything has been containerized then deployed as pods within Kubernetes 1.19. I am able to execute long running tasks and everything apparently looks fine at first glance. However, I have few concerns which require your expertise.
I have declared inbound and outbound queues but Celery created his owns and I do not see any message within those queues (inbound or outbound) :
inbound_queue = "_IN"
outbound_queue = "_OUT"
app = Celery()
app.conf.update(
broker_url = 'pyamqp://%s//' % path,
broker_heartbeat = None,
broker_connection_timeout = int(timeout)
result_backend = 'rpc://',
result_persistent = True,
task_queues = (
Queue(algorithm_queue, Exchange(inbound_queue), routing_key='default', auto_delete=False),
Queue(result_queue, Exchange(outbound_queue), routing_key='default', auto_delete=False),
),
task_default_queue = inbound_queue,
task_default_exchange = inbound_exchange,
task_default_exchange_type = 'direct',
task_default_routing_key = 'default',
)
#app.task(bind=True,
name='osmq.tasks.add',
queue=inbound_queue,
reply_to = outbound_queue,
autoretry_for=(Exception,),
retry_kwargs={'max_retries': 5, 'countdown': 2})
def execute(self, data):
<method_implementation>
I have implemented callbacks to get results back via REST APIs. However, randomly, it can return or not some results when the status is successfull. This is probably related to message persistency. In details, when I implement flower API to get info, status is successfull and the result is partially displayed (shortened json messages) - when I call AsyncResult, for the same status, result is either None or the right one. I do not understand the mechanism between rabbitmq queues and kombu which seems to cache the resulting message. I must guarantee to retrieve results everytime the task has been successfully executed.
def callback(uuid):
task = app.AsyncResult(uuid)
Specifically, it was that Celery 5.0+ did not support amqp:// as a result back end anymore. However, as your example, rpc:// is supported.
The relevant snippet is here: https://docs.celeryproject.org/en/stable/getting-started/backends-and-brokers/index.html#rabbitmq
We tend to always ignore_results=True in our implementation, so I can't give any practical tips of how to use rpc://, other than to infer that any response is put on an application-specific queue, instead of being able to put on a specified queue (or even different broker / rabbitmq instance) via amqp://.

Akka.NET with persistence dropping messages when CPU in under high pressure?

I make some performance testing of my PoC. What I saw is my actor is not receiving all messages that are sent to him and the performance is very low. I sent around 150k messages to my app, and it causes a peak on my processor to reach 100% utilization. But when I stop sending requests 2/3 of messages are not delivered to the actor. Here is a simple metrics from app insights:
To prove I have almost the same number of event persistent in mongo that my actor received messages.
Secondly, performance of processing messages is very disappointing. I get around 300 messages per second.
I know Akka.NET message delivery is at most once by default but I don't get any error saying that message were dropped.
Here is code:
Cluster shard registration:
services.AddSingleton<ValueCoordinatorProvider>(provider =>
{
var shardRegion = ClusterSharding.Get(_actorSystem).Start(
typeName: "values-actor",
entityProps: _actorSystem.DI().Props<ValueActor>(),
settings: ClusterShardingSettings.Create(_actorSystem),
messageExtractor: new ValueShardMsgRouter());
return () => shardRegion;
});
Controller:
[ApiController]
[Route("api/[controller]")]
public class ValueController : ControllerBase
{
private readonly IActorRef _valueCoordinator;
public ValueController(ValueCoordinatorProvider valueCoordinatorProvider)
{
_valueCoordinator = valuenCoordinatorProvider();
}
[HttpPost]
public Task<IActionResult> PostAsync(Message message)
{
_valueCoordinator.Tell(message);
return Task.FromResult((IActionResult)Ok());
}
}
Actor:
public class ValueActor : ReceivePersistentActor
{
public override string PersistenceId { get; }
private decimal _currentValue;
public ValueActor()
{
PersistenceId = Context.Self.Path.Name;
Command<Message>(Handle);
}
private void Handle(Message message)
{
Context.IncrementMessagesReceived();
var accepted = new ValueAccepted(message.ValueId, message.Value);
Persist(accepted, valueAccepted =>
{
_currentValue = valueAccepted.BidValue;
});
}
}
Message router.
public sealed class ValueShardMsgRouter : HashCodeMessageExtractor
{
public const int DefaultShardCount = 1_000_000_000;
public ValueShardMsgRouter() : this(DefaultShardCount)
{
}
public ValueShardMsgRouter(int maxNumberOfShards) : base(maxNumberOfShards)
{
}
public override string EntityId(object message)
{
return message switch
{
IWithValueId valueMsg => valueMsg.ValueId,
_ => null
};
}
}
akka.conf
akka {
stdout-loglevel = ERROR
loglevel = ERROR
actor {
debug {
unhandled = on
}
provider = cluster
serializers {
hyperion = "Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion"
}
serialization-bindings {
"System.Object" = hyperion
}
deployment {
/valuesRouter {
router = consistent-hashing-group
routees.paths = ["/values"]
cluster {
enabled = on
}
}
}
}
remote {
dot-netty.tcp {
hostname = "desktop-j45ou76"
port = 5054
}
}
cluster {
seed-nodes = ["akka.tcp://valuessystem#desktop-j45ou76:5054"]
}
persistence {
journal {
plugin = "akka.persistence.journal.mongodb"
mongodb {
class = "Akka.Persistence.MongoDb.Journal.MongoDbJournal, Akka.Persistence.MongoDb"
connection-string = "mongodb://localhost:27017/akkanet"
auto-initialize = off
plugin-dispatcher = "akka.actor.default-dispatcher"
collection = "EventJournal"
metadata-collection = "Metadata"
legacy-serialization = off
}
}
snapshot-store {
plugin = "akka.persistence.snapshot-store.mongodb"
mongodb {
class = "Akka.Persistence.MongoDb.Snapshot.MongoDbSnapshotStore, Akka.Persistence.MongoDb"
connection-string = "mongodb://localhost:27017/akkanet"
auto-initialize = off
plugin-dispatcher = "akka.actor.default-dispatcher"
collection = "SnapshotStore"
legacy-serialization = off
}
}
}
}
So there are two issues going on here: actor performance and missing messages.
It's not clear from your writeup, but I'm going to make an assumption: 100% of these messages are going to a single actor.
Actor Performance
The end-to-end throughput of a single actor depends on:
The amount of work it takes to route the message to the actor (i.e. through the sharding system, hierarchy, over the network, etc)
The amount of time it takes the actor to process a single message, as this determines the rate at which a mailbox can be emptied; and
Any flow control that affects which messages can be processed when - i.e. if an actor uses stashing and behavior switching, the amount of time an actor spends stashing messages while waiting for its state to change will have a cumulative impact on the end-to-end processing time for all stashed messages.
You will have poor performance due to item 3 on this list. The design that you are implementing calls Persist and blocks the actor from doing any additional processing until the message is successfully persisted. All other messages sent to the actor are stashed internally until the previous one is successfully persisted.
Akka.Persistence offers four options for persisting messages from the point of view of a single actor:
Persist - highest consistency (no other messages can be processed until persistence is confirmed), lowest performance;
PersistAsync - lower consistency, much higher performance. Doesn't wait for the message to be persisted before processing the next message in the mailbox. Allows multiple messages from a single persistent actor to be processed concurrently in-flight - the order in which those events are persisted will be preserved (because they're sent to the internal Akka.Persistence journal IActorRef in that order) but the actor will continue to process additional messages before the persisted ones are confirmed. This means you probably have to modify your actor's in-memory state before you call PersistAsync and not after the fact.
PersistAll - high consistency, but batches multiple persistent events at once. Same ordering and control flow semantics as Persist - but you're just persisting an array of messages together.
PersistAllAsync - highest performance. Same semantics as PersistAsync but it's an atomic batch of messages in an array being persisted together.
To get an idea as to how the performance characteristics of Akka.Persistence changes with each of these methods, take a look at the detailed benchmark data the Akka.NET organization has put together around Akka.Persistence.Linq2Db, the new high performance RDBMS Akka.Persistence library: https://github.com/akkadotnet/Akka.Persistence.Linq2Db#performance - it's a difference between 15,000 per second and 250 per second on SQL; the write performance is likely even higher in a system like MongoDB.
One of the key properties of Akka.Persistence is that it intentionally routes all of the persistence commands through a set of centralized "journal" and "snapshot" actors on each node in a cluster - so messages from multiple persistent actors can be batched together across a small number of concurrent database connections. There are many users running hundreds of thousands of persistent actors simultaneously - if each actor had their own unique connection to the database it would melt even the most robustly vertically scaled database instances on Earth. This connection pooling / sharing is why the individual persistent actors rely on flow control.
You'll see similar performance using any persistent actor framework (i.e. Orleans, Service Fabric) because they all employ a similar design for the same reasons Akka.NET does.
To improve your performance, you will need to either batch received messages together and persist them in a group with PersistAll (think of this as de-bouncing) or use asynchronous persistence semantics using PersistAsync.
You'll also see better aggregate performance if you spread your workload out across many concurrent actors with different entity ids - that way you can benefit from actor concurrency and parallelism.
Missing Messages
There could be any number of reasons why this might occur - most often it's going to be the result of:
Actors being terminated (not the same as restarting) and dumping all of their messages into the DeadLetter collection;
Network disruptions resulting in dropped connections - this can happen when nodes are sitting at 100% CPU - messages that are queued for delivery at the time can be dropped; and
The Akka.Persistence journal receiving timeouts back from the database will result in persistent actors terminating themselves due to loss of consistency.
You should look for the following in your logs:
DeadLetter warnings / counts
OpenCircuitBreakerExceptions coming from Akka.Persistence
You'll usually see both of those appear together - I suspect that's what is happening to your system. The other possibility could be Akka.Remote throwing DisassociationExceptions, which I would also look for.
You can fix the Akka.Remote issues by changing the heartbeat values for the Akka.Cluster failure-detector in configuration https://getakka.net/articles/configuration/akka.cluster.html:
akka.cluster.failure-detector {
# FQCN of the failure detector implementation.
# It must implement akka.remote.FailureDetector and have
# a public constructor with a com.typesafe.config.Config and
# akka.actor.EventStream parameter.
implementation-class = "Akka.Remote.PhiAccrualFailureDetector, Akka.Remote"
# How often keep-alive heartbeat messages should be sent to each connection.
heartbeat-interval = 1 s
# Defines the failure detector threshold.
# A low threshold is prone to generate many wrong suspicions but ensures
# a quick detection in the event of a real crash. Conversely, a high
# threshold generates fewer mistakes but needs more time to detect
# actual crashes.
threshold = 8.0
# Number of the samples of inter-heartbeat arrival times to adaptively
# calculate the failure timeout for connections.
max-sample-size = 1000
# Minimum standard deviation to use for the normal distribution in
# AccrualFailureDetector. Too low standard deviation might result in
# too much sensitivity for sudden, but normal, deviations in heartbeat
# inter arrival times.
min-std-deviation = 100 ms
# Number of potentially lost/delayed heartbeats that will be
# accepted before considering it to be an anomaly.
# This margin is important to be able to survive sudden, occasional,
# pauses in heartbeat arrivals, due to for example garbage collect or
# network drop.
acceptable-heartbeat-pause = 3 s
# Number of member nodes that each member will send heartbeat messages to,
# i.e. each node will be monitored by this number of other nodes.
monitored-by-nr-of-members = 9
# After the heartbeat request has been sent the first failure detection
# will start after this period, even though no heartbeat mesage has
# been received.
expected-response-after = 1 s
}
Bump the acceptable-heartbeat-pause = 3 s value to something larger like 10,20,30 if needed.
Sharding Configuration
One last thing I want to point out with your code - the shard count is way too high. You should have about ~10 shards per node. Reduce it to something reasonable.

Keep track of number of connected clients in expressjs across multi server cluster

I'm using SSE and need to keep track of the number of active connections across the cluster. In each instance I can do
server.getConnections(function(err, count){
if(err) throw err;
console.log(count);
});
What would be the best way to keep an accurate total in a cluster with this method and using Redis?
Not knowing all the details of your setup, I would approach this as follows:
Assuming all your cluster instances talk to the same Redis instance, set up a key called connection_count on the Redis store: SET connection_count 0.
Handle the server.on('connection', (socket) => {...}) method on each cluster node. Inside the handler, atomically increment the value of the connection_count key on the Redis store: INCR connection_count.
Handle the request.on('close', () => {...}) and atomically decrease the connection_count value on the Redis store from within the event handler: DECR connection_count.

One task per node only in Apache Ignite

I'm relatively new to Apache Ignite. I'm using Ignite compute to distribute tasks to nodes. My goal is a task dispatcher that produces tasks and submits these only to nodes that are "free". One node can only do one task at a time. If all nodes have a task running, the dispatcher shall wait for the next node to become available and then submit the next task.
I can implement this with a queue and async Callables, however I wonder if there is an Ignite onboard class that does something like this? Not sure the class ComputeTaskSplitAdapter is what I need to look at, I'm not fully understanding its purpose.
Any help appreciated.
Server nodes can join and leave the cluster while tasks are distributed.
Tasks can take different amount of time on the nodes and as soon as a server finishes a task it shall get the next task.
Here's my node code:
JobStealingCollisionSpi spi = new JobStealingCollisionSpi();
spi.setActiveJobsThreshold(1);
IgniteConfiguration cfg = new IgniteConfiguration();
cfg.setCollisionSpi(spi);
Ignition.start(cfg);
And this is my job distribution code (for testing):
JobStealingCollisionSpi spi = new JobStealingCollisionSpi();
spi.setActiveJobsThreshold(1);
IgniteConfiguration cfg = new IgniteConfiguration();
cfg.setCollisionSpi(spi);
Ignition.setClientMode(true);
Ignite ignite = Ignition.start(cfg);
for (int i = 0; i < 10; i++)
{
ignite.compute().runAsync(new IgniteRunnable()
{
#Override
public void run()
{
System.out.print("Sleeping...");
try
{
Thread.sleep(10000);
} catch (InterruptedException e)
{
e.printStackTrace();
}
System.out.println("Done.");
}
});
}
Yes, Apache Ignite has direct support for it. Please take a look at the One-at-a-Time section in the Job Scheduling documentation: https://apacheignite.readme.io/docs/job-scheduling#section-one-at-a-time
Note that every server has its own waiting queue and servers will move to the next job in their queue immediately after they are done with a previous job.
If you would like even more aggressive scheduling, then you can take a look at Job-Stealing scheduling here: https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/collision/jobstealing/JobStealingCollisionSpi.html
With Job Stealing enabled, servers will still jobs from the job-queues on other servers once their own queue becomes empty. Most of the parameters are configurable.