Akka.Net cluster singleton - handover not occurs when current singleton node shutdown unexpectedly - akka.net

I'm trying Akka.Net Cluster Tools, in order to use the Singleton behavior and it seems to work perfectly, but just when the current singleton node "host" leaves the cluster in a gracefully way. If I suddenly shutdown the host node, the handover does not occur.
Background
I'm building a system that will be composed by four nodes (initially). One of those nodes will be the "workers coordinator" and it will be responsible to monitor some data from database and, when necessary, submit jobs to the other workers. I was thinking to subscribe to cluster events and use the role leader changing event to make an actor (on the leader node) to become a coordinator, but I think that the Cluster Singleton would be a better choice in this case.
Working sample (but just if I gracefully leave the cluster)
private void Start() {
Console.Title = "Worker";
var section = (AkkaConfigurationSection)ConfigurationManager.GetSection("akka");
var config = section.AkkaConfig;
// Create a new actor system (a container for your actors)
var system = ActorSystem.Create("SingletonActorSystem", config);
var cluster = Cluster.Get(system);
cluster.RegisterOnMemberRemoved(() => MemberRemoved(system));
var settings = new ClusterSingletonManagerSettings("processorCoordinatorInstance",
"worker", TimeSpan.FromSeconds(5), TimeSpan.FromSeconds(1));
var actor = system.ActorOf(ClusterSingletonManager.Props(
singletonProps: Props.Create<ProcessorCoordinatorActor>(),
terminationMessage: PoisonPill.Instance,
settings: settings),
name: "processorCoordinator");
string line = Console.ReadLine();
if (line == "g") { //handover works
cluster.Leave(cluster.SelfAddress);
_leaveClusterEvent.WaitOne();
system.Shutdown();
} else { //doesn't work
system.Shutdown();
}
}
private async void MemberRemoved(ActorSystem actorSystem) {
await actorSystem.Terminate();
_leaveClusterEvent.Set();
}
Configuration
akka {
suppress-json-serializer-warning = on
actor {
provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"
}
remote {
helios.tcp {
port = 0
hostname = localhost
}
}
cluster {
seed-nodes = ["akka.tcp://SingletonActorSystem#127.0.0.1:4053"]
roles = [worker]
}
}

Thank you #Horusiath, your answer is totaly right! I wasn't able to find this configuration in akka.net documentation, and I didn't realize that I was supposed to take a look on the akka documentation. Thank you very much!
Have you tried to set akka.cluster.auto-down-unreachable-after to some timeout (eg. 10 sec)? – Horusiath Aug 12 at 11:27

Posting it as a response for caution for those who find this post.
Using auto-downing is NOT recommended in a clustered environment, due to different part of the system might decide after some time that the other part is down, splitting the cluster into two clusters, each with their own cluster singleton.
Related akka docs: https://doc.akka.io/docs/akka/current/split-brain-resolver.html

Related

Apache Ignite performance problem on Azure Kubernetes Service

I'm using Apache Ignite on Azure Kubernetes as a distributed cache.
Also, I have a web API on Azure based on .NET6
The Ignite service works stable and very well on AKS.
But at first request, the API tries to connect Ignite and it takes around 3 seconds. After that, Ignite responses take around 100 ms which is great. Here are my Web API performance outputs for the GetProduct function.
At first, I've tried adding the Ignite Service to Singleton but it failed sometimes as 'connection closed'. How can I keep open the Ignite connection always? or does anyone has something better idea?
here is my latest GetProduct code,
[HttpGet("getProduct")]
public IActionResult GetProduct(string barcode)
{
Stopwatch _stopWatch = new Stopwatch();
_stopWatch.Start();
Product product;
CacheManager cacheManager = new CacheManager();
cacheManager.ProductCache.TryGet(barcode, out product);
if(product == null)
{
return NotFound(new ApiResponse<Product>(product));
}
cacheManager.DisposeIgnite();
_logger.LogWarning("Loaded in " + _stopWatch.ElapsedMilliseconds + " ms...");
return Ok(new ApiResponse<Product>(product));
}
Also, I add CacheManager class here;
public CacheManager()
{
ConnectIgnite();
InitializeCaches();
}
public void ConnectIgnite()
{
_ignite = Ignition.StartClient(GetIgniteConfiguration());
}
public IgniteClientConfiguration GetIgniteConfiguration()
{
var appSettingsJson = AppSettingsJson.GetAppSettings();
var igniteEndpoints = appSettingsJson["AppSettings:IgniteEndpoint"];
var igniteUser = appSettingsJson["AppSettings:IgniteUser"];
var ignitePassword = appSettingsJson["AppSettings:IgnitePassword"];
var nodeList = igniteEndpoints.Split(",");
var config = new IgniteClientConfiguration
{
Endpoints = nodeList,
UserName = igniteUser,
Password = ignitePassword,
EnablePartitionAwareness = true,
SocketTimeout = TimeSpan.FromMilliseconds(System.Threading.Timeout.Infinite)
};
return config;
}
Make it a singleton. Ignite node, even in client mode, is supposed to be running for the entire lifetime of your application. All Ignite APIs are thread-safe. If you get a connection error, please provide more details (exception stack trace, how do you create the singleton, etc).
You can also try the Ignite thin client which consumes fewer resources and connects instantly: https://ignite.apache.org/docs/latest/thin-clients/dotnet-thin-client.

Exception of Shard 0 already allocated

we are getting an exception after every re-start of service 3 POD [3 shard]inside AWS cluster, we are using Nuget Akka1.4.16 for our .net service using AWS Document DB, attaching the code for your refernce, please let us know if we are missing anything.
Exception in ReceiveRecover when replaying event type [Akka. Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [26] for persistenceId
Cause: System.ArgumentException: Shard 0 is already allocated (Parameter 'e')
at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
at Akka.Persistence.Eventsourced.<>c__DisplayClass91_0.g__RecoveryBehavior|0(Object message)
at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
at Akka.Persistence.Eventsourced.<>n__0(Receive receive, Object message)
at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.b__1(Receive receive, Object message)
private void ConfigureActorSystem(IServiceCollection services)
{
var conf = new Config();
var setUp = conf.BootstrapApplication(Configuration["ConnectionString"].Replace("{UserName}", Configuration["UserName"]).Replace("{Password}", Configuration["Password"]));
var dockerconfigsetUp = setUp?.BootstrapFromDocker();
var bootstrap = BootstrapSetup.Create().WithConfig(dockerconfigsetUp);
var di = ServiceProviderSetup.Create(services.BuildServiceProvider());
var actorSystemSetup = bootstrap.And(di);
var actorSystem = ActorSystem.Create("user-actor-system", actorSystemSetup);
var shards=3;
Cluster.Get(actorSystem).RegisterOnMemberUp(() =>
{
var provider = Akka.DependencyInjection.ServiceProvider.For(actorSystem);
var sharding = ClusterSharding.Get(actorSystem);
var shardRegion = sharding.Start(
typeName: nameof(UserActor),
entityPropsFactory: s => provider.Props<UserActor>(s),
settings: ClusterShardingSettings.Create(actorSystem),
messageExtractor: new MessageExtractor(shards)
);
Startup.ShardRegion = shardRegion;
}
);
MongoDbPersistence.Get(actorSystem);
}
Ah, in this case the issue is that your Akka.Cluster.Sharding allocation state was saved in a dirty fashion - usually the result of not allowing the cluster to properly shut itself down, so the sharding system is recovering older / no longer valid shard state data.
There are a couple of ways to fix this:
Run https://github.com/petabridge/Akka.Cluster.Sharding.RepairTool - that will purge old ShardCoordinator data out of Akka.Persistence
Switch from using Akka.Persistence as the backing state mode for Akka.Cluster.Sharding to using DistributedData, which doesn't have this problem even when the cluster shuts down without cleanup: https://getakka.net/articles/clustering/cluster-sharding.html - explains how to configure this at the top of the page.

How to span a ConcurrentDictionary across load-balancer servers when using SignalR hub with Redis

I have ASP.NET Core web application setup with SignalR scaled-out with Redis.
Using the built-in groups works fine:
Clients.Group("Group_Name");
and survives multiple load-balancers. I'm assuming that SignalR persists those groups in Redis automatically so all servers know what groups we have and who are subscribed to them.
However, in my situation, I can't just rely on Groups (or Users), as there is no way to map the connectionId (Say when overloading OnDisconnectedAsync and only the connection id is known) back to its group, and you always need the Group_Name to identify the group. I need that to identify which part of the group is online, so when OnDisconnectedAsync is called, I know which group this guy belongs to, and on which side of the conversation he is.
I've done some research, and they all suggested (including Microsoft Docs) to use something like:
static readonly ConcurrentDictionary<string, ConversationInformation> connectionMaps;
in the hub itself.
Now, this is a great solution (and thread-safe), except that it exists only on one of the load-balancer server's memory, and the other servers have a different instance of this dictionary.
The question is, do I have to persist connectionMaps manually? Using Redis for example?
Something like:
public class ChatHub : Hub
{
static readonly ConcurrentDictionary<string, ConversationInformation> connectionMaps;
ChatHub(IDistributedCache distributedCache)
{
connectionMaps = distributedCache.Get("ConnectionMaps");
/// I think connectionMaps should not be static any more.
}
}
and if yes, is it thread-safe? if no, can you suggest a better solution that works with Load-Balancing?
Have been battling with the same issue on this end. What I've come up with is to persist the collections within the redis cache while utilising a StackExchange.Redis.IDatabaseAsync alongside locks to handle concurrency.
This unfortunately makes the entire process sync but couldn't quite figure a way around this.
Here's the core of what I'm doing, this attains a lock and return back a deserialised collection from the cache
private async Task<ConcurrentDictionary<int, HubMedia>> GetMediaAttributes(bool requireLock)
{
if(requireLock)
{
var retryTime = 0;
try
{
while (!await _redisDatabase.LockTakeAsync(_mediaAttributesLock, _lockValue, _defaultLockDuration))
{
//wait till we can get a lock on the data, 100ms by default
await Task.Delay(100);
retryTime += 10;
if (retryTime > _defaultLockDuration.TotalMilliseconds)
{
_logger.LogError("Failed to get Media Attributes");
return null;
}
}
}
catch(TaskCanceledException e)
{
_logger.LogError("Failed to take lock within the default 5 second wait time " + e);
return null;
}
}
var mediaAttributes = await _redisDatabase.StringGetAsync(MEDIA_ATTRIBUTES_LIST);
if (!mediaAttributes.HasValue)
{
return new ConcurrentDictionary<int, HubMedia>();
}
return JsonConvert.DeserializeObject<ConcurrentDictionary<int, HubMedia>>(mediaAttributes);
}
Updating the collection like so after I've done manipulating it
private async Task<bool> UpdateCollection(string redisCollectionKey, object collection, string lockKey)
{
var success = false;
try
{
success = await _redisDatabase.StringSetAsync(redisCollectionKey, JsonConvert.SerializeObject(collection, new JsonSerializerSettings
{
ReferenceLoopHandling = ReferenceLoopHandling.Ignore
}));
}
finally
{
await _redisDatabase.LockReleaseAsync(lockKey, _lockValue);
}
return success;
}
and when I'm done I just ensure the lock is released for other instances to grab and use
private async Task ReleaseLock(string lockKey)
{
await _redisDatabase.LockReleaseAsync(lockKey, _lockValue);
}
Would be happy to hear if you find a better way of doing this. Struggled to find any documentation on scale out with data retention and sharing.

What is the right way for Akka cluster Singleton startup

I try implement scenario for akka cluster startup:
Run cluster seed with simple actor (cluster monitor, which shows joins of other members)
Run some cluster member, which uses ClusterSingletonManager and ClusterSingletonProxy actors for Singleton implementation
But I have a problem:
10:52:41.691UTC INFO akka.tcp://system#127.0.0.1:9401/user/singletonOfEvents - ClusterSingletonManager state change [Start -> Younger]
and my singleton is not started.
I saw "The singleton actor is always running on the oldest member with specified role." in Akka Cluster Singleton doc. But I can not understand how singleton must be started. Maybe all singletons must be implemented and started in the first seed-node?
As described in Akka documentation, the Cluster Singleton actor instance is started and maintained by the ClusterSingletonManager actor on each of the cluster nodes with the specified role for the singleton. ClusterSingletonManager maintains at most one singleton instance on the oldest node of the cluster with the specified role at any point in time. Should the oldest node (could be the 1st seed node) fail, the next oldest node will be elected. To access the cluster singleton actor, use ClusterSingletonProxy which is present on all nodes with the specified role.
Here's what a sample app that starts the Cluster Singleton might look like:
object Main {
def workTimeout = 10.seconds
def main(args: Array[String]): Unit = {
// Validate arguments host and port from args(0) and args(1)
// ...
val role = "worker"
val conf = ConfigFactory.parseString(s"akka.cluster.roles=[$role]").
withFallback(ConfigFactory.parseString("akka.remote.netty.tcp.hostname=" + host)).
withFallback(ConfigFactory.parseString("akka.remote.netty.tcp.port=" + port)).
withFallback(ConfigFactory.load())
val system = ActorSystem("ClusterSystem", conf)
system.actorOf(
ClusterSingletonManager.props(
Master.props(workTimeout),
PoisonPill,
ClusterSingletonManagerSettings(system).withRole(role)
),
name = "master"
)
val singletonAgent = system.actorOf(
ClusterSingletonProxy.props(
singletonManagerPath = "/user/master",
settings = ClusterSingletonProxySettings(system).withRole(role)
),
name = "proxy"
)
// ...
}
// ...
}
object Master {
def props(workTimeout: FiniteDuration): Props =
Props(classOf[Master], workTimeout)
// ...
}
class Master(workTimeout: FiniteDuration) extends Actor {
import Master._
// ...
}
The cluster configurations might look like the following:
akka {
actor.provider = "akka.cluster.ClusterActorRefProvider"
remote.netty.tcp.port = 0
remote.netty.tcp.hostname = 127.0.0.1
cluster {
seed-nodes = [
"akka.tcp://ClusterSystem#10.1.1.1:2552",
"akka.tcp://ClusterSystem#10.1.1.2:2552"
]
auto-down-unreachable-after = 10s
}
// ...
}

Akka.net cluster round-robin-group configuration. Not routing messages

I am trying to configure a cluster group router and wanted to sanity check my assumptions on "how" this works.
I have 2 separate nodes in a cluster these have the following roles "mainservice" and "secondservice". Inside the "mainservice" I want to send messages to an Actor within the "secondservice" using a round-robin-group router.
In the akka hocon config I have the following within the akka.actor.deployment section:
/secondserviceproxy {
router = round-robin-group
routees.paths = ["/user/gateway"]
nr-of-instances = 3
cluster {
enabled = on
allow-local-routees = off
use-role = secondservice
}
}
My assumption based on the documentation is that I can create a "secondserviceproxy" actor in the "mainservice" and this handles the routing of messages to any running instances of my "secondservice" on a round-robin basis.
var secondServiceProxy = Context.System.ActorOf(Props.Empty.WithRouter(FromConfig.Instance), "secondserviceproxy");
secondServiceProxy.Tell("Main Service telling me something");
I also made the assumption that the routees.path property means that messages are sent to an Actor in the "secondservice" located in it's Actor hierarchy as follows: "/user/gateway".
Is my working assumption correct? As this implementation is yielding no results in the "secondservice".
Your assumptions are correct. What’s probably happening is that your message is being blasted through the cluster router before the router has had a chance to build its routee table of routees around the cluster (which it builds from monitoring cluster gossip).
Result? Your message initially is ending up in DeadLetters. And then later, once the cluster has fully formed, it will go through because the router knows about its intended recipients around the cluster.
You could verify that if you want by subscribing to dead letters from that actor and checking if that’s where the message is going. You can do that like so:
using Akka.Actor;
using Akka.Event;
namespace Foo {
public class DeadLetterAwareActor : ReceiveActor {
protected ILoggingAdapter Log = Context.GetLogger();
public DeadLetterAwareActor() {
// subscribe to DeadLetters in ActorSystem EventStream
Context.System.EventStream.Subscribe(Self, typeof(DeadLetter));
Receiving();
}
private void Receiving() {
// is it my message being delivered to DeadLetters?
Receive<DeadLetter>(msg => msg.Sender.Equals(Self), msg => {
Log.info("My message to {0} was not delivered :(", msg.Recipient);
})
}
}
}