Apache Ignite node segmented - ignite

From time to time I keep getting node segmented. It happens in a cluster with ~40 nodes - it happens only on one node at a time. A few times it happened while there was some heavy GC work going on. On the other hand I have seen that similar heavy GC work going on and no node got segmented. I have tuned failure detection timeout to be bigger than max GC I was experiencing but that didn't help - failureDetection timeout is almost 2x bigger than max GC. How can I figure out if this is really GC or network issues?
I doubt it is related to networking as other nodes would fail as wel. When the process gets restarted it works fine so I would rule out network issues.
Where can I look at the code which produces EVT_NODE_SEGMENTED?
I debugged the IgniteConfiguration object and saw that segResolvers are null/empty so I have no clue where the event it published.
S3 based discovery is used, not sure it matters here(ignite 1.9).
I wonder under which conditions such event is produced? Unable to connect to majority of other nodes or all of them?

This event occurs when node disconnects and can't connect back, see ClientImpl.java and ServerImpl.java
look at logs at segmented node, it must be something like Node is out of topology (probably, due to short-time network problems message, so you can figure out exact problem.

Related

RabbitMQ slowing down after some times

On our RabbitMQ installed in production, we have a performance issue.
To explain the context, we have an initialization batch that creates around ~60k messages. For business reasons, those messages must be treated in strict order and we can't lose any. As such, we have only one queue which is durable and lazy and one consumer (SpringBoot AMQP) with a prefetch of 10. Both are on the same virtual machine.
At first, the processing is fast enough, around 5 to 10 messages per second. But it progressively slows down until it reaches a cap of fewer than 20 messages per hour. It takes approximately 1 hour to reach this point.
After some investigations, we found out that the problem comes from RabbitMQ. When we simply stop and restart it, the performance goes back to normal and then drops slowly again. Doing the same on just the consumer doesn't change anything.
I'm thinking about some resources bottleneck but I can't manage to find which one as RAM, CPU, and disk looks fine. I am not really familiar with ERL virtual machine and managing RabbitMQ itself so I may have missed something.
Does someone as an idea of the source of the problem or where I could look for more information on what is happening?
RabbitMQ characteristics :
ERL 23.3.2
RabbitMQ 3.8.14

Sometimes RabbitMQ cluster starts to freeze and run slowly

We have a cluster of three nodes. Two of them are on one server, the third node is on the other. All queues have an "ha-all" availability setting.
We ran into a problem that sometimes the cluster stops receiving messages, or receives them very slowly.
Some of the connections are sometimes marked "flow". All nodes are not lightly loaded (~10% CPU, ~7% RAM). In addition, it was noticed that at such moments the "rabbitmqctl cluster_status" request hangs on all nodes and responds only after 10-15 minutes. At the same time, the connection between servers and nodes is not lost (ping is stable ~6ms)
Also, during such problems, the ManagementUI freezes or takes a very long time to respond to requests.
What could be the problem?

Ignite slowdown when node increase to 2

I use Ignite.Net and run ignite in my .net core app process.
My application receives some messages (5000 per second) and I put or remove some keys according to the messages received. The cache mode is replicated, with default Primary_Sync write mode.
Everything is good and I can process up to 20,000 messages/sec.
But when I run another ignite node on another machine, everything changes. Processing speed is reduced up to 1000 messages per second.
perhaps it's due to that some operations do on the network, but I want just put or remove keys on the local instance and replicate them (changed keys) to other nodes. Write mode is Primary_Sync and this means ignite must put or remove key on the local node (because all nodes are the same due to replicated mode and no need to distribute them on other nodes) and then replicate them to other nodes asynchronously.
Where is the problem?
Is the slowdown due to network operations?
Looking at the code (could not run it - requires messing with SQL server), I can provide the following recommendations:
Use DataStreamer. Always use streamer when adding/removing batches of data.
Try using multiple threads to load the data. Ignite APIs are thread-safe.
Maybe try CacheWriteSynchronizationMode.FullAsync
Together this should result in a noticeable speedup, no matter how many nodes.

Akka.net / Cluster - How to "Heal" the topology when the leader dies?

I set up a basic test topology with Petabridge Lighthouse and two simple test actors that communicate with each other. This works well so far, but there is one problem: Lighthouse (or the underlying Akka.Cluster) makes one of my actors the leader, and when not shutting the node down gracefully (e.g. when something crashes badly or I simply hit "Stop" in VS) the Lighthouse is not usable any more. Tons of exceptions scroll by and it must be restarted.
Is it possible to configure Akka.Cluster .net in a way that the rest of the topology elects a new leader and carries on?
There are 2 things to point here. One is that if you have a serious risk of your lighthouse node going down, you probably should have more that one -
akka.cluster.seed-nodes setting can take multiple addresses, the only requirement here is that all nodes, including lighthouses, must have them specified in the same order. This way if one lighthouse is going down, another one still can take its role.
Other thing is that when a node becomes unreachable (either because the process crashed on network connection is unavailable), by default akka.net cluster won't down that node. You need to tell it, how it should behave, when such thing happens:
At any point you can configure your own IDowningProvider interface, that will be triggered after certain period of node inactivity will be reached. Then you can manually decide what to do. To use it add fully qualified type name to followin setting: akka.cluster.downing-provider = "MyNamespace.MyDowningProvider, MyAssembly". Example downing provider implementation can be seen here.
You can specify akka.cluster.auto-down-unreachable-after = 10s (or other time value) to specify some timeout given for an unreachable node to join - if it won't join before the timeout triggers, it will be kicked out from the cluster. Only risk here is when cluster split brain happens: under certain situations a network failure between machines can split your cluster in two, if that happens with auto-down set up, two halves of the cluster may consider each other dead. In this case you could end up having two separate clusters instead of one.
Starting from the next release (Akka.Cluster 1.3.3) a new Split Brain Resolver feature will be available. It will allow you to configure more advanced strategies on how to behave in case of network partitions and machine crashes.

Akka.net Cluster Debugging

The title is a bit misleading, so let me explain further.
I have a non thread-safe dll I have no choice but to use as part of my back end
servers. I can't use it directly in my servers as the thread issues it has causes
it to crash. So, I created an akka.net cluster of N nodes each which hosts a single
actor. All of my API calls that were originally to that bad dll are now routed through
messages to these nodes through a round-robin group. As each node only has a single, single
threaded actor, I get safe access, but as I have N of them running I get parallelism, of a sort.
In production, I have things configured with auto-down = false and default timings on heartbeats
and so on. This works perfectly. I can fire up new nodes as needed, they get added to the group,
I can remove them with Cluster.Leave and that is happy as well.
My issue is with debugging. In our development environment we keep a cluster of 20 nodes each
exposing a single actor as described above that wraps this dll. We also have a set of nodes that act as
seed nodes and do nothing else.
When our application is run it joins the cluster. This allows it to direct requests through the round-robin
router to the nodes we keep up in our cluster. When doing development and testing and debugging the app, if I configure things to use auto-down = false
we end up with problems whenever a test run crashes or we stop the application with out going through
proper cluster leaving logic. Such as when we terminate the app with the stop button in the debugger.
With out auto-down, this leaves us with a missing member of the cluster that causes the leader to disallow
additions to the cluster. This means that the next time I run the app to debug, I cant join the cluster and am
stuck.
It seems that I have to have auto-down set to get debugging to work. If it is set, then when I crash my app
the node is removed from the cluster 5 seconds later. When I next fire up my
app, the cluster is back in a happy state and I can join just fine.
The problem with this is that if I am debugging the application and pause it for any amount of time, it is almost immediately
seen as unreachable and then 5 seconds later is thrown out of the cluster. Basically, I can't debug with these settings.
So, I set failure-detector.acceptable-heartbeat-pause = 600s to give me more time to pause the app
while debugging. I will get shutdown in 10 min, but I don't often sit in the debugger for that long, so its an acceptable
trade-off. The issue with this is of course that when I crash the app, or stop it in the debugger, the cluster thinks it
exists for the next 10 minutes. No one tries to talk to these nodes directly, so in theory that isn't a huge issue, but I keep
running into cases where the test I just ran got itself elected as role leader. So the role leader is now dead, but the cluster
doesn't know it yet. This seems to prevent me from joining anything new to the cluster until my 10 min are up. When I try to leave
the cluster nicely, my dead node gets stuck at the exiting state and doesn't get removed for 10 minutes. And I don't always get
notified of the removal either, forcing me to set a timeout on leaving that will cause it to give up.
There doesn't seem to be any way to say "never let me be the leader". When I have run the app with no role set for the cluster
it seems to often get itself elected as the cluster leader causing the same problem
as when the role leader is dead but unknown to be so, but at a larger level.
So, I don't really see any way around this, but maybe someone has some tricks to pull this off. I want to be able to debug
my cluster member without it being thrown out of the cluster, but I also don't want the cluster to think that leader nodes
are around when they aren't, preventing me from rejoining during my next attempt.
Any ideas?