Detect cluster node failure Jboss AS 7.1.1-Final - jboss7.x

I have configured 2 node clusers in Jboss AS 7.1.1-Final. I am planning to use sticky sessions. Meanwhile I am also recording number of active online users in Infinispan cache with node IP from where that user session was created for reporting purpose.
I have taken care of scenarios for login/logout where I would clear our cache entries. Problem is if one of the server node goes down, I need to write clean up routine to clear such records of that node from cache too.
One of the option is to write a client and check at specific interval if server is alive otherwise trigger a clean up routine. This approach would work but I am looking for more cleaner approach if I could detect server node failure that gets notified to other live nodes then I could hit cleanup.
From console I know that it shows when server goes down or comes up. But what would be that listerner to listen to such events. Any thoughts?

If you just need to know when the node leaves within some server module (inside JBoss server) you can use the ViewChanged listener
You cannot get this information on clients connected via REST or memcached protocols - with HotRod protocol it is doable but pretty hackish, you'd have to override TransportFactory.updateServers (probably just extend TcpTransportFactory - see configuration property infinispan.client.hotrod.transport_factory)

Related

GridGain Server partition loss

We have 3 node Gridgain server and there are 3 client nodes deployed in GCP Kubernetes engine. Cluster is native persistence enabled. Also <property name="shutdownPolicy" value="GRACEFUL"/> as shutdown policy. There is one backup for each cache. After automatic cluster restart getting partition loss. Need to reset these partitions by executing control commands.
Can you provide proper solution for this. We have around 60GB persistent data.
<property name="shutdownPolicy" value="GRACEFUL"/> is supposed to protect from partition loss if certain conditions are met:
The caches must be either PARTITIONED with backups > 0 or REPLICATED. Check your configs. Default cache config in Ignite is PARTITIONED with backups = 0 (for historical reasons), so the defaults won't work.
There must be more than one baseline node (only baseline nodes store data!). Here is the doc.
You must stop the nodes in a graceful way. This is a bit tricky since you don't always control this.
If you stop with a kill to the process, make sure it uses SIGTERM and not SIGKILL because the later always kills the process immediately
If you stop with Ignite.close() this should just work
If you stop with Java System.exit() it'll work, but if you use System.halt() - it won't (because halt() is not graceful)
If you use orchestrators such as Kubernetes, you need to make sure they'll stop the nodes gracefully. For example, in Kubernetes you normally have to set terminationGracePeriodSeconds to a high value so that Kubernetes waits for the nodes to finish graceful shutdown instead of killing them.
If you use custom startup scripts, you need to make sure they forward signals to the Ignite process.
To debug this, check the points above. I would normally start by looking at the server logs (with IGNITE_QUIET=false!) to see if "Invoking shutdown hook" message is there. If it isn't there then your shutdown hook isn't getting called, and the problem is one of the points under 3. Otherwise, there should be other log messages explaining the situation.

How to configure Akka.Cluster for services that Crash when binding to port 0

What I am testing is the following scenario:
Start 2 Lighthouses, then start a 3 service that is a member of the cluster. It's seed nodes are configured to be the two Lighthouses that were previously started.
Now this 3rd service has it's HOCON set to bind to port 0, which does it's job and gives me a random port.
Now when I force quit this service to simulate a crash, The logging output from Akka.Net gets REAL chatty (important parts)
AssociationError...Tried to associate with unreachable remote address
address is now gated for 5000ms ... No connection could be made because the target machine actively refused it.
And it seems like it just goes on forever. I assume this is probably harmless and it just looks like a terrible error. The message itself makes sense, the service is literally gone so it can not and will never be able to connect.
Now if I restart the service since it's configured to bind to 0 for Akka.Remoting, it will get an entirely new port, so the Unreachable status of the other failed service will never be resolved.
Is this the expected behavior? I also think there is a configuration setting that might come into play here:
auto-down-unreachable-after
Now this comes with it's own warning about:
Using auto-down implies that two separate clusters will automatically be formed in case of network partition.
Setting this does silence the messages:
auto-down-unreachable-after = 3s
And I get a new message after the node is marked unreachable:
Association to [akka.tcp://ClusterName#localhost:58977] having UID [983892349]is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
Remote actorsystem must be restarted to recover from this situation. Seems pretty serious and something to avoid. At the same time, given that the service joins on a random port, it is irrecoverable. In trying to gain some more knowledge about the UID it seems that it's internally assigned. So I can only guess there would not be any collisions later in time with UIDs, so this would be the proper behavior.
This seems to be the only option outside of
log-info = off
to just silence the logs
I assume the logging of the lighthouse services are chatty, right? That is 'normal' behaviour of the Akka gossip protocol trying to communicate with the crashed node. When this happens, you must configure what you want to do.
The solution for solving this is not always the same for each situation. It could depend for example if you are running the services on a cloud microservices platform for example. But one of the options is indeed 'auto-downing'. This will mark the service as 'UNREACHABLE' (as you can see). This means that the node isn't out of the cluster, but the cluster continues to operate without the crashed node. That's the reason that the same node cannot join, because it is still marked as 'UNREACHABLE'.
Be aware that auto-downing could result into a 'split-brain' of the cluster, where the two parts of the cluster (for example one cluster of 4 nodes gets split into 2 clusters of 2 nodes). This is a situation that you don't want, so this may not be the best solution!
Akka.NET has some other solution to you can configure to correctly deal with this: the Split Brain Resolver. More information how to configure this: https://getakka.net/articles/clustering/split-brain-resolver.html
These are all strategies to prevent 'split-brain' situations and will involve sacrificing nodes to keep the cluster consistent. Use these strategies in combination with for example a microservices orchestration platform (so that instances will restart themselves after crashing/exiting) to create a perfect self-healing Akka cluster.

Apache Ignite Force Server Mode

We are trying to prevent our application startups from just spinning if we cannot reach the remote cluster. From what I've read Force Server Mode states
In this case, discovery will happen as if all the nodes in topology
were server nodes.
What i want to know is:
Does this client then permanently act as a server which would run computes and store caching data?
If connection to the cluster does not happen at first, a later connection to an establish cluster cause issue with consistency? What would be the expect behavior with a Topology version mismatch? Id their potential for a split brain scenario?
No, it's still a client node, but behaves as a server on discovery protocol level. For example, it can start without any server nodes running.
Client node can never cause data inconsistency as it never stores the data. This does not depend forceServerMode flag.

Can Cloudbees instances within an app communicate directly?

I am looking to build an Akka-based application in the cloud, for a garage startup that I'm bootstrapping; by the nature of the app, it's semi-stateful, with as much as possible cached in RAM for performance. (It'll be tolerant of being shut down and restarted periodically, but we want to mostly operate via cached information inside the Actors.)
The architecture is designed for a cluster of servers, communicating between them as necessary so that a user session on node A can query a middleware Actor on node B when appropriate. So my question is, how hard is that in CloudBees?
My understand from this page is that there is no automatic directory service to manage this sort of intra-cluster communication yet, but I can probably live with that -- worse comes to worst, I should be able to manage discovery via the DB, with each node registering itself when it comes up and opening up many-to-many communications with the others.
What I want to check, though, is that this communication is straightforward. Does each node have a reliable local IP that it can advertise for others to contact it on, that is at least stable during this run of the application? Or is there another/better way for a node to advertise its address to the rest of the nodes running this app?
(I assume that the nodes of an app all share the same DB instance.)
Any guidance here would be greatly appreciated. I'd like to choose a hosting provider soon, and keep returning to CloudBees as the most promising-looking of the options...
There are no limitations currently on instances communicating with each other - the trick is in discovering membership. There is an api that will be shortly be released that will allow you to track membership - but for now, the following may work:
To get the port, look at the file names in $PWD/.genapp/ports (as applications can have multiple ports) - (eg System.getenv("PWD") + ".genapp/ports" - list the files in that directory - generally will be just 1 - the file name is the port). There are other ways - for example the "sun.java.command" system property on JVM apps too.
The hostname can be obtained via the usual means (eg InetAddress.getLocalHost().getHostName()): this host
name will be the private name - ie it will resolve to a private IP -
good for node to node communication.
Public IP/hostname: perform a HTTP get (from the server) to the following URL:
http://instance-data/latest/meta-data/public-hostname (will only
return the public IP on the server side of course).
(see http://developer-blog.cloudbees.com/2012/11/finding-port-or-address-of-your.html)
You can then, as you say, on startup, register the appropriate port/private hostname with a DB, and then read that on each node to "seed" the cluster (akka doesn't have to know about all members - just enough seeds) I would think a 2 phase startup: 1: register host/port, 2, look for other members, add them as seed members to the local Akka configuration (may need to periodically do the same for a while, as other nodes startup - to ensure it is seeded enough)
From my reading of Akka setup here: http://doc.akka.io/docs/akka/snapshot/scala/remoting.html
It looks like you can specify the port - so if possible, I would set that to be the app_port environment variable - that means each node can communicate via the private hostname with that port. However, http traffic will also be routed to it - can akka handle this as well - or does it need to have a discrete port for akka and another for any http interface?

Why are my WebLogic clustered MDB app deployments in warning state?

I have a WebLogic cluster on which I've deployed numerous topics and applications that use them. My applications uniformly show themselves in a Warning status. Looking at Monitoring on the deployment, I see the MDB application connects to Server #1, but on server #2 it shows this:
MDB application appName is NOT connected to messaging system.
My JMS Server is targetted to a migratable target, which is in turn targetted to the #1 server and has a cluster identified. And messages sent to either server all flow as expected. I just don't know why these deployments show in a Warning state.
WebLogic 11g
This can be avoided by using the parameter below
<start-mdbs-with-application>false</start-mdbs-with-application>
In the weblogic-application.xml, Setting start-mdbs-with-application to false forces MDBs to defer starting until after the server instance opens its listen port, near the end of the server boot up process.
If you want to perform startup tasks after JMS and JDBC services are available, but before applications and modules have been activated, you can select the Run Before Application Deployments option in the Administration Console (or set the StartupClassMBean’s LoadBeforeAppActivation attribute to “true”).
If you want to perform startup tasks before JMS and JDBC services are available, you can select the Run Before Application Activations option in the Administration Console (or set the StartupClassMBean’s LoadBeforeAppDeployments attribute to “true”).
Refer :http://docs.oracle.com/cd/E13222_01/wls/docs81/ejb/message_beans.html
this is applicable for the versions till 12c and later
I don't like unanswered questions, so I'm going to answer this one.
The problem is resolved, though I was not involved in its resolution. At present the problem only exists for the length of time it takes the JMS subsystem to fully initialize. During that period (with many queues, it can take a while) the JNDI system throws errors and the apps are truly in warning state. Once the JMS is fully initialized, everything goes green.
My belief is that someone corrected something in the JMS Server / Cluster config. I'll never know what it was.