WebSphere 9 ND node agent stopped and the applications are still working. How/why? - authentication

This is WebSphere 9 ND. I've stopped the node agent and the serverStatus.sh script reports that it is down: ADMU0509I: The Node Agent "nodeagent" cannot be reached. Why are the applications still authenticating and appear to be working?

See this article explaining the basic concepts of IBM Websphere application server Network Deployment.
node agent
A node agent manages all managed processes on a WebSphere Application Server on a node by communicating with the Network Deployment Manager to coordinate and synchronize the configuration. A node agent performs management operations on behalf of the Network Deployment Manager. The node agent represents the node in the management cell. Node agents are installed with WebSphere Application Server base, but are not required until the node is added to a cell in a Network Deployment environment.
application server
The application server is the primary component of WebSphere. The server runs a Java™ virtual machine, providing the runtime environment for the application's code. The application server provides containers that specialize in enabling the execution of specific Java application components.
Apps are deployed into the Application server and not to the nodeagent. The role of the node agent is to perform management operations on behalf of Deployment Manager.
So, if the nodeagent is stopped, you will only loose the ability to manage the servers running under that node and it will not stop already running application
servers or applications deployed to servers in that node.
You can validate this by grepping the server name (eg:server1) from the list of all running processes:
ps -ef | grep java | grep servername
Sample output (for an app server) is given below:
wasadmin 12345 98765 2 13:18 pts/0 00:04:57 /opt/ibm/WebSphere/AppServer/java/8.0/bin/java -Dosgi.install.area=/opt/ibm/WebSphere/AppServer <collapsed text> cellname nodename servername
where:
wasadmin - is the os username running the application server on that
node
12345 - is the pid of the application server running on that node.
98765 - is the pid of the parent process (nodeagent). This will be
"1" if the nodeagent is stopped

Related

Weblogic Admin console resiliency

I have a weblogic cluster with cluster node running in 2 VMs to have resiliency during failure in any node. I use the WLST scripts to manage the start & stop of the deployed components as some components will be brought down during specific time frame.
Incase VM in which admin console is running is down, Is there any way to start / stop my deployed components if Im not able to bring up the admin console.

The cluster node is already a member of the cluster

I am trying to create a cluster in Windows Server 2019 and trying create a high availability environment with two SQL Servers 2019. But I am getting the error - "The cluster node is already a member of the cluster". I tried removing nodes from the cluster using multiple PowerShell command, however PowerShell commands are also not working because "Cluster Service" (Windows Service) is not getting started. I also tried the command - cluster node nodename /forcecleanup
Both the servers are communicating with each other, in the same subnet, in the same domain and in same IP range.
An early help would be highly appreciated!

How to restart a Service Fabric Application

I have a gMSA service account running a stateless Service Fabric application. The account has recently been added as a member to a new security group. We don't see that the application is working and I think its because the user claims were loaded on application start up. I've seen that to get this to work on Windows Services that we need to restart the service (mmc->Services, right click restart). I would like to do something similar in Service Fabric.
I see the option of restarting the node, but that is a more heavy handed approach than I want to use. This is in production and I want to scope the solution to the problem. The other applications on the node do not have an issue so I would prefer to not bring them down.
Service Fabric Deactivate (pause) vs Deactivate (restart)?
Thanks in advance,
Greg
What you are looking for is the Restart-ServiceFabricDeployedCodePackage command.
The Restart-ServiceFabricDeployedCodePackage cmdlet ends the code package process, which restarts all of the user service replicas hosted in that process. This restart simulates code package process failures in the cluster, which tests the failover recovery paths of your service.
You can specify a code package, or you can specify a ReplicaSelector to restart the node and code package combination where the replica is hosted. This simplifies tests on the primary host node by not having to determine which Service Fabric node is the primary node before restarting that node.

How does a GlassFish cluster find active IIOP endpoints?

I have a curiosity and I was searching for it without any result. In GlassFish documentation it is written:
If the GlassFish Server instance on which the application client is
deployed participates in a cluster, the GlassFish Server finds all
currently active IIOP endpoints in the cluster automatically. However,
a client should have at least two endpoints specified for
bootstrapping purposes, in case one of the endpoints has failed.
but I am asking myself how this list is created.
I've done some tests with a stand-alone client that is executed in a JVM and does some RMI calls on an application that is deployed in a GlassFish cluster and I can see from the logs that the IIOP endpoints list is completed automatically and it is set as com.sun.appserv.iiop.endpoints system property but if I stop a server instance or start another during the execution of the client the list remains the one that was created when the JVM was started.
GlassFish clustering is managed by the GMS (Group Management Service) which usually uses UDP Multicast, but can use TCP where that is not available.
See section 4 "Administering GlassFish Server Clusters" in the HA Administration Guide (PDF)
The Group Management Service (GMS) enables instances to participate in a cluster by
detecting changes in cluster membership and notifying instances of the changes. To
ensure that GMS can detect changes in cluster membership, a cluster's GMS settings
must be configured correctly.

Windows Server 2008 VM - network services failing

I would really appreciated another perspective on an issue we have been experiencing.
The environment:
We have a small subset of VMs (5 Windows Server 2008 R2 VM's) hosted on a Windows Server 2012 Cluster of 8 Physical Hosts which supports 100's over VMs across various OS (2008/2012 etc).
The issue:
Servers within the subset of VMs experience widespread network SERVICE failures. The failure presents itself as a loss in connectivity for a large number of network related services operating on the VMs (including certain critical network dependant applications).
The impacts:
Server remains online.
Inability to RDP to the servers via Domain Accounts (Local accounts are fine).
Windows event logs associated with Netlogon Failure: Event ID 5719 - This computer was not able to set up a secure session with a domain controller in domain DOWNERGROUP due to the following:
The RPC server is unavailable. This may lead to authentication problems.
Windows event logs assocaited with Group Policy Failure:
Event ID 1054:The processing of Group Policy failed. Windows could not
obtain the name of a domain controller. This could be caused by a name
resolution failure. Verify your Domain Name System (DNS) is configured
and working correctly
Widespread Agent Failure (AV, Monitoring, Application) - Lack of connectivty to centralised management servers.
The resolution(s). Stopping an agent service. Strange however its not limited to a specific agent however if we stop agent A, the server comes back to life, however if we also stop agent B, the server comes back to life with Agent A still running. Restarting the VM also resolves the issue.
Note that these events do not appear on other VMs hosted off the same host at the time of the outage. Also note that the guest is located on the same host prior to, during and after the outage.
We have investigated the suspicion that their may be issues with Dynamic Range Port Allocation with the server possibly getting into a bottleneck state. We have implementedthe "MaxUserPort" and "TCPTimedWaitDelay" registry parameters and have set them to 65k and 30 respectively.
Also note that when an outage occurs, it does not always occur on the same VMs in the group. Often times it is 2, 3, 4 or all servers.
Im really just asking if anyone can see these symptoms and relate to possible causes for our situation.
Any help/discussion would be appreciated.
Well, this turned out to be an interesting resolution.
We discovered that one of our server agents, while not actually showing open ports in Netstat, had over 40,000 handles growing linearly over time.
Had to enable the "handles" column in task manager to be able to see this info.
This was the miracle post...
http://blogs.technet.com/b/kimberj/archive/2012/07/06/sever-quot-hangs-quot-and-ephemeral-port-exhaustion-issues.aspx