I would really appreciated another perspective on an issue we have been experiencing.
The environment:
We have a small subset of VMs (5 Windows Server 2008 R2 VM's) hosted on a Windows Server 2012 Cluster of 8 Physical Hosts which supports 100's over VMs across various OS (2008/2012 etc).
The issue:
Servers within the subset of VMs experience widespread network SERVICE failures. The failure presents itself as a loss in connectivity for a large number of network related services operating on the VMs (including certain critical network dependant applications).
The impacts:
Server remains online.
Inability to RDP to the servers via Domain Accounts (Local accounts are fine).
Windows event logs associated with Netlogon Failure: Event ID 5719 - This computer was not able to set up a secure session with a domain controller in domain DOWNERGROUP due to the following:
The RPC server is unavailable. This may lead to authentication problems.
Windows event logs assocaited with Group Policy Failure:
Event ID 1054:The processing of Group Policy failed. Windows could not
obtain the name of a domain controller. This could be caused by a name
resolution failure. Verify your Domain Name System (DNS) is configured
and working correctly
Widespread Agent Failure (AV, Monitoring, Application) - Lack of connectivty to centralised management servers.
The resolution(s). Stopping an agent service. Strange however its not limited to a specific agent however if we stop agent A, the server comes back to life, however if we also stop agent B, the server comes back to life with Agent A still running. Restarting the VM also resolves the issue.
Note that these events do not appear on other VMs hosted off the same host at the time of the outage. Also note that the guest is located on the same host prior to, during and after the outage.
We have investigated the suspicion that their may be issues with Dynamic Range Port Allocation with the server possibly getting into a bottleneck state. We have implementedthe "MaxUserPort" and "TCPTimedWaitDelay" registry parameters and have set them to 65k and 30 respectively.
Also note that when an outage occurs, it does not always occur on the same VMs in the group. Often times it is 2, 3, 4 or all servers.
Im really just asking if anyone can see these symptoms and relate to possible causes for our situation.
Any help/discussion would be appreciated.
Well, this turned out to be an interesting resolution.
We discovered that one of our server agents, while not actually showing open ports in Netstat, had over 40,000 handles growing linearly over time.
Had to enable the "handles" column in task manager to be able to see this info.
This was the miracle post...
http://blogs.technet.com/b/kimberj/archive/2012/07/06/sever-quot-hangs-quot-and-ephemeral-port-exhaustion-issues.aspx
Related
We use Jmeter to do performance testing. I gave 200 threads(200 users). and we have two servers. like sever A, Server B. i tested indivisibly for 200 Users, it works. and we load balancing server Like server C. So request goes to ether server A Or Server B. But if configure my same jmx script(200 thread) with Server C. it gives error below (but it works for 50 users-- no error).
org.apache.http.NoHttpResponseException: The target server failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:61)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
at org.apache.jmeter.protocol.http.sampler.MeasuringConnectionManager$MeasuredConnection.receiveResponseHeader(MeasuringConnectionManager.java:201)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
If the issue can be reproduced only on higher loads - it's definitely a server (or load balancer) issue so congratulations on finding the first bottleneck.
Now you can investigate the reason and suggest the fixes, the next steps could be:
Inspect application under test / load balancer logs - you can find a clue there
Inspect application under test / load balancer / database / any other middleware configuration. in the majority of cases default configuration is good for development and debugging but you will need to perform some performance tuning before running a prod-like load test
Collect main health metrics on the application under test side (CPU, RAM, Network, Disk, Swap usage, etc.). It might be the case your application simply lacks hardware resources. You can use built-in tools of the operating system(s) or an APM tool or JMeter PerfMon Plugin
Re-run your test with a profiling tool telemetry enabled on the application under test side. This will give you an overview with regards to where the application spends the most time, which are the "heaviest" functions or functions called most frequently so you would know what to optimise.
Make sure that the load balancer equally (or according to the other algorithm) distributes the requests between the backend servers. It might be the case you're hitting only one server, if this is - consider adding the DNS Cache Manager to your Test Plan and re-run your test to see if it helps.
Wall of text (my apologies, but you'll need to read it all):
Error Message: Database Error: Connection Timeout Expired. The timeout period elapsed while attempting to consume the pre-login handshake acknowledgement. This could be because the pre-login handshake failed or the server was una
Environment:
Virtual VMware Server 2008R2 SP1, running SQL 2008 SP3
32GB RAM - about 50 Databases
10Gb LAN connection, datastore storage provided by SSD SAN.
Application is CSTS connecting to SQL Server "DIRGE".
The application is configured to connect to another application for document retrieval "Onbase", who's database is also stored on DIRGE.
Throughout the day, CSTS will get connection time-outs. It's usually in spurts, so if one user is getting a timeout, usually someone else is getting one as well.
SQL has 28GB of the 32GB allocated. Memory utilization is a consistent >95%.
We cannot add more RAM as 2008R2 standard doesn't see more than 32GB.
CPU utilization was very high at times and the trend was it was getting more and more utilization, so we added a second CPU (2 sockets, 4 cores per socket).
I've scoured the event logs and the SQL logs, and the CSTS error logs looking for a commonality. I'm finding very little. I've resolved all the event log errors, no joy.
NOTE: Onbase server also gets connection time outs to SQL, so I don't believe it's application specific.
Scheduled Events:
Logs are backed up at 8am, 11a, 2pm, 5pm.
There's a SSIS package that runs every 15 minutes and takes about 8 minutes to run. However, I did not find any correlation to the timeouts.
There are maintenance plans that run after hours as well.
IP4 and 6 are both enabled.
Clients are referencing the database server by IP, so it's not a name resolution issue.
IP Protocols are enabled, static set to port 1433.
I ran a portqry from the Onbase server to TCP 143 and UDP 1434 and it IS listening.
We have a Solarwinds Database Analyzer running and watching this server; it says CPU and RAM are issues. I can get more details from it if anyone is interested.
I've google-fu'd the heck out of this and I just can't seem to find a good answer. From my searching, it seems this is a networking issue, but we've watched the network and I'm not seeing anything that would be the cause. Throughput is very little overall.
I will say this: The ONBASE server is on a different subnet than DIRGE, but I've ran a test DB connection using the name, named pipe and IP and they all work without issue.
The problem is I'm on a DBA so I'm learning this on the fly (I'm a Sr Systems Engineer).
I'm curious if someone has a suggestion on how to hunt this down.
Short story: My DDS subscriber cannot see data from my DDS publisher. What am I missing?
Long story:
QNX 6.4.1 VM A -- Broken Publisher. IP ends with .113
QNX 6.4.1 VM B -- Working Publisher. IP ends with .114
Windows 7 -- Subscriber/Main Dev box. IP ends with .203
RTI DDS 5.0 -- Middleware version
I have a QNX VM (hosted on the network, not on my machine) that is publishing some data via RTI DDS. The data never shows up in my Windows 7 subscriber application.
Interestingly enough, I can put the same code on VM B, and the subscriber gets data. Thinking this must be a Windows 7 firewall issue, I swapped VM A's IP address with VM B, but this did not help.
Using Wireshark, I can see some heartbeat traffic from VM A, but no data. From VM B, I see the heartbeat traffic and the data. Below is a sanitized Wireshark snippet.
NDDS_DISCOVERY_PEERS is set to include both multicast and the explicit IP address of the other side of each conversation. The QOS profiles are the same, and the RTI Analyzer indicates the Match Analysis was successful (all green).
VM A:
NDDS_DISCOVERY_PEERS=udpv4://239.255.0.1,udpv4://127.0.0.1,udpv4://BLAH.203
VM B:
NDDS_DISCOVERY_PEERS=udpv4://239.255.0.1,udpv4://127.0.0.1,udpv4://BLAH.203
Windows 7 box:
NDDS_DISCOVERY_PEERS=udpv4://239.255.0.1,udpv4://127.0.0.1,udpv4://BLAH.113,udpv4://BLAH.114
When I include them in the NDDS_DISCOVERY_PEERS line, other folks on the network can see DDS traffic from VM A with DDS SPY on their Windows 7 box. My Windows 7 box can not.
Windows 7 event log does not appear to show any firewall or WFP stopping the data packets.
RTI DDS Spy run from my Windows 7 machine shows that VM A (0A061071) writers are visible on the network, but no data is being received. It also shows that the readers in my subscriber on my Windows 7 machine are visible, though it shows up at an odd IP address.
Bonus question (out of curiosity only, NOT the primary question): why does traffic on my local machine show up in DDS SPY as 192.168.11.1 instead of my machine's IP or even 127.0.0.1?
Main Question: What am I missing?
Update:
route print on my Windows 7 box appears to show that I have joined a multicast group with VM A.
netsh interface ip show joins seemed to concur.
Investigation Update:
I rebooted the VM to no effect.
I rebooted the Windows box to no effect.
I removed the multicast from the NDDS_DISCOVERY_PEERS environment
variables on both sides to no effect.
The Windows 7 box has three network interfaces (plus loopback): 1
LAN connection and 2 (unrelated) VM adapters. We are working with
the LAN connection. The QNX VM has one network interface (plus
loopback). Note that the working VM and the broken VM use a
different ethernet driver than each other, as they are slightly
different flavors of QNX 6.4.1. The broken one has wm0 as the
interface, and the working one has en0 as the interface. I don't believe this is the issue, but it is a difference.
I ran DDS SPY on the "broken" QNX VM while it was playing back, and
I got DDS data. I don't have a good method to sniff the network
between where the VM is hosted and my Windows 7 machine to see if it
makes it out of the interface, but looking at the transmitted packet
count out of the ethernet interface on the QNX VM indicates that it
is definitely transmitting something, and the Wireshark captures on the Windows 7 machine itself show that at least some traffic is making it through.
Other folks on the LAN here can see the DDS traffic from the
"broken" VM, which leads me to believe it is a Windows setup issue,
rather than a broken VM--I just can't see what it could be. I've
re-checked the firewall, to no avail. I would have thought that if it were a firewall issue, the problem would have gone away when I swapped IP addresses between VM A and VM B. In any case, the Windows 7 firewall is currently off, to no avail.
Below are several screens of Wireshark output. I skipped a bunch between the third and the fourth, as after the fourth, the traffic tended to look like the bottom of the fourth until the end.
(Skipped a bunch here)
(Pretty much continues on like the last 11 lines above)
What else should I try?
Update:
To answer Rose's question below, using rtiddsping -publisher on the bad VM and rtiddsping -subscriber works appropriately.
I wonder if this issue is caused by the weird IP address. The IP address it happens to publish and somehow latch on to is a local VM ethernet adapter (separate from VM A). See the screenshot below.
The address I would like it to attach to is 10.6.6.203. Any way I can specify that?
More than a year later this happened to me again with a different virtual machine. I had it working yesterday, so I was very suspicious. I scoured all my code changes for the past 24 hours for issues, but didn't find any. Then I decided to see if IT had pushed any patches to my computer.
Guess what? The Windows Firewall had been aggressively updated since the day before. Rules missing or changed, etc. The log said packets were being dropped. I opened the firewall filters up a bit, and suddenly, everything worked again. I hesitate to close this issue, as I am not 100% this was exactly the same as last year, but it feels very similar. I suspect that last year the settings in the firewall were not logging the packet drops.
Long and short of it: if DDS suddenly stops working, check your firewall settings.
A couple of things to try:
Try running rtiddsping -publisher on the broken VM and rtiddsping -subscriber on Windows. This has two advantages:
The data type is small and well-known, so if there's some problem with the data being fragmented due to the different Ethernet drivers, it will not happen with rtiddsping, and may help track down the problem.
Rtiddsping prints out when the publisher and subscriber discover each other, so you will be able to confirm that discovery is completing correctly on both sides. I am guessing discovery is working, because Analyzer is showing both applications, but it is good to confirm.
If you see the same problem with rtiddsping that you see with your application, increase the verbosity to rtiddsping -verbosity 3, and then 5. At the highest verbosity level, this will print (a lot of) additional output, which may give us a hint about what is happening.
To answer your bonus question about spy: The reason why spy is showing that IP address is because that is one of the addresses that is being announced as part of discovery. During discovery, a DomainParticipant can announce up to four IP addresses that can be used to reach it. Spy will choose one of those to display, but it may not be the actual address that is being used to communicate with the application. If your machine does not have any interface with the 192.168.11.1 IP address, this could indicate a larger problem. (This may be normal, though - as long as the correct IP is one of the four that are announced.)
Looking through the packet trace images, there is nothing that is obviously the problem. A few things I notice:
There seems to be a normal pattern of heartbeats/ACKNACKs in the final packet trace image. This indicates that there is some bidirectional communication between the two applications.
It is difficult to tell from these images whether the DATA being sent from .113 to .203 consists of participant-to-participant messages, or real discovery messages - except for two packets: packet #805, and packet #816 (fragments 811-815) look like discovery announcements that are being sent to .203. This indicates that you have at least four entities (DataWriters or DataReaders) in your application on .113.
So, discovery data is being sent by the application on .113. It is being received and reassembled by WireShark, but that doesn't always mean it was received correctly by the application.
Packet #816 has a heartbeat on the end of it. It is possible that packet #818 or #819 might be the ACKNACK that is responding to that heartbeat, but I can't be sure from the image. The next step is to look at those ACKNACKs from .203 to .113 to see if .203 thinks it has received all the discovery data. Here is an example of a HB/ACKNACK pair where a discovery DataReader has received all data:
Submessage: HEARTBEAT
...
firstSeqNumber: 1
lastSeqNumber: 1
The heartbeat sequence number is 1, which indicates it has only sent an announcement about a single DataReader.
Submessage: ACKNACK
...
readerSNState: 2/0:
bitmapBase: 2
numBits: 0
The readerSNState is 2/0, meaning it has received everything before sequence number two, and there is nothing missing. If there is something other than a 0 in the bitmap, it indicates the DataReader did not receive some data.
If you can confirm that the application is receiving all the discovery data correctly, it will be helpful if you can use a WireShark filter to show only user data, since the images aren't highlighting discovery vs. user data.
WireShark filter for just rtps2 user data:
rtps2 && (rtps2.traffic_nature == 3 || rtps2.traffic_nature == 1)
We had a similar issue with this. Here is the environment in a very summarized way:
A publisher
A working subscriber (laptop)
A non-working subscriber (desktop)
Both subscribers held exactly the same software (the desktop was a clone from the laptop, through Clonezilla), but rtiddsspy was blind from the desktop point of view; however, the opposite way worked well: the publisher machine's rtiddsspy saw the desktop. Laptop and publisher machines' always worked well. Laptop and desktop too (they saw each other's subscriptions)
The workaround for this (based on https://community.rti.com/content/forum-topic/discovery-issues) was to increase the MTU on the desktop NIC. Don't ask me why, but it worked.
EDIT: At the beginning, the MTU in the publisher was set to a higher value than the subscriber. So, we changed the MTU in the subscriber to match the publisher's.
We have one VM for BizTalk and a separate VM for the SQL backend. We are using Veeam for backups which basically kicks off a snapshot of the VM. When this snapshot is being finalized on the SQL VM, BizTalk services on the application server fail. Usually they restart automatically but sometimes this requires manual intervention to start the services. The error below is logged on the BizTalk server.
Is there any timeout setting or config changes that will allow BizTalk services to stay up during the snapshot process?
An error occurred that requires the BizTalk service to terminate. The most common causes are the following:
1) An unexpected out of memory error.
OR
2) An inability to connect or a loss of connectivity to one of the BizTalk databases.
The service will shutdown and auto-restart in 1 minute. If the problematic database remains unavailable, this cycle will repeat.
Error message: [DBNETLIB][ConnectionRead (recv()).]General network error. Check your network documentation.
Error source:
BizTalk host name: BizTalkServerApplication
Windows service name: BTSSvc$BizTalkServerApplication
We experienced the same situation and error with both BizTalk 2009 and BizTalk 2013, each set up with two App servers and one SQL DB server.
When our VMware does the final step of the Snapshot backup on the Application servers, it freezes the application server for about 10 seconds, preventing it from receiving packets. On SQL Server 2008 and 2012, it by default will send out keep-alive packets to the clients every 30 seconds (30,000 ms). If the SQL server fails to receive a response back from the App server, it will send out 5 retries (default setting) of the keep-alive request 1 second (1,000 ms) apart. If SQL still does not receive the response back, it will terminate the connection, which will cause the BizTalk hosts on the App server to reset, and in our case, when our German-made ERP system sends its EDI documents over to BizTalk during that reset period, the transmission will fail.
We trapped the issue by running NetMon on the DB and App servers, waiting for the next error message. Upon inspection, we see the five SQL keep-alive packets being sent to the App servers 1 second apart, and at the same time there were NO packets at all received on the Application server. At first guess, one might think they were "just dropped network packets", which is rarely the case. We then made the correlation to the timing of the VM Snapshots, and now confirm each time the snapshot finishes each day, the App servers freeze.
As a Short-to-mid-term workaround, we raised the number of retries SQL attempts before declaring a connection dead, (5 by default), by adding the registry value TcpMaxDataRetransmissions and setting it to 30 (thus 30 seconds before SQL declares the client unresponsive). This has masked the problem for now for us, and use at your own discretion.
We are also looking at an Agent-based version of the VM Snapshot, which may alleviate the condition of freezing the server.
Is there any timeout setting or config changes that will allow BizTalk services to stay up during the snapshot process?
Not that I am aware of, however you might want to Google config options in the btsntsvc.exe.config file which is located in your BizTalk installation directory.
All messages that pass through BizTalk are written to the BizTalkMsgBoxDb and its other databases are involved if you are running tracking, BAM etc. The only service that can cache 'stuff' and handle a database outage is the Enterprise Single Sign-On (ESSO) Service. BizTalk therefore needs a persistent connection to the database server to remain 'up', hence why your Host Instance (BizTalkServerApplication) is stopping - it simply wouldn't be able to process messages if the database wasn't there.
I would add that your approach to back-ups probably isn't supported by Microsoft and I would further suggest that you seriously consider whether an approach that takes your database server offline during the backup is viable?
BizTalk has a pretty robust backup solution for its various databases built into the product, and I would recommend that you take a look at using this supported method.
If you do need to take snapshots of the database system - say once a night - you might want to consider stopping the BizTalk Host Instances, performing the snapshot, and then re-starting the Host Instances through some scripted task.
You might also want to consider checking whether there are any hotfixes for your version of BizTalk Server included in a Cumulative Update that might help address your problem.
I have a WebLogic cluster on which I've deployed numerous topics and applications that use them. My applications uniformly show themselves in a Warning status. Looking at Monitoring on the deployment, I see the MDB application connects to Server #1, but on server #2 it shows this:
MDB application appName is NOT connected to messaging system.
My JMS Server is targetted to a migratable target, which is in turn targetted to the #1 server and has a cluster identified. And messages sent to either server all flow as expected. I just don't know why these deployments show in a Warning state.
WebLogic 11g
This can be avoided by using the parameter below
<start-mdbs-with-application>false</start-mdbs-with-application>
In the weblogic-application.xml, Setting start-mdbs-with-application to false forces MDBs to defer starting until after the server instance opens its listen port, near the end of the server boot up process.
If you want to perform startup tasks after JMS and JDBC services are available, but before applications and modules have been activated, you can select the Run Before Application Deployments option in the Administration Console (or set the StartupClassMBean’s LoadBeforeAppActivation attribute to “true”).
If you want to perform startup tasks before JMS and JDBC services are available, you can select the Run Before Application Activations option in the Administration Console (or set the StartupClassMBean’s LoadBeforeAppDeployments attribute to “true”).
Refer :http://docs.oracle.com/cd/E13222_01/wls/docs81/ejb/message_beans.html
this is applicable for the versions till 12c and later
I don't like unanswered questions, so I'm going to answer this one.
The problem is resolved, though I was not involved in its resolution. At present the problem only exists for the length of time it takes the JMS subsystem to fully initialize. During that period (with many queues, it can take a while) the JNDI system throws errors and the apps are truly in warning state. Once the JMS is fully initialized, everything goes green.
My belief is that someone corrected something in the JMS Server / Cluster config. I'll never know what it was.