Help analyzing glassfish server hang problem

Help analyzing glassfish server hang problem - glassfish

We are running a glassfish server with around 20 jax-ws metro web services. The server specs are Core2Duo with 8GB RAM. We are using a single http listener for all the web services. Development is set to True. Request Thread Count is 2 and Acceptor Count is 1.
The Minimum and Maximum Heap Sizes are 1GB and the Perm Gen is set to 512MB.
The services access an Oracle database via a Hibernate layer and there are many interservice calls between the services.
The front end is ASP.Net. Our problem is that when 4-5 users try to access the application simultaneously for some time (1 hour) the glassfish server hangs with the CPU going to 100% but the memory utilization is around 10-11%.
We are not able to find any pointers as to how to debug this problem. On some instances the log file gives java.lang.OutofMemory Exception : PermGenSpace. But this is also not everytime, i.e. on many occassions the log file does not give any error on hanging. Also the glass fish server does not start if we try to increase the Perm Gen Space. We need some direction on how to diagnose and move towards the solution to this problem.
The Glass Fish Version we are using is v2.1.
We have the following observations:
1. Adding more http listeners (1 listener per 4-5 services) does prolong the failing time but not with much effect.
2. When calling some of the heavy services (one by one operation) with SOAP-UI we also get the hang problem when running many threads simultaneously. (e.g. 8-10 threads)
3. We have observed that when calling with SOAP-UI a service operation (which does not call any other services) rarely hangs while a service calling other services hangs much frequently.

Related

ASP .NET Core Application Process Isolation for IIS hosted Kestrel Services

I'm migrating a service based integration platform from .Net Framework to .Net Core. The original versions of the integration platform have proven very successful and compared to replacing it with a 'off the shelf' integration solution, it has a far better ROI.
So after redeveloping the code, all tests has been working very well and have achieved higher levels of performance with a single IIS server that I could with 2 IIS servers with the original versions.
Except... If I go over ~3 message/sec with multiple clients, I start seeing duplicate GUID key errors when trying to save instrumentation data to my DB. All these errors are generated from the on-ramp service. The on-ramp places the message on a queue. The messages are then consumed by an off-ramp service and sent to the destination (for this load test the destination is a file folder).
Even though the off-ramp is also running on the same server as the on-ramp, we do not see any duplication errors generated by the off-ramp. I suspect this is due to the queue creating a linier process, so only one instance of the off-ramp is running at any time vs the on-ramp that has up to 4 clients firing concurrent messages at it's API.
Initially I thought the issue was caused by a static global variable class I had implemented, crossing process boundaries. But I would expect that the issue would be seen with the off-ramp as well, as the service architecture for both are virtually identical.
Summary of thoughts on issue:
If it is a pure coding issue, then errors would happen at low messaging rates.
The error would also be seen on the off-ramp if the GUID duplication was chance.
The on and off ramps are both running on the same server, but duplication only seen on the on ramp. IE on ramp not impacting the off ramp and visa versa.
Duplication has to be due to shared memory between concurrently running on-ramp instances, generated by multiple client scenario.
To try and resolve the issue I removed the static global variable class but I'm still seeing the duplication errors.
This issue was never observed in the original IIS implementation (after millions of message processed). I suspect the issue is with process isolation in the IIS hosted Kestrel .Net Core service host. From what I have read there is good isolation between different apps (based on IIS path) but not within the same app. So basically within the same IIS app pool. This could explain why .Net Core does not support multiple app running in the same IIS app pool.
If any one has a good idea how i can achieve process isolation between instances of the same app running in the same IIS app pool I would appreciate your thoughts/suggestions.

After running more tests I was able to resolve the issue. The problem was with the scope of the instrumentation variable. At low rates there was never a problem, but at high throughput, the same instrumentation object was being accessed by a second instance of the process.
The issue was difficult to track down due to the short lived nature of the integration services.
Thanks to anyone who reviewed the question.
Martin

Maximum concurrent requests in signalr self hosted in kestrel

I've encountered a strange problem with an application I've developed. The application is a windows service hosting AspNetCore 2.0 running on Kestrel. This application receives requests through an IIS site acting as a proxy.
In this application, I also use signal 2.2.2 integrated using Microsoft.AspNetCore.Owin. All worked well until I detected that the application was not responding to requests.
Other applications on the same machine and using the same IIS server as proxy were working fine. Restarting the application pool serving the site solved the problem temporarily.
The problem resurfaced again and digging through monitoring information the application seems to hang when there are 400 signalr SSE connections on the same machine. This seems plausible as I've found that by default OWIN limits the number of concurrent requests at 100 * number of cpus. (Note that a site on the same machine is serving 5000 requests per minute without a sweat but these are not a long-lived request like the SignalR ones)
The problem is that I seem unable to find the same option when hosting Owin inside AspNetCore. Does someone know if this can be the solution and what is the correct setting?
EDIT: I'm fairly certain that the issue is caused by the number of SignalR connections opened concurrently because by disabling it in Javascript the problem vanished.
2nd EDIT: signalr does not seem to be the cuplrit as load testing the site with crank both in test and in production worked until 5000 concurrent connections which is the default IIS limit and is fine by me

After some trial and error I've been able to identify and correct the problem but it was no easy task so I'm leaving this answer behind if someone else stumbles upon the same problem.
Disabling SignalR did not solve the problem but it made it appear less often.
Thanks to the monitoring in place on the server and IIS I observed that the problem appeared when the number of connections to the site started growing rapidly. This system primarily makes request to other services so it does not have a database nor expensive computations.
Examining the code I've found that there were three problems:
a new HttpClient was created for every request which can exhaust the sockets which are not reused between requests blog blog2 blog3
by default there's a maximum number of concurrent connections on the httpClient to a single domain and this limit is set by default to 2 (!!!) blog4
the code was waiting synchronously on every web request to another system (this program was ported from an mvc4 site which never displayed this problem). This worked fine in MVC but asp.net core is very sensitive to this as it will rapidly exhaust all available threads and because the thread pool starts with the number of cores they will be exhausted quickly making all the requests wait. This value can be increased as temporary stop gap solution with ThreadPool.SetMaxThreads(Int32, Int32) but the only solution is to transform all calls in async calls.
Once all calls were mde async the problem never returned. Basically the problem was due to threadpool starvation and aspnet core sensibility to it vs MVC. Here you can find a nice explanation and a detection method using PerfView.

This could be the issue, but it's unlikely. When hosting in dotnet core you're probably using Kestrel as a webserver implementation, to switch these limits such as concurrent connections you can use KestrelServerLimits class as described in this Microsoft article.
KestrelServerLimits should not be causing you any problems since the default value for ConcurrentConnections is unlimited.

Weblogic Performance Tunning

We have a problem with Weblogic 10.3.2. We install a standard domain with default parameters. In this domain we only have a managed server and only running a web application on this managed server.
After installation we face performance problems. Sometimes user waits 1-2 minutes for application responses. (Forexample user clicks a button and it takes 1-2 minutes to perform GUI refresh. Its not a complicated task.)
To overcome these performance problems define parameters like;
configuraion->server start->arguments
-Xms4g -Xmx6g -Dweblogic.threadpool.MinPoolSize=100 -Dweblogic.threadpool.MaxPoolSize=500
And also we change the datasource connection pool parameters of the application in the weblogic side as below.
Initial Capacity:50
Maximum Capacity:250
Capacity Increment: 10
Statement Cache Type: LRU
Statement Cache Size: 50
We run Weblogic on 32 GB RAM servers with 16 CPUs. %25 resource of the server machine is dedicated for the Weblogic. But we still have performance problem.
Our target is servicing 300-400 concurrent users avoiding 1-2 minutes waiting time for each application request.
Defining a work manager can solve performance issue?
My datasource or managed bean definition is incorrect?
Can anyone help me?
Thanks for your replies

Long communication time of WCF Web Services within the same server

Even if this question is a year old, I am still searching a good answer for this question. I appreciate any information that will lead me to fully understand this issue regarding low performances of communicating web services hosted on the same machine.
I am currently developing a system with several WCF Web Services that communicate intensively.
They are running under IIS7, on the same machine, each service being in a different Application Pool, with multiple workers in the Web Garden.
During the individual evaluation of each Web Service, I can serve 10000-20000 requests per minute, quickly and without any issues for the resource consumption (processor and memory).
When I test the whole system or just a subsystem formed by two Web Services I can't serve more than 2000 requests/minute.
I also observed that communication time between Web Service is a big issue (sometimes more than 10 seconds). But when testing with only 1000 requests per minute everything goes smoothly (connection time of no more than 60 ms).
I have tested the system both with SOAPUI and JMETER, but the times were computed based on system logs, not from the testing tools.
Memory and network aren't an issue (they are used very little).
Later on, I have tested the performance of 2 communicating WCF web services, hosted on two server and on the same server. It again seems that there is a bottleneck when the services are on the same machine, lowering the number of connection with from ten thousands to thousands; again, no memory or processor limiting.
As a note, I am working with quite big data in some cases and some of the operations needed are long ones.
I used perf.mon to see what's going on, for memory, processes, webservice, aspnet, etc. but I didn't see anything that could indicate what it's going wrong.
I also tried all the performance settings and tuning options I could find on the Internet.
Does someone know what can be wrong? Why the communication between Web Services could last so long? Why the Web Service which serves as an entry point in the system can accept 10000 requests/minute when is tested alone, but when communicating with another Web Service barely accepts 2000?
It's an IIS7 problem? Could my system perform better if each Web Service will be deployed on a different server?
I want to understand better how things internally function (IIS and WCF services) to improve performances for current and future systems.

You could try to collect data from WCF performance counters : concurrent calls, instances, duration, ... In addition, WCF throttling provides some properties that you can use to limit how many instances or sessions are created at the application level. Performance of the WCF service can be improved by creating proper instance.
Finally, in load testing, there are many configuations to apply to different component : max concurrent http connection, IIS limits, having many load clients... You load test is invalidated because of this.

Poor WCF performance when running on a VM

I'm working on a self-hosted WCF application which runs just fine on my PC; however, when I try running it on a VM hosted locally using VMware Player, the service takes some two minutes to return data, whereas the original request took only a few seconds.
The VM is using 2Gb RAM and dual CPU running Windows Server 2008 R2 (on an 8Gb/quad core host running Windows 7).
Looking at the WCF service trace, I have the following log entries (time/description):
15:41:26.771 From: Processing message 1.
15:41:26.771 Activity boundary.
15:41:26.820 Received a message over a channel.
15:41:26.844 ServiceChannel information.
15:41:26.848 Incoming HTTP request to URI 'http://localhost:8000/Sql/Database' matched operation 'GetDatabase'
15:41:26.944 Message Log Trace
15:43:25.775 To: Execute 'MyProject.ISqlService.GetDatabase'
15:43:25.775 Activity boundary.
15:43:25.947 From: Execute 'MyProject.ISqlService.GetDatabase'
15:43:25.947 Activity boundary.
15:43:25.947 Message Log Trace
15:43:26.134 Throwing an exception.
15:43:26.134 RequestContext aborted
15:43:26.134 Activity boundary.
So the two minute delay occurs between receiving the incoming HTTP request and the dispatch to the service implementation. This delays occurs whether the request is the first (thus incurring the usual WCF warm-up penalty) or a subsequent request.
While I appreciate that I'm not going to get bare-metal performance from a VM, I'm still concerned about the dire performance, especially as the client tends to timeout before the end of the two minutes. Is there anything I can do to improve matters? It's making testing very difficult.

Maybe your proc does not support VT-x/AMD-V extension, so virtualization is not hardware-accelerated. Check your hardware using CPU-Z.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas