why does my glassfish server stop? - glassfish

I have a GF3 server in production. Sometimes, it just stops responding. At least, all web applications do. CPU / memory usage is low, but I can't get any web app on port 8080 to work. Nothing in the logs (5 minutes gap in server.log until I restarted manually). Everything fine after restart... for a while.
Took a jstack output before restarting. Didn't find anything interesting (no code from my apps running, no locks...).
Version = GlassFish v3 (build 74.2), JRE version 1.6.0_19
UPDATE: it comes back by itself after some time (still not acceptable for my clients :-( )
UPDATE: I switched to a new installation of GF3.1 (was 3.0.1). At the moment (after a couple of hours), one of the applications that is deployed there has 177 sessions. Problem is: I only have about 12 users (where did all those sessions come from?). Same applications deployed with other name has 6 sessions. Could I just run out of thread pools or something like that?

I suggest hooking up Visual VM with the GF plugin.
http://visualvm.java.net/index.html
Then when your server "stops", take a look and see what's going on.

If there's nothing interesting in the stack trace, then the problem is likely to be between the client and GlassFish.
In any case I would also suggest upgrading to the latest JDK (_24) and GlassFish (3.1).

I was using connection pooling with MySQL, and in some places I forgot to close the database connection. After fixing those mistakes everything was fine.

Related

VPN interferes with mobilefirst adapter deployment

I have a problem that has similar symptoms to This question where adapter deployment hangs at 66% complete. As I'm not sure it's the same problem I'm starting this new question.
Using MFP 7.0.0, freshly installed on a new instance of Eclipse Luna.
I have a SQL Adapter that normally can be deployed with no difficulty, and these days if one edits the source that deployment occurs automatically.
As it happens I want to use a database only accessible via a VPN. So initially I developed some SQL scripts in the Database perspective. Using a JDBC URL of this form:
jdbc:db2://the.vpn.host:60006/STUDENT
My scripts work just fine. Now with the VPN still active I attempt to modify my SQL adapter to use that URL, automatic deployment kicks in and bingo, we get to the
Deploy Mobile First adapter (66%)
stage and nothing further happens, ever, this is not just a few minutes time-out, it will sit there for hours. As soon as I drop the VPN the deployment completes.
So my question is two-fold:
1). It seems clear that some aspect of the adapter deployment code is not resilient to network issue - it's clearly not acceptable to hang indefinitely. I speculate that this may hint at the underlying cause of the referenced question.
2). There must be some network peculiarity here. I assume that the deployment process is having trouble reaching the server when the VPN is active. How can I diagnose this?
We have recently identified and corrected the following via an APAR:
PI42968 ADAPTER/APPLICATION DEPLOYMENT TIME CAN BE EXTREMELY SLOW
The fix is now available via IBM Fix Central, so I'd like you to try this fix as I am hopeful it will help in your scenario as well.

Instability on Worklight Server

I'm using websphere liberty profile v8.5.5.0 and worklight 6.2.
The full version of my WL and runtime is:
Server version: 6.2.0.00.20140922-2259
Project WAR version: 6.2.0.00.20140922-2259
I've noticed that sometimes I have troubles getting into the worklightconsole, the server takes a too big of a time to answer and most of the time it just gives me a time out.
Regarding JVM Heap its at 60 - 70% of the total heap, most likkely 1,5 Gb or something like that.
On the FFDC, sometimes I get a error saying something close to an
FFDC Incident has been created: "javax.naming.ServiceUnavailableException: ldap.example.com:389; socket closed; remaining name 'o=example' com.ibm.ws.wim.adapter.ldap.LdapConnection 1670" at ffdc.log
I have my LDAP connected to this websphere via VPN, and I know that webspheres historically have trouble dealing with LDAP.
However I don't see any more errors on the logs; the machine eventually recovers and is able to work correctly, but for some time is 'down'.
If I enable tracing, the verbosity overwhelms the machine and I can't even start the worklightconsole, neither continue to work with worklight like calling an adapter from an application.
There is one more thing, it seems that this happens more frequently after updates on existing application versions or adapters. Does this ring a bell with anyone?
If i ask for a restart when the machine is sluggish, the stoping of the websphere takes quite some time but eventually stops normally and when I start it, everything is fine right out of the bat.
Before asking for a PMR, I would like to know if there is something else I could do to troubleshoot this problem.
Thanks in advance.
My initial "smell" of the problem is that sometimes your VPN connection with LDAP is very slow or your LDAP server is taking too long to respond.
My suggestion is that you try using WAIT(wait.ibm.com), it's a non-invasive easy to use diagnostic tool, to further investigate. If you find out the call to LDAP is getting hang then I suggest you try tuning Liberty LDAP cache, this should help.

ActiveMQ (MQTT) maxes CPU on first client connect

I'm running ActiveMQ (a very recent version) on LinuxMint 15 using oracle 1.7 java. I've only enabled a single transport "mqtt+nio+ssl". It boots up fine, ssl is all working, easy!
However, when I make a (mqtt) connection from the same host (different java process), the activemq process starts to consume a whole core. It keeps the core at 100% until I stop it (it stops normally). This sounds like abnormal behaviour to me, but when I turned on debug logging I got nothing that seemed to suggest massive CPU consumption.
Has anyone else seen or resolved this problem?
Can anyone suggest how I should go about analyzing this problem?
Many Thanks!
Obviously this is some sort of bug in ActiveMQ. There's been a lot of work done on the MQTT and AMQP side for the upcoming release of v5.9.0. You can download snapshots builds or the release candidate of 5.9 and test that to see if it still does this. If it still acts like this then you need to create an issue in the Jira tracker so the team can work on it, preferably with a test case to reproduce it.

Portal running with Glassfish 2.1.1, Liferay 5.2 and SSL get too many blocked threads

I have a portal which is running over SSL on Glassfish and uses Liferay. Last time we sent a email that brings approximately 200 people at same time to access released information our Glassfish "stalled".
From the server we could see that system resources were ok.
- Glassfish has up to 8 GB to use but was using 5 GB
- The server has 4 CPUs and the overall usage was around 30%
- Glassfish is configured up to 400 HTTP threads.
As soon we detected that our server wasn't answering users we started a profiler in order to understand what was going on.
The threads overview show too many blocked threads:
From the stack it's no possible to see code other than sun, grizzly, catalina classes:
I would like to fix such issue but right now I can tell whether I should work on our code our should replace some component like disabling SSL.
Any thoughts would be very appreciated.
Thanks.
A thread dump might have been easier and less intrusive than a profiler - this might have shown you where the threads are blocked in the actual running system.
You'll have to figure out where the blocking occurred: Was it in Liferay's code or in your own? What did you have on the pages, how is the theme done? Also, note that you're running a really old version of Liferay - in case you're running CE this has been out of maintenance for a few years now (Enterprise Edition still being supported, but as you don't mention this, odds are you're running Community Edition (CE))
Further, if you cause situations like the one you describe (sending loads of people at the same time) you might want to load test your system with an artificial load in order to see how it behaves. Also, you might want the landing page to be buffered (this is not to say that 200 users are a lot, but for any such activity you probably want to know that your system can handle it)
Until you prove the opposite, I'd assume that there is some custom component on the page (either a portlet or the theme) that causes a bottleneck and the blocking that you discovered.

Apache Tomcat 6.0.35 is taking 100% CPU in prodcution

I have been using apache-tomcat-6.0.35 in production environment. Our application is hosted on Amazon EC2 using Small Instance. The problem we are facing is that the apache tomcat is using 100% CPU. We have verified it by running htop and it shows multiple threads of tomcat running.
Out application has been developed in Grails 2.0.1.
We are puzzled that why it is happening? Can any body suggest any solutions?
Thanks
Probable Cause
Most likely this has been caused by the recent Leap Second and its impact on quite some unaware/unprepared IT systems, including parts of Linux, MySQL, Java and indeed Tomcat - see the Wired article about the ‘Leap Second’ Bug Wreaks Havoc Across Web for the whole story:
[...], saying it experienced the leap bug problem with the
Java-happy Tomcat web servers it uses to serve up its site. “Our web
servers running tomcat came close to zero response (we were able to
handle some requests),” read an e-mail from a site spokesman. “We were
able to connect to servers in order to reset them. Only rebooting the
servers cleared up the issue.” [emphasis mine]
Workaround / Fix
Accordingly, the solution usually boils down to turning it off and on again, i.e. restarting the server in question, though you might be able to avoid this by simply setting the date, as suggested e.g. in the context of:
Linux/Tomcat, see July 1 2012 Linux problems? High CPU/Load? Probably caused by the Leap Second!:
Apparently, simply forcing a reset of the date is enough to fix the
problem:
date -s "`date`"
MySQL, see MySQL and the Leap Second, High CPU and the Fix (also linked from the comments on wwwhizz' answer to MySQL high CPU usage, where you'll find two specific variations how to do this depending on your OS):
The fix is quite simple – simply set the date. Alternatively, you can
restart the machine, which also works. Restarting MySQL (or Java, or
whatever) does NOT fix the problem.
Background / Proposed Solutions
Please note that while the underlying issue is utterly tricky, it is all but unknown in principle, hence there have been prominent posts/users warning about and explaining this and offering suggestions on how to deal with it in principle, in particular:
An humble attempt to work around the leap second by Marco Marongiu
Time, technology and leaping seconds by Christopher Pascoe
We can't say anything for sure with the information provided. For performance issue, I would recommend a profiler, especially JProfiler, to investigate the cause of this problem. By this way you will be able to locate where the problem is.
This program has a trial license, I think that's enough for a quick look.
UPDATE: after carefully read your question, I see that you have many tomcat instance running for a website? It means that the previous tomcat instances failed to stop; they still run and hog up all the resources. This is possible. You must kill all the old tomcat process before trying to start a new one.
You can kill the processes by hand by "kill -9 " if you are on Linux, before trying to start the server again.