Apache Tomcat 6.0.35 is taking 100% CPU in prodcution - apache

I have been using apache-tomcat-6.0.35 in production environment. Our application is hosted on Amazon EC2 using Small Instance. The problem we are facing is that the apache tomcat is using 100% CPU. We have verified it by running htop and it shows multiple threads of tomcat running.
Out application has been developed in Grails 2.0.1.
We are puzzled that why it is happening? Can any body suggest any solutions?
Thanks

Probable Cause
Most likely this has been caused by the recent Leap Second and its impact on quite some unaware/unprepared IT systems, including parts of Linux, MySQL, Java and indeed Tomcat - see the Wired article about the ‘Leap Second’ Bug Wreaks Havoc Across Web for the whole story:
[...], saying it experienced the leap bug problem with the
Java-happy Tomcat web servers it uses to serve up its site. “Our web
servers running tomcat came close to zero response (we were able to
handle some requests),” read an e-mail from a site spokesman. “We were
able to connect to servers in order to reset them. Only rebooting the
servers cleared up the issue.” [emphasis mine]
Workaround / Fix
Accordingly, the solution usually boils down to turning it off and on again, i.e. restarting the server in question, though you might be able to avoid this by simply setting the date, as suggested e.g. in the context of:
Linux/Tomcat, see July 1 2012 Linux problems? High CPU/Load? Probably caused by the Leap Second!:
Apparently, simply forcing a reset of the date is enough to fix the
problem:
date -s "`date`"
MySQL, see MySQL and the Leap Second, High CPU and the Fix (also linked from the comments on wwwhizz' answer to MySQL high CPU usage, where you'll find two specific variations how to do this depending on your OS):
The fix is quite simple – simply set the date. Alternatively, you can
restart the machine, which also works. Restarting MySQL (or Java, or
whatever) does NOT fix the problem.
Background / Proposed Solutions
Please note that while the underlying issue is utterly tricky, it is all but unknown in principle, hence there have been prominent posts/users warning about and explaining this and offering suggestions on how to deal with it in principle, in particular:
An humble attempt to work around the leap second by Marco Marongiu
Time, technology and leaping seconds by Christopher Pascoe

We can't say anything for sure with the information provided. For performance issue, I would recommend a profiler, especially JProfiler, to investigate the cause of this problem. By this way you will be able to locate where the problem is.
This program has a trial license, I think that's enough for a quick look.
UPDATE: after carefully read your question, I see that you have many tomcat instance running for a website? It means that the previous tomcat instances failed to stop; they still run and hog up all the resources. This is possible. You must kill all the old tomcat process before trying to start a new one.
You can kill the processes by hand by "kill -9 " if you are on Linux, before trying to start the server again.

Related

Instability on Worklight Server

I'm using websphere liberty profile v8.5.5.0 and worklight 6.2.
The full version of my WL and runtime is:
Server version: 6.2.0.00.20140922-2259
Project WAR version: 6.2.0.00.20140922-2259
I've noticed that sometimes I have troubles getting into the worklightconsole, the server takes a too big of a time to answer and most of the time it just gives me a time out.
Regarding JVM Heap its at 60 - 70% of the total heap, most likkely 1,5 Gb or something like that.
On the FFDC, sometimes I get a error saying something close to an
FFDC Incident has been created: "javax.naming.ServiceUnavailableException: ldap.example.com:389; socket closed; remaining name 'o=example' com.ibm.ws.wim.adapter.ldap.LdapConnection 1670" at ffdc.log
I have my LDAP connected to this websphere via VPN, and I know that webspheres historically have trouble dealing with LDAP.
However I don't see any more errors on the logs; the machine eventually recovers and is able to work correctly, but for some time is 'down'.
If I enable tracing, the verbosity overwhelms the machine and I can't even start the worklightconsole, neither continue to work with worklight like calling an adapter from an application.
There is one more thing, it seems that this happens more frequently after updates on existing application versions or adapters. Does this ring a bell with anyone?
If i ask for a restart when the machine is sluggish, the stoping of the websphere takes quite some time but eventually stops normally and when I start it, everything is fine right out of the bat.
Before asking for a PMR, I would like to know if there is something else I could do to troubleshoot this problem.
Thanks in advance.
My initial "smell" of the problem is that sometimes your VPN connection with LDAP is very slow or your LDAP server is taking too long to respond.
My suggestion is that you try using WAIT(wait.ibm.com), it's a non-invasive easy to use diagnostic tool, to further investigate. If you find out the call to LDAP is getting hang then I suggest you try tuning Liberty LDAP cache, this should help.

Bottle WSGI server vs Apache

I don't actually have any problem, just a bit curious of things.
I make a python web framework based on bottle (http://bottlepy.org/). Today I try to do a bit comparison to compare bottle WSGI server and apache server performance. I work on lubuntu 12.04, using apache 2, python 2.7, bottle development version (0.12) and get this surprising result:
As stated in the bottle documentation, the included WSGI Server is only intended for development purpose. The question is, why the development server is faster than the deployment one (apache)?
As far as I know, development server is usually slower, since it provide some "debugging" features.
Also, I never has any response in less than 100 ms when developing PHP application. But look, it is just 13 ms in bottle.
Can anybody please explain this? This is just doesn't make sense for me. A deployment server should be faster than the development one.
Development servers are not necessarily faster than production grade servers, so such an answer is a bit misleading.
The real reason in this case is likely going to be due to lazy loading of your web application on the first request that hits a process. Especially if you don't configure Apache correctly, you could hit this lazy loading quite a bit if your site doesn't get much traffic.
I would suggest you go watch my PyCon talk which deals with some of these issues.
http://lanyrd.com/2013/pycon/scdyzk/
Especially make sure you aren't using prefork MPM. Use mod_wsgi daemon mode in preference.
A deployment server should be faster than the development one.
True. And it generally is faster... in a "typical" web server environment. To test this, try spinning up 20 concurrent clients and have them make continuous requests to each version of your server. You see, you've only tested 1 request at a time--certainly not a typical web environment. I suspect you'll see different results (we're thinking of both latency AND throughput here) with tens or hundreds of concurrent requests per second.
To put it another way: At 10, 20, 100 requests per second, you might still see ~200ms latency from Apache, but you'd see much worse latency from Bottle's server.
Incidentally, the Bottle docs do refer to concurrency:
The built-in default server is based on wsgiref WSGIServer. This
non-threading HTTP server is perfectly fine for development and early
production, but may become a performance bottleneck when server load
increases.
It's also worth noting that Apache is doing a lot more than the Bottle reference server is (checking .htaccess files, dispatching to child process/thread, robust logging, etc.) and all those features necessarily add to request latency.
Finally, I'd ask whether you tuned the Apache installation. It's possible that you could configure it to be faster than it is now, e.g. by tuning the MPM, simplifying logging, disabling .htaccess checks.
Hope this helps. And if you do run a concurrent benchmark, please do share the results with us.

Portal running with Glassfish 2.1.1, Liferay 5.2 and SSL get too many blocked threads

I have a portal which is running over SSL on Glassfish and uses Liferay. Last time we sent a email that brings approximately 200 people at same time to access released information our Glassfish "stalled".
From the server we could see that system resources were ok.
- Glassfish has up to 8 GB to use but was using 5 GB
- The server has 4 CPUs and the overall usage was around 30%
- Glassfish is configured up to 400 HTTP threads.
As soon we detected that our server wasn't answering users we started a profiler in order to understand what was going on.
The threads overview show too many blocked threads:
From the stack it's no possible to see code other than sun, grizzly, catalina classes:
I would like to fix such issue but right now I can tell whether I should work on our code our should replace some component like disabling SSL.
Any thoughts would be very appreciated.
Thanks.
A thread dump might have been easier and less intrusive than a profiler - this might have shown you where the threads are blocked in the actual running system.
You'll have to figure out where the blocking occurred: Was it in Liferay's code or in your own? What did you have on the pages, how is the theme done? Also, note that you're running a really old version of Liferay - in case you're running CE this has been out of maintenance for a few years now (Enterprise Edition still being supported, but as you don't mention this, odds are you're running Community Edition (CE))
Further, if you cause situations like the one you describe (sending loads of people at the same time) you might want to load test your system with an artificial load in order to see how it behaves. Also, you might want the landing page to be buffered (this is not to say that 200 users are a lot, but for any such activity you probably want to know that your system can handle it)
Until you prove the opposite, I'd assume that there is some custom component on the page (either a portlet or the theme) that causes a bottleneck and the blocking that you discovered.

Stress testing a server and VPS's vs. Dedicated servers

We used to have a dedicated server (1&1) and very infrequently ran into problems with the server having issues.
Recently, we migrated to a VPS (Wiredtree.com) with similar specs to our old dedicated server, but notice frequent problems running out of memory, mysql having to restart, etc... both when knowingly running intensive scrips and also just randomly during normal use.
Because of this, we're considering migrating to another at VPS - this time at Slicehost to see if it performs better.
My question is two fold...
Are their straightforward ways we could stress test a VPS at Slicehost to see if the same issues occur without having to actually migrate everything over?
Also, is it possible that the issues we're facing aren't just because of the provider (Wiredtree) but just the difference between a dedicated box and VPS (despite having similar specs)?
The best way to stress test an environment is to put it under load. If this VPS is hosting a web application, use one of the many available web server benchmark tools: ab, httperf, Siege or http_load. You don't necessarily care that much about the statistics from the tool itself, but more that it puts a predictable load on the server so that you can tune Apache to handle it, or at least not crash and burn.
The one problem you have with testing against Slicehost is that you are at the mercy of the Internet and your bandwidth to Slicehost. You may not be able to put enough load on the server to reach a meaningful conclusion.
Instead, you might find it just as valuable to run one of the many virtualization products on the market and set up a VM with comparable specs to the VPS plan you're considering. Local testing over your LAN will allow you to put a higher and more predictable load on the server.
In either case, you don't need to migrate everything, but you will need to set up an environment for your application to run in, with representative data in your database.
A VPS with similar specs to a dedicated server should perform approximately the same, but in order to get good performance, you still need to tune Apache, MySQL and any other long-lived server processes. In my experience, the out-of-the-box configuration of Apache in many Linux distributions is not ideal and will allow far too many child processes, overcommitting memory and sending the server into a swap-death spiral.

Best way to simulate a WAN network

Simplified, I have an application where data is intended to flow over the internet between two servers. Ideally, I'd like to test at what point the software ceases to function. At what lowerbound limit (bandwidth, latency, dropped packets) do things stop working to test the reliability of the software.
What I thought I would do was the following:
Setup up 3 machines (VMware instances)
Install the 2 applications on two of the servers.
Setup up the 3rd server to sit between the two machines by doing some sort of magic with Routing and Remote Access on Windows 2003
Install either Traffic Shaper XP or NetLimiter to limit the bandwidth
Run something like TMnetSim Network Simulator to simulate a bad connection.
Does this sound like a good idea or are there easier/better ways of doing this? I'm not that comfortable on Linux and my team mates are even less so.
WANem does exactly this. We have used it both in a virtual machine on the desktop and on a dedicated old pc and it worked great. It can simulate all sorts of broken connectivity.
FreeBSDs ipfw has provisions to simulate links with a given bandwith, latency or error rate. You could use that FreeBSD machine as your machine "in the middle" in your above setup.
You probably can also run at least one of the endpoints on the same machine if you want to reduce the amount of servers involved.
Someone actually packaged up the settings and whatnot necessary for the FreeBSD solution to this problem and they call it DUMMYNET.
It simulates/enforces queue and bandwidth limitations, delays, packet losses, and multipath effects. It also implements a variant of Weighted Fair Queueing called WF2Q+. It can be used on user's workstations, or on FreeBSD machines acting as routers or bridges.
It can simulate exactly what you want, and its free and will boot onto commodity hardware. They even have a canned install of it that is small enough to put on a floppy disk (!) that you can download at that link.
Maybe it is time to learn a bit about Linux because adding a 50ms delay on every outgoing packet can be done in typing just one line:
tc qdisc add dev eth0 root netem delay 50ms
For more see the Linux Traffic Control HOWTO
We had a similar requirement some ten years ago - I'll see if I can recall how we managed it.
If I remember, we wrote a socket proxy program which was controlled by inetd on a UNIX box. This socket would accept connections from a client and open equivalent sessions through to the server. It would then loop, passing messages in both directions.
The way we achieved WAN characteristics was to introduce random delays (with upper and lower limits) in both the connection establishment and the passing of data once the link was up.
It also had the feature to drop the link occasionally as WAN links were less reliable for us than local traffic.
I recall we had to make it threaded to stop the delays from affecting reverse traffic on the link.
There is a very good (and free) Microsoft solution for that, we use it for quite some time and it works great, it can very easily simulate every thing(packet loss, low bandwidth, disconnection, latency....)
This is the best solution i found for a windows environment
More information and a download link can be found here: MARCO blog post
this product has gone some evolution and it is now integrated into visual studio as part of the automation testing, but i found the use of the standalone(that is quite hard to find, so keep a local copy) to work much better. keep in mind that you need at least two computers(or VMs) since you need to pass through a network adapter in order for the application to work its magic.