If a WebSphere Application is hung on z/OS, what steps should be taken to find the cause?
So far, I took a Heap Dump, Java Core, and System Dump.
None of the threads are deadlocked, no memory issues, and there does not seem to be an abundance of threads. (Only ~50, which is fairly normal.)
The entire application is not accessible. By that I mean, any attempts to connect to it's web pages hang and timeout.
What can by causing this? I am considering a high CPU event, but not sure how to retroactively check that.
I get an similar error message to this 30 times.
BBOO0221W: WSVR0605W: Thread "WebSphere WLM Dispatch Thread t=008b74f8" (00000075) has been active for 720962 milliseconds and may be hung. There is/are 30 thread(s) in total in the server that may be hung.
at sun.reflect.GeneratedMethodAccessor617.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at com.sun.faces.el.MethodBindingImpl.invoke(MethodBindingImpl.java:126)
at com.sun.faces.application.ActionListenerImpl.processAction(ActionListenerImpl.java:72)
at javax.faces.component.UICommand.broadcast(UICommand.java:312)
at javax.faces.component.UIViewRoot.broadcastEvents(UIViewRoot.java:267)
at javax.faces.component.UIViewRoot.processApplication(UIViewRoot.java:381)
at com.sun.faces.lifecycle.InvokeApplicationPhase.execute(InvokeApplicationPhase.java:75)
at com.sun.faces.lifecycle.LifecycleImpl.phase(LifecycleImpl.java:200)
at com.sun.faces.lifecycle.LifecycleImpl.execute(LifecycleImpl.java:90)
at javax.faces.webapp.FacesServlet.service(FacesServlet.java:197)
at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1230)
at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:779)
at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:478)
at com.ibm.ws.webcontainer.servlet.ServletWrapperImpl.handleRequest(ServletWrapperImpl.java:178)
at com.ibm.ws.webcontainer.filter.WebAppFilterChain.invokeTarget(WebAppFilterChain.java:136)
at com.ibm.ws.webcontainer.filter.WebAppFilterChain.doFilter(WebAppFilterChain.java:97)
at org.apache.myfaces.webapp.filter.ExtensionsFilter.doFilter(ExtensionsFilter.java:97)
at com.ibm.ws.webcontainer.filter.FilterInstanceWrapper.doFilter(FilterInstanceWrapper.java:195)
at com.ibm.ws.webcontainer.filter.WebAppFilterChain.doFilter(WebAppFilterChain.java:91)
(truncated)
The "hung" thread themselves don't seem to have any real pattern, they are just normal activity, that should not hang.
One of the best features of z/OS is the diagnostic capabilities - you never have to guess...it's pretty much always possible to find out EXACTLY what's going on.
Personally, I'd start with your system dump and IPCS. Of course, this is a pretty rare skill these days, so if this isn't your thing, first step might be to find someone with good dump reading skills. If you're totally stuck, there's a good introduction here.
Start by making sure that your dump has what you think it has...a good percentage of system dumps end up including the wrong address spaces or the wrong data areas, and these are pretty much useless. If you're in this situation, it's time to design exactly the dump options you want so you capture what you need next time the problem occurs.
You'll get a good overview about what's going on inside WebSphere by using the WebSphere IPCS dump formatter - an overview is here.
In the WebSphere address space(s), there will be many threads, and these will have z/OS TCB's (task control blocks). Go through every last TCB (in IPCS, the SUMM FORMAT command or equivalent) and understand whether it's running (potentially looping) or waiting. My bet would be that the threads are waiting on something or other...a lock, an outside signal, a call to DB2, some vendor software, etc - a good goal is to make a list of all the threads and exactly what each one is waiting for.
Largely, finding a wait reason is about going through the TCB/RB structures to find the PSW and registers at the time of the wait...this tells you the module that's waiting, and most likely you can figure out what's going on from here.
If the system hadn't been hanging a long time before you took your dump, you might also check the system trace table. It will give you the history of what the address space has been doing, although if it's been a long time, there may not be much data there.
Also, since WebSphere is a giant UNIX Services application, don't forget to look at the OMVSDATA if you have this in your dump.
Don't forget that you can always reach out for IBM support - you spend a ton of money on software like WebSphere, so reaching out to have them explain what's going on is certainly one of the better ideas.
Good luck!
The application is not responding because all your dispatch threads (30 apparently) are tied up. New requests are just piling up in the WLM queue until some timeout fires. The dispatch timeout in WAS z/OS should eventually abend the servant region and WLM will start a fresh one (unless you've turned off the timeout). There is a good writeup about timeout management for WAS on z/OS here: http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP102510.
Unfortunately that still doesn't explain why it is stuck in the first place.
Related
New to NServiceBus (4.7.5) and just implemented an NSB host.exe hosted service (implementing IWantToRunWhenBusStartsAndStops) that detects changes to database tables and notifies subscribing web apps by publishing events, e.g. "CustomerDataWasUpdatedEvent". In the future we will perform the actual update through messagehandlers receiving commands obviously, but at the moment this publishing service just polls the database etc.
It all works well, however, approaching production, I noticed that David Boike, in his latest edition of "Learning NServiceBus", states that classes implementing
IWantToRunWhenBusStartsAndStops are really mostly for development and rarely used in production. I set up my database change detection in the Start method and it works nicely, does anyone know why this is discouraged?
Here is the comment in the actual book:
https://books.google.se/books?id=rvpzBgAAQBAJ&pg=PA110&lpg=PA110&dq=nservicebus+iwanttorunwhenbusstartsandstops+in+production+david+boike&source=bl&ots=U6sNII0nm3&sig=qIXffOVFhcy-_3qDnSExRpwRlD4&hl=sv&sa=X&ei=lHWRVc2_BKrWywPB65fIBw&ved=0CBsQ6AEwAA#v=onepage&q=nservicebus%20iwanttorunwhenbusstartsandstops%20in%20production%20david%20boike&f=false
The actual quote is:
...it isn't common to have widespread use of in a production system.
Uncommon is not the same thing as discouraged.
That said I do think there is intent here by the author to highlight the fact that further up the page they assert that this is not a good place to be doing lots of coding, as an unhandled exception can cause the whole process to fail.
The author actually does go on to mention a possible use case for when you may want to load a resource(s) to do work within the handler.
Ok, maybe it's just this scenario we have that is a bit uncommon
Agreed - there is nothing fundamentally wrong with your approach. I recently did the same thing as you for wiring up SqlDependency to listen for database events and then publish a message as a result. In these scenarios there is literally nothing else you can do other than to use IWantToRunAtStatup.
Also, David himself often trawls the nservicebus tag, maybe he'll provide a more definitive answer than mine.
I'll copy the answer I gave in the Particular Software Google Group...
I'll quote myself directly here:
An implementation of IWantToRunWhenBusStartsAndStops is a great place to create a quick interface in order to test messages during debugging by allowing you to send messages based on the console input. Apart from this, it isn't common to have widespread use of them in a production system. One possible production use case will be to provision a resource needed by the endpoint at startup and then tear it down when the endpoint stops.
I think if I could add a little bit of emphasis it would be to "widespread use". I'm not trying to say you won't/can't have an IWantToRunWhenBusStartsAndStops in production code or that avoiding them is a best practice. I am trying to say that having a ton of them is probably a code smell.
Above that paragraph in the book, I warn about IWantToRunWhenBusStartsAndStops not having any ambient transactions or try/catch stuff going on. THAT is really the key part. If you end up throwing an exception in an IWantToRunWhenBusStartsAndStops, tyou can run into big problems. If you use something like a .NET Timer and then throw an exception, you can crash your process!
Let me tell you how I screwed up on this in my first-ever NServiceBus system. The system (still in use today, from what I hear) is responsible for ingesting more than 3000 RSS feeds (probably a lot more than that now) into a CMS. So processing each feed, breaking it up into items, resizing images, encoding attached video for mobile ... all those things were handled in NServiceBus message handlers, which was scaled out to multiple servers, and that was all fantastic.
The problem was the scheduler. I implemented that as an IWantToRunWhenBusStartsAndStops (well, actually IWantToRunAtStartup at that time) and it quickly turned into a mess. I kept the whole table worth of feed information in memory so that I could calculate when to fire off the next ProcessFeed command. I was using the .NET Timer class, and IIRC, I eventually had to use threading primitives like ManualResetEvent in order to coordinate the activity. And because I was using .NET Timer, if the scheduler threw an exception, that endpoint failed and had to restart. Lots of weird edge cases and it was always a quagmire of bugs. Plus, this was now a singleton "commander app" so while the feed/item processors could be scaled out, the scheduler could not.
As I got more experienced with NServiceBus, I realized that each feed should have been a saga, starting from a FeedCreated event, controlled through PauseProcessing and ResumeProcessing commands, using timeouts to control the next processing time, and finally (perhaps) ended via a FeedRemoved event. This would have been MUCH more straightforward and everything would have executed inside transactionally-controlled message handlers.
That experience led me to be a little bit distrustful/skeptical of IWantToRunWhenBusStartsAndStops. Not saying it's bad, just something to be aware of. Always be prepared to consider if what you're trying to do couldn't be better accomplished in another way.
I have an iOS app with a really nasty bug: an operation in my NSOperationQueue will for some reason hang and not finish executing so other additional operations are being queued up but still not executing. This in turn leads to the app not begin able to perform critical functions. I have not yet been able to identify any pattern other than that it occurs on one of my co-workers devices every week or so. Running the app from Xcode at that point does not help as killing and relaunching the app resolves the issue for the time being. I've tried attaching the debugger to a running process and I seem to be able to see log data but any break points I add are not registering. I've added a bread crumb trail of NSLogs to try to pinpoint where it's hanging but this has not yet led to a resolution.
I originally described the bug in another question which is yet to have a clear answer I'm guessing because of the lack of info I'm able to provide around this issue.
A friend once told me that it's possible to save the entire memory stack of an app at a given moment in some form and reload that exact state of memory onto a process on a different device. Does anyone know how I can achieve that? If that's possible the next time someone encounters that bug I can save that exact state of memory and replicate to test all my theories of possible solutions. Or is there a different approach to tackling this? As an interim measure, do you think it would make sense to forcefully make the app crash when the app enters this state so actual users would be less confused? I'm have mixed feelings about this but the user will have to kill the app from the multitask dock anyway in order to use the app again. I can check the operation queue count or create some kind of timeout code for this until I actually nail this bug.
This sounds as a deadlock on a very rare race-condition. You also mentioned using a maxConcurrentOperationCount of 2. This means that either:
some operation is blocking the operation queue and waitiong for main to release some lock and main is waiting for the operation to finish
two operations are waiting on each other to release some lock
1 seems very unlikely as the queue should allow 2 concurrent operations to be completely blocked, unless you are using some system functions that have concurency issues and block you queue instead of just one thread.
I this case my first attempt to debug would be to connect the debugger and pause execution. After that you can look at the stack traces for all threads. You should be able to find the 2 threads that are made by your operation queue after which I would review the responsible functions to find code thet might possibly wait on some lock. Make sure to take into consideration sytem functions.
Well it's quite hard to solve bugs that don't crash the App but just hang a thread. If you can't find the bug by looking at your code step by step checking if there are any possible deadlock- or raceconditions I would suggest to implement some logging.
Write your log to disk everytime you add a logentry. That's not the most memory efficient way, but if you give a build with logging enabled to your co-worker you can pull the log from his iPhone when things go wrong. Even while the App is still running.
Make sure you log every step you take including the values of important variables around the code that you suspect of breaking the App. This way you can see what the App is doing and what the state of the App is.
Hope this helps a bit. I don't now about restoring the state of memory of an App so can't help you with that.
Note; If the App is crashing on the bug you could use some other tools, but if I get it right thats not the case here is it?
I read the question describing the bug and I would try to log to disk what the currently running operations are doing. It seems the operations will hang once in a while and there is a bug in there. If you can log what methods are called while running the operation this will show you what function call will hang the App and you can start looking in there.
You didn't say this but I presume the bug occurs while a human operator is working with the app? Maybe you should add an automated mode to this app, where the app simulates the same operations that users normally do, using randomized times for starting different actions. Then you can leave the app running unattended on all your devices and increase the chances of seeing the problem.
Also, since the problem appears related to the NSOperationQueue, maybe you should subclass it so that you can add logging to the more interesting methods. For example, each time an operation is added you should log the state of the queue, since you suspect that sometimes it is getting suspended.
Also, I suggested this on your other question as well, you may want to setup an observer to get notified if the queue ever goes into a suspended state.
Good luck.
Checking assumptions here, since that never hurts: do you actually have evidence that your background threads are hanging? From what you report, the observed behavior is that the tasks you're putting in your background thread are not achieving the outcome that you expected. That doesn't necessarily indicate that the thread has hung—it might just indicate that the particular conditions meant that the thread closed due to all tasks being completed, without the tasks achieving what you wanted them to.
Addition: Given your answer in the comments, it seems to me the next step then is to use logging when an item begins to be executed in the queue so that you can identify which items it is that lead to the queue becoming blocked. Best guess is that it is a certain class of items or certain characteristics of the items if they are all of a certain class. Log enough as the first step of executing each item that you'll have a reasonable characterization of the item, and then once you get a real device that has entered this state, check the logs and see just what conditions are leading to this problem. That should enable you to reliably reproduce the problem on a device during debugging or in the simulator, to then nail it.
In other words—I would focus your attention on identifying the problematic operations first, rather than trying to identify the particular line of code where things are stalling.
In my case
start
instead of
main
had to be overridden.
When in doubt consult https://developer.apple.com/documentation/foundation/nsoperation#1661262?language=objc for discrepancies with your implementation
Looking for experience with RabbitMQ, especially in HA configuration using Pacemaker and DRDB as recommended here: http://www.rabbitmq.com/pacemaker.html
The DRBD part in particular makes me nervous, so I'm hoping someone here has real-world experience to share.
Works most of the time. However you'll have to pay special attention to fencing (split brain), when dealing with DRBD. On a production system it's always a pain to have to fix this kind of issues manually.
We failed to run RabbitMQ in a master/slave (multi-state RA). We thought we'd enhance availability. We're back to a single instance now. If anyone else has experience with several RabbitMQ instances running concurrently and backing a master entity that would be great to share!
I find the lack of tools to debug Pacemaker when there are issues is a big hurdle to deploy to live systems... It's not always clear what Pacemaker is "thinking" or doing. hb_report is not sufficient unfortunately.
Hope this helps,
D.
We tried master/slave configuration as well, however it became difficult to maintain all instances up to date with no downtime. And trust me, you want to update RabbitMQ. There are always bugs popping up either in RabbitMQ itself or in Erlang.
We've are getting about 100 crashes per year without any meaningful explanation in the logs. The error log just has generic "error while starting" in it and that's pretty much it. Sometimes it won't start after the crash and most of those times, the only solution is to delete all the persistent messages from all instances, so that the queue state is synchronized across the cluster. Other times it would crash immediately after launching and only after multiple repeated attempts will it properly load. Meaning there is no added reliability what so ever when using master/slave. At least there was none in our case. (RabbitMQ 3.5.3, Erlang 18.0)
It works for production, but only if you keep a copy of the message somewhere in the logs or in the database, from where it can be quickly recovered after a major crash.
Thought I would throw this one out there and see what other people's experiences have been like.
I'm experiencing an issue with a system at work where it stops processing jobs in a queue and 'jams' so to speak. Once the services are restarted the software processes the queue and everything returns to normal.
In my experience so far, I cannot for the life of me figure out what is causing these stoppages. That, and I cannot reproduce the stoppage myself. The queue fails at all different intervals, sometimes running for a month straight, other times failing as close together as twice in 1 day. I have since involved two different vendors and various colleagues within the department and everyone is stumped, and has been for several months.
Since I started, we've isolated the processing to a single server and cranked up the logging which we've sent to the vendors. Neither have no idea what the problem is.
We've updated a few settings here and there, upgraded client and server pieces, but we have no idea if the things we are doing is contributing to an overall solution.
So I have a problem that appears to be unreproducible, random and untestable.
Has anyone been involved with any similar situations?
What are some of the ways to solve a situation like this?
Any shared input or experiences would be great.
Cheers,
EDIT:: Cranked up the logging, updated all of the components to the latest version, and made sure proper anti-virus exclusions were done and so far it has not jammed in over a month!
Use a logging framework that can be turned on in production. You might have to have too much logging initially but it should help narrow down the problem and as you get closer you can narrow the scope of the logging and at the same time increase the verbosity (is that a word) of the remaining log statements.
In addition to the logging as pointed out by Kelly there is the possibilty of a deadlock taking place since things seem to stop. One option if this is a Java application is to use jconsole and connect to the JVM instance. jconsole has a detect deadlock option which can provide very valuable information when the hangup occurs.
If this is not a Java application and perhaps a .NET application you could make use of this technique.
When things go badly awry in embedded systems I tend to write an error to a special log file in flash and then reboot (there's not much option if, say, you run out of memory).
I realize even that can go wrong, so I try to minimize it (by not allocating any memory during the final write, and boosting the write processes priority).
But that relies on someone retrieving the log file. Now I was considering sending a message over the intertubes to report the error before rebooting.
On second thoughts, of course, it would be better to send that message after reboot, but it did get me to thinking...
What sort of things ought I be doing if I discover an irrecoverable error, and how can I do them as safely as possible in a system which is in an unstable state?
One strategy is to use a section of RAM that is not initialised by during power-on/reboot. That can be used to store data that survives a reboot, and then when your app restarts, early on in the code it can check that memory and see if it contains any useful data. If it does, then write it to a log, or send it over a comms channel.
How to reserve a section of RAM that is non-initialised is platform-dependent, and depends if you're running a full-blown OS (Linux) that manages RAM initialisation or not. If you're on a small system where RAM initialisation is done by the C start-up code, then your compiler probably has a way to put data (a file-scope variable) in a different section (besides the usual e.g. .bss) which is not initialised by the C start-up code.
If the data is not initialised, then it will probably contain random data at power-up. To determine whether it contains random data or valid data, use a hash, e.g. CRC-32, to determine its validity. If your processor has a way to tell you if you're in a reboot vs a power-up reset, then you should also use that to decide that the data is invalid after a power-up.
There is no single answer to this. I would start with a Watchdog timer. This reboots the system if things go terribly awry.
Something else to consider - what is not in a log file is also important. If you have routine updates from various tasks/actions logged then you can learn from what is missing.
Finally, in the case that things go bad and you are still running: enter a critical section, turn off as much of the OS a possible, shut down peripherals, log as much state info as possible, then reboot!
The one thing you want to make sure you do is to not corrupt data that might legitimately be in flash, so if you try to write information in a crash situation you need to do so carefully and with the knowledge that the system might be an a very bad state so anything you do needs to be done in a way that doesn't make things worse.
Generally, when I detect a crash state I try to spit information out a serial port. A UART driver that's accessible from a crashed state is usually pretty simple - it just needs to be a simple polling driver that writes characters to the transmit data register when the busy bit is clear - a crash handler generally doesn't need to play nice with multitasking, so polling is fine. And it generally doesn't need to worry about incoming data; or at least not needing to worry about incoming data in a fashion that can't be handled by polling. In fact, a crash handler generally cannot expect that multitasking and interrupt handling will be working since the system is screwed up.
I try to have it write the register file, a portion of the stack and any important OS data structures (the current task control block or something) that might be available and interesting. A watchdog timer usually is responsible for resetting the system in this state, so the crash handler might not have the opportunity to write everything, so dump the most important stuff first (do not have the crash handler kick the watchdog - you don't want to have some bug mistakenly prevent the watchdog from resetting the system).
Of course this is most useful in a development setup, since when the device is released it might not have anything attached to the serial port. If you want to be able to capture these kinds of crash dumps after release, then they need to get written somewhere appropriate (like maybe a reserved section of flash - just make sure it's not part of the normal data/file system area unless you're sure it can't corrupt that data). Of course you'd need to have something examine that area at boot so it can be detected and sent somewhere useful or there's no point, unless you might get units back post-mortem and can hook them up to a debugging setup that can look at the data.
I think the most well known example of proper exception handling is a missile self-destruction. The exception was caused by arithmetic overflow in software. There obviously was a lot of tracing/recording media involved because the root cause is known. It was discovered debugged.
So, every embedded design must include 2 features: recording media like your log file and graceful halt, like disabling all timers/interrupts, shutting all ports and sitting in infinite loop or in case of a missile - self-destruction.
Writing messages to flash before reboot in embedded systems is often a bad idea. As you point out, no one is going to read the message, and if the problem is not transient you wear out the flash.
When the system is in an inconsistent state, there is almost nothing you can do reliably and the best thing to do is to restart the system as quickly as possible so that you can recover from transient failures (timing, special external events, etc.). In some systems I have written a trap handler that uses some reserved memory so that it can, set up the serial port and then emit a stack dump and register contents without requiring extra stack space or clobbering registers.
A simple restart with a dump like that is reasonable because if the problem is transient the restart will resolve the problem and you want to keep it simple and let the device continue. If the problem is not transient you are not going to make forward progress anyway and someone can come along and connect a diagnostic device.
Very interesting paper on failures and recovery: WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT?
For a very simple system, do you have a pin you can wiggle? For example, when you start up configure it to have high output, if things go way south (i.e. watchdog reset pending) then set it to low.
Have you ever considered using a garbage collector ?
And I'm not joking.
If you do dynamic allocation at runtime in embedded systems,
why not reserve a mark buffer and mark and sweep when the excrement hits the rotating air blower.
You've probably got the malloc (or whatever) implementation's source, right ?
If you don't have library sources for your embedded system forget I ever suggested it, but tell the rest of us what equipment it is in so we can avoid ever using it. Yikes (how do you debug without library sources?).
If you're system is already dead.... who cares how long it takes. It obviously isn't critical that it be running this instant;
if it was you couldn't risk "dieing" like this anyway ?