How do you solve a problem that is unreproducible, random and changes are not immediately testable? - testing

Thought I would throw this one out there and see what other people's experiences have been like.
I'm experiencing an issue with a system at work where it stops processing jobs in a queue and 'jams' so to speak. Once the services are restarted the software processes the queue and everything returns to normal.
In my experience so far, I cannot for the life of me figure out what is causing these stoppages. That, and I cannot reproduce the stoppage myself. The queue fails at all different intervals, sometimes running for a month straight, other times failing as close together as twice in 1 day. I have since involved two different vendors and various colleagues within the department and everyone is stumped, and has been for several months.
Since I started, we've isolated the processing to a single server and cranked up the logging which we've sent to the vendors. Neither have no idea what the problem is.
We've updated a few settings here and there, upgraded client and server pieces, but we have no idea if the things we are doing is contributing to an overall solution.
So I have a problem that appears to be unreproducible, random and untestable.
Has anyone been involved with any similar situations?
What are some of the ways to solve a situation like this?
Any shared input or experiences would be great.
Cheers,
EDIT:: Cranked up the logging, updated all of the components to the latest version, and made sure proper anti-virus exclusions were done and so far it has not jammed in over a month!

Use a logging framework that can be turned on in production. You might have to have too much logging initially but it should help narrow down the problem and as you get closer you can narrow the scope of the logging and at the same time increase the verbosity (is that a word) of the remaining log statements.

In addition to the logging as pointed out by Kelly there is the possibilty of a deadlock taking place since things seem to stop. One option if this is a Java application is to use jconsole and connect to the JVM instance. jconsole has a detect deadlock option which can provide very valuable information when the hangup occurs.
If this is not a Java application and perhaps a .NET application you could make use of this technique.

Related

Debugging an intermittently stuck NSOperationQueue

I have an iOS app with a really nasty bug: an operation in my NSOperationQueue will for some reason hang and not finish executing so other additional operations are being queued up but still not executing. This in turn leads to the app not begin able to perform critical functions. I have not yet been able to identify any pattern other than that it occurs on one of my co-workers devices every week or so. Running the app from Xcode at that point does not help as killing and relaunching the app resolves the issue for the time being. I've tried attaching the debugger to a running process and I seem to be able to see log data but any break points I add are not registering. I've added a bread crumb trail of NSLogs to try to pinpoint where it's hanging but this has not yet led to a resolution.
I originally described the bug in another question which is yet to have a clear answer I'm guessing because of the lack of info I'm able to provide around this issue.
A friend once told me that it's possible to save the entire memory stack of an app at a given moment in some form and reload that exact state of memory onto a process on a different device. Does anyone know how I can achieve that? If that's possible the next time someone encounters that bug I can save that exact state of memory and replicate to test all my theories of possible solutions. Or is there a different approach to tackling this? As an interim measure, do you think it would make sense to forcefully make the app crash when the app enters this state so actual users would be less confused? I'm have mixed feelings about this but the user will have to kill the app from the multitask dock anyway in order to use the app again. I can check the operation queue count or create some kind of timeout code for this until I actually nail this bug.
This sounds as a deadlock on a very rare race-condition. You also mentioned using a maxConcurrentOperationCount of 2. This means that either:
some operation is blocking the operation queue and waitiong for main to release some lock and main is waiting for the operation to finish
two operations are waiting on each other to release some lock
1 seems very unlikely as the queue should allow 2 concurrent operations to be completely blocked, unless you are using some system functions that have concurency issues and block you queue instead of just one thread.
I this case my first attempt to debug would be to connect the debugger and pause execution. After that you can look at the stack traces for all threads. You should be able to find the 2 threads that are made by your operation queue after which I would review the responsible functions to find code thet might possibly wait on some lock. Make sure to take into consideration sytem functions.
Well it's quite hard to solve bugs that don't crash the App but just hang a thread. If you can't find the bug by looking at your code step by step checking if there are any possible deadlock- or raceconditions I would suggest to implement some logging.
Write your log to disk everytime you add a logentry. That's not the most memory efficient way, but if you give a build with logging enabled to your co-worker you can pull the log from his iPhone when things go wrong. Even while the App is still running.
Make sure you log every step you take including the values of important variables around the code that you suspect of breaking the App. This way you can see what the App is doing and what the state of the App is.
Hope this helps a bit. I don't now about restoring the state of memory of an App so can't help you with that.
Note; If the App is crashing on the bug you could use some other tools, but if I get it right thats not the case here is it?
I read the question describing the bug and I would try to log to disk what the currently running operations are doing. It seems the operations will hang once in a while and there is a bug in there. If you can log what methods are called while running the operation this will show you what function call will hang the App and you can start looking in there.
You didn't say this but I presume the bug occurs while a human operator is working with the app? Maybe you should add an automated mode to this app, where the app simulates the same operations that users normally do, using randomized times for starting different actions. Then you can leave the app running unattended on all your devices and increase the chances of seeing the problem.
Also, since the problem appears related to the NSOperationQueue, maybe you should subclass it so that you can add logging to the more interesting methods. For example, each time an operation is added you should log the state of the queue, since you suspect that sometimes it is getting suspended.
Also, I suggested this on your other question as well, you may want to setup an observer to get notified if the queue ever goes into a suspended state.
Good luck.
Checking assumptions here, since that never hurts: do you actually have evidence that your background threads are hanging? From what you report, the observed behavior is that the tasks you're putting in your background thread are not achieving the outcome that you expected. That doesn't necessarily indicate that the thread has hung—it might just indicate that the particular conditions meant that the thread closed due to all tasks being completed, without the tasks achieving what you wanted them to.
Addition: Given your answer in the comments, it seems to me the next step then is to use logging when an item begins to be executed in the queue so that you can identify which items it is that lead to the queue becoming blocked. Best guess is that it is a certain class of items or certain characteristics of the items if they are all of a certain class. Log enough as the first step of executing each item that you'll have a reasonable characterization of the item, and then once you get a real device that has entered this state, check the logs and see just what conditions are leading to this problem. That should enable you to reliably reproduce the problem on a device during debugging or in the simulator, to then nail it.
In other words—I would focus your attention on identifying the problematic operations first, rather than trying to identify the particular line of code where things are stalling.
In my case
start
instead of
main
had to be overridden.
When in doubt consult https://developer.apple.com/documentation/foundation/nsoperation#1661262?language=objc for discrepancies with your implementation

How stable is RabbitMQ in production (using DRBD and Pacemaker)?

Looking for experience with RabbitMQ, especially in HA configuration using Pacemaker and DRDB as recommended here: http://www.rabbitmq.com/pacemaker.html
The DRBD part in particular makes me nervous, so I'm hoping someone here has real-world experience to share.
Works most of the time. However you'll have to pay special attention to fencing (split brain), when dealing with DRBD. On a production system it's always a pain to have to fix this kind of issues manually.
We failed to run RabbitMQ in a master/slave (multi-state RA). We thought we'd enhance availability. We're back to a single instance now. If anyone else has experience with several RabbitMQ instances running concurrently and backing a master entity that would be great to share!
I find the lack of tools to debug Pacemaker when there are issues is a big hurdle to deploy to live systems... It's not always clear what Pacemaker is "thinking" or doing. hb_report is not sufficient unfortunately.
Hope this helps,
D.
We tried master/slave configuration as well, however it became difficult to maintain all instances up to date with no downtime. And trust me, you want to update RabbitMQ. There are always bugs popping up either in RabbitMQ itself or in Erlang.
We've are getting about 100 crashes per year without any meaningful explanation in the logs. The error log just has generic "error while starting" in it and that's pretty much it. Sometimes it won't start after the crash and most of those times, the only solution is to delete all the persistent messages from all instances, so that the queue state is synchronized across the cluster. Other times it would crash immediately after launching and only after multiple repeated attempts will it properly load. Meaning there is no added reliability what so ever when using master/slave. At least there was none in our case. (RabbitMQ 3.5.3, Erlang 18.0)
It works for production, but only if you keep a copy of the message somewhere in the logs or in the database, from where it can be quickly recovered after a major crash.

What methods do you use to test for scalability in web applications?

Our testing system is pretty rudimentary; fire up a browser, see if it works. Recently we ran into problems, found by our client, with our application where the number of users created a slow-down in the application. The application is basically a huge Word document with people editing their own versions all at the same time. Part of the problem came from not knowing how to test multiple instances at the same time. My partner and I thought about how to test this; one idea was to hire out an internet cafe and hire students for an hour to bang on the app.
What are other ways that people have tried to emulate concurrency in testing their web-based application? Most of the advice here is for specific methodology; I'm asking, how do you test it to make sure that it works?
If you have never checked out Selenium, then you need to. It will allow you to do automated web testing through the browser. Ok, so first problem solved.
Now ideally you could use that same script and load it up on a bunch of boxes and run them all at once to get some sort of load testing right? Luckily for you someone has already figured this out, although it is a paid service: Browser Mob. But, it looks like you were willing to spend a little money to do this anyway, and would probably net you better, more repeatable results.
We usually answer the question "can the web application do more than one thing at a time" by using JMeter to produce a simulated HTTP load on the web server.
I find that it helps to consider distinguish several different types of testing; concurrency (what happens when two events in the system collide), capacity (what happens when there are many overlapping requests), volume (what happens as data accumulates in the system)...
Huge general slow down, evidenced by response times that fall outside of the SLA, are usually related to capacity problems (with contention as a common cause) or volume (many users, much data, and the system gets slower over time). The former usually requires some sort of multi-threaded request stream; the latter you can usually manage by preloading the volume, and then measuring the response times experienced by a single user.
I generally find that separating the load generator from the actual measurement/instrumentation is a good idea. That can be as simple as having a black box over there to generate a typical load, and sitting here with a stop watch measuring the responsiveness of a typical use case.
JMeter http://jmeter.apache.org/

SQL 2005: should I roll my own log shipping?

I'm looking into using log shipping for disaster recovery and I'm getting mixed messages about whether to use the built-in stuff or roll my own. Which do you recommend, please, and if you favour rolling your own what's wrong with the built-in stuff? If I'm going to reinvent the wheel I don't want to make the same mistakes! (We have the Workgroup edition.) Thanks in advance.
There's really two parts to your question:
Is native log shipping good enough?
If not, whose log shipping should I use?
Here's my two cents, but like you're already discovering, a lot of this is based on opinions.
About the first question - native log shipping is fine for small implementations - say, 1-2 servers, a handful of databases, and a full time DBA. In environments like this, the native log shipping's lack of monitoring, alerting, and management isn't a problem. If it breaks, you don't sweat bullets because it's relatively easy to repair. When would it break? For example, if someone accidentally deletes the transaction log backup file before it's restored on the disaster recovery server. (Happens all the time with automated processes.)
When you grow beyond a couple of servers, the lack of management automation starts to become a problem. You want better automated email alerting, alerts when the log shipping gets more than X minutes/hours behind, alerts when the file copying is taking too long, easier handling of multiple secondary servers, etc. That's when people turn to alternate solutions.
About the second question - I'll put it this way. I work for Quest Software, the makers of LiteSpeed, a SQL Server backup & recovery product. I regularly talk to database administrators who use our product and other products like Idera SQLSafe and Red Gate SQL Backup to make their backup management easier. We build GUI tools to automate the log shipping process, give you a nice graphical dashboard showing exactly where your bottlenecks are, and help make sure your butt is covered when your primary datacenter goes down. We sell a lot of licenses. :-)
If you roll your own scripts - and you certainly can - you will be completely alone when your datacenter goes down. You won't have a support line to call, you won't have tools to help you, and you won't be able to tell your coworkers, "Open this GUI and click here to fail over." You'll be trying to walk them through T-SQL scripts in the middle of a disaster. Expert DBAs who have a lot of time on their hands sometimes prefer writing their own scripts, and it does give you a lot of control, but you have to make sure you've got enough time to build them and test them before you bank your job on it.
Have you considered mirroring instead? Here is some documentation to determine if you could do that instead
If you decide to roll your own, here's a nice guide.
I'm assuming you're going this route because Enterprise Edition is so costly?
If you don't need a "live-backup", but really just want a frequently updated backup, I think this approach makes a lot of sense.
One more thing:
Make sure you regularly verify that your backup strategy is working.
I'm pretty sure it's available in Standard, since we're doing some shipping, but I'm not sure about the Workgroup edition - it's pretty stripped down.
I'm always in favor of the packages solution, but mostly because I trust a whole team of MSFT developers more than I trust myself, but that comes with a price for sure. I'd second that any solution you roll on your own has to come with a lag notification piece so that you'll know immediately if it isn't working - how many times do we only find out backup solutions aren't working when somebody needs a backup? Also, think honestly about how much time it will take you to design and roll your own solution, including bug fixes and maintenance - can you really do it more cheaply? Maybe you can, but maybe not.
Also, one problem we ran into with Workgroup edition is that it only supports 5 connections at once, and it seems to start dropping connections if you get more users than that, so we had to upgrade to Standard. We were getting ASP.NET errors that our connections were closed if we left them unattended for even a few seconds, which caused us all kinds of problems.
I would expect this to be close to the last place you'd want to save a few bucks, especially given the likely consequences if you screw up. Would you rather have your job on the line? I don't even think I'd admit it, if I felt I had a chance of getting this one right?
What's your personal upside benefit in this?
I tried the built-in log shipping and found some real problems with it so I developed my own. I blogged about it here.
PS: And just for the record, you definitely get log shipping in the Workgoup edition. I don't know where this Enterprise-only thing started.

Developing in a hostile environment

OK, not that kind of hostile. I'm curious to hear how people deal with developing on big corporate networks that mandate all kinds of developer-unfriendly services and policies on desktops (think ProQuota, over-zealous virus scanners, no local admin, no access to SO). I've previously used virtual LANs used effectively, or completely seperated parallel networks, but these aren't always practical. Any other tips?
The most important thing (if possible) is to recruit support from your boss.
Unless he's a PHB, he will often understand the impact of these restrictions on you, your team, and indirectly on his success. If the requests are reasonable, he can provide the buffer if you do go against IT. In addition, if the entire team or other developers seek the same policies, this "group bargaining power" can be used to create special policies.
Generally speaking, large corporations are over-zealous about legal issues and information security. However, IT departments generally hate dealing with numerous requests for support from the same person. Sometimes, if you show a clear harm to productivity from a project (e.g., you use a lot of temp files and the anti virus hits them), or that your program has to be installable from administrator mode, they will sometimes reach a compromise. You may have to sign something stating you would not use an administrative access on your machine to install illegal software, but you'd still get admin.
In the few cases I have gone for job interviews (I'm mostly in academia but worked some in the industry), one of my greatest concerns was the amount of control I had about my computing environment, from hardware, to software, to administrative rights. If I cannot be trusted as a developer to manage my own windows box, I don't feel I should be trusted with a mission-critical system.
I haven't tried this myself, but I once saw someone say that the central IT gave in and let him administer his own workstation, after he complied with the policies by submitting to them a change request form with a list of the first 300 things he wanted changing on his workstation.
Anything that interferes with you doing your job is good to bring up in a meetings.
Ex:
This Virus Scanner runs 4 times a day while I am at work. During that run my compile times take 5 times as long, and the use of my other development tools is brought down to a crawl.
The web filters are overzealous. I have attempted to access sites x, y, and z for extra development information, and have been unable. The time it took to find a good resources was doubled because of this.
And so on.
Work within the (hostile) rules and give up, quit and find somewhere more enlightened or try to change the organization, your choice.
If you decide to try and change things don't go against IT alone, that will just make you the "trouble maker" and you will never get anywhere, try to get support from your boss and other developers - if you can't get support then you may be better off looking for a new job.
I would explain your issues to your boss and/or sys admin, if they are receptive and agree its a good idea to let you have control over your workstation(s) then problem solved, if not I would walk from the project/job before your probationary period is over.
I was in a similar situation once at a large goverment corporation and it turned out management not be willing to unlock developers boxes was just the tip of the iceburg of a massive buracracy, the project ended up being a huge failure and by the time I left half of the IT department (not just the project team) had quit.
Just my 2 cents
Yeah. Leave. If your organization is not willing to give you the normal tools that any normal professional programmer should be able to use, then it's time to up your networking skills and update your resume.
Bringing your own laptop with the necessary tools is always a good way to overcome these man-made hurdles
Bring you own laptop but DON'T connect it to the network (and make it obvious that you do not intend to).
Copy stuff e.g. Visio diagrams over via USB drive.
If they don't allow USB, you can access the internat from outside and email the files. Using OWA via browser sometimes gives you more rights to send files.
Sounds like they're doing you a favor. Your code is guaranteed to run as a normal user, doesn't try to write to program files or other sensitive directories, is aware of what issues virus scanners bring to the table, and can handle other issues you wouldn't have normally encountered until installing your apps on a client machine.
As for no access to SO, I'd quit.
Our workplace required a full virus scan every day, so in the morning, when I hooked my laptop up, it was a 2 hour wait before I could do work.
I finally found a solution. MSVC 6, has a built in debugger. I went into task manager, and picked the mcaffee scanner process, and told it to debug. This fired up msvc6, and the scanner froze at a breakpoint. I hit reset, and the problem was gone. About 6 months later they removed the policy and all was good.