OptaPlanner immediately produces better solution after terminating and restarting the solver - optaplanner

I created a solution based on the task assigning example of OptaPlanner and observe one specific behavior in both the original example and my own solution:
Solving the 100tasks-5employees problem does hardly produce new better scores after half a minute or so, but terminating the solver and restarting it again does immediately bring up better solutions.
Why does this happen? In my understanding the repeated construction heuristic does not change any planning entity as all of them are already initialized. Then local search is started again. Why does it immediately find new better solutions, while just continuing the first execution without interruption does not or at least much slower?

By terminating and restarting the solver, you're effectively causing Late Acceptance to do a reheating. OptaPlanner will do automatic reheating once this jira is prioritized and implemented.
This occurs on a minority of the use cases. But if it occurs on a use case, it tends to occur on all datasets.
I've some cases workaround it by configuring multiple <localSearch> phases with <unimprovedSecondsSpentLimit> terminations, but I don't like that. Fixing that jira is the only real solution.

Related

Make Optaplanner ignore redundant result

I'm having an issue where in my use case I want to just receive the best answer and return with that single best answer. I want the result as quickly as possible and want the nonimprovement timer to trigger as soon as possible.
A lot of times optaplanner will trigger a new solution is found and when reviewing the score, it is a score that was already found. Is there a way to have optaplanner not signal that it found a new solution that is better when it is the same exact score?
Apologies if this is simple to do, I'm new to optaplanner and couldn't find what I was looking for after searching around.
Could not find a way for optaplanner to not trigger when the same best score found.
EDIT:
I just remembered that because of the random seed (which we want to keep), I spawned 8 threads to solve the same problem. So each thread is individually coming up with the same best answer and notifying the main thread. So this is definitely not an optaplanner problem. Your answer just made me think of that. Thank you so much for answering in impressive timing.
The only thing I can think of is to track to see if the same results come up myself in the main application and keep a nonimprovement timer in the main application to kill the solver threads myself... I'll mark the solution as resolved since I was slow in realizing this is an issue in my application.
OptaPlanner should fire the new best solution event only if the score has improved. If you see the event firing without any scope improvements, there might be two possible explanations:
a new best solution has been produced after submitting a ProblemChange, which is expected as this new best solution reflects an external change
there is a score corruption and, as a result, the score is incorrect; try turning on the FULL_ASSERT mode to confirm that

Can I run a custom phase generating an initial solution several times to generate a pool of starting solutions?

I’m using Optaplanner 8.15.0 to solve a variant of the redistricting problem: Geographic areas with an expected revenue are assembled to sales districts that must be contiguous and should be balanced and compact.
The initial solution is generated by a custom phase using a greedy heuristic. Afterwards a local search hill climbing phase makes big and obvious improvements to that using custom moves. Only then, the “real” optimization starts with a tabu search.
The configuration of the first two phases:
<customPhase>
<customPhaseCommandClass>project.solver.SeedGrowthInitialSolution</customPhaseCommandClass>
</customPhase>
<localSearch>
<localSearchType>HILL_CLIMBING</localSearchType>
<moveListFactory>
<moveListFactoryClass>project.move.SplitAndJoinMoveFactory</moveListFactoryClass>
</moveListFactory>
<termination>
<unimprovedSecondsSpentLimit>60</unimprovedSecondsSpentLimit>
</termination>
</localSearch>
It turned out that the quality (=score) of the overall solution depends very much on the quality of the initial solution after the second phase. I.e. the tabu search will improve the solution substantially, but it’s not able to “fix” a bad initial solution.
Is there a way to run the first two phases repeatedly to generate several candidate solutions and then continue with the best of them?
My current workaround is to start several instances manually, watch the logs for the achieved scores after the first two phases and restart all instances where the initial scores are not promising.
The answer is no, when the problem is specified like this. However, we could maybe change the problem statement a bit. If you run multiple independent solvers and, at the end, pick the best result, you can then pass that to a new solver which will start from there. This would be a variant of multi-bet solving. It is not supported out of the box, but would be relatively easy to code yourself, using either the Solver or SolverManager APIs.
That said, your entire approach is very unconventional. Custom construction heuristic is understandable, as are custom moves. But the fact that you feel the need to separate the moves into different phases makes me curious. Have you benchmarked this to be the best possible approach?
I would think that using the default local search (late acceptance), with your custom moves and our generic moves all thrown into the mix, would yield decent results already. If that fails to happen, then generally either your constraints need to be faster, or the solver needs to run for longer. For large problems, multi-threaded solving can help, too.

Example crash logs from nonatomic variable access on multiple threads?

What sorts of crashes can a lack of threadsafety cause? (Sort of a follow-up to Under what circumstances are atomic properties useful?).
Can anyone reproduce a crash example (even if sporadically) with a test case?
I'm trying to sort out from a large number of crash logs what percent of them are related to a particular integer variable being accessed from multiple threads. (Yes, I've already confirmed that this access can happen in my iOS app, the question is just how often does it happen.)
Obviously improper variable access can lead to unexpected effects and more crashes downstream, but I'm interested only in those related directly to the access/mutation of the variable (since downstream effects won't generally be the same from one app to another). Also, only interested in integer (or other completely immutable) variables.
I'm seeking any example error codes / exceptions / crash logs from which I can extract keywords and regular expressions. Even better would be a test app or unit test case which can reproduce a crash with high probability. I tried a simple example in a unit test but couldn't seem to cause a crash.
While this approach sounds so logical, race conditions manifest themselves in manifold ways and defy simple characterization within crash logs. Problems arising from data races are particularly vexing because they often result in heisenbugs. And sometimes it doesn't even crash but rather just produces incorrect results (which may or may not result in other problems).
I know this isn't the answer you're looking for, but while I can see the appeal of your plan, it is unlikely to be a productive exercise.
If it's a question of prioritizing the thread-safety fix versus other issues, there is no simple answer. The app is just going to be susceptible to crashes and other unpredictable behavior until this data race issue is resolved. But we can't reliably forecast what percentage of your current crash logs will be addressed. IMHO, given that crashing is the quickest way to lose a user-base, this thread-safety fix strikes me as a very high priority.
In terms of identifying the unsynchronized access, the thread sanitizer, as outlined in this WWDC video, is a excellent tool.

Debugging an intermittently stuck NSOperationQueue

I have an iOS app with a really nasty bug: an operation in my NSOperationQueue will for some reason hang and not finish executing so other additional operations are being queued up but still not executing. This in turn leads to the app not begin able to perform critical functions. I have not yet been able to identify any pattern other than that it occurs on one of my co-workers devices every week or so. Running the app from Xcode at that point does not help as killing and relaunching the app resolves the issue for the time being. I've tried attaching the debugger to a running process and I seem to be able to see log data but any break points I add are not registering. I've added a bread crumb trail of NSLogs to try to pinpoint where it's hanging but this has not yet led to a resolution.
I originally described the bug in another question which is yet to have a clear answer I'm guessing because of the lack of info I'm able to provide around this issue.
A friend once told me that it's possible to save the entire memory stack of an app at a given moment in some form and reload that exact state of memory onto a process on a different device. Does anyone know how I can achieve that? If that's possible the next time someone encounters that bug I can save that exact state of memory and replicate to test all my theories of possible solutions. Or is there a different approach to tackling this? As an interim measure, do you think it would make sense to forcefully make the app crash when the app enters this state so actual users would be less confused? I'm have mixed feelings about this but the user will have to kill the app from the multitask dock anyway in order to use the app again. I can check the operation queue count or create some kind of timeout code for this until I actually nail this bug.
This sounds as a deadlock on a very rare race-condition. You also mentioned using a maxConcurrentOperationCount of 2. This means that either:
some operation is blocking the operation queue and waitiong for main to release some lock and main is waiting for the operation to finish
two operations are waiting on each other to release some lock
1 seems very unlikely as the queue should allow 2 concurrent operations to be completely blocked, unless you are using some system functions that have concurency issues and block you queue instead of just one thread.
I this case my first attempt to debug would be to connect the debugger and pause execution. After that you can look at the stack traces for all threads. You should be able to find the 2 threads that are made by your operation queue after which I would review the responsible functions to find code thet might possibly wait on some lock. Make sure to take into consideration sytem functions.
Well it's quite hard to solve bugs that don't crash the App but just hang a thread. If you can't find the bug by looking at your code step by step checking if there are any possible deadlock- or raceconditions I would suggest to implement some logging.
Write your log to disk everytime you add a logentry. That's not the most memory efficient way, but if you give a build with logging enabled to your co-worker you can pull the log from his iPhone when things go wrong. Even while the App is still running.
Make sure you log every step you take including the values of important variables around the code that you suspect of breaking the App. This way you can see what the App is doing and what the state of the App is.
Hope this helps a bit. I don't now about restoring the state of memory of an App so can't help you with that.
Note; If the App is crashing on the bug you could use some other tools, but if I get it right thats not the case here is it?
I read the question describing the bug and I would try to log to disk what the currently running operations are doing. It seems the operations will hang once in a while and there is a bug in there. If you can log what methods are called while running the operation this will show you what function call will hang the App and you can start looking in there.
You didn't say this but I presume the bug occurs while a human operator is working with the app? Maybe you should add an automated mode to this app, where the app simulates the same operations that users normally do, using randomized times for starting different actions. Then you can leave the app running unattended on all your devices and increase the chances of seeing the problem.
Also, since the problem appears related to the NSOperationQueue, maybe you should subclass it so that you can add logging to the more interesting methods. For example, each time an operation is added you should log the state of the queue, since you suspect that sometimes it is getting suspended.
Also, I suggested this on your other question as well, you may want to setup an observer to get notified if the queue ever goes into a suspended state.
Good luck.
Checking assumptions here, since that never hurts: do you actually have evidence that your background threads are hanging? From what you report, the observed behavior is that the tasks you're putting in your background thread are not achieving the outcome that you expected. That doesn't necessarily indicate that the thread has hung—it might just indicate that the particular conditions meant that the thread closed due to all tasks being completed, without the tasks achieving what you wanted them to.
Addition: Given your answer in the comments, it seems to me the next step then is to use logging when an item begins to be executed in the queue so that you can identify which items it is that lead to the queue becoming blocked. Best guess is that it is a certain class of items or certain characteristics of the items if they are all of a certain class. Log enough as the first step of executing each item that you'll have a reasonable characterization of the item, and then once you get a real device that has entered this state, check the logs and see just what conditions are leading to this problem. That should enable you to reliably reproduce the problem on a device during debugging or in the simulator, to then nail it.
In other words—I would focus your attention on identifying the problematic operations first, rather than trying to identify the particular line of code where things are stalling.
In my case
start
instead of
main
had to be overridden.
When in doubt consult https://developer.apple.com/documentation/foundation/nsoperation#1661262?language=objc for discrepancies with your implementation

How do you solve a problem that is unreproducible, random and changes are not immediately testable?

Thought I would throw this one out there and see what other people's experiences have been like.
I'm experiencing an issue with a system at work where it stops processing jobs in a queue and 'jams' so to speak. Once the services are restarted the software processes the queue and everything returns to normal.
In my experience so far, I cannot for the life of me figure out what is causing these stoppages. That, and I cannot reproduce the stoppage myself. The queue fails at all different intervals, sometimes running for a month straight, other times failing as close together as twice in 1 day. I have since involved two different vendors and various colleagues within the department and everyone is stumped, and has been for several months.
Since I started, we've isolated the processing to a single server and cranked up the logging which we've sent to the vendors. Neither have no idea what the problem is.
We've updated a few settings here and there, upgraded client and server pieces, but we have no idea if the things we are doing is contributing to an overall solution.
So I have a problem that appears to be unreproducible, random and untestable.
Has anyone been involved with any similar situations?
What are some of the ways to solve a situation like this?
Any shared input or experiences would be great.
Cheers,
EDIT:: Cranked up the logging, updated all of the components to the latest version, and made sure proper anti-virus exclusions were done and so far it has not jammed in over a month!
Use a logging framework that can be turned on in production. You might have to have too much logging initially but it should help narrow down the problem and as you get closer you can narrow the scope of the logging and at the same time increase the verbosity (is that a word) of the remaining log statements.
In addition to the logging as pointed out by Kelly there is the possibilty of a deadlock taking place since things seem to stop. One option if this is a Java application is to use jconsole and connect to the JVM instance. jconsole has a detect deadlock option which can provide very valuable information when the hangup occurs.
If this is not a Java application and perhaps a .NET application you could make use of this technique.