Stopping and restarting the solving process helps finding better solutions - optaplanner

My colleague and I successfully implemented an exam scheduler based on OptaPlanner as our bachelor thesis.
However, we noticed that the solver gets stuck during the local search phase for hours, and stopping/restarting the solving process helps find new solutions a lot.
I think this behavior is due to how we handle new best solutions.
In fact, we only save better solutions, as illustrated in the example graphic below.
In the local search phase, we use "Step Counting Hill Climbing" with stepCountingHillClimbingSize = 400 and stepCountingHillClimbingSize = 1.
The dashed arrows indicate to which solution the new solution is compared.
When the solving process is stopped at step 7 and restarted, the solution from step 5 is loaded. Whereas, if the solving process is not interrupted, the solution from step 7 becomes the reference solution.
I think this process adds some randomness.
How can I add this behaviour with OptaPlanner?
Is this the same problem as described in
OptaPlanner immediately produces better solution after terminating and restarting the solver?
solverConfig.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<solver xmlns="https://www.optaplanner.org/xsd/solver">
<!--TODO: enable for real production use - gives a small performance boost. See: https://docs.optaplanner.org/latest/optaplanner-docs/html_single/index.html#environmentModeProduction-->
<!--<environmentMode>NON_REPRODUCIBLE</environmentMode>-->
<moveThreadCount>AUTO</moveThreadCount>
<solutionClass>ch.ost.examscheduler.solvers.opta.domain.ExamTimetable</solutionClass>
<entityClass>ch.ost.examscheduler.solvers.opta.domain.Exam</entityClass>
<scoreDirectorFactory>
<constraintProviderClass>ch.ost.examscheduler.solvers.opta.solver.constraints.TimeTableConstraintProvider
</constraintProviderClass>
<constraintStreamImplType>DROOLS</constraintStreamImplType>
<initializingScoreTrend>ONLY_DOWN/ONLY_DOWN</initializingScoreTrend>
</scoreDirectorFactory>
<constructionHeuristic>
<constructionHeuristicType>FIRST_FIT_DECREASING</constructionHeuristicType>
</constructionHeuristic>
<localSearch>
<unionMoveSelector>
<cacheType>PHASE</cacheType>
<changeMoveSelector>
<filterClass>ch.ost.examscheduler.solvers.opta.solver.filters.AllRoomsUnassignedChangeMoveFilter
</filterClass>
</changeMoveSelector>
</unionMoveSelector>
<acceptor>
<stepCountingHillClimbingSize>400</stepCountingHillClimbingSize>
</acceptor>
<forager>
<acceptedCountLimit>1</acceptedCountLimit>
</forager>
</localSearch>
</solver>

Yea, we've seen this in some other use cases too. The default local search type, Late Acceptance can benefit from a restart in some use case. Such a restart is called reheating. We have an issue open to implement reheating.
It looks bad - but usually the gain of the reheat is a rounding error - and the result is still much better than what you'd fine elsewhere. But especially on small datasets, where other solvers might do better, it's a pain.
Meanwhile, a workaround (a manual reheat) is to configure the localSearch phase twice and put an unimproved time termination on the fist localSearch phase of 30 seconds or so. But what if it gets stuck twice?

Related

Infinispan-8 blocking state during iteration through cache

I am using infinispan 8.2.11. During iteration through cache with use of cache.entrySet().iterator() the thread gets stuck and not move. Here is the thread dump I collected :
"EJB default - 32" #586 prio=5 os_prio=0 tid=0x000055ce2f619000 nid=0x2853 runnable [0x00007f8780c7a000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006efb93ba8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
at org.infinispan.stream.impl.DistributedCacheStream$IteratorSupplier.get(DistributedCacheStream.java:754)
at org.infinispan.util.CloseableSuppliedIterator.getNext(CloseableSuppliedIterator.java:26)
at org.infinispan.util.CloseableSuppliedIterator.hasNext(CloseableSuppliedIterator.java:32)
at org.infinispan.stream.impl.RemovableIterator.getNextFromIterator(RemovableIterator.java:34)
at org.infinispan.stream.impl.RemovableIterator.hasNext(RemovableIterator.java:43)
at org.infinispan.commons.util.Closeables$IteratorAsCloseableIterator.hasNext(Closeables.java:93)
at org.infinispan.stream.impl.RemovableIterator.getNextFromIterator(RemovableIterator.java:34)
at org.infinispan.stream.impl.RemovableIterator.hasNext(RemovableIterator.java:43)
at org.infinispan.commons.util.IteratorMapper.hasNext(IteratorMapper.java:26)
I found the article in the Jboss Community Archive which describes similar issue : https://developer.jboss.org/thread/271158. There was a fix delivered into infinispan 9 which I belive resolves this problem: ISPN-9080
Is it possible to backport this fix into infinispan-8? Unfortunately, I can't uplift the version of infinispan in my project.
Unfortunately, we do not maintain such older versions. The suggested approach is to update to a more recent version. If that is not possible you could try patching the older version as you have the changes available https://github.com/infinispan/infinispan/pull/5924/files
Also to note this does not fix the actual issue. This just fixes a symptom of the actual issue. The actual issue is that the newest topology was not installed for some reason, but the original poster was not able to provide sufficient information to ascertain why.
In any case I would recommend to think about update the Infinispan version.
There are many fixes you might miss, as version 11 is the current stable one, and run into already fixed stuff.
But the problem in your case is that something happened in your cluster and cluster topology updates are missed.
If you have a stable cluster this problem will not happen.
So if you are able to find the cause for the 'instable' cluster, could be intentional stopping/starting nodes, and this is aceptable you can prevent from it.

How to specify simulatedAnnealingStartingTemperature using OptaPlanner benchmark blueprints

I'm attempting to use OptaPlanner benchmark blueprints as described at http://docs.jboss.org/optaplanner/release/6.4.0.Final/optaplanner-docs/html_single/index.html#benchmarkBlueprint.
When I use a solverBenchmarkBluePrintType of EVERY_CONSTRUCTION_HEURISTIC_TYPE_WITH_EVERY_LOCAL_SEARCH_TYPE, I get the following error:
The acceptorType (SIMULATED_ANNEALING) currently requires a simulatedAnnealingStartingTemperature (null).
So I tried adding the following to my benchmark.xml file in the inheritedSolverBenchmark section:
<localSearch>
<acceptor>
<simulatedAnnealingStartingTemperature>0hard/500soft</simulatedAnnealingStartingTemperature>
</acceptor>
</localSearch>
And I get this error:
The exception of the firstFailureSingleBenchmarkRunner
(solution_FIRST_FIT-HILL_CLIMBING_0) is chained. /
java.lang.IllegalStateException Local Search phase (0) needs to start
from an initialized solution, but the planning variable
(HatchEventOrderAllocator3bOrder.assignedHatchEvent) is uninitialized
for the entity com.mm.server.inventory.app.HatchEventOrderAllocator3bOrder#7216ab0f).
Initialize the solution by configuring a Construction Heuristic phase before this phase.
It seems that including the localSearch section in the inheritedSolverBenchmark has side effects that I didn't intend.
How can I pass the simulatedAnnealingStartingTemperature to every iteration of the Simulated Annealing algorithm use the blueprint?
It's a bug (jira linked). I haven't found a way yet to determine a somewhat reasonable SA default temperature (it's too use case specific).
Workaround: just ignore the results of SA in the blueprint's benchmark report.
Possible fixes for jira: remove SA from blueprint or do some stupid default temperature like 10hard/1000soft. The latter might give people the wrong impression that SA is useless on their case, but the former might not get them to give SA a chance at all...

Labview Program changes behavior after looking at (not changing) the Block Diagram

My Labview Program works like a charm, until I look at the Block Diagram. No changes are made. I do not Save. Just Ctrl+E and then Ctrl+R.
Now it does not work properly. Only a Restart of Labview fixes the problem.
My Program controls two Scanner arrays for Laser Cutting simultaneously. To force parallel working, I use the Error handler and loops that wait for a signal from the Scanner. But suddenly some loops run more often than they should.
What does majorly happen in Labview when I open the Block diagram that messes with my code?
Edit:
Its hard to tell what is happening without violating my non-disclosure agreement.
I'm controlling two independent mirror-Arrays for Laser Cutting. While one is running one Cutting-Job, the other is supposed to run the other Jobs. Just very fast. When the first is finished they meet at the same position and run the same geometry at the same slow speed. The jobs are provided as *.XML and stored as .net Objects. The device only runs the most recent job and overwrites it when getting a new one.
I can check if a job is still running. While this is true I run a while loop for the other jobs. Now this loop runs a few times too often and even ignores WAIT-blocks to a degree. Also it skips the part where it reads the XML job file, changes the speed part back to fast again and saves it. It only runs one time fast.
#Joe: No it does not. It only runs once well. afterwards it does not.
Youtube links
The way it is supposed to move
The wrong way
There is exactly one thing I can think of that changes solely by opening the block diagram.
When the block diagram opens, any commented-out or unreachable-code-compiler-eliminated sections of code will load their subVIs. If one of those commented out sections of code were somehow interfere with your running code, you might have an issue.
There are only two ways I know of for that to interfere... both of them are fairly improbable.
a) You have some sort of "check for all VIs in memory" or "check for all types in memory" that you're using as a plug-in system. When the commented-out sections load, that would change the VIs in memory. Such systems are not uncommon when parsing XML, so maybe.
b) You are using Run VI method for some dynamically invoked VI to execute as a top-level VI, but by loading the diagram, it discovers that it is a subVI of your current program. A VI cannot simultaneously be top-level and a subVI, so the call to Run VI returns an error.
That's it. I can't think of anything else. Both ideas seem unlikely, but given your claim and a lack of a block diagram, I figured I'd post it as a hypothesis.
In the improbable case someone has a similar problem. The problem was a xml file that was read during run time. Sometimes multiple instances tried to access it and this produced the error.
Quick point to check: are Debug and "retain data in wires" disabled? While it may not change the computations, but it may certainly change the timing of very tight loops, and that was one of the unexpected program behaviors, OP was referring to.

EdgeNGram: Error instantiating class: 'org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory'

I've set up Solr, so far everything's working just dandy, but now I wanted to add the EdgeNGram functionality to my searches. However, as soon as I throw it into my schema.xml, it starts throwing the error:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Could not load conf for core collection1: Plugin init failure for [schema.xml]
fieldType "text_en_splitting": Plugin init failure for [schema.xml] analyzer/filter:
Error instantiating class: 'org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory'.
Schema file is /opt/solr/server/solr/collection1/conf/schema.xml
The relevant schema part looks like this:
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
The configuration of the rest of the fieldType is fine as I've tested it rather extensively. It's just adding this line that throws the error.
Now, I've done some looking around, and usually, these errors mean that there's a .jar that's missing (at least according to the other two questions posted on here, not relating to NGram specifically though). So I went ahead and digged up lucene-analyzers-common.jar and explicitly added it in my solrconfig.xml, like so:
<lib dir="${solr.install.dir:../../../..}/dist/" regex="lucene-analyzers-common-\d.*\.jar" />
No luck. I know the path is fine though, I included the mysql_connector just this way. Anyways, I grew pretty tired of this error, so I went ahead and included every single .jar that I could dig up:
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-.*\.jar" />
Nope.jpg
Of course, all of this was accompanied by many bin/solr stop -all and starts, all of them still serving me that pretty, red banner in the Solr Admin. I'm on Solr 5.0.0
Help?
Well, this is awkward, but it's a typical thing to happen to me. Not 5 minutes after I posted this question, I made the error go away [note that I say "error go away", I didn't say "I solved it", because I haven't tested it completely yet].
Anyways, in my filter tag, you see that side="front" tag? Yeah, bad idea apparently. It's odd though, because I found that in the apache docs: https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory, I didn't make it up or anything.
Can anyone explain that?

Debugging Jackrabbit Lucene re-index abort/failure

I'm trying to rebuild the Lucene search index on a Jackrabbit 2.0 instance (actually a Day CRX 2.1 instance) so that I can apply new property boost weights for relevancy scoring. However it's repeatably aborting the indexing at the same point, count 3173000
*INFO * MultiIndex: indexing... /content/xxxxxx/jcr:content (3173000) (MultiIndex.java, line 1209)
*INFO * RepositoryImpl: Shutting down repository... (RepositoryImpl.java, line 1139)
(company names redacted) leaving the CRX web instance showing
java.lang.IllegalStateException: The repository is not available.
There's no indication in the logs why it's shutting down. There are no more lines between those two on any higher level of trace. The path mentioned exists and is unremarkable. Jackrabbit logs the path every 100 nodes so it could be any of the next 100 that cause the failure.
Any idea what could possibly have gone wrong, or how I can debug this?
(This, unfortunately, is one of those I'm-out-of-my-depth questions - I can't tell you much more because I don't know where to look.)
Thanks for everyone's suggestions in the comments. The problem was we had some content with bad HTML: specifically an <li>, closed or not, inside a <select><option>:
<html><body><form>
<select>
<option value="1"><li></option>
</select>
</form></body></html>
This kills javax.swing.text.html.parser.Parser with a StackOverflowError, which is a Throwable and so not caught by the error handling in Jackrabbit MultiIndex.
I've reported the Parser crash to Oracle and I'll propose a patch to Jackrabbit core that adds extra try/catches around the indexing code to at least log the exact node with a problem and, where possible, recover from the error and carry on indexing. In the case of a StackOverflowError I think this is recoverable: by the time we're back in the exception handling code the stack has been unwound to a sensible depth.
In practice I'm not going to be allowed to run a modified Jackrabbit in production here but at least I've identified and fixed the bad content so the same problem won't bite us there.