How to prevent SCIP from doing restarts while solving MIPs - scip

I ma trying to solve an MIP and I noticed that sometimes scip restarts the problem again but the restarting does nothing. It spends a bunch of time trying find cuts and it doesn't find any.
The following is the message that comes before the restart is triggered. Is there a way to disable restarting?
Restart triggered after 50 consecutive estimations that the remaining tree will be large
(run 1, node 2309) performing user restart

You can disable restarts by setting
estimation/restarts/restartpolicy = n

Related

How to fix stalled archival process with ServicePulse?

In my environment I am running ServicePulse v1.14.4 and ServiceControl v3.3.0. I periodically go in to check for error messages and they're are in excess of 280K. So I start the archive process and hope they get archived. After 4 days the process is still going or seems to be hung up at 66%.
My question is there any way to get this back on track without having to restart the services?

Google Compute Engine VM constantly crashes

On the Compute Engine VM in us-west-1b, I run 16 vCPUs near 99% usage. After a few hours, the VM automatically crashes. This is not a one-time incident, and I have to manually restart the VM.
There are a few instances of CPU usage suddenly dropping to around 30%, then bouncing back to 99%.
There are no logs for the VM at the time of the crash. Is there any other way to get the error logs?
How do I prevent VMs from crashing?
CPU usage graph
This could be your process manager saying that your processes are out of resources. You might wanna look into Kernel tuning where you can increase the limits on the number of active processes on your VM/OS and their resources. Or you can try using a bigger machine with more physical resources. In short, your machine is falling short on resources and hence in order to keep the OS up, process manager shuts down the processes. SSH is one of those processes. Once you reset the machine, all comes back to normal.
How process manager/kernel decides to quit a process varies in many ways. It could simply be that a process has consistently stayed up for way long time to consume too many resources. Also, one thing to note is that OS images that you use to create a VM on GCP is custom hardened by Google to make sure that they can limit malicious capabilities of processes running on such machines.
One of the best ways to tackle this is:
increase the resources of your VM
then go back to code and find out if there's something that is leaking in the process or memory
if all fails, then you might wanna do some kernel tuning to make sure your processes have higer priority than other system process. Though this is a bad idea since you could end up creating a zombie VM.

Akka.net Cluster Debugging

The title is a bit misleading, so let me explain further.
I have a non thread-safe dll I have no choice but to use as part of my back end
servers. I can't use it directly in my servers as the thread issues it has causes
it to crash. So, I created an akka.net cluster of N nodes each which hosts a single
actor. All of my API calls that were originally to that bad dll are now routed through
messages to these nodes through a round-robin group. As each node only has a single, single
threaded actor, I get safe access, but as I have N of them running I get parallelism, of a sort.
In production, I have things configured with auto-down = false and default timings on heartbeats
and so on. This works perfectly. I can fire up new nodes as needed, they get added to the group,
I can remove them with Cluster.Leave and that is happy as well.
My issue is with debugging. In our development environment we keep a cluster of 20 nodes each
exposing a single actor as described above that wraps this dll. We also have a set of nodes that act as
seed nodes and do nothing else.
When our application is run it joins the cluster. This allows it to direct requests through the round-robin
router to the nodes we keep up in our cluster. When doing development and testing and debugging the app, if I configure things to use auto-down = false
we end up with problems whenever a test run crashes or we stop the application with out going through
proper cluster leaving logic. Such as when we terminate the app with the stop button in the debugger.
With out auto-down, this leaves us with a missing member of the cluster that causes the leader to disallow
additions to the cluster. This means that the next time I run the app to debug, I cant join the cluster and am
stuck.
It seems that I have to have auto-down set to get debugging to work. If it is set, then when I crash my app
the node is removed from the cluster 5 seconds later. When I next fire up my
app, the cluster is back in a happy state and I can join just fine.
The problem with this is that if I am debugging the application and pause it for any amount of time, it is almost immediately
seen as unreachable and then 5 seconds later is thrown out of the cluster. Basically, I can't debug with these settings.
So, I set failure-detector.acceptable-heartbeat-pause = 600s to give me more time to pause the app
while debugging. I will get shutdown in 10 min, but I don't often sit in the debugger for that long, so its an acceptable
trade-off. The issue with this is of course that when I crash the app, or stop it in the debugger, the cluster thinks it
exists for the next 10 minutes. No one tries to talk to these nodes directly, so in theory that isn't a huge issue, but I keep
running into cases where the test I just ran got itself elected as role leader. So the role leader is now dead, but the cluster
doesn't know it yet. This seems to prevent me from joining anything new to the cluster until my 10 min are up. When I try to leave
the cluster nicely, my dead node gets stuck at the exiting state and doesn't get removed for 10 minutes. And I don't always get
notified of the removal either, forcing me to set a timeout on leaving that will cause it to give up.
There doesn't seem to be any way to say "never let me be the leader". When I have run the app with no role set for the cluster
it seems to often get itself elected as the cluster leader causing the same problem
as when the role leader is dead but unknown to be so, but at a larger level.
So, I don't really see any way around this, but maybe someone has some tricks to pull this off. I want to be able to debug
my cluster member without it being thrown out of the cluster, but I also don't want the cluster to think that leader nodes
are around when they aren't, preventing me from rejoining during my next attempt.
Any ideas?

Spring Batch restart crashed jobs

Hi spring batch users,
regarding the documentation http://docs.spring.io/spring-batch/reference/htmlsingle/#d5e1320
"If the process died ("kill -9" or server failure) the job is, of course, not running, but the JobRepository has no way of knowing because no-one told it before the process died."
I try to find and restart the stale job executions by using
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(jobName);
...
jobExecution.setStatus(FAILED);
jobExecution.setEndTime(new Date());
jobRepository.update(jobExecution);
jobOperator.restart(jobExecution.getId());
But this seems to be very inconvenient.
1) I have to do this before other (new) jobs could be started.
2) I have to handle multiple instances of running servers so findRunningJobExecutions will not do the trick.
You can find other questions regarding this topic:
https://jira.spring.io/browse/BATCH-2433?jql=project%20%3D%20BATCH%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
Spring Batch after JVM crash
I would love to see a solution to register a "start up clean jobs listener". This will still not fix the problems originated by the multi server environment because spring batch does not know if the JobExecution marked by STARTED is not running on an other instance.
Thanks for any advice
Alex
Your job cannot and should not recover "automatically" from a kill -9 scenario. A kill -9 is treated very differently than you application throwing a caught Exception. The reason for this is that you've effectively pulled the carpet out from under the application without giving it a chance to reach a synchronization point with the database to commit any necessary information to the ExecutionContext or update the job/step status(es). Therefore, the last status touchpoint with the database will remain and the job will still look STARTED.
"OK, fine" you say, "but if I start another execution, I want it to find that STARTED execution, and pick up where it left off." The problem here is that there is no clean way for the application to distinguish a job that is ACTUALLY RUNNING from one that has failed but couldn't up the database. The framework here correctly errs on the side of caution and prevents you from starting a job that already appears running, and this is a GOOD thing.
Why? Because let's assume your job was actually still running and you restarted by accident. As coded, the framework will start to spin up, see your running execution and fail with the following message A job execution for this job is already running. I can't tell you how many times we've been saved by this because someone accidentally launched a job twice!
If you were to implement the listener you suggest, the 2nd execution would instead be allowed to start and you'd have 2 different JVMs repeating the same work, possibly writing to the same files/tables and causing a huge data mess that could be impossible to clean up.
Trust me, in the event the Linux terminal kills your job or your job dies because the connection to the database has been severed, you WANT human eyes on those execution states before you attempt a restart.
Finally, on the off chance you actually wanted to kill you job, you can leverage several other standard patterns for stopping jobs:
Stop via throw Exception
Stop via JobOperator.stop()

Tracking Chrome and its many processes

I'm trying to keep an eye on how long an application runs. To do this, I capture every process's ID as it starts, and when that process is shut down, I log the time. However, Google's Chrome starts and stops like 6 processes when you start it up and shut it down, meaning each execution of Chrome gets logged multiple times.
Is there a better way to track the execution of an application than by process ID? Or is there, perhaps, a technique for getting around this particular problem? I'd considered not adding a process ID if a process with the same ID was added within a second or so, but that seems exploitable.
Any ideas?
I am not 100% but I would assume that one process in Chrome must be the parent. try eliminating processes from your list if their parent (PPID) is the same (and not init = PID 1)
I ended up just checking if I was adding a duplicate. Not very efficient, but easy and effective. It will serve for now.