How to ensure the stability of a pc? - crash

I need to run an intensive CPU task that maxes all cores of my CPU to 100%.
After a few days of running this task, I find that the machine becomes unresponsive and I am no longer able to SSH into it. I then have to restart the machine and begin the task again. This task could take several weeks or even months to compute.
I'd like to find a way to run this task to completion.
I've tried running the task on Debian 8 and Ubuntu Server LTS. Both of these operating system exhibited the same problem. I thought about running the task inside a Virtual Machine and using Cron to snapshot it every hour, but this seems quite extreme and would suffer an overhead.
Why is it that my machine is unstable?
Could it be due to power fluctuations?
Should I try under-clocking the CPU?
Thanks

Related

Server start much slower in JUnit tests than when starting application

We have JUnit tests that boot up the entire server and then test some functions in the booted server. However this takes a lot longer than when simply booting the server regularly through it's main function, outside of a a junit test. Why could that be?
More concretely, we're using the dropwizard framework with a jetty server. The logs I get after starting within a unit test are
INFO [2022-09-19 15:12:45,985] org.eclipse.jetty.server.Server: Started #60649ms
and the logs I get with a regular application start are
INFO [2022-09-19 15:15:06,887] org.eclipse.jetty.server.Server: Started #13093ms
As you can see, it's around 6 times faster outside of JUnit.
Is there any reason for that? The server spawned in junit isn't 100% identical to the other one but it comes pretty close. I don't see any reason why it should just be that much slower. Is there any "known" reason why this is happening or is it more likely that's something about how we spawn the server in our testing environment?
When trying to find the bottleneck, I couldn't identify one single place that runs slower. It seems that everything is just running slower in the tests.
Edit:
An additional information is that I run on an M1 mac. My previous results, where the tests were 6 times slower than the server start, were with brew install --cask adoptopenjdk11. However when I switched to the brew install --cask zulu-jdk11, now the disparity is much smaller: 18 seconds for test runs, 12 seconds for server starts. It doesn't make the mystery smaller, but it makes me a bit happier.

How long is a gem5 build with "gem5 scons build/X86/gem5.opt -j9" expected to take on a virtual machine?

first time working with gem5,according to gem5.org the following build command should take about 15 minutes or so to complete, build/X86/gem5.opt -j9. But its been more than an hour since the build started and its not complete yet, has anyone experienced the same issue? is it normal? My machine has 8 cores and I've allocated 16 gigs of memory for VMware on which am running the build. Can it be a hardware problem such as not enough memory?
So far I have started the build process from scratch a few times with the same results, I've also tried it on a different virtualization platform (Virtualbox) but Its taking the same amount of time to build.
Thanks!

Minishift is too slow to load

I have a boot2docker version of minishift installed on my laptop. Since I am using Windows 10 Home edition, I am forced to use Virtual box to run the minishift OS. However, every time I have to load the OS, minishift takes ages to boot up.
It takes almost 15 minutes to fully start the system. Plus I have to rsync my changes again as they are lost every time I stop the machine. Is there any solution to this?

TensorFlow very slow on second run (Ubuntu)

I'm having a problem with TensorFlow (CPU) on Ubuntu 14.04 (VM, droplet), where running a script is fast the first time, but when running the same (or another) script directly after completion of the first run, things become very slow.
I'm talking minutes instead of seconds. Even simple test scripts (like those provided in the tutorial) take forever, with no visible CPU load.
For comparison: first run of the test script from the tutorial gives:
{real:0m0.790s, user:0m0.688s, sys:0m0.111s}
Second run of the same script, directly after completion of the first run gives:
{real: 2m46.628s, user: 0m0.783s, sys: 0m0.104s}
Eventually, things seem to clear up and performance is back (only for one run though).
I narrowed the problem down to this:
sess=tf.Session()
takes very long. Apparently resources used by a previous Session are not properly released [?]. My scripts use the Context manager, like
with tf.Session() as sess:
sess.run(...)
My latest hypothesis is that this has to do with system properties (virtual machine settings, hypervisor issues interacting with the context manager of TF). Using the docker container of TF makes no difference. Rebooting didn't help either. The same scripts run OK on OS X.
To make sure it's obvious what happened and that this question is answered: This occurred because tensorflow was reading from /dev/random instead of /dev/urandom. On some systems, /dev/random can exhaust its supply of randomness and block until more is available, causing the slowdown. This has now been fixed in github. The fixes are included in release 0.6.0 and later.

VMware Player VM - 1 core CPU limitation

I'm using a VM with VMware Player to write code and compile.
As my current program is huge, the compilation takes a while to be done (upto 5 minutes)
using 25% of my 4 cores CPU on my host = 100% of one core.
It seems that the VM is limited to use 1 single core.
Is there a way to optimize the number of cores a VM can use?
I'd like to use 50% or 75% of my 4 cores CPU.
Thanks
It sounds like you're limited by the number of parallel build tasks you can run, not the VM CPU configuration, e.g., by default, make will run a single step at a time. Try running several steps in parallel, e.g., run make -j4 or equivalent for your build system.
On a separate note, a VM may be more overhead for you than you might like; consider using Docker to host your development environment.