TensorFlow very slow on second run (Ubuntu)

TensorFlow very slow on second run (Ubuntu) - tensorflow

I'm having a problem with TensorFlow (CPU) on Ubuntu 14.04 (VM, droplet), where running a script is fast the first time, but when running the same (or another) script directly after completion of the first run, things become very slow.
I'm talking minutes instead of seconds. Even simple test scripts (like those provided in the tutorial) take forever, with no visible CPU load.
For comparison: first run of the test script from the tutorial gives:
{real:0m0.790s, user:0m0.688s, sys:0m0.111s}
Second run of the same script, directly after completion of the first run gives:
{real: 2m46.628s, user: 0m0.783s, sys: 0m0.104s}
Eventually, things seem to clear up and performance is back (only for one run though).
I narrowed the problem down to this:
sess=tf.Session()
takes very long. Apparently resources used by a previous Session are not properly released [?]. My scripts use the Context manager, like
with tf.Session() as sess:
sess.run(...)
My latest hypothesis is that this has to do with system properties (virtual machine settings, hypervisor issues interacting with the context manager of TF). Using the docker container of TF makes no difference. Rebooting didn't help either. The same scripts run OK on OS X.

To make sure it's obvious what happened and that this question is answered: This occurred because tensorflow was reading from /dev/random instead of /dev/urandom. On some systems, /dev/random can exhaust its supply of randomness and block until more is available, causing the slowdown. This has now been fixed in github. The fixes are included in release 0.6.0 and later.

Related

Server start much slower in JUnit tests than when starting application

We have JUnit tests that boot up the entire server and then test some functions in the booted server. However this takes a lot longer than when simply booting the server regularly through it's main function, outside of a a junit test. Why could that be?
More concretely, we're using the dropwizard framework with a jetty server. The logs I get after starting within a unit test are
INFO [2022-09-19 15:12:45,985] org.eclipse.jetty.server.Server: Started #60649ms
and the logs I get with a regular application start are
INFO [2022-09-19 15:15:06,887] org.eclipse.jetty.server.Server: Started #13093ms
As you can see, it's around 6 times faster outside of JUnit.
Is there any reason for that? The server spawned in junit isn't 100% identical to the other one but it comes pretty close. I don't see any reason why it should just be that much slower. Is there any "known" reason why this is happening or is it more likely that's something about how we spawn the server in our testing environment?
When trying to find the bottleneck, I couldn't identify one single place that runs slower. It seems that everything is just running slower in the tests.
Edit:
An additional information is that I run on an M1 mac. My previous results, where the tests were 6 times slower than the server start, were with brew install --cask adoptopenjdk11. However when I switched to the brew install --cask zulu-jdk11, now the disparity is much smaller: 18 seconds for test runs, 12 seconds for server starts. It doesn't make the mystery smaller, but it makes me a bit happier.

Google cloud GPU machines reboot abruptly

When training a model on the GPU machine, it get interrupted due to some system patch process. Since Google cloud GPU machines do not have an option of live migration, it is painful task to restart the training every time this happens. Google has clearly mentioned that there is no way around this but to restart the machines in this Doc.
Is there a clever way to detect if the machine is rebooted and resume the training automatically.
Sometimes it also happens that due to some kernel update, the CUDA drivers stop working and GPU is not visible and CUDA drivers need a re-installation. So writing a startup script to resume the training is also not a bulletproof solution.

Yes there is. If you use tensorflow, you can use it's checkpointing feature to save your progress and pick up where you left off.
One great example of this is provided here: https://github.com/GoogleCloudPlatform/ml-on-gcp/blob/master/gce/survival-training/README-tf-estimator.md

Tensorflow doesn't see GPU and uses CPU instead, how come?

I'm running a Python script using GPU-enabled Tensorflow. However, the program doesn't seem to recognize any GPU and starts using CPU straight away. What could be the cause of this?

Just want to add to the discussion that tensorflow may stop seeing a GPU due to CUDA initialization failure, in other words, tensorflow detects the GPU, but can't dispatch any op onto it, so it falls back to CPU. In this case, you should see an error in the log, like this
E tensorflow/stream_executor/cuda/cuda_driver.cc:481] failed call to cuInit: CUDA_ERROR_UNKNOWN
The cause is likely to be the conflict between different processes using GPU simultaneously. When that is the case, the most reliable way I found to get tensorflow working is to restart the machine. In the worst case, reinstalling tensorflow and / or NVidia driver.
See also one more case when GPU suddenly stops working.

Virtualize a CPU without AES-NI

I have an application with AES-NI compiled it, but supposedly select the implementation at runtime based on cpuid. I want to test if it really functions correctly on an old CPU without such dedicated instructions. VirtualBox cannot help because the CPU is the same. How can I do such a test without having access to an old CPU?

How to ensure the stability of a pc?

I need to run an intensive CPU task that maxes all cores of my CPU to 100%.
After a few days of running this task, I find that the machine becomes unresponsive and I am no longer able to SSH into it. I then have to restart the machine and begin the task again. This task could take several weeks or even months to compute.
I'd like to find a way to run this task to completion.
I've tried running the task on Debian 8 and Ubuntu Server LTS. Both of these operating system exhibited the same problem. I thought about running the task inside a Virtual Machine and using Cron to snapshot it every hour, but this seems quite extreme and would suffer an overhead.
Why is it that my machine is unstable?
Could it be due to power fluctuations?
Should I try under-clocking the CPU?
Thanks

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas