troubles caused by tensorflow image's LD_LIBRARY_PATH - gpu

I'v installed DC/OS v1.8.4, the destination node has gpu resources and nvidia driver has also been installed, I tried to deploy tensorflow in mesos container, but it failed, there is only one error message in mesos's stderr:
mesos-containerizer: error while loading shared libraries: libmesos-1.0.1.so: cannot open shared object file: No such file or directory
But I can deploy other services successfuly, such as nginx, wordpress (also in mesos container)
The problem may be caused by tensorflow image, in its parent image CUDA, it reset LD_LIBRARY_PATH :
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH} ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
In OpenDCOS, before mesos-agent startup, it sets its executor's environment variable LD_LIBRARY_PATH to "/opt/mesosphere/lib", so that executor can locate necessary so files, but in above case, LD_LIBRARY_PATH is reset by tensorflow, so it failed to startup!
Anyone knows how OpenDCOS handle this problem ? Modify these public CUDA images?

GPUs are only officially supported in DC/OS 1.9+
For (unsupported) instructions on getting GPUs to work in 1.8, please see my answer to this question on the DC/OS mailing list:
https://groups.google.com/a/dcos.io/d/msg/users/HEgcUfRRqzk/inIBmapMCQAJ
Additionally, there is also a know issue with setting LD_LIBRARY_PATH in your container image for pre 1.9 clusters (though it usually manifests as a missing libssl.so library).
In your case, the CUDA container is setting LD_LIBRARY_PATH, which is overriding the LD_LIBRARY_PATH setting that DC/OS relies on to find it's library files. This is obviously a bug in DC/OS and has since been fixed in 1.9. The best (unsupported) workaround for this is to run
sudo ldconfig /opt/mesosphere/lib
on all of your nodes to put /opt/mesosphere/lib into the default library path. You will have to redo this on every reboot (or alternatively) add /opt/mesosphere/lib to a file under /etc/ld.so.conf.d/ to make it durable (maybe /etc/ld.so.conf.d/dcos.conf?).
This JIRA addressing the underlying issue can be found here:
https://issues.apache.org/jira/browse/MESOS-7027

Related

Airflow Broken DAG error - The version of cryptography does not match the loaded shared object

Airflow 1.10.12 Seeing this error in the UI:
Broken DAG: [/home/airflow/dags/something.py] The version of cryptography does not match the loaded shared object. This can happen if you have multiple copies of cryptography installed in your Python path. Please try creating a new virtual environment to resolve this issue. Loaded python version: 2.9.2, shared object version: b'2.9'
The dags compile on the machine with no errors, but these messages appear for almost all the dags.
I have also recreated the virtualenv multiple times, but the error persists.
Anyone seen this before?
Turns out that a celery host had a scheduler running that was inserting the errors in the database. Stopped the extra scheduler and the messages went away

Using Pycharm to connect to cluster 'ImportError: libcuda.so.1: cannot open shared object file: No such file or directory'

I want to use Pycharm on my own laptop to connect to our linux cluster and use tensorflow-gpu on it.
However, it says:
ImportError: libcuda.so.1: cannot open shared object file: No such
file or directory.
When I use terminal to connect to cluster and use tensorflow GPU through terminal,there's no problem.
However ,when I use python remote interpreter in Pycharm,the error happens when importing tensorflow-gpu :
ImportError: libcuda.so.1: cannot open shared object file: No such
file or directory
Failed to load the native TensorFlow runtime.
The location 'libcuda.so.1' on the cluster is
'/usr/lib64/nvidia/libcuda.so.1'
and
'/usr/lib64/libcuda.so.1'.
I tried to add them as LD_LIBRARY_PATH to Environment variables in Pycharm run configuration :
LD_LIBRARY_PATH=/usr/lib64/libcuda.so.1\;/usr/lib64/nvidia/libcuda.so.1
but it doesn't work.
I can use other packages like numpy,sklearn normally.
What's more,the corresponding version of my Tensorflow GPU is CUDA 9.0.If the error is about CUDA,it should be like can not find libcuda.so.9,however it shows libcuda.so.1.
I can also use tensorflow-GPU through terminal and ssh well,so I think the problem might be from Pycharm settings?
What do I need to do about Pycharm settings apart from adding LD_LIBRARY_PATH to Environment variables?

nv-nsight-cu-cli caused Tensorflow to fail

I've downloaded the newest Nsight Compute profiling tool and I want to use it to benchmark Tensorflow applications. The code I'm using is here. It runs perfectly fine when I execute it and when I benchmark it with nvprof ./mnist.py it had no problem at all. However, when I try to run it with command sudo ./nv-nsight-cu-cli [path to the file] I get the following error:
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
I suspect that nv-nsight-cu-cli somehow didn't recognized the environment variable at all. Is there any fix around?
You need to search for differences in both environments:
env variables
LD_LIBRARY_PATH
/etc/ld.so.conf
/etc/ld.so.conf.d/*
cuBLAS
Is installation complete/not broken?
Is it installed at the same location on both machines?
Versions
...
You can start with locate libcublas.so on both machines to see if there's a difference. Alternatively, you can strace -f -e open the program to check where it tries to libcublas.so from.
Your error has (for now) nothing to do with GPUs: libcublas.so.9.0 can just not be found. Find it, find why Tensorflow can not find it and your problem will be solved.
It appears that GP100 is not supported by the tool at this moment.
The answer is found here:
Nsight Compute only supports Pascal (other than GP100) and later GPUs.

How to get missing modules in Ocaml?

[Solved (at bottom). installed quartz and re-installed with x11 via brew then restarted machine.]
I'm learning Ocaml and am going through these documentations pages and need to install some modules (graphics).
I'm missing a Graphics module in Ocaml. After trying to load it on toplevel (the REPL right?) with:
$ ocaml
OCaml version blahblah
# #load "graphics.cma";;
# open Graphics;;
I get the error message:
Cannot find file graphics.cma.
So I wander over to this question and after not finding the file with the command:
ls `ocamlc -where`/graphics*`
I read that this means that:
Graphics is not installed and you have to reinstall OCaml compiler
enabling Graphics.
Does this mean I have to recompile Ocaml every time I need a new module? I'm not sure what he meant by that.
Then, I tried to install Graphics with: opam install graphics.
I got this error:
This package relies on external (system) dependencies that may be missing. `opam depext lablgl.1.05' may help you find the correct installation for your system.
So I did opam depext lablgl.1.05
After this, I tried opam install graphics again, but it failed with this error:
#=== ERROR while installing graphics.1.0 ======================================#
# opam-version 1.2.2
# os darwin
# command ocamlc -custom graphics.cma -o test
# path /Users/alexanderkleinhans/.opam/system/build/graphics.1.0
# compiler system (4.02.2)
# exit-code 2
# env-file /Users/alexanderkleinhans/.opam/system/build/graphics.1.0/graphics-24451-7afd23.env
# stdout-file /Users/alexanderkleinhans/.opam/system/build/graphics.1.0/graphics-24451-7afd23.out
# stderr-file /Users/alexanderkleinhans/.opam/system/build/graphics.1.0/graphics-24451-7afd23.err
### stderr ###
# File "_none_", line 1:
# Error: Cannot find file graphics.cma
=-=- Error report -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
The following actions failed
∗ install graphics 1.0
No changes have been performed
=-=- graphics.1.0 troubleshooting -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
=> This package checks whether the Graphics library was compiled.
This error says Cannot find file graphics.cma which brings me back to this question and what the steps are to get graphics.cma (and other modules as I might need them).
I though opam was a package manager for ocaml (this install modules right?)
EDIT:
I did brew info ocaml and I did install with x11 so I though this meant I should have it...
ocaml: stable 4.04.1 (bottled), devel 4.05.0+beta3, HEAD
General purpose programming language in the ML family
https://ocaml.org/
/usr/local/Cellar/ocaml/4.04.1 (1,730 files, 194.4MB) *
Poured from bottle on 2017-06-13 at 15:23:43
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/ocaml.rb
==> Requirements
Optional: x11 ✘
==> Options
--with-flambda
Install with flambda support
--with-x11
Install with the Graphics module
--devel
Install development version 4.05.0+beta3
--HEAD
Install HEAD version
EDIT 2:
Running
brew install Caskroom/cask/xquartz
brew reinstall ocaml --with-x11
Allowed me to compile, but running gave me this fatal exception. Seems to be an X11 thing?
Fatal error: exception Graphics.Graphic_failure("Cannot open display ")
Solved
So I think the two steps that were necessary were to make sure ocaml was installed with X11. Note that `brew info ocaml` seemed to give wrong information (said it was installed with X11 but reinstall was necessary). On OSX, I also needed to install quarts.
brew install Caskroom/cask/xquartz
brew reinstall ocaml --with-x11
After this I COULD compile, but got an error on execution. This was simply solved by restarting which I read was necessary after installation of xquartz.
After that I could run fine.
The graphics module is an optional part of the base OCaml install, not an external module. This explains why you can't install it using OPAM. The OPAM module that you show is only testing whether it is installed in the current OCaml system. It can't (and hence doesn't try to) install graphics as a separate module.
For this reason, installing graphics (when it's not already installed) is unusually tricky. There's no need to recompile OCaml for installing most (if not all) other modules.
For what it's worth, I am running macOS 10.12.4, and I used "opam switch" to switch my OCaml system to the 4.03.0 release. In the resulting environment, the Graphics module is installed, and I have no trouble running the examples at the website you mention. (For the first, I see concentric red and yellow circles, for example.)
You might try doing "opam switch" to switch to a recent version of the compiler, and see if this gets things going for you. In the past I have had trouble getting Graphics to work, but it is working great for me now.

TensorCPU build: dependencies downloading issue

I am trying to build tensorflow cpu version on my centos 6.5, but i am stuck :-
bazel build -c opt //tensorflow/cc:tutorials_example_trainer
........
WARNING: Sandboxed execution is not supported on your system and thus hermeticity of actions cannot be guaranteed. See http://bazel.io/docs/bazel-user-manual.html#sandboxing for more information. You can turn off this warning via --ignore_unsupported_sandboxing.
INFO: Downloading from http://www.ijg.org/files/jpegsrc.v9a.tar.gz: 0B
it tries to download jpeg/eigen/png tar files but is unable to do so due to lack of internet connectivity on my machine .
I can download & put all these dependencies somewhere within the tensorflow sourcecode directory so that build procedure automatically detect them.
Could you please suggest the path to that directory ( relative to tensorflow src root directory)? or is there a file which needs modification?
I tried placing it under $TENSOR_SRC_ROOT/tensorflow/contrib/cmake/external, but that did not help.
Eagerly awaiting your replies,
thanks to bazel development team, setting HTTP_PROXY & HTTPS_PROXY in my environment resolved the issue.