nv-nsight-cu-cli caused Tensorflow to fail - tensorflow

I've downloaded the newest Nsight Compute profiling tool and I want to use it to benchmark Tensorflow applications. The code I'm using is here. It runs perfectly fine when I execute it and when I benchmark it with nvprof ./mnist.py it had no problem at all. However, when I try to run it with command sudo ./nv-nsight-cu-cli [path to the file] I get the following error:
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
I suspect that nv-nsight-cu-cli somehow didn't recognized the environment variable at all. Is there any fix around?

You need to search for differences in both environments:
env variables
LD_LIBRARY_PATH
/etc/ld.so.conf
/etc/ld.so.conf.d/*
cuBLAS
Is installation complete/not broken?
Is it installed at the same location on both machines?
Versions
...
You can start with locate libcublas.so on both machines to see if there's a difference. Alternatively, you can strace -f -e open the program to check where it tries to libcublas.so from.
Your error has (for now) nothing to do with GPUs: libcublas.so.9.0 can just not be found. Find it, find why Tensorflow can not find it and your problem will be solved.

It appears that GP100 is not supported by the tool at this moment.
The answer is found here:
Nsight Compute only supports Pascal (other than GP100) and later GPUs.

Related

Running remote Pycharm interpreter with tensorflow and cuda (with module load)

I am using a remote computer in order to run my program on its GPU. My program contains some code with tensorflow functions, and for easier debugging with Pycharm I would like to connect via ssh with remote interpreter to the computer with the GPU. This part can be done easily since Pycharm has this option so I can connect there. However, tensorflow is not loaded automatically so I get import error.
Note that in our institution, we run module load cuda/10.0 and module load tensorflow/1.14.0 each time the computer is loaded. Now this part is the tricky one. Opening a remote terminal creates another session which is not related to the remote interpreter session so it's not affecting remote interpreter modules.
I know that module load in general configures env, however I am not sure how can I export the environment variables to Pycharm's environment variables that are configured before a run.
Any help would be appreciated. Thanks in advance.
The workaround after all was relatively simple: first, I have installed the EnvFile plugin, as it was explained here: https://stackoverflow.com/a/42708476/13236698
Them I created an .env file with a quick script on python, extracting all environment variables and their values from os.environ and wrote them to a file in the following format: <env_variable>=<variable_value>, and saved the file with .env extension. Then I loaded it to PyCharm, and voila - all tensorflow modules were loaded fine.

jpype.getDefaultJVMPath() fails when I try accessing JVM from python3

I am currently using Python3, java8, jpype 0.6.3 version on windows10.
jpype.getDefaultJVMPath() fails with an error :
raise JVMNotFoundException("No JVM shared library file ({0}) "
jpype._jvmfinder.JVMNotFoundException: No JVM shared library file (jvm.dll) found. Try setting up the JAVA_HOME environment variable properly.
My JAVA_HOME points to C:\Program Files (x86)\Java\jdk1.8.0_241
If I try starting JVM directly by passing the jvm.dll path("C:\Program Files (x86)\Java\jdk1.8.0_241\jre\bin\client\jvm.dll) python program crashes.
I have already given executable permission to .dll file
Could anyone please help me fix this issue for the above system specifications
It is possible that your JVM architecture (32 bit) does not match your Python (64 bit). This would cause the symptoms you are describing.
It turns out that the shared code I was using required a specific version of one of the drivers. I still don't understand it all enough to explain why that was but with the older driver version (from a colleague) everything works!
from jpype import *
startJVM("/home/user_name/Downloads/ideaIC-2022.2.3/idea-IC- 222.4345.14/jbr/lib/server/libjvm.so", "-ea")
java.lang.System.out.println("hello world")
shutdownJVM()
This works for me
set the path manually
startJVM("/home/user_name/Downloads/ideaIC-2022.2.3/idea-IC- 222.4345.14/jbr/lib/server/libjvm.so", "-ea")
If this path is not correct for you
Search for this file libjvm.so inside partition computer on linux
Then copy the file path

troubles caused by tensorflow image's LD_LIBRARY_PATH

I'v installed DC/OS v1.8.4, the destination node has gpu resources and nvidia driver has also been installed, I tried to deploy tensorflow in mesos container, but it failed, there is only one error message in mesos's stderr:
mesos-containerizer: error while loading shared libraries: libmesos-1.0.1.so: cannot open shared object file: No such file or directory
But I can deploy other services successfuly, such as nginx, wordpress (also in mesos container)
The problem may be caused by tensorflow image, in its parent image CUDA, it reset LD_LIBRARY_PATH :
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH} ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
In OpenDCOS, before mesos-agent startup, it sets its executor's environment variable LD_LIBRARY_PATH to "/opt/mesosphere/lib", so that executor can locate necessary so files, but in above case, LD_LIBRARY_PATH is reset by tensorflow, so it failed to startup!
Anyone knows how OpenDCOS handle this problem ? Modify these public CUDA images?
GPUs are only officially supported in DC/OS 1.9+
For (unsupported) instructions on getting GPUs to work in 1.8, please see my answer to this question on the DC/OS mailing list:
https://groups.google.com/a/dcos.io/d/msg/users/HEgcUfRRqzk/inIBmapMCQAJ
Additionally, there is also a know issue with setting LD_LIBRARY_PATH in your container image for pre 1.9 clusters (though it usually manifests as a missing libssl.so library).
In your case, the CUDA container is setting LD_LIBRARY_PATH, which is overriding the LD_LIBRARY_PATH setting that DC/OS relies on to find it's library files. This is obviously a bug in DC/OS and has since been fixed in 1.9. The best (unsupported) workaround for this is to run
sudo ldconfig /opt/mesosphere/lib
on all of your nodes to put /opt/mesosphere/lib into the default library path. You will have to redo this on every reboot (or alternatively) add /opt/mesosphere/lib to a file under /etc/ld.so.conf.d/ to make it durable (maybe /etc/ld.so.conf.d/dcos.conf?).
This JIRA addressing the underlying issue can be found here:
https://issues.apache.org/jira/browse/MESOS-7027

Systemtap libdwfl error on Linux

I am tying to work/setup the Systemtap tool for profiling OS procesess, on a Virtual Linux. I am using VirtualBox to run the image. Via
rpm -q kernel
and
cat /proc/version
The version obtained is:
Linux version 2.6.32-5-686 (Debian 2.6.32-48squeeze4)
I have correctly downloaded and installed the tool and wrote a simple program (.stp). However I keep getting the same error, which I have searched information in many places without success:
After executing:
sudo stap my_profiler.stp
I get:
semantic error: libdwfl failure (all kernel modules found): no error
Pass 3: translation failed. Try again with another '--vp 001' option.
According to https://sourceware.org/systemtap/SystemTap_Beginners_Guide/errors.html
⁠semantic error: libdwfl failure
There was a problem processing the debugging information. In most cases, this error results from the installation of a kernel-debuginfo package whose version does not match the probed kernel exactly. The installed kernel-debuginfo package itself may have some consistency or correctness problems.
I have found no relevant information on the "kernel-debuginfo" package. I have also tried the verbose option without benefit. I even tried with an old Snapshot of the VM. Any ideas?
The code of the .stp program I ran:
probe timer.profile{
printf("Process: %s\n", execname())
printf("Process ID: %d\n", pid())
}
Found the problem!!!! It seemed that I was using the wrong version of the Linux Kernel. I was using the default kernel supplied by the version I wrote in the question. It seems that that version (the 2.6.32-5-686 one) has problems with the debug-info so all I did was try the same with another version (the Linux version 3.9.6 with gcc version 4.7.2 Debian 4.7.2-5) and it worked without trouble :)

Command line to run a program with both xlwt and abaqusConstants modules

Windows Machine, Python 2.4.
I have a program that imports both xlwt/xlrd and abaqusConstants module.
When I run my program with the command line: abaqus python abc.py, I get "ImportError: No module named xlwt/xlrd"
When I run my program with the command line: c:\python24\python.exe abc.py, I get "ImportError: No module named abaqusConstants".
The program ran perfectly when I ran it on my system where xlrd/xlwt was present in c:\python24\lib and Abaqus was installed in C-drive. When I tried to access xlrd/xlwt from my organisation's common share, the above problem appeared.
Is it because Abaqus is not present in the common share? How do I rectify this issue? Please tell me what command line to use.
the module abaqusConstants is only available in abaqus kernel executions of python so you need to be running it with abaqus python. Make sure that your PYTHONPATH variable is set properly to include the directory where xlwt/xlrd exists. see Using matplotlib (for python 2.6) with Abaqus 6.12 for a similar issue.