Cloud ML Tensorflow and Cudnn versions compatibility - tensorflow

I would like to use Tensorflow 1.3 (and maybe 1.4) on Cloud ML. Im running jobs on multi-GPU machines on Cloud ML
I do that by specifying the tensorflow version in the setup.py as shown below:
from setuptools import setup
REQUIRED_PACKAGES = ['tensorflow==1.3.0']
setup(
name='my-image-classification',
install_requires=REQUIRED_PACKAGES,
version='1.0',
packages=['my_image_classification',
'my_image_classification/foo',
'my_image_classification/bar',
'my_image_classification/utils'],
)
What is the cudnn library that is installed on Cloud ML? Is it compatible with tensorflow 1.3 and tensorflow 1.3+ ?
I was able to start the jobs, but the performance is 10X lower than the expected value, and I'm curious if there is a problem with the underlying linking of Libraries
Edit:
I'm pretty confident now that the Cudnn versions on Cloud ML dont match what is required for Tensorflow 1.3. I noticed that Tensorflow 1.3 jobs are missing the "Creating Tensorflow device (/gpu:0...) " Logs which appear when I run a job with the default available Tensorflow on cloud ml

DISCLAIMER: using anything but 1.0, 1.2 is not officially supported as of 2017/11/01.
You need to specify the GPU-enabled version of TensorFlow:
REQUIRED_PACKAGES = ['tensorflow-gpu==1.3.0']
But the version of pip is out-of-date so you need to force that to update first.

Related

How to use the GPU with Tensorflow 2.11?

According to this link: //pypi.org/project/tensorflow-gpu/ , the "tensorflow-gpu" package is no longer supported and users should instead use the "tensorflow" package, which supposedly supports the GPU.
However after, installing the tensorflow 2.11 package, it will not even detect my GPU device. It only runs on the CPU. How does one use the GPU with Tensorflow 2.11?
It appears that Tensorflow 2.10 is the last version to support the GPU on windows: https://discuss.tensorflow.org/t/2-10-last-version-to-support-native-windows-gpu/12404

is CUDA 11 with RTX 3080 support tensorflow and keras?

I attached RTX 3080 to my computer. but when training on keras 2.3.1 and tensorflow 1.15, I got some error "failed to run cuBLAS_STATUS_EXECUTION_FAILED, did not mem zero GPU location . . . check failed:start_event !=nullptr && stop_event != nullptr" I think the problem is that recently released rtx 3080 and CUDA 11 is not yet support the keras 2.xx and tensorflow 1.xx. is this right? And what make that problem?
At the moment of writing this, currently Nvidia 30xx series only fully support CUDA version 11.x, see https://forums.developer.nvidia.com/t/can-rtx-3080-support-cuda-10-1/155849/2
Tensorflow 1.15 wasn't fully supported on CUDA since version 10.1 and newer, for probably similar reason as described in the link above. Unfortunately TensorFlow version 1.x is no longer supported or maintained, see https://github.com/tensorflow/tensorflow/issues/43629#issuecomment-700709796
TensorFlow 2.4 is your best bet with an Ampere GPU. It has now a stable release, and it has official support for CUDA 11.0, see https://www.tensorflow.org/install/source#gpu
As TensorFlow 1.x is never going to be updated or maintained by TensorFlow team, I would strongly suggest moving to TensorFlow 2.x, excluding personal preferences, it's better in almost every way and has tf.compat module for backwards compatibility with TensorFlow 1.x code, if rewriting you code base is not an option. However, even that module is no longer maintained, really showing that version 1.x is dead, see https://www.tensorflow.org/guide/versions#what_is_covered
However, if you're dead set on using TensorFlow 1.15, you might have a chance with Nvidia Tensorflow, which apparently has support for version 1.15 on Ampere GPUs, see https://developer.nvidia.com/blog/accelerating-tensorflow-on-a100-gpus/

Why has gpu stopped working for me in google colab?

I am a university professor trying to learn deep learning for a possible class in the future. I have been using google colab with GPU support for the past couple of months. Just recently, the GPU device is not found. But, I am doing everything that I have done in the past. I can't imagine that I have done anything wrong because I am just working through tutorials from books and the tensorflow 2.0 tutorials site.
tensorflow 2 on Colab GPU was broken recently due to an upgrade from CUDA 10.0 to CUDA 10.1. As of this afternoon, the issue should be resolved for the tensorflow builds bundled with Colab. That is, if you run the following magic command:
%tensorflow_version 2.x
then import tensorflow will import a working, GPU-compatible tensorflow 2.0 version.
Note, however, if you attempt to install a version of tensorflow using pip install tensorflow-gpu or similar, the result may not work in Colab due to system incompatibilities.
See https://colab.research.google.com/notebooks/tensorflow_version.ipynb for more information.

SageMaker and TensorFlow 2.0

What is the best way to run TensorFlow 2.0 with AWS Sagemeker?
As of today (Aug 7th, 2019) AWS does not provide TensorFlow 2.0 SageMaker containers, so my understanding is that I need to build my own.
What is the best Base image to use? Example Dockerfile?
EDIT: Amazon SageMaker does now support TF 2.0 and higher.
SageMaker + TensorFlow docs: https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html
Supported Tensorflow versions (and Docker URIs): https://aws.amazon.com/releasenotes/available-deep-learning-containers-images
Original answer
Here is an example Dockerfile that uses the underlying SageMaker Containers library (this is what is used in the official pre-built Docker images):
FROM tensorflow/tensorflow:2.0.0b1
RUN pip install sagemaker-containers
# Copies the training code inside the container
COPY train.py /opt/ml/code/train.py
# Defines train.py as script entrypoint
ENV SAGEMAKER_PROGRAM train.py
For more information on this approach, see https://docs.aws.amazon.com/sagemaker/latest/dg/build-container-to-train-script-get-started.html
Update on 10 nov 2019:
There is now a way to use Tensorflow 2 in SageMaker, event though there is no shortcut to start TF 2 directly from SageMaker console.
Start a conda Python3 Kernel
Make some updates (one in each code cell):
!pip install --upgrade pip # pip 19.0 or higher is required for TF 2
!pip install --upgrade setuptools # Otherwise you'll get annoying warnings about bad installs
Install Tensorflow 2
!pip install --user --upgrade tensorflow
Given the doc, this will install in $HOME.
Nota:
If you are using a GPU based instance of SageMaker, replace tensorflow by tensorflow-gpu.
You now can use TF 2 in your instance. This only needs to be done once, as long as the instance remains up.
To test, just run in the next cell:
import tensorflow as tf
print(tf.__version__)
You should see 2.0.0 or higher.
As of now, the best image that you can use to build Tensorflow 2.0 is 2.0.0b1, which is the latest version of Tensorflow 2.0(image) that is available now.Please find the link here. You also have an image 2.0.0b1-py3, which comes with Python3 (3.5 for Ubuntu 16-based images; 3.6 for Ubuntu 18-based images).
If you feel this answer is useful, kindly accept this answer and/or up vote it. Thanks.
Here from the SageMaker team. To use SageMaker seamlessly, it's recommended that you try out Amazon SageMaker Studio. It's similar to a Jupyter Notebook instance but has a variety of SageMaker services already integrated within it. A huge plus point is that you can select from a variety of kernels and for TensorFlow 2.0 you can select any of these that already have all the required packages installed:
Python 3 (TensorFlow 2.1 Python 3.6 CPU Optimized)
Python 3 (TensorFlow 2.1 Python 3.6 GPU Optimized)
Python 3 (TensorFlow 2.3 Python 3.7 CPU Optimized)
Python 3 (TensorFlow 2.3 Python 3.7 GPU Optimized)
In Studio, you can select the Kernel drop-down located in top-right corner of your SageMaker Studio instance:
Shown below is a screenshot of the UI interface in SageMaker Studio of the available kernels that are compatible with TensorFlow 2.0. Note that all kernels within SageMaker Studio are continuously tested and work seamlessly with all SageMaker Services.

"No module named tensorflow" when tensorflow-gpu is required

I built a python tensorflow package and uploaded to run on ml engine.
"tensorflow-gpu==1.8.0" (no tensorflow) is set to be required in my setup.py.
The ML engine run fails at "import tensorflow as tf" saying "No module named tensorflow".
The ML engine run works fine when I'm only requiring "tensorflow==1.8.0" but I believe tensorflow-gpu is needed to use GPU.
Any ideas how to solve this issue?
Thanks
You need to set --runtime-version=1.8 when submitting the job. Consequently, you don't need to manually specify TF in setup.py. In fact, if that's the only package you are requiring, you can omit setup.py altogether.
Update 2018/06/29:
Explanation: different versions of TensorFlow require different versions of NVIDIA's drivers and software stack. The --runtime-version is guaranteed to have the right version of the drivers for that particular version of TensorFlow. You can technically set the version of tensorflow-gpu in your setup.py, but that version must be compatible with the NVIDIA stack present in the --runtime-version you've selected (defaults to the very old TF 1.0).
This also happens when you have multiple python versions. In that case you have to specify the relevant python version for tf installation. For example,"python3 setup.py" instead of "python setup.py".