distribute tensorflow demos - tensorflow

Recently, tensorflow had add the distribute training module, what's the distribute pre-requirement? I mean the environment like this,
tensorflow >= 0.8 kubernates shared file system, gcloud?
And it had release the example code:
Is there any way to run tensorflow cluster example, when only have hdfs and without any shared file system, where will model file store in?

Each computer will need to have tensorflow installed, (and in my experience, they should all be the same version. I had a few issues mixing versions 8 and 9).
Once that is set up, each computer will need access to the code it is to run (main.py for example). We use an NFS to share this, but you could just as easily git pull on each machine to get the latest copy of your code.
Then you just need to start them up. We would just ssh to each machine in our most basic setup, but if you have a cluster like kubernates, then it may be different for you.
As for checkpoints, I believe only the chief worker writes to checkpoint files if that's what your last question was asking.
Let me know if you have further questions.

Related

Is it possible to run .ipynb notebooks locally using GPU acceleration? How?

Every time I need to train a 'large' deep learning model I do it from Google Collab, as it allows you to use GPU acceleration.
My pc has a dedicated GPU, I was wondering if it is possible to use it to run my notebooks locally in a fast way. Is it possible to train models using my pc GPU? In that case, how?
I am open to work with DataSpell, VSCode or any other IDE.
Nicholas Renotte has a great 'Getting Started' video that goes through the entire process of setting up GPU accelerated notebooks on your PC. The stuff you're interested starts around the 12 minute mark.
Yes, it is possible to run .ipynb notebooks locally using GPU acceleration. To do so, you will need to install the necessary libraries and frameworks such as TensorFlow, PyTorch, or Keras. Depending on the IDE you choose, you will need to install the relevant plugins and packages for GPU acceleration.
In terms of IDEs, DataSpell, VSCode, PyCharm, and Jupyter Notebook are all suitable for running notebooks locally with GPU acceleration.
Once the necessary libraries and frameworks are installed, you will then need to install the appropriate drivers for your GPU and configure the environment for GPU acceleration.
Finally, you will need to modify the .ipynb notebook to enable GPU acceleration and specify the number of GPUs you will be using. Once all the necessary steps have been taken, you will then be able to run the notebook locally with GPU acceleration.

Google Cloud Deep Learning On Linux VM throws Unknown Cuda Error

I am trying to set up a deep learning VM on Google Cloud but I keep running into the same issue over and over again.
I will follow all the steps, set up a N1-highmem-8 (8 vCPU, 52gb Memory) instance, add a single T4 GPU and select the Deep Learning Image: TensorFlow 2.4 m69 CUDA 110 image. That's it.
After that, I will ssh into the vm, run the script that installs all the NVIDIA drivers and... when I begin using it, by simply running
from tensorflow.keras.layers import Input, Dense
i = Input((100,))
x = Dense(500)(i)
I keep getting failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error. By that point I haven't installed anything and haven't done anything custom, just the vanilla image from GCP.
What is more concerning is that, even if I delete the vm and then create a new one with the same config, some times the error won't happen immediately and sometimes it's present off the bat.
Has anyone encountered this? I've googled around to see if anyone has faced this issue and while I came across suggestions, all of them are old and have not worked for me. More over, suggestions on NVIDIA support forums tell me to re-install everything and the whole point of me using a pre-built GCP image specifically for deep learning is so that I don't have to enter the hell of installing and resolving issues with NVIDIA drivers.
The issue is fixed with the M74 image, but you are using M69. So follow one of the two fixes provided in the Google Cloud public forum.
we can mitigate the issue by:
Fix #1: Use the latest DLVM image (M74 or later) in a new VM instance: They have released a fix for the newest DLVM image in M74 so you will no longer be affected by this issue.
Fix #2: Patch your existing instance running images older than M74.
Run the following via an SSH session on the affected instance:
gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh
chmod +x /tmp/restart_patch.sh
sudo /tmp/restart_patch.sh
sudo service jupyter restart
This only needs to be done once, and does not need to be rerun each time the instance is rebooted.

What is the difference between Tensorflow installation via source vs using pip?

I'm want to run tensorflow on a very standard machine setup (windows 64 bit) and have read that tensorflow has greater performance if built from source as it is optimised for your system. When installing tensorflow via pip for why does pip not select the optimal build for your system?
Also if you did install via pip is there a way or being able to tell whether the optimal build has been installed, or is the only way of knowing that simply remembering how you installed it?
Google has taken the position, that it is not reasonable to build TF for every possible instruction set out there. They only release generic builds for Linux, Mac and Windows. You must build from source if you want all the optimizations for your particular machine's instruction set.

TensorFlow without jupyter notebook

Do I absolutely need to use jupyter notebook to run TensorFlow in Windows ?
I tried the detect object example with the jupyter notebook, it works but I'm not really comfortable, Im used to notepad++ and running python directly on my windows without virtual environment.
I tried to copy past all the codes but I run into many hugs.
No, it is not compulsory to use Jupyter notebook to run Tensorflow on Windows. I personally use PyCharm as my IDE and Anaconda for dependency management (this is completely optional).
I would recommend you to use a proper IDE instead of notepad++ because it's much easier to do debugging using an IDE. You'll also be cloning a lot from Git when you start developing your own model, and usually the open source models out there has a lot of classes and methods in it (take Google's Inception net for example).
Another alternative would be maybe you can start posting about the bugs you are facing, then we can all start helping you.

Automatic packing of server-side product as Docker and OVA image

We develop a server-side solution and to ease its deployment we would like to provide our cutomers with two options:
1. Docker image
2. VM image in OVA format
The images should be automatically created by our build machine.
As of today, we use packer for this purpose. First we create docker image and then update that image in preconfigured virtual machine image (using 'virtualbox-ovf' builder). This works pretty well, but there are some problems with this solution.
First, our vm includes docker framework and two OSes (host's and docker's), so our VM image is ~twice bigger than docker. Second, to base our solution on another linux distro, we should manually configure new VM machine.
We are looking for 'Dockerfile'-style solution to create and configure VM automatically and then export it in OVA format. 'virtualbox-iso' builder is the obvious way to do this, but the building process will be much longer.
If you are willing to use Debian as your base OS then you could look at TurnKey Linux's TKLDev. It's probably a bit of a learning curve initially but it's a pretty cool thing IMO (although I'm very biased - see below disclaimer). TKLDev will build you a TurnKey (Debian based) ISO with your software installed on top. Then using Buildtasks you can convert the ISO to OVA, VMDK, LXC, Docker, OpenStack, etc...
Unfortunately Buildtasks is not very well documented but significant chunks of it are in bash so if you are handy with a Linux commandline you could work it out. Otherwise ask on the TurnKey forums.
The initial development (from Packer to TKLDev) may take a little while, but once the heavy lifting is done the creation of an ISO (in a guest VM on a moderm multicore CPU PC) takes about 10-15 mins and the OVA probably another ~5; Docker another ~5.
If you wanted to make it build automatically then you could use a hook to trigger a fresh TKLDev build (including the buildtasks image creation) everytime a commit was made to a repo. I know that git supports this but I assume that other version control systems allow something similar.
Also if the appliance that you are making is open source then perhaps it could be added to the TurnKey Linux library?
Disclaimer: I work with TurnKey Linux. :)
FWIW this is essentially the process we use to create our library of appliances in most virtualisation formats known to human kind!